266 lines
8.8 KiB
Markdown
266 lines
8.8 KiB
Markdown
|
|
# Phase 8 - Technical Analysis and Root Cause Investigation
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
Phase 8 comprehensive benchmarking reveals **critical performance issues** with HAKMEM:
|
|||
|
|
|
|||
|
|
- **Working Set 256 (Hot Cache)**: 9.4% slower than System malloc, 45.2% slower than mimalloc
|
|||
|
|
- **Working Set 8192 (Realistic)**: **246% slower than System malloc, 485% slower than mimalloc**
|
|||
|
|
|
|||
|
|
The most alarming finding: HAKMEM experiences **4.8x performance degradation** when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.
|
|||
|
|
|
|||
|
|
## Benchmark Results Summary
|
|||
|
|
|
|||
|
|
### Working Set 256 (Hot Cache)
|
|||
|
|
|
|||
|
|
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|
|||
|
|
|----------------|---------------|--------|-----------|
|
|||
|
|
| HAKMEM Phase 8 | 79.2 | ±2.4% | 1.00x |
|
|||
|
|
| System malloc | 86.7 | ±1.0% | 1.09x |
|
|||
|
|
| mimalloc | 114.9 | ±1.2% | 1.45x |
|
|||
|
|
|
|||
|
|
### Working Set 8192 (Realistic Workload)
|
|||
|
|
|
|||
|
|
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|
|||
|
|
|----------------|---------------|--------|-----------|
|
|||
|
|
| HAKMEM Phase 8 | 16.5 | ±2.5% | 1.00x |
|
|||
|
|
| System malloc | 57.1 | ±1.3% | 3.46x |
|
|||
|
|
| mimalloc | 96.5 | ±0.9% | 5.85x |
|
|||
|
|
|
|||
|
|
### Scalability Analysis
|
|||
|
|
|
|||
|
|
Performance degradation from WS256 → WS8192:
|
|||
|
|
|
|||
|
|
- **HAKMEM**: 4.80x slowdown (79.2 → 16.5 M ops/s)
|
|||
|
|
- **System**: 1.52x slowdown (86.7 → 57.1 M ops/s)
|
|||
|
|
- **mimalloc**: 1.19x slowdown (114.9 → 96.5 M ops/s)
|
|||
|
|
|
|||
|
|
**HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.**
|
|||
|
|
|
|||
|
|
## Root Cause Analysis
|
|||
|
|
|
|||
|
|
### Evidence from Debug Logs
|
|||
|
|
|
|||
|
|
The benchmark output shows critical issues:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens **4 times** during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.
|
|||
|
|
|
|||
|
|
### Issue 1: SuperSlab Architecture Doesn't Scale
|
|||
|
|
|
|||
|
|
**Symptoms**:
|
|||
|
|
- Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
|
|||
|
|
- Shared SuperSlabs fail repeatedly
|
|||
|
|
- TLS_SLL_HDR_RESET events occur (slab header corruption?)
|
|||
|
|
|
|||
|
|
**Root Causes (Hypotheses)**:
|
|||
|
|
|
|||
|
|
1. **SuperSlab Capacity**: Current 512KB SuperSlabs may be too small for WS8192
|
|||
|
|
- 8192 objects × (16-1024 bytes average) = ~4-8MB working set
|
|||
|
|
- Multiple SuperSlabs needed → increased lookup overhead
|
|||
|
|
|
|||
|
|
2. **Fragmentation**: SuperSlabs become fragmented with larger working sets
|
|||
|
|
- Free slots scattered across multiple SuperSlabs
|
|||
|
|
- Linear search through slab list becomes expensive
|
|||
|
|
|
|||
|
|
3. **TLB Pressure**: More SuperSlabs = more page table entries
|
|||
|
|
- System malloc uses fewer, larger arenas
|
|||
|
|
- HAKMEM's 512KB slabs create more TLB misses
|
|||
|
|
|
|||
|
|
4. **Cache Pollution**: Slab metadata pollutes L1/L2 cache
|
|||
|
|
- Each SuperSlab has metadata overhead
|
|||
|
|
- More slabs = more metadata = less cache for actual data
|
|||
|
|
|
|||
|
|
### Issue 2: TLS Drain Overhead
|
|||
|
|
|
|||
|
|
Debug logs show:
|
|||
|
|
```
|
|||
|
|
[TLS_SLL_DRAIN] Drain ENABLED (default)
|
|||
|
|
[TLS_SLL_DRAIN] Interval=2048 (default)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.
|
|||
|
|
|
|||
|
|
**Evidence**:
|
|||
|
|
- WS256 should fit entirely in cache, yet HAKMEM still lags
|
|||
|
|
- System malloc has simpler fast path (no drain logic)
|
|||
|
|
- 9.4% overhead = ~7-8 extra cycles per allocation
|
|||
|
|
|
|||
|
|
### Issue 3: TLS_SLL_HDR_RESET Events
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.
|
|||
|
|
|
|||
|
|
## Performance Breakdown
|
|||
|
|
|
|||
|
|
### Where HAKMEM Loses Performance (WS8192)
|
|||
|
|
|
|||
|
|
Estimated cycle budget (assuming 3.5 GHz CPU):
|
|||
|
|
|
|||
|
|
- **HAKMEM**: 16.5 M ops/s = ~212 cycles/operation
|
|||
|
|
- **System**: 57.1 M ops/s = ~61 cycles/operation
|
|||
|
|
- **mimalloc**: 96.5 M ops/s = ~36 cycles/operation
|
|||
|
|
|
|||
|
|
**Gap Analysis**:
|
|||
|
|
- HAKMEM uses **151 extra cycles** vs System malloc
|
|||
|
|
- HAKMEM uses **176 extra cycles** vs mimalloc
|
|||
|
|
|
|||
|
|
Where do these cycles go?
|
|||
|
|
|
|||
|
|
1. **SuperSlab Lookup** (~50-80 cycles)
|
|||
|
|
- Linear search through slab list
|
|||
|
|
- Cache misses on slab metadata
|
|||
|
|
- TLB misses on slab pages
|
|||
|
|
|
|||
|
|
2. **TLS Drain Logic** (~10-15 cycles)
|
|||
|
|
- Drain counter checks every allocation
|
|||
|
|
- Branch mispredictions
|
|||
|
|
|
|||
|
|
3. **Fragmentation Overhead** (~30-50 cycles)
|
|||
|
|
- Walking free lists
|
|||
|
|
- Finding suitable free blocks
|
|||
|
|
|
|||
|
|
4. **Legacy Fallback** (~50-100 cycles when triggered)
|
|||
|
|
- System malloc/mmap calls
|
|||
|
|
- Context switches
|
|||
|
|
|
|||
|
|
## Competitive Analysis
|
|||
|
|
|
|||
|
|
### Why System malloc Wins (3.46x faster)
|
|||
|
|
|
|||
|
|
1. **Arena-based design**: Fewer, larger memory regions
|
|||
|
|
2. **Thread caching**: Similar to HAKMEM TLS but better tuned
|
|||
|
|
3. **Mature optimization**: Decades of tuning
|
|||
|
|
4. **Simple fast path**: No drain logic, no SuperSlab lookup
|
|||
|
|
|
|||
|
|
### Why mimalloc Dominates (5.85x faster)
|
|||
|
|
|
|||
|
|
1. **Segment-based design**: Optimal for multi-threaded workloads
|
|||
|
|
2. **Free list sharding**: Reduces contention
|
|||
|
|
3. **Aggressive inlining**: Fast path is 15-20 instructions
|
|||
|
|
4. **No locks in fast path**: Lock-free for thread-local allocations
|
|||
|
|
5. **Delayed freeing**: Like HAKMEM drain but more efficient
|
|||
|
|
6. **Minimal metadata**: Less cache pollution
|
|||
|
|
|
|||
|
|
## Critical Gaps to Address
|
|||
|
|
|
|||
|
|
### Gap 1: Fast Path Performance (9.4% slower at WS256)
|
|||
|
|
|
|||
|
|
**Target**: Match System malloc at hot cache workload
|
|||
|
|
**Required improvement**: +9.4% = +7.5 M ops/s
|
|||
|
|
|
|||
|
|
**Action items**:
|
|||
|
|
- Profile TLS drain overhead
|
|||
|
|
- Inline critical functions more aggressively
|
|||
|
|
- Reduce branch mispredictions
|
|||
|
|
- Consider removing drain logic or making it lazy
|
|||
|
|
|
|||
|
|
### Gap 2: Scalability (246% slower at WS8192)
|
|||
|
|
|
|||
|
|
**Target**: Get within 20% of System malloc at realistic workload
|
|||
|
|
**Required improvement**: +246% = +40.6 M ops/s (2.46x speedup needed!)
|
|||
|
|
|
|||
|
|
**Action items**:
|
|||
|
|
- Fix SuperSlab scaling
|
|||
|
|
- Reduce fragmentation
|
|||
|
|
- Optimize SuperSlab lookup (hash table instead of linear search?)
|
|||
|
|
- Reduce TLB pressure (larger SuperSlabs or better placement)
|
|||
|
|
- Profile cache misses
|
|||
|
|
|
|||
|
|
## Recommendations for Phase 9+
|
|||
|
|
|
|||
|
|
### Phase 9: CRITICAL - SuperSlab Investigation
|
|||
|
|
|
|||
|
|
**Goal**: Understand why SuperSlab performance collapses at WS8192
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Add detailed profiling:
|
|||
|
|
- SuperSlab lookup latency distribution
|
|||
|
|
- Cache miss rates (L1, L2, L3)
|
|||
|
|
- TLB miss rates
|
|||
|
|
- Fragmentation metrics
|
|||
|
|
|
|||
|
|
2. Measure SuperSlab statistics:
|
|||
|
|
- Number of active SuperSlabs at WS256 vs WS8192
|
|||
|
|
- Average slab list length
|
|||
|
|
- Hit rate for first-slab lookup
|
|||
|
|
|
|||
|
|
3. Experiment with SuperSlab sizes:
|
|||
|
|
- Try 1MB, 2MB, 4MB SuperSlabs
|
|||
|
|
- Measure impact on performance
|
|||
|
|
|
|||
|
|
4. Analyze "shared_fail→legacy" events:
|
|||
|
|
- Why do shared slabs fail?
|
|||
|
|
- How often does it happen?
|
|||
|
|
- Can we pre-allocate more capacity?
|
|||
|
|
|
|||
|
|
### Phase 10: Fast Path Optimization
|
|||
|
|
|
|||
|
|
**Goal**: Close 9.4% gap at WS256
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Profile TLS drain overhead
|
|||
|
|
2. Experiment with drain intervals (4096, 8192, disable)
|
|||
|
|
3. Inline more aggressively
|
|||
|
|
4. Add `__builtin_expect` hints for common paths
|
|||
|
|
5. Reduce branch mispredictions
|
|||
|
|
|
|||
|
|
### Phase 11: Architecture Re-evaluation
|
|||
|
|
|
|||
|
|
**Goal**: Decide if SuperSlab model is viable
|
|||
|
|
|
|||
|
|
**Decision point**: If Phase 9 can't get within 50% of System malloc at WS8192, consider:
|
|||
|
|
|
|||
|
|
1. **Hybrid approach**: TLS fast path + different backend (jemalloc-style arenas?)
|
|||
|
|
2. **Abandon SuperSlab**: Switch to segment-based design like mimalloc
|
|||
|
|
3. **Radical simplification**: Focus on specific use case (small allocations only?)
|
|||
|
|
|
|||
|
|
## Success Criteria for Phase 9
|
|||
|
|
|
|||
|
|
Minimum acceptable improvements:
|
|||
|
|
- WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
|
|||
|
|
- WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)
|
|||
|
|
|
|||
|
|
Stretch goals:
|
|||
|
|
- WS256: 90+ M ops/s (close to System malloc)
|
|||
|
|
- WS8192: 45+ M ops/s (80% of System malloc)
|
|||
|
|
|
|||
|
|
## Raw Data
|
|||
|
|
|
|||
|
|
All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).
|
|||
|
|
|
|||
|
|
### Working Set 256
|
|||
|
|
```
|
|||
|
|
HAKMEM: [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
|
|||
|
|
System: [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
|
|||
|
|
mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Working Set 8192
|
|||
|
|
```
|
|||
|
|
HAKMEM: [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
|
|||
|
|
System: [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
|
|||
|
|
mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:
|
|||
|
|
|
|||
|
|
1. **SuperSlab scaling is broken** - 4.8x performance degradation is unacceptable
|
|||
|
|
2. **Fast path has overhead** - Even hot cache shows 9.4% gap
|
|||
|
|
3. **Competition is fierce** - mimalloc is 5.85x faster at realistic workloads
|
|||
|
|
|
|||
|
|
**Next priority**: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.
|
|||
|
|
|
|||
|
|
The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.
|