This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
266 lines
8.8 KiB
Markdown
266 lines
8.8 KiB
Markdown
# Phase 8 - Technical Analysis and Root Cause Investigation
|
||
|
||
## Executive Summary
|
||
|
||
Phase 8 comprehensive benchmarking reveals **critical performance issues** with HAKMEM:
|
||
|
||
- **Working Set 256 (Hot Cache)**: 9.4% slower than System malloc, 45.2% slower than mimalloc
|
||
- **Working Set 8192 (Realistic)**: **246% slower than System malloc, 485% slower than mimalloc**
|
||
|
||
The most alarming finding: HAKMEM experiences **4.8x performance degradation** when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.
|
||
|
||
## Benchmark Results Summary
|
||
|
||
### Working Set 256 (Hot Cache)
|
||
|
||
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|
||
|----------------|---------------|--------|-----------|
|
||
| HAKMEM Phase 8 | 79.2 | ±2.4% | 1.00x |
|
||
| System malloc | 86.7 | ±1.0% | 1.09x |
|
||
| mimalloc | 114.9 | ±1.2% | 1.45x |
|
||
|
||
### Working Set 8192 (Realistic Workload)
|
||
|
||
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|
||
|----------------|---------------|--------|-----------|
|
||
| HAKMEM Phase 8 | 16.5 | ±2.5% | 1.00x |
|
||
| System malloc | 57.1 | ±1.3% | 3.46x |
|
||
| mimalloc | 96.5 | ±0.9% | 5.85x |
|
||
|
||
### Scalability Analysis
|
||
|
||
Performance degradation from WS256 → WS8192:
|
||
|
||
- **HAKMEM**: 4.80x slowdown (79.2 → 16.5 M ops/s)
|
||
- **System**: 1.52x slowdown (86.7 → 57.1 M ops/s)
|
||
- **mimalloc**: 1.19x slowdown (114.9 → 96.5 M ops/s)
|
||
|
||
**HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.**
|
||
|
||
## Root Cause Analysis
|
||
|
||
### Evidence from Debug Logs
|
||
|
||
The benchmark output shows critical issues:
|
||
|
||
```
|
||
[SS_BACKEND] shared_fail→legacy cls=7
|
||
[SS_BACKEND] shared_fail→legacy cls=7
|
||
[SS_BACKEND] shared_fail→legacy cls=7
|
||
[SS_BACKEND] shared_fail→legacy cls=7
|
||
```
|
||
|
||
**Analysis**: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens **4 times** during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.
|
||
|
||
### Issue 1: SuperSlab Architecture Doesn't Scale
|
||
|
||
**Symptoms**:
|
||
- Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
|
||
- Shared SuperSlabs fail repeatedly
|
||
- TLS_SLL_HDR_RESET events occur (slab header corruption?)
|
||
|
||
**Root Causes (Hypotheses)**:
|
||
|
||
1. **SuperSlab Capacity**: Current 512KB SuperSlabs may be too small for WS8192
|
||
- 8192 objects × (16-1024 bytes average) = ~4-8MB working set
|
||
- Multiple SuperSlabs needed → increased lookup overhead
|
||
|
||
2. **Fragmentation**: SuperSlabs become fragmented with larger working sets
|
||
- Free slots scattered across multiple SuperSlabs
|
||
- Linear search through slab list becomes expensive
|
||
|
||
3. **TLB Pressure**: More SuperSlabs = more page table entries
|
||
- System malloc uses fewer, larger arenas
|
||
- HAKMEM's 512KB slabs create more TLB misses
|
||
|
||
4. **Cache Pollution**: Slab metadata pollutes L1/L2 cache
|
||
- Each SuperSlab has metadata overhead
|
||
- More slabs = more metadata = less cache for actual data
|
||
|
||
### Issue 2: TLS Drain Overhead
|
||
|
||
Debug logs show:
|
||
```
|
||
[TLS_SLL_DRAIN] Drain ENABLED (default)
|
||
[TLS_SLL_DRAIN] Interval=2048 (default)
|
||
```
|
||
|
||
**Analysis**: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.
|
||
|
||
**Evidence**:
|
||
- WS256 should fit entirely in cache, yet HAKMEM still lags
|
||
- System malloc has simpler fast path (no drain logic)
|
||
- 9.4% overhead = ~7-8 extra cycles per allocation
|
||
|
||
### Issue 3: TLS_SLL_HDR_RESET Events
|
||
|
||
```
|
||
[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0
|
||
```
|
||
|
||
**Analysis**: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.
|
||
|
||
## Performance Breakdown
|
||
|
||
### Where HAKMEM Loses Performance (WS8192)
|
||
|
||
Estimated cycle budget (assuming 3.5 GHz CPU):
|
||
|
||
- **HAKMEM**: 16.5 M ops/s = ~212 cycles/operation
|
||
- **System**: 57.1 M ops/s = ~61 cycles/operation
|
||
- **mimalloc**: 96.5 M ops/s = ~36 cycles/operation
|
||
|
||
**Gap Analysis**:
|
||
- HAKMEM uses **151 extra cycles** vs System malloc
|
||
- HAKMEM uses **176 extra cycles** vs mimalloc
|
||
|
||
Where do these cycles go?
|
||
|
||
1. **SuperSlab Lookup** (~50-80 cycles)
|
||
- Linear search through slab list
|
||
- Cache misses on slab metadata
|
||
- TLB misses on slab pages
|
||
|
||
2. **TLS Drain Logic** (~10-15 cycles)
|
||
- Drain counter checks every allocation
|
||
- Branch mispredictions
|
||
|
||
3. **Fragmentation Overhead** (~30-50 cycles)
|
||
- Walking free lists
|
||
- Finding suitable free blocks
|
||
|
||
4. **Legacy Fallback** (~50-100 cycles when triggered)
|
||
- System malloc/mmap calls
|
||
- Context switches
|
||
|
||
## Competitive Analysis
|
||
|
||
### Why System malloc Wins (3.46x faster)
|
||
|
||
1. **Arena-based design**: Fewer, larger memory regions
|
||
2. **Thread caching**: Similar to HAKMEM TLS but better tuned
|
||
3. **Mature optimization**: Decades of tuning
|
||
4. **Simple fast path**: No drain logic, no SuperSlab lookup
|
||
|
||
### Why mimalloc Dominates (5.85x faster)
|
||
|
||
1. **Segment-based design**: Optimal for multi-threaded workloads
|
||
2. **Free list sharding**: Reduces contention
|
||
3. **Aggressive inlining**: Fast path is 15-20 instructions
|
||
4. **No locks in fast path**: Lock-free for thread-local allocations
|
||
5. **Delayed freeing**: Like HAKMEM drain but more efficient
|
||
6. **Minimal metadata**: Less cache pollution
|
||
|
||
## Critical Gaps to Address
|
||
|
||
### Gap 1: Fast Path Performance (9.4% slower at WS256)
|
||
|
||
**Target**: Match System malloc at hot cache workload
|
||
**Required improvement**: +9.4% = +7.5 M ops/s
|
||
|
||
**Action items**:
|
||
- Profile TLS drain overhead
|
||
- Inline critical functions more aggressively
|
||
- Reduce branch mispredictions
|
||
- Consider removing drain logic or making it lazy
|
||
|
||
### Gap 2: Scalability (246% slower at WS8192)
|
||
|
||
**Target**: Get within 20% of System malloc at realistic workload
|
||
**Required improvement**: +246% = +40.6 M ops/s (2.46x speedup needed!)
|
||
|
||
**Action items**:
|
||
- Fix SuperSlab scaling
|
||
- Reduce fragmentation
|
||
- Optimize SuperSlab lookup (hash table instead of linear search?)
|
||
- Reduce TLB pressure (larger SuperSlabs or better placement)
|
||
- Profile cache misses
|
||
|
||
## Recommendations for Phase 9+
|
||
|
||
### Phase 9: CRITICAL - SuperSlab Investigation
|
||
|
||
**Goal**: Understand why SuperSlab performance collapses at WS8192
|
||
|
||
**Tasks**:
|
||
1. Add detailed profiling:
|
||
- SuperSlab lookup latency distribution
|
||
- Cache miss rates (L1, L2, L3)
|
||
- TLB miss rates
|
||
- Fragmentation metrics
|
||
|
||
2. Measure SuperSlab statistics:
|
||
- Number of active SuperSlabs at WS256 vs WS8192
|
||
- Average slab list length
|
||
- Hit rate for first-slab lookup
|
||
|
||
3. Experiment with SuperSlab sizes:
|
||
- Try 1MB, 2MB, 4MB SuperSlabs
|
||
- Measure impact on performance
|
||
|
||
4. Analyze "shared_fail→legacy" events:
|
||
- Why do shared slabs fail?
|
||
- How often does it happen?
|
||
- Can we pre-allocate more capacity?
|
||
|
||
### Phase 10: Fast Path Optimization
|
||
|
||
**Goal**: Close 9.4% gap at WS256
|
||
|
||
**Tasks**:
|
||
1. Profile TLS drain overhead
|
||
2. Experiment with drain intervals (4096, 8192, disable)
|
||
3. Inline more aggressively
|
||
4. Add `__builtin_expect` hints for common paths
|
||
5. Reduce branch mispredictions
|
||
|
||
### Phase 11: Architecture Re-evaluation
|
||
|
||
**Goal**: Decide if SuperSlab model is viable
|
||
|
||
**Decision point**: If Phase 9 can't get within 50% of System malloc at WS8192, consider:
|
||
|
||
1. **Hybrid approach**: TLS fast path + different backend (jemalloc-style arenas?)
|
||
2. **Abandon SuperSlab**: Switch to segment-based design like mimalloc
|
||
3. **Radical simplification**: Focus on specific use case (small allocations only?)
|
||
|
||
## Success Criteria for Phase 9
|
||
|
||
Minimum acceptable improvements:
|
||
- WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
|
||
- WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)
|
||
|
||
Stretch goals:
|
||
- WS256: 90+ M ops/s (close to System malloc)
|
||
- WS8192: 45+ M ops/s (80% of System malloc)
|
||
|
||
## Raw Data
|
||
|
||
All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).
|
||
|
||
### Working Set 256
|
||
```
|
||
HAKMEM: [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
|
||
System: [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
|
||
mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s
|
||
```
|
||
|
||
### Working Set 8192
|
||
```
|
||
HAKMEM: [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
|
||
System: [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
|
||
mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s
|
||
```
|
||
|
||
## Conclusion
|
||
|
||
Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:
|
||
|
||
1. **SuperSlab scaling is broken** - 4.8x performance degradation is unacceptable
|
||
2. **Fast path has overhead** - Even hot cache shows 9.4% gap
|
||
3. **Competition is fierce** - mimalloc is 5.85x faster at realistic workloads
|
||
|
||
**Next priority**: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.
|
||
|
||
The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.
|