This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
8.8 KiB
Phase 8 - Technical Analysis and Root Cause Investigation
Executive Summary
Phase 8 comprehensive benchmarking reveals critical performance issues with HAKMEM:
- Working Set 256 (Hot Cache): 9.4% slower than System malloc, 45.2% slower than mimalloc
- Working Set 8192 (Realistic): 246% slower than System malloc, 485% slower than mimalloc
The most alarming finding: HAKMEM experiences 4.8x performance degradation when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.
Benchmark Results Summary
Working Set 256 (Hot Cache)
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|---|---|---|---|
| HAKMEM Phase 8 | 79.2 | ±2.4% | 1.00x |
| System malloc | 86.7 | ±1.0% | 1.09x |
| mimalloc | 114.9 | ±1.2% | 1.45x |
Working Set 8192 (Realistic Workload)
| Allocator | Avg (M ops/s) | StdDev | vs HAKMEM |
|---|---|---|---|
| HAKMEM Phase 8 | 16.5 | ±2.5% | 1.00x |
| System malloc | 57.1 | ±1.3% | 3.46x |
| mimalloc | 96.5 | ±0.9% | 5.85x |
Scalability Analysis
Performance degradation from WS256 → WS8192:
- HAKMEM: 4.80x slowdown (79.2 → 16.5 M ops/s)
- System: 1.52x slowdown (86.7 → 57.1 M ops/s)
- mimalloc: 1.19x slowdown (114.9 → 96.5 M ops/s)
HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.
Root Cause Analysis
Evidence from Debug Logs
The benchmark output shows critical issues:
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
Analysis: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens 4 times during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.
Issue 1: SuperSlab Architecture Doesn't Scale
Symptoms:
- Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
- Shared SuperSlabs fail repeatedly
- TLS_SLL_HDR_RESET events occur (slab header corruption?)
Root Causes (Hypotheses):
-
SuperSlab Capacity: Current 512KB SuperSlabs may be too small for WS8192
- 8192 objects × (16-1024 bytes average) = ~4-8MB working set
- Multiple SuperSlabs needed → increased lookup overhead
-
Fragmentation: SuperSlabs become fragmented with larger working sets
- Free slots scattered across multiple SuperSlabs
- Linear search through slab list becomes expensive
-
TLB Pressure: More SuperSlabs = more page table entries
- System malloc uses fewer, larger arenas
- HAKMEM's 512KB slabs create more TLB misses
-
Cache Pollution: Slab metadata pollutes L1/L2 cache
- Each SuperSlab has metadata overhead
- More slabs = more metadata = less cache for actual data
Issue 2: TLS Drain Overhead
Debug logs show:
[TLS_SLL_DRAIN] Drain ENABLED (default)
[TLS_SLL_DRAIN] Interval=2048 (default)
Analysis: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.
Evidence:
- WS256 should fit entirely in cache, yet HAKMEM still lags
- System malloc has simpler fast path (no drain logic)
- 9.4% overhead = ~7-8 extra cycles per allocation
Issue 3: TLS_SLL_HDR_RESET Events
[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0
Analysis: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.
Performance Breakdown
Where HAKMEM Loses Performance (WS8192)
Estimated cycle budget (assuming 3.5 GHz CPU):
- HAKMEM: 16.5 M ops/s = ~212 cycles/operation
- System: 57.1 M ops/s = ~61 cycles/operation
- mimalloc: 96.5 M ops/s = ~36 cycles/operation
Gap Analysis:
- HAKMEM uses 151 extra cycles vs System malloc
- HAKMEM uses 176 extra cycles vs mimalloc
Where do these cycles go?
-
SuperSlab Lookup (~50-80 cycles)
- Linear search through slab list
- Cache misses on slab metadata
- TLB misses on slab pages
-
TLS Drain Logic (~10-15 cycles)
- Drain counter checks every allocation
- Branch mispredictions
-
Fragmentation Overhead (~30-50 cycles)
- Walking free lists
- Finding suitable free blocks
-
Legacy Fallback (~50-100 cycles when triggered)
- System malloc/mmap calls
- Context switches
Competitive Analysis
Why System malloc Wins (3.46x faster)
- Arena-based design: Fewer, larger memory regions
- Thread caching: Similar to HAKMEM TLS but better tuned
- Mature optimization: Decades of tuning
- Simple fast path: No drain logic, no SuperSlab lookup
Why mimalloc Dominates (5.85x faster)
- Segment-based design: Optimal for multi-threaded workloads
- Free list sharding: Reduces contention
- Aggressive inlining: Fast path is 15-20 instructions
- No locks in fast path: Lock-free for thread-local allocations
- Delayed freeing: Like HAKMEM drain but more efficient
- Minimal metadata: Less cache pollution
Critical Gaps to Address
Gap 1: Fast Path Performance (9.4% slower at WS256)
Target: Match System malloc at hot cache workload Required improvement: +9.4% = +7.5 M ops/s
Action items:
- Profile TLS drain overhead
- Inline critical functions more aggressively
- Reduce branch mispredictions
- Consider removing drain logic or making it lazy
Gap 2: Scalability (246% slower at WS8192)
Target: Get within 20% of System malloc at realistic workload Required improvement: +246% = +40.6 M ops/s (2.46x speedup needed!)
Action items:
- Fix SuperSlab scaling
- Reduce fragmentation
- Optimize SuperSlab lookup (hash table instead of linear search?)
- Reduce TLB pressure (larger SuperSlabs or better placement)
- Profile cache misses
Recommendations for Phase 9+
Phase 9: CRITICAL - SuperSlab Investigation
Goal: Understand why SuperSlab performance collapses at WS8192
Tasks:
-
Add detailed profiling:
- SuperSlab lookup latency distribution
- Cache miss rates (L1, L2, L3)
- TLB miss rates
- Fragmentation metrics
-
Measure SuperSlab statistics:
- Number of active SuperSlabs at WS256 vs WS8192
- Average slab list length
- Hit rate for first-slab lookup
-
Experiment with SuperSlab sizes:
- Try 1MB, 2MB, 4MB SuperSlabs
- Measure impact on performance
-
Analyze "shared_fail→legacy" events:
- Why do shared slabs fail?
- How often does it happen?
- Can we pre-allocate more capacity?
Phase 10: Fast Path Optimization
Goal: Close 9.4% gap at WS256
Tasks:
- Profile TLS drain overhead
- Experiment with drain intervals (4096, 8192, disable)
- Inline more aggressively
- Add
__builtin_expecthints for common paths - Reduce branch mispredictions
Phase 11: Architecture Re-evaluation
Goal: Decide if SuperSlab model is viable
Decision point: If Phase 9 can't get within 50% of System malloc at WS8192, consider:
- Hybrid approach: TLS fast path + different backend (jemalloc-style arenas?)
- Abandon SuperSlab: Switch to segment-based design like mimalloc
- Radical simplification: Focus on specific use case (small allocations only?)
Success Criteria for Phase 9
Minimum acceptable improvements:
- WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
- WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)
Stretch goals:
- WS256: 90+ M ops/s (close to System malloc)
- WS8192: 45+ M ops/s (80% of System malloc)
Raw Data
All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).
Working Set 256
HAKMEM: [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
System: [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s
Working Set 8192
HAKMEM: [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
System: [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s
Conclusion
Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:
- SuperSlab scaling is broken - 4.8x performance degradation is unacceptable
- Fast path has overhead - Even hot cache shows 9.4% gap
- Competition is fierce - mimalloc is 5.85x faster at realistic workloads
Next priority: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.
The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.