Files

Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.

2025-12-01 16:37:59 +09:00

8.8 KiB

Raw Blame History

Phase 8 - Technical Analysis and Root Cause Investigation

Executive Summary

Phase 8 comprehensive benchmarking reveals critical performance issues with HAKMEM:

Working Set 256 (Hot Cache): 9.4% slower than System malloc, 45.2% slower than mimalloc
Working Set 8192 (Realistic): 246% slower than System malloc, 485% slower than mimalloc

The most alarming finding: HAKMEM experiences 4.8x performance degradation when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.

Benchmark Results Summary

Working Set 256 (Hot Cache)

Allocator	Avg (M ops/s)	StdDev	vs HAKMEM
HAKMEM Phase 8	79.2	±2.4%	1.00x
System malloc	86.7	±1.0%	1.09x
mimalloc	114.9	±1.2%	1.45x

Working Set 8192 (Realistic Workload)

Allocator	Avg (M ops/s)	StdDev	vs HAKMEM
HAKMEM Phase 8	16.5	±2.5%	1.00x
System malloc	57.1	±1.3%	3.46x
mimalloc	96.5	±0.9%	5.85x

Scalability Analysis

Performance degradation from WS256 → WS8192:

HAKMEM: 4.80x slowdown (79.2 → 16.5 M ops/s)
System: 1.52x slowdown (86.7 → 57.1 M ops/s)
mimalloc: 1.19x slowdown (114.9 → 96.5 M ops/s)

HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.

Root Cause Analysis

Evidence from Debug Logs

The benchmark output shows critical issues:

[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7

Analysis: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens 4 times during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.

Issue 1: SuperSlab Architecture Doesn't Scale

Symptoms:

Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
Shared SuperSlabs fail repeatedly
TLS_SLL_HDR_RESET events occur (slab header corruption?)

Root Causes (Hypotheses):

SuperSlab Capacity: Current 512KB SuperSlabs may be too small for WS8192
- 8192 objects × (16-1024 bytes average) = ~4-8MB working set
- Multiple SuperSlabs needed → increased lookup overhead
Fragmentation: SuperSlabs become fragmented with larger working sets
- Free slots scattered across multiple SuperSlabs
- Linear search through slab list becomes expensive
TLB Pressure: More SuperSlabs = more page table entries
- System malloc uses fewer, larger arenas
- HAKMEM's 512KB slabs create more TLB misses
Cache Pollution: Slab metadata pollutes L1/L2 cache
- Each SuperSlab has metadata overhead
- More slabs = more metadata = less cache for actual data

Issue 2: TLS Drain Overhead

Debug logs show:

[TLS_SLL_DRAIN] Drain ENABLED (default)
[TLS_SLL_DRAIN] Interval=2048 (default)

Analysis: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.

Evidence:

WS256 should fit entirely in cache, yet HAKMEM still lags
System malloc has simpler fast path (no drain logic)
9.4% overhead = ~7-8 extra cycles per allocation

Issue 3: TLS_SLL_HDR_RESET Events

[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0

Analysis: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.

Performance Breakdown

Where HAKMEM Loses Performance (WS8192)

Estimated cycle budget (assuming 3.5 GHz CPU):

HAKMEM: 16.5 M ops/s = ~212 cycles/operation
System: 57.1 M ops/s = ~61 cycles/operation
mimalloc: 96.5 M ops/s = ~36 cycles/operation

Gap Analysis:

HAKMEM uses 151 extra cycles vs System malloc
HAKMEM uses 176 extra cycles vs mimalloc

Where do these cycles go?

SuperSlab Lookup (~50-80 cycles)
- Linear search through slab list
- Cache misses on slab metadata
- TLB misses on slab pages
TLS Drain Logic (~10-15 cycles)
- Drain counter checks every allocation
- Branch mispredictions
Fragmentation Overhead (~30-50 cycles)
- Walking free lists
- Finding suitable free blocks
Legacy Fallback (~50-100 cycles when triggered)
- System malloc/mmap calls
- Context switches

Competitive Analysis

Why System malloc Wins (3.46x faster)

Arena-based design: Fewer, larger memory regions
Thread caching: Similar to HAKMEM TLS but better tuned
Mature optimization: Decades of tuning
Simple fast path: No drain logic, no SuperSlab lookup

Why mimalloc Dominates (5.85x faster)

Segment-based design: Optimal for multi-threaded workloads
Free list sharding: Reduces contention
Aggressive inlining: Fast path is 15-20 instructions
No locks in fast path: Lock-free for thread-local allocations
Delayed freeing: Like HAKMEM drain but more efficient
Minimal metadata: Less cache pollution

Critical Gaps to Address

Gap 1: Fast Path Performance (9.4% slower at WS256)

Target: Match System malloc at hot cache workload Required improvement: +9.4% = +7.5 M ops/s

Action items:

Profile TLS drain overhead
Inline critical functions more aggressively
Reduce branch mispredictions
Consider removing drain logic or making it lazy

Gap 2: Scalability (246% slower at WS8192)

Target: Get within 20% of System malloc at realistic workload Required improvement: +246% = +40.6 M ops/s (2.46x speedup needed!)

Action items:

Fix SuperSlab scaling
Reduce fragmentation
Optimize SuperSlab lookup (hash table instead of linear search?)
Reduce TLB pressure (larger SuperSlabs or better placement)
Profile cache misses

Recommendations for Phase 9+

Phase 9: CRITICAL - SuperSlab Investigation

Goal: Understand why SuperSlab performance collapses at WS8192

Tasks:

Add detailed profiling:
- SuperSlab lookup latency distribution
- Cache miss rates (L1, L2, L3)
- TLB miss rates
- Fragmentation metrics
Measure SuperSlab statistics:
- Number of active SuperSlabs at WS256 vs WS8192
- Average slab list length
- Hit rate for first-slab lookup
Experiment with SuperSlab sizes:
- Try 1MB, 2MB, 4MB SuperSlabs
- Measure impact on performance
Analyze "shared_fail→legacy" events:
- Why do shared slabs fail?
- How often does it happen?
- Can we pre-allocate more capacity?

Phase 10: Fast Path Optimization

Goal: Close 9.4% gap at WS256

Tasks:

Profile TLS drain overhead
Experiment with drain intervals (4096, 8192, disable)
Inline more aggressively
Add __builtin_expect hints for common paths
Reduce branch mispredictions

Phase 11: Architecture Re-evaluation

Goal: Decide if SuperSlab model is viable

Decision point: If Phase 9 can't get within 50% of System malloc at WS8192, consider:

Hybrid approach: TLS fast path + different backend (jemalloc-style arenas?)
Abandon SuperSlab: Switch to segment-based design like mimalloc
Radical simplification: Focus on specific use case (small allocations only?)

Success Criteria for Phase 9

Minimum acceptable improvements:

WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)

Stretch goals:

WS256: 90+ M ops/s (close to System malloc)
WS8192: 45+ M ops/s (80% of System malloc)

Raw Data

All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).

Working Set 256

HAKMEM:   [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
System:   [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s

Working Set 8192

HAKMEM:   [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
System:   [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s

Conclusion

Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:

SuperSlab scaling is broken - 4.8x performance degradation is unacceptable
Fast path has overhead - Even hot cache shows 9.4% gap
Competition is fierce - mimalloc is 5.85x faster at realistic workloads

Next priority: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.

The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.

8.8 KiB Raw Blame History Unescape Escape