Files
hakmem/PHASE8_COMPREHENSIVE_BENCHMARK_REPORT.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

4.8 KiB

================================================================================ Phase 8 Comprehensive Allocator Comparison - Analysis

Working Set 256 (Hot cache, Phase 7 comparison)

Allocator Avg (M ops/s) StdDev (%) Min - Max vs HAKMEM
HAKMEM Phase 8 79.2 ± 2.4% 77.0 - 81.2 1.00x
System malloc 86.7 ± 1.0% 85.3 - 87.5 1.09x
mimalloc 114.9 ± 1.2% 112.5 - 116.2 1.45x

Working Set 8192 (Realistic workload)

Allocator Avg (M ops/s) StdDev (%) Min - Max vs HAKMEM
HAKMEM Phase 8 16.5 ± 2.5% 15.8 - 16.9 1.00x
System malloc 57.1 ± 1.3% 56.1 - 57.8 3.46x
mimalloc 96.5 ± 0.9% 95.5 - 97.7 5.85x

================================================================================ Performance Analysis

1. Working Set 256 (Hot Cache) Results

  • HAKMEM Phase 8: 79.2 M ops/s
  • System malloc: 86.7 M ops/s (1.09x faster)
  • mimalloc: 114.9 M ops/s (1.45x faster)

HAKMEM is 9.4% slower than System malloc and 45.2% slower than mimalloc

2. Working Set 8192 (Realistic Workload) Results

  • HAKMEM Phase 8: 16.5 M ops/s
  • System malloc: 57.1 M ops/s (3.46x faster)
  • mimalloc: 96.5 M ops/s (5.85x faster)

HAKMEM is 246.0% slower than System malloc and 484.9% slower than mimalloc

================================================================================ Critical Observations

HAKMEM Performance Gap Analysis

Performance degradation from WS256 to WS8192:

  • HAKMEM: 4.80x slowdown (79.2 → 16.5 M ops/s)
  • System: 1.52x slowdown (86.7 → 57.1 M ops/s)
  • mimalloc: 1.19x slowdown (114.9 → 96.5 M ops/s)

HAKMEM degrades 3.16x MORE than System malloc HAKMEM degrades 4.03x MORE than mimalloc

Key Issues Identified

  1. Hot Cache Performance (WS256):

    • HAKMEM: 79.2 M ops/s
    • Gap: -9.1% vs System, -45.8% vs mimalloc
    • Issue: Fast-path overhead (TLS drain, SuperSlab lookup)
  2. Realistic Workload Performance (WS8192):

    • HAKMEM: 16.5 M ops/s
    • Gap: -71.1% vs System, -83.1% vs mimalloc
    • Issue: SEVERE - SuperSlab scaling, fragmentation, TLB pressure
  3. Scalability Problem:

    • HAKMEM loses 4.8x performance with larger working sets
    • System loses only 1.5x
    • mimalloc loses only 1.2x
    • Root cause: SuperSlab architecture doesn't scale well

================================================================================ Recommendations for Phase 9+

CRITICAL PRIORITY: Fix WS8192 Performance Gap

The 71-83% performance gap at realistic working sets is UNACCEPTABLE.

Immediate Actions Required:

  1. Investigate SuperSlab Scaling (Phase 9)

    • Profile: Why does performance collapse with larger working sets?
    • Hypothesis: SuperSlab lookup overhead, fragmentation, or TLB misses
    • Debug logs show 'shared_fail→legacy' messages → shared slab exhaustion
  2. Optimize Fast Path (Phase 10)

    • Even WS256 shows 9-46% gap vs competitors
    • Profile TLS drain overhead
    • Consider reducing drain frequency or lazy draining
  3. Consider Alternative Architectures (Phase 11)

    • Current SuperSlab model may be fundamentally flawed
    • Benchmark shows 4.8x degradation vs 1.5x for System malloc
    • May need hybrid approach: TLS fast path + different backend
  4. Specific Debug Actions

    • Analyze '[SS_BACKEND] shared_fail→legacy' logs
    • Measure SuperSlab hit rate at different working set sizes
    • Profile cache misses and TLB misses

================================================================================ Raw Data (for reproducibility)

hakmem_256 : [78480676, 78099247, 77034450, 81120430, 81206714] system_256 : [87329938, 86497843, 87514376, 85308713, 86630819] mimalloc_256 : [115842807, 115180313, 116209200, 112542094, 114950573] hakmem_8192 : [16504443, 15799180, 16916987, 16687009, 16582555] system_8192 : [56095157, 57843156, 56999206, 57717254, 56720055] mimalloc_8192 : [96824532, 96117137, 95521242, 97733856, 96327554]

================================================================================ Analysis Complete