Files
hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

8.3 KiB
Raw Blame History

Phase 9-1 Performance Investigation - Executive Summary

Date: 2025-11-30 Status: Investigation Complete Investigator: Claude (Sonnet 4.5)


TL;DR

Phase 9-1 hash table optimization had ZERO performance impact because:

  1. SuperSlab is DISABLED by default - optimized code never runs
  2. Real bottleneck is kernel overhead (55%) - mmap/munmap syscalls dominate
  3. SuperSlab lookup is NOT in hot path - only 1.14% of total time

Fix: Address SuperSlab backend failures and kernel overhead, not lookup performance.


Performance Data

Benchmark Results

Configuration Throughput Change
Phase 8 Baseline 16.5 M ops/s -
Phase 9-1 (SuperSlab OFF) 16.5 M ops/s 0%
Phase 9-1 (SuperSlab ON) 16.4 M ops/s 0%

Conclusion: Hash table optimization made no difference.

Perf Profile (WS8192)

Component CPU % Cycles Status
Kernel (mmap/munmap) 55% ~117 BOTTLENECK
├─ munmap / VMA splitting 30% ~64 Critical issue
└─ mmap / page setup 11% ~23 Expensive
free() wrapper 11% ~24 Wrapper overhead
main() benchmark loop 8% ~16 Measurement artifact
unified_cache_refill 4% ~9 Page faults
Fast free TLS path 1% ~3 Actual work!
Other 21% ~43 Misc

Key Insight: Only 3 cycles are spent in actual allocation work. The rest is overhead (117 cycles in kernel alone!).


Root Cause Analysis

1. SuperSlab Disabled by Default

Code: core/box/hak_core_init.inc.h:172-173

if (!getenv("HAKMEM_TINY_USE_SUPERSLAB")) {
    setenv("HAKMEM_TINY_USE_SUPERSLAB", "0", 0);  // DISABLED
}

Impact: Hash table code is never executed during benchmark.

2. Backend Failures Trigger Legacy Path

Debug Logs:

[SS_BACKEND] shared_fail→legacy cls=7  (4 times)
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6

Analysis:

  • Class 7 (1024 bytes) SuperSlab exhaustion
  • Falls back to system malloc → mmap/munmap
  • 4 failures × ~1000 allocs = ~4000 kernel syscalls
  • Explains 30% munmap overhead in perf

3. Hash Table Not in Hot Path

Perf Evidence:

  • hak_super_lookup() does NOT appear in top 20 functions
  • ss_map_lookup() hash table code: 0% visible overhead
  • Fast TLS path dominates: only 1.14% total free time

Code Path:

free(ptr)
  └─ hak_tiny_free_fast_v2()  [1.14% total]
      ├─ Read header (class_idx)
      ├─ Push to TLS freelist  ← FAST PATH (3 cycles)
      └─ hak_super_lookup()     ← VALIDATION ONLY (not in hot path)

Where Phase 8 Analysis Went Wrong

Phase 8 Claimed (INCORRECT)

Claim Reality
"SuperSlab lookup = 50-80 cycles" Lookup not in hot path (0% perf profile)
"Major bottleneck" Kernel overhead (55%) is real bottleneck
"Expected: 16.5M → 23-25M ops/s" Actual: 16.5M → 16.5M ops/s (0% change)

What Was Missed

  1. No profiling before optimization - Assumed bottleneck without evidence
  2. Didn't check default config - SuperSlab disabled by default
  3. Ignored kernel overhead - 55% of time in syscalls
  4. Optimized wrong thing - Lookup is validation, not hot path

Priority 1: Fix SuperSlab Backend (Immediate)

Problem: Class 7 (1024 bytes) exhaustion → legacy fallback → kernel overhead

Solutions:

  1. Increase SuperSlab size: 512KB → 2MB

    • 4x more blocks per slab
    • Reduces fragmentation
    • Expected: -20% kernel overhead = +30-40% throughput
  2. Pre-allocate SuperSlabs at startup:

    hak_ss_prewarm_class(7, 16);  // 16 SuperSlabs for class 7
    
    • Eliminates startup mmap overhead
    • Expected: -30% kernel overhead = +50-70% throughput
  3. Enable SuperSlab by default (after fixing backend):

    setenv("HAKMEM_TINY_USE_SUPERSLAB", "1", 0);  // Enable
    

Expected Result: 16.5 M ops/s → 25-35 M ops/s (+50-110%)

Priority 2: Reduce Kernel Overhead (Short-term)

Problem: 55% of time in mmap/munmap syscalls

Solutions:

  1. Fix backend failures (see Priority 1)
  2. Increase batch size to amortize syscall cost
  3. Pre-allocate memory pool to avoid runtime mmap
  4. Monitor VMA count: cat /proc/self/maps | wc -l

Expected Result: Kernel overhead 55% → 10-20%

Priority 3: Optimize User-space (Long-term)

Problem: 11% in free() wrapper overhead

Solutions:

  1. Inline wrapper more aggressively
  2. Remove stack canary checks in hot path
  3. Optimize TLS access (direct segment access)

Expected Result: -5% overhead = +6-8% throughput


Performance Projections

Changes:

  • Fix class 7 exhaustion
  • Pre-allocate SuperSlab pool
  • Enable SuperSlab by default

Expected:

  • Kernel: 55% → 10% (-45%)
  • Throughput: 16.5 M → 45-50 M ops/s (+170-200%)

Scenario 2: Increase SuperSlab Size Only

Changes:

  • Change default: 512KB → 2MB
  • No other changes

Expected:

  • Kernel: 55% → 35% (-20%)
  • Throughput: 16.5 M → 25-30 M ops/s (+50-80%)

Scenario 3: Do Nothing (Status Quo)

Result: 16.5 M ops/s (no change)

  • Hash table infrastructure exists but provides no benefit
  • Kernel overhead continues to dominate
  • SuperSlab backend remains unstable

Lessons Learned

What Went Well

  1. Clean implementation: Hash table code is well-architected
  2. Box pattern compliance: Single responsibility, clear contracts
  3. No regressions: 0% performance change (neither better nor worse)
  4. Good infrastructure: Enables future optimizations

What Could Be Better

  1. Profile before optimizing: Always run perf first
  2. Verify assumptions: Check default configuration
  3. Focus on hot path: Optimize what's actually slow
  4. Measure kernel time: Don't ignore syscall overhead

Key Takeaway

"Premature optimization is the root of all evil. Profile first, optimize second."

  • Donald Knuth

Phase 9-1 optimized SuperSlab lookup (not in hot path) while ignoring kernel overhead (55% of runtime). Always profile before optimizing!


Next Steps

Immediate (This Week)

  1. Investigate class 7 exhaustion:

    HAKMEM_SS_DEBUG=1 ./bench_random_mixed_hakmem 10000000 8192 42 2>&1 | grep -E "cls=7|shared_fail"
    
  2. Test SuperSlab size increase:

    • Change SUPERSLAB_SIZE_MIN from 512KB to 2MB
    • Re-run benchmark, expect +50-80% throughput
  3. Test prewarming:

    hak_ss_prewarm_class(7, 16);  // Pre-allocate 16 SuperSlabs
    
    • Expect +50-70% throughput

Short-term (Next 2 Weeks)

  1. Fix backend stability:

    • Investigate fragmentation metrics
    • Increase shared SuperSlab capacity
    • Add telemetry for exhaustion events
  2. Enable SuperSlab by default:

    • Only after backend is stable
    • Verify no regressions with full test suite
  3. Re-benchmark with fixed backend:

    • Target: 45-50 M ops/s at WS8192
    • Compare to mimalloc (96.5 M ops/s)

Long-term (Future Phases)

  1. Phase 10: Reduce wrapper overhead (11% → 5%)
  2. Phase 11: Architecture re-evaluation if still >2x slower than mimalloc
  3. Phase 12: Consider hybrid approach (TLS + different backend)

Files

Investigation Report (Full Details):

  • /mnt/workdisk/public_share/hakmem/PHASE9_PERF_INVESTIGATION.md

Summary (This File):

  • /mnt/workdisk/public_share/hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md

Perf Data:

  • /tmp/phase9_perf.data (perf record output)

Related Documents:

  • PHASE8_TECHNICAL_ANALYSIS.md - Original (incorrect) bottleneck analysis
  • PHASE9_1_COMPLETE.md - Implementation completion report
  • PHASE9_1_PROGRESS.md - Detailed progress tracking

Conclusion

Phase 9-1 successfully delivered clean O(1) hash table infrastructure, but performance did not improve because:

  1. Wrong target: Optimized lookup (not in hot path)
  2. Real bottleneck: Kernel overhead (55% from mmap/munmap)
  3. Backend issues: SuperSlab exhaustion forces legacy fallback

Recommendation: Fix SuperSlab backend and reduce kernel overhead. Expected gain: +170-200% throughput (16.5 M → 45-50 M ops/s).


Prepared by: Claude (Sonnet 4.5) Date: 2025-11-30 Status: Complete - Action plan provided