This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
8.3 KiB
Phase 9-1 Performance Investigation - Executive Summary
Date: 2025-11-30 Status: Investigation Complete Investigator: Claude (Sonnet 4.5)
TL;DR
Phase 9-1 hash table optimization had ZERO performance impact because:
- SuperSlab is DISABLED by default - optimized code never runs
- Real bottleneck is kernel overhead (55%) - mmap/munmap syscalls dominate
- SuperSlab lookup is NOT in hot path - only 1.14% of total time
Fix: Address SuperSlab backend failures and kernel overhead, not lookup performance.
Performance Data
Benchmark Results
| Configuration | Throughput | Change |
|---|---|---|
| Phase 8 Baseline | 16.5 M ops/s | - |
| Phase 9-1 (SuperSlab OFF) | 16.5 M ops/s | 0% |
| Phase 9-1 (SuperSlab ON) | 16.4 M ops/s | 0% |
Conclusion: Hash table optimization made no difference.
Perf Profile (WS8192)
| Component | CPU % | Cycles | Status |
|---|---|---|---|
| Kernel (mmap/munmap) | 55% | ~117 | BOTTLENECK |
| ├─ munmap / VMA splitting | 30% | ~64 | Critical issue |
| └─ mmap / page setup | 11% | ~23 | Expensive |
| free() wrapper | 11% | ~24 | Wrapper overhead |
| main() benchmark loop | 8% | ~16 | Measurement artifact |
| unified_cache_refill | 4% | ~9 | Page faults |
| Fast free TLS path | 1% | ~3 | Actual work! |
| Other | 21% | ~43 | Misc |
Key Insight: Only 3 cycles are spent in actual allocation work. The rest is overhead (117 cycles in kernel alone!).
Root Cause Analysis
1. SuperSlab Disabled by Default
Code: core/box/hak_core_init.inc.h:172-173
if (!getenv("HAKMEM_TINY_USE_SUPERSLAB")) {
setenv("HAKMEM_TINY_USE_SUPERSLAB", "0", 0); // DISABLED
}
Impact: Hash table code is never executed during benchmark.
2. Backend Failures Trigger Legacy Path
Debug Logs:
[SS_BACKEND] shared_fail→legacy cls=7 (4 times)
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6
Analysis:
- Class 7 (1024 bytes) SuperSlab exhaustion
- Falls back to system malloc → mmap/munmap
- 4 failures × ~1000 allocs = ~4000 kernel syscalls
- Explains 30% munmap overhead in perf
3. Hash Table Not in Hot Path
Perf Evidence:
hak_super_lookup()does NOT appear in top 20 functionsss_map_lookup()hash table code: 0% visible overhead- Fast TLS path dominates: only 1.14% total free time
Code Path:
free(ptr)
└─ hak_tiny_free_fast_v2() [1.14% total]
├─ Read header (class_idx)
├─ Push to TLS freelist ← FAST PATH (3 cycles)
└─ hak_super_lookup() ← VALIDATION ONLY (not in hot path)
Where Phase 8 Analysis Went Wrong
Phase 8 Claimed (INCORRECT)
| Claim | Reality |
|---|---|
| "SuperSlab lookup = 50-80 cycles" | Lookup not in hot path (0% perf profile) |
| "Major bottleneck" | Kernel overhead (55%) is real bottleneck |
| "Expected: 16.5M → 23-25M ops/s" | Actual: 16.5M → 16.5M ops/s (0% change) |
What Was Missed
- No profiling before optimization - Assumed bottleneck without evidence
- Didn't check default config - SuperSlab disabled by default
- Ignored kernel overhead - 55% of time in syscalls
- Optimized wrong thing - Lookup is validation, not hot path
Recommended Action Plan
Priority 1: Fix SuperSlab Backend (Immediate)
Problem: Class 7 (1024 bytes) exhaustion → legacy fallback → kernel overhead
Solutions:
-
Increase SuperSlab size: 512KB → 2MB
- 4x more blocks per slab
- Reduces fragmentation
- Expected: -20% kernel overhead = +30-40% throughput
-
Pre-allocate SuperSlabs at startup:
hak_ss_prewarm_class(7, 16); // 16 SuperSlabs for class 7- Eliminates startup mmap overhead
- Expected: -30% kernel overhead = +50-70% throughput
-
Enable SuperSlab by default (after fixing backend):
setenv("HAKMEM_TINY_USE_SUPERSLAB", "1", 0); // Enable
Expected Result: 16.5 M ops/s → 25-35 M ops/s (+50-110%)
Priority 2: Reduce Kernel Overhead (Short-term)
Problem: 55% of time in mmap/munmap syscalls
Solutions:
- Fix backend failures (see Priority 1)
- Increase batch size to amortize syscall cost
- Pre-allocate memory pool to avoid runtime mmap
- Monitor VMA count:
cat /proc/self/maps | wc -l
Expected Result: Kernel overhead 55% → 10-20%
Priority 3: Optimize User-space (Long-term)
Problem: 11% in free() wrapper overhead
Solutions:
- Inline wrapper more aggressively
- Remove stack canary checks in hot path
- Optimize TLS access (direct segment access)
Expected Result: -5% overhead = +6-8% throughput
Performance Projections
Scenario 1: Fix Backend + Prewarm (Recommended)
Changes:
- Fix class 7 exhaustion
- Pre-allocate SuperSlab pool
- Enable SuperSlab by default
Expected:
- Kernel: 55% → 10% (-45%)
- Throughput: 16.5 M → 45-50 M ops/s (+170-200%)
Scenario 2: Increase SuperSlab Size Only
Changes:
- Change default: 512KB → 2MB
- No other changes
Expected:
- Kernel: 55% → 35% (-20%)
- Throughput: 16.5 M → 25-30 M ops/s (+50-80%)
Scenario 3: Do Nothing (Status Quo)
Result: 16.5 M ops/s (no change)
- Hash table infrastructure exists but provides no benefit
- Kernel overhead continues to dominate
- SuperSlab backend remains unstable
Lessons Learned
What Went Well
- Clean implementation: Hash table code is well-architected
- Box pattern compliance: Single responsibility, clear contracts
- No regressions: 0% performance change (neither better nor worse)
- Good infrastructure: Enables future optimizations
What Could Be Better
- Profile before optimizing: Always run perf first
- Verify assumptions: Check default configuration
- Focus on hot path: Optimize what's actually slow
- Measure kernel time: Don't ignore syscall overhead
Key Takeaway
"Premature optimization is the root of all evil. Profile first, optimize second."
- Donald Knuth
Phase 9-1 optimized SuperSlab lookup (not in hot path) while ignoring kernel overhead (55% of runtime). Always profile before optimizing!
Next Steps
Immediate (This Week)
-
Investigate class 7 exhaustion:
HAKMEM_SS_DEBUG=1 ./bench_random_mixed_hakmem 10000000 8192 42 2>&1 | grep -E "cls=7|shared_fail" -
Test SuperSlab size increase:
- Change
SUPERSLAB_SIZE_MINfrom 512KB to 2MB - Re-run benchmark, expect +50-80% throughput
- Change
-
Test prewarming:
hak_ss_prewarm_class(7, 16); // Pre-allocate 16 SuperSlabs- Expect +50-70% throughput
Short-term (Next 2 Weeks)
-
Fix backend stability:
- Investigate fragmentation metrics
- Increase shared SuperSlab capacity
- Add telemetry for exhaustion events
-
Enable SuperSlab by default:
- Only after backend is stable
- Verify no regressions with full test suite
-
Re-benchmark with fixed backend:
- Target: 45-50 M ops/s at WS8192
- Compare to mimalloc (96.5 M ops/s)
Long-term (Future Phases)
- Phase 10: Reduce wrapper overhead (11% → 5%)
- Phase 11: Architecture re-evaluation if still >2x slower than mimalloc
- Phase 12: Consider hybrid approach (TLS + different backend)
Files
Investigation Report (Full Details):
/mnt/workdisk/public_share/hakmem/PHASE9_PERF_INVESTIGATION.md
Summary (This File):
/mnt/workdisk/public_share/hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md
Perf Data:
/tmp/phase9_perf.data(perf record output)
Related Documents:
PHASE8_TECHNICAL_ANALYSIS.md- Original (incorrect) bottleneck analysisPHASE9_1_COMPLETE.md- Implementation completion reportPHASE9_1_PROGRESS.md- Detailed progress tracking
Conclusion
Phase 9-1 successfully delivered clean O(1) hash table infrastructure, but performance did not improve because:
- Wrong target: Optimized lookup (not in hot path)
- Real bottleneck: Kernel overhead (55% from mmap/munmap)
- Backend issues: SuperSlab exhaustion forces legacy fallback
Recommendation: Fix SuperSlab backend and reduce kernel overhead. Expected gain: +170-200% throughput (16.5 M → 45-50 M ops/s).
Prepared by: Claude (Sonnet 4.5) Date: 2025-11-30 Status: Complete - Action plan provided