# HAKMEM Performance Bottleneck Executive Summary **Date**: 2025-12-04 **Analysis Type**: Comprehensive Performance Profiling **Status**: CRITICAL BOTTLENECK IDENTIFIED --- ## The Problem **Current Performance**: 4.1M ops/s **Target Performance**: 16M+ ops/s (4x improvement) **Performance Gap**: 3.9x remaining --- ## Root Cause: Page Fault Storm **The smoking gun**: 69% of execution time is spent handling page faults. ### The Evidence ``` perf stat shows: - 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults - 1,146 cycles per operation (286 cycles at 4x = target) - 690 cycles per operation spent in kernel page fault handling (60% of total time) perf report shows: - unified_cache_refill: 69.07% of total time (with children) └─ 60%+ is kernel page fault handling chain: - clear_page_erms: 11.25% (zeroing newly allocated pages) - do_anonymous_page: 20%+ (allocating kernel folios) - folio_add_new_anon_rmap: 7.11% (adding to reverse map) - folio_add_lru_vma: 4.88% (adding to LRU list) - __mem_cgroup_charge: 4.37% (memory cgroup accounting) ``` ### Why This Matters Every time `unified_cache_refill` allocates memory from a SuperSlab, it writes to previously unmapped memory. This triggers a page fault, forcing the kernel to: 1. **Allocate a physical page** (rmqueue: 2.03%) 2. **Zero the page for security** (clear_page_erms: 11.25%) 3. **Set up page tables** (handle_pte_fault, __pte_offset_map: 3-5%) 4. **Add to LRU lists** (folio_add_lru_vma: 4.88%) 5. **Charge memory cgroup** (__mem_cgroup_charge: 4.37%) 6. **Update reverse map** (folio_add_new_anon_rmap: 7.11%) **Total kernel overhead**: ~690 cycles per operation (60% of 1,146 cycles) --- ## Secondary Bottlenecks ### 1. Branch Mispredictions (9.04% miss rate) - 21M mispredictions / 1M operations = 21 misses per op - Each miss costs ~15-20 cycles = 315-420 cycles wasted per op - Indicates complex control flow in allocation path ### 2. Speculation Mitigation (5.44% overhead) - srso_alias_safe_ret: 2.85% - srso_alias_return_thunk: 2.59% - CPU security features (Spectre/Meltdown) add indirect branch overhead - Cannot be eliminated but can be minimized ### 3. Cache Misses (Moderate) - L1 D-cache misses: 17.2 per operation - Cache miss rate: 13.03% of cache references - At ~10 cycles per L1 miss = ~172 cycles per op - Not catastrophic but room for improvement --- ## The Path to 4x Performance ### Immediate Action: Pre-fault SuperSlab Memory **Solution**: Add `MAP_POPULATE` flag to `mmap()` calls in SuperSlab acquisition **Implementation**: ```c // In superslab_acquire(): void* ptr = mmap(NULL, SUPERSLAB_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, // Add MAP_POPULATE -1, 0); ``` **Expected Impact**: - Eliminates 60-70% of runtime page faults - Trades startup time for runtime performance - **Expected speedup: 2-3x (8.2M - 12.3M ops/s)** - **Effort: 1 hour** ### Follow-up: Profile-Guided Optimization (PGO) **Solution**: Build with `-fprofile-generate`, run benchmark, rebuild with `-fprofile-use` **Expected Impact**: - Optimizes branch layout for common paths - Reduces branch misprediction rate from 9% to ~6-7% - **Expected speedup: 1.2-1.3x on top of prefaulting** - **Effort: 2 hours** ### Advanced: Transparent Hugepages **Solution**: Use `mmap(MAP_HUGETLB)` for 2MB pages instead of 4KB pages **Expected Impact**: - Reduces page fault count by 512x (4KB → 2MB) - Reduces TLB pressure significantly - **Expected speedup: 2-4x** - **Effort: 1 day (with fallback logic)** --- ## Conservative Performance Projection | Optimization | Speedup | Cumulative | Ops/s | Effort | |-------------|---------|------------|-------|--------| | Baseline | 1.0x | 1.0x | 4.1M | - | | MAP_POPULATE | 2.5x | 2.5x | 10.3M | 1 hour | | PGO | 1.25x | 3.1x | 12.7M | 2 hours | | Branch hints | 1.1x | 3.4x | 14.0M | 4 hours | | Cache layout | 1.15x | 3.9x | **16.0M** | 2 hours | **Total effort to reach 4x target**: ~1 day of development --- ## Aggressive Performance Projection | Optimization | Speedup | Cumulative | Ops/s | Effort | |-------------|---------|------------|-------|--------| | Baseline | 1.0x | 1.0x | 4.1M | - | | Hugepages | 3.0x | 3.0x | 12.3M | 1 day | | PGO | 1.3x | 3.9x | 16.0M | 2 hours | | Branch optimization | 1.2x | 4.7x | 19.3M | 4 hours | | Prefetching | 1.15x | 5.4x | **22.1M** | 4 hours | **Total effort to reach 5x+**: ~2 days of development --- ## Recommended Action Plan ### Phase 1: Immediate (Today) 1. Add MAP_POPULATE to superslab mmap() calls 2. Verify page fault count drops to near-zero 3. Measure new throughput (expect 8-12M ops/s) ### Phase 2: Quick Wins (Tomorrow) 1. Build with PGO (-fprofile-generate/use) 2. Add __builtin_expect() hints to hot paths 3. Measure new throughput (expect 12-16M ops/s) ### Phase 3: Advanced (This Week) 1. Implement hugepage support with fallback 2. Optimize data structure layout for cache 3. Add prefetch hints for predictable accesses 4. Target: 16-24M ops/s --- ## Key Metrics Summary | Metric | Current | Target | Status | |--------|---------|--------|--------| | Throughput | 4.1M ops/s | 16M ops/s | 🔴 25% of target | | Cycles/op | 1,146 | ~245 | 🔴 4.7x too slow | | Page faults | 132,509 | <1,000 | 🔴 132x too many | | IPC | 0.97 | 0.97 | 🟢 Optimal | | Branch misses | 9.04% | <5% | 🟡 Moderate | | Cache misses | 13.03% | <10% | 🟡 Moderate | | Kernel time | 60% | <5% | 🔴 Critical | --- ## Files Generated 1. **PERF_BOTTLENECK_ANALYSIS_20251204.md** - Full detailed analysis with recommendations 2. **PERF_RAW_DATA_20251204.txt** - Raw perf stat/report output for reference 3. **EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md** - This file (executive overview) --- ## Conclusion The performance gap is **not a mystery**. The profiling data clearly shows that **60-70% of execution time is spent in kernel page fault handling**. The fix is straightforward: **pre-fault memory with MAP_POPULATE** and eliminate the runtime page fault overhead. This single change should deliver 2-3x improvement, putting us at 8-12M ops/s. Combined with PGO and minor branch optimizations, we can confidently reach the 4x target (16M+ ops/s). **Next Step**: Implement MAP_POPULATE in superslab_acquire() and re-measure.