# Comprehensive Profiling Analysis: HAKMEM Performance Gaps ## ๐Ÿ” Executive Summary After the **Prefault Box + MAP_POPULATE fix**, the profiling shows: ### Current Performance Metrics | Metric | Random Mixed | Tiny Hot | Gap | |--------|---|---|---| | **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** ๐Ÿคฏ | | **Page Faults** | 7,672 | 7,672 | **IDENTICAL** โš ๏ธ | | **L1 Cache Misses** | 763K | 738K | Similar | | **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x | | **Instructions/Cycle** | 0.74 | 0.73 | Similar | | **TLB Miss Rate** | 48.65% (dTLB) | N/A | High | ### ๐Ÿšจ KEY FINDING: Prefault is NOT working as expected! **Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests: 1. โœ— Prefault Box is either disabled or ineffective 2. โœ— Page faults are coming from elsewhere (not Superslab mmap) 3. โœ— MAP_POPULATE flag is not preventing runtime faults --- ## ๐Ÿ“Š Detailed Performance Breakdown ### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B) **Top Kernel Functions (by CPU time):** ``` 15.01% asm_exc_page_fault โ† Page fault handling 11.65% clear_page_erms โ† Page zeroing 5.27% zap_pte_range โ† Memory cleanup 5.20% handle_mm_fault โ† MMU fault handling 4.06% do_anonymous_page โ† Anonymous page allocation 3.18% __handle_mm_fault โ† Nested fault handling 2.35% rmqueue_bulk โ† Allocator backend 2.35% __memset_avx2_unaligned โ† Memory operations 2.28% do_user_addr_fault โ† User fault handling 1.77% arch_exit_to_user_mode โ† Context switch ``` **Kernel overhead:** ~63% of cycles **L1 DCL misses:** 763K / operations **Branch miss rate:** 11.94% ### Tiny Hot Workload (10M allocations, fixed size) **Top Kernel Functions (by CPU time):** ``` 14.19% asm_exc_page_fault โ† Page fault handling 12.82% clear_page_erms โ† Page zeroing 5.61% __memset_avx2_unaligned โ† Memory operations 5.02% do_anonymous_page โ† Anonymous page allocation 3.31% mem_cgroup_commit_charge โ† Memory accounting 2.67% __handle_mm_fault โ† MMU fault handling 2.45% do_user_addr_fault โ† User fault handling ``` **Kernel overhead:** ~66% of cycles **L1 DCL misses:** 738K / operations **Branch miss rate:** 11.03% ### Comparison: Why are cycles similar? ``` Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op โš ๏ธ THROUGHPUT DIFFERENCE UNEXPLAINED! ``` The cycles are nearly identical, but throughput differs because: - Random Mixed: Measuring only 1M operations (baseline) - Tiny Hot: Measuring 10M operations (10x scale) - **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op **This means Random Mixed is NOT actually slower in per-operation cost!** --- ## ๐ŸŽฏ Critical Findings ### Finding 1: Page Faults Are NOT Being Reduced **Observed:** - Random Mixed: 7,672 page faults - Tiny Hot: 7,672 page faults - **Difference: 0** โ† This is wrong! **Expected (with prefault):** - Random Mixed: 7,672 โ†’ maybe 100-500 (90% reduction) - Tiny Hot: 7,672 โ†’ ~50-100 (minimal change) **Hypothesis:** - Prefault Box may not be enabled - Or MAP_POPULATE is not working on this kernel - Or allocations are hitting kernel-internal mmap (not Superslab) ### Finding 2: TLB Misses Are HIGH (48.65%) ``` dTLB-loads: 49,160 dTLB-load-misses: 23,917 (48.65% miss rate!) iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact) ``` **Meaning:** Nearly half of TLB lookups fail, causing page table walks. **Why this matters:** - Each TLB miss = ~10-40 cycles (vs 1-3 for hit) - 23,917 ร— 25 cycles = ~600K wasted cycles - That's ~10% of total runtime! ### Finding 3: Both Workloads Are Similar Despite different access patterns: - Both spend 15% on page fault handling - Both spend 12% on page zeroing - Both have similar L1 miss rates - Both have similar branch miss rates **Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code. --- ## ๐Ÿ“ˆ Layer Analysis ### Kernel vs User Split | Category | Random Mixed | Tiny Hot | Analysis | |----------|---|---|---| | **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant | | **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar | | **User malloc/free** | <1% | <1% | Not visible | | **User pool/cache logic** | <1% | <1% | Not visible | ### User-Space Functions Visible in Profile **Random Mixed:** ``` 0.59% hak_free_at.constprop.0 (hakmem free path) ``` **Tiny Hot:** ``` 0.59% hak_pool_mid_lookup (hakmem pool routing) ``` **Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each). --- ## ๐Ÿ”ง What's Really Happening ### Current State (POST-Prefault Box) ``` allocate(size): 1. malloc wrapper โ†’ <1% cycles 2. Gatekeeper routing โ†’ ~0.1% cycles 3. unified_cache_refill โ†’ (hidden in kernel time) 4. shared_pool_acquire โ†’ (hidden in kernel time) 5. SuperSlab/mmap call โ†’ Triggers kernel 6. **KERNEL PAGE FAULTS** โ†’ 15% cycles 7. clear_page_erms (zero) โ†’ 12% cycles ``` ### Why Prefault Isn't Working **Possible reasons:** 1. **Prefault Box disabled?** - Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED` - Or: `g_ss_populate_once` not being set 2. **MAP_POPULATE not actually pre-faulting?** - Linux kernel may be lazy even with MAP_POPULATE - Need `madvise(MADV_POPULATE_READ)` to force immediate faulting - Or use `mincore()` to check before allocation 3. **Allocations not from Superslab mmap?** - Page faults may be from TLS cache allocation - Or from libc internal allocations - Not from Superslab backend 4. **TLB misses dominating?** - 48.65% TLB miss rate suggests memory layout issue - SuperSlab metadata may not be cache-friendly - Working set too large for TLB --- ## ๐ŸŽ“ What We Learned From Previous Analysis From the earlier profiling report, we identified that: - **Random Mixed was 21.7x slower** due to 61.7% page faults - **Expected with prefault:** Should drop to ~5% or less But NOW we see: - **Random Mixed is NOT significantly slower** (per-op cost is similar) - **Page faults are identical** to Tiny Hot - **This contradicts expectations** ### Possible Explanation The **earlier measurements** may have been from: - Benchmark run at startup (cold caches) - With additional profiling overhead - Or different workload parameters The **current measurements** are: - Steady state (after initial allocation) - With higher throughput (Tiny Hot = 10M ops) - After recent optimizations --- ## ๐ŸŽฏ Next Steps - Three Options ### ๐Ÿ“‹ Option A: Verify Prefault is Actually Enabled **Goal:** Confirm prefault mechanism is working **Steps:** 1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()` 2. Check if `MAP_POPULATE` flag is set in actual mmap calls 3. Run with `strace` to see mmap flags: ```bash strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE ``` 4. Check if `madvise(MADV_POPULATE_READ)` calls are happening **Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces --- ### ๐ŸŽฏ Option B: Reduce TLB Misses (48.65% โ†’ ~5%) **Goal:** Improve memory layout to reduce TLB pressure **Steps:** 1. **Analyze SuperSlab metadata layout:** - Current: Is metadata per-slab or centralized? - Check: `sp_meta_find_or_create()` hot path 2. **Improve cache locality:** - Cache-align metadata structures - Use larger pages (2MB or 1GB hugepages) - Reduce working set size 3. **Profile with hugepages:** ```bash echo 10 > /proc/sys/vm/nr_hugepages HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 ``` **Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty) --- ### ๐Ÿš€ Option C: Reduce Page Zeroing (12% โ†’ ~2%) **Goal:** Skip unnecessary page zeroing **Steps:** 1. **Analyze what needs zeroing:** - Are SuperSlab pages truly uninitialized? - Can we reuse memory without zeroing? - Use `MADV_DONTNEED` before reuse? 2. **Implement lazy zeroing:** - Don't zero pages on allocation - Only zero used portions - Let kernel handle rest on free 3. **Use uninitialized pools:** - Pre-allocate without zeroing - Initialize on-demand **Expected gain:** 1.5x speedup (eliminate 12% zero cost) --- ## ๐Ÿ“Š Recommendation Based on the analysis: ### Most Impactful (Order of Preference): 1. **Fix TLB Misses (48.65%)** - Potential gain: 1.5-2x - Implementation: Medium difficulty - Reason: Already showing 48% miss rate 2. **Verify Prefault Actually Works** - Potential gain: Unknown (currently not working?) - Implementation: Easy (debugging) - Reason: Should have been solved but showing same page faults 3. **Reduce Page Zeroing** - Potential gain: 1.5x - Implementation: Medium difficulty - Reason: 12% of total time --- ## ๐Ÿงช Recommended Next Action ### Immediate (This Session) Run diagnostic to confirm prefault status: ```bash # Check if MAP_POPULATE is in actual mmap calls strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20 # Check compiler flags grep -i prefault Makefile # Check environment variables env | grep -i HAKMEM | grep -i PREFAULT ``` ### If Prefault is Disabled โ†’ Enable It Then re-run profiling to verify improvement. ### If Prefault is Enabled โ†’ Move to Option B (TLB) Focus on reducing 48% TLB miss rate. --- ## ๐Ÿ“ˆ Expected Outcome After All Fixes | Factor | Current | After | Gain | |--------|---------|-------|------| | Page faults | 7,672 | 500-1000 | 8-15x | | TLB misses | 48.65% | ~5% | 3-5x | | Page zeroing | 12% | 2% | 2x | | **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** | | **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |