# HAKMEM Allocator - Phase 2 Performance Analysis ## Quick Summary | Metric | Phase 1 | Phase 2 | Change | |--------|---------|---------|--------| | **Throughput** | 72M ops/s | 79.8M ops/s | **+10.8%** ✓ | | Cycles | 78.6M | 72.2M | -8.1% ✓ | | Instructions | 167M | 153M | -8.4% ✓ | | Branches | 36M | 23M | **-36%** ✓ | | Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ | | L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ | | dTLB Misses | N/A | 41 (0.01%) | **Excellent!** ✓ | ## Top 5 Hotspots (Phase 2, 628 samples) 1. **malloc()** - 36.51% CPU time - Function overhead (prologue/epilogue): ~18% - Lock operations: 5.05% - Initialization checks: ~15% 2. **main()** - 30.51% CPU time - Benchmark loop overhead (not allocator) 3. **free()** - 19.66% CPU time - Lock operations: 3.29% - Cached variable checks: ~15% - Function overhead: ~10% 4. **clear_page_erms (kernel)** - 9.31% CPU time - Page fault handling 5. **irqentry_exit_to_user_mode (kernel)** - 5.33% CPU time - Kernel exit overhead ## Phase 3 Optimization Targets (Ranked by Impact) ### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%) **Target**: Reduce malloc/free from 56% → ~33% CPU time - Inline hot paths to eliminate function call overhead - Remove stats counters from production builds - Cache initialization state in TLS ### 🔥 Priority 2: Branch Optimization (Expected: +3-5%) **Target**: Reduce branch misses from 1.02M → <700K - Apply Profile-Guided Optimization (PGO) - Add LIKELY/UNLIKELY hints - Reduce branches in fast path from ~15 to 5-7 ### 🔥 Priority 3: Cache Optimization (Expected: +2-4%) **Target**: Reduce L3 misses from 216K → <180K - Align hot structures to cache lines - Add prefetching in allocation path - Compact metadata structures ### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%) - Cache g_initialized/g_enable checks in TLS - Use constructor attributes more aggressively ### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%) - Move stats to TLS, aggregate periodically - Eliminate atomic ops from fast path ### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%) - Reduce TLS reads/writes from ~10 to ~4 per operation - Cache TLS values in registers ## Expected Phase 3 Results **Target Throughput**: 87-95M ops/s (+9-19% improvement) | Metric | Phase 2 | Phase 3 Target | Change | |--------|---------|----------------|--------| | Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% | | malloc CPU | 36.51% | ~22% | -40% | | free CPU | 19.66% | ~11% | -44% | | Branch misses | 4.43% | <3% | -32% | | L3 cache misses | 10.28% | <8% | -22% | ## Key Insights ### ✅ What Worked in Phase 2 1. **SuperSlab size increase** (64KB → 512KB): Dramatically reduced branches (-36%) 2. **Amortized initialization**: memset overhead dropped from 6.41% → 1.77% 3. **Virtual memory optimization**: TLB miss rate is excellent (0.01%) ### ❌ What Needs Work 1. **Branch prediction**: Miss rate doubled despite fewer branches 2. **Cache pressure**: Larger SuperSlabs increased L3 misses 3. **Function overhead**: malloc/free dominate CPU time (56%) ### 🤔 Surprising Findings 1. **Cross-calling pattern**: malloc/free call each other 8-12% of the time - Thread-local cache flushing - Deferred release operations - May benefit from batching 2. **Kernel overhead increased**: clear_page_erms went from 2.23% → 9.31% - May need page pre-faulting strategy 3. **Main loop visible**: 30.51% CPU time - Benchmark overhead, not allocator - Real allocator overhead is ~56% (malloc + free) ## Files Generated - `perf_phase2_stats.txt` - perf stat -d output - `perf_phase2_symbols.txt` - Symbol-level hotspots - `perf_phase2_callgraph.txt` - Call graph analysis - `perf_phase2_detailed.txt` - Detailed counter breakdown - `perf_malloc_annotate.txt` - Assembly annotation for malloc() - `perf_free_annotate.txt` - Assembly annotation for free() - `perf_analysis_summary.txt` - Detailed comparison with Phase 1 - `phase3_recommendations.txt` - Complete optimization roadmap ## How to Use This Data ### For Quick Reference ```bash cat perf_phase2_stats.txt # See overall metrics cat perf_phase2_symbols.txt # See top functions ``` ### For Deep Analysis ```bash cat perf_malloc_annotate.txt # See assembly-level hotspots in malloc cat perf_free_annotate.txt # See assembly-level hotspots in free cat perf_analysis_summary.txt # See Phase 1 vs Phase 2 comparison ``` ### For Planning Phase 3 ```bash cat phase3_recommendations.txt # See ranked optimization opportunities ``` ### To Re-run Analysis ```bash # Quick stat perf stat -d ./bench_random_mixed_hakmem 1000000 256 42 # Detailed profiling perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42 perf report --stdio --no-children --sort symbol ``` ## Next Steps 1. **Week 1**: Implement fast-path inlining + remove stats locks (Expected: +8-10%) 2. **Week 2**: Apply PGO + branch hints (Expected: +3-5%) 3. **Week 3**: Cache line alignment + prefetching (Expected: +2-4%) 4. **Week 4**: TLS optimization + polish (Expected: +1-3%) **Total Expected**: +14-22% improvement → **Target: 91-97M ops/s** --- Generated: 2025-11-28 Phase: 2 → 3 transition Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s