# HAKMEM Performance Analysis - Complete Index **Date**: 2025-12-04 **Benchmark**: bench_random_mixed_hakmem (1M operations, ws=256) **Current Performance**: 4.1M ops/s **Target**: 16M+ ops/s (4x improvement) --- ## Quick Summary **CRITICAL FINDING**: Page fault handling consumes 60-70% of execution time. **Primary Bottleneck**: - 132,509 page faults per 1M operations - Each page fault costs ~690 cycles - Kernel spends 60% of time in: clear_page_erms (11%), do_anonymous_page (20%), LRU/cgroup accounting (12%) **Recommended Fix**: - Add `MAP_POPULATE` to superslab mmap() calls → 2-3x speedup (1 hour effort) - Follow with PGO and branch optimization → reach 4x target --- ## Analysis Documents (Read in Order) ### 1. Executive Summary (START HERE) **File**: `/mnt/workdisk/public_share/hakmem/EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md` **Purpose**: High-level overview for decision makers **Content**: - Problem statement and root cause - Key metrics summary - Recommended action plan with timelines - Conservative and aggressive performance projections **Reading time**: 5 minutes --- ### 2. Detailed Analysis Report **File**: `/mnt/workdisk/public_share/hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md` **Purpose**: In-depth technical analysis for engineers **Content**: - Complete performance counter breakdown - Top 10 hottest functions with call chains - Bottleneck analysis with cycle accounting - Detailed optimization recommendations with effort estimates - Specific code changes required **Reading time**: 20 minutes --- ### 3. Raw Performance Data **File**: `/mnt/workdisk/public_share/hakmem/PERF_RAW_DATA_20251204.txt` **Purpose**: Reference data for validation and future comparison **Content**: - Raw perf stat output (all counters) - Raw perf report output (function profiles) - Syscall trace data - Assembly annotation of hot functions - Complete call chain data **Reading time**: Reference only (5-10 minutes to browse) --- ## Key Findings at a Glance | Category | Finding | Impact | Fix Effort | |----------|---------|--------|------------| | **Page Faults** | 132,509 faults (13% of ops) | 60-70% of runtime | 1 hour (MAP_POPULATE) | | **Branch Misses** | 9.04% miss rate (21M misses) | ~30% overhead | 4 hours (hints + PGO) | | **Cache Misses** | 13.03% miss rate (17 L1 misses/op) | ~15% overhead | 2 hours (layout) | | **Speculation** | Retpoline overhead | ~5% overhead | Cannot fix (CPU security) | | **IPC** | 0.97 (near optimal) | No issue | No fix needed | --- ## Performance Metrics ### Current State ``` Throughput: 4.1M ops/s Cycles per op: 1,146 cycles Instructions/op: 1,109 instructions IPC: 0.97 (excellent) Page faults/op: 0.132 (catastrophic) Branch misses/op: 21 (high) L1 misses/op: 17.2 (moderate) ``` ### Target State (after optimizations) ``` Throughput: 16M+ ops/s (4x improvement) Cycles per op: ~245 cycles (4.7x reduction) Page faults/op: <0.001 (132x reduction) Branch misses/op: ~12 (1.75x reduction) L1 misses/op: ~10 (1.7x reduction) ``` --- ## Top Bottleneck Functions (by time spent) ### Kernel Functions (60% of total time) 1. **clear_page_erms** (11.25%) - Zeroing newly allocated pages 2. **do_anonymous_page** (20%+) - Kernel page allocation 3. **folio_add_new_anon_rmap** (7.11%) - Reverse mapping 4. **folio_add_lru_vma** (4.88%) - LRU list management 5. **__mem_cgroup_charge** (4.37%) - Memory cgroup accounting ### User-space Functions (8-10% of total time) 1. **unified_cache_refill** (4.37%) - Main HAKMEM allocation path 2. **free** (1.40%) - Deallocation 3. **malloc** (1.36%) - Allocation wrapper 4. **shared_pool_acquire_slab** (1.31%) - Slab acquisition **Insight**: User-space code is only 8-10% of runtime. The remaining 90% is kernel overhead! --- ## Optimization Roadmap ### Phase 1: Eliminate Page Faults (Priority: URGENT) **Target**: 2-3x improvement (8-12M ops/s) **Effort**: 1 hour **Changes**: - Add `MAP_POPULATE` to `mmap()` in `superslab_acquire()` - Files to modify: `/mnt/workdisk/public_share/hakmem/core/superslab/*.c` **Validation**: ```bash perf stat -e page-faults ./bench_random_mixed_hakmem 1000000 256 42 # Expected: <1,000 page faults (was 132,509) ``` ### Phase 2: Profile-Guided Optimization (Priority: HIGH) **Target**: 1.2-1.3x additional improvement (10-16M ops/s cumulative) **Effort**: 2 hours **Changes**: ```bash make clean CFLAGS="-fprofile-generate" make ./bench_random_mixed_hakmem 10000000 256 42 # Generate profile make clean CFLAGS="-fprofile-use" make ``` ### Phase 3: Branch Optimization (Priority: MEDIUM) **Target**: 1.1-1.2x additional improvement **Effort**: 4 hours **Changes**: - Add `__builtin_expect()` hints to hot paths in `unified_cache_refill` - Simplify conditionals in fast path - Reorder checks for common case first ### Phase 4: Cache Layout Optimization (Priority: LOW) **Target**: 1.1-1.15x additional improvement **Effort**: 2 hours **Changes**: - Add `__attribute__((aligned(64)))` to hot structures - Pack frequently-accessed fields together - Separate read-mostly vs write-mostly data --- ## Commands Used for Analysis ```bash # Hardware performance counters perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,L1-dcache-load-misses,LLC-load-misses -r 3 \ ./bench_random_mixed_hakmem 1000000 256 42 # Page fault and context switch metrics perf stat -e task-clock,context-switches,cpu-migrations,page-faults,minor-faults,major-faults -r 3 \ ./bench_random_mixed_hakmem 1000000 256 42 # Function-level profiling perf record -F 5000 -g ./bench_random_mixed_hakmem 1000000 256 42 perf report --stdio -n --percent-limit 0.5 # Syscall tracing strace -e trace=mmap,madvise,munmap,mprotect -c ./bench_random_mixed_hakmem 1000000 256 42 ``` --- ## Related Documents - **PERF_PROFILE_ANALYSIS_20251204.md** - Earlier profiling analysis (phase 1) - **BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md** - Batch tier optimization results - **bench_random_mixed.c** - Benchmark source code --- ## Next Steps 1. **Read Executive Summary** (5 min) - Understand the problem and solution 2. **Implement MAP_POPULATE** (1 hour) - Immediate 2-3x improvement 3. **Validate with perf stat** (5 min) - Confirm page faults dropped 4. **Re-run full benchmark suite** (30 min) - Measure actual speedup 5. **If target not reached, proceed to Phase 2** (PGO optimization) --- ## Questions & Answers **Q: Why is IPC so high (0.97) if we're only at 4.1M ops/s?** A: The CPU is executing instructions efficiently, but most of those instructions are in the kernel handling page faults. The user-space code is only 10% of runtime. **Q: Can we just disable page fault handling?** A: No, but we can pre-fault memory with MAP_POPULATE so page faults happen at startup instead of during the benchmark. **Q: Why not just use hugepages?** A: Hugepages are better (2-4x improvement) but require more complex implementation. MAP_POPULATE gives 2-3x improvement with 1 hour of work. We should do MAP_POPULATE first, then consider hugepages if we need more performance. **Q: Will MAP_POPULATE hurt startup time?** A: Yes, but we're trading startup time for runtime performance. For a memory allocator, this is usually the right tradeoff. We can make it optional via environment variable. **Q: What about the branch mispredictions?** A: Those are secondary. Fix page faults first (60% of time), then tackle branches (30% of remaining time), then cache misses (15% of remaining time). --- ## Conclusion The analysis is complete and the path forward is clear: 1. Page faults are the primary bottleneck (60-70% of time) 2. MAP_POPULATE is the simplest fix (1 hour, 2-3x improvement) 3. PGO and branch hints can get us to 4x target 4. All optimizations are straightforward and low-risk **Confidence level**: Very high (based on hard profiling data) **Risk level**: Low (MAP_POPULATE is well-tested and widely used) **Time to 4x target**: 1-2 days of development --- **Analysis conducted by**: Claude (Anthropic) **Analysis method**: perf stat, perf record, perf report, strace **Data quality**: High (3-run averages, <1% variance) **Reproducibility**: 100% (all commands documented)