# Phase 7 Full Benchmark Suite Execution Plan **Date**: 2025-11-08 **Phase**: 7-1.3 (HEADER_CLASSIDX=1 optimization) **Current Status**: Partial results available (Larson 1T: 2.63M ops/s, bench_random_mixed 128B: 17.7M ops/s) **Goal**: Comprehensive performance evaluation across ALL benchmark patterns --- ## Executive Summary ### Available Benchmarks (5 categories) 1. **Larson** - Multi-threaded stress test (8-128B, mimalloc-bench derived) 2. **Random Mixed** - Single-threaded random allocation (16-8192B) 3. **Mid-Large MT** - Multi-threaded mid-size (8-32KB) 4. **VM Mixed** - Large allocations (512KB-2MB, L2.5/L2 test) 5. **Tiny Hot** - Hot path micro-benchmark (8-64B, LIFO) ### Current Build Status (Phase 7 = HEADER_CLASSIDX=1) All benchmarks were built with HEADER_CLASSIDX=1 on 2025-11-07/08: - ✅ `larson_hakmem` (2025-11-08 11:48) - ✅ `bench_random_mixed_hakmem` (2025-11-08 11:48) - ✅ `bench_mid_large_mt_hakmem` (2025-11-07 18:42) - ✅ `bench_tiny_hot_hakmem` (2025-11-07 18:03) - ✅ `bench_vm_mixed_hakmem` (2025-11-07 18:03) **Note**: Makefile has `HAKMEM_TINY_HEADER_CLASSIDX=1` permanently enabled (line 99-100). --- ## Execution Plan ### Phase 1: Verify Build Status (5 minutes) **Verify HEADER_CLASSIDX=1 is enabled:** ```bash # Check Makefile flag grep "HAKMEM_TINY_HEADER_CLASSIDX" Makefile # Verify all binaries are up-to-date make -n bench_random_mixed_hakmem bench_tiny_hot_hakmem \ bench_mid_large_mt_hakmem bench_vm_mixed_hakmem \ larson_hakmem ``` **If rebuild needed:** ```bash # Clean rebuild with HEADER_CLASSIDX=1 (already default) make clean make -j bench_random_mixed_hakmem bench_random_mixed_system bench_random_mixed_mi \ bench_tiny_hot_hakmem bench_tiny_hot_system bench_tiny_hot_mi \ bench_mid_large_mt_hakmem bench_mid_large_mt_system bench_mid_large_mt_mi \ bench_vm_mixed_hakmem bench_vm_mixed_system \ larson_hakmem larson_system larson_mi ``` **Time**: ~3-5 minutes (if rebuild needed) --- ### Phase 2: Quick Sanity Test (2 minutes) **Test each benchmark runs successfully:** ```bash # Larson (1T, 1 second) ./larson_hakmem 1 8 128 1024 1 12345 1 # Random Mixed (small run) ./bench_random_mixed_hakmem 1000 128 1234567 # Mid-Large MT (2 threads, small) ./bench_mid_large_mt_hakmem 2 1000 2048 42 # VM Mixed (small) ./bench_vm_mixed_hakmem 100 256 424242 # Tiny Hot (small) ./bench_tiny_hot_hakmem 32 10 1000 ``` **Expected**: All benchmarks run without SEGV/crashes. --- ### Phase 3: Full Benchmark Suite Execution #### Option A: Automated Suite Runner (RECOMMENDED) ⭐ **Use existing bench_suite_matrix.sh:** ```bash # This runs ALL benchmarks (random_mixed, mid_large_mt, vm_mixed, tiny_hot) # across system/mimalloc/HAKMEM variants ./scripts/bench_suite_matrix.sh ``` **Output**: - CSV: `bench_results/suite//results.csv` - Raw logs: `bench_results/suite//raw/*.out` **Time**: ~15-20 minutes **Coverage**: - Random Mixed: 2 cycles × 2 ws × 3 variants = 12 runs - Mid-Large MT: 2 threads × 3 variants = 6 runs - VM Mixed: 2 cycles × 2 variants = 4 runs (system + hakmem only) - Tiny Hot: 2 sizes × 3 variants = 6 runs **Total**: 28 benchmark runs --- #### Option B: Individual Benchmark Scripts (Detailed Analysis) If you need more control or want to run A/B tests with environment variables: ##### 3.1 Larson Benchmark (Multi-threaded Stress) **Basic run (1T, 4T, 8T):** ```bash # 1 thread, 10 seconds HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 1 # 4 threads, 10 seconds (CRITICAL: test multi-thread stability) HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 4 # 8 threads, 10 seconds HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 8 ``` **A/B test with environment variables:** ```bash # Use automated script (includes PGO) ./scripts/bench_larson_1t_ab.sh ``` **Output**: `bench_results/larson_ab//results.csv` **Time**: ~20-30 minutes (includes PGO build) **Key Metrics**: - Throughput (ops/s) - Stability (4T should not crash - see Phase 6-2.3 active counter fix) --- ##### 3.2 Random Mixed (Single-threaded, Mixed Sizes) **Basic run:** ```bash # 400K cycles, 8192B working set HAKMEM_WRAP_TINY=1 ./bench_random_mixed_hakmem 400000 8192 1234567 ./bench_random_mixed_system 400000 8192 1234567 ./bench_random_mixed_mi 400000 8192 1234567 ``` **A/B test with environment variables:** ```bash # Runs 5 repetitions, median calculation ./scripts/bench_random_mixed_ab.sh ``` **Output**: `bench_results/random_mixed_ab//results.csv` **Time**: ~15-20 minutes (5 reps × multiple configs) **Key Metrics**: - Throughput (ops/s) across different working set sizes - SPECIALIZE_MASK impact (0 vs 0x0F) - FAST_CAP impact (8 vs 16 vs 32) --- ##### 3.3 Mid-Large MT (Multi-threaded, 8-32KB) **Basic run:** ```bash # 4 threads, 40K cycles, 2KB working set HAKMEM_WRAP_TINY=1 ./bench_mid_large_mt_hakmem 4 40000 2048 42 ./bench_mid_large_mt_system 4 40000 2048 42 ./bench_mid_large_mt_mi 4 40000 2048 42 ``` **A/B test:** ```bash ./scripts/bench_mid_large_mt_ab.sh ``` **Output**: `bench_results/mid_large_mt_ab//results.csv` **Time**: ~10-15 minutes **Key Metrics**: - Multi-threaded performance (2T vs 4T) - HAKMEM's SuperSlab efficiency (expected: strong performance here) **Note**: Previous results showed HAKMEM weakness here (suite/20251107: 2.1M vs system 8.7M). This is unexpected given the Mid-Large benchmark success (+108% on 2025-11-02). Need to investigate if this is a regression or different test pattern. --- ##### 3.4 VM Mixed (Large Allocations, 512KB-2MB) **Basic run:** ```bash # 20K cycles, 256 working set HAKMEM_BIGCACHE_L25=1 HAKMEM_WRAP_TINY=1 ./bench_vm_mixed_hakmem 20000 256 424242 ./bench_vm_mixed_system 20000 256 424242 ``` **Time**: ~5 minutes **Key Metrics**: - L2.5 cache effectiveness (BIGCACHE_L25=1 vs 0) - Large allocation performance --- ##### 3.5 Tiny Hot (Hot Path Micro-benchmark) **Basic run:** ```bash # 32B, 100 batch, 60K cycles HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 32 100 60000 ./bench_tiny_hot_system 32 100 60000 ./bench_tiny_hot_mi 32 100 60000 # 64B HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 64 100 60000 ./bench_tiny_hot_system 64 100 60000 ./bench_tiny_hot_mi 64 100 60000 ``` **Time**: ~5 minutes **Key Metrics**: - Hot path efficiency (direct TLS cache access) - Expected weakness (Phase 6 analysis: -60% vs system) --- ### Phase 4: Analysis and Comparison #### 4.1 Extract Results from Suite Run ```bash # Get latest suite results latest=$(ls -td bench_results/suite/* | head -1) cat ${latest}/results.csv # Quick comparison awk -F, 'NR>1 { if ($2=="hakmem") hakmem[$1]+=$4 if ($2=="system") system[$1]+=$4 if ($2=="mi") mi[$1]+=$4 count[$1]++ } END { for (b in hakmem) { h=hakmem[b]/count[b] s=system[b]/count[b] m=mi[b]/count[b] printf "%s: HAKMEM=%.2fM system=%.2fM mi=%.2fM (vs_sys=%+.1f%%, vs_mi=%+.1f%%)\n", b, h/1e6, s/1e6, m/1e6, (h/s-1)*100, (h/m-1)*100 } }' ${latest}/results.csv ``` #### 4.2 Key Comparisons **Phase 7 vs System malloc:** ```bash # Extract HAKMEM vs system for each benchmark awk -F, 'NR>1 && ($2=="hakmem" || $2=="system") { key=$1 "," $3 if ($2=="hakmem") h[key]=$4 if ($2=="system") s[key]=$4 } END { for (k in h) { if (s[k]) { pct = (h[k]/s[k] - 1) * 100 printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, s[k]/1e6, pct } } }' ${latest}/results.csv | sort ``` **Phase 7 vs mimalloc:** ```bash # Similar for mimalloc comparison awk -F, 'NR>1 && ($2=="hakmem" || $2=="mi") { key=$1 "," $3 if ($2=="hakmem") h[key]=$4 if ($2=="mi") m[key]=$4 } END { for (k in h) { if (m[k]) { pct = (h[k]/m[k] - 1) * 100 printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, m[k]/1e6, pct } } }' ${latest}/results.csv | sort ``` #### 4.3 Generate Summary Report ```bash # Create comprehensive summary cat > PHASE7_RESULTS_SUMMARY.md << 'REPORT' # Phase 7 Benchmark Results Summary ## Test Configuration - Phase: 7-1.3 (HEADER_CLASSIDX=1) - Date: $(date +%Y-%m-%d) - Suite: $(basename ${latest}) ## Overall Results ### Random Mixed (16-8192B, single-threaded) [Insert results here] ### Mid-Large MT (8-32KB, multi-threaded) [Insert results here] ### VM Mixed (512KB-2MB, large allocations) [Insert results here] ### Tiny Hot (8-64B, hot path micro) [Insert results here] ### Larson (8-128B, multi-threaded stress) [Insert results here] ## Analysis ### Strengths [Areas where HAKMEM outperforms] ### Weaknesses [Areas where HAKMEM underperforms] ### Comparison with Previous Phases [Phase 6 vs Phase 7 delta] ## Bottleneck Identification [Performance profiling with perf] REPORT ``` --- ### Phase 5: Performance Profiling (Optional, if bottlenecks found) **Profile hot paths with perf:** ```bash # Profile random_mixed (if slow) perf record -g --call-graph dwarf -- \ ./bench_random_mixed_hakmem 400000 8192 1234567 perf report --stdio > perf_random_mixed_phase7.txt # Profile larson 1T perf record -g --call-graph dwarf -- \ ./larson_hakmem 10 8 128 1024 1 12345 1 perf report --stdio > perf_larson_1t_phase7.txt ``` **Compare with Phase 6:** ```bash # If you have Phase 6 binaries saved, run side-by-side # and compare perf reports ``` --- ## Expected Results & Analysis Strategy ### Baseline Expectations (from Phase 6 analysis) #### Strong Areas (Expected +50% to +171% vs System) 1. **Mid-Large (8-32KB)**: HAKMEM's SuperSlab should dominate - Expected: +100% to +150% vs system - Phase 7 improvement target: Maintain or improve 2. **Large Allocations (VM Mixed)**: L2.5 layer efficiency - Expected: Competitive or slight win vs system #### Weak Areas (Expected -50% to -70% vs System) 1. **Tiny (≤128B)**: Structural weakness identified in Phase 6 - Expected: -40% to -60% vs system - Phase 7 HEADER_CLASSIDX may help: +10-20% improvement 2. **Random Mixed**: Magazine layer overhead - Expected: -20% to -50% vs system - Phase 7 target: Reduce gap 3. **Larson Multi-thread**: Contention issues - Expected: Variable (1T: ok, 4T+: risk of crashes) - Phase 7 critical: Verify 4T stability (active counter fix) ### What to Look For #### Phase 7 Improvements (HEADER_CLASSIDX=1) - **Tiny allocations**: +10-30% improvement (fewer header loads) - **Random mixed**: +15-25% improvement (class_idx in header) - **Cache efficiency**: Better locality (1-byte header vs 2-byte) #### Red Flags - **Mid-Large regression**: Should NOT regress (HEADER_CLASSIDX doesn't affect mid-large path) - **4T+ crashes in Larson**: Active counter bug should be fixed (Phase 6-2.3) - **Severe regression (>20%)**: Investigate immediately #### Bottleneck Identification If Phase 7 results are disappointing: 1. **Run perf** on slow benchmarks 2. **Compare with Phase 6** perf profiles (if available) 3. **Check hot paths**: - `tiny_alloc_fast()` - Should be 3-4 instructions - `tiny_free_fast()` - Should be fast header check - `superslab_refill()` - Should use P0 ctz optimization --- ## Time Estimates ### Minimal Run (Option A: Suite Script Only) - Build verification: 2 min - Sanity test: 2 min - Suite execution: 15-20 min - Quick analysis: 5 min - **Total: ~25-30 minutes** ### Comprehensive Run (Option B: All Individual Scripts) - Build verification: 2 min - Sanity test: 2 min - Larson A/B: 25 min - Random Mixed A/B: 20 min - Mid-Large MT A/B: 15 min - VM Mixed: 5 min - Tiny Hot: 5 min - Analysis & report: 15 min - **Total: ~90 minutes (1.5 hours)** ### With Performance Profiling - Add: ~20-30 min per benchmark - **Total: ~2-3 hours** --- ## Recommended Execution Order ### Quick Assessment (30 minutes) 1. ✅ Verify build status 2. ✅ Run suite script (bench_suite_matrix.sh) 3. ✅ Generate quick comparison 4. 🔍 Identify major wins/losses 5. 📝 Decide if deep dive needed ### Deep Analysis (if needed, +60 minutes) 1. 🔬 Run individual A/B scripts for problem areas 2. 📊 Profile with perf 3. 📝 Compare with Phase 6 baseline 4. 💡 Generate actionable insights --- ## Output Organization ``` bench_results/ ├── suite/ │ └── / │ ├── results.csv # All benchmarks, all variants │ └── raw/*.out # Raw logs ├── random_mixed_ab/ │ └── / │ ├── results.csv # A/B test results │ └── raw/*.txt # Per-run data ├── larson_ab/ │ └── / │ ├── results.csv │ └── raw/*.out ├── mid_large_mt_ab/ │ └── / │ ├── results.csv │ └── raw/*.out └── ... # Analysis reports PHASE7_RESULTS_SUMMARY.md # High-level summary PHASE7_DETAILED_ANALYSIS.md # Deep dive (if needed) perf_*.txt # Performance profiles ``` --- ## Next Steps After Benchmark ### If Phase 7 Shows Strong Results (+30-50% overall) 1. ✅ Commit and document improvements 2. 🎯 Focus on remaining weak areas (Tiny allocations) 3. 📢 Prepare performance summary for stakeholders ### If Phase 7 Shows Modest Results (+10-20% overall) 1. 🔍 Identify specific bottlenecks (perf profiling) 2. 🧪 Test individual optimizations in isolation 3. 📊 Compare with Phase 6 to ensure no regressions ### If Phase 7 Shows Regressions (any area -10% or worse) 1. 🚨 Immediate investigation 2. 🔄 Bisect to find regression point 3. 🧪 Consider reverting HEADER_CLASSIDX if severe --- ## Quick Reference Commands ```bash # Full suite (automated) ./scripts/bench_suite_matrix.sh # Individual benchmarks (quick test) ./larson_hakmem 1 8 128 1024 1 12345 1 ./bench_random_mixed_hakmem 400000 8192 1234567 ./bench_mid_large_mt_hakmem 4 40000 2048 42 ./bench_vm_mixed_hakmem 20000 256 424242 ./bench_tiny_hot_hakmem 32 100 60000 # A/B tests (environment variable sweeps) ./scripts/bench_larson_1t_ab.sh ./scripts/bench_random_mixed_ab.sh ./scripts/bench_mid_large_mt_ab.sh # Latest results ls -td bench_results/suite/* | head -1 cat $(ls -td bench_results/suite/* | head -1)/results.csv # Performance profiling perf record -g --call-graph dwarf -- ./bench_random_mixed_hakmem 400000 8192 1234567 perf report --stdio > perf_output.txt ``` --- ## Key Success Metrics ### Primary Goal: Overall Improvement - **Target**: +20-30% average throughput vs Phase 6 - **Minimum**: No regressions in mid-large (HAKMEM's strength) ### Secondary Goals: 1. **Stability**: 4T+ Larson runs without crashes 2. **Tiny improvement**: -40% to -50% vs system (from -60%) 3. **Random mixed improvement**: -10% to -20% vs system (from -30%+) ### Stretch Goals: 1. **Mid-large dominance**: Maintain +100% vs system 2. **Overall parity**: Match or beat system malloc on average 3. **Consistency**: No severe outliers (no single test <50% of system) --- **Document Version**: 1.0 **Created**: 2025-11-08 **Author**: Claude (Task Agent) **Status**: Ready for execution