# Phase 6-A Benchmark Results **Date**: 2025-11-29 **Change**: Disable SuperSlab lookup debug validation in RELEASE builds **File**: `core/tiny_region_id.h:199-239` **Guard**: `#if !HAKMEM_BUILD_RELEASE` around `hak_super_lookup()` call **Reason**: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation) --- ## Executive Summary Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is **significantly smaller** than predicted: - **Expected**: +12-15% (random_mixed), +8-10% (mid_mt_gap) - **Actual (best 3 of 5)**: +1.67% (random_mixed), +1.33% (mid_mt_gap) - **Actual (excluding warmup)**: +4.07% (random_mixed), +1.97% (mid_mt_gap) **Recommendation**: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement). --- ## Benchmark Configuration ### Build Configurations #### Baseline (Before Phase 6-A) ```bash make clean make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem # Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default # Result: SuperSlab lookup ALWAYS enabled (no guard in code yet) ``` #### Phase 6-A (After) ```bash git stash pop # Restore Phase 6-A changes make clean make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem # Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default # Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE) ``` ### Benchmark Parameters - **Iterations**: 1,000,000 operations per run - **Working Set**: 256 blocks - **Seed**: 42 (reproducible) - **Runs**: 5 per configuration - **Suppression**: `2>/dev/null` to exclude debug output noise --- ## Raw Results ### bench_random_mixed (Tiny workload, 16B-1KB) #### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ``` Run 1: 53.81 M ops/s Run 2: 53.25 M ops/s Run 3: 53.56 M ops/s Run 4: 49.41 M ops/s Run 5: 51.41 M ops/s Average: 52.29 M ops/s Stdev: 1.86 M ops/s ``` #### Phase 6-A (Release build, SuperSlab lookup DISABLED) ``` Run 1: 39.11 M ops/s ⚠️ OUTLIER (warmup) Run 2: 53.30 M ops/s Run 3: 56.28 M ops/s Run 4: 52.79 M ops/s Run 5: 53.72 M ops/s Average: 51.04 M ops/s (all runs) Stdev: 6.80 M ops/s (high due to outlier) Average (excl. Run 1): 54.02 M ops/s ``` **Outlier Analysis**: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue. --- ### bench_mid_mt_gap (Mid MT workload, 1KB-8KB) #### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ``` Run 1: 41.70 M ops/s Run 2: 37.39 M ops/s Run 3: 40.91 M ops/s Run 4: 40.53 M ops/s Run 5: 40.56 M ops/s Average: 40.22 M ops/s Stdev: 1.65 M ops/s ``` #### Phase 6-A (Release build, SuperSlab lookup DISABLED) ``` Run 1: 41.49 M ops/s Run 2: 41.81 M ops/s Run 3: 41.51 M ops/s Run 4: 38.43 M ops/s Run 5: 40.78 M ops/s Average: 40.80 M ops/s Stdev: 1.38 M ops/s ``` **Variance Analysis**: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark. --- ## Statistical Analysis ### Comparison 1: All Runs (Conservative) | Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result | |-----------|----------|-----------|----------|----------|----------|--------| | random_mixed | 52.29 M | 51.04 M | -1.25 M | **-2.39%** | +12-15% | ❌ FAIL | | mid_mt_gap | 40.22 M | 40.80 M | +0.59 M | **+1.46%** | +8-10% | ❌ FAIL | ### Comparison 2: Excluding First Run (Warmup Correction) | Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result | |-----------|----------|-----------|----------|----------|----------|--------| | random_mixed | 51.91 M | 54.02 M | +2.11 M | **+4.07%** | +12-15% | ⚠️ PARTIAL | | mid_mt_gap | 39.85 M | 40.63 M | +0.78 M | **+1.97%** | +8-10% | ❌ FAIL | ### Comparison 3: Best 3 of 5 (Peak Performance) | Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result | |-----------|----------|-----------|----------|----------|----------|--------| | random_mixed | 53.54 M | 54.43 M | +0.89 M | **+1.67%** | +12-15% | ❌ FAIL | | mid_mt_gap | 41.06 M | 41.60 M | +0.54 M | **+1.33%** | +8-10% | ❌ FAIL | --- ## Performance Summary ### Overall Results (Best 3 of 5 method) - **random_mixed**: 53.54 → 54.43 M ops/s (+1.67%) - **mid_mt_gap**: 41.06 → 41.60 M ops/s (+1.33%) ### vs Predictions - **random_mixed**: Expected +12-15%, Actual +1.67% → **FAIL** (8-10x smaller than expected) - **mid_mt_gap**: Expected +8-10%, Actual +1.33% → **FAIL** (6-7x smaller than expected) ### Interpretation Phase 6-A shows **statistically measurable but practically negligible** performance improvements: - Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap) - Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap) - All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap) The improvements are **8-10x smaller** than expected based on perf analysis. --- ## Root Cause Analysis ### Why the Discrepancy? The perf profile showed `hak_super_lookup()` consuming **15.84% of CPU time**, yet removing it yields only **~1-4% improvement**. Possible explanations: #### 1. **Compiler Optimization (Most Likely)** The compiler may already be optimizing away the `hak_super_lookup()` call in release builds: - **Dead Store Elimination**: The result of `hak_super_lookup()` is only used for debug logging - **Inlining + Constant Propagation**: With LTO, the compiler sees the result is unused - **Evidence**: Phase 6-A guard has minimal impact, suggesting code was already "free" **Action**: Examine assembly output to verify if `hak_super_lookup()` is present in baseline build #### 2. **Perf Sampling Bias** The perf profile may have been captured during a different workload phase: - Different allocation patterns (class distribution) - Different cache states (cold vs. hot) - Different thread counts (single vs. multi-threaded) **Action**: Re-run perf on the exact benchmark workload to verify 15.84% claim #### 3. **Measurement Noise** The benchmarks show high variance: - random_mixed: 1.86 M stdev (3.6% of mean) - mid_mt_gap: 1.65 M stdev (4.1% of mean) The measured improvements (+1-4%) are within **1-2 standard deviations** of noise. **Action**: Run longer benchmarks (10M+ operations) to reduce noise #### 4. **Lookup Already Cache-Friendly** The SuperSlab registry lookup may be highly cache-efficient in these workloads: - Small working set (256 blocks) fits in L1/L2 cache - Registry entries for active SuperSlabs are hot - Cost is much lower than perf's 15.84% suggests **Action**: Benchmark with larger working sets (4KB+) to stress cache #### 5. **Wrong Hot Path** The perf profile showed 15.84% CPU in `hak_super_lookup()`, but this may not be on the **allocation hot path** that these benchmarks exercise: - The call is in `tiny_region_id_write_header()` (allocation) - Benchmarks mix alloc+free, free path may dominate - Perf may have sampled during a malloc-heavy phase **Action**: Isolate allocation-only benchmark (no frees) to verify --- ## Recommendations ### Immediate Actions 1. **HOLD** on committing Phase 6-A until investigation completes - Current results don't justify the change - Risk: code churn without measurable benefit 2. **Verify Compiler Behavior** ```bash # Generate assembly for baseline build gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h # Check if hak_super_lookup appears grep "hak_super_lookup" baseline.s # If absent: compiler already eliminated it (explains minimal improvement) # If present: something else is going on ``` 3. **Re-run Perf on Benchmark Workload** ```bash # Build baseline without Phase 6-A git stash make clean && make bench_random_mixed_hakmem # Profile the exact benchmark perf record -g ./bench_random_mixed_hakmem 10000000 256 42 perf report --stdio | grep -A20 "hak_super_lookup" # Verify if 15.84% claim holds for this workload ``` 4. **Longer Benchmark Runs** ```bash # 100M operations to reduce noise for i in 1 2 3 4 5; do ./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null done ``` ### Long-Term Considerations If investigation reveals: #### Scenario A: Compiler Already Optimized - **Decision**: Commit Phase 6-A for code cleanliness (no harm, no foul) - **Rationale**: Explicitly documents debug-only code, prevents future confusion - **Benefit**: Future-proof if compiler behavior changes #### Scenario B: Perf Was Wrong - **Decision**: Discard Phase 6-A, update perf methodology - **Rationale**: The 15.84% CPU claim was based on flawed profiling - **Action**: Document correct perf sampling procedure #### Scenario C: Benchmark Doesn't Stress Hot Path - **Decision**: Commit Phase 6-A, improve benchmark coverage - **Rationale**: Real workloads may show the expected gains - **Action**: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free) #### Scenario D: Measurement Noise Dominates - **Decision**: Commit Phase 6-A if longer runs show >5% improvement - **Rationale**: Noise can hide real improvements - **Action**: Use mimalloc-bench suite for more stable measurements --- ## Next Steps ### Phase 6-B: Conditional Path Forward **Option 1: Investigate First (Recommended)** 1. Run assembly analysis (1 hour) 2. Re-run perf on benchmark (2 hours) 3. Run longer benchmarks (4 hours) 4. Make data-driven decision **Option 2: Commit Anyway** - Rationale: Code is cleaner, no measurable harm - Risk: Future confusion if optimization isn't actually needed **Option 3: Discard Phase 6-A** - Rationale: No measurable benefit, not worth the churn - Risk: Miss real optimization if measurement was flawed --- ## Appendix: Full Benchmark Output ### Baseline - bench_random_mixed ``` === Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) === Run 1: Throughput = 53806309 ops/s [iter=1000000 ws=256] time=0.019s Run 2: Throughput = 53246568 ops/s [iter=1000000 ws=256] time=0.019s Run 3: Throughput = 53563123 ops/s [iter=1000000 ws=256] time=0.019s Run 4: Throughput = 49409566 ops/s [iter=1000000 ws=256] time=0.020s Run 5: Throughput = 51412515 ops/s [iter=1000000 ws=256] time=0.019s ``` ### Phase 6-A - bench_random_mixed ``` === Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) === Run 1: Throughput = 39111201 ops/s [iter=1000000 ws=256] time=0.026s Run 2: Throughput = 53296242 ops/s [iter=1000000 ws=256] time=0.019s Run 3: Throughput = 56279982 ops/s [iter=1000000 ws=256] time=0.018s Run 4: Throughput = 52790754 ops/s [iter=1000000 ws=256] time=0.019s Run 5: Throughput = 53715992 ops/s [iter=1000000 ws=256] time=0.019s ``` ### Baseline - bench_mid_mt_gap ``` === Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) === Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s. Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s. Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s. Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s. Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s. ``` ### Phase 6-A - bench_mid_mt_gap ``` === Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) === Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s. Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s. Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s. Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s. Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s. ``` --- ## Conclusion Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is **8-10x smaller** than the expected +12-15% based on perf analysis. **Critical Question**: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement? **Most Likely Answer**: The compiler was already optimizing away the `hak_super_lookup()` call in release builds through dead code elimination, since its result is only used for debug assertions. **Recommended Action**: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.