# Phase 61: C7 ULTRA Header-Light A/B Test Results **Date**: 2025-12-17 **Status**: NEUTRAL (+0.31%, below +1.0% GO threshold) **Decision**: Keep OFF by default, available as research flag --- ## Test Configuration **Baseline**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0` (header write on every alloc) **Treatment**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=1` (header write once at refill) **Profile**: MIXED_TINYV3_C7_SAFE (Speed-first) **Runs**: 10 iterations per configuration **Binary**: bench_random_mixed_hakmem_minimal --- ## Runtime Profiling (Step 0) **Command**: ```bash perf record -F 99 -g -- ./bench_random_mixed_hakmem_minimal 200000000 400 1 perf report --no-children | head -60 ``` **Top Hotspots**: 1. `free`: 30.92% 2. `malloc`: 24.77% 3. `tiny_region_id_write_header`: 2.32% (within `free` backtrace) 4. `tiny_c7_ultra_alloc`: 1.90% **Observation**: - Header write is 2.32% hotspot (down from 4.56% in Phase 42) - C7 ULTRA alloc is 1.90% of total cycles - Combined target overhead: ~4.22% --- ## A/B Test Results ### Baseline (HEADER_LIGHT=0) ``` Run 1: 60,596,666 ops/s Run 2: 60,631,338 ops/s Run 3: 58,848,585 ops/s Run 4: 57,592,486 ops/s Run 5: 60,072,235 ops/s Run 6: 58,936,742 ops/s Run 7: 59,389,954 ops/s Run 8: 59,785,720 ops/s Run 9: 59,956,318 ops/s Run 10: 59,619,539 ops/s ``` **Statistics**: - Mean: 59,542,958 ops/s - Median: 59,702,630 ops/s - Min: 57,592,486 ops/s - Max: 60,631,338 ops/s - StdDev: 912,145 - CV: 1.53% ### Treatment (HEADER_LIGHT=1) ``` Run 1: 58,677,671 ops/s Run 2: 59,459,236 ops/s Run 3: 61,090,929 ops/s Run 4: 57,586,075 ops/s Run 5: 61,556,526 ops/s Run 6: 61,837,526 ops/s Run 7: 58,629,333 ops/s Run 8: 60,012,916 ops/s Run 9: 57,548,197 ops/s Run 10: 60,888,920 ops/s ``` **Statistics**: - Mean: 59,728,733 ops/s - Median: 59,736,076 ops/s - Min: 57,548,197 ops/s - Max: 61,837,526 ops/s - StdDev: 1,591,714 - CV: 2.66% --- ## Analysis **Delta**: +0.31% (185,775 ops/s improvement) **Decision Matrix**: - GO: +1.0% or better → NOT MET - NEUTRAL: ±1.0% → **MATCHED** (+0.31%) - NO-GO: -1.0% or worse → NOT MET **Verdict**: **NEUTRAL** --- ## Discussion ### Why +0.31% is Below Expectations 1. **Header Write Overhead Lower Than Expected**: - Profiling shows 2.32% (not 4.56% as in Phase 42) - Mixed workload dilutes C7-specific hotspots - Expected: ~2-3% gain - Actual: +0.31% 2. **Higher Variance in Treatment**: - Baseline CV: 1.53% - Treatment CV: 2.66% (1.74x higher) - Suggests additional noise or cache effects 3. **Header Write Not the Bottleneck**: - C7 ULTRA alloc hit is already fast (~5-7 instructions) - Header write (~3-4 instructions) is small part - Other factors (TLS cache locality, refill overhead) dominate 4. **Refill Phase Overhead**: - Header-light mode writes headers during refill (cold path) - Adds branch in hot path (`if (header_light)`) - Net instruction reduction: ~2-3 instructions (not 5-7) ### Positive Observations 1. **No Regression**: +0.31% is positive (though small) 2. **Implementation Stable**: Pre-existing implementation works correctly 3. **No Safety Issues**: Invariant (headers present) holds 4. **Rollback Safe**: ENV gate=0 by default --- ## Recommendation **Status**: Keep as **research flag** (default OFF) **Rationale**: 1. Gain (+0.31%) is below significance threshold (+1.0%) 2. Higher variance (CV 2.66% vs 1.53%) suggests instability 3. Instruction reduction insufficient to justify complexity 4. Better opportunities exist (e.g., Phase 62: TLS prefetch, Phase 63: refill batching) **Future Re-evaluation**: - Retry with C7-heavy workload (>50% C7 allocations) - Combine with other C7 optimizations (batch refill, SIMD header write) - Profile with IPC/cache-miss counters (not just cycles) --- ## ENV Control **Variable**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT` **Default**: 0 (OFF) **Location**: `core/box/tiny_front_v3_env_box.h:145-152` **Usage**: ```bash # Enable header-light mode (research only) export HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=1 # Disable (default) export HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0 # or unset unset HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT ``` --- ## Next Steps 1. **Keep implementation**: Code is clean, no removal needed 2. **Document as research flag**: Available for future C7-heavy workloads 3. **Phase 62 priorities**: - TLS prefetch optimization (higher impact potential) - Refill batch size tuning (reduce cold path overhead) - IPC profiling (identify real bottlenecks) --- ## Conclusion Phase 61 achieves **NEUTRAL** status (+0.31%): - Implementation works correctly (no bugs) - Gain is real but insufficient (+0.31% < +1.0% threshold) - Keep as research flag (default OFF) - Focus on higher-impact optimizations (Phase 62+) **Lesson**: Micro-optimizations require precise profiling. Cycle count alone insufficient—need IPC, cache misses, and workload-specific analysis.