# Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results **Date**: 2025-12-17 **Status**: NEUTRAL (-0.71%, research box) **Baseline**: 48.34% of mimalloc (Phase 59b Speed-first) --- ## Executive Summary Phase 62A attempted to optimize `tiny_c7_ultra_alloc()` hot path by eliminating per-call `tiny_front_v3_c7_ultra_header_light_enabled()` checks and using TLS `headers_initialized` flag instead. The optimization resulted in **-0.71% regression (NEUTRAL)**, indicating the approach does not yield the expected +1-3% gain. **Conclusion**: Research box (default OFF, `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0`) --- ## A/B Test Results (Mixed benchmark, 10-run) ### Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0) **Runs** (M ops/s): ``` 59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569 ``` **Statistics**: - **Mean**: 59.300 M ops/s - **Median**: 59.561 M ops/s - **StdDev**: 1.173 M ops/s - **CV**: 1.98% --- ### Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1) **Runs** (M ops/s): ``` 56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430 ``` **Statistics**: - **Mean**: 58.879 M ops/s - **Median**: 58.935 M ops/s - **StdDev**: 1.079 M ops/s - **CV**: 1.83% --- ## Comparison | Metric | Baseline | Treatment | Delta | |--------|----------|-----------|-------| | Mean | 59.300 | 58.879 | **-0.71%** | | Median | 59.561 | 58.935 | -1.05% | | StdDev | 1.173 | 1.079 | -8.0% | | CV | 1.98% | 1.83% | -0.15pp | **Verdict**: **NEUTRAL** (-0.71% within ±1.0% threshold, but negative) --- ## Implementation Details ### Optimization Strategy **Original Code** (`tiny_c7_ultra_alloc` hot path): ```c void* tiny_c7_ultra_alloc(size_t size) { tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls; const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled(); // Per-call check uint16_t n = tls->count; if (n > 0) { void* base = tls->freelist[n - 1]; tls->count = n - 1; if (header_light) { // Per-call branch return (uint8_t*)base + 1; } return tiny_region_id_write_header(base, 7); } // ... refill and retry } ``` **Optimized Code** (Phase 62A): ```c void* tiny_c7_ultra_alloc(size_t size) { tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls; // No per-call header_light check - use TLS flag instead uint16_t n = tls->count; if (n > 0) { void* base = tls->freelist[n - 1]; tls->count = n - 1; if (tls->headers_initialized) { // TLS flag set during refill return (uint8_t*)base + 1; } return tiny_region_id_write_header(base, 7); } // ... refill and retry } ``` **Intended Benefits**: 1. Eliminate per-call `tiny_front_v3_c7_ultra_header_light_enabled()` function call 2. Replace with TLS field access (already in cache from count/freelist) 3. Reduce dependency chain length --- ## Root Cause Analysis ### Why No Improvement? 1. **LTO Optimization Already In Place** - In HAKMEM_BENCH_MINIMAL (`-flto`), `tiny_front_v3_c7_ultra_header_light_enabled()` is likely already inlined - Function call overhead may already be zero at compile time - Replacing with TLS field access doesn't improve latency (still L1 cache hit) 2. **TLS Access Not Cheaper Than Expected** - TLS field `headers_initialized` requires offset calculation + memory access - Function call overhead may actually be lower (register-based, already predicted) - Branch prediction on `if (header_light)` may be extremely accurate (99.99%+) 3. **Layout Tax from Added Code** - Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption - Added if-dispatch at function entry (`if (!c7_ultra_alloc_depchain_opt_enabled())`) may affect code layout - Result: -0.71% regression consistent with pattern 4. **Hot Path May Already Be Optimal** - Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% stack % - But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns - Suggests hot path is already well-optimized by compiler --- ## Lessons Learned ### 1. Function Call Overhead is Negligible in LTO Mode With `-flto` and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because: - Compiler already determined optimal inlining - Instruction fetch overhead may not be the bottleneck - Replacing call with memory access can have similar latency ### 2. Layout Tax is Real and Persistent This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests: - I-cache alignment matters more than instruction count - Code layout disruptions can negate micro-optimization gains - Box Theory "minimal code change" principle is well-justified ### 3. Per-Call Flags May Be Faster Than Per-TLS State Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because: - Function results are likely in registers (temporary) - TLS access requires memory load + offset calculation - Branch predictor handles pattern well ### 4. 5.18% Stack % ≠ Optimizable Hotspot Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% combined stack overhead, but this is misleading because: - Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself) - Self time is likely 2-3% (actual function execution) - Micro-optimizations on already-optimized paths yield diminishing returns --- ## Decision **NEUTRAL (research box)**: - Set default to `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0` (OFF) - Keep code with ENV gate for future reference - Do not adopt as production default **Next Steps**: 1. Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk 2. Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness 3. Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk --- ## Box Theory Compliance | Principle | Status | Notes | |-----------|--------|-------| | Single Conversion Point | ✅ Yes | `tiny_c7_ultra_alloc()` boundary | | Clear Boundary | ✅ Yes | Env gate `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT` | | Reversible | ✅ Yes | Can switch via ENV or compile flag | | No Side Effects | ✅ Yes | Pure optimization attempt, no new data structures | | Performance | ❌ No | **-0.71% regression, NO-GO** | **Overall**: Box Theory compliant but performance non-compliant. --- ## Appendix: Raw Data ### Baseline (10-run, M ops/s) ``` 59.553099 59.906197 60.134051 59.533090 56.265139 59.367898 60.044922 58.486467 60.141028 59.568791 ``` ### Treatment (10-run, M ops/s) ``` 56.351851 58.923605 58.946089 60.109441 58.629557 58.689160 59.609485 58.160391 59.939368 59.430088 ``` --- **End of Phase 62A Report**