# Phase 40: BENCH_MINIMAL Gate Constantization Results **Date**: 2025-12-16 **Verdict**: **NO-GO (-2.47%)** **Status**: Reverted ## Executive Summary Phase 40 attempted to constantize `tiny_header_mode()` in BENCH_MINIMAL mode, following the proven success pattern from Phase 39 (+1.98%). However, A/B testing revealed an unexpected **-2.47% regression**, leading to a NO-GO verdict and full revert of changes. ## Hypothesis Building on Phase 39's success with gate function constantization (+1.98%), Phase 40 targeted `tiny_header_mode()` as the next highest-impact candidate based on FAST v3 perf profiling: - **Location**: `core/tiny_region_id.h:180-211` - **Pattern**: Lazy-init with `static int g_header_mode = -1` + `getenv()` - **Call site**: Hot path in `tiny_region_id_write_header()` (4.56% self-time) - **Expected gain**: +0.3~0.8% (similar to Phase 39 targets) ## Implementation ### Change: tiny_header_mode() Constantization **File**: `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h` ```c static inline int tiny_header_mode(void) { #if HAKMEM_BENCH_MINIMAL // Phase 40: BENCH_MINIMAL → 固定 FULL (header write enabled) // Rationale: Eliminates lazy-init gate check in alloc hot path // Expected: +0.3~0.8% (TBD after A/B test) return TINY_HEADER_MODE_FULL; #else static int g_header_mode = -1; if (__builtin_expect(g_header_mode == -1, 0)) { const char* e = getenv("HAKMEM_TINY_HEADER_MODE"); // ... [original lazy-init logic] ... } return g_header_mode; #endif } ``` **Rationale**: - In BENCH_MINIMAL mode, always return constant `TINY_HEADER_MODE_FULL` (0) - Eliminates branch + lazy-init overhead in hot path - Matches default benchmark behavior (FULL mode) ## A/B Test Results ### Test Configuration - **Benchmark**: `bench_random_mixed_hakmem_minimal` - **Test harness**: `scripts/run_mixed_10_cleanenv.sh` - **Parameters**: `ITERS=20000000 WS=400` - **Method**: Git stash A/B (baseline vs treatment) ### Baseline (FAST v3 without Phase 40) ``` Run 1/10: 56789069 ops/s Run 2/10: 56274671 ops/s Run 3/10: 56513942 ops/s Run 4/10: 56133590 ops/s Run 5/10: 56634961 ops/s Run 6/10: 54943677 ops/s Run 7/10: 57088883 ops/s Run 8/10: 56337157 ops/s Run 9/10: 55930637 ops/s Run 10/10: 56590285 ops/s Mean: 56,323,700 ops/s ``` ### Treatment (FAST v4 with Phase 40) ``` Run 1/10: 54355307 ops/s Run 2/10: 56936372 ops/s Run 3/10: 54694629 ops/s Run 4/10: 54504756 ops/s Run 5/10: 55137468 ops/s Run 6/10: 52434980 ops/s Run 7/10: 52438841 ops/s Run 8/10: 54966798 ops/s Run 9/10: 56834583 ops/s Run 10/10: 57034821 ops/s Mean: 54,933,856 ops/s ``` ### Delta Analysis ``` Baseline: 56,323,700 ops/s Treatment: 54,933,856 ops/s Delta: -1,389,844 ops/s (-2.47%) Verdict: NO-GO (threshold: -0.5% or worse) ``` ## Root Cause Analysis ### Why did Phase 40 fail when Phase 39 succeeded? #### 1. Code Layout Effects (Phase 22-2 Precedent) The regression is likely caused by **compiler code layout changes** rather than the logic change itself: - **LTO reordering**: Adding `#if HAKMEM_BENCH_MINIMAL` block changes function layout - **Instruction cache**: Small layout changes can significantly impact icache hit rates - **Branch prediction**: Modified code placement affects CPU branch predictor state **Evidence from Phase 22-2**: - Physical code deletion caused **-5.16% regression** despite removing "dead" code - Reason: Layout changes disrupted hot path alignment and icache behavior - Lesson: "Deleting to speed up" is unreliable with LTO #### 2. Hot Path Already Optimized Unlike Phase 39 targets, `tiny_header_mode()` may already be effectively optimized: **Phase 21 Hot/Cold Split**: ```c // Phase 21: Hot/cold split for FULL mode (ENV-gated) if (tiny_header_hotfull_enabled()) { int header_mode = tiny_header_mode(); if (__builtin_expect(header_mode == TINY_HEADER_MODE_FULL, 1)) { // Hot path: straight-line code (no existing_header read, no guard call) uint8_t desired_header = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK)); *header_ptr = desired_header; // ... fast path ... return user; } // Cold path return tiny_region_id_write_header_slow(base, class_idx, header_ptr); } ``` **Key observation**: - The hot path at line 349 calls `tiny_header_mode()` and checks for `TINY_HEADER_MODE_FULL` - This call is already **once per allocation** and **highly predictable** (always FULL in benchmarks) - The `__builtin_expect` hint ensures the FULL branch is predicted correctly - Compiler may already be inlining and optimizing away the branch **Phase 39 difference**: - Phase 39 targeted gates called on **every path** without existing optimization - Those gates had no Phase 21-style hot/cold split - Constantization provided genuine branch elimination #### 3. Snapshot Caching Interaction The `TinyFrontV3Snapshot` mechanism caches `tiny_header_mode()` value: ```c // core/box/tiny_front_v3_env_box.h:13 uint8_t header_mode; // tiny_header_mode() の値をキャッシュ // core/hakmem_tiny.c:83 .header_mode = (uint8_t)tiny_header_mode(), ``` If most allocations use the cached value from snapshot rather than calling `tiny_header_mode()` directly, constantizing the function provides minimal benefit while still incurring layout disruption costs. ## Lessons Learned ### 1. Not All Gates Are Created Equal **Phase 39 success criteria** (gates that benefit from constantization): - Called on **every hot path** without optimization - No existing hot/cold split or branch prediction hints - No snapshot caching mechanism - Examples: `g_alloc_front_gate_enabled`, `g_alloc_prewarm_enabled` **Phase 40 failure indicators** (gates that DON'T benefit): - Already optimized with hot/cold split (Phase 21) - Protected by `__builtin_expect` branch hints - Cached in snapshot structures - Infrequently called (once per allocation vs once per operation) ### 2. Code Layout Tax Exceeds Logic Benefit Even when logic change is sound, layout disruption can dominate: ``` Logic benefit: ~0.5% (eliminate branch + lazy-init) Layout tax: ~3.0% (icache/alignment disruption) Net result: -2.47% (NO-GO) ``` ### 3. Perf Profile Can Be Misleading `tiny_region_id_write_header()` showed 4.56% self-time in perf, but: - Most of that time is **actual header write work**, not gate overhead - The `tiny_header_mode()` call is already optimized by compiler - Profiler cannot distinguish between "work" time and "gate" time **Better heuristic**: Only constantize gates that: 1. Appear in perf with **high instruction count** (not just time) 2. Have visible `getenv()` calls in assembly 3. Lack existing optimization (no Phase 21-style split) ## Recommendation **REVERT Phase 40 changes completely.** ### Alternative Approaches (Future Research) If we still want to optimize `tiny_header_mode()`: 1. **Wait for Phase 21 BENCH_MINIMAL adoption** - Constantize `tiny_header_hotfull_enabled()` instead - Rationale: Eliminates entire hot/cold branch, not just mode check - Expected: +0.5~1% (higher leverage point) 2. **Profile-guided optimization** - Let compiler optimize based on runtime profile - Rationale: Avoid manual layout disruption - Method: `gcc -fprofile-generate` → run benchmark → `gcc -fprofile-use` 3. **Assembly inspection first** - Check if gate is actually compiled as branch - Method: `objdump -d bench_random_mixed_hakmem_minimal | grep -A20 tiny_header_mode` - If already optimized away → skip constantization ## Files Modified (REVERTED) - `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h` (lines 180-218) ## Next Steps 1. **Revert all Phase 40 changes** via `git restore` 2. **Update CURRENT_TASK.md** - Mark Phase 40 as NO-GO with analysis 3. **Document in scorecard** - Add Phase 40 as research failure for future reference 4. **Re-evaluate gate candidates** - Use stricter criteria (see Lessons Learned #1) ## Appendix: Raw Test Data ### Baseline runs ``` 56789069, 56274671, 56513942, 56133590, 56634961, 54943677, 57088883, 56337157, 55930637, 56590285 ``` ### Treatment runs ``` 54355307, 56936372, 54694629, 54504756, 55137468, 52434980, 52438841, 54966798, 56834583, 57034821 ``` ### Variance Analysis **Baseline**: - Std dev: ~586K ops/s (1.04% CV) - Range: 2.14M ops/s (54.9M - 57.1M) **Treatment**: - Std dev: ~1.52M ops/s (2.77% CV) - Range: 4.60M ops/s (52.4M - 57.0M) **Observation**: Treatment shows **2.6x higher variance** than baseline, suggesting layout instability. --- **Conclusion**: Phase 40 is a clear NO-GO. Revert all changes and re-focus on gates without existing optimization.