# Phase 41: ASM-First Gate Audit and Optimization - Results **Date**: 2025-12-16 **Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median) **Target**: +0.5% (56.32M+ ops/s) for GO **Result**: **NO-GO** (-2.02% regression) --- ## Methodology: ASM-First Approach Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology: 1. **Baseline measurement** before any code changes 2. **ASM inspection** to verify gates actually exist in assembly 3. **Optimization only if gates found** in hot paths 4. **Incremental testing** with proper A/B comparison --- ## Step 0: Baseline Measurement **Command**: `make perf_fast` ### 10-Run Results (Baseline): ``` Run 1: 56.62M ops/s Run 2: 55.62M ops/s Run 3: 56.62M ops/s Run 4: 56.62M ops/s Run 5: 55.79M ops/s Run 6: 55.42M ops/s Run 7: 55.89M ops/s Run 8: 56.16M ops/s Run 9: 54.79M ops/s Run 10: 56.17M ops/s ``` **Baseline Statistics**: - **Mean**: 55.97M ops/s - **Median**: 56.03M ops/s - **Range**: 54.79M - 56.62M ops/s --- ## Step 1: ASM Inspection Results ### Target Gates (from Phase 40 preparation): 1. `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h` 2. `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h` ### Inspection Command: ```bash objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled" ``` ### Findings: #### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly** - **Call count**: 19+ occurrences in disassembly - **Function location**: `0x10630 ` - **Call sites identified**: - Line 685: `call 10630 ` - Line 705: `call 10630 ` - Line 933: `call 10630 ` - Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc. #### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly** - **Already optimized away** by compiler (likely inlined and dead-code eliminated) - MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks ### Call Site Analysis: **Source locations of `mid_v3_debug_enabled()`**: 1. **Alloc path** (`core/box/hak_alloc_api.inc.h`): - Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block - Line 95: Inside same block, after class selection - Line 106: Inside same block, after successful allocation 2. **Free path** (`core/box/hak_free_api.inc.h`): - Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path) - Line 273: Inside same block (legacy path) 3. **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`): - Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545) ### Key Insight: `mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code. **Pattern observed**: ```c // In hot paths: if (mid_v3_enabled() && ...) { // Outer guard - optimized to "if (0)" // ... if (mid_v3_debug_enabled() && ...) { // Inner debug gate - still in ASM! fprintf(stderr, ...); } // ... } ``` --- ## Step 2: Condition Reordering **Status**: **SKIPPED** - Not applicable **Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain. --- ## Step 3: BENCH_MINIMAL Constantization ### Implementation: Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`: ```c #include "../hakmem_build_flags.h" static inline int mid_v3_enabled(void) { #if HAKMEM_BENCH_MINIMAL // Phase 41: BENCH_MINIMAL → 固定 OFF (research box) return 0; #else static int g_enable = -1; if (__builtin_expect(g_enable == -1, 0)) { const char* e = getenv("HAKMEM_MID_V3_ENABLED"); if (e && *e) { g_enable = (*e != '0') ? 1 : 0; } else { g_enable = 0; // default OFF } } return g_enable; #endif } static inline int mid_v3_debug_enabled(void) { #if HAKMEM_BENCH_MINIMAL // Phase 41: BENCH_MINIMAL → 固定 OFF (research box) return 0; #else static int g_debug = -1; if (__builtin_expect(g_debug == -1, 0)) { const char* e = getenv("HAKMEM_MID_V3_DEBUG"); if (e && *e) { g_debug = (*e != '0') ? 1 : 0; } else { g_debug = 0; } } return g_debug; #endif } ``` ### ASM Verification After Step 3: ```bash objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled" ``` **Result**: ✅ **Both gates ELIMINATED from assembly** - No `mid_v3_debug_enabled` function in disassembly - No call sites remaining - Compiler successfully dead-code eliminated all MID v3 related code ### Performance Results (Step 3): **Command**: `make perf_fast` (after Step 3 changes) ### 10-Run Results (Step 3): ``` Run 1: 54.60M ops/s Run 2: 54.35M ops/s Run 3: 54.11M ops/s Run 4: 54.60M ops/s Run 5: 54.84M ops/s Run 6: 54.79M ops/s Run 7: 54.53M ops/s Run 8: 54.56M ops/s Run 9: 55.96M ops/s Run 10: 56.08M ops/s ``` **Step 3 Statistics**: - **Mean**: 54.84M ops/s - **Median**: 54.60M ops/s - **Range**: 54.11M - 56.08M ops/s ### Comparison vs Baseline: | Metric | Baseline | Step 3 | Delta | Percent | |--------|----------|--------|-------|---------| | Mean | 55.97M | 54.84M | -1.13M | **-2.02%** | | Median | 56.03M | 54.60M | -1.43M | **-2.55%** | **Verdict**: **NO-GO** (-2.02% regression) --- ## Root Cause Analysis: Layout Tax ### Why did constantization hurt performance? **Hypothesis**: **Code layout tax** (same issue as Phase 40) 1. **Before Step 3**: - `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions - Call sites reference these functions, which are never executed (dead code) - Hot path code layout is stable 2. **After Step 3**: - Both gates return compile-time constant `0` - Compiler inlines these and eliminates entire MID v3 blocks - Hot path code is re-laid out by compiler (different basic block arrangement) - **I-cache locality changes** → performance regression ### Precedent: Phase 40 Results Phase 40 attempted to constantize `tiny_header_mode()`: - **Result**: -2.47% regression - **Cause**: Layout tax from code elimination - **Lesson**: Removing already-optimized-away code can hurt more than help ### Why layout tax occurs: Modern CPUs are extremely sensitive to: - **Branch predictor state** (different code layout → different prediction patterns) - **I-cache line alignment** (moving hot loops can cause cache line splits) - **μop cache behavior** (LSD/DSB interactions change with layout) - **TLB pressure** (code page mapping changes) Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls. --- ## Final Decision: REVERT Step 3 **Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h` ```bash git checkout core/box/mid_hotbox_v3_env_box.h ``` **Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway. --- ## Lessons Learned ### 1. ASM-First Methodology Works ✅ Successfully identified that: - `mid_v3_enabled()` was already optimized away - `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks) ### 2. Dead Code != Performance Impact ❌ **Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax - The dead `mid_v3_debug_enabled()` calls were never executed - But removing them caused code relayout → -2.02% regression - **Lesson**: Leave dead code alone if it's already not executed ### 3. Layout Tax is Real and Significant Both Phase 40 and Phase 41 hit layout tax: - Phase 40: `tiny_header_mode()` constantization → -2.47% - Phase 41: `mid_v3_*()` constantization → -2.02% **Pattern**: Structural changes to inline functions → unpredictable layout effects ### 4. When to Stop Optimizing **Stop criteria**: 1. If gate is already optimized away in ASM → Don't touch it 2. If gate appears in ASM but is never executed → **Still don't touch it** (layout risk) 3. Only optimize if gate is **executed frequently** in hot paths ### 5. ASM Inspection is Necessary but Not Sufficient - ✅ ASM inspection told us gates exist - ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks - ✅ **Need runtime profiling** (e.g., `perf record`) to confirm execution frequency --- ## Recommendations for Phase 42+ ### 1. Add Runtime Profiling Step **Before optimizing any gate**, use `perf` to verify it's actually executed: ```bash # Profile hot functions perf record -g -F 999 ./bench_random_mixed_hakmem_minimal perf report --no-children --sort comm,dso,symbol # Check if mid_v3_debug_enabled appears in profile perf report | grep mid_v3 ``` **Decision criteria**: - If function appears in `perf report` → Worth optimizing - If function is in ASM but NOT in `perf report` → Dead code, leave alone ### 2. Focus on Actually-Executed Gates **Priority list** (requires profiling validation): 1. Gates that appear in `perf report` top 50 functions 2. Gates called in tight loops (identified via `-C` context in `perf annotate`) 3. Gates with measurable CPU time (>0.1% in profile) ### 3. Accept Dead Code in ASM **Philosophy shift**: - Old: "If it's in ASM, optimize it" - New: "If it's in ASM but not executed, ignore it" Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax. ### 4. Test Layout Stability Before committing any structural change: 1. Run 3× 10-run benchmarks (baseline, change, revert-verify) 2. Check if results are reproducible 3. Accept only if gain is **≥1.0%** (to overcome layout noise) ### 5. Alternative: Investigate Other Hot Gates Instead of MID v3 gates (which are dead), profile to find: - Tiny allocator gates that ARE executed - Free path gates with measurable cost - Size class routing decisions in hot paths --- ## Quantitative Summary | Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict | |-------|----------------|--------------|-----------|--------|--------|---------| | Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped | | Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO | | Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A | | Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** | **Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted) --- ## Conclusion Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation: > **ASM presence ≠ Performance impact** The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**. **Key Takeaway**: - ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`) - ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code - ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions **Phase 42 Direction**: 1. Add `perf record/report` step to methodology 2. Target only gates that appear in runtime profiles 3. Accept dead code in ASM as zero-cost (don't fix what isn't broken) 4. Require ≥1.0% gain to overcome layout noise **Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**