hakmem/docs/analysis/PHASE41_ASM_FIRST_GATE_AUDIT_RESULTS.md

# Phase 41: ASM-First Gate Audit and Optimization - Results

**Date**: 2025-12-16
**Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median)
**Target**: +0.5% (56.32M+ ops/s) for GO
**Result**: **NO-GO** (-2.02% regression)

---

## Methodology: ASM-First Approach

Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology:

1. **Baseline measurement** before any code changes
2. **ASM inspection** to verify gates actually exist in assembly
3. **Optimization only if gates found** in hot paths
4. **Incremental testing** with proper A/B comparison

---

## Step 0: Baseline Measurement

**Command**: `make perf_fast`

### 10-Run Results (Baseline):
```
Run 1:  56.62M ops/s
Run 2:  55.62M ops/s
Run 3:  56.62M ops/s
Run 4:  56.62M ops/s
Run 5:  55.79M ops/s
Run 6:  55.42M ops/s
Run 7:  55.89M ops/s
Run 8:  56.16M ops/s
Run 9:  54.79M ops/s
Run 10: 56.17M ops/s
```

**Baseline Statistics**:
- **Mean**: 55.97M ops/s
- **Median**: 56.03M ops/s
- **Range**: 54.79M - 56.62M ops/s

---

## Step 1: ASM Inspection Results

### Target Gates (from Phase 40 preparation):
1. `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
2. `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h`

### Inspection Command:
```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```

### Findings:

#### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly**
- **Call count**: 19+ occurrences in disassembly
- **Function location**: `0x10630 <mid_v3_debug_enabled.lto_priv.0>`
- **Call sites identified**:
  - Line 685: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Line 705: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Line 933: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.

#### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly**
- **Already optimized away** by compiler (likely inlined and dead-code eliminated)
- MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks

### Call Site Analysis:

**Source locations of `mid_v3_debug_enabled()`**:

1. **Alloc path** (`core/box/hak_alloc_api.inc.h`):
   - Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block
   - Line 95: Inside same block, after class selection
   - Line 106: Inside same block, after successful allocation

2. **Free path** (`core/box/hak_free_api.inc.h`):
   - Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path)
   - Line 273: Inside same block (legacy path)

3. **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`):
   - Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)

### Key Insight:

`mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.

**Pattern observed**:
```c
// In hot paths:
if (mid_v3_enabled() && ...) {  // Outer guard - optimized to "if (0)"
    // ...
    if (mid_v3_debug_enabled() && ...) {  // Inner debug gate - still in ASM!
        fprintf(stderr, ...);
    }
    // ...
}
```

---

## Step 2: Condition Reordering

**Status**: **SKIPPED** - Not applicable

**Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain.

---

## Step 3: BENCH_MINIMAL Constantization

### Implementation:

Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`:

```c
#include "../hakmem_build_flags.h"

static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_ENABLED");
        if (e && *e) {
            g_enable = (*e != '0') ? 1 : 0;
        } else {
            g_enable = 0;  // default OFF
        }
    }
    return g_enable;
#endif
}

static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_debug = -1;
    if (__builtin_expect(g_debug == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_DEBUG");
        if (e && *e) {
            g_debug = (*e != '0') ? 1 : 0;
        } else {
            g_debug = 0;
        }
    }
    return g_debug;
#endif
}
```

### ASM Verification After Step 3:

```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```

**Result**: ✅ **Both gates ELIMINATED from assembly**
- No `mid_v3_debug_enabled` function in disassembly
- No call sites remaining
- Compiler successfully dead-code eliminated all MID v3 related code

### Performance Results (Step 3):

**Command**: `make perf_fast` (after Step 3 changes)

### 10-Run Results (Step 3):
```
Run 1:  54.60M ops/s
Run 2:  54.35M ops/s
Run 3:  54.11M ops/s
Run 4:  54.60M ops/s
Run 5:  54.84M ops/s
Run 6:  54.79M ops/s
Run 7:  54.53M ops/s
Run 8:  54.56M ops/s
Run 9:  55.96M ops/s
Run 10: 56.08M ops/s
```

**Step 3 Statistics**:
- **Mean**: 54.84M ops/s
- **Median**: 54.60M ops/s
- **Range**: 54.11M - 56.08M ops/s

### Comparison vs Baseline:

| Metric | Baseline | Step 3 | Delta | Percent |
|--------|----------|--------|-------|---------|
| Mean   | 55.97M   | 54.84M | -1.13M | **-2.02%** |
| Median | 56.03M   | 54.60M | -1.43M | **-2.55%** |

**Verdict**: **NO-GO** (-2.02% regression)

---

## Root Cause Analysis: Layout Tax

### Why did constantization hurt performance?

**Hypothesis**: **Code layout tax** (same issue as Phase 40)

1. **Before Step 3**:
   - `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions
   - Call sites reference these functions, which are never executed (dead code)
   - Hot path code layout is stable

2. **After Step 3**:
   - Both gates return compile-time constant `0`
   - Compiler inlines these and eliminates entire MID v3 blocks
   - Hot path code is re-laid out by compiler (different basic block arrangement)
   - **I-cache locality changes** → performance regression

### Precedent: Phase 40 Results

Phase 40 attempted to constantize `tiny_header_mode()`:
- **Result**: -2.47% regression
- **Cause**: Layout tax from code elimination
- **Lesson**: Removing already-optimized-away code can hurt more than help

### Why layout tax occurs:

Modern CPUs are extremely sensitive to:
- **Branch predictor state** (different code layout → different prediction patterns)
- **I-cache line alignment** (moving hot loops can cause cache line splits)
- **μop cache behavior** (LSD/DSB interactions change with layout)
- **TLB pressure** (code page mapping changes)

Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls.

---

## Final Decision: REVERT Step 3

**Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h`

```bash
git checkout core/box/mid_hotbox_v3_env_box.h
```

**Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.

---

## Lessons Learned

### 1. ASM-First Methodology Works

✅ Successfully identified that:
- `mid_v3_enabled()` was already optimized away
- `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks)

### 2. Dead Code != Performance Impact

❌ **Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax

- The dead `mid_v3_debug_enabled()` calls were never executed
- But removing them caused code relayout → -2.02% regression
- **Lesson**: Leave dead code alone if it's already not executed

### 3. Layout Tax is Real and Significant

Both Phase 40 and Phase 41 hit layout tax:
- Phase 40: `tiny_header_mode()` constantization → -2.47%
- Phase 41: `mid_v3_*()` constantization → -2.02%

**Pattern**: Structural changes to inline functions → unpredictable layout effects

### 4. When to Stop Optimizing

**Stop criteria**:
1. If gate is already optimized away in ASM → Don't touch it
2. If gate appears in ASM but is never executed → **Still don't touch it** (layout risk)
3. Only optimize if gate is **executed frequently** in hot paths

### 5. ASM Inspection is Necessary but Not Sufficient

- ✅ ASM inspection told us gates exist
- ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks
- ✅ **Need runtime profiling** (e.g., `perf record`) to confirm execution frequency

---

## Recommendations for Phase 42+

### 1. Add Runtime Profiling Step

**Before optimizing any gate**, use `perf` to verify it's actually executed:

```bash
# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol

# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3
```

**Decision criteria**:
- If function appears in `perf report` → Worth optimizing
- If function is in ASM but NOT in `perf report` → Dead code, leave alone

### 2. Focus on Actually-Executed Gates

**Priority list** (requires profiling validation):
1. Gates that appear in `perf report` top 50 functions
2. Gates called in tight loops (identified via `-C` context in `perf annotate`)
3. Gates with measurable CPU time (>0.1% in profile)

### 3. Accept Dead Code in ASM

**Philosophy shift**:
- Old: "If it's in ASM, optimize it"
- New: "If it's in ASM but not executed, ignore it"

Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax.

### 4. Test Layout Stability

Before committing any structural change:
1. Run 3× 10-run benchmarks (baseline, change, revert-verify)
2. Check if results are reproducible
3. Accept only if gain is **≥1.0%** (to overcome layout noise)

### 5. Alternative: Investigate Other Hot Gates

Instead of MID v3 gates (which are dead), profile to find:
- Tiny allocator gates that ARE executed
- Free path gates with measurable cost
- Size class routing decisions in hot paths

---

## Quantitative Summary

| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
|-------|----------------|--------------|-----------|--------|--------|---------|
| Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped |
| Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO |
| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
| Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** |

**Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted)

---

## Conclusion

Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation:

> **ASM presence ≠ Performance impact**

The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**.

**Key Takeaway**:
- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`)
- ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code
- ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions

**Phase 42 Direction**:
1. Add `perf record/report` step to methodology
2. Target only gates that appear in runtime profiles
3. Accept dead code in ASM as zero-cost (don't fix what isn't broken)
4. Require ≥1.0% gain to overcome layout noise

**Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**
-												Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 06:24:01 +09:00
+								# Phase 41: ASM-First Gate Audit and Optimization - Results
 								**Date**: 2025-12-16
 								**Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median)
 								**Target**: +0.5% (56.32M+ ops/s) for GO
 								**Result**: **NO-GO** (-2.02% regression)
 								---
 								## Methodology: ASM-First Approach
 								Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology:
 . **Baseline measurement** before any code changes
 . **ASM inspection** to verify gates actually exist in assembly
 . **Optimization only if gates found** in hot paths
 . **Incremental testing** with proper A/B comparison
 								---
 								## Step 0: Baseline Measurement
 								**Command**: `make perf_fast`
 								### 10-Run Results (Baseline):
 								```
 								Run 1:  56.62M ops/s
 								Run 2:  55.62M ops/s
 								Run 3:  56.62M ops/s
 								Run 4:  56.62M ops/s
 								Run 5:  55.79M ops/s
 								Run 6:  55.42M ops/s
 								Run 7:  55.89M ops/s
 								Run 8:  56.16M ops/s
 								Run 9:  54.79M ops/s
 								Run 10: 56.17M ops/s
 								```
 								**Baseline Statistics**:
 								- **Mean**: 55.97M ops/s
 								- **Median**: 56.03M ops/s
 								- **Range**: 54.79M - 56.62M ops/s
 								---
 								## Step 1: ASM Inspection Results
 								### Target Gates (from Phase 40 preparation):
 . `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
 . `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
 								### Inspection Command:
 								```bash
 								objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
 								```
 								### Findings:
 								#### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly**
 								- **Call count**: 19+ occurrences in disassembly
 								- **Function location**: `0x10630 <mid_v3_debug_enabled.lto_priv.0>`
 								- **Call sites identified**:
 								  - Line 685: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
 								  - Line 705: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
 								  - Line 933: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
 								  - Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.
 								#### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly**
 								- **Already optimized away** by compiler (likely inlined and dead-code eliminated)
 								- MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks
 								### Call Site Analysis:
 								**Source locations of `mid_v3_debug_enabled()`**:
 . **Alloc path** (`core/box/hak_alloc_api.inc.h`):
 								   - Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block
 								   - Line 95: Inside same block, after class selection
 								   - Line 106: Inside same block, after successful allocation
 . **Free path** (`core/box/hak_free_api.inc.h`):
 								   - Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path)
 								   - Line 273: Inside same block (legacy path)
 . **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`):
 								   - Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)
 								### Key Insight:
 								`mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.
 								**Pattern observed**:
 								```c
 								// In hot paths:
 								if (mid_v3_enabled() && ...) {  // Outer guard - optimized to "if (0)"
 								    // ...
 								    if (mid_v3_debug_enabled() && ...) {  // Inner debug gate - still in ASM!
 								        fprintf(stderr, ...);
 								    }
 								    // ...
 								}
 								```
 								---
 								## Step 2: Condition Reordering
 								**Status**: **SKIPPED** - Not applicable
 								**Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain.
 								---
 								## Step 3: BENCH_MINIMAL Constantization
 								### Implementation:
 								Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`:
 								```c
 								#include "../hakmem_build_flags.h"
 								static inline int mid_v3_enabled(void) {
 								#if HAKMEM_BENCH_MINIMAL
 								    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
 								    return 0;
 								#else
 								    static int g_enable = -1;
 								    if (__builtin_expect(g_enable == -1, 0)) {
 								        const char* e = getenv("HAKMEM_MID_V3_ENABLED");
 								        if (e && *e) {
 								            g_enable = (*e != '0') ? 1 : 0;
 								        } else {
 								            g_enable = 0;  // default OFF
 								        }
 								    }
 								    return g_enable;
 								#endif
 								}
 								static inline int mid_v3_debug_enabled(void) {
 								#if HAKMEM_BENCH_MINIMAL
 								    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
 								    return 0;
 								#else
 								    static int g_debug = -1;
 								    if (__builtin_expect(g_debug == -1, 0)) {
 								        const char* e = getenv("HAKMEM_MID_V3_DEBUG");
 								        if (e && *e) {
 								            g_debug = (*e != '0') ? 1 : 0;
 								        } else {
 								            g_debug = 0;
 								        }
 								    }
 								    return g_debug;
 								#endif
 								}
 								```
 								### ASM Verification After Step 3:
 								```bash
 								objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
 								```
 								**Result**: ✅ **Both gates ELIMINATED from assembly**
 								- No `mid_v3_debug_enabled` function in disassembly
 								- No call sites remaining
 								- Compiler successfully dead-code eliminated all MID v3 related code
 								### Performance Results (Step 3):
 								**Command**: `make perf_fast` (after Step 3 changes)
 								### 10-Run Results (Step 3):
 								```
 								Run 1:  54.60M ops/s
 								Run 2:  54.35M ops/s
 								Run 3:  54.11M ops/s
 								Run 4:  54.60M ops/s
 								Run 5:  54.84M ops/s
 								Run 6:  54.79M ops/s
 								Run 7:  54.53M ops/s
 								Run 8:  54.56M ops/s
 								Run 9:  55.96M ops/s
 								Run 10: 56.08M ops/s
 								```
 								**Step 3 Statistics**:
 								- **Mean**: 54.84M ops/s
 								- **Median**: 54.60M ops/s
 								- **Range**: 54.11M - 56.08M ops/s
 								### Comparison vs Baseline:
 								| Metric | Baseline | Step 3 | Delta | Percent |
 								|--------|----------|--------|-------|---------|
 								| Mean   | 55.97M   | 54.84M | -1.13M | **-2.02%** |
 								| Median | 56.03M   | 54.60M | -1.43M | **-2.55%** |
 								**Verdict**: **NO-GO** (-2.02% regression)
 								---
 								## Root Cause Analysis: Layout Tax
 								### Why did constantization hurt performance?
 								**Hypothesis**: **Code layout tax** (same issue as Phase 40)
 . **Before Step 3**:
 								   - `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions
 								   - Call sites reference these functions, which are never executed (dead code)
 								   - Hot path code layout is stable
 . **After Step 3**:
 								   - Both gates return compile-time constant `0`
 								   - Compiler inlines these and eliminates entire MID v3 blocks
 								   - Hot path code is re-laid out by compiler (different basic block arrangement)
 								   - **I-cache locality changes** → performance regression
 								### Precedent: Phase 40 Results
 								Phase 40 attempted to constantize `tiny_header_mode()`:
 								- **Result**: -2.47% regression
 								- **Cause**: Layout tax from code elimination
 								- **Lesson**: Removing already-optimized-away code can hurt more than help
 								### Why layout tax occurs:
 								Modern CPUs are extremely sensitive to:
 								- **Branch predictor state** (different code layout → different prediction patterns)
 								- **I-cache line alignment** (moving hot loops can cause cache line splits)
 								- **μop cache behavior** (LSD/DSB interactions change with layout)
 								- **TLB pressure** (code page mapping changes)
 								Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls.
 								---
 								## Final Decision: REVERT Step 3
 								**Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h`
 								```bash
 								git checkout core/box/mid_hotbox_v3_env_box.h
 								```
 								**Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.
 								---
 								## Lessons Learned
 								### 1. ASM-First Methodology Works
 								✅ Successfully identified that:
 								- `mid_v3_enabled()` was already optimized away
 								- `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks)
 								### 2. Dead Code != Performance Impact
 								❌ **Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax
 								- The dead `mid_v3_debug_enabled()` calls were never executed
 								- But removing them caused code relayout → -2.02% regression
 								- **Lesson**: Leave dead code alone if it's already not executed
 								### 3. Layout Tax is Real and Significant
 								Both Phase 40 and Phase 41 hit layout tax:
 								- Phase 40: `tiny_header_mode()` constantization → -2.47%
 								- Phase 41: `mid_v3_*()` constantization → -2.02%
 								**Pattern**: Structural changes to inline functions → unpredictable layout effects
 								### 4. When to Stop Optimizing
 								**Stop criteria**:
 . If gate is already optimized away in ASM → Don't touch it
 . If gate appears in ASM but is never executed → **Still don't touch it** (layout risk)
 . Only optimize if gate is **executed frequently** in hot paths
 								### 5. ASM Inspection is Necessary but Not Sufficient
 								- ✅ ASM inspection told us gates exist
 								- ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks
 								- ✅ **Need runtime profiling** (e.g., `perf record`) to confirm execution frequency
 								---
 								## Recommendations for Phase 42+
 								### 1. Add Runtime Profiling Step
 								**Before optimizing any gate**, use `perf` to verify it's actually executed:
 								```bash
 								# Profile hot functions
 								perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
 								perf report --no-children --sort comm,dso,symbol
 								# Check if mid_v3_debug_enabled appears in profile
 								perf report | grep mid_v3
 								```
 								**Decision criteria**:
 								- If function appears in `perf report` → Worth optimizing
 								- If function is in ASM but NOT in `perf report` → Dead code, leave alone
 								### 2. Focus on Actually-Executed Gates
 								**Priority list** (requires profiling validation):
 . Gates that appear in `perf report` top 50 functions
 . Gates called in tight loops (identified via `-C` context in `perf annotate`)
 . Gates with measurable CPU time (>0.1% in profile)
 								### 3. Accept Dead Code in ASM
 								**Philosophy shift**:
 								- Old: "If it's in ASM, optimize it"
 								- New: "If it's in ASM but not executed, ignore it"
 								Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax.
 								### 4. Test Layout Stability
 								Before committing any structural change:
 . Run 3× 10-run benchmarks (baseline, change, revert-verify)
 . Check if results are reproducible
 . Accept only if gain is **≥1.0%** (to overcome layout noise)
 								### 5. Alternative: Investigate Other Hot Gates
 								Instead of MID v3 gates (which are dead), profile to find:
 								- Tiny allocator gates that ARE executed
 								- Free path gates with measurable cost
 								- Size class routing decisions in hot paths
 								---
 								## Quantitative Summary
 								| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
 								|-------|----------------|--------------|-----------|--------|--------|---------|
 								| Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped |
 								| Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO |
 								| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
 								| Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** |
 								**Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted)
 								---
 								## Conclusion
 								Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation:
 								> **ASM presence ≠ Performance impact**
 								The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**.
 								**Key Takeaway**:
 								- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`)
 								- ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code
 								- ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions
 								**Phase 42 Direction**:
 . Add `perf record/report` step to methodology
 . Target only gates that appear in runtime profiles
 . Accept dead code in ASM as zero-cost (don't fix what isn't broken)
 . Require ≥1.0% gain to overcome layout noise
 								**Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**