# Phase 41: ASM-First Gate Audit and Optimization - Results

**Date**: 2025-12-16
**Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median)
**Target**: +0.5% (56.32M+ ops/s) for GO
**Result**: **NO-GO** (-2.02% regression)

---

## Methodology: ASM-First Approach

Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology:

1. **Baseline measurement** before any code changes
2. **ASM inspection** to verify gates actually exist in assembly
3. **Optimization only if gates found** in hot paths
4. **Incremental testing** with proper A/B comparison

---

## Step 0: Baseline Measurement

**Command**: `make perf_fast`

### 10-Run Results (Baseline):
```
Run 1:  56.62M ops/s
Run 2:  55.62M ops/s
Run 3:  56.62M ops/s
Run 4:  56.62M ops/s
Run 5:  55.79M ops/s
Run 6:  55.42M ops/s
Run 7:  55.89M ops/s
Run 8:  56.16M ops/s
Run 9:  54.79M ops/s
Run 10: 56.17M ops/s
```

**Baseline Statistics**:
- **Mean**: 55.97M ops/s
- **Median**: 56.03M ops/s
- **Range**: 54.79M - 56.62M ops/s

---

## Step 1: ASM Inspection Results

### Target Gates (from Phase 40 preparation):
1. `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
2. `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h`

### Inspection Command:
```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```

### Findings:

#### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly**
- **Call count**: 19+ occurrences in disassembly
- **Function location**: `0x10630 <mid_v3_debug_enabled.lto_priv.0>`
- **Call sites identified**:
  - Line 685: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Line 705: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Line 933: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
  - Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.

#### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly**
- **Already optimized away** by compiler (likely inlined and dead-code eliminated)
- MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks

### Call Site Analysis:

**Source locations of `mid_v3_debug_enabled()`**:

1. **Alloc path** (`core/box/hak_alloc_api.inc.h`):
   - Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block
   - Line 95: Inside same block, after class selection
   - Line 106: Inside same block, after successful allocation

2. **Free path** (`core/box/hak_free_api.inc.h`):
   - Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path)
   - Line 273: Inside same block (legacy path)

3. **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`):
   - Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)

### Key Insight:

`mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.

**Pattern observed**:
```c
// In hot paths:
if (mid_v3_enabled() && ...) {  // Outer guard - optimized to "if (0)"
    // ...
    if (mid_v3_debug_enabled() && ...) {  // Inner debug gate - still in ASM!
        fprintf(stderr, ...);
    }
    // ...
}
```

---

## Step 2: Condition Reordering

**Status**: **SKIPPED** - Not applicable

**Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain.

---

## Step 3: BENCH_MINIMAL Constantization

### Implementation:

Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`:

```c
#include "../hakmem_build_flags.h"

static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_ENABLED");
        if (e && *e) {
            g_enable = (*e != '0') ? 1 : 0;
        } else {
            g_enable = 0;  // default OFF
        }
    }
    return g_enable;
#endif
}

static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_debug = -1;
    if (__builtin_expect(g_debug == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_DEBUG");
        if (e && *e) {
            g_debug = (*e != '0') ? 1 : 0;
        } else {
            g_debug = 0;
        }
    }
    return g_debug;
#endif
}
```

### ASM Verification After Step 3:

```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```

**Result**: ✅ **Both gates ELIMINATED from assembly**
- No `mid_v3_debug_enabled` function in disassembly
- No call sites remaining
- Compiler successfully dead-code eliminated all MID v3 related code

### Performance Results (Step 3):

**Command**: `make perf_fast` (after Step 3 changes)

### 10-Run Results (Step 3):
```
Run 1:  54.60M ops/s
Run 2:  54.35M ops/s
Run 3:  54.11M ops/s
Run 4:  54.60M ops/s
Run 5:  54.84M ops/s
Run 6:  54.79M ops/s
Run 7:  54.53M ops/s
Run 8:  54.56M ops/s
Run 9:  55.96M ops/s
Run 10: 56.08M ops/s
```

**Step 3 Statistics**:
- **Mean**: 54.84M ops/s
- **Median**: 54.60M ops/s
- **Range**: 54.11M - 56.08M ops/s

### Comparison vs Baseline:

| Metric | Baseline | Step 3 | Delta | Percent |
|--------|----------|--------|-------|---------|
| Mean   | 55.97M   | 54.84M | -1.13M | **-2.02%** |
| Median | 56.03M   | 54.60M | -1.43M | **-2.55%** |

**Verdict**: **NO-GO** (-2.02% regression)

---

## Root Cause Analysis: Layout Tax

### Why did constantization hurt performance?

**Hypothesis**: **Code layout tax** (same issue as Phase 40)

1. **Before Step 3**:
   - `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions
   - Call sites reference these functions, which are never executed (dead code)
   - Hot path code layout is stable

2. **After Step 3**:
   - Both gates return compile-time constant `0`
   - Compiler inlines these and eliminates entire MID v3 blocks
   - Hot path code is re-laid out by compiler (different basic block arrangement)
   - **I-cache locality changes** → performance regression

### Precedent: Phase 40 Results

Phase 40 attempted to constantize `tiny_header_mode()`:
- **Result**: -2.47% regression
- **Cause**: Layout tax from code elimination
- **Lesson**: Removing already-optimized-away code can hurt more than help

### Why layout tax occurs:

Modern CPUs are extremely sensitive to:
- **Branch predictor state** (different code layout → different prediction patterns)
- **I-cache line alignment** (moving hot loops can cause cache line splits)
- **μop cache behavior** (LSD/DSB interactions change with layout)
- **TLB pressure** (code page mapping changes)

Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls.

---

## Final Decision: REVERT Step 3

**Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h`

```bash
git checkout core/box/mid_hotbox_v3_env_box.h
```

**Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.

---

## Lessons Learned

### 1. ASM-First Methodology Works

✅ Successfully identified that:
- `mid_v3_enabled()` was already optimized away
- `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks)

### 2. Dead Code != Performance Impact

❌ **Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax

- The dead `mid_v3_debug_enabled()` calls were never executed
- But removing them caused code relayout → -2.02% regression
- **Lesson**: Leave dead code alone if it's already not executed

### 3. Layout Tax is Real and Significant

Both Phase 40 and Phase 41 hit layout tax:
- Phase 40: `tiny_header_mode()` constantization → -2.47%
- Phase 41: `mid_v3_*()` constantization → -2.02%

**Pattern**: Structural changes to inline functions → unpredictable layout effects

### 4. When to Stop Optimizing

**Stop criteria**:
1. If gate is already optimized away in ASM → Don't touch it
2. If gate appears in ASM but is never executed → **Still don't touch it** (layout risk)
3. Only optimize if gate is **executed frequently** in hot paths

### 5. ASM Inspection is Necessary but Not Sufficient

- ✅ ASM inspection told us gates exist
- ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks
- ✅ **Need runtime profiling** (e.g., `perf record`) to confirm execution frequency

---

## Recommendations for Phase 42+

### 1. Add Runtime Profiling Step

**Before optimizing any gate**, use `perf` to verify it's actually executed:

```bash
# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol

# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3
```

**Decision criteria**:
- If function appears in `perf report` → Worth optimizing
- If function is in ASM but NOT in `perf report` → Dead code, leave alone

### 2. Focus on Actually-Executed Gates

**Priority list** (requires profiling validation):
1. Gates that appear in `perf report` top 50 functions
2. Gates called in tight loops (identified via `-C` context in `perf annotate`)
3. Gates with measurable CPU time (>0.1% in profile)

### 3. Accept Dead Code in ASM

**Philosophy shift**:
- Old: "If it's in ASM, optimize it"
- New: "If it's in ASM but not executed, ignore it"

Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax.

### 4. Test Layout Stability

Before committing any structural change:
1. Run 3× 10-run benchmarks (baseline, change, revert-verify)
2. Check if results are reproducible
3. Accept only if gain is **≥1.0%** (to overcome layout noise)

### 5. Alternative: Investigate Other Hot Gates

Instead of MID v3 gates (which are dead), profile to find:
- Tiny allocator gates that ARE executed
- Free path gates with measurable cost
- Size class routing decisions in hot paths

---

## Quantitative Summary

| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
|-------|----------------|--------------|-----------|--------|--------|---------|
| Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped |
| Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO |
| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
| Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** |

**Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted)

---

## Conclusion

Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation:

> **ASM presence ≠ Performance impact**

The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**.

**Key Takeaway**:
- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`)
- ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code
- ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions

**Phase 42 Direction**:
1. Add `perf record/report` step to methodology
2. Target only gates that appear in runtime profiles
3. Accept dead code in ASM as zero-cost (don't fix what isn't broken)
4. Require ≥1.0% gain to overcome layout noise

**Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**