# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions

## Status

- Phase 18 v1 (layout + sections) → NO-GO (I-cache regression)
- Phase 18 v2 (instruction removal) → Next phase
- Strategy: Remove stats/ENV overhead at compile-time
- Expected: +10-20% throughput, -30-40% instructions

Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md`

---

## 0. Goal / Success Criteria

**GO threshold** (STRICT):
- Throughput: **+5% minimum** (+8% preferred)
- Instructions: **-15% minimum** (clear proof of concept)
- I-cache: automatic improvement from smaller footprint

**NEUTRAL**:
- Throughput: ±3% (marginal)
- Instructions: -5% to -15% (incomplete optimization)

**NO-GO**:
- Throughput: < -2% or negative
- Instructions: < -5% (failed to reduce overhead)

If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck).

---

## 1. Implementation Steps

### 1.1 Add Makefile knob

**File**: `Makefile`

After Phase 18 v1 section (around line 140), add:

```makefile
# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds)
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
  CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1
  # Note: Both bench and shared lib will disable instrumentation
  # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled)
endif
```

**Location**: After `HOT_TEXT_GC_SECTIONS` section (line ~145)

### 1.2 Update front_fastlane_stats_box.h

**File**: `core/box/front_fastlane_stats_box.h`

Find the `FRONT_FASTLANE_STAT_INC` macro (currently defined as atomic increment).

Replace entire macro section:

```c
// Before:
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { \
        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
        } \
    } while(0)

// After:
#if HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0)
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { \
        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
        } \
    } while(0)
#endif
```

**Rationale**: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely.

### 1.3 Update malloc_tiny_fast.h free stats

**File**: `core/front/malloc_tiny_fast.h`

Search for `free_tiny_fast_mono_stat_inc_*` calls (probably in free_tiny_fast function).

Wrap each call:

```c
// Before:
free_tiny_fast_mono_stat_inc_hit();

// After:
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif
```

Search pattern: `free_tiny_fast_mono_stat_` (should find ~5-10 calls)

**Rationale**: Similar to malloc, free stats become optional in BENCH_MINIMAL.

### 1.4 Update hakmem_env_snapshot_box.h

**File**: `core/box/hakmem_env_snapshot_box.h`

Find function `hakmem_env_snapshot_enabled()`.

Replace implementation:

```c
// Before:
static inline bool hakmem_env_snapshot_enabled(void) {
    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}

// After:
#if HAKMEM_BENCH_MINIMAL
// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit)
static inline bool hakmem_env_snapshot_enabled(void) {
    return 1;
}
#else
// Normal mode: runtime check
static inline bool hakmem_env_snapshot_enabled(void) {
    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}
#endif
```

**Rationale**: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization.

### 1.5 (Optional) Update debug logging

**File**: any file with `hakmem_tls_trace_*` or `hakmem_diag_*` calls

Only if needed (lower priority). Wrap verbose logging:

```c
#if !HAKMEM_BENCH_MINIMAL
hakmem_tls_trace_alloc(ptr, size);
#endif
```

---

## 2. Build & Verify

### 2.1 Baseline build (BENCH_MINIMAL=0)

```sh
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem
```

Expected: ~653K (same as before)

### 2.2 Optimized build (BENCH_MINIMAL=1)

```sh
make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem
```

Expected: Same size (instrumentation is compiled away, not removed)

**Note**: Binary size may not change (removal is compile-time, not linker removal).

### 2.3 Verify compilation

Both builds should complete without errors.

Check for syntax errors in conditional code:
```sh
# If you see errors like "undeclared identifier", it means conditional guards are wrong
```

---

## 3. A/B Test Execution

### 3.1 Baseline run (BENCH_MINIMAL=0)

```sh
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```

Record:
- Mean throughput
- Stdev
- Min/Max

Run perf stat:

```sh
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1
```

Record:
- cycles
- instructions
- I-cache-load-misses
- branch-misses %

### 3.2 Optimized run (BENCH_MINIMAL=1)

```sh
make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```

Same recording as baseline.

Perf stat (same command).

### 3.3 Optional: system ceiling check

```sh
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput"
```

(Already measured in Phase 17: ~85M ops/s)

---

## 4. Analysis & GO/NO-GO Decision

### Compute deltas

```
Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100
Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100
Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100
```

### Decision logic

**GO** (if ALL true):
- Delta_throughput ≥ +5%
- Delta_instructions ≤ -15%
- Delta_icache ≤ -10% (should follow instruction reduction)

**NEUTRAL** (if some criteria marginal):
- Delta_throughput ∈ ±3%
- Delta_instructions ∈ [-15%, -5%]
- Variance increase < 50%

**NO-GO** (if any critical missed):
- Delta_throughput < -2%
- Delta_instructions > -5% (failed to reduce overhead)
- Variance increase > 100%

---

## 5. Reporting (required artifacts)

Create file: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md`

Include:

```markdown
# Phase 18 v2: BENCH_MINIMAL — A/B Test Results

| Metric | Baseline | Optimized | Delta |
|--------|----------|-----------|-------|
| Throughput (mean) | X.XXM | X.XXM | +/-X.XX% |
| Throughput (σ) | X.XXM | X.XXM | % |
| Instructions | X.XB | X.XB | +/-X.X% |
| I-cache misses | XXXK | XXXK | +/-X.X% |
| Cycles | X.XB | X.XB | +/-X.X% |

## Verdict

GO / NEUTRAL / NO-GO with reasoning
```

Update: `CURRENT_TASK.md` (Phase 18 v2 status + next)

---

## 6. Important Notes

- Ensure `BENCH_MINIMAL=0` (default) is tested first
- Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully
- If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this)
- This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI)

---

## 7. If GO: Next Steps

1. **Promote BENCH_MINIMAL=1** as default for bench builds
2. **Prepare Phase 19** (next optimization frontier)
   - Possible targets: SIMD prefetch, lock-free structures, allocation batching
3. **Document learnings** about allocator overhead vs memory latency

---

## 8. If NO-GO: Lessons

- Allocator is memory-bound (IPC=2.30 constant across phases)
- Instruction count reduction doesn't always yield throughput gains if memory latency dominates
- Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations