# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design

## Context (Phase 18 v1 failed)

Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).

**Result**: CATASTROPHIC I-cache regression (+91%).

**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality.

**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**.

---

## 0. Goal (High Impact)

Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time:
- Stats collection (counter increments on every operation)
- Environment variable checks (TLS lookups)
- Debug logging (conditional output)

**Expected impact**:
- Instruction count: -30-40% (benchmark workload)
- I-cache misses: automatic improvement from smaller working set
- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)

**GO Criteria** (strict):
- Throughput: +5% minimum, +8% preferred
- Instructions: -15% minimum (clear, measurable reduction)
- I-cache: should improve proportionally to instruction reduction

If instructions do not drop meaningfully, abandon this phase.

---

## 1. Strategy: BENCH_MINIMAL Build Mode

### 1.1 What to Remove

**Category A: Stats collection** (highest frequency, low value for bench)
```c
FRONT_FASTLANE_STAT_INC(malloc_total);      // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_hit);        // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
free_tiny_fast_mono_stat_inc_hit();          // ← 1 per free
```

**Category B: Environment variable checks** (per-operation cost if not short-circuit)
```c
if (hakmem_env_snapshot_enabled()) { ... }   // TLS + branch
if (front_fastlane_enabled()) { ... }        // TLS + branch
```

**Category C: Debug logging / verbose output** (rarely needed in bench)
```c
hakmem_tls_trace_alloc();   // log path
hakmem_diag_record_fallback();
```

### 1.2 Compile-Time Gate

New build mode:
```
HAKMEM_BENCH_MINIMAL=0/1  (default 0, opt-in)
```

Activated via Makefile knob:
```makefile
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
  # ... applies to both bench binaries and shared lib
endif
```

**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).

### 1.3 Safeguards

1. **Conditional compilation markers**:
   - Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
   - Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
   - Ensure code is still correct when markers are OFF (normal operation)

2. **No behavioral changes**:
   - Fast paths must remain identical (stats are just instrumentation)
   - Slow paths (fallback, cold) can be simplified (less instrumentation)

3. **Rollback**: Single build knob, reversible

---

## 2. Implementation Plan

### Phase 2.1: Stats removal (Priority 1)

**File**: `core/box/front_fastlane_stats_box.h`

```c
#if !HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { (void)0; } while(0)  // ← becomes no-op
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
    atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
#endif
```

**File**: `core/front/malloc_tiny_fast.h`

Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
```c
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif
```

**Expected savings**: ~20-30 instructions/op (counter increments + memory sync)

### Phase 2.2: Environment variable check removal (Priority 1)

**File**: `core/box/hakmem_env_snapshot_box.h`

```c
#if !HAKMEM_BENCH_MINIMAL
// Normal: Check ENV at runtime
bool hakmem_env_snapshot_enabled(void) {
    return g_env_snapshot_enabled;
}
#else
// Bench: Always enable snapshot (one-time cost)
static inline bool hakmem_env_snapshot_enabled(void) {
    return 1;  // ← compile-time constant, eliminated by optimizer
}
#endif
```

**File**: `core/bench_profile.h`

Similar pattern for profile-related checks.

**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction)

### Phase 2.3: Debug logging removal (Priority 2, if needed)

**File**: trace/logging functions

Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).

**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench)

---

## 3. Risks / Mitigations

### Risk A: Benchmark no longer representative

If we remove instrumentation, does the bench still measure the allocator?

**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic.
- Fast paths remain identical.
- Only instrumentation overhead is removed.
- This is similar to how production binaries disable debug tracing.

### Risk B: Build regression

Conditional compilation could hide bugs.

**Mitigation**:
- Ensure `BENCH_MINIMAL=0` (default) always tested first.
- Test both modes in CI if available.
- Manual verification that code is syntactically correct in both modes.

### Risk C: Instruction count doesn't drop

If removing stats + ENV checks doesn't clearly drop instructions, it means:
- Compiler/LTO already optimizes these away in RELEASE builds
- The overhead is elsewhere (memory access patterns, branch misprediction)

**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.

---

## 4. Expected Impact

**Conservative** (stats only):
- Throughput: +5-8%
- Instructions: -20-30%
- I-cache: improvement follows from instruction reduction

**Ambitious** (stats + ENV + debug):
- Throughput: +10-20%
- Instructions: -30-40%
- I-cache: -20-30% (proportional to instruction reduction)

**System binary ceiling**: 85M ops/s
- Baseline: 48M ops/s
- Target: 52-55M ops/s (via +5-15%)
- Would be 60-68M ops/s with ambitious impact

---

## 5. Box Theory Compliance

**Box**: BenchMinimalBox
- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
- **Scope**: Instrumentation removal only (no algorithm changes)
- **Rollback**: Single build knob (default OFF, backward compatible)
- **Observability**: Perf stat before/after (instructions, I-cache, throughput)

---

## 6. Success Criteria

### GO (Proceed):
- Throughput: **+5% or more**
- Instructions: **-15% or more** (smoking gun for success)
- I-cache: should improve proportionally

### NEUTRAL (keep as research box):
- Throughput: ±3% (within noise)
- Instructions: -5% to -15% (marginal)

### NO-GO (freeze):
- Throughput: < -2% or no improvement
- Instructions: < -5% (failed optimization objective)

---

## 7. Files to Modify

1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support)
   - Already created in Phase 18 v1

2. **`core/box/front_fastlane_stats_box.h`** (stats conditional)
   - Update macro to be no-op when BENCH_MINIMAL=1

3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper)
   - Wrap free_tiny_fast_mono_stat_* calls

4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification)
   - Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1

5. **`core/bench_profile.h`** (profile checks)
   - Simplify profile related checks

6. **`Makefile`**
   - Add `BENCH_MINIMAL ?= 0` knob
   - Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled

---

## 8. Notes

- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries.
- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.

---

## 9. Next Steps After Phase 18 v2

### If GO (+5%+):
- Promote BENCH_MINIMAL=1 as the new baseline for bench builds
- Update CURRENT_TASK with Phase 18 v2 victory
- Prepare Phase 19 (next optimization frontier)

### If NEUTRAL:
- Investigate why instructions didn't drop (compiler already optimized?)
- Consider aggressive options: SIMD, allocation batching, lock-free

### If NO-GO:
- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
- Shift focus to broader architectural changes