hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md

# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design

## Context (Phase 18 v1 failed)

Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).

**Result**: CATASTROPHIC I-cache regression (+91%).

**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality.

**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**.

---

## 0. Goal (High Impact)

Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time:
- Stats collection (counter increments on every operation)
- Environment variable checks (TLS lookups)
- Debug logging (conditional output)

**Expected impact**:
- Instruction count: -30-40% (benchmark workload)
- I-cache misses: automatic improvement from smaller working set
- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)

**GO Criteria** (strict):
- Throughput: +5% minimum, +8% preferred
- Instructions: -15% minimum (clear, measurable reduction)
- I-cache: should improve proportionally to instruction reduction

If instructions do not drop meaningfully, abandon this phase.

---

## 1. Strategy: BENCH_MINIMAL Build Mode

### 1.1 What to Remove

**Category A: Stats collection** (highest frequency, low value for bench)
```c
FRONT_FASTLANE_STAT_INC(malloc_total);      // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_hit);        // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
free_tiny_fast_mono_stat_inc_hit();          // ← 1 per free
```

**Category B: Environment variable checks** (per-operation cost if not short-circuit)
```c
if (hakmem_env_snapshot_enabled()) { ... }   // TLS + branch
if (front_fastlane_enabled()) { ... }        // TLS + branch
```

**Category C: Debug logging / verbose output** (rarely needed in bench)
```c
hakmem_tls_trace_alloc();   // log path
hakmem_diag_record_fallback();
```

### 1.2 Compile-Time Gate

New build mode:
```
HAKMEM_BENCH_MINIMAL=0/1  (default 0, opt-in)
```

Activated via Makefile knob:
```makefile
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
  # ... applies to both bench binaries and shared lib
endif
```

**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).

### 1.3 Safeguards

1. **Conditional compilation markers**:
   - Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
   - Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
   - Ensure code is still correct when markers are OFF (normal operation)

2. **No behavioral changes**:
   - Fast paths must remain identical (stats are just instrumentation)
   - Slow paths (fallback, cold) can be simplified (less instrumentation)

3. **Rollback**: Single build knob, reversible

---

## 2. Implementation Plan

### Phase 2.1: Stats removal (Priority 1)

**File**: `core/box/front_fastlane_stats_box.h`

```c
#if !HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { (void)0; } while(0)  // ← becomes no-op
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
    atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
#endif
```

**File**: `core/front/malloc_tiny_fast.h`

Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
```c
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif
```

**Expected savings**: ~20-30 instructions/op (counter increments + memory sync)

### Phase 2.2: Environment variable check removal (Priority 1)

**File**: `core/box/hakmem_env_snapshot_box.h`

```c
#if !HAKMEM_BENCH_MINIMAL
// Normal: Check ENV at runtime
bool hakmem_env_snapshot_enabled(void) {
    return g_env_snapshot_enabled;
}
#else
// Bench: Always enable snapshot (one-time cost)
static inline bool hakmem_env_snapshot_enabled(void) {
    return 1;  // ← compile-time constant, eliminated by optimizer
}
#endif
```

**File**: `core/bench_profile.h`

Similar pattern for profile-related checks.

**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction)

### Phase 2.3: Debug logging removal (Priority 2, if needed)

**File**: trace/logging functions

Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).

**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench)

---

## 3. Risks / Mitigations

### Risk A: Benchmark no longer representative

If we remove instrumentation, does the bench still measure the allocator?

**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic.
- Fast paths remain identical.
- Only instrumentation overhead is removed.
- This is similar to how production binaries disable debug tracing.

### Risk B: Build regression

Conditional compilation could hide bugs.

**Mitigation**:
- Ensure `BENCH_MINIMAL=0` (default) always tested first.
- Test both modes in CI if available.
- Manual verification that code is syntactically correct in both modes.

### Risk C: Instruction count doesn't drop

If removing stats + ENV checks doesn't clearly drop instructions, it means:
- Compiler/LTO already optimizes these away in RELEASE builds
- The overhead is elsewhere (memory access patterns, branch misprediction)

**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.

---

## 4. Expected Impact

**Conservative** (stats only):
- Throughput: +5-8%
- Instructions: -20-30%
- I-cache: improvement follows from instruction reduction

**Ambitious** (stats + ENV + debug):
- Throughput: +10-20%
- Instructions: -30-40%
- I-cache: -20-30% (proportional to instruction reduction)

**System binary ceiling**: 85M ops/s
- Baseline: 48M ops/s
- Target: 52-55M ops/s (via +5-15%)
- Would be 60-68M ops/s with ambitious impact

---

## 5. Box Theory Compliance

**Box**: BenchMinimalBox
- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
- **Scope**: Instrumentation removal only (no algorithm changes)
- **Rollback**: Single build knob (default OFF, backward compatible)
- **Observability**: Perf stat before/after (instructions, I-cache, throughput)

---

## 6. Success Criteria

### GO (Proceed):
- Throughput: **+5% or more**
- Instructions: **-15% or more** (smoking gun for success)
- I-cache: should improve proportionally

### NEUTRAL (keep as research box):
- Throughput: ±3% (within noise)
- Instructions: -5% to -15% (marginal)

### NO-GO (freeze):
- Throughput: < -2% or no improvement
- Instructions: < -5% (failed optimization objective)

---

## 7. Files to Modify

1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support)
   - Already created in Phase 18 v1

2. **`core/box/front_fastlane_stats_box.h`** (stats conditional)
   - Update macro to be no-op when BENCH_MINIMAL=1

3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper)
   - Wrap free_tiny_fast_mono_stat_* calls

4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification)
   - Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1

5. **`core/bench_profile.h`** (profile checks)
   - Simplify profile related checks

6. **`Makefile`**
   - Add `BENCH_MINIMAL ?= 0` knob
   - Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled

---

## 8. Notes

- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries.
- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.

---

## 9. Next Steps After Phase 18 v2

### If GO (+5%+):
- Promote BENCH_MINIMAL=1 as the new baseline for bench builds
- Update CURRENT_TASK with Phase 18 v2 victory
- Prepare Phase 19 (next optimization frontier)

### If NEUTRAL:
- Investigate why instructions didn't drop (compiler already optimized?)
- Consider aggressive options: SIMD, allocation batching, lock-free

### If NO-GO:
- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
- Shift focus to broader architectural changes
Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy) ## Phase 18 v2: Next Phase Direction After Phase 18 v1 failure (layout optimization caused I-cache regression), shift to instruction count reduction via compile-time removal: - Stats collection (FRONT_FASTLANE_STAT_INC → no-op) - Environment checks (runtime lookup → constant) - Debug logging (conditional compilation) Expected impact: Instructions -30-40%, Throughput +10-20% ## Success Criteria (STRICT) GO (must have ALL): - Throughput: +5% minimum (+8% preferred) - Instructions: -15% minimum (smoking gun) - I-cache: automatic improvement from smaller footprint NEUTRAL: throughput ±3%, instructions -5% to -15% NO-GO: throughput < -2%, instructions < -5% Key: If instructions do not drop -15%+, allocator is not the bottleneck and this phase should be abandoned. ## Implementation Strategy 1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe) 2. Conditional removal: - Stats: #if !HAKMEM_BENCH_MINIMAL - ENV checks: constant propagation - Debug: conditional includes 3. A/B test with perf stat (must measure instruction reduction) ## Files New: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design) - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step) Modified: - CURRENT_TASK.md (Phase 18 v1/v2 status) ## Key Learning from Phase 18 v1 Failure Layout optimization is extremely fragile without strong ordering guarantees. Section splitting alone (without symbol ordering, PGO, or linker script) destroyed code locality and increased I-cache misses 91%. Switching to direct instruction removal is safer and more predictable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> 2025-12-15 05:55:22 +09:00			`# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design`

			`## Context (Phase 18 v1 failed)`

			Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).

			`Result: CATASTROPHIC I-cache regression (+91%).`

			`Root cause: Section-based splitting without explicit hot symbol ordering destroyed code locality.`

			`Lesson: Layout tweaks are too fragile; instruction count reduction is the direct path.`

			`---`

			`## 0. Goal (High Impact)`

			`Reduce instruction footprint per allocation/free by removing non-essential code paths at compile-time:`
			`- Stats collection (counter increments on every operation)`
			`- Environment variable checks (TLS lookups)`
			`- Debug logging (conditional output)`

			`Expected impact:`
			`- Instruction count: -30-40% (benchmark workload)`
			`- I-cache misses: automatic improvement from smaller working set`
			`- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)`

			`GO Criteria (strict):`
			`- Throughput: +5% minimum, +8% preferred`
			`- Instructions: -15% minimum (clear, measurable reduction)`
			`- I-cache: should improve proportionally to instruction reduction`

			`If instructions do not drop meaningfully, abandon this phase.`

			`---`

			`## 1. Strategy: BENCH_MINIMAL Build Mode`

			`### 1.1 What to Remove`

			`Category A: Stats collection (highest frequency, low value for bench)`
			```c
			`FRONT_FASTLANE_STAT_INC(malloc_total); // ← 1 per malloc`
			`FRONT_FASTLANE_STAT_INC(malloc_hit); // ← 1 per malloc`
			`FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional`
			`free_tiny_fast_mono_stat_inc_hit(); // ← 1 per free`
			```

			`Category B: Environment variable checks (per-operation cost if not short-circuit)`
			```c
			`if (hakmem_env_snapshot_enabled()) { ... } // TLS + branch`
			`if (front_fastlane_enabled()) { ... } // TLS + branch`
			```

			`Category C: Debug logging / verbose output (rarely needed in bench)`
			```c
			`hakmem_tls_trace_alloc(); // log path`
			`hakmem_diag_record_fallback();`
			```

			`### 1.2 Compile-Time Gate`

			`New build mode:`
			```
			`HAKMEM_BENCH_MINIMAL=0/1 (default 0, opt-in)`
			```

			`Activated via Makefile knob:`
			```makefile
			`BENCH_MINIMAL ?= 0`
			`ifeq ($(BENCH_MINIMAL),1)`
			`CFLAGS += -DHAKMEM_BENCH_MINIMAL=1`
			`# ... applies to both bench binaries and shared lib`
			`endif`
			```

			`Important: This is not a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).`

			`### 1.3 Safeguards`

			`1. Conditional compilation markers:`
			- Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
			- Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
			`- Ensure code is still correct when markers are OFF (normal operation)`

			`2. No behavioral changes:`
			`- Fast paths must remain identical (stats are just instrumentation)`
			`- Slow paths (fallback, cold) can be simplified (less instrumentation)`

			`3. Rollback: Single build knob, reversible`

			`---`

			`## 2. Implementation Plan`

			`### Phase 2.1: Stats removal (Priority 1)`

			File: `core/box/front_fastlane_stats_box.h`

			```c
			`#if !HAKMEM_BENCH_MINIMAL`
			`#define FRONT_FASTLANE_STAT_INC(stat) \`
			`do { (void)0; } while(0) // ← becomes no-op`
			`#else`
			`#define FRONT_FASTLANE_STAT_INC(stat) \`
			`atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)`
			`#endif`
			```

			File: `core/front/malloc_tiny_fast.h`

			Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
			```c
			`#if !HAKMEM_BENCH_MINIMAL`
			`free_tiny_fast_mono_stat_inc_hit();`
			`#endif`
			```

			`Expected savings: ~20-30 instructions/op (counter increments + memory sync)`

			`### Phase 2.2: Environment variable check removal (Priority 1)`

			File: `core/box/hakmem_env_snapshot_box.h`

			```c
			`#if !HAKMEM_BENCH_MINIMAL`
			`// Normal: Check ENV at runtime`
			`bool hakmem_env_snapshot_enabled(void) {`
			`return g_env_snapshot_enabled;`
			`}`
			`#else`
			`// Bench: Always enable snapshot (one-time cost)`
			`static inline bool hakmem_env_snapshot_enabled(void) {`
			`return 1; // ← compile-time constant, eliminated by optimizer`
			`}`
			`#endif`
			```

			File: `core/bench_profile.h`

			`Similar pattern for profile-related checks.`

			`Expected savings: ~5-10 instructions/op (TLS lookups + branch prediction)`

			`### Phase 2.3: Debug logging removal (Priority 2, if needed)`

			`File: trace/logging functions`

			Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).

			`Expected savings: ~0-5 instructions/op (rare paths, low impact for bench)`

			`---`

			`## 3. Risks / Mitigations`

			`### Risk A: Benchmark no longer representative`

			`If we remove instrumentation, does the bench still measure the allocator?`

			`Mitigation: BENCH_MINIMAL disables stats (instrumentation), not core logic.`
			`- Fast paths remain identical.`
			`- Only instrumentation overhead is removed.`
			`- This is similar to how production binaries disable debug tracing.`

			`### Risk B: Build regression`

			`Conditional compilation could hide bugs.`

			`Mitigation:`
			- Ensure `BENCH_MINIMAL=0` (default) always tested first.
			`- Test both modes in CI if available.`
			`- Manual verification that code is syntactically correct in both modes.`

			`### Risk C: Instruction count doesn't drop`

			`If removing stats + ENV checks doesn't clearly drop instructions, it means:`
			`- Compiler/LTO already optimizes these away in RELEASE builds`
			`- The overhead is elsewhere (memory access patterns, branch misprediction)`

			Mitigation: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.

			`---`

			`## 4. Expected Impact`

			`Conservative (stats only):`
			`- Throughput: +5-8%`
			`- Instructions: -20-30%`
			`- I-cache: improvement follows from instruction reduction`

			`Ambitious (stats + ENV + debug):`
			`- Throughput: +10-20%`
			`- Instructions: -30-40%`
			`- I-cache: -20-30% (proportional to instruction reduction)`

			`System binary ceiling: 85M ops/s`
			`- Baseline: 48M ops/s`
			`- Target: 52-55M ops/s (via +5-15%)`
			`- Would be 60-68M ops/s with ambitious impact`

			`---`

			`## 5. Box Theory Compliance`

			`Box: BenchMinimalBox`
			- Boundary: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
			`- Scope: Instrumentation removal only (no algorithm changes)`
			`- Rollback: Single build knob (default OFF, backward compatible)`
			`- Observability: Perf stat before/after (instructions, I-cache, throughput)`

			`---`

			`## 6. Success Criteria`

			`### GO (Proceed):`
			`- Throughput: +5% or more`
			`- Instructions: -15% or more (smoking gun for success)`
			`- I-cache: should improve proportionally`

			`### NEUTRAL (keep as research box):`
			`- Throughput: ±3% (within noise)`
			`- Instructions: -5% to -15% (marginal)`

			`### NO-GO (freeze):`
			`- Throughput: < -2% or no improvement`
			`- Instructions: < -5% (failed optimization objective)`

			`---`

			`## 7. Files to Modify`

			1. `core/box/hot_text_attrs_box.h` (update to add BENCH_MINIMAL support)
			`- Already created in Phase 18 v1`

			2. `core/box/front_fastlane_stats_box.h` (stats conditional)
			`- Update macro to be no-op when BENCH_MINIMAL=1`

			3. `core/front/malloc_tiny_fast.h` (free stats wrapper)
			`- Wrap free_tiny_fast_mono_stat_* calls`

			4. `core/box/hakmem_env_snapshot_box.h` (ENV check simplification)
			`- Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1`

			5. `core/bench_profile.h` (profile checks)
			`- Simplify profile related checks`

			6. `Makefile`
			- Add `BENCH_MINIMAL ?= 0` knob
			- Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled

			`---`

			`## 8. Notes`

			`- Production safety: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.`
			`- Bench-only: This mode is intended for benchmark binary builds, not shipped libraries.`
			`- Phase 17 learning: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.`

			`---`

			`## 9. Next Steps After Phase 18 v2`

			`### If GO (+5%+):`
			`- Promote BENCH_MINIMAL=1 as the new baseline for bench builds`
			`- Update CURRENT_TASK with Phase 18 v2 victory`
			`- Prepare Phase 19 (next optimization frontier)`

			`### If NEUTRAL:`
			`- Investigate why instructions didn't drop (compiler already optimized?)`
			`- Consider aggressive options: SIMD, allocation batching, lock-free`

			`### If NO-GO:`
			`- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)`
			`- Shift focus to broader architectural changes`