Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)

## Phase 18 v2: Next Phase Direction After Phase 18 v1 failure (layout optimization caused I-cache regression), shift to instruction count reduction via compile-time removal: - Stats collection (FRONT_FASTLANE_STAT_INC → no-op) - Environment checks (runtime lookup → constant) - Debug logging (conditional compilation) Expected impact: Instructions -30-40%, Throughput +10-20% ## Success Criteria (STRICT) GO (must have ALL): - Throughput: +5% minimum (+8% preferred) - Instructions: -15% minimum (smoking gun) - I-cache: automatic improvement from smaller footprint NEUTRAL: throughput ±3%, instructions -5% to -15% NO-GO: throughput < -2%, instructions < -5% Key: If instructions do not drop -15%+, allocator is not the bottleneck and this phase should be abandoned. ## Implementation Strategy 1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe) 2. Conditional removal: - Stats: #if !HAKMEM_BENCH_MINIMAL - ENV checks: constant propagation - Debug: conditional includes 3. A/B test with perf stat (must measure instruction reduction) ## Files New: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design) - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step) Modified: - CURRENT_TASK.md (Phase 18 v1/v2 status) ## Key Learning from Phase 18 v1 Failure Layout optimization is extremely fragile without strong ordering guarantees. Section splitting alone (without symbol ordering, PGO, or linker script) destroyed code locality and increased I-cache misses 91%. Switching to direct instruction removal is safer and more predictable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:55:22 +09:00
parent b1912d6587
commit ad346f7885
3 changed files with 637 additions and 23 deletions
--- a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md
+++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md
@ -0,0 +1,274 @@
+# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design
+
+## Context (Phase 18 v1 failed)
+
+Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).
+
+**Result**: CATASTROPHIC I-cache regression (+91%).
+
+**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality.
+
+**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**.
+
+---
+
+## 0. Goal (High Impact)
+
+Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time:
+- Stats collection (counter increments on every operation)
+- Environment variable checks (TLS lookups)
+- Debug logging (conditional output)
+
+**Expected impact**:
+- Instruction count: -30-40% (benchmark workload)
+- I-cache misses: automatic improvement from smaller working set
+- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)
+
+**GO Criteria** (strict):
+- Throughput: +5% minimum, +8% preferred
+- Instructions: -15% minimum (clear, measurable reduction)
+- I-cache: should improve proportionally to instruction reduction
+
+If instructions do not drop meaningfully, abandon this phase.
+
+---
+
+## 1. Strategy: BENCH_MINIMAL Build Mode
+
+### 1.1 What to Remove
+
+**Category A: Stats collection** (highest frequency, low value for bench)
+```c
+FRONT_FASTLANE_STAT_INC(malloc_total);      // ← 1 per malloc
+FRONT_FASTLANE_STAT_INC(malloc_hit);        // ← 1 per malloc
+FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
+free_tiny_fast_mono_stat_inc_hit();          // ← 1 per free
+```
+
+**Category B: Environment variable checks** (per-operation cost if not short-circuit)
+```c
+if (hakmem_env_snapshot_enabled()) { ... }   // TLS + branch
+if (front_fastlane_enabled()) { ... }        // TLS + branch
+```
+
+**Category C: Debug logging / verbose output** (rarely needed in bench)
+```c
+hakmem_tls_trace_alloc();   // log path
+hakmem_diag_record_fallback();
+```
+
+### 1.2 Compile-Time Gate
+
+New build mode:
+```
+HAKMEM_BENCH_MINIMAL=0/1  (default 0, opt-in)
+```
+
+Activated via Makefile knob:
+```makefile
+BENCH_MINIMAL ?= 0
+ifeq ($(BENCH_MINIMAL),1)
+  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
+  # ... applies to both bench binaries and shared lib
+endif
+```
+
+**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).
+
+### 1.3 Safeguards
+
+1. **Conditional compilation markers**:
+   - Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
+   - Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
+   - Ensure code is still correct when markers are OFF (normal operation)
+
+2. **No behavioral changes**:
+   - Fast paths must remain identical (stats are just instrumentation)
+   - Slow paths (fallback, cold) can be simplified (less instrumentation)
+
+3. **Rollback**: Single build knob, reversible
+
+---
+
+## 2. Implementation Plan
+
+### Phase 2.1: Stats removal (Priority 1)
+
+**File**: `core/box/front_fastlane_stats_box.h`
+
+```c
+#if !HAKMEM_BENCH_MINIMAL
+#define FRONT_FASTLANE_STAT_INC(stat) \
+    do { (void)0; } while(0)  // ← becomes no-op
+#else
+#define FRONT_FASTLANE_STAT_INC(stat) \
+    atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
+#endif
+```
+
+**File**: `core/front/malloc_tiny_fast.h`
+
+Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
+```c
+#if !HAKMEM_BENCH_MINIMAL
+free_tiny_fast_mono_stat_inc_hit();
+#endif
+```
+
+**Expected savings**: ~20-30 instructions/op (counter increments + memory sync)
+
+### Phase 2.2: Environment variable check removal (Priority 1)
+
+**File**: `core/box/hakmem_env_snapshot_box.h`
+
+```c
+#if !HAKMEM_BENCH_MINIMAL
+// Normal: Check ENV at runtime
+bool hakmem_env_snapshot_enabled(void) {
+    return g_env_snapshot_enabled;
+}
+#else
+// Bench: Always enable snapshot (one-time cost)
+static inline bool hakmem_env_snapshot_enabled(void) {
+    return 1;  // ← compile-time constant, eliminated by optimizer
+}
+#endif
+```
+
+**File**: `core/bench_profile.h`
+
+Similar pattern for profile-related checks.
+
+**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction)
+
+### Phase 2.3: Debug logging removal (Priority 2, if needed)
+
+**File**: trace/logging functions
+
+Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).
+
+**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench)
+
+---
+
+## 3. Risks / Mitigations
+
+### Risk A: Benchmark no longer representative
+
+If we remove instrumentation, does the bench still measure the allocator?
+
+**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic.
+- Fast paths remain identical.
+- Only instrumentation overhead is removed.
+- This is similar to how production binaries disable debug tracing.
+
+### Risk B: Build regression
+
+Conditional compilation could hide bugs.
+
+**Mitigation**:
+- Ensure `BENCH_MINIMAL=0` (default) always tested first.
+- Test both modes in CI if available.
+- Manual verification that code is syntactically correct in both modes.
+
+### Risk C: Instruction count doesn't drop
+
+If removing stats + ENV checks doesn't clearly drop instructions, it means:
+- Compiler/LTO already optimizes these away in RELEASE builds
+- The overhead is elsewhere (memory access patterns, branch misprediction)
+
+**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.
+
+---
+
+## 4. Expected Impact
+
+**Conservative** (stats only):
+- Throughput: +5-8%
+- Instructions: -20-30%
+- I-cache: improvement follows from instruction reduction
+
+**Ambitious** (stats + ENV + debug):
+- Throughput: +10-20%
+- Instructions: -30-40%
+- I-cache: -20-30% (proportional to instruction reduction)
+
+**System binary ceiling**: 85M ops/s
+- Baseline: 48M ops/s
+- Target: 52-55M ops/s (via +5-15%)
+- Would be 60-68M ops/s with ambitious impact
+
+---
+
+## 5. Box Theory Compliance
+
+**Box**: BenchMinimalBox
+- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
+- **Scope**: Instrumentation removal only (no algorithm changes)
+- **Rollback**: Single build knob (default OFF, backward compatible)
+- **Observability**: Perf stat before/after (instructions, I-cache, throughput)
+
+---
+
+## 6. Success Criteria
+
+### GO (Proceed):
+- Throughput: **+5% or more**
+- Instructions: **-15% or more** (smoking gun for success)
+- I-cache: should improve proportionally
+
+### NEUTRAL (keep as research box):
+- Throughput: ±3% (within noise)
+- Instructions: -5% to -15% (marginal)
+
+### NO-GO (freeze):
+- Throughput: < -2% or no improvement
+- Instructions: < -5% (failed optimization objective)
+
+---
+
+## 7. Files to Modify
+
+1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support)
+   - Already created in Phase 18 v1
+
+2. **`core/box/front_fastlane_stats_box.h`** (stats conditional)
+   - Update macro to be no-op when BENCH_MINIMAL=1
+
+3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper)
+   - Wrap free_tiny_fast_mono_stat_* calls
+
+4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification)
+   - Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1
+
+5. **`core/bench_profile.h`** (profile checks)
+   - Simplify profile related checks
+
+6. **`Makefile`**
+   - Add `BENCH_MINIMAL ?= 0` knob
+   - Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled
+
+---
+
+## 8. Notes
+
+- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
+- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries.
+- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.
+
+---
+
+## 9. Next Steps After Phase 18 v2
+
+### If GO (+5%+):
+- Promote BENCH_MINIMAL=1 as the new baseline for bench builds
+- Update CURRENT_TASK with Phase 18 v2 victory
+- Prepare Phase 19 (next optimization frontier)
+
+### If NEUTRAL:
+- Investigate why instructions didn't drop (compiler already optimized?)
+- Consider aggressive options: SIMD, allocation batching, lock-free
+
+### If NO-GO:
+- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
+- Shift focus to broader architectural changes
--- a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md
@ -0,0 +1,315 @@
+# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions
+
+## Status
+
+- Phase 18 v1 (layout + sections) → NO-GO (I-cache regression)
+- Phase 18 v2 (instruction removal) → Next phase
+- Strategy: Remove stats/ENV overhead at compile-time
+- Expected: +10-20% throughput, -30-40% instructions
+
+Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md`
+
+---
+
+## 0. Goal / Success Criteria
+
+**GO threshold** (STRICT):
+- Throughput: **+5% minimum** (+8% preferred)
+- Instructions: **-15% minimum** (clear proof of concept)
+- I-cache: automatic improvement from smaller footprint
+
+**NEUTRAL**:
+- Throughput: ±3% (marginal)
+- Instructions: -5% to -15% (incomplete optimization)
+
+**NO-GO**:
+- Throughput: < -2% or negative
+- Instructions: < -5% (failed to reduce overhead)
+
+If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck).
+
+---
+
+## 1. Implementation Steps
+
+### 1.1 Add Makefile knob
+
+**File**: `Makefile`
+
+After Phase 18 v1 section (around line 140), add:
+
+```makefile
+# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds)
+BENCH_MINIMAL ?= 0
+ifeq ($(BENCH_MINIMAL),1)
+  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
+  CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1
+  # Note: Both bench and shared lib will disable instrumentation
+  # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled)
+endif
+```
+
+**Location**: After `HOT_TEXT_GC_SECTIONS` section (line ~145)
+
+### 1.2 Update front_fastlane_stats_box.h
+
+**File**: `core/box/front_fastlane_stats_box.h`
+
+Find the `FRONT_FASTLANE_STAT_INC` macro (currently defined as atomic increment).
+
+Replace entire macro section:
+
+```c
+// Before:
+#define FRONT_FASTLANE_STAT_INC(stat) \
+    do { \
+        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
+            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
+        } \
+    } while(0)
+
+// After:
+#if HAKMEM_BENCH_MINIMAL
+#define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0)
+#else
+#define FRONT_FASTLANE_STAT_INC(stat) \
+    do { \
+        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
+            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
+        } \
+    } while(0)
+#endif
+```
+
+**Rationale**: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely.
+
+### 1.3 Update malloc_tiny_fast.h free stats
+
+**File**: `core/front/malloc_tiny_fast.h`
+
+Search for `free_tiny_fast_mono_stat_inc_*` calls (probably in free_tiny_fast function).
+
+Wrap each call:
+
+```c
+// Before:
+free_tiny_fast_mono_stat_inc_hit();
+
+// After:
+#if !HAKMEM_BENCH_MINIMAL
+free_tiny_fast_mono_stat_inc_hit();
+#endif
+```
+
+Search pattern: `free_tiny_fast_mono_stat_` (should find ~5-10 calls)
+
+**Rationale**: Similar to malloc, free stats become optional in BENCH_MINIMAL.
+
+### 1.4 Update hakmem_env_snapshot_box.h
+
+**File**: `core/box/hakmem_env_snapshot_box.h`
+
+Find function `hakmem_env_snapshot_enabled()`.
+
+Replace implementation:
+
+```c
+// Before:
+static inline bool hakmem_env_snapshot_enabled(void) {
+    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
+}
+
+// After:
+#if HAKMEM_BENCH_MINIMAL
+// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit)
+static inline bool hakmem_env_snapshot_enabled(void) {
+    return 1;
+}
+#else
+// Normal mode: runtime check
+static inline bool hakmem_env_snapshot_enabled(void) {
+    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
+}
+#endif
+```
+
+**Rationale**: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization.
+
+### 1.5 (Optional) Update debug logging
+
+**File**: any file with `hakmem_tls_trace_*` or `hakmem_diag_*` calls
+
+Only if needed (lower priority). Wrap verbose logging:
+
+```c
+#if !HAKMEM_BENCH_MINIMAL
+hakmem_tls_trace_alloc(ptr, size);
+#endif
+```
+
+---
+
+## 2. Build & Verify
+
+### 2.1 Baseline build (BENCH_MINIMAL=0)
+
+```sh
+make clean
+make -j bench_random_mixed_hakmem bench_random_mixed_system
+ls -lh bench_random_mixed_hakmem
+```
+
+Expected: ~653K (same as before)
+
+### 2.2 Optimized build (BENCH_MINIMAL=1)
+
+```sh
+make clean
+make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
+ls -lh bench_random_mixed_hakmem
+```
+
+Expected: Same size (instrumentation is compiled away, not removed)
+
+**Note**: Binary size may not change (removal is compile-time, not linker removal).
+
+### 2.3 Verify compilation
+
+Both builds should complete without errors.
+
+Check for syntax errors in conditional code:
+```sh
+# If you see errors like "undeclared identifier", it means conditional guards are wrong
+```
+
+---
+
+## 3. A/B Test Execution
+
+### 3.1 Baseline run (BENCH_MINIMAL=0)
+
+```sh
+make clean
+make -j bench_random_mixed_hakmem bench_random_mixed_system
+scripts/run_mixed_10_cleanenv.sh
+```
+
+Record:
+- Mean throughput
+- Stdev
+- Min/Max
+
+Run perf stat:
+
+```sh
+perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
+  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  ./bench_random_mixed_hakmem 200000000 400 1
+```
+
+Record:
+- cycles
+- instructions
+- I-cache-load-misses
+- branch-misses %
+
+### 3.2 Optimized run (BENCH_MINIMAL=1)
+
+```sh
+make clean
+make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
+scripts/run_mixed_10_cleanenv.sh
+```
+
+Same recording as baseline.
+
+Perf stat (same command).
+
+### 3.3 Optional: system ceiling check
+
+```sh
+./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput"
+```
+
+(Already measured in Phase 17: ~85M ops/s)
+
+---
+
+## 4. Analysis & GO/NO-GO Decision
+
+### Compute deltas
+
+```
+Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100
+Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100
+Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100
+```
+
+### Decision logic
+
+**GO** (if ALL true):
+- Delta_throughput ≥ +5%
+- Delta_instructions ≤ -15%
+- Delta_icache ≤ -10% (should follow instruction reduction)
+
+**NEUTRAL** (if some criteria marginal):
+- Delta_throughput ∈ ±3%
+- Delta_instructions ∈ [-15%, -5%]
+- Variance increase < 50%
+
+**NO-GO** (if any critical missed):
+- Delta_throughput < -2%
+- Delta_instructions > -5% (failed to reduce overhead)
+- Variance increase > 100%
+
+---
+
+## 5. Reporting (required artifacts)
+
+Create file: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md`
+
+Include:
+
+```markdown
+# Phase 18 v2: BENCH_MINIMAL — A/B Test Results
+
+| Metric | Baseline | Optimized | Delta |
+|--------|----------|-----------|-------|
+| Throughput (mean) | X.XXM | X.XXM | +/-X.XX% |
+| Throughput (σ) | X.XXM | X.XXM | % |
+| Instructions | X.XB | X.XB | +/-X.X% |
+| I-cache misses | XXXK | XXXK | +/-X.X% |
+| Cycles | X.XB | X.XB | +/-X.X% |
+
+## Verdict
+
+GO / NEUTRAL / NO-GO with reasoning
+```
+
+Update: `CURRENT_TASK.md` (Phase 18 v2 status + next)
+
+---
+
+## 6. Important Notes
+
+- Ensure `BENCH_MINIMAL=0` (default) is tested first
+- Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully
+- If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this)
+- This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI)
+
+---
+
+## 7. If GO: Next Steps
+
+1. **Promote BENCH_MINIMAL=1** as default for bench builds
+2. **Prepare Phase 19** (next optimization frontier)
+   - Possible targets: SIMD prefetch, lock-free structures, allocation batching
+3. **Document learnings** about allocator overhead vs memory latency
+
+---
+
+## 8. If NO-GO: Lessons
+
+- Allocator is memory-bound (IPC=2.30 constant across phases)
+- Instruction count reduction doesn't always yield throughput gains if memory latency dominates
+- Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations