Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%) Target: Reduce alloc-side fixed costs by adding LEGACY direct path to FastLane entry, mirroring Phase 9/10 free-side winning pattern. Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as research box (default OFF). Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause: unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3 only (matching existing dualhot pattern). Files: - core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new) - core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119) - core/bench_profile.h (env refresh sync) - Makefile (new obj) - docs/analysis/PHASE16_*.md (design/results/instructions) ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in) Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/ routing optimization ROI is exhausted post-Phase-6 FastLane collapse. --- ## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed Purpose: Validate "system malloc faster" observation using same-binary A/B testing to isolate allocator logic差 vs binary layout penalty. Method: - Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem) - System binary: bench_random_mixed_system (21K separate binary) - Perf stat: Hardware counter analysis (I-cache, cycles, instructions) Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates. Gap breakdown (Mixed, 20M iters, ws=400): - hakmem (FORCE_LIBC=0): 48.12M ops/s - libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level) - system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem Perf stat (200M iters): - I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun) - Cycles: 17.9B → 10.2B = -43% - Instructions: 41.3B → 21.5B = -48% - Binary size: 653K → 21K (30x difference) Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >> algorithmic efficiency. Conclusion: Phase 12's "system malloc 1.6x faster" was real, but misattributed. Gap is layout/I-cache, NOT allocator algorithm. Files: - docs/analysis/PHASE17_*.md (results/instructions) - scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned) Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt) --- ## Phase 18: Hot Text Isolation — Design Added Purpose: Reduce I-cache misses + instruction footprint via layout control (binary optimization, not allocator algorithm changes). Strategy (v1 → v2 progression): v1 (TU split + hot/cold attrs + optional gc-sections): - Target: +2% throughput (GO threshold, realistic for layout tweaks) - Secondary: I-cache -10%, instructions -5% (direction confirmation) - Risk: Low (reversible via build knob) - Expected: +0-2% (NEUTRAL likely, but validates approach) v2 (BENCH_MINIMAL compile-out): - Target: +10-20% throughput (本命) - Method: Conditional compilation removes stats/ENV/debug from hot path - Expected: Instruction count -30-40% → significant I-cache improvement Files: - docs/analysis/PHASE18_*.md (design/instructions) - CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan) Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob) Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
parent 87fa27518c
commit f8e7cf05b4
14 changed files with 1292 additions and 5 deletions
--- a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md
+++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md
@ -0,0 +1,208 @@
+# Phase 16: Front FastLane Alloc LEGACY Direct v1 — A/B Test Results
+
+**Date**: 2025-12-15
+**Status**: NEUTRAL (+0.62%)
+
+---
+
+## Executive Summary
+
+Phase 16 v1 attempted to reduce alloc-side fixed costs by adding a LEGACY direct path to FastLane entry point, bypassing route/policy overhead for LEGACY allocations. The optimization mirrored the free-side winning pattern (Phase 9/10).
+
+**Result**: +0.62% on Mixed (NEUTRAL), below +1.0% GO threshold.
+
+**Critical Issue Discovered**: Initial implementation caused segmentation fault for classes C4-C7. Root cause: `unified_cache_refill()` incompatibility. **Safety fix applied**: Limited optimization to C0-C3 only (matching existing dualhot pattern).
+
+**Verdict**: NEUTRAL — freeze as research box (default OFF).
+
+---
+
+## A/B Test Results
+
+### Mixed (16-1024B, 10-run clean env)
+
+**Baseline** (ENV=0):
+- Mean: 47,510,791 ops/s
+- Median: 47,606,360 ops/s
+- Runs: 48151673, 47596179, 47735208, 47903499, 46674576, 47977105, 47236265, 47481537, 46735322, 47616542
+
+**Optimized** (ENV=1):
+- Mean: 47,803,890 ops/s
+- Median: 47,901,551 ops/s
+- Runs: 47401229, 47908200, 48158776, 48126240, 47477867, 47894902, 47644796, 48191059, 47930512, 47305320
+
+**Delta**:
+- Mean: **+0.62%**
+- Median: **+0.62%**
+
+**Verdict**: NEUTRAL (below +1.0% GO threshold)
+
+---
+
+### C6-heavy Regression Check (5-run)
+
+**Baseline** (ENV=0):
+- Mean: 21,134,240 ops/s
+- Median: 21,186,983 ops/s
+- Runs: 21186983, 21327420, 20807950, 21112023, 21236823
+
+**Optimized** (ENV=1):
+- Mean: 21,147,197 ops/s
+- Median: 21,139,301 ops/s
+- Runs: 21358869, 21209299, 20992077, 21139301, 21036438
+
+**Delta**:
+- Mean: **+0.06%**
+- Median: **-0.23%**
+
+**Verdict**: PASS (no significant regression)
+
+---
+
+## Implementation Summary
+
+### Files Modified
+
+1. **`core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`** (new)
+   - L0 ENV gate for LEGACY direct feature
+   - ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default 0, opt-in)
+   - API: `front_fastlane_alloc_legacy_direct_enabled()`, `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
+
+2. **`core/box/front_fastlane_box.h`**
+   - Added LEGACY direct early-exit in `front_fastlane_try_malloc()` (lines 93-119)
+   - **SAFETY CONSTRAINT**: Limited to C0-C3 only due to refill incompatibility for C4-C7
+   - Direct conditions: ENV enabled + static route ready + LEGACY route confirmed
+   - Direct path: `tiny_hot_alloc_fast()` → `tiny_cold_refill_and_alloc()` → fallback to `malloc_tiny_fast_for_class()`
+
+3. **`core/bench_profile.h`**
+   - Added `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` to refresh sync group
+
+4. **`Makefile`**
+   - Added `front_fastlane_alloc_legacy_direct_env_box.o` to OBJS_BASE, SHARED_OBJS, TINY_BENCH_OBJS_BASE
+
+---
+
+## Critical Bug & Fix
+
+### Issue: Segmentation Fault (Exit Code 139)
+
+**Symptom**: Benchmark crashed with ENV=1 during larger workloads (20M iterations).
+
+**Root Cause**:
+- Crash occurred in `unified_cache_refill()` → `tiny_next_read()` (intrusive pointer read)
+- Initial implementation attempted to use direct path for ALL classes (C0-C7)
+- Classes C4-C7 triggered incompatibility with `unified_cache_refill()` logic
+- Existing dualhot code (Phase ALLOC-TINY-FAST-DUALHOT-2) only operates on C0-C3
+
+**Backtrace**:
+```
+#0  0x0000555555564d89 in tiny_next_read.lto_priv.5.lto_priv ()
+#1  0x00007ffff7b00318 in ?? ()
+#2  0x0000555555557f29 in unified_cache_refill ()
+```
+
+**Fix Applied**:
+- Limited LEGACY direct path to C0-C3 only (line 96 of front_fastlane_box.h)
+- Added safety comment explaining constraint
+- Matches existing proven pattern from dualhot implementation
+
+**Code Change**:
+```c
+// Before (CRASHED):
+if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled(), 0)) {
+
+// After (SAFE):
+if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
+```
+
+---
+
+## Analysis
+
+### Why +0.62% is Below Threshold
+
+1. **Limited Scope**: Optimization only applies to C0-C3 due to safety constraint
+   - C4-C7 continue using full route/policy path
+   - Mixed benchmark uses all size classes (16-1024B = C0-C5 primarily)
+
+2. **Existing Optimizations**: dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) already optimizes C0-C3
+   - LEGACY direct overlaps with dualhot coverage
+   - Marginal benefit when dualhot is disabled, but default config has dualhot enabled in some profiles
+
+3. **Route Overhead Not Dominant**: After Phase 6 FastLane collapse, route/policy fixed costs are already minimized
+   - Phase 14-15 (cache shape) also showed NEUTRAL results
+   - Suggests current bottleneck is not in dispatch layers
+
+### Root Cause of Limited Benefit
+
+The optimization targets the same problem space as existing dualhot but with different enablement conditions:
+- **dualhot**: Always enabled for C0-C3, no route check
+- **LEGACY direct**: ENV-gated, requires static route confirmation
+
+When both are active, LEGACY direct provides minimal incremental value.
+
+---
+
+## Recommendations
+
+1. **Freeze as Research Box** (default OFF)
+   - ENV remains opt-in: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
+   - No preset promotion
+   - Keep code for potential future use if dualhot is disabled
+
+2. **Investigate C4-C7 Refill Issue**
+   - Root cause: Why does `unified_cache_refill()` fail for C4-C7 in this path?
+   - Possible causes:
+     - LIFO mode interaction (Phase 15)
+     - Cache state assumptions in refill logic
+     - Intrusive pointer corruption
+   - **Action**: Debug under controlled conditions before expanding to C4-C7
+
+3. **Shift Focus Away from Dispatch Layers**
+   - Phase 14, 15, 16 all showed NEUTRAL results
+   - Phase 6 FastLane already collapsed major dispatch overhead
+   - **Next direction**: Investigate cache miss costs, memory layout, or backend allocation
+
+4. **Consider Dualhot/LEGACY Direct Consolidation**
+   - If LEGACY direct is kept, evaluate merging with dualhot logic
+   - Avoid code duplication and overlap
+
+---
+
+## Comparison with Recent Phases
+
+| Phase | Target | Delta (Mixed) | Verdict |
+|-------|--------|---------------|---------|
+| Phase 10 | Free LEGACY direct | +1.89% | **GO** |
+| Phase 13 v1 | C7 preserve header | -0.40% | NEUTRAL (freeze) |
+| Phase 14 v1 | tcache intrusive | +0.20% | NEUTRAL (freeze) |
+| Phase 14 v2 | tcache hot integration | +0.08% | NEUTRAL (freeze) |
+| Phase 15 v1 | UnifiedCache FIFO→LIFO | -0.70% | NEUTRAL (freeze) |
+| **Phase 16 v1** | **Alloc LEGACY direct** | **+0.62%** | **NEUTRAL (freeze)** |
+
+**Pattern**: Post-Phase-10 optimizations consistently show NEUTRAL results. Major gains came from earlier phases (FastLane collapse +11.13%, Free DeDup +5.18%, etc.). Current bottleneck likely not in dispatch/routing layers.
+
+---
+
+## Files Changed
+
+- `core/box/front_fastlane_alloc_legacy_direct_env_box.h` (new)
+- `core/box/front_fastlane_alloc_legacy_direct_env_box.c` (new)
+- `core/box/front_fastlane_box.h` (modified)
+- `core/bench_profile.h` (modified)
+- `Makefile` (modified)
+
+---
+
+## ENV Variables
+
+- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
+
+---
+
+## Next Steps
+
+1. **Freeze Phase 16** with default OFF
+2. **Commit with verdict**: "Phase 16 v1: NEUTRAL (+0.62%), research box"
+3. **Update CURRENT_TASK.md** with Phase 16 summary
+4. **Shift optimization focus** based on profiling/analysis (away from dispatch layers)
--- a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md
+++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md
@ -0,0 +1,133 @@
+# Phase 16: Front FastLane Alloc LEGACY Direct v1（alloc 側の “2段目ホット” を monolithic early-exit 化）
+
+**Date**: 2025-12-15  
+**Status**: DESIGN（Phase 16 kickoff）
+
+---
+
+## 0. Executive Summary（1枚）
+
+Phase 14-15（pointer-chase / cache-shape）系は **NEUTRAL** で freeze。
+次は “キャッシュ形状” ではなく、**命令数/分岐の固定費を削る**方向に戻す。
+
+現状の `malloc()` は Phase 6 で FastLane に集約され、ほぼ常に:
+
+```
+malloc() → front_fastlane_try_malloc(size) → malloc_tiny_fast_for_class(size, class_idx)
+```
+
+となる。
+
+しかし `malloc_tiny_fast_for_class()` は **LEGACY ルートでも**、
+ULTRA/C7 早期分岐・route_kind 決定・ENV cfg 読み・dispatch shape などの固定費を通る。
+free 側（Phase 9/10/6-2）は “monolithic early-exit” に寄せて勝っているため、
+alloc 側も同じ勝ち筋で **FastLane 入口で LEGACY を直行**させるのが ROI が高い。
+
+Phase 16 は Box Theory を保ったまま、FastLane の alloc に “LEGACY direct” を 1 本足す：
+
+- **hit 時**: `tiny_hot_alloc_fast(class_idx)` → 即 return（route/policy を踏まない）
+- **miss 時**: `tiny_cold_refill_and_alloc(class_idx)`（既存 cold 境界）
+- **不確実時**: 既存 `malloc_tiny_fast_for_class()` にフォールバック（境界 1 箇所）
+
+---
+
+## 1. 現状（why）
+
+- Phase 6（Front FastLane）で wrapper→gate→policy→route を collapse し、入口固定費は大きく削減できた。
+- その結果、alloc 側の残コストは **`malloc_tiny_fast_for_class()` 内の分岐/ENV/route 決定**に寄りやすい。
+- Phase 14/15 で “UnifiedCache の形状” をいじっても Mixed が動かない → 現状は **cache shape が支配的ではない**。
+
+よって Phase 16 は、cache の内部を変えずに **route/policy 固定費を削る**。
+
+---
+
+## 2. 提案（Phase 16 v1）
+
+### 2.1 追加する箱（Box Theory）
+
+```
+L0: front_fastlane_alloc_legacy_direct_env_box   (ENV gate / rollback)
+  ↓
+L1: front_fastlane_try_malloc()                  (LEGACY direct early-exit)
+  ↓
+L2: malloc_tiny_fast_for_class()                 (既存: route/policy/ULTRA/MID/V7)
+  ↓
+L3: tiny_front_hot_box / tiny_front_cold_box     (既存: unified cache / refill)
+```
+
+**境界は 1 箇所**:
+- “direct 条件を満たさない/失敗” → `malloc_tiny_fast_for_class()` に落とす。
+
+### 2.2 ENV
+
+- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`（default 0, opt-in）
+
+初期は opt-in で A/B。
+GO なら preset 昇格（MIXED のみから段階的）を検討する。
+
+### 2.3 Direct 条件（Fail-Fast）
+
+alloc direct は **“断定できるときだけ”**に限定する：
+
+必須条件（推奨）:
+- FastLane が有効（既存）
+- `size <= tiny_get_max_size()`（既存）
+- `class_idx` が有効（既存）
+- `front_fastlane_class_mask` に含まれる（既存）
+- `tiny_static_route_ready_fast()` が true（Learner interlock 等で false のときは使わない）
+- `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`（LEGACY 断定）
+
+その上で:
+- `tiny_hot_alloc_fast(class_idx)` → hit なら return
+- miss なら `tiny_cold_refill_and_alloc(class_idx)` を呼ぶ（既存 cold 境界）
+- それでも NULL の場合だけ `malloc_tiny_fast_for_class()` にフォールバック（安全重視）
+
+---
+
+## 3. 可視化（最小）
+
+Release での常時ログは禁止。
+必要なら `HAKMEM_DEBUG_COUNTERS=1` のみで:
+
+- `front_fastlane_alloc_legacy_direct_hit`
+- `front_fastlane_alloc_legacy_direct_miss`
+- `front_fastlane_alloc_legacy_direct_fallback`
+
+（atomic は stats box に閉じ込める。ホット側に atomic を置かない）
+
+---
+
+## 4. A/B 計測（同一バイナリ）
+
+GO/NO-GO（Mixed 10-run, clean env）:
+- GO: mean +1.0% 以上
+- NO-GO: mean -1.0% 以下（即 rollback / freeze）
+- NEUTRAL: ±1.0%（research box freeze）
+
+対象:
+- `scripts/run_mixed_10_cleanenv.sh`
+- 追加で C6-heavy 5-run（回帰なし確認）
+
+---
+
+## 5. リスクと対策
+
+### リスク 1: “LEGACY と断定” が崩れて誤ルートする
+
+対策:
+- `tiny_static_route_ready_fast()` を必須条件にする（Learner 有効時は false になる想定）
+- route_kind を必ず確認（mask だけに依存しない）
+- 失敗時は必ず既存経路へフォールバック
+
+### リスク 2: direct 経路が小さすぎて効果が出ない
+
+対策:
+- まず Mixed の “LEGACY 比率” を stats で可視化（debug counters のみ）
+- 効かなければ freeze（Phase 14/15 と同じ扱い）
+
+### リスク 3: 分岐追加が逆効果（Phase 11 の再来）
+
+対策:
+- direct 判定は **FastLane 内で 1 回だけ**（call site helper を増やさない）
+- direct 判定が false の場合は既存の `malloc_tiny_fast_for_class()` をそのまま呼ぶ
+
--- a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
@ -0,0 +1,124 @@
+# Phase 16: Front FastLane Alloc LEGACY Direct v1 — Next Instructions
+
+設計: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
+
+---
+
+## 0. Status / Why now
+
+- Phase 14-15（tcache / FIFO→LIFO）は **NEUTRAL** → freeze（default OFF）
+- 次の狙いは “cache 形状” ではなく、**alloc 側の route/policy 固定費を減らす**
+- free 側は Phase 9/10/6-2 の “monolithic early-exit + dedup” が勝ち筋 → alloc 側にも同じパターンを適用する
+
+---
+
+## 1. GO 条件
+
+Mixed 10-run（clean env）:
+- **GO**: mean +1.0% 以上
+- **NO-GO**: mean -1.0% 以下（即 rollback / freeze）
+- **NEUTRAL**: ±1.0% → research box freeze
+
+追加ゲート（必須）:
+- `tiny_static_route_ready_fast()` が true の環境で、LEGACY direct が実際に通っている（debug counters で確認できるなら尚良い）
+
+---
+
+## 2. Box 図（境界 1 箇所）
+
+```
+L0: front_fastlane_alloc_legacy_direct_env_box   (ENV gate / refresh)
+  ↓
+L1: front_fastlane_box.h                         (try_malloc 内 early-exit)
+  ↓
+L2: malloc_tiny_fast_for_class()                 (既存経路)
+```
+
+境界は **“direct 条件 NG / direct が NULL → malloc_tiny_fast_for_class”** の 1 箇所に固定する。
+
+---
+
+## 3. Patch 順（小さく積む）
+
+### Patch 1: L0 ENV gate box（戻せる）
+
+新規:
+- `core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`
+
+ENV:
+- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`（default 0）
+
+API（例）:
+- `front_fastlane_alloc_legacy_direct_enabled() -> int`
+- `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
+
+要件:
+- hot path に `getenv()` を置かない（cached）
+- `bench_profile` の `putenv()` 同期のため refresh を提供（Phase 8/13/14/15 パターン）
+
+### Patch 2: 統合点（FastLane alloc に 1 本だけ）
+
+対象:
+- `core/box/front_fastlane_box.h`
+
+変更:
+- `front_fastlane_try_malloc()` の class mask 判定の後に、次の “direct 経路” を追加
+
+direct 条件（Fail-Fast）:
+1. `front_fastlane_alloc_legacy_direct_enabled() == 1`
+2. `tiny_static_route_ready_fast()` が true（Learner interlock 等で false の場合は direct 禁止）
+3. `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`（LEGACY を断定）
+
+direct 実体:
+- `void* p = tiny_hot_alloc_fast(class_idx);`
+- `if (p) return p;`
+- `p = tiny_cold_refill_and_alloc(class_idx);`
+- `if (p) return p;`
+- 失敗時のみ `malloc_tiny_fast_for_class(size, class_idx)` にフォールバック（安全側）
+
+注意:
+- “call site helper を増やさない” を優先（Phase 11 の反省）
+- 直行するのは **LEGACY のみ**（ULTRA/MID/V7 は既存に任せる）
+
+### Patch 3: bench_profile 同期（ENV 漏れ防止）
+
+対象:
+- `core/bench_profile.h`
+
+変更:
+- `#ifdef USE_HAKMEM` の refresh 群に `front_fastlane_alloc_legacy_direct_env_refresh_from_env();` を追加
+
+---
+
+## 4. A/B（同一バイナリ）
+
+Baseline:
+```sh
+HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
+```
+
+Optimized:
+```sh
+HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
+```
+
+追加（回帰検出）:
+```sh
+HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
+HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
+```
+
+---
+
+## 5. 健康診断
+
+```sh
+scripts/verify_health_profiles.sh
+```
+
+---
+
+## 6. Rollback
+
+- `export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
+
--- a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md
+++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md
@ -0,0 +1,89 @@
+# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
+
+**Date**: 2025-12-15  
+**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
+
+---
+
+## Executive Summary
+
+Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
+
+- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
+- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
+
+Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
+
+---
+
+## Measurement Setup
+
+Workload:
+- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
+- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
+
+Two comparisons:
+1) **Same-binary toggle** (allocator logic delta)
+2) **System binary** (layout penalty delta)
+
+---
+
+## Results
+
+### 1) Same-binary A/B (allocator delta)
+
+Binary: `bench_random_mixed_hakmem`  
+Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
+
+| Mode | Throughput (ops/s) | Delta |
+|------|---------------------|-------|
+| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
+| libc  (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
+
+Interpretation: allocator logic delta is ~noise-level in this experiment context.
+
+### 2) System binary (layout penalty)
+
+Binary: `bench_random_mixed_system`
+
+| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
+|------|---------------------|--------------------------------|
+| system malloc | 83.85M | **+73.57%** |
+
+Total observed gap: ~+74% class.
+
+---
+
+## Perf Stat (200M iterations) — Smoking Gun
+
+| Metric | hakmem binary | system binary | Delta |
+|--------|---------------|---------------|-------|
+| I-cache misses | 153K | 68K | **-55%** |
+| Cycles | 17.9B | 10.2B | **-43%** |
+| Instructions | 41.3B | 21.5B | **-48%** |
+| Binary size | 653K | 21K | **-97%** |
+
+Interpretation:
+- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
+- The 30× text footprint difference strongly correlates with the gap.
+
+---
+
+## Conclusion
+
+Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
+
+- ❌ Not primarily allocator algorithm differences
+- ✅ **Text/layout + I-cache locality + instruction footprint**
+
+This shifts the optimization frontier:
+- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
+- Focus on **Hot Text Isolation / layout control**
+
+---
+
+## Next
+
+Proceed to:
+- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
+
--- a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md
@ -0,0 +1,130 @@
+# Phase 17: FORCE_LIBC Gap Validation（same-binary A/B）Next Instructions
+
+## Status（前提）
+
+- Phase 14–16 は **NEUTRAL / research box freeze**（dispatch/cache-shape/pointer-chase 系は頭打ち）
+- Phase 16 v1（FastLane alloc LEGACY direct）は **NEUTRAL (+0.62%)** かつ **C0–C3 限定**（C4–C7 は segv で安全制限）
+- Phase 12 で「system malloc が hakmem より速い」という観測があるが、**別バイナリ比較は layout/LTO 差で壊れやすい**
+
+本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体（allocator差か、バイナリ差か）を SSOT 化すること。
+
+---
+
+## 0. 目的（Deliverables）
+
+1) **同一バイナリ A/B**: `bench_random_mixed_hakmem` を用いて
+- A: `HAKMEM_FORCE_LIBC_ALLOC=0`（hakmem）
+- B: `HAKMEM_FORCE_LIBC_ALLOC=1`（libc）
+
+2) **別バイナリとの差分分解**（任意）
+- `bench_random_mixed_system`（小さいバイナリ）も測り、`libc-in-hakmem-binary` と比較して **layout penalty** を推定
+
+3) **次の主戦場を決める**（GO/NO-GO ではなく、方針決定）
+
+---
+
+## 1. 実施手順（再現性重視）
+
+### 1.1 Build（同一 commit で固定）
+
+```sh
+make -j bench_random_mixed_hakmem bench_random_mixed_system
+```
+
+### 1.2 Clean ENV（Phase 14–16 研究 knob を固定）
+
+推奨: `scripts/run_mixed_10_cleanenv.sh` を使う（ENV 漏れ防止）。
+
+追加で次を明示（Phase 16 を確実に OFF）:
+
+```sh
+export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0
+```
+
+### 1.3 Same-binary A/B（本丸）
+
+**A: hakmem**
+
+```sh
+HAKMEM_FORCE_LIBC_ALLOC=0 scripts/run_mixed_10_cleanenv.sh
+```
+
+**B: libc（同一バイナリ）**
+
+```sh
+HAKMEM_FORCE_LIBC_ALLOC=1 scripts/run_mixed_10_cleanenv.sh
+```
+
+記録:
+- mean / median / stdev（10-run）
+- Min/Max
+
+### 1.4 Optional: system binary baseline（layout penalty 推定）
+
+```sh
+for i in $(seq 1 10); do
+  echo "=== Run ${i}/10 (system bin) ==="
+  ./bench_random_mixed_system "${ITERS:-20000000}" "${WS:-400}" 1 2>&1 | rg "Throughput" || true
+done
+```
+
+解釈:
+- `system bin` が `FORCE_LIBC` より大きく速い → **layout/text size penalty** が支配的
+- `FORCE_LIBC` が `hakmem` より大きく速い → **allocator ロジック差** が支配的
+
+---
+
+## 2. 判定（方針分岐）
+
+### Case A: `FORCE_LIBC` が hakmem より **+20% 以上**速い
+
+結論: gap の本体は allocator ロジック（命令数/固定費）側。
+
+次の芯（推奨）:
+- **Phase 18: Free FastPath Gate Consolidation**
+  - `free_tiny_fast()` 内の ENV gate / TLS probe を FastLane 入口で 1 回だけに集約
+  - 目的: “monolithic early-exit” の勝ち筋を維持したまま、per-call gate 固定費を削る
+  - Box 境界: `front_fastlane_try_free()` → `free_tiny_fast_with_snapshot()` の 1 箇所
+  - 戻せる: `HAKMEM_FREE_TINY_FAST_SNAPSHOT=0/1`
+
+### Case B: `FORCE_LIBC` が hakmem と **±5% 以内**
+
+結論: allocator差は小さく、Phase 12 の「system malloc 1.6x」は別要因（バイナリ差/計測系）濃厚。
+
+次の芯（推奨）:
+- **Phase 18: Hot Text Isolation / Layout Control**
+  - cold code を `__attribute__((cold,noinline))` + 別 TU に追放
+  - 可能なら link-order（hot 関数の順序固定）で I-cache 安定化
+  - A/B は同一バイナリで `HAKMEM_LAYOUT_MODE=0/1`（section/attribute のみ切替）
+
+### Case C: `FORCE_LIBC` が hakmem より速いが、`system bin` とも差が大きい
+
+結論: allocator差 + layout penalty の **両方**がある。
+
+次の芯:
+- 先に **layout penalty** を削る（Phase 18 Hot Text Isolation）
+- その後に **gate consolidation**（Phase 19）へ
+
+---
+
+## 3. 可視化（最小）
+
+- 10-run の raw throughput を保存（`scripts/run_mixed_10_cleanenv.sh` 出力ログで十分）
+- 追加で 1 本だけ `perf stat`（200M iters, 1-run）:
+
+```sh
+perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
+  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FORCE_LIBC_ALLOC=0 \
+  ./bench_random_mixed_hakmem 200000000 400 1
+```
+
+同じコマンドで `HAKMEM_FORCE_LIBC_ALLOC=1` も 1 本取る。
+
+---
+
+## 4. 重要ルール（Box Theory）
+
+- A/B は **同一バイナリ**で行う（layout/LTO 差で誤判定しない）
+- 新しい最適化は必ず ENV gate（戻せる）+ 境界 1 箇所
+- 迷ったら “Fail-Fast で fallback” を優先（速度より整合性）
+
--- a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
+++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
@ -0,0 +1,135 @@
+# Phase 18: Hot Text Isolation v1 — Design
+
+## 0. Context (from Phase 17)
+
+Phase 17 established **Case B**:
+- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
+- The large gap appears vs the tiny `bench_random_mixed_system` binary.
+
+Signal:
+- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
+- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
+
+Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
+
+---
+
+## 1. Goal
+
+Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
+
+Primary success metric:
+- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
+  - `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
+  - total instructions executed per 200M iters
+
+---
+
+## 2. Non-goals
+
+- No allocator algorithm redesign.
+- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
+- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
+
+---
+
+## 3. Box Theory framing
+
+This is a “build/layout box”:
+- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
+- **Boundary**: build flag / TU split (no runtime overhead)
+- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
+- **Observability**: perf stat + binary size (no always-on logs)
+
+---
+
+## 4. Design: v1 tactics (low-risk)
+
+### 4.1 Hot/Cold attributes SSOT
+
+Introduce a single header defining attributes:
+- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
+- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
+
+Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
+
+Why:
+- Makes “what is hot/cold” explicit and consistent (SSOT).
+- Lets us annotate a small set of functions without scattering ad-hoc attributes.
+
+### 4.2 Translation-unit split for wrappers
+
+Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
+- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
+
+Why:
+- Prevents wrapper text from being interleaved with unrelated code in the same TU.
+- Improves the linker’s ability to cluster hot code.
+- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
+
+### 4.3 Cold code isolation
+
+Ensure rarely-hit helpers stay cold/out-of-line:
+- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
+- “slow fallback” paths (`malloc_cold`, `free_cold`)
+
+Principle:
+- Hot path must remain a straight-line “try → return” shape.
+- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
+
+### 4.4 Optional: section GC for bench builds
+
+For bench binaries only:
+- add `-ffunction-sections -fdata-sections`
+- link with `-Wl,--gc-sections`
+
+Why:
+- Drops truly-unused text and reduces overall text pressure.
+- Helps the linker keep hot text denser.
+
+This is optional because it is toolchain-sensitive; measure before promoting.
+
+---
+
+## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
+
+Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
+
+Concept:
+- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
+- In this mode:
+  - “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
+  - ENV gates become compile-time (no TLS/env probing in hot path)
+  - Hot counters/stats macros compile out completely
+
+Why this still fits Box Theory:
+- It is a **build box** (reversible by knob), not an algorithm rewrite
+- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
+- Observability shifts to `perf stat` (no always-on logging)
+
+Expected impact:
+- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
+
+## 5. Risks / mitigations
+
+### Risk A: layout tweaks regress throughput
+
+Mitigation:
+- A/B using the same workload + perf stat counters (Phase 17 set).
+- If regression: keep as research-only (build knob default OFF).
+
+### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
+
+Mitigation:
+- Keep v1 minimal (TU split + attributes first).
+- Only enable `--gc-sections` if it’s stable in the current toolchain.
+
+---
+
+## 6. Expected impact
+
+Conservative:
+- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
+
+Stretch goal:
+- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
--- a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
@ -0,0 +1,165 @@
+# Phase 18: Hot Text Isolation v1 — Next Instructions
+
+## Status
+
+- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
+- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.
+
+Refs:
+- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
+- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
+
+---
+
+## 0. Goal / Success Criteria
+
+Primary (v1 は “低リスク・効果小さめ” 想定):
+- Mixed (16–1024B) throughput **+2%** 以上で GO（layout work の現実ライン）
+
+Secondary (must move in the right direction):
+- I-cache misses reduced（目安: **-10%** 以上）
+- Total instructions reduced（目安: **-5%** 以上）
+
+If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
+
+---
+
+## 1. Patch Plan (small, reversible)
+
+### Patch 1: Hot/Cold attribute SSOT (L0 Box)
+
+Add:
+- `core/box/hot_text_attrs_box.h`
+
+Defines:
+- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)
+
+Usage:
+- annotate only a short, high-impact list first:
+  - wrappers: `malloc/free/calloc/realloc`
+  - FastLane entry helpers (if non-inline)
+  - cold helpers: `malloc_cold/free_cold`, wrapper diagnostics
+
+Rollback: build knob off.
+
+### Patch 2: Wrapper TU split (L1 Box boundary)
+
+Move wrapper definitions out of `core/hakmem.c`:
+- new: `core/hak_wrappers_box.c`
+  - `#include "box/hak_wrappers.inc.h"`
+- remove wrapper include from `core/hakmem.c`
+
+Rationale:
+- Prevents wrapper text from being interleaved with unrelated code in one TU.
+- Sets up link-order clustering.
+
+Rollback: restore include in `core/hakmem.c` and drop new TU.
+
+### Patch 3 (optional): bench-only section GC
+
+Makefile knob:
+- `HOT_TEXT_ISOLATION=0/1`
+
+When `=1`, add for bench builds:
+- `-DHAKMEM_HOT_TEXT_ISOLATION=1`
+- `-ffunction-sections -fdata-sections`
+- `LDFLAGS += -Wl,--gc-sections`
+
+Notes:
+- Keep it bench-only first (do not touch shared lib build until proven stable).
+- If toolchain rejects `--gc-sections` or results are unstable → skip this patch.
+
+---
+
+## 2. A/B Procedure (required)
+
+### 2.1 Baseline build (OFF)
+
+```sh
+make clean
+make -j bench_random_mixed_hakmem bench_random_mixed_system
+ls -lh bench_random_mixed_hakmem bench_random_mixed_system
+scripts/run_mixed_10_cleanenv.sh
+```
+
+Perf stat (1 run, 200M iters):
+
+```sh
+perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
+  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  ./bench_random_mixed_hakmem 200000000 400 1
+```
+
+### 2.2 Optimized build (ON)
+
+```sh
+make clean
+make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
+ls -lh bench_random_mixed_hakmem bench_random_mixed_system
+scripts/run_mixed_10_cleanenv.sh
+```
+
+Perf stat (same command):
+
+```sh
+perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
+  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  ./bench_random_mixed_hakmem 200000000 400 1
+```
+
+### 2.3 System ceiling check (optional)
+
+```sh
+./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
+```
+
+---
+
+## 3. GO/NO-GO Decision
+
+- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
+- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
+- **NO-GO**: **-2%** or worse → rollback and freeze
+
+Health profiles:
+
+```sh
+scripts/verify_health_profiles.sh
+```
+
+---
+
+## 4. Reporting (required artifacts)
+
+Create:
+- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
+  - throughput A/B (10-run)
+  - binary sizes
+  - perf stat table (cycles/instructions/I-cache)
+  - conclusion (GO/NEUTRAL/NO-GO)
+
+Update:
+- `CURRENT_TASK.md` (Phase 18 status + next)
+
+---
+
+## 5. Notes / guardrails
+
+- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
+- Avoid “delete code” experiments; only isolate/cold/cluster.
+- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
+
+---
+
+## 6. If v1 is NEUTRAL: Phase 18 v2（BENCH_MINIMAL）へ即進む
+
+Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。
+
+次の一手（bench 専用 binary / rollback 可能）:
+
+- `HAKMEM_BENCH_MINIMAL=1`（Makefile knob）で:
+  - FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
+  - hot counters を完全 compile-out
+  - 観測は `perf stat` のみ（常時ログ禁止）
+
+期待: +10–20%（もし本当に instruction footprint が支配ならここで大きく動く）