Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 07:47:44 +09:00
parent e4baa1894f
commit e9b97e9d8e
14 changed files with 840 additions and 210 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -1,236 +1,89 @@
 # CURRENT_TASK（Rolling, SSOT）
-## 0) 今の「正」
+## 0) 今の「正」（SSOT）
- **性能比較の正**: FAST PGO build（`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`）✓ **Phase 69 昇格済み** (Warm Pool Size=16)
+- **性能比較の正**: FAST PGO build（`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`）＋ **WarmPool=16**（Phase 69 強GOで昇格済み）
 - **安全・互換の正**: Standard build（`make bench_random_mixed_hakmem`）
 - **観測の正**: OBSERVE build（`make perf_observe`）
- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`（M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp）
+- **スコアカード（目標/現在値）**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
- **計測の正（Mixed 10-run）**: `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト）
+  - Current baseline（FAST v3 + PGO, Phase 69）: **62.63M ops/s = 51.77% of mimalloc**
  - 次の目標: **M2 = 55%**（残り **+3.23pp**）
 - **Mixed 10-run SSOT**: `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト）
-## 1) 現状（要点）
+## 1) 迷子防止（経路/観測）
- Phase 64（backend prune / DCE）: **NO-GO**（-4.05%） → layout tax 由来
+“経路が踏まれていない最適化” を防ぐための最小手順。
 - Phase 63（FAST_PROFILE_FIXED）: **研究用ビルド**として保持（FAST の gate を compile-time 固定）
 - Phase 65（Hot Symbol Ordering）: **BLOCKED**（GCC+LTO の制約で不公平/不可能）→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md`
 - Phase 66（PGO, GCC+LTO）: **GO** ✓
  - 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1%
  - Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO)
 - Phase 68（PGO training set 最適化）: **GO & 昇格完了** ✓
  - 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過)
  - Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp)
 - Phase 69（Refill tuning: Warm Pool Size 最適化）: **強GO & 昇格完了** ✓✓✓
  - 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過)
  - 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp)
-## 2) 次の指示書（Active）
+- **Route Banner（経路の誤認を潰す）**: `HAKMEM_ROUTE_BANNER=1`
  - 出力: Route assignments（backend route kind）+ cache config（`unified_cache_enabled` / `warm_pool_max_per_class`）
 - **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
  - WS=400（Mixed SSOT）では miss が極小 → `unified_cache_refill()` 最適化は **凍結（ROIゼロ）**
-**Phase 68: PGO training set 最適化** ✅ **完了**
+## 2) 直近の結論（要点だけ）
- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン)
+- **Phase 69（WarmPool sweep）**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO（+3.26%）**、baseline 昇格済み。
- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過)
+  - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過)
+  - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- ✓ スコアカード・CURRENT_TASK 更新完了
+- **Phase 70（観測SSOT）**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
 ---
 **Phase 67a: Layout Tax 法医学（変更最小）** ✅ **完了・実運用可能**
 - ✓ `scripts/box/layout_tax_forensics_box.sh` 新規（測定ハーネス）
  - Baseline vs Treatment の 10-run throughput 比較
  - perf stat 自動収集（cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB）
  - Binary metadata（サイズ、セクション構成）
 - ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規（診断ガイド）
  - 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下)
  - "症状→原因候補" マッピング表
    * IPC 低下 3%↑ → I-cache miss / code layout dispersal
    * branch-misses ↑10%↑ → branch prediction penalty
    * dTLB-misses ↑100%↑ → data layout fragmentation
  - Phase 64 case study（-4.05% の root cause: IPC 2.05 → 1.98）
  - 運用ガイドライン
 **使用例**:
 ```bash
 ./scripts/box/layout_tax_forensics_box.sh \
    ./bench_random_mixed_hakmem_minimal_pgo \
    ./bench_random_mixed_hakmem_fast_pruned  # or Phase 64 attempt
 ```
 成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる
 ---
 **Phase 69: "refill頻度×固定税" を削る（M2への最短距離）**
 **Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**
 - ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
 - ✓ Tunable parameters 特定:
  - `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`（refill 量の実体, ENV-only）
  - Unified Cache C5-C7 capacity (128 → 256/512)
  - Warm Pool size (12 → 16/24)
 - ✓ Sweep 計画立案（single-parameter → combined optimization）
 - ✓ Risk assessment & 判定基準定義
 **Phase 69-1: Sweep 実行** ✅ **完了**
 - ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean)
 - ✓ Warm Pool Size sweep:
  - Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner**
  - Size=24: 62.37M ops/s (+2.84%, GO)
 - ✓ Unified Cache C5-C7 sweep:
  - Cache=256: 61.92M ops/s (+2.09%, GO)
  - Cache=512: 61.80M ops/s (+1.89%, GO)
 - ✓ Combined optimization check:
  - Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive)
 - ✓ “Refill Batch Size sweep” は無効（knob 未接続）:
  - `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない
  - 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md`
 - **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
 - **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)**
 **Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了**
 - ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加
 - ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加
 - ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加:
  - Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc
  - M1 (50%) achievement: **EXCEEDED** (+1.77pp above target)
  - M2 (55%) progress: Gap reduced to +3.23pp
 - ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除
 **新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp)
 ---
 **Phase 69-3（次候補）: refill 量（ENV-only）sweep OR 次の sweep**
 - **選択肢 A（推奨）**: Refill count の ENV sweep（コード変更なし）
  - `HAKMEM_TINY_REFILL_COUNT_MID`（C4–C7）を 64/96/128/160… で sweep
  - `HAKMEM_TINY_REFILL_COUNT_HOT`（C0–C3）も同様に sweep（ただし WarmPool/UnifiedCache と相互作用あり）
  - 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%)
 - **選択肢 B**: Unified Cache の fine sweep（ENV-only）
  - C5/C6/C7 を 192/256/320… などで sweep（Phase 69-1 の 256/512 は coarse）
  - WarmPool=16 との非加算性を “原因切り分け” する
 - **選択肢 C**: compile-time knob の新設（後回し）
  - `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない
  - 必要なら別途 SSOT を作って実装する（Phase 70+）
 - **選択肢 D**: 別方向の最適化（M2: 55% への最短距離）
  - 残り gap: +3.23pp (51.77% → 55%)
  - Phase 67b（境界 inline/unroll チューニング）
  - Top 50 hot functions の最適化
  - PGO profile の再調整
 ---
 **Phase 67b（後続・保険）: 境界inline/unrollチューニング**
 - **注意**: layout tax リスク高い（Phase 64 reference）
 - **前提**: Top 50 実行確認が必須
 - Phase 69 が外れた時の保険として後回し推奨
 ---
 **Phase 70（観測の前提固め）: Refill/WarmPool 最適化の Step 0 を SSOT 化**
 - 目的: **“経路が踏まれていない最適化”** を防ぐ（Phase 40/41/64 の layout tax 前例）
 - 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない（backend route kind）
  - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
-  - Mixed SSOT（WS=400）で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める
+- **Phase 71/73（WarmPool=16 の勝ち筋確定）**: 勝ち筋は **instruction/branch の微減**（perf stat で確定）。
 - ✅ Phase 70-1: Route Banner 実装（経路誤認の根絶）
  - ENV: `HAKMEM_ROUTE_BANNER=1`
  - 出力: Route assignments（backend route kind）+ cache config（unified_cache / warm_pool_max_per_class）
 - ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT（“見えてないだけ”事故の根絶）
  - `Unified-STATS total_allocs == total_frees` を確認してから議論する（統計の信頼性ゲート）
 - ✅ Phase 70-2: Refill 最適化の扱い確定（SSOT）
  - Mixed SSOT（WS=400）で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結（ROIゼロ）**
  - 現状の実測: miss は極小（例: total miss=5）→ refill最適化は SSOT workload では ROI なし
  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
 - **Phase 72（ENV knob ROI枯れ）**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造（コード）で攻める段階**。
---
+## 3) 運用ルール（Box Theory + layout tax 対策）
-**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決**
+- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む（Fail-fast、最小可視化）。
 - A/B は **同一バイナリでENVトグル**が原則（別バイナリ比較は layout が混ざる）。
 - “削除して速い” は封印（link-out/大削除は layout tax で符号反転しやすい）→ **compile-out** を優先。
  - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**（TLB/LLC/周波数/配置）の可能性が高い
+## 4) 次の指示書（Active）
 - 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化（Phase 72）を決め打ちする
 - 方式: **同一バイナリ + cleanenv + 交互実行**（layout tax/環境ドリフトを避ける）
  - A: `HAKMEM_WARM_POOL_SIZE=12`
  - B: `HAKMEM_WARM_POOL_SIZE=16`
  - events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults`
-**結果**（パラドックス）:
+### Phase 74（構造）: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
 - ✅ Throughput: +0.91% (46.52M → 46.95M ops/s)
 - ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE**
 - ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE**
 - ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE**
 - ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE**
 - ✓ page-faults: -0.21% (negligible)
-**Phase 71 仮説（REJECTED）**:
+**前提**:
- 予測: "TLB/cache efficiency improvement from memory layout"
+- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- 実測: TLB/cache metrics both **DEGRADED**
+- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
-**Phase 73 確定**:
+**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)**
+- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減
+- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate
+- 判定: **NEUTRAL (+0.50%)**
 - Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache)
-**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section
+**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
 - Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
 - Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
 - しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
 - 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
 - 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**
-**Phase 72（構造）: WarmPool=16 の勝ち筋を増幅（Phase 73 結果が出てから）**
+**結論**:
 - P1 (LOCALIZE) は default OFF で凍結（dependency chain 削減の ROI 低い）
 - 次: **Phase 74-3 (P0: FASTAPI)** へ進む
- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手（推測で弄ると Phase 40/41/64 の再発）
+**Phase 74-3: P0 (FASTAPI)** 🟡 **次の指示書**
 - Phase 73 の結論: **instruction/branch 減が支配的**（TLB/cache はむしろ悪化）→「WarmPool=16 が “短い経路” を踏ませている」ことが本質
-**Phase 72-0（SSOT）: “どの関数が短くなったか” を特定してから構造に入る**
+**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
- A/B は WarmPool=12 vs 16 のまま（同一バイナリ・cleanenv）
+**Approach**:
- perf record を **cycles ではなく instruction/branch で取る**（原因が instruction/branch 減だから）
+- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
-  - `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
+- 前提: "valid/enabled/no-stats" を caller 側で保証
-  - `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
+- Fail-fast: 想定外の状態なら slow path へ fallback（境界1箇所）
- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定（例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等）
+- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
-**Phase 72-1（構造）: 特定した関数にだけ手を入れる（箱の境界 1 箇所化）** ✅ **キャンセル（ROIゼロ）**
+**Expected**: +1-2% via branch reduction (P1 と異なる軸)
- perf record 結果: `unified_cache_push` が -0.86% branches（最大削減）
+**判定**:
- 当初計画: Unified Cache の FULL drain 最適化
+- **GO**: +1.0% 以上
- **キャンセル理由**: 全クラスで `full=0`（FULL イベントが発生していない）→ ROI ゼロ
+- **NEUTRAL**: ±1.0%（freeze、次へ）
 - **NO-GO**: -1.0% 以下（即 revert）
-**Phase 72-2: WarmPool 追加 sweep** ✅ **完了（ROI枯れ）**
+**参考**:
 - 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
 - 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
 - 結果 (P1): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
- 目的: WarmPool=16 以外に勝者がいるか確認
+## 5) アーカイブ
 - Baseline: WarmPool=16 = 56.23M ops/s (10-run)
 - 結果:
  - WarmPool=20: 56.13M ops/s (**-0.18%**, NO-GO)
  - WarmPool=24: 56.30M ops/s (**+0.12%**, 誤差範囲)
  - WarmPool=32: 56.07M ops/s (**-0.28%**, NO-GO)
 - **判定**: 全候補が ±0.5% 以内 → **Phase 72 終了（ENV knob ROI 枯れ）**
 ---
 **Phase 72 総括**:
 - **確定**: WarmPool=16 が最適値（Phase 69 で確定、Phase 72 で再確認）
 - **確定**: ENV knob による追加最適化の余地なし
 - **勝ち筋**: instruction/branch 削減が支配的（Phase 73 で確定）
 - **次のステップ**: 構造変更（コード変更）が必要
 **注記**: 研究箱の削除は今やらない（link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解）
 ---
 **Phase 74（次候補）: 構造変更による最適化**
 - **前提**: ENV knob ROI 枯れ → コード変更が必要
 - **候補 A**: `unified_cache_push` の branch 削減（Phase 72-0 で最大寄与確認済み）
 - **候補 B**: hot path の inline 強化（layout tax リスクあり、要 forensics）
 - **候補 C**: PGO profile 再調整（WarmPool=16 前提で retrain）
 - **判定基準**: +1.0% → GO、+0.5% 未満 → NO-GO
 ## 3) アーカイブ
 - 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
+- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
--- a/core/box/tiny_unified_cache_hitpath_env_box.h
+++ b/core/box/tiny_unified_cache_hitpath_env_box.h
@ -0,0 +1,32 @@
 // tiny_unified_cache_hitpath_env_box.h - Phase 74: ENV gate for hit-path LOCALIZE
 //
 // Purpose: ENV-gated toggle for unified_cache_push/pop LOCALIZE optimization
 // Design: lazy-init pattern to avoid hot-path getenv overhead
 //
 // ENV: HAKMEM_TINY_UC_LOCALIZE=0/1 (default 0, OFF)
 //
 // Box Theory:
 //   L0: ENV gate (this file)
 //   L1: LOCALIZE implementation (in tiny_unified_cache.h)
 #ifndef HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
 #define HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
 #include <stdlib.h>
 // ============================================================================
 // Phase 74: LOCALIZE ENV Gate (lazy-init, cached)
 // ============================================================================
 // Check if LOCALIZE optimization is enabled
 // Uses lazy-init pattern: getenv called once, then cached
 static inline int tiny_uc_localize_enabled(void) {
    static int g_enabled = -1;  // -1 = uninitialized
    if (__builtin_expect(g_enabled == -1, 0)) {
        const char* e = getenv("HAKMEM_TINY_UC_LOCALIZE");
        g_enabled = (e && *e && *e != '0') ? 1 : 0;
    }
    return g_enabled;
 }
 #endif // HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
--- a/core/front/tiny_unified_cache.h
+++ b/core/front/tiny_unified_cache.h
@ -31,6 +31,7 @@
 #include "../box/ptr_type_box.h"    // Phantom pointer types (BASE/USER)
 #include "../box/tiny_front_config_box.h"  // Phase 8-Step1: Config macros
 #include "../box/tiny_tcache_box.h"  // Phase 14 v1: Intrusive LIFO tcache
 #include "../box/tiny_unified_cache_hitpath_env_box.h"  // Phase 74: LOCALIZE ENV gate
 // ============================================================================
 // Phase 3 C2 Patch 3: Bounds Check Compile-out
@ -247,6 +248,30 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) {
    }
    #endif
    // Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch)
 #if HAKMEM_TINY_UC_LOCALIZE_COMPILED
    // LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
    uint16_t head = cache->head;
    uint16_t tail = cache->tail;
    uint16_t mask = cache->mask;
    uint16_t next_tail = (tail + 1) & mask;
    if (__builtin_expect(next_tail == head, 0)) {
 #if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
        g_unified_cache_full[class_idx]++;
 #endif
        return 0;  // Full
    }
    cache->slots[tail] = base_raw;
    cache->tail = next_tail;
 #if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
    g_unified_cache_push[class_idx]++;
 #endif
    return 1;  // SUCCESS (LOCALIZE path)
 #else
    // Default path: Original implementation
    uint16_t next_tail = (cache->tail + 1) & cache->mask;
    // Full check (leave 1 slot empty to distinguish full/empty)
@ -266,6 +291,7 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) {
 #endif
    return 1;  // SUCCESS (2-3 cache misses total)
 #endif  // HAKMEM_TINY_UC_LOCALIZE_COMPILED
 }
 // ============================================================================
@ -316,6 +342,37 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) {
    }
 #endif
    // Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch)
 #if HAKMEM_TINY_UC_LOCALIZE_COMPILED
    // LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
    uint16_t head = cache->head;
    uint16_t tail = cache->tail;
    uint16_t mask = cache->mask;
    if (__builtin_expect(head != tail, 1)) {
        void* base = cache->slots[head];
        cache->head = (head + 1) & mask;
 #if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
        g_unified_cache_hit[class_idx]++;
 #endif
 #if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
        if (__builtin_expect(unified_cache_measure_check(), 0)) {
            atomic_fetch_add_explicit(&g_unified_cache_hits_global,
                                      1, memory_order_relaxed);
            atomic_fetch_add_explicit(&g_unified_cache_hits_by_class[class_idx],
                                      1, memory_order_relaxed);
        }
 #endif
        return HAK_BASE_FROM_RAW(base);  // Hit! (LOCALIZE path)
    }
    // Cache miss → Batch refill from SuperSlab
 #if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
    g_unified_cache_miss[class_idx]++;
 #endif
    return unified_cache_refill(class_idx);
 #else
    // Default path: Original implementation
    // Tcache miss/disabled/compiled-out → try pop from array cache (fast path)
    if (__builtin_expect(cache->head != cache->tail, 1)) {
        void* base = cache->slots[cache->head];  // 1 cache miss (array access)
@ -341,6 +398,7 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) {
    g_unified_cache_miss[class_idx]++;
 #endif
    return unified_cache_refill(class_idx);  // Refill + return first block (BASE)
 #endif  // HAKMEM_TINY_UC_LOCALIZE_COMPILED
 }
 #endif // HAK_FRONT_TINY_UNIFIED_CACHE_H
--- a/core/hakmem_build_flags.h
+++ b/core/hakmem_build_flags.h
@ -434,6 +434,18 @@
 #  define HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED 0
 #endif
 // ------------------------------------------------------------
 // Phase 74: UnifiedCache LOCALIZE (Compile-time hit-path optimization)
 // ------------------------------------------------------------
 // LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
 // When =1: Always use localize version (no runtime branch, maximum DCE)
 // When =0: Use original implementation (default, backward compatible)
 // Build: make EXTRA_CFLAGS="-DHAKMEM_TINY_UC_LOCALIZE_COMPILED=1" [target]
 // Expected impact: +0.5-1.5% via dependency chain reduction
 #ifndef HAKMEM_TINY_UC_LOCALIZE_COMPILED
 #  define HAKMEM_TINY_UC_LOCALIZE_COMPILED 0
 #endif
 // ------------------------------------------------------------
 // Helper enum (for documentation / logging)
 // ------------------------------------------------------------
--- a/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md
+++ b/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md
@ -11,7 +11,7 @@
 mimalloc との比較は **FAST build** で行う（Standard は fixed tax を含むため公平でない）。
-## Current snapshot（2025-12-17, Phase 68 PGO — 新 baseline）
+## Current snapshot（2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline）
 計測条件（再現の正）：
 - Mixed: `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`）
--- a/docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
+++ b/docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
@ -0,0 +1,197 @@
 # Phase 69-1: Refill Tuning Parameter Sweeps - Results
 **Date**: 2025-12-17
 **Baseline**: Phase 68 PGO (`bench_random_mixed_hakmem_minimal_pgo`)
 **Benchmark**: `scripts/run_mixed_10_cleanenv.sh` (RUNS=10)
 **Goal**: Find +3-6% optimization for M2 milestone (55% of mimalloc)
 ---
 ## Executive Summary
 **Winner Identified**: **Warm Pool Size=16** achieves **+3.26% (Strong GO)** with ENV-only change.
 - **No code changes required** - Deploy via `HAKMEM_WARM_POOL_SIZE=16` environment variable
 - **Exceeds M2 threshold** (+3.0% Strong GO criterion)
 - **Single strongest improvement** among all tested parameters
 - **Combined optimizations are non-additive** - Warm Pool Size=16 alone outperforms combinations
 ⚠️ **Important correction (2025-12 audit)**:
 The previously reported “Refill Batch Size sweep” based on `TINY_REFILL_BATCH_SIZE` was **not measuring a real knob**.
 That macro currently has **zero call sites** (it is defined but not referenced in the active Tiny front path), so any
 observed deltas were **layout/drift noise**, not an algorithmic effect.
 ---
 ## Full Sweep Results
 ### Baseline (Phase 68 PGO)
 | Metric | Value |
 |--------|-------|
 | **Mean** | 60.65M ops/s |
 | **Median** | 60.68M ops/s |
 | **CV** | 1.68% |
 | **% of mimalloc** | 50.93% |
 **Runs**: 10
 **Binary**: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
 ---
 ### 1. Warm Pool Size Sweep (ENV-only, no recompile)
 **Parameter**: `HAKMEM_WARM_POOL_SIZE` (default: 12 SuperSlabs/class)
 | Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
 |------|----------------|------------------|----|-----------:|----------|
 | **16** | **62.63** | **63.38** | 2.43% | **+3.26%** | **Strong GO** ✓✓✓ |
 | 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ |
 **Winner**: **Size=16 (+3.26%)**
 **Analysis**:
 - Size=16 exceeds +3.0% Strong GO threshold
 - Size=24 shows diminishing returns (+2.84% vs +3.26%)
 - Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead
 **Command Used**:
 ```bash
 # Size=16
 HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
 # Size=24
 HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
 ```
 ---
 ### 2. Unified Cache C5-C7 Sweep (ENV-only, no recompile)
 **Parameter**: `HAKMEM_TINY_UNIFIED_C5`, `HAKMEM_TINY_UNIFIED_C6`, `HAKMEM_TINY_UNIFIED_C7` (default: 128 slots)
 | Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
 |------------|----------------|------------------|----|-----------:|----------|
 | **256** | **61.92** | **61.70** | 1.49% | **+2.09%** | **GO** ✓ |
 | 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ |
 **Winner**: **Cache=256 (+2.09%)**
 **Analysis**:
 - Cache=256 shows +2.09% improvement (GO threshold)
 - Cache=512 shows diminishing returns (+1.89% vs +2.09%)
 - Larger caches provide marginal gains while increasing memory overhead
 - Lower CV (1.49%) indicates stable performance
 **Command Used**:
 ```bash
 # Cache=256
 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
 # Cache=512
 HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
 ```
 ---
 ### 3. Combined Optimization Check
 **Configuration**: Warm Pool Size=16 + Unified Cache C5-C7=256
 | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
 |----------------|------------------|----|-----------:|----------|
 | 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) |
 **Analysis**:
 - Combined result (+2.81%) is **LESS than** Warm Pool Size=16 alone (+3.26%)
 - **Non-additive behavior** indicates parameters are not orthogonal
 - **Likely explanation**: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant
 - **Recommendation**: Use Warm Pool Size=16 alone for maximum benefit
 **Command Used**:
 ```bash
 HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
 ```
 ---
 ### 4. Refill Batch Size Sweep (invalid — macro not wired)
 The `TINY_REFILL_BATCH_SIZE` macro is currently **define-only**:
 ```bash
 rg -n "TINY_REFILL_BATCH_SIZE" core
 # -> core/hakmem_tiny_config.h only
 ```
 So we do **not** treat it as a tuning parameter until it is actually connected to refill logic.
 If we want to tune refill frequency, use the real knobs:
 - `HAKMEM_TINY_REFILL_COUNT_HOT`
 - `HAKMEM_TINY_REFILL_COUNT_MID`
 - `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
 ---
 ## Recommendations
 ### Phase 69-2 (Baseline Promotion)
 **Primary Recommendation**: **Deploy Warm Pool Size=16 (ENV-only)**
 **Rationale**:
 1. **Strongest single improvement** (+3.26%, Strong GO)
 2. **No code changes required** - Zero risk of layout tax
 3. **Immediate deployment** via environment variable
 4. **Exceeds M2 threshold** (+3.0% Strong GO criterion)
 **Deployment**:
 ```bash
 # Add to PGO training environment and benchmark scripts
 export HAKMEM_WARM_POOL_SIZE=16
 ```
 ---
 ### Secondary Options (for Phase 69-3+)
 **Option A: Warm Pool Size=16 + Refill Batch=32**
 - **Combined potential**: Unknown (requires testing, may be non-additive like unified cache)
 - **Complexity**: Requires PGO rebuild for Batch=32
 - **Risk**: Layout tax from code change
 **Option B: Warm Pool Size=16 alone (recommended)**
 - **Gain**: +3.26% guaranteed
 - **Complexity**: ENV-only, zero code changes
 - **Risk**: None (reversible via ENV)
 ---
 ## Raw Data Files
 All 10-run logs saved to:
 - `/tmp/phase69_baseline.log` - Phase 68 PGO baseline
 - `/tmp/phase69_warm16.log` - Warm Pool Size=16
 - `/tmp/phase69_warm24.log` - Warm Pool Size=24
 - `/tmp/phase69_cache256.log` - Unified Cache C5-C7=256
 - `/tmp/phase69_cache512.log` - Unified Cache C5-C7=512
 - `/tmp/phase69_combined.log` - Combined (Warm=16 + Cache=256)
 - `/tmp/phase69_batch32.log` - Refill Batch=32
 ---
 ## Next Steps
 **Awaiting User Instructions for Phase 69-2**:
 1. Confirm Warm Pool Size=16 as baseline promotion candidate
 2. Decide whether to:
   - Update ENV defaults in `hakmem_tiny_config.h` (preferred for SSOT)
   - Document as recommended ENV setting in README/docs
   - Add to PGO training scripts
 3. Re-run `make pgo-fast-full` with `HAKMEM_WARM_POOL_SIZE=16` in training environment
 4. Update `PERFORMANCE_TARGETS_SCORECARD.md` with new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc)
 ---
 **Phase 69-1 Status**: ✅ **COMPLETE**
 **Winner**: **Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)**
--- a/docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md
+++ b/docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md
@ -0,0 +1,46 @@
 # Phase 69-3A: Refill Batch=64 build failure triage — Root cause & fix
 ## Symptom
 `make pgo-fast-build` (profile-use) fails to link with undefined `__gcov_*` symbols, e.g.:
 - `__gcov_init`, `__gcov_exit`
 - `__gcov_merge_add`, `__gcov_merge_topn`
 - `__gcov_time_profiler_counter`
 This appeared when trying to evaluate `Refill Batch Size=64`.
 ## Root cause (actual)
 The failure is **not** “compiler limit due to batch=64”.
 It is a **stale object mixing** problem:
 - Some benchmark `.o` files were built in the profile-gen step (`-fprofile-generate`) and **were not removed by `make clean`**.
 - In the profile-use step (`-fprofile-use`), those stale instrumented `.o` files were reused and linked without `-fprofile-generate` → libgcov was not pulled in.
 - Result: unresolved `__gcov_*` symbols at link time.
 In other words: **instrumented bench object reused in non-instrumented link**.
 ## Fix (minimal, safe)
 Strengthen `make clean` to remove benchmark objects/binaries that were previously omitted, including:
 - `bench_random_mixed_hakmem.o`
 - `bench_tiny_hot_hakmem.o`
 - related bench variants (`*_system`, `*_mi`, `*_hakx`, `*_minimal*`, etc.)
 This preserves toolchain fairness (GCC + LTO) and prevents cross-step contamination in PGO workflows.
 ## Verification
 After the fix, the Phase 66 PGO pipeline builds successfully again:
 ```sh
 make pgo-fast-profile pgo-fast-collect pgo-fast-build
 ```
 ## Notes
 - This fix is **layout-neutral**: it only affects build hygiene (artifact cleanup).
 - This also hardens other workflows where flags change across builds (PGO / FAST targets).
 - Follow-up audit note (2025-12): `TINY_REFILL_BATCH_SIZE` is currently define-only (no call sites), so the “batch=64”
  performance experiment itself was not measuring a real knob; however the build hygiene fix remains valid and important.
--- a/docs/analysis/PHASE69_REFILL_TUNING_3B_REFILL_BATCH_PGO_SWEEP_RESULTS.md
+++ b/docs/analysis/PHASE69_REFILL_TUNING_3B_REFILL_BATCH_PGO_SWEEP_RESULTS.md
@ -0,0 +1,45 @@
 # Phase 69-3B: Refill Batch Size sweep (PGO, warm_pool=16) — Results
 ⚠️ **INVALID (2025-12 audit)**: `TINY_REFILL_BATCH_SIZE` is currently **not wired** into the active Tiny front path
 (it has zero call sites; define-only in `core/hakmem_tiny_config.h`). Any observed deltas in this file should be treated
 as **layout/drift noise**, not an algorithmic effect. This document is kept only as an experiment record.
 ## Context
 Phase 69-2 promoted the ENV-only winner:
 - `HAKMEM_WARM_POOL_SIZE=16`
 This phase explores compile-time refill batch size (`TINY_REFILL_BATCH_SIZE`) under the current PGO workflow:
 - `make pgo-fast-full` (GCC + LTO preserved)
 - Training uses cleanenv-aligned workloads (`scripts/box/pgo_fast_profile_config.sh`)
 ## Build hygiene prerequisite
 Batch=64 originally “failed to build” due to stale profile-gen bench objects being reused in profile-use links.
 That issue is fixed by strengthening `make clean` (see `docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md`).
 ## Measurement (Mixed 10-run)
 All results are from the same host session, using:
 - `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`
 - `RUNS=10 scripts/run_mixed_10_cleanenv.sh`
 | Batch | Mean (M ops/s) | Median (M ops/s) | CV |
 |------:|----------------:|-----------------:|---:|
 | 16    | 61.30           | 61.64            | 1.50% |
 | 32    | 60.73           | 61.17            | 2.19% |
 | 48    | 61.94           | 62.54            | 1.53% |
 | 64    | 61.51           | 61.81            | 1.56% |
 ## Decision
 - **Batch=48** is the best of the tested set in this session (+~1.0% vs batch=16 baseline).
 - **Batch=32** regresses in this session (note: previously was GO under a different baseline).
 - **Batch=64** builds successfully after the hygiene fix, but is not the best performer here.
 ## Next steps (Phase 69-3C)
 If we want to pursue M2 (55%) via this path:
 1. Promote **batch=48** as a research candidate with a dedicated Phase tag (compile-time change + PGO rebuild).
 2. Re-run the sweep at another time window to confirm ordering (layout/drift sensitivity).
 3. If stable, promote batch=48 into the FAST baseline build path.
--- a/docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md
+++ b/docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md
@ -0,0 +1,47 @@
 # Phase 69-3C: Refill Batch “knob” audit — `TINY_REFILL_BATCH_SIZE` is not wired
 ## Summary
 The Phase 69 “Refill Batch Size sweep” was based on `TINY_REFILL_BATCH_SIZE` in `core/hakmem_tiny_config.h`, but an audit
 shows this macro currently has **zero call sites** in the active Tiny front path. As a result, any measured deltas from
 editing this macro are **not algorithmic**; they are attributable to layout/drift/noise.
 ## Evidence
 ### 1) Zero call sites
 ```sh
 rg -n "TINY_REFILL_BATCH_SIZE" core
 ```
 Result: only `core/hakmem_tiny_config.h` (define-only).
 ### 2) PGO binaries unchanged when toggling the macro
 We rebuilt the full PGO pipeline twice (`make pgo-fast-full`) after changing the macro (batch16 vs batch48) and found the
 resulting binaries were bit-identical (same size + same SHA256).
 This confirms the macro does not affect the compiled hot path today.
 ## Action taken
 - Restored `TINY_REFILL_BATCH_SIZE` to `16` and added an explicit “not wired” note in `core/hakmem_tiny_config.h`.
 - Marked the “Refill Batch Size sweep” section in Phase 69 docs as invalid.
 ## What to tune instead (real knobs)
 To tune refill frequency/amount without rebuilding:
 - `HAKMEM_TINY_REFILL_COUNT_HOT` (C0–C3)
 - `HAKMEM_TINY_REFILL_COUNT_MID` (C4–C7)
 - `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
 Defaults are set in `core/hakmem_tiny_init.inc` and can be overridden via ENV.
 ## Optional future work (if we still want a compile-time knob)
 If we want a compile-time “refill batch size” knob, we need to wire it into a single SSOT:
 - either by feeding it into the refill-count defaults (`g_refill_count_*`), or
 - by introducing a dedicated build flag that the refill logic consumes directly.
 Until then, do not run Phase 69 sweeps based on `TINY_REFILL_BATCH_SIZE`.
--- a/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
+++ b/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
@ -12,6 +12,13 @@
 Before implementing any refill/WarmPool changes, execute this sequence:
 0.  **Route Banner（任意だが推奨）**:
    ```bash
    HAKMEM_ROUTE_BANNER=1 ./bench_random_mixed_hakmem_observe ...
    ```
    - Route assignments（backend route kind）と cache config（`unified_cache_enabled` / `warm_pool_max_per_class`）を 1 回だけ表示する。
    - 「Route=LEGACY = Unified Cache 未使用」といった誤認を防ぐ（LEGACYでもUnified Cacheは alloc/free の front で使われる）。
 1.  **Build with Stats**:
    ```bash
    make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1'
@ -20,7 +27,7 @@ Before implementing any refill/WarmPool changes, execute this sequence:
 2.  **Run with Stats**:
    ```bash
-    HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
+    HAKMEM_ROUTE_BANNER=1 HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
    ```
 3.  **Check Output**:
--- a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md
+++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md
@ -0,0 +1,116 @@
 # Phase 74: UnifiedCache hit-path structural optimization (WS=400 SSOT)
 **Status**: 🟡 DRAFT（設計SSOT / 次の指示書）
 ## 0) 背景（なぜ今これか）
 - 現行 baseline（Phase 69）: `bench_random_mixed_hakmem_minimal_pgo` = **62.63M ops/s = 51.77% of mimalloc**（`HAKMEM_WARM_POOL_SIZE=16`）
 - Phase 70（観測SSOT）により、WS=400（Mixed SSOT）では **UnifiedCache miss が極小**であることが確定。
  - `unified_cache_refill()` / WarmPool-pop を速くしても **ROI はほぼゼロ**（refill最適化は凍結）
  - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
 - Phase 73（perf stat）により、WarmPool=16 の勝ちは **instruction/branch の微減**が支配的と確定。
  - つまり次も「hit-path を短くする」方向が最も筋が良い。
  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
 本フェーズの狙いは、**UnifiedCache の hit-path（push/pop）から“踏まなくていい分岐/ロード”を構造で外に追い出す**こと。
 ## 1) 目的 / 非目的
 **目的**
 - WS=400 の SSOT workload で **+1〜3%**（単発）を狙う（積み上げで M2=55% へ）。
 - “経路が踏まれていない最適化” を避ける（Phase 70 の SSOT を守る）。
 **非目的**
 - `unified_cache_refill()` の最適化（miss が極小なので SSOT では ROI なし）。
 - link-out / 大削除による DCE（layout tax で符号反転の前例が多い）。
 - route kind を変えて別 workload にする（まず SSOT workload を崩さない）。
 ## 2) Box Theory（箱割り）
 ### 箱の責務
 L0: **EnvGateBox**
 - `HAKMEM_TINY_UC_*` のトグル（default OFF、いつでも戻せる）。
 L1: **TinyUnifiedCacheHitPathBox（NEW / 研究箱）**
 - `unified_cache_push/pop` の **hit-path だけを短くする**（refill/overflow/registryは触らない）。
 - 変換点（境界）は 1 箇所: `unified_cache_push/pop` 内で “fast→fallback” を1回だけ行う。
 ### 可視化（最小）
 - `uc_hitpath_fast_hits` / `uc_hitpath_fast_fallbacks` の2カウンタだけ（必要なら）。
 - それ以外は `perf stat`（instructions/branches）を正とする。
 ## 3) 具体案（優先順）
 ### P1（低リスク）: ローカル変数化で再ロード/依存チェーンを固定する
 狙い:
 - `cache->head/tail/mask/capacity` 等の再ロードを抑制し、**依存チェーンを短く**する。
 設計:
 - `unified_cache_push()` / `unified_cache_pop_or_refill()` の中で
  - `uint16_t head = cache->head;` のように **ローカルへ落とす**
  - `next = (x + 1) & mask` の算術を **1回に固定**
  - `cache->tail = next;` のような store を最後にまとめる
 導入:
 - ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`（default 0）
 - 方式: 同一バイナリで ON/OFF（layout tax を最小にするため、分岐は入口1回に限定）
 リスク:
 - レジスタ圧上昇で逆に遅くなる可能性 → A/B 必須。
 ### P0（中リスク/中ROI）: Fast-API 化（enable判定/統計を外に追い出す）
 狙い:
 - hit-path の中に残る “ほぼ不変な判定” を **呼び出し側に追い出し**、`push/pop` を直線化する。
 設計:
 - `unified_cache_push_fast(TinyUnifiedCache* cache, void* base)` のような **最短API** を追加
  - 前提: “有効/初期化済み/統計OFF” を呼び出し側で保証
  - 失敗時のみ既存 `unified_cache_push()` へ落とす（境界1箇所）
 導入:
 - ENV: `HAKMEM_TINY_UC_FASTAPI=0/1`（default 0）
 - Fail-fast: 途中でモードが変わったら “safe fallback” へ（bench用途なら abort でも良い）
 リスク:
 - call site の増加で layout が動く → GO 閾値は +1.0%（厳しめ）。
 ### P2（高リスク/高ROI候補）: hot class 限定で slots を TLS 直置き（pointer chase削減）
 狙い:
 - hit-path の `cache->slots` のロード（ポインタ追跡）を消す。
 設計:
 - `TinyUnifiedCache` の “hot class のみ” を別構造に逃がし、TLS 内に `slots[]` を直置き。
  - 対象候補: 容量が小さい C4/C5/C6/C7（C2/C3 の 2048 は直置きが重い）
 リスク:
 - TLS サイズ増で dTLB/cache が悪化しうる（勝てば大きいが、NO-GO もあり得る）。
 ## 4) A/B（SSOT）
 ### 4.1 ベンチ条件（固定）
 - `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`）
 - `HAKMEM_WARM_POOL_SIZE=16`（baseline）
 ### 4.2 GO/NO-GO
 - **GO**: +1.0% 以上
 - **NEUTRAL**: ±1.0%（research box freeze）
 - **NO-GO**: -1.0% 以下（即 revert）
 ### 4.3 追加で必ず見る（Phase 73 教訓）
 - `perf stat`: `instructions`, `branches`, `branch-misses`（勝ち筋が instruction/branch 減なので）
 - `cache-misses`, `iTLB-load-misses`, `dTLB-load-misses`（layout tax 検知）
 ## 5) 直近の実装順（推奨）
 1. **P1（LOCALIZE）** を小さく入れて A/B（最短で勝ち筋確認）
 2. 勝てたら **P0（FASTAPI）** を追加（さらに分岐を外へ）
 3. それでも足りなければ **P2（inline slots hot）** を research box として試す
 ## 6) 退出条件（やめどき）
 - WS=400 SSOT で `perf` 上の “unified_cache_push/pop” が Top 50 圏外になったら、この系は撤退（Phase 42 の教訓）。
 - 3回連続で NEUTRAL/NO-GO が続いたら、次の構造（別層）へ（layout tax の危険が増すため）。
--- a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md
@ -0,0 +1,75 @@
 # Phase 74-1: UnifiedCache hit-path “LOCALIZE” 実装指示書
 **Status**: 🟡 READY
 ## 目的
 WS=400（Mixed SSOT）でほぼ hit-path しか踏まれないため、`unified_cache_push/pop` の **依存チェーン（再ロード）を短く**して instructions/branches を削る。
 - 設計SSOT: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
 - 観測SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`（refill最適化は凍結）
 ## 原則（Box Theory）
 - L0: ENV gate 箱を追加（default OFF、いつでも戻せる）
 - L1: `unified_cache_push/pop` の中だけに閉じた変更（境界1箇所）
 - 可視化は最小（基本は perf stat を正とする）
 - Fail-fast: 迷ったら fallback
 ## Step 0: Baseline 確認（SSOT）
 ```bash
 scripts/run_mixed_10_cleanenv.sh
 ```
 ## Step 1: ENV gate（L0 box）
 新規:
 - `core/box/tiny_unified_cache_hitpath_env_box.h`（例）
 ENV:
 - `HAKMEM_TINY_UC_LOCALIZE=0/1`（default 0）
 要件:
 - hot path で getenv を踏まない（既存の lazy-init パターン or build flag で固定）
 ## Step 2: LOCALIZE 実装（L1 box）
 対象:
 - `core/front/tiny_unified_cache.h` の `unified_cache_push()` / `unified_cache_pop_or_refill()`
 方針:
 - `cache->head/tail/mask/capacity` をローカルへ落として **再ロードを防ぐ**
 - store は最後にまとめる（`cache->tail = next_tail;` など）
 - 仕様は変えない（容量/順序/統計/overflow の意味を維持）
 導入パターン（例）:
 - `if (!tiny_uc_localize_enabled())` のときは既存実装をそのまま通す
 - `enabled` のときだけ localize 版を呼ぶ
 ## Step 3: A/B（同一バイナリ）
 ```bash
 HAKMEM_TINY_UC_LOCALIZE=0 scripts/run_mixed_10_cleanenv.sh
 HAKMEM_TINY_UC_LOCALIZE=1 scripts/run_mixed_10_cleanenv.sh
 ```
 追加で（勝ち筋が instructions/branches なので必須）:
 ```bash
 perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses -- \
  ./bench_random_mixed_hakmem_minimal_pgo 20000000 400 1
 ```
 ## 判定
 - **GO**: +1.0% 以上
 - **NEUTRAL**: ±1.0%（research box freeze）
 - **NO-GO**: -1.0% 以下（即 revert）
 NO-GO の切り分け:
 - `scripts/box/layout_tax_forensics_box.sh` を使う（layout tax / IPC低下 / TLB悪化の分類）
 ## Step 4: 昇格方針
 - 初回 GO でも **default ON にしない**（まずは 3回独立再計測で再現性を確認）
 - 3回とも GO なら `scripts/run_mixed_10_cleanenv.sh` / `core/bench_profile.h` へ昇格を検討
--- a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md
+++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md
@ -0,0 +1,140 @@
 # Phase 74: UnifiedCache hit-path structural optimization - Results
 **Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
 ## Summary
 Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
 **P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
 ---
 ## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
 **Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
 **Implementation**:
 - ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
 - Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
 **Results** (10-run A/B):
 | Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
 |--------|------------|------------|-------|
 | throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
 | instructions | 4,583M | 4,615M | **+0.7%** |
 | branches | 1,276M | 1,281M | **+0.4%** |
 | cache-misses | 560K | 461K | -17.7% |
 **Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
 **Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
 ---
 ## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
 **Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
 **Implementation**:
 - Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
 - Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
 **Results** (10-run A/B via `layout_tax_forensics_box.sh`):
 | Metric | Baseline (=0) | Treatment (=1) | Delta |
 |--------|---------------|----------------|-------|
 | **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
 | cycles | 1,553M | 1,548M | -0.3% |
 | **instructions** | 2,748M | 2,733M | **-0.6%** |
 | **branches** | 632M | 617M | **-2.3%** |
 | **cache-misses** | 707K | 1,316K | **+86%** |
 | dTLB-load-misses | 46K | 33K | -28% |
 **Analysis**:
 1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
 2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
 3. **But cache-misses +86%** → register pressure / spill / worse access pattern
 4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
 **Phase 74-1 vs 74-2 comparison**:
 - 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
 - 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
 - But cache-misses +86% cancels out → **total NEUTRAL**
 **Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN**
 ---
 ## Root Cause (Phase 74-2)
 **Why cache-misses increased (+86%)**:
 1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
   - Compiler may spill to stack → more memory traffic
   - `cache->slots[head]` may lose prefetch opportunity
 2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
   - Storing to local breaks dependency tracking?
   - Memory alias analysis degraded?
 **Evidence**:
 - dTLB-misses decreased (-28%) → data layout not the issue
 - L1-dcache-load-misses similar → not a TLB/page issue
 - cache-misses (+86%) is the PRIMARY BLOCKER
 ---
 ## Lessons Learned
 1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
 2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
 3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
 4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
 **Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
 ---
 ## P1 (LOCALIZE) - Frozen State
 **Files**:
 - `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
 - `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
 - `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
 **Default behavior**: LOCALIZE=0 (original implementation)
 **Rollback**: No action needed (default OFF)
 ---
 ## Next Steps
 **Phase 74-3: P0 (FASTAPI)**
 **Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
 **Approach**:
 - Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
 - Assume: "valid/enabled/no-stats" at caller side
 - Fail-fast: fallback to slow path on unexpected state
 - ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
 **Expected benefit**: +1-2% via branch reduction (different axis than P1)
 **GO threshold**: +1.0% (strict, structural change)
 ---
 ## Artifacts
 - **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
 - **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
 - **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
 - **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
 ---
 ## Timeline
 - Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
 - Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN**
 - Phase 74-3: P0 (FASTAPI) → (next)
--- a/hakmem.d
+++ b/hakmem.d
@ -103,6 +103,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
 core/box/../front/../box/../hakmem_tiny_config.h \
 core/box/../front/../box/../tiny_nextptr.h \
 core/box/../front/../box/tiny_tcache_env_box.h \
 core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h \
 core/box/../front/../tiny_region_id.h core/box/../front/../hakmem_tiny.h \
 core/box/../front/../box/tiny_env_box.h \
 core/box/../front/../box/tiny_front_hot_box.h \
@ -361,6 +362,7 @@ core/box/../front/../box/tiny_tcache_box.h:
 core/box/../front/../box/../hakmem_tiny_config.h:
 core/box/../front/../box/../tiny_nextptr.h:
 core/box/../front/../box/tiny_tcache_env_box.h:
 core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h:
 core/box/../front/../tiny_region_id.h:
 core/box/../front/../hakmem_tiny.h:
 core/box/../front/../box/tiny_env_box.h: