diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index e091896d..3bcdb12b 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -1,236 +1,89 @@ # CURRENT_TASK(Rolling, SSOT) -## 0) 今の「正」 +## 0) 今の「正」(SSOT) -- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)✓ **Phase 69 昇格済み** (Warm Pool Size=16) +- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)+ **WarmPool=16**(Phase 69 強GOで昇格済み) - **安全・互換の正**: Standard build(`make bench_random_mixed_hakmem`) - **観測の正**: OBSERVE build(`make perf_observe`) -- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`(M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp) -- **計測の正(Mixed 10-run)**: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト) +- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md` + - Current baseline(FAST v3 + PGO, Phase 69): **62.63M ops/s = 51.77% of mimalloc** + - 次の目標: **M2 = 55%**(残り **+3.23pp**) +- **Mixed 10-run SSOT**: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト) -## 1) 現状(要点) +## 1) 迷子防止(経路/観測) -- Phase 64(backend prune / DCE): **NO-GO**(-4.05%) → layout tax 由来 -- Phase 63(FAST_PROFILE_FIXED): **研究用ビルド**として保持(FAST の gate を compile-time 固定) -- Phase 65(Hot Symbol Ordering): **BLOCKED**(GCC+LTO の制約で不公平/不可能)→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md` -- Phase 66(PGO, GCC+LTO): **GO** ✓ - - 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1% - - Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO) -- Phase 68(PGO training set 最適化): **GO & 昇格完了** ✓ - - 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過) - - Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp) -- Phase 69(Refill tuning: Warm Pool Size 最適化): **強GO & 昇格完了** ✓✓✓ - - 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過) - - 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp) +“経路が踏まれていない最適化” を防ぐための最小手順。 -## 2) 次の指示書(Active) +- **Route Banner(経路の誤認を潰す)**: `HAKMEM_ROUTE_BANNER=1` + - 出力: Route assignments(backend route kind)+ cache config(`unified_cache_enabled` / `warm_pool_max_per_class`) +- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` + - WS=400(Mixed SSOT)では miss が極小 → `unified_cache_refill()` 最適化は **凍結(ROIゼロ)** -**Phase 68: PGO training set 最適化** ✅ **完了** +## 2) 直近の結論(要点だけ) -- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン) -- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過) -- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過) -- ✓ スコアカード・CURRENT_TASK 更新完了 - ---- - -**Phase 67a: Layout Tax 法医学(変更最小)** ✅ **完了・実運用可能** - -- ✓ `scripts/box/layout_tax_forensics_box.sh` 新規(測定ハーネス) - - Baseline vs Treatment の 10-run throughput 比較 - - perf stat 自動収集(cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB) - - Binary metadata(サイズ、セクション構成) - -- ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規(診断ガイド) - - 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下) - - "症状→原因候補" マッピング表 - * IPC 低下 3%↑ → I-cache miss / code layout dispersal - * branch-misses ↑10%↑ → branch prediction penalty - * dTLB-misses ↑100%↑ → data layout fragmentation - - Phase 64 case study(-4.05% の root cause: IPC 2.05 → 1.98) - - 運用ガイドライン - -**使用例**: -```bash -./scripts/box/layout_tax_forensics_box.sh \ - ./bench_random_mixed_hakmem_minimal_pgo \ - ./bench_random_mixed_hakmem_fast_pruned # or Phase 64 attempt -``` - -成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる - ---- - -**Phase 69: "refill頻度×固定税" を削る(M2への最短距離)** - -**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了** - -- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成 -- ✓ Tunable parameters 特定: - - `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`(refill 量の実体, ENV-only) - - Unified Cache C5-C7 capacity (128 → 256/512) - - Warm Pool size (12 → 16/24) -- ✓ Sweep 計画立案(single-parameter → combined optimization) -- ✓ Risk assessment & 判定基準定義 - -**Phase 69-1: Sweep 実行** ✅ **完了** - -- ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean) -- ✓ Warm Pool Size sweep: - - Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner** - - Size=24: 62.37M ops/s (+2.84%, GO) -- ✓ Unified Cache C5-C7 sweep: - - Cache=256: 61.92M ops/s (+2.09%, GO) - - Cache=512: 61.80M ops/s (+1.89%, GO) -- ✓ Combined optimization check: - - Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive) -- ✓ “Refill Batch Size sweep” は無効(knob 未接続): - - `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない - - 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md` -- **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md` -- **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)** - -**Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了** - -- ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加 -- ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加 -- ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加: - - Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc - - M1 (50%) achievement: **EXCEEDED** (+1.77pp above target) - - M2 (55%) progress: Gap reduced to +3.23pp -- ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除 - -**新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp) - ---- - -**Phase 69-3(次候補): refill 量(ENV-only)sweep OR 次の sweep** - -- **選択肢 A(推奨)**: Refill count の ENV sweep(コード変更なし) - - `HAKMEM_TINY_REFILL_COUNT_MID`(C4–C7)を 64/96/128/160… で sweep - - `HAKMEM_TINY_REFILL_COUNT_HOT`(C0–C3)も同様に sweep(ただし WarmPool/UnifiedCache と相互作用あり) - - 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%) - -- **選択肢 B**: Unified Cache の fine sweep(ENV-only) - - C5/C6/C7 を 192/256/320… などで sweep(Phase 69-1 の 256/512 は coarse) - - WarmPool=16 との非加算性を “原因切り分け” する - -- **選択肢 C**: compile-time knob の新設(後回し) - - `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない - - 必要なら別途 SSOT を作って実装する(Phase 70+) - -- **選択肢 D**: 別方向の最適化(M2: 55% への最短距離) - - 残り gap: +3.23pp (51.77% → 55%) - - Phase 67b(境界 inline/unroll チューニング) - - Top 50 hot functions の最適化 - - PGO profile の再調整 - ---- - -**Phase 67b(後続・保険): 境界inline/unrollチューニング** -- **注意**: layout tax リスク高い(Phase 64 reference) -- **前提**: Top 50 実行確認が必須 -- Phase 69 が外れた時の保険として後回し推奨 - ---- - -**Phase 70(観測の前提固め): Refill/WarmPool 最適化の Step 0 を SSOT 化** - -- 目的: **“経路が踏まれていない最適化”** を防ぐ(Phase 40/41/64 の layout tax 前例) -- 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない(backend route kind) -- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` - - Mixed SSOT(WS=400)で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める -- ✅ Phase 70-1: Route Banner 実装(経路誤認の根絶) - - ENV: `HAKMEM_ROUTE_BANNER=1` - - 出力: Route assignments(backend route kind)+ cache config(unified_cache / warm_pool_max_per_class) -- ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT(“見えてないだけ”事故の根絶) - - `Unified-STATS total_allocs == total_frees` を確認してから議論する(統計の信頼性ゲート) -- ✅ Phase 70-2: Refill 最適化の扱い確定(SSOT) - - Mixed SSOT(WS=400)で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結(ROIゼロ)** - - 現状の実測: miss は極小(例: total miss=5)→ refill最適化は SSOT workload では ROI なし +- **Phase 69(WarmPool sweep)**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO(+3.26%)**、baseline 昇格済み。 + - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` + - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md` +- **Phase 70(観測SSOT)**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。 + - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` +- **Phase 71/73(WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**(perf stat で確定)。 - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` +- **Phase 72(ENV knob ROI枯れ)**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**。 ---- +## 3) 運用ルール(Box Theory + layout tax 対策) -**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決** +- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む(Fail-fast、最小可視化)。 +- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。 +- “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。 + - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` -- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**(TLB/LLC/周波数/配置)の可能性が高い -- 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化(Phase 72)を決め打ちする -- 方式: **同一バイナリ + cleanenv + 交互実行**(layout tax/環境ドリフトを避ける) - - A: `HAKMEM_WARM_POOL_SIZE=12` - - B: `HAKMEM_WARM_POOL_SIZE=16` - - events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults` +## 4) 次の指示書(Active) -**結果**(パラドックス): -- ✅ Throughput: +0.91% (46.52M → 46.95M ops/s) -- ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE** -- ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE** -- ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE** -- ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE** -- ✓ page-faults: -0.21% (negligible) +### Phase 74(構造): UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結** -**Phase 71 仮説(REJECTED)**: -- 予測: "TLB/cache efficiency improvement from memory layout" -- 実測: TLB/cache metrics both **DEGRADED** +**前提**: +- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。 +- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。 -**Phase 73 確定**: -- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)** -- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減 -- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate -- Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache) +**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)** +- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` +- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%) +- 判定: **NEUTRAL (+0.50%)** -**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section +**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)** +- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0) +- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓ +- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%** +- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺** +- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結** -**Phase 72(構造): WarmPool=16 の勝ち筋を増幅(Phase 73 結果が出てから)** +**結論**: +- P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い) +- 次: **Phase 74-3 (P0: FASTAPI)** へ進む -- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手(推測で弄ると Phase 40/41/64 の再発) -- Phase 73 の結論: **instruction/branch 減が支配的**(TLB/cache はむしろ悪化)→「WarmPool=16 が “短い経路” を踏ませている」ことが本質 +**Phase 74-3: P0 (FASTAPI)** 🟡 **次の指示書** -**Phase 72-0(SSOT): “どの関数が短くなったか” を特定してから構造に入る** +**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す** -- A/B は WarmPool=12 vs 16 のまま(同一バイナリ・cleanenv) -- perf record を **cycles ではなく instruction/branch で取る**(原因が instruction/branch 減だから) - - `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1` - - `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1` -- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定(例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等) +**Approach**: +- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加 +- 前提: "valid/enabled/no-stats" を caller 側で保証 +- Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所) +- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box) -**Phase 72-1(構造): 特定した関数にだけ手を入れる(箱の境界 1 箇所化)** ✅ **キャンセル(ROIゼロ)** +**Expected**: +1-2% via branch reduction (P1 と異なる軸) -- perf record 結果: `unified_cache_push` が -0.86% branches(最大削減) -- 当初計画: Unified Cache の FULL drain 最適化 -- **キャンセル理由**: 全クラスで `full=0`(FULL イベントが発生していない)→ ROI ゼロ +**判定**: +- **GO**: +1.0% 以上 +- **NEUTRAL**: ±1.0%(freeze、次へ) +- **NO-GO**: -1.0% 以下(即 revert) -**Phase 72-2: WarmPool 追加 sweep** ✅ **完了(ROI枯れ)** +**参考**: +- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md` +- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md` +- 結果 (P1): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` -- 目的: WarmPool=16 以外に勝者がいるか確認 -- Baseline: WarmPool=16 = 56.23M ops/s (10-run) -- 結果: - - WarmPool=20: 56.13M ops/s (**-0.18%**, NO-GO) - - WarmPool=24: 56.30M ops/s (**+0.12%**, 誤差範囲) - - WarmPool=32: 56.07M ops/s (**-0.28%**, NO-GO) -- **判定**: 全候補が ±0.5% 以内 → **Phase 72 終了(ENV knob ROI 枯れ)** - ---- - -**Phase 72 総括**: -- **確定**: WarmPool=16 が最適値(Phase 69 で確定、Phase 72 で再確認) -- **確定**: ENV knob による追加最適化の余地なし -- **勝ち筋**: instruction/branch 削減が支配的(Phase 73 で確定) -- **次のステップ**: 構造変更(コード変更)が必要 - -**注記**: 研究箱の削除は今やらない(link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解) - ---- - -**Phase 74(次候補): 構造変更による最適化** - -- **前提**: ENV knob ROI 枯れ → コード変更が必要 -- **候補 A**: `unified_cache_push` の branch 削減(Phase 72-0 で最大寄与確認済み) -- **候補 B**: hot path の inline 強化(layout tax リスクあり、要 forensics) -- **候補 C**: PGO profile 再調整(WarmPool=16 前提で retrain) -- **判定基準**: +1.0% → GO、+0.5% 未満 → NO-GO - -## 3) アーカイブ +## 5) アーカイブ - 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md` -- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md` +- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md` diff --git a/core/box/tiny_unified_cache_hitpath_env_box.h b/core/box/tiny_unified_cache_hitpath_env_box.h new file mode 100644 index 00000000..dabc5f07 --- /dev/null +++ b/core/box/tiny_unified_cache_hitpath_env_box.h @@ -0,0 +1,32 @@ +// tiny_unified_cache_hitpath_env_box.h - Phase 74: ENV gate for hit-path LOCALIZE +// +// Purpose: ENV-gated toggle for unified_cache_push/pop LOCALIZE optimization +// Design: lazy-init pattern to avoid hot-path getenv overhead +// +// ENV: HAKMEM_TINY_UC_LOCALIZE=0/1 (default 0, OFF) +// +// Box Theory: +// L0: ENV gate (this file) +// L1: LOCALIZE implementation (in tiny_unified_cache.h) + +#ifndef HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H +#define HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H + +#include + +// ============================================================================ +// Phase 74: LOCALIZE ENV Gate (lazy-init, cached) +// ============================================================================ + +// Check if LOCALIZE optimization is enabled +// Uses lazy-init pattern: getenv called once, then cached +static inline int tiny_uc_localize_enabled(void) { + static int g_enabled = -1; // -1 = uninitialized + if (__builtin_expect(g_enabled == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_UC_LOCALIZE"); + g_enabled = (e && *e && *e != '0') ? 1 : 0; + } + return g_enabled; +} + +#endif // HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H diff --git a/core/front/tiny_unified_cache.h b/core/front/tiny_unified_cache.h index 65adb40f..fb32b116 100644 --- a/core/front/tiny_unified_cache.h +++ b/core/front/tiny_unified_cache.h @@ -31,6 +31,7 @@ #include "../box/ptr_type_box.h" // Phantom pointer types (BASE/USER) #include "../box/tiny_front_config_box.h" // Phase 8-Step1: Config macros #include "../box/tiny_tcache_box.h" // Phase 14 v1: Intrusive LIFO tcache +#include "../box/tiny_unified_cache_hitpath_env_box.h" // Phase 74: LOCALIZE ENV gate // ============================================================================ // Phase 3 C2 Patch 3: Bounds Check Compile-out @@ -247,6 +248,30 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) { } #endif + // Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch) +#if HAKMEM_TINY_UC_LOCALIZE_COMPILED + // LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains + uint16_t head = cache->head; + uint16_t tail = cache->tail; + uint16_t mask = cache->mask; + uint16_t next_tail = (tail + 1) & mask; + + if (__builtin_expect(next_tail == head, 0)) { +#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED + g_unified_cache_full[class_idx]++; +#endif + return 0; // Full + } + + cache->slots[tail] = base_raw; + cache->tail = next_tail; + +#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED + g_unified_cache_push[class_idx]++; +#endif + return 1; // SUCCESS (LOCALIZE path) +#else + // Default path: Original implementation uint16_t next_tail = (cache->tail + 1) & cache->mask; // Full check (leave 1 slot empty to distinguish full/empty) @@ -266,6 +291,7 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) { #endif return 1; // SUCCESS (2-3 cache misses total) +#endif // HAKMEM_TINY_UC_LOCALIZE_COMPILED } // ============================================================================ @@ -316,6 +342,37 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) { } #endif + // Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch) +#if HAKMEM_TINY_UC_LOCALIZE_COMPILED + // LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains + uint16_t head = cache->head; + uint16_t tail = cache->tail; + uint16_t mask = cache->mask; + + if (__builtin_expect(head != tail, 1)) { + void* base = cache->slots[head]; + cache->head = (head + 1) & mask; +#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED + g_unified_cache_hit[class_idx]++; +#endif +#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED + if (__builtin_expect(unified_cache_measure_check(), 0)) { + atomic_fetch_add_explicit(&g_unified_cache_hits_global, + 1, memory_order_relaxed); + atomic_fetch_add_explicit(&g_unified_cache_hits_by_class[class_idx], + 1, memory_order_relaxed); + } +#endif + return HAK_BASE_FROM_RAW(base); // Hit! (LOCALIZE path) + } + + // Cache miss → Batch refill from SuperSlab +#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED + g_unified_cache_miss[class_idx]++; +#endif + return unified_cache_refill(class_idx); +#else + // Default path: Original implementation // Tcache miss/disabled/compiled-out → try pop from array cache (fast path) if (__builtin_expect(cache->head != cache->tail, 1)) { void* base = cache->slots[cache->head]; // 1 cache miss (array access) @@ -341,6 +398,7 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) { g_unified_cache_miss[class_idx]++; #endif return unified_cache_refill(class_idx); // Refill + return first block (BASE) +#endif // HAKMEM_TINY_UC_LOCALIZE_COMPILED } #endif // HAK_FRONT_TINY_UNIFIED_CACHE_H diff --git a/core/hakmem_build_flags.h b/core/hakmem_build_flags.h index 0974e169..7cecb5af 100644 --- a/core/hakmem_build_flags.h +++ b/core/hakmem_build_flags.h @@ -434,6 +434,18 @@ # define HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED 0 #endif +// ------------------------------------------------------------ +// Phase 74: UnifiedCache LOCALIZE (Compile-time hit-path optimization) +// ------------------------------------------------------------ +// LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains +// When =1: Always use localize version (no runtime branch, maximum DCE) +// When =0: Use original implementation (default, backward compatible) +// Build: make EXTRA_CFLAGS="-DHAKMEM_TINY_UC_LOCALIZE_COMPILED=1" [target] +// Expected impact: +0.5-1.5% via dependency chain reduction +#ifndef HAKMEM_TINY_UC_LOCALIZE_COMPILED +# define HAKMEM_TINY_UC_LOCALIZE_COMPILED 0 +#endif + // ------------------------------------------------------------ // Helper enum (for documentation / logging) // ------------------------------------------------------------ diff --git a/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md b/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md index bdeec393..6a169a6e 100644 --- a/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md +++ b/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md @@ -11,7 +11,7 @@ mimalloc との比較は **FAST build** で行う(Standard は fixed tax を含むため公平でない)。 -## Current snapshot(2025-12-17, Phase 68 PGO — 新 baseline) +## Current snapshot(2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline) 計測条件(再現の正): - Mixed: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`) diff --git a/docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md b/docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md new file mode 100644 index 00000000..cd07ea40 --- /dev/null +++ b/docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md @@ -0,0 +1,197 @@ +# Phase 69-1: Refill Tuning Parameter Sweeps - Results + +**Date**: 2025-12-17 +**Baseline**: Phase 68 PGO (`bench_random_mixed_hakmem_minimal_pgo`) +**Benchmark**: `scripts/run_mixed_10_cleanenv.sh` (RUNS=10) +**Goal**: Find +3-6% optimization for M2 milestone (55% of mimalloc) + +--- + +## Executive Summary + +**Winner Identified**: **Warm Pool Size=16** achieves **+3.26% (Strong GO)** with ENV-only change. + +- **No code changes required** - Deploy via `HAKMEM_WARM_POOL_SIZE=16` environment variable +- **Exceeds M2 threshold** (+3.0% Strong GO criterion) +- **Single strongest improvement** among all tested parameters +- **Combined optimizations are non-additive** - Warm Pool Size=16 alone outperforms combinations + +⚠️ **Important correction (2025-12 audit)**: +The previously reported “Refill Batch Size sweep” based on `TINY_REFILL_BATCH_SIZE` was **not measuring a real knob**. +That macro currently has **zero call sites** (it is defined but not referenced in the active Tiny front path), so any +observed deltas were **layout/drift noise**, not an algorithmic effect. + +--- + +## Full Sweep Results + +### Baseline (Phase 68 PGO) + +| Metric | Value | +|--------|-------| +| **Mean** | 60.65M ops/s | +| **Median** | 60.68M ops/s | +| **CV** | 1.68% | +| **% of mimalloc** | 50.93% | + +**Runs**: 10 +**Binary**: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized) + +--- + +### 1. Warm Pool Size Sweep (ENV-only, no recompile) + +**Parameter**: `HAKMEM_WARM_POOL_SIZE` (default: 12 SuperSlabs/class) + +| Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision | +|------|----------------|------------------|----|-----------:|----------| +| **16** | **62.63** | **63.38** | 2.43% | **+3.26%** | **Strong GO** ✓✓✓ | +| 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ | + +**Winner**: **Size=16 (+3.26%)** + +**Analysis**: +- Size=16 exceeds +3.0% Strong GO threshold +- Size=24 shows diminishing returns (+2.84% vs +3.26%) +- Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead + +**Command Used**: +```bash +# Size=16 +HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh + +# Size=24 +HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh +``` + +--- + +### 2. Unified Cache C5-C7 Sweep (ENV-only, no recompile) + +**Parameter**: `HAKMEM_TINY_UNIFIED_C5`, `HAKMEM_TINY_UNIFIED_C6`, `HAKMEM_TINY_UNIFIED_C7` (default: 128 slots) + +| Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision | +|------------|----------------|------------------|----|-----------:|----------| +| **256** | **61.92** | **61.70** | 1.49% | **+2.09%** | **GO** ✓ | +| 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ | + +**Winner**: **Cache=256 (+2.09%)** + +**Analysis**: +- Cache=256 shows +2.09% improvement (GO threshold) +- Cache=512 shows diminishing returns (+1.89% vs +2.09%) +- Larger caches provide marginal gains while increasing memory overhead +- Lower CV (1.49%) indicates stable performance + +**Command Used**: +```bash +# Cache=256 +HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh + +# Cache=512 +HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh +``` + +--- + +### 3. Combined Optimization Check + +**Configuration**: Warm Pool Size=16 + Unified Cache C5-C7=256 + +| Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision | +|----------------|------------------|----|-----------:|----------| +| 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) | + +**Analysis**: +- Combined result (+2.81%) is **LESS than** Warm Pool Size=16 alone (+3.26%) +- **Non-additive behavior** indicates parameters are not orthogonal +- **Likely explanation**: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant +- **Recommendation**: Use Warm Pool Size=16 alone for maximum benefit + +**Command Used**: +```bash +HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh +``` + +--- + +### 4. Refill Batch Size Sweep (invalid — macro not wired) + +The `TINY_REFILL_BATCH_SIZE` macro is currently **define-only**: + +```bash +rg -n "TINY_REFILL_BATCH_SIZE" core +# -> core/hakmem_tiny_config.h only +``` + +So we do **not** treat it as a tuning parameter until it is actually connected to refill logic. + +If we want to tune refill frequency, use the real knobs: +- `HAKMEM_TINY_REFILL_COUNT_HOT` +- `HAKMEM_TINY_REFILL_COUNT_MID` +- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}` + +--- + +## Recommendations + +### Phase 69-2 (Baseline Promotion) + +**Primary Recommendation**: **Deploy Warm Pool Size=16 (ENV-only)** + +**Rationale**: +1. **Strongest single improvement** (+3.26%, Strong GO) +2. **No code changes required** - Zero risk of layout tax +3. **Immediate deployment** via environment variable +4. **Exceeds M2 threshold** (+3.0% Strong GO criterion) + +**Deployment**: +```bash +# Add to PGO training environment and benchmark scripts +export HAKMEM_WARM_POOL_SIZE=16 +``` + +--- + +### Secondary Options (for Phase 69-3+) + +**Option A: Warm Pool Size=16 + Refill Batch=32** +- **Combined potential**: Unknown (requires testing, may be non-additive like unified cache) +- **Complexity**: Requires PGO rebuild for Batch=32 +- **Risk**: Layout tax from code change + +**Option B: Warm Pool Size=16 alone (recommended)** +- **Gain**: +3.26% guaranteed +- **Complexity**: ENV-only, zero code changes +- **Risk**: None (reversible via ENV) + +--- + +## Raw Data Files + +All 10-run logs saved to: +- `/tmp/phase69_baseline.log` - Phase 68 PGO baseline +- `/tmp/phase69_warm16.log` - Warm Pool Size=16 +- `/tmp/phase69_warm24.log` - Warm Pool Size=24 +- `/tmp/phase69_cache256.log` - Unified Cache C5-C7=256 +- `/tmp/phase69_cache512.log` - Unified Cache C5-C7=512 +- `/tmp/phase69_combined.log` - Combined (Warm=16 + Cache=256) +- `/tmp/phase69_batch32.log` - Refill Batch=32 + +--- + +## Next Steps + +**Awaiting User Instructions for Phase 69-2**: +1. Confirm Warm Pool Size=16 as baseline promotion candidate +2. Decide whether to: + - Update ENV defaults in `hakmem_tiny_config.h` (preferred for SSOT) + - Document as recommended ENV setting in README/docs + - Add to PGO training scripts +3. Re-run `make pgo-fast-full` with `HAKMEM_WARM_POOL_SIZE=16` in training environment +4. Update `PERFORMANCE_TARGETS_SCORECARD.md` with new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc) + +--- + +**Phase 69-1 Status**: ✅ **COMPLETE** +**Winner**: **Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)** diff --git a/docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md b/docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md new file mode 100644 index 00000000..1e2df033 --- /dev/null +++ b/docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md @@ -0,0 +1,46 @@ +# Phase 69-3A: Refill Batch=64 build failure triage — Root cause & fix + +## Symptom + +`make pgo-fast-build` (profile-use) fails to link with undefined `__gcov_*` symbols, e.g.: + +- `__gcov_init`, `__gcov_exit` +- `__gcov_merge_add`, `__gcov_merge_topn` +- `__gcov_time_profiler_counter` + +This appeared when trying to evaluate `Refill Batch Size=64`. + +## Root cause (actual) + +The failure is **not** “compiler limit due to batch=64”. + +It is a **stale object mixing** problem: +- Some benchmark `.o` files were built in the profile-gen step (`-fprofile-generate`) and **were not removed by `make clean`**. +- In the profile-use step (`-fprofile-use`), those stale instrumented `.o` files were reused and linked without `-fprofile-generate` → libgcov was not pulled in. +- Result: unresolved `__gcov_*` symbols at link time. + +In other words: **instrumented bench object reused in non-instrumented link**. + +## Fix (minimal, safe) + +Strengthen `make clean` to remove benchmark objects/binaries that were previously omitted, including: +- `bench_random_mixed_hakmem.o` +- `bench_tiny_hot_hakmem.o` +- related bench variants (`*_system`, `*_mi`, `*_hakx`, `*_minimal*`, etc.) + +This preserves toolchain fairness (GCC + LTO) and prevents cross-step contamination in PGO workflows. + +## Verification + +After the fix, the Phase 66 PGO pipeline builds successfully again: + +```sh +make pgo-fast-profile pgo-fast-collect pgo-fast-build +``` + +## Notes + +- This fix is **layout-neutral**: it only affects build hygiene (artifact cleanup). +- This also hardens other workflows where flags change across builds (PGO / FAST targets). +- Follow-up audit note (2025-12): `TINY_REFILL_BATCH_SIZE` is currently define-only (no call sites), so the “batch=64” + performance experiment itself was not measuring a real knob; however the build hygiene fix remains valid and important. diff --git a/docs/analysis/PHASE69_REFILL_TUNING_3B_REFILL_BATCH_PGO_SWEEP_RESULTS.md b/docs/analysis/PHASE69_REFILL_TUNING_3B_REFILL_BATCH_PGO_SWEEP_RESULTS.md new file mode 100644 index 00000000..949ab965 --- /dev/null +++ b/docs/analysis/PHASE69_REFILL_TUNING_3B_REFILL_BATCH_PGO_SWEEP_RESULTS.md @@ -0,0 +1,45 @@ +# Phase 69-3B: Refill Batch Size sweep (PGO, warm_pool=16) — Results + +⚠️ **INVALID (2025-12 audit)**: `TINY_REFILL_BATCH_SIZE` is currently **not wired** into the active Tiny front path +(it has zero call sites; define-only in `core/hakmem_tiny_config.h`). Any observed deltas in this file should be treated +as **layout/drift noise**, not an algorithmic effect. This document is kept only as an experiment record. + +## Context + +Phase 69-2 promoted the ENV-only winner: +- `HAKMEM_WARM_POOL_SIZE=16` + +This phase explores compile-time refill batch size (`TINY_REFILL_BATCH_SIZE`) under the current PGO workflow: +- `make pgo-fast-full` (GCC + LTO preserved) +- Training uses cleanenv-aligned workloads (`scripts/box/pgo_fast_profile_config.sh`) + +## Build hygiene prerequisite + +Batch=64 originally “failed to build” due to stale profile-gen bench objects being reused in profile-use links. +That issue is fixed by strengthening `make clean` (see `docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md`). + +## Measurement (Mixed 10-run) + +All results are from the same host session, using: +- `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` +- `RUNS=10 scripts/run_mixed_10_cleanenv.sh` + +| Batch | Mean (M ops/s) | Median (M ops/s) | CV | +|------:|----------------:|-----------------:|---:| +| 16 | 61.30 | 61.64 | 1.50% | +| 32 | 60.73 | 61.17 | 2.19% | +| 48 | 61.94 | 62.54 | 1.53% | +| 64 | 61.51 | 61.81 | 1.56% | + +## Decision + +- **Batch=48** is the best of the tested set in this session (+~1.0% vs batch=16 baseline). +- **Batch=32** regresses in this session (note: previously was GO under a different baseline). +- **Batch=64** builds successfully after the hygiene fix, but is not the best performer here. + +## Next steps (Phase 69-3C) + +If we want to pursue M2 (55%) via this path: +1. Promote **batch=48** as a research candidate with a dedicated Phase tag (compile-time change + PGO rebuild). +2. Re-run the sweep at another time window to confirm ordering (layout/drift sensitivity). +3. If stable, promote batch=48 into the FAST baseline build path. diff --git a/docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md b/docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md new file mode 100644 index 00000000..49a5e817 --- /dev/null +++ b/docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md @@ -0,0 +1,47 @@ +# Phase 69-3C: Refill Batch “knob” audit — `TINY_REFILL_BATCH_SIZE` is not wired + +## Summary + +The Phase 69 “Refill Batch Size sweep” was based on `TINY_REFILL_BATCH_SIZE` in `core/hakmem_tiny_config.h`, but an audit +shows this macro currently has **zero call sites** in the active Tiny front path. As a result, any measured deltas from +editing this macro are **not algorithmic**; they are attributable to layout/drift/noise. + +## Evidence + +### 1) Zero call sites + +```sh +rg -n "TINY_REFILL_BATCH_SIZE" core +``` + +Result: only `core/hakmem_tiny_config.h` (define-only). + +### 2) PGO binaries unchanged when toggling the macro + +We rebuilt the full PGO pipeline twice (`make pgo-fast-full`) after changing the macro (batch16 vs batch48) and found the +resulting binaries were bit-identical (same size + same SHA256). + +This confirms the macro does not affect the compiled hot path today. + +## Action taken + +- Restored `TINY_REFILL_BATCH_SIZE` to `16` and added an explicit “not wired” note in `core/hakmem_tiny_config.h`. +- Marked the “Refill Batch Size sweep” section in Phase 69 docs as invalid. + +## What to tune instead (real knobs) + +To tune refill frequency/amount without rebuilding: +- `HAKMEM_TINY_REFILL_COUNT_HOT` (C0–C3) +- `HAKMEM_TINY_REFILL_COUNT_MID` (C4–C7) +- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}` + +Defaults are set in `core/hakmem_tiny_init.inc` and can be overridden via ENV. + +## Optional future work (if we still want a compile-time knob) + +If we want a compile-time “refill batch size” knob, we need to wire it into a single SSOT: +- either by feeding it into the refill-count defaults (`g_refill_count_*`), or +- by introducing a dedicated build flag that the refill logic consumes directly. + +Until then, do not run Phase 69 sweeps based on `TINY_REFILL_BATCH_SIZE`. + diff --git a/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md b/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md index 2c10d4fc..5dd0e3c7 100644 --- a/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md +++ b/docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md @@ -12,6 +12,13 @@ Before implementing any refill/WarmPool changes, execute this sequence: +0. **Route Banner(任意だが推奨)**: + ```bash + HAKMEM_ROUTE_BANNER=1 ./bench_random_mixed_hakmem_observe ... + ``` + - Route assignments(backend route kind)と cache config(`unified_cache_enabled` / `warm_pool_max_per_class`)を 1 回だけ表示する。 + - 「Route=LEGACY = Unified Cache 未使用」といった誤認を防ぐ(LEGACYでもUnified Cacheは alloc/free の front で使われる)。 + 1. **Build with Stats**: ```bash make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1' @@ -20,7 +27,7 @@ Before implementing any refill/WarmPool changes, execute this sequence: 2. **Run with Stats**: ```bash - HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1 + HAKMEM_ROUTE_BANNER=1 HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1 ``` 3. **Check Output**: diff --git a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md new file mode 100644 index 00000000..aaf7e04a --- /dev/null +++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md @@ -0,0 +1,116 @@ +# Phase 74: UnifiedCache hit-path structural optimization (WS=400 SSOT) + +**Status**: 🟡 DRAFT(設計SSOT / 次の指示書) + +## 0) 背景(なぜ今これか) + +- 現行 baseline(Phase 69): `bench_random_mixed_hakmem_minimal_pgo` = **62.63M ops/s = 51.77% of mimalloc**(`HAKMEM_WARM_POOL_SIZE=16`) +- Phase 70(観測SSOT)により、WS=400(Mixed SSOT)では **UnifiedCache miss が極小**であることが確定。 + - `unified_cache_refill()` / WarmPool-pop を速くしても **ROI はほぼゼロ**(refill最適化は凍結) + - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` +- Phase 73(perf stat)により、WarmPool=16 の勝ちは **instruction/branch の微減**が支配的と確定。 + - つまり次も「hit-path を短くする」方向が最も筋が良い。 + - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` + +本フェーズの狙いは、**UnifiedCache の hit-path(push/pop)から“踏まなくていい分岐/ロード”を構造で外に追い出す**こと。 + +## 1) 目的 / 非目的 + +**目的** +- WS=400 の SSOT workload で **+1〜3%**(単発)を狙う(積み上げで M2=55% へ)。 +- “経路が踏まれていない最適化” を避ける(Phase 70 の SSOT を守る)。 + +**非目的** +- `unified_cache_refill()` の最適化(miss が極小なので SSOT では ROI なし)。 +- link-out / 大削除による DCE(layout tax で符号反転の前例が多い)。 +- route kind を変えて別 workload にする(まず SSOT workload を崩さない)。 + +## 2) Box Theory(箱割り) + +### 箱の責務 + +L0: **EnvGateBox** +- `HAKMEM_TINY_UC_*` のトグル(default OFF、いつでも戻せる)。 + +L1: **TinyUnifiedCacheHitPathBox(NEW / 研究箱)** +- `unified_cache_push/pop` の **hit-path だけを短くする**(refill/overflow/registryは触らない)。 +- 変換点(境界)は 1 箇所: `unified_cache_push/pop` 内で “fast→fallback” を1回だけ行う。 + +### 可視化(最小) +- `uc_hitpath_fast_hits` / `uc_hitpath_fast_fallbacks` の2カウンタだけ(必要なら)。 +- それ以外は `perf stat`(instructions/branches)を正とする。 + +## 3) 具体案(優先順) + +### P1(低リスク): ローカル変数化で再ロード/依存チェーンを固定する + +狙い: +- `cache->head/tail/mask/capacity` 等の再ロードを抑制し、**依存チェーンを短く**する。 + +設計: +- `unified_cache_push()` / `unified_cache_pop_or_refill()` の中で + - `uint16_t head = cache->head;` のように **ローカルへ落とす** + - `next = (x + 1) & mask` の算術を **1回に固定** + - `cache->tail = next;` のような store を最後にまとめる + +導入: +- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0) +- 方式: 同一バイナリで ON/OFF(layout tax を最小にするため、分岐は入口1回に限定) + +リスク: +- レジスタ圧上昇で逆に遅くなる可能性 → A/B 必須。 + +### P0(中リスク/中ROI): Fast-API 化(enable判定/統計を外に追い出す) + +狙い: +- hit-path の中に残る “ほぼ不変な判定” を **呼び出し側に追い出し**、`push/pop` を直線化する。 + +設計: +- `unified_cache_push_fast(TinyUnifiedCache* cache, void* base)` のような **最短API** を追加 + - 前提: “有効/初期化済み/統計OFF” を呼び出し側で保証 + - 失敗時のみ既存 `unified_cache_push()` へ落とす(境界1箇所) + +導入: +- ENV: `HAKMEM_TINY_UC_FASTAPI=0/1`(default 0) +- Fail-fast: 途中でモードが変わったら “safe fallback” へ(bench用途なら abort でも良い) + +リスク: +- call site の増加で layout が動く → GO 閾値は +1.0%(厳しめ)。 + +### P2(高リスク/高ROI候補): hot class 限定で slots を TLS 直置き(pointer chase削減) + +狙い: +- hit-path の `cache->slots` のロード(ポインタ追跡)を消す。 + +設計: +- `TinyUnifiedCache` の “hot class のみ” を別構造に逃がし、TLS 内に `slots[]` を直置き。 + - 対象候補: 容量が小さい C4/C5/C6/C7(C2/C3 の 2048 は直置きが重い) + +リスク: +- TLS サイズ増で dTLB/cache が悪化しうる(勝てば大きいが、NO-GO もあり得る)。 + +## 4) A/B(SSOT) + +### 4.1 ベンチ条件(固定) +- `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`) +- `HAKMEM_WARM_POOL_SIZE=16`(baseline) + +### 4.2 GO/NO-GO +- **GO**: +1.0% 以上 +- **NEUTRAL**: ±1.0%(research box freeze) +- **NO-GO**: -1.0% 以下(即 revert) + +### 4.3 追加で必ず見る(Phase 73 教訓) +- `perf stat`: `instructions`, `branches`, `branch-misses`(勝ち筋が instruction/branch 減なので) +- `cache-misses`, `iTLB-load-misses`, `dTLB-load-misses`(layout tax 検知) + +## 5) 直近の実装順(推奨) + +1. **P1(LOCALIZE)** を小さく入れて A/B(最短で勝ち筋確認) +2. 勝てたら **P0(FASTAPI)** を追加(さらに分岐を外へ) +3. それでも足りなければ **P2(inline slots hot)** を research box として試す + +## 6) 退出条件(やめどき) + +- WS=400 SSOT で `perf` 上の “unified_cache_push/pop” が Top 50 圏外になったら、この系は撤退(Phase 42 の教訓)。 +- 3回連続で NEUTRAL/NO-GO が続いたら、次の構造(別層)へ(layout tax の危険が増すため)。 diff --git a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md new file mode 100644 index 00000000..28ef2a9a --- /dev/null +++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md @@ -0,0 +1,75 @@ +# Phase 74-1: UnifiedCache hit-path “LOCALIZE” 実装指示書 + +**Status**: 🟡 READY + +## 目的 + +WS=400(Mixed SSOT)でほぼ hit-path しか踏まれないため、`unified_cache_push/pop` の **依存チェーン(再ロード)を短く**して instructions/branches を削る。 + +- 設計SSOT: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md` +- 観測SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`(refill最適化は凍結) + +## 原則(Box Theory) + +- L0: ENV gate 箱を追加(default OFF、いつでも戻せる) +- L1: `unified_cache_push/pop` の中だけに閉じた変更(境界1箇所) +- 可視化は最小(基本は perf stat を正とする) +- Fail-fast: 迷ったら fallback + +## Step 0: Baseline 確認(SSOT) + +```bash +scripts/run_mixed_10_cleanenv.sh +``` + +## Step 1: ENV gate(L0 box) + +新規: +- `core/box/tiny_unified_cache_hitpath_env_box.h`(例) + +ENV: +- `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0) + +要件: +- hot path で getenv を踏まない(既存の lazy-init パターン or build flag で固定) + +## Step 2: LOCALIZE 実装(L1 box) + +対象: +- `core/front/tiny_unified_cache.h` の `unified_cache_push()` / `unified_cache_pop_or_refill()` + +方針: +- `cache->head/tail/mask/capacity` をローカルへ落として **再ロードを防ぐ** +- store は最後にまとめる(`cache->tail = next_tail;` など) +- 仕様は変えない(容量/順序/統計/overflow の意味を維持) + +導入パターン(例): +- `if (!tiny_uc_localize_enabled())` のときは既存実装をそのまま通す +- `enabled` のときだけ localize 版を呼ぶ + +## Step 3: A/B(同一バイナリ) + +```bash +HAKMEM_TINY_UC_LOCALIZE=0 scripts/run_mixed_10_cleanenv.sh +HAKMEM_TINY_UC_LOCALIZE=1 scripts/run_mixed_10_cleanenv.sh +``` + +追加で(勝ち筋が instructions/branches なので必須): +```bash +perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses -- \ + ./bench_random_mixed_hakmem_minimal_pgo 20000000 400 1 +``` + +## 判定 + +- **GO**: +1.0% 以上 +- **NEUTRAL**: ±1.0%(research box freeze) +- **NO-GO**: -1.0% 以下(即 revert) + +NO-GO の切り分け: +- `scripts/box/layout_tax_forensics_box.sh` を使う(layout tax / IPC低下 / TLB悪化の分類) + +## Step 4: 昇格方針 + +- 初回 GO でも **default ON にしない**(まずは 3回独立再計測で再現性を確認) +- 3回とも GO なら `scripts/run_mixed_10_cleanenv.sh` / `core/bench_profile.h` へ昇格を検討 diff --git a/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md new file mode 100644 index 00000000..5d3dca6a --- /dev/null +++ b/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md @@ -0,0 +1,140 @@ +# Phase 74: UnifiedCache hit-path structural optimization - Results + +**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%) + +## Summary + +Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓). + +**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase. + +--- + +## Phase 74-1: LOCALIZE (ENV-gated, runtime branch) + +**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains. + +**Implementation**: +- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0) +- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }` + +**Results** (10-run A/B): +| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta | +|--------|------------|------------|-------| +| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** | +| instructions | 4,583M | 4,615M | **+0.7%** | +| branches | 1,276M | 1,281M | **+0.4%** | +| cache-misses | 560K | 461K | -17.7% | + +**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent. + +**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate). + +--- + +## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch) + +**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance. + +**Implementation**: +- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0) +- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch) + +**Results** (10-run A/B via `layout_tax_forensics_box.sh`): +| Metric | Baseline (=0) | Treatment (=1) | Delta | +|--------|---------------|----------------|-------| +| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** | +| cycles | 1,553M | 1,548M | -0.3% | +| **instructions** | 2,748M | 2,733M | **-0.6%** | +| **branches** | 632M | 617M | **-2.3%** | +| **cache-misses** | 707K | 1,316K | **+86%** | +| dTLB-load-misses | 46K | 33K | -28% | + +**Analysis**: +1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓ +2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓ +3. **But cache-misses +86%** → register pressure / spill / worse access pattern +4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings + +**Phase 74-1 vs 74-2 comparison**: +- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses** +- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins** +- But cache-misses +86% cancels out → **total NEUTRAL** + +**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN** + +--- + +## Root Cause (Phase 74-2) + +**Why cache-misses increased (+86%)**: + +1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers + - Compiler may spill to stack → more memory traffic + - `cache->slots[head]` may lose prefetch opportunity +2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations + - Storing to local breaks dependency tracking? + - Memory alias analysis degraded? + +**Evidence**: +- dTLB-misses decreased (-28%) → data layout not the issue +- L1-dcache-load-misses similar → not a TLB/page issue +- cache-misses (+86%) is the PRIMARY BLOCKER + +--- + +## Lessons Learned + +1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate +2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed +3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate +4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects + +**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop). + +--- + +## P1 (LOCALIZE) - Frozen State + +**Files**: +- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0) +- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2) +- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks + +**Default behavior**: LOCALIZE=0 (original implementation) +**Rollback**: No action needed (default OFF) + +--- + +## Next Steps + +**Phase 74-3: P0 (FASTAPI)** + +**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop. + +**Approach**: +- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs +- Assume: "valid/enabled/no-stats" at caller side +- Fail-fast: fallback to slow path on unexpected state +- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box) + +**Expected benefit**: +1-2% via branch reduction (different axis than P1) + +**GO threshold**: +1.0% (strict, structural change) + +--- + +## Artifacts + +- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md` +- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md` +- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file) +- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data) + +--- + +## Timeline + +- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)** +- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN** +- Phase 74-3: P0 (FASTAPI) → (next) diff --git a/hakmem.d b/hakmem.d index 3480d525..5c64f7fa 100644 --- a/hakmem.d +++ b/hakmem.d @@ -103,6 +103,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \ core/box/../front/../box/../hakmem_tiny_config.h \ core/box/../front/../box/../tiny_nextptr.h \ core/box/../front/../box/tiny_tcache_env_box.h \ + core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h \ core/box/../front/../tiny_region_id.h core/box/../front/../hakmem_tiny.h \ core/box/../front/../box/tiny_env_box.h \ core/box/../front/../box/tiny_front_hot_box.h \ @@ -361,6 +362,7 @@ core/box/../front/../box/tiny_tcache_box.h: core/box/../front/../box/../hakmem_tiny_config.h: core/box/../front/../box/../tiny_nextptr.h: core/box/../front/../box/tiny_tcache_env_box.h: +core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h: core/box/../front/../tiny_region_id.h: core/box/../front/../hakmem_tiny.h: core/box/../front/../box/tiny_env_box.h: