Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)
Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
267
CURRENT_TASK.md
267
CURRENT_TASK.md
@ -1,236 +1,89 @@
|
|||||||
# CURRENT_TASK(Rolling, SSOT)
|
# CURRENT_TASK(Rolling, SSOT)
|
||||||
|
|
||||||
## 0) 今の「正」
|
## 0) 今の「正」(SSOT)
|
||||||
|
|
||||||
- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)✓ **Phase 69 昇格済み** (Warm Pool Size=16)
|
- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)+ **WarmPool=16**(Phase 69 強GOで昇格済み)
|
||||||
- **安全・互換の正**: Standard build(`make bench_random_mixed_hakmem`)
|
- **安全・互換の正**: Standard build(`make bench_random_mixed_hakmem`)
|
||||||
- **観測の正**: OBSERVE build(`make perf_observe`)
|
- **観測の正**: OBSERVE build(`make perf_observe`)
|
||||||
- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`(M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp)
|
- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
|
||||||
- **計測の正(Mixed 10-run)**: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト)
|
- Current baseline(FAST v3 + PGO, Phase 69): **62.63M ops/s = 51.77% of mimalloc**
|
||||||
|
- 次の目標: **M2 = 55%**(残り **+3.23pp**)
|
||||||
|
- **Mixed 10-run SSOT**: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト)
|
||||||
|
|
||||||
## 1) 現状(要点)
|
## 1) 迷子防止(経路/観測)
|
||||||
|
|
||||||
- Phase 64(backend prune / DCE): **NO-GO**(-4.05%) → layout tax 由来
|
“経路が踏まれていない最適化” を防ぐための最小手順。
|
||||||
- Phase 63(FAST_PROFILE_FIXED): **研究用ビルド**として保持(FAST の gate を compile-time 固定)
|
|
||||||
- Phase 65(Hot Symbol Ordering): **BLOCKED**(GCC+LTO の制約で不公平/不可能)→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md`
|
|
||||||
- Phase 66(PGO, GCC+LTO): **GO** ✓
|
|
||||||
- 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1%
|
|
||||||
- Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO)
|
|
||||||
- Phase 68(PGO training set 最適化): **GO & 昇格完了** ✓
|
|
||||||
- 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過)
|
|
||||||
- Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp)
|
|
||||||
- Phase 69(Refill tuning: Warm Pool Size 最適化): **強GO & 昇格完了** ✓✓✓
|
|
||||||
- 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過)
|
|
||||||
- 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp)
|
|
||||||
|
|
||||||
## 2) 次の指示書(Active)
|
- **Route Banner(経路の誤認を潰す)**: `HAKMEM_ROUTE_BANNER=1`
|
||||||
|
- 出力: Route assignments(backend route kind)+ cache config(`unified_cache_enabled` / `warm_pool_max_per_class`)
|
||||||
|
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
||||||
|
- WS=400(Mixed SSOT)では miss が極小 → `unified_cache_refill()` 最適化は **凍結(ROIゼロ)**
|
||||||
|
|
||||||
**Phase 68: PGO training set 最適化** ✅ **完了**
|
## 2) 直近の結論(要点だけ)
|
||||||
|
|
||||||
- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン)
|
- **Phase 69(WarmPool sweep)**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO(+3.26%)**、baseline 昇格済み。
|
||||||
- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過)
|
- 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
|
||||||
- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過)
|
- 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
||||||
- ✓ スコアカード・CURRENT_TASK 更新完了
|
- **Phase 70(観測SSOT)**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 67a: Layout Tax 法医学(変更最小)** ✅ **完了・実運用可能**
|
|
||||||
|
|
||||||
- ✓ `scripts/box/layout_tax_forensics_box.sh` 新規(測定ハーネス)
|
|
||||||
- Baseline vs Treatment の 10-run throughput 比較
|
|
||||||
- perf stat 自動収集(cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB)
|
|
||||||
- Binary metadata(サイズ、セクション構成)
|
|
||||||
|
|
||||||
- ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規(診断ガイド)
|
|
||||||
- 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下)
|
|
||||||
- "症状→原因候補" マッピング表
|
|
||||||
* IPC 低下 3%↑ → I-cache miss / code layout dispersal
|
|
||||||
* branch-misses ↑10%↑ → branch prediction penalty
|
|
||||||
* dTLB-misses ↑100%↑ → data layout fragmentation
|
|
||||||
- Phase 64 case study(-4.05% の root cause: IPC 2.05 → 1.98)
|
|
||||||
- 運用ガイドライン
|
|
||||||
|
|
||||||
**使用例**:
|
|
||||||
```bash
|
|
||||||
./scripts/box/layout_tax_forensics_box.sh \
|
|
||||||
./bench_random_mixed_hakmem_minimal_pgo \
|
|
||||||
./bench_random_mixed_hakmem_fast_pruned # or Phase 64 attempt
|
|
||||||
```
|
|
||||||
|
|
||||||
成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 69: "refill頻度×固定税" を削る(M2への最短距離)**
|
|
||||||
|
|
||||||
**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**
|
|
||||||
|
|
||||||
- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
|
|
||||||
- ✓ Tunable parameters 特定:
|
|
||||||
- `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`(refill 量の実体, ENV-only)
|
|
||||||
- Unified Cache C5-C7 capacity (128 → 256/512)
|
|
||||||
- Warm Pool size (12 → 16/24)
|
|
||||||
- ✓ Sweep 計画立案(single-parameter → combined optimization)
|
|
||||||
- ✓ Risk assessment & 判定基準定義
|
|
||||||
|
|
||||||
**Phase 69-1: Sweep 実行** ✅ **完了**
|
|
||||||
|
|
||||||
- ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean)
|
|
||||||
- ✓ Warm Pool Size sweep:
|
|
||||||
- Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner**
|
|
||||||
- Size=24: 62.37M ops/s (+2.84%, GO)
|
|
||||||
- ✓ Unified Cache C5-C7 sweep:
|
|
||||||
- Cache=256: 61.92M ops/s (+2.09%, GO)
|
|
||||||
- Cache=512: 61.80M ops/s (+1.89%, GO)
|
|
||||||
- ✓ Combined optimization check:
|
|
||||||
- Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive)
|
|
||||||
- ✓ “Refill Batch Size sweep” は無効(knob 未接続):
|
|
||||||
- `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない
|
|
||||||
- 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md`
|
|
||||||
- **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
|
||||||
- **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)**
|
|
||||||
|
|
||||||
**Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了**
|
|
||||||
|
|
||||||
- ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加
|
|
||||||
- ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加
|
|
||||||
- ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加:
|
|
||||||
- Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc
|
|
||||||
- M1 (50%) achievement: **EXCEEDED** (+1.77pp above target)
|
|
||||||
- M2 (55%) progress: Gap reduced to +3.23pp
|
|
||||||
- ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除
|
|
||||||
|
|
||||||
**新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 69-3(次候補): refill 量(ENV-only)sweep OR 次の sweep**
|
|
||||||
|
|
||||||
- **選択肢 A(推奨)**: Refill count の ENV sweep(コード変更なし)
|
|
||||||
- `HAKMEM_TINY_REFILL_COUNT_MID`(C4–C7)を 64/96/128/160… で sweep
|
|
||||||
- `HAKMEM_TINY_REFILL_COUNT_HOT`(C0–C3)も同様に sweep(ただし WarmPool/UnifiedCache と相互作用あり)
|
|
||||||
- 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%)
|
|
||||||
|
|
||||||
- **選択肢 B**: Unified Cache の fine sweep(ENV-only)
|
|
||||||
- C5/C6/C7 を 192/256/320… などで sweep(Phase 69-1 の 256/512 は coarse)
|
|
||||||
- WarmPool=16 との非加算性を “原因切り分け” する
|
|
||||||
|
|
||||||
- **選択肢 C**: compile-time knob の新設(後回し)
|
|
||||||
- `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない
|
|
||||||
- 必要なら別途 SSOT を作って実装する(Phase 70+)
|
|
||||||
|
|
||||||
- **選択肢 D**: 別方向の最適化(M2: 55% への最短距離)
|
|
||||||
- 残り gap: +3.23pp (51.77% → 55%)
|
|
||||||
- Phase 67b(境界 inline/unroll チューニング)
|
|
||||||
- Top 50 hot functions の最適化
|
|
||||||
- PGO profile の再調整
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 67b(後続・保険): 境界inline/unrollチューニング**
|
|
||||||
- **注意**: layout tax リスク高い(Phase 64 reference)
|
|
||||||
- **前提**: Top 50 実行確認が必須
|
|
||||||
- Phase 69 が外れた時の保険として後回し推奨
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 70(観測の前提固め): Refill/WarmPool 最適化の Step 0 を SSOT 化**
|
|
||||||
|
|
||||||
- 目的: **“経路が踏まれていない最適化”** を防ぐ(Phase 40/41/64 の layout tax 前例)
|
|
||||||
- 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない(backend route kind)
|
|
||||||
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
||||||
- Mixed SSOT(WS=400)で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める
|
- **Phase 71/73(WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**(perf stat で確定)。
|
||||||
- ✅ Phase 70-1: Route Banner 実装(経路誤認の根絶)
|
|
||||||
- ENV: `HAKMEM_ROUTE_BANNER=1`
|
|
||||||
- 出力: Route assignments(backend route kind)+ cache config(unified_cache / warm_pool_max_per_class)
|
|
||||||
- ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT(“見えてないだけ”事故の根絶)
|
|
||||||
- `Unified-STATS total_allocs == total_frees` を確認してから議論する(統計の信頼性ゲート)
|
|
||||||
- ✅ Phase 70-2: Refill 最適化の扱い確定(SSOT)
|
|
||||||
- Mixed SSOT(WS=400)で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結(ROIゼロ)**
|
|
||||||
- 現状の実測: miss は極小(例: total miss=5)→ refill最適化は SSOT workload では ROI なし
|
|
||||||
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
||||||
|
- **Phase 72(ENV knob ROI枯れ)**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**。
|
||||||
|
|
||||||
---
|
## 3) 運用ルール(Box Theory + layout tax 対策)
|
||||||
|
|
||||||
**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決**
|
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む(Fail-fast、最小可視化)。
|
||||||
|
- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。
|
||||||
|
- “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。
|
||||||
|
- 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
|
||||||
|
|
||||||
- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**(TLB/LLC/周波数/配置)の可能性が高い
|
## 4) 次の指示書(Active)
|
||||||
- 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化(Phase 72)を決め打ちする
|
|
||||||
- 方式: **同一バイナリ + cleanenv + 交互実行**(layout tax/環境ドリフトを避ける)
|
|
||||||
- A: `HAKMEM_WARM_POOL_SIZE=12`
|
|
||||||
- B: `HAKMEM_WARM_POOL_SIZE=16`
|
|
||||||
- events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults`
|
|
||||||
|
|
||||||
**結果**(パラドックス):
|
### Phase 74(構造): UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
|
||||||
- ✅ Throughput: +0.91% (46.52M → 46.95M ops/s)
|
|
||||||
- ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE**
|
|
||||||
- ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE**
|
|
||||||
- ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE**
|
|
||||||
- ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE**
|
|
||||||
- ✓ page-faults: -0.21% (negligible)
|
|
||||||
|
|
||||||
**Phase 71 仮説(REJECTED)**:
|
**前提**:
|
||||||
- 予測: "TLB/cache efficiency improvement from memory layout"
|
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
|
||||||
- 実測: TLB/cache metrics both **DEGRADED**
|
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
|
||||||
|
|
||||||
**Phase 73 確定**:
|
**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
|
||||||
- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)**
|
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
|
||||||
- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減
|
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
|
||||||
- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate
|
- 判定: **NEUTRAL (+0.50%)**
|
||||||
- Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache)
|
|
||||||
|
|
||||||
**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section
|
**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
|
||||||
|
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
||||||
|
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
|
||||||
|
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
|
||||||
|
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
|
||||||
|
- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**
|
||||||
|
|
||||||
**Phase 72(構造): WarmPool=16 の勝ち筋を増幅(Phase 73 結果が出てから)**
|
**結論**:
|
||||||
|
- P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い)
|
||||||
|
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
|
||||||
|
|
||||||
- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手(推測で弄ると Phase 40/41/64 の再発)
|
**Phase 74-3: P0 (FASTAPI)** 🟡 **次の指示書**
|
||||||
- Phase 73 の結論: **instruction/branch 減が支配的**(TLB/cache はむしろ悪化)→「WarmPool=16 が “短い経路” を踏ませている」ことが本質
|
|
||||||
|
|
||||||
**Phase 72-0(SSOT): “どの関数が短くなったか” を特定してから構造に入る**
|
**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
|
||||||
|
|
||||||
- A/B は WarmPool=12 vs 16 のまま(同一バイナリ・cleanenv)
|
**Approach**:
|
||||||
- perf record を **cycles ではなく instruction/branch で取る**(原因が instruction/branch 減だから)
|
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
|
||||||
- `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
|
- 前提: "valid/enabled/no-stats" を caller 側で保証
|
||||||
- `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
|
- Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所)
|
||||||
- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定(例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等)
|
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
||||||
|
|
||||||
**Phase 72-1(構造): 特定した関数にだけ手を入れる(箱の境界 1 箇所化)** ✅ **キャンセル(ROIゼロ)**
|
**Expected**: +1-2% via branch reduction (P1 と異なる軸)
|
||||||
|
|
||||||
- perf record 結果: `unified_cache_push` が -0.86% branches(最大削減)
|
**判定**:
|
||||||
- 当初計画: Unified Cache の FULL drain 最適化
|
- **GO**: +1.0% 以上
|
||||||
- **キャンセル理由**: 全クラスで `full=0`(FULL イベントが発生していない)→ ROI ゼロ
|
- **NEUTRAL**: ±1.0%(freeze、次へ)
|
||||||
|
- **NO-GO**: -1.0% 以下(即 revert)
|
||||||
|
|
||||||
**Phase 72-2: WarmPool 追加 sweep** ✅ **完了(ROI枯れ)**
|
**参考**:
|
||||||
|
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||||
|
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
- 結果 (P1): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
|
||||||
|
|
||||||
- 目的: WarmPool=16 以外に勝者がいるか確認
|
## 5) アーカイブ
|
||||||
- Baseline: WarmPool=16 = 56.23M ops/s (10-run)
|
|
||||||
- 結果:
|
|
||||||
- WarmPool=20: 56.13M ops/s (**-0.18%**, NO-GO)
|
|
||||||
- WarmPool=24: 56.30M ops/s (**+0.12%**, 誤差範囲)
|
|
||||||
- WarmPool=32: 56.07M ops/s (**-0.28%**, NO-GO)
|
|
||||||
- **判定**: 全候補が ±0.5% 以内 → **Phase 72 終了(ENV knob ROI 枯れ)**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 72 総括**:
|
|
||||||
- **確定**: WarmPool=16 が最適値(Phase 69 で確定、Phase 72 で再確認)
|
|
||||||
- **確定**: ENV knob による追加最適化の余地なし
|
|
||||||
- **勝ち筋**: instruction/branch 削減が支配的(Phase 73 で確定)
|
|
||||||
- **次のステップ**: 構造変更(コード変更)が必要
|
|
||||||
|
|
||||||
**注記**: 研究箱の削除は今やらない(link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Phase 74(次候補): 構造変更による最適化**
|
|
||||||
|
|
||||||
- **前提**: ENV knob ROI 枯れ → コード変更が必要
|
|
||||||
- **候補 A**: `unified_cache_push` の branch 削減(Phase 72-0 で最大寄与確認済み)
|
|
||||||
- **候補 B**: hot path の inline 強化(layout tax リスクあり、要 forensics)
|
|
||||||
- **候補 C**: PGO profile 再調整(WarmPool=16 前提で retrain)
|
|
||||||
- **判定基準**: +1.0% → GO、+0.5% 未満 → NO-GO
|
|
||||||
|
|
||||||
## 3) アーカイブ
|
|
||||||
|
|
||||||
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
|
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
|
||||||
- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
|
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
|
||||||
|
|||||||
32
core/box/tiny_unified_cache_hitpath_env_box.h
Normal file
32
core/box/tiny_unified_cache_hitpath_env_box.h
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
// tiny_unified_cache_hitpath_env_box.h - Phase 74: ENV gate for hit-path LOCALIZE
|
||||||
|
//
|
||||||
|
// Purpose: ENV-gated toggle for unified_cache_push/pop LOCALIZE optimization
|
||||||
|
// Design: lazy-init pattern to avoid hot-path getenv overhead
|
||||||
|
//
|
||||||
|
// ENV: HAKMEM_TINY_UC_LOCALIZE=0/1 (default 0, OFF)
|
||||||
|
//
|
||||||
|
// Box Theory:
|
||||||
|
// L0: ENV gate (this file)
|
||||||
|
// L1: LOCALIZE implementation (in tiny_unified_cache.h)
|
||||||
|
|
||||||
|
#ifndef HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
|
||||||
|
#define HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
|
||||||
|
|
||||||
|
#include <stdlib.h>
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 74: LOCALIZE ENV Gate (lazy-init, cached)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Check if LOCALIZE optimization is enabled
|
||||||
|
// Uses lazy-init pattern: getenv called once, then cached
|
||||||
|
static inline int tiny_uc_localize_enabled(void) {
|
||||||
|
static int g_enabled = -1; // -1 = uninitialized
|
||||||
|
if (__builtin_expect(g_enabled == -1, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_UC_LOCALIZE");
|
||||||
|
g_enabled = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
}
|
||||||
|
return g_enabled;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // HAK_BOX_TINY_UNIFIED_CACHE_HITPATH_ENV_BOX_H
|
||||||
@ -31,6 +31,7 @@
|
|||||||
#include "../box/ptr_type_box.h" // Phantom pointer types (BASE/USER)
|
#include "../box/ptr_type_box.h" // Phantom pointer types (BASE/USER)
|
||||||
#include "../box/tiny_front_config_box.h" // Phase 8-Step1: Config macros
|
#include "../box/tiny_front_config_box.h" // Phase 8-Step1: Config macros
|
||||||
#include "../box/tiny_tcache_box.h" // Phase 14 v1: Intrusive LIFO tcache
|
#include "../box/tiny_tcache_box.h" // Phase 14 v1: Intrusive LIFO tcache
|
||||||
|
#include "../box/tiny_unified_cache_hitpath_env_box.h" // Phase 74: LOCALIZE ENV gate
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
// Phase 3 C2 Patch 3: Bounds Check Compile-out
|
// Phase 3 C2 Patch 3: Bounds Check Compile-out
|
||||||
@ -247,6 +248,30 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) {
|
|||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch)
|
||||||
|
#if HAKMEM_TINY_UC_LOCALIZE_COMPILED
|
||||||
|
// LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
|
||||||
|
uint16_t head = cache->head;
|
||||||
|
uint16_t tail = cache->tail;
|
||||||
|
uint16_t mask = cache->mask;
|
||||||
|
uint16_t next_tail = (tail + 1) & mask;
|
||||||
|
|
||||||
|
if (__builtin_expect(next_tail == head, 0)) {
|
||||||
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
||||||
|
g_unified_cache_full[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return 0; // Full
|
||||||
|
}
|
||||||
|
|
||||||
|
cache->slots[tail] = base_raw;
|
||||||
|
cache->tail = next_tail;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
||||||
|
g_unified_cache_push[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return 1; // SUCCESS (LOCALIZE path)
|
||||||
|
#else
|
||||||
|
// Default path: Original implementation
|
||||||
uint16_t next_tail = (cache->tail + 1) & cache->mask;
|
uint16_t next_tail = (cache->tail + 1) & cache->mask;
|
||||||
|
|
||||||
// Full check (leave 1 slot empty to distinguish full/empty)
|
// Full check (leave 1 slot empty to distinguish full/empty)
|
||||||
@ -266,6 +291,7 @@ static inline int unified_cache_push(int class_idx, hak_base_ptr_t base) {
|
|||||||
#endif
|
#endif
|
||||||
|
|
||||||
return 1; // SUCCESS (2-3 cache misses total)
|
return 1; // SUCCESS (2-3 cache misses total)
|
||||||
|
#endif // HAKMEM_TINY_UC_LOCALIZE_COMPILED
|
||||||
}
|
}
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
@ -316,6 +342,37 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) {
|
|||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Phase 74-2: LOCALIZE optimization (compile-time gate, no runtime branch)
|
||||||
|
#if HAKMEM_TINY_UC_LOCALIZE_COMPILED
|
||||||
|
// LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
|
||||||
|
uint16_t head = cache->head;
|
||||||
|
uint16_t tail = cache->tail;
|
||||||
|
uint16_t mask = cache->mask;
|
||||||
|
|
||||||
|
if (__builtin_expect(head != tail, 1)) {
|
||||||
|
void* base = cache->slots[head];
|
||||||
|
cache->head = (head + 1) & mask;
|
||||||
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
||||||
|
g_unified_cache_hit[class_idx]++;
|
||||||
|
#endif
|
||||||
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
||||||
|
if (__builtin_expect(unified_cache_measure_check(), 0)) {
|
||||||
|
atomic_fetch_add_explicit(&g_unified_cache_hits_global,
|
||||||
|
1, memory_order_relaxed);
|
||||||
|
atomic_fetch_add_explicit(&g_unified_cache_hits_by_class[class_idx],
|
||||||
|
1, memory_order_relaxed);
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
return HAK_BASE_FROM_RAW(base); // Hit! (LOCALIZE path)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cache miss → Batch refill from SuperSlab
|
||||||
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
||||||
|
g_unified_cache_miss[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return unified_cache_refill(class_idx);
|
||||||
|
#else
|
||||||
|
// Default path: Original implementation
|
||||||
// Tcache miss/disabled/compiled-out → try pop from array cache (fast path)
|
// Tcache miss/disabled/compiled-out → try pop from array cache (fast path)
|
||||||
if (__builtin_expect(cache->head != cache->tail, 1)) {
|
if (__builtin_expect(cache->head != cache->tail, 1)) {
|
||||||
void* base = cache->slots[cache->head]; // 1 cache miss (array access)
|
void* base = cache->slots[cache->head]; // 1 cache miss (array access)
|
||||||
@ -341,6 +398,7 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) {
|
|||||||
g_unified_cache_miss[class_idx]++;
|
g_unified_cache_miss[class_idx]++;
|
||||||
#endif
|
#endif
|
||||||
return unified_cache_refill(class_idx); // Refill + return first block (BASE)
|
return unified_cache_refill(class_idx); // Refill + return first block (BASE)
|
||||||
|
#endif // HAKMEM_TINY_UC_LOCALIZE_COMPILED
|
||||||
}
|
}
|
||||||
|
|
||||||
#endif // HAK_FRONT_TINY_UNIFIED_CACHE_H
|
#endif // HAK_FRONT_TINY_UNIFIED_CACHE_H
|
||||||
|
|||||||
@ -434,6 +434,18 @@
|
|||||||
# define HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED 0
|
# define HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED 0
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// ------------------------------------------------------------
|
||||||
|
// Phase 74: UnifiedCache LOCALIZE (Compile-time hit-path optimization)
|
||||||
|
// ------------------------------------------------------------
|
||||||
|
// LOCALIZE: Load head/tail/mask once into locals to avoid reload dependency chains
|
||||||
|
// When =1: Always use localize version (no runtime branch, maximum DCE)
|
||||||
|
// When =0: Use original implementation (default, backward compatible)
|
||||||
|
// Build: make EXTRA_CFLAGS="-DHAKMEM_TINY_UC_LOCALIZE_COMPILED=1" [target]
|
||||||
|
// Expected impact: +0.5-1.5% via dependency chain reduction
|
||||||
|
#ifndef HAKMEM_TINY_UC_LOCALIZE_COMPILED
|
||||||
|
# define HAKMEM_TINY_UC_LOCALIZE_COMPILED 0
|
||||||
|
#endif
|
||||||
|
|
||||||
// ------------------------------------------------------------
|
// ------------------------------------------------------------
|
||||||
// Helper enum (for documentation / logging)
|
// Helper enum (for documentation / logging)
|
||||||
// ------------------------------------------------------------
|
// ------------------------------------------------------------
|
||||||
|
|||||||
@ -11,7 +11,7 @@
|
|||||||
|
|
||||||
mimalloc との比較は **FAST build** で行う(Standard は fixed tax を含むため公平でない)。
|
mimalloc との比較は **FAST build** で行う(Standard は fixed tax を含むため公平でない)。
|
||||||
|
|
||||||
## Current snapshot(2025-12-17, Phase 68 PGO — 新 baseline)
|
## Current snapshot(2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline)
|
||||||
|
|
||||||
計測条件(再現の正):
|
計測条件(再現の正):
|
||||||
- Mixed: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
- Mixed: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
||||||
|
|||||||
197
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Normal file
197
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Normal file
@ -0,0 +1,197 @@
|
|||||||
|
# Phase 69-1: Refill Tuning Parameter Sweeps - Results
|
||||||
|
|
||||||
|
**Date**: 2025-12-17
|
||||||
|
**Baseline**: Phase 68 PGO (`bench_random_mixed_hakmem_minimal_pgo`)
|
||||||
|
**Benchmark**: `scripts/run_mixed_10_cleanenv.sh` (RUNS=10)
|
||||||
|
**Goal**: Find +3-6% optimization for M2 milestone (55% of mimalloc)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
**Winner Identified**: **Warm Pool Size=16** achieves **+3.26% (Strong GO)** with ENV-only change.
|
||||||
|
|
||||||
|
- **No code changes required** - Deploy via `HAKMEM_WARM_POOL_SIZE=16` environment variable
|
||||||
|
- **Exceeds M2 threshold** (+3.0% Strong GO criterion)
|
||||||
|
- **Single strongest improvement** among all tested parameters
|
||||||
|
- **Combined optimizations are non-additive** - Warm Pool Size=16 alone outperforms combinations
|
||||||
|
|
||||||
|
⚠️ **Important correction (2025-12 audit)**:
|
||||||
|
The previously reported “Refill Batch Size sweep” based on `TINY_REFILL_BATCH_SIZE` was **not measuring a real knob**.
|
||||||
|
That macro currently has **zero call sites** (it is defined but not referenced in the active Tiny front path), so any
|
||||||
|
observed deltas were **layout/drift noise**, not an algorithmic effect.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Full Sweep Results
|
||||||
|
|
||||||
|
### Baseline (Phase 68 PGO)
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| **Mean** | 60.65M ops/s |
|
||||||
|
| **Median** | 60.68M ops/s |
|
||||||
|
| **CV** | 1.68% |
|
||||||
|
| **% of mimalloc** | 50.93% |
|
||||||
|
|
||||||
|
**Runs**: 10
|
||||||
|
**Binary**: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1. Warm Pool Size Sweep (ENV-only, no recompile)
|
||||||
|
|
||||||
|
**Parameter**: `HAKMEM_WARM_POOL_SIZE` (default: 12 SuperSlabs/class)
|
||||||
|
|
||||||
|
| Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||||
|
|------|----------------|------------------|----|-----------:|----------|
|
||||||
|
| **16** | **62.63** | **63.38** | 2.43% | **+3.26%** | **Strong GO** ✓✓✓ |
|
||||||
|
| 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ |
|
||||||
|
|
||||||
|
**Winner**: **Size=16 (+3.26%)**
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Size=16 exceeds +3.0% Strong GO threshold
|
||||||
|
- Size=24 shows diminishing returns (+2.84% vs +3.26%)
|
||||||
|
- Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead
|
||||||
|
|
||||||
|
**Command Used**:
|
||||||
|
```bash
|
||||||
|
# Size=16
|
||||||
|
HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||||
|
|
||||||
|
# Size=24
|
||||||
|
HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Unified Cache C5-C7 Sweep (ENV-only, no recompile)
|
||||||
|
|
||||||
|
**Parameter**: `HAKMEM_TINY_UNIFIED_C5`, `HAKMEM_TINY_UNIFIED_C6`, `HAKMEM_TINY_UNIFIED_C7` (default: 128 slots)
|
||||||
|
|
||||||
|
| Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||||
|
|------------|----------------|------------------|----|-----------:|----------|
|
||||||
|
| **256** | **61.92** | **61.70** | 1.49% | **+2.09%** | **GO** ✓ |
|
||||||
|
| 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ |
|
||||||
|
|
||||||
|
**Winner**: **Cache=256 (+2.09%)**
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Cache=256 shows +2.09% improvement (GO threshold)
|
||||||
|
- Cache=512 shows diminishing returns (+1.89% vs +2.09%)
|
||||||
|
- Larger caches provide marginal gains while increasing memory overhead
|
||||||
|
- Lower CV (1.49%) indicates stable performance
|
||||||
|
|
||||||
|
**Command Used**:
|
||||||
|
```bash
|
||||||
|
# Cache=256
|
||||||
|
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||||
|
|
||||||
|
# Cache=512
|
||||||
|
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Combined Optimization Check
|
||||||
|
|
||||||
|
**Configuration**: Warm Pool Size=16 + Unified Cache C5-C7=256
|
||||||
|
|
||||||
|
| Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||||
|
|----------------|------------------|----|-----------:|----------|
|
||||||
|
| 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) |
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Combined result (+2.81%) is **LESS than** Warm Pool Size=16 alone (+3.26%)
|
||||||
|
- **Non-additive behavior** indicates parameters are not orthogonal
|
||||||
|
- **Likely explanation**: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant
|
||||||
|
- **Recommendation**: Use Warm Pool Size=16 alone for maximum benefit
|
||||||
|
|
||||||
|
**Command Used**:
|
||||||
|
```bash
|
||||||
|
HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Refill Batch Size Sweep (invalid — macro not wired)
|
||||||
|
|
||||||
|
The `TINY_REFILL_BATCH_SIZE` macro is currently **define-only**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rg -n "TINY_REFILL_BATCH_SIZE" core
|
||||||
|
# -> core/hakmem_tiny_config.h only
|
||||||
|
```
|
||||||
|
|
||||||
|
So we do **not** treat it as a tuning parameter until it is actually connected to refill logic.
|
||||||
|
|
||||||
|
If we want to tune refill frequency, use the real knobs:
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT_HOT`
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT_MID`
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
### Phase 69-2 (Baseline Promotion)
|
||||||
|
|
||||||
|
**Primary Recommendation**: **Deploy Warm Pool Size=16 (ENV-only)**
|
||||||
|
|
||||||
|
**Rationale**:
|
||||||
|
1. **Strongest single improvement** (+3.26%, Strong GO)
|
||||||
|
2. **No code changes required** - Zero risk of layout tax
|
||||||
|
3. **Immediate deployment** via environment variable
|
||||||
|
4. **Exceeds M2 threshold** (+3.0% Strong GO criterion)
|
||||||
|
|
||||||
|
**Deployment**:
|
||||||
|
```bash
|
||||||
|
# Add to PGO training environment and benchmark scripts
|
||||||
|
export HAKMEM_WARM_POOL_SIZE=16
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Secondary Options (for Phase 69-3+)
|
||||||
|
|
||||||
|
**Option A: Warm Pool Size=16 + Refill Batch=32**
|
||||||
|
- **Combined potential**: Unknown (requires testing, may be non-additive like unified cache)
|
||||||
|
- **Complexity**: Requires PGO rebuild for Batch=32
|
||||||
|
- **Risk**: Layout tax from code change
|
||||||
|
|
||||||
|
**Option B: Warm Pool Size=16 alone (recommended)**
|
||||||
|
- **Gain**: +3.26% guaranteed
|
||||||
|
- **Complexity**: ENV-only, zero code changes
|
||||||
|
- **Risk**: None (reversible via ENV)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Raw Data Files
|
||||||
|
|
||||||
|
All 10-run logs saved to:
|
||||||
|
- `/tmp/phase69_baseline.log` - Phase 68 PGO baseline
|
||||||
|
- `/tmp/phase69_warm16.log` - Warm Pool Size=16
|
||||||
|
- `/tmp/phase69_warm24.log` - Warm Pool Size=24
|
||||||
|
- `/tmp/phase69_cache256.log` - Unified Cache C5-C7=256
|
||||||
|
- `/tmp/phase69_cache512.log` - Unified Cache C5-C7=512
|
||||||
|
- `/tmp/phase69_combined.log` - Combined (Warm=16 + Cache=256)
|
||||||
|
- `/tmp/phase69_batch32.log` - Refill Batch=32
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
**Awaiting User Instructions for Phase 69-2**:
|
||||||
|
1. Confirm Warm Pool Size=16 as baseline promotion candidate
|
||||||
|
2. Decide whether to:
|
||||||
|
- Update ENV defaults in `hakmem_tiny_config.h` (preferred for SSOT)
|
||||||
|
- Document as recommended ENV setting in README/docs
|
||||||
|
- Add to PGO training scripts
|
||||||
|
3. Re-run `make pgo-fast-full` with `HAKMEM_WARM_POOL_SIZE=16` in training environment
|
||||||
|
4. Update `PERFORMANCE_TARGETS_SCORECARD.md` with new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Phase 69-1 Status**: ✅ **COMPLETE**
|
||||||
|
**Winner**: **Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)**
|
||||||
@ -0,0 +1,46 @@
|
|||||||
|
# Phase 69-3A: Refill Batch=64 build failure triage — Root cause & fix
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
`make pgo-fast-build` (profile-use) fails to link with undefined `__gcov_*` symbols, e.g.:
|
||||||
|
|
||||||
|
- `__gcov_init`, `__gcov_exit`
|
||||||
|
- `__gcov_merge_add`, `__gcov_merge_topn`
|
||||||
|
- `__gcov_time_profiler_counter`
|
||||||
|
|
||||||
|
This appeared when trying to evaluate `Refill Batch Size=64`.
|
||||||
|
|
||||||
|
## Root cause (actual)
|
||||||
|
|
||||||
|
The failure is **not** “compiler limit due to batch=64”.
|
||||||
|
|
||||||
|
It is a **stale object mixing** problem:
|
||||||
|
- Some benchmark `.o` files were built in the profile-gen step (`-fprofile-generate`) and **were not removed by `make clean`**.
|
||||||
|
- In the profile-use step (`-fprofile-use`), those stale instrumented `.o` files were reused and linked without `-fprofile-generate` → libgcov was not pulled in.
|
||||||
|
- Result: unresolved `__gcov_*` symbols at link time.
|
||||||
|
|
||||||
|
In other words: **instrumented bench object reused in non-instrumented link**.
|
||||||
|
|
||||||
|
## Fix (minimal, safe)
|
||||||
|
|
||||||
|
Strengthen `make clean` to remove benchmark objects/binaries that were previously omitted, including:
|
||||||
|
- `bench_random_mixed_hakmem.o`
|
||||||
|
- `bench_tiny_hot_hakmem.o`
|
||||||
|
- related bench variants (`*_system`, `*_mi`, `*_hakx`, `*_minimal*`, etc.)
|
||||||
|
|
||||||
|
This preserves toolchain fairness (GCC + LTO) and prevents cross-step contamination in PGO workflows.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
After the fix, the Phase 66 PGO pipeline builds successfully again:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
make pgo-fast-profile pgo-fast-collect pgo-fast-build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- This fix is **layout-neutral**: it only affects build hygiene (artifact cleanup).
|
||||||
|
- This also hardens other workflows where flags change across builds (PGO / FAST targets).
|
||||||
|
- Follow-up audit note (2025-12): `TINY_REFILL_BATCH_SIZE` is currently define-only (no call sites), so the “batch=64”
|
||||||
|
performance experiment itself was not measuring a real knob; however the build hygiene fix remains valid and important.
|
||||||
@ -0,0 +1,45 @@
|
|||||||
|
# Phase 69-3B: Refill Batch Size sweep (PGO, warm_pool=16) — Results
|
||||||
|
|
||||||
|
⚠️ **INVALID (2025-12 audit)**: `TINY_REFILL_BATCH_SIZE` is currently **not wired** into the active Tiny front path
|
||||||
|
(it has zero call sites; define-only in `core/hakmem_tiny_config.h`). Any observed deltas in this file should be treated
|
||||||
|
as **layout/drift noise**, not an algorithmic effect. This document is kept only as an experiment record.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Phase 69-2 promoted the ENV-only winner:
|
||||||
|
- `HAKMEM_WARM_POOL_SIZE=16`
|
||||||
|
|
||||||
|
This phase explores compile-time refill batch size (`TINY_REFILL_BATCH_SIZE`) under the current PGO workflow:
|
||||||
|
- `make pgo-fast-full` (GCC + LTO preserved)
|
||||||
|
- Training uses cleanenv-aligned workloads (`scripts/box/pgo_fast_profile_config.sh`)
|
||||||
|
|
||||||
|
## Build hygiene prerequisite
|
||||||
|
|
||||||
|
Batch=64 originally “failed to build” due to stale profile-gen bench objects being reused in profile-use links.
|
||||||
|
That issue is fixed by strengthening `make clean` (see `docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md`).
|
||||||
|
|
||||||
|
## Measurement (Mixed 10-run)
|
||||||
|
|
||||||
|
All results are from the same host session, using:
|
||||||
|
- `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`
|
||||||
|
- `RUNS=10 scripts/run_mixed_10_cleanenv.sh`
|
||||||
|
|
||||||
|
| Batch | Mean (M ops/s) | Median (M ops/s) | CV |
|
||||||
|
|------:|----------------:|-----------------:|---:|
|
||||||
|
| 16 | 61.30 | 61.64 | 1.50% |
|
||||||
|
| 32 | 60.73 | 61.17 | 2.19% |
|
||||||
|
| 48 | 61.94 | 62.54 | 1.53% |
|
||||||
|
| 64 | 61.51 | 61.81 | 1.56% |
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
- **Batch=48** is the best of the tested set in this session (+~1.0% vs batch=16 baseline).
|
||||||
|
- **Batch=32** regresses in this session (note: previously was GO under a different baseline).
|
||||||
|
- **Batch=64** builds successfully after the hygiene fix, but is not the best performer here.
|
||||||
|
|
||||||
|
## Next steps (Phase 69-3C)
|
||||||
|
|
||||||
|
If we want to pursue M2 (55%) via this path:
|
||||||
|
1. Promote **batch=48** as a research candidate with a dedicated Phase tag (compile-time change + PGO rebuild).
|
||||||
|
2. Re-run the sweep at another time window to confirm ordering (layout/drift sensitivity).
|
||||||
|
3. If stable, promote batch=48 into the FAST baseline build path.
|
||||||
@ -0,0 +1,47 @@
|
|||||||
|
# Phase 69-3C: Refill Batch “knob” audit — `TINY_REFILL_BATCH_SIZE` is not wired
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The Phase 69 “Refill Batch Size sweep” was based on `TINY_REFILL_BATCH_SIZE` in `core/hakmem_tiny_config.h`, but an audit
|
||||||
|
shows this macro currently has **zero call sites** in the active Tiny front path. As a result, any measured deltas from
|
||||||
|
editing this macro are **not algorithmic**; they are attributable to layout/drift/noise.
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
|
||||||
|
### 1) Zero call sites
|
||||||
|
|
||||||
|
```sh
|
||||||
|
rg -n "TINY_REFILL_BATCH_SIZE" core
|
||||||
|
```
|
||||||
|
|
||||||
|
Result: only `core/hakmem_tiny_config.h` (define-only).
|
||||||
|
|
||||||
|
### 2) PGO binaries unchanged when toggling the macro
|
||||||
|
|
||||||
|
We rebuilt the full PGO pipeline twice (`make pgo-fast-full`) after changing the macro (batch16 vs batch48) and found the
|
||||||
|
resulting binaries were bit-identical (same size + same SHA256).
|
||||||
|
|
||||||
|
This confirms the macro does not affect the compiled hot path today.
|
||||||
|
|
||||||
|
## Action taken
|
||||||
|
|
||||||
|
- Restored `TINY_REFILL_BATCH_SIZE` to `16` and added an explicit “not wired” note in `core/hakmem_tiny_config.h`.
|
||||||
|
- Marked the “Refill Batch Size sweep” section in Phase 69 docs as invalid.
|
||||||
|
|
||||||
|
## What to tune instead (real knobs)
|
||||||
|
|
||||||
|
To tune refill frequency/amount without rebuilding:
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT_HOT` (C0–C3)
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT_MID` (C4–C7)
|
||||||
|
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
|
||||||
|
|
||||||
|
Defaults are set in `core/hakmem_tiny_init.inc` and can be overridden via ENV.
|
||||||
|
|
||||||
|
## Optional future work (if we still want a compile-time knob)
|
||||||
|
|
||||||
|
If we want a compile-time “refill batch size” knob, we need to wire it into a single SSOT:
|
||||||
|
- either by feeding it into the refill-count defaults (`g_refill_count_*`), or
|
||||||
|
- by introducing a dedicated build flag that the refill logic consumes directly.
|
||||||
|
|
||||||
|
Until then, do not run Phase 69 sweeps based on `TINY_REFILL_BATCH_SIZE`.
|
||||||
|
|
||||||
@ -12,6 +12,13 @@
|
|||||||
|
|
||||||
Before implementing any refill/WarmPool changes, execute this sequence:
|
Before implementing any refill/WarmPool changes, execute this sequence:
|
||||||
|
|
||||||
|
0. **Route Banner(任意だが推奨)**:
|
||||||
|
```bash
|
||||||
|
HAKMEM_ROUTE_BANNER=1 ./bench_random_mixed_hakmem_observe ...
|
||||||
|
```
|
||||||
|
- Route assignments(backend route kind)と cache config(`unified_cache_enabled` / `warm_pool_max_per_class`)を 1 回だけ表示する。
|
||||||
|
- 「Route=LEGACY = Unified Cache 未使用」といった誤認を防ぐ(LEGACYでもUnified Cacheは alloc/free の front で使われる)。
|
||||||
|
|
||||||
1. **Build with Stats**:
|
1. **Build with Stats**:
|
||||||
```bash
|
```bash
|
||||||
make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1'
|
make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1'
|
||||||
@ -20,7 +27,7 @@ Before implementing any refill/WarmPool changes, execute this sequence:
|
|||||||
|
|
||||||
2. **Run with Stats**:
|
2. **Run with Stats**:
|
||||||
```bash
|
```bash
|
||||||
HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
|
HAKMEM_ROUTE_BANNER=1 HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Check Output**:
|
3. **Check Output**:
|
||||||
|
|||||||
@ -0,0 +1,116 @@
|
|||||||
|
# Phase 74: UnifiedCache hit-path structural optimization (WS=400 SSOT)
|
||||||
|
|
||||||
|
**Status**: 🟡 DRAFT(設計SSOT / 次の指示書)
|
||||||
|
|
||||||
|
## 0) 背景(なぜ今これか)
|
||||||
|
|
||||||
|
- 現行 baseline(Phase 69): `bench_random_mixed_hakmem_minimal_pgo` = **62.63M ops/s = 51.77% of mimalloc**(`HAKMEM_WARM_POOL_SIZE=16`)
|
||||||
|
- Phase 70(観測SSOT)により、WS=400(Mixed SSOT)では **UnifiedCache miss が極小**であることが確定。
|
||||||
|
- `unified_cache_refill()` / WarmPool-pop を速くしても **ROI はほぼゼロ**(refill最適化は凍結)
|
||||||
|
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
||||||
|
- Phase 73(perf stat)により、WarmPool=16 の勝ちは **instruction/branch の微減**が支配的と確定。
|
||||||
|
- つまり次も「hit-path を短くする」方向が最も筋が良い。
|
||||||
|
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
||||||
|
|
||||||
|
本フェーズの狙いは、**UnifiedCache の hit-path(push/pop)から“踏まなくていい分岐/ロード”を構造で外に追い出す**こと。
|
||||||
|
|
||||||
|
## 1) 目的 / 非目的
|
||||||
|
|
||||||
|
**目的**
|
||||||
|
- WS=400 の SSOT workload で **+1〜3%**(単発)を狙う(積み上げで M2=55% へ)。
|
||||||
|
- “経路が踏まれていない最適化” を避ける(Phase 70 の SSOT を守る)。
|
||||||
|
|
||||||
|
**非目的**
|
||||||
|
- `unified_cache_refill()` の最適化(miss が極小なので SSOT では ROI なし)。
|
||||||
|
- link-out / 大削除による DCE(layout tax で符号反転の前例が多い)。
|
||||||
|
- route kind を変えて別 workload にする(まず SSOT workload を崩さない)。
|
||||||
|
|
||||||
|
## 2) Box Theory(箱割り)
|
||||||
|
|
||||||
|
### 箱の責務
|
||||||
|
|
||||||
|
L0: **EnvGateBox**
|
||||||
|
- `HAKMEM_TINY_UC_*` のトグル(default OFF、いつでも戻せる)。
|
||||||
|
|
||||||
|
L1: **TinyUnifiedCacheHitPathBox(NEW / 研究箱)**
|
||||||
|
- `unified_cache_push/pop` の **hit-path だけを短くする**(refill/overflow/registryは触らない)。
|
||||||
|
- 変換点(境界)は 1 箇所: `unified_cache_push/pop` 内で “fast→fallback” を1回だけ行う。
|
||||||
|
|
||||||
|
### 可視化(最小)
|
||||||
|
- `uc_hitpath_fast_hits` / `uc_hitpath_fast_fallbacks` の2カウンタだけ(必要なら)。
|
||||||
|
- それ以外は `perf stat`(instructions/branches)を正とする。
|
||||||
|
|
||||||
|
## 3) 具体案(優先順)
|
||||||
|
|
||||||
|
### P1(低リスク): ローカル変数化で再ロード/依存チェーンを固定する
|
||||||
|
|
||||||
|
狙い:
|
||||||
|
- `cache->head/tail/mask/capacity` 等の再ロードを抑制し、**依存チェーンを短く**する。
|
||||||
|
|
||||||
|
設計:
|
||||||
|
- `unified_cache_push()` / `unified_cache_pop_or_refill()` の中で
|
||||||
|
- `uint16_t head = cache->head;` のように **ローカルへ落とす**
|
||||||
|
- `next = (x + 1) & mask` の算術を **1回に固定**
|
||||||
|
- `cache->tail = next;` のような store を最後にまとめる
|
||||||
|
|
||||||
|
導入:
|
||||||
|
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0)
|
||||||
|
- 方式: 同一バイナリで ON/OFF(layout tax を最小にするため、分岐は入口1回に限定)
|
||||||
|
|
||||||
|
リスク:
|
||||||
|
- レジスタ圧上昇で逆に遅くなる可能性 → A/B 必須。
|
||||||
|
|
||||||
|
### P0(中リスク/中ROI): Fast-API 化(enable判定/統計を外に追い出す)
|
||||||
|
|
||||||
|
狙い:
|
||||||
|
- hit-path の中に残る “ほぼ不変な判定” を **呼び出し側に追い出し**、`push/pop` を直線化する。
|
||||||
|
|
||||||
|
設計:
|
||||||
|
- `unified_cache_push_fast(TinyUnifiedCache* cache, void* base)` のような **最短API** を追加
|
||||||
|
- 前提: “有効/初期化済み/統計OFF” を呼び出し側で保証
|
||||||
|
- 失敗時のみ既存 `unified_cache_push()` へ落とす(境界1箇所)
|
||||||
|
|
||||||
|
導入:
|
||||||
|
- ENV: `HAKMEM_TINY_UC_FASTAPI=0/1`(default 0)
|
||||||
|
- Fail-fast: 途中でモードが変わったら “safe fallback” へ(bench用途なら abort でも良い)
|
||||||
|
|
||||||
|
リスク:
|
||||||
|
- call site の増加で layout が動く → GO 閾値は +1.0%(厳しめ)。
|
||||||
|
|
||||||
|
### P2(高リスク/高ROI候補): hot class 限定で slots を TLS 直置き(pointer chase削減)
|
||||||
|
|
||||||
|
狙い:
|
||||||
|
- hit-path の `cache->slots` のロード(ポインタ追跡)を消す。
|
||||||
|
|
||||||
|
設計:
|
||||||
|
- `TinyUnifiedCache` の “hot class のみ” を別構造に逃がし、TLS 内に `slots[]` を直置き。
|
||||||
|
- 対象候補: 容量が小さい C4/C5/C6/C7(C2/C3 の 2048 は直置きが重い)
|
||||||
|
|
||||||
|
リスク:
|
||||||
|
- TLS サイズ増で dTLB/cache が悪化しうる(勝てば大きいが、NO-GO もあり得る)。
|
||||||
|
|
||||||
|
## 4) A/B(SSOT)
|
||||||
|
|
||||||
|
### 4.1 ベンチ条件(固定)
|
||||||
|
- `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
||||||
|
- `HAKMEM_WARM_POOL_SIZE=16`(baseline)
|
||||||
|
|
||||||
|
### 4.2 GO/NO-GO
|
||||||
|
- **GO**: +1.0% 以上
|
||||||
|
- **NEUTRAL**: ±1.0%(research box freeze)
|
||||||
|
- **NO-GO**: -1.0% 以下(即 revert)
|
||||||
|
|
||||||
|
### 4.3 追加で必ず見る(Phase 73 教訓)
|
||||||
|
- `perf stat`: `instructions`, `branches`, `branch-misses`(勝ち筋が instruction/branch 減なので)
|
||||||
|
- `cache-misses`, `iTLB-load-misses`, `dTLB-load-misses`(layout tax 検知)
|
||||||
|
|
||||||
|
## 5) 直近の実装順(推奨)
|
||||||
|
|
||||||
|
1. **P1(LOCALIZE)** を小さく入れて A/B(最短で勝ち筋確認)
|
||||||
|
2. 勝てたら **P0(FASTAPI)** を追加(さらに分岐を外へ)
|
||||||
|
3. それでも足りなければ **P2(inline slots hot)** を research box として試す
|
||||||
|
|
||||||
|
## 6) 退出条件(やめどき)
|
||||||
|
|
||||||
|
- WS=400 SSOT で `perf` 上の “unified_cache_push/pop” が Top 50 圏外になったら、この系は撤退(Phase 42 の教訓)。
|
||||||
|
- 3回連続で NEUTRAL/NO-GO が続いたら、次の構造(別層)へ(layout tax の危険が増すため)。
|
||||||
@ -0,0 +1,75 @@
|
|||||||
|
# Phase 74-1: UnifiedCache hit-path “LOCALIZE” 実装指示書
|
||||||
|
|
||||||
|
**Status**: 🟡 READY
|
||||||
|
|
||||||
|
## 目的
|
||||||
|
|
||||||
|
WS=400(Mixed SSOT)でほぼ hit-path しか踏まれないため、`unified_cache_push/pop` の **依存チェーン(再ロード)を短く**して instructions/branches を削る。
|
||||||
|
|
||||||
|
- 設計SSOT: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||||
|
- 観測SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`(refill最適化は凍結)
|
||||||
|
|
||||||
|
## 原則(Box Theory)
|
||||||
|
|
||||||
|
- L0: ENV gate 箱を追加(default OFF、いつでも戻せる)
|
||||||
|
- L1: `unified_cache_push/pop` の中だけに閉じた変更(境界1箇所)
|
||||||
|
- 可視化は最小(基本は perf stat を正とする)
|
||||||
|
- Fail-fast: 迷ったら fallback
|
||||||
|
|
||||||
|
## Step 0: Baseline 確認(SSOT)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 1: ENV gate(L0 box)
|
||||||
|
|
||||||
|
新規:
|
||||||
|
- `core/box/tiny_unified_cache_hitpath_env_box.h`(例)
|
||||||
|
|
||||||
|
ENV:
|
||||||
|
- `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0)
|
||||||
|
|
||||||
|
要件:
|
||||||
|
- hot path で getenv を踏まない(既存の lazy-init パターン or build flag で固定)
|
||||||
|
|
||||||
|
## Step 2: LOCALIZE 実装(L1 box)
|
||||||
|
|
||||||
|
対象:
|
||||||
|
- `core/front/tiny_unified_cache.h` の `unified_cache_push()` / `unified_cache_pop_or_refill()`
|
||||||
|
|
||||||
|
方針:
|
||||||
|
- `cache->head/tail/mask/capacity` をローカルへ落として **再ロードを防ぐ**
|
||||||
|
- store は最後にまとめる(`cache->tail = next_tail;` など)
|
||||||
|
- 仕様は変えない(容量/順序/統計/overflow の意味を維持)
|
||||||
|
|
||||||
|
導入パターン(例):
|
||||||
|
- `if (!tiny_uc_localize_enabled())` のときは既存実装をそのまま通す
|
||||||
|
- `enabled` のときだけ localize 版を呼ぶ
|
||||||
|
|
||||||
|
## Step 3: A/B(同一バイナリ)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
HAKMEM_TINY_UC_LOCALIZE=0 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
HAKMEM_TINY_UC_LOCALIZE=1 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
追加で(勝ち筋が instructions/branches なので必須):
|
||||||
|
```bash
|
||||||
|
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses -- \
|
||||||
|
./bench_random_mixed_hakmem_minimal_pgo 20000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
## 判定
|
||||||
|
|
||||||
|
- **GO**: +1.0% 以上
|
||||||
|
- **NEUTRAL**: ±1.0%(research box freeze)
|
||||||
|
- **NO-GO**: -1.0% 以下(即 revert)
|
||||||
|
|
||||||
|
NO-GO の切り分け:
|
||||||
|
- `scripts/box/layout_tax_forensics_box.sh` を使う(layout tax / IPC低下 / TLB悪化の分類)
|
||||||
|
|
||||||
|
## Step 4: 昇格方針
|
||||||
|
|
||||||
|
- 初回 GO でも **default ON にしない**(まずは 3回独立再計測で再現性を確認)
|
||||||
|
- 3回とも GO なら `scripts/run_mixed_10_cleanenv.sh` / `core/bench_profile.h` へ昇格を検討
|
||||||
@ -0,0 +1,140 @@
|
|||||||
|
# Phase 74: UnifiedCache hit-path structural optimization - Results
|
||||||
|
|
||||||
|
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
|
||||||
|
|
||||||
|
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
|
||||||
|
|
||||||
|
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
|
||||||
|
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
|
||||||
|
|
||||||
|
**Results** (10-run A/B):
|
||||||
|
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|
||||||
|
|--------|------------|------------|-------|
|
||||||
|
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
|
||||||
|
| instructions | 4,583M | 4,615M | **+0.7%** |
|
||||||
|
| branches | 1,276M | 1,281M | **+0.4%** |
|
||||||
|
| cache-misses | 560K | 461K | -17.7% |
|
||||||
|
|
||||||
|
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
|
||||||
|
|
||||||
|
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
|
||||||
|
|
||||||
|
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
||||||
|
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
|
||||||
|
|
||||||
|
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
|
||||||
|
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|
||||||
|
|--------|---------------|----------------|-------|
|
||||||
|
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
|
||||||
|
| cycles | 1,553M | 1,548M | -0.3% |
|
||||||
|
| **instructions** | 2,748M | 2,733M | **-0.6%** |
|
||||||
|
| **branches** | 632M | 617M | **-2.3%** |
|
||||||
|
| **cache-misses** | 707K | 1,316K | **+86%** |
|
||||||
|
| dTLB-load-misses | 46K | 33K | -28% |
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
|
||||||
|
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
|
||||||
|
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
|
||||||
|
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
|
||||||
|
|
||||||
|
**Phase 74-1 vs 74-2 comparison**:
|
||||||
|
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
|
||||||
|
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
|
||||||
|
- But cache-misses +86% cancels out → **total NEUTRAL**
|
||||||
|
|
||||||
|
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause (Phase 74-2)
|
||||||
|
|
||||||
|
**Why cache-misses increased (+86%)**:
|
||||||
|
|
||||||
|
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
|
||||||
|
- Compiler may spill to stack → more memory traffic
|
||||||
|
- `cache->slots[head]` may lose prefetch opportunity
|
||||||
|
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
|
||||||
|
- Storing to local breaks dependency tracking?
|
||||||
|
- Memory alias analysis degraded?
|
||||||
|
|
||||||
|
**Evidence**:
|
||||||
|
- dTLB-misses decreased (-28%) → data layout not the issue
|
||||||
|
- L1-dcache-load-misses similar → not a TLB/page issue
|
||||||
|
- cache-misses (+86%) is the PRIMARY BLOCKER
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lessons Learned
|
||||||
|
|
||||||
|
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
|
||||||
|
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
|
||||||
|
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
|
||||||
|
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
|
||||||
|
|
||||||
|
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## P1 (LOCALIZE) - Frozen State
|
||||||
|
|
||||||
|
**Files**:
|
||||||
|
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
|
||||||
|
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
|
||||||
|
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
|
||||||
|
|
||||||
|
**Default behavior**: LOCALIZE=0 (original implementation)
|
||||||
|
**Rollback**: No action needed (default OFF)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
**Phase 74-3: P0 (FASTAPI)**
|
||||||
|
|
||||||
|
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
|
||||||
|
- Assume: "valid/enabled/no-stats" at caller side
|
||||||
|
- Fail-fast: fallback to slow path on unexpected state
|
||||||
|
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
||||||
|
|
||||||
|
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
|
||||||
|
|
||||||
|
**GO threshold**: +1.0% (strict, structural change)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Artifacts
|
||||||
|
|
||||||
|
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||||
|
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
|
||||||
|
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
|
||||||
|
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN**
|
||||||
|
- Phase 74-3: P0 (FASTAPI) → (next)
|
||||||
2
hakmem.d
2
hakmem.d
@ -103,6 +103,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
|
|||||||
core/box/../front/../box/../hakmem_tiny_config.h \
|
core/box/../front/../box/../hakmem_tiny_config.h \
|
||||||
core/box/../front/../box/../tiny_nextptr.h \
|
core/box/../front/../box/../tiny_nextptr.h \
|
||||||
core/box/../front/../box/tiny_tcache_env_box.h \
|
core/box/../front/../box/tiny_tcache_env_box.h \
|
||||||
|
core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h \
|
||||||
core/box/../front/../tiny_region_id.h core/box/../front/../hakmem_tiny.h \
|
core/box/../front/../tiny_region_id.h core/box/../front/../hakmem_tiny.h \
|
||||||
core/box/../front/../box/tiny_env_box.h \
|
core/box/../front/../box/tiny_env_box.h \
|
||||||
core/box/../front/../box/tiny_front_hot_box.h \
|
core/box/../front/../box/tiny_front_hot_box.h \
|
||||||
@ -361,6 +362,7 @@ core/box/../front/../box/tiny_tcache_box.h:
|
|||||||
core/box/../front/../box/../hakmem_tiny_config.h:
|
core/box/../front/../box/../hakmem_tiny_config.h:
|
||||||
core/box/../front/../box/../tiny_nextptr.h:
|
core/box/../front/../box/../tiny_nextptr.h:
|
||||||
core/box/../front/../box/tiny_tcache_env_box.h:
|
core/box/../front/../box/tiny_tcache_env_box.h:
|
||||||
|
core/box/../front/../box/tiny_unified_cache_hitpath_env_box.h:
|
||||||
core/box/../front/../tiny_region_id.h:
|
core/box/../front/../tiny_region_id.h:
|
||||||
core/box/../front/../hakmem_tiny.h:
|
core/box/../front/../hakmem_tiny.h:
|
||||||
core/box/../front/../box/tiny_env_box.h:
|
core/box/../front/../box/tiny_env_box.h:
|
||||||
|
|||||||
Reference in New Issue
Block a user