12 KiB
CURRENT_TASK(Rolling, SSOT)
0) 今の「正」(SSOT)
- 性能比較の正: FAST PGO build(
make pgo-fast-full→bench_random_mixed_hakmem_minimal_pgo)+ WarmPool=16- Phase 75(C5/C6 inline slots)は presets に昇格済み
- Phase 75-4 で FAST PGO rebase を実施し C5+C6=ON が +3.16% (GO) を確認(ただし FAST PGO baseline 自体が Phase 69 から大きく後退している疑い → Phase 75-5 で PGO 再生成が必要)
- 安全・互換の正: Standard build(
make bench_random_mixed_hakmem) - 観測の正: OBSERVE build(
make perf_observe) - スコアカード(目標/現在値):
docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md- FAST baseline(SSOT):
docs/analysis/PERFORMANCE_TARGETS_SCORECARD.mdを正とする(Phase 69: 62.63M ops/s = 51.77% of mimalloc) - Phase 75 の計測(Standard):
bench_random_mixed_hakmemで A/B +5.41% を確認(Phase 75-3 4-point matrix) - Phase 75 の計測(FAST PGO):
bench_random_mixed_hakmem_minimal_pgoで A/B +3.16% を確認(Phase 75-4 4-point matrix) - 次の目標: M2 = 55%(gap は FAST baseline を基準に判断する)
- FAST baseline(SSOT):
- Mixed 10-run SSOT(ハーネス):
scripts/run_mixed_10_cleanenv.sh- デフォルト
BENCH_BIN=./bench_random_mixed_hakmem(Standard) - FAST PGO は
BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgoを明示する - 既定:
ITERS=20000000 WS=400、HAKMEM_WARM_POOL_SIZE=16、HAKMEM_TINY_C5_INLINE_SLOTS=1、HAKMEM_TINY_C6_INLINE_SLOTS=1
- デフォルト
1) 迷子防止(経路/観測)
“経路が踏まれていない最適化” を防ぐための最小手順。
- Route Banner(経路の誤認を潰す):
HAKMEM_ROUTE_BANNER=1- 出力: Route assignments(backend route kind)+ cache config(
unified_cache_enabled/warm_pool_max_per_class)
- 出力: Route assignments(backend route kind)+ cache config(
- Refill観測のSSOT:
docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md- WS=400(Mixed SSOT)では miss が極小 →
unified_cache_refill()最適化は 凍結(ROIゼロ)
- WS=400(Mixed SSOT)では miss が極小 →
2) 直近の結論(要点だけ)
- Phase 69(WarmPool sweep):
HAKMEM_WARM_POOL_SIZE=16が 強GO(+3.26%)、baseline 昇格済み。- 設計:
docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md - 結果:
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
- 設計:
- Phase 70(観測SSOT): 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT:
docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
- SSOT:
- Phase 71/73(WarmPool=16 の勝ち筋確定): 勝ち筋は instruction/branch の微減(perf stat で確定)。
- 詳細:
docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md
- 詳細:
- Phase 72(ENV knob ROI枯れ): WarmPool=16 を超える ENV-only 勝ち筋なし → 構造(コード)で攻める段階。
3) 運用ルール(Box Theory + layout tax 対策)
- 変更は必ず 箱 + 境界1箇所 + ENVで戻せる で積む(Fail-fast、最小可視化)。
- A/B は 同一バイナリでENVトグルが原則(別バイナリ比較は layout が混ざる)。
- “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ compile-out を優先。
- 診断:
scripts/box/layout_tax_forensics_box.sh/docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md
- 診断:
4) 次の指示書(Active)
Phase 74(構造): UnifiedCache hit-path を短くする ✅ P1 (LOCALIZE) 凍結
前提:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
Phase 74-1: LOCALIZE (ENV-gated) ✅ 完了 (NEUTRAL +0.50%)
- ENV:
HAKMEM_TINY_UC_LOCALIZE=0/1 - Runtime branch overhead で instructions/branches 増加 (+0.7%/+0.4%)
- 判定: NEUTRAL (+0.50%)
Phase 74-2: LOCALIZE (compile-time gate) ✅ 完了 (NEUTRAL -0.87%)
- Build flag:
HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1(default 0) - Runtime branch 削除 → instructions/branches 改善 (-0.6%/-2.3%) ✓
- しかし cache-misses +86% (register pressure / spill) → throughput -0.87%
- 切り分け成功: LOCALIZE本体は勝ち、cache-miss 増加で相殺
- 判定: NEUTRAL (-0.87%) → P1 (LOCALIZE) 凍結
結論:
- P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い)
- 次: Phase 74-3 (P0: FASTAPI) へ進む
Phase 74-3: P0 (FASTAPI) ✅ 完了 (NEUTRAL +0.32%)
Goal: unified_cache_enabled() / lazy-init / stats 判定を hot loop の外へ追い出す
Approach:
unified_cache_push_fast()/unified_cache_pop_fast()API 追加- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所)
- ENV gate:
HAKMEM_TINY_UC_FASTAPI=0/1(default 0, research box)
Results (10-run Mixed SSOT, WS=400):
- Throughput: +0.32% (NEUTRAL, below +1.0% GO threshold)
- cache-misses: -16.31% (positive signal, insufficient throughput gain)
判定: NEUTRAL (+0.32%) → P0 (FASTAPI) 凍結
参考:
- 設計:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md - 指示書:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md - 結果 (P1/P0):
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md
Phase 75(構造): Hot-class Inline Slots (P2) ✅ 完了(Standard A/B)
Goal: C4-C7 の統計分析 → targeted optimization 戦略決定
前提 (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: per-class 特性を活用 → TLS-direct inline slots で branch elimination
Phase 75-0: Per-Class Analysis ✅ 完了
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|---|---|---|---|---|---|---|---|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | 5,501,709 | 100% | 57.2% |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | 2,747,209 | 100% | 28.5% |
| C4 | 64 | 63 | 687,563 | 687,564 | 1,375,127 | 100% | 14.3% |
| C7 | ? | ? | ? | ? | ? | ? | ? |
Key findings:
- C6 圧倒的支配: 57.2% の操作 (2.75M hits)
- 全クラス 100% hit rate (refill inactive in SSOT)
- Cache occupancy near-capacity (98-99%)
Phase 75-1: C6-only Inline Slots ✅ 完了 (GO +2.87%)
Approach: Modular box theory design with single decision point at TLS init
Implementation (5 new boxes + test script):
- ENV gate box:
HAKMEM_TINY_C6_INLINE_SLOTS=0/1(lazy-init, default OFF) - TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API:
c6_inline_push()/c6_inline_pop()(always_inline, 1-2 cycles) - Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache
Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): 44.24 M ops/s
- Treatment (C6 inline ON): 45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)
Decision: ✅ GO (exceeds +1.0% strict threshold)
Mechanism: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
参考:
- Per-class分析:
docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md - 結果:
docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md
Phase 75-2: C5 Inline Slots ✅ 完了 (GO +1.10%)
Goal: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
Approach: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate:
HAKMEM_TINY_C5_INLINE_SLOTS=0/1(default OFF) - Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
Results (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)
Decision: ✅ GO (C5 individual contribution validated)
Cumulative Performance:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)
参考:
- 実装詳細:
docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md
Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B) ✅ 完了 (STRONG GO +5.41%)
Goal: Comprehensive interaction test + final promotion decision
Approach: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined
Results (10-run per point, Mixed SSOT, WS=400):
- Point A (baseline): 42.36 M ops/s
- Point B (C5 solo): 43.54 M ops/s (+2.79% vs A)
- Point C (C6 solo): 44.25 M ops/s (+4.46% vs A)
- Point D (C5+C6): 44.65 M ops/s (+5.41% vs A) [MAIN TARGET]
Additivity Analysis:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: 1.72% (near-perfect additivity, minimal negative interaction)
Perf Stat Validation (D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)
Decision: ✅ STRONG GO (+5.41%)
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
Promotion Completed:
core/bench_profile.h: Added C5+C6 defaults tobench_apply_mixed_tinyv3_c7_common()scripts/run_mixed_10_cleanenv.sh: Added C5+C6 ENV defaults- C5+C6 inline slots now promoted to preset defaults for MIXED_TINYV3_C7_SAFE
Phase 75 Complete: C5+C6 inline slots (129-256B) deliver +5.41% proven gain on Standard binary(bench_random_mixed_hakmem)。
- FAST PGO baseline(スコアカード)を更新する前に、
BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgoで 同条件の A/B(C5/C6 OFF/ON) を再計測すること。
Phase 75-4(FAST PGO rebase)✅ 完了
- 結果: +3.16% (GO)(4-point matrix、outlier 除外後)
- 詳細:
docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md - 重要: Phase 69 の FAST baseline (62.63M) と比較して 現行 FAST PGO baseline が大きく低い疑い(PGO profile staleness / training mismatch / build drift)
Phase 75-5(PGO 再生成)🟥 次のActive(HIGH PRIORITY)
目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。
手順(骨子):
- PGO training を “C5/C6=ON” 前提で回す(training 時に
HAKMEM_TINY_C5_INLINE_SLOTS=1/HAKMEM_TINY_C6_INLINE_SLOTS=1を必ず設定) make pgo-fast-fullでbench_random_mixed_hakmem_minimal_pgoを再生成- 10-run で baseline を再測定し、Phase 75-4 の Point A/D を再計測
- Layout tax / drift の疑いが出たら
scripts/box/layout_tax_forensics_box.shで原因分類
参考:
- 4-point matrix 結果:
docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md - Test script:
scripts/phase75_3_matrix_test.sh
5) アーカイブ
- 詳細ログ:
CURRENT_TASK_ARCHIVE_20251210.md - 整理前スナップショット:
docs/analysis/CURRENT_TASK_ARCHIVE.md