30 KiB
CURRENT_TASK(Rolling, SSOT)
SSOT(今の正)
- 性能SSOT:
scripts/run_mixed_10_cleanenv.sh(WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF) - 経路確認:
scripts/run_mixed_observe_ssot.sh(OBSERVE専用、throughput比較には使わない) - buildモード:
docs/analysis/SSOT_BUILD_MODES.md - 外部比較(短時間):
docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md(LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け)
Phase 87-88(終了: NO-GO)
Status: ✅ OBSERVE verified + ❌ Phase 88 NO-GO
Phase 87: Inline Slots Verification
Initial Finding (Wrong): Standard binary showed PUSH TOTAL/POP TOTAL = 0
- Root Cause: ENV ドリフト(
HAKMEM_BENCH_MIN_SIZE/MAX_SIZE漏れ)- 修正:
scripts/run_mixed_10_cleanenv.shでサイズ範囲を強制固定(MIN=16, MAX=1040) HAKMEM_BENCH_C5_ONLY=0,HAKMEM_BENCH_C6_ONLY=0,HAKMEM_BENCH_C7_ONLY=0強制
- 修正:
Corrected Finding (OBSERVE binary) - 20M ops Mixed SSOT WS=400:
PUSH TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
POP TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
PUSH FULL: 0 (0.00%)
POP EMPTY: 168 (0.003%)
JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89
Phase 88: Batch Drain Optimization
Overflow Analysis:
- POP EMPTY rate: 168 / 4,812,031 = 0.003% ← 極小
- PUSH FULL rate: 0 / 4,812,031 = 0% ← 起きていない
- Decision: バッチ化しても速さは動かない(overflow がほぼ起きていない)
Phase 88 Decision: NO-GO(凍結)
- Rationale: 0.003% overflow 率では layout tax リスク > 期待値
- Infrastructure: 観測用 telemetry は残す(将来の WS/容量 変更時に再検証可能)
Artifacts Created:
- Telemetry box:
core/box/tiny_inline_slots_overflow_stats_box.h/c - Phase 87 results:
docs/analysis/PHASE87_OBSERVATION_RESULTS.md - SSOT 強化:
scripts/run_mixed_10_cleanenv.sh,scripts/run_mixed_observe_ssot.sh - ENV ドリフト防止ドキュメント:
docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md
Key Learning:
- "踏んでるか確定"には OBSERVE バイナリ + total counters が必須
- 観測と性能測定は分離(telemetry overhead を避ける)
- ENV ドリフト(MIN/MAX サイズ, CLASS_ONLY) = 経路を変える主要因 Follow-up Fix (SSOT hardening):
scripts/run_mixed_10_cleanenv.shnow forcesHAKMEM_BENCH_MIN_SIZE=16/HAKMEM_BENCH_MAX_SIZE=1040and disablesHAKMEM_BENCH_C{5,6,7}_ONLYto prevent path drift.- New pre-flight helper:
scripts/run_mixed_observe_ssot.sh(Route Banner + OBSERVE, single run). - Overflow stats compile gating fixed (see above).
Phase 89(完了: Bottleneck Analysis & Optimization Roadmap)
Status: ✅ SSOT Measurement Complete + 3 Optimization Candidates Identified
4-Step SSOT Procedure Completion
Step 1: OBSERVE Binary Preflight
- Binary:
bench_random_mixed_hakmem_observe(with telemetry enabled) - Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
- Throughput (with telemetry): 51.52M ops/s
Step 2: Standard 10-run Baseline
- Binary:
bench_random_mixed_hakmem(clean, no telemetry) - 10-run SSOT results: 51.36M ops/s (CV: 0.7%, very stable)
- Range: 50.74M - 51.73M
- Decision: This is baseline for bottleneck analysis
Step 3: FAST PGO 10-run Comparison
- Binary:
bench_random_mixed_hakmem_minimal_pgo(PGO optimized) - 10-run SSOT results: 54.16M ops/s (CV: 1.5%, acceptable)
- Range: 52.89M - 55.13M
- Performance Gap: 54.16M - 51.36M = 2.80M ops/s (+5.45%)
- This represents the optimization ceiling with current PGO profile
Step 4: Results Captured
- Git SHA:
e4c5f0535(master branch) - Timestamp: 2025-12-18 23:06:01
- System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
- Files:
docs/analysis/PHASE89_SSOT_MEASUREMENT.md
Perf Analysis & Top Bottleneck Identification
Profile Run: 40M operations (0.78s), 833 perf samples
Top Functions by CPU Time:
- free - 27.40% (hottest)
- main - 26.30% (benchmark loop, not optimizable)
- malloc - 20.36% (hottest)
- malloc.cold - 10.65% (cold path, avoid optimizing)
- free.cold - 5.59% (cold path, avoid optimizing)
- tiny_region_id_write_header - 2.98% (hot, inlining candidate)
malloc + free combined = 47.76% of CPU time (already Phase 9/10/78-1/80-1 optimized)
Top 3 Optimization Candidates (Ranked by Priority)
| Candidate | Priority | Recommendation | Expected Gain | Risk | Effort |
|---|---|---|---|---|---|
| tiny_region_id_write_header always_inline | HIGH | PURSUE | +1-2% | LOW | 1-2h |
| malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h |
| Cold-path optimization | LOW | AVOID | +1% | HIGH | 10-20h |
Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)
- Current: Selective inlining from
core/region_id_v6.c - Proposal: Force
always_inlinefor hot-path call sites - Layout Impact: MINIMAL (no code bulk, maintains I-cache discipline)
- Recommendation: YES - PURSUE
- Estimated timeline: Phase 90
- Implementation: 1-2 lines, add
__attribute__((always_inline))wrapper
Candidate 2: malloc/free branch reduction (47.76% CPU)
- Current: Phase 9/10/78-1/80-1/83-1 already optimized
- Observation: 56.4M branch-misses (branch prediction pressure)
- Proposal: Pre-compute routing tables (like Phase 85 approach)
- Risk: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
- Recommendation: DEFER
- Wait for workload characteristics that justify complexity
- Current gains saturation point reached
Phase 91(終了: NEUTRAL / 凍結)
Status: ⚪ NEUTRAL(C6 IFL: +0.38% / 10-run)→ default OFF で保持
- 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
- 結果(SSOT 10-run):
- Control(
HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0)mean 52.05M - Treatment(
HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1)mean 52.25M - Δ +0.38%(GO閾値 +1.0% 未達)
- Control(
- 判定: 凍結(research box)
- 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない
Phase 92(開始予定)
Status: 🔍 次フェーズ計画中
目的: tcmalloc 性能ギャップ(hakmem: 52M vs tcmalloc: 58M, -12.8%)を短時間で原因分類
実施予定:
- ケース A:小 vs 大オブジェクト分離テスト(C6-only vs C7-only)
- ケース B:Inline Slots vs Unified Cache 分離テスト
- ケース C:LIFO vs FIFO 比較
- ケース D:Pool size sensitivity テスト
期間: 1-2h(短時間 Triage) 出力: Primary bottleneck 特定 → 次の Candidate 選定
References:
- Triage Plan:
docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md
Candidate 3: Cold-path de-duplication (16.24% CPU)
- Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
- Rationale: Separation improves hot-path I-cache utilization
- Recommendation: AVOID
- Aligns with user's "layout tax 回避" principle
- Optimizing cold paths would ADD code to hot path (violates design)
Key Performance Insights
FAST PGO vs Standard (+5.45%) breakdown:
- PGO branch prediction optimization: ~3%
- Code layout optimization: ~2%
- Inlining decisions: ~0.5%
Conclusion: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.
Inline Slots Health: Working perfectly - 0.003% overflow rate confirms no bottleneck
References & Artifacts
- SSOT Measurement:
docs/analysis/PHASE89_SSOT_MEASUREMENT.md - Bottleneck Analysis:
docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md - Perf Stats:
docs/analysis/PHASE89_PERF_STAT.txt - Scripts:
scripts/run_mixed_10_cleanenv.sh,scripts/run_mixed_observe_ssot.sh
Phase 86(終了: NO-GO)
Status: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)
A/B Test (10-run SSOT):
- Control: 51,750,467 ops/s (CV: 2.26%)
- Treatment: 51,881,055 ops/s (CV: 2.32%)
- Delta: +0.25% (mean), -0.15% (median)
Summary: Free path legacy mask (mask-only) optimization for LEGACY classes.
- Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
- Implementation: Correct (0x7f mask computed, C0-C6 optimized)
- Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
- Conclusion: Free path optimization layer has reached practical ceiling
0) 今の「正」(SSOT)
- 現行 SSOT(Phase 89 capture / Git SHA: e4c5f0535):
- Standard(
./bench_random_mixed_hakmem)10-run mean: 51.36M ops/s(CV ~0.7%) - FAST PGO minimal(
./bench_random_mixed_hakmem_minimal_pgo)10-run mean: 54.16M ops/s(CV ~1.5% / Standard比 +5.45%) - OBSERVE(
./bench_random_mixed_hakmem_observe): 51.52M ops/s(telemetry込み、性能比較の正ではない) - SSOT capture:
docs/analysis/PHASE89_SSOT_MEASUREMENT.md
- Standard(
- 性能最適化の判断の正: 同一バイナリ A/B(ENVトグル)=
scripts/run_mixed_10_cleanenv.sh - mimalloc/tcmalloc 参照の正: reference(別バイナリ/LD_PRELOAD)=
docs/analysis/ALLOCATOR_COMPARISON_SSOT.md - スコアカード(目標/現在値の正):
docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md(Phase 89 SSOT を現行 snapshot として反映済み)- Phase 66/68/69(60M〜62M台)は historical(現 HEAD と直接比較しない。比較するなら rebase を取る)
- 次フェーズ(設計見直し):
docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md - Mixed 10-run SSOT(ハーネス):
scripts/run_mixed_10_cleanenv.sh- デフォルト
BENCH_BIN=./bench_random_mixed_hakmem(Standard) - FAST PGO は
BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgoを明示する - 既定:
ITERS=20000000 WS=400、HAKMEM_WARM_POOL_SIZE=16、HAKMEM_TINY_C4_INLINE_SLOTS=1、HAKMEM_TINY_C5_INLINE_SLOTS=1、HAKMEM_TINY_C6_INLINE_SLOTS=1、HAKMEM_TINY_INLINE_SLOTS_FIXED=1、HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1 - cleanenv で固定OFF(漏れ防止):
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0(Phase 83-1 NO-GO / research)
- デフォルト
0a) ころころ防止(最低限の SSOT ルール)
- hakmem は必ず
HAKMEM_PROFILEを明示する(未指定だと route が変わり、数値が破綻しやすい)。- 推奨:
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE(Speed-first)
- 推奨:
- 比較は目的で runner を分ける:
- hakmem SSOT(最適化判断):
scripts/run_mixed_10_cleanenv.sh - allocator reference(短時間):
scripts/run_allocator_quick_matrix.sh - allocator reference(layout差を最小化):
scripts/run_allocator_preload_matrix.sh
- hakmem SSOT(最適化判断):
- 再現ログを残す(数%を詰めるときの最低限):
scripts/bench_ssot_capture.shHAKMEM_BENCH_ENV_LOG=1(CPU governor/EPP/freq を記録)- 外部相談(貼り付けパケット):
docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md(生成:scripts/make_chatgpt_pro_packet_free_path.sh)
0b) Allocator比較(reference)
- allocator比較(system/jemalloc/mimalloc/tcmalloc)は reference(別バイナリ/LD_PRELOAD → layout差を含む)。
- SSOT:
docs/analysis/ALLOCATOR_COMPARISON_SSOT.md - Quick(Random Mixed 10-run):
scripts/run_allocator_quick_matrix.sh- 重要: hakmem は
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFEを明示し、scripts/run_mixed_10_cleanenv.sh経由で走らせる(PROFILE漏れで数値が壊れるため)。
- 重要: hakmem は
- Same-binary(推奨, layout差を最小化):
scripts/run_allocator_preload_matrix.shbench_random_mixed_systemを固定し、LD_PRELOADで allocator を差し替える。- 注記: hakmem の linked benchmark(
bench_random_mixed_hakmem*)とは経路が異なる(LD_PRELOAD=drop-in wrapper なので別物)。
- Scenario CSV(small-scale reference):
scripts/bench_allocators_compare.sh
- SSOT:
1) 迷子防止(経路/観測)
“経路が踏まれていない最適化” を防ぐための最小手順。
- Route Banner(経路の誤認を潰す):
HAKMEM_ROUTE_BANNER=1- 出力: Route assignments(backend route kind)+ cache config(
unified_cache_enabled/warm_pool_max_per_class)
- 出力: Route assignments(backend route kind)+ cache config(
- Refill観測のSSOT:
docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md- WS=400(Mixed SSOT)では miss が極小 →
unified_cache_refill()最適化は 凍結(ROIゼロ)
- WS=400(Mixed SSOT)では miss が極小 →
2) 直近の結論(要点だけ)
- Phase 69(WarmPool sweep):
HAKMEM_WARM_POOL_SIZE=16が 強GO(+3.26%)、baseline 昇格済み。- 設計:
docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md - 結果:
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
- 設計:
- Phase 70(観測SSOT): 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT:
docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
- SSOT:
- Phase 71/73(WarmPool=16 の勝ち筋確定): 勝ち筋は instruction/branch の微減(perf stat で確定)。
- 詳細:
docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md
- 詳細:
- Phase 72(ENV knob ROI枯れ): WarmPool=16 を超える ENV-only 勝ち筋なし → 構造(コード)で攻める段階。
- Phase 78-1(構造): Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で GO(+2.31%)。
- 結果:
docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md
- 結果:
- Phase 80-1(構造): Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で GO(+1.65%)。
- 結果:
docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md
- 結果:
- Phase 83-1(構造): Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で NO-GO(+0.32%, branch reduction negligible)。
- 結果:
docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md - 原因: lazy-init pattern が既に最適化済み(per-op overhead minimal)→ fixed mode の ROI 極小
- 結果:
2a) 次の大方針(設計の順番、SSOT)
目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory(境界1箇所・戻せる・可視化最小・fail-fast)を崩さず +5–10% を狙う。
優先順(Google/TCMalloc の芯を参考にする):
- ThreadCache overflow のバッチ化(最優先)
- inline slots(C4/C5/C6)が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
- 変換点は 1 箇所(flush/drain)に固定
- Central/Shared 側のバッチ push/pop(次点)
- shared/remote への統合をバッチ化して lock/atomic の回数を減らす
- Memory return / footprint policy(運用軸)
- Balanced/Lean の勝ち筋(syscall/RSS drift/tail)をSSOT化しつつ、速度を落とさない範囲で攻める
重要: 現状は「設計の芯」を決める段階。実装は 計測で overflow の頻度が十分に高いことを確認してから。
2b) 次の作業(待機中)
ユーザーが別エージェント(Claude Code)に依頼した処理が完了するまで待機する。 完了後に着手するチェック(最短で必要な2つ):
- inline slots overflow 率の計測(C4/C5/C6 の FULL/overflow 回数・割合)
- overflow 先のコストの定量化(overflow 時に落ちる関数の perf stat / perf report)
これが揃ったら Phase 86(Overflow batch design)へ進む。
3) 運用ルール(Box Theory + layout tax 対策)
- 変更は必ず 箱 + 境界1箇所 + ENVで戻せる で積む(Fail-fast、最小可視化)。
- A/B は 同一バイナリでENVトグルが原則(別バイナリ比較は layout が混ざる)。
- SSOT運用(ころころ防止):
docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md - “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ compile-out を優先。
- 診断:
scripts/box/layout_tax_forensics_box.sh/docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md
- 診断:
- 研究箱の棚卸しSSOT:
docs/analysis/RESEARCH_BOXES_SSOT.md- ノブ一覧:
scripts/list_hakmem_knobs.sh
- ノブ一覧:
5) 研究箱の扱い(freeze方針)
-
Phase 79-1(C2 local cache):
HAKMEM_TINY_C2_LOCAL_CACHE=0/1- 結果: +0.57%(NO-GO, threshold +1.0% 未達)→ research box freeze
- SSOT/cleanenv では default OFF(
scripts/run_mixed_10_cleanenv.shが0を強制) - 物理削除はしない(layout tax リスク回避)
- Phase 82(hardening): hot path から C2 local cache を完全除外(環境変数を立てても alloc/free hot では踏まない)
- 記録:
docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md
- 記録:
-
Phase 85(Free path commit-once, LEGACY-only):
HAKMEM_FREE_PATH_COMMIT_ONCE=0/1- 結果: NO-GO(-0.86%) → research box freeze(default OFF)
- 理由: Phase 10(MONO LEGACY DIRECT)と効果が被り、さらに間接呼び出し/配置の税が増えた
- 記録:
docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md
4) 次の指示書(Active)
Phase 74(構造): UnifiedCache hit-path を短くする ✅ P1 (LOCALIZE) 凍結
前提:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
Phase 74-1: LOCALIZE (ENV-gated) ✅ 完了 (NEUTRAL +0.50%)
- ENV:
HAKMEM_TINY_UC_LOCALIZE=0/1 - Runtime branch overhead で instructions/branches 増加 (+0.7%/+0.4%)
- 判定: NEUTRAL (+0.50%)
Phase 74-2: LOCALIZE (compile-time gate) ✅ 完了 (NEUTRAL -0.87%)
- Build flag:
HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1(default 0) - Runtime branch 削除 → instructions/branches 改善 (-0.6%/-2.3%) ✓
- しかし cache-misses +86% (register pressure / spill) → throughput -0.87%
- 切り分け成功: LOCALIZE本体は勝ち、cache-miss 増加で相殺
- 判定: NEUTRAL (-0.87%) → P1 (LOCALIZE) 凍結
結論:
- P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い)
- 次: Phase 74-3 (P0: FASTAPI) へ進む
Phase 74-3: P0 (FASTAPI) ✅ 完了 (NEUTRAL +0.32%)
Goal: unified_cache_enabled() / lazy-init / stats 判定を hot loop の外へ追い出す
Approach:
unified_cache_push_fast()/unified_cache_pop_fast()API 追加- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所)
- ENV gate:
HAKMEM_TINY_UC_FASTAPI=0/1(default 0, research box)
Results (10-run Mixed SSOT, WS=400):
- Throughput: +0.32% (NEUTRAL, below +1.0% GO threshold)
- cache-misses: -16.31% (positive signal, insufficient throughput gain)
判定: NEUTRAL (+0.32%) → P0 (FASTAPI) 凍結
参考:
- 設計:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md - 指示書:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md - 結果 (P1/P0):
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md
Phase 75(構造): Hot-class Inline Slots (P2) ✅ 完了(Standard A/B)
Goal: C4-C7 の統計分析 → targeted optimization 戦略決定
前提 (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: per-class 特性を活用 → TLS-direct inline slots で branch elimination
Phase 75-0: Per-Class Analysis ✅ 完了
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|---|---|---|---|---|---|---|---|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | 5,501,709 | 100% | 57.2% |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | 2,747,209 | 100% | 28.5% |
| C4 | 64 | 63 | 687,563 | 687,564 | 1,375,127 | 100% | 14.3% |
| C7 | ? | ? | ? | ? | ? | ? | ? |
Key findings:
- C6 圧倒的支配: 57.2% の操作 (2.75M hits)
- 全クラス 100% hit rate (refill inactive in SSOT)
- Cache occupancy near-capacity (98-99%)
Phase 75-1: C6-only Inline Slots ✅ 完了 (GO +2.87%)
Approach: Modular box theory design with single decision point at TLS init
Implementation (5 new boxes + test script):
- ENV gate box:
HAKMEM_TINY_C6_INLINE_SLOTS=0/1(lazy-init, default OFF) - TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API:
c6_inline_push()/c6_inline_pop()(always_inline, 1-2 cycles) - Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache
Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): 44.24 M ops/s
- Treatment (C6 inline ON): 45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)
Decision: ✅ GO (exceeds +1.0% strict threshold)
Mechanism: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
参考:
- Per-class分析:
docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md - 結果:
docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md
Phase 75-2: C5 Inline Slots ✅ 完了 (GO +1.10%)
Goal: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
Approach: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate:
HAKMEM_TINY_C5_INLINE_SLOTS=0/1(default OFF) - Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
Results (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)
Decision: ✅ GO (C5 individual contribution validated)
Cumulative Performance:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)
参考:
- 実装詳細:
docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md
Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B) ✅ 完了 (STRONG GO +5.41%)
Goal: Comprehensive interaction test + final promotion decision
Approach: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined
Results (10-run per point, Mixed SSOT, WS=400):
- Point A (baseline): 42.36 M ops/s
- Point B (C5 solo): 43.54 M ops/s (+2.79% vs A)
- Point C (C6 solo): 44.25 M ops/s (+4.46% vs A)
- Point D (C5+C6): 44.65 M ops/s (+5.41% vs A) [MAIN TARGET]
Additivity Analysis:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: 1.72% (near-perfect additivity, minimal negative interaction)
Perf Stat Validation (D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)
Decision: ✅ STRONG GO (+5.41%)
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
Promotion Completed:
core/bench_profile.h: Added C5+C6 defaults tobench_apply_mixed_tinyv3_c7_common()scripts/run_mixed_10_cleanenv.sh: Added C5+C6 ENV defaults- C5+C6 inline slots now promoted to preset defaults for MIXED_TINYV3_C7_SAFE
Phase 75 Complete: C5+C6 inline slots (129-256B) deliver +5.41% proven gain on Standard binary(bench_random_mixed_hakmem)。
- FAST PGO baseline(スコアカード)を更新する前に、
BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgoで 同条件の A/B(C5/C6 OFF/ON) を再計測すること。
Phase 75-4(FAST PGO rebase)✅ 完了
- 結果: +3.16% (GO)(4-point matrix、outlier 除外後)
- 詳細:
docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md - 重要: Phase 69 の FAST baseline (62.63M) と比較して 現行 FAST PGO baseline が大きく低い疑い(PGO profile staleness / training mismatch / build drift)
Phase 75-5(PGO 再生成)✅ 完了(NO-GO on hypothesis, code bloat root cause identified)
目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。
結果:
- PGO profile regeneration の効果は 限定的 (+0.3% のみ)
- Root cause は PGO profile mismatch ではなく code bloat (+13KB, +3.1%)
- Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression
Forensics findings (scripts/box/layout_tax_forensics_box.sh):
- Text size: +13KB (+3.1%)
- IPC: 1.80 → 1.67 (-7.22%)
- Branch-misses: +19.4%
- Cache-misses: +5.7%
Decision:
- FAST PGO は code bloat に敏感 → Track A/B discipline 確立
- Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO)
- Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions)
参考:
- 詳細結果:
docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md - 指示書:
docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md
Phase 76(構造継続): C4-C7 Remaining Classes ✅ Phase 76-1 完了 (GO +1.73%)
前提 (Phase 75 complete):
- C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
- Code bloat sensitivity identified → Track A/B discipline established
- Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)
Phase 76-0: C7 Statistics Analysis ✅ 完了 (NO-GO for C7 P2)
Approach: OBSERVE run to measure C7 allocation patterns in Mixed SSOT Results: C7 = 0% operations in Mixed SSOT workload Decision: NO-GO for C7 P2 optimization → proceed to C4
参考:
- 結果:
docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md
Phase 76-1: C4 Inline Slots ✅ 完了 (GO +1.73%)
Goal: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations
Implementation (modular box pattern):
- ENV gate:
HAKMEM_TINY_C4_INLINE_SLOTS=0/1(default OFF → ON after promotion) - TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
- Fast-path API:
c4_inline_push()/c4_inline_pop()(always_inline) - Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade)
Results (10-run Mixed SSOT, WS=400):
- Baseline (C4=OFF, C5=ON, C6=ON): 52.42 M ops/s
- Treatment (C4=ON, C5=ON, C6=ON): 53.33 M ops/s
- Delta: +0.91 M ops/s (+1.73%)
Decision: ✅ GO (exceeds +1.0% threshold)
Promotion Completed:
core/bench_profile.h: Added C4 default tobench_apply_mixed_tinyv3_c7_common()scripts/run_mixed_10_cleanenv.sh: AddedHAKMEM_TINY_C4_INLINE_SLOTS=1default- C4 inline slots now promoted to preset defaults alongside C5+C6
Coverage Summary (C4-C7 complete):
- C6: 57.17% (Phase 75-1, +2.87%)
- C5: 28.55% (Phase 75-2, +1.10%)
- C4: 14.29% (Phase 76-1, +1.73%)
- C7: 0.00% (Phase 76-0, NO-GO)
- Combined C4-C6: 100% of C4-C7 operations
Estimated Cumulative Gain: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)
参考:
- 結果:
docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md - C4 box files:
core/box/tiny_c4_inline_slots_*.h,core/front/tiny_c4_inline_slots.h,core/tiny_c4_inline_slots.c
Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix ✅ 完了 (STRONG GO +7.05%, super-additive)
Goal: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis
Results (4-point matrix, 10-run each):
- Point A (all OFF): 49.48 M ops/s (baseline)
- Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
- Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
- Point D (all ON): 52.97 M ops/s (+7.05% vs A) ✅ STRONG GO
Critical Discovery:
- C4 shows -0.08% regression in isolation (C5/C6 OFF)
- C4 shows +1.27% gain in context (with C5+C6 ON)
- Super-additivity: Actual D (+7.05%) exceeds expected additive (+5.56%)
- Implication: Per-class optimizations are context-dependent, not independently additive
Sub-additivity Analysis:
- Expected additive: 52.23 M ops/s (B + C - A)
- Actual: 52.97 M ops/s
- Gain: -1.42% (super-additive!) ✓
Decision: ✅ STRONG GO
- D vs A: +7.05% >> +3.0% threshold
- Super-additive behavior confirms synergistic gains
- C4+C5+C6 locked to SSOT defaults
参考:
- 詳細結果:
docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md
🟩 完了:C4-C7 Inline Slots Optimization Stack
Per-class Coverage Summary (Final):
- C6 (57.17%): +2.87% (Phase 75-1)
- C5 (28.55%): +1.10% (Phase 75-2)
- C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
- C7 (0.00%): NO-GO (Phase 76-0)
- Combined C4-C6: +7.05% (Phase 76-2 super-additive)
Status: ✅ C4-C7 Optimization Complete (100% coverage, SSOT locked)
🟥 次のActive(Phase 77+)
オプション:
Option A: FAST PGO Periodic Tracking (Track B discipline)
- Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
- Monitor mimalloc ratio progress (secondary metric)
- Not a decision point per se, but periodic maintenance
Option B: Phase 77 (Alternative Optimization Axis)
- Explore beyond per-class inline slots
- Candidates:
- Allocation fast-path optimization (call elimination)
- Metadata/page lookup (table optimization)
- C3/C2 class strategies
- Warm pool tuning (beyond Phase 69's WarmPool=16)
推奨: Option B へ進む(Phase 77+)
- C4-C7 optimizations are exhausted and locked
- Ready to explore new optimization axes
- Baseline is now +7.05% stronger than Phase 75-3
参考:
- C4-C7 完全分析:
docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md - Phase 75-3 参考 (C5+C6):
docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md
5) アーカイブ
- 詳細ログ:
CURRENT_TASK_ARCHIVE_20251210.md - 整理前スナップショット:
docs/analysis/CURRENT_TASK_ARCHIVE.md