# CURRENT_TASK(Rolling, SSOT) ## SSOT(今の正) - **性能SSOT**: `scripts/run_mixed_10_cleanenv.sh`(WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF) - **経路確認**: `scripts/run_mixed_observe_ssot.sh`(OBSERVE専用、throughput比較には使わない) - **buildモード**: `docs/analysis/SSOT_BUILD_MODES.md` - **外部比較(短時間)**: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`(LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け) ## Phase 87-88(終了: NO-GO) **Status**: ✅ **OBSERVE verified** + ❌ **Phase 88 NO-GO** ### Phase 87: Inline Slots Verification **Initial Finding (Wrong)**: Standard binary showed PUSH TOTAL/POP TOTAL = 0 - **Root Cause**: ENV ドリフト(`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` 漏れ) - 修正: `scripts/run_mixed_10_cleanenv.sh` でサイズ範囲を強制固定(MIN=16, MAX=1040) - `HAKMEM_BENCH_C5_ONLY=0`, `HAKMEM_BENCH_C6_ONLY=0`, `HAKMEM_BENCH_C7_ONLY=0` 強制 **Corrected Finding (OBSERVE binary)** - 20M ops Mixed SSOT WS=400: ``` PUSH TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓ POP TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓ PUSH FULL: 0 (0.00%) POP EMPTY: 168 (0.003%) JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89 ``` ### Phase 88: Batch Drain Optimization **Overflow Analysis**: - POP EMPTY rate: 168 / 4,812,031 = **0.003%** ← 極小 - PUSH FULL rate: 0 / 4,812,031 = **0%** ← 起きていない - **Decision**: バッチ化しても速さは動かない(overflow がほぼ起きていない) **Phase 88 Decision**: **NO-GO(凍結)** - Rationale: 0.003% overflow 率では layout tax リスク > 期待値 - Infrastructure: 観測用 telemetry は残す(将来の WS/容量 変更時に再検証可能) **Artifacts Created**: - Telemetry box: `core/box/tiny_inline_slots_overflow_stats_box.h/c` - Phase 87 results: `docs/analysis/PHASE87_OBSERVATION_RESULTS.md` - SSOT 強化: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh` - ENV ドリフト防止ドキュメント: `docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md` **Key Learning**: - "踏んでるか確定"には **OBSERVE バイナリ + total counters** が必須 - 観測と性能測定は分離(telemetry overhead を避ける) - ENV ドリフト(MIN/MAX サイズ, CLASS_ONLY) = 経路を変える主要因 **Follow-up Fix (SSOT hardening)**: - `scripts/run_mixed_10_cleanenv.sh` now forces `HAKMEM_BENCH_MIN_SIZE=16` / `HAKMEM_BENCH_MAX_SIZE=1040` and disables `HAKMEM_BENCH_C{5,6,7}_ONLY` to prevent path drift. - New pre-flight helper: `scripts/run_mixed_observe_ssot.sh` (Route Banner + OBSERVE, single run). - Overflow stats compile gating fixed (see above). --- ## Phase 89(完了: Bottleneck Analysis & Optimization Roadmap) **Status**: ✅ **SSOT Measurement Complete** + **3 Optimization Candidates Identified** ### 4-Step SSOT Procedure Completion **Step 1: OBSERVE Binary Preflight** - Binary: `bench_random_mixed_hakmem_observe` (with telemetry enabled) - Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy) - Throughput (with telemetry): 51.52M ops/s **Step 2: Standard 10-run Baseline** - Binary: `bench_random_mixed_hakmem` (clean, no telemetry) - 10-run SSOT results: **51.36M ops/s** (CV: 0.7%, very stable) - Range: 50.74M - 51.73M - **Decision**: This is baseline for bottleneck analysis **Step 3: FAST PGO 10-run Comparison** - Binary: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized) - 10-run SSOT results: **54.16M ops/s** (CV: 1.5%, acceptable) - Range: 52.89M - 55.13M - **Performance Gap**: 54.16M - 51.36M = **2.80M ops/s (+5.45%)** - This represents the optimization ceiling with current PGO profile **Step 4: Results Captured** - Git SHA: e4c5f0535 (master branch) - Timestamp: 2025-12-18 23:06:01 - System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel - Files: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md` ### Perf Analysis & Top Bottleneck Identification **Profile Run**: 40M operations (0.78s), 833 perf samples **Top Functions by CPU Time**: 1. **free** - 27.40% (hottest) 2. main - 26.30% (benchmark loop, not optimizable) 3. **malloc** - 20.36% (hottest) 4. malloc.cold - 10.65% (cold path, avoid optimizing) 5. free.cold - 5.59% (cold path, avoid optimizing) 6. **tiny_region_id_write_header** - 2.98% (hot, inlining candidate) **malloc + free combined = 47.76% of CPU time** (already Phase 9/10/78-1/80-1 optimized) ### Top 3 Optimization Candidates (Ranked by Priority) | Candidate | Priority | Recommendation | Expected Gain | Risk | Effort | |-----------|----------|-----------------|----------------|------|--------| | **tiny_region_id_write_header always_inline** | **HIGH** | **PURSUE** | +1-2% | LOW | 1-2h | | malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h | | Cold-path optimization | LOW | **AVOID** | +1% | HIGH | 10-20h | **Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)** - Current: Selective inlining from `core/region_id_v6.c` - Proposal: Force `always_inline` for hot-path call sites - **Layout Impact**: MINIMAL (no code bulk, maintains I-cache discipline) - **Recommendation**: YES - PURSUE - Estimated timeline: Phase 90 - Implementation: 1-2 lines, add `__attribute__((always_inline))` wrapper **Candidate 2: malloc/free branch reduction (47.76% CPU)** - Current: Phase 9/10/78-1/80-1/83-1 already optimized - Observation: 56.4M branch-misses (branch prediction pressure) - Proposal: Pre-compute routing tables (like Phase 85 approach) - **Risk**: Code bloat, potential layout tax regression (Phase 85 was NO-GO) - **Recommendation**: DEFER - Wait for workload characteristics that justify complexity - Current gains saturation point reached --- ## Phase 91(終了: NEUTRAL / 凍結) **Status**: ⚪ **NEUTRAL**(C6 IFL: +0.38% / 10-run)→ default OFF で保持 - 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る - 結果(SSOT 10-run): - Control(`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0`)mean 52.05M - Treatment(`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1`)mean 52.25M - Δ **+0.38%**(GO閾値 +1.0% 未達) - 判定: **凍結(research box)** - 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない --- ## Phase 92(開始予定) **Status**: 🔍 **次フェーズ計画中** **目的**: tcmalloc 性能ギャップ(hakmem: 52M vs tcmalloc: 58M, -12.8%)を短時間で原因分類 **実施予定**: 1. ケース A:小 vs 大オブジェクト分離テスト(C6-only vs C7-only) 2. ケース B:Inline Slots vs Unified Cache 分離テスト 3. ケース C:LIFO vs FIFO 比較 4. ケース D:Pool size sensitivity テスト **期間**: 1-2h(短時間 Triage) **出力**: Primary bottleneck 特定 → 次の Candidate 選定 **References**: - Triage Plan: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md` --- **Candidate 3: Cold-path de-duplication (16.24% CPU)** - Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated - Rationale: Separation improves hot-path I-cache utilization - **Recommendation**: AVOID - Aligns with user's "layout tax 回避" principle - Optimizing cold paths would ADD code to hot path (violates design) ### Key Performance Insights **FAST PGO vs Standard (+5.45%) breakdown**: - PGO branch prediction optimization: ~3% - Code layout optimization: ~2% - Inlining decisions: ~0.5% **Conclusion**: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs. **Inline Slots Health**: Working perfectly - 0.003% overflow rate confirms no bottleneck ### References & Artifacts - SSOT Measurement: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md` - Bottleneck Analysis: `docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md` - Perf Stats: `docs/analysis/PHASE89_PERF_STAT.txt` - Scripts: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh` --- ## Phase 86(終了: NO-GO) **Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%) **A/B Test (10-run SSOT)**: - Control: 51,750,467 ops/s (CV: 2.26%) - Treatment: 51,881,055 ops/s (CV: 2.32%) - Delta: +0.25% (mean), -0.15% (median) **Summary**: Free path legacy mask (mask-only) optimization for LEGACY classes. - Design: Bitset mask + direct call (avoids Phase 85's indirect call problems) - Implementation: Correct (0x7f mask computed, C0-C6 optimized) - Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit - Conclusion: Free path optimization layer has reached practical ceiling --- ## 0) 今の「正」(SSOT) - **現行 SSOT(Phase 89 capture / Git SHA: e4c5f0535)**: - Standard(`./bench_random_mixed_hakmem`)10-run mean: **51.36M ops/s**(CV ~0.7%) - FAST PGO minimal(`./bench_random_mixed_hakmem_minimal_pgo`)10-run mean: **54.16M ops/s**(CV ~1.5% / Standard比 +5.45%) - OBSERVE(`./bench_random_mixed_hakmem_observe`): 51.52M ops/s(telemetry込み、性能比較の正ではない) - SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md` - **性能最適化の判断の正**: 同一バイナリ A/B(ENVトグル)= `scripts/run_mixed_10_cleanenv.sh` - **mimalloc/tcmalloc 参照の正**: reference(別バイナリ/LD_PRELOAD)= `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md` - **スコアカード(目標/現在値の正)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`(Phase 89 SSOT を現行 snapshot として反映済み) - Phase 66/68/69(60M〜62M台)は **historical**(現 HEAD と直接比較しない。比較するなら rebase を取る) - **次フェーズ(設計見直し)**: `docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md` - **Mixed 10-run SSOT(ハーネス)**: `scripts/run_mixed_10_cleanenv.sh` - デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`(Standard) - FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する - 既定: `ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16`、`HAKMEM_TINY_C4_INLINE_SLOTS=1`、`HAKMEM_TINY_C5_INLINE_SLOTS=1`、`HAKMEM_TINY_C6_INLINE_SLOTS=1`、`HAKMEM_TINY_INLINE_SLOTS_FIXED=1`、`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1` - cleanenv で固定OFF(漏れ防止): `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0`(Phase 83-1 NO-GO / research) ## 0a) ころころ防止(最低限の SSOT ルール) - **hakmem は必ず `HAKMEM_PROFILE` を明示**する(未指定だと route が変わり、数値が破綻しやすい)。 - 推奨: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`(Speed-first) - 比較は目的で runner を分ける: - hakmem SSOT(最適化判断): `scripts/run_mixed_10_cleanenv.sh` - allocator reference(短時間): `scripts/run_allocator_quick_matrix.sh` - allocator reference(layout差を最小化): `scripts/run_allocator_preload_matrix.sh` - 再現ログを残す(数%を詰めるときの最低限): - `scripts/bench_ssot_capture.sh` - `HAKMEM_BENCH_ENV_LOG=1`(CPU governor/EPP/freq を記録) - 外部相談(貼り付けパケット): `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`(生成: `scripts/make_chatgpt_pro_packet_free_path.sh`) ## 0b) Allocator比較(reference) - allocator比較(system/jemalloc/mimalloc/tcmalloc)は **reference**(別バイナリ/LD_PRELOAD → layout差を含む)。 - SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md` - **Quick(Random Mixed 10-run)**: `scripts/run_allocator_quick_matrix.sh` - **重要**: hakmem は `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示し、`scripts/run_mixed_10_cleanenv.sh` 経由で走らせる(PROFILE漏れで数値が壊れるため)。 - **Same-binary(推奨, layout差を最小化)**: `scripts/run_allocator_preload_matrix.sh` - `bench_random_mixed_system` を固定し、`LD_PRELOAD` で allocator を差し替える。 - 注記: hakmem の **linked benchmark**(`bench_random_mixed_hakmem*`)とは経路が異なる(LD_PRELOAD=drop-in wrapper なので別物)。 - **Scenario CSV(small-scale reference)**: `scripts/bench_allocators_compare.sh` ## 1) 迷子防止(経路/観測) “経路が踏まれていない最適化” を防ぐための最小手順。 - **Route Banner(経路の誤認を潰す)**: `HAKMEM_ROUTE_BANNER=1` - 出力: Route assignments(backend route kind)+ cache config(`unified_cache_enabled` / `warm_pool_max_per_class`) - **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` - WS=400(Mixed SSOT)では miss が極小 → `unified_cache_refill()` 最適化は **凍結(ROIゼロ)** ## 2) 直近の結論(要点だけ) - **Phase 69(WarmPool sweep)**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO(+3.26%)**、baseline 昇格済み。 - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md` - **Phase 70(観測SSOT)**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。 - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md` - **Phase 71/73(WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**(perf stat で確定)。 - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` - **Phase 72(ENV knob ROI枯れ)**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**。 - **Phase 78-1(構造)**: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で **GO(+2.31%)**。 - 結果: `docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md` - **Phase 80-1(構造)**: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で **GO(+1.65%)**。 - 結果: `docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md` - **Phase 83-1(構造)**: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で **NO-GO(+0.32%, branch reduction negligible)**。 - 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md` - 原因: lazy-init pattern が既に最適化済み(per-op overhead minimal)→ fixed mode の ROI 極小 ## 2a) 次の大方針(設計の順番、SSOT) 目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory(境界1箇所・戻せる・可視化最小・fail-fast)を崩さず **+5–10%** を狙う。 優先順(Google/TCMalloc の芯を参考にする): 1. **ThreadCache overflow のバッチ化(最優先)** - inline slots(C4/C5/C6)が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす - 変換点は 1 箇所(flush/drain)に固定 2. **Central/Shared 側のバッチ push/pop(次点)** - shared/remote への統合をバッチ化して lock/atomic の回数を減らす 3. **Memory return / footprint policy(運用軸)** - Balanced/Lean の勝ち筋(syscall/RSS drift/tail)をSSOT化しつつ、速度を落とさない範囲で攻める 重要: 現状は「設計の芯」を決める段階。実装は **計測で overflow の頻度が十分に高い**ことを確認してから。 ## 2b) 次の作業(待機中) ユーザーが別エージェント(Claude Code)に依頼した処理が完了するまで待機する。 完了後に着手するチェック(最短で必要な2つ): - **inline slots overflow 率の計測**(C4/C5/C6 の FULL/overflow 回数・割合) - **overflow 先のコストの定量化**(overflow 時に落ちる関数の perf stat / perf report) これが揃ったら Phase 86(Overflow batch design)へ進む。 ## 3) 運用ルール(Box Theory + layout tax 対策) - 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む(Fail-fast、最小可視化)。 - A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。 - SSOT運用(ころころ防止): `docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md` - “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。 - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` - 研究箱の棚卸しSSOT: `docs/analysis/RESEARCH_BOXES_SSOT.md` - ノブ一覧: `scripts/list_hakmem_knobs.sh` ## 5) 研究箱の扱い(freeze方針) - **Phase 79-1(C2 local cache)**: `HAKMEM_TINY_C2_LOCAL_CACHE=0/1` - 結果: +0.57%(NO-GO, threshold +1.0% 未達)→ **research box freeze** - SSOT/cleanenv では **default OFF**(`scripts/run_mixed_10_cleanenv.sh` が `0` を強制) - 物理削除はしない(layout tax リスク回避) - **Phase 82(hardening)**: hot path から C2 local cache を完全除外(環境変数を立てても alloc/free hot では踏まない) - 記録: `docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md` - **Phase 85(Free path commit-once, LEGACY-only)**: `HAKMEM_FREE_PATH_COMMIT_ONCE=0/1` - 結果: **NO-GO(-0.86%)** → **research box freeze(default OFF)** - 理由: Phase 10(MONO LEGACY DIRECT)と効果が被り、さらに間接呼び出し/配置の税が増えた - 記録: `docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md` ## 4) 次の指示書(Active) ### Phase 74(構造): UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結** **前提**: - WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。 - WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。 **Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)** - ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` - Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%) - 判定: **NEUTRAL (+0.50%)** **Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)** - Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0) - Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓ - しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%** - 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺** - 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結** **結論**: - P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い) - 次: **Phase 74-3 (P0: FASTAPI)** へ進む **Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)** **Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す** **Approach**: - `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加 - 前提: "valid/enabled/no-stats" を caller 側で保証 - Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所) - ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box) **Results** (10-run Mixed SSOT, WS=400): - Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold) - cache-misses: **-16.31%** (positive signal, insufficient throughput gain) **判定**: **NEUTRAL (+0.32%)** → **P0 (FASTAPI) 凍結** **参考**: - 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md` - 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md` - 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` --- ## Phase 75(構造): Hot-class Inline Slots (P2) ✅ **完了(Standard A/B)** **Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定 **前提** (Phase 74 learnings): - UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects - 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination **Phase 75-0: Per-Class Analysis** ✅ **完了** Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1): | Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 | |-------|----------|----------|------|--------|-----------|-------|-----------| | C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** | | C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** | | C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** | | C7 | ? | ? | ? | ? | **?** | ? | **?** | **Key findings**: 1. C6 圧倒的支配: 57.2% の操作 (2.75M hits) 2. 全クラス 100% hit rate (refill inactive in SSOT) 3. Cache occupancy near-capacity (98-99%) **Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)** **Approach**: Modular box theory design with single decision point at TLS init **Implementation** (5 new boxes + test script): - ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF) - TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF) - Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles) - Integration: Minimal (2 boundary points: alloc/free for C6 class only) - Backward compatible: Legacy code intact, fail-fast to unified_cache **Results** (10-run Mixed SSOT, WS=400): - Baseline (C6 inline OFF): **44.24 M ops/s** - Treatment (C6 inline ON): **45.51 M ops/s** - Delta: **+1.27 M ops/s (+2.87%)** **Decision**: ✅ **GO** (exceeds +1.0% strict threshold) **Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops) **参考**: - Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md` - 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md` --- **Phase 75-2: C5 Inline Slots** ✅ **完了 (GO +1.10%)** **Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution **Approach**: Replicate C6 pattern with careful isolation - Add C5 ring buffer (128 slots, 1KB TLS) - ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF) - Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON) - Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache) **Results** (10-run Mixed SSOT, WS=400): - Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37) - Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54) - Delta: **+0.49 M ops/s (+1.10%)** **Decision**: ✅ **GO** (C5 individual contribution validated) **Cumulative Performance**: - Phase 75-1 (C6): +2.87% - Phase 75-2 (C5 isolated): +1.10% - Combined potential: ~+3.97% (if additive) **参考**: - 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md` --- **Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)** ✅ **完了 (STRONG GO +5.41%)** **Goal**: Comprehensive interaction test + final promotion decision **Approach**: 4-point matrix A/B test (single binary, ENV-only configuration) - Point A (C5=0, C6=0): Baseline - Point B (C5=1, C6=0): C5 solo - Point C (C5=0, C6=1): C6 solo - Point D (C5=1, C6=1): C5+C6 combined **Results** (10-run per point, Mixed SSOT, WS=400): - **Point A (baseline)**: 42.36 M ops/s - **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A) - **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A) - **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]** **Additivity Analysis**: - Expected additive (B+C-A): 45.43 M ops/s - Actual (D): 44.65 M ops/s - Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction) **Perf Stat Validation (D vs A)**: - Instructions: -6.1% (function call elimination confirmed) - Branches: -6.1% (matches instruction reduction) - Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2) - Throughput: +5.41% (net positive) **Decision**: ✅ **STRONG GO (+5.41%)** - D vs A: +5.41% >> 3.0% threshold - Sub-additivity: 1.72% << 20% acceptable - Phase 73 thesis validated: instructions/branches DOWN, throughput UP **Promotion Completed**: 1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()` 2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults 3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE **Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**(`bench_random_mixed_hakmem`)。 - FAST PGO baseline(スコアカード)を更新する前に、`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` で **同条件の A/B(C5/C6 OFF/ON)** を再計測すること。 ### Phase 75-4(FAST PGO rebase)✅ 完了 - 結果: **+3.16% (GO)**(4-point matrix、outlier 除外後) - 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md` - 重要: Phase 69 の FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑い(PGO profile staleness / training mismatch / build drift) ### Phase 75-5(PGO 再生成)✅ 完了(NO-GO on hypothesis, code bloat root cause identified) 目的: - C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。 結果: - PGO profile regeneration の効果は **限定的** (+0.3% のみ) - Root cause は **PGO profile mismatch ではなく code bloat** (+13KB, +3.1%) - Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression **Forensics findings** (`scripts/box/layout_tax_forensics_box.sh`): - Text size: +13KB (+3.1%) - IPC: 1.80 → 1.67 (-7.22%) - Branch-misses: +19.4% - Cache-misses: +5.7% **Decision**: - FAST PGO は code bloat に敏感 → **Track A/B discipline 確立** - Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO) - Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions) **参考**: - 詳細結果: `docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md` - 指示書: `docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md` --- ### Phase 76(構造継続): C4-C7 Remaining Classes ✅ **Phase 76-1 完了 (GO +1.73%)** **前提** (Phase 75 complete): - C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO) - Code bloat sensitivity identified → Track A/B discipline established - Remaining C4-C7 coverage: C4 (14.29%), C7 (0%) **Phase 76-0: C7 Statistics Analysis** ✅ **完了 (NO-GO for C7 P2)** **Approach**: OBSERVE run to measure C7 allocation patterns in Mixed SSOT **Results**: C7 = **0% operations** in Mixed SSOT workload **Decision**: NO-GO for C7 P2 optimization → proceed to C4 **参考**: - 結果: `docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md` **Phase 76-1: C4 Inline Slots** ✅ **完了 (GO +1.73%)** **Goal**: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations **Implementation** (modular box pattern): - ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1` (default OFF → ON after promotion) - TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB) - Fast-path API: `c4_inline_push()` / `c4_inline_pop()` (always_inline) - Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade) **Results** (10-run Mixed SSOT, WS=400): - Baseline (C4=OFF, C5=ON, C6=ON): **52.42 M ops/s** - Treatment (C4=ON, C5=ON, C6=ON): **53.33 M ops/s** - Delta: **+0.91 M ops/s (+1.73%)** **Decision**: ✅ **GO** (exceeds +1.0% threshold) **Promotion Completed**: 1. `core/bench_profile.h`: Added C4 default to `bench_apply_mixed_tinyv3_c7_common()` 2. `scripts/run_mixed_10_cleanenv.sh`: Added `HAKMEM_TINY_C4_INLINE_SLOTS=1` default 3. C4 inline slots now **promoted to preset defaults** alongside C5+C6 **Coverage Summary (C4-C7 complete)**: - C6: 57.17% (Phase 75-1, +2.87%) - C5: 28.55% (Phase 75-2, +1.10%) - **C4: 14.29% (Phase 76-1, +1.73%)** - C7: 0.00% (Phase 76-0, NO-GO) - **Combined C4-C6: 100% of C4-C7 operations** **Estimated Cumulative Gain**: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3) **参考**: - 結果: `docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md` - C4 box files: `core/box/tiny_c4_inline_slots_*.h`, `core/front/tiny_c4_inline_slots.h`, `core/tiny_c4_inline_slots.c` --- **Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix** ✅ **完了 (STRONG GO +7.05%, super-additive)** **Goal**: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis **Results** (4-point matrix, 10-run each): - Point A (all OFF): 49.48 M ops/s (baseline) - Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression) - Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A) - Point D (all ON): **52.97 M ops/s (+7.05% vs A)** ✅ **STRONG GO** **Critical Discovery**: - C4 shows **-0.08% regression in isolation** (C5/C6 OFF) - C4 shows **+1.27% gain in context** (with C5+C6 ON) - **Super-additivity**: Actual D (+7.05%) exceeds expected additive (+5.56%) - **Implication**: Per-class optimizations are **context-dependent**, not independently additive **Sub-additivity Analysis**: - Expected additive: 52.23 M ops/s (B + C - A) - Actual: 52.97 M ops/s - Gain: **-1.42% (super-additive!)** ✓ **Decision**: ✅ **STRONG GO** - D vs A: +7.05% >> +3.0% threshold - Super-additive behavior confirms synergistic gains - C4+C5+C6 locked to SSOT defaults **参考**: - 詳細結果: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md` --- ### 🟩 完了:C4-C7 Inline Slots Optimization Stack **Per-class Coverage Summary (Final)**: - C6 (57.17%): +2.87% (Phase 75-1) - C5 (28.55%): +1.10% (Phase 75-2) - C4 (14.29%): +1.27% in context (Phase 76-1/76-2) - C7 (0.00%): NO-GO (Phase 76-0) - **Combined C4-C6: +7.05% (Phase 76-2 super-additive)** **Status**: ✅ **C4-C7 Optimization Complete** (100% coverage, SSOT locked) --- ### 🟥 次のActive(Phase 77+) **オプション**: **Option A: FAST PGO Periodic Tracking** (Track B discipline) - Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates - Monitor mimalloc ratio progress (secondary metric) - Not a decision point per se, but periodic maintenance **Option B: Phase 77 (Alternative Optimization Axis)** - Explore beyond per-class inline slots - Candidates: - Allocation fast-path optimization (call elimination) - Metadata/page lookup (table optimization) - C3/C2 class strategies - Warm pool tuning (beyond Phase 69's WarmPool=16) **推奨**: **Option B へ進む**(Phase 77+) - C4-C7 optimizations are exhausted and locked - Ready to explore new optimization axes - Baseline is now +7.05% stronger than Phase 75-3 **参考**: - C4-C7 完全分析: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md` - Phase 75-3 参考 (C5+C6): `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md` ## 5) アーカイブ - 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md` - 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`