Files
hakmem/CURRENT_TASK.md
2025-12-19 03:45:01 +09:00

651 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CURRENT_TASKRolling, SSOT
## SSOT今の正
- **性能SSOT**: `scripts/run_mixed_10_cleanenv.sh`WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF
- **経路確認**: `scripts/run_mixed_observe_ssot.sh`OBSERVE専用、throughput比較には使わない
- **buildモード**: `docs/analysis/SSOT_BUILD_MODES.md`
- **外部比較(短時間)**: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け)
## Phase 87-88終了: NO-GO
**Status**: ✅ **OBSERVE verified** + ❌ **Phase 88 NO-GO**
### Phase 87: Inline Slots Verification
**Initial Finding (Wrong)**: Standard binary showed PUSH TOTAL/POP TOTAL = 0
- **Root Cause**: ENV ドリフト(`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` 漏れ)
- 修正: `scripts/run_mixed_10_cleanenv.sh` でサイズ範囲を強制固定MIN=16, MAX=1040
- `HAKMEM_BENCH_C5_ONLY=0`, `HAKMEM_BENCH_C6_ONLY=0`, `HAKMEM_BENCH_C7_ONLY=0` 強制
**Corrected Finding (OBSERVE binary)** - 20M ops Mixed SSOT WS=400:
```
PUSH TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
POP TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
PUSH FULL: 0 (0.00%)
POP EMPTY: 168 (0.003%)
JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89
```
### Phase 88: Batch Drain Optimization
**Overflow Analysis**:
- POP EMPTY rate: 168 / 4,812,031 = **0.003%** ← 極小
- PUSH FULL rate: 0 / 4,812,031 = **0%** ← 起きていない
- **Decision**: バッチ化しても速さは動かないoverflow がほぼ起きていない)
**Phase 88 Decision**: **NO-GO凍結**
- Rationale: 0.003% overflow 率では layout tax リスク > 期待値
- Infrastructure: 観測用 telemetry は残す(将来の WS/容量 変更時に再検証可能)
**Artifacts Created**:
- Telemetry box: `core/box/tiny_inline_slots_overflow_stats_box.h/c`
- Phase 87 results: `docs/analysis/PHASE87_OBSERVATION_RESULTS.md`
- SSOT 強化: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
- ENV ドリフト防止ドキュメント: `docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md`
**Key Learning**:
- "踏んでるか確定"には **OBSERVE バイナリ + total counters** が必須
- 観測と性能測定は分離telemetry overhead を避ける)
- ENV ドリフトMIN/MAX サイズ, CLASS_ONLY = 経路を変える主要因
**Follow-up Fix (SSOT hardening)**:
- `scripts/run_mixed_10_cleanenv.sh` now forces `HAKMEM_BENCH_MIN_SIZE=16` / `HAKMEM_BENCH_MAX_SIZE=1040` and disables `HAKMEM_BENCH_C{5,6,7}_ONLY` to prevent path drift.
- New pre-flight helper: `scripts/run_mixed_observe_ssot.sh` (Route Banner + OBSERVE, single run).
- Overflow stats compile gating fixed (see above).
---
## Phase 89完了: Bottleneck Analysis & Optimization Roadmap
**Status**: ✅ **SSOT Measurement Complete** + **3 Optimization Candidates Identified**
### 4-Step SSOT Procedure Completion
**Step 1: OBSERVE Binary Preflight**
- Binary: `bench_random_mixed_hakmem_observe` (with telemetry enabled)
- Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
- Throughput (with telemetry): 51.52M ops/s
**Step 2: Standard 10-run Baseline**
- Binary: `bench_random_mixed_hakmem` (clean, no telemetry)
- 10-run SSOT results: **51.36M ops/s** (CV: 0.7%, very stable)
- Range: 50.74M - 51.73M
- **Decision**: This is baseline for bottleneck analysis
**Step 3: FAST PGO 10-run Comparison**
- Binary: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
- 10-run SSOT results: **54.16M ops/s** (CV: 1.5%, acceptable)
- Range: 52.89M - 55.13M
- **Performance Gap**: 54.16M - 51.36M = **2.80M ops/s (+5.45%)**
- This represents the optimization ceiling with current PGO profile
**Step 4: Results Captured**
- Git SHA: e4c5f0535 (master branch)
- Timestamp: 2025-12-18 23:06:01
- System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
- Files: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
### Perf Analysis & Top Bottleneck Identification
**Profile Run**: 40M operations (0.78s), 833 perf samples
**Top Functions by CPU Time**:
1. **free** - 27.40% (hottest)
2. main - 26.30% (benchmark loop, not optimizable)
3. **malloc** - 20.36% (hottest)
4. malloc.cold - 10.65% (cold path, avoid optimizing)
5. free.cold - 5.59% (cold path, avoid optimizing)
6. **tiny_region_id_write_header** - 2.98% (hot, inlining candidate)
**malloc + free combined = 47.76% of CPU time** (already Phase 9/10/78-1/80-1 optimized)
### Top 3 Optimization Candidates (Ranked by Priority)
| Candidate | Priority | Recommendation | Expected Gain | Risk | Effort |
|-----------|----------|-----------------|----------------|------|--------|
| **tiny_region_id_write_header always_inline** | **HIGH** | **PURSUE** | +1-2% | LOW | 1-2h |
| malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h |
| Cold-path optimization | LOW | **AVOID** | +1% | HIGH | 10-20h |
**Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)**
- Current: Selective inlining from `core/region_id_v6.c`
- Proposal: Force `always_inline` for hot-path call sites
- **Layout Impact**: MINIMAL (no code bulk, maintains I-cache discipline)
- **Recommendation**: YES - PURSUE
- Estimated timeline: Phase 90
- Implementation: 1-2 lines, add `__attribute__((always_inline))` wrapper
**Candidate 2: malloc/free branch reduction (47.76% CPU)**
- Current: Phase 9/10/78-1/80-1/83-1 already optimized
- Observation: 56.4M branch-misses (branch prediction pressure)
- Proposal: Pre-compute routing tables (like Phase 85 approach)
- **Risk**: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
- **Recommendation**: DEFER
- Wait for workload characteristics that justify complexity
- Current gains saturation point reached
---
## Phase 91終了: NEUTRAL / 凍結)
**Status**: ⚪ **NEUTRAL**C6 IFL: +0.38% / 10-run→ default OFF で保持
- 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
- 結果SSOT 10-run:
- Control`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0`mean 52.05M
- Treatment`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1`mean 52.25M
- Δ **+0.38%**GO閾値 +1.0% 未達)
- 判定: **凍結research box**
- 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない
---
## Phase 92開始予定
**Status**: 🔍 **次フェーズ計画中**
**目的**: tcmalloc 性能ギャップhakmem: 52M vs tcmalloc: 58M, -12.8%)を短時間で原因分類
**実施予定**:
1. ケース A小 vs 大オブジェクト分離テストC6-only vs C7-only
2. ケース BInline Slots vs Unified Cache 分離テスト
3. ケース CLIFO vs FIFO 比較
4. ケース DPool size sensitivity テスト
**期間**: 1-2h短時間 Triage
**出力**: Primary bottleneck 特定 → 次の Candidate 選定
**References**:
- Triage Plan: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`
---
**Candidate 3: Cold-path de-duplication (16.24% CPU)**
- Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
- Rationale: Separation improves hot-path I-cache utilization
- **Recommendation**: AVOID
- Aligns with user's "layout tax 回避" principle
- Optimizing cold paths would ADD code to hot path (violates design)
### Key Performance Insights
**FAST PGO vs Standard (+5.45%) breakdown**:
- PGO branch prediction optimization: ~3%
- Code layout optimization: ~2%
- Inlining decisions: ~0.5%
**Conclusion**: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.
**Inline Slots Health**: Working perfectly - 0.003% overflow rate confirms no bottleneck
### References & Artifacts
- SSOT Measurement: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- Bottleneck Analysis: `docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md`
- Perf Stats: `docs/analysis/PHASE89_PERF_STAT.txt`
- Scripts: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
---
## Phase 86終了: NO-GO
**Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)
**A/B Test (10-run SSOT)**:
- Control: 51,750,467 ops/s (CV: 2.26%)
- Treatment: 51,881,055 ops/s (CV: 2.32%)
- Delta: +0.25% (mean), -0.15% (median)
**Summary**: Free path legacy mask (mask-only) optimization for LEGACY classes.
- Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
- Implementation: Correct (0x7f mask computed, C0-C6 optimized)
- Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
- Conclusion: Free path optimization layer has reached practical ceiling
---
## 0) 今の「正」SSOT
- **現行 SSOTPhase 89 capture / Git SHA: e4c5f0535**:
- Standard`./bench_random_mixed_hakmem`10-run mean: **51.36M ops/s**CV ~0.7%
- FAST PGO minimal`./bench_random_mixed_hakmem_minimal_pgo`10-run mean: **54.16M ops/s**CV ~1.5% / Standard比 +5.45%
- OBSERVE`./bench_random_mixed_hakmem_observe`: 51.52M ops/stelemetry込み、性能比較の正ではない
- SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- **性能最適化の判断の正**: 同一バイナリ A/BENVトグル `scripts/run_mixed_10_cleanenv.sh`
- **mimalloc/tcmalloc 参照の正**: reference別バイナリ/LD_PRELOAD `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
- **スコアカード(目標/現在値の正)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`Phase 89 SSOT を現行 snapshot として反映済み)
- Phase 66/68/6960M〜62M台**historical**(現 HEAD と直接比較しない。比較するなら rebase を取る)
- **次フェーズ(設計見直し)**: `docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md`
- **Mixed 10-run SSOTハーネス**: `scripts/run_mixed_10_cleanenv.sh`
- デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`Standard
- FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
- 既定: `ITERS=20000000 WS=400``HAKMEM_WARM_POOL_SIZE=16``HAKMEM_TINY_C4_INLINE_SLOTS=1``HAKMEM_TINY_C5_INLINE_SLOTS=1``HAKMEM_TINY_C6_INLINE_SLOTS=1``HAKMEM_TINY_INLINE_SLOTS_FIXED=1``HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1`
- cleanenv で固定OFF漏れ防止: `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0`Phase 83-1 NO-GO / research
## 0a) ころころ防止(最低限の SSOT ルール)
- **hakmem は必ず `HAKMEM_PROFILE` を明示**する(未指定だと route が変わり、数値が破綻しやすい)。
- 推奨: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`Speed-first
- 比較は目的で runner を分ける:
- hakmem SSOT最適化判断: `scripts/run_mixed_10_cleanenv.sh`
- allocator reference短時間: `scripts/run_allocator_quick_matrix.sh`
- allocator referencelayout差を最小化: `scripts/run_allocator_preload_matrix.sh`
- 再現ログを残す(数%を詰めるときの最低限):
- `scripts/bench_ssot_capture.sh`
- `HAKMEM_BENCH_ENV_LOG=1`CPU governor/EPP/freq を記録)
- 外部相談(貼り付けパケット): `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`(生成: `scripts/make_chatgpt_pro_packet_free_path.sh`
## 0b) Allocator比較reference
- allocator比較system/jemalloc/mimalloc/tcmalloc**reference**(別バイナリ/LD_PRELOAD → layout差を含む
- SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
- **QuickRandom Mixed 10-run**: `scripts/run_allocator_quick_matrix.sh`
- **重要**: hakmem は `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示し、`scripts/run_mixed_10_cleanenv.sh` 経由で走らせるPROFILE漏れで数値が壊れるため
- **Same-binary推奨, layout差を最小化**: `scripts/run_allocator_preload_matrix.sh`
- `bench_random_mixed_system` を固定し、`LD_PRELOAD` で allocator を差し替える。
- 注記: hakmem の **linked benchmark**`bench_random_mixed_hakmem*`とは経路が異なるLD_PRELOAD=drop-in wrapper なので別物)。
- **Scenario CSVsmall-scale reference**: `scripts/bench_allocators_compare.sh`
## 1) 迷子防止(経路/観測)
“経路が踏まれていない最適化” を防ぐための最小手順。
- **Route Banner経路の誤認を潰す**: `HAKMEM_ROUTE_BANNER=1`
- 出力: Route assignmentsbackend route kind+ cache config`unified_cache_enabled` / `warm_pool_max_per_class`
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- WS=400Mixed SSOTでは miss が極小 → `unified_cache_refill()` 最適化は **凍結ROIゼロ**
## 2) 直近の結論(要点だけ)
- **Phase 69WarmPool sweep**: `HAKMEM_WARM_POOL_SIZE=16`**強GO+3.26%**、baseline 昇格済み。
- 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **Phase 70観測SSOT**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- **Phase 71/73WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**perf stat で確定)。
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
- **Phase 72ENV knob ROI枯れ**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**
- **Phase 78-1構造**: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で **GO+2.31%**
- 結果: `docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md`
- **Phase 80-1構造**: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で **GO+1.65%**
- 結果: `docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md`
- **Phase 83-1構造**: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で **NO-GO+0.32%, branch reduction negligible**
- 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md`
- 原因: lazy-init pattern が既に最適化済みper-op overhead minimal→ fixed mode の ROI 極小
## 2a) 次の大方針設計の順番、SSOT
目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory境界1箇所・戻せる・可視化最小・fail-fastを崩さず **+510%** を狙う。
優先順Google/TCMalloc の芯を参考にする):
1. **ThreadCache overflow のバッチ化(最優先)**
- inline slotsC4/C5/C6が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
- 変換点は 1 箇所flush/drainに固定
2. **Central/Shared 側のバッチ push/pop次点**
- shared/remote への統合をバッチ化して lock/atomic の回数を減らす
3. **Memory return / footprint policy運用軸**
- Balanced/Lean の勝ち筋syscall/RSS drift/tailをSSOT化しつつ、速度を落とさない範囲で攻める
重要: 現状は「設計の芯」を決める段階。実装は **計測で overflow の頻度が十分に高い**ことを確認してから。
## 2b) 次の作業(待機中)
ユーザーが別エージェントClaude Codeに依頼した処理が完了するまで待機する。
完了後に着手するチェック最短で必要な2つ:
- **inline slots overflow 率の計測**C4/C5/C6 の FULL/overflow 回数・割合)
- **overflow 先のコストの定量化**overflow 時に落ちる関数の perf stat / perf report
これが揃ったら Phase 86Overflow batch designへ進む。
## 3) 運用ルールBox Theory + layout tax 対策)
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積むFail-fast、最小可視化
- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。
- SSOT運用ころころ防止: `docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md`
- “削除して速い” は封印link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。
- 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
- 研究箱の棚卸しSSOT: `docs/analysis/RESEARCH_BOXES_SSOT.md`
- ノブ一覧: `scripts/list_hakmem_knobs.sh`
## 5) 研究箱の扱いfreeze方針
- **Phase 79-1C2 local cache**: `HAKMEM_TINY_C2_LOCAL_CACHE=0/1`
- 結果: +0.57%NO-GO, threshold +1.0% 未達)→ **research box freeze**
- SSOT/cleanenv では **default OFF**`scripts/run_mixed_10_cleanenv.sh``0` を強制)
- 物理削除はしないlayout tax リスク回避)
- **Phase 82hardening**: hot path から C2 local cache を完全除外(環境変数を立てても alloc/free hot では踏まない)
- 記録: `docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md`
- **Phase 85Free path commit-once, LEGACY-only**: `HAKMEM_FREE_PATH_COMMIT_ONCE=0/1`
- 結果: **NO-GO-0.86%****research box freezedefault OFF**
- 理由: Phase 10MONO LEGACY DIRECTと効果が被り、さらに間接呼び出し/配置の税が増えた
- 記録: `docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md`
## 4) 次の指示書Active
### Phase 74構造: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
**前提**:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
**Phase 74-1: LOCALIZE (ENV-gated)****完了 (NEUTRAL +0.50%)**
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- 判定: **NEUTRAL (+0.50%)**
**Phase 74-2: LOCALIZE (compile-time gate)****完了 (NEUTRAL -0.87%)**
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
- 判定: **NEUTRAL (-0.87%)****P1 (LOCALIZE) 凍結**
**結論**:
- P1 (LOCALIZE) は default OFF で凍結dependency chain 削減の ROI 低い)
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
**Phase 74-3: P0 (FASTAPI)****完了 (NEUTRAL +0.32%)**
**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
**Approach**:
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback境界1箇所
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
**Results** (10-run Mixed SSOT, WS=400):
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)
**判定**: **NEUTRAL (+0.32%)****P0 (FASTAPI) 凍結**
**参考**:
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
---
## Phase 75構造: Hot-class Inline Slots (P2) ✅ **完了Standard A/B**
**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定
**前提** (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination
**Phase 75-0: Per-Class Analysis****完了**
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|-------|----------|----------|------|--------|-----------|-------|-----------|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
| C7 | ? | ? | ? | ? | **?** | ? | **?** |
**Key findings**:
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
2. 全クラス 100% hit rate (refill inactive in SSOT)
3. Cache occupancy near-capacity (98-99%)
**Phase 75-1: C6-only Inline Slots****完了 (GO +2.87%)**
**Approach**: Modular box theory design with single decision point at TLS init
**Implementation** (5 new boxes + test script):
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): **44.24 M ops/s**
- Treatment (C6 inline ON): **45.51 M ops/s**
- Delta: **+1.27 M ops/s (+2.87%)**
**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)
**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
**参考**:
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`
---
**Phase 75-2: C5 Inline Slots****完了 (GO +1.10%)**
**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
**Approach**: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
- Delta: **+0.49 M ops/s (+1.10%)**
**Decision**: ✅ **GO** (C5 individual contribution validated)
**Cumulative Performance**:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)
**参考**:
- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`
---
**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)****完了 (STRONG GO +5.41%)**
**Goal**: Comprehensive interaction test + final promotion decision
**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined
**Results** (10-run per point, Mixed SSOT, WS=400):
- **Point A (baseline)**: 42.36 M ops/s
- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**
**Additivity Analysis**:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)
**Perf Stat Validation (D vs A)**:
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)
**Decision**: ✅ **STRONG GO (+5.41%)**
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
**Promotion Completed**:
1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE
**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**`bench_random_mixed_hakmem`)。
- FAST PGO baselineスコアカードを更新する前に`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` **同条件の A/BC5/C6 OFF/ON** を再計測すること
### Phase 75-4FAST PGO rebase✅ 完了
- 結果: **+3.16% (GO)**4-point matrixoutlier 除外後
- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
- 重要: Phase 69 FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑いPGO profile staleness / training mismatch / build drift
### Phase 75-5PGO 再生成)✅ 完了NO-GO on hypothesis, code bloat root cause identified
目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成しPhase 69 クラスの FAST baseline を取り戻す
結果:
- PGO profile regeneration の効果は **限定的** (+0.3% のみ)
- Root cause **PGO profile mismatch ではなく code bloat** (+13KB, +3.1%)
- Code bloat layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) net -12% regression
**Forensics findings** (`scripts/box/layout_tax_forensics_box.sh`):
- Text size: +13KB (+3.1%)
- IPC: 1.80 1.67 (-7.22%)
- Branch-misses: +19.4%
- Cache-misses: +5.7%
**Decision**:
- FAST PGO code bloat に敏感 **Track A/B discipline 確立**
- Track A: Standard binary implementation decisions (SSOT for GO/NO-GO)
- Track B: FAST PGO mimalloc ratio tracking (periodic rebase, not single-point decisions)
**参考**:
- 詳細結果: `docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md`
- 指示書: `docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md`
---
### Phase 76構造継続: C4-C7 Remaining Classes ✅ **Phase 76-1 完了 (GO +1.73%)**
**前提** (Phase 75 complete):
- C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
- Code bloat sensitivity identified Track A/B discipline established
- Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)
**Phase 76-0: C7 Statistics Analysis** **完了 (NO-GO for C7 P2)**
**Approach**: OBSERVE run to measure C7 allocation patterns in Mixed SSOT
**Results**: C7 = **0% operations** in Mixed SSOT workload
**Decision**: NO-GO for C7 P2 optimization proceed to C4
**参考**:
- 結果: `docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md`
**Phase 76-1: C4 Inline Slots** **完了 (GO +1.73%)**
**Goal**: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations
**Implementation** (modular box pattern):
- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1` (default OFF ON after promotion)
- TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
- Fast-path API: `c4_inline_push()` / `c4_inline_pop()` (always_inline)
- Integration: C4 FIRST C5 C6 unified_cache (alloc/free cascade)
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C4=OFF, C5=ON, C6=ON): **52.42 M ops/s**
- Treatment (C4=ON, C5=ON, C6=ON): **53.33 M ops/s**
- Delta: **+0.91 M ops/s (+1.73%)**
**Decision**: **GO** (exceeds +1.0% threshold)
**Promotion Completed**:
1. `core/bench_profile.h`: Added C4 default to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added `HAKMEM_TINY_C4_INLINE_SLOTS=1` default
3. C4 inline slots now **promoted to preset defaults** alongside C5+C6
**Coverage Summary (C4-C7 complete)**:
- C6: 57.17% (Phase 75-1, +2.87%)
- C5: 28.55% (Phase 75-2, +1.10%)
- **C4: 14.29% (Phase 76-1, +1.73%)**
- C7: 0.00% (Phase 76-0, NO-GO)
- **Combined C4-C6: 100% of C4-C7 operations**
**Estimated Cumulative Gain**: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)
**参考**:
- 結果: `docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
- C4 box files: `core/box/tiny_c4_inline_slots_*.h`, `core/front/tiny_c4_inline_slots.h`, `core/tiny_c4_inline_slots.c`
---
**Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix** **完了 (STRONG GO +7.05%, super-additive)**
**Goal**: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis
**Results** (4-point matrix, 10-run each):
- Point A (all OFF): 49.48 M ops/s (baseline)
- Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
- Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
- Point D (all ON): **52.97 M ops/s (+7.05% vs A)** **STRONG GO**
**Critical Discovery**:
- C4 shows **-0.08% regression in isolation** (C5/C6 OFF)
- C4 shows **+1.27% gain in context** (with C5+C6 ON)
- **Super-additivity**: Actual D (+7.05%) exceeds expected additive (+5.56%)
- **Implication**: Per-class optimizations are **context-dependent**, not independently additive
**Sub-additivity Analysis**:
- Expected additive: 52.23 M ops/s (B + C - A)
- Actual: 52.97 M ops/s
- Gain: **-1.42% (super-additive!)**
**Decision**: **STRONG GO**
- D vs A: +7.05% >> +3.0% threshold
- Super-additive behavior confirms synergistic gains
- C4+C5+C6 locked to SSOT defaults
**参考**:
- 詳細結果: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
---
### 🟩 完了C4-C7 Inline Slots Optimization Stack
**Per-class Coverage Summary (Final)**:
- C6 (57.17%): +2.87% (Phase 75-1)
- C5 (28.55%): +1.10% (Phase 75-2)
- C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
- C7 (0.00%): NO-GO (Phase 76-0)
- **Combined C4-C6: +7.05% (Phase 76-2 super-additive)**
**Status**: ✅ **C4-C7 Optimization Complete** (100% coverage, SSOT locked)
---
### 🟥 次のActivePhase 77+
**オプション**:
**Option A: FAST PGO Periodic Tracking** (Track B discipline)
- Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
- Monitor mimalloc ratio progress (secondary metric)
- Not a decision point per se, but periodic maintenance
**Option B: Phase 77 (Alternative Optimization Axis)**
- Explore beyond per-class inline slots
- Candidates:
- Allocation fast-path optimization (call elimination)
- Metadata/page lookup (table optimization)
- C3/C2 class strategies
- Warm pool tuning (beyond Phase 69's WarmPool=16)
**推奨**: **Option B へ進む**Phase 77+
- C4-C7 optimizations are exhausted and locked
- Ready to explore new optimization axes
- Baseline is now +7.05% stronger than Phase 75-3
**参考**:
- C4-C7 完全分析: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
- Phase 75-3 参考 (C5+C6): `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`
## 5) アーカイブ
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`