2025-12-17 21:08:17 +09:00
|
|
|
|
# CURRENT_TASK(Rolling, SSOT)
|
2025-12-12 16:26:42 +09:00
|
|
|
|
|
2025-12-18 22:05:34 +09:00
|
|
|
|
## Phase 86(終了: NO-GO)
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)
|
|
|
|
|
|
|
|
|
|
|
|
**A/B Test (10-run SSOT)**:
|
|
|
|
|
|
- Control: 51,750,467 ops/s (CV: 2.26%)
|
|
|
|
|
|
- Treatment: 51,881,055 ops/s (CV: 2.32%)
|
|
|
|
|
|
- Delta: +0.25% (mean), -0.15% (median)
|
|
|
|
|
|
|
|
|
|
|
|
**Summary**: Free path legacy mask (mask-only) optimization for LEGACY classes.
|
|
|
|
|
|
- Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
|
|
|
|
|
|
- Implementation: Correct (0x7f mask computed, C0-C6 optimized)
|
|
|
|
|
|
- Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
|
|
|
|
|
|
- Conclusion: Free path optimization layer has reached practical ceiling
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 0) 今の「正」(SSOT)
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-18 09:11:56 +09:00
|
|
|
|
- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)+ **WarmPool=16**
|
2025-12-18 09:32:43 +09:00
|
|
|
|
- Phase 75(C5/C6 inline slots)は presets に昇格済み
|
|
|
|
|
|
- Phase 75-4 で FAST PGO rebase を実施し **C5+C6=ON が +3.16% (GO)** を確認(ただし **FAST PGO baseline 自体が Phase 69 から大きく後退**している疑い → Phase 75-5 で PGO 再生成が必要)
|
2025-12-16 15:01:56 +09:00
|
|
|
|
- **安全・互換の正**: Standard build(`make bench_random_mixed_hakmem`)
|
|
|
|
|
|
- **観測の正**: OBSERVE build(`make perf_observe`)
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
|
2025-12-18 09:11:56 +09:00
|
|
|
|
- **FAST baseline(SSOT)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md` を正とする(Phase 69: 62.63M ops/s = 51.77% of mimalloc)
|
2025-12-18 09:32:43 +09:00
|
|
|
|
- **Phase 75 の計測(Standard)**: `bench_random_mixed_hakmem` で **A/B +5.41%** を確認(Phase 75-3 4-point matrix)
|
|
|
|
|
|
- **Phase 75 の計測(FAST PGO)**: `bench_random_mixed_hakmem_minimal_pgo` で **A/B +3.16%** を確認(Phase 75-4 4-point matrix)
|
2025-12-18 09:11:56 +09:00
|
|
|
|
- 次の目標: **M2 = 55%**(gap は FAST baseline を基準に判断する)
|
|
|
|
|
|
- **Mixed 10-run SSOT(ハーネス)**: `scripts/run_mixed_10_cleanenv.sh`
|
|
|
|
|
|
- デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`(Standard)
|
|
|
|
|
|
- FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
|
2025-12-18 18:50:00 +09:00
|
|
|
|
- 既定: `ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16`、`HAKMEM_TINY_C4_INLINE_SLOTS=1`、`HAKMEM_TINY_C5_INLINE_SLOTS=1`、`HAKMEM_TINY_C6_INLINE_SLOTS=1`、`HAKMEM_TINY_INLINE_SLOTS_FIXED=1`、`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1`
|
|
|
|
|
|
- cleanenv で固定OFF(漏れ防止): `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0`(Phase 83-1 NO-GO / research)
|
|
|
|
|
|
|
|
|
|
|
|
## 0a) ころころ防止(最低限の SSOT ルール)
|
|
|
|
|
|
|
|
|
|
|
|
- **hakmem は必ず `HAKMEM_PROFILE` を明示**する(未指定だと route が変わり、数値が破綻しやすい)。
|
|
|
|
|
|
- 推奨: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`(Speed-first)
|
|
|
|
|
|
- 比較は目的で runner を分ける:
|
|
|
|
|
|
- hakmem SSOT(最適化判断): `scripts/run_mixed_10_cleanenv.sh`
|
|
|
|
|
|
- allocator reference(短時間): `scripts/run_allocator_quick_matrix.sh`
|
|
|
|
|
|
- allocator reference(layout差を最小化): `scripts/run_allocator_preload_matrix.sh`
|
|
|
|
|
|
- 再現ログを残す(数%を詰めるときの最低限):
|
|
|
|
|
|
- `scripts/bench_ssot_capture.sh`
|
|
|
|
|
|
- `HAKMEM_BENCH_ENV_LOG=1`(CPU governor/EPP/freq を記録)
|
2025-12-18 22:05:34 +09:00
|
|
|
|
- 外部相談(貼り付けパケット): `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`(生成: `scripts/make_chatgpt_pro_packet_free_path.sh`)
|
2025-12-18 18:50:00 +09:00
|
|
|
|
|
|
|
|
|
|
## 0b) Allocator比較(reference)
|
|
|
|
|
|
|
|
|
|
|
|
- allocator比較(system/jemalloc/mimalloc/tcmalloc)は **reference**(別バイナリ/LD_PRELOAD → layout差を含む)。
|
|
|
|
|
|
- SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
|
|
|
|
|
|
- **Quick(Random Mixed 10-run)**: `scripts/run_allocator_quick_matrix.sh`
|
|
|
|
|
|
- **重要**: hakmem は `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示し、`scripts/run_mixed_10_cleanenv.sh` 経由で走らせる(PROFILE漏れで数値が壊れるため)。
|
|
|
|
|
|
- **Same-binary(推奨, layout差を最小化)**: `scripts/run_allocator_preload_matrix.sh`
|
|
|
|
|
|
- `bench_random_mixed_system` を固定し、`LD_PRELOAD` で allocator を差し替える。
|
|
|
|
|
|
- 注記: hakmem の **linked benchmark**(`bench_random_mixed_hakmem*`)とは経路が異なる(LD_PRELOAD=drop-in wrapper なので別物)。
|
|
|
|
|
|
- **Scenario CSV(small-scale reference)**: `scripts/bench_allocators_compare.sh`
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 1) 迷子防止(経路/観測)
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
“経路が踏まれていない最適化” を防ぐための最小手順。
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- **Route Banner(経路の誤認を潰す)**: `HAKMEM_ROUTE_BANNER=1`
|
|
|
|
|
|
- 出力: Route assignments(backend route kind)+ cache config(`unified_cache_enabled` / `warm_pool_max_per_class`)
|
|
|
|
|
|
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
|
|
|
|
|
- WS=400(Mixed SSOT)では miss が極小 → `unified_cache_refill()` 最適化は **凍結(ROIゼロ)**
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 2) 直近の結論(要点だけ)
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- **Phase 69(WarmPool sweep)**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO(+3.26%)**、baseline 昇格済み。
|
|
|
|
|
|
- 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
|
|
|
|
|
- **Phase 70(観測SSOT)**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
|
|
|
|
|
|
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
|
|
|
|
|
- **Phase 71/73(WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**(perf stat で確定)。
|
|
|
|
|
|
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
|
|
|
|
|
- **Phase 72(ENV knob ROI枯れ)**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**。
|
2025-12-18 18:50:00 +09:00
|
|
|
|
- **Phase 78-1(構造)**: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で **GO(+2.31%)**。
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md`
|
|
|
|
|
|
- **Phase 80-1(構造)**: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で **GO(+1.65%)**。
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md`
|
|
|
|
|
|
- **Phase 83-1(構造)**: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で **NO-GO(+0.32%, branch reduction negligible)**。
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md`
|
|
|
|
|
|
- 原因: lazy-init pattern が既に最適化済み(per-op overhead minimal)→ fixed mode の ROI 極小
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 3) 運用ルール(Box Theory + layout tax 対策)
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む(Fail-fast、最小可視化)。
|
|
|
|
|
|
- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。
|
2025-12-18 10:22:24 +09:00
|
|
|
|
- SSOT運用(ころころ防止): `docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md`
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- “削除して速い” は封印(link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。
|
|
|
|
|
|
- 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
|
2025-12-18 18:50:00 +09:00
|
|
|
|
- 研究箱の棚卸しSSOT: `docs/analysis/RESEARCH_BOXES_SSOT.md`
|
|
|
|
|
|
- ノブ一覧: `scripts/list_hakmem_knobs.sh`
|
|
|
|
|
|
|
|
|
|
|
|
## 5) 研究箱の扱い(freeze方針)
|
|
|
|
|
|
|
|
|
|
|
|
- **Phase 79-1(C2 local cache)**: `HAKMEM_TINY_C2_LOCAL_CACHE=0/1`
|
|
|
|
|
|
- 結果: +0.57%(NO-GO, threshold +1.0% 未達)→ **research box freeze**
|
|
|
|
|
|
- SSOT/cleanenv では **default OFF**(`scripts/run_mixed_10_cleanenv.sh` が `0` を強制)
|
|
|
|
|
|
- 物理削除はしない(layout tax リスク回避)
|
|
|
|
|
|
- **Phase 82(hardening)**: hot path から C2 local cache を完全除外(環境変数を立てても alloc/free hot では踏まない)
|
|
|
|
|
|
- 記録: `docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md`
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 22:05:34 +09:00
|
|
|
|
- **Phase 85(Free path commit-once, LEGACY-only)**: `HAKMEM_FREE_PATH_COMMIT_ONCE=0/1`
|
|
|
|
|
|
- 結果: **NO-GO(-0.86%)** → **research box freeze(default OFF)**
|
|
|
|
|
|
- 理由: Phase 10(MONO LEGACY DIRECT)と効果が被り、さらに間接呼び出し/配置の税が増えた
|
|
|
|
|
|
- 記録: `docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md`
|
|
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 4) 次の指示書(Active)
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
### Phase 74(構造): UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**前提**:
|
|
|
|
|
|
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
|
|
|
|
|
|
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
|
2025-12-18 05:55:47 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
|
|
|
|
|
|
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
|
|
|
|
|
|
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
|
|
|
|
|
|
- 判定: **NEUTRAL (+0.50%)**
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
|
|
|
|
|
|
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
|
|
|
|
|
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
|
|
|
|
|
|
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
|
|
|
|
|
|
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
|
|
|
|
|
|
- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**結論**:
|
|
|
|
|
|
- P1 (LOCALIZE) は default OFF で凍結(dependency chain 削減の ROI 低い)
|
|
|
|
|
|
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 08:22:09 +09:00
|
|
|
|
**Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)**
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
|
2025-12-18 03:44:51 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**Approach**:
|
|
|
|
|
|
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
|
|
|
|
|
|
- 前提: "valid/enabled/no-stats" を caller 側で保証
|
|
|
|
|
|
- Fail-fast: 想定外の状態なら slow path へ fallback(境界1箇所)
|
|
|
|
|
|
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
2025-12-17 16:27:06 +09:00
|
|
|
|
|
2025-12-18 08:22:09 +09:00
|
|
|
|
**Results** (10-run Mixed SSOT, WS=400):
|
|
|
|
|
|
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
|
|
|
|
|
|
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 08:22:09 +09:00
|
|
|
|
**判定**: **NEUTRAL (+0.32%)** → **P0 (FASTAPI) 凍結**
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
**参考**:
|
|
|
|
|
|
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
|
|
|
|
|
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
2025-12-18 08:22:09 +09:00
|
|
|
|
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-12-18 09:11:56 +09:00
|
|
|
|
## Phase 75(構造): Hot-class Inline Slots (P2) ✅ **完了(Standard A/B)**
|
2025-12-18 08:22:09 +09:00
|
|
|
|
|
|
|
|
|
|
**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定
|
|
|
|
|
|
|
|
|
|
|
|
**前提** (Phase 74 learnings):
|
|
|
|
|
|
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
|
|
|
|
|
|
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 75-0: Per-Class Analysis** ✅ **完了**
|
|
|
|
|
|
|
|
|
|
|
|
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
|
|
|
|
|
|
|
|
|
|
|
|
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|
|
|
|
|
|
|-------|----------|----------|------|--------|-----------|-------|-----------|
|
|
|
|
|
|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
|
|
|
|
|
|
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
|
|
|
|
|
|
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
|
|
|
|
|
|
| C7 | ? | ? | ? | ? | **?** | ? | **?** |
|
|
|
|
|
|
|
|
|
|
|
|
**Key findings**:
|
|
|
|
|
|
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
|
|
|
|
|
|
2. 全クラス 100% hit rate (refill inactive in SSOT)
|
|
|
|
|
|
3. Cache occupancy near-capacity (98-99%)
|
|
|
|
|
|
|
2025-12-18 08:39:48 +09:00
|
|
|
|
**Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)**
|
2025-12-18 08:22:09 +09:00
|
|
|
|
|
2025-12-18 08:39:48 +09:00
|
|
|
|
**Approach**: Modular box theory design with single decision point at TLS init
|
2025-12-18 08:22:09 +09:00
|
|
|
|
|
2025-12-18 08:39:48 +09:00
|
|
|
|
**Implementation** (5 new boxes + test script):
|
|
|
|
|
|
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
|
|
|
|
|
|
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
|
|
|
|
|
|
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
|
|
|
|
|
|
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
|
|
|
|
|
|
- Backward compatible: Legacy code intact, fail-fast to unified_cache
|
|
|
|
|
|
|
|
|
|
|
|
**Results** (10-run Mixed SSOT, WS=400):
|
|
|
|
|
|
- Baseline (C6 inline OFF): **44.24 M ops/s**
|
|
|
|
|
|
- Treatment (C6 inline ON): **45.51 M ops/s**
|
|
|
|
|
|
- Delta: **+1.27 M ops/s (+2.87%)**
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)
|
|
|
|
|
|
|
|
|
|
|
|
**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
|
2025-12-18 08:22:09 +09:00
|
|
|
|
|
|
|
|
|
|
**参考**:
|
2025-12-18 08:39:48 +09:00
|
|
|
|
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-12-18 08:53:01 +09:00
|
|
|
|
**Phase 75-2: C5 Inline Slots** ✅ **完了 (GO +1.10%)**
|
2025-12-18 08:39:48 +09:00
|
|
|
|
|
2025-12-18 08:53:01 +09:00
|
|
|
|
**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
|
2025-12-18 08:39:48 +09:00
|
|
|
|
|
2025-12-18 08:53:01 +09:00
|
|
|
|
**Approach**: Replicate C6 pattern with careful isolation
|
2025-12-18 08:39:48 +09:00
|
|
|
|
- Add C5 ring buffer (128 slots, 1KB TLS)
|
2025-12-18 08:53:01 +09:00
|
|
|
|
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
|
|
|
|
|
|
- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
|
|
|
|
|
|
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
|
|
|
|
|
|
|
|
|
|
|
|
**Results** (10-run Mixed SSOT, WS=400):
|
|
|
|
|
|
- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
|
|
|
|
|
|
- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
|
|
|
|
|
|
- Delta: **+0.49 M ops/s (+1.10%)**
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**: ✅ **GO** (C5 individual contribution validated)
|
|
|
|
|
|
|
|
|
|
|
|
**Cumulative Performance**:
|
|
|
|
|
|
- Phase 75-1 (C6): +2.87%
|
|
|
|
|
|
- Phase 75-2 (C5 isolated): +1.10%
|
|
|
|
|
|
- Combined potential: ~+3.97% (if additive)
|
|
|
|
|
|
|
|
|
|
|
|
**参考**:
|
|
|
|
|
|
- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)** ✅ **完了 (STRONG GO +5.41%)**
|
|
|
|
|
|
|
|
|
|
|
|
**Goal**: Comprehensive interaction test + final promotion decision
|
|
|
|
|
|
|
|
|
|
|
|
**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
|
|
|
|
|
|
- Point A (C5=0, C6=0): Baseline
|
|
|
|
|
|
- Point B (C5=1, C6=0): C5 solo
|
|
|
|
|
|
- Point C (C5=0, C6=1): C6 solo
|
|
|
|
|
|
- Point D (C5=1, C6=1): C5+C6 combined
|
|
|
|
|
|
|
|
|
|
|
|
**Results** (10-run per point, Mixed SSOT, WS=400):
|
|
|
|
|
|
- **Point A (baseline)**: 42.36 M ops/s
|
|
|
|
|
|
- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
|
|
|
|
|
|
- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
|
|
|
|
|
|
- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**
|
|
|
|
|
|
|
|
|
|
|
|
**Additivity Analysis**:
|
|
|
|
|
|
- Expected additive (B+C-A): 45.43 M ops/s
|
|
|
|
|
|
- Actual (D): 44.65 M ops/s
|
|
|
|
|
|
- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)
|
|
|
|
|
|
|
|
|
|
|
|
**Perf Stat Validation (D vs A)**:
|
|
|
|
|
|
- Instructions: -6.1% (function call elimination confirmed)
|
|
|
|
|
|
- Branches: -6.1% (matches instruction reduction)
|
|
|
|
|
|
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
|
|
|
|
|
|
- Throughput: +5.41% (net positive)
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**: ✅ **STRONG GO (+5.41%)**
|
|
|
|
|
|
- D vs A: +5.41% >> 3.0% threshold
|
|
|
|
|
|
- Sub-additivity: 1.72% << 20% acceptable
|
|
|
|
|
|
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
|
|
|
|
|
|
|
|
|
|
|
|
**Promotion Completed**:
|
|
|
|
|
|
1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
|
|
|
|
|
|
2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
|
|
|
|
|
|
3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE
|
|
|
|
|
|
|
2025-12-18 09:11:56 +09:00
|
|
|
|
**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**(`bench_random_mixed_hakmem`)。
|
|
|
|
|
|
- FAST PGO baseline(スコアカード)を更新する前に、`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` で **同条件の A/B(C5/C6 OFF/ON)** を再計測すること。
|
2025-12-18 08:53:01 +09:00
|
|
|
|
|
2025-12-18 09:32:43 +09:00
|
|
|
|
### Phase 75-4(FAST PGO rebase)✅ 完了
|
|
|
|
|
|
|
|
|
|
|
|
- 結果: **+3.16% (GO)**(4-point matrix、outlier 除外後)
|
|
|
|
|
|
- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
|
|
|
|
|
|
- 重要: Phase 69 の FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑い(PGO profile staleness / training mismatch / build drift)
|
|
|
|
|
|
|
2025-12-18 18:50:00 +09:00
|
|
|
|
### Phase 75-5(PGO 再生成)✅ 完了(NO-GO on hypothesis, code bloat root cause identified)
|
2025-12-18 09:32:43 +09:00
|
|
|
|
|
|
|
|
|
|
目的:
|
|
|
|
|
|
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。
|
|
|
|
|
|
|
2025-12-18 18:50:00 +09:00
|
|
|
|
結果:
|
|
|
|
|
|
- PGO profile regeneration の効果は **限定的** (+0.3% のみ)
|
|
|
|
|
|
- Root cause は **PGO profile mismatch ではなく code bloat** (+13KB, +3.1%)
|
|
|
|
|
|
- Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression
|
|
|
|
|
|
|
|
|
|
|
|
**Forensics findings** (`scripts/box/layout_tax_forensics_box.sh`):
|
|
|
|
|
|
- Text size: +13KB (+3.1%)
|
|
|
|
|
|
- IPC: 1.80 → 1.67 (-7.22%)
|
|
|
|
|
|
- Branch-misses: +19.4%
|
|
|
|
|
|
- Cache-misses: +5.7%
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**:
|
|
|
|
|
|
- FAST PGO は code bloat に敏感 → **Track A/B discipline 確立**
|
|
|
|
|
|
- Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO)
|
|
|
|
|
|
- Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions)
|
|
|
|
|
|
|
|
|
|
|
|
**参考**:
|
|
|
|
|
|
- 詳細結果: `docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md`
|
|
|
|
|
|
- 指示書: `docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 76(構造継続): C4-C7 Remaining Classes ✅ **Phase 76-1 完了 (GO +1.73%)**
|
|
|
|
|
|
|
|
|
|
|
|
**前提** (Phase 75 complete):
|
|
|
|
|
|
- C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
|
|
|
|
|
|
- Code bloat sensitivity identified → Track A/B discipline established
|
|
|
|
|
|
- Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 76-0: C7 Statistics Analysis** ✅ **完了 (NO-GO for C7 P2)**
|
|
|
|
|
|
|
|
|
|
|
|
**Approach**: OBSERVE run to measure C7 allocation patterns in Mixed SSOT
|
|
|
|
|
|
**Results**: C7 = **0% operations** in Mixed SSOT workload
|
|
|
|
|
|
**Decision**: NO-GO for C7 P2 optimization → proceed to C4
|
|
|
|
|
|
|
|
|
|
|
|
**参考**:
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md`
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 76-1: C4 Inline Slots** ✅ **完了 (GO +1.73%)**
|
|
|
|
|
|
|
|
|
|
|
|
**Goal**: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations
|
|
|
|
|
|
|
|
|
|
|
|
**Implementation** (modular box pattern):
|
|
|
|
|
|
- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1` (default OFF → ON after promotion)
|
|
|
|
|
|
- TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
|
|
|
|
|
|
- Fast-path API: `c4_inline_push()` / `c4_inline_pop()` (always_inline)
|
|
|
|
|
|
- Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade)
|
|
|
|
|
|
|
|
|
|
|
|
**Results** (10-run Mixed SSOT, WS=400):
|
|
|
|
|
|
- Baseline (C4=OFF, C5=ON, C6=ON): **52.42 M ops/s**
|
|
|
|
|
|
- Treatment (C4=ON, C5=ON, C6=ON): **53.33 M ops/s**
|
|
|
|
|
|
- Delta: **+0.91 M ops/s (+1.73%)**
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**: ✅ **GO** (exceeds +1.0% threshold)
|
|
|
|
|
|
|
|
|
|
|
|
**Promotion Completed**:
|
|
|
|
|
|
1. `core/bench_profile.h`: Added C4 default to `bench_apply_mixed_tinyv3_c7_common()`
|
|
|
|
|
|
2. `scripts/run_mixed_10_cleanenv.sh`: Added `HAKMEM_TINY_C4_INLINE_SLOTS=1` default
|
|
|
|
|
|
3. C4 inline slots now **promoted to preset defaults** alongside C5+C6
|
|
|
|
|
|
|
|
|
|
|
|
**Coverage Summary (C4-C7 complete)**:
|
|
|
|
|
|
- C6: 57.17% (Phase 75-1, +2.87%)
|
|
|
|
|
|
- C5: 28.55% (Phase 75-2, +1.10%)
|
|
|
|
|
|
- **C4: 14.29% (Phase 76-1, +1.73%)**
|
|
|
|
|
|
- C7: 0.00% (Phase 76-0, NO-GO)
|
|
|
|
|
|
- **Combined C4-C6: 100% of C4-C7 operations**
|
|
|
|
|
|
|
|
|
|
|
|
**Estimated Cumulative Gain**: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)
|
|
|
|
|
|
|
|
|
|
|
|
**参考**:
|
|
|
|
|
|
- 結果: `docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
|
|
|
|
|
|
- C4 box files: `core/box/tiny_c4_inline_slots_*.h`, `core/front/tiny_c4_inline_slots.h`, `core/tiny_c4_inline_slots.c`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix** ✅ **完了 (STRONG GO +7.05%, super-additive)**
|
|
|
|
|
|
|
|
|
|
|
|
**Goal**: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis
|
|
|
|
|
|
|
|
|
|
|
|
**Results** (4-point matrix, 10-run each):
|
|
|
|
|
|
- Point A (all OFF): 49.48 M ops/s (baseline)
|
|
|
|
|
|
- Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
|
|
|
|
|
|
- Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
|
|
|
|
|
|
- Point D (all ON): **52.97 M ops/s (+7.05% vs A)** ✅ **STRONG GO**
|
|
|
|
|
|
|
|
|
|
|
|
**Critical Discovery**:
|
|
|
|
|
|
- C4 shows **-0.08% regression in isolation** (C5/C6 OFF)
|
|
|
|
|
|
- C4 shows **+1.27% gain in context** (with C5+C6 ON)
|
|
|
|
|
|
- **Super-additivity**: Actual D (+7.05%) exceeds expected additive (+5.56%)
|
|
|
|
|
|
- **Implication**: Per-class optimizations are **context-dependent**, not independently additive
|
|
|
|
|
|
|
|
|
|
|
|
**Sub-additivity Analysis**:
|
|
|
|
|
|
- Expected additive: 52.23 M ops/s (B + C - A)
|
|
|
|
|
|
- Actual: 52.97 M ops/s
|
|
|
|
|
|
- Gain: **-1.42% (super-additive!)** ✓
|
|
|
|
|
|
|
|
|
|
|
|
**Decision**: ✅ **STRONG GO**
|
|
|
|
|
|
- D vs A: +7.05% >> +3.0% threshold
|
|
|
|
|
|
- Super-additive behavior confirms synergistic gains
|
|
|
|
|
|
- C4+C5+C6 locked to SSOT defaults
|
|
|
|
|
|
|
|
|
|
|
|
**参考**:
|
|
|
|
|
|
- 詳細結果: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 🟩 完了:C4-C7 Inline Slots Optimization Stack
|
|
|
|
|
|
|
|
|
|
|
|
**Per-class Coverage Summary (Final)**:
|
|
|
|
|
|
- C6 (57.17%): +2.87% (Phase 75-1)
|
|
|
|
|
|
- C5 (28.55%): +1.10% (Phase 75-2)
|
|
|
|
|
|
- C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
|
|
|
|
|
|
- C7 (0.00%): NO-GO (Phase 76-0)
|
|
|
|
|
|
- **Combined C4-C6: +7.05% (Phase 76-2 super-additive)**
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: ✅ **C4-C7 Optimization Complete** (100% coverage, SSOT locked)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 🟥 次のActive(Phase 77+)
|
|
|
|
|
|
|
|
|
|
|
|
**オプション**:
|
|
|
|
|
|
|
|
|
|
|
|
**Option A: FAST PGO Periodic Tracking** (Track B discipline)
|
|
|
|
|
|
- Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
|
|
|
|
|
|
- Monitor mimalloc ratio progress (secondary metric)
|
|
|
|
|
|
- Not a decision point per se, but periodic maintenance
|
|
|
|
|
|
|
|
|
|
|
|
**Option B: Phase 77 (Alternative Optimization Axis)**
|
|
|
|
|
|
- Explore beyond per-class inline slots
|
|
|
|
|
|
- Candidates:
|
|
|
|
|
|
- Allocation fast-path optimization (call elimination)
|
|
|
|
|
|
- Metadata/page lookup (table optimization)
|
|
|
|
|
|
- C3/C2 class strategies
|
|
|
|
|
|
- Warm pool tuning (beyond Phase 69's WarmPool=16)
|
|
|
|
|
|
|
|
|
|
|
|
**推奨**: **Option B へ進む**(Phase 77+)
|
|
|
|
|
|
- C4-C7 optimizations are exhausted and locked
|
|
|
|
|
|
- Ready to explore new optimization axes
|
|
|
|
|
|
- Baseline is now +7.05% stronger than Phase 75-3
|
2025-12-18 09:32:43 +09:00
|
|
|
|
|
2025-12-18 08:53:01 +09:00
|
|
|
|
**参考**:
|
2025-12-18 18:50:00 +09:00
|
|
|
|
- C4-C7 完全分析: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
|
|
|
|
|
|
- Phase 75-3 参考 (C5+C6): `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`
|
2025-12-18 06:11:21 +09:00
|
|
|
|
|
2025-12-18 07:47:44 +09:00
|
|
|
|
## 5) アーカイブ
|
2025-12-17 16:34:03 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
|
2025-12-18 07:47:44 +09:00
|
|
|
|
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
|