2025-12-17 21:08:17 +09:00
|
|
|
|
# CURRENT_TASK(Rolling, SSOT)
|
2025-12-12 16:26:42 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
## 0) 今の「正」
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
- **性能比較の正**: FAST PGO build(`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`)✓ **Phase 69 昇格済み** (Warm Pool Size=16)
|
2025-12-16 15:01:56 +09:00
|
|
|
|
- **安全・互換の正**: Standard build(`make bench_random_mixed_hakmem`)
|
|
|
|
|
|
- **観測の正**: OBSERVE build(`make perf_observe`)
|
2025-12-18 01:55:27 +09:00
|
|
|
|
- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`(M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp)
|
|
|
|
|
|
- **計測の正(Mixed 10-run)**: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト)
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
## 1) 現状(要点)
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
- Phase 64(backend prune / DCE): **NO-GO**(-4.05%) → layout tax 由来
|
|
|
|
|
|
- Phase 63(FAST_PROFILE_FIXED): **研究用ビルド**として保持(FAST の gate を compile-time 固定)
|
|
|
|
|
|
- Phase 65(Hot Symbol Ordering): **BLOCKED**(GCC+LTO の制約で不公平/不可能)→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md`
|
|
|
|
|
|
- Phase 66(PGO, GCC+LTO): **GO** ✓
|
|
|
|
|
|
- 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1%
|
|
|
|
|
|
- Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO)
|
|
|
|
|
|
- Phase 68(PGO training set 最適化): **GO & 昇格完了** ✓
|
|
|
|
|
|
- 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過)
|
2025-12-18 01:55:27 +09:00
|
|
|
|
- Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp)
|
|
|
|
|
|
- Phase 69(Refill tuning: Warm Pool Size 最適化): **強GO & 昇格完了** ✓✓✓
|
|
|
|
|
|
- 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過)
|
|
|
|
|
|
- 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp)
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
## 2) 次の指示書(Active)
|
2025-12-12 03:50:58 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
**Phase 68: PGO training set 最適化** ✅ **完了**
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン)
|
|
|
|
|
|
- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過)
|
|
|
|
|
|
- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過)
|
|
|
|
|
|
- ✓ スコアカード・CURRENT_TASK 更新完了
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
---
|
2025-12-16 05:35:11 +09:00
|
|
|
|
|
2025-12-17 21:09:42 +09:00
|
|
|
|
**Phase 67a: Layout Tax 法医学(変更最小)** ✅ **完了・実運用可能**
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
|
2025-12-17 21:09:42 +09:00
|
|
|
|
- ✓ `scripts/box/layout_tax_forensics_box.sh` 新規(測定ハーネス)
|
|
|
|
|
|
- Baseline vs Treatment の 10-run throughput 比較
|
|
|
|
|
|
- perf stat 自動収集(cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB)
|
|
|
|
|
|
- Binary metadata(サイズ、セクション構成)
|
|
|
|
|
|
|
|
|
|
|
|
- ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規(診断ガイド)
|
|
|
|
|
|
- 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下)
|
|
|
|
|
|
- "症状→原因候補" マッピング表
|
|
|
|
|
|
* IPC 低下 3%↑ → I-cache miss / code layout dispersal
|
|
|
|
|
|
* branch-misses ↑10%↑ → branch prediction penalty
|
|
|
|
|
|
* dTLB-misses ↑100%↑ → data layout fragmentation
|
|
|
|
|
|
- Phase 64 case study(-4.05% の root cause: IPC 2.05 → 1.98)
|
|
|
|
|
|
- 運用ガイドライン
|
|
|
|
|
|
|
|
|
|
|
|
**使用例**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
./scripts/box/layout_tax_forensics_box.sh \
|
|
|
|
|
|
./bench_random_mixed_hakmem_minimal_pgo \
|
|
|
|
|
|
./bench_random_mixed_hakmem_fast_pruned # or Phase 64 attempt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる
|
|
|
|
|
|
|
|
|
|
|
|
---
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
|
2025-12-17 21:22:21 +09:00
|
|
|
|
**Phase 69: "refill頻度×固定税" を削る(M2への最短距離)**
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**
|
|
|
|
|
|
|
|
|
|
|
|
- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
|
|
|
|
|
|
- ✓ Tunable parameters 特定:
|
2025-12-18 01:55:27 +09:00
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`(refill 量の実体, ENV-only)
|
2025-12-17 21:22:21 +09:00
|
|
|
|
- Unified Cache C5-C7 capacity (128 → 256/512)
|
|
|
|
|
|
- Warm Pool size (12 → 16/24)
|
|
|
|
|
|
- ✓ Sweep 計画立案(single-parameter → combined optimization)
|
|
|
|
|
|
- ✓ Risk assessment & 判定基準定義
|
|
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
**Phase 69-1: Sweep 実行** ✅ **完了**
|
|
|
|
|
|
|
|
|
|
|
|
- ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean)
|
|
|
|
|
|
- ✓ Warm Pool Size sweep:
|
|
|
|
|
|
- Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner**
|
|
|
|
|
|
- Size=24: 62.37M ops/s (+2.84%, GO)
|
|
|
|
|
|
- ✓ Unified Cache C5-C7 sweep:
|
|
|
|
|
|
- Cache=256: 61.92M ops/s (+2.09%, GO)
|
|
|
|
|
|
- Cache=512: 61.80M ops/s (+1.89%, GO)
|
|
|
|
|
|
- ✓ Combined optimization check:
|
|
|
|
|
|
- Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive)
|
|
|
|
|
|
- ✓ “Refill Batch Size sweep” は無効(knob 未接続):
|
|
|
|
|
|
- `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない
|
|
|
|
|
|
- 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md`
|
|
|
|
|
|
- **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
|
|
|
|
|
- **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)**
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了**
|
|
|
|
|
|
|
|
|
|
|
|
- ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加
|
|
|
|
|
|
- ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加
|
|
|
|
|
|
- ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加:
|
|
|
|
|
|
- Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc
|
|
|
|
|
|
- M1 (50%) achievement: **EXCEEDED** (+1.77pp above target)
|
|
|
|
|
|
- M2 (55%) progress: Gap reduced to +3.23pp
|
|
|
|
|
|
- ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除
|
|
|
|
|
|
|
|
|
|
|
|
**新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 69-3(次候補): refill 量(ENV-only)sweep OR 次の sweep**
|
|
|
|
|
|
|
|
|
|
|
|
- **選択肢 A(推奨)**: Refill count の ENV sweep(コード変更なし)
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_MID`(C4–C7)を 64/96/128/160… で sweep
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_HOT`(C0–C3)も同様に sweep(ただし WarmPool/UnifiedCache と相互作用あり)
|
|
|
|
|
|
- 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%)
|
|
|
|
|
|
|
|
|
|
|
|
- **選択肢 B**: Unified Cache の fine sweep(ENV-only)
|
|
|
|
|
|
- C5/C6/C7 を 192/256/320… などで sweep(Phase 69-1 の 256/512 は coarse)
|
|
|
|
|
|
- WarmPool=16 との非加算性を “原因切り分け” する
|
|
|
|
|
|
|
|
|
|
|
|
- **選択肢 C**: compile-time knob の新設(後回し)
|
|
|
|
|
|
- `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない
|
|
|
|
|
|
- 必要なら別途 SSOT を作って実装する(Phase 70+)
|
|
|
|
|
|
|
|
|
|
|
|
- **選択肢 D**: 別方向の最適化(M2: 55% への最短距離)
|
|
|
|
|
|
- 残り gap: +3.23pp (51.77% → 55%)
|
|
|
|
|
|
- Phase 67b(境界 inline/unroll チューニング)
|
|
|
|
|
|
- Top 50 hot functions の最適化
|
|
|
|
|
|
- PGO profile の再調整
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 67b(後続・保険): 境界inline/unrollチューニング**
|
2025-12-17 21:08:17 +09:00
|
|
|
|
- **注意**: layout tax リスク高い(Phase 64 reference)
|
|
|
|
|
|
- **前提**: Top 50 実行確認が必須
|
2025-12-17 21:22:21 +09:00
|
|
|
|
- Phase 69 が外れた時の保険として後回し推奨
|
2025-12-17 16:27:06 +09:00
|
|
|
|
|
2025-12-18 03:44:51 +09:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 70(観測の前提固め): Refill/WarmPool 最適化の Step 0 を SSOT 化**
|
|
|
|
|
|
|
|
|
|
|
|
- 目的: **“経路が踏まれていない最適化”** を防ぐ(Phase 40/41/64 の layout tax 前例)
|
|
|
|
|
|
- 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない(backend route kind)
|
|
|
|
|
|
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
|
|
|
|
|
- Mixed SSOT(WS=400)で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める
|
2025-12-18 05:55:47 +09:00
|
|
|
|
- ✅ Phase 70-1: Route Banner 実装(経路誤認の根絶)
|
|
|
|
|
|
- ENV: `HAKMEM_ROUTE_BANNER=1`
|
|
|
|
|
|
- 出力: Route assignments(backend route kind)+ cache config(unified_cache / warm_pool_max_per_class)
|
|
|
|
|
|
- ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT(“見えてないだけ”事故の根絶)
|
|
|
|
|
|
- `Unified-STATS total_allocs == total_frees` を確認してから議論する(統計の信頼性ゲート)
|
|
|
|
|
|
- ✅ Phase 70-2: Refill 最適化の扱い確定(SSOT)
|
|
|
|
|
|
- Mixed SSOT(WS=400)で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結(ROIゼロ)**
|
|
|
|
|
|
- 現状の実測: miss は極小(例: total miss=5)→ refill最適化は SSOT workload では ROI なし
|
|
|
|
|
|
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決**
|
|
|
|
|
|
|
|
|
|
|
|
- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**(TLB/LLC/周波数/配置)の可能性が高い
|
|
|
|
|
|
- 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化(Phase 72)を決め打ちする
|
|
|
|
|
|
- 方式: **同一バイナリ + cleanenv + 交互実行**(layout tax/環境ドリフトを避ける)
|
|
|
|
|
|
- A: `HAKMEM_WARM_POOL_SIZE=12`
|
|
|
|
|
|
- B: `HAKMEM_WARM_POOL_SIZE=16`
|
|
|
|
|
|
- events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults`
|
|
|
|
|
|
|
|
|
|
|
|
**結果**(パラドックス):
|
|
|
|
|
|
- ✅ Throughput: +0.91% (46.52M → 46.95M ops/s)
|
|
|
|
|
|
- ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE**
|
|
|
|
|
|
- ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE**
|
|
|
|
|
|
- ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE**
|
|
|
|
|
|
- ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE**
|
|
|
|
|
|
- ✓ page-faults: -0.21% (negligible)
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 71 仮説(REJECTED)**:
|
|
|
|
|
|
- 予測: "TLB/cache efficiency improvement from memory layout"
|
|
|
|
|
|
- 実測: TLB/cache metrics both **DEGRADED**
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 73 確定**:
|
|
|
|
|
|
- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)**
|
|
|
|
|
|
- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減
|
|
|
|
|
|
- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate
|
|
|
|
|
|
- Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache)
|
|
|
|
|
|
|
|
|
|
|
|
**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 72(構造): WarmPool=16 の勝ち筋を増幅(Phase 73 結果が出てから)**
|
|
|
|
|
|
|
|
|
|
|
|
- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手(推測で弄ると Phase 40/41/64 の再発)
|
|
|
|
|
|
- Phase 73 の結論: **instruction/branch 減が支配的**(TLB/cache はむしろ悪化)→「WarmPool=16 が “短い経路” を踏ませている」ことが本質
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 72-0(SSOT): “どの関数が短くなったか” を特定してから構造に入る**
|
|
|
|
|
|
|
|
|
|
|
|
- A/B は WarmPool=12 vs 16 のまま(同一バイナリ・cleanenv)
|
|
|
|
|
|
- perf record を **cycles ではなく instruction/branch で取る**(原因が instruction/branch 減だから)
|
|
|
|
|
|
- `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
|
|
|
|
|
|
- `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
|
|
|
|
|
|
- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定(例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等)
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 72-1(構造): 特定した関数にだけ手を入れる(箱の境界 1 箇所化)**
|
|
|
|
|
|
|
|
|
|
|
|
- `shared_pool_acquire_slab` 側が主因なら: “scan/lock/mmap” を減らす設計(warm prefill の境界を 1 箇所に固定)
|
|
|
|
|
|
- `unified_cache_refill` 側が主因なら: “refill の準備/検証” を境界側へ寄せ、hot 側は直線化
|
|
|
|
|
|
- 注意: 目標は「miss を減らす」ではなく **同じ miss でも “短い経路” を踏ませる**こと(Phase 73 の教訓)
|
2025-12-18 03:44:51 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
**注記**: 研究箱の削除は今やらない(link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解)
|
2025-12-17 16:27:06 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
## 3) アーカイブ
|
2025-12-17 16:34:03 +09:00
|
|
|
|
|
2025-12-17 21:08:17 +09:00
|
|
|
|
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
|
|
|
|
|
|
- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
|