Files
hakmem/CURRENT_TASK.md

236 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CURRENT_TASKRolling, SSOT
## 0) 今の「正」SSOT
- **性能比較の正**: FAST PGO build`make pgo-fast-full``bench_random_mixed_hakmem_minimal_pgo` **WarmPool=16**
- Phase 75C5/C6 inline slotsは presets に昇格済み
- Phase 75-4 で FAST PGO rebase を実施し **C5+C6=ON が +3.16% (GO)** を確認(ただし **FAST PGO baseline 自体が Phase 69 から大きく後退**している疑い → Phase 75-5 で PGO 再生成が必要)
- **安全・互換の正**: Standard build`make bench_random_mixed_hakmem`
- **観測の正**: OBSERVE build`make perf_observe`
- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
- **FAST baselineSSOT**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md` を正とするPhase 69: 62.63M ops/s = 51.77% of mimalloc
- **Phase 75 の計測Standard**: `bench_random_mixed_hakmem`**A/B +5.41%** を確認Phase 75-3 4-point matrix
- **Phase 75 の計測FAST PGO**: `bench_random_mixed_hakmem_minimal_pgo`**A/B +3.16%** を確認Phase 75-4 4-point matrix
- 次の目標: **M2 = 55%**gap は FAST baseline を基準に判断する)
- **Mixed 10-run SSOTハーネス**: `scripts/run_mixed_10_cleanenv.sh`
- デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`Standard
- FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
- 既定: `ITERS=20000000 WS=400``HAKMEM_WARM_POOL_SIZE=16``HAKMEM_TINY_C5_INLINE_SLOTS=1``HAKMEM_TINY_C6_INLINE_SLOTS=1`
## 1) 迷子防止(経路/観測)
“経路が踏まれていない最適化” を防ぐための最小手順。
- **Route Banner経路の誤認を潰す**: `HAKMEM_ROUTE_BANNER=1`
- 出力: Route assignmentsbackend route kind+ cache config`unified_cache_enabled` / `warm_pool_max_per_class`
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- WS=400Mixed SSOTでは miss が極小 → `unified_cache_refill()` 最適化は **凍結ROIゼロ**
## 2) 直近の結論(要点だけ)
- **Phase 69WarmPool sweep**: `HAKMEM_WARM_POOL_SIZE=16`**強GO+3.26%**、baseline 昇格済み。
- 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **Phase 70観測SSOT**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- **Phase 71/73WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**perf stat で確定)。
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
- **Phase 72ENV knob ROI枯れ**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**
## 3) 運用ルールBox Theory + layout tax 対策)
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積むFail-fast、最小可視化
- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。
- “削除して速い” は封印link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。
- 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
## 4) 次の指示書Active
### Phase 74構造: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
**前提**:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
**Phase 74-1: LOCALIZE (ENV-gated)****完了 (NEUTRAL +0.50%)**
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- 判定: **NEUTRAL (+0.50%)**
**Phase 74-2: LOCALIZE (compile-time gate)****完了 (NEUTRAL -0.87%)**
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
- 判定: **NEUTRAL (-0.87%)****P1 (LOCALIZE) 凍結**
**結論**:
- P1 (LOCALIZE) は default OFF で凍結dependency chain 削減の ROI 低い)
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
**Phase 74-3: P0 (FASTAPI)****完了 (NEUTRAL +0.32%)**
**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
**Approach**:
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback境界1箇所
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
**Results** (10-run Mixed SSOT, WS=400):
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)
**判定**: **NEUTRAL (+0.32%)****P0 (FASTAPI) 凍結**
**参考**:
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
---
## Phase 75構造: Hot-class Inline Slots (P2) ✅ **完了Standard A/B**
**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定
**前提** (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination
**Phase 75-0: Per-Class Analysis****完了**
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|-------|----------|----------|------|--------|-----------|-------|-----------|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
| C7 | ? | ? | ? | ? | **?** | ? | **?** |
**Key findings**:
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
2. 全クラス 100% hit rate (refill inactive in SSOT)
3. Cache occupancy near-capacity (98-99%)
**Phase 75-1: C6-only Inline Slots****完了 (GO +2.87%)**
**Approach**: Modular box theory design with single decision point at TLS init
**Implementation** (5 new boxes + test script):
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): **44.24 M ops/s**
- Treatment (C6 inline ON): **45.51 M ops/s**
- Delta: **+1.27 M ops/s (+2.87%)**
**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)
**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
**参考**:
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`
---
**Phase 75-2: C5 Inline Slots****完了 (GO +1.10%)**
**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
**Approach**: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
- Delta: **+0.49 M ops/s (+1.10%)**
**Decision**: ✅ **GO** (C5 individual contribution validated)
**Cumulative Performance**:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)
**参考**:
- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`
---
**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)****完了 (STRONG GO +5.41%)**
**Goal**: Comprehensive interaction test + final promotion decision
**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined
**Results** (10-run per point, Mixed SSOT, WS=400):
- **Point A (baseline)**: 42.36 M ops/s
- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**
**Additivity Analysis**:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)
**Perf Stat Validation (D vs A)**:
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)
**Decision**: ✅ **STRONG GO (+5.41%)**
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
**Promotion Completed**:
1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE
**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**`bench_random_mixed_hakmem`)。
- FAST PGO baselineスコアカードを更新する前に`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` **同条件の A/BC5/C6 OFF/ON** を再計測すること
### Phase 75-4FAST PGO rebase✅ 完了
- 結果: **+3.16% (GO)**4-point matrixoutlier 除外後
- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
- 重要: Phase 69 FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑いPGO profile staleness / training mismatch / build drift
### Phase 75-5PGO 再生成)🟥 次のActiveHIGH PRIORITY
目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成しPhase 69 クラスの FAST baseline を取り戻す
手順骨子:
1. PGO training C5/C6=ON” 前提で回すtraining 時に `HAKMEM_TINY_C5_INLINE_SLOTS=1` / `HAKMEM_TINY_C6_INLINE_SLOTS=1` を必ず設定
2. `make pgo-fast-full` `bench_random_mixed_hakmem_minimal_pgo` を再生成
3. 10-run baseline を再測定しPhase 75-4 Point A/D を再計測
4. Layout tax / drift の疑いが出たら `scripts/box/layout_tax_forensics_box.sh` で原因分類
**参考**:
- 4-point matrix 結果: `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`
- Test script: `scripts/phase75_3_matrix_test.sh`
## 5) アーカイブ
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`