hakmem/CURRENT_TASK.md

# CURRENT_TASK（Rolling, SSOT）

## 0) 今の「正」（SSOT）

- **性能比較の正**: FAST PGO build（`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`）＋ **WarmPool=16**
  - Phase 75（C5/C6 inline slots）は presets に昇格済み
  - Phase 75-4 で FAST PGO rebase を実施し **C5+C6=ON が +3.16% (GO)** を確認（ただし **FAST PGO baseline 自体が Phase 69 から大きく後退**している疑い → Phase 75-5 で PGO 再生成が必要）
- **安全・互換の正**: Standard build（`make bench_random_mixed_hakmem`）
- **観測の正**: OBSERVE build（`make perf_observe`）
- **スコアカード（目標/現在値）**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
  - **FAST baseline（SSOT）**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md` を正とする（Phase 69: 62.63M ops/s = 51.77% of mimalloc）
  - **Phase 75 の計測（Standard）**: `bench_random_mixed_hakmem` で **A/B +5.41%** を確認（Phase 75-3 4-point matrix）
  - **Phase 75 の計測（FAST PGO）**: `bench_random_mixed_hakmem_minimal_pgo` で **A/B +3.16%** を確認（Phase 75-4 4-point matrix）
  - 次の目標: **M2 = 55%**（gap は FAST baseline を基準に判断する）
- **Mixed 10-run SSOT（ハーネス）**: `scripts/run_mixed_10_cleanenv.sh`
  - デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`（Standard）
  - FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
  - 既定: `ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16`、`HAKMEM_TINY_C5_INLINE_SLOTS=1`、`HAKMEM_TINY_C6_INLINE_SLOTS=1`

## 1) 迷子防止（経路/観測）

“経路が踏まれていない最適化” を防ぐための最小手順。

- **Route Banner（経路の誤認を潰す）**: `HAKMEM_ROUTE_BANNER=1`
  - 出力: Route assignments（backend route kind）+ cache config（`unified_cache_enabled` / `warm_pool_max_per_class`）
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
  - WS=400（Mixed SSOT）では miss が極小 → `unified_cache_refill()` 最適化は **凍結（ROIゼロ）**

## 2) 直近の結論（要点だけ）

- **Phase 69（WarmPool sweep）**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO（+3.26%）**、baseline 昇格済み。
  - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
  - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **Phase 70（観測SSOT）**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
  - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- **Phase 71/73（WarmPool=16 の勝ち筋確定）**: 勝ち筋は **instruction/branch の微減**（perf stat で確定）。
  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
- **Phase 72（ENV knob ROI枯れ）**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造（コード）で攻める段階**。

## 3) 運用ルール（Box Theory + layout tax 対策）

- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む（Fail-fast、最小可視化）。
- A/B は **同一バイナリでENVトグル**が原則（別バイナリ比較は layout が混ざる）。
- “削除して速い” は封印（link-out/大削除は layout tax で符号反転しやすい）→ **compile-out** を優先。
  - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`

## 4) 次の指示書（Active）

### Phase 74（構造）: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**

**前提**:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。

**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- 判定: **NEUTRAL (+0.50%)**

**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**

**結論**:
- P1 (LOCALIZE) は default OFF で凍結（dependency chain 削減の ROI 低い）
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む

**Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)**

**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**

**Approach**:
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback（境界1箇所）
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)

**Results** (10-run Mixed SSOT, WS=400):
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)

**判定**: **NEUTRAL (+0.32%)** → **P0 (FASTAPI) 凍結**

**参考**:
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`

---

## Phase 75（構造）: Hot-class Inline Slots (P2) ✅ **完了（Standard A/B）**

**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定

**前提** (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination

**Phase 75-0: Per-Class Analysis** ✅ **完了**

Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):

| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|-------|----------|----------|------|--------|-----------|-------|-----------|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
| C7 | ? | ? | ? | ? | **?** | ? | **?** |

**Key findings**:
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
2. 全クラス 100% hit rate (refill inactive in SSOT)
3. Cache occupancy near-capacity (98-99%)

**Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)**

**Approach**: Modular box theory design with single decision point at TLS init

**Implementation** (5 new boxes + test script):
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache

**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): **44.24 M ops/s**
- Treatment (C6 inline ON): **45.51 M ops/s**
- Delta: **+1.27 M ops/s (+2.87%)**

**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)

**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)

**参考**:
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`

---

**Phase 75-2: C5 Inline Slots** ✅ **完了 (GO +1.10%)**

**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution

**Approach**: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)

**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
- Delta: **+0.49 M ops/s (+1.10%)**

**Decision**: ✅ **GO** (C5 individual contribution validated)

**Cumulative Performance**:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)

**参考**:
- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`

---

**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)** ✅ **完了 (STRONG GO +5.41%)**

**Goal**: Comprehensive interaction test + final promotion decision

**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined

**Results** (10-run per point, Mixed SSOT, WS=400):
- **Point A (baseline)**: 42.36 M ops/s
- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**

**Additivity Analysis**:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)

**Perf Stat Validation (D vs A)**:
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)

**Decision**: ✅ **STRONG GO (+5.41%)**
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP

**Promotion Completed**:
1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE

**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**（`bench_random_mixed_hakmem`）。
- FAST PGO baseline（スコアカード）を更新する前に、`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` で **同条件の A/B（C5/C6 OFF/ON）** を再計測すること。

### Phase 75-4（FAST PGO rebase）✅ 完了

- 結果: **+3.16% (GO)**（4-point matrix、outlier 除外後）
- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
- 重要: Phase 69 の FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑い（PGO profile staleness / training mismatch / build drift）

### Phase 75-5（PGO 再生成）🟥 次のActive（HIGH PRIORITY）

目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。

手順（骨子）:
1. PGO training を “C5/C6=ON” 前提で回す（training 時に `HAKMEM_TINY_C5_INLINE_SLOTS=1` / `HAKMEM_TINY_C6_INLINE_SLOTS=1` を必ず設定）
2. `make pgo-fast-full` で `bench_random_mixed_hakmem_minimal_pgo` を再生成
3. 10-run で baseline を再測定し、Phase 75-4 の Point A/D を再計測
4. Layout tax / drift の疑いが出たら `scripts/box/layout_tax_forensics_box.sh` で原因分類

**参考**:
- 4-point matrix 結果: `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`
- Test script: `scripts/phase75_3_matrix_test.sh`

## 5) アーカイブ

- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`