hakmem/CURRENT_TASK.md

# CURRENT_TASK（Rolling, SSOT）

## SSOT（今の正）

- **性能SSOT**: `scripts/run_mixed_10_cleanenv.sh`（WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF）
- **経路確認**: `scripts/run_mixed_observe_ssot.sh`（OBSERVE専用、throughput比較には使わない）
- **buildモード**: `docs/analysis/SSOT_BUILD_MODES.md`
- **外部比較（短時間）**: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`（LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け）

## Phase 87-88（終了: NO-GO）

**Status**: ✅ **OBSERVE verified** + ❌ **Phase 88 NO-GO**

### Phase 87: Inline Slots Verification

**Initial Finding (Wrong)**: Standard binary showed PUSH TOTAL/POP TOTAL = 0
- **Root Cause**: ENV ドリフト（`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` 漏れ）
  - 修正: `scripts/run_mixed_10_cleanenv.sh` でサイズ範囲を強制固定（MIN=16, MAX=1040）
  - `HAKMEM_BENCH_C5_ONLY=0`, `HAKMEM_BENCH_C6_ONLY=0`, `HAKMEM_BENCH_C7_ONLY=0` 強制

**Corrected Finding (OBSERVE binary)** - 20M ops Mixed SSOT WS=400:
```
PUSH TOTAL:   C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
POP TOTAL:    C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
PUSH FULL:    0 (0.00%)
POP EMPTY:    168 (0.003%)

JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89
```

### Phase 88: Batch Drain Optimization

**Overflow Analysis**:
- POP EMPTY rate: 168 / 4,812,031 = **0.003%** ← 極小
- PUSH FULL rate: 0 / 4,812,031 = **0%** ← 起きていない
- **Decision**: バッチ化しても速さは動かない（overflow がほぼ起きていない）

**Phase 88 Decision**: **NO-GO（凍結）**
- Rationale: 0.003% overflow 率では layout tax リスク > 期待値
- Infrastructure: 観測用 telemetry は残す（将来の WS/容量 変更時に再検証可能）

**Artifacts Created**:
- Telemetry box: `core/box/tiny_inline_slots_overflow_stats_box.h/c`
- Phase 87 results: `docs/analysis/PHASE87_OBSERVATION_RESULTS.md`
- SSOT 強化: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
- ENV ドリフト防止ドキュメント: `docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md`

**Key Learning**:
- "踏んでるか確定"には **OBSERVE バイナリ + total counters** が必須
- 観測と性能測定は分離（telemetry overhead を避ける）
- ENV ドリフト（MIN/MAX サイズ, CLASS_ONLY） = 経路を変える主要因
**Follow-up Fix (SSOT hardening)**:
- `scripts/run_mixed_10_cleanenv.sh` now forces `HAKMEM_BENCH_MIN_SIZE=16` / `HAKMEM_BENCH_MAX_SIZE=1040` and disables `HAKMEM_BENCH_C{5,6,7}_ONLY` to prevent path drift.
- New pre-flight helper: `scripts/run_mixed_observe_ssot.sh` (Route Banner + OBSERVE, single run).
 - Overflow stats compile gating fixed (see above).

---

## Phase 89（完了: Bottleneck Analysis & Optimization Roadmap）

**Status**: ✅ **SSOT Measurement Complete** + **3 Optimization Candidates Identified**

### 4-Step SSOT Procedure Completion

**Step 1: OBSERVE Binary Preflight**
- Binary: `bench_random_mixed_hakmem_observe` (with telemetry enabled)
- Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
- Throughput (with telemetry): 51.52M ops/s

**Step 2: Standard 10-run Baseline**
- Binary: `bench_random_mixed_hakmem` (clean, no telemetry)
- 10-run SSOT results: **51.36M ops/s** (CV: 0.7%, very stable)
  - Range: 50.74M - 51.73M
  - **Decision**: This is baseline for bottleneck analysis

**Step 3: FAST PGO 10-run Comparison**
- Binary: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
- 10-run SSOT results: **54.16M ops/s** (CV: 1.5%, acceptable)
  - Range: 52.89M - 55.13M
  - **Performance Gap**: 54.16M - 51.36M = **2.80M ops/s (+5.45%)**
  - This represents the optimization ceiling with current PGO profile

**Step 4: Results Captured**
- Git SHA: e4c5f0535 (master branch)
- Timestamp: 2025-12-18 23:06:01
- System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
- Files: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`

### Perf Analysis & Top Bottleneck Identification

**Profile Run**: 40M operations (0.78s), 833 perf samples

**Top Functions by CPU Time**:
1. **free** - 27.40% (hottest)
2. main - 26.30% (benchmark loop, not optimizable)
3. **malloc** - 20.36% (hottest)
4. malloc.cold - 10.65% (cold path, avoid optimizing)
5. free.cold - 5.59% (cold path, avoid optimizing)
6. **tiny_region_id_write_header** - 2.98% (hot, inlining candidate)

**malloc + free combined = 47.76% of CPU time** (already Phase 9/10/78-1/80-1 optimized)

### Top 3 Optimization Candidates (Ranked by Priority)

| Candidate | Priority | Recommendation | Expected Gain | Risk | Effort |
|-----------|----------|-----------------|----------------|------|--------|
| **tiny_region_id_write_header always_inline** | **HIGH** | **PURSUE** | +1-2% | LOW | 1-2h |
| malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h |
| Cold-path optimization | LOW | **AVOID** | +1% | HIGH | 10-20h |

**Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)**
- Current: Selective inlining from `core/region_id_v6.c`
- Proposal: Force `always_inline` for hot-path call sites
- **Layout Impact**: MINIMAL (no code bulk, maintains I-cache discipline)
- **Recommendation**: YES - PURSUE
  - Estimated timeline: Phase 90
  - Implementation: 1-2 lines, add `__attribute__((always_inline))` wrapper

**Candidate 2: malloc/free branch reduction (47.76% CPU)**
- Current: Phase 9/10/78-1/80-1/83-1 already optimized
- Observation: 56.4M branch-misses (branch prediction pressure)
- Proposal: Pre-compute routing tables (like Phase 85 approach)
- **Risk**: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
- **Recommendation**: DEFER
  - Wait for workload characteristics that justify complexity
  - Current gains saturation point reached

---

## Phase 91（終了: NEUTRAL / 凍結）

**Status**: ⚪ **NEUTRAL**（C6 IFL: +0.38% / 10-run）→ default OFF で保持

- 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
- 結果（SSOT 10-run）:
  - Control（`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0`）mean 52.05M
  - Treatment（`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1`）mean 52.25M
  - Δ **+0.38%**（GO閾値 +1.0% 未達）
- 判定: **凍結（research box）**
  - 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない

---

## Phase 92（開始予定）

**Status**: 🔍 **次フェーズ計画中**

**目的**: tcmalloc 性能ギャップ（hakmem: 52M vs tcmalloc: 58M, -12.8%）を短時間で原因分類

**実施予定**:
1. ケース A：小 vs 大オブジェクト分離テスト（C6-only vs C7-only）
2. ケース B：Inline Slots vs Unified Cache 分離テスト
3. ケース C：LIFO vs FIFO 比較
4. ケース D：Pool size sensitivity テスト

**期間**: 1-2h（短時間 Triage）
**出力**: Primary bottleneck 特定 → 次の Candidate 選定

**References**:
- Triage Plan: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`

---

**Candidate 3: Cold-path de-duplication (16.24% CPU)**
- Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
- Rationale: Separation improves hot-path I-cache utilization
- **Recommendation**: AVOID
  - Aligns with user's "layout tax 回避" principle
  - Optimizing cold paths would ADD code to hot path (violates design)

### Key Performance Insights

**FAST PGO vs Standard (+5.45%) breakdown**:
- PGO branch prediction optimization: ~3%
- Code layout optimization: ~2%
- Inlining decisions: ~0.5%

**Conclusion**: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.

**Inline Slots Health**: Working perfectly - 0.003% overflow rate confirms no bottleneck

### References & Artifacts

- SSOT Measurement: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- Bottleneck Analysis: `docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md`
- Perf Stats: `docs/analysis/PHASE89_PERF_STAT.txt`
- Scripts: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`

---

## Phase 86（終了: NO-GO）

**Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)

**A/B Test (10-run SSOT)**:
- Control:   51,750,467 ops/s (CV: 2.26%)
- Treatment: 51,881,055 ops/s (CV: 2.32%)
- Delta: +0.25% (mean), -0.15% (median)

**Summary**: Free path legacy mask (mask-only) optimization for LEGACY classes.
- Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
- Implementation: Correct (0x7f mask computed, C0-C6 optimized)
- Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
- Conclusion: Free path optimization layer has reached practical ceiling

---

## 0) 今の「正」（SSOT）

- **現行 SSOT（Phase 89 capture / Git SHA: e4c5f0535）**:
  - Standard（`./bench_random_mixed_hakmem`）10-run mean: **51.36M ops/s**（CV ~0.7%）
  - FAST PGO minimal（`./bench_random_mixed_hakmem_minimal_pgo`）10-run mean: **54.16M ops/s**（CV ~1.5% / Standard比 +5.45%）
  - OBSERVE（`./bench_random_mixed_hakmem_observe`）: 51.52M ops/s（telemetry込み、性能比較の正ではない）
  - SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- **性能最適化の判断の正**: 同一バイナリ A/B（ENVトグル）＝ `scripts/run_mixed_10_cleanenv.sh`
- **mimalloc/tcmalloc 参照の正**: reference（別バイナリ/LD_PRELOAD）＝ `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
- **スコアカード（目標/現在値の正）**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`（Phase 89 SSOT を現行 snapshot として反映済み）
  - Phase 66/68/69（60M〜62M台）は **historical**（現 HEAD と直接比較しない。比較するなら rebase を取る）
- **次フェーズ（設計見直し）**: `docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md`
- **Mixed 10-run SSOT（ハーネス）**: `scripts/run_mixed_10_cleanenv.sh`
  - デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`（Standard）
  - FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
  - 既定: `ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16`、`HAKMEM_TINY_C4_INLINE_SLOTS=1`、`HAKMEM_TINY_C5_INLINE_SLOTS=1`、`HAKMEM_TINY_C6_INLINE_SLOTS=1`、`HAKMEM_TINY_INLINE_SLOTS_FIXED=1`、`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1`
  - cleanenv で固定OFF（漏れ防止）: `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0`（Phase 83-1 NO-GO / research）

## 0a) ころころ防止（最低限の SSOT ルール）

- **hakmem は必ず `HAKMEM_PROFILE` を明示**する（未指定だと route が変わり、数値が破綻しやすい）。
  - 推奨: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`（Speed-first）
- 比較は目的で runner を分ける:
  - hakmem SSOT（最適化判断）: `scripts/run_mixed_10_cleanenv.sh`
  - allocator reference（短時間）: `scripts/run_allocator_quick_matrix.sh`
  - allocator reference（layout差を最小化）: `scripts/run_allocator_preload_matrix.sh`
- 再現ログを残す（数%を詰めるときの最低限）:
  - `scripts/bench_ssot_capture.sh`
  - `HAKMEM_BENCH_ENV_LOG=1`（CPU governor/EPP/freq を記録）
  - 外部相談（貼り付けパケット）: `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`（生成: `scripts/make_chatgpt_pro_packet_free_path.sh`）

## 0b) Allocator比較（reference）

- allocator比較（system/jemalloc/mimalloc/tcmalloc）は **reference**（別バイナリ/LD_PRELOAD → layout差を含む）。
  - SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
  - **Quick（Random Mixed 10-run）**: `scripts/run_allocator_quick_matrix.sh`
    - **重要**: hakmem は `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示し、`scripts/run_mixed_10_cleanenv.sh` 経由で走らせる（PROFILE漏れで数値が壊れるため）。
  - **Same-binary（推奨, layout差を最小化）**: `scripts/run_allocator_preload_matrix.sh`
    - `bench_random_mixed_system` を固定し、`LD_PRELOAD` で allocator を差し替える。
    - 注記: hakmem の **linked benchmark**（`bench_random_mixed_hakmem*`）とは経路が異なる（LD_PRELOAD=drop-in wrapper なので別物）。
  - **Scenario CSV（small-scale reference）**: `scripts/bench_allocators_compare.sh`

## 1) 迷子防止（経路/観測）

“経路が踏まれていない最適化” を防ぐための最小手順。

- **Route Banner（経路の誤認を潰す）**: `HAKMEM_ROUTE_BANNER=1`
  - 出力: Route assignments（backend route kind）+ cache config（`unified_cache_enabled` / `warm_pool_max_per_class`）
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
  - WS=400（Mixed SSOT）では miss が極小 → `unified_cache_refill()` 最適化は **凍結（ROIゼロ）**

## 2) 直近の結論（要点だけ）

- **Phase 69（WarmPool sweep）**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO（+3.26%）**、baseline 昇格済み。
  - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
  - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **Phase 70（観測SSOT）**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
  - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- **Phase 71/73（WarmPool=16 の勝ち筋確定）**: 勝ち筋は **instruction/branch の微減**（perf stat で確定）。
  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
- **Phase 72（ENV knob ROI枯れ）**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造（コード）で攻める段階**。
- **Phase 78-1（構造）**: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で **GO（+2.31%）**。
  - 結果: `docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md`
- **Phase 80-1（構造）**: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で **GO（+1.65%）**。
  - 結果: `docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md`
- **Phase 83-1（構造）**: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で **NO-GO（+0.32%, branch reduction negligible）**。
  - 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md`
  - 原因: lazy-init pattern が既に最適化済み（per-op overhead minimal）→ fixed mode の ROI 極小

## 2a) 次の大方針（設計の順番、SSOT）

目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory（境界1箇所・戻せる・可視化最小・fail-fast）を崩さず **+5–10%** を狙う。

優先順（Google/TCMalloc の芯を参考にする）:

1. **ThreadCache overflow のバッチ化（最優先）**
   - inline slots（C4/C5/C6）が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
   - 変換点は 1 箇所（flush/drain）に固定
2. **Central/Shared 側のバッチ push/pop（次点）**
   - shared/remote への統合をバッチ化して lock/atomic の回数を減らす
3. **Memory return / footprint policy（運用軸）**
   - Balanced/Lean の勝ち筋（syscall/RSS drift/tail）をSSOT化しつつ、速度を落とさない範囲で攻める

重要: 現状は「設計の芯」を決める段階。実装は **計測で overflow の頻度が十分に高い**ことを確認してから。

## 2b) 次の作業（待機中）

ユーザーが別エージェント（Claude Code）に依頼した処理が完了するまで待機する。
完了後に着手するチェック（最短で必要な2つ）:

- **inline slots overflow 率の計測**（C4/C5/C6 の FULL/overflow 回数・割合）
- **overflow 先のコストの定量化**（overflow 時に落ちる関数の perf stat / perf report）

これが揃ったら Phase 86（Overflow batch design）へ進む。

## 3) 運用ルール（Box Theory + layout tax 対策）

- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む（Fail-fast、最小可視化）。
- A/B は **同一バイナリでENVトグル**が原則（別バイナリ比較は layout が混ざる）。
- SSOT運用（ころころ防止）: `docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md`
- “削除して速い” は封印（link-out/大削除は layout tax で符号反転しやすい）→ **compile-out** を優先。
  - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
- 研究箱の棚卸しSSOT: `docs/analysis/RESEARCH_BOXES_SSOT.md`
  - ノブ一覧: `scripts/list_hakmem_knobs.sh`

## 5) 研究箱の扱い（freeze方針）

- **Phase 79-1（C2 local cache）**: `HAKMEM_TINY_C2_LOCAL_CACHE=0/1`
  - 結果: +0.57%（NO-GO, threshold +1.0% 未達）→ **research box freeze**
  - SSOT/cleanenv では **default OFF**（`scripts/run_mixed_10_cleanenv.sh` が `0` を強制）
  - 物理削除はしない（layout tax リスク回避）
  - **Phase 82（hardening）**: hot path から C2 local cache を完全除外（環境変数を立てても alloc/free hot では踏まない）
    - 記録: `docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md`

- **Phase 85（Free path commit-once, LEGACY-only）**: `HAKMEM_FREE_PATH_COMMIT_ONCE=0/1`
  - 結果: **NO-GO（-0.86%）** → **research box freeze（default OFF）**
  - 理由: Phase 10（MONO LEGACY DIRECT）と効果が被り、さらに間接呼び出し/配置の税が増えた
  - 記録: `docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md`

## 4) 次の指示書（Active）

### Phase 74（構造）: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**

**前提**:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。

**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- 判定: **NEUTRAL (+0.50%)**

**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**

**結論**:
- P1 (LOCALIZE) は default OFF で凍結（dependency chain 削減の ROI 低い）
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む

**Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)**

**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**

**Approach**:
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback（境界1箇所）
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)

**Results** (10-run Mixed SSOT, WS=400):
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)

**判定**: **NEUTRAL (+0.32%)** → **P0 (FASTAPI) 凍結**

**参考**:
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`

---

## Phase 75（構造）: Hot-class Inline Slots (P2) ✅ **完了（Standard A/B）**

**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定

**前提** (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination

**Phase 75-0: Per-Class Analysis** ✅ **完了**

Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):

| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|-------|----------|----------|------|--------|-----------|-------|-----------|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
| C7 | ? | ? | ? | ? | **?** | ? | **?** |

**Key findings**:
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
2. 全クラス 100% hit rate (refill inactive in SSOT)
3. Cache occupancy near-capacity (98-99%)

**Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)**

**Approach**: Modular box theory design with single decision point at TLS init

**Implementation** (5 new boxes + test script):
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache

**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): **44.24 M ops/s**
- Treatment (C6 inline ON): **45.51 M ops/s**
- Delta: **+1.27 M ops/s (+2.87%)**

**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)

**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)

**参考**:
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`

---

**Phase 75-2: C5 Inline Slots** ✅ **完了 (GO +1.10%)**

**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution

**Approach**: Replicate C6 pattern with careful isolation
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)

**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
- Delta: **+0.49 M ops/s (+1.10%)**

**Decision**: ✅ **GO** (C5 individual contribution validated)

**Cumulative Performance**:
- Phase 75-1 (C6): +2.87%
- Phase 75-2 (C5 isolated): +1.10%
- Combined potential: ~+3.97% (if additive)

**参考**:
- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`

---

**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)** ✅ **完了 (STRONG GO +5.41%)**

**Goal**: Comprehensive interaction test + final promotion decision

**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
- Point A (C5=0, C6=0): Baseline
- Point B (C5=1, C6=0): C5 solo
- Point C (C5=0, C6=1): C6 solo
- Point D (C5=1, C6=1): C5+C6 combined

**Results** (10-run per point, Mixed SSOT, WS=400):
- **Point A (baseline)**: 42.36 M ops/s
- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**

**Additivity Analysis**:
- Expected additive (B+C-A): 45.43 M ops/s
- Actual (D): 44.65 M ops/s
- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)

**Perf Stat Validation (D vs A)**:
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instruction reduction)
- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
- Throughput: +5.41% (net positive)

**Decision**: ✅ **STRONG GO (+5.41%)**
- D vs A: +5.41% >> 3.0% threshold
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 thesis validated: instructions/branches DOWN, throughput UP

**Promotion Completed**:
1. `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
3. C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE

**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**（`bench_random_mixed_hakmem`）。
- FAST PGO baseline（スコアカード）を更新する前に、`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` で **同条件の A/B（C5/C6 OFF/ON）** を再計測すること。

### Phase 75-4（FAST PGO rebase）✅ 完了

- 結果: **+3.16% (GO)**（4-point matrix、outlier 除外後）
- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
- 重要: Phase 69 の FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑い（PGO profile staleness / training mismatch / build drift）

### Phase 75-5（PGO 再生成）✅ 完了（NO-GO on hypothesis, code bloat root cause identified）

目的:
- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。

結果:
- PGO profile regeneration の効果は **限定的** (+0.3% のみ)
- Root cause は **PGO profile mismatch ではなく code bloat** (+13KB, +3.1%)
- Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression

**Forensics findings** (`scripts/box/layout_tax_forensics_box.sh`):
- Text size: +13KB (+3.1%)
- IPC: 1.80 → 1.67 (-7.22%)
- Branch-misses: +19.4%
- Cache-misses: +5.7%

**Decision**:
- FAST PGO は code bloat に敏感 → **Track A/B discipline 確立**
- Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO)
- Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions)

**参考**:
- 詳細結果: `docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md`
- 指示書: `docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md`

---

### Phase 76（構造継続）: C4-C7 Remaining Classes ✅ **Phase 76-1 完了 (GO +1.73%)**

**前提** (Phase 75 complete):
- C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
- Code bloat sensitivity identified → Track A/B discipline established
- Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)

**Phase 76-0: C7 Statistics Analysis** ✅ **完了 (NO-GO for C7 P2)**

**Approach**: OBSERVE run to measure C7 allocation patterns in Mixed SSOT
**Results**: C7 = **0% operations** in Mixed SSOT workload
**Decision**: NO-GO for C7 P2 optimization → proceed to C4

**参考**:
- 結果: `docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md`

**Phase 76-1: C4 Inline Slots** ✅ **完了 (GO +1.73%)**

**Goal**: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations

**Implementation** (modular box pattern):
- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1` (default OFF → ON after promotion)
- TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
- Fast-path API: `c4_inline_push()` / `c4_inline_pop()` (always_inline)
- Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade)

**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C4=OFF, C5=ON, C6=ON): **52.42 M ops/s**
- Treatment (C4=ON, C5=ON, C6=ON): **53.33 M ops/s**
- Delta: **+0.91 M ops/s (+1.73%)**

**Decision**: ✅ **GO** (exceeds +1.0% threshold)

**Promotion Completed**:
1. `core/bench_profile.h`: Added C4 default to `bench_apply_mixed_tinyv3_c7_common()`
2. `scripts/run_mixed_10_cleanenv.sh`: Added `HAKMEM_TINY_C4_INLINE_SLOTS=1` default
3. C4 inline slots now **promoted to preset defaults** alongside C5+C6

**Coverage Summary (C4-C7 complete)**:
- C6: 57.17% (Phase 75-1, +2.87%)
- C5: 28.55% (Phase 75-2, +1.10%)
- **C4: 14.29% (Phase 76-1, +1.73%)**
- C7: 0.00% (Phase 76-0, NO-GO)
- **Combined C4-C6: 100% of C4-C7 operations**

**Estimated Cumulative Gain**: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)

**参考**:
- 結果: `docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
- C4 box files: `core/box/tiny_c4_inline_slots_*.h`, `core/front/tiny_c4_inline_slots.h`, `core/tiny_c4_inline_slots.c`

---

**Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix** ✅ **完了 (STRONG GO +7.05%, super-additive)**

**Goal**: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis

**Results** (4-point matrix, 10-run each):
- Point A (all OFF): 49.48 M ops/s (baseline)
- Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
- Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
- Point D (all ON): **52.97 M ops/s (+7.05% vs A)** ✅ **STRONG GO**

**Critical Discovery**:
- C4 shows **-0.08% regression in isolation** (C5/C6 OFF)
- C4 shows **+1.27% gain in context** (with C5+C6 ON)
- **Super-additivity**: Actual D (+7.05%) exceeds expected additive (+5.56%)
- **Implication**: Per-class optimizations are **context-dependent**, not independently additive

**Sub-additivity Analysis**:
- Expected additive: 52.23 M ops/s (B + C - A)
- Actual: 52.97 M ops/s
- Gain: **-1.42% (super-additive!)** ✓

**Decision**: ✅ **STRONG GO**
- D vs A: +7.05% >> +3.0% threshold
- Super-additive behavior confirms synergistic gains
- C4+C5+C6 locked to SSOT defaults

**参考**:
- 詳細結果: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`

---

### 🟩 完了：C4-C7 Inline Slots Optimization Stack

**Per-class Coverage Summary (Final)**:
- C6 (57.17%): +2.87% (Phase 75-1)
- C5 (28.55%): +1.10% (Phase 75-2)
- C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
- C7 (0.00%): NO-GO (Phase 76-0)
- **Combined C4-C6: +7.05% (Phase 76-2 super-additive)**

**Status**: ✅ **C4-C7 Optimization Complete** (100% coverage, SSOT locked)

---

### 🟥 次のActive（Phase 77+）

**オプション**:

**Option A: FAST PGO Periodic Tracking** (Track B discipline)
- Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
- Monitor mimalloc ratio progress (secondary metric)
- Not a decision point per se, but periodic maintenance

**Option B: Phase 77 (Alternative Optimization Axis)**
- Explore beyond per-class inline slots
- Candidates:
  - Allocation fast-path optimization (call elimination)
  - Metadata/page lookup (table optimization)
  - C3/C2 class strategies
  - Warm pool tuning (beyond Phase 69's WarmPool=16)

**推奨**: **Option B へ進む**（Phase 77+）
- C4-C7 optimizations are exhausted and locked
- Ready to explore new optimization axes
- Baseline is now +7.05% stronger than Phase 75-3

**参考**:
- C4-C7 完全分析: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
- Phase 75-3 参考 (C5+C6): `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`

## 5) アーカイブ

- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								# CURRENT_TASK（Rolling, SSOT）
-												Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)

Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-12 16:26:42 +09:00
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								## SSOT（今の正）
 								- **性能SSOT**: `scripts/run_mixed_10_cleanenv.sh`（WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF）
 								- **経路確認**: `scripts/run_mixed_observe_ssot.sh`（OBSERVE専用、throughput比較には使わない）
 								- **buildモード**: `docs/analysis/SSOT_BUILD_MODES.md`
 								- **外部比較（短時間）**: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`（LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け）
 								## Phase 87-88（終了: NO-GO）
 								**Status**: ✅ **OBSERVE verified** + ❌ **Phase 88 NO-GO**
 								### Phase 87: Inline Slots Verification
 								**Initial Finding (Wrong)**: Standard binary showed PUSH TOTAL/POP TOTAL = 0
 								- **Root Cause**: ENV ドリフト（`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` 漏れ）
 								  - 修正: `scripts/run_mixed_10_cleanenv.sh` でサイズ範囲を強制固定（MIN=16, MAX=1040）
 								  - `HAKMEM_BENCH_C5_ONLY=0`, `HAKMEM_BENCH_C6_ONLY=0`, `HAKMEM_BENCH_C7_ONLY=0` 強制
 								**Corrected Finding (OBSERVE binary)** - 20M ops Mixed SSOT WS=400:
 								```
 								PUSH TOTAL:   C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
 								POP TOTAL:    C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
 								PUSH FULL:    0 (0.00%)
 								POP EMPTY:    168 (0.003%)
 								JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89
 								```
 								### Phase 88: Batch Drain Optimization
 								**Overflow Analysis**:
 								- POP EMPTY rate: 168 / 4,812,031 = **0.003%** ← 極小
 								- PUSH FULL rate: 0 / 4,812,031 = **0%** ← 起きていない
 								- **Decision**: バッチ化しても速さは動かない（overflow がほぼ起きていない）
 								**Phase 88 Decision**: **NO-GO（凍結）**
 								- Rationale: 0.003% overflow 率では layout tax リスク > 期待値
 								- Infrastructure: 観測用 telemetry は残す（将来の WS/容量 変更時に再検証可能）
 								**Artifacts Created**:
 								- Telemetry box: `core/box/tiny_inline_slots_overflow_stats_box.h/c`
 								- Phase 87 results: `docs/analysis/PHASE87_OBSERVATION_RESULTS.md`
 								- SSOT 強化: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
 								- ENV ドリフト防止ドキュメント: `docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md`
 								**Key Learning**:
 								- "踏んでるか確定"には **OBSERVE バイナリ + total counters** が必須
 								- 観測と性能測定は分離（telemetry overhead を避ける）
 								- ENV ドリフト（MIN/MAX サイズ, CLASS_ONLY） = 経路を変える主要因
 								**Follow-up Fix (SSOT hardening)**:
 								- `scripts/run_mixed_10_cleanenv.sh` now forces `HAKMEM_BENCH_MIN_SIZE=16` / `HAKMEM_BENCH_MAX_SIZE=1040` and disables `HAKMEM_BENCH_C{5,6,7}_ONLY` to prevent path drift.
 								- New pre-flight helper: `scripts/run_mixed_observe_ssot.sh` (Route Banner + OBSERVE, single run).
 								 - Overflow stats compile gating fixed (see above).
 								---
 								## Phase 89（完了: Bottleneck Analysis & Optimization Roadmap）
 								**Status**: ✅ **SSOT Measurement Complete** + **3 Optimization Candidates Identified**
 								### 4-Step SSOT Procedure Completion
 								**Step 1: OBSERVE Binary Preflight**
 								- Binary: `bench_random_mixed_hakmem_observe` (with telemetry enabled)
 								- Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
 								- Throughput (with telemetry): 51.52M ops/s
 								**Step 2: Standard 10-run Baseline**
 								- Binary: `bench_random_mixed_hakmem` (clean, no telemetry)
 								- 10-run SSOT results: **51.36M ops/s** (CV: 0.7%, very stable)
 								  - Range: 50.74M - 51.73M
 								  - **Decision**: This is baseline for bottleneck analysis
 								**Step 3: FAST PGO 10-run Comparison**
 								- Binary: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
 								- 10-run SSOT results: **54.16M ops/s** (CV: 1.5%, acceptable)
 								  - Range: 52.89M - 55.13M
 								  - **Performance Gap**: 54.16M - 51.36M = **2.80M ops/s (+5.45%)**
 								  - This represents the optimization ceiling with current PGO profile
 								**Step 4: Results Captured**
 								- Git SHA: e4c5f0535 (master branch)
 								- Timestamp: 2025-12-18 23:06:01
 								- System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
 								- Files: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
 								### Perf Analysis & Top Bottleneck Identification
 								**Profile Run**: 40M operations (0.78s), 833 perf samples
 								**Top Functions by CPU Time**:
 . **free** - 27.40% (hottest)
 . main - 26.30% (benchmark loop, not optimizable)
 . **malloc** - 20.36% (hottest)
 . malloc.cold - 10.65% (cold path, avoid optimizing)
 . free.cold - 5.59% (cold path, avoid optimizing)
 . **tiny_region_id_write_header** - 2.98% (hot, inlining candidate)
 								**malloc + free combined = 47.76% of CPU time** (already Phase 9/10/78-1/80-1 optimized)
 								### Top 3 Optimization Candidates (Ranked by Priority)
 								| Candidate | Priority | Recommendation | Expected Gain | Risk | Effort |
 								|-----------|----------|-----------------|----------------|------|--------|
 								| **tiny_region_id_write_header always_inline** | **HIGH** | **PURSUE** | +1-2% | LOW | 1-2h |
 								| malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h |
 								| Cold-path optimization | LOW | **AVOID** | +1% | HIGH | 10-20h |
 								**Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)**
 								- Current: Selective inlining from `core/region_id_v6.c`
 								- Proposal: Force `always_inline` for hot-path call sites
 								- **Layout Impact**: MINIMAL (no code bulk, maintains I-cache discipline)
 								- **Recommendation**: YES - PURSUE
 								  - Estimated timeline: Phase 90
 								  - Implementation: 1-2 lines, add `__attribute__((always_inline))` wrapper
 								**Candidate 2: malloc/free branch reduction (47.76% CPU)**
 								- Current: Phase 9/10/78-1/80-1/83-1 already optimized
 								- Observation: 56.4M branch-misses (branch prediction pressure)
 								- Proposal: Pre-compute routing tables (like Phase 85 approach)
 								- **Risk**: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
 								- **Recommendation**: DEFER
 								  - Wait for workload characteristics that justify complexity
 								  - Current gains saturation point reached
 								---
 								## Phase 91（終了: NEUTRAL / 凍結）
 								**Status**: ⚪ **NEUTRAL**（C6 IFL: +0.38% / 10-run）→ default OFF で保持
 								- 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
 								- 結果（SSOT 10-run）:
 								  - Control（`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0`）mean 52.05M
 								  - Treatment（`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1`）mean 52.25M
 								  - Δ **+0.38%**（GO閾値 +1.0% 未達）
 								- 判定: **凍結（research box）**
 								  - 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない
 								---
 								## Phase 92（開始予定）
 								**Status**: 🔍 **次フェーズ計画中**
 								**目的**: tcmalloc 性能ギャップ（hakmem: 52M vs tcmalloc: 58M, -12.8%）を短時間で原因分類
 								**実施予定**:
 . ケース A：小 vs 大オブジェクト分離テスト（C6-only vs C7-only）
 . ケース B：Inline Slots vs Unified Cache 分離テスト
 . ケース C：LIFO vs FIFO 比較
 . ケース D：Pool size sensitivity テスト
 								**期間**: 1-2h（短時間 Triage）
 								**出力**: Primary bottleneck 特定 → 次の Candidate 選定
 								**References**:
 								- Triage Plan: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`
 								---
 								**Candidate 3: Cold-path de-duplication (16.24% CPU)**
 								- Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
 								- Rationale: Separation improves hot-path I-cache utilization
 								- **Recommendation**: AVOID
 								  - Aligns with user's "layout tax 回避" principle
 								  - Optimizing cold paths would ADD code to hot path (violates design)
 								### Key Performance Insights
 								**FAST PGO vs Standard (+5.45%) breakdown**:
 								- PGO branch prediction optimization: ~3%
 								- Code layout optimization: ~2%
 								- Inlining decisions: ~0.5%
 								**Conclusion**: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.
 								**Inline Slots Health**: Working perfectly - 0.003% overflow rate confirms no bottleneck
 								### References & Artifacts
 								- SSOT Measurement: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
 								- Bottleneck Analysis: `docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md`
 								- Perf Stats: `docs/analysis/PHASE89_PERF_STAT.txt`
 								- Scripts: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
 								---
-												Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)

## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 22:05:34 +09:00
+								## Phase 86（終了: NO-GO）
 								**Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)
 								**A/B Test (10-run SSOT)**:
 								- Control:   51,750,467 ops/s (CV: 2.26%)
 								- Treatment: 51,881,055 ops/s (CV: 2.32%)
 								- Delta: +0.25% (mean), -0.15% (median)
 								**Summary**: Free path legacy mask (mask-only) optimization for LEGACY classes.
 								- Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
 								- Implementation: Correct (0x7f mask computed, C0-C6 optimized)
 								- Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
 								- Conclusion: Free path optimization layer has reached practical ceiling
 								---
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 0) 今の「正」（SSOT）
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								- **現行 SSOT（Phase 89 capture / Git SHA: e4c5f0535）**:
 								  - Standard（`./bench_random_mixed_hakmem`）10-run mean: **51.36M ops/s**（CV ~0.7%）
 								  - FAST PGO minimal（`./bench_random_mixed_hakmem_minimal_pgo`）10-run mean: **54.16M ops/s**（CV ~1.5% / Standard比 +5.45%）
 								  - OBSERVE（`./bench_random_mixed_hakmem_observe`）: 51.52M ops/s（telemetry込み、性能比較の正ではない）
 								  - SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
 								- **性能最適化の判断の正**: 同一バイナリ A/B（ENVトグル）＝ `scripts/run_mixed_10_cleanenv.sh`
 								- **mimalloc/tcmalloc 参照の正**: reference（別バイナリ/LD_PRELOAD）＝ `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
 								- **スコアカード（目標/現在値の正）**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`（Phase 89 SSOT を現行 snapshot として反映済み）
 								  - Phase 66/68/69（60M〜62M台）は **historical**（現 HEAD と直接比較しない。比較するなら rebase を取る）
 								- **次フェーズ（設計見直し）**: `docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md`
-												docs: clarify Phase 75 vs FAST PGO SSOT

											
										
										
											2025-12-18 09:11:56 +09:00
+								- **Mixed 10-run SSOT（ハーネス）**: `scripts/run_mixed_10_cleanenv.sh`
 								  - デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`（Standard）
 								  - FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								  - 既定: `ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16`、`HAKMEM_TINY_C4_INLINE_SLOTS=1`、`HAKMEM_TINY_C5_INLINE_SLOTS=1`、`HAKMEM_TINY_C6_INLINE_SLOTS=1`、`HAKMEM_TINY_INLINE_SLOTS_FIXED=1`、`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1`
 								  - cleanenv で固定OFF（漏れ防止）: `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0`（Phase 83-1 NO-GO / research）
 								## 0a) ころころ防止（最低限の SSOT ルール）
 								- **hakmem は必ず `HAKMEM_PROFILE` を明示**する（未指定だと route が変わり、数値が破綻しやすい）。
 								  - 推奨: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`（Speed-first）
 								- 比較は目的で runner を分ける:
 								  - hakmem SSOT（最適化判断）: `scripts/run_mixed_10_cleanenv.sh`
 								  - allocator reference（短時間）: `scripts/run_allocator_quick_matrix.sh`
 								  - allocator reference（layout差を最小化）: `scripts/run_allocator_preload_matrix.sh`
 								- 再現ログを残す（数%を詰めるときの最低限）:
 								  - `scripts/bench_ssot_capture.sh`
 								  - `HAKMEM_BENCH_ENV_LOG=1`（CPU governor/EPP/freq を記録）
-												Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)

## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 22:05:34 +09:00
+								  - 外部相談（貼り付けパケット）: `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`（生成: `scripts/make_chatgpt_pro_packet_free_path.sh`）
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
 								## 0b) Allocator比較（reference）
 								- allocator比較（system/jemalloc/mimalloc/tcmalloc）は **reference**（別バイナリ/LD_PRELOAD → layout差を含む）。
 								  - SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
 								  - **Quick（Random Mixed 10-run）**: `scripts/run_allocator_quick_matrix.sh`
 								    - **重要**: hakmem は `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示し、`scripts/run_mixed_10_cleanenv.sh` 経由で走らせる（PROFILE漏れで数値が壊れるため）。
 								  - **Same-binary（推奨, layout差を最小化）**: `scripts/run_allocator_preload_matrix.sh`
 								    - `bench_random_mixed_system` を固定し、`LD_PRELOAD` で allocator を差し替える。
 								    - 注記: hakmem の **linked benchmark**（`bench_random_mixed_hakmem*`）とは経路が異なる（LD_PRELOAD=drop-in wrapper なので別物）。
 								  - **Scenario CSV（small-scale reference）**: `scripts/bench_allocators_compare.sh`
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 1) 迷子防止（経路/観測）
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								“経路が踏まれていない最適化” を防ぐための最小手順。
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								- **Route Banner（経路の誤認を潰す）**: `HAKMEM_ROUTE_BANNER=1`
 								  - 出力: Route assignments（backend route kind）+ cache config（`unified_cache_enabled` / `warm_pool_max_per_class`）
 								- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
 								  - WS=400（Mixed SSOT）では miss が極小 → `unified_cache_refill()` 最適化は **凍結（ROIゼロ）**
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 2) 直近の結論（要点だけ）
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								- **Phase 69（WarmPool sweep）**: `HAKMEM_WARM_POOL_SIZE=16` が **強GO（+3.26%）**、baseline 昇格済み。
 								  - 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
 								  - 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
 								- **Phase 70（観測SSOT）**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
 								  - SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
 								- **Phase 71/73（WarmPool=16 の勝ち筋確定）**: 勝ち筋は **instruction/branch の微減**（perf stat で確定）。
 								  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
 								- **Phase 72（ENV knob ROI枯れ）**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造（コード）で攻める段階**。
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								- **Phase 78-1（構造）**: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で **GO（+2.31%）**。
 								  - 結果: `docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md`
 								- **Phase 80-1（構造）**: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で **GO（+1.65%）**。
 								  - 結果: `docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md`
 								- **Phase 83-1（構造）**: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で **NO-GO（+0.32%, branch reduction negligible）**。
 								  - 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md`
 								  - 原因: lazy-init pattern が既に最適化済み（per-op overhead minimal）→ fixed mode の ROI 極小
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								## 2a) 次の大方針（設計の順番、SSOT）
 								目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory（境界1箇所・戻せる・可視化最小・fail-fast）を崩さず **+5–10%** を狙う。
 								優先順（Google/TCMalloc の芯を参考にする）:
 . **ThreadCache overflow のバッチ化（最優先）**
 								   - inline slots（C4/C5/C6）が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
 								   - 変換点は 1 箇所（flush/drain）に固定
 . **Central/Shared 側のバッチ push/pop（次点）**
 								   - shared/remote への統合をバッチ化して lock/atomic の回数を減らす
 . **Memory return / footprint policy（運用軸）**
 								   - Balanced/Lean の勝ち筋（syscall/RSS drift/tail）をSSOT化しつつ、速度を落とさない範囲で攻める
 								重要: 現状は「設計の芯」を決める段階。実装は **計測で overflow の頻度が十分に高い**ことを確認してから。
 								## 2b) 次の作業（待機中）
 								ユーザーが別エージェント（Claude Code）に依頼した処理が完了するまで待機する。
 								完了後に着手するチェック（最短で必要な2つ）:
 								- **inline slots overflow 率の計測**（C4/C5/C6 の FULL/overflow 回数・割合）
 								- **overflow 先のコストの定量化**（overflow 時に落ちる関数の perf stat / perf report）
 								これが揃ったら Phase 86（Overflow batch design）へ進む。
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 3) 運用ルール（Box Theory + layout tax 対策）
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積む（Fail-fast、最小可視化）。
 								- A/B は **同一バイナリでENVトグル**が原則（別バイナリ比較は layout が混ざる）。
-												Phase 75-6: define SSOT policy to avoid baseline drift

											
										
										
											2025-12-18 10:22:24 +09:00
+								- SSOT運用（ころころ防止）: `docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md`
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								- “削除して速い” は封印（link-out/大削除は layout tax で符号反転しやすい）→ **compile-out** を優先。
 								  - 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								- 研究箱の棚卸しSSOT: `docs/analysis/RESEARCH_BOXES_SSOT.md`
 								  - ノブ一覧: `scripts/list_hakmem_knobs.sh`
 								## 5) 研究箱の扱い（freeze方針）
 								- **Phase 79-1（C2 local cache）**: `HAKMEM_TINY_C2_LOCAL_CACHE=0/1`
 								  - 結果: +0.57%（NO-GO, threshold +1.0% 未達）→ **research box freeze**
 								  - SSOT/cleanenv では **default OFF**（`scripts/run_mixed_10_cleanenv.sh` が `0` を強制）
 								  - 物理削除はしない（layout tax リスク回避）
 								  - **Phase 82（hardening）**: hot path から C2 local cache を完全除外（環境変数を立てても alloc/free hot では踏まない）
 								    - 記録: `docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md`
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)

## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 22:05:34 +09:00
+								- **Phase 85（Free path commit-once, LEGACY-only）**: `HAKMEM_FREE_PATH_COMMIT_ONCE=0/1`
 								  - 結果: **NO-GO（-0.86%）** → **research box freeze（default OFF）**
 								  - 理由: Phase 10（MONO LEGACY DIRECT）と効果が被り、さらに間接呼び出し/配置の税が増えた
 								  - 記録: `docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md`
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 4) 次の指示書（Active）
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								### Phase 74（構造）: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**前提**:
 								- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
 								- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
 								- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
 								- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
 								- 判定: **NEUTRAL (+0.50%)**
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
 								- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
 								- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
 								- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
 								- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
 								- 判定: **NEUTRAL (-0.87%)** → **P1 (LOCALIZE) 凍結**
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**結論**:
 								- P1 (LOCALIZE) は default OFF で凍結（dependency chain 削減の ROI 低い）
 								- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
+								**Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)**
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
-												Phase 70: Defined observability prerequisites SSOT

- Added docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md to clarify that refill/warmpool optimizations require confirmed cache misses to be measurable.
- Updated CURRENT_TASK.md to point to this prerequisite.

											
										
										
											2025-12-18 03:44:51 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**Approach**:
 								- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
 								- 前提: "valid/enabled/no-stats" を caller 側で保証
 								- Fail-fast: 想定外の状態なら slow path へ fallback（境界1箇所）
 								- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
-												Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis

Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline.

Key Findings (200M ops Mixed benchmark):
- tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61)
- tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%)
- Allocation-specific hot path: 12.37% (C7 + header + cache)

Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis)
- Expected gain: +1-3% (higher absolute margin than Phases 46A/61)
- Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%)
- Approach: Deep profiling → ASM inspection → A/B test with ENV gate

Alternative Options:
- Option B: tiny_region_id_write_header (3.82%, higher risk)
- Option C: Algorithmic redesign (post-50% milestone)

Box Theory Compliance:
- Single conversion point: tiny_c7_ultra_alloc() boundary
- Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1)
- No side effects: Pure dependency chain reordering

Timeline: Single phase, 4-6 hours (profile + ASM + test)

Documentation:
- PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data
- CURRENT_TASK.md: Updated next phase guidance

Profiling tools prepared:
- perf record with extended events (cycles, cache-misses, branch-misses)
- ASM inspection methodology documented
- A/B test threshold: ±0.5% (micro-scale)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 16:27:06 +09:00
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
+								**Results** (10-run Mixed SSOT, WS=400):
 								- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
 								- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
+								**判定**: **NEUTRAL (+0.32%)** → **P0 (FASTAPI) 凍結**
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								**参考**:
 								- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
 								- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
+								- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
 								---
-												docs: clarify Phase 75 vs FAST PGO SSOT

											
										
										
											2025-12-18 09:11:56 +09:00
+								## Phase 75（構造）: Hot-class Inline Slots (P2) ✅ **完了（Standard A/B）**
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
 								**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定
 								**前提** (Phase 74 learnings):
 								- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
 								- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination
 								**Phase 75-0: Per-Class Analysis** ✅ **完了**
 								Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
 								| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
 								|-------|----------|----------|------|--------|-----------|-------|-----------|
 								| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
 								| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
 								| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
 								| C7 | ? | ? | ? | ? | **?** | ? | **?** |
 								**Key findings**:
 . C6 圧倒的支配: 57.2% の操作 (2.75M hits)
 . 全クラス 100% hit rate (refill inactive in SSOT)
 . Cache occupancy near-capacity (98-99%)
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
+								**Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)**
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
+								**Approach**: Modular box theory design with single decision point at TLS init
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
+								**Implementation** (5 new boxes + test script):
 								- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
 								- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
 								- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
 								- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
 								- Backward compatible: Legacy code intact, fail-fast to unified_cache
 								**Results** (10-run Mixed SSOT, WS=400):
 								- Baseline (C6 inline OFF): **44.24 M ops/s**
 								- Treatment (C6 inline ON): **45.51 M ops/s**
 								- Delta: **+1.27 M ops/s (+2.87%)**
 								**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)
 								**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
-												Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)

Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status: ✅ GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:22:09 +09:00
 								**参考**:
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
+								- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
 								- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`
 								---
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
+								**Phase 75-2: C5 Inline Slots** ✅ **完了 (GO +1.10%)**
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
+								**Goal**: C5-only isolated measurement (28.5% of C4-C7) for individual contribution
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
+								**Approach**: Replicate C6 pattern with careful isolation
-												Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)

Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status: ✅ GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:39:48 +09:00
+								- Add C5 ring buffer (128 slots, 1KB TLS)
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
+								- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1` (default OFF)
 								- Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
 								- Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)
 								**Results** (10-run Mixed SSOT, WS=400):
 								- Baseline (C5=OFF, C6=ON): **44.26 M ops/s** (σ=0.37)
 								- Treatment (C5=ON, C6=ON): **44.74 M ops/s** (σ=0.54)
 								- Delta: **+0.49 M ops/s (+1.10%)**
 								**Decision**: ✅ **GO** (C5 individual contribution validated)
 								**Cumulative Performance**:
 								- Phase 75-1 (C6): +2.87%
 								- Phase 75-2 (C5 isolated): +1.10%
 								- Combined potential: ~+3.97% (if additive)
 								**参考**:
 								- 実装詳細: `docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md`
 								---
 								**Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B)** ✅ **完了 (STRONG GO +5.41%)**
 								**Goal**: Comprehensive interaction test + final promotion decision
 								**Approach**: 4-point matrix A/B test (single binary, ENV-only configuration)
 								- Point A (C5=0, C6=0): Baseline
 								- Point B (C5=1, C6=0): C5 solo
 								- Point C (C5=0, C6=1): C6 solo
 								- Point D (C5=1, C6=1): C5+C6 combined
 								**Results** (10-run per point, Mixed SSOT, WS=400):
 								- **Point A (baseline)**: 42.36 M ops/s
 								- **Point B (C5 solo)**: 43.54 M ops/s (+2.79% vs A)
 								- **Point C (C6 solo)**: 44.25 M ops/s (+4.46% vs A)
 								- **Point D (C5+C6)**: 44.65 M ops/s (+5.41% vs A) **[MAIN TARGET]**
 								**Additivity Analysis**:
 								- Expected additive (B+C-A): 45.43 M ops/s
 								- Actual (D): 44.65 M ops/s
 								- Sub-additivity: **1.72%** (near-perfect additivity, minimal negative interaction)
 								**Perf Stat Validation (D vs A)**:
 								- Instructions: -6.1% (function call elimination confirmed)
 								- Branches: -6.1% (matches instruction reduction)
 								- Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
 								- Throughput: +5.41% (net positive)
 								**Decision**: ✅ **STRONG GO (+5.41%)**
 								- D vs A: +5.41% >> 3.0% threshold
 								- Sub-additivity: 1.72% << 20% acceptable
 								- Phase 73 thesis validated: instructions/branches DOWN, throughput UP
 								**Promotion Completed**:
 . `core/bench_profile.h`: Added C5+C6 defaults to `bench_apply_mixed_tinyv3_c7_common()`
 . `scripts/run_mixed_10_cleanenv.sh`: Added C5+C6 ENV defaults
 . C5+C6 inline slots now **promoted to preset defaults** for MIXED_TINYV3_C7_SAFE
-												docs: clarify Phase 75 vs FAST PGO SSOT

											
										
										
											2025-12-18 09:11:56 +09:00
+								**Phase 75 Complete**: C5+C6 inline slots (129-256B) deliver +5.41% proven gain **on Standard binary**（`bench_random_mixed_hakmem`）。
 								- FAST PGO baseline（スコアカード）を更新する前に、`BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` で **同条件の A/B（C5/C6 OFF/ON）** を再計測すること。
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
-												Phase 75: record FAST PGO rebase and add PGO regeneration instructions

											
										
										
											2025-12-18 09:32:43 +09:00
+								### Phase 75-4（FAST PGO rebase）✅ 完了
 								- 結果: **+3.16% (GO)**（4-point matrix、outlier 除外後）
 								- 詳細: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
 								- 重要: Phase 69 の FAST baseline (62.63M) と比較して **現行 FAST PGO baseline が大きく低い**疑い（PGO profile staleness / training mismatch / build drift）
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								### Phase 75-5（PGO 再生成）✅ 完了（NO-GO on hypothesis, code bloat root cause identified）
-												Phase 75: record FAST PGO rebase and add PGO regeneration instructions

											
										
										
											2025-12-18 09:32:43 +09:00
 								目的:
 								- C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								結果:
 								- PGO profile regeneration の効果は **限定的** (+0.3% のみ)
 								- Root cause は **PGO profile mismatch ではなく code bloat** (+13KB, +3.1%)
 								- Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression
 								**Forensics findings** (`scripts/box/layout_tax_forensics_box.sh`):
 								- Text size: +13KB (+3.1%)
 								- IPC: 1.80 → 1.67 (-7.22%)
 								- Branch-misses: +19.4%
 								- Cache-misses: +5.7%
 								**Decision**:
 								- FAST PGO は code bloat に敏感 → **Track A/B discipline 確立**
 								- Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO)
 								- Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions)
 								**参考**:
 								- 詳細結果: `docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md`
 								- 指示書: `docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md`
 								---
 								### Phase 76（構造継続）: C4-C7 Remaining Classes ✅ **Phase 76-1 完了 (GO +1.73%)**
 								**前提** (Phase 75 complete):
 								- C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
 								- Code bloat sensitivity identified → Track A/B discipline established
 								- Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)
 								**Phase 76-0: C7 Statistics Analysis** ✅ **完了 (NO-GO for C7 P2)**
 								**Approach**: OBSERVE run to measure C7 allocation patterns in Mixed SSOT
 								**Results**: C7 = **0% operations** in Mixed SSOT workload
 								**Decision**: NO-GO for C7 P2 optimization → proceed to C4
 								**参考**:
 								- 結果: `docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md`
 								**Phase 76-1: C4 Inline Slots** ✅ **完了 (GO +1.73%)**
 								**Goal**: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations
 								**Implementation** (modular box pattern):
 								- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1` (default OFF → ON after promotion)
 								- TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
 								- Fast-path API: `c4_inline_push()` / `c4_inline_pop()` (always_inline)
 								- Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade)
 								**Results** (10-run Mixed SSOT, WS=400):
 								- Baseline (C4=OFF, C5=ON, C6=ON): **52.42 M ops/s**
 								- Treatment (C4=ON, C5=ON, C6=ON): **53.33 M ops/s**
 								- Delta: **+0.91 M ops/s (+1.73%)**
 								**Decision**: ✅ **GO** (exceeds +1.0% threshold)
 								**Promotion Completed**:
 . `core/bench_profile.h`: Added C4 default to `bench_apply_mixed_tinyv3_c7_common()`
 . `scripts/run_mixed_10_cleanenv.sh`: Added `HAKMEM_TINY_C4_INLINE_SLOTS=1` default
 . C4 inline slots now **promoted to preset defaults** alongside C5+C6
 								**Coverage Summary (C4-C7 complete)**:
 								- C6: 57.17% (Phase 75-1, +2.87%)
 								- C5: 28.55% (Phase 75-2, +1.10%)
 								- **C4: 14.29% (Phase 76-1, +1.73%)**
 								- C7: 0.00% (Phase 76-0, NO-GO)
 								- **Combined C4-C6: 100% of C4-C7 operations**
 								**Estimated Cumulative Gain**: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)
 								**参考**:
 								- 結果: `docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
 								- C4 box files: `core/box/tiny_c4_inline_slots_*.h`, `core/front/tiny_c4_inline_slots.h`, `core/tiny_c4_inline_slots.c`
 								---
 								**Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix** ✅ **完了 (STRONG GO +7.05%, super-additive)**
 								**Goal**: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis
 								**Results** (4-point matrix, 10-run each):
 								- Point A (all OFF): 49.48 M ops/s (baseline)
 								- Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
 								- Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
 								- Point D (all ON): **52.97 M ops/s (+7.05% vs A)** ✅ **STRONG GO**
 								**Critical Discovery**:
 								- C4 shows **-0.08% regression in isolation** (C5/C6 OFF)
 								- C4 shows **+1.27% gain in context** (with C5+C6 ON)
 								- **Super-additivity**: Actual D (+7.05%) exceeds expected additive (+5.56%)
 								- **Implication**: Per-class optimizations are **context-dependent**, not independently additive
 								**Sub-additivity Analysis**:
 								- Expected additive: 52.23 M ops/s (B + C - A)
 								- Actual: 52.97 M ops/s
 								- Gain: **-1.42% (super-additive!)** ✓
 								**Decision**: ✅ **STRONG GO**
 								- D vs A: +7.05% >> +3.0% threshold
 								- Super-additive behavior confirms synergistic gains
 								- C4+C5+C6 locked to SSOT defaults
 								**参考**:
 								- 詳細結果: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
 								---
 								### 🟩 完了：C4-C7 Inline Slots Optimization Stack
 								**Per-class Coverage Summary (Final)**:
 								- C6 (57.17%): +2.87% (Phase 75-1)
 								- C5 (28.55%): +1.10% (Phase 75-2)
 								- C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
 								- C7 (0.00%): NO-GO (Phase 76-0)
 								- **Combined C4-C6: +7.05% (Phase 76-2 super-additive)**
 								**Status**: ✅ **C4-C7 Optimization Complete** (100% coverage, SSOT locked)
 								---
 								### 🟥 次のActive（Phase 77+）
 								**オプション**:
 								**Option A: FAST PGO Periodic Tracking** (Track B discipline)
 								- Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
 								- Monitor mimalloc ratio progress (secondary metric)
 								- Not a decision point per se, but periodic maintenance
 								**Option B: Phase 77 (Alternative Optimization Axis)**
 								- Explore beyond per-class inline slots
 								- Candidates:
 								  - Allocation fast-path optimization (call elimination)
 								  - Metadata/page lookup (table optimization)
 								  - C3/C2 class strategies
 								  - Warm pool tuning (beyond Phase 69's WarmPool=16)
 								**推奨**: **Option B へ進む**（Phase 77+）
 								- C4-C7 optimizations are exhausted and locked
 								- Ready to explore new optimization axes
 								- Baseline is now +7.05% stronger than Phase 75-3
-												Phase 75: record FAST PGO rebase and add PGO regeneration instructions

											
										
										
											2025-12-18 09:32:43 +09:00
-												Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%)

Comprehensive interaction testing with single binary, ENV-only configuration:

4-Point Matrix Results (Mixed SSOT, WS=400):
- Point A (C5=0, C6=0): 42.36 M ops/s [Baseline]
- Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A)
- Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A)
- Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]**

Additivity Analysis:
- Expected additive: 45.43 M ops/s (B+C-A)
- Actual: 44.65 M ops/s (D)
- Sub-additivity: 1.72% (near-perfect, minimal negative interaction)

Perf Stat Validation (Point D vs A):
- Instructions: -6.1% (function call elimination confirmed)
- Branches: -6.1% (matches instructions reduction)
- Cache-misses: -31.5% (improved locality, NO code explosion)
- Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (exceeds +3.0% GO threshold)
- D vs A: +5.41% >> +3.0%
- Sub-additivity: 1.72% << 20% acceptable
- Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput

Promotion to Defaults:
- core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common()
- scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added
- C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE

New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0)
M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 08:53:01 +09:00
+								**参考**:
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								- C4-C7 完全分析: `docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md`
 								- Phase 75-3 参考 (C5+C6): `docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md`
-												CURRENT_TASK: Phase 72-2 complete (WarmPool sweep, all NO-GO, ENV knob ROI exhausted)

Phase 72-2 Results:
- WarmPool=16 (baseline): 56.23M ops/s
- WarmPool=20: 56.13M ops/s (-0.18%, NO-GO)
- WarmPool=24: 56.30M ops/s (+0.12%, noise)
- WarmPool=32: 56.07M ops/s (-0.28%, NO-GO)

Conclusion:
- ENV knob optimization exhausted
- WarmPool=16 remains optimal
- Next: Structural changes (Phase 74)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 06:11:21 +09:00
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								## 5) アーカイブ
-												Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%)

Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions.

Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by:
1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks
2. Using TLS headers_initialized flag set during refill
3. Reducing branch count and register pressure

Implementation:
- New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h
- HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF)
- Modified tiny_c7_ultra_alloc() with optimized path
- Preserved original path for compatibility

Results (Mixed benchmark, 10-run):
- Baseline (OPT=0): 59.300 M ops/s (CV 1.98%)
- Treatment (OPT=1): 58.879 M ops/s (CV 1.83%)
- Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative)
- Status: NEUTRAL → Research box (default OFF)

Root Cause Analysis:
1. LTO optimization already inlines header_light function (call cost = 0)
2. TLS access (memory load + offset) not cheaper than function call
3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47)
4. 5.18% stack % is not optimizable hotspot (already well-optimized)

Key Lessons:
- LTO-optimized function calls can be cheaper than TLS field access
- Micro-optimizations on already-optimized paths show diminishing/negative returns
- 48.34% gap to mimalloc is likely algorithmic, not micro-architectural
- Layout tax remains consistent pattern across attempted micro-optimizations

Decision:
- NEUTRAL verdict → kept as research box with ENV gate (default OFF)
- Not adopted as production default
- Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts

Box Theory Compliance: ✅ Compliant (single point, reversible, clear boundary)
Performance Compliance: ❌ No (-0.71% regression)

Documentation:
- PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis
- CURRENT_TASK.md: Updated with results and next phase options

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 16:34:03 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
-												Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 07:47:44 +09:00
+								- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`