hakmem/CURRENT_TASK.md

# CURRENT_TASK（Rolling, SSOT）

## 0) 今の「正」

- **性能比較の正**: FAST PGO build（`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`）✓ **Phase 69 昇格済み** (Warm Pool Size=16)
- **安全・互換の正**: Standard build（`make bench_random_mixed_hakmem`）
- **観測の正**: OBSERVE build（`make perf_observe`）
- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`（M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp）
- **計測の正（Mixed 10-run）**: `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト）

## 1) 現状（要点）

- Phase 64（backend prune / DCE）: **NO-GO**（-4.05%） → layout tax 由来
- Phase 63（FAST_PROFILE_FIXED）: **研究用ビルド**として保持（FAST の gate を compile-time 固定）
- Phase 65（Hot Symbol Ordering）: **BLOCKED**（GCC+LTO の制約で不公平/不可能）→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md`
- Phase 66（PGO, GCC+LTO）: **GO** ✓
  - 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1%
  - Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO)
- Phase 68（PGO training set 最適化）: **GO & 昇格完了** ✓
  - 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過)
  - Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp)
- Phase 69（Refill tuning: Warm Pool Size 最適化）: **強GO & 昇格完了** ✓✓✓
  - 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過)
  - 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp)

## 2) 次の指示書（Active）

**Phase 68: PGO training set 最適化** ✅ **完了**

- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン)
- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過)
- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過)
- ✓ スコアカード・CURRENT_TASK 更新完了

---

**Phase 67a: Layout Tax 法医学（変更最小）** ✅ **完了・実運用可能**

- ✓ `scripts/box/layout_tax_forensics_box.sh` 新規（測定ハーネス）
  - Baseline vs Treatment の 10-run throughput 比較
  - perf stat 自動収集（cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB）
  - Binary metadata（サイズ、セクション構成）

- ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規（診断ガイド）
  - 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下)
  - "症状→原因候補" マッピング表
    * IPC 低下 3%↑ → I-cache miss / code layout dispersal
    * branch-misses ↑10%↑ → branch prediction penalty
    * dTLB-misses ↑100%↑ → data layout fragmentation
  - Phase 64 case study（-4.05% の root cause: IPC 2.05 → 1.98）
  - 運用ガイドライン

**使用例**:
```bash
./scripts/box/layout_tax_forensics_box.sh \
    ./bench_random_mixed_hakmem_minimal_pgo \
    ./bench_random_mixed_hakmem_fast_pruned  # or Phase 64 attempt
```

成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる

---

**Phase 69: "refill頻度×固定税" を削る（M2への最短距離）**

**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**

- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
- ✓ Tunable parameters 特定:
  - `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`（refill 量の実体, ENV-only）
  - Unified Cache C5-C7 capacity (128 → 256/512)
  - Warm Pool size (12 → 16/24)
- ✓ Sweep 計画立案（single-parameter → combined optimization）
- ✓ Risk assessment & 判定基準定義

**Phase 69-1: Sweep 実行** ✅ **完了**

- ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean)
- ✓ Warm Pool Size sweep:
  - Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner**
  - Size=24: 62.37M ops/s (+2.84%, GO)
- ✓ Unified Cache C5-C7 sweep:
  - Cache=256: 61.92M ops/s (+2.09%, GO)
  - Cache=512: 61.80M ops/s (+1.89%, GO)
- ✓ Combined optimization check:
  - Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive)
- ✓ “Refill Batch Size sweep” は無効（knob 未接続）:
  - `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない
  - 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md`
- **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)**

**Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了**

- ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加
- ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加
- ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加:
  - Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc
  - M1 (50%) achievement: **EXCEEDED** (+1.77pp above target)
  - M2 (55%) progress: Gap reduced to +3.23pp
- ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除

**新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp)

---

**Phase 69-3（次候補）: refill 量（ENV-only）sweep OR 次の sweep**

- **選択肢 A（推奨）**: Refill count の ENV sweep（コード変更なし）
  - `HAKMEM_TINY_REFILL_COUNT_MID`（C4–C7）を 64/96/128/160… で sweep
  - `HAKMEM_TINY_REFILL_COUNT_HOT`（C0–C3）も同様に sweep（ただし WarmPool/UnifiedCache と相互作用あり）
  - 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%)

- **選択肢 B**: Unified Cache の fine sweep（ENV-only）
  - C5/C6/C7 を 192/256/320… などで sweep（Phase 69-1 の 256/512 は coarse）
  - WarmPool=16 との非加算性を “原因切り分け” する

- **選択肢 C**: compile-time knob の新設（後回し）
  - `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない
  - 必要なら別途 SSOT を作って実装する（Phase 70+）

- **選択肢 D**: 別方向の最適化（M2: 55% への最短距離）
  - 残り gap: +3.23pp (51.77% → 55%)
  - Phase 67b（境界 inline/unroll チューニング）
  - Top 50 hot functions の最適化
  - PGO profile の再調整

---

**Phase 67b（後続・保険）: 境界inline/unrollチューニング**
- **注意**: layout tax リスク高い（Phase 64 reference）
- **前提**: Top 50 実行確認が必須
- Phase 69 が外れた時の保険として後回し推奨

---

**Phase 70（観測の前提固め）: Refill/WarmPool 最適化の Step 0 を SSOT 化**

- 目的: **“経路が踏まれていない最適化”** を防ぐ（Phase 40/41/64 の layout tax 前例）
- 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない（backend route kind）
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
  - Mixed SSOT（WS=400）で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める
- ✅ Phase 70-1: Route Banner 実装（経路誤認の根絶）
  - ENV: `HAKMEM_ROUTE_BANNER=1`
  - 出力: Route assignments（backend route kind）+ cache config（unified_cache / warm_pool_max_per_class）
- ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT（“見えてないだけ”事故の根絶）
  - `Unified-STATS total_allocs == total_frees` を確認してから議論する（統計の信頼性ゲート）
- ✅ Phase 70-2: Refill 最適化の扱い確定（SSOT）
  - Mixed SSOT（WS=400）で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結（ROIゼロ）**
  - 現状の実測: miss は極小（例: total miss=5）→ refill最適化は SSOT workload では ROI なし
  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`

---

**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決**

- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**（TLB/LLC/周波数/配置）の可能性が高い
- 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化（Phase 72）を決め打ちする
- 方式: **同一バイナリ + cleanenv + 交互実行**（layout tax/環境ドリフトを避ける）
  - A: `HAKMEM_WARM_POOL_SIZE=12`
  - B: `HAKMEM_WARM_POOL_SIZE=16`
  - events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults`

**結果**（パラドックス）:
- ✅ Throughput: +0.91% (46.52M → 46.95M ops/s)
- ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE**
- ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE**
- ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE**
- ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE**
- ✓ page-faults: -0.21% (negligible)

**Phase 71 仮説（REJECTED）**:
- 予測: "TLB/cache efficiency improvement from memory layout"
- 実測: TLB/cache metrics both **DEGRADED**

**Phase 73 確定**:
- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)**
- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減
- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate
- Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache)

**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section

**Phase 72（構造）: WarmPool=16 の勝ち筋を増幅（Phase 73 結果が出てから）**

- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手（推測で弄ると Phase 40/41/64 の再発）
- Phase 73 の結論: **instruction/branch 減が支配的**（TLB/cache はむしろ悪化）→「WarmPool=16 が “短い経路” を踏ませている」ことが本質

**Phase 72-0（SSOT）: “どの関数が短くなったか” を特定してから構造に入る**

- A/B は WarmPool=12 vs 16 のまま（同一バイナリ・cleanenv）
- perf record を **cycles ではなく instruction/branch で取る**（原因が instruction/branch 減だから）
  - `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
  - `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定（例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等）

**Phase 72-1（構造）: 特定した関数にだけ手を入れる（箱の境界 1 箇所化）**

- `shared_pool_acquire_slab` 側が主因なら: “scan/lock/mmap” を減らす設計（warm prefill の境界を 1 箇所に固定）
- `unified_cache_refill` 側が主因なら: “refill の準備/検証” を境界側へ寄せ、hot 側は直線化
- 注意: 目標は「miss を減らす」ではなく **同じ miss でも “短い経路” を踏ませる**こと（Phase 73 の教訓）

**注記**: 研究箱の削除は今やらない（link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解）

## 3) アーカイブ

- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								# CURRENT_TASK（Rolling, SSOT）
-												Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)

Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-12 16:26:42 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								## 0) 今の「正」
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

											
										
										
											2025-12-18 01:55:27 +09:00
+								- **性能比較の正**: FAST PGO build（`make pgo-fast-full` → `bench_random_mixed_hakmem_minimal_pgo`）✓ **Phase 69 昇格済み** (Warm Pool Size=16)
-												Phase 35-39: FAST build optimization complete (+7.13% cumulative)

Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%)
- tiny_front_v3_enabled() → constant true
- tiny_metadata_cache_enabled() → constant 0
- learner_v7_enabled() → constant false
- small_learner_v2_enabled() → constant false

Phase 36: Policy snapshot init-once (GO +0.71%)
- small_policy_v7_snapshot() version check skip in BENCH_MINIMAL
- TLS cache for policy snapshot

Phase 37: Standard TLS cache (NO-GO -0.07%)
- TLS cache for Standard build attempted
- Runtime gate overhead negates benefit

Phase 38: FAST/OBSERVE/Standard workflow established
- make perf_fast, make perf_observe targets
- Scorecard and documentation updates

Phase 39: Hot path gate constantization (GO +1.98%)
- front_gate_unified_enabled() → constant 1
- alloc_dualhot_enabled() → constant 0
- g_bench_fast_front, g_v3_enabled blocks → compile-out
- free_dispatch_stats_enabled() → constant false

Results:
- FAST v3: 56.04M ops/s (47.4% of mimalloc)
- Standard: 53.50M ops/s (45.3% of mimalloc)
- M1 target (50%): 5.5% remaining

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 15:01:56 +09:00
+								- **安全・互換の正**: Standard build（`make bench_random_mixed_hakmem`）
 								- **観測の正**: OBSERVE build（`make perf_observe`）
-												Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

											
										
										
											2025-12-18 01:55:27 +09:00
+								- **スコアカード**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`（M1 達成・超過: 51.77% vs 50% target、M2 まで残り +3.23pp）
 								- **計測の正（Mixed 10-run）**: `scripts/run_mixed_10_cleanenv.sh`（`ITERS=20000000 WS=400`、`HAKMEM_WARM_POOL_SIZE=16` デフォルト）
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								## 1) 現状（要点）
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								- Phase 64（backend prune / DCE）: **NO-GO**（-4.05%） → layout tax 由来
 								- Phase 63（FAST_PROFILE_FIXED）: **研究用ビルド**として保持（FAST の gate を compile-time 固定）
 								- Phase 65（Hot Symbol Ordering）: **BLOCKED**（GCC+LTO の制約で不公平/不可能）→ `docs/analysis/PHASE65_HOT_SYMBOL_ORDERING_1_RESULTS.md`
 								- Phase 66（PGO, GCC+LTO）: **GO** ✓
 								  - 検証: 3回独立実行で +3.0% mean, all >+2.89%, 分散 <±1%
 								  - Baseline: `bench_random_mixed_hakmem_minimal_pgo` = 60.89M ops/s = 50.32% (initial PGO)
 								- Phase 68（PGO training set 最適化）: **GO & 昇格完了** ✓
 								  - 検証: 10-run で +1.19% vs Phase 66 (GO: +1.0% threshold超過)
-												Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

											
										
										
											2025-12-18 01:55:27 +09:00
+								  - Baseline (upgraded): `bench_random_mixed_hakmem_minimal_pgo` = 61.614M ops/s = **50.93%** (50% target 超過、+0.93pp)
 								- Phase 69（Refill tuning: Warm Pool Size 最適化）: **強GO & 昇格完了** ✓✓✓
 								  - 検証: 10-run で +3.26% vs Phase 68 (強GO: +3.0% threshold超過)
 								  - 新 baseline: `bench_random_mixed_hakmem_minimal_pgo` (upgraded) = 62.63M ops/s = **51.77%** (M1 超過、+1.77pp、M2 まで残り +3.23pp)
-												Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 06:24:01 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								## 2) 次の指示書（Active）
-												Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し)

- SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化
- Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY
- ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY
- Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行)
- Legacy compatibility: 既存の tiny_route_env_box.h は併用維持

Box Theory layer structure:
- L0: ULTRA (C4-C7, FROZEN)
- L1: SmallObject v7 (research box)
- L1': MID_v3 / LEGACY (fallback)
- L2: Segment / RegionId
- L3: Policy / Stats / Learner ← Policy Box added here

Frontend now follows clean "size→class→route_kind→switch" pattern.
ENV variables read once at Policy init, not scattered across frontend.

Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-12 03:50:58 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								**Phase 68: PGO training set 最適化** ✅ **完了**
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								- ✓ seed/WS diversification: WS (3→5パターン), seed (1→3パターン)
 								- ✓ 10-run 検証: +1.19% vs Phase 66 (GO threshold +1.0% 超過)
 								- ✓ Baseline 昇格: 61.614M ops/s = 50.93% (M1 target 50% を +0.93pp 超過)
 								- ✓ スコアカード・CURRENT_TASK 更新完了
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								---
-												Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-16 05:35:11 +09:00
-												Phase 67a: Layout tax forensics foundation (SSOT + measurement box)

Changes:
- scripts/box/layout_tax_forensics_box.sh: New measurement harness
  * Baseline vs treatment 10-run throughput comparison
  * Automated perf stat collection (cycles, IPC, branches, misses, TLB)
  * Binary metadata (size, section info)
  * Output to results/layout_tax_forensics/

- docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md: Diagnostic reference
  * Decision tree: GO/NEUTRAL/NO-GO classification
  * Symptom→root-cause mapping (IPC/branch-miss/dTLB/cache-miss)
  * Phase 64 case study analysis (IPC 2.05→1.98)
  * Operational guidelines for Phase 67b+ optimizations

- CURRENT_TASK.md: Phase 67a marked complete, operational

Outcome:
- Layout tax diagnosis now reproducible in single measurement pass
- Enables fast GO/NO-GO decision for future code removal/reordering attempts
- Foundation for M2 (55% target) structural exploration without regression risk

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:09:42 +09:00
+								**Phase 67a: Layout Tax 法医学（変更最小）** ✅ **完了・実運用可能**
-												Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 06:24:01 +09:00
-												Phase 67a: Layout tax forensics foundation (SSOT + measurement box)

Changes:
- scripts/box/layout_tax_forensics_box.sh: New measurement harness
  * Baseline vs treatment 10-run throughput comparison
  * Automated perf stat collection (cycles, IPC, branches, misses, TLB)
  * Binary metadata (size, section info)
  * Output to results/layout_tax_forensics/

- docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md: Diagnostic reference
  * Decision tree: GO/NEUTRAL/NO-GO classification
  * Symptom→root-cause mapping (IPC/branch-miss/dTLB/cache-miss)
  * Phase 64 case study analysis (IPC 2.05→1.98)
  * Operational guidelines for Phase 67b+ optimizations

- CURRENT_TASK.md: Phase 67a marked complete, operational

Outcome:
- Layout tax diagnosis now reproducible in single measurement pass
- Enables fast GO/NO-GO decision for future code removal/reordering attempts
- Foundation for M2 (55% target) structural exploration without regression risk

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:09:42 +09:00
+								- ✓ `scripts/box/layout_tax_forensics_box.sh` 新規（測定ハーネス）
 								  - Baseline vs Treatment の 10-run throughput 比較
 								  - perf stat 自動収集（cycles, IPC, branches, branch-misses, cache-misses, iTLB/dTLB）
 								  - Binary metadata（サイズ、セクション構成）
 								- ✓ `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` 新規（診断ガイド）
 								  - 判定ルール: GO (+1% 以上) / NEUTRAL (±1%) / NO-GO (-1% 以下)
 								  - "症状→原因候補" マッピング表
 								    * IPC 低下 3%↑ → I-cache miss / code layout dispersal
 								    * branch-misses ↑10%↑ → branch prediction penalty
 								    * dTLB-misses ↑100%↑ → data layout fragmentation
 								  - Phase 64 case study（-4.05% の root cause: IPC 2.05 → 1.98）
 								  - 運用ガイドライン
 								**使用例**:
 								```bash
 								./scripts/box/layout_tax_forensics_box.sh \
 								    ./bench_random_mixed_hakmem_minimal_pgo \
 								    ./bench_random_mixed_hakmem_fast_pruned  # or Phase 64 attempt
 								```
 								成果: 「削る系」NO-GO が出た時に、どの指標が悪化しているかを **1回で診断可能** → 以後の link-out/大削除を事前に止められる
 								---
-												Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 06:24:01 +09:00
-												Phase 69-0: Refill tuning design memo (parameter sweep plan)

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:22:21 +09:00
+								**Phase 69: "refill頻度×固定税" を削る（M2への最短距離）**
 								**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**
 								- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
 								- ✓ Tunable parameters 特定:
-												Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

											
										
										
											2025-12-18 01:55:27 +09:00
+								  - `HAKMEM_TINY_REFILL_COUNT_MID` / `HAKMEM_TINY_REFILL_COUNT_HOT`（refill 量の実体, ENV-only）
-												Phase 69-0: Refill tuning design memo (parameter sweep plan)

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:22:21 +09:00
+								  - Unified Cache C5-C7 capacity (128 → 256/512)
 								  - Warm Pool size (12 → 16/24)
 								- ✓ Sweep 計画立案（single-parameter → combined optimization）
 								- ✓ Risk assessment & 判定基準定義
-												Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

											
										
										
											2025-12-18 01:55:27 +09:00
+								**Phase 69-1: Sweep 実行** ✅ **完了**
 								- ✓ Baseline (Phase 68 PGO): 60.65M ops/s (10-run mean)
 								- ✓ Warm Pool Size sweep:
 								  - Size=16: **62.63M ops/s (+3.26%, 強GO)** ✓✓✓ **Winner**
 								  - Size=24: 62.37M ops/s (+2.84%, GO)
 								- ✓ Unified Cache C5-C7 sweep:
 								  - Cache=256: 61.92M ops/s (+2.09%, GO)
 								  - Cache=512: 61.80M ops/s (+1.89%, GO)
 								- ✓ Combined optimization check:
 								  - Warm=16 + Cache=256: 62.35M ops/s (+2.81%, non-additive)
 								- ✓ “Refill Batch Size sweep” は無効（knob 未接続）:
 								  - `TINY_REFILL_BATCH_SIZE` は現行 Tiny front に call site が無く、性能 knob として成立していない
 								  - 参照: `docs/analysis/PHASE69_REFILL_TUNING_3C_REFILL_BATCH_KNOB_AUDIT.md`
 								- **結果**: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
 								- **勝ち設定**: **Warm Pool Size=16 (ENV-only, +3.26%, 強GO)**
 								**Phase 69-2: 勝ち設定を baseline に反映** ✅ **完了**
 								- ✓ `scripts/run_mixed_10_cleanenv.sh` に `HAKMEM_WARM_POOL_SIZE=16` デフォルト追加
 								- ✓ `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` preset に `bench_setenv_default("HAKMEM_WARM_POOL_SIZE","16")` 追加
 								- ✓ `PERFORMANCE_TARGETS_SCORECARD.md` に新 baseline 追加:
 								  - Phase 69 baseline: 62.63M ops/s = 51.77% of mimalloc
 								  - M1 (50%) achievement: **EXCEEDED** (+1.77pp above target)
 								  - M2 (55%) progress: Gap reduced to +3.23pp
 								- ✓ Rollback: `HAKMEM_WARM_POOL_SIZE=12` or ENV 変数削除
 								**新 baseline**: 62.63M ops/s = mimalloc の **51.77%** (Phase 68 から +3.26%、M2 まで残り +3.23pp)
 								---
 								**Phase 69-3（次候補）: refill 量（ENV-only）sweep OR 次の sweep**
 								- **選択肢 A（推奨）**: Refill count の ENV sweep（コード変更なし）
 								  - `HAKMEM_TINY_REFILL_COUNT_MID`（C4–C7）を 64/96/128/160… で sweep
 								  - `HAKMEM_TINY_REFILL_COUNT_HOT`（C0–C3）も同様に sweep（ただし WarmPool/UnifiedCache と相互作用あり）
 								  - 判定: 10-run mean で GO(+1.0%) / 強GO(+3.0%) / NO-GO(-1.0%)
 								- **選択肢 B**: Unified Cache の fine sweep（ENV-only）
 								  - C5/C6/C7 を 192/256/320… などで sweep（Phase 69-1 の 256/512 は coarse）
 								  - WarmPool=16 との非加算性を “原因切り分け” する
 								- **選択肢 C**: compile-time knob の新設（後回し）
 								  - `TINY_REFILL_BATCH_SIZE` は未接続なので、そのまま追わない
 								  - 必要なら別途 SSOT を作って実装する（Phase 70+）
 								- **選択肢 D**: 別方向の最適化（M2: 55% への最短距離）
 								  - 残り gap: +3.23pp (51.77% → 55%)
 								  - Phase 67b（境界 inline/unroll チューニング）
 								  - Top 50 hot functions の最適化
 								  - PGO profile の再調整
-												Phase 69-0: Refill tuning design memo (parameter sweep plan)

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:22:21 +09:00
 								---
 								**Phase 67b（後続・保険）: 境界inline/unrollチューニング**
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								- **注意**: layout tax リスク高い（Phase 64 reference）
 								- **前提**: Top 50 実行確認が必須
-												Phase 69-0: Refill tuning design memo (parameter sweep plan)

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:22:21 +09:00
+								- Phase 69 が外れた時の保険として後回し推奨
-												Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis

Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline.

Key Findings (200M ops Mixed benchmark):
- tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61)
- tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%)
- Allocation-specific hot path: 12.37% (C7 + header + cache)

Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis)
- Expected gain: +1-3% (higher absolute margin than Phases 46A/61)
- Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%)
- Approach: Deep profiling → ASM inspection → A/B test with ENV gate

Alternative Options:
- Option B: tiny_region_id_write_header (3.82%, higher risk)
- Option C: Algorithmic redesign (post-50% milestone)

Box Theory Compliance:
- Single conversion point: tiny_c7_ultra_alloc() boundary
- Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1)
- No side effects: Pure dependency chain reordering

Timeline: Single phase, 4-6 hours (profile + ASM + test)

Documentation:
- PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data
- CURRENT_TASK.md: Updated next phase guidance

Profiling tools prepared:
- perf record with extended events (cycles, cache-misses, branch-misses)
- ASM inspection methodology documented
- A/B test threshold: ±0.5% (micro-scale)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 16:27:06 +09:00
-												Phase 70: Defined observability prerequisites SSOT

- Added docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md to clarify that refill/warmpool optimizations require confirmed cache misses to be measurable.
- Updated CURRENT_TASK.md to point to this prerequisite.

											
										
										
											2025-12-18 03:44:51 +09:00
+								---
 								**Phase 70（観測の前提固め）: Refill/WarmPool 最適化の Step 0 を SSOT 化**
 								- 目的: **“経路が踏まれていない最適化”** を防ぐ（Phase 40/41/64 の layout tax 前例）
 								- 注意: `Route assignments: LEGACY` は「Unified Cache 未使用」を意味しない（backend route kind）
 								- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
 								  - Mixed SSOT（WS=400）で `unified_cache_refill()` / WarmPool pop が有意に起きているかを **OBSERVE で確定**してから Phase 70 を進める
-												CURRENT_TASK: Phase 70-73 complete, Phase 72 plan (post-instruction-reduction)

Updated with:
- Phase 70-1/2/3: Route Banner + consistency checks + refill optimization freeze
- Phase 73: Hardware profiling paradox resolved (instruction reduction wins despite worse TLB/cache)
- Phase 72-0: Function-level perf record plan (identify which functions reduced instructions)
- Phase 72-1: Structure optimization targeting identified hot functions

Key insight: WarmPool=16 selects shorter code paths, not memory hierarchy optimization.
Next action: Phase 72-0 confirmed unified_cache_push as primary target.

											
										
										
											2025-12-18 05:55:47 +09:00
+								- ✅ Phase 70-1: Route Banner 実装（経路誤認の根絶）
 								  - ENV: `HAKMEM_ROUTE_BANNER=1`
 								  - 出力: Route assignments（backend route kind）+ cache config（unified_cache / warm_pool_max_per_class）
 								- ✅ Phase 70-3: OBSERVE 統計の整合性 SSOT（“見えてないだけ”事故の根絶）
 								  - `Unified-STATS total_allocs == total_frees` を確認してから議論する（統計の信頼性ゲート）
 								- ✅ Phase 70-2: Refill 最適化の扱い確定（SSOT）
 								  - Mixed SSOT（WS=400）で `Unified-STATS miss < 1000` なら **Refill 最適化は凍結（ROIゼロ）**
 								  - 現状の実測: miss は極小（例: total miss=5）→ refill最適化は SSOT workload では ROI なし
 								  - 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
 								---
 								**Phase 73: WarmPool=16 の "勝ち筋" を perf で確定** ✅ **完了・パラドックス解決**
 								- 背景: WarmPool=16 は throughput/CV を改善するが、Unified/WarmPool 等の可視カウンタはほぼ同一 → **「1回あたりのコスト差」**（TLB/LLC/周波数/配置）の可能性が高い
 								- 目的: WarmPool=12 vs 16 の差分を **perf stat** で "何が減ったか" に落とし、次の構造最適化（Phase 72）を決め打ちする
 								- 方式: **同一バイナリ + cleanenv + 交互実行**（layout tax/環境ドリフトを避ける）
 								  - A: `HAKMEM_WARM_POOL_SIZE=12`
 								  - B: `HAKMEM_WARM_POOL_SIZE=16`
 								  - events: `cycles,instructions,branches,branch-misses,cache-misses,LLC-load-misses,iTLB-load-misses,dTLB-load-misses,page-faults`
 								**結果**（パラドックス）:
 								- ✅ Throughput: +0.91% (46.52M → 46.95M ops/s)
 								- ✅ **instructions**: -0.38% (-17.4M instructions) ← **PRIMARY WIN SOURCE**
 								- ✅ **branches**: -0.30% (-3.7M branches) ← **SECONDARY WIN SOURCE**
 								- ⚠️ **dTLB-load-misses**: +29.06% (28,792 → 37,158) ← **WORSE**
 								- ⚠️ **cache-misses**: +17.80% (458K → 540K) ← **WORSE**
 								- ✓ page-faults: -0.21% (negligible)
 								**Phase 71 仮説（REJECTED）**:
 								- 予測: "TLB/cache efficiency improvement from memory layout"
 								- 実測: TLB/cache metrics both **DEGRADED**
 								**Phase 73 確定**:
 								- 勝ち筋: **Control-flow optimization (instruction/branch count reduction)**
 								- 機構: WarmPool=16 がより短い code path を選択 → 17.4M instructions 削減
 								- Trade-off: +4MB RSS → worse TLB/cache, but instruction savings dominate
 								- Net benefit: ~8.2M cycles saved (instruction/branch) >> ~4.2M cycles lost (TLB/cache)
 								**詳細**: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md` Phase 73 section
 								**Phase 72（構造）: WarmPool=16 の勝ち筋を増幅（Phase 73 結果が出てから）**
 								- 前提: Phase 73 で “勝ち筋” を数値で確定してから着手（推測で弄ると Phase 40/41/64 の再発）
 								- Phase 73 の結論: **instruction/branch 減が支配的**（TLB/cache はむしろ悪化）→「WarmPool=16 が “短い経路” を踏ませている」ことが本質
 								**Phase 72-0（SSOT）: “どの関数が短くなったか” を特定してから構造に入る**
 								- A/B は WarmPool=12 vs 16 のまま（同一バイナリ・cleanenv）
 								- perf record を **cycles ではなく instruction/branch で取る**（原因が instruction/branch 減だから）
 								  - `perf record -e instructions:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
 								  - `perf record -e branches:u -c 100000 -- ./bench_random_mixed_hakmem_observe 20000000 400 1`
 								- 目的: WarmPool=16 で **instruction share / branch share が減った関数 top 3** を確定（例: `shared_pool_acquire_slab`, `unified_cache_refill`, `warm_pool_do_prefill`, `superslab_refill` 等）
 								**Phase 72-1（構造）: 特定した関数にだけ手を入れる（箱の境界 1 箇所化）**
 								- `shared_pool_acquire_slab` 側が主因なら: “scan/lock/mmap” を減らす設計（warm prefill の境界を 1 箇所に固定）
 								- `unified_cache_refill` 側が主因なら: “refill の準備/検証” を境界側へ寄せ、hot 側は直線化
 								- 注意: 目標は「miss を減らす」ではなく **同じ miss でも “短い経路” を踏ませる**こと（Phase 73 の教訓）
-												Phase 70: Defined observability prerequisites SSOT

- Added docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md to clarify that refill/warmpool optimizations require confirmed cache misses to be measurable.
- Updated CURRENT_TASK.md to point to this prerequisite.

											
										
										
											2025-12-18 03:44:51 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								**注記**: 研究箱の削除は今やらない（link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解）
-												Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis

Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline.

Key Findings (200M ops Mixed benchmark):
- tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61)
- tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%)
- Allocation-specific hot path: 12.37% (C7 + header + cache)

Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis)
- Expected gain: +1-3% (higher absolute margin than Phases 46A/61)
- Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%)
- Approach: Deep profiling → ASM inspection → A/B test with ENV gate

Alternative Options:
- Option B: tiny_region_id_write_header (3.82%, higher risk)
- Option C: Algorithmic redesign (post-50% milestone)

Box Theory Compliance:
- Single conversion point: tiny_c7_ultra_alloc() boundary
- Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1)
- No side effects: Pure dependency chain reordering

Timeline: Single phase, 4-6 hours (profile + ASM + test)

Documentation:
- PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data
- CURRENT_TASK.md: Updated next phase guidance

Profiling tools prepared:
- perf record with extended events (cycles, cache-misses, branch-misses)
- ASM inspection methodology documented
- A/B test threshold: ±0.5% (micro-scale)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 16:27:06 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								## 3) アーカイブ
-												Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%)

Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions.

Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by:
1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks
2. Using TLS headers_initialized flag set during refill
3. Reducing branch count and register pressure

Implementation:
- New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h
- HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF)
- Modified tiny_c7_ultra_alloc() with optimized path
- Preserved original path for compatibility

Results (Mixed benchmark, 10-run):
- Baseline (OPT=0): 59.300 M ops/s (CV 1.98%)
- Treatment (OPT=1): 58.879 M ops/s (CV 1.83%)
- Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative)
- Status: NEUTRAL → Research box (default OFF)

Root Cause Analysis:
1. LTO optimization already inlines header_light function (call cost = 0)
2. TLS access (memory load + offset) not cheaper than function call
3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47)
4. 5.18% stack % is not optimizable hotspot (already well-optimized)

Key Lessons:
- LTO-optimized function calls can be cheaper than TLS field access
- Micro-optimizations on already-optimized paths show diminishing/negative returns
- 48.34% gap to mimalloc is likely algorithmic, not micro-architectural
- Layout tax remains consistent pattern across attempted micro-optimizations

Decision:
- NEUTRAL verdict → kept as research box with ENV gate (default OFF)
- Not adopted as production default
- Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts

Box Theory Compliance: ✅ Compliant (single point, reversible, clear boundary)
Performance Compliance: ❌ No (-0.71% regression)

Documentation:
- PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis
- CURRENT_TASK.md: Updated with results and next phase options

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 16:34:03 +09:00
-												Phase 68: PGO training set diversification (seed/WS expansion)

Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
  for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active

Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:08:17 +09:00
+								- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
 								- 直近整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`