Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-18 07:47:44 +09:00
parent e4baa1894f
commit e9b97e9d8e
14 changed files with 840 additions and 210 deletions

View File

@ -11,7 +11,7 @@
mimalloc との比較は **FAST build** で行うStandard は fixed tax を含むため公平でない)。
## Current snapshot2025-12-17, Phase 68 PGO — 新 baseline
## Current snapshot2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline
計測条件(再現の正):
- Mixed: `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400`

View File

@ -0,0 +1,197 @@
# Phase 69-1: Refill Tuning Parameter Sweeps - Results
**Date**: 2025-12-17
**Baseline**: Phase 68 PGO (`bench_random_mixed_hakmem_minimal_pgo`)
**Benchmark**: `scripts/run_mixed_10_cleanenv.sh` (RUNS=10)
**Goal**: Find +3-6% optimization for M2 milestone (55% of mimalloc)
---
## Executive Summary
**Winner Identified**: **Warm Pool Size=16** achieves **+3.26% (Strong GO)** with ENV-only change.
- **No code changes required** - Deploy via `HAKMEM_WARM_POOL_SIZE=16` environment variable
- **Exceeds M2 threshold** (+3.0% Strong GO criterion)
- **Single strongest improvement** among all tested parameters
- **Combined optimizations are non-additive** - Warm Pool Size=16 alone outperforms combinations
⚠️ **Important correction (2025-12 audit)**:
The previously reported “Refill Batch Size sweep” based on `TINY_REFILL_BATCH_SIZE` was **not measuring a real knob**.
That macro currently has **zero call sites** (it is defined but not referenced in the active Tiny front path), so any
observed deltas were **layout/drift noise**, not an algorithmic effect.
---
## Full Sweep Results
### Baseline (Phase 68 PGO)
| Metric | Value |
|--------|-------|
| **Mean** | 60.65M ops/s |
| **Median** | 60.68M ops/s |
| **CV** | 1.68% |
| **% of mimalloc** | 50.93% |
**Runs**: 10
**Binary**: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
---
### 1. Warm Pool Size Sweep (ENV-only, no recompile)
**Parameter**: `HAKMEM_WARM_POOL_SIZE` (default: 12 SuperSlabs/class)
| Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|------|----------------|------------------|----|-----------:|----------|
| **16** | **62.63** | **63.38** | 2.43% | **+3.26%** | **Strong GO** ✓✓✓ |
| 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ |
**Winner**: **Size=16 (+3.26%)**
**Analysis**:
- Size=16 exceeds +3.0% Strong GO threshold
- Size=24 shows diminishing returns (+2.84% vs +3.26%)
- Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead
**Command Used**:
```bash
# Size=16
HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Size=24
HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
```
---
### 2. Unified Cache C5-C7 Sweep (ENV-only, no recompile)
**Parameter**: `HAKMEM_TINY_UNIFIED_C5`, `HAKMEM_TINY_UNIFIED_C6`, `HAKMEM_TINY_UNIFIED_C7` (default: 128 slots)
| Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|------------|----------------|------------------|----|-----------:|----------|
| **256** | **61.92** | **61.70** | 1.49% | **+2.09%** | **GO** ✓ |
| 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ |
**Winner**: **Cache=256 (+2.09%)**
**Analysis**:
- Cache=256 shows +2.09% improvement (GO threshold)
- Cache=512 shows diminishing returns (+1.89% vs +2.09%)
- Larger caches provide marginal gains while increasing memory overhead
- Lower CV (1.49%) indicates stable performance
**Command Used**:
```bash
# Cache=256
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Cache=512
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
```
---
### 3. Combined Optimization Check
**Configuration**: Warm Pool Size=16 + Unified Cache C5-C7=256
| Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|----------------|------------------|----|-----------:|----------|
| 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) |
**Analysis**:
- Combined result (+2.81%) is **LESS than** Warm Pool Size=16 alone (+3.26%)
- **Non-additive behavior** indicates parameters are not orthogonal
- **Likely explanation**: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant
- **Recommendation**: Use Warm Pool Size=16 alone for maximum benefit
**Command Used**:
```bash
HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
```
---
### 4. Refill Batch Size Sweep (invalid — macro not wired)
The `TINY_REFILL_BATCH_SIZE` macro is currently **define-only**:
```bash
rg -n "TINY_REFILL_BATCH_SIZE" core
# -> core/hakmem_tiny_config.h only
```
So we do **not** treat it as a tuning parameter until it is actually connected to refill logic.
If we want to tune refill frequency, use the real knobs:
- `HAKMEM_TINY_REFILL_COUNT_HOT`
- `HAKMEM_TINY_REFILL_COUNT_MID`
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
---
## Recommendations
### Phase 69-2 (Baseline Promotion)
**Primary Recommendation**: **Deploy Warm Pool Size=16 (ENV-only)**
**Rationale**:
1. **Strongest single improvement** (+3.26%, Strong GO)
2. **No code changes required** - Zero risk of layout tax
3. **Immediate deployment** via environment variable
4. **Exceeds M2 threshold** (+3.0% Strong GO criterion)
**Deployment**:
```bash
# Add to PGO training environment and benchmark scripts
export HAKMEM_WARM_POOL_SIZE=16
```
---
### Secondary Options (for Phase 69-3+)
**Option A: Warm Pool Size=16 + Refill Batch=32**
- **Combined potential**: Unknown (requires testing, may be non-additive like unified cache)
- **Complexity**: Requires PGO rebuild for Batch=32
- **Risk**: Layout tax from code change
**Option B: Warm Pool Size=16 alone (recommended)**
- **Gain**: +3.26% guaranteed
- **Complexity**: ENV-only, zero code changes
- **Risk**: None (reversible via ENV)
---
## Raw Data Files
All 10-run logs saved to:
- `/tmp/phase69_baseline.log` - Phase 68 PGO baseline
- `/tmp/phase69_warm16.log` - Warm Pool Size=16
- `/tmp/phase69_warm24.log` - Warm Pool Size=24
- `/tmp/phase69_cache256.log` - Unified Cache C5-C7=256
- `/tmp/phase69_cache512.log` - Unified Cache C5-C7=512
- `/tmp/phase69_combined.log` - Combined (Warm=16 + Cache=256)
- `/tmp/phase69_batch32.log` - Refill Batch=32
---
## Next Steps
**Awaiting User Instructions for Phase 69-2**:
1. Confirm Warm Pool Size=16 as baseline promotion candidate
2. Decide whether to:
- Update ENV defaults in `hakmem_tiny_config.h` (preferred for SSOT)
- Document as recommended ENV setting in README/docs
- Add to PGO training scripts
3. Re-run `make pgo-fast-full` with `HAKMEM_WARM_POOL_SIZE=16` in training environment
4. Update `PERFORMANCE_TARGETS_SCORECARD.md` with new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc)
---
**Phase 69-1 Status**: ✅ **COMPLETE**
**Winner**: **Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)**

View File

@ -0,0 +1,46 @@
# Phase 69-3A: Refill Batch=64 build failure triage — Root cause & fix
## Symptom
`make pgo-fast-build` (profile-use) fails to link with undefined `__gcov_*` symbols, e.g.:
- `__gcov_init`, `__gcov_exit`
- `__gcov_merge_add`, `__gcov_merge_topn`
- `__gcov_time_profiler_counter`
This appeared when trying to evaluate `Refill Batch Size=64`.
## Root cause (actual)
The failure is **not** “compiler limit due to batch=64”.
It is a **stale object mixing** problem:
- Some benchmark `.o` files were built in the profile-gen step (`-fprofile-generate`) and **were not removed by `make clean`**.
- In the profile-use step (`-fprofile-use`), those stale instrumented `.o` files were reused and linked without `-fprofile-generate` → libgcov was not pulled in.
- Result: unresolved `__gcov_*` symbols at link time.
In other words: **instrumented bench object reused in non-instrumented link**.
## Fix (minimal, safe)
Strengthen `make clean` to remove benchmark objects/binaries that were previously omitted, including:
- `bench_random_mixed_hakmem.o`
- `bench_tiny_hot_hakmem.o`
- related bench variants (`*_system`, `*_mi`, `*_hakx`, `*_minimal*`, etc.)
This preserves toolchain fairness (GCC + LTO) and prevents cross-step contamination in PGO workflows.
## Verification
After the fix, the Phase 66 PGO pipeline builds successfully again:
```sh
make pgo-fast-profile pgo-fast-collect pgo-fast-build
```
## Notes
- This fix is **layout-neutral**: it only affects build hygiene (artifact cleanup).
- This also hardens other workflows where flags change across builds (PGO / FAST targets).
- Follow-up audit note (2025-12): `TINY_REFILL_BATCH_SIZE` is currently define-only (no call sites), so the “batch=64”
performance experiment itself was not measuring a real knob; however the build hygiene fix remains valid and important.

View File

@ -0,0 +1,45 @@
# Phase 69-3B: Refill Batch Size sweep (PGO, warm_pool=16) — Results
⚠️ **INVALID (2025-12 audit)**: `TINY_REFILL_BATCH_SIZE` is currently **not wired** into the active Tiny front path
(it has zero call sites; define-only in `core/hakmem_tiny_config.h`). Any observed deltas in this file should be treated
as **layout/drift noise**, not an algorithmic effect. This document is kept only as an experiment record.
## Context
Phase 69-2 promoted the ENV-only winner:
- `HAKMEM_WARM_POOL_SIZE=16`
This phase explores compile-time refill batch size (`TINY_REFILL_BATCH_SIZE`) under the current PGO workflow:
- `make pgo-fast-full` (GCC + LTO preserved)
- Training uses cleanenv-aligned workloads (`scripts/box/pgo_fast_profile_config.sh`)
## Build hygiene prerequisite
Batch=64 originally “failed to build” due to stale profile-gen bench objects being reused in profile-use links.
That issue is fixed by strengthening `make clean` (see `docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md`).
## Measurement (Mixed 10-run)
All results are from the same host session, using:
- `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`
- `RUNS=10 scripts/run_mixed_10_cleanenv.sh`
| Batch | Mean (M ops/s) | Median (M ops/s) | CV |
|------:|----------------:|-----------------:|---:|
| 16 | 61.30 | 61.64 | 1.50% |
| 32 | 60.73 | 61.17 | 2.19% |
| 48 | 61.94 | 62.54 | 1.53% |
| 64 | 61.51 | 61.81 | 1.56% |
## Decision
- **Batch=48** is the best of the tested set in this session (+~1.0% vs batch=16 baseline).
- **Batch=32** regresses in this session (note: previously was GO under a different baseline).
- **Batch=64** builds successfully after the hygiene fix, but is not the best performer here.
## Next steps (Phase 69-3C)
If we want to pursue M2 (55%) via this path:
1. Promote **batch=48** as a research candidate with a dedicated Phase tag (compile-time change + PGO rebuild).
2. Re-run the sweep at another time window to confirm ordering (layout/drift sensitivity).
3. If stable, promote batch=48 into the FAST baseline build path.

View File

@ -0,0 +1,47 @@
# Phase 69-3C: Refill Batch “knob” audit — `TINY_REFILL_BATCH_SIZE` is not wired
## Summary
The Phase 69 “Refill Batch Size sweep” was based on `TINY_REFILL_BATCH_SIZE` in `core/hakmem_tiny_config.h`, but an audit
shows this macro currently has **zero call sites** in the active Tiny front path. As a result, any measured deltas from
editing this macro are **not algorithmic**; they are attributable to layout/drift/noise.
## Evidence
### 1) Zero call sites
```sh
rg -n "TINY_REFILL_BATCH_SIZE" core
```
Result: only `core/hakmem_tiny_config.h` (define-only).
### 2) PGO binaries unchanged when toggling the macro
We rebuilt the full PGO pipeline twice (`make pgo-fast-full`) after changing the macro (batch16 vs batch48) and found the
resulting binaries were bit-identical (same size + same SHA256).
This confirms the macro does not affect the compiled hot path today.
## Action taken
- Restored `TINY_REFILL_BATCH_SIZE` to `16` and added an explicit “not wired” note in `core/hakmem_tiny_config.h`.
- Marked the “Refill Batch Size sweep” section in Phase 69 docs as invalid.
## What to tune instead (real knobs)
To tune refill frequency/amount without rebuilding:
- `HAKMEM_TINY_REFILL_COUNT_HOT` (C0C3)
- `HAKMEM_TINY_REFILL_COUNT_MID` (C4C7)
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
Defaults are set in `core/hakmem_tiny_init.inc` and can be overridden via ENV.
## Optional future work (if we still want a compile-time knob)
If we want a compile-time “refill batch size” knob, we need to wire it into a single SSOT:
- either by feeding it into the refill-count defaults (`g_refill_count_*`), or
- by introducing a dedicated build flag that the refill logic consumes directly.
Until then, do not run Phase 69 sweeps based on `TINY_REFILL_BATCH_SIZE`.

View File

@ -12,6 +12,13 @@
Before implementing any refill/WarmPool changes, execute this sequence:
0. **Route Banner任意だが推奨**:
```bash
HAKMEM_ROUTE_BANNER=1 ./bench_random_mixed_hakmem_observe ...
```
- Route assignmentsbackend route kindと cache config`unified_cache_enabled` / `warm_pool_max_per_class`)を 1 回だけ表示する。
- 「Route=LEGACY = Unified Cache 未使用」といった誤認を防ぐLEGACYでもUnified Cacheは alloc/free の front で使われる)。
1. **Build with Stats**:
```bash
make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1'
@ -20,7 +27,7 @@ Before implementing any refill/WarmPool changes, execute this sequence:
2. **Run with Stats**:
```bash
HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
HAKMEM_ROUTE_BANNER=1 HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
```
3. **Check Output**:

View File

@ -0,0 +1,116 @@
# Phase 74: UnifiedCache hit-path structural optimization (WS=400 SSOT)
**Status**: 🟡 DRAFT設計SSOT / 次の指示書)
## 0) 背景(なぜ今これか)
- 現行 baselinePhase 69: `bench_random_mixed_hakmem_minimal_pgo` = **62.63M ops/s = 51.77% of mimalloc**`HAKMEM_WARM_POOL_SIZE=16`
- Phase 70観測SSOTにより、WS=400Mixed SSOTでは **UnifiedCache miss が極小**であることが確定。
- `unified_cache_refill()` / WarmPool-pop を速くしても **ROI はほぼゼロ**refill最適化は凍結
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- Phase 73perf statにより、WarmPool=16 の勝ちは **instruction/branch の微減**が支配的と確定。
- つまり次も「hit-path を短くする」方向が最も筋が良い。
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
本フェーズの狙いは、**UnifiedCache の hit-pathpush/popから“踏まなくていい分岐/ロード”を構造で外に追い出す**こと。
## 1) 目的 / 非目的
**目的**
- WS=400 の SSOT workload で **+1〜3%**(単発)を狙う(積み上げで M2=55% へ)。
- “経路が踏まれていない最適化” を避けるPhase 70 の SSOT を守る)。
**非目的**
- `unified_cache_refill()` の最適化miss が極小なので SSOT では ROI なし)。
- link-out / 大削除による DCElayout tax で符号反転の前例が多い)。
- route kind を変えて別 workload にする(まず SSOT workload を崩さない)。
## 2) Box Theory箱割り
### 箱の責務
L0: **EnvGateBox**
- `HAKMEM_TINY_UC_*` のトグルdefault OFF、いつでも戻せる
L1: **TinyUnifiedCacheHitPathBoxNEW / 研究箱)**
- `unified_cache_push/pop`**hit-path だけを短くする**refill/overflow/registryは触らない
- 変換点(境界)は 1 箇所: `unified_cache_push/pop` 内で “fast→fallback” を1回だけ行う。
### 可視化(最小)
- `uc_hitpath_fast_hits` / `uc_hitpath_fast_fallbacks` の2カウンタだけ必要なら
- それ以外は `perf stat`instructions/branchesを正とする。
## 3) 具体案(優先順)
### P1低リスク: ローカル変数化で再ロード/依存チェーンを固定する
狙い:
- `cache->head/tail/mask/capacity` 等の再ロードを抑制し、**依存チェーンを短く**する。
設計:
- `unified_cache_push()` / `unified_cache_pop_or_refill()` の中で
- `uint16_t head = cache->head;` のように **ローカルへ落とす**
- `next = (x + 1) & mask` の算術を **1回に固定**
- `cache->tail = next;` のような store を最後にまとめる
導入:
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`default 0
- 方式: 同一バイナリで ON/OFFlayout tax を最小にするため、分岐は入口1回に限定
リスク:
- レジスタ圧上昇で逆に遅くなる可能性 → A/B 必須。
### P0中リスク/中ROI: Fast-API 化enable判定/統計を外に追い出す)
狙い:
- hit-path の中に残る “ほぼ不変な判定” を **呼び出し側に追い出し**`push/pop` を直線化する。
設計:
- `unified_cache_push_fast(TinyUnifiedCache* cache, void* base)` のような **最短API** を追加
- 前提: “有効/初期化済み/統計OFF” を呼び出し側で保証
- 失敗時のみ既存 `unified_cache_push()` へ落とす境界1箇所
導入:
- ENV: `HAKMEM_TINY_UC_FASTAPI=0/1`default 0
- Fail-fast: 途中でモードが変わったら “safe fallback” へbench用途なら abort でも良い)
リスク:
- call site の増加で layout が動く → GO 閾値は +1.0%(厳しめ)。
### P2高リスク/高ROI候補: hot class 限定で slots を TLS 直置きpointer chase削減
狙い:
- hit-path の `cache->slots` のロード(ポインタ追跡)を消す。
設計:
- `TinyUnifiedCache` の “hot class のみ” を別構造に逃がし、TLS 内に `slots[]` を直置き。
- 対象候補: 容量が小さい C4/C5/C6/C7C2/C3 の 2048 は直置きが重い)
リスク:
- TLS サイズ増で dTLB/cache が悪化しうる勝てば大きいが、NO-GO もあり得る)。
## 4) A/BSSOT
### 4.1 ベンチ条件(固定)
- `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400`
- `HAKMEM_WARM_POOL_SIZE=16`baseline
### 4.2 GO/NO-GO
- **GO**: +1.0% 以上
- **NEUTRAL**: ±1.0%research box freeze
- **NO-GO**: -1.0% 以下(即 revert
### 4.3 追加で必ず見るPhase 73 教訓)
- `perf stat`: `instructions`, `branches`, `branch-misses`(勝ち筋が instruction/branch 減なので)
- `cache-misses`, `iTLB-load-misses`, `dTLB-load-misses`layout tax 検知)
## 5) 直近の実装順(推奨)
1. **P1LOCALIZE** を小さく入れて A/B最短で勝ち筋確認
2. 勝てたら **P0FASTAPI** を追加(さらに分岐を外へ)
3. それでも足りなければ **P2inline slots hot** を research box として試す
## 6) 退出条件(やめどき)
- WS=400 SSOT で `perf` 上の “unified_cache_push/pop” が Top 50 圏外になったら、この系は撤退Phase 42 の教訓)。
- 3回連続で NEUTRAL/NO-GO が続いたら、次の構造別層layout tax の危険が増すため)。

View File

@ -0,0 +1,75 @@
# Phase 74-1: UnifiedCache hit-path “LOCALIZE” 実装指示書
**Status**: 🟡 READY
## 目的
WS=400Mixed SSOTでほぼ hit-path しか踏まれないため、`unified_cache_push/pop` の **依存チェーン(再ロード)を短く**して instructions/branches を削る。
- 設計SSOT: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 観測SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`refill最適化は凍結
## 原則Box Theory
- L0: ENV gate 箱を追加default OFF、いつでも戻せる
- L1: `unified_cache_push/pop` の中だけに閉じた変更境界1箇所
- 可視化は最小(基本は perf stat を正とする)
- Fail-fast: 迷ったら fallback
## Step 0: Baseline 確認SSOT
```bash
scripts/run_mixed_10_cleanenv.sh
```
## Step 1: ENV gateL0 box
新規:
- `core/box/tiny_unified_cache_hitpath_env_box.h`(例)
ENV:
- `HAKMEM_TINY_UC_LOCALIZE=0/1`default 0
要件:
- hot path で getenv を踏まない(既存の lazy-init パターン or build flag で固定)
## Step 2: LOCALIZE 実装L1 box
対象:
- `core/front/tiny_unified_cache.h``unified_cache_push()` / `unified_cache_pop_or_refill()`
方針:
- `cache->head/tail/mask/capacity` をローカルへ落として **再ロードを防ぐ**
- store は最後にまとめる(`cache->tail = next_tail;` など)
- 仕様は変えない(容量/順序/統計/overflow の意味を維持)
導入パターン(例):
- `if (!tiny_uc_localize_enabled())` のときは既存実装をそのまま通す
- `enabled` のときだけ localize 版を呼ぶ
## Step 3: A/B同一バイナリ
```bash
HAKMEM_TINY_UC_LOCALIZE=0 scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_UC_LOCALIZE=1 scripts/run_mixed_10_cleanenv.sh
```
追加で(勝ち筋が instructions/branches なので必須):
```bash
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses -- \
./bench_random_mixed_hakmem_minimal_pgo 20000000 400 1
```
## 判定
- **GO**: +1.0% 以上
- **NEUTRAL**: ±1.0%research box freeze
- **NO-GO**: -1.0% 以下(即 revert
NO-GO の切り分け:
- `scripts/box/layout_tax_forensics_box.sh` を使うlayout tax / IPC低下 / TLB悪化の分類
## Step 4: 昇格方針
- 初回 GO でも **default ON にしない**(まずは 3回独立再計測で再現性を確認
- 3回とも GO なら `scripts/run_mixed_10_cleanenv.sh` / `core/bench_profile.h` へ昇格を検討

View File

@ -0,0 +1,140 @@
# Phase 74: UnifiedCache hit-path structural optimization - Results
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
## Summary
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
---
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
**Implementation**:
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
**Results** (10-run A/B):
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|--------|------------|------------|-------|
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
| instructions | 4,583M | 4,615M | **+0.7%** |
| branches | 1,276M | 1,281M | **+0.4%** |
| cache-misses | 560K | 461K | -17.7% |
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
---
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
**Implementation**:
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|--------|---------------|----------------|-------|
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
| cycles | 1,553M | 1,548M | -0.3% |
| **instructions** | 2,748M | 2,733M | **-0.6%** |
| **branches** | 632M | 617M | **-2.3%** |
| **cache-misses** | 707K | 1,316K | **+86%** |
| dTLB-load-misses | 46K | 33K | -28% |
**Analysis**:
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
**Phase 74-1 vs 74-2 comparison**:
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
- But cache-misses +86% cancels out → **total NEUTRAL**
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)****P1 FROZEN**
---
## Root Cause (Phase 74-2)
**Why cache-misses increased (+86%)**:
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
- Compiler may spill to stack → more memory traffic
- `cache->slots[head]` may lose prefetch opportunity
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
- Storing to local breaks dependency tracking?
- Memory alias analysis degraded?
**Evidence**:
- dTLB-misses decreased (-28%) → data layout not the issue
- L1-dcache-load-misses similar → not a TLB/page issue
- cache-misses (+86%) is the PRIMARY BLOCKER
---
## Lessons Learned
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
---
## P1 (LOCALIZE) - Frozen State
**Files**:
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
**Default behavior**: LOCALIZE=0 (original implementation)
**Rollback**: No action needed (default OFF)
---
## Next Steps
**Phase 74-3: P0 (FASTAPI)**
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
**Approach**:
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
- Assume: "valid/enabled/no-stats" at caller side
- Fail-fast: fallback to slow path on unexpected state
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
**GO threshold**: +1.0% (strict, structural change)
---
## Artifacts
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
---
## Timeline
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)****P1 FROZEN**
- Phase 74-3: P0 (FASTAPI) → (next)