Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)
Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@ -11,7 +11,7 @@
|
||||
|
||||
mimalloc との比較は **FAST build** で行う(Standard は fixed tax を含むため公平でない)。
|
||||
|
||||
## Current snapshot(2025-12-17, Phase 68 PGO — 新 baseline)
|
||||
## Current snapshot(2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline)
|
||||
|
||||
計測条件(再現の正):
|
||||
- Mixed: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
||||
|
||||
197
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Normal file
197
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Normal file
@ -0,0 +1,197 @@
|
||||
# Phase 69-1: Refill Tuning Parameter Sweeps - Results
|
||||
|
||||
**Date**: 2025-12-17
|
||||
**Baseline**: Phase 68 PGO (`bench_random_mixed_hakmem_minimal_pgo`)
|
||||
**Benchmark**: `scripts/run_mixed_10_cleanenv.sh` (RUNS=10)
|
||||
**Goal**: Find +3-6% optimization for M2 milestone (55% of mimalloc)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Winner Identified**: **Warm Pool Size=16** achieves **+3.26% (Strong GO)** with ENV-only change.
|
||||
|
||||
- **No code changes required** - Deploy via `HAKMEM_WARM_POOL_SIZE=16` environment variable
|
||||
- **Exceeds M2 threshold** (+3.0% Strong GO criterion)
|
||||
- **Single strongest improvement** among all tested parameters
|
||||
- **Combined optimizations are non-additive** - Warm Pool Size=16 alone outperforms combinations
|
||||
|
||||
⚠️ **Important correction (2025-12 audit)**:
|
||||
The previously reported “Refill Batch Size sweep” based on `TINY_REFILL_BATCH_SIZE` was **not measuring a real knob**.
|
||||
That macro currently has **zero call sites** (it is defined but not referenced in the active Tiny front path), so any
|
||||
observed deltas were **layout/drift noise**, not an algorithmic effect.
|
||||
|
||||
---
|
||||
|
||||
## Full Sweep Results
|
||||
|
||||
### Baseline (Phase 68 PGO)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| **Mean** | 60.65M ops/s |
|
||||
| **Median** | 60.68M ops/s |
|
||||
| **CV** | 1.68% |
|
||||
| **% of mimalloc** | 50.93% |
|
||||
|
||||
**Runs**: 10
|
||||
**Binary**: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
|
||||
|
||||
---
|
||||
|
||||
### 1. Warm Pool Size Sweep (ENV-only, no recompile)
|
||||
|
||||
**Parameter**: `HAKMEM_WARM_POOL_SIZE` (default: 12 SuperSlabs/class)
|
||||
|
||||
| Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||
|------|----------------|------------------|----|-----------:|----------|
|
||||
| **16** | **62.63** | **63.38** | 2.43% | **+3.26%** | **Strong GO** ✓✓✓ |
|
||||
| 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ |
|
||||
|
||||
**Winner**: **Size=16 (+3.26%)**
|
||||
|
||||
**Analysis**:
|
||||
- Size=16 exceeds +3.0% Strong GO threshold
|
||||
- Size=24 shows diminishing returns (+2.84% vs +3.26%)
|
||||
- Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead
|
||||
|
||||
**Command Used**:
|
||||
```bash
|
||||
# Size=16
|
||||
HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||
|
||||
# Size=24
|
||||
HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Unified Cache C5-C7 Sweep (ENV-only, no recompile)
|
||||
|
||||
**Parameter**: `HAKMEM_TINY_UNIFIED_C5`, `HAKMEM_TINY_UNIFIED_C6`, `HAKMEM_TINY_UNIFIED_C7` (default: 128 slots)
|
||||
|
||||
| Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||
|------------|----------------|------------------|----|-----------:|----------|
|
||||
| **256** | **61.92** | **61.70** | 1.49% | **+2.09%** | **GO** ✓ |
|
||||
| 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ |
|
||||
|
||||
**Winner**: **Cache=256 (+2.09%)**
|
||||
|
||||
**Analysis**:
|
||||
- Cache=256 shows +2.09% improvement (GO threshold)
|
||||
- Cache=512 shows diminishing returns (+1.89% vs +2.09%)
|
||||
- Larger caches provide marginal gains while increasing memory overhead
|
||||
- Lower CV (1.49%) indicates stable performance
|
||||
|
||||
**Command Used**:
|
||||
```bash
|
||||
# Cache=256
|
||||
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||
|
||||
# Cache=512
|
||||
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Combined Optimization Check
|
||||
|
||||
**Configuration**: Warm Pool Size=16 + Unified Cache C5-C7=256
|
||||
|
||||
| Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|
||||
|----------------|------------------|----|-----------:|----------|
|
||||
| 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) |
|
||||
|
||||
**Analysis**:
|
||||
- Combined result (+2.81%) is **LESS than** Warm Pool Size=16 alone (+3.26%)
|
||||
- **Non-additive behavior** indicates parameters are not orthogonal
|
||||
- **Likely explanation**: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant
|
||||
- **Recommendation**: Use Warm Pool Size=16 alone for maximum benefit
|
||||
|
||||
**Command Used**:
|
||||
```bash
|
||||
HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Refill Batch Size Sweep (invalid — macro not wired)
|
||||
|
||||
The `TINY_REFILL_BATCH_SIZE` macro is currently **define-only**:
|
||||
|
||||
```bash
|
||||
rg -n "TINY_REFILL_BATCH_SIZE" core
|
||||
# -> core/hakmem_tiny_config.h only
|
||||
```
|
||||
|
||||
So we do **not** treat it as a tuning parameter until it is actually connected to refill logic.
|
||||
|
||||
If we want to tune refill frequency, use the real knobs:
|
||||
- `HAKMEM_TINY_REFILL_COUNT_HOT`
|
||||
- `HAKMEM_TINY_REFILL_COUNT_MID`
|
||||
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Phase 69-2 (Baseline Promotion)
|
||||
|
||||
**Primary Recommendation**: **Deploy Warm Pool Size=16 (ENV-only)**
|
||||
|
||||
**Rationale**:
|
||||
1. **Strongest single improvement** (+3.26%, Strong GO)
|
||||
2. **No code changes required** - Zero risk of layout tax
|
||||
3. **Immediate deployment** via environment variable
|
||||
4. **Exceeds M2 threshold** (+3.0% Strong GO criterion)
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
# Add to PGO training environment and benchmark scripts
|
||||
export HAKMEM_WARM_POOL_SIZE=16
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Secondary Options (for Phase 69-3+)
|
||||
|
||||
**Option A: Warm Pool Size=16 + Refill Batch=32**
|
||||
- **Combined potential**: Unknown (requires testing, may be non-additive like unified cache)
|
||||
- **Complexity**: Requires PGO rebuild for Batch=32
|
||||
- **Risk**: Layout tax from code change
|
||||
|
||||
**Option B: Warm Pool Size=16 alone (recommended)**
|
||||
- **Gain**: +3.26% guaranteed
|
||||
- **Complexity**: ENV-only, zero code changes
|
||||
- **Risk**: None (reversible via ENV)
|
||||
|
||||
---
|
||||
|
||||
## Raw Data Files
|
||||
|
||||
All 10-run logs saved to:
|
||||
- `/tmp/phase69_baseline.log` - Phase 68 PGO baseline
|
||||
- `/tmp/phase69_warm16.log` - Warm Pool Size=16
|
||||
- `/tmp/phase69_warm24.log` - Warm Pool Size=24
|
||||
- `/tmp/phase69_cache256.log` - Unified Cache C5-C7=256
|
||||
- `/tmp/phase69_cache512.log` - Unified Cache C5-C7=512
|
||||
- `/tmp/phase69_combined.log` - Combined (Warm=16 + Cache=256)
|
||||
- `/tmp/phase69_batch32.log` - Refill Batch=32
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
**Awaiting User Instructions for Phase 69-2**:
|
||||
1. Confirm Warm Pool Size=16 as baseline promotion candidate
|
||||
2. Decide whether to:
|
||||
- Update ENV defaults in `hakmem_tiny_config.h` (preferred for SSOT)
|
||||
- Document as recommended ENV setting in README/docs
|
||||
- Add to PGO training scripts
|
||||
3. Re-run `make pgo-fast-full` with `HAKMEM_WARM_POOL_SIZE=16` in training environment
|
||||
4. Update `PERFORMANCE_TARGETS_SCORECARD.md` with new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc)
|
||||
|
||||
---
|
||||
|
||||
**Phase 69-1 Status**: ✅ **COMPLETE**
|
||||
**Winner**: **Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)**
|
||||
@ -0,0 +1,46 @@
|
||||
# Phase 69-3A: Refill Batch=64 build failure triage — Root cause & fix
|
||||
|
||||
## Symptom
|
||||
|
||||
`make pgo-fast-build` (profile-use) fails to link with undefined `__gcov_*` symbols, e.g.:
|
||||
|
||||
- `__gcov_init`, `__gcov_exit`
|
||||
- `__gcov_merge_add`, `__gcov_merge_topn`
|
||||
- `__gcov_time_profiler_counter`
|
||||
|
||||
This appeared when trying to evaluate `Refill Batch Size=64`.
|
||||
|
||||
## Root cause (actual)
|
||||
|
||||
The failure is **not** “compiler limit due to batch=64”.
|
||||
|
||||
It is a **stale object mixing** problem:
|
||||
- Some benchmark `.o` files were built in the profile-gen step (`-fprofile-generate`) and **were not removed by `make clean`**.
|
||||
- In the profile-use step (`-fprofile-use`), those stale instrumented `.o` files were reused and linked without `-fprofile-generate` → libgcov was not pulled in.
|
||||
- Result: unresolved `__gcov_*` symbols at link time.
|
||||
|
||||
In other words: **instrumented bench object reused in non-instrumented link**.
|
||||
|
||||
## Fix (minimal, safe)
|
||||
|
||||
Strengthen `make clean` to remove benchmark objects/binaries that were previously omitted, including:
|
||||
- `bench_random_mixed_hakmem.o`
|
||||
- `bench_tiny_hot_hakmem.o`
|
||||
- related bench variants (`*_system`, `*_mi`, `*_hakx`, `*_minimal*`, etc.)
|
||||
|
||||
This preserves toolchain fairness (GCC + LTO) and prevents cross-step contamination in PGO workflows.
|
||||
|
||||
## Verification
|
||||
|
||||
After the fix, the Phase 66 PGO pipeline builds successfully again:
|
||||
|
||||
```sh
|
||||
make pgo-fast-profile pgo-fast-collect pgo-fast-build
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- This fix is **layout-neutral**: it only affects build hygiene (artifact cleanup).
|
||||
- This also hardens other workflows where flags change across builds (PGO / FAST targets).
|
||||
- Follow-up audit note (2025-12): `TINY_REFILL_BATCH_SIZE` is currently define-only (no call sites), so the “batch=64”
|
||||
performance experiment itself was not measuring a real knob; however the build hygiene fix remains valid and important.
|
||||
@ -0,0 +1,45 @@
|
||||
# Phase 69-3B: Refill Batch Size sweep (PGO, warm_pool=16) — Results
|
||||
|
||||
⚠️ **INVALID (2025-12 audit)**: `TINY_REFILL_BATCH_SIZE` is currently **not wired** into the active Tiny front path
|
||||
(it has zero call sites; define-only in `core/hakmem_tiny_config.h`). Any observed deltas in this file should be treated
|
||||
as **layout/drift noise**, not an algorithmic effect. This document is kept only as an experiment record.
|
||||
|
||||
## Context
|
||||
|
||||
Phase 69-2 promoted the ENV-only winner:
|
||||
- `HAKMEM_WARM_POOL_SIZE=16`
|
||||
|
||||
This phase explores compile-time refill batch size (`TINY_REFILL_BATCH_SIZE`) under the current PGO workflow:
|
||||
- `make pgo-fast-full` (GCC + LTO preserved)
|
||||
- Training uses cleanenv-aligned workloads (`scripts/box/pgo_fast_profile_config.sh`)
|
||||
|
||||
## Build hygiene prerequisite
|
||||
|
||||
Batch=64 originally “failed to build” due to stale profile-gen bench objects being reused in profile-use links.
|
||||
That issue is fixed by strengthening `make clean` (see `docs/analysis/PHASE69_REFILL_TUNING_3A_BUILD_FAILURE_TRIAGE_BATCH64.md`).
|
||||
|
||||
## Measurement (Mixed 10-run)
|
||||
|
||||
All results are from the same host session, using:
|
||||
- `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`
|
||||
- `RUNS=10 scripts/run_mixed_10_cleanenv.sh`
|
||||
|
||||
| Batch | Mean (M ops/s) | Median (M ops/s) | CV |
|
||||
|------:|----------------:|-----------------:|---:|
|
||||
| 16 | 61.30 | 61.64 | 1.50% |
|
||||
| 32 | 60.73 | 61.17 | 2.19% |
|
||||
| 48 | 61.94 | 62.54 | 1.53% |
|
||||
| 64 | 61.51 | 61.81 | 1.56% |
|
||||
|
||||
## Decision
|
||||
|
||||
- **Batch=48** is the best of the tested set in this session (+~1.0% vs batch=16 baseline).
|
||||
- **Batch=32** regresses in this session (note: previously was GO under a different baseline).
|
||||
- **Batch=64** builds successfully after the hygiene fix, but is not the best performer here.
|
||||
|
||||
## Next steps (Phase 69-3C)
|
||||
|
||||
If we want to pursue M2 (55%) via this path:
|
||||
1. Promote **batch=48** as a research candidate with a dedicated Phase tag (compile-time change + PGO rebuild).
|
||||
2. Re-run the sweep at another time window to confirm ordering (layout/drift sensitivity).
|
||||
3. If stable, promote batch=48 into the FAST baseline build path.
|
||||
@ -0,0 +1,47 @@
|
||||
# Phase 69-3C: Refill Batch “knob” audit — `TINY_REFILL_BATCH_SIZE` is not wired
|
||||
|
||||
## Summary
|
||||
|
||||
The Phase 69 “Refill Batch Size sweep” was based on `TINY_REFILL_BATCH_SIZE` in `core/hakmem_tiny_config.h`, but an audit
|
||||
shows this macro currently has **zero call sites** in the active Tiny front path. As a result, any measured deltas from
|
||||
editing this macro are **not algorithmic**; they are attributable to layout/drift/noise.
|
||||
|
||||
## Evidence
|
||||
|
||||
### 1) Zero call sites
|
||||
|
||||
```sh
|
||||
rg -n "TINY_REFILL_BATCH_SIZE" core
|
||||
```
|
||||
|
||||
Result: only `core/hakmem_tiny_config.h` (define-only).
|
||||
|
||||
### 2) PGO binaries unchanged when toggling the macro
|
||||
|
||||
We rebuilt the full PGO pipeline twice (`make pgo-fast-full`) after changing the macro (batch16 vs batch48) and found the
|
||||
resulting binaries were bit-identical (same size + same SHA256).
|
||||
|
||||
This confirms the macro does not affect the compiled hot path today.
|
||||
|
||||
## Action taken
|
||||
|
||||
- Restored `TINY_REFILL_BATCH_SIZE` to `16` and added an explicit “not wired” note in `core/hakmem_tiny_config.h`.
|
||||
- Marked the “Refill Batch Size sweep” section in Phase 69 docs as invalid.
|
||||
|
||||
## What to tune instead (real knobs)
|
||||
|
||||
To tune refill frequency/amount without rebuilding:
|
||||
- `HAKMEM_TINY_REFILL_COUNT_HOT` (C0–C3)
|
||||
- `HAKMEM_TINY_REFILL_COUNT_MID` (C4–C7)
|
||||
- `HAKMEM_TINY_REFILL_COUNT` / `HAKMEM_TINY_REFILL_COUNT_C{0..7}`
|
||||
|
||||
Defaults are set in `core/hakmem_tiny_init.inc` and can be overridden via ENV.
|
||||
|
||||
## Optional future work (if we still want a compile-time knob)
|
||||
|
||||
If we want a compile-time “refill batch size” knob, we need to wire it into a single SSOT:
|
||||
- either by feeding it into the refill-count defaults (`g_refill_count_*`), or
|
||||
- by introducing a dedicated build flag that the refill logic consumes directly.
|
||||
|
||||
Until then, do not run Phase 69 sweeps based on `TINY_REFILL_BATCH_SIZE`.
|
||||
|
||||
@ -12,6 +12,13 @@
|
||||
|
||||
Before implementing any refill/WarmPool changes, execute this sequence:
|
||||
|
||||
0. **Route Banner(任意だが推奨)**:
|
||||
```bash
|
||||
HAKMEM_ROUTE_BANNER=1 ./bench_random_mixed_hakmem_observe ...
|
||||
```
|
||||
- Route assignments(backend route kind)と cache config(`unified_cache_enabled` / `warm_pool_max_per_class`)を 1 回だけ表示する。
|
||||
- 「Route=LEGACY = Unified Cache 未使用」といった誤認を防ぐ(LEGACYでもUnified Cacheは alloc/free の front で使われる)。
|
||||
|
||||
1. **Build with Stats**:
|
||||
```bash
|
||||
make bench_random_mixed_hakmem_observe EXTRA_CFLAGS='-DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1'
|
||||
@ -20,7 +27,7 @@ Before implementing any refill/WarmPool changes, execute this sequence:
|
||||
|
||||
2. **Run with Stats**:
|
||||
```bash
|
||||
HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
|
||||
HAKMEM_ROUTE_BANNER=1 HAKMEM_WARM_POOL_STATS=1 ./bench_random_mixed_hakmem_observe 20000000 400 1
|
||||
```
|
||||
|
||||
3. **Check Output**:
|
||||
|
||||
@ -0,0 +1,116 @@
|
||||
# Phase 74: UnifiedCache hit-path structural optimization (WS=400 SSOT)
|
||||
|
||||
**Status**: 🟡 DRAFT(設計SSOT / 次の指示書)
|
||||
|
||||
## 0) 背景(なぜ今これか)
|
||||
|
||||
- 現行 baseline(Phase 69): `bench_random_mixed_hakmem_minimal_pgo` = **62.63M ops/s = 51.77% of mimalloc**(`HAKMEM_WARM_POOL_SIZE=16`)
|
||||
- Phase 70(観測SSOT)により、WS=400(Mixed SSOT)では **UnifiedCache miss が極小**であることが確定。
|
||||
- `unified_cache_refill()` / WarmPool-pop を速くしても **ROI はほぼゼロ**(refill最適化は凍結)
|
||||
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
|
||||
- Phase 73(perf stat)により、WarmPool=16 の勝ちは **instruction/branch の微減**が支配的と確定。
|
||||
- つまり次も「hit-path を短くする」方向が最も筋が良い。
|
||||
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
|
||||
|
||||
本フェーズの狙いは、**UnifiedCache の hit-path(push/pop)から“踏まなくていい分岐/ロード”を構造で外に追い出す**こと。
|
||||
|
||||
## 1) 目的 / 非目的
|
||||
|
||||
**目的**
|
||||
- WS=400 の SSOT workload で **+1〜3%**(単発)を狙う(積み上げで M2=55% へ)。
|
||||
- “経路が踏まれていない最適化” を避ける(Phase 70 の SSOT を守る)。
|
||||
|
||||
**非目的**
|
||||
- `unified_cache_refill()` の最適化(miss が極小なので SSOT では ROI なし)。
|
||||
- link-out / 大削除による DCE(layout tax で符号反転の前例が多い)。
|
||||
- route kind を変えて別 workload にする(まず SSOT workload を崩さない)。
|
||||
|
||||
## 2) Box Theory(箱割り)
|
||||
|
||||
### 箱の責務
|
||||
|
||||
L0: **EnvGateBox**
|
||||
- `HAKMEM_TINY_UC_*` のトグル(default OFF、いつでも戻せる)。
|
||||
|
||||
L1: **TinyUnifiedCacheHitPathBox(NEW / 研究箱)**
|
||||
- `unified_cache_push/pop` の **hit-path だけを短くする**(refill/overflow/registryは触らない)。
|
||||
- 変換点(境界)は 1 箇所: `unified_cache_push/pop` 内で “fast→fallback” を1回だけ行う。
|
||||
|
||||
### 可視化(最小)
|
||||
- `uc_hitpath_fast_hits` / `uc_hitpath_fast_fallbacks` の2カウンタだけ(必要なら)。
|
||||
- それ以外は `perf stat`(instructions/branches)を正とする。
|
||||
|
||||
## 3) 具体案(優先順)
|
||||
|
||||
### P1(低リスク): ローカル変数化で再ロード/依存チェーンを固定する
|
||||
|
||||
狙い:
|
||||
- `cache->head/tail/mask/capacity` 等の再ロードを抑制し、**依存チェーンを短く**する。
|
||||
|
||||
設計:
|
||||
- `unified_cache_push()` / `unified_cache_pop_or_refill()` の中で
|
||||
- `uint16_t head = cache->head;` のように **ローカルへ落とす**
|
||||
- `next = (x + 1) & mask` の算術を **1回に固定**
|
||||
- `cache->tail = next;` のような store を最後にまとめる
|
||||
|
||||
導入:
|
||||
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0)
|
||||
- 方式: 同一バイナリで ON/OFF(layout tax を最小にするため、分岐は入口1回に限定)
|
||||
|
||||
リスク:
|
||||
- レジスタ圧上昇で逆に遅くなる可能性 → A/B 必須。
|
||||
|
||||
### P0(中リスク/中ROI): Fast-API 化(enable判定/統計を外に追い出す)
|
||||
|
||||
狙い:
|
||||
- hit-path の中に残る “ほぼ不変な判定” を **呼び出し側に追い出し**、`push/pop` を直線化する。
|
||||
|
||||
設計:
|
||||
- `unified_cache_push_fast(TinyUnifiedCache* cache, void* base)` のような **最短API** を追加
|
||||
- 前提: “有効/初期化済み/統計OFF” を呼び出し側で保証
|
||||
- 失敗時のみ既存 `unified_cache_push()` へ落とす(境界1箇所)
|
||||
|
||||
導入:
|
||||
- ENV: `HAKMEM_TINY_UC_FASTAPI=0/1`(default 0)
|
||||
- Fail-fast: 途中でモードが変わったら “safe fallback” へ(bench用途なら abort でも良い)
|
||||
|
||||
リスク:
|
||||
- call site の増加で layout が動く → GO 閾値は +1.0%(厳しめ)。
|
||||
|
||||
### P2(高リスク/高ROI候補): hot class 限定で slots を TLS 直置き(pointer chase削減)
|
||||
|
||||
狙い:
|
||||
- hit-path の `cache->slots` のロード(ポインタ追跡)を消す。
|
||||
|
||||
設計:
|
||||
- `TinyUnifiedCache` の “hot class のみ” を別構造に逃がし、TLS 内に `slots[]` を直置き。
|
||||
- 対象候補: 容量が小さい C4/C5/C6/C7(C2/C3 の 2048 は直置きが重い)
|
||||
|
||||
リスク:
|
||||
- TLS サイズ増で dTLB/cache が悪化しうる(勝てば大きいが、NO-GO もあり得る)。
|
||||
|
||||
## 4) A/B(SSOT)
|
||||
|
||||
### 4.1 ベンチ条件(固定)
|
||||
- `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
||||
- `HAKMEM_WARM_POOL_SIZE=16`(baseline)
|
||||
|
||||
### 4.2 GO/NO-GO
|
||||
- **GO**: +1.0% 以上
|
||||
- **NEUTRAL**: ±1.0%(research box freeze)
|
||||
- **NO-GO**: -1.0% 以下(即 revert)
|
||||
|
||||
### 4.3 追加で必ず見る(Phase 73 教訓)
|
||||
- `perf stat`: `instructions`, `branches`, `branch-misses`(勝ち筋が instruction/branch 減なので)
|
||||
- `cache-misses`, `iTLB-load-misses`, `dTLB-load-misses`(layout tax 検知)
|
||||
|
||||
## 5) 直近の実装順(推奨)
|
||||
|
||||
1. **P1(LOCALIZE)** を小さく入れて A/B(最短で勝ち筋確認)
|
||||
2. 勝てたら **P0(FASTAPI)** を追加(さらに分岐を外へ)
|
||||
3. それでも足りなければ **P2(inline slots hot)** を research box として試す
|
||||
|
||||
## 6) 退出条件(やめどき)
|
||||
|
||||
- WS=400 SSOT で `perf` 上の “unified_cache_push/pop” が Top 50 圏外になったら、この系は撤退(Phase 42 の教訓)。
|
||||
- 3回連続で NEUTRAL/NO-GO が続いたら、次の構造(別層)へ(layout tax の危険が増すため)。
|
||||
@ -0,0 +1,75 @@
|
||||
# Phase 74-1: UnifiedCache hit-path “LOCALIZE” 実装指示書
|
||||
|
||||
**Status**: 🟡 READY
|
||||
|
||||
## 目的
|
||||
|
||||
WS=400(Mixed SSOT)でほぼ hit-path しか踏まれないため、`unified_cache_push/pop` の **依存チェーン(再ロード)を短く**して instructions/branches を削る。
|
||||
|
||||
- 設計SSOT: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||
- 観測SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`(refill最適化は凍結)
|
||||
|
||||
## 原則(Box Theory)
|
||||
|
||||
- L0: ENV gate 箱を追加(default OFF、いつでも戻せる)
|
||||
- L1: `unified_cache_push/pop` の中だけに閉じた変更(境界1箇所)
|
||||
- 可視化は最小(基本は perf stat を正とする)
|
||||
- Fail-fast: 迷ったら fallback
|
||||
|
||||
## Step 0: Baseline 確認(SSOT)
|
||||
|
||||
```bash
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
## Step 1: ENV gate(L0 box)
|
||||
|
||||
新規:
|
||||
- `core/box/tiny_unified_cache_hitpath_env_box.h`(例)
|
||||
|
||||
ENV:
|
||||
- `HAKMEM_TINY_UC_LOCALIZE=0/1`(default 0)
|
||||
|
||||
要件:
|
||||
- hot path で getenv を踏まない(既存の lazy-init パターン or build flag で固定)
|
||||
|
||||
## Step 2: LOCALIZE 実装(L1 box)
|
||||
|
||||
対象:
|
||||
- `core/front/tiny_unified_cache.h` の `unified_cache_push()` / `unified_cache_pop_or_refill()`
|
||||
|
||||
方針:
|
||||
- `cache->head/tail/mask/capacity` をローカルへ落として **再ロードを防ぐ**
|
||||
- store は最後にまとめる(`cache->tail = next_tail;` など)
|
||||
- 仕様は変えない(容量/順序/統計/overflow の意味を維持)
|
||||
|
||||
導入パターン(例):
|
||||
- `if (!tiny_uc_localize_enabled())` のときは既存実装をそのまま通す
|
||||
- `enabled` のときだけ localize 版を呼ぶ
|
||||
|
||||
## Step 3: A/B(同一バイナリ)
|
||||
|
||||
```bash
|
||||
HAKMEM_TINY_UC_LOCALIZE=0 scripts/run_mixed_10_cleanenv.sh
|
||||
HAKMEM_TINY_UC_LOCALIZE=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
追加で(勝ち筋が instructions/branches なので必須):
|
||||
```bash
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses -- \
|
||||
./bench_random_mixed_hakmem_minimal_pgo 20000000 400 1
|
||||
```
|
||||
|
||||
## 判定
|
||||
|
||||
- **GO**: +1.0% 以上
|
||||
- **NEUTRAL**: ±1.0%(research box freeze)
|
||||
- **NO-GO**: -1.0% 以下(即 revert)
|
||||
|
||||
NO-GO の切り分け:
|
||||
- `scripts/box/layout_tax_forensics_box.sh` を使う(layout tax / IPC低下 / TLB悪化の分類)
|
||||
|
||||
## Step 4: 昇格方針
|
||||
|
||||
- 初回 GO でも **default ON にしない**(まずは 3回独立再計測で再現性を確認)
|
||||
- 3回とも GO なら `scripts/run_mixed_10_cleanenv.sh` / `core/bench_profile.h` へ昇格を検討
|
||||
@ -0,0 +1,140 @@
|
||||
# Phase 74: UnifiedCache hit-path structural optimization - Results
|
||||
|
||||
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
|
||||
|
||||
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
|
||||
|
||||
---
|
||||
|
||||
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
|
||||
|
||||
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
|
||||
|
||||
**Implementation**:
|
||||
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
|
||||
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
|
||||
|
||||
**Results** (10-run A/B):
|
||||
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|
||||
|--------|------------|------------|-------|
|
||||
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
|
||||
| instructions | 4,583M | 4,615M | **+0.7%** |
|
||||
| branches | 1,276M | 1,281M | **+0.4%** |
|
||||
| cache-misses | 560K | 461K | -17.7% |
|
||||
|
||||
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
|
||||
|
||||
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
|
||||
|
||||
---
|
||||
|
||||
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
|
||||
|
||||
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
|
||||
|
||||
**Implementation**:
|
||||
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
||||
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
|
||||
|
||||
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
|
||||
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|
||||
|--------|---------------|----------------|-------|
|
||||
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
|
||||
| cycles | 1,553M | 1,548M | -0.3% |
|
||||
| **instructions** | 2,748M | 2,733M | **-0.6%** |
|
||||
| **branches** | 632M | 617M | **-2.3%** |
|
||||
| **cache-misses** | 707K | 1,316K | **+86%** |
|
||||
| dTLB-load-misses | 46K | 33K | -28% |
|
||||
|
||||
**Analysis**:
|
||||
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
|
||||
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
|
||||
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
|
||||
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
|
||||
|
||||
**Phase 74-1 vs 74-2 comparison**:
|
||||
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
|
||||
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
|
||||
- But cache-misses +86% cancels out → **total NEUTRAL**
|
||||
|
||||
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN**
|
||||
|
||||
---
|
||||
|
||||
## Root Cause (Phase 74-2)
|
||||
|
||||
**Why cache-misses increased (+86%)**:
|
||||
|
||||
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
|
||||
- Compiler may spill to stack → more memory traffic
|
||||
- `cache->slots[head]` may lose prefetch opportunity
|
||||
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
|
||||
- Storing to local breaks dependency tracking?
|
||||
- Memory alias analysis degraded?
|
||||
|
||||
**Evidence**:
|
||||
- dTLB-misses decreased (-28%) → data layout not the issue
|
||||
- L1-dcache-load-misses similar → not a TLB/page issue
|
||||
- cache-misses (+86%) is the PRIMARY BLOCKER
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
|
||||
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
|
||||
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
|
||||
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
|
||||
|
||||
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
|
||||
|
||||
---
|
||||
|
||||
## P1 (LOCALIZE) - Frozen State
|
||||
|
||||
**Files**:
|
||||
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
|
||||
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
|
||||
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
|
||||
|
||||
**Default behavior**: LOCALIZE=0 (original implementation)
|
||||
**Rollback**: No action needed (default OFF)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
**Phase 74-3: P0 (FASTAPI)**
|
||||
|
||||
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
|
||||
|
||||
**Approach**:
|
||||
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
|
||||
- Assume: "valid/enabled/no-stats" at caller side
|
||||
- Fail-fast: fallback to slow path on unexpected state
|
||||
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
||||
|
||||
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
|
||||
|
||||
**GO threshold**: +1.0% (strict, structural change)
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
||||
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
|
||||
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
|
||||
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN**
|
||||
- Phase 74-3: P0 (FASTAPI) → (next)
|
||||
Reference in New Issue
Block a user