Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
790 lines
37 KiB
Markdown
790 lines
37 KiB
Markdown
# Performance Targets(mimalloc 追跡の"数値目標")
|
||
|
||
目的: 速さだけでなく **syscall / メモリ安定性 / 長時間安定性**を含めて「勝ち筋」を固定する。
|
||
|
||
## 運用方針(Phase 38 確定)
|
||
|
||
**比較基準は FAST build** を正とする:
|
||
- **FAST**: 純粋な性能計測(gate function 定数化、診断カウンタ OFF)
|
||
- **Standard**: 安全・互換の基準(ENV gate 有効、本線リリース用)
|
||
- **OBSERVE**: 挙動観測・デバッグ(診断カウンタ ON)
|
||
|
||
mimalloc との比較は **FAST build** で行う(Standard は fixed tax を含むため公平でない)。
|
||
|
||
## Current snapshot(2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline)
|
||
|
||
計測条件(再現の正):
|
||
- Mixed: `scripts/run_mixed_10_cleanenv.sh`(`ITERS=20000000 WS=400`)
|
||
- 10-run mean/median
|
||
- Git: master (Phase 68 PGO, seed/WS diversified profile)
|
||
- **Baseline binary**: `bench_random_mixed_hakmem_minimal_pgo` (Phase 68 upgraded)
|
||
- **Stability**: Phase 66: 3 iterations, +3.0% mean, variance <±1% | Phase 68: 10-run, +1.19% vs Phase 66 (GO)
|
||
|
||
Note:
|
||
- Phase 75 introduced C5/C6 inline slots and promoted them into presets. Phase 75 A/B results were recorded on the Standard binary (`./bench_random_mixed_hakmem`).
|
||
- FAST PGO SSOT baselines/ratios should only be updated after re-running A/B with `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`.
|
||
|
||
### hakmem Build Variants(同一バイナリレイアウト)
|
||
|
||
| Build | Mean (M ops/s) | Median (M ops/s) | vs mimalloc | 備考 |
|
||
|-------|----------------|------------------|-------------|------|
|
||
| FAST v3 | 58.478 | 58.876 | 48.34% | 旧 baseline(Phase 59b rebase)。性能評価の正から昇格 → Phase 66 PGO へ |
|
||
| FAST v3 + PGO | 59.80 | 60.25 | 49.41% | Phase 47: NEUTRAL (+0.27% mean, +1.02% median, research box) |
|
||
| **FAST v3 + PGO (Phase 66)** | **60.89** | **61.35** | **50.32%** | **GO: +3.0% mean (3回検証済み、安定 <±1%)**。Phase 66 PGO initial baseline |
|
||
| **FAST v3 + PGO (Phase 68)** | **61.614** | **61.924** | **50.93%** | **GO: +1.19% vs Phase 66** ✓ (seed/WS diversification) |
|
||
| **FAST v3 + PGO (Phase 69)** | **62.63** | **63.38** | **51.77%** | **強GO: +3.26% vs Phase 68** ✓✓✓ (Warm Pool Size=16, ENV-only) → **昇格済み 新 FAST baseline** ✓ |
|
||
| FAST v3 + PGO + Phase 75 (C5+C6 ON) [Point D] | **55.51** | - | **45.70%** | Phase 75-4 FAST PGO rebase (C5+C6 inline slots): +3.16% vs Point A ✓ **[REBASE URGENT]** |
|
||
| Standard | 53.50 | - | 44.21% | 安全・互換基準(Phase 48 前計測、要 rebase) |
|
||
| OBSERVE | TBD | - | - | 診断カウンタ ON |
|
||
|
||
補足:
|
||
- Phase 63: `make bench_random_mixed_hakmem_fast_fixed`(`HAKMEM_FAST_PROFILE_FIXED=1`)は research build(GO 未達時は SSOT に載せない)。結果は `docs/analysis/PHASE63_FAST_PROFILE_FIXED_BUILD_RESULTS.md`。
|
||
|
||
**FAST vs Standard delta: +10.6%**(Standard 側は Phase 48 前計測、mimalloc baseline 変更で ratio 調整)
|
||
|
||
**Phase 59b Notes:**
|
||
- **Profile Change**: Switched from `MIXED_TINYV3_C7_BALANCED` to `MIXED_TINYV3_C7_SAFE` (Speed-first) as canonical default
|
||
- **Rationale**: Phase 57 60-min soak showed Speed-first wins on all metrics (lower CV, better tail latency)
|
||
- **Stability**: CV 2.52% (hakmem) vs 0.90% (mimalloc) in Phase 59b
|
||
- **vs Phase 59**: Ratio change (49.13% → 48.34%) due to mimalloc variance (+1.59%), hakmem stable
|
||
- **Recommended Profile**: `MIXED_TINYV3_C7_SAFE` (Speed-first default)
|
||
|
||
### Reference allocators(別バイナリ、layout 差あり)
|
||
|
||
| allocator | mean (M ops/s) | median (M ops/s) | ratio vs mimalloc (mean) | CV |
|
||
|----------|-----------------|------------------|--------------------------|-----|
|
||
| **mimalloc (separate)** | **124.82** | 124.71 | **100%** | 1.10% |
|
||
| **tcmalloc (LD_PRELOAD)** | **115.26** | 115.51 | **92.33%** | 1.22% |
|
||
| **jemalloc (LD_PRELOAD)** | **97.39** | 97.88 | **77.96%** | 1.29% |
|
||
| **system (separate)** | **85.20** | 85.40 | **68.24%** | 1.98% |
|
||
| libc (same binary) | 76.26 | 76.66 | 63.30% | (old) |
|
||
|
||
Notes:
|
||
- **Phase 59b rebase**: mimalloc updated (120.466M → 120.979M, +0.43% variation)
|
||
- **2025-12-18 Update (corrected)**: tcmalloc/jemalloc/system 計測完了 (10-run Random Mixed, WS=400, ITERS=20M, SEED=1)
|
||
- tcmalloc: 115.26M ops/s (92.33% of mimalloc) ✓
|
||
- jemalloc: 97.39M ops/s (77.96% of mimalloc)
|
||
- system: 85.20M ops/s (68.24% of mimalloc)
|
||
- mimalloc: 124.82M ops/s (baseline)
|
||
- 計測スクリプト: `scripts/run_allocator_quick_matrix.sh` (hakmem via run_mixed_10_cleanenv.sh)
|
||
- **修正**: hakmem 計測が HAKMEM_PROFILE を明示するように修正 → SSOT レンジ復帰
|
||
- `system/mimalloc/jemalloc/tcmalloc` は別バイナリ計測のため **layout(text size/I-cache)差分を含む reference**
|
||
- `tcmalloc (LD_PRELOAD)` は gperftools から install (`/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so`)
|
||
- `libc (same binary)` は `HAKMEM_FORCE_LIBC_ALLOC=1` により、同一レイアウト上での比較の目安(Phase 48 前計測)
|
||
- **mimalloc 比較は FAST build を使用すること**(Standard の gate overhead は hakmem 固有の税)
|
||
- 比較手順(SSOT): `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
|
||
- **同一バイナリ比較(layout差を最小化)**: `scripts/run_allocator_preload_matrix.sh`(`bench_random_mixed_system` 固定 + `LD_PRELOAD` 差し替え)
|
||
- 注意: hakmem の SSOT(`bench_random_mixed_hakmem*`)とは経路が異なる(drop-in wrapper reference)
|
||
|
||
## Allocator Comparison(bench_allocators_compare.sh, small-scale reference)
|
||
|
||
注意:
|
||
- これは `bench_allocators_*` の `--scenario mixed`(8B..1MB の簡易混合)による **small-scale reference**。
|
||
- Mixed 16–1024B SSOT(`scripts/run_mixed_10_cleanenv.sh`)とは **別物**なので、FAST baseline/マイルストーンとは混同しない。
|
||
|
||
実行(例):
|
||
```bash
|
||
make bench
|
||
JEMALLOC_SO=/path/to/libjemalloc.so.2 \
|
||
TCMALLOC_SO=/path/to/libtcmalloc.so \
|
||
scripts/bench_allocators_compare.sh --scenario mixed --iterations 50
|
||
```
|
||
|
||
結果(2025-12-18, mixed, iterations=50):
|
||
|
||
| allocator | ops/sec (M) | vs mimalloc (Phase 69 ref) | vs system | soft_pf | RSS (MB) |
|
||
|----------|--------------|----------------------------|-----------|---------|----------|
|
||
| tcmalloc (LD_PRELOAD) | 34.56 | 28.6% | 11.2x | 3,842 | 21.5 |
|
||
| jemalloc (LD_PRELOAD) | 24.33 | 20.1% | 7.9x | 143 | 3.8 |
|
||
| hakmem (linked) | 16.85 | 13.9% | 5.4x | 4,701 | 46.5 |
|
||
| system (linked) | 3.09 | 2.6% | 1.0x | 68,590 | 19.6 |
|
||
|
||
補足:
|
||
- `soft_pf`/`RSS` は `getrusage()` 由来(Linux の `ru_maxrss` は KB)。
|
||
|
||
## Allocator Comparison(Random Mixed, 10-run, WS=400, reference)
|
||
|
||
注意:
|
||
- 別バイナリ比較は layout tax が混ざる。
|
||
- **同一バイナリ比較(LD_PRELOAD)を優先**したい場合は `scripts/run_allocator_preload_matrix.sh` を使う。
|
||
|
||
## 1) Speed(相対目標)
|
||
|
||
前提: **FAST build** で hakmem vs mimalloc を比較する(Standard は gate overhead を含むため不公平)。
|
||
|
||
推奨マイルストーン(Mixed 16–1024B, FAST build):
|
||
|
||
| Milestone | Target | Current (2025-12-18, corrected) | Status |
|
||
|-----------|--------|-----------------------------------|--------|
|
||
| M1 | mimalloc の **50%** | 44.46% | 🟡 **未達** (PROFILE 修正後の計測) |
|
||
| M2 | mimalloc の **55%** | 44.46% | 🔴 **未達** (Gap: -10.54pp)|
|
||
| M3 | mimalloc の **60%** | - | 🔴 未達(構造改造必要)|
|
||
| M4 | mimalloc の **65–70%** | - | 🔴 未達(構造改造必要)|
|
||
|
||
**現状:** hakmem (FAST PGO) (2025-12-18) = 55.53M ops/s = mimalloc の 44.46%(Random Mixed, WS=400, ITERS=20M, 10-run)
|
||
|
||
⚠️ **重要**: Phase 69 baseline (62.63M = 51.77%) は古い計測条件の可能性。PROFILE 明示修正後の新 baseline は 44.46%(M1 未達)。
|
||
|
||
**Phase 68 PGO 昇格(Phase 66 → Phase 68 upgrade):**
|
||
- Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)
|
||
- Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
|
||
- Profile change: seed/WS diversification (WS 3種 → 5種, seed 1種 → 3種)
|
||
- M1 (50%) achievement: **EXCEEDED** (+0.93pp above target, vs +0.32pp in Phase 66)
|
||
|
||
**M1 Achievement Analysis:**
|
||
- Phase 66: Gap to 50%: +0.32% (EXCEEDED target, first time above 50%)
|
||
- Phase 68: Gap to 50%: +0.93% (further improved via seed/WS diversification)
|
||
- Production perspective: 50.93% vs 50.00% is robustly statistically achieved
|
||
- Stability advantage: Phase 66 (3-run <±1%) → Phase 68 (10-run +1.19%, improved reproducibility)
|
||
- **Verdict**: M1 **EXCEEDED** (+0.93pp), M2 (55%) に向けて次フェーズ検討
|
||
|
||
**Phase 68 Benefits Over Phase 66:**
|
||
- Reduced PGO overfitting via seed/WS diversification
|
||
- +1.19% improvement from better profile representation
|
||
- More representative of production workload variance
|
||
- Higher confidence in baseline stability
|
||
|
||
**Phase 69 PGO 昇格(Phase 68 → Phase 69 upgrade):**
|
||
- Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
|
||
- Phase 69 baseline: 62.63M ops/s = 51.77% (+3.26% vs Phase 68, 10-run verified)
|
||
- Parameter change: Warm Pool Size 12 → 16 (ENV-only, zero code changes)
|
||
- M1 (50%) achievement: **EXCEEDED** (+1.77pp above target, vs +0.93pp in Phase 68)
|
||
- M2 (55%) progress: Gap reduced to +3.23pp (from +4.07pp in Phase 68)
|
||
|
||
**Phase 69 Benefits Over Phase 68:**
|
||
- +3.26% improvement from warm pool optimization (強GO threshold exceeded)
|
||
- ENV-only change (zero layout tax risk, fully reversible)
|
||
- Reduced registry O(N) scan overhead via larger warm pool
|
||
- Non-additive with other optimizations (Warm Pool Size=16 alone is optimal)
|
||
- Single strongest parameter improvement in refill tuning sweep
|
||
|
||
**Phase 69 Implementation:**
|
||
- Warm Pool Size: 12 → 16 SuperSlabs/class
|
||
- ENV variable: `HAKMEM_WARM_POOL_SIZE=16` (default in MIXED_TINYV3_C7_SAFE preset)
|
||
- Rollback: Set `HAKMEM_WARM_POOL_SIZE=12` or remove ENV variable
|
||
- Results: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
||
|
||
**Phase 75-4: FAST PGO Rebase (C5+C6 Inline Slots Validation) — CRITICAL FINDING**
|
||
|
||
Phase 75-3 validated C5+C6 inline slots optimization on Standard binary (+5.41%). Phase 75-4 rebased this onto FAST PGO baseline to update SSOT:
|
||
|
||
**4-Point Matrix (FAST PGO, Mixed SSOT):**
|
||
| Point | Config | Throughput | Delta vs A |
|
||
|-------|--------|-----------|-----------|
|
||
| A | C5=0, C6=0 | 53.81 M ops/s | baseline |
|
||
| B | C5=1, C6=0 | 53.03 M ops/s | -1.45% |
|
||
| C | C5=0, C6=1 | 54.17 M ops/s | +0.67% |
|
||
| **D** | **C5=1, C6=1** | **55.51 M ops/s** | **+3.16%** |
|
||
|
||
**Decision**: ✅ **GO** (Point D exceeds +3.0% ideal threshold by +0.16%)
|
||
|
||
**⚠️ CRITICAL FINDING: PGO Profile Staleness**
|
||
|
||
- **Phase 69 FAST baseline**: 62.63 M ops/s
|
||
- **Phase 75-4 Point A (FAST PGO baseline)**: 53.81 M ops/s
|
||
- **Regression**: -14.09% (not explained by Phase 75 additions)
|
||
- **Root cause hypothesis**: PGO profile trained pre-Phase 69 (likely Phase 68 or earlier) with C5=0, C6=0 configuration
|
||
- **Impact**: FAST PGO captures only 58.4% of Standard's +5.41% gain (3.16% vs 5.41%)
|
||
|
||
**Recommended Actions (Priority Order):**
|
||
|
||
1. **IMMEDIATE - UPDATE SSOT**: Phase 75 C5+C6 inline slots confirmed working (+3.16% on FAST PGO)
|
||
- Promote to core/bench_profile.h (already done for Standard, now FAST PGO validated)
|
||
- Update this scorecard: Phase 75 baseline = 55.51 M ops/s (Point D, with C5+C6 ON)
|
||
|
||
2. **HIGH PRIORITY - PHASE 75-5 (PGO Profile Regeneration)**
|
||
- Regenerate PGO profile with C5=1, C6=1 training configuration
|
||
- Expected gain: unknown (likely positive if the training profile matches the actual hot path, but not guaranteed)
|
||
- Estimated recovery: treat any number as a hypothesis until re-measured (do not assume a return to Phase 69 levels)
|
||
- Root cause analysis: Investigate 14% gap vs Phase 69 (layout, code bloat, or profile mismatch)
|
||
|
||
**Documentation:**
|
||
- Phase 75-4 results: `docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md`
|
||
- Next: Phase 75-5 (PGO regeneration) required before next optimization phase
|
||
|
||
**Impact on M2 Milestone:**
|
||
- Phase 69 FAST baseline: 62.63 M ops/s (51.77% of mimalloc, +3.23pp to M2)
|
||
- Phase 75-4 Point A (baseline): 53.81 M ops/s (44.35% of mimalloc, +10.65pp to M2)
|
||
- Phase 75-4 Point D (C5+C6): 55.51 M ops/s (45.70% of mimalloc, +9.30pp to M2)
|
||
- **Status**: Phase 75 optimization proven, but PGO profile regression masks true progress
|
||
|
||
※注意: `mimalloc/system/jemalloc` の参照値は環境ドリフトでズレるため、定期的に再ベースラインする。
|
||
- Phase 48 完了: `docs/analysis/PHASE48_REBASE_ALLOCATORS_AND_STABILITY_SUITE_RESULTS.md`
|
||
- Phase 59 完了: `docs/analysis/PHASE59_50PERCENT_RECOVERY_BASELINE_REBASE_RESULTS.md`
|
||
|
||
## 2) Syscall budget(OS churn)
|
||
|
||
Tiny hot path の理想:
|
||
- steady-state(warmup 後)で **mmap/munmap/madvise = 0**(または "ほぼ 0")
|
||
|
||
目安(許容):
|
||
- `mmap+munmap+madvise` 合計が **1e8 ops あたり 1 回以下**(= 1e-8 / op)
|
||
|
||
Current (Phase 48 rebase):
|
||
- `HAKMEM_SS_OS_STATS=1`(Mixed, `iters=200000000 ws=400`):
|
||
- `[SS_OS_STATS] alloc=9 free=10 madvise=9 madvise_enomem=0 madvise_other=0 madvise_disabled=0 mmap_total=9 fallback_mmap=0 huge_alloc=0 huge_fail=0`
|
||
- **Total syscalls (mmap+madvise): 18 / 200M ops = 9e-8 / op**
|
||
- **Status: EXCELLENT** (within 10x of ideal, NO steady-state churn)
|
||
|
||
観測方法(どちらか):
|
||
- 内部: `HAKMEM_SS_OS_STATS=1` の `[SS_OS_STATS]`(madvise/disabled 等)
|
||
- 外部: `perf stat` の syscall events か `strace -c`(短い実行で回数だけ見る)
|
||
|
||
**Phase 48 confirmation:**
|
||
- warmup 後に mmap/madvise が増え続けていない(stable)
|
||
- mimalloc に対する「速さ以外の勝ち筋」の 1 つを数値で確認
|
||
|
||
## 3) Memory stability(RSS / fragmentation)
|
||
|
||
最低条件(Mixed / ws 固定の soak):
|
||
- RSS が **時間とともに単調増加しない**
|
||
- 1時間の soak で RSS drift が **+5% 以内**(目安)
|
||
|
||
**Current (Phase 51 - 5min single-process soak):**
|
||
|
||
| Allocator | First RSS (MB) | Last RSS (MB) | Peak RSS (MB) | RSS Drift | Status |
|
||
|-----------|----------------|---------------|---------------|-----------|--------|
|
||
| hakmem FAST | 32.88 | 32.88 | 32.88 | +0.00% | EXCELLENT |
|
||
| mimalloc | 1.88 | 1.88 | 1.88 | +0.00% | EXCELLENT |
|
||
| system malloc | 1.88 | 1.88 | 1.88 | +0.00% | EXCELLENT |
|
||
|
||
**Phase 51 details (single-process soak):**
|
||
- Test duration: 5 minutes (300 seconds)
|
||
- Epoch size: 5 seconds
|
||
- Samples: 60 epochs per allocator
|
||
- Results: `docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md`
|
||
- Script: `scripts/soak_mixed_single_process.sh`
|
||
- **All allocators show ZERO drift** - excellent memory discipline
|
||
- Note: hakmem's higher base RSS (33 MB vs 2 MB) is a **design trade-off** (Phase 53 triage)
|
||
- **Key difference from Phase 50**: Single process with persistent allocator state (simulates long-running servers)
|
||
- Optional: Memory-Lean mode(opt-in, Phase 54)で RSS <10MB を狙う場合は Phase 55 の検証マトリクスを正とする:
|
||
- `docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md`
|
||
|
||
**Balanced mode(Phase 55, LEAN+OFF):**
|
||
- `HAKMEM_SS_MEM_LEAN=1` + `HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF`
|
||
- 効果: RSS は下がらない(≈33MB のまま)一方で、prewarm 抑制により throughput/stability が微改善し得る
|
||
- 次: `docs/analysis/PHASE56_PROMOTE_LEAN_OFF_PREWARM_SUPPRESSION_NEXT_INSTRUCTIONS.md`
|
||
|
||
**Phase 53 RSS Tax Triage:**
|
||
|
||
| Component | Memory (MB) | % of Total | Source |
|
||
|-----------|-------------|------------|--------|
|
||
| Tiny metadata | 0.04 | 0.1% | TLS caches, warm pool, page box |
|
||
| SuperSlab backend | ~20-25 | 60-75% | Persistent slabs for fast allocation |
|
||
| Benchmark working set | ~5-8 | 15-25% | Live objects (WS=400) |
|
||
| OS overhead | ~2-5 | 6-15% | Page tables, heap metadata |
|
||
| **Total RSS** | **32.88** | **100%** | Measured peak |
|
||
|
||
**Root Cause (Phase 53):**
|
||
- **NOT bench warmup**: RSS unchanged by prefault setting (32.88 MB → 33.12 MB)
|
||
- **IS allocator design**: Speed-first strategy with persistent superslabs
|
||
- **Trade-off**: +10x syscall efficiency, -17x memory efficiency vs mimalloc
|
||
- **Verdict**: **ACCEPTABLE** for speed-first strategy (documented design choice)
|
||
|
||
**Results**: `docs/analysis/PHASE53_RSS_TAX_TRIAGE_RESULTS.md`
|
||
|
||
**RSS Tax Target:**
|
||
- **Current**: 32.88 MB (FAST build, speed-first)
|
||
- **Target**: <35 MB (maintain speed-first design)
|
||
- **Alternative**: <10 MB (if memory-lean mode implemented, Phase 54+)
|
||
- **Status**: ACCEPTABLE (documented trade-off, zero drift, predictable)
|
||
|
||
**Phase 55: Memory-Lean Mode (PRODUCTION-READY):**
|
||
|
||
Memory-Lean mode provides **opt-in memory control** without performance penalty. Winner: **LEAN+OFF** (prewarm suppression only).
|
||
|
||
| Mode | Config | Throughput vs Baseline | RSS (MB) | Syscalls/op | Status |
|
||
|------|--------|------------------------|----------|-------------|--------|
|
||
| **Speed-first (default)** | `LEAN=0` | baseline (56.2M ops/s) | 32.75 | 1e-8 | Production |
|
||
| **Balanced (opt-in)** | `LEAN=1 DECOMMIT=OFF` | **+1.2%** (56.8M ops/s) | 32.88 | 1.25e-7 | Production |
|
||
|
||
**Key Results (30-min test, WS=400):**
|
||
- **Throughput**: +1.2% faster than baseline (56.8M vs 56.2M ops/s)
|
||
- **RSS**: 32.88 MB (stable, 0% drift)
|
||
- **Stability**: CV 5.41% (better than baseline 5.52%)
|
||
- **Syscalls**: 1.25e-7/op (8x under budget <1e-6/op)
|
||
- **No decommit overhead**: Prewarm suppression only, zero syscall tax
|
||
|
||
**Use Cases:**
|
||
- **Speed-first (default)**: `HAKMEM_SS_MEM_LEAN=0` (full prewarm enabled)
|
||
- **Balanced (opt-in)**: `HAKMEM_SS_MEM_LEAN=1 HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF` (prewarm suppression only)
|
||
|
||
**Why LEAN+OFF is production-ready:**
|
||
1. Faster than baseline (+1.2%, no compromise)
|
||
2. Zero decommit syscall overhead (lean_decommit=0)
|
||
3. Perfect RSS stability (0% drift, better CV than baseline)
|
||
4. Simplest lean mode (no policy complexity)
|
||
5. Opt-in safety (`LEAN=0` disables all lean behavior)
|
||
|
||
**Results**: `docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md`
|
||
|
||
**Phase 56: Promote LEAN+OFF as "Balanced Mode" (DEFAULT):**
|
||
|
||
Phase 56 promotes LEAN+OFF as the production-recommended "Balanced mode" by setting it as the default in `MIXED_TINYV3_C7_SAFE` benchmark profile.
|
||
Phase 57 later shows Speed-first wins on 60-min + tail; default handling is revisited in Phase 58 (profile split).
|
||
|
||
**Profile Comparison (10-run validation, Phase 56):**
|
||
|
||
| Profile | Config | Mean (M ops/s) | CV | RSS (MB) | Syscalls/op | Use Case |
|
||
|---------|--------|---------------|-----|----------|-------------|----------|
|
||
| **Speed-first** | `LEAN=0` | 59.12 (Phase 55) | 0.48% | 33.00 | 5.00e-08 | Latency-critical, full prewarm |
|
||
| **Balanced** | `LEAN=1 DECOMMIT=OFF` | 59.84 (FAST), 60.48 (Standard) | 2.21% (FAST), 0.81% (Standard) | ~30 MB | 5.00e-08 | Prewarm suppression only |
|
||
|
||
**Phase 56 Validation Results (10-run):**
|
||
- **FAST build**: 59.84 M ops/s (mean), 60.36 M ops/s (median), CV 2.21%
|
||
- **Standard build**: 60.48 M ops/s (mean), 60.66 M ops/s (median), CV 0.81%
|
||
- **vs Phase 55 baseline**: +1.2% throughput gain confirmed (59.84 / 59.12 = 1.012)
|
||
- **Syscalls**: Zero overhead (5.00e-08/op, identical to baseline)
|
||
|
||
**Implementation:**
|
||
- Phase 56 added LEAN+OFF defaults to `MIXED_TINYV3_C7_SAFE` (historical).
|
||
- Phase 58 split presets: `MIXED_TINYV3_C7_SAFE` (Speed-first) + `MIXED_TINYV3_C7_BALANCED` (LEAN+OFF).
|
||
|
||
**Verdict**: **GO (production-ready)** — Balanced mode is faster, more stable, and has zero syscall overhead vs Speed-first.
|
||
|
||
**Rollback**: Remove 3 lines from `core/bench_profile.h` or set `HAKMEM_SS_MEM_LEAN=0` at runtime.
|
||
|
||
**Results**: `docs/analysis/PHASE56_PROMOTE_LEAN_OFF_RESULTS.md`
|
||
**Implementation**: `docs/analysis/PHASE56_PROMOTE_LEAN_OFF_IMPLEMENTATION.md`
|
||
|
||
**Phase 57: Balanced Mode 60-min Soak + Syscalls (FINAL VALIDATION):**
|
||
|
||
Phase 57 performed final validation of Balanced mode with 60-minute soak tests, high-resolution tail proxy, and syscall budget verification.
|
||
|
||
**60-min Soak Results (DURATION_SEC=3600, EPOCH_SEC=10, 360 epochs):**
|
||
|
||
| Mode | Mean TP (M ops/s) | CV | RSS (MB) | RSS Drift | Syscalls/op | Status |
|
||
|------|-------------------|-----|----------|-----------|-------------|--------|
|
||
| **Balanced** | 58.93 | 5.38% | 33.00 | 0.00% | 1.25e-7 | Production |
|
||
| **Speed-first** | 60.74 | 1.58% | 32.75 | 0.00% | 1.25e-7 | Production |
|
||
|
||
**Key Results:**
|
||
- **RSS Drift**: 0.00% for both modes (perfect stability over 60 minutes)
|
||
- **Throughput Drift**: 0.00% for both modes (no degradation)
|
||
- **CV (60-min)**: Balanced 5.38%, Speed-first 1.58% (both acceptable for production)
|
||
- **Syscalls**: Identical budget (1.25e-7/op, 800× below <1e-6 target)
|
||
- **DSO guard**: Active in both modes (madvise_disabled=1, correct)
|
||
|
||
**10-min Tail Proxy Results (DURATION_SEC=600, EPOCH_SEC=1, 600 epochs):**
|
||
|
||
| Mode | Mean TP (M ops/s) | CV | p99 Latency (ns/op) | p99.9 Latency (ns/op) |
|
||
|------|-------------------|-----|---------------------|------------------------|
|
||
| **Balanced** | 53.11 | 2.18% | 20.78 | 21.24 |
|
||
| **Speed-first** | 53.62 | 0.71% | 19.14 | 19.35 |
|
||
|
||
**Tail Analysis:**
|
||
- Balanced: CV 2.18% (excellent for production), p99 +8.6% higher latency
|
||
- Speed-first: CV 0.71% (exceptional stability), lower tail latency
|
||
- Both: Zero RSS drift, no performance degradation
|
||
|
||
**Syscall Budget (200M ops, HAKMEM_SS_OS_STATS=1):**
|
||
|
||
| Mode | Total syscalls | Syscalls/op | madvise_disabled | lean_decommit |
|
||
|------|----------------|-------------|------------------|---------------|
|
||
| Balanced | 25 | 1.25e-7 | 1 (DSO guard active) | 0 (not triggered) |
|
||
| Speed-first | 25 | 1.25e-7 | 1 (DSO guard active) | 0 (not triggered) |
|
||
|
||
**Observations:**
|
||
- Identical syscall behavior across modes
|
||
- No runaway madvise/mmap (stable counts)
|
||
- lean_decommit=0: LEAN policy not triggered in WS=400 workload (expected)
|
||
- DSO guard functioning correctly in both modes
|
||
|
||
**Trade-off Summary:**
|
||
|
||
Balanced vs Speed-first:
|
||
- **Throughput**: -3.0% (60-min mean: 58.93M vs 60.74M ops/s)
|
||
- **Latency p99**: +8.6% (10-min: 20.78 vs 19.14 ns/op)
|
||
- **Stability**: +3.8pp CV (60-min: 5.38% vs 1.58%)
|
||
- **Memory**: +0.76% RSS (33.00 vs 32.75 MB)
|
||
- **Syscalls**: Identical (1.25e-7/op)
|
||
|
||
**Verdict**: **GO (production-ready)** — Both modes stable, zero drift, user choice preserved.
|
||
|
||
**Use Cases:**
|
||
- **Speed-first** (default): `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
|
||
- **Balanced** (opt-in): `HAKMEM_PROFILE=MIXED_TINYV3_C7_BALANCED` (sets `LEAN=1 DECOMMIT=OFF`)
|
||
|
||
**Phase 58: Profile Split (Speed-first default + Balanced opt-in):**
|
||
- `MIXED_TINYV3_C7_SAFE`: Speed-first default (does not set `HAKMEM_SS_MEM_LEAN`)
|
||
- `MIXED_TINYV3_C7_BALANCED`: Balanced opt-in preset (sets `LEAN=1 DECOMMIT=OFF`)
|
||
|
||
**Results**: `docs/analysis/PHASE57_BALANCED_MODE_60MIN_SOAK_AND_SYSCALLS_RESULTS.md`
|
||
|
||
**Phase 50 details (multi-process soak):**
|
||
- Test duration: 5 minutes (300 seconds)
|
||
- Step size: 20M operations per sample
|
||
- Samples: hakmem=742, mimalloc=1523, system=1093
|
||
- Results: `docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md`
|
||
- Script: `scripts/soak_mixed_rss.sh`
|
||
- **All allocators show ZERO drift** - excellent memory discipline
|
||
- **Key difference from Phase 51**: Separate process per sample (simulates batch jobs)
|
||
|
||
**Tools:**
|
||
|
||
```bash
|
||
# 5-min soak (Phase 50 - quick validation)
|
||
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
|
||
DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
|
||
scripts/soak_mixed_rss.sh > soak_fast_5min.csv
|
||
|
||
# Analysis (CSV to metrics)
|
||
python3 analyze_soak.py # Calculates drift/CV/peak RSS
|
||
```
|
||
|
||
**Target:**
|
||
- RSS drift: < +5% (5-min soak: PASS, 60-min: TBD)
|
||
- Throughput drift: > -5% (5-min soak: PASS, 60-min: TBD)
|
||
|
||
**Next steps (Phase 51+):**
|
||
- Extend to 30-60 min soak for long-term validation
|
||
- Compare mimalloc RSS behavior (currently only hakmem measured)
|
||
|
||
## 4) Long-run stability(性能・一貫性)
|
||
|
||
最低条件:
|
||
- 30–60 分の soak で ops/s が **-5% 以上落ちない**
|
||
- CV(変動係数)が **~1–2%** に収まる(現状の運用と整合)
|
||
|
||
**Current (Phase 51 - 5min single-process soak):**
|
||
|
||
| Allocator | Mean TP (M ops/s) | First 5 avg | Last 5 avg | TP Drift | CV | Status |
|
||
|-----------|-------------------|-------------|------------|----------|----|----|
|
||
| hakmem FAST | 59.95 | 59.45 | 60.17 | +1.20% | **0.50%** | EXCELLENT |
|
||
| mimalloc | 122.38 | 122.61 | 122.03 | -0.47% | 0.39% | EXCELLENT |
|
||
| system malloc | 85.31 | 84.99 | 85.32 | +0.38% | 0.42% | EXCELLENT |
|
||
|
||
**Phase 51 details (single-process soak):**
|
||
- **All allocators show minimal drift** (<1.5%) - highly stable performance
|
||
- **CV values are exceptional** (0.39%-0.50%) - **3-5× better than Phase 50 multi-process**
|
||
- **hakmem CV: 0.50%** - best stability in single-process mode, 3× better than Phase 50
|
||
- No performance degradation over 5 minutes
|
||
- Results: `docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md`
|
||
- Script: `scripts/soak_mixed_single_process.sh` (epoch-based, persistent allocator state)
|
||
- **Key improvement**: Single-process mode eliminates cold-start variance (superior for long-run stability measurement)
|
||
|
||
**Phase 50 details (multi-process soak):**
|
||
- **All allocators show positive drift** (+0.8% to +0.9%) - likely CPU warmup effect
|
||
- **CV values are good** (1.5%-2.1%) - consistent but higher due to cold-start variance
|
||
- hakmem CV (1.49%) slightly better than mimalloc (1.60%)
|
||
- Results: `docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md`
|
||
- Script: `scripts/soak_mixed_rss.sh` (separate process per sample)
|
||
|
||
**Comparison to short-run (Phase 48 rebase):**
|
||
- Mixed 10-run: CV = 1.22%(mean 59.15M / min 58.12M / max 60.02M)
|
||
- 5-min multi-process soak (Phase 50): CV = 1.49%(mean 59.65M)
|
||
- 5-min single-process soak (Phase 51): CV = 0.50%(mean 59.95M)
|
||
- **Consistency: Single-process soak provides best stability measurement (3× lower CV)**
|
||
|
||
**Tools:**
|
||
|
||
```bash
|
||
# Run 5-min soak
|
||
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
|
||
DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
|
||
scripts/soak_mixed_rss.sh > soak_fast_5min.csv
|
||
|
||
# Analyze with Python
|
||
python3 analyze_soak.py # Calculates mean, drift, CV automatically
|
||
```
|
||
|
||
**Target:**
|
||
- Throughput drift: > -5% (5-min: PASS +0.94%, 60-min: TBD)
|
||
- CV: < 2% (5-min: PASS 1.49%, 60-min: TBD)
|
||
|
||
**Next steps (Phase 51+):**
|
||
- Extend to 30-60 min soak for long-term validation
|
||
- Confirm no monotonic drift (throughput should not decay over time)
|
||
|
||
## 5) Tail Latency(p99/p999)
|
||
|
||
**Status:** COMPLETE - Phase 52 (Throughput Proxy Method)
|
||
|
||
**Objective:** Measure tail latency using epoch throughput distribution as a proxy
|
||
|
||
**Method:** Use 1-second epoch throughput variance as a proxy for per-operation latency distribution
|
||
- Rationale: Epochs with lower throughput indicate periods of higher latency
|
||
- Advantage: Zero observer effect, measurement-only approach
|
||
- Implementation: 5-minute soak with 1-second epochs, calculate percentiles
|
||
- Note: Throughput tail is the *low* side (p1/p0.1). Latency percentiles must be computed from per-epoch latency values (not inverted percentiles).
|
||
- Tool: `scripts/analyze_epoch_tail_csv.py`
|
||
|
||
**Current Results (Phase 52 - Tail Latency Proxy):**
|
||
|
||
### Throughput Distribution (ops/sec)
|
||
|
||
| Metric | hakmem FAST | mimalloc | system malloc |
|
||
|--------|-------------|----------|---------------|
|
||
| **p50** | 47,887,721 | 98,738,326 | 69,562,115 |
|
||
| **p90** | 58,629,195 | 99,580,629 | 69,931,575 |
|
||
| **p99** | 59,174,766 | 110,702,822 | 70,165,415 |
|
||
| **p999** | 59,567,912 | 111,190,037 | 70,308,452 |
|
||
| **Mean** | 50,174,657 | 99,084,977 | 69,447,599 |
|
||
| **Std Dev** | 4,461,290 | 2,455,894 | 522,021 |
|
||
|
||
### Latency Proxy (ns/op)
|
||
|
||
Calculated as `1 / throughput * 1e9`:
|
||
|
||
| Metric | hakmem FAST | mimalloc | system malloc |
|
||
|--------|-------------|----------|---------------|
|
||
| **p50** | 20.88 ns | 10.13 ns | 14.38 ns |
|
||
| **p90** | 21.12 ns | 10.24 ns | 14.50 ns |
|
||
| **p99** | 21.33 ns | 10.43 ns | 14.80 ns |
|
||
| **p999** | 21.57 ns | 10.47 ns | 15.07 ns |
|
||
|
||
### Tail Consistency Metrics
|
||
|
||
**Standard Deviation as % of Mean (lower = more consistent):**
|
||
- hakmem FAST: **7.98%** (highest variability)
|
||
- mimalloc: 2.28% (good consistency)
|
||
- system malloc: 0.77% (best consistency)
|
||
|
||
**p99/p50 Ratio (lower = better tail):**
|
||
- hakmem FAST: 1.024 (2.4% tail slowdown)
|
||
- mimalloc: 1.030 (3.0% tail slowdown)
|
||
- system malloc: 1.029 (2.9% tail slowdown)
|
||
|
||
**p999/p50 Ratio:**
|
||
- hakmem FAST: 1.033 (3.3% tail slowdown)
|
||
- mimalloc: 1.034 (3.4% tail slowdown)
|
||
- system malloc: 1.048 (4.8% tail slowdown)
|
||
|
||
### Analysis
|
||
|
||
**Key Findings:**
|
||
1. **hakmem has highest throughput variance**: 4.46M ops/sec std dev (7.98% of mean)
|
||
- 2× worse than mimalloc (2.28%)
|
||
- 10× worse than system malloc (0.77%)
|
||
2. **mimalloc has best absolute performance AND good tail behavior**:
|
||
- 2× faster than hakmem at all percentiles
|
||
- Moderate variance (2.28% std dev)
|
||
3. **system malloc has rock-solid consistency**:
|
||
- Lowest variance (0.77% std dev)
|
||
- Very tight p99/p999 spread
|
||
4. **hakmem's tail problem is variance, not worst-case**:
|
||
- Absolute p99 latency (21.33 ns) is reasonable
|
||
- But 2-3× higher variance than competitors
|
||
- Suggests optimization opportunities in cache warmth, metadata layout
|
||
|
||
**Test Configuration:**
|
||
- Duration: 5 minutes (300 seconds)
|
||
- Epoch length: 1 second
|
||
- Workload: Mixed (WS=400)
|
||
- Process model: Single process (persistent allocator state)
|
||
- Script: `scripts/soak_mixed_single_process.sh`
|
||
- Results: `docs/analysis/PHASE52_TAIL_LATENCY_PROXY_RESULTS.md`
|
||
|
||
**Target:**
|
||
- Std dev as % of mean: < 3% (Current: 7.98%, Goal: match mimalloc's 2.28%)
|
||
- p99/p50 ratio: < 1.05 (Current: 1.024, Status: GOOD)
|
||
- **Priority**: Reduce variance rather than chasing p999 specifically
|
||
|
||
**Next steps:**
|
||
- Phase 53: RSS Tax Triage (understand memory overhead sources)
|
||
- Future phases: Target variance reduction (TLS cache optimization, metadata locality)
|
||
|
||
## 6) 判定ルール(運用)
|
||
|
||
- runtime 変更(ENVのみ): GO 閾値 +1.0%(Mixed 10-run mean)
|
||
- build-level 変更(compile-out 系): GO 閾値 +0.5%(layout の揺れを考慮)
|
||
|
||
## 6) Build Variants(FAST / Standard / OBSERVE)— Phase 38 運用
|
||
|
||
### 3種類のビルド
|
||
|
||
| Build | Binary | 目的 | 特徴 |
|
||
|-------|--------|------|------|
|
||
| **FAST** | `bench_random_mixed_hakmem_minimal` | 純粋な性能計測 | gate function 定数化、診断 OFF |
|
||
| **Standard** | `bench_random_mixed_hakmem` | 安全・互換基準 | ENV gate 有効、本線リリース用 |
|
||
| **OBSERVE** | `bench_random_mixed_hakmem_observe` | 挙動観測 | 診断カウンタ ON、perf 分析用 |
|
||
|
||
### 運用ルール(Phase 38 確定)
|
||
|
||
1. **性能評価は FAST build で行う**(mimalloc 比較の正)
|
||
2. **Standard は安全基準**(gate overhead は許容、本線機能の互換性優先)
|
||
3. **OBSERVE はデバッグ用**(性能評価には使わない、診断出力あり)
|
||
|
||
### FAST build 履歴
|
||
|
||
| Version | Mean (ops/s) | Delta | 変更内容 |
|
||
|---------|--------------|-------|----------|
|
||
| FAST v1 | 54,557,938 | baseline | Phase 35-A: gate function 定数化 |
|
||
| FAST v2 | 54,943,734 | +0.71% | Phase 36: policy snapshot init-once |
|
||
| **FAST v3** | 56,040,000 | +1.98% | Phase 39: hot path gate 定数化 |
|
||
|
||
**FAST v3 で定数化されたもの:**
|
||
- `tiny_front_v3_enabled()` → 常に `true`
|
||
- `tiny_metadata_cache_enabled()` → 常に `0`
|
||
- `small_policy_v7_snapshot()` → version check スキップ、init-once TLS cache
|
||
- `learner_v7_enabled()` → 常に `false`
|
||
- `small_learner_v2_enabled()` → 常に `false`
|
||
- `front_gate_unified_enabled()` → 常に `1`(Phase 39)
|
||
- `alloc_dualhot_enabled()` → 常に `0`(Phase 39)
|
||
- `g_bench_fast_front` block → compile-out(Phase 39)
|
||
- `g_v3_enabled` block → compile-out(Phase 39)
|
||
- `free_dispatch_stats_enabled()` → 常に `false`(Phase 39)
|
||
|
||
### 使い方(Phase 38 ワークフロー)
|
||
|
||
**推奨: 自動化ターゲットを使用**
|
||
|
||
```bash
|
||
# FAST 10-run 性能評価(mimalloc 比較の正)
|
||
make perf_fast
|
||
|
||
# OBSERVE health check(syscall/診断確認)
|
||
make perf_observe
|
||
|
||
# 両方実行
|
||
make perf_all
|
||
```
|
||
|
||
**手動実行(個別制御が必要な場合)**
|
||
|
||
```bash
|
||
# FAST build のみビルド
|
||
make bench_random_mixed_hakmem_minimal
|
||
|
||
# Standard build のみビルド
|
||
make bench_random_mixed_hakmem
|
||
|
||
# OBSERVE build のみビルド
|
||
make bench_random_mixed_hakmem_observe
|
||
|
||
# 10-run 実行(任意の binary で)
|
||
scripts/run_mixed_10_cleanenv.sh
|
||
```
|
||
|
||
### Phase 37 教訓(Standard 最適化の限界)
|
||
|
||
Standard build を速くする試み(TLS cache)は NO-GO (-0.07%):
|
||
- Runtime gate (lazy-init) は必ず overhead を持つ
|
||
- Compile-time constant (BENCH_MINIMAL) が唯一の解
|
||
- **結論:** Standard は安全基準として維持、性能は FAST で評価
|
||
|
||
### Phase 39 実施済み(FAST v3)
|
||
|
||
以下の gate function は Phase 39 で定数化済み:
|
||
|
||
**malloc path(実施済み):**
|
||
| Gate | File | FAST v3 値 | Status |
|
||
|------|------|-----------|--------|
|
||
| `front_gate_unified_enabled()` | malloc_tiny_fast.h | 固定 1 | ✅ GO |
|
||
| `alloc_dualhot_enabled()` | malloc_tiny_fast.h | 固定 0 | ✅ GO |
|
||
|
||
**free path(実施済み):**
|
||
| Gate | File | FAST v3 値 | Status |
|
||
|------|------|-----------|--------|
|
||
| `g_bench_fast_front` | hak_free_api.inc.h | compile-out | ✅ GO |
|
||
| `g_v3_enabled` | hak_free_api.inc.h | compile-out | ✅ GO |
|
||
| `g_free_dispatch_ssot` | hak_free_api.inc.h | lazy-init 維持 | 保留 |
|
||
|
||
**stats(実施済み):**
|
||
| Gate | File | FAST v3 値 | Status |
|
||
|------|------|-----------|--------|
|
||
| `free_dispatch_stats_enabled()` | free_dispatch_stats_box.h | 固定 false | ✅ GO |
|
||
|
||
**Phase 39 結果:** +1.98%(GO)
|
||
|
||
### Phase 47: FAST+PGO research box(NEUTRAL, 保留)
|
||
|
||
Phase 47 で compile-time fixed front config (`HAKMEM_TINY_FRONT_PGO=1`) を試験:
|
||
|
||
**結果:**
|
||
- Mean: +0.27%(閾値 +0.5% 未達)
|
||
- Median: +1.02%(positive signal)
|
||
- 判定: **NEUTRAL**(研究ボックスとして保持、FAST 標準には採用せず)
|
||
|
||
**理由:**
|
||
- Mean が GO 閾値(+0.5%)を下回る
|
||
- Treatment 分散が 2× baseline(layout tax の兆候)
|
||
- Median は positive だが、mean との乖離が大きい
|
||
|
||
**Research box として保持:**
|
||
- Makefile ターゲット: `bench_random_mixed_hakmem_fast_pgo`
|
||
- 将来的に他の最適化と組み合わせる可能性を残す
|
||
- 詳細: `docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md`
|
||
|
||
### Phase 60: Alloc Pass-Down SSOT (NO-GO, research box)
|
||
|
||
Phase 60 implemented a Single Source of Truth (SSOT) pattern for the allocation path, computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down.
|
||
|
||
**A/B Test Results (Mixed 10-run):**
|
||
- **Baseline (SSOT=0)**: 60.05M ops/s (CV: 1.00%)
|
||
- **Treatment (SSOT=1)**: 59.77M ops/s (CV: 1.55%)
|
||
- **Delta**: -0.46% (**NO-GO**)
|
||
|
||
**Root Cause:**
|
||
1. Added branch check `if (alloc_passdown_ssot_enabled())` overhead
|
||
2. Original path already has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations
|
||
3. SSOT forces upfront computation, negating the benefit of early exits
|
||
4. Struct pass-down introduces ABI overhead (register pressure, stack spills)
|
||
|
||
**Comparison with Free-Side Phase 19-6C:**
|
||
- Free-side SSOT: +1.5% (GO) - many redundant computations across multiple paths
|
||
- Alloc-side SSOT: -0.46% (NO-GO) - efficient early exits already in place
|
||
|
||
**Kept as Research Box:**
|
||
- ENV gate: `HAKMEM_ALLOC_PASSDOWN_SSOT=0` (default OFF)
|
||
- Files: `core/box/alloc_passdown_ssot_env_box.h`, `core/front/malloc_tiny_fast.h`
|
||
- Rollback: Build without `-DHAKMEM_ALLOC_PASSDOWN_SSOT=1`
|
||
|
||
**Lessons Learned:**
|
||
- SSOT pattern works when there are **many redundant computations** (Free-side)
|
||
- SSOT fails when the original path has **efficient early exits** (Alloc-side)
|
||
- Even a single branch check can introduce measurable overhead in hot paths
|
||
- Upfront computation negates the benefits of lazy evaluation
|
||
|
||
**Documentation:**
|
||
- Design: `docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_DESIGN_AND_INSTRUCTIONS.md`
|
||
- Results: `docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_RESULTS.md`
|
||
- Implementation: `docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_IMPLEMENTATION.md`
|
||
|
||
**Next Steps:**
|
||
- Focus on Top 50 hot functions: `tiny_region_id_write_header` (3.50%), `unified_cache_push` (1.21%)
|
||
- Investigate branch reduction in hot paths
|
||
- Consider PGO or direct dispatch for common class indices
|
||
|
||
### Phase 61: C7 ULTRA Header-Light (NEUTRAL, research box)
|
||
|
||
Phase 61 tested skipping header write in C7 ULTRA alloc hit path to reduce instruction count.
|
||
|
||
**A/B Test Results (Mixed 10-run, Speed-first):**
|
||
- **Baseline (HEADER_LIGHT=0)**: 59.54M ops/s (CV: 1.53%)
|
||
- **Treatment (HEADER_LIGHT=1)**: 59.73M ops/s (CV: 2.66%)
|
||
- **Delta**: +0.31% (**NEUTRAL**)
|
||
|
||
**Runtime Profiling (perf record):**
|
||
- `tiny_region_id_write_header`: 2.32% (hotspot confirmed)
|
||
- `tiny_c7_ultra_alloc`: 1.90% (in top 10)
|
||
- Combined target overhead: ~4.22%
|
||
|
||
**Root Cause of Low Gain:**
|
||
1. Header write is smaller hotspot than expected (2.32% vs 4.56% in Phase 42)
|
||
2. Mixed workload dilutes C7-specific optimizations
|
||
3. Treatment has higher variance (CV 2.66% vs 1.53%)
|
||
4. Header-light mode adds branch in hot path (`if (header_light)`)
|
||
5. Refill phase still writes headers (cold path overhead)
|
||
|
||
**Implementation Status:**
|
||
- Pre-existing implementation discovered during analysis
|
||
- ENV gate: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0` (default OFF)
|
||
- Location: `core/tiny_c7_ultra.c:39-51`, `core/box/tiny_front_v3_env_box.h:145-152`
|
||
- Rollback: ENV gate already OFF by default (safe)
|
||
|
||
**Kept as Research Box:**
|
||
- Available for future C7-heavy workloads (>50% C7 allocations)
|
||
- May combine with other C7 optimizations (batch refill, SIMD header write)
|
||
- Requires IPC/cache-miss profiling (not just cycle count)
|
||
|
||
**Documentation:**
|
||
- Results: `docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_RESULTS.md`
|
||
- Implementation: `docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_IMPLEMENTATION.md`
|
||
|
||
**Lessons Learned:**
|
||
- Micro-optimizations need precise profiling (IPC, cache misses, not just cycles)
|
||
- Mixed workload may not show benefits of class-specific optimizations
|
||
- Instruction count reduction doesn't always translate to performance gain
|
||
- Higher variance (CV) suggests instability or additional noise
|