Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
32 KiB
Performance Targets(mimalloc 追跡の"数値目標")
目的: 速さだけでなく syscall / メモリ安定性 / 長時間安定性を含めて「勝ち筋」を固定する。
運用方針(Phase 38 確定)
比較基準は FAST build を正とする:
- FAST: 純粋な性能計測(gate function 定数化、診断カウンタ OFF)
- Standard: 安全・互換の基準(ENV gate 有効、本線リリース用)
- OBSERVE: 挙動観測・デバッグ(診断カウンタ ON)
mimalloc との比較は FAST build で行う(Standard は fixed tax を含むため公平でない)。
Current snapshot(2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline)
計測条件(再現の正):
- Mixed:
scripts/run_mixed_10_cleanenv.sh(ITERS=20000000 WS=400) - 10-run mean/median
- Git: master (Phase 68 PGO, seed/WS diversified profile)
- Baseline binary:
bench_random_mixed_hakmem_minimal_pgo(Phase 68 upgraded) - Stability: Phase 66: 3 iterations, +3.0% mean, variance <±1% | Phase 68: 10-run, +1.19% vs Phase 66 (GO)
hakmem Build Variants(同一バイナリレイアウト)
| Build | Mean (M ops/s) | Median (M ops/s) | vs mimalloc | 備考 |
|---|---|---|---|---|
| FAST v3 | 58.478 | 58.876 | 48.34% | 旧 baseline(Phase 59b rebase)。性能評価の正から昇格 → Phase 66 PGO へ |
| FAST v3 + PGO | 59.80 | 60.25 | 49.41% | Phase 47: NEUTRAL (+0.27% mean, +1.02% median, research box) |
| FAST v3 + PGO (Phase 66) | 60.89 | 61.35 | 50.32% | GO: +3.0% mean (3回検証済み、安定 <±1%)。Phase 66 PGO initial baseline |
| FAST v3 + PGO (Phase 68) | 61.614 | 61.924 | 50.93% | GO: +1.19% vs Phase 66 ✓ (seed/WS diversification) |
| FAST v3 + PGO (Phase 69) | 62.63 | 63.38 | 51.77% | 強GO: +3.26% vs Phase 68 ✓✓✓ (Warm Pool Size=16, ENV-only) → 昇格済み 新 FAST baseline ✓ |
| Standard | 53.50 | - | 44.21% | 安全・互換基準(Phase 48 前計測、要 rebase) |
| OBSERVE | TBD | - | - | 診断カウンタ ON |
補足:
- Phase 63:
make bench_random_mixed_hakmem_fast_fixed(HAKMEM_FAST_PROFILE_FIXED=1)は research build(GO 未達時は SSOT に載せない)。結果はdocs/analysis/PHASE63_FAST_PROFILE_FIXED_BUILD_RESULTS.md。
FAST vs Standard delta: +10.6%(Standard 側は Phase 48 前計測、mimalloc baseline 変更で ratio 調整)
Phase 59b Notes:
- Profile Change: Switched from
MIXED_TINYV3_C7_BALANCEDtoMIXED_TINYV3_C7_SAFE(Speed-first) as canonical default - Rationale: Phase 57 60-min soak showed Speed-first wins on all metrics (lower CV, better tail latency)
- Stability: CV 2.52% (hakmem) vs 0.90% (mimalloc) in Phase 59b
- vs Phase 59: Ratio change (49.13% → 48.34%) due to mimalloc variance (+1.59%), hakmem stable
- Recommended Profile:
MIXED_TINYV3_C7_SAFE(Speed-first default)
Reference allocators(別バイナリ、layout 差あり)
| allocator | mean (M ops/s) | median (M ops/s) | ratio vs mimalloc (mean) | CV |
|---|---|---|---|---|
| mimalloc (separate) | 120.979 | 120.967 | 100% | 0.90% |
| jemalloc (LD_PRELOAD) | 96.06 | 97.00 | 79.73% | 2.93% |
| system (separate) | 85.10 | 85.24 | 70.65% | 1.01% |
| libc (same binary) | 76.26 | 76.66 | 63.30% | (old) |
Notes:
- Phase 59b rebase: mimalloc updated (120.466M → 120.979M, +0.43% variation)
system/mimalloc/jemallocは別バイナリ計測のため layout(text size/I-cache)差分を含む referencelibc (same binary)はHAKMEM_FORCE_LIBC_ALLOC=1により、同一レイアウト上での比較の目安(Phase 48 前計測)- mimalloc 比較は FAST build を使用すること(Standard の gate overhead は hakmem 固有の税)
- jemalloc 初回計測: 79.73% of mimalloc(Phase 59 baseline, system より 9% 速い strong competitor)
1) Speed(相対目標)
前提: FAST build で hakmem vs mimalloc を比較する(Standard は gate overhead を含むため不公平)。
推奨マイルストーン(Mixed 16–1024B, FAST build):
| Milestone | Target | Current (FAST v3 + PGO Phase 69) | Status |
|---|---|---|---|
| M1 | mimalloc の 50% | 51.77% | 🟢 EXCEEDED (Phase 69, Warm Pool Size=16, ENV-only) |
| M2 | mimalloc の 55% | - | 🔴 未達(残り +3.23pp、Phase 69+ 継続中) |
| M3 | mimalloc の 60% | - | 🔴 未達(構造改造必要) |
| M4 | mimalloc の 65–70% | - | 🔴 未達(構造改造必要) |
現状: FAST v3 + PGO (Phase 69) = 62.63M ops/s = mimalloc の 51.77%(Warm Pool Size=16, ENV-only, 10-run 検証済み)
Phase 68 PGO 昇格(Phase 66 → Phase 68 upgrade):
- Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)
- Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
- Profile change: seed/WS diversification (WS 3種 → 5種, seed 1種 → 3種)
- M1 (50%) achievement: EXCEEDED (+0.93pp above target, vs +0.32pp in Phase 66)
M1 Achievement Analysis:
- Phase 66: Gap to 50%: +0.32% (EXCEEDED target, first time above 50%)
- Phase 68: Gap to 50%: +0.93% (further improved via seed/WS diversification)
- Production perspective: 50.93% vs 50.00% is robustly statistically achieved
- Stability advantage: Phase 66 (3-run <±1%) → Phase 68 (10-run +1.19%, improved reproducibility)
- Verdict: M1 EXCEEDED (+0.93pp), M2 (55%) に向けて次フェーズ検討
Phase 68 Benefits Over Phase 66:
- Reduced PGO overfitting via seed/WS diversification
- +1.19% improvement from better profile representation
- More representative of production workload variance
- Higher confidence in baseline stability
Phase 69 PGO 昇格(Phase 68 → Phase 69 upgrade):
- Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
- Phase 69 baseline: 62.63M ops/s = 51.77% (+3.26% vs Phase 68, 10-run verified)
- Parameter change: Warm Pool Size 12 → 16 (ENV-only, zero code changes)
- M1 (50%) achievement: EXCEEDED (+1.77pp above target, vs +0.93pp in Phase 68)
- M2 (55%) progress: Gap reduced to +3.23pp (from +4.07pp in Phase 68)
Phase 69 Benefits Over Phase 68:
- +3.26% improvement from warm pool optimization (強GO threshold exceeded)
- ENV-only change (zero layout tax risk, fully reversible)
- Reduced registry O(N) scan overhead via larger warm pool
- Non-additive with other optimizations (Warm Pool Size=16 alone is optimal)
- Single strongest parameter improvement in refill tuning sweep
Phase 69 Implementation:
- Warm Pool Size: 12 → 16 SuperSlabs/class
- ENV variable:
HAKMEM_WARM_POOL_SIZE=16(default in MIXED_TINYV3_C7_SAFE preset) - Rollback: Set
HAKMEM_WARM_POOL_SIZE=12or remove ENV variable - Results:
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
※注意: mimalloc/system/jemalloc の参照値は環境ドリフトでズレるため、定期的に再ベースラインする。
- Phase 48 完了:
docs/analysis/PHASE48_REBASE_ALLOCATORS_AND_STABILITY_SUITE_RESULTS.md - Phase 59 完了:
docs/analysis/PHASE59_50PERCENT_RECOVERY_BASELINE_REBASE_RESULTS.md
2) Syscall budget(OS churn)
Tiny hot path の理想:
- steady-state(warmup 後)で mmap/munmap/madvise = 0(または "ほぼ 0")
目安(許容):
mmap+munmap+madvise合計が 1e8 ops あたり 1 回以下(= 1e-8 / op)
Current (Phase 48 rebase):
HAKMEM_SS_OS_STATS=1(Mixed,iters=200000000 ws=400):[SS_OS_STATS] alloc=9 free=10 madvise=9 madvise_enomem=0 madvise_other=0 madvise_disabled=0 mmap_total=9 fallback_mmap=0 huge_alloc=0 huge_fail=0- Total syscalls (mmap+madvise): 18 / 200M ops = 9e-8 / op
- Status: EXCELLENT (within 10x of ideal, NO steady-state churn)
観測方法(どちらか):
- 内部:
HAKMEM_SS_OS_STATS=1の[SS_OS_STATS](madvise/disabled 等) - 外部:
perf statの syscall events かstrace -c(短い実行で回数だけ見る)
Phase 48 confirmation:
- warmup 後に mmap/madvise が増え続けていない(stable)
- mimalloc に対する「速さ以外の勝ち筋」の 1 つを数値で確認
3) Memory stability(RSS / fragmentation)
最低条件(Mixed / ws 固定の soak):
- RSS が 時間とともに単調増加しない
- 1時間の soak で RSS drift が +5% 以内(目安)
Current (Phase 51 - 5min single-process soak):
| Allocator | First RSS (MB) | Last RSS (MB) | Peak RSS (MB) | RSS Drift | Status |
|---|---|---|---|---|---|
| hakmem FAST | 32.88 | 32.88 | 32.88 | +0.00% | EXCELLENT |
| mimalloc | 1.88 | 1.88 | 1.88 | +0.00% | EXCELLENT |
| system malloc | 1.88 | 1.88 | 1.88 | +0.00% | EXCELLENT |
Phase 51 details (single-process soak):
- Test duration: 5 minutes (300 seconds)
- Epoch size: 5 seconds
- Samples: 60 epochs per allocator
- Results:
docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md - Script:
scripts/soak_mixed_single_process.sh - All allocators show ZERO drift - excellent memory discipline
- Note: hakmem's higher base RSS (33 MB vs 2 MB) is a design trade-off (Phase 53 triage)
- Key difference from Phase 50: Single process with persistent allocator state (simulates long-running servers)
- Optional: Memory-Lean mode(opt-in, Phase 54)で RSS <10MB を狙う場合は Phase 55 の検証マトリクスを正とする:
docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md
Balanced mode(Phase 55, LEAN+OFF):
HAKMEM_SS_MEM_LEAN=1+HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF- 効果: RSS は下がらない(≈33MB のまま)一方で、prewarm 抑制により throughput/stability が微改善し得る
- 次:
docs/analysis/PHASE56_PROMOTE_LEAN_OFF_PREWARM_SUPPRESSION_NEXT_INSTRUCTIONS.md
Phase 53 RSS Tax Triage:
| Component | Memory (MB) | % of Total | Source |
|---|---|---|---|
| Tiny metadata | 0.04 | 0.1% | TLS caches, warm pool, page box |
| SuperSlab backend | ~20-25 | 60-75% | Persistent slabs for fast allocation |
| Benchmark working set | ~5-8 | 15-25% | Live objects (WS=400) |
| OS overhead | ~2-5 | 6-15% | Page tables, heap metadata |
| Total RSS | 32.88 | 100% | Measured peak |
Root Cause (Phase 53):
- NOT bench warmup: RSS unchanged by prefault setting (32.88 MB → 33.12 MB)
- IS allocator design: Speed-first strategy with persistent superslabs
- Trade-off: +10x syscall efficiency, -17x memory efficiency vs mimalloc
- Verdict: ACCEPTABLE for speed-first strategy (documented design choice)
Results: docs/analysis/PHASE53_RSS_TAX_TRIAGE_RESULTS.md
RSS Tax Target:
- Current: 32.88 MB (FAST build, speed-first)
- Target: <35 MB (maintain speed-first design)
- Alternative: <10 MB (if memory-lean mode implemented, Phase 54+)
- Status: ACCEPTABLE (documented trade-off, zero drift, predictable)
Phase 55: Memory-Lean Mode (PRODUCTION-READY):
Memory-Lean mode provides opt-in memory control without performance penalty. Winner: LEAN+OFF (prewarm suppression only).
| Mode | Config | Throughput vs Baseline | RSS (MB) | Syscalls/op | Status |
|---|---|---|---|---|---|
| Speed-first (default) | LEAN=0 |
baseline (56.2M ops/s) | 32.75 | 1e-8 | Production |
| Balanced (opt-in) | LEAN=1 DECOMMIT=OFF |
+1.2% (56.8M ops/s) | 32.88 | 1.25e-7 | Production |
Key Results (30-min test, WS=400):
- Throughput: +1.2% faster than baseline (56.8M vs 56.2M ops/s)
- RSS: 32.88 MB (stable, 0% drift)
- Stability: CV 5.41% (better than baseline 5.52%)
- Syscalls: 1.25e-7/op (8x under budget <1e-6/op)
- No decommit overhead: Prewarm suppression only, zero syscall tax
Use Cases:
- Speed-first (default):
HAKMEM_SS_MEM_LEAN=0(full prewarm enabled) - Balanced (opt-in):
HAKMEM_SS_MEM_LEAN=1 HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF(prewarm suppression only)
Why LEAN+OFF is production-ready:
- Faster than baseline (+1.2%, no compromise)
- Zero decommit syscall overhead (lean_decommit=0)
- Perfect RSS stability (0% drift, better CV than baseline)
- Simplest lean mode (no policy complexity)
- Opt-in safety (
LEAN=0disables all lean behavior)
Results: docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md
Phase 56: Promote LEAN+OFF as "Balanced Mode" (DEFAULT):
Phase 56 promotes LEAN+OFF as the production-recommended "Balanced mode" by setting it as the default in MIXED_TINYV3_C7_SAFE benchmark profile.
Phase 57 later shows Speed-first wins on 60-min + tail; default handling is revisited in Phase 58 (profile split).
Profile Comparison (10-run validation, Phase 56):
| Profile | Config | Mean (M ops/s) | CV | RSS (MB) | Syscalls/op | Use Case |
|---|---|---|---|---|---|---|
| Speed-first | LEAN=0 |
59.12 (Phase 55) | 0.48% | 33.00 | 5.00e-08 | Latency-critical, full prewarm |
| Balanced | LEAN=1 DECOMMIT=OFF |
59.84 (FAST), 60.48 (Standard) | 2.21% (FAST), 0.81% (Standard) | ~30 MB | 5.00e-08 | Prewarm suppression only |
Phase 56 Validation Results (10-run):
- FAST build: 59.84 M ops/s (mean), 60.36 M ops/s (median), CV 2.21%
- Standard build: 60.48 M ops/s (mean), 60.66 M ops/s (median), CV 0.81%
- vs Phase 55 baseline: +1.2% throughput gain confirmed (59.84 / 59.12 = 1.012)
- Syscalls: Zero overhead (5.00e-08/op, identical to baseline)
Implementation:
- Phase 56 added LEAN+OFF defaults to
MIXED_TINYV3_C7_SAFE(historical). - Phase 58 split presets:
MIXED_TINYV3_C7_SAFE(Speed-first) +MIXED_TINYV3_C7_BALANCED(LEAN+OFF).
Verdict: GO (production-ready) — Balanced mode is faster, more stable, and has zero syscall overhead vs Speed-first.
Rollback: Remove 3 lines from core/bench_profile.h or set HAKMEM_SS_MEM_LEAN=0 at runtime.
Results: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_RESULTS.md
Implementation: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_IMPLEMENTATION.md
Phase 57: Balanced Mode 60-min Soak + Syscalls (FINAL VALIDATION):
Phase 57 performed final validation of Balanced mode with 60-minute soak tests, high-resolution tail proxy, and syscall budget verification.
60-min Soak Results (DURATION_SEC=3600, EPOCH_SEC=10, 360 epochs):
| Mode | Mean TP (M ops/s) | CV | RSS (MB) | RSS Drift | Syscalls/op | Status |
|---|---|---|---|---|---|---|
| Balanced | 58.93 | 5.38% | 33.00 | 0.00% | 1.25e-7 | Production |
| Speed-first | 60.74 | 1.58% | 32.75 | 0.00% | 1.25e-7 | Production |
Key Results:
- RSS Drift: 0.00% for both modes (perfect stability over 60 minutes)
- Throughput Drift: 0.00% for both modes (no degradation)
- CV (60-min): Balanced 5.38%, Speed-first 1.58% (both acceptable for production)
- Syscalls: Identical budget (1.25e-7/op, 800× below <1e-6 target)
- DSO guard: Active in both modes (madvise_disabled=1, correct)
10-min Tail Proxy Results (DURATION_SEC=600, EPOCH_SEC=1, 600 epochs):
| Mode | Mean TP (M ops/s) | CV | p99 Latency (ns/op) | p99.9 Latency (ns/op) |
|---|---|---|---|---|
| Balanced | 53.11 | 2.18% | 20.78 | 21.24 |
| Speed-first | 53.62 | 0.71% | 19.14 | 19.35 |
Tail Analysis:
- Balanced: CV 2.18% (excellent for production), p99 +8.6% higher latency
- Speed-first: CV 0.71% (exceptional stability), lower tail latency
- Both: Zero RSS drift, no performance degradation
Syscall Budget (200M ops, HAKMEM_SS_OS_STATS=1):
| Mode | Total syscalls | Syscalls/op | madvise_disabled | lean_decommit |
|---|---|---|---|---|
| Balanced | 25 | 1.25e-7 | 1 (DSO guard active) | 0 (not triggered) |
| Speed-first | 25 | 1.25e-7 | 1 (DSO guard active) | 0 (not triggered) |
Observations:
- Identical syscall behavior across modes
- No runaway madvise/mmap (stable counts)
- lean_decommit=0: LEAN policy not triggered in WS=400 workload (expected)
- DSO guard functioning correctly in both modes
Trade-off Summary:
Balanced vs Speed-first:
- Throughput: -3.0% (60-min mean: 58.93M vs 60.74M ops/s)
- Latency p99: +8.6% (10-min: 20.78 vs 19.14 ns/op)
- Stability: +3.8pp CV (60-min: 5.38% vs 1.58%)
- Memory: +0.76% RSS (33.00 vs 32.75 MB)
- Syscalls: Identical (1.25e-7/op)
Verdict: GO (production-ready) — Both modes stable, zero drift, user choice preserved.
Use Cases:
- Speed-first (default):
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE - Balanced (opt-in):
HAKMEM_PROFILE=MIXED_TINYV3_C7_BALANCED(setsLEAN=1 DECOMMIT=OFF)
Phase 58: Profile Split (Speed-first default + Balanced opt-in):
MIXED_TINYV3_C7_SAFE: Speed-first default (does not setHAKMEM_SS_MEM_LEAN)MIXED_TINYV3_C7_BALANCED: Balanced opt-in preset (setsLEAN=1 DECOMMIT=OFF)
Results: docs/analysis/PHASE57_BALANCED_MODE_60MIN_SOAK_AND_SYSCALLS_RESULTS.md
Phase 50 details (multi-process soak):
- Test duration: 5 minutes (300 seconds)
- Step size: 20M operations per sample
- Samples: hakmem=742, mimalloc=1523, system=1093
- Results:
docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md - Script:
scripts/soak_mixed_rss.sh - All allocators show ZERO drift - excellent memory discipline
- Key difference from Phase 51: Separate process per sample (simulates batch jobs)
Tools:
# 5-min soak (Phase 50 - quick validation)
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
scripts/soak_mixed_rss.sh > soak_fast_5min.csv
# Analysis (CSV to metrics)
python3 analyze_soak.py # Calculates drift/CV/peak RSS
Target:
- RSS drift: < +5% (5-min soak: PASS, 60-min: TBD)
- Throughput drift: > -5% (5-min soak: PASS, 60-min: TBD)
Next steps (Phase 51+):
- Extend to 30-60 min soak for long-term validation
- Compare mimalloc RSS behavior (currently only hakmem measured)
4) Long-run stability(性能・一貫性)
最低条件:
- 30–60 分の soak で ops/s が -5% 以上落ちない
- CV(変動係数)が ~1–2% に収まる(現状の運用と整合)
Current (Phase 51 - 5min single-process soak):
| Allocator | Mean TP (M ops/s) | First 5 avg | Last 5 avg | TP Drift | CV | Status |
|---|---|---|---|---|---|---|
| hakmem FAST | 59.95 | 59.45 | 60.17 | +1.20% | 0.50% | EXCELLENT |
| mimalloc | 122.38 | 122.61 | 122.03 | -0.47% | 0.39% | EXCELLENT |
| system malloc | 85.31 | 84.99 | 85.32 | +0.38% | 0.42% | EXCELLENT |
Phase 51 details (single-process soak):
- All allocators show minimal drift (<1.5%) - highly stable performance
- CV values are exceptional (0.39%-0.50%) - 3-5× better than Phase 50 multi-process
- hakmem CV: 0.50% - best stability in single-process mode, 3× better than Phase 50
- No performance degradation over 5 minutes
- Results:
docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md - Script:
scripts/soak_mixed_single_process.sh(epoch-based, persistent allocator state) - Key improvement: Single-process mode eliminates cold-start variance (superior for long-run stability measurement)
Phase 50 details (multi-process soak):
- All allocators show positive drift (+0.8% to +0.9%) - likely CPU warmup effect
- CV values are good (1.5%-2.1%) - consistent but higher due to cold-start variance
- hakmem CV (1.49%) slightly better than mimalloc (1.60%)
- Results:
docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md - Script:
scripts/soak_mixed_rss.sh(separate process per sample)
Comparison to short-run (Phase 48 rebase):
- Mixed 10-run: CV = 1.22%(mean 59.15M / min 58.12M / max 60.02M)
- 5-min multi-process soak (Phase 50): CV = 1.49%(mean 59.65M)
- 5-min single-process soak (Phase 51): CV = 0.50%(mean 59.95M)
- Consistency: Single-process soak provides best stability measurement (3× lower CV)
Tools:
# Run 5-min soak
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
scripts/soak_mixed_rss.sh > soak_fast_5min.csv
# Analyze with Python
python3 analyze_soak.py # Calculates mean, drift, CV automatically
Target:
- Throughput drift: > -5% (5-min: PASS +0.94%, 60-min: TBD)
- CV: < 2% (5-min: PASS 1.49%, 60-min: TBD)
Next steps (Phase 51+):
- Extend to 30-60 min soak for long-term validation
- Confirm no monotonic drift (throughput should not decay over time)
5) Tail Latency(p99/p999)
Status: COMPLETE - Phase 52 (Throughput Proxy Method)
Objective: Measure tail latency using epoch throughput distribution as a proxy
Method: Use 1-second epoch throughput variance as a proxy for per-operation latency distribution
- Rationale: Epochs with lower throughput indicate periods of higher latency
- Advantage: Zero observer effect, measurement-only approach
- Implementation: 5-minute soak with 1-second epochs, calculate percentiles
- Note: Throughput tail is the low side (p1/p0.1). Latency percentiles must be computed from per-epoch latency values (not inverted percentiles).
- Tool:
scripts/analyze_epoch_tail_csv.py
Current Results (Phase 52 - Tail Latency Proxy):
Throughput Distribution (ops/sec)
| Metric | hakmem FAST | mimalloc | system malloc |
|---|---|---|---|
| p50 | 47,887,721 | 98,738,326 | 69,562,115 |
| p90 | 58,629,195 | 99,580,629 | 69,931,575 |
| p99 | 59,174,766 | 110,702,822 | 70,165,415 |
| p999 | 59,567,912 | 111,190,037 | 70,308,452 |
| Mean | 50,174,657 | 99,084,977 | 69,447,599 |
| Std Dev | 4,461,290 | 2,455,894 | 522,021 |
Latency Proxy (ns/op)
Calculated as 1 / throughput * 1e9:
| Metric | hakmem FAST | mimalloc | system malloc |
|---|---|---|---|
| p50 | 20.88 ns | 10.13 ns | 14.38 ns |
| p90 | 21.12 ns | 10.24 ns | 14.50 ns |
| p99 | 21.33 ns | 10.43 ns | 14.80 ns |
| p999 | 21.57 ns | 10.47 ns | 15.07 ns |
Tail Consistency Metrics
Standard Deviation as % of Mean (lower = more consistent):
- hakmem FAST: 7.98% (highest variability)
- mimalloc: 2.28% (good consistency)
- system malloc: 0.77% (best consistency)
p99/p50 Ratio (lower = better tail):
- hakmem FAST: 1.024 (2.4% tail slowdown)
- mimalloc: 1.030 (3.0% tail slowdown)
- system malloc: 1.029 (2.9% tail slowdown)
p999/p50 Ratio:
- hakmem FAST: 1.033 (3.3% tail slowdown)
- mimalloc: 1.034 (3.4% tail slowdown)
- system malloc: 1.048 (4.8% tail slowdown)
Analysis
Key Findings:
- hakmem has highest throughput variance: 4.46M ops/sec std dev (7.98% of mean)
- 2× worse than mimalloc (2.28%)
- 10× worse than system malloc (0.77%)
- mimalloc has best absolute performance AND good tail behavior:
- 2× faster than hakmem at all percentiles
- Moderate variance (2.28% std dev)
- system malloc has rock-solid consistency:
- Lowest variance (0.77% std dev)
- Very tight p99/p999 spread
- hakmem's tail problem is variance, not worst-case:
- Absolute p99 latency (21.33 ns) is reasonable
- But 2-3× higher variance than competitors
- Suggests optimization opportunities in cache warmth, metadata layout
Test Configuration:
- Duration: 5 minutes (300 seconds)
- Epoch length: 1 second
- Workload: Mixed (WS=400)
- Process model: Single process (persistent allocator state)
- Script:
scripts/soak_mixed_single_process.sh - Results:
docs/analysis/PHASE52_TAIL_LATENCY_PROXY_RESULTS.md
Target:
- Std dev as % of mean: < 3% (Current: 7.98%, Goal: match mimalloc's 2.28%)
- p99/p50 ratio: < 1.05 (Current: 1.024, Status: GOOD)
- Priority: Reduce variance rather than chasing p999 specifically
Next steps:
- Phase 53: RSS Tax Triage (understand memory overhead sources)
- Future phases: Target variance reduction (TLS cache optimization, metadata locality)
6) 判定ルール(運用)
- runtime 変更(ENVのみ): GO 閾値 +1.0%(Mixed 10-run mean)
- build-level 変更(compile-out 系): GO 閾値 +0.5%(layout の揺れを考慮)
6) Build Variants(FAST / Standard / OBSERVE)— Phase 38 運用
3種類のビルド
| Build | Binary | 目的 | 特徴 |
|---|---|---|---|
| FAST | bench_random_mixed_hakmem_minimal |
純粋な性能計測 | gate function 定数化、診断 OFF |
| Standard | bench_random_mixed_hakmem |
安全・互換基準 | ENV gate 有効、本線リリース用 |
| OBSERVE | bench_random_mixed_hakmem_observe |
挙動観測 | 診断カウンタ ON、perf 分析用 |
運用ルール(Phase 38 確定)
- 性能評価は FAST build で行う(mimalloc 比較の正)
- Standard は安全基準(gate overhead は許容、本線機能の互換性優先)
- OBSERVE はデバッグ用(性能評価には使わない、診断出力あり)
FAST build 履歴
| Version | Mean (ops/s) | Delta | 変更内容 |
|---|---|---|---|
| FAST v1 | 54,557,938 | baseline | Phase 35-A: gate function 定数化 |
| FAST v2 | 54,943,734 | +0.71% | Phase 36: policy snapshot init-once |
| FAST v3 | 56,040,000 | +1.98% | Phase 39: hot path gate 定数化 |
FAST v3 で定数化されたもの:
tiny_front_v3_enabled()→ 常にtruetiny_metadata_cache_enabled()→ 常に0small_policy_v7_snapshot()→ version check スキップ、init-once TLS cachelearner_v7_enabled()→ 常にfalsesmall_learner_v2_enabled()→ 常にfalsefront_gate_unified_enabled()→ 常に1(Phase 39)alloc_dualhot_enabled()→ 常に0(Phase 39)g_bench_fast_frontblock → compile-out(Phase 39)g_v3_enabledblock → compile-out(Phase 39)free_dispatch_stats_enabled()→ 常にfalse(Phase 39)
使い方(Phase 38 ワークフロー)
推奨: 自動化ターゲットを使用
# FAST 10-run 性能評価(mimalloc 比較の正)
make perf_fast
# OBSERVE health check(syscall/診断確認)
make perf_observe
# 両方実行
make perf_all
手動実行(個別制御が必要な場合)
# FAST build のみビルド
make bench_random_mixed_hakmem_minimal
# Standard build のみビルド
make bench_random_mixed_hakmem
# OBSERVE build のみビルド
make bench_random_mixed_hakmem_observe
# 10-run 実行(任意の binary で)
scripts/run_mixed_10_cleanenv.sh
Phase 37 教訓(Standard 最適化の限界)
Standard build を速くする試み(TLS cache)は NO-GO (-0.07%):
- Runtime gate (lazy-init) は必ず overhead を持つ
- Compile-time constant (BENCH_MINIMAL) が唯一の解
- 結論: Standard は安全基準として維持、性能は FAST で評価
Phase 39 実施済み(FAST v3)
以下の gate function は Phase 39 で定数化済み:
malloc path(実施済み):
| Gate | File | FAST v3 値 | Status |
|---|---|---|---|
front_gate_unified_enabled() |
malloc_tiny_fast.h | 固定 1 | ✅ GO |
alloc_dualhot_enabled() |
malloc_tiny_fast.h | 固定 0 | ✅ GO |
free path(実施済み):
| Gate | File | FAST v3 値 | Status |
|---|---|---|---|
g_bench_fast_front |
hak_free_api.inc.h | compile-out | ✅ GO |
g_v3_enabled |
hak_free_api.inc.h | compile-out | ✅ GO |
g_free_dispatch_ssot |
hak_free_api.inc.h | lazy-init 維持 | 保留 |
stats(実施済み):
| Gate | File | FAST v3 値 | Status |
|---|---|---|---|
free_dispatch_stats_enabled() |
free_dispatch_stats_box.h | 固定 false | ✅ GO |
Phase 39 結果: +1.98%(GO)
Phase 47: FAST+PGO research box(NEUTRAL, 保留)
Phase 47 で compile-time fixed front config (HAKMEM_TINY_FRONT_PGO=1) を試験:
結果:
- Mean: +0.27%(閾値 +0.5% 未達)
- Median: +1.02%(positive signal)
- 判定: NEUTRAL(研究ボックスとして保持、FAST 標準には採用せず)
理由:
- Mean が GO 閾値(+0.5%)を下回る
- Treatment 分散が 2× baseline(layout tax の兆候)
- Median は positive だが、mean との乖離が大きい
Research box として保持:
- Makefile ターゲット:
bench_random_mixed_hakmem_fast_pgo - 将来的に他の最適化と組み合わせる可能性を残す
- 詳細:
docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md
Phase 60: Alloc Pass-Down SSOT (NO-GO, research box)
Phase 60 implemented a Single Source of Truth (SSOT) pattern for the allocation path, computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down.
A/B Test Results (Mixed 10-run):
- Baseline (SSOT=0): 60.05M ops/s (CV: 1.00%)
- Treatment (SSOT=1): 59.77M ops/s (CV: 1.55%)
- Delta: -0.46% (NO-GO)
Root Cause:
- Added branch check
if (alloc_passdown_ssot_enabled())overhead - Original path already has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations
- SSOT forces upfront computation, negating the benefit of early exits
- Struct pass-down introduces ABI overhead (register pressure, stack spills)
Comparison with Free-Side Phase 19-6C:
- Free-side SSOT: +1.5% (GO) - many redundant computations across multiple paths
- Alloc-side SSOT: -0.46% (NO-GO) - efficient early exits already in place
Kept as Research Box:
- ENV gate:
HAKMEM_ALLOC_PASSDOWN_SSOT=0(default OFF) - Files:
core/box/alloc_passdown_ssot_env_box.h,core/front/malloc_tiny_fast.h - Rollback: Build without
-DHAKMEM_ALLOC_PASSDOWN_SSOT=1
Lessons Learned:
- SSOT pattern works when there are many redundant computations (Free-side)
- SSOT fails when the original path has efficient early exits (Alloc-side)
- Even a single branch check can introduce measurable overhead in hot paths
- Upfront computation negates the benefits of lazy evaluation
Documentation:
- Design:
docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_DESIGN_AND_INSTRUCTIONS.md - Results:
docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_RESULTS.md - Implementation:
docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_IMPLEMENTATION.md
Next Steps:
- Focus on Top 50 hot functions:
tiny_region_id_write_header(3.50%),unified_cache_push(1.21%) - Investigate branch reduction in hot paths
- Consider PGO or direct dispatch for common class indices
Phase 61: C7 ULTRA Header-Light (NEUTRAL, research box)
Phase 61 tested skipping header write in C7 ULTRA alloc hit path to reduce instruction count.
A/B Test Results (Mixed 10-run, Speed-first):
- Baseline (HEADER_LIGHT=0): 59.54M ops/s (CV: 1.53%)
- Treatment (HEADER_LIGHT=1): 59.73M ops/s (CV: 2.66%)
- Delta: +0.31% (NEUTRAL)
Runtime Profiling (perf record):
tiny_region_id_write_header: 2.32% (hotspot confirmed)tiny_c7_ultra_alloc: 1.90% (in top 10)- Combined target overhead: ~4.22%
Root Cause of Low Gain:
- Header write is smaller hotspot than expected (2.32% vs 4.56% in Phase 42)
- Mixed workload dilutes C7-specific optimizations
- Treatment has higher variance (CV 2.66% vs 1.53%)
- Header-light mode adds branch in hot path (
if (header_light)) - Refill phase still writes headers (cold path overhead)
Implementation Status:
- Pre-existing implementation discovered during analysis
- ENV gate:
HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0(default OFF) - Location:
core/tiny_c7_ultra.c:39-51,core/box/tiny_front_v3_env_box.h:145-152 - Rollback: ENV gate already OFF by default (safe)
Kept as Research Box:
- Available for future C7-heavy workloads (>50% C7 allocations)
- May combine with other C7 optimizations (batch refill, SIMD header write)
- Requires IPC/cache-miss profiling (not just cycle count)
Documentation:
- Results:
docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_RESULTS.md - Implementation:
docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_IMPLEMENTATION.md
Lessons Learned:
- Micro-optimizations need precise profiling (IPC, cache misses, not just cycles)
- Mixed workload may not show benefits of class-specific optimizations
- Instruction count reduction doesn't always translate to performance gain
- Higher variance (CV) suggests instability or additional noise