Files
hakmem/docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

37 KiB
Raw Blame History

Performance Targetsmimalloc 追跡の"数値目標"

目的: 速さだけでなく syscall / メモリ安定性 / 長時間安定性を含めて「勝ち筋」を固定する。

運用方針Phase 38 確定)

比較基準は FAST build を正とする:

  • FAST: 純粋な性能計測gate function 定数化、診断カウンタ OFF
  • Standard: 安全・互換の基準ENV gate 有効、本線リリース用)
  • OBSERVE: 挙動観測・デバッグ(診断カウンタ ON

mimalloc との比較は FAST build で行うStandard は fixed tax を含むため公平でない)。

Current snapshot2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline

計測条件(再現の正):

  • Mixed: scripts/run_mixed_10_cleanenv.shITERS=20000000 WS=400
  • 10-run mean/median
  • Git: master (Phase 68 PGO, seed/WS diversified profile)
  • Baseline binary: bench_random_mixed_hakmem_minimal_pgo (Phase 68 upgraded)
  • Stability: Phase 66: 3 iterations, +3.0% mean, variance <±1% | Phase 68: 10-run, +1.19% vs Phase 66 (GO)

Note:

  • Phase 75 introduced C5/C6 inline slots and promoted them into presets. Phase 75 A/B results were recorded on the Standard binary (./bench_random_mixed_hakmem).
  • FAST PGO SSOT baselines/ratios should only be updated after re-running A/B with BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo.

hakmem Build Variants同一バイナリレイアウト

Build Mean (M ops/s) Median (M ops/s) vs mimalloc 備考
FAST v3 58.478 58.876 48.34% 旧 baselinePhase 59b rebase。性能評価の正から昇格 → Phase 66 PGO へ
FAST v3 + PGO 59.80 60.25 49.41% Phase 47: NEUTRAL (+0.27% mean, +1.02% median, research box)
FAST v3 + PGO (Phase 66) 60.89 61.35 50.32% GO: +3.0% mean (3回検証済み、安定 <±1%)。Phase 66 PGO initial baseline
FAST v3 + PGO (Phase 68) 61.614 61.924 50.93% GO: +1.19% vs Phase 66 ✓ (seed/WS diversification)
FAST v3 + PGO (Phase 69) 62.63 63.38 51.77% 強GO: +3.26% vs Phase 68 ✓✓✓ (Warm Pool Size=16, ENV-only) → 昇格済み 新 FAST baseline
FAST v3 + PGO + Phase 75 (C5+C6 ON) [Point D] 55.51 - 45.70% Phase 75-4 FAST PGO rebase (C5+C6 inline slots): +3.16% vs Point A ✓ [REBASE URGENT]
Standard 53.50 - 44.21% 安全・互換基準Phase 48 前計測、要 rebase
OBSERVE TBD - - 診断カウンタ ON

補足:

  • Phase 63: make bench_random_mixed_hakmem_fast_fixedHAKMEM_FAST_PROFILE_FIXED=1)は research buildGO 未達時は SSOT に載せない)。結果は docs/analysis/PHASE63_FAST_PROFILE_FIXED_BUILD_RESULTS.md

FAST vs Standard delta: +10.6%Standard 側は Phase 48 前計測、mimalloc baseline 変更で ratio 調整)

Phase 59b Notes:

  • Profile Change: Switched from MIXED_TINYV3_C7_BALANCED to MIXED_TINYV3_C7_SAFE (Speed-first) as canonical default
  • Rationale: Phase 57 60-min soak showed Speed-first wins on all metrics (lower CV, better tail latency)
  • Stability: CV 2.52% (hakmem) vs 0.90% (mimalloc) in Phase 59b
  • vs Phase 59: Ratio change (49.13% → 48.34%) due to mimalloc variance (+1.59%), hakmem stable
  • Recommended Profile: MIXED_TINYV3_C7_SAFE (Speed-first default)

Reference allocators別バイナリ、layout 差あり)

allocator mean (M ops/s) median (M ops/s) ratio vs mimalloc (mean) CV
mimalloc (separate) 124.82 124.71 100% 1.10%
tcmalloc (LD_PRELOAD) 115.26 115.51 92.33% 1.22%
jemalloc (LD_PRELOAD) 97.39 97.88 77.96% 1.29%
system (separate) 85.20 85.40 68.24% 1.98%
libc (same binary) 76.26 76.66 63.30% (old)

Notes:

  • Phase 59b rebase: mimalloc updated (120.466M → 120.979M, +0.43% variation)
  • 2025-12-18 Update (corrected): tcmalloc/jemalloc/system 計測完了 (10-run Random Mixed, WS=400, ITERS=20M, SEED=1)
    • tcmalloc: 115.26M ops/s (92.33% of mimalloc) ✓
    • jemalloc: 97.39M ops/s (77.96% of mimalloc)
    • system: 85.20M ops/s (68.24% of mimalloc)
    • mimalloc: 124.82M ops/s (baseline)
    • 計測スクリプト: scripts/run_allocator_quick_matrix.sh (hakmem via run_mixed_10_cleanenv.sh)
    • 修正: hakmem 計測が HAKMEM_PROFILE を明示するように修正 → SSOT レンジ復帰
  • system/mimalloc/jemalloc/tcmalloc は別バイナリ計測のため layouttext size/I-cache差分を含む reference
  • tcmalloc (LD_PRELOAD) は gperftools から install /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
  • libc (same binary)HAKMEM_FORCE_LIBC_ALLOC=1 により、同一レイアウト上での比較の目安Phase 48 前計測)
  • mimalloc 比較は FAST build を使用することStandard の gate overhead は hakmem 固有の税)
  • 比較手順SSOT: docs/analysis/ALLOCATOR_COMPARISON_SSOT.md
  • 同一バイナリ比較layout差を最小化: scripts/run_allocator_preload_matrix.shbench_random_mixed_system 固定 + LD_PRELOAD 差し替え)
    • 注意: hakmem の SSOTbench_random_mixed_hakmem*とは経路が異なるdrop-in wrapper reference

Allocator Comparisonbench_allocators_compare.sh, small-scale reference

注意:

  • これは bench_allocators_*--scenario mixed8B..1MB の簡易混合)による small-scale reference
  • Mixed 161024B SSOTscripts/run_mixed_10_cleanenv.sh)とは 別物なので、FAST baseline/マイルストーンとは混同しない。

実行(例):

make bench
JEMALLOC_SO=/path/to/libjemalloc.so.2 \
TCMALLOC_SO=/path/to/libtcmalloc.so \
scripts/bench_allocators_compare.sh --scenario mixed --iterations 50

結果2025-12-18, mixed, iterations=50:

allocator ops/sec (M) vs mimalloc (Phase 69 ref) vs system soft_pf RSS (MB)
tcmalloc (LD_PRELOAD) 34.56 28.6% 11.2x 3,842 21.5
jemalloc (LD_PRELOAD) 24.33 20.1% 7.9x 143 3.8
hakmem (linked) 16.85 13.9% 5.4x 4,701 46.5
system (linked) 3.09 2.6% 1.0x 68,590 19.6

補足:

  • soft_pf/RSSgetrusage() 由来Linux の ru_maxrss は KB

Allocator ComparisonRandom Mixed, 10-run, WS=400, reference

注意:

  • 別バイナリ比較は layout tax が混ざる。
  • 同一バイナリ比較LD_PRELOADを優先したい場合は scripts/run_allocator_preload_matrix.sh を使う。

1) Speed相対目標

前提: FAST build で hakmem vs mimalloc を比較するStandard は gate overhead を含むため不公平)。

推奨マイルストーンMixed 161024B, FAST build

Milestone Target Current (2025-12-18, corrected) Status
M1 mimalloc の 50% 44.46% 🟡 未達 (PROFILE 修正後の計測)
M2 mimalloc の 55% 44.46% 🔴 未達 (Gap: -10.54pp)
M3 mimalloc の 60% - 🔴 未達(構造改造必要)
M4 mimalloc の 6570% - 🔴 未達(構造改造必要)

現状: hakmem (FAST PGO) (2025-12-18) = 55.53M ops/s = mimalloc の 44.46%Random Mixed, WS=400, ITERS=20M, 10-run

⚠️ 重要: Phase 69 baseline (62.63M = 51.77%) は古い計測条件の可能性。PROFILE 明示修正後の新 baseline は 44.46%M1 未達)。

Phase 68 PGO 昇格Phase 66 → Phase 68 upgrade:

  • Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)
  • Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
  • Profile change: seed/WS diversification (WS 3種 → 5種, seed 1種 → 3種)
  • M1 (50%) achievement: EXCEEDED (+0.93pp above target, vs +0.32pp in Phase 66)

M1 Achievement Analysis:

  • Phase 66: Gap to 50%: +0.32% (EXCEEDED target, first time above 50%)
  • Phase 68: Gap to 50%: +0.93% (further improved via seed/WS diversification)
  • Production perspective: 50.93% vs 50.00% is robustly statistically achieved
  • Stability advantage: Phase 66 (3-run <±1%) → Phase 68 (10-run +1.19%, improved reproducibility)
  • Verdict: M1 EXCEEDED (+0.93pp), M2 (55%) に向けて次フェーズ検討

Phase 68 Benefits Over Phase 66:

  • Reduced PGO overfitting via seed/WS diversification
  • +1.19% improvement from better profile representation
  • More representative of production workload variance
  • Higher confidence in baseline stability

Phase 69 PGO 昇格Phase 68 → Phase 69 upgrade:

  • Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
  • Phase 69 baseline: 62.63M ops/s = 51.77% (+3.26% vs Phase 68, 10-run verified)
  • Parameter change: Warm Pool Size 12 → 16 (ENV-only, zero code changes)
  • M1 (50%) achievement: EXCEEDED (+1.77pp above target, vs +0.93pp in Phase 68)
  • M2 (55%) progress: Gap reduced to +3.23pp (from +4.07pp in Phase 68)

Phase 69 Benefits Over Phase 68:

  • +3.26% improvement from warm pool optimization (強GO threshold exceeded)
  • ENV-only change (zero layout tax risk, fully reversible)
  • Reduced registry O(N) scan overhead via larger warm pool
  • Non-additive with other optimizations (Warm Pool Size=16 alone is optimal)
  • Single strongest parameter improvement in refill tuning sweep

Phase 69 Implementation:

  • Warm Pool Size: 12 → 16 SuperSlabs/class
  • ENV variable: HAKMEM_WARM_POOL_SIZE=16 (default in MIXED_TINYV3_C7_SAFE preset)
  • Rollback: Set HAKMEM_WARM_POOL_SIZE=12 or remove ENV variable
  • Results: docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md

Phase 75-4: FAST PGO Rebase (C5+C6 Inline Slots Validation) — CRITICAL FINDING

Phase 75-3 validated C5+C6 inline slots optimization on Standard binary (+5.41%). Phase 75-4 rebased this onto FAST PGO baseline to update SSOT:

4-Point Matrix (FAST PGO, Mixed SSOT):

Point Config Throughput Delta vs A
A C5=0, C6=0 53.81 M ops/s baseline
B C5=1, C6=0 53.03 M ops/s -1.45%
C C5=0, C6=1 54.17 M ops/s +0.67%
D C5=1, C6=1 55.51 M ops/s +3.16%

Decision: GO (Point D exceeds +3.0% ideal threshold by +0.16%)

⚠️ CRITICAL FINDING: PGO Profile Staleness

  • Phase 69 FAST baseline: 62.63 M ops/s
  • Phase 75-4 Point A (FAST PGO baseline): 53.81 M ops/s
  • Regression: -14.09% (not explained by Phase 75 additions)
  • Root cause hypothesis: PGO profile trained pre-Phase 69 (likely Phase 68 or earlier) with C5=0, C6=0 configuration
  • Impact: FAST PGO captures only 58.4% of Standard's +5.41% gain (3.16% vs 5.41%)

Recommended Actions (Priority Order):

  1. IMMEDIATE - UPDATE SSOT: Phase 75 C5+C6 inline slots confirmed working (+3.16% on FAST PGO)

    • Promote to core/bench_profile.h (already done for Standard, now FAST PGO validated)
    • Update this scorecard: Phase 75 baseline = 55.51 M ops/s (Point D, with C5+C6 ON)
  2. HIGH PRIORITY - PHASE 75-5 (PGO Profile Regeneration)

    • Regenerate PGO profile with C5=1, C6=1 training configuration
    • Expected gain: unknown (likely positive if the training profile matches the actual hot path, but not guaranteed)
    • Estimated recovery: treat any number as a hypothesis until re-measured (do not assume a return to Phase 69 levels)
    • Root cause analysis: Investigate 14% gap vs Phase 69 (layout, code bloat, or profile mismatch)

Documentation:

  • Phase 75-4 results: docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md
  • Next: Phase 75-5 (PGO regeneration) required before next optimization phase

Impact on M2 Milestone:

  • Phase 69 FAST baseline: 62.63 M ops/s (51.77% of mimalloc, +3.23pp to M2)
  • Phase 75-4 Point A (baseline): 53.81 M ops/s (44.35% of mimalloc, +10.65pp to M2)
  • Phase 75-4 Point D (C5+C6): 55.51 M ops/s (45.70% of mimalloc, +9.30pp to M2)
  • Status: Phase 75 optimization proven, but PGO profile regression masks true progress

※注意: mimalloc/system/jemalloc の参照値は環境ドリフトでズレるため、定期的に再ベースラインする。

  • Phase 48 完了: docs/analysis/PHASE48_REBASE_ALLOCATORS_AND_STABILITY_SUITE_RESULTS.md
  • Phase 59 完了: docs/analysis/PHASE59_50PERCENT_RECOVERY_BASELINE_REBASE_RESULTS.md

2) Syscall budgetOS churn

Tiny hot path の理想:

  • steady-statewarmup 後)で mmap/munmap/madvise = 0(または "ほぼ 0"

目安(許容):

  • mmap+munmap+madvise 合計が 1e8 ops あたり 1 回以下= 1e-8 / op

Current (Phase 48 rebase):

  • HAKMEM_SS_OS_STATS=1Mixed, iters=200000000 ws=400:
    • [SS_OS_STATS] alloc=9 free=10 madvise=9 madvise_enomem=0 madvise_other=0 madvise_disabled=0 mmap_total=9 fallback_mmap=0 huge_alloc=0 huge_fail=0
    • Total syscalls (mmap+madvise): 18 / 200M ops = 9e-8 / op
    • Status: EXCELLENT (within 10x of ideal, NO steady-state churn)

観測方法(どちらか):

  • 内部: HAKMEM_SS_OS_STATS=1[SS_OS_STATS]madvise/disabled 等)
  • 外部: perf stat の syscall events か strace -c(短い実行で回数だけ見る)

Phase 48 confirmation:

  • warmup 後に mmap/madvise が増え続けていないstable
  • mimalloc に対する「速さ以外の勝ち筋」の 1 つを数値で確認

3) Memory stabilityRSS / fragmentation

最低条件Mixed / ws 固定の soak

  • RSS が 時間とともに単調増加しない
  • 1時間の soak で RSS drift が +5% 以内(目安)

Current (Phase 51 - 5min single-process soak):

Allocator First RSS (MB) Last RSS (MB) Peak RSS (MB) RSS Drift Status
hakmem FAST 32.88 32.88 32.88 +0.00% EXCELLENT
mimalloc 1.88 1.88 1.88 +0.00% EXCELLENT
system malloc 1.88 1.88 1.88 +0.00% EXCELLENT

Phase 51 details (single-process soak):

  • Test duration: 5 minutes (300 seconds)
  • Epoch size: 5 seconds
  • Samples: 60 epochs per allocator
  • Results: docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md
  • Script: scripts/soak_mixed_single_process.sh
  • All allocators show ZERO drift - excellent memory discipline
  • Note: hakmem's higher base RSS (33 MB vs 2 MB) is a design trade-off (Phase 53 triage)
  • Key difference from Phase 50: Single process with persistent allocator state (simulates long-running servers)
  • Optional: Memory-Lean modeopt-in, Phase 54で RSS <10MB を狙う場合は Phase 55 の検証マトリクスを正とする:
    • docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md

Balanced modePhase 55, LEAN+OFF:

  • HAKMEM_SS_MEM_LEAN=1 + HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF
  • 効果: RSS は下がらない≈33MB のまま一方で、prewarm 抑制により throughput/stability が微改善し得る
  • 次: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_PREWARM_SUPPRESSION_NEXT_INSTRUCTIONS.md

Phase 53 RSS Tax Triage:

Component Memory (MB) % of Total Source
Tiny metadata 0.04 0.1% TLS caches, warm pool, page box
SuperSlab backend ~20-25 60-75% Persistent slabs for fast allocation
Benchmark working set ~5-8 15-25% Live objects (WS=400)
OS overhead ~2-5 6-15% Page tables, heap metadata
Total RSS 32.88 100% Measured peak

Root Cause (Phase 53):

  • NOT bench warmup: RSS unchanged by prefault setting (32.88 MB → 33.12 MB)
  • IS allocator design: Speed-first strategy with persistent superslabs
  • Trade-off: +10x syscall efficiency, -17x memory efficiency vs mimalloc
  • Verdict: ACCEPTABLE for speed-first strategy (documented design choice)

Results: docs/analysis/PHASE53_RSS_TAX_TRIAGE_RESULTS.md

RSS Tax Target:

  • Current: 32.88 MB (FAST build, speed-first)
  • Target: <35 MB (maintain speed-first design)
  • Alternative: <10 MB (if memory-lean mode implemented, Phase 54+)
  • Status: ACCEPTABLE (documented trade-off, zero drift, predictable)

Phase 55: Memory-Lean Mode (PRODUCTION-READY):

Memory-Lean mode provides opt-in memory control without performance penalty. Winner: LEAN+OFF (prewarm suppression only).

Mode Config Throughput vs Baseline RSS (MB) Syscalls/op Status
Speed-first (default) LEAN=0 baseline (56.2M ops/s) 32.75 1e-8 Production
Balanced (opt-in) LEAN=1 DECOMMIT=OFF +1.2% (56.8M ops/s) 32.88 1.25e-7 Production

Key Results (30-min test, WS=400):

  • Throughput: +1.2% faster than baseline (56.8M vs 56.2M ops/s)
  • RSS: 32.88 MB (stable, 0% drift)
  • Stability: CV 5.41% (better than baseline 5.52%)
  • Syscalls: 1.25e-7/op (8x under budget <1e-6/op)
  • No decommit overhead: Prewarm suppression only, zero syscall tax

Use Cases:

  • Speed-first (default): HAKMEM_SS_MEM_LEAN=0 (full prewarm enabled)
  • Balanced (opt-in): HAKMEM_SS_MEM_LEAN=1 HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF (prewarm suppression only)

Why LEAN+OFF is production-ready:

  1. Faster than baseline (+1.2%, no compromise)
  2. Zero decommit syscall overhead (lean_decommit=0)
  3. Perfect RSS stability (0% drift, better CV than baseline)
  4. Simplest lean mode (no policy complexity)
  5. Opt-in safety (LEAN=0 disables all lean behavior)

Results: docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md

Phase 56: Promote LEAN+OFF as "Balanced Mode" (DEFAULT):

Phase 56 promotes LEAN+OFF as the production-recommended "Balanced mode" by setting it as the default in MIXED_TINYV3_C7_SAFE benchmark profile. Phase 57 later shows Speed-first wins on 60-min + tail; default handling is revisited in Phase 58 (profile split).

Profile Comparison (10-run validation, Phase 56):

Profile Config Mean (M ops/s) CV RSS (MB) Syscalls/op Use Case
Speed-first LEAN=0 59.12 (Phase 55) 0.48% 33.00 5.00e-08 Latency-critical, full prewarm
Balanced LEAN=1 DECOMMIT=OFF 59.84 (FAST), 60.48 (Standard) 2.21% (FAST), 0.81% (Standard) ~30 MB 5.00e-08 Prewarm suppression only

Phase 56 Validation Results (10-run):

  • FAST build: 59.84 M ops/s (mean), 60.36 M ops/s (median), CV 2.21%
  • Standard build: 60.48 M ops/s (mean), 60.66 M ops/s (median), CV 0.81%
  • vs Phase 55 baseline: +1.2% throughput gain confirmed (59.84 / 59.12 = 1.012)
  • Syscalls: Zero overhead (5.00e-08/op, identical to baseline)

Implementation:

  • Phase 56 added LEAN+OFF defaults to MIXED_TINYV3_C7_SAFE (historical).
  • Phase 58 split presets: MIXED_TINYV3_C7_SAFE (Speed-first) + MIXED_TINYV3_C7_BALANCED (LEAN+OFF).

Verdict: GO (production-ready) — Balanced mode is faster, more stable, and has zero syscall overhead vs Speed-first.

Rollback: Remove 3 lines from core/bench_profile.h or set HAKMEM_SS_MEM_LEAN=0 at runtime.

Results: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_RESULTS.md Implementation: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_IMPLEMENTATION.md

Phase 57: Balanced Mode 60-min Soak + Syscalls (FINAL VALIDATION):

Phase 57 performed final validation of Balanced mode with 60-minute soak tests, high-resolution tail proxy, and syscall budget verification.

60-min Soak Results (DURATION_SEC=3600, EPOCH_SEC=10, 360 epochs):

Mode Mean TP (M ops/s) CV RSS (MB) RSS Drift Syscalls/op Status
Balanced 58.93 5.38% 33.00 0.00% 1.25e-7 Production
Speed-first 60.74 1.58% 32.75 0.00% 1.25e-7 Production

Key Results:

  • RSS Drift: 0.00% for both modes (perfect stability over 60 minutes)
  • Throughput Drift: 0.00% for both modes (no degradation)
  • CV (60-min): Balanced 5.38%, Speed-first 1.58% (both acceptable for production)
  • Syscalls: Identical budget (1.25e-7/op, 800× below <1e-6 target)
  • DSO guard: Active in both modes (madvise_disabled=1, correct)

10-min Tail Proxy Results (DURATION_SEC=600, EPOCH_SEC=1, 600 epochs):

Mode Mean TP (M ops/s) CV p99 Latency (ns/op) p99.9 Latency (ns/op)
Balanced 53.11 2.18% 20.78 21.24
Speed-first 53.62 0.71% 19.14 19.35

Tail Analysis:

  • Balanced: CV 2.18% (excellent for production), p99 +8.6% higher latency
  • Speed-first: CV 0.71% (exceptional stability), lower tail latency
  • Both: Zero RSS drift, no performance degradation

Syscall Budget (200M ops, HAKMEM_SS_OS_STATS=1):

Mode Total syscalls Syscalls/op madvise_disabled lean_decommit
Balanced 25 1.25e-7 1 (DSO guard active) 0 (not triggered)
Speed-first 25 1.25e-7 1 (DSO guard active) 0 (not triggered)

Observations:

  • Identical syscall behavior across modes
  • No runaway madvise/mmap (stable counts)
  • lean_decommit=0: LEAN policy not triggered in WS=400 workload (expected)
  • DSO guard functioning correctly in both modes

Trade-off Summary:

Balanced vs Speed-first:

  • Throughput: -3.0% (60-min mean: 58.93M vs 60.74M ops/s)
  • Latency p99: +8.6% (10-min: 20.78 vs 19.14 ns/op)
  • Stability: +3.8pp CV (60-min: 5.38% vs 1.58%)
  • Memory: +0.76% RSS (33.00 vs 32.75 MB)
  • Syscalls: Identical (1.25e-7/op)

Verdict: GO (production-ready) — Both modes stable, zero drift, user choice preserved.

Use Cases:

  • Speed-first (default): HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
  • Balanced (opt-in): HAKMEM_PROFILE=MIXED_TINYV3_C7_BALANCED (sets LEAN=1 DECOMMIT=OFF)

Phase 58: Profile Split (Speed-first default + Balanced opt-in):

  • MIXED_TINYV3_C7_SAFE: Speed-first default (does not set HAKMEM_SS_MEM_LEAN)
  • MIXED_TINYV3_C7_BALANCED: Balanced opt-in preset (sets LEAN=1 DECOMMIT=OFF)

Results: docs/analysis/PHASE57_BALANCED_MODE_60MIN_SOAK_AND_SYSCALLS_RESULTS.md

Phase 50 details (multi-process soak):

  • Test duration: 5 minutes (300 seconds)
  • Step size: 20M operations per sample
  • Samples: hakmem=742, mimalloc=1523, system=1093
  • Results: docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md
  • Script: scripts/soak_mixed_rss.sh
  • All allocators show ZERO drift - excellent memory discipline
  • Key difference from Phase 51: Separate process per sample (simulates batch jobs)

Tools:

# 5-min soak (Phase 50 - quick validation)
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
  DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
  scripts/soak_mixed_rss.sh > soak_fast_5min.csv

# Analysis (CSV to metrics)
python3 analyze_soak.py  # Calculates drift/CV/peak RSS

Target:

  • RSS drift: < +5% (5-min soak: PASS, 60-min: TBD)
  • Throughput drift: > -5% (5-min soak: PASS, 60-min: TBD)

Next steps (Phase 51+):

  • Extend to 30-60 min soak for long-term validation
  • Compare mimalloc RSS behavior (currently only hakmem measured)

4) Long-run stability性能・一貫性

最低条件:

  • 3060 分の soak で ops/s が -5% 以上落ちない
  • CV変動係数~12% に収まる(現状の運用と整合)

Current (Phase 51 - 5min single-process soak):

Allocator Mean TP (M ops/s) First 5 avg Last 5 avg TP Drift CV Status
hakmem FAST 59.95 59.45 60.17 +1.20% 0.50% EXCELLENT
mimalloc 122.38 122.61 122.03 -0.47% 0.39% EXCELLENT
system malloc 85.31 84.99 85.32 +0.38% 0.42% EXCELLENT

Phase 51 details (single-process soak):

  • All allocators show minimal drift (<1.5%) - highly stable performance
  • CV values are exceptional (0.39%-0.50%) - 3-5× better than Phase 50 multi-process
  • hakmem CV: 0.50% - best stability in single-process mode, 3× better than Phase 50
  • No performance degradation over 5 minutes
  • Results: docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md
  • Script: scripts/soak_mixed_single_process.sh (epoch-based, persistent allocator state)
  • Key improvement: Single-process mode eliminates cold-start variance (superior for long-run stability measurement)

Phase 50 details (multi-process soak):

  • All allocators show positive drift (+0.8% to +0.9%) - likely CPU warmup effect
  • CV values are good (1.5%-2.1%) - consistent but higher due to cold-start variance
  • hakmem CV (1.49%) slightly better than mimalloc (1.60%)
  • Results: docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md
  • Script: scripts/soak_mixed_rss.sh (separate process per sample)

Comparison to short-run (Phase 48 rebase):

  • Mixed 10-run: CV = 1.22%mean 59.15M / min 58.12M / max 60.02M
  • 5-min multi-process soak (Phase 50): CV = 1.49%mean 59.65M
  • 5-min single-process soak (Phase 51): CV = 0.50%mean 59.95M
  • Consistency: Single-process soak provides best stability measurement (3× lower CV)

Tools:

# Run 5-min soak
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
  DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
  scripts/soak_mixed_rss.sh > soak_fast_5min.csv

# Analyze with Python
python3 analyze_soak.py  # Calculates mean, drift, CV automatically

Target:

  • Throughput drift: > -5% (5-min: PASS +0.94%, 60-min: TBD)
  • CV: < 2% (5-min: PASS 1.49%, 60-min: TBD)

Next steps (Phase 51+):

  • Extend to 30-60 min soak for long-term validation
  • Confirm no monotonic drift (throughput should not decay over time)

5) Tail Latencyp99/p999

Status: COMPLETE - Phase 52 (Throughput Proxy Method)

Objective: Measure tail latency using epoch throughput distribution as a proxy

Method: Use 1-second epoch throughput variance as a proxy for per-operation latency distribution

  • Rationale: Epochs with lower throughput indicate periods of higher latency
  • Advantage: Zero observer effect, measurement-only approach
  • Implementation: 5-minute soak with 1-second epochs, calculate percentiles
    • Note: Throughput tail is the low side (p1/p0.1). Latency percentiles must be computed from per-epoch latency values (not inverted percentiles).
    • Tool: scripts/analyze_epoch_tail_csv.py

Current Results (Phase 52 - Tail Latency Proxy):

Throughput Distribution (ops/sec)

Metric hakmem FAST mimalloc system malloc
p50 47,887,721 98,738,326 69,562,115
p90 58,629,195 99,580,629 69,931,575
p99 59,174,766 110,702,822 70,165,415
p999 59,567,912 111,190,037 70,308,452
Mean 50,174,657 99,084,977 69,447,599
Std Dev 4,461,290 2,455,894 522,021

Latency Proxy (ns/op)

Calculated as 1 / throughput * 1e9:

Metric hakmem FAST mimalloc system malloc
p50 20.88 ns 10.13 ns 14.38 ns
p90 21.12 ns 10.24 ns 14.50 ns
p99 21.33 ns 10.43 ns 14.80 ns
p999 21.57 ns 10.47 ns 15.07 ns

Tail Consistency Metrics

Standard Deviation as % of Mean (lower = more consistent):

  • hakmem FAST: 7.98% (highest variability)
  • mimalloc: 2.28% (good consistency)
  • system malloc: 0.77% (best consistency)

p99/p50 Ratio (lower = better tail):

  • hakmem FAST: 1.024 (2.4% tail slowdown)
  • mimalloc: 1.030 (3.0% tail slowdown)
  • system malloc: 1.029 (2.9% tail slowdown)

p999/p50 Ratio:

  • hakmem FAST: 1.033 (3.3% tail slowdown)
  • mimalloc: 1.034 (3.4% tail slowdown)
  • system malloc: 1.048 (4.8% tail slowdown)

Analysis

Key Findings:

  1. hakmem has highest throughput variance: 4.46M ops/sec std dev (7.98% of mean)
    • 2× worse than mimalloc (2.28%)
    • 10× worse than system malloc (0.77%)
  2. mimalloc has best absolute performance AND good tail behavior:
    • 2× faster than hakmem at all percentiles
    • Moderate variance (2.28% std dev)
  3. system malloc has rock-solid consistency:
    • Lowest variance (0.77% std dev)
    • Very tight p99/p999 spread
  4. hakmem's tail problem is variance, not worst-case:
    • Absolute p99 latency (21.33 ns) is reasonable
    • But 2-3× higher variance than competitors
    • Suggests optimization opportunities in cache warmth, metadata layout

Test Configuration:

  • Duration: 5 minutes (300 seconds)
  • Epoch length: 1 second
  • Workload: Mixed (WS=400)
  • Process model: Single process (persistent allocator state)
  • Script: scripts/soak_mixed_single_process.sh
  • Results: docs/analysis/PHASE52_TAIL_LATENCY_PROXY_RESULTS.md

Target:

  • Std dev as % of mean: < 3% (Current: 7.98%, Goal: match mimalloc's 2.28%)
  • p99/p50 ratio: < 1.05 (Current: 1.024, Status: GOOD)
  • Priority: Reduce variance rather than chasing p999 specifically

Next steps:

  • Phase 53: RSS Tax Triage (understand memory overhead sources)
  • Future phases: Target variance reduction (TLS cache optimization, metadata locality)

6) 判定ルール(運用)

  • runtime 変更ENVのみ: GO 閾値 +1.0%Mixed 10-run mean
  • build-level 変更compile-out 系): GO 閾値 +0.5%layout の揺れを考慮)

6) Build VariantsFAST / Standard / OBSERVE— Phase 38 運用

3種類のビルド

Build Binary 目的 特徴
FAST bench_random_mixed_hakmem_minimal 純粋な性能計測 gate function 定数化、診断 OFF
Standard bench_random_mixed_hakmem 安全・互換基準 ENV gate 有効、本線リリース用
OBSERVE bench_random_mixed_hakmem_observe 挙動観測 診断カウンタ ON、perf 分析用

運用ルールPhase 38 確定)

  1. 性能評価は FAST build で行うmimalloc 比較の正)
  2. Standard は安全基準gate overhead は許容、本線機能の互換性優先)
  3. OBSERVE はデバッグ用(性能評価には使わない、診断出力あり)

FAST build 履歴

Version Mean (ops/s) Delta 変更内容
FAST v1 54,557,938 baseline Phase 35-A: gate function 定数化
FAST v2 54,943,734 +0.71% Phase 36: policy snapshot init-once
FAST v3 56,040,000 +1.98% Phase 39: hot path gate 定数化

FAST v3 で定数化されたもの:

  • tiny_front_v3_enabled() → 常に true
  • tiny_metadata_cache_enabled() → 常に 0
  • small_policy_v7_snapshot() → version check スキップ、init-once TLS cache
  • learner_v7_enabled() → 常に false
  • small_learner_v2_enabled() → 常に false
  • front_gate_unified_enabled() → 常に 1Phase 39
  • alloc_dualhot_enabled() → 常に 0Phase 39
  • g_bench_fast_front block → compile-outPhase 39
  • g_v3_enabled block → compile-outPhase 39
  • free_dispatch_stats_enabled() → 常に falsePhase 39

使い方Phase 38 ワークフロー)

推奨: 自動化ターゲットを使用

# FAST 10-run 性能評価mimalloc 比較の正)
make perf_fast

# OBSERVE health checksyscall/診断確認)
make perf_observe

# 両方実行
make perf_all

手動実行(個別制御が必要な場合)

# FAST build のみビルド
make bench_random_mixed_hakmem_minimal

# Standard build のみビルド
make bench_random_mixed_hakmem

# OBSERVE build のみビルド
make bench_random_mixed_hakmem_observe

# 10-run 実行(任意の binary で)
scripts/run_mixed_10_cleanenv.sh

Phase 37 教訓Standard 最適化の限界)

Standard build を速くする試みTLS cacheは NO-GO (-0.07%):

  • Runtime gate (lazy-init) は必ず overhead を持つ
  • Compile-time constant (BENCH_MINIMAL) が唯一の解
  • 結論: Standard は安全基準として維持、性能は FAST で評価

Phase 39 実施済みFAST v3

以下の gate function は Phase 39 で定数化済み:

malloc path実施済み:

Gate File FAST v3 値 Status
front_gate_unified_enabled() malloc_tiny_fast.h 固定 1 GO
alloc_dualhot_enabled() malloc_tiny_fast.h 固定 0 GO

free path実施済み:

Gate File FAST v3 値 Status
g_bench_fast_front hak_free_api.inc.h compile-out GO
g_v3_enabled hak_free_api.inc.h compile-out GO
g_free_dispatch_ssot hak_free_api.inc.h lazy-init 維持 保留

stats実施済み:

Gate File FAST v3 値 Status
free_dispatch_stats_enabled() free_dispatch_stats_box.h 固定 false GO

Phase 39 結果: +1.98%GO

Phase 47: FAST+PGO research boxNEUTRAL, 保留)

Phase 47 で compile-time fixed front config (HAKMEM_TINY_FRONT_PGO=1) を試験:

結果:

  • Mean: +0.27%(閾値 +0.5% 未達)
  • Median: +1.02%positive signal
  • 判定: NEUTRAL研究ボックスとして保持、FAST 標準には採用せず)

理由:

  • Mean が GO 閾値(+0.5%)を下回る
  • Treatment 分散が 2× baselinelayout tax の兆候)
  • Median は positive だが、mean との乖離が大きい

Research box として保持:

  • Makefile ターゲット: bench_random_mixed_hakmem_fast_pgo
  • 将来的に他の最適化と組み合わせる可能性を残す
  • 詳細: docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md

Phase 60: Alloc Pass-Down SSOT (NO-GO, research box)

Phase 60 implemented a Single Source of Truth (SSOT) pattern for the allocation path, computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down.

A/B Test Results (Mixed 10-run):

  • Baseline (SSOT=0): 60.05M ops/s (CV: 1.00%)
  • Treatment (SSOT=1): 59.77M ops/s (CV: 1.55%)
  • Delta: -0.46% (NO-GO)

Root Cause:

  1. Added branch check if (alloc_passdown_ssot_enabled()) overhead
  2. Original path already has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations
  3. SSOT forces upfront computation, negating the benefit of early exits
  4. Struct pass-down introduces ABI overhead (register pressure, stack spills)

Comparison with Free-Side Phase 19-6C:

  • Free-side SSOT: +1.5% (GO) - many redundant computations across multiple paths
  • Alloc-side SSOT: -0.46% (NO-GO) - efficient early exits already in place

Kept as Research Box:

  • ENV gate: HAKMEM_ALLOC_PASSDOWN_SSOT=0 (default OFF)
  • Files: core/box/alloc_passdown_ssot_env_box.h, core/front/malloc_tiny_fast.h
  • Rollback: Build without -DHAKMEM_ALLOC_PASSDOWN_SSOT=1

Lessons Learned:

  • SSOT pattern works when there are many redundant computations (Free-side)
  • SSOT fails when the original path has efficient early exits (Alloc-side)
  • Even a single branch check can introduce measurable overhead in hot paths
  • Upfront computation negates the benefits of lazy evaluation

Documentation:

  • Design: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_DESIGN_AND_INSTRUCTIONS.md
  • Results: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_RESULTS.md
  • Implementation: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_IMPLEMENTATION.md

Next Steps:

  • Focus on Top 50 hot functions: tiny_region_id_write_header (3.50%), unified_cache_push (1.21%)
  • Investigate branch reduction in hot paths
  • Consider PGO or direct dispatch for common class indices

Phase 61: C7 ULTRA Header-Light (NEUTRAL, research box)

Phase 61 tested skipping header write in C7 ULTRA alloc hit path to reduce instruction count.

A/B Test Results (Mixed 10-run, Speed-first):

  • Baseline (HEADER_LIGHT=0): 59.54M ops/s (CV: 1.53%)
  • Treatment (HEADER_LIGHT=1): 59.73M ops/s (CV: 2.66%)
  • Delta: +0.31% (NEUTRAL)

Runtime Profiling (perf record):

  • tiny_region_id_write_header: 2.32% (hotspot confirmed)
  • tiny_c7_ultra_alloc: 1.90% (in top 10)
  • Combined target overhead: ~4.22%

Root Cause of Low Gain:

  1. Header write is smaller hotspot than expected (2.32% vs 4.56% in Phase 42)
  2. Mixed workload dilutes C7-specific optimizations
  3. Treatment has higher variance (CV 2.66% vs 1.53%)
  4. Header-light mode adds branch in hot path (if (header_light))
  5. Refill phase still writes headers (cold path overhead)

Implementation Status:

  • Pre-existing implementation discovered during analysis
  • ENV gate: HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0 (default OFF)
  • Location: core/tiny_c7_ultra.c:39-51, core/box/tiny_front_v3_env_box.h:145-152
  • Rollback: ENV gate already OFF by default (safe)

Kept as Research Box:

  • Available for future C7-heavy workloads (>50% C7 allocations)
  • May combine with other C7 optimizations (batch refill, SIMD header write)
  • Requires IPC/cache-miss profiling (not just cycle count)

Documentation:

  • Results: docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_RESULTS.md
  • Implementation: docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_IMPLEMENTATION.md

Lessons Learned:

  • Micro-optimizations need precise profiling (IPC, cache misses, not just cycles)
  • Mixed workload may not show benefits of class-specific optimizations
  • Instruction count reduction doesn't always translate to performance gain
  • Higher variance (CV) suggests instability or additional noise