Files

Moe Charm (CI) e9b97e9d8e Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-18 07:47:44 +09:00

32 KiB

Raw Blame History

Performance Targets（mimalloc 追跡の"数値目標"）

目的: 速さだけでなく syscall / メモリ安定性 / 長時間安定性を含めて「勝ち筋」を固定する。

運用方針（Phase 38 確定）

比較基準は FAST build を正とする:

FAST: 純粋な性能計測（gate function 定数化、診断カウンタ OFF）
Standard: 安全・互換の基準（ENV gate 有効、本線リリース用）
OBSERVE: 挙動観測・デバッグ（診断カウンタ ON）

mimalloc との比較は FAST build で行う（Standard は fixed tax を含むため公平でない）。

Current snapshot（2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline）

計測条件（再現の正）：

Mixed: scripts/run_mixed_10_cleanenv.sh（ITERS=20000000 WS=400）
10-run mean/median
Git: master (Phase 68 PGO, seed/WS diversified profile)
Baseline binary: bench_random_mixed_hakmem_minimal_pgo (Phase 68 upgraded)
Stability: Phase 66: 3 iterations, +3.0% mean, variance <±1% | Phase 68: 10-run, +1.19% vs Phase 66 (GO)

hakmem Build Variants（同一バイナリレイアウト）

Build	Mean (M ops/s)	Median (M ops/s)	vs mimalloc	備考
FAST v3	58.478	58.876	48.34%	旧 baseline（Phase 59b rebase）。性能評価の正から昇格 → Phase 66 PGO へ
FAST v3 + PGO	59.80	60.25	49.41%	Phase 47: NEUTRAL (+0.27% mean, +1.02% median, research box)
FAST v3 + PGO (Phase 66)	60.89	61.35	50.32%	GO: +3.0% mean (3回検証済み、安定 <±1%)。Phase 66 PGO initial baseline
FAST v3 + PGO (Phase 68)	61.614	61.924	50.93%	GO: +1.19% vs Phase 66 ✓ (seed/WS diversification)
FAST v3 + PGO (Phase 69)	62.63	63.38	51.77%	強GO: +3.26% vs Phase 68 ✓✓✓ (Warm Pool Size=16, ENV-only) → 昇格済み新 FAST baseline ✓
Standard	53.50	-	44.21%	安全・互換基準（Phase 48 前計測、要 rebase）
OBSERVE	TBD	-	-	診断カウンタ ON

補足:

Phase 63: make bench_random_mixed_hakmem_fast_fixed（HAKMEM_FAST_PROFILE_FIXED=1）は research build（GO 未達時は SSOT に載せない）。結果は docs/analysis/PHASE63_FAST_PROFILE_FIXED_BUILD_RESULTS.md。

FAST vs Standard delta: +10.6%（Standard 側は Phase 48 前計測、mimalloc baseline 変更で ratio 調整）

Phase 59b Notes:

Profile Change: Switched from MIXED_TINYV3_C7_BALANCED to MIXED_TINYV3_C7_SAFE (Speed-first) as canonical default
Rationale: Phase 57 60-min soak showed Speed-first wins on all metrics (lower CV, better tail latency)
Stability: CV 2.52% (hakmem) vs 0.90% (mimalloc) in Phase 59b
vs Phase 59: Ratio change (49.13% → 48.34%) due to mimalloc variance (+1.59%), hakmem stable
Recommended Profile: MIXED_TINYV3_C7_SAFE (Speed-first default)

Reference allocators（別バイナリ、layout 差あり）

allocator	mean (M ops/s)	median (M ops/s)	ratio vs mimalloc (mean)	CV
mimalloc (separate)	120.979	120.967	100%	0.90%
jemalloc (LD_PRELOAD)	96.06	97.00	79.73%	2.93%
system (separate)	85.10	85.24	70.65%	1.01%
libc (same binary)	76.26	76.66	63.30%	(old)

Notes:

Phase 59b rebase: mimalloc updated (120.466M → 120.979M, +0.43% variation)
system/mimalloc/jemalloc は別バイナリ計測のため layout（text size/I-cache）差分を含む reference
libc (same binary) は HAKMEM_FORCE_LIBC_ALLOC=1 により、同一レイアウト上での比較の目安（Phase 48 前計測）
mimalloc 比較は FAST build を使用すること（Standard の gate overhead は hakmem 固有の税）
jemalloc 初回計測: 79.73% of mimalloc（Phase 59 baseline, system より 9% 速い strong competitor）

1) Speed（相対目標）

前提: FAST build で hakmem vs mimalloc を比較する（Standard は gate overhead を含むため不公平）。

推奨マイルストーン（Mixed 16–1024B, FAST build）：

Milestone	Target	Current (FAST v3 + PGO Phase 69)	Status
M1	mimalloc の 50%	51.77%	🟢 EXCEEDED (Phase 69, Warm Pool Size=16, ENV-only)
M2	mimalloc の 55%	-	🔴 未達（残り +3.23pp、Phase 69+ 継続中）
M3	mimalloc の 60%	-	🔴 未達（構造改造必要）
M4	mimalloc の 65–70%	-	🔴 未達（構造改造必要）

現状: FAST v3 + PGO (Phase 69) = 62.63M ops/s = mimalloc の 51.77%（Warm Pool Size=16, ENV-only, 10-run 検証済み）

Phase 68 PGO 昇格（Phase 66 → Phase 68 upgrade）:

Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)
Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
Profile change: seed/WS diversification (WS 3種 → 5種, seed 1種 → 3種)
M1 (50%) achievement: EXCEEDED (+0.93pp above target, vs +0.32pp in Phase 66)

M1 Achievement Analysis:

Phase 66: Gap to 50%: +0.32% (EXCEEDED target, first time above 50%)
Phase 68: Gap to 50%: +0.93% (further improved via seed/WS diversification)
Production perspective: 50.93% vs 50.00% is robustly statistically achieved
Stability advantage: Phase 66 (3-run <±1%) → Phase 68 (10-run +1.19%, improved reproducibility)
Verdict: M1 EXCEEDED (+0.93pp), M2 (55%) に向けて次フェーズ検討

Phase 68 Benefits Over Phase 66:

Reduced PGO overfitting via seed/WS diversification
+1.19% improvement from better profile representation
More representative of production workload variance
Higher confidence in baseline stability

Phase 69 PGO 昇格（Phase 68 → Phase 69 upgrade）:

Phase 68 baseline: 61.614M ops/s = 50.93% (+1.19% vs Phase 66, 10-run verified)
Phase 69 baseline: 62.63M ops/s = 51.77% (+3.26% vs Phase 68, 10-run verified)
Parameter change: Warm Pool Size 12 → 16 (ENV-only, zero code changes)
M1 (50%) achievement: EXCEEDED (+1.77pp above target, vs +0.93pp in Phase 68)
M2 (55%) progress: Gap reduced to +3.23pp (from +4.07pp in Phase 68)

Phase 69 Benefits Over Phase 68:

+3.26% improvement from warm pool optimization (強GO threshold exceeded)
ENV-only change (zero layout tax risk, fully reversible)
Reduced registry O(N) scan overhead via larger warm pool
Non-additive with other optimizations (Warm Pool Size=16 alone is optimal)
Single strongest parameter improvement in refill tuning sweep

Phase 69 Implementation:

Warm Pool Size: 12 → 16 SuperSlabs/class
ENV variable: HAKMEM_WARM_POOL_SIZE=16 (default in MIXED_TINYV3_C7_SAFE preset)
Rollback: Set HAKMEM_WARM_POOL_SIZE=12 or remove ENV variable
Results: docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md

※注意: mimalloc/system/jemalloc の参照値は環境ドリフトでズレるため、定期的に再ベースラインする。

Phase 48 完了: docs/analysis/PHASE48_REBASE_ALLOCATORS_AND_STABILITY_SUITE_RESULTS.md
Phase 59 完了: docs/analysis/PHASE59_50PERCENT_RECOVERY_BASELINE_REBASE_RESULTS.md

2) Syscall budget（OS churn）

Tiny hot path の理想:

steady-state（warmup 後）で mmap/munmap/madvise = 0（または "ほぼ 0"）

目安（許容）：

mmap+munmap+madvise 合計が 1e8 ops あたり 1 回以下（= 1e-8 / op）

Current (Phase 48 rebase):

HAKMEM_SS_OS_STATS=1（Mixed, iters=200000000 ws=400）:
- [SS_OS_STATS] alloc=9 free=10 madvise=9 madvise_enomem=0 madvise_other=0 madvise_disabled=0 mmap_total=9 fallback_mmap=0 huge_alloc=0 huge_fail=0
- Total syscalls (mmap+madvise): 18 / 200M ops = 9e-8 / op
- Status: EXCELLENT (within 10x of ideal, NO steady-state churn)

観測方法（どちらか）：

内部: HAKMEM_SS_OS_STATS=1 の [SS_OS_STATS]（madvise/disabled 等）
外部: perf stat の syscall events か strace -c（短い実行で回数だけ見る）

Phase 48 confirmation:

warmup 後に mmap/madvise が増え続けていない（stable）
mimalloc に対する「速さ以外の勝ち筋」の 1 つを数値で確認

3) Memory stability（RSS / fragmentation）

最低条件（Mixed / ws 固定の soak）：

RSS が 時間とともに単調増加しない
1時間の soak で RSS drift が +5% 以内（目安）

Current (Phase 51 - 5min single-process soak):

Allocator	First RSS (MB)	Last RSS (MB)	Peak RSS (MB)	RSS Drift	Status
hakmem FAST	32.88	32.88	32.88	+0.00%	EXCELLENT
mimalloc	1.88	1.88	1.88	+0.00%	EXCELLENT
system malloc	1.88	1.88	1.88	+0.00%	EXCELLENT

Phase 51 details (single-process soak):

Test duration: 5 minutes (300 seconds)
Epoch size: 5 seconds
Samples: 60 epochs per allocator
Results: docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md
Script: scripts/soak_mixed_single_process.sh
All allocators show ZERO drift - excellent memory discipline
Note: hakmem's higher base RSS (33 MB vs 2 MB) is a design trade-off (Phase 53 triage)
Key difference from Phase 50: Single process with persistent allocator state (simulates long-running servers)
Optional: Memory-Lean mode（opt-in, Phase 54）で RSS <10MB を狙う場合は Phase 55 の検証マトリクスを正とする:
- docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md

Balanced mode（Phase 55, LEAN+OFF）:

HAKMEM_SS_MEM_LEAN=1 + HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF
効果: RSS は下がらない（≈33MB のまま）一方で、prewarm 抑制により throughput/stability が微改善し得る
次: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_PREWARM_SUPPRESSION_NEXT_INSTRUCTIONS.md

Phase 53 RSS Tax Triage:

Component	Memory (MB)	% of Total	Source
Tiny metadata	0.04	0.1%	TLS caches, warm pool, page box
SuperSlab backend	~20-25	60-75%	Persistent slabs for fast allocation
Benchmark working set	~5-8	15-25%	Live objects (WS=400)
OS overhead	~2-5	6-15%	Page tables, heap metadata
Total RSS	32.88	100%	Measured peak

Root Cause (Phase 53):

NOT bench warmup: RSS unchanged by prefault setting (32.88 MB → 33.12 MB)
IS allocator design: Speed-first strategy with persistent superslabs
Trade-off: +10x syscall efficiency, -17x memory efficiency vs mimalloc
Verdict: ACCEPTABLE for speed-first strategy (documented design choice)

Results: docs/analysis/PHASE53_RSS_TAX_TRIAGE_RESULTS.md

RSS Tax Target:

Current: 32.88 MB (FAST build, speed-first)
Target: <35 MB (maintain speed-first design)
Alternative: <10 MB (if memory-lean mode implemented, Phase 54+)
Status: ACCEPTABLE (documented trade-off, zero drift, predictable)

Phase 55: Memory-Lean Mode (PRODUCTION-READY):

Memory-Lean mode provides opt-in memory control without performance penalty. Winner: LEAN+OFF (prewarm suppression only).

Mode	Config	Throughput vs Baseline	RSS (MB)	Syscalls/op	Status
Speed-first (default)	`LEAN=0`	baseline (56.2M ops/s)	32.75	1e-8	Production
Balanced (opt-in)	`LEAN=1 DECOMMIT=OFF`	+1.2% (56.8M ops/s)	32.88	1.25e-7	Production

Key Results (30-min test, WS=400):

Throughput: +1.2% faster than baseline (56.8M vs 56.2M ops/s)
RSS: 32.88 MB (stable, 0% drift)
Stability: CV 5.41% (better than baseline 5.52%)
Syscalls: 1.25e-7/op (8x under budget <1e-6/op)
No decommit overhead: Prewarm suppression only, zero syscall tax

Use Cases:

Speed-first (default): HAKMEM_SS_MEM_LEAN=0 (full prewarm enabled)
Balanced (opt-in): HAKMEM_SS_MEM_LEAN=1 HAKMEM_SS_MEM_LEAN_DECOMMIT=OFF (prewarm suppression only)

Why LEAN+OFF is production-ready:

Faster than baseline (+1.2%, no compromise)
Zero decommit syscall overhead (lean_decommit=0)
Perfect RSS stability (0% drift, better CV than baseline)
Simplest lean mode (no policy complexity)
Opt-in safety (LEAN=0 disables all lean behavior)

Results: docs/analysis/PHASE55_MEMORY_LEAN_MODE_VALIDATION_MATRIX.md

Phase 56: Promote LEAN+OFF as "Balanced Mode" (DEFAULT):

Phase 56 promotes LEAN+OFF as the production-recommended "Balanced mode" by setting it as the default in MIXED_TINYV3_C7_SAFE benchmark profile. Phase 57 later shows Speed-first wins on 60-min + tail; default handling is revisited in Phase 58 (profile split).

Profile Comparison (10-run validation, Phase 56):

Profile	Config	Mean (M ops/s)	CV	RSS (MB)	Syscalls/op	Use Case
Speed-first	`LEAN=0`	59.12 (Phase 55)	0.48%	33.00	5.00e-08	Latency-critical, full prewarm
Balanced	`LEAN=1 DECOMMIT=OFF`	59.84 (FAST), 60.48 (Standard)	2.21% (FAST), 0.81% (Standard)	~30 MB	5.00e-08	Prewarm suppression only

Phase 56 Validation Results (10-run):

FAST build: 59.84 M ops/s (mean), 60.36 M ops/s (median), CV 2.21%
Standard build: 60.48 M ops/s (mean), 60.66 M ops/s (median), CV 0.81%
vs Phase 55 baseline: +1.2% throughput gain confirmed (59.84 / 59.12 = 1.012)
Syscalls: Zero overhead (5.00e-08/op, identical to baseline)

Implementation:

Phase 56 added LEAN+OFF defaults to MIXED_TINYV3_C7_SAFE (historical).
Phase 58 split presets: MIXED_TINYV3_C7_SAFE (Speed-first) + MIXED_TINYV3_C7_BALANCED (LEAN+OFF).

Verdict: GO (production-ready) — Balanced mode is faster, more stable, and has zero syscall overhead vs Speed-first.

Rollback: Remove 3 lines from core/bench_profile.h or set HAKMEM_SS_MEM_LEAN=0 at runtime.

Results: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_RESULTS.md Implementation: docs/analysis/PHASE56_PROMOTE_LEAN_OFF_IMPLEMENTATION.md

Phase 57: Balanced Mode 60-min Soak + Syscalls (FINAL VALIDATION):

Phase 57 performed final validation of Balanced mode with 60-minute soak tests, high-resolution tail proxy, and syscall budget verification.

60-min Soak Results (DURATION_SEC=3600, EPOCH_SEC=10, 360 epochs):

Mode	Mean TP (M ops/s)	CV	RSS (MB)	RSS Drift	Syscalls/op	Status
Balanced	58.93	5.38%	33.00	0.00%	1.25e-7	Production
Speed-first	60.74	1.58%	32.75	0.00%	1.25e-7	Production

Key Results:

RSS Drift: 0.00% for both modes (perfect stability over 60 minutes)
Throughput Drift: 0.00% for both modes (no degradation)
CV (60-min): Balanced 5.38%, Speed-first 1.58% (both acceptable for production)
Syscalls: Identical budget (1.25e-7/op, 800× below <1e-6 target)
DSO guard: Active in both modes (madvise_disabled=1, correct)

10-min Tail Proxy Results (DURATION_SEC=600, EPOCH_SEC=1, 600 epochs):

Mode	Mean TP (M ops/s)	CV	p99 Latency (ns/op)	p99.9 Latency (ns/op)
Balanced	53.11	2.18%	20.78	21.24
Speed-first	53.62	0.71%	19.14	19.35

Tail Analysis:

Balanced: CV 2.18% (excellent for production), p99 +8.6% higher latency
Speed-first: CV 0.71% (exceptional stability), lower tail latency
Both: Zero RSS drift, no performance degradation

Syscall Budget (200M ops, HAKMEM_SS_OS_STATS=1):

Mode	Total syscalls	Syscalls/op	madvise_disabled	lean_decommit
Balanced	25	1.25e-7	1 (DSO guard active)	0 (not triggered)
Speed-first	25	1.25e-7	1 (DSO guard active)	0 (not triggered)

Observations:

Identical syscall behavior across modes
No runaway madvise/mmap (stable counts)
lean_decommit=0: LEAN policy not triggered in WS=400 workload (expected)
DSO guard functioning correctly in both modes

Trade-off Summary:

Balanced vs Speed-first:

Throughput: -3.0% (60-min mean: 58.93M vs 60.74M ops/s)
Latency p99: +8.6% (10-min: 20.78 vs 19.14 ns/op)
Stability: +3.8pp CV (60-min: 5.38% vs 1.58%)
Memory: +0.76% RSS (33.00 vs 32.75 MB)
Syscalls: Identical (1.25e-7/op)

Verdict: GO (production-ready) — Both modes stable, zero drift, user choice preserved.

Use Cases:

Speed-first (default): HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
Balanced (opt-in): HAKMEM_PROFILE=MIXED_TINYV3_C7_BALANCED (sets LEAN=1 DECOMMIT=OFF)

Phase 58: Profile Split (Speed-first default + Balanced opt-in):

MIXED_TINYV3_C7_SAFE: Speed-first default (does not set HAKMEM_SS_MEM_LEAN)
MIXED_TINYV3_C7_BALANCED: Balanced opt-in preset (sets LEAN=1 DECOMMIT=OFF)

Results: docs/analysis/PHASE57_BALANCED_MODE_60MIN_SOAK_AND_SYSCALLS_RESULTS.md

Phase 50 details (multi-process soak):

Test duration: 5 minutes (300 seconds)
Step size: 20M operations per sample
Samples: hakmem=742, mimalloc=1523, system=1093
Results: docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md
Script: scripts/soak_mixed_rss.sh
All allocators show ZERO drift - excellent memory discipline
Key difference from Phase 51: Separate process per sample (simulates batch jobs)

Tools:

# 5-min soak (Phase 50 - quick validation)
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
  DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
  scripts/soak_mixed_rss.sh > soak_fast_5min.csv

# Analysis (CSV to metrics)
python3 analyze_soak.py  # Calculates drift/CV/peak RSS

Target:

RSS drift: < +5% (5-min soak: PASS, 60-min: TBD)
Throughput drift: > -5% (5-min soak: PASS, 60-min: TBD)

Next steps (Phase 51+):

Extend to 30-60 min soak for long-term validation
Compare mimalloc RSS behavior (currently only hakmem measured)

4) Long-run stability（性能・一貫性）

最低条件:

30–60 分の soak で ops/s が -5% 以上落ちない
CV（変動係数）が ~1–2% に収まる（現状の運用と整合）

Current (Phase 51 - 5min single-process soak):

Allocator	Mean TP (M ops/s)	First 5 avg	Last 5 avg	TP Drift	CV	Status
hakmem FAST	59.95	59.45	60.17	+1.20%	0.50%	EXCELLENT
mimalloc	122.38	122.61	122.03	-0.47%	0.39%	EXCELLENT
system malloc	85.31	84.99	85.32	+0.38%	0.42%	EXCELLENT

Phase 51 details (single-process soak):

All allocators show minimal drift (<1.5%) - highly stable performance
CV values are exceptional (0.39%-0.50%) - 3-5× better than Phase 50 multi-process
hakmem CV: 0.50% - best stability in single-process mode, 3× better than Phase 50
No performance degradation over 5 minutes
Results: docs/analysis/PHASE51_SINGLE_PROCESS_SOAK_AND_TAIL_PLAN_RESULTS.md
Script: scripts/soak_mixed_single_process.sh (epoch-based, persistent allocator state)
Key improvement: Single-process mode eliminates cold-start variance (superior for long-run stability measurement)

Phase 50 details (multi-process soak):

All allocators show positive drift (+0.8% to +0.9%) - likely CPU warmup effect
CV values are good (1.5%-2.1%) - consistent but higher due to cold-start variance
hakmem CV (1.49%) slightly better than mimalloc (1.60%)
Results: docs/analysis/PHASE50_OPERATIONAL_EDGE_STABILITY_SUITE_RESULTS.md
Script: scripts/soak_mixed_rss.sh (separate process per sample)

Comparison to short-run (Phase 48 rebase):

Mixed 10-run: CV = 1.22%（mean 59.15M / min 58.12M / max 60.02M）
5-min multi-process soak (Phase 50): CV = 1.49%（mean 59.65M）
5-min single-process soak (Phase 51): CV = 0.50%（mean 59.95M）
Consistency: Single-process soak provides best stability measurement (3× lower CV)

Tools:

# Run 5-min soak
BENCH_BIN=./bench_random_mixed_hakmem_minimal \
  DURATION_SEC=300 STEP_ITERS=20000000 WS=400 \
  scripts/soak_mixed_rss.sh > soak_fast_5min.csv

# Analyze with Python
python3 analyze_soak.py  # Calculates mean, drift, CV automatically

Target:

Throughput drift: > -5% (5-min: PASS +0.94%, 60-min: TBD)
CV: < 2% (5-min: PASS 1.49%, 60-min: TBD)

Next steps (Phase 51+):

Extend to 30-60 min soak for long-term validation
Confirm no monotonic drift (throughput should not decay over time)

5) Tail Latency（p99/p999）

Status: COMPLETE - Phase 52 (Throughput Proxy Method)

Objective: Measure tail latency using epoch throughput distribution as a proxy

Method: Use 1-second epoch throughput variance as a proxy for per-operation latency distribution

Rationale: Epochs with lower throughput indicate periods of higher latency
Advantage: Zero observer effect, measurement-only approach
Implementation: 5-minute soak with 1-second epochs, calculate percentiles
- Note: Throughput tail is the low side (p1/p0.1). Latency percentiles must be computed from per-epoch latency values (not inverted percentiles).
- Tool: scripts/analyze_epoch_tail_csv.py

Current Results (Phase 52 - Tail Latency Proxy):

Throughput Distribution (ops/sec)

Metric	hakmem FAST	mimalloc	system malloc
p50	47,887,721	98,738,326	69,562,115
p90	58,629,195	99,580,629	69,931,575
p99	59,174,766	110,702,822	70,165,415
p999	59,567,912	111,190,037	70,308,452
Mean	50,174,657	99,084,977	69,447,599
Std Dev	4,461,290	2,455,894	522,021

Latency Proxy (ns/op)

Calculated as 1 / throughput * 1e9:

Metric	hakmem FAST	mimalloc	system malloc
p50	20.88 ns	10.13 ns	14.38 ns
p90	21.12 ns	10.24 ns	14.50 ns
p99	21.33 ns	10.43 ns	14.80 ns
p999	21.57 ns	10.47 ns	15.07 ns

Tail Consistency Metrics

Standard Deviation as % of Mean (lower = more consistent):

hakmem FAST: 7.98% (highest variability)
mimalloc: 2.28% (good consistency)
system malloc: 0.77% (best consistency)

p99/p50 Ratio (lower = better tail):

hakmem FAST: 1.024 (2.4% tail slowdown)
mimalloc: 1.030 (3.0% tail slowdown)
system malloc: 1.029 (2.9% tail slowdown)

p999/p50 Ratio:

hakmem FAST: 1.033 (3.3% tail slowdown)
mimalloc: 1.034 (3.4% tail slowdown)
system malloc: 1.048 (4.8% tail slowdown)

Analysis

Key Findings:

hakmem has highest throughput variance: 4.46M ops/sec std dev (7.98% of mean)
- 2× worse than mimalloc (2.28%)
- 10× worse than system malloc (0.77%)
mimalloc has best absolute performance AND good tail behavior:
- 2× faster than hakmem at all percentiles
- Moderate variance (2.28% std dev)
system malloc has rock-solid consistency:
- Lowest variance (0.77% std dev)
- Very tight p99/p999 spread
hakmem's tail problem is variance, not worst-case:
- Absolute p99 latency (21.33 ns) is reasonable
- But 2-3× higher variance than competitors
- Suggests optimization opportunities in cache warmth, metadata layout

Test Configuration:

Duration: 5 minutes (300 seconds)
Epoch length: 1 second
Workload: Mixed (WS=400)
Process model: Single process (persistent allocator state)
Script: scripts/soak_mixed_single_process.sh
Results: docs/analysis/PHASE52_TAIL_LATENCY_PROXY_RESULTS.md

Target:

Std dev as % of mean: < 3% (Current: 7.98%, Goal: match mimalloc's 2.28%)
p99/p50 ratio: < 1.05 (Current: 1.024, Status: GOOD)
Priority: Reduce variance rather than chasing p999 specifically

Next steps:

Phase 53: RSS Tax Triage (understand memory overhead sources)
Future phases: Target variance reduction (TLS cache optimization, metadata locality)

6) 判定ルール（運用）

runtime 変更（ENVのみ）: GO 閾値 +1.0%（Mixed 10-run mean）
build-level 変更（compile-out 系）: GO 閾値 +0.5%（layout の揺れを考慮）

6) Build Variants（FAST / Standard / OBSERVE）— Phase 38 運用

3種類のビルド

Build	Binary	目的	特徴
FAST	`bench_random_mixed_hakmem_minimal`	純粋な性能計測	gate function 定数化、診断 OFF
Standard	`bench_random_mixed_hakmem`	安全・互換基準	ENV gate 有効、本線リリース用
OBSERVE	`bench_random_mixed_hakmem_observe`	挙動観測	診断カウンタ ON、perf 分析用

運用ルール（Phase 38 確定）

性能評価は FAST build で行う（mimalloc 比較の正）
Standard は安全基準（gate overhead は許容、本線機能の互換性優先）
OBSERVE はデバッグ用（性能評価には使わない、診断出力あり）

FAST build 履歴

Version	Mean (ops/s)	Delta	変更内容
FAST v1	54,557,938	baseline	Phase 35-A: gate function 定数化
FAST v2	54,943,734	+0.71%	Phase 36: policy snapshot init-once
FAST v3	56,040,000	+1.98%	Phase 39: hot path gate 定数化

FAST v3 で定数化されたもの:

tiny_front_v3_enabled() → 常に true
tiny_metadata_cache_enabled() → 常に 0
small_policy_v7_snapshot() → version check スキップ、init-once TLS cache
learner_v7_enabled() → 常に false
small_learner_v2_enabled() → 常に false
front_gate_unified_enabled() → 常に 1（Phase 39）
alloc_dualhot_enabled() → 常に 0（Phase 39）
g_bench_fast_front block → compile-out（Phase 39）
g_v3_enabled block → compile-out（Phase 39）
free_dispatch_stats_enabled() → 常に false（Phase 39）

使い方（Phase 38 ワークフロー）

推奨: 自動化ターゲットを使用

# FAST 10-run 性能評価（mimalloc 比較の正）
make perf_fast

# OBSERVE health check（syscall/診断確認）
make perf_observe

# 両方実行
make perf_all

手動実行（個別制御が必要な場合）

# FAST build のみビルド
make bench_random_mixed_hakmem_minimal

# Standard build のみビルド
make bench_random_mixed_hakmem

# OBSERVE build のみビルド
make bench_random_mixed_hakmem_observe

# 10-run 実行（任意の binary で）
scripts/run_mixed_10_cleanenv.sh

Phase 37 教訓（Standard 最適化の限界）

Standard build を速くする試み（TLS cache）は NO-GO (-0.07%):

Runtime gate (lazy-init) は必ず overhead を持つ
Compile-time constant (BENCH_MINIMAL) が唯一の解
結論: Standard は安全基準として維持、性能は FAST で評価

Phase 39 実施済み（FAST v3）

以下の gate function は Phase 39 で定数化済み:

malloc path（実施済み）:

Gate	File	FAST v3 値	Status
`front_gate_unified_enabled()`	malloc_tiny_fast.h	固定 1	✅ GO
`alloc_dualhot_enabled()`	malloc_tiny_fast.h	固定 0	✅ GO

free path（実施済み）:

Gate	File	FAST v3 値	Status
`g_bench_fast_front`	hak_free_api.inc.h	compile-out	✅ GO
`g_v3_enabled`	hak_free_api.inc.h	compile-out	✅ GO
`g_free_dispatch_ssot`	hak_free_api.inc.h	lazy-init 維持	保留

stats（実施済み）:

Gate	File	FAST v3 値	Status
`free_dispatch_stats_enabled()`	free_dispatch_stats_box.h	固定 false	✅ GO

Phase 39 結果: +1.98%（GO）

Phase 47: FAST+PGO research box（NEUTRAL, 保留）

Phase 47 で compile-time fixed front config (HAKMEM_TINY_FRONT_PGO=1) を試験:

結果:

Mean: +0.27%（閾値 +0.5% 未達）
Median: +1.02%（positive signal）
判定: NEUTRAL（研究ボックスとして保持、FAST 標準には採用せず）

理由:

Mean が GO 閾値（+0.5%）を下回る
Treatment 分散が 2× baseline（layout tax の兆候）
Median は positive だが、mean との乖離が大きい

Research box として保持:

Makefile ターゲット: bench_random_mixed_hakmem_fast_pgo
将来的に他の最適化と組み合わせる可能性を残す
詳細: docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md

Phase 60: Alloc Pass-Down SSOT (NO-GO, research box)

Phase 60 implemented a Single Source of Truth (SSOT) pattern for the allocation path, computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down.

A/B Test Results (Mixed 10-run):

Baseline (SSOT=0): 60.05M ops/s (CV: 1.00%)
Treatment (SSOT=1): 59.77M ops/s (CV: 1.55%)
Delta: -0.46% (NO-GO)

Root Cause:

Added branch check if (alloc_passdown_ssot_enabled()) overhead
Original path already has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations
SSOT forces upfront computation, negating the benefit of early exits
Struct pass-down introduces ABI overhead (register pressure, stack spills)

Comparison with Free-Side Phase 19-6C:

Free-side SSOT: +1.5% (GO) - many redundant computations across multiple paths
Alloc-side SSOT: -0.46% (NO-GO) - efficient early exits already in place

Kept as Research Box:

ENV gate: HAKMEM_ALLOC_PASSDOWN_SSOT=0 (default OFF)
Files: core/box/alloc_passdown_ssot_env_box.h, core/front/malloc_tiny_fast.h
Rollback: Build without -DHAKMEM_ALLOC_PASSDOWN_SSOT=1

Lessons Learned:

SSOT pattern works when there are many redundant computations (Free-side)
SSOT fails when the original path has efficient early exits (Alloc-side)
Even a single branch check can introduce measurable overhead in hot paths
Upfront computation negates the benefits of lazy evaluation

Documentation:

Design: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_DESIGN_AND_INSTRUCTIONS.md
Results: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_RESULTS.md
Implementation: docs/analysis/PHASE60_ALLOC_PASSDOWN_SSOT_IMPLEMENTATION.md

Next Steps:

Focus on Top 50 hot functions: tiny_region_id_write_header (3.50%), unified_cache_push (1.21%)
Investigate branch reduction in hot paths
Consider PGO or direct dispatch for common class indices

Phase 61: C7 ULTRA Header-Light (NEUTRAL, research box)

Phase 61 tested skipping header write in C7 ULTRA alloc hit path to reduce instruction count.

A/B Test Results (Mixed 10-run, Speed-first):

Baseline (HEADER_LIGHT=0): 59.54M ops/s (CV: 1.53%)
Treatment (HEADER_LIGHT=1): 59.73M ops/s (CV: 2.66%)
Delta: +0.31% (NEUTRAL)

Runtime Profiling (perf record):

tiny_region_id_write_header: 2.32% (hotspot confirmed)
tiny_c7_ultra_alloc: 1.90% (in top 10)
Combined target overhead: ~4.22%

Root Cause of Low Gain:

Header write is smaller hotspot than expected (2.32% vs 4.56% in Phase 42)
Mixed workload dilutes C7-specific optimizations
Treatment has higher variance (CV 2.66% vs 1.53%)
Header-light mode adds branch in hot path (if (header_light))
Refill phase still writes headers (cold path overhead)

Implementation Status:

Pre-existing implementation discovered during analysis
ENV gate: HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0 (default OFF)
Location: core/tiny_c7_ultra.c:39-51, core/box/tiny_front_v3_env_box.h:145-152
Rollback: ENV gate already OFF by default (safe)

Kept as Research Box:

Available for future C7-heavy workloads (>50% C7 allocations)
May combine with other C7 optimizations (batch refill, SIMD header write)
Requires IPC/cache-miss profiling (not just cycle count)

Documentation:

Results: docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_RESULTS.md
Implementation: docs/analysis/PHASE61_C7_ULTRA_HEADER_LIGHT_IMPLEMENTATION.md

Lessons Learned:

Micro-optimizations need precise profiling (IPC, cache misses, not just cycles)
Mixed workload may not show benefits of class-specific optimizations
Instruction count reduction doesn't always translate to performance gain
Higher variance (CV) suggests instability or additional noise

32 KiB Raw Blame History Unescape Escape

Performance Targets（mimalloc 追跡の"数値目標"）

運用方針（Phase 38 確定）

Current snapshot（2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline）

hakmem Build Variants（同一バイナリレイアウト）

Reference allocators（別バイナリ、layout 差あり）

1) Speed（相対目標）

2) Syscall budget（OS churn）

3) Memory stability（RSS / fragmentation）

4) Long-run stability（性能・一貫性）

5) Tail Latency（p99/p999）

Throughput Distribution (ops/sec)

Latency Proxy (ns/op)

Tail Consistency Metrics

Analysis

6) 判定ルール（運用）

6) Build Variants（FAST / Standard / OBSERVE）— Phase 38 運用

3種類のビルド

運用ルール（Phase 38 確定）

FAST build 履歴

使い方（Phase 38 ワークフロー）

Phase 37 教訓（Standard 最適化の限界）

Phase 39 実施済み（FAST v3）

Phase 47: FAST+PGO research box（NEUTRAL, 保留）

Phase 60: Alloc Pass-Down SSOT (NO-GO, research box)

Phase 61: C7 ULTRA Header-Light (NEUTRAL, research box)

32 KiB

Raw Blame History