Files
hakmem/docs/benchmarks/BENCH_RESULTS_2025_10_29.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

108 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Bench Results — 2025-10-29
Summary
- TinyHot (direct link, triad): HAKMEM is ~240246 M ops/s at 8128B; System malloc ~315330 M; mimalloc ~555630 M.
- RandomMixed (direct link, ws=200/400/800, 100k cycles): HAKMEM ~24.825.3 M; System ~26.026.5 M; mimalloc ~26.627.0 M.
- Comprehensive pair (direct link): HAKMEM ~235246 M across small tests; mimalloc ~900980 M. HAKMEM mixed: ~234.5 M; mimalloc mixed: ~876.5 M.
Key CSVs
- TinyHot triad: bench_results/tiny_hot_triad_20251029_112655/results.csv
- TinyHot triad (Minimal Front build): bench_results/tiny_hot_triad_20251029_112934/results.csv
- RandomMixed matrix: bench_results/random_mixed_20251029_112713/results.csv
- Comprehensive pair (HAKMEM vs mimalloc): bench_results/comp_pair_20251029_112732/summary.csv
- Mixed quick sweep: bench_results/sweep_mixed_quick_20251029_112832/results.csv
- TinyHot triad (postrefine 12:42): bench_results/tiny_hot_triad_20251029_124209/results.csv
- TinyHot triad (postPGO 13:14): bench_results/tiny_hot_triad_20251029_131457/results.csv
- perf stat (postPGO 13:14): bench_results/perf_hot_triad_20251029_1314{22,57}/hakmem_s{32,64}_b100_c50000.perf.csv
- TinyHot triad (14:06): bench_results/tiny_hot_triad_20251029_140637/results.csv
- RandomMixed matrix (14:06): bench_results/random_mixed_20251029_140651/results.csv
- Benchfastpath PGO triad (14:50): bench_results/tiny_hot_triad_20251029_145020/results.csv
- Benchfastpath sweep (r8/r12/r16, 15:08): bench_results/tiny_benchfast_sweep_20251029_150802/
- Bench SLLonly + warmup + PGO (15:25): bench_results/tiny_hot_triad_20251029_152510/results.csv
- Bench SLLonly tuned (REFILL32=12, WARMUP32=192, 15:27): bench_results/tiny_hot_triad_20251029_152738/results.csv
Notable Findings
- TinyHot gap: HAKMEM trails System by ~7080 M以前より数M改善と mimallocに対し~2.32.5× at 32/64B, batch=100。
- Minimal Front build trims front tiers but gives only micro gains on this box (~+03 M). Instruction count remains the limiter.
- RandomMixed: HAKMEM is 1.02.0 M behind System/mimalloc; L1 misses dont dominate—extra instructions/branches in backpath are likely causes.
- Benchfastpathベンチ専用直線化PGO: 32B/b100/30kで最大 358.4MSystem 312.6M を上回り。824B帯も 310350M に到達。
- リフィルA/Br8/r12/r16では 32Bは r16≈267.4M, r8≈266.7M で僅差、64Bは r12≈266.8M が最良非PGO個別比較
- Bench SLLonly + warmup + PGO: 824Bで 400M超、32B/b100 は 388.7429.2M 範囲(パラメタ/PGO差
- 代表: 32B/b100=429.18MSystem=312.55M, mimalloc=588.31M
- USDT is unavailable on the current kernel (WSL); scripts autofallback to PMU. Overview summary is PMUonly.
RandomMixed Update (13:38)
- Preset: rmax=96, rmaxh=192, spill_hyst=16推奨
- ws=200: H=24.65/24.75M, S=25.91/25.65M, mi=26.48/26.50M
- ws=400: H=24.89/24.86M, S=25.68/25.99M, mi=26.59/26.73M
- ws=800: H=25.00/24.59M, S=25.85/25.98M, mi=26.61/26.62M
- CSV: bench_results/random_mixed_20251029_133834/results.csv
- 要約: RandomMixedはSystemに肉薄差~35%、mimallocとの差は~69%。安定して“追いついてきた”。
PostPGO Update (13:14)
- TinyHot (80k cycles, hakmem only, batch=100): 8B=245.58M, 16B=245.86M, 32B=240.81M, 64B=242.31M
- 傾向: free側getenvゼロ化、SLL分岐削減、統計分岐排除により、各サイズで+数Mの微増環境変動内で改善
Quick A/B (RandomMixed) — Best Preset Observed
- rmax=96, rmaxh=192, spill_hyst=16 at ws=400, seed=42, cycles=60k:
- HAKMEM: 26.06 M; System: 27.36 M; mimalloc: 27.84 M
- See: bench_results/sweep_mixed_quick_20251029_112832/results.csv
Recommended Presets (directlink)
- TinyHot: HAKMEM_TINY_TLS_SLL=1, HAKMEM_TINY_MAG_CAP=12864Bは512 A/B, HAKMEM_TINY_REMOTE_DRAIN_TRYRATE=0
- TinyHotベンチ専用: -DHAKMEM_TINY_BENCH_FASTPATH=1≤64B, PGO適用, リフィルは32B=16, 64B=12 を起点にA/B
- TinyHotベンチ専用・SLLonly推奨:
- ビルド: -DHAKMEM_TINY_BENCH_FASTPATH=1 -DHAKMEM_TINY_BENCH_SLL_ONLY=1 -DHAKMEM_TINY_BENCH_TINY_CLASSES=3
- ウォームアップ初回のみSLLを充填: 8=64, 16=96, 32=160192, 64=192A/B
- リフィル(クラス別): REFILL32=12 が良好64は既定8〜12でA/B
- PGO: 8/16/32/64batch=100, cycles=60kでプロファイル収集→最適化
- Mixed: HAKMEM_TINY_REFILL_MAX=96, HAKMEM_TINY_REFILL_MAX_HOT=192, HAKMEM_TINY_SPILL_HYST=16本箱のベスト近傍
- 統計サンプリング(任意): ビルド時 -DHAKMEM_TINY_STAT_SAMPLING、実行時 HAKMEM_TINY_STAT_RATE_LG=14 など2^14回に1回flush
- 8/16特化任意: 16BのみA/Bする場合は HAKMEM_TINY_SPECIALIZE_MASK=0x02本箱では状況次第、既定OFFのまま推奨
What Changed Since 10/28
- Targeted remotedrain queue implemented; BG remote scan replaced with perclass target list (off by default; envtunable).
- Background spill queue integrated (off by default); spill hysteresis and batch lowerbound added.
- Minimal/Strict Front compiletime gates wired; sizespecialized 32/64B magpop path (bench A/B) in place.
- Scripts for triad/mixed/pair and PMU overview are stable and saving CSVs under bench_results/…
Next Steps (perf focus)
- TinyHot: further reduce insns/op in the first 3 tiers.
- Keep front simple: SLL → small TLS mag pop → regular mag. Avoid fastpath writes; sample/flush counters at low frequency only.
- Consider 32/64B sizespecialized inline pops + PGO (use pgo-hot-profile/build) and remeasure perf stat.
- Mixed: fewer refills and narrower backpath work per cycle.
- Sweep larger REFILL_MAX(HOT) and refine SPILL_HYST; classspecific tables for hot classes.
- Keep BG_REMOTE off on this box; prefer targeted queue only when needed.
TinyHot差縮小に向けて補足
- Write最小化の徹底: TLS mag-popはtopのみ更新。統計/ownerは低頻度flush現状対応済を継続強化
- サイズ特化の常時inline化PGO: 16/32/64Bに限定し命令列を固定化8Bは本箱ではオフ推奨
- 小型マガジン8/16/32BA/B: 128要素の小型マガジンでL1常駐性を上げ、SLL/通常マガジン遷移を減らす。
- wrapper判定の入口外し: 再入はラッパー側短絡、非ラッパー経路は分岐無しで最短化。
-中期TreiberスタックのABA耐性: remote/spillキューをポインタ+世代カウンタのDCASに置換MT安定性/効率)。
How to Reproduce
- TinyHot triad: SKIP_BUILD=1 bash scripts/run_tiny_hot_triad.sh 80000
- RandomMixed: bash scripts/run_random_mixed_matrix.sh 100000
- Mixed quick sweep: bash scripts/sweep_mixed_quick.sh 60000
- Comprehensive pair: bash scripts/run_comprehensive_pair.sh
- PMU overview (falls back from USDT): PERF_BIN=$(command -v perf) bash scripts/run_usdt_overview.sh 40000; then python3 scripts/parse_usdt_stat.py bench_results/usdt_YYYYMMDD_HHMMSS
Environment Notes
- WSL kernel (5.15.167.4microsoftstandardWSL2) blocks perf sdt:… USDT; use PMUonly on this machine. For USDT, use a native Linux kernel with tracefs + proper perf tools.
Addendum — PGO + 32/64B specialization A/B (perf)
- Build: make pgo-hot-profile && make pgo-hot-build (Strict Front)
- perf stat (32B, batch=100, 50k cycles)
- Baseline (spec=OFF): cycles=239,571,393; instructions=1,734,394,667
- Specialize (spec=ON): cycles=235,875,647; instructions=1,693,762,017
- Delta: cycles 1.5%, instructions 2.3%
- perf stat (64B, batch=100, 50k cycles)
- Baseline (spec=OFF): cycles=237,616,584; instructions=1,733,704,932
- Specialize (spec=ON): cycles=233,434,688; instructions=1,693,469,923
- Delta: cycles 1.8%, instructions 2.3%
- Throughput (TinyHot triad, 60k cycles, hakmem only)
- 32B batch=100: 239.00 → 239.72 M ops/s (+0.3%)
- 64B batch=100: 241.76 → 244.20 M ops/s (+1.0%)
Notes: PGO+Strict Frontに対して32/64特化は命令数を約2%削減。体感性能は小幅向上。今後は前段の書き込み最小化・補給頻度の最適化を重ねて、さらなるinsns/op低減を狙う。