Files
hakmem/docs/benchmarks/BENCH_RESULTS_2025_10_29.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

8.2 KiB
Raw Blame History

Bench Results — 2025-10-29

Summary

  • TinyHot (direct link, triad): HAKMEM is ~240246 M ops/s at 8128B; System malloc ~315330 M; mimalloc ~555630 M.
  • RandomMixed (direct link, ws=200/400/800, 100k cycles): HAKMEM ~24.825.3 M; System ~26.026.5 M; mimalloc ~26.627.0 M.
  • Comprehensive pair (direct link): HAKMEM ~235246 M across small tests; mimalloc ~900980 M. HAKMEM mixed: ~234.5 M; mimalloc mixed: ~876.5 M.

Key CSVs

  • TinyHot triad: bench_results/tiny_hot_triad_20251029_112655/results.csv
  • TinyHot triad (Minimal Front build): bench_results/tiny_hot_triad_20251029_112934/results.csv
  • RandomMixed matrix: bench_results/random_mixed_20251029_112713/results.csv
  • Comprehensive pair (HAKMEM vs mimalloc): bench_results/comp_pair_20251029_112732/summary.csv
  • Mixed quick sweep: bench_results/sweep_mixed_quick_20251029_112832/results.csv
  • TinyHot triad (postrefine 12:42): bench_results/tiny_hot_triad_20251029_124209/results.csv
  • TinyHot triad (postPGO 13:14): bench_results/tiny_hot_triad_20251029_131457/results.csv
  • perf stat (postPGO 13:14): bench_results/perf_hot_triad_20251029_1314{22,57}/hakmem_s{32,64}_b100_c50000.perf.csv
  • TinyHot triad (14:06): bench_results/tiny_hot_triad_20251029_140637/results.csv
  • RandomMixed matrix (14:06): bench_results/random_mixed_20251029_140651/results.csv
  • Benchfastpath PGO triad (14:50): bench_results/tiny_hot_triad_20251029_145020/results.csv
  • Benchfastpath sweep (r8/r12/r16, 15:08): bench_results/tiny_benchfast_sweep_20251029_150802/
  • Bench SLLonly + warmup + PGO (15:25): bench_results/tiny_hot_triad_20251029_152510/results.csv
  • Bench SLLonly tuned (REFILL32=12, WARMUP32=192, 15:27): bench_results/tiny_hot_triad_20251029_152738/results.csv

Notable Findings

  • TinyHot gap: HAKMEM trails System by 7080 M以前より数M改善と mimallocに対し2.32.5× at 32/64B, batch=100。
  • Minimal Front build trims front tiers but gives only micro gains on this box (~+03 M). Instruction count remains the limiter.
  • RandomMixed: HAKMEM is 1.02.0 M behind System/mimalloc; L1 misses dont dominate—extra instructions/branches in backpath are likely causes.
  • Benchfastpathベンチ専用直線化PGO: 32B/b100/30kで最大 358.4MSystem 312.6M を上回り。824B帯も 310350M に到達。
  • リフィルA/Br8/r12/r16では 32Bは r16≈267.4M, r8≈266.7M で僅差、64Bは r12≈266.8M が最良非PGO個別比較
  • Bench SLLonly + warmup + PGO: 824Bで 400M超、32B/b100 は 388.7429.2M 範囲(パラメタ/PGO差
    • 代表: 32B/b100=429.18MSystem=312.55M, mimalloc=588.31M
  • USDT is unavailable on the current kernel (WSL); scripts autofallback to PMU. Overview summary is PMUonly.

RandomMixed Update (13:38)

  • Preset: rmax=96, rmaxh=192, spill_hyst=16推奨
  • ws=200: H=24.65/24.75M, S=25.91/25.65M, mi=26.48/26.50M
  • ws=400: H=24.89/24.86M, S=25.68/25.99M, mi=26.59/26.73M
  • ws=800: H=25.00/24.59M, S=25.85/25.98M, mi=26.61/26.62M
  • CSV: bench_results/random_mixed_20251029_133834/results.csv
  • 要約: RandomMixedはSystemに肉薄35%、mimallocとの差は69%。安定して“追いついてきた”。

PostPGO Update (13:14)

  • TinyHot (80k cycles, hakmem only, batch=100): 8B=245.58M, 16B=245.86M, 32B=240.81M, 64B=242.31M
  • 傾向: free側getenvゼロ化、SLL分岐削減、統計分岐排除により、各サイズで+数Mの微増環境変動内で改善

Quick A/B (RandomMixed) — Best Preset Observed

  • rmax=96, rmaxh=192, spill_hyst=16 at ws=400, seed=42, cycles=60k:
    • HAKMEM: 26.06 M; System: 27.36 M; mimalloc: 27.84 M
  • See: bench_results/sweep_mixed_quick_20251029_112832/results.csv

Recommended Presets (directlink)

  • TinyHot: HAKMEM_TINY_TLS_SLL=1, HAKMEM_TINY_MAG_CAP=12864Bは512 A/B, HAKMEM_TINY_REMOTE_DRAIN_TRYRATE=0
  • TinyHotベンチ専用: -DHAKMEM_TINY_BENCH_FASTPATH=1≤64B, PGO適用, リフィルは32B=16, 64B=12 を起点にA/B
  • TinyHotベンチ専用・SLLonly推奨:
    • ビルド: -DHAKMEM_TINY_BENCH_FASTPATH=1 -DHAKMEM_TINY_BENCH_SLL_ONLY=1 -DHAKMEM_TINY_BENCH_TINY_CLASSES=3
    • ウォームアップ初回のみSLLを充填: 8=64, 16=96, 32=160192, 64=192A/B
    • リフィル(クラス別): REFILL32=12 が良好64は既定8〜12でA/B
    • PGO: 8/16/32/64batch=100, cycles=60kでプロファイル収集→最適化
  • Mixed: HAKMEM_TINY_REFILL_MAX=96, HAKMEM_TINY_REFILL_MAX_HOT=192, HAKMEM_TINY_SPILL_HYST=16本箱のベスト近傍
  • 統計サンプリング(任意): ビルド時 -DHAKMEM_TINY_STAT_SAMPLING、実行時 HAKMEM_TINY_STAT_RATE_LG=14 など2^14回に1回flush
  • 8/16特化任意: 16BのみA/Bする場合は HAKMEM_TINY_SPECIALIZE_MASK=0x02本箱では状況次第、既定OFFのまま推奨

What Changed Since 10/28

  • Targeted remotedrain queue implemented; BG remote scan replaced with perclass target list (off by default; envtunable).
  • Background spill queue integrated (off by default); spill hysteresis and batch lowerbound added.
  • Minimal/Strict Front compiletime gates wired; sizespecialized 32/64B magpop path (bench A/B) in place.
  • Scripts for triad/mixed/pair and PMU overview are stable and saving CSVs under bench_results/…

Next Steps (perf focus)

  • TinyHot: further reduce insns/op in the first 3 tiers.
    • Keep front simple: SLL → small TLS mag pop → regular mag. Avoid fastpath writes; sample/flush counters at low frequency only.
    • Consider 32/64B sizespecialized inline pops + PGO (use pgo-hot-profile/build) and remeasure perf stat.
  • Mixed: fewer refills and narrower backpath work per cycle.
    • Sweep larger REFILL_MAX(HOT) and refine SPILL_HYST; classspecific tables for hot classes.
    • Keep BG_REMOTE off on this box; prefer targeted queue only when needed.

TinyHot差縮小に向けて補足

  • Write最小化の徹底: TLS mag-popはtopのみ更新。統計/ownerは低頻度flush現状対応済を継続強化
  • サイズ特化の常時inline化PGO: 16/32/64Bに限定し命令列を固定化8Bは本箱ではオフ推奨
  • 小型マガジン8/16/32BA/B: 128要素の小型マガジンでL1常駐性を上げ、SLL/通常マガジン遷移を減らす。
  • wrapper判定の入口外し: 再入はラッパー側短絡、非ラッパー経路は分岐無しで最短化。 -中期TreiberスタックのABA耐性: remote/spillキューをポインタ+世代カウンタのDCASに置換MT安定性/効率)。

How to Reproduce

  • TinyHot triad: SKIP_BUILD=1 bash scripts/run_tiny_hot_triad.sh 80000
  • RandomMixed: bash scripts/run_random_mixed_matrix.sh 100000
  • Mixed quick sweep: bash scripts/sweep_mixed_quick.sh 60000
  • Comprehensive pair: bash scripts/run_comprehensive_pair.sh
  • PMU overview (falls back from USDT): PERF_BIN=$(command -v perf) bash scripts/run_usdt_overview.sh 40000; then python3 scripts/parse_usdt_stat.py bench_results/usdt_YYYYMMDD_HHMMSS

Environment Notes

  • WSL kernel (5.15.167.4microsoftstandardWSL2) blocks perf sdt:… USDT; use PMUonly on this machine. For USDT, use a native Linux kernel with tracefs + proper perf tools.

Addendum — PGO + 32/64B specialization A/B (perf)

  • Build: make pgo-hot-profile && make pgo-hot-build (Strict Front)
  • perf stat (32B, batch=100, 50k cycles)
    • Baseline (spec=OFF): cycles=239,571,393; instructions=1,734,394,667
    • Specialize (spec=ON): cycles=235,875,647; instructions=1,693,762,017
    • Delta: cycles 1.5%, instructions 2.3%
  • perf stat (64B, batch=100, 50k cycles)
    • Baseline (spec=OFF): cycles=237,616,584; instructions=1,733,704,932
    • Specialize (spec=ON): cycles=233,434,688; instructions=1,693,469,923
    • Delta: cycles 1.8%, instructions 2.3%
  • Throughput (TinyHot triad, 60k cycles, hakmem only)
    • 32B batch=100: 239.00 → 239.72 M ops/s (+0.3%)
    • 64B batch=100: 241.76 → 244.20 M ops/s (+1.0%) Notes: PGO+Strict Frontに対して32/64特化は命令数を約2%削減。体感性能は小幅向上。今後は前段の書き込み最小化・補給頻度の最適化を重ねて、さらなるinsns/op低減を狙う。