Bench Results — 2025-10-29 Summary - Tiny‑Hot (direct link, triad): HAKMEM is ~240–246 M ops/s at 8–128B; System malloc ~315–330 M; mimalloc ~555–630 M. - Random‑Mixed (direct link, ws=200/400/800, 100k cycles): HAKMEM ~24.8–25.3 M; System ~26.0–26.5 M; mimalloc ~26.6–27.0 M. - Comprehensive pair (direct link): HAKMEM ~235–246 M across small tests; mimalloc ~900–980 M. HAKMEM mixed: ~234.5 M; mimalloc mixed: ~876.5 M. Key CSVs - Tiny‑Hot triad: bench_results/tiny_hot_triad_20251029_112655/results.csv - Tiny‑Hot triad (Minimal Front build): bench_results/tiny_hot_triad_20251029_112934/results.csv - Random‑Mixed matrix: bench_results/random_mixed_20251029_112713/results.csv - Comprehensive pair (HAKMEM vs mimalloc): bench_results/comp_pair_20251029_112732/summary.csv - Mixed quick sweep: bench_results/sweep_mixed_quick_20251029_112832/results.csv - Tiny‑Hot triad (post‑refine 12:42): bench_results/tiny_hot_triad_20251029_124209/results.csv - Tiny‑Hot triad (post‑PGO 13:14): bench_results/tiny_hot_triad_20251029_131457/results.csv - perf stat (post‑PGO 13:14): bench_results/perf_hot_triad_20251029_1314{22,57}/hakmem_s{32,64}_b100_c50000.perf.csv - Tiny‑Hot triad (14:06): bench_results/tiny_hot_triad_20251029_140637/results.csv - Random‑Mixed matrix (14:06): bench_results/random_mixed_20251029_140651/results.csv - Bench‑fastpath PGO triad (14:50): bench_results/tiny_hot_triad_20251029_145020/results.csv - Bench‑fastpath sweep (r8/r12/r16, 15:08): bench_results/tiny_benchfast_sweep_20251029_150802/ - Bench SLL‑only + warmup + PGO (15:25): bench_results/tiny_hot_triad_20251029_152510/results.csv - Bench SLL‑only tuned (REFILL32=12, WARMUP32=192, 15:27): bench_results/tiny_hot_triad_20251029_152738/results.csv Notable Findings - Tiny‑Hot gap: HAKMEM trails System by ~70–80 M(以前より数M改善)と mimallocに対し~2.3–2.5× at 32/64B, batch=100。 - Minimal Front build trims front tiers but gives only micro gains on this box (~+0–3 M). Instruction count remains the limiter. - Random‑Mixed: HAKMEM is 1.0–2.0 M behind System/mimalloc; L1 misses don’t dominate—extra instructions/branches in back‑path are likely causes. - Bench‑fastpath(ベンチ専用直線化+PGO): 32B/b100/30kで最大 358.4M(System 312.6M を上回り)。8–24B帯も 310–350M に到達。 - リフィルA/B(r8/r12/r16)では 32Bは r16≈267.4M, r8≈266.7M で僅差、64Bは r12≈266.8M が最良(非PGO個別比較)。 - Bench SLL‑only + warmup + PGO: 8–24Bで 400M超、32B/b100 は 388.7–429.2M 範囲(パラメタ/PGO差)。 - 代表: 32B/b100=429.18M(System=312.55M, mimalloc=588.31M) - USDT is unavailable on the current kernel (WSL); scripts auto‑fallback to PMU. Overview summary is PMU‑only. Random‑Mixed Update (13:38) - Preset: rmax=96, rmaxh=192, spill_hyst=16(推奨) - ws=200: H=24.65/24.75M, S=25.91/25.65M, mi=26.48/26.50M - ws=400: H=24.89/24.86M, S=25.68/25.99M, mi=26.59/26.73M - ws=800: H=25.00/24.59M, S=25.85/25.98M, mi=26.61/26.62M - CSV: bench_results/random_mixed_20251029_133834/results.csv - 要約: Random‑MixedはSystemに肉薄(差~3–5%)、mimallocとの差は~6–9%。安定して“追いついてきた”。 Post‑PGO Update (13:14) - Tiny‑Hot (80k cycles, hakmem only, batch=100): 8B=245.58M, 16B=245.86M, 32B=240.81M, 64B=242.31M - 傾向: free側getenvゼロ化、SLL分岐削減、統計分岐排除により、各サイズで+数Mの微増(環境変動内で改善)。 Quick A/B (Random‑Mixed) — Best Preset Observed - rmax=96, rmaxh=192, spill_hyst=16 at ws=400, seed=42, cycles=60k: - HAKMEM: 26.06 M; System: 27.36 M; mimalloc: 27.84 M - See: bench_results/sweep_mixed_quick_20251029_112832/results.csv Recommended Presets (direct‑link) - Tiny‑Hot: HAKMEM_TINY_TLS_SLL=1, HAKMEM_TINY_MAG_CAP=128(64Bは512 A/B), HAKMEM_TINY_REMOTE_DRAIN_TRYRATE=0 - Tiny‑Hot(ベンチ専用): -DHAKMEM_TINY_BENCH_FASTPATH=1(≤64B), PGO適用, リフィルは32B=16, 64B=12 を起点にA/B - Tiny‑Hot(ベンチ専用・SLL‑only推奨): - ビルド: -DHAKMEM_TINY_BENCH_FASTPATH=1 -DHAKMEM_TINY_BENCH_SLL_ONLY=1 -DHAKMEM_TINY_BENCH_TINY_CLASSES=3 - ウォームアップ(初回のみSLLを充填): 8=64, 16=96, 32=160–192, 64=192(A/B) - リフィル(クラス別): REFILL32=12 が良好(64は既定8〜12でA/B) - PGO: 8/16/32/64(batch=100, cycles=60k)でプロファイル収集→最適化 - Mixed: HAKMEM_TINY_REFILL_MAX=96, HAKMEM_TINY_REFILL_MAX_HOT=192, HAKMEM_TINY_SPILL_HYST=16(本箱のベスト近傍) - 統計サンプリング(任意): ビルド時 -DHAKMEM_TINY_STAT_SAMPLING、実行時 HAKMEM_TINY_STAT_RATE_LG=14 など(2^14回に1回flush) - 8/16特化(任意): 16BのみA/Bする場合は HAKMEM_TINY_SPECIALIZE_MASK=0x02(本箱では状況次第、既定OFFのまま推奨) What Changed Since 10/28 - Targeted remote‑drain queue implemented; BG remote scan replaced with per‑class target list (off by default; env‑tunable). - Background spill queue integrated (off by default); spill hysteresis and batch lower‑bound added. - Minimal/Strict Front compile‑time gates wired; size‑specialized 32/64B mag‑pop path (bench A/B) in place. - Scripts for triad/mixed/pair and PMU overview are stable and saving CSVs under bench_results/… Next Steps (perf focus) - Tiny‑Hot: further reduce insns/op in the first 3 tiers. - Keep front simple: SLL → small TLS mag pop → regular mag. Avoid fast‑path writes; sample/flush counters at low frequency only. - Consider 32/64B size‑specialized inline pops + PGO (use pgo-hot-profile/build) and re‑measure perf stat. - Mixed: fewer refills and narrower back‑path work per cycle. - Sweep larger REFILL_MAX(HOT) and refine SPILL_HYST; class‑specific tables for hot classes. - Keep BG_REMOTE off on this box; prefer targeted queue only when needed. Tiny‑Hot差縮小に向けて(補足) - Write最小化の徹底: TLS mag-popはtopのみ更新。統計/ownerは低頻度flush(現状対応済を継続強化)。 - サイズ特化の常時inline化+PGO: 16/32/64Bに限定し命令列を固定化(8Bは本箱ではオフ推奨)。 - 小型マガジン(8/16/32B)A/B: 128要素の小型マガジンでL1常駐性を上げ、SLL/通常マガジン遷移を減らす。 - wrapper判定の入口外し: 再入はラッパー側短絡、非ラッパー経路は分岐無しで最短化。 -(中期)TreiberスタックのABA耐性: remote/spillキューをポインタ+世代カウンタのDCASに置換(MT安定性/効率)。 How to Reproduce - Tiny‑Hot triad: SKIP_BUILD=1 bash scripts/run_tiny_hot_triad.sh 80000 - Random‑Mixed: bash scripts/run_random_mixed_matrix.sh 100000 - Mixed quick sweep: bash scripts/sweep_mixed_quick.sh 60000 - Comprehensive pair: bash scripts/run_comprehensive_pair.sh - PMU overview (falls back from USDT): PERF_BIN=$(command -v perf) bash scripts/run_usdt_overview.sh 40000; then python3 scripts/parse_usdt_stat.py bench_results/usdt_YYYYMMDD_HHMMSS Environment Notes - WSL kernel (5.15.167.4‑microsoft‑standard‑WSL2) blocks perf sdt:… USDT; use PMU‑only on this machine. For USDT, use a native Linux kernel with tracefs + proper perf tools. Addendum — PGO + 32/64B specialization A/B (perf) - Build: make pgo-hot-profile && make pgo-hot-build (Strict Front) - perf stat (32B, batch=100, 50k cycles) - Baseline (spec=OFF): cycles=239,571,393; instructions=1,734,394,667 - Specialize (spec=ON): cycles=235,875,647; instructions=1,693,762,017 - Delta: cycles −1.5%, instructions −2.3% - perf stat (64B, batch=100, 50k cycles) - Baseline (spec=OFF): cycles=237,616,584; instructions=1,733,704,932 - Specialize (spec=ON): cycles=233,434,688; instructions=1,693,469,923 - Delta: cycles −1.8%, instructions −2.3% - Throughput (Tiny‑Hot triad, 60k cycles, hakmem only) - 32B batch=100: 239.00 → 239.72 M ops/s (+0.3%) - 64B batch=100: 241.76 → 244.20 M ops/s (+1.0%) Notes: PGO+Strict Frontに対して32/64特化は命令数を約2%削減。体感性能は小幅向上。今後は前段の書き込み最小化・補給頻度の最適化を重ねて、さらなるinsns/op低減を狙う。