# HAKMEM Build Guide ## Quick Start ### 通常ビルド (Normal Build) ```bash make bench_comprehensive_hakmem ./bench_comprehensive_hakmem ``` **Expected**: 200-220M ops/sec ### PGO最適化ビルド (Recommended) ```bash ./build_pgo.sh ./bench_comprehensive_hakmem ``` **Expected**: 300-350M ops/sec (+50-75% faster!) ### 共有ライブラリ(LD_PRELOAD)PGOビルド ```bash # Step 1: 計測用に instrumented な共有ライブラリでプロファイル収集 make pgo-profile-shared # Step 2: PGO最適化した共有ライブラリをビルド make pgo-build-shared # 実行(system版ベンチに差し替え) HAKMEM_WRAP_TINY=1 LD_PRELOAD=./libhakmem.so ./bench_comprehensive_system ``` Expected: 共有ライブラリでも通常より高速(環境により差あり) --- ## PGO Build Script ### Usage ```bash ./build_pgo.sh [command] ``` ### Commands | Command | Description | When to Use | |---------|-------------|-------------| | `all` | Full PGO build (default) | First time, or after code changes | | `clean` | Clean previous builds | Before rebuilding | | `profile` | Build instrumented + collect profile | Step 1: Profile collection | | `build` | Build optimized using profile | Step 2: After profile exists | ### Example Workflow #### Full automated build (recommended) ```bash ./build_pgo.sh ``` #### Manual step-by-step ```bash # Step 1: Collect profile ./build_pgo.sh profile # Step 2: Build optimized ./build_pgo.sh build ``` --- ## Performance Comparison | Build Type | 128B Long-lived | Best Result | Use Case | |------------|----------------|-------------|----------| | **Normal** | 210M ops/s | 222M ops/s | Debug, development | | **PGO** | 314M ops/s | 342M ops/s | Production, benchmarks | Latest (Phase 9.3+ tiny fast-path): - Direct (PGO): 400M+ ops/s 確認済み(bench_comprehensive_hakmem) - System malloc baseline: ~410M ops/s(環境依存) ## What is PGO? **Profile-Guided Optimization (PGO)** is a compiler optimization technique: 1. **Phase 1 (Profile)**: Build instrumented binary, run representative workload 2. **Phase 2 (Optimize)**: Rebuild with profile data, compiler optimizes hot paths **Benefits**: - Better branch prediction - Improved code layout (hot paths together) - Inlining decisions based on actual usage - +50-75% performance improvement ## Requirements - GCC with LTO/PGO support (gcc 7+) - ~2 minutes for full PGO build - 200MB disk space for profile data (*.gcda files) ## Troubleshooting ### "Profile data not generated" ```bash # Make sure you run the instrumented binary HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem ``` ### "No profile data found" ```bash # Run profile step first ./build_pgo.sh profile ``` ### Clean start ```bash ./build_pgo.sh clean ./build_pgo.sh all ``` ### ベンチ実行のヒント(Tiny向け) - Tinyを有効化: `export HAKMEM_WRAP_TINY=1` - 観測・学習系は既定OFF(オーバーヘッド回避)。ONにする場合のみ環境変数を設定 - 学習: `HAKMEM_LEARN=1` - 既定では `HAKMEM_SITE_RULES` や `HAKMEM_PROF` は未設定(OFF) ### TinyモードとFLINT(フロント+遅延インテリジェンス) - Ultra Tiny(SLL-only, 実験) - 有効化: `HAKMEM_TINY_ULTRA=1` - 検証ON/OFF: `HAKMEM_TINY_ULTRA_VALIDATE=0/1`(性能計測時は0推奨) - パラメータ(クラス別上書き): - `HAKMEM_TINY_ULTRA_BATCH_C{0..7}=N` - `HAKMEM_TINY_ULTRA_SLL_CAP_C{0..7}=N` - 可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200` - FLINT(Fast Lightweight INTelligence, 実験) - FRONT(超軽量FastCache): `HAKMEM_TINY_FRONTEND=1` - INT(遅延インテリジェンス: イベント集計+BGスレッド): `HAKMEM_INT_ENGINE=1` - 備考: ホットパス最小化+学習は非同期化。現状は実験(Ultra/通常とA/B比較推奨) - TinyQuickSlot(最小フロント): `HAKMEM_TINY_QUICK=1` - 64B/クラスの6エントリ・スタック。ヒット時は1ラインのみ参照し返却。 - miss時は SLL→Quick, Magazine→Quick の少量補充で局所性を維持。 ### スクリプト集(CSV出力あり) - 直リンク総合比較(HAKMEM vs mimalloc): `bash scripts/run_comprehensive_pair.sh` - Tiny triad(HAKMEM/System/mimalloc): `bash scripts/run_tiny_hot_triad.sh 80000` - Random mixed triad: `bash scripts/run_random_mixed_matrix.sh 120000` - Ultra可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200` - Ultraパラメータスイープ: `bash scripts/sweep_ultra_params.sh 40000 150` ### 高速ビルドターゲット(実験用) ```bash make bench_fast HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem ``` 備考: unwindテーブル等を削減するビルド。PGOとの併用を推奨。 --- ## Advanced: Manual PGO Build If you prefer Makefile directly: ```bash # Step 1: Profile collection make pgo-profile # Step 2: Optimized build make pgo-build # Run benchmark ./bench_comprehensive_hakmem ``` --- ## Phase 8.4 Achievement ``` 🏆 342M ops/s NEW RECORD! (+8.2% vs Step 3d baseline) Top 5 Results (PGO Build): 1. 64B FIFO: 342M ops/s 🥇 2. 64B Interleaved: 342M ops/s 🥈 3. 64B Long-lived: 342M ops/s 🥉 4. 32B Long-lived: 341M ops/s 5. 128B FIFO: 341M ops/s ``` **Design**: Zero hot-path overhead ACE observer - Removed all ACE counters from alloc/free paths (600M+ operations) - Background Learner thread observation (1-second interval) - Registry-based scan using existing `meta->used` field --- Generated with [Claude Code](https://claude.com/claude-code) - Phase 8.4