195 lines
5.5 KiB
Markdown
195 lines
5.5 KiB
Markdown
|
|
# HAKMEM Build Guide
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
### 通常ビルド (Normal Build)
|
|||
|
|
```bash
|
|||
|
|
make bench_comprehensive_hakmem
|
|||
|
|
./bench_comprehensive_hakmem
|
|||
|
|
```
|
|||
|
|
**Expected**: 200-220M ops/sec
|
|||
|
|
|
|||
|
|
### PGO最適化ビルド (Recommended)
|
|||
|
|
```bash
|
|||
|
|
./build_pgo.sh
|
|||
|
|
./bench_comprehensive_hakmem
|
|||
|
|
```
|
|||
|
|
**Expected**: 300-350M ops/sec (+50-75% faster!)
|
|||
|
|
|
|||
|
|
### 共有ライブラリ(LD_PRELOAD)PGOビルド
|
|||
|
|
```bash
|
|||
|
|
# Step 1: 計測用に instrumented な共有ライブラリでプロファイル収集
|
|||
|
|
make pgo-profile-shared
|
|||
|
|
|
|||
|
|
# Step 2: PGO最適化した共有ライブラリをビルド
|
|||
|
|
make pgo-build-shared
|
|||
|
|
|
|||
|
|
# 実行(system版ベンチに差し替え)
|
|||
|
|
HAKMEM_WRAP_TINY=1 LD_PRELOAD=./libhakmem.so ./bench_comprehensive_system
|
|||
|
|
```
|
|||
|
|
Expected: 共有ライブラリでも通常より高速(環境により差あり)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## PGO Build Script
|
|||
|
|
|
|||
|
|
### Usage
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
./build_pgo.sh [command]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Commands
|
|||
|
|
|
|||
|
|
| Command | Description | When to Use |
|
|||
|
|
|---------|-------------|-------------|
|
|||
|
|
| `all` | Full PGO build (default) | First time, or after code changes |
|
|||
|
|
| `clean` | Clean previous builds | Before rebuilding |
|
|||
|
|
| `profile` | Build instrumented + collect profile | Step 1: Profile collection |
|
|||
|
|
| `build` | Build optimized using profile | Step 2: After profile exists |
|
|||
|
|
|
|||
|
|
### Example Workflow
|
|||
|
|
|
|||
|
|
#### Full automated build (recommended)
|
|||
|
|
```bash
|
|||
|
|
./build_pgo.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Manual step-by-step
|
|||
|
|
```bash
|
|||
|
|
# Step 1: Collect profile
|
|||
|
|
./build_pgo.sh profile
|
|||
|
|
|
|||
|
|
# Step 2: Build optimized
|
|||
|
|
./build_pgo.sh build
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance Comparison
|
|||
|
|
|
|||
|
|
| Build Type | 128B Long-lived | Best Result | Use Case |
|
|||
|
|
|------------|----------------|-------------|----------|
|
|||
|
|
| **Normal** | 210M ops/s | 222M ops/s | Debug, development |
|
|||
|
|
| **PGO** | 314M ops/s | 342M ops/s | Production, benchmarks |
|
|||
|
|
|
|||
|
|
Latest (Phase 9.3+ tiny fast-path):
|
|||
|
|
- Direct (PGO): 400M+ ops/s 確認済み(bench_comprehensive_hakmem)
|
|||
|
|
- System malloc baseline: ~410M ops/s(環境依存)
|
|||
|
|
|
|||
|
|
## What is PGO?
|
|||
|
|
|
|||
|
|
**Profile-Guided Optimization (PGO)** is a compiler optimization technique:
|
|||
|
|
|
|||
|
|
1. **Phase 1 (Profile)**: Build instrumented binary, run representative workload
|
|||
|
|
2. **Phase 2 (Optimize)**: Rebuild with profile data, compiler optimizes hot paths
|
|||
|
|
|
|||
|
|
**Benefits**:
|
|||
|
|
- Better branch prediction
|
|||
|
|
- Improved code layout (hot paths together)
|
|||
|
|
- Inlining decisions based on actual usage
|
|||
|
|
- +50-75% performance improvement
|
|||
|
|
|
|||
|
|
## Requirements
|
|||
|
|
|
|||
|
|
- GCC with LTO/PGO support (gcc 7+)
|
|||
|
|
- ~2 minutes for full PGO build
|
|||
|
|
- 200MB disk space for profile data (*.gcda files)
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### "Profile data not generated"
|
|||
|
|
```bash
|
|||
|
|
# Make sure you run the instrumented binary
|
|||
|
|
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### "No profile data found"
|
|||
|
|
```bash
|
|||
|
|
# Run profile step first
|
|||
|
|
./build_pgo.sh profile
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Clean start
|
|||
|
|
```bash
|
|||
|
|
./build_pgo.sh clean
|
|||
|
|
./build_pgo.sh all
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ベンチ実行のヒント(Tiny向け)
|
|||
|
|
- Tinyを有効化: `export HAKMEM_WRAP_TINY=1`
|
|||
|
|
- 観測・学習系は既定OFF(オーバーヘッド回避)。ONにする場合のみ環境変数を設定
|
|||
|
|
- 学習: `HAKMEM_LEARN=1`
|
|||
|
|
- 既定では `HAKMEM_SITE_RULES` や `HAKMEM_PROF` は未設定(OFF)
|
|||
|
|
|
|||
|
|
### TinyモードとFLINT(フロント+遅延インテリジェンス)
|
|||
|
|
- Ultra Tiny(SLL-only, 実験)
|
|||
|
|
- 有効化: `HAKMEM_TINY_ULTRA=1`
|
|||
|
|
- 検証ON/OFF: `HAKMEM_TINY_ULTRA_VALIDATE=0/1`(性能計測時は0推奨)
|
|||
|
|
- パラメータ(クラス別上書き):
|
|||
|
|
- `HAKMEM_TINY_ULTRA_BATCH_C{0..7}=N`
|
|||
|
|
- `HAKMEM_TINY_ULTRA_SLL_CAP_C{0..7}=N`
|
|||
|
|
- 可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
|
|||
|
|
- FLINT(Fast Lightweight INTelligence, 実験)
|
|||
|
|
- FRONT(超軽量FastCache): `HAKMEM_TINY_FRONTEND=1`
|
|||
|
|
- INT(遅延インテリジェンス: イベント集計+BGスレッド): `HAKMEM_INT_ENGINE=1`
|
|||
|
|
- 備考: ホットパス最小化+学習は非同期化。現状は実験(Ultra/通常とA/B比較推奨)
|
|||
|
|
- TinyQuickSlot(最小フロント): `HAKMEM_TINY_QUICK=1`
|
|||
|
|
- 64B/クラスの6エントリ・スタック。ヒット時は1ラインのみ参照し返却。
|
|||
|
|
- miss時は SLL→Quick, Magazine→Quick の少量補充で局所性を維持。
|
|||
|
|
|
|||
|
|
### スクリプト集(CSV出力あり)
|
|||
|
|
- 直リンク総合比較(HAKMEM vs mimalloc): `bash scripts/run_comprehensive_pair.sh`
|
|||
|
|
- Tiny triad(HAKMEM/System/mimalloc): `bash scripts/run_tiny_hot_triad.sh 80000`
|
|||
|
|
- Random mixed triad: `bash scripts/run_random_mixed_matrix.sh 120000`
|
|||
|
|
- Ultra可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
|
|||
|
|
- Ultraパラメータスイープ: `bash scripts/sweep_ultra_params.sh 40000 150`
|
|||
|
|
|
|||
|
|
### 高速ビルドターゲット(実験用)
|
|||
|
|
```bash
|
|||
|
|
make bench_fast
|
|||
|
|
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
|
|||
|
|
```
|
|||
|
|
備考: unwindテーブル等を削減するビルド。PGOとの併用を推奨。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Advanced: Manual PGO Build
|
|||
|
|
|
|||
|
|
If you prefer Makefile directly:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Step 1: Profile collection
|
|||
|
|
make pgo-profile
|
|||
|
|
|
|||
|
|
# Step 2: Optimized build
|
|||
|
|
make pgo-build
|
|||
|
|
|
|||
|
|
# Run benchmark
|
|||
|
|
./bench_comprehensive_hakmem
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 8.4 Achievement
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
🏆 342M ops/s NEW RECORD! (+8.2% vs Step 3d baseline)
|
|||
|
|
|
|||
|
|
Top 5 Results (PGO Build):
|
|||
|
|
1. 64B FIFO: 342M ops/s 🥇
|
|||
|
|
2. 64B Interleaved: 342M ops/s 🥈
|
|||
|
|
3. 64B Long-lived: 342M ops/s 🥉
|
|||
|
|
4. 32B Long-lived: 341M ops/s
|
|||
|
|
5. 128B FIFO: 341M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Design**: Zero hot-path overhead ACE observer
|
|||
|
|
- Removed all ACE counters from alloc/free paths (600M+ operations)
|
|||
|
|
- Background Learner thread observation (1-second interval)
|
|||
|
|
- Registry-based scan using existing `meta->used` field
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
Generated with [Claude Code](https://claude.com/claude-code) - Phase 8.4
|