Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
195 lines
5.5 KiB
Markdown
195 lines
5.5 KiB
Markdown
# HAKMEM Build Guide
|
||
|
||
## Quick Start
|
||
|
||
### 通常ビルド (Normal Build)
|
||
```bash
|
||
make bench_comprehensive_hakmem
|
||
./bench_comprehensive_hakmem
|
||
```
|
||
**Expected**: 200-220M ops/sec
|
||
|
||
### PGO最適化ビルド (Recommended)
|
||
```bash
|
||
./build_pgo.sh
|
||
./bench_comprehensive_hakmem
|
||
```
|
||
**Expected**: 300-350M ops/sec (+50-75% faster!)
|
||
|
||
### 共有ライブラリ(LD_PRELOAD)PGOビルド
|
||
```bash
|
||
# Step 1: 計測用に instrumented な共有ライブラリでプロファイル収集
|
||
make pgo-profile-shared
|
||
|
||
# Step 2: PGO最適化した共有ライブラリをビルド
|
||
make pgo-build-shared
|
||
|
||
# 実行(system版ベンチに差し替え)
|
||
HAKMEM_WRAP_TINY=1 LD_PRELOAD=./libhakmem.so ./bench_comprehensive_system
|
||
```
|
||
Expected: 共有ライブラリでも通常より高速(環境により差あり)
|
||
|
||
---
|
||
|
||
## PGO Build Script
|
||
|
||
### Usage
|
||
|
||
```bash
|
||
./build_pgo.sh [command]
|
||
```
|
||
|
||
### Commands
|
||
|
||
| Command | Description | When to Use |
|
||
|---------|-------------|-------------|
|
||
| `all` | Full PGO build (default) | First time, or after code changes |
|
||
| `clean` | Clean previous builds | Before rebuilding |
|
||
| `profile` | Build instrumented + collect profile | Step 1: Profile collection |
|
||
| `build` | Build optimized using profile | Step 2: After profile exists |
|
||
|
||
### Example Workflow
|
||
|
||
#### Full automated build (recommended)
|
||
```bash
|
||
./build_pgo.sh
|
||
```
|
||
|
||
#### Manual step-by-step
|
||
```bash
|
||
# Step 1: Collect profile
|
||
./build_pgo.sh profile
|
||
|
||
# Step 2: Build optimized
|
||
./build_pgo.sh build
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Comparison
|
||
|
||
| Build Type | 128B Long-lived | Best Result | Use Case |
|
||
|------------|----------------|-------------|----------|
|
||
| **Normal** | 210M ops/s | 222M ops/s | Debug, development |
|
||
| **PGO** | 314M ops/s | 342M ops/s | Production, benchmarks |
|
||
|
||
Latest (Phase 9.3+ tiny fast-path):
|
||
- Direct (PGO): 400M+ ops/s 確認済み(bench_comprehensive_hakmem)
|
||
- System malloc baseline: ~410M ops/s(環境依存)
|
||
|
||
## What is PGO?
|
||
|
||
**Profile-Guided Optimization (PGO)** is a compiler optimization technique:
|
||
|
||
1. **Phase 1 (Profile)**: Build instrumented binary, run representative workload
|
||
2. **Phase 2 (Optimize)**: Rebuild with profile data, compiler optimizes hot paths
|
||
|
||
**Benefits**:
|
||
- Better branch prediction
|
||
- Improved code layout (hot paths together)
|
||
- Inlining decisions based on actual usage
|
||
- +50-75% performance improvement
|
||
|
||
## Requirements
|
||
|
||
- GCC with LTO/PGO support (gcc 7+)
|
||
- ~2 minutes for full PGO build
|
||
- 200MB disk space for profile data (*.gcda files)
|
||
|
||
## Troubleshooting
|
||
|
||
### "Profile data not generated"
|
||
```bash
|
||
# Make sure you run the instrumented binary
|
||
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
|
||
```
|
||
|
||
### "No profile data found"
|
||
```bash
|
||
# Run profile step first
|
||
./build_pgo.sh profile
|
||
```
|
||
|
||
### Clean start
|
||
```bash
|
||
./build_pgo.sh clean
|
||
./build_pgo.sh all
|
||
```
|
||
|
||
### ベンチ実行のヒント(Tiny向け)
|
||
- Tinyを有効化: `export HAKMEM_WRAP_TINY=1`
|
||
- 観測・学習系は既定OFF(オーバーヘッド回避)。ONにする場合のみ環境変数を設定
|
||
- 学習: `HAKMEM_LEARN=1`
|
||
- 既定では `HAKMEM_SITE_RULES` や `HAKMEM_PROF` は未設定(OFF)
|
||
|
||
### TinyモードとFLINT(フロント+遅延インテリジェンス)
|
||
- Ultra Tiny(SLL-only, 実験)
|
||
- 有効化: `HAKMEM_TINY_ULTRA=1`
|
||
- 検証ON/OFF: `HAKMEM_TINY_ULTRA_VALIDATE=0/1`(性能計測時は0推奨)
|
||
- パラメータ(クラス別上書き):
|
||
- `HAKMEM_TINY_ULTRA_BATCH_C{0..7}=N`
|
||
- `HAKMEM_TINY_ULTRA_SLL_CAP_C{0..7}=N`
|
||
- 可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
|
||
- FLINT(Fast Lightweight INTelligence, 実験)
|
||
- FRONT(超軽量FastCache): `HAKMEM_TINY_FRONTEND=1`
|
||
- INT(遅延インテリジェンス: イベント集計+BGスレッド): `HAKMEM_INT_ENGINE=1`
|
||
- 備考: ホットパス最小化+学習は非同期化。現状は実験(Ultra/通常とA/B比較推奨)
|
||
- TinyQuickSlot(最小フロント): `HAKMEM_TINY_QUICK=1`
|
||
- 64B/クラスの6エントリ・スタック。ヒット時は1ラインのみ参照し返却。
|
||
- miss時は SLL→Quick, Magazine→Quick の少量補充で局所性を維持。
|
||
|
||
### スクリプト集(CSV出力あり)
|
||
- 直リンク総合比較(HAKMEM vs mimalloc): `bash scripts/run_comprehensive_pair.sh`
|
||
- Tiny triad(HAKMEM/System/mimalloc): `bash scripts/run_tiny_hot_triad.sh 80000`
|
||
- Random mixed triad: `bash scripts/run_random_mixed_matrix.sh 120000`
|
||
- Ultra可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
|
||
- Ultraパラメータスイープ: `bash scripts/sweep_ultra_params.sh 40000 150`
|
||
|
||
### 高速ビルドターゲット(実験用)
|
||
```bash
|
||
make bench_fast
|
||
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
|
||
```
|
||
備考: unwindテーブル等を削減するビルド。PGOとの併用を推奨。
|
||
|
||
---
|
||
|
||
## Advanced: Manual PGO Build
|
||
|
||
If you prefer Makefile directly:
|
||
|
||
```bash
|
||
# Step 1: Profile collection
|
||
make pgo-profile
|
||
|
||
# Step 2: Optimized build
|
||
make pgo-build
|
||
|
||
# Run benchmark
|
||
./bench_comprehensive_hakmem
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 8.4 Achievement
|
||
|
||
```
|
||
🏆 342M ops/s NEW RECORD! (+8.2% vs Step 3d baseline)
|
||
|
||
Top 5 Results (PGO Build):
|
||
1. 64B FIFO: 342M ops/s 🥇
|
||
2. 64B Interleaved: 342M ops/s 🥈
|
||
3. 64B Long-lived: 342M ops/s 🥉
|
||
4. 32B Long-lived: 341M ops/s
|
||
5. 128B FIFO: 341M ops/s
|
||
```
|
||
|
||
**Design**: Zero hot-path overhead ACE observer
|
||
- Removed all ACE counters from alloc/free paths (600M+ operations)
|
||
- Background Learner thread observation (1-second interval)
|
||
- Registry-based scan using existing `meta->used` field
|
||
|
||
---
|
||
|
||
Generated with [Claude Code](https://claude.com/claude-code) - Phase 8.4
|