# HAKMEM Build Guide

## Quick Start

### 通常ビルド (Normal Build)
```bash
make bench_comprehensive_hakmem
./bench_comprehensive_hakmem
```
**Expected**: 200-220M ops/sec

### PGO最適化ビルド (Recommended)
```bash
./build_pgo.sh
./bench_comprehensive_hakmem
```
**Expected**: 300-350M ops/sec (+50-75% faster!)

### 共有ライブラリ（LD_PRELOAD）PGOビルド
```bash
# Step 1: 計測用に instrumented な共有ライブラリでプロファイル収集
make pgo-profile-shared

# Step 2: PGO最適化した共有ライブラリをビルド
make pgo-build-shared

# 実行（system版ベンチに差し替え）
HAKMEM_WRAP_TINY=1 LD_PRELOAD=./libhakmem.so ./bench_comprehensive_system
```
Expected: 共有ライブラリでも通常より高速（環境により差あり）

---

## PGO Build Script

### Usage

```bash
./build_pgo.sh [command]
```

### Commands

| Command | Description | When to Use |
|---------|-------------|-------------|
| `all` | Full PGO build (default) | First time, or after code changes |
| `clean` | Clean previous builds | Before rebuilding |
| `profile` | Build instrumented + collect profile | Step 1: Profile collection |
| `build` | Build optimized using profile | Step 2: After profile exists |

### Example Workflow

#### Full automated build (recommended)
```bash
./build_pgo.sh
```

#### Manual step-by-step
```bash
# Step 1: Collect profile
./build_pgo.sh profile

# Step 2: Build optimized
./build_pgo.sh build
```

---

## Performance Comparison

| Build Type | 128B Long-lived | Best Result | Use Case |
|------------|----------------|-------------|----------|
| **Normal** | 210M ops/s | 222M ops/s | Debug, development |
| **PGO** | 314M ops/s | 342M ops/s | Production, benchmarks |

Latest (Phase 9.3+ tiny fast-path):
- Direct (PGO): 400M+ ops/s 確認済み（bench_comprehensive_hakmem）
- System malloc baseline: ~410M ops/s（環境依存）

## What is PGO?

**Profile-Guided Optimization (PGO)** is a compiler optimization technique:

1. **Phase 1 (Profile)**: Build instrumented binary, run representative workload
2. **Phase 2 (Optimize)**: Rebuild with profile data, compiler optimizes hot paths

**Benefits**:
- Better branch prediction
- Improved code layout (hot paths together)
- Inlining decisions based on actual usage
- +50-75% performance improvement

## Requirements

- GCC with LTO/PGO support (gcc 7+)
- ~2 minutes for full PGO build
- 200MB disk space for profile data (*.gcda files)

## Troubleshooting

### "Profile data not generated"
```bash
# Make sure you run the instrumented binary
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
```

### "No profile data found"
```bash
# Run profile step first
./build_pgo.sh profile
```

### Clean start
```bash
./build_pgo.sh clean
./build_pgo.sh all
```

### ベンチ実行のヒント（Tiny向け）
- Tinyを有効化: `export HAKMEM_WRAP_TINY=1`
- 観測・学習系は既定OFF（オーバーヘッド回避）。ONにする場合のみ環境変数を設定
  - 学習: `HAKMEM_LEARN=1`
  - 既定では `HAKMEM_SITE_RULES` や `HAKMEM_PROF` は未設定（OFF）

### TinyモードとFLINT（フロント＋遅延インテリジェンス）
- Ultra Tiny（SLL-only, 実験）
  - 有効化: `HAKMEM_TINY_ULTRA=1`
  - 検証ON/OFF: `HAKMEM_TINY_ULTRA_VALIDATE=0/1`（性能計測時は0推奨）
  - パラメータ（クラス別上書き）:
    - `HAKMEM_TINY_ULTRA_BATCH_C{0..7}=N`
    - `HAKMEM_TINY_ULTRA_SLL_CAP_C{0..7}=N`
  - 可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
- FLINT（Fast Lightweight INTelligence, 実験）
  - FRONT（超軽量FastCache）: `HAKMEM_TINY_FRONTEND=1`
  - INT（遅延インテリジェンス: イベント集計＋BGスレッド）: `HAKMEM_INT_ENGINE=1`
  - 備考: ホットパス最小化＋学習は非同期化。現状は実験（Ultra/通常とA/B比較推奨）
  - TinyQuickSlot（最小フロント）: `HAKMEM_TINY_QUICK=1`
    - 64B/クラスの6エントリ・スタック。ヒット時は1ラインのみ参照し返却。
    - miss時は SLL→Quick, Magazine→Quick の少量補充で局所性を維持。

### スクリプト集（CSV出力あり）
- 直リンク総合比較（HAKMEM vs mimalloc）: `bash scripts/run_comprehensive_pair.sh`
- Tiny triad（HAKMEM/System/mimalloc）: `bash scripts/run_tiny_hot_triad.sh 80000`
- Random mixed triad: `bash scripts/run_random_mixed_matrix.sh 120000`
- Ultra可視化: `bash scripts/run_ultra_debug_sweep.sh 60000 200`
- Ultraパラメータスイープ: `bash scripts/sweep_ultra_params.sh 40000 150`

### 高速ビルドターゲット（実験用）
```bash
make bench_fast
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
```
備考: unwindテーブル等を削減するビルド。PGOとの併用を推奨。

---

## Advanced: Manual PGO Build

If you prefer Makefile directly:

```bash
# Step 1: Profile collection
make pgo-profile

# Step 2: Optimized build
make pgo-build

# Run benchmark
./bench_comprehensive_hakmem
```

---

## Phase 8.4 Achievement

```
🏆 342M ops/s NEW RECORD! (+8.2% vs Step 3d baseline)

Top 5 Results (PGO Build):
1. 64B FIFO:        342M ops/s 🥇
2. 64B Interleaved: 342M ops/s 🥈
3. 64B Long-lived:  342M ops/s 🥉
4. 32B Long-lived:  341M ops/s
5. 128B FIFO:       341M ops/s
```

**Design**: Zero hot-path overhead ACE observer
- Removed all ACE counters from alloc/free paths (600M+ operations)
- Background Learner thread observation (1-second interval)
- Registry-based scan using existing `meta->used` field

---

Generated with [Claude Code](https://claude.com/claude-code) - Phase 8.4