hakmem/docs/archive/PHASE4_IMPROVEMENT_ROADMAP.md

# Phase 4 改善ロードマップ

## Overview

Phase 4 で 3.6% の性能退行が発生したため、段階的に改善します。

**現状**:
- Phase 3: 391 M ops/sec
- Phase 4: 373-380 M ops/sec
- **退行**: -3.6%

**目標**:
- Phase 4.1（Quick Win）: 385-390 M ops/sec（+1-2%）
- Phase 4.2（Gating）: 390-395 M ops/sec（Phase 3 レベル回復）
- Phase 4.3（Batching）: 395-400 M ops/sec（Phase 3 超え）

---

## Phase 4.1: Quick Win（Option A+B）

### 目標
- 実装時間: **5-10分**
- 期待効果: **+1-2%**（385-390 M ops/sec）
- リスク: **低**

### 実装内容

#### Option A: 重複メモリアクセスの削減

**Before**:
```c
// owner->class_idx を2回読む
int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                      owner == g_tls_active_slab_b[owner->class_idx]);

if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(owner->class_idx);
    continue;
}
```

**After**:
```c
// 1回だけ読んで再利用
uint8_t cidx = owner->class_idx;
TinySlab* tls_a = g_tls_active_slab_a[cidx];
TinySlab* tls_b = g_tls_active_slab_b[cidx];

if ((owner == tls_a || owner == tls_b) &&
    !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}
```

**改善点**:
- `owner->class_idx`: 3回読み → 1回読み
- `g_tls_active_slab_a/b[cidx]`: ループ外に hoist 可能

#### Option B: Branch prediction hint

```c
// TLS Magazine から spill する場合、TLS-active slab への戻りが likely
if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                     !mini_mag_is_full(&owner->mini_mag), 1)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}
```

**改善点**:
- Branch misprediction を削減
- CPU の分岐予測をヒント

### 修正箇所

**ファイル**: `hakmem_tiny.c`
**関数**: `hak_tiny_free_with_slab()`
**行番号**: 890-922

### 検証方法

```bash
# Phase 4.1 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 期待結果: 385-390 M ops/sec
```

### 成功基準

- **最低**: 380 M ops/sec（現状維持）
- **目標**: 385-390 M ops/sec（+1-2%）
- **理想**: 390+ M ops/sec（Phase 3 レベル）

---

## Phase 4.2: High-water ゲート（Option E-1）

### 目標
- 実装時間: **10-20分**
- 期待効果: **+2-5%**（390-395 M ops/sec）
- リスク: **低〜中**

### 実装内容

#### High-water ゲートロジック

**コンセプト**:
> TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ
> 理由: 次回 alloc は TLS から出るので、mini-mag への投入は無駄

**実装**:
```c
// hak_tiny_free_with_slab() の先頭に追加
void hak_tiny_free_with_slab(...) {
    // ... 既存の前処理 ...

    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;  // TLS Magazine 現在の占有数
    int tls_cap = TLS_MAG_CAPACITY;  // 2048

    if (tls_occ >= (tls_cap * 3 / 4)) {
        // High-water: Phase 4 無効
        // 全件 bitmap へ直書き（既存ロジック）
        for (int i = 0; i < mag->count; i++) {
            PoolItem it = mag->items[i];
            TinySlab* owner = hak_tiny_owner_slab(it.ptr);
            if (!owner) continue;

            // Bitmap へ spill（既存ロジックを流用）
            // ... bitmap operations ...
        }

        // 統計
        g_tiny_pool.phase4_gate_skip[class_idx]++;
        return;
    }

    // Low-water: Phase 4 実行（既存ロジック）
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Phase 4.1 で最適化したロジック
        uint8_t cidx = owner->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                             !mini_mag_is_full(&owner->mini_mag), 1)) {
            mini_mag_push(&owner->mini_mag, it.ptr);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
            continue;
        }

        // Bitmap へ spill
        // ... bitmap operations ...
        g_tiny_pool.phase4_bitmap_spill[cidx]++;
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}
```

#### 定数定義

```c
// hakmem_tiny.h または hakmem_tiny.c
#define TLS_MAG_CAPACITY 2048
#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 1536
#define TLS_MAG_LOW_WATER (TLS_MAG_CAPACITY / 4)        // 512
```

#### 統計追加

```c
// hakmem_tiny.h: TinyPool 構造体に追加
typedef struct {
    // 既存
    uint64_t alloc_count[TINY_NUM_CLASSES];
    uint64_t free_count[TINY_NUM_CLASSES];
    uint64_t slab_count[TINY_NUM_CLASSES];

    // Phase 4 測定用（新規）
    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 判定回数
    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag push 成功
    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap spill
    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water skip
} TinyPool;
```

```c
// hakmem_tiny.c: 初期化
void hak_tiny_init(void) {
    // ... 既存の初期化 ...

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        g_tiny_pool.phase4_spill_count[i] = 0;
        g_tiny_pool.phase4_mini_push[i] = 0;
        g_tiny_pool.phase4_bitmap_spill[i] = 0;
        g_tiny_pool.phase4_gate_skip[i] = 0;
    }
}
```

### 検証方法

```bash
# Phase 4.2 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 統計確認（実装後）
# hak_tiny_print_stats() に phase4 統計を追加
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 390-395 M ops/sec（Phase 3 レベル）
```

### 成功基準

- **最低**: 385 M ops/sec（Phase 4.1 維持）
- **目標**: 390-395 M ops/sec（Phase 3 レベル回復）
- **理想**: 395+ M ops/sec（Phase 3 超え）

### Revert 判断

Phase 4.2 実装後も 385 M ops/sec を下回る場合:
- **Phase 4 全体を revert**
- Phase 3（391 M ops/sec）に戻る
- Pull 型アプローチを検討

---

## Phase 4.3: Per-slab バッチ（Option E-2）

### 目標
- 実装時間: **30-40分**
- 期待効果: **+2-5%**（395-400 M ops/sec）
- リスク: **中〜高**（実装複雑）

### 実装内容

#### Per-slab グルーピング

**コンセプト**:
> Spill 256 items を slab 単位でグルーピング
> is_tls_active 判定: 256回 → slab数回（1-8回）に激減

**データ構造**:
```c
#define SLAB_BUCKETS 32  // 線形プローブ用バケツ数

typedef struct {
    TinySlab* owner;
    void* ptrs[256];
    int count;
} SlabBucket;
```

**実装**:
```c
void hak_tiny_free_with_slab_batched(...) {
    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;
    if (tls_occ >= TLS_MAG_HIGH_WATER) {
        fast_spill_all_to_bitmap(mag);
        return;
    }

    // Phase 4.3: Per-slab バッチ
    SlabBucket buckets[SLAB_BUCKETS] = {0};

    // 1st pass: Slab 単位でグルーピング
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Linear probing hash
        size_t hash = ((uintptr_t)owner >> 6) & (SLAB_BUCKETS - 1);
        while (buckets[hash].owner && buckets[hash].owner != owner) {
            hash = (hash + 1) & (SLAB_BUCKETS - 1);
        }

        if (!buckets[hash].owner) {
            buckets[hash].owner = owner;
        }
        buckets[hash].ptrs[buckets[hash].count++] = it.ptr;
    }

    // 2nd pass: Slab ごとに処理（判定は slab ごとに 1 回）
    for (int b = 0; b < SLAB_BUCKETS; b++) {
        if (!buckets[b].owner) continue;

        TinySlab* slab = buckets[b].owner;
        uint8_t cidx = slab->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        int is_tls_active = (slab == tls_a || slab == tls_b);
        int room = mini_capacity(&slab->mini_mag) - mini_count(&slab->mini_mag);
        int take = is_tls_active ? min(room, buckets[b].count) : 0;

        // Mini-mag へ一括 push
        for (int i = 0; i < take; i++) {
            mini_mag_push(&slab->mini_mag, buckets[b].ptrs[i]);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
        }

        // 余りは bitmap へ一括 spill
        for (int i = take; i < buckets[b].count; i++) {
            // ... bitmap operations ...
            g_tiny_pool.phase4_bitmap_spill[cidx]++;
        }
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}
```

#### Helper 関数

```c
// Mini-magazine の容量と現在数
static inline int mini_capacity(PageMiniMag* mag) {
    return mag->capacity;
}

static inline int mini_count(PageMiniMag* mag) {
    return mag->count;
}

// min マクロ
#define min(a, b) ((a) < (b) ? (a) : (b))
```

### 検証方法

```bash
# Phase 4.3 実装
make clean && make bench_tiny

# ベンチマーク実行（5回）
for i in {1..5}; do
    ./bench_tiny 2>&1 | tail -5
done

# 統計確認
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 395-400 M ops/sec（Phase 3 超え）
```

### 成功基準

- **最低**: 390 M ops/sec（Phase 4.2 維持）
- **目標**: 395-400 M ops/sec（Phase 3 超え）
- **理想**: 400+ M ops/sec（5% 改善）

---

## Phase 4.4: Pull 型反転（将来）

### 目標
- 実装時間: **1-2時間**
- 期待効果: **根本的解決**
- リスク: **高**（アーキテクチャ変更）

### コンセプト

**現状（Push型）**:
- Free 側（spill）で mini-mag に押し戻す
- すべての spill item に overhead
- 恩恵は allocation 側で発生（不確実）

**改善（Pull型）**:
- Allocation 側で必要時だけ mini-mag から引き上げる
- Free 側の overhead ゼロ
- Allocation latency は若干増加（trade-off）

### 実装箇所

**ファイル**: `hakmem_tiny.c`
**関数**: `hak_tiny_alloc()`
**タイミング**: Bitmap scan の直前

### 実装イメージ

```c
void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0) return NULL;

    // 1. TLS Magazine (fast path)
    if (!mini_mag_is_empty(&g_tls_mag[class_idx])) {
        return mini_mag_pop(&g_tls_mag[class_idx]);
    }

    // 2. TLS Active Slabs (medium path)
    TinySlab* tls = g_tls_active_slab_a[class_idx];
    if (!(tls && tls->free_count > 0)) {
        tls = g_tls_active_slab_b[class_idx];
    }

    if (tls && tls->free_count > 0) {
        // Phase 4.4: Pull from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            void* p = mini_mag_pop(&tls->mini_mag);
            if (p) {
                stats_record_alloc(class_idx);
                return p;
            }
        }

        // Phase 4.4: Refill TLS Magazine from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            int pulled = mini_pull_batch(&tls->mini_mag, &g_tls_mag[class_idx], 16);
            if (pulled > 0) {
                void* p = mini_mag_pop(&g_tls_mag[class_idx]);
                if (p) {
                    stats_record_alloc(class_idx);
                    return p;
                }
            }
        }

        // Fallback: Bitmap scan（既存ロジック）
        // ...
    }

    // 3. Global pool（既存ロジック）
    // ...
}
```

### Free 側の変更

```c
void hak_tiny_free_with_slab(...) {
    // Phase 4.4: Push型ロジックを削除
    // 全件 bitmap へ直書き（シンプル化）

    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Bitmap へ spill（既存ロジック）
        // ... bitmap operations ...
    }
}
```

### Trade-off

**利点**:
- Free latency が安定（overhead なし）
- Allocation 側で制御できる（必要時だけ pull）

**欠点**:
- Allocation latency が若干増加（mini-mag からの pull コスト）
- 実装が複雑

---

## 測定・診断

### 統計出力

```c
void hak_tiny_print_phase4_stats(void) {
    printf("========================================\n");
    printf("Phase 4 Statistics\n");
    printf("========================================\n");

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uint64_t spill = g_tiny_pool.phase4_spill_count[i];
        uint64_t mini = g_tiny_pool.phase4_mini_push[i];
        uint64_t bitmap = g_tiny_pool.phase4_bitmap_spill[i];
        uint64_t gate = g_tiny_pool.phase4_gate_skip[i];

        if (spill == 0) continue;

        double mini_ratio = (double)mini / (mini + bitmap) * 100;
        double gate_ratio = (double)gate / spill * 100;

        printf("Class %d (%zu B):\n", i, g_tiny_class_sizes[i]);
        printf("  Spill count:     %lu\n", spill);
        printf("  Mini-mag push:   %lu (%.1f%%)\n", mini, mini_ratio);
        printf("  Bitmap spill:    %lu (%.1f%%)\n", bitmap, 100 - mini_ratio);
        printf("  Gate skip:       %lu (%.1f%%)\n", gate, gate_ratio);
        printf("\n");
    }
}
```

### ベンチマーク比較

```bash
# Phase 3（ベースライン）
git checkout <phase3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase3.txt

# Phase 4.1
git checkout <phase4.1-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_1.txt

# Phase 4.2
git checkout <phase4.2-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_2.txt

# Phase 4.3
git checkout <phase4.3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_3.txt

# 比較
diff results_phase3.txt results_phase4_*.txt
```

---

## 成功基準まとめ

| Phase | 実装時間 | 期待性能 | リスク | 必須 |
|-------|---------|---------|-------|-----|
| 4.1 (A+B) | 5-10分 | 385-390 M ops/sec | 低 | ✅ Yes |
| 4.2 (ゲート) | 10-20分 | 390-395 M ops/sec | 低〜中 | ✅ Yes |
| 4.3 (バッチ) | 30-40分 | 395-400 M ops/sec | 中〜高 | ⚠️ Conditional |
| 4.4 (Pull) | 1-2時間 | 根本解決 | 高 | ❌ Future |

**Revert 条件**:
- Phase 4.2 実装後も < 385 M ops/sec → Phase 4 全体を revert

**継続条件**:
- Phase 4.2 で >= 390 M ops/sec → Phase 4.3 に進む

---

## Timeline

**Day 1（今日）**:
1. ✅ ドキュメント整備（完了）
2. 🔄 Phase 4.1 実装（5-10分）
3. 🔄 Phase 4.1 検証（5分）
4. 🔄 Phase 4.2 実装（10-20分）
5. 🔄 Phase 4.2 検証（5分）
6. ✅ コミット
7. 📊 結果まとめ

**Day 2（条件付き）**:
- Phase 4.2 が成功した場合のみ
- Phase 4.3 実装・検証

**Future**:
- Phase 4.4（Pull型）は別途検討

---

## References

- PHASE4_REGRESSION_ANALYSIS.md（詳細分析）
- ChatGPT Pro アドバイス（2025-10-26）
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Phase 4 改善ロードマップ
 								## Overview
 								Phase 4 で 3.6% の性能退行が発生したため、段階的に改善します。
 								**現状**:
 								- Phase 3: 391 M ops/sec
 								- Phase 4: 373-380 M ops/sec
 								- **退行**: -3.6%
 								**目標**:
 								- Phase 4.1（Quick Win）: 385-390 M ops/sec（+1-2%）
 								- Phase 4.2（Gating）: 390-395 M ops/sec（Phase 3 レベル回復）
 								- Phase 4.3（Batching）: 395-400 M ops/sec（Phase 3 超え）
 								---
 								## Phase 4.1: Quick Win（Option A+B）
 								### 目標
 								- 実装時間: **5-10分**
 								- 期待効果: **+1-2%**（385-390 M ops/sec）
 								- リスク: **低**
 								### 実装内容
 								#### Option A: 重複メモリアクセスの削減
 								**Before**:
 								```c
 								// owner->class_idx を2回読む
 								int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
 								                      owner == g_tls_active_slab_b[owner->class_idx]);
 								if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
 								    mini_mag_push(&owner->mini_mag, it.ptr);
 								    stats_record_free(owner->class_idx);
 								    continue;
 								}
 								```
 								**After**:
 								```c
 								// 1回だけ読んで再利用
 								uint8_t cidx = owner->class_idx;
 								TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								if ((owner == tls_a || owner == tls_b) &&
 								    !mini_mag_is_full(&owner->mini_mag)) {
 								    mini_mag_push(&owner->mini_mag, it.ptr);
 								    stats_record_free(cidx);
 								    continue;
 								}
 								```
 								**改善点**:
 								- `owner->class_idx`: 3回読み → 1回読み
 								- `g_tls_active_slab_a/b[cidx]`: ループ外に hoist 可能
 								#### Option B: Branch prediction hint
 								```c
 								// TLS Magazine から spill する場合、TLS-active slab への戻りが likely
 								if (__builtin_expect((owner == tls_a || owner == tls_b) &&
 								                     !mini_mag_is_full(&owner->mini_mag), 1)) {
 								    mini_mag_push(&owner->mini_mag, it.ptr);
 								    stats_record_free(cidx);
 								    continue;
 								}
 								```
 								**改善点**:
 								- Branch misprediction を削減
 								- CPU の分岐予測をヒント
 								### 修正箇所
 								**ファイル**: `hakmem_tiny.c`
 								**関数**: `hak_tiny_free_with_slab()`
 								**行番号**: 890-922
 								### 検証方法
 								```bash
 								# Phase 4.1 実装
 								make clean && make bench_tiny
 								# ベンチマーク実行（3回）
 								./bench_tiny 2>&1 | tail -5
 								./bench_tiny 2>&1 | tail -5
 								./bench_tiny 2>&1 | tail -5
 								# 期待結果: 385-390 M ops/sec
 								```
 								### 成功基準
 								- **最低**: 380 M ops/sec（現状維持）
 								- **目標**: 385-390 M ops/sec（+1-2%）
 								- **理想**: 390+ M ops/sec（Phase 3 レベル）
 								---
 								## Phase 4.2: High-water ゲート（Option E-1）
 								### 目標
 								- 実装時間: **10-20分**
 								- 期待効果: **+2-5%**（390-395 M ops/sec）
 								- リスク: **低〜中**
 								### 実装内容
 								#### High-water ゲートロジック
 								**コンセプト**:
 								> TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ
 								> 理由: 次回 alloc は TLS から出るので、mini-mag への投入は無駄
 								**実装**:
 								```c
 								// hak_tiny_free_with_slab() の先頭に追加
 								void hak_tiny_free_with_slab(...) {
 								    // ... 既存の前処理 ...
 								    // Phase 4.2: High-water ゲート
 								    int tls_occ = mag->count;  // TLS Magazine 現在の占有数
 								    int tls_cap = TLS_MAG_CAPACITY;  // 2048
 								    if (tls_occ >= (tls_cap * 3 / 4)) {
 								        // High-water: Phase 4 無効
 								        // 全件 bitmap へ直書き（既存ロジック）
 								        for (int i = 0; i < mag->count; i++) {
 								            PoolItem it = mag->items[i];
 								            TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								            if (!owner) continue;
 								            // Bitmap へ spill（既存ロジックを流用）
 								            // ... bitmap operations ...
 								        }
 								        // 統計
 								        g_tiny_pool.phase4_gate_skip[class_idx]++;
 								        return;
 								    }
 								    // Low-water: Phase 4 実行（既存ロジック）
 								    for (int i = 0; i < mag->count; i++) {
 								        PoolItem it = mag->items[i];
 								        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								        if (!owner) continue;
 								        // Phase 4.1 で最適化したロジック
 								        uint8_t cidx = owner->class_idx;
 								        TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								        TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								        if (__builtin_expect((owner == tls_a || owner == tls_b) &&
 								                             !mini_mag_is_full(&owner->mini_mag), 1)) {
 								            mini_mag_push(&owner->mini_mag, it.ptr);
 								            stats_record_free(cidx);
 								            g_tiny_pool.phase4_mini_push[cidx]++;
 								            continue;
 								        }
 								        // Bitmap へ spill
 								        // ... bitmap operations ...
 								        g_tiny_pool.phase4_bitmap_spill[cidx]++;
 								    }
 								    g_tiny_pool.phase4_spill_count[class_idx]++;
 								}
 								```
 								#### 定数定義
 								```c
 								// hakmem_tiny.h または hakmem_tiny.c
 								#define TLS_MAG_CAPACITY 2048
 								#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 1536
 								#define TLS_MAG_LOW_WATER (TLS_MAG_CAPACITY / 4)        // 512
 								```
 								#### 統計追加
 								```c
 								// hakmem_tiny.h: TinyPool 構造体に追加
 								typedef struct {
 								    // 既存
 								    uint64_t alloc_count[TINY_NUM_CLASSES];
 								    uint64_t free_count[TINY_NUM_CLASSES];
 								    uint64_t slab_count[TINY_NUM_CLASSES];
 								    // Phase 4 測定用（新規）
 								    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 判定回数
 								    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag push 成功
 								    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap spill
 								    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water skip
 								} TinyPool;
 								```
 								```c
 								// hakmem_tiny.c: 初期化
 								void hak_tiny_init(void) {
 								    // ... 既存の初期化 ...
 								    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
 								        g_tiny_pool.phase4_spill_count[i] = 0;
 								        g_tiny_pool.phase4_mini_push[i] = 0;
 								        g_tiny_pool.phase4_bitmap_spill[i] = 0;
 								        g_tiny_pool.phase4_gate_skip[i] = 0;
 								    }
 								}
 								```
 								### 検証方法
 								```bash
 								# Phase 4.2 実装
 								make clean && make bench_tiny
 								# ベンチマーク実行（3回）
 								./bench_tiny 2>&1 | tail -5
 								./bench_tiny 2>&1 | tail -5
 								./bench_tiny 2>&1 | tail -5
 								# 統計確認（実装後）
 								# hak_tiny_print_stats() に phase4 統計を追加
 								./test_mf2 2>&1 | grep -A 10 "Phase 4"
 								# 期待結果: 390-395 M ops/sec（Phase 3 レベル）
 								```
 								### 成功基準
 								- **最低**: 385 M ops/sec（Phase 4.1 維持）
 								- **目標**: 390-395 M ops/sec（Phase 3 レベル回復）
 								- **理想**: 395+ M ops/sec（Phase 3 超え）
 								### Revert 判断
 								Phase 4.2 実装後も 385 M ops/sec を下回る場合:
 								- **Phase 4 全体を revert**
 								- Phase 3（391 M ops/sec）に戻る
 								- Pull 型アプローチを検討
 								---
 								## Phase 4.3: Per-slab バッチ（Option E-2）
 								### 目標
 								- 実装時間: **30-40分**
 								- 期待効果: **+2-5%**（395-400 M ops/sec）
 								- リスク: **中〜高**（実装複雑）
 								### 実装内容
 								#### Per-slab グルーピング
 								**コンセプト**:
 								> Spill 256 items を slab 単位でグルーピング
 								> is_tls_active 判定: 256回 → slab数回（1-8回）に激減
 								**データ構造**:
 								```c
 								#define SLAB_BUCKETS 32  // 線形プローブ用バケツ数
 								typedef struct {
 								    TinySlab* owner;
 								    void* ptrs[256];
 								    int count;
 								} SlabBucket;
 								```
 								**実装**:
 								```c
 								void hak_tiny_free_with_slab_batched(...) {
 								    // Phase 4.2: High-water ゲート
 								    int tls_occ = mag->count;
 								    if (tls_occ >= TLS_MAG_HIGH_WATER) {
 								        fast_spill_all_to_bitmap(mag);
 								        return;
 								    }
 								    // Phase 4.3: Per-slab バッチ
 								    SlabBucket buckets[SLAB_BUCKETS] = {0};
 								    // 1st pass: Slab 単位でグルーピング
 								    for (int i = 0; i < mag->count; i++) {
 								        PoolItem it = mag->items[i];
 								        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								        if (!owner) continue;
 								        // Linear probing hash
 								        size_t hash = ((uintptr_t)owner >> 6) & (SLAB_BUCKETS - 1);
 								        while (buckets[hash].owner && buckets[hash].owner != owner) {
 								            hash = (hash + 1) & (SLAB_BUCKETS - 1);
 								        }
 								        if (!buckets[hash].owner) {
 								            buckets[hash].owner = owner;
 								        }
 								        buckets[hash].ptrs[buckets[hash].count++] = it.ptr;
 								    }
 								    // 2nd pass: Slab ごとに処理（判定は slab ごとに 1 回）
 								    for (int b = 0; b < SLAB_BUCKETS; b++) {
 								        if (!buckets[b].owner) continue;
 								        TinySlab* slab = buckets[b].owner;
 								        uint8_t cidx = slab->class_idx;
 								        TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								        TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								        int is_tls_active = (slab == tls_a || slab == tls_b);
 								        int room = mini_capacity(&slab->mini_mag) - mini_count(&slab->mini_mag);
 								        int take = is_tls_active ? min(room, buckets[b].count) : 0;
 								        // Mini-mag へ一括 push
 								        for (int i = 0; i < take; i++) {
 								            mini_mag_push(&slab->mini_mag, buckets[b].ptrs[i]);
 								            stats_record_free(cidx);
 								            g_tiny_pool.phase4_mini_push[cidx]++;
 								        }
 								        // 余りは bitmap へ一括 spill
 								        for (int i = take; i < buckets[b].count; i++) {
 								            // ... bitmap operations ...
 								            g_tiny_pool.phase4_bitmap_spill[cidx]++;
 								        }
 								    }
 								    g_tiny_pool.phase4_spill_count[class_idx]++;
 								}
 								```
 								#### Helper 関数
 								```c
 								// Mini-magazine の容量と現在数
 								static inline int mini_capacity(PageMiniMag* mag) {
 								    return mag->capacity;
 								}
 								static inline int mini_count(PageMiniMag* mag) {
 								    return mag->count;
 								}
 								// min マクロ
 								#define min(a, b) ((a) < (b) ? (a) : (b))
 								```
 								### 検証方法
 								```bash
 								# Phase 4.3 実装
 								make clean && make bench_tiny
 								# ベンチマーク実行（5回）
 								for i in {1..5}; do
 								    ./bench_tiny 2>&1 | tail -5
 								done
 								# 統計確認
 								./test_mf2 2>&1 | grep -A 10 "Phase 4"
 								# 期待結果: 395-400 M ops/sec（Phase 3 超え）
 								```
 								### 成功基準
 								- **最低**: 390 M ops/sec（Phase 4.2 維持）
 								- **目標**: 395-400 M ops/sec（Phase 3 超え）
 								- **理想**: 400+ M ops/sec（5% 改善）
 								---
 								## Phase 4.4: Pull 型反転（将来）
 								### 目標
 								- 実装時間: **1-2時間**
 								- 期待効果: **根本的解決**
 								- リスク: **高**（アーキテクチャ変更）
 								### コンセプト
 								**現状（Push型）**:
 								- Free 側（spill）で mini-mag に押し戻す
 								- すべての spill item に overhead
 								- 恩恵は allocation 側で発生（不確実）
 								**改善（Pull型）**:
 								- Allocation 側で必要時だけ mini-mag から引き上げる
 								- Free 側の overhead ゼロ
 								- Allocation latency は若干増加（trade-off）
 								### 実装箇所
 								**ファイル**: `hakmem_tiny.c`
 								**関数**: `hak_tiny_alloc()`
 								**タイミング**: Bitmap scan の直前
 								### 実装イメージ
 								```c
 								void* hak_tiny_alloc(size_t size) {
 								    int class_idx = hak_tiny_size_to_class(size);
 								    if (class_idx < 0) return NULL;
 								    // 1. TLS Magazine (fast path)
 								    if (!mini_mag_is_empty(&g_tls_mag[class_idx])) {
 								        return mini_mag_pop(&g_tls_mag[class_idx]);
 								    }
 								    // 2. TLS Active Slabs (medium path)
 								    TinySlab* tls = g_tls_active_slab_a[class_idx];
 								    if (!(tls && tls->free_count > 0)) {
 								        tls = g_tls_active_slab_b[class_idx];
 								    }
 								    if (tls && tls->free_count > 0) {
 								        // Phase 4.4: Pull from page mini-mag
 								        if (!mini_mag_is_empty(&tls->mini_mag)) {
 								            void* p = mini_mag_pop(&tls->mini_mag);
 								            if (p) {
 								                stats_record_alloc(class_idx);
 								                return p;
 								            }
 								        }
 								        // Phase 4.4: Refill TLS Magazine from page mini-mag
 								        if (!mini_mag_is_empty(&tls->mini_mag)) {
 								            int pulled = mini_pull_batch(&tls->mini_mag, &g_tls_mag[class_idx], 16);
 								            if (pulled > 0) {
 								                void* p = mini_mag_pop(&g_tls_mag[class_idx]);
 								                if (p) {
 								                    stats_record_alloc(class_idx);
 								                    return p;
 								                }
 								            }
 								        }
 								        // Fallback: Bitmap scan（既存ロジック）
 								        // ...
 								    }
 								    // 3. Global pool（既存ロジック）
 								    // ...
 								}
 								```
 								### Free 側の変更
 								```c
 								void hak_tiny_free_with_slab(...) {
 								    // Phase 4.4: Push型ロジックを削除
 								    // 全件 bitmap へ直書き（シンプル化）
 								    for (int i = 0; i < mag->count; i++) {
 								        PoolItem it = mag->items[i];
 								        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								        if (!owner) continue;
 								        // Bitmap へ spill（既存ロジック）
 								        // ... bitmap operations ...
 								    }
 								}
 								```
 								### Trade-off
 								**利点**:
 								- Free latency が安定（overhead なし）
 								- Allocation 側で制御できる（必要時だけ pull）
 								**欠点**:
 								- Allocation latency が若干増加（mini-mag からの pull コスト）
 								- 実装が複雑
 								---
 								## 測定・診断
 								### 統計出力
 								```c
 								void hak_tiny_print_phase4_stats(void) {
 								    printf("========================================\n");
 								    printf("Phase 4 Statistics\n");
 								    printf("========================================\n");
 								    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
 								        uint64_t spill = g_tiny_pool.phase4_spill_count[i];
 								        uint64_t mini = g_tiny_pool.phase4_mini_push[i];
 								        uint64_t bitmap = g_tiny_pool.phase4_bitmap_spill[i];
 								        uint64_t gate = g_tiny_pool.phase4_gate_skip[i];
 								        if (spill == 0) continue;
 								        double mini_ratio = (double)mini / (mini + bitmap) * 100;
 								        double gate_ratio = (double)gate / spill * 100;
 								        printf("Class %d (%zu B):\n", i, g_tiny_class_sizes[i]);
 								        printf("  Spill count:     %lu\n", spill);
 								        printf("  Mini-mag push:   %lu (%.1f%%)\n", mini, mini_ratio);
 								        printf("  Bitmap spill:    %lu (%.1f%%)\n", bitmap, 100 - mini_ratio);
 								        printf("  Gate skip:       %lu (%.1f%%)\n", gate, gate_ratio);
 								        printf("\n");
 								    }
 								}
 								```
 								### ベンチマーク比較
 								```bash
 								# Phase 3（ベースライン）
 								git checkout <phase3-commit>
 								make clean && make bench_tiny
 								./bench_tiny > results_phase3.txt
 								# Phase 4.1
 								git checkout <phase4.1-commit>
 								make clean && make bench_tiny
 								./bench_tiny > results_phase4_1.txt
 								# Phase 4.2
 								git checkout <phase4.2-commit>
 								make clean && make bench_tiny
 								./bench_tiny > results_phase4_2.txt
 								# Phase 4.3
 								git checkout <phase4.3-commit>
 								make clean && make bench_tiny
 								./bench_tiny > results_phase4_3.txt
 								# 比較
 								diff results_phase3.txt results_phase4_*.txt
 								```
 								---
 								## 成功基準まとめ
 								| Phase | 実装時間 | 期待性能 | リスク | 必須 |
 								|-------|---------|---------|-------|-----|
 								| 4.1 (A+B) | 5-10分 | 385-390 M ops/sec | 低 | ✅ Yes |
 								| 4.2 (ゲート) | 10-20分 | 390-395 M ops/sec | 低〜中 | ✅ Yes |
 								| 4.3 (バッチ) | 30-40分 | 395-400 M ops/sec | 中〜高 | ⚠️ Conditional |
 								| 4.4 (Pull) | 1-2時間 | 根本解決 | 高 | ❌ Future |
 								**Revert 条件**:
 								- Phase 4.2 実装後も < 385 M ops/sec → Phase 4 全体を revert
 								**継続条件**:
 								- Phase 4.2 で >= 390 M ops/sec → Phase 4.3 に進む
 								---
 								## Timeline
 								**Day 1（今日）**:
 . ✅ ドキュメント整備（完了）
 . 🔄 Phase 4.1 実装（5-10分）
 . 🔄 Phase 4.1 検証（5分）
 . 🔄 Phase 4.2 実装（10-20分）
 . 🔄 Phase 4.2 検証（5分）
 . ✅ コミット
 . 📊 結果まとめ
 								**Day 2（条件付き）**:
 								- Phase 4.2 が成功した場合のみ
 								- Phase 4.3 実装・検証
 								**Future**:
 								- Phase 4.4（Pull型）は別途検討
 								---
 								## References
 								- PHASE4_REGRESSION_ANALYSIS.md（詳細分析）
 								- ChatGPT Pro アドバイス（2025-10-26）