hakmem/docs/archive/PHASE4_IMPROVEMENT_ROADMAP.md

# Phase 4 改善ロードマップ

## Overview

Phase 4 で 3.6% の性能退行が発生したため、段階的に改善します。

**現状**:
- Phase 3: 391 M ops/sec
- Phase 4: 373-380 M ops/sec
- **退行**: -3.6%

**目標**:
- Phase 4.1（Quick Win）: 385-390 M ops/sec（+1-2%）
- Phase 4.2（Gating）: 390-395 M ops/sec（Phase 3 レベル回復）
- Phase 4.3（Batching）: 395-400 M ops/sec（Phase 3 超え）

---

## Phase 4.1: Quick Win（Option A+B）

### 目標
- 実装時間: **5-10分**
- 期待効果: **+1-2%**（385-390 M ops/sec）
- リスク: **低**

### 実装内容

#### Option A: 重複メモリアクセスの削減

**Before**:
```c
// owner->class_idx を2回読む
int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                      owner == g_tls_active_slab_b[owner->class_idx]);

if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(owner->class_idx);
    continue;
}
```

**After**:
```c
// 1回だけ読んで再利用
uint8_t cidx = owner->class_idx;
TinySlab* tls_a = g_tls_active_slab_a[cidx];
TinySlab* tls_b = g_tls_active_slab_b[cidx];

if ((owner == tls_a || owner == tls_b) &&
    !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}
```

**改善点**:
- `owner->class_idx`: 3回読み → 1回読み
- `g_tls_active_slab_a/b[cidx]`: ループ外に hoist 可能

#### Option B: Branch prediction hint

```c
// TLS Magazine から spill する場合、TLS-active slab への戻りが likely
if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                     !mini_mag_is_full(&owner->mini_mag), 1)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}
```

**改善点**:
- Branch misprediction を削減
- CPU の分岐予測をヒント

### 修正箇所

**ファイル**: `hakmem_tiny.c`
**関数**: `hak_tiny_free_with_slab()`
**行番号**: 890-922

### 検証方法

```bash
# Phase 4.1 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 期待結果: 385-390 M ops/sec
```

### 成功基準

- **最低**: 380 M ops/sec（現状維持）
- **目標**: 385-390 M ops/sec（+1-2%）
- **理想**: 390+ M ops/sec（Phase 3 レベル）

---

## Phase 4.2: High-water ゲート（Option E-1）

### 目標
- 実装時間: **10-20分**
- 期待効果: **+2-5%**（390-395 M ops/sec）
- リスク: **低〜中**

### 実装内容

#### High-water ゲートロジック

**コンセプト**:
> TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ
> 理由: 次回 alloc は TLS から出るので、mini-mag への投入は無駄

**実装**:
```c
// hak_tiny_free_with_slab() の先頭に追加
void hak_tiny_free_with_slab(...) {
    // ... 既存の前処理 ...

    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;  // TLS Magazine 現在の占有数
    int tls_cap = TLS_MAG_CAPACITY;  // 2048

    if (tls_occ >= (tls_cap * 3 / 4)) {
        // High-water: Phase 4 無効
        // 全件 bitmap へ直書き（既存ロジック）
        for (int i = 0; i < mag->count; i++) {
            PoolItem it = mag->items[i];
            TinySlab* owner = hak_tiny_owner_slab(it.ptr);
            if (!owner) continue;

            // Bitmap へ spill（既存ロジックを流用）
            // ... bitmap operations ...
        }

        // 統計
        g_tiny_pool.phase4_gate_skip[class_idx]++;
        return;
    }

    // Low-water: Phase 4 実行（既存ロジック）
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Phase 4.1 で最適化したロジック
        uint8_t cidx = owner->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                             !mini_mag_is_full(&owner->mini_mag), 1)) {
            mini_mag_push(&owner->mini_mag, it.ptr);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
            continue;
        }

        // Bitmap へ spill
        // ... bitmap operations ...
        g_tiny_pool.phase4_bitmap_spill[cidx]++;
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}
```

#### 定数定義

```c
// hakmem_tiny.h または hakmem_tiny.c
#define TLS_MAG_CAPACITY 2048
#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 1536
#define TLS_MAG_LOW_WATER (TLS_MAG_CAPACITY / 4)        // 512
```

#### 統計追加

```c
// hakmem_tiny.h: TinyPool 構造体に追加
typedef struct {
    // 既存
    uint64_t alloc_count[TINY_NUM_CLASSES];
    uint64_t free_count[TINY_NUM_CLASSES];
    uint64_t slab_count[TINY_NUM_CLASSES];

    // Phase 4 測定用（新規）
    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 判定回数
    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag push 成功
    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap spill
    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water skip
} TinyPool;
```

```c
// hakmem_tiny.c: 初期化
void hak_tiny_init(void) {
    // ... 既存の初期化 ...

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        g_tiny_pool.phase4_spill_count[i] = 0;
        g_tiny_pool.phase4_mini_push[i] = 0;
        g_tiny_pool.phase4_bitmap_spill[i] = 0;
        g_tiny_pool.phase4_gate_skip[i] = 0;
    }
}
```

### 検証方法

```bash
# Phase 4.2 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 統計確認（実装後）
# hak_tiny_print_stats() に phase4 統計を追加
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 390-395 M ops/sec（Phase 3 レベル）
```

### 成功基準

- **最低**: 385 M ops/sec（Phase 4.1 維持）
- **目標**: 390-395 M ops/sec（Phase 3 レベル回復）
- **理想**: 395+ M ops/sec（Phase 3 超え）

### Revert 判断

Phase 4.2 実装後も 385 M ops/sec を下回る場合:
- **Phase 4 全体を revert**
- Phase 3（391 M ops/sec）に戻る
- Pull 型アプローチを検討

---

## Phase 4.3: Per-slab バッチ（Option E-2）

### 目標
- 実装時間: **30-40分**
- 期待効果: **+2-5%**（395-400 M ops/sec）
- リスク: **中〜高**（実装複雑）

### 実装内容

#### Per-slab グルーピング

**コンセプト**:
> Spill 256 items を slab 単位でグルーピング
> is_tls_active 判定: 256回 → slab数回（1-8回）に激減

**データ構造**:
```c
#define SLAB_BUCKETS 32  // 線形プローブ用バケツ数

typedef struct {
    TinySlab* owner;
    void* ptrs[256];
    int count;
} SlabBucket;
```

**実装**:
```c
void hak_tiny_free_with_slab_batched(...) {
    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;
    if (tls_occ >= TLS_MAG_HIGH_WATER) {
        fast_spill_all_to_bitmap(mag);
        return;
    }

    // Phase 4.3: Per-slab バッチ
    SlabBucket buckets[SLAB_BUCKETS] = {0};

    // 1st pass: Slab 単位でグルーピング
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Linear probing hash
        size_t hash = ((uintptr_t)owner >> 6) & (SLAB_BUCKETS - 1);
        while (buckets[hash].owner && buckets[hash].owner != owner) {
            hash = (hash + 1) & (SLAB_BUCKETS - 1);
        }

        if (!buckets[hash].owner) {
            buckets[hash].owner = owner;
        }
        buckets[hash].ptrs[buckets[hash].count++] = it.ptr;
    }

    // 2nd pass: Slab ごとに処理（判定は slab ごとに 1 回）
    for (int b = 0; b < SLAB_BUCKETS; b++) {
        if (!buckets[b].owner) continue;

        TinySlab* slab = buckets[b].owner;
        uint8_t cidx = slab->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        int is_tls_active = (slab == tls_a || slab == tls_b);
        int room = mini_capacity(&slab->mini_mag) - mini_count(&slab->mini_mag);
        int take = is_tls_active ? min(room, buckets[b].count) : 0;

        // Mini-mag へ一括 push
        for (int i = 0; i < take; i++) {
            mini_mag_push(&slab->mini_mag, buckets[b].ptrs[i]);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
        }

        // 余りは bitmap へ一括 spill
        for (int i = take; i < buckets[b].count; i++) {
            // ... bitmap operations ...
            g_tiny_pool.phase4_bitmap_spill[cidx]++;
        }
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}
```

#### Helper 関数

```c
// Mini-magazine の容量と現在数
static inline int mini_capacity(PageMiniMag* mag) {
    return mag->capacity;
}

static inline int mini_count(PageMiniMag* mag) {
    return mag->count;
}

// min マクロ
#define min(a, b) ((a) < (b) ? (a) : (b))
```

### 検証方法

```bash
# Phase 4.3 実装
make clean && make bench_tiny

# ベンチマーク実行（5回）
for i in {1..5}; do
    ./bench_tiny 2>&1 | tail -5
done

# 統計確認
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 395-400 M ops/sec（Phase 3 超え）
```

### 成功基準

- **最低**: 390 M ops/sec（Phase 4.2 維持）
- **目標**: 395-400 M ops/sec（Phase 3 超え）
- **理想**: 400+ M ops/sec（5% 改善）

---

## Phase 4.4: Pull 型反転（将来）

### 目標
- 実装時間: **1-2時間**
- 期待効果: **根本的解決**
- リスク: **高**（アーキテクチャ変更）

### コンセプト

**現状（Push型）**:
- Free 側（spill）で mini-mag に押し戻す
- すべての spill item に overhead
- 恩恵は allocation 側で発生（不確実）

**改善（Pull型）**:
- Allocation 側で必要時だけ mini-mag から引き上げる
- Free 側の overhead ゼロ
- Allocation latency は若干増加（trade-off）

### 実装箇所

**ファイル**: `hakmem_tiny.c`
**関数**: `hak_tiny_alloc()`
**タイミング**: Bitmap scan の直前

### 実装イメージ

```c
void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0) return NULL;

    // 1. TLS Magazine (fast path)
    if (!mini_mag_is_empty(&g_tls_mag[class_idx])) {
        return mini_mag_pop(&g_tls_mag[class_idx]);
    }

    // 2. TLS Active Slabs (medium path)
    TinySlab* tls = g_tls_active_slab_a[class_idx];
    if (!(tls && tls->free_count > 0)) {
        tls = g_tls_active_slab_b[class_idx];
    }

    if (tls && tls->free_count > 0) {
        // Phase 4.4: Pull from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            void* p = mini_mag_pop(&tls->mini_mag);
            if (p) {
                stats_record_alloc(class_idx);
                return p;
            }
        }

        // Phase 4.4: Refill TLS Magazine from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            int pulled = mini_pull_batch(&tls->mini_mag, &g_tls_mag[class_idx], 16);
            if (pulled > 0) {
                void* p = mini_mag_pop(&g_tls_mag[class_idx]);
                if (p) {
                    stats_record_alloc(class_idx);
                    return p;
                }
            }
        }

        // Fallback: Bitmap scan（既存ロジック）
        // ...
    }

    // 3. Global pool（既存ロジック）
    // ...
}
```

### Free 側の変更

```c
void hak_tiny_free_with_slab(...) {
    // Phase 4.4: Push型ロジックを削除
    // 全件 bitmap へ直書き（シンプル化）

    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Bitmap へ spill（既存ロジック）
        // ... bitmap operations ...
    }
}
```

### Trade-off

**利点**:
- Free latency が安定（overhead なし）
- Allocation 側で制御できる（必要時だけ pull）

**欠点**:
- Allocation latency が若干増加（mini-mag からの pull コスト）
- 実装が複雑

---

## 測定・診断

### 統計出力

```c
void hak_tiny_print_phase4_stats(void) {
    printf("========================================\n");
    printf("Phase 4 Statistics\n");
    printf("========================================\n");

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uint64_t spill = g_tiny_pool.phase4_spill_count[i];
        uint64_t mini = g_tiny_pool.phase4_mini_push[i];
        uint64_t bitmap = g_tiny_pool.phase4_bitmap_spill[i];
        uint64_t gate = g_tiny_pool.phase4_gate_skip[i];

        if (spill == 0) continue;

        double mini_ratio = (double)mini / (mini + bitmap) * 100;
        double gate_ratio = (double)gate / spill * 100;

        printf("Class %d (%zu B):\n", i, g_tiny_class_sizes[i]);
        printf("  Spill count:     %lu\n", spill);
        printf("  Mini-mag push:   %lu (%.1f%%)\n", mini, mini_ratio);
        printf("  Bitmap spill:    %lu (%.1f%%)\n", bitmap, 100 - mini_ratio);
        printf("  Gate skip:       %lu (%.1f%%)\n", gate, gate_ratio);
        printf("\n");
    }
}
```

### ベンチマーク比較

```bash
# Phase 3（ベースライン）
git checkout <phase3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase3.txt

# Phase 4.1
git checkout <phase4.1-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_1.txt

# Phase 4.2
git checkout <phase4.2-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_2.txt

# Phase 4.3
git checkout <phase4.3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_3.txt

# 比較
diff results_phase3.txt results_phase4_*.txt
```

---

## 成功基準まとめ

| Phase | 実装時間 | 期待性能 | リスク | 必須 |
|-------|---------|---------|-------|-----|
| 4.1 (A+B) | 5-10分 | 385-390 M ops/sec | 低 | ✅ Yes |
| 4.2 (ゲート) | 10-20分 | 390-395 M ops/sec | 低〜中 | ✅ Yes |
| 4.3 (バッチ) | 30-40分 | 395-400 M ops/sec | 中〜高 | ⚠️ Conditional |
| 4.4 (Pull) | 1-2時間 | 根本解決 | 高 | ❌ Future |

**Revert 条件**:
- Phase 4.2 実装後も < 385 M ops/sec → Phase 4 全体を revert

**継続条件**:
- Phase 4.2 で >= 390 M ops/sec → Phase 4.3 に進む

---

## Timeline

**Day 1（今日）**:
1. ✅ ドキュメント整備（完了）
2. 🔄 Phase 4.1 実装（5-10分）
3. 🔄 Phase 4.1 検証（5分）
4. 🔄 Phase 4.2 実装（10-20分）
5. 🔄 Phase 4.2 検証（5分）
6. ✅ コミット
7. 📊 結果まとめ

**Day 2（条件付き）**:
- Phase 4.2 が成功した場合のみ
- Phase 4.3 実装・検証

**Future**:
- Phase 4.4（Pull型）は別途検討

---

## References

- PHASE4_REGRESSION_ANALYSIS.md（詳細分析）
- ChatGPT Pro アドバイス（2025-10-26）