hakmem/docs/analysis/PHASE4_REGRESSION_ANALYSIS.md

# Phase 4 性能退行の原因分析と改善戦略

## Executive Summary

**Phase 4 実装結果**:
- Phase 3: 391 M ops/sec
- Phase 4: 373-380 M ops/sec
- **退行**: -3.6%

**根本原因**:
> "free で先払い（push型）" は spill 頻発系で負ける。"必要時だけ取る（pull型）" に切り替えるべき

**解決策（優先順）**:
1. **Option E**: ゲーティング＋バッチ化（構造改善）
2. **Option D**: Trade-off 測定（科学的検証）
3. **Option A+B**: マイクロ最適化（Quick Win）
4. **Pull型反転**: 根本的アーキテクチャ変更

---

## Phase 4 で実装した内容

### 目的
TLS Magazine から slab への spill 時に、TLS-active な slab の場合は mini-magazine に優先的に戻すことで、**次回の allocation を高速化**する。

### 実装（hakmem_tiny.c:890-922）

```c
// Phase 4: TLS Magazine spill logic (hak_tiny_free_with_slab 関数内)
for (int i = 0; i < mag->count; i++) {
    TinySlab* owner = hak_tiny_owner_slab(it.ptr);

    // 追加されたチェック（ここが overhead になっている）
    int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                          owner == g_tls_active_slab_b[owner->class_idx]);

    if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
        // Fast path: mini-magazine に戻す（bitmap 触らない）
        mini_mag_push(&owner->mini_mag, it.ptr);
        stats_record_free(owner->class_idx);
        continue;
    }

    // Slow path: bitmap 直接書き込み（既存ロジック）
    // ... bitmap operations ...
}
```

### 設計意図

**Trade-off**:
- **Free path**: わずかな overhead を追加（is_tls_active チェック）
- **Alloc path**: mini-magazine から取れるので高速化（bitmap scan 回避）

**期待シナリオ**:
- Spill は稀（TLS Magazine が満杯になる頻度は低い）
- Mini-magazine にアイテムがあれば次回 allocation が速い（5-6ns → 1-2ns）

---

## 問題分析

### Overhead の内訳

**毎アイテムごとに実行されるコスト**:
```c
int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                      owner == g_tls_active_slab_b[owner->class_idx]);
```

1. `owner->class_idx` メモリアクセス × **2回**
2. `g_tls_active_slab_a[...]` TLS アクセス
3. `g_tls_active_slab_b[...]` TLS アクセス
4. ポインタ比較 × 2回
5. `mini_mag_is_full()` チェック

**推定コスト**: 約 2-3 ns per item

### Benchmark 特性（bench_tiny）

**ワークロード**:
- 100 alloc → 100 free を 10M 回繰り返す
- TLS Magazine capacity: 2048 items
- Spill trigger: Magazine が満杯（2048 items）
- Spill size: 256 items

**Spill 頻度**:
- 100 alloc × 10M = 1B allocations
- Spill 回数: 1B / 2048 ≈ 488k spills
- Total spill items: 488k × 256 = 125M items

**Phase 4 総コスト**:
- 125M items × 2.5 ns = **312.5 ms overhead**
- Total time: ~5.3 sec
- Overhead 比率: 312.5 / 5300 = **5.9%**

**Phase 4 による恩恵**:
- TLS Magazine が高水位（≥75%）のとき、mini-magazine からの allocation は**発生しない**
- → **恩恵ゼロ、コストだけ可視化**

### 根本的な設計ミス

> **「free で加速の仕込みをする（push型）」は、spill が頻発する系（bench_tiny）ではコスト先払いになり負けやすい。**

**問題点**:
1. **Spill が頻繁**: bench_tiny では 488k spills
2. **TLS Magazine が高水位**: 次回 alloc は TLS から出る（mini-mag 不要）
3. **先払いコスト**: すべての spill item に overhead
4. **恩恵なし**: Mini-mag からの allocation が発生しない

**正しいアプローチ**:
- **Pull型**: Allocation 側で必要時だけ mini-mag から取る
- **ゲーティング**: TLS Magazine 高水位時は Phase 4 スキップ
- **バッチ化**: Slab 単位で判定（アイテム単位ではなく）

---

## ChatGPT Pro のアドバイス

### 1. 最優先で実装すべき改善案

#### **Option E: ゲーティング＋バッチ化**（最重要・新提案）

**E-1: High-water ゲート**
```c
// spill 開始前に一度だけ判定
int tls_occ = tls_mag_occupancy();
if (tls_occ >= TLS_MAG_HIGH_WATER) {
    // 全件 bitmap へ直書き（Phase 4 無効）
    fast_spill_all_to_bitmap(mag);
    return;
}
```

**効果**:
- TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ
- 「どうせ次回 alloc は TLS から出る」局面での無駄仕事を**ゼロ化**

**E-2: Per-slab バッチ**
```c
// Spill 256 items を slab 単位でグルーピング（32 バケツ線形プローブ）
// is_tls_active 判定: 256回 → slab数回（通常 1-8回）に激減

Bucket bk[BUCKETS] = {0};

// 1st pass: グルーピング
for (int i = 0; i < mag->count; ++i) {
    TinySlab* owner = hak_tiny_owner_slab(mag->items[i]);
    size_t h = ((uintptr_t)owner >> 6) & (BUCKETS-1);
    while (bk[h].owner && bk[h].owner != owner) h = (h+1) & (BUCKETS-1);
    if (!bk[h].owner) bk[h].owner = owner;
    bk[h].ptrs[bk[h].n++] = mag->items[i];
}

// 2nd pass: slab 単位で処理（判定は slab ごとに 1 回）
for (int b = 0; b < BUCKETS; ++b) if (bk[b].owner) {
    TinySlab* s = bk[b].owner;
    uint8_t cidx = s->class_idx;
    TinySlab* tls_a = g_tls_active_slab_a[cidx];
    TinySlab* tls_b = g_tls_active_slab_b[cidx];

    int is_tls_active = (s == tls_a || s == tls_b);
    int room = mini_capacity(&s->mini_mag) - mini_count(&s->mini_mag);
    int take = is_tls_active ? min(room, bk[b].n) : 0;

    // mini へ一括 push
    for (int i = 0; i < take; ++i) mini_push_bulk(&s->mini_mag, bk[b].ptrs[i]);

    // 余りは bitmap を word 単位で一括更新
    for (int i = take; i < bk[b].n; ++i) bitmap_set_free(s, bk[b].ptrs[i]);
}
```

**効果**:
- `is_tls_active` 判定: 256回 → **slab数回（1-8回）に激減**
- `mini_mag_is_full()`: 256回 → **1回の room 計算に置換**
- ループ内の負担（ロード/比較/分岐）が**桁で削減**

**期待効果**: 退行 3.6% の主因を根こそぎ排除

---

#### **Option D: Trade-off 測定**（必須）

**測定すべき指標**:

**Free 側コスト**:
- `cost_check_per_item`: is_tls_active の平均コスト（ns）
- `spill_items_per_sec`: Spill 件数/秒

**Allocation 側便益**:
- `mini_hit_ratio`: Phase 4 投入分に対する mini-mag からの実消費率
- `delta_alloc_ns`: Bitmap → mini-mag により縮んだ ns（~3-4ns）

**損益分岐計算**:
```
便益/秒 = mini_hit_ratio × delta_alloc_ns × alloc_from_mini_per_sec
コスト/秒 = cost_check_per_item × spill_items_per_sec

便益 - コスト > 0 のときだけ Phase 4 有効化
```

**簡易版**:
```c
if (mini_hit_ratio < 10% || tls_occupancy > 75%) {
    // Phase 4 を一時停止
}
```

---

#### **Option A+B: マイクロ最適化**（ローコスト・即入れる）

**Option A**: 重複メモリアクセスの削減
```c
// Before: owner->class_idx を2回読む
int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                      owner == g_tls_active_slab_b[owner->class_idx]);

// After: 1回だけ読んで再利用
uint8_t cidx = owner->class_idx;
TinySlab* tls_a = g_tls_active_slab_a[cidx];
TinySlab* tls_b = g_tls_active_slab_b[cidx];

if ((owner == tls_a || owner == tls_b) &&
    !mini_mag_is_full(&owner->mini_mag)) {
    // ...
}
```

**Option B**: Branch prediction hint
```c
if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                     !mini_mag_is_full(&owner->mini_mag), 1)) {
    // Fast path - likely taken
}
```

**期待効果**: +1-2%（退行解消には不十分）

---

#### **Option C: Locality caching**（状況依存）

```c
TinySlab* last_owner = NULL;
int last_is_tls = 0;

for (...) {
    TinySlab* owner = hak_tiny_owner_slab(it.ptr);

    int is_tls_active;
    if (owner == last_owner) {
        is_tls_active = last_is_tls;  // Cached!
    } else {
        uint8_t cidx = owner->class_idx;
        is_tls_active = (owner == g_tls_active_slab_a[cidx] ||
                          owner == g_tls_active_slab_b[cidx]);
        last_owner = owner;
        last_is_tls = is_tls_active;
    }

    if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
        // ...
    }
}
```

**期待効果**: Locality が高い場合 2-3%（Option E で自然に内包される）

---

### 2. 見落としている最適化手法

#### **Pull 型への反転**（根本改善）

**現状（Push型）**:
- Free 側（spill）で "先に" mini-mag へ押し戻す
- すべての spill item に overhead
- 恩恵は allocation 側で発生するが、発生しないこともある

**改善（Pull型）**:
```c
// alloc_slow() で bitmap に降りる"直前"
TinySlab* s = g_tls_active_slab_a[class_idx];
if (s && !mini_mag_is_empty(&s->mini_mag)) {
    int pulled = mini_pull_batch(&s->mini_mag, tls_mag, PULL_BATCH);
    if (pulled > 0) return tls_mag_pop();
}
```

**効果**:
- Free 側から is_tls_active 判定を**完全に外せる**
- Free レイテンシを確実に守れる
- Allocation 側で必要時だけ取る（overhead の先払いなし）

---

#### **2段 bitmap + word 一括操作**

**現状**:
- Bit 単位で set/clear

**改善**:
```c
// Summary bitmap (2nd level): 非空 word のビットセット
uint64_t bm_top;  // 各ビットが 1 word (64 items) を表す
uint64_t bm_word[N];  // 実際の bitmap

// Spill 時: word 単位で一括 OR
for (int i = 0; i < group_count; i += 64) {
    int word_idx = block_idx / 64;
    bm_word[word_idx] |= free_mask;  // 一括 OR
    if (bm_word[word_idx]) bm_top |= (1ULL << (word_idx / 64));
}
```

**効果**:
- 空 word のスキャンをゼロに
- キャッシュ効率向上

---

#### **事前容量の読み切り**

```c
// Before: mini_mag_is_full() を毎回呼ぶ
if (!mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(...);
}

// After: room を一度計算
int room = mini_capacity(&s->mini_mag) - mini_count(&s->mini_mag);
if (room == 0) {
    // Phase 4 スキップ（mini へは push しない）
}
int take = min(room, group_count);
for (int i = 0; i < take; ++i) {
    mini_mag_push(...);  // is_full チェック不要
}
```

---

#### **High/Low-water 二段制御**

```c
int tls_occ = tls_mag_occupancy();

if (tls_occ >= HIGH_WATER) {
    // Phase 4 全 skip
} else if (tls_occ <= LOW_WATER) {
    // Phase 4 積極採用
} else {
    // 中間域: Slab バッチのみ（細粒度チェックなし）
}
```

---

### 3. 設計判断の妥当性

#### 一般論

> "Free で小さな負担を追加して alloc を速くする" は**条件付きで有効**

**有効な条件**:
1. Free の上振れ頻度が低い（spill が稀）
2. Alloc が実際に恩恵を受ける（hit 率が高い）
3. 先払いコスト < 後払い便益

#### bench_tiny での失敗理由

- ❌ Spill が頻繁（488k spills）
- ❌ TLS Magazine が高水位（hit 率ゼロ）
- ❌ 先払いコスト > 後払い便益（コストだけ可視化）

#### Real-world での可能性

**有利なシナリオ**:
- Burst allocation（短時間に大量 alloc → しばらく静穏 → 大量 free）
- TLS Magazine が低水位（mini-mag からの allocation が発生）
- Spill が稀（コストが amortize される）

**不利なシナリオ**:
- Steady-state（alloc/free が均等に発生）
- TLS Magazine が常に高水位
- Spill が頻繁

---

## 実装計画

### Phase 4.1: Quick Win（Option A+B）

**目標**: 5分で +1-2% 回収

**実装**:
```c
// hakmem_tiny.c:890-922 を修正
uint8_t cidx = owner->class_idx;  // 1回だけ読む
TinySlab* tls_a = g_tls_active_slab_a[cidx];
TinySlab* tls_b = g_tls_active_slab_b[cidx];

if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                     !mini_mag_is_full(&owner->mini_mag), 1)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}
```

**検証**:
```bash
make bench_tiny && ./bench_tiny
# 期待: 380 → 385-390 M ops/sec
```

---

### Phase 4.2: High-water ゲート（Option E-1）

**目標**: 10-20分で構造改善

**実装**:
```c
// hak_tiny_free_with_slab() の先頭に追加
int tls_occ = mag->count;  // TLS Magazine 占有数
if (tls_occ >= TLS_MAG_HIGH_WATER) {
    // Phase 4 無効: 全件 bitmap へ直書き
    for (int i = 0; i < mag->count; i++) {
        TinySlab* owner = hak_tiny_owner_slab(mag->items[i]);
        // ... 既存の bitmap spill ロジック ...
    }
    return;
}

// tls_occ < HIGH_WATER の場合のみ Phase 4 実行
// ... 既存の Phase 4 ロジック ...
```

**定数**:
```c
#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 75%
```

**検証**:
```bash
make bench_tiny && ./bench_tiny
# 期待: 385 → 390-395 M ops/sec（Phase 3 レベルに回復）
```

---

### Phase 4.3: Per-slab バッチ（Option E-2）

**目標**: 30-40分で根本解決

**実装**: 上記の E-2 コード例を参照

**検証**:
```bash
make bench_tiny && ./bench_tiny
# 期待: 390 → 395-400 M ops/sec（Phase 3 を超える）
```

---

### Phase 4.4: Pull 型反転（将来）

**目標**: 根本的アーキテクチャ変更

**実装箇所**: `hak_tiny_alloc()` の bitmap scan 直前

**検証**: Real-world benchmarks で評価

---

## 測定フレームワーク

### 追加する統計

```c
// hakmem_tiny.h
typedef struct {
    // 既存
    uint64_t alloc_count[TINY_NUM_CLASSES];
    uint64_t free_count[TINY_NUM_CLASSES];
    uint64_t slab_count[TINY_NUM_CLASSES];

    // Phase 4 測定用
    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 実行回数
    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag へ push した件数
    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap へ spill した件数
    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water でスキップした回数
} TinyPool;
```

### 損益計算

```c
void hak_tiny_print_phase4_stats(void) {
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uint64_t total_spill = g_tiny_pool.phase4_spill_count[i];
        uint64_t mini_push = g_tiny_pool.phase4_mini_push[i];
        uint64_t gate_skip = g_tiny_pool.phase4_gate_skip[i];

        double mini_ratio = (double)mini_push / total_spill;
        double gate_ratio = (double)gate_skip / total_spill;

        printf("Class %d: mini_ratio=%.2f%%, gate_ratio=%.2f%%\n",
               i, mini_ratio * 100, gate_ratio * 100);
    }
}
```

---

## 結論

### 優先順位

1. **Short-term**: Option A+B → High-water ゲート
2. **Mid-term**: Per-slab バッチ
3. **Long-term**: Pull 型反転

### 成功基準

- Phase 4.1（A+B）: 385-390 M ops/sec（+1-2%）
- Phase 4.2（ゲート）: 390-395 M ops/sec（Phase 3 レベル回復）
- Phase 4.3（バッチ）: 395-400 M ops/sec（Phase 3 超え）

### Revert 判断

Phase 4.2（ゲート）を実装しても Phase 3 レベル（391 M ops/sec）に戻らない場合:
- Phase 4 全体を revert
- Pull 型アプローチを検討

---

## References

- ChatGPT Pro アドバイス（2025-10-26）
- HYBRID_IMPLEMENTATION_DESIGN.md
- TINY_POOL_OPTIMIZATION_ROADMAP.md
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Phase 4 性能退行の原因分析と改善戦略
 								## Executive Summary
 								**Phase 4 実装結果**:
 								- Phase 3: 391 M ops/sec
 								- Phase 4: 373-380 M ops/sec
 								- **退行**: -3.6%
 								**根本原因**:
 								> "free で先払い（push型）" は spill 頻発系で負ける。"必要時だけ取る（pull型）" に切り替えるべき
 								**解決策（優先順）**:
 . **Option E**: ゲーティング＋バッチ化（構造改善）
 . **Option D**: Trade-off 測定（科学的検証）
 . **Option A+B**: マイクロ最適化（Quick Win）
 . **Pull型反転**: 根本的アーキテクチャ変更
 								---
 								## Phase 4 で実装した内容
 								### 目的
 								TLS Magazine から slab への spill 時に、TLS-active な slab の場合は mini-magazine に優先的に戻すことで、**次回の allocation を高速化**する。
 								### 実装（hakmem_tiny.c:890-922）
 								```c
 								// Phase 4: TLS Magazine spill logic (hak_tiny_free_with_slab 関数内)
 								for (int i = 0; i < mag->count; i++) {
 								    TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								    // 追加されたチェック（ここが overhead になっている）
 								    int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
 								                          owner == g_tls_active_slab_b[owner->class_idx]);
 								    if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
 								        // Fast path: mini-magazine に戻す（bitmap 触らない）
 								        mini_mag_push(&owner->mini_mag, it.ptr);
 								        stats_record_free(owner->class_idx);
 								        continue;
 								    }
 								    // Slow path: bitmap 直接書き込み（既存ロジック）
 								    // ... bitmap operations ...
 								}
 								```
 								### 設計意図
 								**Trade-off**:
 								- **Free path**: わずかな overhead を追加（is_tls_active チェック）
 								- **Alloc path**: mini-magazine から取れるので高速化（bitmap scan 回避）
 								**期待シナリオ**:
 								- Spill は稀（TLS Magazine が満杯になる頻度は低い）
 								- Mini-magazine にアイテムがあれば次回 allocation が速い（5-6ns → 1-2ns）
 								---
 								## 問題分析
 								### Overhead の内訳
 								**毎アイテムごとに実行されるコスト**:
 								```c
 								int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
 								                      owner == g_tls_active_slab_b[owner->class_idx]);
 								```
 . `owner->class_idx` メモリアクセス × **2回**
 . `g_tls_active_slab_a[...]` TLS アクセス
 . `g_tls_active_slab_b[...]` TLS アクセス
 . ポインタ比較 × 2回
 . `mini_mag_is_full()` チェック
 								**推定コスト**: 約 2-3 ns per item
 								### Benchmark 特性（bench_tiny）
 								**ワークロード**:
 								- 100 alloc → 100 free を 10M 回繰り返す
 								- TLS Magazine capacity: 2048 items
 								- Spill trigger: Magazine が満杯（2048 items）
 								- Spill size: 256 items
 								**Spill 頻度**:
 								- 100 alloc × 10M = 1B allocations
 								- Spill 回数: 1B / 2048 ≈ 488k spills
 								- Total spill items: 488k × 256 = 125M items
 								**Phase 4 総コスト**:
 								- 125M items × 2.5 ns = **312.5 ms overhead**
 								- Total time: ~5.3 sec
 								- Overhead 比率: 312.5 / 5300 = **5.9%**
 								**Phase 4 による恩恵**:
 								- TLS Magazine が高水位（≥75%）のとき、mini-magazine からの allocation は**発生しない**
 								- → **恩恵ゼロ、コストだけ可視化**
 								### 根本的な設計ミス
 								> **「free で加速の仕込みをする（push型）」は、spill が頻発する系（bench_tiny）ではコスト先払いになり負けやすい。**
 								**問題点**:
 . **Spill が頻繁**: bench_tiny では 488k spills
 . **TLS Magazine が高水位**: 次回 alloc は TLS から出る（mini-mag 不要）
 . **先払いコスト**: すべての spill item に overhead
 . **恩恵なし**: Mini-mag からの allocation が発生しない
 								**正しいアプローチ**:
 								- **Pull型**: Allocation 側で必要時だけ mini-mag から取る
 								- **ゲーティング**: TLS Magazine 高水位時は Phase 4 スキップ
 								- **バッチ化**: Slab 単位で判定（アイテム単位ではなく）
 								---
 								## ChatGPT Pro のアドバイス
 								### 1. 最優先で実装すべき改善案
 								#### **Option E: ゲーティング＋バッチ化**（最重要・新提案）
 								**E-1: High-water ゲート**
 								```c
 								// spill 開始前に一度だけ判定
 								int tls_occ = tls_mag_occupancy();
 								if (tls_occ >= TLS_MAG_HIGH_WATER) {
 								    // 全件 bitmap へ直書き（Phase 4 無効）
 								    fast_spill_all_to_bitmap(mag);
 								    return;
 								}
 								```
 								**効果**:
 								- TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ
 								- 「どうせ次回 alloc は TLS から出る」局面での無駄仕事を**ゼロ化**
 								**E-2: Per-slab バッチ**
 								```c
 								// Spill 256 items を slab 単位でグルーピング（32 バケツ線形プローブ）
 								// is_tls_active 判定: 256回 → slab数回（通常 1-8回）に激減
 								Bucket bk[BUCKETS] = {0};
 								// 1st pass: グルーピング
 								for (int i = 0; i < mag->count; ++i) {
 								    TinySlab* owner = hak_tiny_owner_slab(mag->items[i]);
 								    size_t h = ((uintptr_t)owner >> 6) & (BUCKETS-1);
 								    while (bk[h].owner && bk[h].owner != owner) h = (h+1) & (BUCKETS-1);
 								    if (!bk[h].owner) bk[h].owner = owner;
 								    bk[h].ptrs[bk[h].n++] = mag->items[i];
 								}
 								// 2nd pass: slab 単位で処理（判定は slab ごとに 1 回）
 								for (int b = 0; b < BUCKETS; ++b) if (bk[b].owner) {
 								    TinySlab* s = bk[b].owner;
 								    uint8_t cidx = s->class_idx;
 								    TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								    TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								    int is_tls_active = (s == tls_a || s == tls_b);
 								    int room = mini_capacity(&s->mini_mag) - mini_count(&s->mini_mag);
 								    int take = is_tls_active ? min(room, bk[b].n) : 0;
 								    // mini へ一括 push
 								    for (int i = 0; i < take; ++i) mini_push_bulk(&s->mini_mag, bk[b].ptrs[i]);
 								    // 余りは bitmap を word 単位で一括更新
 								    for (int i = take; i < bk[b].n; ++i) bitmap_set_free(s, bk[b].ptrs[i]);
 								}
 								```
 								**効果**:
 								- `is_tls_active` 判定: 256回 → **slab数回（1-8回）に激減**
 								- `mini_mag_is_full()`: 256回 → **1回の room 計算に置換**
 								- ループ内の負担（ロード/比較/分岐）が**桁で削減**
 								**期待効果**: 退行 3.6% の主因を根こそぎ排除
 								---
 								#### **Option D: Trade-off 測定**（必須）
 								**測定すべき指標**:
 								**Free 側コスト**:
 								- `cost_check_per_item`: is_tls_active の平均コスト（ns）
 								- `spill_items_per_sec`: Spill 件数/秒
 								**Allocation 側便益**:
 								- `mini_hit_ratio`: Phase 4 投入分に対する mini-mag からの実消費率
 								- `delta_alloc_ns`: Bitmap → mini-mag により縮んだ ns（~3-4ns）
 								**損益分岐計算**:
 								```
 								便益/秒 = mini_hit_ratio × delta_alloc_ns × alloc_from_mini_per_sec
 								コスト/秒 = cost_check_per_item × spill_items_per_sec
 								便益 - コスト > 0 のときだけ Phase 4 有効化
 								```
 								**簡易版**:
 								```c
 								if (mini_hit_ratio < 10% || tls_occupancy > 75%) {
 								    // Phase 4 を一時停止
 								}
 								```
 								---
 								#### **Option A+B: マイクロ最適化**（ローコスト・即入れる）
 								**Option A**: 重複メモリアクセスの削減
 								```c
 								// Before: owner->class_idx を2回読む
 								int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
 								                      owner == g_tls_active_slab_b[owner->class_idx]);
 								// After: 1回だけ読んで再利用
 								uint8_t cidx = owner->class_idx;
 								TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								if ((owner == tls_a || owner == tls_b) &&
 								    !mini_mag_is_full(&owner->mini_mag)) {
 								    // ...
 								}
 								```
 								**Option B**: Branch prediction hint
 								```c
 								if (__builtin_expect((owner == tls_a || owner == tls_b) &&
 								                     !mini_mag_is_full(&owner->mini_mag), 1)) {
 								    // Fast path - likely taken
 								}
 								```
 								**期待効果**: +1-2%（退行解消には不十分）
 								---
 								#### **Option C: Locality caching**（状況依存）
 								```c
 								TinySlab* last_owner = NULL;
 								int last_is_tls = 0;
 								for (...) {
 								    TinySlab* owner = hak_tiny_owner_slab(it.ptr);
 								    int is_tls_active;
 								    if (owner == last_owner) {
 								        is_tls_active = last_is_tls;  // Cached!
 								    } else {
 								        uint8_t cidx = owner->class_idx;
 								        is_tls_active = (owner == g_tls_active_slab_a[cidx] ||
 								                          owner == g_tls_active_slab_b[cidx]);
 								        last_owner = owner;
 								        last_is_tls = is_tls_active;
 								    }
 								    if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
 								        // ...
 								    }
 								}
 								```
 								**期待効果**: Locality が高い場合 2-3%（Option E で自然に内包される）
 								---
 								### 2. 見落としている最適化手法
 								#### **Pull 型への反転**（根本改善）
 								**現状（Push型）**:
 								- Free 側（spill）で "先に" mini-mag へ押し戻す
 								- すべての spill item に overhead
 								- 恩恵は allocation 側で発生するが、発生しないこともある
 								**改善（Pull型）**:
 								```c
 								// alloc_slow() で bitmap に降りる"直前"
 								TinySlab* s = g_tls_active_slab_a[class_idx];
 								if (s && !mini_mag_is_empty(&s->mini_mag)) {
 								    int pulled = mini_pull_batch(&s->mini_mag, tls_mag, PULL_BATCH);
 								    if (pulled > 0) return tls_mag_pop();
 								}
 								```
 								**効果**:
 								- Free 側から is_tls_active 判定を**完全に外せる**
 								- Free レイテンシを確実に守れる
 								- Allocation 側で必要時だけ取る（overhead の先払いなし）
 								---
 								#### **2段 bitmap + word 一括操作**
 								**現状**:
 								- Bit 単位で set/clear
 								**改善**:
 								```c
 								// Summary bitmap (2nd level): 非空 word のビットセット
 								uint64_t bm_top;  // 各ビットが 1 word (64 items) を表す
 								uint64_t bm_word[N];  // 実際の bitmap
 								// Spill 時: word 単位で一括 OR
 								for (int i = 0; i < group_count; i += 64) {
 								    int word_idx = block_idx / 64;
 								    bm_word[word_idx] |= free_mask;  // 一括 OR
 								    if (bm_word[word_idx]) bm_top |= (1ULL << (word_idx / 64));
 								}
 								```
 								**効果**:
 								- 空 word のスキャンをゼロに
 								- キャッシュ効率向上
 								---
 								#### **事前容量の読み切り**
 								```c
 								// Before: mini_mag_is_full() を毎回呼ぶ
 								if (!mini_mag_is_full(&owner->mini_mag)) {
 								    mini_mag_push(...);
 								}
 								// After: room を一度計算
 								int room = mini_capacity(&s->mini_mag) - mini_count(&s->mini_mag);
 								if (room == 0) {
 								    // Phase 4 スキップ（mini へは push しない）
 								}
 								int take = min(room, group_count);
 								for (int i = 0; i < take; ++i) {
 								    mini_mag_push(...);  // is_full チェック不要
 								}
 								```
 								---
 								#### **High/Low-water 二段制御**
 								```c
 								int tls_occ = tls_mag_occupancy();
 								if (tls_occ >= HIGH_WATER) {
 								    // Phase 4 全 skip
 								} else if (tls_occ <= LOW_WATER) {
 								    // Phase 4 積極採用
 								} else {
 								    // 中間域: Slab バッチのみ（細粒度チェックなし）
 								}
 								```
 								---
 								### 3. 設計判断の妥当性
 								#### 一般論
 								> "Free で小さな負担を追加して alloc を速くする" は**条件付きで有効**
 								**有効な条件**:
 . Free の上振れ頻度が低い（spill が稀）
 . Alloc が実際に恩恵を受ける（hit 率が高い）
 . 先払いコスト < 後払い便益
 								#### bench_tiny での失敗理由
 								- ❌ Spill が頻繁（488k spills）
 								- ❌ TLS Magazine が高水位（hit 率ゼロ）
 								- ❌ 先払いコスト > 後払い便益（コストだけ可視化）
 								#### Real-world での可能性
 								**有利なシナリオ**:
 								- Burst allocation（短時間に大量 alloc → しばらく静穏 → 大量 free）
 								- TLS Magazine が低水位（mini-mag からの allocation が発生）
 								- Spill が稀（コストが amortize される）
 								**不利なシナリオ**:
 								- Steady-state（alloc/free が均等に発生）
 								- TLS Magazine が常に高水位
 								- Spill が頻繁
 								---
 								## 実装計画
 								### Phase 4.1: Quick Win（Option A+B）
 								**目標**: 5分で +1-2% 回収
 								**実装**:
 								```c
 								// hakmem_tiny.c:890-922 を修正
 								uint8_t cidx = owner->class_idx;  // 1回だけ読む
 								TinySlab* tls_a = g_tls_active_slab_a[cidx];
 								TinySlab* tls_b = g_tls_active_slab_b[cidx];
 								if (__builtin_expect((owner == tls_a || owner == tls_b) &&
 								                     !mini_mag_is_full(&owner->mini_mag), 1)) {
 								    mini_mag_push(&owner->mini_mag, it.ptr);
 								    stats_record_free(cidx);
 								    continue;
 								}
 								```
 								**検証**:
 								```bash
 								make bench_tiny && ./bench_tiny
 								# 期待: 380 → 385-390 M ops/sec
 								```
 								---
 								### Phase 4.2: High-water ゲート（Option E-1）
 								**目標**: 10-20分で構造改善
 								**実装**:
 								```c
 								// hak_tiny_free_with_slab() の先頭に追加
 								int tls_occ = mag->count;  // TLS Magazine 占有数
 								if (tls_occ >= TLS_MAG_HIGH_WATER) {
 								    // Phase 4 無効: 全件 bitmap へ直書き
 								    for (int i = 0; i < mag->count; i++) {
 								        TinySlab* owner = hak_tiny_owner_slab(mag->items[i]);
 								        // ... 既存の bitmap spill ロジック ...
 								    }
 								    return;
 								}
 								// tls_occ < HIGH_WATER の場合のみ Phase 4 実行
 								// ... 既存の Phase 4 ロジック ...
 								```
 								**定数**:
 								```c
 								#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 75%
 								```
 								**検証**:
 								```bash
 								make bench_tiny && ./bench_tiny
 								# 期待: 385 → 390-395 M ops/sec（Phase 3 レベルに回復）
 								```
 								---
 								### Phase 4.3: Per-slab バッチ（Option E-2）
 								**目標**: 30-40分で根本解決
 								**実装**: 上記の E-2 コード例を参照
 								**検証**:
 								```bash
 								make bench_tiny && ./bench_tiny
 								# 期待: 390 → 395-400 M ops/sec（Phase 3 を超える）
 								```
 								---
 								### Phase 4.4: Pull 型反転（将来）
 								**目標**: 根本的アーキテクチャ変更
 								**実装箇所**: `hak_tiny_alloc()` の bitmap scan 直前
 								**検証**: Real-world benchmarks で評価
 								---
 								## 測定フレームワーク
 								### 追加する統計
 								```c
 								// hakmem_tiny.h
 								typedef struct {
 								    // 既存
 								    uint64_t alloc_count[TINY_NUM_CLASSES];
 								    uint64_t free_count[TINY_NUM_CLASSES];
 								    uint64_t slab_count[TINY_NUM_CLASSES];
 								    // Phase 4 測定用
 								    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 実行回数
 								    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag へ push した件数
 								    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap へ spill した件数
 								    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water でスキップした回数
 								} TinyPool;
 								```
 								### 損益計算
 								```c
 								void hak_tiny_print_phase4_stats(void) {
 								    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
 								        uint64_t total_spill = g_tiny_pool.phase4_spill_count[i];
 								        uint64_t mini_push = g_tiny_pool.phase4_mini_push[i];
 								        uint64_t gate_skip = g_tiny_pool.phase4_gate_skip[i];
 								        double mini_ratio = (double)mini_push / total_spill;
 								        double gate_ratio = (double)gate_skip / total_spill;
 								        printf("Class %d: mini_ratio=%.2f%%, gate_ratio=%.2f%%\n",
 								               i, mini_ratio * 100, gate_ratio * 100);
 								    }
 								}
 								```
 								---
 								## 結論
 								### 優先順位
 . **Short-term**: Option A+B → High-water ゲート
 . **Mid-term**: Per-slab バッチ
 . **Long-term**: Pull 型反転
 								### 成功基準
 								- Phase 4.1（A+B）: 385-390 M ops/sec（+1-2%）
 								- Phase 4.2（ゲート）: 390-395 M ops/sec（Phase 3 レベル回復）
 								- Phase 4.3（バッチ）: 395-400 M ops/sec（Phase 3 超え）
 								### Revert 判断
 								Phase 4.2（ゲート）を実装しても Phase 3 レベル（391 M ops/sec）に戻らない場合:
 								- Phase 4 全体を revert
 								- Pull 型アプローチを検討
 								---
 								## References
 								- ChatGPT Pro アドバイス（2025-10-26）
 								- HYBRID_IMPLEMENTATION_DESIGN.md
 								- TINY_POOL_OPTIMIZATION_ROADMAP.md