# Phase 6.24: SuperSlab 最適化（Quick + Medium fix）

**日付**: 2025-10-24
**ステータス**: ✅ **大成功！Baseline を +2.6% 上回る**
**目標**: Phase 6.23 の性能低下を改善 → Baseline 超え
**実績**: **+8.2%** 改善（vs Phase 6.23）、**+2.6%** 改善（vs Baseline）

---

## 📊 Benchmark 結果

### Multi-threaded (4 threads, 16B allocation)

| Version | Throughput | vs Phase 6.23 | vs Baseline (OFF) |
|---------|------------|---------------|-------------------|
| **Baseline (SuperSlab OFF)** | 268.21 M ops/sec | - | baseline |
| **Phase 6.23 (SuperSlab ON)** | 254.15 M ops/sec | baseline | **-5.2%** ❌ |
| **Phase 6.24 Quick fix** | 271.01 M ops/sec | **+6.6%** | **+1.0%** ✅ |
| **Phase 6.24 Quick + Medium** | **275.10 M ops/sec** | **+8.2%** 🎉 | **+2.6%** ✅ |

### 改善の内訳

| 最適化 | Throughput | 改善幅 | 累積改善 |
|--------|------------|--------|----------|
| Phase 6.23 (baseline) | 254.15 M ops/sec | - | - |
| + Quick fix (Lazy freelist) | 271.01 M ops/sec | **+6.6%** | +6.6% |
| + Medium fix (TLS unified) | **275.10 M ops/sec** | **+1.5%** | **+8.2%** |

**結論**: Task先生の分析通り、**freelist 初期化コスト**が主なボトルネックでした！

---

## 🚀 実装内容

### Quick fix: freelist Lazy Initialization

**問題**: `superslab_init_slab()` で 4096 blocks の freelist を構築（4-7 μs）

**解決策**: Linear allocation mode を実装
- freelist を最初に構築しない
- `meta->freelist == NULL` のときは sequential memory access で allocation
- Free 後の再利用のみ freelist 使用

**実装** (`hakmem_tiny_superslab.c:137-151`, `hakmem_tiny.c:895-925`):

```c
// superslab_init_slab() - NO freelist build!
void superslab_init_slab(SuperSlab* ss, int slab_idx, ...) {
    // ...
    // Phase 6.24: Lazy freelist initialization
    // NO freelist build here! (saves 4000-8000 cycles)
    TinySlabMeta* meta = &ss->slabs[slab_idx];
    meta->freelist = NULL;  // NULL = linear mode
    meta->used = 0;
    meta->capacity = (uint16_t)capacity;
    meta->owner_tid = owner_tid;
}

// superslab_alloc_from_slab() - Linear allocation
static inline void* superslab_alloc_from_slab(SuperSlab* ss, int slab_idx) {
    TinySlabMeta* meta = &ss->slabs[slab_idx];

    // Linear allocation mode (freelist == NULL)
    if (meta->freelist == NULL && meta->used < meta->capacity) {
        size_t block_size = g_tiny_class_sizes[ss->size_class];
        void* slab_start = slab_data_start(ss, slab_idx);
        if (slab_idx == 0) slab_start = (char*)slab_start + 1024;

        void* block = (char*)slab_start + (meta->used * block_size);
        meta->used++;
        return block;  // O(1) pointer arithmetic
    }

    // Freelist mode (after first free)
    if (meta->freelist) {
        void* block = meta->freelist;
        meta->freelist = *(void**)block;
        meta->used++;
        return block;
    }

    return NULL;
}
```

**効果**: **+6.6%** 改善（254.15 → 271.01 M ops/sec）

---

### Medium fix: TLS 変数統合

**問題**: `g_tls_superslab[class_idx]` と `g_tls_slab_idx[class_idx]` の二重アクセス → **3 TLS reads**

**解決策**: Unified TLS structure

**実装** (`hakmem_tiny.c:78-87`):

```c
// Phase 6.24: Unified TLS slab cache
typedef struct {
    SuperSlab* ss;           // SuperSlab pointer (8B)
    TinySlabMeta* meta;      // Direct slab metadata cache (8B)
    uint8_t slab_idx;        // Slab index (1B)
    uint8_t _pad[7];         // Padding to 16B
} TinyTLSSlab;

static __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES];
```

**Fast path** (`hakmem_tiny.c:969-1015`):

```c
static inline void* hak_tiny_alloc_superslab(int class_idx) {
    // Phase 6.24: 1 TLS read (down from 3!)
    TinyTLSSlab* tls = &g_tls_slabs[class_idx];
    TinySlabMeta* meta = tls->meta;  // Already cached!

    // Fast path: Direct metadata access
    if (meta && meta->freelist == NULL && meta->used < meta->capacity) {
        // Linear allocation
        size_t block_size = g_tiny_class_sizes[tls->ss->size_class];
        void* slab_start = slab_data_start(tls->ss, tls->slab_idx);
        if (tls->slab_idx == 0) slab_start = (char*)slab_start + 1024;
        void* block = (char*)slab_start + (meta->used * block_size);
        meta->used++;
        return block;
    }

    if (meta && meta->freelist) {
        // Freelist allocation
        void* block = meta->freelist;
        meta->freelist = *(void**)block;
        meta->used++;
        return block;
    }

    // Slow path: Refill
    // ...
}
```

**効果**: **+1.5%** 改善（271.01 → 275.10 M ops/sec）

**TLS reads 削減**:
- Before: 3 TLS reads（`g_tls_superslab` + `g_tls_slab_idx` + retry時に再読み）
- After: **1 TLS read**（`g_tls_slabs` のみ）
- 削減: **-6-10 cycles per allocation**

---

## 📈 Performance Analysis

### Task先生の分析精度

| 予測 | 実測 | 精度 |
|------|------|------|
| Quick fix: +2-3% | **+6.6%** | 🎯 超えた！ |
| Medium fix: +0.5-1% | **+1.5%** | 🎯 超えた！ |
| 合計: +2.5-4% | **+8.2%** | 🎯🎯 大幅超過！ |

**Task先生の分析は保守的だったが、方向性は完璧！**

### Cycle 削減の計算

**Quick fix (Lazy freelist init)**:
- Before: 4096 iterations × 3-5 cycles = 12,288-20,480 cycles per slab init
- After: **0 cycles** (linear allocation)
- Multi-threaded で 4 threads × 複数 SuperSlab → 数万 cycles 削減

**Medium fix (TLS unified)**:
- Before: 3 TLS reads × 3-5 cycles = 9-15 cycles per allocation
- After: 1 TLS read × 3-5 cycles = **3-5 cycles per allocation**
- 削減: **6-10 cycles per allocation**
- 800M operations × 6-10 cycles = **4.8-8.0 billion cycles 削減**

---

## 🎓 Lessons Learned

### 1. Lazy initialization の威力

**freelist を最初に構築しない**という発想が大勝利。
- 初期化コスト → 0 cycles
- Linear allocation → Sequential memory access（キャッシュ効率最高）
- Free 後の再利用のみ freelist → 実用上問題なし

### 2. TLS アクセスのコストは無視できない

1 TLS read は 3-5 cycles だが、**hot path で 3回読む**と 9-15 cycles。
- Unified structure で 1回に削減 → **6-10 cycles 削減**
- Multi-threaded で累積効果

### 3. Profiling < 理論分析

perf を使わなくてもコードレビュー + 計算で十分分析可能。
- Task先生の理論分析が的確だった
- 「推測するな、測定せよ」も大事だが、「理解して最適化」はもっと速い

### 4. 段階的最適化の重要性

Quick fix → Medium fix と分けたことで：
- 各最適化の効果を定量化できた
- 問題の切り分けが明確
- Rollback が容易

---

## 📂 File Changes

### 変更ファイル

| ファイル | 変更内容 | 行数 |
|---------|----------|------|
| `hakmem_tiny_superslab.h` | Lazy init コメント追加 | +5 |
| `hakmem_tiny_superslab.c` | freelist 構築削除 | +12, -10 |
| `hakmem_tiny.c` | TLS unified + Linear allocation | +100, -30 |

### 合計

- **変更**: 3 ファイル, ~87 行追加

---

## 🎯 Next Steps: mimalloc を倒す計画

Phase 6.24 で SuperSlab の基盤が整いました。次は：

### Priority 1: Tiny Pool の完成（Phase 6.25）

**Long-term fix**: Magazine と SuperSlab の統合
- Hybrid Magazine（SuperSlab-backed）を実装
- または Magazine を完全削除（mimalloc style）
- **期待効果**: +5-10%

**Target**: 300+ M ops/sec（vs 現在 275 M ops/sec）

### Priority 2: Mid Pool の最適化（Phase 6.26+）

**現状**: Mid 4T = 8.33 M/s
**Target**: mimalloc 並み（？）

**最適化候補**:
- Mid Pool にも SuperSlab 適用？
- TLS ring buffer の拡張
- Headerless allocation の最適化

### Priority 3: Large Pool の最適化（Phase 6.27+）

**現状**: Large 4T = 1.27 M/s
**Target**: mimalloc 並み（？）

**最適化候補**:
- L2.5 Pool の capacity 調整
- Batch allocation の改善
- ELO system のチューニング

---

## 🎉 Conclusion

Phase 6.24 では、Task先生の分析に基づき **Quick fix + Medium fix** を実装し、**+8.2% の大幅改善**を達成しました！

### 主な成果

1. ✅ **freelist Lazy Initialization**: +6.6% 改善
2. ✅ **TLS 変数統合**: +1.5% 改善
3. ✅ **Baseline 超え**: +2.6% 改善（268.21 → 275.10 M ops/sec）

### 次のマイルストーン

- **Phase 6.25**: Long-term fix（Magazine 統合）→ 300+ M ops/sec 目標
- **Phase 6.26+**: Mid/Large Pool 最適化 → mimalloc に迫る

**mimalloc、倒すぞー！** 🔥🔥🔥

---

**作成日**: 2025-10-24 12:30 JST
**ステータス**: ✅ **Phase 6.24 完了、次は Phase 6.25 へ**
**次のフェーズ**: mimalloc 制覇計画策定