hakmem/HISTORY.md

# HAKMEM Development History

## Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02~03) ❌

### 目標
- Dual Free Lists (mimalloc): +10-15%
- Magazine 統合: +3-5%
- 合計期待: +15-23% (16.53 → 19.1-20.3 M ops/sec)

### 実装内容

#### 1. TinyUnifiedMag 定義 (hakmem_tiny.c:590-603)
```c
typedef struct {
    void* slots[256];   // Large capacity for better hit rate
    uint16_t top;       // 0..256
    uint16_t cap;       // =256 (adjustable per class)
} TinyUnifiedMag;

static int g_unified_mag_enable = 1;
static uint16_t g_unified_mag_cap[TINY_NUM_CLASSES] = {
    64, 64, 64, 64,      // Classes 0-3 (hot): 64 slots
    32, 32, 16, 16       // Classes 4-7 (cold): smaller capacity
};
static __thread TinyUnifiedMag g_tls_unified_mag[TINY_NUM_CLASSES];
```

#### 2. Dual Free Lists 追加 (hakmem_tiny.h:147-151)
```c
// Phase 5-B: Dual Free Lists (mimalloc-inspired optimization)
void* local_free;               // Local free list (same-thread, no atomic)
atomic_uintptr_t thread_free;   // Remote free list (cross-thread, atomic)
```

#### 3. hak_tiny_alloc() 書き換え (hakmem_tiny_alloc.inc:159-180)
- 48 lines → 8 lines に削減
- 3-4 branches → 1 branch に削減
```c
if (__builtin_expect(g_unified_mag_enable, 1)) {
    TinyUnifiedMag* mag = &g_tls_unified_mag[class_idx];
    if (__builtin_expect(mag->top > 0, 1)) {
        void* ptr = mag->slots[--mag->top];
        HAK_RET_ALLOC(class_idx, ptr);
    }
    // Fast path - try local_free from TLS active slabs (no atomic!)
    TinySlab* slab = g_tls_active_slab_a[class_idx];
    if (!slab) slab = g_tls_active_slab_b[class_idx];
    if (slab && slab->local_free) {
        void* ptr = slab->local_free;
        slab->local_free = *(void**)ptr;
        HAK_RET_ALLOC(class_idx, ptr);
    }
}
```

#### 4. Free path 分離 (hakmem_tiny_free.inc)
- Same-thread: local_free (no atomic) - lines 216-230
- Remote-thread: thread_free (atomic CAS) - lines 468-484

#### 5. Migration logic (hakmem_tiny_slow.inc:12-76)
- local_free → Magazine (batch 32 items)
- thread_free → local_free → Magazine

#### 6. Magazine refill from SuperSlab (hakmem_tiny_slow.inc:78-107)
- Batch allocate 8-64 blocks

### ベンチマーク結果 💥

#### Initial (Magazine cap=256)
- bench_random_mixed: 16.51 M ops/sec (baseline: 16.53, -0.12%)

#### After Dual Free Lists (Magazine cap=256)
- bench_random_mixed: 16.35 M ops/sec (-1.1% vs baseline)

#### After local_free fast path (Magazine cap=256)
- bench_random_mixed: 16.42 M ops/sec (-0.67% vs baseline)

#### After capacity optimization (Magazine cap=64)
- bench_random_mixed: 16.36 M ops/sec (-1.0% vs baseline)

#### Final evaluation (Magazine cap=64)
**Single-threaded (bench_tiny_hot, 64B):**
- System allocator: **169.49 M ops/sec**
- HAKMEM Phase 5-B: **49.91 M ops/sec**
- **Regression: -71%** (3.4x slower!)

**Multi-threaded (bench_mid_large_mt, 2 threads, 8-32KB):**
- System allocator: **11.51 M ops/sec**
- HAKMEM Phase 5-B: **7.44 M ops/sec**
- **Regression: -35%**
- ⚠️ NOTE: Tests 8-32KB allocations (outside Tiny range)

### 根本原因分析 🔍

#### 1. Magazine capacity ミスチューン
- **問題**: 64 slots は ST workload に小さすぎる
- **詳細**: batch=100 の場合、2回に1回は slow path に落ちる
- **原因**: System allocator の tcache (7+ entries per size) との比較で劣る
- **Perf分析**: `hak_tiny_alloc_slow` が 4.25% を占める (高すぎ)

#### 2. Migration logic オーバーヘッド
- **問題**: Slow path での free list → Magazine migration が高コスト
- **詳細**: Batch migration (32 items) が頻繁に発生
- **原因**: Pointer chase + atomic operations の累積
- **Perf分析**: `pthread_mutex_lock` が 3.40% (single-threaded なのに!)

#### 3. Dual Free Lists の誤算
- **問題**: ST では効果ゼロ、むしろオーバーヘッド
- **詳細**: ST では remote_free は発生しない
- **原因**: Dual structures のメモリ overhead のみが残る
- **教訓**: MT 専用の最適化を ST に適用した

#### 4. Unified Magazine の問題
- **問題**: 統合で simplicity は得たが performance は失った
- **詳細**: 旧 HotMag (128 slots) + Fast + Quick の組み合わせのほうが高速
- **原因**: 単純化 ≠ 高速化
- **教訓**: Complexity reduction が performance improvement とは限らない

### 学んだこと 📚

#### ✅ Good Ideas
1. **Magazine unification 自体は良アイデア** (complexity 削減の方向性は正しい)
2. **Dual Free Lists は mimalloc で実証済み** (ただし MT 環境で)
3. **Migration logic の発想** (free list を Magazine に集約)

#### ❌ Bad Execution
1. **Capacity tuning が不適切** (64 slots → 128+ 必要)
2. **Dual Free Lists は MT 専用** (ST で導入すべきでない)
3. **Migration logic が重すぎる** (batch size 削減 or lazy migration 必要)
4. **Benchmark mismatch** (ST で MT 最適化を評価した)

#### 🎯 Next Time
1. **ST と MT を分けて設計** (条件付きコンパイル or runtime switch)
2. **Capacity を大きめに** (128-256 slots for hot classes)
3. **Migration を軽量化** (lazy migration, smaller batch size)
4. **Benchmark を先に選定** (最適化の方向性と一致させる)

### 関連コミット
- 4672d54: refactor(tiny): expose class locks for module sharing
- 6593935: refactor(tiny): move magazine init functions
- 1b232e1: refactor(tiny): move magazine capacity helpers
- 0f1e5ac: refactor(tiny): extract magazine data structures
- 85a00a0: refactor(core): organize source files into core/ directory

### 次のステップ候補
1. **Phase 5-B-v2**: Magazine unification のみ (Dual Free Lists なし, capacity 128-256)
2. **Phase 6 系**: L25/SuperSlab 最適化に移行
3. **Rollback**: Baseline に戻って別アプローチ

---

## Phase 5-A: Direct Page Cache (2025-11-01) ❌

### 目標
- Direct cache でO(1) slab lookup: +15-20%

### 実装内容
- Global `slabs_direct[129]` でO(1) direct page cache

### ベンチマーク結果 💥
- bench_random_mixed: 15.25-16.04 M ops/sec (baseline: 16.53)
- **Regression: -3~-7.7%** (期待+15-20% → 実際-3~-7.7%)

### 根本原因
- Global cache による contention
- Cache pollution
- False sharing

### 学んだこと
- Global structures は避けるべき (TLS が基本)
- Direct cache よりも Magazine-based approach が有効

---

## Phase 4-A1: HotMag capacity tuning (2025-10-31) ❌

### 目標
- HotMag capacity を増やして hit rate 向上

### 結果
- 性能改善なし

### 学んだこと
- Capacity 単体では効果薄い
- 構造的な問題を解決する必要

---

## Phase 3: Remote drain optimization (2025-10-30) ❌

### 目標
- Remote drain の最適化

### 結果
- 性能改善なし

### 学んだこと
- Remote drain はボトルネックではなかった

---

## Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅

### 目標
- Magazine capacity tuning
- Registry optimization

### 結果
- **成功**: 性能改善達成

### 学んだこと
- Magazine-based approach は有効
- Registry は O(1) lookup で十分
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# HAKMEM Development History`

			`## Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02~03) ❌`

			`### 目標`
			`- Dual Free Lists (mimalloc): +10-15%`
			`- Magazine 統合: +3-5%`
			`- 合計期待: +15-23% (16.53 → 19.1-20.3 M ops/sec)`

			`### 実装内容`

			`#### 1. TinyUnifiedMag 定義 (hakmem_tiny.c:590-603)`
			```c
			`typedef struct {`
			`void* slots[256]; // Large capacity for better hit rate`
			`uint16_t top; // 0..256`
			`uint16_t cap; // =256 (adjustable per class)`
			`} TinyUnifiedMag;`

			`static int g_unified_mag_enable = 1;`
			`static uint16_t g_unified_mag_cap[TINY_NUM_CLASSES] = {`
			`64, 64, 64, 64, // Classes 0-3 (hot): 64 slots`
			`32, 32, 16, 16 // Classes 4-7 (cold): smaller capacity`
			`};`
			`static __thread TinyUnifiedMag g_tls_unified_mag[TINY_NUM_CLASSES];`
			```

			`#### 2. Dual Free Lists 追加 (hakmem_tiny.h:147-151)`
			```c
			`// Phase 5-B: Dual Free Lists (mimalloc-inspired optimization)`
			`void* local_free; // Local free list (same-thread, no atomic)`
			`atomic_uintptr_t thread_free; // Remote free list (cross-thread, atomic)`
			```

			`#### 3. hak_tiny_alloc() 書き換え (hakmem_tiny_alloc.inc:159-180)`
			`- 48 lines → 8 lines に削減`
			`- 3-4 branches → 1 branch に削減`
			```c
			`if (__builtin_expect(g_unified_mag_enable, 1)) {`
			`TinyUnifiedMag* mag = &g_tls_unified_mag[class_idx];`
			`if (__builtin_expect(mag->top > 0, 1)) {`
			`void* ptr = mag->slots[--mag->top];`
			`HAK_RET_ALLOC(class_idx, ptr);`
			`}`
			`// Fast path - try local_free from TLS active slabs (no atomic!)`
			`TinySlab* slab = g_tls_active_slab_a[class_idx];`
			`if (!slab) slab = g_tls_active_slab_b[class_idx];`
			`if (slab && slab->local_free) {`
			`void* ptr = slab->local_free;`
			`slab->local_free = (void*)ptr;`
			`HAK_RET_ALLOC(class_idx, ptr);`
			`}`
			`}`
			```

			`#### 4. Free path 分離 (hakmem_tiny_free.inc)`
			`- Same-thread: local_free (no atomic) - lines 216-230`
			`- Remote-thread: thread_free (atomic CAS) - lines 468-484`

			`#### 5. Migration logic (hakmem_tiny_slow.inc:12-76)`
			`- local_free → Magazine (batch 32 items)`
			`- thread_free → local_free → Magazine`

			`#### 6. Magazine refill from SuperSlab (hakmem_tiny_slow.inc:78-107)`
			`- Batch allocate 8-64 blocks`

			`### ベンチマーク結果 💥`

			`#### Initial (Magazine cap=256)`
			`- bench_random_mixed: 16.51 M ops/sec (baseline: 16.53, -0.12%)`

			`#### After Dual Free Lists (Magazine cap=256)`
			`- bench_random_mixed: 16.35 M ops/sec (-1.1% vs baseline)`

			`#### After local_free fast path (Magazine cap=256)`
			`- bench_random_mixed: 16.42 M ops/sec (-0.67% vs baseline)`

			`#### After capacity optimization (Magazine cap=64)`
			`- bench_random_mixed: 16.36 M ops/sec (-1.0% vs baseline)`

			`#### Final evaluation (Magazine cap=64)`
			`Single-threaded (bench_tiny_hot, 64B):`
			`- System allocator: 169.49 M ops/sec`
			`- HAKMEM Phase 5-B: 49.91 M ops/sec`
			`- Regression: -71% (3.4x slower!)`

			`Multi-threaded (bench_mid_large_mt, 2 threads, 8-32KB):`
			`- System allocator: 11.51 M ops/sec`
			`- HAKMEM Phase 5-B: 7.44 M ops/sec`
			`- Regression: -35%`
			`- ⚠️ NOTE: Tests 8-32KB allocations (outside Tiny range)`

			`### 根本原因分析 🔍`

			`#### 1. Magazine capacity ミスチューン`
			`- 問題: 64 slots は ST workload に小さすぎる`
			`- 詳細: batch=100 の場合、2回に1回は slow path に落ちる`
			`- 原因: System allocator の tcache (7+ entries per size) との比較で劣る`
			- Perf分析: `hak_tiny_alloc_slow` が 4.25% を占める (高すぎ)

			`#### 2. Migration logic オーバーヘッド`
			`- 問題: Slow path での free list → Magazine migration が高コスト`
			`- 詳細: Batch migration (32 items) が頻繁に発生`
			`- 原因: Pointer chase + atomic operations の累積`
			- Perf分析: `pthread_mutex_lock` が 3.40% (single-threaded なのに!)

			`#### 3. Dual Free Lists の誤算`
			`- 問題: ST では効果ゼロ、むしろオーバーヘッド`
			`- 詳細: ST では remote_free は発生しない`
			`- 原因: Dual structures のメモリ overhead のみが残る`
			`- 教訓: MT 専用の最適化を ST に適用した`

			`#### 4. Unified Magazine の問題`
			`- 問題: 統合で simplicity は得たが performance は失った`
			`- 詳細: 旧 HotMag (128 slots) + Fast + Quick の組み合わせのほうが高速`
			`- 原因: 単純化 ≠ 高速化`
			`- 教訓: Complexity reduction が performance improvement とは限らない`

			`### 学んだこと 📚`

			`#### ✅ Good Ideas`
			`1. Magazine unification 自体は良アイデア (complexity 削減の方向性は正しい)`
			`2. Dual Free Lists は mimalloc で実証済み (ただし MT 環境で)`
			`3. Migration logic の発想 (free list を Magazine に集約)`

			`#### ❌ Bad Execution`
			`1. Capacity tuning が不適切 (64 slots → 128+ 必要)`
			`2. Dual Free Lists は MT 専用 (ST で導入すべきでない)`
			`3. Migration logic が重すぎる (batch size 削減 or lazy migration 必要)`
			`4. Benchmark mismatch (ST で MT 最適化を評価した)`

			`#### 🎯 Next Time`
			`1. ST と MT を分けて設計 (条件付きコンパイル or runtime switch)`
			`2. Capacity を大きめに (128-256 slots for hot classes)`
			`3. Migration を軽量化 (lazy migration, smaller batch size)`
			`4. Benchmark を先に選定 (最適化の方向性と一致させる)`

			`### 関連コミット`
			`- 4672d54: refactor(tiny): expose class locks for module sharing`
			`- 6593935: refactor(tiny): move magazine init functions`
			`- 1b232e1: refactor(tiny): move magazine capacity helpers`
			`- 0f1e5ac: refactor(tiny): extract magazine data structures`
			`- 85a00a0: refactor(core): organize source files into core/ directory`

			`### 次のステップ候補`
			`1. Phase 5-B-v2: Magazine unification のみ (Dual Free Lists なし, capacity 128-256)`
			`2. Phase 6 系: L25/SuperSlab 最適化に移行`
			`3. Rollback: Baseline に戻って別アプローチ`

			`---`

			`## Phase 5-A: Direct Page Cache (2025-11-01) ❌`

			`### 目標`
			`- Direct cache でO(1) slab lookup: +15-20%`

			`### 実装内容`
			- Global `slabs_direct[129]` でO(1) direct page cache

			`### ベンチマーク結果 💥`
			`- bench_random_mixed: 15.25-16.04 M ops/sec (baseline: 16.53)`
			`- Regression: -3~-7.7% (期待+15-20% → 実際-3~-7.7%)`

			`### 根本原因`
			`- Global cache による contention`
			`- Cache pollution`
			`- False sharing`

			`### 学んだこと`
			`- Global structures は避けるべき (TLS が基本)`
			`- Direct cache よりも Magazine-based approach が有効`

			`---`

			`## Phase 4-A1: HotMag capacity tuning (2025-10-31) ❌`

			`### 目標`
			`- HotMag capacity を増やして hit rate 向上`

			`### 結果`
			`- 性能改善なし`

			`### 学んだこと`
			`- Capacity 単体では効果薄い`
			`- 構造的な問題を解決する必要`

			`---`

			`## Phase 3: Remote drain optimization (2025-10-30) ❌`

			`### 目標`
			`- Remote drain の最適化`

			`### 結果`
			`- 性能改善なし`

			`### 学んだこと`
			`- Remote drain はボトルネックではなかった`

			`---`

			`## Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅`

			`### 目標`
			`- Magazine capacity tuning`
			`- Registry optimization`

			`### 結果`
			`- 成功: 性能改善達成`

			`### 学んだこと`
			`- Magazine-based approach は有効`
			`- Registry は O(1) lookup で十分`