hakmem/benchmarks/results/TINY_PERFORMANCE_ANALYSIS.md

# Tiny Allocator 性能分析レポート

## 📉 現状の問題

### ベンチマーク結果 (2025-11-02)
```
HAKMEM Tiny: 52.59 M ops/sec (平均)
System (glibc): 135.94 M ops/sec (平均)
差分: -61.3% (System の 38.7%)
```

**すべてのパターンで劣る:**
- Sequential LIFO: -69.2%
- Sequential FIFO: -69.4%
- Random Free: -58.9%
- Interleaved: -67.2%
- Long/Short-lived: -68.6%

---

## 🔍 根本原因

### 1. Fast Path が複雑すぎる

**System tcache (glibc):**
```c
// 3-4 命令のみ!
void* tcache_get(size_t sz) {
    tcache_entry *e = &tcache->entries[tc_idx(sz)];
    if (e->count > 0) {
        void *ret = e->list;
        e->list = ret->next;  // Single linked list pop
        e->count--;
        return ret;
    }
    return NULL;  // Fallback to arena
}
```

**HAKMEM Tiny (`core/hakmem_tiny_alloc.inc:76-214`):**
1. 初期化チェック (line 77-83)
2. Wrapper チェック (line 84-101)
3. Size → class 変換 (line 103-109)
4. [ifdef] BENCH_FASTPATH (line 111-157)
   - SLL (single linked list) チェック
   - Magazine チェック
   - Refill 処理
5. HotMag チェック (line 159-172)
   - HotMag pop
   - Conditional refill
6. Hot alloc (line 174-199)
   - Switch-case で class 別関数
7. Fast tier (line 201-207)
8. Slow path (line 209-213)

→ **何十もの分岐** + 複数の関数呼び出し

**Branch Misprediction のコスト:**
- 最近の CPU: 15-20 cycles/miss
- HAKMEM は 5-10 branches → 50-200 cycles の可能性
- System tcache: 1-2 branches → 15-40 cycles

---

### 2. Magazine 層が多すぎる

**現在の構造 (4-5層):**
```
HotMag (128 slots, class 0-2)
  ↓ miss
Hot Alloc (class-specific functions)
  ↓ miss
Fast Tier  
  ↓ miss
Magazine (TinyTLSMag)
  ↓ miss
TLS List
  ↓ miss
Slab (bitmap-based)
  ↓ miss
SuperSlab
```

**System tcache (1層):**
```
tcache (7 entries per size)
  ↓ miss
Arena (ptmalloc bins)
```

**問題:**
- 各層で branch + function call のオーバーヘッド
- Cache locality が悪化
- 複雑性による最適化の阻害

---

### 3. Refill が Fast Path に混入

**Line 160-172: HotMag refill on fast path**
```c
if (g_hotmag_enable && class_idx <= 2 && g_fast_head[class_idx] == NULL) {
    hotmag_init_if_needed(class_idx);
    TinyHotMag* hm = &g_tls_hot_mag[class_idx];
    void* hotmag_ptr = hotmag_pop(class_idx);
    if (hotmag_ptr == NULL) {
        if (hotmag_try_refill(class_idx, hm) > 0) {  // ← Refill on fast path!
            hotmag_ptr = hotmag_pop(class_idx);
        }
    }
    ...
}
```

**問題:**
- Refill は slow path で行うべき
- Fast path は pure pop のみにすべき
- System tcache は refill を完全に分離

---

### 4. Bitmap-based Slab Management

**HAKMEM:**
```c
int block_idx = hak_tiny_find_free_block(tls);  // Bitmap scan
if (block_idx >= 0) {
    hak_tiny_set_used(tls, block_idx);
    ...
}
```

**System tcache/arena:**
```c
void *ret = bin->list;  // Free list pop (O(1))
bin->list = ret->next;
```

**問題:**
- Bitmap scan: O(n) worst case
- Free list: O(1) always
- Bitmap は fragmentation には強いが、速度では劣る

---

## 🎯 改善案

### Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐

**目標:** System tcache と同等の速度

**設計:**
```c
// Global TLS cache (per size class)
static __thread void* g_tls_tcache[TINY_NUM_CLASSES];

void* hak_tiny_alloc(size_t size) {
    int class_idx = size_to_class_inline(size);  // Inline化
    if (class_idx < 0) return NULL;
    
    // Ultra-fast path: Single instruction!
    void** head_ptr = &g_tls_tcache[class_idx];
    void* ptr = *head_ptr;
    if (ptr) {
        *head_ptr = *(void**)ptr;  // Pop from free list
        return ptr;
    }
    
    // Slow path: Refill from SuperSlab
    return hak_tiny_alloc_slow_refill(size, class_idx);
}
```

**メリット:**
- Fast path: 3-4 命令のみ
- Branch: 2つのみ (class check + list check)
- System tcache と同等の速度が期待できる

**デメリット:**
- Magazine 層の複雑な最適化が無駄になる
- 大幅なリファクタリングが必要

**実装期間:** 1-2週間

**成功確率:** ⭐⭐⭐⭐ (80%)

---

### Option B: Magazine 層の段階的削減 ⭐⭐⭐

**目標:** 複雑性を減らしつつ、既存の投資を活かす

**段階1:** HotMag + Hot Alloc を削除 (2層削減)
```c
void* hak_tiny_alloc(size_t size) {
    int class_idx = size_to_class_inline(size);
    if (class_idx < 0) return NULL;
    
    // Fast path: TLS Magazine のみ
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;
    }
    
    // Slow path
    return hak_tiny_alloc_slow(size, class_idx);
}
```

**段階2:** Magazine を Free List に変更
```c
// Replace Magazine with Free List
static __thread void* g_tls_free_list[TINY_NUM_CLASSES];
```

**メリット:**
- 段階的に改善可能
- リスク低い

**デメリット:**
- 最終的には Option A と同じになる可能性
- 中途半端な状態が続く

**実装期間:** 2-3週間

**成功確率:** ⭐⭐⭐ (60%)

---

### Option C: Hybrid - Tiny は tcache風 + Mid-Large は現行維持 ⭐⭐⭐⭐

**目標:** Tiny と Mid-Large で異なる戦略

**Tiny (≤1KB):**
- System tcache 風の ultra-simple design
- Free list ベース
- 目標: System の 80-90%

**Mid-Large (8KB-32MB):**
- 現在の SuperSlab/L25 を維持・強化
- 目標: System の 150-200%

**メリット:**
- 各サイズ帯に最適な設計
- Mid-Large の強み (+171%!) を維持
- Tiny の弱点を解消

**デメリット:**
- コードベースが複雑化
- 統一感が失われる

**実装期間:** 2-3週間

**成功確率:** ⭐⭐⭐⭐ (75%)

---

## 📝 推奨アプローチ

**短期 (1-2週間):** Option A (Ultra-Simple Fast Path)
- 最もシンプルで効果的
- System tcache と同等の速度が期待できる
- 失敗してもロールバック容易

**中期 (1ヶ月):** Option C (Hybrid)
- Tiny の弱点解消 + Mid-Large の強み維持
- 全体性能で mimalloc 同等を目指せる

**長期 (3-6ヶ月):** 学習層との統合
- Tiny の簡素化により、学習層の導入が容易に
- ACE (Adaptive Compression Engine) との連携

---

## 次のステップ

1. **Option A のプロトタイプ実装** (1週間)
   - `core/hakmem_tiny_simple.c` として新規作成
   - ベンチマーク比較

2. **結果評価**
   - 目標: System の 80%以上 (108 M ops/sec)
   - 達成できれば mainline に統合

3. **Mid-Large 最適化** (並行作業)
   - HAKX の mainline 統合
   - L25 最適化
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# Tiny Allocator 性能分析レポート`

			`## 📉 現状の問題`

			`### ベンチマーク結果 (2025-11-02)`
			```
			`HAKMEM Tiny: 52.59 M ops/sec (平均)`
			`System (glibc): 135.94 M ops/sec (平均)`
			`差分: -61.3% (System の 38.7%)`
			```

			`すべてのパターンで劣る:`
			`- Sequential LIFO: -69.2%`
			`- Sequential FIFO: -69.4%`
			`- Random Free: -58.9%`
			`- Interleaved: -67.2%`
			`- Long/Short-lived: -68.6%`

			`---`

			`## 🔍 根本原因`

			`### 1. Fast Path が複雑すぎる`

			`System tcache (glibc):`
			```c
			`// 3-4 命令のみ!`
			`void* tcache_get(size_t sz) {`
			`tcache_entry *e = &tcache->entries[tc_idx(sz)];`
			`if (e->count > 0) {`
			`void *ret = e->list;`
			`e->list = ret->next; // Single linked list pop`
			`e->count--;`
			`return ret;`
			`}`
			`return NULL; // Fallback to arena`
			`}`
			```

			HAKMEM Tiny (`core/hakmem_tiny_alloc.inc:76-214`):
			`1. 初期化チェック (line 77-83)`
			`2. Wrapper チェック (line 84-101)`
			`3. Size → class 変換 (line 103-109)`
			`4. [ifdef] BENCH_FASTPATH (line 111-157)`
			`- SLL (single linked list) チェック`
			`- Magazine チェック`
			`- Refill 処理`
			`5. HotMag チェック (line 159-172)`
			`- HotMag pop`
			`- Conditional refill`
			`6. Hot alloc (line 174-199)`
			`- Switch-case で class 別関数`
			`7. Fast tier (line 201-207)`
			`8. Slow path (line 209-213)`

			`→ 何十もの分岐 + 複数の関数呼び出し`

			`Branch Misprediction のコスト:`
			`- 最近の CPU: 15-20 cycles/miss`
			`- HAKMEM は 5-10 branches → 50-200 cycles の可能性`
			`- System tcache: 1-2 branches → 15-40 cycles`

			`---`

			`### 2. Magazine 層が多すぎる`

			`現在の構造 (4-5層):`
			```
			`HotMag (128 slots, class 0-2)`
			`↓ miss`
			`Hot Alloc (class-specific functions)`
			`↓ miss`
			`Fast Tier`
			`↓ miss`
			`Magazine (TinyTLSMag)`
			`↓ miss`
			`TLS List`
			`↓ miss`
			`Slab (bitmap-based)`
			`↓ miss`
			`SuperSlab`
			```

			`System tcache (1層):`
			```
			`tcache (7 entries per size)`
			`↓ miss`
			`Arena (ptmalloc bins)`
			```

			`問題:`
			`- 各層で branch + function call のオーバーヘッド`
			`- Cache locality が悪化`
			`- 複雑性による最適化の阻害`

			`---`

			`### 3. Refill が Fast Path に混入`

			`Line 160-172: HotMag refill on fast path`
			```c
			`if (g_hotmag_enable && class_idx <= 2 && g_fast_head[class_idx] == NULL) {`
			`hotmag_init_if_needed(class_idx);`
			`TinyHotMag* hm = &g_tls_hot_mag[class_idx];`
			`void* hotmag_ptr = hotmag_pop(class_idx);`
			`if (hotmag_ptr == NULL) {`
			`if (hotmag_try_refill(class_idx, hm) > 0) { // ← Refill on fast path!`
			`hotmag_ptr = hotmag_pop(class_idx);`
			`}`
			`}`
			`...`
			`}`
			```

			`問題:`
			`- Refill は slow path で行うべき`
			`- Fast path は pure pop のみにすべき`
			`- System tcache は refill を完全に分離`

			`---`

			`### 4. Bitmap-based Slab Management`

			`HAKMEM:`
			```c
			`int block_idx = hak_tiny_find_free_block(tls); // Bitmap scan`
			`if (block_idx >= 0) {`
			`hak_tiny_set_used(tls, block_idx);`
			`...`
			`}`
			```

			`System tcache/arena:`
			```c
			`void *ret = bin->list; // Free list pop (O(1))`
			`bin->list = ret->next;`
			```

			`問題:`
			`- Bitmap scan: O(n) worst case`
			`- Free list: O(1) always`
			`- Bitmap は fragmentation には強いが、速度では劣る`

			`---`

			`## 🎯 改善案`

			`### Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐`

			`目標: System tcache と同等の速度`

			`設計:`
			```c
			`// Global TLS cache (per size class)`
			`static __thread void* g_tls_tcache[TINY_NUM_CLASSES];`

			`void* hak_tiny_alloc(size_t size) {`
			`int class_idx = size_to_class_inline(size); // Inline化`
			`if (class_idx < 0) return NULL;`

			`// Ultra-fast path: Single instruction!`
			`void** head_ptr = &g_tls_tcache[class_idx];`
			`void* ptr = *head_ptr;`
			`if (ptr) {`
			`head_ptr = (void**)ptr; // Pop from free list`
			`return ptr;`
			`}`

			`// Slow path: Refill from SuperSlab`
			`return hak_tiny_alloc_slow_refill(size, class_idx);`
			`}`
			```

			`メリット:`
			`- Fast path: 3-4 命令のみ`
			`- Branch: 2つのみ (class check + list check)`
			`- System tcache と同等の速度が期待できる`

			`デメリット:`
			`- Magazine 層の複雑な最適化が無駄になる`
			`- 大幅なリファクタリングが必要`

			`実装期間: 1-2週間`

			`成功確率: ⭐⭐⭐⭐ (80%)`

			`---`

			`### Option B: Magazine 層の段階的削減 ⭐⭐⭐`

			`目標: 複雑性を減らしつつ、既存の投資を活かす`

			`段階1: HotMag + Hot Alloc を削除 (2層削減)`
			```c
			`void* hak_tiny_alloc(size_t size) {`
			`int class_idx = size_to_class_inline(size);`
			`if (class_idx < 0) return NULL;`

			`// Fast path: TLS Magazine のみ`
			`TinyTLSMag* mag = &g_tls_mags[class_idx];`
			`if (mag->top > 0) {`
			`return mag->items[--mag->top].ptr;`
			`}`

			`// Slow path`
			`return hak_tiny_alloc_slow(size, class_idx);`
			`}`
			```

			`段階2: Magazine を Free List に変更`
			```c
			`// Replace Magazine with Free List`
			`static __thread void* g_tls_free_list[TINY_NUM_CLASSES];`
			```

			`メリット:`
			`- 段階的に改善可能`
			`- リスク低い`

			`デメリット:`
			`- 最終的には Option A と同じになる可能性`
			`- 中途半端な状態が続く`

			`実装期間: 2-3週間`

			`成功確率: ⭐⭐⭐ (60%)`

			`---`

			`### Option C: Hybrid - Tiny は tcache風 + Mid-Large は現行維持 ⭐⭐⭐⭐`

			`目標: Tiny と Mid-Large で異なる戦略`

			`Tiny (≤1KB):`
			`- System tcache 風の ultra-simple design`
			`- Free list ベース`
			`- 目標: System の 80-90%`

			`Mid-Large (8KB-32MB):`
			`- 現在の SuperSlab/L25 を維持・強化`
			`- 目標: System の 150-200%`

			`メリット:`
			`- 各サイズ帯に最適な設計`
			`- Mid-Large の強み (+171%!) を維持`
			`- Tiny の弱点を解消`

			`デメリット:`
			`- コードベースが複雑化`
			`- 統一感が失われる`

			`実装期間: 2-3週間`

			`成功確率: ⭐⭐⭐⭐ (75%)`

			`---`

			`## 📝 推奨アプローチ`

			`短期 (1-2週間): Option A (Ultra-Simple Fast Path)`
			`- 最もシンプルで効果的`
			`- System tcache と同等の速度が期待できる`
			`- 失敗してもロールバック容易`

			`中期 (1ヶ月): Option C (Hybrid)`
			`- Tiny の弱点解消 + Mid-Large の強み維持`
			`- 全体性能で mimalloc 同等を目指せる`

			`長期 (3-6ヶ月): 学習層との統合`
			`- Tiny の簡素化により、学習層の導入が容易に`
			`- ACE (Adaptive Compression Engine) との連携`

			`---`

			`## 次のステップ`

			`1. Option A のプロトタイプ実装 (1週間)`
			- `core/hakmem_tiny_simple.c` として新規作成
			`- ベンチマーク比較`

			`2. 結果評価`
			`- 目標: System の 80%以上 (108 M ops/sec)`
			`- 達成できれば mainline に統合`

			`3. Mid-Large 最適化 (並行作業)`
			`- HAKX の mainline 統合`
			`- L25 最適化`