289 lines
6.5 KiB
Markdown
289 lines
6.5 KiB
Markdown
|
|
# Tiny Allocator 性能分析レポート
|
||
|
|
|
||
|
|
## 📉 現状の問題
|
||
|
|
|
||
|
|
### ベンチマーク結果 (2025-11-02)
|
||
|
|
```
|
||
|
|
HAKMEM Tiny: 52.59 M ops/sec (平均)
|
||
|
|
System (glibc): 135.94 M ops/sec (平均)
|
||
|
|
差分: -61.3% (System の 38.7%)
|
||
|
|
```
|
||
|
|
|
||
|
|
**すべてのパターンで劣る:**
|
||
|
|
- Sequential LIFO: -69.2%
|
||
|
|
- Sequential FIFO: -69.4%
|
||
|
|
- Random Free: -58.9%
|
||
|
|
- Interleaved: -67.2%
|
||
|
|
- Long/Short-lived: -68.6%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 根本原因
|
||
|
|
|
||
|
|
### 1. Fast Path が複雑すぎる
|
||
|
|
|
||
|
|
**System tcache (glibc):**
|
||
|
|
```c
|
||
|
|
// 3-4 命令のみ!
|
||
|
|
void* tcache_get(size_t sz) {
|
||
|
|
tcache_entry *e = &tcache->entries[tc_idx(sz)];
|
||
|
|
if (e->count > 0) {
|
||
|
|
void *ret = e->list;
|
||
|
|
e->list = ret->next; // Single linked list pop
|
||
|
|
e->count--;
|
||
|
|
return ret;
|
||
|
|
}
|
||
|
|
return NULL; // Fallback to arena
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**HAKMEM Tiny (`core/hakmem_tiny_alloc.inc:76-214`):**
|
||
|
|
1. 初期化チェック (line 77-83)
|
||
|
|
2. Wrapper チェック (line 84-101)
|
||
|
|
3. Size → class 変換 (line 103-109)
|
||
|
|
4. [ifdef] BENCH_FASTPATH (line 111-157)
|
||
|
|
- SLL (single linked list) チェック
|
||
|
|
- Magazine チェック
|
||
|
|
- Refill 処理
|
||
|
|
5. HotMag チェック (line 159-172)
|
||
|
|
- HotMag pop
|
||
|
|
- Conditional refill
|
||
|
|
6. Hot alloc (line 174-199)
|
||
|
|
- Switch-case で class 別関数
|
||
|
|
7. Fast tier (line 201-207)
|
||
|
|
8. Slow path (line 209-213)
|
||
|
|
|
||
|
|
→ **何十もの分岐** + 複数の関数呼び出し
|
||
|
|
|
||
|
|
**Branch Misprediction のコスト:**
|
||
|
|
- 最近の CPU: 15-20 cycles/miss
|
||
|
|
- HAKMEM は 5-10 branches → 50-200 cycles の可能性
|
||
|
|
- System tcache: 1-2 branches → 15-40 cycles
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. Magazine 層が多すぎる
|
||
|
|
|
||
|
|
**現在の構造 (4-5層):**
|
||
|
|
```
|
||
|
|
HotMag (128 slots, class 0-2)
|
||
|
|
↓ miss
|
||
|
|
Hot Alloc (class-specific functions)
|
||
|
|
↓ miss
|
||
|
|
Fast Tier
|
||
|
|
↓ miss
|
||
|
|
Magazine (TinyTLSMag)
|
||
|
|
↓ miss
|
||
|
|
TLS List
|
||
|
|
↓ miss
|
||
|
|
Slab (bitmap-based)
|
||
|
|
↓ miss
|
||
|
|
SuperSlab
|
||
|
|
```
|
||
|
|
|
||
|
|
**System tcache (1層):**
|
||
|
|
```
|
||
|
|
tcache (7 entries per size)
|
||
|
|
↓ miss
|
||
|
|
Arena (ptmalloc bins)
|
||
|
|
```
|
||
|
|
|
||
|
|
**問題:**
|
||
|
|
- 各層で branch + function call のオーバーヘッド
|
||
|
|
- Cache locality が悪化
|
||
|
|
- 複雑性による最適化の阻害
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. Refill が Fast Path に混入
|
||
|
|
|
||
|
|
**Line 160-172: HotMag refill on fast path**
|
||
|
|
```c
|
||
|
|
if (g_hotmag_enable && class_idx <= 2 && g_fast_head[class_idx] == NULL) {
|
||
|
|
hotmag_init_if_needed(class_idx);
|
||
|
|
TinyHotMag* hm = &g_tls_hot_mag[class_idx];
|
||
|
|
void* hotmag_ptr = hotmag_pop(class_idx);
|
||
|
|
if (hotmag_ptr == NULL) {
|
||
|
|
if (hotmag_try_refill(class_idx, hm) > 0) { // ← Refill on fast path!
|
||
|
|
hotmag_ptr = hotmag_pop(class_idx);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**問題:**
|
||
|
|
- Refill は slow path で行うべき
|
||
|
|
- Fast path は pure pop のみにすべき
|
||
|
|
- System tcache は refill を完全に分離
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. Bitmap-based Slab Management
|
||
|
|
|
||
|
|
**HAKMEM:**
|
||
|
|
```c
|
||
|
|
int block_idx = hak_tiny_find_free_block(tls); // Bitmap scan
|
||
|
|
if (block_idx >= 0) {
|
||
|
|
hak_tiny_set_used(tls, block_idx);
|
||
|
|
...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**System tcache/arena:**
|
||
|
|
```c
|
||
|
|
void *ret = bin->list; // Free list pop (O(1))
|
||
|
|
bin->list = ret->next;
|
||
|
|
```
|
||
|
|
|
||
|
|
**問題:**
|
||
|
|
- Bitmap scan: O(n) worst case
|
||
|
|
- Free list: O(1) always
|
||
|
|
- Bitmap は fragmentation には強いが、速度では劣る
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 改善案
|
||
|
|
|
||
|
|
### Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐
|
||
|
|
|
||
|
|
**目標:** System tcache と同等の速度
|
||
|
|
|
||
|
|
**設計:**
|
||
|
|
```c
|
||
|
|
// Global TLS cache (per size class)
|
||
|
|
static __thread void* g_tls_tcache[TINY_NUM_CLASSES];
|
||
|
|
|
||
|
|
void* hak_tiny_alloc(size_t size) {
|
||
|
|
int class_idx = size_to_class_inline(size); // Inline化
|
||
|
|
if (class_idx < 0) return NULL;
|
||
|
|
|
||
|
|
// Ultra-fast path: Single instruction!
|
||
|
|
void** head_ptr = &g_tls_tcache[class_idx];
|
||
|
|
void* ptr = *head_ptr;
|
||
|
|
if (ptr) {
|
||
|
|
*head_ptr = *(void**)ptr; // Pop from free list
|
||
|
|
return ptr;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Slow path: Refill from SuperSlab
|
||
|
|
return hak_tiny_alloc_slow_refill(size, class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**メリット:**
|
||
|
|
- Fast path: 3-4 命令のみ
|
||
|
|
- Branch: 2つのみ (class check + list check)
|
||
|
|
- System tcache と同等の速度が期待できる
|
||
|
|
|
||
|
|
**デメリット:**
|
||
|
|
- Magazine 層の複雑な最適化が無駄になる
|
||
|
|
- 大幅なリファクタリングが必要
|
||
|
|
|
||
|
|
**実装期間:** 1-2週間
|
||
|
|
|
||
|
|
**成功確率:** ⭐⭐⭐⭐ (80%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option B: Magazine 層の段階的削減 ⭐⭐⭐
|
||
|
|
|
||
|
|
**目標:** 複雑性を減らしつつ、既存の投資を活かす
|
||
|
|
|
||
|
|
**段階1:** HotMag + Hot Alloc を削除 (2層削減)
|
||
|
|
```c
|
||
|
|
void* hak_tiny_alloc(size_t size) {
|
||
|
|
int class_idx = size_to_class_inline(size);
|
||
|
|
if (class_idx < 0) return NULL;
|
||
|
|
|
||
|
|
// Fast path: TLS Magazine のみ
|
||
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
||
|
|
if (mag->top > 0) {
|
||
|
|
return mag->items[--mag->top].ptr;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Slow path
|
||
|
|
return hak_tiny_alloc_slow(size, class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**段階2:** Magazine を Free List に変更
|
||
|
|
```c
|
||
|
|
// Replace Magazine with Free List
|
||
|
|
static __thread void* g_tls_free_list[TINY_NUM_CLASSES];
|
||
|
|
```
|
||
|
|
|
||
|
|
**メリット:**
|
||
|
|
- 段階的に改善可能
|
||
|
|
- リスク低い
|
||
|
|
|
||
|
|
**デメリット:**
|
||
|
|
- 最終的には Option A と同じになる可能性
|
||
|
|
- 中途半端な状態が続く
|
||
|
|
|
||
|
|
**実装期間:** 2-3週間
|
||
|
|
|
||
|
|
**成功確率:** ⭐⭐⭐ (60%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option C: Hybrid - Tiny は tcache風 + Mid-Large は現行維持 ⭐⭐⭐⭐
|
||
|
|
|
||
|
|
**目標:** Tiny と Mid-Large で異なる戦略
|
||
|
|
|
||
|
|
**Tiny (≤1KB):**
|
||
|
|
- System tcache 風の ultra-simple design
|
||
|
|
- Free list ベース
|
||
|
|
- 目標: System の 80-90%
|
||
|
|
|
||
|
|
**Mid-Large (8KB-32MB):**
|
||
|
|
- 現在の SuperSlab/L25 を維持・強化
|
||
|
|
- 目標: System の 150-200%
|
||
|
|
|
||
|
|
**メリット:**
|
||
|
|
- 各サイズ帯に最適な設計
|
||
|
|
- Mid-Large の強み (+171%!) を維持
|
||
|
|
- Tiny の弱点を解消
|
||
|
|
|
||
|
|
**デメリット:**
|
||
|
|
- コードベースが複雑化
|
||
|
|
- 統一感が失われる
|
||
|
|
|
||
|
|
**実装期間:** 2-3週間
|
||
|
|
|
||
|
|
**成功確率:** ⭐⭐⭐⭐ (75%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📝 推奨アプローチ
|
||
|
|
|
||
|
|
**短期 (1-2週間):** Option A (Ultra-Simple Fast Path)
|
||
|
|
- 最もシンプルで効果的
|
||
|
|
- System tcache と同等の速度が期待できる
|
||
|
|
- 失敗してもロールバック容易
|
||
|
|
|
||
|
|
**中期 (1ヶ月):** Option C (Hybrid)
|
||
|
|
- Tiny の弱点解消 + Mid-Large の強み維持
|
||
|
|
- 全体性能で mimalloc 同等を目指せる
|
||
|
|
|
||
|
|
**長期 (3-6ヶ月):** 学習層との統合
|
||
|
|
- Tiny の簡素化により、学習層の導入が容易に
|
||
|
|
- ACE (Adaptive Compression Engine) との連携
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 次のステップ
|
||
|
|
|
||
|
|
1. **Option A のプロトタイプ実装** (1週間)
|
||
|
|
- `core/hakmem_tiny_simple.c` として新規作成
|
||
|
|
- ベンチマーク比較
|
||
|
|
|
||
|
|
2. **結果評価**
|
||
|
|
- 目標: System の 80%以上 (108 M ops/sec)
|
||
|
|
- 達成できれば mainline に統合
|
||
|
|
|
||
|
|
3. **Mid-Large 最適化** (並行作業)
|
||
|
|
- HAKX の mainline 統合
|
||
|
|
- L25 最適化
|
||
|
|
|