396 lines
10 KiB
Markdown
396 lines
10 KiB
Markdown
|
|
# Phase 6.22-23: SuperSlab + Per-thread Queues 統合実装
|
|||
|
|
|
|||
|
|
**日付**: 2025-10-24
|
|||
|
|
**目標**: mimalloc 解析結果を基に Tiny Pool を高速化(+25-35% 狙い)
|
|||
|
|
**期間**: 2-3日想定
|
|||
|
|
**ステータス**: 🚀 **実装開始**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 目標
|
|||
|
|
|
|||
|
|
### 性能目標
|
|||
|
|
|
|||
|
|
| Benchmark | Current | Target | Improvement |
|
|||
|
|
|-----------|---------|--------|-------------|
|
|||
|
|
| **Tiny 1T** | 21.0 M/s | **27-30 M/s** | **+25-35%** |
|
|||
|
|
| **Tiny 4T** | 53.5 M/s | **65-72 M/s** | **+22-35%** |
|
|||
|
|
| **vs mimalloc 1T** | 62% | **85-90%** | **+23-28pp** |
|
|||
|
|
| **vs mimalloc 4T** | 70% | **85-95%** | **+15-25pp** |
|
|||
|
|
|
|||
|
|
### 実装目標
|
|||
|
|
|
|||
|
|
1. ✅ **SuperSlab 方式導入** (2MB aligned)
|
|||
|
|
- Registry hash 削除
|
|||
|
|
- AND 演算によるページ検索(O(1) guaranteed)
|
|||
|
|
- 期待効果: **+10-15%**
|
|||
|
|
|
|||
|
|
2. ✅ **Per-thread Page Queues**
|
|||
|
|
- Global freelist + mutex 削除
|
|||
|
|
- TLS per-thread queues に移行
|
|||
|
|
- Lock contention 削減
|
|||
|
|
- 期待効果: **+15-20%**
|
|||
|
|
|
|||
|
|
3. ✅ **Direct Access Table** (bonus)
|
|||
|
|
- Size → class の高速化
|
|||
|
|
- 期待効果: **+5-10%**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧠 mimalloc 解析結果(Task 先生による)
|
|||
|
|
|
|||
|
|
### mimalloc の Fast Path(7命令)
|
|||
|
|
|
|||
|
|
1. **TLS heap 取得**: `__thread` ポインタ1つ (1 load)
|
|||
|
|
2. **Bin lookup**: `pages_free_direct[wsize]` 直接アクセス (1 load)
|
|||
|
|
3. **Freelist pop**: `page->free` → `block->next` (2 loads)
|
|||
|
|
4. **Used カウンタ更新**: `page->used++` (1 inc)
|
|||
|
|
5. **Return**: block ポインタ返却
|
|||
|
|
|
|||
|
|
### hakmem との主な違い
|
|||
|
|
|
|||
|
|
| 項目 | mimalloc | hakmem (Current) | Gap |
|
|||
|
|
|------|----------|------------------|-----|
|
|||
|
|
| **Page lookup** | Segment mask (AND 1回) | Registry hash | **O(1) だがキャッシュミス** |
|
|||
|
|
| **Slab 管理** | Per-thread queues | Global lists + lock | **Lock contention** |
|
|||
|
|
| **Size → class** | Direct array | Lookup table | **1回余分な load** |
|
|||
|
|
| **Block index** | Shift 演算 | Division | **Division コスト** |
|
|||
|
|
|
|||
|
|
### 最大の差分: Segment-Aligned Allocation
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// mimalloc: 超高速ページ検索(1回の AND 演算)
|
|||
|
|
#define MI_SEGMENT_MASK (MI_SEGMENT_SIZE - 1) // 32MB - 1
|
|||
|
|
segment = (uintptr_t)p & ~MI_SEGMENT_MASK;
|
|||
|
|
page = segment->pages[offset];
|
|||
|
|
|
|||
|
|
// hakmem (Current): Registry hash(キャッシュミス + 衝突リスク)
|
|||
|
|
uint32_t h = hash(page_base);
|
|||
|
|
TinySlabDesc* d = registry[h]; // O(1) だが実際はキャッシュミスが痛い
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏗️ 実装設計
|
|||
|
|
|
|||
|
|
### 1. SuperSlab 構造(2MB aligned)
|
|||
|
|
|
|||
|
|
#### 設計思想
|
|||
|
|
|
|||
|
|
- **Alignment**: 2MB (0x200000) aligned allocation
|
|||
|
|
- **Size**: 2MB per SuperSlab
|
|||
|
|
- **Capacity**: 2MB ÷ 64KB slab = 32 slabs per SuperSlab
|
|||
|
|
- **Metadata**: SuperSlab header (64B, cache-line aligned)
|
|||
|
|
|
|||
|
|
#### メタデータ構造
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// SuperSlab: 2MB aligned container for 32 x 64KB slabs
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
// Header (64B, cache-line aligned)
|
|||
|
|
uint64_t magic; // 0xHAKMEM_SUPERSLAB
|
|||
|
|
uint8_t size_class; // 0-7 (8-64B)
|
|||
|
|
uint8_t active_slabs; // Number of active slabs (0-32)
|
|||
|
|
uint16_t _pad;
|
|||
|
|
uint32_t slab_bitmap; // 32-bit bitmap (1=active, 0=free)
|
|||
|
|
|
|||
|
|
// Per-slab metadata (32 entries)
|
|||
|
|
struct {
|
|||
|
|
void* freelist; // Per-slab freelist head
|
|||
|
|
uint16_t used; // Blocks used
|
|||
|
|
uint16_t capacity; // Total blocks
|
|||
|
|
} slabs[32];
|
|||
|
|
|
|||
|
|
// Padding to 64B
|
|||
|
|
char _pad2[64 - 8 - 4 - 32*16];
|
|||
|
|
} __attribute__((aligned(64))) SuperSlab;
|
|||
|
|
|
|||
|
|
// Compile-time checks
|
|||
|
|
_Static_assert(sizeof(SuperSlab) == 64, "SuperSlab header must be 64B");
|
|||
|
|
_Static_assert(SUPERSLAB_SIZE == 2*1024*1024, "2MB alignment required");
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### ポインタ → SuperSlab 検索(mimalloc 方式)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
#define SUPERSLAB_SIZE (2 * 1024 * 1024) // 2MB
|
|||
|
|
#define SUPERSLAB_MASK (SUPERSLAB_SIZE - 1)
|
|||
|
|
#define SLAB_SIZE (64 * 1024) // 64KB
|
|||
|
|
|
|||
|
|
// O(1) SuperSlab 検索(1回の AND 演算)
|
|||
|
|
static inline SuperSlab* ptr_to_superslab(void* p) {
|
|||
|
|
return (SuperSlab*)((uintptr_t)p & ~SUPERSLAB_MASK);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Slab index 計算(Shift 演算)
|
|||
|
|
static inline int ptr_to_slab_index(void* p) {
|
|||
|
|
uintptr_t offset = (uintptr_t)p & SUPERSLAB_MASK;
|
|||
|
|
return (int)(offset >> 16); // ÷ 64KB (2^16)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Allocation フロー
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* tiny_alloc_fast(size_t size) {
|
|||
|
|
int class_idx = SIZE_TO_CLASS[size >> 3]; // Direct access
|
|||
|
|
TinyTLS* tls = get_tls();
|
|||
|
|
|
|||
|
|
// Per-thread active slab
|
|||
|
|
TinySlab* active = tls->active_slab[class_idx];
|
|||
|
|
if (!active || !active->freelist) {
|
|||
|
|
active = refill_slab(class_idx); // TLS queue から取得
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Pop from freelist
|
|||
|
|
void* block = active->freelist;
|
|||
|
|
active->freelist = *(void**)block;
|
|||
|
|
active->used++;
|
|||
|
|
|
|||
|
|
return block;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Free フロー(Fast path: Same-thread)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void tiny_free_fast(void* p) {
|
|||
|
|
// SuperSlab 検索(1回の AND)
|
|||
|
|
SuperSlab* ss = ptr_to_superslab(p);
|
|||
|
|
int slab_idx = ptr_to_slab_index(p);
|
|||
|
|
|
|||
|
|
// Slab metadata 取得
|
|||
|
|
TinySlabMeta* meta = &ss->slabs[slab_idx];
|
|||
|
|
|
|||
|
|
// Same-thread check(TLS 比較)
|
|||
|
|
if (meta->owner_tid == get_tid()) {
|
|||
|
|
// Fast path: Push to TLS freelist
|
|||
|
|
*(void**)p = meta->freelist;
|
|||
|
|
meta->freelist = p;
|
|||
|
|
meta->used--;
|
|||
|
|
} else {
|
|||
|
|
// Slow path: Remote free (lock or lock-free queue)
|
|||
|
|
remote_free(ss, slab_idx, p);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. Per-thread Page Queues
|
|||
|
|
|
|||
|
|
#### 設計思想
|
|||
|
|
|
|||
|
|
- **Global freelist 廃止**: Lock contention 削減
|
|||
|
|
- **TLS queues**: 各スレッドが独自の slab queue を保持
|
|||
|
|
- **Refill policy**: Empty queue → 他スレッドの queue から steal
|
|||
|
|
|
|||
|
|
#### TLS 構造体
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct TinyTLS {
|
|||
|
|
// Active slabs (1 per size class)
|
|||
|
|
TinySlab* active_slab[TINY_NUM_CLASSES];
|
|||
|
|
|
|||
|
|
// Per-class slab queues (LIFO)
|
|||
|
|
struct {
|
|||
|
|
TinySlab* head;
|
|||
|
|
int count;
|
|||
|
|
} slab_queue[TINY_NUM_CLASSES];
|
|||
|
|
|
|||
|
|
// Statistics
|
|||
|
|
uint64_t allocs[TINY_NUM_CLASSES];
|
|||
|
|
uint64_t frees[TINY_NUM_CLASSES];
|
|||
|
|
|
|||
|
|
} __attribute__((aligned(64))) TinyTLS;
|
|||
|
|
|
|||
|
|
static __thread TinyTLS* t_tiny_tls = NULL;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Refill Policy
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
TinySlab* refill_slab(int class_idx) {
|
|||
|
|
TinyTLS* tls = get_tls();
|
|||
|
|
|
|||
|
|
// 1. Try TLS queue
|
|||
|
|
if (tls->slab_queue[class_idx].head) {
|
|||
|
|
return pop_slab_queue(tls, class_idx);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 2. Allocate new SuperSlab (2MB aligned)
|
|||
|
|
SuperSlab* ss = allocate_superslab(class_idx);
|
|||
|
|
if (ss) {
|
|||
|
|
// Initialize first slab
|
|||
|
|
return &ss->slabs[0];
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 3. Steal from other threads (last resort)
|
|||
|
|
return steal_slab(class_idx);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. Direct Access Table(Bonus 最適化)
|
|||
|
|
|
|||
|
|
#### Size → Class の高速化
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Before: Lookup table (1 extra load)
|
|||
|
|
static const uint8_t SIZE_TO_CLASS[8] = { 0, 0, 1, 2, 3, 4, 5, 6 };
|
|||
|
|
int class_idx = SIZE_TO_CLASS[size >> 3];
|
|||
|
|
|
|||
|
|
// After: Direct array access (no LUT)
|
|||
|
|
// Size classes: 8, 16, 24, 32, 40, 48, 56, 64 bytes
|
|||
|
|
static inline int size_to_class_direct(size_t size) {
|
|||
|
|
// Assume size is already rounded up to 8B boundary
|
|||
|
|
return (int)((size >> 3) - 1); // 8→0, 16→1, 24→2, ...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Implementation Roadmap
|
|||
|
|
|
|||
|
|
### Phase 1: SuperSlab 基盤(1日目)
|
|||
|
|
|
|||
|
|
- [ ] `hakmem_tiny_superslab.h` 作成
|
|||
|
|
- SuperSlab 構造体定義
|
|||
|
|
- Inline 関数(ptr_to_superslab, ptr_to_slab_index)
|
|||
|
|
|
|||
|
|
- [ ] `hakmem_tiny_superslab.c` 作成
|
|||
|
|
- SuperSlab allocation (mmap 2MB aligned)
|
|||
|
|
- SuperSlab initialization
|
|||
|
|
- Bitmap 管理
|
|||
|
|
|
|||
|
|
- [ ] `hakmem_tiny.c` 修正
|
|||
|
|
- Registry 削除
|
|||
|
|
- SuperSlab 方式への移行
|
|||
|
|
- ptr_to_superslab() 使用
|
|||
|
|
|
|||
|
|
### Phase 2: Per-thread Queues(1-2日目)
|
|||
|
|
|
|||
|
|
- [ ] TinyTLS 構造体拡張
|
|||
|
|
- slab_queue 追加
|
|||
|
|
- active_slab 配列
|
|||
|
|
|
|||
|
|
- [ ] Global freelist 削除
|
|||
|
|
- g_tiny_freelist[] 削除
|
|||
|
|
- g_tiny_locks[] 削除
|
|||
|
|
|
|||
|
|
- [ ] Refill policy 実装
|
|||
|
|
- TLS queue からの取得
|
|||
|
|
- 新規 SuperSlab allocation
|
|||
|
|
- Steal 機構(optional)
|
|||
|
|
|
|||
|
|
### Phase 3: 最適化 + 検証(2-3日目)
|
|||
|
|
|
|||
|
|
- [ ] Direct Access Table
|
|||
|
|
- SIZE_TO_CLASS 最適化
|
|||
|
|
|
|||
|
|
- [ ] Block Index Shift
|
|||
|
|
- Division → Shift 変更
|
|||
|
|
|
|||
|
|
- [ ] ビルド + テスト
|
|||
|
|
- Compile 確認
|
|||
|
|
- 基本動作テスト
|
|||
|
|
|
|||
|
|
- [ ] Benchmark
|
|||
|
|
- Tiny 1T/4T 測定
|
|||
|
|
- Phase 6.21 比較
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Success Criteria
|
|||
|
|
|
|||
|
|
### Must Have
|
|||
|
|
|
|||
|
|
- ✅ SuperSlab 動作(2MB aligned allocation)
|
|||
|
|
- ✅ Registry 削除完了
|
|||
|
|
- ✅ Per-thread queues 動作
|
|||
|
|
- ✅ Global lock 削除完了
|
|||
|
|
- ✅ Tiny 1T: **+20%** 以上
|
|||
|
|
- ✅ Tiny 4T: **+15%** 以上
|
|||
|
|
|
|||
|
|
### Should Have
|
|||
|
|
|
|||
|
|
- ✅ Tiny 1T: **+25-30%**
|
|||
|
|
- ✅ Tiny 4T: **+20-30%**
|
|||
|
|
- ✅ vs mimalloc 1T: **85%** 以上
|
|||
|
|
- ✅ vs mimalloc 4T: **85%** 以上
|
|||
|
|
|
|||
|
|
### Nice to Have
|
|||
|
|
|
|||
|
|
- ✅ Direct Access Table 実装
|
|||
|
|
- ✅ Block Index Shift 実装
|
|||
|
|
- ✅ Steal 機構実装
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚧 リスク管理
|
|||
|
|
|
|||
|
|
### リスク 1: 2MB Alignment の失敗
|
|||
|
|
|
|||
|
|
**対策**: `posix_memalign` または `mmap` with `MAP_ALIGNED_SUPER`
|
|||
|
|
|
|||
|
|
### リスク 2: メモリオーバーヘッド増加
|
|||
|
|
|
|||
|
|
**影響**: 2MB aligned → 最大 2MB の無駄
|
|||
|
|
**対策**: SuperSlab 使用率モニタリング
|
|||
|
|
|
|||
|
|
### リスク 3: Per-thread Queues の枯渇
|
|||
|
|
|
|||
|
|
**対策**: Steal 機構 + Fallback to global pool
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 Implementation Notes
|
|||
|
|
|
|||
|
|
### mmap 2MB Aligned Allocation
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* allocate_superslab_aligned(void) {
|
|||
|
|
void* ptr = mmap(NULL, SUPERSLAB_SIZE,
|
|||
|
|
PROT_READ | PROT_WRITE,
|
|||
|
|
MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED_SUPER,
|
|||
|
|
-1, 0);
|
|||
|
|
if (ptr == MAP_FAILED) {
|
|||
|
|
// Fallback: manual alignment
|
|||
|
|
void* raw = mmap(NULL, SUPERSLAB_SIZE * 2, ...);
|
|||
|
|
ptr = (void*)(((uintptr_t)raw + SUPERSLAB_MASK) & ~SUPERSLAB_MASK);
|
|||
|
|
}
|
|||
|
|
return ptr;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Compatibility with Existing Code
|
|||
|
|
|
|||
|
|
- `hakmem_tiny.h` の API は変更しない
|
|||
|
|
- `hak_tiny_alloc()`, `hak_tiny_free()` は互換性維持
|
|||
|
|
- 内部実装のみ変更
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 Learning from mimalloc
|
|||
|
|
|
|||
|
|
### Key Takeaways
|
|||
|
|
|
|||
|
|
1. **Memory is cheap, computation is expensive**
|
|||
|
|
- 2MB padding を許容して計算コスト削減
|
|||
|
|
|
|||
|
|
2. **Cache locality > Algorithm complexity**
|
|||
|
|
- Hash O(1) < Direct access with good locality
|
|||
|
|
|
|||
|
|
3. **Lock-free > Fine-grained locking**
|
|||
|
|
- Per-thread queues でロックレス化
|
|||
|
|
|
|||
|
|
4. **Alignment is power**
|
|||
|
|
- Power-of-2 alignment で AND/Shift 演算最適化
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**作成日**: 2025-10-24 11:35 JST
|
|||
|
|
**ステータス**: 🚀 **Ready to implement**
|
|||
|
|
**次のアクション**: Phase 1 実装開始(SuperSlab 基盤)
|