hakmem/docs/status/PHASE_6.22_PLAN_2025_10_24.md

# Phase 6.22-23: SuperSlab + Per-thread Queues 統合実装

**日付**: 2025-10-24
**目標**: mimalloc 解析結果を基に Tiny Pool を高速化（+25-35% 狙い）
**期間**: 2-3日想定
**ステータス**: 🚀 **実装開始**

---

## 🎯 目標

### 性能目標

| Benchmark | Current | Target | Improvement |
|-----------|---------|--------|-------------|
| **Tiny 1T** | 21.0 M/s | **27-30 M/s** | **+25-35%** |
| **Tiny 4T** | 53.5 M/s | **65-72 M/s** | **+22-35%** |
| **vs mimalloc 1T** | 62% | **85-90%** | **+23-28pp** |
| **vs mimalloc 4T** | 70% | **85-95%** | **+15-25pp** |

### 実装目標

1. ✅ **SuperSlab 方式導入** (2MB aligned)
   - Registry hash 削除
   - AND 演算によるページ検索（O(1) guaranteed）
   - 期待効果: **+10-15%**

2. ✅ **Per-thread Page Queues**
   - Global freelist + mutex 削除
   - TLS per-thread queues に移行
   - Lock contention 削減
   - 期待効果: **+15-20%**

3. ✅ **Direct Access Table** (bonus)
   - Size → class の高速化
   - 期待効果: **+5-10%**

---

## 🧠 mimalloc 解析結果（Task 先生による）

### mimalloc の Fast Path（7命令）

1. **TLS heap 取得**: `__thread` ポインタ1つ (1 load)
2. **Bin lookup**: `pages_free_direct[wsize]` 直接アクセス (1 load)
3. **Freelist pop**: `page->free` → `block->next` (2 loads)
4. **Used カウンタ更新**: `page->used++` (1 inc)
5. **Return**: block ポインタ返却

### hakmem との主な違い

| 項目 | mimalloc | hakmem (Current) | Gap |
|------|----------|------------------|-----|
| **Page lookup** | Segment mask (AND 1回) | Registry hash | **O(1) だがキャッシュミス** |
| **Slab 管理** | Per-thread queues | Global lists + lock | **Lock contention** |
| **Size → class** | Direct array | Lookup table | **1回余分な load** |
| **Block index** | Shift 演算 | Division | **Division コスト** |

### 最大の差分: Segment-Aligned Allocation

```c
// mimalloc: 超高速ページ検索（1回の AND 演算）
#define MI_SEGMENT_MASK (MI_SEGMENT_SIZE - 1)  // 32MB - 1
segment = (uintptr_t)p & ~MI_SEGMENT_MASK;
page = segment->pages[offset];

// hakmem (Current): Registry hash（キャッシュミス + 衝突リスク）
uint32_t h = hash(page_base);
TinySlabDesc* d = registry[h];  // O(1) だが実際はキャッシュミスが痛い
```

---

## 🏗️ 実装設計

### 1. SuperSlab 構造（2MB aligned）

#### 設計思想

- **Alignment**: 2MB (0x200000) aligned allocation
- **Size**: 2MB per SuperSlab
- **Capacity**: 2MB ÷ 64KB slab = 32 slabs per SuperSlab
- **Metadata**: SuperSlab header (64B, cache-line aligned)

#### メタデータ構造

```c
// SuperSlab: 2MB aligned container for 32 x 64KB slabs
typedef struct SuperSlab {
    // Header (64B, cache-line aligned)
    uint64_t magic;              // 0xHAKMEM_SUPERSLAB
    uint8_t  size_class;         // 0-7 (8-64B)
    uint8_t  active_slabs;       // Number of active slabs (0-32)
    uint16_t _pad;
    uint32_t slab_bitmap;        // 32-bit bitmap (1=active, 0=free)

    // Per-slab metadata (32 entries)
    struct {
        void*    freelist;       // Per-slab freelist head
        uint16_t used;           // Blocks used
        uint16_t capacity;       // Total blocks
    } slabs[32];

    // Padding to 64B
    char _pad2[64 - 8 - 4 - 32*16];
} __attribute__((aligned(64))) SuperSlab;

// Compile-time checks
_Static_assert(sizeof(SuperSlab) == 64, "SuperSlab header must be 64B");
_Static_assert(SUPERSLAB_SIZE == 2*1024*1024, "2MB alignment required");
```

#### ポインタ → SuperSlab 検索（mimalloc 方式）

```c
#define SUPERSLAB_SIZE    (2 * 1024 * 1024)  // 2MB
#define SUPERSLAB_MASK    (SUPERSLAB_SIZE - 1)
#define SLAB_SIZE         (64 * 1024)        // 64KB

// O(1) SuperSlab 検索（1回の AND 演算）
static inline SuperSlab* ptr_to_superslab(void* p) {
    return (SuperSlab*)((uintptr_t)p & ~SUPERSLAB_MASK);
}

// Slab index 計算（Shift 演算）
static inline int ptr_to_slab_index(void* p) {
    uintptr_t offset = (uintptr_t)p & SUPERSLAB_MASK;
    return (int)(offset >> 16);  // ÷ 64KB (2^16)
}
```

#### Allocation フロー

```c
void* tiny_alloc_fast(size_t size) {
    int class_idx = SIZE_TO_CLASS[size >> 3];  // Direct access
    TinyTLS* tls = get_tls();

    // Per-thread active slab
    TinySlab* active = tls->active_slab[class_idx];
    if (!active || !active->freelist) {
        active = refill_slab(class_idx);  // TLS queue から取得
    }

    // Pop from freelist
    void* block = active->freelist;
    active->freelist = *(void**)block;
    active->used++;

    return block;
}
```

#### Free フロー（Fast path: Same-thread）

```c
void tiny_free_fast(void* p) {
    // SuperSlab 検索（1回の AND）
    SuperSlab* ss = ptr_to_superslab(p);
    int slab_idx = ptr_to_slab_index(p);

    // Slab metadata 取得
    TinySlabMeta* meta = &ss->slabs[slab_idx];

    // Same-thread check（TLS 比較）
    if (meta->owner_tid == get_tid()) {
        // Fast path: Push to TLS freelist
        *(void**)p = meta->freelist;
        meta->freelist = p;
        meta->used--;
    } else {
        // Slow path: Remote free (lock or lock-free queue)
        remote_free(ss, slab_idx, p);
    }
}
```

---

### 2. Per-thread Page Queues

#### 設計思想

- **Global freelist 廃止**: Lock contention 削減
- **TLS queues**: 各スレッドが独自の slab queue を保持
- **Refill policy**: Empty queue → 他スレッドの queue から steal

#### TLS 構造体

```c
typedef struct TinyTLS {
    // Active slabs (1 per size class)
    TinySlab* active_slab[TINY_NUM_CLASSES];

    // Per-class slab queues (LIFO)
    struct {
        TinySlab* head;
        int       count;
    } slab_queue[TINY_NUM_CLASSES];

    // Statistics
    uint64_t allocs[TINY_NUM_CLASSES];
    uint64_t frees[TINY_NUM_CLASSES];

} __attribute__((aligned(64))) TinyTLS;

static __thread TinyTLS* t_tiny_tls = NULL;
```

#### Refill Policy

```c
TinySlab* refill_slab(int class_idx) {
    TinyTLS* tls = get_tls();

    // 1. Try TLS queue
    if (tls->slab_queue[class_idx].head) {
        return pop_slab_queue(tls, class_idx);
    }

    // 2. Allocate new SuperSlab (2MB aligned)
    SuperSlab* ss = allocate_superslab(class_idx);
    if (ss) {
        // Initialize first slab
        return &ss->slabs[0];
    }

    // 3. Steal from other threads (last resort)
    return steal_slab(class_idx);
}
```

---

### 3. Direct Access Table（Bonus 最適化）

#### Size → Class の高速化

```c
// Before: Lookup table (1 extra load)
static const uint8_t SIZE_TO_CLASS[8] = { 0, 0, 1, 2, 3, 4, 5, 6 };
int class_idx = SIZE_TO_CLASS[size >> 3];

// After: Direct array access (no LUT)
// Size classes: 8, 16, 24, 32, 40, 48, 56, 64 bytes
static inline int size_to_class_direct(size_t size) {
    // Assume size is already rounded up to 8B boundary
    return (int)((size >> 3) - 1);  // 8→0, 16→1, 24→2, ...
}
```

---

## 📊 Implementation Roadmap

### Phase 1: SuperSlab 基盤（1日目）

- [ ] `hakmem_tiny_superslab.h` 作成
  - SuperSlab 構造体定義
  - Inline 関数（ptr_to_superslab, ptr_to_slab_index）

- [ ] `hakmem_tiny_superslab.c` 作成
  - SuperSlab allocation (mmap 2MB aligned)
  - SuperSlab initialization
  - Bitmap 管理

- [ ] `hakmem_tiny.c` 修正
  - Registry 削除
  - SuperSlab 方式への移行
  - ptr_to_superslab() 使用

### Phase 2: Per-thread Queues（1-2日目）

- [ ] TinyTLS 構造体拡張
  - slab_queue 追加
  - active_slab 配列

- [ ] Global freelist 削除
  - g_tiny_freelist[] 削除
  - g_tiny_locks[] 削除

- [ ] Refill policy 実装
  - TLS queue からの取得
  - 新規 SuperSlab allocation
  - Steal 機構（optional）

### Phase 3: 最適化 + 検証（2-3日目）

- [ ] Direct Access Table
  - SIZE_TO_CLASS 最適化

- [ ] Block Index Shift
  - Division → Shift 変更

- [ ] ビルド + テスト
  - Compile 確認
  - 基本動作テスト

- [ ] Benchmark
  - Tiny 1T/4T 測定
  - Phase 6.21 比較

---

## 🎯 Success Criteria

### Must Have

- ✅ SuperSlab 動作（2MB aligned allocation）
- ✅ Registry 削除完了
- ✅ Per-thread queues 動作
- ✅ Global lock 削除完了
- ✅ Tiny 1T: **+20%** 以上
- ✅ Tiny 4T: **+15%** 以上

### Should Have

- ✅ Tiny 1T: **+25-30%**
- ✅ Tiny 4T: **+20-30%**
- ✅ vs mimalloc 1T: **85%** 以上
- ✅ vs mimalloc 4T: **85%** 以上

### Nice to Have

- ✅ Direct Access Table 実装
- ✅ Block Index Shift 実装
- ✅ Steal 機構実装

---

## 🚧 リスク管理

### リスク 1: 2MB Alignment の失敗

**対策**: `posix_memalign` または `mmap` with `MAP_ALIGNED_SUPER`

### リスク 2: メモリオーバーヘッド増加

**影響**: 2MB aligned → 最大 2MB の無駄
**対策**: SuperSlab 使用率モニタリング

### リスク 3: Per-thread Queues の枯渇

**対策**: Steal 機構 + Fallback to global pool

---

## 📝 Implementation Notes

### mmap 2MB Aligned Allocation

```c
void* allocate_superslab_aligned(void) {
    void* ptr = mmap(NULL, SUPERSLAB_SIZE,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED_SUPER,
                     -1, 0);
    if (ptr == MAP_FAILED) {
        // Fallback: manual alignment
        void* raw = mmap(NULL, SUPERSLAB_SIZE * 2, ...);
        ptr = (void*)(((uintptr_t)raw + SUPERSLAB_MASK) & ~SUPERSLAB_MASK);
    }
    return ptr;
}
```

### Compatibility with Existing Code

- `hakmem_tiny.h` の API は変更しない
- `hak_tiny_alloc()`, `hak_tiny_free()` は互換性維持
- 内部実装のみ変更

---

## 🎓 Learning from mimalloc

### Key Takeaways

1. **Memory is cheap, computation is expensive**
   - 2MB padding を許容して計算コスト削減

2. **Cache locality > Algorithm complexity**
   - Hash O(1) < Direct access with good locality

3. **Lock-free > Fine-grained locking**
   - Per-thread queues でロックレス化

4. **Alignment is power**
   - Power-of-2 alignment で AND/Shift 演算最適化

---

**作成日**: 2025-10-24 11:35 JST
**ステータス**: 🚀 **Ready to implement**
**次のアクション**: Phase 1 実装開始（SuperSlab 基盤）
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Phase 6.22-23: SuperSlab + Per-thread Queues 統合実装
 								**日付**: 2025-10-24
 								**目標**: mimalloc 解析結果を基に Tiny Pool を高速化（+25-35% 狙い）
 								**期間**: 2-3日想定
 								**ステータス**: 🚀 **実装開始**
 								---
 								## 🎯 目標
 								### 性能目標
 								| Benchmark | Current | Target | Improvement |
 								|-----------|---------|--------|-------------|
 								| **Tiny 1T** | 21.0 M/s | **27-30 M/s** | **+25-35%** |
 								| **Tiny 4T** | 53.5 M/s | **65-72 M/s** | **+22-35%** |
 								| **vs mimalloc 1T** | 62% | **85-90%** | **+23-28pp** |
 								| **vs mimalloc 4T** | 70% | **85-95%** | **+15-25pp** |
 								### 実装目標
 . ✅ **SuperSlab 方式導入** (2MB aligned)
 								   - Registry hash 削除
 								   - AND 演算によるページ検索（O(1) guaranteed）
 								   - 期待効果: **+10-15%**
 . ✅ **Per-thread Page Queues**
 								   - Global freelist + mutex 削除
 								   - TLS per-thread queues に移行
 								   - Lock contention 削減
 								   - 期待効果: **+15-20%**
 . ✅ **Direct Access Table** (bonus)
 								   - Size → class の高速化
 								   - 期待効果: **+5-10%**
 								---
 								## 🧠 mimalloc 解析結果（Task 先生による）
 								### mimalloc の Fast Path（7命令）
 . **TLS heap 取得**: `__thread` ポインタ1つ (1 load)
 . **Bin lookup**: `pages_free_direct[wsize]` 直接アクセス (1 load)
 . **Freelist pop**: `page->free` → `block->next` (2 loads)
 . **Used カウンタ更新**: `page->used++` (1 inc)
 . **Return**: block ポインタ返却
 								### hakmem との主な違い
 								| 項目 | mimalloc | hakmem (Current) | Gap |
 								|------|----------|------------------|-----|
 								| **Page lookup** | Segment mask (AND 1回) | Registry hash | **O(1) だがキャッシュミス** |
 								| **Slab 管理** | Per-thread queues | Global lists + lock | **Lock contention** |
 								| **Size → class** | Direct array | Lookup table | **1回余分な load** |
 								| **Block index** | Shift 演算 | Division | **Division コスト** |
 								### 最大の差分: Segment-Aligned Allocation
 								```c
 								// mimalloc: 超高速ページ検索（1回の AND 演算）
 								#define MI_SEGMENT_MASK (MI_SEGMENT_SIZE - 1)  // 32MB - 1
 								segment = (uintptr_t)p & ~MI_SEGMENT_MASK;
 								page = segment->pages[offset];
 								// hakmem (Current): Registry hash（キャッシュミス + 衝突リスク）
 								uint32_t h = hash(page_base);
 								TinySlabDesc* d = registry[h];  // O(1) だが実際はキャッシュミスが痛い
 								```
 								---
 								## 🏗️ 実装設計
 								### 1. SuperSlab 構造（2MB aligned）
 								#### 設計思想
 								- **Alignment**: 2MB (0x200000) aligned allocation
 								- **Size**: 2MB per SuperSlab
 								- **Capacity**: 2MB ÷ 64KB slab = 32 slabs per SuperSlab
 								- **Metadata**: SuperSlab header (64B, cache-line aligned)
 								#### メタデータ構造
 								```c
 								// SuperSlab: 2MB aligned container for 32 x 64KB slabs
 								typedef struct SuperSlab {
 								    // Header (64B, cache-line aligned)
 								    uint64_t magic;              // 0xHAKMEM_SUPERSLAB
 								    uint8_t  size_class;         // 0-7 (8-64B)
 								    uint8_t  active_slabs;       // Number of active slabs (0-32)
 								    uint16_t _pad;
 								    uint32_t slab_bitmap;        // 32-bit bitmap (1=active, 0=free)
 								    // Per-slab metadata (32 entries)
 								    struct {
 								        void*    freelist;       // Per-slab freelist head
 								        uint16_t used;           // Blocks used
 								        uint16_t capacity;       // Total blocks
 								    } slabs[32];
 								    // Padding to 64B
 								    char _pad2[64 - 8 - 4 - 32*16];
 								} __attribute__((aligned(64))) SuperSlab;
 								// Compile-time checks
 								_Static_assert(sizeof(SuperSlab) == 64, "SuperSlab header must be 64B");
 								_Static_assert(SUPERSLAB_SIZE == 2*1024*1024, "2MB alignment required");
 								```
 								#### ポインタ → SuperSlab 検索（mimalloc 方式）
 								```c
 								#define SUPERSLAB_SIZE    (2 * 1024 * 1024)  // 2MB
 								#define SUPERSLAB_MASK    (SUPERSLAB_SIZE - 1)
 								#define SLAB_SIZE         (64 * 1024)        // 64KB
 								// O(1) SuperSlab 検索（1回の AND 演算）
 								static inline SuperSlab* ptr_to_superslab(void* p) {
 								    return (SuperSlab*)((uintptr_t)p & ~SUPERSLAB_MASK);
 								}
 								// Slab index 計算（Shift 演算）
 								static inline int ptr_to_slab_index(void* p) {
 								    uintptr_t offset = (uintptr_t)p & SUPERSLAB_MASK;
 								    return (int)(offset >> 16);  // ÷ 64KB (2^16)
 								}
 								```
 								#### Allocation フロー
 								```c
 								void* tiny_alloc_fast(size_t size) {
 								    int class_idx = SIZE_TO_CLASS[size >> 3];  // Direct access
 								    TinyTLS* tls = get_tls();
 								    // Per-thread active slab
 								    TinySlab* active = tls->active_slab[class_idx];
 								    if (!active || !active->freelist) {
 								        active = refill_slab(class_idx);  // TLS queue から取得
 								    }
 								    // Pop from freelist
 								    void* block = active->freelist;
 								    active->freelist = *(void**)block;
 								    active->used++;
 								    return block;
 								}
 								```
 								#### Free フロー（Fast path: Same-thread）
 								```c
 								void tiny_free_fast(void* p) {
 								    // SuperSlab 検索（1回の AND）
 								    SuperSlab* ss = ptr_to_superslab(p);
 								    int slab_idx = ptr_to_slab_index(p);
 								    // Slab metadata 取得
 								    TinySlabMeta* meta = &ss->slabs[slab_idx];
 								    // Same-thread check（TLS 比較）
 								    if (meta->owner_tid == get_tid()) {
 								        // Fast path: Push to TLS freelist
 								        *(void**)p = meta->freelist;
 								        meta->freelist = p;
 								        meta->used--;
 								    } else {
 								        // Slow path: Remote free (lock or lock-free queue)
 								        remote_free(ss, slab_idx, p);
 								    }
 								}
 								```
 								---
 								### 2. Per-thread Page Queues
 								#### 設計思想
 								- **Global freelist 廃止**: Lock contention 削減
 								- **TLS queues**: 各スレッドが独自の slab queue を保持
 								- **Refill policy**: Empty queue → 他スレッドの queue から steal
 								#### TLS 構造体
 								```c
 								typedef struct TinyTLS {
 								    // Active slabs (1 per size class)
 								    TinySlab* active_slab[TINY_NUM_CLASSES];
 								    // Per-class slab queues (LIFO)
 								    struct {
 								        TinySlab* head;
 								        int       count;
 								    } slab_queue[TINY_NUM_CLASSES];
 								    // Statistics
 								    uint64_t allocs[TINY_NUM_CLASSES];
 								    uint64_t frees[TINY_NUM_CLASSES];
 								} __attribute__((aligned(64))) TinyTLS;
 								static __thread TinyTLS* t_tiny_tls = NULL;
 								```
 								#### Refill Policy
 								```c
 								TinySlab* refill_slab(int class_idx) {
 								    TinyTLS* tls = get_tls();
 								    // 1. Try TLS queue
 								    if (tls->slab_queue[class_idx].head) {
 								        return pop_slab_queue(tls, class_idx);
 								    }
 								    // 2. Allocate new SuperSlab (2MB aligned)
 								    SuperSlab* ss = allocate_superslab(class_idx);
 								    if (ss) {
 								        // Initialize first slab
 								        return &ss->slabs[0];
 								    }
 								    // 3. Steal from other threads (last resort)
 								    return steal_slab(class_idx);
 								}
 								```
 								---
 								### 3. Direct Access Table（Bonus 最適化）
 								#### Size → Class の高速化
 								```c
 								// Before: Lookup table (1 extra load)
 								static const uint8_t SIZE_TO_CLASS[8] = { 0, 0, 1, 2, 3, 4, 5, 6 };
 								int class_idx = SIZE_TO_CLASS[size >> 3];
 								// After: Direct array access (no LUT)
 								// Size classes: 8, 16, 24, 32, 40, 48, 56, 64 bytes
 								static inline int size_to_class_direct(size_t size) {
 								    // Assume size is already rounded up to 8B boundary
 								    return (int)((size >> 3) - 1);  // 8→0, 16→1, 24→2, ...
 								}
 								```
 								---
 								## 📊 Implementation Roadmap
 								### Phase 1: SuperSlab 基盤（1日目）
 								- [ ] `hakmem_tiny_superslab.h` 作成
 								  - SuperSlab 構造体定義
 								  - Inline 関数（ptr_to_superslab, ptr_to_slab_index）
 								- [ ] `hakmem_tiny_superslab.c` 作成
 								  - SuperSlab allocation (mmap 2MB aligned)
 								  - SuperSlab initialization
 								  - Bitmap 管理
 								- [ ] `hakmem_tiny.c` 修正
 								  - Registry 削除
 								  - SuperSlab 方式への移行
 								  - ptr_to_superslab() 使用
 								### Phase 2: Per-thread Queues（1-2日目）
 								- [ ] TinyTLS 構造体拡張
 								  - slab_queue 追加
 								  - active_slab 配列
 								- [ ] Global freelist 削除
 								  - g_tiny_freelist[] 削除
 								  - g_tiny_locks[] 削除
 								- [ ] Refill policy 実装
 								  - TLS queue からの取得
 								  - 新規 SuperSlab allocation
 								  - Steal 機構（optional）
 								### Phase 3: 最適化 + 検証（2-3日目）
 								- [ ] Direct Access Table
 								  - SIZE_TO_CLASS 最適化
 								- [ ] Block Index Shift
 								  - Division → Shift 変更
 								- [ ] ビルド + テスト
 								  - Compile 確認
 								  - 基本動作テスト
 								- [ ] Benchmark
 								  - Tiny 1T/4T 測定
 								  - Phase 6.21 比較
 								---
 								## 🎯 Success Criteria
 								### Must Have
 								- ✅ SuperSlab 動作（2MB aligned allocation）
 								- ✅ Registry 削除完了
 								- ✅ Per-thread queues 動作
 								- ✅ Global lock 削除完了
 								- ✅ Tiny 1T: **+20%** 以上
 								- ✅ Tiny 4T: **+15%** 以上
 								### Should Have
 								- ✅ Tiny 1T: **+25-30%**
 								- ✅ Tiny 4T: **+20-30%**
 								- ✅ vs mimalloc 1T: **85%** 以上
 								- ✅ vs mimalloc 4T: **85%** 以上
 								### Nice to Have
 								- ✅ Direct Access Table 実装
 								- ✅ Block Index Shift 実装
 								- ✅ Steal 機構実装
 								---
 								## 🚧 リスク管理
 								### リスク 1: 2MB Alignment の失敗
 								**対策**: `posix_memalign` または `mmap` with `MAP_ALIGNED_SUPER`
 								### リスク 2: メモリオーバーヘッド増加
 								**影響**: 2MB aligned → 最大 2MB の無駄
 								**対策**: SuperSlab 使用率モニタリング
 								### リスク 3: Per-thread Queues の枯渇
 								**対策**: Steal 機構 + Fallback to global pool
 								---
 								## 📝 Implementation Notes
 								### mmap 2MB Aligned Allocation
 								```c
 								void* allocate_superslab_aligned(void) {
 								    void* ptr = mmap(NULL, SUPERSLAB_SIZE,
 								                     PROT_READ | PROT_WRITE,
 								                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED_SUPER,
 								                     -1, 0);
 								    if (ptr == MAP_FAILED) {
 								        // Fallback: manual alignment
 								        void* raw = mmap(NULL, SUPERSLAB_SIZE * 2, ...);
 								        ptr = (void*)(((uintptr_t)raw + SUPERSLAB_MASK) & ~SUPERSLAB_MASK);
 								    }
 								    return ptr;
 								}
 								```
 								### Compatibility with Existing Code
 								- `hakmem_tiny.h` の API は変更しない
 								- `hak_tiny_alloc()`, `hak_tiny_free()` は互換性維持
 								- 内部実装のみ変更
 								---
 								## 🎓 Learning from mimalloc
 								### Key Takeaways
 . **Memory is cheap, computation is expensive**
 								   - 2MB padding を許容して計算コスト削減
 . **Cache locality > Algorithm complexity**
 								   - Hash O(1) < Direct access with good locality
 . **Lock-free > Fine-grained locking**
 								   - Per-thread queues でロックレス化
 . **Alignment is power**
 								   - Power-of-2 alignment で AND/Shift 演算最適化
 								---
 								**作成日**: 2025-10-24 11:35 JST
 								**ステータス**: 🚀 **Ready to implement**
 								**次のアクション**: Phase 1 実装開始（SuperSlab 基盤）