Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
10 KiB
Phase 6.22-23: SuperSlab + Per-thread Queues 統合実装
日付: 2025-10-24 目標: mimalloc 解析結果を基に Tiny Pool を高速化(+25-35% 狙い) 期間: 2-3日想定 ステータス: 🚀 実装開始
🎯 目標
性能目標
| Benchmark | Current | Target | Improvement |
|---|---|---|---|
| Tiny 1T | 21.0 M/s | 27-30 M/s | +25-35% |
| Tiny 4T | 53.5 M/s | 65-72 M/s | +22-35% |
| vs mimalloc 1T | 62% | 85-90% | +23-28pp |
| vs mimalloc 4T | 70% | 85-95% | +15-25pp |
実装目標
-
✅ SuperSlab 方式導入 (2MB aligned)
- Registry hash 削除
- AND 演算によるページ検索(O(1) guaranteed)
- 期待効果: +10-15%
-
✅ Per-thread Page Queues
- Global freelist + mutex 削除
- TLS per-thread queues に移行
- Lock contention 削減
- 期待効果: +15-20%
-
✅ Direct Access Table (bonus)
- Size → class の高速化
- 期待効果: +5-10%
🧠 mimalloc 解析結果(Task 先生による)
mimalloc の Fast Path(7命令)
- TLS heap 取得:
__threadポインタ1つ (1 load) - Bin lookup:
pages_free_direct[wsize]直接アクセス (1 load) - Freelist pop:
page->free→block->next(2 loads) - Used カウンタ更新:
page->used++(1 inc) - Return: block ポインタ返却
hakmem との主な違い
| 項目 | mimalloc | hakmem (Current) | Gap |
|---|---|---|---|
| Page lookup | Segment mask (AND 1回) | Registry hash | O(1) だがキャッシュミス |
| Slab 管理 | Per-thread queues | Global lists + lock | Lock contention |
| Size → class | Direct array | Lookup table | 1回余分な load |
| Block index | Shift 演算 | Division | Division コスト |
最大の差分: Segment-Aligned Allocation
// mimalloc: 超高速ページ検索(1回の AND 演算)
#define MI_SEGMENT_MASK (MI_SEGMENT_SIZE - 1) // 32MB - 1
segment = (uintptr_t)p & ~MI_SEGMENT_MASK;
page = segment->pages[offset];
// hakmem (Current): Registry hash(キャッシュミス + 衝突リスク)
uint32_t h = hash(page_base);
TinySlabDesc* d = registry[h]; // O(1) だが実際はキャッシュミスが痛い
🏗️ 実装設計
1. SuperSlab 構造(2MB aligned)
設計思想
- Alignment: 2MB (0x200000) aligned allocation
- Size: 2MB per SuperSlab
- Capacity: 2MB ÷ 64KB slab = 32 slabs per SuperSlab
- Metadata: SuperSlab header (64B, cache-line aligned)
メタデータ構造
// SuperSlab: 2MB aligned container for 32 x 64KB slabs
typedef struct SuperSlab {
// Header (64B, cache-line aligned)
uint64_t magic; // 0xHAKMEM_SUPERSLAB
uint8_t size_class; // 0-7 (8-64B)
uint8_t active_slabs; // Number of active slabs (0-32)
uint16_t _pad;
uint32_t slab_bitmap; // 32-bit bitmap (1=active, 0=free)
// Per-slab metadata (32 entries)
struct {
void* freelist; // Per-slab freelist head
uint16_t used; // Blocks used
uint16_t capacity; // Total blocks
} slabs[32];
// Padding to 64B
char _pad2[64 - 8 - 4 - 32*16];
} __attribute__((aligned(64))) SuperSlab;
// Compile-time checks
_Static_assert(sizeof(SuperSlab) == 64, "SuperSlab header must be 64B");
_Static_assert(SUPERSLAB_SIZE == 2*1024*1024, "2MB alignment required");
ポインタ → SuperSlab 検索(mimalloc 方式)
#define SUPERSLAB_SIZE (2 * 1024 * 1024) // 2MB
#define SUPERSLAB_MASK (SUPERSLAB_SIZE - 1)
#define SLAB_SIZE (64 * 1024) // 64KB
// O(1) SuperSlab 検索(1回の AND 演算)
static inline SuperSlab* ptr_to_superslab(void* p) {
return (SuperSlab*)((uintptr_t)p & ~SUPERSLAB_MASK);
}
// Slab index 計算(Shift 演算)
static inline int ptr_to_slab_index(void* p) {
uintptr_t offset = (uintptr_t)p & SUPERSLAB_MASK;
return (int)(offset >> 16); // ÷ 64KB (2^16)
}
Allocation フロー
void* tiny_alloc_fast(size_t size) {
int class_idx = SIZE_TO_CLASS[size >> 3]; // Direct access
TinyTLS* tls = get_tls();
// Per-thread active slab
TinySlab* active = tls->active_slab[class_idx];
if (!active || !active->freelist) {
active = refill_slab(class_idx); // TLS queue から取得
}
// Pop from freelist
void* block = active->freelist;
active->freelist = *(void**)block;
active->used++;
return block;
}
Free フロー(Fast path: Same-thread)
void tiny_free_fast(void* p) {
// SuperSlab 検索(1回の AND)
SuperSlab* ss = ptr_to_superslab(p);
int slab_idx = ptr_to_slab_index(p);
// Slab metadata 取得
TinySlabMeta* meta = &ss->slabs[slab_idx];
// Same-thread check(TLS 比較)
if (meta->owner_tid == get_tid()) {
// Fast path: Push to TLS freelist
*(void**)p = meta->freelist;
meta->freelist = p;
meta->used--;
} else {
// Slow path: Remote free (lock or lock-free queue)
remote_free(ss, slab_idx, p);
}
}
2. Per-thread Page Queues
設計思想
- Global freelist 廃止: Lock contention 削減
- TLS queues: 各スレッドが独自の slab queue を保持
- Refill policy: Empty queue → 他スレッドの queue から steal
TLS 構造体
typedef struct TinyTLS {
// Active slabs (1 per size class)
TinySlab* active_slab[TINY_NUM_CLASSES];
// Per-class slab queues (LIFO)
struct {
TinySlab* head;
int count;
} slab_queue[TINY_NUM_CLASSES];
// Statistics
uint64_t allocs[TINY_NUM_CLASSES];
uint64_t frees[TINY_NUM_CLASSES];
} __attribute__((aligned(64))) TinyTLS;
static __thread TinyTLS* t_tiny_tls = NULL;
Refill Policy
TinySlab* refill_slab(int class_idx) {
TinyTLS* tls = get_tls();
// 1. Try TLS queue
if (tls->slab_queue[class_idx].head) {
return pop_slab_queue(tls, class_idx);
}
// 2. Allocate new SuperSlab (2MB aligned)
SuperSlab* ss = allocate_superslab(class_idx);
if (ss) {
// Initialize first slab
return &ss->slabs[0];
}
// 3. Steal from other threads (last resort)
return steal_slab(class_idx);
}
3. Direct Access Table(Bonus 最適化)
Size → Class の高速化
// Before: Lookup table (1 extra load)
static const uint8_t SIZE_TO_CLASS[8] = { 0, 0, 1, 2, 3, 4, 5, 6 };
int class_idx = SIZE_TO_CLASS[size >> 3];
// After: Direct array access (no LUT)
// Size classes: 8, 16, 24, 32, 40, 48, 56, 64 bytes
static inline int size_to_class_direct(size_t size) {
// Assume size is already rounded up to 8B boundary
return (int)((size >> 3) - 1); // 8→0, 16→1, 24→2, ...
}
📊 Implementation Roadmap
Phase 1: SuperSlab 基盤(1日目)
-
hakmem_tiny_superslab.h作成- SuperSlab 構造体定義
- Inline 関数(ptr_to_superslab, ptr_to_slab_index)
-
hakmem_tiny_superslab.c作成- SuperSlab allocation (mmap 2MB aligned)
- SuperSlab initialization
- Bitmap 管理
-
hakmem_tiny.c修正- Registry 削除
- SuperSlab 方式への移行
- ptr_to_superslab() 使用
Phase 2: Per-thread Queues(1-2日目)
-
TinyTLS 構造体拡張
- slab_queue 追加
- active_slab 配列
-
Global freelist 削除
- g_tiny_freelist[] 削除
- g_tiny_locks[] 削除
-
Refill policy 実装
- TLS queue からの取得
- 新規 SuperSlab allocation
- Steal 機構(optional)
Phase 3: 最適化 + 検証(2-3日目)
-
Direct Access Table
- SIZE_TO_CLASS 最適化
-
Block Index Shift
- Division → Shift 変更
-
ビルド + テスト
- Compile 確認
- 基本動作テスト
-
Benchmark
- Tiny 1T/4T 測定
- Phase 6.21 比較
🎯 Success Criteria
Must Have
- ✅ SuperSlab 動作(2MB aligned allocation)
- ✅ Registry 削除完了
- ✅ Per-thread queues 動作
- ✅ Global lock 削除完了
- ✅ Tiny 1T: +20% 以上
- ✅ Tiny 4T: +15% 以上
Should Have
- ✅ Tiny 1T: +25-30%
- ✅ Tiny 4T: +20-30%
- ✅ vs mimalloc 1T: 85% 以上
- ✅ vs mimalloc 4T: 85% 以上
Nice to Have
- ✅ Direct Access Table 実装
- ✅ Block Index Shift 実装
- ✅ Steal 機構実装
🚧 リスク管理
リスク 1: 2MB Alignment の失敗
対策: posix_memalign または mmap with MAP_ALIGNED_SUPER
リスク 2: メモリオーバーヘッド増加
影響: 2MB aligned → 最大 2MB の無駄 対策: SuperSlab 使用率モニタリング
リスク 3: Per-thread Queues の枯渇
対策: Steal 機構 + Fallback to global pool
📝 Implementation Notes
mmap 2MB Aligned Allocation
void* allocate_superslab_aligned(void) {
void* ptr = mmap(NULL, SUPERSLAB_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED_SUPER,
-1, 0);
if (ptr == MAP_FAILED) {
// Fallback: manual alignment
void* raw = mmap(NULL, SUPERSLAB_SIZE * 2, ...);
ptr = (void*)(((uintptr_t)raw + SUPERSLAB_MASK) & ~SUPERSLAB_MASK);
}
return ptr;
}
Compatibility with Existing Code
hakmem_tiny.hの API は変更しないhak_tiny_alloc(),hak_tiny_free()は互換性維持- 内部実装のみ変更
🎓 Learning from mimalloc
Key Takeaways
-
Memory is cheap, computation is expensive
- 2MB padding を許容して計算コスト削減
-
Cache locality > Algorithm complexity
- Hash O(1) < Direct access with good locality
-
Lock-free > Fine-grained locking
- Per-thread queues でロックレス化
-
Alignment is power
- Power-of-2 alignment で AND/Shift 演算最適化
作成日: 2025-10-24 11:35 JST ステータス: 🚀 Ready to implement 次のアクション: Phase 1 実装開始(SuperSlab 基盤)