Files
hakmem/docs/design/MID_RANGE_MT_DESIGN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

769 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mid Range MT 最適化設計書 (2025-11-01)
## 📋 概要
**目的**: 8-32KB サイズ範囲のマルチスレッド性能を mimalloc 並みに改善する
**現状性能**:
- hakmem: 46-47 M ops/s (mid_large_mt ベンチ)
- mimalloc: 122 M ops/s
- 差: **-62%** (2.6倍遅い) ❌
**目標性能**:
- **100-120 M ops/s** (mimalloc の 80-100%)
- 2.2-2.6倍の改善
**実装期間**: 1週間
---
## 🎯 設計思想
### ハイブリッド戦略の位置づけ
```
≤1KB (Tiny Pool) → 静的最適化P0完了、学習不要
8-32KB (Mid Range) → mimalloc風 per-thread segment本設計 ← 🔥 ここ
≥64KB (Large Pool) → 学習ベースELO戦略選択
```
**なぜ 8-32KB だけ mimalloc 風?**
1.**MT性能が最優先**: ロック競合が最大のボトルネック
2.**学習の効果が低い**: サイズ範囲が狭く、戦略の差が小さい
3.**mimalloc で実績**: per-thread heap は証明済み
4.**学習層と非衝突**: 64KB以上は学習層がそのまま活きる
### mimalloc の核心アイデア
**Free List Sharding**:
```c
// mimalloc の key insight:
// "各スレッドが独立した heap segment を持つ"
// → ロックフリー、キャッシュ局所性、スケーラビリティ
thread_local Segment* my_segment; // スレッドごとに独立
void* alloc(size) {
// TLS から直接取得(ロック不要!)
return segment_alloc(my_segment, size);
}
```
---
## 🏗️ アーキテクチャ
### 全体構造
```
┌─────────────────────────────────────────────────────────┐
│ malloc(size) - 8KB ≤ size ≤ 32KB │
└───────────────────────┬─────────────────────────────────┘
┌──────────────────────────┐
│ size_to_class(size) │ ← サイズクラス決定
│ 8KB → class 0 │
│ 16KB → class 1 │
│ 32KB → class 2 │
└──────────┬───────────────┘
┌───────────────────────────────────────┐
│ g_mid_segments[class_idx] │ ← __thread TLS
│ │
│ MidThreadSegment { │
│ void* free_list; ← Free objects │
│ void* current; ← Bump pointer │
│ void* end; ← Segment end │
│ size_t chunk_size; ← 64KB │
│ uint32_t used; │
│ uint32_t capacity; │
│ } │
└───────────┬───────────────────────────┘
├─── fast path: free_list から取得
│ (ほとんどの場合)
├─── bump path: current から bump
│ (free_list が空の場合)
└─── slow path: 新規 segment 割り当て
(current == end の場合)
```
### サイズクラス定義
```c
// 8KB, 16KB, 32KB の3クラス
#define MID_SIZE_CLASS_8K 0 // 8KB
#define MID_SIZE_CLASS_16K 1 // 16KB
#define MID_SIZE_CLASS_32K 2 // 32KB
#define MID_NUM_CLASSES 3
// サイズ → クラス変換
static inline int mid_size_to_class(size_t size) {
if (size <= 8192) return MID_SIZE_CLASS_8K;
if (size <= 16384) return MID_SIZE_CLASS_16K;
return MID_SIZE_CLASS_32K;
}
// クラス → ブロックサイズ
static inline size_t mid_class_to_size(int class_idx) {
static const size_t sizes[MID_NUM_CLASSES] = {
8192, // 8KB
16384, // 16KB
32768 // 32KB
};
return sizes[class_idx];
}
```
---
## 📊 データ構造
### MidThreadSegmentコアデータ構造
```c
// core/hakmem_mid_mt.h
typedef struct MidThreadSegment {
// === Fast Path (L1 cache line 0) ===
void* free_list; // Free objects linked list (NULL if empty)
void* current; // Bump allocation pointer
void* end; // End of current chunk
uint32_t used_count; // Number of allocated blocks
// === Metadata (L1 cache line 1) ===
void* chunk_base; // Base address of current chunk
size_t chunk_size; // Size of chunk (64KB)
size_t block_size; // Size of each block (8KB/16KB/32KB)
uint32_t capacity; // Total blocks in chunk
// === Statistics (L1 cache line 2) ===
uint64_t alloc_count; // Total allocations
uint64_t free_count; // Total frees
uint32_t refill_count; // Number of chunk refills
uint32_t padding;
} __attribute__((aligned(64))) MidThreadSegment;
// TLS: 各スレッドが独立した segment を持つ
extern __thread MidThreadSegment g_mid_segments[MID_NUM_CLASSES];
```
**設計ポイント**:
1. **64-byte alignment**: L1キャッシュライン境界に配置false sharing回避
2. **Fast path優先**: `free_list`, `current`, `end` を先頭に配置
3. **統計分離**: ホットパスで不要な統計は後ろに配置
### Segment Registryfree時のサイズ復元用
```c
// free(ptr) でサイズを知るための registry
typedef struct MidSegmentRegistry {
void* base; // Segment base address
size_t block_size; // Block size (8KB/16KB/32KB)
int class_idx; // Size class index
} MidSegmentRegistry;
// グローバル registry (ロックで保護)
typedef struct MidGlobalRegistry {
MidSegmentRegistry* entries; // Dynamic array
uint32_t count; // Number of entries
uint32_t capacity; // Array capacity
pthread_mutex_t lock; // Registry lock
} MidGlobalRegistry;
extern MidGlobalRegistry g_mid_registry;
// Registry 操作
void mid_registry_add(void* base, size_t block_size, int class_idx);
int mid_registry_lookup(void* ptr, size_t* out_block_size, int* out_class_idx);
void mid_registry_remove(void* base);
```
**設計ポイント**:
1. **最小ロック範囲**: registry 操作時のみロック、alloc/free は TLS でロックフリー
2. **O(log N) 検索**: エントリをアドレスでソート、二分探索
3. **遅延削除**: segment が完全に空になった時のみ削除(頻度低い)
---
## 🚀 API 仕様
### 割り当て API
```c
// core/hakmem_mid_mt.h
/**
* Mid Range MT allocation (8-32KB)
*
* @param size Allocation size (must be 8KB ≤ size ≤ 32KB)
* @return Allocated pointer (aligned to block_size), or NULL on failure
*
* Thread-safety: Lock-free (uses TLS)
* Performance: O(1) fast path, O(1) amortized
*/
void* mid_mt_alloc(size_t size);
/**
* Mid Range MT free
*
* @param ptr Pointer to free (must be from mid_mt_alloc)
* @param size Original allocation size (for class lookup)
*
* Thread-safety: Lock-free if freeing to own thread's segment
* Atomic CAS if remote free (cross-thread)
* Performance: O(1) local free, O(1) remote free
*/
void mid_mt_free(void* ptr, size_t size);
/**
* Initialize Mid Range MT allocator
*
* Call once at startup (thread-safe)
*/
void mid_mt_init(void);
/**
* Cleanup thread-local segments (called on thread exit)
*/
void mid_mt_thread_exit(void);
```
### 内部ヘルパー
```c
// Segment operations
static void* segment_alloc(MidThreadSegment* seg, size_t size);
static void segment_free_local(MidThreadSegment* seg, void* ptr);
static void segment_free_remote(void* ptr, size_t block_size, int class_idx);
static bool segment_refill(MidThreadSegment* seg, int class_idx);
// Chunk management
static void* chunk_allocate(size_t chunk_size);
static void chunk_deallocate(void* chunk, size_t chunk_size);
// Registry
static void registry_register(void* base, size_t block_size, int class_idx);
static bool registry_find_segment(void* ptr, size_t* block_size, int* class_idx);
```
---
## 🔍 実装詳細
### Fast Path: segment_alloc
```c
// core/hakmem_mid_mt.c
static inline void* segment_alloc(MidThreadSegment* seg, size_t size) {
// === Path 1: Free list (fastest, ~4-5 instructions) ===
void* p = seg->free_list;
if (likely(p != NULL)) {
seg->free_list = *(void**)p; // Pop from free list
seg->used_count++;
return p;
}
// === Path 2: Bump allocation (fast, ~6-8 instructions) ===
p = seg->current;
void* next = (uint8_t*)p + seg->block_size;
if (likely(next <= seg->end)) {
seg->current = next;
seg->used_count++;
return p;
}
// === Path 3: Refill (slow, called ~once per 64KB) ===
if (!segment_refill(seg, class_idx)) {
return NULL; // OOM
}
// Retry after refill
p = seg->current;
seg->current = (uint8_t*)p + seg->block_size;
seg->used_count++;
return p;
}
```
**性能予測**:
- Path 1 (free list): 99% のケース、~4-5 instructions
- Path 2 (bump): 0.9% のケース、~6-8 instructions
- Path 3 (refill): 0.1% のケース、~1000 instructions (amortized cost low)
### Segment Refill
```c
static bool segment_refill(MidThreadSegment* seg, int class_idx) {
size_t block_size = mid_class_to_size(class_idx);
size_t chunk_size = 64 * 1024; // 64KB chunks
// Allocate new chunk via mmap
void* chunk = chunk_allocate(chunk_size);
if (!chunk) return false;
// Register chunk in global registry (for free path)
registry_register(chunk, block_size, class_idx);
// Setup segment
seg->chunk_base = chunk;
seg->chunk_size = chunk_size;
seg->block_size = block_size;
seg->current = chunk;
seg->end = (uint8_t*)chunk + chunk_size;
seg->capacity = chunk_size / block_size;
seg->refill_count++;
return true;
}
```
### Free Path: mid_mt_free
```c
void mid_mt_free(void* ptr, size_t size) {
if (!ptr) return;
int class_idx = mid_size_to_class(size);
// === Fast path: Local free (own thread's segment) ===
MidThreadSegment* seg = &g_mid_segments[class_idx];
// Check if ptr belongs to current segment
if (likely(ptr >= seg->chunk_base && ptr < seg->end)) {
segment_free_local(seg, ptr);
return;
}
// === Slow path: Remote free (cross-thread) ===
// Need to find the owning segment via registry
size_t block_size;
int owner_class;
if (registry_find_segment(ptr, &block_size, &owner_class)) {
segment_free_remote(ptr, block_size, owner_class);
} else {
// Not from mid_mt allocator, fallback
// (Should not happen if routing is correct)
}
}
static inline void segment_free_local(MidThreadSegment* seg, void* ptr) {
// Push to free list (lock-free, local)
*(void**)ptr = seg->free_list;
seg->free_list = ptr;
seg->used_count--;
}
static void segment_free_remote(void* ptr, size_t block_size, int class_idx) {
// Remote free: We don't know which thread owns this segment
// Strategy: Use atomic CAS to push to a global "remote free list"
// Owner thread will drain it periodically
// For simplicity in Phase 1: Fall back to global pool
// TODO Phase 2: Implement per-segment remote free list
// For now, just leak (will implement properly in Phase 2)
}
```
### Registry: Binary Search Lookup
```c
static bool registry_find_segment(void* ptr, size_t* block_size, int* class_idx) {
pthread_mutex_lock(&g_mid_registry.lock);
// Binary search for segment containing ptr
int left = 0;
int right = g_mid_registry.count - 1;
bool found = false;
while (left <= right) {
int mid = left + (right - left) / 2;
MidSegmentRegistry* entry = &g_mid_registry.entries[mid];
void* seg_end = (uint8_t*)entry->base + 64 * 1024;
if (ptr < entry->base) {
right = mid - 1;
} else if (ptr >= seg_end) {
left = mid + 1;
} else {
// Found!
*block_size = entry->block_size;
*class_idx = entry->class_idx;
found = true;
break;
}
}
pthread_mutex_unlock(&g_mid_registry.lock);
return found;
}
```
---
## 🔒 スレッドセーフティ
### ロックフリー領域(高速パス)
```
Allocation (own thread):
g_mid_segments[class_idx] ← __thread TLS, 完全ロックフリー
└─ free_list, current, end ← 単一スレッドからのみアクセス
Free (local, own thread):
seg->free_list push ← ローカル操作、ロック不要
```
### ロックあり領域(低頻度)
```
Registry 操作:
g_mid_registry ← グローバル、pthread_mutex で保護
├─ registry_register() ← Refill時0.1%の頻度)
├─ registry_find() ← Remote free時10-20%
└─ registry_remove() ← Segment返却時0.01%
Chunk allocation:
mmap() ← システムコール0.1%の頻度)
```
### Remote FreePhase 2で実装
```c
// Phase 1: リモートフリーは未実装(メモリリーク)
// Phase 2: Per-segment remote free list 実装
typedef struct MidThreadSegment {
// ...existing fields...
// Remote free list (accessed by other threads)
_Atomic(void*) remote_free_head; // Atomic Treiber stack
uint32_t remote_free_count; // Approximate count
} MidThreadSegment;
// Remote free (from other thread)
void segment_free_remote(MidThreadSegment* seg, void* ptr) {
// Atomic push to remote_free_head
void* old_head;
do {
old_head = atomic_load_explicit(&seg->remote_free_head, memory_order_acquire);
*(void**)ptr = old_head;
} while (!atomic_compare_exchange_weak_explicit(
&seg->remote_free_head, &old_head, ptr,
memory_order_release, memory_order_acquire));
}
// Drain remote frees (by owner thread, periodically)
void segment_drain_remote(MidThreadSegment* seg) {
void* remote_head = atomic_exchange_explicit(
&seg->remote_free_head, NULL, memory_order_acquire);
// Merge to local free_list
if (remote_head) {
void* tail = remote_head;
while (*(void**)tail != NULL) {
tail = *(void**)tail;
}
*(void**)tail = seg->free_list;
seg->free_list = remote_head;
}
}
```
---
## 🧪 テスト計画
### Unit Tests
```c
// test_mid_mt_basic.c
void test_single_thread_alloc_free() {
// 単一スレッド、単純な alloc/free
void* p1 = mid_mt_alloc(8192);
assert(p1 != NULL);
mid_mt_free(p1, 8192);
}
void test_multiple_size_classes() {
// 3つのサイズクラスを同時に使用
void* p8k = mid_mt_alloc(8192);
void* p16k = mid_mt_alloc(16384);
void* p32k = mid_mt_alloc(32768);
assert(p8k && p16k && p32k);
mid_mt_free(p8k, 8192);
mid_mt_free(p16k, 16384);
mid_mt_free(p32k, 32768);
}
void test_segment_refill() {
// Segment枯渇 → refillをテスト
// 64KB chunk → 8KB blocks = 8 blocks
void* ptrs[10];
for (int i = 0; i < 10; i++) {
ptrs[i] = mid_mt_alloc(8192);
assert(ptrs[i] != NULL);
}
for (int i = 0; i < 10; i++) {
mid_mt_free(ptrs[i], 8192);
}
}
```
### Multi-threaded Tests
```c
// test_mid_mt_multithread.c
void* thread_worker(void* arg) {
int thread_id = *(int*)arg;
// 各スレッドが独立に alloc/free
for (int i = 0; i < 100000; i++) {
void* p = mid_mt_alloc(8192);
assert(p != NULL);
// Write to memory (check no overlap)
*(int*)p = thread_id;
mid_mt_free(p, 8192);
}
return NULL;
}
void test_concurrent_alloc() {
pthread_t threads[4];
int thread_ids[4] = {0, 1, 2, 3};
for (int i = 0; i < 4; i++) {
pthread_create(&threads[i], NULL, thread_worker, &thread_ids[i]);
}
for (int i = 0; i < 4; i++) {
pthread_join(threads[i], NULL);
}
}
```
### Benchmark Tests
```bash
# Existing benchmark: bench_mid_large_mt
./bench_mid_large_mt_hakx # 新実装
# 比較
./bench_mid_large_mt_mi # mimalloc
./bench_mid_large_mt_system # glibc
# 目標:
# Before: 46-47 M ops/s
# After: 100-120 M ops/s (mimalloc: 122 M ops/s)
```
---
## 📈 性能予測
### 理論的性能
**Fast path (free_list hit)**:
```asm
; segment_alloc - fast path (~4-5 instructions)
mov rax, [seg->free_list] ; Load free_list head
test rax, rax ; NULL check
je .bump_path ; Branch if empty
mov rdx, [rax] ; Load next pointer
mov [seg->free_list], rdx ; Update free_list
inc dword [seg->used_count] ; Increment used count
ret ; Return rax
```
**Instructions per allocation**: ~5 (vs mimalloc ~4-6)
**IPC**: 3.5-4.0 (modern CPU, good branch prediction)
**Theoretical peak**:
- 4 GHz CPU / 5 instructions × IPC 3.5 = **2.8 billion ops/s** per core
- But limited by memory latency (~100ns) → **~10M ops/s** realistic
**Multi-threaded (2 threads)**:
- Linear scaling (no lock contention)
- **2 × 10M = 20M ops/s** (but bench shows higher due to cache reuse)
### 実測予測
**bench_mid_large_mt** (8KB, 16KB, 32KB mixed, 2 threads):
- ワーキングセット: ~100-200 objects per thread
- Free list hit rate: ~90-95% (高い再利用率)
- Expected: **100-120 M ops/s**
**根拠**:
1. mimalloc: 122 M ops/s (proven)
2. Our design: ほぼ同じアーキテクチャ
3. Overhead: Registry lookup (remote free時) → -10-20%?
4. But: Remote free は少ない(各スレッド独立ワーキングセット)
**Phase 1 目標**: **100 M ops/s 以上**mimalloc の 82%以上)
---
## 🚧 実装マイルストーン
### Day 1-2: 基本実装
**ファイル作成**:
- [ ] `core/hakmem_mid_mt.h` - API定義、データ構造
- [ ] `core/hakmem_mid_mt.c` - 実装
**実装内容**:
- [ ] `MidThreadSegment` データ構造
- [ ] TLS変数 `g_mid_segments[3]`
- [ ] `mid_mt_alloc()` - free_list + bump path only
- [ ] `mid_mt_free()` - local free only (remote は未実装)
- [ ] `segment_refill()` - mmap chunk allocation
- [ ] 基本的な unit tests
**デバッグ**:
- [ ] Single-threaded test で動作確認
- [ ] Valgrind でメモリリーク検証
### Day 3-4: 統合
**Routing 統合** (`core/hakmem.c`):
```c
void* malloc(size_t size) {
if (size <= 1024) {
return hak_tiny_alloc(size);
}
// NEW: Mid Range MT
if (size <= 32768) {
return mid_mt_alloc(size);
}
// Existing: L2 Pool / L2.5 Pool
return hak_pool_alloc(size);
}
void free(void* ptr) {
if (!ptr) return;
// Size lookup (need to determine which allocator)
size_t size = get_allocation_size(ptr); // How?
if (size <= 1024) {
hak_tiny_free(ptr);
} else if (size <= 32768) {
mid_mt_free(ptr, size);
} else {
hak_pool_free(ptr);
}
}
```
**課題**: `free(ptr)` でサイズが分からない
**解決策**:
1. Registry でサイズ復元(本設計)
2. または: ヘッダに埋め込む(メモリオーバーヘッド)
**実装**:
- [ ] Registry実装binary search lookup
- [ ] Routing logic 統合
- [ ] Multi-threaded test で動作確認
### Day 5-7: ベンチマーク・最適化
**ベンチマーク**:
- [ ] `bench_mid_large_mt` で測定
- [ ] mimalloc と比較
- [ ] `perf record` でプロファイリング
**最適化**:
- [ ] Hot path の assembly 確認inline展開されているか
- [ ] Branch prediction 最適化likely/unlikely
- [ ] Alignment 調整64-byte aligned
- [ ] Cache line 分析
**目標達成判定**:
- [ ] 100 M ops/s 以上達成?
- [ ] mimalloc の 80% 以上?
---
## 🤔 リスクと軽減策
### リスク1: Registry ロックのオーバーヘッド
**問題**: `free()` の度に registry lookup → ロック競合?
**軽減策**:
1. **Fast path**: 現在の segment に属するかチェック(ほとんどのケース)
2. **Registry**: Remote free時のみ少ない
3. **Read-Write Lock**: 読み込みは並行可能
4. **Phase 2**: Per-segment metadata で registry 不要に
### リスク2: Remote Free の未実装
**問題**: Phase 1 では remote free 未実装 → メモリリーク?
**軽減策**:
1. **ベンチマーク特性**: `bench_mid_large_mt` は各スレッド独立 → remote free 少ない
2. **Phase 2 で実装**: Atomic remote free list
3. **短期リスク**: 許容Phase 1 はベンチマークで検証のみ)
### リスク3: Chunk サイズの最適化不足
**問題**: 64KB chunk は最適?
**軽減策**:
1. **Phase 1**: 64KB 固定mimalloc も同様)
2. **Phase 2**: 環境変数で調整可能に
3. **チューニング**: ベンチマーク結果で調整
---
## 📚 参考文献
1. **mimalloc paper**: "mimalloc: Free List Sharding in Action"
- https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/
2. **tcmalloc design**: "TCMalloc: Thread-Caching Malloc"
- https://google.github.io/tcmalloc/design.html
3. **jemalloc documentation**: "jemalloc Memory Allocator"
- http://jemalloc.net/
4. **hakmem 既存実装**:
- `core/hakmem_pool.c` - 既存 L2 Pool参考
- `core/hakmem_tiny.c` - Tiny Pool TLS 設計
---
## ✅ 完了条件
Phase 1 完了の定義:
1.**機能**: 8-32KB の alloc/free が動作(単一・マルチスレッド)
2.**性能**: `bench_mid_large_mt`**100 M ops/s 以上**
3.**正確性**: Valgrind clean、unit tests pass
4.**統合**: `core/hakmem.c` routing で透過的に動作
5.**ドキュメント**: 実装完了レポート作成
---
**作成日**: 2025-11-01
**実装予定**: 2025-11-02 2025-11-08
**レビュー**: Phase 1 完了後にベンチマーク結果で評価
**次のステップ**: Phase 2 (ChatGPT Pro P1-P2) or Remote Free実装