769 lines
22 KiB
Markdown
769 lines
22 KiB
Markdown
|
|
# Mid Range MT 最適化設計書 (2025-11-01)
|
|||
|
|
|
|||
|
|
## 📋 概要
|
|||
|
|
|
|||
|
|
**目的**: 8-32KB サイズ範囲のマルチスレッド性能を mimalloc 並みに改善する
|
|||
|
|
|
|||
|
|
**現状性能**:
|
|||
|
|
- hakmem: 46-47 M ops/s (mid_large_mt ベンチ)
|
|||
|
|
- mimalloc: 122 M ops/s
|
|||
|
|
- 差: **-62%** (2.6倍遅い) ❌
|
|||
|
|
|
|||
|
|
**目標性能**:
|
|||
|
|
- **100-120 M ops/s** (mimalloc の 80-100%)
|
|||
|
|
- 2.2-2.6倍の改善
|
|||
|
|
|
|||
|
|
**実装期間**: 1週間
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 設計思想
|
|||
|
|
|
|||
|
|
### ハイブリッド戦略の位置づけ
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
≤1KB (Tiny Pool) → 静的最適化(P0完了、学習不要)
|
|||
|
|
8-32KB (Mid Range) → mimalloc風 per-thread segment(本設計) ← 🔥 ここ
|
|||
|
|
≥64KB (Large Pool) → 学習ベース(ELO戦略選択)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**なぜ 8-32KB だけ mimalloc 風?**
|
|||
|
|
|
|||
|
|
1. ✅ **MT性能が最優先**: ロック競合が最大のボトルネック
|
|||
|
|
2. ✅ **学習の効果が低い**: サイズ範囲が狭く、戦略の差が小さい
|
|||
|
|
3. ✅ **mimalloc で実績**: per-thread heap は証明済み
|
|||
|
|
4. ✅ **学習層と非衝突**: 64KB以上は学習層がそのまま活きる
|
|||
|
|
|
|||
|
|
### mimalloc の核心アイデア
|
|||
|
|
|
|||
|
|
**Free List Sharding**:
|
|||
|
|
```c
|
|||
|
|
// mimalloc の key insight:
|
|||
|
|
// "各スレッドが独立した heap segment を持つ"
|
|||
|
|
// → ロックフリー、キャッシュ局所性、スケーラビリティ
|
|||
|
|
|
|||
|
|
thread_local Segment* my_segment; // スレッドごとに独立
|
|||
|
|
|
|||
|
|
void* alloc(size) {
|
|||
|
|
// TLS から直接取得(ロック不要!)
|
|||
|
|
return segment_alloc(my_segment, size);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏗️ アーキテクチャ
|
|||
|
|
|
|||
|
|
### 全体構造
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────┐
|
|||
|
|
│ malloc(size) - 8KB ≤ size ≤ 32KB │
|
|||
|
|
└───────────────────────┬─────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────┐
|
|||
|
|
│ size_to_class(size) │ ← サイズクラス決定
|
|||
|
|
│ 8KB → class 0 │
|
|||
|
|
│ 16KB → class 1 │
|
|||
|
|
│ 32KB → class 2 │
|
|||
|
|
└──────────┬───────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌───────────────────────────────────────┐
|
|||
|
|
│ g_mid_segments[class_idx] │ ← __thread TLS
|
|||
|
|
│ │
|
|||
|
|
│ MidThreadSegment { │
|
|||
|
|
│ void* free_list; ← Free objects │
|
|||
|
|
│ void* current; ← Bump pointer │
|
|||
|
|
│ void* end; ← Segment end │
|
|||
|
|
│ size_t chunk_size; ← 64KB │
|
|||
|
|
│ uint32_t used; │
|
|||
|
|
│ uint32_t capacity; │
|
|||
|
|
│ } │
|
|||
|
|
└───────────┬───────────────────────────┘
|
|||
|
|
│
|
|||
|
|
├─── fast path: free_list から取得
|
|||
|
|
│ (ほとんどの場合)
|
|||
|
|
│
|
|||
|
|
├─── bump path: current から bump
|
|||
|
|
│ (free_list が空の場合)
|
|||
|
|
│
|
|||
|
|
└─── slow path: 新規 segment 割り当て
|
|||
|
|
(current == end の場合)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### サイズクラス定義
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// 8KB, 16KB, 32KB の3クラス
|
|||
|
|
#define MID_SIZE_CLASS_8K 0 // 8KB
|
|||
|
|
#define MID_SIZE_CLASS_16K 1 // 16KB
|
|||
|
|
#define MID_SIZE_CLASS_32K 2 // 32KB
|
|||
|
|
#define MID_NUM_CLASSES 3
|
|||
|
|
|
|||
|
|
// サイズ → クラス変換
|
|||
|
|
static inline int mid_size_to_class(size_t size) {
|
|||
|
|
if (size <= 8192) return MID_SIZE_CLASS_8K;
|
|||
|
|
if (size <= 16384) return MID_SIZE_CLASS_16K;
|
|||
|
|
return MID_SIZE_CLASS_32K;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// クラス → ブロックサイズ
|
|||
|
|
static inline size_t mid_class_to_size(int class_idx) {
|
|||
|
|
static const size_t sizes[MID_NUM_CLASSES] = {
|
|||
|
|
8192, // 8KB
|
|||
|
|
16384, // 16KB
|
|||
|
|
32768 // 32KB
|
|||
|
|
};
|
|||
|
|
return sizes[class_idx];
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 データ構造
|
|||
|
|
|
|||
|
|
### MidThreadSegment(コアデータ構造)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// core/hakmem_mid_mt.h
|
|||
|
|
|
|||
|
|
typedef struct MidThreadSegment {
|
|||
|
|
// === Fast Path (L1 cache line 0) ===
|
|||
|
|
void* free_list; // Free objects linked list (NULL if empty)
|
|||
|
|
void* current; // Bump allocation pointer
|
|||
|
|
void* end; // End of current chunk
|
|||
|
|
uint32_t used_count; // Number of allocated blocks
|
|||
|
|
|
|||
|
|
// === Metadata (L1 cache line 1) ===
|
|||
|
|
void* chunk_base; // Base address of current chunk
|
|||
|
|
size_t chunk_size; // Size of chunk (64KB)
|
|||
|
|
size_t block_size; // Size of each block (8KB/16KB/32KB)
|
|||
|
|
uint32_t capacity; // Total blocks in chunk
|
|||
|
|
|
|||
|
|
// === Statistics (L1 cache line 2) ===
|
|||
|
|
uint64_t alloc_count; // Total allocations
|
|||
|
|
uint64_t free_count; // Total frees
|
|||
|
|
uint32_t refill_count; // Number of chunk refills
|
|||
|
|
uint32_t padding;
|
|||
|
|
|
|||
|
|
} __attribute__((aligned(64))) MidThreadSegment;
|
|||
|
|
|
|||
|
|
// TLS: 各スレッドが独立した segment を持つ
|
|||
|
|
extern __thread MidThreadSegment g_mid_segments[MID_NUM_CLASSES];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**設計ポイント**:
|
|||
|
|
1. **64-byte alignment**: L1キャッシュライン境界に配置(false sharing回避)
|
|||
|
|
2. **Fast path優先**: `free_list`, `current`, `end` を先頭に配置
|
|||
|
|
3. **統計分離**: ホットパスで不要な統計は後ろに配置
|
|||
|
|
|
|||
|
|
### Segment Registry(free時のサイズ復元用)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// free(ptr) でサイズを知るための registry
|
|||
|
|
|
|||
|
|
typedef struct MidSegmentRegistry {
|
|||
|
|
void* base; // Segment base address
|
|||
|
|
size_t block_size; // Block size (8KB/16KB/32KB)
|
|||
|
|
int class_idx; // Size class index
|
|||
|
|
} MidSegmentRegistry;
|
|||
|
|
|
|||
|
|
// グローバル registry (ロックで保護)
|
|||
|
|
typedef struct MidGlobalRegistry {
|
|||
|
|
MidSegmentRegistry* entries; // Dynamic array
|
|||
|
|
uint32_t count; // Number of entries
|
|||
|
|
uint32_t capacity; // Array capacity
|
|||
|
|
pthread_mutex_t lock; // Registry lock
|
|||
|
|
} MidGlobalRegistry;
|
|||
|
|
|
|||
|
|
extern MidGlobalRegistry g_mid_registry;
|
|||
|
|
|
|||
|
|
// Registry 操作
|
|||
|
|
void mid_registry_add(void* base, size_t block_size, int class_idx);
|
|||
|
|
int mid_registry_lookup(void* ptr, size_t* out_block_size, int* out_class_idx);
|
|||
|
|
void mid_registry_remove(void* base);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**設計ポイント**:
|
|||
|
|
1. **最小ロック範囲**: registry 操作時のみロック、alloc/free は TLS でロックフリー
|
|||
|
|
2. **O(log N) 検索**: エントリをアドレスでソート、二分探索
|
|||
|
|
3. **遅延削除**: segment が完全に空になった時のみ削除(頻度低い)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 API 仕様
|
|||
|
|
|
|||
|
|
### 割り当て API
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// core/hakmem_mid_mt.h
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Mid Range MT allocation (8-32KB)
|
|||
|
|
*
|
|||
|
|
* @param size Allocation size (must be 8KB ≤ size ≤ 32KB)
|
|||
|
|
* @return Allocated pointer (aligned to block_size), or NULL on failure
|
|||
|
|
*
|
|||
|
|
* Thread-safety: Lock-free (uses TLS)
|
|||
|
|
* Performance: O(1) fast path, O(1) amortized
|
|||
|
|
*/
|
|||
|
|
void* mid_mt_alloc(size_t size);
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Mid Range MT free
|
|||
|
|
*
|
|||
|
|
* @param ptr Pointer to free (must be from mid_mt_alloc)
|
|||
|
|
* @param size Original allocation size (for class lookup)
|
|||
|
|
*
|
|||
|
|
* Thread-safety: Lock-free if freeing to own thread's segment
|
|||
|
|
* Atomic CAS if remote free (cross-thread)
|
|||
|
|
* Performance: O(1) local free, O(1) remote free
|
|||
|
|
*/
|
|||
|
|
void mid_mt_free(void* ptr, size_t size);
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Initialize Mid Range MT allocator
|
|||
|
|
*
|
|||
|
|
* Call once at startup (thread-safe)
|
|||
|
|
*/
|
|||
|
|
void mid_mt_init(void);
|
|||
|
|
|
|||
|
|
/**
|
|||
|
|
* Cleanup thread-local segments (called on thread exit)
|
|||
|
|
*/
|
|||
|
|
void mid_mt_thread_exit(void);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 内部ヘルパー
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Segment operations
|
|||
|
|
static void* segment_alloc(MidThreadSegment* seg, size_t size);
|
|||
|
|
static void segment_free_local(MidThreadSegment* seg, void* ptr);
|
|||
|
|
static void segment_free_remote(void* ptr, size_t block_size, int class_idx);
|
|||
|
|
static bool segment_refill(MidThreadSegment* seg, int class_idx);
|
|||
|
|
|
|||
|
|
// Chunk management
|
|||
|
|
static void* chunk_allocate(size_t chunk_size);
|
|||
|
|
static void chunk_deallocate(void* chunk, size_t chunk_size);
|
|||
|
|
|
|||
|
|
// Registry
|
|||
|
|
static void registry_register(void* base, size_t block_size, int class_idx);
|
|||
|
|
static bool registry_find_segment(void* ptr, size_t* block_size, int* class_idx);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 実装詳細
|
|||
|
|
|
|||
|
|
### Fast Path: segment_alloc
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// core/hakmem_mid_mt.c
|
|||
|
|
|
|||
|
|
static inline void* segment_alloc(MidThreadSegment* seg, size_t size) {
|
|||
|
|
// === Path 1: Free list (fastest, ~4-5 instructions) ===
|
|||
|
|
void* p = seg->free_list;
|
|||
|
|
if (likely(p != NULL)) {
|
|||
|
|
seg->free_list = *(void**)p; // Pop from free list
|
|||
|
|
seg->used_count++;
|
|||
|
|
return p;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// === Path 2: Bump allocation (fast, ~6-8 instructions) ===
|
|||
|
|
p = seg->current;
|
|||
|
|
void* next = (uint8_t*)p + seg->block_size;
|
|||
|
|
|
|||
|
|
if (likely(next <= seg->end)) {
|
|||
|
|
seg->current = next;
|
|||
|
|
seg->used_count++;
|
|||
|
|
return p;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// === Path 3: Refill (slow, called ~once per 64KB) ===
|
|||
|
|
if (!segment_refill(seg, class_idx)) {
|
|||
|
|
return NULL; // OOM
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Retry after refill
|
|||
|
|
p = seg->current;
|
|||
|
|
seg->current = (uint8_t*)p + seg->block_size;
|
|||
|
|
seg->used_count++;
|
|||
|
|
return p;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**性能予測**:
|
|||
|
|
- Path 1 (free list): 99% のケース、~4-5 instructions
|
|||
|
|
- Path 2 (bump): 0.9% のケース、~6-8 instructions
|
|||
|
|
- Path 3 (refill): 0.1% のケース、~1000 instructions (amortized cost low)
|
|||
|
|
|
|||
|
|
### Segment Refill
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static bool segment_refill(MidThreadSegment* seg, int class_idx) {
|
|||
|
|
size_t block_size = mid_class_to_size(class_idx);
|
|||
|
|
size_t chunk_size = 64 * 1024; // 64KB chunks
|
|||
|
|
|
|||
|
|
// Allocate new chunk via mmap
|
|||
|
|
void* chunk = chunk_allocate(chunk_size);
|
|||
|
|
if (!chunk) return false;
|
|||
|
|
|
|||
|
|
// Register chunk in global registry (for free path)
|
|||
|
|
registry_register(chunk, block_size, class_idx);
|
|||
|
|
|
|||
|
|
// Setup segment
|
|||
|
|
seg->chunk_base = chunk;
|
|||
|
|
seg->chunk_size = chunk_size;
|
|||
|
|
seg->block_size = block_size;
|
|||
|
|
seg->current = chunk;
|
|||
|
|
seg->end = (uint8_t*)chunk + chunk_size;
|
|||
|
|
seg->capacity = chunk_size / block_size;
|
|||
|
|
seg->refill_count++;
|
|||
|
|
|
|||
|
|
return true;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Free Path: mid_mt_free
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void mid_mt_free(void* ptr, size_t size) {
|
|||
|
|
if (!ptr) return;
|
|||
|
|
|
|||
|
|
int class_idx = mid_size_to_class(size);
|
|||
|
|
|
|||
|
|
// === Fast path: Local free (own thread's segment) ===
|
|||
|
|
MidThreadSegment* seg = &g_mid_segments[class_idx];
|
|||
|
|
|
|||
|
|
// Check if ptr belongs to current segment
|
|||
|
|
if (likely(ptr >= seg->chunk_base && ptr < seg->end)) {
|
|||
|
|
segment_free_local(seg, ptr);
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// === Slow path: Remote free (cross-thread) ===
|
|||
|
|
// Need to find the owning segment via registry
|
|||
|
|
size_t block_size;
|
|||
|
|
int owner_class;
|
|||
|
|
|
|||
|
|
if (registry_find_segment(ptr, &block_size, &owner_class)) {
|
|||
|
|
segment_free_remote(ptr, block_size, owner_class);
|
|||
|
|
} else {
|
|||
|
|
// Not from mid_mt allocator, fallback
|
|||
|
|
// (Should not happen if routing is correct)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
static inline void segment_free_local(MidThreadSegment* seg, void* ptr) {
|
|||
|
|
// Push to free list (lock-free, local)
|
|||
|
|
*(void**)ptr = seg->free_list;
|
|||
|
|
seg->free_list = ptr;
|
|||
|
|
seg->used_count--;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
static void segment_free_remote(void* ptr, size_t block_size, int class_idx) {
|
|||
|
|
// Remote free: We don't know which thread owns this segment
|
|||
|
|
// Strategy: Use atomic CAS to push to a global "remote free list"
|
|||
|
|
// Owner thread will drain it periodically
|
|||
|
|
|
|||
|
|
// For simplicity in Phase 1: Fall back to global pool
|
|||
|
|
// TODO Phase 2: Implement per-segment remote free list
|
|||
|
|
|
|||
|
|
// For now, just leak (will implement properly in Phase 2)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Registry: Binary Search Lookup
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static bool registry_find_segment(void* ptr, size_t* block_size, int* class_idx) {
|
|||
|
|
pthread_mutex_lock(&g_mid_registry.lock);
|
|||
|
|
|
|||
|
|
// Binary search for segment containing ptr
|
|||
|
|
int left = 0;
|
|||
|
|
int right = g_mid_registry.count - 1;
|
|||
|
|
bool found = false;
|
|||
|
|
|
|||
|
|
while (left <= right) {
|
|||
|
|
int mid = left + (right - left) / 2;
|
|||
|
|
MidSegmentRegistry* entry = &g_mid_registry.entries[mid];
|
|||
|
|
|
|||
|
|
void* seg_end = (uint8_t*)entry->base + 64 * 1024;
|
|||
|
|
|
|||
|
|
if (ptr < entry->base) {
|
|||
|
|
right = mid - 1;
|
|||
|
|
} else if (ptr >= seg_end) {
|
|||
|
|
left = mid + 1;
|
|||
|
|
} else {
|
|||
|
|
// Found!
|
|||
|
|
*block_size = entry->block_size;
|
|||
|
|
*class_idx = entry->class_idx;
|
|||
|
|
found = true;
|
|||
|
|
break;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
pthread_mutex_unlock(&g_mid_registry.lock);
|
|||
|
|
return found;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔒 スレッドセーフティ
|
|||
|
|
|
|||
|
|
### ロックフリー領域(高速パス)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Allocation (own thread):
|
|||
|
|
g_mid_segments[class_idx] ← __thread TLS, 完全ロックフリー
|
|||
|
|
└─ free_list, current, end ← 単一スレッドからのみアクセス
|
|||
|
|
|
|||
|
|
Free (local, own thread):
|
|||
|
|
seg->free_list push ← ローカル操作、ロック不要
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ロックあり領域(低頻度)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Registry 操作:
|
|||
|
|
g_mid_registry ← グローバル、pthread_mutex で保護
|
|||
|
|
├─ registry_register() ← Refill時(0.1%の頻度)
|
|||
|
|
├─ registry_find() ← Remote free時(10-20%?)
|
|||
|
|
└─ registry_remove() ← Segment返却時(0.01%)
|
|||
|
|
|
|||
|
|
Chunk allocation:
|
|||
|
|
mmap() ← システムコール(0.1%の頻度)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Remote Free(Phase 2で実装)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Phase 1: リモートフリーは未実装(メモリリーク)
|
|||
|
|
// Phase 2: Per-segment remote free list 実装
|
|||
|
|
|
|||
|
|
typedef struct MidThreadSegment {
|
|||
|
|
// ...existing fields...
|
|||
|
|
|
|||
|
|
// Remote free list (accessed by other threads)
|
|||
|
|
_Atomic(void*) remote_free_head; // Atomic Treiber stack
|
|||
|
|
uint32_t remote_free_count; // Approximate count
|
|||
|
|
} MidThreadSegment;
|
|||
|
|
|
|||
|
|
// Remote free (from other thread)
|
|||
|
|
void segment_free_remote(MidThreadSegment* seg, void* ptr) {
|
|||
|
|
// Atomic push to remote_free_head
|
|||
|
|
void* old_head;
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load_explicit(&seg->remote_free_head, memory_order_acquire);
|
|||
|
|
*(void**)ptr = old_head;
|
|||
|
|
} while (!atomic_compare_exchange_weak_explicit(
|
|||
|
|
&seg->remote_free_head, &old_head, ptr,
|
|||
|
|
memory_order_release, memory_order_acquire));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Drain remote frees (by owner thread, periodically)
|
|||
|
|
void segment_drain_remote(MidThreadSegment* seg) {
|
|||
|
|
void* remote_head = atomic_exchange_explicit(
|
|||
|
|
&seg->remote_free_head, NULL, memory_order_acquire);
|
|||
|
|
|
|||
|
|
// Merge to local free_list
|
|||
|
|
if (remote_head) {
|
|||
|
|
void* tail = remote_head;
|
|||
|
|
while (*(void**)tail != NULL) {
|
|||
|
|
tail = *(void**)tail;
|
|||
|
|
}
|
|||
|
|
*(void**)tail = seg->free_list;
|
|||
|
|
seg->free_list = remote_head;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧪 テスト計画
|
|||
|
|
|
|||
|
|
### Unit Tests
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// test_mid_mt_basic.c
|
|||
|
|
void test_single_thread_alloc_free() {
|
|||
|
|
// 単一スレッド、単純な alloc/free
|
|||
|
|
void* p1 = mid_mt_alloc(8192);
|
|||
|
|
assert(p1 != NULL);
|
|||
|
|
mid_mt_free(p1, 8192);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void test_multiple_size_classes() {
|
|||
|
|
// 3つのサイズクラスを同時に使用
|
|||
|
|
void* p8k = mid_mt_alloc(8192);
|
|||
|
|
void* p16k = mid_mt_alloc(16384);
|
|||
|
|
void* p32k = mid_mt_alloc(32768);
|
|||
|
|
assert(p8k && p16k && p32k);
|
|||
|
|
mid_mt_free(p8k, 8192);
|
|||
|
|
mid_mt_free(p16k, 16384);
|
|||
|
|
mid_mt_free(p32k, 32768);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void test_segment_refill() {
|
|||
|
|
// Segment枯渇 → refillをテスト
|
|||
|
|
// 64KB chunk → 8KB blocks = 8 blocks
|
|||
|
|
void* ptrs[10];
|
|||
|
|
for (int i = 0; i < 10; i++) {
|
|||
|
|
ptrs[i] = mid_mt_alloc(8192);
|
|||
|
|
assert(ptrs[i] != NULL);
|
|||
|
|
}
|
|||
|
|
for (int i = 0; i < 10; i++) {
|
|||
|
|
mid_mt_free(ptrs[i], 8192);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Multi-threaded Tests
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// test_mid_mt_multithread.c
|
|||
|
|
void* thread_worker(void* arg) {
|
|||
|
|
int thread_id = *(int*)arg;
|
|||
|
|
|
|||
|
|
// 各スレッドが独立に alloc/free
|
|||
|
|
for (int i = 0; i < 100000; i++) {
|
|||
|
|
void* p = mid_mt_alloc(8192);
|
|||
|
|
assert(p != NULL);
|
|||
|
|
|
|||
|
|
// Write to memory (check no overlap)
|
|||
|
|
*(int*)p = thread_id;
|
|||
|
|
|
|||
|
|
mid_mt_free(p, 8192);
|
|||
|
|
}
|
|||
|
|
return NULL;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void test_concurrent_alloc() {
|
|||
|
|
pthread_t threads[4];
|
|||
|
|
int thread_ids[4] = {0, 1, 2, 3};
|
|||
|
|
|
|||
|
|
for (int i = 0; i < 4; i++) {
|
|||
|
|
pthread_create(&threads[i], NULL, thread_worker, &thread_ids[i]);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
for (int i = 0; i < 4; i++) {
|
|||
|
|
pthread_join(threads[i], NULL);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Benchmark Tests
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Existing benchmark: bench_mid_large_mt
|
|||
|
|
./bench_mid_large_mt_hakx # 新実装
|
|||
|
|
|
|||
|
|
# 比較
|
|||
|
|
./bench_mid_large_mt_mi # mimalloc
|
|||
|
|
./bench_mid_large_mt_system # glibc
|
|||
|
|
|
|||
|
|
# 目標:
|
|||
|
|
# Before: 46-47 M ops/s
|
|||
|
|
# After: 100-120 M ops/s (mimalloc: 122 M ops/s)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 性能予測
|
|||
|
|
|
|||
|
|
### 理論的性能
|
|||
|
|
|
|||
|
|
**Fast path (free_list hit)**:
|
|||
|
|
```asm
|
|||
|
|
; segment_alloc - fast path (~4-5 instructions)
|
|||
|
|
mov rax, [seg->free_list] ; Load free_list head
|
|||
|
|
test rax, rax ; NULL check
|
|||
|
|
je .bump_path ; Branch if empty
|
|||
|
|
mov rdx, [rax] ; Load next pointer
|
|||
|
|
mov [seg->free_list], rdx ; Update free_list
|
|||
|
|
inc dword [seg->used_count] ; Increment used count
|
|||
|
|
ret ; Return rax
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Instructions per allocation**: ~5 (vs mimalloc ~4-6)
|
|||
|
|
|
|||
|
|
**IPC**: 3.5-4.0 (modern CPU, good branch prediction)
|
|||
|
|
|
|||
|
|
**Theoretical peak**:
|
|||
|
|
- 4 GHz CPU / 5 instructions × IPC 3.5 = **2.8 billion ops/s** per core
|
|||
|
|
- But limited by memory latency (~100ns) → **~10M ops/s** realistic
|
|||
|
|
|
|||
|
|
**Multi-threaded (2 threads)**:
|
|||
|
|
- Linear scaling (no lock contention)
|
|||
|
|
- **2 × 10M = 20M ops/s** (but bench shows higher due to cache reuse)
|
|||
|
|
|
|||
|
|
### 実測予測
|
|||
|
|
|
|||
|
|
**bench_mid_large_mt** (8KB, 16KB, 32KB mixed, 2 threads):
|
|||
|
|
- ワーキングセット: ~100-200 objects per thread
|
|||
|
|
- Free list hit rate: ~90-95% (高い再利用率)
|
|||
|
|
- Expected: **100-120 M ops/s**
|
|||
|
|
|
|||
|
|
**根拠**:
|
|||
|
|
1. mimalloc: 122 M ops/s (proven)
|
|||
|
|
2. Our design: ほぼ同じアーキテクチャ
|
|||
|
|
3. Overhead: Registry lookup (remote free時) → -10-20%?
|
|||
|
|
4. But: Remote free は少ない(各スレッド独立ワーキングセット)
|
|||
|
|
|
|||
|
|
**Phase 1 目標**: **100 M ops/s 以上**(mimalloc の 82%以上)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚧 実装マイルストーン
|
|||
|
|
|
|||
|
|
### Day 1-2: 基本実装
|
|||
|
|
|
|||
|
|
**ファイル作成**:
|
|||
|
|
- [ ] `core/hakmem_mid_mt.h` - API定義、データ構造
|
|||
|
|
- [ ] `core/hakmem_mid_mt.c` - 実装
|
|||
|
|
|
|||
|
|
**実装内容**:
|
|||
|
|
- [ ] `MidThreadSegment` データ構造
|
|||
|
|
- [ ] TLS変数 `g_mid_segments[3]`
|
|||
|
|
- [ ] `mid_mt_alloc()` - free_list + bump path only
|
|||
|
|
- [ ] `mid_mt_free()` - local free only (remote は未実装)
|
|||
|
|
- [ ] `segment_refill()` - mmap chunk allocation
|
|||
|
|
- [ ] 基本的な unit tests
|
|||
|
|
|
|||
|
|
**デバッグ**:
|
|||
|
|
- [ ] Single-threaded test で動作確認
|
|||
|
|
- [ ] Valgrind でメモリリーク検証
|
|||
|
|
|
|||
|
|
### Day 3-4: 統合
|
|||
|
|
|
|||
|
|
**Routing 統合** (`core/hakmem.c`):
|
|||
|
|
```c
|
|||
|
|
void* malloc(size_t size) {
|
|||
|
|
if (size <= 1024) {
|
|||
|
|
return hak_tiny_alloc(size);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// NEW: Mid Range MT
|
|||
|
|
if (size <= 32768) {
|
|||
|
|
return mid_mt_alloc(size);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Existing: L2 Pool / L2.5 Pool
|
|||
|
|
return hak_pool_alloc(size);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void free(void* ptr) {
|
|||
|
|
if (!ptr) return;
|
|||
|
|
|
|||
|
|
// Size lookup (need to determine which allocator)
|
|||
|
|
size_t size = get_allocation_size(ptr); // How?
|
|||
|
|
|
|||
|
|
if (size <= 1024) {
|
|||
|
|
hak_tiny_free(ptr);
|
|||
|
|
} else if (size <= 32768) {
|
|||
|
|
mid_mt_free(ptr, size);
|
|||
|
|
} else {
|
|||
|
|
hak_pool_free(ptr);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**課題**: `free(ptr)` でサイズが分からない
|
|||
|
|
**解決策**:
|
|||
|
|
1. Registry でサイズ復元(本設計)
|
|||
|
|
2. または: ヘッダに埋め込む(メモリオーバーヘッド)
|
|||
|
|
|
|||
|
|
**実装**:
|
|||
|
|
- [ ] Registry実装(binary search lookup)
|
|||
|
|
- [ ] Routing logic 統合
|
|||
|
|
- [ ] Multi-threaded test で動作確認
|
|||
|
|
|
|||
|
|
### Day 5-7: ベンチマーク・最適化
|
|||
|
|
|
|||
|
|
**ベンチマーク**:
|
|||
|
|
- [ ] `bench_mid_large_mt` で測定
|
|||
|
|
- [ ] mimalloc と比較
|
|||
|
|
- [ ] `perf record` でプロファイリング
|
|||
|
|
|
|||
|
|
**最適化**:
|
|||
|
|
- [ ] Hot path の assembly 確認(inline展開されているか?)
|
|||
|
|
- [ ] Branch prediction 最適化(likely/unlikely)
|
|||
|
|
- [ ] Alignment 調整(64-byte aligned)
|
|||
|
|
- [ ] Cache line 分析
|
|||
|
|
|
|||
|
|
**目標達成判定**:
|
|||
|
|
- [ ] 100 M ops/s 以上達成?
|
|||
|
|
- [ ] mimalloc の 80% 以上?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🤔 リスクと軽減策
|
|||
|
|
|
|||
|
|
### リスク1: Registry ロックのオーバーヘッド
|
|||
|
|
|
|||
|
|
**問題**: `free()` の度に registry lookup → ロック競合?
|
|||
|
|
|
|||
|
|
**軽減策**:
|
|||
|
|
1. **Fast path**: 現在の segment に属するかチェック(ほとんどのケース)
|
|||
|
|
2. **Registry**: Remote free時のみ(少ない)
|
|||
|
|
3. **Read-Write Lock**: 読み込みは並行可能
|
|||
|
|
4. **Phase 2**: Per-segment metadata で registry 不要に
|
|||
|
|
|
|||
|
|
### リスク2: Remote Free の未実装
|
|||
|
|
|
|||
|
|
**問題**: Phase 1 では remote free 未実装 → メモリリーク?
|
|||
|
|
|
|||
|
|
**軽減策**:
|
|||
|
|
1. **ベンチマーク特性**: `bench_mid_large_mt` は各スレッド独立 → remote free 少ない
|
|||
|
|
2. **Phase 2 で実装**: Atomic remote free list
|
|||
|
|
3. **短期リスク**: 許容(Phase 1 はベンチマークで検証のみ)
|
|||
|
|
|
|||
|
|
### リスク3: Chunk サイズの最適化不足
|
|||
|
|
|
|||
|
|
**問題**: 64KB chunk は最適?
|
|||
|
|
|
|||
|
|
**軽減策**:
|
|||
|
|
1. **Phase 1**: 64KB 固定(mimalloc も同様)
|
|||
|
|
2. **Phase 2**: 環境変数で調整可能に
|
|||
|
|
3. **チューニング**: ベンチマーク結果で調整
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 参考文献
|
|||
|
|
|
|||
|
|
1. **mimalloc paper**: "mimalloc: Free List Sharding in Action"
|
|||
|
|
- https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/
|
|||
|
|
|
|||
|
|
2. **tcmalloc design**: "TCMalloc: Thread-Caching Malloc"
|
|||
|
|
- https://google.github.io/tcmalloc/design.html
|
|||
|
|
|
|||
|
|
3. **jemalloc documentation**: "jemalloc Memory Allocator"
|
|||
|
|
- http://jemalloc.net/
|
|||
|
|
|
|||
|
|
4. **hakmem 既存実装**:
|
|||
|
|
- `core/hakmem_pool.c` - 既存 L2 Pool(参考)
|
|||
|
|
- `core/hakmem_tiny.c` - Tiny Pool TLS 設計
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 完了条件
|
|||
|
|
|
|||
|
|
Phase 1 完了の定義:
|
|||
|
|
|
|||
|
|
1. ✅ **機能**: 8-32KB の alloc/free が動作(単一・マルチスレッド)
|
|||
|
|
2. ✅ **性能**: `bench_mid_large_mt` で **100 M ops/s 以上**
|
|||
|
|
3. ✅ **正確性**: Valgrind clean、unit tests pass
|
|||
|
|
4. ✅ **統合**: `core/hakmem.c` routing で透過的に動作
|
|||
|
|
5. ✅ **ドキュメント**: 実装完了レポート作成
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**作成日**: 2025-11-01
|
|||
|
|
**実装予定**: 2025-11-02 ~ 2025-11-08
|
|||
|
|
**レビュー**: Phase 1 完了後にベンチマーク結果で評価
|
|||
|
|
**次のステップ**: Phase 2 (ChatGPT Pro P1-P2) or Remote Free実装
|