583 lines
16 KiB
Markdown
583 lines
16 KiB
Markdown
|
|
# mimalloc 打倒作戦:Phase 7 Battle Plan
|
|||
|
|
|
|||
|
|
**日付**: 2025-10-24
|
|||
|
|
**目標**: hakmem を mimalloc の **60-75%** まで引き上げる
|
|||
|
|
**現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T)
|
|||
|
|
**Gap**: **15.72 M/s** (53.3%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 戦略目標
|
|||
|
|
|
|||
|
|
### Primary Goal: Universal Performance
|
|||
|
|
|
|||
|
|
**Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%**
|
|||
|
|
- 現状: 13.78 M/s (**46.7%**)
|
|||
|
|
- 目標: 17.70-22.13 M/s (**60-75%**)
|
|||
|
|
- **必要改善**: +3.92-8.35 M/s (**+28-61%**)
|
|||
|
|
|
|||
|
|
### Secondary Goal: Specialized Workloads で Superiority
|
|||
|
|
|
|||
|
|
**Target**: hakmem の設計思想を活かす
|
|||
|
|
- **Burst allocation**: mimalloc の 110-120%
|
|||
|
|
- **Locality-aware**: mimalloc の 110-130%
|
|||
|
|
- **Producer-Consumer**: mimalloc の 105-115%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 現状分析(Task先生の診断)
|
|||
|
|
|
|||
|
|
### 4大ボトルネック
|
|||
|
|
|
|||
|
|
| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
|
|||
|
|
|------|------------|---------------|--------|-----|
|
|||
|
|
| **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free |
|
|||
|
|
| **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic |
|
|||
|
|
| **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path |
|
|||
|
|
| **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term |
|
|||
|
|
|
|||
|
|
**Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles
|
|||
|
|
|
|||
|
|
### hakmem Architecture の問題点
|
|||
|
|
|
|||
|
|
**Over-engineered**:
|
|||
|
|
- **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
|
|||
|
|
- **56 mutexes**: 7 classes × 8 shards(37.5% contention rate @ 4T)
|
|||
|
|
- **Hash table**: 2-4 lookups per allocation(10-20 cycles each + cache miss)
|
|||
|
|
|
|||
|
|
**mimalloc Architecture の優位性**:
|
|||
|
|
- **2層のみ**: Thread-local heap → Per-page freelist
|
|||
|
|
- **0 locks**: Lock-free CAS only
|
|||
|
|
- **Pointer arithmetic**: 3-4 instructions, no hash
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 3段階ロードマップ
|
|||
|
|
|
|||
|
|
### Phase 7.1: Quick Fixes (8 hours) → +5-10%
|
|||
|
|
|
|||
|
|
**Goal**: Low-risk な最適化で quick wins
|
|||
|
|
|
|||
|
|
#### QF1: Trylock Probes 削減(2 hours)
|
|||
|
|
|
|||
|
|
**現状**: `g_trylock_probes = 3`(3回まで trylock を試行)
|
|||
|
|
|
|||
|
|
**問題**: Probe のたびに:
|
|||
|
|
- Hash calculation (shard index)
|
|||
|
|
- Mutex trylock (50-200 cycles if contended)
|
|||
|
|
- Branch
|
|||
|
|
|
|||
|
|
**Fix**: Probes を **1** に削減
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Before
|
|||
|
|
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
|
|||
|
|
int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
|
|||
|
|
if (pthread_mutex_trylock(lock) == 0) { ... }
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// After
|
|||
|
|
int s = shard_idx; // No loop
|
|||
|
|
if (pthread_mutex_trylock(lock) == 0) { ... }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: +2-3% (trylock overhead 削減)
|
|||
|
|
|
|||
|
|
**File**: `hakmem_pool.c` (~10 LOC)
|
|||
|
|
|
|||
|
|
#### QF2: Ring + LIFO 統合(4 hours)
|
|||
|
|
|
|||
|
|
**現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成
|
|||
|
|
|
|||
|
|
**問題**:
|
|||
|
|
- 2回チェック(Ring → LIFO)
|
|||
|
|
- Spill logic の複雑さ
|
|||
|
|
|
|||
|
|
**Fix**: Single unified cache (256 slots ring)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Before
|
|||
|
|
typedef struct {
|
|||
|
|
PoolTLSRing ring; // 32 slots
|
|||
|
|
PoolTLSBin bin; // LIFO 256 blocks
|
|||
|
|
} TLSCache;
|
|||
|
|
|
|||
|
|
// After
|
|||
|
|
typedef struct {
|
|||
|
|
PoolBlock* items[256]; // Unified ring (larger)
|
|||
|
|
int top;
|
|||
|
|
} TLSCache;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: +1-2% (branch 削減、cache locality 向上)
|
|||
|
|
|
|||
|
|
**File**: `hakmem_pool.c` (~50 LOC)
|
|||
|
|
|
|||
|
|
#### QF3: Header Writes スキップ(2 hours)
|
|||
|
|
|
|||
|
|
**現状**: Fast path で header を毎回書き込み
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
mid_set_header(hdr, size, site_id); // 5-10 cycles
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Fix**: Ring から pop した block は header が valid → skip
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Only write header when refilling from freelist
|
|||
|
|
if (from_freelist) {
|
|||
|
|
mid_set_header(hdr, size, site_id);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: +1-2% (header write overhead 削減)
|
|||
|
|
|
|||
|
|
**File**: `hakmem_pool.c` (~20 LOC)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 7.2: Medium Fixes (20-30 hours) → +25-35%
|
|||
|
|
|
|||
|
|
**Goal**: Architectural bottlenecks を解決
|
|||
|
|
|
|||
|
|
#### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)
|
|||
|
|
|
|||
|
|
**Goal**: 56 mutexes を atomic CAS に置き換え
|
|||
|
|
|
|||
|
|
**Current**:
|
|||
|
|
```c
|
|||
|
|
static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After**:
|
|||
|
|
```c
|
|||
|
|
static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
|
|||
|
|
1. **Lock-free pop**:
|
|||
|
|
```c
|
|||
|
|
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
|
|||
|
|
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
|
|||
|
|
uintptr_t old_head, new_head;
|
|||
|
|
PoolBlock* block;
|
|||
|
|
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
|
|||
|
|
if (old_head == 0) return NULL; // Empty
|
|||
|
|
|
|||
|
|
block = (PoolBlock*)old_head;
|
|||
|
|
new_head = (uintptr_t)block->next;
|
|||
|
|
} while (!atomic_compare_exchange_weak_explicit(
|
|||
|
|
head_ptr, &old_head, new_head,
|
|||
|
|
memory_order_release, memory_order_relaxed));
|
|||
|
|
|
|||
|
|
return block;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Lock-free push**:
|
|||
|
|
```c
|
|||
|
|
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
|
|||
|
|
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
|
|||
|
|
uintptr_t old_head;
|
|||
|
|
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
|
|||
|
|
block->next = (PoolBlock*)old_head;
|
|||
|
|
} while (!atomic_compare_exchange_weak_explicit(
|
|||
|
|
head_ptr, &old_head, (uintptr_t)block,
|
|||
|
|
memory_order_release, memory_order_relaxed));
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Batch push** (for refill):
|
|||
|
|
```c
|
|||
|
|
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
|
|||
|
|
PoolBlock* head, PoolBlock* tail) {
|
|||
|
|
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
|
|||
|
|
uintptr_t old_head;
|
|||
|
|
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
|
|||
|
|
tail->next = (PoolBlock*)old_head;
|
|||
|
|
} while (!atomic_compare_exchange_weak_explicit(
|
|||
|
|
head_ptr, &old_head, (uintptr_t)head,
|
|||
|
|
memory_order_release, memory_order_relaxed));
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**ABA Problem 対策**:
|
|||
|
|
- **Not critical for freelist**: ABA still results in valid freelist state
|
|||
|
|
- **Alternative**: Add version counter (128-bit CAS) if issues arise
|
|||
|
|
- **Epoch-based reclamation**: Defer block reuse (if needed)
|
|||
|
|
|
|||
|
|
**Expected gain**: **+15-25%** (lock contention 完全排除)
|
|||
|
|
|
|||
|
|
**Files**:
|
|||
|
|
- `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS)
|
|||
|
|
|
|||
|
|
**Risk**: Medium
|
|||
|
|
- Memory ordering bugs
|
|||
|
|
- ABA problem (low risk)
|
|||
|
|
|
|||
|
|
**Testing**:
|
|||
|
|
- ThreadSanitizer (TSan)
|
|||
|
|
- Stress test (16+ threads, 60s)
|
|||
|
|
- Correctness test (no lost blocks, no double-free)
|
|||
|
|
|
|||
|
|
#### MF2: Pointer Arithmetic Page Lookup (8-10 hours)
|
|||
|
|
|
|||
|
|
**Goal**: Hash table lookup を pointer arithmetic に置き換え
|
|||
|
|
|
|||
|
|
**Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex)
|
|||
|
|
|
|||
|
|
**mimalloc approach**:
|
|||
|
|
```c
|
|||
|
|
// Segment-aligned memory (4 MiB alignment)
|
|||
|
|
Segment: [Page Descriptors 128KB][Pages 3968KB]
|
|||
|
|
|
|||
|
|
// Derive page descriptor from block pointer
|
|||
|
|
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1); // Mask
|
|||
|
|
size_t offset = (uintptr_t)ptr - segment;
|
|||
|
|
size_t page_idx = offset / PAGE_SIZE;
|
|||
|
|
PageDesc* desc = &segment->descriptors[page_idx];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Cost**: 3-4 instructions (mask, subtract, divide, array index)
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
|
|||
|
|
1. **Segment allocation**:
|
|||
|
|
```c
|
|||
|
|
#define SEGMENT_SIZE (4 * 1024 * 1024)
|
|||
|
|
#define SEGMENT_ALIGNMENT SEGMENT_SIZE
|
|||
|
|
|
|||
|
|
void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
|
|||
|
|
MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22), // 4MiB aligned
|
|||
|
|
-1, 0);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Page descriptor array**:
|
|||
|
|
```c
|
|||
|
|
typedef struct {
|
|||
|
|
PageDesc descriptors[64]; // 64 × 64KB = 4MiB
|
|||
|
|
char pages[64][64*1024]; // Actual pages
|
|||
|
|
} Segment;
|
|||
|
|
|
|||
|
|
Segment* seg = (Segment*)segment;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Fast lookup**:
|
|||
|
|
```c
|
|||
|
|
static inline PageDesc* page_desc_from_ptr(void* ptr) {
|
|||
|
|
uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
|
|||
|
|
Segment* seg = (Segment*)seg_addr;
|
|||
|
|
size_t offset = (uintptr_t)ptr - seg_addr;
|
|||
|
|
size_t page_idx = offset / (64*1024);
|
|||
|
|
return &seg->descriptors[page_idx];
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: **+10-15%** (hash lookup 排除)
|
|||
|
|
|
|||
|
|
**Files**:
|
|||
|
|
- `hakmem_pool.c`: ~150 LOC (segment allocator)
|
|||
|
|
- `hakmem_pool.h`: Data structure changes
|
|||
|
|
|
|||
|
|
**Risk**: Medium
|
|||
|
|
- Requires memory layout redesign
|
|||
|
|
- Backward compatibility (migrate existing pages?)
|
|||
|
|
|
|||
|
|
#### MF3: Allocation Path 簡略化 (8-10 hours)
|
|||
|
|
|
|||
|
|
**Goal**: 7層 → 3層に削減
|
|||
|
|
|
|||
|
|
**Current path**:
|
|||
|
|
1. TLS Ring (32)
|
|||
|
|
2. TLS LIFO (256)
|
|||
|
|
3. Active Page A
|
|||
|
|
4. Active Page B
|
|||
|
|
5. Transfer Cache
|
|||
|
|
6. Shard Freelist (mutex)
|
|||
|
|
7. Remote Stack
|
|||
|
|
|
|||
|
|
**Simplified path**:
|
|||
|
|
1. **TLS Cache** (unified 256-slot ring)
|
|||
|
|
2. **Per-shard Freelist** (lock-free)
|
|||
|
|
3. **Remote Stack** (lock-free)
|
|||
|
|
|
|||
|
|
**Changes**:
|
|||
|
|
- Remove: Active Pages (bump-run)
|
|||
|
|
- Remove: Transfer Cache
|
|||
|
|
- Keep: TLS Cache (unified), Freelist, Remote
|
|||
|
|
|
|||
|
|
**Rationale**:
|
|||
|
|
- Active Pages: 複雑だが効果薄(Ring で十分)
|
|||
|
|
- Transfer Cache: Owner-aware は良いが overhead 大
|
|||
|
|
|
|||
|
|
**Expected gain**: **+5-8%** (branch 削減、complexity 削減)
|
|||
|
|
|
|||
|
|
**Files**:
|
|||
|
|
- `hakmem_pool.c`: ~200 LOC (refactor allocation path)
|
|||
|
|
|
|||
|
|
**Risk**: Low
|
|||
|
|
- Code deletion is safer than addition
|
|||
|
|
- Testing で regression 確認
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 7.3: Moonshot (60 hours) → +50-70%
|
|||
|
|
|
|||
|
|
**Goal**: mimalloc の architecture を完全移植
|
|||
|
|
|
|||
|
|
#### MS1: Per-Page Sharding (60 hours)
|
|||
|
|
|
|||
|
|
**Goal**: Global sharded freelists → Per-page freelists (mimalloc style)
|
|||
|
|
|
|||
|
|
**Current**: 7 classes × 8 shards = **56 global freelists**
|
|||
|
|
|
|||
|
|
**mimalloc**: Thousands of **per-page freelists**
|
|||
|
|
|
|||
|
|
**Architecture**:
|
|||
|
|
```c
|
|||
|
|
// Thread-local heap
|
|||
|
|
typedef struct {
|
|||
|
|
Page* pages[SIZE_BINS]; // Per-size-class page lists
|
|||
|
|
} Heap;
|
|||
|
|
|
|||
|
|
// Per-page freelist
|
|||
|
|
typedef struct Page {
|
|||
|
|
void* free; // Freelist head (in-page)
|
|||
|
|
void* local_free; // Same-thread frees
|
|||
|
|
atomic_uintptr_t xthread_free; // Cross-thread frees
|
|||
|
|
int used;
|
|||
|
|
int capacity;
|
|||
|
|
struct Page* next; // Next page in heap
|
|||
|
|
} Page;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Allocation**:
|
|||
|
|
```c
|
|||
|
|
void* mi_malloc(size_t size) {
|
|||
|
|
Heap* heap = mi_get_heap();
|
|||
|
|
int bin = size_to_bin(size);
|
|||
|
|
Page* page = heap->pages[bin];
|
|||
|
|
|
|||
|
|
if (page->free) {
|
|||
|
|
void* block = page->free;
|
|||
|
|
page->free = *(void**)block;
|
|||
|
|
return block;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return mi_page_malloc(heap, page, size); // Slow path
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: **+50-70%** (mimalloc とほぼ同等)
|
|||
|
|
|
|||
|
|
**Files**:
|
|||
|
|
- `hakmem_pool.c`: Complete rewrite (~500 LOC)
|
|||
|
|
- `hakmem_pool.h`: New data structures
|
|||
|
|
|
|||
|
|
**Risk**: High
|
|||
|
|
- Architectural overhaul
|
|||
|
|
- 60 hours implementation + testing
|
|||
|
|
- Backward compatibility concerns
|
|||
|
|
|
|||
|
|
**Recommendation**: **Phase 7.2 で十分な効果が出たら skip**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 推奨実装順序
|
|||
|
|
|
|||
|
|
### Option A: MF1 優先(高ROI、中リスク)
|
|||
|
|
|
|||
|
|
**Week 1**: MF1 (Lock-Free Freelist) 実装
|
|||
|
|
- Day 1-2: Lock-free primitives 実装
|
|||
|
|
- Day 3: Integration & unit testing
|
|||
|
|
- Day 4: Benchmark & validation
|
|||
|
|
|
|||
|
|
**Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%)
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- 最大ボトルネック解決
|
|||
|
|
- Standalone fix
|
|||
|
|
- Proven technique
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- ABA problem リスク
|
|||
|
|
- Memory ordering bugs
|
|||
|
|
|
|||
|
|
### Option B: Quick Fixes 優先(低リスク、低ROI)
|
|||
|
|
|
|||
|
|
**Week 1**: QF1+QF2+QF3 実装
|
|||
|
|
- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
|
|||
|
|
- Day 2: QF3 (Header skip) + testing
|
|||
|
|
- Day 3: Benchmark
|
|||
|
|
|
|||
|
|
**Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%)
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- Low risk
|
|||
|
|
- Quick wins
|
|||
|
|
- MF1 の前準備
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- 効果が限定的
|
|||
|
|
|
|||
|
|
### Option C: Hybrid(バランス)
|
|||
|
|
|
|||
|
|
**Week 1**: Quick Fixes (8h)
|
|||
|
|
**Week 2**: MF1 (12h)
|
|||
|
|
**Week 3**: MF2 or MF3
|
|||
|
|
|
|||
|
|
**Expected result**: 13.78 → **18-20 M/s** (+30-45%)
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- リスク分散
|
|||
|
|
- Incremental progress
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- 時間がかかる
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 予想パフォーマンス
|
|||
|
|
|
|||
|
|
### Cumulative Gains
|
|||
|
|
|
|||
|
|
| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
|
|||
|
|
|-------|---------|---------------|------------|-------------|
|
|||
|
|
| **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** |
|
|||
|
|
| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
|
|||
|
|
| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ |
|
|||
|
|
| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ |
|
|||
|
|
| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 |
|
|||
|
|
| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 |
|
|||
|
|
|
|||
|
|
**Target achievement**:
|
|||
|
|
- **60% target**: MF1 でほぼ達成(58-64%)
|
|||
|
|
- **75% target**: MF2 で達成(67-77%)
|
|||
|
|
- **100% target**: MS1 で可能(91-105%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ Risks & Mitigation
|
|||
|
|
|
|||
|
|
### MF1: Lock-Free Freelist
|
|||
|
|
|
|||
|
|
**Risks**:
|
|||
|
|
1. **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight
|
|||
|
|
2. **Memory ordering bugs**: Incorrect acquire/release semantics
|
|||
|
|
3. **Livelock**: CAS retry loop が収束しない
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
1. **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
|
|||
|
|
2. **Memory ordering**: Extensive testing with TSan, code review
|
|||
|
|
3. **Livelock**: Add exponential backoff after N retries
|
|||
|
|
|
|||
|
|
### MF2: Pointer Arithmetic
|
|||
|
|
|
|||
|
|
**Risks**:
|
|||
|
|
1. **Alignment failures**: mmap が 4MiB aligned を保証しない環境
|
|||
|
|
2. **Backward compatibility**: 既存 pages を migrate できない
|
|||
|
|
3. **Memory fragmentation**: Segment 単位で確保 → waste
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
1. **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed)
|
|||
|
|
2. **Backward compat**: Gradual migration, hash table を fallback として残す
|
|||
|
|
3. **Fragmentation**: 4MiB segments は許容範囲(mimalloc 実証済み)
|
|||
|
|
|
|||
|
|
### MF3: Path Simplification
|
|||
|
|
|
|||
|
|
**Risks**:
|
|||
|
|
1. **Performance regression**: Active Pages 削除で burst pattern が遅くなる?
|
|||
|
|
2. **Code complexity**: Refactor が新たなバグを生む
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
1. **Regression**: Benchmark で検証、問題があれば rollback
|
|||
|
|
2. **Complexity**: Incremental refactor, extensive testing
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 Success Criteria
|
|||
|
|
|
|||
|
|
### Primary Metrics
|
|||
|
|
|
|||
|
|
| Metric | Current | Target (60%) | Target (75%) | Measurement |
|
|||
|
|
|--------|---------|--------------|--------------|-------------|
|
|||
|
|
| **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
|
|||
|
|
| **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio |
|
|||
|
|
|
|||
|
|
### Secondary Metrics
|
|||
|
|
|
|||
|
|
| Metric | Target | Measurement |
|
|||
|
|
|--------|--------|-------------|
|
|||
|
|
| **Memory footprint** | <50 MB | RSS baseline |
|
|||
|
|
| **Fragmentation** | <10% | (RSS - user) / user |
|
|||
|
|
| **Regression (Tiny)** | <5% | Larson Tiny 4T |
|
|||
|
|
| **Regression (Large)** | <5% | Larson Large 4T |
|
|||
|
|
|
|||
|
|
### Specialized Workloads
|
|||
|
|
|
|||
|
|
| Workload | Target | Test |
|
|||
|
|
|----------|--------|------|
|
|||
|
|
| **Burst allocation** | >mimalloc | Custom benchmark |
|
|||
|
|
| **Locality-aware** | >mimalloc | Site-based pattern |
|
|||
|
|
| **Producer-Consumer** | >mimalloc | Multi-threaded queue |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📅 Timeline Estimate
|
|||
|
|
|
|||
|
|
### Conservative (完全実装)
|
|||
|
|
|
|||
|
|
| Week | Phase | Tasks | Hours |
|
|||
|
|
|------|-------|-------|-------|
|
|||
|
|
| **1** | QF1-3 | Quick fixes | 8 |
|
|||
|
|
| **2** | MF1 | Lock-free freelist | 12 |
|
|||
|
|
| **3** | MF2 | Pointer arithmetic | 10 |
|
|||
|
|
| **4** | MF3 | Path simplification | 10 |
|
|||
|
|
| **5-6** | MS1 | Per-page sharding (optional) | 60 |
|
|||
|
|
| **Total** | | | **40-100 hours** |
|
|||
|
|
|
|||
|
|
### Aggressive (60% target のみ)
|
|||
|
|
|
|||
|
|
| Week | Phase | Tasks | Hours |
|
|||
|
|
|------|-------|-------|-------|
|
|||
|
|
| **1** | MF1 | Lock-free freelist | 12 |
|
|||
|
|
| **Total** | | | **12 hours** |
|
|||
|
|
|
|||
|
|
**Recommendation**: **MF1 のみ実装 → Benchmark → 判断**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎬 Conclusion
|
|||
|
|
|
|||
|
|
**Current Status**: hakmem は mimalloc の **46.7%**(Phase 6.25-6.27 失敗)
|
|||
|
|
|
|||
|
|
**Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%)
|
|||
|
|
|
|||
|
|
**Battle Plan**:
|
|||
|
|
- **Phase 7.1**: Quick Fixes → +5-10%(8 hours)
|
|||
|
|
- **Phase 7.2**: Medium Fixes → +25-35%(20-30 hours)
|
|||
|
|
- **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐
|
|||
|
|
- **MF2 (Pointer Arithmetic)**: +10-15%
|
|||
|
|
- **MF3 (Simplify Path)**: +5-8%
|
|||
|
|
- **Phase 7.3**: Moonshot → +50-70%(60 hours、optional)
|
|||
|
|
|
|||
|
|
**Recommended Action**: **MF1 (Lock-Free Freelist) を即実装!**
|
|||
|
|
|
|||
|
|
**Expected Outcome**: **58-64% of mimalloc**(60% target 達成!)
|
|||
|
|
|
|||
|
|
**うおおお、mimalloc を倒すぞー!** 🔥🔥🔥
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**作成日**: 2025-10-24 14:30 JST
|
|||
|
|
**ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了**
|
|||
|
|
**次のアクション**: MF1 (Lock-Free Freelist) 実装開始
|