Files
hakmem/docs/status/MIMALLOC_BATTLE_PLAN_2025_10_24.md

583 lines
16 KiB
Markdown
Raw Normal View History

# mimalloc 打倒作戦Phase 7 Battle Plan
**日付**: 2025-10-24
**目標**: hakmem を mimalloc の **60-75%** まで引き上げる
**現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T)
**Gap**: **15.72 M/s** (53.3%)
---
## 🎯 戦略目標
### Primary Goal: Universal Performance
**Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%**
- 現状: 13.78 M/s (**46.7%**)
- 目標: 17.70-22.13 M/s (**60-75%**)
- **必要改善**: +3.92-8.35 M/s (**+28-61%**)
### Secondary Goal: Specialized Workloads で Superiority
**Target**: hakmem の設計思想を活かす
- **Burst allocation**: mimalloc の 110-120%
- **Locality-aware**: mimalloc の 110-130%
- **Producer-Consumer**: mimalloc の 105-115%
---
## 📊 現状分析Task先生の診断
### 4大ボトルネック
| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
|------|------------|---------------|--------|-----|
| **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free |
| **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic |
| **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path |
| **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term |
**Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles
### hakmem Architecture の問題点
**Over-engineered**:
- **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
- **56 mutexes**: 7 classes × 8 shards37.5% contention rate @ 4T
- **Hash table**: 2-4 lookups per allocation10-20 cycles each + cache miss
**mimalloc Architecture の優位性**:
- **2層のみ**: Thread-local heap → Per-page freelist
- **0 locks**: Lock-free CAS only
- **Pointer arithmetic**: 3-4 instructions, no hash
---
## 🚀 3段階ロードマップ
### Phase 7.1: Quick Fixes (8 hours) → +5-10%
**Goal**: Low-risk な最適化で quick wins
#### QF1: Trylock Probes 削減2 hours
**現状**: `g_trylock_probes = 3`3回まで trylock を試行)
**問題**: Probe のたびに:
- Hash calculation (shard index)
- Mutex trylock (50-200 cycles if contended)
- Branch
**Fix**: Probes を **1** に削減
```c
// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
if (pthread_mutex_trylock(lock) == 0) { ... }
}
// After
int s = shard_idx; // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }
```
**Expected gain**: +2-3% (trylock overhead 削減)
**File**: `hakmem_pool.c` (~10 LOC)
#### QF2: Ring + LIFO 統合4 hours
**現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成
**問題**:
- 2回チェックRing → LIFO
- Spill logic の複雑さ
**Fix**: Single unified cache (256 slots ring)
```c
// Before
typedef struct {
PoolTLSRing ring; // 32 slots
PoolTLSBin bin; // LIFO 256 blocks
} TLSCache;
// After
typedef struct {
PoolBlock* items[256]; // Unified ring (larger)
int top;
} TLSCache;
```
**Expected gain**: +1-2% (branch 削減、cache locality 向上)
**File**: `hakmem_pool.c` (~50 LOC)
#### QF3: Header Writes スキップ2 hours
**現状**: Fast path で header を毎回書き込み
```c
mid_set_header(hdr, size, site_id); // 5-10 cycles
```
**Fix**: Ring から pop した block は header が valid → skip
```c
// Only write header when refilling from freelist
if (from_freelist) {
mid_set_header(hdr, size, site_id);
}
```
**Expected gain**: +1-2% (header write overhead 削減)
**File**: `hakmem_pool.c` (~20 LOC)
---
### Phase 7.2: Medium Fixes (20-30 hours) → +25-35%
**Goal**: Architectural bottlenecks を解決
#### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)
**Goal**: 56 mutexes を atomic CAS に置き換え
**Current**:
```c
static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```
**After**:
```c
static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```
**Implementation**:
1. **Lock-free pop**:
```c
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head, new_head;
PoolBlock* block;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
if (old_head == 0) return NULL; // Empty
block = (PoolBlock*)old_head;
new_head = (uintptr_t)block->next;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, new_head,
memory_order_release, memory_order_relaxed));
return block;
}
```
2. **Lock-free push**:
```c
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, (uintptr_t)block,
memory_order_release, memory_order_relaxed));
}
```
3. **Batch push** (for refill):
```c
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
PoolBlock* head, PoolBlock* tail) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
tail->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, (uintptr_t)head,
memory_order_release, memory_order_relaxed));
}
```
**ABA Problem 対策**:
- **Not critical for freelist**: ABA still results in valid freelist state
- **Alternative**: Add version counter (128-bit CAS) if issues arise
- **Epoch-based reclamation**: Defer block reuse (if needed)
**Expected gain**: **+15-25%** (lock contention 完全排除)
**Files**:
- `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS)
**Risk**: Medium
- Memory ordering bugs
- ABA problem (low risk)
**Testing**:
- ThreadSanitizer (TSan)
- Stress test (16+ threads, 60s)
- Correctness test (no lost blocks, no double-free)
#### MF2: Pointer Arithmetic Page Lookup (8-10 hours)
**Goal**: Hash table lookup を pointer arithmetic に置き換え
**Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex)
**mimalloc approach**:
```c
// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]
// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1); // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];
```
**Cost**: 3-4 instructions (mask, subtract, divide, array index)
**Implementation**:
1. **Segment allocation**:
```c
#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE
void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22), // 4MiB aligned
-1, 0);
```
2. **Page descriptor array**:
```c
typedef struct {
PageDesc descriptors[64]; // 64 × 64KB = 4MiB
char pages[64][64*1024]; // Actual pages
} Segment;
Segment* seg = (Segment*)segment;
```
3. **Fast lookup**:
```c
static inline PageDesc* page_desc_from_ptr(void* ptr) {
uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
Segment* seg = (Segment*)seg_addr;
size_t offset = (uintptr_t)ptr - seg_addr;
size_t page_idx = offset / (64*1024);
return &seg->descriptors[page_idx];
}
```
**Expected gain**: **+10-15%** (hash lookup 排除)
**Files**:
- `hakmem_pool.c`: ~150 LOC (segment allocator)
- `hakmem_pool.h`: Data structure changes
**Risk**: Medium
- Requires memory layout redesign
- Backward compatibility (migrate existing pages?)
#### MF3: Allocation Path 簡略化 (8-10 hours)
**Goal**: 7層 → 3層に削減
**Current path**:
1. TLS Ring (32)
2. TLS LIFO (256)
3. Active Page A
4. Active Page B
5. Transfer Cache
6. Shard Freelist (mutex)
7. Remote Stack
**Simplified path**:
1. **TLS Cache** (unified 256-slot ring)
2. **Per-shard Freelist** (lock-free)
3. **Remote Stack** (lock-free)
**Changes**:
- Remove: Active Pages (bump-run)
- Remove: Transfer Cache
- Keep: TLS Cache (unified), Freelist, Remote
**Rationale**:
- Active Pages: 複雑だが効果薄Ring で十分)
- Transfer Cache: Owner-aware は良いが overhead 大
**Expected gain**: **+5-8%** (branch 削減、complexity 削減)
**Files**:
- `hakmem_pool.c`: ~200 LOC (refactor allocation path)
**Risk**: Low
- Code deletion is safer than addition
- Testing で regression 確認
---
### Phase 7.3: Moonshot (60 hours) → +50-70%
**Goal**: mimalloc の architecture を完全移植
#### MS1: Per-Page Sharding (60 hours)
**Goal**: Global sharded freelists → Per-page freelists (mimalloc style)
**Current**: 7 classes × 8 shards = **56 global freelists**
**mimalloc**: Thousands of **per-page freelists**
**Architecture**:
```c
// Thread-local heap
typedef struct {
Page* pages[SIZE_BINS]; // Per-size-class page lists
} Heap;
// Per-page freelist
typedef struct Page {
void* free; // Freelist head (in-page)
void* local_free; // Same-thread frees
atomic_uintptr_t xthread_free; // Cross-thread frees
int used;
int capacity;
struct Page* next; // Next page in heap
} Page;
```
**Allocation**:
```c
void* mi_malloc(size_t size) {
Heap* heap = mi_get_heap();
int bin = size_to_bin(size);
Page* page = heap->pages[bin];
if (page->free) {
void* block = page->free;
page->free = *(void**)block;
return block;
}
return mi_page_malloc(heap, page, size); // Slow path
}
```
**Expected gain**: **+50-70%** (mimalloc とほぼ同等)
**Files**:
- `hakmem_pool.c`: Complete rewrite (~500 LOC)
- `hakmem_pool.h`: New data structures
**Risk**: High
- Architectural overhaul
- 60 hours implementation + testing
- Backward compatibility concerns
**Recommendation**: **Phase 7.2 で十分な効果が出たら skip**
---
## 🎯 推奨実装順序
### Option A: MF1 優先高ROI、中リスク
**Week 1**: MF1 (Lock-Free Freelist) 実装
- Day 1-2: Lock-free primitives 実装
- Day 3: Integration & unit testing
- Day 4: Benchmark & validation
**Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%)
**Pros**:
- 最大ボトルネック解決
- Standalone fix
- Proven technique
**Cons**:
- ABA problem リスク
- Memory ordering bugs
### Option B: Quick Fixes 優先低リスク、低ROI
**Week 1**: QF1+QF2+QF3 実装
- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
- Day 2: QF3 (Header skip) + testing
- Day 3: Benchmark
**Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%)
**Pros**:
- Low risk
- Quick wins
- MF1 の前準備
**Cons**:
- 効果が限定的
### Option C: Hybridバランス
**Week 1**: Quick Fixes (8h)
**Week 2**: MF1 (12h)
**Week 3**: MF2 or MF3
**Expected result**: 13.78 → **18-20 M/s** (+30-45%)
**Pros**:
- リスク分散
- Incremental progress
**Cons**:
- 時間がかかる
---
## 📊 予想パフォーマンス
### Cumulative Gains
| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
|-------|---------|---------------|------------|-------------|
| **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** |
| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ |
| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ |
| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 |
| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 |
**Target achievement**:
- **60% target**: MF1 でほぼ達成58-64%
- **75% target**: MF2 で達成67-77%
- **100% target**: MS1 で可能91-105%
---
## ⚠️ Risks & Mitigation
### MF1: Lock-Free Freelist
**Risks**:
1. **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight
2. **Memory ordering bugs**: Incorrect acquire/release semantics
3. **Livelock**: CAS retry loop が収束しない
**Mitigation**:
1. **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
2. **Memory ordering**: Extensive testing with TSan, code review
3. **Livelock**: Add exponential backoff after N retries
### MF2: Pointer Arithmetic
**Risks**:
1. **Alignment failures**: mmap が 4MiB aligned を保証しない環境
2. **Backward compatibility**: 既存 pages を migrate できない
3. **Memory fragmentation**: Segment 単位で確保 → waste
**Mitigation**:
1. **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed)
2. **Backward compat**: Gradual migration, hash table を fallback として残す
3. **Fragmentation**: 4MiB segments は許容範囲mimalloc 実証済み)
### MF3: Path Simplification
**Risks**:
1. **Performance regression**: Active Pages 削除で burst pattern が遅くなる?
2. **Code complexity**: Refactor が新たなバグを生む
**Mitigation**:
1. **Regression**: Benchmark で検証、問題があれば rollback
2. **Complexity**: Incremental refactor, extensive testing
---
## 🎓 Success Criteria
### Primary Metrics
| Metric | Current | Target (60%) | Target (75%) | Measurement |
|--------|---------|--------------|--------------|-------------|
| **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
| **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio |
### Secondary Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| **Memory footprint** | <50 MB | RSS baseline |
| **Fragmentation** | <10% | (RSS - user) / user |
| **Regression (Tiny)** | <5% | Larson Tiny 4T |
| **Regression (Large)** | <5% | Larson Large 4T |
### Specialized Workloads
| Workload | Target | Test |
|----------|--------|------|
| **Burst allocation** | >mimalloc | Custom benchmark |
| **Locality-aware** | >mimalloc | Site-based pattern |
| **Producer-Consumer** | >mimalloc | Multi-threaded queue |
---
## 📅 Timeline Estimate
### Conservative (完全実装)
| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | QF1-3 | Quick fixes | 8 |
| **2** | MF1 | Lock-free freelist | 12 |
| **3** | MF2 | Pointer arithmetic | 10 |
| **4** | MF3 | Path simplification | 10 |
| **5-6** | MS1 | Per-page sharding (optional) | 60 |
| **Total** | | | **40-100 hours** |
### Aggressive (60% target のみ)
| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | MF1 | Lock-free freelist | 12 |
| **Total** | | | **12 hours** |
**Recommendation**: **MF1 のみ実装 → Benchmark → 判断**
---
## 🎬 Conclusion
**Current Status**: hakmem は mimalloc の **46.7%**Phase 6.25-6.27 失敗)
**Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%)
**Battle Plan**:
- **Phase 7.1**: Quick Fixes → +5-10%8 hours
- **Phase 7.2**: Medium Fixes → +25-35%20-30 hours
- **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐
- **MF2 (Pointer Arithmetic)**: +10-15%
- **MF3 (Simplify Path)**: +5-8%
- **Phase 7.3**: Moonshot → +50-70%60 hours、optional
**Recommended Action**: **MF1 (Lock-Free Freelist) を即実装!**
**Expected Outcome**: **58-64% of mimalloc**60% target 達成!)
**うおおお、mimalloc を倒すぞー!** 🔥🔥🔥
---
**作成日**: 2025-10-24 14:30 JST
**ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了**
**次のアクション**: MF1 (Lock-Free Freelist) 実装開始