# mimalloc 打倒作戦：Phase 7 Battle Plan

**日付**: 2025-10-24
**目標**: hakmem を mimalloc の **60-75%** まで引き上げる
**現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T)
**Gap**: **15.72 M/s** (53.3%)

---

## 🎯 戦略目標

### Primary Goal: Universal Performance

**Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%**
- 現状: 13.78 M/s (**46.7%**)
- 目標: 17.70-22.13 M/s (**60-75%**)
- **必要改善**: +3.92-8.35 M/s (**+28-61%**)

### Secondary Goal: Specialized Workloads で Superiority

**Target**: hakmem の設計思想を活かす
- **Burst allocation**: mimalloc の 110-120%
- **Locality-aware**: mimalloc の 110-130%
- **Producer-Consumer**: mimalloc の 105-115%

---

## 📊 現状分析（Task先生の診断）

### 4大ボトルネック

| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
|------|------------|---------------|--------|-----|
| **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free |
| **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic |
| **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path |
| **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term |

**Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles

### hakmem Architecture の問題点

**Over-engineered**:
- **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
- **56 mutexes**: 7 classes × 8 shards（37.5% contention rate @ 4T）
- **Hash table**: 2-4 lookups per allocation（10-20 cycles each + cache miss）

**mimalloc Architecture の優位性**:
- **2層のみ**: Thread-local heap → Per-page freelist
- **0 locks**: Lock-free CAS only
- **Pointer arithmetic**: 3-4 instructions, no hash

---

## 🚀 3段階ロードマップ

### Phase 7.1: Quick Fixes (8 hours) → +5-10%

**Goal**: Low-risk な最適化で quick wins

#### QF1: Trylock Probes 削減（2 hours）

**現状**: `g_trylock_probes = 3`（3回まで trylock を試行）

**問題**: Probe のたびに：
- Hash calculation (shard index)
- Mutex trylock (50-200 cycles if contended)
- Branch

**Fix**: Probes を **1** に削減

```c
// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
    int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
    if (pthread_mutex_trylock(lock) == 0) { ... }
}

// After
int s = shard_idx;  // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }
```

**Expected gain**: +2-3% (trylock overhead 削減)

**File**: `hakmem_pool.c` (~10 LOC)

#### QF2: Ring + LIFO 統合（4 hours）

**現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成

**問題**:
- 2回チェック（Ring → LIFO）
- Spill logic の複雑さ

**Fix**: Single unified cache (256 slots ring)

```c
// Before
typedef struct {
    PoolTLSRing ring;   // 32 slots
    PoolTLSBin bin;     // LIFO 256 blocks
} TLSCache;

// After
typedef struct {
    PoolBlock* items[256];  // Unified ring (larger)
    int top;
} TLSCache;
```

**Expected gain**: +1-2% (branch 削減、cache locality 向上)

**File**: `hakmem_pool.c` (~50 LOC)

#### QF3: Header Writes スキップ（2 hours）

**現状**: Fast path で header を毎回書き込み

```c
mid_set_header(hdr, size, site_id);  // 5-10 cycles
```

**Fix**: Ring から pop した block は header が valid → skip

```c
// Only write header when refilling from freelist
if (from_freelist) {
    mid_set_header(hdr, size, site_id);
}
```

**Expected gain**: +1-2% (header write overhead 削減)

**File**: `hakmem_pool.c` (~20 LOC)

---

### Phase 7.2: Medium Fixes (20-30 hours) → +25-35%

**Goal**: Architectural bottlenecks を解決

#### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)

**Goal**: 56 mutexes を atomic CAS に置き換え

**Current**:
```c
static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```

**After**:
```c
static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```

**Implementation**:

1. **Lock-free pop**:
```c
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head, new_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
        if (old_head == 0) return NULL;  // Empty

        block = (PoolBlock*)old_head;
        new_head = (uintptr_t)block->next;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, new_head,
        memory_order_release, memory_order_relaxed));

    return block;
}
```

2. **Lock-free push**:
```c
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));
}
```

3. **Batch push** (for refill):
```c
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                    PoolBlock* head, PoolBlock* tail) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)head,
        memory_order_release, memory_order_relaxed));
}
```

**ABA Problem 対策**:
- **Not critical for freelist**: ABA still results in valid freelist state
- **Alternative**: Add version counter (128-bit CAS) if issues arise
- **Epoch-based reclamation**: Defer block reuse (if needed)

**Expected gain**: **+15-25%** (lock contention 完全排除)

**Files**:
- `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS)

**Risk**: Medium
- Memory ordering bugs
- ABA problem (low risk)

**Testing**:
- ThreadSanitizer (TSan)
- Stress test (16+ threads, 60s)
- Correctness test (no lost blocks, no double-free)

#### MF2: Pointer Arithmetic Page Lookup (8-10 hours)

**Goal**: Hash table lookup を pointer arithmetic に置き換え

**Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex)

**mimalloc approach**:
```c
// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]

// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1);  // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];
```

**Cost**: 3-4 instructions (mask, subtract, divide, array index)

**Implementation**:

1. **Segment allocation**:
```c
#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE

void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22),  // 4MiB aligned
                      -1, 0);
```

2. **Page descriptor array**:
```c
typedef struct {
    PageDesc descriptors[64];  // 64 × 64KB = 4MiB
    char pages[64][64*1024];   // Actual pages
} Segment;

Segment* seg = (Segment*)segment;
```

3. **Fast lookup**:
```c
static inline PageDesc* page_desc_from_ptr(void* ptr) {
    uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
    Segment* seg = (Segment*)seg_addr;
    size_t offset = (uintptr_t)ptr - seg_addr;
    size_t page_idx = offset / (64*1024);
    return &seg->descriptors[page_idx];
}
```

**Expected gain**: **+10-15%** (hash lookup 排除)

**Files**:
- `hakmem_pool.c`: ~150 LOC (segment allocator)
- `hakmem_pool.h`: Data structure changes

**Risk**: Medium
- Requires memory layout redesign
- Backward compatibility (migrate existing pages?)

#### MF3: Allocation Path 簡略化 (8-10 hours)

**Goal**: 7層 → 3層に削減

**Current path**:
1. TLS Ring (32)
2. TLS LIFO (256)
3. Active Page A
4. Active Page B
5. Transfer Cache
6. Shard Freelist (mutex)
7. Remote Stack

**Simplified path**:
1. **TLS Cache** (unified 256-slot ring)
2. **Per-shard Freelist** (lock-free)
3. **Remote Stack** (lock-free)

**Changes**:
- Remove: Active Pages (bump-run)
- Remove: Transfer Cache
- Keep: TLS Cache (unified), Freelist, Remote

**Rationale**:
- Active Pages: 複雑だが効果薄（Ring で十分）
- Transfer Cache: Owner-aware は良いが overhead 大

**Expected gain**: **+5-8%** (branch 削減、complexity 削減)

**Files**:
- `hakmem_pool.c`: ~200 LOC (refactor allocation path)

**Risk**: Low
- Code deletion is safer than addition
- Testing で regression 確認

---

### Phase 7.3: Moonshot (60 hours) → +50-70%

**Goal**: mimalloc の architecture を完全移植

#### MS1: Per-Page Sharding (60 hours)

**Goal**: Global sharded freelists → Per-page freelists (mimalloc style)

**Current**: 7 classes × 8 shards = **56 global freelists**

**mimalloc**: Thousands of **per-page freelists**

**Architecture**:
```c
// Thread-local heap
typedef struct {
    Page* pages[SIZE_BINS];  // Per-size-class page lists
} Heap;

// Per-page freelist
typedef struct Page {
    void* free;              // Freelist head (in-page)
    void* local_free;        // Same-thread frees
    atomic_uintptr_t xthread_free;  // Cross-thread frees
    int used;
    int capacity;
    struct Page* next;       // Next page in heap
} Page;
```

**Allocation**:
```c
void* mi_malloc(size_t size) {
    Heap* heap = mi_get_heap();
    int bin = size_to_bin(size);
    Page* page = heap->pages[bin];

    if (page->free) {
        void* block = page->free;
        page->free = *(void**)block;
        return block;
    }

    return mi_page_malloc(heap, page, size);  // Slow path
}
```

**Expected gain**: **+50-70%** (mimalloc とほぼ同等)

**Files**:
- `hakmem_pool.c`: Complete rewrite (~500 LOC)
- `hakmem_pool.h`: New data structures

**Risk**: High
- Architectural overhaul
- 60 hours implementation + testing
- Backward compatibility concerns

**Recommendation**: **Phase 7.2 で十分な効果が出たら skip**

---

## 🎯 推奨実装順序

### Option A: MF1 優先（高ROI、中リスク）

**Week 1**: MF1 (Lock-Free Freelist) 実装
- Day 1-2: Lock-free primitives 実装
- Day 3: Integration & unit testing
- Day 4: Benchmark & validation

**Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%)

**Pros**:
- 最大ボトルネック解決
- Standalone fix
- Proven technique

**Cons**:
- ABA problem リスク
- Memory ordering bugs

### Option B: Quick Fixes 優先（低リスク、低ROI）

**Week 1**: QF1+QF2+QF3 実装
- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
- Day 2: QF3 (Header skip) + testing
- Day 3: Benchmark

**Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%)

**Pros**:
- Low risk
- Quick wins
- MF1 の前準備

**Cons**:
- 効果が限定的

### Option C: Hybrid（バランス）

**Week 1**: Quick Fixes (8h)
**Week 2**: MF1 (12h)
**Week 3**: MF2 or MF3

**Expected result**: 13.78 → **18-20 M/s** (+30-45%)

**Pros**:
- リスク分散
- Incremental progress

**Cons**:
- 時間がかかる

---

## 📊 予想パフォーマンス

### Cumulative Gains

| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
|-------|---------|---------------|------------|-------------|
| **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** |
| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ |
| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ |
| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 |
| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 |

**Target achievement**:
- **60% target**: MF1 でほぼ達成（58-64%）
- **75% target**: MF2 で達成（67-77%）
- **100% target**: MS1 で可能（91-105%）

---

## ⚠️ Risks & Mitigation

### MF1: Lock-Free Freelist

**Risks**:
1. **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight
2. **Memory ordering bugs**: Incorrect acquire/release semantics
3. **Livelock**: CAS retry loop が収束しない

**Mitigation**:
1. **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
2. **Memory ordering**: Extensive testing with TSan, code review
3. **Livelock**: Add exponential backoff after N retries

### MF2: Pointer Arithmetic

**Risks**:
1. **Alignment failures**: mmap が 4MiB aligned を保証しない環境
2. **Backward compatibility**: 既存 pages を migrate できない
3. **Memory fragmentation**: Segment 単位で確保 → waste

**Mitigation**:
1. **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed)
2. **Backward compat**: Gradual migration, hash table を fallback として残す
3. **Fragmentation**: 4MiB segments は許容範囲（mimalloc 実証済み）

### MF3: Path Simplification

**Risks**:
1. **Performance regression**: Active Pages 削除で burst pattern が遅くなる？
2. **Code complexity**: Refactor が新たなバグを生む

**Mitigation**:
1. **Regression**: Benchmark で検証、問題があれば rollback
2. **Complexity**: Incremental refactor, extensive testing

---

## 🎓 Success Criteria

### Primary Metrics

| Metric | Current | Target (60%) | Target (75%) | Measurement |
|--------|---------|--------------|--------------|-------------|
| **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
| **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio |

### Secondary Metrics

| Metric | Target | Measurement |
|--------|--------|-------------|
| **Memory footprint** | <50 MB | RSS baseline |
| **Fragmentation** | <10% | (RSS - user) / user |
| **Regression (Tiny)** | <5% | Larson Tiny 4T |
| **Regression (Large)** | <5% | Larson Large 4T |

### Specialized Workloads

| Workload | Target | Test |
|----------|--------|------|
| **Burst allocation** | >mimalloc | Custom benchmark |
| **Locality-aware** | >mimalloc | Site-based pattern |
| **Producer-Consumer** | >mimalloc | Multi-threaded queue |

---

## 📅 Timeline Estimate

### Conservative (完全実装)

| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | QF1-3 | Quick fixes | 8 |
| **2** | MF1 | Lock-free freelist | 12 |
| **3** | MF2 | Pointer arithmetic | 10 |
| **4** | MF3 | Path simplification | 10 |
| **5-6** | MS1 | Per-page sharding (optional) | 60 |
| **Total** | | | **40-100 hours** |

### Aggressive (60% target のみ)

| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | MF1 | Lock-free freelist | 12 |
| **Total** | | | **12 hours** |

**Recommendation**: **MF1 のみ実装 → Benchmark → 判断**

---

## 🎬 Conclusion

**Current Status**: hakmem は mimalloc の **46.7%**（Phase 6.25-6.27 失敗）

**Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%)

**Battle Plan**:
- **Phase 7.1**: Quick Fixes → +5-10%（8 hours）
- **Phase 7.2**: Medium Fixes → +25-35%（20-30 hours）
  - **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐
  - **MF2 (Pointer Arithmetic)**: +10-15%
  - **MF3 (Simplify Path)**: +5-8%
- **Phase 7.3**: Moonshot → +50-70%（60 hours、optional）

**Recommended Action**: **MF1 (Lock-Free Freelist) を即実装！**

**Expected Outcome**: **58-64% of mimalloc**（60% target 達成！）

**うおおお、mimalloc を倒すぞー！** 🔥🔥🔥

---

**作成日**: 2025-10-24 14:30 JST
**ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了**
**次のアクション**: MF1 (Lock-Free Freelist) 実装開始