hakmem/docs/status/MIMALLOC_BATTLE_PLAN_2025_10_24.md

# mimalloc 打倒作戦：Phase 7 Battle Plan

**日付**: 2025-10-24
**目標**: hakmem を mimalloc の **60-75%** まで引き上げる
**現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T)
**Gap**: **15.72 M/s** (53.3%)

---

## 🎯 戦略目標

### Primary Goal: Universal Performance

**Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%**
- 現状: 13.78 M/s (**46.7%**)
- 目標: 17.70-22.13 M/s (**60-75%**)
- **必要改善**: +3.92-8.35 M/s (**+28-61%**)

### Secondary Goal: Specialized Workloads で Superiority

**Target**: hakmem の設計思想を活かす
- **Burst allocation**: mimalloc の 110-120%
- **Locality-aware**: mimalloc の 110-130%
- **Producer-Consumer**: mimalloc の 105-115%

---

## 📊 現状分析（Task先生の診断）

### 4大ボトルネック

| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
|------|------------|---------------|--------|-----|
| **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free |
| **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic |
| **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path |
| **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term |

**Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles

### hakmem Architecture の問題点

**Over-engineered**:
- **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
- **56 mutexes**: 7 classes × 8 shards（37.5% contention rate @ 4T）
- **Hash table**: 2-4 lookups per allocation（10-20 cycles each + cache miss）

**mimalloc Architecture の優位性**:
- **2層のみ**: Thread-local heap → Per-page freelist
- **0 locks**: Lock-free CAS only
- **Pointer arithmetic**: 3-4 instructions, no hash

---

## 🚀 3段階ロードマップ

### Phase 7.1: Quick Fixes (8 hours) → +5-10%

**Goal**: Low-risk な最適化で quick wins

#### QF1: Trylock Probes 削減（2 hours）

**現状**: `g_trylock_probes = 3`（3回まで trylock を試行）

**問題**: Probe のたびに：
- Hash calculation (shard index)
- Mutex trylock (50-200 cycles if contended)
- Branch

**Fix**: Probes を **1** に削減

```c
// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
    int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
    if (pthread_mutex_trylock(lock) == 0) { ... }
}

// After
int s = shard_idx;  // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }
```

**Expected gain**: +2-3% (trylock overhead 削減)

**File**: `hakmem_pool.c` (~10 LOC)

#### QF2: Ring + LIFO 統合（4 hours）

**現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成

**問題**:
- 2回チェック（Ring → LIFO）
- Spill logic の複雑さ

**Fix**: Single unified cache (256 slots ring)

```c
// Before
typedef struct {
    PoolTLSRing ring;   // 32 slots
    PoolTLSBin bin;     // LIFO 256 blocks
} TLSCache;

// After
typedef struct {
    PoolBlock* items[256];  // Unified ring (larger)
    int top;
} TLSCache;
```

**Expected gain**: +1-2% (branch 削減、cache locality 向上)

**File**: `hakmem_pool.c` (~50 LOC)

#### QF3: Header Writes スキップ（2 hours）

**現状**: Fast path で header を毎回書き込み

```c
mid_set_header(hdr, size, site_id);  // 5-10 cycles
```

**Fix**: Ring から pop した block は header が valid → skip

```c
// Only write header when refilling from freelist
if (from_freelist) {
    mid_set_header(hdr, size, site_id);
}
```

**Expected gain**: +1-2% (header write overhead 削減)

**File**: `hakmem_pool.c` (~20 LOC)

---

### Phase 7.2: Medium Fixes (20-30 hours) → +25-35%

**Goal**: Architectural bottlenecks を解決

#### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)

**Goal**: 56 mutexes を atomic CAS に置き換え

**Current**:
```c
static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```

**After**:
```c
static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
```

**Implementation**:

1. **Lock-free pop**:
```c
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head, new_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
        if (old_head == 0) return NULL;  // Empty

        block = (PoolBlock*)old_head;
        new_head = (uintptr_t)block->next;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, new_head,
        memory_order_release, memory_order_relaxed));

    return block;
}
```

2. **Lock-free push**:
```c
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));
}
```

3. **Batch push** (for refill):
```c
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                    PoolBlock* head, PoolBlock* tail) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)head,
        memory_order_release, memory_order_relaxed));
}
```

**ABA Problem 対策**:
- **Not critical for freelist**: ABA still results in valid freelist state
- **Alternative**: Add version counter (128-bit CAS) if issues arise
- **Epoch-based reclamation**: Defer block reuse (if needed)

**Expected gain**: **+15-25%** (lock contention 完全排除)

**Files**:
- `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS)

**Risk**: Medium
- Memory ordering bugs
- ABA problem (low risk)

**Testing**:
- ThreadSanitizer (TSan)
- Stress test (16+ threads, 60s)
- Correctness test (no lost blocks, no double-free)

#### MF2: Pointer Arithmetic Page Lookup (8-10 hours)

**Goal**: Hash table lookup を pointer arithmetic に置き換え

**Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex)

**mimalloc approach**:
```c
// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]

// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1);  // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];
```

**Cost**: 3-4 instructions (mask, subtract, divide, array index)

**Implementation**:

1. **Segment allocation**:
```c
#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE

void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22),  // 4MiB aligned
                      -1, 0);
```

2. **Page descriptor array**:
```c
typedef struct {
    PageDesc descriptors[64];  // 64 × 64KB = 4MiB
    char pages[64][64*1024];   // Actual pages
} Segment;

Segment* seg = (Segment*)segment;
```

3. **Fast lookup**:
```c
static inline PageDesc* page_desc_from_ptr(void* ptr) {
    uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
    Segment* seg = (Segment*)seg_addr;
    size_t offset = (uintptr_t)ptr - seg_addr;
    size_t page_idx = offset / (64*1024);
    return &seg->descriptors[page_idx];
}
```

**Expected gain**: **+10-15%** (hash lookup 排除)

**Files**:
- `hakmem_pool.c`: ~150 LOC (segment allocator)
- `hakmem_pool.h`: Data structure changes

**Risk**: Medium
- Requires memory layout redesign
- Backward compatibility (migrate existing pages?)

#### MF3: Allocation Path 簡略化 (8-10 hours)

**Goal**: 7層 → 3層に削減

**Current path**:
1. TLS Ring (32)
2. TLS LIFO (256)
3. Active Page A
4. Active Page B
5. Transfer Cache
6. Shard Freelist (mutex)
7. Remote Stack

**Simplified path**:
1. **TLS Cache** (unified 256-slot ring)
2. **Per-shard Freelist** (lock-free)
3. **Remote Stack** (lock-free)

**Changes**:
- Remove: Active Pages (bump-run)
- Remove: Transfer Cache
- Keep: TLS Cache (unified), Freelist, Remote

**Rationale**:
- Active Pages: 複雑だが効果薄（Ring で十分）
- Transfer Cache: Owner-aware は良いが overhead 大

**Expected gain**: **+5-8%** (branch 削減、complexity 削減)

**Files**:
- `hakmem_pool.c`: ~200 LOC (refactor allocation path)

**Risk**: Low
- Code deletion is safer than addition
- Testing で regression 確認

---

### Phase 7.3: Moonshot (60 hours) → +50-70%

**Goal**: mimalloc の architecture を完全移植

#### MS1: Per-Page Sharding (60 hours)

**Goal**: Global sharded freelists → Per-page freelists (mimalloc style)

**Current**: 7 classes × 8 shards = **56 global freelists**

**mimalloc**: Thousands of **per-page freelists**

**Architecture**:
```c
// Thread-local heap
typedef struct {
    Page* pages[SIZE_BINS];  // Per-size-class page lists
} Heap;

// Per-page freelist
typedef struct Page {
    void* free;              // Freelist head (in-page)
    void* local_free;        // Same-thread frees
    atomic_uintptr_t xthread_free;  // Cross-thread frees
    int used;
    int capacity;
    struct Page* next;       // Next page in heap
} Page;
```

**Allocation**:
```c
void* mi_malloc(size_t size) {
    Heap* heap = mi_get_heap();
    int bin = size_to_bin(size);
    Page* page = heap->pages[bin];

    if (page->free) {
        void* block = page->free;
        page->free = *(void**)block;
        return block;
    }

    return mi_page_malloc(heap, page, size);  // Slow path
}
```

**Expected gain**: **+50-70%** (mimalloc とほぼ同等)

**Files**:
- `hakmem_pool.c`: Complete rewrite (~500 LOC)
- `hakmem_pool.h`: New data structures

**Risk**: High
- Architectural overhaul
- 60 hours implementation + testing
- Backward compatibility concerns

**Recommendation**: **Phase 7.2 で十分な効果が出たら skip**

---

## 🎯 推奨実装順序

### Option A: MF1 優先（高ROI、中リスク）

**Week 1**: MF1 (Lock-Free Freelist) 実装
- Day 1-2: Lock-free primitives 実装
- Day 3: Integration & unit testing
- Day 4: Benchmark & validation

**Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%)

**Pros**:
- 最大ボトルネック解決
- Standalone fix
- Proven technique

**Cons**:
- ABA problem リスク
- Memory ordering bugs

### Option B: Quick Fixes 優先（低リスク、低ROI）

**Week 1**: QF1+QF2+QF3 実装
- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
- Day 2: QF3 (Header skip) + testing
- Day 3: Benchmark

**Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%)

**Pros**:
- Low risk
- Quick wins
- MF1 の前準備

**Cons**:
- 効果が限定的

### Option C: Hybrid（バランス）

**Week 1**: Quick Fixes (8h)
**Week 2**: MF1 (12h)
**Week 3**: MF2 or MF3

**Expected result**: 13.78 → **18-20 M/s** (+30-45%)

**Pros**:
- リスク分散
- Incremental progress

**Cons**:
- 時間がかかる

---

## 📊 予想パフォーマンス

### Cumulative Gains

| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
|-------|---------|---------------|------------|-------------|
| **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** |
| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ |
| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ |
| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 |
| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 |

**Target achievement**:
- **60% target**: MF1 でほぼ達成（58-64%）
- **75% target**: MF2 で達成（67-77%）
- **100% target**: MS1 で可能（91-105%）

---

## ⚠️ Risks & Mitigation

### MF1: Lock-Free Freelist

**Risks**:
1. **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight
2. **Memory ordering bugs**: Incorrect acquire/release semantics
3. **Livelock**: CAS retry loop が収束しない

**Mitigation**:
1. **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
2. **Memory ordering**: Extensive testing with TSan, code review
3. **Livelock**: Add exponential backoff after N retries

### MF2: Pointer Arithmetic

**Risks**:
1. **Alignment failures**: mmap が 4MiB aligned を保証しない環境
2. **Backward compatibility**: 既存 pages を migrate できない
3. **Memory fragmentation**: Segment 単位で確保 → waste

**Mitigation**:
1. **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed)
2. **Backward compat**: Gradual migration, hash table を fallback として残す
3. **Fragmentation**: 4MiB segments は許容範囲（mimalloc 実証済み）

### MF3: Path Simplification

**Risks**:
1. **Performance regression**: Active Pages 削除で burst pattern が遅くなる？
2. **Code complexity**: Refactor が新たなバグを生む

**Mitigation**:
1. **Regression**: Benchmark で検証、問題があれば rollback
2. **Complexity**: Incremental refactor, extensive testing

---

## 🎓 Success Criteria

### Primary Metrics

| Metric | Current | Target (60%) | Target (75%) | Measurement |
|--------|---------|--------------|--------------|-------------|
| **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
| **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio |

### Secondary Metrics

| Metric | Target | Measurement |
|--------|--------|-------------|
| **Memory footprint** | <50 MB | RSS baseline |
| **Fragmentation** | <10% | (RSS - user) / user |
| **Regression (Tiny)** | <5% | Larson Tiny 4T |
| **Regression (Large)** | <5% | Larson Large 4T |

### Specialized Workloads

| Workload | Target | Test |
|----------|--------|------|
| **Burst allocation** | >mimalloc | Custom benchmark |
| **Locality-aware** | >mimalloc | Site-based pattern |
| **Producer-Consumer** | >mimalloc | Multi-threaded queue |

---

## 📅 Timeline Estimate

### Conservative (完全実装)

| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | QF1-3 | Quick fixes | 8 |
| **2** | MF1 | Lock-free freelist | 12 |
| **3** | MF2 | Pointer arithmetic | 10 |
| **4** | MF3 | Path simplification | 10 |
| **5-6** | MS1 | Per-page sharding (optional) | 60 |
| **Total** | | | **40-100 hours** |

### Aggressive (60% target のみ)

| Week | Phase | Tasks | Hours |
|------|-------|-------|-------|
| **1** | MF1 | Lock-free freelist | 12 |
| **Total** | | | **12 hours** |

**Recommendation**: **MF1 のみ実装 → Benchmark → 判断**

---

## 🎬 Conclusion

**Current Status**: hakmem は mimalloc の **46.7%**（Phase 6.25-6.27 失敗）

**Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%)

**Battle Plan**:
- **Phase 7.1**: Quick Fixes → +5-10%（8 hours）
- **Phase 7.2**: Medium Fixes → +25-35%（20-30 hours）
  - **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐
  - **MF2 (Pointer Arithmetic)**: +10-15%
  - **MF3 (Simplify Path)**: +5-8%
- **Phase 7.3**: Moonshot → +50-70%（60 hours、optional）

**Recommended Action**: **MF1 (Lock-Free Freelist) を即実装！**

**Expected Outcome**: **58-64% of mimalloc**（60% target 達成！）

**うおおお、mimalloc を倒すぞー！** 🔥🔥🔥

---

**作成日**: 2025-10-24 14:30 JST
**ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了**
**次のアクション**: MF1 (Lock-Free Freelist) 実装開始
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# mimalloc 打倒作戦：Phase 7 Battle Plan
 								**日付**: 2025-10-24
 								**目標**: hakmem を mimalloc の **60-75%** まで引き上げる
 								**現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T)
 								**Gap**: **15.72 M/s** (53.3%)
 								---
 								## 🎯 戦略目標
 								### Primary Goal: Universal Performance
 								**Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%**
 								- 現状: 13.78 M/s (**46.7%**)
 								- 目標: 17.70-22.13 M/s (**60-75%**)
 								- **必要改善**: +3.92-8.35 M/s (**+28-61%**)
 								### Secondary Goal: Specialized Workloads で Superiority
 								**Target**: hakmem の設計思想を活かす
 								- **Burst allocation**: mimalloc の 110-120%
 								- **Locality-aware**: mimalloc の 110-130%
 								- **Producer-Consumer**: mimalloc の 105-115%
 								---
 								## 📊 現状分析（Task先生の診断）
 								### 4大ボトルネック
 								| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
 								|------|------------|---------------|--------|-----|
 								| **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free |
 								| **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic |
 								| **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path |
 								| **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term |
 								**Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles
 								### hakmem Architecture の問題点
 								**Over-engineered**:
 								- **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
 								- **56 mutexes**: 7 classes × 8 shards（37.5% contention rate @ 4T）
 								- **Hash table**: 2-4 lookups per allocation（10-20 cycles each + cache miss）
 								**mimalloc Architecture の優位性**:
 								- **2層のみ**: Thread-local heap → Per-page freelist
 								- **0 locks**: Lock-free CAS only
 								- **Pointer arithmetic**: 3-4 instructions, no hash
 								---
 								## 🚀 3段階ロードマップ
 								### Phase 7.1: Quick Fixes (8 hours) → +5-10%
 								**Goal**: Low-risk な最適化で quick wins
 								#### QF1: Trylock Probes 削減（2 hours）
 								**現状**: `g_trylock_probes = 3`（3回まで trylock を試行）
 								**問題**: Probe のたびに：
 								- Hash calculation (shard index)
 								- Mutex trylock (50-200 cycles if contended)
 								- Branch
 								**Fix**: Probes を **1** に削減
 								```c
 								// Before
 								for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
 								    int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
 								    if (pthread_mutex_trylock(lock) == 0) { ... }
 								}
 								// After
 								int s = shard_idx;  // No loop
 								if (pthread_mutex_trylock(lock) == 0) { ... }
 								```
 								**Expected gain**: +2-3% (trylock overhead 削減)
 								**File**: `hakmem_pool.c` (~10 LOC)
 								#### QF2: Ring + LIFO 統合（4 hours）
 								**現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成
 								**問題**:
 								- 2回チェック（Ring → LIFO）
 								- Spill logic の複雑さ
 								**Fix**: Single unified cache (256 slots ring)
 								```c
 								// Before
 								typedef struct {
 								    PoolTLSRing ring;   // 32 slots
 								    PoolTLSBin bin;     // LIFO 256 blocks
 								} TLSCache;
 								// After
 								typedef struct {
 								    PoolBlock* items[256];  // Unified ring (larger)
 								    int top;
 								} TLSCache;
 								```
 								**Expected gain**: +1-2% (branch 削減、cache locality 向上)
 								**File**: `hakmem_pool.c` (~50 LOC)
 								#### QF3: Header Writes スキップ（2 hours）
 								**現状**: Fast path で header を毎回書き込み
 								```c
 								mid_set_header(hdr, size, site_id);  // 5-10 cycles
 								```
 								**Fix**: Ring から pop した block は header が valid → skip
 								```c
 								// Only write header when refilling from freelist
 								if (from_freelist) {
 								    mid_set_header(hdr, size, site_id);
 								}
 								```
 								**Expected gain**: +1-2% (header write overhead 削減)
 								**File**: `hakmem_pool.c` (~20 LOC)
 								---
 								### Phase 7.2: Medium Fixes (20-30 hours) → +25-35%
 								**Goal**: Architectural bottlenecks を解決
 								#### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)
 								**Goal**: 56 mutexes を atomic CAS に置き換え
 								**Current**:
 								```c
 								static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								```
 								**After**:
 								```c
 								static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								```
 								**Implementation**:
 . **Lock-free pop**:
 								```c
 								PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
 								    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
 								    uintptr_t old_head, new_head;
 								    PoolBlock* block;
 								    do {
 								        old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
 								        if (old_head == 0) return NULL;  // Empty
 								        block = (PoolBlock*)old_head;
 								        new_head = (uintptr_t)block->next;
 								    } while (!atomic_compare_exchange_weak_explicit(
 								        head_ptr, &old_head, new_head,
 								        memory_order_release, memory_order_relaxed));
 								    return block;
 								}
 								```
 . **Lock-free push**:
 								```c
 								void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
 								    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
 								    uintptr_t old_head;
 								    do {
 								        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
 								        block->next = (PoolBlock*)old_head;
 								    } while (!atomic_compare_exchange_weak_explicit(
 								        head_ptr, &old_head, (uintptr_t)block,
 								        memory_order_release, memory_order_relaxed));
 								}
 								```
 . **Batch push** (for refill):
 								```c
 								void freelist_push_batch_lockfree(int class_idx, int shard_idx,
 								                                    PoolBlock* head, PoolBlock* tail) {
 								    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
 								    uintptr_t old_head;
 								    do {
 								        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
 								        tail->next = (PoolBlock*)old_head;
 								    } while (!atomic_compare_exchange_weak_explicit(
 								        head_ptr, &old_head, (uintptr_t)head,
 								        memory_order_release, memory_order_relaxed));
 								}
 								```
 								**ABA Problem 対策**:
 								- **Not critical for freelist**: ABA still results in valid freelist state
 								- **Alternative**: Add version counter (128-bit CAS) if issues arise
 								- **Epoch-based reclamation**: Defer block reuse (if needed)
 								**Expected gain**: **+15-25%** (lock contention 完全排除)
 								**Files**:
 								- `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS)
 								**Risk**: Medium
 								- Memory ordering bugs
 								- ABA problem (low risk)
 								**Testing**:
 								- ThreadSanitizer (TSan)
 								- Stress test (16+ threads, 60s)
 								- Correctness test (no lost blocks, no double-free)
 								#### MF2: Pointer Arithmetic Page Lookup (8-10 hours)
 								**Goal**: Hash table lookup を pointer arithmetic に置き換え
 								**Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex)
 								**mimalloc approach**:
 								```c
 								// Segment-aligned memory (4 MiB alignment)
 								Segment: [Page Descriptors 128KB][Pages 3968KB]
 								// Derive page descriptor from block pointer
 								uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1);  // Mask
 								size_t offset = (uintptr_t)ptr - segment;
 								size_t page_idx = offset / PAGE_SIZE;
 								PageDesc* desc = &segment->descriptors[page_idx];
 								```
 								**Cost**: 3-4 instructions (mask, subtract, divide, array index)
 								**Implementation**:
 . **Segment allocation**:
 								```c
 								#define SEGMENT_SIZE (4 * 1024 * 1024)
 								#define SEGMENT_ALIGNMENT SEGMENT_SIZE
 								void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
 								                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22),  // 4MiB aligned
 								                      -1, 0);
 								```
 . **Page descriptor array**:
 								```c
 								typedef struct {
 								    PageDesc descriptors[64];  // 64 × 64KB = 4MiB
 								    char pages[64][64*1024];   // Actual pages
 								} Segment;
 								Segment* seg = (Segment*)segment;
 								```
 . **Fast lookup**:
 								```c
 								static inline PageDesc* page_desc_from_ptr(void* ptr) {
 								    uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
 								    Segment* seg = (Segment*)seg_addr;
 								    size_t offset = (uintptr_t)ptr - seg_addr;
 								    size_t page_idx = offset / (64*1024);
 								    return &seg->descriptors[page_idx];
 								}
 								```
 								**Expected gain**: **+10-15%** (hash lookup 排除)
 								**Files**:
 								- `hakmem_pool.c`: ~150 LOC (segment allocator)
 								- `hakmem_pool.h`: Data structure changes
 								**Risk**: Medium
 								- Requires memory layout redesign
 								- Backward compatibility (migrate existing pages?)
 								#### MF3: Allocation Path 簡略化 (8-10 hours)
 								**Goal**: 7層 → 3層に削減
 								**Current path**:
 . TLS Ring (32)
 . TLS LIFO (256)
 . Active Page A
 . Active Page B
 . Transfer Cache
 . Shard Freelist (mutex)
 . Remote Stack
 								**Simplified path**:
 . **TLS Cache** (unified 256-slot ring)
 . **Per-shard Freelist** (lock-free)
 . **Remote Stack** (lock-free)
 								**Changes**:
 								- Remove: Active Pages (bump-run)
 								- Remove: Transfer Cache
 								- Keep: TLS Cache (unified), Freelist, Remote
 								**Rationale**:
 								- Active Pages: 複雑だが効果薄（Ring で十分）
 								- Transfer Cache: Owner-aware は良いが overhead 大
 								**Expected gain**: **+5-8%** (branch 削減、complexity 削減)
 								**Files**:
 								- `hakmem_pool.c`: ~200 LOC (refactor allocation path)
 								**Risk**: Low
 								- Code deletion is safer than addition
 								- Testing で regression 確認
 								---
 								### Phase 7.3: Moonshot (60 hours) → +50-70%
 								**Goal**: mimalloc の architecture を完全移植
 								#### MS1: Per-Page Sharding (60 hours)
 								**Goal**: Global sharded freelists → Per-page freelists (mimalloc style)
 								**Current**: 7 classes × 8 shards = **56 global freelists**
 								**mimalloc**: Thousands of **per-page freelists**
 								**Architecture**:
 								```c
 								// Thread-local heap
 								typedef struct {
 								    Page* pages[SIZE_BINS];  // Per-size-class page lists
 								} Heap;
 								// Per-page freelist
 								typedef struct Page {
 								    void* free;              // Freelist head (in-page)
 								    void* local_free;        // Same-thread frees
 								    atomic_uintptr_t xthread_free;  // Cross-thread frees
 								    int used;
 								    int capacity;
 								    struct Page* next;       // Next page in heap
 								} Page;
 								```
 								**Allocation**:
 								```c
 								void* mi_malloc(size_t size) {
 								    Heap* heap = mi_get_heap();
 								    int bin = size_to_bin(size);
 								    Page* page = heap->pages[bin];
 								    if (page->free) {
 								        void* block = page->free;
 								        page->free = *(void**)block;
 								        return block;
 								    }
 								    return mi_page_malloc(heap, page, size);  // Slow path
 								}
 								```
 								**Expected gain**: **+50-70%** (mimalloc とほぼ同等)
 								**Files**:
 								- `hakmem_pool.c`: Complete rewrite (~500 LOC)
 								- `hakmem_pool.h`: New data structures
 								**Risk**: High
 								- Architectural overhaul
 								- 60 hours implementation + testing
 								- Backward compatibility concerns
 								**Recommendation**: **Phase 7.2 で十分な効果が出たら skip**
 								---
 								## 🎯 推奨実装順序
 								### Option A: MF1 優先（高ROI、中リスク）
 								**Week 1**: MF1 (Lock-Free Freelist) 実装
 								- Day 1-2: Lock-free primitives 実装
 								- Day 3: Integration & unit testing
 								- Day 4: Benchmark & validation
 								**Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%)
 								**Pros**:
 								- 最大ボトルネック解決
 								- Standalone fix
 								- Proven technique
 								**Cons**:
 								- ABA problem リスク
 								- Memory ordering bugs
 								### Option B: Quick Fixes 優先（低リスク、低ROI）
 								**Week 1**: QF1+QF2+QF3 実装
 								- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
 								- Day 2: QF3 (Header skip) + testing
 								- Day 3: Benchmark
 								**Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%)
 								**Pros**:
 								- Low risk
 								- Quick wins
 								- MF1 の前準備
 								**Cons**:
 								- 効果が限定的
 								### Option C: Hybrid（バランス）
 								**Week 1**: Quick Fixes (8h)
 								**Week 2**: MF1 (12h)
 								**Week 3**: MF2 or MF3
 								**Expected result**: 13.78 → **18-20 M/s** (+30-45%)
 								**Pros**:
 								- リスク分散
 								- Incremental progress
 								**Cons**:
 								- 時間がかかる
 								---
 								## 📊 予想パフォーマンス
 								### Cumulative Gains
 								| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
 								|-------|---------|---------------|------------|-------------|
 								| **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** |
 								| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
 								| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ |
 								| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ |
 								| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 |
 								| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 |
 								**Target achievement**:
 								- **60% target**: MF1 でほぼ達成（58-64%）
 								- **75% target**: MF2 で達成（67-77%）
 								- **100% target**: MS1 で可能（91-105%）
 								---
 								## ⚠️ Risks & Mitigation
 								### MF1: Lock-Free Freelist
 								**Risks**:
 . **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight
 . **Memory ordering bugs**: Incorrect acquire/release semantics
 . **Livelock**: CAS retry loop が収束しない
 								**Mitigation**:
 . **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
 . **Memory ordering**: Extensive testing with TSan, code review
 . **Livelock**: Add exponential backoff after N retries
 								### MF2: Pointer Arithmetic
 								**Risks**:
 . **Alignment failures**: mmap が 4MiB aligned を保証しない環境
 . **Backward compatibility**: 既存 pages を migrate できない
 . **Memory fragmentation**: Segment 単位で確保 → waste
 								**Mitigation**:
 . **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed)
 . **Backward compat**: Gradual migration, hash table を fallback として残す
 . **Fragmentation**: 4MiB segments は許容範囲（mimalloc 実証済み）
 								### MF3: Path Simplification
 								**Risks**:
 . **Performance regression**: Active Pages 削除で burst pattern が遅くなる？
 . **Code complexity**: Refactor が新たなバグを生む
 								**Mitigation**:
 . **Regression**: Benchmark で検証、問題があれば rollback
 . **Complexity**: Incremental refactor, extensive testing
 								---
 								## 🎓 Success Criteria
 								### Primary Metrics
 								| Metric | Current | Target (60%) | Target (75%) | Measurement |
 								|--------|---------|--------------|--------------|-------------|
 								| **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
 								| **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio |
 								### Secondary Metrics
 								| Metric | Target | Measurement |
 								|--------|--------|-------------|
 								| **Memory footprint** | <50 MB | RSS baseline |
 								| **Fragmentation** | <10% | (RSS - user) / user |
 								| **Regression (Tiny)** | <5% | Larson Tiny 4T |
 								| **Regression (Large)** | <5% | Larson Large 4T |
 								### Specialized Workloads
 								| Workload | Target | Test |
 								|----------|--------|------|
 								| **Burst allocation** | >mimalloc | Custom benchmark |
 								| **Locality-aware** | >mimalloc | Site-based pattern |
 								| **Producer-Consumer** | >mimalloc | Multi-threaded queue |
 								---
 								## 📅 Timeline Estimate
 								### Conservative (完全実装)
 								| Week | Phase | Tasks | Hours |
 								|------|-------|-------|-------|
 								| **1** | QF1-3 | Quick fixes | 8 |
 								| **2** | MF1 | Lock-free freelist | 12 |
 								| **3** | MF2 | Pointer arithmetic | 10 |
 								| **4** | MF3 | Path simplification | 10 |
 								| **5-6** | MS1 | Per-page sharding (optional) | 60 |
 								| **Total** | | | **40-100 hours** |
 								### Aggressive (60% target のみ)
 								| Week | Phase | Tasks | Hours |
 								|------|-------|-------|-------|
 								| **1** | MF1 | Lock-free freelist | 12 |
 								| **Total** | | | **12 hours** |
 								**Recommendation**: **MF1 のみ実装 → Benchmark → 判断**
 								---
 								## 🎬 Conclusion
 								**Current Status**: hakmem は mimalloc の **46.7%**（Phase 6.25-6.27 失敗）
 								**Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%)
 								**Battle Plan**:
 								- **Phase 7.1**: Quick Fixes → +5-10%（8 hours）
 								- **Phase 7.2**: Medium Fixes → +25-35%（20-30 hours）
 								  - **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐
 								  - **MF2 (Pointer Arithmetic)**: +10-15%
 								  - **MF3 (Simplify Path)**: +5-8%
 								- **Phase 7.3**: Moonshot → +50-70%（60 hours、optional）
 								**Recommended Action**: **MF1 (Lock-Free Freelist) を即実装！**
 								**Expected Outcome**: **58-64% of mimalloc**（60% target 達成！）
 								**うおおお、mimalloc を倒すぞー！** 🔥🔥🔥
 								---
 								**作成日**: 2025-10-24 14:30 JST
 								**ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了**
 								**次のアクション**: MF1 (Lock-Free Freelist) 実装開始