# mimalloc 打倒作戦:Phase 7 Battle Plan **日付**: 2025-10-24 **目標**: hakmem を mimalloc の **60-75%** まで引き上げる **現状**: **46.7%** (13.78 M/s vs 29.50 M/s, Mid 4T) **Gap**: **15.72 M/s** (53.3%) --- ## 🎯 戦略目標 ### Primary Goal: Universal Performance **Target**: larson benchmark (Mid 4T) で mimalloc の **60-75%** - 現状: 13.78 M/s (**46.7%**) - 目標: 17.70-22.13 M/s (**60-75%**) - **必要改善**: +3.92-8.35 M/s (**+28-61%**) ### Secondary Goal: Specialized Workloads で Superiority **Target**: hakmem の設計思想を活かす - **Burst allocation**: mimalloc の 110-120% - **Locality-aware**: mimalloc の 110-130% - **Producer-Consumer**: mimalloc の 105-115% --- ## 📊 現状分析(Task先生の診断) ### 4大ボトルネック | Rank | Bottleneck | Cost (cycles) | Impact | Fix | |------|------------|---------------|--------|-----| | **#1** | **Lock Contention** (56 mutexes) | 56 | **50%** | MF1: Lock-free | | **#2** | **Hash Lookups** (page descriptors) | 30 | **25%** | MF2: Pointer arithmetic | | **#3** | **Excess Branching** (7-10 branches) | 15 | **15%** | MF3: Simplify path | | **#4** | **Metadata Overhead** (inline headers) | 10 | **10%** | Long-term | **Total overhead**: ~110 cycles/allocation vs mimalloc ~5 cycles ### hakmem Architecture の問題点 **Over-engineered**: - **7層 TLS caching**: Ring → LIFO → Pages × 2 → TC → Freelist → Remote - **56 mutexes**: 7 classes × 8 shards(37.5% contention rate @ 4T) - **Hash table**: 2-4 lookups per allocation(10-20 cycles each + cache miss) **mimalloc Architecture の優位性**: - **2層のみ**: Thread-local heap → Per-page freelist - **0 locks**: Lock-free CAS only - **Pointer arithmetic**: 3-4 instructions, no hash --- ## 🚀 3段階ロードマップ ### Phase 7.1: Quick Fixes (8 hours) → +5-10% **Goal**: Low-risk な最適化で quick wins #### QF1: Trylock Probes 削減(2 hours) **現状**: `g_trylock_probes = 3`(3回まで trylock を試行) **問題**: Probe のたびに: - Hash calculation (shard index) - Mutex trylock (50-200 cycles if contended) - Branch **Fix**: Probes を **1** に削減 ```c // Before for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1); if (pthread_mutex_trylock(lock) == 0) { ... } } // After int s = shard_idx; // No loop if (pthread_mutex_trylock(lock) == 0) { ... } ``` **Expected gain**: +2-3% (trylock overhead 削減) **File**: `hakmem_pool.c` (~10 LOC) #### QF2: Ring + LIFO 統合(4 hours) **現状**: Ring (32 slots) + LIFO (256 blocks) の 2段構成 **問題**: - 2回チェック(Ring → LIFO) - Spill logic の複雑さ **Fix**: Single unified cache (256 slots ring) ```c // Before typedef struct { PoolTLSRing ring; // 32 slots PoolTLSBin bin; // LIFO 256 blocks } TLSCache; // After typedef struct { PoolBlock* items[256]; // Unified ring (larger) int top; } TLSCache; ``` **Expected gain**: +1-2% (branch 削減、cache locality 向上) **File**: `hakmem_pool.c` (~50 LOC) #### QF3: Header Writes スキップ(2 hours) **現状**: Fast path で header を毎回書き込み ```c mid_set_header(hdr, size, site_id); // 5-10 cycles ``` **Fix**: Ring から pop した block は header が valid → skip ```c // Only write header when refilling from freelist if (from_freelist) { mid_set_header(hdr, size, site_id); } ``` **Expected gain**: +1-2% (header write overhead 削減) **File**: `hakmem_pool.c` (~20 LOC) --- ### Phase 7.2: Medium Fixes (20-30 hours) → +25-35% **Goal**: Architectural bottlenecks を解決 #### MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours) **Goal**: 56 mutexes を atomic CAS に置き換え **Current**: ```c static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; ``` **After**: ```c static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; ``` **Implementation**: 1. **Lock-free pop**: ```c PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) { atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx]; uintptr_t old_head, new_head; PoolBlock* block; do { old_head = atomic_load_explicit(head_ptr, memory_order_acquire); if (old_head == 0) return NULL; // Empty block = (PoolBlock*)old_head; new_head = (uintptr_t)block->next; } while (!atomic_compare_exchange_weak_explicit( head_ptr, &old_head, new_head, memory_order_release, memory_order_relaxed)); return block; } ``` 2. **Lock-free push**: ```c void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) { atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx]; uintptr_t old_head; do { old_head = atomic_load_explicit(head_ptr, memory_order_relaxed); block->next = (PoolBlock*)old_head; } while (!atomic_compare_exchange_weak_explicit( head_ptr, &old_head, (uintptr_t)block, memory_order_release, memory_order_relaxed)); } ``` 3. **Batch push** (for refill): ```c void freelist_push_batch_lockfree(int class_idx, int shard_idx, PoolBlock* head, PoolBlock* tail) { atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx]; uintptr_t old_head; do { old_head = atomic_load_explicit(head_ptr, memory_order_relaxed); tail->next = (PoolBlock*)old_head; } while (!atomic_compare_exchange_weak_explicit( head_ptr, &old_head, (uintptr_t)head, memory_order_release, memory_order_relaxed)); } ``` **ABA Problem 対策**: - **Not critical for freelist**: ABA still results in valid freelist state - **Alternative**: Add version counter (128-bit CAS) if issues arise - **Epoch-based reclamation**: Defer block reuse (if needed) **Expected gain**: **+15-25%** (lock contention 完全排除) **Files**: - `hakmem_pool.c`: ~100 LOC (replace mutex calls with CAS) **Risk**: Medium - Memory ordering bugs - ABA problem (low risk) **Testing**: - ThreadSanitizer (TSan) - Stress test (16+ threads, 60s) - Correctness test (no lost blocks, no double-free) #### MF2: Pointer Arithmetic Page Lookup (8-10 hours) **Goal**: Hash table lookup を pointer arithmetic に置き換え **Current**: `mid_desc_lookup(ptr)` → Hash table (10-20 cycles + mutex) **mimalloc approach**: ```c // Segment-aligned memory (4 MiB alignment) Segment: [Page Descriptors 128KB][Pages 3968KB] // Derive page descriptor from block pointer uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1); // Mask size_t offset = (uintptr_t)ptr - segment; size_t page_idx = offset / PAGE_SIZE; PageDesc* desc = &segment->descriptors[page_idx]; ``` **Cost**: 3-4 instructions (mask, subtract, divide, array index) **Implementation**: 1. **Segment allocation**: ```c #define SEGMENT_SIZE (4 * 1024 * 1024) #define SEGMENT_ALIGNMENT SEGMENT_SIZE void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22), // 4MiB aligned -1, 0); ``` 2. **Page descriptor array**: ```c typedef struct { PageDesc descriptors[64]; // 64 × 64KB = 4MiB char pages[64][64*1024]; // Actual pages } Segment; Segment* seg = (Segment*)segment; ``` 3. **Fast lookup**: ```c static inline PageDesc* page_desc_from_ptr(void* ptr) { uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1); Segment* seg = (Segment*)seg_addr; size_t offset = (uintptr_t)ptr - seg_addr; size_t page_idx = offset / (64*1024); return &seg->descriptors[page_idx]; } ``` **Expected gain**: **+10-15%** (hash lookup 排除) **Files**: - `hakmem_pool.c`: ~150 LOC (segment allocator) - `hakmem_pool.h`: Data structure changes **Risk**: Medium - Requires memory layout redesign - Backward compatibility (migrate existing pages?) #### MF3: Allocation Path 簡略化 (8-10 hours) **Goal**: 7層 → 3層に削減 **Current path**: 1. TLS Ring (32) 2. TLS LIFO (256) 3. Active Page A 4. Active Page B 5. Transfer Cache 6. Shard Freelist (mutex) 7. Remote Stack **Simplified path**: 1. **TLS Cache** (unified 256-slot ring) 2. **Per-shard Freelist** (lock-free) 3. **Remote Stack** (lock-free) **Changes**: - Remove: Active Pages (bump-run) - Remove: Transfer Cache - Keep: TLS Cache (unified), Freelist, Remote **Rationale**: - Active Pages: 複雑だが効果薄(Ring で十分) - Transfer Cache: Owner-aware は良いが overhead 大 **Expected gain**: **+5-8%** (branch 削減、complexity 削減) **Files**: - `hakmem_pool.c`: ~200 LOC (refactor allocation path) **Risk**: Low - Code deletion is safer than addition - Testing で regression 確認 --- ### Phase 7.3: Moonshot (60 hours) → +50-70% **Goal**: mimalloc の architecture を完全移植 #### MS1: Per-Page Sharding (60 hours) **Goal**: Global sharded freelists → Per-page freelists (mimalloc style) **Current**: 7 classes × 8 shards = **56 global freelists** **mimalloc**: Thousands of **per-page freelists** **Architecture**: ```c // Thread-local heap typedef struct { Page* pages[SIZE_BINS]; // Per-size-class page lists } Heap; // Per-page freelist typedef struct Page { void* free; // Freelist head (in-page) void* local_free; // Same-thread frees atomic_uintptr_t xthread_free; // Cross-thread frees int used; int capacity; struct Page* next; // Next page in heap } Page; ``` **Allocation**: ```c void* mi_malloc(size_t size) { Heap* heap = mi_get_heap(); int bin = size_to_bin(size); Page* page = heap->pages[bin]; if (page->free) { void* block = page->free; page->free = *(void**)block; return block; } return mi_page_malloc(heap, page, size); // Slow path } ``` **Expected gain**: **+50-70%** (mimalloc とほぼ同等) **Files**: - `hakmem_pool.c`: Complete rewrite (~500 LOC) - `hakmem_pool.h`: New data structures **Risk**: High - Architectural overhaul - 60 hours implementation + testing - Backward compatibility concerns **Recommendation**: **Phase 7.2 で十分な効果が出たら skip** --- ## 🎯 推奨実装順序 ### Option A: MF1 優先(高ROI、中リスク) **Week 1**: MF1 (Lock-Free Freelist) 実装 - Day 1-2: Lock-free primitives 実装 - Day 3: Integration & unit testing - Day 4: Benchmark & validation **Expected result**: 13.78 → **15.8-17.2 M/s** (+15-25%) **Pros**: - 最大ボトルネック解決 - Standalone fix - Proven technique **Cons**: - ABA problem リスク - Memory ordering bugs ### Option B: Quick Fixes 優先(低リスク、低ROI) **Week 1**: QF1+QF2+QF3 実装 - Day 1: QF1 (Trylock削減) + QF2 (Ring統合) - Day 2: QF3 (Header skip) + testing - Day 3: Benchmark **Expected result**: 13.78 → **14.5-15.2 M/s** (+5-10%) **Pros**: - Low risk - Quick wins - MF1 の前準備 **Cons**: - 効果が限定的 ### Option C: Hybrid(バランス) **Week 1**: Quick Fixes (8h) **Week 2**: MF1 (12h) **Week 3**: MF2 or MF3 **Expected result**: 13.78 → **18-20 M/s** (+30-45%) **Pros**: - リスク分散 - Incremental progress **Cons**: - 時間がかかる --- ## 📊 予想パフォーマンス ### Cumulative Gains | Phase | Changes | Expected Gain | Cumulative | vs mimalloc | |-------|---------|---------------|------------|-------------| | **Current** | Quick wins | baseline | 13.78 M/s | **46.7%** | | + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% | | + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | **58-64%** ✅ | | + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | **67-77%** ✅ | | + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | **74-86%** 🎉 | | + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | **91-105%** 🚀 | **Target achievement**: - **60% target**: MF1 でほぼ達成(58-64%) - **75% target**: MF2 で達成(67-77%) - **100% target**: MS1 で可能(91-105%) --- ## ⚠️ Risks & Mitigation ### MF1: Lock-Free Freelist **Risks**: 1. **ABA problem**: Block A popped, freed, reallocated, pushed back while CAS in-flight 2. **Memory ordering bugs**: Incorrect acquire/release semantics 3. **Livelock**: CAS retry loop が収束しない **Mitigation**: 1. **ABA**: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed 2. **Memory ordering**: Extensive testing with TSan, code review 3. **Livelock**: Add exponential backoff after N retries ### MF2: Pointer Arithmetic **Risks**: 1. **Alignment failures**: mmap が 4MiB aligned を保証しない環境 2. **Backward compatibility**: 既存 pages を migrate できない 3. **Memory fragmentation**: Segment 単位で確保 → waste **Mitigation**: 1. **Alignment**: Use `MAP_ALIGNED` or manual alignment (fallback to hash table if failed) 2. **Backward compat**: Gradual migration, hash table を fallback として残す 3. **Fragmentation**: 4MiB segments は許容範囲(mimalloc 実証済み) ### MF3: Path Simplification **Risks**: 1. **Performance regression**: Active Pages 削除で burst pattern が遅くなる? 2. **Code complexity**: Refactor が新たなバグを生む **Mitigation**: 1. **Regression**: Benchmark で検証、問題があれば rollback 2. **Complexity**: Incremental refactor, extensive testing --- ## 🎓 Success Criteria ### Primary Metrics | Metric | Current | Target (60%) | Target (75%) | Measurement | |--------|---------|--------------|--------------|-------------| | **Mid 4T Throughput** | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s | | **vs mimalloc** | 46.7% | **60%** | **75%** | Ratio | ### Secondary Metrics | Metric | Target | Measurement | |--------|--------|-------------| | **Memory footprint** | <50 MB | RSS baseline | | **Fragmentation** | <10% | (RSS - user) / user | | **Regression (Tiny)** | <5% | Larson Tiny 4T | | **Regression (Large)** | <5% | Larson Large 4T | ### Specialized Workloads | Workload | Target | Test | |----------|--------|------| | **Burst allocation** | >mimalloc | Custom benchmark | | **Locality-aware** | >mimalloc | Site-based pattern | | **Producer-Consumer** | >mimalloc | Multi-threaded queue | --- ## 📅 Timeline Estimate ### Conservative (完全実装) | Week | Phase | Tasks | Hours | |------|-------|-------|-------| | **1** | QF1-3 | Quick fixes | 8 | | **2** | MF1 | Lock-free freelist | 12 | | **3** | MF2 | Pointer arithmetic | 10 | | **4** | MF3 | Path simplification | 10 | | **5-6** | MS1 | Per-page sharding (optional) | 60 | | **Total** | | | **40-100 hours** | ### Aggressive (60% target のみ) | Week | Phase | Tasks | Hours | |------|-------|-------|-------| | **1** | MF1 | Lock-free freelist | 12 | | **Total** | | | **12 hours** | **Recommendation**: **MF1 のみ実装 → Benchmark → 判断** --- ## 🎬 Conclusion **Current Status**: hakmem は mimalloc の **46.7%**(Phase 6.25-6.27 失敗) **Root Cause**: Lock contention (50%) + Hash lookups (25%) + Branching (15%) **Battle Plan**: - **Phase 7.1**: Quick Fixes → +5-10%(8 hours) - **Phase 7.2**: Medium Fixes → +25-35%(20-30 hours) - **MF1 (Lock-Free)**: +15-25% ⭐⭐⭐ - **MF2 (Pointer Arithmetic)**: +10-15% - **MF3 (Simplify Path)**: +5-8% - **Phase 7.3**: Moonshot → +50-70%(60 hours、optional) **Recommended Action**: **MF1 (Lock-Free Freelist) を即実装!** **Expected Outcome**: **58-64% of mimalloc**(60% target 達成!) **うおおお、mimalloc を倒すぞー!** 🔥🔥🔥 --- **作成日**: 2025-10-24 14:30 JST **ステータス**: ✅ **Battle Plan 完成、MF1 実装準備完了** **次のアクション**: MF1 (Lock-Free Freelist) 実装開始