Files
hakmem/docs/status/MIMALLOC_BATTLE_PLAN_2025_10_24.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

16 KiB
Raw Blame History

mimalloc 打倒作戦Phase 7 Battle Plan

日付: 2025-10-24 目標: hakmem を mimalloc の 60-75% まで引き上げる 現状: 46.7% (13.78 M/s vs 29.50 M/s, Mid 4T) Gap: 15.72 M/s (53.3%)


🎯 戦略目標

Primary Goal: Universal Performance

Target: larson benchmark (Mid 4T) で mimalloc の 60-75%

  • 現状: 13.78 M/s (46.7%)
  • 目標: 17.70-22.13 M/s (60-75%)
  • 必要改善: +3.92-8.35 M/s (+28-61%)

Secondary Goal: Specialized Workloads で Superiority

Target: hakmem の設計思想を活かす

  • Burst allocation: mimalloc の 110-120%
  • Locality-aware: mimalloc の 110-130%
  • Producer-Consumer: mimalloc の 105-115%

📊 現状分析Task先生の診断

4大ボトルネック

Rank Bottleneck Cost (cycles) Impact Fix
#1 Lock Contention (56 mutexes) 56 50% MF1: Lock-free
#2 Hash Lookups (page descriptors) 30 25% MF2: Pointer arithmetic
#3 Excess Branching (7-10 branches) 15 15% MF3: Simplify path
#4 Metadata Overhead (inline headers) 10 10% Long-term

Total overhead: ~110 cycles/allocation vs mimalloc ~5 cycles

hakmem Architecture の問題点

Over-engineered:

  • 7層 TLS caching: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
  • 56 mutexes: 7 classes × 8 shards37.5% contention rate @ 4T
  • Hash table: 2-4 lookups per allocation10-20 cycles each + cache miss

mimalloc Architecture の優位性:

  • 2層のみ: Thread-local heap → Per-page freelist
  • 0 locks: Lock-free CAS only
  • Pointer arithmetic: 3-4 instructions, no hash

🚀 3段階ロードマップ

Phase 7.1: Quick Fixes (8 hours) → +5-10%

Goal: Low-risk な最適化で quick wins

QF1: Trylock Probes 削減2 hours

現状: g_trylock_probes = 33回まで trylock を試行)

問題: Probe のたびに:

  • Hash calculation (shard index)
  • Mutex trylock (50-200 cycles if contended)
  • Branch

Fix: Probes を 1 に削減

// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
    int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
    if (pthread_mutex_trylock(lock) == 0) { ... }
}

// After
int s = shard_idx;  // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }

Expected gain: +2-3% (trylock overhead 削減)

File: hakmem_pool.c (~10 LOC)

QF2: Ring + LIFO 統合4 hours

現状: Ring (32 slots) + LIFO (256 blocks) の 2段構成

問題:

  • 2回チェックRing → LIFO
  • Spill logic の複雑さ

Fix: Single unified cache (256 slots ring)

// Before
typedef struct {
    PoolTLSRing ring;   // 32 slots
    PoolTLSBin bin;     // LIFO 256 blocks
} TLSCache;

// After
typedef struct {
    PoolBlock* items[256];  // Unified ring (larger)
    int top;
} TLSCache;

Expected gain: +1-2% (branch 削減、cache locality 向上)

File: hakmem_pool.c (~50 LOC)

QF3: Header Writes スキップ2 hours

現状: Fast path で header を毎回書き込み

mid_set_header(hdr, size, site_id);  // 5-10 cycles

Fix: Ring から pop した block は header が valid → skip

// Only write header when refilling from freelist
if (from_freelist) {
    mid_set_header(hdr, size, site_id);
}

Expected gain: +1-2% (header write overhead 削減)

File: hakmem_pool.c (~20 LOC)


Phase 7.2: Medium Fixes (20-30 hours) → +25-35%

Goal: Architectural bottlenecks を解決

MF1: Lock-Free Freelist (12 hours)

Goal: 56 mutexes を atomic CAS に置き換え

Current:

static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

After:

static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

Implementation:

  1. Lock-free pop:
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head, new_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
        if (old_head == 0) return NULL;  // Empty

        block = (PoolBlock*)old_head;
        new_head = (uintptr_t)block->next;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, new_head,
        memory_order_release, memory_order_relaxed));

    return block;
}
  1. Lock-free push:
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));
}
  1. Batch push (for refill):
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                    PoolBlock* head, PoolBlock* tail) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)head,
        memory_order_release, memory_order_relaxed));
}

ABA Problem 対策:

  • Not critical for freelist: ABA still results in valid freelist state
  • Alternative: Add version counter (128-bit CAS) if issues arise
  • Epoch-based reclamation: Defer block reuse (if needed)

Expected gain: +15-25% (lock contention 完全排除)

Files:

  • hakmem_pool.c: ~100 LOC (replace mutex calls with CAS)

Risk: Medium

  • Memory ordering bugs
  • ABA problem (low risk)

Testing:

  • ThreadSanitizer (TSan)
  • Stress test (16+ threads, 60s)
  • Correctness test (no lost blocks, no double-free)

MF2: Pointer Arithmetic Page Lookup (8-10 hours)

Goal: Hash table lookup を pointer arithmetic に置き換え

Current: mid_desc_lookup(ptr) → Hash table (10-20 cycles + mutex)

mimalloc approach:

// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]

// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1);  // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];

Cost: 3-4 instructions (mask, subtract, divide, array index)

Implementation:

  1. Segment allocation:
#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE

void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22),  // 4MiB aligned
                      -1, 0);
  1. Page descriptor array:
typedef struct {
    PageDesc descriptors[64];  // 64 × 64KB = 4MiB
    char pages[64][64*1024];   // Actual pages
} Segment;

Segment* seg = (Segment*)segment;
  1. Fast lookup:
static inline PageDesc* page_desc_from_ptr(void* ptr) {
    uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
    Segment* seg = (Segment*)seg_addr;
    size_t offset = (uintptr_t)ptr - seg_addr;
    size_t page_idx = offset / (64*1024);
    return &seg->descriptors[page_idx];
}

Expected gain: +10-15% (hash lookup 排除)

Files:

  • hakmem_pool.c: ~150 LOC (segment allocator)
  • hakmem_pool.h: Data structure changes

Risk: Medium

  • Requires memory layout redesign
  • Backward compatibility (migrate existing pages?)

MF3: Allocation Path 簡略化 (8-10 hours)

Goal: 7層 → 3層に削減

Current path:

  1. TLS Ring (32)
  2. TLS LIFO (256)
  3. Active Page A
  4. Active Page B
  5. Transfer Cache
  6. Shard Freelist (mutex)
  7. Remote Stack

Simplified path:

  1. TLS Cache (unified 256-slot ring)
  2. Per-shard Freelist (lock-free)
  3. Remote Stack (lock-free)

Changes:

  • Remove: Active Pages (bump-run)
  • Remove: Transfer Cache
  • Keep: TLS Cache (unified), Freelist, Remote

Rationale:

  • Active Pages: 複雑だが効果薄Ring で十分)
  • Transfer Cache: Owner-aware は良いが overhead 大

Expected gain: +5-8% (branch 削減、complexity 削減)

Files:

  • hakmem_pool.c: ~200 LOC (refactor allocation path)

Risk: Low

  • Code deletion is safer than addition
  • Testing で regression 確認

Phase 7.3: Moonshot (60 hours) → +50-70%

Goal: mimalloc の architecture を完全移植

MS1: Per-Page Sharding (60 hours)

Goal: Global sharded freelists → Per-page freelists (mimalloc style)

Current: 7 classes × 8 shards = 56 global freelists

mimalloc: Thousands of per-page freelists

Architecture:

// Thread-local heap
typedef struct {
    Page* pages[SIZE_BINS];  // Per-size-class page lists
} Heap;

// Per-page freelist
typedef struct Page {
    void* free;              // Freelist head (in-page)
    void* local_free;        // Same-thread frees
    atomic_uintptr_t xthread_free;  // Cross-thread frees
    int used;
    int capacity;
    struct Page* next;       // Next page in heap
} Page;

Allocation:

void* mi_malloc(size_t size) {
    Heap* heap = mi_get_heap();
    int bin = size_to_bin(size);
    Page* page = heap->pages[bin];

    if (page->free) {
        void* block = page->free;
        page->free = *(void**)block;
        return block;
    }

    return mi_page_malloc(heap, page, size);  // Slow path
}

Expected gain: +50-70% (mimalloc とほぼ同等)

Files:

  • hakmem_pool.c: Complete rewrite (~500 LOC)
  • hakmem_pool.h: New data structures

Risk: High

  • Architectural overhaul
  • 60 hours implementation + testing
  • Backward compatibility concerns

Recommendation: Phase 7.2 で十分な効果が出たら skip


🎯 推奨実装順序

Option A: MF1 優先高ROI、中リスク

Week 1: MF1 (Lock-Free Freelist) 実装

  • Day 1-2: Lock-free primitives 実装
  • Day 3: Integration & unit testing
  • Day 4: Benchmark & validation

Expected result: 13.78 → 15.8-17.2 M/s (+15-25%)

Pros:

  • 最大ボトルネック解決
  • Standalone fix
  • Proven technique

Cons:

  • ABA problem リスク
  • Memory ordering bugs

Option B: Quick Fixes 優先低リスク、低ROI

Week 1: QF1+QF2+QF3 実装

  • Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
  • Day 2: QF3 (Header skip) + testing
  • Day 3: Benchmark

Expected result: 13.78 → 14.5-15.2 M/s (+5-10%)

Pros:

  • Low risk
  • Quick wins
  • MF1 の前準備

Cons:

  • 効果が限定的

Option C: Hybridバランス

Week 1: Quick Fixes (8h) Week 2: MF1 (12h) Week 3: MF2 or MF3

Expected result: 13.78 → 18-20 M/s (+30-45%)

Pros:

  • リスク分散
  • Incremental progress

Cons:

  • 時間がかかる

📊 予想パフォーマンス

Cumulative Gains

Phase Changes Expected Gain Cumulative vs mimalloc
Current Quick wins baseline 13.78 M/s 46.7%
+ QF1-3 Trylock + Ring + Header +5-10% 14.5-15.2 M/s 49-51%
+ MF1 Lock-free Freelist +15-25% 17.2-19.0 M/s 58-64%
+ MF2 Pointer Arithmetic +10-15% 19.9-22.8 M/s 67-77%
+ MF3 Path Simplification +5-8% 21.9-25.4 M/s 74-86% 🎉
+ MS1 Per-Page Sharding +50-70% 27.0-31.0 M/s 91-105% 🚀

Target achievement:

  • 60% target: MF1 でほぼ達成58-64%
  • 75% target: MF2 で達成67-77%
  • 100% target: MS1 で可能91-105%

⚠️ Risks & Mitigation

MF1: Lock-Free Freelist

Risks:

  1. ABA problem: Block A popped, freed, reallocated, pushed back while CAS in-flight
  2. Memory ordering bugs: Incorrect acquire/release semantics
  3. Livelock: CAS retry loop が収束しない

Mitigation:

  1. ABA: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
  2. Memory ordering: Extensive testing with TSan, code review
  3. Livelock: Add exponential backoff after N retries

MF2: Pointer Arithmetic

Risks:

  1. Alignment failures: mmap が 4MiB aligned を保証しない環境
  2. Backward compatibility: 既存 pages を migrate できない
  3. Memory fragmentation: Segment 単位で確保 → waste

Mitigation:

  1. Alignment: Use MAP_ALIGNED or manual alignment (fallback to hash table if failed)
  2. Backward compat: Gradual migration, hash table を fallback として残す
  3. Fragmentation: 4MiB segments は許容範囲mimalloc 実証済み)

MF3: Path Simplification

Risks:

  1. Performance regression: Active Pages 削除で burst pattern が遅くなる?
  2. Code complexity: Refactor が新たなバグを生む

Mitigation:

  1. Regression: Benchmark で検証、問題があれば rollback
  2. Complexity: Incremental refactor, extensive testing

🎓 Success Criteria

Primary Metrics

Metric Current Target (60%) Target (75%) Measurement
Mid 4T Throughput 13.78 M/s 17.70 M/s 22.13 M/s Larson 10s
vs mimalloc 46.7% 60% 75% Ratio

Secondary Metrics

Metric Target Measurement
Memory footprint <50 MB RSS baseline
Fragmentation <10% (RSS - user) / user
Regression (Tiny) <5% Larson Tiny 4T
Regression (Large) <5% Larson Large 4T

Specialized Workloads

Workload Target Test
Burst allocation >mimalloc Custom benchmark
Locality-aware >mimalloc Site-based pattern
Producer-Consumer >mimalloc Multi-threaded queue

📅 Timeline Estimate

Conservative (完全実装)

Week Phase Tasks Hours
1 QF1-3 Quick fixes 8
2 MF1 Lock-free freelist 12
3 MF2 Pointer arithmetic 10
4 MF3 Path simplification 10
5-6 MS1 Per-page sharding (optional) 60
Total 40-100 hours

Aggressive (60% target のみ)

Week Phase Tasks Hours
1 MF1 Lock-free freelist 12
Total 12 hours

Recommendation: MF1 のみ実装 → Benchmark → 判断


🎬 Conclusion

Current Status: hakmem は mimalloc の 46.7%Phase 6.25-6.27 失敗)

Root Cause: Lock contention (50%) + Hash lookups (25%) + Branching (15%)

Battle Plan:

  • Phase 7.1: Quick Fixes → +5-10%8 hours
  • Phase 7.2: Medium Fixes → +25-35%20-30 hours
    • MF1 (Lock-Free): +15-25%
    • MF2 (Pointer Arithmetic): +10-15%
    • MF3 (Simplify Path): +5-8%
  • Phase 7.3: Moonshot → +50-70%60 hours、optional

Recommended Action: MF1 (Lock-Free Freelist) を即実装!

Expected Outcome: 58-64% of mimalloc60% target 達成!)

うおおお、mimalloc を倒すぞー! 🔥🔥🔥


作成日: 2025-10-24 14:30 JST ステータス: Battle Plan 完成、MF1 実装準備完了 次のアクション: MF1 (Lock-Free Freelist) 実装開始