Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

16 KiB

Raw Blame History

mimalloc 打倒作戦：Phase 7 Battle Plan

日付: 2025-10-24 目標: hakmem を mimalloc の 60-75% まで引き上げる現状: 46.7% (13.78 M/s vs 29.50 M/s, Mid 4T) Gap: 15.72 M/s (53.3%)

🎯 戦略目標

Primary Goal: Universal Performance

Target: larson benchmark (Mid 4T) で mimalloc の 60-75%

現状: 13.78 M/s (46.7%)
目標: 17.70-22.13 M/s (60-75%)
必要改善: +3.92-8.35 M/s (+28-61%)

Secondary Goal: Specialized Workloads で Superiority

Target: hakmem の設計思想を活かす

Burst allocation: mimalloc の 110-120%
Locality-aware: mimalloc の 110-130%
Producer-Consumer: mimalloc の 105-115%

📊 現状分析（Task先生の診断）

4大ボトルネック

Rank	Bottleneck	Cost (cycles)	Impact	Fix
#1	Lock Contention (56 mutexes)	56	50%	MF1: Lock-free
#2	Hash Lookups (page descriptors)	30	25%	MF2: Pointer arithmetic
#3	Excess Branching (7-10 branches)	15	15%	MF3: Simplify path
#4	Metadata Overhead (inline headers)	10	10%	Long-term

Total overhead: ~110 cycles/allocation vs mimalloc ~5 cycles

hakmem Architecture の問題点

Over-engineered:

7層 TLS caching: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
56 mutexes: 7 classes × 8 shards（37.5% contention rate @ 4T）
Hash table: 2-4 lookups per allocation（10-20 cycles each + cache miss）

mimalloc Architecture の優位性:

2層のみ: Thread-local heap → Per-page freelist
0 locks: Lock-free CAS only
Pointer arithmetic: 3-4 instructions, no hash

🚀 3段階ロードマップ

Phase 7.1: Quick Fixes (8 hours) → +5-10%

Goal: Low-risk な最適化で quick wins

QF1: Trylock Probes 削減（2 hours）

現状: g_trylock_probes = 3（3回まで trylock を試行）

問題: Probe のたびに：

Hash calculation (shard index)
Mutex trylock (50-200 cycles if contended)
Branch

Fix: Probes を 1 に削減

// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
    int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
    if (pthread_mutex_trylock(lock) == 0) { ... }
}

// After
int s = shard_idx;  // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }

Expected gain: +2-3% (trylock overhead 削減)

File: hakmem_pool.c (~10 LOC)

QF2: Ring + LIFO 統合（4 hours）

現状: Ring (32 slots) + LIFO (256 blocks) の 2段構成

問題:

2回チェック（Ring → LIFO）
Spill logic の複雑さ

Fix: Single unified cache (256 slots ring)

// Before
typedef struct {
    PoolTLSRing ring;   // 32 slots
    PoolTLSBin bin;     // LIFO 256 blocks
} TLSCache;

// After
typedef struct {
    PoolBlock* items[256];  // Unified ring (larger)
    int top;
} TLSCache;

Expected gain: +1-2% (branch 削減、cache locality 向上)

File: hakmem_pool.c (~50 LOC)

QF3: Header Writes スキップ（2 hours）

現状: Fast path で header を毎回書き込み

mid_set_header(hdr, size, site_id);  // 5-10 cycles

Fix: Ring から pop した block は header が valid → skip

// Only write header when refilling from freelist
if (from_freelist) {
    mid_set_header(hdr, size, site_id);
}

Expected gain: +1-2% (header write overhead 削減)

File: hakmem_pool.c (~20 LOC)

Phase 7.2: Medium Fixes (20-30 hours) → +25-35%

Goal: Architectural bottlenecks を解決

MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)

Goal: 56 mutexes を atomic CAS に置き換え

Current:

static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

After:

static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

Implementation:

Lock-free pop:

PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head, new_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
        if (old_head == 0) return NULL;  // Empty

        block = (PoolBlock*)old_head;
        new_head = (uintptr_t)block->next;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, new_head,
        memory_order_release, memory_order_relaxed));

    return block;
}

Lock-free push:

void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));
}

Batch push (for refill):

void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                    PoolBlock* head, PoolBlock* tail) {
    atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        head_ptr, &old_head, (uintptr_t)head,
        memory_order_release, memory_order_relaxed));
}

ABA Problem 対策:

Not critical for freelist: ABA still results in valid freelist state
Alternative: Add version counter (128-bit CAS) if issues arise
Epoch-based reclamation: Defer block reuse (if needed)

Expected gain: +15-25% (lock contention 完全排除)

Files:

hakmem_pool.c: ~100 LOC (replace mutex calls with CAS)

Risk: Medium

Memory ordering bugs
ABA problem (low risk)

Testing:

ThreadSanitizer (TSan)
Stress test (16+ threads, 60s)
Correctness test (no lost blocks, no double-free)

MF2: Pointer Arithmetic Page Lookup (8-10 hours)

Goal: Hash table lookup を pointer arithmetic に置き換え

Current: mid_desc_lookup(ptr) → Hash table (10-20 cycles + mutex)

mimalloc approach:

// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]

// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1);  // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];

Cost: 3-4 instructions (mask, subtract, divide, array index)

Implementation:

Segment allocation:

#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE

void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22),  // 4MiB aligned
                      -1, 0);

Page descriptor array:

typedef struct {
    PageDesc descriptors[64];  // 64 × 64KB = 4MiB
    char pages[64][64*1024];   // Actual pages
} Segment;

Segment* seg = (Segment*)segment;

Fast lookup:

static inline PageDesc* page_desc_from_ptr(void* ptr) {
    uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
    Segment* seg = (Segment*)seg_addr;
    size_t offset = (uintptr_t)ptr - seg_addr;
    size_t page_idx = offset / (64*1024);
    return &seg->descriptors[page_idx];
}

Expected gain: +10-15% (hash lookup 排除)

Files:

hakmem_pool.c: ~150 LOC (segment allocator)
hakmem_pool.h: Data structure changes

Risk: Medium

Requires memory layout redesign
Backward compatibility (migrate existing pages?)

MF3: Allocation Path 簡略化 (8-10 hours)

Goal: 7層 → 3層に削減

Current path:

TLS Ring (32)
TLS LIFO (256)
Active Page A
Active Page B
Transfer Cache
Shard Freelist (mutex)
Remote Stack

Simplified path:

TLS Cache (unified 256-slot ring)
Per-shard Freelist (lock-free)
Remote Stack (lock-free)

Changes:

Remove: Active Pages (bump-run)
Remove: Transfer Cache
Keep: TLS Cache (unified), Freelist, Remote

Rationale:

Active Pages: 複雑だが効果薄（Ring で十分）
Transfer Cache: Owner-aware は良いが overhead 大

Expected gain: +5-8% (branch 削減、complexity 削減)

Files:

hakmem_pool.c: ~200 LOC (refactor allocation path)

Risk: Low

Code deletion is safer than addition
Testing で regression 確認

Phase 7.3: Moonshot (60 hours) → +50-70%

Goal: mimalloc の architecture を完全移植

MS1: Per-Page Sharding (60 hours)

Goal: Global sharded freelists → Per-page freelists (mimalloc style)

Current: 7 classes × 8 shards = 56 global freelists

mimalloc: Thousands of per-page freelists

Architecture:

// Thread-local heap
typedef struct {
    Page* pages[SIZE_BINS];  // Per-size-class page lists
} Heap;

// Per-page freelist
typedef struct Page {
    void* free;              // Freelist head (in-page)
    void* local_free;        // Same-thread frees
    atomic_uintptr_t xthread_free;  // Cross-thread frees
    int used;
    int capacity;
    struct Page* next;       // Next page in heap
} Page;

Allocation:

void* mi_malloc(size_t size) {
    Heap* heap = mi_get_heap();
    int bin = size_to_bin(size);
    Page* page = heap->pages[bin];

    if (page->free) {
        void* block = page->free;
        page->free = *(void**)block;
        return block;
    }

    return mi_page_malloc(heap, page, size);  // Slow path
}

Expected gain: +50-70% (mimalloc とほぼ同等)

Files:

hakmem_pool.c: Complete rewrite (~500 LOC)
hakmem_pool.h: New data structures

Risk: High

Architectural overhaul
60 hours implementation + testing
Backward compatibility concerns

Recommendation: Phase 7.2 で十分な効果が出たら skip

🎯 推奨実装順序

Option A: MF1 優先（高ROI、中リスク）

Week 1: MF1 (Lock-Free Freelist) 実装

Day 1-2: Lock-free primitives 実装
Day 3: Integration & unit testing
Day 4: Benchmark & validation

Expected result: 13.78 → 15.8-17.2 M/s (+15-25%)

Pros:

最大ボトルネック解決
Standalone fix
Proven technique

Cons:

ABA problem リスク
Memory ordering bugs

Option B: Quick Fixes 優先（低リスク、低ROI）

Week 1: QF1+QF2+QF3 実装

Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
Day 2: QF3 (Header skip) + testing
Day 3: Benchmark

Expected result: 13.78 → 14.5-15.2 M/s (+5-10%)

Pros:

Low risk
Quick wins
MF1 の前準備

Cons:

効果が限定的

Option C: Hybrid（バランス）

Week 1: Quick Fixes (8h) Week 2: MF1 (12h) Week 3: MF2 or MF3

Expected result: 13.78 → 18-20 M/s (+30-45%)

Pros:

リスク分散
Incremental progress

Cons:

時間がかかる

📊 予想パフォーマンス

Cumulative Gains

Phase	Changes	Expected Gain	Cumulative	vs mimalloc
Current	Quick wins	baseline	13.78 M/s	46.7%
+ QF1-3	Trylock + Ring + Header	+5-10%	14.5-15.2 M/s	49-51%
+ MF1	Lock-free Freelist	+15-25%	17.2-19.0 M/s	58-64% ✅
+ MF2	Pointer Arithmetic	+10-15%	19.9-22.8 M/s	67-77% ✅
+ MF3	Path Simplification	+5-8%	21.9-25.4 M/s	74-86% 🎉
+ MS1	Per-Page Sharding	+50-70%	27.0-31.0 M/s	91-105% 🚀

Target achievement:

60% target: MF1 でほぼ達成（58-64%）
75% target: MF2 で達成（67-77%）
100% target: MS1 で可能（91-105%）

⚠️ Risks & Mitigation

MF1: Lock-Free Freelist

Risks:

ABA problem: Block A popped, freed, reallocated, pushed back while CAS in-flight
Memory ordering bugs: Incorrect acquire/release semantics
Livelock: CAS retry loop が収束しない

Mitigation:

ABA: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
Memory ordering: Extensive testing with TSan, code review
Livelock: Add exponential backoff after N retries

MF2: Pointer Arithmetic

Risks:

Alignment failures: mmap が 4MiB aligned を保証しない環境
Backward compatibility: 既存 pages を migrate できない
Memory fragmentation: Segment 単位で確保 → waste

Mitigation:

Alignment: Use MAP_ALIGNED or manual alignment (fallback to hash table if failed)
Backward compat: Gradual migration, hash table を fallback として残す
Fragmentation: 4MiB segments は許容範囲（mimalloc 実証済み）

MF3: Path Simplification

Risks:

Performance regression: Active Pages 削除で burst pattern が遅くなる？
Code complexity: Refactor が新たなバグを生む

Mitigation:

Regression: Benchmark で検証、問題があれば rollback
Complexity: Incremental refactor, extensive testing

🎓 Success Criteria

Primary Metrics

Metric	Current	Target (60%)	Target (75%)	Measurement
Mid 4T Throughput	13.78 M/s	17.70 M/s	22.13 M/s	Larson 10s
vs mimalloc	46.7%	60%	75%	Ratio

Secondary Metrics

Metric	Target	Measurement
Memory footprint	<50 MB	RSS baseline
Fragmentation	<10%	(RSS - user) / user
Regression (Tiny)	<5%	Larson Tiny 4T
Regression (Large)	<5%	Larson Large 4T

Specialized Workloads

Workload	Target	Test
Burst allocation	>mimalloc	Custom benchmark
Locality-aware	>mimalloc	Site-based pattern
Producer-Consumer	>mimalloc	Multi-threaded queue

📅 Timeline Estimate

Conservative (完全実装)

Week	Phase	Tasks	Hours
1	QF1-3	Quick fixes	8
2	MF1	Lock-free freelist	12
3	MF2	Pointer arithmetic	10
4	MF3	Path simplification	10
5-6	MS1	Per-page sharding (optional)	60
Total			40-100 hours

Aggressive (60% target のみ)

Week	Phase	Tasks	Hours
1	MF1	Lock-free freelist	12
Total			12 hours

Recommendation: MF1 のみ実装 → Benchmark → 判断

🎬 Conclusion

Current Status: hakmem は mimalloc の 46.7%（Phase 6.25-6.27 失敗）

Root Cause: Lock contention (50%) + Hash lookups (25%) + Branching (15%)

Battle Plan:

Phase 7.1: Quick Fixes → +5-10%（8 hours）
Phase 7.2: Medium Fixes → +25-35%（20-30 hours）
- MF1 (Lock-Free): +15-25% ⭐⭐⭐
- MF2 (Pointer Arithmetic): +10-15%
- MF3 (Simplify Path): +5-8%
Phase 7.3: Moonshot → +50-70%（60 hours、optional）

Recommended Action: MF1 (Lock-Free Freelist) を即実装！

Expected Outcome: 58-64% of mimalloc（60% target 達成！）

うおおお、mimalloc を倒すぞー！ 🔥🔥🔥

作成日: 2025-10-24 14:30 JST ステータス: ✅ Battle Plan 完成、MF1 実装準備完了 次のアクション: MF1 (Lock-Free Freelist) 実装開始

16 KiB Raw Blame History Unescape Escape

mimalloc 打倒作戦：Phase 7 Battle Plan

🎯 戦略目標

Primary Goal: Universal Performance

Secondary Goal: Specialized Workloads で Superiority

📊 現状分析（Task先生の診断）

4大ボトルネック

hakmem Architecture の問題点

🚀 3段階ロードマップ

Phase 7.1: Quick Fixes (8 hours) → +5-10%

QF1: Trylock Probes 削減（2 hours）

QF2: Ring + LIFO 統合（4 hours）

QF3: Header Writes スキップ（2 hours）

Phase 7.2: Medium Fixes (20-30 hours) → +25-35%

MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)

MF2: Pointer Arithmetic Page Lookup (8-10 hours)

MF3: Allocation Path 簡略化 (8-10 hours)

Phase 7.3: Moonshot (60 hours) → +50-70%

MS1: Per-Page Sharding (60 hours)

🎯 推奨実装順序

Option A: MF1 優先（高ROI、中リスク）

Option B: Quick Fixes 優先（低リスク、低ROI）

Option C: Hybrid（バランス）

📊 予想パフォーマンス

Cumulative Gains

⚠️ Risks & Mitigation

MF1: Lock-Free Freelist

MF2: Pointer Arithmetic

MF3: Path Simplification

🎓 Success Criteria

Primary Metrics

Secondary Metrics

Specialized Workloads

📅 Timeline Estimate

Conservative (完全実装)

Aggressive (60% target のみ)

🎬 Conclusion

16 KiB

Raw Blame History