Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
mimalloc 打倒作戦:Phase 7 Battle Plan
日付: 2025-10-24 目標: hakmem を mimalloc の 60-75% まで引き上げる 現状: 46.7% (13.78 M/s vs 29.50 M/s, Mid 4T) Gap: 15.72 M/s (53.3%)
🎯 戦略目標
Primary Goal: Universal Performance
Target: larson benchmark (Mid 4T) で mimalloc の 60-75%
- 現状: 13.78 M/s (46.7%)
- 目標: 17.70-22.13 M/s (60-75%)
- 必要改善: +3.92-8.35 M/s (+28-61%)
Secondary Goal: Specialized Workloads で Superiority
Target: hakmem の設計思想を活かす
- Burst allocation: mimalloc の 110-120%
- Locality-aware: mimalloc の 110-130%
- Producer-Consumer: mimalloc の 105-115%
📊 現状分析(Task先生の診断)
4大ボトルネック
| Rank | Bottleneck | Cost (cycles) | Impact | Fix |
|---|---|---|---|---|
| #1 | Lock Contention (56 mutexes) | 56 | 50% | MF1: Lock-free |
| #2 | Hash Lookups (page descriptors) | 30 | 25% | MF2: Pointer arithmetic |
| #3 | Excess Branching (7-10 branches) | 15 | 15% | MF3: Simplify path |
| #4 | Metadata Overhead (inline headers) | 10 | 10% | Long-term |
Total overhead: ~110 cycles/allocation vs mimalloc ~5 cycles
hakmem Architecture の問題点
Over-engineered:
- 7層 TLS caching: Ring → LIFO → Pages × 2 → TC → Freelist → Remote
- 56 mutexes: 7 classes × 8 shards(37.5% contention rate @ 4T)
- Hash table: 2-4 lookups per allocation(10-20 cycles each + cache miss)
mimalloc Architecture の優位性:
- 2層のみ: Thread-local heap → Per-page freelist
- 0 locks: Lock-free CAS only
- Pointer arithmetic: 3-4 instructions, no hash
🚀 3段階ロードマップ
Phase 7.1: Quick Fixes (8 hours) → +5-10%
Goal: Low-risk な最適化で quick wins
QF1: Trylock Probes 削減(2 hours)
現状: g_trylock_probes = 3(3回まで trylock を試行)
問題: Probe のたびに:
- Hash calculation (shard index)
- Mutex trylock (50-200 cycles if contended)
- Branch
Fix: Probes を 1 に削減
// Before
for (int probe = 0; probe < g_trylock_probes; ++probe) { // 3 iterations
int s = (shard_idx + probe) & (POOL_NUM_SHARDS - 1);
if (pthread_mutex_trylock(lock) == 0) { ... }
}
// After
int s = shard_idx; // No loop
if (pthread_mutex_trylock(lock) == 0) { ... }
Expected gain: +2-3% (trylock overhead 削減)
File: hakmem_pool.c (~10 LOC)
QF2: Ring + LIFO 統合(4 hours)
現状: Ring (32 slots) + LIFO (256 blocks) の 2段構成
問題:
- 2回チェック(Ring → LIFO)
- Spill logic の複雑さ
Fix: Single unified cache (256 slots ring)
// Before
typedef struct {
PoolTLSRing ring; // 32 slots
PoolTLSBin bin; // LIFO 256 blocks
} TLSCache;
// After
typedef struct {
PoolBlock* items[256]; // Unified ring (larger)
int top;
} TLSCache;
Expected gain: +1-2% (branch 削減、cache locality 向上)
File: hakmem_pool.c (~50 LOC)
QF3: Header Writes スキップ(2 hours)
現状: Fast path で header を毎回書き込み
mid_set_header(hdr, size, site_id); // 5-10 cycles
Fix: Ring から pop した block は header が valid → skip
// Only write header when refilling from freelist
if (from_freelist) {
mid_set_header(hdr, size, site_id);
}
Expected gain: +1-2% (header write overhead 削減)
File: hakmem_pool.c (~20 LOC)
Phase 7.2: Medium Fixes (20-30 hours) → +25-35%
Goal: Architectural bottlenecks を解決
MF1: Lock-Free Freelist ⭐⭐⭐ (12 hours)
Goal: 56 mutexes を atomic CAS に置き換え
Current:
static PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
static PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
After:
static atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
Implementation:
- Lock-free pop:
PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head, new_head;
PoolBlock* block;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_acquire);
if (old_head == 0) return NULL; // Empty
block = (PoolBlock*)old_head;
new_head = (uintptr_t)block->next;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, new_head,
memory_order_release, memory_order_relaxed));
return block;
}
- Lock-free push:
void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, (uintptr_t)block,
memory_order_release, memory_order_relaxed));
}
- Batch push (for refill):
void freelist_push_batch_lockfree(int class_idx, int shard_idx,
PoolBlock* head, PoolBlock* tail) {
atomic_uintptr_t* head_ptr = &freelist_head[class_idx][shard_idx];
uintptr_t old_head;
do {
old_head = atomic_load_explicit(head_ptr, memory_order_relaxed);
tail->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
head_ptr, &old_head, (uintptr_t)head,
memory_order_release, memory_order_relaxed));
}
ABA Problem 対策:
- Not critical for freelist: ABA still results in valid freelist state
- Alternative: Add version counter (128-bit CAS) if issues arise
- Epoch-based reclamation: Defer block reuse (if needed)
Expected gain: +15-25% (lock contention 完全排除)
Files:
hakmem_pool.c: ~100 LOC (replace mutex calls with CAS)
Risk: Medium
- Memory ordering bugs
- ABA problem (low risk)
Testing:
- ThreadSanitizer (TSan)
- Stress test (16+ threads, 60s)
- Correctness test (no lost blocks, no double-free)
MF2: Pointer Arithmetic Page Lookup (8-10 hours)
Goal: Hash table lookup を pointer arithmetic に置き換え
Current: mid_desc_lookup(ptr) → Hash table (10-20 cycles + mutex)
mimalloc approach:
// Segment-aligned memory (4 MiB alignment)
Segment: [Page Descriptors 128KB][Pages 3968KB]
// Derive page descriptor from block pointer
uintptr_t segment = (uintptr_t)ptr & ~(4*1024*1024 - 1); // Mask
size_t offset = (uintptr_t)ptr - segment;
size_t page_idx = offset / PAGE_SIZE;
PageDesc* desc = &segment->descriptors[page_idx];
Cost: 3-4 instructions (mask, subtract, divide, array index)
Implementation:
- Segment allocation:
#define SEGMENT_SIZE (4 * 1024 * 1024)
#define SEGMENT_ALIGNMENT SEGMENT_SIZE
void* segment = mmap(NULL, SEGMENT_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_ALIGNED(22), // 4MiB aligned
-1, 0);
- Page descriptor array:
typedef struct {
PageDesc descriptors[64]; // 64 × 64KB = 4MiB
char pages[64][64*1024]; // Actual pages
} Segment;
Segment* seg = (Segment*)segment;
- Fast lookup:
static inline PageDesc* page_desc_from_ptr(void* ptr) {
uintptr_t seg_addr = (uintptr_t)ptr & ~(SEGMENT_SIZE - 1);
Segment* seg = (Segment*)seg_addr;
size_t offset = (uintptr_t)ptr - seg_addr;
size_t page_idx = offset / (64*1024);
return &seg->descriptors[page_idx];
}
Expected gain: +10-15% (hash lookup 排除)
Files:
hakmem_pool.c: ~150 LOC (segment allocator)hakmem_pool.h: Data structure changes
Risk: Medium
- Requires memory layout redesign
- Backward compatibility (migrate existing pages?)
MF3: Allocation Path 簡略化 (8-10 hours)
Goal: 7層 → 3層に削減
Current path:
- TLS Ring (32)
- TLS LIFO (256)
- Active Page A
- Active Page B
- Transfer Cache
- Shard Freelist (mutex)
- Remote Stack
Simplified path:
- TLS Cache (unified 256-slot ring)
- Per-shard Freelist (lock-free)
- Remote Stack (lock-free)
Changes:
- Remove: Active Pages (bump-run)
- Remove: Transfer Cache
- Keep: TLS Cache (unified), Freelist, Remote
Rationale:
- Active Pages: 複雑だが効果薄(Ring で十分)
- Transfer Cache: Owner-aware は良いが overhead 大
Expected gain: +5-8% (branch 削減、complexity 削減)
Files:
hakmem_pool.c: ~200 LOC (refactor allocation path)
Risk: Low
- Code deletion is safer than addition
- Testing で regression 確認
Phase 7.3: Moonshot (60 hours) → +50-70%
Goal: mimalloc の architecture を完全移植
MS1: Per-Page Sharding (60 hours)
Goal: Global sharded freelists → Per-page freelists (mimalloc style)
Current: 7 classes × 8 shards = 56 global freelists
mimalloc: Thousands of per-page freelists
Architecture:
// Thread-local heap
typedef struct {
Page* pages[SIZE_BINS]; // Per-size-class page lists
} Heap;
// Per-page freelist
typedef struct Page {
void* free; // Freelist head (in-page)
void* local_free; // Same-thread frees
atomic_uintptr_t xthread_free; // Cross-thread frees
int used;
int capacity;
struct Page* next; // Next page in heap
} Page;
Allocation:
void* mi_malloc(size_t size) {
Heap* heap = mi_get_heap();
int bin = size_to_bin(size);
Page* page = heap->pages[bin];
if (page->free) {
void* block = page->free;
page->free = *(void**)block;
return block;
}
return mi_page_malloc(heap, page, size); // Slow path
}
Expected gain: +50-70% (mimalloc とほぼ同等)
Files:
hakmem_pool.c: Complete rewrite (~500 LOC)hakmem_pool.h: New data structures
Risk: High
- Architectural overhaul
- 60 hours implementation + testing
- Backward compatibility concerns
Recommendation: Phase 7.2 で十分な効果が出たら skip
🎯 推奨実装順序
Option A: MF1 優先(高ROI、中リスク)
Week 1: MF1 (Lock-Free Freelist) 実装
- Day 1-2: Lock-free primitives 実装
- Day 3: Integration & unit testing
- Day 4: Benchmark & validation
Expected result: 13.78 → 15.8-17.2 M/s (+15-25%)
Pros:
- 最大ボトルネック解決
- Standalone fix
- Proven technique
Cons:
- ABA problem リスク
- Memory ordering bugs
Option B: Quick Fixes 優先(低リスク、低ROI)
Week 1: QF1+QF2+QF3 実装
- Day 1: QF1 (Trylock削減) + QF2 (Ring統合)
- Day 2: QF3 (Header skip) + testing
- Day 3: Benchmark
Expected result: 13.78 → 14.5-15.2 M/s (+5-10%)
Pros:
- Low risk
- Quick wins
- MF1 の前準備
Cons:
- 効果が限定的
Option C: Hybrid(バランス)
Week 1: Quick Fixes (8h) Week 2: MF1 (12h) Week 3: MF2 or MF3
Expected result: 13.78 → 18-20 M/s (+30-45%)
Pros:
- リスク分散
- Incremental progress
Cons:
- 時間がかかる
📊 予想パフォーマンス
Cumulative Gains
| Phase | Changes | Expected Gain | Cumulative | vs mimalloc |
|---|---|---|---|---|
| Current | Quick wins | baseline | 13.78 M/s | 46.7% |
| + QF1-3 | Trylock + Ring + Header | +5-10% | 14.5-15.2 M/s | 49-51% |
| + MF1 | Lock-free Freelist | +15-25% | 17.2-19.0 M/s | 58-64% ✅ |
| + MF2 | Pointer Arithmetic | +10-15% | 19.9-22.8 M/s | 67-77% ✅ |
| + MF3 | Path Simplification | +5-8% | 21.9-25.4 M/s | 74-86% 🎉 |
| + MS1 | Per-Page Sharding | +50-70% | 27.0-31.0 M/s | 91-105% 🚀 |
Target achievement:
- 60% target: MF1 でほぼ達成(58-64%)
- 75% target: MF2 で達成(67-77%)
- 100% target: MS1 で可能(91-105%)
⚠️ Risks & Mitigation
MF1: Lock-Free Freelist
Risks:
- ABA problem: Block A popped, freed, reallocated, pushed back while CAS in-flight
- Memory ordering bugs: Incorrect acquire/release semantics
- Livelock: CAS retry loop が収束しない
Mitigation:
- ABA: Not critical for freelist (ABA still valid), use epoch-based reclamation if needed
- Memory ordering: Extensive testing with TSan, code review
- Livelock: Add exponential backoff after N retries
MF2: Pointer Arithmetic
Risks:
- Alignment failures: mmap が 4MiB aligned を保証しない環境
- Backward compatibility: 既存 pages を migrate できない
- Memory fragmentation: Segment 単位で確保 → waste
Mitigation:
- Alignment: Use
MAP_ALIGNEDor manual alignment (fallback to hash table if failed) - Backward compat: Gradual migration, hash table を fallback として残す
- Fragmentation: 4MiB segments は許容範囲(mimalloc 実証済み)
MF3: Path Simplification
Risks:
- Performance regression: Active Pages 削除で burst pattern が遅くなる?
- Code complexity: Refactor が新たなバグを生む
Mitigation:
- Regression: Benchmark で検証、問題があれば rollback
- Complexity: Incremental refactor, extensive testing
🎓 Success Criteria
Primary Metrics
| Metric | Current | Target (60%) | Target (75%) | Measurement |
|---|---|---|---|---|
| Mid 4T Throughput | 13.78 M/s | 17.70 M/s | 22.13 M/s | Larson 10s |
| vs mimalloc | 46.7% | 60% | 75% | Ratio |
Secondary Metrics
| Metric | Target | Measurement |
|---|---|---|
| Memory footprint | <50 MB | RSS baseline |
| Fragmentation | <10% | (RSS - user) / user |
| Regression (Tiny) | <5% | Larson Tiny 4T |
| Regression (Large) | <5% | Larson Large 4T |
Specialized Workloads
| Workload | Target | Test |
|---|---|---|
| Burst allocation | >mimalloc | Custom benchmark |
| Locality-aware | >mimalloc | Site-based pattern |
| Producer-Consumer | >mimalloc | Multi-threaded queue |
📅 Timeline Estimate
Conservative (完全実装)
| Week | Phase | Tasks | Hours |
|---|---|---|---|
| 1 | QF1-3 | Quick fixes | 8 |
| 2 | MF1 | Lock-free freelist | 12 |
| 3 | MF2 | Pointer arithmetic | 10 |
| 4 | MF3 | Path simplification | 10 |
| 5-6 | MS1 | Per-page sharding (optional) | 60 |
| Total | 40-100 hours |
Aggressive (60% target のみ)
| Week | Phase | Tasks | Hours |
|---|---|---|---|
| 1 | MF1 | Lock-free freelist | 12 |
| Total | 12 hours |
Recommendation: MF1 のみ実装 → Benchmark → 判断
🎬 Conclusion
Current Status: hakmem は mimalloc の 46.7%(Phase 6.25-6.27 失敗)
Root Cause: Lock contention (50%) + Hash lookups (25%) + Branching (15%)
Battle Plan:
- Phase 7.1: Quick Fixes → +5-10%(8 hours)
- Phase 7.2: Medium Fixes → +25-35%(20-30 hours)
- MF1 (Lock-Free): +15-25% ⭐⭐⭐
- MF2 (Pointer Arithmetic): +10-15%
- MF3 (Simplify Path): +5-8%
- Phase 7.3: Moonshot → +50-70%(60 hours、optional)
Recommended Action: MF1 (Lock-Free Freelist) を即実装!
Expected Outcome: 58-64% of mimalloc(60% target 達成!)
うおおお、mimalloc を倒すぞー! 🔥🔥🔥
作成日: 2025-10-24 14:30 JST ステータス: ✅ Battle Plan 完成、MF1 実装準備完了 次のアクション: MF1 (Lock-Free Freelist) 実装開始