Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.8 KiB
mimalloc Performance Analysis - Key Findings
The 47% Gap Explained
HAKMEM: 16.53 M ops/sec mimalloc: 24.21 M ops/sec Gap: +7.68 M ops/sec (47% faster)
Top 3 Performance Secrets
1. Direct Page Cache (O(1) Lookup) - Impact: 15-20%
mimalloc:
// Single array index - O(1)
page = heap->pages_free_direct[size / 8];
HAKMEM:
// Binary search through 32 bins - O(log n)
size_class = find_size_class(size); // ~5 comparisons
page = heap->size_classes[size_class];
Savings: ~10 cycles per allocation
2. Dual Free Lists (Local/Remote Split) - Impact: 10-15%
mimalloc:
typedef struct mi_page_s {
mi_block_t* free; // Hot allocation path
mi_block_t* local_free; // Local frees (no atomic!)
_Atomic(mi_thread_free_t) xthread_free; // Remote frees
} mi_page_t;
Why it's faster:
- Local frees go to
local_free(no atomic ops!) - Migration to
freeis batched (pointer swap) - Better cache locality (separate alloc/free lists)
HAKMEM: Single free list with atomic updates
3. Zero-Cost Optimizations - Impact: 5-8%
Branch hints:
if mi_likely(size <= 1024) { // Fast path
return fast_alloc(size);
}
Bit-packed flags:
if (page->flags.full_aligned == 0) { // Single comparison
// Fast path: not full, no aligned blocks
}
Lazy updates:
// Only collect remote frees when needed
if (page->free == NULL) {
collect_remote_frees(page);
}
The Hot Path Breakdown
mimalloc (3 layers, ~20 cycles)
// Layer 0: TLS heap (2 cycles)
heap = mi_prim_get_default_heap();
// Layer 1: Direct page cache (3 cycles)
page = heap->pages_free_direct[size / 8];
// Layer 2: Pop from free list (5 cycles)
block = page->free;
if (block) {
page->free = block->next;
page->used++;
return block;
}
// Layer 3: Generic fallback (slow path)
return _mi_malloc_generic(heap, size, zero, 0);
Total fast path: ~20 cycles
HAKMEM Tiny Current (3 layers, ~30-35 cycles)
// Layer 0: TLS heap (3 cycles)
heap = tls_heap;
// Layer 1: Binary search size class (~5 cycles)
size_class = find_size_class(size); // 3-5 comparisons
// Layer 2: Get page (3 cycles)
page = heap->size_classes[size_class];
// Layer 3: Pop with atomic (~15 cycles with lock prefix)
block = page->freelist;
if (block) {
lock_xadd(&page->used, 1); // 10+ cycles!
page->freelist = block->next;
return block;
}
Total fast path: ~30-35 cycles (with atomic), ~20-25 cycles (without atomic)
Key Insight: Linked Lists Are Optimal!
mimalloc proves that intrusive linked lists are the right data structure for mixed alloc/free workloads.
The performance comes from:
- O(1) page lookup (not from avoiding lists)
- Cache-friendly separation (local vs remote)
- Minimal atomic ops (batching)
- Predictable branches (hints)
Your Phase 3 finding was correct: Linked lists are optimal. The gap comes from micro-optimizations, not data structure choice.
Actionable Recommendations
Phase 1: Direct Page Cache (+15-20%)
Effort: 1-2 days | Risk: Low
// Add to hakmem_heap_t:
hakmem_page_t* pages_direct[129]; // 1032 bytes
// In malloc hot path:
if (size <= 1024) {
page = heap->pages_direct[size / 8];
if (page && page->free_list) {
return pop_block(page);
}
}
Phase 2: Dual Free Lists (+10-15%)
Effort: 3-5 days | Risk: Medium
// Split free list:
typedef struct hakmem_page_s {
hakmem_block_t* free; // Allocation path
hakmem_block_t* local_free; // Local frees (no atomic!)
_Atomic(hakmem_block_t*) thread_free; // Remote frees
} hakmem_page_t;
// In free:
if (is_local_thread(page)) {
block->next = page->local_free;
page->local_free = block; // No atomic!
}
// Migrate when needed:
if (!page->free && page->local_free) {
page->free = page->local_free; // Just swap!
page->local_free = NULL;
}
Phase 3: Branch Hints + Flags (+5-8%)
Effort: 1-2 days | Risk: Low
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
// Bit-pack flags:
union page_flags {
uint8_t combined;
struct {
uint8_t is_full : 1;
uint8_t has_remote : 1;
} bits;
};
// Single comparison:
if (page->flags.combined == 0) {
// Fast path
}
Expected Results
| Phase | Improvement | Cumulative M ops/sec | % of Gap Closed |
|---|---|---|---|
| Baseline | - | 16.53 | 0% |
| Phase 1 | +15-20% | 19.20 | 35% |
| Phase 2 | +10-15% | 22.30 | 75% |
| Phase 3 | +5-8% | 24.00 | 95% |
Final: 16.53 → 24.00 M ops/sec (close the 47% gap to within ~1%)
What Doesn't Matter
❌ Prefetch instructions - Hardware prefetcher is good enough ❌ Hand-written assembly - Compiler optimizes well ❌ Magazine architecture - Direct page cache is simpler ❌ Complex encoding - Simple XOR-rotate is sufficient ❌ Bump allocation - Linked lists are fine for mixed workloads
Validation Strategy
-
Benchmark Phase 1 (direct cache)
- Expect: +2-3 M ops/sec (12-18%)
- If achieved: Proceed to Phase 2
- If not: Profile and debug
-
Benchmark Phase 2 (dual lists)
- Expect: +2-3 M ops/sec additional (10-15%)
- If achieved: Proceed to Phase 3
- If not: Analyze cache behavior
-
Benchmark Phase 3 (branch hints + flags)
- Expect: +1-2 M ops/sec additional (5-8%)
- Final target: 23-24 M ops/sec
Code References (mimalloc source)
Must-Read Files
/src/alloc.c:200- Entry point (mi_malloc)/src/alloc.c:48-59- Hot path (_mi_page_malloc)/include/mimalloc/internal.h:388- Direct cache (_mi_heap_get_free_small_page)/src/alloc.c:593-608- Fast free (mi_free)/src/page.c:217-248- Dual list migration (_mi_page_free_collect)
Key Data Structures
/include/mimalloc/types.h:447- Heap structure (mi_heap_s)/include/mimalloc/types.h:283- Page structure (mi_page_s)/include/mimalloc/types.h:212- Block structure (mi_block_s)/include/mimalloc/types.h:228- Bit-packed flags (mi_page_flags_s)
Summary
mimalloc's advantage is not from avoiding linked lists or using bump allocation.
The 47% gap comes from 8 cumulative micro-optimizations:
- Direct page cache (O(1) vs O(log n))
- Dual free lists (cache-friendly)
- Lazy metadata updates (batching)
- Zero-cost encoding (security for free)
- Branch hints (CPU-friendly)
- Bit-packed flags (fewer comparisons)
- Aggressive inlining (smaller hot path)
- Minimal atomics (local-first free)
Each optimization is small (1-20%), but they multiply to create the 47% gap.
Good news: All techniques are portable to HAKMEM without major architectural changes!
Next Action: Implement Phase 1 (direct page cache) and measure the impact on bench_random_mixed.