Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
287 lines
6.8 KiB
Markdown
287 lines
6.8 KiB
Markdown
# mimalloc Performance Analysis - Key Findings
|
|
|
|
## The 47% Gap Explained
|
|
|
|
**HAKMEM:** 16.53 M ops/sec
|
|
**mimalloc:** 24.21 M ops/sec
|
|
**Gap:** +7.68 M ops/sec (47% faster)
|
|
|
|
---
|
|
|
|
## Top 3 Performance Secrets
|
|
|
|
### 1. Direct Page Cache (O(1) Lookup) - **Impact: 15-20%**
|
|
|
|
**mimalloc:**
|
|
```c
|
|
// Single array index - O(1)
|
|
page = heap->pages_free_direct[size / 8];
|
|
```
|
|
|
|
**HAKMEM:**
|
|
```c
|
|
// Binary search through 32 bins - O(log n)
|
|
size_class = find_size_class(size); // ~5 comparisons
|
|
page = heap->size_classes[size_class];
|
|
```
|
|
|
|
**Savings:** ~10 cycles per allocation
|
|
|
|
---
|
|
|
|
### 2. Dual Free Lists (Local/Remote Split) - **Impact: 10-15%**
|
|
|
|
**mimalloc:**
|
|
```c
|
|
typedef struct mi_page_s {
|
|
mi_block_t* free; // Hot allocation path
|
|
mi_block_t* local_free; // Local frees (no atomic!)
|
|
_Atomic(mi_thread_free_t) xthread_free; // Remote frees
|
|
} mi_page_t;
|
|
```
|
|
|
|
**Why it's faster:**
|
|
- Local frees go to `local_free` (no atomic ops!)
|
|
- Migration to `free` is batched (pointer swap)
|
|
- Better cache locality (separate alloc/free lists)
|
|
|
|
**HAKMEM:** Single free list with atomic updates
|
|
|
|
---
|
|
|
|
### 3. Zero-Cost Optimizations - **Impact: 5-8%**
|
|
|
|
**Branch hints:**
|
|
```c
|
|
if mi_likely(size <= 1024) { // Fast path
|
|
return fast_alloc(size);
|
|
}
|
|
```
|
|
|
|
**Bit-packed flags:**
|
|
```c
|
|
if (page->flags.full_aligned == 0) { // Single comparison
|
|
// Fast path: not full, no aligned blocks
|
|
}
|
|
```
|
|
|
|
**Lazy updates:**
|
|
```c
|
|
// Only collect remote frees when needed
|
|
if (page->free == NULL) {
|
|
collect_remote_frees(page);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## The Hot Path Breakdown
|
|
|
|
### mimalloc (3 layers, ~20 cycles)
|
|
|
|
```c
|
|
// Layer 0: TLS heap (2 cycles)
|
|
heap = mi_prim_get_default_heap();
|
|
|
|
// Layer 1: Direct page cache (3 cycles)
|
|
page = heap->pages_free_direct[size / 8];
|
|
|
|
// Layer 2: Pop from free list (5 cycles)
|
|
block = page->free;
|
|
if (block) {
|
|
page->free = block->next;
|
|
page->used++;
|
|
return block;
|
|
}
|
|
|
|
// Layer 3: Generic fallback (slow path)
|
|
return _mi_malloc_generic(heap, size, zero, 0);
|
|
```
|
|
|
|
**Total fast path: ~20 cycles**
|
|
|
|
### HAKMEM Tiny Current (3 layers, ~30-35 cycles)
|
|
|
|
```c
|
|
// Layer 0: TLS heap (3 cycles)
|
|
heap = tls_heap;
|
|
|
|
// Layer 1: Binary search size class (~5 cycles)
|
|
size_class = find_size_class(size); // 3-5 comparisons
|
|
|
|
// Layer 2: Get page (3 cycles)
|
|
page = heap->size_classes[size_class];
|
|
|
|
// Layer 3: Pop with atomic (~15 cycles with lock prefix)
|
|
block = page->freelist;
|
|
if (block) {
|
|
lock_xadd(&page->used, 1); // 10+ cycles!
|
|
page->freelist = block->next;
|
|
return block;
|
|
}
|
|
```
|
|
|
|
**Total fast path: ~30-35 cycles (with atomic), ~20-25 cycles (without atomic)**
|
|
|
|
---
|
|
|
|
## Key Insight: Linked Lists Are Optimal!
|
|
|
|
mimalloc proves that **intrusive linked lists** are the right data structure for mixed alloc/free workloads.
|
|
|
|
The performance comes from:
|
|
1. **O(1) page lookup** (not from avoiding lists)
|
|
2. **Cache-friendly separation** (local vs remote)
|
|
3. **Minimal atomic ops** (batching)
|
|
4. **Predictable branches** (hints)
|
|
|
|
**Your Phase 3 finding was correct:** Linked lists are optimal. The gap comes from **micro-optimizations**, not data structure choice.
|
|
|
|
---
|
|
|
|
## Actionable Recommendations
|
|
|
|
### Phase 1: Direct Page Cache (+15-20%)
|
|
**Effort:** 1-2 days | **Risk:** Low
|
|
|
|
```c
|
|
// Add to hakmem_heap_t:
|
|
hakmem_page_t* pages_direct[129]; // 1032 bytes
|
|
|
|
// In malloc hot path:
|
|
if (size <= 1024) {
|
|
page = heap->pages_direct[size / 8];
|
|
if (page && page->free_list) {
|
|
return pop_block(page);
|
|
}
|
|
}
|
|
```
|
|
|
|
### Phase 2: Dual Free Lists (+10-15%)
|
|
**Effort:** 3-5 days | **Risk:** Medium
|
|
|
|
```c
|
|
// Split free list:
|
|
typedef struct hakmem_page_s {
|
|
hakmem_block_t* free; // Allocation path
|
|
hakmem_block_t* local_free; // Local frees (no atomic!)
|
|
_Atomic(hakmem_block_t*) thread_free; // Remote frees
|
|
} hakmem_page_t;
|
|
|
|
// In free:
|
|
if (is_local_thread(page)) {
|
|
block->next = page->local_free;
|
|
page->local_free = block; // No atomic!
|
|
}
|
|
|
|
// Migrate when needed:
|
|
if (!page->free && page->local_free) {
|
|
page->free = page->local_free; // Just swap!
|
|
page->local_free = NULL;
|
|
}
|
|
```
|
|
|
|
### Phase 3: Branch Hints + Flags (+5-8%)
|
|
**Effort:** 1-2 days | **Risk:** Low
|
|
|
|
```c
|
|
#define likely(x) __builtin_expect(!!(x), 1)
|
|
#define unlikely(x) __builtin_expect(!!(x), 0)
|
|
|
|
// Bit-pack flags:
|
|
union page_flags {
|
|
uint8_t combined;
|
|
struct {
|
|
uint8_t is_full : 1;
|
|
uint8_t has_remote : 1;
|
|
} bits;
|
|
};
|
|
|
|
// Single comparison:
|
|
if (page->flags.combined == 0) {
|
|
// Fast path
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
| Phase | Improvement | Cumulative M ops/sec | % of Gap Closed |
|
|
|-------|-------------|----------------------|-----------------|
|
|
| Baseline | - | 16.53 | 0% |
|
|
| Phase 1 | +15-20% | 19.20 | 35% |
|
|
| Phase 2 | +10-15% | 22.30 | 75% |
|
|
| Phase 3 | +5-8% | 24.00 | 95% |
|
|
|
|
**Final:** 16.53 → 24.00 M ops/sec (close the 47% gap to within ~1%)
|
|
|
|
---
|
|
|
|
## What Doesn't Matter
|
|
|
|
❌ **Prefetch instructions** - Hardware prefetcher is good enough
|
|
❌ **Hand-written assembly** - Compiler optimizes well
|
|
❌ **Magazine architecture** - Direct page cache is simpler
|
|
❌ **Complex encoding** - Simple XOR-rotate is sufficient
|
|
❌ **Bump allocation** - Linked lists are fine for mixed workloads
|
|
|
|
---
|
|
|
|
## Validation Strategy
|
|
|
|
1. **Benchmark Phase 1** (direct cache)
|
|
- Expect: +2-3 M ops/sec (12-18%)
|
|
- If achieved: Proceed to Phase 2
|
|
- If not: Profile and debug
|
|
|
|
2. **Benchmark Phase 2** (dual lists)
|
|
- Expect: +2-3 M ops/sec additional (10-15%)
|
|
- If achieved: Proceed to Phase 3
|
|
- If not: Analyze cache behavior
|
|
|
|
3. **Benchmark Phase 3** (branch hints + flags)
|
|
- Expect: +1-2 M ops/sec additional (5-8%)
|
|
- Final target: 23-24 M ops/sec
|
|
|
|
---
|
|
|
|
## Code References (mimalloc source)
|
|
|
|
### Must-Read Files
|
|
1. `/src/alloc.c:200` - Entry point (`mi_malloc`)
|
|
2. `/src/alloc.c:48-59` - Hot path (`_mi_page_malloc`)
|
|
3. `/include/mimalloc/internal.h:388` - Direct cache (`_mi_heap_get_free_small_page`)
|
|
4. `/src/alloc.c:593-608` - Fast free (`mi_free`)
|
|
5. `/src/page.c:217-248` - Dual list migration (`_mi_page_free_collect`)
|
|
|
|
### Key Data Structures
|
|
1. `/include/mimalloc/types.h:447` - Heap structure (`mi_heap_s`)
|
|
2. `/include/mimalloc/types.h:283` - Page structure (`mi_page_s`)
|
|
3. `/include/mimalloc/types.h:212` - Block structure (`mi_block_s`)
|
|
4. `/include/mimalloc/types.h:228` - Bit-packed flags (`mi_page_flags_s`)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
mimalloc's advantage is **not** from avoiding linked lists or using bump allocation.
|
|
|
|
The 47% gap comes from **8 cumulative micro-optimizations**:
|
|
1. Direct page cache (O(1) vs O(log n))
|
|
2. Dual free lists (cache-friendly)
|
|
3. Lazy metadata updates (batching)
|
|
4. Zero-cost encoding (security for free)
|
|
5. Branch hints (CPU-friendly)
|
|
6. Bit-packed flags (fewer comparisons)
|
|
7. Aggressive inlining (smaller hot path)
|
|
8. Minimal atomics (local-first free)
|
|
|
|
Each optimization is **small** (1-20%), but they **multiply** to create the 47% gap.
|
|
|
|
**Good news:** All techniques are portable to HAKMEM without major architectural changes!
|
|
|
|
---
|
|
|
|
**Next Action:** Implement Phase 1 (direct page cache) and measure the impact on `bench_random_mixed`.
|