hakmem/MIMALLOC_KEY_FINDINGS.md

# mimalloc Performance Analysis - Key Findings

## The 47% Gap Explained

**HAKMEM:** 16.53 M ops/sec
**mimalloc:** 24.21 M ops/sec
**Gap:** +7.68 M ops/sec (47% faster)

---

## Top 3 Performance Secrets

### 1. Direct Page Cache (O(1) Lookup) - **Impact: 15-20%**

**mimalloc:**
```c
// Single array index - O(1)
page = heap->pages_free_direct[size / 8];
```

**HAKMEM:**
```c
// Binary search through 32 bins - O(log n)
size_class = find_size_class(size);  // ~5 comparisons
page = heap->size_classes[size_class];
```

**Savings:** ~10 cycles per allocation

---

### 2. Dual Free Lists (Local/Remote Split) - **Impact: 10-15%**

**mimalloc:**
```c
typedef struct mi_page_s {
  mi_block_t* free;        // Hot allocation path
  mi_block_t* local_free;  // Local frees (no atomic!)
  _Atomic(mi_thread_free_t) xthread_free;  // Remote frees
} mi_page_t;
```

**Why it's faster:**
- Local frees go to `local_free` (no atomic ops!)
- Migration to `free` is batched (pointer swap)
- Better cache locality (separate alloc/free lists)

**HAKMEM:** Single free list with atomic updates

---

### 3. Zero-Cost Optimizations - **Impact: 5-8%**

**Branch hints:**
```c
if mi_likely(size <= 1024) {      // Fast path
  return fast_alloc(size);
}
```

**Bit-packed flags:**
```c
if (page->flags.full_aligned == 0) {  // Single comparison
  // Fast path: not full, no aligned blocks
}
```

**Lazy updates:**
```c
// Only collect remote frees when needed
if (page->free == NULL) {
  collect_remote_frees(page);
}
```

---

## The Hot Path Breakdown

### mimalloc (3 layers, ~20 cycles)

```c
// Layer 0: TLS heap (2 cycles)
heap = mi_prim_get_default_heap();

// Layer 1: Direct page cache (3 cycles)
page = heap->pages_free_direct[size / 8];

// Layer 2: Pop from free list (5 cycles)
block = page->free;
if (block) {
  page->free = block->next;
  page->used++;
  return block;
}

// Layer 3: Generic fallback (slow path)
return _mi_malloc_generic(heap, size, zero, 0);
```

**Total fast path: ~20 cycles**

### HAKMEM Tiny Current (3 layers, ~30-35 cycles)

```c
// Layer 0: TLS heap (3 cycles)
heap = tls_heap;

// Layer 1: Binary search size class (~5 cycles)
size_class = find_size_class(size);  // 3-5 comparisons

// Layer 2: Get page (3 cycles)
page = heap->size_classes[size_class];

// Layer 3: Pop with atomic (~15 cycles with lock prefix)
block = page->freelist;
if (block) {
  lock_xadd(&page->used, 1);  // 10+ cycles!
  page->freelist = block->next;
  return block;
}
```

**Total fast path: ~30-35 cycles (with atomic), ~20-25 cycles (without atomic)**

---

## Key Insight: Linked Lists Are Optimal!

mimalloc proves that **intrusive linked lists** are the right data structure for mixed alloc/free workloads.

The performance comes from:
1. **O(1) page lookup** (not from avoiding lists)
2. **Cache-friendly separation** (local vs remote)
3. **Minimal atomic ops** (batching)
4. **Predictable branches** (hints)

**Your Phase 3 finding was correct:** Linked lists are optimal. The gap comes from **micro-optimizations**, not data structure choice.

---

## Actionable Recommendations

### Phase 1: Direct Page Cache (+15-20%)
**Effort:** 1-2 days | **Risk:** Low

```c
// Add to hakmem_heap_t:
hakmem_page_t* pages_direct[129];  // 1032 bytes

// In malloc hot path:
if (size <= 1024) {
  page = heap->pages_direct[size / 8];
  if (page && page->free_list) {
    return pop_block(page);
  }
}
```

### Phase 2: Dual Free Lists (+10-15%)
**Effort:** 3-5 days | **Risk:** Medium

```c
// Split free list:
typedef struct hakmem_page_s {
  hakmem_block_t* free;        // Allocation path
  hakmem_block_t* local_free;  // Local frees (no atomic!)
  _Atomic(hakmem_block_t*) thread_free;  // Remote frees
} hakmem_page_t;

// In free:
if (is_local_thread(page)) {
  block->next = page->local_free;
  page->local_free = block;  // No atomic!
}

// Migrate when needed:
if (!page->free && page->local_free) {
  page->free = page->local_free;  // Just swap!
  page->local_free = NULL;
}
```

### Phase 3: Branch Hints + Flags (+5-8%)
**Effort:** 1-2 days | **Risk:** Low

```c
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

// Bit-pack flags:
union page_flags {
  uint8_t combined;
  struct {
    uint8_t is_full : 1;
    uint8_t has_remote : 1;
  } bits;
};

// Single comparison:
if (page->flags.combined == 0) {
  // Fast path
}
```

---

## Expected Results

| Phase | Improvement | Cumulative M ops/sec | % of Gap Closed |
|-------|-------------|----------------------|-----------------|
| Baseline | - | 16.53 | 0% |
| Phase 1 | +15-20% | 19.20 | 35% |
| Phase 2 | +10-15% | 22.30 | 75% |
| Phase 3 | +5-8% | 24.00 | 95% |

**Final:** 16.53 → 24.00 M ops/sec (close the 47% gap to within ~1%)

---

## What Doesn't Matter

❌ **Prefetch instructions** - Hardware prefetcher is good enough
❌ **Hand-written assembly** - Compiler optimizes well
❌ **Magazine architecture** - Direct page cache is simpler
❌ **Complex encoding** - Simple XOR-rotate is sufficient
❌ **Bump allocation** - Linked lists are fine for mixed workloads

---

## Validation Strategy

1. **Benchmark Phase 1** (direct cache)
   - Expect: +2-3 M ops/sec (12-18%)
   - If achieved: Proceed to Phase 2
   - If not: Profile and debug

2. **Benchmark Phase 2** (dual lists)
   - Expect: +2-3 M ops/sec additional (10-15%)
   - If achieved: Proceed to Phase 3
   - If not: Analyze cache behavior

3. **Benchmark Phase 3** (branch hints + flags)
   - Expect: +1-2 M ops/sec additional (5-8%)
   - Final target: 23-24 M ops/sec

---

## Code References (mimalloc source)

### Must-Read Files
1. `/src/alloc.c:200` - Entry point (`mi_malloc`)
2. `/src/alloc.c:48-59` - Hot path (`_mi_page_malloc`)
3. `/include/mimalloc/internal.h:388` - Direct cache (`_mi_heap_get_free_small_page`)
4. `/src/alloc.c:593-608` - Fast free (`mi_free`)
5. `/src/page.c:217-248` - Dual list migration (`_mi_page_free_collect`)

### Key Data Structures
1. `/include/mimalloc/types.h:447` - Heap structure (`mi_heap_s`)
2. `/include/mimalloc/types.h:283` - Page structure (`mi_page_s`)
3. `/include/mimalloc/types.h:212` - Block structure (`mi_block_s`)
4. `/include/mimalloc/types.h:228` - Bit-packed flags (`mi_page_flags_s`)

---

## Summary

mimalloc's advantage is **not** from avoiding linked lists or using bump allocation.

The 47% gap comes from **8 cumulative micro-optimizations**:
1. Direct page cache (O(1) vs O(log n))
2. Dual free lists (cache-friendly)
3. Lazy metadata updates (batching)
4. Zero-cost encoding (security for free)
5. Branch hints (CPU-friendly)
6. Bit-packed flags (fewer comparisons)
7. Aggressive inlining (smaller hot path)
8. Minimal atomics (local-first free)

Each optimization is **small** (1-20%), but they **multiply** to create the 47% gap.

**Good news:** All techniques are portable to HAKMEM without major architectural changes!

---

**Next Action:** Implement Phase 1 (direct page cache) and measure the impact on `bench_random_mixed`.
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# mimalloc Performance Analysis - Key Findings`

			`## The 47% Gap Explained`

			`HAKMEM: 16.53 M ops/sec`
			`mimalloc: 24.21 M ops/sec`
			`Gap: +7.68 M ops/sec (47% faster)`

			`---`

			`## Top 3 Performance Secrets`

			`### 1. Direct Page Cache (O(1) Lookup) - Impact: 15-20%`

			`mimalloc:`
			```c
			`// Single array index - O(1)`
			`page = heap->pages_free_direct[size / 8];`
			```

			`HAKMEM:`
			```c
			`// Binary search through 32 bins - O(log n)`
			`size_class = find_size_class(size); // ~5 comparisons`
			`page = heap->size_classes[size_class];`
			```

			`Savings: ~10 cycles per allocation`

			`---`

			`### 2. Dual Free Lists (Local/Remote Split) - Impact: 10-15%`

			`mimalloc:`
			```c
			`typedef struct mi_page_s {`
			`mi_block_t* free; // Hot allocation path`
			`mi_block_t* local_free; // Local frees (no atomic!)`
			`_Atomic(mi_thread_free_t) xthread_free; // Remote frees`
			`} mi_page_t;`
			```

			`Why it's faster:`
			- Local frees go to `local_free` (no atomic ops!)
			- Migration to `free` is batched (pointer swap)
			`- Better cache locality (separate alloc/free lists)`

			`HAKMEM: Single free list with atomic updates`

			`---`

			`### 3. Zero-Cost Optimizations - Impact: 5-8%`

			`Branch hints:`
			```c
			`if mi_likely(size <= 1024) { // Fast path`
			`return fast_alloc(size);`
			`}`
			```

			`Bit-packed flags:`
			```c
			`if (page->flags.full_aligned == 0) { // Single comparison`
			`// Fast path: not full, no aligned blocks`
			`}`
			```

			`Lazy updates:`
			```c
			`// Only collect remote frees when needed`
			`if (page->free == NULL) {`
			`collect_remote_frees(page);`
			`}`
			```

			`---`

			`## The Hot Path Breakdown`

			`### mimalloc (3 layers, ~20 cycles)`

			```c
			`// Layer 0: TLS heap (2 cycles)`
			`heap = mi_prim_get_default_heap();`

			`// Layer 1: Direct page cache (3 cycles)`
			`page = heap->pages_free_direct[size / 8];`

			`// Layer 2: Pop from free list (5 cycles)`
			`block = page->free;`
			`if (block) {`
			`page->free = block->next;`
			`page->used++;`
			`return block;`
			`}`

			`// Layer 3: Generic fallback (slow path)`
			`return _mi_malloc_generic(heap, size, zero, 0);`
			```

			`Total fast path: ~20 cycles`

			`### HAKMEM Tiny Current (3 layers, ~30-35 cycles)`

			```c
			`// Layer 0: TLS heap (3 cycles)`
			`heap = tls_heap;`

			`// Layer 1: Binary search size class (~5 cycles)`
			`size_class = find_size_class(size); // 3-5 comparisons`

			`// Layer 2: Get page (3 cycles)`
			`page = heap->size_classes[size_class];`

			`// Layer 3: Pop with atomic (~15 cycles with lock prefix)`
			`block = page->freelist;`
			`if (block) {`
			`lock_xadd(&page->used, 1); // 10+ cycles!`
			`page->freelist = block->next;`
			`return block;`
			`}`
			```

			`Total fast path: ~30-35 cycles (with atomic), ~20-25 cycles (without atomic)`

			`---`

			`## Key Insight: Linked Lists Are Optimal!`

			`mimalloc proves that intrusive linked lists are the right data structure for mixed alloc/free workloads.`

			`The performance comes from:`
			`1. O(1) page lookup (not from avoiding lists)`
			`2. Cache-friendly separation (local vs remote)`
			`3. Minimal atomic ops (batching)`
			`4. Predictable branches (hints)`

			`Your Phase 3 finding was correct: Linked lists are optimal. The gap comes from micro-optimizations, not data structure choice.`

			`---`

			`## Actionable Recommendations`

			`### Phase 1: Direct Page Cache (+15-20%)`
			`Effort: 1-2 days \| Risk: Low`

			```c
			`// Add to hakmem_heap_t:`
			`hakmem_page_t* pages_direct[129]; // 1032 bytes`

			`// In malloc hot path:`
			`if (size <= 1024) {`
			`page = heap->pages_direct[size / 8];`
			`if (page && page->free_list) {`
			`return pop_block(page);`
			`}`
			`}`
			```

			`### Phase 2: Dual Free Lists (+10-15%)`
			`Effort: 3-5 days \| Risk: Medium`

			```c
			`// Split free list:`
			`typedef struct hakmem_page_s {`
			`hakmem_block_t* free; // Allocation path`
			`hakmem_block_t* local_free; // Local frees (no atomic!)`
			`_Atomic(hakmem_block_t*) thread_free; // Remote frees`
			`} hakmem_page_t;`

			`// In free:`
			`if (is_local_thread(page)) {`
			`block->next = page->local_free;`
			`page->local_free = block; // No atomic!`
			`}`

			`// Migrate when needed:`
			`if (!page->free && page->local_free) {`
			`page->free = page->local_free; // Just swap!`
			`page->local_free = NULL;`
			`}`
			```

			`### Phase 3: Branch Hints + Flags (+5-8%)`
			`Effort: 1-2 days \| Risk: Low`

			```c
			`#define likely(x) __builtin_expect(!!(x), 1)`
			`#define unlikely(x) __builtin_expect(!!(x), 0)`

			`// Bit-pack flags:`
			`union page_flags {`
			`uint8_t combined;`
			`struct {`
			`uint8_t is_full : 1;`
			`uint8_t has_remote : 1;`
			`} bits;`
			`};`

			`// Single comparison:`
			`if (page->flags.combined == 0) {`
			`// Fast path`
			`}`
			```

			`---`

			`## Expected Results`

			`\| Phase \| Improvement \| Cumulative M ops/sec \| % of Gap Closed \|`
			`\|-------\|-------------\|----------------------\|-----------------\|`
			`\| Baseline \| - \| 16.53 \| 0% \|`
			`\| Phase 1 \| +15-20% \| 19.20 \| 35% \|`
			`\| Phase 2 \| +10-15% \| 22.30 \| 75% \|`
			`\| Phase 3 \| +5-8% \| 24.00 \| 95% \|`

			`Final: 16.53 → 24.00 M ops/sec (close the 47% gap to within ~1%)`

			`---`

			`## What Doesn't Matter`

			`❌ Prefetch instructions - Hardware prefetcher is good enough`
			`❌ Hand-written assembly - Compiler optimizes well`
			`❌ Magazine architecture - Direct page cache is simpler`
			`❌ Complex encoding - Simple XOR-rotate is sufficient`
			`❌ Bump allocation - Linked lists are fine for mixed workloads`

			`---`

			`## Validation Strategy`

			`1. Benchmark Phase 1 (direct cache)`
			`- Expect: +2-3 M ops/sec (12-18%)`
			`- If achieved: Proceed to Phase 2`
			`- If not: Profile and debug`

			`2. Benchmark Phase 2 (dual lists)`
			`- Expect: +2-3 M ops/sec additional (10-15%)`
			`- If achieved: Proceed to Phase 3`
			`- If not: Analyze cache behavior`

			`3. Benchmark Phase 3 (branch hints + flags)`
			`- Expect: +1-2 M ops/sec additional (5-8%)`
			`- Final target: 23-24 M ops/sec`

			`---`

			`## Code References (mimalloc source)`

			`### Must-Read Files`
			1. `/src/alloc.c:200` - Entry point (`mi_malloc`)
			2. `/src/alloc.c:48-59` - Hot path (`_mi_page_malloc`)
			3. `/include/mimalloc/internal.h:388` - Direct cache (`_mi_heap_get_free_small_page`)
			4. `/src/alloc.c:593-608` - Fast free (`mi_free`)
			5. `/src/page.c:217-248` - Dual list migration (`_mi_page_free_collect`)

			`### Key Data Structures`
			1. `/include/mimalloc/types.h:447` - Heap structure (`mi_heap_s`)
			2. `/include/mimalloc/types.h:283` - Page structure (`mi_page_s`)
			3. `/include/mimalloc/types.h:212` - Block structure (`mi_block_s`)
			4. `/include/mimalloc/types.h:228` - Bit-packed flags (`mi_page_flags_s`)

			`---`

			`## Summary`

			`mimalloc's advantage is not from avoiding linked lists or using bump allocation.`

			`The 47% gap comes from 8 cumulative micro-optimizations:`
			`1. Direct page cache (O(1) vs O(log n))`
			`2. Dual free lists (cache-friendly)`
			`3. Lazy metadata updates (batching)`
			`4. Zero-cost encoding (security for free)`
			`5. Branch hints (CPU-friendly)`
			`6. Bit-packed flags (fewer comparisons)`
			`7. Aggressive inlining (smaller hot path)`
			`8. Minimal atomics (local-first free)`

			`Each optimization is small (1-20%), but they multiply to create the 47% gap.`

			`Good news: All techniques are portable to HAKMEM without major architectural changes!`

			`---`

			Next Action: Implement Phase 1 (direct page cache) and measure the impact on `bench_random_mixed`.