Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

6.8 KiB

Raw Blame History

mimalloc Performance Analysis - Key Findings

The 47% Gap Explained

HAKMEM: 16.53 M ops/sec mimalloc: 24.21 M ops/sec Gap: +7.68 M ops/sec (47% faster)

Top 3 Performance Secrets

1. Direct Page Cache (O(1) Lookup) - Impact: 15-20%

mimalloc:

// Single array index - O(1)
page = heap->pages_free_direct[size / 8];

HAKMEM:

// Binary search through 32 bins - O(log n)
size_class = find_size_class(size);  // ~5 comparisons
page = heap->size_classes[size_class];

Savings: ~10 cycles per allocation

2. Dual Free Lists (Local/Remote Split) - Impact: 10-15%

mimalloc:

typedef struct mi_page_s {
  mi_block_t* free;        // Hot allocation path
  mi_block_t* local_free;  // Local frees (no atomic!)
  _Atomic(mi_thread_free_t) xthread_free;  // Remote frees
} mi_page_t;

Why it's faster:

Local frees go to local_free (no atomic ops!)
Migration to free is batched (pointer swap)
Better cache locality (separate alloc/free lists)

HAKMEM: Single free list with atomic updates

3. Zero-Cost Optimizations - Impact: 5-8%

Branch hints:

if mi_likely(size <= 1024) {      // Fast path
  return fast_alloc(size);
}

Bit-packed flags:

if (page->flags.full_aligned == 0) {  // Single comparison
  // Fast path: not full, no aligned blocks
}

Lazy updates:

// Only collect remote frees when needed
if (page->free == NULL) {
  collect_remote_frees(page);
}

The Hot Path Breakdown

mimalloc (3 layers, ~20 cycles)

// Layer 0: TLS heap (2 cycles)
heap = mi_prim_get_default_heap();

// Layer 1: Direct page cache (3 cycles)
page = heap->pages_free_direct[size / 8];

// Layer 2: Pop from free list (5 cycles)
block = page->free;
if (block) {
  page->free = block->next;
  page->used++;
  return block;
}

// Layer 3: Generic fallback (slow path)
return _mi_malloc_generic(heap, size, zero, 0);

Total fast path: ~20 cycles

HAKMEM Tiny Current (3 layers, ~30-35 cycles)

// Layer 0: TLS heap (3 cycles)
heap = tls_heap;

// Layer 1: Binary search size class (~5 cycles)
size_class = find_size_class(size);  // 3-5 comparisons

// Layer 2: Get page (3 cycles)
page = heap->size_classes[size_class];

// Layer 3: Pop with atomic (~15 cycles with lock prefix)
block = page->freelist;
if (block) {
  lock_xadd(&page->used, 1);  // 10+ cycles!
  page->freelist = block->next;
  return block;
}

Total fast path: ~30-35 cycles (with atomic), ~20-25 cycles (without atomic)

Key Insight: Linked Lists Are Optimal!

mimalloc proves that intrusive linked lists are the right data structure for mixed alloc/free workloads.

The performance comes from:

O(1) page lookup (not from avoiding lists)
Cache-friendly separation (local vs remote)
Minimal atomic ops (batching)
Predictable branches (hints)

Your Phase 3 finding was correct: Linked lists are optimal. The gap comes from micro-optimizations, not data structure choice.

Actionable Recommendations

Phase 1: Direct Page Cache (+15-20%)

Effort: 1-2 days | Risk: Low

// Add to hakmem_heap_t:
hakmem_page_t* pages_direct[129];  // 1032 bytes

// In malloc hot path:
if (size <= 1024) {
  page = heap->pages_direct[size / 8];
  if (page && page->free_list) {
    return pop_block(page);
  }
}

Phase 2: Dual Free Lists (+10-15%)

Effort: 3-5 days | Risk: Medium

// Split free list:
typedef struct hakmem_page_s {
  hakmem_block_t* free;        // Allocation path
  hakmem_block_t* local_free;  // Local frees (no atomic!)
  _Atomic(hakmem_block_t*) thread_free;  // Remote frees
} hakmem_page_t;

// In free:
if (is_local_thread(page)) {
  block->next = page->local_free;
  page->local_free = block;  // No atomic!
}

// Migrate when needed:
if (!page->free && page->local_free) {
  page->free = page->local_free;  // Just swap!
  page->local_free = NULL;
}

Phase 3: Branch Hints + Flags (+5-8%)

Effort: 1-2 days | Risk: Low

#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

// Bit-pack flags:
union page_flags {
  uint8_t combined;
  struct {
    uint8_t is_full : 1;
    uint8_t has_remote : 1;
  } bits;
};

// Single comparison:
if (page->flags.combined == 0) {
  // Fast path
}

Expected Results

Phase	Improvement	Cumulative M ops/sec	% of Gap Closed
Baseline	-	16.53	0%
Phase 1	+15-20%	19.20	35%
Phase 2	+10-15%	22.30	75%
Phase 3	+5-8%	24.00	95%

Final: 16.53 → 24.00 M ops/sec (close the 47% gap to within ~1%)

What Doesn't Matter

❌ Prefetch instructions - Hardware prefetcher is good enough ❌ Hand-written assembly - Compiler optimizes well ❌ Magazine architecture - Direct page cache is simpler ❌ Complex encoding - Simple XOR-rotate is sufficient ❌ Bump allocation - Linked lists are fine for mixed workloads

Validation Strategy

Benchmark Phase 1 (direct cache)
- Expect: +2-3 M ops/sec (12-18%)
- If achieved: Proceed to Phase 2
- If not: Profile and debug
Benchmark Phase 2 (dual lists)
- Expect: +2-3 M ops/sec additional (10-15%)
- If achieved: Proceed to Phase 3
- If not: Analyze cache behavior
Benchmark Phase 3 (branch hints + flags)
- Expect: +1-2 M ops/sec additional (5-8%)
- Final target: 23-24 M ops/sec

Code References (mimalloc source)

Must-Read Files

/src/alloc.c:200 - Entry point (mi_malloc)
/src/alloc.c:48-59 - Hot path (_mi_page_malloc)
/include/mimalloc/internal.h:388 - Direct cache (_mi_heap_get_free_small_page)
/src/alloc.c:593-608 - Fast free (mi_free)
/src/page.c:217-248 - Dual list migration (_mi_page_free_collect)

Key Data Structures

/include/mimalloc/types.h:447 - Heap structure (mi_heap_s)
/include/mimalloc/types.h:283 - Page structure (mi_page_s)
/include/mimalloc/types.h:212 - Block structure (mi_block_s)
/include/mimalloc/types.h:228 - Bit-packed flags (mi_page_flags_s)

Summary

mimalloc's advantage is not from avoiding linked lists or using bump allocation.

The 47% gap comes from 8 cumulative micro-optimizations:

Direct page cache (O(1) vs O(log n))
Dual free lists (cache-friendly)
Lazy metadata updates (batching)
Zero-cost encoding (security for free)
Branch hints (CPU-friendly)
Bit-packed flags (fewer comparisons)
Aggressive inlining (smaller hot path)
Minimal atomics (local-first free)

Each optimization is small (1-20%), but they multiply to create the 47% gap.

Good news: All techniques are portable to HAKMEM without major architectural changes!

Next Action: Implement Phase 1 (direct page cache) and measure the impact on bench_random_mixed.

6.8 KiB Raw Blame History