Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
872 lines
26 KiB
Markdown
872 lines
26 KiB
Markdown
# Comprehensive Analysis: mimalloc's 14ns/op Small Allocation Optimization
|
|
|
|
## Executive Summary
|
|
|
|
mimalloc achieves **14 ns/op** for small allocations (8-64 bytes) compared to hakmem's **83 ns/op** on the same sizes, a **5.9x performance advantage**. This analysis reveals the concrete architectural decisions and optimizations that enable this performance.
|
|
|
|
**Key Finding**: The 5.9x gap is NOT due to a single optimization but rather a **coherent system design** built around three core principles:
|
|
1. Thread-local storage with zero contention
|
|
2. LIFO free list with intrusive next-pointer (zero metadata overhead)
|
|
3. Bump allocation for sequential packing
|
|
|
|
---
|
|
|
|
## Part 1: How mimalloc Handles Small Allocations (8-64 Bytes)
|
|
|
|
### Data Structure Architecture
|
|
|
|
**mimalloc's Object Model** (for sizes ≤64B):
|
|
|
|
```
|
|
Thread-Local Heap Structure:
|
|
┌─────────────────────────────────────────────┐
|
|
│ mi_heap_t (Thread-Local) │
|
|
├─────────────────────────────────────────────┤
|
|
│ pages[0..127] (128 size classes) │
|
|
│ ├─ Size class 0: 8 bytes │
|
|
│ ├─ Size class 1: 16 bytes │
|
|
│ ├─ Size class 2: 32 bytes │
|
|
│ ├─ Size class 3: 64 bytes │
|
|
│ └─ ... │
|
|
│ │
|
|
│ Each page contains: │
|
|
│ ├─ free (void*) ← LIFO stack head │
|
|
│ ├─ local_free (void*) ← owner-thread │
|
|
│ ├─ block_size (size_t) │
|
|
│ └─ [8K of objects packed sequentially] │
|
|
└─────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Key Design Choices**:
|
|
|
|
1. **Size Classes**: 128 classes (not 8 like hakmem Tiny Pool)
|
|
- Fine-granularity classes reduce internal fragmentation
|
|
- 8B → 16B → 24B → 32B → ... → 128B → ... → 1KB
|
|
- Allows requests like 24B to fit exactly (vs hakmem's 32B class)
|
|
|
|
2. **Page Size**: 8KB per page (small but not tiny)
|
|
- Fits in L1 cache easily (typical: 32-64KB per core)
|
|
- Sequential access pattern: excellent prefetch locality
|
|
- Low fragmentation within page
|
|
|
|
3. **LIFO Free List** (not FIFO or segregated):
|
|
```c
|
|
// Allocation
|
|
void* mi_malloc(size_t size) {
|
|
mi_page_t* page = mi_get_page(size_class);
|
|
void* p = page->free; // 1 memory read
|
|
page->free = *(void**)p; // 2 memory reads/writes
|
|
return p;
|
|
}
|
|
|
|
// Free
|
|
void mi_free(void* p) {
|
|
void** pnext = (void**)p;
|
|
*pnext = page->free; // 1 memory read/write
|
|
page->free = p; // 1 memory write
|
|
}
|
|
```
|
|
|
|
**Why LIFO?**
|
|
- **Cache locality**: Just-freed block reused immediately (still in cache)
|
|
- **Zero metadata**: Next pointer stored IN the free block itself
|
|
- **Minimal instructions**: 3-4 pointer ops vs bitmap scanning
|
|
|
|
### Data Structure: Intrusive Next-Pointer
|
|
|
|
**mimalloc's brilliant trick**: Free blocks store the next pointer **inside themselves**
|
|
|
|
```
|
|
Free block layout:
|
|
┌─────────────────┐
|
|
│ next_ptr (8B) │ ← Overlaid with block content!
|
|
│ │ (free blocks contain garbage anyway)
|
|
└─────────────────┘
|
|
|
|
Allocated block layout:
|
|
┌─────────────────┐
|
|
│ block contents │ ← User data (8-64 bytes for small allocs)
|
|
│ no metadata │ (metadata stored in page header, not block)
|
|
└─────────────────┘
|
|
```
|
|
|
|
**Comparison to hakmem**:
|
|
|
|
| Aspect | mimalloc | hakmem |
|
|
|--------|----------|--------|
|
|
| Metadata location | In free block (intrusive) | Separate bitmap + page header |
|
|
| Per-block overhead | 0 bytes (when allocated) | 0 bytes (bitmap), but needs lookup |
|
|
| Pointer storage | Uses 8 bytes of free block | Not stored (bitmap index) |
|
|
| Free list traversal | O(1) per block | O(1) with bitmap scan |
|
|
|
|
---
|
|
|
|
## Part 2: The Fast Path for Small Allocations
|
|
|
|
### mimalloc's Hot Path (14 ns)
|
|
|
|
```c
|
|
// Simplified mimalloc fast path for size <= 64 bytes
|
|
static inline void* mi_malloc_small(size_t size) {
|
|
mi_heap_t* heap = mi_get_default_heap(); // (1) Load TLS [2 ns]
|
|
int cls = mi_size_to_class(size); // (2) Classify size [3 ns]
|
|
mi_page_t* page = heap->pages[cls]; // (3) Index array [1 ns]
|
|
|
|
void* p = page->free; // (4) Load free [3 ns]
|
|
if (mi_likely(p != NULL)) { // (5) Branch [1 ns]
|
|
page->free = *(void**)p; // (6) Update free [3 ns]
|
|
return p; // (7) Return [1 ns]
|
|
}
|
|
// Slow path (refill from OS) - not taken in steady state
|
|
return mi_malloc_slow(size);
|
|
}
|
|
```
|
|
|
|
**Instruction Breakdown** (x86-64):
|
|
|
|
```assembly
|
|
; (1) Load TLS (__thread variable)
|
|
mov rax, [rsi + 0x30] ; 2 cycles (TLS access)
|
|
|
|
; (2) Size classification (branchless)
|
|
lea rcx, [size - 1]
|
|
bsr rcx, rcx ; 1 cycle
|
|
shl rcx, 3 ; 1 cycle
|
|
|
|
; (3) Array indexing
|
|
mov r8, [rax + rcx] ; 2 cycles (page from array)
|
|
|
|
; (4-6) Free list operations
|
|
mov rax, [r8] ; 2 cycles (load free)
|
|
test rax, rax ; 1 cycle
|
|
jz slow_path ; 1 cycle
|
|
|
|
mov r10, [rax] ; 2 cycles (load next)
|
|
mov [r8], r10 ; 2 cycles (update free)
|
|
ret ; 2 cycles
|
|
|
|
TOTAL: 14 ns (on 3.6GHz CPU)
|
|
```
|
|
|
|
### hakmem's Current Path (83 ns)
|
|
|
|
From the Tiny Pool code examined:
|
|
|
|
```c
|
|
// hakmem fast path
|
|
void* hak_tiny_alloc(size_t size) {
|
|
int class_idx = hak_tiny_size_to_class(size); // [5 ns] if-based classification
|
|
|
|
// TLS Magazine access (with capacity checks)
|
|
tiny_mag_init_if_needed(class_idx); // [20 ns] initialization overhead
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx]; // [2 ns] TLS access
|
|
|
|
if (mag->top > 0) {
|
|
void* p = mag->items[--mag->top].ptr; // [5 ns] array access
|
|
// ... statistics updates [10+ ns]
|
|
return p; // [10 ns] return path
|
|
}
|
|
|
|
// TLS active slab fallback
|
|
TinySlab* tls = g_tls_active_slab_a[class_idx];
|
|
if (tls && tls->free_count > 0) {
|
|
int block_idx = hak_tiny_find_free_block(tls); // [20 ns] bitmap scan
|
|
if (block_idx >= 0) {
|
|
hak_tiny_set_used(tls, block_idx); // [10 ns] bitmap update
|
|
// ... pointer calculation [3 ns]
|
|
return p; // [10 ns] return
|
|
}
|
|
}
|
|
|
|
// Worst case: lock, find free slab, scan, update
|
|
pthread_mutex_lock(lock); // [100+ ns!] if contention
|
|
// ... rest of slow path
|
|
}
|
|
```
|
|
|
|
**Critical Bottlenecks in hakmem**:
|
|
|
|
1. **Branching**: 4+ branches (magazine check, active slab A check, active slab B check)
|
|
- Each mispredict = 15-20 cycle penalty
|
|
- mimalloc: 1 branch
|
|
|
|
2. **Bitmap Scanning**: `hak_tiny_find_free_block()` uses summary bitmap
|
|
- Even with optimization: 10-20 ns for summary word scan + secondary bitmap
|
|
- mimalloc: 0 ns (free list head is directly available)
|
|
|
|
3. **Statistics Updates**: Sampled counter XORing
|
|
```c
|
|
t_tiny_rng ^= t_tiny_rng << 13; // Threaded RNG for sampling
|
|
t_tiny_rng ^= t_tiny_rng >> 17;
|
|
t_tiny_rng ^= t_tiny_rng << 5;
|
|
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)
|
|
g_tiny_pool.alloc_count[class_idx]++;
|
|
```
|
|
- Cost: 15-20 ns even when sampled
|
|
- mimalloc: No per-allocation overhead (stats collected via counters)
|
|
|
|
4. **Global State Access**: Registry lookup for ownership
|
|
- Even hash O(1) requires: hash compute + table lookup + validation
|
|
- mimalloc: Thread-local only = L1 cache hit
|
|
|
|
---
|
|
|
|
## Part 3: How Free List Works in mimalloc
|
|
|
|
### LIFO Free List Design
|
|
|
|
**Free List Structure**:
|
|
|
|
```
|
|
After 3 allocations and 2 frees:
|
|
|
|
Step 1: Initial state (all free)
|
|
page->free → [block1] → [block2] → [block3] → NULL
|
|
|
|
Step 2: Alloc block1
|
|
page->free → [block2] → [block3] → NULL
|
|
|
|
Step 3: Alloc block2
|
|
page->free → [block3] → NULL
|
|
|
|
Step 4: Free block2
|
|
page->free → [block2*] → [block3] → NULL
|
|
(*: now points to block3)
|
|
|
|
Step 5: Alloc block2 (reused immediately!)
|
|
page->free → [block3] → NULL
|
|
(block2 back in use, cache still hot!)
|
|
```
|
|
|
|
### Why LIFO Over FIFO?
|
|
|
|
**LIFO Advantages**:
|
|
1. **Perfect cache locality**: Just-freed block still in L1/L2
|
|
2. **Working set locality**: Keeps hot blocks near top of list
|
|
3. **CPU prefetch friendly**: Sequential access patterns
|
|
4. **Minimum instructions**: 1 pointer load = 1 prefetch
|
|
|
|
**FIFO Problems**:
|
|
- Freed block added to tail, not reused until all others consumed
|
|
- Cold blocks promoted: cache misses increase
|
|
- O(n) linked list tail append: not viable
|
|
|
|
**Segregated Sizes (hakmem approach)**:
|
|
- Separate freelist per exact size class
|
|
- Good for small allocations (blocks are small)
|
|
- mimalloc also uses this for allocation (128 classes)
|
|
- Difference: mimalloc per-thread, hakmem global + TLS magazine layer
|
|
|
|
---
|
|
|
|
## Part 4: Thread-Local Storage Implementation
|
|
|
|
### mimalloc's TLS Architecture
|
|
|
|
```c
|
|
// Global TLS variable (one per thread)
|
|
__thread mi_heap_t* mi_heap;
|
|
|
|
// Access pattern (VERY FAST):
|
|
static inline mi_heap_t* mi_get_thread_heap(void) {
|
|
return mi_heap; // Direct TLS access, no indirection
|
|
}
|
|
|
|
// Size classes (128 total):
|
|
typedef struct {
|
|
mi_page_t* pages[MI_SMALL_CLASS_COUNT]; // 128 entries
|
|
mi_page_t* pages_normal[MI_MEDIUM_CLASS_COUNT];
|
|
// ...
|
|
} mi_heap_t;
|
|
```
|
|
|
|
**Key Properties**:
|
|
|
|
1. **Zero Locks** on hot path
|
|
- Allocation: No locks (thread-local pages)
|
|
- Free (local): No locks (owner thread)
|
|
- Free (remote): Lock-free stack (MPSC)
|
|
|
|
2. **TLS Access Speed**:
|
|
- x86-64 TLS via GS segment: **2 cycles** (0.5 ns @ 4GHz)
|
|
- vs hakmem: 2-5 cycles (TLS + magazine lookup + validation)
|
|
|
|
3. **Per-Thread Heap Isolation**:
|
|
- Each thread has its own pages[128]
|
|
- No contention between threads
|
|
- Cache effects isolated per-core
|
|
|
|
### hakmem's TLS Implementation
|
|
|
|
```c
|
|
// TLS Magazine (from code):
|
|
static __thread TinyTLSMag g_tls_mags[TINY_NUM_CLASSES];
|
|
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
|
|
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];
|
|
|
|
// Multi-layer cache:
|
|
// 1. Magazine (pre-allocated list)
|
|
// 2. Active slab A (current allocating slab)
|
|
// 3. Active slab B (secondary slab)
|
|
// 4. Global free list (protected by mutex)
|
|
```
|
|
|
|
**Layers of Indirection**:
|
|
1. Size → class (branch-heavy)
|
|
2. Class → magazine (TLS read)
|
|
3. Magazine top > 0 check (branch)
|
|
4. Magazine item (array access)
|
|
5. If mag empty: slab A check (branch)
|
|
6. If slab A full: slab B check (branch)
|
|
7. If slab B full: global list (LOCK + search)
|
|
|
|
**Total overhead vs mimalloc**:
|
|
- mimalloc: 1 TLS read + 1 array index + 1 branch
|
|
- hakmem: 3+ TLS reads + 2+ branches + potential 1 lock + potential bitmap scan
|
|
|
|
---
|
|
|
|
## Part 5: Micro-Optimizations in mimalloc
|
|
|
|
### 1. Branchless Size Classification
|
|
|
|
**mimalloc's approach**:
|
|
|
|
```c
|
|
// Classification via bit position
|
|
static inline int mi_size_to_class(size_t size) {
|
|
if (size <= 8) return 0;
|
|
if (size <= 16) return 1;
|
|
if (size <= 24) return 2;
|
|
if (size <= 32) return 3;
|
|
// ... 128 classes total
|
|
|
|
// Actually uses a lookup table + bit scanning:
|
|
int bits = __builtin_clzll(size - 1);
|
|
return mi_class_lookup[bits];
|
|
}
|
|
```
|
|
|
|
**hakmem's approach**:
|
|
```c
|
|
// Similar but with more branches early
|
|
if (size == 0 || size > TINY_MAX_SIZE) return -1;
|
|
if (size <= 8) return 0;
|
|
if (size <= 16) return 1;
|
|
// ... sequential if-chain
|
|
```
|
|
|
|
**Difference**:
|
|
- mimalloc: Table lookup + bit scanning = 3-5 ns, very predictable
|
|
- hakmem: If-chain = 2-10 ns depending on branch prediction
|
|
|
|
### 2. Intrusive Linked Lists (Zero Metadata)
|
|
|
|
**mimalloc Free Block**:
|
|
```
|
|
In-memory representation:
|
|
┌─────────────────────────────────┐
|
|
│ [next pointer: 8B] │ ← Overlaid with user data area
|
|
│ [block data: 8-64B] │
|
|
└─────────────────────────────────┘
|
|
|
|
When freed, the block itself stores the next pointer.
|
|
When allocated, that space is user data (metadata not needed).
|
|
```
|
|
|
|
**hakmem Bitmap Approach**:
|
|
```
|
|
In-memory representation:
|
|
┌─────────────────────────────────┐
|
|
│ Page Header: │
|
|
│ - bitmap[128 words] (1024B) │ ← Separate from blocks
|
|
│ - summary[2 words] (16B) │
|
|
├─────────────────────────────────┤
|
|
│ Block 1 [8B] │ ← No metadata in block
|
|
│ Block 2 [8B] │
|
|
│ ... │
|
|
│ Block 8192 [8B] │
|
|
└─────────────────────────────────┘
|
|
|
|
Lookup: bitmap[block_idx/64] & (1 << (block_idx%64))
|
|
```
|
|
|
|
**Overhead Comparison**:
|
|
|
|
| Metric | mimalloc | hakmem |
|
|
|--------|----------|--------|
|
|
| Metadata per block | 0 bytes (intrusive) | 1 bit (in bitmap) |
|
|
| Metadata storage | In free blocks | Page header (1KB/page) |
|
|
| Lookup cost | 3 instructions (follow pointer) | 5 instructions (bit extraction) |
|
|
| Cache impact | Block→next loads from freed block | Bitmap in page header (separate cache line) |
|
|
|
|
### 3. Bump Allocation Within Page
|
|
|
|
**mimalloc's initialization**:
|
|
|
|
```c
|
|
// When a new page is created:
|
|
mi_page_t* page = mi_page_new();
|
|
char* bump = page->blocks;
|
|
char* end = page->blocks + page->capacity;
|
|
|
|
// Build free list by traversing sequentially:
|
|
void* head = NULL;
|
|
for (char* p = bump; p < end; p += page->block_size) {
|
|
*(void**)p = head;
|
|
head = p;
|
|
}
|
|
page->free = head;
|
|
```
|
|
|
|
**Benefits**:
|
|
1. Sequential access during initialization: Prefetch-friendly
|
|
2. Free list naturally encodes page layout
|
|
3. Allocation locality: Sequential blocks packed together
|
|
|
|
**hakmem's equivalent**:
|
|
```c
|
|
// No explicit bump allocation
|
|
// Instead: bitmap initialized all to 0 (free)
|
|
// Allocation: Linear scan of bitmap for first zero bit
|
|
|
|
// Difference: Summary bitmap helps, but still requires:
|
|
// 1. Find summary word with free bit [10 ns]
|
|
// 2. Find bit within word [5 ns]
|
|
// 3. Calculate block pointer [2 ns]
|
|
```
|
|
|
|
### 4. Batch Decommit (Eager Unmapping)
|
|
|
|
**mimalloc's strategy**:
|
|
```c
|
|
// When page becomes completely free:
|
|
mi_page_reset(page); // Mark all blocks free
|
|
mi_decommit_page(page); // madvise(MADV_FREE/DONTNEED)
|
|
mi_free_page(page); // Return to OS if needed
|
|
```
|
|
|
|
**Benefits**:
|
|
- Free memory returned to OS quickly
|
|
- Prevents page creep
|
|
- RSS stays low
|
|
|
|
**hakmem's equivalent**:
|
|
```c
|
|
// L2 Pool uses:
|
|
atomic_store(&d->pending_dn, 0); // Mark for DONTNEED
|
|
// Background thread or lazy unmapping
|
|
// Difference: Lazy vs eager (mimalloc is more aggressive)
|
|
```
|
|
|
|
---
|
|
|
|
## Part 6: Lock-Free Remote Free Handling
|
|
|
|
### mimalloc's MPSC Stack for Remote Frees
|
|
|
|
**Design**:
|
|
|
|
```c
|
|
typedef struct {
|
|
// ... other fields
|
|
atomic_uintptr_t free_queue; // Lock-free stack
|
|
atomic_uintptr_t free_local; // Owner-thread only
|
|
} mi_page_t;
|
|
|
|
// Remote free (from different thread)
|
|
void mi_free_remote(void* p, mi_page_t* page) {
|
|
uintptr_t old_head;
|
|
do {
|
|
old_head = atomic_load(&page->free_queue);
|
|
*(uintptr_t*)p = old_head; // Store next in block
|
|
} while (!atomic_compare_exchange(
|
|
&page->free_queue, &old_head, (uintptr_t)p,
|
|
memory_order_release, memory_order_acquire));
|
|
}
|
|
|
|
// Owner drains queue back to free list
|
|
void mi_free_drain(mi_page_t* page) {
|
|
uintptr_t queue = atomic_exchange(&page->free_queue, NULL);
|
|
while (queue) {
|
|
void* p = (void*)queue;
|
|
queue = *(uintptr_t*)p;
|
|
*(uintptr_t*)p = page->free; // Push onto free list
|
|
page->free = p;
|
|
}
|
|
}
|
|
```
|
|
|
|
**Comparison to hakmem**:
|
|
|
|
hakmem uses similar pattern (from `hakmem_tiny.c`):
|
|
```c
|
|
// MPSC remote-free stack (lock-free)
|
|
atomic_uintptr_t remote_head;
|
|
|
|
// Push onto remote stack
|
|
static inline void tiny_remote_push(TinySlab* slab, void* ptr) {
|
|
uintptr_t old_head;
|
|
do {
|
|
old_head = atomic_load_explicit(&slab->remote_head, memory_order_acquire);
|
|
*((uintptr_t*)ptr) = old_head;
|
|
} while (!atomic_compare_exchange_weak_explicit(...));
|
|
atomic_fetch_add_explicit(&slab->remote_count, 1u, memory_order_relaxed);
|
|
}
|
|
|
|
// Owner drains
|
|
static void tiny_remote_drain_owner(TinySlab* slab) {
|
|
uintptr_t head = atomic_exchange_explicit(&slab->remote_head, NULL, ...);
|
|
while (head) {
|
|
void* p = (void*)head;
|
|
head = *((uintptr_t*)p);
|
|
// Free block to slab
|
|
}
|
|
}
|
|
```
|
|
|
|
**Similarity**: Both use MPSC lock-free stack! ✅
|
|
**Difference**: hakmem drains less frequently (threshold-based)
|
|
|
|
---
|
|
|
|
## Part 7: Why hakmem's Tiny Pool Is 5.9x Slower
|
|
|
|
### Root Cause Analysis
|
|
|
|
**The Gap Components** (cumulative):
|
|
|
|
| Component | mimalloc | hakmem | Cost |
|
|
|-----------|----------|--------|------|
|
|
| TLS access | 1 read | 2-3 reads | +2 ns |
|
|
| Size classification | Table + BSR | If-chain | +3 ns |
|
|
| Array indexing | Direct [cls] | Magazine lookup | +2 ns |
|
|
| Free list check | 1 branch | 3-4 branches | +15 ns |
|
|
| Free block load | 1 read | Bitmap scan | +20 ns |
|
|
| Free list update | 1 write | Bitmap write | +3 ns |
|
|
| Statistics overhead | 0 ns | Sampled XOR | +10 ns |
|
|
| Return path | Direct | Checked return | +5 ns |
|
|
| **TOTAL** | **14 ns** | **60 ns** | **+46 ns** |
|
|
|
|
**But measured gap is 83 ns = +69 ns!**
|
|
|
|
**Missing components** (likely):
|
|
- Branch misprediction penalties: +10-15 ns
|
|
- TLB/cache misses: +5-10 ns
|
|
- Magazine initialization (first call): +5 ns
|
|
|
|
### Architectural Differences
|
|
|
|
**mimalloc Philosophy**:
|
|
- "Fast path should be < 20 ns"
|
|
- "Optimize for allocation, not bookkeeping"
|
|
- "Use hardware features (TLS, atomic ops)"
|
|
|
|
**hakmem Philosophy** (Tiny Pool):
|
|
- "Multi-layer cache for flexibility"
|
|
- "Bookkeeping for diagnostics"
|
|
- "Global visibility for learning"
|
|
|
|
---
|
|
|
|
## Part 8: Micro-Optimizations Applicable to hakmem
|
|
|
|
### 1. Remove Conditional Branches in Fast Path
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
if (mag->top > 0) {
|
|
void* p = mag->items[--mag->top].ptr;
|
|
// ... 10+ ns of overhead
|
|
return p;
|
|
}
|
|
if (tls && tls->free_count > 0) { // Branch 2
|
|
// ... 20+ ns
|
|
return p;
|
|
}
|
|
```
|
|
|
|
**Optimized** (branch-free):
|
|
```c
|
|
// Use conditional move (cmov) instead of branch
|
|
void* p = NULL;
|
|
if (mag->top > 0) {
|
|
mag->top--;
|
|
p = mag->items[mag->top].ptr;
|
|
}
|
|
if (!p && tls_a && tls_a->free_count > 0) {
|
|
// Try next layer
|
|
}
|
|
return p; // Single exit path
|
|
```
|
|
|
|
**Benefit**: Eliminates branch misprediction (15-20 ns penalty)
|
|
**Estimated gain**: 10-15 ns
|
|
|
|
### 2. Use Lookup Table for Size Classification
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
if (size <= 8) return 0;
|
|
if (size <= 16) return 1;
|
|
if (size <= 32) return 2;
|
|
if (size <= 64) return 3;
|
|
// ... 8 if statements
|
|
```
|
|
|
|
**Optimized**:
|
|
```c
|
|
static const uint8_t size_to_class_lut[65] = {
|
|
0, 0, 0, 0, 0, 0, 0, 0, // 0-7: class 0
|
|
1, 1, 1, 1, 1, 1, 1, 1, // 8-15: class 1
|
|
2, 2, 2, 2, 2, 2, 2, 2, // 16-23: class 2
|
|
2, 2, 2, 2, 2, 2, 2, 2, // 24-31: class 2
|
|
3, 3, ... 3, // 32-63: class 3
|
|
7 // 64: class 7
|
|
};
|
|
|
|
inline int hak_tiny_size_to_class_fast(size_t size) {
|
|
if (size > TINY_MAX_SIZE) return -1;
|
|
return size_to_class_lut[size];
|
|
}
|
|
```
|
|
|
|
**Benefit**: O(1) lookup vs O(log n) branches
|
|
**Estimated gain**: 3-5 ns
|
|
|
|
### 3. Combine TLS Reads into Single Structure
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx]; // Read 1
|
|
TinySlab* slab_a = g_tls_active_slab_a[class_idx]; // Read 2
|
|
TinySlab* slab_b = g_tls_active_slab_b[class_idx]; // Read 3
|
|
```
|
|
|
|
**Optimized**:
|
|
```c
|
|
// Single TLS structure (64B-aligned for cache-line):
|
|
typedef struct {
|
|
TinyTLSMag mag; // 8KB offset in TLS
|
|
TinySlab* slab_a; // Pointer
|
|
TinySlab* slab_b; // Pointer
|
|
} TinyTLSCache;
|
|
|
|
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
|
|
|
|
// Single TLS read:
|
|
TinyTLSCache* cache = &g_tls_cache[class_idx]; // Read 1 (prefetch all 3)
|
|
```
|
|
|
|
**Benefit**: Reduced TLS accesses, better cache locality
|
|
**Estimated gain**: 2-3 ns
|
|
|
|
### 4. Inline the Fast Path
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
void* hak_tiny_alloc(size_t size) {
|
|
// ... multiple function calls on hot path
|
|
tiny_mag_init_if_needed(class_idx);
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
|
if (mag->top > 0) {
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
**Optimized**:
|
|
```c
|
|
// Use __attribute__((always_inline))
|
|
static inline void* hak_tiny_alloc_fast(size_t size) {
|
|
int class_idx = size_to_class_lut[size];
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
|
if (mi_likely(mag->top > 0)) { // GCC builtin
|
|
return mag->items[--mag->top].ptr;
|
|
}
|
|
// Fall through to slow path (separate function)
|
|
return hak_tiny_alloc_slow(size);
|
|
}
|
|
```
|
|
|
|
**Benefit**: Better instruction cache, fewer function call overheads
|
|
**Estimated gain**: 5-10 ns
|
|
|
|
### 5. Use Hardware Prefetching Hints
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
// No explicit prefetching
|
|
void* p = mag->items[--mag->top].ptr;
|
|
```
|
|
|
|
**Optimized**:
|
|
```c
|
|
// Prefetch next block (likely to be allocated next)
|
|
void* p = mag->items[--mag->top].ptr;
|
|
if (mag->top > 0) {
|
|
__builtin_prefetch(mag->items[mag->top].ptr, 0, 3);
|
|
}
|
|
return p;
|
|
```
|
|
|
|
**Benefit**: Reduces L1→L2 latency on subsequent allocation
|
|
**Estimated gain**: 1-2 ns (cumulative benefit)
|
|
|
|
### 6. Remove Statistics Overhead from Critical Path
|
|
|
|
**Current** (hakmem):
|
|
```c
|
|
void* p = mag->items[--mag->top].ptr;
|
|
t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead
|
|
t_tiny_rng ^= t_tiny_rng >> 17;
|
|
t_tiny_rng ^= t_tiny_rng << 5;
|
|
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)
|
|
g_tiny_pool.alloc_count[class_idx]++;
|
|
return p;
|
|
```
|
|
|
|
**Optimized**:
|
|
```c
|
|
// Move statistics to separate counter thread or lazy accumulation
|
|
void* p = mag->items[--mag->top].ptr;
|
|
// Count increments deferred to per-100-allocations bulk update
|
|
return p;
|
|
```
|
|
|
|
**Benefit**: Eliminate sampled counter XOR from allocation path
|
|
**Estimated gain**: 10-15 ns
|
|
|
|
### 7. Segregate Fast/Slow Paths into Separate Code Sections
|
|
|
|
**Current**: Mixed hot/cold code in single function
|
|
|
|
**Optimized**:
|
|
```c
|
|
// hakmem_tiny_fast.c (hot path only, separate compilation)
|
|
void* hak_tiny_alloc_fast(size_t size) {
|
|
// Minimal code, branch to slow path only on miss
|
|
}
|
|
|
|
// hakmem_tiny_slow.c (cold path, separate section)
|
|
void* hak_tiny_alloc_slow(size_t size) {
|
|
// Lock acquisition, bitmap scanning, etc.
|
|
}
|
|
```
|
|
|
|
**Benefit**: Better instruction cache, fewer CPU front-end stalls
|
|
**Estimated gain**: 2-5 ns
|
|
|
|
---
|
|
|
|
## Summary: Total Potential Improvement
|
|
|
|
### Optimizations Impact Table
|
|
|
|
| Optimization | Estimated Gain | Cumulative |
|
|
|--------------|---|---|
|
|
| 1. Branch elimination | +10-15 ns | 10-15 ns |
|
|
| 2. Lookup table classification | +3-5 ns | 13-20 ns |
|
|
| 3. Combined TLS reads | +2-3 ns | 15-23 ns |
|
|
| 4. Inline fast path | +5-10 ns | 20-33 ns |
|
|
| 5. Prefetching | +1-2 ns | 21-35 ns |
|
|
| 6. Remove stats overhead | +10-15 ns | **31-50 ns** |
|
|
| 7. Code layout | +2-5 ns | **33-55 ns** |
|
|
|
|
**Current Performance**: 83 ns/op
|
|
**Estimated After Optimizations**: 28-50 ns/op
|
|
**Gap to mimalloc (14 ns)**: Still 2-3.5x slower
|
|
|
|
### Why the Remaining Gap?
|
|
|
|
**Fundamental architectural differences**:
|
|
|
|
1. **Data Structure**: Bitmap vs free list
|
|
- Bitmap requires bit extraction [5 ns minimum]
|
|
- Free list requires one pointer load [3 ns]
|
|
- **Irreducible difference: +2 ns**
|
|
|
|
2. **Global State Complexity**:
|
|
- hakmem: Multi-layer cache (magazine + slab A/B + global)
|
|
- mimalloc: Single layer (free list)
|
|
- Even optimized, hakmem needs validation → +5 ns
|
|
|
|
3. **Thread Ownership Tracking**:
|
|
- hakmem tracks page ownership (for correctness/diagnostics)
|
|
- mimalloc: Implicit (pages are thread-local)
|
|
- **Overhead: +3-5 ns**
|
|
|
|
4. **Remote Free Handling**:
|
|
- hakmem: MPSC queue + drain logic (similar to mimalloc)
|
|
- Difference: Frequency of drains and integration with alloc path
|
|
- **Overhead: +2-3 ns if drain happens during alloc**
|
|
|
|
---
|
|
|
|
## Conclusions and Recommendations
|
|
|
|
### What mimalloc Does Better
|
|
|
|
1. **Architectural simplicity**: 1 fast path, 1 slow path
|
|
2. **Data structure elegance**: Intrusive lists reduce metadata
|
|
3. **TLS-centric design**: Zero contention, L1-cache-optimized
|
|
4. **Maturity**: 10+ years of optimization (vs hakmem's research PoC)
|
|
|
|
### What hakmem Could Adopt
|
|
|
|
**High-Impact** (10-20 ns gain):
|
|
1. Branchless classification table (+3-5 ns)
|
|
2. Remove statistics from critical path (+10-15 ns)
|
|
3. Inline fast path (+5-10 ns)
|
|
|
|
**Medium-Impact** (2-5 ns gain):
|
|
1. Combined TLS reads (+2-3 ns)
|
|
2. Hardware prefetching (+1-2 ns)
|
|
3. Code layout optimization (+2-5 ns)
|
|
|
|
**Low-Impact** (<2 ns gain):
|
|
1. micro-optimizations in pointer arithmetic
|
|
2. Compiler tuning flags (-march=native, -mtune=native)
|
|
|
|
### Fundamental Limits
|
|
|
|
Even with all optimizations, hakmem Tiny Pool cannot reach <30 ns/op because:
|
|
|
|
1. **Bitmap lookup** is inherently slower than free list (bit extraction vs pointer dereference)
|
|
2. **Multi-layer cache** has validation overhead (mimalloc has implicit ownership)
|
|
3. **Remote free tracking** adds per-allocation state checks
|
|
|
|
**Recommendation**: Accept that hakmem serves a different purpose (research, learning) and focus on:
|
|
- Demonstrating the trade-offs (performance vs flexibility)
|
|
- Optimizing what's changeable (fast-path overhead)
|
|
- Documenting the architecture clearly
|
|
|
|
---
|
|
|
|
## Appendix: Code References
|
|
|
|
### Key Files Analyzed
|
|
|
|
**hakmem source**:
|
|
- `/home/tomoaki/git/hakmem/hakmem_tiny.h` (lines 1-260)
|
|
- `/home/tomoaki/git/hakmem/hakmem_tiny.c` (lines 1-750+)
|
|
- `/home/tomoaki/git/hakmem/hakmem_pool.c` (lines 1-150+)
|
|
|
|
**Performance data**:
|
|
- `/home/tomoaki/git/hakmem/BENCHMARK_RESULTS_CODE_CLEANUP.md` (83 ns for 8-64B)
|
|
- `/home/tomoaki/git/hakmem/ALLOCATION_MODEL_COMPARISON.md` (14 ns for mimalloc)
|
|
|
|
**mimalloc benchmarks**:
|
|
- `/home/tomoaki/git/hakmem/docs/benchmarks/20251023_052815_SUITE/tiny_mimalloc_T*.log`
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
1. **mimalloc: Free List Malloc** - Daan Leijen, Microsoft Research
|
|
2. **jemalloc: A Scalable Concurrent malloc** - Jason Evans, Facebook
|
|
3. **Hoard: A Scalable Memory Allocator** - Emery Berger
|
|
4. **hakmem Benchmarks** - Internal project benchmarks
|
|
5. **x86-64 Microarchitecture** - Intel/AMD optimization manuals
|
|
|