# Bottleneck Analysis Report: hakmem Tiny Pool Allocator **Date**: 2025-10-26 **Target**: hakmem bitmap-based allocator **Baseline**: mimalloc (industry standard) **Analyzed by**: Deep code analysis + performance modeling --- ## Executive Summary ### Top 3 Bottlenecks with Estimated Impact 1. **TLS Magazine Hierarchy Overhead** (HIGH: ~3-5 ns per allocation) - 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap - Each tier adds cache miss risk and branching overhead - Expected speedup: 30-40% if collapsed to 2-tier 2. **Two-Tier Bitmap Traversal** (HIGH: ~4-6 ns on bitmap path) - Summary bitmap scan + main bitmap scan + hint_word update - Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates) - Expected speedup: 20-30% if bypassed more often via better caching 3. **Registry Lookup on Free Path** (MEDIUM: ~2-4 ns per free) - Hash computation + linear probe + validation on every cross-slab free - Could be eliminated with mimalloc-style pointer arithmetic - Expected speedup: 15-25% on free-heavy workloads ### Performance Gap Analysis **Random Free Pattern** (Bitmap's best case): - hakmem: 68 M ops/sec (14.7 ns/op) - mimalloc: 176 M ops/sec (5.7 ns/op) - **Gap**: 2.6× slower (9 ns difference) **Sequential LIFO Pattern** (Free-list's best case): - hakmem: 102 M ops/sec (9.8 ns/op) - mimalloc: 942 M ops/sec (1.1 ns/op) - **Gap**: 9.2× slower (8.7 ns difference) **Key Insight**: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture. ### Expected Total Speedup - Conservative: 2.0-2.5× (close the 2.6× gap partially) - Optimistic: 3.0-4.0× (with aggressive optimizations) - Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO) --- ## Critical Path Analysis ### Allocation Fast Path Walkthrough Let me trace the exact execution path for `hak_tiny_alloc(16)` with step-by-step cycle estimates: ```c // hakmem_tiny.c:557 - Entry point void* hak_tiny_alloc(size_t size) { // Line 558: Initialization check if (!g_tiny_initialized) hak_tiny_init(); // BRANCH: ~1 cycle (predicted taken once) // Line 561-562: Wrapper context check extern int hak_in_wrapper(void); if (!g_wrap_tiny_enabled && hak_in_wrapper()) // BRANCH: ~1 cycle return NULL; // Line 565: Size to class conversion int class_idx = hak_tiny_size_to_class(size); // INLINE: ~2 cycles (branch chain) if (class_idx < 0) return NULL; // BRANCH: ~1 cycle // Line 569-576: SuperSlab path (disabled by default) if (g_use_superslab) { /* ... */ } // BRANCH: ~1 cycle (not taken) // Line 650-651: TLS Magazine initialization check tiny_mag_init_if_needed(class_idx); // INLINE: ~3 cycles (conditional init) TinyTLSMag* mag = &g_tls_mags[class_idx]; // TLS ACCESS: ~2 cycles // Line 666-670: TLS Magazine fast path (BEST CASE) if (mag->top > 0) { // LOAD + BRANCH: ~2 cycles void* p = mag->items[--mag->top].ptr; // LOAD + DEC + STORE: ~3 cycles stats_record_alloc(class_idx); // INLINE: ~1 cycle (TLS increment) return p; // RETURN: ~1 cycle } // TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz) // Line 673-674: TLS Active Slab lookup (MEDIUM PATH) TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles if (!(tls && tls->free_count > 0)) // LOAD + BRANCH: ~3 cycles tls = g_tls_active_slab_b[class_idx]; // TLS ACCESS: ~2 cycles (if taken) if (tls && tls->free_count > 0) { // BRANCH: ~1 cycle // Line 677-679: Remote drain check if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) { tiny_remote_drain_owner(tls); // RARE: ~50-200 cycles (if taken) } // Line 682-688: Mini-magazine fast path if (!mini_mag_is_empty(&tls->mini_mag)) { // LOAD + BRANCH: ~2 cycles void* p = mini_mag_pop(&tls->mini_mag); // INLINE: ~4 cycles (LIFO pop) if (p) { stats_record_alloc(class_idx); // INLINE: ~1 cycle return p; // RETURN: ~1 cycle } } // MINI-MAG PATH: ~30 cycles (~10 ns) // Line 691-700: Batch refill from bitmap if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) { int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16); // REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized if (refilled > 0) { void* p = mini_mag_pop(&tls->mini_mag); if (p) { stats_record_alloc(class_idx); return p; } } } // REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc // Line 703-713: Bitmap scan fallback if (tls->free_count > 0) { int block_idx = hak_tiny_find_free_block(tls); // BITMAP SCAN: ~15-20 cycles if (block_idx >= 0) { hak_tiny_set_used(tls, block_idx); // BITMAP UPDATE: ~10 cycles tls->free_count--; // STORE: ~1 cycle void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles stats_record_alloc(class_idx); // INLINE: ~1 cycle return p; // RETURN: ~1 cycle } } // BITMAP PATH: ~50 cycles (~17 ns) } // Line 717-718: Lock and refill from global pool (SLOW PATH) pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended) // ... slow path: 200-1000 cycles (rare) ... } ``` ### Cycle Count Summary | Path | Cycles | Latency (ns) | Frequency | Notes | |---------------------|--------|--------------|-----------|-------| | **TLS Magazine Hit** | ~18 | ~6 ns | 60-80% | Best case (cache hit) | | **Mini-Mag Hit** | ~30 | ~10 ns | 10-20% | Good case (slab-local) | | **Batch Refill** | ~50 | ~17 ns | 5-10% | Amortized 3 ns/item | | **Bitmap Scan** | ~50 | ~17 ns | 5-10% | Worst case before lock | | **Global Lock Path** | ~300 | ~100 ns | <5% | Very rare (refill) | **Weighted Average**: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = **~11 ns/op** (theoretical) **Measured Actual**: 9.8-14.7 ns/op (matches model!) ### Comparison with mimalloc's Approach mimalloc achieves **1.1 ns/op** on LIFO pattern by: 1. **No TLS Magazine Layer**: Direct access to thread-local page free-list 2. **Intrusive Free-List**: 1 load + 1 store (2 cycles) vs our 18 cycles 3. **2MB Alignment**: O(1) pointer→slab via bit-masking (no registry lookup) 4. **No Bitmap**: Free-list only (trades random-access resistance for speed) **hakmem's Architecture**: ``` Allocation Request ↓ TLS Magazine (2048 items) ← 1st tier: ~6 ns (cache hit) ↓ (miss) TLS Active Slab (2 per class) ← 2nd tier: lookup cost ↓ Mini-Magazine (16-32 items) ← 3rd tier: ~10 ns (LIFO pop) ↓ (miss) Batch Refill (16 items) ← 4th tier: ~3 ns amortized ↓ (miss) Bitmap Scan (two-tier) ← 5th tier: ~17 ns (expensive) ↓ (miss) Global Lock + Slab Allocation ← 6th tier: ~100+ ns (rare) ``` **mimalloc's Architecture**: ``` Allocation Request ↓ Thread-Local Page Free-List ← 1st tier: ~1 ns (1 load + 1 store) ↓ (miss) Thread-Local Page Queue ← 2nd tier: ~5 ns (page switch) ↓ (miss) Global Segment Allocation ← 3rd tier: ~50 ns (rare) ``` **Key Difference**: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead. --- ## Bottleneck #1: TLS Magazine Hierarchy Overhead ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c` - **Lines**: 650-714 (allocation fast path) - **Impact**: HIGH (affects 100% of allocations) ### Code Analysis ```c // Line 650-651: 1st tier - TLS Magazine tiny_mag_init_if_needed(class_idx); // ~3 cycles (conditional check) TinyTLSMag* mag = &g_tls_mags[class_idx]; // ~2 cycles (TLS base + offset) // Line 666-670: TLS Magazine lookup if (mag->top > 0) { // ~2 cycles (load + branch) void* p = mag->items[--mag->top].ptr; // ~3 cycles (array access + decrement) stats_record_alloc(class_idx); // ~1 cycle (TLS increment) return p; // ~1 cycle } // TOTAL: ~12 cycles for cache hit (BEST CASE) // Line 673-674: 2nd tier - TLS Active Slab lookup TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access) if (!(tls && tls->free_count > 0)) // ~3 cycles (2 loads + branch) tls = g_tls_active_slab_b[class_idx]; // ~2 cycles (if miss) // Line 682-688: 3rd tier - Mini-Magazine if (!mini_mag_is_empty(&tls->mini_mag)) { // ~2 cycles (load slab->mini_mag.count) void* p = mini_mag_pop(&tls->mini_mag); // ~4 cycles (LIFO pop: 2 loads + 1 store) if (p) { stats_record_alloc(class_idx); return p; } } // TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE) ``` ### Why It's Slow 1. **Multiple TLS Accesses**: Each tier requires TLS base lookup + offset calculation - `g_tls_mags[class_idx]` → TLS read #1 - `g_tls_active_slab_a[class_idx]` → TLS read #2 - `g_tls_active_slab_b[class_idx]` → TLS read #3 (conditional) - **Cost**: 2-3 cycles each × 3 = 6-9 cycles overhead 2. **Cache Line Fragmentation**: TLS variables are separate arrays - `g_tls_mags[8]` = 16 KB (2048 items × 8 classes × 8 bytes) - `g_tls_active_slab_a[8]` = 64 bytes - `g_tls_active_slab_b[8]` = 64 bytes - **Cost**: Likely span multiple cache lines → potential cache misses 3. **Branch Misprediction**: Multi-tier fallback creates branch chain - Magazine empty? → Check active slab A - Slab A empty? → Check active slab B - Mini-mag empty? → Refill from bitmap - **Cost**: Each mispredicted branch = 10-20 cycles penalty 4. **Redundant Metadata**: Magazine items store `{void* ptr}` separately from slab pointers - Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class) - Slab pointers: 8 bytes × 2 per class (16 bytes) - **Cost**: Memory overhead reduces cache efficiency ### Optimization: Unified TLS Cache Structure **Before** (current): ```c // Separate TLS arrays (fragmented in memory) static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP]; static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES]; static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES]; ``` **After** (proposed): ```c // Unified per-class TLS structure (cache-line aligned) typedef struct __attribute__((aligned(64))) { // Hot fields (first 64 bytes for L1 cache line) void* mag_items[32]; // Reduced from 2048 to 32 (still effective) uint16_t mag_top; // Current magazine count uint16_t mag_cap; // Magazine capacity uint32_t _pad0; // Warm fields (second cache line) TinySlab* active_slab; // Primary active slab (no A/B split) PageMiniMag* mini_mag; // Direct pointer to slab's mini-mag uint64_t last_refill_tsc; // For adaptive refill timing // Cold fields (third cache line) uint64_t stats_alloc_batch; // Batched statistics uint64_t stats_free_batch; } __attribute__((packed)) TinyTLSCache; static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES]; ``` **Benefits**: 1. Single TLS access: `g_tls_cache[class_idx]` (not 3 separate lookups) 2. Cache-line aligned: All hot fields in first 64 bytes 3. Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class 4. Direct mini-mag pointer: No slab→mini_mag indirection **Expected Speedup**: 30-40% (reduce fast path from ~12 cycles to ~7 cycles) **Risk**: MEDIUM - Requires refactoring TLS access patterns throughout codebase - Magazine size reduction may increase refill frequency (trade-off) - Need careful testing to ensure no regression on multi-threaded workloads --- ## Bottleneck #2: Two-Tier Bitmap Traversal ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h` - **Lines**: 235-269 (`hak_tiny_find_free_block`) - **Impact**: HIGH (affects 5-15% of allocations, but expensive when hit) ### Code Analysis ```c // Line 235-269: Two-tier bitmap scan static inline int hak_tiny_find_free_block(TinySlab* slab) { const int bw = g_tiny_bitmap_words[slab->class_idx]; // Bitmap words const int sw = slab->summary_words; // Summary words if (bw <= 0 || sw <= 0) return -1; int start_word = slab->hint_word % bw; // Hint optimization int start_sw = start_word / 64; // Summary word index int start_sb = start_word % 64; // Summary bit offset // Line 244-267: Summary bitmap scan (outer loop) for (int k = 0; k < sw; k++) { // ~sw iterations (1-128) int idx = start_sw + k; if (idx >= sw) idx -= sw; // Wrap-around uint64_t bits = slab->summary[idx]; // LOAD: ~2 cycles // Mask optimization (skip processed bits) if (k == 0) { bits &= (~0ULL) << start_sb; // BITWISE: ~1 cycle } if (idx == sw - 1 && (bw % 64) != 0) { uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL); bits &= mask; // BITWISE: ~1 cycle } if (bits == 0) continue; // BRANCH: ~1 cycle (often taken) int woff = __builtin_ctzll(bits); // CTZ #1: ~3 cycles int word_idx = idx * 64 + woff; // COMPUTE: ~2 cycles if (word_idx >= bw) continue; // BRANCH: ~1 cycle // Line 261-266: Main bitmap scan (inner) uint64_t used = slab->bitmap[word_idx]; // LOAD: ~2 cycles (cache miss risk) uint64_t free_bits = ~used; // BITWISE: ~1 cycle if (free_bits == 0) continue; // BRANCH: ~1 cycle (rare) int bit_idx = __builtin_ctzll(free_bits); // CTZ #2: ~3 cycles slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles return word_idx * 64 + bit_idx; // RETURN: ~1 cycle } return -1; } // TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access) // WORST CASE: 50-100 cycles (many summary words scanned, cache misses) ``` ### Why It's Slow 1. **Two-Level Indirection**: Summary → Bitmap → Block - Summary scan: Find word with free bits (~5-10 cycles) - Main bitmap scan: Find bit within word (~5 cycles) - **Cost**: 2× CTZ operations, 2× memory loads 2. **Cache Miss Risk**: Bitmap can be up to 1 KB (128 words × 8 bytes) - Class 0 (8B): 128 words = 1024 bytes - Class 1 (16B): 64 words = 512 bytes - Class 2 (32B): 32 words = 256 bytes - **Cost**: Bitmap may not fit in L1 cache (32 KB) → L2 access (~10-20 cycles) 3. **Hint Word State**: Requires update on every allocation - Read hint_word (~1 cycle) - Compute new hint (~2 cycles) - Write hint_word (~1 cycle) - **Cost**: 4 cycles per allocation (not amortized) 4. **Branch-Heavy Loop**: Multiple branches per iteration - `if (bits == 0) continue;` (often taken when bitmap is sparse) - `if (word_idx >= bw) continue;` (rare safety check) - `if (free_bits == 0) continue;` (rare but costly) - **Cost**: Branch misprediction = 10-20 cycles each ### Optimization #1: Increase Mini-Magazine Capacity **Rationale**: Avoid bitmap scan by keeping more items in mini-magazine **Current**: ```c // Line 344: Mini-magazine capacity uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16; ``` **Proposed**: ```c // Increase capacity to reduce bitmap scan frequency uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32; ``` **Benefits**: - Fewer bitmap scans (amortized over 64 items instead of 32) - Better temporal locality (more items cached) **Costs**: - +256 bytes memory per slab (64 × 8 bytes pointers) - Slightly higher refill cost (64 items vs 32) **Expected Speedup**: 10-15% (reduce bitmap scan frequency by 50%) **Risk**: LOW (simple parameter change, no logic changes) ### Optimization #2: Cache-Aware Bitmap Layout **Rationale**: Ensure bitmap fits in L1 cache for hot classes **Current**: ```c // Separate bitmap allocation (may be cache-cold) slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t)); ``` **Proposed**: ```c // Embed small bitmaps directly in slab structure typedef struct TinySlab { // ... existing fields ... // Embedded bitmap for small classes (≤256 bytes) union { uint64_t* bitmap_ptr; // Large classes: heap-allocated uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes) }; uint8_t bitmap_embedded; // Flag: 1=embedded, 0=heap } TinySlab; ``` **Benefits**: - Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded) - Single cache line access for bitmap + slab metadata - No heap allocation for small classes **Expected Speedup**: 5-10% (reduce cache misses on bitmap access) **Risk**: MEDIUM (requires refactoring bitmap access logic) ### Optimization #3: Lazy Summary Bitmap Update **Rationale**: Summary bitmap update is expensive on free path **Current**: ```c // Line 199-213: Summary update on every set_used/set_free static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) { // ... bitmap update ... // Update summary (EXPENSIVE) int sum_word = word_idx / 64; int sum_bit = word_idx % 64; uint64_t has_free = ~v; if (has_free != 0) { slab->summary[sum_word] |= (1ULL << sum_bit); // WRITE } else { slab->summary[sum_word] &= ~(1ULL << sum_bit); // WRITE } } ``` **Proposed**: ```c // Lazy summary update (rebuild only when scanning) static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) { // ... bitmap update ... // NO SUMMARY UPDATE (deferred) } static inline int hak_tiny_find_free_block(TinySlab* slab) { // Rebuild summary if stale (rare) if (slab->summary_stale) { rebuild_summary_bitmap(slab); // O(N) but rare slab->summary_stale = 0; } // ... existing scan logic ... } ``` **Benefits**: - Eliminate summary update on 95% of operations (free path) - Summary rebuild cost amortized over many allocations **Expected Speedup**: 15-20% on free-heavy workloads **Risk**: MEDIUM (requires careful stale bit management) --- ## Bottleneck #3: Registry Lookup on Free Path ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c` - **Lines**: 1102-1118 (`hak_tiny_free`) - **Impact**: MEDIUM (affects cross-slab frees, ~30-50% of frees) ### Code Analysis ```c // Line 1102-1118: Free path with registry lookup void hak_tiny_free(void* ptr) { if (!ptr || !g_tiny_initialized) return; // Line 1106-1111: SuperSlab fast path (disabled by default) SuperSlab* ss = ptr_to_superslab(ptr); // BITWISE: ~2 cycles if (ss && ss->magic == SUPERSLAB_MAGIC) { // LOAD + BRANCH: ~3 cycles hak_tiny_free_superslab(ptr, ss); // FAST PATH: ~5 ns return; } // Line 1114: Registry lookup (EXPENSIVE) TinySlab* slab = hak_tiny_owner_slab(ptr); // LOOKUP: ~10-30 cycles if (!slab) return; hak_tiny_free_with_slab(ptr, slab); // FREE: ~50-200 cycles } // hakmem_tiny.c:395-440 - Registry lookup implementation TinySlab* hak_tiny_owner_slab(void* ptr) { if (!ptr || !g_tiny_initialized) return NULL; if (g_use_registry) { // O(1) hash table lookup uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles TinySlab* slab = registry_lookup(slab_base); // FUNCTION CALL: ~20-50 cycles if (!slab) return NULL; // Validation (bounds check) uintptr_t start = (uintptr_t)slab->base; uintptr_t end = start + TINY_SLAB_SIZE; if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) { return NULL; // False positive } return slab; } else { // O(N) linear search (fallback) for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) { pthread_mutex_lock(lock); // LOCK: ~30-100 cycles // Search free slabs for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) { // ... bounds check ... } pthread_mutex_unlock(lock); } return NULL; } } // Line 268-288: Registry lookup (hash table linear probe) static TinySlab* registry_lookup(uintptr_t slab_base) { int hash = registry_hash(slab_base); // HASH: ~5 cycles for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes int idx = (hash + i) & SLAB_REGISTRY_MASK; // BITWISE: ~2 cycles SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles if (entry->slab_base == slab_base) { // LOAD + BRANCH: ~3 cycles TinySlab* owner = entry->owner; // LOAD: ~2 cycles return owner; } if (entry->slab_base == 0) { // LOAD + BRANCH: ~2 cycles return NULL; // Empty slot } } return NULL; } // TYPICAL COST: 20-30 cycles (1-2 probes, cache hit) // WORST CASE: 50-100 cycles (8 probes, cache miss on registry array) ``` ### Why It's Slow 1. **Hash Computation**: Complex mix function ```c static inline int registry_hash(uintptr_t slab_base) { return (slab_base >> 16) & SLAB_REGISTRY_MASK; // Simple, but... } ``` - Shift + mask = 2 cycles (acceptable) - **BUT**: Linear probing on collision adds 10-30 cycles 2. **Linear Probing**: Up to 8 probes on collision - Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case) - Registry size: 1024 entries (8 KB array) - **Cost**: May span multiple cache lines → cache miss (10-20 cycles penalty) 3. **Validation Overhead**: Bounds check after lookup - Load slab->base (2 cycles) - Compute end address (1 cycle) - Compare twice (2 cycles) - **Cost**: 5 cycles per free (not amortized) 4. **Global Shared State**: Registry is shared across all threads - No cache-line alignment (false sharing risk) - Lock-free reads → ABA problem potential - **Cost**: Atomic load penalties (~5-10 cycles vs normal load) ### Optimization #1: Enable SuperSlab by Default **Rationale**: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style) **Current**: ```c // Line 81: SuperSlab disabled by default static int g_use_superslab = 0; // Runtime toggle ``` **Proposed**: ```c // Enable SuperSlab by default static int g_use_superslab = 1; // Always on ``` **Benefits**: - Eliminate registry lookup entirely: `ptr & ~0x1FFFFF` (1 AND operation) - SuperSlab free path: ~5 ns (vs ~10-30 ns registry path) - Better cache locality (2MB aligned pages) **Costs**: - 2MB address space per SuperSlab (not physical memory due to lazy allocation) - Slightly higher memory overhead (metadata at SuperSlab level) **Expected Speedup**: 20-30% on free-heavy workloads **Risk**: LOW (SuperSlab already implemented and tested in Phase 6.23) ### Optimization #2: Cache Last Freed Slab **Rationale**: Temporal locality - next free likely from same slab **Proposed**: ```c // Per-thread cache of last freed slab static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL}; void hak_tiny_free(void* ptr) { if (!ptr) return; // Try cached slab first (likely hit) int class_idx = guess_class_from_size(ptr); // Heuristic TinySlab* slab = t_last_freed_slab[class_idx]; // Validate pointer is in this slab if (slab && ptr_in_slab_range(ptr, slab)) { hak_tiny_free_with_slab(ptr, slab); // FAST PATH: ~5 ns return; } // Fallback to registry lookup (rare) slab = hak_tiny_owner_slab(ptr); if (slab) { t_last_freed_slab[slab->class_idx] = slab; // Update cache hak_tiny_free_with_slab(ptr, slab); } } ``` **Benefits**: - 80-90% cache hit rate (temporal locality) - Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles) **Expected Speedup**: 15-20% on free-heavy workloads **Risk**: MEDIUM (requires heuristic for class_idx guessing, may mispredict) --- ## Bottleneck #4: Statistics Collection Overhead ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny_stats.h` - **Lines**: 59-73 (`stats_record_alloc`, `stats_record_free`) - **Impact**: LOW (already optimized to TLS batching, but still ~0.5 ns per op) ### Code Analysis ```c // Line 59-62: Allocation statistics (inline) static inline void stats_record_alloc(int class_idx) __attribute__((always_inline)); static inline void stats_record_alloc(int class_idx) { t_alloc_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle } // Line 70-73: Free statistics (inline) static inline void stats_record_free(int class_idx) __attribute__((always_inline)); static inline void stats_record_free(int class_idx) { t_free_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle } ``` ### Why It's (Slightly) Slow 1. **TLS Access Overhead**: Even TLS has cost - TLS base register: %fs on x86-64 (implicit) - Offset calculation: `[%fs + class_idx*4]` - **Cost**: ~0.5 cycles (not zero!) 2. **Cache Line Pollution**: TLS counters compete for L1 cache - `t_alloc_batch[8]` = 32 bytes - `t_free_batch[8]` = 32 bytes - **Cost**: 64 bytes of L1 cache (1 cache line) 3. **Compiler Optimization Barriers**: `always_inline` prevents optimization - Forces inline (good) - But prevents compiler from hoisting out of loops (bad) - **Cost**: Increment inside hot loop vs once outside ### Optimization: Compile-Time Statistics Toggle **Rationale**: Production builds don't need exact counts **Proposed**: ```c #ifdef HAKMEM_ENABLE_STATS #define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++ #define STATS_RECORD_FREE(cls) t_free_batch[cls]++ #else #define STATS_RECORD_ALLOC(cls) ((void)0) #define STATS_RECORD_FREE(cls) ((void)0) #endif ``` **Benefits**: - Zero overhead when stats disabled - Compiler can optimize away dead code **Expected Speedup**: 3-5% (small but measurable) **Risk**: VERY LOW (compile-time flag, no runtime impact) --- ## Bottleneck #5: Magazine Spill/Refill Lock Contention ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c` - **Lines**: 880-939 (magazine spill under class lock) - **Impact**: MEDIUM (affects 5-10% of frees when magazine is full) ### Code Analysis ```c // Line 880-939: Magazine spill (class lock held) if (mag->top < cap) { // Fast path: push to magazine (no lock) mag->items[mag->top].ptr = ptr; mag->top++; return; } // Spill half under class lock pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m; pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended) int spill = cap / 2; // Spill 1024 items (for 2048 cap) for (int i = 0; i < spill && mag->top > 0; i++) { TinyMagItem it = mag->items[--mag->top]; TinySlab* owner = hak_tiny_owner_slab(it.ptr); // LOOKUP: ~20-30 cycles × 1024 if (!owner) continue; // Phase 4.1: Try mini-magazine push (avoid bitmap) if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) { mini_mag_push(&owner->mini_mag, it.ptr); // FAST: ~4 cycles continue; } // Slow path: bitmap update size_t bs = g_tiny_class_sizes[owner->class_idx]; int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs; // DIV: ~10 cycles if (hak_tiny_is_used(owner, idx)) { hak_tiny_set_free(owner, idx); // BITMAP: ~10 cycles owner->free_count++; // ... list management ... } } pthread_mutex_unlock(lock); // TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item) // Amortized: 50-100 ns per free (when spill happens every ~1000 frees) ``` ### Why It's Slow 1. **Lock Hold Time**: Lock held for entire spill (1024 items) - Blocks other threads from accessing class lock - Spill takes ~50-100 µs → other threads stalled - **Cost**: Contention penalty on multi-threaded workloads 2. **Registry Lookup in Loop**: 1024 lookups under lock - `hak_tiny_owner_slab(it.ptr)` called 1024 times - Each lookup: 20-30 cycles - **Cost**: 20,000-30,000 cycles just for lookups 3. **Division in Hot Loop**: Block index calculation uses division - `int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;` - Division is ~10 cycles on modern CPUs (not fully pipelined) - **Cost**: 10,000 cycles for 1024 divisions 4. **Large Spill Batch**: 1024 items is too large - Amortizes lock cost well (good) - But increases lock hold time (bad) - Trade-off not optimized ### Optimization #1: Reduce Spill Batch Size **Rationale**: Smaller batches = shorter lock hold time = less contention **Current**: ```c int spill = cap / 2; // 1024 items for 2048 cap ``` **Proposed**: ```c int spill = 128; // Fixed batch size (not cap-dependent) ``` **Benefits**: - Shorter lock hold time: ~6-12 µs (vs 50-100 µs) - Better multi-thread responsiveness **Costs**: - More frequent spills (8× more frequent) - Slightly higher total lock overhead **Expected Speedup**: 10-15% on multi-threaded workloads **Risk**: LOW (simple parameter change) ### Optimization #2: Lock-Free Spill Stack **Rationale**: Avoid lock entirely for spill path **Proposed**: ```c // Per-class global spill stack (lock-free MPSC) static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES]; void magazine_spill_lockfree(int class_idx, void* ptr) { // Push to lock-free stack uintptr_t old_head; do { old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire); *((uintptr_t*)ptr) = old_head; // Intrusive next-pointer } while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr, memory_order_release, memory_order_relaxed)); } // Background thread drains spill stack periodically void background_drain_spill_stack(void) { for (int i = 0; i < TINY_NUM_CLASSES; i++) { uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel); if (!head) continue; pthread_mutex_lock(&g_tiny_class_locks[i].m); // ... drain to bitmap ... pthread_mutex_unlock(&g_tiny_class_locks[i].m); } } ``` **Benefits**: - Zero lock contention on spill path - Fast atomic CAS (~5-10 cycles) **Costs**: - Requires background thread or periodic drain - Slightly more complex memory management **Expected Speedup**: 20-30% on multi-threaded workloads **Risk**: HIGH (requires careful design of background drain mechanism) --- ## Bottleneck #6: Branch Misprediction in Size Class Lookup ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h` - **Lines**: 159-182 (`hak_tiny_size_to_class`) - **Impact**: LOW (only 1-2 ns per allocation, but called on every allocation) ### Code Analysis ```c // Line 159-182: Size to class lookup (branch chain) static inline int hak_tiny_size_to_class(size_t size) { if (size == 0 || size > TINY_MAX_SIZE) return -1; // BRANCH: ~1 cycle // Branch chain (8 branches for 8 classes) if (size <= 8) return 0; // BRANCH: ~1 cycle if (size <= 16) return 1; // BRANCH: ~1 cycle if (size <= 32) return 2; // BRANCH: ~1 cycle if (size <= 64) return 3; // BRANCH: ~1 cycle if (size <= 128) return 4; // BRANCH: ~1 cycle if (size <= 256) return 5; // BRANCH: ~1 cycle if (size <= 512) return 6; // BRANCH: ~1 cycle return 7; // size <= 1024 } // TYPICAL COST: 3-5 cycles (3-4 branches taken) // WORST CASE: 8 cycles (all branches checked) ``` ### Why It's (Slightly) Slow 1. **Unpredictable Size Distribution**: Branch predictor can't learn pattern - Real-world allocation sizes are quasi-random - Size 16 most common (33%), but others vary - **Cost**: ~20-30% branch misprediction rate (~10 cycles penalty) 2. **Sequential Dependency**: Each branch depends on previous - CPU can't parallelize branch evaluation - Must evaluate branches in order - **Cost**: No instruction-level parallelism (ILP) ### Optimization: Branchless Lookup Table **Rationale**: Use CLZ (count leading zeros) for O(1) class lookup **Proposed**: ```c // Lookup table for size → class (branchless) static const uint8_t g_size_to_class_table[128] = { // size 0-7: class -1 (invalid) -1, -1, -1, -1, -1, -1, -1, -1, // size 8: class 0 0, // size 9-16: class 1 1, 1, 1, 1, 1, 1, 1, 1, // size 17-32: class 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // size 33-64: class 3 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // size 65-128: class 4 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, }; static inline int hak_tiny_size_to_class(size_t size) { if (size == 0 || size > TINY_MAX_SIZE) return -1; // Fast path: direct table lookup for small sizes if (size <= 128) { return g_size_to_class_table[size]; // LOAD: ~2 cycles (L1 cache) } // Slow path: CLZ-based for larger sizes // size 129-256 → class 5 // size 257-512 → class 6 // size 513-1024 → class 7 int clz = __builtin_clzll(size - 1); // CLZ: ~3 cycles return 12 - clz; // Magic constant for power-of-2 classes } // TYPICAL COST: 2-3 cycles (table lookup, no branches) ``` **Benefits**: - Branchless for common sizes (8-128B covers 80%+ of allocations) - Table fits in L1 cache (128 bytes = 2 cache lines) - Predictable performance (no branch misprediction) **Expected Speedup**: 2-3% (reduce 5 cycles to 2-3 cycles) **Risk**: VERY LOW (table is static, no runtime overhead) --- ## Bottleneck #7: Remote Free Drain Overhead ### Location - **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c` - **Lines**: 146-184 (`tiny_remote_drain_locked`) - **Impact**: LOW (only affects cross-thread frees, ~10-20% of workloads) ### Code Analysis ```c // Line 146-184: Remote free drain (under class lock) static void tiny_remote_drain_locked(TinySlab* slab) { uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles unsigned drained = 0; while (head) { // LOOP: variable iterations void* p = (void*)head; head = *((uintptr_t*)p); // LOAD NEXT: ~2 cycles // Calculate block index size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle int block_idx = offset / block_size; // DIVIDE: ~10 cycles // Skip if already free (idempotent) if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles hak_tiny_set_free(slab, block_idx); // BITMAP UPDATE: ~10 cycles int was_full = (slab->free_count == 0); // LOAD: ~1 cycle slab->free_count++; // INCREMENT: ~1 cycle if (was_full) { move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare) } if (slab->free_count == slab->total_count) { // ... slab release logic ... (rare) release_slab(slab); // EXPENSIVE: ~1000 cycles (very rare) break; } g_tiny_pool.free_count[slab->class_idx]++; // GLOBAL INCREMENT: ~1 cycle drained++; } if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles } // TYPICAL COST: 50-100 cycles per drained block (moderate) // WORST CASE: 1000+ cycles (slab release) ``` ### Why It's Slow 1. **Division in Loop**: Block index calculation uses division - `int block_idx = offset / block_size;` - Division is ~10 cycles (even on modern CPUs) - **Cost**: 10 cycles × N remote frees 2. **Atomic Operations**: 2 atomic ops per drain (exchange + fetch_sub) - `atomic_exchange` at start (~10 cycles) - `atomic_fetch_sub` at end (~10 cycles) - **Cost**: 20 cycles overhead (not per-block, but still expensive) 3. **Bitmap Update**: Same as allocation path - `hak_tiny_set_free` updates both bitmap and summary - **Cost**: 10 cycles per block ### Optimization: Multiplication-Based Division **Rationale**: Replace division with multiplication by reciprocal **Current**: ```c int block_idx = offset / block_size; // DIVIDE: ~10 cycles ``` **Proposed**: ```c // Pre-computed reciprocals (magic constants) static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = { // Computed as: (1ULL << 48) / block_size // Allows: block_idx = (offset * reciprocal) >> 48 0x200000000000ULL / 8, // Class 0: 8B 0x100000000000ULL / 16, // Class 1: 16B 0x80000000000ULL / 32, // Class 2: 32B 0x40000000000ULL / 64, // Class 3: 64B 0x20000000000ULL / 128, // Class 4: 128B 0x10000000000ULL / 256, // Class 5: 256B 0x8000000000ULL / 512, // Class 6: 512B 0x4000000000ULL / 1024, // Class 7: 1024B }; // Fast division using multiplication int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles ``` **Benefits**: - Reduce 10 cycles to 3 cycles per division - Saves 7 cycles per remote free **Expected Speedup**: 5-10% on cross-thread workloads **Risk**: VERY LOW (well-known compiler optimization, manually applied) --- ## Profiling Plan ### perf Commands to Run ```bash # 1. CPU cycle breakdown (identify hotspots) perf record -e cycles:u -g ./bench_comprehensive perf report --stdio --no-children | head -100 > perf_cycles.txt # 2. Cache miss analysis (L1d, L1i, LLC) perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\ L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \ ./bench_comprehensive # 3. Branch misprediction rate perf stat -e cycles,instructions,branches,branch-misses \ ./bench_comprehensive # 4. TLB miss analysis (address translation overhead) perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \ ./bench_comprehensive # 5. Function-level profiling (annotated source) perf record -e cycles:u --call-graph dwarf ./bench_comprehensive perf report --stdio --sort symbol --percent-limit 1 # 6. Memory bandwidth utilization perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\ mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \ ./bench_comprehensive # 7. Allocation-specific hotspots (focus on hak_tiny_alloc) perf record -e cycles:u -g --call-graph dwarf -- \ ./bench_comprehensive 2>&1 | grep "hak_tiny" ``` ### Expected Hotspots to Validate Based on code analysis, we expect to see: 1. **hak_tiny_find_free_block** (15-25% of cycles) - Two-tier bitmap scan - CTZ operations - Cache misses on large bitmaps 2. **hak_tiny_set_used / hak_tiny_set_free** (10-15% of cycles) - Bitmap updates - Summary bitmap updates - Write-heavy (cache line bouncing) 3. **hak_tiny_owner_slab** (10-20% of cycles on free path) - Registry lookup - Hash computation - Linear probing 4. **tiny_mag_init_if_needed** (5-10% of cycles) - TLS access - Conditional initialization 5. **stats_record_alloc / stats_record_free** (3-5% of cycles) - TLS counter increments - Cache line pollution ### Validation Criteria **Cache Miss Rates**: - L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor) - LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor) **Branch Misprediction**: - Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor) - Expected: 3-4% (due to unpredictable size classes) **IPC (Instructions Per Cycle)**: - IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor) - Expected: 1.5-1.8 (memory-bound, not compute-bound) **Function Time Distribution**: - hak_tiny_alloc: 40-60% (hot path) - hak_tiny_free: 20-30% (warm path) - hak_tiny_find_free_block: 10-20% (expensive when hit) - Other: < 10% --- ## Optimization Roadmap ### Quick Wins (< 1 hour, Low Risk) 1. **Enable SuperSlab by Default** (Bottleneck #3) - Change: `g_use_superslab = 1;` - Impact: 20-30% speedup on free path - Risk: VERY LOW (already implemented) - Effort: 5 minutes 2. **Disable Statistics in Production** (Bottleneck #4) - Change: Add `#ifndef HAKMEM_ENABLE_STATS` guards - Impact: 3-5% speedup - Risk: VERY LOW (compile-time flag) - Effort: 15 minutes 3. **Increase Mini-Magazine Capacity** (Bottleneck #2) - Change: `mag_capacity = 64` (was 32) - Impact: 10-15% speedup (reduce bitmap scans) - Risk: LOW (slight memory increase) - Effort: 5 minutes 4. **Branchless Size Class Lookup** (Bottleneck #6) - Change: Use lookup table for common sizes - Impact: 2-3% speedup - Risk: VERY LOW (static table) - Effort: 30 minutes **Total Expected Speedup: 35-53%** (conservative: 1.4-1.5×) ### Medium Effort (1-4 hours, Medium Risk) 5. **Unified TLS Cache Structure** (Bottleneck #1) - Change: Merge TLS arrays into single cache-aligned struct - Impact: 30-40% speedup on fast path - Risk: MEDIUM (requires refactoring) - Effort: 3-4 hours 6. **Reduce Magazine Spill Batch** (Bottleneck #5) - Change: `spill = 128` (was 1024) - Impact: 10-15% speedup on multi-threaded - Risk: LOW (parameter tuning) - Effort: 30 minutes 7. **Cache-Aware Bitmap Layout** (Bottleneck #2) - Change: Embed small bitmaps in slab structure - Impact: 5-10% speedup - Risk: MEDIUM (requires struct changes) - Effort: 2-3 hours 8. **Multiplication-Based Division** (Bottleneck #7) - Change: Replace division with mul+shift - Impact: 5-10% speedup on remote frees - Risk: VERY LOW (well-known optimization) - Effort: 1 hour **Total Expected Speedup: 50-85%** (conservative: 1.5-1.8×) ### Major Refactors (> 4 hours, High Risk) 9. **Lock-Free Spill Stack** (Bottleneck #5) - Change: Use atomic MPSC queue for magazine spill - Impact: 20-30% speedup on multi-threaded - Risk: HIGH (complex concurrency) - Effort: 8-12 hours 10. **Lazy Summary Bitmap Update** (Bottleneck #2) - Change: Rebuild summary only when scanning - Impact: 15-20% speedup on free-heavy workloads - Risk: MEDIUM (requires careful staleness tracking) - Effort: 4-6 hours 11. **Collapse TLS Magazine Tiers** (Bottleneck #1) - Change: Merge magazine + mini-mag into single LIFO - Impact: 40-50% speedup (eliminate tier overhead) - Risk: HIGH (major architectural change) - Effort: 12-16 hours 12. **Full mimalloc-Style Rewrite** (All Bottlenecks) - Change: Replace bitmap with intrusive free-list - Impact: 5-9× speedup (match mimalloc) - Risk: VERY HIGH (complete redesign) - Effort: 40+ hours **Total Expected Speedup: 75-150%** (optimistic: 1.8-2.5×) --- ## Risk Assessment Summary ### Low Risk Optimizations (Safe to implement immediately) - SuperSlab enable - Statistics compile-time toggle - Mini-mag capacity increase - Branchless size lookup - Multiplication division - Magazine spill batch reduction **Expected: 1.4-1.6× speedup, 2-3 hours effort** ### Medium Risk Optimizations (Test thoroughly) - Unified TLS cache structure - Cache-aware bitmap layout - Lazy summary update **Expected: 1.6-2.0× speedup, 6-10 hours effort** ### High Risk Optimizations (Prototype first) - Lock-free spill stack - Magazine tier collapse - Full mimalloc rewrite **Expected: 2.0-9.0× speedup, 20-60 hours effort** --- ## Estimated Speedup Summary ### Conservative Target (Low + Medium optimizations) - **Random pattern**: 68 M ops/sec → **140 M ops/sec** (2.0× speedup) - **LIFO pattern**: 102 M ops/sec → **200 M ops/sec** (2.0× speedup) - **Gap to mimalloc**: 2.6× → **1.3×** (close 50% of gap) ### Optimistic Target (All optimizations) - **Random pattern**: 68 M ops/sec → **170 M ops/sec** (2.5× speedup) - **LIFO pattern**: 102 M ops/sec → **450 M ops/sec** (4.4× speedup) - **Gap to mimalloc**: 2.6× → **1.0×** (match on random, 2× on LIFO) --- ## Conclusion The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to: 1. **Architectural overhead**: 6-tier allocation hierarchy vs mimalloc's 3-tier 2. **Bitmap traversal cost**: Two-tier scan adds 15-20 cycles even when optimized 3. **Registry lookup overhead**: Hash table lookup adds 20-30 cycles on free path **Quick wins** (1-3 hours effort) can achieve **1.4-1.6× speedup**. **Medium effort** (10 hours) can achieve **1.8-2.0× speedup**. **Full mimalloc-style rewrite** (40+ hours) needed to match mimalloc's 1.1 ns/op. **Recommendation**: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with `perf`, then decide if medium-effort optimizations are worth the complexity increase.