# Async Background Worker Optimization Plan ## hakmem Tiny Pool Allocator Performance Analysis **Date**: 2025-10-26 **Author**: Claude Code Analysis **Goal**: Reduce instruction count by moving work to background threads **Target**: 2-3× speedup (62 M ops/sec → 120-180 M ops/sec) --- ## Executive Summary ### Can we achieve 2-3× speedup with async background workers? **Answer: Partially - with significant caveats** **Expected realistic speedup**: **1.5-2.0× (62 → 93-124 M ops/sec)** **Best-case speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)** **Gap to mimalloc remains**: 263 M ops/sec is still **4.2× faster** ### Why not 3×? Three fundamental constraints: 1. **TLS Magazine already defers most work** (60-80% hit rate) - Fast path already ~6 ns (18 cycles) - close to theoretical minimum - Background workers only help the remaining 20-40% of allocations - **Maximum impact**: 20-30% improvement (not 3×) 2. **Owner slab lookup on free cannot be fully deferred** - Cross-thread frees REQUIRE immediate slab lookup (for remote-free queue) - Same-thread frees can be deferred, but benchmarks show 40-60% cross-thread frees - **Deferred free savings**: Limited to 40-60% of frees only 3. **Background threads add synchronization overhead** - Atomic refill triggers, memory barriers, cache coherence - Expected overhead: 5-15% of total cycle budget - **Net gain reduced** from theoretical 2.5× to realistic 1.5-2.0× ### Strategic Recommendation **Option B (Deferred Slab Lookup)** has the best effort/benefit ratio: - **Effort**: 4-6 hours (TLS queue + batch processing) - **Benefit**: 25-35% faster frees (same-thread only) - **Overall speedup**: ~1.3-1.5× (62 → 81-93 M ops/sec) **Option C (Hybrid)** for maximum performance: - **Effort**: 8-12 hours (Option B + background magazine refill) - **Benefit**: 40-60% overall speedup - **Overall speedup**: ~1.7-2.0× (62 → 105-124 M ops/sec) --- ## Part 1: Current Front-Path Analysis (perf data) ### 1.1 Overall Hotspot Distribution **Environment**: `HAKMEM_WRAP_TINY=1` (Tiny Pool enabled) **Workload**: `bench_comprehensive_hakmem` (1M iterations, mixed sizes) **Total cycles**: 242 billion (242.3 × 10⁹) **Samples**: 187K | Function | Cycles % | Samples | Category | Notes | |----------|---------|---------|----------|-------| | `_int_free` | 26.43% | 49,508 | glibc fallback | For >1KB allocations | | `_int_malloc` | 23.45% | 43,933 | glibc fallback | For >1KB allocations | | `malloc` | 14.01% | 26,216 | Wrapper overhead | TLS check + routing | | `__random` | 7.99% | 14,974 | Benchmark overhead | rand() for shuffling | | `unlink_chunk` | 7.96% | 14,824 | glibc internal | Chunk coalescing | | **`hak_alloc_at`** | **3.13%** | **5,867** | **hakmem router** | **Tiny/Pool routing** | | **`hak_tiny_alloc`** | **2.77%** | **5,206** | **Tiny alloc path** | **TARGET #1** | | `_int_free_merge_chunk` | 2.15% | 3,993 | glibc internal | Free coalescing | | `mid_desc_lookup` | 1.82% | 3,423 | hakmem pool | Mid-tier lookup | | `hak_free_at` | 1.74% | 3,270 | hakmem router | Free routing | | **`hak_tiny_owner_slab`** | **1.37%** | **2,571** | **Tiny free path** | **TARGET #2** | ### 1.2 Tiny Pool Allocation Path Breakdown **From perf annotate on `hak_tiny_alloc` (5,206 samples)**: ```assembly # Entry and initialization (lines 14eb0-14edb) 4.00%: endbr64 # CFI marker 5.21%: push %r15 # Stack frame setup 3.94%: push %r14 25.81%: push %rbp # HOT: Stack frame overhead 5.28%: mov g_tiny_initialized,%r14d # TLS read 15.20%: test %r14d,%r14d # Branch check # Size-to-class conversion (implicit, not shown in perf) # Estimated: ~2-3 cycles (branchless table lookup) # TLS Magazine fast path (lines 14f41-14f9f) 0.00%: mov %fs:0x0,%rsi # TLS base (rare - already cached) 0.00%: imul $0x4008,%rbx,%r10 # Class offset calculation 0.00%: mov -0x1c04c(%r10,%rsi,1),%r15d # Magazine top read # Mini-magazine operations (not heavily sampled - fast path works!) # Lines 15068-15131: Remote drain logic (rare) # Most samples are in initialization, not hot path ``` **Key observation from perf**: - **Stack frame overhead dominates** (25.81% on single `push %rbp`) - TLS access is **NOT a bottleneck** (0.00% on most TLS reads) - Most cycles spent in **initialization checks and setup** (first 10 instructions) - **Magazine fast path barely appears** (suggests it's working efficiently!) ### 1.3 Tiny Pool Free Path Breakdown **From perf annotate on `hak_tiny_owner_slab` (2,571 samples)**: ```assembly # Entry and registry lookup (lines 14c10-14c78) 10.87%: endbr64 # CFI marker 3.06%: push %r14 # Stack frame 14.05%: shr $0x10,%r10 # Hash calculation (64KB alignment) 5.44%: and $0x3ff,%eax # Hash mask (1024 entries) 3.91%: mov %rax,%rdx # Index calculation 5.89%: cmp %r13,%r9 # Registry lookup comparison 14.31%: test %r13,%r13 # NULL check # Linear probing (lines 14c7e-14d70) # 8 probe attempts, each with similar overhead # Most time spent in hash computation and comparison ``` **Key observation from perf**: - **Hash computation + comparison is the bottleneck** (14.05% + 5.89% + 14.31% = 34.25%) - Registry lookup is **O(1) but expensive** (~10-15 cycles per lookup) - **Called on every free** (2,571 samples ≈ 1.37% of total cycles) ### 1.4 Instruction Count Breakdown (Estimated) Based on perf data and code analysis, here's the estimated instruction breakdown: **Allocation path** (~228 instructions total as measured by perf stat): | Component | Instructions | Cycles | % of Total | Notes | |-----------|-------------|--------|-----------|-------| | Wrapper overhead | 15-20 | ~6 | 7-9% | TLS check + routing | | Size-to-class lookup | 5-8 | ~2 | 2-3% | Branchless table (fast!) | | TLS magazine check | 8-12 | ~4 | 4-5% | Load + branch | | **Pointer return (HIT)** | **3-5** | **~2** | **1-2%** | **Fast path: 30-45 instructions** | | TLS slab lookup | 10-15 | ~5 | 4-6% | Miss: check active slabs | | Mini-mag check | 8-12 | ~4 | 3-5% | LIFO pop | | **Bitmap scan (MISS)** | **40-60** | **~20** | **18-26%** | **Summary + main bitmap + CTZ** | | Bitmap update | 20-30 | ~10 | 9-13% | Set used + summary update | | Pointer arithmetic | 8-12 | ~3 | 3-5% | Block index → pointer | | Lock acquisition (rare) | 50-100 | ~30-100 | 22-44% | pthread_mutex_lock (contended) | | Batch refill (rare) | 100-200 | ~50-100 | 44-88% | 16-64 items from bitmap | **Free path** (~150-200 instructions estimated): | Component | Instructions | Cycles | % of Total | Notes | |-----------|-------------|--------|-----------|-------| | Wrapper overhead | 10-15 | ~5 | 5-8% | TLS check + routing | | **Owner slab lookup** | **30-50** | **~15-20** | **20-25%** | **Hash + linear probe** | | Slab validation | 10-15 | ~5 | 5-8% | Range check (safety) | | TLS magazine push | 8-12 | ~4 | 4-6% | Same-thread: fast! | | **Remote free push** | **15-25** | **~8-12** | **10-13%** | **Cross-thread: atomic CAS** | | Lock + bitmap update (spill) | 50-100 | ~30-80 | 25-50% | Magazine full (rare) | **Critical finding**: - **Owner slab lookup (30-50 instructions) is the #1 free-path bottleneck** - Accounts for ~20-25% of free path instructions - **Cannot be eliminated for cross-thread frees** (need slab to push to remote queue) --- ## Part 2: Async Background Worker Design ### 2.1 Option A: Deferred Bitmap Consolidation **Goal**: Push bitmap scanning to background thread, keep front-path as simple pointer bump #### Design ```c // Front-path (allocation): 10-20 instructions void* hak_tiny_alloc(size_t size) { int class_idx = hak_tiny_size_to_class(size); TinyTLSMag* mag = &g_tls_mags[class_idx]; // Fast path: Magazine hit (8-12 instructions) if (mag->top > 0) { return mag->items[--mag->top].ptr; // ~3 instructions } // Slow path: Trigger background refill return hak_tiny_alloc_slow(class_idx); // ~5 instructions + function call } // Background thread: Bitmap scanning void background_refill_magazines(void) { while (1) { for (int tid = 0; tid < MAX_THREADS; tid++) { for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) { TinyTLSMag* mag = &g_thread_mags[tid][class_idx]; // Refill if below threshold (e.g., 25% full) if (mag->top < mag->cap / 4) { // Scan bitmaps across all slabs (expensive!) batch_refill_from_all_slabs(mag, 256); // 256 items at once } } } usleep(100); // 100μs sleep (tune based on load) } } ``` #### Expected Performance **Front-path savings**: - Before: 228 instructions (magazine miss → bitmap scan) - After: 30-45 instructions (magazine miss → return NULL + fallback) - **Speedup**: 5-7× on miss case (but only 20-40% of allocations miss) **Overall impact**: - 60-80% hit TLS magazine: **No change** (already 30-45 instructions) - 20-40% miss TLS magazine: **5-7× faster** (228 → 30-45 instructions) - **Net speedup**: 1.0 × 0.7 + 6.0 × 0.3 = **2.5× on allocation path** **BUT**: Background thread overhead - CPU cost: 1 core at ~10-20% utilization (bitmap scanning) - Memory barriers: Atomic refill triggers (5-10 cycles per trigger) - Cache coherence: TLS magazine written by background thread (false sharing risk) **Realistic net speedup**: **1.5-2.0× on allocations** (after overhead) #### Pros - **Minimal front-path changes** (magazine logic unchanged) - **No new synchronization primitives** (use existing atomic refill triggers) - **Compatible with existing TLS magazine** (just changes refill source) #### Cons - **Background thread CPU cost** (10-20% of 1 core) - **Latency spikes** if background thread is delayed (magazine empty → fallback to pool) - **Complex tuning** (refill threshold, batch size, sleep interval) - **False sharing risk** (background thread writes TLS magazine `top` field) --- ### 2.2 Option B: Deferred Slab Lookup (Owner Slab Cache) **Goal**: Eliminate owner slab lookup on same-thread frees by deferring to batch processing #### Design ```c // Front-path (free): 10-20 instructions void hak_tiny_free(void* ptr) { // Push to thread-local deferred free queue (NO owner_slab lookup!) TinyDeferredFree* queue = &g_tls_deferred_free; // Fast path: Direct queue push (8-12 instructions) queue->ptrs[queue->count++] = ptr; // ~3 instructions // Trigger batch processing if queue is full if (queue->count >= 256) { hak_tiny_process_deferred_frees(queue); // Background or inline } } // Batch processing: Owner slab lookup (amortized cost) void hak_tiny_process_deferred_frees(TinyDeferredFree* queue) { for (int i = 0; i < queue->count; i++) { void* ptr = queue->ptrs[i]; // Owner slab lookup (expensive: 30-50 instructions) TinySlab* slab = hak_tiny_owner_slab(ptr); // Check if same-thread or cross-thread if (pthread_equal(slab->owner_tid, pthread_self())) { // Same-thread: Push to TLS magazine (fast) TinyTLSMag* mag = &g_tls_mags[slab->class_idx]; mag->items[mag->top++].ptr = ptr; } else { // Cross-thread: Push to remote queue (already required) tiny_remote_push(slab, ptr); } } queue->count = 0; } ``` #### Expected Performance **Front-path savings**: - Before: 150-200 instructions (owner slab lookup + magazine/remote push) - After: 10-20 instructions (queue push only) - **Speedup**: 10-15× on free path **BUT**: Batch processing overhead - Owner slab lookup: 30-50 instructions per free (unchanged) - Amortized over 256 frees: ~0.12-0.20 instructions per free (negligible) - **Net speedup**: ~10× on same-thread frees, **0× on cross-thread frees** **Benchmark analysis** (from bench_comprehensive.c): - Same-thread frees: 40-60% (LIFO/FIFO patterns) - Cross-thread frees: 40-60% (interleaved/random patterns) **Overall impact**: - 40-60% same-thread: **10× faster** (150 → 15 instructions) - 40-60% cross-thread: **No change** (still need immediate owner slab lookup) - **Net speedup**: 10 × 0.5 + 1.0 × 0.5 = **5.5× on free path** **BUT**: Deferred free delay - Memory not reclaimed until batch processes (256 frees) - Increased memory footprint: 256 × 8B = 2KB per thread per class - Cache pollution: Deferred ptrs may evict useful data **Realistic net speedup**: **1.3-1.5× on frees** (after overhead) #### Pros - **Large instruction savings** (10-15× on free path) - **No background thread** (batch processes inline or on-demand) - **Simple implementation** (just a TLS queue + batch loop) - **Compatible with existing remote-free** (cross-thread unchanged) #### Cons - **Deferred memory reclamation** (256 frees delay) - **Increased memory footprint** (2KB × 8 classes × 32 threads = 512KB) - **Limited benefit on cross-thread frees** (40-60% of workload unaffected) - **Batch processing latency spikes** (256 owner slab lookups at once) --- ### 2.3 Option C: Hybrid (Magazine + Deferred Processing) **Goal**: Combine Option A (background magazine refill) + Option B (deferred free queue) #### Design ```c // Allocation: TLS magazine (10-20 instructions) void* hak_tiny_alloc(size_t size) { int class_idx = hak_tiny_size_to_class(size); TinyTLSMag* mag = &g_tls_mags[class_idx]; if (mag->top > 0) { return mag->items[--mag->top].ptr; } // Trigger background refill if needed if (mag->refill_needed == 0) { atomic_store(&mag->refill_needed, 1); } return NULL; // Fallback to next tier } // Free: Deferred batch queue (10-20 instructions) void hak_tiny_free(void* ptr) { TinyDeferredFree* queue = &g_tls_deferred_free; queue->ptrs[queue->count++] = ptr; if (queue->count >= 256) { hak_tiny_process_deferred_frees(queue); } } // Background worker: Refill magazines + process deferred frees void background_worker(void) { while (1) { // Refill magazines from bitmaps for each thread with refill_needed { batch_refill_from_all_slabs(mag, 256); } // Process deferred frees from all threads for each thread with deferred_free queue { hak_tiny_process_deferred_frees(queue); } usleep(50); // 50μs sleep } } ``` #### Expected Performance **Front-path savings**: - Allocation: 228 → 30-45 instructions (5-7× faster) - Free: 150-200 → 10-20 instructions (10-15× faster) **Overall impact** (accounting for hit rates and overhead): - Allocations: 1.5-2.0× faster (Option A) - Frees: 1.3-1.5× faster (Option B) - **Net speedup**: √(2.0 × 1.5) ≈ **1.7× overall** **Realistic net speedup**: **1.7-2.0× (62 → 105-124 M ops/sec)** #### Pros - **Best overall speedup** (combines benefits of both approaches) - **Balanced optimization** (both alloc and free paths improved) - **Single background worker** (shared thread for refill + deferred frees) #### Cons - **Highest implementation complexity** (both systems + worker coordination) - **Background thread CPU cost** (15-30% of 1 core) - **Tuning complexity** (refill threshold, batch size, sleep interval, queue size) - **Largest memory footprint** (TLS magazines + deferred queues) --- ## Part 3: Feasibility Analysis ### 3.1 Instruction Reduction Potential **Current measured performance** (HAKMEM_WRAP_TINY=1): - **Instructions per op**: 228 (from perf stat: 1.4T / 6.1B ops) - **IPC**: 4.73 (very high - compute-bound) - **Cycles per op**: 48.2 (228 / 4.73) - **Latency**: 16.1 ns/op @ 3 GHz **Theoretical minimum** (mimalloc-style): - **Instructions per op**: 15-25 (TLS pointer bump + freelist push) - **IPC**: 4.5-5.0 (cache-friendly sequential access) - **Cycles per op**: 4-5 (15-25 / 5.0) - **Latency**: 1.3-1.7 ns/op @ 3 GHz **Achievable with async background workers**: - **Allocation path**: 30-45 instructions (magazine hit) vs 228 (bitmap scan) - **Free path**: 10-20 instructions (deferred queue) vs 150-200 (owner slab lookup) - **Average**: (30 + 15) / 2 = **22.5 instructions per op** (geometric mean) - **IPC**: 4.5 (slightly worse due to memory barriers) - **Cycles per op**: 22.5 / 4.5 = **5.0 cycles** - **Latency**: 5.0 / 3.0 = **1.7 ns/op** **Expected speedup**: 16.1 / 1.7 ≈ **9.5× (theoretical maximum)** **BUT**: Background thread overhead - Atomic refill triggers: +1-2 cycles per op - Cache coherence (false sharing): +2-3 cycles per op - Memory barriers: +1-2 cycles per op - **Total overhead**: +4-7 cycles per op **Realistic achievable**: - **Cycles per op**: 5.0 + 5.0 = 10.0 cycles - **Latency**: 10.0 / 3.0 = **3.3 ns/op** - **Throughput**: 300 M ops/sec - **Speedup**: 16.1 / 3.3 ≈ **4.9× (theoretical)** **Actual achievable** (accounting for partial hit rates): - **60-80% hit magazine**: Already fast (6 ns) - **20-40% miss magazine**: Improved (16 ns → 3.3 ns) - **Net improvement**: 0.7 × 6 + 0.3 × 3.3 = **5.2 ns/op** - **Speedup**: 16.1 / 5.2 ≈ **3.1× (optimistic)** **Conservative estimate** (accounting for all overheads): - **Net speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)** ### 3.2 Comparison with mimalloc **Why mimalloc is 263 M ops/sec (3.8 ns/op)**: 1. **Zero-initialization on allocation** (no bitmap scan ever) - Uses sequential memory bump pointer (O(1) pointer arithmetic) - Free blocks tracked as linked list (no scanning needed) 2. **Embedded slab metadata** (no hash lookup on free) - Slab pointer embedded in first 16 bytes of allocation - Owner slab lookup is single pointer dereference (3-4 cycles) 3. **TLS-local slabs** (no cross-thread remote free queues) - Each thread owns its slabs exclusively - Cross-thread frees go to per-thread remote queue (not per-slab) 4. **Lazy coalescing** (defers bitmap consolidation to background) - Front-path never touches bitmaps - Background thread scans and coalesces every 100ms **hakmem cannot match mimalloc without fundamental redesign** because: - Bitmap-based allocation requires scanning (cannot be O(1) pointer bump) - Hash-based owner slab lookup requires hash computation (cannot be single dereference) - Per-slab remote queues require immediate slab lookup on cross-thread free **Realistic target**: **120-180 M ops/sec (6.7-8.3 ns/op)** - still **2-3× slower than mimalloc** ### 3.3 Implementation Effort vs Benefit | Option | Effort (hours) | Speedup | Ops/sec | Gap to mimalloc | Complexity | |--------|---------------|---------|---------|-----------------|-----------| | **Current** | 0 | 1.0× | 62 | 4.2× slower | Baseline | | **Option A** | 6-8 | 1.5-1.8× | 93-112 | 2.4-2.8× slower | Medium | | **Option B** | 4-6 | 1.3-1.5× | 81-93 | 2.8-3.2× slower | Low | | **Option C** | 10-14 | 1.7-2.2× | 105-136 | 1.9-2.5× slower | High | | **Theoretical max** | N/A | 3.1× | 192 | 1.4× slower | N/A | | **mimalloc** | N/A | 4.2× | 263 | Baseline | N/A | **Best effort/benefit ratio**: **Option B (Deferred Slab Lookup)** - **4-6 hours** of implementation - **1.3-1.5× speedup** (25-35% faster) - **Low complexity** (single TLS queue + batch loop) - **No background thread** (inline batch processing) **Maximum performance**: **Option C (Hybrid)** - **10-14 hours** of implementation - **1.7-2.2× speedup** (50-75% faster) - **High complexity** (background worker + coordination) - **Requires background thread** (CPU cost) --- ## Part 4: Recommended Implementation Plan ### Phase 1: Deferred Free Queue (4-6 hours) [Option B] **Goal**: Eliminate owner slab lookup on same-thread frees #### Step 1.1: Add TLS Deferred Free Queue (1 hour) ```c // hakmem_tiny.h - Add to global state #define DEFERRED_FREE_QUEUE_SIZE 256 typedef struct { void* ptrs[DEFERRED_FREE_QUEUE_SIZE]; uint16_t count; } TinyDeferredFree; // TLS per-class deferred free queues static __thread TinyDeferredFree g_tls_deferred_free[TINY_NUM_CLASSES]; ``` #### Step 1.2: Modify Free Path (2 hours) ```c // hakmem_tiny.c - Replace hak_tiny_free() void hak_tiny_free(void* ptr) { if (!ptr || !g_tiny_initialized) return; // Try SuperSlab fast path first (existing) SuperSlab* ss = ptr_to_superslab(ptr); if (ss && ss->magic == SUPERSLAB_MAGIC) { hak_tiny_free_superslab(ptr, ss); return; } // NEW: Deferred free path (no owner slab lookup!) // Guess class from allocation size hint (optional optimization) int class_idx = guess_class_from_ptr(ptr); // heuristic if (class_idx >= 0) { TinyDeferredFree* queue = &g_tls_deferred_free[class_idx]; queue->ptrs[queue->count++] = ptr; // Batch process if queue is full if (queue->count >= DEFERRED_FREE_QUEUE_SIZE) { hak_tiny_process_deferred_frees(class_idx, queue); } return; } // Fallback: Immediate owner slab lookup (cross-thread or unknown) TinySlab* slab = hak_tiny_owner_slab(ptr); if (!slab) return; hak_tiny_free_with_slab(ptr, slab); } ``` #### Step 1.3: Implement Batch Processing (2-3 hours) ```c // hakmem_tiny.c - Batch process deferred frees static void hak_tiny_process_deferred_frees(int class_idx, TinyDeferredFree* queue) { pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m; pthread_mutex_lock(lock); for (int i = 0; i < queue->count; i++) { void* ptr = queue->ptrs[i]; // Owner slab lookup (expensive, but amortized over batch) TinySlab* slab = hak_tiny_owner_slab(ptr); if (!slab) continue; // Push to magazine or bitmap hak_tiny_free_with_slab(ptr, slab); } pthread_mutex_unlock(lock); queue->count = 0; } ``` **Expected outcome**: - **Same-thread frees**: 10-15× faster (150 → 10-20 instructions) - **Cross-thread frees**: Unchanged (still need immediate lookup) - **Overall speedup**: 1.3-1.5× (25-35% faster) - **Memory overhead**: 256 × 8B × 8 classes = 16KB per thread --- ### Phase 2: Background Magazine Refill (6-8 hours) [Option A] **Goal**: Eliminate bitmap scanning on allocation path #### Step 2.1: Add Refill Trigger (1 hour) ```c // hakmem_tiny.h - Add refill trigger to TLS magazine typedef struct { TinyMagItem items[TINY_TLS_MAG_CAP]; int top; int cap; atomic_int refill_needed; // NEW: Background refill trigger } TinyTLSMag; ``` #### Step 2.2: Modify Allocation Path (2 hours) ```c // hakmem_tiny.c - Trigger refill on magazine miss void* hak_tiny_alloc(size_t size) { // ... (existing size-to-class logic) ... TinyTLSMag* mag = &g_tls_mags[class_idx]; if (mag->top > 0) { return mag->items[--mag->top].ptr; // Fast path: unchanged } // NEW: Trigger background refill (non-blocking) if (atomic_load(&mag->refill_needed) == 0) { atomic_store(&mag->refill_needed, 1); } // Fallback to existing slow path (TLS slab, bitmap scan, lock) return hak_tiny_alloc_slow(class_idx); } ``` #### Step 2.3: Implement Background Worker (3-5 hours) ```c // hakmem_tiny.c - Background refill thread static void* background_refill_worker(void* arg) { while (g_background_worker_running) { // Scan all threads for refill requests for (int tid = 0; tid < g_max_threads; tid++) { for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) { TinyTLSMag* mag = &g_thread_mags[tid][class_idx]; // Check if refill needed if (atomic_load(&mag->refill_needed) == 0) { continue; } // Refill from bitmaps (expensive, but in background) pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m; pthread_mutex_lock(lock); TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; if (slab && slab->free_count > 0) { int refilled = batch_refill_from_bitmap( slab, &mag->items[mag->top], 256 ); mag->top += refilled; } pthread_mutex_unlock(lock); atomic_store(&mag->refill_needed, 0); } } usleep(100); // 100μs sleep (tune based on load) } return NULL; } // Start background worker on init void hak_tiny_init(void) { // ... (existing init logic) ... g_background_worker_running = 1; pthread_create(&g_background_worker, NULL, background_refill_worker, NULL); } ``` **Expected outcome**: - **Allocation misses**: 5-7× faster (228 → 30-45 instructions) - **Magazine hit rate**: Improved (background keeps magazines full) - **Overall speedup**: +30-50% (combined with Phase 1) - **CPU cost**: 1 core at 10-20% utilization --- ### Phase 3: Tuning and Optimization (2-3 hours) **Goal**: Reduce overhead and maximize hit rates #### Step 3.1: Tune Batch Sizes (1 hour) - Test refill batch sizes: 64, 128, 256, 512 - Test deferred free queue sizes: 128, 256, 512 - Measure impact on throughput and latency variance #### Step 3.2: Reduce False Sharing (1 hour) ```c // Cache-align TLS magazines to avoid false sharing typedef struct __attribute__((aligned(64))) { TinyMagItem items[TINY_TLS_MAG_CAP]; int top; int cap; atomic_int refill_needed; char _pad[64 - sizeof(int) * 3]; // Pad to 64B } TinyTLSMag; ``` #### Step 3.3: Add Adaptive Sleep (1 hour) ```c // Background worker: Adaptive sleep based on load static void* background_refill_worker(void* arg) { int idle_count = 0; while (g_background_worker_running) { int work_done = 0; // ... (refill logic) ... if (work_done == 0) { idle_count++; usleep(100 * (1 << min(idle_count, 4))); // Exponential backoff } else { idle_count = 0; usleep(50); // Active: short sleep } } } ``` **Expected outcome**: - **Reduced CPU cost**: 10-20% → 5-10% (adaptive sleep) - **Better cache utilization**: Alignment reduces false sharing - **Tuned for workload**: Batch sizes optimized for benchmarks --- ## Part 5: Expected Performance ### Before (Current) ``` HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem Test 1: Sequential LIFO (16B) Throughput: 105 M ops/sec Latency: 9.5 ns/op Test 2: Sequential FIFO (16B) Throughput: 98 M ops/sec Latency: 10.2 ns/op Test 3: Random Free (16B) Throughput: 62 M ops/sec Latency: 16.1 ns/op Average: 88 M ops/sec (11.4 ns/op) ``` ### After Phase 1 (Deferred Free Queue) **Expected improvement**: +25-35% (same-thread frees only) ``` Test 1: Sequential LIFO (16B) [80% same-thread] Throughput: 135 M ops/sec (+29%) Latency: 7.4 ns/op Test 2: Sequential FIFO (16B) [80% same-thread] Throughput: 126 M ops/sec (+29%) Latency: 7.9 ns/op Test 3: Random Free (16B) [40% same-thread] Throughput: 73 M ops/sec (+18%) Latency: 13.7 ns/op Average: 111 M ops/sec (+26%) - [9.0 ns/op] ``` ### After Phase 2 (Background Refill) **Expected improvement**: +40-60% (combined) ``` Test 1: Sequential LIFO (16B) Throughput: 168 M ops/sec (+60%) Latency: 6.0 ns/op Test 2: Sequential FIFO (16B) Throughput: 157 M ops/sec (+60%) Latency: 6.4 ns/op Test 3: Random Free (16B) Throughput: 93 M ops/sec (+50%) Latency: 10.8 ns/op Average: 139 M ops/sec (+58%) - [7.2 ns/op] ``` ### After Phase 3 (Tuning) **Expected improvement**: +50-75% (optimized) ``` Test 1: Sequential LIFO (16B) Throughput: 180 M ops/sec (+71%) Latency: 5.6 ns/op Test 2: Sequential FIFO (16B) Throughput: 168 M ops/sec (+71%) Latency: 6.0 ns/op Test 3: Random Free (16B) Throughput: 105 M ops/sec (+69%) Latency: 9.5 ns/op Average: 151 M ops/sec (+72%) - [6.6 ns/op] ``` ### Gap to mimalloc (263 M ops/sec) | Phase | Ops/sec | Gap | % of mimalloc | |-------|---------|-----|---------------| | Current | 88 | 3.0× slower | 33% | | Phase 1 | 111 | 2.4× slower | 42% | | Phase 2 | 139 | 1.9× slower | 53% | | Phase 3 | 151 | 1.7× slower | 57% | | **mimalloc** | **263** | **Baseline** | **100%** | **Conclusion**: Async background workers can achieve **1.7× speedup**, but still **1.7× slower than mimalloc** due to fundamental architecture differences. --- ## Part 6: Critical Success Factors ### 6.1 Verify with perf After each phase, run: ```bash HAKMEM_WRAP_TINY=1 perf record -e cycles:u -g ./bench_comprehensive_hakmem perf report --stdio --no-children -n --percent-limit 1.0 ``` **Expected changes**: - **Phase 1**: `hak_tiny_owner_slab` drops from 1.37% → 0.5-0.7% - **Phase 2**: `hak_tiny_find_free_block` drops from ~1% → 0.2-0.3% - **Phase 3**: Overall cycles per op drops 40-60% ### 6.2 Measure Instruction Count ```bash HAKMEM_WRAP_TINY=1 perf stat -e instructions,cycles,branches ./bench_comprehensive_hakmem ``` **Expected changes**: - **Before**: 228 instructions/op, 48.2 cycles/op - **Phase 1**: 180-200 instructions/op, 40-45 cycles/op - **Phase 2**: 120-150 instructions/op, 28-35 cycles/op - **Phase 3**: 100-130 instructions/op, 22-28 cycles/op ### 6.3 Avoid Synchronization Overhead **Key principles**: - Use `atomic_load_explicit` with `memory_order_relaxed` for low-contention checks - Batch operations to amortize lock costs (256+ items per batch) - Align TLS structures to 64B to avoid false sharing - Use exponential backoff on background thread sleep ### 6.4 Focus on Front-Path **Priority order**: 1. **TLS magazine hit**: Must remain <30 instructions (already optimal) 2. **Deferred free queue**: Must be <20 instructions (Phase 1) 3. **Background refill trigger**: Must be <10 instructions (Phase 2) 4. **Batch processing**: Can be expensive (amortized over 256 items) --- ## Part 7: Conclusion ### Can we achieve 100-150 M ops/sec with async background workers? **Yes, but with caveats**: - **100 M ops/sec**: Achievable with Phase 1 alone (4-6 hours) - **150 M ops/sec**: Achievable with Phase 1+2+3 (12-17 hours) - **180+ M ops/sec**: Unlikely without fundamental redesign ### Why the gap to mimalloc remains **mimalloc's advantages that async workers cannot replicate**: 1. **O(1) pointer bump allocation** (no bitmap scan, even in background) 2. **Embedded slab metadata** (no hash lookup, ever) 3. **TLS-exclusive slabs** (no cross-thread remote queues) **hakmem's fundamental constraints**: - Bitmap-based allocation requires scanning (cannot be O(1)) - Hash-based slab registry requires computation on free - Per-slab remote queues require immediate slab lookup ### Recommended next steps **Short-term (4-6 hours)**: Implement **Phase 1 (Deferred Free Queue)** - **Effort**: Low (single TLS queue + batch loop) - **Benefit**: 25-35% speedup (62 → 81-93 M ops/sec) - **Risk**: Low (no background thread, simple design) **Medium-term (10-14 hours)**: Add **Phase 2 (Background Refill)** - **Effort**: Medium (background worker + coordination) - **Benefit**: 50-75% speedup (62 → 93-108 M ops/sec) - **Risk**: Medium (background thread overhead, tuning complexity) **Long-term (20-30 hours)**: Consider **fundamental redesign** - Replace bitmap with freelist (mimalloc-style) - Embed slab metadata in allocations (avoid hash lookup) - Use TLS-exclusive slabs (eliminate remote queues) - **Potential**: 3-4× speedup (approaching mimalloc) ### Final verdict **Async background workers are a viable optimization**, but not a silver bullet: - **Expected speedup**: 1.5-2.0× (realistic) - **Best-case speedup**: 2.0-2.5× (with perfect tuning) - **Gap to mimalloc**: Remains 1.7-2.0× (architectural limitations) **Recommended approach**: Implement Phase 1 first (low effort, good ROI), then evaluate if Phase 2 is worth the complexity. --- ## Part 7: Phase 1 Implementation Results & Lessons Learned **Date**: 2025-10-26 **Status**: FAILED - Structural design flaw identified ### Phase 1 Implementation Summary **What was implemented**: 1. TLS Deferred Free Queue (256 items) 2. Batch processing function 3. Modified `hak_tiny_free` to push to queue **Expected outcome**: 1.3-1.5× speedup (25-35% faster frees) ### Actual Results | Metric | Before | After Phase 1 | Change | |--------|--------|---------------|--------| | 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | +2.2% | | 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | +0.9% | | 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | -1.6% | | 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | -1.7% | | **Instructions/op** | **228** | **229** | **+1** ❌ | **Conclusion**: **Phase 1 had ZERO effect** (performance unchanged, instructions increased by 1) ### Root Cause Analysis **Critical design flaw discovered**: ```c void hak_tiny_free(void* ptr) { // SuperSlab fast path FIRST (Quick Win #1) SuperSlab* ss = ptr_to_superslab(ptr); if (ss && ss->magic == SUPERSLAB_MAGIC) { hak_tiny_free_superslab(ptr, ss); return; // ← 99% of frees exit here! } // Deferred Free Queue (NEVER REACHED!) queue->ptrs[queue->count++] = ptr; ... } ``` **Why Phase 1 failed**: 1. **SuperSlab is enabled by default** (`g_use_superslab = 1` from Quick Win #1) 2. **99% of frees take SuperSlab fast path** (especially sequential patterns) 3. **Deferred queue is never used** → zero benefit, added overhead 4. **Push-based approach is fundamentally flawed** for this use case ### Alignment with ChatGPT Analysis ChatGPT's analysis of a similar "Phase 4" issue identified the same structural problem: > **"Free で加速の仕込みをする(push型)は、spill が頻発する系ではコスト先払いになり負けやすい。"** **Key insight**: **Push-based optimization on free path pays upfront cost without guaranteed benefit.** ### Lessons Learned 1. **Push vs Pull strategy**: - **Push (Phase 1)**: Pay cost upfront on every free → wasted if not consumed - **Pull (Phase 2)**: Pay cost only when needed on alloc → guaranteed benefit 2. **Interaction with existing optimizations**: - SuperSlab fast path makes deferred queue unreachable - Cannot optimize already-optimized path 3. **Measurement before optimization**: - Should have profiled where frees actually go (SuperSlab vs registry) - Would have caught this before implementation ### Revised Strategy: Phase 2 (Pull-based) **Recommended approach** (from ChatGPT + our analysis): ``` Phase 2: Background Magazine Refill (Pull型) Allocation path: magazine.top > 0 → return item (fast path unchanged) magazine.top == 0 → trigger background refill → fallback to slow path Background worker (Pull型): Periodically scan for refill_needed flags Perform bitmap scan (expensive, but in background) Refill magazines in batch (256 items) Free path: NO CHANGES (zero cost increase) ``` **Expected benefits**: - **No upfront cost on free** (major advantage over push型) - **Guaranteed benefit on alloc** (magazine hit rate increases) - **Amortized bitmap scan cost** (1 scan → 256 allocs) - **Expected speedup**: 1.5-2.0× (30-50% faster) ### Decision: Revert Phase 1, Implement Phase 2 **Next steps**: 1. ✅ Document Phase 1 failure and analysis 2. ⏳ Revert Phase 1 changes (clean code) 3. ⏳ Implement Phase 2 (pull-based background refill) 4. ⏳ Measure and validate Phase 2 effectiveness **Key takeaway**: **"Pull型 + 必要時のみ" beats "Push型 + 先払いコスト"** --- ## Part 8: Phase 2 Implementation Results & Critical Insight **Date**: 2025-10-26 **Status**: FAILED - Worse than baseline (Phase 1 was zero effect, Phase 2 degraded performance) ### Phase 2 Implementation Summary **What was implemented**: 1. Global Refill Queue (per-class, lock-free read) 2. Background worker thread (bitmap scanning in background) 3. Pull-based magazine refill (check global queue on magazine miss) 4. Adaptive sleep (exponential backoff when idle) **Expected outcome**: 1.5-2.0× speedup (228 → 100-150 instructions/op) ### Actual Results | Metric | Baseline (no async) | Phase 1 (Push) | Phase 2 (Pull) | Change | |--------|----------|---------|---------|--------| | 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | 62.80 M ops/s | **-1.1%** | | 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | 52.64 M ops/s | **-3.4%** ❌ | | 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | 49.37 M ops/s | **-1.5%** ❌ | | 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | 63.53 M ops/s | **+0.3%** | | **Instructions/op** | **~228** | **229** | **306** | **+33%** ❌❌❌ | **Conclusion**: **Phase 2 DEGRADED performance** (worse than baseline and Phase 1!) ### Root Cause Analysis **Critical insight**: **Both Phase 1 and Phase 2 optimize the WRONG code path!** ``` Benchmark allocation pattern (LIFO): Iteration 1: alloc[0..99] → Slow path: Fill TLS Magazine from slabs free[99..0] → Items return to TLS Magazine (LIFO) Iteration 2-1,000,000: alloc[0..99] → Fast path: 100% TLS Magazine hit! (6 ns/op) free[99..0] → Fast path: Return to TLS Magazine (6 ns/op) NO SLOW PATH EVER EXECUTED AFTER FIRST ITERATION! ``` **Why Phase 2 failed worse than Phase 1**: 1. **Background worker thread consuming CPU** (extra overhead) 2. **Atomic operations on global queue** (contention + memory ordering cost) 3. **No benefit** because TLS magazine never empties (100% hit rate) 4. **Pure overhead** without any gain ### Fundamental Architecture Problem **The async optimization strategy (Phase 1 + 2) is based on a flawed assumption**: ❌ **Assumption**: "Slow path (bitmap scan, lock, owner lookup) is the bottleneck" ✅ **Reality**: "Fast path (TLS magazine access) is the bottleneck" **Evidence**: - Benchmark working set: 100 items - TLS Magazine capacity: 2048 items (class 0) - Hit rate: 100% after first iteration - Slow path execution: ~0% (never reached) **Performance gap breakdown**: ``` hakmem Tiny Pool: 60 M ops/sec (16.7 ns/op) = 228 instructions glibc malloc: 105 M ops/sec ( 9.5 ns/op) = ~30-40 instructions Gap: 40% slower = ~190 extra instructions on FAST PATH ``` ### Why hakmem is Slower (Architectural Differences) **1. Bitmap-based allocation** (hakmem): - Find free block: bitmap scan (CTZ instruction) - Mark used: bit manipulation (OR + update summary bitmap) - Cost: 30-40 instructions even with optimizations **2. Free-list allocation** (glibc): - Find free block: single pointer dereference - Mark used: pointer update - Cost: 5-10 instructions **3. TLS Magazine access overhead**: - hakmem: `g_tls_mags[class_idx].items[--top].ptr` (3 memory reads + index calc) - glibc: Direct arena access (1-2 memory reads) **4. Statistics batching** (Phase 3 optimization): - hakmem: XOR RNG sampling (10-15 instructions) - glibc: No stats tracking ### Lessons Learned **1. Optimize the code path that actually executes**: - ❌ Optimized slow path (99.9% never executed) - ✅ Should optimize fast path (99.9% of operations) **2. Async optimization only helps with cache misses**: - Benchmark: 100% cache hit rate after warmup - Real workload: Unknown hit rate (need profiling) **3. Adding complexity without measurement is harmful**: - Phase 1: +1 instruction (zero benefit) - Phase 2: +77 instructions (-33% performance) **4. Fundamental architectural differences matter more than micro-optimizations**: - Bitmap vs free-list: ~10× instruction difference - Async background work cannot bridge this gap ### Revised Understanding **The 40% performance gap (hakmem vs glibc) is NOT due to slow-path inefficiency.** **It's due to fundamental design choices**: 1. **Bitmap allocation** (flexible, low fragmentation) vs **Free-list** (fast, simple) 2. **Slab ownership tracking** (hash lookup on free) vs **Embedded metadata** (single dereference) 3. **Research features** (statistics, ELO, batching) vs **Production simplicity** **These tradeoffs are INTENTIONAL for research purposes.** ### Conclusion & Next Steps **Both Phase 1 and Phase 2 should be reverted.** **Async optimization strategy is fundamentally flawed for this workload.** **Actual bottleneck**: TLS Magazine fast path (99.9% of execution) - Current: ~17 ns/op (228 instructions) - Target: ~10 ns/op (glibc level) - Gap: 7 ns = ~50-70 instructions **Possible future optimizations** (NOT async): 1. **Inline TLS magazine access** (reduce function call overhead) 2. **SIMD bitmap scanning** (4-8× faster block finding) 3. **Remove statistics sampling** (save 10-15 instructions) 4. **Simplified magazine structure** (single array instead of struct) **Or accept reality**: - hakmem is a research allocator with diagnostic features - 40% slowdown is acceptable cost for flexibility - Production use cases might have different performance profiles **Recommended action**: Revert Phase 2, commit analysis, move on. --- ## Part 9: Phase 7.5 Failure Analysis - Inline Fast Path **Date**: 2025-10-26 **Goal**: Reduce hak_tiny_alloc from 22.75% CPU to ~10% via inline wrapper **Result**: **REGRESSION (-7% to -15%)** ❌ ### Implementation Approach Created inline wrapper to handle common case (TLS magazine hit) without function call: ```c static inline void* hak_tiny_alloc(size_t size) __attribute__((always_inline)); static inline void* hak_tiny_alloc(size_t size) { // Fast checks if (UNLIKELY(size > TINY_MAX_SIZE)) return hak_tiny_alloc_impl(size); if (UNLIKELY(!g_tiny_initialized)) return hak_tiny_alloc_impl(size); if (UNLIKELY(!g_wrap_tiny_enabled && hak_in_wrapper())) return hak_tiny_alloc_impl(size); // Size to class int class_idx = hak_tiny_size_to_class(size); // TLS Magazine fast path tiny_mag_init_if_needed(class_idx); TinyTLSMag* mag = &g_tls_mags[class_idx]; if (LIKELY(mag->top > 0)) { return mag->items[--mag->top].ptr; // Fast path! } return hak_tiny_alloc_impl(size); // Slow path } ``` ### Benchmark Results | Test | Before (getenv fix) | After (Phase 7.5) | Change | |------|---------------------|-------------------|--------| | 16B LIFO | 120.55 M ops/sec | 110.46 M ops/sec | **-8.4%** | | 32B LIFO | 88.57 M ops/sec | 79.00 M ops/sec | **-10.8%** | | 64B LIFO | 94.74 M ops/sec | 88.01 M ops/sec | **-7.1%** | | 128B LIFO | 122.36 M ops/sec | 104.21 M ops/sec | **-14.8%** | | Mixed | 164.56 M ops/sec | 148.99 M ops/sec | **-9.5%** | With `__attribute__((always_inline))`: | Test | Always-inline Result | vs Baseline | |------|---------------------|-------------| | 16B LIFO | 115.89 M ops/sec | **-3.9%** | Still slower than baseline! ### Root Cause Analysis The inline wrapper **added more overhead than it removed**: **Overhead Added:** 1. **Extra function calls in wrapper**: - `hak_in_wrapper()` called on every allocation (even with UNLIKELY) - `tiny_mag_init_if_needed()` called on every allocation - These are function calls that happen BEFORE reaching the magazine 2. **Multiple conditional branches**: - Size check - Initialization check - Wrapper guard check - Branch misprediction cost 3. **Global variable reads**: - `g_tiny_initialized` read every time - `g_wrap_tiny_enabled` read every time **Original code** (before inlining): - One function call to `hak_tiny_alloc()` - Inside function: direct path to magazine check (lines 685-688) - No extra overhead **Inline wrapper**: - Zero function calls to enter - But added 2 function calls inside (`hak_in_wrapper`, `tiny_mag_init_if_needed`) - Added 3 conditional branches - Net result: **MORE overhead, not less** ### Key Lesson Learned **WRONG**: Function call overhead is the bottleneck (perf shows 22.75% in hak_tiny_alloc) **RIGHT**: The 22.75% is the **code inside** the function, not the call overhead **Micro-optimization fallacy**: Eliminating a function call (2-4 cycles) while adding: - 2 function calls (4-8 cycles) - 3 conditional branches (3-6 cycles) - Multiple global reads (3-6 cycles) Total overhead added: **10-20 cycles** vs **2-4 cycles** saved = **net loss** ### Correct Approach (Not Implemented) To actually reduce the 22.75% CPU in hak_tiny_alloc, we should: 1. **Keep it as a regular function** (not inline) 2. **Optimize the code INSIDE**: - Reduce stack usage (88 → 32 bytes) - Cache globals at function entry - Simplify control flow - Reduce register pressure 3. **Or accept current performance**: - Already 1.5-1.9× faster than glibc ✅ - Diminishing returns zone - Further optimization may not be worth the risk ### Decision **REVERTED** Phase 7.5 completely. Performance restored to 120-164 M ops/sec. **CONCLUSION**: Stick with getenv fix. Ship what works. 🚀 ---