# Tiny Pool Optimization Roadmap Quick reference for implementing mimalloc-style optimizations in hakmem's Tiny Pool. **Current Performance**: 83 ns/op for 8-64B allocations **Target Performance**: 30-50 ns/op (realistic with optimizations) **Gap to mimalloc**: Still 2-3.5x slower (fundamental architecture difference) --- ## Quick Wins (10-20 ns improvement) ### 1. Lookup Table Size Classification **Effort**: 30 minutes | **Gain**: 3-5 ns Replace if-chain with table lookup: ```c // Before static inline int hak_tiny_size_to_class(size_t size) { if (size <= 8) return 0; if (size <= 16) return 1; if (size <= 32) return 2; // ... sequential if-chain } // After static const uint8_t g_size_to_class[65] = { 0,0,0,0,0,0,0,0, // 0-7 1,1,1,1,1,1,1,1, // 8-15 2,2,2,2,2,2,2,2, // 16-23 2,2,2,2,2,2,2,2, // 24-31 3,3,3,3,3,3,3,3, // 32-39 // ... continue to 64 }; static inline int hak_tiny_size_to_class_fast(size_t size) { return likely(size <= 64) ? g_size_to_class[size] : -1; } ``` **Implementation**: 1. File: `hakmem_tiny.h` 2. Add static const array after line 36 (after `g_tiny_blocks_per_slab`) 3. Update `hak_tiny_size_to_class()` to use table 4. Add `__builtin_expect()` for fast path --- ### 2. Remove Statistics from Critical Path **Effort**: 1 hour | **Gain**: 10-15 ns Move sampled counter updates to separate tracking: ```c // Before (hot path) void* p = mag->items[--mag->top].ptr; t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead t_tiny_rng ^= t_tiny_rng >> 17; t_tiny_rng ^= t_tiny_rng << 5; if ((t_tiny_rng & ((1u<items[--mag->top].ptr; // Stats update deferred (see lazy_counter_update below) return p; // New: Lazy counter accumulation (cold path) static void hak_tiny_lazy_counter_update(int class_idx) { if (++g_tls_alloc_counter[class_idx] >= 100) { g_tiny_pool.alloc_count[class_idx] += g_tls_alloc_counter[class_idx]; g_tls_alloc_counter[class_idx] = 0; } } ``` **Implementation**: 1. File: `hakmem_tiny.c` 2. Remove sampled XOR code from `hak_tiny_alloc()` lines 656-659 3. Replace with simple per-thread counter 4. Call `hak_tiny_lazy_counter_update()` in slow path only 5. Update `hak_tiny_get_stats()` to account for lazy counters --- ### 3. Inline Fast Path **Effort**: 1 hour | **Gain**: 5-10 ns Create separate inlined fast-path function: ```c // New file: hakmem_tiny_alloc_fast.h static inline void* hak_tiny_alloc_hot(size_t size) { // Magazine fast path only (no TLS active slab, no locks) if (size > TINY_MAX_SIZE) return NULL; int class_idx = g_size_to_class[size]; TinyTLSMag* mag = &g_tls_mags[class_idx]; if (likely(mag->top > 0)) { return mag->items[--mag->top].ptr; } // Fall through to slow path extern void* hak_tiny_alloc_slow(size_t size); return hak_tiny_alloc_slow(size); } ``` **Implementation**: 1. Create: `hakmem_tiny_alloc_fast.h` 2. Move pure magazine fast path here 3. Declare as `__attribute__((always_inline))` 4. Include from `hakmem.h` for public API 5. Keep `hakmem_tiny.c` for slow path --- ## Medium Effort (2-5 ns improvement each) ### 4. Combine TLS Reads into Single Structure **Effort**: 2 hours | **Gain**: 2-3 ns ```c // Before TinyTLSMag* mag = &g_tls_mags[class_idx]; TinySlab* slab_a = &g_tls_active_slab_a[class_idx]; TinySlab* slab_b = &g_tls_active_slab_b[class_idx]; // 3 separate TLS reads + prefetch misses // After typedef struct { TinyTLSMag mag; // All magazine data TinySlab* slab_a; // Pointers TinySlab* slab_b; } TinyTLSCache; static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES]; TinyTLSCache* cache = &g_tls_cache[class_idx]; // 1 TLS read // All 3 data structures prefetched together ``` **Benefits**: - Single TLS read instead of 3 - All related data on same cache line - Better prefetcher behavior --- ### 5. Hardware Prefetching Hints **Effort**: 30 minutes | **Gain**: 1-2 ns (cumulative) ```c // In hot path loop (e.g., bench_tiny_mt.c) void* p = mag->items[--mag->top].ptr; if (mag->top > 0) { __builtin_prefetch(mag->items[mag->top].ptr, 0, 3); // 0 = read, 3 = bring to L1 } use_allocation(p); ``` --- ### 6. Branchless Fallback Logic **Effort**: 1.5 hours | **Gain**: 10-15 ns Use conditional moves instead of branches: ```c // Before (2+ branches, high misprediction rate) if (mag->top > 0) { return mag->items[--mag->top].ptr; } if (slab_a && slab_a->free_count > 0) { // ... allocation from slab } // Fall through to slow path // After (0 mispredictions with cmov) void* p = NULL; if (mag->top > 0) { p = mag->items[--mag->top].ptr; } // If still NULL, slab_a handler gets chance if (!p && slab_a && slab_a->free_count > 0) { // ... allocation p = result; } return p != NULL ? p : hak_tiny_alloc_slow(size); ``` --- ## Advanced Optimizations (2-5 ns improvement) ### 7. Code Layout (Hot/Cold Separation) **Effort**: 2 hours | **Gain**: 2-5 ns Use compiler pragmas to place fast path in hot section: ```c // In hakmem_tiny_alloc_fast.h __attribute__((section(".text.hot"))) __attribute__((aligned(64))) static inline void* hak_tiny_alloc_hot(size_t size) { // Fast path only } // In hakmem_tiny.c __attribute__((section(".text.cold"))) static void* hak_tiny_alloc_slow(size_t size) { // Slow path: locks, scanning, etc. } ``` **Benefits**: - Hot path packed in contiguous instruction cache - Fewer I-cache misses - Better CPU prefetching --- ## Implementation Priority (Time Investment vs Gain) | Priority | Optimization | Effort | Gain | Ratio | |----------|--------------|--------|------|-------| | **P0** | Lookup table classification | 30min | 3-5ns | 10x ROI | | **P1** | Remove stats overhead | 1hr | 10-15ns | 15x ROI | | **P2** | Inline fast path | 1hr | 5-10ns | 7x ROI | | P3 | Branch elimination | 1.5hr | 10-15ns | 7x ROI | | P4 | Combined TLS reads | 2hr | 2-3ns | 1.5x ROI | | P5 | Code layout | 2hr | 2-5ns | 2x ROI | | P6 | Prefetching hints | 30min | 1-2ns | 3x ROI | --- ## Testing Strategy After each optimization: ```bash # Rebuild make clean && make -j4 # Single-threaded benchmark ./bench_tiny --iterations=1000000 --threads=1 # Should show improvement in latency_ns # Multi-threaded verification ./bench_tiny --iterations=100000 --threads=4 # Should maintain thread-safety and hit rate # Compare against baseline ./docs/bench_compare.sh baseline optimized ``` --- ## Expected Results Timeline **Phase 1** (P0+P1, ~2 hours): - Current: 83 ns/op - Expected: 65-70 ns/op - Gain: ~18% improvement **Phase 2** (P0+P1+P2, ~3 hours): - Expected: 55-65 ns/op - Gain: ~25-30% improvement **Phase 3** (P0+P1+P2+P3, ~4.5 hours): - Expected: 45-55 ns/op - Gain: ~40-45% improvement - Still 3-4x slower than mimalloc (fundamental difference) **Why still slower than mimalloc (14 ns)?** Irreducible architectural gaps: 1. Bitmap lookup [5 ns] vs free list head [1 ns] 2. Magazine validation [3-5 ns] vs implicit ownership [0 ns] 3. Thread ownership tracking [2-3 ns] vs per-thread pages [0 ns] **Total irreducible gap: 10-13 ns** --- ## Code Files to Modify | File | Change | Priority | |------|--------|----------| | `hakmem_tiny.h` | Add size_to_class LUT | P0 | | `hakmem_tiny.c` | Remove stats from hot path | P1 | | `hakmem_tiny_alloc_fast.h` | NEW - Inlined fast path | P2 | | `hakmem_tiny.c` | Branchless fallback | P3 | | `hakmem_tiny.h` | Combine TLS structure | P4 | | `bench_tiny.c` | Add prefetch hints | P6 | | `Makefile` | Hot/cold sections (-ffunction-sections) | P5 | --- ## Rollback Plan Each optimization is independent and can be reverted: ```bash # If optimization causes regression: git diff # See changes git checkout # Revert single file ./bench_tiny # Re-verify ``` All optimizations preserve semantics - no behavior changes. --- ## Success Criteria - [ ] P0 Implementation: 80-82 ns/op (no regression) - [ ] P1 Implementation: 68-72 ns/op (+15% from current) - [ ] P2 Implementation: 60-65 ns/op (+22% from current) - [ ] P3 Implementation: 50-55 ns/op (+35% from current) - [ ] All changes compile with `-O3 -march=native` - [ ] All benchmarks pass thread-safety verification - [ ] No regressions on medium/large allocations (L2 pool) --- **Last Updated**: 2025-10-26 **Status**: Ready for implementation **Owner**: [Team] **Target**: Implement P0-P2 in next sprint