# Tiny Pool Optimization Strategy **Date**: 2025-10-26 **Goal**: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes) **Current**: 83 ns/op (hakmem) vs 14 ns/op (mimalloc) **Target**: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining) **Status**: Ready for implementation --- ## Executive Summary ### The 5.9x Gap: Root Cause Analysis Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from **architectural differences**, not bugs: | Component | mimalloc | hakmem | Impact | |-----------|----------|--------|--------| | **Primary data structure** | LIFO free list (intrusive) | Bitmap + magazine | +20 ns | | **State location** | Thread-local only | Thread-local + global | +10 ns | | **Cache validation** | Implicit (per-thread pages) | Explicit (ownership tracking) | +5 ns | | **Statistics overhead** | Batched/deferred | Per-allocation sampled | +10 ns | | **Control flow** | 1 branch | 3-4 branches | +5 ns | **Total measured gap**: 69 ns (83 - 14) ### Why We Can't Match mimalloc's 14 ns **Irreducible architectural gaps** (10-13 ns total): 1. **Bitmap lookup** [5 ns]: Find-first-set + bit extraction vs single pointer read 2. **Magazine validation** [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership 3. **Statistics integration** [2-3 ns]: Per-class stats require bookkeeping vs atomic counters ### What We Can Realistically Achieve **Addressable overhead** (30-35 ns): - P0: Lookup table classification → +3-5 ns - P1: Remove stats from hot path → +10-15 ns - P2: Inline fast path → +5-10 ns - P3: Branch elimination → +10-15 ns **Expected result**: 83 ns → 50-55 ns (35-40% improvement) --- ## Strategic Approach: Three-Phase Implementation ### Phase I: Quick Wins (P0 + P1) - 90 minutes **Target**: 83 ns → 65-70 ns (~20% improvement) **ROI**: Highest impact per time invested **Why Start Here**: - P0 and P1 are independent (no code conflicts) - Combined gain: 13-20 ns - Low risk (simple, localized changes) - Immediate validation possible ### Phase II: Fast Path Optimization (P2) - 60 minutes **Target**: 65-70 ns → 55-60 ns (~30% cumulative improvement) **ROI**: High impact, moderate complexity **Why Second**: - Depends on P1 completion (stats removed from hot path) - Creates foundation for P3 (branch elimination) - More complex but well-documented in roadmap ### Phase III: Advanced Optimization (P3) - 90 minutes **Target**: 55-60 ns → 50-55 ns (~40% cumulative improvement) **ROI**: Moderate, requires careful testing **Why Last**: - Most complex (branchless logic) - Highest risk of subtle bugs - Marginal improvement (diminishing returns) --- ## Detailed Implementation Plan ### P0: Lookup Table Size Classification **File**: `hakmem_tiny.h` **Effort**: 30 minutes **Gain**: +3-5 ns **Risk**: Very Low #### Current Implementation (If-Chain) ```c static inline int hak_tiny_size_to_class(size_t size) { if (size <= 8) return 0; if (size <= 16) return 1; if (size <= 32) return 2; if (size <= 64) return 3; if (size <= 128) return 4; if (size <= 256) return 5; if (size <= 512) return 6; if (size <= 1024) return 7; return -1; } // Branches: 8 (worst case), avg 4 mispredictions // Cost: 5-8 ns ``` #### Target Implementation (LUT) ```c // Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab) static const uint8_t g_tiny_size_to_class[1025] = { // 0-8: class 0 0,0,0,0,0,0,0,0,0, // 9-16: class 1 1,1,1,1,1,1,1,1, // 17-32: class 2 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 33-64: class 3 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 65-128: class 4 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, // 129-256: class 5 // ... (continue pattern for all 1025 entries) // 513-1024: class 7 }; static inline int hak_tiny_size_to_class_fast(size_t size) { // Fast path: direct table lookup if (__builtin_expect(size <= 1024, 1)) { return g_tiny_size_to_class[size]; } // Slow path: out of Tiny Pool range return -1; } // Branches: 1 (predictable) // Cost: 0.5-1 ns (L1 cache hit) ``` #### Implementation Steps 1. Add `g_tiny_size_to_class[1025]` table to `hakmem_tiny.h` (after line 36) 2. Replace `hak_tiny_size_to_class()` with `hak_tiny_size_to_class_fast()` 3. Update all call sites in `hakmem_tiny.c` 4. Compile and verify correctness #### Testing ```bash # Build and verify correctness make clean && make -j4 # Unit test: verify classification accuracy ./test_tiny_size_class # (create if needed) # Performance test ./bench_tiny --iterations=1000000 --threads=1 # Expected: 83ns → 78-80ns ``` #### Rollback ```bash git diff hakmem_tiny.h # Verify changes git checkout hakmem_tiny.h # If regression detected ``` --- ### P1: Remove Statistics from Critical Path **File**: `hakmem_tiny.c` **Effort**: 60 minutes **Gain**: +10-15 ns **Risk**: Low (statistics semantics preserved) #### Current Implementation (Per-Allocation Sampling) ```c // hakmem_tiny.c:656-659 (hot path) void* p = mag->items[--mag->top].ptr; // Sampled counter update (XOR-based PRNG) t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead t_tiny_rng ^= t_tiny_rng >> 17; t_tiny_rng ^= t_tiny_rng << 5; if ((t_tiny_rng & ((1u<items[--mag->top].ptr; // NO STATS HERE! return p; // Stats overhead: 0 ns // Cold path: Lazy accumulation (called during magazine refill) static void hak_tiny_lazy_counter_update(int class_idx) { // Accumulate every 100 allocations if (++t_tiny_alloc_counter[class_idx] >= 100) { g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx]; t_tiny_alloc_counter[class_idx] = 0; } } // Call from slow path (magazine refill function) void hak_tiny_refill_magazine(...) { // ... existing refill logic ... hak_tiny_lazy_counter_update(class_idx); } ``` #### Implementation Steps 1. Add `t_tiny_alloc_counter[TINY_NUM_CLASSES]` TLS array 2. Remove XOR PRNG code from `hak_tiny_alloc()` hot path (lines 656-659) 3. Add `hak_tiny_lazy_counter_update()` function 4. Call lazy update in slow path (magazine refill, slab allocation) 5. Update `hak_tiny_get_stats()` to flush pending TLS counters #### Statistics Accuracy Trade-off - **Before**: Sampled (1/16 allocations counted) - **After**: Batched (accumulated every 100 allocations) - **Impact**: Both are approximations; batching is more accurate and faster #### Testing ```bash # Build make clean && make -j4 # Functional test: verify stats still work ./test_mf2 # Should show non-zero alloc_count # Performance test ./bench_tiny --iterations=1000000 --threads=1 # Expected: 78-80ns → 63-70ns ``` #### Rollback ```bash git diff hakmem_tiny.c git checkout hakmem_tiny.c # If stats broken ``` --- ### P2: Inline Fast Path **Files**: `hakmem_tiny_alloc_fast.h` (new), `hakmem.h`, `hakmem_tiny.c` **Effort**: 60 minutes **Gain**: +5-10 ns **Risk**: Moderate (function splitting, ABI considerations) #### Current Implementation (Unified Function) ```c // hakmem_tiny.c: Single function for fast + slow path void* hak_tiny_alloc(size_t size) { // Size classification int class_idx = hak_tiny_size_to_class(size); // Magazine check TinyTLSMag* mag = &g_tls_mags[class_idx]; if (mag->top > 0) { return mag->items[--mag->top].ptr; // Fast path } // TLS active slab A TinySlab* slab_a = g_tls_active_slab_a[class_idx]; if (slab_a && slab_a->free_count > 0) { // ... bitmap scan ... } // TLS active slab B // ... // Global pool (slow path with locks) // ... } // Problem: Compiler can't inline due to size/complexity // Call overhead: 5-10 ns ``` #### Target Implementation (Split Fast/Slow) ```c // New file: hakmem_tiny_alloc_fast.h #ifndef HAKMEM_TINY_ALLOC_FAST_H #define HAKMEM_TINY_ALLOC_FAST_H #include "hakmem_tiny.h" // External slow path declaration extern void* hak_tiny_alloc_slow(size_t size, int class_idx); // Inlined fast path (magazine-only) __attribute__((always_inline)) static inline void* hak_tiny_alloc_hot(size_t size) { // Fast bounds check if (__builtin_expect(size > TINY_MAX_SIZE, 0)) { return NULL; // Not Tiny Pool range } // O(1) size classification (uses P0 LUT) int class_idx = g_tiny_size_to_class[size]; // TLS magazine check (fast path only) TinyTLSMag* mag = &g_tls_mags[class_idx]; if (__builtin_expect(mag->top > 0, 1)) { return mag->items[--mag->top].ptr; } // Fall through to slow path return hak_tiny_alloc_slow(size, class_idx); } #endif ``` ```c // hakmem_tiny.c: Slow path only void* hak_tiny_alloc_slow(size_t size, int class_idx) { // TLS active slab A TinySlab* slab_a = g_tls_active_slab_a[class_idx]; if (slab_a && slab_a->free_count > 0) { // ... bitmap scan ... } // TLS active slab B // ... // Global pool (locks) // ... } ``` ```c // hakmem.h: Update public API #include "hakmem_tiny_alloc_fast.h" // Use inlined fast path for tiny allocations static inline void* hak_alloc_at(size_t size, void* site) { if (size <= TINY_MAX_SIZE) { return hak_tiny_alloc_hot(size); } // ... L2/L2.5 pools ... } ``` #### Benefits - **Zero call overhead**: Compiler inlines directly into caller - **Better register allocation**: Hot path uses minimal stack - **Improved branch prediction**: Fast path separate from cold code #### Implementation Steps 1. Create `hakmem_tiny_alloc_fast.h` 2. Move magazine-only logic to `hak_tiny_alloc_hot()` 3. Rename `hak_tiny_alloc()` → `hak_tiny_alloc_slow()` 4. Update `hakmem.h` to include new header 5. Update all call sites #### Testing ```bash # Build make clean && make -j4 # Functional test: verify all paths work ./test_mf2 ./test_mf2_warmup # Performance test ./bench_tiny --iterations=1000000 --threads=1 # Expected: 63-70ns → 55-65ns ``` #### Rollback ```bash git rm hakmem_tiny_alloc_fast.h git checkout hakmem.h hakmem_tiny.c ``` --- ## Testing Strategy ### Phase-by-Phase Validation #### After P0 (Lookup Table) ```bash # Correctness ./test_tiny_size_class # Verify all 1025 entries correct # Performance ./bench_tiny --iterations=1000000 --threads=1 # Target: 83ns → 78-80ns # No regressions on other sizes ./bench_allocators_hakmem --scenario json # L2.5 64KB ./bench_allocators_hakmem --scenario mir # L2.5 256KB ``` #### After P1 (Remove Stats) ```bash # Correctness: Stats still work ./test_mf2 hak_tiny_get_stats() # Should show non-zero counts # Performance ./bench_tiny --iterations=1000000 --threads=1 # Target: 78-80ns → 63-70ns # Multi-threaded: Verify TLS counters work ./bench_tiny_mt --iterations=100000 --threads=4 # Should see reasonable alloc_count ``` #### After P2 (Inline Fast Path) ```bash # Correctness: All paths work ./test_mf2 ./test_mf2_warmup # Performance ./bench_tiny --iterations=1000000 --threads=1 # Target: 63-70ns → 55-65ns # Code inspection: Verify inlining objdump -d bench_tiny | grep -A 20 'hak_alloc_at' # Should show inline expansion, not call instruction ``` ### Comprehensive Validation ```bash # After all P0-P2 complete make clean && make -j4 # Full benchmark suite ./bench_tiny --iterations=1000000 --threads=1 # Single-threaded ./bench_tiny_mt --iterations=100000 --threads=4 # Multi-threaded ./bench_allocators_hakmem --scenario json # L2.5 64KB ./bench_allocators_hakmem --scenario mir # L2.5 256KB # Git commit git add . git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns - P0: Lookup table size classification (+3-5ns) - P1: Remove statistics from hot path (+10-15ns) - P2: Inline fast path (+5-10ns) Cumulative improvement: ~30-35% Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)" ``` --- ## Risk Mitigation ### Low-Risk Approach 1. **Incremental changes**: One optimization at a time 2. **Validation at each step**: Benchmark + test after each P0, P1, P2 3. **Git commits**: Separate commit for each optimization 4. **Rollback ready**: `git checkout` if regression detected ### Potential Issues and Mitigation #### Issue 1: LUT Size (1 KB) - **Risk**: L1 cache pressure - **Mitigation**: 1 KB fits in L1 (32 KB typical), negligible impact - **Fallback**: Use smaller LUT (65 entries, round up to power-of-2) #### Issue 2: Inlining Bloat - **Risk**: Code size increase from inlining - **Mitigation**: Only inline magazine path (~10 instructions) - **Fallback**: Use `__attribute__((hot))` instead of `always_inline` #### Issue 3: Stats Accuracy - **Risk**: Batched counters less accurate than sampled - **Mitigation**: Actually MORE accurate (no sampling variance) - **Fallback**: Keep TLS counters but flush more frequently --- ## Success Criteria ### Performance Targets | Optimization | Current | Target | Achieved? | |--------------|---------|--------|-----------| | **Baseline** | 83 ns/op | - | ✓ (established) | | **P0 Complete** | 83 ns | 78-80 ns | [ ] | | **P1 Complete** | 78-80 ns | 63-70 ns | [ ] | | **P2 Complete** | 63-70 ns | 55-65 ns | [ ] | | **P0-P2 Combined** | 83 ns | **50-55 ns** | [ ] | **Stretch Goal**: 50 ns/op (38% improvement, 3.5x gap to mimalloc) ### Functional Requirements - [ ] All existing tests pass - [ ] No regressions on L2/L2.5 pools - [ ] Statistics still functional (batched counters work) - [ ] Multi-threaded safety maintained - [ ] Zero hard page faults (memory reuse preserved) ### Code Quality - [ ] Clean compilation with `-Wall -Wextra` - [ ] No compiler warnings - [ ] Documented trade-offs (stats batching, LUT size) - [ ] Git commit messages reference this strategy doc --- ## Timeline and Effort ### Phase I: Quick Wins (P0 + P1) - **P0 Implementation**: 30 minutes - **P0 Testing**: 15 minutes - **P1 Implementation**: 60 minutes - **P1 Testing**: 15 minutes - **Total Phase I**: **2 hours** ### Phase II: Fast Path (P2) - **P2 Implementation**: 60 minutes - **P2 Testing**: 15 minutes - **Total Phase II**: **1.25 hours** ### Phase III: Validation - **Comprehensive testing**: 30 minutes - **Documentation update**: 15 minutes - **Git commit**: 15 minutes - **Total Phase III**: **1 hour** ### Grand Total: 4.25 hours **Realistic estimate with contingency**: **5-6 hours** --- ## Beyond P0-P2: Future Optimization (Optional) ### P3: Branch Elimination (Not Included in This Strategy) **Effort**: 90 minutes **Gain**: +10-15 ns **Risk**: High (complex, subtle bugs possible) **Why Deferred**: - Diminishing returns (30% improvement already achieved) - Complexity vs gain trade-off unfavorable - Can be addressed in future iteration ### Alternative: NEXT_STEPS.md Approach After P0-P2 complete, consider NEXT_STEPS.md optimizations: - MPSC opportunistic drain during alloc slow path - Immediate full→free slab promotion after drain - Adaptive magazine capacity per site **These may yield better ROI than P3.** --- ## Comparison to Alternatives ### Alternative 1: Implement NEXT_STEPS.md First **Pros**: Novel features (adaptive magazines, ELO learning) **Cons**: Higher complexity, uncertain performance gain **Why Not**: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer ### Alternative 2: Copy mimalloc's Free List Architecture **Pros**: Would match mimalloc's 14ns performance **Cons**: Abandons bitmap approach, loses diagnostics/ownership tracking **Why Not**: Violates hakmem's research goals (flexible architecture) ### Alternative 3: Do Nothing **Pros**: Zero effort **Cons**: 5.9x gap remains, no learning **Why Not**: P0-P2 are low-risk, high-ROI quick wins --- ## Conclusion This strategy prioritizes **high-impact, low-risk optimizations** to achieve a **30-40% performance improvement** in 4-6 hours of work. **Key Principles**: 1. **Incremental validation**: Test after each step 2. **Focus on hot path**: Remove overhead from critical path 3. **Preserve semantics**: No behavior changes, only optimization 4. **Accept trade-offs**: 50-55ns is excellent; chasing 14ns abandons research goals **Next Steps**: 1. Review and approve this strategy 2. Implement P0 (Lookup Table) - 30 minutes 3. Validate P0 - 15 minutes 4. Implement P1 (Remove Stats) - 60 minutes 5. Validate P1 - 15 minutes 6. Implement P2 (Inline Fast Path) - 60 minutes 7. Validate P2 - 15 minutes 8. Comprehensive testing and documentation - 1 hour **Expected Outcome**: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture. --- **Last Updated**: 2025-10-26 **Status**: Ready for implementation **Approval**: Pending user confirmation **Implementation Start**: After approval