# mimalloc Optimization Implementation Roadmap ## Closing the 47% Performance Gap **Current:** 16.53 M ops/sec **Target:** 24.00 M ops/sec (+45%) **Strategy:** Three-phase implementation with incremental validation --- ## Phase 1: Direct Page Cache ⚡ **HIGH PRIORITY** **Target:** +2.5-3.3 M ops/sec (15-20% improvement) **Effort:** 1-2 days **Risk:** Low **Dependencies:** None ### Implementation Steps #### Step 1.1: Add Direct Cache to Heap Structure **File:** `core/hakmem_tiny.h` ```c #define HAKMEM_DIRECT_PAGES 129 // Up to 1024 bytes (129 * 8) typedef struct hakmem_tiny_heap_s { // Existing fields... hakmem_tiny_class_t size_classes[32]; // NEW: Direct page cache hakmem_tiny_page_t* pages_direct[HAKMEM_DIRECT_PAGES]; // Existing fields... } hakmem_tiny_heap_t; ``` **Memory cost:** 129 × 8 = 1,032 bytes per heap (acceptable) #### Step 1.2: Initialize Direct Cache **File:** `core/hakmem_tiny.c` ```c void hakmem_tiny_heap_init(hakmem_tiny_heap_t* heap) { // Existing initialization... // Initialize direct cache for (size_t i = 0; i < HAKMEM_DIRECT_PAGES; i++) { heap->pages_direct[i] = NULL; } // Populate from existing size classes hakmem_tiny_rebuild_direct_cache(heap); } ``` #### Step 1.3: Cache Update Function **File:** `core/hakmem_tiny.c` ```c static inline void hakmem_tiny_update_direct_cache( hakmem_tiny_heap_t* heap, hakmem_tiny_page_t* page, size_t block_size) { if (block_size > 1024) return; // Only cache small sizes size_t idx = (block_size + 7) / 8; // Round up to word size if (idx < HAKMEM_DIRECT_PAGES) { heap->pages_direct[idx] = page; } } // Call this whenever a page is added/removed from size class ``` #### Step 1.4: Fast Path Using Direct Cache **File:** `core/hakmem_tiny.c` ```c static inline void* hakmem_tiny_malloc_direct( hakmem_tiny_heap_t* heap, size_t size) { // Fast path: direct cache lookup if (size <= 1024) { size_t idx = (size + 7) / 8; hakmem_tiny_page_t* page = heap->pages_direct[idx]; if (page && page->free_list) { // Pop from free list hakmem_block_t* block = page->free_list; page->free_list = block->next; page->used++; return block; } } // Fallback to existing generic path return hakmem_tiny_malloc_generic(heap, size); } // Update main malloc to call this: void* hakmem_malloc(size_t size) { if (size <= HAKMEM_TINY_MAX) { return hakmem_tiny_malloc_direct(tls_heap, size); } // ... existing large allocation path } ``` ### Validation **Benchmark command:** ```bash ./bench_random_mixed_hakx ``` **Expected output:** ``` Before: 16.53 M ops/sec After: 19.00-20.00 M ops/sec (+15-20%) ``` **If target not met:** 1. Profile with `perf record -e cycles,cache-misses ./bench_random_mixed_hakx` 2. Check direct cache hit rate 3. Verify cache is being updated correctly 4. Check for branch mispredictions --- ## Phase 2: Dual Free Lists 🚀 **MEDIUM PRIORITY** **Target:** +2.0-3.3 M ops/sec additional (10-15% improvement) **Effort:** 3-5 days **Risk:** Medium (structural changes) **Dependencies:** Phase 1 complete ### Implementation Steps #### Step 2.1: Modify Page Structure **File:** `core/hakmem_tiny.h` ```c typedef struct hakmem_tiny_page_s { // Existing fields... uint32_t block_size; uint32_t capacity; // OLD: Single free list // hakmem_block_t* free_list; // NEW: Three separate free lists hakmem_block_t* free; // Hot allocation path hakmem_block_t* local_free; // Local frees (no atomic!) _Atomic(uintptr_t) thread_free; // Remote frees + flags (lower 2 bits) uint32_t used; // ... other fields } hakmem_tiny_page_t; ``` **Note:** `thread_free` encodes both pointer and flags in lower 2 bits (aligned blocks allow this) #### Step 2.2: Update Free Path **File:** `core/hakmem_tiny.c` ```c void hakmem_tiny_free(void* ptr) { hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr); hakmem_block_t* block = (hakmem_block_t*)ptr; // Fast path: local thread owns this page if (hakmem_tiny_is_local_page(page)) { // Add to local_free (no atomic!) block->next = page->local_free; page->local_free = block; page->used--; // Retire page if fully free if (page->used == 0) { hakmem_tiny_page_retire(page); } return; } // Slow path: remote free (atomic) hakmem_tiny_free_remote(page, block); } ``` #### Step 2.3: Migration Logic **File:** `core/hakmem_tiny.c` ```c static void hakmem_tiny_collect_frees(hakmem_tiny_page_t* page) { // Step 1: Collect remote frees (atomic) uintptr_t tfree = atomic_exchange(&page->thread_free, 0); hakmem_block_t* remote_list = (hakmem_block_t*)(tfree & ~0x3); if (remote_list) { // Append to local_free hakmem_block_t* tail = remote_list; while (tail->next) tail = tail->next; tail->next = page->local_free; page->local_free = remote_list; } // Step 2: Migrate local_free to free if (page->local_free && !page->free) { page->free = page->local_free; page->local_free = NULL; } } // Call this in allocation path when free list is empty void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) { // ... direct cache lookup hakmem_tiny_page_t* page = heap->pages_direct[idx]; if (page) { // Try to allocate from free list hakmem_block_t* block = page->free; if (block) { page->free = block->next; page->used++; return block; } // Free list empty - collect and retry hakmem_tiny_collect_frees(page); block = page->free; if (block) { page->free = block->next; page->used++; return block; } } // Fallback return hakmem_tiny_malloc_generic(heap, size); } ``` ### Validation **Benchmark command:** ```bash ./bench_random_mixed_hakx ``` **Expected output:** ``` After Phase 1: 19.00-20.00 M ops/sec After Phase 2: 21.50-23.00 M ops/sec (+10-15% additional) ``` **Key metrics to track:** 1. Atomic operation count (should drop significantly) 2. Cache miss rate (should improve) 3. Free path latency (should be faster) **If target not met:** 1. Profile atomic operations: `perf record -e cpu-cycles,instructions,cache-references,cache-misses ./bench_random_mixed_hakx` 2. Check remote free percentage 3. Verify migration is happening correctly 4. Analyze cache line bouncing --- ## Phase 3: Branch Hints + Bit-Packed Flags 🎯 **LOW PRIORITY** **Target:** +1.0-2.0 M ops/sec additional (5-8% improvement) **Effort:** 1-2 days **Risk:** Low **Dependencies:** Phase 2 complete ### Implementation Steps #### Step 3.1: Add Branch Hint Macros **File:** `core/hakmem_config.h` ```c #if defined(__GNUC__) || defined(__clang__) #define hakmem_likely(x) __builtin_expect(!!(x), 1) #define hakmem_unlikely(x) __builtin_expect(!!(x), 0) #else #define hakmem_likely(x) (x) #define hakmem_unlikely(x) (x) #endif ``` #### Step 3.2: Add Branch Hints to Hot Path **File:** `core/hakmem_tiny.c` ```c void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) { // Fast path hint if (hakmem_likely(size <= 1024)) { size_t idx = (size + 7) / 8; hakmem_tiny_page_t* page = heap->pages_direct[idx]; if (hakmem_likely(page != NULL)) { hakmem_block_t* block = page->free; if (hakmem_likely(block != NULL)) { page->free = block->next; page->used++; return block; } // Slow path within fast path hakmem_tiny_collect_frees(page); block = page->free; if (hakmem_likely(block != NULL)) { page->free = block->next; page->used++; return block; } } } // Fallback (unlikely) return hakmem_tiny_malloc_generic(heap, size); } void hakmem_tiny_free(void* ptr) { if (hakmem_unlikely(ptr == NULL)) return; hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr); hakmem_block_t* block = (hakmem_block_t*)ptr; // Local free is likely if (hakmem_likely(hakmem_tiny_is_local_page(page))) { block->next = page->local_free; page->local_free = block; page->used--; // Rarely fully free if (hakmem_unlikely(page->used == 0)) { hakmem_tiny_page_retire(page); } return; } // Remote free is unlikely hakmem_tiny_free_remote(page, block); } ``` #### Step 3.3: Bit-Pack Page Flags **File:** `core/hakmem_tiny.h` ```c typedef union hakmem_page_flags_u { uint8_t combined; // For fast check struct { uint8_t is_full : 1; uint8_t has_remote_frees : 1; uint8_t is_retired : 1; uint8_t unused : 5; } bits; } hakmem_page_flags_t; typedef struct hakmem_tiny_page_s { // ... other fields hakmem_page_flags_t flags; // ... } hakmem_tiny_page_t; ``` **Usage:** ```c // Single comparison instead of multiple if (hakmem_likely(page->flags.combined == 0)) { // Fast path: not full, no remote frees, not retired // ... 3-instruction free } ``` ### Validation **Benchmark command:** ```bash ./bench_random_mixed_hakx ``` **Expected output:** ``` After Phase 2: 21.50-23.00 M ops/sec After Phase 3: 23.00-24.50 M ops/sec (+5-8% additional) ``` **Key metrics:** 1. Branch misprediction rate (should decrease) 2. Instruction count (should decrease slightly) 3. Code size (should decrease due to better branch layout) --- ## Testing Strategy ### Unit Tests **File:** `test_hakmem_phases.c` ```c // Phase 1: Direct cache correctness void test_direct_cache() { hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create(); // Allocate various sizes void* p8 = hakmem_malloc(8); void* p16 = hakmem_malloc(16); void* p32 = hakmem_malloc(32); // Verify direct cache is populated assert(heap->pages_direct[1] != NULL); // 8 bytes assert(heap->pages_direct[2] != NULL); // 16 bytes assert(heap->pages_direct[4] != NULL); // 32 bytes // Free and verify cache is updated hakmem_free(p8); assert(heap->pages_direct[1]->free != NULL); hakmem_tiny_heap_destroy(heap); } // Phase 2: Dual free lists void test_dual_free_lists() { hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create(); void* p = hakmem_malloc(64); hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(p); // Local free goes to local_free hakmem_free(p); assert(page->local_free != NULL); assert(page->free == NULL || page->free != p); // Allocate again triggers migration void* p2 = hakmem_malloc(64); assert(page->local_free == NULL); // Migrated hakmem_tiny_heap_destroy(heap); } // Phase 3: Branch hints (no functional change) void test_branch_hints() { // Just verify compilation and no regression for (int i = 0; i < 10000; i++) { void* p = hakmem_malloc(64); hakmem_free(p); } } ``` ### Benchmark Suite **Run after each phase:** ```bash # Core benchmark ./bench_random_mixed_hakx # Stress tests ./bench_mid_large_hakx ./bench_tiny_hot_hakx ./bench_fragment_stress_hakx # Multi-threaded ./bench_mid_large_mt_hakx ``` ### Validation Checklist **Phase 1:** - [ ] Direct cache correctly populated - [ ] Cache hit rate > 95% for small allocations - [ ] Performance gain: 15-20% - [ ] No memory leaks - [ ] All existing tests pass **Phase 2:** - [ ] Local frees go to local_free - [ ] Remote frees go to thread_free - [ ] Migration works correctly - [ ] Atomic operation count reduced by 80%+ - [ ] Performance gain: 10-15% additional - [ ] Thread-safety maintained - [ ] All existing tests pass **Phase 3:** - [ ] Branch hints compile correctly - [ ] Bit-packed flags work as expected - [ ] Performance gain: 5-8% additional - [ ] Code size reduced or unchanged - [ ] All existing tests pass --- ## Rollback Plan ### Phase 1 Rollback If Phase 1 doesn't meet targets: ```c // #define HAKMEM_USE_DIRECT_CACHE 1 // Comment out void* hakmem_malloc(size_t size) { #ifdef HAKMEM_USE_DIRECT_CACHE return hakmem_tiny_malloc_direct(tls_heap, size); #else return hakmem_tiny_malloc_generic(tls_heap, size); // Old path #endif } ``` ### Phase 2 Rollback If Phase 2 causes issues: ```c // Revert to single free list typedef struct hakmem_tiny_page_s { #ifdef HAKMEM_USE_DUAL_LISTS hakmem_block_t* free; hakmem_block_t* local_free; _Atomic(uintptr_t) thread_free; #else hakmem_block_t* free_list; // Old single list #endif // ... } hakmem_tiny_page_t; ``` --- ## Success Criteria ### Minimum Acceptable Performance - **Phase 1:** +10% (18.18 M ops/sec) - **Phase 2:** +20% cumulative (19.84 M ops/sec) - **Phase 3:** +35% cumulative (22.32 M ops/sec) ### Target Performance - **Phase 1:** +15% (19.01 M ops/sec) - **Phase 2:** +27% cumulative (21.00 M ops/sec) - **Phase 3:** +40% cumulative (23.14 M ops/sec) ### Stretch Goal - **Phase 3:** +45% cumulative (24.00 M ops/sec) - **Match mimalloc!** --- ## Timeline ### Conservative Estimate - **Week 1:** Phase 1 implementation + validation - **Week 2:** Phase 2 implementation - **Week 3:** Phase 2 validation + debugging - **Week 4:** Phase 3 implementation + final validation **Total: 4 weeks** ### Aggressive Estimate - **Day 1-2:** Phase 1 implementation + validation - **Day 3-6:** Phase 2 implementation + validation - **Day 7-8:** Phase 3 implementation + validation **Total: 8 days** --- ## Risk Mitigation ### Technical Risks 1. **Cache coherency issues** (Phase 2) - Mitigation: Extensive multi-threaded testing - Fallback: Keep atomic operations on critical path 2. **Memory overhead** (Phase 1) - Mitigation: Monitor RSS increase - Fallback: Reduce HAKMEM_DIRECT_PAGES to 65 (512 bytes) 3. **Correctness bugs** (Phase 2) - Mitigation: Extensive unit tests, ASAN/TSAN builds - Fallback: Revert to single free list ### Performance Risks 1. **Phase 1 underperforms** (<10%) - Action: Profile cache hit rate - Fix: Adjust cache update logic 2. **Phase 2 adds latency** (cache bouncing) - Action: Profile cache misses - Fix: Adjust migration threshold 3. **Phase 3 no improvement** (compiler already optimized) - Action: Check assembly output - Fix: Skip phase or use PGO --- ## Monitoring ### Key Metrics to Track 1. **Operations/sec** (primary metric) 2. **Latency percentiles** (p50, p95, p99) 3. **Memory usage** (RSS) 4. **Cache miss rate** 5. **Branch misprediction rate** 6. **Atomic operation count** ### Profiling Commands ```bash # Basic profiling perf record -e cycles,instructions,cache-misses ./bench_random_mixed_hakx perf report # Cache analysis perf record -e cache-references,cache-misses,L1-dcache-load-misses ./bench_random_mixed_hakx # Branch analysis perf record -e branch-misses,branches ./bench_random_mixed_hakx # ASAN/TSAN builds CC=clang CFLAGS="-fsanitize=address" make CC=clang CFLAGS="-fsanitize=thread" make ``` --- ## Next Steps 1. **Implement Phase 1** (direct page cache) 2. **Benchmark and validate** (target: +15-20%) 3. **If successful:** Proceed to Phase 2 4. **If not:** Debug and iterate **Start now with Phase 1 - it's low-risk and high-reward!**