# Hybrid Bitmap+Mini-Magazine Implementation Design **Date**: 2025-10-26 **Goal**: Clean, modular implementation of Hybrid approach **Target**: 83ns → 40-50ns (40-50% improvement) **Philosophy**: 綺麗綺麗大作戦 - Clean code from the start --- ## Current Structure Analysis ### Existing Implementation (Phase 6.X) **Already Implemented** ✅: 1. **Two-Tier Bitmap**: `summary` + `bitmap` (lines 103-106 in hakmem_tiny.h) 2. **TLS Magazine**: 2048 items (lines 54-70 in hakmem_tiny.c) 3. **TLS Active Slabs**: A/B slabs (lines 72-73) 4. **Remote Free MPSC**: Lock-free remote free (lines 126-138) 5. **Hint Word**: Scan start position (line 102) **Current Hot Path** (lines 656-661): ```c if (mag->top > 0) { void* p = mag->items[--mag->top].ptr; t_tiny_rng ^= t_tiny_rng << 13; // Statistics XOR (3 ns) t_tiny_rng ^= t_tiny_rng >> 17; t_tiny_rng ^= t_tiny_rng << 5; if ((t_tiny_rng & ...) == 0) g_tiny_pool.alloc_count[class_idx]++; return p; } ``` **Bottlenecks Identified**: 1. ❌ Statistics in hot path: +10-15ns (lines 658-659, 677-678, 793-794) 2. ❌ Bitmap scan on TLS slab: +5-6ns (line 671, 776) 3. ❌ Multiple TLS reads: +2-3ns (mag, slab_a, slab_b) **Current Performance**: 83 ns/op --- ## Clean Modular Design ### Module 1: Page Mini-Magazine (Data Plane) **Purpose**: O(1) LIFO free-list for fast allocation **Location**: New file `hakmem_tiny_mini_mag.h` ```c // ============================================================================ // Page Mini-Magazine: Fast LIFO Cache (Data Plane) // ============================================================================ // Intrusive LIFO block typedef struct MiniMagBlock { struct MiniMagBlock* next; // 8 bytes in free block } MiniMagBlock; // Mini-magazine metadata (embedded in TinySlab) typedef struct { MiniMagBlock* head; // LIFO stack head uint16_t count; // Current item count uint16_t capacity; // Max items (16-32) } PageMiniMag; // Fast path: Pop from mini-magazine (1-2 ns) static inline void* mini_mag_pop(PageMiniMag* mag) { MiniMagBlock* b = mag->head; if (!b) return NULL; mag->head = b->next; mag->count--; return (void*)b; } // Fast path: Push to mini-magazine (1-2 ns) static inline int mini_mag_push(PageMiniMag* mag, void* ptr) { if (mag->count >= mag->capacity) return 0; // Full MiniMagBlock* b = (MiniMagBlock*)ptr; b->next = mag->head; mag->head = b; mag->count++; return 1; } ``` **Characteristics**: - ✅ Zero overhead (intrusive next-pointer) - ✅ O(1) pop/push (1-2 ns) - ✅ Cache-friendly (LIFO = temporal locality) - ✅ No locks (owner-only access) --- ### Module 2: Batch Refill from Bitmap (Control Plane) **Purpose**: Batch refill mini-magazine from two-tier bitmap **Location**: New file `hakmem_tiny_batch_refill.h` ```c // ============================================================================ // Batch Refill: Two-Tier Bitmap → Mini-Magazine (Control Plane) // ============================================================================ // Refill mini-magazine from bitmap (batch of N items) // Returns number of items refilled static inline int batch_refill_from_bitmap( TinySlab* slab, PageMiniMag* mag, int want ) { if (want <= 0 || mag->count >= mag->capacity) return 0; size_t block_size = g_tiny_class_sizes[slab->class_idx]; int got = 0; // Two-tier bitmap scan (using existing summary) while (got < want && slab->free_count > 0) { int block_idx = hak_tiny_find_free_block(slab); if (block_idx < 0) break; // Mark as used in bitmap hak_tiny_set_used(slab, block_idx); slab->free_count--; // Calculate block pointer void* ptr = (char*)slab->base + (block_idx * block_size); // Push to mini-magazine if (!mini_mag_push(mag, ptr)) break; got++; } return got; } // Spill mini-magazine back to bitmap (on slab eviction) static inline void batch_spill_to_bitmap( TinySlab* slab, PageMiniMag* mag ) { size_t block_size = g_tiny_class_sizes[slab->class_idx]; while (mag->count > 0) { void* ptr = mini_mag_pop(mag); if (!ptr) break; // Calculate block index uintptr_t offset = (uintptr_t)ptr - (uintptr_t)slab->base; int block_idx = offset / block_size; // Mark as free in bitmap hak_tiny_set_free(slab, block_idx); slab->free_count++; } } ``` **Characteristics**: - ✅ Batch processing (amortized cost: 3ns/item) - ✅ Uses existing two-tier bitmap - ✅ Preserves bitmap consistency - ✅ Minimal code (~40 lines) --- ### Module 3: Integrated TinySlab Structure **Purpose**: Add mini-magazine to existing TinySlab **Location**: `hakmem_tiny.h` (modification) ```c // Modified TinySlab structure typedef struct TinySlab { void* base; // Base address (64KB aligned) uint64_t* bitmap; // Free block bitmap (dynamic size) uint16_t free_count; // Number of free blocks uint16_t total_count; // Total blocks in slab uint8_t class_idx; // Size class index (0-7) uint8_t _padding[3]; struct TinySlab* next; // Next slab in list // MPSC remote-free stack head (lock-free) atomic_uintptr_t remote_head; atomic_uint remote_count; pthread_t owner_tid; // Two-tier bitmap (existing) uint16_t hint_word; uint8_t summary_words; uint8_t _pad_sum[1]; uint64_t* summary; // NEW: Page Mini-Magazine (16-32 items) PageMiniMag mini_mag; // 16 bytes (aligned) } TinySlab; ``` **Changes**: - ✅ Additive only (no breaking changes) - ✅ Cache-line aligned (64 bytes) - ✅ Backward compatible --- ### Module 4: Statistics Batching (Out-of-Band) **Purpose**: Remove statistics from hot path **Location**: New file `hakmem_tiny_stats.h` ```c // ============================================================================ // Statistics: Batched TLS Counters (Out-of-Band) // ============================================================================ // Per-thread batch counters (flushed periodically) static __thread uint32_t t_alloc_batch[TINY_NUM_CLASSES]; static __thread uint32_t t_free_batch[TINY_NUM_CLASSES]; // Fast path: Increment TLS counter only (0.5 ns) static inline void stats_record_alloc(int class_idx) { t_alloc_batch[class_idx]++; } static inline void stats_record_free(int class_idx) { t_free_batch[class_idx]++; } // Cold path: Flush batch to global counters static inline void stats_flush_if_needed(int class_idx) { // Flush every 256 allocations if ((t_alloc_batch[class_idx] & 0xFF) == 0xFF) { g_tiny_pool.alloc_count[class_idx] += 256; t_alloc_batch[class_idx] = 0; } if ((t_free_batch[class_idx] & 0xFF) == 0xFF) { g_tiny_pool.free_count[class_idx] += 256; t_free_batch[class_idx] = 0; } } ``` **Characteristics**: - ✅ Hot path: 0.5 ns (simple increment) - ✅ Cold path: Batch flush (every 256 ops) - ✅ Accuracy: 99.6% (vs 93.75% with sampling) - ✅ No XOR RNG overhead --- ## Implementation Phases ### Phase 1: Add Mini-Magazine to TinySlab (2-3 hours) **Goal**: Enable page-level fast path **Files to modify**: 1. `hakmem_tiny.h`: Add `PageMiniMag` to `TinySlab` 2. Create `hakmem_tiny_mini_mag.h`: Mini-magazine operations 3. Create `hakmem_tiny_batch_refill.h`: Batch refill/spill **Changes**: ```c // hakmem_tiny.h (line 107, after summary) typedef struct TinySlab { // ... existing fields ... uint64_t* summary; // NEW: Page Mini-Magazine PageMiniMag mini_mag; } TinySlab; // Initialize mini-magazine on slab creation (hakmem_tiny.c:allocate_new_slab) slab->mini_mag.head = NULL; slab->mini_mag.count = 0; slab->mini_mag.capacity = 16; // Start with 16 items ``` **Testing**: ```bash # Compile make clean && make -j4 # Unit test: Mini-magazine operations # (create test_mini_mag.c) ./test_mini_mag # Integration test: Verify no regressions ./bench_tiny --iterations=100000 --threads=1 ``` **Expected**: No performance change (feature not used yet) --- ### Phase 2: Integrate Mini-Magazine into Hot Path (3-4 hours) **Goal**: Use mini-magazine in allocation fast path **Files to modify**: 1. `hakmem_tiny.c`: Modify `hak_tiny_alloc()` to use mini-mag **New hot path logic**: ```c void* hak_tiny_alloc(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (class_idx < 0) return NULL; // 1. TLS Magazine (existing fast path) TinyTLSMag* mag = &g_tls_mags[class_idx]; if (mag->top > 0) { void* p = mag->items[--mag->top].ptr; stats_record_alloc(class_idx); // NEW: Batched return p; } // 2. TLS Active Slab Mini-Magazine (NEW: lock-free) TinySlab* tls = g_tls_active_slab_a[class_idx]; if (tls && tls->mini_mag.count > 0) { void* p = mini_mag_pop(&tls->mini_mag); if (p) { stats_record_alloc(class_idx); return p; } } // 3. Refill mini-magazine from bitmap (medium path) if (tls && tls->free_count > 0) { int got = batch_refill_from_bitmap(tls, &tls->mini_mag, 16); if (got > 0) { void* p = mini_mag_pop(&tls->mini_mag); if (p) { stats_record_alloc(class_idx); return p; } } } // 4. Slow path (existing global pool logic) return tiny_alloc_slow_path(class_idx); } ``` **Testing**: ```bash # Performance test ./bench_tiny --iterations=1000000 --threads=1 # Expected: 83ns → 55-65ns # Multi-threaded test ./bench_tiny_mt --iterations=100000 --threads=4 # Expected: No regressions # Correctness test ./test_mf2 ./test_mf2_warmup ``` **Expected**: 83ns → 55-65ns (+25-35% improvement) --- ### Phase 3: Remove Statistics from Hot Path (1 hour) **Goal**: Eliminate XOR RNG overhead **Files to modify**: 1. Create `hakmem_tiny_stats.h`: Batched statistics 2. `hakmem_tiny.c`: Replace XOR RNG with batched counters **Changes**: ```c // Remove lines 658-659, 677-678, 793-794 (XOR RNG) // Replace with: stats_record_alloc(class_idx); // Add periodic flush (in slow path) stats_flush_if_needed(class_idx); ``` **Testing**: ```bash # Verify statistics still work ./test_mf2 hak_tiny_get_stats(...) # Should show reasonable counts # Performance test ./bench_tiny --iterations=1000000 --threads=1 # Expected: 55-65ns → 45-55ns ``` **Expected**: 55-65ns → 45-55ns (+10ns improvement) --- ### Phase 4: TLS Magazine Integration (1-2 hours) **Goal**: Optimize TLS Magazine refill from mini-magazines **Files to modify**: 1. `hakmem_tiny.c`: Refill TLS Magazine from slab mini-magazines **Changes**: ```c // When TLS Magazine is low, refill from multiple slab mini-magazines static void refill_tls_magazine(int class_idx) { TinyTLSMag* mag = &g_tls_mags[class_idx]; int room = mag->cap - mag->top; if (room <= 0) return; // Try TLS active slab A TinySlab* tls_a = g_tls_active_slab_a[class_idx]; if (tls_a) { while (room > 0 && tls_a->mini_mag.count > 0) { void* p = mini_mag_pop(&tls_a->mini_mag); if (!p) break; mag->items[mag->top++].ptr = p; room--; } // If mini-mag empty, refill from bitmap if (tls_a->mini_mag.count == 0 && tls_a->free_count > 0) { batch_refill_from_bitmap(tls_a, &tls_a->mini_mag, 16); } } // Try TLS active slab B (if still room) if (room > 0) { TinySlab* tls_b = g_tls_active_slab_b[class_idx]; // ... similar logic ... } } ``` **Testing**: ```bash # Full benchmark suite ./bench_tiny --iterations=1000000 --threads=1 ./bench_tiny_mt --iterations=100000 --threads=4 ./bench_allocators_hakmem --scenario json ./bench_allocators_hakmem --scenario mir # Expected: 45-55ns → 40-50ns ``` **Expected**: 45-55ns → 40-50ns (+5-10ns improvement) --- ## Code Organization (綺麗綺麗) ### File Structure ``` hakmem/ ├── hakmem_tiny.h # Main header (modified) ├── hakmem_tiny.c # Main implementation (modified) ├── hakmem_tiny_mini_mag.h # NEW: Mini-magazine operations ├── hakmem_tiny_batch_refill.h # NEW: Batch refill/spill ├── hakmem_tiny_stats.h # NEW: Batched statistics └── hakmem_tiny_superslab.h # Existing (unchanged) ``` ### Module Dependencies ``` hakmem_tiny.c ├── includes hakmem_tiny.h ├── includes hakmem_tiny_mini_mag.h (Phase 1) ├── includes hakmem_tiny_batch_refill.h (Phase 2) └── includes hakmem_tiny_stats.h (Phase 3) ``` ### Coding Standards **Naming Convention**: ```c // Prefix: mini_mag_*, batch_*, stats_* mini_mag_pop() mini_mag_push() batch_refill_from_bitmap() batch_spill_to_bitmap() stats_record_alloc() stats_flush_if_needed() ``` **Inline Policy**: ```c // Hot path: always inline static inline void* mini_mag_pop(...) __attribute__((always_inline)); // Medium path: let compiler decide static inline int batch_refill_from_bitmap(...); // Cold path: never inline static void tiny_alloc_slow_path(...) __attribute__((noinline)); ``` **Alignment**: ```c // Cache-line aligned structures typedef struct __attribute__((aligned(64))) { // ... } PageMiniMag; ``` **Comments**: ```c // Module-level comment (box) // ============================================================================ // Page Mini-Magazine: Fast LIFO Cache (Data Plane) // ============================================================================ // Function-level comment (purpose + cost) // Fast path: Pop from mini-magazine (1-2 ns) static inline void* mini_mag_pop(PageMiniMag* mag) { // ... } ``` --- ## Testing Strategy ### Unit Tests **test_mini_mag.c**: ```c // Test mini-magazine operations void test_push_pop() { PageMiniMag mag = {.head = NULL, .count = 0, .capacity = 16}; void* ptrs[16]; // Push 16 items for (int i = 0; i < 16; i++) { ptrs[i] = malloc(64); assert(mini_mag_push(&mag, ptrs[i]) == 1); } assert(mag.count == 16); // Push when full (should fail) void* extra = malloc(64); assert(mini_mag_push(&mag, extra) == 0); // Pop 16 items (LIFO order) for (int i = 15; i >= 0; i--) { void* p = mini_mag_pop(&mag); assert(p == ptrs[i]); // LIFO } assert(mag.count == 0); // Pop when empty (should return NULL) assert(mini_mag_pop(&mag) == NULL); } ``` **test_batch_refill.c**: ```c // Test batch refill from bitmap void test_refill() { TinySlab* slab = allocate_new_slab(0); // 8B class assert(slab->free_count == 8192); // Refill 16 items int got = batch_refill_from_bitmap(slab, &slab->mini_mag, 16); assert(got == 16); assert(slab->mini_mag.count == 16); assert(slab->free_count == 8192 - 16); // Verify items are valid for (int i = 0; i < 16; i++) { void* p = mini_mag_pop(&slab->mini_mag); assert(p >= slab->base && p < (char*)slab->base + TINY_SLAB_SIZE); } } ``` ### Integration Tests **Existing tests should pass**: ```bash ./test_mf2 ./test_mf2_warmup ./bench_tiny --iterations=100000 --threads=1 ./bench_tiny_mt --iterations=100000 --threads=4 ``` ### Performance Tests **Before/After comparison**: ```bash # Baseline (before optimization) ./bench_tiny --iterations=1000000 --threads=1 > baseline.txt # After Phase 1 (mini-magazine added but not used) ./bench_tiny --iterations=1000000 --threads=1 > phase1.txt diff baseline.txt phase1.txt # Should be identical # After Phase 2 (mini-magazine integrated) ./bench_tiny --iterations=1000000 --threads=1 > phase2.txt # Expected: 83ns → 55-65ns # After Phase 3 (statistics batched) ./bench_tiny --iterations=1000000 --threads=1 > phase3.txt # Expected: 55-65ns → 45-55ns # After Phase 4 (TLS integration) ./bench_tiny --iterations=1000000 --threads=1 > phase4.txt # Expected: 45-55ns → 40-50ns ``` --- ## Success Criteria ### Performance Targets | Phase | Target | Pass Criteria | |-------|--------|---------------| | **Baseline** | 83 ns/op | Current performance | | **Phase 1** | 83 ns/op | No regression | | **Phase 2** | 55-65 ns/op | +25-35% improvement | | **Phase 3** | 45-55 ns/op | +35-45% improvement | | **Phase 4** | 40-50 ns/op | **+40-52% improvement** ✅ | ### Functional Requirements - [ ] All existing tests pass - [ ] No memory leaks (valgrind clean) - [ ] Thread-safe (helgrind clean) - [ ] Statistics accurate (within 1% of actual) - [ ] No regressions on L2/L2.5 pools ### Code Quality - [ ] Clean compilation (`-Wall -Wextra -Werror`) - [ ] Modular design (separate .h files) - [ ] Inline hints appropriate - [ ] Cache-line aligned critical structures - [ ] Comments explain "why" not "what" --- ## Rollback Plan Each phase is independent and can be reverted: ```bash # Revert Phase 4 git revert # Revert Phase 3 git revert # Revert Phase 2 git revert # Revert Phase 1 git revert ``` All changes are backward-compatible (additive only). --- **Last Updated**: 2025-10-26 **Status**: Design complete, ready for implementation **Next**: Begin Phase 1 - Add Mini-Magazine to TinySlab