# Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc **Date**: 2025-10-24 **Status**: 📋 Planning **Target**: Reach 60-75% of mimalloc performance for Mid Pool --- ## 📊 Current Baseline (Phase 6.21 Results) ### Performance vs mimalloc | Workload | Threads | hakmem | mimalloc | Ratio | Gap | |----------|---------|--------|----------|-------|-----| | **Mid** | 1T | 4.0 M/s | 14.6 M/s | **28%** | -72% | | **Mid** | 4T | 13.8 M/s | 29.5 M/s | **47%** | -53% | | Tiny | 1T | 19.4 M/s | 32.6 M/s | 59% | -41% | | Tiny | 4T | 48.0 M/s | 65.7 M/s | 73% | -27% | | Large | 1T | 0.6 M/s | 2.1 M/s | 29% | -71% | **Key Insights**: - ✅ **Phase 6.25 Quick Wins achieved +37.8% for Mid 4T** (10.0 → 13.8 M/s) - ❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T) - 🎯 Target: 60-75% of mimalloc = **8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)** ### Current Mid Pool Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ TLS Fast Path (Lock-Free) │ ├─────────────────────────────────────────────────────────┤ │ 1. TLS Ring Buffer (RING_CAP=32) │ │ - LIFO cache for recently freed blocks │ │ - Per-class, per-thread │ │ - Phase 6.25: 16→32 increased hit rate │ │ │ │ 2. TLS Active Pages (x2: page_a, page_b) │ │ - Bump-run allocation (no per-block links) │ │ - Owner-thread private (lock-free) │ │ - 64KB pages, split on-demand │ ├─────────────────────────────────────────────────────────┤ │ Shared State (Lock-Based) │ ├─────────────────────────────────────────────────────────┤ │ 3. Per-class Freelist (64 shards) │ │ - Mutex-protected per (class, shard) │ │ - Site-based sharding (reduce contention) │ │ - Refill on demand via refill_freelist() │ │ │ │ 4. Remote Stack (MPSC, lock-free push) │ │ - Cross-thread free target │ │ - Drained into freelist under lock │ │ │ │ 5. Transfer Cache (TC, Phase 6.20) │ │ - Per-thread inbox (atomic CAS) │ │ - Owner-aware routing │ │ - Drain trigger: ring->top < 2 │ └─────────────────────────────────────────────────────────┘ Refill Flow (Current): Ring empty → Check Active Pages → Lock Shard → Pop freelist → Drain remote → Shard steal (if CAP reached) → **refill_freelist()** Refill Implementation: - Allocates **1 page** (64KB) via mmap - Splits into blocks, links into freelist - ACE bundle factor: 1-4 pages (adaptive) ``` ### Bottlenecks Identified **From Phase 6.20 Analysis**: 1. **Refill Latency** (Primary) - Single-page refill: 1 mmap syscall per refill - Freelist rebuilding overhead (linking blocks) - Mutex hold time during refill (~100-150 cycles) - **Impact**: ~40% of alloc time in Mid 1T 2. **Lock Contention** (Secondary) - 64 shards × 7 classes = 448 mutexes - Even with sharding, 4T shows contention - Trylock success rate: ~60-70% (Phase 6.25 data) - **Impact**: ~25% of alloc time in Mid 4T 3. **CAP/W_MAX Sub-optimal** (Tertiary) - Static configuration (no runtime adaptation) - W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1 - CAP={64,64,64,32,16} → conservative, low hit rate - **Impact**: ~10-15% missed pool opportunities --- ## 🎯 Phase 6.25 本体: Refill Batching ### Goal **Reduce refill latency by allocating multiple pages at once** **Target**: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s) ### Problem Statement Current `refill_freelist()` allocates **1 page per call**: - 1 mmap syscall (~200-300 cycles) - 1 page split + freelist rebuild (~100-150 cycles) - Held under mutex lock (blocks other threads) - Amortized cost per block: **HIGH** for small classes (e.g., 2KB = 32 blocks/page) **Opportunity**: Allocate **2-4 pages in batch** to amortize costs: - mmap overhead: 300 cycles → 75-150 cycles/page (batched) - Freelist rebuild: done in parallel or optimized - Fill multiple TLS page slots + Ring buffer aggressively ### Implementation Approach #### 1. Create `alloc_tls_page_batch()` Function **Location**: `hakmem_pool.c` (after `alloc_tls_page()`, line ~486) **Signature**: ```c // Allocate multiple pages in batch and distribute to TLS structures // Returns: number of pages successfully allocated (0-batch_size) static int alloc_tls_page_batch(int class_idx, int batch_size, PoolTLSPage* slots[], int num_slots, PoolTLSRing* ring, PoolTLSBin* bin); ``` **Pseudocode**: ```c static int alloc_tls_page_batch(int class_idx, int batch_size, PoolTLSPage* slots[], int num_slots, PoolTLSRing* ring, PoolTLSBin* bin) { size_t user_size = g_class_sizes[class_idx]; size_t block_size = HEADER_SIZE + user_size; int blocks_per_page = POOL_PAGE_SIZE / block_size; if (blocks_per_page <= 0) return 0; int allocated = 0; // Allocate pages in batch (strategy: multiple mmaps or single large mmap) // Option A: Multiple mmaps (simpler, compatible with existing infra) for (int i = 0; i < batch_size; i++) { void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (!page) break; // Prefault (Phase 6.25 quick win) for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) { ((volatile char*)page)[j] = 0; } // Strategy: Fill TLS slots first, then fill Ring/LIFO if (allocated < num_slots && slots[allocated]) { // Assign to TLS active page slot (bump-run init) PoolTLSPage* ap = slots[allocated]; ap->page = page; ap->bump = (char*)page; ap->end = (char*)page + POOL_PAGE_SIZE; ap->count = blocks_per_page; // Register page descriptor mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self()); } else { // Fill Ring + LIFO from this page char* bump = (char*)page; char* end = (char*)page + POOL_PAGE_SIZE; for (int k = 0; k < blocks_per_page; k++) { PoolBlock* b = (PoolBlock*)(void*)bump; // Try Ring first, then LIFO if (ring && ring->top < POOL_TLS_RING_CAP) { ring->items[ring->top++] = b; } else if (bin) { b->next = bin->lo_head; bin->lo_head = b; bin->lo_count++; } bump += block_size; if (bump >= end) break; } mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self()); } allocated++; g_pool.total_pages_allocated++; g_pool.pages_by_class[class_idx]++; g_pool.total_bytes_allocated += POOL_PAGE_SIZE; } if (allocated > 0) { g_pool.refills[class_idx]++; } return allocated; } ``` #### 2. Modify Refill Call Sites **Location**: `hakmem_pool.c:931` (inside `hak_pool_try_alloc`, refill path) **Before**: ```c if (alloc_tls_page(class_idx, tap)) { // ... use newly allocated page } ``` **After**: ```c // Determine batch size from env var (default 2-4) int batch = g_pool_refill_batch_size; // new global config if (batch < 1) batch = 1; if (batch > 4) batch = 4; // Prepare slot array (up to 2 TLS slots) PoolTLSPage* slots[2] = {NULL, NULL}; int num_slots = 0; if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) { slots[num_slots++] = &g_tls_active_page_a[class_idx]; } if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) { slots[num_slots++] = &g_tls_active_page_b[class_idx]; } // Call batch allocator int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots, &g_tls_bin[class_idx].ring, &g_tls_bin[class_idx]); if (allocated > 0) { pthread_mutex_unlock(lock); // Use ring or active page as usual // ... } ``` #### 3. Add Environment Variable **Global Config** (add to `hakmem_pool.c` globals, ~line 316): ```c static int g_pool_refill_batch_size = 2; // env: HAKMEM_POOL_REFILL_BATCH (1-4) ``` **Init** (add to `hak_pool_init()`, ~line 716): ```c const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH"); if (e_batch) { int v = atoi(e_batch); if (v >= 1 && v <= 4) g_pool_refill_batch_size = v; } ``` #### 4. Extend TLS Active Page Slots (Optional) **Current**: 2 slots (page_a, page_b) **Proposal**: Add page_c, page_d for batch_size=4 (if beneficial) **Trade-off**: - ✅ Pro: More TLS-local inventory, fewer shared accesses - ❌ Con: Increased TLS memory footprint (~256 bytes/class) **Recommendation**: Start with 2 slots, measure, then extend if needed. --- ### File Changes Required | File | Function | Change Type | Est. LOC | |------|----------|-------------|----------| | `hakmem_pool.c` | `alloc_tls_page_batch()` | **New function** | +80 | | `hakmem_pool.c` | `hak_pool_try_alloc()` | Modify refill path | +30 | | `hakmem_pool.c` | Globals | Add `g_pool_refill_batch_size` | +1 | | `hakmem_pool.c` | `hak_pool_init()` | Parse env var | +5 | | `hakmem_pool.h` | (none) | No public API change | 0 | | **Total** | | | **~116 LOC** | --- ### Testing Strategy #### Unit Test ```bash # Test batch allocation works HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill # Verify TLS slots filled correctly # Check Ring buffer populated # Check no memory leaks ``` #### Benchmark Test ```bash # Baseline (batch=1, current behavior) HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh # Batch=2 (conservative) HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh # Batch=4 (aggressive) HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh # Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s) ``` #### Failure Modes to Watch 1. **Memory bloat**: Batch too large → excessive pre-allocation - **Monitor**: RSS growth, pages_allocated counter - **Mitigation**: Cap batch_size at 4, respect CAP limits 2. **Ring overflow**: Batch fills Ring, blocks get lost - **Monitor**: Ring underflow counter (should decrease) - **Mitigation**: Properly route overflow to LIFO 3. **TLS slot contention**: Multiple threads allocating same class - **Monitor**: Active page descriptor conflicts - **Mitigation**: Per-thread ownership (already enforced) --- ### Risk Assessment | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Memory bloat (over-allocation) | Medium | High | Cap at batch=4, respect CAP limits | | Complexity (harder to debug) | Low | Medium | Extensive logging, unit tests | | Backward compat (existing workloads) | Low | Low | Default batch=2 (conservative) | | Regression (slower than 1-page) | Low | Medium | A/B test, fallback to batch=1 | **Rollback Plan**: Set `HAKMEM_POOL_REFILL_BATCH=1` to restore original behavior (zero code change). --- ### Estimated Time - **Implementation**: 3-4 hours - Core function: 2 hours - Integration: 1 hour - Testing: 1 hour - **Benchmarking**: 2 hours - Run suite 3x (batch=1,2,4) - Analyze results - **Total**: **5-6 hours** --- ## 🔓 Phase 6.26: Lock-Free Refill ### Goal **Eliminate lock contention on freelist access** **Target**: Mid 4T: +15-20% (13.8 → 16-18 M/s) ### Problem Statement Current freelist uses **per-shard mutexes** (`pthread_mutex_t`): - 64 shards × 7 classes = **448 mutexes** - Contention on hot shards (4T workload) - Trylock success rate: ~60-70% (Phase 6.25 data) - Each lock/unlock: ~20-40 cycles overhead **Opportunity**: Replace mutex with **lock-free stack** (CAS-based): - Atomic compare-and-swap: ~10-15 cycles - No blocking (always forward progress) - Better scalability under contention ### Implementation Approach #### 1. Replace Freelist Mutex with Atomic Head **Current Structure** (`hakmem_pool.c:276-280`): ```c static struct { PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; // ... } g_pool; ``` **New Structure**: ```c static struct { // Lock-free freelist head (atomic pointer) atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; // Lock-free counter (for non-empty bitmap update) atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; // Keep nonempty_mask (atomic already) atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES]; // Remote stack (already lock-free) atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS]; // ... (rest unchanged) } g_pool; ``` #### 2. Implement Lock-Free Push/Pop **Lock-Free Pop** (replace mutex-based pop): ```c // Pop block from lock-free freelist // Returns: block pointer, or NULL if empty static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) { uintptr_t old_head; PoolBlock* block; do { old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx], memory_order_acquire); if (!old_head) { return NULL; // Empty } block = (PoolBlock*)old_head; // Try CAS: freelist_head = block->next } while (!atomic_compare_exchange_weak_explicit( &g_pool.freelist_head[class_idx][shard_idx], &old_head, (uintptr_t)block->next, memory_order_release, memory_order_acquire)); // Update count unsigned old_count = atomic_fetch_sub_explicit( &g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed); // Clear nonempty bit if now empty if (old_count <= 1) { clear_nonempty_bit(class_idx, shard_idx); } return block; } ``` **Lock-Free Push** (for refill path): ```c // Push block onto lock-free freelist static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) { uintptr_t old_head; do { old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx], memory_order_acquire); block->next = (PoolBlock*)old_head; } while (!atomic_compare_exchange_weak_explicit( &g_pool.freelist_head[class_idx][shard_idx], &old_head, (uintptr_t)block, memory_order_release, memory_order_acquire)); // Update count and nonempty bit atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed); set_nonempty_bit(class_idx, shard_idx); } ``` **Lock-Free Batch Push** (for refill, optimization): ```c // Push multiple blocks atomically (amortize CAS overhead) static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx, PoolBlock* head, PoolBlock* tail, int count) { uintptr_t old_head; do { old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx], memory_order_acquire); tail->next = (PoolBlock*)old_head; } while (!atomic_compare_exchange_weak_explicit( &g_pool.freelist_head[class_idx][shard_idx], &old_head, (uintptr_t)head, memory_order_release, memory_order_acquire)); atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count, memory_order_relaxed); set_nonempty_bit(class_idx, shard_idx); } ``` #### 3. Refill Path Integration **Modify `refill_freelist()`** (now lock-free): ```c static int refill_freelist(int class_idx, int shard_idx) { // ... (allocate page, split into blocks) // OLD: lock → push to freelist → unlock // pthread_mutex_lock(lock); // block->next = g_pool.freelist[class_idx][shard_idx]; // g_pool.freelist[class_idx][shard_idx] = freelist_head; // pthread_mutex_unlock(lock); // NEW: lock-free batch push PoolBlock* tail = freelist_head; int count = blocks_per_page; while (tail->next) { tail = tail->next; } freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count); return 1; } ``` #### 4. Remote Stack Drain (Lock-Free) **Current**: `drain_remote_locked()` called under mutex **New**: Drain into local list, then batch-push lock-free ```c // Drain remote stack into freelist (lock-free) static inline void drain_remote_lockfree(int class_idx, int shard_idx) { // Atomically swap remote head to NULL (unchanged) uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx], (uintptr_t)0, memory_order_acq_rel); if (!head) return; // Count blocks int count = 0; PoolBlock* tail = (PoolBlock*)head; while (tail->next) { tail = tail->next; count++; } count++; // Include head // Batch push to freelist (lock-free) freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count); // Update remote count atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count, memory_order_relaxed); } ``` #### 5. Fallback Strategy (Optional) For **rare contention** cases (e.g., CAS spin > 100 iterations): - Option A: Keep spinning (acceptable for short lists) - Option B: Fallback to mutex (hybrid approach) - Option C: Backoff + retry (exponential backoff) **Recommendation**: Start with Option A (pure lock-free), measure, add backoff if needed. --- ### File Changes Required | File | Function | Change Type | Est. LOC | |------|----------|-------------|----------| | `hakmem_pool.c` | Globals | Replace mutexes with atomics | +10/-10 | | `hakmem_pool.c` | `freelist_pop_lockfree()` | **New function** | +30 | | `hakmem_pool.c` | `freelist_push_lockfree()` | **New function** | +20 | | `hakmem_pool.c` | `freelist_push_batch_lockfree()` | **New function** | +25 | | `hakmem_pool.c` | `drain_remote_lockfree()` | Rewrite (lock-free) | +25/-20 | | `hakmem_pool.c` | `refill_freelist()` | Modify (use batch push) | +10/-15 | | `hakmem_pool.c` | `hak_pool_try_alloc()` | Replace lock/unlock with pop | +5/-10 | | `hakmem_pool.c` | `hak_pool_free()` | Lock-free path | +10/-10 | | `hakmem_pool.c` | `hak_pool_init()` | Init atomics (not mutexes) | +5/-5 | | **Total** | | | **~140 LOC (net ~100)** | --- ### Testing Strategy #### Correctness Test ```bash # Single-threaded (no contention, pure correctness) THREADS=1 ./test_pool_lockfree # Multi-threaded stress test (high contention) THREADS=16 DURATION=60 ./test_pool_lockfree_stress # Check for: # - No memory leaks (valgrind) # - No double-free (AddressSanitizer) # - No lost blocks (counter invariants) ``` #### Performance Test ```bash # Baseline (Phase 6.25, with batching) HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh # Lock-free (Phase 6.26) HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh # Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s) ``` #### Contention Analysis ```bash # Measure CAS retry rate # Add instrumentation: # atomic_uint_fast64_t cas_retries; # atomic_uint_fast64_t cas_attempts; # Print ratio at shutdown # Target: <5% retry rate under 4T load ``` --- ### Risk Assessment | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | ABA problem (block reuse) | Low | Critical | Use epoch-based reclamation or hazard pointers | | CAS livelock (high contention) | Medium | High | Add exponential backoff after N retries | | Memory ordering bugs (subtle races) | Medium | Critical | Extensive testing, TSan, formal verification | | Performance regression (1T) | Low | Low | Single-thread has no contention, minimal overhead | **ABA Problem**: - **Scenario**: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight - **Solution**: Not critical for freelist (ABA still results in valid freelist state) - **Alternative**: Add version counter (128-bit CAS) if issues arise **Rollback Plan**: Keep mutexes in code (ifdef'd out), revert via compile flag if needed. --- ### Estimated Time - **Implementation**: 5-6 hours - Lock-free primitives: 2 hours - Integration: 2 hours - Testing: 2 hours - **Debugging**: 2-3 hours (race conditions, TSan) - **Benchmarking**: 2 hours - **Total**: **9-11 hours** --- ## 🧠 Phase 6.27: Learner Integration ### Goal **Dynamic optimization of CAP and W_MAX based on runtime behavior** **Target**: +5-10% across all workloads via adaptive tuning ### Problem Statement Current policy is **static** (set at init): - `CAP = {64,64,64,32,16,32,32}` (conservative) - `W_MAX_MID = 1.60`, `W_MAX_LARGE = 1.30` - No adaptation to workload characteristics **Opportunity**: Use **existing learner infrastructure** to: 1. Collect size distribution stats 2. Adjust `mid_cap[]` dynamically based on hit rate 3. Adjust `w_max_mid` based on fragmentation vs hit rate trade-off **Learner Already Exists**: `hakmem_learner.c` (~585 LOC) - Background thread (1 sec polling) - Hit rate monitoring - UCB1 for W_MAX exploration (Canary deployment) - Budget enforcement + Water-filling **Integration Work**: Minimal (learner already supports Mid Pool tuning) --- ### Implementation Approach #### 1. Enable Learner for Mid Pool **Already Implemented** (`hakmem_learner.c:239-272`): ```c // Adjust Mid caps by hit rate vs target (delta over window) with dwell int mid_classes = 5; if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7; // ... for (int i = 0; i < mid_classes; i++) { uint64_t dh = mid_hits[i] - prev_mid_hits[i]; uint64_t dm = mid_misses[i] - prev_mid_misses[i]; // ... if (hit < (tgt_mid - eps)) { cap += step_mid; // Increase CAP } else if (hit > (tgt_mid + eps)) { cap -= step_mid; // Decrease CAP } // ... } ``` **Action**: Just enable via env var! ```bash HAKMEM_LEARN=1 \ HAKMEM_TARGET_HIT_MID=0.65 \ HAKMEM_CAP_STEP_MID=8 \ HAKMEM_CAP_MIN_MID=16 \ HAKMEM_CAP_MAX_MID=512 \ ./your_app ``` #### 2. W_MAX Learning (Optional, Risky) **Already Implemented** (`hakmem_learner.c:388-499`): - UCB1 multi-armed bandit - Canary deployment (safe exploration) - Rollback if performance regresses **Candidates** (for Mid Pool): ``` W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70] Default: 1.60 (current) Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit) ``` **Enable**: ```bash HAKMEM_LEARN=1 \ HAKMEM_WMAX_LEARN=1 \ HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \ HAKMEM_WMAX_CANARY=1 \ ./your_app ``` **Recommendation**: Start with CAP tuning only, add W_MAX later (more risk). #### 3. Size Distribution Integration (Already Exists) **Histogram** (`hakmem_size_hist.c`): - 1KB granularity bins (0-64KB tracked) - Per-allocation sampling - Reset after learner snapshot **DYN1 Auto-Assignment** (already implemented): ```bash HAKMEM_LEARN=1 \ HAKMEM_DYN1_AUTO=1 \ HAKMEM_CAP_MID_DYN1=64 \ ./your_app ``` **Effect**: Automatically finds peak size in 2-32KB range, assigns DYN1 class. #### 4. New: ACE Stats Integration **Current ACE** (`hakmem_ace.c`): - Records size decisions (original_size → rounded_size → pool) - Tracks L1 fallback rate (miss → malloc) - Not integrated with learner **Proposal**: Add ACE stats to learner score function **Modify Learner Score** (`hakmem_learner.c:414`): ```c // OLD: simple hit-based score double score = (double)(ace.mid_hit + ace.large_hit) - (double)(ace.mid_miss + ace.large_miss) - 2.0 * (double)ace.l1_fallback; // NEW: add fragmentation penalty extern uint64_t hak_ace_get_total_waste(void); // sum of (rounded - original) uint64_t waste = hak_ace_get_total_waste(); double frag_penalty = (double)waste / 1e6; // normalize to MB double score = (double)(ace.mid_hit + ace.large_hit) - (double)(ace.mid_miss + ace.large_miss) - 2.0 * (double)ace.l1_fallback - 0.5 * frag_penalty; // penalize waste ``` **Benefit**: Balance hit rate vs fragmentation (W_MAX tuning). --- ### File Changes Required | File | Function | Change Type | Est. LOC | |------|----------|-------------|----------| | `hakmem_learner.c` | Learner (already exists) | Enable via env | 0 | | `hakmem_ace.c` | `hak_ace_get_total_waste()` | **New function** | +15 | | `hakmem_learner.c` | `learner_main()` | Add frag penalty to score | +10 | | `hakmem_policy.c` | (none) | Learner publishes dynamically | 0 | | **Total** | | | **~25 LOC** | --- ### Testing Strategy #### Baseline Test (Learner Off) ```bash # Static policy (current) RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh # Record: Mid 1T, Mid 4T throughput ``` #### Learner Test (CAP Tuning) ```bash # Enable learner with aggressive targets HAKMEM_LEARN=1 \ HAKMEM_TARGET_HIT_MID=0.75 \ HAKMEM_CAP_STEP_MID=8 \ HAKMEM_CAP_MAX_MID=512 \ HAKMEM_LEARN_WINDOW_MS=2000 \ RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh # Expected: CAP increases to ~128-256 (hit 75% target) # Expected: +5-10% throughput improvement ``` #### W_MAX Learning Test (Optional) ```bash HAKMEM_LEARN=1 \ HAKMEM_WMAX_LEARN=1 \ HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \ HAKMEM_WMAX_CANARY=1 \ HAKMEM_WMAX_TRIAL_SEC=5 \ RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh # Monitor stderr for learner logs: # "[Learner] W_MAX mid canary start: 1.50" # "[Learner] W_MAX mid canary adopt" (success) # or # "[Learner] W_MAX mid canary revert to 1.60" (failure) ``` #### Regression Test ```bash # Check learner doesn't hurt stable workloads # Run with learning OFF, then ON, compare variance # Target: <5% variance, no regressions ``` --- ### Risk Assessment | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Over-tuning (oscillation) | Medium | Medium | Increase dwell time (3→5 sec) | | Under-tuning (no effect) | Medium | Low | Lower target hit rate (0.75→0.65) | | W_MAX instability (fragmentation spike) | Medium | High | Use Canary, revert on regression | | Low-traffic workload (insufficient samples) | High | Low | Set min_samples=256, skip learning if below | **Rollback Plan**: Set `HAKMEM_LEARN=0` (default, no learner overhead). --- ### Estimated Time - **Implementation**: 1-2 hours - ACE waste tracking: 1 hour - Learner score update: 30 min - Testing: 30 min - **Validation**: 3-4 hours - Run suite with/without learner - Analyze CAP convergence - W_MAX exploration (if enabled) - **Total**: **4-6 hours** --- ## 📊 Expected Performance Improvements ### Cumulative Gains (Stacked) | Phase | Change | Mid 1T | Mid 4T | Rationale | |-------|--------|--------|--------|-----------| | **Baseline (6.21)** | Current | 4.0 M/s (28%) | 13.8 M/s (47%) | Post-quick-wins | | **6.25 (Batch)** | Refill 2-4 pages | +10-15% | +5-8% | Amortize syscall, 1T bottleneck | | | | **4.5-5.0 M/s** | **14.5-15.2 M/s** | | | **6.26 (Lock-Free)** | CAS freelist | +2-5% | +15-20% | Eliminate 4T contention | | | | **4.6-5.2 M/s** | **17.0-18.2 M/s** | | | **6.27 (Learner)** | Dynamic CAP/W_MAX | +5-10% | +5-10% | Adaptive tuning | | | | **5.0-5.7 M/s** | **18.0-20.0 M/s** | | | **Target (60-75%)** | vs mimalloc 14.6M / 29.5M | **8.8-11.0 M/s** | **17.7-22.1 M/s** | | | **Achieved?** | | ❌ **35-39%** | ✅ **61-68%** | 1T still short, 4T on target! | ### Gap Analysis **1T Performance**: - Current: 4.0 M/s (28% of mimalloc) - Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc) - **Gap to 60%**: Still need **+5.3-6.0 M/s** (~+110-120%) **Remaining Bottlenecks (1T)**: 1. Single-threaded inherently lock-bound (no TLS benefit) 2. mimalloc's per-thread heaps eliminate ALL shared state 3. Bump allocation (mimalloc) vs freelist (hakmem) 4. Header overhead (32 bytes per alloc in hakmem) **4T Performance**: - Current: 13.8 M/s (47% of mimalloc) - Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc) - **✅ Target achieved!** (60-75% range) --- ### Follow-Up Phases (Post-6.27) **Phase 6.28: Header Elimination** (if 1T still target) - Remove AllocHeader for Mid Pool (use page descriptors only) - Saves 32 bytes per allocation (~8-15% memory) - Saves header write on alloc hot path (~30-50 cycles) - **Estimated Gain**: +15-20% (1T) **Phase 6.29: Bump Allocation** (major refactor) - Replace freelist with bump allocator (mimalloc-style) - Per-thread arenas, no shared state at all - **Estimated Gain**: +50-100% (1T), brings to mimalloc parity - **Risk**: High complexity, long implementation (~2-3 weeks) --- ## 🗓️ Priority-Ordered Task List ### Phase 6.25: Refill Batching (Target: Week 1) 1. ☐ **Implement `alloc_tls_page_batch()` function** (2 hours) - [ ] Write batch mmap loop - [ ] Distribute pages to TLS slots - [ ] Fill Ring/LIFO from overflow pages - [ ] Add page descriptors registration 2. ☐ **Integrate batch refill into `hak_pool_try_alloc()`** (1 hour) - [ ] Replace `alloc_tls_page()` call with batch version - [ ] Prepare slot array logic - [ ] Handle partial allocation (< batch_size) 3. ☐ **Add environment variable support** (30 min) - [ ] Add `g_pool_refill_batch_size` global - [ ] Parse `HAKMEM_POOL_REFILL_BATCH` in init - [ ] Validate range (1-4) 4. ☐ **Unit testing** (1 hour) - [ ] Test batch=1,2,4 correctness - [ ] Verify TLS slots filled - [ ] Check Ring population - [ ] Valgrind (no leaks) 5. ☐ **Benchmark validation** (2 hours) - [ ] Run suite with batch=1 (baseline) - [ ] Run suite with batch=2,4 - [ ] Analyze throughput delta - [ ] **Target**: +10-15% Mid 1T **Total Estimate**: 6-7 hours --- ### Phase 6.26: Lock-Free Refill (Target: Week 2) 6. ☐ **Replace mutex with atomic freelist** (2 hours) - [ ] Change `PoolBlock* freelist[]` → `atomic_uintptr_t freelist_head[]` - [ ] Add `atomic_uint freelist_count[]` - [ ] Remove `PaddedMutex freelist_locks[]` 7. ☐ **Implement lock-free primitives** (2 hours) - [ ] Write `freelist_pop_lockfree()` - [ ] Write `freelist_push_lockfree()` - [ ] Write `freelist_push_batch_lockfree()` 8. ☐ **Rewrite drain functions** (1 hour) - [ ] `drain_remote_lockfree()` (no mutex) - [ ] Count blocks in remote stack - [ ] Batch push to freelist 9. ☐ **Integrate into alloc/free paths** (1 hour) - [ ] Replace lock/pop/unlock with `freelist_pop_lockfree()` - [ ] Update refill to use batch push - [ ] Update free to use lock-free push 10. ☐ **Testing (critical for lock-free)** (3 hours) - [ ] Single-thread correctness test - [ ] Multi-thread stress test (16T, 60 sec) - [ ] TSan (ThreadSanitizer) run - [ ] Check counter invariants (no lost blocks) 11. ☐ **Benchmark validation** (2 hours) - [ ] Run suite with lock-free (4T focus) - [ ] Compare to Phase 6.25 baseline - [ ] Measure CAS retry rate - [ ] **Target**: +15-20% Mid 4T **Total Estimate**: 11-12 hours --- ### Phase 6.27: Learner Integration (Target: Week 2, parallel) 12. ☐ **Add ACE waste tracking** (1 hour) - [ ] Implement `hak_ace_get_total_waste()` in `hakmem_ace.c` - [ ] Track cumulative (rounded - original) per allocation - [ ] Atomic counter for thread safety 13. ☐ **Update learner score function** (30 min) - [ ] Add fragmentation penalty term - [ ] Weight: -0.5 × (waste_MB) - [ ] Test score computation 14. ☐ **Validation testing** (3 hours) - [ ] Baseline run (learner OFF) - [ ] CAP tuning run (learner ON, W_MAX fixed) - [ ] W_MAX learning run (Canary enabled) - [ ] Compare throughput, check convergence 15. ☐ **Documentation** (1 hour) - [ ] Update ENV_VARS.md with learner params - [ ] Document recommended settings - [ ] Add troubleshooting guide (oscillation, no effect) **Total Estimate**: 5-6 hours --- ### Post-Implementation (Week 3) 16. ☐ **Comprehensive benchmarking** (4 hours) - [ ] Full suite (tiny, mid, large) with all phases enabled - [ ] Head-to-head vs mimalloc (1T, 4T, 8T) - [ ] Memory profiling (RSS, fragmentation) - [ ] Generate performance report 17. ☐ **Code review & cleanup** (2 hours) - [ ] Remove debug printfs - [ ] Add comments to complex sections - [ ] Update copyright/phase headers - [ ] Check for code duplication 18. ☐ **Documentation updates** (2 hours) - [ ] Update INDEX.md with new phases - [ ] Write PHASE_6.25_6.27_RESULTS.md - [ ] Update README.md benchmarks section **Total Estimate**: 8 hours --- ## 📈 Success Metrics ### Primary Metrics | Metric | Current | Target | Measurement | |--------|---------|--------|-------------| | **Mid 1T Throughput** | 4.0 M/s | 5.0-5.7 M/s | Larson benchmark, 10s | | **Mid 4T Throughput** | 13.8 M/s | 18.0-20.0 M/s | Larson benchmark, 10s | | **Mid 1T vs mimalloc** | 28% | 35-39% | Ratio of throughputs | | **Mid 4T vs mimalloc** | 47% | 61-68% | Ratio of throughputs | ### Secondary Metrics | Metric | Current | Target | Measurement | |--------|---------|--------|-------------| | Refill frequency (1T) | ~1000/sec | ~250-500/sec | Counter delta | | Lock contention (4T) | ~40% wait | <10% wait | Trylock success rate | | Hit rate (Mid Pool) | ~60% | 70-80% | hits / (hits + misses) | | Memory footprint | 22 MB | <30 MB | RSS baseline | ### Regression Thresholds | Scenario | Threshold | Action | |----------|-----------|--------| | Tiny Pool 4T | <2% regression | Acceptable | | Large Pool | <5% regression | Acceptable | | Memory bloat | >40 MB baseline | Reduce CAP or batch | | Crash/hang in stress test | Any occurrence | Block release, debug | --- ## 🎬 Conclusion This implementation plan provides a **systematic path** to improve hakmem's Mid Pool performance from **47% to 61-68% of mimalloc** for multi-threaded workloads (4T), bringing it into the target range of 60-75%. **Key Insights**: 1. **Phase 6.25 (Batching)**: Low risk, medium reward, tackles 1T bottleneck 2. **Phase 6.26 (Lock-Free)**: Medium risk, high reward, critical for 4T scaling 3. **Phase 6.27 (Learner)**: Low risk, low-medium reward, adaptive optimization **Recommendation**: - Implement 6.25 and 6.27 in **parallel** (independent, ~12 hours total) - Tackle 6.26 **after** 6.25 validated (builds on batch refill, ~12 hours) - **Total time**: ~24-30 hours (3-4 days focused work) **Next Steps**: 1. Review this plan with team 2. Set up benchmarking pipeline (automated, reproducible) 3. Implement Phase 6.25 (highest priority) 4. Measure, iterate, document **Open Questions**: - Should we extend TLS slots from 2 to 4? (Test in 6.25) - Is W_MAX learning worth the risk? (Test in 6.27 with Canary) - After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap? --- **Document Version**: 1.0 **Last Updated**: 2025-10-24 **Author**: Claude (Sonnet 4.5) **Status**: Ready for Implementation