hakmem/docs/status/PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md

# Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc

**Date**: 2025-10-24
**Status**: 📋 Planning
**Target**: Reach 60-75% of mimalloc performance for Mid Pool

---

## 📊 Current Baseline (Phase 6.21 Results)

### Performance vs mimalloc

| Workload | Threads | hakmem | mimalloc | Ratio | Gap |
|----------|---------|--------|----------|-------|-----|
| **Mid** | 1T | 4.0 M/s | 14.6 M/s | **28%** | -72% |
| **Mid** | 4T | 13.8 M/s | 29.5 M/s | **47%** | -53% |
| Tiny | 1T | 19.4 M/s | 32.6 M/s | 59% | -41% |
| Tiny | 4T | 48.0 M/s | 65.7 M/s | 73% | -27% |
| Large | 1T | 0.6 M/s | 2.1 M/s | 29% | -71% |

**Key Insights**:
- ✅ **Phase 6.25 Quick Wins achieved +37.8% for Mid 4T** (10.0 → 13.8 M/s)
- ❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
- 🎯 Target: 60-75% of mimalloc = **8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)**

### Current Mid Pool Architecture

```
┌─────────────────────────────────────────────────────────┐
│ TLS Fast Path (Lock-Free)                               │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Ring Buffer (RING_CAP=32)                        │
│    - LIFO cache for recently freed blocks               │
│    - Per-class, per-thread                              │
│    - Phase 6.25: 16→32 increased hit rate               │
│                                                          │
│ 2. TLS Active Pages (x2: page_a, page_b)                │
│    - Bump-run allocation (no per-block links)           │
│    - Owner-thread private (lock-free)                   │
│    - 64KB pages, split on-demand                        │
├─────────────────────────────────────────────────────────┤
│ Shared State (Lock-Based)                               │
├─────────────────────────────────────────────────────────┤
│ 3. Per-class Freelist (64 shards)                       │
│    - Mutex-protected per (class, shard)                 │
│    - Site-based sharding (reduce contention)            │
│    - Refill on demand via refill_freelist()             │
│                                                          │
│ 4. Remote Stack (MPSC, lock-free push)                  │
│    - Cross-thread free target                           │
│    - Drained into freelist under lock                   │
│                                                          │
│ 5. Transfer Cache (TC, Phase 6.20)                      │
│    - Per-thread inbox (atomic CAS)                      │
│    - Owner-aware routing                                │
│    - Drain trigger: ring->top < 2                       │
└─────────────────────────────────────────────────────────┘

Refill Flow (Current):
  Ring empty → Check Active Pages → Lock Shard → Pop freelist
  → Drain remote → Shard steal (if CAP reached) → **refill_freelist()**

Refill Implementation:
  - Allocates **1 page** (64KB) via mmap
  - Splits into blocks, links into freelist
  - ACE bundle factor: 1-4 pages (adaptive)
```

### Bottlenecks Identified

**From Phase 6.20 Analysis**:

1. **Refill Latency** (Primary)
   - Single-page refill: 1 mmap syscall per refill
   - Freelist rebuilding overhead (linking blocks)
   - Mutex hold time during refill (~100-150 cycles)
   - **Impact**: ~40% of alloc time in Mid 1T

2. **Lock Contention** (Secondary)
   - 64 shards × 7 classes = 448 mutexes
   - Even with sharding, 4T shows contention
   - Trylock success rate: ~60-70% (Phase 6.25 data)
   - **Impact**: ~25% of alloc time in Mid 4T

3. **CAP/W_MAX Sub-optimal** (Tertiary)
   - Static configuration (no runtime adaptation)
   - W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
   - CAP={64,64,64,32,16} → conservative, low hit rate
   - **Impact**: ~10-15% missed pool opportunities

---

## 🎯 Phase 6.25 本体: Refill Batching

### Goal

**Reduce refill latency by allocating multiple pages at once**

**Target**: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)

### Problem Statement

Current `refill_freelist()` allocates **1 page per call**:
- 1 mmap syscall (~200-300 cycles)
- 1 page split + freelist rebuild (~100-150 cycles)
- Held under mutex lock (blocks other threads)
- Amortized cost per block: **HIGH** for small classes (e.g., 2KB = 32 blocks/page)

**Opportunity**: Allocate **2-4 pages in batch** to amortize costs:
- mmap overhead: 300 cycles → 75-150 cycles/page (batched)
- Freelist rebuild: done in parallel or optimized
- Fill multiple TLS page slots + Ring buffer aggressively

### Implementation Approach

#### 1. Create `alloc_tls_page_batch()` Function

**Location**: `hakmem_pool.c` (after `alloc_tls_page()`, line ~486)

**Signature**:
```c
// Allocate multiple pages in batch and distribute to TLS structures
// Returns: number of pages successfully allocated (0-batch_size)
static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin);
```

**Pseudocode**:
```c
static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin) {
    size_t user_size = g_class_sizes[class_idx];
    size_t block_size = HEADER_SIZE + user_size;
    int blocks_per_page = POOL_PAGE_SIZE / block_size;
    if (blocks_per_page <= 0) return 0;

    int allocated = 0;

    // Allocate pages in batch (strategy: multiple mmaps or single large mmap)
    // Option A: Multiple mmaps (simpler, compatible with existing infra)
    for (int i = 0; i < batch_size; i++) {
        void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (!page) break;

        // Prefault (Phase 6.25 quick win)
        for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
            ((volatile char*)page)[j] = 0;
        }

        // Strategy: Fill TLS slots first, then fill Ring/LIFO
        if (allocated < num_slots && slots[allocated]) {
            // Assign to TLS active page slot (bump-run init)
            PoolTLSPage* ap = slots[allocated];
            ap->page = page;
            ap->bump = (char*)page;
            ap->end = (char*)page + POOL_PAGE_SIZE;
            ap->count = blocks_per_page;

            // Register page descriptor
            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        } else {
            // Fill Ring + LIFO from this page
            char* bump = (char*)page;
            char* end = (char*)page + POOL_PAGE_SIZE;

            for (int k = 0; k < blocks_per_page; k++) {
                PoolBlock* b = (PoolBlock*)(void*)bump;

                // Try Ring first, then LIFO
                if (ring && ring->top < POOL_TLS_RING_CAP) {
                    ring->items[ring->top++] = b;
                } else if (bin) {
                    b->next = bin->lo_head;
                    bin->lo_head = b;
                    bin->lo_count++;
                }

                bump += block_size;
                if (bump >= end) break;
            }

            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        }

        allocated++;
        g_pool.total_pages_allocated++;
        g_pool.pages_by_class[class_idx]++;
        g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
    }

    if (allocated > 0) {
        g_pool.refills[class_idx]++;
    }

    return allocated;
}
```

#### 2. Modify Refill Call Sites

**Location**: `hakmem_pool.c:931` (inside `hak_pool_try_alloc`, refill path)

**Before**:
```c
if (alloc_tls_page(class_idx, tap)) {
    // ... use newly allocated page
}
```

**After**:
```c
// Determine batch size from env var (default 2-4)
int batch = g_pool_refill_batch_size;  // new global config
if (batch < 1) batch = 1;
if (batch > 4) batch = 4;

// Prepare slot array (up to 2 TLS slots)
PoolTLSPage* slots[2] = {NULL, NULL};
int num_slots = 0;

if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_a[class_idx];
}
if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_b[class_idx];
}

// Call batch allocator
int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
                                      &g_tls_bin[class_idx].ring,
                                      &g_tls_bin[class_idx]);

if (allocated > 0) {
    pthread_mutex_unlock(lock);
    // Use ring or active page as usual
    // ...
}
```

#### 3. Add Environment Variable

**Global Config** (add to `hakmem_pool.c` globals, ~line 316):
```c
static int g_pool_refill_batch_size = 2;  // env: HAKMEM_POOL_REFILL_BATCH (1-4)
```

**Init** (add to `hak_pool_init()`, ~line 716):
```c
const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
if (e_batch) {
    int v = atoi(e_batch);
    if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
}
```

#### 4. Extend TLS Active Page Slots (Optional)

**Current**: 2 slots (page_a, page_b)
**Proposal**: Add page_c, page_d for batch_size=4 (if beneficial)

**Trade-off**:
- ✅ Pro: More TLS-local inventory, fewer shared accesses
- ❌ Con: Increased TLS memory footprint (~256 bytes/class)

**Recommendation**: Start with 2 slots, measure, then extend if needed.

---

### File Changes Required

| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_pool.c` | `alloc_tls_page_batch()` | **New function** | +80 |
| `hakmem_pool.c` | `hak_pool_try_alloc()` | Modify refill path | +30 |
| `hakmem_pool.c` | Globals | Add `g_pool_refill_batch_size` | +1 |
| `hakmem_pool.c` | `hak_pool_init()` | Parse env var | +5 |
| `hakmem_pool.h` | (none) | No public API change | 0 |
| **Total** | | | **~116 LOC** |

---

### Testing Strategy

#### Unit Test
```bash
# Test batch allocation works
HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill

# Verify TLS slots filled correctly
# Check Ring buffer populated
# Check no memory leaks
```

#### Benchmark Test
```bash
# Baseline (batch=1, current behavior)
HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=2 (conservative)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=4 (aggressive)
HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)
```

#### Failure Modes to Watch

1. **Memory bloat**: Batch too large → excessive pre-allocation
   - **Monitor**: RSS growth, pages_allocated counter
   - **Mitigation**: Cap batch_size at 4, respect CAP limits

2. **Ring overflow**: Batch fills Ring, blocks get lost
   - **Monitor**: Ring underflow counter (should decrease)
   - **Mitigation**: Properly route overflow to LIFO

3. **TLS slot contention**: Multiple threads allocating same class
   - **Monitor**: Active page descriptor conflicts
   - **Mitigation**: Per-thread ownership (already enforced)

---

### Risk Assessment

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Memory bloat (over-allocation) | Medium | High | Cap at batch=4, respect CAP limits |
| Complexity (harder to debug) | Low | Medium | Extensive logging, unit tests |
| Backward compat (existing workloads) | Low | Low | Default batch=2 (conservative) |
| Regression (slower than 1-page) | Low | Medium | A/B test, fallback to batch=1 |

**Rollback Plan**: Set `HAKMEM_POOL_REFILL_BATCH=1` to restore original behavior (zero code change).

---

### Estimated Time

- **Implementation**: 3-4 hours
  - Core function: 2 hours
  - Integration: 1 hour
  - Testing: 1 hour
- **Benchmarking**: 2 hours
  - Run suite 3x (batch=1,2,4)
  - Analyze results
- **Total**: **5-6 hours**

---

## 🔓 Phase 6.26: Lock-Free Refill

### Goal

**Eliminate lock contention on freelist access**

**Target**: Mid 4T: +15-20% (13.8 → 16-18 M/s)

### Problem Statement

Current freelist uses **per-shard mutexes** (`pthread_mutex_t`):
- 64 shards × 7 classes = **448 mutexes**
- Contention on hot shards (4T workload)
- Trylock success rate: ~60-70% (Phase 6.25 data)
- Each lock/unlock: ~20-40 cycles overhead

**Opportunity**: Replace mutex with **lock-free stack** (CAS-based):
- Atomic compare-and-swap: ~10-15 cycles
- No blocking (always forward progress)
- Better scalability under contention

### Implementation Approach

#### 1. Replace Freelist Mutex with Atomic Head

**Current Structure** (`hakmem_pool.c:276-280`):
```c
static struct {
    PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    // ...
} g_pool;
```

**New Structure**:
```c
static struct {
    // Lock-free freelist head (atomic pointer)
    atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Lock-free counter (for non-empty bitmap update)
    atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Keep nonempty_mask (atomic already)
    atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];

    // Remote stack (already lock-free)
    atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // ... (rest unchanged)
} g_pool;
```

#### 2. Implement Lock-Free Push/Pop

**Lock-Free Pop** (replace mutex-based pop):
```c
// Pop block from lock-free freelist
// Returns: block pointer, or NULL if empty
static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    uintptr_t old_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        if (!old_head) {
            return NULL;  // Empty
        }

        block = (PoolBlock*)old_head;
        // Try CAS: freelist_head = block->next
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block->next,
                 memory_order_release, memory_order_acquire));

    // Update count
    unsigned old_count = atomic_fetch_sub_explicit(
        &g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);

    // Clear nonempty bit if now empty
    if (old_count <= 1) {
        clear_nonempty_bit(class_idx, shard_idx);
    }

    return block;
}
```

**Lock-Free Push** (for refill path):
```c
// Push block onto lock-free freelist
static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block,
                 memory_order_release, memory_order_acquire));

    // Update count and nonempty bit
    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}
```

**Lock-Free Batch Push** (for refill, optimization):
```c
// Push multiple blocks atomically (amortize CAS overhead)
static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                                 PoolBlock* head, PoolBlock* tail,
                                                 int count) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)head,
                 memory_order_release, memory_order_acquire));

    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}
```

#### 3. Refill Path Integration

**Modify `refill_freelist()`** (now lock-free):
```c
static int refill_freelist(int class_idx, int shard_idx) {
    // ... (allocate page, split into blocks)

    // OLD: lock → push to freelist → unlock
    // pthread_mutex_lock(lock);
    // block->next = g_pool.freelist[class_idx][shard_idx];
    // g_pool.freelist[class_idx][shard_idx] = freelist_head;
    // pthread_mutex_unlock(lock);

    // NEW: lock-free batch push
    PoolBlock* tail = freelist_head;
    int count = blocks_per_page;
    while (tail->next) {
        tail = tail->next;
    }

    freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);

    return 1;
}
```

#### 4. Remote Stack Drain (Lock-Free)

**Current**: `drain_remote_locked()` called under mutex
**New**: Drain into local list, then batch-push lock-free

```c
// Drain remote stack into freelist (lock-free)
static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
    // Atomically swap remote head to NULL (unchanged)
    uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
                                               (uintptr_t)0, memory_order_acq_rel);
    if (!head) return;

    // Count blocks
    int count = 0;
    PoolBlock* tail = (PoolBlock*)head;
    while (tail->next) {
        tail = tail->next;
        count++;
    }
    count++;  // Include head

    // Batch push to freelist (lock-free)
    freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);

    // Update remote count
    atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
}
```

#### 5. Fallback Strategy (Optional)

For **rare contention** cases (e.g., CAS spin > 100 iterations):
- Option A: Keep spinning (acceptable for short lists)
- Option B: Fallback to mutex (hybrid approach)
- Option C: Backoff + retry (exponential backoff)

**Recommendation**: Start with Option A (pure lock-free), measure, add backoff if needed.

---

### File Changes Required

| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_pool.c` | Globals | Replace mutexes with atomics | +10/-10 |
| `hakmem_pool.c` | `freelist_pop_lockfree()` | **New function** | +30 |
| `hakmem_pool.c` | `freelist_push_lockfree()` | **New function** | +20 |
| `hakmem_pool.c` | `freelist_push_batch_lockfree()` | **New function** | +25 |
| `hakmem_pool.c` | `drain_remote_lockfree()` | Rewrite (lock-free) | +25/-20 |
| `hakmem_pool.c` | `refill_freelist()` | Modify (use batch push) | +10/-15 |
| `hakmem_pool.c` | `hak_pool_try_alloc()` | Replace lock/unlock with pop | +5/-10 |
| `hakmem_pool.c` | `hak_pool_free()` | Lock-free path | +10/-10 |
| `hakmem_pool.c` | `hak_pool_init()` | Init atomics (not mutexes) | +5/-5 |
| **Total** | | | **~140 LOC (net ~100)** |

---

### Testing Strategy

#### Correctness Test
```bash
# Single-threaded (no contention, pure correctness)
THREADS=1 ./test_pool_lockfree

# Multi-threaded stress test (high contention)
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# Check for:
# - No memory leaks (valgrind)
# - No double-free (AddressSanitizer)
# - No lost blocks (counter invariants)
```

#### Performance Test
```bash
# Baseline (Phase 6.25, with batching)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Lock-free (Phase 6.26)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)
```

#### Contention Analysis
```bash
# Measure CAS retry rate
# Add instrumentation:
#   atomic_uint_fast64_t cas_retries;
#   atomic_uint_fast64_t cas_attempts;
# Print ratio at shutdown

# Target: <5% retry rate under 4T load
```

---

### Risk Assessment

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ABA problem (block reuse) | Low | Critical | Use epoch-based reclamation or hazard pointers |
| CAS livelock (high contention) | Medium | High | Add exponential backoff after N retries |
| Memory ordering bugs (subtle races) | Medium | Critical | Extensive testing, TSan, formal verification |
| Performance regression (1T) | Low | Low | Single-thread has no contention, minimal overhead |

**ABA Problem**:
- **Scenario**: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
- **Solution**: Not critical for freelist (ABA still results in valid freelist state)
- **Alternative**: Add version counter (128-bit CAS) if issues arise

**Rollback Plan**: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.

---

### Estimated Time

- **Implementation**: 5-6 hours
  - Lock-free primitives: 2 hours
  - Integration: 2 hours
  - Testing: 2 hours
- **Debugging**: 2-3 hours (race conditions, TSan)
- **Benchmarking**: 2 hours
- **Total**: **9-11 hours**

---

## 🧠 Phase 6.27: Learner Integration

### Goal

**Dynamic optimization of CAP and W_MAX based on runtime behavior**

**Target**: +5-10% across all workloads via adaptive tuning

### Problem Statement

Current policy is **static** (set at init):
- `CAP = {64,64,64,32,16,32,32}` (conservative)
- `W_MAX_MID = 1.60`, `W_MAX_LARGE = 1.30`
- No adaptation to workload characteristics

**Opportunity**: Use **existing learner infrastructure** to:
1. Collect size distribution stats
2. Adjust `mid_cap[]` dynamically based on hit rate
3. Adjust `w_max_mid` based on fragmentation vs hit rate trade-off

**Learner Already Exists**: `hakmem_learner.c` (~585 LOC)
- Background thread (1 sec polling)
- Hit rate monitoring
- UCB1 for W_MAX exploration (Canary deployment)
- Budget enforcement + Water-filling

**Integration Work**: Minimal (learner already supports Mid Pool tuning)

---

### Implementation Approach

#### 1. Enable Learner for Mid Pool

**Already Implemented** (`hakmem_learner.c:239-272`):
```c
// Adjust Mid caps by hit rate vs target (delta over window) with dwell
int mid_classes = 5;
if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
// ...
for (int i = 0; i < mid_classes; i++) {
    uint64_t dh = mid_hits[i] - prev_mid_hits[i];
    uint64_t dm = mid_misses[i] - prev_mid_misses[i];
    // ...
    if (hit < (tgt_mid - eps)) {
        cap += step_mid;  // Increase CAP
    } else if (hit > (tgt_mid + eps)) {
        cap -= step_mid;  // Decrease CAP
    }
    // ...
}
```

**Action**: Just enable via env var!

```bash
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.65 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MIN_MID=16 \
HAKMEM_CAP_MAX_MID=512 \
./your_app
```

#### 2. W_MAX Learning (Optional, Risky)

**Already Implemented** (`hakmem_learner.c:388-499`):
- UCB1 multi-armed bandit
- Canary deployment (safe exploration)
- Rollback if performance regresses

**Candidates** (for Mid Pool):
```
W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
Default: 1.60 (current)
Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)
```

**Enable**:
```bash
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
./your_app
```

**Recommendation**: Start with CAP tuning only, add W_MAX later (more risk).

#### 3. Size Distribution Integration (Already Exists)

**Histogram** (`hakmem_size_hist.c`):
- 1KB granularity bins (0-64KB tracked)
- Per-allocation sampling
- Reset after learner snapshot

**DYN1 Auto-Assignment** (already implemented):
```bash
HAKMEM_LEARN=1 \
HAKMEM_DYN1_AUTO=1 \
HAKMEM_CAP_MID_DYN1=64 \
./your_app
```

**Effect**: Automatically finds peak size in 2-32KB range, assigns DYN1 class.

#### 4. New: ACE Stats Integration

**Current ACE** (`hakmem_ace.c`):
- Records size decisions (original_size → rounded_size → pool)
- Tracks L1 fallback rate (miss → malloc)
- Not integrated with learner

**Proposal**: Add ACE stats to learner score function

**Modify Learner Score** (`hakmem_learner.c:414`):
```c
// OLD: simple hit-based score
double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback;

// NEW: add fragmentation penalty
extern uint64_t hak_ace_get_total_waste(void);  // sum of (rounded - original)
uint64_t waste = hak_ace_get_total_waste();
double frag_penalty = (double)waste / 1e6;  // normalize to MB

double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback
             - 0.5 * frag_penalty;  // penalize waste
```

**Benefit**: Balance hit rate vs fragmentation (W_MAX tuning).

---

### File Changes Required

| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_learner.c` | Learner (already exists) | Enable via env | 0 |
| `hakmem_ace.c` | `hak_ace_get_total_waste()` | **New function** | +15 |
| `hakmem_learner.c` | `learner_main()` | Add frag penalty to score | +10 |
| `hakmem_policy.c` | (none) | Learner publishes dynamically | 0 |
| **Total** | | | **~25 LOC** |

---

### Testing Strategy

#### Baseline Test (Learner Off)
```bash
# Static policy (current)
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Record: Mid 1T, Mid 4T throughput
```

#### Learner Test (CAP Tuning)
```bash
# Enable learner with aggressive targets
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.75 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MAX_MID=512 \
HAKMEM_LEARN_WINDOW_MS=2000 \
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh

# Expected: CAP increases to ~128-256 (hit 75% target)
# Expected: +5-10% throughput improvement
```

#### W_MAX Learning Test (Optional)
```bash
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
HAKMEM_WMAX_TRIAL_SEC=5 \
RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh

# Monitor stderr for learner logs:
# "[Learner] W_MAX mid canary start: 1.50"
# "[Learner] W_MAX mid canary adopt" (success)
# or
# "[Learner] W_MAX mid canary revert to 1.60" (failure)
```

#### Regression Test
```bash
# Check learner doesn't hurt stable workloads
# Run with learning OFF, then ON, compare variance
# Target: <5% variance, no regressions
```

---

### Risk Assessment

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Over-tuning (oscillation) | Medium | Medium | Increase dwell time (3→5 sec) |
| Under-tuning (no effect) | Medium | Low | Lower target hit rate (0.75→0.65) |
| W_MAX instability (fragmentation spike) | Medium | High | Use Canary, revert on regression |
| Low-traffic workload (insufficient samples) | High | Low | Set min_samples=256, skip learning if below |

**Rollback Plan**: Set `HAKMEM_LEARN=0` (default, no learner overhead).

---

### Estimated Time

- **Implementation**: 1-2 hours
  - ACE waste tracking: 1 hour
  - Learner score update: 30 min
  - Testing: 30 min
- **Validation**: 3-4 hours
  - Run suite with/without learner
  - Analyze CAP convergence
  - W_MAX exploration (if enabled)
- **Total**: **4-6 hours**

---

## 📊 Expected Performance Improvements

### Cumulative Gains (Stacked)

| Phase | Change | Mid 1T | Mid 4T | Rationale |
|-------|--------|--------|--------|-----------|
| **Baseline (6.21)** | Current | 4.0 M/s (28%) | 13.8 M/s (47%) | Post-quick-wins |
| **6.25 (Batch)** | Refill 2-4 pages | +10-15% | +5-8% | Amortize syscall, 1T bottleneck |
| | | **4.5-5.0 M/s** | **14.5-15.2 M/s** | |
| **6.26 (Lock-Free)** | CAS freelist | +2-5% | +15-20% | Eliminate 4T contention |
| | | **4.6-5.2 M/s** | **17.0-18.2 M/s** | |
| **6.27 (Learner)** | Dynamic CAP/W_MAX | +5-10% | +5-10% | Adaptive tuning |
| | | **5.0-5.7 M/s** | **18.0-20.0 M/s** | |
| **Target (60-75%)** | vs mimalloc 14.6M / 29.5M | **8.8-11.0 M/s** | **17.7-22.1 M/s** | |
| **Achieved?** | | ❌ **35-39%** | ✅ **61-68%** | 1T still short, 4T on target! |

### Gap Analysis

**1T Performance**:
- Current: 4.0 M/s (28% of mimalloc)
- Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
- **Gap to 60%**: Still need **+5.3-6.0 M/s** (~+110-120%)

**Remaining Bottlenecks (1T)**:
1. Single-threaded inherently lock-bound (no TLS benefit)
2. mimalloc's per-thread heaps eliminate ALL shared state
3. Bump allocation (mimalloc) vs freelist (hakmem)
4. Header overhead (32 bytes per alloc in hakmem)

**4T Performance**:
- Current: 13.8 M/s (47% of mimalloc)
- Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
- **✅ Target achieved!** (60-75% range)

---

### Follow-Up Phases (Post-6.27)

**Phase 6.28: Header Elimination** (if 1T still target)
- Remove AllocHeader for Mid Pool (use page descriptors only)
- Saves 32 bytes per allocation (~8-15% memory)
- Saves header write on alloc hot path (~30-50 cycles)
- **Estimated Gain**: +15-20% (1T)

**Phase 6.29: Bump Allocation** (major refactor)
- Replace freelist with bump allocator (mimalloc-style)
- Per-thread arenas, no shared state at all
- **Estimated Gain**: +50-100% (1T), brings to mimalloc parity
- **Risk**: High complexity, long implementation (~2-3 weeks)

---

## 🗓️ Priority-Ordered Task List

### Phase 6.25: Refill Batching (Target: Week 1)

1. ☐ **Implement `alloc_tls_page_batch()` function** (2 hours)
   - [ ] Write batch mmap loop
   - [ ] Distribute pages to TLS slots
   - [ ] Fill Ring/LIFO from overflow pages
   - [ ] Add page descriptors registration

2. ☐ **Integrate batch refill into `hak_pool_try_alloc()`** (1 hour)
   - [ ] Replace `alloc_tls_page()` call with batch version
   - [ ] Prepare slot array logic
   - [ ] Handle partial allocation (< batch_size)

3. ☐ **Add environment variable support** (30 min)
   - [ ] Add `g_pool_refill_batch_size` global
   - [ ] Parse `HAKMEM_POOL_REFILL_BATCH` in init
   - [ ] Validate range (1-4)

4. ☐ **Unit testing** (1 hour)
   - [ ] Test batch=1,2,4 correctness
   - [ ] Verify TLS slots filled
   - [ ] Check Ring population
   - [ ] Valgrind (no leaks)

5. ☐ **Benchmark validation** (2 hours)
   - [ ] Run suite with batch=1 (baseline)
   - [ ] Run suite with batch=2,4
   - [ ] Analyze throughput delta
   - [ ] **Target**: +10-15% Mid 1T

**Total Estimate**: 6-7 hours

---

### Phase 6.26: Lock-Free Refill (Target: Week 2)

6. ☐ **Replace mutex with atomic freelist** (2 hours)
   - [ ] Change `PoolBlock* freelist[]` → `atomic_uintptr_t freelist_head[]`
   - [ ] Add `atomic_uint freelist_count[]`
   - [ ] Remove `PaddedMutex freelist_locks[]`

7. ☐ **Implement lock-free primitives** (2 hours)
   - [ ] Write `freelist_pop_lockfree()`
   - [ ] Write `freelist_push_lockfree()`
   - [ ] Write `freelist_push_batch_lockfree()`

8. ☐ **Rewrite drain functions** (1 hour)
   - [ ] `drain_remote_lockfree()` (no mutex)
   - [ ] Count blocks in remote stack
   - [ ] Batch push to freelist

9. ☐ **Integrate into alloc/free paths** (1 hour)
   - [ ] Replace lock/pop/unlock with `freelist_pop_lockfree()`
   - [ ] Update refill to use batch push
   - [ ] Update free to use lock-free push

10. ☐ **Testing (critical for lock-free)** (3 hours)
    - [ ] Single-thread correctness test
    - [ ] Multi-thread stress test (16T, 60 sec)
    - [ ] TSan (ThreadSanitizer) run
    - [ ] Check counter invariants (no lost blocks)

11. ☐ **Benchmark validation** (2 hours)
    - [ ] Run suite with lock-free (4T focus)
    - [ ] Compare to Phase 6.25 baseline
    - [ ] Measure CAS retry rate
    - [ ] **Target**: +15-20% Mid 4T

**Total Estimate**: 11-12 hours

---

### Phase 6.27: Learner Integration (Target: Week 2, parallel)

12. ☐ **Add ACE waste tracking** (1 hour)
    - [ ] Implement `hak_ace_get_total_waste()` in `hakmem_ace.c`
    - [ ] Track cumulative (rounded - original) per allocation
    - [ ] Atomic counter for thread safety

13. ☐ **Update learner score function** (30 min)
    - [ ] Add fragmentation penalty term
    - [ ] Weight: -0.5 × (waste_MB)
    - [ ] Test score computation

14. ☐ **Validation testing** (3 hours)
    - [ ] Baseline run (learner OFF)
    - [ ] CAP tuning run (learner ON, W_MAX fixed)
    - [ ] W_MAX learning run (Canary enabled)
    - [ ] Compare throughput, check convergence

15. ☐ **Documentation** (1 hour)
    - [ ] Update ENV_VARS.md with learner params
    - [ ] Document recommended settings
    - [ ] Add troubleshooting guide (oscillation, no effect)

**Total Estimate**: 5-6 hours

---

### Post-Implementation (Week 3)

16. ☐ **Comprehensive benchmarking** (4 hours)
    - [ ] Full suite (tiny, mid, large) with all phases enabled
    - [ ] Head-to-head vs mimalloc (1T, 4T, 8T)
    - [ ] Memory profiling (RSS, fragmentation)
    - [ ] Generate performance report

17. ☐ **Code review & cleanup** (2 hours)
    - [ ] Remove debug printfs
    - [ ] Add comments to complex sections
    - [ ] Update copyright/phase headers
    - [ ] Check for code duplication

18. ☐ **Documentation updates** (2 hours)
    - [ ] Update INDEX.md with new phases
    - [ ] Write PHASE_6.25_6.27_RESULTS.md
    - [ ] Update README.md benchmarks section

**Total Estimate**: 8 hours

---

## 📈 Success Metrics

### Primary Metrics

| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| **Mid 1T Throughput** | 4.0 M/s | 5.0-5.7 M/s | Larson benchmark, 10s |
| **Mid 4T Throughput** | 13.8 M/s | 18.0-20.0 M/s | Larson benchmark, 10s |
| **Mid 1T vs mimalloc** | 28% | 35-39% | Ratio of throughputs |
| **Mid 4T vs mimalloc** | 47% | 61-68% | Ratio of throughputs |

### Secondary Metrics

| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| Refill frequency (1T) | ~1000/sec | ~250-500/sec | Counter delta |
| Lock contention (4T) | ~40% wait | <10% wait | Trylock success rate |
| Hit rate (Mid Pool) | ~60% | 70-80% | hits / (hits + misses) |
| Memory footprint | 22 MB | <30 MB | RSS baseline |

### Regression Thresholds

| Scenario | Threshold | Action |
|----------|-----------|--------|
| Tiny Pool 4T | <2% regression | Acceptable |
| Large Pool | <5% regression | Acceptable |
| Memory bloat | >40 MB baseline | Reduce CAP or batch |
| Crash/hang in stress test | Any occurrence | Block release, debug |

---

## 🎬 Conclusion

This implementation plan provides a **systematic path** to improve hakmem's Mid Pool performance from **47% to 61-68% of mimalloc** for multi-threaded workloads (4T), bringing it into the target range of 60-75%.

**Key Insights**:
1. **Phase 6.25 (Batching)**: Low risk, medium reward, tackles 1T bottleneck
2. **Phase 6.26 (Lock-Free)**: Medium risk, high reward, critical for 4T scaling
3. **Phase 6.27 (Learner)**: Low risk, low-medium reward, adaptive optimization

**Recommendation**:
- Implement 6.25 and 6.27 in **parallel** (independent, ~12 hours total)
- Tackle 6.26 **after** 6.25 validated (builds on batch refill, ~12 hours)
- **Total time**: ~24-30 hours (3-4 days focused work)

**Next Steps**:
1. Review this plan with team
2. Set up benchmarking pipeline (automated, reproducible)
3. Implement Phase 6.25 (highest priority)
4. Measure, iterate, document

**Open Questions**:
- Should we extend TLS slots from 2 to 4? (Test in 6.25)
- Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
- After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?

---

**Document Version**: 1.0
**Last Updated**: 2025-10-24
**Author**: Claude (Sonnet 4.5)
**Status**: Ready for Implementation
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc
 								**Date**: 2025-10-24
 								**Status**: 📋 Planning
 								**Target**: Reach 60-75% of mimalloc performance for Mid Pool
 								---
 								## 📊 Current Baseline (Phase 6.21 Results)
 								### Performance vs mimalloc
 								| Workload | Threads | hakmem | mimalloc | Ratio | Gap |
 								|----------|---------|--------|----------|-------|-----|
 								| **Mid** | 1T | 4.0 M/s | 14.6 M/s | **28%** | -72% |
 								| **Mid** | 4T | 13.8 M/s | 29.5 M/s | **47%** | -53% |
 								| Tiny | 1T | 19.4 M/s | 32.6 M/s | 59% | -41% |
 								| Tiny | 4T | 48.0 M/s | 65.7 M/s | 73% | -27% |
 								| Large | 1T | 0.6 M/s | 2.1 M/s | 29% | -71% |
 								**Key Insights**:
 								- ✅ **Phase 6.25 Quick Wins achieved +37.8% for Mid 4T** (10.0 → 13.8 M/s)
 								- ❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
 								- 🎯 Target: 60-75% of mimalloc = **8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)**
 								### Current Mid Pool Architecture
 								```
 								┌─────────────────────────────────────────────────────────┐
 								│ TLS Fast Path (Lock-Free)                               │
 								├─────────────────────────────────────────────────────────┤
 								│ 1. TLS Ring Buffer (RING_CAP=32)                        │
 								│    - LIFO cache for recently freed blocks               │
 								│    - Per-class, per-thread                              │
 								│    - Phase 6.25: 16→32 increased hit rate               │
 								│                                                          │
 								│ 2. TLS Active Pages (x2: page_a, page_b)                │
 								│    - Bump-run allocation (no per-block links)           │
 								│    - Owner-thread private (lock-free)                   │
 								│    - 64KB pages, split on-demand                        │
 								├─────────────────────────────────────────────────────────┤
 								│ Shared State (Lock-Based)                               │
 								├─────────────────────────────────────────────────────────┤
 								│ 3. Per-class Freelist (64 shards)                       │
 								│    - Mutex-protected per (class, shard)                 │
 								│    - Site-based sharding (reduce contention)            │
 								│    - Refill on demand via refill_freelist()             │
 								│                                                          │
 								│ 4. Remote Stack (MPSC, lock-free push)                  │
 								│    - Cross-thread free target                           │
 								│    - Drained into freelist under lock                   │
 								│                                                          │
 								│ 5. Transfer Cache (TC, Phase 6.20)                      │
 								│    - Per-thread inbox (atomic CAS)                      │
 								│    - Owner-aware routing                                │
 								│    - Drain trigger: ring->top < 2                       │
 								└─────────────────────────────────────────────────────────┘
 								Refill Flow (Current):
 								  Ring empty → Check Active Pages → Lock Shard → Pop freelist
 								  → Drain remote → Shard steal (if CAP reached) → **refill_freelist()**
 								Refill Implementation:
 								  - Allocates **1 page** (64KB) via mmap
 								  - Splits into blocks, links into freelist
 								  - ACE bundle factor: 1-4 pages (adaptive)
 								```
 								### Bottlenecks Identified
 								**From Phase 6.20 Analysis**:
 . **Refill Latency** (Primary)
 								   - Single-page refill: 1 mmap syscall per refill
 								   - Freelist rebuilding overhead (linking blocks)
 								   - Mutex hold time during refill (~100-150 cycles)
 								   - **Impact**: ~40% of alloc time in Mid 1T
 . **Lock Contention** (Secondary)
 								   - 64 shards × 7 classes = 448 mutexes
 								   - Even with sharding, 4T shows contention
 								   - Trylock success rate: ~60-70% (Phase 6.25 data)
 								   - **Impact**: ~25% of alloc time in Mid 4T
 . **CAP/W_MAX Sub-optimal** (Tertiary)
 								   - Static configuration (no runtime adaptation)
 								   - W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
 								   - CAP={64,64,64,32,16} → conservative, low hit rate
 								   - **Impact**: ~10-15% missed pool opportunities
 								---
 								## 🎯 Phase 6.25 本体: Refill Batching
 								### Goal
 								**Reduce refill latency by allocating multiple pages at once**
 								**Target**: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)
 								### Problem Statement
 								Current `refill_freelist()` allocates **1 page per call**:
 								- 1 mmap syscall (~200-300 cycles)
 								- 1 page split + freelist rebuild (~100-150 cycles)
 								- Held under mutex lock (blocks other threads)
 								- Amortized cost per block: **HIGH** for small classes (e.g., 2KB = 32 blocks/page)
 								**Opportunity**: Allocate **2-4 pages in batch** to amortize costs:
 								- mmap overhead: 300 cycles → 75-150 cycles/page (batched)
 								- Freelist rebuild: done in parallel or optimized
 								- Fill multiple TLS page slots + Ring buffer aggressively
 								### Implementation Approach
 								#### 1. Create `alloc_tls_page_batch()` Function
 								**Location**: `hakmem_pool.c` (after `alloc_tls_page()`, line ~486)
 								**Signature**:
 								```c
 								// Allocate multiple pages in batch and distribute to TLS structures
 								// Returns: number of pages successfully allocated (0-batch_size)
 								static int alloc_tls_page_batch(int class_idx, int batch_size,
 								                                 PoolTLSPage* slots[], int num_slots,
 								                                 PoolTLSRing* ring, PoolTLSBin* bin);
 								```
 								**Pseudocode**:
 								```c
 								static int alloc_tls_page_batch(int class_idx, int batch_size,
 								                                 PoolTLSPage* slots[], int num_slots,
 								                                 PoolTLSRing* ring, PoolTLSBin* bin) {
 								    size_t user_size = g_class_sizes[class_idx];
 								    size_t block_size = HEADER_SIZE + user_size;
 								    int blocks_per_page = POOL_PAGE_SIZE / block_size;
 								    if (blocks_per_page <= 0) return 0;
 								    int allocated = 0;
 								    // Allocate pages in batch (strategy: multiple mmaps or single large mmap)
 								    // Option A: Multiple mmaps (simpler, compatible with existing infra)
 								    for (int i = 0; i < batch_size; i++) {
 								        void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
 								                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 								        if (!page) break;
 								        // Prefault (Phase 6.25 quick win)
 								        for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
 								            ((volatile char*)page)[j] = 0;
 								        }
 								        // Strategy: Fill TLS slots first, then fill Ring/LIFO
 								        if (allocated < num_slots && slots[allocated]) {
 								            // Assign to TLS active page slot (bump-run init)
 								            PoolTLSPage* ap = slots[allocated];
 								            ap->page = page;
 								            ap->bump = (char*)page;
 								            ap->end = (char*)page + POOL_PAGE_SIZE;
 								            ap->count = blocks_per_page;
 								            // Register page descriptor
 								            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
 								        } else {
 								            // Fill Ring + LIFO from this page
 								            char* bump = (char*)page;
 								            char* end = (char*)page + POOL_PAGE_SIZE;
 								            for (int k = 0; k < blocks_per_page; k++) {
 								                PoolBlock* b = (PoolBlock*)(void*)bump;
 								                // Try Ring first, then LIFO
 								                if (ring && ring->top < POOL_TLS_RING_CAP) {
 								                    ring->items[ring->top++] = b;
 								                } else if (bin) {
 								                    b->next = bin->lo_head;
 								                    bin->lo_head = b;
 								                    bin->lo_count++;
 								                }
 								                bump += block_size;
 								                if (bump >= end) break;
 								            }
 								            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
 								        }
 								        allocated++;
 								        g_pool.total_pages_allocated++;
 								        g_pool.pages_by_class[class_idx]++;
 								        g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
 								    }
 								    if (allocated > 0) {
 								        g_pool.refills[class_idx]++;
 								    }
 								    return allocated;
 								}
 								```
 								#### 2. Modify Refill Call Sites
 								**Location**: `hakmem_pool.c:931` (inside `hak_pool_try_alloc`, refill path)
 								**Before**:
 								```c
 								if (alloc_tls_page(class_idx, tap)) {
 								    // ... use newly allocated page
 								}
 								```
 								**After**:
 								```c
 								// Determine batch size from env var (default 2-4)
 								int batch = g_pool_refill_batch_size;  // new global config
 								if (batch < 1) batch = 1;
 								if (batch > 4) batch = 4;
 								// Prepare slot array (up to 2 TLS slots)
 								PoolTLSPage* slots[2] = {NULL, NULL};
 								int num_slots = 0;
 								if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
 								    slots[num_slots++] = &g_tls_active_page_a[class_idx];
 								}
 								if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
 								    slots[num_slots++] = &g_tls_active_page_b[class_idx];
 								}
 								// Call batch allocator
 								int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
 								                                      &g_tls_bin[class_idx].ring,
 								                                      &g_tls_bin[class_idx]);
 								if (allocated > 0) {
 								    pthread_mutex_unlock(lock);
 								    // Use ring or active page as usual
 								    // ...
 								}
 								```
 								#### 3. Add Environment Variable
 								**Global Config** (add to `hakmem_pool.c` globals, ~line 316):
 								```c
 								static int g_pool_refill_batch_size = 2;  // env: HAKMEM_POOL_REFILL_BATCH (1-4)
 								```
 								**Init** (add to `hak_pool_init()`, ~line 716):
 								```c
 								const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
 								if (e_batch) {
 								    int v = atoi(e_batch);
 								    if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
 								}
 								```
 								#### 4. Extend TLS Active Page Slots (Optional)
 								**Current**: 2 slots (page_a, page_b)
 								**Proposal**: Add page_c, page_d for batch_size=4 (if beneficial)
 								**Trade-off**:
 								- ✅ Pro: More TLS-local inventory, fewer shared accesses
 								- ❌ Con: Increased TLS memory footprint (~256 bytes/class)
 								**Recommendation**: Start with 2 slots, measure, then extend if needed.
 								---
 								### File Changes Required
 								| File | Function | Change Type | Est. LOC |
 								|------|----------|-------------|----------|
 								| `hakmem_pool.c` | `alloc_tls_page_batch()` | **New function** | +80 |
 								| `hakmem_pool.c` | `hak_pool_try_alloc()` | Modify refill path | +30 |
 								| `hakmem_pool.c` | Globals | Add `g_pool_refill_batch_size` | +1 |
 								| `hakmem_pool.c` | `hak_pool_init()` | Parse env var | +5 |
 								| `hakmem_pool.h` | (none) | No public API change | 0 |
 								| **Total** | | | **~116 LOC** |
 								---
 								### Testing Strategy
 								#### Unit Test
 								```bash
 								# Test batch allocation works
 								HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill
 								# Verify TLS slots filled correctly
 								# Check Ring buffer populated
 								# Check no memory leaks
 								```
 								#### Benchmark Test
 								```bash
 								# Baseline (batch=1, current behavior)
 								HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
 								# Batch=2 (conservative)
 								HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
 								# Batch=4 (aggressive)
 								HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
 								# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)
 								```
 								#### Failure Modes to Watch
 . **Memory bloat**: Batch too large → excessive pre-allocation
 								   - **Monitor**: RSS growth, pages_allocated counter
 								   - **Mitigation**: Cap batch_size at 4, respect CAP limits
 . **Ring overflow**: Batch fills Ring, blocks get lost
 								   - **Monitor**: Ring underflow counter (should decrease)
 								   - **Mitigation**: Properly route overflow to LIFO
 . **TLS slot contention**: Multiple threads allocating same class
 								   - **Monitor**: Active page descriptor conflicts
 								   - **Mitigation**: Per-thread ownership (already enforced)
 								---
 								### Risk Assessment
 								| Risk | Likelihood | Impact | Mitigation |
 								|------|------------|--------|------------|
 								| Memory bloat (over-allocation) | Medium | High | Cap at batch=4, respect CAP limits |
 								| Complexity (harder to debug) | Low | Medium | Extensive logging, unit tests |
 								| Backward compat (existing workloads) | Low | Low | Default batch=2 (conservative) |
 								| Regression (slower than 1-page) | Low | Medium | A/B test, fallback to batch=1 |
 								**Rollback Plan**: Set `HAKMEM_POOL_REFILL_BATCH=1` to restore original behavior (zero code change).
 								---
 								### Estimated Time
 								- **Implementation**: 3-4 hours
 								  - Core function: 2 hours
 								  - Integration: 1 hour
 								  - Testing: 1 hour
 								- **Benchmarking**: 2 hours
 								  - Run suite 3x (batch=1,2,4)
 								  - Analyze results
 								- **Total**: **5-6 hours**
 								---
 								## 🔓 Phase 6.26: Lock-Free Refill
 								### Goal
 								**Eliminate lock contention on freelist access**
 								**Target**: Mid 4T: +15-20% (13.8 → 16-18 M/s)
 								### Problem Statement
 								Current freelist uses **per-shard mutexes** (`pthread_mutex_t`):
 								- 64 shards × 7 classes = **448 mutexes**
 								- Contention on hot shards (4T workload)
 								- Trylock success rate: ~60-70% (Phase 6.25 data)
 								- Each lock/unlock: ~20-40 cycles overhead
 								**Opportunity**: Replace mutex with **lock-free stack** (CAS-based):
 								- Atomic compare-and-swap: ~10-15 cycles
 								- No blocking (always forward progress)
 								- Better scalability under contention
 								### Implementation Approach
 								#### 1. Replace Freelist Mutex with Atomic Head
 								**Current Structure** (`hakmem_pool.c:276-280`):
 								```c
 								static struct {
 								    PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    // ...
 								} g_pool;
 								```
 								**New Structure**:
 								```c
 								static struct {
 								    // Lock-free freelist head (atomic pointer)
 								    atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    // Lock-free counter (for non-empty bitmap update)
 								    atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    // Keep nonempty_mask (atomic already)
 								    atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];
 								    // Remote stack (already lock-free)
 								    atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
 								    // ... (rest unchanged)
 								} g_pool;
 								```
 								#### 2. Implement Lock-Free Push/Pop
 								**Lock-Free Pop** (replace mutex-based pop):
 								```c
 								// Pop block from lock-free freelist
 								// Returns: block pointer, or NULL if empty
 								static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
 								    uintptr_t old_head;
 								    PoolBlock* block;
 								    do {
 								        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
 								                                         memory_order_acquire);
 								        if (!old_head) {
 								            return NULL;  // Empty
 								        }
 								        block = (PoolBlock*)old_head;
 								        // Try CAS: freelist_head = block->next
 								    } while (!atomic_compare_exchange_weak_explicit(
 								                 &g_pool.freelist_head[class_idx][shard_idx],
 								                 &old_head, (uintptr_t)block->next,
 								                 memory_order_release, memory_order_acquire));
 								    // Update count
 								    unsigned old_count = atomic_fetch_sub_explicit(
 								        &g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);
 								    // Clear nonempty bit if now empty
 								    if (old_count <= 1) {
 								        clear_nonempty_bit(class_idx, shard_idx);
 								    }
 								    return block;
 								}
 								```
 								**Lock-Free Push** (for refill path):
 								```c
 								// Push block onto lock-free freelist
 								static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
 								    uintptr_t old_head;
 								    do {
 								        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
 								                                         memory_order_acquire);
 								        block->next = (PoolBlock*)old_head;
 								    } while (!atomic_compare_exchange_weak_explicit(
 								                 &g_pool.freelist_head[class_idx][shard_idx],
 								                 &old_head, (uintptr_t)block,
 								                 memory_order_release, memory_order_acquire));
 								    // Update count and nonempty bit
 								    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
 								                               memory_order_relaxed);
 								    set_nonempty_bit(class_idx, shard_idx);
 								}
 								```
 								**Lock-Free Batch Push** (for refill, optimization):
 								```c
 								// Push multiple blocks atomically (amortize CAS overhead)
 								static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
 								                                                 PoolBlock* head, PoolBlock* tail,
 								                                                 int count) {
 								    uintptr_t old_head;
 								    do {
 								        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
 								                                         memory_order_acquire);
 								        tail->next = (PoolBlock*)old_head;
 								    } while (!atomic_compare_exchange_weak_explicit(
 								                 &g_pool.freelist_head[class_idx][shard_idx],
 								                 &old_head, (uintptr_t)head,
 								                 memory_order_release, memory_order_acquire));
 								    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
 								                               memory_order_relaxed);
 								    set_nonempty_bit(class_idx, shard_idx);
 								}
 								```
 								#### 3. Refill Path Integration
 								**Modify `refill_freelist()`** (now lock-free):
 								```c
 								static int refill_freelist(int class_idx, int shard_idx) {
 								    // ... (allocate page, split into blocks)
 								    // OLD: lock → push to freelist → unlock
 								    // pthread_mutex_lock(lock);
 								    // block->next = g_pool.freelist[class_idx][shard_idx];
 								    // g_pool.freelist[class_idx][shard_idx] = freelist_head;
 								    // pthread_mutex_unlock(lock);
 								    // NEW: lock-free batch push
 								    PoolBlock* tail = freelist_head;
 								    int count = blocks_per_page;
 								    while (tail->next) {
 								        tail = tail->next;
 								    }
 								    freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);
 								    return 1;
 								}
 								```
 								#### 4. Remote Stack Drain (Lock-Free)
 								**Current**: `drain_remote_locked()` called under mutex
 								**New**: Drain into local list, then batch-push lock-free
 								```c
 								// Drain remote stack into freelist (lock-free)
 								static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
 								    // Atomically swap remote head to NULL (unchanged)
 								    uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
 								                                               (uintptr_t)0, memory_order_acq_rel);
 								    if (!head) return;
 								    // Count blocks
 								    int count = 0;
 								    PoolBlock* tail = (PoolBlock*)head;
 								    while (tail->next) {
 								        tail = tail->next;
 								        count++;
 								    }
 								    count++;  // Include head
 								    // Batch push to freelist (lock-free)
 								    freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);
 								    // Update remote count
 								    atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
 								                               memory_order_relaxed);
 								}
 								```
 								#### 5. Fallback Strategy (Optional)
 								For **rare contention** cases (e.g., CAS spin > 100 iterations):
 								- Option A: Keep spinning (acceptable for short lists)
 								- Option B: Fallback to mutex (hybrid approach)
 								- Option C: Backoff + retry (exponential backoff)
 								**Recommendation**: Start with Option A (pure lock-free), measure, add backoff if needed.
 								---
 								### File Changes Required
 								| File | Function | Change Type | Est. LOC |
 								|------|----------|-------------|----------|
 								| `hakmem_pool.c` | Globals | Replace mutexes with atomics | +10/-10 |
 								| `hakmem_pool.c` | `freelist_pop_lockfree()` | **New function** | +30 |
 								| `hakmem_pool.c` | `freelist_push_lockfree()` | **New function** | +20 |
 								| `hakmem_pool.c` | `freelist_push_batch_lockfree()` | **New function** | +25 |
 								| `hakmem_pool.c` | `drain_remote_lockfree()` | Rewrite (lock-free) | +25/-20 |
 								| `hakmem_pool.c` | `refill_freelist()` | Modify (use batch push) | +10/-15 |
 								| `hakmem_pool.c` | `hak_pool_try_alloc()` | Replace lock/unlock with pop | +5/-10 |
 								| `hakmem_pool.c` | `hak_pool_free()` | Lock-free path | +10/-10 |
 								| `hakmem_pool.c` | `hak_pool_init()` | Init atomics (not mutexes) | +5/-5 |
 								| **Total** | | | **~140 LOC (net ~100)** |
 								---
 								### Testing Strategy
 								#### Correctness Test
 								```bash
 								# Single-threaded (no contention, pure correctness)
 								THREADS=1 ./test_pool_lockfree
 								# Multi-threaded stress test (high contention)
 								THREADS=16 DURATION=60 ./test_pool_lockfree_stress
 								# Check for:
 								# - No memory leaks (valgrind)
 								# - No double-free (AddressSanitizer)
 								# - No lost blocks (counter invariants)
 								```
 								#### Performance Test
 								```bash
 								# Baseline (Phase 6.25, with batching)
 								HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
 								# Lock-free (Phase 6.26)
 								HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
 								# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)
 								```
 								#### Contention Analysis
 								```bash
 								# Measure CAS retry rate
 								# Add instrumentation:
 								#   atomic_uint_fast64_t cas_retries;
 								#   atomic_uint_fast64_t cas_attempts;
 								# Print ratio at shutdown
 								# Target: <5% retry rate under 4T load
 								```
 								---
 								### Risk Assessment
 								| Risk | Likelihood | Impact | Mitigation |
 								|------|------------|--------|------------|
 								| ABA problem (block reuse) | Low | Critical | Use epoch-based reclamation or hazard pointers |
 								| CAS livelock (high contention) | Medium | High | Add exponential backoff after N retries |
 								| Memory ordering bugs (subtle races) | Medium | Critical | Extensive testing, TSan, formal verification |
 								| Performance regression (1T) | Low | Low | Single-thread has no contention, minimal overhead |
 								**ABA Problem**:
 								- **Scenario**: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
 								- **Solution**: Not critical for freelist (ABA still results in valid freelist state)
 								- **Alternative**: Add version counter (128-bit CAS) if issues arise
 								**Rollback Plan**: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.
 								---
 								### Estimated Time
 								- **Implementation**: 5-6 hours
 								  - Lock-free primitives: 2 hours
 								  - Integration: 2 hours
 								  - Testing: 2 hours
 								- **Debugging**: 2-3 hours (race conditions, TSan)
 								- **Benchmarking**: 2 hours
 								- **Total**: **9-11 hours**
 								---
 								## 🧠 Phase 6.27: Learner Integration
 								### Goal
 								**Dynamic optimization of CAP and W_MAX based on runtime behavior**
 								**Target**: +5-10% across all workloads via adaptive tuning
 								### Problem Statement
 								Current policy is **static** (set at init):
 								- `CAP = {64,64,64,32,16,32,32}` (conservative)
 								- `W_MAX_MID = 1.60`, `W_MAX_LARGE = 1.30`
 								- No adaptation to workload characteristics
 								**Opportunity**: Use **existing learner infrastructure** to:
 . Collect size distribution stats
 . Adjust `mid_cap[]` dynamically based on hit rate
 . Adjust `w_max_mid` based on fragmentation vs hit rate trade-off
 								**Learner Already Exists**: `hakmem_learner.c` (~585 LOC)
 								- Background thread (1 sec polling)
 								- Hit rate monitoring
 								- UCB1 for W_MAX exploration (Canary deployment)
 								- Budget enforcement + Water-filling
 								**Integration Work**: Minimal (learner already supports Mid Pool tuning)
 								---
 								### Implementation Approach
 								#### 1. Enable Learner for Mid Pool
 								**Already Implemented** (`hakmem_learner.c:239-272`):
 								```c
 								// Adjust Mid caps by hit rate vs target (delta over window) with dwell
 								int mid_classes = 5;
 								if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
 								// ...
 								for (int i = 0; i < mid_classes; i++) {
 								    uint64_t dh = mid_hits[i] - prev_mid_hits[i];
 								    uint64_t dm = mid_misses[i] - prev_mid_misses[i];
 								    // ...
 								    if (hit < (tgt_mid - eps)) {
 								        cap += step_mid;  // Increase CAP
 								    } else if (hit > (tgt_mid + eps)) {
 								        cap -= step_mid;  // Decrease CAP
 								    }
 								    // ...
 								}
 								```
 								**Action**: Just enable via env var!
 								```bash
 								HAKMEM_LEARN=1 \
 								HAKMEM_TARGET_HIT_MID=0.65 \
 								HAKMEM_CAP_STEP_MID=8 \
 								HAKMEM_CAP_MIN_MID=16 \
 								HAKMEM_CAP_MAX_MID=512 \
 								./your_app
 								```
 								#### 2. W_MAX Learning (Optional, Risky)
 								**Already Implemented** (`hakmem_learner.c:388-499`):
 								- UCB1 multi-armed bandit
 								- Canary deployment (safe exploration)
 								- Rollback if performance regresses
 								**Candidates** (for Mid Pool):
 								```
 								W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
 								Default: 1.60 (current)
 								Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)
 								```
 								**Enable**:
 								```bash
 								HAKMEM_LEARN=1 \
 								HAKMEM_WMAX_LEARN=1 \
 								HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
 								HAKMEM_WMAX_CANARY=1 \
 								./your_app
 								```
 								**Recommendation**: Start with CAP tuning only, add W_MAX later (more risk).
 								#### 3. Size Distribution Integration (Already Exists)
 								**Histogram** (`hakmem_size_hist.c`):
 								- 1KB granularity bins (0-64KB tracked)
 								- Per-allocation sampling
 								- Reset after learner snapshot
 								**DYN1 Auto-Assignment** (already implemented):
 								```bash
 								HAKMEM_LEARN=1 \
 								HAKMEM_DYN1_AUTO=1 \
 								HAKMEM_CAP_MID_DYN1=64 \
 								./your_app
 								```
 								**Effect**: Automatically finds peak size in 2-32KB range, assigns DYN1 class.
 								#### 4. New: ACE Stats Integration
 								**Current ACE** (`hakmem_ace.c`):
 								- Records size decisions (original_size → rounded_size → pool)
 								- Tracks L1 fallback rate (miss → malloc)
 								- Not integrated with learner
 								**Proposal**: Add ACE stats to learner score function
 								**Modify Learner Score** (`hakmem_learner.c:414`):
 								```c
 								// OLD: simple hit-based score
 								double score = (double)(ace.mid_hit + ace.large_hit)
 								             - (double)(ace.mid_miss + ace.large_miss)
 								             - 2.0 * (double)ace.l1_fallback;
 								// NEW: add fragmentation penalty
 								extern uint64_t hak_ace_get_total_waste(void);  // sum of (rounded - original)
 								uint64_t waste = hak_ace_get_total_waste();
 								double frag_penalty = (double)waste / 1e6;  // normalize to MB
 								double score = (double)(ace.mid_hit + ace.large_hit)
 								             - (double)(ace.mid_miss + ace.large_miss)
 								             - 2.0 * (double)ace.l1_fallback
 								             - 0.5 * frag_penalty;  // penalize waste
 								```
 								**Benefit**: Balance hit rate vs fragmentation (W_MAX tuning).
 								---
 								### File Changes Required
 								| File | Function | Change Type | Est. LOC |
 								|------|----------|-------------|----------|
 								| `hakmem_learner.c` | Learner (already exists) | Enable via env | 0 |
 								| `hakmem_ace.c` | `hak_ace_get_total_waste()` | **New function** | +15 |
 								| `hakmem_learner.c` | `learner_main()` | Add frag penalty to score | +10 |
 								| `hakmem_policy.c` | (none) | Learner publishes dynamically | 0 |
 								| **Total** | | | **~25 LOC** |
 								---
 								### Testing Strategy
 								#### Baseline Test (Learner Off)
 								```bash
 								# Static policy (current)
 								RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
 								# Record: Mid 1T, Mid 4T throughput
 								```
 								#### Learner Test (CAP Tuning)
 								```bash
 								# Enable learner with aggressive targets
 								HAKMEM_LEARN=1 \
 								HAKMEM_TARGET_HIT_MID=0.75 \
 								HAKMEM_CAP_STEP_MID=8 \
 								HAKMEM_CAP_MAX_MID=512 \
 								HAKMEM_LEARN_WINDOW_MS=2000 \
 								RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
 								# Expected: CAP increases to ~128-256 (hit 75% target)
 								# Expected: +5-10% throughput improvement
 								```
 								#### W_MAX Learning Test (Optional)
 								```bash
 								HAKMEM_LEARN=1 \
 								HAKMEM_WMAX_LEARN=1 \
 								HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
 								HAKMEM_WMAX_CANARY=1 \
 								HAKMEM_WMAX_TRIAL_SEC=5 \
 								RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh
 								# Monitor stderr for learner logs:
 								# "[Learner] W_MAX mid canary start: 1.50"
 								# "[Learner] W_MAX mid canary adopt" (success)
 								# or
 								# "[Learner] W_MAX mid canary revert to 1.60" (failure)
 								```
 								#### Regression Test
 								```bash
 								# Check learner doesn't hurt stable workloads
 								# Run with learning OFF, then ON, compare variance
 								# Target: <5% variance, no regressions
 								```
 								---
 								### Risk Assessment
 								| Risk | Likelihood | Impact | Mitigation |
 								|------|------------|--------|------------|
 								| Over-tuning (oscillation) | Medium | Medium | Increase dwell time (3→5 sec) |
 								| Under-tuning (no effect) | Medium | Low | Lower target hit rate (0.75→0.65) |
 								| W_MAX instability (fragmentation spike) | Medium | High | Use Canary, revert on regression |
 								| Low-traffic workload (insufficient samples) | High | Low | Set min_samples=256, skip learning if below |
 								**Rollback Plan**: Set `HAKMEM_LEARN=0` (default, no learner overhead).
 								---
 								### Estimated Time
 								- **Implementation**: 1-2 hours
 								  - ACE waste tracking: 1 hour
 								  - Learner score update: 30 min
 								  - Testing: 30 min
 								- **Validation**: 3-4 hours
 								  - Run suite with/without learner
 								  - Analyze CAP convergence
 								  - W_MAX exploration (if enabled)
 								- **Total**: **4-6 hours**
 								---
 								## 📊 Expected Performance Improvements
 								### Cumulative Gains (Stacked)
 								| Phase | Change | Mid 1T | Mid 4T | Rationale |
 								|-------|--------|--------|--------|-----------|
 								| **Baseline (6.21)** | Current | 4.0 M/s (28%) | 13.8 M/s (47%) | Post-quick-wins |
 								| **6.25 (Batch)** | Refill 2-4 pages | +10-15% | +5-8% | Amortize syscall, 1T bottleneck |
 								| | | **4.5-5.0 M/s** | **14.5-15.2 M/s** | |
 								| **6.26 (Lock-Free)** | CAS freelist | +2-5% | +15-20% | Eliminate 4T contention |
 								| | | **4.6-5.2 M/s** | **17.0-18.2 M/s** | |
 								| **6.27 (Learner)** | Dynamic CAP/W_MAX | +5-10% | +5-10% | Adaptive tuning |
 								| | | **5.0-5.7 M/s** | **18.0-20.0 M/s** | |
 								| **Target (60-75%)** | vs mimalloc 14.6M / 29.5M | **8.8-11.0 M/s** | **17.7-22.1 M/s** | |
 								| **Achieved?** | | ❌ **35-39%** | ✅ **61-68%** | 1T still short, 4T on target! |
 								### Gap Analysis
 								**1T Performance**:
 								- Current: 4.0 M/s (28% of mimalloc)
 								- Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
 								- **Gap to 60%**: Still need **+5.3-6.0 M/s** (~+110-120%)
 								**Remaining Bottlenecks (1T)**:
 . Single-threaded inherently lock-bound (no TLS benefit)
 . mimalloc's per-thread heaps eliminate ALL shared state
 . Bump allocation (mimalloc) vs freelist (hakmem)
 . Header overhead (32 bytes per alloc in hakmem)
 								**4T Performance**:
 								- Current: 13.8 M/s (47% of mimalloc)
 								- Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
 								- **✅ Target achieved!** (60-75% range)
 								---
 								### Follow-Up Phases (Post-6.27)
 								**Phase 6.28: Header Elimination** (if 1T still target)
 								- Remove AllocHeader for Mid Pool (use page descriptors only)
 								- Saves 32 bytes per allocation (~8-15% memory)
 								- Saves header write on alloc hot path (~30-50 cycles)
 								- **Estimated Gain**: +15-20% (1T)
 								**Phase 6.29: Bump Allocation** (major refactor)
 								- Replace freelist with bump allocator (mimalloc-style)
 								- Per-thread arenas, no shared state at all
 								- **Estimated Gain**: +50-100% (1T), brings to mimalloc parity
 								- **Risk**: High complexity, long implementation (~2-3 weeks)
 								---
 								## 🗓️ Priority-Ordered Task List
 								### Phase 6.25: Refill Batching (Target: Week 1)
 . ☐ **Implement `alloc_tls_page_batch()` function** (2 hours)
 								   - [ ] Write batch mmap loop
 								   - [ ] Distribute pages to TLS slots
 								   - [ ] Fill Ring/LIFO from overflow pages
 								   - [ ] Add page descriptors registration
 . ☐ **Integrate batch refill into `hak_pool_try_alloc()`** (1 hour)
 								   - [ ] Replace `alloc_tls_page()` call with batch version
 								   - [ ] Prepare slot array logic
 								   - [ ] Handle partial allocation (< batch_size)
 . ☐ **Add environment variable support** (30 min)
 								   - [ ] Add `g_pool_refill_batch_size` global
 								   - [ ] Parse `HAKMEM_POOL_REFILL_BATCH` in init
 								   - [ ] Validate range (1-4)
 . ☐ **Unit testing** (1 hour)
 								   - [ ] Test batch=1,2,4 correctness
 								   - [ ] Verify TLS slots filled
 								   - [ ] Check Ring population
 								   - [ ] Valgrind (no leaks)
 . ☐ **Benchmark validation** (2 hours)
 								   - [ ] Run suite with batch=1 (baseline)
 								   - [ ] Run suite with batch=2,4
 								   - [ ] Analyze throughput delta
 								   - [ ] **Target**: +10-15% Mid 1T
 								**Total Estimate**: 6-7 hours
 								---
 								### Phase 6.26: Lock-Free Refill (Target: Week 2)
 . ☐ **Replace mutex with atomic freelist** (2 hours)
 								   - [ ] Change `PoolBlock* freelist[]` → `atomic_uintptr_t freelist_head[]`
 								   - [ ] Add `atomic_uint freelist_count[]`
 								   - [ ] Remove `PaddedMutex freelist_locks[]`
 . ☐ **Implement lock-free primitives** (2 hours)
 								   - [ ] Write `freelist_pop_lockfree()`
 								   - [ ] Write `freelist_push_lockfree()`
 								   - [ ] Write `freelist_push_batch_lockfree()`
 . ☐ **Rewrite drain functions** (1 hour)
 								   - [ ] `drain_remote_lockfree()` (no mutex)
 								   - [ ] Count blocks in remote stack
 								   - [ ] Batch push to freelist
 . ☐ **Integrate into alloc/free paths** (1 hour)
 								   - [ ] Replace lock/pop/unlock with `freelist_pop_lockfree()`
 								   - [ ] Update refill to use batch push
 								   - [ ] Update free to use lock-free push
 . ☐ **Testing (critical for lock-free)** (3 hours)
 								    - [ ] Single-thread correctness test
 								    - [ ] Multi-thread stress test (16T, 60 sec)
 								    - [ ] TSan (ThreadSanitizer) run
 								    - [ ] Check counter invariants (no lost blocks)
 . ☐ **Benchmark validation** (2 hours)
 								    - [ ] Run suite with lock-free (4T focus)
 								    - [ ] Compare to Phase 6.25 baseline
 								    - [ ] Measure CAS retry rate
 								    - [ ] **Target**: +15-20% Mid 4T
 								**Total Estimate**: 11-12 hours
 								---
 								### Phase 6.27: Learner Integration (Target: Week 2, parallel)
 . ☐ **Add ACE waste tracking** (1 hour)
 								    - [ ] Implement `hak_ace_get_total_waste()` in `hakmem_ace.c`
 								    - [ ] Track cumulative (rounded - original) per allocation
 								    - [ ] Atomic counter for thread safety
 . ☐ **Update learner score function** (30 min)
 								    - [ ] Add fragmentation penalty term
 								    - [ ] Weight: -0.5 × (waste_MB)
 								    - [ ] Test score computation
 . ☐ **Validation testing** (3 hours)
 								    - [ ] Baseline run (learner OFF)
 								    - [ ] CAP tuning run (learner ON, W_MAX fixed)
 								    - [ ] W_MAX learning run (Canary enabled)
 								    - [ ] Compare throughput, check convergence
 . ☐ **Documentation** (1 hour)
 								    - [ ] Update ENV_VARS.md with learner params
 								    - [ ] Document recommended settings
 								    - [ ] Add troubleshooting guide (oscillation, no effect)
 								**Total Estimate**: 5-6 hours
 								---
 								### Post-Implementation (Week 3)
 . ☐ **Comprehensive benchmarking** (4 hours)
 								    - [ ] Full suite (tiny, mid, large) with all phases enabled
 								    - [ ] Head-to-head vs mimalloc (1T, 4T, 8T)
 								    - [ ] Memory profiling (RSS, fragmentation)
 								    - [ ] Generate performance report
 . ☐ **Code review & cleanup** (2 hours)
 								    - [ ] Remove debug printfs
 								    - [ ] Add comments to complex sections
 								    - [ ] Update copyright/phase headers
 								    - [ ] Check for code duplication
 . ☐ **Documentation updates** (2 hours)
 								    - [ ] Update INDEX.md with new phases
 								    - [ ] Write PHASE_6.25_6.27_RESULTS.md
 								    - [ ] Update README.md benchmarks section
 								**Total Estimate**: 8 hours
 								---
 								## 📈 Success Metrics
 								### Primary Metrics
 								| Metric | Current | Target | Measurement |
 								|--------|---------|--------|-------------|
 								| **Mid 1T Throughput** | 4.0 M/s | 5.0-5.7 M/s | Larson benchmark, 10s |
 								| **Mid 4T Throughput** | 13.8 M/s | 18.0-20.0 M/s | Larson benchmark, 10s |
 								| **Mid 1T vs mimalloc** | 28% | 35-39% | Ratio of throughputs |
 								| **Mid 4T vs mimalloc** | 47% | 61-68% | Ratio of throughputs |
 								### Secondary Metrics
 								| Metric | Current | Target | Measurement |
 								|--------|---------|--------|-------------|
 								| Refill frequency (1T) | ~1000/sec | ~250-500/sec | Counter delta |
 								| Lock contention (4T) | ~40% wait | <10% wait | Trylock success rate |
 								| Hit rate (Mid Pool) | ~60% | 70-80% | hits / (hits + misses) |
 								| Memory footprint | 22 MB | <30 MB | RSS baseline |
 								### Regression Thresholds
 								| Scenario | Threshold | Action |
 								|----------|-----------|--------|
 								| Tiny Pool 4T | <2% regression | Acceptable |
 								| Large Pool | <5% regression | Acceptable |
 								| Memory bloat | >40 MB baseline | Reduce CAP or batch |
 								| Crash/hang in stress test | Any occurrence | Block release, debug |
 								---
 								## 🎬 Conclusion
 								This implementation plan provides a **systematic path** to improve hakmem's Mid Pool performance from **47% to 61-68% of mimalloc** for multi-threaded workloads (4T), bringing it into the target range of 60-75%.
 								**Key Insights**:
 . **Phase 6.25 (Batching)**: Low risk, medium reward, tackles 1T bottleneck
 . **Phase 6.26 (Lock-Free)**: Medium risk, high reward, critical for 4T scaling
 . **Phase 6.27 (Learner)**: Low risk, low-medium reward, adaptive optimization
 								**Recommendation**:
 								- Implement 6.25 and 6.27 in **parallel** (independent, ~12 hours total)
 								- Tackle 6.26 **after** 6.25 validated (builds on batch refill, ~12 hours)
 								- **Total time**: ~24-30 hours (3-4 days focused work)
 								**Next Steps**:
 . Review this plan with team
 . Set up benchmarking pipeline (automated, reproducible)
 . Implement Phase 6.25 (highest priority)
 . Measure, iterate, document
 								**Open Questions**:
 								- Should we extend TLS slots from 2 to 4? (Test in 6.25)
 								- Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
 								- After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?
 								---
 								**Document Version**: 1.0
 								**Last Updated**: 2025-10-24
 								**Author**: Claude (Sonnet 4.5)
 								**Status**: Ready for Implementation