Files
hakmem/docs/status/PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

36 KiB
Raw Blame History

Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc

Date: 2025-10-24 Status: 📋 Planning Target: Reach 60-75% of mimalloc performance for Mid Pool


📊 Current Baseline (Phase 6.21 Results)

Performance vs mimalloc

Workload Threads hakmem mimalloc Ratio Gap
Mid 1T 4.0 M/s 14.6 M/s 28% -72%
Mid 4T 13.8 M/s 29.5 M/s 47% -53%
Tiny 1T 19.4 M/s 32.6 M/s 59% -41%
Tiny 4T 48.0 M/s 65.7 M/s 73% -27%
Large 1T 0.6 M/s 2.1 M/s 29% -71%

Key Insights:

  • Phase 6.25 Quick Wins achieved +37.8% for Mid 4T (10.0 → 13.8 M/s)
  • Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
  • 🎯 Target: 60-75% of mimalloc = 8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)

Current Mid Pool Architecture

┌─────────────────────────────────────────────────────────┐
│ TLS Fast Path (Lock-Free)                               │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Ring Buffer (RING_CAP=32)                        │
│    - LIFO cache for recently freed blocks               │
│    - Per-class, per-thread                              │
│    - Phase 6.25: 16→32 increased hit rate               │
│                                                          │
│ 2. TLS Active Pages (x2: page_a, page_b)                │
│    - Bump-run allocation (no per-block links)           │
│    - Owner-thread private (lock-free)                   │
│    - 64KB pages, split on-demand                        │
├─────────────────────────────────────────────────────────┤
│ Shared State (Lock-Based)                               │
├─────────────────────────────────────────────────────────┤
│ 3. Per-class Freelist (64 shards)                       │
│    - Mutex-protected per (class, shard)                 │
│    - Site-based sharding (reduce contention)            │
│    - Refill on demand via refill_freelist()             │
│                                                          │
│ 4. Remote Stack (MPSC, lock-free push)                  │
│    - Cross-thread free target                           │
│    - Drained into freelist under lock                   │
│                                                          │
│ 5. Transfer Cache (TC, Phase 6.20)                      │
│    - Per-thread inbox (atomic CAS)                      │
│    - Owner-aware routing                                │
│    - Drain trigger: ring->top < 2                       │
└─────────────────────────────────────────────────────────┘

Refill Flow (Current):
  Ring empty → Check Active Pages → Lock Shard → Pop freelist
  → Drain remote → Shard steal (if CAP reached) → **refill_freelist()**

Refill Implementation:
  - Allocates **1 page** (64KB) via mmap
  - Splits into blocks, links into freelist
  - ACE bundle factor: 1-4 pages (adaptive)

Bottlenecks Identified

From Phase 6.20 Analysis:

  1. Refill Latency (Primary)

    • Single-page refill: 1 mmap syscall per refill
    • Freelist rebuilding overhead (linking blocks)
    • Mutex hold time during refill (~100-150 cycles)
    • Impact: ~40% of alloc time in Mid 1T
  2. Lock Contention (Secondary)

    • 64 shards × 7 classes = 448 mutexes
    • Even with sharding, 4T shows contention
    • Trylock success rate: ~60-70% (Phase 6.25 data)
    • Impact: ~25% of alloc time in Mid 4T
  3. CAP/W_MAX Sub-optimal (Tertiary)

    • Static configuration (no runtime adaptation)
    • W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
    • CAP={64,64,64,32,16} → conservative, low hit rate
    • Impact: ~10-15% missed pool opportunities

🎯 Phase 6.25 本体: Refill Batching

Goal

Reduce refill latency by allocating multiple pages at once

Target: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)

Problem Statement

Current refill_freelist() allocates 1 page per call:

  • 1 mmap syscall (~200-300 cycles)
  • 1 page split + freelist rebuild (~100-150 cycles)
  • Held under mutex lock (blocks other threads)
  • Amortized cost per block: HIGH for small classes (e.g., 2KB = 32 blocks/page)

Opportunity: Allocate 2-4 pages in batch to amortize costs:

  • mmap overhead: 300 cycles → 75-150 cycles/page (batched)
  • Freelist rebuild: done in parallel or optimized
  • Fill multiple TLS page slots + Ring buffer aggressively

Implementation Approach

1. Create alloc_tls_page_batch() Function

Location: hakmem_pool.c (after alloc_tls_page(), line ~486)

Signature:

// Allocate multiple pages in batch and distribute to TLS structures
// Returns: number of pages successfully allocated (0-batch_size)
static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin);

Pseudocode:

static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin) {
    size_t user_size = g_class_sizes[class_idx];
    size_t block_size = HEADER_SIZE + user_size;
    int blocks_per_page = POOL_PAGE_SIZE / block_size;
    if (blocks_per_page <= 0) return 0;

    int allocated = 0;

    // Allocate pages in batch (strategy: multiple mmaps or single large mmap)
    // Option A: Multiple mmaps (simpler, compatible with existing infra)
    for (int i = 0; i < batch_size; i++) {
        void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (!page) break;

        // Prefault (Phase 6.25 quick win)
        for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
            ((volatile char*)page)[j] = 0;
        }

        // Strategy: Fill TLS slots first, then fill Ring/LIFO
        if (allocated < num_slots && slots[allocated]) {
            // Assign to TLS active page slot (bump-run init)
            PoolTLSPage* ap = slots[allocated];
            ap->page = page;
            ap->bump = (char*)page;
            ap->end = (char*)page + POOL_PAGE_SIZE;
            ap->count = blocks_per_page;

            // Register page descriptor
            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        } else {
            // Fill Ring + LIFO from this page
            char* bump = (char*)page;
            char* end = (char*)page + POOL_PAGE_SIZE;

            for (int k = 0; k < blocks_per_page; k++) {
                PoolBlock* b = (PoolBlock*)(void*)bump;

                // Try Ring first, then LIFO
                if (ring && ring->top < POOL_TLS_RING_CAP) {
                    ring->items[ring->top++] = b;
                } else if (bin) {
                    b->next = bin->lo_head;
                    bin->lo_head = b;
                    bin->lo_count++;
                }

                bump += block_size;
                if (bump >= end) break;
            }

            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        }

        allocated++;
        g_pool.total_pages_allocated++;
        g_pool.pages_by_class[class_idx]++;
        g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
    }

    if (allocated > 0) {
        g_pool.refills[class_idx]++;
    }

    return allocated;
}

2. Modify Refill Call Sites

Location: hakmem_pool.c:931 (inside hak_pool_try_alloc, refill path)

Before:

if (alloc_tls_page(class_idx, tap)) {
    // ... use newly allocated page
}

After:

// Determine batch size from env var (default 2-4)
int batch = g_pool_refill_batch_size;  // new global config
if (batch < 1) batch = 1;
if (batch > 4) batch = 4;

// Prepare slot array (up to 2 TLS slots)
PoolTLSPage* slots[2] = {NULL, NULL};
int num_slots = 0;

if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_a[class_idx];
}
if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_b[class_idx];
}

// Call batch allocator
int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
                                      &g_tls_bin[class_idx].ring,
                                      &g_tls_bin[class_idx]);

if (allocated > 0) {
    pthread_mutex_unlock(lock);
    // Use ring or active page as usual
    // ...
}

3. Add Environment Variable

Global Config (add to hakmem_pool.c globals, ~line 316):

static int g_pool_refill_batch_size = 2;  // env: HAKMEM_POOL_REFILL_BATCH (1-4)

Init (add to hak_pool_init(), ~line 716):

const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
if (e_batch) {
    int v = atoi(e_batch);
    if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
}

4. Extend TLS Active Page Slots (Optional)

Current: 2 slots (page_a, page_b) Proposal: Add page_c, page_d for batch_size=4 (if beneficial)

Trade-off:

  • Pro: More TLS-local inventory, fewer shared accesses
  • Con: Increased TLS memory footprint (~256 bytes/class)

Recommendation: Start with 2 slots, measure, then extend if needed.


File Changes Required

File Function Change Type Est. LOC
hakmem_pool.c alloc_tls_page_batch() New function +80
hakmem_pool.c hak_pool_try_alloc() Modify refill path +30
hakmem_pool.c Globals Add g_pool_refill_batch_size +1
hakmem_pool.c hak_pool_init() Parse env var +5
hakmem_pool.h (none) No public API change 0
Total ~116 LOC

Testing Strategy

Unit Test

# Test batch allocation works
HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill

# Verify TLS slots filled correctly
# Check Ring buffer populated
# Check no memory leaks

Benchmark Test

# Baseline (batch=1, current behavior)
HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=2 (conservative)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=4 (aggressive)
HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)

Failure Modes to Watch

  1. Memory bloat: Batch too large → excessive pre-allocation

    • Monitor: RSS growth, pages_allocated counter
    • Mitigation: Cap batch_size at 4, respect CAP limits
  2. Ring overflow: Batch fills Ring, blocks get lost

    • Monitor: Ring underflow counter (should decrease)
    • Mitigation: Properly route overflow to LIFO
  3. TLS slot contention: Multiple threads allocating same class

    • Monitor: Active page descriptor conflicts
    • Mitigation: Per-thread ownership (already enforced)

Risk Assessment

Risk Likelihood Impact Mitigation
Memory bloat (over-allocation) Medium High Cap at batch=4, respect CAP limits
Complexity (harder to debug) Low Medium Extensive logging, unit tests
Backward compat (existing workloads) Low Low Default batch=2 (conservative)
Regression (slower than 1-page) Low Medium A/B test, fallback to batch=1

Rollback Plan: Set HAKMEM_POOL_REFILL_BATCH=1 to restore original behavior (zero code change).


Estimated Time

  • Implementation: 3-4 hours
    • Core function: 2 hours
    • Integration: 1 hour
    • Testing: 1 hour
  • Benchmarking: 2 hours
    • Run suite 3x (batch=1,2,4)
    • Analyze results
  • Total: 5-6 hours

🔓 Phase 6.26: Lock-Free Refill

Goal

Eliminate lock contention on freelist access

Target: Mid 4T: +15-20% (13.8 → 16-18 M/s)

Problem Statement

Current freelist uses per-shard mutexes (pthread_mutex_t):

  • 64 shards × 7 classes = 448 mutexes
  • Contention on hot shards (4T workload)
  • Trylock success rate: ~60-70% (Phase 6.25 data)
  • Each lock/unlock: ~20-40 cycles overhead

Opportunity: Replace mutex with lock-free stack (CAS-based):

  • Atomic compare-and-swap: ~10-15 cycles
  • No blocking (always forward progress)
  • Better scalability under contention

Implementation Approach

1. Replace Freelist Mutex with Atomic Head

Current Structure (hakmem_pool.c:276-280):

static struct {
    PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    // ...
} g_pool;

New Structure:

static struct {
    // Lock-free freelist head (atomic pointer)
    atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Lock-free counter (for non-empty bitmap update)
    atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Keep nonempty_mask (atomic already)
    atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];

    // Remote stack (already lock-free)
    atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // ... (rest unchanged)
} g_pool;

2. Implement Lock-Free Push/Pop

Lock-Free Pop (replace mutex-based pop):

// Pop block from lock-free freelist
// Returns: block pointer, or NULL if empty
static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    uintptr_t old_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        if (!old_head) {
            return NULL;  // Empty
        }

        block = (PoolBlock*)old_head;
        // Try CAS: freelist_head = block->next
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block->next,
                 memory_order_release, memory_order_acquire));

    // Update count
    unsigned old_count = atomic_fetch_sub_explicit(
        &g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);

    // Clear nonempty bit if now empty
    if (old_count <= 1) {
        clear_nonempty_bit(class_idx, shard_idx);
    }

    return block;
}

Lock-Free Push (for refill path):

// Push block onto lock-free freelist
static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block,
                 memory_order_release, memory_order_acquire));

    // Update count and nonempty bit
    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}

Lock-Free Batch Push (for refill, optimization):

// Push multiple blocks atomically (amortize CAS overhead)
static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                                 PoolBlock* head, PoolBlock* tail,
                                                 int count) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)head,
                 memory_order_release, memory_order_acquire));

    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}

3. Refill Path Integration

Modify refill_freelist() (now lock-free):

static int refill_freelist(int class_idx, int shard_idx) {
    // ... (allocate page, split into blocks)

    // OLD: lock → push to freelist → unlock
    // pthread_mutex_lock(lock);
    // block->next = g_pool.freelist[class_idx][shard_idx];
    // g_pool.freelist[class_idx][shard_idx] = freelist_head;
    // pthread_mutex_unlock(lock);

    // NEW: lock-free batch push
    PoolBlock* tail = freelist_head;
    int count = blocks_per_page;
    while (tail->next) {
        tail = tail->next;
    }

    freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);

    return 1;
}

4. Remote Stack Drain (Lock-Free)

Current: drain_remote_locked() called under mutex New: Drain into local list, then batch-push lock-free

// Drain remote stack into freelist (lock-free)
static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
    // Atomically swap remote head to NULL (unchanged)
    uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
                                               (uintptr_t)0, memory_order_acq_rel);
    if (!head) return;

    // Count blocks
    int count = 0;
    PoolBlock* tail = (PoolBlock*)head;
    while (tail->next) {
        tail = tail->next;
        count++;
    }
    count++;  // Include head

    // Batch push to freelist (lock-free)
    freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);

    // Update remote count
    atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
}

5. Fallback Strategy (Optional)

For rare contention cases (e.g., CAS spin > 100 iterations):

  • Option A: Keep spinning (acceptable for short lists)
  • Option B: Fallback to mutex (hybrid approach)
  • Option C: Backoff + retry (exponential backoff)

Recommendation: Start with Option A (pure lock-free), measure, add backoff if needed.


File Changes Required

File Function Change Type Est. LOC
hakmem_pool.c Globals Replace mutexes with atomics +10/-10
hakmem_pool.c freelist_pop_lockfree() New function +30
hakmem_pool.c freelist_push_lockfree() New function +20
hakmem_pool.c freelist_push_batch_lockfree() New function +25
hakmem_pool.c drain_remote_lockfree() Rewrite (lock-free) +25/-20
hakmem_pool.c refill_freelist() Modify (use batch push) +10/-15
hakmem_pool.c hak_pool_try_alloc() Replace lock/unlock with pop +5/-10
hakmem_pool.c hak_pool_free() Lock-free path +10/-10
hakmem_pool.c hak_pool_init() Init atomics (not mutexes) +5/-5
Total ~140 LOC (net ~100)

Testing Strategy

Correctness Test

# Single-threaded (no contention, pure correctness)
THREADS=1 ./test_pool_lockfree

# Multi-threaded stress test (high contention)
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# Check for:
# - No memory leaks (valgrind)
# - No double-free (AddressSanitizer)
# - No lost blocks (counter invariants)

Performance Test

# Baseline (Phase 6.25, with batching)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Lock-free (Phase 6.26)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)

Contention Analysis

# Measure CAS retry rate
# Add instrumentation:
#   atomic_uint_fast64_t cas_retries;
#   atomic_uint_fast64_t cas_attempts;
# Print ratio at shutdown

# Target: <5% retry rate under 4T load

Risk Assessment

Risk Likelihood Impact Mitigation
ABA problem (block reuse) Low Critical Use epoch-based reclamation or hazard pointers
CAS livelock (high contention) Medium High Add exponential backoff after N retries
Memory ordering bugs (subtle races) Medium Critical Extensive testing, TSan, formal verification
Performance regression (1T) Low Low Single-thread has no contention, minimal overhead

ABA Problem:

  • Scenario: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
  • Solution: Not critical for freelist (ABA still results in valid freelist state)
  • Alternative: Add version counter (128-bit CAS) if issues arise

Rollback Plan: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.


Estimated Time

  • Implementation: 5-6 hours
    • Lock-free primitives: 2 hours
    • Integration: 2 hours
    • Testing: 2 hours
  • Debugging: 2-3 hours (race conditions, TSan)
  • Benchmarking: 2 hours
  • Total: 9-11 hours

🧠 Phase 6.27: Learner Integration

Goal

Dynamic optimization of CAP and W_MAX based on runtime behavior

Target: +5-10% across all workloads via adaptive tuning

Problem Statement

Current policy is static (set at init):

  • CAP = {64,64,64,32,16,32,32} (conservative)
  • W_MAX_MID = 1.60, W_MAX_LARGE = 1.30
  • No adaptation to workload characteristics

Opportunity: Use existing learner infrastructure to:

  1. Collect size distribution stats
  2. Adjust mid_cap[] dynamically based on hit rate
  3. Adjust w_max_mid based on fragmentation vs hit rate trade-off

Learner Already Exists: hakmem_learner.c (~585 LOC)

  • Background thread (1 sec polling)
  • Hit rate monitoring
  • UCB1 for W_MAX exploration (Canary deployment)
  • Budget enforcement + Water-filling

Integration Work: Minimal (learner already supports Mid Pool tuning)


Implementation Approach

1. Enable Learner for Mid Pool

Already Implemented (hakmem_learner.c:239-272):

// Adjust Mid caps by hit rate vs target (delta over window) with dwell
int mid_classes = 5;
if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
// ...
for (int i = 0; i < mid_classes; i++) {
    uint64_t dh = mid_hits[i] - prev_mid_hits[i];
    uint64_t dm = mid_misses[i] - prev_mid_misses[i];
    // ...
    if (hit < (tgt_mid - eps)) {
        cap += step_mid;  // Increase CAP
    } else if (hit > (tgt_mid + eps)) {
        cap -= step_mid;  // Decrease CAP
    }
    // ...
}

Action: Just enable via env var!

HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.65 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MIN_MID=16 \
HAKMEM_CAP_MAX_MID=512 \
./your_app

2. W_MAX Learning (Optional, Risky)

Already Implemented (hakmem_learner.c:388-499):

  • UCB1 multi-armed bandit
  • Canary deployment (safe exploration)
  • Rollback if performance regresses

Candidates (for Mid Pool):

W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
Default: 1.60 (current)
Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)

Enable:

HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
./your_app

Recommendation: Start with CAP tuning only, add W_MAX later (more risk).

3. Size Distribution Integration (Already Exists)

Histogram (hakmem_size_hist.c):

  • 1KB granularity bins (0-64KB tracked)
  • Per-allocation sampling
  • Reset after learner snapshot

DYN1 Auto-Assignment (already implemented):

HAKMEM_LEARN=1 \
HAKMEM_DYN1_AUTO=1 \
HAKMEM_CAP_MID_DYN1=64 \
./your_app

Effect: Automatically finds peak size in 2-32KB range, assigns DYN1 class.

4. New: ACE Stats Integration

Current ACE (hakmem_ace.c):

  • Records size decisions (original_size → rounded_size → pool)
  • Tracks L1 fallback rate (miss → malloc)
  • Not integrated with learner

Proposal: Add ACE stats to learner score function

Modify Learner Score (hakmem_learner.c:414):

// OLD: simple hit-based score
double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback;

// NEW: add fragmentation penalty
extern uint64_t hak_ace_get_total_waste(void);  // sum of (rounded - original)
uint64_t waste = hak_ace_get_total_waste();
double frag_penalty = (double)waste / 1e6;  // normalize to MB

double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback
             - 0.5 * frag_penalty;  // penalize waste

Benefit: Balance hit rate vs fragmentation (W_MAX tuning).


File Changes Required

File Function Change Type Est. LOC
hakmem_learner.c Learner (already exists) Enable via env 0
hakmem_ace.c hak_ace_get_total_waste() New function +15
hakmem_learner.c learner_main() Add frag penalty to score +10
hakmem_policy.c (none) Learner publishes dynamically 0
Total ~25 LOC

Testing Strategy

Baseline Test (Learner Off)

# Static policy (current)
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Record: Mid 1T, Mid 4T throughput

Learner Test (CAP Tuning)

# Enable learner with aggressive targets
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.75 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MAX_MID=512 \
HAKMEM_LEARN_WINDOW_MS=2000 \
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh

# Expected: CAP increases to ~128-256 (hit 75% target)
# Expected: +5-10% throughput improvement

W_MAX Learning Test (Optional)

HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
HAKMEM_WMAX_TRIAL_SEC=5 \
RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh

# Monitor stderr for learner logs:
# "[Learner] W_MAX mid canary start: 1.50"
# "[Learner] W_MAX mid canary adopt" (success)
# or
# "[Learner] W_MAX mid canary revert to 1.60" (failure)

Regression Test

# Check learner doesn't hurt stable workloads
# Run with learning OFF, then ON, compare variance
# Target: <5% variance, no regressions

Risk Assessment

Risk Likelihood Impact Mitigation
Over-tuning (oscillation) Medium Medium Increase dwell time (3→5 sec)
Under-tuning (no effect) Medium Low Lower target hit rate (0.75→0.65)
W_MAX instability (fragmentation spike) Medium High Use Canary, revert on regression
Low-traffic workload (insufficient samples) High Low Set min_samples=256, skip learning if below

Rollback Plan: Set HAKMEM_LEARN=0 (default, no learner overhead).


Estimated Time

  • Implementation: 1-2 hours
    • ACE waste tracking: 1 hour
    • Learner score update: 30 min
    • Testing: 30 min
  • Validation: 3-4 hours
    • Run suite with/without learner
    • Analyze CAP convergence
    • W_MAX exploration (if enabled)
  • Total: 4-6 hours

📊 Expected Performance Improvements

Cumulative Gains (Stacked)

Phase Change Mid 1T Mid 4T Rationale
Baseline (6.21) Current 4.0 M/s (28%) 13.8 M/s (47%) Post-quick-wins
6.25 (Batch) Refill 2-4 pages +10-15% +5-8% Amortize syscall, 1T bottleneck
4.5-5.0 M/s 14.5-15.2 M/s
6.26 (Lock-Free) CAS freelist +2-5% +15-20% Eliminate 4T contention
4.6-5.2 M/s 17.0-18.2 M/s
6.27 (Learner) Dynamic CAP/W_MAX +5-10% +5-10% Adaptive tuning
5.0-5.7 M/s 18.0-20.0 M/s
Target (60-75%) vs mimalloc 14.6M / 29.5M 8.8-11.0 M/s 17.7-22.1 M/s
Achieved? 35-39% 61-68% 1T still short, 4T on target!

Gap Analysis

1T Performance:

  • Current: 4.0 M/s (28% of mimalloc)
  • Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
  • Gap to 60%: Still need +5.3-6.0 M/s (~+110-120%)

Remaining Bottlenecks (1T):

  1. Single-threaded inherently lock-bound (no TLS benefit)
  2. mimalloc's per-thread heaps eliminate ALL shared state
  3. Bump allocation (mimalloc) vs freelist (hakmem)
  4. Header overhead (32 bytes per alloc in hakmem)

4T Performance:

  • Current: 13.8 M/s (47% of mimalloc)
  • Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
  • Target achieved! (60-75% range)

Follow-Up Phases (Post-6.27)

Phase 6.28: Header Elimination (if 1T still target)

  • Remove AllocHeader for Mid Pool (use page descriptors only)
  • Saves 32 bytes per allocation (~8-15% memory)
  • Saves header write on alloc hot path (~30-50 cycles)
  • Estimated Gain: +15-20% (1T)

Phase 6.29: Bump Allocation (major refactor)

  • Replace freelist with bump allocator (mimalloc-style)
  • Per-thread arenas, no shared state at all
  • Estimated Gain: +50-100% (1T), brings to mimalloc parity
  • Risk: High complexity, long implementation (~2-3 weeks)

🗓️ Priority-Ordered Task List

Phase 6.25: Refill Batching (Target: Week 1)

  1. Implement alloc_tls_page_batch() function (2 hours)

    • Write batch mmap loop
    • Distribute pages to TLS slots
    • Fill Ring/LIFO from overflow pages
    • Add page descriptors registration
  2. Integrate batch refill into hak_pool_try_alloc() (1 hour)

    • Replace alloc_tls_page() call with batch version
    • Prepare slot array logic
    • Handle partial allocation (< batch_size)
  3. Add environment variable support (30 min)

    • Add g_pool_refill_batch_size global
    • Parse HAKMEM_POOL_REFILL_BATCH in init
    • Validate range (1-4)
  4. Unit testing (1 hour)

    • Test batch=1,2,4 correctness
    • Verify TLS slots filled
    • Check Ring population
    • Valgrind (no leaks)
  5. Benchmark validation (2 hours)

    • Run suite with batch=1 (baseline)
    • Run suite with batch=2,4
    • Analyze throughput delta
    • Target: +10-15% Mid 1T

Total Estimate: 6-7 hours


Phase 6.26: Lock-Free Refill (Target: Week 2)

  1. Replace mutex with atomic freelist (2 hours)

    • Change PoolBlock* freelist[]atomic_uintptr_t freelist_head[]
    • Add atomic_uint freelist_count[]
    • Remove PaddedMutex freelist_locks[]
  2. Implement lock-free primitives (2 hours)

    • Write freelist_pop_lockfree()
    • Write freelist_push_lockfree()
    • Write freelist_push_batch_lockfree()
  3. Rewrite drain functions (1 hour)

    • drain_remote_lockfree() (no mutex)
    • Count blocks in remote stack
    • Batch push to freelist
  4. Integrate into alloc/free paths (1 hour)

    • Replace lock/pop/unlock with freelist_pop_lockfree()
    • Update refill to use batch push
    • Update free to use lock-free push
  5. Testing (critical for lock-free) (3 hours)

    • Single-thread correctness test
    • Multi-thread stress test (16T, 60 sec)
    • TSan (ThreadSanitizer) run
    • Check counter invariants (no lost blocks)
  6. Benchmark validation (2 hours)

    • Run suite with lock-free (4T focus)
    • Compare to Phase 6.25 baseline
    • Measure CAS retry rate
    • Target: +15-20% Mid 4T

Total Estimate: 11-12 hours


Phase 6.27: Learner Integration (Target: Week 2, parallel)

  1. Add ACE waste tracking (1 hour)

    • Implement hak_ace_get_total_waste() in hakmem_ace.c
    • Track cumulative (rounded - original) per allocation
    • Atomic counter for thread safety
  2. Update learner score function (30 min)

    • Add fragmentation penalty term
    • Weight: -0.5 × (waste_MB)
    • Test score computation
  3. Validation testing (3 hours)

    • Baseline run (learner OFF)
    • CAP tuning run (learner ON, W_MAX fixed)
    • W_MAX learning run (Canary enabled)
    • Compare throughput, check convergence
  4. Documentation (1 hour)

    • Update ENV_VARS.md with learner params
    • Document recommended settings
    • Add troubleshooting guide (oscillation, no effect)

Total Estimate: 5-6 hours


Post-Implementation (Week 3)

  1. Comprehensive benchmarking (4 hours)

    • Full suite (tiny, mid, large) with all phases enabled
    • Head-to-head vs mimalloc (1T, 4T, 8T)
    • Memory profiling (RSS, fragmentation)
    • Generate performance report
  2. Code review & cleanup (2 hours)

    • Remove debug printfs
    • Add comments to complex sections
    • Update copyright/phase headers
    • Check for code duplication
  3. Documentation updates (2 hours)

    • Update INDEX.md with new phases
    • Write PHASE_6.25_6.27_RESULTS.md
    • Update README.md benchmarks section

Total Estimate: 8 hours


📈 Success Metrics

Primary Metrics

Metric Current Target Measurement
Mid 1T Throughput 4.0 M/s 5.0-5.7 M/s Larson benchmark, 10s
Mid 4T Throughput 13.8 M/s 18.0-20.0 M/s Larson benchmark, 10s
Mid 1T vs mimalloc 28% 35-39% Ratio of throughputs
Mid 4T vs mimalloc 47% 61-68% Ratio of throughputs

Secondary Metrics

Metric Current Target Measurement
Refill frequency (1T) ~1000/sec ~250-500/sec Counter delta
Lock contention (4T) ~40% wait <10% wait Trylock success rate
Hit rate (Mid Pool) ~60% 70-80% hits / (hits + misses)
Memory footprint 22 MB <30 MB RSS baseline

Regression Thresholds

Scenario Threshold Action
Tiny Pool 4T <2% regression Acceptable
Large Pool <5% regression Acceptable
Memory bloat >40 MB baseline Reduce CAP or batch
Crash/hang in stress test Any occurrence Block release, debug

🎬 Conclusion

This implementation plan provides a systematic path to improve hakmem's Mid Pool performance from 47% to 61-68% of mimalloc for multi-threaded workloads (4T), bringing it into the target range of 60-75%.

Key Insights:

  1. Phase 6.25 (Batching): Low risk, medium reward, tackles 1T bottleneck
  2. Phase 6.26 (Lock-Free): Medium risk, high reward, critical for 4T scaling
  3. Phase 6.27 (Learner): Low risk, low-medium reward, adaptive optimization

Recommendation:

  • Implement 6.25 and 6.27 in parallel (independent, ~12 hours total)
  • Tackle 6.26 after 6.25 validated (builds on batch refill, ~12 hours)
  • Total time: ~24-30 hours (3-4 days focused work)

Next Steps:

  1. Review this plan with team
  2. Set up benchmarking pipeline (automated, reproducible)
  3. Implement Phase 6.25 (highest priority)
  4. Measure, iterate, document

Open Questions:

  • Should we extend TLS slots from 2 to 4? (Test in 6.25)
  • Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
  • After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?

Document Version: 1.0 Last Updated: 2025-10-24 Author: Claude (Sonnet 4.5) Status: Ready for Implementation