Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

36 KiB

Raw Blame History

Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc

Date: 2025-10-24 Status: 📋 Planning Target: Reach 60-75% of mimalloc performance for Mid Pool

📊 Current Baseline (Phase 6.21 Results)

Performance vs mimalloc

Workload	Threads	hakmem	mimalloc	Ratio	Gap
Mid	1T	4.0 M/s	14.6 M/s	28%	-72%
Mid	4T	13.8 M/s	29.5 M/s	47%	-53%
Tiny	1T	19.4 M/s	32.6 M/s	59%	-41%
Tiny	4T	48.0 M/s	65.7 M/s	73%	-27%
Large	1T	0.6 M/s	2.1 M/s	29%	-71%

Key Insights:

✅ Phase 6.25 Quick Wins achieved +37.8% for Mid 4T (10.0 → 13.8 M/s)
❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
🎯 Target: 60-75% of mimalloc = 8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)

Current Mid Pool Architecture

┌─────────────────────────────────────────────────────────┐
│ TLS Fast Path (Lock-Free)                               │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Ring Buffer (RING_CAP=32)                        │
│    - LIFO cache for recently freed blocks               │
│    - Per-class, per-thread                              │
│    - Phase 6.25: 16→32 increased hit rate               │
│                                                          │
│ 2. TLS Active Pages (x2: page_a, page_b)                │
│    - Bump-run allocation (no per-block links)           │
│    - Owner-thread private (lock-free)                   │
│    - 64KB pages, split on-demand                        │
├─────────────────────────────────────────────────────────┤
│ Shared State (Lock-Based)                               │
├─────────────────────────────────────────────────────────┤
│ 3. Per-class Freelist (64 shards)                       │
│    - Mutex-protected per (class, shard)                 │
│    - Site-based sharding (reduce contention)            │
│    - Refill on demand via refill_freelist()             │
│                                                          │
│ 4. Remote Stack (MPSC, lock-free push)                  │
│    - Cross-thread free target                           │
│    - Drained into freelist under lock                   │
│                                                          │
│ 5. Transfer Cache (TC, Phase 6.20)                      │
│    - Per-thread inbox (atomic CAS)                      │
│    - Owner-aware routing                                │
│    - Drain trigger: ring->top < 2                       │
└─────────────────────────────────────────────────────────┘

Refill Flow (Current):
  Ring empty → Check Active Pages → Lock Shard → Pop freelist
  → Drain remote → Shard steal (if CAP reached) → **refill_freelist()**

Refill Implementation:
  - Allocates **1 page** (64KB) via mmap
  - Splits into blocks, links into freelist
  - ACE bundle factor: 1-4 pages (adaptive)

Bottlenecks Identified

From Phase 6.20 Analysis:

Refill Latency (Primary)
- Single-page refill: 1 mmap syscall per refill
- Freelist rebuilding overhead (linking blocks)
- Mutex hold time during refill (~100-150 cycles)
- Impact: ~40% of alloc time in Mid 1T
Lock Contention (Secondary)
- 64 shards × 7 classes = 448 mutexes
- Even with sharding, 4T shows contention
- Trylock success rate: ~60-70% (Phase 6.25 data)
- Impact: ~25% of alloc time in Mid 4T
CAP/W_MAX Sub-optimal (Tertiary)
- Static configuration (no runtime adaptation)
- W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
- CAP={64,64,64,32,16} → conservative, low hit rate
- Impact: ~10-15% missed pool opportunities

🎯 Phase 6.25 本体: Refill Batching

Goal

Reduce refill latency by allocating multiple pages at once

Target: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)

Problem Statement

Current refill_freelist() allocates 1 page per call:

1 mmap syscall (~200-300 cycles)
1 page split + freelist rebuild (~100-150 cycles)
Held under mutex lock (blocks other threads)
Amortized cost per block: HIGH for small classes (e.g., 2KB = 32 blocks/page)

Opportunity: Allocate 2-4 pages in batch to amortize costs:

mmap overhead: 300 cycles → 75-150 cycles/page (batched)
Freelist rebuild: done in parallel or optimized
Fill multiple TLS page slots + Ring buffer aggressively

Implementation Approach

1. Create `alloc_tls_page_batch()` Function

Location: hakmem_pool.c (after alloc_tls_page(), line ~486)

Signature:

// Allocate multiple pages in batch and distribute to TLS structures
// Returns: number of pages successfully allocated (0-batch_size)
static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin);

Pseudocode:

static int alloc_tls_page_batch(int class_idx, int batch_size,
                                 PoolTLSPage* slots[], int num_slots,
                                 PoolTLSRing* ring, PoolTLSBin* bin) {
    size_t user_size = g_class_sizes[class_idx];
    size_t block_size = HEADER_SIZE + user_size;
    int blocks_per_page = POOL_PAGE_SIZE / block_size;
    if (blocks_per_page <= 0) return 0;

    int allocated = 0;

    // Allocate pages in batch (strategy: multiple mmaps or single large mmap)
    // Option A: Multiple mmaps (simpler, compatible with existing infra)
    for (int i = 0; i < batch_size; i++) {
        void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (!page) break;

        // Prefault (Phase 6.25 quick win)
        for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
            ((volatile char*)page)[j] = 0;
        }

        // Strategy: Fill TLS slots first, then fill Ring/LIFO
        if (allocated < num_slots && slots[allocated]) {
            // Assign to TLS active page slot (bump-run init)
            PoolTLSPage* ap = slots[allocated];
            ap->page = page;
            ap->bump = (char*)page;
            ap->end = (char*)page + POOL_PAGE_SIZE;
            ap->count = blocks_per_page;

            // Register page descriptor
            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        } else {
            // Fill Ring + LIFO from this page
            char* bump = (char*)page;
            char* end = (char*)page + POOL_PAGE_SIZE;

            for (int k = 0; k < blocks_per_page; k++) {
                PoolBlock* b = (PoolBlock*)(void*)bump;

                // Try Ring first, then LIFO
                if (ring && ring->top < POOL_TLS_RING_CAP) {
                    ring->items[ring->top++] = b;
                } else if (bin) {
                    b->next = bin->lo_head;
                    bin->lo_head = b;
                    bin->lo_count++;
                }

                bump += block_size;
                if (bump >= end) break;
            }

            mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
        }

        allocated++;
        g_pool.total_pages_allocated++;
        g_pool.pages_by_class[class_idx]++;
        g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
    }

    if (allocated > 0) {
        g_pool.refills[class_idx]++;
    }

    return allocated;
}

2. Modify Refill Call Sites

Location: hakmem_pool.c:931 (inside hak_pool_try_alloc, refill path)

Before:

if (alloc_tls_page(class_idx, tap)) {
    // ... use newly allocated page
}

After:

// Determine batch size from env var (default 2-4)
int batch = g_pool_refill_batch_size;  // new global config
if (batch < 1) batch = 1;
if (batch > 4) batch = 4;

// Prepare slot array (up to 2 TLS slots)
PoolTLSPage* slots[2] = {NULL, NULL};
int num_slots = 0;

if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_a[class_idx];
}
if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
    slots[num_slots++] = &g_tls_active_page_b[class_idx];
}

// Call batch allocator
int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
                                      &g_tls_bin[class_idx].ring,
                                      &g_tls_bin[class_idx]);

if (allocated > 0) {
    pthread_mutex_unlock(lock);
    // Use ring or active page as usual
    // ...
}

3. Add Environment Variable

Global Config (add to hakmem_pool.c globals, ~line 316):

static int g_pool_refill_batch_size = 2;  // env: HAKMEM_POOL_REFILL_BATCH (1-4)

Init (add to hak_pool_init(), ~line 716):

const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
if (e_batch) {
    int v = atoi(e_batch);
    if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
}

4. Extend TLS Active Page Slots (Optional)

Current: 2 slots (page_a, page_b) Proposal: Add page_c, page_d for batch_size=4 (if beneficial)

Trade-off:

✅ Pro: More TLS-local inventory, fewer shared accesses
❌ Con: Increased TLS memory footprint (~256 bytes/class)

Recommendation: Start with 2 slots, measure, then extend if needed.

File Changes Required

File	Function	Change Type	Est. LOC
`hakmem_pool.c`	`alloc_tls_page_batch()`	New function	+80
`hakmem_pool.c`	`hak_pool_try_alloc()`	Modify refill path	+30
`hakmem_pool.c`	Globals	Add `g_pool_refill_batch_size`	+1
`hakmem_pool.c`	`hak_pool_init()`	Parse env var	+5
`hakmem_pool.h`	(none)	No public API change	0
Total			~116 LOC

Testing Strategy

Unit Test

# Test batch allocation works
HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill

# Verify TLS slots filled correctly
# Check Ring buffer populated
# Check no memory leaks

Benchmark Test

# Baseline (batch=1, current behavior)
HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=2 (conservative)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Batch=4 (aggressive)
HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)

Failure Modes to Watch

Memory bloat: Batch too large → excessive pre-allocation
- Monitor: RSS growth, pages_allocated counter
- Mitigation: Cap batch_size at 4, respect CAP limits
Ring overflow: Batch fills Ring, blocks get lost
- Monitor: Ring underflow counter (should decrease)
- Mitigation: Properly route overflow to LIFO
TLS slot contention: Multiple threads allocating same class
- Monitor: Active page descriptor conflicts
- Mitigation: Per-thread ownership (already enforced)

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Memory bloat (over-allocation)	Medium	High	Cap at batch=4, respect CAP limits
Complexity (harder to debug)	Low	Medium	Extensive logging, unit tests
Backward compat (existing workloads)	Low	Low	Default batch=2 (conservative)
Regression (slower than 1-page)	Low	Medium	A/B test, fallback to batch=1

Rollback Plan: Set HAKMEM_POOL_REFILL_BATCH=1 to restore original behavior (zero code change).

Estimated Time

Implementation: 3-4 hours
- Core function: 2 hours
- Integration: 1 hour
- Testing: 1 hour
Benchmarking: 2 hours
- Run suite 3x (batch=1,2,4)
- Analyze results
Total: 5-6 hours

🔓 Phase 6.26: Lock-Free Refill

Goal

Eliminate lock contention on freelist access

Target: Mid 4T: +15-20% (13.8 → 16-18 M/s)

Problem Statement

Current freelist uses per-shard mutexes (pthread_mutex_t):

64 shards × 7 classes = 448 mutexes
Contention on hot shards (4T workload)
Trylock success rate: ~60-70% (Phase 6.25 data)
Each lock/unlock: ~20-40 cycles overhead

Opportunity: Replace mutex with lock-free stack (CAS-based):

Atomic compare-and-swap: ~10-15 cycles
No blocking (always forward progress)
Better scalability under contention

Implementation Approach

1. Replace Freelist Mutex with Atomic Head

Current Structure (hakmem_pool.c:276-280):

static struct {
    PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    // ...
} g_pool;

New Structure:

static struct {
    // Lock-free freelist head (atomic pointer)
    atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Lock-free counter (for non-empty bitmap update)
    atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // Keep nonempty_mask (atomic already)
    atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];

    // Remote stack (already lock-free)
    atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
    atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

    // ... (rest unchanged)
} g_pool;

2. Implement Lock-Free Push/Pop

Lock-Free Pop (replace mutex-based pop):

// Pop block from lock-free freelist
// Returns: block pointer, or NULL if empty
static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
    uintptr_t old_head;
    PoolBlock* block;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        if (!old_head) {
            return NULL;  // Empty
        }

        block = (PoolBlock*)old_head;
        // Try CAS: freelist_head = block->next
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block->next,
                 memory_order_release, memory_order_acquire));

    // Update count
    unsigned old_count = atomic_fetch_sub_explicit(
        &g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);

    // Clear nonempty bit if now empty
    if (old_count <= 1) {
        clear_nonempty_bit(class_idx, shard_idx);
    }

    return block;
}

Lock-Free Push (for refill path):

// Push block onto lock-free freelist
static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)block,
                 memory_order_release, memory_order_acquire));

    // Update count and nonempty bit
    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}

Lock-Free Batch Push (for refill, optimization):

// Push multiple blocks atomically (amortize CAS overhead)
static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
                                                 PoolBlock* head, PoolBlock* tail,
                                                 int count) {
    uintptr_t old_head;

    do {
        old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
                                         memory_order_acquire);
        tail->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak_explicit(
                 &g_pool.freelist_head[class_idx][shard_idx],
                 &old_head, (uintptr_t)head,
                 memory_order_release, memory_order_acquire));

    atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
    set_nonempty_bit(class_idx, shard_idx);
}

3. Refill Path Integration

Modify refill_freelist() (now lock-free):

static int refill_freelist(int class_idx, int shard_idx) {
    // ... (allocate page, split into blocks)

    // OLD: lock → push to freelist → unlock
    // pthread_mutex_lock(lock);
    // block->next = g_pool.freelist[class_idx][shard_idx];
    // g_pool.freelist[class_idx][shard_idx] = freelist_head;
    // pthread_mutex_unlock(lock);

    // NEW: lock-free batch push
    PoolBlock* tail = freelist_head;
    int count = blocks_per_page;
    while (tail->next) {
        tail = tail->next;
    }

    freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);

    return 1;
}

4. Remote Stack Drain (Lock-Free)

Current: drain_remote_locked() called under mutex New: Drain into local list, then batch-push lock-free

// Drain remote stack into freelist (lock-free)
static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
    // Atomically swap remote head to NULL (unchanged)
    uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
                                               (uintptr_t)0, memory_order_acq_rel);
    if (!head) return;

    // Count blocks
    int count = 0;
    PoolBlock* tail = (PoolBlock*)head;
    while (tail->next) {
        tail = tail->next;
        count++;
    }
    count++;  // Include head

    // Batch push to freelist (lock-free)
    freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);

    // Update remote count
    atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
                               memory_order_relaxed);
}

5. Fallback Strategy (Optional)

For rare contention cases (e.g., CAS spin > 100 iterations):

Option A: Keep spinning (acceptable for short lists)
Option B: Fallback to mutex (hybrid approach)
Option C: Backoff + retry (exponential backoff)

Recommendation: Start with Option A (pure lock-free), measure, add backoff if needed.

File Changes Required

File	Function	Change Type	Est. LOC
`hakmem_pool.c`	Globals	Replace mutexes with atomics	+10/-10
`hakmem_pool.c`	`freelist_pop_lockfree()`	New function	+30
`hakmem_pool.c`	`freelist_push_lockfree()`	New function	+20
`hakmem_pool.c`	`freelist_push_batch_lockfree()`	New function	+25
`hakmem_pool.c`	`drain_remote_lockfree()`	Rewrite (lock-free)	+25/-20
`hakmem_pool.c`	`refill_freelist()`	Modify (use batch push)	+10/-15
`hakmem_pool.c`	`hak_pool_try_alloc()`	Replace lock/unlock with pop	+5/-10
`hakmem_pool.c`	`hak_pool_free()`	Lock-free path	+10/-10
`hakmem_pool.c`	`hak_pool_init()`	Init atomics (not mutexes)	+5/-5
Total			~140 LOC (net ~100)

Testing Strategy

Correctness Test

# Single-threaded (no contention, pure correctness)
THREADS=1 ./test_pool_lockfree

# Multi-threaded stress test (high contention)
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# Check for:
# - No memory leaks (valgrind)
# - No double-free (AddressSanitizer)
# - No lost blocks (counter invariants)

Performance Test

# Baseline (Phase 6.25, with batching)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Lock-free (Phase 6.26)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)

Contention Analysis

# Measure CAS retry rate
# Add instrumentation:
#   atomic_uint_fast64_t cas_retries;
#   atomic_uint_fast64_t cas_attempts;
# Print ratio at shutdown

# Target: <5% retry rate under 4T load

Risk Assessment

Risk	Likelihood	Impact	Mitigation
ABA problem (block reuse)	Low	Critical	Use epoch-based reclamation or hazard pointers
CAS livelock (high contention)	Medium	High	Add exponential backoff after N retries
Memory ordering bugs (subtle races)	Medium	Critical	Extensive testing, TSan, formal verification
Performance regression (1T)	Low	Low	Single-thread has no contention, minimal overhead

ABA Problem:

Scenario: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
Solution: Not critical for freelist (ABA still results in valid freelist state)
Alternative: Add version counter (128-bit CAS) if issues arise

Rollback Plan: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.

Estimated Time

Implementation: 5-6 hours
- Lock-free primitives: 2 hours
- Integration: 2 hours
- Testing: 2 hours
Debugging: 2-3 hours (race conditions, TSan)
Benchmarking: 2 hours
Total: 9-11 hours

🧠 Phase 6.27: Learner Integration

Goal

Dynamic optimization of CAP and W_MAX based on runtime behavior

Target: +5-10% across all workloads via adaptive tuning

Problem Statement

Current policy is static (set at init):

CAP = {64,64,64,32,16,32,32} (conservative)
W_MAX_MID = 1.60, W_MAX_LARGE = 1.30
No adaptation to workload characteristics

Opportunity: Use existing learner infrastructure to:

Collect size distribution stats
Adjust mid_cap[] dynamically based on hit rate
Adjust w_max_mid based on fragmentation vs hit rate trade-off

Learner Already Exists: hakmem_learner.c (~585 LOC)

Background thread (1 sec polling)
Hit rate monitoring
UCB1 for W_MAX exploration (Canary deployment)
Budget enforcement + Water-filling

Integration Work: Minimal (learner already supports Mid Pool tuning)

Implementation Approach

1. Enable Learner for Mid Pool

Already Implemented (hakmem_learner.c:239-272):

// Adjust Mid caps by hit rate vs target (delta over window) with dwell
int mid_classes = 5;
if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
// ...
for (int i = 0; i < mid_classes; i++) {
    uint64_t dh = mid_hits[i] - prev_mid_hits[i];
    uint64_t dm = mid_misses[i] - prev_mid_misses[i];
    // ...
    if (hit < (tgt_mid - eps)) {
        cap += step_mid;  // Increase CAP
    } else if (hit > (tgt_mid + eps)) {
        cap -= step_mid;  // Decrease CAP
    }
    // ...
}

Action: Just enable via env var!

HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.65 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MIN_MID=16 \
HAKMEM_CAP_MAX_MID=512 \
./your_app

2. W_MAX Learning (Optional, Risky)

Already Implemented (hakmem_learner.c:388-499):

UCB1 multi-armed bandit
Canary deployment (safe exploration)
Rollback if performance regresses

Candidates (for Mid Pool):

W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
Default: 1.60 (current)
Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)

Enable:

HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
./your_app

Recommendation: Start with CAP tuning only, add W_MAX later (more risk).

3. Size Distribution Integration (Already Exists)

Histogram (hakmem_size_hist.c):

1KB granularity bins (0-64KB tracked)
Per-allocation sampling
Reset after learner snapshot

DYN1 Auto-Assignment (already implemented):

HAKMEM_LEARN=1 \
HAKMEM_DYN1_AUTO=1 \
HAKMEM_CAP_MID_DYN1=64 \
./your_app

Effect: Automatically finds peak size in 2-32KB range, assigns DYN1 class.

4. New: ACE Stats Integration

Current ACE (hakmem_ace.c):

Records size decisions (original_size → rounded_size → pool)
Tracks L1 fallback rate (miss → malloc)
Not integrated with learner

Proposal: Add ACE stats to learner score function

Modify Learner Score (hakmem_learner.c:414):

// OLD: simple hit-based score
double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback;

// NEW: add fragmentation penalty
extern uint64_t hak_ace_get_total_waste(void);  // sum of (rounded - original)
uint64_t waste = hak_ace_get_total_waste();
double frag_penalty = (double)waste / 1e6;  // normalize to MB

double score = (double)(ace.mid_hit + ace.large_hit)
             - (double)(ace.mid_miss + ace.large_miss)
             - 2.0 * (double)ace.l1_fallback
             - 0.5 * frag_penalty;  // penalize waste

Benefit: Balance hit rate vs fragmentation (W_MAX tuning).

File Changes Required

File	Function	Change Type	Est. LOC
`hakmem_learner.c`	Learner (already exists)	Enable via env	0
`hakmem_ace.c`	`hak_ace_get_total_waste()`	New function	+15
`hakmem_learner.c`	`learner_main()`	Add frag penalty to score	+10
`hakmem_policy.c`	(none)	Learner publishes dynamically	0
Total			~25 LOC

Testing Strategy

Baseline Test (Learner Off)

# Static policy (current)
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Record: Mid 1T, Mid 4T throughput

Learner Test (CAP Tuning)

# Enable learner with aggressive targets
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.75 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MAX_MID=512 \
HAKMEM_LEARN_WINDOW_MS=2000 \
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh

# Expected: CAP increases to ~128-256 (hit 75% target)
# Expected: +5-10% throughput improvement

W_MAX Learning Test (Optional)

HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
HAKMEM_WMAX_TRIAL_SEC=5 \
RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh

# Monitor stderr for learner logs:
# "[Learner] W_MAX mid canary start: 1.50"
# "[Learner] W_MAX mid canary adopt" (success)
# or
# "[Learner] W_MAX mid canary revert to 1.60" (failure)

Regression Test

# Check learner doesn't hurt stable workloads
# Run with learning OFF, then ON, compare variance
# Target: <5% variance, no regressions

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Over-tuning (oscillation)	Medium	Medium	Increase dwell time (3→5 sec)
Under-tuning (no effect)	Medium	Low	Lower target hit rate (0.75→0.65)
W_MAX instability (fragmentation spike)	Medium	High	Use Canary, revert on regression
Low-traffic workload (insufficient samples)	High	Low	Set min_samples=256, skip learning if below

Rollback Plan: Set HAKMEM_LEARN=0 (default, no learner overhead).

Estimated Time

Implementation: 1-2 hours
- ACE waste tracking: 1 hour
- Learner score update: 30 min
- Testing: 30 min
Validation: 3-4 hours
- Run suite with/without learner
- Analyze CAP convergence
- W_MAX exploration (if enabled)
Total: 4-6 hours

📊 Expected Performance Improvements

Cumulative Gains (Stacked)

Phase	Change	Mid 1T	Mid 4T	Rationale
Baseline (6.21)	Current	4.0 M/s (28%)	13.8 M/s (47%)	Post-quick-wins
6.25 (Batch)	Refill 2-4 pages	+10-15%	+5-8%	Amortize syscall, 1T bottleneck
		4.5-5.0 M/s	14.5-15.2 M/s
6.26 (Lock-Free)	CAS freelist	+2-5%	+15-20%	Eliminate 4T contention
		4.6-5.2 M/s	17.0-18.2 M/s
6.27 (Learner)	Dynamic CAP/W_MAX	+5-10%	+5-10%	Adaptive tuning
		5.0-5.7 M/s	18.0-20.0 M/s
Target (60-75%)	vs mimalloc 14.6M / 29.5M	8.8-11.0 M/s	17.7-22.1 M/s
Achieved?		❌ 35-39%	✅ 61-68%	1T still short, 4T on target!

Gap Analysis

1T Performance:

Current: 4.0 M/s (28% of mimalloc)
Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
Gap to 60%: Still need +5.3-6.0 M/s (~+110-120%)

Remaining Bottlenecks (1T):

Single-threaded inherently lock-bound (no TLS benefit)
mimalloc's per-thread heaps eliminate ALL shared state
Bump allocation (mimalloc) vs freelist (hakmem)
Header overhead (32 bytes per alloc in hakmem)

4T Performance:

Current: 13.8 M/s (47% of mimalloc)
Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
✅ Target achieved! (60-75% range)

Follow-Up Phases (Post-6.27)

Phase 6.28: Header Elimination (if 1T still target)

Remove AllocHeader for Mid Pool (use page descriptors only)
Saves 32 bytes per allocation (~8-15% memory)
Saves header write on alloc hot path (~30-50 cycles)
Estimated Gain: +15-20% (1T)

Phase 6.29: Bump Allocation (major refactor)

Replace freelist with bump allocator (mimalloc-style)
Per-thread arenas, no shared state at all
Estimated Gain: +50-100% (1T), brings to mimalloc parity
Risk: High complexity, long implementation (~2-3 weeks)

🗓️ Priority-Ordered Task List

Phase 6.25: Refill Batching (Target: Week 1)

☐ Implement alloc_tls_page_batch() function (2 hours)
- Write batch mmap loop
- Distribute pages to TLS slots
- Fill Ring/LIFO from overflow pages
- Add page descriptors registration
☐ Integrate batch refill into hak_pool_try_alloc() (1 hour)
- Replace alloc_tls_page() call with batch version
- Prepare slot array logic
- Handle partial allocation (< batch_size)
☐ Add environment variable support (30 min)
- Add g_pool_refill_batch_size global
- Parse HAKMEM_POOL_REFILL_BATCH in init
- Validate range (1-4)
☐ Unit testing (1 hour)
- Test batch=1,2,4 correctness
- Verify TLS slots filled
- Check Ring population
- Valgrind (no leaks)
☐ Benchmark validation (2 hours)
- Run suite with batch=1 (baseline)
- Run suite with batch=2,4
- Analyze throughput delta
- Target: +10-15% Mid 1T

Total Estimate: 6-7 hours

Phase 6.26: Lock-Free Refill (Target: Week 2)

☐ Replace mutex with atomic freelist (2 hours)
- Change PoolBlock* freelist[] → atomic_uintptr_t freelist_head[]
- Add atomic_uint freelist_count[]
- Remove PaddedMutex freelist_locks[]
☐ Implement lock-free primitives (2 hours)
- Write freelist_pop_lockfree()
- Write freelist_push_lockfree()
- Write freelist_push_batch_lockfree()
☐ Rewrite drain functions (1 hour)
- drain_remote_lockfree() (no mutex)
- Count blocks in remote stack
- Batch push to freelist
☐ Integrate into alloc/free paths (1 hour)
- Replace lock/pop/unlock with freelist_pop_lockfree()
- Update refill to use batch push
- Update free to use lock-free push
☐ Testing (critical for lock-free) (3 hours)
- Single-thread correctness test
- Multi-thread stress test (16T, 60 sec)
- TSan (ThreadSanitizer) run
- Check counter invariants (no lost blocks)
☐ Benchmark validation (2 hours)
- Run suite with lock-free (4T focus)
- Compare to Phase 6.25 baseline
- Measure CAS retry rate
- Target: +15-20% Mid 4T

Total Estimate: 11-12 hours

Phase 6.27: Learner Integration (Target: Week 2, parallel)

☐ Add ACE waste tracking (1 hour)
- Implement hak_ace_get_total_waste() in hakmem_ace.c
- Track cumulative (rounded - original) per allocation
- Atomic counter for thread safety
☐ Update learner score function (30 min)
- Add fragmentation penalty term
- Weight: -0.5 × (waste_MB)
- Test score computation
☐ Validation testing (3 hours)
- Baseline run (learner OFF)
- CAP tuning run (learner ON, W_MAX fixed)
- W_MAX learning run (Canary enabled)
- Compare throughput, check convergence
☐ Documentation (1 hour)
- Update ENV_VARS.md with learner params
- Document recommended settings
- Add troubleshooting guide (oscillation, no effect)

Total Estimate: 5-6 hours

Post-Implementation (Week 3)

☐ Comprehensive benchmarking (4 hours)
- Full suite (tiny, mid, large) with all phases enabled
- Head-to-head vs mimalloc (1T, 4T, 8T)
- Memory profiling (RSS, fragmentation)
- Generate performance report
☐ Code review & cleanup (2 hours)
- Remove debug printfs
- Add comments to complex sections
- Update copyright/phase headers
- Check for code duplication
☐ Documentation updates (2 hours)
- Update INDEX.md with new phases
- Write PHASE_6.25_6.27_RESULTS.md
- Update README.md benchmarks section

Total Estimate: 8 hours

📈 Success Metrics

Primary Metrics

Metric	Current	Target	Measurement
Mid 1T Throughput	4.0 M/s	5.0-5.7 M/s	Larson benchmark, 10s
Mid 4T Throughput	13.8 M/s	18.0-20.0 M/s	Larson benchmark, 10s
Mid 1T vs mimalloc	28%	35-39%	Ratio of throughputs
Mid 4T vs mimalloc	47%	61-68%	Ratio of throughputs

Secondary Metrics

Metric	Current	Target	Measurement
Refill frequency (1T)	~1000/sec	~250-500/sec	Counter delta
Lock contention (4T)	~40% wait	<10% wait	Trylock success rate
Hit rate (Mid Pool)	~60%	70-80%	hits / (hits + misses)
Memory footprint	22 MB	<30 MB	RSS baseline

Regression Thresholds

Scenario	Threshold	Action
Tiny Pool 4T	<2% regression	Acceptable
Large Pool	<5% regression	Acceptable
Memory bloat	>40 MB baseline	Reduce CAP or batch
Crash/hang in stress test	Any occurrence	Block release, debug

🎬 Conclusion

This implementation plan provides a systematic path to improve hakmem's Mid Pool performance from 47% to 61-68% of mimalloc for multi-threaded workloads (4T), bringing it into the target range of 60-75%.

Key Insights:

Phase 6.25 (Batching): Low risk, medium reward, tackles 1T bottleneck
Phase 6.26 (Lock-Free): Medium risk, high reward, critical for 4T scaling
Phase 6.27 (Learner): Low risk, low-medium reward, adaptive optimization

Recommendation:

Implement 6.25 and 6.27 in parallel (independent, ~12 hours total)
Tackle 6.26 after 6.25 validated (builds on batch refill, ~12 hours)
Total time: ~24-30 hours (3-4 days focused work)

Next Steps:

Review this plan with team
Set up benchmarking pipeline (automated, reproducible)
Implement Phase 6.25 (highest priority)
Measure, iterate, document

Open Questions:

Should we extend TLS slots from 2 to 4? (Test in 6.25)
Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?

Document Version: 1.0 Last Updated: 2025-10-24 Author: Claude (Sonnet 4.5) Status: Ready for Implementation

36 KiB Raw Blame History Unescape Escape

Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc

📊 Current Baseline (Phase 6.21 Results)

Performance vs mimalloc

Current Mid Pool Architecture

Bottlenecks Identified

🎯 Phase 6.25 本体: Refill Batching

Goal

Problem Statement

Implementation Approach

1. Create alloc_tls_page_batch() Function

2. Modify Refill Call Sites

3. Add Environment Variable

4. Extend TLS Active Page Slots (Optional)

File Changes Required

Testing Strategy

Unit Test

Benchmark Test

Failure Modes to Watch

Risk Assessment

Estimated Time

🔓 Phase 6.26: Lock-Free Refill

Goal

Problem Statement

Implementation Approach

1. Replace Freelist Mutex with Atomic Head

2. Implement Lock-Free Push/Pop

3. Refill Path Integration

4. Remote Stack Drain (Lock-Free)

5. Fallback Strategy (Optional)

File Changes Required

Testing Strategy

Correctness Test

Performance Test

Contention Analysis

Risk Assessment

Estimated Time

🧠 Phase 6.27: Learner Integration

Goal

Problem Statement

Implementation Approach

1. Enable Learner for Mid Pool

2. W_MAX Learning (Optional, Risky)

3. Size Distribution Integration (Already Exists)

4. New: ACE Stats Integration

File Changes Required

Testing Strategy

Baseline Test (Learner Off)

Learner Test (CAP Tuning)

W_MAX Learning Test (Optional)

Regression Test

Risk Assessment

Estimated Time

📊 Expected Performance Improvements

Cumulative Gains (Stacked)

Gap Analysis

Follow-Up Phases (Post-6.27)

🗓️ Priority-Ordered Task List

Phase 6.25: Refill Batching (Target: Week 1)

Phase 6.26: Lock-Free Refill (Target: Week 2)

Phase 6.27: Learner Integration (Target: Week 2, parallel)

Post-Implementation (Week 3)

📈 Success Metrics

Primary Metrics

Secondary Metrics

Regression Thresholds

🎬 Conclusion

36 KiB

Raw Blame History

1. Create `alloc_tls_page_batch()` Function