Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
36 KiB
Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc
Date: 2025-10-24 Status: 📋 Planning Target: Reach 60-75% of mimalloc performance for Mid Pool
📊 Current Baseline (Phase 6.21 Results)
Performance vs mimalloc
| Workload | Threads | hakmem | mimalloc | Ratio | Gap |
|---|---|---|---|---|---|
| Mid | 1T | 4.0 M/s | 14.6 M/s | 28% | -72% |
| Mid | 4T | 13.8 M/s | 29.5 M/s | 47% | -53% |
| Tiny | 1T | 19.4 M/s | 32.6 M/s | 59% | -41% |
| Tiny | 4T | 48.0 M/s | 65.7 M/s | 73% | -27% |
| Large | 1T | 0.6 M/s | 2.1 M/s | 29% | -71% |
Key Insights:
- ✅ Phase 6.25 Quick Wins achieved +37.8% for Mid 4T (10.0 → 13.8 M/s)
- ❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
- 🎯 Target: 60-75% of mimalloc = 8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)
Current Mid Pool Architecture
┌─────────────────────────────────────────────────────────┐
│ TLS Fast Path (Lock-Free) │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Ring Buffer (RING_CAP=32) │
│ - LIFO cache for recently freed blocks │
│ - Per-class, per-thread │
│ - Phase 6.25: 16→32 increased hit rate │
│ │
│ 2. TLS Active Pages (x2: page_a, page_b) │
│ - Bump-run allocation (no per-block links) │
│ - Owner-thread private (lock-free) │
│ - 64KB pages, split on-demand │
├─────────────────────────────────────────────────────────┤
│ Shared State (Lock-Based) │
├─────────────────────────────────────────────────────────┤
│ 3. Per-class Freelist (64 shards) │
│ - Mutex-protected per (class, shard) │
│ - Site-based sharding (reduce contention) │
│ - Refill on demand via refill_freelist() │
│ │
│ 4. Remote Stack (MPSC, lock-free push) │
│ - Cross-thread free target │
│ - Drained into freelist under lock │
│ │
│ 5. Transfer Cache (TC, Phase 6.20) │
│ - Per-thread inbox (atomic CAS) │
│ - Owner-aware routing │
│ - Drain trigger: ring->top < 2 │
└─────────────────────────────────────────────────────────┘
Refill Flow (Current):
Ring empty → Check Active Pages → Lock Shard → Pop freelist
→ Drain remote → Shard steal (if CAP reached) → **refill_freelist()**
Refill Implementation:
- Allocates **1 page** (64KB) via mmap
- Splits into blocks, links into freelist
- ACE bundle factor: 1-4 pages (adaptive)
Bottlenecks Identified
From Phase 6.20 Analysis:
-
Refill Latency (Primary)
- Single-page refill: 1 mmap syscall per refill
- Freelist rebuilding overhead (linking blocks)
- Mutex hold time during refill (~100-150 cycles)
- Impact: ~40% of alloc time in Mid 1T
-
Lock Contention (Secondary)
- 64 shards × 7 classes = 448 mutexes
- Even with sharding, 4T shows contention
- Trylock success rate: ~60-70% (Phase 6.25 data)
- Impact: ~25% of alloc time in Mid 4T
-
CAP/W_MAX Sub-optimal (Tertiary)
- Static configuration (no runtime adaptation)
- W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
- CAP={64,64,64,32,16} → conservative, low hit rate
- Impact: ~10-15% missed pool opportunities
🎯 Phase 6.25 本体: Refill Batching
Goal
Reduce refill latency by allocating multiple pages at once
Target: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)
Problem Statement
Current refill_freelist() allocates 1 page per call:
- 1 mmap syscall (~200-300 cycles)
- 1 page split + freelist rebuild (~100-150 cycles)
- Held under mutex lock (blocks other threads)
- Amortized cost per block: HIGH for small classes (e.g., 2KB = 32 blocks/page)
Opportunity: Allocate 2-4 pages in batch to amortize costs:
- mmap overhead: 300 cycles → 75-150 cycles/page (batched)
- Freelist rebuild: done in parallel or optimized
- Fill multiple TLS page slots + Ring buffer aggressively
Implementation Approach
1. Create alloc_tls_page_batch() Function
Location: hakmem_pool.c (after alloc_tls_page(), line ~486)
Signature:
// Allocate multiple pages in batch and distribute to TLS structures
// Returns: number of pages successfully allocated (0-batch_size)
static int alloc_tls_page_batch(int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin);
Pseudocode:
static int alloc_tls_page_batch(int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin) {
size_t user_size = g_class_sizes[class_idx];
size_t block_size = HEADER_SIZE + user_size;
int blocks_per_page = POOL_PAGE_SIZE / block_size;
if (blocks_per_page <= 0) return 0;
int allocated = 0;
// Allocate pages in batch (strategy: multiple mmaps or single large mmap)
// Option A: Multiple mmaps (simpler, compatible with existing infra)
for (int i = 0; i < batch_size; i++) {
void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (!page) break;
// Prefault (Phase 6.25 quick win)
for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
((volatile char*)page)[j] = 0;
}
// Strategy: Fill TLS slots first, then fill Ring/LIFO
if (allocated < num_slots && slots[allocated]) {
// Assign to TLS active page slot (bump-run init)
PoolTLSPage* ap = slots[allocated];
ap->page = page;
ap->bump = (char*)page;
ap->end = (char*)page + POOL_PAGE_SIZE;
ap->count = blocks_per_page;
// Register page descriptor
mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
} else {
// Fill Ring + LIFO from this page
char* bump = (char*)page;
char* end = (char*)page + POOL_PAGE_SIZE;
for (int k = 0; k < blocks_per_page; k++) {
PoolBlock* b = (PoolBlock*)(void*)bump;
// Try Ring first, then LIFO
if (ring && ring->top < POOL_TLS_RING_CAP) {
ring->items[ring->top++] = b;
} else if (bin) {
b->next = bin->lo_head;
bin->lo_head = b;
bin->lo_count++;
}
bump += block_size;
if (bump >= end) break;
}
mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
}
allocated++;
g_pool.total_pages_allocated++;
g_pool.pages_by_class[class_idx]++;
g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
}
if (allocated > 0) {
g_pool.refills[class_idx]++;
}
return allocated;
}
2. Modify Refill Call Sites
Location: hakmem_pool.c:931 (inside hak_pool_try_alloc, refill path)
Before:
if (alloc_tls_page(class_idx, tap)) {
// ... use newly allocated page
}
After:
// Determine batch size from env var (default 2-4)
int batch = g_pool_refill_batch_size; // new global config
if (batch < 1) batch = 1;
if (batch > 4) batch = 4;
// Prepare slot array (up to 2 TLS slots)
PoolTLSPage* slots[2] = {NULL, NULL};
int num_slots = 0;
if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
slots[num_slots++] = &g_tls_active_page_a[class_idx];
}
if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
slots[num_slots++] = &g_tls_active_page_b[class_idx];
}
// Call batch allocator
int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
&g_tls_bin[class_idx].ring,
&g_tls_bin[class_idx]);
if (allocated > 0) {
pthread_mutex_unlock(lock);
// Use ring or active page as usual
// ...
}
3. Add Environment Variable
Global Config (add to hakmem_pool.c globals, ~line 316):
static int g_pool_refill_batch_size = 2; // env: HAKMEM_POOL_REFILL_BATCH (1-4)
Init (add to hak_pool_init(), ~line 716):
const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
if (e_batch) {
int v = atoi(e_batch);
if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
}
4. Extend TLS Active Page Slots (Optional)
Current: 2 slots (page_a, page_b) Proposal: Add page_c, page_d for batch_size=4 (if beneficial)
Trade-off:
- ✅ Pro: More TLS-local inventory, fewer shared accesses
- ❌ Con: Increased TLS memory footprint (~256 bytes/class)
Recommendation: Start with 2 slots, measure, then extend if needed.
File Changes Required
| File | Function | Change Type | Est. LOC |
|---|---|---|---|
hakmem_pool.c |
alloc_tls_page_batch() |
New function | +80 |
hakmem_pool.c |
hak_pool_try_alloc() |
Modify refill path | +30 |
hakmem_pool.c |
Globals | Add g_pool_refill_batch_size |
+1 |
hakmem_pool.c |
hak_pool_init() |
Parse env var | +5 |
hakmem_pool.h |
(none) | No public API change | 0 |
| Total | ~116 LOC |
Testing Strategy
Unit Test
# Test batch allocation works
HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill
# Verify TLS slots filled correctly
# Check Ring buffer populated
# Check no memory leaks
Benchmark Test
# Baseline (batch=1, current behavior)
HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Batch=2 (conservative)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Batch=4 (aggressive)
HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)
Failure Modes to Watch
-
Memory bloat: Batch too large → excessive pre-allocation
- Monitor: RSS growth, pages_allocated counter
- Mitigation: Cap batch_size at 4, respect CAP limits
-
Ring overflow: Batch fills Ring, blocks get lost
- Monitor: Ring underflow counter (should decrease)
- Mitigation: Properly route overflow to LIFO
-
TLS slot contention: Multiple threads allocating same class
- Monitor: Active page descriptor conflicts
- Mitigation: Per-thread ownership (already enforced)
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Memory bloat (over-allocation) | Medium | High | Cap at batch=4, respect CAP limits |
| Complexity (harder to debug) | Low | Medium | Extensive logging, unit tests |
| Backward compat (existing workloads) | Low | Low | Default batch=2 (conservative) |
| Regression (slower than 1-page) | Low | Medium | A/B test, fallback to batch=1 |
Rollback Plan: Set HAKMEM_POOL_REFILL_BATCH=1 to restore original behavior (zero code change).
Estimated Time
- Implementation: 3-4 hours
- Core function: 2 hours
- Integration: 1 hour
- Testing: 1 hour
- Benchmarking: 2 hours
- Run suite 3x (batch=1,2,4)
- Analyze results
- Total: 5-6 hours
🔓 Phase 6.26: Lock-Free Refill
Goal
Eliminate lock contention on freelist access
Target: Mid 4T: +15-20% (13.8 → 16-18 M/s)
Problem Statement
Current freelist uses per-shard mutexes (pthread_mutex_t):
- 64 shards × 7 classes = 448 mutexes
- Contention on hot shards (4T workload)
- Trylock success rate: ~60-70% (Phase 6.25 data)
- Each lock/unlock: ~20-40 cycles overhead
Opportunity: Replace mutex with lock-free stack (CAS-based):
- Atomic compare-and-swap: ~10-15 cycles
- No blocking (always forward progress)
- Better scalability under contention
Implementation Approach
1. Replace Freelist Mutex with Atomic Head
Current Structure (hakmem_pool.c:276-280):
static struct {
PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// ...
} g_pool;
New Structure:
static struct {
// Lock-free freelist head (atomic pointer)
atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// Lock-free counter (for non-empty bitmap update)
atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// Keep nonempty_mask (atomic already)
atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];
// Remote stack (already lock-free)
atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// ... (rest unchanged)
} g_pool;
2. Implement Lock-Free Push/Pop
Lock-Free Pop (replace mutex-based pop):
// Pop block from lock-free freelist
// Returns: block pointer, or NULL if empty
static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
uintptr_t old_head;
PoolBlock* block;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
if (!old_head) {
return NULL; // Empty
}
block = (PoolBlock*)old_head;
// Try CAS: freelist_head = block->next
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)block->next,
memory_order_release, memory_order_acquire));
// Update count
unsigned old_count = atomic_fetch_sub_explicit(
&g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);
// Clear nonempty bit if now empty
if (old_count <= 1) {
clear_nonempty_bit(class_idx, shard_idx);
}
return block;
}
Lock-Free Push (for refill path):
// Push block onto lock-free freelist
static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
uintptr_t old_head;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)block,
memory_order_release, memory_order_acquire));
// Update count and nonempty bit
atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
memory_order_relaxed);
set_nonempty_bit(class_idx, shard_idx);
}
Lock-Free Batch Push (for refill, optimization):
// Push multiple blocks atomically (amortize CAS overhead)
static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
PoolBlock* head, PoolBlock* tail,
int count) {
uintptr_t old_head;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
tail->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)head,
memory_order_release, memory_order_acquire));
atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
memory_order_relaxed);
set_nonempty_bit(class_idx, shard_idx);
}
3. Refill Path Integration
Modify refill_freelist() (now lock-free):
static int refill_freelist(int class_idx, int shard_idx) {
// ... (allocate page, split into blocks)
// OLD: lock → push to freelist → unlock
// pthread_mutex_lock(lock);
// block->next = g_pool.freelist[class_idx][shard_idx];
// g_pool.freelist[class_idx][shard_idx] = freelist_head;
// pthread_mutex_unlock(lock);
// NEW: lock-free batch push
PoolBlock* tail = freelist_head;
int count = blocks_per_page;
while (tail->next) {
tail = tail->next;
}
freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);
return 1;
}
4. Remote Stack Drain (Lock-Free)
Current: drain_remote_locked() called under mutex
New: Drain into local list, then batch-push lock-free
// Drain remote stack into freelist (lock-free)
static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
// Atomically swap remote head to NULL (unchanged)
uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
(uintptr_t)0, memory_order_acq_rel);
if (!head) return;
// Count blocks
int count = 0;
PoolBlock* tail = (PoolBlock*)head;
while (tail->next) {
tail = tail->next;
count++;
}
count++; // Include head
// Batch push to freelist (lock-free)
freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);
// Update remote count
atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
memory_order_relaxed);
}
5. Fallback Strategy (Optional)
For rare contention cases (e.g., CAS spin > 100 iterations):
- Option A: Keep spinning (acceptable for short lists)
- Option B: Fallback to mutex (hybrid approach)
- Option C: Backoff + retry (exponential backoff)
Recommendation: Start with Option A (pure lock-free), measure, add backoff if needed.
File Changes Required
| File | Function | Change Type | Est. LOC |
|---|---|---|---|
hakmem_pool.c |
Globals | Replace mutexes with atomics | +10/-10 |
hakmem_pool.c |
freelist_pop_lockfree() |
New function | +30 |
hakmem_pool.c |
freelist_push_lockfree() |
New function | +20 |
hakmem_pool.c |
freelist_push_batch_lockfree() |
New function | +25 |
hakmem_pool.c |
drain_remote_lockfree() |
Rewrite (lock-free) | +25/-20 |
hakmem_pool.c |
refill_freelist() |
Modify (use batch push) | +10/-15 |
hakmem_pool.c |
hak_pool_try_alloc() |
Replace lock/unlock with pop | +5/-10 |
hakmem_pool.c |
hak_pool_free() |
Lock-free path | +10/-10 |
hakmem_pool.c |
hak_pool_init() |
Init atomics (not mutexes) | +5/-5 |
| Total | ~140 LOC (net ~100) |
Testing Strategy
Correctness Test
# Single-threaded (no contention, pure correctness)
THREADS=1 ./test_pool_lockfree
# Multi-threaded stress test (high contention)
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
# Check for:
# - No memory leaks (valgrind)
# - No double-free (AddressSanitizer)
# - No lost blocks (counter invariants)
Performance Test
# Baseline (Phase 6.25, with batching)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
# Lock-free (Phase 6.26)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)
Contention Analysis
# Measure CAS retry rate
# Add instrumentation:
# atomic_uint_fast64_t cas_retries;
# atomic_uint_fast64_t cas_attempts;
# Print ratio at shutdown
# Target: <5% retry rate under 4T load
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| ABA problem (block reuse) | Low | Critical | Use epoch-based reclamation or hazard pointers |
| CAS livelock (high contention) | Medium | High | Add exponential backoff after N retries |
| Memory ordering bugs (subtle races) | Medium | Critical | Extensive testing, TSan, formal verification |
| Performance regression (1T) | Low | Low | Single-thread has no contention, minimal overhead |
ABA Problem:
- Scenario: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
- Solution: Not critical for freelist (ABA still results in valid freelist state)
- Alternative: Add version counter (128-bit CAS) if issues arise
Rollback Plan: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.
Estimated Time
- Implementation: 5-6 hours
- Lock-free primitives: 2 hours
- Integration: 2 hours
- Testing: 2 hours
- Debugging: 2-3 hours (race conditions, TSan)
- Benchmarking: 2 hours
- Total: 9-11 hours
🧠 Phase 6.27: Learner Integration
Goal
Dynamic optimization of CAP and W_MAX based on runtime behavior
Target: +5-10% across all workloads via adaptive tuning
Problem Statement
Current policy is static (set at init):
CAP = {64,64,64,32,16,32,32}(conservative)W_MAX_MID = 1.60,W_MAX_LARGE = 1.30- No adaptation to workload characteristics
Opportunity: Use existing learner infrastructure to:
- Collect size distribution stats
- Adjust
mid_cap[]dynamically based on hit rate - Adjust
w_max_midbased on fragmentation vs hit rate trade-off
Learner Already Exists: hakmem_learner.c (~585 LOC)
- Background thread (1 sec polling)
- Hit rate monitoring
- UCB1 for W_MAX exploration (Canary deployment)
- Budget enforcement + Water-filling
Integration Work: Minimal (learner already supports Mid Pool tuning)
Implementation Approach
1. Enable Learner for Mid Pool
Already Implemented (hakmem_learner.c:239-272):
// Adjust Mid caps by hit rate vs target (delta over window) with dwell
int mid_classes = 5;
if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
// ...
for (int i = 0; i < mid_classes; i++) {
uint64_t dh = mid_hits[i] - prev_mid_hits[i];
uint64_t dm = mid_misses[i] - prev_mid_misses[i];
// ...
if (hit < (tgt_mid - eps)) {
cap += step_mid; // Increase CAP
} else if (hit > (tgt_mid + eps)) {
cap -= step_mid; // Decrease CAP
}
// ...
}
Action: Just enable via env var!
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.65 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MIN_MID=16 \
HAKMEM_CAP_MAX_MID=512 \
./your_app
2. W_MAX Learning (Optional, Risky)
Already Implemented (hakmem_learner.c:388-499):
- UCB1 multi-armed bandit
- Canary deployment (safe exploration)
- Rollback if performance regresses
Candidates (for Mid Pool):
W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
Default: 1.60 (current)
Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)
Enable:
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
./your_app
Recommendation: Start with CAP tuning only, add W_MAX later (more risk).
3. Size Distribution Integration (Already Exists)
Histogram (hakmem_size_hist.c):
- 1KB granularity bins (0-64KB tracked)
- Per-allocation sampling
- Reset after learner snapshot
DYN1 Auto-Assignment (already implemented):
HAKMEM_LEARN=1 \
HAKMEM_DYN1_AUTO=1 \
HAKMEM_CAP_MID_DYN1=64 \
./your_app
Effect: Automatically finds peak size in 2-32KB range, assigns DYN1 class.
4. New: ACE Stats Integration
Current ACE (hakmem_ace.c):
- Records size decisions (original_size → rounded_size → pool)
- Tracks L1 fallback rate (miss → malloc)
- Not integrated with learner
Proposal: Add ACE stats to learner score function
Modify Learner Score (hakmem_learner.c:414):
// OLD: simple hit-based score
double score = (double)(ace.mid_hit + ace.large_hit)
- (double)(ace.mid_miss + ace.large_miss)
- 2.0 * (double)ace.l1_fallback;
// NEW: add fragmentation penalty
extern uint64_t hak_ace_get_total_waste(void); // sum of (rounded - original)
uint64_t waste = hak_ace_get_total_waste();
double frag_penalty = (double)waste / 1e6; // normalize to MB
double score = (double)(ace.mid_hit + ace.large_hit)
- (double)(ace.mid_miss + ace.large_miss)
- 2.0 * (double)ace.l1_fallback
- 0.5 * frag_penalty; // penalize waste
Benefit: Balance hit rate vs fragmentation (W_MAX tuning).
File Changes Required
| File | Function | Change Type | Est. LOC |
|---|---|---|---|
hakmem_learner.c |
Learner (already exists) | Enable via env | 0 |
hakmem_ace.c |
hak_ace_get_total_waste() |
New function | +15 |
hakmem_learner.c |
learner_main() |
Add frag penalty to score | +10 |
hakmem_policy.c |
(none) | Learner publishes dynamically | 0 |
| Total | ~25 LOC |
Testing Strategy
Baseline Test (Learner Off)
# Static policy (current)
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Record: Mid 1T, Mid 4T throughput
Learner Test (CAP Tuning)
# Enable learner with aggressive targets
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.75 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MAX_MID=512 \
HAKMEM_LEARN_WINDOW_MS=2000 \
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Expected: CAP increases to ~128-256 (hit 75% target)
# Expected: +5-10% throughput improvement
W_MAX Learning Test (Optional)
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
HAKMEM_WMAX_TRIAL_SEC=5 \
RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh
# Monitor stderr for learner logs:
# "[Learner] W_MAX mid canary start: 1.50"
# "[Learner] W_MAX mid canary adopt" (success)
# or
# "[Learner] W_MAX mid canary revert to 1.60" (failure)
Regression Test
# Check learner doesn't hurt stable workloads
# Run with learning OFF, then ON, compare variance
# Target: <5% variance, no regressions
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Over-tuning (oscillation) | Medium | Medium | Increase dwell time (3→5 sec) |
| Under-tuning (no effect) | Medium | Low | Lower target hit rate (0.75→0.65) |
| W_MAX instability (fragmentation spike) | Medium | High | Use Canary, revert on regression |
| Low-traffic workload (insufficient samples) | High | Low | Set min_samples=256, skip learning if below |
Rollback Plan: Set HAKMEM_LEARN=0 (default, no learner overhead).
Estimated Time
- Implementation: 1-2 hours
- ACE waste tracking: 1 hour
- Learner score update: 30 min
- Testing: 30 min
- Validation: 3-4 hours
- Run suite with/without learner
- Analyze CAP convergence
- W_MAX exploration (if enabled)
- Total: 4-6 hours
📊 Expected Performance Improvements
Cumulative Gains (Stacked)
| Phase | Change | Mid 1T | Mid 4T | Rationale |
|---|---|---|---|---|
| Baseline (6.21) | Current | 4.0 M/s (28%) | 13.8 M/s (47%) | Post-quick-wins |
| 6.25 (Batch) | Refill 2-4 pages | +10-15% | +5-8% | Amortize syscall, 1T bottleneck |
| 4.5-5.0 M/s | 14.5-15.2 M/s | |||
| 6.26 (Lock-Free) | CAS freelist | +2-5% | +15-20% | Eliminate 4T contention |
| 4.6-5.2 M/s | 17.0-18.2 M/s | |||
| 6.27 (Learner) | Dynamic CAP/W_MAX | +5-10% | +5-10% | Adaptive tuning |
| 5.0-5.7 M/s | 18.0-20.0 M/s | |||
| Target (60-75%) | vs mimalloc 14.6M / 29.5M | 8.8-11.0 M/s | 17.7-22.1 M/s | |
| Achieved? | ❌ 35-39% | ✅ 61-68% | 1T still short, 4T on target! |
Gap Analysis
1T Performance:
- Current: 4.0 M/s (28% of mimalloc)
- Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
- Gap to 60%: Still need +5.3-6.0 M/s (~+110-120%)
Remaining Bottlenecks (1T):
- Single-threaded inherently lock-bound (no TLS benefit)
- mimalloc's per-thread heaps eliminate ALL shared state
- Bump allocation (mimalloc) vs freelist (hakmem)
- Header overhead (32 bytes per alloc in hakmem)
4T Performance:
- Current: 13.8 M/s (47% of mimalloc)
- Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
- ✅ Target achieved! (60-75% range)
Follow-Up Phases (Post-6.27)
Phase 6.28: Header Elimination (if 1T still target)
- Remove AllocHeader for Mid Pool (use page descriptors only)
- Saves 32 bytes per allocation (~8-15% memory)
- Saves header write on alloc hot path (~30-50 cycles)
- Estimated Gain: +15-20% (1T)
Phase 6.29: Bump Allocation (major refactor)
- Replace freelist with bump allocator (mimalloc-style)
- Per-thread arenas, no shared state at all
- Estimated Gain: +50-100% (1T), brings to mimalloc parity
- Risk: High complexity, long implementation (~2-3 weeks)
🗓️ Priority-Ordered Task List
Phase 6.25: Refill Batching (Target: Week 1)
-
☐ Implement
alloc_tls_page_batch()function (2 hours)- Write batch mmap loop
- Distribute pages to TLS slots
- Fill Ring/LIFO from overflow pages
- Add page descriptors registration
-
☐ Integrate batch refill into
hak_pool_try_alloc()(1 hour)- Replace
alloc_tls_page()call with batch version - Prepare slot array logic
- Handle partial allocation (< batch_size)
- Replace
-
☐ Add environment variable support (30 min)
- Add
g_pool_refill_batch_sizeglobal - Parse
HAKMEM_POOL_REFILL_BATCHin init - Validate range (1-4)
- Add
-
☐ Unit testing (1 hour)
- Test batch=1,2,4 correctness
- Verify TLS slots filled
- Check Ring population
- Valgrind (no leaks)
-
☐ Benchmark validation (2 hours)
- Run suite with batch=1 (baseline)
- Run suite with batch=2,4
- Analyze throughput delta
- Target: +10-15% Mid 1T
Total Estimate: 6-7 hours
Phase 6.26: Lock-Free Refill (Target: Week 2)
-
☐ Replace mutex with atomic freelist (2 hours)
- Change
PoolBlock* freelist[]→atomic_uintptr_t freelist_head[] - Add
atomic_uint freelist_count[] - Remove
PaddedMutex freelist_locks[]
- Change
-
☐ Implement lock-free primitives (2 hours)
- Write
freelist_pop_lockfree() - Write
freelist_push_lockfree() - Write
freelist_push_batch_lockfree()
- Write
-
☐ Rewrite drain functions (1 hour)
drain_remote_lockfree()(no mutex)- Count blocks in remote stack
- Batch push to freelist
-
☐ Integrate into alloc/free paths (1 hour)
- Replace lock/pop/unlock with
freelist_pop_lockfree() - Update refill to use batch push
- Update free to use lock-free push
- Replace lock/pop/unlock with
-
☐ Testing (critical for lock-free) (3 hours)
- Single-thread correctness test
- Multi-thread stress test (16T, 60 sec)
- TSan (ThreadSanitizer) run
- Check counter invariants (no lost blocks)
-
☐ Benchmark validation (2 hours)
- Run suite with lock-free (4T focus)
- Compare to Phase 6.25 baseline
- Measure CAS retry rate
- Target: +15-20% Mid 4T
Total Estimate: 11-12 hours
Phase 6.27: Learner Integration (Target: Week 2, parallel)
-
☐ Add ACE waste tracking (1 hour)
- Implement
hak_ace_get_total_waste()inhakmem_ace.c - Track cumulative (rounded - original) per allocation
- Atomic counter for thread safety
- Implement
-
☐ Update learner score function (30 min)
- Add fragmentation penalty term
- Weight: -0.5 × (waste_MB)
- Test score computation
-
☐ Validation testing (3 hours)
- Baseline run (learner OFF)
- CAP tuning run (learner ON, W_MAX fixed)
- W_MAX learning run (Canary enabled)
- Compare throughput, check convergence
-
☐ Documentation (1 hour)
- Update ENV_VARS.md with learner params
- Document recommended settings
- Add troubleshooting guide (oscillation, no effect)
Total Estimate: 5-6 hours
Post-Implementation (Week 3)
-
☐ Comprehensive benchmarking (4 hours)
- Full suite (tiny, mid, large) with all phases enabled
- Head-to-head vs mimalloc (1T, 4T, 8T)
- Memory profiling (RSS, fragmentation)
- Generate performance report
-
☐ Code review & cleanup (2 hours)
- Remove debug printfs
- Add comments to complex sections
- Update copyright/phase headers
- Check for code duplication
-
☐ Documentation updates (2 hours)
- Update INDEX.md with new phases
- Write PHASE_6.25_6.27_RESULTS.md
- Update README.md benchmarks section
Total Estimate: 8 hours
📈 Success Metrics
Primary Metrics
| Metric | Current | Target | Measurement |
|---|---|---|---|
| Mid 1T Throughput | 4.0 M/s | 5.0-5.7 M/s | Larson benchmark, 10s |
| Mid 4T Throughput | 13.8 M/s | 18.0-20.0 M/s | Larson benchmark, 10s |
| Mid 1T vs mimalloc | 28% | 35-39% | Ratio of throughputs |
| Mid 4T vs mimalloc | 47% | 61-68% | Ratio of throughputs |
Secondary Metrics
| Metric | Current | Target | Measurement |
|---|---|---|---|
| Refill frequency (1T) | ~1000/sec | ~250-500/sec | Counter delta |
| Lock contention (4T) | ~40% wait | <10% wait | Trylock success rate |
| Hit rate (Mid Pool) | ~60% | 70-80% | hits / (hits + misses) |
| Memory footprint | 22 MB | <30 MB | RSS baseline |
Regression Thresholds
| Scenario | Threshold | Action |
|---|---|---|
| Tiny Pool 4T | <2% regression | Acceptable |
| Large Pool | <5% regression | Acceptable |
| Memory bloat | >40 MB baseline | Reduce CAP or batch |
| Crash/hang in stress test | Any occurrence | Block release, debug |
🎬 Conclusion
This implementation plan provides a systematic path to improve hakmem's Mid Pool performance from 47% to 61-68% of mimalloc for multi-threaded workloads (4T), bringing it into the target range of 60-75%.
Key Insights:
- Phase 6.25 (Batching): Low risk, medium reward, tackles 1T bottleneck
- Phase 6.26 (Lock-Free): Medium risk, high reward, critical for 4T scaling
- Phase 6.27 (Learner): Low risk, low-medium reward, adaptive optimization
Recommendation:
- Implement 6.25 and 6.27 in parallel (independent, ~12 hours total)
- Tackle 6.26 after 6.25 validated (builds on batch refill, ~12 hours)
- Total time: ~24-30 hours (3-4 days focused work)
Next Steps:
- Review this plan with team
- Set up benchmarking pipeline (automated, reproducible)
- Implement Phase 6.25 (highest priority)
- Measure, iterate, document
Open Questions:
- Should we extend TLS slots from 2 to 4? (Test in 6.25)
- Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
- After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?
Document Version: 1.0 Last Updated: 2025-10-24 Author: Claude (Sonnet 4.5) Status: Ready for Implementation