Files
hakmem/docs/status/PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

1106 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6.25-6.27: Implementation Plan - Catching Up with mimalloc
**Date**: 2025-10-24
**Status**: 📋 Planning
**Target**: Reach 60-75% of mimalloc performance for Mid Pool
---
## 📊 Current Baseline (Phase 6.21 Results)
### Performance vs mimalloc
| Workload | Threads | hakmem | mimalloc | Ratio | Gap |
|----------|---------|--------|----------|-------|-----|
| **Mid** | 1T | 4.0 M/s | 14.6 M/s | **28%** | -72% |
| **Mid** | 4T | 13.8 M/s | 29.5 M/s | **47%** | -53% |
| Tiny | 1T | 19.4 M/s | 32.6 M/s | 59% | -41% |
| Tiny | 4T | 48.0 M/s | 65.7 M/s | 73% | -27% |
| Large | 1T | 0.6 M/s | 2.1 M/s | 29% | -71% |
**Key Insights**:
-**Phase 6.25 Quick Wins achieved +37.8% for Mid 4T** (10.0 → 13.8 M/s)
- ❌ Mid Pool still significantly behind mimalloc (28% 1T, 47% 4T)
- 🎯 Target: 60-75% of mimalloc = **8.8-11.0 M/s (1T), 17.7-22.1 M/s (4T)**
### Current Mid Pool Architecture
```
┌─────────────────────────────────────────────────────────┐
│ TLS Fast Path (Lock-Free) │
├─────────────────────────────────────────────────────────┤
│ 1. TLS Ring Buffer (RING_CAP=32) │
│ - LIFO cache for recently freed blocks │
│ - Per-class, per-thread │
│ - Phase 6.25: 16→32 increased hit rate │
│ │
│ 2. TLS Active Pages (x2: page_a, page_b) │
│ - Bump-run allocation (no per-block links) │
│ - Owner-thread private (lock-free) │
│ - 64KB pages, split on-demand │
├─────────────────────────────────────────────────────────┤
│ Shared State (Lock-Based) │
├─────────────────────────────────────────────────────────┤
│ 3. Per-class Freelist (64 shards) │
│ - Mutex-protected per (class, shard) │
│ - Site-based sharding (reduce contention) │
│ - Refill on demand via refill_freelist() │
│ │
│ 4. Remote Stack (MPSC, lock-free push) │
│ - Cross-thread free target │
│ - Drained into freelist under lock │
│ │
│ 5. Transfer Cache (TC, Phase 6.20) │
│ - Per-thread inbox (atomic CAS) │
│ - Owner-aware routing │
│ - Drain trigger: ring->top < 2 │
└─────────────────────────────────────────────────────────┘
Refill Flow (Current):
Ring empty → Check Active Pages → Lock Shard → Pop freelist
→ Drain remote → Shard steal (if CAP reached) → **refill_freelist()**
Refill Implementation:
- Allocates **1 page** (64KB) via mmap
- Splits into blocks, links into freelist
- ACE bundle factor: 1-4 pages (adaptive)
```
### Bottlenecks Identified
**From Phase 6.20 Analysis**:
1. **Refill Latency** (Primary)
- Single-page refill: 1 mmap syscall per refill
- Freelist rebuilding overhead (linking blocks)
- Mutex hold time during refill (~100-150 cycles)
- **Impact**: ~40% of alloc time in Mid 1T
2. **Lock Contention** (Secondary)
- 64 shards × 7 classes = 448 mutexes
- Even with sharding, 4T shows contention
- Trylock success rate: ~60-70% (Phase 6.25 data)
- **Impact**: ~25% of alloc time in Mid 4T
3. **CAP/W_MAX Sub-optimal** (Tertiary)
- Static configuration (no runtime adaptation)
- W_MAX=1.60 (Mid), 1.30 (Large) → some fallback to L1
- CAP={64,64,64,32,16} → conservative, low hit rate
- **Impact**: ~10-15% missed pool opportunities
---
## 🎯 Phase 6.25 本体: Refill Batching
### Goal
**Reduce refill latency by allocating multiple pages at once**
**Target**: Mid 1T: +10-15% (4.0 → 4.5-5.0 M/s)
### Problem Statement
Current `refill_freelist()` allocates **1 page per call**:
- 1 mmap syscall (~200-300 cycles)
- 1 page split + freelist rebuild (~100-150 cycles)
- Held under mutex lock (blocks other threads)
- Amortized cost per block: **HIGH** for small classes (e.g., 2KB = 32 blocks/page)
**Opportunity**: Allocate **2-4 pages in batch** to amortize costs:
- mmap overhead: 300 cycles → 75-150 cycles/page (batched)
- Freelist rebuild: done in parallel or optimized
- Fill multiple TLS page slots + Ring buffer aggressively
### Implementation Approach
#### 1. Create `alloc_tls_page_batch()` Function
**Location**: `hakmem_pool.c` (after `alloc_tls_page()`, line ~486)
**Signature**:
```c
// Allocate multiple pages in batch and distribute to TLS structures
// Returns: number of pages successfully allocated (0-batch_size)
static int alloc_tls_page_batch(int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin);
```
**Pseudocode**:
```c
static int alloc_tls_page_batch(int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin) {
size_t user_size = g_class_sizes[class_idx];
size_t block_size = HEADER_SIZE + user_size;
int blocks_per_page = POOL_PAGE_SIZE / block_size;
if (blocks_per_page <= 0) return 0;
int allocated = 0;
// Allocate pages in batch (strategy: multiple mmaps or single large mmap)
// Option A: Multiple mmaps (simpler, compatible with existing infra)
for (int i = 0; i < batch_size; i++) {
void* page = mmap(NULL, POOL_PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (!page) break;
// Prefault (Phase 6.25 quick win)
for (size_t j = 0; j < POOL_PAGE_SIZE; j += 4096) {
((volatile char*)page)[j] = 0;
}
// Strategy: Fill TLS slots first, then fill Ring/LIFO
if (allocated < num_slots && slots[allocated]) {
// Assign to TLS active page slot (bump-run init)
PoolTLSPage* ap = slots[allocated];
ap->page = page;
ap->bump = (char*)page;
ap->end = (char*)page + POOL_PAGE_SIZE;
ap->count = blocks_per_page;
// Register page descriptor
mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
} else {
// Fill Ring + LIFO from this page
char* bump = (char*)page;
char* end = (char*)page + POOL_PAGE_SIZE;
for (int k = 0; k < blocks_per_page; k++) {
PoolBlock* b = (PoolBlock*)(void*)bump;
// Try Ring first, then LIFO
if (ring && ring->top < POOL_TLS_RING_CAP) {
ring->items[ring->top++] = b;
} else if (bin) {
b->next = bin->lo_head;
bin->lo_head = b;
bin->lo_count++;
}
bump += block_size;
if (bump >= end) break;
}
mid_desc_register(page, class_idx, (uint64_t)(uintptr_t)pthread_self());
}
allocated++;
g_pool.total_pages_allocated++;
g_pool.pages_by_class[class_idx]++;
g_pool.total_bytes_allocated += POOL_PAGE_SIZE;
}
if (allocated > 0) {
g_pool.refills[class_idx]++;
}
return allocated;
}
```
#### 2. Modify Refill Call Sites
**Location**: `hakmem_pool.c:931` (inside `hak_pool_try_alloc`, refill path)
**Before**:
```c
if (alloc_tls_page(class_idx, tap)) {
// ... use newly allocated page
}
```
**After**:
```c
// Determine batch size from env var (default 2-4)
int batch = g_pool_refill_batch_size; // new global config
if (batch < 1) batch = 1;
if (batch > 4) batch = 4;
// Prepare slot array (up to 2 TLS slots)
PoolTLSPage* slots[2] = {NULL, NULL};
int num_slots = 0;
if (g_tls_active_page_a[class_idx].page == NULL || g_tls_active_page_a[class_idx].count == 0) {
slots[num_slots++] = &g_tls_active_page_a[class_idx];
}
if (g_tls_active_page_b[class_idx].page == NULL || g_tls_active_page_b[class_idx].count == 0) {
slots[num_slots++] = &g_tls_active_page_b[class_idx];
}
// Call batch allocator
int allocated = alloc_tls_page_batch(class_idx, batch, slots, num_slots,
&g_tls_bin[class_idx].ring,
&g_tls_bin[class_idx]);
if (allocated > 0) {
pthread_mutex_unlock(lock);
// Use ring or active page as usual
// ...
}
```
#### 3. Add Environment Variable
**Global Config** (add to `hakmem_pool.c` globals, ~line 316):
```c
static int g_pool_refill_batch_size = 2; // env: HAKMEM_POOL_REFILL_BATCH (1-4)
```
**Init** (add to `hak_pool_init()`, ~line 716):
```c
const char* e_batch = getenv("HAKMEM_POOL_REFILL_BATCH");
if (e_batch) {
int v = atoi(e_batch);
if (v >= 1 && v <= 4) g_pool_refill_batch_size = v;
}
```
#### 4. Extend TLS Active Page Slots (Optional)
**Current**: 2 slots (page_a, page_b)
**Proposal**: Add page_c, page_d for batch_size=4 (if beneficial)
**Trade-off**:
- ✅ Pro: More TLS-local inventory, fewer shared accesses
- ❌ Con: Increased TLS memory footprint (~256 bytes/class)
**Recommendation**: Start with 2 slots, measure, then extend if needed.
---
### File Changes Required
| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_pool.c` | `alloc_tls_page_batch()` | **New function** | +80 |
| `hakmem_pool.c` | `hak_pool_try_alloc()` | Modify refill path | +30 |
| `hakmem_pool.c` | Globals | Add `g_pool_refill_batch_size` | +1 |
| `hakmem_pool.c` | `hak_pool_init()` | Parse env var | +5 |
| `hakmem_pool.h` | (none) | No public API change | 0 |
| **Total** | | | **~116 LOC** |
---
### Testing Strategy
#### Unit Test
```bash
# Test batch allocation works
HAKMEM_POOL_REFILL_BATCH=4 ./test_pool_refill
# Verify TLS slots filled correctly
# Check Ring buffer populated
# Check no memory leaks
```
#### Benchmark Test
```bash
# Baseline (batch=1, current behavior)
HAKMEM_POOL_REFILL_BATCH=1 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Batch=2 (conservative)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Batch=4 (aggressive)
HAKMEM_POOL_REFILL_BATCH=4 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
# Expected: +10-15% on Mid 1T (4.0 → 4.5-5.0 M/s)
```
#### Failure Modes to Watch
1. **Memory bloat**: Batch too large → excessive pre-allocation
- **Monitor**: RSS growth, pages_allocated counter
- **Mitigation**: Cap batch_size at 4, respect CAP limits
2. **Ring overflow**: Batch fills Ring, blocks get lost
- **Monitor**: Ring underflow counter (should decrease)
- **Mitigation**: Properly route overflow to LIFO
3. **TLS slot contention**: Multiple threads allocating same class
- **Monitor**: Active page descriptor conflicts
- **Mitigation**: Per-thread ownership (already enforced)
---
### Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Memory bloat (over-allocation) | Medium | High | Cap at batch=4, respect CAP limits |
| Complexity (harder to debug) | Low | Medium | Extensive logging, unit tests |
| Backward compat (existing workloads) | Low | Low | Default batch=2 (conservative) |
| Regression (slower than 1-page) | Low | Medium | A/B test, fallback to batch=1 |
**Rollback Plan**: Set `HAKMEM_POOL_REFILL_BATCH=1` to restore original behavior (zero code change).
---
### Estimated Time
- **Implementation**: 3-4 hours
- Core function: 2 hours
- Integration: 1 hour
- Testing: 1 hour
- **Benchmarking**: 2 hours
- Run suite 3x (batch=1,2,4)
- Analyze results
- **Total**: **5-6 hours**
---
## 🔓 Phase 6.26: Lock-Free Refill
### Goal
**Eliminate lock contention on freelist access**
**Target**: Mid 4T: +15-20% (13.8 → 16-18 M/s)
### Problem Statement
Current freelist uses **per-shard mutexes** (`pthread_mutex_t`):
- 64 shards × 7 classes = **448 mutexes**
- Contention on hot shards (4T workload)
- Trylock success rate: ~60-70% (Phase 6.25 data)
- Each lock/unlock: ~20-40 cycles overhead
**Opportunity**: Replace mutex with **lock-free stack** (CAS-based):
- Atomic compare-and-swap: ~10-15 cycles
- No blocking (always forward progress)
- Better scalability under contention
### Implementation Approach
#### 1. Replace Freelist Mutex with Atomic Head
**Current Structure** (`hakmem_pool.c:276-280`):
```c
static struct {
PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// ...
} g_pool;
```
**New Structure**:
```c
static struct {
// Lock-free freelist head (atomic pointer)
atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// Lock-free counter (for non-empty bitmap update)
atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// Keep nonempty_mask (atomic already)
atomic_uint_fast64_t nonempty_mask[POOL_NUM_CLASSES];
// Remote stack (already lock-free)
atomic_uintptr_t remote_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
atomic_uint remote_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
// ... (rest unchanged)
} g_pool;
```
#### 2. Implement Lock-Free Push/Pop
**Lock-Free Pop** (replace mutex-based pop):
```c
// Pop block from lock-free freelist
// Returns: block pointer, or NULL if empty
static inline PoolBlock* freelist_pop_lockfree(int class_idx, int shard_idx) {
uintptr_t old_head;
PoolBlock* block;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
if (!old_head) {
return NULL; // Empty
}
block = (PoolBlock*)old_head;
// Try CAS: freelist_head = block->next
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)block->next,
memory_order_release, memory_order_acquire));
// Update count
unsigned old_count = atomic_fetch_sub_explicit(
&g_pool.freelist_count[class_idx][shard_idx], 1, memory_order_relaxed);
// Clear nonempty bit if now empty
if (old_count <= 1) {
clear_nonempty_bit(class_idx, shard_idx);
}
return block;
}
```
**Lock-Free Push** (for refill path):
```c
// Push block onto lock-free freelist
static inline void freelist_push_lockfree(int class_idx, int shard_idx, PoolBlock* block) {
uintptr_t old_head;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)block,
memory_order_release, memory_order_acquire));
// Update count and nonempty bit
atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], 1,
memory_order_relaxed);
set_nonempty_bit(class_idx, shard_idx);
}
```
**Lock-Free Batch Push** (for refill, optimization):
```c
// Push multiple blocks atomically (amortize CAS overhead)
static inline void freelist_push_batch_lockfree(int class_idx, int shard_idx,
PoolBlock* head, PoolBlock* tail,
int count) {
uintptr_t old_head;
do {
old_head = atomic_load_explicit(&g_pool.freelist_head[class_idx][shard_idx],
memory_order_acquire);
tail->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak_explicit(
&g_pool.freelist_head[class_idx][shard_idx],
&old_head, (uintptr_t)head,
memory_order_release, memory_order_acquire));
atomic_fetch_add_explicit(&g_pool.freelist_count[class_idx][shard_idx], count,
memory_order_relaxed);
set_nonempty_bit(class_idx, shard_idx);
}
```
#### 3. Refill Path Integration
**Modify `refill_freelist()`** (now lock-free):
```c
static int refill_freelist(int class_idx, int shard_idx) {
// ... (allocate page, split into blocks)
// OLD: lock → push to freelist → unlock
// pthread_mutex_lock(lock);
// block->next = g_pool.freelist[class_idx][shard_idx];
// g_pool.freelist[class_idx][shard_idx] = freelist_head;
// pthread_mutex_unlock(lock);
// NEW: lock-free batch push
PoolBlock* tail = freelist_head;
int count = blocks_per_page;
while (tail->next) {
tail = tail->next;
}
freelist_push_batch_lockfree(class_idx, shard_idx, freelist_head, tail, count);
return 1;
}
```
#### 4. Remote Stack Drain (Lock-Free)
**Current**: `drain_remote_locked()` called under mutex
**New**: Drain into local list, then batch-push lock-free
```c
// Drain remote stack into freelist (lock-free)
static inline void drain_remote_lockfree(int class_idx, int shard_idx) {
// Atomically swap remote head to NULL (unchanged)
uintptr_t head = atomic_exchange_explicit(&g_pool.remote_head[class_idx][shard_idx],
(uintptr_t)0, memory_order_acq_rel);
if (!head) return;
// Count blocks
int count = 0;
PoolBlock* tail = (PoolBlock*)head;
while (tail->next) {
tail = tail->next;
count++;
}
count++; // Include head
// Batch push to freelist (lock-free)
freelist_push_batch_lockfree(class_idx, shard_idx, (PoolBlock*)head, tail, count);
// Update remote count
atomic_fetch_sub_explicit(&g_pool.remote_count[class_idx][shard_idx], count,
memory_order_relaxed);
}
```
#### 5. Fallback Strategy (Optional)
For **rare contention** cases (e.g., CAS spin > 100 iterations):
- Option A: Keep spinning (acceptable for short lists)
- Option B: Fallback to mutex (hybrid approach)
- Option C: Backoff + retry (exponential backoff)
**Recommendation**: Start with Option A (pure lock-free), measure, add backoff if needed.
---
### File Changes Required
| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_pool.c` | Globals | Replace mutexes with atomics | +10/-10 |
| `hakmem_pool.c` | `freelist_pop_lockfree()` | **New function** | +30 |
| `hakmem_pool.c` | `freelist_push_lockfree()` | **New function** | +20 |
| `hakmem_pool.c` | `freelist_push_batch_lockfree()` | **New function** | +25 |
| `hakmem_pool.c` | `drain_remote_lockfree()` | Rewrite (lock-free) | +25/-20 |
| `hakmem_pool.c` | `refill_freelist()` | Modify (use batch push) | +10/-15 |
| `hakmem_pool.c` | `hak_pool_try_alloc()` | Replace lock/unlock with pop | +5/-10 |
| `hakmem_pool.c` | `hak_pool_free()` | Lock-free path | +10/-10 |
| `hakmem_pool.c` | `hak_pool_init()` | Init atomics (not mutexes) | +5/-5 |
| **Total** | | | **~140 LOC (net ~100)** |
---
### Testing Strategy
#### Correctness Test
```bash
# Single-threaded (no contention, pure correctness)
THREADS=1 ./test_pool_lockfree
# Multi-threaded stress test (high contention)
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
# Check for:
# - No memory leaks (valgrind)
# - No double-free (AddressSanitizer)
# - No lost blocks (counter invariants)
```
#### Performance Test
```bash
# Baseline (Phase 6.25, with batching)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
# Lock-free (Phase 6.26)
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
# Expected: +15-20% on Mid 4T (13.8 → 16-18 M/s)
```
#### Contention Analysis
```bash
# Measure CAS retry rate
# Add instrumentation:
# atomic_uint_fast64_t cas_retries;
# atomic_uint_fast64_t cas_attempts;
# Print ratio at shutdown
# Target: <5% retry rate under 4T load
```
---
### Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ABA problem (block reuse) | Low | Critical | Use epoch-based reclamation or hazard pointers |
| CAS livelock (high contention) | Medium | High | Add exponential backoff after N retries |
| Memory ordering bugs (subtle races) | Medium | Critical | Extensive testing, TSan, formal verification |
| Performance regression (1T) | Low | Low | Single-thread has no contention, minimal overhead |
**ABA Problem**:
- **Scenario**: Block A popped, freed, reallocated, pushed back while another thread's CAS is in-flight
- **Solution**: Not critical for freelist (ABA still results in valid freelist state)
- **Alternative**: Add version counter (128-bit CAS) if issues arise
**Rollback Plan**: Keep mutexes in code (ifdef'd out), revert via compile flag if needed.
---
### Estimated Time
- **Implementation**: 5-6 hours
- Lock-free primitives: 2 hours
- Integration: 2 hours
- Testing: 2 hours
- **Debugging**: 2-3 hours (race conditions, TSan)
- **Benchmarking**: 2 hours
- **Total**: **9-11 hours**
---
## 🧠 Phase 6.27: Learner Integration
### Goal
**Dynamic optimization of CAP and W_MAX based on runtime behavior**
**Target**: +5-10% across all workloads via adaptive tuning
### Problem Statement
Current policy is **static** (set at init):
- `CAP = {64,64,64,32,16,32,32}` (conservative)
- `W_MAX_MID = 1.60`, `W_MAX_LARGE = 1.30`
- No adaptation to workload characteristics
**Opportunity**: Use **existing learner infrastructure** to:
1. Collect size distribution stats
2. Adjust `mid_cap[]` dynamically based on hit rate
3. Adjust `w_max_mid` based on fragmentation vs hit rate trade-off
**Learner Already Exists**: `hakmem_learner.c` (~585 LOC)
- Background thread (1 sec polling)
- Hit rate monitoring
- UCB1 for W_MAX exploration (Canary deployment)
- Budget enforcement + Water-filling
**Integration Work**: Minimal (learner already supports Mid Pool tuning)
---
### Implementation Approach
#### 1. Enable Learner for Mid Pool
**Already Implemented** (`hakmem_learner.c:239-272`):
```c
// Adjust Mid caps by hit rate vs target (delta over window) with dwell
int mid_classes = 5;
if (cur->mid_dyn1_bytes != 0 && cur->mid_dyn2_bytes != 0) mid_classes = 7;
// ...
for (int i = 0; i < mid_classes; i++) {
uint64_t dh = mid_hits[i] - prev_mid_hits[i];
uint64_t dm = mid_misses[i] - prev_mid_misses[i];
// ...
if (hit < (tgt_mid - eps)) {
cap += step_mid; // Increase CAP
} else if (hit > (tgt_mid + eps)) {
cap -= step_mid; // Decrease CAP
}
// ...
}
```
**Action**: Just enable via env var!
```bash
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.65 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MIN_MID=16 \
HAKMEM_CAP_MAX_MID=512 \
./your_app
```
#### 2. W_MAX Learning (Optional, Risky)
**Already Implemented** (`hakmem_learner.c:388-499`):
- UCB1 multi-armed bandit
- Canary deployment (safe exploration)
- Rollback if performance regresses
**Candidates** (for Mid Pool):
```
W_MAX_MID candidates: [1.40, 1.50, 1.60, 1.70]
Default: 1.60 (current)
Exploration: Try 1.50 (tighter, less waste) or 1.70 (looser, higher hit)
```
**Enable**:
```bash
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
./your_app
```
**Recommendation**: Start with CAP tuning only, add W_MAX later (more risk).
#### 3. Size Distribution Integration (Already Exists)
**Histogram** (`hakmem_size_hist.c`):
- 1KB granularity bins (0-64KB tracked)
- Per-allocation sampling
- Reset after learner snapshot
**DYN1 Auto-Assignment** (already implemented):
```bash
HAKMEM_LEARN=1 \
HAKMEM_DYN1_AUTO=1 \
HAKMEM_CAP_MID_DYN1=64 \
./your_app
```
**Effect**: Automatically finds peak size in 2-32KB range, assigns DYN1 class.
#### 4. New: ACE Stats Integration
**Current ACE** (`hakmem_ace.c`):
- Records size decisions (original_size → rounded_size → pool)
- Tracks L1 fallback rate (miss → malloc)
- Not integrated with learner
**Proposal**: Add ACE stats to learner score function
**Modify Learner Score** (`hakmem_learner.c:414`):
```c
// OLD: simple hit-based score
double score = (double)(ace.mid_hit + ace.large_hit)
- (double)(ace.mid_miss + ace.large_miss)
- 2.0 * (double)ace.l1_fallback;
// NEW: add fragmentation penalty
extern uint64_t hak_ace_get_total_waste(void); // sum of (rounded - original)
uint64_t waste = hak_ace_get_total_waste();
double frag_penalty = (double)waste / 1e6; // normalize to MB
double score = (double)(ace.mid_hit + ace.large_hit)
- (double)(ace.mid_miss + ace.large_miss)
- 2.0 * (double)ace.l1_fallback
- 0.5 * frag_penalty; // penalize waste
```
**Benefit**: Balance hit rate vs fragmentation (W_MAX tuning).
---
### File Changes Required
| File | Function | Change Type | Est. LOC |
|------|----------|-------------|----------|
| `hakmem_learner.c` | Learner (already exists) | Enable via env | 0 |
| `hakmem_ace.c` | `hak_ace_get_total_waste()` | **New function** | +15 |
| `hakmem_learner.c` | `learner_main()` | Add frag penalty to score | +10 |
| `hakmem_policy.c` | (none) | Learner publishes dynamically | 0 |
| **Total** | | | **~25 LOC** |
---
### Testing Strategy
#### Baseline Test (Learner Off)
```bash
# Static policy (current)
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Record: Mid 1T, Mid 4T throughput
```
#### Learner Test (CAP Tuning)
```bash
# Enable learner with aggressive targets
HAKMEM_LEARN=1 \
HAKMEM_TARGET_HIT_MID=0.75 \
HAKMEM_CAP_STEP_MID=8 \
HAKMEM_CAP_MAX_MID=512 \
HAKMEM_LEARN_WINDOW_MS=2000 \
RUNTIME=60 THREADS=1,4 ./scripts/run_bench_suite.sh
# Expected: CAP increases to ~128-256 (hit 75% target)
# Expected: +5-10% throughput improvement
```
#### W_MAX Learning Test (Optional)
```bash
HAKMEM_LEARN=1 \
HAKMEM_WMAX_LEARN=1 \
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7 \
HAKMEM_WMAX_CANARY=1 \
HAKMEM_WMAX_TRIAL_SEC=5 \
RUNTIME=120 THREADS=1,4 ./scripts/run_bench_suite.sh
# Monitor stderr for learner logs:
# "[Learner] W_MAX mid canary start: 1.50"
# "[Learner] W_MAX mid canary adopt" (success)
# or
# "[Learner] W_MAX mid canary revert to 1.60" (failure)
```
#### Regression Test
```bash
# Check learner doesn't hurt stable workloads
# Run with learning OFF, then ON, compare variance
# Target: <5% variance, no regressions
```
---
### Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Over-tuning (oscillation) | Medium | Medium | Increase dwell time (3→5 sec) |
| Under-tuning (no effect) | Medium | Low | Lower target hit rate (0.75→0.65) |
| W_MAX instability (fragmentation spike) | Medium | High | Use Canary, revert on regression |
| Low-traffic workload (insufficient samples) | High | Low | Set min_samples=256, skip learning if below |
**Rollback Plan**: Set `HAKMEM_LEARN=0` (default, no learner overhead).
---
### Estimated Time
- **Implementation**: 1-2 hours
- ACE waste tracking: 1 hour
- Learner score update: 30 min
- Testing: 30 min
- **Validation**: 3-4 hours
- Run suite with/without learner
- Analyze CAP convergence
- W_MAX exploration (if enabled)
- **Total**: **4-6 hours**
---
## 📊 Expected Performance Improvements
### Cumulative Gains (Stacked)
| Phase | Change | Mid 1T | Mid 4T | Rationale |
|-------|--------|--------|--------|-----------|
| **Baseline (6.21)** | Current | 4.0 M/s (28%) | 13.8 M/s (47%) | Post-quick-wins |
| **6.25 (Batch)** | Refill 2-4 pages | +10-15% | +5-8% | Amortize syscall, 1T bottleneck |
| | | **4.5-5.0 M/s** | **14.5-15.2 M/s** | |
| **6.26 (Lock-Free)** | CAS freelist | +2-5% | +15-20% | Eliminate 4T contention |
| | | **4.6-5.2 M/s** | **17.0-18.2 M/s** | |
| **6.27 (Learner)** | Dynamic CAP/W_MAX | +5-10% | +5-10% | Adaptive tuning |
| | | **5.0-5.7 M/s** | **18.0-20.0 M/s** | |
| **Target (60-75%)** | vs mimalloc 14.6M / 29.5M | **8.8-11.0 M/s** | **17.7-22.1 M/s** | |
| **Achieved?** | | ❌ **35-39%** | ✅ **61-68%** | 1T still short, 4T on target! |
### Gap Analysis
**1T Performance**:
- Current: 4.0 M/s (28% of mimalloc)
- Post-6.27: 5.0-5.7 M/s (35-39% of mimalloc)
- **Gap to 60%**: Still need **+5.3-6.0 M/s** (~+110-120%)
**Remaining Bottlenecks (1T)**:
1. Single-threaded inherently lock-bound (no TLS benefit)
2. mimalloc's per-thread heaps eliminate ALL shared state
3. Bump allocation (mimalloc) vs freelist (hakmem)
4. Header overhead (32 bytes per alloc in hakmem)
**4T Performance**:
- Current: 13.8 M/s (47% of mimalloc)
- Post-6.27: 18.0-20.0 M/s (61-68% of mimalloc)
- **✅ Target achieved!** (60-75% range)
---
### Follow-Up Phases (Post-6.27)
**Phase 6.28: Header Elimination** (if 1T still target)
- Remove AllocHeader for Mid Pool (use page descriptors only)
- Saves 32 bytes per allocation (~8-15% memory)
- Saves header write on alloc hot path (~30-50 cycles)
- **Estimated Gain**: +15-20% (1T)
**Phase 6.29: Bump Allocation** (major refactor)
- Replace freelist with bump allocator (mimalloc-style)
- Per-thread arenas, no shared state at all
- **Estimated Gain**: +50-100% (1T), brings to mimalloc parity
- **Risk**: High complexity, long implementation (~2-3 weeks)
---
## 🗓️ Priority-Ordered Task List
### Phase 6.25: Refill Batching (Target: Week 1)
1.**Implement `alloc_tls_page_batch()` function** (2 hours)
- [ ] Write batch mmap loop
- [ ] Distribute pages to TLS slots
- [ ] Fill Ring/LIFO from overflow pages
- [ ] Add page descriptors registration
2.**Integrate batch refill into `hak_pool_try_alloc()`** (1 hour)
- [ ] Replace `alloc_tls_page()` call with batch version
- [ ] Prepare slot array logic
- [ ] Handle partial allocation (< batch_size)
3. **Add environment variable support** (30 min)
- [ ] Add `g_pool_refill_batch_size` global
- [ ] Parse `HAKMEM_POOL_REFILL_BATCH` in init
- [ ] Validate range (1-4)
4. **Unit testing** (1 hour)
- [ ] Test batch=1,2,4 correctness
- [ ] Verify TLS slots filled
- [ ] Check Ring population
- [ ] Valgrind (no leaks)
5. **Benchmark validation** (2 hours)
- [ ] Run suite with batch=1 (baseline)
- [ ] Run suite with batch=2,4
- [ ] Analyze throughput delta
- [ ] **Target**: +10-15% Mid 1T
**Total Estimate**: 6-7 hours
---
### Phase 6.26: Lock-Free Refill (Target: Week 2)
6. **Replace mutex with atomic freelist** (2 hours)
- [ ] Change `PoolBlock* freelist[]` `atomic_uintptr_t freelist_head[]`
- [ ] Add `atomic_uint freelist_count[]`
- [ ] Remove `PaddedMutex freelist_locks[]`
7. **Implement lock-free primitives** (2 hours)
- [ ] Write `freelist_pop_lockfree()`
- [ ] Write `freelist_push_lockfree()`
- [ ] Write `freelist_push_batch_lockfree()`
8. **Rewrite drain functions** (1 hour)
- [ ] `drain_remote_lockfree()` (no mutex)
- [ ] Count blocks in remote stack
- [ ] Batch push to freelist
9. **Integrate into alloc/free paths** (1 hour)
- [ ] Replace lock/pop/unlock with `freelist_pop_lockfree()`
- [ ] Update refill to use batch push
- [ ] Update free to use lock-free push
10. **Testing (critical for lock-free)** (3 hours)
- [ ] Single-thread correctness test
- [ ] Multi-thread stress test (16T, 60 sec)
- [ ] TSan (ThreadSanitizer) run
- [ ] Check counter invariants (no lost blocks)
11. **Benchmark validation** (2 hours)
- [ ] Run suite with lock-free (4T focus)
- [ ] Compare to Phase 6.25 baseline
- [ ] Measure CAS retry rate
- [ ] **Target**: +15-20% Mid 4T
**Total Estimate**: 11-12 hours
---
### Phase 6.27: Learner Integration (Target: Week 2, parallel)
12. **Add ACE waste tracking** (1 hour)
- [ ] Implement `hak_ace_get_total_waste()` in `hakmem_ace.c`
- [ ] Track cumulative (rounded - original) per allocation
- [ ] Atomic counter for thread safety
13. **Update learner score function** (30 min)
- [ ] Add fragmentation penalty term
- [ ] Weight: -0.5 × (waste_MB)
- [ ] Test score computation
14. **Validation testing** (3 hours)
- [ ] Baseline run (learner OFF)
- [ ] CAP tuning run (learner ON, W_MAX fixed)
- [ ] W_MAX learning run (Canary enabled)
- [ ] Compare throughput, check convergence
15. **Documentation** (1 hour)
- [ ] Update ENV_VARS.md with learner params
- [ ] Document recommended settings
- [ ] Add troubleshooting guide (oscillation, no effect)
**Total Estimate**: 5-6 hours
---
### Post-Implementation (Week 3)
16. **Comprehensive benchmarking** (4 hours)
- [ ] Full suite (tiny, mid, large) with all phases enabled
- [ ] Head-to-head vs mimalloc (1T, 4T, 8T)
- [ ] Memory profiling (RSS, fragmentation)
- [ ] Generate performance report
17. **Code review & cleanup** (2 hours)
- [ ] Remove debug printfs
- [ ] Add comments to complex sections
- [ ] Update copyright/phase headers
- [ ] Check for code duplication
18. **Documentation updates** (2 hours)
- [ ] Update INDEX.md with new phases
- [ ] Write PHASE_6.25_6.27_RESULTS.md
- [ ] Update README.md benchmarks section
**Total Estimate**: 8 hours
---
## 📈 Success Metrics
### Primary Metrics
| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| **Mid 1T Throughput** | 4.0 M/s | 5.0-5.7 M/s | Larson benchmark, 10s |
| **Mid 4T Throughput** | 13.8 M/s | 18.0-20.0 M/s | Larson benchmark, 10s |
| **Mid 1T vs mimalloc** | 28% | 35-39% | Ratio of throughputs |
| **Mid 4T vs mimalloc** | 47% | 61-68% | Ratio of throughputs |
### Secondary Metrics
| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| Refill frequency (1T) | ~1000/sec | ~250-500/sec | Counter delta |
| Lock contention (4T) | ~40% wait | <10% wait | Trylock success rate |
| Hit rate (Mid Pool) | ~60% | 70-80% | hits / (hits + misses) |
| Memory footprint | 22 MB | <30 MB | RSS baseline |
### Regression Thresholds
| Scenario | Threshold | Action |
|----------|-----------|--------|
| Tiny Pool 4T | <2% regression | Acceptable |
| Large Pool | <5% regression | Acceptable |
| Memory bloat | >40 MB baseline | Reduce CAP or batch |
| Crash/hang in stress test | Any occurrence | Block release, debug |
---
## 🎬 Conclusion
This implementation plan provides a **systematic path** to improve hakmem's Mid Pool performance from **47% to 61-68% of mimalloc** for multi-threaded workloads (4T), bringing it into the target range of 60-75%.
**Key Insights**:
1. **Phase 6.25 (Batching)**: Low risk, medium reward, tackles 1T bottleneck
2. **Phase 6.26 (Lock-Free)**: Medium risk, high reward, critical for 4T scaling
3. **Phase 6.27 (Learner)**: Low risk, low-medium reward, adaptive optimization
**Recommendation**:
- Implement 6.25 and 6.27 in **parallel** (independent, ~12 hours total)
- Tackle 6.26 **after** 6.25 validated (builds on batch refill, ~12 hours)
- **Total time**: ~24-30 hours (3-4 days focused work)
**Next Steps**:
1. Review this plan with team
2. Set up benchmarking pipeline (automated, reproducible)
3. Implement Phase 6.25 (highest priority)
4. Measure, iterate, document
**Open Questions**:
- Should we extend TLS slots from 2 to 4? (Test in 6.25)
- Is W_MAX learning worth the risk? (Test in 6.27 with Canary)
- After 6.27, pursue header elimination (Phase 6.28) or accept 1T gap?
---
**Document Version**: 1.0
**Last Updated**: 2025-10-24
**Author**: Claude (Sonnet 4.5)
**Status**: Ready for Implementation