Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.6 KiB
Phase 6.25-6.27: Quick Reference Guide
Target: Improve Mid Pool from 47% to 61-68% of mimalloc (4T)
📋 Implementation Checklist
Phase 6.25: Refill Batching (~6 hours)
Goal: Reduce refill latency by allocating 2-4 pages at once
# New function in hakmem_pool.c
static int alloc_tls_page_batch(
int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin
);
# New env var
HAKMEM_POOL_REFILL_BATCH=2 # Default (conservative)
HAKMEM_POOL_REFILL_BATCH=4 # Aggressive
Files: hakmem_pool.c (~116 LOC)
Expected: +10-15% (Mid 1T)
Phase 6.26: Lock-Free Refill (~11 hours)
Goal: Replace mutex with atomic CAS on freelist
# Replace in hakmem_pool.c
- PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
# New functions
freelist_pop_lockfree()
freelist_push_lockfree()
freelist_push_batch_lockfree()
drain_remote_lockfree()
Files: hakmem_pool.c (~140 LOC, net ~100)
Expected: +15-20% (Mid 4T)
Phase 6.27: Learner Integration (~5 hours)
Goal: Dynamic CAP/W_MAX tuning based on runtime stats
# Enable existing learner
HAKMEM_LEARN=1
HAKMEM_TARGET_HIT_MID=0.65
HAKMEM_CAP_STEP_MID=8
HAKMEM_CAP_MAX_MID=512
# Optional: W_MAX learning (risky)
HAKMEM_WMAX_LEARN=1
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7
HAKMEM_WMAX_CANARY=1 # Safe exploration
Files: hakmem_ace.c (+15 LOC), hakmem_learner.c (+10 LOC)
Expected: +5-10% (all workloads)
🚀 Quick Start (Implementation Order)
Week 1: Batching + Learner (Parallel)
Day 1-2: Phase 6.25
# 1. Implement batch function
cd /home/tomoaki/git/hakmem
vim hakmem_pool.c # Add alloc_tls_page_batch() after line 486
# 2. Integrate into alloc path
vim hakmem_pool.c # Modify line 931 (refill call site)
# 3. Add env var
vim hakmem_pool.c # Add global + parse in hak_pool_init()
# 4. Test
make clean && make
HAKMEM_POOL_REFILL_BATCH=2 ./test_pool_basic
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
Day 2-3: Phase 6.27
# 1. Add ACE waste tracking
vim hakmem_ace.c # Add hak_ace_get_total_waste()
# 2. Update learner score
vim hakmem_learner.c # Line 414, add frag penalty
# 3. Test
HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.70 RUNTIME=60 THREADS=1,4 \
./scripts/run_bench_suite.sh
Week 2: Lock-Free
Day 1-3: Phase 6.26
# 1. Replace data structures
vim hakmem_pool.c # Line 276-280, atomics
# 2. Implement lock-free ops
vim hakmem_pool.c # Add 3 new functions
# 3. Integrate
vim hakmem_pool.c # Replace lock/unlock with CAS
# 4. Test (CRITICAL: TSan)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
# 5. Benchmark
RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
📊 Expected Results
| Phase | Mid 1T | Mid 4T | vs mimalloc (1T) | vs mimalloc (4T) |
|---|---|---|---|---|
| Baseline (6.21) | 4.0 M/s | 13.8 M/s | 28% | 47% |
| + 6.25 (Batch) | 4.5 M/s | 14.5 M/s | 31% | 49% |
| + 6.26 (Lock-Free) | 4.6 M/s | 17.0 M/s | 32% | 58% |
| + 6.27 (Learner) | 5.0 M/s | 18.5 M/s | 34% | 63% ✅ |
| Target (60-75%) | 8.8-11.0 M/s | 17.7-22.1 M/s | 60-75% | 60-75% |
✅ 4T target achieved! (61-68% range) ❌ 1T still short (need Phase 6.28: header elimination)
🧪 Testing Commands
Correctness Tests
# Unit test (per phase)
./test_pool_refill_batch # Phase 6.25
./test_pool_lockfree # Phase 6.26
./test_pool_learner # Phase 6.27
# Memory safety
valgrind --leak-check=full ./test_pool_refill_batch
make clean && make CFLAGS="-fsanitize=address"
./test_pool_lockfree
# Thread safety (Phase 6.26 CRITICAL)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
Performance Tests
# Quick test (3 sec)
RUNTIME=3 THREADS=1,4 ./scripts/run_bench_suite.sh
# Full test (10 sec, production)
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh
# Stress test (60 sec, stability)
RUNTIME=60 THREADS=1,4,8 ./scripts/run_bench_suite.sh
# Head-to-head comparison
./scripts/head_to_head_large.sh # vs mimalloc
A/B Testing
# Baseline (batch=1, no learner)
HAKMEM_POOL_REFILL_BATCH=1 HAKMEM_LEARN=0 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> baseline.txt
# Phase 6.25 (batch=2)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=0 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> phase_6_25.txt
# Phase 6.27 (learner)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.65 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> phase_6_27.txt
# Compare
grep "Throughput" baseline.txt phase_6_25.txt phase_6_27.txt
🔧 Troubleshooting
Phase 6.25 Issues
Symptom: No performance improvement
- Check:
g_pool_refill_batch_sizevalue (should be 2-4) - Check: Pages allocated counter (should increase in batches)
- Check: Ring buffer filling (should hit RING_CAP more often)
Symptom: Memory bloat
- Reduce:
HAKMEM_POOL_REFILL_BATCH=2(from 4) - Check: Respect CAP limits in batch allocator
- Check: No memory leaks (valgrind)
Phase 6.26 Issues
Symptom: Crash/hang
- Run: ThreadSanitizer (TSan) to find races
- Check: CAS loop doesn't infinite loop (add retry limit)
- Check: Memory ordering (acquire/release)
Symptom: Slower than mutex version
- Check: CAS retry rate (should be <5%)
- Check: Single-thread overhead (should be minimal)
- Add: Exponential backoff after N retries
Symptom: Lost blocks (counter mismatch)
- Check: Batch push count matches list length
- Check: No concurrent modification during CAS
- Add: Invariant checks (debug build)
Phase 6.27 Issues
Symptom: CAP oscillation
- Increase:
HAKMEM_CAP_DWELL_SEC_MID=5(from 3) - Increase:
HAKMEM_LEARN_MIN_SAMPLES=512(from 256) - Narrow: Target band (0.65 ± 0.03 → 0.65 ± 0.05)
Symptom: No CAP changes
- Check: Hit rate out of target band (needs >±3% delta)
- Check: Sufficient samples (≥256 per window)
- Lower:
HAKMEM_TARGET_HIT_MID=0.60(from 0.65)
Symptom: W_MAX instability
- Enable:
HAKMEM_WMAX_CANARY=1(safe exploration) - Increase:
HAKMEM_WMAX_TRIAL_SEC=10(from 5) - Narrow: Candidate range (1.4-1.7 → 1.5-1.6)
📝 Environment Variables Reference
Phase 6.25: Batching
| Variable | Default | Range | Description |
|---|---|---|---|
HAKMEM_POOL_REFILL_BATCH |
2 | 1-4 | Pages per refill (1=baseline) |
Phase 6.26: Lock-Free
(No new env vars, pure implementation change)
Phase 6.27: Learner
| Variable | Default | Range | Description |
|---|---|---|---|
HAKMEM_LEARN |
0 | 0-1 | Enable learner (0=off, 1=on) |
HAKMEM_TARGET_HIT_MID |
0.65 | 0.5-0.9 | Target hit rate for Mid Pool |
HAKMEM_CAP_STEP_MID |
4 | 1-16 | CAP increment/decrement size |
HAKMEM_CAP_MIN_MID |
8 | 4-64 | Minimum CAP per class |
HAKMEM_CAP_MAX_MID |
2048 | 128-4096 | Maximum CAP per class |
HAKMEM_CAP_DWELL_SEC_MID |
3 | 1-10 | Stability period (sec) |
HAKMEM_LEARN_WINDOW_MS |
1000 | 500-5000 | Sampling interval (ms) |
HAKMEM_LEARN_MIN_SAMPLES |
256 | 64-1024 | Min samples to trigger update |
W_MAX Learning (Optional):
| Variable | Default | Range | Description |
|---|---|---|---|
HAKMEM_WMAX_LEARN |
0 | 0-1 | Enable W_MAX exploration |
HAKMEM_WMAX_CANDIDATES_MID |
1.4,1.6,... | CSV list | W_MAX values to try |
HAKMEM_WMAX_CANARY |
1 | 0-1 | Safe exploration (1=on) |
HAKMEM_WMAX_TRIAL_SEC |
5 | 3-15 | Canary trial duration |
HAKMEM_WMAX_ADOPT_PCT |
0.01 | 0.005-0.05 | Adoption threshold (1%) |
🎯 Success Criteria
Must-Have (Release Blockers)
- ✅ No crashes in 60-sec stress test (16T)
- ✅ No memory leaks (valgrind clean)
- ✅ No data races (TSan clean)
- ✅ Mid 4T: ≥17.0 M/s (≥58% of mimalloc)
Should-Have (Quality Bar)
- ✅ Mid 1T: ≥4.5 M/s (≥31% of mimalloc)
- ✅ Memory footprint: ≤30 MB baseline
- ✅ No regression on Tiny/Large (<5%)
Nice-to-Have (Stretch Goals)
- ✅ Mid 4T: ≥18.5 M/s (≥63% of mimalloc) ← TARGET
- ✅ Learner converges in <60 sec
- ✅ W_MAX learning finds better value
📚 Related Documents
- Full Plan:
PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md(this directory) - Previous Results:
PHASE_6.21_RESULTS_2025_10_24.md - Env Vars:
../specs/ENV_VARS.md - Benchmarks:
../benchmarks/README.md
Last Updated: 2025-10-24 Status: Ready for Implementation