Files
hakmem/docs/status/PHASE_6.25_6.27_QUICK_REFERENCE.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

8.6 KiB

Phase 6.25-6.27: Quick Reference Guide

Target: Improve Mid Pool from 47% to 61-68% of mimalloc (4T)


📋 Implementation Checklist

Phase 6.25: Refill Batching (~6 hours)

Goal: Reduce refill latency by allocating 2-4 pages at once

# New function in hakmem_pool.c
static int alloc_tls_page_batch(
    int class_idx, int batch_size,
    PoolTLSPage* slots[], int num_slots,
    PoolTLSRing* ring, PoolTLSBin* bin
);

# New env var
HAKMEM_POOL_REFILL_BATCH=2  # Default (conservative)
HAKMEM_POOL_REFILL_BATCH=4  # Aggressive

Files: hakmem_pool.c (~116 LOC)

Expected: +10-15% (Mid 1T)


Phase 6.26: Lock-Free Refill (~11 hours)

Goal: Replace mutex with atomic CAS on freelist

# Replace in hakmem_pool.c
- PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

# New functions
freelist_pop_lockfree()
freelist_push_lockfree()
freelist_push_batch_lockfree()
drain_remote_lockfree()

Files: hakmem_pool.c (~140 LOC, net ~100)

Expected: +15-20% (Mid 4T)


Phase 6.27: Learner Integration (~5 hours)

Goal: Dynamic CAP/W_MAX tuning based on runtime stats

# Enable existing learner
HAKMEM_LEARN=1
HAKMEM_TARGET_HIT_MID=0.65
HAKMEM_CAP_STEP_MID=8
HAKMEM_CAP_MAX_MID=512

# Optional: W_MAX learning (risky)
HAKMEM_WMAX_LEARN=1
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7
HAKMEM_WMAX_CANARY=1  # Safe exploration

Files: hakmem_ace.c (+15 LOC), hakmem_learner.c (+10 LOC)

Expected: +5-10% (all workloads)


🚀 Quick Start (Implementation Order)

Week 1: Batching + Learner (Parallel)

Day 1-2: Phase 6.25

# 1. Implement batch function
cd /home/tomoaki/git/hakmem
vim hakmem_pool.c  # Add alloc_tls_page_batch() after line 486

# 2. Integrate into alloc path
vim hakmem_pool.c  # Modify line 931 (refill call site)

# 3. Add env var
vim hakmem_pool.c  # Add global + parse in hak_pool_init()

# 4. Test
make clean && make
HAKMEM_POOL_REFILL_BATCH=2 ./test_pool_basic
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

Day 2-3: Phase 6.27

# 1. Add ACE waste tracking
vim hakmem_ace.c  # Add hak_ace_get_total_waste()

# 2. Update learner score
vim hakmem_learner.c  # Line 414, add frag penalty

# 3. Test
HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.70 RUNTIME=60 THREADS=1,4 \
  ./scripts/run_bench_suite.sh

Week 2: Lock-Free

Day 1-3: Phase 6.26

# 1. Replace data structures
vim hakmem_pool.c  # Line 276-280, atomics

# 2. Implement lock-free ops
vim hakmem_pool.c  # Add 3 new functions

# 3. Integrate
vim hakmem_pool.c  # Replace lock/unlock with CAS

# 4. Test (CRITICAL: TSan)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# 5. Benchmark
RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

📊 Expected Results

Phase Mid 1T Mid 4T vs mimalloc (1T) vs mimalloc (4T)
Baseline (6.21) 4.0 M/s 13.8 M/s 28% 47%
+ 6.25 (Batch) 4.5 M/s 14.5 M/s 31% 49%
+ 6.26 (Lock-Free) 4.6 M/s 17.0 M/s 32% 58%
+ 6.27 (Learner) 5.0 M/s 18.5 M/s 34% 63%
Target (60-75%) 8.8-11.0 M/s 17.7-22.1 M/s 60-75% 60-75%

4T target achieved! (61-68% range) 1T still short (need Phase 6.28: header elimination)


🧪 Testing Commands

Correctness Tests

# Unit test (per phase)
./test_pool_refill_batch  # Phase 6.25
./test_pool_lockfree      # Phase 6.26
./test_pool_learner       # Phase 6.27

# Memory safety
valgrind --leak-check=full ./test_pool_refill_batch
make clean && make CFLAGS="-fsanitize=address"
./test_pool_lockfree

# Thread safety (Phase 6.26 CRITICAL)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

Performance Tests

# Quick test (3 sec)
RUNTIME=3 THREADS=1,4 ./scripts/run_bench_suite.sh

# Full test (10 sec, production)
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh

# Stress test (60 sec, stability)
RUNTIME=60 THREADS=1,4,8 ./scripts/run_bench_suite.sh

# Head-to-head comparison
./scripts/head_to_head_large.sh  # vs mimalloc

A/B Testing

# Baseline (batch=1, no learner)
HAKMEM_POOL_REFILL_BATCH=1 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > baseline.txt

# Phase 6.25 (batch=2)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_25.txt

# Phase 6.27 (learner)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.65 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_27.txt

# Compare
grep "Throughput" baseline.txt phase_6_25.txt phase_6_27.txt

🔧 Troubleshooting

Phase 6.25 Issues

Symptom: No performance improvement

  • Check: g_pool_refill_batch_size value (should be 2-4)
  • Check: Pages allocated counter (should increase in batches)
  • Check: Ring buffer filling (should hit RING_CAP more often)

Symptom: Memory bloat

  • Reduce: HAKMEM_POOL_REFILL_BATCH=2 (from 4)
  • Check: Respect CAP limits in batch allocator
  • Check: No memory leaks (valgrind)

Phase 6.26 Issues

Symptom: Crash/hang

  • Run: ThreadSanitizer (TSan) to find races
  • Check: CAS loop doesn't infinite loop (add retry limit)
  • Check: Memory ordering (acquire/release)

Symptom: Slower than mutex version

  • Check: CAS retry rate (should be <5%)
  • Check: Single-thread overhead (should be minimal)
  • Add: Exponential backoff after N retries

Symptom: Lost blocks (counter mismatch)

  • Check: Batch push count matches list length
  • Check: No concurrent modification during CAS
  • Add: Invariant checks (debug build)

Phase 6.27 Issues

Symptom: CAP oscillation

  • Increase: HAKMEM_CAP_DWELL_SEC_MID=5 (from 3)
  • Increase: HAKMEM_LEARN_MIN_SAMPLES=512 (from 256)
  • Narrow: Target band (0.65 ± 0.03 → 0.65 ± 0.05)

Symptom: No CAP changes

  • Check: Hit rate out of target band (needs >±3% delta)
  • Check: Sufficient samples (≥256 per window)
  • Lower: HAKMEM_TARGET_HIT_MID=0.60 (from 0.65)

Symptom: W_MAX instability

  • Enable: HAKMEM_WMAX_CANARY=1 (safe exploration)
  • Increase: HAKMEM_WMAX_TRIAL_SEC=10 (from 5)
  • Narrow: Candidate range (1.4-1.7 → 1.5-1.6)

📝 Environment Variables Reference

Phase 6.25: Batching

Variable Default Range Description
HAKMEM_POOL_REFILL_BATCH 2 1-4 Pages per refill (1=baseline)

Phase 6.26: Lock-Free

(No new env vars, pure implementation change)

Phase 6.27: Learner

Variable Default Range Description
HAKMEM_LEARN 0 0-1 Enable learner (0=off, 1=on)
HAKMEM_TARGET_HIT_MID 0.65 0.5-0.9 Target hit rate for Mid Pool
HAKMEM_CAP_STEP_MID 4 1-16 CAP increment/decrement size
HAKMEM_CAP_MIN_MID 8 4-64 Minimum CAP per class
HAKMEM_CAP_MAX_MID 2048 128-4096 Maximum CAP per class
HAKMEM_CAP_DWELL_SEC_MID 3 1-10 Stability period (sec)
HAKMEM_LEARN_WINDOW_MS 1000 500-5000 Sampling interval (ms)
HAKMEM_LEARN_MIN_SAMPLES 256 64-1024 Min samples to trigger update

W_MAX Learning (Optional):

Variable Default Range Description
HAKMEM_WMAX_LEARN 0 0-1 Enable W_MAX exploration
HAKMEM_WMAX_CANDIDATES_MID 1.4,1.6,... CSV list W_MAX values to try
HAKMEM_WMAX_CANARY 1 0-1 Safe exploration (1=on)
HAKMEM_WMAX_TRIAL_SEC 5 3-15 Canary trial duration
HAKMEM_WMAX_ADOPT_PCT 0.01 0.005-0.05 Adoption threshold (1%)

🎯 Success Criteria

Must-Have (Release Blockers)

  • No crashes in 60-sec stress test (16T)
  • No memory leaks (valgrind clean)
  • No data races (TSan clean)
  • Mid 4T: ≥17.0 M/s (≥58% of mimalloc)

Should-Have (Quality Bar)

  • Mid 1T: ≥4.5 M/s (≥31% of mimalloc)
  • Memory footprint: ≤30 MB baseline
  • No regression on Tiny/Large (<5%)

Nice-to-Have (Stretch Goals)

  • Mid 4T: ≥18.5 M/s (≥63% of mimalloc) ← TARGET
  • Learner converges in <60 sec
  • W_MAX learning finds better value

  • Full Plan: PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md (this directory)
  • Previous Results: PHASE_6.21_RESULTS_2025_10_24.md
  • Env Vars: ../specs/ENV_VARS.md
  • Benchmarks: ../benchmarks/README.md

Last Updated: 2025-10-24 Status: Ready for Implementation