# Phase 6.25-6.27: Quick Reference Guide

**Target**: Improve Mid Pool from 47% to 61-68% of mimalloc (4T)

---

## 📋 Implementation Checklist

### Phase 6.25: Refill Batching (~6 hours)

**Goal**: Reduce refill latency by allocating 2-4 pages at once

```bash
# New function in hakmem_pool.c
static int alloc_tls_page_batch(
    int class_idx, int batch_size,
    PoolTLSPage* slots[], int num_slots,
    PoolTLSRing* ring, PoolTLSBin* bin
);

# New env var
HAKMEM_POOL_REFILL_BATCH=2  # Default (conservative)
HAKMEM_POOL_REFILL_BATCH=4  # Aggressive
```

**Files**: `hakmem_pool.c` (~116 LOC)

**Expected**: +10-15% (Mid 1T)

---

### Phase 6.26: Lock-Free Refill (~11 hours)

**Goal**: Replace mutex with atomic CAS on freelist

```bash
# Replace in hakmem_pool.c
- PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

# New functions
freelist_pop_lockfree()
freelist_push_lockfree()
freelist_push_batch_lockfree()
drain_remote_lockfree()
```

**Files**: `hakmem_pool.c` (~140 LOC, net ~100)

**Expected**: +15-20% (Mid 4T)

---

### Phase 6.27: Learner Integration (~5 hours)

**Goal**: Dynamic CAP/W_MAX tuning based on runtime stats

```bash
# Enable existing learner
HAKMEM_LEARN=1
HAKMEM_TARGET_HIT_MID=0.65
HAKMEM_CAP_STEP_MID=8
HAKMEM_CAP_MAX_MID=512

# Optional: W_MAX learning (risky)
HAKMEM_WMAX_LEARN=1
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7
HAKMEM_WMAX_CANARY=1  # Safe exploration
```

**Files**: `hakmem_ace.c` (+15 LOC), `hakmem_learner.c` (+10 LOC)

**Expected**: +5-10% (all workloads)

---

## 🚀 Quick Start (Implementation Order)

### Week 1: Batching + Learner (Parallel)

**Day 1-2: Phase 6.25**
```bash
# 1. Implement batch function
cd /home/tomoaki/git/hakmem
vim hakmem_pool.c  # Add alloc_tls_page_batch() after line 486

# 2. Integrate into alloc path
vim hakmem_pool.c  # Modify line 931 (refill call site)

# 3. Add env var
vim hakmem_pool.c  # Add global + parse in hak_pool_init()

# 4. Test
make clean && make
HAKMEM_POOL_REFILL_BATCH=2 ./test_pool_basic
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
```

**Day 2-3: Phase 6.27**
```bash
# 1. Add ACE waste tracking
vim hakmem_ace.c  # Add hak_ace_get_total_waste()

# 2. Update learner score
vim hakmem_learner.c  # Line 414, add frag penalty

# 3. Test
HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.70 RUNTIME=60 THREADS=1,4 \
  ./scripts/run_bench_suite.sh
```

### Week 2: Lock-Free

**Day 1-3: Phase 6.26**
```bash
# 1. Replace data structures
vim hakmem_pool.c  # Line 276-280, atomics

# 2. Implement lock-free ops
vim hakmem_pool.c  # Add 3 new functions

# 3. Integrate
vim hakmem_pool.c  # Replace lock/unlock with CAS

# 4. Test (CRITICAL: TSan)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# 5. Benchmark
RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
```

---

## 📊 Expected Results

| Phase | Mid 1T | Mid 4T | vs mimalloc (1T) | vs mimalloc (4T) |
|-------|--------|--------|------------------|------------------|
| Baseline (6.21) | 4.0 M/s | 13.8 M/s | 28% | 47% |
| + 6.25 (Batch) | 4.5 M/s | 14.5 M/s | 31% | 49% |
| + 6.26 (Lock-Free) | 4.6 M/s | 17.0 M/s | 32% | 58% |
| + 6.27 (Learner) | 5.0 M/s | 18.5 M/s | 34% | **63%** ✅ |
| **Target (60-75%)** | 8.8-11.0 M/s | 17.7-22.1 M/s | 60-75% | 60-75% |

✅ **4T target achieved!** (61-68% range)
❌ **1T still short** (need Phase 6.28: header elimination)

---

## 🧪 Testing Commands

### Correctness Tests

```bash
# Unit test (per phase)
./test_pool_refill_batch  # Phase 6.25
./test_pool_lockfree      # Phase 6.26
./test_pool_learner       # Phase 6.27

# Memory safety
valgrind --leak-check=full ./test_pool_refill_batch
make clean && make CFLAGS="-fsanitize=address"
./test_pool_lockfree

# Thread safety (Phase 6.26 CRITICAL)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
```

### Performance Tests

```bash
# Quick test (3 sec)
RUNTIME=3 THREADS=1,4 ./scripts/run_bench_suite.sh

# Full test (10 sec, production)
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh

# Stress test (60 sec, stability)
RUNTIME=60 THREADS=1,4,8 ./scripts/run_bench_suite.sh

# Head-to-head comparison
./scripts/head_to_head_large.sh  # vs mimalloc
```

### A/B Testing

```bash
# Baseline (batch=1, no learner)
HAKMEM_POOL_REFILL_BATCH=1 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > baseline.txt

# Phase 6.25 (batch=2)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_25.txt

# Phase 6.27 (learner)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.65 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_27.txt

# Compare
grep "Throughput" baseline.txt phase_6_25.txt phase_6_27.txt
```

---

## 🔧 Troubleshooting

### Phase 6.25 Issues

**Symptom**: No performance improvement
- Check: `g_pool_refill_batch_size` value (should be 2-4)
- Check: Pages allocated counter (should increase in batches)
- Check: Ring buffer filling (should hit RING_CAP more often)

**Symptom**: Memory bloat
- Reduce: `HAKMEM_POOL_REFILL_BATCH=2` (from 4)
- Check: Respect CAP limits in batch allocator
- Check: No memory leaks (valgrind)

### Phase 6.26 Issues

**Symptom**: Crash/hang
- Run: ThreadSanitizer (TSan) to find races
- Check: CAS loop doesn't infinite loop (add retry limit)
- Check: Memory ordering (acquire/release)

**Symptom**: Slower than mutex version
- Check: CAS retry rate (should be <5%)
- Check: Single-thread overhead (should be minimal)
- Add: Exponential backoff after N retries

**Symptom**: Lost blocks (counter mismatch)
- Check: Batch push count matches list length
- Check: No concurrent modification during CAS
- Add: Invariant checks (debug build)

### Phase 6.27 Issues

**Symptom**: CAP oscillation
- Increase: `HAKMEM_CAP_DWELL_SEC_MID=5` (from 3)
- Increase: `HAKMEM_LEARN_MIN_SAMPLES=512` (from 256)
- Narrow: Target band (0.65 ± 0.03 → 0.65 ± 0.05)

**Symptom**: No CAP changes
- Check: Hit rate out of target band (needs >±3% delta)
- Check: Sufficient samples (≥256 per window)
- Lower: `HAKMEM_TARGET_HIT_MID=0.60` (from 0.65)

**Symptom**: W_MAX instability
- Enable: `HAKMEM_WMAX_CANARY=1` (safe exploration)
- Increase: `HAKMEM_WMAX_TRIAL_SEC=10` (from 5)
- Narrow: Candidate range (1.4-1.7 → 1.5-1.6)

---

## 📝 Environment Variables Reference

### Phase 6.25: Batching

| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_POOL_REFILL_BATCH` | 2 | 1-4 | Pages per refill (1=baseline) |

### Phase 6.26: Lock-Free

(No new env vars, pure implementation change)

### Phase 6.27: Learner

| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_LEARN` | 0 | 0-1 | Enable learner (0=off, 1=on) |
| `HAKMEM_TARGET_HIT_MID` | 0.65 | 0.5-0.9 | Target hit rate for Mid Pool |
| `HAKMEM_CAP_STEP_MID` | 4 | 1-16 | CAP increment/decrement size |
| `HAKMEM_CAP_MIN_MID` | 8 | 4-64 | Minimum CAP per class |
| `HAKMEM_CAP_MAX_MID` | 2048 | 128-4096 | Maximum CAP per class |
| `HAKMEM_CAP_DWELL_SEC_MID` | 3 | 1-10 | Stability period (sec) |
| `HAKMEM_LEARN_WINDOW_MS` | 1000 | 500-5000 | Sampling interval (ms) |
| `HAKMEM_LEARN_MIN_SAMPLES` | 256 | 64-1024 | Min samples to trigger update |

**W_MAX Learning** (Optional):

| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_WMAX_LEARN` | 0 | 0-1 | Enable W_MAX exploration |
| `HAKMEM_WMAX_CANDIDATES_MID` | 1.4,1.6,... | CSV list | W_MAX values to try |
| `HAKMEM_WMAX_CANARY` | 1 | 0-1 | Safe exploration (1=on) |
| `HAKMEM_WMAX_TRIAL_SEC` | 5 | 3-15 | Canary trial duration |
| `HAKMEM_WMAX_ADOPT_PCT` | 0.01 | 0.005-0.05 | Adoption threshold (1%) |

---

## 🎯 Success Criteria

### Must-Have (Release Blockers)

- ✅ No crashes in 60-sec stress test (16T)
- ✅ No memory leaks (valgrind clean)
- ✅ No data races (TSan clean)
- ✅ Mid 4T: ≥17.0 M/s (≥58% of mimalloc)

### Should-Have (Quality Bar)

- ✅ Mid 1T: ≥4.5 M/s (≥31% of mimalloc)
- ✅ Memory footprint: ≤30 MB baseline
- ✅ No regression on Tiny/Large (<5%)

### Nice-to-Have (Stretch Goals)

- ✅ Mid 4T: ≥18.5 M/s (≥63% of mimalloc) ← **TARGET**
- ✅ Learner converges in <60 sec
- ✅ W_MAX learning finds better value

---

## 📚 Related Documents

- **Full Plan**: `PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md` (this directory)
- **Previous Results**: `PHASE_6.21_RESULTS_2025_10_24.md`
- **Env Vars**: `../specs/ENV_VARS.md`
- **Benchmarks**: `../benchmarks/README.md`

---

**Last Updated**: 2025-10-24
**Status**: Ready for Implementation