Files
hakmem/docs/status/PHASE_6.25_6.27_QUICK_REFERENCE.md

332 lines
8.6 KiB
Markdown
Raw Normal View History

# Phase 6.25-6.27: Quick Reference Guide
**Target**: Improve Mid Pool from 47% to 61-68% of mimalloc (4T)
---
## 📋 Implementation Checklist
### Phase 6.25: Refill Batching (~6 hours)
**Goal**: Reduce refill latency by allocating 2-4 pages at once
```bash
# New function in hakmem_pool.c
static int alloc_tls_page_batch(
int class_idx, int batch_size,
PoolTLSPage* slots[], int num_slots,
PoolTLSRing* ring, PoolTLSBin* bin
);
# New env var
HAKMEM_POOL_REFILL_BATCH=2 # Default (conservative)
HAKMEM_POOL_REFILL_BATCH=4 # Aggressive
```
**Files**: `hakmem_pool.c` (~116 LOC)
**Expected**: +10-15% (Mid 1T)
---
### Phase 6.26: Lock-Free Refill (~11 hours)
**Goal**: Replace mutex with atomic CAS on freelist
```bash
# Replace in hakmem_pool.c
- PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
# New functions
freelist_pop_lockfree()
freelist_push_lockfree()
freelist_push_batch_lockfree()
drain_remote_lockfree()
```
**Files**: `hakmem_pool.c` (~140 LOC, net ~100)
**Expected**: +15-20% (Mid 4T)
---
### Phase 6.27: Learner Integration (~5 hours)
**Goal**: Dynamic CAP/W_MAX tuning based on runtime stats
```bash
# Enable existing learner
HAKMEM_LEARN=1
HAKMEM_TARGET_HIT_MID=0.65
HAKMEM_CAP_STEP_MID=8
HAKMEM_CAP_MAX_MID=512
# Optional: W_MAX learning (risky)
HAKMEM_WMAX_LEARN=1
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7
HAKMEM_WMAX_CANARY=1 # Safe exploration
```
**Files**: `hakmem_ace.c` (+15 LOC), `hakmem_learner.c` (+10 LOC)
**Expected**: +5-10% (all workloads)
---
## 🚀 Quick Start (Implementation Order)
### Week 1: Batching + Learner (Parallel)
**Day 1-2: Phase 6.25**
```bash
# 1. Implement batch function
cd /home/tomoaki/git/hakmem
vim hakmem_pool.c # Add alloc_tls_page_batch() after line 486
# 2. Integrate into alloc path
vim hakmem_pool.c # Modify line 931 (refill call site)
# 3. Add env var
vim hakmem_pool.c # Add global + parse in hak_pool_init()
# 4. Test
make clean && make
HAKMEM_POOL_REFILL_BATCH=2 ./test_pool_basic
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh
```
**Day 2-3: Phase 6.27**
```bash
# 1. Add ACE waste tracking
vim hakmem_ace.c # Add hak_ace_get_total_waste()
# 2. Update learner score
vim hakmem_learner.c # Line 414, add frag penalty
# 3. Test
HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.70 RUNTIME=60 THREADS=1,4 \
./scripts/run_bench_suite.sh
```
### Week 2: Lock-Free
**Day 1-3: Phase 6.26**
```bash
# 1. Replace data structures
vim hakmem_pool.c # Line 276-280, atomics
# 2. Implement lock-free ops
vim hakmem_pool.c # Add 3 new functions
# 3. Integrate
vim hakmem_pool.c # Replace lock/unlock with CAS
# 4. Test (CRITICAL: TSan)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
# 5. Benchmark
RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh
```
---
## 📊 Expected Results
| Phase | Mid 1T | Mid 4T | vs mimalloc (1T) | vs mimalloc (4T) |
|-------|--------|--------|------------------|------------------|
| Baseline (6.21) | 4.0 M/s | 13.8 M/s | 28% | 47% |
| + 6.25 (Batch) | 4.5 M/s | 14.5 M/s | 31% | 49% |
| + 6.26 (Lock-Free) | 4.6 M/s | 17.0 M/s | 32% | 58% |
| + 6.27 (Learner) | 5.0 M/s | 18.5 M/s | 34% | **63%** ✅ |
| **Target (60-75%)** | 8.8-11.0 M/s | 17.7-22.1 M/s | 60-75% | 60-75% |
**4T target achieved!** (61-68% range)
**1T still short** (need Phase 6.28: header elimination)
---
## 🧪 Testing Commands
### Correctness Tests
```bash
# Unit test (per phase)
./test_pool_refill_batch # Phase 6.25
./test_pool_lockfree # Phase 6.26
./test_pool_learner # Phase 6.27
# Memory safety
valgrind --leak-check=full ./test_pool_refill_batch
make clean && make CFLAGS="-fsanitize=address"
./test_pool_lockfree
# Thread safety (Phase 6.26 CRITICAL)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress
```
### Performance Tests
```bash
# Quick test (3 sec)
RUNTIME=3 THREADS=1,4 ./scripts/run_bench_suite.sh
# Full test (10 sec, production)
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh
# Stress test (60 sec, stability)
RUNTIME=60 THREADS=1,4,8 ./scripts/run_bench_suite.sh
# Head-to-head comparison
./scripts/head_to_head_large.sh # vs mimalloc
```
### A/B Testing
```bash
# Baseline (batch=1, no learner)
HAKMEM_POOL_REFILL_BATCH=1 HAKMEM_LEARN=0 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> baseline.txt
# Phase 6.25 (batch=2)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=0 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> phase_6_25.txt
# Phase 6.27 (learner)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.65 \
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
> phase_6_27.txt
# Compare
grep "Throughput" baseline.txt phase_6_25.txt phase_6_27.txt
```
---
## 🔧 Troubleshooting
### Phase 6.25 Issues
**Symptom**: No performance improvement
- Check: `g_pool_refill_batch_size` value (should be 2-4)
- Check: Pages allocated counter (should increase in batches)
- Check: Ring buffer filling (should hit RING_CAP more often)
**Symptom**: Memory bloat
- Reduce: `HAKMEM_POOL_REFILL_BATCH=2` (from 4)
- Check: Respect CAP limits in batch allocator
- Check: No memory leaks (valgrind)
### Phase 6.26 Issues
**Symptom**: Crash/hang
- Run: ThreadSanitizer (TSan) to find races
- Check: CAS loop doesn't infinite loop (add retry limit)
- Check: Memory ordering (acquire/release)
**Symptom**: Slower than mutex version
- Check: CAS retry rate (should be <5%)
- Check: Single-thread overhead (should be minimal)
- Add: Exponential backoff after N retries
**Symptom**: Lost blocks (counter mismatch)
- Check: Batch push count matches list length
- Check: No concurrent modification during CAS
- Add: Invariant checks (debug build)
### Phase 6.27 Issues
**Symptom**: CAP oscillation
- Increase: `HAKMEM_CAP_DWELL_SEC_MID=5` (from 3)
- Increase: `HAKMEM_LEARN_MIN_SAMPLES=512` (from 256)
- Narrow: Target band (0.65 ± 0.03 → 0.65 ± 0.05)
**Symptom**: No CAP changes
- Check: Hit rate out of target band (needs >±3% delta)
- Check: Sufficient samples (≥256 per window)
- Lower: `HAKMEM_TARGET_HIT_MID=0.60` (from 0.65)
**Symptom**: W_MAX instability
- Enable: `HAKMEM_WMAX_CANARY=1` (safe exploration)
- Increase: `HAKMEM_WMAX_TRIAL_SEC=10` (from 5)
- Narrow: Candidate range (1.4-1.7 → 1.5-1.6)
---
## 📝 Environment Variables Reference
### Phase 6.25: Batching
| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_POOL_REFILL_BATCH` | 2 | 1-4 | Pages per refill (1=baseline) |
### Phase 6.26: Lock-Free
(No new env vars, pure implementation change)
### Phase 6.27: Learner
| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_LEARN` | 0 | 0-1 | Enable learner (0=off, 1=on) |
| `HAKMEM_TARGET_HIT_MID` | 0.65 | 0.5-0.9 | Target hit rate for Mid Pool |
| `HAKMEM_CAP_STEP_MID` | 4 | 1-16 | CAP increment/decrement size |
| `HAKMEM_CAP_MIN_MID` | 8 | 4-64 | Minimum CAP per class |
| `HAKMEM_CAP_MAX_MID` | 2048 | 128-4096 | Maximum CAP per class |
| `HAKMEM_CAP_DWELL_SEC_MID` | 3 | 1-10 | Stability period (sec) |
| `HAKMEM_LEARN_WINDOW_MS` | 1000 | 500-5000 | Sampling interval (ms) |
| `HAKMEM_LEARN_MIN_SAMPLES` | 256 | 64-1024 | Min samples to trigger update |
**W_MAX Learning** (Optional):
| Variable | Default | Range | Description |
|----------|---------|-------|-------------|
| `HAKMEM_WMAX_LEARN` | 0 | 0-1 | Enable W_MAX exploration |
| `HAKMEM_WMAX_CANDIDATES_MID` | 1.4,1.6,... | CSV list | W_MAX values to try |
| `HAKMEM_WMAX_CANARY` | 1 | 0-1 | Safe exploration (1=on) |
| `HAKMEM_WMAX_TRIAL_SEC` | 5 | 3-15 | Canary trial duration |
| `HAKMEM_WMAX_ADOPT_PCT` | 0.01 | 0.005-0.05 | Adoption threshold (1%) |
---
## 🎯 Success Criteria
### Must-Have (Release Blockers)
- ✅ No crashes in 60-sec stress test (16T)
- ✅ No memory leaks (valgrind clean)
- ✅ No data races (TSan clean)
- ✅ Mid 4T: ≥17.0 M/s (≥58% of mimalloc)
### Should-Have (Quality Bar)
- ✅ Mid 1T: ≥4.5 M/s (≥31% of mimalloc)
- ✅ Memory footprint: ≤30 MB baseline
- ✅ No regression on Tiny/Large (<5%)
### Nice-to-Have (Stretch Goals)
- ✅ Mid 4T: ≥18.5 M/s (≥63% of mimalloc) ← **TARGET**
- ✅ Learner converges in <60 sec
- ✅ W_MAX learning finds better value
---
## 📚 Related Documents
- **Full Plan**: `PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md` (this directory)
- **Previous Results**: `PHASE_6.21_RESULTS_2025_10_24.md`
- **Env Vars**: `../specs/ENV_VARS.md`
- **Benchmarks**: `../benchmarks/README.md`
---
**Last Updated**: 2025-10-24
**Status**: Ready for Implementation