Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

8.6 KiB

Raw Blame History

Phase 6.25-6.27: Quick Reference Guide

Target: Improve Mid Pool from 47% to 61-68% of mimalloc (4T)

📋 Implementation Checklist

Phase 6.25: Refill Batching (~6 hours)

Goal: Reduce refill latency by allocating 2-4 pages at once

# New function in hakmem_pool.c
static int alloc_tls_page_batch(
    int class_idx, int batch_size,
    PoolTLSPage* slots[], int num_slots,
    PoolTLSRing* ring, PoolTLSBin* bin
);

# New env var
HAKMEM_POOL_REFILL_BATCH=2  # Default (conservative)
HAKMEM_POOL_REFILL_BATCH=4  # Aggressive

Files: hakmem_pool.c (~116 LOC)

Expected: +10-15% (Mid 1T)

Phase 6.26: Lock-Free Refill (~11 hours)

Goal: Replace mutex with atomic CAS on freelist

# Replace in hakmem_pool.c
- PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uintptr_t freelist_head[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
+ atomic_uint freelist_count[POOL_NUM_CLASSES][POOL_NUM_SHARDS];

# New functions
freelist_pop_lockfree()
freelist_push_lockfree()
freelist_push_batch_lockfree()
drain_remote_lockfree()

Files: hakmem_pool.c (~140 LOC, net ~100)

Expected: +15-20% (Mid 4T)

Phase 6.27: Learner Integration (~5 hours)

Goal: Dynamic CAP/W_MAX tuning based on runtime stats

# Enable existing learner
HAKMEM_LEARN=1
HAKMEM_TARGET_HIT_MID=0.65
HAKMEM_CAP_STEP_MID=8
HAKMEM_CAP_MAX_MID=512

# Optional: W_MAX learning (risky)
HAKMEM_WMAX_LEARN=1
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.5,1.6,1.7
HAKMEM_WMAX_CANARY=1  # Safe exploration

Files: hakmem_ace.c (+15 LOC), hakmem_learner.c (+10 LOC)

Expected: +5-10% (all workloads)

🚀 Quick Start (Implementation Order)

Week 1: Batching + Learner (Parallel)

Day 1-2: Phase 6.25

# 1. Implement batch function
cd /home/tomoaki/git/hakmem
vim hakmem_pool.c  # Add alloc_tls_page_batch() after line 486

# 2. Integrate into alloc path
vim hakmem_pool.c  # Modify line 931 (refill call site)

# 3. Add env var
vim hakmem_pool.c  # Add global + parse in hak_pool_init()

# 4. Test
make clean && make
HAKMEM_POOL_REFILL_BATCH=2 ./test_pool_basic
HAKMEM_POOL_REFILL_BATCH=2 RUNTIME=10 THREADS=1 ./scripts/run_bench_suite.sh

Day 2-3: Phase 6.27

# 1. Add ACE waste tracking
vim hakmem_ace.c  # Add hak_ace_get_total_waste()

# 2. Update learner score
vim hakmem_learner.c  # Line 414, add frag penalty

# 3. Test
HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.70 RUNTIME=60 THREADS=1,4 \
  ./scripts/run_bench_suite.sh

Week 2: Lock-Free

Day 1-3: Phase 6.26

# 1. Replace data structures
vim hakmem_pool.c  # Line 276-280, atomics

# 2. Implement lock-free ops
vim hakmem_pool.c  # Add 3 new functions

# 3. Integrate
vim hakmem_pool.c  # Replace lock/unlock with CAS

# 4. Test (CRITICAL: TSan)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

# 5. Benchmark
RUNTIME=10 THREADS=4 ./scripts/run_bench_suite.sh

📊 Expected Results

Phase	Mid 1T	Mid 4T	vs mimalloc (1T)	vs mimalloc (4T)
Baseline (6.21)	4.0 M/s	13.8 M/s	28%	47%
+ 6.25 (Batch)	4.5 M/s	14.5 M/s	31%	49%
+ 6.26 (Lock-Free)	4.6 M/s	17.0 M/s	32%	58%
+ 6.27 (Learner)	5.0 M/s	18.5 M/s	34%	63% ✅
Target (60-75%)	8.8-11.0 M/s	17.7-22.1 M/s	60-75%	60-75%

✅ 4T target achieved! (61-68% range) ❌ 1T still short (need Phase 6.28: header elimination)

🧪 Testing Commands

Correctness Tests

# Unit test (per phase)
./test_pool_refill_batch  # Phase 6.25
./test_pool_lockfree      # Phase 6.26
./test_pool_learner       # Phase 6.27

# Memory safety
valgrind --leak-check=full ./test_pool_refill_batch
make clean && make CFLAGS="-fsanitize=address"
./test_pool_lockfree

# Thread safety (Phase 6.26 CRITICAL)
make clean && make CFLAGS="-fsanitize=thread"
THREADS=16 DURATION=60 ./test_pool_lockfree_stress

Performance Tests

# Quick test (3 sec)
RUNTIME=3 THREADS=1,4 ./scripts/run_bench_suite.sh

# Full test (10 sec, production)
RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh

# Stress test (60 sec, stability)
RUNTIME=60 THREADS=1,4,8 ./scripts/run_bench_suite.sh

# Head-to-head comparison
./scripts/head_to_head_large.sh  # vs mimalloc

A/B Testing

# Baseline (batch=1, no learner)
HAKMEM_POOL_REFILL_BATCH=1 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > baseline.txt

# Phase 6.25 (batch=2)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=0 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_25.txt

# Phase 6.27 (learner)
HAKMEM_POOL_REFILL_BATCH=2 HAKMEM_LEARN=1 HAKMEM_TARGET_HIT_MID=0.65 \
  RUNTIME=10 THREADS=1,4 ./scripts/run_bench_suite.sh \
  > phase_6_27.txt

# Compare
grep "Throughput" baseline.txt phase_6_25.txt phase_6_27.txt

🔧 Troubleshooting

Phase 6.25 Issues

Symptom: No performance improvement

Check: g_pool_refill_batch_size value (should be 2-4)
Check: Pages allocated counter (should increase in batches)
Check: Ring buffer filling (should hit RING_CAP more often)

Symptom: Memory bloat

Reduce: HAKMEM_POOL_REFILL_BATCH=2 (from 4)
Check: Respect CAP limits in batch allocator
Check: No memory leaks (valgrind)

Phase 6.26 Issues

Symptom: Crash/hang

Run: ThreadSanitizer (TSan) to find races
Check: CAS loop doesn't infinite loop (add retry limit)
Check: Memory ordering (acquire/release)

Symptom: Slower than mutex version

Check: CAS retry rate (should be <5%)
Check: Single-thread overhead (should be minimal)
Add: Exponential backoff after N retries

Symptom: Lost blocks (counter mismatch)

Check: Batch push count matches list length
Check: No concurrent modification during CAS
Add: Invariant checks (debug build)

Phase 6.27 Issues

Symptom: CAP oscillation

Increase: HAKMEM_CAP_DWELL_SEC_MID=5 (from 3)
Increase: HAKMEM_LEARN_MIN_SAMPLES=512 (from 256)
Narrow: Target band (0.65 ± 0.03 → 0.65 ± 0.05)

Symptom: No CAP changes

Check: Hit rate out of target band (needs >±3% delta)
Check: Sufficient samples (≥256 per window)
Lower: HAKMEM_TARGET_HIT_MID=0.60 (from 0.65)

Symptom: W_MAX instability

Enable: HAKMEM_WMAX_CANARY=1 (safe exploration)
Increase: HAKMEM_WMAX_TRIAL_SEC=10 (from 5)
Narrow: Candidate range (1.4-1.7 → 1.5-1.6)

📝 Environment Variables Reference

Phase 6.25: Batching

Variable	Default	Range	Description
`HAKMEM_POOL_REFILL_BATCH`	2	1-4	Pages per refill (1=baseline)

Phase 6.26: Lock-Free

(No new env vars, pure implementation change)

Phase 6.27: Learner

Variable	Default	Range	Description
`HAKMEM_LEARN`	0	0-1	Enable learner (0=off, 1=on)
`HAKMEM_TARGET_HIT_MID`	0.65	0.5-0.9	Target hit rate for Mid Pool
`HAKMEM_CAP_STEP_MID`	4	1-16	CAP increment/decrement size
`HAKMEM_CAP_MIN_MID`	8	4-64	Minimum CAP per class
`HAKMEM_CAP_MAX_MID`	2048	128-4096	Maximum CAP per class
`HAKMEM_CAP_DWELL_SEC_MID`	3	1-10	Stability period (sec)
`HAKMEM_LEARN_WINDOW_MS`	1000	500-5000	Sampling interval (ms)
`HAKMEM_LEARN_MIN_SAMPLES`	256	64-1024	Min samples to trigger update

W_MAX Learning (Optional):

Variable	Default	Range	Description
`HAKMEM_WMAX_LEARN`	0	0-1	Enable W_MAX exploration
`HAKMEM_WMAX_CANDIDATES_MID`	1.4,1.6,...	CSV list	W_MAX values to try
`HAKMEM_WMAX_CANARY`	1	0-1	Safe exploration (1=on)
`HAKMEM_WMAX_TRIAL_SEC`	5	3-15	Canary trial duration
`HAKMEM_WMAX_ADOPT_PCT`	0.01	0.005-0.05	Adoption threshold (1%)

🎯 Success Criteria

Must-Have (Release Blockers)

✅ No crashes in 60-sec stress test (16T)
✅ No memory leaks (valgrind clean)
✅ No data races (TSan clean)
✅ Mid 4T: ≥17.0 M/s (≥58% of mimalloc)

Should-Have (Quality Bar)

✅ Mid 1T: ≥4.5 M/s (≥31% of mimalloc)
✅ Memory footprint: ≤30 MB baseline
✅ No regression on Tiny/Large (<5%)

Nice-to-Have (Stretch Goals)

✅ Mid 4T: ≥18.5 M/s (≥63% of mimalloc) ← TARGET
✅ Learner converges in <60 sec
✅ W_MAX learning finds better value

Full Plan: PHASE_6.25_6.27_IMPLEMENTATION_PLAN.md (this directory)
Previous Results: PHASE_6.21_RESULTS_2025_10_24.md
Env Vars: ../specs/ENV_VARS.md
Benchmarks: ../benchmarks/README.md

Last Updated: 2025-10-24 Status: Ready for Implementation

8.6 KiB Raw Blame History