Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.3 KiB
6.3 KiB
Phase 6.15: Quick Reference Card
Full Details: See PHASE_6.15_PLAN.md (1008 lines)
📊 The Problem
Current State: hakmem is THREAD-UNSAFE
1-thread: 15.1M ops/sec ✅ Excellent
4-thread: 3.3M ops/sec ❌ -78% collapse!
Root Cause: grep pthread_mutex *.c → 0 results
🎯 The Solution (3 Steps)
| Step | What | Time | Expected Result |
|---|---|---|---|
| 1 | Fix docs | 1h | Clarity on 67.9M issue |
| 2 | P0 Safety Lock | 2-3h | 4T = 13-15M (safe, no scaling) |
| 3 | TLS Performance | 8-10h | 4T = 15-20M (+381% proven) |
📋 Step-by-Step Execution
Day 1 Morning: Step 1 (1 hour)
cd apps/experiments/hakmem-poc
# 1. Edit PHASE_6.14_COMPLETION_REPORT.md
# Add section explaining 67.9M measurement issue
# Add thread safety warning
# 2. Edit CURRENT_TASK.md
# Move Phase 6.14 to completed
# Add Phase 6.15 as current focus
# 3. Verify
grep "67.9M\|Thread Safety" PHASE_6.14_COMPLETION_REPORT.md
grep "Phase 6.15" CURRENT_TASK.md
Day 1 Afternoon: Step 2 - P0 Safety Lock (2-3 hours)
Implementation (30 min)
File: hakmem.c
// After line 22: Add pthread.h
#include <pthread.h>
// After line 58: Add global lock
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)
// Wrap hak_alloc_at (find ~line 300-400)
void* hak_alloc_at(size_t size, uintptr_t site_id) {
HAKMEM_LOCK();
void* ptr = hak_alloc_at_internal(size, site_id); // Rename old function
HAKMEM_UNLOCK();
return ptr;
}
// Wrap hak_free_at
void hak_free_at(void* ptr, uintptr_t site_id) {
if (!ptr) return;
HAKMEM_LOCK();
hak_free_at_internal(ptr, site_id); // Rename old function
HAKMEM_UNLOCK();
}
Testing (1.5 hours)
# Build
make clean && make shared
# Test 1: larson 1T/4T (30 min)
cd /tmp/mimalloc-bench/bench/larson
# 1-thread
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1
# Expected: 13-15M ops/sec
# 4-thread
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4
# Expected: 13-15M ops/sec (same as 1T, no crashes!)
# Test 2: Helgrind (20 min)
valgrind --tool=helgrind \
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 1000 1 12345 4
# Expected: ERROR SUMMARY: 0 errors
# Test 3: Stability (10 min)
for i in {1..10}; do
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4 || exit 1
done
# Expected: 10/10 runs succeed
Documentation (15 min)
Create PHASE_6.15_P0_RESULTS.md with benchmark results.
Day 2: Step 3 - P1 Tiny Pool TLS (2 hours)
File: hakmem_tiny.c
Pattern (copy from hakmem_l25_pool.c:26):
// Add TLS cache
static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
// TLS fast path in hak_tiny_alloc()
TinySlab* slab = tls_tiny_cache[class_idx];
if (slab && slab->free_count > 0) {
// Fast path: no lock needed
return alloc_from_slab(slab, class_idx);
}
// TLS miss: refill from global (locked)
HAKMEM_LOCK();
// ... refill logic ...
HAKMEM_UNLOCK();
Test: larson 4T → expect 12-15M ops/sec
Day 3-4: P2 L2 Pool TLS (3 hours)
File: hakmem_pool.c
Same pattern as Tiny Pool (above)
Test: larson 4T → expect 15-18M ops/sec
Day 5: P3 L2.5 Pool TLS (3 hours)
File: hakmem_l25_pool.c
Existing: Line 26 already has __thread L25Block* tls_l25_cache[5];
Add: Refill/eviction logic in alloc/free functions
Test: larson 4T → expect 18-22M ops/sec
📊 Performance Roadmap
Before P0: 1T = 15.1M 4T = 3.3M (-78%) ← UNSAFE
After P0: 1T = 13-15M 4T = 13-15M (+294-355%) ← SAFE, no scaling
After P1: 1T = 13-15M 4T = 12-15M (+264-355%) ← 95% TLS hit
After P2: 1T = 13-15M 4T = 15-18M (+355-445%) ← 90% TLS hit
After P3: 1T = 13-15M 4T = 18-22M (+445-567%) ← Full TLS
Phase 6.13 Validation:
1T = 17.8M 4T = 15.9M (+381%) ✅ PROVEN
✅ Success Criteria
P0 (Minimum):
- ✅ 4T ≥ 13M ops/sec
- ✅ Helgrind: 0 data races
- ✅ 10/10 stability runs
P0+P1+P2 (Target):
- ✅ 4T ≥ 15M ops/sec
- ✅ TLS hit rate ≥ 90%
- ✅ No 1T regression (≤15%)
All Phases (Stretch):
- ✅ 4T ≥ 18M ops/sec
- ✅ 16T ≥ 11.6M ops/sec
🚨 Critical Findings
-
67.9M ops/sec = Measurement Error
- Actual: 15.1M (1T), 3.3M (4T)
- Fix: Update Phase 6.14 report
-
4-thread collapse = Thread-unsafe
- NOT a feature, NOT expected
- Zero
pthread_mutexin codebase - Fix: P0 global lock (30 min)
-
TLS is validated (+381%)
- Phase 6.13 proved 4T = 15.9M ops/sec
- NOT the cause of Phase 6.11.5 regression
- Real culprit: Slab Registry (Phase 6.12.1)
📁 Document Map
PHASE_6.15_PLAN.md (this) - Full implementation guide (1008 lines)
PHASE_6.15_SUMMARY.md - Executive summary (152 lines)
PHASE_6.15_QUICK_REF.md - Quick reference card (YOU ARE HERE)
THREAD_SAFETY_SOLUTION.md - Complete analysis (Option A/B/C)
PHASE_6.13_INITIAL_RESULTS.md - TLS validation proof
PHASE_6.14_COMPLETION_REPORT.md - Thread issue discovery
🔧 Common Commands
# Build hakmem
cd apps/experiments/hakmem-poc
make clean && make shared
# larson benchmark (4-thread)
cd /tmp/mimalloc-bench/bench/larson
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4
# Helgrind race detection
valgrind --tool=helgrind \
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 1000 1 12345 4
# Check pthread usage
grep -n "pthread" apps/experiments/hakmem-poc/*.c
📞 Need Help?
- Detailed steps: See PHASE_6.15_PLAN.md
- Technical analysis: See THREAD_SAFETY_SOLUTION.md
- Validation proof: See PHASE_6.13_INITIAL_RESULTS.md
Status: ✅ Ready to execute Total Time: 12-13 hours (6 days) Expected ROI: 6-15x improvement (3.3M → 20-50M ops/sec)