# Phase 6.15: Quick Reference Card **Full Details**: See [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) (1008 lines) --- ## 📊 **The Problem** ``` Current State: hakmem is THREAD-UNSAFE 1-thread: 15.1M ops/sec ✅ Excellent 4-thread: 3.3M ops/sec ❌ -78% collapse! Root Cause: grep pthread_mutex *.c → 0 results ``` --- ## 🎯 **The Solution (3 Steps)** | Step | What | Time | Expected Result | |------|------|------|----------------| | **1** | Fix docs | 1h | Clarity on 67.9M issue | | **2** | P0 Safety Lock | 2-3h | 4T = 13-15M (safe, no scaling) | | **3** | TLS Performance | 8-10h | 4T = 15-20M (+381% proven) | --- ## 📋 **Step-by-Step Execution** ### **Day 1 Morning: Step 1 (1 hour)** ```bash cd apps/experiments/hakmem-poc # 1. Edit PHASE_6.14_COMPLETION_REPORT.md # Add section explaining 67.9M measurement issue # Add thread safety warning # 2. Edit CURRENT_TASK.md # Move Phase 6.14 to completed # Add Phase 6.15 as current focus # 3. Verify grep "67.9M\|Thread Safety" PHASE_6.14_COMPLETION_REPORT.md grep "Phase 6.15" CURRENT_TASK.md ``` --- ### **Day 1 Afternoon: Step 2 - P0 Safety Lock (2-3 hours)** #### **Implementation (30 min)** **File**: `hakmem.c` ```c // After line 22: Add pthread.h #include // After line 58: Add global lock static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER; #define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock) #define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock) // Wrap hak_alloc_at (find ~line 300-400) void* hak_alloc_at(size_t size, uintptr_t site_id) { HAKMEM_LOCK(); void* ptr = hak_alloc_at_internal(size, site_id); // Rename old function HAKMEM_UNLOCK(); return ptr; } // Wrap hak_free_at void hak_free_at(void* ptr, uintptr_t site_id) { if (!ptr) return; HAKMEM_LOCK(); hak_free_at_internal(ptr, site_id); // Rename old function HAKMEM_UNLOCK(); } ``` #### **Testing (1.5 hours)** ```bash # Build make clean && make shared # Test 1: larson 1T/4T (30 min) cd /tmp/mimalloc-bench/bench/larson # 1-thread LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 1 # Expected: 13-15M ops/sec # 4-thread LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 4 # Expected: 13-15M ops/sec (same as 1T, no crashes!) # Test 2: Helgrind (20 min) valgrind --tool=helgrind \ LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 1000 1 12345 4 # Expected: ERROR SUMMARY: 0 errors # Test 3: Stability (10 min) for i in {1..10}; do LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 4 || exit 1 done # Expected: 10/10 runs succeed ``` #### **Documentation (15 min)** Create `PHASE_6.15_P0_RESULTS.md` with benchmark results. --- ### **Day 2: Step 3 - P1 Tiny Pool TLS (2 hours)** **File**: `hakmem_tiny.c` **Pattern** (copy from `hakmem_l25_pool.c:26`): ```c // Add TLS cache static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL}; // TLS fast path in hak_tiny_alloc() TinySlab* slab = tls_tiny_cache[class_idx]; if (slab && slab->free_count > 0) { // Fast path: no lock needed return alloc_from_slab(slab, class_idx); } // TLS miss: refill from global (locked) HAKMEM_LOCK(); // ... refill logic ... HAKMEM_UNLOCK(); ``` **Test**: larson 4T → expect 12-15M ops/sec --- ### **Day 3-4: P2 L2 Pool TLS (3 hours)** **File**: `hakmem_pool.c` **Same pattern** as Tiny Pool (above) **Test**: larson 4T → expect 15-18M ops/sec --- ### **Day 5: P3 L2.5 Pool TLS (3 hours)** **File**: `hakmem_l25_pool.c` **Existing**: Line 26 already has `__thread L25Block* tls_l25_cache[5];` **Add**: Refill/eviction logic in alloc/free functions **Test**: larson 4T → expect 18-22M ops/sec --- ## 📊 **Performance Roadmap** ``` Before P0: 1T = 15.1M 4T = 3.3M (-78%) ← UNSAFE After P0: 1T = 13-15M 4T = 13-15M (+294-355%) ← SAFE, no scaling After P1: 1T = 13-15M 4T = 12-15M (+264-355%) ← 95% TLS hit After P2: 1T = 13-15M 4T = 15-18M (+355-445%) ← 90% TLS hit After P3: 1T = 13-15M 4T = 18-22M (+445-567%) ← Full TLS Phase 6.13 Validation: 1T = 17.8M 4T = 15.9M (+381%) ✅ PROVEN ``` --- ## ✅ **Success Criteria** **P0 (Minimum)**: - ✅ 4T ≥ 13M ops/sec - ✅ Helgrind: 0 data races - ✅ 10/10 stability runs **P0+P1+P2 (Target)**: - ✅ 4T ≥ 15M ops/sec - ✅ TLS hit rate ≥ 90% - ✅ No 1T regression (≤15%) **All Phases (Stretch)**: - ✅ 4T ≥ 18M ops/sec - ✅ 16T ≥ 11.6M ops/sec --- ## 🚨 **Critical Findings** 1. **67.9M ops/sec = Measurement Error** - Actual: 15.1M (1T), 3.3M (4T) - Fix: Update Phase 6.14 report 2. **4-thread collapse = Thread-unsafe** - NOT a feature, NOT expected - Zero `pthread_mutex` in codebase - Fix: P0 global lock (30 min) 3. **TLS is validated (+381%)** - Phase 6.13 proved 4T = 15.9M ops/sec - NOT the cause of Phase 6.11.5 regression - Real culprit: Slab Registry (Phase 6.12.1) --- ## 📁 **Document Map** ``` PHASE_6.15_PLAN.md (this) - Full implementation guide (1008 lines) PHASE_6.15_SUMMARY.md - Executive summary (152 lines) PHASE_6.15_QUICK_REF.md - Quick reference card (YOU ARE HERE) THREAD_SAFETY_SOLUTION.md - Complete analysis (Option A/B/C) PHASE_6.13_INITIAL_RESULTS.md - TLS validation proof PHASE_6.14_COMPLETION_REPORT.md - Thread issue discovery ``` --- ## 🔧 **Common Commands** ```bash # Build hakmem cd apps/experiments/hakmem-poc make clean && make shared # larson benchmark (4-thread) cd /tmp/mimalloc-bench/bench/larson LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 4 # Helgrind race detection valgrind --tool=helgrind \ LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 1000 1 12345 4 # Check pthread usage grep -n "pthread" apps/experiments/hakmem-poc/*.c ``` --- ## 📞 **Need Help?** - **Detailed steps**: See [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - **Technical analysis**: See [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - **Validation proof**: See [PHASE_6.13_INITIAL_RESULTS.md](PHASE_6.13_INITIAL_RESULTS.md) --- **Status**: ✅ Ready to execute **Total Time**: 12-13 hours (6 days) **Expected ROI**: 6-15x improvement (3.3M → 20-50M ops/sec)