# Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan **Date**: 2025-10-22 **Status**: ๐Ÿ“‹ **Planning Complete** **Total Time**: 12-13 hours (3 weeks) --- ## ๐Ÿ“Š **Executive Summary** ### **Current Problem** hakmem allocator is **completely thread-unsafe** with catastrophic multi-threaded performance: | Threads | Performance (ops/sec) | vs 1-thread | |---------|----------------------|-------------| | **1-thread** | 15.1M ops/sec | baseline | | **4-thread** | 3.3M ops/sec | **-78% slower** โŒ | **Root Cause**: Zero thread synchronization primitives in current codebase (no `pthread_mutex` anywhere) ### **Solution Strategy** **3-Stage Gradual Implementation**: 1. **Step 1**: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan 2. **Step 2**: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes 3. **Step 3**: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated) **Expected Outcome**: - **Minimum Success** (P0): 4T = 1T performance (safe, no scalability) - **Target Success** (P0+P1): 4T = 12-15M ops/sec (+264-355%) - **Validated** (Phase 6.13): 4T = **15.9M ops/sec** (+381%) โœ… **ALREADY PROVEN** --- ## ๐ŸŽฏ **Step 1: Documentation Updates** (1 hour) ### **Task 1.1: Fix Phase 6.14 Completion Report** (15 minutes) **File**: `apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md` **Current Problem**: - Report focuses on Registry ON/OFF toggle - No mention of 67.9M ops/sec measurement issue - Misleading performance claims **Required Changes**: 1. **Add Executive Summary Section** (after line 9): ```markdown ## โš ๏ธ **Important Note: 67.9M Performance Measurement** **Issue**: Earlier reports mentioned 67.9M ops/sec performance **Status**: โŒ **NOT REPRODUCIBLE** - Likely measurement error **Actual Achievements**: - โœ… Registry ON/OFF toggle implementation (Pattern 2) - โœ… O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs) - โœ… Default: `g_use_registry = 0` (O(N) Sequential Access) **Performance Reality**: - 1-thread: 15.3M ops/sec (O(N), validated) - 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix) ``` 2. **Update Section Title** (line 9): ```markdown ## ๐Ÿ“Š **Executive Summary: Registry Toggle + Thread Safety Issue** ``` 3. **Add Thread Safety Warning** (after line 158): ```markdown --- ## ๐Ÿšจ **Critical Discovery: Thread Safety Issue** ### **Problem** Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**: | Threads | Performance | vs 1-thread | |---------|-------------|-------------| | 1-thread | 15.3M ops/sec | baseline | | 4-thread | **3.3M ops/sec** | **-78%** โŒ | **Root Cause**: `grep pthread_mutex *.c` โ†’ **0 results** (no locks!) **Impact**: All global structures are race-condition prone: - `g_tiny_pool.free_slabs[]` - Concurrent access without locks - `g_l25_pool.freelist[]` - Multiple threads modifying same freelist - `g_slab_registry[]` - Hash table corruption - `g_whale_cache` - Ring buffer race conditions ### **Solution** **Phase 6.15**: Multi-threaded Safety + TLS Performance - **P0** (30 min): Global safety lock (correctness first) - **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance) - **P2** (3 hours): L2 Pool TLS (full coverage) - **P3** (3 hours): L2.5 Pool TLS expansion **Expected Results**: - P0: 4T = 13-15M ops/sec (safe, no scalability) - P0+P1: 4T = 12-15M ops/sec (+264-355%) - **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) โœ… **Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis ``` **Estimated Time**: 15 minutes --- ### **Task 1.2: Create Phase 6.15 Plan Document** (30 minutes) **File**: `apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md` (THIS FILE) **Contents**: โœ… **Already created** (this document) **Sections**: 1. Executive Summary 2. Step 1: Documentation Updates (detailed) 3. Step 2: P0 Safety Lock (implementation + testing) 4. Step 3: Multi-threaded Performance (P1-P3 breakdown) 5. Implementation Checklist 6. Risk Assessment 7. Success Criteria **Estimated Time**: 30 minutes (already completed) --- ### **Task 1.3: Update CURRENT_TASK.md** (10 minutes) **File**: `apps/experiments/hakmem-poc/CURRENT_TASK.md` **Required Changes**: 1. **Update Current Status** (after line 30): ```markdown ## ๐ŸŽฏ **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22) ### **Immediate Priority: Thread Safety Fix** โš ๏ธ **Problem Discovered**: hakmem is completely thread-unsafe - 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M) - Root cause: Zero synchronization primitives (no `pthread_mutex`) **Solution in Progress**: Phase 6.15 (3-stage implementation) 1. โœ… **Step 1**: Documentation updates (1 hour) โ† IN PROGRESS 2. โณ **Step 2**: P0 Safety Lock (30 min + testing) 3. โณ **Step 3**: TLS Performance (P1-P3, 8-10 hours) **Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13) **Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) ``` 2. **Move Phase 6.14 to Completed Section** (after line 296): ```markdown ## โœ… Phase 6.14 ๅฎŒไบ†๏ผ๏ผˆ2025-10-22๏ผ‰ **ๅฎŸ่ฃ…ๅฎŒไบ†**: Registry ON/OFF ๅˆ‡ใ‚Šๆ›ฟใˆๅฎŸ่ฃ… + Thread Safety Issue ็™บ่ฆ‹ **โœ… ๅฎŸ่ฃ…ๅฎŒไบ†ๅ†…ๅฎน**: 1. **Pattern 2 ๅฎŸ่ฃ…**: `HAKMEM_USE_REGISTRY` ็’ฐๅขƒๅค‰ๆ•ฐใง ON/OFF ๅˆ‡ใ‚Šๆ›ฟใˆ 2. **O(N) vs O(1) ๆคœ่จผ**: O(N) ใŒ 2.9-13.7ๅ€้€Ÿใ„ใ“ใจใ‚’ๅฎŸ่จผ 3. **ใƒ‡ใƒ•ใ‚ฉใƒซใƒˆ่จญๅฎš**: `g_use_registry = 0` (O(N) Sequential Access) **๐Ÿšจ Critical Discovery**: 4-thread ๆ€ง่ƒฝๅดฉๅฃŠ (-78%) - ๅŽŸๅ› : ๅ…จใ‚ฐใƒญใƒผใƒใƒซๅค‰ๆ•ฐใŒใƒญใƒƒใ‚ฏ็„กใ— - ๅฏพ็ญ–: Phase 6.15 ใงไฟฎๆญฃไบˆๅฎš **๐Ÿ“Š ๆธฌๅฎš็ตๆžœ**: ``` 1-thread: 15.3M ops/sec (O(N), Registry OFF) 4-thread: 3.3M ops/sec (-78% โ† THREAD-UNSAFE) โŒ ``` **่ฉณ็ดฐใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆ**: - [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 ๅฎŸ่ฃ… - [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - ๅฎŒๅ…จๅˆ†ๆž - [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - ไฟฎๆญฃ่จˆ็”ป **ๅฎŸ่ฃ…ๆ™‚้–“**: 34ๅˆ†๏ผˆไบˆๅฎš้€šใ‚Š๏ผ‰ โšก ``` **Estimated Time**: 10 minutes --- ### **Task 1.4: Update README (if needed)** (5 minutes) **File**: `apps/experiments/hakmem-poc/README.md` (if exists) **Check if exists**: ```bash ls -la apps/experiments/hakmem-poc/README.md ``` **If exists, add warning**: ```markdown ## โš ๏ธ **Current Status: Thread Safety in Development** **Known Issue**: hakmem is currently thread-unsafe - **Single-threaded**: 15.1M ops/sec โœ… Excellent - **Multi-threaded**: 3.3M ops/sec (4T) โŒ Requires fix **Fix in Progress**: Phase 6.15 Multi-threaded Safety - Expected completion: 2025-10-24 (2-3 days) - Target performance: 15-20M ops/sec at 4 threads **Do NOT use in multi-threaded applications until Phase 6.15 is complete.** ``` **Estimated Time**: 5 minutes (or skip if README doesn't exist) --- ### **Task 1.5: Verification** (5 minutes) **Checklist**: - [ ] PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented) - [ ] PHASE_6.15_PLAN.md created (this document) - [ ] CURRENT_TASK.md updated (Phase 6.15 status) - [ ] README.md updated (if exists) **Verification Commands**: ```bash cd apps/experiments/hakmem-poc # Check files exist ls -la PHASE_6.14_COMPLETION_REPORT.md ls -la PHASE_6.15_PLAN.md ls -la CURRENT_TASK.md # Grep for keywords grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md ``` **Estimated Time**: 5 minutes --- ## โฑ๏ธ **Step 1 Total Time: 1 hour 5 minutes** --- ## ๐Ÿ” **Step 2: P0 Safety Lock Implementation** (2-3 hours) ### **Goal** Ensure **correctness** with minimal code changes. No performance improvement expected (4T โ‰ˆ 1T). ### **Success Criteria** - โœ… 1-thread: 13-15M ops/sec (ใƒญใƒƒใ‚ฏใ‚ชใƒผใƒใƒผใƒ˜ใƒƒใƒ‰ 0-15% acceptable) - โœ… 4-thread: 13-15M ops/sec (no scalability, but SAFE) - โœ… Helgrind: Data race = 0 ไปถ - โœ… Stability: 10 consecutive runs without crash --- ### **Task 2.1: Implementation** (30 minutes) #### **File**: `apps/experiments/hakmem-poc/hakmem.c` **Changes Required**: 1. **Add pthread.h include** (after line 22): ```c #include // Phase 6.15 P0: Thread Safety ``` 2. **Add global lock** (after line 58): ```c // ============================================================================ // Phase 6.15 P0: Thread Safety - Global Lock // ============================================================================ // Global lock for all allocator operations // Purpose: Ensure correctness in multi-threaded environment // Performance: 4T โ‰ˆ 1T (no scalability, safety first) // Will be replaced by TLS in P1-P3 (95%+ lock avoidance) static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER; // Lock/unlock helpers (for debugging and future instrumentation) #define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock) #define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock) ``` 3. **Wrap hak_alloc_at()** (find the function, approximately line 300-400): ```c void* hak_alloc_at(size_t size, uintptr_t site_id) { // Phase 6.15 P0: Global lock (safety first) HAKMEM_LOCK(); // Existing implementation void* ptr = hak_alloc_at_internal(size, site_id); HAKMEM_UNLOCK(); return ptr; } // Rename old hak_alloc_at to hak_alloc_at_internal static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) { // ... existing code (no changes) ... } ``` 4. **Wrap hak_free_at()** (find the function): ```c void hak_free_at(void* ptr, uintptr_t site_id) { if (!ptr) return; // Phase 6.15 P0: Global lock (safety first) HAKMEM_LOCK(); // Existing implementation hak_free_at_internal(ptr, site_id); HAKMEM_UNLOCK(); } // Rename old hak_free_at to hak_free_at_internal static void hak_free_at_internal(void* ptr, uintptr_t site_id) { // ... existing code (no changes) ... } ``` 5. **Protect hak_init()** (find initialization function): ```c void hak_init(void) { // Phase 6.15 P0: No lock needed (called once before any threads spawn) // But add atomic check for safety // ... existing init code ... } ``` **Estimated Time**: 30 minutes --- ### **Task 2.2: Build & Smoke Test** (15 minutes) **Commands**: ```bash cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc # Clean build make clean make bench_allocators # Smoke test (single-threaded) ./bench_allocators --allocator hakmem-baseline --scenario json # Expected: ~300-350ns (slight overhead acceptable) ``` **Success Criteria**: - โœ… Build succeeds (no compilation errors) - โœ… No crashes on single-threaded test - โœ… Performance: 13-15M ops/sec (within 0-15% of Phase 6.14) **Estimated Time**: 15 minutes --- ### **Task 2.3: Multi-threaded Validation** (1 hour) #### **Test 1: larson Benchmark** (30 minutes) **Setup**: ```bash # Build shared library (if not already done) cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc make clean && make shared # Verify library ls -lh libhakmem.so nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc" ``` **Benchmark Execution**: ```bash cd /tmp/mimalloc-bench/bench/larson # 1-thread baseline ./larson 0 8 1024 10000 1 12345 1 # 1-thread with hakmem P0 LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 1 # Expected: 13-15M ops/sec (lock overhead 0-15%) ``` ```bash # 4-thread baseline ./larson 0 8 1024 10000 1 12345 4 # 4-thread with hakmem P0 LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 4 # Expected: 13-15M ops/sec (same as 1T, no scalability) # Critical: NO CRASHES, NO DATA CORRUPTION ``` **Success Criteria**: - โœ… 1T: 13-15M ops/sec (within 15% of Phase 6.14) - โœ… 4T: 13-15M ops/sec (no scalability expected) - โœ… 4T: NO crashes, NO segfaults - โœ… 4T: NO data corruption (verify checksum if larson supports) **Estimated Time**: 30 minutes --- #### **Test 2: Helgrind Race Detection** (20 minutes) **Purpose**: Verify all data races are eliminated **Commands**: ```bash cd /tmp/mimalloc-bench/bench/larson # Install valgrind (if not installed) sudo apt-get install -y valgrind # Run Helgrind on 4-thread test valgrind --tool=helgrind \ --read-var-info=yes \ LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 1000 1 12345 4 # Note: Reduced iterations (1000 instead of 10000) for faster run # Expected output: # ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y) ``` **Success Criteria**: - โœ… ERROR SUMMARY: **0 errors** (zero data races) - โœ… No warnings about unprotected reads/writes - โš ๏ธ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code. **Estimated Time**: 20 minutes --- #### **Test 3: Stability Test** (10 minutes) **Purpose**: Ensure no crashes over 10 consecutive runs **Commands**: ```bash cd /tmp/mimalloc-bench/bench/larson # 10 consecutive 4-thread runs for i in {1..10}; do echo "Run $i/10..." LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \ ./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; } done echo "โœ… All 10 runs succeeded!" ``` **Success Criteria**: - โœ… 10/10 runs complete without crashes - โœ… Performance stable across runs (variance < 10%) **Estimated Time**: 10 minutes --- ### **Task 2.4: Document Results** (15 minutes) **Create**: `apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md` **Template**: ```markdown # Phase 6.15 P0: Safety Lock Implementation - Results **Date**: 2025-10-22 **Status**: โœ… **COMPLETED** (Correctness achieved) **Implementation Time**: X minutes --- ## ๐Ÿ“Š **Benchmark Results** ### **larson (mimalloc-bench)** | Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change | |---------|-------------------|-----------------|--------| | 1-thread | 15.1M ops/sec | X.XM ops/sec | ยฑX% | | 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% โœ… | **Performance Summary**: - 1-thread overhead: X% (lock overhead, acceptable) - 4-thread improvement: +XXX% (from -78% to safe) - 4-thread scalability: X.Xx (4T / 1T, expected ~1.0) --- ## โœ… **Success Criteria Met** - โœ… 1T performance: X.XM ops/sec (within 15% of Phase 6.14) - โœ… 4T performance: X.XM ops/sec (safe, no scalability) - โœ… Helgrind: **0 data races** detected - โœ… Stability: **10/10 runs** without crashes --- ## ๐Ÿ”ง **Implementation Details** **Files Modified**: - `hakmem.c` - Added global lock + wrapper functions **Lines Changed**: - +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros) - +10 lines (hak_alloc_at wrapper) - +10 lines (hak_free_at wrapper) - **Total**: ~40 lines **Pattern**: ```c void* hak_alloc_at(size_t size, uintptr_t site_id) { HAKMEM_LOCK(); void* ptr = hak_alloc_at_internal(size, site_id); HAKMEM_UNLOCK(); return ptr; } ``` --- ## ๐ŸŽฏ **Next Steps** **Phase 6.15 P1**: Tiny Pool TLS (2 hours) - Expected: 4T = 12-15M ops/sec (+100-150%) - TLS hit rate: 95%+ - Lock avoidance: 95%+ **Start Date**: 2025-10-XX ``` **Estimated Time**: 15 minutes --- ### **Step 2 Total Time: 2-3 hours** --- ## ๐Ÿš€ **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours) ### **Overview** **Goal**: Achieve near-ideal scalability (4T โ‰ˆ 4x 1T) using Thread-Local Storage (TLS) **Validation**: Phase 6.13 already proved TLS works - 1-thread: 17.8M ops/sec (+123% vs system) - 4-thread: 15.9M ops/sec (+147% vs system) **Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool --- ### **Phase 6.15 P1: Tiny Pool TLS** (2 hours) **Goal**: Thread-local cache for โ‰ค1KB allocations (8 size classes) **Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented) #### **Implementation** **File**: `apps/experiments/hakmem-poc/hakmem_tiny.c` **Changes**: 1. **Add TLS cache** (after line 12): ```c // Phase 6.15 P1: Thread-Local Storage for Tiny Pool // Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26) // Purpose: Reduce global freelist contention (50 cycles โ†’ 10 cycles) // Hit rate expected: 95%+ static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL}; static __thread int tls_tiny_initialized = 0; ``` 2. **TLS initialization** (new function): ```c // Initialize TLS cache for current thread static void hak_tiny_tls_init(void) { if (tls_tiny_initialized) return; // Initialize all size classes to NULL for (int i = 0; i < TINY_NUM_CLASSES; i++) { tls_tiny_cache[i] = NULL; } tls_tiny_initialized = 1; } ``` 3. **Modify hak_tiny_alloc** (existing function): ```c void* hak_tiny_alloc(size_t size, uintptr_t site_id) { // Phase 6.15 P1: TLS fast path if (!tls_tiny_initialized) { hak_tiny_tls_init(); } int class_idx = hak_tiny_get_class_index(size); // TLS hit check (no lock needed) TinySlab* slab = tls_tiny_cache[class_idx]; if (slab && slab->free_count > 0) { // Fast path: Allocate from TLS cache return hak_tiny_alloc_from_slab(slab, class_idx); } // TLS miss: Refill from global freelist (locked) HAKMEM_LOCK(); // Try to get a slab from global freelist slab = g_tiny_pool.free_slabs[class_idx]; if (slab) { // Move slab to TLS cache g_tiny_pool.free_slabs[class_idx] = slab->next; tls_tiny_cache[class_idx] = slab; slab->next = NULL; // Detach from freelist } else { // Allocate new slab (existing logic) slab = allocate_new_slab(class_idx); if (!slab) { HAKMEM_UNLOCK(); return NULL; } tls_tiny_cache[class_idx] = slab; } HAKMEM_UNLOCK(); // Allocate from newly cached slab return hak_tiny_alloc_from_slab(slab, class_idx); } ``` 4. **Modify hak_tiny_free** (existing function): ```c void hak_tiny_free(void* ptr, uintptr_t site_id) { if (!ptr) return; // Find owner slab (O(N) or O(1) depending on g_use_registry) TinySlab* slab = hak_tiny_owner_slab(ptr); if (!slab) { fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n"); return; } int class_idx = slab->size_class; // Free block in slab hak_tiny_free_in_slab(slab, ptr, class_idx); // Check if slab is now empty if (slab->free_count == slab->total_count) { // Phase 6.15 P1: Return empty slab to global freelist // First, remove from TLS cache if it's there if (tls_tiny_cache[class_idx] == slab) { tls_tiny_cache[class_idx] = NULL; } // Return to global freelist (locked) HAKMEM_LOCK(); slab->next = g_tiny_pool.free_slabs[class_idx]; g_tiny_pool.free_slabs[class_idx] = slab; HAKMEM_UNLOCK(); } } ``` **Expected Performance**: - TLS hit rate: 95%+ - Lock contention: 5% (only on TLS miss) - 4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline) **Implementation Time**: 2 hours --- ### **Phase 6.15 P2: L2 Pool TLS** (3 hours) **Goal**: Thread-local cache for 2-32KB allocations (5 size classes) **Pattern**: Same as Tiny Pool TLS (above) #### **Implementation** **File**: `apps/experiments/hakmem-poc/hakmem_pool.c` **Changes**: (Similar structure to Tiny Pool TLS) 1. Add `static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES];` 2. Implement TLS fast path in `hak_pool_alloc()` 3. Implement TLS refill logic (global freelist โ†’ TLS cache) 4. Implement TLS return logic (empty slabs โ†’ global freelist) **Expected Performance**: - TLS hit rate: 90%+ - Cumulative 4T performance: 15-18M ops/sec **Implementation Time**: 3 hours --- ### **Phase 6.15 P3: L2.5 Pool TLS Expansion** (3 hours) **Goal**: Expand existing L2.5 TLS to full implementation **Current State**: `hakmem_l25_pool.c:26` already has TLS declaration: ```c __thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL}; ``` **Missing**: TLS refill/eviction logic (currently only used in fast path) #### **Implementation** **File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c` **Changes**: 1. **Implement TLS refill** (in `hak_l25_pool_alloc`): ```c // Existing TLS check (line ~230) L25Block* block = tls_l25_cache[class_idx]; if (block) { tls_l25_cache[class_idx] = NULL; // Pop from TLS // ... existing header rewrite ... return user_ptr; } // NEW: TLS refill from global freelist HAKMEM_LOCK(); int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1); // Check non-empty bitmap if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) { // Empty freelist, allocate new bundle // ... existing logic ... } else { // Pop from global freelist block = g_l25_pool.freelist[class_idx][shard_idx]; g_l25_pool.freelist[class_idx][shard_idx] = block->next; // Update bitmap if freelist is now empty if (!g_l25_pool.freelist[class_idx][shard_idx]) { g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx); } // Move to TLS cache tls_l25_cache[class_idx] = block; } HAKMEM_UNLOCK(); // Allocate from TLS cache block = tls_l25_cache[class_idx]; tls_l25_cache[class_idx] = NULL; // ... existing header rewrite ... return user_ptr; ``` 2. **Implement TLS eviction** (in `hak_l25_pool_free`): ```c // Existing logic to add to freelist L25Block* block = (L25Block*)hdr; // Phase 6.15 P3: Add to TLS cache first (if empty) if (!tls_l25_cache[class_idx]) { tls_l25_cache[class_idx] = block; block->next = NULL; return; // No need to lock } // TLS cache full, return to global freelist (locked) HAKMEM_LOCK(); block->next = g_l25_pool.freelist[class_idx][shard_idx]; g_l25_pool.freelist[class_idx][shard_idx] = block; // Update bitmap g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx); HAKMEM_UNLOCK(); ``` **Expected Performance**: - TLS hit rate: 95%+ - Cumulative 4T performance: 18-22M ops/sec (+445-567%) **Implementation Time**: 3 hours --- ## ๐Ÿ“‹ **Implementation Checklist** ### **Step 1: Documentation** (1 hour) โœ… - [ ] Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min) - [ ] Task 1.2: Create PHASE_6.15_PLAN.md (30 min) โ† THIS DOCUMENT - [ ] Task 1.3: Update CURRENT_TASK.md (10 min) - [ ] Task 1.4: Update README.md if exists (5 min) - [ ] Task 1.5: Verification (5 min) ### **Step 2: P0 Safety Lock** (2-3 hours) - [ ] Task 2.1: Implementation (30 min) - [ ] Add pthread.h include - [ ] Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros - [ ] Wrap hak_alloc_at() with lock - [ ] Wrap hak_free_at() with lock - [ ] Task 2.2: Build & Smoke Test (15 min) - [ ] `make clean && make bench_allocators` - [ ] Single-threaded test (json scenario) - [ ] Verify: 13-15M ops/sec - [ ] Task 2.3: Multi-threaded Validation (1 hour) - [ ] Test 1: larson 1T/4T (30 min) - [ ] Test 2: Helgrind race detection (20 min) - [ ] Test 3: Stability test 10 runs (10 min) - [ ] Task 2.4: Document Results (15 min) - [ ] Create PHASE_6.15_P0_RESULTS.md ### **Step 3: TLS Performance** (8-10 hours) - [ ] **P1: Tiny Pool TLS** (2 hours) - [ ] Add `tls_tiny_cache[]` declaration - [ ] Implement `hak_tiny_tls_init()` - [ ] Modify `hak_tiny_alloc()` (TLS fast path) - [ ] Modify `hak_tiny_free()` (TLS eviction) - [ ] Test: larson 4T โ†’ 12-15M ops/sec - [ ] Document: PHASE_6.15_P1_RESULTS.md - [ ] **P2: L2 Pool TLS** (3 hours) - [ ] Add `tls_l2_cache[]` declaration - [ ] Implement TLS fast path in `hak_pool_alloc()` - [ ] Implement TLS refill logic - [ ] Implement TLS eviction logic - [ ] Test: larson 4T โ†’ 15-18M ops/sec - [ ] Document: PHASE_6.15_P2_RESULTS.md - [ ] **P3: L2.5 Pool TLS Expansion** (3 hours) - [ ] Implement TLS refill in `hak_l25_pool_alloc()` - [ ] Implement TLS eviction in `hak_l25_pool_free()` - [ ] Test: larson 4T โ†’ 18-22M ops/sec - [ ] Document: PHASE_6.15_P3_RESULTS.md - [ ] **Final Validation** (1 hour) - [ ] larson 1T/4T/16T full validation - [ ] Internal benchmarks (json/mir/vm) - [ ] Helgrind final check - [ ] Create PHASE_6.15_COMPLETION_REPORT.md --- ## โš ๏ธ **Risk Assessment** | Phase | Risk Level | Failure Mode | Mitigation | |-------|-----------|--------------|------------| | **P0 (Safety Lock)** | **ZERO** | Worst case: slow but safe | N/A | | **P1 (Tiny TLS)** | **LOW** | TLS miss overhead | Feature flag `HAKMEM_TLS_TINY` | | **P2 (L2 TLS)** | **LOW** | Memory overhead (TLSร—threads) | Monitor RSS | | **P3 (L2.5 TLS)** | **LOW** | Existing code 50% done | Incremental | **Rollback Strategy**: - Every phase has `#ifdef HAKMEM_TLS_PHASEX` - Can disable individual TLS layers if issues found - P0 Safety Lock ensures correctness even if TLS disabled --- ## ๐ŸŽฏ **Success Criteria** ### **Minimum Success** (P0 only) - โœ… 4T โ‰ฅ 13M ops/sec (safe, from 3.3M) - โœ… Zero race conditions (Helgrind) - โœ… 10/10 stability runs ### **Target Success** (P0 + P1 + P2) - โœ… 4T โ‰ฅ 15M ops/sec (+355% vs 3.3M baseline) - โœ… TLS hit rate โ‰ฅ 90% - โœ… No single-threaded regression (โ‰ค15% overhead) ### **Stretch Goal** (All Phases) - โœ… 4T โ‰ฅ 18M ops/sec (+445%) - โœ… 16T โ‰ฅ 11.6M ops/sec (match system allocator) - โœ… Scalable up to 32 threads ### **Validated** (Phase 6.13 Proof) - โœ… **ALREADY ACHIEVED**: 4T = **15.9M ops/sec** (+381%) โœ… --- ## ๐Ÿ“Š **Expected Timeline** ### **Week 1: Foundation** (Day 1-2) - **Day 1 AM** (1 hour): Step 1 - Documentation updates - **Day 1 PM** (2-3 hours): Step 2 - P0 Safety Lock - **Day 2** (2 hours): Step 3 - P1 Tiny Pool TLS **Milestone**: 4T = 12-15M ops/sec (+264-355%) ### **Week 2: Expansion** (Day 3-5) - **Day 3-4** (3 hours): Step 3 - P2 L2 Pool TLS - **Day 5** (3 hours): Step 3 - P3 L2.5 Pool TLS **Milestone**: 4T = 18-22M ops/sec (+445-567%) ### **Week 3: Validation** (Day 6) - **Day 6** (1 hour): Final validation + completion report **Milestone**: โœ… **Phase 6.15 Complete** --- ## ๐Ÿ”ฌ **Technical References** ### **Existing TLS Implementation** **File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c:26` ```c __thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL}; ``` **Pattern**: Per-thread cache for each size class (L1 cache hit) ### **Phase 6.13 Validation** **File**: `apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md` **Results**: - 1-thread: 17.8M ops/sec (+123% vs system) - 4-thread: 15.9M ops/sec (+147% vs system) - **Proof**: TLS works and provides massive benefit ### **Thread Safety Analysis** **File**: `apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md` **Key Insights**: - mimalloc/jemalloc both use TLS as primary approach - TLS hit rate: 95%+ (industry standard) - Lock contention: 5% (only on TLS miss/refill) --- ## ๐Ÿ“ **Implementation Notes** ### **Why 3 Stages?** 1. **Step 1 (Docs)**: Ensure clarity on what went wrong (67.9M issue) and what's being fixed 2. **Step 2 (P0)**: Prove correctness FIRST (no crashes, no data races) 3. **Step 3 (P1-P3)**: Optimize for performance (TLS) with safety already guaranteed ### **Why Not Skip P0?** - **Risk mitigation**: If TLS fails, we still have working thread-safe allocator - **Debugging**: Easier to debug TLS issues with known-working locked baseline - **Validation**: P0 proves the global lock pattern is correct ### **Why TLS Over Lock-free?** - **Phase 6.14 proved**: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash - **Implication**: Lock-free atomic hash will be SLOWER than TLS - **Industry standard**: mimalloc/jemalloc use TLS, not lock-free - **Proven**: Phase 6.13 validated +123-147% improvement with TLS --- ## ๐Ÿš€ **Next Steps After Phase 6.15** ### **Phase 6.17: 16-Thread Scalability** (Optional, 4 hours) **Current Issue**: 16T = 7.6M ops/sec (-34.8% vs system 11.6M) **Investigation**: 1. Profile global lock contention (perf, helgrind) 2. Measure Whale cache hit rate by thread count 3. Analyze shard distribution (hash collision?) 4. Optimize TLS cache refill (batch refill to reduce global access) **Target**: 16T โ‰ฅ 11.6M ops/sec (match or beat system) --- ## ๐Ÿ“š **Related Documents** - [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - Complete analysis (Option A/B/C comparison) - [PHASE_6.13_INITIAL_RESULTS.md](PHASE_6.13_INITIAL_RESULTS.md) - TLS validation proof - [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Registry toggle + thread issue discovery - [CURRENT_TASK.md](CURRENT_TASK.md) - Overall project status --- **Total Time Investment**: 12-13 hours **Expected ROI**: **6-15x improvement** (3.3M โ†’ 20-50M ops/sec) **Risk**: Low (feature flags + proven design) **Validation**: Phase 6.13 already proves TLS works (**+147%** at 4 threads) --- **Implementation by**: Claude + ChatGPTๅ่ฐƒๅผ€็™บ **Planning Date**: 2025-10-22 **Status**: โœ… **Ready to Execute**