# Fix Implementation Guide: Remove Unsafe Drain Operations **Date**: 2025-11-04 **Target**: Eliminate concurrent freelist corruption **Approach**: Remove Fix #1 and Fix #2, keep Fix #3, fix refill path ownership ordering --- ## Changes Required ### Change 1: Remove Fix #1 (superslab_refill Priority 1 drain) **File**: `core/hakmem_tiny_free.inc` **Lines**: 615-621 **Action**: Comment out or delete **Before**: ```c // Priority 1: Reuse slabs with freelist (already freed blocks) int tls_cap = ss_slabs_capacity(tls->ss); for (int i = 0; i < tls_cap; i++) { // BUGFIX: Drain remote frees before checking freelist (fixes FAST_CAP=0 SEGV) int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0); if (has_remote) { ss_remote_drain_to_freelist(tls->ss, i); // ❌ REMOVE THIS } if (tls->ss->slabs[i].freelist) { // ... rest of logic } } ``` **After**: ```c // Priority 1: Reuse slabs with freelist (already freed blocks) int tls_cap = ss_slabs_capacity(tls->ss); for (int i = 0; i < tls_cap; i++) { // REMOVED: Unsafe drain without ownership check (caused concurrent freelist corruption) // Remote draining is now handled only in paths where ownership is guaranteed: // 1. Mailbox path (tiny_refill.h:100-106) - claims ownership BEFORE draining // 2. Sticky/hot/bench paths (tiny_refill.h) - claims ownership BEFORE draining if (tls->ss->slabs[i].freelist) { // ... rest of logic (unchanged) } } ``` --- ### Change 2: Remove Fix #2 (hak_tiny_alloc_superslab drain) **File**: `core/hakmem_tiny_free.inc` **Lines**: 729-767 (entire block) **Action**: Comment out or delete **Before**: ```c static inline void* hak_tiny_alloc_superslab(int class_idx) { tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0); TinyTLSSlab* tls = &g_tls_slabs[class_idx]; TinySlabMeta* meta = tls->meta; // BUGFIX: Drain ALL slabs' remote queues BEFORE any allocation attempt (fixes FAST_CAP=0 SEGV) // [... 40 lines of drain logic ...] // Fast path: Direct metadata access if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) { // ... } ``` **After**: ```c static inline void* hak_tiny_alloc_superslab(int class_idx) { tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0); TinyTLSSlab* tls = &g_tls_slabs[class_idx]; TinySlabMeta* meta = tls->meta; // REMOVED Fix #2: Unsafe drain of ALL slabs without ownership check // This caused concurrent freelist corruption when multiple threads operated on the same SuperSlab. // Remote draining is now handled exclusively in ownership-safe paths (Mailbox, refill with bind). // Fast path: Direct metadata access (unchanged) if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) { // ... } ``` **Specific lines to remove**: 729-767 (the entire `if (tls->ss && meta)` block with drain loop) --- ### Change 3: Fix Sticky Ring Path (claim ownership BEFORE drain) **File**: `core/tiny_refill.h` **Lines**: 46-51 **Action**: Reorder operations **Before**: ```c if (lm->freelist || has_remote) { if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li); // ❌ Drain BEFORE ownership if (lm->freelist) { tiny_tls_bind_slab(tls, last_ss, li); ss_owner_cas(lm, tiny_self_u32()); // ← Ownership AFTER drain return last_ss; } } ``` **After**: ```c if (lm->freelist || has_remote) { // ✅ BUGFIX: Claim ownership BEFORE draining (prevents concurrent freelist modification) tiny_tls_bind_slab(tls, last_ss, li); ss_owner_cas(lm, tiny_self_u32()); // NOW safe to drain - we own the slab if (!lm->freelist && has_remote) { ss_remote_drain_to_freelist(last_ss, li); } if (lm->freelist) { return last_ss; } } ``` --- ### Change 4: Fix Hot Slot Path (claim ownership BEFORE drain) **File**: `core/tiny_refill.h` **Lines**: 64-66 **Action**: Reorder operations **Before**: ```c TinySlabMeta* m = &hss->slabs[hidx]; if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) ss_remote_drain_to_freelist(hss, hidx); // ❌ Drain BEFORE ownership if (m->freelist) { tiny_tls_bind_slab(tls, hss, hidx); ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain tiny_sticky_save(class_idx, hss, (uint8_t)hidx); return hss; } ``` **After**: ```c TinySlabMeta* m = &hss->slabs[hidx]; // ✅ BUGFIX: Claim ownership BEFORE draining tiny_tls_bind_slab(tls, hss, hidx); ss_owner_cas(m, tiny_self_u32()); // NOW safe to drain - we own the slab if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) { ss_remote_drain_to_freelist(hss, hidx); } if (m->freelist) { tiny_sticky_save(class_idx, hss, (uint8_t)hidx); return hss; } ``` --- ### Change 5: Fix Bench Path (claim ownership BEFORE drain) **File**: `core/tiny_refill.h` **Lines**: 79-81 **Action**: Reorder operations **Before**: ```c TinySlabMeta* m = &bss->slabs[bidx]; if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0) ss_remote_drain_to_freelist(bss, bidx); // ❌ Drain BEFORE ownership if (m->freelist) { tiny_tls_bind_slab(tls, bss, bidx); ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain tiny_sticky_save(class_idx, bss, (uint8_t)bidx); return bss; } ``` **After**: ```c TinySlabMeta* m = &bss->slabs[bidx]; // ✅ BUGFIX: Claim ownership BEFORE draining tiny_tls_bind_slab(tls, bss, bidx); ss_owner_cas(m, tiny_self_u32()); // NOW safe to drain - we own the slab if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0) { ss_remote_drain_to_freelist(bss, bidx); } if (m->freelist) { tiny_sticky_save(class_idx, bss, (uint8_t)bidx); return bss; } ``` --- ### Change 6: Fix mmap_gate Path (claim ownership BEFORE drain) **File**: `core/tiny_mmap_gate.h` **Lines**: 56-58 **Action**: Reorder operations **Before**: ```c TinySlabMeta* m = &cand->slabs[s]; int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0); if (m->freelist || has_remote) { if (!m->freelist && has_remote) ss_remote_drain_to_freelist(cand, s); // ❌ Drain BEFORE ownership if (m->freelist) { tiny_tls_bind_slab(tls, cand, s); ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain return cand; } } ``` **After**: ```c TinySlabMeta* m = &cand->slabs[s]; int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0); if (m->freelist || has_remote) { // ✅ BUGFIX: Claim ownership BEFORE draining tiny_tls_bind_slab(tls, cand, s); ss_owner_cas(m, tiny_self_u32()); // NOW safe to drain - we own the slab if (!m->freelist && has_remote) { ss_remote_drain_to_freelist(cand, s); } if (m->freelist) { return cand; } } ``` --- ## Testing Plan ### Test 1: Baseline (Current Crashes) ```bash # Build with current code (before fixes) make clean && make -s larson_hakmem # Run repro mode (should crash around 4000 events) HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 4 ``` **Expected**: Crash at ~4000 events with `fault_addr=0x6261` --- ### Test 2: Apply Fix (Remove Fix #1 and Fix #2 ONLY) ```bash # Apply Changes 1 and 2 (comment out Fix #1 and Fix #2) # Rebuild make clean && make -s larson_hakmem # Run repro mode HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10 ``` **Expected**: - If crashes stop → Fix #1/#2 were the main culprits ✅ - If crashes continue → Need to apply Changes 3-6 --- ### Test 3: Apply All Fixes (Changes 1-6) ```bash # Apply all changes # Rebuild make clean && make -s larson_hakmem # Run extended test HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20 ``` **Expected**: NO crashes, stable execution for full 20 seconds --- ### Test 4: Guard Mode (Maximum Stress) ```bash # Rebuild with all fixes make clean && make -s larson_hakmem # Run guard mode (stricter checks) HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20 ``` **Expected**: NO crashes, reaches 30+ seconds --- ## Verification Checklist After applying fixes, verify: - [ ] Fix #1 code (hakmem_tiny_free.inc:615-621) commented out or deleted - [ ] Fix #2 code (hakmem_tiny_free.inc:729-767) commented out or deleted - [ ] Fix #3 (tiny_refill.h:100-106) unchanged (already correct) - [ ] Sticky path (tiny_refill.h:46-51) reordered: ownership BEFORE drain - [ ] Hot slot path (tiny_refill.h:64-66) reordered: ownership BEFORE drain - [ ] Bench path (tiny_refill.h:79-81) reordered: ownership BEFORE drain - [ ] mmap_gate path (tiny_mmap_gate.h:56-58) reordered: ownership BEFORE drain - [ ] All changes compile without errors - [ ] Benchmark runs without crashes for 30+ seconds --- ## Expected Results ### Before Fixes | Test | Duration | Events | Result | |------|----------|--------|--------| | repro mode | ~4 sec | ~4012 | ❌ CRASH at fault_addr=0x6261 | | guard mode | ~2 sec | ~2137 | ❌ CRASH at fault_addr=0x6261 | ### After Fixes (Changes 1-2 only) | Test | Duration | Events | Result | |------|----------|--------|--------| | repro mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash | | guard mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash | ### After All Fixes (Changes 1-6) | Test | Duration | Events | Result | |------|----------|--------|--------| | repro mode | 20+ sec | 20000+ | ✅ NO CRASH | | guard mode | 30+ sec | 30000+ | ✅ NO CRASH | --- ## Rollback Plan If fixes cause new issues: 1. **Revert Changes 3-6** (keep Changes 1-2): - Restore original sticky/hot/bench/mmap_gate paths - This removes Fix #1/#2 but keeps old refill ordering - Test again 2. **Revert All Changes**: ```bash git checkout core/hakmem_tiny_free.inc git checkout core/tiny_refill.h git checkout core/tiny_mmap_gate.h make clean && make ``` 3. **Try Alternative**: Option B from ULTRATHINK_ANALYSIS.md (add ownership checks instead of removing) --- ## Additional Debugging (If Crashes Persist) If crashes continue after all fixes: 1. **Enable ownership assertion**: ```c // In hakmem_tiny_superslab.h:345, add at top of ss_remote_drain_to_freelist: #ifdef HAKMEM_DEBUG_OWNERSHIP TinySlabMeta* m = &ss->slabs[slab_idx]; uint32_t owner = m->owner_tid; uint32_t self = tiny_self_u32(); if (owner != 0 && owner != self) { fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab %d owned by %u!\n", self, slab_idx, owner); abort(); } #endif ``` 2. **Rebuild with debug flag**: ```bash make clean CFLAGS="-DHAKMEM_DEBUG_OWNERSHIP=1" make -s larson_hakmem HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10 ``` 3. **Check for other unsafe drain sites**: ```bash grep -n "ss_remote_drain_to_freelist" core/*.{c,inc,h} | grep -v "^//" ``` --- **END OF IMPLEMENTATION GUIDE**