# Ultra-Deep Analysis: Remaining Bugs in Remote Drain System **Date**: 2025-11-04 **Status**: 🔴 **CRITICAL RACE CONDITION IDENTIFIED** **Scope**: Multi-threaded freelist corruption via concurrent `ss_remote_drain_to_freelist()` calls --- ## Executive Summary **Root Cause Found**: **Concurrent draining of the same slab from multiple threads WITHOUT ownership synchronization** The crash at `fault_addr=0x6261` is caused by freelist chain corruption when multiple threads simultaneously call `ss_remote_drain_to_freelist()` on the same slab without exclusive ownership. The pointer truncation (0x6261) is a symptom of concurrent modification to the freelist links. **Impact**: - Fix #1, Fix #2, and multiple paths in `tiny_refill.h` all drain without ownership - ANY two threads operating on the same slab can race and corrupt the freelist - Explains why crashes still occur after 4012 events (race is timing-dependent) --- ## 1. The Freelist Corruption Mechanism ### 1.1 How `ss_remote_drain_to_freelist()` Works ```c // hakmem_tiny_superslab.h:345-365 static inline void ss_remote_drain_to_freelist(SuperSlab* ss, int slab_idx) { _Atomic(uintptr_t)* head = &ss->remote_heads[slab_idx]; uintptr_t p = atomic_exchange_explicit(head, (uintptr_t)NULL, memory_order_acq_rel); if (p == 0) return; TinySlabMeta* meta = &ss->slabs[slab_idx]; uint32_t drained = 0; while (p != 0) { void* node = (void*)p; uintptr_t next = (uintptr_t)(*(void**)node); // ← Read next pointer *(void**)node = meta->freelist; // ← CRITICAL: Write freelist pointer meta->freelist = node; // ← CRITICAL: Update freelist head p = next; drained++; } // Reset remote count after full drain atomic_store_explicit(&ss->remote_counts[slab_idx], 0u, memory_order_relaxed); } ``` **KEY OBSERVATION**: The while loop modifies `meta->freelist` **WITHOUT any atomic protection**. ### 1.2 Race Condition Scenario **Setup**: - Slab 4 of SuperSlab X has `remote_heads[4] != 0` (pending remote frees) - Thread A (T1) and Thread B (T2) both want to drain slab 4 - Neither thread owns slab 4 **Timeline**: | Time | Thread A (Fix #2 path) | Thread B (Sticky refill path) | Result | |------|------------------------|-------------------------------|--------| | T0 | Enters `hak_tiny_alloc_superslab()` | Enters `tiny_refill_try_fast()` sticky ring | | | T1 | Loops through all slabs, reaches i=4 | Finds slab 4 in sticky ring | | | T2 | Sees `remote_heads[4] != 0` | Sees `has_remote != 0` | | | T3 | Calls `ss_remote_drain_to_freelist(ss, 4)` | Calls `ss_remote_drain_to_freelist(ss, 4)` | **RACE!** | | T4 | `atomic_exchange(&remote_heads[4], NULL)` → gets list A | `atomic_exchange(&remote_heads[4], NULL)` → gets NULL | T2 returns early (p==0) | | T5 | Enters while loop, modifies `meta->freelist` | - | Safe (only T1 draining) | **BUT**, if T2 enters the drain **BEFORE** T1 completes the atomic_exchange: | Time | Thread A | Thread B | Result | |------|----------|----------|--------| | T3 | Calls `ss_remote_drain_to_freelist(ss, 4)` | Calls `ss_remote_drain_to_freelist(ss, 4)` | **RACE!** | | T4 | `p = atomic_exchange(&remote_heads[4], NULL)` → gets list A | `p = atomic_exchange(&remote_heads[4], NULL)` → gets NULL | T2 safe exit | | T5 | `while (p != 0)` - starts draining | - | Only T1 draining | **HOWEVER**, the REAL race is **NOT** in the atomic_exchange (which is atomic), but in the **while loop**: **Actual Race** (Fix #1 vs Fix #3): | Time | Thread A (Fix #1: `superslab_refill`) | Thread B (Fix #3: Mailbox path) | Result | |------|----------------------------------------|----------------------------------|--------| | T0 | Enters `superslab_refill()` for class 4 | Enters `tiny_refill_try_fast()` Mailbox path | | | T1 | Reaches Priority 1 loop (line 614-621) | Fetches slab entry from mailbox | | | T2 | Iterates i=0..tls_cap-1, reaches i=5 | Validates slab 5 | | | T3 | Sees `remote_heads[5] != 0` | Calls `tiny_tls_bind_slab(tls, mss, 5)` | | | T4 | Calls `ss_remote_drain_to_freelist(ss, 5)` | Calls `ss_owner_cas(m, self)` - Claims ownership | | | T5 | `p = atomic_exchange(&remote_heads[5], NULL)` → gets list A | Sees `remote_heads[5] != 0` (race!) | **BOTH see remote!=0** | | T6 | Enters while loop: `next = *(void**)node` | Calls `ss_remote_drain_to_freelist(mss, 5)` | | | T7 | `*(void**)node = meta->freelist` | `p = atomic_exchange(&remote_heads[5], NULL)` → gets NULL | T2 returns (p==0) | | T8 | `meta->freelist = node` | - | Only T1 draining now | **Wait, this scenario is also safe!** The atomic_exchange ensures only ONE thread gets the remote list. ### 1.3 The REAL Race: Concurrent Modification of `meta->freelist` The actual problem is **NOT** in the atomic_exchange, but in the assumption that only the owner thread should modify `meta->freelist`. **The Bug**: Fix #1 and Fix #2 drain slabs that might be **owned by another thread**. **Scenario**: | Time | Thread A (Owner of slab 5) | Thread B (Fix #2: drains ALL slabs) | Result | |------|----------------------------|--------------------------------------|--------| | T0 | Owns slab 5, allocating from freelist | Enters `hak_tiny_alloc_superslab()` for class X | | | T1 | Reads `ptr = meta->freelist` | Loops through ALL slabs, reaches i=5 | | | T2 | Reads `meta->freelist = *(void**)ptr` (pop) | Sees `remote_heads[5] != 0` | | | T3 | - | Calls `ss_remote_drain_to_freelist(ss, 5)` | **NO ownership check!** | | T4 | - | `p = atomic_exchange(&remote_heads[5], NULL)` → gets list | | | T5 | **Writes**: `meta->freelist = next_ptr` | **Reads**: `old_head = meta->freelist` | **RACE on meta->freelist!** | | T6 | - | **Writes**: `*(void**)node = old_head` | | | T7 | - | **Writes**: `meta->freelist = node` | **Freelist corruption!** | **Result**: - Thread A's write to `meta->freelist` at T5 is **overwritten** by Thread B at T7 - Thread A's popped pointer is **lost** from the freelist - Or worse: partial write, leading to truncated pointer (0x6261) --- ## 2. All Unsafe Call Sites ### 2.1 Category: UNSAFE (No Ownership Check Before Drain) | File | Line | Context | Path | Risk | |------|------|---------|------|------| | `hakmem_tiny_free.inc` | 620 | **Fix #1** `superslab_refill()` Priority 1 | Alloc slow path | 🔴 **HIGH** | | `hakmem_tiny_free.inc` | 756 | **Fix #2** `hak_tiny_alloc_superslab()` | Alloc fast path | 🔴 **HIGH** | | `tiny_refill.h` | 47 | Sticky ring refill | Alloc refill path | 🟡 **MEDIUM** | | `tiny_refill.h` | 65 | Hot slot refill | Alloc refill path | 🟡 **MEDIUM** | | `tiny_refill.h` | 80 | Bench refill | Alloc refill path | 🟡 **MEDIUM** | | `tiny_mmap_gate.h` | 57 | mmap gate sweep | Alloc refill path | 🟡 **MEDIUM** | | `hakmem_tiny_superslab.h` | 376 | `ss_remote_drain_light()` | Background drain | 🟠 **LOW** (unused?) | | `hakmem_tiny.c` | 652 | Old drain path | Legacy code | 🟠 **LOW** (unused?) | ### 2.2 Category: SAFE (Ownership Claimed BEFORE Drain) | File | Line | Context | Protection | |------|------|---------|-----------| | `tiny_refill.h` | 100-105 | **Fix #3** Mailbox path | ✅ `tiny_tls_bind_slab()` + `ss_owner_cas()` BEFORE drain | ### 2.3 Category: PROBABLY SAFE (Special Cases) | File | Line | Context | Why Safe? | |------|------|---------|-----------| | `hakmem_tiny_free.inc` | 592 | `superslab_refill()` adopt path | Just adopted, unlikely concurrent access | --- ## 3. Why Fix #3 is Correct (and Others Are Not) ### 3.1 Fix #3: Mailbox Path (CORRECT) ```c // tiny_refill.h:96-106 // BUGFIX: Claim ownership BEFORE draining remote queue (fixes FAST_CAP=0 SEGV) tiny_tls_bind_slab(tls, mss, midx); // Bind to TLS ss_owner_cas(m, tiny_self_u32()); // ✅ CLAIM OWNERSHIP FIRST // NOW safe to drain - we're the owner if (atomic_load_explicit(&mss->remote_heads[midx], memory_order_acquire) != 0) { ss_remote_drain_to_freelist(mss, midx); // ✅ Safe: we own the slab } ``` **Why this works**: - `ss_owner_cas()` sets `m->owner_tid = self` (line 385-386 of hakmem_tiny_superslab.h) - Only the owner thread should modify `meta->freelist` directly - Other threads must use `ss_remote_push()` to add to remote queue - By claiming ownership BEFORE draining, we ensure exclusive access to `meta->freelist` ### 3.2 Fix #1 and Fix #2 (INCORRECT) ```c // hakmem_tiny_free.inc:614-621 (Fix #1) for (int i = 0; i < tls_cap; i++) { int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0); if (has_remote) { ss_remote_drain_to_freelist(tls->ss, i); // ❌ NO OWNERSHIP CHECK! } ``` ```c // hakmem_tiny_free.inc:749-757 (Fix #2) for (int i = 0; i < tls_cap; i++) { uintptr_t remote_val = atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire); if (remote_val != 0) { ss_remote_drain_to_freelist(tls->ss, i); // ❌ NO OWNERSHIP CHECK! } } ``` **Why this is broken**: - Drains ALL slabs in the SuperSlab (i=0..tls_cap-1) - Does NOT check `m->owner_tid` before draining - Can drain slabs owned by OTHER threads - Concurrent modification of `meta->freelist` → corruption ### 3.3 Other Unsafe Paths **Sticky Ring** (tiny_refill.h:47): ```c if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li); // ❌ Drain BEFORE ownership if (lm->freelist) { tiny_tls_bind_slab(tls, last_ss, li); ss_owner_cas(lm, tiny_self_u32()); // ← Ownership AFTER drain return last_ss; } ``` **Hot Slot** (tiny_refill.h:65): ```c if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) ss_remote_drain_to_freelist(hss, hidx); // ❌ Drain BEFORE ownership if (m->freelist) { tiny_tls_bind_slab(tls, hss, hidx); ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain ``` **Same pattern**: Drain first, claim ownership later → Race window! --- ## 4. Explaining the `fault_addr=0x6261` Pattern ### 4.1 Observed Pattern ``` rip=0x00005e3b94a28ece fault_addr=0x0000000000006261 ``` Previous analysis found pointers like `0x7a1ad5a06261` → truncated to `0x6261` (lower 16 bits). ### 4.2 Probable Cause: Partial Write During Race **Scenario**: 1. Thread A: Reads `ptr = meta->freelist` → `0x7a1ad5a06261` 2. Thread B: Concurrently drains, modifies `meta->freelist` 3. Thread A: Tries to dereference `ptr`, but pointer was partially overwritten 4. Result: Segmentation fault at `0x6261` (incomplete pointer) **OR**: - CPU store buffer reordering - Non-atomic 64-bit write on some architectures - Cache coherency issue **Bottom line**: Concurrent writes to `meta->freelist` without synchronization → undefined behavior. --- ## 5. Recommended Fixes ### 5.1 Option A: Remove Fix #1 and Fix #2 (SAFEST) **Rationale**: - Fix #3 (Mailbox) already drains safely with ownership - Fix #1 and Fix #2 are redundant AND unsafe - The sticky/hot/bench paths need fixing separately **Changes**: 1. **Delete Fix #1** (hakmem_tiny_free.inc:615-621): ```c // REMOVE THIS LOOP: for (int i = 0; i < tls_cap; i++) { int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0); if (has_remote) { ss_remote_drain_to_freelist(tls->ss, i); } } ``` 2. **Delete Fix #2** (hakmem_tiny_free.inc:729-767): ```c // REMOVE THIS ENTIRE BLOCK (lines 729-767) ``` 3. **Keep Fix #3** (tiny_refill.h:96-106) - it's correct! **Expected Impact**: - Eliminates the main source of concurrent drain races - May still crash if sticky/hot/bench paths race with each other - But frequency should drop dramatically ### 5.2 Option B: Add Ownership Check to Fix #1 and Fix #2 **Changes**: ```c // Fix #1: hakmem_tiny_free.inc:615-621 for (int i = 0; i < tls_cap; i++) { TinySlabMeta* m = &tls->ss->slabs[i]; // ONLY drain if we own this slab if (m->owner_tid == tiny_self_u32()) { int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0); if (has_remote) { ss_remote_drain_to_freelist(tls->ss, i); } } } ``` **Problem**: - Still racy! `owner_tid` can change between the check and the drain - Needs proper locking or ownership transfer protocol - More complex, error-prone ### 5.3 Option C: Fix Sticky/Hot/Bench Paths (CORRECT ORDER) **Changes**: ```c // Sticky ring (tiny_refill.h:46-51) if (lm->freelist || has_remote) { // ✅ Claim ownership FIRST tiny_tls_bind_slab(tls, last_ss, li); ss_owner_cas(lm, tiny_self_u32()); // NOW safe to drain if (!lm->freelist && has_remote) { ss_remote_drain_to_freelist(last_ss, li); } if (lm->freelist) { return last_ss; } } ``` Apply same pattern to hot slot (line 65) and bench (line 80). ### 5.4 RECOMMENDED: Combine Option A + Option C 1. **Remove Fix #1 and Fix #2** (eliminate main race sources) 2. **Fix sticky/hot/bench paths** (claim ownership before drain) 3. **Keep Fix #3** (already correct) **Verification**: ```bash # After applying fixes, rebuild and test make clean && make -s larson_hakmem HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10 # Expected: NO crashes, or at least much fewer crashes ``` --- ## 6. Next Steps ### 6.1 Immediate Actions 1. **Apply Option A**: Remove Fix #1 and Fix #2 - Comment out lines 615-621 in hakmem_tiny_free.inc - Comment out lines 729-767 in hakmem_tiny_free.inc - Rebuild and test 2. **Test Results**: - If crashes stop → Fix #1/#2 were the main culprits - If crashes continue → Sticky/hot/bench paths need fixing (Option C) 3. **Apply Option C** (if needed): - Modify tiny_refill.h lines 46-51, 64-66, 78-81 - Claim ownership BEFORE draining - Rebuild and test ### 6.2 Long-Term Improvements 1. **Add Ownership Assertion**: ```c static inline void ss_remote_drain_to_freelist(SuperSlab* ss, int slab_idx) { #ifdef HAKMEM_DEBUG_OWNERSHIP TinySlabMeta* m = &ss->slabs[slab_idx]; uint32_t owner = m->owner_tid; uint32_t self = tiny_self_u32(); if (owner != 0 && owner != self) { fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab owned by %u!\n", self, owner); abort(); } #endif // ... rest of function } ``` 2. **Add Debug Counters**: - Count concurrent drain attempts - Track ownership violations - Dump statistics on crash 3. **Consider Lock-Free Alternative**: - Use CAS-based freelist updates - Or: Don't drain at all, just CAS-pop from remote queue directly - Or: Ownership transfer protocol (expensive) --- ## 7. Conclusion **Root Cause**: Concurrent `ss_remote_drain_to_freelist()` calls without exclusive ownership. **Main Culprits**: Fix #1 and Fix #2 drain all slabs without ownership checks. **Secondary Issues**: Sticky/hot/bench paths drain before claiming ownership. **Solution**: Remove Fix #1/#2, fix sticky/hot/bench order, keep Fix #3. **Confidence**: 🟢 **HIGH** - This explains all observed symptoms: - Crashes at `fault_addr=0x6261` (freelist corruption) - Timing-dependent failures (race condition) - Improvements from Fix #3 (correct ownership protocol) - Remaining crashes (Fix #1/#2 still racing) --- **END OF ULTRA-DEEP ANALYSIS**