# Ultra-Deep Analysis Summary: Root Cause Found **Date**: 2025-11-04 **Status**: 🎯 **ROOT CAUSE IDENTIFIED** --- ## TL;DR **The Bug**: Fix #1 and Fix #2 drain slabs WITHOUT checking ownership, causing concurrent modification of `meta->freelist` when multiple threads operate on the same SuperSlab. **The Fix**: Remove Fix #1 and Fix #2, reorder sticky/hot/bench paths to claim ownership BEFORE draining. **Confidence**: 🟢 **95%** - Explains all symptoms: crashes at 0x6261, timing-dependent failures, partial improvements from Fix #3. --- ## The Race Condition ### What Fix #1 and Fix #2 Do (WRONG) ```c // Fix #1 (superslab_refill) and Fix #2 (hak_tiny_alloc_superslab) for (int i = 0; i < tls_cap; i++) { // Loop through ALL slabs if (remote_heads[i] != 0) { ss_remote_drain_to_freelist(ss, i); // ❌ NO ownership check! } } ``` **Problem**: Drains ALL slabs in the SuperSlab, including slabs **owned by other threads**. ### The Race | Thread A (owns slab 5) | Thread B (Fix #2, no ownership) | |------------------------|----------------------------------| | `ptr = meta->freelist` | Loops through all slabs, i=5 | | `meta->freelist = *(void**)ptr` | Calls `ss_remote_drain_to_freelist(ss, 5)` | | (allocating from freelist) | `node_next = meta->freelist` ← **RACE!** | | | `meta->freelist = node` ← **Overwrites A's update!** | **Result**: Freelist corruption, crash at `fault_addr=0x6261` (truncated pointer). --- ## Why Fix #3 is Correct ```c // Fix #3 (Mailbox path in tiny_refill.h) tiny_tls_bind_slab(tls, mss, midx); // Bind to TLS ss_owner_cas(m, tiny_self_u32()); // ✅ CLAIM OWNERSHIP FIRST // NOW safe to drain - we're the owner if (remote_heads[midx] != 0) { ss_remote_drain_to_freelist(mss, midx); // ✅ Safe: we own it } ``` **Key difference**: Claims ownership (`owner_tid = self`) BEFORE draining. --- ## All Unsafe Call Sites | Location | Fix | Risk | Solution | |----------|-----|------|----------| | `hakmem_tiny_free.inc:620` | **Fix #1** | 🔴 HIGH | ❌ DELETE | | `hakmem_tiny_free.inc:756` | **Fix #2** | 🔴 HIGH | ❌ DELETE | | `tiny_refill.h:47` | Sticky | 🟡 MEDIUM | ✅ Reorder: ownership → drain | | `tiny_refill.h:65` | Hot slot | 🟡 MEDIUM | ✅ Reorder: ownership → drain | | `tiny_refill.h:80` | Bench | 🟡 MEDIUM | ✅ Reorder: ownership → drain | | `tiny_mmap_gate.h:57` | mmap_gate | 🟡 MEDIUM | ✅ Reorder: ownership → drain | | `tiny_refill.h:105` | **Fix #3** | ✅ SAFE | ✅ Keep as-is | --- ## The Fix (3 Steps) ### Step 1: Remove Fix #1 (Priority: HIGH) **File**: `core/hakmem_tiny_free.inc` **Lines**: 615-621 Comment out this block: ```c // UNSAFE: Drains all slabs without ownership check for (int i = 0; i < tls_cap; i++) { int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0); if (has_remote) { ss_remote_drain_to_freelist(tls->ss, i); // ❌ DELETE } ``` ### Step 2: Remove Fix #2 (Priority: HIGH) **File**: `core/hakmem_tiny_free.inc` **Lines**: 729-767 (entire block) Comment out the entire Fix #2 block (40 lines starting with "BUGFIX: Drain ALL slabs..."). ### Step 3: Fix Refill Paths (Priority: MEDIUM) **Files**: `core/tiny_refill.h`, `core/tiny_mmap_gate.h` **Pattern** (apply to sticky/hot/bench/mmap_gate): ```c // BEFORE (WRONG): if (!m->freelist && has_remote) ss_remote_drain_to_freelist(ss, idx); // ❌ Drain first if (m->freelist) { tiny_tls_bind_slab(tls, ss, idx); // ← Ownership after ss_owner_cas(m, self); return ss; } // AFTER (CORRECT): tiny_tls_bind_slab(tls, ss, idx); // ✅ Ownership first ss_owner_cas(m, self); if (!m->freelist && has_remote) { ss_remote_drain_to_freelist(ss, idx); // ← Drain after } if (m->freelist) { return ss; } ``` --- ## Test Plan ### Test 1: Remove Fix #1 and Fix #2 Only ```bash # Apply Step 1 and Step 2 (comment out Fix #1 and Fix #2) make clean && make -s larson_hakmem HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10 ``` **Expected**: - ✅ **If crashes stop**: Fix #1/#2 were the main culprits (DONE!) - ⚠️ **If crashes continue**: Need Step 3 (refill path fixes) ### Test 2: Apply All Fixes (Step 1-3) ```bash # Apply all fixes make clean && make -s larson_hakmem HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20 HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20 ``` **Expected**: NO crashes, stable for 20+ seconds. --- ## Why This Explains Everything 1. **Crashes at `fault_addr=0x6261`**: Freelist corruption from concurrent writes 2. **Timing-dependent**: Race depends on thread scheduling 3. **Improvement from 500 → 4012 events**: Fix #3 reduced races, but Fix #1/#2 still race 4. **Guard mode vs repro mode**: Different timing → different race frequency --- ## Detailed Documentation - **Full Analysis**: `/mnt/workdisk/public_share/hakmem/ULTRATHINK_ANALYSIS.md` - **Implementation Guide**: `/mnt/workdisk/public_share/hakmem/FIX_IMPLEMENTATION_GUIDE.md` - **This Summary**: `/mnt/workdisk/public_share/hakmem/ULTRATHINK_SUMMARY.md` --- ## Next Action 1. Apply **Step 1 and Step 2** (remove Fix #1 and Fix #2) 2. Rebuild and test (repro mode, 30 threads, 10 seconds) 3. If crashes persist, apply **Step 3** (fix refill paths) 4. Report results **Estimated time**: 15 minutes to apply fixes + 5 minutes testing = 20 minutes total. --- **END OF SUMMARY**