Files
hakmem/docs/analysis/ULTRATHINK_ANALYSIS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

413 lines
15 KiB
Markdown

# Ultra-Deep Analysis: Remaining Bugs in Remote Drain System
**Date**: 2025-11-04
**Status**: 🔴 **CRITICAL RACE CONDITION IDENTIFIED**
**Scope**: Multi-threaded freelist corruption via concurrent `ss_remote_drain_to_freelist()` calls
---
## Executive Summary
**Root Cause Found**: **Concurrent draining of the same slab from multiple threads WITHOUT ownership synchronization**
The crash at `fault_addr=0x6261` is caused by freelist chain corruption when multiple threads simultaneously call `ss_remote_drain_to_freelist()` on the same slab without exclusive ownership. The pointer truncation (0x6261) is a symptom of concurrent modification to the freelist links.
**Impact**:
- Fix #1, Fix #2, and multiple paths in `tiny_refill.h` all drain without ownership
- ANY two threads operating on the same slab can race and corrupt the freelist
- Explains why crashes still occur after 4012 events (race is timing-dependent)
---
## 1. The Freelist Corruption Mechanism
### 1.1 How `ss_remote_drain_to_freelist()` Works
```c
// hakmem_tiny_superslab.h:345-365
static inline void ss_remote_drain_to_freelist(SuperSlab* ss, int slab_idx) {
_Atomic(uintptr_t)* head = &ss->remote_heads[slab_idx];
uintptr_t p = atomic_exchange_explicit(head, (uintptr_t)NULL, memory_order_acq_rel);
if (p == 0) return;
TinySlabMeta* meta = &ss->slabs[slab_idx];
uint32_t drained = 0;
while (p != 0) {
void* node = (void*)p;
uintptr_t next = (uintptr_t)(*(void**)node); // ← Read next pointer
*(void**)node = meta->freelist; // ← CRITICAL: Write freelist pointer
meta->freelist = node; // ← CRITICAL: Update freelist head
p = next;
drained++;
}
// Reset remote count after full drain
atomic_store_explicit(&ss->remote_counts[slab_idx], 0u, memory_order_relaxed);
}
```
**KEY OBSERVATION**: The while loop modifies `meta->freelist` **WITHOUT any atomic protection**.
### 1.2 Race Condition Scenario
**Setup**:
- Slab 4 of SuperSlab X has `remote_heads[4] != 0` (pending remote frees)
- Thread A (T1) and Thread B (T2) both want to drain slab 4
- Neither thread owns slab 4
**Timeline**:
| Time | Thread A (Fix #2 path) | Thread B (Sticky refill path) | Result |
|------|------------------------|-------------------------------|--------|
| T0 | Enters `hak_tiny_alloc_superslab()` | Enters `tiny_refill_try_fast()` sticky ring | |
| T1 | Loops through all slabs, reaches i=4 | Finds slab 4 in sticky ring | |
| T2 | Sees `remote_heads[4] != 0` | Sees `has_remote != 0` | |
| T3 | Calls `ss_remote_drain_to_freelist(ss, 4)` | Calls `ss_remote_drain_to_freelist(ss, 4)` | **RACE!** |
| T4 | `atomic_exchange(&remote_heads[4], NULL)` → gets list A | `atomic_exchange(&remote_heads[4], NULL)` → gets NULL | T2 returns early (p==0) |
| T5 | Enters while loop, modifies `meta->freelist` | - | Safe (only T1 draining) |
**BUT**, if T2 enters the drain **BEFORE** T1 completes the atomic_exchange:
| Time | Thread A | Thread B | Result |
|------|----------|----------|--------|
| T3 | Calls `ss_remote_drain_to_freelist(ss, 4)` | Calls `ss_remote_drain_to_freelist(ss, 4)` | **RACE!** |
| T4 | `p = atomic_exchange(&remote_heads[4], NULL)` → gets list A | `p = atomic_exchange(&remote_heads[4], NULL)` → gets NULL | T2 safe exit |
| T5 | `while (p != 0)` - starts draining | - | Only T1 draining |
**HOWEVER**, the REAL race is **NOT** in the atomic_exchange (which is atomic), but in the **while loop**:
**Actual Race** (Fix #1 vs Fix #3):
| Time | Thread A (Fix #1: `superslab_refill`) | Thread B (Fix #3: Mailbox path) | Result |
|------|----------------------------------------|----------------------------------|--------|
| T0 | Enters `superslab_refill()` for class 4 | Enters `tiny_refill_try_fast()` Mailbox path | |
| T1 | Reaches Priority 1 loop (line 614-621) | Fetches slab entry from mailbox | |
| T2 | Iterates i=0..tls_cap-1, reaches i=5 | Validates slab 5 | |
| T3 | Sees `remote_heads[5] != 0` | Calls `tiny_tls_bind_slab(tls, mss, 5)` | |
| T4 | Calls `ss_remote_drain_to_freelist(ss, 5)` | Calls `ss_owner_cas(m, self)` - Claims ownership | |
| T5 | `p = atomic_exchange(&remote_heads[5], NULL)` → gets list A | Sees `remote_heads[5] != 0` (race!) | **BOTH see remote!=0** |
| T6 | Enters while loop: `next = *(void**)node` | Calls `ss_remote_drain_to_freelist(mss, 5)` | |
| T7 | `*(void**)node = meta->freelist` | `p = atomic_exchange(&remote_heads[5], NULL)` → gets NULL | T2 returns (p==0) |
| T8 | `meta->freelist = node` | - | Only T1 draining now |
**Wait, this scenario is also safe!** The atomic_exchange ensures only ONE thread gets the remote list.
### 1.3 The REAL Race: Concurrent Modification of `meta->freelist`
The actual problem is **NOT** in the atomic_exchange, but in the assumption that only the owner thread should modify `meta->freelist`.
**The Bug**: Fix #1 and Fix #2 drain slabs that might be **owned by another thread**.
**Scenario**:
| Time | Thread A (Owner of slab 5) | Thread B (Fix #2: drains ALL slabs) | Result |
|------|----------------------------|--------------------------------------|--------|
| T0 | Owns slab 5, allocating from freelist | Enters `hak_tiny_alloc_superslab()` for class X | |
| T1 | Reads `ptr = meta->freelist` | Loops through ALL slabs, reaches i=5 | |
| T2 | Reads `meta->freelist = *(void**)ptr` (pop) | Sees `remote_heads[5] != 0` | |
| T3 | - | Calls `ss_remote_drain_to_freelist(ss, 5)` | **NO ownership check!** |
| T4 | - | `p = atomic_exchange(&remote_heads[5], NULL)` → gets list | |
| T5 | **Writes**: `meta->freelist = next_ptr` | **Reads**: `old_head = meta->freelist` | **RACE on meta->freelist!** |
| T6 | - | **Writes**: `*(void**)node = old_head` | |
| T7 | - | **Writes**: `meta->freelist = node` | **Freelist corruption!** |
**Result**:
- Thread A's write to `meta->freelist` at T5 is **overwritten** by Thread B at T7
- Thread A's popped pointer is **lost** from the freelist
- Or worse: partial write, leading to truncated pointer (0x6261)
---
## 2. All Unsafe Call Sites
### 2.1 Category: UNSAFE (No Ownership Check Before Drain)
| File | Line | Context | Path | Risk |
|------|------|---------|------|------|
| `hakmem_tiny_free.inc` | 620 | **Fix #1** `superslab_refill()` Priority 1 | Alloc slow path | 🔴 **HIGH** |
| `hakmem_tiny_free.inc` | 756 | **Fix #2** `hak_tiny_alloc_superslab()` | Alloc fast path | 🔴 **HIGH** |
| `tiny_refill.h` | 47 | Sticky ring refill | Alloc refill path | 🟡 **MEDIUM** |
| `tiny_refill.h` | 65 | Hot slot refill | Alloc refill path | 🟡 **MEDIUM** |
| `tiny_refill.h` | 80 | Bench refill | Alloc refill path | 🟡 **MEDIUM** |
| `tiny_mmap_gate.h` | 57 | mmap gate sweep | Alloc refill path | 🟡 **MEDIUM** |
| `hakmem_tiny_superslab.h` | 376 | `ss_remote_drain_light()` | Background drain | 🟠 **LOW** (unused?) |
| `hakmem_tiny.c` | 652 | Old drain path | Legacy code | 🟠 **LOW** (unused?) |
### 2.2 Category: SAFE (Ownership Claimed BEFORE Drain)
| File | Line | Context | Protection |
|------|------|---------|-----------|
| `tiny_refill.h` | 100-105 | **Fix #3** Mailbox path | ✅ `tiny_tls_bind_slab()` + `ss_owner_cas()` BEFORE drain |
### 2.3 Category: PROBABLY SAFE (Special Cases)
| File | Line | Context | Why Safe? |
|------|------|---------|-----------|
| `hakmem_tiny_free.inc` | 592 | `superslab_refill()` adopt path | Just adopted, unlikely concurrent access |
---
## 3. Why Fix #3 is Correct (and Others Are Not)
### 3.1 Fix #3: Mailbox Path (CORRECT)
```c
// tiny_refill.h:96-106
// BUGFIX: Claim ownership BEFORE draining remote queue (fixes FAST_CAP=0 SEGV)
tiny_tls_bind_slab(tls, mss, midx); // Bind to TLS
ss_owner_cas(m, tiny_self_u32()); // ✅ CLAIM OWNERSHIP FIRST
// NOW safe to drain - we're the owner
if (atomic_load_explicit(&mss->remote_heads[midx], memory_order_acquire) != 0) {
ss_remote_drain_to_freelist(mss, midx); // ✅ Safe: we own the slab
}
```
**Why this works**:
- `ss_owner_cas()` sets `m->owner_tid = self` (line 385-386 of hakmem_tiny_superslab.h)
- Only the owner thread should modify `meta->freelist` directly
- Other threads must use `ss_remote_push()` to add to remote queue
- By claiming ownership BEFORE draining, we ensure exclusive access to `meta->freelist`
### 3.2 Fix #1 and Fix #2 (INCORRECT)
```c
// hakmem_tiny_free.inc:614-621 (Fix #1)
for (int i = 0; i < tls_cap; i++) {
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
if (has_remote) {
ss_remote_drain_to_freelist(tls->ss, i); // ❌ NO OWNERSHIP CHECK!
}
```
```c
// hakmem_tiny_free.inc:749-757 (Fix #2)
for (int i = 0; i < tls_cap; i++) {
uintptr_t remote_val = atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire);
if (remote_val != 0) {
ss_remote_drain_to_freelist(tls->ss, i); // ❌ NO OWNERSHIP CHECK!
}
}
```
**Why this is broken**:
- Drains ALL slabs in the SuperSlab (i=0..tls_cap-1)
- Does NOT check `m->owner_tid` before draining
- Can drain slabs owned by OTHER threads
- Concurrent modification of `meta->freelist` → corruption
### 3.3 Other Unsafe Paths
**Sticky Ring** (tiny_refill.h:47):
```c
if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li); // ❌ Drain BEFORE ownership
if (lm->freelist) {
tiny_tls_bind_slab(tls, last_ss, li);
ss_owner_cas(lm, tiny_self_u32()); // ← Ownership AFTER drain
return last_ss;
}
```
**Hot Slot** (tiny_refill.h:65):
```c
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0)
ss_remote_drain_to_freelist(hss, hidx); // ❌ Drain BEFORE ownership
if (m->freelist) {
tiny_tls_bind_slab(tls, hss, hidx);
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
```
**Same pattern**: Drain first, claim ownership later → Race window!
---
## 4. Explaining the `fault_addr=0x6261` Pattern
### 4.1 Observed Pattern
```
rip=0x00005e3b94a28ece
fault_addr=0x0000000000006261
```
Previous analysis found pointers like `0x7a1ad5a06261` → truncated to `0x6261` (lower 16 bits).
### 4.2 Probable Cause: Partial Write During Race
**Scenario**:
1. Thread A: Reads `ptr = meta->freelist``0x7a1ad5a06261`
2. Thread B: Concurrently drains, modifies `meta->freelist`
3. Thread A: Tries to dereference `ptr`, but pointer was partially overwritten
4. Result: Segmentation fault at `0x6261` (incomplete pointer)
**OR**:
- CPU store buffer reordering
- Non-atomic 64-bit write on some architectures
- Cache coherency issue
**Bottom line**: Concurrent writes to `meta->freelist` without synchronization → undefined behavior.
---
## 5. Recommended Fixes
### 5.1 Option A: Remove Fix #1 and Fix #2 (SAFEST)
**Rationale**:
- Fix #3 (Mailbox) already drains safely with ownership
- Fix #1 and Fix #2 are redundant AND unsafe
- The sticky/hot/bench paths need fixing separately
**Changes**:
1. **Delete Fix #1** (hakmem_tiny_free.inc:615-621):
```c
// REMOVE THIS LOOP:
for (int i = 0; i < tls_cap; i++) {
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
if (has_remote) {
ss_remote_drain_to_freelist(tls->ss, i);
}
}
```
2. **Delete Fix #2** (hakmem_tiny_free.inc:729-767):
```c
// REMOVE THIS ENTIRE BLOCK (lines 729-767)
```
3. **Keep Fix #3** (tiny_refill.h:96-106) - it's correct!
**Expected Impact**:
- Eliminates the main source of concurrent drain races
- May still crash if sticky/hot/bench paths race with each other
- But frequency should drop dramatically
### 5.2 Option B: Add Ownership Check to Fix #1 and Fix #2
**Changes**:
```c
// Fix #1: hakmem_tiny_free.inc:615-621
for (int i = 0; i < tls_cap; i++) {
TinySlabMeta* m = &tls->ss->slabs[i];
// ONLY drain if we own this slab
if (m->owner_tid == tiny_self_u32()) {
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
if (has_remote) {
ss_remote_drain_to_freelist(tls->ss, i);
}
}
}
```
**Problem**:
- Still racy! `owner_tid` can change between the check and the drain
- Needs proper locking or ownership transfer protocol
- More complex, error-prone
### 5.3 Option C: Fix Sticky/Hot/Bench Paths (CORRECT ORDER)
**Changes**:
```c
// Sticky ring (tiny_refill.h:46-51)
if (lm->freelist || has_remote) {
// ✅ Claim ownership FIRST
tiny_tls_bind_slab(tls, last_ss, li);
ss_owner_cas(lm, tiny_self_u32());
// NOW safe to drain
if (!lm->freelist && has_remote) {
ss_remote_drain_to_freelist(last_ss, li);
}
if (lm->freelist) {
return last_ss;
}
}
```
Apply same pattern to hot slot (line 65) and bench (line 80).
### 5.4 RECOMMENDED: Combine Option A + Option C
1. **Remove Fix #1 and Fix #2** (eliminate main race sources)
2. **Fix sticky/hot/bench paths** (claim ownership before drain)
3. **Keep Fix #3** (already correct)
**Verification**:
```bash
# After applying fixes, rebuild and test
make clean && make -s larson_hakmem
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
# Expected: NO crashes, or at least much fewer crashes
```
---
## 6. Next Steps
### 6.1 Immediate Actions
1. **Apply Option A**: Remove Fix #1 and Fix #2
- Comment out lines 615-621 in hakmem_tiny_free.inc
- Comment out lines 729-767 in hakmem_tiny_free.inc
- Rebuild and test
2. **Test Results**:
- If crashes stop → Fix #1/#2 were the main culprits
- If crashes continue → Sticky/hot/bench paths need fixing (Option C)
3. **Apply Option C** (if needed):
- Modify tiny_refill.h lines 46-51, 64-66, 78-81
- Claim ownership BEFORE draining
- Rebuild and test
### 6.2 Long-Term Improvements
1. **Add Ownership Assertion**:
```c
static inline void ss_remote_drain_to_freelist(SuperSlab* ss, int slab_idx) {
#ifdef HAKMEM_DEBUG_OWNERSHIP
TinySlabMeta* m = &ss->slabs[slab_idx];
uint32_t owner = m->owner_tid;
uint32_t self = tiny_self_u32();
if (owner != 0 && owner != self) {
fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab owned by %u!\n", self, owner);
abort();
}
#endif
// ... rest of function
}
```
2. **Add Debug Counters**:
- Count concurrent drain attempts
- Track ownership violations
- Dump statistics on crash
3. **Consider Lock-Free Alternative**:
- Use CAS-based freelist updates
- Or: Don't drain at all, just CAS-pop from remote queue directly
- Or: Ownership transfer protocol (expensive)
---
## 7. Conclusion
**Root Cause**: Concurrent `ss_remote_drain_to_freelist()` calls without exclusive ownership.
**Main Culprits**: Fix #1 and Fix #2 drain all slabs without ownership checks.
**Secondary Issues**: Sticky/hot/bench paths drain before claiming ownership.
**Solution**: Remove Fix #1/#2, fix sticky/hot/bench order, keep Fix #3.
**Confidence**: 🟢 **HIGH** - This explains all observed symptoms:
- Crashes at `fault_addr=0x6261` (freelist corruption)
- Timing-dependent failures (race condition)
- Improvements from Fix #3 (correct ownership protocol)
- Remaining crashes (Fix #1/#2 still racing)
---
**END OF ULTRA-DEEP ANALYSIS**