Larson double-free investigation: Enhanced diagnostics + Remove buggy drain pushback

**Problem**: Larson benchmark crashes with TLS_SLL_DUP (double-free), 100% crash rate in debug

**Root Cause**: TLS drain pushback code (commit c2f104618) created duplicates by
pushing pointers back to TLS SLL while they were still in the linked list chain.

**Diagnostic Enhancements** (ChatGPT + Claude collaboration):
1. **Callsite Tracking**: Track file:line for each TLS SLL push (debug only)
   - Arrays: g_tls_sll_push_file[], g_tls_sll_push_line[]
   - Macro: tls_sll_push() auto-records __FILE__, __LINE__

2. **Enhanced Duplicate Detection**:
   - Scan depth: 64 → 256 nodes (deep duplicate detection)
   - Error message shows BOTH current and previous push locations
   - Calls ptr_trace_dump_now() for detailed analysis

3. **Evidence Captured**:
   - Both duplicate pushes from same line (221)
   - Pointer at position 11 in TLS SLL (count=18, scanned=11)
   - Confirms pointer allocated without being popped from TLS SLL

**Fix**:
- **core/box/tls_sll_drain_box.h**: Remove pushback code entirely
  - Old: Push back to TLS SLL on validation failure → duplicates!
  - New: Skip pointer (accept rare leak) to avoid duplicates
  - Rationale: SuperSlab lookup failures are transient/rare

**Status**: Fix implemented, ready for testing

**Updated**:
- LARSON_DOUBLE_FREE_INVESTIGATION.md: Root cause confirmed
This commit is contained in:
Moe Charm (CI)
2025-11-27 07:30:32 +09:00
parent c2f104618f
commit 8553894171
3 changed files with 83 additions and 45 deletions

View File

@ -85,40 +85,38 @@ File: `core/box/tls_sll_box.h:381`
**Impact**: Enables precise root cause identification.
## Next Steps
## Root Cause CONFIRMED (2025-11-27)
### Priority 1: Verify Metadata Consistency
### TLS Drain Pushback Bug Creates Duplicates!
Add assertions to check:
1. Pointer is in ONLY ONE location at a time:
- TLS SLL
- Slab freelist
- Not both!
**File**: `core/box/tls_sll_drain_box.h:148-162`
2. `meta->used` count matches reality
### Priority 2: Fix TLS Drain Leak
File: `core/box/tls_sll_drain_box.h:154`
**Current (buggy)**:
**Buggy Fix (commit c2f104618)**:
```c
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
continue; // ← LEAK!
// CRITICAL BUG: Creates duplicates!
tiny_next_write(class_idx, base, g_tls_sll[class_idx].head);
g_tls_sll[class_idx].head = base; // ← Pushes to position 0
g_tls_sll[class_idx].count++; // ← But pointer ALREADY at position 11!
break;
}
```
**Proposed Fix**:
```c
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
// Option A: Push back to TLS SLL (retry later)
tls_sll_push(class_idx, base, UINT32_MAX);
break; // Stop draining this class for now
**Scenario**:
1. TLS SLL has pointer at position 11 (count=18)
2. Drain loop pops pointer from TLS SLL (now count=17, but pointer still in chain at position 10)
3. SuperSlab lookup fails (transient state)
4. Pushback adds pointer to position 0 → **NOW AT TWO POSITIONS** (0 and 10)
5. Allocation pops from position 0
6. User frees → tries to push → **duplicate detected at position 10**
// Option B: Force to remote queue (if lookup fails, assume remote)
// (requires more analysis)
}
**Evidence**:
```
[TLS_SLL_DUP] cls=1 ptr=0x... count=18 scanned=11
```
Pointer found at position 11 during duplicate scan!
**Correct Fix**: DON'T push back when already in TLS SLL. Just **stop draining** when validation fails.
### Priority 3: Enhanced Tracing