**Problem**: Larson benchmark crashes with TLS_SLL_DUP (double-free), 100% crash rate in debug
**Root Cause**: TLS drain pushback code (commit c2f104618) created duplicates by
pushing pointers back to TLS SLL while they were still in the linked list chain.
**Diagnostic Enhancements** (ChatGPT + Claude collaboration):
1. **Callsite Tracking**: Track file:line for each TLS SLL push (debug only)
- Arrays: g_tls_sll_push_file[], g_tls_sll_push_line[]
- Macro: tls_sll_push() auto-records __FILE__, __LINE__
2. **Enhanced Duplicate Detection**:
- Scan depth: 64 → 256 nodes (deep duplicate detection)
- Error message shows BOTH current and previous push locations
- Calls ptr_trace_dump_now() for detailed analysis
3. **Evidence Captured**:
- Both duplicate pushes from same line (221)
- Pointer at position 11 in TLS SLL (count=18, scanned=11)
- Confirms pointer allocated without being popped from TLS SLL
**Fix**:
- **core/box/tls_sll_drain_box.h**: Remove pushback code entirely
- Old: Push back to TLS SLL on validation failure → duplicates!
- New: Skip pointer (accept rare leak) to avoid duplicates
- Rationale: SuperSlab lookup failures are transient/rare
**Status**: Fix implemented, ready for testing
**Updated**:
- LARSON_DOUBLE_FREE_INVESTIGATION.md: Root cause confirmed
142 lines
4.4 KiB
Markdown
142 lines
4.4 KiB
Markdown
# Larson Double-Free Investigation Report
|
|
|
|
## Date: 2025-11-27
|
|
|
|
## Summary
|
|
|
|
Larson benchmark crashes with TLS_SLL_PUSH_DUP error (double-free detection). Investigation reveals potential metadata inconsistency causing same pointer to be allocated twice without proper free.
|
|
|
|
## Symptoms
|
|
|
|
```
|
|
[TLS_SLL_PUSH_DUP] cls=1 ptr=0x76e109240430
|
|
last_push_from=hak_tiny_free_fast_v2
|
|
last_pop_from=(null) ← Never popped from TLS SLL!
|
|
where=hak_tiny_free_fast_v2
|
|
```
|
|
|
|
**Key Observation**: Pointer was pushed to TLS SLL but never popped, yet being freed again.
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Eliminated Hypotheses
|
|
|
|
1. ❌ **Larson Benchmark Bug**: ChatGPT analyzed larson.cpp - no double-free logic found
|
|
2. ❌ **Cross-Thread Free**: LARSON_FIX=1 doesn't prevent the crash
|
|
3. ❌ **Stale Header**: Fixed freelist header write (commit e4868bf23) but crash persists
|
|
|
|
### Current Leading Hypothesis: Metadata Inconsistency
|
|
|
|
**Scenario**:
|
|
1. User: `free(P)` → P pushed to TLS SLL (count++)
|
|
2. **Without pop**: P somehow reallocated from slab freelist or carve
|
|
3. User: `p2 = malloc()` → Returns P (same address!)
|
|
4. User: `free(p2)` → Tries to push P to TLS SLL again
|
|
5. Duplicate detection: P already in TLS SLL → ABORT
|
|
|
|
**This requires**:
|
|
- `meta->used` count mismatch
|
|
- P in both TLS SLL AND slab freelist simultaneously
|
|
- Synchronization failure between TLS SLL and slab metadata
|
|
|
|
## Evidence
|
|
|
|
### TLS SLL Pop Logic (Suspicious)
|
|
File: `core/box/tls_sll_box.h:570-572`
|
|
```c
|
|
if (g_tls_sll[class_idx].count > 0) {
|
|
g_tls_sll[class_idx].count--; // Conditional decrement!
|
|
}
|
|
```
|
|
If count somehow becomes 0, head is updated but count doesn't decrement!
|
|
|
|
### TLS Drain Leak (Memory Leak Bug)
|
|
File: `core/box/tls_sll_drain_box.h:148-154`
|
|
```c
|
|
SuperSlab* ss = hak_super_lookup(base);
|
|
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
|
|
fprintf(stderr, "[TLS_SLL_DRAIN] SKIP: ...\n");
|
|
continue; // ← Pointer DROPPED without returning to freelist!
|
|
}
|
|
```
|
|
**Critical**: If SuperSlab lookup fails, pointer is popped but never returned → memory leak.
|
|
|
|
## Fixes Implemented
|
|
|
|
### 1. Freelist Header Write (commit e4868bf23)
|
|
File: `core/tiny_superslab_alloc.inc.h:159-169`
|
|
|
|
**Problem**: Freelist allocation path didn't write headers
|
|
```c
|
|
// OLD (buggy)
|
|
return block; // Returns BASE without header
|
|
|
|
// NEW (fixed)
|
|
void* user = tiny_region_id_write_header(block, meta->class_idx);
|
|
return user;
|
|
```
|
|
|
|
**Impact**: Prevents stale headers, but doesn't fix double-free.
|
|
|
|
### 2. Abort on Duplicate (commit e4868bf23)
|
|
File: `core/box/tls_sll_box.h:381`
|
|
|
|
**Change**: `return true` → `abort()` for diagnostic backtrace
|
|
|
|
**Impact**: Enables precise root cause identification.
|
|
|
|
## Root Cause CONFIRMED (2025-11-27)
|
|
|
|
### TLS Drain Pushback Bug Creates Duplicates!
|
|
|
|
**File**: `core/box/tls_sll_drain_box.h:148-162`
|
|
|
|
**Buggy Fix (commit c2f104618)**:
|
|
```c
|
|
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
|
|
// CRITICAL BUG: Creates duplicates!
|
|
tiny_next_write(class_idx, base, g_tls_sll[class_idx].head);
|
|
g_tls_sll[class_idx].head = base; // ← Pushes to position 0
|
|
g_tls_sll[class_idx].count++; // ← But pointer ALREADY at position 11!
|
|
break;
|
|
}
|
|
```
|
|
|
|
**Scenario**:
|
|
1. TLS SLL has pointer at position 11 (count=18)
|
|
2. Drain loop pops pointer from TLS SLL (now count=17, but pointer still in chain at position 10)
|
|
3. SuperSlab lookup fails (transient state)
|
|
4. Pushback adds pointer to position 0 → **NOW AT TWO POSITIONS** (0 and 10)
|
|
5. Allocation pops from position 0
|
|
6. User frees → tries to push → **duplicate detected at position 10**
|
|
|
|
**Evidence**:
|
|
```
|
|
[TLS_SLL_DUP] cls=1 ptr=0x... count=18 scanned=11
|
|
```
|
|
Pointer found at position 11 during duplicate scan!
|
|
|
|
**Correct Fix**: DON'T push back when already in TLS SLL. Just **stop draining** when validation fails.
|
|
|
|
### Priority 3: Enhanced Tracing
|
|
|
|
Add debug logging to track pointer lifecycle:
|
|
1. Malloc: "P allocated from source X"
|
|
2. Free: "P freed to TLS SLL"
|
|
3. Pop: "P popped from TLS SLL"
|
|
4. Drain: "P drained to freelist"
|
|
5. Re-alloc: "P reallocated from freelist"
|
|
|
|
ENV: `HAKMEM_TINY_PTR_TRACE=<address>` to track specific pointer.
|
|
|
|
## Crash Rate
|
|
|
|
**Before fixes**: 47% (14/30 runs crash)
|
|
**After header fix**: 100% (still crashes, just faster detection)
|
|
|
|
## References
|
|
|
|
- Commit: e4868bf23 "Larson crash investigation: Add freelist header write + abort()"
|
|
- Previous investigation: Task agent Phase 2 (identified TLS_SLL_PUSH_DUP pattern)
|
|
- Larson benchmark analysis: ChatGPT confirmed no user-code bug
|