Files
hakmem/docs/analysis/LARSON_DOUBLE_FREE_INVESTIGATION.md

144 lines
4.0 KiB
Markdown
Raw Normal View History

# Larson Double-Free Investigation Report
## Date: 2025-11-27
## Summary
Larson benchmark crashes with TLS_SLL_PUSH_DUP error (double-free detection). Investigation reveals potential metadata inconsistency causing same pointer to be allocated twice without proper free.
## Symptoms
```
[TLS_SLL_PUSH_DUP] cls=1 ptr=0x76e109240430
last_push_from=hak_tiny_free_fast_v2
last_pop_from=(null) ← Never popped from TLS SLL!
where=hak_tiny_free_fast_v2
```
**Key Observation**: Pointer was pushed to TLS SLL but never popped, yet being freed again.
## Root Cause Analysis
### Eliminated Hypotheses
1.**Larson Benchmark Bug**: ChatGPT analyzed larson.cpp - no double-free logic found
2.**Cross-Thread Free**: LARSON_FIX=1 doesn't prevent the crash
3.**Stale Header**: Fixed freelist header write (commit e4868bf23) but crash persists
### Current Leading Hypothesis: Metadata Inconsistency
**Scenario**:
1. User: `free(P)` → P pushed to TLS SLL (count++)
2. **Without pop**: P somehow reallocated from slab freelist or carve
3. User: `p2 = malloc()` → Returns P (same address!)
4. User: `free(p2)` → Tries to push P to TLS SLL again
5. Duplicate detection: P already in TLS SLL → ABORT
**This requires**:
- `meta->used` count mismatch
- P in both TLS SLL AND slab freelist simultaneously
- Synchronization failure between TLS SLL and slab metadata
## Evidence
### TLS SLL Pop Logic (Suspicious)
File: `core/box/tls_sll_box.h:570-572`
```c
if (g_tls_sll[class_idx].count > 0) {
g_tls_sll[class_idx].count--; // Conditional decrement!
}
```
If count somehow becomes 0, head is updated but count doesn't decrement!
### TLS Drain Leak (Memory Leak Bug)
File: `core/box/tls_sll_drain_box.h:148-154`
```c
SuperSlab* ss = hak_super_lookup(base);
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
fprintf(stderr, "[TLS_SLL_DRAIN] SKIP: ...\n");
continue; // ← Pointer DROPPED without returning to freelist!
}
```
**Critical**: If SuperSlab lookup fails, pointer is popped but never returned → memory leak.
## Fixes Implemented
### 1. Freelist Header Write (commit e4868bf23)
File: `core/tiny_superslab_alloc.inc.h:159-169`
**Problem**: Freelist allocation path didn't write headers
```c
// OLD (buggy)
return block; // Returns BASE without header
// NEW (fixed)
void* user = tiny_region_id_write_header(block, meta->class_idx);
return user;
```
**Impact**: Prevents stale headers, but doesn't fix double-free.
### 2. Abort on Duplicate (commit e4868bf23)
File: `core/box/tls_sll_box.h:381`
**Change**: `return true``abort()` for diagnostic backtrace
**Impact**: Enables precise root cause identification.
## Next Steps
### Priority 1: Verify Metadata Consistency
Add assertions to check:
1. Pointer is in ONLY ONE location at a time:
- TLS SLL
- Slab freelist
- Not both!
2. `meta->used` count matches reality
### Priority 2: Fix TLS Drain Leak
File: `core/box/tls_sll_drain_box.h:154`
**Current (buggy)**:
```c
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
continue; // ← LEAK!
}
```
**Proposed Fix**:
```c
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
// Option A: Push back to TLS SLL (retry later)
tls_sll_push(class_idx, base, UINT32_MAX);
break; // Stop draining this class for now
// Option B: Force to remote queue (if lookup fails, assume remote)
// (requires more analysis)
}
```
### Priority 3: Enhanced Tracing
Add debug logging to track pointer lifecycle:
1. Malloc: "P allocated from source X"
2. Free: "P freed to TLS SLL"
3. Pop: "P popped from TLS SLL"
4. Drain: "P drained to freelist"
5. Re-alloc: "P reallocated from freelist"
ENV: `HAKMEM_TINY_PTR_TRACE=<address>` to track specific pointer.
## Crash Rate
**Before fixes**: 47% (14/30 runs crash)
**After header fix**: 100% (still crashes, just faster detection)
## References
- Commit: e4868bf23 "Larson crash investigation: Add freelist header write + abort()"
- Previous investigation: Task agent Phase 2 (identified TLS_SLL_PUSH_DUP pattern)
- Larson benchmark analysis: ChatGPT confirmed no user-code bug