Files
hakmem/docs/analysis/LARSON_DOUBLE_FREE_INVESTIGATION.md
Moe Charm (CI) c2f104618f Fix critical TLS drain memory leak causing potential double-free
## Root Cause

TLS drain was dropping pointers when SuperSlab lookup or slab_idx validation failed:
- Pop pointer from TLS SLL
- Lookup/validation fails
- continue → LEAK! Pointer never returned to any freelist

## Impact

Memory leak + potential double allocation:
1. Pointer P popped but leaked
2. Same address P reallocated from carve/other source
3. User frees P again → duplicate detection → ABORT

## Fix

**Before (BUGGY)**:
```c
if (!ss || invalid_slab_idx) {
    continue;  // ← LEAK!
}
```

**After (FIXED)**:
```c
if (!ss || invalid_slab_idx) {
    // Push back to TLS SLL head (retry later)
    tiny_next_write(class_idx, base, g_tls_sll[class_idx].head);
    g_tls_sll[class_idx].head = base;
    g_tls_sll[class_idx].count++;
    break;  // Stop draining to avoid infinite retry
}
```

## Files Changed

- core/box/tls_sll_drain_box.h: Fix 2 leak sites (SS lookup + slab_idx validation)
- docs/analysis/LARSON_DOUBLE_FREE_INVESTIGATION.md: Investigation report

## Related

- Larson double-free investigation (47% crash rate)
- Commit e4868bf23: Freelist header write + abort() on duplicate
- ChatGPT analysis: Larson benchmark code is correct (no user bug)
2025-11-27 06:49:38 +09:00

4.0 KiB

Larson Double-Free Investigation Report

Date: 2025-11-27

Summary

Larson benchmark crashes with TLS_SLL_PUSH_DUP error (double-free detection). Investigation reveals potential metadata inconsistency causing same pointer to be allocated twice without proper free.

Symptoms

[TLS_SLL_PUSH_DUP] cls=1 ptr=0x76e109240430
last_push_from=hak_tiny_free_fast_v2
last_pop_from=(null)  ← Never popped from TLS SLL!
where=hak_tiny_free_fast_v2

Key Observation: Pointer was pushed to TLS SLL but never popped, yet being freed again.

Root Cause Analysis

Eliminated Hypotheses

  1. Larson Benchmark Bug: ChatGPT analyzed larson.cpp - no double-free logic found
  2. Cross-Thread Free: LARSON_FIX=1 doesn't prevent the crash
  3. Stale Header: Fixed freelist header write (commit e4868bf23) but crash persists

Current Leading Hypothesis: Metadata Inconsistency

Scenario:

  1. User: free(P) → P pushed to TLS SLL (count++)
  2. Without pop: P somehow reallocated from slab freelist or carve
  3. User: p2 = malloc() → Returns P (same address!)
  4. User: free(p2) → Tries to push P to TLS SLL again
  5. Duplicate detection: P already in TLS SLL → ABORT

This requires:

  • meta->used count mismatch
  • P in both TLS SLL AND slab freelist simultaneously
  • Synchronization failure between TLS SLL and slab metadata

Evidence

TLS SLL Pop Logic (Suspicious)

File: core/box/tls_sll_box.h:570-572

if (g_tls_sll[class_idx].count > 0) {
    g_tls_sll[class_idx].count--;  // Conditional decrement!
}

If count somehow becomes 0, head is updated but count doesn't decrement!

TLS Drain Leak (Memory Leak Bug)

File: core/box/tls_sll_drain_box.h:148-154

SuperSlab* ss = hak_super_lookup(base);
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
    fprintf(stderr, "[TLS_SLL_DRAIN] SKIP: ...\n");
    continue;  // ← Pointer DROPPED without returning to freelist!
}

Critical: If SuperSlab lookup fails, pointer is popped but never returned → memory leak.

Fixes Implemented

1. Freelist Header Write (commit e4868bf23)

File: core/tiny_superslab_alloc.inc.h:159-169

Problem: Freelist allocation path didn't write headers

// OLD (buggy)
return block;  // Returns BASE without header

// NEW (fixed)
void* user = tiny_region_id_write_header(block, meta->class_idx);
return user;

Impact: Prevents stale headers, but doesn't fix double-free.

2. Abort on Duplicate (commit e4868bf23)

File: core/box/tls_sll_box.h:381

Change: return trueabort() for diagnostic backtrace

Impact: Enables precise root cause identification.

Next Steps

Priority 1: Verify Metadata Consistency

Add assertions to check:

  1. Pointer is in ONLY ONE location at a time:

    • TLS SLL
    • Slab freelist
    • Not both!
  2. meta->used count matches reality

Priority 2: Fix TLS Drain Leak

File: core/box/tls_sll_drain_box.h:154

Current (buggy):

if (!ss || ss->magic != SUPERSLAB_MAGIC) {
    continue;  // ← LEAK!
}

Proposed Fix:

if (!ss || ss->magic != SUPERSLAB_MAGIC) {
    // Option A: Push back to TLS SLL (retry later)
    tls_sll_push(class_idx, base, UINT32_MAX);
    break;  // Stop draining this class for now

    // Option B: Force to remote queue (if lookup fails, assume remote)
    // (requires more analysis)
}

Priority 3: Enhanced Tracing

Add debug logging to track pointer lifecycle:

  1. Malloc: "P allocated from source X"
  2. Free: "P freed to TLS SLL"
  3. Pop: "P popped from TLS SLL"
  4. Drain: "P drained to freelist"
  5. Re-alloc: "P reallocated from freelist"

ENV: HAKMEM_TINY_PTR_TRACE=<address> to track specific pointer.

Crash Rate

Before fixes: 47% (14/30 runs crash) After header fix: 100% (still crashes, just faster detection)

References

  • Commit: e4868bf23 "Larson crash investigation: Add freelist header write + abort()"
  • Previous investigation: Task agent Phase 2 (identified TLS_SLL_PUSH_DUP pattern)
  • Larson benchmark analysis: ChatGPT confirmed no user-code bug