# bench_fixed_size_hakmem Crash Report: workset=64 Race Condition

**Date**: 2025-11-15
**Status**: 🔴 **ROOT CAUSE IDENTIFIED** - Race condition in Stage 1 (lock-free freelist reuse)

---

## Executive Summary

`bench_fixed_size_hakmem` crashes with SEGV when `workset=64` and `iterations >= 2150`:

```bash
# Works fine:
./out/release/bench_fixed_size_hakmem 10000 16 60  # OK
./out/release/bench_fixed_size_hakmem 2100 16 64   # OK

# Crashes:
./out/release/bench_fixed_size_hakmem 2150 16 64   # SEGV
./out/release/bench_fixed_size_hakmem 10000 16 64  # SEGV
```

**Root Cause**: NULL pointer dereference in `shared_pool_acquire_slab()` Stage 1 due to race condition between:
- Thread A releasing a SuperSlab (sets `sp_meta->ss = NULL`, frees memory)
- Thread B reusing a slot from the freelist (loads stale `sp_meta` with NULL `ss`)

---

## Crash Details

### Stack Trace

```
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005a12b89a770b in shared_pool_acquire_slab.constprop ()

Crashing instruction:
=> or %r15d,0x14(%r14)

Register state:
r14 = 0x0  (NULL pointer!)
```

**Disassembly context** (line 572 in `hakmem_shared_pool.c`):
```asm
0x5a12b89a770b:  or %r15d,0x14(%r14)  ; Tries to access ss->slab_bitmap (offset 0x14)
                                       ; r14 = ss = NULL → SEGV
```

### Debug Log Output

```
[SP_ACQUIRE_STAGE2_LOCKFREE] class=2 claimed UNUSED slot (ss=0x791110200000 slab=31)
[SP_ACQUIRE_STAGE3] class=2 new SuperSlab (ss=0x79110fe00000 from_lru=0)
[SP_ACQUIRE_STAGE1_LOCKFREE] class=2 reusing EMPTY slot (ss=(nil) slab=0)  ← CRASH HERE
```

**Smoking gun**: Last line shows Stage 1 got `ss=(nil)` but still tried to use it!

---

## Root Cause Analysis

### The Race Condition

**File**: `core/hakmem_shared_pool.c`
**Function**: `shared_pool_acquire_slab()` (lines 514-738)

**Race Timeline**:

| Time | Thread A (Releasing Slab) | Thread B (Acquiring Slab) |
|------|---------------------------|---------------------------|
| T0   | `shared_pool_release_slab(ss, idx)` called | - |
| T1   | Line 840: `sp_freelist_push_lockfree(class, meta, idx)` | - |
|      | (Slot pushed to freelist, ss still valid) | - |
| T2   | Line 850: Detects `active_slots == 0` | - |
| T3   | Line 862: `atomic_store(&meta->ss, NULL)` | - |
| T4   | Line 870: `superslab_free(ss)` (memory freed) | - |
| T5   | - | `shared_pool_acquire_slab(class, ...)` called |
| T6   | - | Line 548: `sp_freelist_pop_lockfree()` **pops stale meta** |
| T7   | - | Line 564: `ss = atomic_load(&meta->ss)` **ss = NULL!** |
| T8   | - | Line 566-569: Debug log shows `ss=(nil)` |
| T9   | - | Line 572: `ss->slab_bitmap \|= ...` **SEGV!** |

### Vulnerable Code Path

**Stage 1 (Lock-Free Freelist Reuse)** in `shared_pool_acquire_slab()`:

```c
// Lines 548-592 (hakmem_shared_pool.c)
if (sp_freelist_pop_lockfree(class_idx, &reuse_meta, &reuse_slot_idx)) {
    // ...
    pthread_mutex_lock(&g_shared_pool.alloc_lock);

    // Activate slot under mutex (slot state transition requires protection)
    if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) {
        // ⚠️ BUG: Load ss atomically, but NO NULL CHECK!
        SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);

        if (dbg_acquire == 1) {
            fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n",
                    class_idx, (void*)ss, reuse_slot_idx);
        }

        // ❌ CRASH HERE: ss can be NULL if SuperSlab was freed after push but before pop
        ss->slab_bitmap |= (1u << reuse_slot_idx);  // Line 572: NULL dereference!
        // ...
    }
}
```

**Why the NULL check is missing:**

The code assumes:
1. If `sp_freelist_pop_lockfree()` returns true → slot is valid
2. If `sp_slot_mark_active()` succeeds → SuperSlab must still exist

**But this is wrong** because:
1. Slot was pushed to freelist when SuperSlab was still valid (line 840)
2. SuperSlab was freed AFTER push but BEFORE pop (line 862-870)
3. The freelist node contains a stale `sp_meta` pointer whose `ss` is now NULL

### Why Stage 2 Doesn't Crash

**Stage 2 (Lock-Free UNUSED Slot Claiming)** has proper NULL handling:

```c
// Lines 613-622 (hakmem_shared_pool.c)
int claimed_idx = sp_slot_claim_lockfree(meta, class_idx);
if (claimed_idx >= 0) {
    SuperSlab* ss = atomic_load_explicit(&meta->ss, memory_order_acquire);
    if (!ss) {
        // ✅ CORRECT: Skip if SuperSlab was freed
        continue;
    }
    // ... safe to use ss
}
```

This check was added in a previous RACE FIX but **was not applied to Stage 1**.

---

## Why workset=64 Specifically?

The crash is **NOT** specific to workset=64, but rather to **total operations × drain frequency**:

### Crash Threshold Analysis

| workset | iterations | Total Ops | Crash? | Drain Cycles (÷2048) |
|---------|-----------|-----------|--------|---------------------|
| 60      | 10000     | 600,000   | ❌ OK  | 293                 |
| 64      | 2100      | 134,400   | ❌ OK  | 66                  |
| 64      | 2150      | 137,600   | ✅ CRASH | 67                |
| 64      | 10000     | 640,000   | ✅ CRASH | 313               |

**Pattern**: Crash happens around **2150 iterations** (137,600 ops, ~67 drain cycles).

**Why this threshold?**

1. **TLS SLL drain interval** = 2048 (default)
2. At ~2150 iterations:
   - First major drain cycle completes (~67 drains)
   - Many slabs are released to shared pool
   - Freelist accumulates many freed slots
   - Some SuperSlabs become completely empty → freed
   - Race window opens: slots in freelist whose SuperSlabs are freed

3. **workset=64** amplifies the issue:
   - Larger working set = more concurrent allocations
   - More slabs active → more slabs released during drain
   - Higher probability of hitting the race window

---

## Reproduction

### Minimal Repro

```bash
cd /mnt/workdisk/public_share/hakmem

# Crash reliably:
./out/release/bench_fixed_size_hakmem 2150 16 64

# Debug logging (shows ss=(nil)):
HAKMEM_SS_ACQUIRE_DEBUG=1 ./out/release/bench_fixed_size_hakmem 2150 16 64
```

**Expected Output** (last lines before crash):
```
[SP_ACQUIRE_STAGE2_LOCKFREE] class=2 claimed UNUSED slot (ss=0x... slab=31)
[SP_ACQUIRE_STAGE3] class=2 new SuperSlab (ss=0x... from_lru=0)
[SP_ACQUIRE_STAGE1_LOCKFREE] class=2 reusing EMPTY slot (ss=(nil) slab=0)
Segmentation fault (core dumped)
```

### Testing Boundaries

```bash
# Find exact crash threshold:
for i in {2100..2200..10}; do
  ./out/release/bench_fixed_size_hakmem $i 16 64 >/dev/null 2>&1 \
    && echo "iters=$i: OK" \
    || echo "iters=$i: CRASH"
done

# Output:
# iters=2100: OK
# iters=2110: OK
# ...
# iters=2140: OK
# iters=2150: CRASH  ← First crash
```

---

## Recommended Fix

**File**: `core/hakmem_shared_pool.c`
**Function**: `shared_pool_acquire_slab()`
**Lines**: 562-592 (Stage 1)

### Patch (Minimal, 5 lines)

```diff
--- a/core/hakmem_shared_pool.c
+++ b/core/hakmem_shared_pool.c
@@ -561,6 +561,12 @@ shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out)
         // Activate slot under mutex (slot state transition requires protection)
         if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) {
             // RACE FIX: Load SuperSlab pointer atomically (consistency)
             SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
+
+            // RACE FIX: Check if SuperSlab was freed between push and pop
+            if (!ss) {
+                // SuperSlab freed after slot was pushed to freelist - skip and fall through
+                pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+                goto stage2_fallback;  // Try Stage 2 (UNUSED slots) or Stage 3 (new SS)
+            }

             if (dbg_acquire == 1) {
                 fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n",
@@ -598,6 +604,7 @@ shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out)
         pthread_mutex_unlock(&g_shared_pool.alloc_lock);
     }

+stage2_fallback:
     // ========== Stage 2 (Lock-Free): Try to claim UNUSED slots ==========
```

### Alternative Fix (No goto, +10 lines)

If `goto` is undesirable, wrap Stage 2+3 in a helper function or use a flag:

```c
// After line 564:
SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
if (!ss) {
    // SuperSlab was freed - release lock and continue to Stage 2
    if (g_lock_stats_enabled == 1) {
        atomic_fetch_add(&g_lock_release_count, 1);
    }
    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
    // Fall through to Stage 2 below (no goto needed)
} else {
    // ... existing code (lines 566-591)
}
```

---

## Verification Plan

### Test Cases

```bash
# 1. Original crash case (must pass after fix):
./out/release/bench_fixed_size_hakmem 2150 16 64
./out/release/bench_fixed_size_hakmem 10000 16 64

# 2. Boundary cases (all must pass):
./out/release/bench_fixed_size_hakmem 2100 16 64
./out/release/bench_fixed_size_hakmem 3000 16 64
./out/release/bench_fixed_size_hakmem 10000 16 128

# 3. Other size classes (regression test):
./out/release/bench_fixed_size_hakmem 10000 256 128
./out/release/bench_fixed_size_hakmem 10000 1024 128

# 4. Stress test (100K iterations, various worksets):
for ws in 32 64 96 128 192 256; do
  echo "Testing workset=$ws..."
  ./out/release/bench_fixed_size_hakmem 100000 16 $ws || echo "FAIL: workset=$ws"
done
```

### Debug Validation

After applying the fix, verify with debug logging:

```bash
HAKMEM_SS_ACQUIRE_DEBUG=1 ./out/release/bench_fixed_size_hakmem 2150 16 64 2>&1 | \
  grep "ss=(nil)"

# Expected: No output (no NULL ss should reach Stage 1 activation)
```

---

## Impact Assessment

### Severity: **CRITICAL (P0)**

- **Reliability**: Crash in production workloads with high allocation churn
- **Frequency**: Deterministic after ~2150 iterations (workload-dependent)
- **Scope**: Affects all allocations using shared pool (Phase 12+)

### Affected Components

1. **Shared SuperSlab Pool** (`core/hakmem_shared_pool.c`)
   - Stage 1 lock-free freelist reuse path
2. **TLS SLL Drain** (indirectly)
   - Triggers slab releases that populate freelist
3. **All benchmarks using fixed worksets**
   - `bench_fixed_size_hakmem`
   - Potentially `bench_random_mixed_hakmem` with high churn

### Pre-Existing or Phase 13-B?

**Pre-existing bug** in Phase 12 shared pool implementation.

**Not caused by Phase 13-B changes** (TinyHeapV2 supply hook):
- Crash reproduces with `HAKMEM_TINY_HEAP_V2=0` (HeapV2 disabled)
- Root cause is in Stage 1 freelist logic (lines 562-592)
- Phase 13-B only added supply hook in `tiny_free_fast_v2.inc.h` (separate code path)

---

## Related Issues

### Similar Bugs Fixed Previously

1. **Stage 2 NULL check** (lines 618-622):
   - Added in previous RACE FIX commit
   - Comment: "SuperSlab was freed between claiming and loading"
   - **Same pattern, but Stage 1 was missed!**

2. **sp_meta->ss NULL store** (line 862):
   - Added in RACE FIX: "Set meta->ss to NULL BEFORE unlocking mutex"
   - Correctly prevents Stage 2 from accessing freed SuperSlab
   - **But Stage 1 freelist can still hold stale pointers**

### Design Flaw: Freelist Lifetime Management

The root issue is **decoupled lifetimes**:
- Freelist nodes live in global pool (`g_free_node_pool`, never freed)
- SuperSlabs are dynamically freed (line 870: `superslab_free(ss)`)
- No mechanism to invalidate freelist nodes when SuperSlab is freed

**Potential long-term fixes** (beyond this patch):

1. **Generation counter** in `SharedSSMeta`:
   - Increment on each SuperSlab allocation/free
   - Freelist node stores generation number
   - Pop path checks if generation matches (stale node → skip)

2. **Lazy freelist cleanup**:
   - Before freeing SuperSlab, scan freelist and remove matching nodes
   - Requires lock-free list traversal or fallback to mutex

3. **Reference counting** on `SharedSSMeta`:
   - Increment when pushing to freelist
   - Decrement when popping or freeing SuperSlab
   - Only free SuperSlab when refcount == 0

---

## Files Involved

### Primary Bug Location

- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c`
  - Line 562-592: Stage 1 (lock-free freelist reuse) - **MISSING NULL CHECK**
  - Line 618-622: Stage 2 (lock-free unused claiming) - **HAS NULL CHECK** ✅
  - Line 840: `sp_freelist_push_lockfree()` - pushes slot to freelist
  - Line 862: Sets `sp_meta->ss = NULL` before freeing SuperSlab
  - Line 870: `superslab_free(ss)` - frees SuperSlab memory

### Related Files (Context)

- `/mnt/workdisk/public_share/hakmem/benchmarks/src/fixed/bench_fixed_size.c`
  - Benchmark that triggers the crash (workset=64 pattern)
- `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_drain_box.h`
  - TLS SLL drain interval (2048) - affects when slabs are released
- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h`
  - Line 234-235: Calls `shared_pool_release_slab()` when slab is empty

---

## Summary

### What Happened

1. **workset=64, iterations=2150** creates high allocation churn
2. After ~67 drain cycles, many slabs are released to shared pool
3. Some SuperSlabs become completely empty → freed
4. Freelist contains slots whose SuperSlabs are already freed (`ss = NULL`)
5. Stage 1 pops a stale slot, loads `ss = NULL`, crashes on dereference

### Why It Wasn't Caught Earlier

1. **Low iteration count** in normal testing (< 2000 iterations)
2. **Stage 2 already has NULL check** - assumed Stage 1 was also safe
3. **Race window is small** - only happens when:
   - Freelist is non-empty (needs prior releases)
   - SuperSlab is completely empty (all slots freed)
   - Another thread pops before SuperSlab is reallocated

### The Fix

Add NULL check in Stage 1 after loading `ss`, matching Stage 2's pattern:

```c
SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
if (!ss) {
    // SuperSlab freed - skip and fall through to Stage 2/3
    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
    goto stage2_fallback;  // or return and retry
}
```

**Impact**: Minimal overhead (1 NULL check per Stage 1 hit), fixes critical crash.

---

## Action Items

- [ ] Apply minimal NULL check patch to `shared_pool_acquire_slab()` Stage 1
- [ ] Rebuild and test crash cases (workset=64, iterations=2150/10000)
- [ ] Run stress test (100K iterations, worksets 32-256)
- [ ] Verify with debug logging (no `ss=(nil)` in Stage 1)
- [ ] Consider long-term fix (generation counter or refcounting)
- [ ] Update `CURRENT_TASK.md` with fix status

---

**Report End**