Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
413 lines
11 KiB
Markdown
413 lines
11 KiB
Markdown
# Fix Implementation Guide: Remove Unsafe Drain Operations
|
|
|
|
**Date**: 2025-11-04
|
|
**Target**: Eliminate concurrent freelist corruption
|
|
**Approach**: Remove Fix #1 and Fix #2, keep Fix #3, fix refill path ownership ordering
|
|
|
|
---
|
|
|
|
## Changes Required
|
|
|
|
### Change 1: Remove Fix #1 (superslab_refill Priority 1 drain)
|
|
|
|
**File**: `core/hakmem_tiny_free.inc`
|
|
**Lines**: 615-621
|
|
**Action**: Comment out or delete
|
|
|
|
**Before**:
|
|
```c
|
|
// Priority 1: Reuse slabs with freelist (already freed blocks)
|
|
int tls_cap = ss_slabs_capacity(tls->ss);
|
|
for (int i = 0; i < tls_cap; i++) {
|
|
// BUGFIX: Drain remote frees before checking freelist (fixes FAST_CAP=0 SEGV)
|
|
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
|
|
if (has_remote) {
|
|
ss_remote_drain_to_freelist(tls->ss, i); // ❌ REMOVE THIS
|
|
}
|
|
|
|
if (tls->ss->slabs[i].freelist) {
|
|
// ... rest of logic
|
|
}
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
// Priority 1: Reuse slabs with freelist (already freed blocks)
|
|
int tls_cap = ss_slabs_capacity(tls->ss);
|
|
for (int i = 0; i < tls_cap; i++) {
|
|
// REMOVED: Unsafe drain without ownership check (caused concurrent freelist corruption)
|
|
// Remote draining is now handled only in paths where ownership is guaranteed:
|
|
// 1. Mailbox path (tiny_refill.h:100-106) - claims ownership BEFORE draining
|
|
// 2. Sticky/hot/bench paths (tiny_refill.h) - claims ownership BEFORE draining
|
|
|
|
if (tls->ss->slabs[i].freelist) {
|
|
// ... rest of logic (unchanged)
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Change 2: Remove Fix #2 (hak_tiny_alloc_superslab drain)
|
|
|
|
**File**: `core/hakmem_tiny_free.inc`
|
|
**Lines**: 729-767 (entire block)
|
|
**Action**: Comment out or delete
|
|
|
|
**Before**:
|
|
```c
|
|
static inline void* hak_tiny_alloc_superslab(int class_idx) {
|
|
tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
|
|
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
|
|
TinySlabMeta* meta = tls->meta;
|
|
|
|
// BUGFIX: Drain ALL slabs' remote queues BEFORE any allocation attempt (fixes FAST_CAP=0 SEGV)
|
|
// [... 40 lines of drain logic ...]
|
|
|
|
// Fast path: Direct metadata access
|
|
if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
static inline void* hak_tiny_alloc_superslab(int class_idx) {
|
|
tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
|
|
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
|
|
TinySlabMeta* meta = tls->meta;
|
|
|
|
// REMOVED Fix #2: Unsafe drain of ALL slabs without ownership check
|
|
// This caused concurrent freelist corruption when multiple threads operated on the same SuperSlab.
|
|
// Remote draining is now handled exclusively in ownership-safe paths (Mailbox, refill with bind).
|
|
|
|
// Fast path: Direct metadata access (unchanged)
|
|
if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**Specific lines to remove**: 729-767 (the entire `if (tls->ss && meta)` block with drain loop)
|
|
|
|
---
|
|
|
|
### Change 3: Fix Sticky Ring Path (claim ownership BEFORE drain)
|
|
|
|
**File**: `core/tiny_refill.h`
|
|
**Lines**: 46-51
|
|
**Action**: Reorder operations
|
|
|
|
**Before**:
|
|
```c
|
|
if (lm->freelist || has_remote) {
|
|
if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li); // ❌ Drain BEFORE ownership
|
|
if (lm->freelist) {
|
|
tiny_tls_bind_slab(tls, last_ss, li);
|
|
ss_owner_cas(lm, tiny_self_u32()); // ← Ownership AFTER drain
|
|
return last_ss;
|
|
}
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
if (lm->freelist || has_remote) {
|
|
// ✅ BUGFIX: Claim ownership BEFORE draining (prevents concurrent freelist modification)
|
|
tiny_tls_bind_slab(tls, last_ss, li);
|
|
ss_owner_cas(lm, tiny_self_u32());
|
|
|
|
// NOW safe to drain - we own the slab
|
|
if (!lm->freelist && has_remote) {
|
|
ss_remote_drain_to_freelist(last_ss, li);
|
|
}
|
|
|
|
if (lm->freelist) {
|
|
return last_ss;
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Change 4: Fix Hot Slot Path (claim ownership BEFORE drain)
|
|
|
|
**File**: `core/tiny_refill.h`
|
|
**Lines**: 64-66
|
|
**Action**: Reorder operations
|
|
|
|
**Before**:
|
|
```c
|
|
TinySlabMeta* m = &hss->slabs[hidx];
|
|
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0)
|
|
ss_remote_drain_to_freelist(hss, hidx); // ❌ Drain BEFORE ownership
|
|
if (m->freelist) {
|
|
tiny_tls_bind_slab(tls, hss, hidx);
|
|
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
|
|
tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
|
|
return hss;
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
TinySlabMeta* m = &hss->slabs[hidx];
|
|
|
|
// ✅ BUGFIX: Claim ownership BEFORE draining
|
|
tiny_tls_bind_slab(tls, hss, hidx);
|
|
ss_owner_cas(m, tiny_self_u32());
|
|
|
|
// NOW safe to drain - we own the slab
|
|
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) {
|
|
ss_remote_drain_to_freelist(hss, hidx);
|
|
}
|
|
|
|
if (m->freelist) {
|
|
tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
|
|
return hss;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Change 5: Fix Bench Path (claim ownership BEFORE drain)
|
|
|
|
**File**: `core/tiny_refill.h`
|
|
**Lines**: 79-81
|
|
**Action**: Reorder operations
|
|
|
|
**Before**:
|
|
```c
|
|
TinySlabMeta* m = &bss->slabs[bidx];
|
|
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0)
|
|
ss_remote_drain_to_freelist(bss, bidx); // ❌ Drain BEFORE ownership
|
|
if (m->freelist) {
|
|
tiny_tls_bind_slab(tls, bss, bidx);
|
|
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
|
|
tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
|
|
return bss;
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
TinySlabMeta* m = &bss->slabs[bidx];
|
|
|
|
// ✅ BUGFIX: Claim ownership BEFORE draining
|
|
tiny_tls_bind_slab(tls, bss, bidx);
|
|
ss_owner_cas(m, tiny_self_u32());
|
|
|
|
// NOW safe to drain - we own the slab
|
|
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0) {
|
|
ss_remote_drain_to_freelist(bss, bidx);
|
|
}
|
|
|
|
if (m->freelist) {
|
|
tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
|
|
return bss;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Change 6: Fix mmap_gate Path (claim ownership BEFORE drain)
|
|
|
|
**File**: `core/tiny_mmap_gate.h`
|
|
**Lines**: 56-58
|
|
**Action**: Reorder operations
|
|
|
|
**Before**:
|
|
```c
|
|
TinySlabMeta* m = &cand->slabs[s];
|
|
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
|
|
if (m->freelist || has_remote) {
|
|
if (!m->freelist && has_remote) ss_remote_drain_to_freelist(cand, s); // ❌ Drain BEFORE ownership
|
|
if (m->freelist) {
|
|
tiny_tls_bind_slab(tls, cand, s);
|
|
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
|
|
return cand;
|
|
}
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```c
|
|
TinySlabMeta* m = &cand->slabs[s];
|
|
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
|
|
if (m->freelist || has_remote) {
|
|
// ✅ BUGFIX: Claim ownership BEFORE draining
|
|
tiny_tls_bind_slab(tls, cand, s);
|
|
ss_owner_cas(m, tiny_self_u32());
|
|
|
|
// NOW safe to drain - we own the slab
|
|
if (!m->freelist && has_remote) {
|
|
ss_remote_drain_to_freelist(cand, s);
|
|
}
|
|
|
|
if (m->freelist) {
|
|
return cand;
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Plan
|
|
|
|
### Test 1: Baseline (Current Crashes)
|
|
|
|
```bash
|
|
# Build with current code (before fixes)
|
|
make clean && make -s larson_hakmem
|
|
|
|
# Run repro mode (should crash around 4000 events)
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 4
|
|
```
|
|
|
|
**Expected**: Crash at ~4000 events with `fault_addr=0x6261`
|
|
|
|
---
|
|
|
|
### Test 2: Apply Fix (Remove Fix #1 and Fix #2 ONLY)
|
|
|
|
```bash
|
|
# Apply Changes 1 and 2 (comment out Fix #1 and Fix #2)
|
|
# Rebuild
|
|
make clean && make -s larson_hakmem
|
|
|
|
# Run repro mode
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
|
|
```
|
|
|
|
**Expected**:
|
|
- If crashes stop → Fix #1/#2 were the main culprits ✅
|
|
- If crashes continue → Need to apply Changes 3-6
|
|
|
|
---
|
|
|
|
### Test 3: Apply All Fixes (Changes 1-6)
|
|
|
|
```bash
|
|
# Apply all changes
|
|
# Rebuild
|
|
make clean && make -s larson_hakmem
|
|
|
|
# Run extended test
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20
|
|
```
|
|
|
|
**Expected**: NO crashes, stable execution for full 20 seconds
|
|
|
|
---
|
|
|
|
### Test 4: Guard Mode (Maximum Stress)
|
|
|
|
```bash
|
|
# Rebuild with all fixes
|
|
make clean && make -s larson_hakmem
|
|
|
|
# Run guard mode (stricter checks)
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20
|
|
```
|
|
|
|
**Expected**: NO crashes, reaches 30+ seconds
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
After applying fixes, verify:
|
|
|
|
- [ ] Fix #1 code (hakmem_tiny_free.inc:615-621) commented out or deleted
|
|
- [ ] Fix #2 code (hakmem_tiny_free.inc:729-767) commented out or deleted
|
|
- [ ] Fix #3 (tiny_refill.h:100-106) unchanged (already correct)
|
|
- [ ] Sticky path (tiny_refill.h:46-51) reordered: ownership BEFORE drain
|
|
- [ ] Hot slot path (tiny_refill.h:64-66) reordered: ownership BEFORE drain
|
|
- [ ] Bench path (tiny_refill.h:79-81) reordered: ownership BEFORE drain
|
|
- [ ] mmap_gate path (tiny_mmap_gate.h:56-58) reordered: ownership BEFORE drain
|
|
- [ ] All changes compile without errors
|
|
- [ ] Benchmark runs without crashes for 30+ seconds
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
### Before Fixes
|
|
|
|
| Test | Duration | Events | Result |
|
|
|------|----------|--------|--------|
|
|
| repro mode | ~4 sec | ~4012 | ❌ CRASH at fault_addr=0x6261 |
|
|
| guard mode | ~2 sec | ~2137 | ❌ CRASH at fault_addr=0x6261 |
|
|
|
|
### After Fixes (Changes 1-2 only)
|
|
|
|
| Test | Duration | Events | Result |
|
|
|------|----------|--------|--------|
|
|
| repro mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash |
|
|
| guard mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash |
|
|
|
|
### After All Fixes (Changes 1-6)
|
|
|
|
| Test | Duration | Events | Result |
|
|
|------|----------|--------|--------|
|
|
| repro mode | 20+ sec | 20000+ | ✅ NO CRASH |
|
|
| guard mode | 30+ sec | 30000+ | ✅ NO CRASH |
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If fixes cause new issues:
|
|
|
|
1. **Revert Changes 3-6** (keep Changes 1-2):
|
|
- Restore original sticky/hot/bench/mmap_gate paths
|
|
- This removes Fix #1/#2 but keeps old refill ordering
|
|
- Test again
|
|
|
|
2. **Revert All Changes**:
|
|
```bash
|
|
git checkout core/hakmem_tiny_free.inc
|
|
git checkout core/tiny_refill.h
|
|
git checkout core/tiny_mmap_gate.h
|
|
make clean && make
|
|
```
|
|
|
|
3. **Try Alternative**: Option B from ULTRATHINK_ANALYSIS.md (add ownership checks instead of removing)
|
|
|
|
---
|
|
|
|
## Additional Debugging (If Crashes Persist)
|
|
|
|
If crashes continue after all fixes:
|
|
|
|
1. **Enable ownership assertion**:
|
|
```c
|
|
// In hakmem_tiny_superslab.h:345, add at top of ss_remote_drain_to_freelist:
|
|
#ifdef HAKMEM_DEBUG_OWNERSHIP
|
|
TinySlabMeta* m = &ss->slabs[slab_idx];
|
|
uint32_t owner = m->owner_tid;
|
|
uint32_t self = tiny_self_u32();
|
|
if (owner != 0 && owner != self) {
|
|
fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab %d owned by %u!\n",
|
|
self, slab_idx, owner);
|
|
abort();
|
|
}
|
|
#endif
|
|
```
|
|
|
|
2. **Rebuild with debug flag**:
|
|
```bash
|
|
make clean
|
|
CFLAGS="-DHAKMEM_DEBUG_OWNERSHIP=1" make -s larson_hakmem
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
|
|
```
|
|
|
|
3. **Check for other unsafe drain sites**:
|
|
```bash
|
|
grep -n "ss_remote_drain_to_freelist" core/*.{c,inc,h} | grep -v "^//"
|
|
```
|
|
|
|
---
|
|
|
|
**END OF IMPLEMENTATION GUIDE**
|