Files
hakmem/docs/design/FIX_IMPLEMENTATION_GUIDE.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

413 lines
11 KiB
Markdown

# Fix Implementation Guide: Remove Unsafe Drain Operations
**Date**: 2025-11-04
**Target**: Eliminate concurrent freelist corruption
**Approach**: Remove Fix #1 and Fix #2, keep Fix #3, fix refill path ownership ordering
---
## Changes Required
### Change 1: Remove Fix #1 (superslab_refill Priority 1 drain)
**File**: `core/hakmem_tiny_free.inc`
**Lines**: 615-621
**Action**: Comment out or delete
**Before**:
```c
// Priority 1: Reuse slabs with freelist (already freed blocks)
int tls_cap = ss_slabs_capacity(tls->ss);
for (int i = 0; i < tls_cap; i++) {
// BUGFIX: Drain remote frees before checking freelist (fixes FAST_CAP=0 SEGV)
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
if (has_remote) {
ss_remote_drain_to_freelist(tls->ss, i); // ❌ REMOVE THIS
}
if (tls->ss->slabs[i].freelist) {
// ... rest of logic
}
}
```
**After**:
```c
// Priority 1: Reuse slabs with freelist (already freed blocks)
int tls_cap = ss_slabs_capacity(tls->ss);
for (int i = 0; i < tls_cap; i++) {
// REMOVED: Unsafe drain without ownership check (caused concurrent freelist corruption)
// Remote draining is now handled only in paths where ownership is guaranteed:
// 1. Mailbox path (tiny_refill.h:100-106) - claims ownership BEFORE draining
// 2. Sticky/hot/bench paths (tiny_refill.h) - claims ownership BEFORE draining
if (tls->ss->slabs[i].freelist) {
// ... rest of logic (unchanged)
}
}
```
---
### Change 2: Remove Fix #2 (hak_tiny_alloc_superslab drain)
**File**: `core/hakmem_tiny_free.inc`
**Lines**: 729-767 (entire block)
**Action**: Comment out or delete
**Before**:
```c
static inline void* hak_tiny_alloc_superslab(int class_idx) {
tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
TinySlabMeta* meta = tls->meta;
// BUGFIX: Drain ALL slabs' remote queues BEFORE any allocation attempt (fixes FAST_CAP=0 SEGV)
// [... 40 lines of drain logic ...]
// Fast path: Direct metadata access
if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
// ...
}
```
**After**:
```c
static inline void* hak_tiny_alloc_superslab(int class_idx) {
tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
TinySlabMeta* meta = tls->meta;
// REMOVED Fix #2: Unsafe drain of ALL slabs without ownership check
// This caused concurrent freelist corruption when multiple threads operated on the same SuperSlab.
// Remote draining is now handled exclusively in ownership-safe paths (Mailbox, refill with bind).
// Fast path: Direct metadata access (unchanged)
if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
// ...
}
```
**Specific lines to remove**: 729-767 (the entire `if (tls->ss && meta)` block with drain loop)
---
### Change 3: Fix Sticky Ring Path (claim ownership BEFORE drain)
**File**: `core/tiny_refill.h`
**Lines**: 46-51
**Action**: Reorder operations
**Before**:
```c
if (lm->freelist || has_remote) {
if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li); // ❌ Drain BEFORE ownership
if (lm->freelist) {
tiny_tls_bind_slab(tls, last_ss, li);
ss_owner_cas(lm, tiny_self_u32()); // ← Ownership AFTER drain
return last_ss;
}
}
```
**After**:
```c
if (lm->freelist || has_remote) {
// ✅ BUGFIX: Claim ownership BEFORE draining (prevents concurrent freelist modification)
tiny_tls_bind_slab(tls, last_ss, li);
ss_owner_cas(lm, tiny_self_u32());
// NOW safe to drain - we own the slab
if (!lm->freelist && has_remote) {
ss_remote_drain_to_freelist(last_ss, li);
}
if (lm->freelist) {
return last_ss;
}
}
```
---
### Change 4: Fix Hot Slot Path (claim ownership BEFORE drain)
**File**: `core/tiny_refill.h`
**Lines**: 64-66
**Action**: Reorder operations
**Before**:
```c
TinySlabMeta* m = &hss->slabs[hidx];
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0)
ss_remote_drain_to_freelist(hss, hidx); // ❌ Drain BEFORE ownership
if (m->freelist) {
tiny_tls_bind_slab(tls, hss, hidx);
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
return hss;
}
```
**After**:
```c
TinySlabMeta* m = &hss->slabs[hidx];
// ✅ BUGFIX: Claim ownership BEFORE draining
tiny_tls_bind_slab(tls, hss, hidx);
ss_owner_cas(m, tiny_self_u32());
// NOW safe to drain - we own the slab
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) {
ss_remote_drain_to_freelist(hss, hidx);
}
if (m->freelist) {
tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
return hss;
}
```
---
### Change 5: Fix Bench Path (claim ownership BEFORE drain)
**File**: `core/tiny_refill.h`
**Lines**: 79-81
**Action**: Reorder operations
**Before**:
```c
TinySlabMeta* m = &bss->slabs[bidx];
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0)
ss_remote_drain_to_freelist(bss, bidx); // ❌ Drain BEFORE ownership
if (m->freelist) {
tiny_tls_bind_slab(tls, bss, bidx);
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
return bss;
}
```
**After**:
```c
TinySlabMeta* m = &bss->slabs[bidx];
// ✅ BUGFIX: Claim ownership BEFORE draining
tiny_tls_bind_slab(tls, bss, bidx);
ss_owner_cas(m, tiny_self_u32());
// NOW safe to drain - we own the slab
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0) {
ss_remote_drain_to_freelist(bss, bidx);
}
if (m->freelist) {
tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
return bss;
}
```
---
### Change 6: Fix mmap_gate Path (claim ownership BEFORE drain)
**File**: `core/tiny_mmap_gate.h`
**Lines**: 56-58
**Action**: Reorder operations
**Before**:
```c
TinySlabMeta* m = &cand->slabs[s];
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
if (m->freelist || has_remote) {
if (!m->freelist && has_remote) ss_remote_drain_to_freelist(cand, s); // ❌ Drain BEFORE ownership
if (m->freelist) {
tiny_tls_bind_slab(tls, cand, s);
ss_owner_cas(m, tiny_self_u32()); // ← Ownership AFTER drain
return cand;
}
}
```
**After**:
```c
TinySlabMeta* m = &cand->slabs[s];
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
if (m->freelist || has_remote) {
// ✅ BUGFIX: Claim ownership BEFORE draining
tiny_tls_bind_slab(tls, cand, s);
ss_owner_cas(m, tiny_self_u32());
// NOW safe to drain - we own the slab
if (!m->freelist && has_remote) {
ss_remote_drain_to_freelist(cand, s);
}
if (m->freelist) {
return cand;
}
}
```
---
## Testing Plan
### Test 1: Baseline (Current Crashes)
```bash
# Build with current code (before fixes)
make clean && make -s larson_hakmem
# Run repro mode (should crash around 4000 events)
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 4
```
**Expected**: Crash at ~4000 events with `fault_addr=0x6261`
---
### Test 2: Apply Fix (Remove Fix #1 and Fix #2 ONLY)
```bash
# Apply Changes 1 and 2 (comment out Fix #1 and Fix #2)
# Rebuild
make clean && make -s larson_hakmem
# Run repro mode
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
```
**Expected**:
- If crashes stop → Fix #1/#2 were the main culprits ✅
- If crashes continue → Need to apply Changes 3-6
---
### Test 3: Apply All Fixes (Changes 1-6)
```bash
# Apply all changes
# Rebuild
make clean && make -s larson_hakmem
# Run extended test
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20
```
**Expected**: NO crashes, stable execution for full 20 seconds
---
### Test 4: Guard Mode (Maximum Stress)
```bash
# Rebuild with all fixes
make clean && make -s larson_hakmem
# Run guard mode (stricter checks)
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20
```
**Expected**: NO crashes, reaches 30+ seconds
---
## Verification Checklist
After applying fixes, verify:
- [ ] Fix #1 code (hakmem_tiny_free.inc:615-621) commented out or deleted
- [ ] Fix #2 code (hakmem_tiny_free.inc:729-767) commented out or deleted
- [ ] Fix #3 (tiny_refill.h:100-106) unchanged (already correct)
- [ ] Sticky path (tiny_refill.h:46-51) reordered: ownership BEFORE drain
- [ ] Hot slot path (tiny_refill.h:64-66) reordered: ownership BEFORE drain
- [ ] Bench path (tiny_refill.h:79-81) reordered: ownership BEFORE drain
- [ ] mmap_gate path (tiny_mmap_gate.h:56-58) reordered: ownership BEFORE drain
- [ ] All changes compile without errors
- [ ] Benchmark runs without crashes for 30+ seconds
---
## Expected Results
### Before Fixes
| Test | Duration | Events | Result |
|------|----------|--------|--------|
| repro mode | ~4 sec | ~4012 | ❌ CRASH at fault_addr=0x6261 |
| guard mode | ~2 sec | ~2137 | ❌ CRASH at fault_addr=0x6261 |
### After Fixes (Changes 1-2 only)
| Test | Duration | Events | Result |
|------|----------|--------|--------|
| repro mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash |
| guard mode | 10+ sec | 10000+ | ✅ NO CRASH or ⚠️ occasional crash |
### After All Fixes (Changes 1-6)
| Test | Duration | Events | Result |
|------|----------|--------|--------|
| repro mode | 20+ sec | 20000+ | ✅ NO CRASH |
| guard mode | 30+ sec | 30000+ | ✅ NO CRASH |
---
## Rollback Plan
If fixes cause new issues:
1. **Revert Changes 3-6** (keep Changes 1-2):
- Restore original sticky/hot/bench/mmap_gate paths
- This removes Fix #1/#2 but keeps old refill ordering
- Test again
2. **Revert All Changes**:
```bash
git checkout core/hakmem_tiny_free.inc
git checkout core/tiny_refill.h
git checkout core/tiny_mmap_gate.h
make clean && make
```
3. **Try Alternative**: Option B from ULTRATHINK_ANALYSIS.md (add ownership checks instead of removing)
---
## Additional Debugging (If Crashes Persist)
If crashes continue after all fixes:
1. **Enable ownership assertion**:
```c
// In hakmem_tiny_superslab.h:345, add at top of ss_remote_drain_to_freelist:
#ifdef HAKMEM_DEBUG_OWNERSHIP
TinySlabMeta* m = &ss->slabs[slab_idx];
uint32_t owner = m->owner_tid;
uint32_t self = tiny_self_u32();
if (owner != 0 && owner != self) {
fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab %d owned by %u!\n",
self, slab_idx, owner);
abort();
}
#endif
```
2. **Rebuild with debug flag**:
```bash
make clean
CFLAGS="-DHAKMEM_DEBUG_OWNERSHIP=1" make -s larson_hakmem
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
```
3. **Check for other unsafe drain sites**:
```bash
grep -n "ss_remote_drain_to_freelist" core/*.{c,inc,h} | grep -v "^//"
```
---
**END OF IMPLEMENTATION GUIDE**