Files
hakmem/docs/design/FIX_IMPLEMENTATION_GUIDE.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

11 KiB

Fix Implementation Guide: Remove Unsafe Drain Operations

Date: 2025-11-04 Target: Eliminate concurrent freelist corruption Approach: Remove Fix #1 and Fix #2, keep Fix #3, fix refill path ownership ordering


Changes Required

Change 1: Remove Fix #1 (superslab_refill Priority 1 drain)

File: core/hakmem_tiny_free.inc Lines: 615-621 Action: Comment out or delete

Before:

// Priority 1: Reuse slabs with freelist (already freed blocks)
int tls_cap = ss_slabs_capacity(tls->ss);
for (int i = 0; i < tls_cap; i++) {
    // BUGFIX: Drain remote frees before checking freelist (fixes FAST_CAP=0 SEGV)
    int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
    if (has_remote) {
        ss_remote_drain_to_freelist(tls->ss, i);  // ❌ REMOVE THIS
    }

    if (tls->ss->slabs[i].freelist) {
        // ... rest of logic
    }
}

After:

// Priority 1: Reuse slabs with freelist (already freed blocks)
int tls_cap = ss_slabs_capacity(tls->ss);
for (int i = 0; i < tls_cap; i++) {
    // REMOVED: Unsafe drain without ownership check (caused concurrent freelist corruption)
    // Remote draining is now handled only in paths where ownership is guaranteed:
    // 1. Mailbox path (tiny_refill.h:100-106) - claims ownership BEFORE draining
    // 2. Sticky/hot/bench paths (tiny_refill.h) - claims ownership BEFORE draining

    if (tls->ss->slabs[i].freelist) {
        // ... rest of logic (unchanged)
    }
}

Change 2: Remove Fix #2 (hak_tiny_alloc_superslab drain)

File: core/hakmem_tiny_free.inc Lines: 729-767 (entire block) Action: Comment out or delete

Before:

static inline void* hak_tiny_alloc_superslab(int class_idx) {
    tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
    TinyTLSSlab* tls = &g_tls_slabs[class_idx];
    TinySlabMeta* meta = tls->meta;

    // BUGFIX: Drain ALL slabs' remote queues BEFORE any allocation attempt (fixes FAST_CAP=0 SEGV)
    // [... 40 lines of drain logic ...]

    // Fast path: Direct metadata access
    if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
        // ...
    }

After:

static inline void* hak_tiny_alloc_superslab(int class_idx) {
    tiny_debug_ring_record(TINY_RING_EVENT_ALLOC_ENTER, 0x01, (void*)(uintptr_t)class_idx, 0);
    TinyTLSSlab* tls = &g_tls_slabs[class_idx];
    TinySlabMeta* meta = tls->meta;

    // REMOVED Fix #2: Unsafe drain of ALL slabs without ownership check
    // This caused concurrent freelist corruption when multiple threads operated on the same SuperSlab.
    // Remote draining is now handled exclusively in ownership-safe paths (Mailbox, refill with bind).

    // Fast path: Direct metadata access (unchanged)
    if (meta && meta->freelist == NULL && meta->used < meta->capacity && tls->slab_base) {
        // ...
    }

Specific lines to remove: 729-767 (the entire if (tls->ss && meta) block with drain loop)


Change 3: Fix Sticky Ring Path (claim ownership BEFORE drain)

File: core/tiny_refill.h Lines: 46-51 Action: Reorder operations

Before:

if (lm->freelist || has_remote) {
    if (!lm->freelist && has_remote) ss_remote_drain_to_freelist(last_ss, li);  // ❌ Drain BEFORE ownership
    if (lm->freelist) {
        tiny_tls_bind_slab(tls, last_ss, li);
        ss_owner_cas(lm, tiny_self_u32());  // ← Ownership AFTER drain
        return last_ss;
    }
}

After:

if (lm->freelist || has_remote) {
    // ✅ BUGFIX: Claim ownership BEFORE draining (prevents concurrent freelist modification)
    tiny_tls_bind_slab(tls, last_ss, li);
    ss_owner_cas(lm, tiny_self_u32());

    // NOW safe to drain - we own the slab
    if (!lm->freelist && has_remote) {
        ss_remote_drain_to_freelist(last_ss, li);
    }

    if (lm->freelist) {
        return last_ss;
    }
}

Change 4: Fix Hot Slot Path (claim ownership BEFORE drain)

File: core/tiny_refill.h Lines: 64-66 Action: Reorder operations

Before:

TinySlabMeta* m = &hss->slabs[hidx];
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0)
    ss_remote_drain_to_freelist(hss, hidx);  // ❌ Drain BEFORE ownership
if (m->freelist) {
    tiny_tls_bind_slab(tls, hss, hidx);
    ss_owner_cas(m, tiny_self_u32());  // ← Ownership AFTER drain
    tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
    return hss;
}

After:

TinySlabMeta* m = &hss->slabs[hidx];

// ✅ BUGFIX: Claim ownership BEFORE draining
tiny_tls_bind_slab(tls, hss, hidx);
ss_owner_cas(m, tiny_self_u32());

// NOW safe to drain - we own the slab
if (!m->freelist && atomic_load_explicit(&hss->remote_heads[hidx], memory_order_acquire) != 0) {
    ss_remote_drain_to_freelist(hss, hidx);
}

if (m->freelist) {
    tiny_sticky_save(class_idx, hss, (uint8_t)hidx);
    return hss;
}

Change 5: Fix Bench Path (claim ownership BEFORE drain)

File: core/tiny_refill.h Lines: 79-81 Action: Reorder operations

Before:

TinySlabMeta* m = &bss->slabs[bidx];
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0)
    ss_remote_drain_to_freelist(bss, bidx);  // ❌ Drain BEFORE ownership
if (m->freelist) {
    tiny_tls_bind_slab(tls, bss, bidx);
    ss_owner_cas(m, tiny_self_u32());  // ← Ownership AFTER drain
    tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
    return bss;
}

After:

TinySlabMeta* m = &bss->slabs[bidx];

// ✅ BUGFIX: Claim ownership BEFORE draining
tiny_tls_bind_slab(tls, bss, bidx);
ss_owner_cas(m, tiny_self_u32());

// NOW safe to drain - we own the slab
if (!m->freelist && atomic_load_explicit(&bss->remote_heads[bidx], memory_order_acquire) != 0) {
    ss_remote_drain_to_freelist(bss, bidx);
}

if (m->freelist) {
    tiny_sticky_save(class_idx, bss, (uint8_t)bidx);
    return bss;
}

Change 6: Fix mmap_gate Path (claim ownership BEFORE drain)

File: core/tiny_mmap_gate.h Lines: 56-58 Action: Reorder operations

Before:

TinySlabMeta* m = &cand->slabs[s];
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
if (m->freelist || has_remote) {
    if (!m->freelist && has_remote) ss_remote_drain_to_freelist(cand, s);  // ❌ Drain BEFORE ownership
    if (m->freelist) {
        tiny_tls_bind_slab(tls, cand, s);
        ss_owner_cas(m, tiny_self_u32());  // ← Ownership AFTER drain
        return cand;
    }
}

After:

TinySlabMeta* m = &cand->slabs[s];
int has_remote = (atomic_load_explicit(&cand->remote_heads[s], memory_order_acquire) != 0);
if (m->freelist || has_remote) {
    // ✅ BUGFIX: Claim ownership BEFORE draining
    tiny_tls_bind_slab(tls, cand, s);
    ss_owner_cas(m, tiny_self_u32());

    // NOW safe to drain - we own the slab
    if (!m->freelist && has_remote) {
        ss_remote_drain_to_freelist(cand, s);
    }

    if (m->freelist) {
        return cand;
    }
}

Testing Plan

Test 1: Baseline (Current Crashes)

# Build with current code (before fixes)
make clean && make -s larson_hakmem

# Run repro mode (should crash around 4000 events)
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 4

Expected: Crash at ~4000 events with fault_addr=0x6261


Test 2: Apply Fix (Remove Fix #1 and Fix #2 ONLY)

# Apply Changes 1 and 2 (comment out Fix #1 and Fix #2)
# Rebuild
make clean && make -s larson_hakmem

# Run repro mode
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10

Expected:

  • If crashes stop → Fix #1/#2 were the main culprits
  • If crashes continue → Need to apply Changes 3-6

Test 3: Apply All Fixes (Changes 1-6)

# Apply all changes
# Rebuild
make clean && make -s larson_hakmem

# Run extended test
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20

Expected: NO crashes, stable execution for full 20 seconds


Test 4: Guard Mode (Maximum Stress)

# Rebuild with all fixes
make clean && make -s larson_hakmem

# Run guard mode (stricter checks)
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20

Expected: NO crashes, reaches 30+ seconds


Verification Checklist

After applying fixes, verify:

  • Fix #1 code (hakmem_tiny_free.inc:615-621) commented out or deleted
  • Fix #2 code (hakmem_tiny_free.inc:729-767) commented out or deleted
  • Fix #3 (tiny_refill.h:100-106) unchanged (already correct)
  • Sticky path (tiny_refill.h:46-51) reordered: ownership BEFORE drain
  • Hot slot path (tiny_refill.h:64-66) reordered: ownership BEFORE drain
  • Bench path (tiny_refill.h:79-81) reordered: ownership BEFORE drain
  • mmap_gate path (tiny_mmap_gate.h:56-58) reordered: ownership BEFORE drain
  • All changes compile without errors
  • Benchmark runs without crashes for 30+ seconds

Expected Results

Before Fixes

Test Duration Events Result
repro mode ~4 sec ~4012 CRASH at fault_addr=0x6261
guard mode ~2 sec ~2137 CRASH at fault_addr=0x6261

After Fixes (Changes 1-2 only)

Test Duration Events Result
repro mode 10+ sec 10000+ NO CRASH or ⚠️ occasional crash
guard mode 10+ sec 10000+ NO CRASH or ⚠️ occasional crash

After All Fixes (Changes 1-6)

Test Duration Events Result
repro mode 20+ sec 20000+ NO CRASH
guard mode 30+ sec 30000+ NO CRASH

Rollback Plan

If fixes cause new issues:

  1. Revert Changes 3-6 (keep Changes 1-2):

    • Restore original sticky/hot/bench/mmap_gate paths
    • This removes Fix #1/#2 but keeps old refill ordering
    • Test again
  2. Revert All Changes:

    git checkout core/hakmem_tiny_free.inc
    git checkout core/tiny_refill.h
    git checkout core/tiny_mmap_gate.h
    make clean && make
    
  3. Try Alternative: Option B from ULTRATHINK_ANALYSIS.md (add ownership checks instead of removing)


Additional Debugging (If Crashes Persist)

If crashes continue after all fixes:

  1. Enable ownership assertion:

    // In hakmem_tiny_superslab.h:345, add at top of ss_remote_drain_to_freelist:
    #ifdef HAKMEM_DEBUG_OWNERSHIP
    TinySlabMeta* m = &ss->slabs[slab_idx];
    uint32_t owner = m->owner_tid;
    uint32_t self = tiny_self_u32();
    if (owner != 0 && owner != self) {
        fprintf(stderr, "[OWNERSHIP ERROR] Thread %u draining slab %d owned by %u!\n",
                self, slab_idx, owner);
        abort();
    }
    #endif
    
  2. Rebuild with debug flag:

    make clean
    CFLAGS="-DHAKMEM_DEBUG_OWNERSHIP=1" make -s larson_hakmem
    HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
    
  3. Check for other unsafe drain sites:

    grep -n "ss_remote_drain_to_freelist" core/*.{c,inc,h} | grep -v "^//"
    

END OF IMPLEMENTATION GUIDE