Files
hakmem/docs/benchmarks/LARSON_DIAGNOSTIC_PATCH.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

9.0 KiB

Larson Race Condition Diagnostic Patch

Purpose: Confirm the freelist race condition hypothesis before implementing full fix

Quick Diagnostic (5 minutes)

Add logging to detect concurrent freelist access:

# Edit core/front/tiny_unified_cache.c

Patch: Add Thread ID Logging

--- a/core/front/tiny_unified_cache.c
+++ b/core/front/tiny_unified_cache.c
@@ -8,6 +8,7 @@
 #include "../box/pagefault_telemetry_box.h"  // Phase 24: Box PageFaultTelemetry (Tiny page touch stats)
 #include <stdlib.h>
 #include <string.h>
+#include <pthread.h>

 // Phase 23-E: Forward declarations
 extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES];  // From hakmem_tiny_superslab.c
@@ -166,8 +167,22 @@ void* unified_cache_refill(int class_idx) {
                        : tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);

     while (produced < room) {
         if (m->freelist) {
+            // DIAGNOSTIC: Log thread + freelist state
+            static _Atomic uint64_t g_diag_count = 0;
+            uint64_t diag_n = atomic_fetch_add_explicit(&g_diag_count, 1, memory_order_relaxed);
+            if (diag_n < 100) {  // First 100 pops only
+                fprintf(stderr, "[FREELIST_POP] T%lu cls=%d ss=%p slab=%d freelist=%p owner=%u\n",
+                        (unsigned long)pthread_self(),
+                        class_idx,
+                        (void*)tls->ss,
+                        tls->slab_idx,
+                        m->freelist,
+                        (unsigned)m->owner_tid_low);
+                fflush(stderr);
+            }
+
             // Freelist pop
             void* p = m->freelist;
             m->freelist = tiny_next_read(class_idx, p);

Build and Run

./build.sh larson_hakmem 2>&1 | tail -5

# Run with 4 threads (known to crash)
./out/release/larson_hakmem 4 4 500 10000 1000 12345 1 2>&1 | tee larson_diag.log

# Analyze results
grep FREELIST_POP larson_diag.log | head -50

Expected Output (Race Confirmed)

If race exists, you'll see:

[FREELIST_POP] T140737353857856 cls=6 ss=0x76f899260800 slab=3 freelist=0x76f899261000 owner=42
[FREELIST_POP] T140737345465088 cls=6 ss=0x76f899260800 slab=3 freelist=0x76f899261000 owner=42
                                                          ^^^^ SAME SS+SLAB+FREELIST ^^^^

Key Evidence:

  • Different thread IDs (T140737353857856 vs T140737345465088)
  • SAME SuperSlab pointer (ss=0x76f899260800)
  • SAME slab index (slab=3)
  • SAME freelist head (freelist=0x76f899261000)
  • RACE CONFIRMED: Two threads popping from same freelist simultaneously!

Quick Workaround (30 minutes)

Force thread affinity by rejecting cross-thread access:

--- a/core/front/tiny_unified_cache.c
+++ b/core/front/tiny_unified_cache.c
@@ -137,6 +137,21 @@ void* unified_cache_refill(int class_idx) {
 void* unified_cache_refill(int class_idx) {
     TinyTLSSlab* tls = &g_tls_slabs[class_idx];

+    // WORKAROUND: Ensure slab ownership (thread affinity)
+    if (tls->meta) {
+        uint8_t my_tid_low = (uint8_t)pthread_self();
+
+        // If slab has no owner, claim it
+        if (tls->meta->owner_tid_low == 0) {
+            tls->meta->owner_tid_low = my_tid_low;
+        }
+        // If slab owned by different thread, force refill to get new slab
+        else if (tls->meta->owner_tid_low != my_tid_low) {
+            tls->ss = NULL;  // Trigger superslab_refill
+        }
+    }
+
     // Step 1: Ensure SuperSlab available
     if (!tls->ss) {
         if (!superslab_refill(class_idx)) return NULL;

Test Workaround

./build.sh larson_hakmem 2>&1 | tail -5

# Test with 4, 8, 10 threads
for threads in 4 8 10; do
    echo "Testing $threads threads..."
    timeout 30 ./out/release/larson_hakmem $threads $threads 500 10000 1000 12345 1
    echo "Exit code: $?"
done

Expected: Larson should complete without SEGV (may be slower due to more refills)


Proper Fix Preview (Option 1: Atomic Freelist)

Step 1: Update TinySlabMeta

--- a/core/superslab/superslab_types.h
+++ b/core/superslab/superslab_types.h
@@ -10,8 +10,8 @@
 // TinySlabMeta: per-slab metadata embedded in SuperSlab
 typedef struct TinySlabMeta {
-    void*    freelist;       // NULL = bump-only, non-NULL = freelist head
-    uint16_t used;           // blocks currently allocated from this slab
+    _Atomic uintptr_t freelist;  // Atomic freelist head (was: void*)
+    _Atomic uint16_t used;       // Atomic used count (was: uint16_t)
     uint16_t capacity;       // total blocks this slab can hold
     uint8_t  class_idx;      // owning tiny class (Phase 12: per-slab)
     uint8_t  carved;         // carve/owner flags

Step 2: Update Freelist Operations

--- a/core/front/tiny_unified_cache.c
+++ b/core/front/tiny_unified_cache.c
@@ -168,9 +168,20 @@ void* unified_cache_refill(int class_idx) {

     while (produced < room) {
-        if (m->freelist) {
-            void* p = m->freelist;
-            m->freelist = tiny_next_read(class_idx, p);
+        // Atomic freelist pop (lock-free)
+        void* p = (void*)atomic_load_explicit(&m->freelist, memory_order_acquire);
+        while (p != NULL) {
+            void* next = tiny_next_read(class_idx, p);
+
+            // CAS: Only succeed if freelist unchanged
+            if (atomic_compare_exchange_weak_explicit(
+                    &m->freelist, &p, (uintptr_t)next,
+                    memory_order_release, memory_order_acquire)) {
+                // Successfully popped block
+                break;
+            }
+            // CAS failed → p was updated to current value, retry
+        }
+        if (p) {

             // PageFaultTelemetry: record page touch for this BASE
             pagefault_telemetry_touch(class_idx, p);
@@ -180,7 +191,7 @@ void* unified_cache_refill(int class_idx) {
             *(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
             #endif

-            m->used++;
+            atomic_fetch_add_explicit(&m->used, 1, memory_order_relaxed);
             out[produced++] = p;

         } else if (m->carved < m->capacity) {

Step 3: Update All Access Sites

Files requiring atomic conversion (estimated 20 high-priority sites):

  1. core/front/tiny_unified_cache.c - freelist pop (DONE above)
  2. core/tiny_superslab_free.inc.h - freelist push (same-thread free)
  3. core/tiny_superslab_alloc.inc.h - freelist allocation
  4. core/box/carve_push_box.c - batch operations
  5. core/slab_handle.h - freelist traversal

Grep pattern to find sites:

grep -rn "->freelist" core/ --include="*.c" --include="*.h" | grep -v "\.d:" | grep -v "//" | wc -l
# Result: 87 sites (audit required)

Testing Checklist

Phase 1: Basic Functionality

  • Single-threaded: bench_random_mixed_hakmem 10000 256 42
  • C7 specific: bench_random_mixed_hakmem 10000 1024 42
  • Fixed size: bench_fixed_size_hakmem 10000 1024 128

Phase 2: Multi-Threading

  • 2 threads: larson_hakmem 2 2 100 1000 100 12345 1
  • 4 threads: larson_hakmem 4 4 500 10000 1000 12345 1
  • 8 threads: larson_hakmem 8 8 500 10000 1000 12345 1
  • 10 threads: larson_hakmem 10 10 500 10000 1000 12345 1 (original params)

Phase 3: Stress Test

# 100 iterations with random parameters
for i in {1..100}; do
    threads=$((RANDOM % 16 + 2))
    ./out/release/larson_hakmem $threads $threads 500 10000 1000 $RANDOM 1 || {
        echo "FAILED at iteration $i with $threads threads"
        exit 1
    }
done
echo "✅ All 100 iterations passed"

Phase 4: Performance Regression

# Before fix
./out/release/larson_hakmem 2 2 100 1000 100 12345 1 | grep "Throughput ="
# Expected: ~24.6M ops/s

# After fix (should be similar, lock-free CAS is fast)
./out/release/larson_hakmem 2 2 100 1000 100 12345 1 | grep "Throughput ="
# Target: >= 20M ops/s (< 20% regression acceptable)

Timeline Estimate

Task Time Priority
Apply diagnostic patch 5 min P0
Verify race with logs 10 min P0
Apply workaround patch 30 min P1
Test workaround 30 min P1
Implement atomic fix 2-3 hrs P2
Audit all access sites 3-4 hrs P2
Comprehensive testing 1 hr P2
Total (Full Fix) 7-9 hrs -
Total (Workaround Only) 1-2 hrs -

Decision Matrix

Use Workaround If:

  • Need Larson working ASAP (< 2 hours)
  • Can tolerate slight performance regression (~10-15%)
  • Want minimal code changes (< 20 lines)

Use Atomic Fix If:

  • Need production-quality solution
  • Performance is critical (lock-free = optimal)
  • Have time for thorough audit (7-9 hours)

Use Per-Slab Mutex If:

  • Want guaranteed correctness
  • Performance less critical than safety
  • Prefer simple, auditable code

Recommendation

Immediate (Today): Apply workaround patch to unblock Larson testing Short-term (This Week): Implement atomic fix with careful audit Long-term (Next Release): Consider architectural fix (slab affinity) for optimal performance


Contact for Questions

See LARSON_CRASH_ROOT_CAUSE_REPORT.md for detailed analysis.