Files
hakmem/docs/status/P0_BUG_STATUS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

7.3 KiB
Raw Blame History

P0 SEGV Bug - Current Status & Next Steps

Last Update: 2025-11-12

🐛 Bug Summary

Symptom: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42) Pattern: Corrupted address 0x7fff00008000 in TLS SLL chain Root Cause: STALE NEXT POINTERS in carved chains


🎁 Box Theory Implementation (完了済み)

Box 3 (Pointer Conversion Box)

  • File: core/box/ptr_conversion_box.h (267 lines)
  • 役割: BASE ↔ USER pointer conversion
  • API:
    • ptr_base_to_user(base, class_idx) - C0-C6: base+1, C7: base
    • ptr_user_to_base(user, class_idx) - C0-C6: user-1, C7: user
  • Status: Committed (1713 lines added total)

Box E (Expansion Box)

  • File: core/box/superslab_expansion_box.h/c
  • 役割: SuperSlab expansion with TLS state guarantee
  • 機能: expansion_expand_with_tls_guarantee() - Expand後に slab 0 を即座にバインド
  • Status: Committed

Box I (Integrity Box) - 703 lines!

  • File: core/box/integrity_box.h (267行) + integrity_box.c (436行)
  • 役割: Comprehensive integrity verification system
  • Priority ALPHA: 5つの Slab Metadata 不変条件チェック
    1. carved <= capacity
    2. used <= carved
    3. used <= capacity
    4. free_count == (carved - used)
    5. capacity <= 512
  • 機能:
    • integrity_validate_slab_metadata() - メタデータ検証
    • validate_ptr_range() - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
  • Status: Committed

Box TLS-SLL (今回の修正対象)

  • File: core/box/tls_sll_box.h
  • 役割: TLS Single-Linked List management (C7-safe)
  • API:
    • tls_sll_push() - Push to SLL (C7 rejected)
    • tls_sll_pop() - Pop from SLL (returns base pointer)
    • tls_sll_splice() - Batch push
  • 今回の発見:
    • Fix #1: tls_sll_pop で next をクリアC0-C6 は base+1 で)
    • But: carved chain の tail が NULL 終端されていないFix #2 必要)
  • Status: ⚠️ Fix #1 適用済み、Fix #2 未適用

その他のBox (既存)

  • Front Gate Box: core/box/front_gate_box.h/c + front_gate_classifier.c
  • Free Local/Remote/Publish Box: core/box/free_local_box.c, free_remote_box.c, free_publish_box.c
  • Mailbox Box: core/box/mailbox_box.h/c

Commit Info:

  • Commit: "Add Box I (Integrity), Box E (Expansion)..."
  • Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
  • Date: Recent (before P0 debug session)

🔍 Investigation History

Completed Investigations

  1. Valgrind (O0 build): 0 errors, 29K iterations passed

    • Conclusion: Bug is optimization-dependent (-O3 triggers it)
  2. Task Agent GDB Analysis:

    • Found crash location: tls_sll_pop line 169
    • Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
  3. Box I, E, 3 Implementation: 703 lines of integrity checks

    • All checks passed before crash
    • Validation didn't catch the bug

🛠️ Fixes Applied (Partial Success)

Fix #1: Clear next pointer in tls_sll_pop (INCOMPLETE)

File: core/box/tls_sll_box.h:254-262

Change:

// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
    *(void**)base = NULL;
}

// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
    if (class_idx == 7) {
        *(void**)base = NULL;  // C7: clear at base (offset 0)
    } else {
        *(void**)((uint8_t*)base + 1) = NULL;  // C0-C6: clear at base+1 (offset 1)
    }
#else
    *(void**)base = NULL;
#endif

Result:

  • Passed 29K iterations (previous crash point)
  • Still crashes at 38,985 iterations

🚨 NEW DISCOVERY: Root Cause Found!

Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)

File: core/tiny_refill_opt.h:229-234

BUG: Tail block's next pointer is NOT NULL-terminated!

// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
    uint8_t* next = cursor + stride;
    *(void**)(cursor + next_offset) = (void*)next;  // Links blocks 0→1, 1→2, ...
    cursor = next;
}
void* tail = (void*)cursor;  // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!

IMPACT:

  1. Chain is carved: head → block1 → block2 → ... → tail → [GARBAGE]
  2. Chain spliced to TLS SLL
  3. Later, tls_sll_pop traverses the chain
  4. Reads garbage next pointer → SEGV at 0x7fff00008000

FIX (add after line 233):

for (uint32_t i = 1; i < batch; i++) {
    uint8_t* next = cursor + stride;
    *(void**)(cursor + next_offset) = (void*)next;
    cursor = next;
}
void* tail = (void*)cursor;

// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;

🚨 CURRENT STATUS (2025-11-12 UPDATED)

Fixes Applied:

  1. Fix #1: Clear next pointer in tls_sll_pop (C0-C6 at base+1)
  2. Fix #2: NULL-terminate tail in trc_linear_carve()
  3. Fix #3: Clean rebuild with HEADER_CLASSIDX=1
  4. Fix #4: Increase canary check frequency (1000 → 100 ops)
  5. Fix #5: Add bounds check to tls_sll_push()

Test Results:

  • Still crashes at iteration 28,410 (call 14269)
  • Canaries: NOT corrupted (corruption is immediate)
  • Bounds check: NOT triggered (class_idx is valid)
  • Task agent finding: External corruption of g_tls_sll_head[0]

Analysis:

  • Fix #1 and Fix #2 ARE working correctly (Task agent verified)
  • Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
  • class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
  • Crash is deterministic at call 14269

📋 Next Steps (NEEDS USER INPUT)

Option A: Deep GDB Investigation (SLOW)

  • Set hardware watchpoint on g_tls_sll_head[0]
  • Run to call 14250, then watch for corruption
  • Time: 1-2 hours, may not work with optimization

Option B: Disable Optimizations (DIAGNOSTIC)

  • Rebuild with -O0 to see if bug disappears
  • If so, likely compiler optimization bug or UB
  • Time: 10 minutes

Option C: Simplified Stress Test (QUICK)

  • Disable P0 batch optimization temporarily
  • Disable SFC temporarily
  • Test with simpler code path
  • Time: 20 minutes

After Fix Verified

  1. Commit P0 fix:

    • Fix #1: Clear next in tls_sll_pop
    • Fix #2: NULL-terminate in trc_linear_carve
    • Box I/E/3 validation infrastructure
    • Double-free detection
  2. Update CLAUDE.md with findings

  3. Performance benchmark (release build)


🎯 Expected Outcome

After applying Fix #2, the allocator should:

  • Pass 100K iterations without crash
  • Pass 1M iterations without crash
  • Maintain performance (~2.7M ops/s for 256B)

📝 Lessons Learned

  1. Stale pointers are dangerous: Always NULL-terminate linked lists
  2. Optimization exposes bugs: -O3 can hide initialization in debug builds
  3. Multiple fixes needed: Fix #1 alone was insufficient
  4. Chain integrity: Carved chains MUST be properly terminated

🔧 Build Flags (CRITICAL)

MUST use these flags:

HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1

Why: HAKMEM_TINY_HEADER_CLASSIDX=1 is required for Fix #1 to execute!

Use build.sh to ensure correct flags:

./build.sh bench_random_mixed_hakmem