## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.3 KiB
P0 SEGV Bug - Current Status & Next Steps
Last Update: 2025-11-12
🐛 Bug Summary
Symptom: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
Pattern: Corrupted address 0x7fff00008000 in TLS SLL chain
Root Cause: STALE NEXT POINTERS in carved chains
🎁 Box Theory Implementation (完了済み)
✅ Box 3 (Pointer Conversion Box)
- File:
core/box/ptr_conversion_box.h(267 lines) - 役割: BASE ↔ USER pointer conversion
- API:
ptr_base_to_user(base, class_idx)- C0-C6: base+1, C7: baseptr_user_to_base(user, class_idx)- C0-C6: user-1, C7: user
- Status: ✅ Committed (1713 lines added total)
✅ Box E (Expansion Box)
- File:
core/box/superslab_expansion_box.h/c - 役割: SuperSlab expansion with TLS state guarantee
- 機能:
expansion_expand_with_tls_guarantee()- Expand後に slab 0 を即座にバインド - Status: ✅ Committed
✅ Box I (Integrity Box) - 703 lines!
- File:
core/box/integrity_box.h(267行) +integrity_box.c(436行) - 役割: Comprehensive integrity verification system
- Priority ALPHA: 5つの Slab Metadata 不変条件チェック
carved <= capacityused <= carvedused <= capacityfree_count == (carved - used)capacity <= 512
- 機能:
integrity_validate_slab_metadata()- メタデータ検証validate_ptr_range()- ポインタ範囲検証(null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
- Status: ✅ Committed
✅ Box TLS-SLL (今回の修正対象)
- File:
core/box/tls_sll_box.h - 役割: TLS Single-Linked List management (C7-safe)
- API:
tls_sll_push()- Push to SLL (C7 rejected)tls_sll_pop()- Pop from SLL (returns base pointer)tls_sll_splice()- Batch push
- 今回の発見:
- Fix #1:
tls_sll_popで next をクリア(C0-C6 は base+1 で) - But: carved chain の tail が NULL 終端されていない(Fix #2 必要)
- Fix #1:
- Status: ⚠️ Fix #1 適用済み、Fix #2 未適用
✅ その他のBox (既存)
- Front Gate Box:
core/box/front_gate_box.h/c+front_gate_classifier.c - Free Local/Remote/Publish Box:
core/box/free_local_box.c,free_remote_box.c,free_publish_box.c - Mailbox Box:
core/box/mailbox_box.h/c
Commit Info:
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
- Date: Recent (before P0 debug session)
🔍 Investigation History
✅ Completed Investigations
-
Valgrind (O0 build): 0 errors, 29K iterations passed
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
-
Task Agent GDB Analysis:
- Found crash location:
tls_sll_popline 169 - Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
- Found crash location:
-
Box I, E, 3 Implementation: 703 lines of integrity checks
- All checks passed before crash
- Validation didn't catch the bug
🛠️ Fixes Applied (Partial Success)
Fix #1: Clear next pointer in tls_sll_pop ✅ (INCOMPLETE)
File: core/box/tls_sll_box.h:254-262
Change:
// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
*(void**)base = NULL;
}
// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
if (class_idx == 7) {
*(void**)base = NULL; // C7: clear at base (offset 0)
} else {
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
}
#else
*(void**)base = NULL;
#endif
Result:
- ✅ Passed 29K iterations (previous crash point)
- ❌ Still crashes at 38,985 iterations
🚨 NEW DISCOVERY: Root Cause Found!
Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
File: core/tiny_refill_opt.h:229-234
BUG: Tail block's next pointer is NOT NULL-terminated!
// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
cursor = next;
}
void* tail = (void*)cursor; // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!
IMPACT:
- Chain is carved:
head → block1 → block2 → ... → tail → [GARBAGE] - Chain spliced to TLS SLL
- Later,
tls_sll_poptraverses the chain - Reads garbage
nextpointer → SEGV at0x7fff00008000
FIX (add after line 233):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next;
cursor = next;
}
void* tail = (void*)cursor;
// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;
🚨 CURRENT STATUS (2025-11-12 UPDATED)
Fixes Applied:
- ✅ Fix #1: Clear next pointer in
tls_sll_pop(C0-C6 at base+1) - ✅ Fix #2: NULL-terminate tail in
trc_linear_carve() - ✅ Fix #3: Clean rebuild with
HEADER_CLASSIDX=1 - ✅ Fix #4: Increase canary check frequency (1000 → 100 ops)
- ✅ Fix #5: Add bounds check to
tls_sll_push()
Test Results:
- ❌ Still crashes at iteration 28,410 (call 14269)
- Canaries: NOT corrupted (corruption is immediate)
- Bounds check: NOT triggered (class_idx is valid)
- Task agent finding: External corruption of
g_tls_sll_head[0]
Analysis:
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
- Crash is deterministic at call 14269
📋 Next Steps (NEEDS USER INPUT)
Option A: Deep GDB Investigation (SLOW)
- Set hardware watchpoint on
g_tls_sll_head[0] - Run to call 14250, then watch for corruption
- Time: 1-2 hours, may not work with optimization
Option B: Disable Optimizations (DIAGNOSTIC)
- Rebuild with
-O0to see if bug disappears - If so, likely compiler optimization bug or UB
- Time: 10 minutes
Option C: Simplified Stress Test (QUICK)
- Disable P0 batch optimization temporarily
- Disable SFC temporarily
- Test with simpler code path
- Time: 20 minutes
After Fix Verified
-
Commit P0 fix:
- Fix #1: Clear next in
tls_sll_pop - Fix #2: NULL-terminate in
trc_linear_carve - Box I/E/3 validation infrastructure
- Double-free detection
- Fix #1: Clear next in
-
Update CLAUDE.md with findings
-
Performance benchmark (release build)
🎯 Expected Outcome
After applying Fix #2, the allocator should:
- ✅ Pass 100K iterations without crash
- ✅ Pass 1M iterations without crash
- ✅ Maintain performance (~2.7M ops/s for 256B)
📝 Lessons Learned
- Stale pointers are dangerous: Always NULL-terminate linked lists
- Optimization exposes bugs:
-O3can hide initialization in debug builds - Multiple fixes needed: Fix #1 alone was insufficient
- Chain integrity: Carved chains MUST be properly terminated
🔧 Build Flags (CRITICAL)
MUST use these flags:
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
Why: HAKMEM_TINY_HEADER_CLASSIDX=1 is required for Fix #1 to execute!
Use build.sh to ensure correct flags:
./build.sh bench_random_mixed_hakmem