# P0 SEGV Bug - Current Status & Next Steps **Last Update**: 2025-11-12 ## 🐛 Bug Summary **Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42) **Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain **Root Cause**: **STALE NEXT POINTERS** in carved chains --- ## 🎁 Box Theory Implementation (完了済み) ### ✅ **Box 3** (Pointer Conversion Box) - **File**: `core/box/ptr_conversion_box.h` (267 lines) - **役割**: BASE ↔ USER pointer conversion - **API**: - `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base - `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user - **Status**: ✅ Committed (1713 lines added total) ### ✅ **Box E** (Expansion Box) - **File**: `core/box/superslab_expansion_box.h/c` - **役割**: SuperSlab expansion with TLS state guarantee - **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド - **Status**: ✅ Committed ### ✅ **Box I** (Integrity Box) - **703 lines!** - **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行) - **役割**: Comprehensive integrity verification system - **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック 1. `carved <= capacity` 2. `used <= carved` 3. `used <= capacity` 4. `free_count == (carved - used)` 5. `capacity <= 512` - **機能**: - `integrity_validate_slab_metadata()` - メタデータ検証 - `validate_ptr_range()` - ポインタ範囲検証(null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン) - **Status**: ✅ Committed ### ✅ **Box TLS-SLL** (今回の修正対象) - **File**: `core/box/tls_sll_box.h` - **役割**: TLS Single-Linked List management (C7-safe) - **API**: - `tls_sll_push()` - Push to SLL (C7 rejected) - `tls_sll_pop()` - Pop from SLL (returns base pointer) - `tls_sll_splice()` - Batch push - **今回の発見**: - Fix #1: `tls_sll_pop` で next をクリア(C0-C6 は base+1 で) - But: carved chain の tail が NULL 終端されていない(Fix #2 必要) - **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用 ### ✅ **その他のBox** (既存) - **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c` - **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c` - **Mailbox Box**: `core/box/mailbox_box.h/c` **Commit Info**: - Commit: "Add Box I (Integrity), Box E (Expansion)..." - Files: 23 files changed, 1713 insertions(+), 56 deletions(-) - Date: Recent (before P0 debug session) --- ## 🔍 Investigation History ### ✅ Completed Investigations 1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed - Conclusion: Bug is optimization-dependent (-O3 triggers it) 2. **Task Agent GDB Analysis**: - Found crash location: `tls_sll_pop` line 169 - Hypothesis: use-after-allocate (next pointer at base+1 is user memory) 3. **Box I, E, 3 Implementation**: 703 lines of integrity checks - All checks passed before crash - Validation didn't catch the bug --- ## 🛠️ Fixes Applied (Partial Success) ### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE) **File**: `core/box/tls_sll_box.h:254-262` **Change**: ```c // OLD (WRONG): Only cleared for C7 if (__builtin_expect(class_idx == 7, 0)) { *(void**)base = NULL; } // NEW: Clear for C0-C6 too #if HAKMEM_TINY_HEADER_CLASSIDX if (class_idx == 7) { *(void**)base = NULL; // C7: clear at base (offset 0) } else { *(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1) } #else *(void**)base = NULL; #endif ``` **Result**: - ✅ Passed 29K iterations (previous crash point) - ❌ **Still crashes at 38,985 iterations** --- ## 🚨 NEW DISCOVERY: Root Cause Found! ### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED) **File**: `core/tiny_refill_opt.h:229-234` **BUG**: Tail block's next pointer is NOT NULL-terminated! ```c // Current code (BUGGY): for (uint32_t i = 1; i < batch; i++) { uint8_t* next = cursor + stride; *(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ... cursor = next; } void* tail = (void*)cursor; // tail = last block // ❌ BUG: tail's next pointer is NEVER set to NULL! // It contains GARBAGE from previous allocation! ``` **IMPACT**: 1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]` 2. Chain spliced to TLS SLL 3. Later, `tls_sll_pop` traverses the chain 4. Reads garbage `next` pointer → SEGV at `0x7fff00008000` **FIX** (add after line 233): ```c for (uint32_t i = 1; i < batch; i++) { uint8_t* next = cursor + stride; *(void**)(cursor + next_offset) = (void*)next; cursor = next; } void* tail = (void*)cursor; // ✅ FIX: NULL-terminate the tail *(void**)((uint8_t*)tail + next_offset) = NULL; ``` --- ## 🚨 CURRENT STATUS (2025-11-12 UPDATED) ### Fixes Applied: 1. ✅ **Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1) 2. ✅ **Fix #2**: NULL-terminate tail in `trc_linear_carve()` 3. ✅ **Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1` 4. ✅ **Fix #4**: Increase canary check frequency (1000 → 100 ops) 5. ✅ **Fix #5**: Add bounds check to `tls_sll_push()` ### Test Results: - ❌ **Still crashes at iteration 28,410 (call 14269)** - Canaries: NOT corrupted (corruption is immediate) - Bounds check: NOT triggered (class_idx is valid) - Task agent finding: External corruption of `g_tls_sll_head[0]` ### Analysis: - Fix #1 and Fix #2 ARE working correctly (Task agent verified) - Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it) - class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger) - Crash is deterministic at call 14269 ## 📋 Next Steps (NEEDS USER INPUT) ### Option A: Deep GDB Investigation (SLOW) - Set hardware watchpoint on `g_tls_sll_head[0]` - Run to call 14250, then watch for corruption - Time: 1-2 hours, may not work with optimization ### Option B: Disable Optimizations (DIAGNOSTIC) - Rebuild with `-O0` to see if bug disappears - If so, likely compiler optimization bug or UB - Time: 10 minutes ### Option C: Simplified Stress Test (QUICK) - Disable P0 batch optimization temporarily - Disable SFC temporarily - Test with simpler code path - Time: 20 minutes ### After Fix Verified 4. **Commit P0 fix**: - Fix #1: Clear next in `tls_sll_pop` - Fix #2: NULL-terminate in `trc_linear_carve` - Box I/E/3 validation infrastructure - Double-free detection 5. **Update CLAUDE.md** with findings 6. **Performance benchmark** (release build) --- ## 🎯 Expected Outcome After applying Fix #2, the allocator should: - ✅ Pass 100K iterations without crash - ✅ Pass 1M iterations without crash - ✅ Maintain performance (~2.7M ops/s for 256B) --- ## 📝 Lessons Learned 1. **Stale pointers are dangerous**: Always NULL-terminate linked lists 2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds 3. **Multiple fixes needed**: Fix #1 alone was insufficient 4. **Chain integrity**: Carved chains MUST be properly terminated --- ## 🔧 Build Flags (CRITICAL) **MUST use these flags**: ```bash HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 ``` **Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute! **Use build.sh** to ensure correct flags: ```bash ./build.sh bench_random_mixed_hakmem ```