Files
hakmem/P0_BUG_STATUS.md

242 lines
7.3 KiB
Markdown
Raw Normal View History

Fix #16: Resolve double BASE→USER conversion causing header corruption 🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting BASE → USER pointers before returning to caller. The caller then applied HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER conversion, resulting in double offset (BASE+2) and header written at wrong location. 📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at tiny_region_id_write_header, making it the single source of truth for BASE → USER conversion. 🔧 CHANGES: - Fix #16: Remove premature BASE→USER conversions (6 locations) * core/tiny_alloc_fast.inc.h (3 fixes) * core/hakmem_tiny_refill.inc.h (2 fixes) * core/hakmem_tiny_fastcache.inc.h (1 fix) - Fix #12: Add header validation in tls_sll_pop (detect corruption) - Fix #14: Defense-in-depth header restoration in tls_sll_splice - Fix #15: USER pointer detection (for debugging) - Fix #13: Bump window header restoration - Fix #2, #6, #7, #8: Various header restoration & NULL termination 🧪 TEST RESULTS: 100% SUCCESS - 10K-500K iterations: All passed - 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161) - Performance: ~630K ops/s average (stable) - Header corruption: ZERO 📋 FIXES SUMMARY: Fix #1-8: Initial header restoration & chain fixes (chatgpt-san) Fix #9-10: USER pointer auto-fix (later disabled) Fix #12: Validation system (caught corruption at call 14209) Fix #13: Bump window header writes Fix #14: Splice defense-in-depth Fix #15: USER pointer detection (debugging tool) Fix #16: Double conversion fix (FINAL SOLUTION) ✅ 🎓 LESSONS LEARNED: 1. Validation catches bugs early (Fix #12 was critical) 2. Class-specific inline logging reveals patterns (Option C) 3. Box Theory provides clean architectural boundaries 4. Multiple investigation approaches (Task/chatgpt-san collaboration) 📄 DOCUMENTATION: - P0_BUG_STATUS.md: Complete bug tracking timeline - C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis - FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Task Agent <task@anthropic.com> Co-Authored-By: ChatGPT <chatgpt@openai.com>
2025-11-12 10:33:57 +09:00
# P0 SEGV Bug - Current Status & Next Steps
**Last Update**: 2025-11-12
## 🐛 Bug Summary
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
**Root Cause**: **STALE NEXT POINTERS** in carved chains
---
## 🎁 Box Theory Implementation (完了済み)
### ✅ **Box 3** (Pointer Conversion Box)
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
- **役割**: BASE ↔ USER pointer conversion
- **API**:
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
- **Status**: ✅ Committed (1713 lines added total)
### ✅ **Box E** (Expansion Box)
- **File**: `core/box/superslab_expansion_box.h/c`
- **役割**: SuperSlab expansion with TLS state guarantee
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
- **Status**: ✅ Committed
### ✅ **Box I** (Integrity Box) - **703 lines!**
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
- **役割**: Comprehensive integrity verification system
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
1. `carved <= capacity`
2. `used <= carved`
3. `used <= capacity`
4. `free_count == (carved - used)`
5. `capacity <= 512`
- **機能**:
- `integrity_validate_slab_metadata()` - メタデータ検証
- `validate_ptr_range()` - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
- **Status**: ✅ Committed
### ✅ **Box TLS-SLL** (今回の修正対象)
- **File**: `core/box/tls_sll_box.h`
- **役割**: TLS Single-Linked List management (C7-safe)
- **API**:
- `tls_sll_push()` - Push to SLL (C7 rejected)
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
- `tls_sll_splice()` - Batch push
- **今回の発見**:
- Fix #1: `tls_sll_pop` で next をクリアC0-C6 は base+1 で)
- But: carved chain の tail が NULL 終端されていないFix #2 必要)
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
### ✅ **その他のBox** (既存)
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
- **Mailbox Box**: `core/box/mailbox_box.h/c`
**Commit Info**:
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
- Date: Recent (before P0 debug session)
---
## 🔍 Investigation History
### ✅ Completed Investigations
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
2. **Task Agent GDB Analysis**:
- Found crash location: `tls_sll_pop` line 169
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
- All checks passed before crash
- Validation didn't catch the bug
---
## 🛠️ Fixes Applied (Partial Success)
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
**File**: `core/box/tls_sll_box.h:254-262`
**Change**:
```c
// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
*(void**)base = NULL;
}
// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
if (class_idx == 7) {
*(void**)base = NULL; // C7: clear at base (offset 0)
} else {
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
}
#else
*(void**)base = NULL;
#endif
```
**Result**:
- ✅ Passed 29K iterations (previous crash point)
-**Still crashes at 38,985 iterations**
---
## 🚨 NEW DISCOVERY: Root Cause Found!
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
**File**: `core/tiny_refill_opt.h:229-234`
**BUG**: Tail block's next pointer is NOT NULL-terminated!
```c
// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
cursor = next;
}
void* tail = (void*)cursor; // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!
```
**IMPACT**:
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
2. Chain spliced to TLS SLL
3. Later, `tls_sll_pop` traverses the chain
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
**FIX** (add after line 233):
```c
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next;
cursor = next;
}
void* tail = (void*)cursor;
// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;
```
---
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
### Fixes Applied:
1.**Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
2.**Fix #2**: NULL-terminate tail in `trc_linear_carve()`
3.**Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
4.**Fix #4**: Increase canary check frequency (1000 → 100 ops)
5.**Fix #5**: Add bounds check to `tls_sll_push()`
### Test Results:
-**Still crashes at iteration 28,410 (call 14269)**
- Canaries: NOT corrupted (corruption is immediate)
- Bounds check: NOT triggered (class_idx is valid)
- Task agent finding: External corruption of `g_tls_sll_head[0]`
### Analysis:
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
- Crash is deterministic at call 14269
## 📋 Next Steps (NEEDS USER INPUT)
### Option A: Deep GDB Investigation (SLOW)
- Set hardware watchpoint on `g_tls_sll_head[0]`
- Run to call 14250, then watch for corruption
- Time: 1-2 hours, may not work with optimization
### Option B: Disable Optimizations (DIAGNOSTIC)
- Rebuild with `-O0` to see if bug disappears
- If so, likely compiler optimization bug or UB
- Time: 10 minutes
### Option C: Simplified Stress Test (QUICK)
- Disable P0 batch optimization temporarily
- Disable SFC temporarily
- Test with simpler code path
- Time: 20 minutes
### After Fix Verified
4. **Commit P0 fix**:
- Fix #1: Clear next in `tls_sll_pop`
- Fix #2: NULL-terminate in `trc_linear_carve`
- Box I/E/3 validation infrastructure
- Double-free detection
5. **Update CLAUDE.md** with findings
6. **Performance benchmark** (release build)
---
## 🎯 Expected Outcome
After applying Fix #2, the allocator should:
- ✅ Pass 100K iterations without crash
- ✅ Pass 1M iterations without crash
- ✅ Maintain performance (~2.7M ops/s for 256B)
---
## 📝 Lessons Learned
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
3. **Multiple fixes needed**: Fix #1 alone was insufficient
4. **Chain integrity**: Carved chains MUST be properly terminated
---
## 🔧 Build Flags (CRITICAL)
**MUST use these flags**:
```bash
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
```
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
**Use build.sh** to ensure correct flags:
```bash
./build.sh bench_random_mixed_hakmem
```