Files
hakmem/P0_BUG_STATUS.md
Moe Charm (CI) 84dbd97fe9 Fix #16: Resolve double BASE→USER conversion causing header corruption
🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting
BASE → USER pointers before returning to caller. The caller then applied
HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER
conversion, resulting in double offset (BASE+2) and header written at
wrong location.

📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at
tiny_region_id_write_header, making it the single source of truth for
BASE → USER conversion.

🔧 CHANGES:
- Fix #16: Remove premature BASE→USER conversions (6 locations)
  * core/tiny_alloc_fast.inc.h (3 fixes)
  * core/hakmem_tiny_refill.inc.h (2 fixes)
  * core/hakmem_tiny_fastcache.inc.h (1 fix)

- Fix #12: Add header validation in tls_sll_pop (detect corruption)
- Fix #14: Defense-in-depth header restoration in tls_sll_splice
- Fix #15: USER pointer detection (for debugging)
- Fix #13: Bump window header restoration
- Fix #2, #6, #7, #8: Various header restoration & NULL termination

🧪 TEST RESULTS: 100% SUCCESS
- 10K-500K iterations: All passed
- 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161)
- Performance: ~630K ops/s average (stable)
- Header corruption: ZERO

📋 FIXES SUMMARY:
Fix #1-8:   Initial header restoration & chain fixes (chatgpt-san)
Fix #9-10:  USER pointer auto-fix (later disabled)
Fix #12:    Validation system (caught corruption at call 14209)
Fix #13:    Bump window header writes
Fix #14:    Splice defense-in-depth
Fix #15:    USER pointer detection (debugging tool)
Fix #16:    Double conversion fix (FINAL SOLUTION) 

🎓 LESSONS LEARNED:
1. Validation catches bugs early (Fix #12 was critical)
2. Class-specific inline logging reveals patterns (Option C)
3. Box Theory provides clean architectural boundaries
4. Multiple investigation approaches (Task/chatgpt-san collaboration)

📄 DOCUMENTATION:
- P0_BUG_STATUS.md: Complete bug tracking timeline
- C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis
- FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task Agent <task@anthropic.com>
Co-Authored-By: ChatGPT <chatgpt@openai.com>
2025-11-12 10:33:57 +09:00

242 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# P0 SEGV Bug - Current Status & Next Steps
**Last Update**: 2025-11-12
## 🐛 Bug Summary
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
**Root Cause**: **STALE NEXT POINTERS** in carved chains
---
## 🎁 Box Theory Implementation (完了済み)
### ✅ **Box 3** (Pointer Conversion Box)
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
- **役割**: BASE ↔ USER pointer conversion
- **API**:
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
- **Status**: ✅ Committed (1713 lines added total)
### ✅ **Box E** (Expansion Box)
- **File**: `core/box/superslab_expansion_box.h/c`
- **役割**: SuperSlab expansion with TLS state guarantee
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
- **Status**: ✅ Committed
### ✅ **Box I** (Integrity Box) - **703 lines!**
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
- **役割**: Comprehensive integrity verification system
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
1. `carved <= capacity`
2. `used <= carved`
3. `used <= capacity`
4. `free_count == (carved - used)`
5. `capacity <= 512`
- **機能**:
- `integrity_validate_slab_metadata()` - メタデータ検証
- `validate_ptr_range()` - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
- **Status**: ✅ Committed
### ✅ **Box TLS-SLL** (今回の修正対象)
- **File**: `core/box/tls_sll_box.h`
- **役割**: TLS Single-Linked List management (C7-safe)
- **API**:
- `tls_sll_push()` - Push to SLL (C7 rejected)
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
- `tls_sll_splice()` - Batch push
- **今回の発見**:
- Fix #1: `tls_sll_pop` で next をクリアC0-C6 は base+1 で)
- But: carved chain の tail が NULL 終端されていないFix #2 必要)
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
### ✅ **その他のBox** (既存)
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
- **Mailbox Box**: `core/box/mailbox_box.h/c`
**Commit Info**:
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
- Date: Recent (before P0 debug session)
---
## 🔍 Investigation History
### ✅ Completed Investigations
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
2. **Task Agent GDB Analysis**:
- Found crash location: `tls_sll_pop` line 169
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
- All checks passed before crash
- Validation didn't catch the bug
---
## 🛠️ Fixes Applied (Partial Success)
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
**File**: `core/box/tls_sll_box.h:254-262`
**Change**:
```c
// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
*(void**)base = NULL;
}
// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
if (class_idx == 7) {
*(void**)base = NULL; // C7: clear at base (offset 0)
} else {
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
}
#else
*(void**)base = NULL;
#endif
```
**Result**:
- ✅ Passed 29K iterations (previous crash point)
-**Still crashes at 38,985 iterations**
---
## 🚨 NEW DISCOVERY: Root Cause Found!
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
**File**: `core/tiny_refill_opt.h:229-234`
**BUG**: Tail block's next pointer is NOT NULL-terminated!
```c
// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
cursor = next;
}
void* tail = (void*)cursor; // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!
```
**IMPACT**:
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
2. Chain spliced to TLS SLL
3. Later, `tls_sll_pop` traverses the chain
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
**FIX** (add after line 233):
```c
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next;
cursor = next;
}
void* tail = (void*)cursor;
// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;
```
---
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
### Fixes Applied:
1.**Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
2.**Fix #2**: NULL-terminate tail in `trc_linear_carve()`
3.**Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
4.**Fix #4**: Increase canary check frequency (1000 → 100 ops)
5.**Fix #5**: Add bounds check to `tls_sll_push()`
### Test Results:
-**Still crashes at iteration 28,410 (call 14269)**
- Canaries: NOT corrupted (corruption is immediate)
- Bounds check: NOT triggered (class_idx is valid)
- Task agent finding: External corruption of `g_tls_sll_head[0]`
### Analysis:
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
- Crash is deterministic at call 14269
## 📋 Next Steps (NEEDS USER INPUT)
### Option A: Deep GDB Investigation (SLOW)
- Set hardware watchpoint on `g_tls_sll_head[0]`
- Run to call 14250, then watch for corruption
- Time: 1-2 hours, may not work with optimization
### Option B: Disable Optimizations (DIAGNOSTIC)
- Rebuild with `-O0` to see if bug disappears
- If so, likely compiler optimization bug or UB
- Time: 10 minutes
### Option C: Simplified Stress Test (QUICK)
- Disable P0 batch optimization temporarily
- Disable SFC temporarily
- Test with simpler code path
- Time: 20 minutes
### After Fix Verified
4. **Commit P0 fix**:
- Fix #1: Clear next in `tls_sll_pop`
- Fix #2: NULL-terminate in `trc_linear_carve`
- Box I/E/3 validation infrastructure
- Double-free detection
5. **Update CLAUDE.md** with findings
6. **Performance benchmark** (release build)
---
## 🎯 Expected Outcome
After applying Fix #2, the allocator should:
- ✅ Pass 100K iterations without crash
- ✅ Pass 1M iterations without crash
- ✅ Maintain performance (~2.7M ops/s for 256B)
---
## 📝 Lessons Learned
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
3. **Multiple fixes needed**: Fix #1 alone was insufficient
4. **Chain integrity**: Carved chains MUST be properly terminated
---
## 🔧 Build Flags (CRITICAL)
**MUST use these flags**:
```bash
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
```
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
**Use build.sh** to ensure correct flags:
```bash
./build.sh bench_random_mixed_hakmem
```