242 lines
7.3 KiB
Markdown
242 lines
7.3 KiB
Markdown
|
|
# P0 SEGV Bug - Current Status & Next Steps
|
|||
|
|
|
|||
|
|
**Last Update**: 2025-11-12
|
|||
|
|
|
|||
|
|
## 🐛 Bug Summary
|
|||
|
|
|
|||
|
|
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
|
|||
|
|
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
|
|||
|
|
**Root Cause**: **STALE NEXT POINTERS** in carved chains
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎁 Box Theory Implementation (完了済み)
|
|||
|
|
|
|||
|
|
### ✅ **Box 3** (Pointer Conversion Box)
|
|||
|
|
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
|
|||
|
|
- **役割**: BASE ↔ USER pointer conversion
|
|||
|
|
- **API**:
|
|||
|
|
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
|
|||
|
|
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
|
|||
|
|
- **Status**: ✅ Committed (1713 lines added total)
|
|||
|
|
|
|||
|
|
### ✅ **Box E** (Expansion Box)
|
|||
|
|
- **File**: `core/box/superslab_expansion_box.h/c`
|
|||
|
|
- **役割**: SuperSlab expansion with TLS state guarantee
|
|||
|
|
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
|
|||
|
|
- **Status**: ✅ Committed
|
|||
|
|
|
|||
|
|
### ✅ **Box I** (Integrity Box) - **703 lines!**
|
|||
|
|
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
|
|||
|
|
- **役割**: Comprehensive integrity verification system
|
|||
|
|
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
|
|||
|
|
1. `carved <= capacity`
|
|||
|
|
2. `used <= carved`
|
|||
|
|
3. `used <= capacity`
|
|||
|
|
4. `free_count == (carved - used)`
|
|||
|
|
5. `capacity <= 512`
|
|||
|
|
- **機能**:
|
|||
|
|
- `integrity_validate_slab_metadata()` - メタデータ検証
|
|||
|
|
- `validate_ptr_range()` - ポインタ範囲検証(null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
|
|||
|
|
- **Status**: ✅ Committed
|
|||
|
|
|
|||
|
|
### ✅ **Box TLS-SLL** (今回の修正対象)
|
|||
|
|
- **File**: `core/box/tls_sll_box.h`
|
|||
|
|
- **役割**: TLS Single-Linked List management (C7-safe)
|
|||
|
|
- **API**:
|
|||
|
|
- `tls_sll_push()` - Push to SLL (C7 rejected)
|
|||
|
|
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
|
|||
|
|
- `tls_sll_splice()` - Batch push
|
|||
|
|
- **今回の発見**:
|
|||
|
|
- Fix #1: `tls_sll_pop` で next をクリア(C0-C6 は base+1 で)
|
|||
|
|
- But: carved chain の tail が NULL 終端されていない(Fix #2 必要)
|
|||
|
|
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
|
|||
|
|
|
|||
|
|
### ✅ **その他のBox** (既存)
|
|||
|
|
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
|
|||
|
|
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
|
|||
|
|
- **Mailbox Box**: `core/box/mailbox_box.h/c`
|
|||
|
|
|
|||
|
|
**Commit Info**:
|
|||
|
|
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
|
|||
|
|
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
|
|||
|
|
- Date: Recent (before P0 debug session)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Investigation History
|
|||
|
|
|
|||
|
|
### ✅ Completed Investigations
|
|||
|
|
|
|||
|
|
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
|
|||
|
|
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
|
|||
|
|
|
|||
|
|
2. **Task Agent GDB Analysis**:
|
|||
|
|
- Found crash location: `tls_sll_pop` line 169
|
|||
|
|
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
|
|||
|
|
|
|||
|
|
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
|
|||
|
|
- All checks passed before crash
|
|||
|
|
- Validation didn't catch the bug
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🛠️ Fixes Applied (Partial Success)
|
|||
|
|
|
|||
|
|
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
|
|||
|
|
|
|||
|
|
**File**: `core/box/tls_sll_box.h:254-262`
|
|||
|
|
|
|||
|
|
**Change**:
|
|||
|
|
```c
|
|||
|
|
// OLD (WRONG): Only cleared for C7
|
|||
|
|
if (__builtin_expect(class_idx == 7, 0)) {
|
|||
|
|
*(void**)base = NULL;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// NEW: Clear for C0-C6 too
|
|||
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|||
|
|
if (class_idx == 7) {
|
|||
|
|
*(void**)base = NULL; // C7: clear at base (offset 0)
|
|||
|
|
} else {
|
|||
|
|
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
|
|||
|
|
}
|
|||
|
|
#else
|
|||
|
|
*(void**)base = NULL;
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Result**:
|
|||
|
|
- ✅ Passed 29K iterations (previous crash point)
|
|||
|
|
- ❌ **Still crashes at 38,985 iterations**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚨 NEW DISCOVERY: Root Cause Found!
|
|||
|
|
|
|||
|
|
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
|
|||
|
|
|
|||
|
|
**File**: `core/tiny_refill_opt.h:229-234`
|
|||
|
|
|
|||
|
|
**BUG**: Tail block's next pointer is NOT NULL-terminated!
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Current code (BUGGY):
|
|||
|
|
for (uint32_t i = 1; i < batch; i++) {
|
|||
|
|
uint8_t* next = cursor + stride;
|
|||
|
|
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
|
|||
|
|
cursor = next;
|
|||
|
|
}
|
|||
|
|
void* tail = (void*)cursor; // tail = last block
|
|||
|
|
// ❌ BUG: tail's next pointer is NEVER set to NULL!
|
|||
|
|
// It contains GARBAGE from previous allocation!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**IMPACT**:
|
|||
|
|
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
|
|||
|
|
2. Chain spliced to TLS SLL
|
|||
|
|
3. Later, `tls_sll_pop` traverses the chain
|
|||
|
|
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
|
|||
|
|
|
|||
|
|
**FIX** (add after line 233):
|
|||
|
|
```c
|
|||
|
|
for (uint32_t i = 1; i < batch; i++) {
|
|||
|
|
uint8_t* next = cursor + stride;
|
|||
|
|
*(void**)(cursor + next_offset) = (void*)next;
|
|||
|
|
cursor = next;
|
|||
|
|
}
|
|||
|
|
void* tail = (void*)cursor;
|
|||
|
|
|
|||
|
|
// ✅ FIX: NULL-terminate the tail
|
|||
|
|
*(void**)((uint8_t*)tail + next_offset) = NULL;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
|
|||
|
|
|
|||
|
|
### Fixes Applied:
|
|||
|
|
1. ✅ **Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
|
|||
|
|
2. ✅ **Fix #2**: NULL-terminate tail in `trc_linear_carve()`
|
|||
|
|
3. ✅ **Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
|
|||
|
|
4. ✅ **Fix #4**: Increase canary check frequency (1000 → 100 ops)
|
|||
|
|
5. ✅ **Fix #5**: Add bounds check to `tls_sll_push()`
|
|||
|
|
|
|||
|
|
### Test Results:
|
|||
|
|
- ❌ **Still crashes at iteration 28,410 (call 14269)**
|
|||
|
|
- Canaries: NOT corrupted (corruption is immediate)
|
|||
|
|
- Bounds check: NOT triggered (class_idx is valid)
|
|||
|
|
- Task agent finding: External corruption of `g_tls_sll_head[0]`
|
|||
|
|
|
|||
|
|
### Analysis:
|
|||
|
|
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
|
|||
|
|
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
|
|||
|
|
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
|
|||
|
|
- Crash is deterministic at call 14269
|
|||
|
|
|
|||
|
|
## 📋 Next Steps (NEEDS USER INPUT)
|
|||
|
|
|
|||
|
|
### Option A: Deep GDB Investigation (SLOW)
|
|||
|
|
- Set hardware watchpoint on `g_tls_sll_head[0]`
|
|||
|
|
- Run to call 14250, then watch for corruption
|
|||
|
|
- Time: 1-2 hours, may not work with optimization
|
|||
|
|
|
|||
|
|
### Option B: Disable Optimizations (DIAGNOSTIC)
|
|||
|
|
- Rebuild with `-O0` to see if bug disappears
|
|||
|
|
- If so, likely compiler optimization bug or UB
|
|||
|
|
- Time: 10 minutes
|
|||
|
|
|
|||
|
|
### Option C: Simplified Stress Test (QUICK)
|
|||
|
|
- Disable P0 batch optimization temporarily
|
|||
|
|
- Disable SFC temporarily
|
|||
|
|
- Test with simpler code path
|
|||
|
|
- Time: 20 minutes
|
|||
|
|
|
|||
|
|
### After Fix Verified
|
|||
|
|
|
|||
|
|
4. **Commit P0 fix**:
|
|||
|
|
- Fix #1: Clear next in `tls_sll_pop`
|
|||
|
|
- Fix #2: NULL-terminate in `trc_linear_carve`
|
|||
|
|
- Box I/E/3 validation infrastructure
|
|||
|
|
- Double-free detection
|
|||
|
|
|
|||
|
|
5. **Update CLAUDE.md** with findings
|
|||
|
|
|
|||
|
|
6. **Performance benchmark** (release build)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Expected Outcome
|
|||
|
|
|
|||
|
|
After applying Fix #2, the allocator should:
|
|||
|
|
- ✅ Pass 100K iterations without crash
|
|||
|
|
- ✅ Pass 1M iterations without crash
|
|||
|
|
- ✅ Maintain performance (~2.7M ops/s for 256B)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 Lessons Learned
|
|||
|
|
|
|||
|
|
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
|
|||
|
|
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
|
|||
|
|
3. **Multiple fixes needed**: Fix #1 alone was insufficient
|
|||
|
|
4. **Chain integrity**: Carved chains MUST be properly terminated
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 Build Flags (CRITICAL)
|
|||
|
|
|
|||
|
|
**MUST use these flags**:
|
|||
|
|
```bash
|
|||
|
|
HEADER_CLASSIDX=1
|
|||
|
|
AGGRESSIVE_INLINE=1
|
|||
|
|
PREWARM_TLS=1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
|
|||
|
|
|
|||
|
|
**Use build.sh** to ensure correct flags:
|
|||
|
|
```bash
|
|||
|
|
./build.sh bench_random_mixed_hakmem
|
|||
|
|
```
|