Files
hakmem/docs/status/P0_BUG_STATUS.md

242 lines
7.3 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# P0 SEGV Bug - Current Status & Next Steps
**Last Update**: 2025-11-12
## 🐛 Bug Summary
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
**Root Cause**: **STALE NEXT POINTERS** in carved chains
---
## 🎁 Box Theory Implementation (完了済み)
### ✅ **Box 3** (Pointer Conversion Box)
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
- **役割**: BASE ↔ USER pointer conversion
- **API**:
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
- **Status**: ✅ Committed (1713 lines added total)
### ✅ **Box E** (Expansion Box)
- **File**: `core/box/superslab_expansion_box.h/c`
- **役割**: SuperSlab expansion with TLS state guarantee
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
- **Status**: ✅ Committed
### ✅ **Box I** (Integrity Box) - **703 lines!**
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
- **役割**: Comprehensive integrity verification system
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
1. `carved <= capacity`
2. `used <= carved`
3. `used <= capacity`
4. `free_count == (carved - used)`
5. `capacity <= 512`
- **機能**:
- `integrity_validate_slab_metadata()` - メタデータ検証
- `validate_ptr_range()` - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
- **Status**: ✅ Committed
### ✅ **Box TLS-SLL** (今回の修正対象)
- **File**: `core/box/tls_sll_box.h`
- **役割**: TLS Single-Linked List management (C7-safe)
- **API**:
- `tls_sll_push()` - Push to SLL (C7 rejected)
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
- `tls_sll_splice()` - Batch push
- **今回の発見**:
- Fix #1: `tls_sll_pop` で next をクリアC0-C6 は base+1 で)
- But: carved chain の tail が NULL 終端されていないFix #2 必要)
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
### ✅ **その他のBox** (既存)
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
- **Mailbox Box**: `core/box/mailbox_box.h/c`
**Commit Info**:
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
- Date: Recent (before P0 debug session)
---
## 🔍 Investigation History
### ✅ Completed Investigations
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
2. **Task Agent GDB Analysis**:
- Found crash location: `tls_sll_pop` line 169
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
- All checks passed before crash
- Validation didn't catch the bug
---
## 🛠️ Fixes Applied (Partial Success)
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
**File**: `core/box/tls_sll_box.h:254-262`
**Change**:
```c
// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
*(void**)base = NULL;
}
// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
if (class_idx == 7) {
*(void**)base = NULL; // C7: clear at base (offset 0)
} else {
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
}
#else
*(void**)base = NULL;
#endif
```
**Result**:
- ✅ Passed 29K iterations (previous crash point)
-**Still crashes at 38,985 iterations**
---
## 🚨 NEW DISCOVERY: Root Cause Found!
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
**File**: `core/tiny_refill_opt.h:229-234`
**BUG**: Tail block's next pointer is NOT NULL-terminated!
```c
// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
cursor = next;
}
void* tail = (void*)cursor; // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!
```
**IMPACT**:
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
2. Chain spliced to TLS SLL
3. Later, `tls_sll_pop` traverses the chain
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
**FIX** (add after line 233):
```c
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next;
cursor = next;
}
void* tail = (void*)cursor;
// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;
```
---
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
### Fixes Applied:
1.**Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
2.**Fix #2**: NULL-terminate tail in `trc_linear_carve()`
3.**Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
4.**Fix #4**: Increase canary check frequency (1000 → 100 ops)
5.**Fix #5**: Add bounds check to `tls_sll_push()`
### Test Results:
-**Still crashes at iteration 28,410 (call 14269)**
- Canaries: NOT corrupted (corruption is immediate)
- Bounds check: NOT triggered (class_idx is valid)
- Task agent finding: External corruption of `g_tls_sll_head[0]`
### Analysis:
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
- Crash is deterministic at call 14269
## 📋 Next Steps (NEEDS USER INPUT)
### Option A: Deep GDB Investigation (SLOW)
- Set hardware watchpoint on `g_tls_sll_head[0]`
- Run to call 14250, then watch for corruption
- Time: 1-2 hours, may not work with optimization
### Option B: Disable Optimizations (DIAGNOSTIC)
- Rebuild with `-O0` to see if bug disappears
- If so, likely compiler optimization bug or UB
- Time: 10 minutes
### Option C: Simplified Stress Test (QUICK)
- Disable P0 batch optimization temporarily
- Disable SFC temporarily
- Test with simpler code path
- Time: 20 minutes
### After Fix Verified
4. **Commit P0 fix**:
- Fix #1: Clear next in `tls_sll_pop`
- Fix #2: NULL-terminate in `trc_linear_carve`
- Box I/E/3 validation infrastructure
- Double-free detection
5. **Update CLAUDE.md** with findings
6. **Performance benchmark** (release build)
---
## 🎯 Expected Outcome
After applying Fix #2, the allocator should:
- ✅ Pass 100K iterations without crash
- ✅ Pass 1M iterations without crash
- ✅ Maintain performance (~2.7M ops/s for 256B)
---
## 📝 Lessons Learned
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
3. **Multiple fixes needed**: Fix #1 alone was insufficient
4. **Chain integrity**: Carved chains MUST be properly terminated
---
## 🔧 Build Flags (CRITICAL)
**MUST use these flags**:
```bash
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
```
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
**Use build.sh** to ensure correct flags:
```bash
./build.sh bench_random_mixed_hakmem
```