## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
242 lines
7.3 KiB
Markdown
242 lines
7.3 KiB
Markdown
# P0 SEGV Bug - Current Status & Next Steps
|
||
|
||
**Last Update**: 2025-11-12
|
||
|
||
## 🐛 Bug Summary
|
||
|
||
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
|
||
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
|
||
**Root Cause**: **STALE NEXT POINTERS** in carved chains
|
||
|
||
---
|
||
|
||
## 🎁 Box Theory Implementation (完了済み)
|
||
|
||
### ✅ **Box 3** (Pointer Conversion Box)
|
||
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
|
||
- **役割**: BASE ↔ USER pointer conversion
|
||
- **API**:
|
||
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
|
||
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
|
||
- **Status**: ✅ Committed (1713 lines added total)
|
||
|
||
### ✅ **Box E** (Expansion Box)
|
||
- **File**: `core/box/superslab_expansion_box.h/c`
|
||
- **役割**: SuperSlab expansion with TLS state guarantee
|
||
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
|
||
- **Status**: ✅ Committed
|
||
|
||
### ✅ **Box I** (Integrity Box) - **703 lines!**
|
||
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
|
||
- **役割**: Comprehensive integrity verification system
|
||
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
|
||
1. `carved <= capacity`
|
||
2. `used <= carved`
|
||
3. `used <= capacity`
|
||
4. `free_count == (carved - used)`
|
||
5. `capacity <= 512`
|
||
- **機能**:
|
||
- `integrity_validate_slab_metadata()` - メタデータ検証
|
||
- `validate_ptr_range()` - ポインタ範囲検証(null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
|
||
- **Status**: ✅ Committed
|
||
|
||
### ✅ **Box TLS-SLL** (今回の修正対象)
|
||
- **File**: `core/box/tls_sll_box.h`
|
||
- **役割**: TLS Single-Linked List management (C7-safe)
|
||
- **API**:
|
||
- `tls_sll_push()` - Push to SLL (C7 rejected)
|
||
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
|
||
- `tls_sll_splice()` - Batch push
|
||
- **今回の発見**:
|
||
- Fix #1: `tls_sll_pop` で next をクリア(C0-C6 は base+1 で)
|
||
- But: carved chain の tail が NULL 終端されていない(Fix #2 必要)
|
||
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
|
||
|
||
### ✅ **その他のBox** (既存)
|
||
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
|
||
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
|
||
- **Mailbox Box**: `core/box/mailbox_box.h/c`
|
||
|
||
**Commit Info**:
|
||
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
|
||
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
|
||
- Date: Recent (before P0 debug session)
|
||
|
||
---
|
||
|
||
## 🔍 Investigation History
|
||
|
||
### ✅ Completed Investigations
|
||
|
||
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
|
||
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
|
||
|
||
2. **Task Agent GDB Analysis**:
|
||
- Found crash location: `tls_sll_pop` line 169
|
||
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
|
||
|
||
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
|
||
- All checks passed before crash
|
||
- Validation didn't catch the bug
|
||
|
||
---
|
||
|
||
## 🛠️ Fixes Applied (Partial Success)
|
||
|
||
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
|
||
|
||
**File**: `core/box/tls_sll_box.h:254-262`
|
||
|
||
**Change**:
|
||
```c
|
||
// OLD (WRONG): Only cleared for C7
|
||
if (__builtin_expect(class_idx == 7, 0)) {
|
||
*(void**)base = NULL;
|
||
}
|
||
|
||
// NEW: Clear for C0-C6 too
|
||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||
if (class_idx == 7) {
|
||
*(void**)base = NULL; // C7: clear at base (offset 0)
|
||
} else {
|
||
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
|
||
}
|
||
#else
|
||
*(void**)base = NULL;
|
||
#endif
|
||
```
|
||
|
||
**Result**:
|
||
- ✅ Passed 29K iterations (previous crash point)
|
||
- ❌ **Still crashes at 38,985 iterations**
|
||
|
||
---
|
||
|
||
## 🚨 NEW DISCOVERY: Root Cause Found!
|
||
|
||
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
|
||
|
||
**File**: `core/tiny_refill_opt.h:229-234`
|
||
|
||
**BUG**: Tail block's next pointer is NOT NULL-terminated!
|
||
|
||
```c
|
||
// Current code (BUGGY):
|
||
for (uint32_t i = 1; i < batch; i++) {
|
||
uint8_t* next = cursor + stride;
|
||
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
|
||
cursor = next;
|
||
}
|
||
void* tail = (void*)cursor; // tail = last block
|
||
// ❌ BUG: tail's next pointer is NEVER set to NULL!
|
||
// It contains GARBAGE from previous allocation!
|
||
```
|
||
|
||
**IMPACT**:
|
||
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
|
||
2. Chain spliced to TLS SLL
|
||
3. Later, `tls_sll_pop` traverses the chain
|
||
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
|
||
|
||
**FIX** (add after line 233):
|
||
```c
|
||
for (uint32_t i = 1; i < batch; i++) {
|
||
uint8_t* next = cursor + stride;
|
||
*(void**)(cursor + next_offset) = (void*)next;
|
||
cursor = next;
|
||
}
|
||
void* tail = (void*)cursor;
|
||
|
||
// ✅ FIX: NULL-terminate the tail
|
||
*(void**)((uint8_t*)tail + next_offset) = NULL;
|
||
```
|
||
|
||
---
|
||
|
||
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
|
||
|
||
### Fixes Applied:
|
||
1. ✅ **Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
|
||
2. ✅ **Fix #2**: NULL-terminate tail in `trc_linear_carve()`
|
||
3. ✅ **Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
|
||
4. ✅ **Fix #4**: Increase canary check frequency (1000 → 100 ops)
|
||
5. ✅ **Fix #5**: Add bounds check to `tls_sll_push()`
|
||
|
||
### Test Results:
|
||
- ❌ **Still crashes at iteration 28,410 (call 14269)**
|
||
- Canaries: NOT corrupted (corruption is immediate)
|
||
- Bounds check: NOT triggered (class_idx is valid)
|
||
- Task agent finding: External corruption of `g_tls_sll_head[0]`
|
||
|
||
### Analysis:
|
||
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
|
||
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
|
||
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
|
||
- Crash is deterministic at call 14269
|
||
|
||
## 📋 Next Steps (NEEDS USER INPUT)
|
||
|
||
### Option A: Deep GDB Investigation (SLOW)
|
||
- Set hardware watchpoint on `g_tls_sll_head[0]`
|
||
- Run to call 14250, then watch for corruption
|
||
- Time: 1-2 hours, may not work with optimization
|
||
|
||
### Option B: Disable Optimizations (DIAGNOSTIC)
|
||
- Rebuild with `-O0` to see if bug disappears
|
||
- If so, likely compiler optimization bug or UB
|
||
- Time: 10 minutes
|
||
|
||
### Option C: Simplified Stress Test (QUICK)
|
||
- Disable P0 batch optimization temporarily
|
||
- Disable SFC temporarily
|
||
- Test with simpler code path
|
||
- Time: 20 minutes
|
||
|
||
### After Fix Verified
|
||
|
||
4. **Commit P0 fix**:
|
||
- Fix #1: Clear next in `tls_sll_pop`
|
||
- Fix #2: NULL-terminate in `trc_linear_carve`
|
||
- Box I/E/3 validation infrastructure
|
||
- Double-free detection
|
||
|
||
5. **Update CLAUDE.md** with findings
|
||
|
||
6. **Performance benchmark** (release build)
|
||
|
||
---
|
||
|
||
## 🎯 Expected Outcome
|
||
|
||
After applying Fix #2, the allocator should:
|
||
- ✅ Pass 100K iterations without crash
|
||
- ✅ Pass 1M iterations without crash
|
||
- ✅ Maintain performance (~2.7M ops/s for 256B)
|
||
|
||
---
|
||
|
||
## 📝 Lessons Learned
|
||
|
||
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
|
||
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
|
||
3. **Multiple fixes needed**: Fix #1 alone was insufficient
|
||
4. **Chain integrity**: Carved chains MUST be properly terminated
|
||
|
||
---
|
||
|
||
## 🔧 Build Flags (CRITICAL)
|
||
|
||
**MUST use these flags**:
|
||
```bash
|
||
HEADER_CLASSIDX=1
|
||
AGGRESSIVE_INLINE=1
|
||
PREWARM_TLS=1
|
||
```
|
||
|
||
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
|
||
|
||
**Use build.sh** to ensure correct flags:
|
||
```bash
|
||
./build.sh bench_random_mixed_hakmem
|
||
```
|