Files
hakmem/docs/status/P0_BUG_STATUS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

242 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# P0 SEGV Bug - Current Status & Next Steps
**Last Update**: 2025-11-12
## 🐛 Bug Summary
**Symptom**: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42)
**Pattern**: Corrupted address `0x7fff00008000` in TLS SLL chain
**Root Cause**: **STALE NEXT POINTERS** in carved chains
---
## 🎁 Box Theory Implementation (完了済み)
### ✅ **Box 3** (Pointer Conversion Box)
- **File**: `core/box/ptr_conversion_box.h` (267 lines)
- **役割**: BASE ↔ USER pointer conversion
- **API**:
- `ptr_base_to_user(base, class_idx)` - C0-C6: base+1, C7: base
- `ptr_user_to_base(user, class_idx)` - C0-C6: user-1, C7: user
- **Status**: ✅ Committed (1713 lines added total)
### ✅ **Box E** (Expansion Box)
- **File**: `core/box/superslab_expansion_box.h/c`
- **役割**: SuperSlab expansion with TLS state guarantee
- **機能**: `expansion_expand_with_tls_guarantee()` - Expand後に slab 0 を即座にバインド
- **Status**: ✅ Committed
### ✅ **Box I** (Integrity Box) - **703 lines!**
- **File**: `core/box/integrity_box.h` (267行) + `integrity_box.c` (436行)
- **役割**: Comprehensive integrity verification system
- **Priority ALPHA**: 5つの Slab Metadata 不変条件チェック
1. `carved <= capacity`
2. `used <= carved`
3. `used <= capacity`
4. `free_count == (carved - used)`
5. `capacity <= 512`
- **機能**:
- `integrity_validate_slab_metadata()` - メタデータ検証
- `validate_ptr_range()` - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
- **Status**: ✅ Committed
### ✅ **Box TLS-SLL** (今回の修正対象)
- **File**: `core/box/tls_sll_box.h`
- **役割**: TLS Single-Linked List management (C7-safe)
- **API**:
- `tls_sll_push()` - Push to SLL (C7 rejected)
- `tls_sll_pop()` - Pop from SLL (returns base pointer)
- `tls_sll_splice()` - Batch push
- **今回の発見**:
- Fix #1: `tls_sll_pop` で next をクリアC0-C6 は base+1 で)
- But: carved chain の tail が NULL 終端されていないFix #2 必要)
- **Status**: ⚠️ Fix #1 適用済み、Fix #2 未適用
### ✅ **その他のBox** (既存)
- **Front Gate Box**: `core/box/front_gate_box.h/c` + `front_gate_classifier.c`
- **Free Local/Remote/Publish Box**: `core/box/free_local_box.c`, `free_remote_box.c`, `free_publish_box.c`
- **Mailbox Box**: `core/box/mailbox_box.h/c`
**Commit Info**:
- Commit: "Add Box I (Integrity), Box E (Expansion)..."
- Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
- Date: Recent (before P0 debug session)
---
## 🔍 Investigation History
### ✅ Completed Investigations
1. **Valgrind (O0 build)**: 0 errors, 29K iterations passed
- Conclusion: Bug is optimization-dependent (-O3 triggers it)
2. **Task Agent GDB Analysis**:
- Found crash location: `tls_sll_pop` line 169
- Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
3. **Box I, E, 3 Implementation**: 703 lines of integrity checks
- All checks passed before crash
- Validation didn't catch the bug
---
## 🛠️ Fixes Applied (Partial Success)
### Fix #1: Clear next pointer in `tls_sll_pop` ✅ (INCOMPLETE)
**File**: `core/box/tls_sll_box.h:254-262`
**Change**:
```c
// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
*(void**)base = NULL;
}
// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
if (class_idx == 7) {
*(void**)base = NULL; // C7: clear at base (offset 0)
} else {
*(void**)((uint8_t*)base + 1) = NULL; // C0-C6: clear at base+1 (offset 1)
}
#else
*(void**)base = NULL;
#endif
```
**Result**:
- ✅ Passed 29K iterations (previous crash point)
-**Still crashes at 38,985 iterations**
---
## 🚨 NEW DISCOVERY: Root Cause Found!
### Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)
**File**: `core/tiny_refill_opt.h:229-234`
**BUG**: Tail block's next pointer is NOT NULL-terminated!
```c
// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next; // Links blocks 0→1, 1→2, ...
cursor = next;
}
void* tail = (void*)cursor; // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!
```
**IMPACT**:
1. Chain is carved: `head → block1 → block2 → ... → tail → [GARBAGE]`
2. Chain spliced to TLS SLL
3. Later, `tls_sll_pop` traverses the chain
4. Reads garbage `next` pointer → SEGV at `0x7fff00008000`
**FIX** (add after line 233):
```c
for (uint32_t i = 1; i < batch; i++) {
uint8_t* next = cursor + stride;
*(void**)(cursor + next_offset) = (void*)next;
cursor = next;
}
void* tail = (void*)cursor;
// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;
```
---
## 🚨 CURRENT STATUS (2025-11-12 UPDATED)
### Fixes Applied:
1.**Fix #1**: Clear next pointer in `tls_sll_pop` (C0-C6 at base+1)
2.**Fix #2**: NULL-terminate tail in `trc_linear_carve()`
3.**Fix #3**: Clean rebuild with `HEADER_CLASSIDX=1`
4.**Fix #4**: Increase canary check frequency (1000 → 100 ops)
5.**Fix #5**: Add bounds check to `tls_sll_push()`
### Test Results:
-**Still crashes at iteration 28,410 (call 14269)**
- Canaries: NOT corrupted (corruption is immediate)
- Bounds check: NOT triggered (class_idx is valid)
- Task agent finding: External corruption of `g_tls_sll_head[0]`
### Analysis:
- Fix #1 and Fix #2 ARE working correctly (Task agent verified)
- Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
- class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
- Crash is deterministic at call 14269
## 📋 Next Steps (NEEDS USER INPUT)
### Option A: Deep GDB Investigation (SLOW)
- Set hardware watchpoint on `g_tls_sll_head[0]`
- Run to call 14250, then watch for corruption
- Time: 1-2 hours, may not work with optimization
### Option B: Disable Optimizations (DIAGNOSTIC)
- Rebuild with `-O0` to see if bug disappears
- If so, likely compiler optimization bug or UB
- Time: 10 minutes
### Option C: Simplified Stress Test (QUICK)
- Disable P0 batch optimization temporarily
- Disable SFC temporarily
- Test with simpler code path
- Time: 20 minutes
### After Fix Verified
4. **Commit P0 fix**:
- Fix #1: Clear next in `tls_sll_pop`
- Fix #2: NULL-terminate in `trc_linear_carve`
- Box I/E/3 validation infrastructure
- Double-free detection
5. **Update CLAUDE.md** with findings
6. **Performance benchmark** (release build)
---
## 🎯 Expected Outcome
After applying Fix #2, the allocator should:
- ✅ Pass 100K iterations without crash
- ✅ Pass 1M iterations without crash
- ✅ Maintain performance (~2.7M ops/s for 256B)
---
## 📝 Lessons Learned
1. **Stale pointers are dangerous**: Always NULL-terminate linked lists
2. **Optimization exposes bugs**: `-O3` can hide initialization in debug builds
3. **Multiple fixes needed**: Fix #1 alone was insufficient
4. **Chain integrity**: Carved chains MUST be properly terminated
---
## 🔧 Build Flags (CRITICAL)
**MUST use these flags**:
```bash
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
```
**Why**: `HAKMEM_TINY_HEADER_CLASSIDX=1` is required for Fix #1 to execute!
**Use build.sh** to ensure correct flags:
```bash
./build.sh bench_random_mixed_hakmem
```