Files
hakmem/P0_BUG_STATUS.md
Moe Charm (CI) 84dbd97fe9 Fix #16: Resolve double BASE→USER conversion causing header corruption
🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting
BASE → USER pointers before returning to caller. The caller then applied
HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER
conversion, resulting in double offset (BASE+2) and header written at
wrong location.

📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at
tiny_region_id_write_header, making it the single source of truth for
BASE → USER conversion.

🔧 CHANGES:
- Fix #16: Remove premature BASE→USER conversions (6 locations)
  * core/tiny_alloc_fast.inc.h (3 fixes)
  * core/hakmem_tiny_refill.inc.h (2 fixes)
  * core/hakmem_tiny_fastcache.inc.h (1 fix)

- Fix #12: Add header validation in tls_sll_pop (detect corruption)
- Fix #14: Defense-in-depth header restoration in tls_sll_splice
- Fix #15: USER pointer detection (for debugging)
- Fix #13: Bump window header restoration
- Fix #2, #6, #7, #8: Various header restoration & NULL termination

🧪 TEST RESULTS: 100% SUCCESS
- 10K-500K iterations: All passed
- 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161)
- Performance: ~630K ops/s average (stable)
- Header corruption: ZERO

📋 FIXES SUMMARY:
Fix #1-8:   Initial header restoration & chain fixes (chatgpt-san)
Fix #9-10:  USER pointer auto-fix (later disabled)
Fix #12:    Validation system (caught corruption at call 14209)
Fix #13:    Bump window header writes
Fix #14:    Splice defense-in-depth
Fix #15:    USER pointer detection (debugging tool)
Fix #16:    Double conversion fix (FINAL SOLUTION) 

🎓 LESSONS LEARNED:
1. Validation catches bugs early (Fix #12 was critical)
2. Class-specific inline logging reveals patterns (Option C)
3. Box Theory provides clean architectural boundaries
4. Multiple investigation approaches (Task/chatgpt-san collaboration)

📄 DOCUMENTATION:
- P0_BUG_STATUS.md: Complete bug tracking timeline
- C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis
- FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task Agent <task@anthropic.com>
Co-Authored-By: ChatGPT <chatgpt@openai.com>
2025-11-12 10:33:57 +09:00

7.3 KiB
Raw Blame History

P0 SEGV Bug - Current Status & Next Steps

Last Update: 2025-11-12

🐛 Bug Summary

Symptom: SEGV crash at iterations 28,440 and 38,985 (deterministic with seed 42) Pattern: Corrupted address 0x7fff00008000 in TLS SLL chain Root Cause: STALE NEXT POINTERS in carved chains


🎁 Box Theory Implementation (完了済み)

Box 3 (Pointer Conversion Box)

  • File: core/box/ptr_conversion_box.h (267 lines)
  • 役割: BASE ↔ USER pointer conversion
  • API:
    • ptr_base_to_user(base, class_idx) - C0-C6: base+1, C7: base
    • ptr_user_to_base(user, class_idx) - C0-C6: user-1, C7: user
  • Status: Committed (1713 lines added total)

Box E (Expansion Box)

  • File: core/box/superslab_expansion_box.h/c
  • 役割: SuperSlab expansion with TLS state guarantee
  • 機能: expansion_expand_with_tls_guarantee() - Expand後に slab 0 を即座にバインド
  • Status: Committed

Box I (Integrity Box) - 703 lines!

  • File: core/box/integrity_box.h (267行) + integrity_box.c (436行)
  • 役割: Comprehensive integrity verification system
  • Priority ALPHA: 5つの Slab Metadata 不変条件チェック
    1. carved <= capacity
    2. used <= carved
    3. used <= capacity
    4. free_count == (carved - used)
    5. capacity <= 512
  • 機能:
    • integrity_validate_slab_metadata() - メタデータ検証
    • validate_ptr_range() - ポインタ範囲検証null-page, kernel-space, 0xa2/0xcc/0xdd/0xfe パターン)
  • Status: Committed

Box TLS-SLL (今回の修正対象)

  • File: core/box/tls_sll_box.h
  • 役割: TLS Single-Linked List management (C7-safe)
  • API:
    • tls_sll_push() - Push to SLL (C7 rejected)
    • tls_sll_pop() - Pop from SLL (returns base pointer)
    • tls_sll_splice() - Batch push
  • 今回の発見:
    • Fix #1: tls_sll_pop で next をクリアC0-C6 は base+1 で)
    • But: carved chain の tail が NULL 終端されていないFix #2 必要)
  • Status: ⚠️ Fix #1 適用済み、Fix #2 未適用

その他のBox (既存)

  • Front Gate Box: core/box/front_gate_box.h/c + front_gate_classifier.c
  • Free Local/Remote/Publish Box: core/box/free_local_box.c, free_remote_box.c, free_publish_box.c
  • Mailbox Box: core/box/mailbox_box.h/c

Commit Info:

  • Commit: "Add Box I (Integrity), Box E (Expansion)..."
  • Files: 23 files changed, 1713 insertions(+), 56 deletions(-)
  • Date: Recent (before P0 debug session)

🔍 Investigation History

Completed Investigations

  1. Valgrind (O0 build): 0 errors, 29K iterations passed

    • Conclusion: Bug is optimization-dependent (-O3 triggers it)
  2. Task Agent GDB Analysis:

    • Found crash location: tls_sll_pop line 169
    • Hypothesis: use-after-allocate (next pointer at base+1 is user memory)
  3. Box I, E, 3 Implementation: 703 lines of integrity checks

    • All checks passed before crash
    • Validation didn't catch the bug

🛠️ Fixes Applied (Partial Success)

Fix #1: Clear next pointer in tls_sll_pop (INCOMPLETE)

File: core/box/tls_sll_box.h:254-262

Change:

// OLD (WRONG): Only cleared for C7
if (__builtin_expect(class_idx == 7, 0)) {
    *(void**)base = NULL;
}

// NEW: Clear for C0-C6 too
#if HAKMEM_TINY_HEADER_CLASSIDX
    if (class_idx == 7) {
        *(void**)base = NULL;  // C7: clear at base (offset 0)
    } else {
        *(void**)((uint8_t*)base + 1) = NULL;  // C0-C6: clear at base+1 (offset 1)
    }
#else
    *(void**)base = NULL;
#endif

Result:

  • Passed 29K iterations (previous crash point)
  • Still crashes at 38,985 iterations

🚨 NEW DISCOVERY: Root Cause Found!

Fix #2: NULL-terminate carved chain tail (NOT YET APPLIED)

File: core/tiny_refill_opt.h:229-234

BUG: Tail block's next pointer is NOT NULL-terminated!

// Current code (BUGGY):
for (uint32_t i = 1; i < batch; i++) {
    uint8_t* next = cursor + stride;
    *(void**)(cursor + next_offset) = (void*)next;  // Links blocks 0→1, 1→2, ...
    cursor = next;
}
void* tail = (void*)cursor;  // tail = last block
// ❌ BUG: tail's next pointer is NEVER set to NULL!
// It contains GARBAGE from previous allocation!

IMPACT:

  1. Chain is carved: head → block1 → block2 → ... → tail → [GARBAGE]
  2. Chain spliced to TLS SLL
  3. Later, tls_sll_pop traverses the chain
  4. Reads garbage next pointer → SEGV at 0x7fff00008000

FIX (add after line 233):

for (uint32_t i = 1; i < batch; i++) {
    uint8_t* next = cursor + stride;
    *(void**)(cursor + next_offset) = (void*)next;
    cursor = next;
}
void* tail = (void*)cursor;

// ✅ FIX: NULL-terminate the tail
*(void**)((uint8_t*)tail + next_offset) = NULL;

🚨 CURRENT STATUS (2025-11-12 UPDATED)

Fixes Applied:

  1. Fix #1: Clear next pointer in tls_sll_pop (C0-C6 at base+1)
  2. Fix #2: NULL-terminate tail in trc_linear_carve()
  3. Fix #3: Clean rebuild with HEADER_CLASSIDX=1
  4. Fix #4: Increase canary check frequency (1000 → 100 ops)
  5. Fix #5: Add bounds check to tls_sll_push()

Test Results:

  • Still crashes at iteration 28,410 (call 14269)
  • Canaries: NOT corrupted (corruption is immediate)
  • Bounds check: NOT triggered (class_idx is valid)
  • Task agent finding: External corruption of g_tls_sll_head[0]

Analysis:

  • Fix #1 and Fix #2 ARE working correctly (Task agent verified)
  • Corruption happens IMMEDIATELY before crash (canaries at 100-op interval miss it)
  • class_idx is valid [0-7] when corruption happens (bounds check doesn't trigger)
  • Crash is deterministic at call 14269

📋 Next Steps (NEEDS USER INPUT)

Option A: Deep GDB Investigation (SLOW)

  • Set hardware watchpoint on g_tls_sll_head[0]
  • Run to call 14250, then watch for corruption
  • Time: 1-2 hours, may not work with optimization

Option B: Disable Optimizations (DIAGNOSTIC)

  • Rebuild with -O0 to see if bug disappears
  • If so, likely compiler optimization bug or UB
  • Time: 10 minutes

Option C: Simplified Stress Test (QUICK)

  • Disable P0 batch optimization temporarily
  • Disable SFC temporarily
  • Test with simpler code path
  • Time: 20 minutes

After Fix Verified

  1. Commit P0 fix:

    • Fix #1: Clear next in tls_sll_pop
    • Fix #2: NULL-terminate in trc_linear_carve
    • Box I/E/3 validation infrastructure
    • Double-free detection
  2. Update CLAUDE.md with findings

  3. Performance benchmark (release build)


🎯 Expected Outcome

After applying Fix #2, the allocator should:

  • Pass 100K iterations without crash
  • Pass 1M iterations without crash
  • Maintain performance (~2.7M ops/s for 256B)

📝 Lessons Learned

  1. Stale pointers are dangerous: Always NULL-terminate linked lists
  2. Optimization exposes bugs: -O3 can hide initialization in debug builds
  3. Multiple fixes needed: Fix #1 alone was insufficient
  4. Chain integrity: Carved chains MUST be properly terminated

🔧 Build Flags (CRITICAL)

MUST use these flags:

HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1

Why: HAKMEM_TINY_HEADER_CLASSIDX=1 is required for Fix #1 to execute!

Use build.sh to ensure correct flags:

./build.sh bench_random_mixed_hakmem