Files
hakmem/C2_CORRUPTION_ROOT_CAUSE_FINAL.md
Moe Charm (CI) 84dbd97fe9 Fix #16: Resolve double BASE→USER conversion causing header corruption
🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting
BASE → USER pointers before returning to caller. The caller then applied
HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER
conversion, resulting in double offset (BASE+2) and header written at
wrong location.

📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at
tiny_region_id_write_header, making it the single source of truth for
BASE → USER conversion.

🔧 CHANGES:
- Fix #16: Remove premature BASE→USER conversions (6 locations)
  * core/tiny_alloc_fast.inc.h (3 fixes)
  * core/hakmem_tiny_refill.inc.h (2 fixes)
  * core/hakmem_tiny_fastcache.inc.h (1 fix)

- Fix #12: Add header validation in tls_sll_pop (detect corruption)
- Fix #14: Defense-in-depth header restoration in tls_sll_splice
- Fix #15: USER pointer detection (for debugging)
- Fix #13: Bump window header restoration
- Fix #2, #6, #7, #8: Various header restoration & NULL termination

🧪 TEST RESULTS: 100% SUCCESS
- 10K-500K iterations: All passed
- 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161)
- Performance: ~630K ops/s average (stable)
- Header corruption: ZERO

📋 FIXES SUMMARY:
Fix #1-8:   Initial header restoration & chain fixes (chatgpt-san)
Fix #9-10:  USER pointer auto-fix (later disabled)
Fix #12:    Validation system (caught corruption at call 14209)
Fix #13:    Bump window header writes
Fix #14:    Splice defense-in-depth
Fix #15:    USER pointer detection (debugging tool)
Fix #16:    Double conversion fix (FINAL SOLUTION) 

🎓 LESSONS LEARNED:
1. Validation catches bugs early (Fix #12 was critical)
2. Class-specific inline logging reveals patterns (Option C)
3. Box Theory provides clean architectural boundaries
4. Multiple investigation approaches (Task/chatgpt-san collaboration)

📄 DOCUMENTATION:
- P0_BUG_STATUS.md: Complete bug tracking timeline
- C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis
- FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task Agent <task@anthropic.com>
Co-Authored-By: ChatGPT <chatgpt@openai.com>
2025-11-12 10:33:57 +09:00

223 lines
6.7 KiB
Markdown

# Class 2 Header Corruption - Root Cause Analysis (FINAL)
## Executive Summary
**Status**: ROOT CAUSE IDENTIFIED
**Corrupted Pointer**: `0x74db60210116`
**Corruption Call**: `14209`
**Last Valid State**: Call `3957` (PUSH)
**Root Cause**: **USER/BASE Pointer Confusion**
- TLS SLL is receiving USER pointers (`BASE+1`) instead of BASE pointers
- When these USER pointers are returned to user code, the user writes to what they think is user data, but it's actually the header byte at BASE
---
## Evidence
### 1. Corrupted Pointer Timeline
```
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957
[C2_POP] ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209
```
**Corruption Window**: 10,252 calls (3957 → 14209)
**No other C2 operations** on `0x74db60210116` in this window
### 2. Address Analysis - USER/BASE Confusion
```
[C2_PUSH] ptr=0x74db60210115 before=0xa2 after=0xa2 call=3915
[C2_POP] ptr=0x74db60210115 header=0xa2 expected=0xa2 call=3936
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957
[C2_POP] ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209
```
**Address Spacing**:
- `0x74db60210115` vs `0x74db60210116` = **1 byte difference**
- **Expected stride for Class 2**: 33 bytes (32-byte block + 1-byte header)
**Conclusion**: `0x115` and `0x116` are **NOT two different blocks**!
- `0x74db60210115` = USER pointer (BASE + 1)
- `0x74db60210116` = BASE pointer (header location)
**They are the SAME physical block, just different pointer representations!**
---
## Corruption Mechanism
### Phase 1: Initial Confusion (Calls 3915-3936)
1. **Call 3915**: Block is **FREE'd** (pushed to TLS SLL)
- Pointer: `0x74db60210115` (USER pointer - **BUG!**)
- TLS SLL receives USER instead of BASE
- Header at `0x116` is written (because tls_sll_push restores it)
2. **Call 3936**: Block is **ALLOC'd** (popped from TLS SLL)
- Pointer: `0x74db60210115` (USER pointer)
- User receives `0x74db60210115` as USER (correct offset!)
- Header at `0x116` is still intact
### Phase 2: Re-Free with Correct Pointer (Call 3957)
3. **Call 3957**: Block is **FREE'd** again (pushed to TLS SLL)
- Pointer: `0x74db60210116` (BASE pointer - **CORRECT!**)
- Header is restored to `0xa2`
- Block enters TLS SLL as BASE
### Phase 3: User Overwrites Header (Calls 3957-14209)
4. **Between Calls 3957-14209**: Block is **ALLOC'd** (popped from TLS SLL)
- TLS SLL returns: `0x74db60210116` (BASE)
- **BUG: Code returns BASE to user instead of USER!**
- User receives `0x74db60210116` thinking it's USER data start
- User writes to `0x74db60210116[0]` (thinks it's user byte 0)
- **ACTUALLY overwrites header at BASE!**
- Header becomes `0x00`
5. **Call 14209**: Block is **FREE'd** (pushed to TLS SLL)
- Pointer: `0x74db60210116` (BASE)
- **CORRUPTION DETECTED**: Header is `0x00` instead of `0xa2`
---
## Root Cause: PTR_BASE_TO_USER Missing in POP Path
**The allocator has TWO pointer conventions:**
1. **Internal (TLS SLL)**: Uses BASE pointers (header at offset 0)
2. **External (User API)**: Uses USER pointers (BASE + 1 for header classes)
**Conversion Macros**:
```c
#define PTR_BASE_TO_USER(base, class_idx) \
((class_idx) == 7 ? (base) : ((void*)((uint8_t*)(base) + 1)))
#define PTR_USER_TO_BASE(user, class_idx) \
((class_idx) == 7 ? (user) : ((void*)((uint8_t*)(user) - 1)))
```
**The Bug**:
- **tls_sll_pop()** returns BASE pointer (correct for internal use)
- **Fast path allocation** returns BASE to user **WITHOUT calling PTR_BASE_TO_USER!**
- User receives BASE, writes to BASE[0], **destroys header**
---
## Expected Fixes
### Fix #1: Convert BASE → USER in Fast Allocation Path
**Location**: Wherever `tls_sll_pop()` result is returned to user
**Example** (hypothetical fast path):
```c
// BEFORE (BUG):
void* tls_sll_pop(int class_idx, void** out);
// ...
*out = base; // ← BUG: Returns BASE to user!
return base; // ← BUG: Returns BASE to user!
// AFTER (FIX):
void* tls_sll_pop(int class_idx, void** out);
// ...
*out = PTR_BASE_TO_USER(base, class_idx); // ✅ Convert to USER
return PTR_BASE_TO_USER(base, class_idx); // ✅ Convert to USER
```
### Fix #2: Convert USER → BASE in Fast Free Path
**Location**: Wherever user pointer is pushed to TLS SLL
**Example** (hypothetical fast free):
```c
// BEFORE (BUG):
void hakmem_free(void* user_ptr) {
tls_sll_push(class_idx, user_ptr, ...); // ← BUG: Passes USER to TLS SLL!
}
// AFTER (FIX):
void hakmem_free(void* user_ptr) {
void* base = PTR_USER_TO_BASE(user_ptr, class_idx); // ✅ Convert to BASE
tls_sll_push(class_idx, base, ...);
}
```
---
## Next Steps
1. **Grep for all malloc/free paths** that return/accept pointers
2. **Verify PTR_BASE_TO_USER conversion** in every allocation path
3. **Verify PTR_USER_TO_BASE conversion** in every free path
4. **Add assertions** in debug builds to detect USER/BASE mismatches
### Grep Commands
```bash
# Find all places that call tls_sll_pop (allocation)
grep -rn "tls_sll_pop" core/
# Find all places that call tls_sll_push (free)
grep -rn "tls_sll_push" core/
# Find PTR_BASE_TO_USER usage (should be in alloc paths)
grep -rn "PTR_BASE_TO_USER" core/
# Find PTR_USER_TO_BASE usage (should be in free paths)
grep -rn "PTR_USER_TO_BASE" core/
```
---
## Verification After Fix
After applying fixes, re-run with Class 2 inline logs:
```bash
./build.sh bench_random_mixed_hakmem
timeout 180s ./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tee c2_fixed.log
# Check for corruption
grep "CORRUPTION DETECTED" c2_fixed.log
# Expected: NO OUTPUT (no corruption)
# Check for USER/BASE mismatch (addresses should be 33-byte aligned)
grep "C2_PUSH\|C2_POP" c2_fixed.log | head -100
# Expected: All addresses differ by multiples of 33 (0x21)
```
---
## Conclusion
**The header corruption is NOT caused by:**
- ✗ Missing header writes in CARVE
- ✗ Missing header restoration in PUSH/SPLICE
- ✗ Missing header validation in POP
- ✗ Stride calculation bugs
- ✗ Double-free
- ✗ Use-after-free
**The header corruption IS caused by:**
-**Missing PTR_BASE_TO_USER conversion in fast allocation path**
-**Returning BASE pointers to users who expect USER pointers**
-**Users overwriting byte 0 (header) thinking it's user data**
**This is a simple, deterministic bug with a 1-line fix in each affected path.**
---
## Final Report
- **Bug Type**: Pointer convention mismatch (BASE vs USER)
- **Affected Classes**: C0-C6 (header classes, NOT C7)
- **Symptom**: Random header corruption after allocation
- **Root Cause**: Fast alloc path returns BASE instead of USER
- **Fix**: Add `PTR_BASE_TO_USER()` in alloc path, `PTR_USER_TO_BASE()` in free path
- **Verification**: Address spacing in logs (should be 33-byte multiples, not 1-byte)
- **Status**: **READY FOR FIX**