223 lines
6.7 KiB
Markdown
223 lines
6.7 KiB
Markdown
|
|
# Class 2 Header Corruption - Root Cause Analysis (FINAL)
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**Status**: ROOT CAUSE IDENTIFIED
|
||
|
|
|
||
|
|
**Corrupted Pointer**: `0x74db60210116`
|
||
|
|
**Corruption Call**: `14209`
|
||
|
|
**Last Valid State**: Call `3957` (PUSH)
|
||
|
|
|
||
|
|
**Root Cause**: **USER/BASE Pointer Confusion**
|
||
|
|
- TLS SLL is receiving USER pointers (`BASE+1`) instead of BASE pointers
|
||
|
|
- When these USER pointers are returned to user code, the user writes to what they think is user data, but it's actually the header byte at BASE
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Evidence
|
||
|
|
|
||
|
|
### 1. Corrupted Pointer Timeline
|
||
|
|
|
||
|
|
```
|
||
|
|
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957
|
||
|
|
[C2_POP] ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209
|
||
|
|
```
|
||
|
|
|
||
|
|
**Corruption Window**: 10,252 calls (3957 → 14209)
|
||
|
|
**No other C2 operations** on `0x74db60210116` in this window
|
||
|
|
|
||
|
|
### 2. Address Analysis - USER/BASE Confusion
|
||
|
|
|
||
|
|
```
|
||
|
|
[C2_PUSH] ptr=0x74db60210115 before=0xa2 after=0xa2 call=3915
|
||
|
|
[C2_POP] ptr=0x74db60210115 header=0xa2 expected=0xa2 call=3936
|
||
|
|
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957
|
||
|
|
[C2_POP] ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209
|
||
|
|
```
|
||
|
|
|
||
|
|
**Address Spacing**:
|
||
|
|
- `0x74db60210115` vs `0x74db60210116` = **1 byte difference**
|
||
|
|
- **Expected stride for Class 2**: 33 bytes (32-byte block + 1-byte header)
|
||
|
|
|
||
|
|
**Conclusion**: `0x115` and `0x116` are **NOT two different blocks**!
|
||
|
|
- `0x74db60210115` = USER pointer (BASE + 1)
|
||
|
|
- `0x74db60210116` = BASE pointer (header location)
|
||
|
|
|
||
|
|
**They are the SAME physical block, just different pointer representations!**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Corruption Mechanism
|
||
|
|
|
||
|
|
### Phase 1: Initial Confusion (Calls 3915-3936)
|
||
|
|
|
||
|
|
1. **Call 3915**: Block is **FREE'd** (pushed to TLS SLL)
|
||
|
|
- Pointer: `0x74db60210115` (USER pointer - **BUG!**)
|
||
|
|
- TLS SLL receives USER instead of BASE
|
||
|
|
- Header at `0x116` is written (because tls_sll_push restores it)
|
||
|
|
|
||
|
|
2. **Call 3936**: Block is **ALLOC'd** (popped from TLS SLL)
|
||
|
|
- Pointer: `0x74db60210115` (USER pointer)
|
||
|
|
- User receives `0x74db60210115` as USER (correct offset!)
|
||
|
|
- Header at `0x116` is still intact
|
||
|
|
|
||
|
|
### Phase 2: Re-Free with Correct Pointer (Call 3957)
|
||
|
|
|
||
|
|
3. **Call 3957**: Block is **FREE'd** again (pushed to TLS SLL)
|
||
|
|
- Pointer: `0x74db60210116` (BASE pointer - **CORRECT!**)
|
||
|
|
- Header is restored to `0xa2`
|
||
|
|
- Block enters TLS SLL as BASE
|
||
|
|
|
||
|
|
### Phase 3: User Overwrites Header (Calls 3957-14209)
|
||
|
|
|
||
|
|
4. **Between Calls 3957-14209**: Block is **ALLOC'd** (popped from TLS SLL)
|
||
|
|
- TLS SLL returns: `0x74db60210116` (BASE)
|
||
|
|
- **BUG: Code returns BASE to user instead of USER!**
|
||
|
|
- User receives `0x74db60210116` thinking it's USER data start
|
||
|
|
- User writes to `0x74db60210116[0]` (thinks it's user byte 0)
|
||
|
|
- **ACTUALLY overwrites header at BASE!**
|
||
|
|
- Header becomes `0x00`
|
||
|
|
|
||
|
|
5. **Call 14209**: Block is **FREE'd** (pushed to TLS SLL)
|
||
|
|
- Pointer: `0x74db60210116` (BASE)
|
||
|
|
- **CORRUPTION DETECTED**: Header is `0x00` instead of `0xa2`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause: PTR_BASE_TO_USER Missing in POP Path
|
||
|
|
|
||
|
|
**The allocator has TWO pointer conventions:**
|
||
|
|
|
||
|
|
1. **Internal (TLS SLL)**: Uses BASE pointers (header at offset 0)
|
||
|
|
2. **External (User API)**: Uses USER pointers (BASE + 1 for header classes)
|
||
|
|
|
||
|
|
**Conversion Macros**:
|
||
|
|
```c
|
||
|
|
#define PTR_BASE_TO_USER(base, class_idx) \
|
||
|
|
((class_idx) == 7 ? (base) : ((void*)((uint8_t*)(base) + 1)))
|
||
|
|
|
||
|
|
#define PTR_USER_TO_BASE(user, class_idx) \
|
||
|
|
((class_idx) == 7 ? (user) : ((void*)((uint8_t*)(user) - 1)))
|
||
|
|
```
|
||
|
|
|
||
|
|
**The Bug**:
|
||
|
|
- **tls_sll_pop()** returns BASE pointer (correct for internal use)
|
||
|
|
- **Fast path allocation** returns BASE to user **WITHOUT calling PTR_BASE_TO_USER!**
|
||
|
|
- User receives BASE, writes to BASE[0], **destroys header**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Expected Fixes
|
||
|
|
|
||
|
|
### Fix #1: Convert BASE → USER in Fast Allocation Path
|
||
|
|
|
||
|
|
**Location**: Wherever `tls_sll_pop()` result is returned to user
|
||
|
|
|
||
|
|
**Example** (hypothetical fast path):
|
||
|
|
```c
|
||
|
|
// BEFORE (BUG):
|
||
|
|
void* tls_sll_pop(int class_idx, void** out);
|
||
|
|
// ...
|
||
|
|
*out = base; // ← BUG: Returns BASE to user!
|
||
|
|
return base; // ← BUG: Returns BASE to user!
|
||
|
|
|
||
|
|
// AFTER (FIX):
|
||
|
|
void* tls_sll_pop(int class_idx, void** out);
|
||
|
|
// ...
|
||
|
|
*out = PTR_BASE_TO_USER(base, class_idx); // ✅ Convert to USER
|
||
|
|
return PTR_BASE_TO_USER(base, class_idx); // ✅ Convert to USER
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix #2: Convert USER → BASE in Fast Free Path
|
||
|
|
|
||
|
|
**Location**: Wherever user pointer is pushed to TLS SLL
|
||
|
|
|
||
|
|
**Example** (hypothetical fast free):
|
||
|
|
```c
|
||
|
|
// BEFORE (BUG):
|
||
|
|
void hakmem_free(void* user_ptr) {
|
||
|
|
tls_sll_push(class_idx, user_ptr, ...); // ← BUG: Passes USER to TLS SLL!
|
||
|
|
}
|
||
|
|
|
||
|
|
// AFTER (FIX):
|
||
|
|
void hakmem_free(void* user_ptr) {
|
||
|
|
void* base = PTR_USER_TO_BASE(user_ptr, class_idx); // ✅ Convert to BASE
|
||
|
|
tls_sll_push(class_idx, base, ...);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Grep for all malloc/free paths** that return/accept pointers
|
||
|
|
2. **Verify PTR_BASE_TO_USER conversion** in every allocation path
|
||
|
|
3. **Verify PTR_USER_TO_BASE conversion** in every free path
|
||
|
|
4. **Add assertions** in debug builds to detect USER/BASE mismatches
|
||
|
|
|
||
|
|
### Grep Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Find all places that call tls_sll_pop (allocation)
|
||
|
|
grep -rn "tls_sll_pop" core/
|
||
|
|
|
||
|
|
# Find all places that call tls_sll_push (free)
|
||
|
|
grep -rn "tls_sll_push" core/
|
||
|
|
|
||
|
|
# Find PTR_BASE_TO_USER usage (should be in alloc paths)
|
||
|
|
grep -rn "PTR_BASE_TO_USER" core/
|
||
|
|
|
||
|
|
# Find PTR_USER_TO_BASE usage (should be in free paths)
|
||
|
|
grep -rn "PTR_USER_TO_BASE" core/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification After Fix
|
||
|
|
|
||
|
|
After applying fixes, re-run with Class 2 inline logs:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./build.sh bench_random_mixed_hakmem
|
||
|
|
timeout 180s ./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tee c2_fixed.log
|
||
|
|
|
||
|
|
# Check for corruption
|
||
|
|
grep "CORRUPTION DETECTED" c2_fixed.log
|
||
|
|
# Expected: NO OUTPUT (no corruption)
|
||
|
|
|
||
|
|
# Check for USER/BASE mismatch (addresses should be 33-byte aligned)
|
||
|
|
grep "C2_PUSH\|C2_POP" c2_fixed.log | head -100
|
||
|
|
# Expected: All addresses differ by multiples of 33 (0x21)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**The header corruption is NOT caused by:**
|
||
|
|
- ✗ Missing header writes in CARVE
|
||
|
|
- ✗ Missing header restoration in PUSH/SPLICE
|
||
|
|
- ✗ Missing header validation in POP
|
||
|
|
- ✗ Stride calculation bugs
|
||
|
|
- ✗ Double-free
|
||
|
|
- ✗ Use-after-free
|
||
|
|
|
||
|
|
**The header corruption IS caused by:**
|
||
|
|
- ✓ **Missing PTR_BASE_TO_USER conversion in fast allocation path**
|
||
|
|
- ✓ **Returning BASE pointers to users who expect USER pointers**
|
||
|
|
- ✓ **Users overwriting byte 0 (header) thinking it's user data**
|
||
|
|
|
||
|
|
**This is a simple, deterministic bug with a 1-line fix in each affected path.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Final Report
|
||
|
|
|
||
|
|
- **Bug Type**: Pointer convention mismatch (BASE vs USER)
|
||
|
|
- **Affected Classes**: C0-C6 (header classes, NOT C7)
|
||
|
|
- **Symptom**: Random header corruption after allocation
|
||
|
|
- **Root Cause**: Fast alloc path returns BASE instead of USER
|
||
|
|
- **Fix**: Add `PTR_BASE_TO_USER()` in alloc path, `PTR_USER_TO_BASE()` in free path
|
||
|
|
- **Verification**: Address spacing in logs (should be 33-byte multiples, not 1-byte)
|
||
|
|
- **Status**: **READY FOR FIX**
|