🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting BASE → USER pointers before returning to caller. The caller then applied HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER conversion, resulting in double offset (BASE+2) and header written at wrong location. 📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at tiny_region_id_write_header, making it the single source of truth for BASE → USER conversion. 🔧 CHANGES: - Fix #16: Remove premature BASE→USER conversions (6 locations) * core/tiny_alloc_fast.inc.h (3 fixes) * core/hakmem_tiny_refill.inc.h (2 fixes) * core/hakmem_tiny_fastcache.inc.h (1 fix) - Fix #12: Add header validation in tls_sll_pop (detect corruption) - Fix #14: Defense-in-depth header restoration in tls_sll_splice - Fix #15: USER pointer detection (for debugging) - Fix #13: Bump window header restoration - Fix #2, #6, #7, #8: Various header restoration & NULL termination 🧪 TEST RESULTS: 100% SUCCESS - 10K-500K iterations: All passed - 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161) - Performance: ~630K ops/s average (stable) - Header corruption: ZERO 📋 FIXES SUMMARY: Fix #1-8: Initial header restoration & chain fixes (chatgpt-san) Fix #9-10: USER pointer auto-fix (later disabled) Fix #12: Validation system (caught corruption at call 14209) Fix #13: Bump window header writes Fix #14: Splice defense-in-depth Fix #15: USER pointer detection (debugging tool) Fix #16: Double conversion fix (FINAL SOLUTION) ✅ 🎓 LESSONS LEARNED: 1. Validation catches bugs early (Fix #12 was critical) 2. Class-specific inline logging reveals patterns (Option C) 3. Box Theory provides clean architectural boundaries 4. Multiple investigation approaches (Task/chatgpt-san collaboration) 📄 DOCUMENTATION: - P0_BUG_STATUS.md: Complete bug tracking timeline - C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis - FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Task Agent <task@anthropic.com> Co-Authored-By: ChatGPT <chatgpt@openai.com>
7.3 KiB
Class 2 Header Corruption - FINAL ROOT CAUSE
Executive Summary
STATUS: ✅ ROOT CAUSE IDENTIFIED
Corrupted Pointer: 0x74db60210116
Corruption Call: 14209
Last Valid PUSH: Call 3957
Root Cause: The logs reveal 0x74db60210115 and 0x74db60210116 (only 1 byte apart) are being pushed/popped from TLS SLL. This spacing is IMPOSSIBLE for Class 2 (32B blocks + 1B header = 33B stride).
Conclusion: These are USER and BASE representations of the SAME block, indicating a USER/BASE pointer mismatch somewhere in the code that allows USER pointers to leak into the TLS SLL.
Evidence
Timeline of Corrupted Block
[C2_PUSH] ptr=0x74db60210115 before=0xa2 after=0xa2 call=3915 ← USER pointer!
[C2_POP] ptr=0x74db60210115 header=0xa2 expected=0xa2 call=3936 ← USER pointer!
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957 ← BASE pointer (correct)
[C2_POP] ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209 ← CORRUPTION!
Address Analysis
0x74db60210115 ← USER pointer (BASE + 1)
0x74db60210116 ← BASE pointer (header location)
Difference: 1 byte (should be 33 bytes for different Class 2 blocks)
Conclusion: Same physical block, two different pointer conventions
Corruption Mechanism
Phase 1: USER Pointer Leak (Calls 3915-3936)
-
Call 3915: FREE operation pushes
0x115(USER pointer) to TLS SLL- BUG: Code path passes USER to
tls_sll_pushinstead of BASE - TLS SLL receives USER pointer
tls_sll_pushwrites header at USER-1 (0x116), so header is correct
- BUG: Code path passes USER to
-
Call 3936: ALLOC operation pops
0x115(USER pointer) from TLS SLL- Returns USER pointer to application (correct for external API)
- User writes to
0x115+(user data area) - Header at
0x116remains intact (not touched by user)
Phase 2: Correct BASE Pointer (Call 3957)
- Call 3957: FREE operation pushes
0x116(BASE pointer) to TLS SLL- Correct: Passes BASE to
tls_sll_push - Header restored to
0xa2
- Correct: Passes BASE to
Phase 3: User Overwrites Header (Calls 3957-14209)
-
Between 3957-14209: ALLOC operation pops
0x116from TLS SLL- BUG: Returns BASE pointer to user instead of USER pointer!
- User receives
0x116thinking it's the start of user data - User writes to
0x116[0](thinks it's user byte 0) - ACTUALLY overwrites header byte!
- Header becomes
0x00
-
Call 14209: FREE operation pushes
0x116to TLS SLL- CORRUPTION DETECTED: Header is
0x00instead of0xa2
- CORRUPTION DETECTED: Header is
Code Analysis
Allocation Paths (USER Conversion) ✅ CORRECT
File: /mnt/workdisk/public_share/hakmem/core/tiny_region_id.h:46
static inline void* tiny_region_id_write_header(void* base, int class_idx) {
if (!base) return base;
if (__builtin_expect(class_idx == 7, 0)) {
return base; // C7: headerless
}
// Write header at BASE
uint8_t* header_ptr = (uint8_t*)base;
*header_ptr = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
void* user = header_ptr + 1; // ✅ Convert BASE → USER
return user; // ✅ CORRECT: Returns USER pointer
}
Usage: All HAK_RET_ALLOC(class_idx, ptr) calls use this function, which correctly returns USER pointers.
Free Paths (BASE Conversion) - MIXED RESULTS
Path 1: Ultra-Simple Free ✅ CORRECT
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:383
void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); // ✅ Convert USER → BASE
if (tls_sll_push(class_idx, base, (uint32_t)sll_cap)) {
return; // Success
}
Status: ✅ CORRECT - Converts USER → BASE before push
Path 2: Freelist Drain ❓ SUSPICIOUS
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:75
static inline void tiny_drain_freelist_to_sll_once(SuperSlab* ss, int slab_idx, int class_idx) {
// ...
while (m->freelist && moved < budget) {
void* p = m->freelist; // ← What is this? BASE or USER?
// ...
if (tls_sll_push(class_idx, p, sll_capacity)) { // ← Pushing p directly
moved++;
}
}
}
Question: Is m->freelist stored as BASE or USER?
Answer: Freelist stores pointers at offset 0 (header location for header classes), so m->freelist contains BASE pointers. This is CORRECT.
Path 3: Fast Free ❓ NEEDS INVESTIGATION
File: /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h
Need to check if fast free path converts USER → BASE.
Next Steps: Find the Buggy Path
Step 1: Check Fast Free Path
grep -A 10 -B 5 "tls_sll_push" core/tiny_free_fast_v2.inc.h
Look for paths that pass ptr directly to tls_sll_push without USER → BASE conversion.
Step 2: Check All Free Wrappers
grep -rn "void.*free.*void.*ptr" core/ | grep -v "\.o:"
Check all free entry points to ensure USER → BASE conversion.
Step 3: Add Validation to tls_sll_push
Temporarily add address alignment check in tls_sll_push:
// In tls_sll_box.h: tls_sll_push()
#if !HAKMEM_BUILD_RELEASE
if (class_idx != 7) {
// For header classes, ptr should be BASE (even address for 32B blocks)
// USER pointers would be BASE+1 (odd addresses for 32B blocks)
uintptr_t addr = (uintptr_t)ptr;
if ((addr & 1) != 0) { // ODD address = USER pointer!
extern _Atomic uint64_t malloc_count;
uint64_t call = atomic_load(&malloc_count);
fprintf(stderr, "[TLS_SLL_PUSH_BUG] call=%lu cls=%d ptr=%p is ODD (USER pointer!)\\n",
call, class_idx, ptr);
fprintf(stderr, "[TLS_SLL_PUSH_BUG] Caller passed USER instead of BASE!\\n");
fflush(stderr);
abort();
}
}
#endif
This will catch USER pointers immediately at injection point!
Step 4: Run Test
./build.sh bench_random_mixed_hakmem
timeout 60s ./out/release/bench_random_mixed_hakmem 10000 256 42 2>&1 | tee user_ptr_catch.log
Expected: Immediate abort with backtrace showing which path is passing USER pointers.
Hypothesis
Based on the evidence, the bug is likely in:
- Fast free path that doesn't convert USER → BASE before
tls_sll_push - Some wrapper around
hakmem_free()that pre-converts USER → BASE incorrectly - Some refill/drain path that accidentally uses USER pointers from freelist
Most Likely: Fast free path optimization that skips USER → BASE conversion for performance.
Verification Plan
- Add ODD address validation to
tls_sll_push(debug builds only) - Run 10K iteration test
- Catch USER pointer injection with backtrace
- Fix the specific path
- Re-test with 100K iterations
- Remove validation (keep in comments for future debugging)
Expected Fix
Once we identify the buggy path, the fix will be a 1-liner:
// BEFORE (BUG):
tls_sll_push(class_idx, user_ptr, ...); // ← Passing USER!
// AFTER (FIX):
void* base = PTR_USER_TO_BASE(user_ptr, class_idx); // ✅ Convert to BASE
tls_sll_push(class_idx, base, ...);
Status
- ✅ Root cause identified (USER/BASE mismatch)
- ✅ Evidence collected (logs showing ODD/EVEN addresses)
- ✅ Mechanism understood (user overwrites header when given BASE)
- ⏳ Specific buggy path: TO BE IDENTIFIED (next step)
- ⏳ Fix: TO BE APPLIED (1-line change)
- ⏳ Verification: TO BE DONE (100K test)