Files
hakmem/FINAL_ANALYSIS_C2_CORRUPTION.md
Moe Charm (CI) 84dbd97fe9 Fix #16: Resolve double BASE→USER conversion causing header corruption
🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting
BASE → USER pointers before returning to caller. The caller then applied
HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER
conversion, resulting in double offset (BASE+2) and header written at
wrong location.

📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at
tiny_region_id_write_header, making it the single source of truth for
BASE → USER conversion.

🔧 CHANGES:
- Fix #16: Remove premature BASE→USER conversions (6 locations)
  * core/tiny_alloc_fast.inc.h (3 fixes)
  * core/hakmem_tiny_refill.inc.h (2 fixes)
  * core/hakmem_tiny_fastcache.inc.h (1 fix)

- Fix #12: Add header validation in tls_sll_pop (detect corruption)
- Fix #14: Defense-in-depth header restoration in tls_sll_splice
- Fix #15: USER pointer detection (for debugging)
- Fix #13: Bump window header restoration
- Fix #2, #6, #7, #8: Various header restoration & NULL termination

🧪 TEST RESULTS: 100% SUCCESS
- 10K-500K iterations: All passed
- 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161)
- Performance: ~630K ops/s average (stable)
- Header corruption: ZERO

📋 FIXES SUMMARY:
Fix #1-8:   Initial header restoration & chain fixes (chatgpt-san)
Fix #9-10:  USER pointer auto-fix (later disabled)
Fix #12:    Validation system (caught corruption at call 14209)
Fix #13:    Bump window header writes
Fix #14:    Splice defense-in-depth
Fix #15:    USER pointer detection (debugging tool)
Fix #16:    Double conversion fix (FINAL SOLUTION) 

🎓 LESSONS LEARNED:
1. Validation catches bugs early (Fix #12 was critical)
2. Class-specific inline logging reveals patterns (Option C)
3. Box Theory provides clean architectural boundaries
4. Multiple investigation approaches (Task/chatgpt-san collaboration)

📄 DOCUMENTATION:
- P0_BUG_STATUS.md: Complete bug tracking timeline
- C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis
- FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task Agent <task@anthropic.com>
Co-Authored-By: ChatGPT <chatgpt@openai.com>
2025-11-12 10:33:57 +09:00

7.3 KiB

Class 2 Header Corruption - FINAL ROOT CAUSE

Executive Summary

STATUS: ROOT CAUSE IDENTIFIED

Corrupted Pointer: 0x74db60210116 Corruption Call: 14209 Last Valid PUSH: Call 3957

Root Cause: The logs reveal 0x74db60210115 and 0x74db60210116 (only 1 byte apart) are being pushed/popped from TLS SLL. This spacing is IMPOSSIBLE for Class 2 (32B blocks + 1B header = 33B stride).

Conclusion: These are USER and BASE representations of the SAME block, indicating a USER/BASE pointer mismatch somewhere in the code that allows USER pointers to leak into the TLS SLL.


Evidence

Timeline of Corrupted Block

[C2_PUSH] ptr=0x74db60210115 before=0xa2 after=0xa2 call=3915   ← USER pointer!
[C2_POP]  ptr=0x74db60210115 header=0xa2 expected=0xa2 call=3936  ← USER pointer!
[C2_PUSH] ptr=0x74db60210116 before=0xa2 after=0xa2 call=3957   ← BASE pointer (correct)
[C2_POP]  ptr=0x74db60210116 header=0x00 expected=0xa2 call=14209 ← CORRUPTION!

Address Analysis

0x74db60210115  ← USER pointer (BASE + 1)
0x74db60210116  ← BASE pointer (header location)

Difference: 1 byte (should be 33 bytes for different Class 2 blocks)

Conclusion: Same physical block, two different pointer conventions


Corruption Mechanism

Phase 1: USER Pointer Leak (Calls 3915-3936)

  1. Call 3915: FREE operation pushes 0x115 (USER pointer) to TLS SLL

    • BUG: Code path passes USER to tls_sll_push instead of BASE
    • TLS SLL receives USER pointer
    • tls_sll_push writes header at USER-1 (0x116), so header is correct
  2. Call 3936: ALLOC operation pops 0x115 (USER pointer) from TLS SLL

    • Returns USER pointer to application (correct for external API)
    • User writes to 0x115+ (user data area)
    • Header at 0x116 remains intact (not touched by user)

Phase 2: Correct BASE Pointer (Call 3957)

  1. Call 3957: FREE operation pushes 0x116 (BASE pointer) to TLS SLL
    • Correct: Passes BASE to tls_sll_push
    • Header restored to 0xa2

Phase 3: User Overwrites Header (Calls 3957-14209)

  1. Between 3957-14209: ALLOC operation pops 0x116 from TLS SLL

    • BUG: Returns BASE pointer to user instead of USER pointer!
    • User receives 0x116 thinking it's the start of user data
    • User writes to 0x116[0] (thinks it's user byte 0)
    • ACTUALLY overwrites header byte!
    • Header becomes 0x00
  2. Call 14209: FREE operation pushes 0x116 to TLS SLL

    • CORRUPTION DETECTED: Header is 0x00 instead of 0xa2

Code Analysis

Allocation Paths (USER Conversion) CORRECT

File: /mnt/workdisk/public_share/hakmem/core/tiny_region_id.h:46

static inline void* tiny_region_id_write_header(void* base, int class_idx) {
    if (!base) return base;
    if (__builtin_expect(class_idx == 7, 0)) {
        return base;  // C7: headerless
    }

    // Write header at BASE
    uint8_t* header_ptr = (uint8_t*)base;
    *header_ptr = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);

    void* user = header_ptr + 1;  // ✅ Convert BASE → USER
    return user;  // ✅ CORRECT: Returns USER pointer
}

Usage: All HAK_RET_ALLOC(class_idx, ptr) calls use this function, which correctly returns USER pointers.

Free Paths (BASE Conversion) - MIXED RESULTS

Path 1: Ultra-Simple Free CORRECT

File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:383

void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);  // ✅ Convert USER → BASE
if (tls_sll_push(class_idx, base, (uint32_t)sll_cap)) {
    return;  // Success
}

Status: CORRECT - Converts USER → BASE before push

Path 2: Freelist Drain SUSPICIOUS

File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:75

static inline void tiny_drain_freelist_to_sll_once(SuperSlab* ss, int slab_idx, int class_idx) {
    // ...
    while (m->freelist && moved < budget) {
        void* p = m->freelist;  // ← What is this? BASE or USER?
        // ...
        if (tls_sll_push(class_idx, p, sll_capacity)) {  // ← Pushing p directly
            moved++;
        }
    }
}

Question: Is m->freelist stored as BASE or USER?

Answer: Freelist stores pointers at offset 0 (header location for header classes), so m->freelist contains BASE pointers. This is CORRECT.

Path 3: Fast Free NEEDS INVESTIGATION

File: /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h

Need to check if fast free path converts USER → BASE.


Next Steps: Find the Buggy Path

Step 1: Check Fast Free Path

grep -A 10 -B 5 "tls_sll_push" core/tiny_free_fast_v2.inc.h

Look for paths that pass ptr directly to tls_sll_push without USER → BASE conversion.

Step 2: Check All Free Wrappers

grep -rn "void.*free.*void.*ptr" core/ | grep -v "\.o:"

Check all free entry points to ensure USER → BASE conversion.

Step 3: Add Validation to tls_sll_push

Temporarily add address alignment check in tls_sll_push:

// In tls_sll_box.h: tls_sll_push()
#if !HAKMEM_BUILD_RELEASE
if (class_idx != 7) {
    // For header classes, ptr should be BASE (even address for 32B blocks)
    // USER pointers would be BASE+1 (odd addresses for 32B blocks)
    uintptr_t addr = (uintptr_t)ptr;
    if ((addr & 1) != 0) {  // ODD address = USER pointer!
        extern _Atomic uint64_t malloc_count;
        uint64_t call = atomic_load(&malloc_count);
        fprintf(stderr, "[TLS_SLL_PUSH_BUG] call=%lu cls=%d ptr=%p is ODD (USER pointer!)\\n",
                call, class_idx, ptr);
        fprintf(stderr, "[TLS_SLL_PUSH_BUG] Caller passed USER instead of BASE!\\n");
        fflush(stderr);
        abort();
    }
}
#endif

This will catch USER pointers immediately at injection point!

Step 4: Run Test

./build.sh bench_random_mixed_hakmem
timeout 60s ./out/release/bench_random_mixed_hakmem 10000 256 42 2>&1 | tee user_ptr_catch.log

Expected: Immediate abort with backtrace showing which path is passing USER pointers.


Hypothesis

Based on the evidence, the bug is likely in:

  1. Fast free path that doesn't convert USER → BASE before tls_sll_push
  2. Some wrapper around hakmem_free() that pre-converts USER → BASE incorrectly
  3. Some refill/drain path that accidentally uses USER pointers from freelist

Most Likely: Fast free path optimization that skips USER → BASE conversion for performance.


Verification Plan

  1. Add ODD address validation to tls_sll_push (debug builds only)
  2. Run 10K iteration test
  3. Catch USER pointer injection with backtrace
  4. Fix the specific path
  5. Re-test with 100K iterations
  6. Remove validation (keep in comments for future debugging)

Expected Fix

Once we identify the buggy path, the fix will be a 1-liner:

// BEFORE (BUG):
tls_sll_push(class_idx, user_ptr, ...);  // ← Passing USER!

// AFTER (FIX):
void* base = PTR_USER_TO_BASE(user_ptr, class_idx);  // ✅ Convert to BASE
tls_sll_push(class_idx, base, ...);

Status

  • Root cause identified (USER/BASE mismatch)
  • Evidence collected (logs showing ODD/EVEN addresses)
  • Mechanism understood (user overwrites header when given BASE)
  • Specific buggy path: TO BE IDENTIFIED (next step)
  • Fix: TO BE APPLIED (1-line change)
  • Verification: TO BE DONE (100K test)