Files
hakmem/FREELIST_CORRUPTION_ROOT_CAUSE.md
Moe Charm (CI) b8ed2b05b4 Phase 6-2.6: Fix slab_data_start() consistency in refill/validation paths
Problem:
- Phase 6-2.5 changed SUPERSLAB_SLAB0_DATA_OFFSET from 1024 → 2048
- Fixed sizeof(SuperSlab) mismatch (1088 bytes)
- But 3 locations still used old slab_data_start() + manual offset

This caused:
- Address mismatch between allocation carving and validation
- Freelist corruption false positives
- 53-byte misalignment errors resolved, but new errors appeared

Changes:
1. core/tiny_tls_guard.h:34
   - Validation: slab_data_start() → tiny_slab_base_for()
   - Ensures validation uses same base address as allocation

2. core/hakmem_tiny_refill.inc.h:222
   - Allocation carving: Remove manual +2048 hack
   - Use canonical tiny_slab_base_for()

3. core/hakmem_tiny_refill.inc.h:275
   - Bump allocation: Remove duplicate slab_start calculation
   - Use existing base calculation with tiny_slab_base_for()

Result:
- Consistent use of tiny_slab_base_for() across all paths
- All code uses SUPERSLAB_SLAB0_DATA_OFFSET constant
- Remaining freelist corruption needs deeper investigation (not simple offset bug)

Related commits:
- d2f0d8458: Phase 6-2.5 (constants.h + 2048 offset)
- c9053a43a: Phase 6-2.3~6-2.4 (active counter + SEGV fixes)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 22:34:24 +09:00

4.4 KiB

FREELIST CORRUPTION ROOT CAUSE ANALYSIS

Phase 6-2.5 SLAB0_DATA_OFFSET Investigation

Executive Summary

The freelist corruption after changing SLAB0_DATA_OFFSET from 1024 to 2048 is NOT caused by the offset change. The root cause is a use-after-free vulnerability in the remote free queue combined with massive double-frees.

Timeline

  • Initial symptom: [TRC_FAILFAST] stage=freelist_next cls=7 node=0x7e1ff3c1d474
  • Investigation started: After Phase 6-2.5 offset change
  • Root cause found: Use-after-free in ss_remote_push + double-frees

Root Cause Analysis

1. Double-Free Epidemic

# Test reveals 180+ duplicate freed addresses
HAKMEM_WRAP_TINY=1 ./larson_hakmem 1 1 1024 1024 1 12345 1 | \
  grep "free_local_box" | awk '{print $6}' | sort | uniq -d | wc -l
# Result: 180+ duplicates

2. Use-After-Free Vulnerability

Location: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.h:437

static inline int ss_remote_push(SuperSlab* ss, int slab_idx, void* ptr) {
    // ... validation ...
    do {
        old = atomic_load_explicit(head, memory_order_acquire);
        if (!g_remote_side_enable) {
            *(void**)ptr = (void*)old;  // ← WRITES TO POTENTIALLY ALLOCATED MEMORY!
        }
    } while (!atomic_compare_exchange_weak_explicit(...));
}

3. The Attack Sequence

  1. Thread A frees block X → pushed to remote queue (next pointer written)
  2. Thread B (owner) drains remote queue → adds X to freelist
  3. Thread B allocates X → application starts using it
  4. Thread C double-frees X → corrupts active user memory
  5. User writes data including 0x6261 pattern
  6. Freelist traversal interprets user data as next pointer → CRASH

Evidence

Corrupted Pointers

  • 0x7c1b4a606261 - User data ending with 0x6261 pattern
  • 0x6261 - Pure user data, no valid address
  • Pattern 0x6261 detected as "TLS guard scribble" in code

Debug Output

[TRC_FREELIST_LOG] stage=free_local_box cls=7 node=0x7da27ec0b800 next=0x7da27ec0bc00
[TRC_FREELIST_LOG] stage=free_local_box cls=7 node=0x7da27ec0b800 next=0x7da27ec04000
                                                     ^^^^^^^^^^^ SAME ADDRESS FREED TWICE!

Remote Queue Activity

[DEBUG ss_remote_push] Call #1 ss=0x735d23e00000 slab_idx=0
[DEBUG ss_remote_push] Call #2 ss=0x735d23e00000 slab_idx=5
[TRC_FAILFAST] stage=freelist_next cls=7 node=0x6261

Why SLAB0_DATA_OFFSET Change Exposed This

The offset change from 1024 to 2048 didn't cause the bug but may have:

  1. Changed memory layout/timing
  2. Made corruption more visible
  3. Affected which blocks get double-freed
  4. The bug existed before but was latent

Attempted Mitigations

1. Enable Safe Free (COMPLETED)

// core/hakmem_tiny.c:39
int g_tiny_safe_free = 1;  // ULTRATHINK FIX: Enable by default

Result: Still crashes - race condition persists

2. Required Fixes (PENDING)

  • Add ownership validation before writing next pointer
  • Implement proper memory barriers
  • Add atomic state tracking for blocks
  • Consider hazard pointers or epoch-based reclamation

Reproduction

# Immediate crash with SuperSlab enabled
HAKMEM_WRAP_TINY=1 ./larson_hakmem 1 1 1024 1024 1 12345 1

# Works fine without SuperSlab
HAKMEM_WRAP_TINY=0 ./larson_hakmem 1 1 1024 1024 1 12345 1

Recommendations

  1. IMMEDIATE: Do not use in production
  2. SHORT-TERM: Disable remote free queue (HAKMEM_TINY_DISABLE_REMOTE=1)
  3. LONG-TERM: Redesign lock-free MPSC with safe memory reclamation

Technical Details

Memory Layout (Class 7, 1024-byte blocks)

SuperSlab base: 0x7c1b4a600000
Slab 0 start:   0x7c1b4a600000 + 2048 = 0x7c1b4a600800
Block 0:        0x7c1b4a600800
Block 1:        0x7c1b4a600c00
Block 42:       0x7c1b4a60b000 (offset 43008 from slab 0 start)

Validation Points

  • Offset 2048 is correct (aligns to 1024-byte blocks)
  • sizeof(SuperSlab) = 1088 requires 2048-byte alignment
  • All legitimate blocks ARE properly aligned
  • Corruption comes from use-after-free, not misalignment

Conclusion

The HAKMEM allocator has a critical memory safety bug in its lock-free remote free queue. The bug allows:

  • Use-after-free corruption
  • Double-free vulnerabilities
  • Memory corruption of active allocations

This is a SECURITY VULNERABILITY that could be exploited for arbitrary code execution.

Author

Claude Opus 4.1 (ULTRATHINK Mode) Analysis Date: 2025-11-07