Files
hakmem/C6_TLS_SLL_HEAD_CORRUPTION_ROOT_CAUSE.md
Moe Charm (CI) 25d963a4aa Code Cleanup: Remove false positives, redundant validations, and reduce verbose logging
Following the C7 stride upgrade fix (commit 23c0d9541), this commit performs
comprehensive cleanup to improve code quality and reduce debug noise.

## Changes

### 1. Disable False Positive Checks (tiny_nextptr.h)
- **Disabled**: NXT_MISALIGN validation block with `#if 0`
- **Reason**: Produces false positives due to slab base offsets (2048, 65536)
  not being stride-aligned, causing all blocks to appear "misaligned"
- **TODO**: Reimplement to check stride DISTANCE between consecutive blocks
  instead of absolute alignment to stride boundaries

### 2. Remove Redundant Geometry Validations

**hakmem_tiny_refill_p0.inc.h (P0 batch refill)**
- Removed 25-line CARVE_GEOMETRY_FIX validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Stride table is now correct in tiny_block_stride_for_class(),
  defense-in-depth validation adds overhead without benefit

**ss_legacy_backend_box.c (legacy backend)**
- Removed 18-line LEGACY_FIX_GEOMETRY validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Shared_pool validates geometry at acquisition time

### 3. Reduce Verbose Logging

**hakmem_shared_pool.c (sp_fix_geometry_if_needed)**
- Made SP_FIX_GEOMETRY logging conditional on `!HAKMEM_BUILD_RELEASE`
- **Reason**: Geometry fixes are expected during stride upgrades,
  no need to log in release builds

### 4. Verification
- Build:  Successful (LTO warnings expected)
- Test:  10K iterations (1.87M ops/s, no crashes)
- NXT_MISALIGN false positives:  Eliminated

## Files Modified
- core/tiny_nextptr.h - Disabled false positive NXT_MISALIGN check
- core/hakmem_tiny_refill_p0.inc.h - Removed redundant CARVE validation
- core/box/ss_legacy_backend_box.c - Removed redundant LEGACY validation
- core/hakmem_shared_pool.c - Made SP_FIX_GEOMETRY logging debug-only

## Impact
- **Code clarity**: Removed 43 lines of redundant validation code
- **Debug noise**: Reduced false positive diagnostics
- **Performance**: Eliminated overhead from redundant geometry checks
- **Maintainability**: Single source of truth for geometry validation

🧹 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 23:00:24 +09:00

9.7 KiB

Class 6 TLS SLL Head Corruption - Root Cause Analysis

Date: 2025-11-21 Status: ROOT CAUSE IDENTIFIED Severity: CRITICAL BUG - Data structure corruption


Executive Summary

Root Cause: Class 7 (1024B) next pointer writes overwrite the header byte due to tiny_next_off(7) == 0, corrupting blocks in freelist. When these corrupted blocks are later used in operations that read the header to determine class_idx, the corrupted class_idx causes writes to the wrong TLS SLL (Class 6 instead of Class 7).

Impact: Class 6 TLS SLL head corruption (small integer values like 0x0b, 0xbe, 0xdc, 0x7f)

Fix Required: Change tiny_next_off(7) from 0 to 1 (preserve header for Class 7)


Problem Description

Observed Symptoms

From ChatGPT diagnostic results:

  1. Class 6 head corruption: g_tls_sll[6].head contains small integers (0xb, 0xbe, 0xdc, 0x7f) instead of valid pointers
  2. Class 6 count is correct: g_tls_sll[6].count is accurate (no corruption)
  3. Canary intact: Both g_tls_canary_before_sll and g_tls_canary_after_sll are intact
  4. No invalid push detected: g_tls_sll_invalid_push[6] = 0
  5. 1024B correctly routed to C7: ALLOC_GE1024: C7=1576 (no C6 allocations for 1024B)

Key Observation

The corrupted values (0x0b, 0xbe, 0xdc, 0x7f) are low bytes of pointer addresses, suggesting pointer data is being misinterpreted as class_idx.


Root Cause Analysis

1. Class 7 Next Pointer Offset Bug

File: /mnt/workdisk/public_share/hakmem/core/tiny_nextptr.h Lines: 42-47

static inline __attribute__((always_inline)) size_t tiny_next_off(int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
    // Phase E1-CORRECT REVISED (C7 corruption fix):
    // Class 0, 7 → offset 0 (freelist中はheader潰す - payload最大化)
    // Class 1-6 → offset 1 (header保持 - 十分なpayloadあり)
    return (class_idx == 0 || class_idx == 7) ? 0u : 1u;
#else
    (void)class_idx;
    return 0u;
#endif
}

Problem: Class 7 uses next_off = 0, meaning:

  • When a C7 block is freed, the next pointer is written at BASE+0
  • This OVERWRITES the header byte at BASE+0 (which should contain 0xa7)

2. Header Corruption Sequence

Allocation (C7 block at address 0x7f1234abcd00):

BASE+0: 0xa7 (header: HEADER_MAGIC | class_idx)
BASE+1 to BASE+2047: user data (2047 bytes)

Free → Push to TLS SLL:

// In tls_sll_push() or similar:
tiny_next_write(7, base, g_tls_sll[7].head);  // Writes next pointer at BASE+0
g_tls_sll[7].head = base;

// Result:
BASE+0: 0xcd (LOW BYTE of previous head pointer 0x7f...abcd)
BASE+1: 0xab
BASE+2: 0x34
BASE+3: 0x12
BASE+4: 0x7f
BASE+5: 0x00
BASE+6: 0x00
BASE+7: 0x00

Header is now CORRUPTED: BASE+0 = 0xcd instead of 0xa7

3. Corrupted Class Index Read

Later, if code reads the header to determine class_idx:

// In tiny_region_id_read_header() or similar:
uint8_t header = *(ptr - 1);  // Reads BASE+0
int class_idx = header & 0x0F;  // Extracts low 4 bits

// If header = 0xcd (corrupted):
class_idx = 0xcd & 0x0F = 0x0D = 13 (out of bounds!)

// If header = 0xbe (corrupted):
class_idx = 0xbe & 0x0F = 0x0E = 14 (out of bounds!)

// If header = 0x06 (lucky corruption):
class_idx = 0x06 & 0x0F = 0x06 = 6 (WRONG CLASS!)

4. Wrong TLS SLL Write

If the corrupted class_idx is used to access g_tls_sll[]:

// Somewhere in the code (e.g., refill, push, pop):
g_tls_sll[class_idx].head = some_pointer;

// If class_idx = 6 (from corrupted header 0x?6):
g_tls_sll[6].head = 0x...0b  // Low byte of pointer → 0x0b

Result: Class 6 TLS SLL head is corrupted with pointer low bytes!


Evidence Supporting This Theory

1. Struct Layout is Correct

sizeof(TinyTLSSLL) = 16 bytes
C6 -> C7 gap: 16 bytes (correct)
C6.head offset: 0
C7.head offset: 16 (correct)

No struct alignment issues.

2. All Head Write Sites are Correct

All g_tls_sll[class_idx].head = ... writes use correct array indexing. No pointer arithmetic bugs found.

3. Size-to-Class Routing is Correct

hak_tiny_size_to_class(1024) = 7  // Correct
g_size_to_class_lut_2k[1025] = 7  // Correct (1024 + 1 byte header)

4. Corruption Values Match Pointer Low Bytes

Observed corruptions: 0x0b, 0xbe, 0xdc, 0x7f These are typical low bytes of x86-64 heap pointers (0x7f..., 0xbe..., 0xdc..., 0x0b...)

5. Code That Reads Headers Exists

Multiple locations read header & 0x0F to get class_idx:

  • tiny_free_fast_v2.inc.h:106: tiny_region_id_read_header(ptr)
  • tiny_ultra_fast.inc.h:68: header & 0x0F
  • pool_tls.c:157: header & 0x0F
  • hakmem_smallmid.c:307: header & 0x0f

Critical Code Paths

Path 1: C7 Free → Header Corruption

  1. User frees 1024B allocation (Class 7)
  2. tiny_free_fast_v2.inc.h or similar calls:
    int class_idx = tiny_region_id_read_header(ptr);  // Reads 0xa7
    
  3. Push to freelist (e.g., meta->freelist):
    tiny_next_write(7, base, meta->freelist);  // Writes at BASE+0, OVERWRITES header!
    
  4. Header corrupted: BASE+0 = 0x?? (pointer low byte) instead of 0xa7

Path 2: Corrupted Header → Wrong Class Write

  1. Allocation from freelist (refill or pop):
    void* p = meta->freelist;
    meta->freelist = tiny_next_read(7, p);  // Reads next pointer
    
  2. Later free (different code path):
    int class_idx = tiny_region_id_read_header(p);  // Reads corrupted header
    // class_idx = 0x?6 & 0x0F = 6 (WRONG!)
    
  3. Push to wrong TLS SLL:
    g_tls_sll[6].head = base;  // Should be g_tls_sll[7].head!
    

Why ChatGPT Diagnostics Didn't Catch This

  1. Push-side validation: Only validates pointers being pushed, not the class_idx used for indexing
  2. Count is correct: Count operations don't depend on corrupted headers
  3. Canary intact: Corruption is within valid array bounds (C6 is a valid index)
  4. Routing is correct: Initial routing (1024B → C7) is correct; corruption happens after allocation

Locations That Write to g_tls_sll[*].head

Direct Writes (11 locations)

  1. core/tiny_ultra_fast.inc.h:52 - Pop operation
  2. core/tiny_ultra_fast.inc.h:80 - Push operation
  3. core/hakmem_tiny_lifecycle.inc:164 - Reset
  4. core/tiny_alloc_fast_inline.h:56 - NULL assignment (sentinel)
  5. core/tiny_alloc_fast_inline.h:62 - Pop next
  6. core/tiny_alloc_fast_inline.h:107 - Push base
  7. core/tiny_alloc_fast_inline.h:113 - Push ptr
  8. core/tiny_alloc_fast.inc.h:873 - Reset
  9. core/box/tls_sll_box.h:246 - Push
  10. core/box/tls_sll_box.h:274,319,362 - Sentinel/corruption recovery
  11. core/box/tls_sll_box.h:396 - Pop
  12. core/box/tls_sll_box.h:474 - Splice

Indirect Writes (via trc_splice_to_sll)

  • core/hakmem_tiny_refill_p0.inc.h:244,284 - Batch refill splice
  • Calls tls_sll_splice() → writes to g_tls_sll[class_idx].head

All sites correctly index with class_idx. The bug is that class_idx itself is corrupted.


The Fix

File: core/tiny_nextptr.h Line: 47

// BEFORE (BUG):
return (class_idx == 0 || class_idx == 7) ? 0u : 1u;

// AFTER (FIX):
return (class_idx == 0) ? 0u : 1u;  // C7 now uses offset 1 (preserve header)

Rationale:

  • C7 has 2048B total size (1B header + 2047B payload)
  • Using offset 1 leaves 2046B usable (still plenty for 1024B request)
  • Preserves header integrity for all freelist operations
  • Aligns with C1-C6 behavior (consistent design)

Cost: 1 byte payload loss per C7 block (2047B → 2046B usable)

Option 2: Restore Header Before Header-Dependent Operations

Add header restoration in all paths that:

  1. Pop from freelist (before splice to TLS SLL)
  2. Pop from TLS SLL (before returning to user)

Cons: Complex, error-prone, performance overhead


Verification Plan

  1. Apply Fix: Change tiny_next_off(7) to return 1 for C7
  2. Rebuild: ./build.sh bench_random_mixed_hakmem
  3. Test: Run benchmark with HAKMEM_TINY_SLL_DIAG=1
  4. Monitor: Check for C6 head corruption logs
  5. Validate: Confirm g_tls_sll[6].head stays valid (no small integers)

Additional Diagnostics

If corruption persists after fix, add:

// In tls_sll_push() before line 246:
if (class_idx == 6 || class_idx == 7) {
    uint8_t header = *(uint8_t*)ptr;
    uint8_t expected = HEADER_MAGIC | class_idx;
    if (header != expected) {
        fprintf(stderr, "[TLS_SLL_PUSH] C%d header corruption! ptr=%p header=0x%02x expected=0x%02x\n",
                class_idx, ptr, header, expected);
    }
}

  • core/tiny_nextptr.h - Next pointer offset logic (BUG HERE)
  • core/box/tiny_next_ptr_box.h - Box API wrapper
  • core/tiny_region_id.h - Header read/write operations
  • core/box/tls_sll_box.h - TLS SLL push/pop/splice
  • core/hakmem_tiny_refill_p0.inc.h - P0 refill (uses splice)
  • core/tiny_refill_opt.h - Refill chain operations

Timeline

  • Phase E1-CORRECT: Introduced C7 header + offset 0 decision
  • Comment: "freelist中はheader潰す - payload最大化"
  • Trade-off: Saved 1 byte payload, but broke header integrity
  • Impact: Freelist operations corrupt headers → wrong class_idx reads → C6 corruption

Conclusion

The corruption is NOT a direct write to g_tls_sll[6] with wrong data. It's an indirect corruption via:

  1. C7 next pointer write → overwrites header at BASE+0
  2. Corrupted header → wrong class_idx when read
  3. Wrong class_idx → write to g_tls_sll[6] instead of g_tls_sll[7]

Fix: Change tiny_next_off(7) from 0 to 1 to preserve C7 headers.

Cost: 1 byte per C7 block (negligible for 2KB blocks) Benefit: Eliminates critical data structure corruption