Files
hakmem/docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md

1112 lines
29 KiB
Markdown
Raw Normal View History

# TLS SLL Header Corruption Diagnosis & Fix Instructions for ChatGPT
## Problem Statement
**Symptom**:
- Baseline (Headerless OFF) crashes with SIGSEGV
- Error log: `[TLS_SLL_HDR_RESET] cls=1 base=0x7ef296abf8c8 got=0x31 expect=0xa1 count=0`
- Location: `core/box/tls_sll_box.h` header integrity check during pop operation
**Root Cause**:
Header byte at offset 0 from base pointer contains user data (0x31) instead of header magic (0xa1).
This indicates one of:
1. Wrong pointer is being stored in TLS SLL
2. Header is not being written correctly before push
3. Adjacent block corruption overwrites header
4. Header write/read offset mismatch
**Impact**:
- TLS SLL header reset occurs (entire freelist for class 1 dropped)
- Subsequent allocations may fail or use wrong metadata
- Benchmark crashes with SIGSEGV
- Memory corruption potential
**Timeline**:
- Discovered during Phase 1 TLS Hint Box benchmarking
- Affects baseline configuration (no hints involved)
- Suggests pre-existing issue in shared TLS SLL code
---
## Investigation Strategy
**Phase A: Understand the Error**
- Where is header validation happening?
- What does 0x31 represent? (Is it deterministic or random data?)
- Can we reproduce with minimal allocations?
**Phase B: Locate Corruption Source**
- Where is header supposed to be written?
- Is header being written BEFORE push or after?
- Are there any recent changes to header write logic?
**Phase C: Implement Fix**
- Add instrumentation to catch corruption early
- Identify exact allocation/free cycle causing problem
- Fix root cause (not just symptom)
**Phase D: Validate**
- TC1 baseline should complete without crashes
- TC2/TC3 can then be evaluated
- No performance regression
---
## Deep Dive: TLS SLL Header Corruption
### What is 0x31?
The error reports `got=0x31`. Let's understand what this means:
```c
// Expected (header magic for class 1):
0xa1 = 0xa0 (HEADER_MAGIC) | 0x01 (class_idx)
// Got:
0x31 = 0b00110001
= ASCII '1' character
= Some piece of user data or metadata
```
**Questions to answer**:
1. Is 0x31 always the same, or does it vary? (Deterministic vs random corruption)
2. Does 0x31 correspond to any known data pattern in hakmem?
3. Does the corruption happen during alloc or free?
4. Is 0x31 part of the test program's data?
### TLS SLL Header Check Logic
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h` (around lines 280-320)
```c
// In tls_sll_pop_impl():
if (tiny_class_preserves_header(class_idx)) {
uint8_t* b = (uint8_t*)raw_base;
uint8_t got = *b; // Read byte at offset 0 of base pointer
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
if (got != expected) {
// CORRUPTION DETECTED!
fprintf(stderr, "[TLS_SLL_HDR_RESET] cls=%d base=%p got=0x%02x expect=0x%02x ...\n",
class_idx, raw_base, got, expected);
// ... reset logic follows
}
}
```
**Key Points**:
- Header is read at `(uint8_t*)raw_base` (offset 0)
- Expected value is `0xa0 | class_idx`
- For class 1: expect `0xa1`
- Got `0x31` instead (user data)
### When Does This Happen?
The error occurs during `tls_sll_pop()`, which is called when:
1. **Freelist refill**: Taking blocks from TLS SLL back to unified cache
2. **Magazine spill**: Freelist → TLS SLL transition for overflow
3. **Allocation path**: Pulling blocks from TLS SLL to satisfy malloc
**The header corruption must have happened BEFORE push**, but is detected AFTER pop.
This suggests:
- Either the pointer stored in TLS SLL is wrong (points to wrong location)
- Or the header was never written correctly
- Or adjacent block corruption overwrote the header
- Or there's an offset calculation error between push and pop
---
## Diagnostic Procedure
### Step 1: Reproduce with Minimal Test
Create the smallest possible test case:
**File**: `/mnt/workdisk/public_share/hakmem/tests/test_tls_sll_minimal.c`
```c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main() {
printf("Test 1: Simple alloc/free cycle\n");
for (int i = 0; i < 10; i++) {
void* p = malloc(16); // Class 1
if (p) {
memset(p, 0x31, 16); // Write user data (includes 0x31!)
free(p);
}
}
printf("✓ Test 1 passed\n");
printf("Test 2: Rapid alloc/free (trigger refill)\n");
for (int i = 0; i < 1000; i++) {
void* p = malloc(16);
if (p) {
memset(p, 0x31, 16);
free(p);
}
}
printf("✓ Test 2 passed\n");
printf("Test 3: Multiple sizes\n");
for (int size = 8; size <= 512; size *= 2) {
for (int j = 0; j < 100; j++) {
void* p = malloc(size);
if (p) {
memset(p, 0x31, size);
free(p);
}
}
}
printf("✓ Test 3 passed\n");
printf("Test 4: Heavy churn (trigger SLL push/pop)\n");
void* ptrs[100];
for (int round = 0; round < 10; round++) {
for (int i = 0; i < 100; i++) {
ptrs[i] = malloc(16);
if (ptrs[i]) memset(ptrs[i], 0x31, 16);
}
for (int i = 0; i < 100; i++) {
free(ptrs[i]);
}
}
printf("✓ Test 4 passed\n");
return 0;
}
```
**Build and test**:
```bash
cd /mnt/workdisk/public_share/hakmem
mkdir -p tests
gcc -o tests/test_tls_sll_minimal tests/test_tls_sll_minimal.c
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
```
**Goal**: Find the minimal reproduction:
- If test 1 fails: Early corruption (basic alloc/free)
- If test 2 fails: Refill-related corruption
- If test 3 fails: Class-specific issue
- If test 4 fails: SLL push/pop cycling issue
### Step 2: Add Diagnostic Logging
Instrument the header write/read paths:
#### Instrument Header Write
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
Find the `HAK_RET_ALLOC` macro and add logging:
```c
// Add diagnostic logging
#define HAK_RET_ALLOC(base, cls) do { \
fprintf(stderr, "[ALLOC_HEADER_WRITE] base=%p cls=%d\n", base, cls); \
uint8_t* hdr = (uint8_t*)(base); \
uint8_t magic = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
*hdr = magic; \
fprintf(stderr, "[ALLOC_HEADER_WROTE] base=%p magic=0x%02x (at %p)\n", base, *hdr, hdr); \
__atomic_thread_fence(__ATOMIC_RELEASE); \
hak_user_ptr_t user = ptr_base_to_user(base, cls); \
fprintf(stderr, "[ALLOC_RETURN] user=%p (base=%p + %ld)\n", user, base, (char*)user - (char*)base); \
return user; \
} while(0)
```
#### Instrument Header Read
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
Modify the header read/check in `tls_sll_pop_impl()`:
```c
// In tls_sll_pop_impl(), before the check:
if (tiny_class_preserves_header(class_idx)) {
uint8_t* b = (uint8_t*)raw_base;
uint8_t got = *b;
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
// NEW DIAGNOSTIC LOGGING:
fprintf(stderr, "[TLS_SLL_POP_CHECK] class=%d raw_base=%p checking at %p\n",
class_idx, raw_base, b);
fprintf(stderr, "[TLS_SLL_POP_READ] got=0x%02x expected=0x%02x\n", got, expected);
if (got != expected) {
fprintf(stderr, "[CORRUPTION_DETECTED] Mismatch! Dumping context...\n");
fprintf(stderr, "[CORRUPTION_CONTEXT] raw_base=%p, offset=%ld\n", raw_base, (char*)b - (char*)raw_base);
// Dump surrounding bytes
fprintf(stderr, "[CORRUPTION_DUMP] Bytes around base: ");
for (int i = -8; i < 16; i++) {
fprintf(stderr, "%02x ", ((uint8_t*)raw_base)[i]);
}
fprintf(stderr, "\n");
// ... existing reset logic
}
}
```
#### Instrument SLL Push
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
Find `tls_sll_push_impl()` and add logging:
```c
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
fprintf(stderr, "[TLS_SLL_PUSH] class=%d ptr=%p\n", class_idx, ptr);
// Check header BEFORE push
if (tiny_class_preserves_header(class_idx)) {
uint8_t hdr = *(uint8_t*)ptr;
fprintf(stderr, "[TLS_SLL_PUSH_HDR_CHECK] ptr=%p header=0x%02x\n", ptr, hdr);
}
// ... existing push logic
}
```
**Build and run**:
```bash
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | grep -E "ALLOC|POP|PUSH|CORRUPTION" | head -100
```
**What to look for**:
- Do ALLOC_HEADER_WRITE and TLS_SLL_PUSH_HDR_CHECK match?
- Does TLS_SLL_POP_READ show corruption?
- What is the sequence: WRITE → PUSH → POP?
- Are pointers consistent across operations?
### Step 3: Examine Header Write Locations
Search for all places headers are written:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -rn "= 0xa\|= HEADER_MAGIC\|= TINY_HEADER\|0xa0 |" core/ --include="*.h" --include="*.c" --include="*.inc"
```
Expected locations:
1. `core/hakmem_tiny_config_box.inc` - HAK_RET_ALLOC macro
2. `core/box/tls_sll_box.h` - Optional header write on SLL push (if needed)
3. `core/tiny_alloc_fast_push.c` - Fast path allocations
4. Other allocation paths?
**Check each location**:
- Is the offset correct? (Should be offset 0 from base)
- Is it written BEFORE or AFTER pushing to TLS SLL?
- Is there an atomic fence to prevent reordering?
- Is the class_idx valid?
### Step 4: Examine Pointer Conversion Logic
The key question: **Are we storing the right pointer in TLS SLL?**
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
Check the pointer conversion macros:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -A5 "ptr_user_to_base\|ptr_base_to_user\|HAK_BASE_FROM_RAW" core/hakmem_tiny_types.h
```
**Critical questions**:
1. When we free a user pointer, do we convert it to base pointer correctly?
2. When we push to TLS SLL, do we push the base pointer or user pointer?
3. When we pop from TLS SLL, do we get back the exact same base pointer?
**Expected flow**:
```
Alloc: BASE → (write header at BASE) → (convert to USER) → return USER
Free: USER → (convert to BASE) → (push BASE to TLS SLL)
Pop: (pop BASE from TLS SLL) → (read header at BASE) → validate
```
If any step uses wrong offset, corruption occurs.
### Step 5: Git Blame on Recent Changes
```bash
cd /mnt/workdisk/public_share/hakmem
git log --oneline -30
git show b5be708b6 # "Fix potential freelist corruption"
git show c91602f18 # "Fix ptr_user_to_base_blind regression"
git show f3f75ba3d # "Fix magazine spill RAW pointer"
```
**Check**: Did any of these changes affect header write logic?
Look for:
- Changes to `HAK_RET_ALLOC` macro
- Changes to pointer conversion logic
- Changes to TLS SLL push/pop
- Changes to header offset calculations
### Step 6: Review Commit History for TLS SLL
```bash
cd /mnt/workdisk/public_share/hakmem
git log --oneline --all -- core/box/tls_sll_box.h | head -20
git log -p --all -- core/box/tls_sll_box.h | head -200
```
Look for:
- When was header logic last changed?
- Were there any defensive fixes recently?
- Any atomic fence changes?
- Any offset calculation changes?
### Step 7: Check Phase 1 Configuration
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
Verify the header configuration:
```c
// Phase 1: headerless = false → headers ON
// Header should be at offset 0 of base pointer
#define TINY_HEADER_SIZE_BYTES 1
#define HEADER_MAGIC 0xa0
```
**Check**:
- Is HEADERLESS defined? (Should be undefined for Phase 1)
- Is header size correct? (Should be 1 byte)
- Are offset calculations consistent?
---
## Likely Root Causes (Narrowed)
### Root Cause A: Header Written at Wrong Offset
**Symptom**: User data appears where header should be
**Check**:
```c
// In HAK_RET_ALLOC, are we writing at the right place?
// Phase 1: header at offset 0 of base
uint8_t* hdr_ptr = (uint8_t*)base; // Should be offset 0
*hdr_ptr = magic;
// If this was changed to:
uint8_t* hdr_ptr = (uint8_t*)base + 1; // WRONG! User data location
*hdr_ptr = magic;
// Then header is written in user space, gets overwritten
```
**How to verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_RET_ALLOC" core/hakmem_tiny_config_box.inc
# Check that header write is at (uint8_t*)base, not base+offset
```
**Fix**: Ensure header write is at `(uint8_t*)base`, not base+offset.
### Root Cause B: User Pointer Pushed Instead of Base Pointer
**Symptom**: SLL contains user pointers, but pop expects base pointers
**Sequence**:
```c
// During free:
void* user_ptr = ...; // User pointer (base + 1 for Phase 1)
tls_sll_push(class_idx, user_ptr); // WRONG! Should be base pointer
// During pop:
void* popped = tls_sll_pop(class_idx); // Gets user_ptr
uint8_t header = *(uint8_t*)popped; // Reads at user_ptr, not base_ptr!
// This reads user data instead of header
```
**How to verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -A3 -B3
# Check that all pushes use base pointer, not user pointer
```
**Fix**: Convert user pointer to base pointer before pushing:
```c
hak_base_ptr_t base = ptr_user_to_base(user_ptr, class_idx);
tls_sll_push(class_idx, base, cap);
```
### Root Cause C: Atomic Fence Missing
**Symptom**: Compiler reorders header write after SLL push
**Check**:
```c
*(uint8_t*)base = header_magic; // Instruction 1
__atomic_thread_fence(__ATOMIC_RELEASE); // Fence (required!)
tls_sll_push(class_idx, base); // Instruction 2
```
If fence is missing, CPU/compiler might:
1. Schedule push before header write
2. Other thread sees unprepared node in SLL
3. Pop reads unwritten header → corruption
**How to verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -B5 "tls_sll_push" core/ --include="*.c" --include="*.inc" | grep -E "fence|barrier|atomic"
# Check that fence exists between header write and push
```
**Fix**: Add `__atomic_thread_fence(__ATOMIC_RELEASE)` after header write, before SLL push.
### Root Cause D: Magazine Spill Pointer Wrapping
**Symptom**: Magazine stores RAW pointer, SLL expects BASE pointer
**Already Fixed**: Commit f3f75ba3d added `HAK_BASE_FROM_RAW()` wrapper
**Verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_BASE_FROM_RAW\|magazine.*spill" core/hakmem_tiny_refill.inc.h
# Check line 228 or nearby has the fix
```
**Expected code**:
```c
void* p = mag->items[--mag->top].ptr;
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Must have this!
if (!tls_sll_push(class_idx, base_p, cap)) {
// ...
}
```
**Fix**: If missing, add `HAK_BASE_FROM_RAW()` wrapper around raw pointer.
### Root Cause E: Class Index Mismatch
**Symptom**: Wrong class_idx used for header magic
**Check**:
```c
int class_idx = ...; // Where does this come from?
uint8_t magic = (uint8_t)(0xa0 | (class_idx & 0x0f));
// If class_idx is wrong (e.g., -1 or 999), magic will be corrupt
```
**How to verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -rn "class_idx\|tiny_size_to_class" core/ --include="*.h" | grep -E "= -1|= 0xff"
# Look for places where class_idx might be invalid
```
**Fix**: Validate class_idx is in range [0, 7] before using:
```c
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
abort();
}
```
### Root Cause F: Offset Calculation Error
**Symptom**: Header written at base, but read at base+offset (or vice versa)
**Check**:
```c
// During alloc:
*(uint8_t*)base = magic; // Write at base+0
user = base + 1; // User at base+1 (Phase 1)
// During free/pop:
base = user - 1; // Should recover original base
uint8_t hdr = *(uint8_t*)base; // Should read at base+0
// BUT if conversion is wrong:
base = user - 0; // WRONG! Off by one
uint8_t hdr = *(uint8_t*)base; // Reads at wrong location
```
**How to verify**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -A10 "ptr_user_to_base_impl\|ptr_base_to_user_impl" core/hakmem_tiny_types.h
# Check offset calculations are consistent
```
**Fix**: Ensure offset calculations match between:
- `ptr_base_to_user` (add offset)
- `ptr_user_to_base` (subtract same offset)
---
## Proposed Fix Patterns
Based on diagnostic results, the fix will likely be one of:
### Fix Pattern 1: Restore Header Write Logic
**Problem**: Header write uses wrong offset or wrong pointer
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
```c
#define HAK_RET_ALLOC(base, cls) do { \
/* Write header FIRST at offset 0 of base */ \
*(uint8_t*)(base) = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
/* Ensure header write completes before next operation */ \
__atomic_thread_fence(__ATOMIC_RELEASE); \
/* Now convert to user pointer and return */ \
return ptr_base_to_user((base), (cls)); \
} while(0)
```
### Fix Pattern 2: Add Missing Fence
**Problem**: Compiler reorders header write after SLL push
**File**: `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast_push.c` or `core/hakmem_tiny_free.inc`
```c
// Before push to TLS SLL:
*(uint8_t*)base = header_magic;
__atomic_thread_fence(__ATOMIC_RELEASE); // ADD THIS LINE
tls_sll_push(class_idx, base, cap);
```
### Fix Pattern 3: Fix Pointer Type in Push
**Problem**: User pointer pushed instead of base pointer
**File**: Multiple locations (search for `tls_sll_push`)
```c
// In free path:
void* user_ptr = ptr; // From user
hak_base_ptr_t base_ptr = ptr_user_to_base(user_ptr, class_idx); // Convert!
if (!tls_sll_push(class_idx, base_ptr, cap)) { // Push base, not user
// ...
}
```
### Fix Pattern 4: Validate Inputs
**Problem**: Invalid class_idx or pointer values
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
```c
// At entry of tls_sll_push_impl():
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
// Validate inputs
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
return false;
}
if (!ptr || ptr == (void*)-1) {
fprintf(stderr, "[ERROR] Invalid pointer: %p\n", ptr);
return false;
}
// ... existing logic
}
```
### Fix Pattern 5: Check Magazine Spill
**Problem**: Magazine spill uses wrong pointer type
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h`
```c
// Around line 228 (magazine spill):
void* p = mag->items[--mag->top].ptr;
// MUST convert RAW to BASE before pushing:
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Essential!
if (!tls_sll_push(class_idx, base_p, cap)) {
// ... error handling
}
```
**Verify fix exists**:
```bash
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_BASE_FROM_RAW" core/hakmem_tiny_refill.inc.h
# Should see it used before tls_sll_push
```
### Fix Pattern 6: Fix Offset Calculation
**Problem**: Pointer conversion uses wrong offset
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
```c
// Verify Phase 1 offsets:
static inline hak_user_ptr_t ptr_base_to_user_impl(hak_base_ptr_t base, int cls) {
if (tiny_class_preserves_header(cls)) {
return (hak_user_ptr_t)((uint8_t*)base + TINY_HEADER_SIZE_BYTES); // +1 for Phase 1
}
return (hak_user_ptr_t)base;
}
static inline hak_base_ptr_t ptr_user_to_base_impl(hak_user_ptr_t user, int cls) {
if (tiny_class_preserves_header(cls)) {
return (hak_base_ptr_t)((uint8_t*)user - TINY_HEADER_SIZE_BYTES); // -1 for Phase 1
}
return (hak_base_ptr_t)user;
}
```
**Check**: Ensure +1 and -1 match, and TINY_HEADER_SIZE_BYTES is 1.
---
## Debug Workflow
### Quick Debug Cycle
```bash
cd /mnt/workdisk/public_share/hakmem
# 1. Make changes to source
# ... edit files ...
# 2. Rebuild
make clean && make shared -j8
# 3. Test with minimal reproducer
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | tee debug.log
# 4. Check for errors
grep "TLS_SLL_HDR_RESET\|CORRUPTION\|SIGSEGV" debug.log
# 5. Analyze log patterns
grep -E "ALLOC|PUSH|POP" debug.log | head -50
```
### Advanced Debug: GDB
```bash
cd /mnt/workdisk/public_share/hakmem
# Build with debug symbols
make clean
CFLAGS="-g -O0" make shared -j8
# Run under GDB
gdb --args ./tests/test_tls_sll_minimal
```
**GDB commands**:
```gdb
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) break tls_sll_push_impl
(gdb) break tls_sll_pop_impl
(gdb) run
(gdb) print /x *(uint8_t*)ptr # Check header byte
(gdb) print class_idx
(gdb) backtrace
(gdb) continue
```
### Memory Corruption Detection
Enable AddressSanitizer:
```bash
cd /mnt/workdisk/public_share/hakmem
make clean
CFLAGS="-fsanitize=address -g" LDFLAGS="-fsanitize=address" make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
```
ASan will catch:
- Buffer overflows
- Use-after-free
- Double-free
- Invalid pointer arithmetic
---
## After Applying Fix
### Step 1: Rebuild and Test Minimal Reproducer
```bash
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
```
**Expected**:
- All tests pass
- No `[TLS_SLL_HDR_RESET]` errors
- No SIGSEGV crashes
### Step 2: Run TC1 Baseline Test
```bash
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | tail -20
```
**Expected**:
- "Total elapsed time..." message
- No SIGSEGV
- Completion within timeout
### Step 3: Run Full Benchmark Suite
```bash
cd /mnt/workdisk/public_share/hakmem
# cfrac test
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 2>&1 | head -10
# larson test
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 2>&1 | tail -10
# sh6bench test
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh6bench 2>&1 | tail -5
```
**Expected**: All pass without crashes or corruption errors
### Step 4: Regression Check
Ensure fix doesn't break other configurations:
```bash
cd /mnt/workdisk/public_share/hakmem
# Test Phase 2 (headerless=true) - if implemented
# ... config changes ...
# make clean && make shared -j8
# LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
# Test with different workloads
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/mstress 10 2
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/rptest 10
```
### Step 5: Performance Check
Verify no performance regression:
```bash
cd /mnt/workdisk/public_share/hakmem
# Before fix (save baseline):
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
# Note: May crash, but if it runs, record time
# After fix:
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
# Compare: Should be within 5% of baseline (if baseline worked)
```
### Step 6: Remove Diagnostic Logging
After fix is confirmed, remove debug logging:
```bash
cd /mnt/workdisk/public_share/hakmem
# Remove fprintf statements added for diagnosis
# Restore original HAK_RET_ALLOC macro
# Restore original tls_sll_push/pop implementations
# Rebuild clean version
make clean
make shared -j8
# Final test
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench
```
### Step 7: Commit with Detailed Message
```bash
cd /mnt/workdisk/public_share/hakmem
git status
git add [modified files]
git commit -m "Fix TLS SLL header corruption
Problem: Header magic byte being corrupted during allocation/free path,
causing [TLS_SLL_HDR_RESET] errors and SIGSEGV crashes in baseline tests.
Symptoms:
- sh8bench crashes with SIGSEGV
- Error: [TLS_SLL_HDR_RESET] cls=1 got=0x31 expect=0xa1
- Header validation fails during tls_sll_pop()
Root cause: [DESCRIBE WHAT WAS WRONG - e.g.:]
- User pointer was being pushed to TLS SLL instead of base pointer
- Header read at wrong offset due to pointer type mismatch
- Missing atomic fence allowed reordering of header write
Solution: [DESCRIBE WHAT WAS FIXED - e.g.:]
- Convert user pointer to base pointer before tls_sll_push()
- Add atomic fence after header write, before SLL operations
- Validate pointer types at SLL entry points
Changes:
- core/hakmem_tiny_config_box.inc: Fixed HAK_RET_ALLOC header offset
- core/box/tls_sll_box.h: Added pointer validation
- core/hakmem_tiny_free.inc: Convert to base ptr before push
Validation:
- test_tls_sll_minimal passes (4/4 tests)
- sh8bench baseline completes successfully
- cfrac/larson/sh6bench pass without crashes
- No performance regression (<2% variance)
Verified: TC1 baseline stability restored, ready for Phase 1 testing"
```
---
## Expected Timeline
**Phase A: Understanding (1-2 hours)**
- Read this document
- Understand TLS SLL architecture
- Review header mechanism
- Locate relevant code sections
**Phase B: Diagnosis (2-4 hours)**
- Create minimal test case
- Add diagnostic logging
- Run tests and analyze logs
- Identify root cause
**Phase C: Fix Implementation (1-2 hours)**
- Implement surgical fix
- Remove diagnostic logging
- Clean build and test
**Phase D: Validation (1-2 hours)**
- Run full test suite
- Verify no regressions
- Performance check
- Document and commit
**Total: 5-10 hours** for complete diagnosis, fix, and validation
---
## Success Criteria
**Must Have**:
1. No `[TLS_SLL_HDR_RESET]` errors in baseline tests
2. sh8bench completes without SIGSEGV
3. Minimal test suite passes (4/4 tests)
4. Fix is surgical (minimal code changes)
5. Root cause documented clearly
**Nice to Have**:
1. Performance neutral (<5% variance)
2. Fix applies to all configurations
3. Additional validation checks added
4. Regression tests added
**Verification**:
```bash
cd /mnt/workdisk/public_share/hakmem
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep -E "Total elapsed|RESET|SIGSEGV"
# Should show "Total elapsed time" with no errors
```
---
## Common Pitfalls
### Pitfall 1: Fixing Symptoms, Not Root Cause
**Wrong approach**:
```c
// Just disable the check
if (got != expected) {
// Do nothing, ignore corruption
}
```
**Right approach**:
- Understand WHY corruption happens
- Fix the source (wrong pointer, wrong offset, etc.)
- Keep the validation check enabled
### Pitfall 2: Over-Engineering
**Wrong approach**:
- Rewrite entire TLS SLL system
- Add complex locking mechanisms
- Change fundamental architecture
**Right approach**:
- Minimal fix (usually 1-5 lines)
- Fix pointer conversion or offset
- Add fence if missing
### Pitfall 3: Ignoring Test Results
**Wrong approach**:
- Fix compiles, assume it works
- Skip minimal reproducer
- Don't verify with benchmarks
**Right approach**:
- Test with minimal case FIRST
- Verify all benchmarks pass
- Check performance impact
### Pitfall 4: Removing Too Much Logging Too Early
**Wrong approach**:
- Remove diagnostic logging immediately
- Hard to debug if issue returns
**Right approach**:
- Keep logging until fix is verified
- Remove logging in separate commit
- Document what was learned
---
## Additional Resources
### Key Files to Understand
1. `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
- TLS SLL push/pop implementation
- Header validation logic
2. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
- HAK_RET_ALLOC macro
- Header write logic
3. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
- Pointer conversion macros
- ptr_user_to_base / ptr_base_to_user
4. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h`
- Magazine spill logic
- TLS SLL interaction
### Useful Git Commands
```bash
# Find when header logic changed
git log -p --all -S "0xa0" -- core/
# Find recent changes to TLS SLL
git log --oneline -20 -- core/box/tls_sll_box.h
# Compare current vs previous version
git diff HEAD~5 core/hakmem_tiny_config_box.inc
# Find all references to a function
git grep -n "tls_sll_push" core/
```
### Debugging Commands
```bash
# Check header size configuration
grep -n "TINY_HEADER\|HEADERLESS" core/hakmem_tiny_types.h
# Find all allocation return points
grep -rn "HAK_RET_ALLOC\|return.*user" core/ --include="*.inc"
# Find all TLS SLL push calls
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -B3 -A3
# Check atomic operations
grep -rn "atomic_thread_fence\|__atomic\|memory_order" core/ --include="*.h"
```
---
## Questions to Answer During Diagnosis
1. **What is 0x31?**
- Is it always 0x31, or does it vary?
- Does it correspond to test data?
- Is it ASCII '1' character?
2. **Where is the header written?**
- In HAK_RET_ALLOC macro?
- In tls_sll_push?
- Somewhere else?
3. **Where is the header read?**
- In tls_sll_pop?
- In allocation path?
4. **Are offsets consistent?**
- Write at offset X
- Read at offset X
- Both use same base pointer?
5. **Are pointer types correct?**
- Push base or user pointer?
- Pop returns base or user pointer?
- Conversions correct?
6. **Is there a fence?**
- Between header write and SLL push?
- Between SLL pop and header read?
7. **Is class_idx valid?**
- In range [0, 7]?
- Matches actual allocation size?
8. **Has this ever worked?**
- Check git history
- Was there a recent breaking change?
---
## Document Version
- **Version**: 1.0
- **Date**: 2025-12-03
- **Author**: System diagnostic documentation
- **Target**: ChatGPT diagnostic agent
- **Estimated completion time**: 5-10 hours
---
## Final Checklist
Before considering the fix complete:
- [ ] Minimal reproducer created and passes
- [ ] Root cause identified and documented
- [ ] Fix implemented with explanation
- [ ] Diagnostic logging removed
- [ ] All baseline tests pass
- [ ] No performance regression
- [ ] Git commit with detailed message
- [ ] This document updated with findings
**Good luck with the diagnosis!**