Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through systematic diagnosis and fix of TLS SLL header corruption issue. Documents Added: - README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system - CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read) - CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline) - GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review - STATUS_2025_12_03_CURRENT.md: Complete project status snapshot - TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines) - 6 root cause patterns with code examples - Diagnostic logging instrumentation - Fix templates and validation procedures - TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines) - HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup - SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes Problem Context: - Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET] - Error: cls=1 base=0x... got=0x31 expect=0xa1 - Blocks Phase 1 validation and Phase 2 progression Expected Outcome: - ChatGPT follows 7-step diagnostic process - Root cause identified (one of 6 patterns) - Surgical fix (1-5 lines) - TC1 baseline completes without crashes 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
29 KiB
TLS SLL Header Corruption Diagnosis & Fix Instructions for ChatGPT
Problem Statement
Symptom:
- Baseline (Headerless OFF) crashes with SIGSEGV
- Error log:
[TLS_SLL_HDR_RESET] cls=1 base=0x7ef296abf8c8 got=0x31 expect=0xa1 count=0 - Location:
core/box/tls_sll_box.hheader integrity check during pop operation
Root Cause: Header byte at offset 0 from base pointer contains user data (0x31) instead of header magic (0xa1). This indicates one of:
- Wrong pointer is being stored in TLS SLL
- Header is not being written correctly before push
- Adjacent block corruption overwrites header
- Header write/read offset mismatch
Impact:
- TLS SLL header reset occurs (entire freelist for class 1 dropped)
- Subsequent allocations may fail or use wrong metadata
- Benchmark crashes with SIGSEGV
- Memory corruption potential
Timeline:
- Discovered during Phase 1 TLS Hint Box benchmarking
- Affects baseline configuration (no hints involved)
- Suggests pre-existing issue in shared TLS SLL code
Investigation Strategy
Phase A: Understand the Error
- Where is header validation happening?
- What does 0x31 represent? (Is it deterministic or random data?)
- Can we reproduce with minimal allocations?
Phase B: Locate Corruption Source
- Where is header supposed to be written?
- Is header being written BEFORE push or after?
- Are there any recent changes to header write logic?
Phase C: Implement Fix
- Add instrumentation to catch corruption early
- Identify exact allocation/free cycle causing problem
- Fix root cause (not just symptom)
Phase D: Validate
- TC1 baseline should complete without crashes
- TC2/TC3 can then be evaluated
- No performance regression
Deep Dive: TLS SLL Header Corruption
What is 0x31?
The error reports got=0x31. Let's understand what this means:
// Expected (header magic for class 1):
0xa1 = 0xa0 (HEADER_MAGIC) | 0x01 (class_idx)
// Got:
0x31 = 0b00110001
= ASCII '1' character
= Some piece of user data or metadata
Questions to answer:
- Is 0x31 always the same, or does it vary? (Deterministic vs random corruption)
- Does 0x31 correspond to any known data pattern in hakmem?
- Does the corruption happen during alloc or free?
- Is 0x31 part of the test program's data?
TLS SLL Header Check Logic
Location: /mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h (around lines 280-320)
// In tls_sll_pop_impl():
if (tiny_class_preserves_header(class_idx)) {
uint8_t* b = (uint8_t*)raw_base;
uint8_t got = *b; // Read byte at offset 0 of base pointer
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
if (got != expected) {
// CORRUPTION DETECTED!
fprintf(stderr, "[TLS_SLL_HDR_RESET] cls=%d base=%p got=0x%02x expect=0x%02x ...\n",
class_idx, raw_base, got, expected);
// ... reset logic follows
}
}
Key Points:
- Header is read at
(uint8_t*)raw_base(offset 0) - Expected value is
0xa0 | class_idx - For class 1: expect
0xa1 - Got
0x31instead (user data)
When Does This Happen?
The error occurs during tls_sll_pop(), which is called when:
- Freelist refill: Taking blocks from TLS SLL back to unified cache
- Magazine spill: Freelist → TLS SLL transition for overflow
- Allocation path: Pulling blocks from TLS SLL to satisfy malloc
The header corruption must have happened BEFORE push, but is detected AFTER pop.
This suggests:
- Either the pointer stored in TLS SLL is wrong (points to wrong location)
- Or the header was never written correctly
- Or adjacent block corruption overwrote the header
- Or there's an offset calculation error between push and pop
Diagnostic Procedure
Step 1: Reproduce with Minimal Test
Create the smallest possible test case:
File: /mnt/workdisk/public_share/hakmem/tests/test_tls_sll_minimal.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main() {
printf("Test 1: Simple alloc/free cycle\n");
for (int i = 0; i < 10; i++) {
void* p = malloc(16); // Class 1
if (p) {
memset(p, 0x31, 16); // Write user data (includes 0x31!)
free(p);
}
}
printf("✓ Test 1 passed\n");
printf("Test 2: Rapid alloc/free (trigger refill)\n");
for (int i = 0; i < 1000; i++) {
void* p = malloc(16);
if (p) {
memset(p, 0x31, 16);
free(p);
}
}
printf("✓ Test 2 passed\n");
printf("Test 3: Multiple sizes\n");
for (int size = 8; size <= 512; size *= 2) {
for (int j = 0; j < 100; j++) {
void* p = malloc(size);
if (p) {
memset(p, 0x31, size);
free(p);
}
}
}
printf("✓ Test 3 passed\n");
printf("Test 4: Heavy churn (trigger SLL push/pop)\n");
void* ptrs[100];
for (int round = 0; round < 10; round++) {
for (int i = 0; i < 100; i++) {
ptrs[i] = malloc(16);
if (ptrs[i]) memset(ptrs[i], 0x31, 16);
}
for (int i = 0; i < 100; i++) {
free(ptrs[i]);
}
}
printf("✓ Test 4 passed\n");
return 0;
}
Build and test:
cd /mnt/workdisk/public_share/hakmem
mkdir -p tests
gcc -o tests/test_tls_sll_minimal tests/test_tls_sll_minimal.c
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
Goal: Find the minimal reproduction:
- If test 1 fails: Early corruption (basic alloc/free)
- If test 2 fails: Refill-related corruption
- If test 3 fails: Class-specific issue
- If test 4 fails: SLL push/pop cycling issue
Step 2: Add Diagnostic Logging
Instrument the header write/read paths:
Instrument Header Write
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc
Find the HAK_RET_ALLOC macro and add logging:
// Add diagnostic logging
#define HAK_RET_ALLOC(base, cls) do { \
fprintf(stderr, "[ALLOC_HEADER_WRITE] base=%p cls=%d\n", base, cls); \
uint8_t* hdr = (uint8_t*)(base); \
uint8_t magic = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
*hdr = magic; \
fprintf(stderr, "[ALLOC_HEADER_WROTE] base=%p magic=0x%02x (at %p)\n", base, *hdr, hdr); \
__atomic_thread_fence(__ATOMIC_RELEASE); \
hak_user_ptr_t user = ptr_base_to_user(base, cls); \
fprintf(stderr, "[ALLOC_RETURN] user=%p (base=%p + %ld)\n", user, base, (char*)user - (char*)base); \
return user; \
} while(0)
Instrument Header Read
File: /mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h
Modify the header read/check in tls_sll_pop_impl():
// In tls_sll_pop_impl(), before the check:
if (tiny_class_preserves_header(class_idx)) {
uint8_t* b = (uint8_t*)raw_base;
uint8_t got = *b;
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
// NEW DIAGNOSTIC LOGGING:
fprintf(stderr, "[TLS_SLL_POP_CHECK] class=%d raw_base=%p checking at %p\n",
class_idx, raw_base, b);
fprintf(stderr, "[TLS_SLL_POP_READ] got=0x%02x expected=0x%02x\n", got, expected);
if (got != expected) {
fprintf(stderr, "[CORRUPTION_DETECTED] Mismatch! Dumping context...\n");
fprintf(stderr, "[CORRUPTION_CONTEXT] raw_base=%p, offset=%ld\n", raw_base, (char*)b - (char*)raw_base);
// Dump surrounding bytes
fprintf(stderr, "[CORRUPTION_DUMP] Bytes around base: ");
for (int i = -8; i < 16; i++) {
fprintf(stderr, "%02x ", ((uint8_t*)raw_base)[i]);
}
fprintf(stderr, "\n");
// ... existing reset logic
}
}
Instrument SLL Push
File: /mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h
Find tls_sll_push_impl() and add logging:
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
fprintf(stderr, "[TLS_SLL_PUSH] class=%d ptr=%p\n", class_idx, ptr);
// Check header BEFORE push
if (tiny_class_preserves_header(class_idx)) {
uint8_t hdr = *(uint8_t*)ptr;
fprintf(stderr, "[TLS_SLL_PUSH_HDR_CHECK] ptr=%p header=0x%02x\n", ptr, hdr);
}
// ... existing push logic
}
Build and run:
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | grep -E "ALLOC|POP|PUSH|CORRUPTION" | head -100
What to look for:
- Do ALLOC_HEADER_WRITE and TLS_SLL_PUSH_HDR_CHECK match?
- Does TLS_SLL_POP_READ show corruption?
- What is the sequence: WRITE → PUSH → POP?
- Are pointers consistent across operations?
Step 3: Examine Header Write Locations
Search for all places headers are written:
cd /mnt/workdisk/public_share/hakmem
grep -rn "= 0xa\|= HEADER_MAGIC\|= TINY_HEADER\|0xa0 |" core/ --include="*.h" --include="*.c" --include="*.inc"
Expected locations:
core/hakmem_tiny_config_box.inc- HAK_RET_ALLOC macrocore/box/tls_sll_box.h- Optional header write on SLL push (if needed)core/tiny_alloc_fast_push.c- Fast path allocations- Other allocation paths?
Check each location:
- Is the offset correct? (Should be offset 0 from base)
- Is it written BEFORE or AFTER pushing to TLS SLL?
- Is there an atomic fence to prevent reordering?
- Is the class_idx valid?
Step 4: Examine Pointer Conversion Logic
The key question: Are we storing the right pointer in TLS SLL?
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h
Check the pointer conversion macros:
cd /mnt/workdisk/public_share/hakmem
grep -A5 "ptr_user_to_base\|ptr_base_to_user\|HAK_BASE_FROM_RAW" core/hakmem_tiny_types.h
Critical questions:
- When we free a user pointer, do we convert it to base pointer correctly?
- When we push to TLS SLL, do we push the base pointer or user pointer?
- When we pop from TLS SLL, do we get back the exact same base pointer?
Expected flow:
Alloc: BASE → (write header at BASE) → (convert to USER) → return USER
Free: USER → (convert to BASE) → (push BASE to TLS SLL)
Pop: (pop BASE from TLS SLL) → (read header at BASE) → validate
If any step uses wrong offset, corruption occurs.
Step 5: Git Blame on Recent Changes
cd /mnt/workdisk/public_share/hakmem
git log --oneline -30
git show b5be708b6 # "Fix potential freelist corruption"
git show c91602f18 # "Fix ptr_user_to_base_blind regression"
git show f3f75ba3d # "Fix magazine spill RAW pointer"
Check: Did any of these changes affect header write logic?
Look for:
- Changes to
HAK_RET_ALLOCmacro - Changes to pointer conversion logic
- Changes to TLS SLL push/pop
- Changes to header offset calculations
Step 6: Review Commit History for TLS SLL
cd /mnt/workdisk/public_share/hakmem
git log --oneline --all -- core/box/tls_sll_box.h | head -20
git log -p --all -- core/box/tls_sll_box.h | head -200
Look for:
- When was header logic last changed?
- Were there any defensive fixes recently?
- Any atomic fence changes?
- Any offset calculation changes?
Step 7: Check Phase 1 Configuration
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h
Verify the header configuration:
// Phase 1: headerless = false → headers ON
// Header should be at offset 0 of base pointer
#define TINY_HEADER_SIZE_BYTES 1
#define HEADER_MAGIC 0xa0
Check:
- Is HEADERLESS defined? (Should be undefined for Phase 1)
- Is header size correct? (Should be 1 byte)
- Are offset calculations consistent?
Likely Root Causes (Narrowed)
Root Cause A: Header Written at Wrong Offset
Symptom: User data appears where header should be
Check:
// In HAK_RET_ALLOC, are we writing at the right place?
// Phase 1: header at offset 0 of base
uint8_t* hdr_ptr = (uint8_t*)base; // Should be offset 0
*hdr_ptr = magic;
// If this was changed to:
uint8_t* hdr_ptr = (uint8_t*)base + 1; // WRONG! User data location
*hdr_ptr = magic;
// Then header is written in user space, gets overwritten
How to verify:
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_RET_ALLOC" core/hakmem_tiny_config_box.inc
# Check that header write is at (uint8_t*)base, not base+offset
Fix: Ensure header write is at (uint8_t*)base, not base+offset.
Root Cause B: User Pointer Pushed Instead of Base Pointer
Symptom: SLL contains user pointers, but pop expects base pointers
Sequence:
// During free:
void* user_ptr = ...; // User pointer (base + 1 for Phase 1)
tls_sll_push(class_idx, user_ptr); // WRONG! Should be base pointer
// During pop:
void* popped = tls_sll_pop(class_idx); // Gets user_ptr
uint8_t header = *(uint8_t*)popped; // Reads at user_ptr, not base_ptr!
// This reads user data instead of header
How to verify:
cd /mnt/workdisk/public_share/hakmem
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -A3 -B3
# Check that all pushes use base pointer, not user pointer
Fix: Convert user pointer to base pointer before pushing:
hak_base_ptr_t base = ptr_user_to_base(user_ptr, class_idx);
tls_sll_push(class_idx, base, cap);
Root Cause C: Atomic Fence Missing
Symptom: Compiler reorders header write after SLL push
Check:
*(uint8_t*)base = header_magic; // Instruction 1
__atomic_thread_fence(__ATOMIC_RELEASE); // Fence (required!)
tls_sll_push(class_idx, base); // Instruction 2
If fence is missing, CPU/compiler might:
- Schedule push before header write
- Other thread sees unprepared node in SLL
- Pop reads unwritten header → corruption
How to verify:
cd /mnt/workdisk/public_share/hakmem
grep -B5 "tls_sll_push" core/ --include="*.c" --include="*.inc" | grep -E "fence|barrier|atomic"
# Check that fence exists between header write and push
Fix: Add __atomic_thread_fence(__ATOMIC_RELEASE) after header write, before SLL push.
Root Cause D: Magazine Spill Pointer Wrapping
Symptom: Magazine stores RAW pointer, SLL expects BASE pointer
Already Fixed: Commit f3f75ba3d added HAK_BASE_FROM_RAW() wrapper
Verify:
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_BASE_FROM_RAW\|magazine.*spill" core/hakmem_tiny_refill.inc.h
# Check line 228 or nearby has the fix
Expected code:
void* p = mag->items[--mag->top].ptr;
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Must have this!
if (!tls_sll_push(class_idx, base_p, cap)) {
// ...
}
Fix: If missing, add HAK_BASE_FROM_RAW() wrapper around raw pointer.
Root Cause E: Class Index Mismatch
Symptom: Wrong class_idx used for header magic
Check:
int class_idx = ...; // Where does this come from?
uint8_t magic = (uint8_t)(0xa0 | (class_idx & 0x0f));
// If class_idx is wrong (e.g., -1 or 999), magic will be corrupt
How to verify:
cd /mnt/workdisk/public_share/hakmem
grep -rn "class_idx\|tiny_size_to_class" core/ --include="*.h" | grep -E "= -1|= 0xff"
# Look for places where class_idx might be invalid
Fix: Validate class_idx is in range [0, 7] before using:
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
abort();
}
Root Cause F: Offset Calculation Error
Symptom: Header written at base, but read at base+offset (or vice versa)
Check:
// During alloc:
*(uint8_t*)base = magic; // Write at base+0
user = base + 1; // User at base+1 (Phase 1)
// During free/pop:
base = user - 1; // Should recover original base
uint8_t hdr = *(uint8_t*)base; // Should read at base+0
// BUT if conversion is wrong:
base = user - 0; // WRONG! Off by one
uint8_t hdr = *(uint8_t*)base; // Reads at wrong location
How to verify:
cd /mnt/workdisk/public_share/hakmem
grep -A10 "ptr_user_to_base_impl\|ptr_base_to_user_impl" core/hakmem_tiny_types.h
# Check offset calculations are consistent
Fix: Ensure offset calculations match between:
ptr_base_to_user(add offset)ptr_user_to_base(subtract same offset)
Proposed Fix Patterns
Based on diagnostic results, the fix will likely be one of:
Fix Pattern 1: Restore Header Write Logic
Problem: Header write uses wrong offset or wrong pointer
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc
#define HAK_RET_ALLOC(base, cls) do { \
/* Write header FIRST at offset 0 of base */ \
*(uint8_t*)(base) = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
/* Ensure header write completes before next operation */ \
__atomic_thread_fence(__ATOMIC_RELEASE); \
/* Now convert to user pointer and return */ \
return ptr_base_to_user((base), (cls)); \
} while(0)
Fix Pattern 2: Add Missing Fence
Problem: Compiler reorders header write after SLL push
File: /mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast_push.c or core/hakmem_tiny_free.inc
// Before push to TLS SLL:
*(uint8_t*)base = header_magic;
__atomic_thread_fence(__ATOMIC_RELEASE); // ADD THIS LINE
tls_sll_push(class_idx, base, cap);
Fix Pattern 3: Fix Pointer Type in Push
Problem: User pointer pushed instead of base pointer
File: Multiple locations (search for tls_sll_push)
// In free path:
void* user_ptr = ptr; // From user
hak_base_ptr_t base_ptr = ptr_user_to_base(user_ptr, class_idx); // Convert!
if (!tls_sll_push(class_idx, base_ptr, cap)) { // Push base, not user
// ...
}
Fix Pattern 4: Validate Inputs
Problem: Invalid class_idx or pointer values
File: /mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h
// At entry of tls_sll_push_impl():
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
// Validate inputs
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
return false;
}
if (!ptr || ptr == (void*)-1) {
fprintf(stderr, "[ERROR] Invalid pointer: %p\n", ptr);
return false;
}
// ... existing logic
}
Fix Pattern 5: Check Magazine Spill
Problem: Magazine spill uses wrong pointer type
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h
// Around line 228 (magazine spill):
void* p = mag->items[--mag->top].ptr;
// MUST convert RAW to BASE before pushing:
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Essential!
if (!tls_sll_push(class_idx, base_p, cap)) {
// ... error handling
}
Verify fix exists:
cd /mnt/workdisk/public_share/hakmem
grep -n "HAK_BASE_FROM_RAW" core/hakmem_tiny_refill.inc.h
# Should see it used before tls_sll_push
Fix Pattern 6: Fix Offset Calculation
Problem: Pointer conversion uses wrong offset
File: /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h
// Verify Phase 1 offsets:
static inline hak_user_ptr_t ptr_base_to_user_impl(hak_base_ptr_t base, int cls) {
if (tiny_class_preserves_header(cls)) {
return (hak_user_ptr_t)((uint8_t*)base + TINY_HEADER_SIZE_BYTES); // +1 for Phase 1
}
return (hak_user_ptr_t)base;
}
static inline hak_base_ptr_t ptr_user_to_base_impl(hak_user_ptr_t user, int cls) {
if (tiny_class_preserves_header(cls)) {
return (hak_base_ptr_t)((uint8_t*)user - TINY_HEADER_SIZE_BYTES); // -1 for Phase 1
}
return (hak_base_ptr_t)user;
}
Check: Ensure +1 and -1 match, and TINY_HEADER_SIZE_BYTES is 1.
Debug Workflow
Quick Debug Cycle
cd /mnt/workdisk/public_share/hakmem
# 1. Make changes to source
# ... edit files ...
# 2. Rebuild
make clean && make shared -j8
# 3. Test with minimal reproducer
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | tee debug.log
# 4. Check for errors
grep "TLS_SLL_HDR_RESET\|CORRUPTION\|SIGSEGV" debug.log
# 5. Analyze log patterns
grep -E "ALLOC|PUSH|POP" debug.log | head -50
Advanced Debug: GDB
cd /mnt/workdisk/public_share/hakmem
# Build with debug symbols
make clean
CFLAGS="-g -O0" make shared -j8
# Run under GDB
gdb --args ./tests/test_tls_sll_minimal
GDB commands:
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) break tls_sll_push_impl
(gdb) break tls_sll_pop_impl
(gdb) run
(gdb) print /x *(uint8_t*)ptr # Check header byte
(gdb) print class_idx
(gdb) backtrace
(gdb) continue
Memory Corruption Detection
Enable AddressSanitizer:
cd /mnt/workdisk/public_share/hakmem
make clean
CFLAGS="-fsanitize=address -g" LDFLAGS="-fsanitize=address" make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
ASan will catch:
- Buffer overflows
- Use-after-free
- Double-free
- Invalid pointer arithmetic
After Applying Fix
Step 1: Rebuild and Test Minimal Reproducer
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
Expected:
- All tests pass
- No
[TLS_SLL_HDR_RESET]errors - No SIGSEGV crashes
Step 2: Run TC1 Baseline Test
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | tail -20
Expected:
- "Total elapsed time..." message
- No SIGSEGV
- Completion within timeout
Step 3: Run Full Benchmark Suite
cd /mnt/workdisk/public_share/hakmem
# cfrac test
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 2>&1 | head -10
# larson test
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 2>&1 | tail -10
# sh6bench test
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh6bench 2>&1 | tail -5
Expected: All pass without crashes or corruption errors
Step 4: Regression Check
Ensure fix doesn't break other configurations:
cd /mnt/workdisk/public_share/hakmem
# Test Phase 2 (headerless=true) - if implemented
# ... config changes ...
# make clean && make shared -j8
# LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
# Test with different workloads
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/mstress 10 2
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/rptest 10
Step 5: Performance Check
Verify no performance regression:
cd /mnt/workdisk/public_share/hakmem
# Before fix (save baseline):
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
# Note: May crash, but if it runs, record time
# After fix:
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
# Compare: Should be within 5% of baseline (if baseline worked)
Step 6: Remove Diagnostic Logging
After fix is confirmed, remove debug logging:
cd /mnt/workdisk/public_share/hakmem
# Remove fprintf statements added for diagnosis
# Restore original HAK_RET_ALLOC macro
# Restore original tls_sll_push/pop implementations
# Rebuild clean version
make clean
make shared -j8
# Final test
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench
Step 7: Commit with Detailed Message
cd /mnt/workdisk/public_share/hakmem
git status
git add [modified files]
git commit -m "Fix TLS SLL header corruption
Problem: Header magic byte being corrupted during allocation/free path,
causing [TLS_SLL_HDR_RESET] errors and SIGSEGV crashes in baseline tests.
Symptoms:
- sh8bench crashes with SIGSEGV
- Error: [TLS_SLL_HDR_RESET] cls=1 got=0x31 expect=0xa1
- Header validation fails during tls_sll_pop()
Root cause: [DESCRIBE WHAT WAS WRONG - e.g.:]
- User pointer was being pushed to TLS SLL instead of base pointer
- Header read at wrong offset due to pointer type mismatch
- Missing atomic fence allowed reordering of header write
Solution: [DESCRIBE WHAT WAS FIXED - e.g.:]
- Convert user pointer to base pointer before tls_sll_push()
- Add atomic fence after header write, before SLL operations
- Validate pointer types at SLL entry points
Changes:
- core/hakmem_tiny_config_box.inc: Fixed HAK_RET_ALLOC header offset
- core/box/tls_sll_box.h: Added pointer validation
- core/hakmem_tiny_free.inc: Convert to base ptr before push
Validation:
- test_tls_sll_minimal passes (4/4 tests)
- sh8bench baseline completes successfully
- cfrac/larson/sh6bench pass without crashes
- No performance regression (<2% variance)
Verified: TC1 baseline stability restored, ready for Phase 1 testing"
Expected Timeline
Phase A: Understanding (1-2 hours)
- Read this document
- Understand TLS SLL architecture
- Review header mechanism
- Locate relevant code sections
Phase B: Diagnosis (2-4 hours)
- Create minimal test case
- Add diagnostic logging
- Run tests and analyze logs
- Identify root cause
Phase C: Fix Implementation (1-2 hours)
- Implement surgical fix
- Remove diagnostic logging
- Clean build and test
Phase D: Validation (1-2 hours)
- Run full test suite
- Verify no regressions
- Performance check
- Document and commit
Total: 5-10 hours for complete diagnosis, fix, and validation
Success Criteria
Must Have:
- No
[TLS_SLL_HDR_RESET]errors in baseline tests - sh8bench completes without SIGSEGV
- Minimal test suite passes (4/4 tests)
- Fix is surgical (minimal code changes)
- Root cause documented clearly
Nice to Have:
- Performance neutral (<5% variance)
- Fix applies to all configurations
- Additional validation checks added
- Regression tests added
Verification:
cd /mnt/workdisk/public_share/hakmem
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep -E "Total elapsed|RESET|SIGSEGV"
# Should show "Total elapsed time" with no errors
Common Pitfalls
Pitfall 1: Fixing Symptoms, Not Root Cause
Wrong approach:
// Just disable the check
if (got != expected) {
// Do nothing, ignore corruption
}
Right approach:
- Understand WHY corruption happens
- Fix the source (wrong pointer, wrong offset, etc.)
- Keep the validation check enabled
Pitfall 2: Over-Engineering
Wrong approach:
- Rewrite entire TLS SLL system
- Add complex locking mechanisms
- Change fundamental architecture
Right approach:
- Minimal fix (usually 1-5 lines)
- Fix pointer conversion or offset
- Add fence if missing
Pitfall 3: Ignoring Test Results
Wrong approach:
- Fix compiles, assume it works
- Skip minimal reproducer
- Don't verify with benchmarks
Right approach:
- Test with minimal case FIRST
- Verify all benchmarks pass
- Check performance impact
Pitfall 4: Removing Too Much Logging Too Early
Wrong approach:
- Remove diagnostic logging immediately
- Hard to debug if issue returns
Right approach:
- Keep logging until fix is verified
- Remove logging in separate commit
- Document what was learned
Additional Resources
Key Files to Understand
-
/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h- TLS SLL push/pop implementation
- Header validation logic
-
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc- HAK_RET_ALLOC macro
- Header write logic
-
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h- Pointer conversion macros
- ptr_user_to_base / ptr_base_to_user
-
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h- Magazine spill logic
- TLS SLL interaction
Useful Git Commands
# Find when header logic changed
git log -p --all -S "0xa0" -- core/
# Find recent changes to TLS SLL
git log --oneline -20 -- core/box/tls_sll_box.h
# Compare current vs previous version
git diff HEAD~5 core/hakmem_tiny_config_box.inc
# Find all references to a function
git grep -n "tls_sll_push" core/
Debugging Commands
# Check header size configuration
grep -n "TINY_HEADER\|HEADERLESS" core/hakmem_tiny_types.h
# Find all allocation return points
grep -rn "HAK_RET_ALLOC\|return.*user" core/ --include="*.inc"
# Find all TLS SLL push calls
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -B3 -A3
# Check atomic operations
grep -rn "atomic_thread_fence\|__atomic\|memory_order" core/ --include="*.h"
Questions to Answer During Diagnosis
-
What is 0x31?
- Is it always 0x31, or does it vary?
- Does it correspond to test data?
- Is it ASCII '1' character?
-
Where is the header written?
- In HAK_RET_ALLOC macro?
- In tls_sll_push?
- Somewhere else?
-
Where is the header read?
- In tls_sll_pop?
- In allocation path?
-
Are offsets consistent?
- Write at offset X
- Read at offset X
- Both use same base pointer?
-
Are pointer types correct?
- Push base or user pointer?
- Pop returns base or user pointer?
- Conversions correct?
-
Is there a fence?
- Between header write and SLL push?
- Between SLL pop and header read?
-
Is class_idx valid?
- In range [0, 7]?
- Matches actual allocation size?
-
Has this ever worked?
- Check git history
- Was there a recent breaking change?
Document Version
- Version: 1.0
- Date: 2025-12-03
- Author: System diagnostic documentation
- Target: ChatGPT diagnostic agent
- Estimated completion time: 5-10 hours
Final Checklist
Before considering the fix complete:
- Minimal reproducer created and passes
- Root cause identified and documented
- Fix implemented with explanation
- Diagnostic logging removed
- All baseline tests pass
- No performance regression
- Git commit with detailed message
- This document updated with findings
Good luck with the diagnosis!