Files
hakmem/docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md

302 lines
8.6 KiB
Markdown
Raw Normal View History

# ChatGPT Task: TLS SLL Header Corruption Diagnosis & Fix
**Status**: BLOCKING - System instability detected in baseline configuration
**Priority**: CRITICAL
**Assigned to**: Claude (ChatGPT model)
**Expected Duration**: 4-8 hours
---
## Executive Summary
The hakmem memory allocator baseline configuration crashes with a critical header corruption error:
```
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
```
This occurs in **shared code paths** (not Phase 1 specific), blocking all further development and validation.
**Your Task**: Diagnose and fix this issue using the comprehensive diagnostic guide.
---
## What You Need to Know
### Context
- **Project**: hakmem - custom memory allocator with "Box Theory" architecture
- **Language**: C
- **Current Phase**: Phase 1 implementation + Phase 2 (Headerless) planning
- **Problem**: Baseline test crashes before completing benchmarks
- **Error Location**: `core/box/tls_sll_box.h` - header validation during TLS SLL pop
### The Error
When a block is popped from the TLS SLL (Thread-Local Single-Linked List), the header validation checks:
```c
uint8_t got = *b; // Read byte at offset 0 of base pointer
uint8_t expected = 0xa0 | class_idx; // For class 1: 0xa1
if (got != expected) {
// ERROR DETECTED - got 0x31 instead of 0xa1
}
```
The header byte contains user data (0x31 = '1' character) instead of the expected magic value (0xa1).
**This means**: Either:
1. Wrong pointer was stored in TLS SLL
2. Header was not written before pushing to TLS SLL
3. Header was overwritten after pushing
4. Offset calculation is wrong
---
## Your Step-by-Step Task
### Step 1: Read the Comprehensive Diagnostic Document
**File**: `/mnt/workdisk/public_share/hakmem/docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md`
This 1,150+ line document contains:
- 6 detailed root cause patterns with code examples
- Minimal test case template (test_tls_sll_minimal.c)
- Diagnostic logging instrumentation points
- Fix patterns with code snippets
- 7-step validation procedure
**Action**: Read the entire document and understand the investigation methodology.
---
### Step 2: Reproduce the Error with Minimal Test Case
Create `/mnt/workdisk/public_share/hakmem/tests/test_tls_sll_minimal.c` based on template in the diagnostic document.
```bash
cd /mnt/workdisk/public_share/hakmem
# Build minimal test
gcc -g -O1 -I./core -I./core/box \
tests/test_tls_sll_minimal.c \
-L. -lhakmem -lpthread -o test_minimal
# Run (should crash with TLS_SLL_HDR_RESET error)
./test_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|Segmentation"
```
**Expected Output**: Should reproduce the header corruption within first 100-1000 allocations.
---
### Step 3: Add Diagnostic Logging
Instrument the following locations to capture when header corruption occurs:
**Location A**: `core/hakmem_tiny_free.inc` - Header write before TLS SLL push
```c
// Around line 550: Before tls_sll_push()
// ADD LOGGING:
fprintf(stderr, "[HEADER_WRITE] base=%p, offset=%zu, writing 0x%02x\n",
base, offset, (HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK)));
```
**Location B**: `core/box/tls_sll_box.h` - Header read during pop
```c
// Around line 282-303: In tls_sll_pop_impl()
// ADD LOGGING:
fprintf(stderr, "[HEADER_READ] base=%p, got=0x%02x, expected=0x%02x\n",
raw_base, got, expected);
```
**Location C**: `core/hakmem_tiny_refill.inc.h` - Magazine spill
```c
// Around line 228: Before/after tls_sll_push()
// ADD LOGGING:
fprintf(stderr, "[SPILL] class=%d, ptr=%p (wrapping to base)\n", class_idx, p);
```
**Action**: Add detailed logging to identify which allocation/free cycle causes corruption.
---
### Step 4: Run Diagnostic Test with Logging
```bash
# Rebuild with logging enabled
make clean
make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_DEBUG_LOGGING=1"
# Run minimal test and capture log
./test_minimal 2>&1 | tee diagnostic_output.txt
# Analyze log to find last successful write before corruption
grep HEADER_WRITE diagnostic_output.txt | tail -10
grep HEADER_READ diagnostic_output.txt | grep -A1 -B1 "0x31"
```
**Expected Result**: Log will show exact allocation/free sequence leading to corruption.
---
### Step 5: Identify Root Cause (One of Six Patterns)
Based on diagnostic logs, match against these patterns from the diagnostic document:
1. **RAW Pointer vs BASE Pointer**: Wrong pointer type passed to tls_sll_push()
2. **Header Offset Mismatch**: Writing at offset 1, reading at offset 0
3. **Atomic Fence Missing**: Compiler reordering causing write-after-push
4. **Adjacent Block Overflow**: User data from preceding block overwrites header
5. **Class Index Mismatch**: Push with class_idx A, pop as class_idx B
6. **Headerless Mode Interference**: Mixed header/headerless logic
**Action**: Determine which pattern applies to your findings.
---
### Step 6: Implement Surgical Fix
Once root cause is identified, apply a minimal fix (typically 1-5 lines):
**Example fixes** (from diagnostic document):
```c
// Pattern 1 - RAW vs BASE pointer:
// WRONG:
tls_sll_push(class_idx, p, size); // p is RAW pointer
// FIXED:
hak_base_ptr_t base = HAK_BASE_FROM_RAW(p);
tls_sll_push(class_idx, base, size);
// Pattern 2 - Offset mismatch:
// WRONG:
*(uint8_t*)((char*)base + 1) = header; // Writing at offset 1
// In pop: uint8_t h = *((uint8_t*)base); // Reading at offset 0
// FIXED:
*(uint8_t*)base = header; // Consistent offset
// Pattern 3 - Atomic fence missing:
// WRONG:
*hdr = magic;
tls_sll_push(...);
// FIXED:
*hdr = magic;
atomic_thread_fence(memory_order_release); // Prevent reordering
tls_sll_push(...);
```
**Action**: Apply fix to source code and rebuild.
---
### Step 7: Validate Fix
```bash
# Step 7a: Run minimal test
./test_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|passed|failed"
# Step 7b: Run baseline benchmark
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | \
grep -E "TLS_SLL_HDR_RESET|Total|PASSED|FAILED"
# Step 7c: Run cfrac (memory intensive)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 2>&1 | \
grep -E "error|TLS_SLL_HDR_RESET|Total"
# Step 7d: Check for regressions
make test -j8 FILTER="tls_sll"
```
**Success Criteria**:
- ✅ Minimal test completes without TLS_SLL_HDR_RESET
- ✅ sh8bench runs to completion (several minutes)
- ✅ cfrac completes without errors
- ✅ All unit tests pass
- ✅ No performance regression (< 5%)
---
## Commit & Documentation
Once validated, commit with detailed message:
```bash
git add -A
git commit -m "Fix TLS SLL header corruption in [Component]
Root Cause:
[Brief 1-2 sentence explanation of what was wrong]
Pattern Affected:
[Which of the 6 patterns this was]
Fix Applied:
[Minimal description of the fix]
Validation:
- [Test case] passed
- [Benchmark] completed without TLS_SLL_HDR_RESET
- No performance regression
Related Issues:
- TLS SLL baseline instability
- Required for Phase 1/2 validation"
```
---
## Reference Files
| File | Purpose |
|------|---------|
| `docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` | **Complete diagnostic guide** - READ FIRST |
| `core/box/tls_sll_box.h` | TLS SLL implementation (header validation at lines 282-303) |
| `core/hakmem_tiny_free.inc` | Free path (header write before push, lines ~550) |
| `core/hakmem_tiny_refill.inc.h` | Magazine spill (lines ~228) |
| `docs/HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md` | Test environment setup |
| `debug_artifacts/headerless/` | Benchmark results showing error |
---
## Communication Plan
**Status Updates**: After each step, provide brief status:
- Step 2: "Reproducer created - X allocations before crash"
- Step 3: "Logging added to [X locations]"
- Step 4: "Log analysis complete - [pattern identified]"
- Step 5: "Root cause identified: Pattern #[N]"
- Step 6: "Fix applied - [brief description]"
- Step 7: "Validation complete - [test results]"
---
## Post-Fix: Unblocking Next Phases
Once this issue is fixed, the following can proceed:
1. **Phase 1 Completion**: TLS Hint Box performance optimization (currently showing 2.3% improvement vs target 15-20%)
2. **Phase 2 Validation**: Test Headerless mode (ON/OFF configurations)
3. **Performance Benchmarking**: Full multi-test suite (TC1, TC2, TC3)
4. **Future Phases**: Phase 102 (MemApi bridge), production optimization
---
## Success Metric
**GOAL**: TC1 baseline test completes successfully with zero TLS_SLL_HDR_RESET errors.
Current Status: ❌ FAILING (crashes at ~22 seconds)
Target Status: ✅ PASSING (completion in 4-6 minutes)
---
**Questions?** Refer to the diagnostic document for detailed explanations of each pattern and debugging technique.
**Ready to start?** Begin with Step 1: Read the full diagnostic guide.
🚀 Your investigation begins now!