Files
hakmem/docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md
Moe Charm (CI) 2624dcce62 Add comprehensive ChatGPT handoff documentation for TLS SLL diagnosis
Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through
systematic diagnosis and fix of TLS SLL header corruption issue.

Documents Added:
- README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system
- CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read)
- CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline)
- GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review
- STATUS_2025_12_03_CURRENT.md: Complete project status snapshot
- TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines)
  - 6 root cause patterns with code examples
  - Diagnostic logging instrumentation
  - Fix templates and validation procedures
- TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines)
- HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup
- SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes

Problem Context:
- Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET]
- Error: cls=1 base=0x... got=0x31 expect=0xa1
- Blocks Phase 1 validation and Phase 2 progression

Expected Outcome:
- ChatGPT follows 7-step diagnostic process
- Root cause identified (one of 6 patterns)
- Surgical fix (1-5 lines)
- TC1 baseline completes without crashes

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:41:34 +09:00

8.6 KiB

ChatGPT Task: TLS SLL Header Corruption Diagnosis & Fix

Status: BLOCKING - System instability detected in baseline configuration Priority: CRITICAL Assigned to: Claude (ChatGPT model) Expected Duration: 4-8 hours


Executive Summary

The hakmem memory allocator baseline configuration crashes with a critical header corruption error:

[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1

This occurs in shared code paths (not Phase 1 specific), blocking all further development and validation.

Your Task: Diagnose and fix this issue using the comprehensive diagnostic guide.


What You Need to Know

Context

  • Project: hakmem - custom memory allocator with "Box Theory" architecture
  • Language: C
  • Current Phase: Phase 1 implementation + Phase 2 (Headerless) planning
  • Problem: Baseline test crashes before completing benchmarks
  • Error Location: core/box/tls_sll_box.h - header validation during TLS SLL pop

The Error

When a block is popped from the TLS SLL (Thread-Local Single-Linked List), the header validation checks:

uint8_t got = *b;              // Read byte at offset 0 of base pointer
uint8_t expected = 0xa0 | class_idx;  // For class 1: 0xa1

if (got != expected) {
    // ERROR DETECTED - got 0x31 instead of 0xa1
}

The header byte contains user data (0x31 = '1' character) instead of the expected magic value (0xa1).

This means: Either:

  1. Wrong pointer was stored in TLS SLL
  2. Header was not written before pushing to TLS SLL
  3. Header was overwritten after pushing
  4. Offset calculation is wrong

Your Step-by-Step Task

Step 1: Read the Comprehensive Diagnostic Document

File: /mnt/workdisk/public_share/hakmem/docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md

This 1,150+ line document contains:

  • 6 detailed root cause patterns with code examples
  • Minimal test case template (test_tls_sll_minimal.c)
  • Diagnostic logging instrumentation points
  • Fix patterns with code snippets
  • 7-step validation procedure

Action: Read the entire document and understand the investigation methodology.


Step 2: Reproduce the Error with Minimal Test Case

Create /mnt/workdisk/public_share/hakmem/tests/test_tls_sll_minimal.c based on template in the diagnostic document.

cd /mnt/workdisk/public_share/hakmem

# Build minimal test
gcc -g -O1 -I./core -I./core/box \
    tests/test_tls_sll_minimal.c \
    -L. -lhakmem -lpthread -o test_minimal

# Run (should crash with TLS_SLL_HDR_RESET error)
./test_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|Segmentation"

Expected Output: Should reproduce the header corruption within first 100-1000 allocations.


Step 3: Add Diagnostic Logging

Instrument the following locations to capture when header corruption occurs:

Location A: core/hakmem_tiny_free.inc - Header write before TLS SLL push

// Around line 550: Before tls_sll_push()
// ADD LOGGING:
fprintf(stderr, "[HEADER_WRITE] base=%p, offset=%zu, writing 0x%02x\n",
        base, offset, (HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK)));

Location B: core/box/tls_sll_box.h - Header read during pop

// Around line 282-303: In tls_sll_pop_impl()
// ADD LOGGING:
fprintf(stderr, "[HEADER_READ] base=%p, got=0x%02x, expected=0x%02x\n",
        raw_base, got, expected);

Location C: core/hakmem_tiny_refill.inc.h - Magazine spill

// Around line 228: Before/after tls_sll_push()
// ADD LOGGING:
fprintf(stderr, "[SPILL] class=%d, ptr=%p (wrapping to base)\n", class_idx, p);

Action: Add detailed logging to identify which allocation/free cycle causes corruption.


Step 4: Run Diagnostic Test with Logging

# Rebuild with logging enabled
make clean
make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_DEBUG_LOGGING=1"

# Run minimal test and capture log
./test_minimal 2>&1 | tee diagnostic_output.txt

# Analyze log to find last successful write before corruption
grep HEADER_WRITE diagnostic_output.txt | tail -10
grep HEADER_READ diagnostic_output.txt | grep -A1 -B1 "0x31"

Expected Result: Log will show exact allocation/free sequence leading to corruption.


Step 5: Identify Root Cause (One of Six Patterns)

Based on diagnostic logs, match against these patterns from the diagnostic document:

  1. RAW Pointer vs BASE Pointer: Wrong pointer type passed to tls_sll_push()
  2. Header Offset Mismatch: Writing at offset 1, reading at offset 0
  3. Atomic Fence Missing: Compiler reordering causing write-after-push
  4. Adjacent Block Overflow: User data from preceding block overwrites header
  5. Class Index Mismatch: Push with class_idx A, pop as class_idx B
  6. Headerless Mode Interference: Mixed header/headerless logic

Action: Determine which pattern applies to your findings.


Step 6: Implement Surgical Fix

Once root cause is identified, apply a minimal fix (typically 1-5 lines):

Example fixes (from diagnostic document):

// Pattern 1 - RAW vs BASE pointer:
// WRONG:
tls_sll_push(class_idx, p, size);  // p is RAW pointer
// FIXED:
hak_base_ptr_t base = HAK_BASE_FROM_RAW(p);
tls_sll_push(class_idx, base, size);

// Pattern 2 - Offset mismatch:
// WRONG:
*(uint8_t*)((char*)base + 1) = header;  // Writing at offset 1
// In pop: uint8_t h = *((uint8_t*)base);  // Reading at offset 0
// FIXED:
*(uint8_t*)base = header;  // Consistent offset

// Pattern 3 - Atomic fence missing:
// WRONG:
*hdr = magic;
tls_sll_push(...);
// FIXED:
*hdr = magic;
atomic_thread_fence(memory_order_release);  // Prevent reordering
tls_sll_push(...);

Action: Apply fix to source code and rebuild.


Step 7: Validate Fix

# Step 7a: Run minimal test
./test_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|passed|failed"

# Step 7b: Run baseline benchmark
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | \
  grep -E "TLS_SLL_HDR_RESET|Total|PASSED|FAILED"

# Step 7c: Run cfrac (memory intensive)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 2>&1 | \
  grep -E "error|TLS_SLL_HDR_RESET|Total"

# Step 7d: Check for regressions
make test -j8 FILTER="tls_sll"

Success Criteria:

  • Minimal test completes without TLS_SLL_HDR_RESET
  • sh8bench runs to completion (several minutes)
  • cfrac completes without errors
  • All unit tests pass
  • No performance regression (< 5%)

Commit & Documentation

Once validated, commit with detailed message:

git add -A
git commit -m "Fix TLS SLL header corruption in [Component]

Root Cause:
[Brief 1-2 sentence explanation of what was wrong]

Pattern Affected:
[Which of the 6 patterns this was]

Fix Applied:
[Minimal description of the fix]

Validation:
- [Test case] passed
- [Benchmark] completed without TLS_SLL_HDR_RESET
- No performance regression

Related Issues:
- TLS SLL baseline instability
- Required for Phase 1/2 validation"

Reference Files

File Purpose
docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md Complete diagnostic guide - READ FIRST
core/box/tls_sll_box.h TLS SLL implementation (header validation at lines 282-303)
core/hakmem_tiny_free.inc Free path (header write before push, lines ~550)
core/hakmem_tiny_refill.inc.h Magazine spill (lines ~228)
docs/HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md Test environment setup
debug_artifacts/headerless/ Benchmark results showing error

Communication Plan

Status Updates: After each step, provide brief status:

  • Step 2: "Reproducer created - X allocations before crash"
  • Step 3: "Logging added to [X locations]"
  • Step 4: "Log analysis complete - [pattern identified]"
  • Step 5: "Root cause identified: Pattern #[N]"
  • Step 6: "Fix applied - [brief description]"
  • Step 7: "Validation complete - [test results]"

Post-Fix: Unblocking Next Phases

Once this issue is fixed, the following can proceed:

  1. Phase 1 Completion: TLS Hint Box performance optimization (currently showing 2.3% improvement vs target 15-20%)
  2. Phase 2 Validation: Test Headerless mode (ON/OFF configurations)
  3. Performance Benchmarking: Full multi-test suite (TC1, TC2, TC3)
  4. Future Phases: Phase 102 (MemApi bridge), production optimization

Success Metric

GOAL: TC1 baseline test completes successfully with zero TLS_SLL_HDR_RESET errors.

Current Status: FAILING (crashes at ~22 seconds) Target Status: PASSING (completion in 4-6 minutes)


Questions? Refer to the diagnostic document for detailed explanations of each pattern and debugging technique.

Ready to start? Begin with Step 1: Read the full diagnostic guide.

🚀 Your investigation begins now!