Files
hakmem/docs/CHATGPT_CONTEXT_SUMMARY.md
Moe Charm (CI) 2624dcce62 Add comprehensive ChatGPT handoff documentation for TLS SLL diagnosis
Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through
systematic diagnosis and fix of TLS SLL header corruption issue.

Documents Added:
- README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system
- CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read)
- CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline)
- GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review
- STATUS_2025_12_03_CURRENT.md: Complete project status snapshot
- TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines)
  - 6 root cause patterns with code examples
  - Diagnostic logging instrumentation
  - Fix templates and validation procedures
- TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines)
- HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup
- SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes

Problem Context:
- Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET]
- Error: cls=1 base=0x... got=0x31 expect=0xa1
- Blocks Phase 1 validation and Phase 2 progression

Expected Outcome:
- ChatGPT follows 7-step diagnostic process
- Root cause identified (one of 6 patterns)
- Surgical fix (1-5 lines)
- TC1 baseline completes without crashes

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:41:34 +09:00

8.5 KiB

Context Summary for ChatGPT - TLS SLL Header Corruption Fix

Date: 2025-12-03 Project: hakmem - Custom Memory Allocator Handoff From: Gemini + Task agent (previous phase) Current Task: Diagnose and fix TLS SLL header corruption Status: CRITICAL BLOCKER - Investigation Required


Quick Facts

Item Value
Problem Header corruption in TLS SLL during baseline testing
Error Message [TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
Error Location core/box/tls_sll_box.h:282-303
Affected Configurations ALL (shared code path issue)
Root Cause Unknown (6 patterns documented)
Fix Type Surgical (1-5 lines expected)
Build Status Succeeds
Baseline Test Status Crashes (SIGSEGV at ~22 seconds)

What is 0x31 vs 0xa1?

Expected (header magic): 0xa1 = 0xa0 (HEADER_MAGIC) | 0x01 (class_idx=1)
Got (corruption):        0x31 = ASCII character '1' or some user data

This means: User data exists where header should be.

Project Architecture (Box Theory)

The hakmem allocator uses a Box Theory architecture where:

  • Each component (memory layout, pointer conversion, TLS state) is a separate "box"
  • Each box has a single responsibility and clear API boundaries
  • Examples:
    • tiny_layout_box.h - Class sizes and header offsets (single source of truth)
    • ptr_conversion_box.h - Pointer type safety (base vs user pointers)
    • tls_sll_box.h - Thread-local single-linked list management
    • tls_ss_hint_box.h - SuperSlab hint cache (Phase 1 optimization)

Recent Changes (Last 5 Commits)

  1. f3f75ba3d - "Fix Magazine Spill RAW pointer type conversion"

    • Added HAK_BASE_FROM_RAW() wrapper in hakmem_tiny_refill.inc.h:228
    • Status: Fixed
  2. 2dc9d5d59 - "Fix include order in hakmem.c"

    • Moved #include "box/hak_kpi_util.inc.h" before hak_core_init.inc.h
    • Status: Fixed
  3. 94f9ea51 - "Implement TLS SuperSlab Hint Box (Phase 1)"

    • New header-only cache for recently-used SuperSlabs
    • Status: Implemented, but only 2.3% performance improvement (target was 15-20%)
  4. Earlier: Box theory framework, phantom types, etc.


The Remaining Issue: TLS SLL Header Corruption

Symptom

# Build succeeds
$ make clean && make shared -j8
Building libhakmem.so... OK (547KB)

# But baseline test crashes
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
[TLS_SLL_HDR_RESET] cls=1 base=0x7ef296abf8c8 got=0x31 expect=0xa1 count=0
Segmentation fault (core dumped)

Timeline

  • When Discovered: During Phase 1 benchmarking (2025-12-03)
  • Frequency: 100% reproducible with sh8bench
  • Scope: Affects baseline (Headerless OFF), so affects all configurations

Error Location

File: core/box/tls_sll_box.h (lines 282-303) Function: tls_sll_pop_impl() Operation: Reading header validation

// Simplified logic (actual code has more details)
if (tiny_class_preserves_header(class_idx)) {
    uint8_t* b = (uint8_t*)raw_base;
    uint8_t got = *b;  // Read byte at offset 0
    uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));

    if (got != expected) {
        fprintf(stderr, "[TLS_SLL_HDR_RESET] cls=%d base=%p got=0x%02x expect=0x%02x\n",
                class_idx, raw_base, got, expected);
        // Reset TLS SLL for this class
    }
}

Root Cause - Six Documented Patterns

The diagnostic document identifies six possible patterns:

  1. RAW Pointer vs BASE Pointer - Wrong pointer type passed to tls_sll_push()
  2. Header Offset Mismatch - Writing at one offset, reading at another
  3. Atomic Fence Missing - Compiler/CPU reordering of write + push
  4. Adjacent Block Overflow - User data from previous block overwrites header
  5. Class Index Mismatch - Push with one class_idx, pop as different class_idx
  6. Headerless Mode Interference - Mixed header/headerless logic despite OFF flag

Your Task

You have two comprehensive documents:

  1. docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md (THIS FILE'S COMPANION)

    • Step-by-step task breakdown
    • 7-step investigation and fix process
    • Expected validation criteria
  2. docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md (MAIN REFERENCE - 1,150+ LINES)

    • Deep dive into all 6 root cause patterns
    • Code examples for each pattern
    • Minimal test case template
    • Diagnostic logging instrumentation
    • Fix code templates
    • 7-step validation procedure

Follow the handoff document's steps 1-7 to diagnose and fix this issue.


Build & Test Commands

Quick Build

cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8

Baseline Test (Should Currently Crash)

LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | \
  grep -E "TLS_SLL_HDR_RESET|Total|Segmentation"

Minimal Test Case (After Creation)

./tests/test_tls_sll_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|PASS|FAIL"

Important File Locations

Path Purpose
core/box/tls_sll_box.h TLS SLL implementation (error source)
core/hakmem_tiny_free.inc Free path - where headers are written
core/hakmem_tiny_refill.inc.h Magazine spill - recent fix location
core/box/ptr_conversion_box.h Pointer type conversion
core/box/tiny_layout_box.h Class layout definitions
core/box/tls_ss_hint_box.h Phase 1 optimization (new)
docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md YOUR MAIN REFERENCE

Key Data Structures

TLS SLL Header Structure

typedef struct {
    uint8_t hdr;       // Header: 0xa0 | class_idx
    uint8_t pad;       // Padding/metadata
    uint16_t _unused;  // Alignment
    SuperSlab* next;   // Pointer to next SuperSlab
} TlsSllEntry;

Header Validation

// Expected value for class 1:
expected = 0xa0 | 1 = 0xa1

// What we're seeing:
got = 0x31 = some user data

// This means the header was never written OR was overwritten

Pointer Types in hakmem

The codebase distinguishes between:

hak_base_ptr_t   - "Base pointer" pointing to start of allocation (includes header)
hak_user_ptr_t   - "User pointer" pointing to user data (after offset adjustment)

Conversion:
user = base + tiny_user_offset(class_idx)   // Typically base + 1
base = user - tiny_user_offset(class_idx)   // Typically user - 1

Critical: In Headerless mode, the offset is 0, so base == user.


Known Good Patterns (For Reference)

From previous fixes:

// Pattern: Wrapping RAW pointer before TLS SLL push (ALREADY FIXED)
void* p = mag->items[--mag->top].ptr;              // RAW pointer (user offset)
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p);     // Wrap to base pointer
if (!tls_sll_push(class_idx, base_p, cap)) {      // Push base pointer

// Pattern: Consistent include order (ALREADY FIXED)
#include "box/hak_kpi_util.inc.h"      // Must come first
#include "hak_core_init.inc.h"          // Must come after

Success Criteria

Criteria Status
TLS SLL Header Corruption diagnosed In progress
Root cause pattern identified In progress
Minimal reproducer created In progress
Fix implemented In progress
sh8bench runs without errors GOAL
cfrac runs without errors GOAL
No performance regression GOAL

Previous Phase Context

This project has gone through several phases:

  • Phase 0: Initial implementation (completed)
  • Phase 1: TLS SuperSlab Hint Box optimization (implemented, needs validation)
  • Phase 2: Headerless mode (designed, blocked by current issue)
  • Phase 102: MemApi bridge (future)

The current issue blocks validation of Phase 1 and progression to Phase 2.


Timeline Estimate

  • Step 1 (Read guide): 15-30 min
  • Step 2-3 (Setup + logging): 1-2 hours
  • Step 4 (Diagnostic run): 30 min
  • Step 5 (Pattern matching): 1 hour
  • Step 6 (Fix implementation): 30 min - 1 hour
  • Step 7 (Validation): 1-2 hours

Total: 4-8 hours expected


Next: Start Investigation

👉 Next Action: Read docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md and follow steps 1-7.

The comprehensive diagnostic guide (docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md) contains all the details you need for each pattern and debugging technique.

Questions or blockers? The diagnostic guide has extensive explanations for each pattern.


You're now ready to begin the investigation. Good luck! 🚀