Files
hakmem/docs/README_HANDOFF_CHATGPT.md
Moe Charm (CI) 2624dcce62 Add comprehensive ChatGPT handoff documentation for TLS SLL diagnosis
Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through
systematic diagnosis and fix of TLS SLL header corruption issue.

Documents Added:
- README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system
- CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read)
- CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline)
- GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review
- STATUS_2025_12_03_CURRENT.md: Complete project status snapshot
- TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines)
  - 6 root cause patterns with code examples
  - Diagnostic logging instrumentation
  - Fix templates and validation procedures
- TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines)
- HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup
- SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes

Problem Context:
- Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET]
- Error: cls=1 base=0x... got=0x31 expect=0xa1
- Blocks Phase 1 validation and Phase 2 progression

Expected Outcome:
- ChatGPT follows 7-step diagnostic process
- Root cause identified (one of 6 patterns)
- Surgical fix (1-5 lines)
- TC1 baseline completes without crashes

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:41:34 +09:00

11 KiB

🚀 ChatGPT Task Handoff - TLS SLL Header Corruption Fix

Target: Claude (ChatGPT model) Task: Diagnose and fix critical TLS SLL header corruption Status: Ready for immediate handoff Date: 2025-12-03


Quick Start (TL;DR)

The Problem: hakmem baseline crashes with header corruption

[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1

Your Task: Fix it using 7 documented steps

Documents You Need (in order):

  1. 📖 READ FIRST: CHATGPT_CONTEXT_SUMMARY.md (2-3 min read)
  2. 📋 FOLLOW: CHATGPT_HANDOFF_TLS_DIAGNOSIS.md (7 detailed steps)
  3. 🔍 REFERENCE: TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md (1,150 lines of deep reference)

Success: TC1 baseline test completes without crashes

Timeline: 4-8 hours expected


The Three Documents Explained

1. CHATGPT_CONTEXT_SUMMARY.md

Purpose: Quick reference and architecture overview Read Time: 2-3 minutes Contains:

  • What 0x31 means vs 0xa1
  • Project architecture (Box Theory)
  • Recent changes (5 commits)
  • The remaining issue explained simply
  • File locations and data structures
  • Build & test commands
  • Success criteria

When to Use:

  • First thing to read
  • Reference when you need quick facts
  • Before diving into detailed diagnosis

2. CHATGPT_HANDOFF_TLS_DIAGNOSIS.md

Purpose: Step-by-step task breakdown for fixing the issue Follow Time: 4-8 hours Contains:

  • Executive summary
  • 7 specific steps to diagnose and fix:
    • Step 1: Read the diagnostic guide
    • Step 2: Reproduce with minimal test
    • Step 3: Add diagnostic logging
    • Step 4: Run diagnostic test
    • Step 5: Identify root cause pattern
    • Step 6: Implement fix
    • Step 7: Validate fix
  • Expected output for each step
  • How to identify which of 6 patterns caused the issue
  • Example fix code for each pattern
  • Validation criteria
  • Commit message template

When to Use:

  • This is your TASK DOCUMENT
  • Follow the 7 steps in order
  • After each step, update status

3. TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md

Purpose: Deep reference for detailed understanding Reference Time: As needed during diagnosis Contains:

  • 6 root cause patterns with full code examples
  • Minimal test case template
  • Detailed diagnostic logging instrumentation
  • Pattern-specific fix templates
  • 7-step validation procedure
  • Debugging techniques and tools

When to Use:

  • During Step 3 (diagnostic logging)
  • During Step 5 (pattern matching)
  • During Step 6 (implementing fix)
  • As reference for understanding each pattern

Document Relationships

┌─────────────────────────────────────────┐
│ CHATGPT_CONTEXT_SUMMARY.md              │
│ (Start here - 2-3 min)                  │
│ ↓                                       │
│ Quick facts + architecture overview     │
└──────────────┬──────────────────────────┘
               │
               ↓
┌──────────────────────────────────────────┐
│ CHATGPT_HANDOFF_TLS_DIAGNOSIS.md        │
│ (Follow these 7 steps - 4-8 hours)      │
│ ↓                                        │
│ Step 1: Read diagnostic guide            │
│ Step 2: Create minimal reproducer        │
│ Step 3: Add logging [→ consult ref #3]  │
│ Step 4: Run diagnostic test              │
│ Step 5: Match pattern [→ consult ref #3]│
│ Step 6: Implement fix [→ consult ref #3]│
│ Step 7: Validate                         │
└──────────────┬───────────────────────────┘
               │
               ↓
┌──────────────────────────────────────────┐
│ TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md   │
│ (Deep reference - consult as needed)     │
│                                          │
│ 6 Root Cause Patterns:                   │
│ 1. RAW vs BASE pointer                   │
│ 2. Header offset mismatch                │
│ 3. Atomic fence missing                  │
│ 4. Adjacent block overflow               │
│ 5. Class index mismatch                  │
│ 6. Headerless mode interference          │
│                                          │
│ For each pattern: code examples + fixes  │
└──────────────────────────────────────────┘

How to Use These Documents

Before Starting

  1. Read Summary (2-3 min)

    • Understand what the problem is
    • Learn about the project architecture
    • Know what tools you'll use
  2. Skim Handoff (5 min)

    • Understand the 7-step process
    • Know what's expected at each step
    • Identify reference points

During Work

  1. Follow Handoff Step-by-Step (4-8 hours)

    • Step 1: Read the diagnostic guide thoroughly
    • Step 2: Create minimal reproducer
    • Step 3: Add logging (reference diagnostic guide)
    • Step 4: Run and capture output
    • Step 5: Match observed behavior to patterns (reference diagnostic guide)
    • Step 6: Implement fix (reference diagnostic guide for fix templates)
    • Step 7: Validate success
  2. Consult Diagnostic Guide as Needed

    • When you need pattern details (Step 5)
    • When you need fix code templates (Step 6)
    • When you need validation procedures (Step 7)

After Completion

  1. Report Status
    • Which root cause pattern was identified
    • What fix was applied
    • Validation results
    • Commit message

Key Information to Know

The Error Explained

Error Message: [TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1

Interpretation:
- Location: Reading header byte from allocated block during free
- Expected: 0xa1 (0xa0 MAGIC | class_idx=1)
- Got: 0x31 (user data or corruption)
- Meaning: Header was never written OR was overwritten

Root Cause: One of 6 documented patterns

Success Looks Like

# Before fix:
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
Segmentation fault (code 139)
Execution time: ~22 seconds before crash

# After fix:
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
Total: 54.5 Mops/s  [no TLS_SLL_HDR_RESET errors]
Execution time: 4-6 minutes [completes successfully]

File Locations You'll Need

File Purpose Action
core/box/tls_sll_box.h Error source Read/understand
core/hakmem_tiny_free.inc Header write Add logging
core/hakmem_tiny_refill.inc.h Magazine spill Check for issues
core/box/ptr_conversion_box.h Pointer conversion Understand logic
core/box/tiny_layout_box.h Class layout Understand definitions
tests/test_tls_sll_minimal.c Your test Create this
debug_artifacts/headerless/ Benchmark logs Reference existing

Commands You'll Use

Build & Test

# Clean build
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8

# Run baseline (will currently crash)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

# Run minimal test (after creating it)
./tests/test_tls_sll_minimal

With Logging

# Build with debug logging
make clean
make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_DEBUG_LOGGING=1"

# Capture diagnostic output
./test_tls_sll_minimal 2>&1 | tee diagnostic_output.txt

# Analyze logs
grep HEADER_WRITE diagnostic_output.txt | tail -10
grep -B5 "got=0x31" diagnostic_output.txt

What to Expect

Per-Step Timeline

  • Step 1 (Read diagnostic guide): 30-45 min
  • Step 2 (Create reproducer): 30-60 min
  • Step 3 (Add logging): 1-2 hours
  • Step 4 (Run test): 30 min
  • Step 5 (Pattern matching): 1 hour
  • Step 6 (Implement fix): 30 min - 1 hour
  • Step 7 (Validate): 1-2 hours

Total: 4-8 hours

What You'll Discover

By the end of the process, you will have:

  • Identified which of 6 patterns caused the issue
  • Created a minimal reproducer
  • Added diagnostic logging to find corruption
  • Traced the exact allocation/free sequence causing the problem
  • Implemented a 1-5 line fix
  • Validated the fix works with multiple benchmarks
  • Understood the root cause completely

Communication Checkpoints

After completing each step, provide brief status:

Step 2: "Reproducer created - crashes after X allocations" Step 4: "Diagnostic logs show pattern [A/B/C/etc]" Step 5: "Root cause identified as Pattern #[N]" Step 6: "Fix applied - [1-2 line description]" Step 7: "Validation: sh8bench passed, cfrac passed, no regressions"


Success Criteria (Clear & Measurable)

Criterion Status
Minimal reproducer created Expected
Root cause identified (one of 6 patterns) Expected
Diagnostic logging captured Expected
Fix implemented (1-5 lines) Expected
sh8bench completes without crashes TARGET
cfrac completes without crashes TARGET
Unit tests pass TARGET
< 5% performance regression TARGET

If You Get Stuck

Problem: Can't reproduce the error

  • Solution: Check if build includes logging headers. Verify LD_PRELOAD path is correct.

Problem: Logs don't show expected pattern

  • Solution: Check if you're logging at the right locations. Reference diagnostic guide for exact instrumentation points.

Problem: Multiple patterns seem possible

  • Solution: Add more detailed logging to narrow down. Reference diagnostic guide's pattern-specific logging recommendations.

Problem: Fix doesn't resolve the issue

  • Solution: Validate that logging shows the assumed pattern. May need to test a different pattern. Try pattern #2, #3, etc. in order.

Next Steps After Completion

Once TLS SLL header corruption is fixed:

  1. Validate Phase 1 Performance (Currently 2.3%, target 15-20%)

    • Profile with perf/cachegrind
    • Identify secondary bottlenecks
    • Consider cache size optimization
  2. Proceed to Phase 2 (Headerless mode)

    • Implement HAKMEM_TINY_HEADERLESS toggle
    • Test alignment guarantees
    • Benchmark performance trade-offs
  3. Plan Phase 102 (MemApi bridge)

    • Connect hakmem to nyrt Ring0 runtime
    • Design integration points

Questions Before Starting?

  • What is Box Theory? → Read the Context Summary
  • What are Phantom Types? → Read the Context Summary
  • What are the 6 root cause patterns? → They're in the Diagnostic Guide
  • How do I add logging? → Step 3 of Handoff document + Diagnostic Guide

All answers are in the three documents. No need for external research.


You're Now Ready! 🚀

  1. Read CHATGPT_CONTEXT_SUMMARY.md (2-3 min)
  2. Follow CHATGPT_HANDOFF_TLS_DIAGNOSIS.md (7 steps, 4-8 hours)
  3. Reference TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md (as needed)

Start with Step 1 of the Handoff document.

Expected outcome: TLS SLL header corruption diagnosed and fixed.

Next review: After fix is validated and committed.


Good luck! The investigation methodology is solid, the documentation is comprehensive, and the fix is likely to be simple once identified. 💪