Files
hakmem/docs/README_HANDOFF_CHATGPT.md

379 lines
11 KiB
Markdown
Raw Normal View History

# 🚀 ChatGPT Task Handoff - TLS SLL Header Corruption Fix
**Target**: Claude (ChatGPT model)
**Task**: Diagnose and fix critical TLS SLL header corruption
**Status**: Ready for immediate handoff
**Date**: 2025-12-03
---
## Quick Start (TL;DR)
**The Problem**: hakmem baseline crashes with header corruption
```
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
```
**Your Task**: Fix it using 7 documented steps
**Documents You Need** (in order):
1. 📖 **READ FIRST**: `CHATGPT_CONTEXT_SUMMARY.md` (2-3 min read)
2. 📋 **FOLLOW**: `CHATGPT_HANDOFF_TLS_DIAGNOSIS.md` (7 detailed steps)
3. 🔍 **REFERENCE**: `TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` (1,150 lines of deep reference)
**Success**: TC1 baseline test completes without crashes
**Timeline**: 4-8 hours expected
---
## The Three Documents Explained
### 1. CHATGPT_CONTEXT_SUMMARY.md
**Purpose**: Quick reference and architecture overview
**Read Time**: 2-3 minutes
**Contains**:
- What 0x31 means vs 0xa1
- Project architecture (Box Theory)
- Recent changes (5 commits)
- The remaining issue explained simply
- File locations and data structures
- Build & test commands
- Success criteria
**When to Use**:
- First thing to read
- Reference when you need quick facts
- Before diving into detailed diagnosis
---
### 2. CHATGPT_HANDOFF_TLS_DIAGNOSIS.md
**Purpose**: Step-by-step task breakdown for fixing the issue
**Follow Time**: 4-8 hours
**Contains**:
- Executive summary
- 7 specific steps to diagnose and fix:
- Step 1: Read the diagnostic guide
- Step 2: Reproduce with minimal test
- Step 3: Add diagnostic logging
- Step 4: Run diagnostic test
- Step 5: Identify root cause pattern
- Step 6: Implement fix
- Step 7: Validate fix
- Expected output for each step
- How to identify which of 6 patterns caused the issue
- Example fix code for each pattern
- Validation criteria
- Commit message template
**When to Use**:
- This is your TASK DOCUMENT
- Follow the 7 steps in order
- After each step, update status
---
### 3. TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md
**Purpose**: Deep reference for detailed understanding
**Reference Time**: As needed during diagnosis
**Contains**:
- 6 root cause patterns with full code examples
- Minimal test case template
- Detailed diagnostic logging instrumentation
- Pattern-specific fix templates
- 7-step validation procedure
- Debugging techniques and tools
**When to Use**:
- During Step 3 (diagnostic logging)
- During Step 5 (pattern matching)
- During Step 6 (implementing fix)
- As reference for understanding each pattern
---
## Document Relationships
```
┌─────────────────────────────────────────┐
│ CHATGPT_CONTEXT_SUMMARY.md │
│ (Start here - 2-3 min) │
│ ↓ │
│ Quick facts + architecture overview │
└──────────────┬──────────────────────────┘
┌──────────────────────────────────────────┐
│ CHATGPT_HANDOFF_TLS_DIAGNOSIS.md │
│ (Follow these 7 steps - 4-8 hours) │
│ ↓ │
│ Step 1: Read diagnostic guide │
│ Step 2: Create minimal reproducer │
│ Step 3: Add logging [→ consult ref #3] │
│ Step 4: Run diagnostic test │
│ Step 5: Match pattern [→ consult ref #3]│
│ Step 6: Implement fix [→ consult ref #3]│
│ Step 7: Validate │
└──────────────┬───────────────────────────┘
┌──────────────────────────────────────────┐
│ TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md │
│ (Deep reference - consult as needed) │
│ │
│ 6 Root Cause Patterns: │
│ 1. RAW vs BASE pointer │
│ 2. Header offset mismatch │
│ 3. Atomic fence missing │
│ 4. Adjacent block overflow │
│ 5. Class index mismatch │
│ 6. Headerless mode interference │
│ │
│ For each pattern: code examples + fixes │
└──────────────────────────────────────────┘
```
---
## How to Use These Documents
### Before Starting
1. **Read Summary** (2-3 min)
- Understand what the problem is
- Learn about the project architecture
- Know what tools you'll use
2. **Skim Handoff** (5 min)
- Understand the 7-step process
- Know what's expected at each step
- Identify reference points
### During Work
3. **Follow Handoff Step-by-Step** (4-8 hours)
- Step 1: Read the diagnostic guide thoroughly
- Step 2: Create minimal reproducer
- Step 3: Add logging (reference diagnostic guide)
- Step 4: Run and capture output
- Step 5: Match observed behavior to patterns (reference diagnostic guide)
- Step 6: Implement fix (reference diagnostic guide for fix templates)
- Step 7: Validate success
4. **Consult Diagnostic Guide as Needed**
- When you need pattern details (Step 5)
- When you need fix code templates (Step 6)
- When you need validation procedures (Step 7)
### After Completion
5. **Report Status**
- Which root cause pattern was identified
- What fix was applied
- Validation results
- Commit message
---
## Key Information to Know
### The Error Explained
```
Error Message: [TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
Interpretation:
- Location: Reading header byte from allocated block during free
- Expected: 0xa1 (0xa0 MAGIC | class_idx=1)
- Got: 0x31 (user data or corruption)
- Meaning: Header was never written OR was overwritten
Root Cause: One of 6 documented patterns
```
### Success Looks Like
```bash
# Before fix:
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
Segmentation fault (code 139)
Execution time: ~22 seconds before crash
# After fix:
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
Total: 54.5 Mops/s [no TLS_SLL_HDR_RESET errors]
Execution time: 4-6 minutes [completes successfully]
```
---
## File Locations You'll Need
| File | Purpose | Action |
|------|---------|--------|
| `core/box/tls_sll_box.h` | Error source | Read/understand |
| `core/hakmem_tiny_free.inc` | Header write | Add logging |
| `core/hakmem_tiny_refill.inc.h` | Magazine spill | Check for issues |
| `core/box/ptr_conversion_box.h` | Pointer conversion | Understand logic |
| `core/box/tiny_layout_box.h` | Class layout | Understand definitions |
| `tests/test_tls_sll_minimal.c` | Your test | Create this |
| `debug_artifacts/headerless/` | Benchmark logs | Reference existing |
---
## Commands You'll Use
### Build & Test
```bash
# Clean build
cd /mnt/workdisk/public_share/hakmem
make clean
make shared -j8
# Run baseline (will currently crash)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
# Run minimal test (after creating it)
./tests/test_tls_sll_minimal
```
### With Logging
```bash
# Build with debug logging
make clean
make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_DEBUG_LOGGING=1"
# Capture diagnostic output
./test_tls_sll_minimal 2>&1 | tee diagnostic_output.txt
# Analyze logs
grep HEADER_WRITE diagnostic_output.txt | tail -10
grep -B5 "got=0x31" diagnostic_output.txt
```
---
## What to Expect
### Per-Step Timeline
- **Step 1** (Read diagnostic guide): 30-45 min
- **Step 2** (Create reproducer): 30-60 min
- **Step 3** (Add logging): 1-2 hours
- **Step 4** (Run test): 30 min
- **Step 5** (Pattern matching): 1 hour
- **Step 6** (Implement fix): 30 min - 1 hour
- **Step 7** (Validate): 1-2 hours
**Total**: 4-8 hours
### What You'll Discover
By the end of the process, you will have:
- ✅ Identified which of 6 patterns caused the issue
- ✅ Created a minimal reproducer
- ✅ Added diagnostic logging to find corruption
- ✅ Traced the exact allocation/free sequence causing the problem
- ✅ Implemented a 1-5 line fix
- ✅ Validated the fix works with multiple benchmarks
- ✅ Understood the root cause completely
---
## Communication Checkpoints
After completing each step, provide brief status:
**Step 2**: "Reproducer created - crashes after X allocations"
**Step 4**: "Diagnostic logs show pattern [A/B/C/etc]"
**Step 5**: "Root cause identified as Pattern #[N]"
**Step 6**: "Fix applied - [1-2 line description]"
**Step 7**: "Validation: sh8bench passed, cfrac passed, no regressions"
---
## Success Criteria (Clear & Measurable)
| Criterion | Status |
|-----------|--------|
| Minimal reproducer created | ✅ Expected |
| Root cause identified (one of 6 patterns) | ✅ Expected |
| Diagnostic logging captured | ✅ Expected |
| Fix implemented (1-5 lines) | ✅ Expected |
| sh8bench completes without crashes | ✅ TARGET |
| cfrac completes without crashes | ✅ TARGET |
| Unit tests pass | ✅ TARGET |
| < 5% performance regression | TARGET |
---
## If You Get Stuck
**Problem**: Can't reproduce the error
- **Solution**: Check if build includes logging headers. Verify LD_PRELOAD path is correct.
**Problem**: Logs don't show expected pattern
- **Solution**: Check if you're logging at the right locations. Reference diagnostic guide for exact instrumentation points.
**Problem**: Multiple patterns seem possible
- **Solution**: Add more detailed logging to narrow down. Reference diagnostic guide's pattern-specific logging recommendations.
**Problem**: Fix doesn't resolve the issue
- **Solution**: Validate that logging shows the assumed pattern. May need to test a different pattern. Try pattern #2, #3, etc. in order.
---
## Next Steps After Completion
Once TLS SLL header corruption is fixed:
1. **Validate Phase 1 Performance** (Currently 2.3%, target 15-20%)
- Profile with perf/cachegrind
- Identify secondary bottlenecks
- Consider cache size optimization
2. **Proceed to Phase 2** (Headerless mode)
- Implement HAKMEM_TINY_HEADERLESS toggle
- Test alignment guarantees
- Benchmark performance trade-offs
3. **Plan Phase 102** (MemApi bridge)
- Connect hakmem to nyrt Ring0 runtime
- Design integration points
---
## Questions Before Starting?
- ❓ What is Box Theory? → Read the Context Summary
- ❓ What are Phantom Types? → Read the Context Summary
- ❓ What are the 6 root cause patterns? → They're in the Diagnostic Guide
- ❓ How do I add logging? → Step 3 of Handoff document + Diagnostic Guide
**All answers are in the three documents. No need for external research.**
---
## You're Now Ready! 🚀
1. **Read** `CHATGPT_CONTEXT_SUMMARY.md` (2-3 min)
2. **Follow** `CHATGPT_HANDOFF_TLS_DIAGNOSIS.md` (7 steps, 4-8 hours)
3. **Reference** `TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` (as needed)
**Start with Step 1 of the Handoff document.**
**Expected outcome**: TLS SLL header corruption diagnosed and fixed. ✅
**Next review**: After fix is validated and committed.
---
**Good luck! The investigation methodology is solid, the documentation is comprehensive, and the fix is likely to be simple once identified. 💪**