379 lines
11 KiB
Markdown
379 lines
11 KiB
Markdown
|
|
# 🚀 ChatGPT Task Handoff - TLS SLL Header Corruption Fix
|
||
|
|
|
||
|
|
**Target**: Claude (ChatGPT model)
|
||
|
|
**Task**: Diagnose and fix critical TLS SLL header corruption
|
||
|
|
**Status**: Ready for immediate handoff
|
||
|
|
**Date**: 2025-12-03
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start (TL;DR)
|
||
|
|
|
||
|
|
**The Problem**: hakmem baseline crashes with header corruption
|
||
|
|
```
|
||
|
|
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
|
||
|
|
```
|
||
|
|
|
||
|
|
**Your Task**: Fix it using 7 documented steps
|
||
|
|
|
||
|
|
**Documents You Need** (in order):
|
||
|
|
1. 📖 **READ FIRST**: `CHATGPT_CONTEXT_SUMMARY.md` (2-3 min read)
|
||
|
|
2. 📋 **FOLLOW**: `CHATGPT_HANDOFF_TLS_DIAGNOSIS.md` (7 detailed steps)
|
||
|
|
3. 🔍 **REFERENCE**: `TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` (1,150 lines of deep reference)
|
||
|
|
|
||
|
|
**Success**: TC1 baseline test completes without crashes
|
||
|
|
|
||
|
|
**Timeline**: 4-8 hours expected
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Three Documents Explained
|
||
|
|
|
||
|
|
### 1. CHATGPT_CONTEXT_SUMMARY.md
|
||
|
|
|
||
|
|
**Purpose**: Quick reference and architecture overview
|
||
|
|
**Read Time**: 2-3 minutes
|
||
|
|
**Contains**:
|
||
|
|
- What 0x31 means vs 0xa1
|
||
|
|
- Project architecture (Box Theory)
|
||
|
|
- Recent changes (5 commits)
|
||
|
|
- The remaining issue explained simply
|
||
|
|
- File locations and data structures
|
||
|
|
- Build & test commands
|
||
|
|
- Success criteria
|
||
|
|
|
||
|
|
**When to Use**:
|
||
|
|
- First thing to read
|
||
|
|
- Reference when you need quick facts
|
||
|
|
- Before diving into detailed diagnosis
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. CHATGPT_HANDOFF_TLS_DIAGNOSIS.md
|
||
|
|
|
||
|
|
**Purpose**: Step-by-step task breakdown for fixing the issue
|
||
|
|
**Follow Time**: 4-8 hours
|
||
|
|
**Contains**:
|
||
|
|
- Executive summary
|
||
|
|
- 7 specific steps to diagnose and fix:
|
||
|
|
- Step 1: Read the diagnostic guide
|
||
|
|
- Step 2: Reproduce with minimal test
|
||
|
|
- Step 3: Add diagnostic logging
|
||
|
|
- Step 4: Run diagnostic test
|
||
|
|
- Step 5: Identify root cause pattern
|
||
|
|
- Step 6: Implement fix
|
||
|
|
- Step 7: Validate fix
|
||
|
|
- Expected output for each step
|
||
|
|
- How to identify which of 6 patterns caused the issue
|
||
|
|
- Example fix code for each pattern
|
||
|
|
- Validation criteria
|
||
|
|
- Commit message template
|
||
|
|
|
||
|
|
**When to Use**:
|
||
|
|
- This is your TASK DOCUMENT
|
||
|
|
- Follow the 7 steps in order
|
||
|
|
- After each step, update status
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md
|
||
|
|
|
||
|
|
**Purpose**: Deep reference for detailed understanding
|
||
|
|
**Reference Time**: As needed during diagnosis
|
||
|
|
**Contains**:
|
||
|
|
- 6 root cause patterns with full code examples
|
||
|
|
- Minimal test case template
|
||
|
|
- Detailed diagnostic logging instrumentation
|
||
|
|
- Pattern-specific fix templates
|
||
|
|
- 7-step validation procedure
|
||
|
|
- Debugging techniques and tools
|
||
|
|
|
||
|
|
**When to Use**:
|
||
|
|
- During Step 3 (diagnostic logging)
|
||
|
|
- During Step 5 (pattern matching)
|
||
|
|
- During Step 6 (implementing fix)
|
||
|
|
- As reference for understanding each pattern
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Document Relationships
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────┐
|
||
|
|
│ CHATGPT_CONTEXT_SUMMARY.md │
|
||
|
|
│ (Start here - 2-3 min) │
|
||
|
|
│ ↓ │
|
||
|
|
│ Quick facts + architecture overview │
|
||
|
|
└──────────────┬──────────────────────────┘
|
||
|
|
│
|
||
|
|
↓
|
||
|
|
┌──────────────────────────────────────────┐
|
||
|
|
│ CHATGPT_HANDOFF_TLS_DIAGNOSIS.md │
|
||
|
|
│ (Follow these 7 steps - 4-8 hours) │
|
||
|
|
│ ↓ │
|
||
|
|
│ Step 1: Read diagnostic guide │
|
||
|
|
│ Step 2: Create minimal reproducer │
|
||
|
|
│ Step 3: Add logging [→ consult ref #3] │
|
||
|
|
│ Step 4: Run diagnostic test │
|
||
|
|
│ Step 5: Match pattern [→ consult ref #3]│
|
||
|
|
│ Step 6: Implement fix [→ consult ref #3]│
|
||
|
|
│ Step 7: Validate │
|
||
|
|
└──────────────┬───────────────────────────┘
|
||
|
|
│
|
||
|
|
↓
|
||
|
|
┌──────────────────────────────────────────┐
|
||
|
|
│ TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md │
|
||
|
|
│ (Deep reference - consult as needed) │
|
||
|
|
│ │
|
||
|
|
│ 6 Root Cause Patterns: │
|
||
|
|
│ 1. RAW vs BASE pointer │
|
||
|
|
│ 2. Header offset mismatch │
|
||
|
|
│ 3. Atomic fence missing │
|
||
|
|
│ 4. Adjacent block overflow │
|
||
|
|
│ 5. Class index mismatch │
|
||
|
|
│ 6. Headerless mode interference │
|
||
|
|
│ │
|
||
|
|
│ For each pattern: code examples + fixes │
|
||
|
|
└──────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## How to Use These Documents
|
||
|
|
|
||
|
|
### Before Starting
|
||
|
|
|
||
|
|
1. **Read Summary** (2-3 min)
|
||
|
|
- Understand what the problem is
|
||
|
|
- Learn about the project architecture
|
||
|
|
- Know what tools you'll use
|
||
|
|
|
||
|
|
2. **Skim Handoff** (5 min)
|
||
|
|
- Understand the 7-step process
|
||
|
|
- Know what's expected at each step
|
||
|
|
- Identify reference points
|
||
|
|
|
||
|
|
### During Work
|
||
|
|
|
||
|
|
3. **Follow Handoff Step-by-Step** (4-8 hours)
|
||
|
|
- Step 1: Read the diagnostic guide thoroughly
|
||
|
|
- Step 2: Create minimal reproducer
|
||
|
|
- Step 3: Add logging (reference diagnostic guide)
|
||
|
|
- Step 4: Run and capture output
|
||
|
|
- Step 5: Match observed behavior to patterns (reference diagnostic guide)
|
||
|
|
- Step 6: Implement fix (reference diagnostic guide for fix templates)
|
||
|
|
- Step 7: Validate success
|
||
|
|
|
||
|
|
4. **Consult Diagnostic Guide as Needed**
|
||
|
|
- When you need pattern details (Step 5)
|
||
|
|
- When you need fix code templates (Step 6)
|
||
|
|
- When you need validation procedures (Step 7)
|
||
|
|
|
||
|
|
### After Completion
|
||
|
|
|
||
|
|
5. **Report Status**
|
||
|
|
- Which root cause pattern was identified
|
||
|
|
- What fix was applied
|
||
|
|
- Validation results
|
||
|
|
- Commit message
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Key Information to Know
|
||
|
|
|
||
|
|
### The Error Explained
|
||
|
|
|
||
|
|
```
|
||
|
|
Error Message: [TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
|
||
|
|
|
||
|
|
Interpretation:
|
||
|
|
- Location: Reading header byte from allocated block during free
|
||
|
|
- Expected: 0xa1 (0xa0 MAGIC | class_idx=1)
|
||
|
|
- Got: 0x31 (user data or corruption)
|
||
|
|
- Meaning: Header was never written OR was overwritten
|
||
|
|
|
||
|
|
Root Cause: One of 6 documented patterns
|
||
|
|
```
|
||
|
|
|
||
|
|
### Success Looks Like
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Before fix:
|
||
|
|
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1
|
||
|
|
Segmentation fault (code 139)
|
||
|
|
Execution time: ~22 seconds before crash
|
||
|
|
|
||
|
|
# After fix:
|
||
|
|
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
Total: 54.5 Mops/s [no TLS_SLL_HDR_RESET errors]
|
||
|
|
Execution time: 4-6 minutes [completes successfully]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## File Locations You'll Need
|
||
|
|
|
||
|
|
| File | Purpose | Action |
|
||
|
|
|------|---------|--------|
|
||
|
|
| `core/box/tls_sll_box.h` | Error source | Read/understand |
|
||
|
|
| `core/hakmem_tiny_free.inc` | Header write | Add logging |
|
||
|
|
| `core/hakmem_tiny_refill.inc.h` | Magazine spill | Check for issues |
|
||
|
|
| `core/box/ptr_conversion_box.h` | Pointer conversion | Understand logic |
|
||
|
|
| `core/box/tiny_layout_box.h` | Class layout | Understand definitions |
|
||
|
|
| `tests/test_tls_sll_minimal.c` | Your test | Create this |
|
||
|
|
| `debug_artifacts/headerless/` | Benchmark logs | Reference existing |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Commands You'll Use
|
||
|
|
|
||
|
|
### Build & Test
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Clean build
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
|
||
|
|
# Run baseline (will currently crash)
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
|
||
|
|
# Run minimal test (after creating it)
|
||
|
|
./tests/test_tls_sll_minimal
|
||
|
|
```
|
||
|
|
|
||
|
|
### With Logging
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Build with debug logging
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_DEBUG_LOGGING=1"
|
||
|
|
|
||
|
|
# Capture diagnostic output
|
||
|
|
./test_tls_sll_minimal 2>&1 | tee diagnostic_output.txt
|
||
|
|
|
||
|
|
# Analyze logs
|
||
|
|
grep HEADER_WRITE diagnostic_output.txt | tail -10
|
||
|
|
grep -B5 "got=0x31" diagnostic_output.txt
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What to Expect
|
||
|
|
|
||
|
|
### Per-Step Timeline
|
||
|
|
|
||
|
|
- **Step 1** (Read diagnostic guide): 30-45 min
|
||
|
|
- **Step 2** (Create reproducer): 30-60 min
|
||
|
|
- **Step 3** (Add logging): 1-2 hours
|
||
|
|
- **Step 4** (Run test): 30 min
|
||
|
|
- **Step 5** (Pattern matching): 1 hour
|
||
|
|
- **Step 6** (Implement fix): 30 min - 1 hour
|
||
|
|
- **Step 7** (Validate): 1-2 hours
|
||
|
|
|
||
|
|
**Total**: 4-8 hours
|
||
|
|
|
||
|
|
### What You'll Discover
|
||
|
|
|
||
|
|
By the end of the process, you will have:
|
||
|
|
- ✅ Identified which of 6 patterns caused the issue
|
||
|
|
- ✅ Created a minimal reproducer
|
||
|
|
- ✅ Added diagnostic logging to find corruption
|
||
|
|
- ✅ Traced the exact allocation/free sequence causing the problem
|
||
|
|
- ✅ Implemented a 1-5 line fix
|
||
|
|
- ✅ Validated the fix works with multiple benchmarks
|
||
|
|
- ✅ Understood the root cause completely
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Communication Checkpoints
|
||
|
|
|
||
|
|
After completing each step, provide brief status:
|
||
|
|
|
||
|
|
**Step 2**: "Reproducer created - crashes after X allocations"
|
||
|
|
**Step 4**: "Diagnostic logs show pattern [A/B/C/etc]"
|
||
|
|
**Step 5**: "Root cause identified as Pattern #[N]"
|
||
|
|
**Step 6**: "Fix applied - [1-2 line description]"
|
||
|
|
**Step 7**: "Validation: sh8bench passed, cfrac passed, no regressions"
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria (Clear & Measurable)
|
||
|
|
|
||
|
|
| Criterion | Status |
|
||
|
|
|-----------|--------|
|
||
|
|
| Minimal reproducer created | ✅ Expected |
|
||
|
|
| Root cause identified (one of 6 patterns) | ✅ Expected |
|
||
|
|
| Diagnostic logging captured | ✅ Expected |
|
||
|
|
| Fix implemented (1-5 lines) | ✅ Expected |
|
||
|
|
| sh8bench completes without crashes | ✅ TARGET |
|
||
|
|
| cfrac completes without crashes | ✅ TARGET |
|
||
|
|
| Unit tests pass | ✅ TARGET |
|
||
|
|
| < 5% performance regression | ✅ TARGET |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## If You Get Stuck
|
||
|
|
|
||
|
|
**Problem**: Can't reproduce the error
|
||
|
|
- **Solution**: Check if build includes logging headers. Verify LD_PRELOAD path is correct.
|
||
|
|
|
||
|
|
**Problem**: Logs don't show expected pattern
|
||
|
|
- **Solution**: Check if you're logging at the right locations. Reference diagnostic guide for exact instrumentation points.
|
||
|
|
|
||
|
|
**Problem**: Multiple patterns seem possible
|
||
|
|
- **Solution**: Add more detailed logging to narrow down. Reference diagnostic guide's pattern-specific logging recommendations.
|
||
|
|
|
||
|
|
**Problem**: Fix doesn't resolve the issue
|
||
|
|
- **Solution**: Validate that logging shows the assumed pattern. May need to test a different pattern. Try pattern #2, #3, etc. in order.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps After Completion
|
||
|
|
|
||
|
|
Once TLS SLL header corruption is fixed:
|
||
|
|
|
||
|
|
1. **Validate Phase 1 Performance** (Currently 2.3%, target 15-20%)
|
||
|
|
- Profile with perf/cachegrind
|
||
|
|
- Identify secondary bottlenecks
|
||
|
|
- Consider cache size optimization
|
||
|
|
|
||
|
|
2. **Proceed to Phase 2** (Headerless mode)
|
||
|
|
- Implement HAKMEM_TINY_HEADERLESS toggle
|
||
|
|
- Test alignment guarantees
|
||
|
|
- Benchmark performance trade-offs
|
||
|
|
|
||
|
|
3. **Plan Phase 102** (MemApi bridge)
|
||
|
|
- Connect hakmem to nyrt Ring0 runtime
|
||
|
|
- Design integration points
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Questions Before Starting?
|
||
|
|
|
||
|
|
- ❓ What is Box Theory? → Read the Context Summary
|
||
|
|
- ❓ What are Phantom Types? → Read the Context Summary
|
||
|
|
- ❓ What are the 6 root cause patterns? → They're in the Diagnostic Guide
|
||
|
|
- ❓ How do I add logging? → Step 3 of Handoff document + Diagnostic Guide
|
||
|
|
|
||
|
|
**All answers are in the three documents. No need for external research.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## You're Now Ready! 🚀
|
||
|
|
|
||
|
|
1. **Read** `CHATGPT_CONTEXT_SUMMARY.md` (2-3 min)
|
||
|
|
2. **Follow** `CHATGPT_HANDOFF_TLS_DIAGNOSIS.md` (7 steps, 4-8 hours)
|
||
|
|
3. **Reference** `TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` (as needed)
|
||
|
|
|
||
|
|
**Start with Step 1 of the Handoff document.**
|
||
|
|
|
||
|
|
**Expected outcome**: TLS SLL header corruption diagnosed and fixed. ✅
|
||
|
|
|
||
|
|
**Next review**: After fix is validated and committed.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Good luck! The investigation methodology is solid, the documentation is comprehensive, and the fix is likely to be simple once identified. 💪**
|