296 lines
8.5 KiB
Markdown
296 lines
8.5 KiB
Markdown
|
|
# Context Summary for ChatGPT - TLS SLL Header Corruption Fix
|
||
|
|
|
||
|
|
**Date**: 2025-12-03
|
||
|
|
**Project**: hakmem - Custom Memory Allocator
|
||
|
|
**Handoff From**: Gemini + Task agent (previous phase)
|
||
|
|
**Current Task**: Diagnose and fix TLS SLL header corruption
|
||
|
|
**Status**: CRITICAL BLOCKER - Investigation Required
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Facts
|
||
|
|
|
||
|
|
| Item | Value |
|
||
|
|
|------|-------|
|
||
|
|
| **Problem** | Header corruption in TLS SLL during baseline testing |
|
||
|
|
| **Error Message** | `[TLS_SLL_HDR_RESET] cls=1 base=0x... got=0x31 expect=0xa1` |
|
||
|
|
| **Error Location** | `core/box/tls_sll_box.h:282-303` |
|
||
|
|
| **Affected Configurations** | ALL (shared code path issue) |
|
||
|
|
| **Root Cause** | Unknown (6 patterns documented) |
|
||
|
|
| **Fix Type** | Surgical (1-5 lines expected) |
|
||
|
|
| **Build Status** | ✅ Succeeds |
|
||
|
|
| **Baseline Test Status** | ❌ Crashes (SIGSEGV at ~22 seconds) |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What is 0x31 vs 0xa1?
|
||
|
|
|
||
|
|
```
|
||
|
|
Expected (header magic): 0xa1 = 0xa0 (HEADER_MAGIC) | 0x01 (class_idx=1)
|
||
|
|
Got (corruption): 0x31 = ASCII character '1' or some user data
|
||
|
|
|
||
|
|
This means: User data exists where header should be.
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Project Architecture (Box Theory)
|
||
|
|
|
||
|
|
The hakmem allocator uses a **Box Theory** architecture where:
|
||
|
|
|
||
|
|
- Each component (memory layout, pointer conversion, TLS state) is a separate "box"
|
||
|
|
- Each box has a single responsibility and clear API boundaries
|
||
|
|
- Examples:
|
||
|
|
- `tiny_layout_box.h` - Class sizes and header offsets (single source of truth)
|
||
|
|
- `ptr_conversion_box.h` - Pointer type safety (base vs user pointers)
|
||
|
|
- `tls_sll_box.h` - Thread-local single-linked list management
|
||
|
|
- `tls_ss_hint_box.h` - SuperSlab hint cache (Phase 1 optimization)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recent Changes (Last 5 Commits)
|
||
|
|
|
||
|
|
1. **f3f75ba3d** - "Fix Magazine Spill RAW pointer type conversion"
|
||
|
|
- Added HAK_BASE_FROM_RAW() wrapper in hakmem_tiny_refill.inc.h:228
|
||
|
|
- Status: ✅ Fixed
|
||
|
|
|
||
|
|
2. **2dc9d5d59** - "Fix include order in hakmem.c"
|
||
|
|
- Moved #include "box/hak_kpi_util.inc.h" before hak_core_init.inc.h
|
||
|
|
- Status: ✅ Fixed
|
||
|
|
|
||
|
|
3. **94f9ea51** - "Implement TLS SuperSlab Hint Box (Phase 1)"
|
||
|
|
- New header-only cache for recently-used SuperSlabs
|
||
|
|
- Status: ✅ Implemented, but only 2.3% performance improvement (target was 15-20%)
|
||
|
|
|
||
|
|
4. Earlier: Box theory framework, phantom types, etc.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Remaining Issue: TLS SLL Header Corruption
|
||
|
|
|
||
|
|
### Symptom
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Build succeeds
|
||
|
|
$ make clean && make shared -j8
|
||
|
|
Building libhakmem.so... OK (547KB)
|
||
|
|
|
||
|
|
# But baseline test crashes
|
||
|
|
$ LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
[TLS_SLL_HDR_RESET] cls=1 base=0x7ef296abf8c8 got=0x31 expect=0xa1 count=0
|
||
|
|
Segmentation fault (core dumped)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Timeline
|
||
|
|
|
||
|
|
- **When Discovered**: During Phase 1 benchmarking (2025-12-03)
|
||
|
|
- **Frequency**: 100% reproducible with sh8bench
|
||
|
|
- **Scope**: Affects baseline (Headerless OFF), so affects all configurations
|
||
|
|
|
||
|
|
### Error Location
|
||
|
|
|
||
|
|
**File**: `core/box/tls_sll_box.h` (lines 282-303)
|
||
|
|
**Function**: `tls_sll_pop_impl()`
|
||
|
|
**Operation**: Reading header validation
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Simplified logic (actual code has more details)
|
||
|
|
if (tiny_class_preserves_header(class_idx)) {
|
||
|
|
uint8_t* b = (uint8_t*)raw_base;
|
||
|
|
uint8_t got = *b; // Read byte at offset 0
|
||
|
|
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
|
||
|
|
|
||
|
|
if (got != expected) {
|
||
|
|
fprintf(stderr, "[TLS_SLL_HDR_RESET] cls=%d base=%p got=0x%02x expect=0x%02x\n",
|
||
|
|
class_idx, raw_base, got, expected);
|
||
|
|
// Reset TLS SLL for this class
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Root Cause - Six Documented Patterns
|
||
|
|
|
||
|
|
The diagnostic document identifies six possible patterns:
|
||
|
|
|
||
|
|
1. **RAW Pointer vs BASE Pointer** - Wrong pointer type passed to tls_sll_push()
|
||
|
|
2. **Header Offset Mismatch** - Writing at one offset, reading at another
|
||
|
|
3. **Atomic Fence Missing** - Compiler/CPU reordering of write + push
|
||
|
|
4. **Adjacent Block Overflow** - User data from previous block overwrites header
|
||
|
|
5. **Class Index Mismatch** - Push with one class_idx, pop as different class_idx
|
||
|
|
6. **Headerless Mode Interference** - Mixed header/headerless logic despite OFF flag
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Your Task
|
||
|
|
|
||
|
|
**You have two comprehensive documents**:
|
||
|
|
|
||
|
|
1. **`docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md`** (THIS FILE'S COMPANION)
|
||
|
|
- Step-by-step task breakdown
|
||
|
|
- 7-step investigation and fix process
|
||
|
|
- Expected validation criteria
|
||
|
|
|
||
|
|
2. **`docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md`** (MAIN REFERENCE - 1,150+ LINES)
|
||
|
|
- Deep dive into all 6 root cause patterns
|
||
|
|
- Code examples for each pattern
|
||
|
|
- Minimal test case template
|
||
|
|
- Diagnostic logging instrumentation
|
||
|
|
- Fix code templates
|
||
|
|
- 7-step validation procedure
|
||
|
|
|
||
|
|
**Follow the handoff document's steps 1-7 to diagnose and fix this issue.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Build & Test Commands
|
||
|
|
|
||
|
|
### Quick Build
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
```
|
||
|
|
|
||
|
|
### Baseline Test (Should Currently Crash)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | \
|
||
|
|
grep -E "TLS_SLL_HDR_RESET|Total|Segmentation"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Minimal Test Case (After Creation)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./tests/test_tls_sll_minimal 2>&1 | grep -E "TLS_SLL_HDR_RESET|PASS|FAIL"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Important File Locations
|
||
|
|
|
||
|
|
| Path | Purpose |
|
||
|
|
|------|---------|
|
||
|
|
| `core/box/tls_sll_box.h` | TLS SLL implementation (error source) |
|
||
|
|
| `core/hakmem_tiny_free.inc` | Free path - where headers are written |
|
||
|
|
| `core/hakmem_tiny_refill.inc.h` | Magazine spill - recent fix location |
|
||
|
|
| `core/box/ptr_conversion_box.h` | Pointer type conversion |
|
||
|
|
| `core/box/tiny_layout_box.h` | Class layout definitions |
|
||
|
|
| `core/box/tls_ss_hint_box.h` | Phase 1 optimization (new) |
|
||
|
|
| `docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md` | YOUR MAIN REFERENCE |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Key Data Structures
|
||
|
|
|
||
|
|
### TLS SLL Header Structure
|
||
|
|
|
||
|
|
```c
|
||
|
|
typedef struct {
|
||
|
|
uint8_t hdr; // Header: 0xa0 | class_idx
|
||
|
|
uint8_t pad; // Padding/metadata
|
||
|
|
uint16_t _unused; // Alignment
|
||
|
|
SuperSlab* next; // Pointer to next SuperSlab
|
||
|
|
} TlsSllEntry;
|
||
|
|
```
|
||
|
|
|
||
|
|
### Header Validation
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Expected value for class 1:
|
||
|
|
expected = 0xa0 | 1 = 0xa1
|
||
|
|
|
||
|
|
// What we're seeing:
|
||
|
|
got = 0x31 = some user data
|
||
|
|
|
||
|
|
// This means the header was never written OR was overwritten
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Pointer Types in hakmem
|
||
|
|
|
||
|
|
The codebase distinguishes between:
|
||
|
|
|
||
|
|
```c
|
||
|
|
hak_base_ptr_t - "Base pointer" pointing to start of allocation (includes header)
|
||
|
|
hak_user_ptr_t - "User pointer" pointing to user data (after offset adjustment)
|
||
|
|
|
||
|
|
Conversion:
|
||
|
|
user = base + tiny_user_offset(class_idx) // Typically base + 1
|
||
|
|
base = user - tiny_user_offset(class_idx) // Typically user - 1
|
||
|
|
```
|
||
|
|
|
||
|
|
**Critical**: In Headerless mode, the offset is 0, so base == user.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Known Good Patterns (For Reference)
|
||
|
|
|
||
|
|
From previous fixes:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Pattern: Wrapping RAW pointer before TLS SLL push (ALREADY FIXED)
|
||
|
|
void* p = mag->items[--mag->top].ptr; // RAW pointer (user offset)
|
||
|
|
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Wrap to base pointer
|
||
|
|
if (!tls_sll_push(class_idx, base_p, cap)) { // Push base pointer
|
||
|
|
|
||
|
|
// Pattern: Consistent include order (ALREADY FIXED)
|
||
|
|
#include "box/hak_kpi_util.inc.h" // Must come first
|
||
|
|
#include "hak_core_init.inc.h" // Must come after
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
| Criteria | Status |
|
||
|
|
|----------|--------|
|
||
|
|
| TLS SLL Header Corruption diagnosed | ❌ In progress |
|
||
|
|
| Root cause pattern identified | ❌ In progress |
|
||
|
|
| Minimal reproducer created | ❌ In progress |
|
||
|
|
| Fix implemented | ❌ In progress |
|
||
|
|
| sh8bench runs without errors | ❌ GOAL |
|
||
|
|
| cfrac runs without errors | ❌ GOAL |
|
||
|
|
| No performance regression | ❌ GOAL |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Previous Phase Context
|
||
|
|
|
||
|
|
This project has gone through several phases:
|
||
|
|
|
||
|
|
- **Phase 0**: Initial implementation (completed)
|
||
|
|
- **Phase 1**: TLS SuperSlab Hint Box optimization (implemented, needs validation)
|
||
|
|
- **Phase 2**: Headerless mode (designed, blocked by current issue)
|
||
|
|
- **Phase 102**: MemApi bridge (future)
|
||
|
|
|
||
|
|
The current issue blocks validation of Phase 1 and progression to Phase 2.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Timeline Estimate
|
||
|
|
|
||
|
|
- **Step 1 (Read guide)**: 15-30 min
|
||
|
|
- **Step 2-3 (Setup + logging)**: 1-2 hours
|
||
|
|
- **Step 4 (Diagnostic run)**: 30 min
|
||
|
|
- **Step 5 (Pattern matching)**: 1 hour
|
||
|
|
- **Step 6 (Fix implementation)**: 30 min - 1 hour
|
||
|
|
- **Step 7 (Validation)**: 1-2 hours
|
||
|
|
|
||
|
|
**Total**: 4-8 hours expected
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next: Start Investigation
|
||
|
|
|
||
|
|
👉 **Next Action**: Read `docs/CHATGPT_HANDOFF_TLS_DIAGNOSIS.md` and follow steps 1-7.
|
||
|
|
|
||
|
|
The comprehensive diagnostic guide (`docs/TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md`) contains all the details you need for each pattern and debugging technique.
|
||
|
|
|
||
|
|
**Questions or blockers?** The diagnostic guide has extensive explanations for each pattern.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**You're now ready to begin the investigation. Good luck! 🚀**
|