1112 lines
29 KiB
Markdown
1112 lines
29 KiB
Markdown
|
|
# TLS SLL Header Corruption Diagnosis & Fix Instructions for ChatGPT
|
||
|
|
|
||
|
|
## Problem Statement
|
||
|
|
|
||
|
|
**Symptom**:
|
||
|
|
- Baseline (Headerless OFF) crashes with SIGSEGV
|
||
|
|
- Error log: `[TLS_SLL_HDR_RESET] cls=1 base=0x7ef296abf8c8 got=0x31 expect=0xa1 count=0`
|
||
|
|
- Location: `core/box/tls_sll_box.h` header integrity check during pop operation
|
||
|
|
|
||
|
|
**Root Cause**:
|
||
|
|
Header byte at offset 0 from base pointer contains user data (0x31) instead of header magic (0xa1).
|
||
|
|
This indicates one of:
|
||
|
|
1. Wrong pointer is being stored in TLS SLL
|
||
|
|
2. Header is not being written correctly before push
|
||
|
|
3. Adjacent block corruption overwrites header
|
||
|
|
4. Header write/read offset mismatch
|
||
|
|
|
||
|
|
**Impact**:
|
||
|
|
- TLS SLL header reset occurs (entire freelist for class 1 dropped)
|
||
|
|
- Subsequent allocations may fail or use wrong metadata
|
||
|
|
- Benchmark crashes with SIGSEGV
|
||
|
|
- Memory corruption potential
|
||
|
|
|
||
|
|
**Timeline**:
|
||
|
|
- Discovered during Phase 1 TLS Hint Box benchmarking
|
||
|
|
- Affects baseline configuration (no hints involved)
|
||
|
|
- Suggests pre-existing issue in shared TLS SLL code
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Investigation Strategy
|
||
|
|
|
||
|
|
**Phase A: Understand the Error**
|
||
|
|
- Where is header validation happening?
|
||
|
|
- What does 0x31 represent? (Is it deterministic or random data?)
|
||
|
|
- Can we reproduce with minimal allocations?
|
||
|
|
|
||
|
|
**Phase B: Locate Corruption Source**
|
||
|
|
- Where is header supposed to be written?
|
||
|
|
- Is header being written BEFORE push or after?
|
||
|
|
- Are there any recent changes to header write logic?
|
||
|
|
|
||
|
|
**Phase C: Implement Fix**
|
||
|
|
- Add instrumentation to catch corruption early
|
||
|
|
- Identify exact allocation/free cycle causing problem
|
||
|
|
- Fix root cause (not just symptom)
|
||
|
|
|
||
|
|
**Phase D: Validate**
|
||
|
|
- TC1 baseline should complete without crashes
|
||
|
|
- TC2/TC3 can then be evaluated
|
||
|
|
- No performance regression
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Deep Dive: TLS SLL Header Corruption
|
||
|
|
|
||
|
|
### What is 0x31?
|
||
|
|
|
||
|
|
The error reports `got=0x31`. Let's understand what this means:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Expected (header magic for class 1):
|
||
|
|
0xa1 = 0xa0 (HEADER_MAGIC) | 0x01 (class_idx)
|
||
|
|
|
||
|
|
// Got:
|
||
|
|
0x31 = 0b00110001
|
||
|
|
= ASCII '1' character
|
||
|
|
= Some piece of user data or metadata
|
||
|
|
```
|
||
|
|
|
||
|
|
**Questions to answer**:
|
||
|
|
1. Is 0x31 always the same, or does it vary? (Deterministic vs random corruption)
|
||
|
|
2. Does 0x31 correspond to any known data pattern in hakmem?
|
||
|
|
3. Does the corruption happen during alloc or free?
|
||
|
|
4. Is 0x31 part of the test program's data?
|
||
|
|
|
||
|
|
### TLS SLL Header Check Logic
|
||
|
|
|
||
|
|
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h` (around lines 280-320)
|
||
|
|
|
||
|
|
```c
|
||
|
|
// In tls_sll_pop_impl():
|
||
|
|
if (tiny_class_preserves_header(class_idx)) {
|
||
|
|
uint8_t* b = (uint8_t*)raw_base;
|
||
|
|
uint8_t got = *b; // Read byte at offset 0 of base pointer
|
||
|
|
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
|
||
|
|
|
||
|
|
if (got != expected) {
|
||
|
|
// CORRUPTION DETECTED!
|
||
|
|
fprintf(stderr, "[TLS_SLL_HDR_RESET] cls=%d base=%p got=0x%02x expect=0x%02x ...\n",
|
||
|
|
class_idx, raw_base, got, expected);
|
||
|
|
// ... reset logic follows
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Points**:
|
||
|
|
- Header is read at `(uint8_t*)raw_base` (offset 0)
|
||
|
|
- Expected value is `0xa0 | class_idx`
|
||
|
|
- For class 1: expect `0xa1`
|
||
|
|
- Got `0x31` instead (user data)
|
||
|
|
|
||
|
|
### When Does This Happen?
|
||
|
|
|
||
|
|
The error occurs during `tls_sll_pop()`, which is called when:
|
||
|
|
1. **Freelist refill**: Taking blocks from TLS SLL back to unified cache
|
||
|
|
2. **Magazine spill**: Freelist → TLS SLL transition for overflow
|
||
|
|
3. **Allocation path**: Pulling blocks from TLS SLL to satisfy malloc
|
||
|
|
|
||
|
|
**The header corruption must have happened BEFORE push**, but is detected AFTER pop.
|
||
|
|
|
||
|
|
This suggests:
|
||
|
|
- Either the pointer stored in TLS SLL is wrong (points to wrong location)
|
||
|
|
- Or the header was never written correctly
|
||
|
|
- Or adjacent block corruption overwrote the header
|
||
|
|
- Or there's an offset calculation error between push and pop
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Diagnostic Procedure
|
||
|
|
|
||
|
|
### Step 1: Reproduce with Minimal Test
|
||
|
|
|
||
|
|
Create the smallest possible test case:
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/tests/test_tls_sll_minimal.c`
|
||
|
|
|
||
|
|
```c
|
||
|
|
#include <stdlib.h>
|
||
|
|
#include <stdio.h>
|
||
|
|
#include <string.h>
|
||
|
|
|
||
|
|
int main() {
|
||
|
|
printf("Test 1: Simple alloc/free cycle\n");
|
||
|
|
for (int i = 0; i < 10; i++) {
|
||
|
|
void* p = malloc(16); // Class 1
|
||
|
|
if (p) {
|
||
|
|
memset(p, 0x31, 16); // Write user data (includes 0x31!)
|
||
|
|
free(p);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
printf("✓ Test 1 passed\n");
|
||
|
|
|
||
|
|
printf("Test 2: Rapid alloc/free (trigger refill)\n");
|
||
|
|
for (int i = 0; i < 1000; i++) {
|
||
|
|
void* p = malloc(16);
|
||
|
|
if (p) {
|
||
|
|
memset(p, 0x31, 16);
|
||
|
|
free(p);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
printf("✓ Test 2 passed\n");
|
||
|
|
|
||
|
|
printf("Test 3: Multiple sizes\n");
|
||
|
|
for (int size = 8; size <= 512; size *= 2) {
|
||
|
|
for (int j = 0; j < 100; j++) {
|
||
|
|
void* p = malloc(size);
|
||
|
|
if (p) {
|
||
|
|
memset(p, 0x31, size);
|
||
|
|
free(p);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
printf("✓ Test 3 passed\n");
|
||
|
|
|
||
|
|
printf("Test 4: Heavy churn (trigger SLL push/pop)\n");
|
||
|
|
void* ptrs[100];
|
||
|
|
for (int round = 0; round < 10; round++) {
|
||
|
|
for (int i = 0; i < 100; i++) {
|
||
|
|
ptrs[i] = malloc(16);
|
||
|
|
if (ptrs[i]) memset(ptrs[i], 0x31, 16);
|
||
|
|
}
|
||
|
|
for (int i = 0; i < 100; i++) {
|
||
|
|
free(ptrs[i]);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
printf("✓ Test 4 passed\n");
|
||
|
|
|
||
|
|
return 0;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Build and test**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
mkdir -p tests
|
||
|
|
gcc -o tests/test_tls_sll_minimal tests/test_tls_sll_minimal.c
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
|
||
|
|
```
|
||
|
|
|
||
|
|
**Goal**: Find the minimal reproduction:
|
||
|
|
- If test 1 fails: Early corruption (basic alloc/free)
|
||
|
|
- If test 2 fails: Refill-related corruption
|
||
|
|
- If test 3 fails: Class-specific issue
|
||
|
|
- If test 4 fails: SLL push/pop cycling issue
|
||
|
|
|
||
|
|
### Step 2: Add Diagnostic Logging
|
||
|
|
|
||
|
|
Instrument the header write/read paths:
|
||
|
|
|
||
|
|
#### Instrument Header Write
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
|
||
|
|
|
||
|
|
Find the `HAK_RET_ALLOC` macro and add logging:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Add diagnostic logging
|
||
|
|
#define HAK_RET_ALLOC(base, cls) do { \
|
||
|
|
fprintf(stderr, "[ALLOC_HEADER_WRITE] base=%p cls=%d\n", base, cls); \
|
||
|
|
uint8_t* hdr = (uint8_t*)(base); \
|
||
|
|
uint8_t magic = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
|
||
|
|
*hdr = magic; \
|
||
|
|
fprintf(stderr, "[ALLOC_HEADER_WROTE] base=%p magic=0x%02x (at %p)\n", base, *hdr, hdr); \
|
||
|
|
__atomic_thread_fence(__ATOMIC_RELEASE); \
|
||
|
|
hak_user_ptr_t user = ptr_base_to_user(base, cls); \
|
||
|
|
fprintf(stderr, "[ALLOC_RETURN] user=%p (base=%p + %ld)\n", user, base, (char*)user - (char*)base); \
|
||
|
|
return user; \
|
||
|
|
} while(0)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Instrument Header Read
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
|
||
|
|
|
||
|
|
Modify the header read/check in `tls_sll_pop_impl()`:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// In tls_sll_pop_impl(), before the check:
|
||
|
|
if (tiny_class_preserves_header(class_idx)) {
|
||
|
|
uint8_t* b = (uint8_t*)raw_base;
|
||
|
|
uint8_t got = *b;
|
||
|
|
uint8_t expected = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
|
||
|
|
|
||
|
|
// NEW DIAGNOSTIC LOGGING:
|
||
|
|
fprintf(stderr, "[TLS_SLL_POP_CHECK] class=%d raw_base=%p checking at %p\n",
|
||
|
|
class_idx, raw_base, b);
|
||
|
|
fprintf(stderr, "[TLS_SLL_POP_READ] got=0x%02x expected=0x%02x\n", got, expected);
|
||
|
|
|
||
|
|
if (got != expected) {
|
||
|
|
fprintf(stderr, "[CORRUPTION_DETECTED] Mismatch! Dumping context...\n");
|
||
|
|
fprintf(stderr, "[CORRUPTION_CONTEXT] raw_base=%p, offset=%ld\n", raw_base, (char*)b - (char*)raw_base);
|
||
|
|
|
||
|
|
// Dump surrounding bytes
|
||
|
|
fprintf(stderr, "[CORRUPTION_DUMP] Bytes around base: ");
|
||
|
|
for (int i = -8; i < 16; i++) {
|
||
|
|
fprintf(stderr, "%02x ", ((uint8_t*)raw_base)[i]);
|
||
|
|
}
|
||
|
|
fprintf(stderr, "\n");
|
||
|
|
|
||
|
|
// ... existing reset logic
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Instrument SLL Push
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
|
||
|
|
|
||
|
|
Find `tls_sll_push_impl()` and add logging:
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
|
||
|
|
fprintf(stderr, "[TLS_SLL_PUSH] class=%d ptr=%p\n", class_idx, ptr);
|
||
|
|
|
||
|
|
// Check header BEFORE push
|
||
|
|
if (tiny_class_preserves_header(class_idx)) {
|
||
|
|
uint8_t hdr = *(uint8_t*)ptr;
|
||
|
|
fprintf(stderr, "[TLS_SLL_PUSH_HDR_CHECK] ptr=%p header=0x%02x\n", ptr, hdr);
|
||
|
|
}
|
||
|
|
|
||
|
|
// ... existing push logic
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Build and run**:
|
||
|
|
```bash
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | grep -E "ALLOC|POP|PUSH|CORRUPTION" | head -100
|
||
|
|
```
|
||
|
|
|
||
|
|
**What to look for**:
|
||
|
|
- Do ALLOC_HEADER_WRITE and TLS_SLL_PUSH_HDR_CHECK match?
|
||
|
|
- Does TLS_SLL_POP_READ show corruption?
|
||
|
|
- What is the sequence: WRITE → PUSH → POP?
|
||
|
|
- Are pointers consistent across operations?
|
||
|
|
|
||
|
|
### Step 3: Examine Header Write Locations
|
||
|
|
|
||
|
|
Search for all places headers are written:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -rn "= 0xa\|= HEADER_MAGIC\|= TINY_HEADER\|0xa0 |" core/ --include="*.h" --include="*.c" --include="*.inc"
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected locations:
|
||
|
|
1. `core/hakmem_tiny_config_box.inc` - HAK_RET_ALLOC macro
|
||
|
|
2. `core/box/tls_sll_box.h` - Optional header write on SLL push (if needed)
|
||
|
|
3. `core/tiny_alloc_fast_push.c` - Fast path allocations
|
||
|
|
4. Other allocation paths?
|
||
|
|
|
||
|
|
**Check each location**:
|
||
|
|
- Is the offset correct? (Should be offset 0 from base)
|
||
|
|
- Is it written BEFORE or AFTER pushing to TLS SLL?
|
||
|
|
- Is there an atomic fence to prevent reordering?
|
||
|
|
- Is the class_idx valid?
|
||
|
|
|
||
|
|
### Step 4: Examine Pointer Conversion Logic
|
||
|
|
|
||
|
|
The key question: **Are we storing the right pointer in TLS SLL?**
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
|
||
|
|
|
||
|
|
Check the pointer conversion macros:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -A5 "ptr_user_to_base\|ptr_base_to_user\|HAK_BASE_FROM_RAW" core/hakmem_tiny_types.h
|
||
|
|
```
|
||
|
|
|
||
|
|
**Critical questions**:
|
||
|
|
1. When we free a user pointer, do we convert it to base pointer correctly?
|
||
|
|
2. When we push to TLS SLL, do we push the base pointer or user pointer?
|
||
|
|
3. When we pop from TLS SLL, do we get back the exact same base pointer?
|
||
|
|
|
||
|
|
**Expected flow**:
|
||
|
|
```
|
||
|
|
Alloc: BASE → (write header at BASE) → (convert to USER) → return USER
|
||
|
|
Free: USER → (convert to BASE) → (push BASE to TLS SLL)
|
||
|
|
Pop: (pop BASE from TLS SLL) → (read header at BASE) → validate
|
||
|
|
```
|
||
|
|
|
||
|
|
If any step uses wrong offset, corruption occurs.
|
||
|
|
|
||
|
|
### Step 5: Git Blame on Recent Changes
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
git log --oneline -30
|
||
|
|
git show b5be708b6 # "Fix potential freelist corruption"
|
||
|
|
git show c91602f18 # "Fix ptr_user_to_base_blind regression"
|
||
|
|
git show f3f75ba3d # "Fix magazine spill RAW pointer"
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check**: Did any of these changes affect header write logic?
|
||
|
|
|
||
|
|
Look for:
|
||
|
|
- Changes to `HAK_RET_ALLOC` macro
|
||
|
|
- Changes to pointer conversion logic
|
||
|
|
- Changes to TLS SLL push/pop
|
||
|
|
- Changes to header offset calculations
|
||
|
|
|
||
|
|
### Step 6: Review Commit History for TLS SLL
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
git log --oneline --all -- core/box/tls_sll_box.h | head -20
|
||
|
|
git log -p --all -- core/box/tls_sll_box.h | head -200
|
||
|
|
```
|
||
|
|
|
||
|
|
Look for:
|
||
|
|
- When was header logic last changed?
|
||
|
|
- Were there any defensive fixes recently?
|
||
|
|
- Any atomic fence changes?
|
||
|
|
- Any offset calculation changes?
|
||
|
|
|
||
|
|
### Step 7: Check Phase 1 Configuration
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
|
||
|
|
|
||
|
|
Verify the header configuration:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Phase 1: headerless = false → headers ON
|
||
|
|
// Header should be at offset 0 of base pointer
|
||
|
|
#define TINY_HEADER_SIZE_BYTES 1
|
||
|
|
#define HEADER_MAGIC 0xa0
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check**:
|
||
|
|
- Is HEADERLESS defined? (Should be undefined for Phase 1)
|
||
|
|
- Is header size correct? (Should be 1 byte)
|
||
|
|
- Are offset calculations consistent?
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Likely Root Causes (Narrowed)
|
||
|
|
|
||
|
|
### Root Cause A: Header Written at Wrong Offset
|
||
|
|
|
||
|
|
**Symptom**: User data appears where header should be
|
||
|
|
|
||
|
|
**Check**:
|
||
|
|
```c
|
||
|
|
// In HAK_RET_ALLOC, are we writing at the right place?
|
||
|
|
// Phase 1: header at offset 0 of base
|
||
|
|
uint8_t* hdr_ptr = (uint8_t*)base; // Should be offset 0
|
||
|
|
*hdr_ptr = magic;
|
||
|
|
|
||
|
|
// If this was changed to:
|
||
|
|
uint8_t* hdr_ptr = (uint8_t*)base + 1; // WRONG! User data location
|
||
|
|
*hdr_ptr = magic;
|
||
|
|
// Then header is written in user space, gets overwritten
|
||
|
|
```
|
||
|
|
|
||
|
|
**How to verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -n "HAK_RET_ALLOC" core/hakmem_tiny_config_box.inc
|
||
|
|
# Check that header write is at (uint8_t*)base, not base+offset
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Ensure header write is at `(uint8_t*)base`, not base+offset.
|
||
|
|
|
||
|
|
### Root Cause B: User Pointer Pushed Instead of Base Pointer
|
||
|
|
|
||
|
|
**Symptom**: SLL contains user pointers, but pop expects base pointers
|
||
|
|
|
||
|
|
**Sequence**:
|
||
|
|
```c
|
||
|
|
// During free:
|
||
|
|
void* user_ptr = ...; // User pointer (base + 1 for Phase 1)
|
||
|
|
tls_sll_push(class_idx, user_ptr); // WRONG! Should be base pointer
|
||
|
|
|
||
|
|
// During pop:
|
||
|
|
void* popped = tls_sll_pop(class_idx); // Gets user_ptr
|
||
|
|
uint8_t header = *(uint8_t*)popped; // Reads at user_ptr, not base_ptr!
|
||
|
|
// This reads user data instead of header
|
||
|
|
```
|
||
|
|
|
||
|
|
**How to verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -A3 -B3
|
||
|
|
# Check that all pushes use base pointer, not user pointer
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Convert user pointer to base pointer before pushing:
|
||
|
|
```c
|
||
|
|
hak_base_ptr_t base = ptr_user_to_base(user_ptr, class_idx);
|
||
|
|
tls_sll_push(class_idx, base, cap);
|
||
|
|
```
|
||
|
|
|
||
|
|
### Root Cause C: Atomic Fence Missing
|
||
|
|
|
||
|
|
**Symptom**: Compiler reorders header write after SLL push
|
||
|
|
|
||
|
|
**Check**:
|
||
|
|
```c
|
||
|
|
*(uint8_t*)base = header_magic; // Instruction 1
|
||
|
|
__atomic_thread_fence(__ATOMIC_RELEASE); // Fence (required!)
|
||
|
|
tls_sll_push(class_idx, base); // Instruction 2
|
||
|
|
```
|
||
|
|
|
||
|
|
If fence is missing, CPU/compiler might:
|
||
|
|
1. Schedule push before header write
|
||
|
|
2. Other thread sees unprepared node in SLL
|
||
|
|
3. Pop reads unwritten header → corruption
|
||
|
|
|
||
|
|
**How to verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -B5 "tls_sll_push" core/ --include="*.c" --include="*.inc" | grep -E "fence|barrier|atomic"
|
||
|
|
# Check that fence exists between header write and push
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Add `__atomic_thread_fence(__ATOMIC_RELEASE)` after header write, before SLL push.
|
||
|
|
|
||
|
|
### Root Cause D: Magazine Spill Pointer Wrapping
|
||
|
|
|
||
|
|
**Symptom**: Magazine stores RAW pointer, SLL expects BASE pointer
|
||
|
|
|
||
|
|
**Already Fixed**: Commit f3f75ba3d added `HAK_BASE_FROM_RAW()` wrapper
|
||
|
|
|
||
|
|
**Verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -n "HAK_BASE_FROM_RAW\|magazine.*spill" core/hakmem_tiny_refill.inc.h
|
||
|
|
# Check line 228 or nearby has the fix
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected code**:
|
||
|
|
```c
|
||
|
|
void* p = mag->items[--mag->top].ptr;
|
||
|
|
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Must have this!
|
||
|
|
if (!tls_sll_push(class_idx, base_p, cap)) {
|
||
|
|
// ...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: If missing, add `HAK_BASE_FROM_RAW()` wrapper around raw pointer.
|
||
|
|
|
||
|
|
### Root Cause E: Class Index Mismatch
|
||
|
|
|
||
|
|
**Symptom**: Wrong class_idx used for header magic
|
||
|
|
|
||
|
|
**Check**:
|
||
|
|
```c
|
||
|
|
int class_idx = ...; // Where does this come from?
|
||
|
|
uint8_t magic = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
||
|
|
// If class_idx is wrong (e.g., -1 or 999), magic will be corrupt
|
||
|
|
```
|
||
|
|
|
||
|
|
**How to verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -rn "class_idx\|tiny_size_to_class" core/ --include="*.h" | grep -E "= -1|= 0xff"
|
||
|
|
# Look for places where class_idx might be invalid
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Validate class_idx is in range [0, 7] before using:
|
||
|
|
```c
|
||
|
|
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
|
||
|
|
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
|
||
|
|
abort();
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Root Cause F: Offset Calculation Error
|
||
|
|
|
||
|
|
**Symptom**: Header written at base, but read at base+offset (or vice versa)
|
||
|
|
|
||
|
|
**Check**:
|
||
|
|
```c
|
||
|
|
// During alloc:
|
||
|
|
*(uint8_t*)base = magic; // Write at base+0
|
||
|
|
user = base + 1; // User at base+1 (Phase 1)
|
||
|
|
|
||
|
|
// During free/pop:
|
||
|
|
base = user - 1; // Should recover original base
|
||
|
|
uint8_t hdr = *(uint8_t*)base; // Should read at base+0
|
||
|
|
|
||
|
|
// BUT if conversion is wrong:
|
||
|
|
base = user - 0; // WRONG! Off by one
|
||
|
|
uint8_t hdr = *(uint8_t*)base; // Reads at wrong location
|
||
|
|
```
|
||
|
|
|
||
|
|
**How to verify**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -A10 "ptr_user_to_base_impl\|ptr_base_to_user_impl" core/hakmem_tiny_types.h
|
||
|
|
# Check offset calculations are consistent
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Ensure offset calculations match between:
|
||
|
|
- `ptr_base_to_user` (add offset)
|
||
|
|
- `ptr_user_to_base` (subtract same offset)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Proposed Fix Patterns
|
||
|
|
|
||
|
|
Based on diagnostic results, the fix will likely be one of:
|
||
|
|
|
||
|
|
### Fix Pattern 1: Restore Header Write Logic
|
||
|
|
|
||
|
|
**Problem**: Header write uses wrong offset or wrong pointer
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
|
||
|
|
|
||
|
|
```c
|
||
|
|
#define HAK_RET_ALLOC(base, cls) do { \
|
||
|
|
/* Write header FIRST at offset 0 of base */ \
|
||
|
|
*(uint8_t*)(base) = (uint8_t)(0xa0 | ((cls) & 0x0f)); \
|
||
|
|
/* Ensure header write completes before next operation */ \
|
||
|
|
__atomic_thread_fence(__ATOMIC_RELEASE); \
|
||
|
|
/* Now convert to user pointer and return */ \
|
||
|
|
return ptr_base_to_user((base), (cls)); \
|
||
|
|
} while(0)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix Pattern 2: Add Missing Fence
|
||
|
|
|
||
|
|
**Problem**: Compiler reorders header write after SLL push
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast_push.c` or `core/hakmem_tiny_free.inc`
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Before push to TLS SLL:
|
||
|
|
*(uint8_t*)base = header_magic;
|
||
|
|
__atomic_thread_fence(__ATOMIC_RELEASE); // ADD THIS LINE
|
||
|
|
tls_sll_push(class_idx, base, cap);
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix Pattern 3: Fix Pointer Type in Push
|
||
|
|
|
||
|
|
**Problem**: User pointer pushed instead of base pointer
|
||
|
|
|
||
|
|
**File**: Multiple locations (search for `tls_sll_push`)
|
||
|
|
|
||
|
|
```c
|
||
|
|
// In free path:
|
||
|
|
void* user_ptr = ptr; // From user
|
||
|
|
hak_base_ptr_t base_ptr = ptr_user_to_base(user_ptr, class_idx); // Convert!
|
||
|
|
if (!tls_sll_push(class_idx, base_ptr, cap)) { // Push base, not user
|
||
|
|
// ...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix Pattern 4: Validate Inputs
|
||
|
|
|
||
|
|
**Problem**: Invalid class_idx or pointer values
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
// At entry of tls_sll_push_impl():
|
||
|
|
static inline bool tls_sll_push_impl(..., hak_base_ptr_t ptr, ...) {
|
||
|
|
// Validate inputs
|
||
|
|
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
|
||
|
|
fprintf(stderr, "[ERROR] Invalid class_idx: %d\n", class_idx);
|
||
|
|
return false;
|
||
|
|
}
|
||
|
|
if (!ptr || ptr == (void*)-1) {
|
||
|
|
fprintf(stderr, "[ERROR] Invalid pointer: %p\n", ptr);
|
||
|
|
return false;
|
||
|
|
}
|
||
|
|
|
||
|
|
// ... existing logic
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix Pattern 5: Check Magazine Spill
|
||
|
|
|
||
|
|
**Problem**: Magazine spill uses wrong pointer type
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Around line 228 (magazine spill):
|
||
|
|
void* p = mag->items[--mag->top].ptr;
|
||
|
|
|
||
|
|
// MUST convert RAW to BASE before pushing:
|
||
|
|
hak_base_ptr_t base_p = HAK_BASE_FROM_RAW(p); // Essential!
|
||
|
|
|
||
|
|
if (!tls_sll_push(class_idx, base_p, cap)) {
|
||
|
|
// ... error handling
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Verify fix exists**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
grep -n "HAK_BASE_FROM_RAW" core/hakmem_tiny_refill.inc.h
|
||
|
|
# Should see it used before tls_sll_push
|
||
|
|
```
|
||
|
|
|
||
|
|
### Fix Pattern 6: Fix Offset Calculation
|
||
|
|
|
||
|
|
**Problem**: Pointer conversion uses wrong offset
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Verify Phase 1 offsets:
|
||
|
|
static inline hak_user_ptr_t ptr_base_to_user_impl(hak_base_ptr_t base, int cls) {
|
||
|
|
if (tiny_class_preserves_header(cls)) {
|
||
|
|
return (hak_user_ptr_t)((uint8_t*)base + TINY_HEADER_SIZE_BYTES); // +1 for Phase 1
|
||
|
|
}
|
||
|
|
return (hak_user_ptr_t)base;
|
||
|
|
}
|
||
|
|
|
||
|
|
static inline hak_base_ptr_t ptr_user_to_base_impl(hak_user_ptr_t user, int cls) {
|
||
|
|
if (tiny_class_preserves_header(cls)) {
|
||
|
|
return (hak_base_ptr_t)((uint8_t*)user - TINY_HEADER_SIZE_BYTES); // -1 for Phase 1
|
||
|
|
}
|
||
|
|
return (hak_base_ptr_t)user;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check**: Ensure +1 and -1 match, and TINY_HEADER_SIZE_BYTES is 1.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Debug Workflow
|
||
|
|
|
||
|
|
### Quick Debug Cycle
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# 1. Make changes to source
|
||
|
|
# ... edit files ...
|
||
|
|
|
||
|
|
# 2. Rebuild
|
||
|
|
make clean && make shared -j8
|
||
|
|
|
||
|
|
# 3. Test with minimal reproducer
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal 2>&1 | tee debug.log
|
||
|
|
|
||
|
|
# 4. Check for errors
|
||
|
|
grep "TLS_SLL_HDR_RESET\|CORRUPTION\|SIGSEGV" debug.log
|
||
|
|
|
||
|
|
# 5. Analyze log patterns
|
||
|
|
grep -E "ALLOC|PUSH|POP" debug.log | head -50
|
||
|
|
```
|
||
|
|
|
||
|
|
### Advanced Debug: GDB
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# Build with debug symbols
|
||
|
|
make clean
|
||
|
|
CFLAGS="-g -O0" make shared -j8
|
||
|
|
|
||
|
|
# Run under GDB
|
||
|
|
gdb --args ./tests/test_tls_sll_minimal
|
||
|
|
```
|
||
|
|
|
||
|
|
**GDB commands**:
|
||
|
|
```gdb
|
||
|
|
(gdb) set environment LD_PRELOAD ./libhakmem.so
|
||
|
|
(gdb) break tls_sll_push_impl
|
||
|
|
(gdb) break tls_sll_pop_impl
|
||
|
|
(gdb) run
|
||
|
|
(gdb) print /x *(uint8_t*)ptr # Check header byte
|
||
|
|
(gdb) print class_idx
|
||
|
|
(gdb) backtrace
|
||
|
|
(gdb) continue
|
||
|
|
```
|
||
|
|
|
||
|
|
### Memory Corruption Detection
|
||
|
|
|
||
|
|
Enable AddressSanitizer:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
CFLAGS="-fsanitize=address -g" LDFLAGS="-fsanitize=address" make shared -j8
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
|
||
|
|
```
|
||
|
|
|
||
|
|
ASan will catch:
|
||
|
|
- Buffer overflows
|
||
|
|
- Use-after-free
|
||
|
|
- Double-free
|
||
|
|
- Invalid pointer arithmetic
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## After Applying Fix
|
||
|
|
|
||
|
|
### Step 1: Rebuild and Test Minimal Reproducer
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**:
|
||
|
|
- All tests pass
|
||
|
|
- No `[TLS_SLL_HDR_RESET]` errors
|
||
|
|
- No SIGSEGV crashes
|
||
|
|
|
||
|
|
### Step 2: Run TC1 Baseline Test
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**:
|
||
|
|
- "Total elapsed time..." message
|
||
|
|
- No SIGSEGV
|
||
|
|
- Completion within timeout
|
||
|
|
|
||
|
|
### Step 3: Run Full Benchmark Suite
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# cfrac test
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 2>&1 | head -10
|
||
|
|
|
||
|
|
# larson test
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 2>&1 | tail -10
|
||
|
|
|
||
|
|
# sh6bench test
|
||
|
|
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh6bench 2>&1 | tail -5
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: All pass without crashes or corruption errors
|
||
|
|
|
||
|
|
### Step 4: Regression Check
|
||
|
|
|
||
|
|
Ensure fix doesn't break other configurations:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# Test Phase 2 (headerless=true) - if implemented
|
||
|
|
# ... config changes ...
|
||
|
|
# make clean && make shared -j8
|
||
|
|
# LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
|
||
|
|
|
||
|
|
# Test with different workloads
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/mstress 10 2
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/rptest 10
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Performance Check
|
||
|
|
|
||
|
|
Verify no performance regression:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# Before fix (save baseline):
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
|
||
|
|
# Note: May crash, but if it runs, record time
|
||
|
|
|
||
|
|
# After fix:
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep "Total elapsed"
|
||
|
|
|
||
|
|
# Compare: Should be within 5% of baseline (if baseline worked)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 6: Remove Diagnostic Logging
|
||
|
|
|
||
|
|
After fix is confirmed, remove debug logging:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
|
||
|
|
# Remove fprintf statements added for diagnosis
|
||
|
|
# Restore original HAK_RET_ALLOC macro
|
||
|
|
# Restore original tls_sll_push/pop implementations
|
||
|
|
|
||
|
|
# Rebuild clean version
|
||
|
|
make clean
|
||
|
|
make shared -j8
|
||
|
|
|
||
|
|
# Final test
|
||
|
|
LD_PRELOAD=./libhakmem.so ./tests/test_tls_sll_minimal
|
||
|
|
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 7: Commit with Detailed Message
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
git status
|
||
|
|
git add [modified files]
|
||
|
|
git commit -m "Fix TLS SLL header corruption
|
||
|
|
|
||
|
|
Problem: Header magic byte being corrupted during allocation/free path,
|
||
|
|
causing [TLS_SLL_HDR_RESET] errors and SIGSEGV crashes in baseline tests.
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
- sh8bench crashes with SIGSEGV
|
||
|
|
- Error: [TLS_SLL_HDR_RESET] cls=1 got=0x31 expect=0xa1
|
||
|
|
- Header validation fails during tls_sll_pop()
|
||
|
|
|
||
|
|
Root cause: [DESCRIBE WHAT WAS WRONG - e.g.:]
|
||
|
|
- User pointer was being pushed to TLS SLL instead of base pointer
|
||
|
|
- Header read at wrong offset due to pointer type mismatch
|
||
|
|
- Missing atomic fence allowed reordering of header write
|
||
|
|
|
||
|
|
Solution: [DESCRIBE WHAT WAS FIXED - e.g.:]
|
||
|
|
- Convert user pointer to base pointer before tls_sll_push()
|
||
|
|
- Add atomic fence after header write, before SLL operations
|
||
|
|
- Validate pointer types at SLL entry points
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- core/hakmem_tiny_config_box.inc: Fixed HAK_RET_ALLOC header offset
|
||
|
|
- core/box/tls_sll_box.h: Added pointer validation
|
||
|
|
- core/hakmem_tiny_free.inc: Convert to base ptr before push
|
||
|
|
|
||
|
|
Validation:
|
||
|
|
- test_tls_sll_minimal passes (4/4 tests)
|
||
|
|
- sh8bench baseline completes successfully
|
||
|
|
- cfrac/larson/sh6bench pass without crashes
|
||
|
|
- No performance regression (<2% variance)
|
||
|
|
|
||
|
|
Verified: TC1 baseline stability restored, ready for Phase 1 testing"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Expected Timeline
|
||
|
|
|
||
|
|
**Phase A: Understanding (1-2 hours)**
|
||
|
|
- Read this document
|
||
|
|
- Understand TLS SLL architecture
|
||
|
|
- Review header mechanism
|
||
|
|
- Locate relevant code sections
|
||
|
|
|
||
|
|
**Phase B: Diagnosis (2-4 hours)**
|
||
|
|
- Create minimal test case
|
||
|
|
- Add diagnostic logging
|
||
|
|
- Run tests and analyze logs
|
||
|
|
- Identify root cause
|
||
|
|
|
||
|
|
**Phase C: Fix Implementation (1-2 hours)**
|
||
|
|
- Implement surgical fix
|
||
|
|
- Remove diagnostic logging
|
||
|
|
- Clean build and test
|
||
|
|
|
||
|
|
**Phase D: Validation (1-2 hours)**
|
||
|
|
- Run full test suite
|
||
|
|
- Verify no regressions
|
||
|
|
- Performance check
|
||
|
|
- Document and commit
|
||
|
|
|
||
|
|
**Total: 5-10 hours** for complete diagnosis, fix, and validation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
**Must Have**:
|
||
|
|
1. No `[TLS_SLL_HDR_RESET]` errors in baseline tests
|
||
|
|
2. sh8bench completes without SIGSEGV
|
||
|
|
3. Minimal test suite passes (4/4 tests)
|
||
|
|
4. Fix is surgical (minimal code changes)
|
||
|
|
5. Root cause documented clearly
|
||
|
|
|
||
|
|
**Nice to Have**:
|
||
|
|
1. Performance neutral (<5% variance)
|
||
|
|
2. Fix applies to all configurations
|
||
|
|
3. Additional validation checks added
|
||
|
|
4. Regression tests added
|
||
|
|
|
||
|
|
**Verification**:
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
LD_PRELOAD=./libhakmem.so timeout 20 ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep -E "Total elapsed|RESET|SIGSEGV"
|
||
|
|
# Should show "Total elapsed time" with no errors
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Pitfalls
|
||
|
|
|
||
|
|
### Pitfall 1: Fixing Symptoms, Not Root Cause
|
||
|
|
|
||
|
|
**Wrong approach**:
|
||
|
|
```c
|
||
|
|
// Just disable the check
|
||
|
|
if (got != expected) {
|
||
|
|
// Do nothing, ignore corruption
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Right approach**:
|
||
|
|
- Understand WHY corruption happens
|
||
|
|
- Fix the source (wrong pointer, wrong offset, etc.)
|
||
|
|
- Keep the validation check enabled
|
||
|
|
|
||
|
|
### Pitfall 2: Over-Engineering
|
||
|
|
|
||
|
|
**Wrong approach**:
|
||
|
|
- Rewrite entire TLS SLL system
|
||
|
|
- Add complex locking mechanisms
|
||
|
|
- Change fundamental architecture
|
||
|
|
|
||
|
|
**Right approach**:
|
||
|
|
- Minimal fix (usually 1-5 lines)
|
||
|
|
- Fix pointer conversion or offset
|
||
|
|
- Add fence if missing
|
||
|
|
|
||
|
|
### Pitfall 3: Ignoring Test Results
|
||
|
|
|
||
|
|
**Wrong approach**:
|
||
|
|
- Fix compiles, assume it works
|
||
|
|
- Skip minimal reproducer
|
||
|
|
- Don't verify with benchmarks
|
||
|
|
|
||
|
|
**Right approach**:
|
||
|
|
- Test with minimal case FIRST
|
||
|
|
- Verify all benchmarks pass
|
||
|
|
- Check performance impact
|
||
|
|
|
||
|
|
### Pitfall 4: Removing Too Much Logging Too Early
|
||
|
|
|
||
|
|
**Wrong approach**:
|
||
|
|
- Remove diagnostic logging immediately
|
||
|
|
- Hard to debug if issue returns
|
||
|
|
|
||
|
|
**Right approach**:
|
||
|
|
- Keep logging until fix is verified
|
||
|
|
- Remove logging in separate commit
|
||
|
|
- Document what was learned
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Additional Resources
|
||
|
|
|
||
|
|
### Key Files to Understand
|
||
|
|
|
||
|
|
1. `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
|
||
|
|
- TLS SLL push/pop implementation
|
||
|
|
- Header validation logic
|
||
|
|
|
||
|
|
2. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_config_box.inc`
|
||
|
|
- HAK_RET_ALLOC macro
|
||
|
|
- Header write logic
|
||
|
|
|
||
|
|
3. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_types.h`
|
||
|
|
- Pointer conversion macros
|
||
|
|
- ptr_user_to_base / ptr_base_to_user
|
||
|
|
|
||
|
|
4. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill.inc.h`
|
||
|
|
- Magazine spill logic
|
||
|
|
- TLS SLL interaction
|
||
|
|
|
||
|
|
### Useful Git Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Find when header logic changed
|
||
|
|
git log -p --all -S "0xa0" -- core/
|
||
|
|
|
||
|
|
# Find recent changes to TLS SLL
|
||
|
|
git log --oneline -20 -- core/box/tls_sll_box.h
|
||
|
|
|
||
|
|
# Compare current vs previous version
|
||
|
|
git diff HEAD~5 core/hakmem_tiny_config_box.inc
|
||
|
|
|
||
|
|
# Find all references to a function
|
||
|
|
git grep -n "tls_sll_push" core/
|
||
|
|
```
|
||
|
|
|
||
|
|
### Debugging Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check header size configuration
|
||
|
|
grep -n "TINY_HEADER\|HEADERLESS" core/hakmem_tiny_types.h
|
||
|
|
|
||
|
|
# Find all allocation return points
|
||
|
|
grep -rn "HAK_RET_ALLOC\|return.*user" core/ --include="*.inc"
|
||
|
|
|
||
|
|
# Find all TLS SLL push calls
|
||
|
|
grep -rn "tls_sll_push" core/ --include="*.c" --include="*.inc" -B3 -A3
|
||
|
|
|
||
|
|
# Check atomic operations
|
||
|
|
grep -rn "atomic_thread_fence\|__atomic\|memory_order" core/ --include="*.h"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Questions to Answer During Diagnosis
|
||
|
|
|
||
|
|
1. **What is 0x31?**
|
||
|
|
- Is it always 0x31, or does it vary?
|
||
|
|
- Does it correspond to test data?
|
||
|
|
- Is it ASCII '1' character?
|
||
|
|
|
||
|
|
2. **Where is the header written?**
|
||
|
|
- In HAK_RET_ALLOC macro?
|
||
|
|
- In tls_sll_push?
|
||
|
|
- Somewhere else?
|
||
|
|
|
||
|
|
3. **Where is the header read?**
|
||
|
|
- In tls_sll_pop?
|
||
|
|
- In allocation path?
|
||
|
|
|
||
|
|
4. **Are offsets consistent?**
|
||
|
|
- Write at offset X
|
||
|
|
- Read at offset X
|
||
|
|
- Both use same base pointer?
|
||
|
|
|
||
|
|
5. **Are pointer types correct?**
|
||
|
|
- Push base or user pointer?
|
||
|
|
- Pop returns base or user pointer?
|
||
|
|
- Conversions correct?
|
||
|
|
|
||
|
|
6. **Is there a fence?**
|
||
|
|
- Between header write and SLL push?
|
||
|
|
- Between SLL pop and header read?
|
||
|
|
|
||
|
|
7. **Is class_idx valid?**
|
||
|
|
- In range [0, 7]?
|
||
|
|
- Matches actual allocation size?
|
||
|
|
|
||
|
|
8. **Has this ever worked?**
|
||
|
|
- Check git history
|
||
|
|
- Was there a recent breaking change?
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Document Version
|
||
|
|
|
||
|
|
- **Version**: 1.0
|
||
|
|
- **Date**: 2025-12-03
|
||
|
|
- **Author**: System diagnostic documentation
|
||
|
|
- **Target**: ChatGPT diagnostic agent
|
||
|
|
- **Estimated completion time**: 5-10 hours
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Final Checklist
|
||
|
|
|
||
|
|
Before considering the fix complete:
|
||
|
|
|
||
|
|
- [ ] Minimal reproducer created and passes
|
||
|
|
- [ ] Root cause identified and documented
|
||
|
|
- [ ] Fix implemented with explanation
|
||
|
|
- [ ] Diagnostic logging removed
|
||
|
|
- [ ] All baseline tests pass
|
||
|
|
- [ ] No performance regression
|
||
|
|
- [ ] Git commit with detailed message
|
||
|
|
- [ ] This document updated with findings
|
||
|
|
|
||
|
|
**Good luck with the diagnosis!**
|