Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through systematic diagnosis and fix of TLS SLL header corruption issue. Documents Added: - README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system - CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read) - CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline) - GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review - STATUS_2025_12_03_CURRENT.md: Complete project status snapshot - TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines) - 6 root cause patterns with code examples - Diagnostic logging instrumentation - Fix templates and validation procedures - TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines) - HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup - SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes Problem Context: - Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET] - Error: cls=1 base=0x... got=0x31 expect=0xa1 - Blocks Phase 1 validation and Phase 2 progression Expected Outcome: - ChatGPT follows 7-step diagnostic process - Root cause identified (one of 6 patterns) - Surgical fix (1-5 lines) - TC1 baseline completes without crashes 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
229 lines
9.4 KiB
Markdown
229 lines
9.4 KiB
Markdown
# Headerless Stability Debug Instructions (Root-Cause / Fail-Fast)
|
|
|
|
Quality bar for this playbook:
|
|
|
|
| Metric | Score | Notes |
|
|
| --- | --- | --- |
|
|
| Coverage | 9/10 | Seven root-cause candidates + multiple probes |
|
|
| Actionability | 9/10 | Copy/pasteable bash + gdb/asan commands |
|
|
| Time budget | 10-22h | Phased so we can stop after each milestone |
|
|
| Expected success | 85-90% | Parallel probes + bisect safety net |
|
|
|
|
Goal (Definition of Done)
|
|
- Reproduce, isolate, and permanently fix the headerless instability with a verified regression test.
|
|
- Fix must be A/B switchable and observable (Box Theory: isolate boxes, single boundary, backout flag).
|
|
|
|
Scope and signals
|
|
- Both Headerless OFF and Headerless ON crash: suggests shared path, not just hint box.
|
|
- Observed symptoms: TLS_SLL integrity failures, invalid free() pointers, hangs in sh8bench/cfrac.
|
|
|
|
Box Theory anchors (work inside clear boxes, fail-fast, reversible)
|
|
- Box 2: Remote queue push/drain (no owner/publish side effects).
|
|
- Box 3: Ownership CAS (only at bind boundary).
|
|
- Box 4: Publish/Adopt boundary (single drain->bind->owner acquire point).
|
|
- Hint box: tls_ss_hint cache (guarded by `HAKMEM_TINY_SS_TLS_HINT`).
|
|
- Backouts: `HAKMEM_TINY_HEADERLESS`, `HAKMEM_TINY_SS_TLS_HINT`, `HAKMEM_TINY_SS_ADOPT`, `HAKMEM_TINY_RF_FORCE_NOTIFY`.
|
|
|
|
---
|
|
|
|
## Step-by-Step Flow
|
|
|
|
### 0) Pre-flight (15 min)
|
|
- `ulimit -c unlimited`; ensure `git status -sb` clean enough to bisect.
|
|
- Use single-thread first: `export HAKMEM_TINY_THREADS=1`.
|
|
- Disable learn/ACE noise: `export HAKMEM_ACE_ENABLED=0 HAKMEM_LEARN=0`.
|
|
- Keep artifacts: `mkdir -p debug_artifacts/headerless`.
|
|
|
|
### 1) Test Case 1 — Headerless OFF (control)
|
|
```bash
|
|
cd /mnt/workdisk/public_share/hakmem
|
|
make clean && make shared -j8
|
|
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
|
|
2>&1 | tee debug_artifacts/headerless/tc1_off.log | tail -40
|
|
```
|
|
Expected: completes with "Total elapsed time".
|
|
If it crashes: the base path (non-headerless) is already broken -> focus on shared free/registry first.
|
|
|
|
### 2) Test Case 2 — Headerless ON, hint OFF
|
|
```bash
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
|
|
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
|
|
2>&1 | tee debug_artifacts/headerless/tc2_hdrless_nohint.log | tail -40
|
|
```
|
|
Outcome tells us whether headerless core path (without hint) is already unstable.
|
|
|
|
### 3) Test Case 3 — Headerless ON, hint ON (Phase 1 path)
|
|
```bash
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
|
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
|
|
2>&1 | tee debug_artifacts/headerless/tc3_hdrless_hint.log | tail -40
|
|
```
|
|
If TC2 passes and TC3 fails, suspect hint cache / adopt boundary; otherwise suspect shared box.
|
|
|
|
### 4) ASan pass (pinpoint corruption early)
|
|
```bash
|
|
make clean && make asan-shared-alloc -j8 \
|
|
EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
|
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench \
|
|
2>&1 | tee debug_artifacts/headerless/asan_hdrless.log | head -200
|
|
```
|
|
If ASan is noisy, rerun with `HAKMEM_TINY_SS_TLS_HINT=0` to see if corruption follows the hint box.
|
|
|
|
### 5) GDB capture (first crash)
|
|
```bash
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
|
gdb --args ./mimalloc-bench/out/bench/sh8bench
|
|
(gdb) set environment LD_PRELOAD ./libhakmem.so
|
|
(gdb) run
|
|
(gdb) bt
|
|
(gdb) frame 0
|
|
(gdb) info locals
|
|
(gdb) x/4gx ptr # replace ptr with the crashing pointer
|
|
```
|
|
Save to `debug_artifacts/headerless/gdb_bt.txt`.
|
|
|
|
### 6) Git bisect (only after TC1 result is known)
|
|
```bash
|
|
git bisect start
|
|
git bisect bad HEAD
|
|
git bisect good <last-known-good> # e.g., pre f3f75ba3d if that was stable
|
|
# For each step:
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" || exit 125
|
|
LD_PRELOAD=./libhakmem.so timeout 15 ./mimalloc-bench/out/bench/sh8bench && exit 0 || exit 1
|
|
```
|
|
Record each verdict in `debug_artifacts/headerless/bisect_log.txt`. Reset with `git bisect reset` after.
|
|
|
|
---
|
|
|
|
## Root-Cause Candidates (7) and Probes
|
|
|
|
1) TLS hint cache stale/dangling (Box: hint)
|
|
- Symptom: free() uses cached ss that was recycled; remote-dangling or wrong class.
|
|
- Probe: log generation vs pointer range.
|
|
```c
|
|
fprintf(stderr, "[HINT_LOOKUP] ptr=%p ss=%p gen=%llu magic=%llx\n",
|
|
ptr, ss, ss ? (unsigned long long)ss->generation : 0,
|
|
ss ? (unsigned long long)ss->magic : 0);
|
|
```
|
|
- A/B: `HAKMEM_TINY_SS_TLS_HINT=0` should fully remove this path.
|
|
|
|
2) TLS SLL normalize mismatch (Box: TLS SLL)
|
|
- Symptom: headerless ptr hits queue expecting header offset.
|
|
- Probe: in `core/box/tls_sll_box.h` around normalize/mismatch detection, log once:
|
|
```c
|
|
fprintf(stderr, "[TLS_SLL_MISMATCH] ptr=%p has_hdr=%d expect_hdr=%d q=%s\n",
|
|
ptr, actual_has_header, expected_has_header, queue_name);
|
|
```
|
|
- Check that `TLS_SLL_NORMALIZE_USERPTR/RAWPTR` is invoked at every push/pop boundary.
|
|
|
|
3) SuperSlab registry stale or race (Box: registry boundary)
|
|
- Symptom: registry returns freed slab; hint and registry disagree.
|
|
- Probe: add generation/epoch in TinySuperSlab and compare on lookup; assert `SUPERSLAB_MAGIC`.
|
|
- A/B: force registry path only by turning hint off; compare crash locus.
|
|
|
|
4) Class index drift (Box: metadata)
|
|
- Symptom: slab->class_idx corrupt -> wrong free list math.
|
|
- Probe: after `slab_index_for()`, assert `class_idx < TINY_NUM_CLASSES`; log slab_idx/class_idx.
|
|
- A/B: run small vs 1024-byte classes; see if only one class fails.
|
|
|
|
5) Magazine wrap/unwrap slip (Box: refill/magazine)
|
|
- Symptom: pointer stored raw, read as user (or vice versa) in refill spill.
|
|
- Probe: instrument `core/hakmem_tiny_refill.inc` around magazine push/pop; dump raw/user pointer deltas.
|
|
- A/B: force refill slow path only: `export HAKMEM_TINY_MUST_ADOPT=1`.
|
|
|
|
6) Remote queue drain boundary breach (Box 2->4 boundary)
|
|
- Symptom: remote drain merges freelist twice or skips owner check.
|
|
- Probe: ring events or one-shot logs at `ss_remote_drain_to_freelist()` and adopt boundary:
|
|
```c
|
|
fprintf(stderr, "[REMOTE_DRAIN] ss=%p slab=%d count_before=%u\n", ss, slab_idx, remote_counts[slab_idx]);
|
|
```
|
|
- A/B: `HAKMEM_TINY_SS_ADOPT=0` to see if crash is tied to adopt boundary logic.
|
|
|
|
7) Pointer wrap/unwrap toggle confusion (Box: pointer bridge)
|
|
- Symptom: header offset applied twice or skipped.
|
|
- Probe: assert alignment and expected delta at every `user_to_raw/raw_to_user` site in free path.
|
|
- A/B: run with `HAKMEM_TINY_HEADERLESS=0` vs `1` with same workload; see if delta shows only in headerless.
|
|
|
|
---
|
|
|
|
## Data to Capture (single-pass, no log spam)
|
|
- Logs: last 400 lines from each TC run; grep for `[TLS_SLL]`, `[HINT]`, `[REMOTE]`.
|
|
- GDB: full `bt`, `frame 0`, `info locals`, and pointer dump.
|
|
- ASan: first 150 lines including shadow/poison info.
|
|
- Minimal repro: smallest C snippet or shell script that crashes within 30s.
|
|
- Env stamp: `uname -a`, `lscpu | head -20`, `git rev-parse HEAD`.
|
|
|
|
Format when reporting:
|
|
```
|
|
=== TC1 (Headerless OFF) ===
|
|
Result: crash / hang / pass
|
|
Last log lines: ...
|
|
|
|
=== TC2 (Headerless ON, hint OFF) ===
|
|
Result: ...
|
|
|
|
=== TC3 (Headerless ON, hint ON) ===
|
|
Result: ...
|
|
|
|
=== ASan ===
|
|
<first 20 lines + error site>
|
|
|
|
=== GDB (first crash) ===
|
|
<bt + frame 0 locals>
|
|
```
|
|
|
|
---
|
|
|
|
## Observability and Guardrails (Box Theory)
|
|
- One-shot logs only; no continuous debug spam. Use counters where possible.
|
|
- Keep boundary single: drain->bind->owner_acquire only inside refill/adopt; do not add side effects in remote push/publish.
|
|
- Toggleable fixes: wrap new checks with `#if defined(DEBUG_HDRLESS)` or env flags so we can A/B quickly.
|
|
- Fail-fast: `assert`/`abort` on invalid class_idx, magic, or out-of-range pointers instead of silently recovering.
|
|
|
|
---
|
|
|
|
## Decision Tree
|
|
- TC1 fails -> shared free/registry bug; ignore hint; inspect pointer normalize + registry first.
|
|
- TC1 passes, TC2 fails -> headerless core path bug; focus on pointer normalize and class_idx drift.
|
|
- TC2 passes, TC3 fails -> hint cache or adopt boundary; focus on stale hint + generation checks.
|
|
- ASan shows UAF/double-free -> instrument free path and magazine spill; gate hint off to see if corruption follows.
|
|
- Bisect isolates commit -> fix there, keep A/B flag, add regression test.
|
|
|
|
---
|
|
|
|
## Timeline (target 10-22h)
|
|
- 2-4h: run TC1-3, capture GDB/ASan, decide branch of decision tree.
|
|
- 4-8h: instrument relevant box (from candidates), build A/B toggles, derive minimal repro.
|
|
- 2-6h: root-cause confirmation with repro + ASan clean pass.
|
|
- 2-4h: implement fix, add regression test, verify all three test cases + baseline perf smoke.
|
|
|
|
---
|
|
|
|
## Quick Command Reference
|
|
```bash
|
|
# Clean builds
|
|
make clean && make shared -j8
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
|
|
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
|
make clean && make asan-shared-alloc -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
|
|
|
# Runs
|
|
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench
|
|
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench
|
|
|
|
# GDB essentials
|
|
gdb --args ./mimalloc-bench/out/bench/sh8bench
|
|
(gdb) set environment LD_PRELOAD ./libhakmem.so
|
|
(gdb) run
|
|
(gdb) bt
|
|
(gdb) frame 0
|
|
(gdb) info locals
|
|
|
|
# Bisect skeleton
|
|
git bisect start
|
|
git bisect bad HEAD
|
|
git bisect good <good-sha>
|
|
# build/test, mark good|bad|skip
|
|
git bisect reset
|
|
```
|