Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through systematic diagnosis and fix of TLS SLL header corruption issue. Documents Added: - README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system - CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read) - CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline) - GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review - STATUS_2025_12_03_CURRENT.md: Complete project status snapshot - TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines) - 6 root cause patterns with code examples - Diagnostic logging instrumentation - Fix templates and validation procedures - TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines) - HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup - SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes Problem Context: - Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET] - Error: cls=1 base=0x... got=0x31 expect=0xa1 - Blocks Phase 1 validation and Phase 2 progression Expected Outcome: - ChatGPT follows 7-step diagnostic process - Root cause identified (one of 6 patterns) - Surgical fix (1-5 lines) - TC1 baseline completes without crashes 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.4 KiB
9.4 KiB
Headerless Stability Debug Instructions (Root-Cause / Fail-Fast)
Quality bar for this playbook:
| Metric | Score | Notes |
|---|---|---|
| Coverage | 9/10 | Seven root-cause candidates + multiple probes |
| Actionability | 9/10 | Copy/pasteable bash + gdb/asan commands |
| Time budget | 10-22h | Phased so we can stop after each milestone |
| Expected success | 85-90% | Parallel probes + bisect safety net |
Goal (Definition of Done)
- Reproduce, isolate, and permanently fix the headerless instability with a verified regression test.
- Fix must be A/B switchable and observable (Box Theory: isolate boxes, single boundary, backout flag).
Scope and signals
- Both Headerless OFF and Headerless ON crash: suggests shared path, not just hint box.
- Observed symptoms: TLS_SLL integrity failures, invalid free() pointers, hangs in sh8bench/cfrac.
Box Theory anchors (work inside clear boxes, fail-fast, reversible)
- Box 2: Remote queue push/drain (no owner/publish side effects).
- Box 3: Ownership CAS (only at bind boundary).
- Box 4: Publish/Adopt boundary (single drain->bind->owner acquire point).
- Hint box: tls_ss_hint cache (guarded by
HAKMEM_TINY_SS_TLS_HINT). - Backouts:
HAKMEM_TINY_HEADERLESS,HAKMEM_TINY_SS_TLS_HINT,HAKMEM_TINY_SS_ADOPT,HAKMEM_TINY_RF_FORCE_NOTIFY.
Step-by-Step Flow
0) Pre-flight (15 min)
ulimit -c unlimited; ensuregit status -sbclean enough to bisect.- Use single-thread first:
export HAKMEM_TINY_THREADS=1. - Disable learn/ACE noise:
export HAKMEM_ACE_ENABLED=0 HAKMEM_LEARN=0. - Keep artifacts:
mkdir -p debug_artifacts/headerless.
1) Test Case 1 — Headerless OFF (control)
cd /mnt/workdisk/public_share/hakmem
make clean && make shared -j8
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
2>&1 | tee debug_artifacts/headerless/tc1_off.log | tail -40
Expected: completes with "Total elapsed time".
If it crashes: the base path (non-headerless) is already broken -> focus on shared free/registry first.
2) Test Case 2 — Headerless ON, hint OFF
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
2>&1 | tee debug_artifacts/headerless/tc2_hdrless_nohint.log | tail -40
Outcome tells us whether headerless core path (without hint) is already unstable.
3) Test Case 3 — Headerless ON, hint ON (Phase 1 path)
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
2>&1 | tee debug_artifacts/headerless/tc3_hdrless_hint.log | tail -40
If TC2 passes and TC3 fails, suspect hint cache / adopt boundary; otherwise suspect shared box.
4) ASan pass (pinpoint corruption early)
make clean && make asan-shared-alloc -j8 \
EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench \
2>&1 | tee debug_artifacts/headerless/asan_hdrless.log | head -200
If ASan is noisy, rerun with HAKMEM_TINY_SS_TLS_HINT=0 to see if corruption follows the hint box.
5) GDB capture (first crash)
make clean && make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
gdb --args ./mimalloc-bench/out/bench/sh8bench
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) run
(gdb) bt
(gdb) frame 0
(gdb) info locals
(gdb) x/4gx ptr # replace ptr with the crashing pointer
Save to debug_artifacts/headerless/gdb_bt.txt.
6) Git bisect (only after TC1 result is known)
git bisect start
git bisect bad HEAD
git bisect good <last-known-good> # e.g., pre f3f75ba3d if that was stable
# For each step:
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" || exit 125
LD_PRELOAD=./libhakmem.so timeout 15 ./mimalloc-bench/out/bench/sh8bench && exit 0 || exit 1
Record each verdict in debug_artifacts/headerless/bisect_log.txt. Reset with git bisect reset after.
Root-Cause Candidates (7) and Probes
- TLS hint cache stale/dangling (Box: hint)
- Symptom: free() uses cached ss that was recycled; remote-dangling or wrong class.
- Probe: log generation vs pointer range.
fprintf(stderr, "[HINT_LOOKUP] ptr=%p ss=%p gen=%llu magic=%llx\n",
ptr, ss, ss ? (unsigned long long)ss->generation : 0,
ss ? (unsigned long long)ss->magic : 0);
- A/B:
HAKMEM_TINY_SS_TLS_HINT=0should fully remove this path.
- TLS SLL normalize mismatch (Box: TLS SLL)
- Symptom: headerless ptr hits queue expecting header offset.
- Probe: in
core/box/tls_sll_box.haround normalize/mismatch detection, log once:
fprintf(stderr, "[TLS_SLL_MISMATCH] ptr=%p has_hdr=%d expect_hdr=%d q=%s\n",
ptr, actual_has_header, expected_has_header, queue_name);
- Check that
TLS_SLL_NORMALIZE_USERPTR/RAWPTRis invoked at every push/pop boundary.
- SuperSlab registry stale or race (Box: registry boundary)
- Symptom: registry returns freed slab; hint and registry disagree.
- Probe: add generation/epoch in TinySuperSlab and compare on lookup; assert
SUPERSLAB_MAGIC. - A/B: force registry path only by turning hint off; compare crash locus.
- Class index drift (Box: metadata)
- Symptom: slab->class_idx corrupt -> wrong free list math.
- Probe: after
slab_index_for(), assertclass_idx < TINY_NUM_CLASSES; log slab_idx/class_idx. - A/B: run small vs 1024-byte classes; see if only one class fails.
- Magazine wrap/unwrap slip (Box: refill/magazine)
- Symptom: pointer stored raw, read as user (or vice versa) in refill spill.
- Probe: instrument
core/hakmem_tiny_refill.incaround magazine push/pop; dump raw/user pointer deltas. - A/B: force refill slow path only:
export HAKMEM_TINY_MUST_ADOPT=1.
- Remote queue drain boundary breach (Box 2->4 boundary)
- Symptom: remote drain merges freelist twice or skips owner check.
- Probe: ring events or one-shot logs at
ss_remote_drain_to_freelist()and adopt boundary:
fprintf(stderr, "[REMOTE_DRAIN] ss=%p slab=%d count_before=%u\n", ss, slab_idx, remote_counts[slab_idx]);
- A/B:
HAKMEM_TINY_SS_ADOPT=0to see if crash is tied to adopt boundary logic.
- Pointer wrap/unwrap toggle confusion (Box: pointer bridge)
- Symptom: header offset applied twice or skipped.
- Probe: assert alignment and expected delta at every
user_to_raw/raw_to_usersite in free path. - A/B: run with
HAKMEM_TINY_HEADERLESS=0vs1with same workload; see if delta shows only in headerless.
Data to Capture (single-pass, no log spam)
- Logs: last 400 lines from each TC run; grep for
[TLS_SLL],[HINT],[REMOTE]. - GDB: full
bt,frame 0,info locals, and pointer dump. - ASan: first 150 lines including shadow/poison info.
- Minimal repro: smallest C snippet or shell script that crashes within 30s.
- Env stamp:
uname -a,lscpu | head -20,git rev-parse HEAD.
Format when reporting:
=== TC1 (Headerless OFF) ===
Result: crash / hang / pass
Last log lines: ...
=== TC2 (Headerless ON, hint OFF) ===
Result: ...
=== TC3 (Headerless ON, hint ON) ===
Result: ...
=== ASan ===
<first 20 lines + error site>
=== GDB (first crash) ===
<bt + frame 0 locals>
Observability and Guardrails (Box Theory)
- One-shot logs only; no continuous debug spam. Use counters where possible.
- Keep boundary single: drain->bind->owner_acquire only inside refill/adopt; do not add side effects in remote push/publish.
- Toggleable fixes: wrap new checks with
#if defined(DEBUG_HDRLESS)or env flags so we can A/B quickly. - Fail-fast:
assert/aborton invalid class_idx, magic, or out-of-range pointers instead of silently recovering.
Decision Tree
- TC1 fails -> shared free/registry bug; ignore hint; inspect pointer normalize + registry first.
- TC1 passes, TC2 fails -> headerless core path bug; focus on pointer normalize and class_idx drift.
- TC2 passes, TC3 fails -> hint cache or adopt boundary; focus on stale hint + generation checks.
- ASan shows UAF/double-free -> instrument free path and magazine spill; gate hint off to see if corruption follows.
- Bisect isolates commit -> fix there, keep A/B flag, add regression test.
Timeline (target 10-22h)
- 2-4h: run TC1-3, capture GDB/ASan, decide branch of decision tree.
- 4-8h: instrument relevant box (from candidates), build A/B toggles, derive minimal repro.
- 2-6h: root-cause confirmation with repro + ASan clean pass.
- 2-4h: implement fix, add regression test, verify all three test cases + baseline perf smoke.
Quick Command Reference
# Clean builds
make clean && make shared -j8
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
make clean && make asan-shared-alloc -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Runs
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench
# GDB essentials
gdb --args ./mimalloc-bench/out/bench/sh8bench
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) run
(gdb) bt
(gdb) frame 0
(gdb) info locals
# Bisect skeleton
git bisect start
git bisect bad HEAD
git bisect good <good-sha>
# build/test, mark good|bad|skip
git bisect reset