Files
hakmem/docs/HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md
Moe Charm (CI) 2624dcce62 Add comprehensive ChatGPT handoff documentation for TLS SLL diagnosis
Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through
systematic diagnosis and fix of TLS SLL header corruption issue.

Documents Added:
- README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system
- CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read)
- CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline)
- GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review
- STATUS_2025_12_03_CURRENT.md: Complete project status snapshot
- TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines)
  - 6 root cause patterns with code examples
  - Diagnostic logging instrumentation
  - Fix templates and validation procedures
- TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines)
- HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup
- SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes

Problem Context:
- Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET]
- Error: cls=1 base=0x... got=0x31 expect=0xa1
- Blocks Phase 1 validation and Phase 2 progression

Expected Outcome:
- ChatGPT follows 7-step diagnostic process
- Root cause identified (one of 6 patterns)
- Surgical fix (1-5 lines)
- TC1 baseline completes without crashes

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:41:34 +09:00

9.4 KiB

Headerless Stability Debug Instructions (Root-Cause / Fail-Fast)

Quality bar for this playbook:

Metric Score Notes
Coverage 9/10 Seven root-cause candidates + multiple probes
Actionability 9/10 Copy/pasteable bash + gdb/asan commands
Time budget 10-22h Phased so we can stop after each milestone
Expected success 85-90% Parallel probes + bisect safety net

Goal (Definition of Done)

  • Reproduce, isolate, and permanently fix the headerless instability with a verified regression test.
  • Fix must be A/B switchable and observable (Box Theory: isolate boxes, single boundary, backout flag).

Scope and signals

  • Both Headerless OFF and Headerless ON crash: suggests shared path, not just hint box.
  • Observed symptoms: TLS_SLL integrity failures, invalid free() pointers, hangs in sh8bench/cfrac.

Box Theory anchors (work inside clear boxes, fail-fast, reversible)

  • Box 2: Remote queue push/drain (no owner/publish side effects).
  • Box 3: Ownership CAS (only at bind boundary).
  • Box 4: Publish/Adopt boundary (single drain->bind->owner acquire point).
  • Hint box: tls_ss_hint cache (guarded by HAKMEM_TINY_SS_TLS_HINT).
  • Backouts: HAKMEM_TINY_HEADERLESS, HAKMEM_TINY_SS_TLS_HINT, HAKMEM_TINY_SS_ADOPT, HAKMEM_TINY_RF_FORCE_NOTIFY.

Step-by-Step Flow

0) Pre-flight (15 min)

  • ulimit -c unlimited; ensure git status -sb clean enough to bisect.
  • Use single-thread first: export HAKMEM_TINY_THREADS=1.
  • Disable learn/ACE noise: export HAKMEM_ACE_ENABLED=0 HAKMEM_LEARN=0.
  • Keep artifacts: mkdir -p debug_artifacts/headerless.

1) Test Case 1 — Headerless OFF (control)

cd /mnt/workdisk/public_share/hakmem
make clean && make shared -j8
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
  2>&1 | tee debug_artifacts/headerless/tc1_off.log | tail -40

Expected: completes with "Total elapsed time".
If it crashes: the base path (non-headerless) is already broken -> focus on shared free/registry first.

2) Test Case 2 — Headerless ON, hint OFF

make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
  2>&1 | tee debug_artifacts/headerless/tc2_hdrless_nohint.log | tail -40

Outcome tells us whether headerless core path (without hint) is already unstable.

3) Test Case 3 — Headerless ON, hint ON (Phase 1 path)

make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench \
  2>&1 | tee debug_artifacts/headerless/tc3_hdrless_hint.log | tail -40

If TC2 passes and TC3 fails, suspect hint cache / adopt boundary; otherwise suspect shared box.

4) ASan pass (pinpoint corruption early)

make clean && make asan-shared-alloc -j8 \
  EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench \
  2>&1 | tee debug_artifacts/headerless/asan_hdrless.log | head -200

If ASan is noisy, rerun with HAKMEM_TINY_SS_TLS_HINT=0 to see if corruption follows the hint box.

5) GDB capture (first crash)

make clean && make shared -j8 EXTRA_CFLAGS="-g -O1 -DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
gdb --args ./mimalloc-bench/out/bench/sh8bench
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) run
(gdb) bt
(gdb) frame 0
(gdb) info locals
(gdb) x/4gx ptr  # replace ptr with the crashing pointer

Save to debug_artifacts/headerless/gdb_bt.txt.

6) Git bisect (only after TC1 result is known)

git bisect start
git bisect bad HEAD
git bisect good <last-known-good>   # e.g., pre f3f75ba3d if that was stable
# For each step:
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" || exit 125
LD_PRELOAD=./libhakmem.so timeout 15 ./mimalloc-bench/out/bench/sh8bench && exit 0 || exit 1

Record each verdict in debug_artifacts/headerless/bisect_log.txt. Reset with git bisect reset after.


Root-Cause Candidates (7) and Probes

  1. TLS hint cache stale/dangling (Box: hint)
  • Symptom: free() uses cached ss that was recycled; remote-dangling or wrong class.
  • Probe: log generation vs pointer range.
fprintf(stderr, "[HINT_LOOKUP] ptr=%p ss=%p gen=%llu magic=%llx\n",
        ptr, ss, ss ? (unsigned long long)ss->generation : 0,
        ss ? (unsigned long long)ss->magic : 0);
  • A/B: HAKMEM_TINY_SS_TLS_HINT=0 should fully remove this path.
  1. TLS SLL normalize mismatch (Box: TLS SLL)
  • Symptom: headerless ptr hits queue expecting header offset.
  • Probe: in core/box/tls_sll_box.h around normalize/mismatch detection, log once:
fprintf(stderr, "[TLS_SLL_MISMATCH] ptr=%p has_hdr=%d expect_hdr=%d q=%s\n",
        ptr, actual_has_header, expected_has_header, queue_name);
  • Check that TLS_SLL_NORMALIZE_USERPTR/RAWPTR is invoked at every push/pop boundary.
  1. SuperSlab registry stale or race (Box: registry boundary)
  • Symptom: registry returns freed slab; hint and registry disagree.
  • Probe: add generation/epoch in TinySuperSlab and compare on lookup; assert SUPERSLAB_MAGIC.
  • A/B: force registry path only by turning hint off; compare crash locus.
  1. Class index drift (Box: metadata)
  • Symptom: slab->class_idx corrupt -> wrong free list math.
  • Probe: after slab_index_for(), assert class_idx < TINY_NUM_CLASSES; log slab_idx/class_idx.
  • A/B: run small vs 1024-byte classes; see if only one class fails.
  1. Magazine wrap/unwrap slip (Box: refill/magazine)
  • Symptom: pointer stored raw, read as user (or vice versa) in refill spill.
  • Probe: instrument core/hakmem_tiny_refill.inc around magazine push/pop; dump raw/user pointer deltas.
  • A/B: force refill slow path only: export HAKMEM_TINY_MUST_ADOPT=1.
  1. Remote queue drain boundary breach (Box 2->4 boundary)
  • Symptom: remote drain merges freelist twice or skips owner check.
  • Probe: ring events or one-shot logs at ss_remote_drain_to_freelist() and adopt boundary:
fprintf(stderr, "[REMOTE_DRAIN] ss=%p slab=%d count_before=%u\n", ss, slab_idx, remote_counts[slab_idx]);
  • A/B: HAKMEM_TINY_SS_ADOPT=0 to see if crash is tied to adopt boundary logic.
  1. Pointer wrap/unwrap toggle confusion (Box: pointer bridge)
  • Symptom: header offset applied twice or skipped.
  • Probe: assert alignment and expected delta at every user_to_raw/raw_to_user site in free path.
  • A/B: run with HAKMEM_TINY_HEADERLESS=0 vs 1 with same workload; see if delta shows only in headerless.

Data to Capture (single-pass, no log spam)

  • Logs: last 400 lines from each TC run; grep for [TLS_SLL], [HINT], [REMOTE].
  • GDB: full bt, frame 0, info locals, and pointer dump.
  • ASan: first 150 lines including shadow/poison info.
  • Minimal repro: smallest C snippet or shell script that crashes within 30s.
  • Env stamp: uname -a, lscpu | head -20, git rev-parse HEAD.

Format when reporting:

=== TC1 (Headerless OFF) ===
Result: crash / hang / pass
Last log lines: ...

=== TC2 (Headerless ON, hint OFF) ===
Result: ...

=== TC3 (Headerless ON, hint ON) ===
Result: ...

=== ASan ===
<first 20 lines + error site>

=== GDB (first crash) ===
<bt + frame 0 locals>

Observability and Guardrails (Box Theory)

  • One-shot logs only; no continuous debug spam. Use counters where possible.
  • Keep boundary single: drain->bind->owner_acquire only inside refill/adopt; do not add side effects in remote push/publish.
  • Toggleable fixes: wrap new checks with #if defined(DEBUG_HDRLESS) or env flags so we can A/B quickly.
  • Fail-fast: assert/abort on invalid class_idx, magic, or out-of-range pointers instead of silently recovering.

Decision Tree

  • TC1 fails -> shared free/registry bug; ignore hint; inspect pointer normalize + registry first.
  • TC1 passes, TC2 fails -> headerless core path bug; focus on pointer normalize and class_idx drift.
  • TC2 passes, TC3 fails -> hint cache or adopt boundary; focus on stale hint + generation checks.
  • ASan shows UAF/double-free -> instrument free path and magazine spill; gate hint off to see if corruption follows.
  • Bisect isolates commit -> fix there, keep A/B flag, add regression test.

Timeline (target 10-22h)

  • 2-4h: run TC1-3, capture GDB/ASan, decide branch of decision tree.
  • 4-8h: instrument relevant box (from candidates), build A/B toggles, derive minimal repro.
  • 2-6h: root-cause confirmation with repro + ASan clean pass.
  • 2-4h: implement fix, add regression test, verify all three test cases + baseline perf smoke.

Quick Command Reference

# Clean builds
make clean && make shared -j8
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
make clean && make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
make clean && make asan-shared-alloc -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Runs
LD_PRELOAD=./libhakmem.so timeout 30 ./mimalloc-bench/out/bench/sh8bench
LD_PRELOAD=./libhakmem_asan.so timeout 20 ./mimalloc-bench/out/bench/sh8bench

# GDB essentials
gdb --args ./mimalloc-bench/out/bench/sh8bench
(gdb) set environment LD_PRELOAD ./libhakmem.so
(gdb) run
(gdb) bt
(gdb) frame 0
(gdb) info locals

# Bisect skeleton
git bisect start
git bisect bad HEAD
git bisect good <good-sha>
# build/test, mark good|bad|skip
git bisect reset