Files
hakmem/docs/TINY_C7_1KB_SEGV_TRIAGE.md

75 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

TINY 1KB (class7) SEGV Triage Plan
=================================
Scope
- Reproducible SEGV on fixed-size 1KB bench: `./bench_fixed_size_hakmem 200000 1024 128`
- Persists with Direct-FC OFF. Likely in non-direct P0 or legacy refill path.
- Goal: isolate failing path, capture backtrace, prove root cause, and patch with minimal deltas.
Quick Repro Matrix
- Release (baseline):
- `./build.sh release bench_fixed_size_hakmem`
- `./bench_fixed_size_hakmem 200000 1024 128` → SEGV
- Disable P0 (all classes):
- `HAKMEM_TINY_P0_DISABLE=1 ./bench_fixed_size_hakmem 200000 1024 128` → Check (SEGV persists?)
- Disable remote drain:
- `HAKMEM_TINY_P0_NO_DRAIN=1 ./bench_fixed_size_hakmem 200000 1024 128` → Check
- Assume 1T (disable remote side-table):
- `HAKMEM_TINY_ASSUME_1T=1 ./bench_fixed_size_hakmem 200000 1024 128` → Check
Debug Build + Guards
1) Build debug flavor
- `./build.sh debug bench_fixed_size_hakmem`
2) Strong safety/guards
- `export HAKMEM_TINY_SAFE_FREE_STRICT=1`
- `export HAKMEM_TINY_DEBUG_REMOTE_GUARD=1`
- `export HAKMEM_INVALID_FREE_LOG=1`
- `export HAKMEM_TINY_RF_FORCE_NOTIFY=1`
3) Run under gdb
- `gdb --args ./bench_fixed_size_hakmem 200000 1024 128`
- `(gdb) run`
- On crash: `(gdb) bt`, `(gdb) frame 0`, `(gdb) p/x *meta`, `(gdb) p tls->slab_idx`, `(gdb) p tls->ss`, `(gdb) p meta->used`, `(gdb) p meta->carved`, `(gdb) p meta->capacity`
Hypotheses (ranked)
1) Capacity/stride mismatch in class7 carve
- class7 uses stride=1024 (no 1B header). Any code calculating with `bs = class_size + 1` will overstep.
- Check: `superslab_init_slab()` capacity, and any linear carve helper uses the same stride consistently.
2) TLS slab switch with stale pointers (already fixed for P0 direct path; check legacy/P0-general)
- After `superslab_refill()`, ensure `tls = &g_tls_slabs[c]; meta = tls->meta;` reloaded before counters/linear carve.
3) Remote drain corrupts freelist
- Verify sentinel cleared; ensure drain happens before freelist pop; check class7 path uses same ordering.
Files to Inspect
- `core/tiny_superslab_alloc.inc.h` (superslab_refill, adopt_bind_if_safe, stride/capacity)
- `core/hakmem_tiny_refill.inc.h` (legacy SLL refill, carve/pop ordering, bounds checks)
- `core/hakmem_tiny_refill_p0.inc.h` (P0 general path C7 is currently guarded OFF for direct-FC; confirm P0 batch not entering for C7)
- `core/superslab/superslab_inline.h` (remote drain, sentinel guard)
Instrumentation to Add (debug-only)
- In `superslab_init_slab(ss, idx, class_size, tid)`:
- Compute `stride = class_size + (class_idx != 7 ? 1 : 0)`; assert `meta->capacity == usable/stride`.
- In linear carve path (legacy + P0-general):
- Before write: assert `meta->carved < meta->capacity`; compute base and assert `ptr < slab_base+usable`.
- After `superslab_refill()` in any loop: rebind `tls/meta` unconditionally.
Bisect Switches
- Kill P0 entirely: `HAKMEM_TINY_P0_DISABLE=1`
- Skip remote drain: `HAKMEM_TINY_P0_NO_DRAIN=1`
- Assume ST mode: `HAKMEM_TINY_ASSUME_1T=1`
- Disable simplified refills (if applicable): `HAKMEM_TINY_SIMPLE_REFILL=0` (add if not present)
Patch Strategy (expected minimal fix)
1) Make class7 stride consistently 1024 in all carve paths (no +1 header). Audit bs computations.
2) Ensure tls/meta rebind after every `superslab_refill()` in non-direct paths.
3) Enforce drain-before-pop ordering and sentinel clear.
Acceptance Criteria
- `./bench_fixed_size_hakmem 200000 1024 128` passes 3/3 without SEGV.
- Debug counters show `active_delta == taken` (no mismatch).
- No invalid-free logs under STRICT mode.
Notes
- We already defaulted C7 DirectFC to OFF and guarded P0 entry for C7 unless explicitly enabled (`HAKMEM_TINY_P0_C7_ENABLE=1`).
- Focus on legacy/P0-general carve paths for C7.