# Phase 18: Hot Text Isolation v1 — Design ## 0. Context (from Phase 17) Phase 17 established **Case B**: - Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**. - The large gap appears vs the tiny `bench_random_mixed_system` binary. Signal: - I-cache misses / instructions / cycles are far worse in the hakmem-linked binary. - Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap. Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` --- ## 1. Goal Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms. Primary success metric: - Mixed (16–1024B) throughput improvement, with accompanying reductions in: - `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17) - total instructions executed per 200M iters --- ## 2. Non-goals - No allocator algorithm redesign. - No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes). - No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results). --- ## 3. Box Theory framing This is a “build/layout box”: - **Box**: HotTextIsolationBox (compile-time layout controls + annotations) - **Boundary**: build flag / TU split (no runtime overhead) - **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1` - **Observability**: perf stat + binary size (no always-on logs) --- ## 4. Design: v1 tactics (low-risk) ### 4.1 Hot/Cold attributes SSOT Introduce a single header defining attributes: - `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`) - `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`) Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`. Why: - Makes “what is hot/cold” explicit and consistent (SSOT). - Lets us annotate a small set of functions without scattering ad-hoc attributes. ### 4.2 Translation-unit split for wrappers Move wrapper definitions out of `core/hakmem.c` into a dedicated TU: - `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h` Why: - Prevents wrapper text from being interleaved with unrelated code in the same TU. - Improves the linker’s ability to cluster hot code. - Enables future link-order experiments (symbol ordering files) without touching allocator logic. ### 4.3 Cold code isolation Ensure rarely-hit helpers stay cold/out-of-line: - wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging) - “slow fallback” paths (`malloc_cold`, `free_cold`) Principle: - Hot path must remain a straight-line “try → return” shape. - Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers. ### 4.4 Optional: section GC for bench builds For bench binaries only: - add `-ffunction-sections -fdata-sections` - link with `-Wl,--gc-sections` Why: - Drops truly-unused text and reduces overall text pressure. - Helps the linker keep hot text denser. This is optional because it is toolchain-sensitive; measure before promoting. --- ## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build. Concept: - Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob) - In this mode: - “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.) - ENV gates become compile-time (no TLS/env probing in hot path) - Hot counters/stats macros compile out completely Why this still fits Box Theory: - It is a **build box** (reversible by knob), not an algorithm rewrite - Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact - Observability shifts to `perf stat` (no always-on logging) Expected impact: - If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%). ## 5. Risks / mitigations ### Risk A: layout tweaks regress throughput Mitigation: - A/B using the same workload + perf stat counters (Phase 17 set). - If regression: keep as research-only (build knob default OFF). ### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions) Mitigation: - Keep v1 minimal (TU split + attributes first). - Only enable `--gc-sections` if it’s stable in the current toolchain. --- ## 6. Expected impact Conservative: - +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses. Stretch goal: - Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.