136 lines
4.8 KiB
Markdown
136 lines
4.8 KiB
Markdown
|
|
# Phase 18: Hot Text Isolation v1 — Design
|
|||
|
|
|
|||
|
|
## 0. Context (from Phase 17)
|
|||
|
|
|
|||
|
|
Phase 17 established **Case B**:
|
|||
|
|
- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
|
|||
|
|
- The large gap appears vs the tiny `bench_random_mixed_system` binary.
|
|||
|
|
|
|||
|
|
Signal:
|
|||
|
|
- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
|
|||
|
|
- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
|
|||
|
|
|
|||
|
|
Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Goal
|
|||
|
|
|
|||
|
|
Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
|
|||
|
|
|
|||
|
|
Primary success metric:
|
|||
|
|
- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
|
|||
|
|
- `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
|
|||
|
|
- total instructions executed per 200M iters
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Non-goals
|
|||
|
|
|
|||
|
|
- No allocator algorithm redesign.
|
|||
|
|
- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
|
|||
|
|
- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Box Theory framing
|
|||
|
|
|
|||
|
|
This is a “build/layout box”:
|
|||
|
|
- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
|
|||
|
|
- **Boundary**: build flag / TU split (no runtime overhead)
|
|||
|
|
- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
|||
|
|
- **Observability**: perf stat + binary size (no always-on logs)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Design: v1 tactics (low-risk)
|
|||
|
|
|
|||
|
|
### 4.1 Hot/Cold attributes SSOT
|
|||
|
|
|
|||
|
|
Introduce a single header defining attributes:
|
|||
|
|
- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
|
|||
|
|
- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
|
|||
|
|
|
|||
|
|
Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
|
|||
|
|
|
|||
|
|
Why:
|
|||
|
|
- Makes “what is hot/cold” explicit and consistent (SSOT).
|
|||
|
|
- Lets us annotate a small set of functions without scattering ad-hoc attributes.
|
|||
|
|
|
|||
|
|
### 4.2 Translation-unit split for wrappers
|
|||
|
|
|
|||
|
|
Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
|
|||
|
|
- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
|
|||
|
|
|
|||
|
|
Why:
|
|||
|
|
- Prevents wrapper text from being interleaved with unrelated code in the same TU.
|
|||
|
|
- Improves the linker’s ability to cluster hot code.
|
|||
|
|
- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
|
|||
|
|
|
|||
|
|
### 4.3 Cold code isolation
|
|||
|
|
|
|||
|
|
Ensure rarely-hit helpers stay cold/out-of-line:
|
|||
|
|
- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
|
|||
|
|
- “slow fallback” paths (`malloc_cold`, `free_cold`)
|
|||
|
|
|
|||
|
|
Principle:
|
|||
|
|
- Hot path must remain a straight-line “try → return” shape.
|
|||
|
|
- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
|
|||
|
|
|
|||
|
|
### 4.4 Optional: section GC for bench builds
|
|||
|
|
|
|||
|
|
For bench binaries only:
|
|||
|
|
- add `-ffunction-sections -fdata-sections`
|
|||
|
|
- link with `-Wl,--gc-sections`
|
|||
|
|
|
|||
|
|
Why:
|
|||
|
|
- Drops truly-unused text and reduces overall text pressure.
|
|||
|
|
- Helps the linker keep hot text denser.
|
|||
|
|
|
|||
|
|
This is optional because it is toolchain-sensitive; measure before promoting.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
|
|||
|
|
|
|||
|
|
Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
|
|||
|
|
|
|||
|
|
Concept:
|
|||
|
|
- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
|
|||
|
|
- In this mode:
|
|||
|
|
- “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
|
|||
|
|
- ENV gates become compile-time (no TLS/env probing in hot path)
|
|||
|
|
- Hot counters/stats macros compile out completely
|
|||
|
|
|
|||
|
|
Why this still fits Box Theory:
|
|||
|
|
- It is a **build box** (reversible by knob), not an algorithm rewrite
|
|||
|
|
- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
|
|||
|
|
- Observability shifts to `perf stat` (no always-on logging)
|
|||
|
|
|
|||
|
|
Expected impact:
|
|||
|
|
- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
|
|||
|
|
|
|||
|
|
## 5. Risks / mitigations
|
|||
|
|
|
|||
|
|
### Risk A: layout tweaks regress throughput
|
|||
|
|
|
|||
|
|
Mitigation:
|
|||
|
|
- A/B using the same workload + perf stat counters (Phase 17 set).
|
|||
|
|
- If regression: keep as research-only (build knob default OFF).
|
|||
|
|
|
|||
|
|
### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
|
|||
|
|
|
|||
|
|
Mitigation:
|
|||
|
|
- Keep v1 minimal (TU split + attributes first).
|
|||
|
|
- Only enable `--gc-sections` if it’s stable in the current toolchain.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Expected impact
|
|||
|
|
|
|||
|
|
Conservative:
|
|||
|
|
- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
|
|||
|
|
|
|||
|
|
Stretch goal:
|
|||
|
|
- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
|