hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md

# Phase 18: Hot Text Isolation v1 — Design

## 0. Context (from Phase 17)

Phase 17 established **Case B**:
- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
- The large gap appears vs the tiny `bench_random_mixed_system` binary.

Signal:
- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.

Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`

---

## 1. Goal

Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.

Primary success metric:
- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
  - `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
  - total instructions executed per 200M iters

---

## 2. Non-goals

- No allocator algorithm redesign.
- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).

---

## 3. Box Theory framing

This is a “build/layout box”:
- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
- **Boundary**: build flag / TU split (no runtime overhead)
- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
- **Observability**: perf stat + binary size (no always-on logs)

---

## 4. Design: v1 tactics (low-risk)

### 4.1 Hot/Cold attributes SSOT

Introduce a single header defining attributes:
- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)

Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.

Why:
- Makes “what is hot/cold” explicit and consistent (SSOT).
- Lets us annotate a small set of functions without scattering ad-hoc attributes.

### 4.2 Translation-unit split for wrappers

Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`

Why:
- Prevents wrapper text from being interleaved with unrelated code in the same TU.
- Improves the linker’s ability to cluster hot code.
- Enables future link-order experiments (symbol ordering files) without touching allocator logic.

### 4.3 Cold code isolation

Ensure rarely-hit helpers stay cold/out-of-line:
- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
- “slow fallback” paths (`malloc_cold`, `free_cold`)

Principle:
- Hot path must remain a straight-line “try → return” shape.
- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.

### 4.4 Optional: section GC for bench builds

For bench binaries only:
- add `-ffunction-sections -fdata-sections`
- link with `-Wl,--gc-sections`

Why:
- Drops truly-unused text and reduces overall text pressure.
- Helps the linker keep hot text denser.

This is optional because it is toolchain-sensitive; measure before promoting.

---

## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out

Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.

Concept:
- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
- In this mode:
  - “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
  - ENV gates become compile-time (no TLS/env probing in hot path)
  - Hot counters/stats macros compile out completely

Why this still fits Box Theory:
- It is a **build box** (reversible by knob), not an algorithm rewrite
- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
- Observability shifts to `perf stat` (no always-on logging)

Expected impact:
- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).

## 5. Risks / mitigations

### Risk A: layout tweaks regress throughput

Mitigation:
- A/B using the same workload + perf stat counters (Phase 17 set).
- If regression: keep as research-only (build knob default OFF).

### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)

Mitigation:
- Keep v1 minimal (TU split + attributes first).
- Only enable `--gc-sections` if it’s stable in the current toolchain.

---

## 6. Expected impact

Conservative:
- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.

Stretch goal:
- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
-												Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:25:47 +09:00
+								# Phase 18: Hot Text Isolation v1 — Design
 								## 0. Context (from Phase 17)
 								Phase 17 established **Case B**:
 								- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
 								- The large gap appears vs the tiny `bench_random_mixed_system` binary.
 								Signal:
 								- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
 								- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
 								Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
 								---
 								## 1. Goal
 								Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
 								Primary success metric:
 								- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
 								  - `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
 								  - total instructions executed per 200M iters
 								---
 								## 2. Non-goals
 								- No allocator algorithm redesign.
 								- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
 								- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
 								---
 								## 3. Box Theory framing
 								This is a “build/layout box”:
 								- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
 								- **Boundary**: build flag / TU split (no runtime overhead)
 								- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
 								- **Observability**: perf stat + binary size (no always-on logs)
 								---
 								## 4. Design: v1 tactics (low-risk)
 								### 4.1 Hot/Cold attributes SSOT
 								Introduce a single header defining attributes:
 								- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
 								- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
 								Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
 								Why:
 								- Makes “what is hot/cold” explicit and consistent (SSOT).
 								- Lets us annotate a small set of functions without scattering ad-hoc attributes.
 								### 4.2 Translation-unit split for wrappers
 								Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
 								- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
 								Why:
 								- Prevents wrapper text from being interleaved with unrelated code in the same TU.
 								- Improves the linker’s ability to cluster hot code.
 								- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
 								### 4.3 Cold code isolation
 								Ensure rarely-hit helpers stay cold/out-of-line:
 								- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
 								- “slow fallback” paths (`malloc_cold`, `free_cold`)
 								Principle:
 								- Hot path must remain a straight-line “try → return” shape.
 								- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
 								### 4.4 Optional: section GC for bench builds
 								For bench binaries only:
 								- add `-ffunction-sections -fdata-sections`
 								- link with `-Wl,--gc-sections`
 								Why:
 								- Drops truly-unused text and reduces overall text pressure.
 								- Helps the linker keep hot text denser.
 								This is optional because it is toolchain-sensitive; measure before promoting.
 								---
 								## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
 								Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
 								Concept:
 								- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
 								- In this mode:
 								  - “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
 								  - ENV gates become compile-time (no TLS/env probing in hot path)
 								  - Hot counters/stats macros compile out completely
 								Why this still fits Box Theory:
 								- It is a **build box** (reversible by knob), not an algorithm rewrite
 								- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
 								- Observability shifts to `perf stat` (no always-on logging)
 								Expected impact:
 								- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
 								## 5. Risks / mitigations
 								### Risk A: layout tweaks regress throughput
 								Mitigation:
 								- A/B using the same workload + perf stat counters (Phase 17 set).
 								- If regression: keep as research-only (build knob default OFF).
 								### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
 								Mitigation:
 								- Keep v1 minimal (TU split + attributes first).
 								- Only enable `--gc-sections` if it’s stable in the current toolchain.
 								---
 								## 6. Expected impact
 								Conservative:
 								- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
 								Stretch goal:
 								- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.