hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

# Phase 18: Hot Text Isolation v1 — Next Instructions

## Status

- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.

Refs:
- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`

---

## 0. Goal / Success Criteria

Primary (v1 は “低リスク・効果小さめ” 想定):
- Mixed (16–1024B) throughput **+2%** 以上で GO（layout work の現実ライン）

Secondary (must move in the right direction):
- I-cache misses reduced（目安: **-10%** 以上）
- Total instructions reduced（目安: **-5%** 以上）

If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.

---

## 1. Patch Plan (small, reversible)

### Patch 1: Hot/Cold attribute SSOT (L0 Box)

Add:
- `core/box/hot_text_attrs_box.h`

Defines:
- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)

Usage:
- annotate only a short, high-impact list first:
  - wrappers: `malloc/free/calloc/realloc`
  - FastLane entry helpers (if non-inline)
  - cold helpers: `malloc_cold/free_cold`, wrapper diagnostics

Rollback: build knob off.

### Patch 2: Wrapper TU split (L1 Box boundary)

Move wrapper definitions out of `core/hakmem.c`:
- new: `core/hak_wrappers_box.c`
  - `#include "box/hak_wrappers.inc.h"`
- remove wrapper include from `core/hakmem.c`

Rationale:
- Prevents wrapper text from being interleaved with unrelated code in one TU.
- Sets up link-order clustering.

Rollback: restore include in `core/hakmem.c` and drop new TU.

### Patch 3 (optional): bench-only section GC

Makefile knob:
- `HOT_TEXT_GC_SECTIONS=0/1`（research-only）

When `=1`, add for bench builds:
- `-ffunction-sections -fdata-sections`
- `-Wl,--gc-sections`

Notes:
- Keep it bench-only first (do not touch shared lib build until proven stable).
- **Phase 18 v1 outcome**: This exact flag set caused an I-cache regression in this repo/toolchain.
  - Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
  - Therefore, **Patch 3 is NO-GO for now** unless combined with explicit hot symbol ordering.

---

## 2. A/B Procedure (required)

### 2.1 Baseline build (OFF)

```sh
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```

Perf stat (1 run, 200M iters):

```sh
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1
```

### 2.2 Optimized build (ON)

```sh
make clean
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```

Perf stat (same command):

```sh
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1
```

### 2.3 System ceiling check (optional)

```sh
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
```

---

## 3. GO/NO-GO Decision

- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
- **NO-GO**: **-2%** or worse → rollback and freeze

Health profiles:

```sh
scripts/verify_health_profiles.sh
```

---

## 4. Reporting (required artifacts)

Create:
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
  - throughput A/B (10-run)
  - binary sizes
  - perf stat table (cycles/instructions/I-cache)
  - conclusion (GO/NEUTRAL/NO-GO)

Update:
- `CURRENT_TASK.md` (Phase 18 status + next)

---

## 5. Notes / guardrails

- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
- Avoid “delete code” experiments; only isolate/cold/cluster.
- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.

---

## 6. If v1 is NEUTRAL: Phase 18 v2（BENCH_MINIMAL）へ即進む

Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。

次の一手（bench 専用 binary / rollback 可能）:

- `HAKMEM_BENCH_MINIMAL=1`（Makefile knob）で:
  - FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
  - hot counters を完全 compile-out
  - 観測は `perf stat` のみ（常時ログ禁止）

期待: +10–20%（もし本当に instruction footprint が支配ならここで大きく動く）
-												Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:25:47 +09:00
+								# Phase 18: Hot Text Isolation v1 — Next Instructions
 								## Status
 								- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
 								- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.
 								Refs:
 								- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
 								- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
 								---
 								## 0. Goal / Success Criteria
 								Primary (v1 は “低リスク・効果小さめ” 想定):
 								- Mixed (16–1024B) throughput **+2%** 以上で GO（layout work の現実ライン）
 								Secondary (must move in the right direction):
 								- I-cache misses reduced（目安: **-10%** 以上）
 								- Total instructions reduced（目安: **-5%** 以上）
 								If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
 								---
 								## 1. Patch Plan (small, reversible)
 								### Patch 1: Hot/Cold attribute SSOT (L0 Box)
 								Add:
 								- `core/box/hot_text_attrs_box.h`
 								Defines:
 								- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)
 								Usage:
 								- annotate only a short, high-impact list first:
 								  - wrappers: `malloc/free/calloc/realloc`
 								  - FastLane entry helpers (if non-inline)
 								  - cold helpers: `malloc_cold/free_cold`, wrapper diagnostics
 								Rollback: build knob off.
 								### Patch 2: Wrapper TU split (L1 Box boundary)
 								Move wrapper definitions out of `core/hakmem.c`:
 								- new: `core/hak_wrappers_box.c`
 								  - `#include "box/hak_wrappers.inc.h"`
 								- remove wrapper include from `core/hakmem.c`
 								Rationale:
 								- Prevents wrapper text from being interleaved with unrelated code in one TU.
 								- Sets up link-order clustering.
 								Rollback: restore include in `core/hakmem.c` and drop new TU.
 								### Patch 3 (optional): bench-only section GC
 								Makefile knob:
-												Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression)

## Summary

Phase 18 v1 attempted layout optimization using section splitting + GC:
- `-ffunction-sections -fdata-sections -Wl,--gc-sections`

Result: **Catastrophic I-cache regression**
- Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: +91.06% (131K → 250K)
- Variance: +80% (σ=0.45M → σ=0.81M)

Root cause: Section-based splitting without explicit hot symbol ordering
fragments code locality, destroying natural compiler/LTO layout.

## Build Knob Safety

Makefile updated to separate concerns:
- `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain)
- `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO)

Both kept as research boxes (default OFF).

## Verdict

Freeze Phase 18 v1:
- Do NOT use section-based linking without strong ordering strategy
- Keep hot/cold attributes as placeholder (currently unused)
- Proceed to Phase 18 v2: BENCH_MINIMAL compile-out

Expected impact v2: +10-20% via instruction count reduction
- GO threshold: +5% minimum, +8% preferred
- Only continue if instructions clearly drop

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md

Modified:
- Makefile (build knob safety isolation)
- CURRENT_TASK.md (Phase 18 v1 verdict)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

## Lessons

1. Layout optimization is extremely fragile without ordering guarantees
2. I-cache is first-order performance factor (IPC=2.30 is memory-bound)
3. Compiler defaults may be better than manual section splitting
4. Next frontier: instruction count reduction (stats/ENV removal)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:53:58 +09:00
+								- `HOT_TEXT_GC_SECTIONS=0/1`（research-only）
-												Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:25:47 +09:00
 								When `=1`, add for bench builds:
 								- `-ffunction-sections -fdata-sections`
-												Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression)

## Summary

Phase 18 v1 attempted layout optimization using section splitting + GC:
- `-ffunction-sections -fdata-sections -Wl,--gc-sections`

Result: **Catastrophic I-cache regression**
- Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: +91.06% (131K → 250K)
- Variance: +80% (σ=0.45M → σ=0.81M)

Root cause: Section-based splitting without explicit hot symbol ordering
fragments code locality, destroying natural compiler/LTO layout.

## Build Knob Safety

Makefile updated to separate concerns:
- `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain)
- `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO)

Both kept as research boxes (default OFF).

## Verdict

Freeze Phase 18 v1:
- Do NOT use section-based linking without strong ordering strategy
- Keep hot/cold attributes as placeholder (currently unused)
- Proceed to Phase 18 v2: BENCH_MINIMAL compile-out

Expected impact v2: +10-20% via instruction count reduction
- GO threshold: +5% minimum, +8% preferred
- Only continue if instructions clearly drop

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md

Modified:
- Makefile (build knob safety isolation)
- CURRENT_TASK.md (Phase 18 v1 verdict)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

## Lessons

1. Layout optimization is extremely fragile without ordering guarantees
2. I-cache is first-order performance factor (IPC=2.30 is memory-bound)
3. Compiler defaults may be better than manual section splitting
4. Next frontier: instruction count reduction (stats/ENV removal)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:53:58 +09:00
+								- `-Wl,--gc-sections`
-												Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:25:47 +09:00
 								Notes:
 								- Keep it bench-only first (do not touch shared lib build until proven stable).
-												Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression)

## Summary

Phase 18 v1 attempted layout optimization using section splitting + GC:
- `-ffunction-sections -fdata-sections -Wl,--gc-sections`

Result: **Catastrophic I-cache regression**
- Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: +91.06% (131K → 250K)
- Variance: +80% (σ=0.45M → σ=0.81M)

Root cause: Section-based splitting without explicit hot symbol ordering
fragments code locality, destroying natural compiler/LTO layout.

## Build Knob Safety

Makefile updated to separate concerns:
- `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain)
- `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO)

Both kept as research boxes (default OFF).

## Verdict

Freeze Phase 18 v1:
- Do NOT use section-based linking without strong ordering strategy
- Keep hot/cold attributes as placeholder (currently unused)
- Proceed to Phase 18 v2: BENCH_MINIMAL compile-out

Expected impact v2: +10-20% via instruction count reduction
- GO threshold: +5% minimum, +8% preferred
- Only continue if instructions clearly drop

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md

Modified:
- Makefile (build knob safety isolation)
- CURRENT_TASK.md (Phase 18 v1 verdict)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

## Lessons

1. Layout optimization is extremely fragile without ordering guarantees
2. I-cache is first-order performance factor (IPC=2.30 is memory-bound)
3. Compiler defaults may be better than manual section splitting
4. Next frontier: instruction count reduction (stats/ENV removal)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:53:58 +09:00
+								- **Phase 18 v1 outcome**: This exact flag set caused an I-cache regression in this repo/toolchain.
 								  - Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
 								  - Therefore, **Patch 3 is NO-GO for now** unless combined with explicit hot symbol ordering.
-												Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-15 05:25:47 +09:00
 								---
 								## 2. A/B Procedure (required)
 								### 2.1 Baseline build (OFF)
 								```sh
 								make clean
 								make -j bench_random_mixed_hakmem bench_random_mixed_system
 								ls -lh bench_random_mixed_hakmem bench_random_mixed_system
 								scripts/run_mixed_10_cleanenv.sh
 								```
 								Perf stat (1 run, 200M iters):
 								```sh
 								perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
 								  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
 								  ./bench_random_mixed_hakmem 200000000 400 1
 								```
 								### 2.2 Optimized build (ON)
 								```sh
 								make clean
 								make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
 								ls -lh bench_random_mixed_hakmem bench_random_mixed_system
 								scripts/run_mixed_10_cleanenv.sh
 								```
 								Perf stat (same command):
 								```sh
 								perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
 								  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
 								  ./bench_random_mixed_hakmem 200000000 400 1
 								```
 								### 2.3 System ceiling check (optional)
 								```sh
 								./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
 								```
 								---
 								## 3. GO/NO-GO Decision
 								- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
 								- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
 								- **NO-GO**: **-2%** or worse → rollback and freeze
 								Health profiles:
 								```sh
 								scripts/verify_health_profiles.sh
 								```
 								---
 								## 4. Reporting (required artifacts)
 								Create:
 								- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
 								  - throughput A/B (10-run)
 								  - binary sizes
 								  - perf stat table (cycles/instructions/I-cache)
 								  - conclusion (GO/NEUTRAL/NO-GO)
 								Update:
 								- `CURRENT_TASK.md` (Phase 18 status + next)
 								---
 								## 5. Notes / guardrails
 								- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
 								- Avoid “delete code” experiments; only isolate/cold/cluster.
 								- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
 								---
 								## 6. If v1 is NEUTRAL: Phase 18 v2（BENCH_MINIMAL）へ即進む
 								Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。
 								次の一手（bench 専用 binary / rollback 可能）:
 								- `HAKMEM_BENCH_MINIMAL=1`（Makefile knob）で:
 								  - FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
 								  - hot counters を完全 compile-out
 								  - 観測は `perf stat` のみ（常時ログ禁止）
 								期待: +10–20%（もし本当に instruction footprint が支配ならここで大きく動く）