Files
hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Moe Charm (CI) b1912d6587 Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression)
## Summary

Phase 18 v1 attempted layout optimization using section splitting + GC:
- `-ffunction-sections -fdata-sections -Wl,--gc-sections`

Result: **Catastrophic I-cache regression**
- Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: +91.06% (131K → 250K)
- Variance: +80% (σ=0.45M → σ=0.81M)

Root cause: Section-based splitting without explicit hot symbol ordering
fragments code locality, destroying natural compiler/LTO layout.

## Build Knob Safety

Makefile updated to separate concerns:
- `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain)
- `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO)

Both kept as research boxes (default OFF).

## Verdict

Freeze Phase 18 v1:
- Do NOT use section-based linking without strong ordering strategy
- Keep hot/cold attributes as placeholder (currently unused)
- Proceed to Phase 18 v2: BENCH_MINIMAL compile-out

Expected impact v2: +10-20% via instruction count reduction
- GO threshold: +5% minimum, +8% preferred
- Only continue if instructions clearly drop

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md

Modified:
- Makefile (build knob safety isolation)
- CURRENT_TASK.md (Phase 18 v1 verdict)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

## Lessons

1. Layout optimization is extremely fragile without ordering guarantees
2. I-cache is first-order performance factor (IPC=2.30 is memory-bound)
3. Compiler defaults may be better than manual section splitting
4. Next frontier: instruction count reduction (stats/ENV removal)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:53:58 +09:00

4.9 KiB
Raw Blame History

Phase 18: Hot Text Isolation v1 — Next Instructions

Status

  • Phase 17 confirms Case B: allocator logic delta is negligible; gap is layout/I-cache.
  • Next: reduce instruction footprint + improve I-cache locality via Hot Text Isolation.

Refs:

  • Phase 17 results: docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md
  • Phase 18 design: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md

0. Goal / Success Criteria

Primary (v1 は “低リスク・効果小さめ” 想定):

  • Mixed (161024B) throughput +2% 以上で GOlayout work の現実ライン)

Secondary (must move in the right direction):

  • I-cache misses reduced目安: -10% 以上)
  • Total instructions reduced目安: -5% 以上)

If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.


1. Patch Plan (small, reversible)

Patch 1: Hot/Cold attribute SSOT (L0 Box)

Add:

  • core/box/hot_text_attrs_box.h

Defines:

  • HAK_HOT_FN, HAK_COLD_FN (no-op when HAKMEM_HOT_TEXT_ISOLATION=0)

Usage:

  • annotate only a short, high-impact list first:
    • wrappers: malloc/free/calloc/realloc
    • FastLane entry helpers (if non-inline)
    • cold helpers: malloc_cold/free_cold, wrapper diagnostics

Rollback: build knob off.

Patch 2: Wrapper TU split (L1 Box boundary)

Move wrapper definitions out of core/hakmem.c:

  • new: core/hak_wrappers_box.c
    • #include "box/hak_wrappers.inc.h"
  • remove wrapper include from core/hakmem.c

Rationale:

  • Prevents wrapper text from being interleaved with unrelated code in one TU.
  • Sets up link-order clustering.

Rollback: restore include in core/hakmem.c and drop new TU.

Patch 3 (optional): bench-only section GC

Makefile knob:

  • HOT_TEXT_GC_SECTIONS=0/1research-only

When =1, add for bench builds:

  • -ffunction-sections -fdata-sections
  • -Wl,--gc-sections

Notes:

  • Keep it bench-only first (do not touch shared lib build until proven stable).
  • Phase 18 v1 outcome: This exact flag set caused an I-cache regression in this repo/toolchain.
    • Ref: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md
    • Therefore, Patch 3 is NO-GO for now unless combined with explicit hot symbol ordering.

2. A/B Procedure (required)

2.1 Baseline build (OFF)

make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh

Perf stat (1 run, 200M iters):

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1

2.2 Optimized build (ON)

make clean
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh

Perf stat (same command):

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1

2.3 System ceiling check (optional)

./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true

3. GO/NO-GO Decision

  • GO: Mixed 10-run mean +2% 以上 and no health regressions
  • NEUTRAL: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
  • NO-GO: -2% or worse → rollback and freeze

Health profiles:

scripts/verify_health_profiles.sh

4. Reporting (required artifacts)

Create:

  • docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md
    • throughput A/B (10-run)
    • binary sizes
    • perf stat table (cycles/instructions/I-cache)
    • conclusion (GO/NEUTRAL/NO-GO)

Update:

  • CURRENT_TASK.md (Phase 18 status + next)

5. Notes / guardrails

  • This phase intentionally compares different binaries (layout is the target), but keep the environment clean (env -i, fixed profile, same machine).
  • Avoid “delete code” experiments; only isolate/cold/cluster.
  • Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.

6. If v1 is NEUTRAL: Phase 18 v2BENCH_MINIMALへ即進む

Phase 17 の “instructions 2x” を直接削るには、layout だけでなく hot path に混ざっている ENV/stats/debug の固定費を compile-out する必要がある可能性が高い。

次の一手bench 専用 binary / rollback 可能):

  • HAKMEM_BENCH_MINIMAL=1Makefile knobで:
    • FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
    • hot counters を完全 compile-out
    • 観測は perf stat のみ(常時ログ禁止)

期待: +1020%(もし本当に instruction footprint が支配ならここで大きく動く)