## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.8 KiB
4.8 KiB
Phase 18: Hot Text Isolation v1 — Next Instructions
Status
- Phase 17 confirms Case B: allocator logic delta is negligible; gap is layout/I-cache.
- Next: reduce instruction footprint + improve I-cache locality via Hot Text Isolation.
Refs:
- Phase 17 results:
docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md - Phase 18 design:
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
0. Goal / Success Criteria
Primary (v1 は “低リスク・効果小さめ” 想定):
- Mixed (16–1024B) throughput +2% 以上で GO(layout work の現実ライン)
Secondary (must move in the right direction):
- I-cache misses reduced(目安: -10% 以上)
- Total instructions reduced(目安: -5% 以上)
If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
1. Patch Plan (small, reversible)
Patch 1: Hot/Cold attribute SSOT (L0 Box)
Add:
core/box/hot_text_attrs_box.h
Defines:
HAK_HOT_FN,HAK_COLD_FN(no-op whenHAKMEM_HOT_TEXT_ISOLATION=0)
Usage:
- annotate only a short, high-impact list first:
- wrappers:
malloc/free/calloc/realloc - FastLane entry helpers (if non-inline)
- cold helpers:
malloc_cold/free_cold, wrapper diagnostics
- wrappers:
Rollback: build knob off.
Patch 2: Wrapper TU split (L1 Box boundary)
Move wrapper definitions out of core/hakmem.c:
- new:
core/hak_wrappers_box.c#include "box/hak_wrappers.inc.h"
- remove wrapper include from
core/hakmem.c
Rationale:
- Prevents wrapper text from being interleaved with unrelated code in one TU.
- Sets up link-order clustering.
Rollback: restore include in core/hakmem.c and drop new TU.
Patch 3 (optional): bench-only section GC
Makefile knob:
HOT_TEXT_ISOLATION=0/1
When =1, add for bench builds:
-DHAKMEM_HOT_TEXT_ISOLATION=1-ffunction-sections -fdata-sectionsLDFLAGS += -Wl,--gc-sections
Notes:
- Keep it bench-only first (do not touch shared lib build until proven stable).
- If toolchain rejects
--gc-sectionsor results are unstable → skip this patch.
2. A/B Procedure (required)
2.1 Baseline build (OFF)
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
Perf stat (1 run, 200M iters):
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 200000000 400 1
2.2 Optimized build (ON)
make clean
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
Perf stat (same command):
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 200000000 400 1
2.3 System ceiling check (optional)
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
3. GO/NO-GO Decision
- GO: Mixed 10-run mean +2% 以上 and no health regressions
- NEUTRAL: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
- NO-GO: -2% or worse → rollback and freeze
Health profiles:
scripts/verify_health_profiles.sh
4. Reporting (required artifacts)
Create:
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md- throughput A/B (10-run)
- binary sizes
- perf stat table (cycles/instructions/I-cache)
- conclusion (GO/NEUTRAL/NO-GO)
Update:
CURRENT_TASK.md(Phase 18 status + next)
5. Notes / guardrails
- This phase intentionally compares different binaries (layout is the target), but keep the environment clean (
env -i, fixed profile, same machine). - Avoid “delete code” experiments; only isolate/cold/cluster.
- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
6. If v1 is NEUTRAL: Phase 18 v2(BENCH_MINIMAL)へ即進む
Phase 17 の “instructions 2x” を直接削るには、layout だけでなく hot path に混ざっている ENV/stats/debug の固定費を compile-out する必要がある可能性が高い。
次の一手(bench 専用 binary / rollback 可能):
HAKMEM_BENCH_MINIMAL=1(Makefile knob)で:- FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
- hot counters を完全 compile-out
- 観測は
perf statのみ(常時ログ禁止)
期待: +10–20%(もし本当に instruction footprint が支配ならここで大きく動く)