Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%) - tiny_front_v3_enabled() → constant true - tiny_metadata_cache_enabled() → constant 0 - learner_v7_enabled() → constant false - small_learner_v2_enabled() → constant false Phase 36: Policy snapshot init-once (GO +0.71%) - small_policy_v7_snapshot() version check skip in BENCH_MINIMAL - TLS cache for policy snapshot Phase 37: Standard TLS cache (NO-GO -0.07%) - TLS cache for Standard build attempted - Runtime gate overhead negates benefit Phase 38: FAST/OBSERVE/Standard workflow established - make perf_fast, make perf_observe targets - Scorecard and documentation updates Phase 39: Hot path gate constantization (GO +1.98%) - front_gate_unified_enabled() → constant 1 - alloc_dualhot_enabled() → constant 0 - g_bench_fast_front, g_v3_enabled blocks → compile-out - free_dispatch_stats_enabled() → constant false Results: - FAST v3: 56.04M ops/s (47.4% of mimalloc) - Standard: 53.50M ops/s (45.3% of mimalloc) - M1 target (50%): 5.5% remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1.8 KiB
1.8 KiB
Phase 39: FAST v3 Gate Function Constantization — Results
Summary
Result: GO (+1.98%)
Phase 39 の gate function 定数化により、FAST build は +1.98% の性能改善を達成。
A/B Test Results(10-run 正式計測)
Baseline (FAST v2 without Phase 39)
Mean: 54.95M ops/s
Treatment (FAST v3 with Phase 39)
Mean: 56.04M ops/s
Delta
- +1.98%(GO 閾値 +0.5% を大幅に上回る)
計測条件:
make perf_fast(10-run clean env)ITERS=20000000 WS=400
Changes Made
A) malloc hot path (core/front/malloc_tiny_fast.h)
front_gate_unified_enabled()→ BENCH_MINIMAL で固定1alloc_dualhot_enabled()→ BENCH_MINIMAL で固定0
B) free dispatcher (core/box/hak_free_api.inc.h)
g_bench_fast_frontblock → BENCH_MINIMAL で compile-outg_v3_enabledblock → BENCH_MINIMAL で compile-outg_free_dispatch_ssot→ 保留 (lazy-init 維持)
C) stats gate (core/box/free_dispatch_stats_box.h)
free_dispatch_stats_enabled()→ BENCH_MINIMAL で固定false
Analysis
10-run 正式計測により、lazy-init gate function の compile-out が +1.98% の性能改善を達成することが確認された。
改善の要因:
- Branch elimination:
__builtin_expectによる予測は効率的だが、branch 自体の除去はそれ以上に効果的 - I-cache pressure: lazy-init コードパスの除去により I-cache footprint が縮小
- Compiler optimization: 定数化により、呼び出し元での追加最適化が可能に
Recommendation
判定: GO (+1.98% > +0.5%)
Phase 39 の変更は全て採用。FAST v3 として確定。
Files Modified
core/front/malloc_tiny_fast.hcore/box/hak_free_api.inc.hcore/box/free_dispatch_stats_box.h