Files

Moe Charm (CI) b7085c47e1 Phase 35-39: FAST build optimization complete (+7.13% cumulative)

Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%)
- tiny_front_v3_enabled() → constant true
- tiny_metadata_cache_enabled() → constant 0
- learner_v7_enabled() → constant false
- small_learner_v2_enabled() → constant false

Phase 36: Policy snapshot init-once (GO +0.71%)
- small_policy_v7_snapshot() version check skip in BENCH_MINIMAL
- TLS cache for policy snapshot

Phase 37: Standard TLS cache (NO-GO -0.07%)
- TLS cache for Standard build attempted
- Runtime gate overhead negates benefit

Phase 38: FAST/OBSERVE/Standard workflow established
- make perf_fast, make perf_observe targets
- Scorecard and documentation updates

Phase 39: Hot path gate constantization (GO +1.98%)
- front_gate_unified_enabled() → constant 1
- alloc_dualhot_enabled() → constant 0
- g_bench_fast_front, g_v3_enabled blocks → compile-out
- free_dispatch_stats_enabled() → constant false

Results:
- FAST v3: 56.04M ops/s (47.4% of mimalloc)
- Standard: 53.50M ops/s (45.3% of mimalloc)
- M1 target (50%): 5.5% remaining

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-16 15:01:56 +09:00

1.8 KiB

Raw Blame History

Phase 39: FAST v3 Gate Function Constantization — Results

Summary

Result: GO (+1.98%)

Phase 39 の gate function 定数化により、FAST build は +1.98% の性能改善を達成。

A/B Test Results（10-run 正式計測）

Baseline (FAST v2 without Phase 39)

Mean: 54.95M ops/s

Treatment (FAST v3 with Phase 39)

Mean: 56.04M ops/s

Delta

+1.98%（GO 閾値 +0.5% を大幅に上回る）

計測条件:

make perf_fast（10-run clean env）
ITERS=20000000 WS=400

Changes Made

A) malloc hot path (core/front/malloc_tiny_fast.h)

front_gate_unified_enabled() → BENCH_MINIMAL で固定 1
alloc_dualhot_enabled() → BENCH_MINIMAL で固定 0

B) free dispatcher (core/box/hak_free_api.inc.h)

g_bench_fast_front block → BENCH_MINIMAL で compile-out
g_v3_enabled block → BENCH_MINIMAL で compile-out
g_free_dispatch_ssot → 保留 (lazy-init 維持)

C) stats gate (core/box/free_dispatch_stats_box.h)

free_dispatch_stats_enabled() → BENCH_MINIMAL で固定 false

Analysis

10-run 正式計測により、lazy-init gate function の compile-out が +1.98% の性能改善を達成することが確認された。

改善の要因:

Branch elimination: __builtin_expect による予測は効率的だが、branch 自体の除去はそれ以上に効果的
I-cache pressure: lazy-init コードパスの除去により I-cache footprint が縮小
Compiler optimization: 定数化により、呼び出し元での追加最適化が可能に

Recommendation

判定: GO (+1.98% > +0.5%)

Phase 39 の変更は全て採用。FAST v3 として確定。

Files Modified

core/front/malloc_tiny_fast.h
core/box/hak_free_api.inc.h
core/box/free_dispatch_stats_box.h

1.8 KiB Raw Blame History Unescape Escape