Files

Moe Charm (CI) 37bb3ee63f Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)

Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-14 20:48:34 +09:00

12 KiB

Raw Blame History

Phase 6-10: Cumulative Results & Strategic Analysis

Date: 2025-12-14 Status: ✅ COMPLETE - Major performance milestone achieved

Executive Summary

Phases 6-10 delivered +24.6% cumulative improvement on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.

Cumulative Performance

Metric	Phase 5 Baseline	Phase 10 Final	Improvement
Throughput (Mixed)	43.04M ops/s	53.62M ops/s	+24.6% 🎉
Optimization Strategy	Per-phase incremental	Layer collapse + path consolidation	Structural

Phase-by-Phase Breakdown

Phase 6-1: Front FastLane (Layer Collapse) — ✅ +11.13%

Target: wrapper→gate→policy→route 層の固定費

Strategy: 単一エントリポイント (FastLane) への集約

front_fastlane_try_malloc(size) - size→class→route→handler を 1 箇所で処理
front_fastlane_try_free(ptr) - header validation + direct call

Result: +11.13% (hakmem 史上最大の単一改善)

Key Files:

core/box/front_fastlane_env_box.h
core/box/front_fastlane_box.h
core/box/hak_wrappers.inc.h (wrapper integration)

Lesson: Consolidation at wrapper level is the winning pattern (vs micro-optimizations in hot path)

Phase 6-2: Front FastLane Free DeDup — ✅ +5.18%

Target: FastLane free 側の重複 header validation

Strategy: DeDup ON → free_tiny_fast() を直接呼び出し (duplicate validation を skip)

Result: +5.18% (σ: 1.00% → 0.58%, variance -42%)

Key Files:

core/box/front_fastlane_env_box.h (DeDup gate 追加)
core/box/front_fastlane_box.h (direct call path)

Lesson: Eliminating redundant checks provides both performance and stability gains

Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ NO-GO (-2.16%)

Target: FastLane free を free_tiny_fast_hot() に寄せる

Result: -2.16% (FROZEN)

Root Cause:

Hot/cold split が FastLane の軽量経路に overhead を追加
TLS access + cold path branching
I-cache locality 低下

Lesson: Same optimization has different effects in different contexts (wrapper vs FastLane)

Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ +2.61%

Target: Phase 3 D1 最適化が効かない ENV cache 事故

Strategy: bench_profile putenv 後に refresh mechanism を追加

Result: +2.61% (σ: 867K → 336K, variance -61%)

Key Files:

core/box/tiny_free_route_cache_env_box.{h,c} (refresh 追加)
core/bench_profile.h (sync 追加)

Lesson: ENV gate synchronization is critical for ensuring optimizations actually work

Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ +2.72%

Target: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct

Strategy: free_tiny_fast() 内で C0-C3 を LEGACY に early-exit (function split なし)

Result: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)

Key Files:

core/box/free_tiny_fast_mono_dualhot_env_box.h
core/front/malloc_tiny_fast.h (early-exit 追加)

Lesson: Monolithic early-exit > function split (avoids call overhead, maintains I-cache locality)

Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ +1.89%

Target: Phase 9 (C0-C3) を C4-C7 に拡張

Strategy: nonlegacy_mask キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct

Result: +1.89%

Key Files:

core/box/free_tiny_fast_mono_legacy_direct_env_box.h (nonlegacy_mask caching)
core/front/malloc_tiny_fast.h (C4-C7 early-exit)

Lesson: Cached bitmask enables safe fast-path with minimal overhead

Phase 11: ENV Snapshot "maybe-fast" API — ❌ NO-GO (-8.35%)

Target: ENV snapshot 参照固定費 (~2-3%) を削減

Strategy: hakmem_env_snapshot_maybe_fast() API で enabled+snapshot+front_snap を 1 回に集約

Result: -8.35% (FROZEN, 設計ミス)

Root Cause:

maybe_fast() を inline hot path 関数内で呼ぶことで、ctor_mode check が累積
Compiler optimization 阻害 (unconditional call vs conditional branch)

Lesson: ENV gate optimization should target gate itself, not call sites

Technical Patterns (Across All Phases)

Winning Patterns ✅

Wrapper-level consolidation (Phase 6-1: +11.13%)
- Collapse multiple layers into single entry point
- Reduce total instruction count, improve I-cache locality
Deduplication (Phase 6-2: +5.18%)
- Eliminate redundant checks at layer boundaries
- Both performance and stability benefits
Monolithic early-exit (Phase 9: +2.72%, Phase 10: +1.89%)
- Better than function split (avoids call overhead)
- Maintains I-cache locality
Cached bitmask (Phase 10: nonlegacy_mask)
- Single bit operation for complex condition
- Computed once at init, hot path reads only
ENV gate synchronization (Phase 8: +2.61%)
- Refresh mechanism ensures optimizations actually work
- Critical for bench_profile integration

Anti-Patterns ❌

Function split for lightweight paths (Phase 7: -2.16%)
- Hot/cold split adds overhead when path is already optimized
- Context-dependent: works for wrapper, fails for FastLane
Call-site API changes (Phase 11: -8.35%)
- Moving helper calls into inline hot path accumulates overhead
- Compiler optimization inhibited
Branch hints without profiling (E3-4, E5-3c precedents)
- Profile-dependent, often regresses

Perf Profile Analysis (Phase 10 後)

Current Hotspots (Mixed, 30M iterations)

Symbol	Self%	Classification	Next Action
`front_fastlane_try_free`	33.88%	集約点 (expected)	✅ Successfully consolidated
`main`	26.21%	Benchmark overhead	N/A (not optimizable)
`malloc`	23.26%	Alloc wrapper 集約点	🔍 Investigate further
`tiny_header_finalize_alloc`	5.33%	Header write	⚪ E5-2 tried, NEUTRAL
`tiny_c7_ultra_alloc`	3.75%	C7 ULTRA path	🔍 Possible target
`unified_cache_push`	1.61%	Cache push	⚪ Marginal ROI (~+1.0%)
`hakmem_env_snapshot`	0.82%	ENV snapshot	❌ Phase 11 failed
`tiny_front_v3_snapshot_get`	0.66%	Front snapshot	❌ Phase 11 failed

Key Observations

front_fastlane_try_free (33.88%) - This is EXPECTED and GOOD
- Phase 6-1 successfully consolidated free path
- High self% indicates consolidation worked (vs distributed overhead)
- Not a bottleneck, just a measurement artifact
malloc (23.26%) - Alloc wrapper 集約点
- Similar to front_fastlane_try_free (includes FastLane alloc)
- May indicate alloc side has room for optimization
- Need deeper analysis to separate wrapper vs actual hot path
tiny_header_finalize_alloc (5.33%) - Already optimized
- Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
- Branch overhead ≈ savings
- Further optimization unlikely to help
Remaining hotspots < 2% - Marginal ROI
- unified_cache_push (1.61%): E5-3b predicted ~+1.0%
- tiny_c7_ultra_alloc (3.75%): C7-specific, narrow scope

Strategic Options (Next Steps)

Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)

Targets:

unified_cache_push (1.61%) - predicted +1.0% ROI
tiny_c7_ultra_alloc (3.75%) - C7-specific

Risk: Diminishing returns (each optimization < +1-2%)

Recommendation: Low priority unless low-hanging fruit identified

Option B: Alloc Side Deep Dive 🔍 (High Potential)

Observation: malloc (23.26%) suggests alloc side may have structural opportunities

Investigation Steps:

Perf profile with --call-graph dwarf to separate wrapper vs hot path
Identify if FastLane alloc has similar consolidation opportunities as FastLane free
Look for alloc-side deduplication opportunities

Expected ROI: +5-10% (if similar to Phase 6 free-side gains)

Recommendation: HIGH PRIORITY - Most promising next direction

Option C: Declare Victory 🎉 (Strategic Pause)

Achievement: +24.6% cumulative (Phase 6-10)

Context:

Phase 5: ~+9-10% cumulative
Phase 6-10: +24.6% cumulative
Total (Phase 5-10): ~+30-35% cumulative

Rationale:

Reached major performance milestone
Remaining optimizations < +2% each (marginal)
Risk/reward ratio decreasing

Recommendation: ⚪ NEUTRAL - Valid option depending on project goals

Cumulative Statistics

Performance Gains

Phase Range	Cumulative Gain	Individual Phases
Phase 5 (E4-E5)	~+9-10%	E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%)
Phase 6-10	+24.6%	6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
Total (Phase 5-10)	~+30-35%	Estimated compound effect

Stability Improvements

Multiple phases delivered variance reduction:

Phase 6-2: σ -42% (1.00% → 0.58%)
Phase 8: σ -61% (867K → 336K)
Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)

Insight: Consolidation/deduplication patterns provide BOTH performance AND stability

Lessons Learned

1. Consolidation > Micro-Optimization

Evidence: Phase 6-1 (+11.13%) > all micro-optimizations combined

Pattern: Wrapper-level structural changes yield largest gains

2. Context Matters

Evidence: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)

Pattern: Same optimization technique has different effects in different contexts

3. Monolithic Early-Exit > Function Split

Evidence: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed

Pattern: Avoid function call overhead, maintain I-cache locality

4. Inline Hot Path = Danger Zone

Evidence: Phase 11 (-8.35%) - API call in inline function accumulated overhead

Pattern: Even 2-3 instructions are expensive at high call frequency

5. ENV Gate Optimization Should Target Gate Itself

Evidence: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization

Pattern: Optimize the gate (caching, probe window), not the call sites

6. Variance Reduction = Hidden Value

Evidence: Multiple phases delivered 40-60% variance reduction

Pattern: Consolidation/deduplication improves both mean AND stability

Recommendations

Immediate Next Steps (Priority Order)

HIGH PRIORITY: Alloc side deep dive (Option B)
- Perf profile with call-graph to identify opportunities
- Look for FastLane alloc equivalent optimizations
- Expected ROI: +5-10%
MEDIUM PRIORITY: Strategic pause (Option C)
- Document Phase 6-10 achievements
- Re-evaluate project goals and priorities
- Consider if +24.6% is "good enough" milestone
LOW PRIORITY: Micro-optimizations (Option A)
- Only pursue if low-hanging fruit identified
- Each remaining optimization < +2% ROI

Long-Term Strategic Questions

What is the target performance?
- mimalloc competitive? (need benchmark comparison)
- Absolute threshold? (e.g., 60M ops/s on Mixed)
What is the risk tolerance?
- Each optimization has ~20-30% chance of NO-GO
- Diminishing returns as we optimize further
What is the maintenance cost?
- Each new optimization adds complexity
- ENV gates increase test matrix

Conclusion

Phase 6-10 delivered +24.6% cumulative improvement, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.

Status: ✅ MAJOR MILESTONE ACHIEVED

Next: Recommend alloc side deep dive as highest-ROI next direction, or strategic pause to consolidate gains.

Implementation: Claude Code Date: 2025-12-14 Phases: 6-10 (with Phase 11 NO-GO documented) Commits: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d

12 KiB Raw Blame History Unescape Escape

Phase 6-10: Cumulative Results & Strategic Analysis

Executive Summary

Cumulative Performance

Phase-by-Phase Breakdown

Phase 6-1: Front FastLane (Layer Collapse) — ✅ +11.13%

Phase 6-2: Front FastLane Free DeDup — ✅ +5.18%

Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ NO-GO (-2.16%)

Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ +2.61%

Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ +2.72%

Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ +1.89%

Phase 11: ENV Snapshot "maybe-fast" API — ❌ NO-GO (-8.35%)

Technical Patterns (Across All Phases)

Winning Patterns ✅

Anti-Patterns ❌

Perf Profile Analysis (Phase 10 後)

Current Hotspots (Mixed, 30M iterations)

Key Observations

Strategic Options (Next Steps)

Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)

Option B: Alloc Side Deep Dive 🔍 (High Potential)

Option C: Declare Victory 🎉 (Strategic Pause)

Cumulative Statistics

Performance Gains

Stability Improvements

Lessons Learned

1. Consolidation > Micro-Optimization

2. Context Matters

3. Monolithic Early-Exit > Function Split

4. Inline Hot Path = Danger Zone

5. ENV Gate Optimization Should Target Gate Itself

6. Variance Reduction = Hidden Value

Recommendations

Immediate Next Steps (Priority Order)

Long-Term Strategic Questions

Conclusion

12 KiB

Raw Blame History