Files
hakmem/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md
Moe Charm (CI) 37bb3ee63f Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)
Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:48:34 +09:00

12 KiB
Raw Blame History

Phase 6-10: Cumulative Results & Strategic Analysis

Date: 2025-12-14 Status: COMPLETE - Major performance milestone achieved


Executive Summary

Phases 6-10 delivered +24.6% cumulative improvement on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.

Cumulative Performance

Metric Phase 5 Baseline Phase 10 Final Improvement
Throughput (Mixed) 43.04M ops/s 53.62M ops/s +24.6% 🎉
Optimization Strategy Per-phase incremental Layer collapse + path consolidation Structural

Phase-by-Phase Breakdown

Phase 6-1: Front FastLane (Layer Collapse) — +11.13%

Target: wrapper→gate→policy→route 層の固定費

Strategy: 単一エントリポイント (FastLane) への集約

  • front_fastlane_try_malloc(size) - size→class→route→handler を 1 箇所で処理
  • front_fastlane_try_free(ptr) - header validation + direct call

Result: +11.13% (hakmem 史上最大の単一改善)

Key Files:

  • core/box/front_fastlane_env_box.h
  • core/box/front_fastlane_box.h
  • core/box/hak_wrappers.inc.h (wrapper integration)

Lesson: Consolidation at wrapper level is the winning pattern (vs micro-optimizations in hot path)


Phase 6-2: Front FastLane Free DeDup — +5.18%

Target: FastLane free 側の重複 header validation

Strategy: DeDup ON → free_tiny_fast() を直接呼び出し (duplicate validation を skip)

Result: +5.18% (σ: 1.00% → 0.58%, variance -42%)

Key Files:

  • core/box/front_fastlane_env_box.h (DeDup gate 追加)
  • core/box/front_fastlane_box.h (direct call path)

Lesson: Eliminating redundant checks provides both performance and stability gains


Phase 7: Front FastLane Free Hot/Cold Alignment — NO-GO (-2.16%)

Target: FastLane free を free_tiny_fast_hot() に寄せる

Result: -2.16% (FROZEN)

Root Cause:

  • Hot/cold split が FastLane の軽量経路に overhead を追加
  • TLS access + cold path branching
  • I-cache locality 低下

Lesson: Same optimization has different effects in different contexts (wrapper vs FastLane)


Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — +2.61%

Target: Phase 3 D1 最適化が効かない ENV cache 事故

Strategy: bench_profile putenv 後に refresh mechanism を追加

Result: +2.61% (σ: 867K → 336K, variance -61%)

Key Files:

  • core/box/tiny_free_route_cache_env_box.{h,c} (refresh 追加)
  • core/bench_profile.h (sync 追加)

Lesson: ENV gate synchronization is critical for ensuring optimizations actually work


Phase 9: FREE-TINY-FAST MONO DUALHOT — +2.72%

Target: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct

Strategy: free_tiny_fast() 内で C0-C3 を LEGACY に early-exit (function split なし)

Result: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)

Key Files:

  • core/box/free_tiny_fast_mono_dualhot_env_box.h
  • core/front/malloc_tiny_fast.h (early-exit 追加)

Lesson: Monolithic early-exit > function split (avoids call overhead, maintains I-cache locality)


Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — +1.89%

Target: Phase 9 (C0-C3) を C4-C7 に拡張

Strategy: nonlegacy_mask キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct

Result: +1.89%

Key Files:

  • core/box/free_tiny_fast_mono_legacy_direct_env_box.h (nonlegacy_mask caching)
  • core/front/malloc_tiny_fast.h (C4-C7 early-exit)

Lesson: Cached bitmask enables safe fast-path with minimal overhead


Phase 11: ENV Snapshot "maybe-fast" API — NO-GO (-8.35%)

Target: ENV snapshot 参照固定費 (~2-3%) を削減

Strategy: hakmem_env_snapshot_maybe_fast() API で enabled+snapshot+front_snap を 1 回に集約

Result: -8.35% (FROZEN, 設計ミス)

Root Cause:

  • maybe_fast() を inline hot path 関数内で呼ぶことで、ctor_mode check が累積
  • Compiler optimization 阻害 (unconditional call vs conditional branch)

Lesson: ENV gate optimization should target gate itself, not call sites


Technical Patterns (Across All Phases)

Winning Patterns

  1. Wrapper-level consolidation (Phase 6-1: +11.13%)

    • Collapse multiple layers into single entry point
    • Reduce total instruction count, improve I-cache locality
  2. Deduplication (Phase 6-2: +5.18%)

    • Eliminate redundant checks at layer boundaries
    • Both performance and stability benefits
  3. Monolithic early-exit (Phase 9: +2.72%, Phase 10: +1.89%)

    • Better than function split (avoids call overhead)
    • Maintains I-cache locality
  4. Cached bitmask (Phase 10: nonlegacy_mask)

    • Single bit operation for complex condition
    • Computed once at init, hot path reads only
  5. ENV gate synchronization (Phase 8: +2.61%)

    • Refresh mechanism ensures optimizations actually work
    • Critical for bench_profile integration

Anti-Patterns

  1. Function split for lightweight paths (Phase 7: -2.16%)

    • Hot/cold split adds overhead when path is already optimized
    • Context-dependent: works for wrapper, fails for FastLane
  2. Call-site API changes (Phase 11: -8.35%)

    • Moving helper calls into inline hot path accumulates overhead
    • Compiler optimization inhibited
  3. Branch hints without profiling (E3-4, E5-3c precedents)

    • Profile-dependent, often regresses

Perf Profile Analysis (Phase 10 後)

Current Hotspots (Mixed, 30M iterations)

Symbol Self% Classification Next Action
front_fastlane_try_free 33.88% 集約点 (expected) Successfully consolidated
main 26.21% Benchmark overhead N/A (not optimizable)
malloc 23.26% Alloc wrapper 集約点 🔍 Investigate further
tiny_header_finalize_alloc 5.33% Header write E5-2 tried, NEUTRAL
tiny_c7_ultra_alloc 3.75% C7 ULTRA path 🔍 Possible target
unified_cache_push 1.61% Cache push Marginal ROI (~+1.0%)
hakmem_env_snapshot 0.82% ENV snapshot Phase 11 failed
tiny_front_v3_snapshot_get 0.66% Front snapshot Phase 11 failed

Key Observations

  1. front_fastlane_try_free (33.88%) - This is EXPECTED and GOOD

    • Phase 6-1 successfully consolidated free path
    • High self% indicates consolidation worked (vs distributed overhead)
    • Not a bottleneck, just a measurement artifact
  2. malloc (23.26%) - Alloc wrapper 集約点

    • Similar to front_fastlane_try_free (includes FastLane alloc)
    • May indicate alloc side has room for optimization
    • Need deeper analysis to separate wrapper vs actual hot path
  3. tiny_header_finalize_alloc (5.33%) - Already optimized

    • Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
    • Branch overhead ≈ savings
    • Further optimization unlikely to help
  4. Remaining hotspots < 2% - Marginal ROI

    • unified_cache_push (1.61%): E5-3b predicted ~+1.0%
    • tiny_c7_ultra_alloc (3.75%): C7-specific, narrow scope

Strategic Options (Next Steps)

Option A: Continue Micro-Optimizations (Marginal ROI)

Targets:

  • unified_cache_push (1.61%) - predicted +1.0% ROI
  • tiny_c7_ultra_alloc (3.75%) - C7-specific

Risk: Diminishing returns (each optimization < +1-2%)

Recommendation: Low priority unless low-hanging fruit identified


Option B: Alloc Side Deep Dive 🔍 (High Potential)

Observation: malloc (23.26%) suggests alloc side may have structural opportunities

Investigation Steps:

  1. Perf profile with --call-graph dwarf to separate wrapper vs hot path
  2. Identify if FastLane alloc has similar consolidation opportunities as FastLane free
  3. Look for alloc-side deduplication opportunities

Expected ROI: +5-10% (if similar to Phase 6 free-side gains)

Recommendation: HIGH PRIORITY - Most promising next direction


Option C: Declare Victory 🎉 (Strategic Pause)

Achievement: +24.6% cumulative (Phase 6-10)

Context:

  • Phase 5: ~+9-10% cumulative
  • Phase 6-10: +24.6% cumulative
  • Total (Phase 5-10): ~+30-35% cumulative

Rationale:

  • Reached major performance milestone
  • Remaining optimizations < +2% each (marginal)
  • Risk/reward ratio decreasing

Recommendation: NEUTRAL - Valid option depending on project goals


Cumulative Statistics

Performance Gains

Phase Range Cumulative Gain Individual Phases
Phase 5 (E4-E5) ~+9-10% E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%)
Phase 6-10 +24.6% 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
Total (Phase 5-10) ~+30-35% Estimated compound effect

Stability Improvements

Multiple phases delivered variance reduction:

  • Phase 6-2: σ -42% (1.00% → 0.58%)
  • Phase 8: σ -61% (867K → 336K)
  • Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)

Insight: Consolidation/deduplication patterns provide BOTH performance AND stability


Lessons Learned

1. Consolidation > Micro-Optimization

Evidence: Phase 6-1 (+11.13%) > all micro-optimizations combined

Pattern: Wrapper-level structural changes yield largest gains

2. Context Matters

Evidence: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)

Pattern: Same optimization technique has different effects in different contexts

3. Monolithic Early-Exit > Function Split

Evidence: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed

Pattern: Avoid function call overhead, maintain I-cache locality

4. Inline Hot Path = Danger Zone

Evidence: Phase 11 (-8.35%) - API call in inline function accumulated overhead

Pattern: Even 2-3 instructions are expensive at high call frequency

5. ENV Gate Optimization Should Target Gate Itself

Evidence: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization

Pattern: Optimize the gate (caching, probe window), not the call sites

6. Variance Reduction = Hidden Value

Evidence: Multiple phases delivered 40-60% variance reduction

Pattern: Consolidation/deduplication improves both mean AND stability


Recommendations

Immediate Next Steps (Priority Order)

  1. HIGH PRIORITY: Alloc side deep dive (Option B)

    • Perf profile with call-graph to identify opportunities
    • Look for FastLane alloc equivalent optimizations
    • Expected ROI: +5-10%
  2. MEDIUM PRIORITY: Strategic pause (Option C)

    • Document Phase 6-10 achievements
    • Re-evaluate project goals and priorities
    • Consider if +24.6% is "good enough" milestone
  3. LOW PRIORITY: Micro-optimizations (Option A)

    • Only pursue if low-hanging fruit identified
    • Each remaining optimization < +2% ROI

Long-Term Strategic Questions

  1. What is the target performance?

    • mimalloc competitive? (need benchmark comparison)
    • Absolute threshold? (e.g., 60M ops/s on Mixed)
  2. What is the risk tolerance?

    • Each optimization has ~20-30% chance of NO-GO
    • Diminishing returns as we optimize further
  3. What is the maintenance cost?

    • Each new optimization adds complexity
    • ENV gates increase test matrix

Conclusion

Phase 6-10 delivered +24.6% cumulative improvement, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.

Status: MAJOR MILESTONE ACHIEVED

Next: Recommend alloc side deep dive as highest-ROI next direction, or strategic pause to consolidate gains.


Implementation: Claude Code Date: 2025-12-14 Phases: 6-10 (with Phase 11 NO-GO documented) Commits: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d