Comprehensive analysis of Phases 6-10 achievements: - Cumulative improvement: +24.6% (43.04M → 53.62M ops/s) - Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) - Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%) Winning patterns: - Wrapper-level consolidation (Phase 6-1: largest single gain) - Deduplication at layer boundaries (Phase 6-2) - Monolithic early-exit (Phase 9, 10 vs Phase 7 function split) Next strategic options: A) Micro-optimizations (marginal ROI < +2%) B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%) C) Strategic pause (declare victory at +24.6%) Recommendation: Alloc side investigation as highest-ROI next direction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Phase 6-10: Cumulative Results & Strategic Analysis
Date: 2025-12-14 Status: ✅ COMPLETE - Major performance milestone achieved
Executive Summary
Phases 6-10 delivered +24.6% cumulative improvement on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.
Cumulative Performance
| Metric | Phase 5 Baseline | Phase 10 Final | Improvement |
|---|---|---|---|
| Throughput (Mixed) | 43.04M ops/s | 53.62M ops/s | +24.6% 🎉 |
| Optimization Strategy | Per-phase incremental | Layer collapse + path consolidation | Structural |
Phase-by-Phase Breakdown
Phase 6-1: Front FastLane (Layer Collapse) — ✅ +11.13%
Target: wrapper→gate→policy→route 層の固定費
Strategy: 単一エントリポイント (FastLane) への集約
front_fastlane_try_malloc(size)- size→class→route→handler を 1 箇所で処理front_fastlane_try_free(ptr)- header validation + direct call
Result: +11.13% (hakmem 史上最大の単一改善)
Key Files:
core/box/front_fastlane_env_box.hcore/box/front_fastlane_box.hcore/box/hak_wrappers.inc.h(wrapper integration)
Lesson: Consolidation at wrapper level is the winning pattern (vs micro-optimizations in hot path)
Phase 6-2: Front FastLane Free DeDup — ✅ +5.18%
Target: FastLane free 側の重複 header validation
Strategy: DeDup ON → free_tiny_fast() を直接呼び出し (duplicate validation を skip)
Result: +5.18% (σ: 1.00% → 0.58%, variance -42%)
Key Files:
core/box/front_fastlane_env_box.h(DeDup gate 追加)core/box/front_fastlane_box.h(direct call path)
Lesson: Eliminating redundant checks provides both performance and stability gains
Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ NO-GO (-2.16%)
Target: FastLane free を free_tiny_fast_hot() に寄せる
Result: -2.16% (FROZEN)
Root Cause:
- Hot/cold split が FastLane の軽量経路に overhead を追加
- TLS access + cold path branching
- I-cache locality 低下
Lesson: Same optimization has different effects in different contexts (wrapper vs FastLane)
Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ +2.61%
Target: Phase 3 D1 最適化が効かない ENV cache 事故
Strategy: bench_profile putenv 後に refresh mechanism を追加
Result: +2.61% (σ: 867K → 336K, variance -61%)
Key Files:
core/box/tiny_free_route_cache_env_box.{h,c}(refresh 追加)core/bench_profile.h(sync 追加)
Lesson: ENV gate synchronization is critical for ensuring optimizations actually work
Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ +2.72%
Target: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct
Strategy: free_tiny_fast() 内で C0-C3 を LEGACY に early-exit (function split なし)
Result: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)
Key Files:
core/box/free_tiny_fast_mono_dualhot_env_box.hcore/front/malloc_tiny_fast.h(early-exit 追加)
Lesson: Monolithic early-exit > function split (avoids call overhead, maintains I-cache locality)
Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ +1.89%
Target: Phase 9 (C0-C3) を C4-C7 に拡張
Strategy: nonlegacy_mask キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct
Result: +1.89%
Key Files:
core/box/free_tiny_fast_mono_legacy_direct_env_box.h(nonlegacy_mask caching)core/front/malloc_tiny_fast.h(C4-C7 early-exit)
Lesson: Cached bitmask enables safe fast-path with minimal overhead
Phase 11: ENV Snapshot "maybe-fast" API — ❌ NO-GO (-8.35%)
Target: ENV snapshot 参照固定費 (~2-3%) を削減
Strategy: hakmem_env_snapshot_maybe_fast() API で enabled+snapshot+front_snap を 1 回に集約
Result: -8.35% (FROZEN, 設計ミス)
Root Cause:
maybe_fast()を inline hot path 関数内で呼ぶことで、ctor_modecheck が累積- Compiler optimization 阻害 (unconditional call vs conditional branch)
Lesson: ENV gate optimization should target gate itself, not call sites
Technical Patterns (Across All Phases)
Winning Patterns ✅
-
Wrapper-level consolidation (Phase 6-1: +11.13%)
- Collapse multiple layers into single entry point
- Reduce total instruction count, improve I-cache locality
-
Deduplication (Phase 6-2: +5.18%)
- Eliminate redundant checks at layer boundaries
- Both performance and stability benefits
-
Monolithic early-exit (Phase 9: +2.72%, Phase 10: +1.89%)
- Better than function split (avoids call overhead)
- Maintains I-cache locality
-
Cached bitmask (Phase 10: nonlegacy_mask)
- Single bit operation for complex condition
- Computed once at init, hot path reads only
-
ENV gate synchronization (Phase 8: +2.61%)
- Refresh mechanism ensures optimizations actually work
- Critical for bench_profile integration
Anti-Patterns ❌
-
Function split for lightweight paths (Phase 7: -2.16%)
- Hot/cold split adds overhead when path is already optimized
- Context-dependent: works for wrapper, fails for FastLane
-
Call-site API changes (Phase 11: -8.35%)
- Moving helper calls into inline hot path accumulates overhead
- Compiler optimization inhibited
-
Branch hints without profiling (E3-4, E5-3c precedents)
- Profile-dependent, often regresses
Perf Profile Analysis (Phase 10 後)
Current Hotspots (Mixed, 30M iterations)
| Symbol | Self% | Classification | Next Action |
|---|---|---|---|
front_fastlane_try_free |
33.88% | 集約点 (expected) | ✅ Successfully consolidated |
main |
26.21% | Benchmark overhead | N/A (not optimizable) |
malloc |
23.26% | Alloc wrapper 集約点 | 🔍 Investigate further |
tiny_header_finalize_alloc |
5.33% | Header write | ⚪ E5-2 tried, NEUTRAL |
tiny_c7_ultra_alloc |
3.75% | C7 ULTRA path | 🔍 Possible target |
unified_cache_push |
1.61% | Cache push | ⚪ Marginal ROI (~+1.0%) |
hakmem_env_snapshot |
0.82% | ENV snapshot | ❌ Phase 11 failed |
tiny_front_v3_snapshot_get |
0.66% | Front snapshot | ❌ Phase 11 failed |
Key Observations
-
front_fastlane_try_free(33.88%) - This is EXPECTED and GOOD- Phase 6-1 successfully consolidated free path
- High self% indicates consolidation worked (vs distributed overhead)
- Not a bottleneck, just a measurement artifact
-
malloc(23.26%) - Alloc wrapper 集約点- Similar to
front_fastlane_try_free(includes FastLane alloc) - May indicate alloc side has room for optimization
- Need deeper analysis to separate wrapper vs actual hot path
- Similar to
-
tiny_header_finalize_alloc(5.33%) - Already optimized- Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
- Branch overhead ≈ savings
- Further optimization unlikely to help
-
Remaining hotspots < 2% - Marginal ROI
unified_cache_push(1.61%): E5-3b predicted ~+1.0%tiny_c7_ultra_alloc(3.75%): C7-specific, narrow scope
Strategic Options (Next Steps)
Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)
Targets:
unified_cache_push(1.61%) - predicted +1.0% ROItiny_c7_ultra_alloc(3.75%) - C7-specific
Risk: Diminishing returns (each optimization < +1-2%)
Recommendation: Low priority unless low-hanging fruit identified
Option B: Alloc Side Deep Dive 🔍 (High Potential)
Observation: malloc (23.26%) suggests alloc side may have structural opportunities
Investigation Steps:
- Perf profile with
--call-graph dwarfto separate wrapper vs hot path - Identify if FastLane alloc has similar consolidation opportunities as FastLane free
- Look for alloc-side deduplication opportunities
Expected ROI: +5-10% (if similar to Phase 6 free-side gains)
Recommendation: HIGH PRIORITY - Most promising next direction
Option C: Declare Victory 🎉 (Strategic Pause)
Achievement: +24.6% cumulative (Phase 6-10)
Context:
- Phase 5: ~+9-10% cumulative
- Phase 6-10: +24.6% cumulative
- Total (Phase 5-10): ~+30-35% cumulative
Rationale:
- Reached major performance milestone
- Remaining optimizations < +2% each (marginal)
- Risk/reward ratio decreasing
Recommendation: ⚪ NEUTRAL - Valid option depending on project goals
Cumulative Statistics
Performance Gains
| Phase Range | Cumulative Gain | Individual Phases |
|---|---|---|
| Phase 5 (E4-E5) | ~+9-10% | E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) |
| Phase 6-10 | +24.6% | 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) |
| Total (Phase 5-10) | ~+30-35% | Estimated compound effect |
Stability Improvements
Multiple phases delivered variance reduction:
- Phase 6-2: σ -42% (1.00% → 0.58%)
- Phase 8: σ -61% (867K → 336K)
- Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)
Insight: Consolidation/deduplication patterns provide BOTH performance AND stability
Lessons Learned
1. Consolidation > Micro-Optimization
Evidence: Phase 6-1 (+11.13%) > all micro-optimizations combined
Pattern: Wrapper-level structural changes yield largest gains
2. Context Matters
Evidence: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)
Pattern: Same optimization technique has different effects in different contexts
3. Monolithic Early-Exit > Function Split
Evidence: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed
Pattern: Avoid function call overhead, maintain I-cache locality
4. Inline Hot Path = Danger Zone
Evidence: Phase 11 (-8.35%) - API call in inline function accumulated overhead
Pattern: Even 2-3 instructions are expensive at high call frequency
5. ENV Gate Optimization Should Target Gate Itself
Evidence: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization
Pattern: Optimize the gate (caching, probe window), not the call sites
6. Variance Reduction = Hidden Value
Evidence: Multiple phases delivered 40-60% variance reduction
Pattern: Consolidation/deduplication improves both mean AND stability
Recommendations
Immediate Next Steps (Priority Order)
-
HIGH PRIORITY: Alloc side deep dive (Option B)
- Perf profile with call-graph to identify opportunities
- Look for FastLane alloc equivalent optimizations
- Expected ROI: +5-10%
-
MEDIUM PRIORITY: Strategic pause (Option C)
- Document Phase 6-10 achievements
- Re-evaluate project goals and priorities
- Consider if +24.6% is "good enough" milestone
-
LOW PRIORITY: Micro-optimizations (Option A)
- Only pursue if low-hanging fruit identified
- Each remaining optimization < +2% ROI
Long-Term Strategic Questions
-
What is the target performance?
- mimalloc competitive? (need benchmark comparison)
- Absolute threshold? (e.g., 60M ops/s on Mixed)
-
What is the risk tolerance?
- Each optimization has ~20-30% chance of NO-GO
- Diminishing returns as we optimize further
-
What is the maintenance cost?
- Each new optimization adds complexity
- ENV gates increase test matrix
Conclusion
Phase 6-10 delivered +24.6% cumulative improvement, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.
Status: ✅ MAJOR MILESTONE ACHIEVED
Next: Recommend alloc side deep dive as highest-ROI next direction, or strategic pause to consolidate gains.
Implementation: Claude Code
Date: 2025-12-14
Phases: 6-10 (with Phase 11 NO-GO documented)
Commits: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d