diff --git a/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md b/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md new file mode 100644 index 00000000..0bdf6bf9 --- /dev/null +++ b/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md @@ -0,0 +1,369 @@ +# Phase 6-10: Cumulative Results & Strategic Analysis + +**Date**: 2025-12-14 +**Status**: ✅ **COMPLETE** - Major performance milestone achieved + +--- + +## Executive Summary + +Phases 6-10 delivered **+24.6% cumulative improvement** on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history. + +### Cumulative Performance + +| Metric | Phase 5 Baseline | Phase 10 Final | Improvement | +|--------|------------------|----------------|-------------| +| **Throughput (Mixed)** | 43.04M ops/s | 53.62M ops/s | **+24.6%** 🎉 | +| **Optimization Strategy** | Per-phase incremental | Layer collapse + path consolidation | Structural | + +--- + +## Phase-by-Phase Breakdown + +### Phase 6-1: Front FastLane (Layer Collapse) — ✅ **+11.13%** + +**Target**: wrapper→gate→policy→route 層の固定費 + +**Strategy**: 単一エントリポイント (FastLane) への集約 +- `front_fastlane_try_malloc(size)` - size→class→route→handler を 1 箇所で処理 +- `front_fastlane_try_free(ptr)` - header validation + direct call + +**Result**: +11.13% (hakmem 史上最大の単一改善) + +**Key Files**: +- `core/box/front_fastlane_env_box.h` +- `core/box/front_fastlane_box.h` +- `core/box/hak_wrappers.inc.h` (wrapper integration) + +**Lesson**: **Consolidation at wrapper level** is the winning pattern (vs micro-optimizations in hot path) + +--- + +### Phase 6-2: Front FastLane Free DeDup — ✅ **+5.18%** + +**Target**: FastLane free 側の重複 header validation + +**Strategy**: DeDup ON → `free_tiny_fast()` を直接呼び出し (duplicate validation を skip) + +**Result**: +5.18% (σ: 1.00% → 0.58%, variance -42%) + +**Key Files**: +- `core/box/front_fastlane_env_box.h` (DeDup gate 追加) +- `core/box/front_fastlane_box.h` (direct call path) + +**Lesson**: **Eliminating redundant checks** provides both performance and stability gains + +--- + +### Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ **NO-GO (-2.16%)** + +**Target**: FastLane free を `free_tiny_fast_hot()` に寄せる + +**Result**: -2.16% (FROZEN) + +**Root Cause**: +- Hot/cold split が FastLane の軽量経路に overhead を追加 +- TLS access + cold path branching +- I-cache locality 低下 + +**Lesson**: **Same optimization has different effects in different contexts** (wrapper vs FastLane) + +--- + +### Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ **+2.61%** + +**Target**: Phase 3 D1 最適化が効かない ENV cache 事故 + +**Strategy**: `bench_profile` putenv 後に refresh mechanism を追加 + +**Result**: +2.61% (σ: 867K → 336K, variance -61%) + +**Key Files**: +- `core/box/tiny_free_route_cache_env_box.{h,c}` (refresh 追加) +- `core/bench_profile.h` (sync 追加) + +**Lesson**: **ENV gate synchronization** is critical for ensuring optimizations actually work + +--- + +### Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ **+2.72%** + +**Target**: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct + +**Strategy**: `free_tiny_fast()` 内で C0-C3 を LEGACY に early-exit (function split なし) + +**Result**: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable) + +**Key Files**: +- `core/box/free_tiny_fast_mono_dualhot_env_box.h` +- `core/front/malloc_tiny_fast.h` (early-exit 追加) + +**Lesson**: **Monolithic early-exit > function split** (avoids call overhead, maintains I-cache locality) + +--- + +### Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ **+1.89%** + +**Target**: Phase 9 (C0-C3) を C4-C7 に拡張 + +**Strategy**: `nonlegacy_mask` キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct + +**Result**: +1.89% + +**Key Files**: +- `core/box/free_tiny_fast_mono_legacy_direct_env_box.h` (nonlegacy_mask caching) +- `core/front/malloc_tiny_fast.h` (C4-C7 early-exit) + +**Lesson**: **Cached bitmask** enables safe fast-path with minimal overhead + +--- + +### Phase 11: ENV Snapshot "maybe-fast" API — ❌ **NO-GO (-8.35%)** + +**Target**: ENV snapshot 参照固定費 (~2-3%) を削減 + +**Strategy**: `hakmem_env_snapshot_maybe_fast()` API で enabled+snapshot+front_snap を 1 回に集約 + +**Result**: -8.35% (FROZEN, 設計ミス) + +**Root Cause**: +- `maybe_fast()` を inline hot path 関数内で呼ぶことで、`ctor_mode` check が累積 +- Compiler optimization 阻害 (unconditional call vs conditional branch) + +**Lesson**: **ENV gate optimization should target gate itself, not call sites** + +--- + +## Technical Patterns (Across All Phases) + +### Winning Patterns ✅ + +1. **Wrapper-level consolidation** (Phase 6-1: +11.13%) + - Collapse multiple layers into single entry point + - Reduce total instruction count, improve I-cache locality + +2. **Deduplication** (Phase 6-2: +5.18%) + - Eliminate redundant checks at layer boundaries + - Both performance and stability benefits + +3. **Monolithic early-exit** (Phase 9: +2.72%, Phase 10: +1.89%) + - Better than function split (avoids call overhead) + - Maintains I-cache locality + +4. **Cached bitmask** (Phase 10: nonlegacy_mask) + - Single bit operation for complex condition + - Computed once at init, hot path reads only + +5. **ENV gate synchronization** (Phase 8: +2.61%) + - Refresh mechanism ensures optimizations actually work + - Critical for bench_profile integration + +### Anti-Patterns ❌ + +1. **Function split for lightweight paths** (Phase 7: -2.16%) + - Hot/cold split adds overhead when path is already optimized + - Context-dependent: works for wrapper, fails for FastLane + +2. **Call-site API changes** (Phase 11: -8.35%) + - Moving helper calls into inline hot path accumulates overhead + - Compiler optimization inhibited + +3. **Branch hints without profiling** (E3-4, E5-3c precedents) + - Profile-dependent, often regresses + +--- + +## Perf Profile Analysis (Phase 10 後) + +### Current Hotspots (Mixed, 30M iterations) + +| Symbol | Self% | Classification | Next Action | +|--------|-------|----------------|-------------| +| `front_fastlane_try_free` | 33.88% | 集約点 (expected) | ✅ Successfully consolidated | +| `main` | 26.21% | Benchmark overhead | N/A (not optimizable) | +| `malloc` | 23.26% | Alloc wrapper 集約点 | 🔍 Investigate further | +| `tiny_header_finalize_alloc` | 5.33% | Header write | ⚪ E5-2 tried, NEUTRAL | +| `tiny_c7_ultra_alloc` | 3.75% | C7 ULTRA path | 🔍 Possible target | +| `unified_cache_push` | 1.61% | Cache push | ⚪ Marginal ROI (~+1.0%) | +| `hakmem_env_snapshot` | 0.82% | ENV snapshot | ❌ Phase 11 failed | +| `tiny_front_v3_snapshot_get` | 0.66% | Front snapshot | ❌ Phase 11 failed | + +### Key Observations + +1. **`front_fastlane_try_free` (33.88%)** - This is EXPECTED and GOOD + - Phase 6-1 successfully consolidated free path + - High self% indicates consolidation worked (vs distributed overhead) + - Not a bottleneck, just a measurement artifact + +2. **`malloc` (23.26%)** - Alloc wrapper 集約点 + - Similar to `front_fastlane_try_free` (includes FastLane alloc) + - May indicate alloc side has room for optimization + - Need deeper analysis to separate wrapper vs actual hot path + +3. **`tiny_header_finalize_alloc` (5.33%)** - Already optimized + - Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%) + - Branch overhead ≈ savings + - Further optimization unlikely to help + +4. **Remaining hotspots < 2%** - Marginal ROI + - `unified_cache_push` (1.61%): E5-3b predicted ~+1.0% + - `tiny_c7_ultra_alloc` (3.75%): C7-specific, narrow scope + +--- + +## Strategic Options (Next Steps) + +### Option A: Continue Micro-Optimizations ⚪ (Marginal ROI) + +**Targets**: +- `unified_cache_push` (1.61%) - predicted +1.0% ROI +- `tiny_c7_ultra_alloc` (3.75%) - C7-specific + +**Risk**: Diminishing returns (each optimization < +1-2%) + +**Recommendation**: Low priority unless low-hanging fruit identified + +--- + +### Option B: Alloc Side Deep Dive 🔍 (High Potential) + +**Observation**: `malloc` (23.26%) suggests alloc side may have structural opportunities + +**Investigation Steps**: +1. Perf profile with `--call-graph dwarf` to separate wrapper vs hot path +2. Identify if FastLane alloc has similar consolidation opportunities as FastLane free +3. Look for alloc-side deduplication opportunities + +**Expected ROI**: +5-10% (if similar to Phase 6 free-side gains) + +**Recommendation**: **HIGH PRIORITY** - Most promising next direction + +--- + +### Option C: Declare Victory 🎉 (Strategic Pause) + +**Achievement**: +24.6% cumulative (Phase 6-10) + +**Context**: +- Phase 5: ~+9-10% cumulative +- **Phase 6-10: +24.6% cumulative** +- **Total (Phase 5-10): ~+30-35% cumulative** + +**Rationale**: +- Reached major performance milestone +- Remaining optimizations < +2% each (marginal) +- Risk/reward ratio decreasing + +**Recommendation**: ⚪ NEUTRAL - Valid option depending on project goals + +--- + +## Cumulative Statistics + +### Performance Gains + +| Phase Range | Cumulative Gain | Individual Phases | +|-------------|-----------------|-------------------| +| Phase 5 (E4-E5) | ~+9-10% | E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) | +| **Phase 6-10** | **+24.6%** | 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) | +| **Total (Phase 5-10)** | **~+30-35%** | Estimated compound effect | + +### Stability Improvements + +Multiple phases delivered variance reduction: +- Phase 6-2: σ -42% (1.00% → 0.58%) +- Phase 8: σ -61% (867K → 336K) +- Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable) + +**Insight**: Consolidation/deduplication patterns provide BOTH performance AND stability + +--- + +## Lessons Learned + +### 1. Consolidation > Micro-Optimization + +**Evidence**: Phase 6-1 (+11.13%) > all micro-optimizations combined + +**Pattern**: Wrapper-level structural changes yield largest gains + +### 2. Context Matters + +**Evidence**: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%) + +**Pattern**: Same optimization technique has different effects in different contexts + +### 3. Monolithic Early-Exit > Function Split + +**Evidence**: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed + +**Pattern**: Avoid function call overhead, maintain I-cache locality + +### 4. Inline Hot Path = Danger Zone + +**Evidence**: Phase 11 (-8.35%) - API call in inline function accumulated overhead + +**Pattern**: Even 2-3 instructions are expensive at high call frequency + +### 5. ENV Gate Optimization Should Target Gate Itself + +**Evidence**: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization + +**Pattern**: Optimize the gate (caching, probe window), not the call sites + +### 6. Variance Reduction = Hidden Value + +**Evidence**: Multiple phases delivered 40-60% variance reduction + +**Pattern**: Consolidation/deduplication improves both mean AND stability + +--- + +## Recommendations + +### Immediate Next Steps (Priority Order) + +1. **HIGH PRIORITY**: Alloc side deep dive (Option B) + - Perf profile with call-graph to identify opportunities + - Look for FastLane alloc equivalent optimizations + - Expected ROI: +5-10% + +2. **MEDIUM PRIORITY**: Strategic pause (Option C) + - Document Phase 6-10 achievements + - Re-evaluate project goals and priorities + - Consider if +24.6% is "good enough" milestone + +3. **LOW PRIORITY**: Micro-optimizations (Option A) + - Only pursue if low-hanging fruit identified + - Each remaining optimization < +2% ROI + +### Long-Term Strategic Questions + +1. **What is the target performance?** + - mimalloc competitive? (need benchmark comparison) + - Absolute threshold? (e.g., 60M ops/s on Mixed) + +2. **What is the risk tolerance?** + - Each optimization has ~20-30% chance of NO-GO + - Diminishing returns as we optimize further + +3. **What is the maintenance cost?** + - Each new optimization adds complexity + - ENV gates increase test matrix + +--- + +## Conclusion + +Phase 6-10 delivered **+24.6% cumulative improvement**, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work. + +**Status**: ✅ **MAJOR MILESTONE ACHIEVED** + +**Next**: Recommend **alloc side deep dive** as highest-ROI next direction, or **strategic pause** to consolidate gains. + +--- + +**Implementation**: Claude Code +**Date**: 2025-12-14 +**Phases**: 6-10 (with Phase 11 NO-GO documented) +**Commits**: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d