Files
hakmem/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md
Moe Charm (CI) 37bb3ee63f Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)
Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:48:34 +09:00

370 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6-10: Cumulative Results & Strategic Analysis
**Date**: 2025-12-14
**Status**: ✅ **COMPLETE** - Major performance milestone achieved
---
## Executive Summary
Phases 6-10 delivered **+24.6% cumulative improvement** on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.
### Cumulative Performance
| Metric | Phase 5 Baseline | Phase 10 Final | Improvement |
|--------|------------------|----------------|-------------|
| **Throughput (Mixed)** | 43.04M ops/s | 53.62M ops/s | **+24.6%** 🎉 |
| **Optimization Strategy** | Per-phase incremental | Layer collapse + path consolidation | Structural |
---
## Phase-by-Phase Breakdown
### Phase 6-1: Front FastLane (Layer Collapse) — ✅ **+11.13%**
**Target**: wrapper→gate→policy→route 層の固定費
**Strategy**: 単一エントリポイント (FastLane) への集約
- `front_fastlane_try_malloc(size)` - size→class→route→handler を 1 箇所で処理
- `front_fastlane_try_free(ptr)` - header validation + direct call
**Result**: +11.13% (hakmem 史上最大の単一改善)
**Key Files**:
- `core/box/front_fastlane_env_box.h`
- `core/box/front_fastlane_box.h`
- `core/box/hak_wrappers.inc.h` (wrapper integration)
**Lesson**: **Consolidation at wrapper level** is the winning pattern (vs micro-optimizations in hot path)
---
### Phase 6-2: Front FastLane Free DeDup — ✅ **+5.18%**
**Target**: FastLane free 側の重複 header validation
**Strategy**: DeDup ON → `free_tiny_fast()` を直接呼び出し (duplicate validation を skip)
**Result**: +5.18% (σ: 1.00% → 0.58%, variance -42%)
**Key Files**:
- `core/box/front_fastlane_env_box.h` (DeDup gate 追加)
- `core/box/front_fastlane_box.h` (direct call path)
**Lesson**: **Eliminating redundant checks** provides both performance and stability gains
---
### Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ **NO-GO (-2.16%)**
**Target**: FastLane free を `free_tiny_fast_hot()` に寄せる
**Result**: -2.16% (FROZEN)
**Root Cause**:
- Hot/cold split が FastLane の軽量経路に overhead を追加
- TLS access + cold path branching
- I-cache locality 低下
**Lesson**: **Same optimization has different effects in different contexts** (wrapper vs FastLane)
---
### Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ **+2.61%**
**Target**: Phase 3 D1 最適化が効かない ENV cache 事故
**Strategy**: `bench_profile` putenv 後に refresh mechanism を追加
**Result**: +2.61% (σ: 867K → 336K, variance -61%)
**Key Files**:
- `core/box/tiny_free_route_cache_env_box.{h,c}` (refresh 追加)
- `core/bench_profile.h` (sync 追加)
**Lesson**: **ENV gate synchronization** is critical for ensuring optimizations actually work
---
### Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ **+2.72%**
**Target**: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct
**Strategy**: `free_tiny_fast()` 内で C0-C3 を LEGACY に early-exit (function split なし)
**Result**: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)
**Key Files**:
- `core/box/free_tiny_fast_mono_dualhot_env_box.h`
- `core/front/malloc_tiny_fast.h` (early-exit 追加)
**Lesson**: **Monolithic early-exit > function split** (avoids call overhead, maintains I-cache locality)
---
### Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ **+1.89%**
**Target**: Phase 9 (C0-C3) を C4-C7 に拡張
**Strategy**: `nonlegacy_mask` キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct
**Result**: +1.89%
**Key Files**:
- `core/box/free_tiny_fast_mono_legacy_direct_env_box.h` (nonlegacy_mask caching)
- `core/front/malloc_tiny_fast.h` (C4-C7 early-exit)
**Lesson**: **Cached bitmask** enables safe fast-path with minimal overhead
---
### Phase 11: ENV Snapshot "maybe-fast" API — ❌ **NO-GO (-8.35%)**
**Target**: ENV snapshot 参照固定費 (~2-3%) を削減
**Strategy**: `hakmem_env_snapshot_maybe_fast()` API で enabled+snapshot+front_snap を 1 回に集約
**Result**: -8.35% (FROZEN, 設計ミス)
**Root Cause**:
- `maybe_fast()` を inline hot path 関数内で呼ぶことで、`ctor_mode` check が累積
- Compiler optimization 阻害 (unconditional call vs conditional branch)
**Lesson**: **ENV gate optimization should target gate itself, not call sites**
---
## Technical Patterns (Across All Phases)
### Winning Patterns ✅
1. **Wrapper-level consolidation** (Phase 6-1: +11.13%)
- Collapse multiple layers into single entry point
- Reduce total instruction count, improve I-cache locality
2. **Deduplication** (Phase 6-2: +5.18%)
- Eliminate redundant checks at layer boundaries
- Both performance and stability benefits
3. **Monolithic early-exit** (Phase 9: +2.72%, Phase 10: +1.89%)
- Better than function split (avoids call overhead)
- Maintains I-cache locality
4. **Cached bitmask** (Phase 10: nonlegacy_mask)
- Single bit operation for complex condition
- Computed once at init, hot path reads only
5. **ENV gate synchronization** (Phase 8: +2.61%)
- Refresh mechanism ensures optimizations actually work
- Critical for bench_profile integration
### Anti-Patterns ❌
1. **Function split for lightweight paths** (Phase 7: -2.16%)
- Hot/cold split adds overhead when path is already optimized
- Context-dependent: works for wrapper, fails for FastLane
2. **Call-site API changes** (Phase 11: -8.35%)
- Moving helper calls into inline hot path accumulates overhead
- Compiler optimization inhibited
3. **Branch hints without profiling** (E3-4, E5-3c precedents)
- Profile-dependent, often regresses
---
## Perf Profile Analysis (Phase 10 後)
### Current Hotspots (Mixed, 30M iterations)
| Symbol | Self% | Classification | Next Action |
|--------|-------|----------------|-------------|
| `front_fastlane_try_free` | 33.88% | 集約点 (expected) | ✅ Successfully consolidated |
| `main` | 26.21% | Benchmark overhead | N/A (not optimizable) |
| `malloc` | 23.26% | Alloc wrapper 集約点 | 🔍 Investigate further |
| `tiny_header_finalize_alloc` | 5.33% | Header write | ⚪ E5-2 tried, NEUTRAL |
| `tiny_c7_ultra_alloc` | 3.75% | C7 ULTRA path | 🔍 Possible target |
| `unified_cache_push` | 1.61% | Cache push | ⚪ Marginal ROI (~+1.0%) |
| `hakmem_env_snapshot` | 0.82% | ENV snapshot | ❌ Phase 11 failed |
| `tiny_front_v3_snapshot_get` | 0.66% | Front snapshot | ❌ Phase 11 failed |
### Key Observations
1. **`front_fastlane_try_free` (33.88%)** - This is EXPECTED and GOOD
- Phase 6-1 successfully consolidated free path
- High self% indicates consolidation worked (vs distributed overhead)
- Not a bottleneck, just a measurement artifact
2. **`malloc` (23.26%)** - Alloc wrapper 集約点
- Similar to `front_fastlane_try_free` (includes FastLane alloc)
- May indicate alloc side has room for optimization
- Need deeper analysis to separate wrapper vs actual hot path
3. **`tiny_header_finalize_alloc` (5.33%)** - Already optimized
- Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
- Branch overhead ≈ savings
- Further optimization unlikely to help
4. **Remaining hotspots < 2%** - Marginal ROI
- `unified_cache_push` (1.61%): E5-3b predicted ~+1.0%
- `tiny_c7_ultra_alloc` (3.75%): C7-specific, narrow scope
---
## Strategic Options (Next Steps)
### Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)
**Targets**:
- `unified_cache_push` (1.61%) - predicted +1.0% ROI
- `tiny_c7_ultra_alloc` (3.75%) - C7-specific
**Risk**: Diminishing returns (each optimization < +1-2%)
**Recommendation**: Low priority unless low-hanging fruit identified
---
### Option B: Alloc Side Deep Dive 🔍 (High Potential)
**Observation**: `malloc` (23.26%) suggests alloc side may have structural opportunities
**Investigation Steps**:
1. Perf profile with `--call-graph dwarf` to separate wrapper vs hot path
2. Identify if FastLane alloc has similar consolidation opportunities as FastLane free
3. Look for alloc-side deduplication opportunities
**Expected ROI**: +5-10% (if similar to Phase 6 free-side gains)
**Recommendation**: **HIGH PRIORITY** - Most promising next direction
---
### Option C: Declare Victory 🎉 (Strategic Pause)
**Achievement**: +24.6% cumulative (Phase 6-10)
**Context**:
- Phase 5: ~+9-10% cumulative
- **Phase 6-10: +24.6% cumulative**
- **Total (Phase 5-10): ~+30-35% cumulative**
**Rationale**:
- Reached major performance milestone
- Remaining optimizations < +2% each (marginal)
- Risk/reward ratio decreasing
**Recommendation**: ⚪ NEUTRAL - Valid option depending on project goals
---
## Cumulative Statistics
### Performance Gains
| Phase Range | Cumulative Gain | Individual Phases |
|-------------|-----------------|-------------------|
| Phase 5 (E4-E5) | ~+9-10% | E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) |
| **Phase 6-10** | **+24.6%** | 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) |
| **Total (Phase 5-10)** | **~+30-35%** | Estimated compound effect |
### Stability Improvements
Multiple phases delivered variance reduction:
- Phase 6-2: σ -42% (1.00% → 0.58%)
- Phase 8: σ -61% (867K → 336K)
- Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)
**Insight**: Consolidation/deduplication patterns provide BOTH performance AND stability
---
## Lessons Learned
### 1. Consolidation > Micro-Optimization
**Evidence**: Phase 6-1 (+11.13%) > all micro-optimizations combined
**Pattern**: Wrapper-level structural changes yield largest gains
### 2. Context Matters
**Evidence**: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)
**Pattern**: Same optimization technique has different effects in different contexts
### 3. Monolithic Early-Exit > Function Split
**Evidence**: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed
**Pattern**: Avoid function call overhead, maintain I-cache locality
### 4. Inline Hot Path = Danger Zone
**Evidence**: Phase 11 (-8.35%) - API call in inline function accumulated overhead
**Pattern**: Even 2-3 instructions are expensive at high call frequency
### 5. ENV Gate Optimization Should Target Gate Itself
**Evidence**: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization
**Pattern**: Optimize the gate (caching, probe window), not the call sites
### 6. Variance Reduction = Hidden Value
**Evidence**: Multiple phases delivered 40-60% variance reduction
**Pattern**: Consolidation/deduplication improves both mean AND stability
---
## Recommendations
### Immediate Next Steps (Priority Order)
1. **HIGH PRIORITY**: Alloc side deep dive (Option B)
- Perf profile with call-graph to identify opportunities
- Look for FastLane alloc equivalent optimizations
- Expected ROI: +5-10%
2. **MEDIUM PRIORITY**: Strategic pause (Option C)
- Document Phase 6-10 achievements
- Re-evaluate project goals and priorities
- Consider if +24.6% is "good enough" milestone
3. **LOW PRIORITY**: Micro-optimizations (Option A)
- Only pursue if low-hanging fruit identified
- Each remaining optimization < +2% ROI
### Long-Term Strategic Questions
1. **What is the target performance?**
- mimalloc competitive? (need benchmark comparison)
- Absolute threshold? (e.g., 60M ops/s on Mixed)
2. **What is the risk tolerance?**
- Each optimization has ~20-30% chance of NO-GO
- Diminishing returns as we optimize further
3. **What is the maintenance cost?**
- Each new optimization adds complexity
- ENV gates increase test matrix
---
## Conclusion
Phase 6-10 delivered **+24.6% cumulative improvement**, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.
**Status**: ✅ **MAJOR MILESTONE ACHIEVED**
**Next**: Recommend **alloc side deep dive** as highest-ROI next direction, or **strategic pause** to consolidate gains.
---
**Implementation**: Claude Code
**Date**: 2025-12-14
**Phases**: 6-10 (with Phase 11 NO-GO documented)
**Commits**: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d