hakmem/docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md

# Phase 6-10: Cumulative Results & Strategic Analysis

**Date**: 2025-12-14
**Status**: ✅ **COMPLETE** - Major performance milestone achieved

---

## Executive Summary

Phases 6-10 delivered **+24.6% cumulative improvement** on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.

### Cumulative Performance

| Metric | Phase 5 Baseline | Phase 10 Final | Improvement |
|--------|------------------|----------------|-------------|
| **Throughput (Mixed)** | 43.04M ops/s | 53.62M ops/s | **+24.6%** 🎉 |
| **Optimization Strategy** | Per-phase incremental | Layer collapse + path consolidation | Structural |

---

## Phase-by-Phase Breakdown

### Phase 6-1: Front FastLane (Layer Collapse) — ✅ **+11.13%**

**Target**: wrapper→gate→policy→route 層の固定費

**Strategy**: 単一エントリポイント (FastLane) への集約
- `front_fastlane_try_malloc(size)` - size→class→route→handler を 1 箇所で処理
- `front_fastlane_try_free(ptr)` - header validation + direct call

**Result**: +11.13% (hakmem 史上最大の単一改善)

**Key Files**:
- `core/box/front_fastlane_env_box.h`
- `core/box/front_fastlane_box.h`
- `core/box/hak_wrappers.inc.h` (wrapper integration)

**Lesson**: **Consolidation at wrapper level** is the winning pattern (vs micro-optimizations in hot path)

---

### Phase 6-2: Front FastLane Free DeDup — ✅ **+5.18%**

**Target**: FastLane free 側の重複 header validation

**Strategy**: DeDup ON → `free_tiny_fast()` を直接呼び出し (duplicate validation を skip)

**Result**: +5.18% (σ: 1.00% → 0.58%, variance -42%)

**Key Files**:
- `core/box/front_fastlane_env_box.h` (DeDup gate 追加)
- `core/box/front_fastlane_box.h` (direct call path)

**Lesson**: **Eliminating redundant checks** provides both performance and stability gains

---

### Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ **NO-GO (-2.16%)**

**Target**: FastLane free を `free_tiny_fast_hot()` に寄せる

**Result**: -2.16% (FROZEN)

**Root Cause**:
- Hot/cold split が FastLane の軽量経路に overhead を追加
- TLS access + cold path branching
- I-cache locality 低下

**Lesson**: **Same optimization has different effects in different contexts** (wrapper vs FastLane)

---

### Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ **+2.61%**

**Target**: Phase 3 D1 最適化が効かない ENV cache 事故

**Strategy**: `bench_profile` putenv 後に refresh mechanism を追加

**Result**: +2.61% (σ: 867K → 336K, variance -61%)

**Key Files**:
- `core/box/tiny_free_route_cache_env_box.{h,c}` (refresh 追加)
- `core/bench_profile.h` (sync 追加)

**Lesson**: **ENV gate synchronization** is critical for ensuring optimizations actually work

---

### Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ **+2.72%**

**Target**: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct

**Strategy**: `free_tiny_fast()` 内で C0-C3 を LEGACY に early-exit (function split なし)

**Result**: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)

**Key Files**:
- `core/box/free_tiny_fast_mono_dualhot_env_box.h`
- `core/front/malloc_tiny_fast.h` (early-exit 追加)

**Lesson**: **Monolithic early-exit > function split** (avoids call overhead, maintains I-cache locality)

---

### Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ **+1.89%**

**Target**: Phase 9 (C0-C3) を C4-C7 に拡張

**Strategy**: `nonlegacy_mask` キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct

**Result**: +1.89%

**Key Files**:
- `core/box/free_tiny_fast_mono_legacy_direct_env_box.h` (nonlegacy_mask caching)
- `core/front/malloc_tiny_fast.h` (C4-C7 early-exit)

**Lesson**: **Cached bitmask** enables safe fast-path with minimal overhead

---

### Phase 11: ENV Snapshot "maybe-fast" API — ❌ **NO-GO (-8.35%)**

**Target**: ENV snapshot 参照固定費 (~2-3%) を削減

**Strategy**: `hakmem_env_snapshot_maybe_fast()` API で enabled+snapshot+front_snap を 1 回に集約

**Result**: -8.35% (FROZEN, 設計ミス)

**Root Cause**:
- `maybe_fast()` を inline hot path 関数内で呼ぶことで、`ctor_mode` check が累積
- Compiler optimization 阻害 (unconditional call vs conditional branch)

**Lesson**: **ENV gate optimization should target gate itself, not call sites**

---

## Technical Patterns (Across All Phases)

### Winning Patterns ✅

1. **Wrapper-level consolidation** (Phase 6-1: +11.13%)
   - Collapse multiple layers into single entry point
   - Reduce total instruction count, improve I-cache locality

2. **Deduplication** (Phase 6-2: +5.18%)
   - Eliminate redundant checks at layer boundaries
   - Both performance and stability benefits

3. **Monolithic early-exit** (Phase 9: +2.72%, Phase 10: +1.89%)
   - Better than function split (avoids call overhead)
   - Maintains I-cache locality

4. **Cached bitmask** (Phase 10: nonlegacy_mask)
   - Single bit operation for complex condition
   - Computed once at init, hot path reads only

5. **ENV gate synchronization** (Phase 8: +2.61%)
   - Refresh mechanism ensures optimizations actually work
   - Critical for bench_profile integration

### Anti-Patterns ❌

1. **Function split for lightweight paths** (Phase 7: -2.16%)
   - Hot/cold split adds overhead when path is already optimized
   - Context-dependent: works for wrapper, fails for FastLane

2. **Call-site API changes** (Phase 11: -8.35%)
   - Moving helper calls into inline hot path accumulates overhead
   - Compiler optimization inhibited

3. **Branch hints without profiling** (E3-4, E5-3c precedents)
   - Profile-dependent, often regresses

---

## Perf Profile Analysis (Phase 10 後)

### Current Hotspots (Mixed, 30M iterations)

| Symbol | Self% | Classification | Next Action |
|--------|-------|----------------|-------------|
| `front_fastlane_try_free` | 33.88% | 集約点 (expected) | ✅ Successfully consolidated |
| `main` | 26.21% | Benchmark overhead | N/A (not optimizable) |
| `malloc` | 23.26% | Alloc wrapper 集約点 | 🔍 Investigate further |
| `tiny_header_finalize_alloc` | 5.33% | Header write | ⚪ E5-2 tried, NEUTRAL |
| `tiny_c7_ultra_alloc` | 3.75% | C7 ULTRA path | 🔍 Possible target |
| `unified_cache_push` | 1.61% | Cache push | ⚪ Marginal ROI (~+1.0%) |
| `hakmem_env_snapshot` | 0.82% | ENV snapshot | ❌ Phase 11 failed |
| `tiny_front_v3_snapshot_get` | 0.66% | Front snapshot | ❌ Phase 11 failed |

### Key Observations

1. **`front_fastlane_try_free` (33.88%)** - This is EXPECTED and GOOD
   - Phase 6-1 successfully consolidated free path
   - High self% indicates consolidation worked (vs distributed overhead)
   - Not a bottleneck, just a measurement artifact

2. **`malloc` (23.26%)** - Alloc wrapper 集約点
   - Similar to `front_fastlane_try_free` (includes FastLane alloc)
   - May indicate alloc side has room for optimization
   - Need deeper analysis to separate wrapper vs actual hot path

3. **`tiny_header_finalize_alloc` (5.33%)** - Already optimized
   - Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
   - Branch overhead ≈ savings
   - Further optimization unlikely to help

4. **Remaining hotspots < 2%** - Marginal ROI
   - `unified_cache_push` (1.61%): E5-3b predicted ~+1.0%
   - `tiny_c7_ultra_alloc` (3.75%): C7-specific, narrow scope

---

## Strategic Options (Next Steps)

### Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)

**Targets**:
- `unified_cache_push` (1.61%) - predicted +1.0% ROI
- `tiny_c7_ultra_alloc` (3.75%) - C7-specific

**Risk**: Diminishing returns (each optimization < +1-2%)

**Recommendation**: Low priority unless low-hanging fruit identified

---

### Option B: Alloc Side Deep Dive 🔍 (High Potential)

**Observation**: `malloc` (23.26%) suggests alloc side may have structural opportunities

**Investigation Steps**:
1. Perf profile with `--call-graph dwarf` to separate wrapper vs hot path
2. Identify if FastLane alloc has similar consolidation opportunities as FastLane free
3. Look for alloc-side deduplication opportunities

**Expected ROI**: +5-10% (if similar to Phase 6 free-side gains)

**Recommendation**: **HIGH PRIORITY** - Most promising next direction

---

### Option C: Declare Victory 🎉 (Strategic Pause)

**Achievement**: +24.6% cumulative (Phase 6-10)

**Context**:
- Phase 5: ~+9-10% cumulative
- **Phase 6-10: +24.6% cumulative**
- **Total (Phase 5-10): ~+30-35% cumulative**

**Rationale**:
- Reached major performance milestone
- Remaining optimizations < +2% each (marginal)
- Risk/reward ratio decreasing

**Recommendation**: ⚪ NEUTRAL - Valid option depending on project goals

---

## Cumulative Statistics

### Performance Gains

| Phase Range | Cumulative Gain | Individual Phases |
|-------------|-----------------|-------------------|
| Phase 5 (E4-E5) | ~+9-10% | E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) |
| **Phase 6-10** | **+24.6%** | 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) |
| **Total (Phase 5-10)** | **~+30-35%** | Estimated compound effect |

### Stability Improvements

Multiple phases delivered variance reduction:
- Phase 6-2: σ -42% (1.00% → 0.58%)
- Phase 8: σ -61% (867K → 336K)
- Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)

**Insight**: Consolidation/deduplication patterns provide BOTH performance AND stability

---

## Lessons Learned

### 1. Consolidation > Micro-Optimization

**Evidence**: Phase 6-1 (+11.13%) > all micro-optimizations combined

**Pattern**: Wrapper-level structural changes yield largest gains

### 2. Context Matters

**Evidence**: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)

**Pattern**: Same optimization technique has different effects in different contexts

### 3. Monolithic Early-Exit > Function Split

**Evidence**: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed

**Pattern**: Avoid function call overhead, maintain I-cache locality

### 4. Inline Hot Path = Danger Zone

**Evidence**: Phase 11 (-8.35%) - API call in inline function accumulated overhead

**Pattern**: Even 2-3 instructions are expensive at high call frequency

### 5. ENV Gate Optimization Should Target Gate Itself

**Evidence**: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization

**Pattern**: Optimize the gate (caching, probe window), not the call sites

### 6. Variance Reduction = Hidden Value

**Evidence**: Multiple phases delivered 40-60% variance reduction

**Pattern**: Consolidation/deduplication improves both mean AND stability

---

## Recommendations

### Immediate Next Steps (Priority Order)

1. **HIGH PRIORITY**: Alloc side deep dive (Option B)
   - Perf profile with call-graph to identify opportunities
   - Look for FastLane alloc equivalent optimizations
   - Expected ROI: +5-10%

2. **MEDIUM PRIORITY**: Strategic pause (Option C)
   - Document Phase 6-10 achievements
   - Re-evaluate project goals and priorities
   - Consider if +24.6% is "good enough" milestone

3. **LOW PRIORITY**: Micro-optimizations (Option A)
   - Only pursue if low-hanging fruit identified
   - Each remaining optimization < +2% ROI

### Long-Term Strategic Questions

1. **What is the target performance?**
   - mimalloc competitive? (need benchmark comparison)
   - Absolute threshold? (e.g., 60M ops/s on Mixed)

2. **What is the risk tolerance?**
   - Each optimization has ~20-30% chance of NO-GO
   - Diminishing returns as we optimize further

3. **What is the maintenance cost?**
   - Each new optimization adds complexity
   - ENV gates increase test matrix

---

## Conclusion

Phase 6-10 delivered **+24.6% cumulative improvement**, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.

**Status**: ✅ **MAJOR MILESTONE ACHIEVED**

**Next**: Recommend **alloc side deep dive** as highest-ROI next direction, or **strategic pause** to consolidate gains.

---

**Implementation**: Claude Code
**Date**: 2025-12-14
**Phases**: 6-10 (with Phase 11 NO-GO documented)
**Commits**: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d
-												Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)

Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 20:48:34 +09:00
+								# Phase 6-10: Cumulative Results & Strategic Analysis
 								**Date**: 2025-12-14
 								**Status**: ✅ **COMPLETE** - Major performance milestone achieved
 								---
 								## Executive Summary
 								Phases 6-10 delivered **+24.6% cumulative improvement** on Mixed workload (16-1024B), representing the largest performance gain in hakmem's optimization history.
 								### Cumulative Performance
 								| Metric | Phase 5 Baseline | Phase 10 Final | Improvement |
 								|--------|------------------|----------------|-------------|
 								| **Throughput (Mixed)** | 43.04M ops/s | 53.62M ops/s | **+24.6%** 🎉 |
 								| **Optimization Strategy** | Per-phase incremental | Layer collapse + path consolidation | Structural |
 								---
 								## Phase-by-Phase Breakdown
 								### Phase 6-1: Front FastLane (Layer Collapse) — ✅ **+11.13%**
 								**Target**: wrapper→gate→policy→route 層の固定費
 								**Strategy**: 単一エントリポイント (FastLane) への集約
 								- `front_fastlane_try_malloc(size)` - size→class→route→handler を 1 箇所で処理
 								- `front_fastlane_try_free(ptr)` - header validation + direct call
 								**Result**: +11.13% (hakmem 史上最大の単一改善)
 								**Key Files**:
 								- `core/box/front_fastlane_env_box.h`
 								- `core/box/front_fastlane_box.h`
 								- `core/box/hak_wrappers.inc.h` (wrapper integration)
 								**Lesson**: **Consolidation at wrapper level** is the winning pattern (vs micro-optimizations in hot path)
 								---
 								### Phase 6-2: Front FastLane Free DeDup — ✅ **+5.18%**
 								**Target**: FastLane free 側の重複 header validation
 								**Strategy**: DeDup ON → `free_tiny_fast()` を直接呼び出し (duplicate validation を skip)
 								**Result**: +5.18% (σ: 1.00% → 0.58%, variance -42%)
 								**Key Files**:
 								- `core/box/front_fastlane_env_box.h` (DeDup gate 追加)
 								- `core/box/front_fastlane_box.h` (direct call path)
 								**Lesson**: **Eliminating redundant checks** provides both performance and stability gains
 								---
 								### Phase 7: Front FastLane Free Hot/Cold Alignment — ❌ **NO-GO (-2.16%)**
 								**Target**: FastLane free を `free_tiny_fast_hot()` に寄せる
 								**Result**: -2.16% (FROZEN)
 								**Root Cause**:
 								- Hot/cold split が FastLane の軽量経路に overhead を追加
 								- TLS access + cold path branching
 								- I-cache locality 低下
 								**Lesson**: **Same optimization has different effects in different contexts** (wrapper vs FastLane)
 								---
 								### Phase 8: FREE-STATIC-ROUTE ENV Cache Fix — ✅ **+2.61%**
 								**Target**: Phase 3 D1 最適化が効かない ENV cache 事故
 								**Strategy**: `bench_profile` putenv 後に refresh mechanism を追加
 								**Result**: +2.61% (σ: 867K → 336K, variance -61%)
 								**Key Files**:
 								- `core/box/tiny_free_route_cache_env_box.{h,c}` (refresh 追加)
 								- `core/bench_profile.h` (sync 追加)
 								**Lesson**: **ENV gate synchronization** is critical for ensuring optimizations actually work
 								---
 								### Phase 9: FREE-TINY-FAST MONO DUALHOT — ✅ **+2.72%**
 								**Target**: Phase 7 の失敗を教訓に、monolithic 内 early-exit で C0-C3 direct
 								**Strategy**: `free_tiny_fast()` 内で C0-C3 を LEGACY に early-exit (function split なし)
 								**Result**: +2.72% (σ: 2.44M → 955K, variance -60.8%, 2.6x more stable)
 								**Key Files**:
 								- `core/box/free_tiny_fast_mono_dualhot_env_box.h`
 								- `core/front/malloc_tiny_fast.h` (early-exit 追加)
 								**Lesson**: **Monolithic early-exit > function split** (avoids call overhead, maintains I-cache locality)
 								---
 								### Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT — ✅ **+1.89%**
 								**Target**: Phase 9 (C0-C3) を C4-C7 に拡張
 								**Strategy**: `nonlegacy_mask` キャッシュで ULTRA/MID/V7 誤爆を防ぎつつ LEGACY direct
 								**Result**: +1.89%
 								**Key Files**:
 								- `core/box/free_tiny_fast_mono_legacy_direct_env_box.h` (nonlegacy_mask caching)
 								- `core/front/malloc_tiny_fast.h` (C4-C7 early-exit)
 								**Lesson**: **Cached bitmask** enables safe fast-path with minimal overhead
 								---
 								### Phase 11: ENV Snapshot "maybe-fast" API — ❌ **NO-GO (-8.35%)**
 								**Target**: ENV snapshot 参照固定費 (~2-3%) を削減
 								**Strategy**: `hakmem_env_snapshot_maybe_fast()` API で enabled+snapshot+front_snap を 1 回に集約
 								**Result**: -8.35% (FROZEN, 設計ミス)
 								**Root Cause**:
 								- `maybe_fast()` を inline hot path 関数内で呼ぶことで、`ctor_mode` check が累積
 								- Compiler optimization 阻害 (unconditional call vs conditional branch)
 								**Lesson**: **ENV gate optimization should target gate itself, not call sites**
 								---
 								## Technical Patterns (Across All Phases)
 								### Winning Patterns ✅
 . **Wrapper-level consolidation** (Phase 6-1: +11.13%)
 								   - Collapse multiple layers into single entry point
 								   - Reduce total instruction count, improve I-cache locality
 . **Deduplication** (Phase 6-2: +5.18%)
 								   - Eliminate redundant checks at layer boundaries
 								   - Both performance and stability benefits
 . **Monolithic early-exit** (Phase 9: +2.72%, Phase 10: +1.89%)
 								   - Better than function split (avoids call overhead)
 								   - Maintains I-cache locality
 . **Cached bitmask** (Phase 10: nonlegacy_mask)
 								   - Single bit operation for complex condition
 								   - Computed once at init, hot path reads only
 . **ENV gate synchronization** (Phase 8: +2.61%)
 								   - Refresh mechanism ensures optimizations actually work
 								   - Critical for bench_profile integration
 								### Anti-Patterns ❌
 . **Function split for lightweight paths** (Phase 7: -2.16%)
 								   - Hot/cold split adds overhead when path is already optimized
 								   - Context-dependent: works for wrapper, fails for FastLane
 . **Call-site API changes** (Phase 11: -8.35%)
 								   - Moving helper calls into inline hot path accumulates overhead
 								   - Compiler optimization inhibited
 . **Branch hints without profiling** (E3-4, E5-3c precedents)
 								   - Profile-dependent, often regresses
 								---
 								## Perf Profile Analysis (Phase 10 後)
 								### Current Hotspots (Mixed, 30M iterations)
 								| Symbol | Self% | Classification | Next Action |
 								|--------|-------|----------------|-------------|
 								| `front_fastlane_try_free` | 33.88% | 集約点 (expected) | ✅ Successfully consolidated |
 								| `main` | 26.21% | Benchmark overhead | N/A (not optimizable) |
 								| `malloc` | 23.26% | Alloc wrapper 集約点 | 🔍 Investigate further |
 								| `tiny_header_finalize_alloc` | 5.33% | Header write | ⚪ E5-2 tried, NEUTRAL |
 								| `tiny_c7_ultra_alloc` | 3.75% | C7 ULTRA path | 🔍 Possible target |
 								| `unified_cache_push` | 1.61% | Cache push | ⚪ Marginal ROI (~+1.0%) |
 								| `hakmem_env_snapshot` | 0.82% | ENV snapshot | ❌ Phase 11 failed |
 								| `tiny_front_v3_snapshot_get` | 0.66% | Front snapshot | ❌ Phase 11 failed |
 								### Key Observations
 . **`front_fastlane_try_free` (33.88%)** - This is EXPECTED and GOOD
 								   - Phase 6-1 successfully consolidated free path
 								   - High self% indicates consolidation worked (vs distributed overhead)
 								   - Not a bottleneck, just a measurement artifact
 . **`malloc` (23.26%)** - Alloc wrapper 集約点
 								   - Similar to `front_fastlane_try_free` (includes FastLane alloc)
 								   - May indicate alloc side has room for optimization
 								   - Need deeper analysis to separate wrapper vs actual hot path
 . **`tiny_header_finalize_alloc` (5.33%)** - Already optimized
 								   - Phase 5 E5-2 (Header Write-Once) was NEUTRAL (+0.45%)
 								   - Branch overhead ≈ savings
 								   - Further optimization unlikely to help
 . **Remaining hotspots < 2%** - Marginal ROI
 								   - `unified_cache_push` (1.61%): E5-3b predicted ~+1.0%
 								   - `tiny_c7_ultra_alloc` (3.75%): C7-specific, narrow scope
 								---
 								## Strategic Options (Next Steps)
 								### Option A: Continue Micro-Optimizations ⚪ (Marginal ROI)
 								**Targets**:
 								- `unified_cache_push` (1.61%) - predicted +1.0% ROI
 								- `tiny_c7_ultra_alloc` (3.75%) - C7-specific
 								**Risk**: Diminishing returns (each optimization < +1-2%)
 								**Recommendation**: Low priority unless low-hanging fruit identified
 								---
 								### Option B: Alloc Side Deep Dive 🔍 (High Potential)
 								**Observation**: `malloc` (23.26%) suggests alloc side may have structural opportunities
 								**Investigation Steps**:
 . Perf profile with `--call-graph dwarf` to separate wrapper vs hot path
 . Identify if FastLane alloc has similar consolidation opportunities as FastLane free
 . Look for alloc-side deduplication opportunities
 								**Expected ROI**: +5-10% (if similar to Phase 6 free-side gains)
 								**Recommendation**: **HIGH PRIORITY** - Most promising next direction
 								---
 								### Option C: Declare Victory 🎉 (Strategic Pause)
 								**Achievement**: +24.6% cumulative (Phase 6-10)
 								**Context**:
 								- Phase 5: ~+9-10% cumulative
 								- **Phase 6-10: +24.6% cumulative**
 								- **Total (Phase 5-10): ~+30-35% cumulative**
 								**Rationale**:
 								- Reached major performance milestone
 								- Remaining optimizations < +2% each (marginal)
 								- Risk/reward ratio decreasing
 								**Recommendation**: ⚪ NEUTRAL - Valid option depending on project goals
 								---
 								## Cumulative Statistics
 								### Performance Gains
 								| Phase Range | Cumulative Gain | Individual Phases |
 								|-------------|-----------------|-------------------|
 								| Phase 5 (E4-E5) | ~+9-10% | E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) |
 								| **Phase 6-10** | **+24.6%** | 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%) |
 								| **Total (Phase 5-10)** | **~+30-35%** | Estimated compound effect |
 								### Stability Improvements
 								Multiple phases delivered variance reduction:
 								- Phase 6-2: σ -42% (1.00% → 0.58%)
 								- Phase 8: σ -61% (867K → 336K)
 								- Phase 9: σ -60.8% (2.44M → 955K, 2.6x more stable)
 								**Insight**: Consolidation/deduplication patterns provide BOTH performance AND stability
 								---
 								## Lessons Learned
 								### 1. Consolidation > Micro-Optimization
 								**Evidence**: Phase 6-1 (+11.13%) > all micro-optimizations combined
 								**Pattern**: Wrapper-level structural changes yield largest gains
 								### 2. Context Matters
 								**Evidence**: Hot/cold split works in wrapper (Phase 1), fails in FastLane (Phase 7 -2.16%)
 								**Pattern**: Same optimization technique has different effects in different contexts
 								### 3. Monolithic Early-Exit > Function Split
 								**Evidence**: Phase 9 (+2.72%) succeeded where Phase 7 (-2.16%) failed
 								**Pattern**: Avoid function call overhead, maintain I-cache locality
 								### 4. Inline Hot Path = Danger Zone
 								**Evidence**: Phase 11 (-8.35%) - API call in inline function accumulated overhead
 								**Pattern**: Even 2-3 instructions are expensive at high call frequency
 								### 5. ENV Gate Optimization Should Target Gate Itself
 								**Evidence**: Phase 11 (-8.35%) - call-site changes inhibited compiler optimization
 								**Pattern**: Optimize the gate (caching, probe window), not the call sites
 								### 6. Variance Reduction = Hidden Value
 								**Evidence**: Multiple phases delivered 40-60% variance reduction
 								**Pattern**: Consolidation/deduplication improves both mean AND stability
 								---
 								## Recommendations
 								### Immediate Next Steps (Priority Order)
 . **HIGH PRIORITY**: Alloc side deep dive (Option B)
 								   - Perf profile with call-graph to identify opportunities
 								   - Look for FastLane alloc equivalent optimizations
 								   - Expected ROI: +5-10%
 . **MEDIUM PRIORITY**: Strategic pause (Option C)
 								   - Document Phase 6-10 achievements
 								   - Re-evaluate project goals and priorities
 								   - Consider if +24.6% is "good enough" milestone
 . **LOW PRIORITY**: Micro-optimizations (Option A)
 								   - Only pursue if low-hanging fruit identified
 								   - Each remaining optimization < +2% ROI
 								### Long-Term Strategic Questions
 . **What is the target performance?**
 								   - mimalloc competitive? (need benchmark comparison)
 								   - Absolute threshold? (e.g., 60M ops/s on Mixed)
 . **What is the risk tolerance?**
 								   - Each optimization has ~20-30% chance of NO-GO
 								   - Diminishing returns as we optimize further
 . **What is the maintenance cost?**
 								   - Each new optimization adds complexity
 								   - ENV gates increase test matrix
 								---
 								## Conclusion
 								Phase 6-10 delivered **+24.6% cumulative improvement**, representing hakmem's most successful optimization sequence. The winning patterns (consolidation, deduplication, monolithic early-exit) are now well-established and can guide future work.
 								**Status**: ✅ **MAJOR MILESTONE ACHIEVED**
 								**Next**: Recommend **alloc side deep dive** as highest-ROI next direction, or **strategic pause** to consolidate gains.
 								---
 								**Implementation**: Claude Code
 								**Date**: 2025-12-14
 								**Phases**: 6-10 (with Phase 11 NO-GO documented)
 								**Commits**: ea221d057, f301ee4df, dcc1d42e7, be723ca05, 871034da1, 71b1354d3, ad73ca554, 16c7bce2d