hakmem/docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md

# Phase 5 E5-3: Candidate Analysis and Strategic Recommendations

## Executive Summary

**Recommendation**: **DEFER E5-3 candidates** and move to **E6 (ENV snapshot branch-shape fix)**.

**Rationale**:
- E5-2 (Header Write-Once, 3.35% self%) achieved only +0.45% NEUTRAL
- E5-4 (Malloc Tiny Direct, E5-1 pattern) also came back NEUTRAL (-0.48%)
- E5-3 candidates (7.14%, 3.39%, 2.97% self%) have similar or worse ROI profiles
- Profiler self% != optimization opportunity (time-weighted samples can mislead)
- Cumulative gains from E4+E5-1 (~+9-10%) represent significant progress
- Next phase should target higher-level structural opportunities

---

## E5-3 Candidate Analysis

### Context: Post-E5-2 Baseline
- **E5-1 (Free Tiny Direct)**: +3.35% GO (adopted)
- **E5-2 (Header Write-Once)**: +0.45% NEUTRAL (frozen as research box)
- **New baseline**: 44.42M ops/s (Mixed, 20M iters, ws=400)

### Available Candidates (from perf profile)

| Candidate | Self% | Call Frequency | ROI Assessment |
|-----------|-------|----------------|----------------|
| free_tiny_fast_cold | 7.14% | LOW (cold path) | **NO-GO** |
| unified_cache_push | 3.39% | HIGH (every free) | **MAYBE** |
| hakmem_env_snapshot_enabled | 2.97% | HIGH (wrapper+gate) | **NO-GO** |

---

## Detailed Analysis

### E5-3a: free_tiny_fast_cold (7.14% self%) ❌ **NO-GO**

**Hypothesis**: Cold path branch structure optimization (route determination, LARSON check)

**Why NO-GO**:
1. **Self% Misleading**: 7.14% is time-weighted, not frequency
   - Cold path is called RARELY (only when hot path misses)
   - High self% = expensive when hit, not = high total cost
   - Optimizing cold path has minimal impact on overall throughput

2. **Branch Prediction Already Optimized**:
   - Current implementation uses `__builtin_expect` hints
   - LARSON/heap checks are already marked UNLIKELY
   - Further branch reordering has marginal benefit (~0.1-0.5% at best)

3. **Similar to E5-2 Failure**:
   - E5-2 targeted 3.35% self%, gained only +0.45%
   - E5-3a targets 7.14% self% BUT lower frequency
   - Expected gain: +0.3-1.0% (< +1.0% GO threshold)

4. **Structural Issues**:
   - Goto-based early exit adds control flow complexity
   - Potential I-cache pollution (similar to Phase 1 A3 failure)
   - Safety risks (LARSON check bypass in optimized path)

**Conservative Estimate**: +0.5% ± 0.5% (NEUTRAL range)

**Decision**: **NO-GO / DEFER**

---

### E5-3b: unified_cache_push (3.39% self%) ⚠️ **MAYBE**

**Hypothesis**: Push operation overhead (TLS access, modulo arithmetic, bounds check)

**Why MAYBE**:
1. **Frequency**: Called on EVERY free (high frequency)
2. **Current Implementation**: Already highly optimized
   - Ring buffer with power-of-2 masking (no division)
   - Single TLS access (g_unified_cache[class_idx])
   - Minimal branch count (1-2 branches)

3. **Potential Optimizations**:
   - **Inline Expansion**: Force always_inline (may hurt I-cache)
   - **TLS Caching**: Cache g_unified_cache base pointer (adds TLS variable)
   - **Bounds Check Removal**: Assume capacity never changes (unsafe)

4. **Risk Assessment**:
   - **High risk**: unified_cache_push is already in critical path
   - **Low ROI**: 3.39% self% with limited optimization headroom
   - **Similar to E5-2**: Micro-optimization with marginal benefit

**Conservative Estimate**: +0.5-1.5% (borderline NEUTRAL/GO)

**Decision**: **DEFER** (pursue only if E5-1 pattern exhausted)

---

### E5-3c: hakmem_env_snapshot_enabled (2.97% self%) ❌ **NO-GO**

**Hypothesis**: Branch hint optimization (enabled=1 is常用 in MIXED)

**Why NO-GO**:
1. **E3-4 Precedent**: Phase 4 E3-4 (ENV Constructor Init) **FAILED**
   - Attempted to eliminate lazy check overhead (3.22% self%)
   - Result: -1.44% regression (constructor mode added overhead)
   - Root cause: Branch predictor tuning is profile-dependent

2. **Branch Hint Contradiction**:
   - Default builds: enabled=0 → hint UNLIKELY is correct
   - MIXED preset: enabled=1 → hint UNLIKELY is WRONG
   - Changing hint helps MIXED but hurts default builds

3. **Optimization Space**: Already consolidated in E4-1 (E1)
   - ENV snapshot reduced 3 TLS reads → 1 TLS read
   - Remaining overhead is unavoidable (lazy init check)
   - Further optimization requires constructor init (E3-4 showed this fails)

**Conservative Estimate**: -1.0% to +0.5% (high regression risk)

**Decision**: **NO-GO** (proven failure in E3-4)

---

## Strategic Recommendations

### Priority 1: Exploit E5-1 Success Pattern ✅

**E5-1 Strategy (Free Tiny Direct)**:
- **Target**: Wrapper-level overhead (deduplication)
- **Method**: Single header check → direct call to free_tiny_fast()
- **Result**: +3.35% (GO)

**Replicable Patterns** (updated):
1. **Malloc Tiny Direct (E5-4)**: Tested → ⚪ NEUTRAL（-0.48%）→ freeze
   - Lesson: alloc side is already thin (LTO/inlining), so wrapper-level direct bypass doesn't remove real work

2. **Alloc Gate Specialization**: Per-class fast paths
   - C0-C3: Direct to LEGACY (skip policy snapshot)
   - C4-C7: Route-specific fast paths
   - Expected: +1-3%

3. **E6: ENV snapshot branch-shape fix**
   - Semantics unchanged, only branch shape for MIXED where `HAKMEM_ENV_SNAPSHOT=1` is steady-state
   - Goal: reduce mispredicts from “OFF 前提” hints

### Priority 2: Profile New Baseline

After E4+E5-1 adoption (~+9-10% cumulative):
1. **Re-profile Mixed workload** (new bottlenecks may emerge)
2. **Identify high-frequency, high-overhead** targets
3. **Focus on deduplication/consolidation** (proven pattern)

### Priority 3: Avoid Diminishing Returns

**Red Flags** (E5-2, E5-3 lessons):
- **Self% > 3%** but **low frequency** → misleading
- **Micro-optimizations** in already-optimized code → marginal ROI
- **Branch hint tuning** → profile-dependent, high regression risk
- **Cold path optimization** → time-weighted ≠ frequency-weighted

**Green Flags** (E4-1, E4-2, E5-1 successes):
- **Wrapper-level deduplication** → +3-6% per optimization
- **TLS consolidation** → +2-4% per consolidation
- **Direct path creation** → +2-4% per path
- **Structural changes** (not micro-tuning) → higher ROI

---

## Lessons from Phase 5

### Wins (E4-1, E4-2, E5-1)
1. **ENV Snapshot Consolidation** (E4-1): +3.51%
   - 3 TLS reads → 1 TLS read
   - Deduplication > micro-optimization

2. **Malloc Wrapper Snapshot** (E4-2): +21.83% standalone (+6.43% combined)
   - Function call elimination (tiny_get_max_size)
   - Pre-caching + TLS consolidation

3. **Free Tiny Direct** (E5-1): +3.35%
   - Single header check → direct call
   - Wrapper-level deduplication

**Common Pattern**: **Eliminate redundancy at architectural boundaries** (wrapper, gate, snapshot)

### Losses / Neutrals (E3-4, E5-2)
1. **ENV Constructor Init** (E3-4): -1.44%
   - Constructor mode added overhead
   - Branch prediction is profile-dependent

2. **Header Write-Once** (E5-2): +0.45% NEUTRAL
   - Assumption incorrect (headers NOT redundant)
   - Branch overhead ≈ savings

**Common Pattern**: **Micro-optimizations in hot functions** have limited ROI when code is already optimized

---

## Conclusion

**E5-3 Recommendation**: **DEFER all three candidates**

**Rationale**:
1. **E5-3a (cold path)**: Low frequency, high risk, estimated +0.5% NEUTRAL
2. **E5-3b (push)**: Already optimized, marginal ROI, estimated +1.0% borderline
3. **E5-3c (env snapshot)**: Proven failure (E3-4), estimated -1.0% NO-GO

**Next Steps**:
1. ✅ **Promote E5-1** to `MIXED_TINYV3_C7_SAFE` preset (if not already done)
2. ✅ **Profile new baseline** (E4+E5-1 ON) to find next high-ROI targets
3. ✅ **Design E5-4**: Malloc Tiny Direct (E5-1 pattern applied to alloc side)
   - Expected: +2-4% based on E5-1 precedent
   - Lower risk than E5-3 candidates
4. ✅ **Update roadmap**: Focus on wrapper-level optimizations, avoid diminishing returns

**Key Insight**: **Profiler self% is necessary but not sufficient** for optimization prioritization. Frequency, redundancy, and architectural seams matter more than raw self%.

---

## Appendix: Implementation Notes (E5-3a - Not Executed)

**Files Created** (research box, not tested):
- `core/box/free_cold_shape_env_box.{h,c}` (ENV gate)
- `core/box/free_cold_shape_stats_box.{h,c}` (stats counters)

**Integration Point**:
- `core/front/malloc_tiny_fast.h` (lines 418-437, free_tiny_fast_cold)

**Decision**: **FROZEN** (default OFF, do not pursue A/B testing)

**Rationale**: Pre-analysis shows NO-GO (low frequency, high risk, marginal ROI < +1.0%)

---

**Date**: 2025-12-14
**Phase**: 5 E5-3
**Status**: Analysis Complete → **DEFER E5-3**, Proceed to E5-4 (Malloc Direct Path)
**Cumulative**: E4+E5-1 = ~+9-10% (baseline: 44.42M ops/s Mixed)
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								# Phase 5 E5-3: Candidate Analysis and Strategic Recommendations
 								## Executive Summary
-												Phase 5: freeze E5-4 malloc tiny direct (neutral)

											
										
										
											2025-12-14 06:59:35 +09:00
+								**Recommendation**: **DEFER E5-3 candidates** and move to **E6 (ENV snapshot branch-shape fix)**.
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
 								**Rationale**:
 								- E5-2 (Header Write-Once, 3.35% self%) achieved only +0.45% NEUTRAL
-												Phase 5: freeze E5-4 malloc tiny direct (neutral)

											
										
										
											2025-12-14 06:59:35 +09:00
+								- E5-4 (Malloc Tiny Direct, E5-1 pattern) also came back NEUTRAL (-0.48%)
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								- E5-3 candidates (7.14%, 3.39%, 2.97% self%) have similar or worse ROI profiles
 								- Profiler self% != optimization opportunity (time-weighted samples can mislead)
 								- Cumulative gains from E4+E5-1 (~+9-10%) represent significant progress
 								- Next phase should target higher-level structural opportunities
 								---
 								## E5-3 Candidate Analysis
 								### Context: Post-E5-2 Baseline
 								- **E5-1 (Free Tiny Direct)**: +3.35% GO (adopted)
 								- **E5-2 (Header Write-Once)**: +0.45% NEUTRAL (frozen as research box)
 								- **New baseline**: 44.42M ops/s (Mixed, 20M iters, ws=400)
 								### Available Candidates (from perf profile)
 								| Candidate | Self% | Call Frequency | ROI Assessment |
 								|-----------|-------|----------------|----------------|
 								| free_tiny_fast_cold | 7.14% | LOW (cold path) | **NO-GO** |
 								| unified_cache_push | 3.39% | HIGH (every free) | **MAYBE** |
 								| hakmem_env_snapshot_enabled | 2.97% | HIGH (wrapper+gate) | **NO-GO** |
 								---
 								## Detailed Analysis
 								### E5-3a: free_tiny_fast_cold (7.14% self%) ❌ **NO-GO**
 								**Hypothesis**: Cold path branch structure optimization (route determination, LARSON check)
 								**Why NO-GO**:
 . **Self% Misleading**: 7.14% is time-weighted, not frequency
 								   - Cold path is called RARELY (only when hot path misses)
 								   - High self% = expensive when hit, not = high total cost
 								   - Optimizing cold path has minimal impact on overall throughput
 . **Branch Prediction Already Optimized**:
 								   - Current implementation uses `__builtin_expect` hints
 								   - LARSON/heap checks are already marked UNLIKELY
 								   - Further branch reordering has marginal benefit (~0.1-0.5% at best)
 . **Similar to E5-2 Failure**:
 								   - E5-2 targeted 3.35% self%, gained only +0.45%
 								   - E5-3a targets 7.14% self% BUT lower frequency
 								   - Expected gain: +0.3-1.0% (< +1.0% GO threshold)
 . **Structural Issues**:
 								   - Goto-based early exit adds control flow complexity
 								   - Potential I-cache pollution (similar to Phase 1 A3 failure)
 								   - Safety risks (LARSON check bypass in optimized path)
 								**Conservative Estimate**: +0.5% ± 0.5% (NEUTRAL range)
 								**Decision**: **NO-GO / DEFER**
 								---
 								### E5-3b: unified_cache_push (3.39% self%) ⚠️ **MAYBE**
 								**Hypothesis**: Push operation overhead (TLS access, modulo arithmetic, bounds check)
 								**Why MAYBE**:
 . **Frequency**: Called on EVERY free (high frequency)
 . **Current Implementation**: Already highly optimized
 								   - Ring buffer with power-of-2 masking (no division)
 								   - Single TLS access (g_unified_cache[class_idx])
 								   - Minimal branch count (1-2 branches)
 . **Potential Optimizations**:
 								   - **Inline Expansion**: Force always_inline (may hurt I-cache)
 								   - **TLS Caching**: Cache g_unified_cache base pointer (adds TLS variable)
 								   - **Bounds Check Removal**: Assume capacity never changes (unsafe)
 . **Risk Assessment**:
 								   - **High risk**: unified_cache_push is already in critical path
 								   - **Low ROI**: 3.39% self% with limited optimization headroom
 								   - **Similar to E5-2**: Micro-optimization with marginal benefit
 								**Conservative Estimate**: +0.5-1.5% (borderline NEUTRAL/GO)
 								**Decision**: **DEFER** (pursue only if E5-1 pattern exhausted)
 								---
 								### E5-3c: hakmem_env_snapshot_enabled (2.97% self%) ❌ **NO-GO**
 								**Hypothesis**: Branch hint optimization (enabled=1 is常用 in MIXED)
 								**Why NO-GO**:
 . **E3-4 Precedent**: Phase 4 E3-4 (ENV Constructor Init) **FAILED**
 								   - Attempted to eliminate lazy check overhead (3.22% self%)
 								   - Result: -1.44% regression (constructor mode added overhead)
 								   - Root cause: Branch predictor tuning is profile-dependent
 . **Branch Hint Contradiction**:
 								   - Default builds: enabled=0 → hint UNLIKELY is correct
 								   - MIXED preset: enabled=1 → hint UNLIKELY is WRONG
 								   - Changing hint helps MIXED but hurts default builds
 . **Optimization Space**: Already consolidated in E4-1 (E1)
 								   - ENV snapshot reduced 3 TLS reads → 1 TLS read
 								   - Remaining overhead is unavoidable (lazy init check)
 								   - Further optimization requires constructor init (E3-4 showed this fails)
 								**Conservative Estimate**: -1.0% to +0.5% (high regression risk)
 								**Decision**: **NO-GO** (proven failure in E3-4)
 								---
 								## Strategic Recommendations
 								### Priority 1: Exploit E5-1 Success Pattern ✅
 								**E5-1 Strategy (Free Tiny Direct)**:
 								- **Target**: Wrapper-level overhead (deduplication)
 								- **Method**: Single header check → direct call to free_tiny_fast()
 								- **Result**: +3.35% (GO)
-												Phase 5: freeze E5-4 malloc tiny direct (neutral)

											
										
										
											2025-12-14 06:59:35 +09:00
+								**Replicable Patterns** (updated):
 . **Malloc Tiny Direct (E5-4)**: Tested → ⚪ NEUTRAL（-0.48%）→ freeze
 								   - Lesson: alloc side is already thin (LTO/inlining), so wrapper-level direct bypass doesn't remove real work
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
 . **Alloc Gate Specialization**: Per-class fast paths
 								   - C0-C3: Direct to LEGACY (skip policy snapshot)
 								   - C4-C7: Route-specific fast paths
 								   - Expected: +1-3%
-												Phase 5: freeze E5-4 malloc tiny direct (neutral)

											
										
										
											2025-12-14 06:59:35 +09:00
+. **E6: ENV snapshot branch-shape fix**
 								   - Semantics unchanged, only branch shape for MIXED where `HAKMEM_ENV_SNAPSHOT=1` is steady-state
 								   - Goal: reduce mispredicts from “OFF 前提” hints
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								### Priority 2: Profile New Baseline
 								After E4+E5-1 adoption (~+9-10% cumulative):
 . **Re-profile Mixed workload** (new bottlenecks may emerge)
 . **Identify high-frequency, high-overhead** targets
 . **Focus on deduplication/consolidation** (proven pattern)
 								### Priority 3: Avoid Diminishing Returns
 								**Red Flags** (E5-2, E5-3 lessons):
 								- **Self% > 3%** but **low frequency** → misleading
 								- **Micro-optimizations** in already-optimized code → marginal ROI
 								- **Branch hint tuning** → profile-dependent, high regression risk
 								- **Cold path optimization** → time-weighted ≠ frequency-weighted
 								**Green Flags** (E4-1, E4-2, E5-1 successes):
 								- **Wrapper-level deduplication** → +3-6% per optimization
 								- **TLS consolidation** → +2-4% per consolidation
 								- **Direct path creation** → +2-4% per path
 								- **Structural changes** (not micro-tuning) → higher ROI
 								---
 								## Lessons from Phase 5
 								### Wins (E4-1, E4-2, E5-1)
 . **ENV Snapshot Consolidation** (E4-1): +3.51%
 								   - 3 TLS reads → 1 TLS read
 								   - Deduplication > micro-optimization
 . **Malloc Wrapper Snapshot** (E4-2): +21.83% standalone (+6.43% combined)
 								   - Function call elimination (tiny_get_max_size)
 								   - Pre-caching + TLS consolidation
 . **Free Tiny Direct** (E5-1): +3.35%
 								   - Single header check → direct call
 								   - Wrapper-level deduplication
 								**Common Pattern**: **Eliminate redundancy at architectural boundaries** (wrapper, gate, snapshot)
 								### Losses / Neutrals (E3-4, E5-2)
 . **ENV Constructor Init** (E3-4): -1.44%
 								   - Constructor mode added overhead
 								   - Branch prediction is profile-dependent
 . **Header Write-Once** (E5-2): +0.45% NEUTRAL
 								   - Assumption incorrect (headers NOT redundant)
 								   - Branch overhead ≈ savings
 								**Common Pattern**: **Micro-optimizations in hot functions** have limited ROI when code is already optimized
 								---
 								## Conclusion
 								**E5-3 Recommendation**: **DEFER all three candidates**
 								**Rationale**:
 . **E5-3a (cold path)**: Low frequency, high risk, estimated +0.5% NEUTRAL
 . **E5-3b (push)**: Already optimized, marginal ROI, estimated +1.0% borderline
 . **E5-3c (env snapshot)**: Proven failure (E3-4), estimated -1.0% NO-GO
 								**Next Steps**:
 . ✅ **Promote E5-1** to `MIXED_TINYV3_C7_SAFE` preset (if not already done)
 . ✅ **Profile new baseline** (E4+E5-1 ON) to find next high-ROI targets
 . ✅ **Design E5-4**: Malloc Tiny Direct (E5-1 pattern applied to alloc side)
 								   - Expected: +2-4% based on E5-1 precedent
 								   - Lower risk than E5-3 candidates
 . ✅ **Update roadmap**: Focus on wrapper-level optimizations, avoid diminishing returns
 								**Key Insight**: **Profiler self% is necessary but not sufficient** for optimization prioritization. Frequency, redundancy, and architectural seams matter more than raw self%.
 								---
 								## Appendix: Implementation Notes (E5-3a - Not Executed)
 								**Files Created** (research box, not tested):
 								- `core/box/free_cold_shape_env_box.{h,c}` (ENV gate)
 								- `core/box/free_cold_shape_stats_box.{h,c}` (stats counters)
 								**Integration Point**:
 								- `core/front/malloc_tiny_fast.h` (lines 418-437, free_tiny_fast_cold)
 								**Decision**: **FROZEN** (default OFF, do not pursue A/B testing)
 								**Rationale**: Pre-analysis shows NO-GO (low frequency, high risk, marginal ROI < +1.0%)
 								---
 								**Date**: 2025-12-14
 								**Phase**: 5 E5-3
 								**Status**: Analysis Complete → **DEFER E5-3**, Proceed to E5-4 (Malloc Direct Path)
 								**Cumulative**: E4+E5-1 = ~+9-10% (baseline: 44.42M ops/s Mixed)