232 lines
8.4 KiB
Markdown
232 lines
8.4 KiB
Markdown
|
|
# Phase 5 E5-3: Candidate Analysis and Strategic Recommendations
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**Recommendation**: **DEFER E5-3 optimization**. Continue with established winning patterns (E5-1 style wrapper-level optimizations) rather than pursuing diminishing-returns micro-optimizations in profiler hot spots.
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
- E5-2 (Header Write-Once, 3.35% self%) achieved only +0.45% NEUTRAL
|
||
|
|
- E5-3 candidates (7.14%, 3.39%, 2.97% self%) have similar or worse ROI profiles
|
||
|
|
- Profiler self% != optimization opportunity (time-weighted samples can mislead)
|
||
|
|
- Cumulative gains from E4+E5-1 (~+9-10%) represent significant progress
|
||
|
|
- Next phase should target higher-level structural opportunities
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## E5-3 Candidate Analysis
|
||
|
|
|
||
|
|
### Context: Post-E5-2 Baseline
|
||
|
|
- **E5-1 (Free Tiny Direct)**: +3.35% GO (adopted)
|
||
|
|
- **E5-2 (Header Write-Once)**: +0.45% NEUTRAL (frozen as research box)
|
||
|
|
- **New baseline**: 44.42M ops/s (Mixed, 20M iters, ws=400)
|
||
|
|
|
||
|
|
### Available Candidates (from perf profile)
|
||
|
|
|
||
|
|
| Candidate | Self% | Call Frequency | ROI Assessment |
|
||
|
|
|-----------|-------|----------------|----------------|
|
||
|
|
| free_tiny_fast_cold | 7.14% | LOW (cold path) | **NO-GO** |
|
||
|
|
| unified_cache_push | 3.39% | HIGH (every free) | **MAYBE** |
|
||
|
|
| hakmem_env_snapshot_enabled | 2.97% | HIGH (wrapper+gate) | **NO-GO** |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detailed Analysis
|
||
|
|
|
||
|
|
### E5-3a: free_tiny_fast_cold (7.14% self%) ❌ **NO-GO**
|
||
|
|
|
||
|
|
**Hypothesis**: Cold path branch structure optimization (route determination, LARSON check)
|
||
|
|
|
||
|
|
**Why NO-GO**:
|
||
|
|
1. **Self% Misleading**: 7.14% is time-weighted, not frequency
|
||
|
|
- Cold path is called RARELY (only when hot path misses)
|
||
|
|
- High self% = expensive when hit, not = high total cost
|
||
|
|
- Optimizing cold path has minimal impact on overall throughput
|
||
|
|
|
||
|
|
2. **Branch Prediction Already Optimized**:
|
||
|
|
- Current implementation uses `__builtin_expect` hints
|
||
|
|
- LARSON/heap checks are already marked UNLIKELY
|
||
|
|
- Further branch reordering has marginal benefit (~0.1-0.5% at best)
|
||
|
|
|
||
|
|
3. **Similar to E5-2 Failure**:
|
||
|
|
- E5-2 targeted 3.35% self%, gained only +0.45%
|
||
|
|
- E5-3a targets 7.14% self% BUT lower frequency
|
||
|
|
- Expected gain: +0.3-1.0% (< +1.0% GO threshold)
|
||
|
|
|
||
|
|
4. **Structural Issues**:
|
||
|
|
- Goto-based early exit adds control flow complexity
|
||
|
|
- Potential I-cache pollution (similar to Phase 1 A3 failure)
|
||
|
|
- Safety risks (LARSON check bypass in optimized path)
|
||
|
|
|
||
|
|
**Conservative Estimate**: +0.5% ± 0.5% (NEUTRAL range)
|
||
|
|
|
||
|
|
**Decision**: **NO-GO / DEFER**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### E5-3b: unified_cache_push (3.39% self%) ⚠️ **MAYBE**
|
||
|
|
|
||
|
|
**Hypothesis**: Push operation overhead (TLS access, modulo arithmetic, bounds check)
|
||
|
|
|
||
|
|
**Why MAYBE**:
|
||
|
|
1. **Frequency**: Called on EVERY free (high frequency)
|
||
|
|
2. **Current Implementation**: Already highly optimized
|
||
|
|
- Ring buffer with power-of-2 masking (no division)
|
||
|
|
- Single TLS access (g_unified_cache[class_idx])
|
||
|
|
- Minimal branch count (1-2 branches)
|
||
|
|
|
||
|
|
3. **Potential Optimizations**:
|
||
|
|
- **Inline Expansion**: Force always_inline (may hurt I-cache)
|
||
|
|
- **TLS Caching**: Cache g_unified_cache base pointer (adds TLS variable)
|
||
|
|
- **Bounds Check Removal**: Assume capacity never changes (unsafe)
|
||
|
|
|
||
|
|
4. **Risk Assessment**:
|
||
|
|
- **High risk**: unified_cache_push is already in critical path
|
||
|
|
- **Low ROI**: 3.39% self% with limited optimization headroom
|
||
|
|
- **Similar to E5-2**: Micro-optimization with marginal benefit
|
||
|
|
|
||
|
|
**Conservative Estimate**: +0.5-1.5% (borderline NEUTRAL/GO)
|
||
|
|
|
||
|
|
**Decision**: **DEFER** (pursue only if E5-1 pattern exhausted)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### E5-3c: hakmem_env_snapshot_enabled (2.97% self%) ❌ **NO-GO**
|
||
|
|
|
||
|
|
**Hypothesis**: Branch hint optimization (enabled=1 is常用 in MIXED)
|
||
|
|
|
||
|
|
**Why NO-GO**:
|
||
|
|
1. **E3-4 Precedent**: Phase 4 E3-4 (ENV Constructor Init) **FAILED**
|
||
|
|
- Attempted to eliminate lazy check overhead (3.22% self%)
|
||
|
|
- Result: -1.44% regression (constructor mode added overhead)
|
||
|
|
- Root cause: Branch predictor tuning is profile-dependent
|
||
|
|
|
||
|
|
2. **Branch Hint Contradiction**:
|
||
|
|
- Default builds: enabled=0 → hint UNLIKELY is correct
|
||
|
|
- MIXED preset: enabled=1 → hint UNLIKELY is WRONG
|
||
|
|
- Changing hint helps MIXED but hurts default builds
|
||
|
|
|
||
|
|
3. **Optimization Space**: Already consolidated in E4-1 (E1)
|
||
|
|
- ENV snapshot reduced 3 TLS reads → 1 TLS read
|
||
|
|
- Remaining overhead is unavoidable (lazy init check)
|
||
|
|
- Further optimization requires constructor init (E3-4 showed this fails)
|
||
|
|
|
||
|
|
**Conservative Estimate**: -1.0% to +0.5% (high regression risk)
|
||
|
|
|
||
|
|
**Decision**: **NO-GO** (proven failure in E3-4)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Strategic Recommendations
|
||
|
|
|
||
|
|
### Priority 1: Exploit E5-1 Success Pattern ✅
|
||
|
|
|
||
|
|
**E5-1 Strategy (Free Tiny Direct)**:
|
||
|
|
- **Target**: Wrapper-level overhead (deduplication)
|
||
|
|
- **Method**: Single header check → direct call to free_tiny_fast()
|
||
|
|
- **Result**: +3.35% (GO)
|
||
|
|
|
||
|
|
**Replicable Patterns**:
|
||
|
|
1. **Malloc Tiny Direct**: Apply E5-1 pattern to malloc() side
|
||
|
|
- Single size check → direct call to malloc_tiny_fast_for_class()
|
||
|
|
- Eliminate: Size validation redundancy, ENV snapshot overhead
|
||
|
|
- Expected: +2-4% (similar to E5-1)
|
||
|
|
|
||
|
|
2. **Alloc Gate Specialization**: Per-class fast paths
|
||
|
|
- C0-C3: Direct to LEGACY (skip policy snapshot)
|
||
|
|
- C4-C7: Route-specific fast paths
|
||
|
|
- Expected: +1-3%
|
||
|
|
|
||
|
|
### Priority 2: Profile New Baseline
|
||
|
|
|
||
|
|
After E4+E5-1 adoption (~+9-10% cumulative):
|
||
|
|
1. **Re-profile Mixed workload** (new bottlenecks may emerge)
|
||
|
|
2. **Identify high-frequency, high-overhead** targets
|
||
|
|
3. **Focus on deduplication/consolidation** (proven pattern)
|
||
|
|
|
||
|
|
### Priority 3: Avoid Diminishing Returns
|
||
|
|
|
||
|
|
**Red Flags** (E5-2, E5-3 lessons):
|
||
|
|
- **Self% > 3%** but **low frequency** → misleading
|
||
|
|
- **Micro-optimizations** in already-optimized code → marginal ROI
|
||
|
|
- **Branch hint tuning** → profile-dependent, high regression risk
|
||
|
|
- **Cold path optimization** → time-weighted ≠ frequency-weighted
|
||
|
|
|
||
|
|
**Green Flags** (E4-1, E4-2, E5-1 successes):
|
||
|
|
- **Wrapper-level deduplication** → +3-6% per optimization
|
||
|
|
- **TLS consolidation** → +2-4% per consolidation
|
||
|
|
- **Direct path creation** → +2-4% per path
|
||
|
|
- **Structural changes** (not micro-tuning) → higher ROI
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Lessons from Phase 5
|
||
|
|
|
||
|
|
### Wins (E4-1, E4-2, E5-1)
|
||
|
|
1. **ENV Snapshot Consolidation** (E4-1): +3.51%
|
||
|
|
- 3 TLS reads → 1 TLS read
|
||
|
|
- Deduplication > micro-optimization
|
||
|
|
|
||
|
|
2. **Malloc Wrapper Snapshot** (E4-2): +21.83% standalone (+6.43% combined)
|
||
|
|
- Function call elimination (tiny_get_max_size)
|
||
|
|
- Pre-caching + TLS consolidation
|
||
|
|
|
||
|
|
3. **Free Tiny Direct** (E5-1): +3.35%
|
||
|
|
- Single header check → direct call
|
||
|
|
- Wrapper-level deduplication
|
||
|
|
|
||
|
|
**Common Pattern**: **Eliminate redundancy at architectural boundaries** (wrapper, gate, snapshot)
|
||
|
|
|
||
|
|
### Losses / Neutrals (E3-4, E5-2)
|
||
|
|
1. **ENV Constructor Init** (E3-4): -1.44%
|
||
|
|
- Constructor mode added overhead
|
||
|
|
- Branch prediction is profile-dependent
|
||
|
|
|
||
|
|
2. **Header Write-Once** (E5-2): +0.45% NEUTRAL
|
||
|
|
- Assumption incorrect (headers NOT redundant)
|
||
|
|
- Branch overhead ≈ savings
|
||
|
|
|
||
|
|
**Common Pattern**: **Micro-optimizations in hot functions** have limited ROI when code is already optimized
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**E5-3 Recommendation**: **DEFER all three candidates**
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
1. **E5-3a (cold path)**: Low frequency, high risk, estimated +0.5% NEUTRAL
|
||
|
|
2. **E5-3b (push)**: Already optimized, marginal ROI, estimated +1.0% borderline
|
||
|
|
3. **E5-3c (env snapshot)**: Proven failure (E3-4), estimated -1.0% NO-GO
|
||
|
|
|
||
|
|
**Next Steps**:
|
||
|
|
1. ✅ **Promote E5-1** to `MIXED_TINYV3_C7_SAFE` preset (if not already done)
|
||
|
|
2. ✅ **Profile new baseline** (E4+E5-1 ON) to find next high-ROI targets
|
||
|
|
3. ✅ **Design E5-4**: Malloc Tiny Direct (E5-1 pattern applied to alloc side)
|
||
|
|
- Expected: +2-4% based on E5-1 precedent
|
||
|
|
- Lower risk than E5-3 candidates
|
||
|
|
4. ✅ **Update roadmap**: Focus on wrapper-level optimizations, avoid diminishing returns
|
||
|
|
|
||
|
|
**Key Insight**: **Profiler self% is necessary but not sufficient** for optimization prioritization. Frequency, redundancy, and architectural seams matter more than raw self%.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Implementation Notes (E5-3a - Not Executed)
|
||
|
|
|
||
|
|
**Files Created** (research box, not tested):
|
||
|
|
- `core/box/free_cold_shape_env_box.{h,c}` (ENV gate)
|
||
|
|
- `core/box/free_cold_shape_stats_box.{h,c}` (stats counters)
|
||
|
|
|
||
|
|
**Integration Point**:
|
||
|
|
- `core/front/malloc_tiny_fast.h` (lines 418-437, free_tiny_fast_cold)
|
||
|
|
|
||
|
|
**Decision**: **FROZEN** (default OFF, do not pursue A/B testing)
|
||
|
|
|
||
|
|
**Rationale**: Pre-analysis shows NO-GO (low frequency, high risk, marginal ROI < +1.0%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Date**: 2025-12-14
|
||
|
|
**Phase**: 5 E5-3
|
||
|
|
**Status**: Analysis Complete → **DEFER E5-3**, Proceed to E5-4 (Malloc Direct Path)
|
||
|
|
**Cumulative**: E4+E5-1 = ~+9-10% (baseline: 44.42M ops/s Mixed)
|