Files
hakmem/docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
2025-12-14 06:59:35 +09:00

8.5 KiB
Raw Blame History

Phase 5 E5-3: Candidate Analysis and Strategic Recommendations

Executive Summary

Recommendation: DEFER E5-3 candidates and move to E6 (ENV snapshot branch-shape fix).

Rationale:

  • E5-2 (Header Write-Once, 3.35% self%) achieved only +0.45% NEUTRAL
  • E5-4 (Malloc Tiny Direct, E5-1 pattern) also came back NEUTRAL (-0.48%)
  • E5-3 candidates (7.14%, 3.39%, 2.97% self%) have similar or worse ROI profiles
  • Profiler self% != optimization opportunity (time-weighted samples can mislead)
  • Cumulative gains from E4+E5-1 (~+9-10%) represent significant progress
  • Next phase should target higher-level structural opportunities

E5-3 Candidate Analysis

Context: Post-E5-2 Baseline

  • E5-1 (Free Tiny Direct): +3.35% GO (adopted)
  • E5-2 (Header Write-Once): +0.45% NEUTRAL (frozen as research box)
  • New baseline: 44.42M ops/s (Mixed, 20M iters, ws=400)

Available Candidates (from perf profile)

Candidate Self% Call Frequency ROI Assessment
free_tiny_fast_cold 7.14% LOW (cold path) NO-GO
unified_cache_push 3.39% HIGH (every free) MAYBE
hakmem_env_snapshot_enabled 2.97% HIGH (wrapper+gate) NO-GO

Detailed Analysis

E5-3a: free_tiny_fast_cold (7.14% self%) NO-GO

Hypothesis: Cold path branch structure optimization (route determination, LARSON check)

Why NO-GO:

  1. Self% Misleading: 7.14% is time-weighted, not frequency

    • Cold path is called RARELY (only when hot path misses)
    • High self% = expensive when hit, not = high total cost
    • Optimizing cold path has minimal impact on overall throughput
  2. Branch Prediction Already Optimized:

    • Current implementation uses __builtin_expect hints
    • LARSON/heap checks are already marked UNLIKELY
    • Further branch reordering has marginal benefit (~0.1-0.5% at best)
  3. Similar to E5-2 Failure:

    • E5-2 targeted 3.35% self%, gained only +0.45%
    • E5-3a targets 7.14% self% BUT lower frequency
    • Expected gain: +0.3-1.0% (< +1.0% GO threshold)
  4. Structural Issues:

    • Goto-based early exit adds control flow complexity
    • Potential I-cache pollution (similar to Phase 1 A3 failure)
    • Safety risks (LARSON check bypass in optimized path)

Conservative Estimate: +0.5% ± 0.5% (NEUTRAL range)

Decision: NO-GO / DEFER


E5-3b: unified_cache_push (3.39% self%) ⚠️ MAYBE

Hypothesis: Push operation overhead (TLS access, modulo arithmetic, bounds check)

Why MAYBE:

  1. Frequency: Called on EVERY free (high frequency)

  2. Current Implementation: Already highly optimized

    • Ring buffer with power-of-2 masking (no division)
    • Single TLS access (g_unified_cache[class_idx])
    • Minimal branch count (1-2 branches)
  3. Potential Optimizations:

    • Inline Expansion: Force always_inline (may hurt I-cache)
    • TLS Caching: Cache g_unified_cache base pointer (adds TLS variable)
    • Bounds Check Removal: Assume capacity never changes (unsafe)
  4. Risk Assessment:

    • High risk: unified_cache_push is already in critical path
    • Low ROI: 3.39% self% with limited optimization headroom
    • Similar to E5-2: Micro-optimization with marginal benefit

Conservative Estimate: +0.5-1.5% (borderline NEUTRAL/GO)

Decision: DEFER (pursue only if E5-1 pattern exhausted)


E5-3c: hakmem_env_snapshot_enabled (2.97% self%) NO-GO

Hypothesis: Branch hint optimization (enabled=1 is常用 in MIXED)

Why NO-GO:

  1. E3-4 Precedent: Phase 4 E3-4 (ENV Constructor Init) FAILED

    • Attempted to eliminate lazy check overhead (3.22% self%)
    • Result: -1.44% regression (constructor mode added overhead)
    • Root cause: Branch predictor tuning is profile-dependent
  2. Branch Hint Contradiction:

    • Default builds: enabled=0 → hint UNLIKELY is correct
    • MIXED preset: enabled=1 → hint UNLIKELY is WRONG
    • Changing hint helps MIXED but hurts default builds
  3. Optimization Space: Already consolidated in E4-1 (E1)

    • ENV snapshot reduced 3 TLS reads → 1 TLS read
    • Remaining overhead is unavoidable (lazy init check)
    • Further optimization requires constructor init (E3-4 showed this fails)

Conservative Estimate: -1.0% to +0.5% (high regression risk)

Decision: NO-GO (proven failure in E3-4)


Strategic Recommendations

Priority 1: Exploit E5-1 Success Pattern

E5-1 Strategy (Free Tiny Direct):

  • Target: Wrapper-level overhead (deduplication)
  • Method: Single header check → direct call to free_tiny_fast()
  • Result: +3.35% (GO)

Replicable Patterns (updated):

  1. Malloc Tiny Direct (E5-4): Tested → NEUTRAL-0.48%)→ freeze

    • Lesson: alloc side is already thin (LTO/inlining), so wrapper-level direct bypass doesn't remove real work
  2. Alloc Gate Specialization: Per-class fast paths

    • C0-C3: Direct to LEGACY (skip policy snapshot)
    • C4-C7: Route-specific fast paths
    • Expected: +1-3%
  3. E6: ENV snapshot branch-shape fix

    • Semantics unchanged, only branch shape for MIXED where HAKMEM_ENV_SNAPSHOT=1 is steady-state
    • Goal: reduce mispredicts from “OFF 前提” hints

Priority 2: Profile New Baseline

After E4+E5-1 adoption (~+9-10% cumulative):

  1. Re-profile Mixed workload (new bottlenecks may emerge)
  2. Identify high-frequency, high-overhead targets
  3. Focus on deduplication/consolidation (proven pattern)

Priority 3: Avoid Diminishing Returns

Red Flags (E5-2, E5-3 lessons):

  • Self% > 3% but low frequency → misleading
  • Micro-optimizations in already-optimized code → marginal ROI
  • Branch hint tuning → profile-dependent, high regression risk
  • Cold path optimization → time-weighted ≠ frequency-weighted

Green Flags (E4-1, E4-2, E5-1 successes):

  • Wrapper-level deduplication → +3-6% per optimization
  • TLS consolidation → +2-4% per consolidation
  • Direct path creation → +2-4% per path
  • Structural changes (not micro-tuning) → higher ROI

Lessons from Phase 5

Wins (E4-1, E4-2, E5-1)

  1. ENV Snapshot Consolidation (E4-1): +3.51%

    • 3 TLS reads → 1 TLS read
    • Deduplication > micro-optimization
  2. Malloc Wrapper Snapshot (E4-2): +21.83% standalone (+6.43% combined)

    • Function call elimination (tiny_get_max_size)
    • Pre-caching + TLS consolidation
  3. Free Tiny Direct (E5-1): +3.35%

    • Single header check → direct call
    • Wrapper-level deduplication

Common Pattern: Eliminate redundancy at architectural boundaries (wrapper, gate, snapshot)

Losses / Neutrals (E3-4, E5-2)

  1. ENV Constructor Init (E3-4): -1.44%

    • Constructor mode added overhead
    • Branch prediction is profile-dependent
  2. Header Write-Once (E5-2): +0.45% NEUTRAL

    • Assumption incorrect (headers NOT redundant)
    • Branch overhead ≈ savings

Common Pattern: Micro-optimizations in hot functions have limited ROI when code is already optimized


Conclusion

E5-3 Recommendation: DEFER all three candidates

Rationale:

  1. E5-3a (cold path): Low frequency, high risk, estimated +0.5% NEUTRAL
  2. E5-3b (push): Already optimized, marginal ROI, estimated +1.0% borderline
  3. E5-3c (env snapshot): Proven failure (E3-4), estimated -1.0% NO-GO

Next Steps:

  1. Promote E5-1 to MIXED_TINYV3_C7_SAFE preset (if not already done)
  2. Profile new baseline (E4+E5-1 ON) to find next high-ROI targets
  3. Design E5-4: Malloc Tiny Direct (E5-1 pattern applied to alloc side)
    • Expected: +2-4% based on E5-1 precedent
    • Lower risk than E5-3 candidates
  4. Update roadmap: Focus on wrapper-level optimizations, avoid diminishing returns

Key Insight: Profiler self% is necessary but not sufficient for optimization prioritization. Frequency, redundancy, and architectural seams matter more than raw self%.


Appendix: Implementation Notes (E5-3a - Not Executed)

Files Created (research box, not tested):

  • core/box/free_cold_shape_env_box.{h,c} (ENV gate)
  • core/box/free_cold_shape_stats_box.{h,c} (stats counters)

Integration Point:

  • core/front/malloc_tiny_fast.h (lines 418-437, free_tiny_fast_cold)

Decision: FROZEN (default OFF, do not pursue A/B testing)

Rationale: Pre-analysis shows NO-GO (low frequency, high risk, marginal ROI < +1.0%)


Date: 2025-12-14 Phase: 5 E5-3 Status: Analysis Complete → DEFER E5-3, Proceed to E5-4 (Malloc Direct Path) Cumulative: E4+E5-1 = ~+9-10% (baseline: 44.42M ops/s Mixed)