# Phase 4 Comprehensive Status Analysis **Date**: 2025-12-14 **Analyst**: Claude Code **Baseline**: E1 enabled (~45M ops/s) --- ## Part 1: E2 Freeze Decision Analysis ### Test Data Review **E2 Configuration**: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc) **Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled) **Test**: 10-run A/B, 20M iterations, ws=400 #### Statistical Analysis | Metric | Baseline (E2=0) | Optimized (E2=1) | Delta | |--------|-----------------|------------------|-------| | Mean | 45.40M ops/s | 45.30M ops/s | -0.21% | | Median | 45.51M ops/s | 45.22M ops/s | -0.62% | | StdDev | 0.38M (0.84% CV) | 0.49M (1.07% CV) | +28% variance | #### Variance Consistency Analysis **Baseline runs** (DUALHOT=0): - Range: 44.60M - 45.90M (1.30M spread) - Runs within ±1% of mean: 9/10 (90%) - Outliers: Run 8 (44.60M, -1.76% from mean) **Optimized runs** (DUALHOT=1): - Range: 44.59M - 46.28M (1.69M spread) - Runs within ±1% of mean: 8/10 (80%) - Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean) **Observation**: Higher variance in optimized version suggests branch misprediction or cache effects. #### Comparison to Free DUALHOT Success | Path | DUALHOT Result | Reason | |------|----------------|--------| | **Free** | **+13.0%** | Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees) | | **Alloc** | **-0.21%** | Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost | **Root Cause**: - Free path: C0-C3 optimization skips **expensive operations** (policy snapshot + route lookup) - Alloc path: C0-C3 optimization skips **already-cached operations** (static routing eliminates lookup) - Net effect: Branch overhead ≈ Savings → neutral ### E2 Freeze Recommendation **Decision**: ✅ **DEFINITIVE FREEZE** **Rationale**: 1. **Result is consistent**: All 10 runs showed similar pattern (no bimodal distribution) 2. **Not a measurement error**: StdDev 0.38M-0.49M is normal for this workload 3. **Root cause understood**: Alloc path already optimized via C3 static routing 4. **Free vs Alloc asymmetry explained**: Free skips expensive ops, alloc skips cheap cached ops 5. **No alternative conditions warranted**: - Different workload (C6-heavy): Won't help - same route caching applies - Different iteration count: Won't change fundamental branch cost vs savings trade-off - Combined flags: No synergy available - route caching is already optimal **Conclusion**: E2 is a **structural dead-end** for Mixed workload. Alloc route optimization saturated by C3. --- ## Part 2: Fresh Perf Profile Analysis (E1 Enabled) ### Profile Configuration **Command**: `HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1` **Throughput**: 45.26M ops/s **Samples**: 946 samples, 3.25B cycles ### Top Functions (self% >= 2.0%) | Rank | Function | self% | Change from Pre-E1 | Category | |------|----------|-------|-------------------|----------| | 1 | free | 22.19% | +2.5pp (from ~19%) | Wrapper | | 2 | tiny_alloc_gate_fast | 18.99% | +3.6pp (from 15.37%) | Alloc Gate | | 3 | main | 15.21% | No change | Benchmark | | 4 | malloc | 13.36% | No change | Wrapper | | 5 | free_tiny_fast_cold | 7.32% | +1.5pp (from 5.84%) | Free Path | | 6 | hakmem_env_snapshot_enabled | 3.22% | **NEW (was 0% combined)** | ENV Gate | | 7 | tiny_region_id_write_header | 2.60% | +0.1pp (from 2.50%) | Header | | 8 | unified_cache_push | 2.56% | -1.4pp (from 3.97%) | Cache | | 9 | tiny_route_for_class | 2.29% | +0.01pp (from 2.28%) | Routing | | 10 | small_policy_v7_snapshot | 2.26% | No data | Policy | | 11 | tiny_c7_ultra_alloc | 2.16% | -1.8pp (from 3.97%) | C7 Alloc | ### E1 Impact Analysis **Expected**: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read **Actual**: `hakmem_env_snapshot_enabled` shows 3.22% self% **Interpretation**: - ENV overhead **shifted** from 3 separate functions → 1 function - **NOT eliminated** - still paying 3.22% for ENV checking - E1's +3.92% gain likely from **reduced TLS pressure** (fewer TLS variables), not eliminated checks - The snapshot approach caches results, reducing repeated getenv() calls **Surprise findings**: 1. **tiny_alloc_gate_fast increased** from 15.37% → 18.99% (+3.6pp) - Possible reason: Other functions got faster (relative %), or I-cache effects 2. **hakmem_env_snapshot_enabled is NEW hot spot** (3.22%) - This is the consolidation point - still significant overhead 3. **unified_cache_push decreased** from 3.97% → 2.56% (-1.4pp) - Good sign: Cache operations more efficient ### Hot Spot Distribution **Pre-E1** (Phase 4 D3 baseline): - ENV gates (3 functions): 3.26% - tiny_alloc_gate_fast: 15.37% - free_tiny_fast_cold: 5.84% - **Total measured overhead**: ~24.5% **Post-E1** (current): - ENV snapshot (1 function): 3.22% - tiny_alloc_gate_fast: 18.99% - free_tiny_fast_cold: 7.32% - **Total measured overhead**: ~29.5% **Analysis**: Overhead increased in absolute %, but throughput increased +3.92%. This suggests: - Baseline got faster (other code optimized) - Relative % shifted to measured functions - Perf sampling variance (946 samples has ~±3% error margin) --- ## Part 3: E3 Candidate Identification ### Methodology **Selection Criteria**: 1. self% >= 5% (significant impact) 2. Not already heavily optimized (avoid saturated areas) 3. Different approach from route/TLS optimization (explore new vectors) ### Candidate Analysis #### Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION **Current State**: - Phase 3 C3: Static routing (+2.20% gain) - Phase 4 D3: Alloc gate shape (+0.56% neutral) - Phase 4 E2: Per-class fast path (-0.21% neutral) **Why it's 18.99%**: - Route determination: Already cached (C3) - Branch prediction: Already tuned (D3) - Per-class specialization: No benefit (E2) **Remaining Overhead**: - Function call overhead (not inlined) - ENV snapshot check (3.22% now consolidated) - Size→class conversion (hak_tiny_size_to_class) - Wrapper→gate dispatch **Optimization Approach**: **INLINING + DISPATCH OPTIMIZATION** - **Strategy**: Inline tiny_alloc_gate_fast into malloc wrapper - Eliminate function call overhead (save ~5-10 cycles) - Improve I-cache locality (malloc + gate in same cache line) - Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit) - **Expected Gain**: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall) - **Risk**: Medium (I-cache pressure, as seen in A3 -4% regression) **Recommendation**: **DEFER** - Route optimization saturated, inlining has I-cache risk --- #### Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD **Current State**: - Phase 2 B4: Wrapper hot/cold split (+1.47% gain) - Wrapper shape already optimized (rare checks in cold path) **Why it's 22.19%**: - This is the `free()` wrapper function (libc entry point) - Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast **Optimization Approach**: **WRAPPER BYPASS (IFUNC) or Function Pointer Caching** - **Strategy 1 (IFUNC)**: Use GNU IFUNC to resolve malloc/free at load time - Direct binding: `malloc → tiny_alloc_gate_fast` (no wrapper layer) - Risk: HIGH (ABI compatibility, thread-safety) - **Strategy 2 (Function Pointer)**: Cache `g_free_impl` in TLS - Check once at thread init, then direct call - Risk: Medium, Lower gain (+1-2%) **Recommendation**: **HIGH PRIORITY** - Large potential gain, prototype with function pointer approach first --- #### Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION **Current State**: - Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path) - Cold path handles C4-C7 (~50% of frees) **Optimization Approach**: **C4-C7 ROUTE SPECIALIZATION** - **Strategy**: Create per-class cold paths (similar to E2 alloc attempt) - **Expected Gain**: +0.5-1.0% - **Risk**: Low **Recommendation**: **MEDIUM PRIORITY** - Incremental gain, but may hit diminishing returns like E2 --- #### Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐ **Current State**: - Phase 4 E1: ENV snapshot consolidation (+3.92% gain) - 3 separate ENV gates → 1 consolidated snapshot **Why it's 3.22%**: - This IS the optimization (consolidation point) - Still checking `g_hakmem_env_snapshot.initialized` on every call - TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path) **Optimization Approach**: **LAZY INIT ELIMINATION** - **Strategy**: Force ENV snapshot initialization at library load time (constructor) - Use `__attribute__((constructor))` to init before main() - Eliminate `if (!initialized)` check in hot path - Make `hakmem_env_get()` a pure TLS read (no branch) - **Expected Gain**: +0.5-1.5% (eliminate 3.22% check overhead) - **Risk**: Low (standard initialization pattern) - **Implementation**: ```c __attribute__((constructor)) static void hakmem_env_snapshot_init_early(void) { hakmem_env_snapshot_init(); // Force init before any alloc/free } static inline const hakmem_env_snapshot* hakmem_env_get(void) { return &g_hakmem_env_snapshot; // No check, just return } ``` **Recommendation**: **HIGH PRIORITY** - Clean win, low risk, eliminates E1's remaining overhead --- #### Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION **Current State**: - Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO) - I-cache pressure issue identified **Optimization Approach**: **SELECTIVE INLINING** - **Strategy**: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY) - **Expected Gain**: +0.5-1.0% - **Risk**: Medium (I-cache effects) **Recommendation**: **LOW PRIORITY** - A3 already explored, I-cache risk remains --- ### E3 Candidate Ranking | Rank | Candidate | self% | Approach | Expected Gain | Risk | ROI | |------|-----------|-------|----------|---------------|------|-----| | **1** | **hakmem_env_snapshot_enabled** | **3.22%** | **Constructor init** | **+0.5-1.5%** | **Low** | **⭐⭐⭐** | | **2** | **free wrapper** | **22.19%** | **Function pointer cache** | **+1-2%** | **Medium** | **⭐⭐⭐** | | 3 | tiny_alloc_gate_fast | 18.99% | Inlining | +1-2% | High (I-cache) | ⭐⭐ | | 4 | free_tiny_fast_cold | 7.32% | Route specialization | +0.5-1.0% | Low | ⭐⭐ | | 5 | tiny_region_id_write_header | 2.60% | Selective inline | +0.5-1.0% | Medium | ⭐ | --- ## Part 4: Summary & Recommendations ### E2 Final Decision **Decision**: ✅ **FREEZE DEFINITIVELY** **Rationale**: 1. **Result is consistent**: -0.21% mean, -0.62% median across 10 runs 2. **Root cause clear**: Alloc route optimization saturated by Phase 3 C3 static routing 3. **Free vs Alloc asymmetry**: Free DUALHOT skips expensive ops, alloc skips cached ops 4. **No alternative testing needed**: Workload/iteration changes won't fix structural issue 5. **Lesson learned**: Per-class specialization only works when bypassing uncached overhead **Action**: - Keep `HAKMEM_TINY_ALLOC_DUALHOT=0` as default (research box frozen) - Document in CURRENT_TASK.md as NEUTRAL result - No further investigation warranted --- ### Perf Findings (E1 Enabled Baseline) **Throughput**: 45.26M ops/s (+3.92% from pre-E1 baseline) **Hot Spots** (self% >= 5%): 1. free (22.19%) - Wrapper overhead 2. tiny_alloc_gate_fast (18.99%) - Route overhead (saturated) 3. main (15.21%) - Benchmark driver 4. malloc (13.36%) - Wrapper overhead 5. free_tiny_fast_cold (7.32%) - C4-C7 free path **E1 Impact**: - ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function) - Gain from reduced TLS pressure: +3.92% - **Remaining opportunity**: Eliminate lazy init check (3.22% → 0%) **New Hot Spots**: - hakmem_env_snapshot_enabled: 3.22% (consolidation point) **Changes from Pre-E1**: - tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%) - free: +2.5pp (~19% → 22.19%) - unified_cache_push: -1.4pp (3.97% → 2.56%) --- ### E3 Recommendation **Primary Target**: **hakmem_env_snapshot_enabled (E3-4)** **Approach**: Constructor-based initialization - Force ENV snapshot init at library load time - Eliminate lazy init check in hot path - Make `hakmem_env_get()` a pure TLS read (no branch) **Expected Gain**: +0.5-1.5% **Implementation Complexity**: Low (2-day task) - Add `__attribute__((constructor))` function - Remove init check from hakmem_env_get() - A/B test with 10-run Mixed + 5-run C6-heavy **Rationale**: 1. **Low risk**: Standard initialization pattern (used by jemalloc, tcmalloc) 2. **Clear gain**: Eliminates 3.22% overhead (lazy init check) 3. **Compounds E1**: Completes ENV snapshot optimization started in E1 4. **Different vector**: Not route/TLS optimization - this is **initialization overhead reduction** **Success Criteria**: - Mean gain >= +0.5% (conservative) - No regression on any profile - Health check passes --- **Secondary Target**: **free wrapper (E3-2)** **Approach**: Function pointer caching - Cache `g_free_impl` in TLS at thread init - Direct call instead of LD mode check + dispatch - Lower risk than IFUNC approach **Expected Gain**: +1-2% **Implementation Complexity**: Medium (3-4 day task) **Risk**: Medium (thread-safety, initialization order) --- ### Phase 4 Status **Active Optimizations**: - E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in) - E3-4 (ENV Constructor Init): ❌ NO-GO (frozen, default OFF, requires E1) **Frozen Optimizations**: - D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF) - E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF) **Cumulative Gain** (Phase 2-4): - B3 (Routing shape): +2.89% - B4 (Wrapper split): +1.47% - C3 (Static routing): +2.20% - D1 (Free route cache): +2.19% - E1 (ENV snapshot): +3.92% - **Total (Phase 4)**: ~+3.9%(E1 のみ) **Baseline(参考)**: - E1=1, CTOR=0: 45.26M ops/s(Mixed, 40M iters, ws=400) - E1=1, CTOR=1: 46.86M ops/s(Mixed, 20M iters, ws=400, re-validation: -1.44%) **Remaining Potential**: - E3-2 (Wrapper function ptr): +1-2% - E3-3 (Free route special): +0.5-1.0% - **Realistic ceiling**: ~48-50M ops/s (without major redesign) --- ### Next Steps #### Immediate (Priority 1) 1. **Freeze E2 in CURRENT_TASK.md** - Document NEUTRAL decision (-0.21%) - Add root cause explanation (route caching saturation) - Mark as research box (default OFF, frozen) 2. **E3-4 の昇格ゲート(再検証)** - E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認 - A/B: Mixed 10-run(E1=1, CTOR=0 vs 1) - 健康診断: `scripts/verify_health_profiles.sh` #### Short-term (Priority 2) 3. **E1/E3-4 ON の状態で perf を取り直す** - `hakmem_env_snapshot_enabled` が Top から落ちる/self% が有意に下がること - 次の芯(alloc gate / free_tiny_fast_cold / wrapper)を “self% ≥ 5%” で選定 #### Long-term (Priority 3) 6. **Consider non-incremental approaches** - Mimalloc-style TLS bucket redesign (major overhaul) - Static-compiled routing (eliminate runtime policy) - IFUNC for zero-overhead wrapper (high risk) --- ### Lessons Learned #### Route Optimization Saturation **Observation**: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%) **Insight**: - Route optimization has diminishing returns after static caching (C3) - Further specialization adds branch overhead without eliminating cost - **Lesson**: Don't pursue per-class specialization on already-cached paths #### Shape Optimization Plateau **Observation**: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%) **Insight**: - Branch prediction saturates after initial tuning - LIKELY/UNLIKELY hints have limited benefit on well-trained branches - **Lesson**: Shape optimization good for first pass, limited ROI after #### ENV Consolidation Success **Observation**: E1 (ENV snapshot) achieved +3.92% gain **Insight**: - Reducing TLS pressure (3 vars → 1 var) has measurable benefit - Consolidation point still has overhead (3.22% self%) - **Lesson**: Constructor init is next logical step (eliminate lazy check) #### Inlining I-Cache Risk **Observation**: A3 (header always_inline) showed -4% regression on Mixed **Insight**: - Aggressive inlining can thrash I-cache on mixed workloads - Selective inlining (per-class) may work but needs careful profiling - **Lesson**: Inlining is high-risk, constructor/caching approaches safer --- ### Realistic Expectations **Current State**: 45M ops/s (E1 enabled) **Target**: 48-50M ops/s (with E3-4, E3-2) **Ceiling**: ~55-60M ops/s (without major redesign) **Gap to mimalloc**: ~2.5x (128M vs 55M ops/s) **Why large gap remains**: - Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets - Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout - Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15 **Next phase options**: 1. **Incremental** (E3-4, E3-2): +1-3% gains, safe, diminishing returns 2. **Structural redesign**: +20-50% potential, high risk, months of work 3. **Workload-specific tuning**: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed **Recommendation**: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted. --- **Analysis Complete**: 2025-12-14 **Next Action**: Implement E3-4 (ENV Constructor Init) **Expected Timeline**: 2-3 days (design → implement → A/B → decision)