# Phase 4: Perf Profile Analysis - Next Optimization Target **Date**: 2025-12-14 **Baseline**: Phase 3 + D1 Complete (~46.37M ops/s, MIXED_TINYV3_C7_SAFE) **Profile**: MIXED_TINYV3_C7_SAFE (20M iterations, ws=400, F=999Hz) **Samples**: 922 samples, 3.1B cycles ## Executive Summary **Current Status**: - Phase 3 + D1: ~8.93% cumulative gain (37.5M → 51M ops/s baseline) - D3 (Alloc Gate Shape): NEUTRAL (+0.56% mean, -0.5% median) → frozen as research box - **Learning**: Shape optimizations (B3, D3) have limited ROI - branch prediction improvements plateau **Next Strategy**: Identify self% ≥ 5% functions and apply **different approaches** (not shape-based): - Hot/cold split (separate rare paths) - Caching (avoid repeated expensive operations) - Inlining (reduce function call overhead) - ENV gate consolidation (reduce repeated TLS/getenv checks) --- ## Perf Report Analysis ### Top Functions (self% ≥ 5%) Filtered for hakmem internal functions (excluding main, malloc/free wrappers): | Rank | Function | self% | Category | Already Optimized? | |------|----------|-------|----------|--------------------| | 1 | `tiny_alloc_gate_fast.lto_priv.0` | **15.37%** | Alloc Gate | D3 shape (neutral) | | 2 | `free_tiny_fast_cold.lto_priv.0` | **5.84%** | Free Path | Hot/cold split done | | - | `unified_cache_push.lto_priv.0` | 3.97% | Cache | Core primitive | | - | `tiny_c7_ultra_alloc.constprop.0` | 3.97% | C7 Alloc | Not optimized | | - | `tiny_region_id_write_header.lto_priv.0` | 2.50% | Header | A3 inlining (NO-GO) | | - | `tiny_route_for_class.lto_priv.0` | 2.28% | Routing | C3 static cache done | **Key Observations**: 1. **tiny_alloc_gate_fast** (15.37%): Still dominant despite D3 shape optimization 2. **free_tiny_fast_cold** (5.84%): Cold path still hot (ENV gate overhead?) 3. **ENV gate functions** (1-2% each): `tiny_c7_ultra_enabled_env` (1.28%), `tiny_front_v3_enabled` (1.01%), `tiny_metadata_cache_enabled` (0.97%) - Combined: **~3.26%** on ENV checking overhead - Repeated TLS reads + getenv lazy init --- ## Detailed Candidate Analysis ### Candidate 1: `tiny_alloc_gate_fast` (15.37% self%) ⭐ TOP TARGET **Current State**: - Phase D3: Alloc gate shape optimization → NEUTRAL (+0.56% mean, -0.5% median) - Approach: Branch hints (LIKELY/UNLIKELY) + route table direct access - Result: Limited improvement (branch prediction already well-tuned) **Perf Annotate Hotspots** (lines with >5% samples): ```asm 9.97%: cmp $0x2,%r13d # Route comparison (ROUTE_POOL_ONLY check) 5.77%: movzbl (%rsi,%rbx,1),%r13d # Route table load (g_tiny_route) 11.32%: mov 0x280aea(%rip),%eax # rel_route_logged.26 (C7 logging check) 5.72%: test %eax,%eax # Route logging branch ``` **Root Causes**: 1. **Route determination overhead** (9.97% + 5.77% = 15.74%): - `g_tiny_route[class_idx & 7]` load + comparison - Branch on `ROUTE_POOL_ONLY` (rare path, but checked every call) 2. **C7 logging overhead** (11.32% + 5.72% = 17.04%): - `rel_route_logged.26` TLS check (C7-specific, rare in Mixed) - Branch misprediction when C7 is ~10% of traffic 3. **ENV gate overhead**: - `alloc_gate_shape_enabled()` check (line 151) - `tiny_route_get()` falls back to slow path (line 186) **Optimization Opportunities**: #### Option A1: **Per-Class Fast Path Specialization** (HIGH ROI, STRUCTURAL) **Approach**: Create specialized `tiny_alloc_gate_fast_c{0-7}()` for each class - **Benefit**: Eliminate runtime route determination (static per-class decision) - **Strategy**: - C0-C3 (LEGACY): Direct to `malloc_tiny_fast_for_class()`, skip route check - C4-C6 (MID/V3): Direct to small_policy path, skip LEGACY check - C7 (ULTRA): Direct to `tiny_c7_ultra_alloc()`, skip all route logic - **Expected gain**: Eliminate 15.74% route overhead → **+2-3% overall** - **Risk**: Medium (code duplication, must maintain 8 variants) - **Precedent**: FREE path already does this via `HAKMEM_FREE_TINY_FAST_HOTCOLD` (+13% win) #### Option A2: **Route Cache Consolidation** (MEDIUM ROI, CACHE-BASED) **Approach**: Extend C3 static routing to alloc gate (bypass `tiny_route_get()` entirely) - **Benefit**: Eliminate `tiny_route_get()` call + route table load - **Strategy**: - Check `g_tiny_static_route_ready` once (already cached) - Use `g_tiny_static_route_table[class_idx]` directly (already done in C3) - Remove duplicate `g_tiny_route[]` load (line 157) - **Expected gain**: Reduce 5.77% route load overhead → **+0.5-1% overall** - **Risk**: Low (extends existing C3 infrastructure) - **Note**: Partial overlap with A1 (both reduce route overhead) #### Option A3: **C7 Logging Branch Elimination** (LOW ROI, ENV-BASED) **Approach**: Make C7 logging opt-in via ENV (default OFF in Mixed profile) - **Benefit**: Eliminate 17.04% C7 logging overhead in Mixed workload - **Strategy**: - Add `HAKMEM_TINY_C7_ROUTE_LOGGING=0` to MIXED_TINYV3_C7_SAFE - Keep logging enabled in C6_HEAVY profile (debugging use case) - **Expected gain**: Eliminate 17.04% local overhead → **+2-3% in alloc_gate_fast** → **+0.3-0.5% overall** - **Risk**: Very low (ENV-gated, reversible) - **Caveat**: This is ~17% of *tiny_alloc_gate_fast's* self%, not 17% of total runtime **Recommendation**: **Pursue A1 (Per-Class Fast Path)** as primary target - Rationale: Structural change that eliminates root cause (runtime route determination) - Precedent: FREE path hot/cold split achieved +13% with similar approach - A2 can be quick win before A1 (low-hanging fruit) - A3 is minor (local to tiny_alloc_gate_fast, small overall impact) --- ### Candidate 2: `free_tiny_fast_cold` (5.84% self%) ⚠️ ALREADY OPTIMIZED **Current State**: - Phase FREE-TINY-FAST-HOTCOLD-1: Hot/cold split → +13% gain - Split C0-C3 (hot) from C4-C7 (cold) - Cold path still shows 5.84% self% → expected (C4-C7 are ~50% of frees) **Perf Annotate Hotspots**: ```asm 4.12%: call tiny_route_for_class.lto_priv.0 # Route determination (C4-C7) 3.95%: cmpl g_tiny_front_v3_snapshot_ready # Front v3 snapshot check 3.63%: cmpl %fs:0xfffffffffffb3b00 # TLS ENV check (FREE_TINY_FAST_HOTCOLD) ``` **Root Causes**: 1. **Route determination** (4.12%): Necessary for C4-C7 (not LEGACY) 2. **ENV gate overhead** (3.95% + 3.63% = 7.58%): Repeated TLS checks 3. **Front v3 snapshot check** (3.95%): Lazy init overhead **Optimization Opportunities**: #### Option B1: **ENV Gate Consolidation** (MEDIUM ROI, CACHE-BASED) **Approach**: Consolidate repeated ENV checks into single TLS snapshot - **Benefit**: Reduce 7.58% ENV checking overhead - **Strategy**: - Create `struct free_env_snapshot { uint8_t hotcold_on; uint8_t front_v3_on; ... }` - Cache in TLS (initialized once per thread) - Single TLS read per `free_tiny_fast_cold()` call - **Expected gain**: Reduce 7.58% local overhead → **+0.4-0.6% overall** (5.84% * 7.58% = ~0.44%) - **Risk**: Low (existing pattern in C3 static routing) #### Option B2: **C4-C7 Route Specialization** (LOW ROI, STRUCTURAL) **Approach**: Create per-class cold paths (similar to A1 for alloc) - **Benefit**: Eliminate route determination for C4-C7 - **Strategy**: Split `free_tiny_fast_cold()` into 4 variants (C4, C5, C6, C7) - **Expected gain**: Reduce 4.12% route overhead → **+0.24% overall** - **Risk**: Medium (code duplication) - **Note**: Lower priority than A1 (free path already optimized via hot/cold split) **Recommendation**: **Pursue B1 (ENV Gate Consolidation)** as secondary target - Rationale: Complements A1 (alloc gate specialization) - Can be applied to both alloc and free paths (shared infrastructure) - Lower ROI than A1, but easier to implement --- ### Candidate 3: ENV Gate Functions (Combined 3.26% self%) 🎯 CROSS-CUTTING **Functions**: - `tiny_c7_ultra_enabled_env.lto_priv.0` (1.28%) - `tiny_front_v3_enabled.lto_priv.0` (1.01%) - `tiny_metadata_cache_enabled.lto_priv.0` (0.97%) **Current Pattern** (from source): ```c static inline int tiny_front_v3_enabled(void) { static __thread int g = -1; if (__builtin_expect(g == -1, 0)) { const char* e = getenv("HAKMEM_TINY_FRONT_V3_ENABLED"); g = (e && *e && *e != '0') ? 1 : 0; } return g; } ``` **Root Causes**: 1. **TLS read overhead**: Each function reads separate TLS variable (3 separate reads in hot path) 2. **Lazy init check**: `g == -1` branch on every call (cold, but still checked) 3. **Function call overhead**: Called from multiple hot paths (not always inlined) **Optimization Opportunities**: #### Option C1: **ENV Snapshot Consolidation** ⭐ HIGH ROI **Approach**: Consolidate all ENV gates into single TLS snapshot struct - **Benefit**: Reduce 3 TLS reads → 1 TLS read, eliminate 2 lazy init checks - **Strategy**: ```c struct hakmem_env_snapshot { uint8_t front_v3_on; uint8_t metadata_cache_on; uint8_t c7_ultra_on; uint8_t free_hotcold_on; uint8_t static_route_on; // ... (8 bytes total, cache-friendly) }; extern __thread struct hakmem_env_snapshot g_hakmem_env_snapshot; static inline const struct hakmem_env_snapshot* hakmem_env_get(void) { if (__builtin_expect(!g_hakmem_env_snapshot.initialized, 0)) { hakmem_env_snapshot_init(); // One-time init } return &g_hakmem_env_snapshot; } ``` - **Expected gain**: Eliminate 3.26% ENV overhead → **+3.0-3.5% overall** - **Risk**: Medium (refactor all ENV gate call sites) - **Precedent**: `tiny_front_v3_snapshot` already does this for front v3 config **Recommendation**: **HIGHEST PRIORITY - Pursue C1 as Phase 4 PRIMARY TARGET** - Rationale: - **3.26% direct overhead** (measurable in perf) - **Cross-cutting benefit**: Improves both alloc and free paths - **Structural improvement**: Reduces TLS pressure across entire codebase - **Clear precedent**: `tiny_front_v3_snapshot` pattern already proven - **Compounds with A1**: Per-class fast paths can check single ENV snapshot instead of multiple gates --- ## Selected Next Target ### Phase 4 E1: **ENV Snapshot Consolidation** (PRIMARY TARGET) **Function**: Consolidate all ENV gates into single TLS snapshot **Expected Gain**: **+3.0-3.5%** (eliminate 3.26% ENV overhead) **Risk**: Medium (refactor ENV gate call sites) **Effort**: 2-3 days (create snapshot struct, refactor ~20 call sites, A/B test) **Implementation Plan**: #### Step 1: Create ENV Snapshot Infrastructure - File: `core/box/hakmem_env_snapshot_box.h/c` - Struct: `hakmem_env_snapshot` (8-byte TLS struct) - API: `hakmem_env_get()` (lazy init, returns const snapshot*) #### Step 2: Migrate ENV Gates Priority order (by self% impact): 1. `tiny_c7_ultra_enabled_env()` (1.28%) 2. `tiny_front_v3_enabled()` (1.01%) 3. `tiny_metadata_cache_enabled()` (0.97%) 4. `free_tiny_fast_hotcold_enabled()` (in `free_tiny_fast_cold`) 5. `tiny_static_route_enabled()` (in routing hot path) #### Step 3: Refactor Call Sites - Replace: `if (tiny_front_v3_enabled()) { ... }` - With: `const hakmem_env_snapshot* env = hakmem_env_get(); if (env->front_v3_on) { ... }` - Count: ~20-30 call sites (grep analysis needed) #### Step 4: A/B Test - Baseline: Current mainline (Phase 3 + D1) - Optimized: ENV snapshot consolidation - Workloads: Mixed (10-run), C6-heavy (5-run) - Threshold: +1.0% mean gain for GO #### Step 5: Validation - Health check: `verify_health_profiles.sh` - Regression check: Ensure no performance loss on any profile **Success Criteria**: - [ ] ENV snapshot struct created - [ ] All priority ENV gates migrated - [ ] A/B test shows +2.5% or better (Mixed, 10-run) - [ ] Health check passes - [ ] Default ON in MIXED_TINYV3_C7_SAFE --- ## Alternative Targets (Lower Priority) ### Phase 4 E2: **Per-Class Alloc Fast Path** (SECONDARY TARGET) **Function**: Specialize `tiny_alloc_gate_fast()` per class **Expected Gain**: **+2-3%** (eliminate 15.74% route overhead in tiny_alloc_gate_fast) **Risk**: Medium (code duplication, 8 variants to maintain) **Effort**: 3-4 days (create 8 fast paths, refactor malloc wrapper, A/B test) **Why Secondary?**: - Higher implementation complexity (8 variants vs. 1 snapshot struct) - Dependent on E1 success (ENV snapshot makes per-class paths cleaner) - Can be pursued after E1 proves ENV consolidation pattern --- ## Candidate Summary Table | Phase | Target | self% | Approach | Expected Gain | Risk | Priority | |-------|--------|-------|----------|---------------|------|----------| | **E1** | **ENV Snapshot Consolidation** | **3.26%** | **Caching** | **+3.0-3.5%** | **Medium** | **⭐ PRIMARY** | | E2 | Per-Class Alloc Fast Path | 15.37% | Hot/Cold Split | +2-3% | Medium | Secondary | | E3 | Free ENV Gate Consolidation | 7.58% (local) | Caching | +0.4-0.6% | Low | Tertiary | | E4 | C7 Logging Elimination | 17.04% (local) | ENV-gated | +0.3-0.5% | Very Low | Quick Win | --- ## Shape Optimization Plateau Analysis **Observation**: D3 (Alloc Gate Shape) achieved only +0.56% mean gain (NEUTRAL) **Why Shape Optimizations Plateau?**: 1. **Branch Prediction Saturation**: Modern CPUs (Zen3/Zen4) already predict well-trained branches - LIKELY/UNLIKELY hints: Marginal benefit on hot paths - B3 (Routing Shape): +2.89% → Initial win (untrained branches) - D3 (Alloc Gate Shape): +0.56% → Diminishing returns (already trained) 2. **I-Cache Pressure**: Adding cold helpers can regress if not carefully placed - A3 (always_inline header): -4.00% on Mixed (I-cache thrashing) - D3: Neutral (no regression, but no clear win) 3. **TLS/Memory Overhead Dominates**: ENV gates (3.26%) > Branch misprediction (~0.5%) - Next optimization should target memory/TLS overhead, not branches **Lessons Learned**: - Shape optimizations: Good for first pass (B3 +2.89%), limited ROI after - Next frontier: Caching (ENV snapshot), structural changes (per-class paths) - Avoid: More LIKELY/UNLIKELY hints (saturated) - Prefer: Eliminate checks entirely (snapshot) or specialize paths (per-class) --- ## Next Steps 1. **Phase 4 E1: ENV Snapshot Consolidation** (PRIMARY) - Create design doc: `PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_DESIGN.md` - Implement snapshot infrastructure - Migrate priority ENV gates - A/B test (Mixed 10-run) - Target: +3.0% gain, promote to default if successful 2. **Phase 4 E2: Per-Class Alloc Fast Path** (SECONDARY) - Depends on E1 success - Design doc: `PHASE4_E2_PER_CLASS_ALLOC_FASTPATH_DESIGN.md` - Prototype C7-only fast path first (highest gain, least complexity) - A/B test incremental per-class specialization - Target: +2-3% gain 3. **Update CURRENT_TASK.md**: - Document perf findings - Note shape optimization plateau - List E1 as next target --- ## Appendix: Perf Command Reference ```bash # Profile current mainline HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1 # Generate report (sorted by symbol, no children aggregation) perf report --stdio --no-children --sort=symbol | head -80 # Annotate specific function perf annotate --stdio tiny_alloc_gate_fast.lto_priv.0 | head -100 ``` **Key Metrics**: - Samples: 922 (sufficient for 0.1% precision) - Frequency: 999 Hz (balance between overhead and resolution) - Iterations: 40M (runtime ~0.86s, enough for stable sampling) - Workload: Mixed (ws=400, representative of production) --- **Status**: Ready for Phase 4 E1 implementation **Baseline**: 46.37M ops/s (Phase 3 + D1) **Target**: 47.8M ops/s (+3.0% via ENV snapshot consolidation)