Files
hakmem/docs/analysis/PHASE4_PERF_PROFILE_ANALYSIS.md
2025-12-14 00:48:03 +09:00

15 KiB

Phase 4: Perf Profile Analysis - Next Optimization Target

Date: 2025-12-14 Baseline: Phase 3 + D1 Complete (~46.37M ops/s, MIXED_TINYV3_C7_SAFE) Profile: MIXED_TINYV3_C7_SAFE (20M iterations, ws=400, F=999Hz) Samples: 922 samples, 3.1B cycles

Executive Summary

Current Status:

  • Phase 3 + D1: ~8.93% cumulative gain (37.5M → 51M ops/s baseline)
  • D3 (Alloc Gate Shape): NEUTRAL (+0.56% mean, -0.5% median) → frozen as research box
  • Learning: Shape optimizations (B3, D3) have limited ROI - branch prediction improvements plateau

Next Strategy: Identify self% ≥ 5% functions and apply different approaches (not shape-based):

  • Hot/cold split (separate rare paths)
  • Caching (avoid repeated expensive operations)
  • Inlining (reduce function call overhead)
  • ENV gate consolidation (reduce repeated TLS/getenv checks)

Perf Report Analysis

Top Functions (self% ≥ 5%)

Filtered for hakmem internal functions (excluding main, malloc/free wrappers):

Rank Function self% Category Already Optimized?
1 tiny_alloc_gate_fast.lto_priv.0 15.37% Alloc Gate D3 shape (neutral)
2 free_tiny_fast_cold.lto_priv.0 5.84% Free Path Hot/cold split done
- unified_cache_push.lto_priv.0 3.97% Cache Core primitive
- tiny_c7_ultra_alloc.constprop.0 3.97% C7 Alloc Not optimized
- tiny_region_id_write_header.lto_priv.0 2.50% Header A3 inlining (NO-GO)
- tiny_route_for_class.lto_priv.0 2.28% Routing C3 static cache done

Key Observations:

  1. tiny_alloc_gate_fast (15.37%): Still dominant despite D3 shape optimization
  2. free_tiny_fast_cold (5.84%): Cold path still hot (ENV gate overhead?)
  3. ENV gate functions (1-2% each): tiny_c7_ultra_enabled_env (1.28%), tiny_front_v3_enabled (1.01%), tiny_metadata_cache_enabled (0.97%)
    • Combined: ~3.26% on ENV checking overhead
    • Repeated TLS reads + getenv lazy init

Detailed Candidate Analysis

Candidate 1: tiny_alloc_gate_fast (15.37% self%) TOP TARGET

Current State:

  • Phase D3: Alloc gate shape optimization → NEUTRAL (+0.56% mean, -0.5% median)
  • Approach: Branch hints (LIKELY/UNLIKELY) + route table direct access
  • Result: Limited improvement (branch prediction already well-tuned)

Perf Annotate Hotspots (lines with >5% samples):

9.97%: cmp $0x2,%r13d              # Route comparison (ROUTE_POOL_ONLY check)
5.77%: movzbl (%rsi,%rbx,1),%r13d # Route table load (g_tiny_route)
11.32%: mov 0x280aea(%rip),%eax   # rel_route_logged.26 (C7 logging check)
5.72%: test %eax,%eax             # Route logging branch

Root Causes:

  1. Route determination overhead (9.97% + 5.77% = 15.74%):
    • g_tiny_route[class_idx & 7] load + comparison
    • Branch on ROUTE_POOL_ONLY (rare path, but checked every call)
  2. C7 logging overhead (11.32% + 5.72% = 17.04%):
    • rel_route_logged.26 TLS check (C7-specific, rare in Mixed)
    • Branch misprediction when C7 is ~10% of traffic
  3. ENV gate overhead:
    • alloc_gate_shape_enabled() check (line 151)
    • tiny_route_get() falls back to slow path (line 186)

Optimization Opportunities:

Option A1: Per-Class Fast Path Specialization (HIGH ROI, STRUCTURAL)

Approach: Create specialized tiny_alloc_gate_fast_c{0-7}() for each class

  • Benefit: Eliminate runtime route determination (static per-class decision)
  • Strategy:
    • C0-C3 (LEGACY): Direct to malloc_tiny_fast_for_class(), skip route check
    • C4-C6 (MID/V3): Direct to small_policy path, skip LEGACY check
    • C7 (ULTRA): Direct to tiny_c7_ultra_alloc(), skip all route logic
  • Expected gain: Eliminate 15.74% route overhead → +2-3% overall
  • Risk: Medium (code duplication, must maintain 8 variants)
  • Precedent: FREE path already does this via HAKMEM_FREE_TINY_FAST_HOTCOLD (+13% win)

Option A2: Route Cache Consolidation (MEDIUM ROI, CACHE-BASED)

Approach: Extend C3 static routing to alloc gate (bypass tiny_route_get() entirely)

  • Benefit: Eliminate tiny_route_get() call + route table load
  • Strategy:
    • Check g_tiny_static_route_ready once (already cached)
    • Use g_tiny_static_route_table[class_idx] directly (already done in C3)
    • Remove duplicate g_tiny_route[] load (line 157)
  • Expected gain: Reduce 5.77% route load overhead → +0.5-1% overall
  • Risk: Low (extends existing C3 infrastructure)
  • Note: Partial overlap with A1 (both reduce route overhead)

Option A3: C7 Logging Branch Elimination (LOW ROI, ENV-BASED)

Approach: Make C7 logging opt-in via ENV (default OFF in Mixed profile)

  • Benefit: Eliminate 17.04% C7 logging overhead in Mixed workload
  • Strategy:
    • Add HAKMEM_TINY_C7_ROUTE_LOGGING=0 to MIXED_TINYV3_C7_SAFE
    • Keep logging enabled in C6_HEAVY profile (debugging use case)
  • Expected gain: Eliminate 17.04% local overhead → +2-3% in alloc_gate_fast+0.3-0.5% overall
  • Risk: Very low (ENV-gated, reversible)
  • Caveat: This is ~17% of tiny_alloc_gate_fast's self%, not 17% of total runtime

Recommendation: Pursue A1 (Per-Class Fast Path) as primary target

  • Rationale: Structural change that eliminates root cause (runtime route determination)
  • Precedent: FREE path hot/cold split achieved +13% with similar approach
  • A2 can be quick win before A1 (low-hanging fruit)
  • A3 is minor (local to tiny_alloc_gate_fast, small overall impact)

Candidate 2: free_tiny_fast_cold (5.84% self%) ⚠️ ALREADY OPTIMIZED

Current State:

  • Phase FREE-TINY-FAST-HOTCOLD-1: Hot/cold split → +13% gain
  • Split C0-C3 (hot) from C4-C7 (cold)
  • Cold path still shows 5.84% self% → expected (C4-C7 are ~50% of frees)

Perf Annotate Hotspots:

4.12%: call tiny_route_for_class.lto_priv.0  # Route determination (C4-C7)
3.95%: cmpl g_tiny_front_v3_snapshot_ready   # Front v3 snapshot check
3.63%: cmpl %fs:0xfffffffffffb3b00           # TLS ENV check (FREE_TINY_FAST_HOTCOLD)

Root Causes:

  1. Route determination (4.12%): Necessary for C4-C7 (not LEGACY)
  2. ENV gate overhead (3.95% + 3.63% = 7.58%): Repeated TLS checks
  3. Front v3 snapshot check (3.95%): Lazy init overhead

Optimization Opportunities:

Option B1: ENV Gate Consolidation (MEDIUM ROI, CACHE-BASED)

Approach: Consolidate repeated ENV checks into single TLS snapshot

  • Benefit: Reduce 7.58% ENV checking overhead
  • Strategy:
    • Create struct free_env_snapshot { uint8_t hotcold_on; uint8_t front_v3_on; ... }
    • Cache in TLS (initialized once per thread)
    • Single TLS read per free_tiny_fast_cold() call
  • Expected gain: Reduce 7.58% local overhead → +0.4-0.6% overall (5.84% * 7.58% = ~0.44%)
  • Risk: Low (existing pattern in C3 static routing)

Option B2: C4-C7 Route Specialization (LOW ROI, STRUCTURAL)

Approach: Create per-class cold paths (similar to A1 for alloc)

  • Benefit: Eliminate route determination for C4-C7
  • Strategy: Split free_tiny_fast_cold() into 4 variants (C4, C5, C6, C7)
  • Expected gain: Reduce 4.12% route overhead → +0.24% overall
  • Risk: Medium (code duplication)
  • Note: Lower priority than A1 (free path already optimized via hot/cold split)

Recommendation: Pursue B1 (ENV Gate Consolidation) as secondary target

  • Rationale: Complements A1 (alloc gate specialization)
  • Can be applied to both alloc and free paths (shared infrastructure)
  • Lower ROI than A1, but easier to implement

Candidate 3: ENV Gate Functions (Combined 3.26% self%) 🎯 CROSS-CUTTING

Functions:

  • tiny_c7_ultra_enabled_env.lto_priv.0 (1.28%)
  • tiny_front_v3_enabled.lto_priv.0 (1.01%)
  • tiny_metadata_cache_enabled.lto_priv.0 (0.97%)

Current Pattern (from source):

static inline int tiny_front_v3_enabled(void) {
    static __thread int g = -1;
    if (__builtin_expect(g == -1, 0)) {
        const char* e = getenv("HAKMEM_TINY_FRONT_V3_ENABLED");
        g = (e && *e && *e != '0') ? 1 : 0;
    }
    return g;
}

Root Causes:

  1. TLS read overhead: Each function reads separate TLS variable (3 separate reads in hot path)
  2. Lazy init check: g == -1 branch on every call (cold, but still checked)
  3. Function call overhead: Called from multiple hot paths (not always inlined)

Optimization Opportunities:

Option C1: ENV Snapshot Consolidation HIGH ROI

Approach: Consolidate all ENV gates into single TLS snapshot struct

  • Benefit: Reduce 3 TLS reads → 1 TLS read, eliminate 2 lazy init checks
  • Strategy:
    struct hakmem_env_snapshot {
        uint8_t front_v3_on;
        uint8_t metadata_cache_on;
        uint8_t c7_ultra_on;
        uint8_t free_hotcold_on;
        uint8_t static_route_on;
        // ... (8 bytes total, cache-friendly)
    };
    
    extern __thread struct hakmem_env_snapshot g_hakmem_env_snapshot;
    
    static inline const struct hakmem_env_snapshot* hakmem_env_get(void) {
        if (__builtin_expect(!g_hakmem_env_snapshot.initialized, 0)) {
            hakmem_env_snapshot_init();  // One-time init
        }
        return &g_hakmem_env_snapshot;
    }
    
  • Expected gain: Eliminate 3.26% ENV overhead → +3.0-3.5% overall
  • Risk: Medium (refactor all ENV gate call sites)
  • Precedent: tiny_front_v3_snapshot already does this for front v3 config

Recommendation: HIGHEST PRIORITY - Pursue C1 as Phase 4 PRIMARY TARGET

  • Rationale:
    • 3.26% direct overhead (measurable in perf)
    • Cross-cutting benefit: Improves both alloc and free paths
    • Structural improvement: Reduces TLS pressure across entire codebase
    • Clear precedent: tiny_front_v3_snapshot pattern already proven
    • Compounds with A1: Per-class fast paths can check single ENV snapshot instead of multiple gates

Selected Next Target

Phase 4 E1: ENV Snapshot Consolidation (PRIMARY TARGET)

Function: Consolidate all ENV gates into single TLS snapshot Expected Gain: +3.0-3.5% (eliminate 3.26% ENV overhead) Risk: Medium (refactor ENV gate call sites) Effort: 2-3 days (create snapshot struct, refactor ~20 call sites, A/B test)

Implementation Plan:

Step 1: Create ENV Snapshot Infrastructure

  • File: core/box/hakmem_env_snapshot_box.h/c
  • Struct: hakmem_env_snapshot (8-byte TLS struct)
  • API: hakmem_env_get() (lazy init, returns const snapshot*)

Step 2: Migrate ENV Gates

Priority order (by self% impact):

  1. tiny_c7_ultra_enabled_env() (1.28%)
  2. tiny_front_v3_enabled() (1.01%)
  3. tiny_metadata_cache_enabled() (0.97%)
  4. free_tiny_fast_hotcold_enabled() (in free_tiny_fast_cold)
  5. tiny_static_route_enabled() (in routing hot path)

Step 3: Refactor Call Sites

  • Replace: if (tiny_front_v3_enabled()) { ... }
  • With: const hakmem_env_snapshot* env = hakmem_env_get(); if (env->front_v3_on) { ... }
  • Count: ~20-30 call sites (grep analysis needed)

Step 4: A/B Test

  • Baseline: Current mainline (Phase 3 + D1)
  • Optimized: ENV snapshot consolidation
  • Workloads: Mixed (10-run), C6-heavy (5-run)
  • Threshold: +1.0% mean gain for GO

Step 5: Validation

  • Health check: verify_health_profiles.sh
  • Regression check: Ensure no performance loss on any profile

Success Criteria:

  • ENV snapshot struct created
  • All priority ENV gates migrated
  • A/B test shows +2.5% or better (Mixed, 10-run)
  • Health check passes
  • Default ON in MIXED_TINYV3_C7_SAFE

Alternative Targets (Lower Priority)

Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY TARGET)

Function: Specialize tiny_alloc_gate_fast() per class Expected Gain: +2-3% (eliminate 15.74% route overhead in tiny_alloc_gate_fast) Risk: Medium (code duplication, 8 variants to maintain) Effort: 3-4 days (create 8 fast paths, refactor malloc wrapper, A/B test)

Why Secondary?:

  • Higher implementation complexity (8 variants vs. 1 snapshot struct)
  • Dependent on E1 success (ENV snapshot makes per-class paths cleaner)
  • Can be pursued after E1 proves ENV consolidation pattern

Candidate Summary Table

Phase Target self% Approach Expected Gain Risk Priority
E1 ENV Snapshot Consolidation 3.26% Caching +3.0-3.5% Medium PRIMARY
E2 Per-Class Alloc Fast Path 15.37% Hot/Cold Split +2-3% Medium Secondary
E3 Free ENV Gate Consolidation 7.58% (local) Caching +0.4-0.6% Low Tertiary
E4 C7 Logging Elimination 17.04% (local) ENV-gated +0.3-0.5% Very Low Quick Win

Shape Optimization Plateau Analysis

Observation: D3 (Alloc Gate Shape) achieved only +0.56% mean gain (NEUTRAL)

Why Shape Optimizations Plateau?:

  1. Branch Prediction Saturation: Modern CPUs (Zen3/Zen4) already predict well-trained branches

    • LIKELY/UNLIKELY hints: Marginal benefit on hot paths
    • B3 (Routing Shape): +2.89% → Initial win (untrained branches)
    • D3 (Alloc Gate Shape): +0.56% → Diminishing returns (already trained)
  2. I-Cache Pressure: Adding cold helpers can regress if not carefully placed

    • A3 (always_inline header): -4.00% on Mixed (I-cache thrashing)
    • D3: Neutral (no regression, but no clear win)
  3. TLS/Memory Overhead Dominates: ENV gates (3.26%) > Branch misprediction (~0.5%)

    • Next optimization should target memory/TLS overhead, not branches

Lessons Learned:

  • Shape optimizations: Good for first pass (B3 +2.89%), limited ROI after
  • Next frontier: Caching (ENV snapshot), structural changes (per-class paths)
  • Avoid: More LIKELY/UNLIKELY hints (saturated)
  • Prefer: Eliminate checks entirely (snapshot) or specialize paths (per-class)

Next Steps

  1. Phase 4 E1: ENV Snapshot Consolidation (PRIMARY)

    • Create design doc: PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_DESIGN.md
    • Implement snapshot infrastructure
    • Migrate priority ENV gates
    • A/B test (Mixed 10-run)
    • Target: +3.0% gain, promote to default if successful
  2. Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY)

    • Depends on E1 success
    • Design doc: PHASE4_E2_PER_CLASS_ALLOC_FASTPATH_DESIGN.md
    • Prototype C7-only fast path first (highest gain, least complexity)
    • A/B test incremental per-class specialization
    • Target: +2-3% gain
  3. Update CURRENT_TASK.md:

    • Document perf findings
    • Note shape optimization plateau
    • List E1 as next target

Appendix: Perf Command Reference

# Profile current mainline
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1

# Generate report (sorted by symbol, no children aggregation)
perf report --stdio --no-children --sort=symbol | head -80

# Annotate specific function
perf annotate --stdio tiny_alloc_gate_fast.lto_priv.0 | head -100

Key Metrics:

  • Samples: 922 (sufficient for 0.1% precision)
  • Frequency: 999 Hz (balance between overhead and resolution)
  • Iterations: 40M (runtime ~0.86s, enough for stable sampling)
  • Workload: Mixed (ws=400, representative of production)

Status: Ready for Phase 4 E1 implementation Baseline: 46.37M ops/s (Phase 3 + D1) Target: 47.8M ops/s (+3.0% via ENV snapshot consolidation)