15 KiB
Phase 4: Perf Profile Analysis - Next Optimization Target
Date: 2025-12-14 Baseline: Phase 3 + D1 Complete (~46.37M ops/s, MIXED_TINYV3_C7_SAFE) Profile: MIXED_TINYV3_C7_SAFE (20M iterations, ws=400, F=999Hz) Samples: 922 samples, 3.1B cycles
Executive Summary
Current Status:
- Phase 3 + D1: ~8.93% cumulative gain (37.5M → 51M ops/s baseline)
- D3 (Alloc Gate Shape): NEUTRAL (+0.56% mean, -0.5% median) → frozen as research box
- Learning: Shape optimizations (B3, D3) have limited ROI - branch prediction improvements plateau
Next Strategy: Identify self% ≥ 5% functions and apply different approaches (not shape-based):
- Hot/cold split (separate rare paths)
- Caching (avoid repeated expensive operations)
- Inlining (reduce function call overhead)
- ENV gate consolidation (reduce repeated TLS/getenv checks)
Perf Report Analysis
Top Functions (self% ≥ 5%)
Filtered for hakmem internal functions (excluding main, malloc/free wrappers):
| Rank | Function | self% | Category | Already Optimized? |
|---|---|---|---|---|
| 1 | tiny_alloc_gate_fast.lto_priv.0 |
15.37% | Alloc Gate | D3 shape (neutral) |
| 2 | free_tiny_fast_cold.lto_priv.0 |
5.84% | Free Path | Hot/cold split done |
| - | unified_cache_push.lto_priv.0 |
3.97% | Cache | Core primitive |
| - | tiny_c7_ultra_alloc.constprop.0 |
3.97% | C7 Alloc | Not optimized |
| - | tiny_region_id_write_header.lto_priv.0 |
2.50% | Header | A3 inlining (NO-GO) |
| - | tiny_route_for_class.lto_priv.0 |
2.28% | Routing | C3 static cache done |
Key Observations:
- tiny_alloc_gate_fast (15.37%): Still dominant despite D3 shape optimization
- free_tiny_fast_cold (5.84%): Cold path still hot (ENV gate overhead?)
- ENV gate functions (1-2% each):
tiny_c7_ultra_enabled_env(1.28%),tiny_front_v3_enabled(1.01%),tiny_metadata_cache_enabled(0.97%)- Combined: ~3.26% on ENV checking overhead
- Repeated TLS reads + getenv lazy init
Detailed Candidate Analysis
Candidate 1: tiny_alloc_gate_fast (15.37% self%) ⭐ TOP TARGET
Current State:
- Phase D3: Alloc gate shape optimization → NEUTRAL (+0.56% mean, -0.5% median)
- Approach: Branch hints (LIKELY/UNLIKELY) + route table direct access
- Result: Limited improvement (branch prediction already well-tuned)
Perf Annotate Hotspots (lines with >5% samples):
9.97%: cmp $0x2,%r13d # Route comparison (ROUTE_POOL_ONLY check)
5.77%: movzbl (%rsi,%rbx,1),%r13d # Route table load (g_tiny_route)
11.32%: mov 0x280aea(%rip),%eax # rel_route_logged.26 (C7 logging check)
5.72%: test %eax,%eax # Route logging branch
Root Causes:
- Route determination overhead (9.97% + 5.77% = 15.74%):
g_tiny_route[class_idx & 7]load + comparison- Branch on
ROUTE_POOL_ONLY(rare path, but checked every call)
- C7 logging overhead (11.32% + 5.72% = 17.04%):
rel_route_logged.26TLS check (C7-specific, rare in Mixed)- Branch misprediction when C7 is ~10% of traffic
- ENV gate overhead:
alloc_gate_shape_enabled()check (line 151)tiny_route_get()falls back to slow path (line 186)
Optimization Opportunities:
Option A1: Per-Class Fast Path Specialization (HIGH ROI, STRUCTURAL)
Approach: Create specialized tiny_alloc_gate_fast_c{0-7}() for each class
- Benefit: Eliminate runtime route determination (static per-class decision)
- Strategy:
- C0-C3 (LEGACY): Direct to
malloc_tiny_fast_for_class(), skip route check - C4-C6 (MID/V3): Direct to small_policy path, skip LEGACY check
- C7 (ULTRA): Direct to
tiny_c7_ultra_alloc(), skip all route logic
- C0-C3 (LEGACY): Direct to
- Expected gain: Eliminate 15.74% route overhead → +2-3% overall
- Risk: Medium (code duplication, must maintain 8 variants)
- Precedent: FREE path already does this via
HAKMEM_FREE_TINY_FAST_HOTCOLD(+13% win)
Option A2: Route Cache Consolidation (MEDIUM ROI, CACHE-BASED)
Approach: Extend C3 static routing to alloc gate (bypass tiny_route_get() entirely)
- Benefit: Eliminate
tiny_route_get()call + route table load - Strategy:
- Check
g_tiny_static_route_readyonce (already cached) - Use
g_tiny_static_route_table[class_idx]directly (already done in C3) - Remove duplicate
g_tiny_route[]load (line 157)
- Check
- Expected gain: Reduce 5.77% route load overhead → +0.5-1% overall
- Risk: Low (extends existing C3 infrastructure)
- Note: Partial overlap with A1 (both reduce route overhead)
Option A3: C7 Logging Branch Elimination (LOW ROI, ENV-BASED)
Approach: Make C7 logging opt-in via ENV (default OFF in Mixed profile)
- Benefit: Eliminate 17.04% C7 logging overhead in Mixed workload
- Strategy:
- Add
HAKMEM_TINY_C7_ROUTE_LOGGING=0to MIXED_TINYV3_C7_SAFE - Keep logging enabled in C6_HEAVY profile (debugging use case)
- Add
- Expected gain: Eliminate 17.04% local overhead → +2-3% in alloc_gate_fast → +0.3-0.5% overall
- Risk: Very low (ENV-gated, reversible)
- Caveat: This is ~17% of tiny_alloc_gate_fast's self%, not 17% of total runtime
Recommendation: Pursue A1 (Per-Class Fast Path) as primary target
- Rationale: Structural change that eliminates root cause (runtime route determination)
- Precedent: FREE path hot/cold split achieved +13% with similar approach
- A2 can be quick win before A1 (low-hanging fruit)
- A3 is minor (local to tiny_alloc_gate_fast, small overall impact)
Candidate 2: free_tiny_fast_cold (5.84% self%) ⚠️ ALREADY OPTIMIZED
Current State:
- Phase FREE-TINY-FAST-HOTCOLD-1: Hot/cold split → +13% gain
- Split C0-C3 (hot) from C4-C7 (cold)
- Cold path still shows 5.84% self% → expected (C4-C7 are ~50% of frees)
Perf Annotate Hotspots:
4.12%: call tiny_route_for_class.lto_priv.0 # Route determination (C4-C7)
3.95%: cmpl g_tiny_front_v3_snapshot_ready # Front v3 snapshot check
3.63%: cmpl %fs:0xfffffffffffb3b00 # TLS ENV check (FREE_TINY_FAST_HOTCOLD)
Root Causes:
- Route determination (4.12%): Necessary for C4-C7 (not LEGACY)
- ENV gate overhead (3.95% + 3.63% = 7.58%): Repeated TLS checks
- Front v3 snapshot check (3.95%): Lazy init overhead
Optimization Opportunities:
Option B1: ENV Gate Consolidation (MEDIUM ROI, CACHE-BASED)
Approach: Consolidate repeated ENV checks into single TLS snapshot
- Benefit: Reduce 7.58% ENV checking overhead
- Strategy:
- Create
struct free_env_snapshot { uint8_t hotcold_on; uint8_t front_v3_on; ... } - Cache in TLS (initialized once per thread)
- Single TLS read per
free_tiny_fast_cold()call
- Create
- Expected gain: Reduce 7.58% local overhead → +0.4-0.6% overall (5.84% * 7.58% = ~0.44%)
- Risk: Low (existing pattern in C3 static routing)
Option B2: C4-C7 Route Specialization (LOW ROI, STRUCTURAL)
Approach: Create per-class cold paths (similar to A1 for alloc)
- Benefit: Eliminate route determination for C4-C7
- Strategy: Split
free_tiny_fast_cold()into 4 variants (C4, C5, C6, C7) - Expected gain: Reduce 4.12% route overhead → +0.24% overall
- Risk: Medium (code duplication)
- Note: Lower priority than A1 (free path already optimized via hot/cold split)
Recommendation: Pursue B1 (ENV Gate Consolidation) as secondary target
- Rationale: Complements A1 (alloc gate specialization)
- Can be applied to both alloc and free paths (shared infrastructure)
- Lower ROI than A1, but easier to implement
Candidate 3: ENV Gate Functions (Combined 3.26% self%) 🎯 CROSS-CUTTING
Functions:
tiny_c7_ultra_enabled_env.lto_priv.0(1.28%)tiny_front_v3_enabled.lto_priv.0(1.01%)tiny_metadata_cache_enabled.lto_priv.0(0.97%)
Current Pattern (from source):
static inline int tiny_front_v3_enabled(void) {
static __thread int g = -1;
if (__builtin_expect(g == -1, 0)) {
const char* e = getenv("HAKMEM_TINY_FRONT_V3_ENABLED");
g = (e && *e && *e != '0') ? 1 : 0;
}
return g;
}
Root Causes:
- TLS read overhead: Each function reads separate TLS variable (3 separate reads in hot path)
- Lazy init check:
g == -1branch on every call (cold, but still checked) - Function call overhead: Called from multiple hot paths (not always inlined)
Optimization Opportunities:
Option C1: ENV Snapshot Consolidation ⭐ HIGH ROI
Approach: Consolidate all ENV gates into single TLS snapshot struct
- Benefit: Reduce 3 TLS reads → 1 TLS read, eliminate 2 lazy init checks
- Strategy:
struct hakmem_env_snapshot { uint8_t front_v3_on; uint8_t metadata_cache_on; uint8_t c7_ultra_on; uint8_t free_hotcold_on; uint8_t static_route_on; // ... (8 bytes total, cache-friendly) }; extern __thread struct hakmem_env_snapshot g_hakmem_env_snapshot; static inline const struct hakmem_env_snapshot* hakmem_env_get(void) { if (__builtin_expect(!g_hakmem_env_snapshot.initialized, 0)) { hakmem_env_snapshot_init(); // One-time init } return &g_hakmem_env_snapshot; } - Expected gain: Eliminate 3.26% ENV overhead → +3.0-3.5% overall
- Risk: Medium (refactor all ENV gate call sites)
- Precedent:
tiny_front_v3_snapshotalready does this for front v3 config
Recommendation: HIGHEST PRIORITY - Pursue C1 as Phase 4 PRIMARY TARGET
- Rationale:
- 3.26% direct overhead (measurable in perf)
- Cross-cutting benefit: Improves both alloc and free paths
- Structural improvement: Reduces TLS pressure across entire codebase
- Clear precedent:
tiny_front_v3_snapshotpattern already proven - Compounds with A1: Per-class fast paths can check single ENV snapshot instead of multiple gates
Selected Next Target
Phase 4 E1: ENV Snapshot Consolidation (PRIMARY TARGET)
Function: Consolidate all ENV gates into single TLS snapshot Expected Gain: +3.0-3.5% (eliminate 3.26% ENV overhead) Risk: Medium (refactor ENV gate call sites) Effort: 2-3 days (create snapshot struct, refactor ~20 call sites, A/B test)
Implementation Plan:
Step 1: Create ENV Snapshot Infrastructure
- File:
core/box/hakmem_env_snapshot_box.h/c - Struct:
hakmem_env_snapshot(8-byte TLS struct) - API:
hakmem_env_get()(lazy init, returns const snapshot*)
Step 2: Migrate ENV Gates
Priority order (by self% impact):
tiny_c7_ultra_enabled_env()(1.28%)tiny_front_v3_enabled()(1.01%)tiny_metadata_cache_enabled()(0.97%)free_tiny_fast_hotcold_enabled()(infree_tiny_fast_cold)tiny_static_route_enabled()(in routing hot path)
Step 3: Refactor Call Sites
- Replace:
if (tiny_front_v3_enabled()) { ... } - With:
const hakmem_env_snapshot* env = hakmem_env_get(); if (env->front_v3_on) { ... } - Count: ~20-30 call sites (grep analysis needed)
Step 4: A/B Test
- Baseline: Current mainline (Phase 3 + D1)
- Optimized: ENV snapshot consolidation
- Workloads: Mixed (10-run), C6-heavy (5-run)
- Threshold: +1.0% mean gain for GO
Step 5: Validation
- Health check:
verify_health_profiles.sh - Regression check: Ensure no performance loss on any profile
Success Criteria:
- ENV snapshot struct created
- All priority ENV gates migrated
- A/B test shows +2.5% or better (Mixed, 10-run)
- Health check passes
- Default ON in MIXED_TINYV3_C7_SAFE
Alternative Targets (Lower Priority)
Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY TARGET)
Function: Specialize tiny_alloc_gate_fast() per class
Expected Gain: +2-3% (eliminate 15.74% route overhead in tiny_alloc_gate_fast)
Risk: Medium (code duplication, 8 variants to maintain)
Effort: 3-4 days (create 8 fast paths, refactor malloc wrapper, A/B test)
Why Secondary?:
- Higher implementation complexity (8 variants vs. 1 snapshot struct)
- Dependent on E1 success (ENV snapshot makes per-class paths cleaner)
- Can be pursued after E1 proves ENV consolidation pattern
Candidate Summary Table
| Phase | Target | self% | Approach | Expected Gain | Risk | Priority |
|---|---|---|---|---|---|---|
| E1 | ENV Snapshot Consolidation | 3.26% | Caching | +3.0-3.5% | Medium | ⭐ PRIMARY |
| E2 | Per-Class Alloc Fast Path | 15.37% | Hot/Cold Split | +2-3% | Medium | Secondary |
| E3 | Free ENV Gate Consolidation | 7.58% (local) | Caching | +0.4-0.6% | Low | Tertiary |
| E4 | C7 Logging Elimination | 17.04% (local) | ENV-gated | +0.3-0.5% | Very Low | Quick Win |
Shape Optimization Plateau Analysis
Observation: D3 (Alloc Gate Shape) achieved only +0.56% mean gain (NEUTRAL)
Why Shape Optimizations Plateau?:
-
Branch Prediction Saturation: Modern CPUs (Zen3/Zen4) already predict well-trained branches
- LIKELY/UNLIKELY hints: Marginal benefit on hot paths
- B3 (Routing Shape): +2.89% → Initial win (untrained branches)
- D3 (Alloc Gate Shape): +0.56% → Diminishing returns (already trained)
-
I-Cache Pressure: Adding cold helpers can regress if not carefully placed
- A3 (always_inline header): -4.00% on Mixed (I-cache thrashing)
- D3: Neutral (no regression, but no clear win)
-
TLS/Memory Overhead Dominates: ENV gates (3.26%) > Branch misprediction (~0.5%)
- Next optimization should target memory/TLS overhead, not branches
Lessons Learned:
- Shape optimizations: Good for first pass (B3 +2.89%), limited ROI after
- Next frontier: Caching (ENV snapshot), structural changes (per-class paths)
- Avoid: More LIKELY/UNLIKELY hints (saturated)
- Prefer: Eliminate checks entirely (snapshot) or specialize paths (per-class)
Next Steps
-
Phase 4 E1: ENV Snapshot Consolidation (PRIMARY)
- Create design doc:
PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_DESIGN.md - Implement snapshot infrastructure
- Migrate priority ENV gates
- A/B test (Mixed 10-run)
- Target: +3.0% gain, promote to default if successful
- Create design doc:
-
Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY)
- Depends on E1 success
- Design doc:
PHASE4_E2_PER_CLASS_ALLOC_FASTPATH_DESIGN.md - Prototype C7-only fast path first (highest gain, least complexity)
- A/B test incremental per-class specialization
- Target: +2-3% gain
-
Update CURRENT_TASK.md:
- Document perf findings
- Note shape optimization plateau
- List E1 as next target
Appendix: Perf Command Reference
# Profile current mainline
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1
# Generate report (sorted by symbol, no children aggregation)
perf report --stdio --no-children --sort=symbol | head -80
# Annotate specific function
perf annotate --stdio tiny_alloc_gate_fast.lto_priv.0 | head -100
Key Metrics:
- Samples: 922 (sufficient for 0.1% precision)
- Frequency: 999 Hz (balance between overhead and resolution)
- Iterations: 40M (runtime ~0.86s, enough for stable sampling)
- Workload: Mixed (ws=400, representative of production)
Status: Ready for Phase 4 E1 implementation Baseline: 46.37M ops/s (Phase 3 + D1) Target: 47.8M ops/s (+3.0% via ENV snapshot consolidation)