# Phase 5 E5: Comprehensive Analysis & Implementation Report ## Executive Summary **Date**: 2025-12-14 **Baseline**: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400) **Status**: Step 1 Complete - Perf Analysis & Priority Decision --- ## Step 1: Perf Analysis - Free() Internal Breakdown ### Test Configuration ```bash HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \ ./bench_random_mixed_hakmem 20000000 400 1 ``` **Results**: 476 samples @ 999Hz, throughput 44.12M ops/s ### Top Hotspots (self% >= 2.0%) | Rank | Function | Self% | Analysis | |------|----------|-------|----------| | 1 | `free` | 21.67% | Wrapper layer overhead (header check, ENV snapshot, gate dispatch) | | 2 | `tiny_alloc_gate_fast` | 18.55% | Alloc gate (reduced from E4 baseline) | | 3 | `main` | 17.44% | Benchmark driver | | 4 | `malloc` | 12.27% | Wrapper layer (reduced from E4 baseline) | | 5 | `free_tiny_fast_cold` | 7.89% | **NEW**: Cold path (route determination, policy snapshot) | | 6 | `unified_cache_push` | 3.47% | TLS cache push operation | | 7 | `tiny_c7_ultra_alloc` | 2.98% | C7 ULTRA alloc path | | 8 | `tiny_region_id_write_header` | 2.59% | **Header write tax** (reduced from 6.97% in E4 baseline) | | 9 | `hakmem_env_snapshot_enabled` | 2.57% | **ENV snapshot gate overhead** (reduced from 4.29%) | ### Free Path Internal Breakdown **Total free() overhead**: 21.67% (wrapper) + 7.89% (cold path) = **29.56%** **Key Observations**: 1. **Wrapper Layer (21.67% self%)**: - NULL check: ~0% - Header magic validation: **~8% of free() self%** - Load header byte: `movzbl -0x1(%rbp),%r15d` (7.44% sample hit) - Magic comparison: `cmp $0xa0,%r14b` (0%) - Class extraction: `and $0xf,%r12d` (0%) - ENV snapshot calls: **~8% of free() self%** - `mov 0x5b311(%rip),%edx` (7.73% + 6.98% sample hits) - Branch to cold init paths (0%) - TLS reads for gate dispatch: **~5% of free() self%** - `mov 0x14(%r15),%ebx` (8.01% sample hit) 2. **Cold Path (7.89% self%)**: - `tiny_route_for_class()`: 1.95% self% (route determination) - Policy snapshot: 0.64% self% - Class-based dispatch overhead 3. **Header Write Tax (2.59% self%)**: - **DOWN from 6.97% in E4 baseline** (-62% reduction!) - Note: This is already significantly improved - Further optimization ROI is lower now 4. **ENV Snapshot Overhead (2.57% self%)**: - **DOWN from 4.29% in E4 baseline** (-40% reduction!) - Lazy init checks + TLS reads - Still visible but reduced impact ### Critical Insight: Header Check Overhead **Annotated perf shows**: - Lines with 7-8% sample hits are concentrated in: 1. **Header load**: `movzbl -0x1(%rbp),%r15d` (line 0x1cd6d) 2. **Multiple TLS/global reads** for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d) **These lines represent**: - **Header validation overhead**: ~7-8% of free() self% - **ENV gate consolidation overhead**: ~8% of free() self% (multiple checks) - **Total wrapper tax**: ~16% of 21.67% = **~3.5% of total runtime** ### Branch Prediction Analysis **From perf annotate**: - Page boundary check: `test $0xfff,%ebp` (0% hit - well predicted) - Magic comparison: `cmp $0xa0,%r14b` (0% hit - well predicted) - ENV snapshot checks: Multiple branches with 0% hit (well predicted) - **No significant branch misprediction issues** --- ## Step 2: E5 Candidate Priority Decision ### E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY) **Target**: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = **29.56% total** **Hypothesis**: - Current free() wrapper does header validation + ENV snapshot + gate dispatch - For Tiny allocations (48% of frees in Mixed), this is redundant work: 1. Header is validated in wrapper (magic check) 2. Header is re-extracted in `free_tiny_fast()` 3. Route determination happens in `free_tiny_fast_cold()` - **Opportunity**: Single-check Tiny direct path in wrapper **Strategy** (Box Theory): - **L0 SplitBox**: `HAKMEM_FREE_TINY_DIRECT=0/1` (default 0) - **L1 HotBox**: Wrapper-level Tiny fast path - Check: `(header & 0xF0) == 0xA0` (Tiny magic) - Extract: `class_idx = header & 0x0F` - Validate: `class_idx < 8` (fail-fast) - Direct call: `free_tiny_fast_direct(ptr, base, class_idx)` - Skips: ENV snapshot consolidation in wrapper - Skips: Cold path route determination - Skips: Re-validation of header - **L1 ColdBox**: Existing fallback for Mid/Pool/Large/invalid **Expected ROI**: - **Baseline 29.56% overhead** → **Target 18-20% overhead** (30-40% reduction) - **Translation**: +3-5% throughput improvement - **Risk**: Low (single boundary, fail-fast validation) **Implementation Complexity**: Medium - New function: `free_tiny_fast_direct()` (inline, thin wrapper) - Integration: 1 boundary in `free()` wrapper - Counters: 3 (direct_hit, direct_miss, invalid_header) ### E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY) **Target**: 2.59% (tiny_region_id_write_header) **Hypothesis**: - Header write happens on EVERY alloc (hot path tax) - Blocks are reused within same class → header is stable - **Opportunity**: Write header once at refill boundary (cold path) **Strategy** (Box Theory): - **L0 SplitBox**: `HAKMEM_TINY_HEADER_PREFILL=0/1` (default 0) - **L2 HeaderPrefillBox**: Refill boundary header initialization - When: Slab refill (cold path, ~64 blocks at once) - Where: `tiny_unified_cache_refill()` or similar - Action: Prefill all headers in slab page - **L1 HotBox**: Alloc path skips header write - Condition: Check "prefill done" flag per slab - Fast path: Return `base + 1` directly (no header write) - Fallback: Traditional `tiny_region_id_write_header()` if not prefilled **Expected ROI**: - **Baseline 2.59%** → **Target 0.5-1.0%** (60-80% reduction) - **Translation**: +1.5-2.0% throughput improvement - **Risk**: Medium (requires slab-level "prefilled" tracking) **Implementation Complexity**: High - Metadata: Per-slab "header_prefilled" flag (1 bit per slab) - Integration: 2 boundaries (refill + alloc hot path) - Safety: Fallback to hot path write if prefill not done **Note**: Phase 1 A3 showed `always_inline` on header write was **NO-GO** (-4% on Mixed). This approach is different: **eliminate writes**, not inline them. ### E5-3: ENV Snapshot Branch Shape (LOW PRIORITY) **Target**: 2.57% (hakmem_env_snapshot_enabled) **Hypothesis**: - `MIXED_TINYV3_C7_SAFE` now defaults to `HAKMEM_ENV_SNAPSHOT=1` - Current branch hint: `__builtin_expect(..., 0)` (expects DISABLED) - Reality: Snapshot is ENABLED → hint is backwards - **Opportunity**: Flip branch shape to match reality **Strategy** (Box Theory): - **L0 SplitBox**: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1` (default 0) - **L1 ShapeBox**: Branch restructuring - Old: `if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }` - New: `if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }` - Or: Move legacy to `noinline,cold` helper **Expected ROI**: - **Baseline 2.57%** → **Target 1.5-2.0%** (20-40% reduction) - **Translation**: +0.5-1.0% throughput improvement - **Risk**: Very low (branch shape change only, no logic change) **Implementation Complexity**: Low - Changes: 5-8 call sites in hot paths - Integration: Flip `__builtin_expect()` polarity - Counters: None needed (perf annotation will show) **Note**: This is a "polish" optimization. E3-4 showed branch hints can backfire. Approach carefully with A/B testing. --- ## Step 2 Conclusion: Priority Ranking ### Priority Order (ROI × Safety × Complexity) 1. **E5-1 (Free Tiny Direct Path)**: +3-5% expected, medium complexity, low risk - **ROI Score**: 9/10 (largest target, proven pattern) - **Safety Score**: 8/10 (single boundary, fail-fast) - **Implementation Score**: 7/10 (medium effort) - **TOTAL**: 24/30 → **HIGHEST PRIORITY** 2. **E5-2 (Header Prefill at Refill)**: +1.5-2.0% expected, high complexity, medium risk - **ROI Score**: 6/10 (2.59% target, diminishing returns from E4) - **Safety Score**: 6/10 (metadata overhead, fallback needed) - **Implementation Score**: 4/10 (high effort, slab tracking) - **TOTAL**: 16/30 → **MEDIUM PRIORITY** 3. **E5-3 (ENV Snapshot Shape)**: +0.5-1.0% expected, low complexity, low risk - **ROI Score**: 4/10 (small target, branch hint uncertainty) - **Safety Score**: 9/10 (pure shape change) - **Implementation Score**: 9/10 (low effort) - **TOTAL**: 22/30 → **LOW PRIORITY** (but easy win if E5-1 succeeds) ### Decision: Proceed with E5-1 (Free Tiny Direct Path) **Rationale**: 1. **Largest target**: 29.56% combined overhead (free wrapper + cold path) 2. **Proven pattern**: E4 success shows TLS consolidation works 3. **Single boundary**: Clear separation, fail-fast validation 4. **Low risk**: Header magic check is reliable, fallback is trivial 5. **Box Theory compliant**: L0 ENV gate, L1 hot/cold split, minimal counters **Expected Gain**: +3-5% (conservative estimate) **Success Threshold**: +1.0% (GO criteria) **Fallback Plan**: If NO-GO, freeze and pursue E5-2 or E5-3 --- ## Step 3: E5-1 Implementation Plan (Next) ### Design Document - File: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md` - Sections: 1. Problem Statement (29.56% overhead analysis) 2. Solution Architecture (wrapper-level Tiny direct path) 3. Box Theory Boundary (L0 ENV gate, L1 hot/cold split) 4. Safety Gates (header validation, fail-fast, fallback) 5. Integration Points (free() wrapper, counters) 6. A/B Test Plan (10-run Mixed, health check) ### Implementation Files - **ENV Box**: `core/box/free_tiny_direct_env_box.h/.c` - ENV gate: `HAKMEM_FREE_TINY_DIRECT=0/1` - Lazy init, static cache - **Stats Box**: `core/box/free_tiny_direct_stats_box.h` - Counters: `direct_hit`, `direct_miss`, `invalid_header` - **Integration**: `core/box/hak_wrappers.inc.h` - Lines ~595-640 (free() wrapper hot path) - New inline: `free_tiny_fast_direct()` ### A/B Test Protocol ```bash # Baseline (E5-1=0) for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \ ./bench_random_mixed_hakmem 20000000 400 1 done # Optimized (E5-1=1) for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \ ./bench_random_mixed_hakmem 20000000 400 1 done ``` **Success Criteria**: - **GO**: Mean +1.0% or higher - **NEUTRAL**: ±1.0% - **NO-GO**: Mean -1.0% or lower --- ## Appendix: Perf Data Details ### Free() Annotated Hot Lines **Top sample-hit lines** (from `perf annotate --stdio free`): 1. **8.01%**: `mov 0x14(%r15),%ebx` (offset 0x1cd2d) - Loading `wrapper_env->wrap_shape` from TLS - **Analysis**: ENV snapshot field access overhead 2. **7.73%**: `mov %rdi,%rbp` (offset 0x1ccd1) - Saving ptr argument to register - **Analysis**: Register pressure in wrapper preamble 3. **7.44%**: `mov 0x5b311(%rip),%edx` (offset 0x1cd01) - Loading cached ENV gate value (global) - **Analysis**: Lazy init check overhead 4. **6.98%**: `test %r11d,%r11d` (offset 0x1ccf8) - Testing ENV gate flag (TLS) - **Analysis**: Branch condition for ENV snapshot path 5. **7.62%**: `pop %r14` (offset 0x1ce6e) - Epilogue register restore - **Analysis**: Function call overhead (6+ register saves) **Total wrapper overhead from top 5 lines**: ~37% of free() self% = **~8% of total runtime** ### Free_Tiny_Fast_Cold Analysis **7.89% self% breakdown**: - `tiny_route_for_class()`: 1.95% (route determination) - Policy snapshot: 0.64% - Class dispatch: ~5% (remainder) **Cold path is triggered when**: - Class != C7 (no ULTRA early-exit) - OR class requires route determination (C0-C6 in Mixed) **Opportunity**: Bypass cold path entirely for Tiny by validating in wrapper. --- ## Next Actions 1. ✅ **Step 1 Complete**: Perf analysis → E5-1 selected 2. 🔄 **Step 3 Next**: Implement E5-1 design + code 3. ⏳ **Step 4 Next**: 10-run A/B test (Mixed) 4. ⏳ **Step 5 Next**: Health check 5. ⏳ **Step 6 Next**: Update CURRENT_TASK.md with results **Status**: Ready to proceed with E5-1 implementation.