# Phase 19: FastLane Instruction Reduction - Design Document ## 0. Executive Summary **Goal**: Reduce instruction/branch count gap between hakmem and libc to close throughput gap **Current Gap**: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc) **Target**: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement **Success Criteria**: Achieve 52-56M ops/s (from current 44.88M ops/s) ### Key Findings Per-operation overhead comparison (200M ops): | Metric | hakmem | libc | Delta | Delta % | |--------|--------|------|-------|---------| | **Instructions/op** | 209.09 | 135.92 | +73.17 | **+53.8%** | | **Branches/op** | 52.33 | 22.93 | +29.40 | **+128.2%** | | Cycles/op | 96.48 | 54.69 | +41.79 | +76.4% | | Branch-miss % | 2.22% | 2.87% | -0.65% | Better | **Critical insight**: hakmem executes **73 extra instructions** and **29 extra branches** per operation vs libc. This massive overhead accounts for the entire throughput gap. --- ## 1. Gap Analysis (Per-Operation Breakdown) ### 1.1 Instruction Gap: +73.17 instructions/op (+53.8%) This excess comes from multiple layers of overhead: - **FastLane wrapper checks**: ENV gates, class mask validation, size checks - **Policy snapshot overhead**: TLS reads for routing decisions (3+ reads even with ENV snapshot) - **Route determination**: Static route table lookup vs direct path - **Multiple ENV gates**: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.) - **Stats counters**: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.) - **Header validation duplication**: FastLane + free_tiny_fast both validate header ### 1.2 Branch Gap: +29.40 branches/op (+128.2%) Branching is **2.3x worse** than instruction gap: - **Cascading ENV checks**: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT) - **Route dispatch**: Static route check + route_kind switch - **Early-exit patterns**: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths - **Stats gating**: `if (__builtin_expect(...))` patterns around counters ### 1.3 Why Cycles/op Gap is Smaller Than Expected Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc). This suggests: - **Good CPU pipelining**: Branch predictor is working well (2.22% miss rate) - **I-cache locality**: Code is reasonably compact despite extra instructions - **But**: We're paying for every extra branch in pipeline stalls --- ## 2. Hot Path Breakdown (perf report) Top 10 hot functions (% of cycles): | Function | % time | Category | Reduction Target? | |----------|--------|----------|-------------------| | **front_fastlane_try_free** | 23.97% | Wrapper | ✓ **YES** (remove layer) | | **malloc** | 23.84% | Wrapper | ✓ **YES** (remove layer) | | main | 22.02% | Benchmark | (baseline) | | **free** | 6.82% | Wrapper | ✓ **YES** (remove layer) | | unified_cache_push | 4.44% | Core | Optimize later | | tiny_header_finalize_alloc | 4.34% | Core | Optimize later | | tiny_c7_ultra_alloc | 3.38% | Core | Optimize later | | tiny_c7_ultra_free | 2.07% | Core | Optimize later | | hakmem_env_snapshot_enabled | 1.22% | ENV | ✓ **YES** (eliminate checks) | | hak_super_lookup | 0.98% | Core | Optimize later | **Critical observation**: The top 3 user-space functions are **all wrappers**: - `front_fastlane_try_free` (23.97%) + `free` (6.82%) = **30.79%** on free wrappers - `malloc` (23.84%) on alloc wrapper - Combined wrapper overhead: **~54-55%** of all cycles ### 2.1 front_fastlane_try_free Annotated Breakdown From `perf annotate`, the hot path has these expensive operations: **Header validation** (lines 1c786-1c791, ~3% samples): ```asm movzbl -0x1(%rbp),%ebx # Load header byte mov %ebx,%eax # Copy to eax and $0xfffffff0,%eax # Extract magic (0xA0) cmp $0xa0,%al # Check magic jne ... (fallback) # Branch on mismatch ``` **ENV snapshot checks** (lines 1c7ff-1c822, ~7% samples): ```asm cmpl $0x1,0x628fa(%rip) # g_hakmem_env_snapshot_ctor_mode (3.01%) mov 0x628ef(%rip),%r15d # g_hakmem_env_snapshot_gate (1.36%) je ... cmp $0xffffffff,%r15d je ... (init path) test %r15d,%r15d jne ... (snapshot path) ``` **Class routing overhead** (lines 1c7d1-1c7fb, ~3% samples): ```asm mov 0x6299c(%rip),%r15d # g.5.lto_priv.0 (policy gate) cmp $0x1,%r15d jne ... (fallback) movzbl 0x6298f(%rip),%eax # g_mask.3.lto_priv.0 cmp $0xff,%al je ... (all-classes path) movzbl %al,%r9d bt %r13d,%r9d # Bit test class mask jae ... (fallback) ``` **Total overhead**: ~15-20% of cycles in front_fastlane_try_free are spent on: - Header validation (already done again in free_tiny_fast) - ENV snapshot probing - Policy/route checks --- ## 3. Reduction Candidates (Prioritized by ROI) ### Candidate A: **Eliminate FastLane Wrapper Layer** (Highest ROI) **Problem**: front_fastlane_try_free + free wrappers consume 30.79% of cycles **Root cause**: Double header validation + ENV checks + class mask checks **Proposal**: Direct call to free_tiny_fast() from free() wrapper **Implementation**: ```c // In free() wrapper: void free(void* ptr) { if (__builtin_expect(!ptr, 0)) return; // Phase 19-A: Direct call (no FastLane layer) if (free_tiny_fast(ptr)) { return; // Handled } // Fallback to cold path free_cold(ptr); } ``` **Reduction estimate**: - **Instructions**: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks) - **Branches**: -5-7/op (remove FastLane gate checks) - **Impact**: ~10-15% throughput improvement (remove 30% wrapper overhead) **Risk**: **LOW** (free_tiny_fast already has validation + routing logic) --- ### Candidate B: **Consolidate ENV Snapshot Checks** (High ROI) **Problem**: ENV snapshot is checked **3+ times per operation**: 1. FastLane entry: `g_initialized` check 2. Route determination: `hakmem_env_snapshot_enabled()` check 3. Route-specific: `tiny_c7_ultra_enabled_env()` check 4. Legacy fallback: Another ENV snapshot check **Proposal**: Single ENV snapshot read at entry, pass context down **Implementation**: ```c // Phase 19-B: ENV context struct typedef struct { bool c7_ultra_enabled; bool dualhot_enabled; bool legacy_direct_enabled; SmallRouteKind route_kind[8]; // Pre-computed routes } FastLaneCtx; static __thread FastLaneCtx g_fastlane_ctx = {0}; static __thread int g_fastlane_ctx_init = 0; static inline const FastLaneCtx* fastlane_ctx_get(void) { if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) { // One-time init per thread const HakmemEnvSnapshot* env = hakmem_env_snapshot(); g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled; // ... populate other fields g_fastlane_ctx_init = 1; } return &g_fastlane_ctx; } ``` **Reduction estimate**: - **Instructions**: -8-12/op (eliminate redundant TLS reads) - **Branches**: -3-5/op (single init check instead of multiple) - **Impact**: ~5-8% throughput improvement **Risk**: **MEDIUM** (need to handle ENV changes during runtime - use invalidation hook) --- ### Candidate C: **Remove Stats Counters from Hot Path** (Medium ROI) **Problem**: Stats counters on hot path add atomic increments: - `FRONT_FASTLANE_STAT_INC(free_total)` (every op) - `FREE_PATH_STAT_INC(total_calls)` (every op) - `ALLOC_GATE_STAT_INC(total_calls)` (every alloc) - `tiny_front_free_stat_inc(class_idx)` (every free) **Proposal**: Make stats DEBUG-only or sample-based (1-in-N) **Implementation**: ```c // Phase 19-C: Sampling-based stats #if !HAKMEM_BUILD_RELEASE static __thread uint32_t g_stat_counter = 0; if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) { // Sample 1-in-4096 operations FRONT_FASTLANE_STAT_INC(free_total); } #endif ``` **Reduction estimate**: - **Instructions**: -4-6/op (remove atomic increments) - **Branches**: -2-3/op (remove `if (__builtin_expect(...))` checks) - **Impact**: ~3-5% throughput improvement **Risk**: **LOW** (stats already compile-time optional) --- ### Candidate D: **Inline Header Validation** (Medium ROI) **Problem**: Header validation happens twice: 1. FastLane wrapper: `*((uint8_t*)ptr - 1)` (lines 179-191 in front_fastlane_box.h) 2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h) **Proposal**: Trust FastLane validation, remove duplicate check **Implementation**: ```c // Phase 19-D: Add "trusted" variant static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) { // Skip header validation (caller already validated) // Direct to route dispatch ... } // In FastLane: uint8_t header = *((uint8_t*)ptr - 1); int class_idx = header & 0x0F; void* base = tiny_user_to_base_inline(ptr); return free_tiny_fast_trusted(ptr, class_idx, base); ``` **Reduction estimate**: - **Instructions**: -3-5/op (remove duplicate header load + extract) - **Branches**: -1-2/op (remove duplicate magic check) - **Impact**: ~2-3% throughput improvement **Risk**: **MEDIUM** (need to ensure all callers validate header) --- ### Candidate E: **Static Route Table Optimization** (Lower ROI) **Problem**: Route determination uses TLS lookups + bit tests: ```c if (tiny_static_route_ready_fast()) { route_kind = tiny_static_route_get_kind_fast(class_idx); } else { route_kind = tiny_policy_hot_get_route(class_idx); } ``` **Proposal**: Pre-compute common routes at init, inline direct paths **Implementation**: ```c // Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA) static __thread uint8_t g_route_fastmap = 0; // bit 0=C0...bit 7=C7, 1=LEGACY static inline bool is_legacy_route_fast(int class_idx) { return (g_route_fastmap >> class_idx) & 1; } ``` **Reduction estimate**: - **Instructions**: -3-4/op (replace function call with bit test) - **Branches**: -1-2/op (replace nested if with single bit test) - **Impact**: ~2-3% throughput improvement **Risk**: **LOW** (route table is already static) --- ## 4. Combined Impact Estimate Assuming independent reductions (conservative estimate with 80% efficiency due to overlap): | Candidate | Instructions/op | Branches/op | Throughput | |-----------|-----------------|-------------|------------| | Baseline | 209.09 | 52.33 | 44.88M ops/s | | **A: Remove FastLane layer** | -17.5 | -6.0 | +12% | | **B: ENV snapshot consolidation** | -10.0 | -4.0 | +6% | | **C: Stats removal (Release)** | -5.0 | -2.5 | +4% | | **D: Inline header validation** | -4.0 | -1.5 | +2% | | **E: Static route fast path** | -3.5 | -1.5 | +2% | | **Combined (80% efficiency)** | **-32.0** | **-12.4** | **+21%** | **Projected outcome**: - Instructions/op: 209.09 → **177.09** (vs libc 135.92, gap reduced from +53.8% to +30.3%) - Branches/op: 52.33 → **39.93** (vs libc 22.93, gap reduced from +128.2% to +74.1%) - Throughput: 44.88M → **54.3M ops/s** (vs libc 77.62M, gap reduced from +73.0% to +43.0%) **Achievement vs Goal**: ✓ Exceeds target (+21% vs +15-25% goal) --- ## 5. Implementation Plan ### Phase 19-1: Remove FastLane Wrapper Layer (A) **Priority**: P0 (highest ROI) **Effort**: 2-3 hours **Risk**: Low (free_tiny_fast already complete) Steps: 1. Modify `free()` wrapper to directly call `free_tiny_fast(ptr)` 2. Modify `malloc()` wrapper to directly call `malloc_tiny_fast(size)` 3. Measure: Expect +10-15% throughput 4. Fallback: Keep FastLane as compile-time option ### Phase 19-2: ENV Snapshot Consolidation (B) **Priority**: P1 (high ROI, moderate risk) **Effort**: 4-6 hours **Risk**: Medium (ENV invalidation needed) Steps: 1. Create `FastLaneCtx` struct with pre-computed ENV state 2. Add TLS cache with invalidation hook 3. Replace scattered ENV checks with single context read 4. Measure: Expect +5-8% throughput on top of Phase 19-1 5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1) ### Phase 19-3: Stats Removal (C) + Header Inline (D) **Priority**: P2 (medium ROI, low risk) **Effort**: 2-3 hours **Risk**: Low (already compile-time optional) Steps: 1. Make stats sample-based (1-in-4096) in Release builds 2. Add `free_tiny_fast_trusted()` variant (skip header validation) 3. Measure: Expect +3-5% throughput on top of Phase 19-2 4. Fallback: Compile-time flags for both features ### Phase 19-4: Static Route Fast Path (E) **Priority**: P3 (lower ROI, polish) **Effort**: 2-3 hours **Risk**: Low (route table is static) Steps: 1. Add `g_route_fastmap` TLS cache 2. Replace function calls with bit tests 3. Measure: Expect +2-3% throughput on top of Phase 19-3 4. Fallback: Keep existing path as fallback --- ## 6. Box Theory Compliance ### Boundary Preservation - **L0 (ENV)**: Keep existing ENV gates, add new ones for each optimization - **L1 (Hot inline)**: free_tiny_fast(), malloc_tiny_fast() remain unchanged - **L2 (Cold fallback)**: free_cold(), malloc_cold() remain unchanged - **L3 (Stats)**: Make optional via #if guards ### Reversibility - Each phase is ENV-gated (can revert at runtime) - Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats) - FastLane layer can be kept as compile-time option for A/B testing ### Incremental Rollout - Phase 19-1: Remove wrapper (default ON) - Phase 19-2: ENV context (default OFF, opt-in for testing) - Phase 19-3: Stats/header (default ON in Release, OFF in Debug) - Phase 19-4: Route fast path (default ON) --- ## 7. Validation Checklist After each phase: - [ ] Run perf stat (compare instructions/branches/cycles per-op) - [ ] Run perf record + annotate (verify hot path reduction) - [ ] Run benchmark suite (Mixed, C6-heavy, C7-heavy) - [ ] Check correctness (Larson, multithreaded, stress tests) - [ ] Measure RSS/memory overhead (should be unchanged) - [ ] A/B test (ENV toggle to verify reversibility) Success criteria: - [ ] Throughput improvement matches estimate (±20%) - [ ] Instruction count reduction matches estimate (±20%) - [ ] Branch count reduction matches estimate (±20%) - [ ] No correctness regressions (all tests pass) - [ ] No memory overhead increase (RSS unchanged) --- ## 8. Risk Assessment ### High-Risk Areas 1. **ENV invalidation** (Phase 19-2): Runtime ENV changes could break cached context - Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure) - Fallback: Revert to scattered ENV checks 2. **Header validation trust** (Phase 19-3D): Skipping validation could miss corruption - Mitigation: Keep validation in Debug builds, extensive testing - Fallback: Compile-time option to keep duplicate checks ### Medium-Risk Areas 1. **FastLane removal** (Phase 19-1): Could break gradual rollout (class_mask filtering) - Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely) - Fallback: Keep FastLane as compile-time option ### Low-Risk Areas 1. **Stats removal** (Phase 19-3C): Already compile-time optional 2. **Route fast path** (Phase 19-4): Route table is static, no runtime changes --- ## 9. Future Optimization Opportunities (Post-Phase 19) After Phase 19 closes the wrapper gap, next targets: 1. **Unified Cache optimization** (4.44% cycles): - Reduce cache miss overhead (refill path) - Optimize LIFO vs ring buffer trade-off 2. **Header finalization** (4.34% cycles): - Investigate always_inline for tiny_header_finalize_alloc() - Reduce metadata writes (defer to batch update) 3. **C7 ULTRA optimization** (3.38% + 2.07% = 5.45% cycles): - Investigate TLS cache locality - Reduce ULTRA push/pop overhead 4. **Super lookup optimization** (0.98% cycles): - Already optimized in Phase 12 (mask-based) - Further reduction may require architectural changes **Estimated ceiling**: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M) **Remaining gap**: Likely fundamental architectural differences (thread-local vs global allocator) --- ## 10. Appendix: Detailed perf Data ### 10.1 perf stat Results (200M ops) **hakmem (FORCE_LIBC=0)**: ``` Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0': 19,296,118,430 cycles 41,817,886,925 instructions # 2.17 insn per cycle 10,466,190,806 branches 232,592,257 branch-misses # 2.22% of all branches 1,660,073 cache-misses 134,601 L1-icache-load-misses 4.913685503 seconds time elapsed Throughput: 44.88M ops/s ``` **libc (FORCE_LIBC=1)**: ``` Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1': 10,937,550,228 cycles 27,183,469,339 instructions # 2.49 insn per cycle 4,586,617,379 branches 131,515,905 branch-misses # 2.87% of all branches 767,370 cache-misses 64,102 L1-icache-load-misses 2.835174452 seconds time elapsed Throughput: 77.62M ops/s ``` ### 10.2 Top 30 Hot Functions (perf report) ``` 23.97% front_fastlane_try_free.lto_priv.0 23.84% malloc 22.02% main 6.82% free 4.44% unified_cache_push.lto_priv.0 4.34% tiny_header_finalize_alloc.lto_priv.0 3.38% tiny_c7_ultra_alloc.constprop.0 2.07% tiny_c7_ultra_free 1.22% hakmem_env_snapshot_enabled.lto_priv.0 0.98% hak_super_lookup.part.0.lto_priv.4.lto_priv.0 0.85% hakmem_env_snapshot.lto_priv.0 0.82% hak_pool_free_v1_slow_impl 0.59% tiny_front_v3_snapshot_get.lto_priv.0 0.30% __memset_avx2_unaligned_erms (libc) 0.30% tiny_unified_lifo_enabled.lto_priv.0 0.28% hak_free_at.constprop.0 0.24% hak_pool_try_alloc.part.0 0.24% malloc_cold 0.16% hak_pool_try_alloc_v1_impl.part.0 0.14% free_cold.constprop.0 0.13% mid_inuse_dec_deferred 0.12% hak_pool_mid_lookup 0.12% do_user_addr_fault (kernel) 0.11% handle_pte_fault (kernel) 0.11% __mod_memcg_lruvec_state (kernel) 0.10% do_anonymous_page (kernel) 0.09% classify_ptr 0.07% tiny_get_max_size.lto_priv.0 0.06% __handle_mm_fault (kernel) 0.06% __alloc_pages (kernel) ``` --- ## 11. Conclusion Phase 19 has **clear, actionable targets** with high ROI: 1. **Immediate action (Phase 19-1)**: Remove FastLane wrapper layer - Expected: +10-15% throughput - Risk: Low - Effort: 2-3 hours 2. **Follow-up (Phase 19-2-4)**: ENV consolidation + stats + route optimization - Expected: +6-11% additional throughput - Risk: Medium (ENV invalidation) - Effort: 8-12 hours **Combined target**: +21% throughput (44.88M → 54.3M ops/s) **Gap closure**: Reduce instruction gap from +53.8% to +30.3% vs libc This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.