# Phase 19: FastLane Instruction Reduction - Design Document

## 0. Executive Summary

**Goal**: Reduce instruction/branch count gap between hakmem and libc to close throughput gap
**Current Gap**: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc)
**Target**: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement
**Success Criteria**: Achieve 52-56M ops/s (from current 44.88M ops/s)

### Key Findings

Per-operation overhead comparison (200M ops):

| Metric | hakmem | libc | Delta | Delta % |
|--------|--------|------|-------|---------|
| **Instructions/op** | 209.09 | 135.92 | +73.17 | **+53.8%** |
| **Branches/op** | 52.33 | 22.93 | +29.40 | **+128.2%** |
| Cycles/op | 96.48 | 54.69 | +41.79 | +76.4% |
| Branch-miss % | 2.22% | 2.87% | -0.65% | Better |

**Critical insight**: hakmem executes **73 extra instructions** and **29 extra branches** per operation vs libc.
This massive overhead accounts for the entire throughput gap.

---

## 1. Gap Analysis (Per-Operation Breakdown)

### 1.1 Instruction Gap: +73.17 instructions/op (+53.8%)

This excess comes from multiple layers of overhead:
- **FastLane wrapper checks**: ENV gates, class mask validation, size checks
- **Policy snapshot overhead**: TLS reads for routing decisions (3+ reads even with ENV snapshot)
- **Route determination**: Static route table lookup vs direct path
- **Multiple ENV gates**: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.)
- **Stats counters**: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.)
- **Header validation duplication**: FastLane + free_tiny_fast both validate header

### 1.2 Branch Gap: +29.40 branches/op (+128.2%)

Branching is **2.3x worse** than instruction gap:
- **Cascading ENV checks**: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT)
- **Route dispatch**: Static route check + route_kind switch
- **Early-exit patterns**: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths
- **Stats gating**: `if (__builtin_expect(...))` patterns around counters

### 1.3 Why Cycles/op Gap is Smaller Than Expected

Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc).
This suggests:
- **Good CPU pipelining**: Branch predictor is working well (2.22% miss rate)
- **I-cache locality**: Code is reasonably compact despite extra instructions
- **But**: We're paying for every extra branch in pipeline stalls

---

## 2. Hot Path Breakdown (perf report)

Top 10 hot functions (% of cycles):

| Function | % time | Category | Reduction Target? |
|----------|--------|----------|-------------------|
| **front_fastlane_try_free** | 23.97% | Wrapper | ✓ **YES** (remove layer) |
| **malloc** | 23.84% | Wrapper | ✓ **YES** (remove layer) |
| main | 22.02% | Benchmark | (baseline) |
| **free** | 6.82% | Wrapper | ✓ **YES** (remove layer) |
| unified_cache_push | 4.44% | Core | Optimize later |
| tiny_header_finalize_alloc | 4.34% | Core | Optimize later |
| tiny_c7_ultra_alloc | 3.38% | Core | Optimize later |
| tiny_c7_ultra_free | 2.07% | Core | Optimize later |
| hakmem_env_snapshot_enabled | 1.22% | ENV | ✓ **YES** (eliminate checks) |
| hak_super_lookup | 0.98% | Core | Optimize later |

**Critical observation**: The top 3 user-space functions are **all wrappers**:
- `front_fastlane_try_free` (23.97%) + `free` (6.82%) = **30.79%** on free wrappers
- `malloc` (23.84%) on alloc wrapper
- Combined wrapper overhead: **~54-55%** of all cycles

### 2.1 front_fastlane_try_free Annotated Breakdown

From `perf annotate`, the hot path has these expensive operations:

**Header validation** (lines 1c786-1c791, ~3% samples):
```asm
movzbl -0x1(%rbp),%ebx          # Load header byte
mov    %ebx,%eax                # Copy to eax
and    $0xfffffff0,%eax         # Extract magic (0xA0)
cmp    $0xa0,%al                # Check magic
jne    ... (fallback)           # Branch on mismatch
```

**ENV snapshot checks** (lines 1c7ff-1c822, ~7% samples):
```asm
cmpl   $0x1,0x628fa(%rip)       # g_hakmem_env_snapshot_ctor_mode (3.01%)
mov    0x628ef(%rip),%r15d      # g_hakmem_env_snapshot_gate (1.36%)
je     ...
cmp    $0xffffffff,%r15d
je     ... (init path)
test   %r15d,%r15d
jne    ... (snapshot path)
```

**Class routing overhead** (lines 1c7d1-1c7fb, ~3% samples):
```asm
mov    0x6299c(%rip),%r15d      # g.5.lto_priv.0 (policy gate)
cmp    $0x1,%r15d
jne    ... (fallback)
movzbl 0x6298f(%rip),%eax       # g_mask.3.lto_priv.0
cmp    $0xff,%al
je     ... (all-classes path)
movzbl %al,%r9d
bt     %r13d,%r9d               # Bit test class mask
jae    ... (fallback)
```

**Total overhead**: ~15-20% of cycles in front_fastlane_try_free are spent on:
- Header validation (already done again in free_tiny_fast)
- ENV snapshot probing
- Policy/route checks

---

## 3. Reduction Candidates (Prioritized by ROI)

### Candidate A: **Eliminate FastLane Wrapper Layer** (Highest ROI)

**Problem**: front_fastlane_try_free + free wrappers consume 30.79% of cycles
**Root cause**: Double header validation + ENV checks + class mask checks

**Proposal**: Direct call to free_tiny_fast() from free() wrapper

**Implementation**:
```c
// In free() wrapper:
void free(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return;

    // Phase 19-A: Direct call (no FastLane layer)
    if (free_tiny_fast(ptr)) {
        return;  // Handled
    }

    // Fallback to cold path
    free_cold(ptr);
}
```

**Reduction estimate**:
- **Instructions**: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks)
- **Branches**: -5-7/op (remove FastLane gate checks)
- **Impact**: ~10-15% throughput improvement (remove 30% wrapper overhead)

**Risk**: **LOW** (free_tiny_fast already has validation + routing logic)

---

### Candidate B: **Consolidate ENV Snapshot Checks** (High ROI)

**Problem**: ENV snapshot is checked **3+ times per operation**:
1. FastLane entry: `g_initialized` check
2. Route determination: `hakmem_env_snapshot_enabled()` check
3. Route-specific: `tiny_c7_ultra_enabled_env()` check
4. Legacy fallback: Another ENV snapshot check

**Proposal**: Single ENV snapshot read at entry, pass context down

**Implementation**:
```c
// Phase 19-B: ENV context struct
typedef struct {
    bool c7_ultra_enabled;
    bool dualhot_enabled;
    bool legacy_direct_enabled;
    SmallRouteKind route_kind[8];  // Pre-computed routes
} FastLaneCtx;

static __thread FastLaneCtx g_fastlane_ctx = {0};
static __thread int g_fastlane_ctx_init = 0;

static inline const FastLaneCtx* fastlane_ctx_get(void) {
    if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) {
        // One-time init per thread
        const HakmemEnvSnapshot* env = hakmem_env_snapshot();
        g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled;
        // ... populate other fields
        g_fastlane_ctx_init = 1;
    }
    return &g_fastlane_ctx;
}
```

**Reduction estimate**:
- **Instructions**: -8-12/op (eliminate redundant TLS reads)
- **Branches**: -3-5/op (single init check instead of multiple)
- **Impact**: ~5-8% throughput improvement

**Risk**: **MEDIUM** (need to handle ENV changes during runtime - use invalidation hook)

---

### Candidate C: **Remove Stats Counters from Hot Path** (Medium ROI)

**Problem**: Stats counters on hot path add atomic increments:
- `FRONT_FASTLANE_STAT_INC(free_total)` (every op)
- `FREE_PATH_STAT_INC(total_calls)` (every op)
- `ALLOC_GATE_STAT_INC(total_calls)` (every alloc)
- `tiny_front_free_stat_inc(class_idx)` (every free)

**Proposal**: Make stats DEBUG-only or sample-based (1-in-N)

**Implementation**:
```c
// Phase 19-C: Sampling-based stats
#if !HAKMEM_BUILD_RELEASE
    static __thread uint32_t g_stat_counter = 0;
    if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) {
        // Sample 1-in-4096 operations
        FRONT_FASTLANE_STAT_INC(free_total);
    }
#endif
```

**Reduction estimate**:
- **Instructions**: -4-6/op (remove atomic increments)
- **Branches**: -2-3/op (remove `if (__builtin_expect(...))` checks)
- **Impact**: ~3-5% throughput improvement

**Risk**: **LOW** (stats already compile-time optional)

---

### Candidate D: **Inline Header Validation** (Medium ROI)

**Problem**: Header validation happens twice:
1. FastLane wrapper: `*((uint8_t*)ptr - 1)` (lines 179-191 in front_fastlane_box.h)
2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h)

**Proposal**: Trust FastLane validation, remove duplicate check

**Implementation**:
```c
// Phase 19-D: Add "trusted" variant
static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) {
    // Skip header validation (caller already validated)
    // Direct to route dispatch
    ...
}

// In FastLane:
uint8_t header = *((uint8_t*)ptr - 1);
int class_idx = header & 0x0F;
void* base = tiny_user_to_base_inline(ptr);
return free_tiny_fast_trusted(ptr, class_idx, base);
```

**Reduction estimate**:
- **Instructions**: -3-5/op (remove duplicate header load + extract)
- **Branches**: -1-2/op (remove duplicate magic check)
- **Impact**: ~2-3% throughput improvement

**Risk**: **MEDIUM** (need to ensure all callers validate header)

---

### Candidate E: **Static Route Table Optimization** (Lower ROI)

**Problem**: Route determination uses TLS lookups + bit tests:
```c
if (tiny_static_route_ready_fast()) {
    route_kind = tiny_static_route_get_kind_fast(class_idx);
} else {
    route_kind = tiny_policy_hot_get_route(class_idx);
}
```

**Proposal**: Pre-compute common routes at init, inline direct paths

**Implementation**:
```c
// Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA)
static __thread uint8_t g_route_fastmap = 0;  // bit 0=C0...bit 7=C7, 1=LEGACY

static inline bool is_legacy_route_fast(int class_idx) {
    return (g_route_fastmap >> class_idx) & 1;
}
```

**Reduction estimate**:
- **Instructions**: -3-4/op (replace function call with bit test)
- **Branches**: -1-2/op (replace nested if with single bit test)
- **Impact**: ~2-3% throughput improvement

**Risk**: **LOW** (route table is already static)

---

## 4. Combined Impact Estimate

Assuming independent reductions (conservative estimate with 80% efficiency due to overlap):

| Candidate | Instructions/op | Branches/op | Throughput |
|-----------|-----------------|-------------|------------|
| Baseline | 209.09 | 52.33 | 44.88M ops/s |
| **A: Remove FastLane layer** | -17.5 | -6.0 | +12% |
| **B: ENV snapshot consolidation** | -10.0 | -4.0 | +6% |
| **C: Stats removal (Release)** | -5.0 | -2.5 | +4% |
| **D: Inline header validation** | -4.0 | -1.5 | +2% |
| **E: Static route fast path** | -3.5 | -1.5 | +2% |
| **Combined (80% efficiency)** | **-32.0** | **-12.4** | **+21%** |

**Projected outcome**:
- Instructions/op: 209.09 → **177.09** (vs libc 135.92, gap reduced from +53.8% to +30.3%)
- Branches/op: 52.33 → **39.93** (vs libc 22.93, gap reduced from +128.2% to +74.1%)
- Throughput: 44.88M → **54.3M ops/s** (vs libc 77.62M, gap reduced from +73.0% to +43.0%)

**Achievement vs Goal**: ✓ Exceeds target (+21% vs +15-25% goal)

---

## 5. Implementation Plan

### Phase 19-1: Remove FastLane Wrapper Layer (A)
**Priority**: P0 (highest ROI)
**Effort**: 2-3 hours
**Risk**: Low (free_tiny_fast already complete)

Steps:
1. Modify `free()` wrapper to directly call `free_tiny_fast(ptr)`
2. Modify `malloc()` wrapper to directly call `malloc_tiny_fast(size)`
3. Measure: Expect +10-15% throughput
4. Fallback: Keep FastLane as compile-time option

### Phase 19-2: ENV Snapshot Consolidation (B)
**Priority**: P1 (high ROI, moderate risk)
**Effort**: 4-6 hours
**Risk**: Medium (ENV invalidation needed)

Steps:
1. Create `FastLaneCtx` struct with pre-computed ENV state
2. Add TLS cache with invalidation hook
3. Replace scattered ENV checks with single context read
4. Measure: Expect +5-8% throughput on top of Phase 19-1
5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1)

### Phase 19-3: Stats Removal (C) + Header Inline (D)
**Priority**: P2 (medium ROI, low risk)
**Effort**: 2-3 hours
**Risk**: Low (already compile-time optional)

Steps:
1. Make stats sample-based (1-in-4096) in Release builds
2. Add `free_tiny_fast_trusted()` variant (skip header validation)
3. Measure: Expect +3-5% throughput on top of Phase 19-2
4. Fallback: Compile-time flags for both features

### Phase 19-4: Static Route Fast Path (E)
**Priority**: P3 (lower ROI, polish)
**Effort**: 2-3 hours
**Risk**: Low (route table is static)

Steps:
1. Add `g_route_fastmap` TLS cache
2. Replace function calls with bit tests
3. Measure: Expect +2-3% throughput on top of Phase 19-3
4. Fallback: Keep existing path as fallback

---

## 6. Box Theory Compliance

### Boundary Preservation
- **L0 (ENV)**: Keep existing ENV gates, add new ones for each optimization
- **L1 (Hot inline)**: free_tiny_fast(), malloc_tiny_fast() remain unchanged
- **L2 (Cold fallback)**: free_cold(), malloc_cold() remain unchanged
- **L3 (Stats)**: Make optional via #if guards

### Reversibility
- Each phase is ENV-gated (can revert at runtime)
- Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats)
- FastLane layer can be kept as compile-time option for A/B testing

### Incremental Rollout
- Phase 19-1: Remove wrapper (default ON)
- Phase 19-2: ENV context (default OFF, opt-in for testing)
- Phase 19-3: Stats/header (default ON in Release, OFF in Debug)
- Phase 19-4: Route fast path (default ON)

---

## 7. Validation Checklist

After each phase:
- [ ] Run perf stat (compare instructions/branches/cycles per-op)
- [ ] Run perf record + annotate (verify hot path reduction)
- [ ] Run benchmark suite (Mixed, C6-heavy, C7-heavy)
- [ ] Check correctness (Larson, multithreaded, stress tests)
- [ ] Measure RSS/memory overhead (should be unchanged)
- [ ] A/B test (ENV toggle to verify reversibility)

Success criteria:
- [ ] Throughput improvement matches estimate (±20%)
- [ ] Instruction count reduction matches estimate (±20%)
- [ ] Branch count reduction matches estimate (±20%)
- [ ] No correctness regressions (all tests pass)
- [ ] No memory overhead increase (RSS unchanged)

---

## 8. Risk Assessment

### High-Risk Areas
1. **ENV invalidation** (Phase 19-2): Runtime ENV changes could break cached context
   - Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure)
   - Fallback: Revert to scattered ENV checks

2. **Header validation trust** (Phase 19-3D): Skipping validation could miss corruption
   - Mitigation: Keep validation in Debug builds, extensive testing
   - Fallback: Compile-time option to keep duplicate checks

### Medium-Risk Areas
1. **FastLane removal** (Phase 19-1): Could break gradual rollout (class_mask filtering)
   - Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely)
   - Fallback: Keep FastLane as compile-time option

### Low-Risk Areas
1. **Stats removal** (Phase 19-3C): Already compile-time optional
2. **Route fast path** (Phase 19-4): Route table is static, no runtime changes

---

## 9. Future Optimization Opportunities (Post-Phase 19)

After Phase 19 closes the wrapper gap, next targets:

1. **Unified Cache optimization** (4.44% cycles):
   - Reduce cache miss overhead (refill path)
   - Optimize LIFO vs ring buffer trade-off

2. **Header finalization** (4.34% cycles):
   - Investigate always_inline for tiny_header_finalize_alloc()
   - Reduce metadata writes (defer to batch update)

3. **C7 ULTRA optimization** (3.38% + 2.07% = 5.45% cycles):
   - Investigate TLS cache locality
   - Reduce ULTRA push/pop overhead

4. **Super lookup optimization** (0.98% cycles):
   - Already optimized in Phase 12 (mask-based)
   - Further reduction may require architectural changes

**Estimated ceiling**: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M)
**Remaining gap**: Likely fundamental architectural differences (thread-local vs global allocator)

---

## 10. Appendix: Detailed perf Data

### 10.1 perf stat Results (200M ops)

**hakmem (FORCE_LIBC=0)**:
```
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0':

    19,296,118,430  cycles
    41,817,886,925  instructions              #  2.17  insn per cycle
    10,466,190,806  branches
       232,592,257  branch-misses             #  2.22% of all branches
         1,660,073  cache-misses
           134,601  L1-icache-load-misses

       4.913685503 seconds time elapsed
Throughput: 44.88M ops/s
```

**libc (FORCE_LIBC=1)**:
```
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1':

    10,937,550,228  cycles
    27,183,469,339  instructions              #  2.49  insn per cycle
     4,586,617,379  branches
       131,515,905  branch-misses             #  2.87% of all branches
           767,370  cache-misses
            64,102  L1-icache-load-misses

       2.835174452 seconds time elapsed
Throughput: 77.62M ops/s
```

### 10.2 Top 30 Hot Functions (perf report)

```
    23.97%  front_fastlane_try_free.lto_priv.0
    23.84%  malloc
    22.02%  main
     6.82%  free
     4.44%  unified_cache_push.lto_priv.0
     4.34%  tiny_header_finalize_alloc.lto_priv.0
     3.38%  tiny_c7_ultra_alloc.constprop.0
     2.07%  tiny_c7_ultra_free
     1.22%  hakmem_env_snapshot_enabled.lto_priv.0
     0.98%  hak_super_lookup.part.0.lto_priv.4.lto_priv.0
     0.85%  hakmem_env_snapshot.lto_priv.0
     0.82%  hak_pool_free_v1_slow_impl
     0.59%  tiny_front_v3_snapshot_get.lto_priv.0
     0.30%  __memset_avx2_unaligned_erms (libc)
     0.30%  tiny_unified_lifo_enabled.lto_priv.0
     0.28%  hak_free_at.constprop.0
     0.24%  hak_pool_try_alloc.part.0
     0.24%  malloc_cold
     0.16%  hak_pool_try_alloc_v1_impl.part.0
     0.14%  free_cold.constprop.0
     0.13%  mid_inuse_dec_deferred
     0.12%  hak_pool_mid_lookup
     0.12%  do_user_addr_fault (kernel)
     0.11%  handle_pte_fault (kernel)
     0.11%  __mod_memcg_lruvec_state (kernel)
     0.10%  do_anonymous_page (kernel)
     0.09%  classify_ptr
     0.07%  tiny_get_max_size.lto_priv.0
     0.06%  __handle_mm_fault (kernel)
     0.06%  __alloc_pages (kernel)
```

---

## 11. Conclusion

Phase 19 has **clear, actionable targets** with high ROI:

1. **Immediate action (Phase 19-1)**: Remove FastLane wrapper layer
   - Expected: +10-15% throughput
   - Risk: Low
   - Effort: 2-3 hours

2. **Follow-up (Phase 19-2-4)**: ENV consolidation + stats + route optimization
   - Expected: +6-11% additional throughput
   - Risk: Medium (ENV invalidation)
   - Effort: 8-12 hours

**Combined target**: +21% throughput (44.88M → 54.3M ops/s)
**Gap closure**: Reduce instruction gap from +53.8% to +30.3% vs libc

This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.