hakmem/docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md

# Phase 5 E5: Comprehensive Analysis & Implementation Report

## Executive Summary

**Date**: 2025-12-14
**Baseline**: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400)
**Status**: Step 1 Complete - Perf Analysis & Priority Decision

---

## Step 1: Perf Analysis - Free() Internal Breakdown

### Test Configuration
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
  ./bench_random_mixed_hakmem 20000000 400 1
```

**Results**: 476 samples @ 999Hz, throughput 44.12M ops/s

### Top Hotspots (self% >= 2.0%)

| Rank | Function | Self% | Analysis |
|------|----------|-------|----------|
| 1 | `free` | 21.67% | Wrapper layer overhead (header check, ENV snapshot, gate dispatch) |
| 2 | `tiny_alloc_gate_fast` | 18.55% | Alloc gate (reduced from E4 baseline) |
| 3 | `main` | 17.44% | Benchmark driver |
| 4 | `malloc` | 12.27% | Wrapper layer (reduced from E4 baseline) |
| 5 | `free_tiny_fast_cold` | 7.89% | **NEW**: Cold path (route determination, policy snapshot) |
| 6 | `unified_cache_push` | 3.47% | TLS cache push operation |
| 7 | `tiny_c7_ultra_alloc` | 2.98% | C7 ULTRA alloc path |
| 8 | `tiny_region_id_write_header` | 2.59% | **Header write tax** (reduced from 6.97% in E4 baseline) |
| 9 | `hakmem_env_snapshot_enabled` | 2.57% | **ENV snapshot gate overhead** (reduced from 4.29%) |

### Free Path Internal Breakdown

**Total free() overhead**: 21.67% (wrapper) + 7.89% (cold path) = **29.56%**

**Key Observations**:

1. **Wrapper Layer (21.67% self%)**:
   - NULL check: ~0%
   - Header magic validation: **~8% of free() self%**
     - Load header byte: `movzbl -0x1(%rbp),%r15d` (7.44% sample hit)
     - Magic comparison: `cmp $0xa0,%r14b` (0%)
     - Class extraction: `and $0xf,%r12d` (0%)
   - ENV snapshot calls: **~8% of free() self%**
     - `mov 0x5b311(%rip),%edx` (7.73% + 6.98% sample hits)
     - Branch to cold init paths (0%)
   - TLS reads for gate dispatch: **~5% of free() self%**
     - `mov 0x14(%r15),%ebx` (8.01% sample hit)

2. **Cold Path (7.89% self%)**:
   - `tiny_route_for_class()`: 1.95% self% (route determination)
   - Policy snapshot: 0.64% self%
   - Class-based dispatch overhead

3. **Header Write Tax (2.59% self%)**:
   - **DOWN from 6.97% in E4 baseline** (-62% reduction!)
   - Note: This is already significantly improved
   - Further optimization ROI is lower now

4. **ENV Snapshot Overhead (2.57% self%)**:
   - **DOWN from 4.29% in E4 baseline** (-40% reduction!)
   - Lazy init checks + TLS reads
   - Still visible but reduced impact

### Critical Insight: Header Check Overhead

**Annotated perf shows**:
- Lines with 7-8% sample hits are concentrated in:
  1. **Header load**: `movzbl -0x1(%rbp),%r15d` (line 0x1cd6d)
  2. **Multiple TLS/global reads** for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)

**These lines represent**:
- **Header validation overhead**: ~7-8% of free() self%
- **ENV gate consolidation overhead**: ~8% of free() self% (multiple checks)
- **Total wrapper tax**: ~16% of 21.67% = **~3.5% of total runtime**

### Branch Prediction Analysis

**From perf annotate**:
- Page boundary check: `test $0xfff,%ebp` (0% hit - well predicted)
- Magic comparison: `cmp $0xa0,%r14b` (0% hit - well predicted)
- ENV snapshot checks: Multiple branches with 0% hit (well predicted)
- **No significant branch misprediction issues**

---

## Step 2: E5 Candidate Priority Decision

### E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)

**Target**: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = **29.56% total**

**Hypothesis**:
- Current free() wrapper does header validation + ENV snapshot + gate dispatch
- For Tiny allocations (48% of frees in Mixed), this is redundant work:
  1. Header is validated in wrapper (magic check)
  2. Header is re-extracted in `free_tiny_fast()`
  3. Route determination happens in `free_tiny_fast_cold()`
- **Opportunity**: Single-check Tiny direct path in wrapper

**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_FREE_TINY_DIRECT=0/1` (default 0)
- **L1 HotBox**: Wrapper-level Tiny fast path
  - Check: `(header & 0xF0) == 0xA0` (Tiny magic)
  - Extract: `class_idx = header & 0x0F`
  - Validate: `class_idx < 8` (fail-fast)
  - Direct call: `free_tiny_fast_direct(ptr, base, class_idx)`
    - Skips: ENV snapshot consolidation in wrapper
    - Skips: Cold path route determination
    - Skips: Re-validation of header
- **L1 ColdBox**: Existing fallback for Mid/Pool/Large/invalid

**Expected ROI**:
- **Baseline 29.56% overhead** → **Target 18-20% overhead** (30-40% reduction)
- **Translation**: +3-5% throughput improvement
- **Risk**: Low (single boundary, fail-fast validation)

**Implementation Complexity**: Medium
- New function: `free_tiny_fast_direct()` (inline, thin wrapper)
- Integration: 1 boundary in `free()` wrapper
- Counters: 3 (direct_hit, direct_miss, invalid_header)

### E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)

**Target**: 2.59% (tiny_region_id_write_header)

**Hypothesis**:
- Header write happens on EVERY alloc (hot path tax)
- Blocks are reused within same class → header is stable
- **Opportunity**: Write header once at refill boundary (cold path)

**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_TINY_HEADER_PREFILL=0/1` (default 0)
- **L2 HeaderPrefillBox**: Refill boundary header initialization
  - When: Slab refill (cold path, ~64 blocks at once)
  - Where: `tiny_unified_cache_refill()` or similar
  - Action: Prefill all headers in slab page
- **L1 HotBox**: Alloc path skips header write
  - Condition: Check "prefill done" flag per slab
  - Fast path: Return `base + 1` directly (no header write)
  - Fallback: Traditional `tiny_region_id_write_header()` if not prefilled

**Expected ROI**:
- **Baseline 2.59%** → **Target 0.5-1.0%** (60-80% reduction)
- **Translation**: +1.5-2.0% throughput improvement
- **Risk**: Medium (requires slab-level "prefilled" tracking)

**Implementation Complexity**: High
- Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
- Integration: 2 boundaries (refill + alloc hot path)
- Safety: Fallback to hot path write if prefill not done

**Note**: Phase 1 A3 showed `always_inline` on header write was **NO-GO** (-4% on Mixed).
This approach is different: **eliminate writes**, not inline them.

### E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)

**Target**: 2.57% (hakmem_env_snapshot_enabled)

**Hypothesis**:
- `MIXED_TINYV3_C7_SAFE` now defaults to `HAKMEM_ENV_SNAPSHOT=1`
- Current branch hint: `__builtin_expect(..., 0)` (expects DISABLED)
- Reality: Snapshot is ENABLED → hint is backwards
- **Opportunity**: Flip branch shape to match reality

**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1` (default 0)
- **L1 ShapeBox**: Branch restructuring
  - Old: `if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }`
  - New: `if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }`
  - Or: Move legacy to `noinline,cold` helper

**Expected ROI**:
- **Baseline 2.57%** → **Target 1.5-2.0%** (20-40% reduction)
- **Translation**: +0.5-1.0% throughput improvement
- **Risk**: Very low (branch shape change only, no logic change)

**Implementation Complexity**: Low
- Changes: 5-8 call sites in hot paths
- Integration: Flip `__builtin_expect()` polarity
- Counters: None needed (perf annotation will show)

**Note**: This is a "polish" optimization. E3-4 showed branch hints can backfire.
Approach carefully with A/B testing.

---

## Step 2 Conclusion: Priority Ranking

### Priority Order (ROI × Safety × Complexity)

1. **E5-1 (Free Tiny Direct Path)**: +3-5% expected, medium complexity, low risk
   - **ROI Score**: 9/10 (largest target, proven pattern)
   - **Safety Score**: 8/10 (single boundary, fail-fast)
   - **Implementation Score**: 7/10 (medium effort)
   - **TOTAL**: 24/30 → **HIGHEST PRIORITY**

2. **E5-2 (Header Prefill at Refill)**: +1.5-2.0% expected, high complexity, medium risk
   - **ROI Score**: 6/10 (2.59% target, diminishing returns from E4)
   - **Safety Score**: 6/10 (metadata overhead, fallback needed)
   - **Implementation Score**: 4/10 (high effort, slab tracking)
   - **TOTAL**: 16/30 → **MEDIUM PRIORITY**

3. **E5-3 (ENV Snapshot Shape)**: +0.5-1.0% expected, low complexity, low risk
   - **ROI Score**: 4/10 (small target, branch hint uncertainty)
   - **Safety Score**: 9/10 (pure shape change)
   - **Implementation Score**: 9/10 (low effort)
   - **TOTAL**: 22/30 → **LOW PRIORITY** (but easy win if E5-1 succeeds)

### Decision: Proceed with E5-1 (Free Tiny Direct Path)

**Rationale**:
1. **Largest target**: 29.56% combined overhead (free wrapper + cold path)
2. **Proven pattern**: E4 success shows TLS consolidation works
3. **Single boundary**: Clear separation, fail-fast validation
4. **Low risk**: Header magic check is reliable, fallback is trivial
5. **Box Theory compliant**: L0 ENV gate, L1 hot/cold split, minimal counters

**Expected Gain**: +3-5% (conservative estimate)
**Success Threshold**: +1.0% (GO criteria)
**Fallback Plan**: If NO-GO, freeze and pursue E5-2 or E5-3

---

## Step 3: E5-1 Implementation Plan (Next)

### Design Document
- File: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md`
- Sections:
  1. Problem Statement (29.56% overhead analysis)
  2. Solution Architecture (wrapper-level Tiny direct path)
  3. Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
  4. Safety Gates (header validation, fail-fast, fallback)
  5. Integration Points (free() wrapper, counters)
  6. A/B Test Plan (10-run Mixed, health check)

### Implementation Files
- **ENV Box**: `core/box/free_tiny_direct_env_box.h/.c`
  - ENV gate: `HAKMEM_FREE_TINY_DIRECT=0/1`
  - Lazy init, static cache
- **Stats Box**: `core/box/free_tiny_direct_stats_box.h`
  - Counters: `direct_hit`, `direct_miss`, `invalid_header`
- **Integration**: `core/box/hak_wrappers.inc.h`
  - Lines ~595-640 (free() wrapper hot path)
  - New inline: `free_tiny_fast_direct()`

### A/B Test Protocol
```bash
# Baseline (E5-1=0)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
    ./bench_random_mixed_hakmem 20000000 400 1
done

# Optimized (E5-1=1)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
    ./bench_random_mixed_hakmem 20000000 400 1
done
```

**Success Criteria**:
- **GO**: Mean +1.0% or higher
- **NEUTRAL**: ±1.0%
- **NO-GO**: Mean -1.0% or lower

---

## Appendix: Perf Data Details

### Free() Annotated Hot Lines

**Top sample-hit lines** (from `perf annotate --stdio free`):

1. **8.01%**: `mov 0x14(%r15),%ebx` (offset 0x1cd2d)
   - Loading `wrapper_env->wrap_shape` from TLS
   - **Analysis**: ENV snapshot field access overhead

2. **7.73%**: `mov %rdi,%rbp` (offset 0x1ccd1)
   - Saving ptr argument to register
   - **Analysis**: Register pressure in wrapper preamble

3. **7.44%**: `mov 0x5b311(%rip),%edx` (offset 0x1cd01)
   - Loading cached ENV gate value (global)
   - **Analysis**: Lazy init check overhead

4. **6.98%**: `test %r11d,%r11d` (offset 0x1ccf8)
   - Testing ENV gate flag (TLS)
   - **Analysis**: Branch condition for ENV snapshot path

5. **7.62%**: `pop %r14` (offset 0x1ce6e)
   - Epilogue register restore
   - **Analysis**: Function call overhead (6+ register saves)

**Total wrapper overhead from top 5 lines**: ~37% of free() self% = **~8% of total runtime**

### Free_Tiny_Fast_Cold Analysis

**7.89% self% breakdown**:
- `tiny_route_for_class()`: 1.95% (route determination)
- Policy snapshot: 0.64%
- Class dispatch: ~5% (remainder)

**Cold path is triggered when**:
- Class != C7 (no ULTRA early-exit)
- OR class requires route determination (C0-C6 in Mixed)

**Opportunity**: Bypass cold path entirely for Tiny by validating in wrapper.

---

## Next Actions

1. ✅ **Step 1 Complete**: Perf analysis → E5-1 selected
2. 🔄 **Step 3 Next**: Implement E5-1 design + code
3. ⏳ **Step 4 Next**: 10-run A/B test (Mixed)
4. ⏳ **Step 5 Next**: Health check
5. ⏳ **Step 6 Next**: Update CURRENT_TASK.md with results

**Status**: Ready to proceed with E5-1 implementation.
-												Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)

Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:52:32 +09:00
+								# Phase 5 E5: Comprehensive Analysis & Implementation Report
 								## Executive Summary
 								**Date**: 2025-12-14
 								**Baseline**: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400)
 								**Status**: Step 1 Complete - Perf Analysis & Priority Decision
 								---
 								## Step 1: Perf Analysis - Free() Internal Breakdown
 								### Test Configuration
 								```bash
 								HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
 								  ./bench_random_mixed_hakmem 20000000 400 1
 								```
 								**Results**: 476 samples @ 999Hz, throughput 44.12M ops/s
 								### Top Hotspots (self% >= 2.0%)
 								| Rank | Function | Self% | Analysis |
 								|------|----------|-------|----------|
 								| 1 | `free` | 21.67% | Wrapper layer overhead (header check, ENV snapshot, gate dispatch) |
 								| 2 | `tiny_alloc_gate_fast` | 18.55% | Alloc gate (reduced from E4 baseline) |
 								| 3 | `main` | 17.44% | Benchmark driver |
 								| 4 | `malloc` | 12.27% | Wrapper layer (reduced from E4 baseline) |
 								| 5 | `free_tiny_fast_cold` | 7.89% | **NEW**: Cold path (route determination, policy snapshot) |
 								| 6 | `unified_cache_push` | 3.47% | TLS cache push operation |
 								| 7 | `tiny_c7_ultra_alloc` | 2.98% | C7 ULTRA alloc path |
 								| 8 | `tiny_region_id_write_header` | 2.59% | **Header write tax** (reduced from 6.97% in E4 baseline) |
 								| 9 | `hakmem_env_snapshot_enabled` | 2.57% | **ENV snapshot gate overhead** (reduced from 4.29%) |
 								### Free Path Internal Breakdown
 								**Total free() overhead**: 21.67% (wrapper) + 7.89% (cold path) = **29.56%**
 								**Key Observations**:
 . **Wrapper Layer (21.67% self%)**:
 								   - NULL check: ~0%
 								   - Header magic validation: **~8% of free() self%**
 								     - Load header byte: `movzbl -0x1(%rbp),%r15d` (7.44% sample hit)
 								     - Magic comparison: `cmp $0xa0,%r14b` (0%)
 								     - Class extraction: `and $0xf,%r12d` (0%)
 								   - ENV snapshot calls: **~8% of free() self%**
 								     - `mov 0x5b311(%rip),%edx` (7.73% + 6.98% sample hits)
 								     - Branch to cold init paths (0%)
 								   - TLS reads for gate dispatch: **~5% of free() self%**
 								     - `mov 0x14(%r15),%ebx` (8.01% sample hit)
 . **Cold Path (7.89% self%)**:
 								   - `tiny_route_for_class()`: 1.95% self% (route determination)
 								   - Policy snapshot: 0.64% self%
 								   - Class-based dispatch overhead
 . **Header Write Tax (2.59% self%)**:
 								   - **DOWN from 6.97% in E4 baseline** (-62% reduction!)
 								   - Note: This is already significantly improved
 								   - Further optimization ROI is lower now
 . **ENV Snapshot Overhead (2.57% self%)**:
 								   - **DOWN from 4.29% in E4 baseline** (-40% reduction!)
 								   - Lazy init checks + TLS reads
 								   - Still visible but reduced impact
 								### Critical Insight: Header Check Overhead
 								**Annotated perf shows**:
 								- Lines with 7-8% sample hits are concentrated in:
 . **Header load**: `movzbl -0x1(%rbp),%r15d` (line 0x1cd6d)
 . **Multiple TLS/global reads** for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)
 								**These lines represent**:
 								- **Header validation overhead**: ~7-8% of free() self%
 								- **ENV gate consolidation overhead**: ~8% of free() self% (multiple checks)
 								- **Total wrapper tax**: ~16% of 21.67% = **~3.5% of total runtime**
 								### Branch Prediction Analysis
 								**From perf annotate**:
 								- Page boundary check: `test $0xfff,%ebp` (0% hit - well predicted)
 								- Magic comparison: `cmp $0xa0,%r14b` (0% hit - well predicted)
 								- ENV snapshot checks: Multiple branches with 0% hit (well predicted)
 								- **No significant branch misprediction issues**
 								---
 								## Step 2: E5 Candidate Priority Decision
 								### E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)
 								**Target**: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = **29.56% total**
 								**Hypothesis**:
 								- Current free() wrapper does header validation + ENV snapshot + gate dispatch
 								- For Tiny allocations (48% of frees in Mixed), this is redundant work:
 . Header is validated in wrapper (magic check)
 . Header is re-extracted in `free_tiny_fast()`
 . Route determination happens in `free_tiny_fast_cold()`
 								- **Opportunity**: Single-check Tiny direct path in wrapper
 								**Strategy** (Box Theory):
 								- **L0 SplitBox**: `HAKMEM_FREE_TINY_DIRECT=0/1` (default 0)
 								- **L1 HotBox**: Wrapper-level Tiny fast path
 								  - Check: `(header & 0xF0) == 0xA0` (Tiny magic)
 								  - Extract: `class_idx = header & 0x0F`
 								  - Validate: `class_idx < 8` (fail-fast)
 								  - Direct call: `free_tiny_fast_direct(ptr, base, class_idx)`
 								    - Skips: ENV snapshot consolidation in wrapper
 								    - Skips: Cold path route determination
 								    - Skips: Re-validation of header
 								- **L1 ColdBox**: Existing fallback for Mid/Pool/Large/invalid
 								**Expected ROI**:
 								- **Baseline 29.56% overhead** → **Target 18-20% overhead** (30-40% reduction)
 								- **Translation**: +3-5% throughput improvement
 								- **Risk**: Low (single boundary, fail-fast validation)
 								**Implementation Complexity**: Medium
 								- New function: `free_tiny_fast_direct()` (inline, thin wrapper)
 								- Integration: 1 boundary in `free()` wrapper
 								- Counters: 3 (direct_hit, direct_miss, invalid_header)
 								### E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)
 								**Target**: 2.59% (tiny_region_id_write_header)
 								**Hypothesis**:
 								- Header write happens on EVERY alloc (hot path tax)
 								- Blocks are reused within same class → header is stable
 								- **Opportunity**: Write header once at refill boundary (cold path)
 								**Strategy** (Box Theory):
 								- **L0 SplitBox**: `HAKMEM_TINY_HEADER_PREFILL=0/1` (default 0)
 								- **L2 HeaderPrefillBox**: Refill boundary header initialization
 								  - When: Slab refill (cold path, ~64 blocks at once)
 								  - Where: `tiny_unified_cache_refill()` or similar
 								  - Action: Prefill all headers in slab page
 								- **L1 HotBox**: Alloc path skips header write
 								  - Condition: Check "prefill done" flag per slab
 								  - Fast path: Return `base + 1` directly (no header write)
 								  - Fallback: Traditional `tiny_region_id_write_header()` if not prefilled
 								**Expected ROI**:
 								- **Baseline 2.59%** → **Target 0.5-1.0%** (60-80% reduction)
 								- **Translation**: +1.5-2.0% throughput improvement
 								- **Risk**: Medium (requires slab-level "prefilled" tracking)
 								**Implementation Complexity**: High
 								- Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
 								- Integration: 2 boundaries (refill + alloc hot path)
 								- Safety: Fallback to hot path write if prefill not done
 								**Note**: Phase 1 A3 showed `always_inline` on header write was **NO-GO** (-4% on Mixed).
 								This approach is different: **eliminate writes**, not inline them.
 								### E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)
 								**Target**: 2.57% (hakmem_env_snapshot_enabled)
 								**Hypothesis**:
 								- `MIXED_TINYV3_C7_SAFE` now defaults to `HAKMEM_ENV_SNAPSHOT=1`
 								- Current branch hint: `__builtin_expect(..., 0)` (expects DISABLED)
 								- Reality: Snapshot is ENABLED → hint is backwards
 								- **Opportunity**: Flip branch shape to match reality
 								**Strategy** (Box Theory):
 								- **L0 SplitBox**: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1` (default 0)
 								- **L1 ShapeBox**: Branch restructuring
 								  - Old: `if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }`
 								  - New: `if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }`
 								  - Or: Move legacy to `noinline,cold` helper
 								**Expected ROI**:
 								- **Baseline 2.57%** → **Target 1.5-2.0%** (20-40% reduction)
 								- **Translation**: +0.5-1.0% throughput improvement
 								- **Risk**: Very low (branch shape change only, no logic change)
 								**Implementation Complexity**: Low
 								- Changes: 5-8 call sites in hot paths
 								- Integration: Flip `__builtin_expect()` polarity
 								- Counters: None needed (perf annotation will show)
 								**Note**: This is a "polish" optimization. E3-4 showed branch hints can backfire.
 								Approach carefully with A/B testing.
 								---
 								## Step 2 Conclusion: Priority Ranking
 								### Priority Order (ROI × Safety × Complexity)
 . **E5-1 (Free Tiny Direct Path)**: +3-5% expected, medium complexity, low risk
 								   - **ROI Score**: 9/10 (largest target, proven pattern)
 								   - **Safety Score**: 8/10 (single boundary, fail-fast)
 								   - **Implementation Score**: 7/10 (medium effort)
 								   - **TOTAL**: 24/30 → **HIGHEST PRIORITY**
 . **E5-2 (Header Prefill at Refill)**: +1.5-2.0% expected, high complexity, medium risk
 								   - **ROI Score**: 6/10 (2.59% target, diminishing returns from E4)
 								   - **Safety Score**: 6/10 (metadata overhead, fallback needed)
 								   - **Implementation Score**: 4/10 (high effort, slab tracking)
 								   - **TOTAL**: 16/30 → **MEDIUM PRIORITY**
 . **E5-3 (ENV Snapshot Shape)**: +0.5-1.0% expected, low complexity, low risk
 								   - **ROI Score**: 4/10 (small target, branch hint uncertainty)
 								   - **Safety Score**: 9/10 (pure shape change)
 								   - **Implementation Score**: 9/10 (low effort)
 								   - **TOTAL**: 22/30 → **LOW PRIORITY** (but easy win if E5-1 succeeds)
 								### Decision: Proceed with E5-1 (Free Tiny Direct Path)
 								**Rationale**:
 . **Largest target**: 29.56% combined overhead (free wrapper + cold path)
 . **Proven pattern**: E4 success shows TLS consolidation works
 . **Single boundary**: Clear separation, fail-fast validation
 . **Low risk**: Header magic check is reliable, fallback is trivial
 . **Box Theory compliant**: L0 ENV gate, L1 hot/cold split, minimal counters
 								**Expected Gain**: +3-5% (conservative estimate)
 								**Success Threshold**: +1.0% (GO criteria)
 								**Fallback Plan**: If NO-GO, freeze and pursue E5-2 or E5-3
 								---
 								## Step 3: E5-1 Implementation Plan (Next)
 								### Design Document
 								- File: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md`
 								- Sections:
 . Problem Statement (29.56% overhead analysis)
 . Solution Architecture (wrapper-level Tiny direct path)
 . Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
 . Safety Gates (header validation, fail-fast, fallback)
 . Integration Points (free() wrapper, counters)
 . A/B Test Plan (10-run Mixed, health check)
 								### Implementation Files
 								- **ENV Box**: `core/box/free_tiny_direct_env_box.h/.c`
 								  - ENV gate: `HAKMEM_FREE_TINY_DIRECT=0/1`
 								  - Lazy init, static cache
 								- **Stats Box**: `core/box/free_tiny_direct_stats_box.h`
 								  - Counters: `direct_hit`, `direct_miss`, `invalid_header`
 								- **Integration**: `core/box/hak_wrappers.inc.h`
 								  - Lines ~595-640 (free() wrapper hot path)
 								  - New inline: `free_tiny_fast_direct()`
 								### A/B Test Protocol
 								```bash
 								# Baseline (E5-1=0)
 								for i in {1..10}; do
 								  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
 								    ./bench_random_mixed_hakmem 20000000 400 1
 								done
 								# Optimized (E5-1=1)
 								for i in {1..10}; do
 								  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
 								    ./bench_random_mixed_hakmem 20000000 400 1
 								done
 								```
 								**Success Criteria**:
 								- **GO**: Mean +1.0% or higher
 								- **NEUTRAL**: ±1.0%
 								- **NO-GO**: Mean -1.0% or lower
 								---
 								## Appendix: Perf Data Details
 								### Free() Annotated Hot Lines
 								**Top sample-hit lines** (from `perf annotate --stdio free`):
 . **8.01%**: `mov 0x14(%r15),%ebx` (offset 0x1cd2d)
 								   - Loading `wrapper_env->wrap_shape` from TLS
 								   - **Analysis**: ENV snapshot field access overhead
 . **7.73%**: `mov %rdi,%rbp` (offset 0x1ccd1)
 								   - Saving ptr argument to register
 								   - **Analysis**: Register pressure in wrapper preamble
 . **7.44%**: `mov 0x5b311(%rip),%edx` (offset 0x1cd01)
 								   - Loading cached ENV gate value (global)
 								   - **Analysis**: Lazy init check overhead
 . **6.98%**: `test %r11d,%r11d` (offset 0x1ccf8)
 								   - Testing ENV gate flag (TLS)
 								   - **Analysis**: Branch condition for ENV snapshot path
 . **7.62%**: `pop %r14` (offset 0x1ce6e)
 								   - Epilogue register restore
 								   - **Analysis**: Function call overhead (6+ register saves)
 								**Total wrapper overhead from top 5 lines**: ~37% of free() self% = **~8% of total runtime**
 								### Free_Tiny_Fast_Cold Analysis
 								**7.89% self% breakdown**:
 								- `tiny_route_for_class()`: 1.95% (route determination)
 								- Policy snapshot: 0.64%
 								- Class dispatch: ~5% (remainder)
 								**Cold path is triggered when**:
 								- Class != C7 (no ULTRA early-exit)
 								- OR class requires route determination (C0-C6 in Mixed)
 								**Opportunity**: Bypass cold path entirely for Tiny by validating in wrapper.
 								---
 								## Next Actions
 . ✅ **Step 1 Complete**: Perf analysis → E5-1 selected
 . 🔄 **Step 3 Next**: Implement E5-1 design + code
 . ⏳ **Step 4 Next**: 10-run A/B test (Mixed)
 . ⏳ **Step 5 Next**: Health check
 . ⏳ **Step 6 Next**: Update CURRENT_TASK.md with results
 								**Status**: Ready to proceed with E5-1 implementation.