hakmem/docs/analysis/PHASE4_PERF_PROFILE_ANALYSIS.md

# Phase 4: Perf Profile Analysis - Next Optimization Target

**Date**: 2025-12-14
**Baseline**: Phase 3 + D1 Complete (~46.37M ops/s, MIXED_TINYV3_C7_SAFE)
**Profile**: MIXED_TINYV3_C7_SAFE (20M iterations, ws=400, F=999Hz)
**Samples**: 922 samples, 3.1B cycles

## Executive Summary

**Current Status**:
- Phase 3 + D1: ~8.93% cumulative gain (37.5M → 51M ops/s baseline)
- D3 (Alloc Gate Shape): NEUTRAL (+0.56% mean, -0.5% median) → frozen as research box
- **Learning**: Shape optimizations (B3, D3) have limited ROI - branch prediction improvements plateau

**Next Strategy**: Identify self% ≥ 5% functions and apply **different approaches** (not shape-based):
- Hot/cold split (separate rare paths)
- Caching (avoid repeated expensive operations)
- Inlining (reduce function call overhead)
- ENV gate consolidation (reduce repeated TLS/getenv checks)

---

## Perf Report Analysis

### Top Functions (self% ≥ 5%)

Filtered for hakmem internal functions (excluding main, malloc/free wrappers):

| Rank | Function | self% | Category | Already Optimized? |
|------|----------|-------|----------|--------------------|
| 1 | `tiny_alloc_gate_fast.lto_priv.0` | **15.37%** | Alloc Gate | D3 shape (neutral) |
| 2 | `free_tiny_fast_cold.lto_priv.0` | **5.84%** | Free Path | Hot/cold split done |
| - | `unified_cache_push.lto_priv.0` | 3.97% | Cache | Core primitive |
| - | `tiny_c7_ultra_alloc.constprop.0` | 3.97% | C7 Alloc | Not optimized |
| - | `tiny_region_id_write_header.lto_priv.0` | 2.50% | Header | A3 inlining (NO-GO) |
| - | `tiny_route_for_class.lto_priv.0` | 2.28% | Routing | C3 static cache done |

**Key Observations**:
1. **tiny_alloc_gate_fast** (15.37%): Still dominant despite D3 shape optimization
2. **free_tiny_fast_cold** (5.84%): Cold path still hot (ENV gate overhead?)
3. **ENV gate functions** (1-2% each): `tiny_c7_ultra_enabled_env` (1.28%), `tiny_front_v3_enabled` (1.01%), `tiny_metadata_cache_enabled` (0.97%)
   - Combined: **~3.26%** on ENV checking overhead
   - Repeated TLS reads + getenv lazy init

---

## Detailed Candidate Analysis

### Candidate 1: `tiny_alloc_gate_fast` (15.37% self%) ⭐ TOP TARGET

**Current State**:
- Phase D3: Alloc gate shape optimization → NEUTRAL (+0.56% mean, -0.5% median)
- Approach: Branch hints (LIKELY/UNLIKELY) + route table direct access
- Result: Limited improvement (branch prediction already well-tuned)

**Perf Annotate Hotspots** (lines with >5% samples):
```asm
9.97%: cmp $0x2,%r13d              # Route comparison (ROUTE_POOL_ONLY check)
5.77%: movzbl (%rsi,%rbx,1),%r13d # Route table load (g_tiny_route)
11.32%: mov 0x280aea(%rip),%eax   # rel_route_logged.26 (C7 logging check)
5.72%: test %eax,%eax             # Route logging branch
```

**Root Causes**:
1. **Route determination overhead** (9.97% + 5.77% = 15.74%):
   - `g_tiny_route[class_idx & 7]` load + comparison
   - Branch on `ROUTE_POOL_ONLY` (rare path, but checked every call)
2. **C7 logging overhead** (11.32% + 5.72% = 17.04%):
   - `rel_route_logged.26` TLS check (C7-specific, rare in Mixed)
   - Branch misprediction when C7 is ~10% of traffic
3. **ENV gate overhead**:
   - `alloc_gate_shape_enabled()` check (line 151)
   - `tiny_route_get()` falls back to slow path (line 186)

**Optimization Opportunities**:

#### Option A1: **Per-Class Fast Path Specialization** (HIGH ROI, STRUCTURAL)
**Approach**: Create specialized `tiny_alloc_gate_fast_c{0-7}()` for each class
- **Benefit**: Eliminate runtime route determination (static per-class decision)
- **Strategy**:
  - C0-C3 (LEGACY): Direct to `malloc_tiny_fast_for_class()`, skip route check
  - C4-C6 (MID/V3): Direct to small_policy path, skip LEGACY check
  - C7 (ULTRA): Direct to `tiny_c7_ultra_alloc()`, skip all route logic
- **Expected gain**: Eliminate 15.74% route overhead → **+2-3% overall**
- **Risk**: Medium (code duplication, must maintain 8 variants)
- **Precedent**: FREE path already does this via `HAKMEM_FREE_TINY_FAST_HOTCOLD` (+13% win)

#### Option A2: **Route Cache Consolidation** (MEDIUM ROI, CACHE-BASED)
**Approach**: Extend C3 static routing to alloc gate (bypass `tiny_route_get()` entirely)
- **Benefit**: Eliminate `tiny_route_get()` call + route table load
- **Strategy**:
  - Check `g_tiny_static_route_ready` once (already cached)
  - Use `g_tiny_static_route_table[class_idx]` directly (already done in C3)
  - Remove duplicate `g_tiny_route[]` load (line 157)
- **Expected gain**: Reduce 5.77% route load overhead → **+0.5-1% overall**
- **Risk**: Low (extends existing C3 infrastructure)
- **Note**: Partial overlap with A1 (both reduce route overhead)

#### Option A3: **C7 Logging Branch Elimination** (LOW ROI, ENV-BASED)
**Approach**: Make C7 logging opt-in via ENV (default OFF in Mixed profile)
- **Benefit**: Eliminate 17.04% C7 logging overhead in Mixed workload
- **Strategy**:
  - Add `HAKMEM_TINY_C7_ROUTE_LOGGING=0` to MIXED_TINYV3_C7_SAFE
  - Keep logging enabled in C6_HEAVY profile (debugging use case)
- **Expected gain**: Eliminate 17.04% local overhead → **+2-3% in alloc_gate_fast** → **+0.3-0.5% overall**
- **Risk**: Very low (ENV-gated, reversible)
- **Caveat**: This is ~17% of *tiny_alloc_gate_fast's* self%, not 17% of total runtime

**Recommendation**: **Pursue A1 (Per-Class Fast Path)** as primary target
- Rationale: Structural change that eliminates root cause (runtime route determination)
- Precedent: FREE path hot/cold split achieved +13% with similar approach
- A2 can be quick win before A1 (low-hanging fruit)
- A3 is minor (local to tiny_alloc_gate_fast, small overall impact)

---

### Candidate 2: `free_tiny_fast_cold` (5.84% self%) ⚠️ ALREADY OPTIMIZED

**Current State**:
- Phase FREE-TINY-FAST-HOTCOLD-1: Hot/cold split → +13% gain
- Split C0-C3 (hot) from C4-C7 (cold)
- Cold path still shows 5.84% self% → expected (C4-C7 are ~50% of frees)

**Perf Annotate Hotspots**:
```asm
4.12%: call tiny_route_for_class.lto_priv.0  # Route determination (C4-C7)
3.95%: cmpl g_tiny_front_v3_snapshot_ready   # Front v3 snapshot check
3.63%: cmpl %fs:0xfffffffffffb3b00           # TLS ENV check (FREE_TINY_FAST_HOTCOLD)
```

**Root Causes**:
1. **Route determination** (4.12%): Necessary for C4-C7 (not LEGACY)
2. **ENV gate overhead** (3.95% + 3.63% = 7.58%): Repeated TLS checks
3. **Front v3 snapshot check** (3.95%): Lazy init overhead

**Optimization Opportunities**:

#### Option B1: **ENV Gate Consolidation** (MEDIUM ROI, CACHE-BASED)
**Approach**: Consolidate repeated ENV checks into single TLS snapshot
- **Benefit**: Reduce 7.58% ENV checking overhead
- **Strategy**:
  - Create `struct free_env_snapshot { uint8_t hotcold_on; uint8_t front_v3_on; ... }`
  - Cache in TLS (initialized once per thread)
  - Single TLS read per `free_tiny_fast_cold()` call
- **Expected gain**: Reduce 7.58% local overhead → **+0.4-0.6% overall** (5.84% * 7.58% = ~0.44%)
- **Risk**: Low (existing pattern in C3 static routing)

#### Option B2: **C4-C7 Route Specialization** (LOW ROI, STRUCTURAL)
**Approach**: Create per-class cold paths (similar to A1 for alloc)
- **Benefit**: Eliminate route determination for C4-C7
- **Strategy**: Split `free_tiny_fast_cold()` into 4 variants (C4, C5, C6, C7)
- **Expected gain**: Reduce 4.12% route overhead → **+0.24% overall**
- **Risk**: Medium (code duplication)
- **Note**: Lower priority than A1 (free path already optimized via hot/cold split)

**Recommendation**: **Pursue B1 (ENV Gate Consolidation)** as secondary target
- Rationale: Complements A1 (alloc gate specialization)
- Can be applied to both alloc and free paths (shared infrastructure)
- Lower ROI than A1, but easier to implement

---

### Candidate 3: ENV Gate Functions (Combined 3.26% self%) 🎯 CROSS-CUTTING

**Functions**:
- `tiny_c7_ultra_enabled_env.lto_priv.0` (1.28%)
- `tiny_front_v3_enabled.lto_priv.0` (1.01%)
- `tiny_metadata_cache_enabled.lto_priv.0` (0.97%)

**Current Pattern** (from source):
```c
static inline int tiny_front_v3_enabled(void) {
    static __thread int g = -1;
    if (__builtin_expect(g == -1, 0)) {
        const char* e = getenv("HAKMEM_TINY_FRONT_V3_ENABLED");
        g = (e && *e && *e != '0') ? 1 : 0;
    }
    return g;
}
```

**Root Causes**:
1. **TLS read overhead**: Each function reads separate TLS variable (3 separate reads in hot path)
2. **Lazy init check**: `g == -1` branch on every call (cold, but still checked)
3. **Function call overhead**: Called from multiple hot paths (not always inlined)

**Optimization Opportunities**:

#### Option C1: **ENV Snapshot Consolidation** ⭐ HIGH ROI
**Approach**: Consolidate all ENV gates into single TLS snapshot struct
- **Benefit**: Reduce 3 TLS reads → 1 TLS read, eliminate 2 lazy init checks
- **Strategy**:
  ```c
  struct hakmem_env_snapshot {
      uint8_t front_v3_on;
      uint8_t metadata_cache_on;
      uint8_t c7_ultra_on;
      uint8_t free_hotcold_on;
      uint8_t static_route_on;
      // ... (8 bytes total, cache-friendly)
  };

  extern __thread struct hakmem_env_snapshot g_hakmem_env_snapshot;

  static inline const struct hakmem_env_snapshot* hakmem_env_get(void) {
      if (__builtin_expect(!g_hakmem_env_snapshot.initialized, 0)) {
          hakmem_env_snapshot_init();  // One-time init
      }
      return &g_hakmem_env_snapshot;
  }
  ```
- **Expected gain**: Eliminate 3.26% ENV overhead → **+3.0-3.5% overall**
- **Risk**: Medium (refactor all ENV gate call sites)
- **Precedent**: `tiny_front_v3_snapshot` already does this for front v3 config

**Recommendation**: **HIGHEST PRIORITY - Pursue C1 as Phase 4 PRIMARY TARGET**
- Rationale:
  - **3.26% direct overhead** (measurable in perf)
  - **Cross-cutting benefit**: Improves both alloc and free paths
  - **Structural improvement**: Reduces TLS pressure across entire codebase
  - **Clear precedent**: `tiny_front_v3_snapshot` pattern already proven
  - **Compounds with A1**: Per-class fast paths can check single ENV snapshot instead of multiple gates

---

## Selected Next Target

### Phase 4 E1: **ENV Snapshot Consolidation** (PRIMARY TARGET)

**Function**: Consolidate all ENV gates into single TLS snapshot
**Expected Gain**: **+3.0-3.5%** (eliminate 3.26% ENV overhead)
**Risk**: Medium (refactor ENV gate call sites)
**Effort**: 2-3 days (create snapshot struct, refactor ~20 call sites, A/B test)

**Implementation Plan**:

#### Step 1: Create ENV Snapshot Infrastructure
- File: `core/box/hakmem_env_snapshot_box.h/c`
- Struct: `hakmem_env_snapshot` (8-byte TLS struct)
- API: `hakmem_env_get()` (lazy init, returns const snapshot*)

#### Step 2: Migrate ENV Gates
Priority order (by self% impact):
1. `tiny_c7_ultra_enabled_env()` (1.28%)
2. `tiny_front_v3_enabled()` (1.01%)
3. `tiny_metadata_cache_enabled()` (0.97%)
4. `free_tiny_fast_hotcold_enabled()` (in `free_tiny_fast_cold`)
5. `tiny_static_route_enabled()` (in routing hot path)

#### Step 3: Refactor Call Sites
- Replace: `if (tiny_front_v3_enabled()) { ... }`
- With: `const hakmem_env_snapshot* env = hakmem_env_get(); if (env->front_v3_on) { ... }`
- Count: ~20-30 call sites (grep analysis needed)

#### Step 4: A/B Test
- Baseline: Current mainline (Phase 3 + D1)
- Optimized: ENV snapshot consolidation
- Workloads: Mixed (10-run), C6-heavy (5-run)
- Threshold: +1.0% mean gain for GO

#### Step 5: Validation
- Health check: `verify_health_profiles.sh`
- Regression check: Ensure no performance loss on any profile

**Success Criteria**:
- [ ] ENV snapshot struct created
- [ ] All priority ENV gates migrated
- [ ] A/B test shows +2.5% or better (Mixed, 10-run)
- [ ] Health check passes
- [ ] Default ON in MIXED_TINYV3_C7_SAFE

---

## Alternative Targets (Lower Priority)

### Phase 4 E2: **Per-Class Alloc Fast Path** (SECONDARY TARGET)

**Function**: Specialize `tiny_alloc_gate_fast()` per class
**Expected Gain**: **+2-3%** (eliminate 15.74% route overhead in tiny_alloc_gate_fast)
**Risk**: Medium (code duplication, 8 variants to maintain)
**Effort**: 3-4 days (create 8 fast paths, refactor malloc wrapper, A/B test)

**Why Secondary?**:
- Higher implementation complexity (8 variants vs. 1 snapshot struct)
- Dependent on E1 success (ENV snapshot makes per-class paths cleaner)
- Can be pursued after E1 proves ENV consolidation pattern

---

## Candidate Summary Table

| Phase | Target | self% | Approach | Expected Gain | Risk | Priority |
|-------|--------|-------|----------|---------------|------|----------|
| **E1** | **ENV Snapshot Consolidation** | **3.26%** | **Caching** | **+3.0-3.5%** | **Medium** | **⭐ PRIMARY** |
| E2 | Per-Class Alloc Fast Path | 15.37% | Hot/Cold Split | +2-3% | Medium | Secondary |
| E3 | Free ENV Gate Consolidation | 7.58% (local) | Caching | +0.4-0.6% | Low | Tertiary |
| E4 | C7 Logging Elimination | 17.04% (local) | ENV-gated | +0.3-0.5% | Very Low | Quick Win |

---

## Shape Optimization Plateau Analysis

**Observation**: D3 (Alloc Gate Shape) achieved only +0.56% mean gain (NEUTRAL)

**Why Shape Optimizations Plateau?**:
1. **Branch Prediction Saturation**: Modern CPUs (Zen3/Zen4) already predict well-trained branches
   - LIKELY/UNLIKELY hints: Marginal benefit on hot paths
   - B3 (Routing Shape): +2.89% → Initial win (untrained branches)
   - D3 (Alloc Gate Shape): +0.56% → Diminishing returns (already trained)

2. **I-Cache Pressure**: Adding cold helpers can regress if not carefully placed
   - A3 (always_inline header): -4.00% on Mixed (I-cache thrashing)
   - D3: Neutral (no regression, but no clear win)

3. **TLS/Memory Overhead Dominates**: ENV gates (3.26%) > Branch misprediction (~0.5%)
   - Next optimization should target memory/TLS overhead, not branches

**Lessons Learned**:
- Shape optimizations: Good for first pass (B3 +2.89%), limited ROI after
- Next frontier: Caching (ENV snapshot), structural changes (per-class paths)
- Avoid: More LIKELY/UNLIKELY hints (saturated)
- Prefer: Eliminate checks entirely (snapshot) or specialize paths (per-class)

---

## Next Steps

1. **Phase 4 E1: ENV Snapshot Consolidation** (PRIMARY)
   - Create design doc: `PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_DESIGN.md`
   - Implement snapshot infrastructure
   - Migrate priority ENV gates
   - A/B test (Mixed 10-run)
   - Target: +3.0% gain, promote to default if successful

2. **Phase 4 E2: Per-Class Alloc Fast Path** (SECONDARY)
   - Depends on E1 success
   - Design doc: `PHASE4_E2_PER_CLASS_ALLOC_FASTPATH_DESIGN.md`
   - Prototype C7-only fast path first (highest gain, least complexity)
   - A/B test incremental per-class specialization
   - Target: +2-3% gain

3. **Update CURRENT_TASK.md**:
   - Document perf findings
   - Note shape optimization plateau
   - List E1 as next target

---

## Appendix: Perf Command Reference

```bash
# Profile current mainline
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1

# Generate report (sorted by symbol, no children aggregation)
perf report --stdio --no-children --sort=symbol | head -80

# Annotate specific function
perf annotate --stdio tiny_alloc_gate_fast.lto_priv.0 | head -100
```

**Key Metrics**:
- Samples: 922 (sufficient for 0.1% precision)
- Frequency: 999 Hz (balance between overhead and resolution)
- Iterations: 40M (runtime ~0.86s, enough for stable sampling)
- Workload: Mixed (ws=400, representative of production)

---

**Status**: Ready for Phase 4 E1 implementation
**Baseline**: 46.37M ops/s (Phase 3 + D1)
**Target**: 47.8M ops/s (+3.0% via ENV snapshot consolidation)
Phase 4 E1: env snapshot consolidation docs 2025-12-14 00:48:03 +09:00			`# Phase 4: Perf Profile Analysis - Next Optimization Target`

			`Date: 2025-12-14`
			`Baseline: Phase 3 + D1 Complete (~46.37M ops/s, MIXED_TINYV3_C7_SAFE)`
			`Profile: MIXED_TINYV3_C7_SAFE (20M iterations, ws=400, F=999Hz)`
			`Samples: 922 samples, 3.1B cycles`

			`## Executive Summary`

			`Current Status:`
			`- Phase 3 + D1: ~8.93% cumulative gain (37.5M → 51M ops/s baseline)`
			`- D3 (Alloc Gate Shape): NEUTRAL (+0.56% mean, -0.5% median) → frozen as research box`
			`- Learning: Shape optimizations (B3, D3) have limited ROI - branch prediction improvements plateau`

			`Next Strategy: Identify self% ≥ 5% functions and apply different approaches (not shape-based):`
			`- Hot/cold split (separate rare paths)`
			`- Caching (avoid repeated expensive operations)`
			`- Inlining (reduce function call overhead)`
			`- ENV gate consolidation (reduce repeated TLS/getenv checks)`

			`---`

			`## Perf Report Analysis`

			`### Top Functions (self% ≥ 5%)`

			`Filtered for hakmem internal functions (excluding main, malloc/free wrappers):`

			`\| Rank \| Function \| self% \| Category \| Already Optimized? \|`
			`\|------\|----------\|-------\|----------\|--------------------\|`
			\| 1 \| `tiny_alloc_gate_fast.lto_priv.0` \| 15.37% \| Alloc Gate \| D3 shape (neutral) \|
			\| 2 \| `free_tiny_fast_cold.lto_priv.0` \| 5.84% \| Free Path \| Hot/cold split done \|
			\| - \| `unified_cache_push.lto_priv.0` \| 3.97% \| Cache \| Core primitive \|
			\| - \| `tiny_c7_ultra_alloc.constprop.0` \| 3.97% \| C7 Alloc \| Not optimized \|
			\| - \| `tiny_region_id_write_header.lto_priv.0` \| 2.50% \| Header \| A3 inlining (NO-GO) \|
			\| - \| `tiny_route_for_class.lto_priv.0` \| 2.28% \| Routing \| C3 static cache done \|

			`Key Observations:`
			`1. tiny_alloc_gate_fast (15.37%): Still dominant despite D3 shape optimization`
			`2. free_tiny_fast_cold (5.84%): Cold path still hot (ENV gate overhead?)`
			3. ENV gate functions (1-2% each): `tiny_c7_ultra_enabled_env` (1.28%), `tiny_front_v3_enabled` (1.01%), `tiny_metadata_cache_enabled` (0.97%)
			`- Combined: ~3.26% on ENV checking overhead`
			`- Repeated TLS reads + getenv lazy init`

			`---`

			`## Detailed Candidate Analysis`

			### Candidate 1: `tiny_alloc_gate_fast` (15.37% self%) ⭐ TOP TARGET

			`Current State:`
			`- Phase D3: Alloc gate shape optimization → NEUTRAL (+0.56% mean, -0.5% median)`
			`- Approach: Branch hints (LIKELY/UNLIKELY) + route table direct access`
			`- Result: Limited improvement (branch prediction already well-tuned)`

			`Perf Annotate Hotspots (lines with >5% samples):`
			```asm
			`9.97%: cmp $0x2,%r13d # Route comparison (ROUTE_POOL_ONLY check)`
			`5.77%: movzbl (%rsi,%rbx,1),%r13d # Route table load (g_tiny_route)`
			`11.32%: mov 0x280aea(%rip),%eax # rel_route_logged.26 (C7 logging check)`
			`5.72%: test %eax,%eax # Route logging branch`
			```

			`Root Causes:`
			`1. Route determination overhead (9.97% + 5.77% = 15.74%):`
			- `g_tiny_route[class_idx & 7]` load + comparison
			- Branch on `ROUTE_POOL_ONLY` (rare path, but checked every call)
			`2. C7 logging overhead (11.32% + 5.72% = 17.04%):`
			- `rel_route_logged.26` TLS check (C7-specific, rare in Mixed)
			`- Branch misprediction when C7 is ~10% of traffic`
			`3. ENV gate overhead:`
			- `alloc_gate_shape_enabled()` check (line 151)
			- `tiny_route_get()` falls back to slow path (line 186)

			`Optimization Opportunities:`

			`#### Option A1: Per-Class Fast Path Specialization (HIGH ROI, STRUCTURAL)`
			Approach: Create specialized `tiny_alloc_gate_fast_c{0-7}()` for each class
			`- Benefit: Eliminate runtime route determination (static per-class decision)`
			`- Strategy:`
			- C0-C3 (LEGACY): Direct to `malloc_tiny_fast_for_class()`, skip route check
			`- C4-C6 (MID/V3): Direct to small_policy path, skip LEGACY check`
			- C7 (ULTRA): Direct to `tiny_c7_ultra_alloc()`, skip all route logic
			`- Expected gain: Eliminate 15.74% route overhead → +2-3% overall`
			`- Risk: Medium (code duplication, must maintain 8 variants)`
			- Precedent: FREE path already does this via `HAKMEM_FREE_TINY_FAST_HOTCOLD` (+13% win)

			`#### Option A2: Route Cache Consolidation (MEDIUM ROI, CACHE-BASED)`
			Approach: Extend C3 static routing to alloc gate (bypass `tiny_route_get()` entirely)
			- Benefit: Eliminate `tiny_route_get()` call + route table load
			`- Strategy:`
			- Check `g_tiny_static_route_ready` once (already cached)
			- Use `g_tiny_static_route_table[class_idx]` directly (already done in C3)
			- Remove duplicate `g_tiny_route[]` load (line 157)
			`- Expected gain: Reduce 5.77% route load overhead → +0.5-1% overall`
			`- Risk: Low (extends existing C3 infrastructure)`
			`- Note: Partial overlap with A1 (both reduce route overhead)`

			`#### Option A3: C7 Logging Branch Elimination (LOW ROI, ENV-BASED)`
			`Approach: Make C7 logging opt-in via ENV (default OFF in Mixed profile)`
			`- Benefit: Eliminate 17.04% C7 logging overhead in Mixed workload`
			`- Strategy:`
			- Add `HAKMEM_TINY_C7_ROUTE_LOGGING=0` to MIXED_TINYV3_C7_SAFE
			`- Keep logging enabled in C6_HEAVY profile (debugging use case)`
			`- Expected gain: Eliminate 17.04% local overhead → +2-3% in alloc_gate_fast → +0.3-0.5% overall`
			`- Risk: Very low (ENV-gated, reversible)`
			`- Caveat: This is ~17% of tiny_alloc_gate_fast's self%, not 17% of total runtime`

			`Recommendation: Pursue A1 (Per-Class Fast Path) as primary target`
			`- Rationale: Structural change that eliminates root cause (runtime route determination)`
			`- Precedent: FREE path hot/cold split achieved +13% with similar approach`
			`- A2 can be quick win before A1 (low-hanging fruit)`
			`- A3 is minor (local to tiny_alloc_gate_fast, small overall impact)`

			`---`

			### Candidate 2: `free_tiny_fast_cold` (5.84% self%) ⚠️ ALREADY OPTIMIZED

			`Current State:`
			`- Phase FREE-TINY-FAST-HOTCOLD-1: Hot/cold split → +13% gain`
			`- Split C0-C3 (hot) from C4-C7 (cold)`
			`- Cold path still shows 5.84% self% → expected (C4-C7 are ~50% of frees)`

			`Perf Annotate Hotspots:`
			```asm
			`4.12%: call tiny_route_for_class.lto_priv.0 # Route determination (C4-C7)`
			`3.95%: cmpl g_tiny_front_v3_snapshot_ready # Front v3 snapshot check`
			`3.63%: cmpl %fs:0xfffffffffffb3b00 # TLS ENV check (FREE_TINY_FAST_HOTCOLD)`
			```

			`Root Causes:`
			`1. Route determination (4.12%): Necessary for C4-C7 (not LEGACY)`
			`2. ENV gate overhead (3.95% + 3.63% = 7.58%): Repeated TLS checks`
			`3. Front v3 snapshot check (3.95%): Lazy init overhead`

			`Optimization Opportunities:`

			`#### Option B1: ENV Gate Consolidation (MEDIUM ROI, CACHE-BASED)`
			`Approach: Consolidate repeated ENV checks into single TLS snapshot`
			`- Benefit: Reduce 7.58% ENV checking overhead`
			`- Strategy:`
			- Create `struct free_env_snapshot { uint8_t hotcold_on; uint8_t front_v3_on; ... }`
			`- Cache in TLS (initialized once per thread)`
			- Single TLS read per `free_tiny_fast_cold()` call
			`- Expected gain: Reduce 7.58% local overhead → +0.4-0.6% overall (5.84% * 7.58% = ~0.44%)`
			`- Risk: Low (existing pattern in C3 static routing)`

			`#### Option B2: C4-C7 Route Specialization (LOW ROI, STRUCTURAL)`
			`Approach: Create per-class cold paths (similar to A1 for alloc)`
			`- Benefit: Eliminate route determination for C4-C7`
			- Strategy: Split `free_tiny_fast_cold()` into 4 variants (C4, C5, C6, C7)
			`- Expected gain: Reduce 4.12% route overhead → +0.24% overall`
			`- Risk: Medium (code duplication)`
			`- Note: Lower priority than A1 (free path already optimized via hot/cold split)`

			`Recommendation: Pursue B1 (ENV Gate Consolidation) as secondary target`
			`- Rationale: Complements A1 (alloc gate specialization)`
			`- Can be applied to both alloc and free paths (shared infrastructure)`
			`- Lower ROI than A1, but easier to implement`

			`---`

			`### Candidate 3: ENV Gate Functions (Combined 3.26% self%) 🎯 CROSS-CUTTING`

			`Functions:`
			- `tiny_c7_ultra_enabled_env.lto_priv.0` (1.28%)
			- `tiny_front_v3_enabled.lto_priv.0` (1.01%)
			- `tiny_metadata_cache_enabled.lto_priv.0` (0.97%)

			`Current Pattern (from source):`
			```c
			`static inline int tiny_front_v3_enabled(void) {`
			`static __thread int g = -1;`
			`if (__builtin_expect(g == -1, 0)) {`
			`const char* e = getenv("HAKMEM_TINY_FRONT_V3_ENABLED");`
			`g = (e && e && e != '0') ? 1 : 0;`
			`}`
			`return g;`
			`}`
			```

			`Root Causes:`
			`1. TLS read overhead: Each function reads separate TLS variable (3 separate reads in hot path)`
			2. Lazy init check: `g == -1` branch on every call (cold, but still checked)
			`3. Function call overhead: Called from multiple hot paths (not always inlined)`

			`Optimization Opportunities:`

			`#### Option C1: ENV Snapshot Consolidation ⭐ HIGH ROI`
			`Approach: Consolidate all ENV gates into single TLS snapshot struct`
			`- Benefit: Reduce 3 TLS reads → 1 TLS read, eliminate 2 lazy init checks`
			`- Strategy:`
			```c
			`struct hakmem_env_snapshot {`
			`uint8_t front_v3_on;`
			`uint8_t metadata_cache_on;`
			`uint8_t c7_ultra_on;`
			`uint8_t free_hotcold_on;`
			`uint8_t static_route_on;`
			`// ... (8 bytes total, cache-friendly)`
			`};`

			`extern __thread struct hakmem_env_snapshot g_hakmem_env_snapshot;`

			`static inline const struct hakmem_env_snapshot* hakmem_env_get(void) {`
			`if (__builtin_expect(!g_hakmem_env_snapshot.initialized, 0)) {`
			`hakmem_env_snapshot_init(); // One-time init`
			`}`
			`return &g_hakmem_env_snapshot;`
			`}`
			```
			`- Expected gain: Eliminate 3.26% ENV overhead → +3.0-3.5% overall`
			`- Risk: Medium (refactor all ENV gate call sites)`
			- Precedent: `tiny_front_v3_snapshot` already does this for front v3 config

			`Recommendation: HIGHEST PRIORITY - Pursue C1 as Phase 4 PRIMARY TARGET`
			`- Rationale:`
			`- 3.26% direct overhead (measurable in perf)`
			`- Cross-cutting benefit: Improves both alloc and free paths`
			`- Structural improvement: Reduces TLS pressure across entire codebase`
			- Clear precedent: `tiny_front_v3_snapshot` pattern already proven
			`- Compounds with A1: Per-class fast paths can check single ENV snapshot instead of multiple gates`

			`---`

			`## Selected Next Target`

			`### Phase 4 E1: ENV Snapshot Consolidation (PRIMARY TARGET)`

			`Function: Consolidate all ENV gates into single TLS snapshot`
			`Expected Gain: +3.0-3.5% (eliminate 3.26% ENV overhead)`
			`Risk: Medium (refactor ENV gate call sites)`
			`Effort: 2-3 days (create snapshot struct, refactor ~20 call sites, A/B test)`

			`Implementation Plan:`

			`#### Step 1: Create ENV Snapshot Infrastructure`
			- File: `core/box/hakmem_env_snapshot_box.h/c`
			- Struct: `hakmem_env_snapshot` (8-byte TLS struct)
			- API: `hakmem_env_get()` (lazy init, returns const snapshot*)

			`#### Step 2: Migrate ENV Gates`
			`Priority order (by self% impact):`
			1. `tiny_c7_ultra_enabled_env()` (1.28%)
			2. `tiny_front_v3_enabled()` (1.01%)
			3. `tiny_metadata_cache_enabled()` (0.97%)
			4. `free_tiny_fast_hotcold_enabled()` (in `free_tiny_fast_cold`)
			5. `tiny_static_route_enabled()` (in routing hot path)

			`#### Step 3: Refactor Call Sites`
			- Replace: `if (tiny_front_v3_enabled()) { ... }`
			- With: `const hakmem_env_snapshot* env = hakmem_env_get(); if (env->front_v3_on) { ... }`
			`- Count: ~20-30 call sites (grep analysis needed)`

			`#### Step 4: A/B Test`
			`- Baseline: Current mainline (Phase 3 + D1)`
			`- Optimized: ENV snapshot consolidation`
			`- Workloads: Mixed (10-run), C6-heavy (5-run)`
			`- Threshold: +1.0% mean gain for GO`

			`#### Step 5: Validation`
			- Health check: `verify_health_profiles.sh`
			`- Regression check: Ensure no performance loss on any profile`

			`Success Criteria:`
			`- [ ] ENV snapshot struct created`
			`- [ ] All priority ENV gates migrated`
			`- [ ] A/B test shows +2.5% or better (Mixed, 10-run)`
			`- [ ] Health check passes`
			`- [ ] Default ON in MIXED_TINYV3_C7_SAFE`

			`---`

			`## Alternative Targets (Lower Priority)`

			`### Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY TARGET)`

			Function: Specialize `tiny_alloc_gate_fast()` per class
			`Expected Gain: +2-3% (eliminate 15.74% route overhead in tiny_alloc_gate_fast)`
			`Risk: Medium (code duplication, 8 variants to maintain)`
			`Effort: 3-4 days (create 8 fast paths, refactor malloc wrapper, A/B test)`

			`Why Secondary?:`
			`- Higher implementation complexity (8 variants vs. 1 snapshot struct)`
			`- Dependent on E1 success (ENV snapshot makes per-class paths cleaner)`
			`- Can be pursued after E1 proves ENV consolidation pattern`

			`---`

			`## Candidate Summary Table`

			`\| Phase \| Target \| self% \| Approach \| Expected Gain \| Risk \| Priority \|`
			`\|-------\|--------\|-------\|----------\|---------------\|------\|----------\|`
			`\| E1 \| ENV Snapshot Consolidation \| 3.26% \| Caching \| +3.0-3.5% \| Medium \| ⭐ PRIMARY \|`
			`\| E2 \| Per-Class Alloc Fast Path \| 15.37% \| Hot/Cold Split \| +2-3% \| Medium \| Secondary \|`
			`\| E3 \| Free ENV Gate Consolidation \| 7.58% (local) \| Caching \| +0.4-0.6% \| Low \| Tertiary \|`
			`\| E4 \| C7 Logging Elimination \| 17.04% (local) \| ENV-gated \| +0.3-0.5% \| Very Low \| Quick Win \|`

			`---`

			`## Shape Optimization Plateau Analysis`

			`Observation: D3 (Alloc Gate Shape) achieved only +0.56% mean gain (NEUTRAL)`

			`Why Shape Optimizations Plateau?:`
			`1. Branch Prediction Saturation: Modern CPUs (Zen3/Zen4) already predict well-trained branches`
			`- LIKELY/UNLIKELY hints: Marginal benefit on hot paths`
			`- B3 (Routing Shape): +2.89% → Initial win (untrained branches)`
			`- D3 (Alloc Gate Shape): +0.56% → Diminishing returns (already trained)`

			`2. I-Cache Pressure: Adding cold helpers can regress if not carefully placed`
			`- A3 (always_inline header): -4.00% on Mixed (I-cache thrashing)`
			`- D3: Neutral (no regression, but no clear win)`

			`3. TLS/Memory Overhead Dominates: ENV gates (3.26%) > Branch misprediction (~0.5%)`
			`- Next optimization should target memory/TLS overhead, not branches`

			`Lessons Learned:`
			`- Shape optimizations: Good for first pass (B3 +2.89%), limited ROI after`
			`- Next frontier: Caching (ENV snapshot), structural changes (per-class paths)`
			`- Avoid: More LIKELY/UNLIKELY hints (saturated)`
			`- Prefer: Eliminate checks entirely (snapshot) or specialize paths (per-class)`

			`---`

			`## Next Steps`

			`1. Phase 4 E1: ENV Snapshot Consolidation (PRIMARY)`
			- Create design doc: `PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_DESIGN.md`
			`- Implement snapshot infrastructure`
			`- Migrate priority ENV gates`
			`- A/B test (Mixed 10-run)`
			`- Target: +3.0% gain, promote to default if successful`

			`2. Phase 4 E2: Per-Class Alloc Fast Path (SECONDARY)`
			`- Depends on E1 success`
			- Design doc: `PHASE4_E2_PER_CLASS_ALLOC_FASTPATH_DESIGN.md`
			`- Prototype C7-only fast path first (highest gain, least complexity)`
			`- A/B test incremental per-class specialization`
			`- Target: +2-3% gain`

			`3. Update CURRENT_TASK.md:`
			`- Document perf findings`
			`- Note shape optimization plateau`
			`- List E1 as next target`

			`---`

			`## Appendix: Perf Command Reference`

			```bash
			`# Profile current mainline`
			`HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \`
			`perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1`

			`# Generate report (sorted by symbol, no children aggregation)`
			`perf report --stdio --no-children --sort=symbol \| head -80`

			`# Annotate specific function`
			`perf annotate --stdio tiny_alloc_gate_fast.lto_priv.0 \| head -100`
			```

			`Key Metrics:`
			`- Samples: 922 (sufficient for 0.1% precision)`
			`- Frequency: 999 Hz (balance between overhead and resolution)`
			`- Iterations: 40M (runtime ~0.86s, enough for stable sampling)`
			`- Workload: Mixed (ws=400, representative of production)`

			`---`

			`Status: Ready for Phase 4 E1 implementation`
			`Baseline: 46.37M ops/s (Phase 3 + D1)`
			`Target: 47.8M ops/s (+3.0% via ENV snapshot consolidation)`