hakmem/docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md

# Phase 4 Comprehensive Status Analysis

**Date**: 2025-12-14
**Analyst**: Claude Code
**Baseline**: E1 enabled (~45M ops/s)

---

## Part 1: E2 Freeze Decision Analysis

### Test Data Review

**E2 Configuration**: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc)
**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled)
**Test**: 10-run A/B, 20M iterations, ws=400

#### Statistical Analysis

| Metric | Baseline (E2=0) | Optimized (E2=1) | Delta |
|--------|-----------------|------------------|-------|
| Mean | 45.40M ops/s | 45.30M ops/s | -0.21% |
| Median | 45.51M ops/s | 45.22M ops/s | -0.62% |
| StdDev | 0.38M (0.84% CV) | 0.49M (1.07% CV) | +28% variance |

#### Variance Consistency Analysis

**Baseline runs** (DUALHOT=0):
- Range: 44.60M - 45.90M (1.30M spread)
- Runs within ±1% of mean: 9/10 (90%)
- Outliers: Run 8 (44.60M, -1.76% from mean)

**Optimized runs** (DUALHOT=1):
- Range: 44.59M - 46.28M (1.69M spread)
- Runs within ±1% of mean: 8/10 (80%)
- Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean)

**Observation**: Higher variance in optimized version suggests branch misprediction or cache effects.

#### Comparison to Free DUALHOT Success

| Path | DUALHOT Result | Reason |
|------|----------------|--------|
| **Free** | **+13.0%** | Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees) |
| **Alloc** | **-0.21%** | Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost |

**Root Cause**:
- Free path: C0-C3 optimization skips **expensive operations** (policy snapshot + route lookup)
- Alloc path: C0-C3 optimization skips **already-cached operations** (static routing eliminates lookup)
- Net effect: Branch overhead ≈ Savings → neutral

### E2 Freeze Recommendation

**Decision**: ✅ **DEFINITIVE FREEZE**

**Rationale**:

1. **Result is consistent**: All 10 runs showed similar pattern (no bimodal distribution)
2. **Not a measurement error**: StdDev 0.38M-0.49M is normal for this workload
3. **Root cause understood**: Alloc path already optimized via C3 static routing
4. **Free vs Alloc asymmetry explained**: Free skips expensive ops, alloc skips cheap cached ops
5. **No alternative conditions warranted**:
   - Different workload (C6-heavy): Won't help - same route caching applies
   - Different iteration count: Won't change fundamental branch cost vs savings trade-off
   - Combined flags: No synergy available - route caching is already optimal

**Conclusion**: E2 is a **structural dead-end** for Mixed workload. Alloc route optimization saturated by C3.

---

## Part 2: Fresh Perf Profile Analysis (E1 Enabled)

### Profile Configuration

**Command**: `HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1`
**Throughput**: 45.26M ops/s
**Samples**: 946 samples, 3.25B cycles

### Top Functions (self% >= 2.0%)

| Rank | Function | self% | Change from Pre-E1 | Category |
|------|----------|-------|-------------------|----------|
| 1 | free | 22.19% | +2.5pp (from ~19%) | Wrapper |
| 2 | tiny_alloc_gate_fast | 18.99% | +3.6pp (from 15.37%) | Alloc Gate |
| 3 | main | 15.21% | No change | Benchmark |
| 4 | malloc | 13.36% | No change | Wrapper |
| 5 | free_tiny_fast_cold | 7.32% | +1.5pp (from 5.84%) | Free Path |
| 6 | hakmem_env_snapshot_enabled | 3.22% | **NEW (was 0% combined)** | ENV Gate |
| 7 | tiny_region_id_write_header | 2.60% | +0.1pp (from 2.50%) | Header |
| 8 | unified_cache_push | 2.56% | -1.4pp (from 3.97%) | Cache |
| 9 | tiny_route_for_class | 2.29% | +0.01pp (from 2.28%) | Routing |
| 10 | small_policy_v7_snapshot | 2.26% | No data | Policy |
| 11 | tiny_c7_ultra_alloc | 2.16% | -1.8pp (from 3.97%) | C7 Alloc |

### E1 Impact Analysis

**Expected**: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read
**Actual**: `hakmem_env_snapshot_enabled` shows 3.22% self%

**Interpretation**:
- ENV overhead **shifted** from 3 separate functions → 1 function
- **NOT eliminated** - still paying 3.22% for ENV checking
- E1's +3.92% gain likely from **reduced TLS pressure** (fewer TLS variables), not eliminated checks
- The snapshot approach caches results, reducing repeated getenv() calls

**Surprise findings**:
1. **tiny_alloc_gate_fast increased** from 15.37% → 18.99% (+3.6pp)
   - Possible reason: Other functions got faster (relative %), or I-cache effects
2. **hakmem_env_snapshot_enabled is NEW hot spot** (3.22%)
   - This is the consolidation point - still significant overhead
3. **unified_cache_push decreased** from 3.97% → 2.56% (-1.4pp)
   - Good sign: Cache operations more efficient

### Hot Spot Distribution

**Pre-E1** (Phase 4 D3 baseline):
- ENV gates (3 functions): 3.26%
- tiny_alloc_gate_fast: 15.37%
- free_tiny_fast_cold: 5.84%
- **Total measured overhead**: ~24.5%

**Post-E1** (current):
- ENV snapshot (1 function): 3.22%
- tiny_alloc_gate_fast: 18.99%
- free_tiny_fast_cold: 7.32%
- **Total measured overhead**: ~29.5%

**Analysis**: Overhead increased in absolute %, but throughput increased +3.92%. This suggests:
- Baseline got faster (other code optimized)
- Relative % shifted to measured functions
- Perf sampling variance (946 samples has ~±3% error margin)

---

## Part 3: E3 Candidate Identification

### Methodology

**Selection Criteria**:
1. self% >= 5% (significant impact)
2. Not already heavily optimized (avoid saturated areas)
3. Different approach from route/TLS optimization (explore new vectors)

### Candidate Analysis

#### Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION

**Current State**:
- Phase 3 C3: Static routing (+2.20% gain)
- Phase 4 D3: Alloc gate shape (+0.56% neutral)
- Phase 4 E2: Per-class fast path (-0.21% neutral)

**Why it's 18.99%**:
- Route determination: Already cached (C3)
- Branch prediction: Already tuned (D3)
- Per-class specialization: No benefit (E2)

**Remaining Overhead**:
- Function call overhead (not inlined)
- ENV snapshot check (3.22% now consolidated)
- Size→class conversion (hak_tiny_size_to_class)
- Wrapper→gate dispatch

**Optimization Approach**: **INLINING + DISPATCH OPTIMIZATION**
- **Strategy**: Inline tiny_alloc_gate_fast into malloc wrapper
  - Eliminate function call overhead (save ~5-10 cycles)
  - Improve I-cache locality (malloc + gate in same cache line)
  - Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit)
- **Expected Gain**: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall)
- **Risk**: Medium (I-cache pressure, as seen in A3 -4% regression)

**Recommendation**: **DEFER** - Route optimization saturated, inlining has I-cache risk

---

#### Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD

**Current State**:
- Phase 2 B4: Wrapper hot/cold split (+1.47% gain)
- Wrapper shape already optimized (rare checks in cold path)

**Why it's 22.19%**:
- This is the `free()` wrapper function (libc entry point)
- Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast

**Optimization Approach**: **WRAPPER BYPASS (IFUNC) or Function Pointer Caching**
- **Strategy 1 (IFUNC)**: Use GNU IFUNC to resolve malloc/free at load time
  - Direct binding: `malloc → tiny_alloc_gate_fast` (no wrapper layer)
  - Risk: HIGH (ABI compatibility, thread-safety)
- **Strategy 2 (Function Pointer)**: Cache `g_free_impl` in TLS
  - Check once at thread init, then direct call
  - Risk: Medium, Lower gain (+1-2%)

**Recommendation**: **HIGH PRIORITY** - Large potential gain, prototype with function pointer approach first

---

#### Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION

**Current State**:
- Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path)
- Cold path handles C4-C7 (~50% of frees)

**Optimization Approach**: **C4-C7 ROUTE SPECIALIZATION**
- **Strategy**: Create per-class cold paths (similar to E2 alloc attempt)
- **Expected Gain**: +0.5-1.0%
- **Risk**: Low

**Recommendation**: **MEDIUM PRIORITY** - Incremental gain, but may hit diminishing returns like E2

---

#### Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐

**Current State**:
- Phase 4 E1: ENV snapshot consolidation (+3.92% gain)
- 3 separate ENV gates → 1 consolidated snapshot

**Why it's 3.22%**:
- This IS the optimization (consolidation point)
- Still checking `g_hakmem_env_snapshot.initialized` on every call
- TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path)

**Optimization Approach**: **LAZY INIT ELIMINATION**
- **Strategy**: Force ENV snapshot initialization at library load time (constructor)
  - Use `__attribute__((constructor))` to init before main()
  - Eliminate `if (!initialized)` check in hot path
  - Make `hakmem_env_get()` a pure TLS read (no branch)
- **Expected Gain**: +0.5-1.5% (eliminate 3.22% check overhead)
- **Risk**: Low (standard initialization pattern)
- **Implementation**:
  ```c
  __attribute__((constructor))
  static void hakmem_env_snapshot_init_early(void) {
      hakmem_env_snapshot_init();  // Force init before any alloc/free
  }

  static inline const hakmem_env_snapshot* hakmem_env_get(void) {
      return &g_hakmem_env_snapshot;  // No check, just return
  }
  ```

**Recommendation**: **HIGH PRIORITY** - Clean win, low risk, eliminates E1's remaining overhead

---

#### Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION

**Current State**:
- Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO)
- I-cache pressure issue identified

**Optimization Approach**: **SELECTIVE INLINING**
- **Strategy**: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY)
- **Expected Gain**: +0.5-1.0%
- **Risk**: Medium (I-cache effects)

**Recommendation**: **LOW PRIORITY** - A3 already explored, I-cache risk remains

---

### E3 Candidate Ranking

| Rank | Candidate | self% | Approach | Expected Gain | Risk | ROI |
|------|-----------|-------|----------|---------------|------|-----|
| **1** | **hakmem_env_snapshot_enabled** | **3.22%** | **Constructor init** | **+0.5-1.5%** | **Low** | **⭐⭐⭐** |
| **2** | **free wrapper** | **22.19%** | **Function pointer cache** | **+1-2%** | **Medium** | **⭐⭐⭐** |
| 3 | tiny_alloc_gate_fast | 18.99% | Inlining | +1-2% | High (I-cache) | ⭐⭐ |
| 4 | free_tiny_fast_cold | 7.32% | Route specialization | +0.5-1.0% | Low | ⭐⭐ |
| 5 | tiny_region_id_write_header | 2.60% | Selective inline | +0.5-1.0% | Medium | ⭐ |

---

## Part 4: Summary & Recommendations

### E2 Final Decision

**Decision**: ✅ **FREEZE DEFINITIVELY**

**Rationale**:
1. **Result is consistent**: -0.21% mean, -0.62% median across 10 runs
2. **Root cause clear**: Alloc route optimization saturated by Phase 3 C3 static routing
3. **Free vs Alloc asymmetry**: Free DUALHOT skips expensive ops, alloc skips cached ops
4. **No alternative testing needed**: Workload/iteration changes won't fix structural issue
5. **Lesson learned**: Per-class specialization only works when bypassing uncached overhead

**Action**:
- Keep `HAKMEM_TINY_ALLOC_DUALHOT=0` as default (research box frozen)
- Document in CURRENT_TASK.md as NEUTRAL result
- No further investigation warranted

---

### Perf Findings (E1 Enabled Baseline)

**Throughput**: 45.26M ops/s (+3.92% from pre-E1 baseline)

**Hot Spots** (self% >= 5%):
1. free (22.19%) - Wrapper overhead
2. tiny_alloc_gate_fast (18.99%) - Route overhead (saturated)
3. main (15.21%) - Benchmark driver
4. malloc (13.36%) - Wrapper overhead
5. free_tiny_fast_cold (7.32%) - C4-C7 free path

**E1 Impact**:
- ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function)
- Gain from reduced TLS pressure: +3.92%
- **Remaining opportunity**: Eliminate lazy init check (3.22% → 0%)

**New Hot Spots**:
- hakmem_env_snapshot_enabled: 3.22% (consolidation point)

**Changes from Pre-E1**:
- tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%)
- free: +2.5pp (~19% → 22.19%)
- unified_cache_push: -1.4pp (3.97% → 2.56%)

---

### E3 Recommendation

**Primary Target**: **hakmem_env_snapshot_enabled (E3-4)**

**Approach**: Constructor-based initialization
- Force ENV snapshot init at library load time
- Eliminate lazy init check in hot path
- Make `hakmem_env_get()` a pure TLS read (no branch)

**Expected Gain**: +0.5-1.5%

**Implementation Complexity**: Low (2-day task)
- Add `__attribute__((constructor))` function
- Remove init check from hakmem_env_get()
- A/B test with 10-run Mixed + 5-run C6-heavy

**Rationale**:
1. **Low risk**: Standard initialization pattern (used by jemalloc, tcmalloc)
2. **Clear gain**: Eliminates 3.22% overhead (lazy init check)
3. **Compounds E1**: Completes ENV snapshot optimization started in E1
4. **Different vector**: Not route/TLS optimization - this is **initialization overhead reduction**

**Success Criteria**:
- Mean gain >= +0.5% (conservative)
- No regression on any profile
- Health check passes

---

**Secondary Target**: **free wrapper (E3-2)**

**Approach**: Function pointer caching
- Cache `g_free_impl` in TLS at thread init
- Direct call instead of LD mode check + dispatch
- Lower risk than IFUNC approach

**Expected Gain**: +1-2%

**Implementation Complexity**: Medium (3-4 day task)

**Risk**: Medium (thread-safety, initialization order)

---

### Phase 4 Status

**Active Optimizations**:
- E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in)
- E3-4 (ENV Constructor Init): +4.75% ✅ GO (research box, default OFF / opt-in, requires E1)

**Frozen Optimizations**:
- D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF)
- E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF)

**Cumulative Gain** (Phase 2-4):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- D1 (Free route cache): +2.19%
- E1 (ENV snapshot): +3.92%
- E3-4 (ENV ctor): +4.75% (opt-in, requires E1)
- **Total (opt-in含む): ~17%**（プロファイル/ENV 組み合わせ依存）

**Baseline（参考）**:
- E1=1, CTOR=0: 45.26M ops/s（Mixed, 40M iters, ws=400）
- E1=1, CTOR=1: 46.38M ops/s（Mixed, 20M iters, ws=400）

**Remaining Potential**:
- E3-2 (Wrapper function ptr): +1-2%
- E3-3 (Free route special): +0.5-1.0%
- **Realistic ceiling**: ~48-50M ops/s (without major redesign)

---

### Next Steps

#### Immediate (Priority 1)

1. **Freeze E2 in CURRENT_TASK.md**
   - Document NEUTRAL decision (-0.21%)
   - Add root cause explanation (route caching saturation)
   - Mark as research box (default OFF, frozen)

2. **E3-4 の昇格ゲート（再検証）**
   - E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認
   - A/B: Mixed 10-run（E1=1, CTOR=0 vs 1）
   - 健康診断: `scripts/verify_health_profiles.sh`

#### Short-term (Priority 2)

3. **E1/E3-4 ON の状態で perf を取り直す**
   - `hakmem_env_snapshot_enabled` が Top から落ちる／self% が有意に下がること
   - 次の芯（alloc gate / free_tiny_fast_cold / wrapper）を “self% ≥ 5%” で選定

#### Long-term (Priority 3)

6. **Consider non-incremental approaches**
   - Mimalloc-style TLS bucket redesign (major overhaul)
   - Static-compiled routing (eliminate runtime policy)
   - IFUNC for zero-overhead wrapper (high risk)

---

### Lessons Learned

#### Route Optimization Saturation

**Observation**: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%)

**Insight**:
- Route optimization has diminishing returns after static caching (C3)
- Further specialization adds branch overhead without eliminating cost
- **Lesson**: Don't pursue per-class specialization on already-cached paths

#### Shape Optimization Plateau

**Observation**: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%)

**Insight**:
- Branch prediction saturates after initial tuning
- LIKELY/UNLIKELY hints have limited benefit on well-trained branches
- **Lesson**: Shape optimization good for first pass, limited ROI after

#### ENV Consolidation Success

**Observation**: E1 (ENV snapshot) achieved +3.92% gain

**Insight**:
- Reducing TLS pressure (3 vars → 1 var) has measurable benefit
- Consolidation point still has overhead (3.22% self%)
- **Lesson**: Constructor init is next logical step (eliminate lazy check)

#### Inlining I-Cache Risk

**Observation**: A3 (header always_inline) showed -4% regression on Mixed

**Insight**:
- Aggressive inlining can thrash I-cache on mixed workloads
- Selective inlining (per-class) may work but needs careful profiling
- **Lesson**: Inlining is high-risk, constructor/caching approaches safer

---

### Realistic Expectations

**Current State**: 45M ops/s (E1 enabled)
**Target**: 48-50M ops/s (with E3-4, E3-2)
**Ceiling**: ~55-60M ops/s (without major redesign)

**Gap to mimalloc**: ~2.5x (128M vs 55M ops/s)

**Why large gap remains**:
- Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets
- Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout
- Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15

**Next phase options**:
1. **Incremental** (E3-4, E3-2): +1-3% gains, safe, diminishing returns
2. **Structural redesign**: +20-50% potential, high risk, months of work
3. **Workload-specific tuning**: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed

**Recommendation**: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted.

---

**Analysis Complete**: 2025-12-14
**Next Action**: Implement E3-4 (ENV Constructor Init)
**Expected Timeline**: 2-3 days (design → implement → A/B → decision)
-												Phase 4 E3-4: ENV Constructor Init (+4.75% GO)

Target: Eliminate E1 lazy init check overhead (3.22% self%)
- E1 consolidated ENV gates but lazy check remained in hot path
- Strategy: __attribute__((constructor(101))) for pre-main init

Implementation:
- ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box)
- core/box/hakmem_env_snapshot_box.c: Constructor function added
  - Reads ENV before main() when CTOR=1
  - Refresh also syncs gate state for bench_profile putenv
- core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check
  - CTOR=1 fast path: direct global read (no lazy branch)
  - CTOR=0 fallback: legacy lazy init (rollback safe)
  - Branch hints adjusted for default OFF baseline

A/B Test Results (Mixed, 10-run, 20M iters, E1=1):
- Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median)
- Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median)
- Improvement: +4.75% mean, +4.35% median

Decision: GO (+4.75% >> +0.5% threshold)
- Expected +0.5-1.5%, achieved +4.75%
- Lazy init branch overhead was larger than expected
- Action: Keep as research box (default OFF), evaluate promotion

Phase 4 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
- E3-4 (Constructor Init): +4.75%
- Total Phase 4: ~+8.5%

Deliverables:
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
- docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
- scripts/verify_health_profiles.sh (sanity check script)
- CURRENT_TASK.md (E3-4 complete, next instructions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 02:57:35 +09:00
+								# Phase 4 Comprehensive Status Analysis
 								**Date**: 2025-12-14
 								**Analyst**: Claude Code
 								**Baseline**: E1 enabled (~45M ops/s)
 								---
 								## Part 1: E2 Freeze Decision Analysis
 								### Test Data Review
 								**E2 Configuration**: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc)
 								**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled)
 								**Test**: 10-run A/B, 20M iterations, ws=400
 								#### Statistical Analysis
 								| Metric | Baseline (E2=0) | Optimized (E2=1) | Delta |
 								|--------|-----------------|------------------|-------|
 								| Mean | 45.40M ops/s | 45.30M ops/s | -0.21% |
 								| Median | 45.51M ops/s | 45.22M ops/s | -0.62% |
 								| StdDev | 0.38M (0.84% CV) | 0.49M (1.07% CV) | +28% variance |
 								#### Variance Consistency Analysis
 								**Baseline runs** (DUALHOT=0):
 								- Range: 44.60M - 45.90M (1.30M spread)
 								- Runs within ±1% of mean: 9/10 (90%)
 								- Outliers: Run 8 (44.60M, -1.76% from mean)
 								**Optimized runs** (DUALHOT=1):
 								- Range: 44.59M - 46.28M (1.69M spread)
 								- Runs within ±1% of mean: 8/10 (80%)
 								- Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean)
 								**Observation**: Higher variance in optimized version suggests branch misprediction or cache effects.
 								#### Comparison to Free DUALHOT Success
 								| Path | DUALHOT Result | Reason |
 								|------|----------------|--------|
 								| **Free** | **+13.0%** | Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees) |
 								| **Alloc** | **-0.21%** | Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost |
 								**Root Cause**:
 								- Free path: C0-C3 optimization skips **expensive operations** (policy snapshot + route lookup)
 								- Alloc path: C0-C3 optimization skips **already-cached operations** (static routing eliminates lookup)
 								- Net effect: Branch overhead ≈ Savings → neutral
 								### E2 Freeze Recommendation
 								**Decision**: ✅ **DEFINITIVE FREEZE**
 								**Rationale**:
 . **Result is consistent**: All 10 runs showed similar pattern (no bimodal distribution)
 . **Not a measurement error**: StdDev 0.38M-0.49M is normal for this workload
 . **Root cause understood**: Alloc path already optimized via C3 static routing
 . **Free vs Alloc asymmetry explained**: Free skips expensive ops, alloc skips cheap cached ops
 . **No alternative conditions warranted**:
 								   - Different workload (C6-heavy): Won't help - same route caching applies
 								   - Different iteration count: Won't change fundamental branch cost vs savings trade-off
 								   - Combined flags: No synergy available - route caching is already optimal
 								**Conclusion**: E2 is a **structural dead-end** for Mixed workload. Alloc route optimization saturated by C3.
 								---
 								## Part 2: Fresh Perf Profile Analysis (E1 Enabled)
 								### Profile Configuration
 								**Command**: `HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1`
 								**Throughput**: 45.26M ops/s
 								**Samples**: 946 samples, 3.25B cycles
 								### Top Functions (self% >= 2.0%)
 								| Rank | Function | self% | Change from Pre-E1 | Category |
 								|------|----------|-------|-------------------|----------|
 								| 1 | free | 22.19% | +2.5pp (from ~19%) | Wrapper |
 								| 2 | tiny_alloc_gate_fast | 18.99% | +3.6pp (from 15.37%) | Alloc Gate |
 								| 3 | main | 15.21% | No change | Benchmark |
 								| 4 | malloc | 13.36% | No change | Wrapper |
 								| 5 | free_tiny_fast_cold | 7.32% | +1.5pp (from 5.84%) | Free Path |
 								| 6 | hakmem_env_snapshot_enabled | 3.22% | **NEW (was 0% combined)** | ENV Gate |
 								| 7 | tiny_region_id_write_header | 2.60% | +0.1pp (from 2.50%) | Header |
 								| 8 | unified_cache_push | 2.56% | -1.4pp (from 3.97%) | Cache |
 								| 9 | tiny_route_for_class | 2.29% | +0.01pp (from 2.28%) | Routing |
 								| 10 | small_policy_v7_snapshot | 2.26% | No data | Policy |
 								| 11 | tiny_c7_ultra_alloc | 2.16% | -1.8pp (from 3.97%) | C7 Alloc |
 								### E1 Impact Analysis
 								**Expected**: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read
 								**Actual**: `hakmem_env_snapshot_enabled` shows 3.22% self%
 								**Interpretation**:
 								- ENV overhead **shifted** from 3 separate functions → 1 function
 								- **NOT eliminated** - still paying 3.22% for ENV checking
 								- E1's +3.92% gain likely from **reduced TLS pressure** (fewer TLS variables), not eliminated checks
 								- The snapshot approach caches results, reducing repeated getenv() calls
 								**Surprise findings**:
 . **tiny_alloc_gate_fast increased** from 15.37% → 18.99% (+3.6pp)
 								   - Possible reason: Other functions got faster (relative %), or I-cache effects
 . **hakmem_env_snapshot_enabled is NEW hot spot** (3.22%)
 								   - This is the consolidation point - still significant overhead
 . **unified_cache_push decreased** from 3.97% → 2.56% (-1.4pp)
 								   - Good sign: Cache operations more efficient
 								### Hot Spot Distribution
 								**Pre-E1** (Phase 4 D3 baseline):
 								- ENV gates (3 functions): 3.26%
 								- tiny_alloc_gate_fast: 15.37%
 								- free_tiny_fast_cold: 5.84%
 								- **Total measured overhead**: ~24.5%
 								**Post-E1** (current):
 								- ENV snapshot (1 function): 3.22%
 								- tiny_alloc_gate_fast: 18.99%
 								- free_tiny_fast_cold: 7.32%
 								- **Total measured overhead**: ~29.5%
 								**Analysis**: Overhead increased in absolute %, but throughput increased +3.92%. This suggests:
 								- Baseline got faster (other code optimized)
 								- Relative % shifted to measured functions
 								- Perf sampling variance (946 samples has ~±3% error margin)
 								---
 								## Part 3: E3 Candidate Identification
 								### Methodology
 								**Selection Criteria**:
 . self% >= 5% (significant impact)
 . Not already heavily optimized (avoid saturated areas)
 . Different approach from route/TLS optimization (explore new vectors)
 								### Candidate Analysis
 								#### Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION
 								**Current State**:
 								- Phase 3 C3: Static routing (+2.20% gain)
 								- Phase 4 D3: Alloc gate shape (+0.56% neutral)
 								- Phase 4 E2: Per-class fast path (-0.21% neutral)
 								**Why it's 18.99%**:
 								- Route determination: Already cached (C3)
 								- Branch prediction: Already tuned (D3)
 								- Per-class specialization: No benefit (E2)
 								**Remaining Overhead**:
 								- Function call overhead (not inlined)
 								- ENV snapshot check (3.22% now consolidated)
 								- Size→class conversion (hak_tiny_size_to_class)
 								- Wrapper→gate dispatch
 								**Optimization Approach**: **INLINING + DISPATCH OPTIMIZATION**
 								- **Strategy**: Inline tiny_alloc_gate_fast into malloc wrapper
 								  - Eliminate function call overhead (save ~5-10 cycles)
 								  - Improve I-cache locality (malloc + gate in same cache line)
 								  - Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit)
 								- **Expected Gain**: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall)
 								- **Risk**: Medium (I-cache pressure, as seen in A3 -4% regression)
 								**Recommendation**: **DEFER** - Route optimization saturated, inlining has I-cache risk
 								---
 								#### Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD
 								**Current State**:
 								- Phase 2 B4: Wrapper hot/cold split (+1.47% gain)
 								- Wrapper shape already optimized (rare checks in cold path)
 								**Why it's 22.19%**:
 								- This is the `free()` wrapper function (libc entry point)
 								- Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast
 								**Optimization Approach**: **WRAPPER BYPASS (IFUNC) or Function Pointer Caching**
 								- **Strategy 1 (IFUNC)**: Use GNU IFUNC to resolve malloc/free at load time
 								  - Direct binding: `malloc → tiny_alloc_gate_fast` (no wrapper layer)
 								  - Risk: HIGH (ABI compatibility, thread-safety)
 								- **Strategy 2 (Function Pointer)**: Cache `g_free_impl` in TLS
 								  - Check once at thread init, then direct call
 								  - Risk: Medium, Lower gain (+1-2%)
 								**Recommendation**: **HIGH PRIORITY** - Large potential gain, prototype with function pointer approach first
 								---
 								#### Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION
 								**Current State**:
 								- Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path)
 								- Cold path handles C4-C7 (~50% of frees)
 								**Optimization Approach**: **C4-C7 ROUTE SPECIALIZATION**
 								- **Strategy**: Create per-class cold paths (similar to E2 alloc attempt)
 								- **Expected Gain**: +0.5-1.0%
 								- **Risk**: Low
 								**Recommendation**: **MEDIUM PRIORITY** - Incremental gain, but may hit diminishing returns like E2
 								---
 								#### Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐
 								**Current State**:
 								- Phase 4 E1: ENV snapshot consolidation (+3.92% gain)
 								- 3 separate ENV gates → 1 consolidated snapshot
 								**Why it's 3.22%**:
 								- This IS the optimization (consolidation point)
 								- Still checking `g_hakmem_env_snapshot.initialized` on every call
 								- TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path)
 								**Optimization Approach**: **LAZY INIT ELIMINATION**
 								- **Strategy**: Force ENV snapshot initialization at library load time (constructor)
 								  - Use `__attribute__((constructor))` to init before main()
 								  - Eliminate `if (!initialized)` check in hot path
 								  - Make `hakmem_env_get()` a pure TLS read (no branch)
 								- **Expected Gain**: +0.5-1.5% (eliminate 3.22% check overhead)
 								- **Risk**: Low (standard initialization pattern)
 								- **Implementation**:
 								  ```c
 								  __attribute__((constructor))
 								  static void hakmem_env_snapshot_init_early(void) {
 								      hakmem_env_snapshot_init();  // Force init before any alloc/free
 								  }
 								  static inline const hakmem_env_snapshot* hakmem_env_get(void) {
 								      return &g_hakmem_env_snapshot;  // No check, just return
 								  }
 								  ```
 								**Recommendation**: **HIGH PRIORITY** - Clean win, low risk, eliminates E1's remaining overhead
 								---
 								#### Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION
 								**Current State**:
 								- Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO)
 								- I-cache pressure issue identified
 								**Optimization Approach**: **SELECTIVE INLINING**
 								- **Strategy**: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY)
 								- **Expected Gain**: +0.5-1.0%
 								- **Risk**: Medium (I-cache effects)
 								**Recommendation**: **LOW PRIORITY** - A3 already explored, I-cache risk remains
 								---
 								### E3 Candidate Ranking
 								| Rank | Candidate | self% | Approach | Expected Gain | Risk | ROI |
 								|------|-----------|-------|----------|---------------|------|-----|
 								| **1** | **hakmem_env_snapshot_enabled** | **3.22%** | **Constructor init** | **+0.5-1.5%** | **Low** | **⭐⭐⭐** |
 								| **2** | **free wrapper** | **22.19%** | **Function pointer cache** | **+1-2%** | **Medium** | **⭐⭐⭐** |
 								| 3 | tiny_alloc_gate_fast | 18.99% | Inlining | +1-2% | High (I-cache) | ⭐⭐ |
 								| 4 | free_tiny_fast_cold | 7.32% | Route specialization | +0.5-1.0% | Low | ⭐⭐ |
 								| 5 | tiny_region_id_write_header | 2.60% | Selective inline | +0.5-1.0% | Medium | ⭐ |
 								---
 								## Part 4: Summary & Recommendations
 								### E2 Final Decision
 								**Decision**: ✅ **FREEZE DEFINITIVELY**
 								**Rationale**:
 . **Result is consistent**: -0.21% mean, -0.62% median across 10 runs
 . **Root cause clear**: Alloc route optimization saturated by Phase 3 C3 static routing
 . **Free vs Alloc asymmetry**: Free DUALHOT skips expensive ops, alloc skips cached ops
 . **No alternative testing needed**: Workload/iteration changes won't fix structural issue
 . **Lesson learned**: Per-class specialization only works when bypassing uncached overhead
 								**Action**:
 								- Keep `HAKMEM_TINY_ALLOC_DUALHOT=0` as default (research box frozen)
 								- Document in CURRENT_TASK.md as NEUTRAL result
 								- No further investigation warranted
 								---
 								### Perf Findings (E1 Enabled Baseline)
 								**Throughput**: 45.26M ops/s (+3.92% from pre-E1 baseline)
 								**Hot Spots** (self% >= 5%):
 . free (22.19%) - Wrapper overhead
 . tiny_alloc_gate_fast (18.99%) - Route overhead (saturated)
 . main (15.21%) - Benchmark driver
 . malloc (13.36%) - Wrapper overhead
 . free_tiny_fast_cold (7.32%) - C4-C7 free path
 								**E1 Impact**:
 								- ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function)
 								- Gain from reduced TLS pressure: +3.92%
 								- **Remaining opportunity**: Eliminate lazy init check (3.22% → 0%)
 								**New Hot Spots**:
 								- hakmem_env_snapshot_enabled: 3.22% (consolidation point)
 								**Changes from Pre-E1**:
 								- tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%)
 								- free: +2.5pp (~19% → 22.19%)
 								- unified_cache_push: -1.4pp (3.97% → 2.56%)
 								---
 								### E3 Recommendation
 								**Primary Target**: **hakmem_env_snapshot_enabled (E3-4)**
 								**Approach**: Constructor-based initialization
 								- Force ENV snapshot init at library load time
 								- Eliminate lazy init check in hot path
 								- Make `hakmem_env_get()` a pure TLS read (no branch)
 								**Expected Gain**: +0.5-1.5%
 								**Implementation Complexity**: Low (2-day task)
 								- Add `__attribute__((constructor))` function
 								- Remove init check from hakmem_env_get()
 								- A/B test with 10-run Mixed + 5-run C6-heavy
 								**Rationale**:
 . **Low risk**: Standard initialization pattern (used by jemalloc, tcmalloc)
 . **Clear gain**: Eliminates 3.22% overhead (lazy init check)
 . **Compounds E1**: Completes ENV snapshot optimization started in E1
 . **Different vector**: Not route/TLS optimization - this is **initialization overhead reduction**
 								**Success Criteria**:
 								- Mean gain >= +0.5% (conservative)
 								- No regression on any profile
 								- Health check passes
 								---
 								**Secondary Target**: **free wrapper (E3-2)**
 								**Approach**: Function pointer caching
 								- Cache `g_free_impl` in TLS at thread init
 								- Direct call instead of LD mode check + dispatch
 								- Lower risk than IFUNC approach
 								**Expected Gain**: +1-2%
 								**Implementation Complexity**: Medium (3-4 day task)
 								**Risk**: Medium (thread-safety, initialization order)
 								---
 								### Phase 4 Status
 								**Active Optimizations**:
 								- E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in)
 								- E3-4 (ENV Constructor Init): +4.75% ✅ GO (research box, default OFF / opt-in, requires E1)
 								**Frozen Optimizations**:
 								- D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF)
 								- E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF)
 								**Cumulative Gain** (Phase 2-4):
 								- B3 (Routing shape): +2.89%
 								- B4 (Wrapper split): +1.47%
 								- C3 (Static routing): +2.20%
 								- D1 (Free route cache): +2.19%
 								- E1 (ENV snapshot): +3.92%
 								- E3-4 (ENV ctor): +4.75% (opt-in, requires E1)
 								- **Total (opt-in含む): ~17%**（プロファイル/ENV 組み合わせ依存）
 								**Baseline（参考）**:
 								- E1=1, CTOR=0: 45.26M ops/s（Mixed, 40M iters, ws=400）
 								- E1=1, CTOR=1: 46.38M ops/s（Mixed, 20M iters, ws=400）
 								**Remaining Potential**:
 								- E3-2 (Wrapper function ptr): +1-2%
 								- E3-3 (Free route special): +0.5-1.0%
 								- **Realistic ceiling**: ~48-50M ops/s (without major redesign)
 								---
 								### Next Steps
 								#### Immediate (Priority 1)
 . **Freeze E2 in CURRENT_TASK.md**
 								   - Document NEUTRAL decision (-0.21%)
 								   - Add root cause explanation (route caching saturation)
 								   - Mark as research box (default OFF, frozen)
 . **E3-4 の昇格ゲート（再検証）**
 								   - E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認
 								   - A/B: Mixed 10-run（E1=1, CTOR=0 vs 1）
 								   - 健康診断: `scripts/verify_health_profiles.sh`
 								#### Short-term (Priority 2)
 . **E1/E3-4 ON の状態で perf を取り直す**
 								   - `hakmem_env_snapshot_enabled` が Top から落ちる／self% が有意に下がること
 								   - 次の芯（alloc gate / free_tiny_fast_cold / wrapper）を “self% ≥ 5%” で選定
 								#### Long-term (Priority 3)
 . **Consider non-incremental approaches**
 								   - Mimalloc-style TLS bucket redesign (major overhaul)
 								   - Static-compiled routing (eliminate runtime policy)
 								   - IFUNC for zero-overhead wrapper (high risk)
 								---
 								### Lessons Learned
 								#### Route Optimization Saturation
 								**Observation**: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%)
 								**Insight**:
 								- Route optimization has diminishing returns after static caching (C3)
 								- Further specialization adds branch overhead without eliminating cost
 								- **Lesson**: Don't pursue per-class specialization on already-cached paths
 								#### Shape Optimization Plateau
 								**Observation**: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%)
 								**Insight**:
 								- Branch prediction saturates after initial tuning
 								- LIKELY/UNLIKELY hints have limited benefit on well-trained branches
 								- **Lesson**: Shape optimization good for first pass, limited ROI after
 								#### ENV Consolidation Success
 								**Observation**: E1 (ENV snapshot) achieved +3.92% gain
 								**Insight**:
 								- Reducing TLS pressure (3 vars → 1 var) has measurable benefit
 								- Consolidation point still has overhead (3.22% self%)
 								- **Lesson**: Constructor init is next logical step (eliminate lazy check)
 								#### Inlining I-Cache Risk
 								**Observation**: A3 (header always_inline) showed -4% regression on Mixed
 								**Insight**:
 								- Aggressive inlining can thrash I-cache on mixed workloads
 								- Selective inlining (per-class) may work but needs careful profiling
 								- **Lesson**: Inlining is high-risk, constructor/caching approaches safer
 								---
 								### Realistic Expectations
 								**Current State**: 45M ops/s (E1 enabled)
 								**Target**: 48-50M ops/s (with E3-4, E3-2)
 								**Ceiling**: ~55-60M ops/s (without major redesign)
 								**Gap to mimalloc**: ~2.5x (128M vs 55M ops/s)
 								**Why large gap remains**:
 								- Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets
 								- Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout
 								- Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15
 								**Next phase options**:
 . **Incremental** (E3-4, E3-2): +1-3% gains, safe, diminishing returns
 . **Structural redesign**: +20-50% potential, high risk, months of work
 . **Workload-specific tuning**: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed
 								**Recommendation**: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted.
 								---
 								**Analysis Complete**: 2025-12-14
 								**Next Action**: Implement E3-4 (ENV Constructor Init)
 								**Expected Timeline**: 2-3 days (design → implement → A/B → decision)