Phase 4 E3-4: ENV Constructor Init (+4.75% GO)

Target: Eliminate E1 lazy init check overhead (3.22% self%) - E1 consolidated ENV gates but lazy check remained in hot path - Strategy: __attribute__((constructor(101))) for pre-main init Implementation: - ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box) - core/box/hakmem_env_snapshot_box.c: Constructor function added - Reads ENV before main() when CTOR=1 - Refresh also syncs gate state for bench_profile putenv - core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check - CTOR=1 fast path: direct global read (no lazy branch) - CTOR=0 fallback: legacy lazy init (rollback safe) - Branch hints adjusted for default OFF baseline A/B Test Results (Mixed, 10-run, 20M iters, E1=1): - Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median) - Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median) - Improvement: +4.75% mean, +4.35% median Decision: GO (+4.75% >> +0.5% threshold) - Expected +0.5-1.5%, achieved +4.75% - Lazy init branch overhead was larger than expected - Action: Keep as research box (default OFF), evaluate promotion Phase 4 Cumulative: - E1 (ENV Snapshot): +3.92% - E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen) - E3-4 (Constructor Init): +4.75% - Total Phase 4: ~+8.5% Deliverables: - docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md - docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md - docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md - docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md - scripts/verify_health_profiles.sh (sanity check script) - CURRENT_TASK.md (E3-4 complete, next instructions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 02:57:35 +09:00
parent 6a6744d065
commit 21e2e4ac2b
11 changed files with 1010 additions and 10 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -1,6 +1,37 @@
 # 本線タスク（現在）

-## 更新メモ（2025-12-14 Phase 4 E2 Complete - Alloc Per-Class FastPath）
+## 更新メモ（2025-12-14 Phase 4 E3-4 Complete - ENV Constructor Init）
+
+### Phase 4 E3-4: ENV Constructor Init ✅ GO (+4.75%) (2025-12-14)
+
+**Target**: E1 の lazy init check（3.22% self%）を constructor init で排除
+- E1 で ENV snapshot を統合したが、`hakmem_env_snapshot_enabled()` の lazy check が残っていた
+- Strategy: `__attribute__((constructor(101)))` で main() 前に gate 初期化
+
+**Implementation**:
+- ENV gate: `HAKMEM_ENV_SNAPSHOT_CTOR=0/1` (default: 0, research box)
+- `core/box/hakmem_env_snapshot_box.c`: Constructor function 追加
+- `core/box/hakmem_env_snapshot_box.h`: Dual-mode enabled check (constructor vs legacy)
+
+**A/B Test Results** (Mixed, 10-run, 20M iters, HAKMEM_ENV_SNAPSHOT=1):
+- Baseline (CTOR=0): **44.28M ops/s** (mean), 44.60M ops/s (median), σ=1.0M
+- Optimized (CTOR=1): **46.38M ops/s** (mean), 46.53M ops/s (median), σ=0.5M
+- **Improvement: +4.75% mean, +4.35% median**
+
+**Decision: GO** (+4.75% >> +0.5% threshold)
+- 期待値 +0.5-1.5% を大幅に上回る +4.75% 達成
+- Action: Keep as research box for now (default OFF)
+- Design doc: `docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md`
+
+**Key Insight**: Lazy init check overhead was larger than expected. Constructor pattern eliminates branch in hot path entirely, yielding substantial gain.
+
+**Cumulative Status (Phase 4)**:
+- E1 (ENV Snapshot): +3.92% (GO)
+- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
+- **E3-4 (Constructor Init): +4.75% (GO)**
+- **Total Phase 4: ~+8.5%**
+
+---

 ### Phase 4 E2: Alloc Per-Class FastPath ⚪ NEUTRAL (2025-12-14)

@ -34,8 +65,11 @@
 **Cumulative Status**:
 - Phase 4 E1: +3.92% (GO, research box)
 - Phase 4 E2: -0.21% (NEUTRAL, frozen)
+- Phase 4 E3-4: +4.75% (GO, research box; requires E1)

-### Next: Phase 4 E3 - TBD (consult perf profile or pursue other optimization vectors)
+### Next: Phase 4 E3-4（昇格判断）
+
+- 指示書: `docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md`

 ---

--- a/core/box/hakmem_env_snapshot_box.c
+++ b/core/box/hakmem_env_snapshot_box.c
@ -1,4 +1,6 @@
 // hakmem_env_snapshot_box.c - Phase 4 E1: ENV Snapshot Consolidation (implementation)
+//
+// E3-4 Extension: Constructor init to eliminate lazy check overhead (3.22% self%)

 #include "hakmem_env_snapshot_box.h"
 #include <stdio.h>
@ -13,6 +15,30 @@ extern bool small_learner_v2_enabled(void);
 HakmemEnvSnapshot g_hakmem_env_snapshot = {0};
 int g_hakmem_env_snapshot_ready = 0;

+// E3-4: Global gate state (not static local - avoids lazy init overhead)
+int g_hakmem_env_snapshot_gate = -1;
+int g_hakmem_env_snapshot_ctor_mode = -1;
+
+// E3-4: Constructor - run before main() to init gate without lazy check
+__attribute__((constructor(101)))
+static void hakmem_env_snapshot_gate_ctor(void) {
+    // Read HAKMEM_ENV_SNAPSHOT_CTOR (default OFF for safety)
+    const char* ctor_env = getenv("HAKMEM_ENV_SNAPSHOT_CTOR");
+    g_hakmem_env_snapshot_ctor_mode = (ctor_env && *ctor_env == '1') ? 1 : 0;
+
+    if (g_hakmem_env_snapshot_ctor_mode) {
+        // Constructor mode: init gate now (before any malloc/free calls)
+        const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
+        g_hakmem_env_snapshot_gate = (e && *e == '1') ? 1 : 0;
+
+#if !HAKMEM_BUILD_RELEASE
+        fprintf(stderr, "[E3-4] Constructor init: HAKMEM_ENV_SNAPSHOT_GATE=%d\n",
+                g_hakmem_env_snapshot_gate);
+        fflush(stderr);
+#endif
+    }
+}
+
 // Internal helper: read all ENV vars and compute effective values
 static void hakmem_env_snapshot_load(HakmemEnvSnapshot* snap) {
    // Read HAKMEM_TINY_C7_ULTRA (default: ON)
@ -68,6 +94,14 @@ void hakmem_env_snapshot_init(void) {
 // Refresh snapshot from ENV (for bench_profile putenv sync)
 // This ensures that after bench_setenv_default() runs, the snapshot is refreshed
 void hakmem_env_snapshot_refresh_from_env(void) {
+    // Refresh gate state too so bench_profile putenv defaults take effect even if
+    // the gate was lazily initialized earlier (e.g. by pre-main malloc/free).
+    const char* ctor_env = getenv("HAKMEM_ENV_SNAPSHOT_CTOR");
+    g_hakmem_env_snapshot_ctor_mode = (ctor_env && *ctor_env == '1') ? 1 : 0;
+
+    const char* gate_env = getenv("HAKMEM_ENV_SNAPSHOT");
+    g_hakmem_env_snapshot_gate = (gate_env && *gate_env == '1') ? 1 : 0;
+
    hakmem_env_snapshot_load(&g_hakmem_env_snapshot);
    g_hakmem_env_snapshot_ready = 1;

--- a/core/box/hakmem_env_snapshot_box.h
+++ b/core/box/hakmem_env_snapshot_box.h
@ -10,10 +10,16 @@
 //   - Lazy init with version-based refresh (follows tiny_front_v3_snapshot pattern)
 //   - Learner interlock: tiny_metadata_cache_eff = cache && !learner
 //
+// E3-4 Extension: Constructor init to eliminate lazy check overhead
+//   - ENV: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0)
+//   - When =1: Gate init runs in constructor (before main)
+//   - Eliminates 3.22% lazy init check overhead
+//
 // Benefits:
 //   - 3 TLS reads → 1 TLS read (66% reduction)
 //   - 3 lazy init checks → 1 lazy init check
-//   - Expected gain: +1-3% (conservative from 3.26% overhead)
+//   - E3-4: Lazy init check → no check (constructor init)
+//   - Expected gain: +1-3% (E1) + +0.5-1.5% (E3-4)

 #ifndef HAK_ENV_SNAPSHOT_BOX_H
 #define HAK_ENV_SNAPSHOT_BOX_H
@ -47,18 +53,29 @@ static inline const HakmemEnvSnapshot* hakmem_env_snapshot(void) {
    return &g_hakmem_env_snapshot;
 }

+// E3-4: Global gate state (defined in hakmem_env_snapshot_box.c)
+extern int g_hakmem_env_snapshot_gate;
+extern int g_hakmem_env_snapshot_ctor_mode;
+
 // ENV gate: default OFF (research box, set =1 to enable)
+// E3-4: Dual-mode - constructor init (fast) or legacy lazy init (fallback)
 static inline bool hakmem_env_snapshot_enabled(void) {
-    static int g = -1;
-    if (__builtin_expect(g == -1, 0)) {
+    // E3-4 Fast path: constructor mode (no lazy check, just global read)
+    // Default is OFF, so ctor_mode==1 is UNLIKELY.
+    if (__builtin_expect(g_hakmem_env_snapshot_ctor_mode == 1, 0)) {
+        return g_hakmem_env_snapshot_gate != 0;
+    }
+
+    // Legacy path: lazy init (fallback when HAKMEM_ENV_SNAPSHOT_CTOR=0)
+    if (__builtin_expect(g_hakmem_env_snapshot_gate == -1, 0)) {
        const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
        if (e && *e) {
-            g = (*e == '1') ? 1 : 0;
+            g_hakmem_env_snapshot_gate = (*e == '1') ? 1 : 0;
        } else {
-            g = 0;  // default: OFF (research box)
+            g_hakmem_env_snapshot_gate = 0;  // default: OFF (research box)
        }
    }
-    return g != 0;
+    return g_hakmem_env_snapshot_gate != 0;
 }

 #endif // HAK_ENV_SNAPSHOT_BOX_H
--- a/docs/analysis/ENV_PROFILE_PRESETS.md
+++ b/docs/analysis/ENV_PROFILE_PRESETS.md
@ -108,6 +108,16 @@ HAKMEM_ENV_SNAPSHOT=1
  - **Status**: ✅ GO（Mixed 10-run: **+3.92% avg / +4.01% median**）→ default OFF（opt-in）
  - **Effect**: `tiny_c7_ultra_enabled_env/tiny_front_v3_enabled/tiny_metadata_cache_enabled` のホット ENV gate を snapshot 1 本に集約
  - **Rollback**: `HAKMEM_ENV_SNAPSHOT=0`
+- **Phase 4 E3-4（ENV Constructor Init）** ✅ GO (opt-in):
+```sh
+# Requires E1
+HAKMEM_ENV_SNAPSHOT=1
+HAKMEM_ENV_SNAPSHOT_CTOR=1
+```
+  - **Status**: ✅ GO（Mixed 10-run: **+4.75% mean / +4.35% median**）→ default OFF（opt-in）
+  - **Effect**: `hakmem_env_snapshot_enabled()` の lazy gate 判定を constructor init で短絡（hot path の分岐/ロード削減）
+  - **Note**: “constructor での pre-main init” を効かせたい場合は、プロセス起動前に ENV を設定する（bench_profile putenv だけでは遅い）
+  - **Rollback**: `HAKMEM_ENV_SNAPSHOT_CTOR=0`
 - v2 系は触らない（C7_SAFE では Pool v2 / Tiny v2 は常時 OFF）。
 - FREE_POLICY/THP を触る実験例（現在の HEAD では必須ではなく、組み合わせによっては微マイナスになる場合もある）:
 ```sh
--- a/docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
+++ b/docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
@ -0,0 +1,486 @@
+# Phase 4 Comprehensive Status Analysis
+
+**Date**: 2025-12-14
+**Analyst**: Claude Code
+**Baseline**: E1 enabled (~45M ops/s)
+
+---
+
+## Part 1: E2 Freeze Decision Analysis
+
+### Test Data Review
+
+**E2 Configuration**: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc)
+**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled)
+**Test**: 10-run A/B, 20M iterations, ws=400
+
+#### Statistical Analysis
+
+| Metric | Baseline (E2=0) | Optimized (E2=1) | Delta |
+|--------|-----------------|------------------|-------|
+| Mean | 45.40M ops/s | 45.30M ops/s | -0.21% |
+| Median | 45.51M ops/s | 45.22M ops/s | -0.62% |
+| StdDev | 0.38M (0.84% CV) | 0.49M (1.07% CV) | +28% variance |
+
+#### Variance Consistency Analysis
+
+**Baseline runs** (DUALHOT=0):
+- Range: 44.60M - 45.90M (1.30M spread)
+- Runs within ±1% of mean: 9/10 (90%)
+- Outliers: Run 8 (44.60M, -1.76% from mean)
+
+**Optimized runs** (DUALHOT=1):
+- Range: 44.59M - 46.28M (1.69M spread)
+- Runs within ±1% of mean: 8/10 (80%)
+- Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean)
+
+**Observation**: Higher variance in optimized version suggests branch misprediction or cache effects.
+
+#### Comparison to Free DUALHOT Success
+
+| Path | DUALHOT Result | Reason |
+|------|----------------|--------|
+| **Free** | **+13.0%** | Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees) |
+| **Alloc** | **-0.21%** | Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost |
+
+**Root Cause**:
+- Free path: C0-C3 optimization skips **expensive operations** (policy snapshot + route lookup)
+- Alloc path: C0-C3 optimization skips **already-cached operations** (static routing eliminates lookup)
+- Net effect: Branch overhead ≈ Savings → neutral
+
+### E2 Freeze Recommendation
+
+**Decision**: ✅ **DEFINITIVE FREEZE**
+
+**Rationale**:
+
+1. **Result is consistent**: All 10 runs showed similar pattern (no bimodal distribution)
+2. **Not a measurement error**: StdDev 0.38M-0.49M is normal for this workload
+3. **Root cause understood**: Alloc path already optimized via C3 static routing
+4. **Free vs Alloc asymmetry explained**: Free skips expensive ops, alloc skips cheap cached ops
+5. **No alternative conditions warranted**:
+   - Different workload (C6-heavy): Won't help - same route caching applies
+   - Different iteration count: Won't change fundamental branch cost vs savings trade-off
+   - Combined flags: No synergy available - route caching is already optimal
+
+**Conclusion**: E2 is a **structural dead-end** for Mixed workload. Alloc route optimization saturated by C3.
+
+---
+
+## Part 2: Fresh Perf Profile Analysis (E1 Enabled)
+
+### Profile Configuration
+
+**Command**: `HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1`
+**Throughput**: 45.26M ops/s
+**Samples**: 946 samples, 3.25B cycles
+
+### Top Functions (self% >= 2.0%)
+
+| Rank | Function | self% | Change from Pre-E1 | Category |
+|------|----------|-------|-------------------|----------|
+| 1 | free | 22.19% | +2.5pp (from ~19%) | Wrapper |
+| 2 | tiny_alloc_gate_fast | 18.99% | +3.6pp (from 15.37%) | Alloc Gate |
+| 3 | main | 15.21% | No change | Benchmark |
+| 4 | malloc | 13.36% | No change | Wrapper |
+| 5 | free_tiny_fast_cold | 7.32% | +1.5pp (from 5.84%) | Free Path |
+| 6 | hakmem_env_snapshot_enabled | 3.22% | **NEW (was 0% combined)** | ENV Gate |
+| 7 | tiny_region_id_write_header | 2.60% | +0.1pp (from 2.50%) | Header |
+| 8 | unified_cache_push | 2.56% | -1.4pp (from 3.97%) | Cache |
+| 9 | tiny_route_for_class | 2.29% | +0.01pp (from 2.28%) | Routing |
+| 10 | small_policy_v7_snapshot | 2.26% | No data | Policy |
+| 11 | tiny_c7_ultra_alloc | 2.16% | -1.8pp (from 3.97%) | C7 Alloc |
+
+### E1 Impact Analysis
+
+**Expected**: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read
+**Actual**: `hakmem_env_snapshot_enabled` shows 3.22% self%
+
+**Interpretation**:
+- ENV overhead **shifted** from 3 separate functions → 1 function
+- **NOT eliminated** - still paying 3.22% for ENV checking
+- E1's +3.92% gain likely from **reduced TLS pressure** (fewer TLS variables), not eliminated checks
+- The snapshot approach caches results, reducing repeated getenv() calls
+
+**Surprise findings**:
+1. **tiny_alloc_gate_fast increased** from 15.37% → 18.99% (+3.6pp)
+   - Possible reason: Other functions got faster (relative %), or I-cache effects
+2. **hakmem_env_snapshot_enabled is NEW hot spot** (3.22%)
+   - This is the consolidation point - still significant overhead
+3. **unified_cache_push decreased** from 3.97% → 2.56% (-1.4pp)
+   - Good sign: Cache operations more efficient
+
+### Hot Spot Distribution
+
+**Pre-E1** (Phase 4 D3 baseline):
+- ENV gates (3 functions): 3.26%
+- tiny_alloc_gate_fast: 15.37%
+- free_tiny_fast_cold: 5.84%
+- **Total measured overhead**: ~24.5%
+
+**Post-E1** (current):
+- ENV snapshot (1 function): 3.22%
+- tiny_alloc_gate_fast: 18.99%
+- free_tiny_fast_cold: 7.32%
+- **Total measured overhead**: ~29.5%
+
+**Analysis**: Overhead increased in absolute %, but throughput increased +3.92%. This suggests:
+- Baseline got faster (other code optimized)
+- Relative % shifted to measured functions
+- Perf sampling variance (946 samples has ~±3% error margin)
+
+---
+
+## Part 3: E3 Candidate Identification
+
+### Methodology
+
+**Selection Criteria**:
+1. self% >= 5% (significant impact)
+2. Not already heavily optimized (avoid saturated areas)
+3. Different approach from route/TLS optimization (explore new vectors)
+
+### Candidate Analysis
+
+#### Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION
+
+**Current State**:
+- Phase 3 C3: Static routing (+2.20% gain)
+- Phase 4 D3: Alloc gate shape (+0.56% neutral)
+- Phase 4 E2: Per-class fast path (-0.21% neutral)
+
+**Why it's 18.99%**:
+- Route determination: Already cached (C3)
+- Branch prediction: Already tuned (D3)
+- Per-class specialization: No benefit (E2)
+
+**Remaining Overhead**:
+- Function call overhead (not inlined)
+- ENV snapshot check (3.22% now consolidated)
+- Size→class conversion (hak_tiny_size_to_class)
+- Wrapper→gate dispatch
+
+**Optimization Approach**: **INLINING + DISPATCH OPTIMIZATION**
+- **Strategy**: Inline tiny_alloc_gate_fast into malloc wrapper
+  - Eliminate function call overhead (save ~5-10 cycles)
+  - Improve I-cache locality (malloc + gate in same cache line)
+  - Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit)
+- **Expected Gain**: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall)
+- **Risk**: Medium (I-cache pressure, as seen in A3 -4% regression)
+
+**Recommendation**: **DEFER** - Route optimization saturated, inlining has I-cache risk
+
+---
+
+#### Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD
+
+**Current State**:
+- Phase 2 B4: Wrapper hot/cold split (+1.47% gain)
+- Wrapper shape already optimized (rare checks in cold path)
+
+**Why it's 22.19%**:
+- This is the `free()` wrapper function (libc entry point)
+- Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast
+
+**Optimization Approach**: **WRAPPER BYPASS (IFUNC) or Function Pointer Caching**
+- **Strategy 1 (IFUNC)**: Use GNU IFUNC to resolve malloc/free at load time
+  - Direct binding: `malloc → tiny_alloc_gate_fast` (no wrapper layer)
+  - Risk: HIGH (ABI compatibility, thread-safety)
+- **Strategy 2 (Function Pointer)**: Cache `g_free_impl` in TLS
+  - Check once at thread init, then direct call
+  - Risk: Medium, Lower gain (+1-2%)
+
+**Recommendation**: **HIGH PRIORITY** - Large potential gain, prototype with function pointer approach first
+
+---
+
+#### Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION
+
+**Current State**:
+- Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path)
+- Cold path handles C4-C7 (~50% of frees)
+
+**Optimization Approach**: **C4-C7 ROUTE SPECIALIZATION**
+- **Strategy**: Create per-class cold paths (similar to E2 alloc attempt)
+- **Expected Gain**: +0.5-1.0%
+- **Risk**: Low
+
+**Recommendation**: **MEDIUM PRIORITY** - Incremental gain, but may hit diminishing returns like E2
+
+---
+
+#### Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐
+
+**Current State**:
+- Phase 4 E1: ENV snapshot consolidation (+3.92% gain)
+- 3 separate ENV gates → 1 consolidated snapshot
+
+**Why it's 3.22%**:
+- This IS the optimization (consolidation point)
+- Still checking `g_hakmem_env_snapshot.initialized` on every call
+- TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path)
+
+**Optimization Approach**: **LAZY INIT ELIMINATION**
+- **Strategy**: Force ENV snapshot initialization at library load time (constructor)
+  - Use `__attribute__((constructor))` to init before main()
+  - Eliminate `if (!initialized)` check in hot path
+  - Make `hakmem_env_get()` a pure TLS read (no branch)
+- **Expected Gain**: +0.5-1.5% (eliminate 3.22% check overhead)
+- **Risk**: Low (standard initialization pattern)
+- **Implementation**:
+  ```c
+  __attribute__((constructor))
+  static void hakmem_env_snapshot_init_early(void) {
+      hakmem_env_snapshot_init();  // Force init before any alloc/free
+  }
+
+  static inline const hakmem_env_snapshot* hakmem_env_get(void) {
+      return &g_hakmem_env_snapshot;  // No check, just return
+  }
+  ```
+
+**Recommendation**: **HIGH PRIORITY** - Clean win, low risk, eliminates E1's remaining overhead
+
+---
+
+#### Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION
+
+**Current State**:
+- Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO)
+- I-cache pressure issue identified
+
+**Optimization Approach**: **SELECTIVE INLINING**
+- **Strategy**: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY)
+- **Expected Gain**: +0.5-1.0%
+- **Risk**: Medium (I-cache effects)
+
+**Recommendation**: **LOW PRIORITY** - A3 already explored, I-cache risk remains
+
+---
+
+### E3 Candidate Ranking
+
+| Rank | Candidate | self% | Approach | Expected Gain | Risk | ROI |
+|------|-----------|-------|----------|---------------|------|-----|
+| **1** | **hakmem_env_snapshot_enabled** | **3.22%** | **Constructor init** | **+0.5-1.5%** | **Low** | **⭐⭐⭐** |
+| **2** | **free wrapper** | **22.19%** | **Function pointer cache** | **+1-2%** | **Medium** | **⭐⭐⭐** |
+| 3 | tiny_alloc_gate_fast | 18.99% | Inlining | +1-2% | High (I-cache) | ⭐⭐ |
+| 4 | free_tiny_fast_cold | 7.32% | Route specialization | +0.5-1.0% | Low | ⭐⭐ |
+| 5 | tiny_region_id_write_header | 2.60% | Selective inline | +0.5-1.0% | Medium | ⭐ |
+
+---
+
+## Part 4: Summary & Recommendations
+
+### E2 Final Decision
+
+**Decision**: ✅ **FREEZE DEFINITIVELY**
+
+**Rationale**:
+1. **Result is consistent**: -0.21% mean, -0.62% median across 10 runs
+2. **Root cause clear**: Alloc route optimization saturated by Phase 3 C3 static routing
+3. **Free vs Alloc asymmetry**: Free DUALHOT skips expensive ops, alloc skips cached ops
+4. **No alternative testing needed**: Workload/iteration changes won't fix structural issue
+5. **Lesson learned**: Per-class specialization only works when bypassing uncached overhead
+
+**Action**:
+- Keep `HAKMEM_TINY_ALLOC_DUALHOT=0` as default (research box frozen)
+- Document in CURRENT_TASK.md as NEUTRAL result
+- No further investigation warranted
+
+---
+
+### Perf Findings (E1 Enabled Baseline)
+
+**Throughput**: 45.26M ops/s (+3.92% from pre-E1 baseline)
+
+**Hot Spots** (self% >= 5%):
+1. free (22.19%) - Wrapper overhead
+2. tiny_alloc_gate_fast (18.99%) - Route overhead (saturated)
+3. main (15.21%) - Benchmark driver
+4. malloc (13.36%) - Wrapper overhead
+5. free_tiny_fast_cold (7.32%) - C4-C7 free path
+
+**E1 Impact**:
+- ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function)
+- Gain from reduced TLS pressure: +3.92%
+- **Remaining opportunity**: Eliminate lazy init check (3.22% → 0%)
+
+**New Hot Spots**:
+- hakmem_env_snapshot_enabled: 3.22% (consolidation point)
+
+**Changes from Pre-E1**:
+- tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%)
+- free: +2.5pp (~19% → 22.19%)
+- unified_cache_push: -1.4pp (3.97% → 2.56%)
+
+---
+
+### E3 Recommendation
+
+**Primary Target**: **hakmem_env_snapshot_enabled (E3-4)**
+
+**Approach**: Constructor-based initialization
+- Force ENV snapshot init at library load time
+- Eliminate lazy init check in hot path
+- Make `hakmem_env_get()` a pure TLS read (no branch)
+
+**Expected Gain**: +0.5-1.5%
+
+**Implementation Complexity**: Low (2-day task)
+- Add `__attribute__((constructor))` function
+- Remove init check from hakmem_env_get()
+- A/B test with 10-run Mixed + 5-run C6-heavy
+
+**Rationale**:
+1. **Low risk**: Standard initialization pattern (used by jemalloc, tcmalloc)
+2. **Clear gain**: Eliminates 3.22% overhead (lazy init check)
+3. **Compounds E1**: Completes ENV snapshot optimization started in E1
+4. **Different vector**: Not route/TLS optimization - this is **initialization overhead reduction**
+
+**Success Criteria**:
+- Mean gain >= +0.5% (conservative)
+- No regression on any profile
+- Health check passes
+
+---
+
+**Secondary Target**: **free wrapper (E3-2)**
+
+**Approach**: Function pointer caching
+- Cache `g_free_impl` in TLS at thread init
+- Direct call instead of LD mode check + dispatch
+- Lower risk than IFUNC approach
+
+**Expected Gain**: +1-2%
+
+**Implementation Complexity**: Medium (3-4 day task)
+
+**Risk**: Medium (thread-safety, initialization order)
+
+---
+
+### Phase 4 Status
+
+**Active Optimizations**:
+- E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in)
+- E3-4 (ENV Constructor Init): +4.75% ✅ GO (research box, default OFF / opt-in, requires E1)
+
+**Frozen Optimizations**:
+- D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF)
+- E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF)
+
+**Cumulative Gain** (Phase 2-4):
+- B3 (Routing shape): +2.89%
+- B4 (Wrapper split): +1.47%
+- C3 (Static routing): +2.20%
+- D1 (Free route cache): +2.19%
+- E1 (ENV snapshot): +3.92%
+- E3-4 (ENV ctor): +4.75% (opt-in, requires E1)
+- **Total (opt-in含む): ~17%**（プロファイル/ENV 組み合わせ依存）
+
+**Baseline（参考）**:
+- E1=1, CTOR=0: 45.26M ops/s（Mixed, 40M iters, ws=400）
+- E1=1, CTOR=1: 46.38M ops/s（Mixed, 20M iters, ws=400）
+
+**Remaining Potential**:
+- E3-2 (Wrapper function ptr): +1-2%
+- E3-3 (Free route special): +0.5-1.0%
+- **Realistic ceiling**: ~48-50M ops/s (without major redesign)
+
+---
+
+### Next Steps
+
+#### Immediate (Priority 1)
+
+1. **Freeze E2 in CURRENT_TASK.md**
+   - Document NEUTRAL decision (-0.21%)
+   - Add root cause explanation (route caching saturation)
+   - Mark as research box (default OFF, frozen)
+
+2. **E3-4 の昇格ゲート（再検証）**
+   - E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認
+   - A/B: Mixed 10-run（E1=1, CTOR=0 vs 1）
+   - 健康診断: `scripts/verify_health_profiles.sh`
+
+#### Short-term (Priority 2)
+
+3. **E1/E3-4 ON の状態で perf を取り直す**
+   - `hakmem_env_snapshot_enabled` が Top から落ちる／self% が有意に下がること
+   - 次の芯（alloc gate / free_tiny_fast_cold / wrapper）を “self% ≥ 5%” で選定
+
+#### Long-term (Priority 3)
+
+6. **Consider non-incremental approaches**
+   - Mimalloc-style TLS bucket redesign (major overhaul)
+   - Static-compiled routing (eliminate runtime policy)
+   - IFUNC for zero-overhead wrapper (high risk)
+
+---
+
+### Lessons Learned
+
+#### Route Optimization Saturation
+
+**Observation**: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%)
+
+**Insight**:
+- Route optimization has diminishing returns after static caching (C3)
+- Further specialization adds branch overhead without eliminating cost
+- **Lesson**: Don't pursue per-class specialization on already-cached paths
+
+#### Shape Optimization Plateau
+
+**Observation**: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%)
+
+**Insight**:
+- Branch prediction saturates after initial tuning
+- LIKELY/UNLIKELY hints have limited benefit on well-trained branches
+- **Lesson**: Shape optimization good for first pass, limited ROI after
+
+#### ENV Consolidation Success
+
+**Observation**: E1 (ENV snapshot) achieved +3.92% gain
+
+**Insight**:
+- Reducing TLS pressure (3 vars → 1 var) has measurable benefit
+- Consolidation point still has overhead (3.22% self%)
+- **Lesson**: Constructor init is next logical step (eliminate lazy check)
+
+#### Inlining I-Cache Risk
+
+**Observation**: A3 (header always_inline) showed -4% regression on Mixed
+
+**Insight**:
+- Aggressive inlining can thrash I-cache on mixed workloads
+- Selective inlining (per-class) may work but needs careful profiling
+- **Lesson**: Inlining is high-risk, constructor/caching approaches safer
+
+---
+
+### Realistic Expectations
+
+**Current State**: 45M ops/s (E1 enabled)
+**Target**: 48-50M ops/s (with E3-4, E3-2)
+**Ceiling**: ~55-60M ops/s (without major redesign)
+
+**Gap to mimalloc**: ~2.5x (128M vs 55M ops/s)
+
+**Why large gap remains**:
+- Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets
+- Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout
+- Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15
+
+**Next phase options**:
+1. **Incremental** (E3-4, E3-2): +1-3% gains, safe, diminishing returns
+2. **Structural redesign**: +20-50% potential, high risk, months of work
+3. **Workload-specific tuning**: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed
+
+**Recommendation**: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted.
+
+---
+
+**Analysis Complete**: 2025-12-14
+**Next Action**: Implement E3-4 (ENV Constructor Init)
+**Expected Timeline**: 2-3 days (design → implement → A/B → decision)
--- a/docs/analysis/PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE4_E1_ENV_SNAPSHOT_CONSOLIDATION_NEXT_INSTRUCTIONS.md
@ -102,6 +102,7 @@ E1=1 で perf を取り直し、次を確認:
 NEUTRAL/NO-GO の場合:
 - default OFF のまま freeze（本線は汚さない）

-## Next（Phase 4 E2）
+## Next（Phase 4 E3-4）

- 次の指示書: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md`
+- 設計メモ: `docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md`
+- 次の指示書: `docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md`
--- a/docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md
@ -1,5 +1,12 @@
 # Phase 4 E2: Alloc Per-Class Fast Path（次の指示書）

+## Status（2025-12-14）
+
+- 🔬 NEUTRAL（Mixed 10-run: **-0.21% mean / -0.62% median**）
+- Decision: freeze（research box, default OFF）
+- Results: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md`
+- Next: `docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md`
+
 ## Step 0: 前提（E1 を ON にしてから評価）

 E2 は “ENV overhead を消した状態” で効果を見たいので、まず E1 を有効化して測る。
--- a/docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
+++ b/docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
@ -0,0 +1,218 @@
+# Phase 4 E3-4: ENV Constructor Init 設計メモ
+
+## 目的
+
+E1 で統合した ENV snapshot の lazy init check（3.22% self%）を排除。
+
+**期待**: **+0.5-1.5%** 改善
+
+---
+
+## 結果（A/B テスト）
+
+**判定**: ✅ **GO** (+4.75%)
+
+| Metric | Baseline (CTOR=0) | Optimized (CTOR=1) | Delta |
+|--------|-------------------|-------------------|-------|
+| Mean | 44.27M ops/s | 46.38M ops/s | **+4.75%** |
+| Median | 44.60M ops/s | 46.53M ops/s | **+4.35%** |
+
+**観察**:
+- 期待値 +0.5-1.5% を大幅に上回る +4.75% 達成
+- 全 10 run で Optimized が Baseline を上回る（一貫した改善）
+- Median でも +4.35% 確認（外れ値ではない）
+
+**分析**:
+- lazy init check（`if (g == -1)`）の削除効果が予想以上
+- 分岐予測ミス削減 + TLS アクセスパターン改善が複合的に効いた可能性
+- E1 (+3.92%) と E3-4 (+4.75%) の累積効果: **~+9%**
+
+---
+
+## 現状分析
+
+### E1 完了後の状態
+
+- `hakmem_env_snapshot_enabled()`: 3.22% self%（perf profile）
+- 原因: 毎回の lazy init check（`static int g = -1` + `getenv()`）
+
+```c
+// 現在の実装（core/box/hakmem_env_snapshot_box.h:51-62）
+static inline bool hakmem_env_snapshot_enabled(void) {
+    static int g = -1;
+    if (__builtin_expect(g == -1, 0)) {  // ← この分岐が 3.22%
+        const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
+        if (e && *e) {
+            g = (*e == '1') ? 1 : 0;
+        } else {
+            g = 0;
+        }
+    }
+    return g != 0;
+}
+```
+
+### 問題
+
+1. **分岐コスト**: `if (g == -1)` が hot path で毎回評価
+2. **予測ミス**: first call で branch misprediction
+3. **関数呼び出しオーバーヘッド**: inline でも分岐は残る
+
+---
+
+## 設計
+
+### アプローチ: Constructor Init + Direct Read
+
+```c
+// 新しい実装
+static int g_hakmem_env_snapshot_gate = -1;
+
+__attribute__((constructor(101)))  // priority 101: after libc init
+static void hakmem_env_snapshot_gate_init(void) {
+    const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
+    g_hakmem_env_snapshot_gate = (e && *e == '1') ? 1 : 0;
+}
+
+static inline bool hakmem_env_snapshot_enabled(void) {
+    return g_hakmem_env_snapshot_gate != 0;  // No branch (just load + compare)
+}
+```
+
+### 利点
+
+1. **分岐削減**: `if (g == -1)` 完全排除
+2. **一度だけ**: `getenv()` は main() 前に 1 回のみ
+3. **キャッシュ効率**: global read は TLS より高速（L1 hit 率高い）
+
+### リスク
+
+| リスク | 対策 |
+|--------|------|
+| putenv() 後の変更が反映されない | bench_profile の `hakmem_env_snapshot_refresh_from_env()` で gate/snapshot を同期 |
+| constructor order | priority 101 で libc init 後を保証 |
+| fork() 安全性 | hakmem は fork-safe 設計済み |
+
+---
+
+## Box Theory（実装計画）
+
+### L0: Env（戻せる）
+
+```
+HAKMEM_ENV_SNAPSHOT_CTOR=0/1  # default: 0（OFF）
+```
+
+- **ON (=1)**: Constructor init を使用（lazy check なし）
+- **OFF (=0)**: 従来の lazy init を使用（rollback 可能）
+
+### L1: ENV Constructor Box（境界: 1 箇所）
+
+#### 変更対象
+
+- `core/box/hakmem_env_snapshot_box.h` (変更)
+  - `hakmem_env_snapshot_enabled()` を 2 つのモードで実装
+- `core/box/hakmem_env_snapshot_box.c` (変更)
+  - Constructor 関数を追加
+
+---
+
+## 実装指示
+
+### Patch 1: Constructor Init Gate
+
+**ファイル**: `core/box/hakmem_env_snapshot_box.c`
+
+```c
+// Global gate (not static local - avoids lazy init)
+int g_hakmem_env_snapshot_gate = -1;
+int g_hakmem_env_snapshot_ctor_mode = -1;
+
+// Constructor: run before main()
+__attribute__((constructor(101)))
+static void hakmem_env_snapshot_gate_ctor(void) {
+    // Read HAKMEM_ENV_SNAPSHOT_CTOR (default OFF)
+    const char* ctor_env = getenv("HAKMEM_ENV_SNAPSHOT_CTOR");
+    g_hakmem_env_snapshot_ctor_mode = (ctor_env && *ctor_env == '1') ? 1 : 0;
+
+    if (g_hakmem_env_snapshot_ctor_mode) {
+        // Constructor mode: init gate now
+        const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
+        g_hakmem_env_snapshot_gate = (e && *e == '1') ? 1 : 0;
+    }
+}
+```
+
+### Patch 2: Dual-Mode Enabled Check
+
+**ファイル**: `core/box/hakmem_env_snapshot_box.h`
+
+```c
+// Global gate state (defined in .c)
+extern int g_hakmem_env_snapshot_gate;
+extern int g_hakmem_env_snapshot_ctor_mode;
+
+static inline bool hakmem_env_snapshot_enabled(void) {
+    // Fast path: constructor mode (no branch except final compare)
+    // Default is OFF, so ctor_mode==1 is UNLIKELY.
+    if (__builtin_expect(g_hakmem_env_snapshot_ctor_mode == 1, 0)) {
+        return g_hakmem_env_snapshot_gate != 0;
+    }
+
+    // Slow path: legacy lazy init (fallback)
+    if (__builtin_expect(g_hakmem_env_snapshot_gate == -1, 0)) {
+        const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
+        g_hakmem_env_snapshot_gate = (e && *e == '1') ? 1 : 0;
+    }
+    return g_hakmem_env_snapshot_gate != 0;
+}
+```
+
+---
+
+## A/B テスト計画
+
+### Test Matrix
+
+| Profile | Iterations | Runs | Command |
+|---------|-----------|------|---------|
+| Mixed | 20M | 10 | `./bench_random_mixed_hakmem 20000000 400 1` |
+
+### Baseline
+
+```bash
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ENV_SNAPSHOT=1 HAKMEM_ENV_SNAPSHOT_CTOR=0 \
+  ./bench_random_mixed_hakmem 20000000 400 1
+```
+
+### Optimized
+
+```bash
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ENV_SNAPSHOT=1 HAKMEM_ENV_SNAPSHOT_CTOR=1 \
+  ./bench_random_mixed_hakmem 20000000 400 1
+```
+
+### 判定基準
+
+- **GO**: +0.5% 以上
+- **NEUTRAL**: ±0.5%（研究箱維持）
+- **NO-GO**: -0.5% 以下
+
+---
+
+## 期待値の根拠
+
+**なぜ +0.5-1.5% か？**
+
+1. **現在のオーバーヘッド**: 3.22% self%
+2. **削減分**: lazy init check の分岐コスト（~10-15 cycles per call）
+3. **削減率**: ~15-30% of 3.22% → 0.5-1.0%
+4. **追加効果**: better branch prediction（warm path に分岐なし）
+
+---
+
+## 非目標
+
+- snapshot refresh API の変更（putenv sync は既存 API で対応）
+- E1 の構造変更（consolidation は維持）
+- 他の ENV gate の constructor 化（E3-4 は hakmem_env_snapshot_enabled のみ）
--- a/docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
@ -0,0 +1,106 @@
+# Phase 4 E3-4: ENV Constructor Init（次の指示書）
+
+## Status（2025-12-14）
+
+- ✅ 実装済み（research box / default OFF）
+- A/B（Mixed, 10-run, iter=20M, ws=400, E1=1）で **+4.75% mean / +4.35% median** を観測
+- ENV:
+  - E1: `HAKMEM_ENV_SNAPSHOT=0/1`（default 0）
+  - E3-4: `HAKMEM_ENV_SNAPSHOT_CTOR=0/1`（default 0、E1=1 前提）
+
+## ゴール
+
+1) “E3-4 の勝ち” を再確認して固定化する  
+2) 本線（プリセット）へ昇格するか判断する（戻せる形で）
+
+---
+
+## Step 0: 前提（E1 を ON にして測る）
+
+E3-4 は `hakmem_env_snapshot_enabled()` の gate 判定を短絡する最適化なので、E1 が ON であることが前提。
+
+---
+
+## Step 1: Build & 健康診断（先に通す）
+
+```sh
+make bench_random_mixed_hakmem -j
+scripts/verify_health_profiles.sh
+```
+
+---
+
+## Step 2: A/B（Mixed 10-run）
+
+Mixed 10-run（iter=20M, ws=400）:
+
+```sh
+# Baseline: ctor=0
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+HAKMEM_ENV_SNAPSHOT=1 \
+HAKMEM_ENV_SNAPSHOT_CTOR=0 \
+./bench_random_mixed_hakmem 20000000 400 1
+
+# Optimized: ctor=1
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+HAKMEM_ENV_SNAPSHOT=1 \
+HAKMEM_ENV_SNAPSHOT_CTOR=1 \
+./bench_random_mixed_hakmem 20000000 400 1
+```
+
+判定（10-run mean）:
+- GO: **+1.0% 以上**
+- ±1%: NEUTRAL（research box 維持）
+- -1% 以下: NO-GO（freeze）
+
+注意:
+- “constructor の pre-main init” を効かせたい場合は、起動前に ENV を設定する（bench_profile putenv だけでは遅い）。
+
+---
+
+## Step 3: perf で “消えたか” を確認（E3-4=1）
+
+```sh
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+HAKMEM_ENV_SNAPSHOT=1 \
+HAKMEM_ENV_SNAPSHOT_CTOR=1 \
+perf record -F 99 -- ./bench_random_mixed_hakmem 20000000 400 1
+
+perf report --stdio --no-children
+```
+
+確認ポイント:
+- `hakmem_env_snapshot_enabled` の self% が有意に下がる／Top から落ちる
+- 代わりに “snapshot 参照” が 1 箇所に集約されている
+
+---
+
+## Step 4: 昇格（GO の場合のみ）
+
+### Option A（推奨・安全）: E1 だけプリセット昇格、E3-4 は opt-in 維持
+
+- `core/bench_profile.h`（`MIXED_TINYV3_C7_SAFE`）:
+  - `bench_setenv_default("HAKMEM_ENV_SNAPSHOT","1");`
+  - `HAKMEM_ENV_SNAPSHOT_CTOR` は入れない（研究箱のまま）
+- `docs/analysis/ENV_PROFILE_PRESETS.md` に E1/E3-4 の推奨セットを追記
+- `CURRENT_TASK.md` を更新
+
+### Option B（攻める）: E1+E3-4 をプリセット昇格
+
+- 20-run validation（mean/median 両方）を通してから
+- 注意: `HAKMEM_ENV_SNAPSHOT_CTOR=1` をプリセット default にする場合、分岐 hint/期待値も合わせて見直す（baseline を汚さない）
+
+---
+
+## Step 5: Rollback（いつでも戻せる）
+
+```sh
+HAKMEM_ENV_SNAPSHOT=0
+HAKMEM_ENV_SNAPSHOT_CTOR=0
+```
+
+---
+
+## Next（Phase 4 Close）
+
+- E1/E3-4 の “どこまで本線に入れるか” を決めたら、Phase 4 は CLOSE（勝ち箱はプリセットへ、研究箱は freeze）にする。
--- a/docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
+++ b/docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
@ -0,0 +1,68 @@
+# Phase 4 Status - Executive Summary
+
+**Date**: 2025-12-14
+**Status**: E1 GO（opt-in）, E2 FROZEN, E3-4 GO（opt-in）
+**Baseline**: Mixed 20M/ws=400（E1/E3-4 の ON/OFF に依存。結果は各 A/B セクション参照）
+
+---
+
+## Quick Status
+
+### E2 Decision: FREEZE ✅ (NEUTRAL)
+
+**Result**: -0.21% mean, -0.62% median (NEUTRAL)
+
+**Why Freeze?**
+- Alloc route optimization saturated by Phase 3 C3 (static routing)
+- Free DUALHOT worked (+13%) because it skipped expensive ops
+- Alloc DUALHOT doesn't work (-0.21%) because route already cached
+- **Lesson**: Per-class specialization only helps when bypassing uncached overhead
+
+**Action**: Keep as research box (default OFF), no further investigation
+
+---
+
+## E1/E3-4 Results (Mixed A/B)
+
+### E1: ENV Snapshot Consolidation ✅ GO (opt-in)
+
+**Result**: +3.92% avg, +4.01% median  
+**ENV**: `HAKMEM_ENV_SNAPSHOT=1`（default OFF）
+
+### E3-4: ENV Constructor Init ✅ GO (opt-in)
+
+**Result**: +4.75% mean, +4.35% median（E1=1 前提）  
+**ENV**: `HAKMEM_ENV_SNAPSHOT=1 HAKMEM_ENV_SNAPSHOT_CTOR=1`（default OFF）
+
+**Note**: “constructor での pre-main init” を効かせたい場合はプロセス起動前に ENV を設定（bench_profile putenv だけでは遅い）
+
+---
+
+## Phase 4 Cumulative Status
+
+**Active**:
+- E1 (ENV Snapshot): +3.92% ✅ GO（opt-in）
+- E3-4 (ENV CTOR): +4.75% ✅ GO（opt-in, requires E1）
+
+**Frozen**:
+- D3 (Alloc Gate Shape): +0.56% ⚪
+- E2 (Alloc Per-Class FastPath): -0.21% ⚪
+
+## Next Actions
+
+1. E3-4 の “hint/refresh” 調整後に 10-run 再確認（昇格前の最終ゲート）
+2. GO 維持なら `ENV_PROFILE_PRESETS.md` と `CURRENT_TASK.md` に “E1+E3-4 の推奨セット” を明記
+3. E1/E3-4 ON の状態で perf を取り直して次の芯を選ぶ（alloc gate / free_tiny_fast_cold など）
+
+---
+
+## Key Lessons
+
+1. **Route optimization saturated**: C3 already cached routes, E2 no benefit
+2. **Shape optimization plateaued**: D3 +0.56% neutral, branch prediction saturated
+3. **ENV consolidation successful**: E1 +3.92%, constructor init is next step
+4. **Different optimization vectors needed**: Move beyond route/shape to init/dispatch overhead
+
+---
+
+**Full Analysis**: `/mnt/workdisk/public_share/hakmem/docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md`
--- a/scripts/verify_health_profiles.sh
+++ b/scripts/verify_health_profiles.sh
@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$root_dir"
+
+if [[ ! -x ./bench_random_mixed_hakmem || ! -x ./bench_mid_large_mt_hakmem ]]; then
+  make bench_random_mixed_hakmem bench_mid_large_mt_hakmem -j
+fi
+
+echo "== Health Profile 1/2: MIXED_TINYV3_C7_SAFE =="
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1
+
+echo
+echo "== Health Profile 2/2: C6_HEAVY_LEGACY_POOLV1 =="
+HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 1000000 400 1
+
+echo
+echo "OK: health profiles passed"