Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results: - free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI - unified_cache_push (3.39%): DEFER - already optimized - hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom Key Insight: perf self% is time-weighted, not frequency-weighted. Cold paths appear hot but have low total impact. Next: E5-4 (Malloc Tiny Direct Path) - Apply E5-1 winning pattern to malloc side - Target: tiny_alloc_gate_fast() gate tax elimination - ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1 Files added: - docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md - docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md - core/box/free_cold_shape_env_box.{h,c} (research box, not tested) - core/box/free_cold_shape_stats_box.{h,c} (research box, not tested) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 06:44:04 +09:00
parent f7b18aaf13
commit 580e7f4fa3
11 changed files with 594 additions and 7 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -1,5 +1,68 @@
 # 本線タスク（現在）

+## 更新メモ（2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot）
+
+### Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14)
+
+**Decision**: **DEFER all E5-3 candidates** (E5-3a/b/c). Pivot to E5-4 (Malloc Direct Path, E5-1 pattern replication).
+
+**Analysis**:
+- **E5-3a (free_tiny_fast_cold 7.14%)**: NO-GO (cold path, low frequency despite high self%)
+- **E5-3b (unified_cache_push 3.39%)**: MAYBE (already optimized, marginal ROI ~+1.0%)
+- **E5-3c (hakmem_env_snapshot_enabled 2.97%)**: NO-GO (E3-4 precedent shows -1.44% regression)
+
+**Key Insight**: **Profiler self% ≠ optimization opportunity**
+- Self% is time-weighted (samples during execution), not frequency-weighted
+- Cold paths appear hot due to expensive operations when hit, not total cost
+- E5-2 lesson: 3.35% self% → +0.45% NEUTRAL (branch overhead ≈ savings)
+
+**ROI Assessment**:
+| Candidate | Self% | Frequency | Expected Gain | Risk | Decision |
+|-----------|-------|-----------|---------------|------|----------|
+| E5-3a (cold path) | 7.14% | LOW | +0.5% | HIGH | NO-GO |
+| E5-3b (push) | 3.39% | HIGH | +1.0% | MEDIUM | DEFER |
+| E5-3c (env snapshot) | 2.97% | HIGH | -1.0% | HIGH | NO-GO |
+
+**Strategic Pivot**: Focus on **E5-1 Success Pattern** (wrapper-level deduplication)
+- E5-1 (Free Tiny Direct): +3.35% (GO) ✅
+- **Next**: E5-4 (Malloc Tiny Direct) - Apply E5-1 pattern to alloc side
+- **Expected**: +2-4% (similar to E5-1, based on malloc wrapper overhead)
+
+**Cumulative Status (Phase 5)**:
+- E4-1 (Free Wrapper Snapshot): +3.51% standalone
+- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone
+- E4 Combined: +6.43% (from baseline with both OFF)
+- E5-1 (Free Tiny Direct): +3.35% (from E4 baseline)
+- E5-2 (Header Write-Once): +0.45% NEUTRAL (frozen)
+- **E5-3**: **DEFER** (analysis complete, no implementation/test)
+- **Total Phase 5**: ~+9-10% cumulative (E4+E5-1 promoted, E5-2 frozen, E5-3 deferred)
+
+**Implementation** (E5-3a research box, NOT TESTED):
+- Files created:
+  - `core/box/free_cold_shape_env_box.{h,c}` (ENV gate, default OFF)
+  - `core/box/free_cold_shape_stats_box.{h,c}` (stats counters)
+  - `docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md` (analysis)
+- Files modified:
+  - `core/front/malloc_tiny_fast.h` (lines 418-437, cold path shape optimization)
+- Pattern: Early exit for LEGACY path (skip LARSON check when !use_tiny_heap)
+- **Status**: FROZEN (default OFF, pre-analysis shows NO-GO, not worth A/B testing)
+
+**Key Lessons**:
+1. **Profiler self% misleads** when frequency is low (cold path)
+2. **Micro-optimizations plateau** in already-optimized code (E5-2, E5-3b)
+3. **Branch hints are profile-dependent** (E3-4 failure, E5-3c risk)
+4. **Wrapper-level deduplication wins** (E4-1, E4-2, E5-1 pattern)
+
+**Next Steps**:
+- **E5-4 Design**: Malloc Tiny Direct Path (E5-1 pattern for alloc)
+  - Target: malloc() wrapper overhead (~12.95% self% in E4 profile)
+  - Method: Single size check → direct call to malloc_tiny_fast_for_class()
+  - Expected: +2-4% (based on E5-1 precedent +3.35%)
+- Design doc: `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_DESIGN.md`
+- Next instructions: `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
+
+---
+
 ## 更新メモ（2025-12-14 Phase 5 E5-2 Complete - Header Write-Once）

 ### Phase 5 E5-2: Header Write-Once Optimization ⚪ NEUTRAL (2025-12-14)
@ -120,12 +183,15 @@

 **Next Steps**:
 - ✅ Promote: `HAKMEM_FREE_TINY_DIRECT=1` to `MIXED_TINYV3_C7_SAFE` preset
- Next: E5-2 (Header Prefill at Refill, 2.59% target) or E5-3 (ENV Snapshot Shape, 2.57% target)
+- ✅ E5-2: NEUTRAL → FREEZE
+- ✅ E5-3: DEFER（ROI 低）
+- Next: **E5-4 (Malloc Tiny Direct)**（E5-1 パターンの alloc 側複製）
 - Design docs:
  - `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md`
  - `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md`
  - `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
  - `docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md`
+  - `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`

 ---

--- a/core/box/free_cold_shape_env_box.c
+++ b/core/box/free_cold_shape_env_box.c
@ -0,0 +1,5 @@
+// free_cold_shape_env_box.c - Phase 5 E5-3a: Free Cold Path Shape Optimization
+#include "free_cold_shape_env_box.h"
+
+// Global gate state (-1: uninitialized, 0: OFF, 1: ON)
+int g_free_cold_shape = -1;
--- a/core/box/free_cold_shape_env_box.h
+++ b/core/box/free_cold_shape_env_box.h
@ -0,0 +1,57 @@
+// free_cold_shape_env_box.h - Phase 5 E5-3a: Free Cold Path Shape Optimization
+//
+// Purpose: Optimize free_tiny_fast_cold() branch structure for better prediction
+// Target: free_tiny_fast_cold (7.14% self% in Mixed workload)
+//
+// Hypothesis:
+//   - Cold path has heavy branching overhead (route determination, LARSON check, ENV gates)
+//   - MIXED workload: LARSON=0 and use_tiny_heap=0 are COMMON (not rare)
+//   - Current branch hints assume LARSON/TinyHeap are rare, but profile shows otherwise
+//   - Reordering branches + fixing hints can reduce mispredictions
+//
+// Strategy:
+//   - Shape 1 (Optimized): Reorder branches to handle common LEGACY path first
+//     - Check use_tiny_heap==0 FIRST (LIKELY in Mixed, ~90%+ of cold path)
+//     - Short-circuit to LEGACY fallback when heap routing not needed
+//     - Defer LARSON/cross-thread checks to only when needed (heap routes)
+//   - Keep LARSON safety when needed (heap routes still do cross-thread check)
+//
+// Design:
+//   - ENV: HAKMEM_FREE_COLD_SHAPE=0/1 (default: 0, research box)
+//   - Shape 0 (baseline): Current structure (LARSON+heap check, then legacy)
+//   - Shape 1 (optimized): use_tiny_heap==0 early exit, LARSON only for heap
+//
+// Expected Benefit:
+//   - Reduce branch mispredictions in cold path (~7.14% self%)
+//   - Target gain: +1-3% (if branch prediction is bottleneck)
+//   - Conservative estimate: +0.5-1.5% (cold path is 7.14%, not dominant)
+//
+// Box Theory Compliance:
+//   - L0: ENV gate (default 0)
+//   - L1: Single boundary (free_tiny_fast_cold function)
+//   - Rollback: ENV=0 reverts to baseline
+//   - A/B testable: Same binary, ENV toggle
+
+#ifndef HAK_FREE_COLD_SHAPE_ENV_BOX_H
+#define HAK_FREE_COLD_SHAPE_ENV_BOX_H
+
+#include <stdlib.h>
+
+// Global gate state (defined in free_cold_shape_env_box.c)
+extern int g_free_cold_shape;
+
+// ENV gate: Check if optimized cold path shape is enabled
+// Default: 0 (baseline), set HAKMEM_FREE_COLD_SHAPE=1 for optimized shape
+static inline int free_cold_shape_enabled(void) {
+    if (__builtin_expect(g_free_cold_shape == -1, 0)) {
+        const char* e = getenv("HAKMEM_FREE_COLD_SHAPE");
+        if (e && *e) {
+            g_free_cold_shape = (*e == '1') ? 1 : 0;
+        } else {
+            g_free_cold_shape = 0;  // default: OFF (research box)
+        }
+    }
+    return g_free_cold_shape;
+}
+
+#endif  // HAK_FREE_COLD_SHAPE_ENV_BOX_H
--- a/core/box/free_cold_shape_stats_box.c
+++ b/core/box/free_cold_shape_stats_box.c
@ -0,0 +1,29 @@
+// free_cold_shape_stats_box.c - Phase 5 E5-3a: Free Cold Shape Stats
+#include "free_cold_shape_stats_box.h"
+
+// Stats counters (global atomics)
+_Atomic uint64_t g_free_cold_shape_legacy_fast = 0;
+_Atomic uint64_t g_free_cold_shape_heap_path = 0;
+_Atomic uint64_t g_free_cold_shape_enabled_count = 0;
+
+void free_cold_shape_print_stats(void) {
+#if !HAKMEM_BUILD_RELEASE
+    uint64_t legacy = atomic_load(&g_free_cold_shape_legacy_fast);
+    uint64_t heap = atomic_load(&g_free_cold_shape_heap_path);
+    uint64_t enabled = atomic_load(&g_free_cold_shape_enabled_count);
+    uint64_t total = legacy + heap;
+
+    if (total == 0) return;  // No activity
+
+    fprintf(stderr, "\n[FREE-COLD-SHAPE] Stats:\n");
+    fprintf(stderr, "  Shape enabled: %llu\n", (unsigned long long)enabled);
+    fprintf(stderr, "  LEGACY fast path: %llu (%.1f%%)\n",
+            (unsigned long long)legacy,
+            100.0 * legacy / total);
+    fprintf(stderr, "  Heap route path: %llu (%.1f%%)\n",
+            (unsigned long long)heap,
+            100.0 * heap / total);
+    fprintf(stderr, "  Total cold hits: %llu\n", (unsigned long long)total);
+    fflush(stderr);
+#endif
+}
--- a/core/box/free_cold_shape_stats_box.h
+++ b/core/box/free_cold_shape_stats_box.h
@ -0,0 +1,34 @@
+// free_cold_shape_stats_box.h - Phase 5 E5-3a: Free Cold Shape Stats
+//
+// Purpose: Track cold path branch distributions
+// Metrics: legacy_fast_path, heap_path, shape_enabled
+
+#ifndef HAK_FREE_COLD_SHAPE_STATS_BOX_H
+#define HAK_FREE_COLD_SHAPE_STATS_BOX_H
+
+#include <stdint.h>
+#include <stdatomic.h>
+#include <stdio.h>
+
+// Forward declarations for HAKMEM_DEBUG_COUNTERS
+#ifndef HAKMEM_DEBUG_COUNTERS
+#define HAKMEM_DEBUG_COUNTERS 0
+#endif
+
+// Stats counters (global atomics, always compiled)
+extern _Atomic uint64_t g_free_cold_shape_legacy_fast;  // Optimized: LEGACY path (no heap)
+extern _Atomic uint64_t g_free_cold_shape_heap_path;    // Heap route path
+extern _Atomic uint64_t g_free_cold_shape_enabled_count;  // Shape=1 hits
+
+// Increment macros (compile-out in release builds)
+#if HAKMEM_DEBUG_COUNTERS
+    #define FREE_COLD_SHAPE_STAT_INC(name) \
+        atomic_fetch_add_explicit(&g_free_cold_shape_##name, 1, memory_order_relaxed)
+#else
+    #define FREE_COLD_SHAPE_STAT_INC(name) ((void)0)
+#endif
+
+// Print stats (implemented in free_cold_shape_stats_box.c)
+void free_cold_shape_print_stats(void);
+
+#endif  // HAK_FREE_COLD_SHAPE_STATS_BOX_H
--- a/core/front/malloc_tiny_fast.h
+++ b/core/front/malloc_tiny_fast.h
@ -70,6 +70,8 @@
 #include "../box/tiny_metadata_cache_hot_box.h" // Phase 3 C2: Policy hot cache (metadata cache optimization)
 #include "../box/tiny_free_route_cache_env_box.h" // Phase 3 D1: Free path route cache
 #include "../box/hakmem_env_snapshot_box.h" // Phase 4 E1: ENV snapshot consolidation
+#include "../box/free_cold_shape_env_box.h" // Phase 5 E5-3a: Free cold path shape optimization
+#include "../box/free_cold_shape_stats_box.h" // Phase 5 E5-3a: Free cold shape stats

 // Helper: current thread id (low 32 bits) for owner check
 #ifndef TINY_SELF_U32_LOCAL_DEFINED
@ -413,6 +415,28 @@ static int free_tiny_fast_cold(void* ptr, void* base, int class_idx)
    }
 #endif  // !HAKMEM_BUILD_RELEASE

+    // Phase 5 E5-3a: Optimized cold path shape
+    // Strategy: Handle common LEGACY path first (use_tiny_heap==0 in Mixed ~90%+)
+    // Defer expensive LARSON/cross-thread checks to only when heap routing needed
+    static __thread int g_cold_shape = -1;
+    if (__builtin_expect(g_cold_shape == -1, 0)) {
+        g_cold_shape = free_cold_shape_enabled() ? 1 : 0;
+    }
+
+    if (g_cold_shape == 1) {
+        // Optimized shape: Check use_tiny_heap FIRST
+        if (__builtin_expect(!use_tiny_heap, 1)) {
+            // Most common case in Mixed: LEGACY path, no heap routing
+            // Skip LARSON/cross-thread check entirely (not needed for legacy)
+            FREE_COLD_SHAPE_STAT_INC(legacy_fast);
+            FREE_COLD_SHAPE_STAT_INC(enabled_count);
+            goto legacy_fallback;
+        }
+        // Rare: heap routing needed, do full validation
+        FREE_COLD_SHAPE_STAT_INC(heap_path);
+    }
+
+    // Baseline shape: LARSON check first (current behavior)
    // Cross-thread free detection (Larson MT crash fix, ENV gated) + TinyHeap free path
    {
        static __thread int g_larson_fix = -1;
@ -467,7 +491,7 @@ static int free_tiny_fast_cold(void* ptr, void* base, int class_idx)
                        }
                        return 0;  // remote push failed; fall back to normal path
                    }
-                    // Same-thread + TinyHeap route → route-based free
+                                    // Same-thread + TinyHeap route → route-based free
                    if (__builtin_expect(use_tiny_heap, 0)) {
                        FREE_TINY_FAST_HOTCOLD_STAT_INC(cold_tinyheap);
                        switch (route) {
@ -541,6 +565,7 @@ static int free_tiny_fast_cold(void* ptr, void* base, int class_idx)
 #endif

    // Phase REFACTOR-2: Legacy fallback (use unified helper)
+legacy_fallback:
    FREE_TINY_FAST_HOTCOLD_STAT_INC(cold_legacy_fallback);
    tiny_legacy_fallback_free_base(base, class_idx);
    return 1;
--- a/docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
@ -72,7 +72,7 @@ perf report --stdio --no-children
 ```

 判断基準（self% ≥ 5%）:
- `tiny_region_id_write_header` が依然 5% 以上 → **E5-2** 優先
+- `tiny_region_id_write_header` が依然 5% 以上 → **E5-2** は NEUTRAL で freeze 済み（次は E5-4 を優先）
 - `hakmem_env_snapshot_enabled` / `tiny_get_max_size` が 5% 付近まで上がる → **E5-3** 優先

 ---
@ -83,4 +83,3 @@ perf report --stdio --no-children
  - 目標: `tiny_region_id_write_header` の hot path stores を減らす（A3 の “always_inline” は NO-GO 済み）
 - E5-3: `hakmem_env_snapshot_enabled()` の分岐形/配置を “enabled 前提” に寄せる
  - 目標: mispredict を避け、`malloc_tiny_fast.h` 内の繰り返し gate を軽くする
-
--- a/docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
+++ b/docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
@ -0,0 +1,231 @@
+# Phase 5 E5-3: Candidate Analysis and Strategic Recommendations
+
+## Executive Summary
+
+**Recommendation**: **DEFER E5-3 optimization**. Continue with established winning patterns (E5-1 style wrapper-level optimizations) rather than pursuing diminishing-returns micro-optimizations in profiler hot spots.
+
+**Rationale**:
+- E5-2 (Header Write-Once, 3.35% self%) achieved only +0.45% NEUTRAL
+- E5-3 candidates (7.14%, 3.39%, 2.97% self%) have similar or worse ROI profiles
+- Profiler self% != optimization opportunity (time-weighted samples can mislead)
+- Cumulative gains from E4+E5-1 (~+9-10%) represent significant progress
+- Next phase should target higher-level structural opportunities
+
+---
+
+## E5-3 Candidate Analysis
+
+### Context: Post-E5-2 Baseline
+- **E5-1 (Free Tiny Direct)**: +3.35% GO (adopted)
+- **E5-2 (Header Write-Once)**: +0.45% NEUTRAL (frozen as research box)
+- **New baseline**: 44.42M ops/s (Mixed, 20M iters, ws=400)
+
+### Available Candidates (from perf profile)
+
+| Candidate | Self% | Call Frequency | ROI Assessment |
+|-----------|-------|----------------|----------------|
+| free_tiny_fast_cold | 7.14% | LOW (cold path) | **NO-GO** |
+| unified_cache_push | 3.39% | HIGH (every free) | **MAYBE** |
+| hakmem_env_snapshot_enabled | 2.97% | HIGH (wrapper+gate) | **NO-GO** |
+
+---
+
+## Detailed Analysis
+
+### E5-3a: free_tiny_fast_cold (7.14% self%) ❌ **NO-GO**
+
+**Hypothesis**: Cold path branch structure optimization (route determination, LARSON check)
+
+**Why NO-GO**:
+1. **Self% Misleading**: 7.14% is time-weighted, not frequency
+   - Cold path is called RARELY (only when hot path misses)
+   - High self% = expensive when hit, not = high total cost
+   - Optimizing cold path has minimal impact on overall throughput
+
+2. **Branch Prediction Already Optimized**:
+   - Current implementation uses `__builtin_expect` hints
+   - LARSON/heap checks are already marked UNLIKELY
+   - Further branch reordering has marginal benefit (~0.1-0.5% at best)
+
+3. **Similar to E5-2 Failure**:
+   - E5-2 targeted 3.35% self%, gained only +0.45%
+   - E5-3a targets 7.14% self% BUT lower frequency
+   - Expected gain: +0.3-1.0% (< +1.0% GO threshold)
+
+4. **Structural Issues**:
+   - Goto-based early exit adds control flow complexity
+   - Potential I-cache pollution (similar to Phase 1 A3 failure)
+   - Safety risks (LARSON check bypass in optimized path)
+
+**Conservative Estimate**: +0.5% ± 0.5% (NEUTRAL range)
+
+**Decision**: **NO-GO / DEFER**
+
+---
+
+### E5-3b: unified_cache_push (3.39% self%) ⚠️ **MAYBE**
+
+**Hypothesis**: Push operation overhead (TLS access, modulo arithmetic, bounds check)
+
+**Why MAYBE**:
+1. **Frequency**: Called on EVERY free (high frequency)
+2. **Current Implementation**: Already highly optimized
+   - Ring buffer with power-of-2 masking (no division)
+   - Single TLS access (g_unified_cache[class_idx])
+   - Minimal branch count (1-2 branches)
+
+3. **Potential Optimizations**:
+   - **Inline Expansion**: Force always_inline (may hurt I-cache)
+   - **TLS Caching**: Cache g_unified_cache base pointer (adds TLS variable)
+   - **Bounds Check Removal**: Assume capacity never changes (unsafe)
+
+4. **Risk Assessment**:
+   - **High risk**: unified_cache_push is already in critical path
+   - **Low ROI**: 3.39% self% with limited optimization headroom
+   - **Similar to E5-2**: Micro-optimization with marginal benefit
+
+**Conservative Estimate**: +0.5-1.5% (borderline NEUTRAL/GO)
+
+**Decision**: **DEFER** (pursue only if E5-1 pattern exhausted)
+
+---
+
+### E5-3c: hakmem_env_snapshot_enabled (2.97% self%) ❌ **NO-GO**
+
+**Hypothesis**: Branch hint optimization (enabled=1 is常用 in MIXED)
+
+**Why NO-GO**:
+1. **E3-4 Precedent**: Phase 4 E3-4 (ENV Constructor Init) **FAILED**
+   - Attempted to eliminate lazy check overhead (3.22% self%)
+   - Result: -1.44% regression (constructor mode added overhead)
+   - Root cause: Branch predictor tuning is profile-dependent
+
+2. **Branch Hint Contradiction**:
+   - Default builds: enabled=0 → hint UNLIKELY is correct
+   - MIXED preset: enabled=1 → hint UNLIKELY is WRONG
+   - Changing hint helps MIXED but hurts default builds
+
+3. **Optimization Space**: Already consolidated in E4-1 (E1)
+   - ENV snapshot reduced 3 TLS reads → 1 TLS read
+   - Remaining overhead is unavoidable (lazy init check)
+   - Further optimization requires constructor init (E3-4 showed this fails)
+
+**Conservative Estimate**: -1.0% to +0.5% (high regression risk)
+
+**Decision**: **NO-GO** (proven failure in E3-4)
+
+---
+
+## Strategic Recommendations
+
+### Priority 1: Exploit E5-1 Success Pattern ✅
+
+**E5-1 Strategy (Free Tiny Direct)**:
+- **Target**: Wrapper-level overhead (deduplication)
+- **Method**: Single header check → direct call to free_tiny_fast()
+- **Result**: +3.35% (GO)
+
+**Replicable Patterns**:
+1. **Malloc Tiny Direct**: Apply E5-1 pattern to malloc() side
+   - Single size check → direct call to malloc_tiny_fast_for_class()
+   - Eliminate: Size validation redundancy, ENV snapshot overhead
+   - Expected: +2-4% (similar to E5-1)
+
+2. **Alloc Gate Specialization**: Per-class fast paths
+   - C0-C3: Direct to LEGACY (skip policy snapshot)
+   - C4-C7: Route-specific fast paths
+   - Expected: +1-3%
+
+### Priority 2: Profile New Baseline
+
+After E4+E5-1 adoption (~+9-10% cumulative):
+1. **Re-profile Mixed workload** (new bottlenecks may emerge)
+2. **Identify high-frequency, high-overhead** targets
+3. **Focus on deduplication/consolidation** (proven pattern)
+
+### Priority 3: Avoid Diminishing Returns
+
+**Red Flags** (E5-2, E5-3 lessons):
+- **Self% > 3%** but **low frequency** → misleading
+- **Micro-optimizations** in already-optimized code → marginal ROI
+- **Branch hint tuning** → profile-dependent, high regression risk
+- **Cold path optimization** → time-weighted ≠ frequency-weighted
+
+**Green Flags** (E4-1, E4-2, E5-1 successes):
+- **Wrapper-level deduplication** → +3-6% per optimization
+- **TLS consolidation** → +2-4% per consolidation
+- **Direct path creation** → +2-4% per path
+- **Structural changes** (not micro-tuning) → higher ROI
+
+---
+
+## Lessons from Phase 5
+
+### Wins (E4-1, E4-2, E5-1)
+1. **ENV Snapshot Consolidation** (E4-1): +3.51%
+   - 3 TLS reads → 1 TLS read
+   - Deduplication > micro-optimization
+
+2. **Malloc Wrapper Snapshot** (E4-2): +21.83% standalone (+6.43% combined)
+   - Function call elimination (tiny_get_max_size)
+   - Pre-caching + TLS consolidation
+
+3. **Free Tiny Direct** (E5-1): +3.35%
+   - Single header check → direct call
+   - Wrapper-level deduplication
+
+**Common Pattern**: **Eliminate redundancy at architectural boundaries** (wrapper, gate, snapshot)
+
+### Losses / Neutrals (E3-4, E5-2)
+1. **ENV Constructor Init** (E3-4): -1.44%
+   - Constructor mode added overhead
+   - Branch prediction is profile-dependent
+
+2. **Header Write-Once** (E5-2): +0.45% NEUTRAL
+   - Assumption incorrect (headers NOT redundant)
+   - Branch overhead ≈ savings
+
+**Common Pattern**: **Micro-optimizations in hot functions** have limited ROI when code is already optimized
+
+---
+
+## Conclusion
+
+**E5-3 Recommendation**: **DEFER all three candidates**
+
+**Rationale**:
+1. **E5-3a (cold path)**: Low frequency, high risk, estimated +0.5% NEUTRAL
+2. **E5-3b (push)**: Already optimized, marginal ROI, estimated +1.0% borderline
+3. **E5-3c (env snapshot)**: Proven failure (E3-4), estimated -1.0% NO-GO
+
+**Next Steps**:
+1. ✅ **Promote E5-1** to `MIXED_TINYV3_C7_SAFE` preset (if not already done)
+2. ✅ **Profile new baseline** (E4+E5-1 ON) to find next high-ROI targets
+3. ✅ **Design E5-4**: Malloc Tiny Direct (E5-1 pattern applied to alloc side)
+   - Expected: +2-4% based on E5-1 precedent
+   - Lower risk than E5-3 candidates
+4. ✅ **Update roadmap**: Focus on wrapper-level optimizations, avoid diminishing returns
+
+**Key Insight**: **Profiler self% is necessary but not sufficient** for optimization prioritization. Frequency, redundancy, and architectural seams matter more than raw self%.
+
+---
+
+## Appendix: Implementation Notes (E5-3a - Not Executed)
+
+**Files Created** (research box, not tested):
+- `core/box/free_cold_shape_env_box.{h,c}` (ENV gate)
+- `core/box/free_cold_shape_stats_box.{h,c}` (stats counters)
+
+**Integration Point**:
+- `core/front/malloc_tiny_fast.h` (lines 418-437, free_tiny_fast_cold)
+
+**Decision**: **FROZEN** (default OFF, do not pursue A/B testing)
+
+**Rationale**: Pre-analysis shows NO-GO (low frequency, high risk, marginal ROI < +1.0%)
+
+---
+
+**Date**: 2025-12-14
+**Phase**: 5 E5-3
+**Status**: Analysis Complete → **DEFER E5-3**, Proceed to E5-4 (Malloc Direct Path)
+**Cumulative**: E4+E5-1 = ~+9-10% (baseline: 44.42M ops/s Mixed)
--- a/docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
@ -0,0 +1,122 @@
+# Phase 5 E5-4: Malloc Tiny Direct Path（次の指示書）
+
+## Status（2025-12-14 / E5-2 FREEZE 後）
+
+- E5-1（Free Tiny Direct）は ✅ GO（+3.35%）
+- E5-2（Header refill write-once）は ⚪ NEUTRAL → FREEZE
+- E5-3（env shape 等）は **DEFER**
+- 次の芯: **E5-4（Malloc Tiny Direct）** = E5-1 の成功パターンを alloc 側へ複製
+
+狙い:
+- `malloc()` wrapper から `tiny_alloc_gate_fast()` 呼び出しの “ゲート税” を削り、
+  **wrapper → malloc_tiny_fast_for_class()** へ最短で入る。
+
+前提:
+- “Tiny を使ってはいけない” モード（POOL_ONLY 等）を壊さない（= `g_tiny_route[]` は必ず尊重）。
+- fail-fast: 失敗したら既存経路へ即フォールバック。
+- 戻せる: ENV gate default OFF。
+
+---
+
+## Step 0: 対象ホットの確認（perf）
+
+E4/E5-1 を ON にした baseline で確認:
+```sh
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  perf record -F 99 -- ./bench_random_mixed_hakmem 20000000 400 1
+perf report --stdio --no-children
+```
+
+狙いの目安:
+- `tiny_alloc_gate_fast` が self% **≥ 8%** なら E5-4 の ROI は高い
+
+---
+
+## Step 1: 箱の追加（ENV gate + optional stats）
+
+### 1) ENV gate（必須）
+- 新規: `core/box/malloc_tiny_direct_env_box.h`
+  - ENV: `HAKMEM_MALLOC_TINY_DIRECT=0/1`（default 0）
+  - `static inline bool malloc_tiny_direct_enabled(void)` を提供
+
+### 2) stats（任意、compile-out 推奨）
+- 新規: `core/box/malloc_tiny_direct_stats_box.h`
+  - `direct_total`, `direct_hit`, `direct_miss`, `route_pool_only`, `class_oob`, `fast_null`
+  - `HAKMEM_DEBUG_COUNTERS=0` で compile-out（観測税ゼロ）
+
+Box Theory:
+- L0: ENV gate（戻せる）
+- L1: direct try（副作用ゼロ）
+- 見える化: カウンタのみ
+
+---
+
+## Step 2: wrapper へ統合（境界1箇所）
+
+対象: `core/box/hak_wrappers.inc.h` の `malloc()` hot path（E4-2 snapshot の中）
+
+やること:
+- 既存の
+  - `size <= 256` → `tiny_alloc_gate_fast(size)`
+  - `size <= tiny_get_max_size()` → `tiny_alloc_gate_fast(size)`
+  を “direct try” に置換/前段追加する。
+
+**Direct try の条件（安全最優先）**:
+1) `malloc_wrapper_env_snapshot_enabled()` が ON（E4-2 の経路内）
+2) `env->front_gate_unified` が true（Tiny front を使う前提）
+3) `size <= 256`（まず最頻だけ、範囲を狭く）
+4) `class_idx = hak_tiny_size_to_class(size)` が [0..7]
+5) `g_tiny_route[class_idx] != ROUTE_POOL_ONLY`（Tiny 禁止を尊重）
+
+**Direct try の呼び出し**:
+- `void* p = malloc_tiny_fast_for_class(size, class_idx);`
+- `p != NULL` なら即 return
+- `p == NULL` なら既存ルートにフォールバック（TinyFirst/Refill失敗を許容）
+
+重要:
+- `tiny_alloc_gate_fast()` の “診断/検証” は bypass されるので、
+  debug ビルドでは direct try を **tiny_alloc_gate_diag_enabled()==0 のときだけ**に限定する（推奨）。
+
+---
+
+## Step 3: A/B テスト（同一バイナリ）
+
+### A: baseline（E5-4 OFF）
+```sh
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  HAKMEM_MALLOC_TINY_DIRECT=0 \
+  ./bench_random_mixed_hakmem 20000000 400 1
+```
+
+### B: optimized（E5-4 ON）
+```sh
+HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  HAKMEM_MALLOC_TINY_DIRECT=1 \
+  ./bench_random_mixed_hakmem 20000000 400 1
+```
+
+判定（Mixed 10-run mean）:
+- GO: **+1.0% 以上**
+- ±1.0%: NEUTRAL → freeze
+- -1.0% 以下: NO-GO → freeze
+
+追加で C6-heavy も 5-run だけ確認（回帰が無いこと）。
+
+---
+
+## Step 4: 健康診断（必須）
+
+```sh
+scripts/verify_health_profiles.sh
+```
+
+---
+
+## Step 5: 昇格（GO のときだけ）
+
+- `core/bench_profile.h`（`MIXED_TINYV3_C7_SAFE`）に:
+  - `bench_setenv_default("HAKMEM_MALLOC_TINY_DIRECT", "1");`
+- `docs/analysis/ENV_PROFILE_PRESETS.md` に:
+  - 効果、A/B、rollback（`HAKMEM_MALLOC_TINY_DIRECT=0`）を追記
+- `CURRENT_TASK.md` を更新
+
--- a/docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
@ -1,6 +1,6 @@
 # Phase 5 E5: Post E4-Combined Next Instructions（次の指示書）

-## Status（2025-12-14 / E4 Combined GO 後）
+## Status（2025-12-14 / E5-2 FREEZE 反映）

 - Baseline（Mixed, 20M iters, ws=400）: **47.34M ops/s**（E4-1+E4-2 ON）
 - Hot spots（self%）:
@ -15,6 +15,9 @@

 Update:
 - E5-1（Free Tiny Direct Path）✅ GO（+3.35% mean / +3.36% median）→ 指示書: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
+- E5-2（Header write to refill boundary）⚪ NEUTRAL → FREEZE（追わない）
+- E5-3（env shape 等）DEFER → 次は E5-4（malloc 側 direct）
+- E5-4 指示書: `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`

 ---

@ -74,7 +77,15 @@ perf report --stdio --no-children --symbol free

 ---

-## E5-2（優先B）: `tiny_region_id_write_header` を “毎回 alloc” から外す（refill 境界へ）
+## E5-2: Header write-once（⚪ NEUTRAL → FROZEN）
+
+結論:
+- E5-2 は **NEUTRAL**（branch overhead ≈ savings）なので **freeze**。
+- 以後は追わず、次は E5-4 を優先する。
+
+参照:
+- Design: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md`
+- Results: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md`

 ### 仮説
 `tiny_region_id_write_header` は “正しいが高頻度”。
@ -96,7 +107,14 @@ perf report --stdio --no-children --symbol free

 ---

-## E5-3（優先C / 小パッチ）: `hakmem_env_snapshot_enabled()` の分岐形を “enabled 前提” に寄せる
+## E5-4（次の芯）: Malloc Tiny Direct（E5-1 の alloc 側複製）
+
+指示書:
+- `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
+
+---
+
+## E5-3（DEFER）: `hakmem_env_snapshot_enabled()` の分岐形を “enabled 前提” に寄せる

 ### 背景
 `MIXED_TINYV3_C7_SAFE` では `HAKMEM_ENV_SNAPSHOT=1` が常用になったため、
--- a/docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
@ -73,3 +73,4 @@ scripts/verify_health_profiles.sh
 - E4 合算 A/B: `docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md`
 - E5 次の芯: `docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md`
 - E5-1 昇格: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
+- E5-4 次: `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`