hakmem/docs/analysis/PHASE3_BASELINE_AND_CANDIDATES.md

# Phase 3: Baseline Establishment & Next Optimization Candidates

**Date**: 2025-12-13
**Status**: BASELINE ESTABLISHED
**Goal**: Identify next micro-optimization targets with +1-5% potential each

---

## Executive Summary

**Baseline Performance (MID_V3=0, MIXED workload)**:
- Mean: 45.78M ops/s
- Median: 46.79M ops/s
- Range: 42.36M - 47.12M ops/s
- StdDev: ~1.75M ops/s (3.8% variance)

**Top Optimization Candidates**:
1. **free() wrapper** (28.95% self%) - HIGH PRIORITY
2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY
3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact)

**Expected Next Gains**: +3-8% cumulative from free path optimizations

---

## Step 0: Baseline Establishment

### Configuration Verification

**Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`

**Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`):
```c
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");  // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
```

**Optimization Flags Enabled**:
- `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1)
- `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split)
- `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization)
- `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%)

### Baseline Measurements (5 runs)

| Run | Throughput (M ops/s) | Time (ms) |
|-----|---------------------|-----------|
| 1   | 46.84               | 21        |
| 2   | 46.79               | 21        |
| 3   | 45.77               | 22        |
| 4   | 47.12               | 21        |
| 5   | 42.36               | 24        |

**Statistics**:
- **Mean**: 45.78M ops/s
- **Median**: 46.79M ops/s
- **Min**: 42.36M ops/s
- **Max**: 47.12M ops/s
- **Range**: 4.76M ops/s (11.2%)
- **StdDev**: ~1.75M ops/s (3.8% variance)

**Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.

**Comparison to Previous**:
- Previous C3 baseline: ~39.8M ops/s (with default settings)
- **Current baseline: 46.79M ops/s**
- **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working)

---

## Step 1: Perf Profiling Results

### Profiling Setup

**Command**:
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1
```

**Results**:
- Samples: 30 (cycles:P event)
- Event count: 921,849,973 cycles
- Throughput: 47.37M ops/s (consistent with baseline)

### Top Functions by Self%

| Rank | Symbol                              | Self%  | Samples | Children% | Category        |
|------|-------------------------------------|--------|---------|-----------|-----------------|
| 1    | `free`                              | 28.95% | 3       | 45.20%    | **HOT WRAPPER** |
| 2    | `tiny_alloc_gate_fast.lto_priv.0`   | 12.75% | 3       | 29.11%    | **HOT ALLOC**   |
| 3    | `main`                              | 12.53% | 3       | 21.00%    | Benchmark       |
| 4    | `malloc`                            | 12.43% | 3       | 16.71%    | Wrapper         |
| 5    | `tiny_front_v3_enabled.lto_priv.0`  | 7.75%  | 2       | 7.85%     | Tiny front      |
| 6    | `tiny_route_for_class.lto_priv.0`   | 4.39%  | 2       | 24.78%    | Route lookup    |
| 7    | `free.cold`                         | 4.15%  | 1       | 4.15%     | Cold path       |
| 8    | `hak_pool_free`                     | 4.02%  | 1       | 4.02%     | Pool free       |

### Call Graph Analysis

**free() hot path** (28.95% self, 45.20% children):
```
free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%)  ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)
```

**tiny_alloc_gate_fast** (12.75% self, 29.11% children):
```
tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)
```

**Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**.

---

## Step 2: Candidate Prioritization

### HIGH PRIORITY (Expected +3-5% each)

#### 1. **free() wrapper path** (28.95% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524`

**Current Implementation**:
```c
void free(void* ptr) {
    if (!ptr) return;

    // BenchFast bypass (unlikely, 0)
    if (__builtin_expect(bench_fast_enabled(), 0)) { ... }

    const wrapper_env_cfg_t* wcfg = wrapper_env_cfg();  // ← Memory load

    if (__builtin_expect(wcfg->wrap_shape, 0)) {        // ← Branch
        if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
            int freed;
            if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
                freed = free_tiny_fast_hot(ptr);
            } else {
                freed = free_tiny_fast(ptr);
            }
            if (__builtin_expect(freed, 1)) {
                return;  // SUCCESS
            }
        }
        return free_cold(ptr, wcfg);
    }
    // Legacy path...
}
```

**Optimization Opportunities**:

**A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%)
- Currently calls `wrapper_env_cfg()` on every free
- Could cache in TLS or register during init
- Risk: LOW (read-only after init)

**B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%)
- Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check
- Could be compile-time or init-time cached
- Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)

**C. Reduce branch mispredictions** (Expected: +0.5-1%)
- Reorder branches to put likely path first
- Current: `bench_fast_enabled()` checked first (unlikely=0)
- Optimization: Move Tiny fast path check earlier
- Risk: LOW

**Total Expected Gain: +2.5-5%**

#### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147`

**Current Implementation**:
```c
static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
    FREE_DISPATCH_STAT_INC(route_for_class_calls);  // Debug stat (RELEASE: noop)
    if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
        tiny_route_snapshot_init();
    }
    if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
        return TINY_ROUTE_LEGACY;
    }
    return g_tiny_route_class[ci];
}
```

**Optimization Opportunities**:

**A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%)
- Check happens on EVERY call from free path
- Phase 3 C3 already implemented static routing for alloc path
- **Proposal**: Apply same static route cache to free path
- Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check
- Risk: MEDIUM (need to ensure init ordering)

**B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%)
- In free path, `ci` is derived from header (already validated)
- Could add `tiny_route_for_class_unchecked(ci)` variant
- Risk: MEDIUM (need careful caller audit)

**Total Expected Gain: +1.5-3%**

#### 3. **tiny_alloc_gate_fast()** (12.75% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139`

**Current Implementation**:
```c
static inline void* tiny_alloc_gate_fast(size_t size)
{
    int class_idx = hak_tiny_size_to_class(size);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return NULL;
    }

    TinyRoutePolicy route = tiny_route_get(class_idx);  // ← Already optimized (Phase 3 C3)

    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
        return NULL;
    }

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
        return user_ptr;
    }

    // ROUTE_TINY_FIRST fallback...
}
```

**Optimization Opportunities**:

**A. Specialize for common routes** (Expected: +1-2%)
- MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
- Could create `tiny_alloc_gate_fast_legacy_only()` variant
- Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
- Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)

**B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%)
- Might not be fully inlined by LTO
- Add `__attribute__((always_inline))` hint
- Risk: LOW

**Total Expected Gain: +1.5-3%**

---

### MEDIUM PRIORITY (Expected +0.5-1% each)

#### 4. **tiny_front_v3_enabled()** (7.75% self%)
- Appears in free path via `free_tiny_fast_hot()`
- Likely a runtime env check that could be cached
- Risk: LOW
- Expected Gain: +0.5-1%

#### 5. **free.cold** (4.15% self%)
- Cold path for free wrapper
- Handles classification and fallback
- Not a hot optimization target (already in slow path)
- Expected Gain: <+0.5%

---

### LOW PRIORITY / IGNORE

#### 6. **main()** (12.53% self%)
- Benchmark overhead (not part of allocator)
- IGNORE

#### 7. **malloc()** (12.43% self%)
- Already optimized in previous phases
- Appears lower than free in profile
- Defer to next round

---

## Step 3: Recommended Next Steps

### Phase 3 D1: Free Path Route Cache ✅ GO（ENV opt-in）
**Target**: `tiny_route_for_class()` の呼び出しを free path から削る
**Result**: Mixed 10-run mean **+1.06%**（median は負ける回がある）
**Decision**: ✅ GO だが **default 化は 20-run 確認待ち**

**ENV Gate**: `HAKMEM_FREE_STATIC_ROUTE=1`（default: 0）

---

### Phase 3 D2: Wrapper Env Cache ❌ NO-GO（FROZEN）
**Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る
**Result**: Mixed 10-run mean **-1.44%** regression
**Decision**: ❌ NO-GO（研究箱 freeze、default OFF）

**ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`（default: 0）

---

### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)
**Target**: `tiny_alloc_gate_fast()` for LEGACY-only route
**Expected Gain**: +1-2%
**Risk**: LOW
**Effort**: 2-3 hours

**Implementation**:
1. Create `tiny_alloc_gate_fast_legacy()` specialized variant
2. Eliminate ROUTE_POOL_ONLY and ROUTE_TINY_FIRST branches
3. Use in MIXED profile where all classes are LEGACY
4. A/B test: BASELINE vs D3

**ENV Gate**: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=1` (default: 0)

---

## Expected Cumulative Results（更新）

| Phase      | Optimization                     | Expected Gain | Notes |
|------------|----------------------------------|---------------|-------|
| Baseline   | MID_V3=0 + B3+B4+C3              | -             | — |
| **D1**     | Free route cache                 | +0〜+2%       | mean は勝ち、median 確認待ち（default OFF） |
| **D2**     | Wrapper env cache                | —             | NO-GO（freeze） |
| **D3**     | Alloc gate specialization        | +0〜+2%       | perf で 5% 超なら着手 |

**With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total)

---

## Risk Assessment

| Optimization        | Risk Level | Mitigation                                      |
|---------------------|------------|-------------------------------------------------|
| Free route cache    | MEDIUM     | Ensure init ordering, ENV gate for rollback     |
| Wrapper env cache   | —          | NO-GO（-1.44% regression）                      |
| Alloc specialization| LOW        | Profile-specific, existing static route pattern |

**All optimizations**: Follow ENV gate + A/B test + decision pattern (research box)

---

## Post-D1/D2 Status (2025-12-13)

### Phase 3 D1/D2 Validation Complete ✅

1. **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT
   - 20-run validation completed
   - Results: Mean +2.19%, Median +2.37% (both criteria met)
   - Status: Added to MIXED_TINYV3_C7_SAFE preset as default
   - Implementation: `HAKMEM_FREE_STATIC_ROUTE=1`

2. **D2 (Wrapper Env Cache)**: ❌ FROZEN
   - Results: -1.44% regression
   - Status: Research box frozen, default OFF, do not pursue
   - Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended)

### Active Optimizations in MIXED_TINYV3_C7_SAFE

1. **B3**: Routing branch shape (+2.89% proven)
2. **B4**: Wrapper hot/cold split (+1.47% proven)
3. **C3**: Static routing (+2.20% proven)
4. **D1**: Free route cache (+2.19% proven) - NEW
5. **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven)

**Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)

### Next Actions

1. **Profile**: Run perf on current baseline to identify next targets
   - Requirement: self% ≥5% for Phase 3 D3 consideration
   - Target: `tiny_alloc_gate_fast` specialization

2. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation
   - Only proceed if perf shows ≥5% self% in alloc gate
   - ENV: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=0/1`

3. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap

---

## Appendix: Raw Perf Data

### Full Perf Report (Top 20)

```
# Samples: 30  of event 'cycles:P'
# Event count (approx.): 921849973

    46.11%     0.00%             0  [.] 0000000000000000
    45.20%    28.95%             3  [.] free
    29.11%    12.75%             3  [.] tiny_alloc_gate_fast.lto_priv.0
    24.78%     4.39%             2  [.] tiny_route_for_class.lto_priv.0
    21.00%    12.53%             3  [.] main
    16.71%    12.43%             3  [.] malloc
    12.95%     4.27%             1  [.] tiny_region_id_write_header.lto_priv.0
     8.66%     4.39%             1  [.] tiny_c7_ultra_free
     8.56%     4.28%             1  [.] free_tiny_fast_cold.lto_priv.0
     7.85%     7.75%             2  [.] tiny_front_v3_enabled.lto_priv.0
     4.27%     0.00%             0  [.] 0x00007ad3a9c2d001
     4.23%     0.00%             0  [.] tiny_c7_ultra_enabled_env.lto_priv.0
     4.21%     0.00%             0  [.] 0x00007ad3ab960c81
     4.20%     0.00%             0  [.] 0x00007ad3ab939401
     4.15%     4.15%             1  [.] free.cold
     4.15%     0.00%             0  [.] unified_cache_push.lto_priv.0
     4.02%     4.02%             1  [.] hak_pool_free
```

### Baseline Run Details

**Run 1**: 46.84M ops/s
```
Throughput =  46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208
```

**Run 2**: 46.79M ops/s
```
Throughput =  46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```

**Run 3**: 45.77M ops/s
```
Throughput =  45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176
```

**Run 4**: 47.12M ops/s
```
Throughput =  47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```

**Run 5**: 42.36M ops/s (outlier)
```
Throughput =  42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080
```

---

## Document History

- **2025-12-13**: Initial baseline establishment and candidate analysis
- **Next**: Phase 3 D1 implementation (Free route cache)
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
+								# Phase 3: Baseline Establishment & Next Optimization Candidates
 								**Date**: 2025-12-13
 								**Status**: BASELINE ESTABLISHED
 								**Goal**: Identify next micro-optimization targets with +1-5% potential each
 								---
 								## Executive Summary
 								**Baseline Performance (MID_V3=0, MIXED workload)**:
 								- Mean: 45.78M ops/s
 								- Median: 46.79M ops/s
 								- Range: 42.36M - 47.12M ops/s
 								- StdDev: ~1.75M ops/s (3.8% variance)
 								**Top Optimization Candidates**:
 . **free() wrapper** (28.95% self%) - HIGH PRIORITY
 . **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY
 . **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact)
 								**Expected Next Gains**: +3-8% cumulative from free path optimizations
 								---
 								## Step 0: Baseline Establishment
 								### Configuration Verification
 								**Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
 								**Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`):
 								```c
 								bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");  // CRITICAL: MID_V3 disabled for Mixed
 								bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
 								```
 								**Optimization Flags Enabled**:
 								- `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1)
 								- `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split)
 								- `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization)
 								- `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%)
 								### Baseline Measurements (5 runs)
 								| Run | Throughput (M ops/s) | Time (ms) |
 								|-----|---------------------|-----------|
 								| 1   | 46.84               | 21        |
 								| 2   | 46.79               | 21        |
 								| 3   | 45.77               | 22        |
 								| 4   | 47.12               | 21        |
 								| 5   | 42.36               | 24        |
 								**Statistics**:
 								- **Mean**: 45.78M ops/s
 								- **Median**: 46.79M ops/s
 								- **Min**: 42.36M ops/s
 								- **Max**: 47.12M ops/s
 								- **Range**: 4.76M ops/s (11.2%)
 								- **StdDev**: ~1.75M ops/s (3.8% variance)
 								**Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.
 								**Comparison to Previous**:
 								- Previous C3 baseline: ~39.8M ops/s (with default settings)
 								- **Current baseline: 46.79M ops/s**
 								- **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working)
 								---
 								## Step 1: Perf Profiling Results
 								### Profiling Setup
 								**Command**:
 								```bash
 								HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1
 								```
 								**Results**:
 								- Samples: 30 (cycles:P event)
 								- Event count: 921,849,973 cycles
 								- Throughput: 47.37M ops/s (consistent with baseline)
 								### Top Functions by Self%
 								| Rank | Symbol                              | Self%  | Samples | Children% | Category        |
 								|------|-------------------------------------|--------|---------|-----------|-----------------|
 								| 1    | `free`                              | 28.95% | 3       | 45.20%    | **HOT WRAPPER** |
 								| 2    | `tiny_alloc_gate_fast.lto_priv.0`   | 12.75% | 3       | 29.11%    | **HOT ALLOC**   |
 								| 3    | `main`                              | 12.53% | 3       | 21.00%    | Benchmark       |
 								| 4    | `malloc`                            | 12.43% | 3       | 16.71%    | Wrapper         |
 								| 5    | `tiny_front_v3_enabled.lto_priv.0`  | 7.75%  | 2       | 7.85%     | Tiny front      |
 								| 6    | `tiny_route_for_class.lto_priv.0`   | 4.39%  | 2       | 24.78%    | Route lookup    |
 								| 7    | `free.cold`                         | 4.15%  | 1       | 4.15%     | Cold path       |
 								| 8    | `hak_pool_free`                     | 4.02%  | 1       | 4.02%     | Pool free       |
 								### Call Graph Analysis
 								**free() hot path** (28.95% self, 45.20% children):
 								```
 								free (28.95% self)
 								├── tiny_route_for_class.lto_priv.0 (20.38%)  ← MAJOR BOTTLENECK
 								├── free (recursive, 16.24%)
 								├── tiny_region_id_write_header.lto_priv.0 (4.29%)
 								└── malloc (4.28%)
 								```
 								**tiny_alloc_gate_fast** (12.75% self, 29.11% children):
 								```
 								tiny_alloc_gate_fast (12.75% self)
 								├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
 								├── main (4.27%)
 								└── free (4.20%)
 								```
 								**Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**.
 								---
 								## Step 2: Candidate Prioritization
 								### HIGH PRIORITY (Expected +3-5% each)
 								#### 1. **free() wrapper path** (28.95% self%)
 								**Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524`
 								**Current Implementation**:
 								```c
 								void free(void* ptr) {
 								    if (!ptr) return;
 								    // BenchFast bypass (unlikely, 0)
 								    if (__builtin_expect(bench_fast_enabled(), 0)) { ... }
 								    const wrapper_env_cfg_t* wcfg = wrapper_env_cfg();  // ← Memory load
 								    if (__builtin_expect(wcfg->wrap_shape, 0)) {        // ← Branch
 								        if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
 								            int freed;
 								            if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
 								                freed = free_tiny_fast_hot(ptr);
 								            } else {
 								                freed = free_tiny_fast(ptr);
 								            }
 								            if (__builtin_expect(freed, 1)) {
 								                return;  // SUCCESS
 								            }
 								        }
 								        return free_cold(ptr, wcfg);
 								    }
 								    // Legacy path...
 								}
 								```
 								**Optimization Opportunities**:
 								**A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%)
 								- Currently calls `wrapper_env_cfg()` on every free
 								- Could cache in TLS or register during init
 								- Risk: LOW (read-only after init)
 								**B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%)
 								- Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check
 								- Could be compile-time or init-time cached
 								- Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)
 								**C. Reduce branch mispredictions** (Expected: +0.5-1%)
 								- Reorder branches to put likely path first
 								- Current: `bench_fast_enabled()` checked first (unlikely=0)
 								- Optimization: Move Tiny fast path check earlier
 								- Risk: LOW
 								**Total Expected Gain: +2.5-5%**
 								#### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children)
 								**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147`
 								**Current Implementation**:
 								```c
 								static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
 								    FREE_DISPATCH_STAT_INC(route_for_class_calls);  // Debug stat (RELEASE: noop)
 								    if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
 								        tiny_route_snapshot_init();
 								    }
 								    if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
 								        return TINY_ROUTE_LEGACY;
 								    }
 								    return g_tiny_route_class[ci];
 								}
 								```
 								**Optimization Opportunities**:
 								**A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%)
 								- Check happens on EVERY call from free path
 								- Phase 3 C3 already implemented static routing for alloc path
 								- **Proposal**: Apply same static route cache to free path
 								- Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check
 								- Risk: MEDIUM (need to ensure init ordering)
 								**B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%)
 								- In free path, `ci` is derived from header (already validated)
 								- Could add `tiny_route_for_class_unchecked(ci)` variant
 								- Risk: MEDIUM (need careful caller audit)
 								**Total Expected Gain: +1.5-3%**
 								#### 3. **tiny_alloc_gate_fast()** (12.75% self%)
 								**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139`
 								**Current Implementation**:
 								```c
 								static inline void* tiny_alloc_gate_fast(size_t size)
 								{
 								    int class_idx = hak_tiny_size_to_class(size);
 								    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
 								        return NULL;
 								    }
 								    TinyRoutePolicy route = tiny_route_get(class_idx);  // ← Already optimized (Phase 3 C3)
 								    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
 								        return NULL;
 								    }
 								    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
 								    if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
 								        return user_ptr;
 								    }
 								    // ROUTE_TINY_FIRST fallback...
 								}
 								```
 								**Optimization Opportunities**:
 								**A. Specialize for common routes** (Expected: +1-2%)
 								- MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
 								- Could create `tiny_alloc_gate_fast_legacy_only()` variant
 								- Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
 								- Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)
 								**B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%)
 								- Might not be fully inlined by LTO
 								- Add `__attribute__((always_inline))` hint
 								- Risk: LOW
 								**Total Expected Gain: +1.5-3%**
 								---
 								### MEDIUM PRIORITY (Expected +0.5-1% each)
 								#### 4. **tiny_front_v3_enabled()** (7.75% self%)
 								- Appears in free path via `free_tiny_fast_hot()`
 								- Likely a runtime env check that could be cached
 								- Risk: LOW
 								- Expected Gain: +0.5-1%
 								#### 5. **free.cold** (4.15% self%)
 								- Cold path for free wrapper
 								- Handles classification and fallback
 								- Not a hot optimization target (already in slow path)
 								- Expected Gain: <+0.5%
 								---
 								### LOW PRIORITY / IGNORE
 								#### 6. **main()** (12.53% self%)
 								- Benchmark overhead (not part of allocator)
 								- IGNORE
 								#### 7. **malloc()** (12.43% self%)
 								- Already optimized in previous phases
 								- Appears lower than free in profile
 								- Defer to next round
 								---
 								## Step 3: Recommended Next Steps
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								### Phase 3 D1: Free Path Route Cache ✅ GO（ENV opt-in）
 								**Target**: `tiny_route_for_class()` の呼び出しを free path から削る
 								**Result**: Mixed 10-run mean **+1.06%**（median は負ける回がある）
 								**Decision**: ✅ GO だが **default 化は 20-run 確認待ち**
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								**ENV Gate**: `HAKMEM_FREE_STATIC_ROUTE=1`（default: 0）
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
 								---
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								### Phase 3 D2: Wrapper Env Cache ❌ NO-GO（FROZEN）
 								**Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る
 								**Result**: Mixed 10-run mean **-1.44%** regression
 								**Decision**: ❌ NO-GO（研究箱 freeze、default OFF）
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								**ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`（default: 0）
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
 								---
 								### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)
 								**Target**: `tiny_alloc_gate_fast()` for LEGACY-only route
 								**Expected Gain**: +1-2%
 								**Risk**: LOW
 								**Effort**: 2-3 hours
 								**Implementation**:
 . Create `tiny_alloc_gate_fast_legacy()` specialized variant
 . Eliminate ROUTE_POOL_ONLY and ROUTE_TINY_FIRST branches
 . Use in MIXED profile where all classes are LEGACY
 . A/B test: BASELINE vs D3
 								**ENV Gate**: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=1` (default: 0)
 								---
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								## Expected Cumulative Results（更新）
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								| Phase      | Optimization                     | Expected Gain | Notes |
 								|------------|----------------------------------|---------------|-------|
 								| Baseline   | MID_V3=0 + B3+B4+C3              | -             | — |
 								| **D1**     | Free route cache                 | +0〜+2%       | mean は勝ち、median 確認待ち（default OFF） |
 								| **D2**     | Wrapper env cache                | —             | NO-GO（freeze） |
 								| **D3**     | Alloc gate specialization        | +0〜+2%       | perf で 5% 超なら着手 |
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
 								**With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total)
 								---
 								## Risk Assessment
 								| Optimization        | Risk Level | Mitigation                                      |
 								|---------------------|------------|-------------------------------------------------|
 								| Free route cache    | MEDIUM     | Ensure init ordering, ENV gate for rollback     |
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								| Wrapper env cache   | —          | NO-GO（-1.44% regression）                      |
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
+								| Alloc specialization| LOW        | Profile-specific, existing static route pattern |
 								**All optimizations**: Follow ENV gate + A/B test + decision pattern (research box)
 								---
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								## Post-D1/D2 Status (2025-12-13)
 								### Phase 3 D1/D2 Validation Complete ✅
 . **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT
 								   - 20-run validation completed
 								   - Results: Mean +2.19%, Median +2.37% (both criteria met)
 								   - Status: Added to MIXED_TINYV3_C7_SAFE preset as default
 								   - Implementation: `HAKMEM_FREE_STATIC_ROUTE=1`
 . **D2 (Wrapper Env Cache)**: ❌ FROZEN
 								   - Results: -1.44% regression
 								   - Status: Research box frozen, default OFF, do not pursue
 								   - Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended)
 								### Active Optimizations in MIXED_TINYV3_C7_SAFE
 . **B3**: Routing branch shape (+2.89% proven)
 . **B4**: Wrapper hot/cold split (+1.47% proven)
 . **C3**: Static routing (+2.20% proven)
 . **D1**: Free route cache (+2.19% proven) - NEW
 . **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven)
 								**Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+								### Next Actions
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+. **Profile**: Run perf on current baseline to identify next targets
 								   - Requirement: self% ≥5% for Phase 3 D3 consideration
 								   - Target: `tiny_alloc_gate_fast` specialization
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation
 								   - Only proceed if perf shows ≥5% self% in alloc gate
 								   - ENV: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=0/1`
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
-												Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established

Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:42:22 +09:00
+. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap
-												Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]

Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-13 22:03:27 +09:00
 								---
 								## Appendix: Raw Perf Data
 								### Full Perf Report (Top 20)
 								```
 								# Samples: 30  of event 'cycles:P'
 								# Event count (approx.): 921849973
 .11%     0.00%             0  [.] 0000000000000000
 .20%    28.95%             3  [.] free
 .11%    12.75%             3  [.] tiny_alloc_gate_fast.lto_priv.0
 .78%     4.39%             2  [.] tiny_route_for_class.lto_priv.0
 .00%    12.53%             3  [.] main
 .71%    12.43%             3  [.] malloc
 .95%     4.27%             1  [.] tiny_region_id_write_header.lto_priv.0
 .66%     4.39%             1  [.] tiny_c7_ultra_free
 .56%     4.28%             1  [.] free_tiny_fast_cold.lto_priv.0
 .85%     7.75%             2  [.] tiny_front_v3_enabled.lto_priv.0
 .27%     0.00%             0  [.] 0x00007ad3a9c2d001
 .23%     0.00%             0  [.] tiny_c7_ultra_enabled_env.lto_priv.0
 .21%     0.00%             0  [.] 0x00007ad3ab960c81
 .20%     0.00%             0  [.] 0x00007ad3ab939401
 .15%     4.15%             1  [.] free.cold
 .15%     0.00%             0  [.] unified_cache_push.lto_priv.0
 .02%     4.02%             1  [.] hak_pool_free
 								```
 								### Baseline Run Details
 								**Run 1**: 46.84M ops/s
 								```
 								Throughput =  46841499 ops/s [iter=1000000 ws=400] time=0.021s
 								[RSS] max_kb=30208
 								```
 								**Run 2**: 46.79M ops/s
 								```
 								Throughput =  46793317 ops/s [iter=1000000 ws=400] time=0.021s
 								[RSS] max_kb=30080
 								```
 								**Run 3**: 45.77M ops/s
 								```
 								Throughput =  45772756 ops/s [iter=1000000 ws=400] time=0.022s
 								[RSS] max_kb=34176
 								```
 								**Run 4**: 47.12M ops/s
 								```
 								Throughput =  47117176 ops/s [iter=1000000 ws=400] time=0.021s
 								[RSS] max_kb=30080
 								```
 								**Run 5**: 42.36M ops/s (outlier)
 								```
 								Throughput =  42359615 ops/s [iter=1000000 ws=400] time=0.024s
 								[RSS] max_kb=30080
 								```
 								---
 								## Document History
 								- **2025-12-13**: Initial baseline establishment and candidate analysis
 								- **Next**: Phase 3 D1 implementation (Free route cache)