hakmem/docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md

# Phase 83-1: Switch Dispatch Fixed Mode - A/B Test Results

## Objective
Remove per-operation ENV gate overhead from `tiny_inline_slots_switch_dispatch_enabled()` by pre-computing the decision at bench_profile boundary.

**Pattern**: Phase 78-1 replication (inline slots fixed mode)
**Expected Gain**: +0.3-1.0% (branch reduction)

## Implementation Summary

### Box Theory Design
- **Boundary**: bench_profile calls `tiny_inline_slots_switch_dispatch_fixed_refresh_from_env()` after putenv defaults
- **Hot path**: `tiny_inline_slots_switch_dispatch_enabled_fast()` reads cached global when FIXED=1
- **Reversible**: toggle HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0/1

### Files Created
1. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.h` - Fast-path API + global cache
2. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.c` - Refresh implementation

### Files Modified
1. `core/box/tiny_front_hot_box.h` - Alloc path: `_enabled()` → `_enabled_fast()`
2. `core/box/tiny_legacy_fallback_box.h` - Free path: `_enabled()` → `_enabled_fast()`
3. `Makefile` - Added `tiny_inline_slots_switch_dispatch_fixed_box.o`

## A/B Test Results

### Quick Check (3-run)
**Baseline (FIXED=0, SWITCH=1)**:
- Run 1: 54.12 M ops/s
- Run 2: 55.01 M ops/s
- Run 3: 52.95 M ops/s
- **Mean: 54.02 M ops/s**

**Treatment (FIXED=1, SWITCH=1)**:
- Run 1: 54.57 M ops/s
- Run 2: 54.17 M ops/s
- Run 3: 53.94 M ops/s
- **Mean: 54.23 M ops/s**

**Quick Check Gain: +0.39%** (+0.21 M ops/s)

### Full Test (10-run)
**Baseline (FIXED=0, SWITCH=1)**:
```
Run 1:  54.13 M ops/s
Run 2:  54.14 M ops/s
Run 3:  51.30 M ops/s
Run 4:  52.75 M ops/s
Run 5:  52.68 M ops/s
Run 6:  53.75 M ops/s
Run 7:  53.44 M ops/s
Run 8:  53.33 M ops/s
Run 9:  53.43 M ops/s
Run 10: 52.73 M ops/s
Mean: 53.17 M ops/s
```

**Treatment (FIXED=1, SWITCH=1)**:
```
Run 1:  52.35 M ops/s
Run 2:  52.87 M ops/s
Run 3:  54.36 M ops/s
Run 4:  53.13 M ops/s
Run 5:  52.36 M ops/s
Run 6:  54.12 M ops/s
Run 7:  53.55 M ops/s
Run 8:  53.76 M ops/s
Run 9:  53.81 M ops/s
Run 10: 53.12 M ops/s
Mean: 53.34 M ops/s
```

**Full Test Gain: +0.32%** (+0.17 M ops/s)

## perf stat Analysis

### Baseline (FIXED=0, SWITCH=1)
```
Throughput:        54.07 M ops/s
Cycles:            1,697,024,527
Instructions:      3,515,034,248 (2.07 IPC)
Branches:          893,509,797
Branch-misses:     28,621,855 (3.20%)
```

### Treatment (FIXED=1, SWITCH=1)
```
Throughput:        53.98 M ops/s
Cycles:            1,706,618,243
Instructions:      3,513,893,603 (2.06 IPC)
Branches:          893,343,014
Branch-misses:     28,582,157 (3.20%)
```

### perf stat Delta
| Metric | Baseline | Treatment | Delta | % Change |
|--------|----------|-----------|-------|----------|
| Throughput | 54.07 M | 53.98 M | -0.09 M | -0.17% |
| Cycles | 1,697M | 1,707M | +10M | +0.56% |
| Instructions | 3,515M | 3,514M | -1M | -0.03% |
| Branches | 893.5M | 893.3M | -0.2M | **-0.02%** |
| Branch-misses | 28.6M | 28.6M | -0.04M | -0.14% |

**Key Finding**: Branch reduction is negligible (-0.02%). Single perf run shows noise.

## Analysis

### Expected vs Actual
- **Expected**: +0.3-1.0% gain via branch reduction (Phase 78-1 pattern)
- **Actual**: +0.32% gain (10-run average)
- **Branch reduction**: -0.02% (essentially zero)

### Interpretation
1. **Marginal Gain**: +0.32% is at the very bottom of the expected range
2. **No Branch Reduction**: -0.02% branch count change is within noise
3. **High Variance**: perf stat single run shows -0.17%, contradicting 10-run +0.32%
4. **Pattern Mismatch**: Phase 78-1 achieved +2.31% with clear branch reduction

### Root Cause Hypothesis
The optimization targets `tiny_inline_slots_switch_dispatch_enabled()` which uses a static lazy-init cache:
```c
static inline int tiny_inline_slots_switch_dispatch_enabled(void) {
    static int g_switch_dispatch_enabled = -1;  // -1 = uncached
    if (__builtin_expect(g_switch_dispatch_enabled == -1, 0)) {
        // First call only
        const char* e = getenv("HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH");
        g_switch_dispatch_enabled = (e && *e && *e != '0') ? 1 : 0;
    }
    return g_switch_dispatch_enabled;
}
```

**Issue**: After the first call, `g_switch_dispatch_enabled != -1` is always predicted correctly. The compiler/CPU already optimizes this check to near-zero cost.

**Contrast with Phase 78-1**: That phase optimized per-class ENV gates (`tiny_c4_inline_slots_enabled()` etc.) which are called thousands of times per benchmark run. Switch dispatch check is called once per alloc/free operation, but the lazy-init pattern already eliminates most overhead.

## Decision Gate

**GO Threshold**: +1.0%
**Actual Result**: +0.32%

**Status**: ❌ **NO-GO** (below threshold, negligible branch reduction)

### Recommendations
1. **Do not promote** SWITCHDISPATCH_FIXED=1 to SSOT
2. **Keep code** as research box (reversible design preserved)
3. **Phase 78-1 pattern** not applicable to lazy-init ENV gates (diminishing returns)

## ENV Variables

### Baseline (Phase 80-1 mode)
```bash
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0  # Disabled (lazy-init)
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1        # Switch dispatch ON
```

### Treatment (Phase 83-1 mode)
```bash
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=1  # Enabled (startup cache)
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1        # Switch dispatch ON
```

## Next Steps

1. ✅ **Phase 80-1**: Switch dispatch remains in SSOT (+1.65% STRONG GO)
2. ❌ **Phase 83-1**: Fixed mode NOT promoted (marginal gain)
3. 🔬 **Research**: Investigate other optimization opportunities beyond ENV gate overhead

---

**Phase 83-1 Conclusion**: NO-GO due to marginal gain (+0.32%) and negligible branch reduction. Lazy-init pattern already optimizes ENV gate overhead effectively.
Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> 2025-12-18 18:50:00 +09:00			`# Phase 83-1: Switch Dispatch Fixed Mode - A/B Test Results`

			`## Objective`
			Remove per-operation ENV gate overhead from `tiny_inline_slots_switch_dispatch_enabled()` by pre-computing the decision at bench_profile boundary.

			`Pattern: Phase 78-1 replication (inline slots fixed mode)`
			`Expected Gain: +0.3-1.0% (branch reduction)`

			`## Implementation Summary`

			`### Box Theory Design`
			- Boundary: bench_profile calls `tiny_inline_slots_switch_dispatch_fixed_refresh_from_env()` after putenv defaults
			- Hot path: `tiny_inline_slots_switch_dispatch_enabled_fast()` reads cached global when FIXED=1
			`- Reversible: toggle HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0/1`

			`### Files Created`
			1. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.h` - Fast-path API + global cache
			2. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.c` - Refresh implementation

			`### Files Modified`
			1. `core/box/tiny_front_hot_box.h` - Alloc path: `_enabled()` → `_enabled_fast()`
			2. `core/box/tiny_legacy_fallback_box.h` - Free path: `_enabled()` → `_enabled_fast()`
			3. `Makefile` - Added `tiny_inline_slots_switch_dispatch_fixed_box.o`

			`## A/B Test Results`

			`### Quick Check (3-run)`
			`Baseline (FIXED=0, SWITCH=1):`
			`- Run 1: 54.12 M ops/s`
			`- Run 2: 55.01 M ops/s`
			`- Run 3: 52.95 M ops/s`
			`- Mean: 54.02 M ops/s`

			`Treatment (FIXED=1, SWITCH=1):`
			`- Run 1: 54.57 M ops/s`
			`- Run 2: 54.17 M ops/s`
			`- Run 3: 53.94 M ops/s`
			`- Mean: 54.23 M ops/s`

			`Quick Check Gain: +0.39% (+0.21 M ops/s)`

			`### Full Test (10-run)`
			`Baseline (FIXED=0, SWITCH=1):`
			```
			`Run 1: 54.13 M ops/s`
			`Run 2: 54.14 M ops/s`
			`Run 3: 51.30 M ops/s`
			`Run 4: 52.75 M ops/s`
			`Run 5: 52.68 M ops/s`
			`Run 6: 53.75 M ops/s`
			`Run 7: 53.44 M ops/s`
			`Run 8: 53.33 M ops/s`
			`Run 9: 53.43 M ops/s`
			`Run 10: 52.73 M ops/s`
			`Mean: 53.17 M ops/s`
			```

			`Treatment (FIXED=1, SWITCH=1):`
			```
			`Run 1: 52.35 M ops/s`
			`Run 2: 52.87 M ops/s`
			`Run 3: 54.36 M ops/s`
			`Run 4: 53.13 M ops/s`
			`Run 5: 52.36 M ops/s`
			`Run 6: 54.12 M ops/s`
			`Run 7: 53.55 M ops/s`
			`Run 8: 53.76 M ops/s`
			`Run 9: 53.81 M ops/s`
			`Run 10: 53.12 M ops/s`
			`Mean: 53.34 M ops/s`
			```

			`Full Test Gain: +0.32% (+0.17 M ops/s)`

			`## perf stat Analysis`

			`### Baseline (FIXED=0, SWITCH=1)`
			```
			`Throughput: 54.07 M ops/s`
			`Cycles: 1,697,024,527`
			`Instructions: 3,515,034,248 (2.07 IPC)`
			`Branches: 893,509,797`
			`Branch-misses: 28,621,855 (3.20%)`
			```

			`### Treatment (FIXED=1, SWITCH=1)`
			```
			`Throughput: 53.98 M ops/s`
			`Cycles: 1,706,618,243`
			`Instructions: 3,513,893,603 (2.06 IPC)`
			`Branches: 893,343,014`
			`Branch-misses: 28,582,157 (3.20%)`
			```

			`### perf stat Delta`
			`\| Metric \| Baseline \| Treatment \| Delta \| % Change \|`
			`\|--------\|----------\|-----------\|-------\|----------\|`
			`\| Throughput \| 54.07 M \| 53.98 M \| -0.09 M \| -0.17% \|`
			`\| Cycles \| 1,697M \| 1,707M \| +10M \| +0.56% \|`
			`\| Instructions \| 3,515M \| 3,514M \| -1M \| -0.03% \|`
			`\| Branches \| 893.5M \| 893.3M \| -0.2M \| -0.02% \|`
			`\| Branch-misses \| 28.6M \| 28.6M \| -0.04M \| -0.14% \|`

			`Key Finding: Branch reduction is negligible (-0.02%). Single perf run shows noise.`

			`## Analysis`

			`### Expected vs Actual`
			`- Expected: +0.3-1.0% gain via branch reduction (Phase 78-1 pattern)`
			`- Actual: +0.32% gain (10-run average)`
			`- Branch reduction: -0.02% (essentially zero)`

			`### Interpretation`
			`1. Marginal Gain: +0.32% is at the very bottom of the expected range`
			`2. No Branch Reduction: -0.02% branch count change is within noise`
			`3. High Variance: perf stat single run shows -0.17%, contradicting 10-run +0.32%`
			`4. Pattern Mismatch: Phase 78-1 achieved +2.31% with clear branch reduction`

			`### Root Cause Hypothesis`
			The optimization targets `tiny_inline_slots_switch_dispatch_enabled()` which uses a static lazy-init cache:
			```c
			`static inline int tiny_inline_slots_switch_dispatch_enabled(void) {`
			`static int g_switch_dispatch_enabled = -1; // -1 = uncached`
			`if (__builtin_expect(g_switch_dispatch_enabled == -1, 0)) {`
			`// First call only`
			`const char* e = getenv("HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH");`
			`g_switch_dispatch_enabled = (e && e && e != '0') ? 1 : 0;`
			`}`
			`return g_switch_dispatch_enabled;`
			`}`
			```

			Issue: After the first call, `g_switch_dispatch_enabled != -1` is always predicted correctly. The compiler/CPU already optimizes this check to near-zero cost.

			Contrast with Phase 78-1: That phase optimized per-class ENV gates (`tiny_c4_inline_slots_enabled()` etc.) which are called thousands of times per benchmark run. Switch dispatch check is called once per alloc/free operation, but the lazy-init pattern already eliminates most overhead.

			`## Decision Gate`

			`GO Threshold: +1.0%`
			`Actual Result: +0.32%`

			`Status: ❌ NO-GO (below threshold, negligible branch reduction)`

			`### Recommendations`
			`1. Do not promote SWITCHDISPATCH_FIXED=1 to SSOT`
			`2. Keep code as research box (reversible design preserved)`
			`3. Phase 78-1 pattern not applicable to lazy-init ENV gates (diminishing returns)`

			`## ENV Variables`

			`### Baseline (Phase 80-1 mode)`
			```bash
			`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0 # Disabled (lazy-init)`
			`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1 # Switch dispatch ON`
			```

			`### Treatment (Phase 83-1 mode)`
			```bash
			`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=1 # Enabled (startup cache)`
			`HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1 # Switch dispatch ON`
			```

			`## Next Steps`

			`1. ✅ Phase 80-1: Switch dispatch remains in SSOT (+1.65% STRONG GO)`
			`2. ❌ Phase 83-1: Fixed mode NOT promoted (marginal gain)`
			`3. 🔬 Research: Investigate other optimization opportunities beyond ENV gate overhead`

			`---`

			`Phase 83-1 Conclusion: NO-GO due to marginal gain (+0.32%) and negligible branch reduction. Lazy-init pattern already optimizes ENV gate overhead effectively.`