172 lines
5.6 KiB
Markdown
172 lines
5.6 KiB
Markdown
|
|
# Phase 83-1: Switch Dispatch Fixed Mode - A/B Test Results
|
||
|
|
|
||
|
|
## Objective
|
||
|
|
Remove per-operation ENV gate overhead from `tiny_inline_slots_switch_dispatch_enabled()` by pre-computing the decision at bench_profile boundary.
|
||
|
|
|
||
|
|
**Pattern**: Phase 78-1 replication (inline slots fixed mode)
|
||
|
|
**Expected Gain**: +0.3-1.0% (branch reduction)
|
||
|
|
|
||
|
|
## Implementation Summary
|
||
|
|
|
||
|
|
### Box Theory Design
|
||
|
|
- **Boundary**: bench_profile calls `tiny_inline_slots_switch_dispatch_fixed_refresh_from_env()` after putenv defaults
|
||
|
|
- **Hot path**: `tiny_inline_slots_switch_dispatch_enabled_fast()` reads cached global when FIXED=1
|
||
|
|
- **Reversible**: toggle HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0/1
|
||
|
|
|
||
|
|
### Files Created
|
||
|
|
1. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.h` - Fast-path API + global cache
|
||
|
|
2. `core/box/tiny_inline_slots_switch_dispatch_fixed_box.c` - Refresh implementation
|
||
|
|
|
||
|
|
### Files Modified
|
||
|
|
1. `core/box/tiny_front_hot_box.h` - Alloc path: `_enabled()` → `_enabled_fast()`
|
||
|
|
2. `core/box/tiny_legacy_fallback_box.h` - Free path: `_enabled()` → `_enabled_fast()`
|
||
|
|
3. `Makefile` - Added `tiny_inline_slots_switch_dispatch_fixed_box.o`
|
||
|
|
|
||
|
|
## A/B Test Results
|
||
|
|
|
||
|
|
### Quick Check (3-run)
|
||
|
|
**Baseline (FIXED=0, SWITCH=1)**:
|
||
|
|
- Run 1: 54.12 M ops/s
|
||
|
|
- Run 2: 55.01 M ops/s
|
||
|
|
- Run 3: 52.95 M ops/s
|
||
|
|
- **Mean: 54.02 M ops/s**
|
||
|
|
|
||
|
|
**Treatment (FIXED=1, SWITCH=1)**:
|
||
|
|
- Run 1: 54.57 M ops/s
|
||
|
|
- Run 2: 54.17 M ops/s
|
||
|
|
- Run 3: 53.94 M ops/s
|
||
|
|
- **Mean: 54.23 M ops/s**
|
||
|
|
|
||
|
|
**Quick Check Gain: +0.39%** (+0.21 M ops/s)
|
||
|
|
|
||
|
|
### Full Test (10-run)
|
||
|
|
**Baseline (FIXED=0, SWITCH=1)**:
|
||
|
|
```
|
||
|
|
Run 1: 54.13 M ops/s
|
||
|
|
Run 2: 54.14 M ops/s
|
||
|
|
Run 3: 51.30 M ops/s
|
||
|
|
Run 4: 52.75 M ops/s
|
||
|
|
Run 5: 52.68 M ops/s
|
||
|
|
Run 6: 53.75 M ops/s
|
||
|
|
Run 7: 53.44 M ops/s
|
||
|
|
Run 8: 53.33 M ops/s
|
||
|
|
Run 9: 53.43 M ops/s
|
||
|
|
Run 10: 52.73 M ops/s
|
||
|
|
Mean: 53.17 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Treatment (FIXED=1, SWITCH=1)**:
|
||
|
|
```
|
||
|
|
Run 1: 52.35 M ops/s
|
||
|
|
Run 2: 52.87 M ops/s
|
||
|
|
Run 3: 54.36 M ops/s
|
||
|
|
Run 4: 53.13 M ops/s
|
||
|
|
Run 5: 52.36 M ops/s
|
||
|
|
Run 6: 54.12 M ops/s
|
||
|
|
Run 7: 53.55 M ops/s
|
||
|
|
Run 8: 53.76 M ops/s
|
||
|
|
Run 9: 53.81 M ops/s
|
||
|
|
Run 10: 53.12 M ops/s
|
||
|
|
Mean: 53.34 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Full Test Gain: +0.32%** (+0.17 M ops/s)
|
||
|
|
|
||
|
|
## perf stat Analysis
|
||
|
|
|
||
|
|
### Baseline (FIXED=0, SWITCH=1)
|
||
|
|
```
|
||
|
|
Throughput: 54.07 M ops/s
|
||
|
|
Cycles: 1,697,024,527
|
||
|
|
Instructions: 3,515,034,248 (2.07 IPC)
|
||
|
|
Branches: 893,509,797
|
||
|
|
Branch-misses: 28,621,855 (3.20%)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment (FIXED=1, SWITCH=1)
|
||
|
|
```
|
||
|
|
Throughput: 53.98 M ops/s
|
||
|
|
Cycles: 1,706,618,243
|
||
|
|
Instructions: 3,513,893,603 (2.06 IPC)
|
||
|
|
Branches: 893,343,014
|
||
|
|
Branch-misses: 28,582,157 (3.20%)
|
||
|
|
```
|
||
|
|
|
||
|
|
### perf stat Delta
|
||
|
|
| Metric | Baseline | Treatment | Delta | % Change |
|
||
|
|
|--------|----------|-----------|-------|----------|
|
||
|
|
| Throughput | 54.07 M | 53.98 M | -0.09 M | -0.17% |
|
||
|
|
| Cycles | 1,697M | 1,707M | +10M | +0.56% |
|
||
|
|
| Instructions | 3,515M | 3,514M | -1M | -0.03% |
|
||
|
|
| Branches | 893.5M | 893.3M | -0.2M | **-0.02%** |
|
||
|
|
| Branch-misses | 28.6M | 28.6M | -0.04M | -0.14% |
|
||
|
|
|
||
|
|
**Key Finding**: Branch reduction is negligible (-0.02%). Single perf run shows noise.
|
||
|
|
|
||
|
|
## Analysis
|
||
|
|
|
||
|
|
### Expected vs Actual
|
||
|
|
- **Expected**: +0.3-1.0% gain via branch reduction (Phase 78-1 pattern)
|
||
|
|
- **Actual**: +0.32% gain (10-run average)
|
||
|
|
- **Branch reduction**: -0.02% (essentially zero)
|
||
|
|
|
||
|
|
### Interpretation
|
||
|
|
1. **Marginal Gain**: +0.32% is at the very bottom of the expected range
|
||
|
|
2. **No Branch Reduction**: -0.02% branch count change is within noise
|
||
|
|
3. **High Variance**: perf stat single run shows -0.17%, contradicting 10-run +0.32%
|
||
|
|
4. **Pattern Mismatch**: Phase 78-1 achieved +2.31% with clear branch reduction
|
||
|
|
|
||
|
|
### Root Cause Hypothesis
|
||
|
|
The optimization targets `tiny_inline_slots_switch_dispatch_enabled()` which uses a static lazy-init cache:
|
||
|
|
```c
|
||
|
|
static inline int tiny_inline_slots_switch_dispatch_enabled(void) {
|
||
|
|
static int g_switch_dispatch_enabled = -1; // -1 = uncached
|
||
|
|
if (__builtin_expect(g_switch_dispatch_enabled == -1, 0)) {
|
||
|
|
// First call only
|
||
|
|
const char* e = getenv("HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH");
|
||
|
|
g_switch_dispatch_enabled = (e && *e && *e != '0') ? 1 : 0;
|
||
|
|
}
|
||
|
|
return g_switch_dispatch_enabled;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Issue**: After the first call, `g_switch_dispatch_enabled != -1` is always predicted correctly. The compiler/CPU already optimizes this check to near-zero cost.
|
||
|
|
|
||
|
|
**Contrast with Phase 78-1**: That phase optimized per-class ENV gates (`tiny_c4_inline_slots_enabled()` etc.) which are called thousands of times per benchmark run. Switch dispatch check is called once per alloc/free operation, but the lazy-init pattern already eliminates most overhead.
|
||
|
|
|
||
|
|
## Decision Gate
|
||
|
|
|
||
|
|
**GO Threshold**: +1.0%
|
||
|
|
**Actual Result**: +0.32%
|
||
|
|
|
||
|
|
**Status**: ❌ **NO-GO** (below threshold, negligible branch reduction)
|
||
|
|
|
||
|
|
### Recommendations
|
||
|
|
1. **Do not promote** SWITCHDISPATCH_FIXED=1 to SSOT
|
||
|
|
2. **Keep code** as research box (reversible design preserved)
|
||
|
|
3. **Phase 78-1 pattern** not applicable to lazy-init ENV gates (diminishing returns)
|
||
|
|
|
||
|
|
## ENV Variables
|
||
|
|
|
||
|
|
### Baseline (Phase 80-1 mode)
|
||
|
|
```bash
|
||
|
|
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0 # Disabled (lazy-init)
|
||
|
|
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1 # Switch dispatch ON
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment (Phase 83-1 mode)
|
||
|
|
```bash
|
||
|
|
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=1 # Enabled (startup cache)
|
||
|
|
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1 # Switch dispatch ON
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. ✅ **Phase 80-1**: Switch dispatch remains in SSOT (+1.65% STRONG GO)
|
||
|
|
2. ❌ **Phase 83-1**: Fixed mode NOT promoted (marginal gain)
|
||
|
|
3. 🔬 **Research**: Investigate other optimization opportunities beyond ENV gate overhead
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Phase 83-1 Conclusion**: NO-GO due to marginal gain (+0.32%) and negligible branch reduction. Lazy-init pattern already optimizes ENV gate overhead effectively.
|