Files
hakmem/docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

5.6 KiB

Phase 83-1: Switch Dispatch Fixed Mode - A/B Test Results

Objective

Remove per-operation ENV gate overhead from tiny_inline_slots_switch_dispatch_enabled() by pre-computing the decision at bench_profile boundary.

Pattern: Phase 78-1 replication (inline slots fixed mode) Expected Gain: +0.3-1.0% (branch reduction)

Implementation Summary

Box Theory Design

  • Boundary: bench_profile calls tiny_inline_slots_switch_dispatch_fixed_refresh_from_env() after putenv defaults
  • Hot path: tiny_inline_slots_switch_dispatch_enabled_fast() reads cached global when FIXED=1
  • Reversible: toggle HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0/1

Files Created

  1. core/box/tiny_inline_slots_switch_dispatch_fixed_box.h - Fast-path API + global cache
  2. core/box/tiny_inline_slots_switch_dispatch_fixed_box.c - Refresh implementation

Files Modified

  1. core/box/tiny_front_hot_box.h - Alloc path: _enabled()_enabled_fast()
  2. core/box/tiny_legacy_fallback_box.h - Free path: _enabled()_enabled_fast()
  3. Makefile - Added tiny_inline_slots_switch_dispatch_fixed_box.o

A/B Test Results

Quick Check (3-run)

Baseline (FIXED=0, SWITCH=1):

  • Run 1: 54.12 M ops/s
  • Run 2: 55.01 M ops/s
  • Run 3: 52.95 M ops/s
  • Mean: 54.02 M ops/s

Treatment (FIXED=1, SWITCH=1):

  • Run 1: 54.57 M ops/s
  • Run 2: 54.17 M ops/s
  • Run 3: 53.94 M ops/s
  • Mean: 54.23 M ops/s

Quick Check Gain: +0.39% (+0.21 M ops/s)

Full Test (10-run)

Baseline (FIXED=0, SWITCH=1):

Run 1:  54.13 M ops/s
Run 2:  54.14 M ops/s
Run 3:  51.30 M ops/s
Run 4:  52.75 M ops/s
Run 5:  52.68 M ops/s
Run 6:  53.75 M ops/s
Run 7:  53.44 M ops/s
Run 8:  53.33 M ops/s
Run 9:  53.43 M ops/s
Run 10: 52.73 M ops/s
Mean: 53.17 M ops/s

Treatment (FIXED=1, SWITCH=1):

Run 1:  52.35 M ops/s
Run 2:  52.87 M ops/s
Run 3:  54.36 M ops/s
Run 4:  53.13 M ops/s
Run 5:  52.36 M ops/s
Run 6:  54.12 M ops/s
Run 7:  53.55 M ops/s
Run 8:  53.76 M ops/s
Run 9:  53.81 M ops/s
Run 10: 53.12 M ops/s
Mean: 53.34 M ops/s

Full Test Gain: +0.32% (+0.17 M ops/s)

perf stat Analysis

Baseline (FIXED=0, SWITCH=1)

Throughput:        54.07 M ops/s
Cycles:            1,697,024,527
Instructions:      3,515,034,248 (2.07 IPC)
Branches:          893,509,797
Branch-misses:     28,621,855 (3.20%)

Treatment (FIXED=1, SWITCH=1)

Throughput:        53.98 M ops/s
Cycles:            1,706,618,243
Instructions:      3,513,893,603 (2.06 IPC)
Branches:          893,343,014
Branch-misses:     28,582,157 (3.20%)

perf stat Delta

Metric Baseline Treatment Delta % Change
Throughput 54.07 M 53.98 M -0.09 M -0.17%
Cycles 1,697M 1,707M +10M +0.56%
Instructions 3,515M 3,514M -1M -0.03%
Branches 893.5M 893.3M -0.2M -0.02%
Branch-misses 28.6M 28.6M -0.04M -0.14%

Key Finding: Branch reduction is negligible (-0.02%). Single perf run shows noise.

Analysis

Expected vs Actual

  • Expected: +0.3-1.0% gain via branch reduction (Phase 78-1 pattern)
  • Actual: +0.32% gain (10-run average)
  • Branch reduction: -0.02% (essentially zero)

Interpretation

  1. Marginal Gain: +0.32% is at the very bottom of the expected range
  2. No Branch Reduction: -0.02% branch count change is within noise
  3. High Variance: perf stat single run shows -0.17%, contradicting 10-run +0.32%
  4. Pattern Mismatch: Phase 78-1 achieved +2.31% with clear branch reduction

Root Cause Hypothesis

The optimization targets tiny_inline_slots_switch_dispatch_enabled() which uses a static lazy-init cache:

static inline int tiny_inline_slots_switch_dispatch_enabled(void) {
    static int g_switch_dispatch_enabled = -1;  // -1 = uncached
    if (__builtin_expect(g_switch_dispatch_enabled == -1, 0)) {
        // First call only
        const char* e = getenv("HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH");
        g_switch_dispatch_enabled = (e && *e && *e != '0') ? 1 : 0;
    }
    return g_switch_dispatch_enabled;
}

Issue: After the first call, g_switch_dispatch_enabled != -1 is always predicted correctly. The compiler/CPU already optimizes this check to near-zero cost.

Contrast with Phase 78-1: That phase optimized per-class ENV gates (tiny_c4_inline_slots_enabled() etc.) which are called thousands of times per benchmark run. Switch dispatch check is called once per alloc/free operation, but the lazy-init pattern already eliminates most overhead.

Decision Gate

GO Threshold: +1.0% Actual Result: +0.32%

Status: NO-GO (below threshold, negligible branch reduction)

Recommendations

  1. Do not promote SWITCHDISPATCH_FIXED=1 to SSOT
  2. Keep code as research box (reversible design preserved)
  3. Phase 78-1 pattern not applicable to lazy-init ENV gates (diminishing returns)

ENV Variables

Baseline (Phase 80-1 mode)

HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0  # Disabled (lazy-init)
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1        # Switch dispatch ON

Treatment (Phase 83-1 mode)

HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=1  # Enabled (startup cache)
HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1        # Switch dispatch ON

Next Steps

  1. Phase 80-1: Switch dispatch remains in SSOT (+1.65% STRONG GO)
  2. Phase 83-1: Fixed mode NOT promoted (marginal gain)
  3. 🔬 Research: Investigate other optimization opportunities beyond ENV gate overhead

Phase 83-1 Conclusion: NO-GO due to marginal gain (+0.32%) and negligible branch reduction. Lazy-init pattern already optimizes ENV gate overhead effectively.