Files
hakmem/docs/analysis/PHASE77_1_C3_INLINE_SLOTS_RESULTS.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

6.8 KiB

Phase 77-1: C3 Inline Slots A/B Test Results

Executive Summary

Decision: NO-GO (+0.40% gain, below +1.0% GO threshold)

Key Finding: C3 inline slots provide minimal performance improvement (+0.40%) despite architectural alignment with successful C4-C6 optimizations. This suggests C3 traffic is not a bottleneck in the mixed workload (WS=400, 16-1040B allocations).


Test Configuration

Workload

  • Binary: ./bench_random_mixed_hakmem (with C3 inline slots compiled)
  • Iterations: 20,000,000 ops per run
  • Working Set: 400 slots
  • Size Range: 16-1040B (mixed allocations)
  • Runs: 10 per configuration

Configurations

  • Baseline: C3 OFF (HAKMEM_TINY_C3_INLINE_SLOTS=0), C4/C5/C6 ON
  • Treatment: C3 ON (HAKMEM_TINY_C3_INLINE_SLOTS=1), C4/C5/C6 ON
  • Measurement: Throughput (ops/s)

Raw Results (10 runs each)

Baseline (C3 OFF)

40435972, 41430741, 41023773, 39807320, 40474129,
40436476, 40643305, 40116079, 40295157, 40622709
  • Mean: 40.52 M ops/s
  • Min: 39.80 M ops/s
  • Max: 41.43 M ops/s
  • Std Dev: ~0.57 M ops/s

Treatment (C3 ON)

40836958, 40492669, 40726473, 41205860, 40609735,
40943945, 40612661, 41083970, 40370334, 40040018
  • Mean: 40.69 M ops/s
  • Min: 40.04 M ops/s
  • Max: 41.20 M ops/s
  • Std Dev: ~0.43 M ops/s

Delta Analysis

Metric Value
Baseline Mean 40.52 M ops/s
Treatment Mean 40.69 M ops/s
Absolute Gain 0.17 M ops/s
Relative Gain +0.40%
GO Threshold +1.0%
Status NO-GO

Confidence Analysis

  • Sample size: 10 per group
  • Overlap: Baseline and Treatment ranges have significant overlap
  • Signal-to-noise: Gain (0.17M) is comparable to baseline std dev (0.57M)
  • Conclusion: Gain is within noise, not statistically significant

Root Cause Analysis: Why No Gain?

1. Phase 77-0 Observation Confirmed

  • Unified_cache statistics showed C3 had only 1 miss out of 20M operations (0.00005% miss rate)
  • This ultra-low miss rate indicates C3 is already well-serviced by existing mechanisms

2. Warm Pool Effectiveness

  • Warm pool + first-page-cache are likely intercepting C3 traffic
  • C3 is below the "hot class" threshold where inline slots provide ROI

3. TLS Overhead vs. Benefit

  • C3 adds 2KB/thread TLS overhead
  • No corresponding reduction in unified_cache misses → overhead not justified
  • Unlike C4-C6 where inline slots eliminated significant unified_cache traffic

4. Workload Characteristics

  • WS=400 mixed workload is dominated by C5-C6 (57.17% + 28.55% = 85.7% of operations)
  • C3 only ~15.6% of workload (64-128B size range)
  • Even if C3 were optimized, it can only affect 15.6% of operations
  • Only 4-5% of that traffic is currently hitting unified_cache (based on Phase 77-0 data)

Comparison to C4-C6 Success

Why C4-C6 Succeeded (+7.05% cumulative)

Factor C4-C6 C3
Hot traffic % 14.29% + 28.55% + 57.17% = 100% of Tiny ~15.6% of total
Unified_cache hits Low but visible Almost none
Context dependency Super-additive synergy No interaction
Size class range 128-2048B (large objects) 64-128B (small)

Key Insight: C4-C6 optimization succeeded because it addressed active contention in the unified_cache. C3 optimization addresses non-existent contention.


Per-Class Coverage Summary (Final)

C0-C7 Optimization Status

Class Size Range Coverage % Optimization Result Status
C6 1025-2048B 57.17% Inline Slots +2.87% GO (Phase 75-1)
C5 513-1024B 28.55% Inline Slots +1.10% GO (Phase 75-2)
C4 257-512B 14.29% Inline Slots +1.27% (in context) GO (Phase 76-1, +7.05% cumulative)
C3 65-256B ~15.6% Inline Slots +0.40% NO-GO (Phase 77-1)
C2 33-64B ~15.6% Not tested N/A ⏸️ CONDITIONAL (blocked by C3 NO-GO)
C7 2049-4096B 0.00% N/A N/A NO-GO (Phase 76-0)
C0-C1 <32B Minimal N/A N/A ⏸️ Future (blocked by C2)

Decision Logic

Success Criteria

Criterion Threshold Actual Pass
GO Threshold ≥ +1.0% +0.40%
Noise floor < 50% of baseline std dev 30% of std dev ⚠️
Statistical significance p < 0.05 (10 samples) High overlap

Decision: NO-GO

Rationale:

  1. Below GO threshold: +0.40% is significantly below +1.0% GO floor
  2. Statistical insignificance: Gain is within measurement noise
  3. Root cause confirmed: Phase 77-0 data shows C3 has minimal unified_cache contention
  4. No follow-on to C2: Phase 77-2 (C2) conditional on C3 success → BLOCKED

Impact: C3-C2 optimization axis exhausted. Per-class inline slots optimization complete at C4-C6.


Phase 77-2 Status: SKIPPED (Conditional NO-GO)

Phase 77-2 (C2 inline slots) was conditional on Phase 77-1 (C3) success. Since Phase 77-1 is NO-GO:

  • Phase 77-2 is SKIPPED (not implemented)
  • C2 remains unoptimized (consistent with Phase 77-0 observation: negligible unified_cache traffic)

1. Lock C4-C6 as Permanent SSOT (Already done Phase 76-2)

  • C4+C5+C6 inline slots = +7.05% cumulative gain, super-additive
  • Promoted to defaults in core/bench_profile.h and test scripts

2. Explore Alternative Optimization Axes (Phase 78+)

Given C3 NO-GO, consider:

  • Option A: Allocation fast-path further optimization (instruction/branch reduction)
  • Option B: Metadata/page lookup optimization (avoid pointer chasing)
  • Option C: Warm pool tuning beyond Phase 69's WarmPool=16
  • Option D: Alternative size-class strategies (C1/C2 with different thresholds)

3. Track mimalloc Ratio (Secondary Metric, ongoing)

  • Current: 89.2% (Phase 76-2 baseline)
  • Monitor code bloat from C4-C6 additions
  • Rebbase FAST PGO profile if bloat becomes concern

Conclusion

Phase 77-1 validates that per-class inline slots optimization has a natural stopping point at C3. Unlike C4-C6 which addressed hot unified_cache traffic, C3 (and by extension C2) appear to be well-serviced by existing warm pool and caching mechanisms.

Key Learning: Not all size classes benefit equally from the same optimization pattern. C3's low traffic and non-existent unified_cache contention make inline slots wasteful in this workload.

Status: DECISION MADE (C3 NO-GO, C4-C6 locked to SSOT, Phase 77 complete)


Phase 77 Status: ✓ COMPLETE (Phase 77-0 GO, Phase 77-1 NO-GO, Phase 77-2 SKIPPED)

Next Phase: Phase 78 (Alternative optimization axis TBD)