Files
hakmem/docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

9.2 KiB
Raw Blame History

Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix Results

Executive Summary

Decision: STRONG GO (+7.05% cumulative gain, exceeds +3.0% threshold with super-additivity)

Key Finding: C4+C5+C6 inline slots deliver +7.05% throughput gain on Standard binary, completing the per-class optimization trilogy with synergistic interaction effects.

Critical Discovery: C4 shows negative performance in isolation (-0.08% without C5/C6) but synergistic gain with C5+C6 present (+1.27% marginal contribution in full stack).


4-Point Matrix Test Results

Test Configuration

  • Workload: Mixed SSOT (WS=400, ITERS=20000000)
  • Binary: ./bench_random_mixed_hakmem (Standard build)
  • Runs: 10 per configuration
  • Harness: scripts/run_mixed_10_cleanenv.sh

Raw Data (10 runs per point)

Point Config Average Throughput Delta vs A Status
A C4=0, C5=0, C6=0 49.48 M ops/s - Baseline
B C4=1, C5=0, C6=0 49.44 M ops/s -0.08% Regression
C C4=0, C5=1, C6=1 52.27 M ops/s +5.63% Strong gain
D C4=1, C5=1, C6=1 52.97 M ops/s +7.05% Excellent gain

Per-Point Details

Point A (All OFF): 48804232, 49822782, 50299414, 49431043, 48346953, 50594873, 49295433, 48956687, 49491449, 49803811

  • Mean: 49.48 M ops/s
  • σ: 0.63 M ops/s

Point B (C4 Only): 49246268, 49780577, 49618929, 48652983, 50000003, 48989740, 49973913, 49077610, 50144043, 48958613

  • Mean: 49.44 M ops/s
  • σ: 0.56 M ops/s
  • Δ vs A: -0.08%

Point C (C5+C6 Only): 52249144, 52038944, 52804475, 52441811, 52193156, 52561113, 51884004, 52336668, 52019796, 52196738

  • Mean: 52.27 M ops/s
  • σ: 0.38 M ops/s
  • Δ vs A: +5.63%

Point D (All ON): 52909030, 51748016, 53837633, 52436623, 53136539, 52671717, 54071840, 52759324, 52769820, 53374875

  • Mean: 52.97 M ops/s
  • σ: 0.92 M ops/s
  • Δ vs A: +7.05%

Sub-Additivity Analysis

Additivity Calculation

If C4 and C5+C6 gains were purely additive, we would expect:

Expected D = A + (B-A) + (C-A)
           = 49.48 + (-0.04) + (2.79)
           = 52.23 M ops/s

Actual D: 52.97 M ops/s

Sub-additivity loss: -1.42% (negative indicates SUPER-ADDITIVITY)

Interpretation

The combined C4+C5+C6 gain is 1.42% better than additive, indicating synergistic interaction:

  • C4 solo: -0.08% (detrimental when C5/C6 OFF)
  • C5+C6 solo: +5.63% (strong gain)
  • C4+C5+C6 combined: +7.05% (super-additive!)
  • Marginal contribution of C4 in full stack: +1.27% (vs D vs C)

Key Insight: C4 optimization is context-dependent. It provides minimal or negative benefit when the hot allocation path still goes through the full unified_cache. But when C5+C6 are already on the fast path (reducing unified_cache traffic for 85.7% of operations), C4 becomes synergistic on the remaining 14.3% of operations.


Decision Matrix

Success Criteria

Criterion Threshold Actual Pass
GO Threshold ≥ +1.0% +7.05%
Ideal Threshold ≥ +3.0% +7.05%
Sub-additivity < 20% loss -1.42% (super-additive)
Pattern consistency D > C > A

Decision: STRONG GO

Rationale:

  1. Cumulative gain of +7.05% exceeds ideal threshold (+3.0%) by +4.05pp
  2. Super-additive behavior (actual > expected) indicates positive interaction synergy
  3. All thresholds exceeded with robust measurement across 40 total runs
  4. Clear hierarchy: D > C > A (with B showing context-dependent behavior)

Quality Rating: Excellent GO (exceeds threshold by +4.05pp, demonstrates synergistic gains)


Comparison to Phase 75-3 (C5+C6 Matrix)

Phase 75-3 Results

Point Config Throughput Delta
A C5=0, C6=0 42.36 M ops/s -
B C5=1, C6=0 43.54 M ops/s +2.79%
C C5=0, C6=1 44.25 M ops/s +4.46%
D C5=1, C6=1 44.65 M ops/s +5.41%

Phase 76-2 Results (with C4)

Point Config Throughput Delta
A C4=0, C5=0, C6=0 49.48 M ops/s -
B C4=1, C5=0, C6=0 49.44 M ops/s -0.08%
C C4=0, C5=1, C6=1 52.27 M ops/s +5.63%
D C4=1, C5=1, C6=1 52.97 M ops/s +7.05%

Key Differences

  1. Baseline Difference: Phase 75-3 baseline (42.36M) vs Phase 76-2 baseline (49.48M)

    • Different warm-up/system conditions
    • Percentage gains are directly comparable
  2. C5+C6 Contribution:

    • Phase 75-3: +5.41% (isolated)
    • Phase 76-2 Point C: +5.63% (confirms reproducibility)
  3. C4 Contribution:

    • Phase 75-3: N/A (C4 not yet measured)
    • Phase 76-2 Point B: -0.08% (alone), +1.27% marginal (in full stack)
  4. Cumulative Effect:

    • Phase 75-3 (C5+C6): +5.41%
    • Phase 76-2 (C4+C5+C6): +7.05%
    • Additional contribution from C4: +1.64pp

Insights: Context-Dependent Optimization

C4 Behavior Analysis

Finding: C4 inline slots show paradoxical behavior:

  • Standalone (C4 only, C5/C6 OFF): -0.08% (regression)
  • In context (C4 with C5+C6 ON): +1.27% (gain)

Hypothesis: When C5+C6 are OFF, the allocation fast path still heavily uses unified_cache for all size classes (C0-C7). C4 inline slots add TLS overhead without significant branch elimination benefit.

When C5+C6 are ON, unified_cache traffic for C5-C6 is eliminated (85.7% of operations avoid unified_cache). The remaining C4 operations see more benefit from inline slots because:

  1. TLS overhead is amortized across fewer unified_cache operations
  2. Branch prediction state improves without C5/C6 hot traffic
  3. L1-dcache pressure from inline slots is offset by reduced unified_cache accesses

Implication: Per-class optimizations are not independently additive but context-dependent. This validates the importance of 4-point matrix testing before promoting optimizations.


Per-Class Coverage Summary (Final)

C4-C7 Optimization Complete

Class Size Range Coverage % Optimization Individual Gain Cumulative Status
C6 1025-2048B 57.17% Inline Slots +2.87%
C5 513-1024B 28.55% Inline Slots +1.10%
C4 257-512B 14.29% Inline Slots +1.27% (in context)
C7 2049-4096B 0.00% N/A N/A NO-GO
Combined C4-C6 256-2048B 100% Inline Slots +7.05% STRONG GO

Measurement Progression

  1. Phase 75-1 (C6 only): +2.87% (10-run A/B)
  2. Phase 75-2 (C5 only, isolated): +1.10% (10-run A/B)
  3. Phase 75-3 (C5+C6 interaction): +5.41% (4-point matrix)
  4. Phase 76-0 (C7 analysis): NO-GO (0% operations)
  5. Phase 76-1 (C4 in context): +1.73% (10-run A/B with C5+C6 ON)
  6. Phase 76-2 (C4+C5+C6 interaction): +7.05% (4-point matrix, super-additive)

Immediate (Completed)

  1. C4 Inline Slots Promoted to SSOT

    • core/bench_profile.h: C4 default ON
    • scripts/run_mixed_10_cleanenv.sh: C4 default ON
    • Combined C4+C5+C6 now preset default
  2. Phase 76-2 Results Documented

    • This file: PHASE76_2_C4C5C6_MATRIX_RESULTS.md
    • CURRENT_TASK.md updated with Phase 76-2

Optional (Future Phases)

  1. FAST PGO Rebase (Track B - periodic, not decision-point)

    • Monitor code bloat impact from C4 addition
    • Regenerate PGO profile with C4+C5+C6=ON if code bloat becomes concern
    • Track mimalloc ratio progress (secondary metric)
  2. Next Optimization Axis (Phase 77+)

    • C4+C5+C6 optimizations complete and locked to SSOT
    • Explore new optimization strategies:
      • Allocation fast-path further optimization
      • Metadata/page lookup optimization
      • Alternative size-class strategies (C3/C2)

Artifacts

Test Logs

  • /tmp/phase76_2_point_A.log (C4=0, C5=0, C6=0)
  • /tmp/phase76_2_point_B.log (C4=1, C5=0, C6=0)
  • /tmp/phase76_2_point_C.log (C4=0, C5=1, C6=1)
  • /tmp/phase76_2_point_D.log (C4=1, C5=1, C6=1)

Analysis Script

  • /tmp/phase76_2_analysis.sh (matrix calculation)
  • /tmp/phase76_2_matrix_test.sh (test harness)

Binary Information

  • Binary: ./bench_random_mixed_hakmem
  • Build time: 2025-12-18 (Phase 76-1)
  • Size: 674K
  • Compiler: gcc -O3 -march=native -flto

Conclusion

Phase 76-2 validates that C4+C5+C6 inline slots deliver +7.05% cumulative throughput gain on Standard binary, completing comprehensive optimization of C4-C7 size class allocations.

Critical Discovery: Per-class optimizations are context-dependent rather than independently additive. C4 shows negative performance in isolation but strong synergistic gains when C5+C6 are already optimized. This finding emphasizes the importance of 4-point matrix testing before promoting multi-stage optimizations.

Recommendation: Lock C4+C5+C6 configuration as SSOT baseline ( completed). Proceed to next optimization axis (Phase 77+) with confidence that per-class inline slots optimization is exhausted.


Phase 76-2 Status: ✓ COMPLETE (STRONG GO, +7.05% super-additive gain validated)

Next Phase: Phase 77 (Alternative optimization axes) or FAST PGO periodic tracking (Track B)