Files
hakmem/docs/analysis/PHASE79_1_C2_LOCAL_CACHE_RESULTS.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

11 KiB
Raw Blame History

Phase 79-1: C2 Local Cache Optimization Results

Executive Summary

Decision: NO-GO (+0.57% gain, below +1.0% GO threshold)

Key Finding: Despite Phase 79-0 identifying C2 Stage3 lock contention, implementing a TLS-local cache for C2 allocations did NOT deliver the predicted performance gain (+0.5% to +1.5%). Actual result: +0.57% ≈ at lower bound of prediction but insufficient to exceed threshold.


Test Configuration

Implementation

  • New Files: 4 box files (env, tls, api, c variable)
  • Integration: Allocation/deallocation hot paths (tiny_front_hot_box.h, tiny_legacy_fallback_box.h)
  • ENV Variable: HAKMEM_TINY_C2_LOCAL_CACHE=0/1 (default OFF)
  • TLS Capacity: 64 slots (512B per thread, per Phase 79-0 spec)
  • Pattern: Same ring buffer + fail-fast approach as C3/C4/C5/C6

Test Setup

  • Binary: ./bench_random_mixed_hakmem (same binary, ENV-gated)
  • Baseline: HAKMEM_TINY_C2_LOCAL_CACHE=0 (no C2 cache, Phase 78-1 baseline)
  • Treatment: HAKMEM_TINY_C2_LOCAL_CACHE=1 (C2 local cache enabled)
  • Workload: 20M iterations, WS=400, 16-1040B mixed allocations
  • Runs: 10 per configuration

Raw Results

Baseline (HAKMEM_TINY_C2_LOCAL_CACHE=0)

Run 1: 42.93 M ops/s
Run 2: 42.30 M ops/s
Run 3: 41.84 M ops/s
Run 4: 41.36 M ops/s
Run 5: 41.79 M ops/s
Run 6: 39.51 M ops/s
Run 7: 42.35 M ops/s
Run 8: 42.41 M ops/s
Run 9: 42.53 M ops/s
Run 10: 41.66 M ops/s

Mean: 41.86 M ops/s
Range: 39.51 - 42.93 M ops/s (3.42 M ops/s variance)

Treatment (HAKMEM_TINY_C2_LOCAL_CACHE=1)

Run 1: 42.51 M ops/s
Run 2: 42.22 M ops/s
Run 3: 42.37 M ops/s
Run 4: 42.66 M ops/s
Run 5: 41.89 M ops/s
Run 6: 41.94 M ops/s
Run 7: 42.19 M ops/s
Run 8: 40.75 M ops/s
Run 9: 41.97 M ops/s
Run 10: 42.53 M ops/s

Mean: 42.10 M ops/s
Range: 40.75 - 42.66 M ops/s (1.91 M ops/s variance)

Delta Analysis

Metric Value
Baseline Mean 41.86 M ops/s
Treatment Mean 42.10 M ops/s
Absolute Gain +0.24 M ops/s
Relative Gain +0.57%
GO Threshold +1.0%
Status NO-GO

Root Cause Analysis

Why C2 Local Cache Underperformed

  1. Phase 79-0 Contention Signal Misleading

    • Observation: 2 Stage3 (backend lock) hits for C2 in single 20M iteration run
    • Lock rate: 0.08% (1 lock per 1.25M operations)
    • Problem: This extremely low contention rate suggests:
      • Even with local cache, reduction in absolute lock count is minimal
      • 1-2 backend locks per 20M ops = negligible CPU impact
      • Not a "hot contention" pattern like unified_cache misses or magazine thrashing
  2. TLS Cache Hit Rates Likely Low

    • C2 allocation/free pattern may not favor TLS retention
    • Phase 77-0 showed C3 unified_cache traffic minimal (already warm-pool served)
    • C2 might have similar characteristic: already well-served by existing mechanisms
    • Local cache helps ONLY if frees cluster within same thread (locality)
  3. Cache Capacity Constraints

    • 64 slots = relatively small ring buffer
    • May hit full condition frequently, forcing fallback to unified_cache anyway
    • Reduced effective cache hit rate vs. larger capacities
  4. Workload Characteristics (WS=400)

    • Small working set (400 unique allocations)
    • Warm pool already preloads allocations efficiently
    • Magazine caching might already be serving C2 well
    • Less free-clustering per thread = lower C2 local cache efficiency

Comparison to Other Phases

Phase Optimization Predicted Actual Result
75-1 C6 Inline Slots +2-3% +2.87% GO
76-1 C4 Inline Slots +1-2% +1.73% GO
77-1 C3 Inline Slots +0.5-1% +0.40% NO-GO
78-1 Fixed Mode +1-2% +2.31% GO
79-1 C2 Local Cache +0.5-1.5% +0.57% NO-GO

Key Pattern:

  • Larger classes (C6=512B, C4=128B) benefit significantly from inline slots
  • Smaller classes (C3=64B, C2=32B) show diminishing returns or hit warm-pool saturation
  • C2 appears to be in warm-pool-dominated regime (like C3)

Why C2 is Different from C4-C6

C4-C6 Success Pattern

  • Classes handled 2.5M-5.0M operations in workload
  • Lock contention: Measured Stage3 hits = 0-2 (Stage2 dominated)
  • Root cause: Unified_cache misses forcing backend pool access
  • Solution: Inline slots reduce unified_cache pressure
  • Result: Intercepting traffic before unified_cache was effective

C2 Failure Pattern

  • Class handles 2.5M operations (same as C3)
  • Lock contention: ALL 2 C2 locks = Stage3 (backend-only)
  • Root cause hypothesis: C2 frees not being cached/retained
  • Solution attempted: TLS cache to locally retain frees
  • Problem: Even with local cache, no measurable improvement
  • Conclusion: Lock contention wasn't actually the bottleneck, or solution doesn't address it

Technical Observations

  1. Variability Analysis

    • Baseline variance: 3.42 M ops/s (8.2% coefficient of variation)
    • Treatment variance: 1.91 M ops/s (4.5% coefficient of variation)
    • Treatment shows lower variance (more stable) but not higher throughput
    • Suggests: C2 cache reduces noise but doesn't accelerate hot path
  2. Lock Statistics Interpretation

    • Phase 78-0 showed 2 Stage3 locks per 2.5M C2 ops
    • If local cache eliminated both locks: ~50-100 cycles saved per 20M ops
    • Expected gain: 50-100 cycles / (40.52M ops × 2-3 cycles/op) ≈ +0.2-0.4% (matches observation!)
    • Insight: Lock contention existed but was NOT the primary throughput bottleneck
  3. Why Lock Stats Misled

    • Lock acquisition is expensive (~50-100 cycles) but rare (0.08%)
    • The cost is paid only twice per 20M operations
    • Per-operation baseline cost > occasional lock cost
    • Lesson: Lock statistics ≠ throughput impact. Frequency matters more than per-event cost.

Alternative Hypotheses (Not Tested)

If C2 cache had worked, we would expect:

  • ~50% of C2 frees captured by local cache
  • Each cache hit saves ~10-20 cycles vs. unified_cache path
  • Net: +0.5-1.0% throughput
  • Actual observation: No measurable savings

Why it didn't work:

  1. C2 local cache capacity (64) too small or too large (untested)
  2. C2 frees don't cluster per-thread (random distribution)
  3. Warm pool already intercepting C2 allocations before local cache hits
  4. Magazine caching already effective for C2
  5. Contention analysis (Phase 79-0) misidentified true bottleneck

Decision Logic

Success Criteria NOT Met

Criterion Threshold Actual Pass
GO Threshold ≥ +1.0% +0.57%
Prediction accuracy Within 50% +113% error
Pattern consistency Aligns with prior Counter to C3 (similar) ⚠️

Decision: NO-GO

Rationale:

  1. Gain (+0.57%) significantly below GO threshold (+1.0%)
  2. Prediction error large (+0.93% expected at median, actual +0.57%)
  3. ⚠️ Result contradicts Phase 77-1 C3 pattern (both NO-GO for similar reasons)
  4. Code quality: Implementation correct (no behavioral issues)
  5. Safety: Safe to discard (ENV-gated, easily disabled)

Implications

Phase 79 Strategy Revision

Original Plan:

  • Phase 79-0: Identify C0-C3 bottleneck (C2 Stage3 lock contention identified)
  • Phase 79-1: Implement 1-box C2 local cache (implemented)
  • Phase 79-1 A/B test: +1.0% GO (only +0.57%)

Learning:

  • Lock statistics are misleading for throughput optimization
  • Frequency of operation matters more than per-event cost
  • C0-C3 classes may already be well-served by warm pool + magazine caching
  • Further gains require targeting different bottleneck or different mechanism

Recommendations

  1. Option A: Accept Phase 79-1 NO-GO

    • Revert C2 local cache (remove from codebase)
    • Archive findings (lock contention identified but not throughput-limiting)
    • Focus on other optimization axes (Phase 80+)
  2. Option B: Investigate Alternative C2 Mechanism (Phase 79-2)

    • Magazine local hold buffer optimization (if available)
    • Warm pool size tuning for C2
    • SizeClass lookup caching for C2
    • Expected gain: +0.3-0.8% (speculative)
  3. Option C: Larger C2 Cache Experiment (Phase 79-1b)

    • Test 128 or 256-slot C2 cache (1KB or 2KB per thread)
    • Hypothesis: Larger capacity = higher hit rate
    • Risk: TLS bloat, diminishing returns
    • Expected effort: 1 hour (Makefile + env config change only)
  4. Option D: Abandon C0-C3 Axis

    • Observation: C3 (+0.40%), C2 (+0.57%) both fall below threshold
    • C0-C1 likely even smaller gains
    • Warm pool + magazine caching already dominates C0-C3
    • Recommend shifting focus to other allocator subsystems

Code Status

Files Created (Phase 79-1a):

  • core/box/tiny_c2_local_cache_env_box.h
  • core/box/tiny_c2_local_cache_tls_box.h
  • core/front/tiny_c2_local_cache.h
  • core/tiny_c2_local_cache.c

Files Modified (Phase 79-1b):

  • Makefile (added tiny_c2_local_cache.o)
  • core/box/tiny_front_hot_box.h (added C2 cache pop)
  • core/box/tiny_legacy_fallback_box.h (added C2 cache push)

Status: Implementation complete, A/B test complete, decision: NO-GO


Cumulative Performance Track

Phase Optimization Result Cumulative
75-1 C6 Inline Slots +2.87% +2.87%
75-3 C5+C6 interaction +5.41% (baseline dependent)
76-2 C4+C5+C6 matrix +7.05% +7.05%
77-1 C3 Inline Slots +0.40% NO-GO
78-1 Fixed Mode +2.31% +9.36%
79-1 C2 Local Cache +0.57% NO-GO

Current Baseline: 41.86 M ops/s (from Phase 78-1: 40.52 → 41.46 M ops/s, but higher in Phase 79-1)


Conclusion

Phase 79-1 NO-GO validates the following insights:

  1. Lock statistics don't predict throughput: Phase 79-0's Stage3 lock analysis identified real contention but overestimated its performance impact (~0.2% vs. predicted 0.5-1.5%).

  2. Warm pool effectiveness: Classes C2-C3 appear to be in warm-pool-dominated regime already, similar to observation from Phase 77-1 (C3 warm pool serving allocations before inline slots could help).

  3. Diminishing returns in tiny classes: C0-C3 optimization ROI drops significantly compared to C4-C6, suggesting fundamental architecture already optimizes small classes well.

  4. Per-thread locality matters: Allocation patterns don't cluster per-thread for C2, reducing value of TLS-local caches.

Next Steps: Consider Phase 80 with different optimization axis (e.g., Magazine overflow handling, compile-time constant optimization, or focus on non-tiny allocation sizes).


Status: Phase 79-1 Complete (NO-GO)

Decision Point: Archive C2 local cache or experiment with alternative C2 mechanism (Phase 79-2)?