Files
hakmem/docs/analysis/PHASE78_0_SSOT_VERIFICATION.md
Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00

7.2 KiB

Phase 78-0: SSOT Verification & Phase 78-1 Plan

Phase 78-0 Complete: SSOT Verified

Verification Results (Single Run)

Binary: ./bench_random_mixed_hakmem (Standard, C4/C5/C6 ON, C3 OFF) Configuration: HAKMEM_ROUTE_BANNER=1, HAKMEM_MEASURE_UNIFIED_CACHE=1 Workload: 20M iterations, WS=400, 16-1040B mixed allocations

Route Configuration

  • unified_cache_enabled = 1 ✓
  • warm_pool_max_per_class = 12 ✓
  • All routes = LEGACY (correct for Phase 76-2 state) ✓

Unified Cache Statistics (Per-Class)

Class Hits Misses Interpretation
C4 0 1 Inline slots active (full interception) ✓
C5 0 1 Inline slots active (full interception) ✓
C6 0 1 Inline slots active (full interception) ✓

Critical Insight

Zero unified_cache hits for C4/C5/C6 = Expected and Correct

The inline slots ARE working perfectly:

  • During steady-state operations: 100% of C4/C5/C6 traffic intercepted by inline slots
  • Never reaches unified_cache during normal allocation path
  • 1 miss per class occurs only during initialization/drain (not steady-state)

Throughput Baseline

  • 40.50 M ops/s (confirms Phase 76-2 SSOT baseline intact)

GATE DECISION

GO TO PHASE 78-1

SSOT state verified:

  • C4/C5/C6 inline slots confirmed active
  • Traffic interception pattern correct
  • Ready for per-op overhead optimization

Phase 78-1: Per-Op Decision Overhead Removal

Problem Statement

Current inline slot enable checks (tiny_c4/c5/c6_inline_slots_enabled()) add per-operation overhead:

// Current (Phase 76-1): Called on EVERY alloc/free
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
    // tiny_c4_inline_slots_enabled() = function call + cached static check
}

Each operation has:

  1. Function call overhead
  2. Static variable load (g_c4_inline_slots_enabled)
  3. Comparison (== -1) - minimal but measurable

Solution: Fixed Mode Optimization

New ENV: HAKMEM_TINY_INLINE_SLOTS_FIXED=0/1 (default OFF for conservative testing)

When FIXED=1:

  1. At program startup (via bench_profile_apply), read all C4/C5/C6 ENVs once
  2. Cache decisions in static globals: g_c4_inline_slots_fixed_mode, etc.
  3. Hot path: Direct global read instead of function call (0 per-op overhead)

Expected Performance Impact

  • Optimistic: +1.5% to +3.0% (eliminate per-op decision overhead)
  • Realistic: +0.5% to +1.5% (modern CPUs speculate through branches well)
  • Conservative: +0.1% to +0.5% (if CPU already eliminated the cost via prediction)

Implementation Checklist

Phase 78-1a: Create Fixed Mode Box

  • ✓ Created: core/box/tiny_inline_slots_fixed_mode_box.h
    • Global caching variables: g_c4/c5/c6_inline_slots_fixed_mode
    • Initialization function: tiny_inline_slots_fixed_mode_init()
    • Fast path functions: tiny_c4_inline_slots_enabled_fast(), etc.

Phase 78-1b: Update Alloc Path (tiny_front_hot_box.h)

  • Replace tiny_c4/c5/c6_inline_slots_enabled() with fast versions
  • Add include: #include "tiny_inline_slots_fixed_mode_box.h"
  • Update enable checks to use _fast() suffix

Phase 78-1c: Update Free Path (tiny_legacy_fallback_box.h)

  • Replace tiny_c4/c5/c6_inline_slots_enabled() with fast versions
  • Add include: #include "tiny_inline_slots_fixed_mode_box.h"
  • Update enable checks to use _fast() suffix

Phase 78-1d: Initialize at Program Startup

  • Option 1: Call tiny_inline_slots_fixed_mode_init() from bench_profile_apply()
  • Option 2: Call from hakmem_tiny_init_thread() (TLS init time)
  • Recommended: Option 1 (once at program startup, not per-thread)

Phase 78-1e: A/B Test

  • Baseline: HAKMEM_TINY_INLINE_SLOTS_FIXED=0 (default, Phase 76-2 behavior)
  • Treatment: HAKMEM_TINY_INLINE_SLOTS_FIXED=1 (fixed mode optimization)
  • GO Threshold: +1.0% (same as Phase 77-1, same binary)
  • Runs: 10 per configuration (WS=400, 20M iterations)

Code Pattern

Alloc Path (tiny_front_hot_box.h)

#include "tiny_inline_slots_fixed_mode_box.h"  // NEW

// In tiny_hot_alloc_fast():
// Phase 78-1: C3 inline slots with fixed mode
if (class_idx == 3 && tiny_c3_inline_slots_enabled_fast()) {  // CHANGED: use _fast()
    // ...
}

// Phase 76-1: C4 Inline Slots with fixed mode
if (class_idx == 4 && tiny_c4_inline_slots_enabled_fast()) {  // CHANGED: use _fast()
    // ...
}

Initialization (bench_profile.h or hakmem_tiny.c)

extern void tiny_inline_slots_fixed_mode_init(void);

void bench_apply_profile(void) {
    // ... existing code ...

    // Phase 78-1: Initialize fixed mode if enabled
    if (tiny_inline_slots_fixed_enabled()) {
        tiny_inline_slots_fixed_mode_init();
    }
}

Rationale for This Optimization

  1. Proven Optimization: C4/C5/C6 are locked to SSOT (+7.05% cumulative)
  2. Per-Op Overhead Matters: Hot path executes 20M+ times per benchmark
  3. Low Risk: Backward compatible (FIXED=0 is default, restores Phase 76-1 behavior)
  4. Architectural Fit: Aligns with Box Pattern (single responsibility at initialization)
  5. Foundation for Future: Can apply same technique to other per-op decisions

Risk Assessment

Low Risk:

  • Backward compatible (FIXED=0 by default)
  • No change to inline slots logic, only to enable checks
  • Can quickly disable with ENV (FIXED=0)
  • A/B testing validates correctness

Potential Issues:

  • Compiler optimization might eliminate the overhead we're trying to remove (unlikely with aggressive optimization flags)
  • Cache coherency on multi-socket systems (unlikely to affect performance)

Success Criteria

PASS (+1.0% minimum):

  • Implementation complete
  • A/B test shows +1.0% or greater gain
  • Promote FIXED to default
  • Document in PHASE78_1 results

⚠️ MARGINAL (+0.3% to +0.9%):

  • Measurable gain but below threshold
  • Keep as optional optimization (FIXED=0 default)
  • Investigate CPU branch prediction effectiveness

FAIL (< +0.3%):

  • Compiler/CPU already eliminated the overhead
  • Revert to Phase 76-1 behavior (simpler code)
  • Explore alternative optimizations (Phase 79+)

Next Steps

  1. Implement Phase 78-1 (if approved):

    • Update tiny_c4/c5/c6_inline_slots_env_box.h to check fixed mode
    • Update tiny_front_hot_box.h and tiny_legacy_fallback_box.h
    • Add initialization call to bench_profile_apply()
    • Build and test
  2. Run Phase 78-1 A/B Test (10 runs each configuration)

  3. Decision Gate:

    • +1.0% → Promote to SSOT
    • ⚠️ +0.3% → Keep optional
    • <+0.3% → Revert (keep Phase 76-1 as is)
  4. Phase 79+: If Phase 78-1 ≥ +1.0%, continue with alternative optimization axes


Summary Table

Phase Focus Result Decision
77-0 C0-C3 Volume C3 traffic minimal Proceed to 77-1
77-1 C3 Inline Slots +0.40% (NO-GO) NO-GO, skip 77-2
78-0 SSOT Verification Verified Proceed to 78-1
78-1 Per-Op Overhead TBD In Progress

Status: Phase 78-0 Complete, Phase 78-1 Plan Finalized, Ready for Implementation

Binary Size: Phase 76-2 baseline + ~1.5KB (new box, static globals)

Code Quality: Low-risk optimization (backward compatible, architectural alignment)