Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.9 KiB
Phase 76-0: C7 Per-Class Statistics Analysis (SSOT化)
Executive Summary
Definitive C7 Statistics from Mixed SSOT Workload:
- C7 Hit Count: 0 (ZERO allocations)
- C7 Percentage: 0.00% of C4-C7 operations
- Verdict: NO-GO for C7 P2 (inline slots optimization)
Test Configuration
Binary: bench_random_mixed_hakmem_observe (with HAKMEM_MEASURE_UNIFIED_CACHE=1)
Environment Variables:
HAKMEM_WARM_POOL_SIZE=16
HAKMEM_TINY_C5_INLINE_SLOTS=1
HAKMEM_TINY_C6_INLINE_SLOTS=1
Benchmark Parameters:
- Iterations: 20,000,000
- Working Set Size: 400
- Runs: 1 (per-class stats are cumulative)
Unified Cache Initialization:
C4 capacity = 64 (power of 2)
C5 capacity = 128 (power of 2)
C6 capacity = 128 (power of 2)
C7 capacity = 128 (power of 2)
Results: Per-Class Statistics
C7 Statistics (CRITICAL FINDING)
| Metric | Value |
|---|---|
| Hit Count | 0 |
| Miss Count | 0 |
| Push Count | 0 |
| Full Count | 0 |
| Total Allocations | 0 |
| Occupied Slots | 0/128 |
| Hit Rate | N/A |
| Full Rate | N/A |
Status: C7 received ZERO allocations in the Mixed SSOT workload.
C4-C7 Ranking (Cumulative)
| Class | Hit Count | Miss Count | Capacity | Hit % | Percentage of Total |
|---|---|---|---|---|---|
| C6 | 2,750,854 | 1 | 128 | 100.0% | 57.17% |
| C5 | 1,373,604 | 1 | 128 | 100.0% | 28.55% |
| C4 | 687,563 | 1 | 64 | 100.0% | 14.29% |
| C7 | 0 | 0 | 128 | N/A | 0.00% |
| TOTAL | 4,812,021 | 3 | — | — | 100.00% |
Coverage Analysis
| Cumulative Classes | Operations | Percentage |
|---|---|---|
| C6 alone | 2,750,854 | 57.17% |
| C5+C6 | 4,124,458 | 85.72% |
| C4+C5+C6 | 4,812,021 | 100.00% |
| C4+C5+C6+C7 | 4,812,021 | 100.00% (no change) |
Decision Analysis
Threshold Criteria
- GO for C7 P2: C7 > 20% of C4-C7 operations
- NEUTRAL: 15% < C7 ≤ 20% of C4-C7 operations
- CONSIDER C4 redesign: C7 ≤ 15% of C4-C7 operations
Verdict: NO-GO for C7 P2
C7: 0.00% - Falls far below any viable threshold
Explanation:
- Zero Volume: The Mixed SSOT workload (128-1024B allocations) does NOT generate any C7 (1024-2048B) allocations.
- Workload Mismatch: The benchmark parameters (400 working set size, 20M iterations) are tuned to exercise C4-C6 intensively but avoid C7 entirely.
- No Optimization Benefit: Any C7 P2 (inline slots) optimization would provide 0% improvement for this specific workload.
- Resource Opportunity Cost: Engineering effort for C7 P2 would be better spent on C4 (14.29%) or investigating alternative workloads.
Recommended Next Phase
Phase 76-1: C4 Per-Class Deep Dive
Objective: Analyze C4 (14.3% of total operations) as the next optimization target
Rationale:
- C4 is the largest remaining bottleneck after C5+C6 inline slots
- C4 (256-512B) represents a significant portion of tiny allocations
- After C5/C6 optimizations (85.7%), C4 becomes critical for overall performance
Investigation Areas:
- C4 Hit Rate: Currently 100.0% (full cache hits) - room for miss reduction?
- C4 Cache Occupancy: 63/64 slots occupied (near full)
- C4 Allocation Pattern: Is there temporal locality opportunity?
- Alternative: Investigate workloads that DO use C7 (system-level, long-lived objects)
Suggested Implementation Options:
- C4 LIFO optimization (vs current FIFO-like behavior)
- C4 spatial locality improvements
- C4 refill batching (similar to C5/C6)
- Hybrid C4-C5 inline slots strategy
Artifacts
Raw Log
Location: /tmp/phase76_0_c7_stats.log
Key excerpts:
[Unified-STATS] Unified Cache Metrics:
[Unified-STATS] Consistency Check:
[Unified-STATS] total_allocs (hit+miss) = 5327287
[Unified-STATS] total_frees (push+full) = 1202827
C2: 128/2048 slots occupied, hit=172530 miss=1 (100.0% hit), push=172531 full=0 (0.0% full)
C3: 128/2048 slots occupied, hit=342731 miss=1 (100.0% hit), push=342732 full=0 (0.0% full)
C4: 63/64 slots occupied, hit=687563 miss=1 (100.0% hit), push=687564 full=0 (0.0% full)
C5: 75/128 slots occupied, hit=1373604 miss=1 (100.0% hit), push=0 full=0 (0.0% full)
C6: 42/128 slots occupied, hit=2750854 miss=1 (100.0% hit), push=0 full=0 (0.0% full)
[C7 MISSING - 0 operations]
Throughput = 46152700 ops/s [iter=20000000 ws=400] time=0.433s
Verification Output
C7 Initialization: ✓ Capacity=128 allocated
C7 Route Assignment: ✓ LEGACY route configured
C7 Operations: ✗ ZERO allocations
C7 Carve Attempts: 0 (no operations triggered)
C7 Warm Pool: 0 pops, 0 pushes
C7 Meta Used Counter: 0 total operations
Key Insights
-
Workload Characterization: The Mixed SSOT benchmark is optimized for C4-C6 (128-1024B). This is intentional and appropriate for most mixed workloads.
-
C7 Market Opportunity: C7 (1024-2048B) allocations appear in:
- Long-lived data structures (hash tables, trees)
- System-level workloads (networking buffers)
- Specialized benchmarks (not representative of general use)
-
Optimization Priority:
- C6 (57.2%): ✓ Already optimized with inline slots
- C5 (28.5%): ✓ Already optimized with inline slots
- C4 (14.3%): ← Next optimization target
- C7 (0.0%): ✗ No presence in mixed workload
-
Engineering Trade-offs:
- C7 P2 would add complexity for 0% mixed-workload benefit
- C4 redesign could improve 14.3% of operations
- Consider phase-out of C7 optimization if isolated workloads don't justify it
Conclusion
Phase 76-0 Complete: C7 is definitively measured at 0.00% of Mixed SSOT operations.
Next Action: Proceed to Phase 76-1: C4 Analysis to evaluate the largest remaining optimization opportunity (14.29% of total operations).
File: /tmp/phase76_0_c7_stats.log
Date: 2025-12-18
Status: ✓ Decision gate established