Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
184 lines
5.9 KiB
Markdown
184 lines
5.9 KiB
Markdown
# Phase 76-0: C7 Per-Class Statistics Analysis (SSOT化)
|
|
|
|
## Executive Summary
|
|
|
|
**Definitive C7 Statistics from Mixed SSOT Workload:**
|
|
- **C7 Hit Count: 0** (ZERO allocations)
|
|
- **C7 Percentage: 0.00%** of C4-C7 operations
|
|
- **Verdict: NO-GO for C7 P2 (inline slots optimization)**
|
|
|
|
---
|
|
|
|
## Test Configuration
|
|
|
|
**Binary**: `bench_random_mixed_hakmem_observe` (with HAKMEM_MEASURE_UNIFIED_CACHE=1)
|
|
|
|
**Environment Variables**:
|
|
```bash
|
|
HAKMEM_WARM_POOL_SIZE=16
|
|
HAKMEM_TINY_C5_INLINE_SLOTS=1
|
|
HAKMEM_TINY_C6_INLINE_SLOTS=1
|
|
```
|
|
|
|
**Benchmark Parameters**:
|
|
- Iterations: 20,000,000
|
|
- Working Set Size: 400
|
|
- Runs: 1 (per-class stats are cumulative)
|
|
|
|
**Unified Cache Initialization**:
|
|
```
|
|
C4 capacity = 64 (power of 2)
|
|
C5 capacity = 128 (power of 2)
|
|
C6 capacity = 128 (power of 2)
|
|
C7 capacity = 128 (power of 2)
|
|
```
|
|
|
|
---
|
|
|
|
## Results: Per-Class Statistics
|
|
|
|
### C7 Statistics (CRITICAL FINDING)
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Hit Count | 0 |
|
|
| Miss Count | 0 |
|
|
| Push Count | 0 |
|
|
| Full Count | 0 |
|
|
| **Total Allocations** | **0** |
|
|
| **Occupied Slots** | **0/128** |
|
|
| Hit Rate | N/A |
|
|
| Full Rate | N/A |
|
|
|
|
**Status**: C7 received **ZERO allocations** in the Mixed SSOT workload.
|
|
|
|
### C4-C7 Ranking (Cumulative)
|
|
|
|
| Class | Hit Count | Miss Count | Capacity | Hit % | Percentage of Total |
|
|
|-------|-----------|-----------|----------|-------|---------------------|
|
|
| C6 | 2,750,854 | 1 | 128 | 100.0% | **57.17%** |
|
|
| C5 | 1,373,604 | 1 | 128 | 100.0% | **28.55%** |
|
|
| C4 | 687,563 | 1 | 64 | 100.0% | **14.29%** |
|
|
| C7 | 0 | 0 | 128 | N/A | **0.00%** |
|
|
| **TOTAL** | **4,812,021** | **3** | — | — | **100.00%** |
|
|
|
|
### Coverage Analysis
|
|
|
|
| Cumulative Classes | Operations | Percentage |
|
|
|--------------------|------------|-----------|
|
|
| C6 alone | 2,750,854 | 57.17% |
|
|
| C5+C6 | 4,124,458 | 85.72% |
|
|
| **C4+C5+C6** | **4,812,021** | **100.00%** |
|
|
| C4+C5+C6+C7 | 4,812,021 | 100.00% (no change) |
|
|
|
|
---
|
|
|
|
## Decision Analysis
|
|
|
|
### Threshold Criteria
|
|
- **GO for C7 P2**: C7 > 20% of C4-C7 operations
|
|
- **NEUTRAL**: 15% < C7 ≤ 20% of C4-C7 operations
|
|
- **CONSIDER C4 redesign**: C7 ≤ 15% of C4-C7 operations
|
|
|
|
### Verdict: **NO-GO for C7 P2**
|
|
|
|
**C7: 0.00%** - Falls far below any viable threshold
|
|
|
|
**Explanation:**
|
|
1. **Zero Volume**: The Mixed SSOT workload (128-1024B allocations) does NOT generate any C7 (1024-2048B) allocations.
|
|
2. **Workload Mismatch**: The benchmark parameters (400 working set size, 20M iterations) are tuned to exercise C4-C6 intensively but avoid C7 entirely.
|
|
3. **No Optimization Benefit**: Any C7 P2 (inline slots) optimization would provide 0% improvement for this specific workload.
|
|
4. **Resource Opportunity Cost**: Engineering effort for C7 P2 would be better spent on C4 (14.29%) or investigating alternative workloads.
|
|
|
|
---
|
|
|
|
## Recommended Next Phase
|
|
|
|
### Phase 76-1: C4 Per-Class Deep Dive
|
|
|
|
**Objective**: Analyze C4 (14.3% of total operations) as the next optimization target
|
|
|
|
**Rationale**:
|
|
- C4 is the **largest remaining bottleneck** after C5+C6 inline slots
|
|
- C4 (256-512B) represents a significant portion of tiny allocations
|
|
- After C5/C6 optimizations (85.7%), C4 becomes critical for overall performance
|
|
|
|
**Investigation Areas**:
|
|
1. **C4 Hit Rate**: Currently 100.0% (full cache hits) - room for miss reduction?
|
|
2. **C4 Cache Occupancy**: 63/64 slots occupied (near full)
|
|
3. **C4 Allocation Pattern**: Is there temporal locality opportunity?
|
|
4. **Alternative**: Investigate workloads that DO use C7 (system-level, long-lived objects)
|
|
|
|
**Suggested Implementation Options**:
|
|
- C4 LIFO optimization (vs current FIFO-like behavior)
|
|
- C4 spatial locality improvements
|
|
- C4 refill batching (similar to C5/C6)
|
|
- Hybrid C4-C5 inline slots strategy
|
|
|
|
---
|
|
|
|
## Artifacts
|
|
|
|
### Raw Log
|
|
Location: `/tmp/phase76_0_c7_stats.log`
|
|
|
|
Key excerpts:
|
|
```
|
|
[Unified-STATS] Unified Cache Metrics:
|
|
[Unified-STATS] Consistency Check:
|
|
[Unified-STATS] total_allocs (hit+miss) = 5327287
|
|
[Unified-STATS] total_frees (push+full) = 1202827
|
|
|
|
C2: 128/2048 slots occupied, hit=172530 miss=1 (100.0% hit), push=172531 full=0 (0.0% full)
|
|
C3: 128/2048 slots occupied, hit=342731 miss=1 (100.0% hit), push=342732 full=0 (0.0% full)
|
|
C4: 63/64 slots occupied, hit=687563 miss=1 (100.0% hit), push=687564 full=0 (0.0% full)
|
|
C5: 75/128 slots occupied, hit=1373604 miss=1 (100.0% hit), push=0 full=0 (0.0% full)
|
|
C6: 42/128 slots occupied, hit=2750854 miss=1 (100.0% hit), push=0 full=0 (0.0% full)
|
|
[C7 MISSING - 0 operations]
|
|
|
|
Throughput = 46152700 ops/s [iter=20000000 ws=400] time=0.433s
|
|
```
|
|
|
|
### Verification Output
|
|
```
|
|
C7 Initialization: ✓ Capacity=128 allocated
|
|
C7 Route Assignment: ✓ LEGACY route configured
|
|
C7 Operations: ✗ ZERO allocations
|
|
C7 Carve Attempts: 0 (no operations triggered)
|
|
C7 Warm Pool: 0 pops, 0 pushes
|
|
C7 Meta Used Counter: 0 total operations
|
|
```
|
|
|
|
---
|
|
|
|
## Key Insights
|
|
|
|
1. **Workload Characterization**: The Mixed SSOT benchmark is optimized for C4-C6 (128-1024B). This is intentional and appropriate for most mixed workloads.
|
|
|
|
2. **C7 Market Opportunity**: C7 (1024-2048B) allocations appear in:
|
|
- Long-lived data structures (hash tables, trees)
|
|
- System-level workloads (networking buffers)
|
|
- Specialized benchmarks (not representative of general use)
|
|
|
|
3. **Optimization Priority**:
|
|
- C6 (57.2%): ✓ Already optimized with inline slots
|
|
- C5 (28.5%): ✓ Already optimized with inline slots
|
|
- C4 (14.3%): ← **Next optimization target**
|
|
- C7 (0.0%): ✗ No presence in mixed workload
|
|
|
|
4. **Engineering Trade-offs**:
|
|
- C7 P2 would add complexity for 0% mixed-workload benefit
|
|
- C4 redesign could improve 14.3% of operations
|
|
- Consider phase-out of C7 optimization if isolated workloads don't justify it
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Phase 76-0 Complete**: C7 is definitively measured at 0.00% of Mixed SSOT operations.
|
|
|
|
**Next Action**: Proceed to **Phase 76-1: C4 Analysis** to evaluate the largest remaining optimization opportunity (14.29% of total operations).
|
|
|
|
**File**: `/tmp/phase76_0_c7_stats.log`
|
|
**Date**: 2025-12-18
|
|
**Status**: ✓ Decision gate established
|