Files
hakmem/docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md
2025-12-18 09:11:56 +09:00

247 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 75 Per-Class Analysis - Mixed SSOT Unified-STATS
**Status**: ANALYSIS COMPLETE, ready for Phase 75 (P2: Hot-class Inline Slots) targeting decision
**Workload**: Mixed SSOT (WS=400, ITERS=20000000, WarmPool=16)
**Measurement**: `HAKMEM_MEASURE_UNIFIED_CACHE=1` OBSERVE run
---
## 1. Per-Class Unified-STATS (Ranked by Volume)
### Data Summary
| Class | Capacity | Occupied | Hit Count | Push Count | Total Ops | Hit Rate | % of Total |
|-------|----------|----------|-----------|------------|-----------|----------|-----------|
| **C6** | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100.0% | **57.2%** |
| **C5** | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100.0% | **28.5%** |
| **C4** | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100.0% | **14.3%** |
| **C7** | ? | ? | ? | ? | **?** | ? | **?** |
**Total C4-C6**: 9,624,045 operations (100% hit rate across all three classes)
**Observation**: C7 statistics not visible in current OBSERVE output (may require additional diagnostics)
---
## 2. Ranking & Key Findings
### Volume Ranking (Descending)
1. **C6: 57.2% of C4-C7 volume** (2.75M hits, 2.75M pushes)
- Highest operational density
- Cache occupancy: 127/128 (99.2%)
- Perfect 100% hit rate
2. **C5: 28.5% of C4-C7 volume** (1.37M hits, 1.37M pushes)
- Second-highest operational density
- Cache occupancy: 127/128 (99.2%)
- Perfect 100% hit rate
3. **C4: 14.3% of C4-C7 volume** (687K hits, 687K pushes)
- Lower operational density
- Cache occupancy: 63/64 (98.4%)
- Perfect 100% hit rate
4. **C7: UNKNOWN**
- Statistics not yet captured
- Requires separate analysis run with explicit C7 flags
---
## 3. Unified-STATS Interpretation
### Perfect Hit Rates (100% across all observed classes)
All observed classes (C4, C5, C6) achieve **100% hit rate** in Mixed SSOT workload:
- Zero refill events (`push == hit`)
- All allocations sourced from unified_cache (no fallback to backend)
- Cache capacity is **never exhausted** (0% full events)
**Implication**: UnifiedCache **sufficiently sized** for Mixed SSOT; refill path not active during benchmark.
### Cache Occupancy Patterns
```
C4: 63/64 slots occupied (98.4%) - 1 free slot
C5: 127/128 slots occupied (99.2%) - 1 free slot
C6: 127/128 slots occupied (99.2%) - 1 free slot
```
**Finding**: All classes operate at **near-capacity** (98-99%), indicating:
- Steady-state working set matches cache capacity
- Minimal fragmentation
- High cache efficiency
---
## 4. P2 (Hot-class Inline Slots) Targeting Strategy
### Recommendation: PRIMARY TARGET = C6
**Rationale**:
1. **Highest ROI**: C6 dominates with 57.2% of operations
- ~2.75M hit operations = highest branch reduction opportunity
- Any optimization on C6 provides 57% proportional benefit across all C4-C7 ops
2. **Secondary Target**: C5 (28.5%)
- Significant volume, second-priority optimization
- Compound benefit: C6 + C5 = 85.7% of C4-C7 operations
3. **Low Priority**: C4 (14.3%)
- Lowest volume, lower ROI
- Defer unless C6/C5 optimization requires it
4. **Unknown**: C7
- Statistics not yet available
- Recommend gathering C7 stats before deciding C6/C5/C4 vs C7 targeting
---
## 5. Inline Slots Design Impact Analysis
### Estimated Branch Reduction (per optimization)
Assuming **inline fast-path** placement (TLS-direct, zero-branch):
**Per-class impact** (based on Phase 74 lessons):
- Instruction count reduction per hit: ~2-4 instructions (push/pop branch elimination)
- Expected throughput gain per 1M hits: +0.05-0.10% (conservative estimate)
**C6 standalone**: 2.75M hits × 0.05-0.10%/M = **+0.14-0.27%** (projected, if branch overhead dominates)
**C6 + C5 combined**: 4.12M hits × 0.05-0.10%/M = **+0.21-0.41%** (projected)
**Risk factors**:
- Cache-miss sensitivity (Phase 74-2 showed +86% cache-misses from register pressure)
- TLS struct bloat (each inline slot = ~8-16 bytes × capacity per class)
- Memory hierarchy effects (L1-dcache pressure from TLS expansion)
---
## 6. Before/After Unified-STATS Baseline
### FAST PGO Baseline Reference (Phase 69: WarmPool=16)
**Important (SSOT)**:
- This baseline is from the FAST PGO scorecard and is the correct reference for mimalloc ratio tracking.
- If you run `scripts/run_mixed_10_cleanenv.sh` without setting `BENCH_BIN`, it defaults to the Standard binary (`./bench_random_mixed_hakmem`).
- To measure Phase 75 on FAST PGO, set:
- `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh`
```
FAST Mixed SSOT Throughput: 62.63 M ops/s (51.77% of mimalloc)
Target M2: 55% of mimalloc (~65.1 M ops/s baseline)
Remaining gap: +3.23pp
```
### Phase 75 (P2) Success Criteria (measured vs FAST PGO baseline)
| Scenario | Throughput | vs Baseline | Status |
|----------|-----------|-----------|--------|
| **GO** | ≥ 64.1 M ops/s | +2.4% | +0.8pp toward M2 |
| **NEUTRAL** | 61.6-64.1 M ops/s | ±1.5% | freeze, continue Phase 76 |
| **NO-GO** | ≤ 61.6 M ops/s | -1.6% | revert immediately |
**Strict gate**: +2.0% for structural change (TLS bloat risk)
---
## 7. Risk Assessment: TLS Expansion vs Benefit
### TLS Struct Bloat Analysis
**Current TLS size** (estimated from Phase 69):
- UnifiedCache entries: minimal (backend pointers only)
- WarmPool SLL: ~2KB (Phase 69-71)
- **Total TINY_MEM TLS: ~2-4KB per thread**
**Proposed P2 expansion** (inline slots for C4-C7):
- C4 inline: 64 slots × 8 bytes = 512 bytes
- C5 inline: 128 slots × 8 bytes = 1,024 bytes
- C6 inline: 128 slots × 8 bytes = 1,024 bytes
- C7 inline: ??? slots × 8 bytes = ???
- **Total P2 expansion: ~2.5-3.5KB per class (selective) or ~4-5KB (all C4-C7)**
**TLS Memory Trade-off**:
- 10 threads × 4KB = **40KB system-wide** (negligible)
- But **per-thread L1-dcache footprint** increases
- L1-dcache pressure → potential cache evictions
- Phase 74-2 showed this can dominate (cache-misses +86%)
### Decision Gate
**Before proceeding with P2**:
1. Gather C7 statistics (currently missing)
2. Validate C6 > C5 > C4 > C7 ordering
3. Decide: C6-only, C6+C5, or full C4-C7?
4. Benchmark single-class inline (C6) first to validate ROI before expanding
---
## 8. Next Steps (User Decision Required)
### Option A: Proceed with C6-only P2 (Recommended - Lowest Risk)
**Approach**:
- Implement inline slots for C6 only (highest volume, 57.2%)
- Measure impact: target +1.5-2.5% throughput
- If successful, expand to C5 in Phase 75-2
**Pros**: Lowest TLS bloat, highest ROI/risk ratio
**Cons**: Multi-phase approach, requires two A/B cycles
### Option B: Proceed with C6+C5 P2 (Moderate Risk)
**Approach**:
- Implement inline slots for C6 + C5 (combined 85.7% of C4-C7 ops)
- Measure impact: target +2.0-3.0% throughput
- If successful, consolidate as Phase 75 final
**Pros**: Single A/B cycle, captures 85.7% of optimization opportunity
**Cons**: Higher TLS bloat (~2KB), higher register pressure risk
### Option C: Defer P2 Until C7 Analysis
**Approach**:
- Gather C7 statistics from separate OBSERVE run
- Rank all four classes before targeting
- Decide on C6/C5/C4/C7 balance based on full data
**Pros**: Data-driven decision, reduces risk of targeting wrong class
**Cons**: Adds diagnostic cycle before implementation
---
## 9. Recommendation Summary
**PRIMARY RECOMMENDATION**: **Option A - Start with C6-only**
**Rationale**:
1. C6 is clearly dominant (57.2% volume)
2. Lowest TLS bloat (~1KB) reduces register pressure risk
3. Conservative approach aligns with Phase 74 learnings (register pressure matters)
4. Fail-fast: if C6 shows positive ROI, expand to C5; if NO-GO, iterate differently
**Secondary**: Gather C7 stats in parallel to validate completeness
**Decision**: **User choice** - provide approach preference before proceeding to Phase 75 implementation
---
## Artifacts
- **Baseline**: Mixed SSOT OBSERVE run: `./bench_random_mixed_hakmem_observe 20000000 400 1`
- **Measurement**: Per-class Unified-STATS with `HAKMEM_MEASURE_UNIFIED_CACHE=1`
- **Analysis**: This document (PHASE75_PERCLASS_ANALYSIS_0_SSOT.md)
---
## Timeline
- Phase 74 (P1/P0): UnifiedCache hit-path optimization → FROZEN (NEUTRAL)
- Phase 75 (P2): Hot-class Inline Slots → **PENDING USER DECISION** (targeting strategy)
- Phase 75-1: Implement selected class(es) → (next)
- Phase 75-2: A/B test & results → (next)