tomoaki/hakmem

Fork 0

Files

Moe Charm (CI) e9fad41154 docs: clarify Phase 75 vs FAST PGO SSOT

2025-12-18 09:11:56 +09:00

8.2 KiB

Raw Blame History

Phase 75 Per-Class Analysis - Mixed SSOT Unified-STATS

Status: ANALYSIS COMPLETE, ready for Phase 75 (P2: Hot-class Inline Slots) targeting decision

Workload: Mixed SSOT (WS=400, ITERS=20000000, WarmPool=16)

Measurement: HAKMEM_MEASURE_UNIFIED_CACHE=1 OBSERVE run

1. Per-Class Unified-STATS (Ranked by Volume)

Data Summary

Class	Capacity	Occupied	Hit Count	Push Count	Total Ops	Hit Rate	% of Total
C6	128	127	2,750,854	2,750,855	5,501,709	100.0%	57.2%
C5	128	127	1,373,604	1,373,605	2,747,209	100.0%	28.5%
C4	64	63	687,563	687,564	1,375,127	100.0%	14.3%
C7	?	?	?	?	?	?	?

Total C4-C6: 9,624,045 operations (100% hit rate across all three classes)

Observation: C7 statistics not visible in current OBSERVE output (may require additional diagnostics)

2. Ranking & Key Findings

Volume Ranking (Descending)

C6: 57.2% of C4-C7 volume (2.75M hits, 2.75M pushes)
- Highest operational density
- Cache occupancy: 127/128 (99.2%)
- Perfect 100% hit rate
C5: 28.5% of C4-C7 volume (1.37M hits, 1.37M pushes)
- Second-highest operational density
- Cache occupancy: 127/128 (99.2%)
- Perfect 100% hit rate
C4: 14.3% of C4-C7 volume (687K hits, 687K pushes)
- Lower operational density
- Cache occupancy: 63/64 (98.4%)
- Perfect 100% hit rate
C7: UNKNOWN
- Statistics not yet captured
- Requires separate analysis run with explicit C7 flags

3. Unified-STATS Interpretation

Perfect Hit Rates (100% across all observed classes)

All observed classes (C4, C5, C6) achieve 100% hit rate in Mixed SSOT workload:

Zero refill events (push == hit)
All allocations sourced from unified_cache (no fallback to backend)
Cache capacity is never exhausted (0% full events)

Implication: UnifiedCache sufficiently sized for Mixed SSOT; refill path not active during benchmark.

Cache Occupancy Patterns

C4: 63/64  slots occupied (98.4%) - 1 free slot
C5: 127/128 slots occupied (99.2%) - 1 free slot
C6: 127/128 slots occupied (99.2%) - 1 free slot

Finding: All classes operate at near-capacity (98-99%), indicating:

Steady-state working set matches cache capacity
Minimal fragmentation
High cache efficiency

4. P2 (Hot-class Inline Slots) Targeting Strategy

Recommendation: PRIMARY TARGET = C6

Rationale:

Highest ROI: C6 dominates with 57.2% of operations
- ~2.75M hit operations = highest branch reduction opportunity
- Any optimization on C6 provides 57% proportional benefit across all C4-C7 ops
Secondary Target: C5 (28.5%)
- Significant volume, second-priority optimization
- Compound benefit: C6 + C5 = 85.7% of C4-C7 operations
Low Priority: C4 (14.3%)
- Lowest volume, lower ROI
- Defer unless C6/C5 optimization requires it
Unknown: C7
- Statistics not yet available
- Recommend gathering C7 stats before deciding C6/C5/C4 vs C7 targeting

5. Inline Slots Design Impact Analysis

Estimated Branch Reduction (per optimization)

Assuming inline fast-path placement (TLS-direct, zero-branch):

Per-class impact (based on Phase 74 lessons):

Instruction count reduction per hit: ~2-4 instructions (push/pop branch elimination)
Expected throughput gain per 1M hits: +0.05-0.10% (conservative estimate)

C6 standalone: 2.75M hits × 0.05-0.10%/M = +0.14-0.27% (projected, if branch overhead dominates)

C6 + C5 combined: 4.12M hits × 0.05-0.10%/M = +0.21-0.41% (projected)

Risk factors:

Cache-miss sensitivity (Phase 74-2 showed +86% cache-misses from register pressure)
TLS struct bloat (each inline slot = ~8-16 bytes × capacity per class)
Memory hierarchy effects (L1-dcache pressure from TLS expansion)

6. Before/After Unified-STATS Baseline

FAST PGO Baseline Reference (Phase 69: WarmPool=16)

Important (SSOT):

This baseline is from the FAST PGO scorecard and is the correct reference for mimalloc ratio tracking.
If you run scripts/run_mixed_10_cleanenv.sh without setting BENCH_BIN, it defaults to the Standard binary (./bench_random_mixed_hakmem).
To measure Phase 75 on FAST PGO, set:
- BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

FAST Mixed SSOT Throughput: 62.63 M ops/s (51.77% of mimalloc)
Target M2: 55% of mimalloc (~65.1 M ops/s baseline)
Remaining gap: +3.23pp

Phase 75 (P2) Success Criteria (measured vs FAST PGO baseline)

Scenario	Throughput	vs Baseline	Status
GO	≥ 64.1 M ops/s	+2.4%	+0.8pp toward M2
NEUTRAL	61.6-64.1 M ops/s	±1.5%	freeze, continue Phase 76
NO-GO	≤ 61.6 M ops/s	-1.6%	revert immediately

Strict gate: +2.0% for structural change (TLS bloat risk)

7. Risk Assessment: TLS Expansion vs Benefit

TLS Struct Bloat Analysis

Current TLS size (estimated from Phase 69):

UnifiedCache entries: minimal (backend pointers only)
WarmPool SLL: ~2KB (Phase 69-71)
Total TINY_MEM TLS: ~2-4KB per thread

Proposed P2 expansion (inline slots for C4-C7):

C4 inline: 64 slots × 8 bytes = 512 bytes
C5 inline: 128 slots × 8 bytes = 1,024 bytes
C6 inline: 128 slots × 8 bytes = 1,024 bytes
C7 inline: ??? slots × 8 bytes = ???
Total P2 expansion: ~2.5-3.5KB per class (selective) or ~4-5KB (all C4-C7)

TLS Memory Trade-off:

10 threads × 4KB = 40KB system-wide (negligible)
But per-thread L1-dcache footprint increases
- L1-dcache pressure → potential cache evictions
- Phase 74-2 showed this can dominate (cache-misses +86%)

Decision Gate

Before proceeding with P2:

Gather C7 statistics (currently missing)
Validate C6 > C5 > C4 > C7 ordering
Decide: C6-only, C6+C5, or full C4-C7?
Benchmark single-class inline (C6) first to validate ROI before expanding

8. Next Steps (User Decision Required)

Option A: Proceed with C6-only P2 (Recommended - Lowest Risk)

Approach:

Implement inline slots for C6 only (highest volume, 57.2%)
Measure impact: target +1.5-2.5% throughput
If successful, expand to C5 in Phase 75-2

Pros: Lowest TLS bloat, highest ROI/risk ratio Cons: Multi-phase approach, requires two A/B cycles

Option B: Proceed with C6+C5 P2 (Moderate Risk)

Approach:

Implement inline slots for C6 + C5 (combined 85.7% of C4-C7 ops)
Measure impact: target +2.0-3.0% throughput
If successful, consolidate as Phase 75 final

Pros: Single A/B cycle, captures 85.7% of optimization opportunity Cons: Higher TLS bloat (~2KB), higher register pressure risk

Option C: Defer P2 Until C7 Analysis

Approach:

Gather C7 statistics from separate OBSERVE run
Rank all four classes before targeting
Decide on C6/C5/C4/C7 balance based on full data

Pros: Data-driven decision, reduces risk of targeting wrong class Cons: Adds diagnostic cycle before implementation

9. Recommendation Summary

PRIMARY RECOMMENDATION: Option A - Start with C6-only

Rationale:

C6 is clearly dominant (57.2% volume)
Lowest TLS bloat (~1KB) reduces register pressure risk
Conservative approach aligns with Phase 74 learnings (register pressure matters)
Fail-fast: if C6 shows positive ROI, expand to C5; if NO-GO, iterate differently

Secondary: Gather C7 stats in parallel to validate completeness

Decision: User choice - provide approach preference before proceeding to Phase 75 implementation

Artifacts

Baseline: Mixed SSOT OBSERVE run: ./bench_random_mixed_hakmem_observe 20000000 400 1
Measurement: Per-class Unified-STATS with HAKMEM_MEASURE_UNIFIED_CACHE=1
Analysis: This document (PHASE75_PERCLASS_ANALYSIS_0_SSOT.md)

Timeline

Phase 74 (P1/P0): UnifiedCache hit-path optimization → FROZEN (NEUTRAL)
Phase 75 (P2): Hot-class Inline Slots → PENDING USER DECISION (targeting strategy)
Phase 75-1: Implement selected class(es) → (next)
Phase 75-2: A/B test & results → (next)

8.2 KiB Raw Blame History Unescape Escape

Phase 75 Per-Class Analysis - Mixed SSOT Unified-STATS

1. Per-Class Unified-STATS (Ranked by Volume)

Data Summary

2. Ranking & Key Findings

Volume Ranking (Descending)

3. Unified-STATS Interpretation

Perfect Hit Rates (100% across all observed classes)

Cache Occupancy Patterns

4. P2 (Hot-class Inline Slots) Targeting Strategy

Recommendation: PRIMARY TARGET = C6

5. Inline Slots Design Impact Analysis

Estimated Branch Reduction (per optimization)

6. Before/After Unified-STATS Baseline

FAST PGO Baseline Reference (Phase 69: WarmPool=16)

Phase 75 (P2) Success Criteria (measured vs FAST PGO baseline)

7. Risk Assessment: TLS Expansion vs Benefit

TLS Struct Bloat Analysis

Decision Gate

8. Next Steps (User Decision Required)

Option A: Proceed with C6-only P2 (Recommended - Lowest Risk)

Option B: Proceed with C6+C5 P2 (Moderate Risk)

Option C: Defer P2 Until C7 Analysis

9. Recommendation Summary

Artifacts

Timeline

8.2 KiB

Raw Blame History