187 lines
6.6 KiB
Markdown
187 lines
6.6 KiB
Markdown
|
|
# Phase 89: Bottleneck Analysis & Next Optimization Candidates
|
||
|
|
|
||
|
|
**Date**: 2025-12-18
|
||
|
|
**SSOT Baseline (Standard)**: 51.36M ops/s
|
||
|
|
**SSOT Optimized (FAST PGO)**: 54.16M ops/s (+5.45%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Perf Profile Summary
|
||
|
|
|
||
|
|
**Profile Run**: 40M operations (0.78s), 833 samples
|
||
|
|
**Top 50 Functions by CPU Time**:
|
||
|
|
|
||
|
|
| Rank | Function | CPU Time | Type | Notes |
|
||
|
|
|------|----------|----------|------|-------|
|
||
|
|
| 1 | **free** | 27.40% | **HOTTEST** | Free path (malloc_tiny_fast main handler) |
|
||
|
|
| 2 | main | 26.30% | Loop | Benchmark loop structure (not optimizable) |
|
||
|
|
| 3 | **malloc** | 20.36% | **HOTTEST** | Alloc path (malloc_tiny_fast main handler) |
|
||
|
|
| 4 | malloc.cold | 10.65% | Cold path | Rarely executed alloc fallback |
|
||
|
|
| 5 | free.cold | 5.59% | Cold path | Rarely executed free fallback |
|
||
|
|
| 6 | **tiny_region_id_write_header** | 2.98% | **HOT** | Region metadata write (inlined candidate) |
|
||
|
|
| 7-50 | Various | ~5% | Minor | Page faults, memset, init (one-time/rare) |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Key Observations
|
||
|
|
|
||
|
|
### CPU Time Breakdown:
|
||
|
|
- **malloc + free combined**: 47.76% (27.40% + 20.36%)
|
||
|
|
- This is the core allocation/deallocation hot path
|
||
|
|
- Current architecture: `malloc_tiny_fast.h` with inline slots (C4-C7) already optimized
|
||
|
|
|
||
|
|
- **tiny_region_id_write_header**: 2.98%
|
||
|
|
- Called during every free for C4-C7 classes
|
||
|
|
- Currently NOT inlined to all call sites (selective inlining only)
|
||
|
|
- Potential optimization: Force always_inline for hot paths
|
||
|
|
|
||
|
|
- **malloc.cold / free.cold**: 10.65% + 5.59% = 16.24%
|
||
|
|
- Cold paths (fallback routes)
|
||
|
|
- Should NOT be optimized (violates layout tax principle)
|
||
|
|
- Adding code to optimize cold paths increases code bloat
|
||
|
|
|
||
|
|
### Inline Slots Status (from OBSERVE):
|
||
|
|
- C4/C5/C6 inline slots ARE active during measurement
|
||
|
|
- PUSH TOTAL: 4.81M ops (100% of C4-C7 operations)
|
||
|
|
- Overflow rate: 0.003% (negligible)
|
||
|
|
- **Conclusion**: Inline slots are working perfectly, not a bottleneck
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 3 Optimization Candidates
|
||
|
|
|
||
|
|
### Candidate 1: tiny_region_id_write_header Inlining (2.98% CPU)
|
||
|
|
|
||
|
|
**Current Implementation**:
|
||
|
|
- Located in: `core/region_id_v6.c`
|
||
|
|
- Called from: `malloc_tiny_fast.h` during free path
|
||
|
|
- Current inlining: Selective (only some call sites)
|
||
|
|
|
||
|
|
**Opportunity**:
|
||
|
|
- Force `always_inline` on hot-path call sites to eliminate function call overhead
|
||
|
|
- Estimated savings: 1-2% CPU time (small gain, low risk)
|
||
|
|
- **Layout Impact**: MINIMAL (only modifying call site, not adding code bulk)
|
||
|
|
|
||
|
|
**Risk Assessment**:
|
||
|
|
- LOW: Function is already optimized, only changing inline strategy
|
||
|
|
- No new branches or code paths
|
||
|
|
- I-cache pressure: minimal (function body is ~30-50 cycles)
|
||
|
|
|
||
|
|
**Recommendation**: **YES - PURSUE**
|
||
|
|
- Implement: Add `__attribute__((always_inline))` to hot-path wrapper
|
||
|
|
- Target: Free path only (malloc path is lower frequency)
|
||
|
|
- Expected gain: +1-2% throughput
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Candidate 2: malloc/free Hot-Path Branch Reduction (47.76% CPU)
|
||
|
|
|
||
|
|
**Current Implementation**:
|
||
|
|
- Located in: `core/front/malloc_tiny_fast.h` (Phase 9/10/80-1 optimized)
|
||
|
|
- Already using: Fixed inline slots, switch dispatch, per-op policy snapshots
|
||
|
|
- Branches: 1-3 per operation (policy check, class route, handler dispatch)
|
||
|
|
|
||
|
|
**Opportunity**:
|
||
|
|
- Profile shows **56.4M branch-misses** out of ~1.75 insn/cycle
|
||
|
|
- This indicates branch prediction pressure, not a simple optimization
|
||
|
|
- Further reduction requires: Per-thread pre-computed routing tables or elimination of policy snapshot checks
|
||
|
|
|
||
|
|
**Analysis**:
|
||
|
|
- Phase 9/10/78-1/80-1/83-1 have already eliminated most low-hanging branches
|
||
|
|
- Remaining optimization would require structural change (pre-compute all routing at init time)
|
||
|
|
- **Risk**: Code bloat from pre-computed tables, potential layout tax regression
|
||
|
|
|
||
|
|
**Recommendation**: **DEFERRED TO PHASE 90+**
|
||
|
|
- Requires architectural change (similar to Phase 85's approach, which was NO-GO)
|
||
|
|
- Wait for overflow/workload characteristics that justify the complexity
|
||
|
|
- Current gains are saturated
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Candidate 3: Cold-Path De-duplication (malloc.cold/free.cold = 16.24% CPU)
|
||
|
|
|
||
|
|
**Current Implementation**:
|
||
|
|
- malloc.cold: 10.65% (fallback alloc path)
|
||
|
|
- free.cold: 5.59% (fallback free path)
|
||
|
|
|
||
|
|
**Opportunity**: NONE (Intentional Design)
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
- Cold paths are EXPLICITLY separate to avoid code bloat in hot path
|
||
|
|
- Separating code improves I-cache utilization for hot path
|
||
|
|
- Optimizing cold path would ADD code to hot path (violating layout tax principle)
|
||
|
|
- Cold paths are rarely executed in SSOT workload
|
||
|
|
|
||
|
|
**Recommendation**: **NO - DO NOT PURSUE**
|
||
|
|
- Aligns with user's emphasis on "avoiding layout tax"
|
||
|
|
- Cold paths are correctly placed
|
||
|
|
- Optimization here would hurt hot-path performance
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Ceiling Analysis
|
||
|
|
|
||
|
|
**FAST PGO vs Standard: 5.45% delta**
|
||
|
|
|
||
|
|
This gap represents:
|
||
|
|
1. **PGO branch prediction optimizations** (~3%)
|
||
|
|
- PGO reorders frequently-taken paths
|
||
|
|
- Improves branch prediction hit rate
|
||
|
|
|
||
|
|
2. **Code layout optimizations** (~2%)
|
||
|
|
- Hottest functions placed contiguously
|
||
|
|
- Reduces I-cache misses
|
||
|
|
|
||
|
|
3. **Inlining decisions** (~0.5%)
|
||
|
|
- PGO optimizes inlining thresholds
|
||
|
|
- Fewer expensive calls in hot path
|
||
|
|
|
||
|
|
**Implication for Standard Build**:
|
||
|
|
- Standard build is fundamentally limited by branch prediction pressure
|
||
|
|
- Further gains require: (a) reducing branches, or (b) making branches more predictable
|
||
|
|
- Both options require careful architectural tradeoffs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Strategy for Phase 90+
|
||
|
|
|
||
|
|
### Immediate (Quick Win):
|
||
|
|
1. **Phase 90: tiny_region_id_write_header always_inline**
|
||
|
|
- Effort: 1-2 lines of code
|
||
|
|
- Expected gain: +1-2%
|
||
|
|
- Risk: LOW
|
||
|
|
|
||
|
|
### Medium-term (Structural):
|
||
|
|
2. **Phase 91: Hot-path routing pre-computation (optional)**
|
||
|
|
- Only if overflow rate increases or workload changes
|
||
|
|
- Risk: MEDIUM (code bloat, layout tax)
|
||
|
|
- Expected gain: +2-3% (speculative)
|
||
|
|
|
||
|
|
3. **Phase 92: Allocator comparison sweep**
|
||
|
|
- Use FAST PGO as comparison baseline (+5.45%)
|
||
|
|
- Verify gap closure as individual optimizations accumulate
|
||
|
|
|
||
|
|
### Deferred:
|
||
|
|
- Avoid cold-path optimization (maintains I-cache discipline)
|
||
|
|
- Do NOT pursue redundant branch elimination (saturation point reached)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary Table
|
||
|
|
|
||
|
|
| Candidate | Priority | Effort | Risk | Expected Gain | Recommendation |
|
||
|
|
|-----------|----------|--------|------|----------------|-----------------|
|
||
|
|
| tiny_region_id_write_header inlining | HIGH | 1-2h | LOW | +1-2% | **PURSUE** |
|
||
|
|
| malloc/free branch reduction | MED | 20-40h | MEDIUM | +2-3% | DEFER |
|
||
|
|
| cold-path optimization | LOW | 10-20h | HIGH | +1% | **AVOID** |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Layout Tax Adherence Check
|
||
|
|
|
||
|
|
✓ Candidate 1 (header inlining): No code bulk, maintains I-cache discipline
|
||
|
|
✓ Candidate 2 deferred: Avoids adding branches to hot path
|
||
|
|
✓ Candidate 3 avoided: Maintains cold-path separation principle
|
||
|
|
|
||
|
|
**Conclusion**: All recommendations align with user's "避けるlayout tax" principle.
|