Files
hakmem/docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md
2025-12-19 03:45:01 +09:00

6.6 KiB

Phase 89: Bottleneck Analysis & Next Optimization Candidates

Date: 2025-12-18
SSOT Baseline (Standard): 51.36M ops/s
SSOT Optimized (FAST PGO): 54.16M ops/s (+5.45%)


Perf Profile Summary

Profile Run: 40M operations (0.78s), 833 samples
Top 50 Functions by CPU Time:

Rank Function CPU Time Type Notes
1 free 27.40% HOTTEST Free path (malloc_tiny_fast main handler)
2 main 26.30% Loop Benchmark loop structure (not optimizable)
3 malloc 20.36% HOTTEST Alloc path (malloc_tiny_fast main handler)
4 malloc.cold 10.65% Cold path Rarely executed alloc fallback
5 free.cold 5.59% Cold path Rarely executed free fallback
6 tiny_region_id_write_header 2.98% HOT Region metadata write (inlined candidate)
7-50 Various ~5% Minor Page faults, memset, init (one-time/rare)

Key Observations

CPU Time Breakdown:

  • malloc + free combined: 47.76% (27.40% + 20.36%)

    • This is the core allocation/deallocation hot path
    • Current architecture: malloc_tiny_fast.h with inline slots (C4-C7) already optimized
  • tiny_region_id_write_header: 2.98%

    • Called during every free for C4-C7 classes
    • Currently NOT inlined to all call sites (selective inlining only)
    • Potential optimization: Force always_inline for hot paths
  • malloc.cold / free.cold: 10.65% + 5.59% = 16.24%

    • Cold paths (fallback routes)
    • Should NOT be optimized (violates layout tax principle)
    • Adding code to optimize cold paths increases code bloat

Inline Slots Status (from OBSERVE):

  • C4/C5/C6 inline slots ARE active during measurement
  • PUSH TOTAL: 4.81M ops (100% of C4-C7 operations)
  • Overflow rate: 0.003% (negligible)
  • Conclusion: Inline slots are working perfectly, not a bottleneck

Top 3 Optimization Candidates

Candidate 1: tiny_region_id_write_header Inlining (2.98% CPU)

Current Implementation:

  • Located in: core/region_id_v6.c
  • Called from: malloc_tiny_fast.h during free path
  • Current inlining: Selective (only some call sites)

Opportunity:

  • Force always_inline on hot-path call sites to eliminate function call overhead
  • Estimated savings: 1-2% CPU time (small gain, low risk)
  • Layout Impact: MINIMAL (only modifying call site, not adding code bulk)

Risk Assessment:

  • LOW: Function is already optimized, only changing inline strategy
  • No new branches or code paths
  • I-cache pressure: minimal (function body is ~30-50 cycles)

Recommendation: YES - PURSUE

  • Implement: Add __attribute__((always_inline)) to hot-path wrapper
  • Target: Free path only (malloc path is lower frequency)
  • Expected gain: +1-2% throughput

Candidate 2: malloc/free Hot-Path Branch Reduction (47.76% CPU)

Current Implementation:

  • Located in: core/front/malloc_tiny_fast.h (Phase 9/10/80-1 optimized)
  • Already using: Fixed inline slots, switch dispatch, per-op policy snapshots
  • Branches: 1-3 per operation (policy check, class route, handler dispatch)

Opportunity:

  • Profile shows 56.4M branch-misses out of ~1.75 insn/cycle
  • This indicates branch prediction pressure, not a simple optimization
  • Further reduction requires: Per-thread pre-computed routing tables or elimination of policy snapshot checks

Analysis:

  • Phase 9/10/78-1/80-1/83-1 have already eliminated most low-hanging branches
  • Remaining optimization would require structural change (pre-compute all routing at init time)
  • Risk: Code bloat from pre-computed tables, potential layout tax regression

Recommendation: DEFERRED TO PHASE 90+

  • Requires architectural change (similar to Phase 85's approach, which was NO-GO)
  • Wait for overflow/workload characteristics that justify the complexity
  • Current gains are saturated

Candidate 3: Cold-Path De-duplication (malloc.cold/free.cold = 16.24% CPU)

Current Implementation:

  • malloc.cold: 10.65% (fallback alloc path)
  • free.cold: 5.59% (fallback free path)

Opportunity: NONE (Intentional Design)

Rationale:

  • Cold paths are EXPLICITLY separate to avoid code bloat in hot path
  • Separating code improves I-cache utilization for hot path
  • Optimizing cold path would ADD code to hot path (violating layout tax principle)
  • Cold paths are rarely executed in SSOT workload

Recommendation: NO - DO NOT PURSUE

  • Aligns with user's emphasis on "avoiding layout tax"
  • Cold paths are correctly placed
  • Optimization here would hurt hot-path performance

Performance Ceiling Analysis

FAST PGO vs Standard: 5.45% delta

This gap represents:

  1. PGO branch prediction optimizations (~3%)

    • PGO reorders frequently-taken paths
    • Improves branch prediction hit rate
  2. Code layout optimizations (~2%)

    • Hottest functions placed contiguously
    • Reduces I-cache misses
  3. Inlining decisions (~0.5%)

    • PGO optimizes inlining thresholds
    • Fewer expensive calls in hot path

Implication for Standard Build:

  • Standard build is fundamentally limited by branch prediction pressure
  • Further gains require: (a) reducing branches, or (b) making branches more predictable
  • Both options require careful architectural tradeoffs

Immediate (Quick Win):

  1. Phase 90: tiny_region_id_write_header always_inline
    • Effort: 1-2 lines of code
    • Expected gain: +1-2%
    • Risk: LOW

Medium-term (Structural):

  1. Phase 91: Hot-path routing pre-computation (optional)

    • Only if overflow rate increases or workload changes
    • Risk: MEDIUM (code bloat, layout tax)
    • Expected gain: +2-3% (speculative)
  2. Phase 92: Allocator comparison sweep

    • Use FAST PGO as comparison baseline (+5.45%)
    • Verify gap closure as individual optimizations accumulate

Deferred:

  • Avoid cold-path optimization (maintains I-cache discipline)
  • Do NOT pursue redundant branch elimination (saturation point reached)

Summary Table

Candidate Priority Effort Risk Expected Gain Recommendation
tiny_region_id_write_header inlining HIGH 1-2h LOW +1-2% PURSUE
malloc/free branch reduction MED 20-40h MEDIUM +2-3% DEFER
cold-path optimization LOW 10-20h HIGH +1% AVOID

Layout Tax Adherence Check

✓ Candidate 1 (header inlining): No code bulk, maintains I-cache discipline
✓ Candidate 2 deferred: Avoids adding branches to hot path
✓ Candidate 3 avoided: Maintains cold-path separation principle

Conclusion: All recommendations align with user's "避けるlayout tax" principle.