tomoaki/hakmem

Fork 0

Files

Moe Charm (CI) 2013514f7b Working state before pushing to cyu remote

2025-12-19 03:45:01 +09:00

6.6 KiB

Raw Blame History

Phase 89: Bottleneck Analysis & Next Optimization Candidates

Date: 2025-12-18
SSOT Baseline (Standard): 51.36M ops/s
SSOT Optimized (FAST PGO): 54.16M ops/s (+5.45%)

Perf Profile Summary

Profile Run: 40M operations (0.78s), 833 samples
Top 50 Functions by CPU Time:

Rank	Function	CPU Time	Type	Notes
1	free	27.40%	HOTTEST	Free path (malloc_tiny_fast main handler)
2	main	26.30%	Loop	Benchmark loop structure (not optimizable)
3	malloc	20.36%	HOTTEST	Alloc path (malloc_tiny_fast main handler)
4	malloc.cold	10.65%	Cold path	Rarely executed alloc fallback
5	free.cold	5.59%	Cold path	Rarely executed free fallback
6	tiny_region_id_write_header	2.98%	HOT	Region metadata write (inlined candidate)
7-50	Various	~5%	Minor	Page faults, memset, init (one-time/rare)

Key Observations

CPU Time Breakdown:

malloc + free combined: 47.76% (27.40% + 20.36%)
- This is the core allocation/deallocation hot path
- Current architecture: malloc_tiny_fast.h with inline slots (C4-C7) already optimized
tiny_region_id_write_header: 2.98%
- Called during every free for C4-C7 classes
- Currently NOT inlined to all call sites (selective inlining only)
- Potential optimization: Force always_inline for hot paths
malloc.cold / free.cold: 10.65% + 5.59% = 16.24%
- Cold paths (fallback routes)
- Should NOT be optimized (violates layout tax principle)
- Adding code to optimize cold paths increases code bloat

Inline Slots Status (from OBSERVE):

C4/C5/C6 inline slots ARE active during measurement
PUSH TOTAL: 4.81M ops (100% of C4-C7 operations)
Overflow rate: 0.003% (negligible)
Conclusion: Inline slots are working perfectly, not a bottleneck

Top 3 Optimization Candidates

Candidate 1: tiny_region_id_write_header Inlining (2.98% CPU)

Current Implementation:

Located in: core/region_id_v6.c
Called from: malloc_tiny_fast.h during free path
Current inlining: Selective (only some call sites)

Opportunity:

Force always_inline on hot-path call sites to eliminate function call overhead
Estimated savings: 1-2% CPU time (small gain, low risk)
Layout Impact: MINIMAL (only modifying call site, not adding code bulk)

Risk Assessment:

LOW: Function is already optimized, only changing inline strategy
No new branches or code paths
I-cache pressure: minimal (function body is ~30-50 cycles)

Recommendation: YES - PURSUE

Implement: Add __attribute__((always_inline)) to hot-path wrapper
Target: Free path only (malloc path is lower frequency)
Expected gain: +1-2% throughput

Candidate 2: malloc/free Hot-Path Branch Reduction (47.76% CPU)

Current Implementation:

Located in: core/front/malloc_tiny_fast.h (Phase 9/10/80-1 optimized)
Already using: Fixed inline slots, switch dispatch, per-op policy snapshots
Branches: 1-3 per operation (policy check, class route, handler dispatch)

Opportunity:

Profile shows 56.4M branch-misses out of ~1.75 insn/cycle
This indicates branch prediction pressure, not a simple optimization
Further reduction requires: Per-thread pre-computed routing tables or elimination of policy snapshot checks

Analysis:

Phase 9/10/78-1/80-1/83-1 have already eliminated most low-hanging branches
Remaining optimization would require structural change (pre-compute all routing at init time)
Risk: Code bloat from pre-computed tables, potential layout tax regression

Recommendation: DEFERRED TO PHASE 90+

Requires architectural change (similar to Phase 85's approach, which was NO-GO)
Wait for overflow/workload characteristics that justify the complexity
Current gains are saturated

Candidate 3: Cold-Path De-duplication (malloc.cold/free.cold = 16.24% CPU)

Current Implementation:

malloc.cold: 10.65% (fallback alloc path)
free.cold: 5.59% (fallback free path)

Opportunity: NONE (Intentional Design)

Rationale:

Cold paths are EXPLICITLY separate to avoid code bloat in hot path
Separating code improves I-cache utilization for hot path
Optimizing cold path would ADD code to hot path (violating layout tax principle)
Cold paths are rarely executed in SSOT workload

Recommendation: NO - DO NOT PURSUE

Aligns with user's emphasis on "avoiding layout tax"
Cold paths are correctly placed
Optimization here would hurt hot-path performance

Performance Ceiling Analysis

FAST PGO vs Standard: 5.45% delta

This gap represents:

PGO branch prediction optimizations (~3%)
- PGO reorders frequently-taken paths
- Improves branch prediction hit rate
Code layout optimizations (~2%)
- Hottest functions placed contiguously
- Reduces I-cache misses
Inlining decisions (~0.5%)
- PGO optimizes inlining thresholds
- Fewer expensive calls in hot path

Implication for Standard Build:

Standard build is fundamentally limited by branch prediction pressure
Further gains require: (a) reducing branches, or (b) making branches more predictable
Both options require careful architectural tradeoffs

Recommended Strategy for Phase 90+

Immediate (Quick Win):

Phase 90: tiny_region_id_write_header always_inline
- Effort: 1-2 lines of code
- Expected gain: +1-2%
- Risk: LOW

Medium-term (Structural):

Phase 91: Hot-path routing pre-computation (optional)
- Only if overflow rate increases or workload changes
- Risk: MEDIUM (code bloat, layout tax)
- Expected gain: +2-3% (speculative)
Phase 92: Allocator comparison sweep
- Use FAST PGO as comparison baseline (+5.45%)
- Verify gap closure as individual optimizations accumulate

Deferred:

Avoid cold-path optimization (maintains I-cache discipline)
Do NOT pursue redundant branch elimination (saturation point reached)

Summary Table

Candidate	Priority	Effort	Risk	Expected Gain	Recommendation
tiny_region_id_write_header inlining	HIGH	1-2h	LOW	+1-2%	PURSUE
malloc/free branch reduction	MED	20-40h	MEDIUM	+2-3%	DEFER
cold-path optimization	LOW	10-20h	HIGH	+1%	AVOID

Layout Tax Adherence Check

✓ Candidate 1 (header inlining): No code bulk, maintains I-cache discipline
✓ Candidate 2 deferred: Avoids adding branches to hot path
✓ Candidate 3 avoided: Maintains cold-path separation principle

Conclusion: All recommendations align with user's "避けるlayout tax" principle.

6.6 KiB Raw Blame History

Phase 89: Bottleneck Analysis & Next Optimization Candidates

Perf Profile Summary

Key Observations

CPU Time Breakdown:

Inline Slots Status (from OBSERVE):

Top 3 Optimization Candidates

Candidate 1: tiny_region_id_write_header Inlining (2.98% CPU)

Candidate 2: malloc/free Hot-Path Branch Reduction (47.76% CPU)

Candidate 3: Cold-Path De-duplication (malloc.cold/free.cold = 16.24% CPU)

Performance Ceiling Analysis

Recommended Strategy for Phase 90+

Immediate (Quick Win):

Medium-term (Structural):

Deferred:

Summary Table

Layout Tax Adherence Check

6.6 KiB

Raw Blame History