diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 9ef561ae..2010bbd9 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -36,11 +36,21 @@ ## 3) 次の指示書 -**Phase 62: 次(TBD)** +**Phase 62: C7 ULTRA Hotpath Optimization - Planning Complete** -- Phase 61 が NEUTRAL (+0.31%) だったため、次のターゲットを探索する -- Runtime profiling で Top 50 のホット関数を確認(Phase 61: `tiny_region_id_write_header` 2.32%, `tiny_c7_ultra_alloc` 1.90%) -- 候補: TLS prefetch optimization, refill batch size tuning, IPC profiling +Phase 59b・61 完了後、runtime profiling により次のターゲット特定: + +- **新 Profile**: 200M ops Mixed benchmark (Speed-first mode) + - tiny_c7_ultra_alloc: **5.18%** (2.41% self + multi-stack overhead) + - tiny_region_id_write_header: **3.82%** (2.72% + 1.10%) + - unified_cache_push: 1.37% (Phase 46A already pursued) + +- **Phase 62 推奨**: C7 ULTRA Inline + IPC Analysis + - Option A: tiny_c7_ultra_alloc dependency chain reordering (+1-3% expected) + - Option B: tiny_region_id_write_header reordering (+0.5-1.5%, higher risk) + - Option C: Algorithmic redesign (post-50% milestone) + +詳細: `docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md`(完了、ready for implementation) **Phase 61: 完了(NEUTRAL +0.31%, research box)** diff --git a/docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md b/docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md new file mode 100644 index 00000000..b054f6f3 --- /dev/null +++ b/docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md @@ -0,0 +1,174 @@ +# Phase 62: Allocation Hotpath Optimization - Target Analysis + +**Date**: 2025-12-17 +**Status**: Planning Phase +**Baseline**: 48.34% of mimalloc (Phase 59b Speed-first) + +--- + +## Executive Summary + +Runtime profiling (Phase 59b Speed-first profile) reveals that after Phases 59-61 micro-optimization attempts, the next highest-value targets are: + +1. **tiny_c7_ultra_alloc: 5.18%** (new primary target) +2. **tiny_region_id_write_header: 3.82%** (reconfirmed hot) +3. **unified_cache_push: 1.37%** (already optimized in Phase 46A) + +Phase 62 targets `tiny_c7_ultra_alloc` dependency chain optimization with potential +1-3% gain. + +--- + +## Profiling Results (200M ops Mixed benchmark) + +### Top Allocation Functions + +``` +Function | Self % | Stack % | Status +----------------------------------|--------|----------|------------------ +malloc (wrapper) | 27.17% | ~60% | Core loop +free (wrapper) | 25.95% | ~60% | Core loop +main (benchmark loop) | 26.78% | ~60% | Core loop +tiny_c7_ultra_alloc | 2.41% | 5.18% | NEW TARGET +tiny_region_id_write_header | 2.72% | 3.82% | Phase 61 confirmed +unified_cache_push | 1.37% | 1.37% | Phase 46A (no-go) +tiny_c7_ultra_free | 0.56% | 0.56% | Lower priority +``` + +**Note**: Stack % represents cumulative overhead from multiple call stacks + +### Key Findings + +1. **Allocation Specific Hot Path**: 12.37% (C7 ultra + region write + cache) +2. **Core Allocator**: 79.9% (malloc + free + main loop interactions) +3. **Profiling Confidence**: 376 samples, clear hot path, low noise + +--- + +## Phase 62 Options + +### Option A: C7 ULTRA Hotpath (5.18% - PRIMARY CANDIDATE) + +**Opportunities**: +- **A1: Inline Decision Path** - Ensure `tiny_c7_ultra_alloc` always inlined +- **A2: TLS Prefetch** - Speculatively load C7 metadata structure +- **A3: Dependency Chain Reduction** - Reorder operations for parallelism +- **A4: Carve Batch Optimization** - Pre-carve slabs to reduce refill calls + +**Expected Gain**: +1-3% (5.18% of addressable performance) + +**Risk Level**: Medium +- Precedent: Phase 46A similar optimization (-0.68% from layout tax) +- Phase 43: Branch elimination (-1.18% regression) +- But: 5x larger than Phase 46A target (higher absolute gain margin) + +**Rationale**: +- C7 ULTRA already optimized in free (Phase 7+), alloc side underexplored +- No successful alloc-side structural optimization since Phase 39 (+1.98% gate prune) +- This is not micro-architecture bound (unlike Phase 46A store-ordering) + +--- + +### Option B: tiny_region_id_write_header (3.82% - SECONDARY) + +**Opportunities**: +- **B1: Dependency Chain Reorder** - Schedule non-dependent operations earlier +- **B2: Condition Consolidation** - Reduce branch count +- **B3: Store Bypass** - Avoid load-after-store stalls + +**Expected Gain**: +0.5-1.5% + +**Risk Level**: High +- Phase 43: Header write optimization (-1.18%) +- Phase 46A: always_inline (-0.68%) +- Layout tax is real and measurable + +**Decision**: Secondary option; pursue only if Option A fails + +--- + +### Option C: Algorithmic Redesign (VERY HIGH IMPACT, VERY HIGH COST) + +**Examples**: +- Segment pre-allocation vs demand-based +- Free-side batching (coalesce multiple frees) +- Static route caching (trade memory for latency) + +**Expected Gain**: +3-8% (affects 79.9% core functions) + +**Risk**: Very high (requires major refactoring, extensive testing) + +**Decision**: Post-50% milestone option; requires strategic decision + +--- + +## Phase 62A Recommendation: C7 ULTRA Inline + IPC Analysis + +### Implementation Plan + +**Step 1: Deep Profiling** (1-2 hours) +```bash +perf record -F 99 -g -e cycles:P,cache-misses,branch-misses,stalled-cycles-frontend \ + -- ./bench_random_mixed_hakmem_minimal 200000000 400 1 +perf report --stdio | grep -A 20 "tiny_c7_ultra_alloc" +``` + +**Step 2: ASM Inspection** (1 hour) +- objdump -d on tiny_c7_ultra_alloc +- Identify dependency chains (load-use, store-use distances) +- Map to CPU latencies (L1: 4 cycles, L2: 10, L3: 40-75) +- Identify stores that can be deferred/reordered + +**Step 3: A/B Test** (2-3 hours) +- Create `HAKMEM_TINY_C7_ULTRA_INLINE_OPT` ENV gate +- Implement dependency chain reordering (if identified) +- Run 10-run Mixed benchmark +- Measure +/- threshold: ±0.5% (micro-scale) + +**Step 4: Decision** +- +0.5% or higher → GO (adopt as default) +- ±0.5% → NEUTRAL (keep as research box) +- -0.5% or lower → NO-GO (revert, document) + +--- + +## Alternative: Quick Validation (if time-limited) + +If deep optimization is not feasible, proceed with: + +1. **Phase 62B: Static Routing Cache** - Pre-compute route decisions for each class + - Phase 45 suggested +0.5-1.0% from TLS prefetch + - Lower risk than C7 modification + +2. **Phase 62C: Carve Batch Study** - Analyze carve operation frequency + - May identify batching opportunity with minimal code changes + +--- + +## Box Theory Compliance + +- **Single Conversion Point**: C7 ultra path has clear entry point +- **Clear Boundary**: tiny_c7_ultra_alloc() function boundary +- **Reversible**: ENV gate (`HAKMEM_TINY_C7_ULTRA_INLINE_OPT=0/1`) +- **No Side Effects**: Pure optimization, no new data structures +- **Performance**: Expected +1-3% (TBD via A/B test) + +--- + +## Success Criteria + +| Metric | Target | Status | +|--------|--------|--------| +| M1 (50%) | 50.0% | 48.34% (gap -1.66%) | +| Throughput improvement | +1-3% | TBD | +| Variance (CV) | <2.5% | Current 2.52% ✓ | +| Memory efficiency | <35MB RSS | Current 33MB ✓ | +| Syscall budget | <1e-7/op | Current 1.25e-7/op ✓ | + +--- + +## Timeline + +- **Phase 62A (C7 ULTRA Inline)**: Single phase, 4-6 hours +- **Decision point**: After A/B test +- **Next phases**: Based on Phase 62A result +