175 lines
5.6 KiB
Markdown
175 lines
5.6 KiB
Markdown
|
|
# Phase 62: Allocation Hotpath Optimization - Target Analysis
|
||
|
|
|
||
|
|
**Date**: 2025-12-17
|
||
|
|
**Status**: Planning Phase
|
||
|
|
**Baseline**: 48.34% of mimalloc (Phase 59b Speed-first)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Runtime profiling (Phase 59b Speed-first profile) reveals that after Phases 59-61 micro-optimization attempts, the next highest-value targets are:
|
||
|
|
|
||
|
|
1. **tiny_c7_ultra_alloc: 5.18%** (new primary target)
|
||
|
|
2. **tiny_region_id_write_header: 3.82%** (reconfirmed hot)
|
||
|
|
3. **unified_cache_push: 1.37%** (already optimized in Phase 46A)
|
||
|
|
|
||
|
|
Phase 62 targets `tiny_c7_ultra_alloc` dependency chain optimization with potential +1-3% gain.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Profiling Results (200M ops Mixed benchmark)
|
||
|
|
|
||
|
|
### Top Allocation Functions
|
||
|
|
|
||
|
|
```
|
||
|
|
Function | Self % | Stack % | Status
|
||
|
|
----------------------------------|--------|----------|------------------
|
||
|
|
malloc (wrapper) | 27.17% | ~60% | Core loop
|
||
|
|
free (wrapper) | 25.95% | ~60% | Core loop
|
||
|
|
main (benchmark loop) | 26.78% | ~60% | Core loop
|
||
|
|
tiny_c7_ultra_alloc | 2.41% | 5.18% | NEW TARGET
|
||
|
|
tiny_region_id_write_header | 2.72% | 3.82% | Phase 61 confirmed
|
||
|
|
unified_cache_push | 1.37% | 1.37% | Phase 46A (no-go)
|
||
|
|
tiny_c7_ultra_free | 0.56% | 0.56% | Lower priority
|
||
|
|
```
|
||
|
|
|
||
|
|
**Note**: Stack % represents cumulative overhead from multiple call stacks
|
||
|
|
|
||
|
|
### Key Findings
|
||
|
|
|
||
|
|
1. **Allocation Specific Hot Path**: 12.37% (C7 ultra + region write + cache)
|
||
|
|
2. **Core Allocator**: 79.9% (malloc + free + main loop interactions)
|
||
|
|
3. **Profiling Confidence**: 376 samples, clear hot path, low noise
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 62 Options
|
||
|
|
|
||
|
|
### Option A: C7 ULTRA Hotpath (5.18% - PRIMARY CANDIDATE)
|
||
|
|
|
||
|
|
**Opportunities**:
|
||
|
|
- **A1: Inline Decision Path** - Ensure `tiny_c7_ultra_alloc` always inlined
|
||
|
|
- **A2: TLS Prefetch** - Speculatively load C7 metadata structure
|
||
|
|
- **A3: Dependency Chain Reduction** - Reorder operations for parallelism
|
||
|
|
- **A4: Carve Batch Optimization** - Pre-carve slabs to reduce refill calls
|
||
|
|
|
||
|
|
**Expected Gain**: +1-3% (5.18% of addressable performance)
|
||
|
|
|
||
|
|
**Risk Level**: Medium
|
||
|
|
- Precedent: Phase 46A similar optimization (-0.68% from layout tax)
|
||
|
|
- Phase 43: Branch elimination (-1.18% regression)
|
||
|
|
- But: 5x larger than Phase 46A target (higher absolute gain margin)
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
- C7 ULTRA already optimized in free (Phase 7+), alloc side underexplored
|
||
|
|
- No successful alloc-side structural optimization since Phase 39 (+1.98% gate prune)
|
||
|
|
- This is not micro-architecture bound (unlike Phase 46A store-ordering)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option B: tiny_region_id_write_header (3.82% - SECONDARY)
|
||
|
|
|
||
|
|
**Opportunities**:
|
||
|
|
- **B1: Dependency Chain Reorder** - Schedule non-dependent operations earlier
|
||
|
|
- **B2: Condition Consolidation** - Reduce branch count
|
||
|
|
- **B3: Store Bypass** - Avoid load-after-store stalls
|
||
|
|
|
||
|
|
**Expected Gain**: +0.5-1.5%
|
||
|
|
|
||
|
|
**Risk Level**: High
|
||
|
|
- Phase 43: Header write optimization (-1.18%)
|
||
|
|
- Phase 46A: always_inline (-0.68%)
|
||
|
|
- Layout tax is real and measurable
|
||
|
|
|
||
|
|
**Decision**: Secondary option; pursue only if Option A fails
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option C: Algorithmic Redesign (VERY HIGH IMPACT, VERY HIGH COST)
|
||
|
|
|
||
|
|
**Examples**:
|
||
|
|
- Segment pre-allocation vs demand-based
|
||
|
|
- Free-side batching (coalesce multiple frees)
|
||
|
|
- Static route caching (trade memory for latency)
|
||
|
|
|
||
|
|
**Expected Gain**: +3-8% (affects 79.9% core functions)
|
||
|
|
|
||
|
|
**Risk**: Very high (requires major refactoring, extensive testing)
|
||
|
|
|
||
|
|
**Decision**: Post-50% milestone option; requires strategic decision
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 62A Recommendation: C7 ULTRA Inline + IPC Analysis
|
||
|
|
|
||
|
|
### Implementation Plan
|
||
|
|
|
||
|
|
**Step 1: Deep Profiling** (1-2 hours)
|
||
|
|
```bash
|
||
|
|
perf record -F 99 -g -e cycles:P,cache-misses,branch-misses,stalled-cycles-frontend \
|
||
|
|
-- ./bench_random_mixed_hakmem_minimal 200000000 400 1
|
||
|
|
perf report --stdio | grep -A 20 "tiny_c7_ultra_alloc"
|
||
|
|
```
|
||
|
|
|
||
|
|
**Step 2: ASM Inspection** (1 hour)
|
||
|
|
- objdump -d on tiny_c7_ultra_alloc
|
||
|
|
- Identify dependency chains (load-use, store-use distances)
|
||
|
|
- Map to CPU latencies (L1: 4 cycles, L2: 10, L3: 40-75)
|
||
|
|
- Identify stores that can be deferred/reordered
|
||
|
|
|
||
|
|
**Step 3: A/B Test** (2-3 hours)
|
||
|
|
- Create `HAKMEM_TINY_C7_ULTRA_INLINE_OPT` ENV gate
|
||
|
|
- Implement dependency chain reordering (if identified)
|
||
|
|
- Run 10-run Mixed benchmark
|
||
|
|
- Measure +/- threshold: ±0.5% (micro-scale)
|
||
|
|
|
||
|
|
**Step 4: Decision**
|
||
|
|
- +0.5% or higher → GO (adopt as default)
|
||
|
|
- ±0.5% → NEUTRAL (keep as research box)
|
||
|
|
- -0.5% or lower → NO-GO (revert, document)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Alternative: Quick Validation (if time-limited)
|
||
|
|
|
||
|
|
If deep optimization is not feasible, proceed with:
|
||
|
|
|
||
|
|
1. **Phase 62B: Static Routing Cache** - Pre-compute route decisions for each class
|
||
|
|
- Phase 45 suggested +0.5-1.0% from TLS prefetch
|
||
|
|
- Lower risk than C7 modification
|
||
|
|
|
||
|
|
2. **Phase 62C: Carve Batch Study** - Analyze carve operation frequency
|
||
|
|
- May identify batching opportunity with minimal code changes
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Box Theory Compliance
|
||
|
|
|
||
|
|
- **Single Conversion Point**: C7 ultra path has clear entry point
|
||
|
|
- **Clear Boundary**: tiny_c7_ultra_alloc() function boundary
|
||
|
|
- **Reversible**: ENV gate (`HAKMEM_TINY_C7_ULTRA_INLINE_OPT=0/1`)
|
||
|
|
- **No Side Effects**: Pure optimization, no new data structures
|
||
|
|
- **Performance**: Expected +1-3% (TBD via A/B test)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
| Metric | Target | Status |
|
||
|
|
|--------|--------|--------|
|
||
|
|
| M1 (50%) | 50.0% | 48.34% (gap -1.66%) |
|
||
|
|
| Throughput improvement | +1-3% | TBD |
|
||
|
|
| Variance (CV) | <2.5% | Current 2.52% ✓ |
|
||
|
|
| Memory efficiency | <35MB RSS | Current 33MB ✓ |
|
||
|
|
| Syscall budget | <1e-7/op | Current 1.25e-7/op ✓ |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Timeline
|
||
|
|
|
||
|
|
- **Phase 62A (C7 ULTRA Inline)**: Single phase, 4-6 hours
|
||
|
|
- **Decision point**: After A/B test
|
||
|
|
- **Next phases**: Based on Phase 62A result
|
||
|
|
|