hakmem/docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md

# Phase 62: Allocation Hotpath Optimization - Target Analysis

**Date**: 2025-12-17  
**Status**: Planning Phase  
**Baseline**: 48.34% of mimalloc (Phase 59b Speed-first)

---

## Executive Summary

Runtime profiling (Phase 59b Speed-first profile) reveals that after Phases 59-61 micro-optimization attempts, the next highest-value targets are:

1. **tiny_c7_ultra_alloc: 5.18%** (new primary target)
2. **tiny_region_id_write_header: 3.82%** (reconfirmed hot)
3. **unified_cache_push: 1.37%** (already optimized in Phase 46A)

Phase 62 targets `tiny_c7_ultra_alloc` dependency chain optimization with potential +1-3% gain.

---

## Profiling Results (200M ops Mixed benchmark)

### Top Allocation Functions

```
Function                          | Self % | Stack %  | Status
----------------------------------|--------|----------|------------------
malloc (wrapper)                  | 27.17% | ~60%    | Core loop
free (wrapper)                    | 25.95% | ~60%    | Core loop
main (benchmark loop)             | 26.78% | ~60%    | Core loop
tiny_c7_ultra_alloc               | 2.41%  | 5.18%   | NEW TARGET
tiny_region_id_write_header       | 2.72%  | 3.82%   | Phase 61 confirmed
unified_cache_push                | 1.37%  | 1.37%   | Phase 46A (no-go)
tiny_c7_ultra_free                | 0.56%  | 0.56%   | Lower priority
```

**Note**: Stack % represents cumulative overhead from multiple call stacks

### Key Findings

1. **Allocation Specific Hot Path**: 12.37% (C7 ultra + region write + cache)
2. **Core Allocator**: 79.9% (malloc + free + main loop interactions)
3. **Profiling Confidence**: 376 samples, clear hot path, low noise

---

## Phase 62 Options

### Option A: C7 ULTRA Hotpath (5.18% - PRIMARY CANDIDATE)

**Opportunities**:
- **A1: Inline Decision Path** - Ensure `tiny_c7_ultra_alloc` always inlined
- **A2: TLS Prefetch** - Speculatively load C7 metadata structure
- **A3: Dependency Chain Reduction** - Reorder operations for parallelism
- **A4: Carve Batch Optimization** - Pre-carve slabs to reduce refill calls

**Expected Gain**: +1-3% (5.18% of addressable performance)

**Risk Level**: Medium
- Precedent: Phase 46A similar optimization (-0.68% from layout tax)
- Phase 43: Branch elimination (-1.18% regression)
- But: 5x larger than Phase 46A target (higher absolute gain margin)

**Rationale**: 
- C7 ULTRA already optimized in free (Phase 7+), alloc side underexplored
- No successful alloc-side structural optimization since Phase 39 (+1.98% gate prune)
- This is not micro-architecture bound (unlike Phase 46A store-ordering)

---

### Option B: tiny_region_id_write_header (3.82% - SECONDARY)

**Opportunities**:
- **B1: Dependency Chain Reorder** - Schedule non-dependent operations earlier
- **B2: Condition Consolidation** - Reduce branch count
- **B3: Store Bypass** - Avoid load-after-store stalls

**Expected Gain**: +0.5-1.5%

**Risk Level**: High
- Phase 43: Header write optimization (-1.18%)
- Phase 46A: always_inline (-0.68%)
- Layout tax is real and measurable

**Decision**: Secondary option; pursue only if Option A fails

---

### Option C: Algorithmic Redesign (VERY HIGH IMPACT, VERY HIGH COST)

**Examples**:
- Segment pre-allocation vs demand-based
- Free-side batching (coalesce multiple frees)
- Static route caching (trade memory for latency)

**Expected Gain**: +3-8% (affects 79.9% core functions)

**Risk**: Very high (requires major refactoring, extensive testing)

**Decision**: Post-50% milestone option; requires strategic decision

---

## Phase 62A Recommendation: C7 ULTRA Inline + IPC Analysis

### Implementation Plan

**Step 1: Deep Profiling** (1-2 hours)
```bash
perf record -F 99 -g -e cycles:P,cache-misses,branch-misses,stalled-cycles-frontend \
  -- ./bench_random_mixed_hakmem_minimal 200000000 400 1
perf report --stdio | grep -A 20 "tiny_c7_ultra_alloc"
```

**Step 2: ASM Inspection** (1 hour)
- objdump -d on tiny_c7_ultra_alloc
- Identify dependency chains (load-use, store-use distances)
- Map to CPU latencies (L1: 4 cycles, L2: 10, L3: 40-75)
- Identify stores that can be deferred/reordered

**Step 3: A/B Test** (2-3 hours)
- Create `HAKMEM_TINY_C7_ULTRA_INLINE_OPT` ENV gate
- Implement dependency chain reordering (if identified)
- Run 10-run Mixed benchmark
- Measure +/- threshold: ±0.5% (micro-scale)

**Step 4: Decision**
- +0.5% or higher → GO (adopt as default)
- ±0.5% → NEUTRAL (keep as research box)
- -0.5% or lower → NO-GO (revert, document)

---

## Alternative: Quick Validation (if time-limited)

If deep optimization is not feasible, proceed with:

1. **Phase 62B: Static Routing Cache** - Pre-compute route decisions for each class
   - Phase 45 suggested +0.5-1.0% from TLS prefetch
   - Lower risk than C7 modification

2. **Phase 62C: Carve Batch Study** - Analyze carve operation frequency
   - May identify batching opportunity with minimal code changes

---

## Box Theory Compliance

- **Single Conversion Point**: C7 ultra path has clear entry point
- **Clear Boundary**: tiny_c7_ultra_alloc() function boundary
- **Reversible**: ENV gate (`HAKMEM_TINY_C7_ULTRA_INLINE_OPT=0/1`)
- **No Side Effects**: Pure optimization, no new data structures
- **Performance**: Expected +1-3% (TBD via A/B test)

---

## Success Criteria

| Metric | Target | Status |
|--------|--------|--------|
| M1 (50%) | 50.0% | 48.34% (gap -1.66%) |
| Throughput improvement | +1-3% | TBD |
| Variance (CV) | <2.5% | Current 2.52% ✓ |
| Memory efficiency | <35MB RSS | Current 33MB ✓ |
| Syscall budget | <1e-7/op | Current 1.25e-7/op ✓ |

---

## Timeline

- **Phase 62A (C7 ULTRA Inline)**: Single phase, 4-6 hours
- **Decision point**: After A/B test
- **Next phases**: Based on Phase 62A result
Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline. Key Findings (200M ops Mixed benchmark): - tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61) - tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%) - Allocation-specific hot path: 12.37% (C7 + header + cache) Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis) - Expected gain: +1-3% (higher absolute margin than Phases 46A/61) - Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%) - Approach: Deep profiling → ASM inspection → A/B test with ENV gate Alternative Options: - Option B: tiny_region_id_write_header (3.82%, higher risk) - Option C: Algorithmic redesign (post-50% milestone) Box Theory Compliance: - Single conversion point: tiny_c7_ultra_alloc() boundary - Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1) - No side effects: Pure dependency chain reordering Timeline: Single phase, 4-6 hours (profile + ASM + test) Documentation: - PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data - CURRENT_TASK.md: Updated next phase guidance Profiling tools prepared: - perf record with extended events (cycles, cache-misses, branch-misses) - ASM inspection methodology documented - A/B test threshold: ±0.5% (micro-scale) 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> 2025-12-17 16:27:06 +09:00			`# Phase 62: Allocation Hotpath Optimization - Target Analysis`

			`Date: 2025-12-17`
			`Status: Planning Phase`
			`Baseline: 48.34% of mimalloc (Phase 59b Speed-first)`

			`---`

			`## Executive Summary`

			`Runtime profiling (Phase 59b Speed-first profile) reveals that after Phases 59-61 micro-optimization attempts, the next highest-value targets are:`

			`1. tiny_c7_ultra_alloc: 5.18% (new primary target)`
			`2. tiny_region_id_write_header: 3.82% (reconfirmed hot)`
			`3. unified_cache_push: 1.37% (already optimized in Phase 46A)`

			Phase 62 targets `tiny_c7_ultra_alloc` dependency chain optimization with potential +1-3% gain.

			`---`

			`## Profiling Results (200M ops Mixed benchmark)`

			`### Top Allocation Functions`

			```
			`Function \| Self % \| Stack % \| Status`
			`----------------------------------\|--------\|----------\|------------------`
			`malloc (wrapper) \| 27.17% \| ~60% \| Core loop`
			`free (wrapper) \| 25.95% \| ~60% \| Core loop`
			`main (benchmark loop) \| 26.78% \| ~60% \| Core loop`
			`tiny_c7_ultra_alloc \| 2.41% \| 5.18% \| NEW TARGET`
			`tiny_region_id_write_header \| 2.72% \| 3.82% \| Phase 61 confirmed`
			`unified_cache_push \| 1.37% \| 1.37% \| Phase 46A (no-go)`
			`tiny_c7_ultra_free \| 0.56% \| 0.56% \| Lower priority`
			```

			`Note: Stack % represents cumulative overhead from multiple call stacks`

			`### Key Findings`

			`1. Allocation Specific Hot Path: 12.37% (C7 ultra + region write + cache)`
			`2. Core Allocator: 79.9% (malloc + free + main loop interactions)`
			`3. Profiling Confidence: 376 samples, clear hot path, low noise`

			`---`

			`## Phase 62 Options`

			`### Option A: C7 ULTRA Hotpath (5.18% - PRIMARY CANDIDATE)`

			`Opportunities:`
			- A1: Inline Decision Path - Ensure `tiny_c7_ultra_alloc` always inlined
			`- A2: TLS Prefetch - Speculatively load C7 metadata structure`
			`- A3: Dependency Chain Reduction - Reorder operations for parallelism`
			`- A4: Carve Batch Optimization - Pre-carve slabs to reduce refill calls`

			`Expected Gain: +1-3% (5.18% of addressable performance)`

			`Risk Level: Medium`
			`- Precedent: Phase 46A similar optimization (-0.68% from layout tax)`
			`- Phase 43: Branch elimination (-1.18% regression)`
			`- But: 5x larger than Phase 46A target (higher absolute gain margin)`

			`Rationale:`
			`- C7 ULTRA already optimized in free (Phase 7+), alloc side underexplored`
			`- No successful alloc-side structural optimization since Phase 39 (+1.98% gate prune)`
			`- This is not micro-architecture bound (unlike Phase 46A store-ordering)`

			`---`

			`### Option B: tiny_region_id_write_header (3.82% - SECONDARY)`

			`Opportunities:`
			`- B1: Dependency Chain Reorder - Schedule non-dependent operations earlier`
			`- B2: Condition Consolidation - Reduce branch count`
			`- B3: Store Bypass - Avoid load-after-store stalls`

			`Expected Gain: +0.5-1.5%`

			`Risk Level: High`
			`- Phase 43: Header write optimization (-1.18%)`
			`- Phase 46A: always_inline (-0.68%)`
			`- Layout tax is real and measurable`

			`Decision: Secondary option; pursue only if Option A fails`

			`---`

			`### Option C: Algorithmic Redesign (VERY HIGH IMPACT, VERY HIGH COST)`

			`Examples:`
			`- Segment pre-allocation vs demand-based`
			`- Free-side batching (coalesce multiple frees)`
			`- Static route caching (trade memory for latency)`

			`Expected Gain: +3-8% (affects 79.9% core functions)`

			`Risk: Very high (requires major refactoring, extensive testing)`

			`Decision: Post-50% milestone option; requires strategic decision`

			`---`

			`## Phase 62A Recommendation: C7 ULTRA Inline + IPC Analysis`

			`### Implementation Plan`

			`Step 1: Deep Profiling (1-2 hours)`
			```bash
			`perf record -F 99 -g -e cycles:P,cache-misses,branch-misses,stalled-cycles-frontend \`
			`-- ./bench_random_mixed_hakmem_minimal 200000000 400 1`
			`perf report --stdio \| grep -A 20 "tiny_c7_ultra_alloc"`
			```

			`Step 2: ASM Inspection (1 hour)`
			`- objdump -d on tiny_c7_ultra_alloc`
			`- Identify dependency chains (load-use, store-use distances)`
			`- Map to CPU latencies (L1: 4 cycles, L2: 10, L3: 40-75)`
			`- Identify stores that can be deferred/reordered`

			`Step 3: A/B Test (2-3 hours)`
			- Create `HAKMEM_TINY_C7_ULTRA_INLINE_OPT` ENV gate
			`- Implement dependency chain reordering (if identified)`
			`- Run 10-run Mixed benchmark`
			`- Measure +/- threshold: ±0.5% (micro-scale)`

			`Step 4: Decision`
			`- +0.5% or higher → GO (adopt as default)`
			`- ±0.5% → NEUTRAL (keep as research box)`
			`- -0.5% or lower → NO-GO (revert, document)`

			`---`

			`## Alternative: Quick Validation (if time-limited)`

			`If deep optimization is not feasible, proceed with:`

			`1. Phase 62B: Static Routing Cache - Pre-compute route decisions for each class`
			`- Phase 45 suggested +0.5-1.0% from TLS prefetch`
			`- Lower risk than C7 modification`

			`2. Phase 62C: Carve Batch Study - Analyze carve operation frequency`
			`- May identify batching opportunity with minimal code changes`

			`---`

			`## Box Theory Compliance`

			`- Single Conversion Point: C7 ultra path has clear entry point`
			`- Clear Boundary: tiny_c7_ultra_alloc() function boundary`
			- Reversible: ENV gate (`HAKMEM_TINY_C7_ULTRA_INLINE_OPT=0/1`)
			`- No Side Effects: Pure optimization, no new data structures`
			`- Performance: Expected +1-3% (TBD via A/B test)`

			`---`

			`## Success Criteria`

			`\| Metric \| Target \| Status \|`
			`\|--------\|--------\|--------\|`
			`\| M1 (50%) \| 50.0% \| 48.34% (gap -1.66%) \|`
			`\| Throughput improvement \| +1-3% \| TBD \|`
			`\| Variance (CV) \| <2.5% \| Current 2.52% ✓ \|`
			`\| Memory efficiency \| <35MB RSS \| Current 33MB ✓ \|`
			`\| Syscall budget \| <1e-7/op \| Current 1.25e-7/op ✓ \|`

			`---`

			`## Timeline`

			`- Phase 62A (C7 ULTRA Inline): Single phase, 4-6 hours`
			`- Decision point: After A/B test`
			`- Next phases: Based on Phase 62A result`