Files
hakmem/docs/analysis/PHASE62_NEXT_TARGET_ANALYSIS.md
Moe Charm (CI) ea417200d2 Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis
Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline.

Key Findings (200M ops Mixed benchmark):
- tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61)
- tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%)
- Allocation-specific hot path: 12.37% (C7 + header + cache)

Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis)
- Expected gain: +1-3% (higher absolute margin than Phases 46A/61)
- Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%)
- Approach: Deep profiling → ASM inspection → A/B test with ENV gate

Alternative Options:
- Option B: tiny_region_id_write_header (3.82%, higher risk)
- Option C: Algorithmic redesign (post-50% milestone)

Box Theory Compliance:
- Single conversion point: tiny_c7_ultra_alloc() boundary
- Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1)
- No side effects: Pure dependency chain reordering

Timeline: Single phase, 4-6 hours (profile + ASM + test)

Documentation:
- PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data
- CURRENT_TASK.md: Updated next phase guidance

Profiling tools prepared:
- perf record with extended events (cycles, cache-misses, branch-misses)
- ASM inspection methodology documented
- A/B test threshold: ±0.5% (micro-scale)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-17 16:27:06 +09:00

5.6 KiB

Phase 62: Allocation Hotpath Optimization - Target Analysis

Date: 2025-12-17
Status: Planning Phase
Baseline: 48.34% of mimalloc (Phase 59b Speed-first)


Executive Summary

Runtime profiling (Phase 59b Speed-first profile) reveals that after Phases 59-61 micro-optimization attempts, the next highest-value targets are:

  1. tiny_c7_ultra_alloc: 5.18% (new primary target)
  2. tiny_region_id_write_header: 3.82% (reconfirmed hot)
  3. unified_cache_push: 1.37% (already optimized in Phase 46A)

Phase 62 targets tiny_c7_ultra_alloc dependency chain optimization with potential +1-3% gain.


Profiling Results (200M ops Mixed benchmark)

Top Allocation Functions

Function                          | Self % | Stack %  | Status
----------------------------------|--------|----------|------------------
malloc (wrapper)                  | 27.17% | ~60%    | Core loop
free (wrapper)                    | 25.95% | ~60%    | Core loop
main (benchmark loop)             | 26.78% | ~60%    | Core loop
tiny_c7_ultra_alloc               | 2.41%  | 5.18%   | NEW TARGET
tiny_region_id_write_header       | 2.72%  | 3.82%   | Phase 61 confirmed
unified_cache_push                | 1.37%  | 1.37%   | Phase 46A (no-go)
tiny_c7_ultra_free                | 0.56%  | 0.56%   | Lower priority

Note: Stack % represents cumulative overhead from multiple call stacks

Key Findings

  1. Allocation Specific Hot Path: 12.37% (C7 ultra + region write + cache)
  2. Core Allocator: 79.9% (malloc + free + main loop interactions)
  3. Profiling Confidence: 376 samples, clear hot path, low noise

Phase 62 Options

Option A: C7 ULTRA Hotpath (5.18% - PRIMARY CANDIDATE)

Opportunities:

  • A1: Inline Decision Path - Ensure tiny_c7_ultra_alloc always inlined
  • A2: TLS Prefetch - Speculatively load C7 metadata structure
  • A3: Dependency Chain Reduction - Reorder operations for parallelism
  • A4: Carve Batch Optimization - Pre-carve slabs to reduce refill calls

Expected Gain: +1-3% (5.18% of addressable performance)

Risk Level: Medium

  • Precedent: Phase 46A similar optimization (-0.68% from layout tax)
  • Phase 43: Branch elimination (-1.18% regression)
  • But: 5x larger than Phase 46A target (higher absolute gain margin)

Rationale:

  • C7 ULTRA already optimized in free (Phase 7+), alloc side underexplored
  • No successful alloc-side structural optimization since Phase 39 (+1.98% gate prune)
  • This is not micro-architecture bound (unlike Phase 46A store-ordering)

Option B: tiny_region_id_write_header (3.82% - SECONDARY)

Opportunities:

  • B1: Dependency Chain Reorder - Schedule non-dependent operations earlier
  • B2: Condition Consolidation - Reduce branch count
  • B3: Store Bypass - Avoid load-after-store stalls

Expected Gain: +0.5-1.5%

Risk Level: High

  • Phase 43: Header write optimization (-1.18%)
  • Phase 46A: always_inline (-0.68%)
  • Layout tax is real and measurable

Decision: Secondary option; pursue only if Option A fails


Option C: Algorithmic Redesign (VERY HIGH IMPACT, VERY HIGH COST)

Examples:

  • Segment pre-allocation vs demand-based
  • Free-side batching (coalesce multiple frees)
  • Static route caching (trade memory for latency)

Expected Gain: +3-8% (affects 79.9% core functions)

Risk: Very high (requires major refactoring, extensive testing)

Decision: Post-50% milestone option; requires strategic decision


Phase 62A Recommendation: C7 ULTRA Inline + IPC Analysis

Implementation Plan

Step 1: Deep Profiling (1-2 hours)

perf record -F 99 -g -e cycles:P,cache-misses,branch-misses,stalled-cycles-frontend \
  -- ./bench_random_mixed_hakmem_minimal 200000000 400 1
perf report --stdio | grep -A 20 "tiny_c7_ultra_alloc"

Step 2: ASM Inspection (1 hour)

  • objdump -d on tiny_c7_ultra_alloc
  • Identify dependency chains (load-use, store-use distances)
  • Map to CPU latencies (L1: 4 cycles, L2: 10, L3: 40-75)
  • Identify stores that can be deferred/reordered

Step 3: A/B Test (2-3 hours)

  • Create HAKMEM_TINY_C7_ULTRA_INLINE_OPT ENV gate
  • Implement dependency chain reordering (if identified)
  • Run 10-run Mixed benchmark
  • Measure +/- threshold: ±0.5% (micro-scale)

Step 4: Decision

  • +0.5% or higher → GO (adopt as default)
  • ±0.5% → NEUTRAL (keep as research box)
  • -0.5% or lower → NO-GO (revert, document)

Alternative: Quick Validation (if time-limited)

If deep optimization is not feasible, proceed with:

  1. Phase 62B: Static Routing Cache - Pre-compute route decisions for each class

    • Phase 45 suggested +0.5-1.0% from TLS prefetch
    • Lower risk than C7 modification
  2. Phase 62C: Carve Batch Study - Analyze carve operation frequency

    • May identify batching opportunity with minimal code changes

Box Theory Compliance

  • Single Conversion Point: C7 ultra path has clear entry point
  • Clear Boundary: tiny_c7_ultra_alloc() function boundary
  • Reversible: ENV gate (HAKMEM_TINY_C7_ULTRA_INLINE_OPT=0/1)
  • No Side Effects: Pure optimization, no new data structures
  • Performance: Expected +1-3% (TBD via A/B test)

Success Criteria

Metric Target Status
M1 (50%) 50.0% 48.34% (gap -1.66%)
Throughput improvement +1-3% TBD
Variance (CV) <2.5% Current 2.52% ✓
Memory efficiency <35MB RSS Current 33MB ✓
Syscall budget <1e-7/op Current 1.25e-7/op ✓

Timeline

  • Phase 62A (C7 ULTRA Inline): Single phase, 4-6 hours
  • Decision point: After A/B test
  • Next phases: Based on Phase 62A result