Files
hakmem/docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
Moe Charm (CI) 8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:52:32 +09:00

12 KiB
Raw Blame History

Phase 5 E5: Comprehensive Analysis & Implementation Report

Executive Summary

Date: 2025-12-14 Baseline: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400) Status: Step 1 Complete - Perf Analysis & Priority Decision


Step 1: Perf Analysis - Free() Internal Breakdown

Test Configuration

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
  ./bench_random_mixed_hakmem 20000000 400 1

Results: 476 samples @ 999Hz, throughput 44.12M ops/s

Top Hotspots (self% >= 2.0%)

Rank Function Self% Analysis
1 free 21.67% Wrapper layer overhead (header check, ENV snapshot, gate dispatch)
2 tiny_alloc_gate_fast 18.55% Alloc gate (reduced from E4 baseline)
3 main 17.44% Benchmark driver
4 malloc 12.27% Wrapper layer (reduced from E4 baseline)
5 free_tiny_fast_cold 7.89% NEW: Cold path (route determination, policy snapshot)
6 unified_cache_push 3.47% TLS cache push operation
7 tiny_c7_ultra_alloc 2.98% C7 ULTRA alloc path
8 tiny_region_id_write_header 2.59% Header write tax (reduced from 6.97% in E4 baseline)
9 hakmem_env_snapshot_enabled 2.57% ENV snapshot gate overhead (reduced from 4.29%)

Free Path Internal Breakdown

Total free() overhead: 21.67% (wrapper) + 7.89% (cold path) = 29.56%

Key Observations:

  1. Wrapper Layer (21.67% self%):

    • NULL check: ~0%
    • Header magic validation: ~8% of free() self%
      • Load header byte: movzbl -0x1(%rbp),%r15d (7.44% sample hit)
      • Magic comparison: cmp $0xa0,%r14b (0%)
      • Class extraction: and $0xf,%r12d (0%)
    • ENV snapshot calls: ~8% of free() self%
      • mov 0x5b311(%rip),%edx (7.73% + 6.98% sample hits)
      • Branch to cold init paths (0%)
    • TLS reads for gate dispatch: ~5% of free() self%
      • mov 0x14(%r15),%ebx (8.01% sample hit)
  2. Cold Path (7.89% self%):

    • tiny_route_for_class(): 1.95% self% (route determination)
    • Policy snapshot: 0.64% self%
    • Class-based dispatch overhead
  3. Header Write Tax (2.59% self%):

    • DOWN from 6.97% in E4 baseline (-62% reduction!)
    • Note: This is already significantly improved
    • Further optimization ROI is lower now
  4. ENV Snapshot Overhead (2.57% self%):

    • DOWN from 4.29% in E4 baseline (-40% reduction!)
    • Lazy init checks + TLS reads
    • Still visible but reduced impact

Critical Insight: Header Check Overhead

Annotated perf shows:

  • Lines with 7-8% sample hits are concentrated in:
    1. Header load: movzbl -0x1(%rbp),%r15d (line 0x1cd6d)
    2. Multiple TLS/global reads for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)

These lines represent:

  • Header validation overhead: ~7-8% of free() self%
  • ENV gate consolidation overhead: ~8% of free() self% (multiple checks)
  • Total wrapper tax: ~16% of 21.67% = ~3.5% of total runtime

Branch Prediction Analysis

From perf annotate:

  • Page boundary check: test $0xfff,%ebp (0% hit - well predicted)
  • Magic comparison: cmp $0xa0,%r14b (0% hit - well predicted)
  • ENV snapshot checks: Multiple branches with 0% hit (well predicted)
  • No significant branch misprediction issues

Step 2: E5 Candidate Priority Decision

E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)

Target: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = 29.56% total

Hypothesis:

  • Current free() wrapper does header validation + ENV snapshot + gate dispatch
  • For Tiny allocations (48% of frees in Mixed), this is redundant work:
    1. Header is validated in wrapper (magic check)
    2. Header is re-extracted in free_tiny_fast()
    3. Route determination happens in free_tiny_fast_cold()
  • Opportunity: Single-check Tiny direct path in wrapper

Strategy (Box Theory):

  • L0 SplitBox: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
  • L1 HotBox: Wrapper-level Tiny fast path
    • Check: (header & 0xF0) == 0xA0 (Tiny magic)
    • Extract: class_idx = header & 0x0F
    • Validate: class_idx < 8 (fail-fast)
    • Direct call: free_tiny_fast_direct(ptr, base, class_idx)
      • Skips: ENV snapshot consolidation in wrapper
      • Skips: Cold path route determination
      • Skips: Re-validation of header
  • L1 ColdBox: Existing fallback for Mid/Pool/Large/invalid

Expected ROI:

  • Baseline 29.56% overheadTarget 18-20% overhead (30-40% reduction)
  • Translation: +3-5% throughput improvement
  • Risk: Low (single boundary, fail-fast validation)

Implementation Complexity: Medium

  • New function: free_tiny_fast_direct() (inline, thin wrapper)
  • Integration: 1 boundary in free() wrapper
  • Counters: 3 (direct_hit, direct_miss, invalid_header)

E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)

Target: 2.59% (tiny_region_id_write_header)

Hypothesis:

  • Header write happens on EVERY alloc (hot path tax)
  • Blocks are reused within same class → header is stable
  • Opportunity: Write header once at refill boundary (cold path)

Strategy (Box Theory):

  • L0 SplitBox: HAKMEM_TINY_HEADER_PREFILL=0/1 (default 0)
  • L2 HeaderPrefillBox: Refill boundary header initialization
    • When: Slab refill (cold path, ~64 blocks at once)
    • Where: tiny_unified_cache_refill() or similar
    • Action: Prefill all headers in slab page
  • L1 HotBox: Alloc path skips header write
    • Condition: Check "prefill done" flag per slab
    • Fast path: Return base + 1 directly (no header write)
    • Fallback: Traditional tiny_region_id_write_header() if not prefilled

Expected ROI:

  • Baseline 2.59%Target 0.5-1.0% (60-80% reduction)
  • Translation: +1.5-2.0% throughput improvement
  • Risk: Medium (requires slab-level "prefilled" tracking)

Implementation Complexity: High

  • Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
  • Integration: 2 boundaries (refill + alloc hot path)
  • Safety: Fallback to hot path write if prefill not done

Note: Phase 1 A3 showed always_inline on header write was NO-GO (-4% on Mixed). This approach is different: eliminate writes, not inline them.

E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)

Target: 2.57% (hakmem_env_snapshot_enabled)

Hypothesis:

  • MIXED_TINYV3_C7_SAFE now defaults to HAKMEM_ENV_SNAPSHOT=1
  • Current branch hint: __builtin_expect(..., 0) (expects DISABLED)
  • Reality: Snapshot is ENABLED → hint is backwards
  • Opportunity: Flip branch shape to match reality

Strategy (Box Theory):

  • L0 SplitBox: HAKMEM_ENV_SNAPSHOT_SHAPE=0/1 (default 0)
  • L1 ShapeBox: Branch restructuring
    • Old: if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }
    • New: if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }
    • Or: Move legacy to noinline,cold helper

Expected ROI:

  • Baseline 2.57%Target 1.5-2.0% (20-40% reduction)
  • Translation: +0.5-1.0% throughput improvement
  • Risk: Very low (branch shape change only, no logic change)

Implementation Complexity: Low

  • Changes: 5-8 call sites in hot paths
  • Integration: Flip __builtin_expect() polarity
  • Counters: None needed (perf annotation will show)

Note: This is a "polish" optimization. E3-4 showed branch hints can backfire. Approach carefully with A/B testing.


Step 2 Conclusion: Priority Ranking

Priority Order (ROI × Safety × Complexity)

  1. E5-1 (Free Tiny Direct Path): +3-5% expected, medium complexity, low risk

    • ROI Score: 9/10 (largest target, proven pattern)
    • Safety Score: 8/10 (single boundary, fail-fast)
    • Implementation Score: 7/10 (medium effort)
    • TOTAL: 24/30 → HIGHEST PRIORITY
  2. E5-2 (Header Prefill at Refill): +1.5-2.0% expected, high complexity, medium risk

    • ROI Score: 6/10 (2.59% target, diminishing returns from E4)
    • Safety Score: 6/10 (metadata overhead, fallback needed)
    • Implementation Score: 4/10 (high effort, slab tracking)
    • TOTAL: 16/30 → MEDIUM PRIORITY
  3. E5-3 (ENV Snapshot Shape): +0.5-1.0% expected, low complexity, low risk

    • ROI Score: 4/10 (small target, branch hint uncertainty)
    • Safety Score: 9/10 (pure shape change)
    • Implementation Score: 9/10 (low effort)
    • TOTAL: 22/30 → LOW PRIORITY (but easy win if E5-1 succeeds)

Decision: Proceed with E5-1 (Free Tiny Direct Path)

Rationale:

  1. Largest target: 29.56% combined overhead (free wrapper + cold path)
  2. Proven pattern: E4 success shows TLS consolidation works
  3. Single boundary: Clear separation, fail-fast validation
  4. Low risk: Header magic check is reliable, fallback is trivial
  5. Box Theory compliant: L0 ENV gate, L1 hot/cold split, minimal counters

Expected Gain: +3-5% (conservative estimate) Success Threshold: +1.0% (GO criteria) Fallback Plan: If NO-GO, freeze and pursue E5-2 or E5-3


Step 3: E5-1 Implementation Plan (Next)

Design Document

  • File: docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
  • Sections:
    1. Problem Statement (29.56% overhead analysis)
    2. Solution Architecture (wrapper-level Tiny direct path)
    3. Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
    4. Safety Gates (header validation, fail-fast, fallback)
    5. Integration Points (free() wrapper, counters)
    6. A/B Test Plan (10-run Mixed, health check)

Implementation Files

  • ENV Box: core/box/free_tiny_direct_env_box.h/.c
    • ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1
    • Lazy init, static cache
  • Stats Box: core/box/free_tiny_direct_stats_box.h
    • Counters: direct_hit, direct_miss, invalid_header
  • Integration: core/box/hak_wrappers.inc.h
    • Lines ~595-640 (free() wrapper hot path)
    • New inline: free_tiny_fast_direct()

A/B Test Protocol

# Baseline (E5-1=0)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
    ./bench_random_mixed_hakmem 20000000 400 1
done

# Optimized (E5-1=1)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
    ./bench_random_mixed_hakmem 20000000 400 1
done

Success Criteria:

  • GO: Mean +1.0% or higher
  • NEUTRAL: ±1.0%
  • NO-GO: Mean -1.0% or lower

Appendix: Perf Data Details

Free() Annotated Hot Lines

Top sample-hit lines (from perf annotate --stdio free):

  1. 8.01%: mov 0x14(%r15),%ebx (offset 0x1cd2d)

    • Loading wrapper_env->wrap_shape from TLS
    • Analysis: ENV snapshot field access overhead
  2. 7.73%: mov %rdi,%rbp (offset 0x1ccd1)

    • Saving ptr argument to register
    • Analysis: Register pressure in wrapper preamble
  3. 7.44%: mov 0x5b311(%rip),%edx (offset 0x1cd01)

    • Loading cached ENV gate value (global)
    • Analysis: Lazy init check overhead
  4. 6.98%: test %r11d,%r11d (offset 0x1ccf8)

    • Testing ENV gate flag (TLS)
    • Analysis: Branch condition for ENV snapshot path
  5. 7.62%: pop %r14 (offset 0x1ce6e)

    • Epilogue register restore
    • Analysis: Function call overhead (6+ register saves)

Total wrapper overhead from top 5 lines: ~37% of free() self% = ~8% of total runtime

Free_Tiny_Fast_Cold Analysis

7.89% self% breakdown:

  • tiny_route_for_class(): 1.95% (route determination)
  • Policy snapshot: 0.64%
  • Class dispatch: ~5% (remainder)

Cold path is triggered when:

  • Class != C7 (no ULTRA early-exit)
  • OR class requires route determination (C0-C6 in Mixed)

Opportunity: Bypass cold path entirely for Tiny by validating in wrapper.


Next Actions

  1. Step 1 Complete: Perf analysis → E5-1 selected
  2. 🔄 Step 3 Next: Implement E5-1 design + code
  3. Step 4 Next: 10-run A/B test (Mixed)
  4. Step 5 Next: Health check
  5. Step 6 Next: Update CURRENT_TASK.md with results

Status: Ready to proceed with E5-1 implementation.