Files

Moe Charm (CI) 8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)

Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-14 05:52:32 +09:00

12 KiB

Raw Blame History

Phase 5 E5: Comprehensive Analysis & Implementation Report

Executive Summary

Date: 2025-12-14 Baseline: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400) Status: Step 1 Complete - Perf Analysis & Priority Decision

Step 1: Perf Analysis - Free() Internal Breakdown

Test Configuration

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
  ./bench_random_mixed_hakmem 20000000 400 1

Results: 476 samples @ 999Hz, throughput 44.12M ops/s

Top Hotspots (self% >= 2.0%)

Rank	Function	Self%	Analysis
1	`free`	21.67%	Wrapper layer overhead (header check, ENV snapshot, gate dispatch)
2	`tiny_alloc_gate_fast`	18.55%	Alloc gate (reduced from E4 baseline)
3	`main`	17.44%	Benchmark driver
4	`malloc`	12.27%	Wrapper layer (reduced from E4 baseline)
5	`free_tiny_fast_cold`	7.89%	NEW: Cold path (route determination, policy snapshot)
6	`unified_cache_push`	3.47%	TLS cache push operation
7	`tiny_c7_ultra_alloc`	2.98%	C7 ULTRA alloc path
8	`tiny_region_id_write_header`	2.59%	Header write tax (reduced from 6.97% in E4 baseline)
9	`hakmem_env_snapshot_enabled`	2.57%	ENV snapshot gate overhead (reduced from 4.29%)

Free Path Internal Breakdown

Total free() overhead: 21.67% (wrapper) + 7.89% (cold path) = 29.56%

Key Observations:

Wrapper Layer (21.67% self%):
- NULL check: ~0%
- Header magic validation: ~8% of free() self%
  - Load header byte: movzbl -0x1(%rbp),%r15d (7.44% sample hit)
  - Magic comparison: cmp $0xa0,%r14b (0%)
  - Class extraction: and $0xf,%r12d (0%)
- ENV snapshot calls: ~8% of free() self%
  - mov 0x5b311(%rip),%edx (7.73% + 6.98% sample hits)
  - Branch to cold init paths (0%)
- TLS reads for gate dispatch: ~5% of free() self%
  - mov 0x14(%r15),%ebx (8.01% sample hit)
Cold Path (7.89% self%):
- tiny_route_for_class(): 1.95% self% (route determination)
- Policy snapshot: 0.64% self%
- Class-based dispatch overhead
Header Write Tax (2.59% self%):
- DOWN from 6.97% in E4 baseline (-62% reduction!)
- Note: This is already significantly improved
- Further optimization ROI is lower now
ENV Snapshot Overhead (2.57% self%):
- DOWN from 4.29% in E4 baseline (-40% reduction!)
- Lazy init checks + TLS reads
- Still visible but reduced impact

Critical Insight: Header Check Overhead

Annotated perf shows:

Lines with 7-8% sample hits are concentrated in:
1. Header load: movzbl -0x1(%rbp),%r15d (line 0x1cd6d)
2. Multiple TLS/global reads for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)

These lines represent:

Header validation overhead: ~7-8% of free() self%
ENV gate consolidation overhead: ~8% of free() self% (multiple checks)
Total wrapper tax: ~16% of 21.67% = ~3.5% of total runtime

Branch Prediction Analysis

From perf annotate:

Page boundary check: test $0xfff,%ebp (0% hit - well predicted)
Magic comparison: cmp $0xa0,%r14b (0% hit - well predicted)
ENV snapshot checks: Multiple branches with 0% hit (well predicted)
No significant branch misprediction issues

Step 2: E5 Candidate Priority Decision

E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)

Target: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = 29.56% total

Hypothesis:

Current free() wrapper does header validation + ENV snapshot + gate dispatch
For Tiny allocations (48% of frees in Mixed), this is redundant work:
1. Header is validated in wrapper (magic check)
2. Header is re-extracted in free_tiny_fast()
3. Route determination happens in free_tiny_fast_cold()
Opportunity: Single-check Tiny direct path in wrapper

Strategy (Box Theory):

L0 SplitBox: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
L1 HotBox: Wrapper-level Tiny fast path
- Check: (header & 0xF0) == 0xA0 (Tiny magic)
- Extract: class_idx = header & 0x0F
- Validate: class_idx < 8 (fail-fast)
- Direct call: free_tiny_fast_direct(ptr, base, class_idx)
  - Skips: ENV snapshot consolidation in wrapper
  - Skips: Cold path route determination
  - Skips: Re-validation of header
L1 ColdBox: Existing fallback for Mid/Pool/Large/invalid

Expected ROI:

Baseline 29.56% overhead → Target 18-20% overhead (30-40% reduction)
Translation: +3-5% throughput improvement
Risk: Low (single boundary, fail-fast validation)

Implementation Complexity: Medium

New function: free_tiny_fast_direct() (inline, thin wrapper)
Integration: 1 boundary in free() wrapper
Counters: 3 (direct_hit, direct_miss, invalid_header)

E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)

Target: 2.59% (tiny_region_id_write_header)

Hypothesis:

Header write happens on EVERY alloc (hot path tax)
Blocks are reused within same class → header is stable
Opportunity: Write header once at refill boundary (cold path)

Strategy (Box Theory):

L0 SplitBox: HAKMEM_TINY_HEADER_PREFILL=0/1 (default 0)
L2 HeaderPrefillBox: Refill boundary header initialization
- When: Slab refill (cold path, ~64 blocks at once)
- Where: tiny_unified_cache_refill() or similar
- Action: Prefill all headers in slab page
L1 HotBox: Alloc path skips header write
- Condition: Check "prefill done" flag per slab
- Fast path: Return base + 1 directly (no header write)
- Fallback: Traditional tiny_region_id_write_header() if not prefilled

Expected ROI:

Baseline 2.59% → Target 0.5-1.0% (60-80% reduction)
Translation: +1.5-2.0% throughput improvement
Risk: Medium (requires slab-level "prefilled" tracking)

Implementation Complexity: High

Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
Integration: 2 boundaries (refill + alloc hot path)
Safety: Fallback to hot path write if prefill not done

Note: Phase 1 A3 showed always_inline on header write was NO-GO (-4% on Mixed). This approach is different: eliminate writes, not inline them.

E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)

Target: 2.57% (hakmem_env_snapshot_enabled)

Hypothesis:

MIXED_TINYV3_C7_SAFE now defaults to HAKMEM_ENV_SNAPSHOT=1
Current branch hint: __builtin_expect(..., 0) (expects DISABLED)
Reality: Snapshot is ENABLED → hint is backwards
Opportunity: Flip branch shape to match reality

Strategy (Box Theory):

L0 SplitBox: HAKMEM_ENV_SNAPSHOT_SHAPE=0/1 (default 0)
L1 ShapeBox: Branch restructuring
- Old: if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }
- New: if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }
- Or: Move legacy to noinline,cold helper

Expected ROI:

Baseline 2.57% → Target 1.5-2.0% (20-40% reduction)
Translation: +0.5-1.0% throughput improvement
Risk: Very low (branch shape change only, no logic change)

Implementation Complexity: Low

Changes: 5-8 call sites in hot paths
Integration: Flip __builtin_expect() polarity
Counters: None needed (perf annotation will show)

Note: This is a "polish" optimization. E3-4 showed branch hints can backfire. Approach carefully with A/B testing.

Step 2 Conclusion: Priority Ranking

Priority Order (ROI × Safety × Complexity)

E5-1 (Free Tiny Direct Path): +3-5% expected, medium complexity, low risk
- ROI Score: 9/10 (largest target, proven pattern)
- Safety Score: 8/10 (single boundary, fail-fast)
- Implementation Score: 7/10 (medium effort)
- TOTAL: 24/30 → HIGHEST PRIORITY
E5-2 (Header Prefill at Refill): +1.5-2.0% expected, high complexity, medium risk
- ROI Score: 6/10 (2.59% target, diminishing returns from E4)
- Safety Score: 6/10 (metadata overhead, fallback needed)
- Implementation Score: 4/10 (high effort, slab tracking)
- TOTAL: 16/30 → MEDIUM PRIORITY
E5-3 (ENV Snapshot Shape): +0.5-1.0% expected, low complexity, low risk
- ROI Score: 4/10 (small target, branch hint uncertainty)
- Safety Score: 9/10 (pure shape change)
- Implementation Score: 9/10 (low effort)
- TOTAL: 22/30 → LOW PRIORITY (but easy win if E5-1 succeeds)

Decision: Proceed with E5-1 (Free Tiny Direct Path)

Rationale:

Largest target: 29.56% combined overhead (free wrapper + cold path)
Proven pattern: E4 success shows TLS consolidation works
Single boundary: Clear separation, fail-fast validation
Low risk: Header magic check is reliable, fallback is trivial
Box Theory compliant: L0 ENV gate, L1 hot/cold split, minimal counters

Expected Gain: +3-5% (conservative estimate) Success Threshold: +1.0% (GO criteria) Fallback Plan: If NO-GO, freeze and pursue E5-2 or E5-3

Step 3: E5-1 Implementation Plan (Next)

Design Document

File: docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
Sections:
1. Problem Statement (29.56% overhead analysis)
2. Solution Architecture (wrapper-level Tiny direct path)
3. Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
4. Safety Gates (header validation, fail-fast, fallback)
5. Integration Points (free() wrapper, counters)
6. A/B Test Plan (10-run Mixed, health check)

Implementation Files

ENV Box: core/box/free_tiny_direct_env_box.h/.c
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1
- Lazy init, static cache
Stats Box: core/box/free_tiny_direct_stats_box.h
- Counters: direct_hit, direct_miss, invalid_header
Integration: core/box/hak_wrappers.inc.h
- Lines ~595-640 (free() wrapper hot path)
- New inline: free_tiny_fast_direct()

A/B Test Protocol

# Baseline (E5-1=0)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
    ./bench_random_mixed_hakmem 20000000 400 1
done

# Optimized (E5-1=1)
for i in {1..10}; do
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
    ./bench_random_mixed_hakmem 20000000 400 1
done

Success Criteria:

GO: Mean +1.0% or higher
NEUTRAL: ±1.0%
NO-GO: Mean -1.0% or lower

Appendix: Perf Data Details

Free() Annotated Hot Lines

Top sample-hit lines (from perf annotate --stdio free):

8.01%: mov 0x14(%r15),%ebx (offset 0x1cd2d)
- Loading wrapper_env->wrap_shape from TLS
- Analysis: ENV snapshot field access overhead
7.73%: mov %rdi,%rbp (offset 0x1ccd1)
- Saving ptr argument to register
- Analysis: Register pressure in wrapper preamble
7.44%: mov 0x5b311(%rip),%edx (offset 0x1cd01)
- Loading cached ENV gate value (global)
- Analysis: Lazy init check overhead
6.98%: test %r11d,%r11d (offset 0x1ccf8)
- Testing ENV gate flag (TLS)
- Analysis: Branch condition for ENV snapshot path
7.62%: pop %r14 (offset 0x1ce6e)
- Epilogue register restore
- Analysis: Function call overhead (6+ register saves)

Total wrapper overhead from top 5 lines: ~37% of free() self% = ~8% of total runtime

Free_Tiny_Fast_Cold Analysis

7.89% self% breakdown:

tiny_route_for_class(): 1.95% (route determination)
Policy snapshot: 0.64%
Class dispatch: ~5% (remainder)

Cold path is triggered when:

Class != C7 (no ULTRA early-exit)
OR class requires route determination (C0-C6 in Mixed)

Opportunity: Bypass cold path entirely for Tiny by validating in wrapper.

Next Actions

✅ Step 1 Complete: Perf analysis → E5-1 selected
🔄 Step 3 Next: Implement E5-1 design + code
⏳ Step 4 Next: 10-run A/B test (Mixed)
⏳ Step 5 Next: Health check
⏳ Step 6 Next: Update CURRENT_TASK.md with results

Status: Ready to proceed with E5-1 implementation.

12 KiB Raw Blame History Unescape Escape

Phase 5 E5: Comprehensive Analysis & Implementation Report

Executive Summary

Step 1: Perf Analysis - Free() Internal Breakdown

Test Configuration

Top Hotspots (self% >= 2.0%)

Free Path Internal Breakdown

Critical Insight: Header Check Overhead

Branch Prediction Analysis

Step 2: E5 Candidate Priority Decision

E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)

E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)

E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)

Step 2 Conclusion: Priority Ranking

Priority Order (ROI × Safety × Complexity)

Decision: Proceed with E5-1 (Free Tiny Direct Path)

Step 3: E5-1 Implementation Plan (Next)

Design Document

Implementation Files

A/B Test Protocol

Appendix: Perf Data Details

Free() Annotated Hot Lines

Free_Tiny_Fast_Cold Analysis

Next Actions

12 KiB

Raw Blame History