Target: Consolidate free() wrapper overhead (29.56% combined) - free() wrapper: 21.67% self% - free_tiny_fast_cold(): 7.89% self% Strategy: Single header check in wrapper → direct call to free_tiny_fast() - Eliminates redundant header validation (validated twice before) - Bypasses cold path routing for Tiny allocations - High coverage: 48% of frees in Mixed workload are Tiny Implementation: - ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0) - core/box/free_tiny_direct_env_box.h: ENV gate - core/box/free_tiny_direct_stats_box.h: Stats counters - core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625) Safety gates: - Page boundary guard ((ptr & 0xFFF) != 0) - Tiny magic validation ((header & 0xF0) == 0xA0) - Class bounds check (class_idx < 8) - Fail-fast fallback to existing paths A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median) - Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median) - Improvement: +3.35% mean, +3.36% median Decision: GO (+3.35% >= +1.0% threshold) - 3rd consecutive success with consolidation/deduplication pattern - E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35% - Health check: PASS (all profiles) Phase 5 Cumulative: - E4 Combined: +6.43% - E5-1: +3.35% - Estimated total: ~+10% Deliverables: - docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md - docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md - docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md - CURRENT_TASK.md (E5-1 complete) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Phase 5 E5: Comprehensive Analysis & Implementation Report
Executive Summary
Date: 2025-12-14 Baseline: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400) Status: Step 1 Complete - Perf Analysis & Priority Decision
Step 1: Perf Analysis - Free() Internal Breakdown
Test Configuration
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
./bench_random_mixed_hakmem 20000000 400 1
Results: 476 samples @ 999Hz, throughput 44.12M ops/s
Top Hotspots (self% >= 2.0%)
| Rank | Function | Self% | Analysis |
|---|---|---|---|
| 1 | free |
21.67% | Wrapper layer overhead (header check, ENV snapshot, gate dispatch) |
| 2 | tiny_alloc_gate_fast |
18.55% | Alloc gate (reduced from E4 baseline) |
| 3 | main |
17.44% | Benchmark driver |
| 4 | malloc |
12.27% | Wrapper layer (reduced from E4 baseline) |
| 5 | free_tiny_fast_cold |
7.89% | NEW: Cold path (route determination, policy snapshot) |
| 6 | unified_cache_push |
3.47% | TLS cache push operation |
| 7 | tiny_c7_ultra_alloc |
2.98% | C7 ULTRA alloc path |
| 8 | tiny_region_id_write_header |
2.59% | Header write tax (reduced from 6.97% in E4 baseline) |
| 9 | hakmem_env_snapshot_enabled |
2.57% | ENV snapshot gate overhead (reduced from 4.29%) |
Free Path Internal Breakdown
Total free() overhead: 21.67% (wrapper) + 7.89% (cold path) = 29.56%
Key Observations:
-
Wrapper Layer (21.67% self%):
- NULL check: ~0%
- Header magic validation: ~8% of free() self%
- Load header byte:
movzbl -0x1(%rbp),%r15d(7.44% sample hit) - Magic comparison:
cmp $0xa0,%r14b(0%) - Class extraction:
and $0xf,%r12d(0%)
- Load header byte:
- ENV snapshot calls: ~8% of free() self%
mov 0x5b311(%rip),%edx(7.73% + 6.98% sample hits)- Branch to cold init paths (0%)
- TLS reads for gate dispatch: ~5% of free() self%
mov 0x14(%r15),%ebx(8.01% sample hit)
-
Cold Path (7.89% self%):
tiny_route_for_class(): 1.95% self% (route determination)- Policy snapshot: 0.64% self%
- Class-based dispatch overhead
-
Header Write Tax (2.59% self%):
- DOWN from 6.97% in E4 baseline (-62% reduction!)
- Note: This is already significantly improved
- Further optimization ROI is lower now
-
ENV Snapshot Overhead (2.57% self%):
- DOWN from 4.29% in E4 baseline (-40% reduction!)
- Lazy init checks + TLS reads
- Still visible but reduced impact
Critical Insight: Header Check Overhead
Annotated perf shows:
- Lines with 7-8% sample hits are concentrated in:
- Header load:
movzbl -0x1(%rbp),%r15d(line 0x1cd6d) - Multiple TLS/global reads for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)
- Header load:
These lines represent:
- Header validation overhead: ~7-8% of free() self%
- ENV gate consolidation overhead: ~8% of free() self% (multiple checks)
- Total wrapper tax: ~16% of 21.67% = ~3.5% of total runtime
Branch Prediction Analysis
From perf annotate:
- Page boundary check:
test $0xfff,%ebp(0% hit - well predicted) - Magic comparison:
cmp $0xa0,%r14b(0% hit - well predicted) - ENV snapshot checks: Multiple branches with 0% hit (well predicted)
- No significant branch misprediction issues
Step 2: E5 Candidate Priority Decision
E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)
Target: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = 29.56% total
Hypothesis:
- Current free() wrapper does header validation + ENV snapshot + gate dispatch
- For Tiny allocations (48% of frees in Mixed), this is redundant work:
- Header is validated in wrapper (magic check)
- Header is re-extracted in
free_tiny_fast() - Route determination happens in
free_tiny_fast_cold()
- Opportunity: Single-check Tiny direct path in wrapper
Strategy (Box Theory):
- L0 SplitBox:
HAKMEM_FREE_TINY_DIRECT=0/1(default 0) - L1 HotBox: Wrapper-level Tiny fast path
- Check:
(header & 0xF0) == 0xA0(Tiny magic) - Extract:
class_idx = header & 0x0F - Validate:
class_idx < 8(fail-fast) - Direct call:
free_tiny_fast_direct(ptr, base, class_idx)- Skips: ENV snapshot consolidation in wrapper
- Skips: Cold path route determination
- Skips: Re-validation of header
- Check:
- L1 ColdBox: Existing fallback for Mid/Pool/Large/invalid
Expected ROI:
- Baseline 29.56% overhead → Target 18-20% overhead (30-40% reduction)
- Translation: +3-5% throughput improvement
- Risk: Low (single boundary, fail-fast validation)
Implementation Complexity: Medium
- New function:
free_tiny_fast_direct()(inline, thin wrapper) - Integration: 1 boundary in
free()wrapper - Counters: 3 (direct_hit, direct_miss, invalid_header)
E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)
Target: 2.59% (tiny_region_id_write_header)
Hypothesis:
- Header write happens on EVERY alloc (hot path tax)
- Blocks are reused within same class → header is stable
- Opportunity: Write header once at refill boundary (cold path)
Strategy (Box Theory):
- L0 SplitBox:
HAKMEM_TINY_HEADER_PREFILL=0/1(default 0) - L2 HeaderPrefillBox: Refill boundary header initialization
- When: Slab refill (cold path, ~64 blocks at once)
- Where:
tiny_unified_cache_refill()or similar - Action: Prefill all headers in slab page
- L1 HotBox: Alloc path skips header write
- Condition: Check "prefill done" flag per slab
- Fast path: Return
base + 1directly (no header write) - Fallback: Traditional
tiny_region_id_write_header()if not prefilled
Expected ROI:
- Baseline 2.59% → Target 0.5-1.0% (60-80% reduction)
- Translation: +1.5-2.0% throughput improvement
- Risk: Medium (requires slab-level "prefilled" tracking)
Implementation Complexity: High
- Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
- Integration: 2 boundaries (refill + alloc hot path)
- Safety: Fallback to hot path write if prefill not done
Note: Phase 1 A3 showed always_inline on header write was NO-GO (-4% on Mixed).
This approach is different: eliminate writes, not inline them.
E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)
Target: 2.57% (hakmem_env_snapshot_enabled)
Hypothesis:
MIXED_TINYV3_C7_SAFEnow defaults toHAKMEM_ENV_SNAPSHOT=1- Current branch hint:
__builtin_expect(..., 0)(expects DISABLED) - Reality: Snapshot is ENABLED → hint is backwards
- Opportunity: Flip branch shape to match reality
Strategy (Box Theory):
- L0 SplitBox:
HAKMEM_ENV_SNAPSHOT_SHAPE=0/1(default 0) - L1 ShapeBox: Branch restructuring
- Old:
if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot } - New:
if (__builtin_expect(enabled, 1)) { snapshot } else { legacy } - Or: Move legacy to
noinline,coldhelper
- Old:
Expected ROI:
- Baseline 2.57% → Target 1.5-2.0% (20-40% reduction)
- Translation: +0.5-1.0% throughput improvement
- Risk: Very low (branch shape change only, no logic change)
Implementation Complexity: Low
- Changes: 5-8 call sites in hot paths
- Integration: Flip
__builtin_expect()polarity - Counters: None needed (perf annotation will show)
Note: This is a "polish" optimization. E3-4 showed branch hints can backfire. Approach carefully with A/B testing.
Step 2 Conclusion: Priority Ranking
Priority Order (ROI × Safety × Complexity)
-
E5-1 (Free Tiny Direct Path): +3-5% expected, medium complexity, low risk
- ROI Score: 9/10 (largest target, proven pattern)
- Safety Score: 8/10 (single boundary, fail-fast)
- Implementation Score: 7/10 (medium effort)
- TOTAL: 24/30 → HIGHEST PRIORITY
-
E5-2 (Header Prefill at Refill): +1.5-2.0% expected, high complexity, medium risk
- ROI Score: 6/10 (2.59% target, diminishing returns from E4)
- Safety Score: 6/10 (metadata overhead, fallback needed)
- Implementation Score: 4/10 (high effort, slab tracking)
- TOTAL: 16/30 → MEDIUM PRIORITY
-
E5-3 (ENV Snapshot Shape): +0.5-1.0% expected, low complexity, low risk
- ROI Score: 4/10 (small target, branch hint uncertainty)
- Safety Score: 9/10 (pure shape change)
- Implementation Score: 9/10 (low effort)
- TOTAL: 22/30 → LOW PRIORITY (but easy win if E5-1 succeeds)
Decision: Proceed with E5-1 (Free Tiny Direct Path)
Rationale:
- Largest target: 29.56% combined overhead (free wrapper + cold path)
- Proven pattern: E4 success shows TLS consolidation works
- Single boundary: Clear separation, fail-fast validation
- Low risk: Header magic check is reliable, fallback is trivial
- Box Theory compliant: L0 ENV gate, L1 hot/cold split, minimal counters
Expected Gain: +3-5% (conservative estimate) Success Threshold: +1.0% (GO criteria) Fallback Plan: If NO-GO, freeze and pursue E5-2 or E5-3
Step 3: E5-1 Implementation Plan (Next)
Design Document
- File:
docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md - Sections:
- Problem Statement (29.56% overhead analysis)
- Solution Architecture (wrapper-level Tiny direct path)
- Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
- Safety Gates (header validation, fail-fast, fallback)
- Integration Points (free() wrapper, counters)
- A/B Test Plan (10-run Mixed, health check)
Implementation Files
- ENV Box:
core/box/free_tiny_direct_env_box.h/.c- ENV gate:
HAKMEM_FREE_TINY_DIRECT=0/1 - Lazy init, static cache
- ENV gate:
- Stats Box:
core/box/free_tiny_direct_stats_box.h- Counters:
direct_hit,direct_miss,invalid_header
- Counters:
- Integration:
core/box/hak_wrappers.inc.h- Lines ~595-640 (free() wrapper hot path)
- New inline:
free_tiny_fast_direct()
A/B Test Protocol
# Baseline (E5-1=0)
for i in {1..10}; do
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
./bench_random_mixed_hakmem 20000000 400 1
done
# Optimized (E5-1=1)
for i in {1..10}; do
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
./bench_random_mixed_hakmem 20000000 400 1
done
Success Criteria:
- GO: Mean +1.0% or higher
- NEUTRAL: ±1.0%
- NO-GO: Mean -1.0% or lower
Appendix: Perf Data Details
Free() Annotated Hot Lines
Top sample-hit lines (from perf annotate --stdio free):
-
8.01%:
mov 0x14(%r15),%ebx(offset 0x1cd2d)- Loading
wrapper_env->wrap_shapefrom TLS - Analysis: ENV snapshot field access overhead
- Loading
-
7.73%:
mov %rdi,%rbp(offset 0x1ccd1)- Saving ptr argument to register
- Analysis: Register pressure in wrapper preamble
-
7.44%:
mov 0x5b311(%rip),%edx(offset 0x1cd01)- Loading cached ENV gate value (global)
- Analysis: Lazy init check overhead
-
6.98%:
test %r11d,%r11d(offset 0x1ccf8)- Testing ENV gate flag (TLS)
- Analysis: Branch condition for ENV snapshot path
-
7.62%:
pop %r14(offset 0x1ce6e)- Epilogue register restore
- Analysis: Function call overhead (6+ register saves)
Total wrapper overhead from top 5 lines: ~37% of free() self% = ~8% of total runtime
Free_Tiny_Fast_Cold Analysis
7.89% self% breakdown:
tiny_route_for_class(): 1.95% (route determination)- Policy snapshot: 0.64%
- Class dispatch: ~5% (remainder)
Cold path is triggered when:
- Class != C7 (no ULTRA early-exit)
- OR class requires route determination (C0-C6 in Mixed)
Opportunity: Bypass cold path entirely for Tiny by validating in wrapper.
Next Actions
- ✅ Step 1 Complete: Perf analysis → E5-1 selected
- 🔄 Step 3 Next: Implement E5-1 design + code
- ⏳ Step 4 Next: 10-run A/B test (Mixed)
- ⏳ Step 5 Next: Health check
- ⏳ Step 6 Next: Update CURRENT_TASK.md with results
Status: Ready to proceed with E5-1 implementation.