Files
hakmem/docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
Moe Charm (CI) 8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:52:32 +09:00

324 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 5 E5: Comprehensive Analysis & Implementation Report
## Executive Summary
**Date**: 2025-12-14
**Baseline**: 47.34M ops/s (Mixed, E4-1+E4-2 ON, 20M iters, ws=400)
**Status**: Step 1 Complete - Perf Analysis & Priority Decision
---
## Step 1: Perf Analysis - Free() Internal Breakdown
### Test Configuration
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 999 -- \
./bench_random_mixed_hakmem 20000000 400 1
```
**Results**: 476 samples @ 999Hz, throughput 44.12M ops/s
### Top Hotspots (self% >= 2.0%)
| Rank | Function | Self% | Analysis |
|------|----------|-------|----------|
| 1 | `free` | 21.67% | Wrapper layer overhead (header check, ENV snapshot, gate dispatch) |
| 2 | `tiny_alloc_gate_fast` | 18.55% | Alloc gate (reduced from E4 baseline) |
| 3 | `main` | 17.44% | Benchmark driver |
| 4 | `malloc` | 12.27% | Wrapper layer (reduced from E4 baseline) |
| 5 | `free_tiny_fast_cold` | 7.89% | **NEW**: Cold path (route determination, policy snapshot) |
| 6 | `unified_cache_push` | 3.47% | TLS cache push operation |
| 7 | `tiny_c7_ultra_alloc` | 2.98% | C7 ULTRA alloc path |
| 8 | `tiny_region_id_write_header` | 2.59% | **Header write tax** (reduced from 6.97% in E4 baseline) |
| 9 | `hakmem_env_snapshot_enabled` | 2.57% | **ENV snapshot gate overhead** (reduced from 4.29%) |
### Free Path Internal Breakdown
**Total free() overhead**: 21.67% (wrapper) + 7.89% (cold path) = **29.56%**
**Key Observations**:
1. **Wrapper Layer (21.67% self%)**:
- NULL check: ~0%
- Header magic validation: **~8% of free() self%**
- Load header byte: `movzbl -0x1(%rbp),%r15d` (7.44% sample hit)
- Magic comparison: `cmp $0xa0,%r14b` (0%)
- Class extraction: `and $0xf,%r12d` (0%)
- ENV snapshot calls: **~8% of free() self%**
- `mov 0x5b311(%rip),%edx` (7.73% + 6.98% sample hits)
- Branch to cold init paths (0%)
- TLS reads for gate dispatch: **~5% of free() self%**
- `mov 0x14(%r15),%ebx` (8.01% sample hit)
2. **Cold Path (7.89% self%)**:
- `tiny_route_for_class()`: 1.95% self% (route determination)
- Policy snapshot: 0.64% self%
- Class-based dispatch overhead
3. **Header Write Tax (2.59% self%)**:
- **DOWN from 6.97% in E4 baseline** (-62% reduction!)
- Note: This is already significantly improved
- Further optimization ROI is lower now
4. **ENV Snapshot Overhead (2.57% self%)**:
- **DOWN from 4.29% in E4 baseline** (-40% reduction!)
- Lazy init checks + TLS reads
- Still visible but reduced impact
### Critical Insight: Header Check Overhead
**Annotated perf shows**:
- Lines with 7-8% sample hits are concentrated in:
1. **Header load**: `movzbl -0x1(%rbp),%r15d` (line 0x1cd6d)
2. **Multiple TLS/global reads** for ENV gates (lines 0x1ccd1, 0x1ccf8, 0x1cd01, 0x1cd2d)
**These lines represent**:
- **Header validation overhead**: ~7-8% of free() self%
- **ENV gate consolidation overhead**: ~8% of free() self% (multiple checks)
- **Total wrapper tax**: ~16% of 21.67% = **~3.5% of total runtime**
### Branch Prediction Analysis
**From perf annotate**:
- Page boundary check: `test $0xfff,%ebp` (0% hit - well predicted)
- Magic comparison: `cmp $0xa0,%r14b` (0% hit - well predicted)
- ENV snapshot checks: Multiple branches with 0% hit (well predicted)
- **No significant branch misprediction issues**
---
## Step 2: E5 Candidate Priority Decision
### E5-1: Free() Tiny Direct Path (HIGHEST PRIORITY)
**Target**: 21.67% (free wrapper) + 7.89% (free_tiny_fast_cold) = **29.56% total**
**Hypothesis**:
- Current free() wrapper does header validation + ENV snapshot + gate dispatch
- For Tiny allocations (48% of frees in Mixed), this is redundant work:
1. Header is validated in wrapper (magic check)
2. Header is re-extracted in `free_tiny_fast()`
3. Route determination happens in `free_tiny_fast_cold()`
- **Opportunity**: Single-check Tiny direct path in wrapper
**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_FREE_TINY_DIRECT=0/1` (default 0)
- **L1 HotBox**: Wrapper-level Tiny fast path
- Check: `(header & 0xF0) == 0xA0` (Tiny magic)
- Extract: `class_idx = header & 0x0F`
- Validate: `class_idx < 8` (fail-fast)
- Direct call: `free_tiny_fast_direct(ptr, base, class_idx)`
- Skips: ENV snapshot consolidation in wrapper
- Skips: Cold path route determination
- Skips: Re-validation of header
- **L1 ColdBox**: Existing fallback for Mid/Pool/Large/invalid
**Expected ROI**:
- **Baseline 29.56% overhead** → **Target 18-20% overhead** (30-40% reduction)
- **Translation**: +3-5% throughput improvement
- **Risk**: Low (single boundary, fail-fast validation)
**Implementation Complexity**: Medium
- New function: `free_tiny_fast_direct()` (inline, thin wrapper)
- Integration: 1 boundary in `free()` wrapper
- Counters: 3 (direct_hit, direct_miss, invalid_header)
### E5-2: Header Write to Refill Boundary (MEDIUM PRIORITY)
**Target**: 2.59% (tiny_region_id_write_header)
**Hypothesis**:
- Header write happens on EVERY alloc (hot path tax)
- Blocks are reused within same class → header is stable
- **Opportunity**: Write header once at refill boundary (cold path)
**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_TINY_HEADER_PREFILL=0/1` (default 0)
- **L2 HeaderPrefillBox**: Refill boundary header initialization
- When: Slab refill (cold path, ~64 blocks at once)
- Where: `tiny_unified_cache_refill()` or similar
- Action: Prefill all headers in slab page
- **L1 HotBox**: Alloc path skips header write
- Condition: Check "prefill done" flag per slab
- Fast path: Return `base + 1` directly (no header write)
- Fallback: Traditional `tiny_region_id_write_header()` if not prefilled
**Expected ROI**:
- **Baseline 2.59%** → **Target 0.5-1.0%** (60-80% reduction)
- **Translation**: +1.5-2.0% throughput improvement
- **Risk**: Medium (requires slab-level "prefilled" tracking)
**Implementation Complexity**: High
- Metadata: Per-slab "header_prefilled" flag (1 bit per slab)
- Integration: 2 boundaries (refill + alloc hot path)
- Safety: Fallback to hot path write if prefill not done
**Note**: Phase 1 A3 showed `always_inline` on header write was **NO-GO** (-4% on Mixed).
This approach is different: **eliminate writes**, not inline them.
### E5-3: ENV Snapshot Branch Shape (LOW PRIORITY)
**Target**: 2.57% (hakmem_env_snapshot_enabled)
**Hypothesis**:
- `MIXED_TINYV3_C7_SAFE` now defaults to `HAKMEM_ENV_SNAPSHOT=1`
- Current branch hint: `__builtin_expect(..., 0)` (expects DISABLED)
- Reality: Snapshot is ENABLED → hint is backwards
- **Opportunity**: Flip branch shape to match reality
**Strategy** (Box Theory):
- **L0 SplitBox**: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1` (default 0)
- **L1 ShapeBox**: Branch restructuring
- Old: `if (__builtin_expect(!enabled, 1)) { legacy } else { snapshot }`
- New: `if (__builtin_expect(enabled, 1)) { snapshot } else { legacy }`
- Or: Move legacy to `noinline,cold` helper
**Expected ROI**:
- **Baseline 2.57%** → **Target 1.5-2.0%** (20-40% reduction)
- **Translation**: +0.5-1.0% throughput improvement
- **Risk**: Very low (branch shape change only, no logic change)
**Implementation Complexity**: Low
- Changes: 5-8 call sites in hot paths
- Integration: Flip `__builtin_expect()` polarity
- Counters: None needed (perf annotation will show)
**Note**: This is a "polish" optimization. E3-4 showed branch hints can backfire.
Approach carefully with A/B testing.
---
## Step 2 Conclusion: Priority Ranking
### Priority Order (ROI × Safety × Complexity)
1. **E5-1 (Free Tiny Direct Path)**: +3-5% expected, medium complexity, low risk
- **ROI Score**: 9/10 (largest target, proven pattern)
- **Safety Score**: 8/10 (single boundary, fail-fast)
- **Implementation Score**: 7/10 (medium effort)
- **TOTAL**: 24/30 → **HIGHEST PRIORITY**
2. **E5-2 (Header Prefill at Refill)**: +1.5-2.0% expected, high complexity, medium risk
- **ROI Score**: 6/10 (2.59% target, diminishing returns from E4)
- **Safety Score**: 6/10 (metadata overhead, fallback needed)
- **Implementation Score**: 4/10 (high effort, slab tracking)
- **TOTAL**: 16/30 → **MEDIUM PRIORITY**
3. **E5-3 (ENV Snapshot Shape)**: +0.5-1.0% expected, low complexity, low risk
- **ROI Score**: 4/10 (small target, branch hint uncertainty)
- **Safety Score**: 9/10 (pure shape change)
- **Implementation Score**: 9/10 (low effort)
- **TOTAL**: 22/30 → **LOW PRIORITY** (but easy win if E5-1 succeeds)
### Decision: Proceed with E5-1 (Free Tiny Direct Path)
**Rationale**:
1. **Largest target**: 29.56% combined overhead (free wrapper + cold path)
2. **Proven pattern**: E4 success shows TLS consolidation works
3. **Single boundary**: Clear separation, fail-fast validation
4. **Low risk**: Header magic check is reliable, fallback is trivial
5. **Box Theory compliant**: L0 ENV gate, L1 hot/cold split, minimal counters
**Expected Gain**: +3-5% (conservative estimate)
**Success Threshold**: +1.0% (GO criteria)
**Fallback Plan**: If NO-GO, freeze and pursue E5-2 or E5-3
---
## Step 3: E5-1 Implementation Plan (Next)
### Design Document
- File: `docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md`
- Sections:
1. Problem Statement (29.56% overhead analysis)
2. Solution Architecture (wrapper-level Tiny direct path)
3. Box Theory Boundary (L0 ENV gate, L1 hot/cold split)
4. Safety Gates (header validation, fail-fast, fallback)
5. Integration Points (free() wrapper, counters)
6. A/B Test Plan (10-run Mixed, health check)
### Implementation Files
- **ENV Box**: `core/box/free_tiny_direct_env_box.h/.c`
- ENV gate: `HAKMEM_FREE_TINY_DIRECT=0/1`
- Lazy init, static cache
- **Stats Box**: `core/box/free_tiny_direct_stats_box.h`
- Counters: `direct_hit`, `direct_miss`, `invalid_header`
- **Integration**: `core/box/hak_wrappers.inc.h`
- Lines ~595-640 (free() wrapper hot path)
- New inline: `free_tiny_fast_direct()`
### A/B Test Protocol
```bash
# Baseline (E5-1=0)
for i in {1..10}; do
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=0 \
./bench_random_mixed_hakmem 20000000 400 1
done
# Optimized (E5-1=1)
for i in {1..10}; do
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FREE_TINY_DIRECT=1 \
./bench_random_mixed_hakmem 20000000 400 1
done
```
**Success Criteria**:
- **GO**: Mean +1.0% or higher
- **NEUTRAL**: ±1.0%
- **NO-GO**: Mean -1.0% or lower
---
## Appendix: Perf Data Details
### Free() Annotated Hot Lines
**Top sample-hit lines** (from `perf annotate --stdio free`):
1. **8.01%**: `mov 0x14(%r15),%ebx` (offset 0x1cd2d)
- Loading `wrapper_env->wrap_shape` from TLS
- **Analysis**: ENV snapshot field access overhead
2. **7.73%**: `mov %rdi,%rbp` (offset 0x1ccd1)
- Saving ptr argument to register
- **Analysis**: Register pressure in wrapper preamble
3. **7.44%**: `mov 0x5b311(%rip),%edx` (offset 0x1cd01)
- Loading cached ENV gate value (global)
- **Analysis**: Lazy init check overhead
4. **6.98%**: `test %r11d,%r11d` (offset 0x1ccf8)
- Testing ENV gate flag (TLS)
- **Analysis**: Branch condition for ENV snapshot path
5. **7.62%**: `pop %r14` (offset 0x1ce6e)
- Epilogue register restore
- **Analysis**: Function call overhead (6+ register saves)
**Total wrapper overhead from top 5 lines**: ~37% of free() self% = **~8% of total runtime**
### Free_Tiny_Fast_Cold Analysis
**7.89% self% breakdown**:
- `tiny_route_for_class()`: 1.95% (route determination)
- Policy snapshot: 0.64%
- Class dispatch: ~5% (remainder)
**Cold path is triggered when**:
- Class != C7 (no ULTRA early-exit)
- OR class requires route determination (C0-C6 in Mixed)
**Opportunity**: Bypass cold path entirely for Tiny by validating in wrapper.
---
## Next Actions
1.**Step 1 Complete**: Perf analysis → E5-1 selected
2. 🔄 **Step 3 Next**: Implement E5-1 design + code
3.**Step 4 Next**: 10-run A/B test (Mixed)
4.**Step 5 Next**: Health check
5.**Step 6 Next**: Update CURRENT_TASK.md with results
**Status**: Ready to proceed with E5-1 implementation.