Files
hakmem/docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
Moe Charm (CI) 4a070d8a14 Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)
Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path

Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates 2 TLS reads → 1 TLS read (50% reduction)
  - Reduces 4 branches → 3 branches (25% reduction)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median

Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%

E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
  - Branch prediction hint mismatch (UNLIKELY with always-true)
  - Retest confirmed -1.78% regression
  - Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer

Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 04:24:34 +09:00

486 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 4 Comprehensive Status Analysis
**Date**: 2025-12-14
**Analyst**: Claude Code
**Baseline**: E1 enabled (~45M ops/s)
---
## Part 1: E2 Freeze Decision Analysis
### Test Data Review
**E2 Configuration**: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc)
**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled)
**Test**: 10-run A/B, 20M iterations, ws=400
#### Statistical Analysis
| Metric | Baseline (E2=0) | Optimized (E2=1) | Delta |
|--------|-----------------|------------------|-------|
| Mean | 45.40M ops/s | 45.30M ops/s | -0.21% |
| Median | 45.51M ops/s | 45.22M ops/s | -0.62% |
| StdDev | 0.38M (0.84% CV) | 0.49M (1.07% CV) | +28% variance |
#### Variance Consistency Analysis
**Baseline runs** (DUALHOT=0):
- Range: 44.60M - 45.90M (1.30M spread)
- Runs within ±1% of mean: 9/10 (90%)
- Outliers: Run 8 (44.60M, -1.76% from mean)
**Optimized runs** (DUALHOT=1):
- Range: 44.59M - 46.28M (1.69M spread)
- Runs within ±1% of mean: 8/10 (80%)
- Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean)
**Observation**: Higher variance in optimized version suggests branch misprediction or cache effects.
#### Comparison to Free DUALHOT Success
| Path | DUALHOT Result | Reason |
|------|----------------|--------|
| **Free** | **+13.0%** | Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees) |
| **Alloc** | **-0.21%** | Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost |
**Root Cause**:
- Free path: C0-C3 optimization skips **expensive operations** (policy snapshot + route lookup)
- Alloc path: C0-C3 optimization skips **already-cached operations** (static routing eliminates lookup)
- Net effect: Branch overhead ≈ Savings → neutral
### E2 Freeze Recommendation
**Decision**: ✅ **DEFINITIVE FREEZE**
**Rationale**:
1. **Result is consistent**: All 10 runs showed similar pattern (no bimodal distribution)
2. **Not a measurement error**: StdDev 0.38M-0.49M is normal for this workload
3. **Root cause understood**: Alloc path already optimized via C3 static routing
4. **Free vs Alloc asymmetry explained**: Free skips expensive ops, alloc skips cheap cached ops
5. **No alternative conditions warranted**:
- Different workload (C6-heavy): Won't help - same route caching applies
- Different iteration count: Won't change fundamental branch cost vs savings trade-off
- Combined flags: No synergy available - route caching is already optimal
**Conclusion**: E2 is a **structural dead-end** for Mixed workload. Alloc route optimization saturated by C3.
---
## Part 2: Fresh Perf Profile Analysis (E1 Enabled)
### Profile Configuration
**Command**: `HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1`
**Throughput**: 45.26M ops/s
**Samples**: 946 samples, 3.25B cycles
### Top Functions (self% >= 2.0%)
| Rank | Function | self% | Change from Pre-E1 | Category |
|------|----------|-------|-------------------|----------|
| 1 | free | 22.19% | +2.5pp (from ~19%) | Wrapper |
| 2 | tiny_alloc_gate_fast | 18.99% | +3.6pp (from 15.37%) | Alloc Gate |
| 3 | main | 15.21% | No change | Benchmark |
| 4 | malloc | 13.36% | No change | Wrapper |
| 5 | free_tiny_fast_cold | 7.32% | +1.5pp (from 5.84%) | Free Path |
| 6 | hakmem_env_snapshot_enabled | 3.22% | **NEW (was 0% combined)** | ENV Gate |
| 7 | tiny_region_id_write_header | 2.60% | +0.1pp (from 2.50%) | Header |
| 8 | unified_cache_push | 2.56% | -1.4pp (from 3.97%) | Cache |
| 9 | tiny_route_for_class | 2.29% | +0.01pp (from 2.28%) | Routing |
| 10 | small_policy_v7_snapshot | 2.26% | No data | Policy |
| 11 | tiny_c7_ultra_alloc | 2.16% | -1.8pp (from 3.97%) | C7 Alloc |
### E1 Impact Analysis
**Expected**: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read
**Actual**: `hakmem_env_snapshot_enabled` shows 3.22% self%
**Interpretation**:
- ENV overhead **shifted** from 3 separate functions → 1 function
- **NOT eliminated** - still paying 3.22% for ENV checking
- E1's +3.92% gain likely from **reduced TLS pressure** (fewer TLS variables), not eliminated checks
- The snapshot approach caches results, reducing repeated getenv() calls
**Surprise findings**:
1. **tiny_alloc_gate_fast increased** from 15.37% → 18.99% (+3.6pp)
- Possible reason: Other functions got faster (relative %), or I-cache effects
2. **hakmem_env_snapshot_enabled is NEW hot spot** (3.22%)
- This is the consolidation point - still significant overhead
3. **unified_cache_push decreased** from 3.97% → 2.56% (-1.4pp)
- Good sign: Cache operations more efficient
### Hot Spot Distribution
**Pre-E1** (Phase 4 D3 baseline):
- ENV gates (3 functions): 3.26%
- tiny_alloc_gate_fast: 15.37%
- free_tiny_fast_cold: 5.84%
- **Total measured overhead**: ~24.5%
**Post-E1** (current):
- ENV snapshot (1 function): 3.22%
- tiny_alloc_gate_fast: 18.99%
- free_tiny_fast_cold: 7.32%
- **Total measured overhead**: ~29.5%
**Analysis**: Overhead increased in absolute %, but throughput increased +3.92%. This suggests:
- Baseline got faster (other code optimized)
- Relative % shifted to measured functions
- Perf sampling variance (946 samples has ~±3% error margin)
---
## Part 3: E3 Candidate Identification
### Methodology
**Selection Criteria**:
1. self% >= 5% (significant impact)
2. Not already heavily optimized (avoid saturated areas)
3. Different approach from route/TLS optimization (explore new vectors)
### Candidate Analysis
#### Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION
**Current State**:
- Phase 3 C3: Static routing (+2.20% gain)
- Phase 4 D3: Alloc gate shape (+0.56% neutral)
- Phase 4 E2: Per-class fast path (-0.21% neutral)
**Why it's 18.99%**:
- Route determination: Already cached (C3)
- Branch prediction: Already tuned (D3)
- Per-class specialization: No benefit (E2)
**Remaining Overhead**:
- Function call overhead (not inlined)
- ENV snapshot check (3.22% now consolidated)
- Size→class conversion (hak_tiny_size_to_class)
- Wrapper→gate dispatch
**Optimization Approach**: **INLINING + DISPATCH OPTIMIZATION**
- **Strategy**: Inline tiny_alloc_gate_fast into malloc wrapper
- Eliminate function call overhead (save ~5-10 cycles)
- Improve I-cache locality (malloc + gate in same cache line)
- Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit)
- **Expected Gain**: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall)
- **Risk**: Medium (I-cache pressure, as seen in A3 -4% regression)
**Recommendation**: **DEFER** - Route optimization saturated, inlining has I-cache risk
---
#### Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD
**Current State**:
- Phase 2 B4: Wrapper hot/cold split (+1.47% gain)
- Wrapper shape already optimized (rare checks in cold path)
**Why it's 22.19%**:
- This is the `free()` wrapper function (libc entry point)
- Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast
**Optimization Approach**: **WRAPPER BYPASS (IFUNC) or Function Pointer Caching**
- **Strategy 1 (IFUNC)**: Use GNU IFUNC to resolve malloc/free at load time
- Direct binding: `malloc → tiny_alloc_gate_fast` (no wrapper layer)
- Risk: HIGH (ABI compatibility, thread-safety)
- **Strategy 2 (Function Pointer)**: Cache `g_free_impl` in TLS
- Check once at thread init, then direct call
- Risk: Medium, Lower gain (+1-2%)
**Recommendation**: **HIGH PRIORITY** - Large potential gain, prototype with function pointer approach first
---
#### Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION
**Current State**:
- Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path)
- Cold path handles C4-C7 (~50% of frees)
**Optimization Approach**: **C4-C7 ROUTE SPECIALIZATION**
- **Strategy**: Create per-class cold paths (similar to E2 alloc attempt)
- **Expected Gain**: +0.5-1.0%
- **Risk**: Low
**Recommendation**: **MEDIUM PRIORITY** - Incremental gain, but may hit diminishing returns like E2
---
#### Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐
**Current State**:
- Phase 4 E1: ENV snapshot consolidation (+3.92% gain)
- 3 separate ENV gates → 1 consolidated snapshot
**Why it's 3.22%**:
- This IS the optimization (consolidation point)
- Still checking `g_hakmem_env_snapshot.initialized` on every call
- TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path)
**Optimization Approach**: **LAZY INIT ELIMINATION**
- **Strategy**: Force ENV snapshot initialization at library load time (constructor)
- Use `__attribute__((constructor))` to init before main()
- Eliminate `if (!initialized)` check in hot path
- Make `hakmem_env_get()` a pure TLS read (no branch)
- **Expected Gain**: +0.5-1.5% (eliminate 3.22% check overhead)
- **Risk**: Low (standard initialization pattern)
- **Implementation**:
```c
__attribute__((constructor))
static void hakmem_env_snapshot_init_early(void) {
hakmem_env_snapshot_init(); // Force init before any alloc/free
}
static inline const hakmem_env_snapshot* hakmem_env_get(void) {
return &g_hakmem_env_snapshot; // No check, just return
}
```
**Recommendation**: **HIGH PRIORITY** - Clean win, low risk, eliminates E1's remaining overhead
---
#### Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION
**Current State**:
- Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO)
- I-cache pressure issue identified
**Optimization Approach**: **SELECTIVE INLINING**
- **Strategy**: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY)
- **Expected Gain**: +0.5-1.0%
- **Risk**: Medium (I-cache effects)
**Recommendation**: **LOW PRIORITY** - A3 already explored, I-cache risk remains
---
### E3 Candidate Ranking
| Rank | Candidate | self% | Approach | Expected Gain | Risk | ROI |
|------|-----------|-------|----------|---------------|------|-----|
| **1** | **hakmem_env_snapshot_enabled** | **3.22%** | **Constructor init** | **+0.5-1.5%** | **Low** | **⭐⭐⭐** |
| **2** | **free wrapper** | **22.19%** | **Function pointer cache** | **+1-2%** | **Medium** | **⭐⭐⭐** |
| 3 | tiny_alloc_gate_fast | 18.99% | Inlining | +1-2% | High (I-cache) | ⭐⭐ |
| 4 | free_tiny_fast_cold | 7.32% | Route specialization | +0.5-1.0% | Low | ⭐⭐ |
| 5 | tiny_region_id_write_header | 2.60% | Selective inline | +0.5-1.0% | Medium | ⭐ |
---
## Part 4: Summary & Recommendations
### E2 Final Decision
**Decision**: ✅ **FREEZE DEFINITIVELY**
**Rationale**:
1. **Result is consistent**: -0.21% mean, -0.62% median across 10 runs
2. **Root cause clear**: Alloc route optimization saturated by Phase 3 C3 static routing
3. **Free vs Alloc asymmetry**: Free DUALHOT skips expensive ops, alloc skips cached ops
4. **No alternative testing needed**: Workload/iteration changes won't fix structural issue
5. **Lesson learned**: Per-class specialization only works when bypassing uncached overhead
**Action**:
- Keep `HAKMEM_TINY_ALLOC_DUALHOT=0` as default (research box frozen)
- Document in CURRENT_TASK.md as NEUTRAL result
- No further investigation warranted
---
### Perf Findings (E1 Enabled Baseline)
**Throughput**: 45.26M ops/s (+3.92% from pre-E1 baseline)
**Hot Spots** (self% >= 5%):
1. free (22.19%) - Wrapper overhead
2. tiny_alloc_gate_fast (18.99%) - Route overhead (saturated)
3. main (15.21%) - Benchmark driver
4. malloc (13.36%) - Wrapper overhead
5. free_tiny_fast_cold (7.32%) - C4-C7 free path
**E1 Impact**:
- ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function)
- Gain from reduced TLS pressure: +3.92%
- **Remaining opportunity**: Eliminate lazy init check (3.22% → 0%)
**New Hot Spots**:
- hakmem_env_snapshot_enabled: 3.22% (consolidation point)
**Changes from Pre-E1**:
- tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%)
- free: +2.5pp (~19% → 22.19%)
- unified_cache_push: -1.4pp (3.97% → 2.56%)
---
### E3 Recommendation
**Primary Target**: **hakmem_env_snapshot_enabled (E3-4)**
**Approach**: Constructor-based initialization
- Force ENV snapshot init at library load time
- Eliminate lazy init check in hot path
- Make `hakmem_env_get()` a pure TLS read (no branch)
**Expected Gain**: +0.5-1.5%
**Implementation Complexity**: Low (2-day task)
- Add `__attribute__((constructor))` function
- Remove init check from hakmem_env_get()
- A/B test with 10-run Mixed + 5-run C6-heavy
**Rationale**:
1. **Low risk**: Standard initialization pattern (used by jemalloc, tcmalloc)
2. **Clear gain**: Eliminates 3.22% overhead (lazy init check)
3. **Compounds E1**: Completes ENV snapshot optimization started in E1
4. **Different vector**: Not route/TLS optimization - this is **initialization overhead reduction**
**Success Criteria**:
- Mean gain >= +0.5% (conservative)
- No regression on any profile
- Health check passes
---
**Secondary Target**: **free wrapper (E3-2)**
**Approach**: Function pointer caching
- Cache `g_free_impl` in TLS at thread init
- Direct call instead of LD mode check + dispatch
- Lower risk than IFUNC approach
**Expected Gain**: +1-2%
**Implementation Complexity**: Medium (3-4 day task)
**Risk**: Medium (thread-safety, initialization order)
---
### Phase 4 Status
**Active Optimizations**:
- E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in)
- E3-4 (ENV Constructor Init): ❌ NO-GO (frozen, default OFF, requires E1)
**Frozen Optimizations**:
- D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF)
- E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF)
**Cumulative Gain** (Phase 2-4):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- D1 (Free route cache): +2.19%
- E1 (ENV snapshot): +3.92%
- **Total (Phase 4)**: ~+3.9%E1 のみ)
**Baseline参考**:
- E1=1, CTOR=0: 45.26M ops/sMixed, 40M iters, ws=400
- E1=1, CTOR=1: 46.86M ops/sMixed, 20M iters, ws=400, re-validation: -1.44%
**Remaining Potential**:
- E3-2 (Wrapper function ptr): +1-2%
- E3-3 (Free route special): +0.5-1.0%
- **Realistic ceiling**: ~48-50M ops/s (without major redesign)
---
### Next Steps
#### Immediate (Priority 1)
1. **Freeze E2 in CURRENT_TASK.md**
- Document NEUTRAL decision (-0.21%)
- Add root cause explanation (route caching saturation)
- Mark as research box (default OFF, frozen)
2. **E3-4 の昇格ゲート(再検証)**
- E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認
- A/B: Mixed 10-runE1=1, CTOR=0 vs 1
- 健康診断: `scripts/verify_health_profiles.sh`
#### Short-term (Priority 2)
3. **E1/E3-4 ON の状態で perf を取り直す**
- `hakmem_env_snapshot_enabled` が Top から落ちるself% が有意に下がること
- 次の芯alloc gate / free_tiny_fast_cold / wrapperを “self% ≥ 5%” で選定
#### Long-term (Priority 3)
6. **Consider non-incremental approaches**
- Mimalloc-style TLS bucket redesign (major overhaul)
- Static-compiled routing (eliminate runtime policy)
- IFUNC for zero-overhead wrapper (high risk)
---
### Lessons Learned
#### Route Optimization Saturation
**Observation**: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%)
**Insight**:
- Route optimization has diminishing returns after static caching (C3)
- Further specialization adds branch overhead without eliminating cost
- **Lesson**: Don't pursue per-class specialization on already-cached paths
#### Shape Optimization Plateau
**Observation**: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%)
**Insight**:
- Branch prediction saturates after initial tuning
- LIKELY/UNLIKELY hints have limited benefit on well-trained branches
- **Lesson**: Shape optimization good for first pass, limited ROI after
#### ENV Consolidation Success
**Observation**: E1 (ENV snapshot) achieved +3.92% gain
**Insight**:
- Reducing TLS pressure (3 vars → 1 var) has measurable benefit
- Consolidation point still has overhead (3.22% self%)
- **Lesson**: Constructor init is next logical step (eliminate lazy check)
#### Inlining I-Cache Risk
**Observation**: A3 (header always_inline) showed -4% regression on Mixed
**Insight**:
- Aggressive inlining can thrash I-cache on mixed workloads
- Selective inlining (per-class) may work but needs careful profiling
- **Lesson**: Inlining is high-risk, constructor/caching approaches safer
---
### Realistic Expectations
**Current State**: 45M ops/s (E1 enabled)
**Target**: 48-50M ops/s (with E3-4, E3-2)
**Ceiling**: ~55-60M ops/s (without major redesign)
**Gap to mimalloc**: ~2.5x (128M vs 55M ops/s)
**Why large gap remains**:
- Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets
- Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout
- Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15
**Next phase options**:
1. **Incremental** (E3-4, E3-2): +1-3% gains, safe, diminishing returns
2. **Structural redesign**: +20-50% potential, high risk, months of work
3. **Workload-specific tuning**: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed
**Recommendation**: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted.
---
**Analysis Complete**: 2025-12-14
**Next Action**: Implement E3-4 (ENV Constructor Init)
**Expected Timeline**: 2-3 days (design → implement → A/B → decision)