Files

Moe Charm (CI) 4a070d8a14 Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)

Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path

Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates 2 TLS reads → 1 TLS read (50% reduction)
  - Reduces 4 branches → 3 branches (25% reduction)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median

Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%

E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
  - Branch prediction hint mismatch (UNLIKELY with always-true)
  - Retest confirmed -1.78% regression
  - Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer

Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-14 04:24:34 +09:00

17 KiB

Raw Blame History

Phase 4 Comprehensive Status Analysis

Date: 2025-12-14 Analyst: Claude Code Baseline: E1 enabled (~45M ops/s)

Part 1: E2 Freeze Decision Analysis

Test Data Review

E2 Configuration: HAKMEM_TINY_ALLOC_DUALHOT (C0-C3 fast path for alloc) Baseline: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled) Test: 10-run A/B, 20M iterations, ws=400

Statistical Analysis

Metric	Baseline (E2=0)	Optimized (E2=1)	Delta
Mean	45.40M ops/s	45.30M ops/s	-0.21%
Median	45.51M ops/s	45.22M ops/s	-0.62%
StdDev	0.38M (0.84% CV)	0.49M (1.07% CV)	+28% variance

Variance Consistency Analysis

Baseline runs (DUALHOT=0):

Range: 44.60M - 45.90M (1.30M spread)
Runs within ±1% of mean: 9/10 (90%)
Outliers: Run 8 (44.60M, -1.76% from mean)

Optimized runs (DUALHOT=1):

Range: 44.59M - 46.28M (1.69M spread)
Runs within ±1% of mean: 8/10 (80%)
Outliers: Run 2 (46.28M, +2.16% from mean), Run 3 (44.59M, -1.58% from mean)

Observation: Higher variance in optimized version suggests branch misprediction or cache effects.

Comparison to Free DUALHOT Success

Path	DUALHOT Result	Reason
Free	+13.0%	Skips policy_snapshot() + tiny_route_for_class() for C0-C3 (48% of frees)
Alloc	-0.21%	Route already cached (Phase 3 C3), C0-C3 check adds branch without bypassing cost

Root Cause:

Free path: C0-C3 optimization skips expensive operations (policy snapshot + route lookup)
Alloc path: C0-C3 optimization skips already-cached operations (static routing eliminates lookup)
Net effect: Branch overhead ≈ Savings → neutral

E2 Freeze Recommendation

Decision: ✅ DEFINITIVE FREEZE

Rationale:

Result is consistent: All 10 runs showed similar pattern (no bimodal distribution)
Not a measurement error: StdDev 0.38M-0.49M is normal for this workload
Root cause understood: Alloc path already optimized via C3 static routing
Free vs Alloc asymmetry explained: Free skips expensive ops, alloc skips cheap cached ops
No alternative conditions warranted:
- Different workload (C6-heavy): Won't help - same route caching applies
- Different iteration count: Won't change fundamental branch cost vs savings trade-off
- Combined flags: No synergy available - route caching is already optimal

Conclusion: E2 is a structural dead-end for Mixed workload. Alloc route optimization saturated by C3.

Part 2: Fresh Perf Profile Analysis (E1 Enabled)

Profile Configuration

Command: HAKMEM_ENV_SNAPSHOT=1 perf record -F 999 -- ./bench_random_mixed_hakmem 40000000 400 1 Throughput: 45.26M ops/s Samples: 946 samples, 3.25B cycles

Top Functions (self% >= 2.0%)

Rank	Function	self%	Change from Pre-E1	Category
1	free	22.19%	+2.5pp (from ~19%)	Wrapper
2	tiny_alloc_gate_fast	18.99%	+3.6pp (from 15.37%)	Alloc Gate
3	main	15.21%	No change	Benchmark
4	malloc	13.36%	No change	Wrapper
5	free_tiny_fast_cold	7.32%	+1.5pp (from 5.84%)	Free Path
6	hakmem_env_snapshot_enabled	3.22%	NEW (was 0% combined)	ENV Gate
7	tiny_region_id_write_header	2.60%	+0.1pp (from 2.50%)	Header
8	unified_cache_push	2.56%	-1.4pp (from 3.97%)	Cache
9	tiny_route_for_class	2.29%	+0.01pp (from 2.28%)	Routing
10	small_policy_v7_snapshot	2.26%	No data	Policy
11	tiny_c7_ultra_alloc	2.16%	-1.8pp (from 3.97%)	C7 Alloc

E1 Impact Analysis

Expected: E1 consolidates 3 ENV gates (3.26% self%) → 1 TLS read Actual: hakmem_env_snapshot_enabled shows 3.22% self%

Interpretation:

ENV overhead shifted from 3 separate functions → 1 function
NOT eliminated - still paying 3.22% for ENV checking
E1's +3.92% gain likely from reduced TLS pressure (fewer TLS variables), not eliminated checks
The snapshot approach caches results, reducing repeated getenv() calls

Surprise findings:

tiny_alloc_gate_fast increased from 15.37% → 18.99% (+3.6pp)
- Possible reason: Other functions got faster (relative %), or I-cache effects
hakmem_env_snapshot_enabled is NEW hot spot (3.22%)
- This is the consolidation point - still significant overhead
unified_cache_push decreased from 3.97% → 2.56% (-1.4pp)
- Good sign: Cache operations more efficient

Hot Spot Distribution

Pre-E1 (Phase 4 D3 baseline):

ENV gates (3 functions): 3.26%
tiny_alloc_gate_fast: 15.37%
free_tiny_fast_cold: 5.84%
Total measured overhead: ~24.5%

Post-E1 (current):

ENV snapshot (1 function): 3.22%
tiny_alloc_gate_fast: 18.99%
free_tiny_fast_cold: 7.32%
Total measured overhead: ~29.5%

Analysis: Overhead increased in absolute %, but throughput increased +3.92%. This suggests:

Baseline got faster (other code optimized)
Relative % shifted to measured functions
Perf sampling variance (946 samples has ~±3% error margin)

Part 3: E3 Candidate Identification

Methodology

Selection Criteria:

self% >= 5% (significant impact)
Not already heavily optimized (avoid saturated areas)
Different approach from route/TLS optimization (explore new vectors)

Candidate Analysis

Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION

Current State:

Phase 3 C3: Static routing (+2.20% gain)
Phase 4 D3: Alloc gate shape (+0.56% neutral)
Phase 4 E2: Per-class fast path (-0.21% neutral)

Why it's 18.99%:

Route determination: Already cached (C3)
Branch prediction: Already tuned (D3)
Per-class specialization: No benefit (E2)

Remaining Overhead:

Function call overhead (not inlined)
ENV snapshot check (3.22% now consolidated)
Size→class conversion (hak_tiny_size_to_class)
Wrapper→gate dispatch

Optimization Approach: INLINING + DISPATCH OPTIMIZATION

Strategy: Inline tiny_alloc_gate_fast into malloc wrapper
- Eliminate function call overhead (save ~5-10 cycles)
- Improve I-cache locality (malloc + gate in same cache line)
- Enable cross-function optimization (compiler can optimize malloc→gate→fast_path as one unit)
Expected Gain: +1-2% (reduce 18.99% self by 10-15% = ~2pp overall)
Risk: Medium (I-cache pressure, as seen in A3 -4% regression)

Recommendation: DEFER - Route optimization saturated, inlining has I-cache risk

Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD

Current State:

Phase 2 B4: Wrapper hot/cold split (+1.47% gain)
Wrapper shape already optimized (rare checks in cold path)

Why it's 22.19%:

This is the free() wrapper function (libc entry point)
Includes: LD mode check, jemalloc check, diagnostics, then dispatch to free_tiny_fast

Optimization Approach: WRAPPER BYPASS (IFUNC) or Function Pointer Caching

Strategy 1 (IFUNC): Use GNU IFUNC to resolve malloc/free at load time
- Direct binding: malloc → tiny_alloc_gate_fast (no wrapper layer)
- Risk: HIGH (ABI compatibility, thread-safety)
Strategy 2 (Function Pointer): Cache g_free_impl in TLS
- Check once at thread init, then direct call
- Risk: Medium, Lower gain (+1-2%)

Recommendation: HIGH PRIORITY - Large potential gain, prototype with function pointer approach first

Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION

Current State:

Phase FREE-DUALHOT: Hot/cold split (+13% gain for C0-C3 hot path)
Cold path handles C4-C7 (~50% of frees)

Optimization Approach: C4-C7 ROUTE SPECIALIZATION

Strategy: Create per-class cold paths (similar to E2 alloc attempt)
Expected Gain: +0.5-1.0%
Risk: Low

Recommendation: MEDIUM PRIORITY - Incremental gain, but may hit diminishing returns like E2

Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐

Current State:

Phase 4 E1: ENV snapshot consolidation (+3.92% gain)
3 separate ENV gates → 1 consolidated snapshot

Why it's 3.22%:

This IS the optimization (consolidation point)
Still checking g_hakmem_env_snapshot.initialized on every call
TLS read overhead (1 TLS variable vs 3, but still 1 read per hot path)

Optimization Approach: LAZY INIT ELIMINATION

Strategy: Force ENV snapshot initialization at library load time (constructor)
- Use __attribute__((constructor)) to init before main()
- Eliminate if (!initialized) check in hot path
- Make hakmem_env_get() a pure TLS read (no branch)
Expected Gain: +0.5-1.5% (eliminate 3.22% check overhead)
Risk: Low (standard initialization pattern)

Implementation:

__attribute__((constructor))
static void hakmem_env_snapshot_init_early(void) {
    hakmem_env_snapshot_init();  // Force init before any alloc/free
}

static inline const hakmem_env_snapshot* hakmem_env_get(void) {
    return &g_hakmem_env_snapshot;  // No check, just return
}

Recommendation: HIGH PRIORITY - Clean win, low risk, eliminates E1's remaining overhead

Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION

Current State:

Phase 1 A3: always_inline attempt → -4.00% regression (NO-GO)
I-cache pressure issue identified

Optimization Approach: SELECTIVE INLINING

Strategy: Inline only for hot classes (C7 ULTRA, C0-C3 LEGACY)
Expected Gain: +0.5-1.0%
Risk: Medium (I-cache effects)

Recommendation: LOW PRIORITY - A3 already explored, I-cache risk remains

E3 Candidate Ranking

Rank	Candidate	self%	Approach	Expected Gain	Risk	ROI
1	hakmem_env_snapshot_enabled	3.22%	Constructor init	+0.5-1.5%	Low	⭐⭐⭐
2	free wrapper	22.19%	Function pointer cache	+1-2%	Medium	⭐⭐⭐
3	tiny_alloc_gate_fast	18.99%	Inlining	+1-2%	High (I-cache)	⭐⭐
4	free_tiny_fast_cold	7.32%	Route specialization	+0.5-1.0%	Low	⭐⭐
5	tiny_region_id_write_header	2.60%	Selective inline	+0.5-1.0%	Medium	⭐

Part 4: Summary & Recommendations

E2 Final Decision

Decision: ✅ FREEZE DEFINITIVELY

Rationale:

Result is consistent: -0.21% mean, -0.62% median across 10 runs
Root cause clear: Alloc route optimization saturated by Phase 3 C3 static routing
Free vs Alloc asymmetry: Free DUALHOT skips expensive ops, alloc skips cached ops
No alternative testing needed: Workload/iteration changes won't fix structural issue
Lesson learned: Per-class specialization only works when bypassing uncached overhead

Action:

Keep HAKMEM_TINY_ALLOC_DUALHOT=0 as default (research box frozen)
Document in CURRENT_TASK.md as NEUTRAL result
No further investigation warranted

Perf Findings (E1 Enabled Baseline)

Throughput: 45.26M ops/s (+3.92% from pre-E1 baseline)

Hot Spots (self% >= 5%):

free (22.19%) - Wrapper overhead
tiny_alloc_gate_fast (18.99%) - Route overhead (saturated)
main (15.21%) - Benchmark driver
malloc (13.36%) - Wrapper overhead
free_tiny_fast_cold (7.32%) - C4-C7 free path

E1 Impact:

ENV overhead consolidated: 3.26% (3 functions) → 3.22% (1 function)
Gain from reduced TLS pressure: +3.92%
Remaining opportunity: Eliminate lazy init check (3.22% → 0%)

New Hot Spots:

hakmem_env_snapshot_enabled: 3.22% (consolidation point)

Changes from Pre-E1:

tiny_alloc_gate_fast: +3.6pp (15.37% → 18.99%)
free: +2.5pp (~19% → 22.19%)
unified_cache_push: -1.4pp (3.97% → 2.56%)

E3 Recommendation

Primary Target: hakmem_env_snapshot_enabled (E3-4)

Approach: Constructor-based initialization

Force ENV snapshot init at library load time
Eliminate lazy init check in hot path
Make hakmem_env_get() a pure TLS read (no branch)

Expected Gain: +0.5-1.5%

Implementation Complexity: Low (2-day task)

Add __attribute__((constructor)) function
Remove init check from hakmem_env_get()
A/B test with 10-run Mixed + 5-run C6-heavy

Rationale:

Low risk: Standard initialization pattern (used by jemalloc, tcmalloc)
Clear gain: Eliminates 3.22% overhead (lazy init check)
Compounds E1: Completes ENV snapshot optimization started in E1
Different vector: Not route/TLS optimization - this is initialization overhead reduction

Success Criteria:

Mean gain >= +0.5% (conservative)
No regression on any profile
Health check passes

Secondary Target: free wrapper (E3-2)

Approach: Function pointer caching

Cache g_free_impl in TLS at thread init
Direct call instead of LD mode check + dispatch
Lower risk than IFUNC approach

Expected Gain: +1-2%

Implementation Complexity: Medium (3-4 day task)

Risk: Medium (thread-safety, initialization order)

Phase 4 Status

Active Optimizations:

E1 (ENV Snapshot): +3.92% ✅ GO (research box, default OFF / opt-in)
E3-4 (ENV Constructor Init): ❌ NO-GO (frozen, default OFF, requires E1)

Frozen Optimizations:

D3 (Alloc Gate Shape): +0.56% ⚪ NEUTRAL (research box, default OFF)
E2 (Alloc Per-Class FastPath): -0.21% ⚪ NEUTRAL (research box, default OFF)

Cumulative Gain (Phase 2-4):

B3 (Routing shape): +2.89%
B4 (Wrapper split): +1.47%
C3 (Static routing): +2.20%
D1 (Free route cache): +2.19%
E1 (ENV snapshot): +3.92%
Total (Phase 4): ~+3.9%（E1 のみ）

Baseline（参考）:

E1=1, CTOR=0: 45.26M ops/s（Mixed, 40M iters, ws=400）
E1=1, CTOR=1: 46.86M ops/s（Mixed, 20M iters, ws=400, re-validation: -1.44%）

Remaining Potential:

E3-2 (Wrapper function ptr): +1-2%
E3-3 (Free route special): +0.5-1.0%
Realistic ceiling: ~48-50M ops/s (without major redesign)

Next Steps

Immediate (Priority 1)

Freeze E2 in CURRENT_TASK.md
- Document NEUTRAL decision (-0.21%)
- Add root cause explanation (route caching saturation)
- Mark as research box (default OFF, frozen)
E3-4 の昇格ゲート（再検証）
- E3-4 は GO 済みだが、branch hint/refresh など “足元の調整” 後に 10-run 再確認
- A/B: Mixed 10-run（E1=1, CTOR=0 vs 1）
- 健康診断: scripts/verify_health_profiles.sh

Short-term (Priority 2)

E1/E3-4 ON の状態で perf を取り直す
- hakmem_env_snapshot_enabled が Top から落ちる／self% が有意に下がること
- 次の芯（alloc gate / free_tiny_fast_cold / wrapper）を “self% ≥ 5%” で選定

Long-term (Priority 3)

Consider non-incremental approaches
- Mimalloc-style TLS bucket redesign (major overhaul)
- Static-compiled routing (eliminate runtime policy)
- IFUNC for zero-overhead wrapper (high risk)

Lessons Learned

Route Optimization Saturation

Observation: E2 (alloc per-class) showed -0.21% neutral despite free path success (+13%)

Insight:

Route optimization has diminishing returns after static caching (C3)
Further specialization adds branch overhead without eliminating cost
Lesson: Don't pursue per-class specialization on already-cached paths

Shape Optimization Plateau

Observation: D3 (alloc gate shape) showed +0.56% neutral despite B3 success (+2.89%)

Insight:

Branch prediction saturates after initial tuning
LIKELY/UNLIKELY hints have limited benefit on well-trained branches
Lesson: Shape optimization good for first pass, limited ROI after

ENV Consolidation Success

Observation: E1 (ENV snapshot) achieved +3.92% gain

Insight:

Reducing TLS pressure (3 vars → 1 var) has measurable benefit
Consolidation point still has overhead (3.22% self%)
Lesson: Constructor init is next logical step (eliminate lazy check)

Inlining I-Cache Risk

Observation: A3 (header always_inline) showed -4% regression on Mixed

Insight:

Aggressive inlining can thrash I-cache on mixed workloads
Selective inlining (per-class) may work but needs careful profiling
Lesson: Inlining is high-risk, constructor/caching approaches safer

Realistic Expectations

Current State: 45M ops/s (E1 enabled) Target: 48-50M ops/s (with E3-4, E3-2) Ceiling: ~55-60M ops/s (without major redesign)

Gap to mimalloc: ~2.5x (128M vs 55M ops/s)

Why large gap remains:

Architectural overhead: 4-5 layer design (wrapper → gate → policy → route → handler) vs mimalloc's 1-layer TLS buckets
Per-call policy: hakmem evaluates policy on every call, mimalloc uses static TLS layout
Instruction overhead: ~50-100 instructions per alloc/free vs mimalloc's ~10-15

Next phase options:

Incremental (E3-4, E3-2): +1-3% gains, safe, diminishing returns
Structural redesign: +20-50% potential, high risk, months of work
Workload-specific tuning: Optimize for specific profiles (C6-heavy, C7-only), not general Mixed

Recommendation: Pursue E3-4 (low-hanging fruit), then re-evaluate if structural redesign warranted.

Analysis Complete: 2025-12-14 Next Action: Implement E3-4 (ENV Constructor Init) Expected Timeline: 2-3 days (design → implement → A/B → decision)

17 KiB Raw Blame History Unescape Escape

Phase 4 Comprehensive Status Analysis

Part 1: E2 Freeze Decision Analysis

Test Data Review

Statistical Analysis

Variance Consistency Analysis

Comparison to Free DUALHOT Success

E2 Freeze Recommendation

Part 2: Fresh Perf Profile Analysis (E1 Enabled)

Profile Configuration

Top Functions (self% >= 2.0%)

E1 Impact Analysis

Hot Spot Distribution

Part 3: E3 Candidate Identification

Methodology

Candidate Analysis

Candidate E3-1: tiny_alloc_gate_fast (18.99% self%) - ROUTING SATURATION

Candidate E3-2: free (22.19% self%) - WRAPPER OVERHEAD

Candidate E3-3: free_tiny_fast_cold (7.32% self%) - COLD PATH OPTIMIZATION

Candidate E3-4: hakmem_env_snapshot_enabled (3.22% self%) - ENV OVERHEAD REDUCTION ⭐

Candidate E3-5: tiny_region_id_write_header (2.60% self%) - HEADER WRITE OPTIMIZATION

E3 Candidate Ranking

Part 4: Summary & Recommendations

E2 Final Decision

Perf Findings (E1 Enabled Baseline)

E3 Recommendation

Phase 4 Status

Next Steps

Immediate (Priority 1)

Short-term (Priority 2)

Long-term (Priority 3)

Lessons Learned

Route Optimization Saturation

Shape Optimization Plateau

ENV Consolidation Success

Inlining I-Cache Risk

Realistic Expectations

17 KiB

Raw Blame History