Files

Moe Charm (CI) 6cdbd815ab Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-14 05:36:57 +09:00

10 KiB

Raw Blame History

Phase 5 E4 Combined (E4-1 + E4-2) - A/B Test Results

Date: 2025-12-14 Status: ✅ GO (+6.43% mean gain) New Baseline: 47.34M ops/s (Mixed, 20M iters, ws=400)

Executive Summary

Combined effect of E4-1 (Free Wrapper ENV Snapshot) + E4-2 (Malloc Wrapper ENV Snapshot) shows +6.43% improvement (same-binary A/B). Individual A/B numbers are reference-only (measured in different sessions) and should not be summed.

Key Finding: ENV snapshot optimizations target overlapping resources (TLS access, cache lines, branch predictor). Even when both are “wins” independently, the combined incremental gain is not additive.

A/B Test Results (Mixed, 10-run, 20M iters, ws=400)

Baseline Configuration (both OFF)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0

Results:

Mean: 44.48M ops/s
Median: 44.39M ops/s
StdDev: 0.38M ops/s

Raw data (ops/s):

45041282, 44252030, 44962831, 44159599, 44219264,
44339939, 44436723, 43943643, 44939786, 44475893

Optimized Configuration (both ON)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1

Results:

Mean: 47.34M ops/s
Median: 47.38M ops/s
StdDev: 0.42M ops/s

Raw data (ops/s):

47805624, 46325254, 47678853, 47318676, 47444745,
47296416, 47244865, 47484869, 47698161, 47094537

Performance Delta

Metric	Baseline	Optimized	Gain
Mean	44.48M	47.34M	+6.43% ✅
Median	44.39M	47.38M	+6.74% ✅
StdDev	0.38M	0.42M	+10.5% (slightly higher variance)

Decision: ✅ GO (+6.43% >= +1.0% threshold)

Individual vs Combined Analysis

Individual reference results（別セッションなので “参考値”）

E4-1（free wrapper snapshot）A/B: 45.35M → 46.94M（+3.51%）
E4-2（malloc wrapper snapshot）A/B: 35.74M → 43.54M（+21.83%）

Combined（同一バイナリ比較なので “正”）

both OFF: 44.48M
both ON: 47.34M（+6.43% mean / +6.74% median）

Interaction Analysis

E4-1 / E4-2 の “単独” A/B は **別セッション（別バイナリ状態）**で測られているため、単純加算（+3.51% + +21.83%）は 比較として成立しません。

本ドキュメントの Combined A/B（同一バイナリで両方 OFF/ON を切替） が、現時点の正しい “増分” を与える 唯一の比較 です。

Combined の結論:

同一バイナリ内の比較で +6.43% mean / +6.74% median ✅
“単独の勝ち” は事実だが、相互作用（同時 ON の増分）は Combined を採用する

Why Subadditive? Technical Analysis

1. Baseline mismatch（単独テストの前提差）

E4-1 と E4-2 の “単独” A/B は測定条件（バイナリ状態/ENV/周辺最適化）が一致していないため、「足し算期待値」を作ると 見かけ上 subadditive に見えます。

2. Shared Bottlenecks

Both optimizations target the same underlying resource:

TLS access consolidation: Reducing multiple TLS reads to single snapshot
Memory bandwidth: TLS reads compete for same cache lines
Cache hierarchy: ENV data shares L1/L2 cache space

Once TLS access is optimized in one path (e.g., free), optimizing it in the other path (malloc) yields diminishing returns.

3. Branch Predictor Saturation

Both ENV snapshot checks add branches:

// Free path (E4-1)
if (free_wrapper_env_snapshot_enabled()) {
    struct free_wrapper_env_snapshot snap = free_wrapper_env_snapshot();
    // ...
}

// Malloc path (E4-2)
if (malloc_wrapper_env_snapshot_enabled()) {
    struct malloc_wrapper_env_snapshot snap = malloc_wrapper_env_snapshot();
    // ...
}

These branches compete for branch predictor entries. Combined overhead is non-linear.

4. Measurement Methodology

Individual tests were run sequentially, not in isolation:

E4-1 was tested first (changing code + binary)
E4-2 was tested on top of E4-1's code changes
Combined test uses both, but baseline may have drifted

Lesson: Always measure combined effect from a clean baseline with all optimizations OFF.

Health Check Results

scripts/verify_health_profiles.sh

Status: ✅ PASS (all profiles passed)

Profile 1: MIXED_TINYV3_C7_SAFE

Throughput: 42.3M ops/s
Status: PASS

Profile 2: C6_HEAVY_LEGACY_POOLV1

Throughput: 20.9M ops/s
Status: PASS

No regressions detected in health profiles.

Perf Profile (New Baseline: E4 Combined ON)

Command:

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
perf record -F 99 -- ./bench_random_mixed_hakmem 20000000 400 1

Throughput: 47.0M ops/s (20M iters, ws=400) Samples: 52 samples @ 99Hz

Top Hot Spots (self% >= 2.0%)

Rank	Function	Self%	Notes
1	free	37.56%	Wrapper + gate (still dominant)
2	tiny_alloc_gate_fast	13.73%	Reduced from 19.50% (E4-2 effect)
3	malloc	12.95%	Reduced from 16.13% (E4-2 effect)
4	main	11.13%	Benchmark driver
5	tiny_region_id_write_header	6.97%	Header write tax
6	tiny_c7_ultra_alloc	4.56%	C7 alloc path
7	hakmem_env_snapshot_enabled	4.29%	ENV snapshot overhead (NEW)
8	tiny_get_max_size	4.24%	Size limit check
9	tiny_route_for_class	2.27%	Route lookup
10	unified_cache_push	2.13%	TLS cache push

Key Observations

free() dominance: 37.56% self% is the largest single hot spot
- Already optimized with ENV snapshot (E4-1)
- Further optimization requires analyzing free() internals
malloc/alloc gate reduction: E4-2 successfully reduced combined malloc + tiny_alloc_gate_fast
- Before: 16.13% + 19.50% = 35.63%
- After: 12.95% + 13.73% = 26.68%
- Reduction: -8.95 percentage points ✅
ENV snapshot overhead visible: hakmem_env_snapshot_enabled() now shows 4.29% self%
- This is the cost of ENV snapshot checks
- Offset by larger gains from TLS consolidation
- Future: Consider caching enabled() result in hot paths
Header write tax: tiny_region_id_write_header (6.97%) is a candidate for E5
- Phase 1 A3 showed always_inline is NO-GO (-4% on Mixed)
- Alternative: Reduce write frequency (selective mode, cached headers)

Next Phase 5 Candidates (self% >= 5%)

E5-1: free() Path Internals (37.56% self%)

Target: Analyze free_tiny_fast() / free_tiny_fast_hotcold() structure
Opportunity: Largest single hot spot, but already heavily optimized
Challenge: Diminishing returns (already has ENV snapshot, hotcold split, static routing)
Estimated ROI: Medium (+2-5%)

E5-2: Header Write Reduction (6.97% self%)

Target: tiny_region_id_write_header() call frequency
Strategy: Conditional header writes (write only when needed)
Challenge: Phase 1 A3 showed always_inline causes I-cache pressure (-4%)
Estimated ROI: Medium (+1-3%)

E5-3: ENV Snapshot Overhead (4.29% self%)

Target: hakmem_env_snapshot_enabled() check cost
Strategy: Cache enabled() result in TLS per-thread
Opportunity: Remove repeated enabled() checks in hot loops
Estimated ROI: Low-Medium (+1-2%)

Cumulative Phase 5 Status

Individual Optimizations

E4-1 (Free Wrapper ENV Snapshot): +3.51% standalone
E4-2 (Malloc Wrapper ENV Snapshot): +21.83% standalone (on E4-1 baseline)

Combined Effect

E4 Combined: +6.43% (from "both OFF" baseline of 44.48M)
Overall Phase 5 Progress: 35.74M → 47.34M = +32.4%

Interaction Type

SUBADDITIVE: Combined gain (6.43%) < Sum of individual gains (25.34%)
Reason: Overlapping baseline shifts, shared TLS/cache resources, baseline drift

Key Insight

ENV snapshot optimizations work best when applied early. Stacking multiple ENV snapshots yields diminishing returns due to:

Shared TLS access patterns
Branch predictor competition
Cache line contention
Baseline measurement drift

Next Steps

Immediate Actions

✅ Update CURRENT_TASK.md with E4 combined results
✅ Create PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
Profile analysis: Identify E5 candidates

Future Phase 5 Work

E5-1: free() path internals optimization
- Analyze free_tiny_fast_hotcold() structure
- Consider: unified cache optimization, hotcold threshold tuning
E5-2: Header write reduction
- Selective header writes (only when classification needed)
- Cached header mode (write once, reuse)
E5-3: ENV snapshot overhead reduction
- Cache enabled() result in TLS
- Eliminate repeated checks in hot loops

Long-term Considerations

Baseline stability: Need consistent baseline measurement protocol
Measurement methodology: Test combined effects from clean baseline (all OFF)
Diminishing returns: ENV snapshot pattern is plateauing (+6.43% combined vs +25% expected)

References

E4-1 Design: docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
E4-2 Design: docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
Combined Instructions: docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md
CURRENT_TASK.md: Updated with E4 combined results

Conclusion

Decision: ✅ GO - Keep both optimizations DEFAULT ON

Rationale:

Combined gain (+6.43%) exceeds threshold (+1.0%)
New baseline (47.34M ops/s) is highest achieved in Phase 5
Health checks pass with no regressions
Both optimizations provide value, even if subadditive

Action Items:

Keep HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 (default ON)
Keep HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 (default ON)
Shift focus to next bottleneck (free path internals or header write)
Update perf profile baseline to 47.34M ops/s for future comparisons

Phase 5 Progress: 35.74M → 47.34M ops/s = +32.4% cumulative gain ✅

10 KiB Raw Blame History Unescape Escape