Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
10 KiB
Phase 5 E4 Combined (E4-1 + E4-2) - A/B Test Results
Date: 2025-12-14 Status: ✅ GO (+6.43% mean gain) New Baseline: 47.34M ops/s (Mixed, 20M iters, ws=400)
Executive Summary
Combined effect of E4-1 (Free Wrapper ENV Snapshot) + E4-2 (Malloc Wrapper ENV Snapshot) shows +6.43% improvement (same-binary A/B). Individual A/B numbers are reference-only (measured in different sessions) and should not be summed.
Key Finding: ENV snapshot optimizations target overlapping resources (TLS access, cache lines, branch predictor). Even when both are “wins” independently, the combined incremental gain is not additive.
A/B Test Results (Mixed, 10-run, 20M iters, ws=400)
Baseline Configuration (both OFF)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0
Results:
- Mean: 44.48M ops/s
- Median: 44.39M ops/s
- StdDev: 0.38M ops/s
Raw data (ops/s):
45041282, 44252030, 44962831, 44159599, 44219264,
44339939, 44436723, 43943643, 44939786, 44475893
Optimized Configuration (both ON)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1
Results:
- Mean: 47.34M ops/s
- Median: 47.38M ops/s
- StdDev: 0.42M ops/s
Raw data (ops/s):
47805624, 46325254, 47678853, 47318676, 47444745,
47296416, 47244865, 47484869, 47698161, 47094537
Performance Delta
| Metric | Baseline | Optimized | Gain |
|---|---|---|---|
| Mean | 44.48M | 47.34M | +6.43% ✅ |
| Median | 44.39M | 47.38M | +6.74% ✅ |
| StdDev | 0.38M | 0.42M | +10.5% (slightly higher variance) |
Decision: ✅ GO (+6.43% >= +1.0% threshold)
Individual vs Combined Analysis
Individual reference results(別セッションなので “参考値”)
- E4-1(free wrapper snapshot)A/B: 45.35M → 46.94M(+3.51%)
- E4-2(malloc wrapper snapshot)A/B: 35.74M → 43.54M(+21.83%)
Combined(同一バイナリ比較なので “正”)
- both OFF: 44.48M
- both ON: 47.34M(+6.43% mean / +6.74% median)
Interaction Analysis
E4-1 / E4-2 の “単独” A/B は **別セッション(別バイナリ状態)**で測られているため、 単純加算(+3.51% + +21.83%)は 比較として成立しません。
本ドキュメントの Combined A/B(同一バイナリで両方 OFF/ON を切替) が、 現時点の正しい “増分” を与える 唯一の比較 です。
Combined の結論:
- 同一バイナリ内の比較で +6.43% mean / +6.74% median ✅
- “単独の勝ち” は事実だが、相互作用(同時 ON の増分)は Combined を採用する
Why Subadditive? Technical Analysis
1. Baseline mismatch(単独テストの前提差)
E4-1 と E4-2 の “単独” A/B は測定条件(バイナリ状態/ENV/周辺最適化)が一致していないため、 「足し算期待値」を作ると 見かけ上 subadditive に見えます。
2. Shared Bottlenecks
Both optimizations target the same underlying resource:
- TLS access consolidation: Reducing multiple TLS reads to single snapshot
- Memory bandwidth: TLS reads compete for same cache lines
- Cache hierarchy: ENV data shares L1/L2 cache space
Once TLS access is optimized in one path (e.g., free), optimizing it in the other path (malloc) yields diminishing returns.
3. Branch Predictor Saturation
Both ENV snapshot checks add branches:
// Free path (E4-1)
if (free_wrapper_env_snapshot_enabled()) {
struct free_wrapper_env_snapshot snap = free_wrapper_env_snapshot();
// ...
}
// Malloc path (E4-2)
if (malloc_wrapper_env_snapshot_enabled()) {
struct malloc_wrapper_env_snapshot snap = malloc_wrapper_env_snapshot();
// ...
}
These branches compete for branch predictor entries. Combined overhead is non-linear.
4. Measurement Methodology
Individual tests were run sequentially, not in isolation:
- E4-1 was tested first (changing code + binary)
- E4-2 was tested on top of E4-1's code changes
- Combined test uses both, but baseline may have drifted
Lesson: Always measure combined effect from a clean baseline with all optimizations OFF.
Health Check Results
scripts/verify_health_profiles.sh
Status: ✅ PASS (all profiles passed)
Profile 1: MIXED_TINYV3_C7_SAFE
- Throughput: 42.3M ops/s
- Status: PASS
Profile 2: C6_HEAVY_LEGACY_POOLV1
- Throughput: 20.9M ops/s
- Status: PASS
No regressions detected in health profiles.
Perf Profile (New Baseline: E4 Combined ON)
Command:
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
perf record -F 99 -- ./bench_random_mixed_hakmem 20000000 400 1
Throughput: 47.0M ops/s (20M iters, ws=400) Samples: 52 samples @ 99Hz
Top Hot Spots (self% >= 2.0%)
| Rank | Function | Self% | Notes |
|---|---|---|---|
| 1 | free | 37.56% | Wrapper + gate (still dominant) |
| 2 | tiny_alloc_gate_fast | 13.73% | Reduced from 19.50% (E4-2 effect) |
| 3 | malloc | 12.95% | Reduced from 16.13% (E4-2 effect) |
| 4 | main | 11.13% | Benchmark driver |
| 5 | tiny_region_id_write_header | 6.97% | Header write tax |
| 6 | tiny_c7_ultra_alloc | 4.56% | C7 alloc path |
| 7 | hakmem_env_snapshot_enabled | 4.29% | ENV snapshot overhead (NEW) |
| 8 | tiny_get_max_size | 4.24% | Size limit check |
| 9 | tiny_route_for_class | 2.27% | Route lookup |
| 10 | unified_cache_push | 2.13% | TLS cache push |
Key Observations
-
free() dominance: 37.56% self% is the largest single hot spot
- Already optimized with ENV snapshot (E4-1)
- Further optimization requires analyzing free() internals
-
malloc/alloc gate reduction: E4-2 successfully reduced combined malloc + tiny_alloc_gate_fast
- Before: 16.13% + 19.50% = 35.63%
- After: 12.95% + 13.73% = 26.68%
- Reduction: -8.95 percentage points ✅
-
ENV snapshot overhead visible: hakmem_env_snapshot_enabled() now shows 4.29% self%
- This is the cost of ENV snapshot checks
- Offset by larger gains from TLS consolidation
- Future: Consider caching enabled() result in hot paths
-
Header write tax: tiny_region_id_write_header (6.97%) is a candidate for E5
- Phase 1 A3 showed always_inline is NO-GO (-4% on Mixed)
- Alternative: Reduce write frequency (selective mode, cached headers)
Next Phase 5 Candidates (self% >= 5%)
E5-1: free() Path Internals (37.56% self%)
- Target: Analyze free_tiny_fast() / free_tiny_fast_hotcold() structure
- Opportunity: Largest single hot spot, but already heavily optimized
- Challenge: Diminishing returns (already has ENV snapshot, hotcold split, static routing)
- Estimated ROI: Medium (+2-5%)
E5-2: Header Write Reduction (6.97% self%)
- Target: tiny_region_id_write_header() call frequency
- Strategy: Conditional header writes (write only when needed)
- Challenge: Phase 1 A3 showed always_inline causes I-cache pressure (-4%)
- Estimated ROI: Medium (+1-3%)
E5-3: ENV Snapshot Overhead (4.29% self%)
- Target: hakmem_env_snapshot_enabled() check cost
- Strategy: Cache enabled() result in TLS per-thread
- Opportunity: Remove repeated enabled() checks in hot loops
- Estimated ROI: Low-Medium (+1-2%)
Cumulative Phase 5 Status
Individual Optimizations
- E4-1 (Free Wrapper ENV Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper ENV Snapshot): +21.83% standalone (on E4-1 baseline)
Combined Effect
- E4 Combined: +6.43% (from "both OFF" baseline of 44.48M)
- Overall Phase 5 Progress: 35.74M → 47.34M = +32.4%
Interaction Type
- SUBADDITIVE: Combined gain (6.43%) < Sum of individual gains (25.34%)
- Reason: Overlapping baseline shifts, shared TLS/cache resources, baseline drift
Key Insight
ENV snapshot optimizations work best when applied early. Stacking multiple ENV snapshots yields diminishing returns due to:
- Shared TLS access patterns
- Branch predictor competition
- Cache line contention
- Baseline measurement drift
Next Steps
Immediate Actions
- ✅ Update CURRENT_TASK.md with E4 combined results
- ✅ Create PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- Profile analysis: Identify E5 candidates
Future Phase 5 Work
-
E5-1: free() path internals optimization
- Analyze free_tiny_fast_hotcold() structure
- Consider: unified cache optimization, hotcold threshold tuning
-
E5-2: Header write reduction
- Selective header writes (only when classification needed)
- Cached header mode (write once, reuse)
-
E5-3: ENV snapshot overhead reduction
- Cache enabled() result in TLS
- Eliminate repeated checks in hot loops
Long-term Considerations
- Baseline stability: Need consistent baseline measurement protocol
- Measurement methodology: Test combined effects from clean baseline (all OFF)
- Diminishing returns: ENV snapshot pattern is plateauing (+6.43% combined vs +25% expected)
References
- E4-1 Design:
docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md - E4-2 Design:
docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md - Combined Instructions:
docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md: Updated with E4 combined results
Conclusion
Decision: ✅ GO - Keep both optimizations DEFAULT ON
Rationale:
- Combined gain (+6.43%) exceeds threshold (+1.0%)
- New baseline (47.34M ops/s) is highest achieved in Phase 5
- Health checks pass with no regressions
- Both optimizations provide value, even if subadditive
Action Items:
- Keep HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 (default ON)
- Keep HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 (default ON)
- Shift focus to next bottleneck (free path internals or header write)
- Update perf profile baseline to 47.34M ops/s for future comparisons
Phase 5 Progress: 35.74M → 47.34M ops/s = +32.4% cumulative gain ✅