Files
hakmem/docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md

301 lines
10 KiB
Markdown
Raw Normal View History

Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
# Phase 5 E4 Combined (E4-1 + E4-2) - A/B Test Results
**Date**: 2025-12-14
**Status**: ✅ GO (+6.43% mean gain)
**New Baseline**: 47.34M ops/s (Mixed, 20M iters, ws=400)
---
## Executive Summary
Combined effect of E4-1 (Free Wrapper ENV Snapshot) + E4-2 (Malloc Wrapper ENV Snapshot) shows **+6.43% improvement** (same-binary A/B). Individual A/B numbers are **reference-only** (measured in different sessions) and should not be summed.
**Key Finding**: ENV snapshot optimizations target overlapping resources (TLS access, cache lines, branch predictor). Even when both are “wins” independently, the combined incremental gain is not additive.
---
## A/B Test Results (Mixed, 10-run, 20M iters, ws=400)
### Baseline Configuration (both OFF)
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0
```
**Results**:
- Mean: **44.48M ops/s**
- Median: **44.39M ops/s**
- StdDev: **0.38M ops/s**
Raw data (ops/s):
```
45041282, 44252030, 44962831, 44159599, 44219264,
44339939, 44436723, 43943643, 44939786, 44475893
```
### Optimized Configuration (both ON)
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1
```
**Results**:
- Mean: **47.34M ops/s**
- Median: **47.38M ops/s**
- StdDev: **0.42M ops/s**
Raw data (ops/s):
```
47805624, 46325254, 47678853, 47318676, 47444745,
47296416, 47244865, 47484869, 47698161, 47094537
```
### Performance Delta
| Metric | Baseline | Optimized | Gain |
|--------|----------|-----------|------|
| **Mean** | 44.48M | 47.34M | **+6.43%** ✅ |
| **Median** | 44.39M | 47.38M | **+6.74%** ✅ |
| **StdDev** | 0.38M | 0.42M | +10.5% (slightly higher variance) |
**Decision**: ✅ **GO** (+6.43% >= +1.0% threshold)
---
## Individual vs Combined Analysis
### Individual reference results別セッションなので “参考値”)
- E4-1free wrapper snapshotA/B: 45.35M → 46.94M+3.51%
- E4-2malloc wrapper snapshotA/B: 35.74M → 43.54M+21.83%
### Combined同一バイナリ比較なので “正”)
- both OFF: 44.48M
- both ON: 47.34M+6.43% mean / +6.74% median
### Interaction Analysis
E4-1 / E4-2 の “単独” A/B は **別セッション(別バイナリ状態)**で測られているため、
単純加算(+3.51% + +21.83%)は **比較として成立しません**
本ドキュメントの **Combined A/B同一バイナリで両方 OFF/ON を切替)** が、
現時点の正しい “増分” を与える **唯一の比較** です。
**Combined の結論**:
- 同一バイナリ内の比較で **+6.43% mean / +6.74% median** ✅
- “単独の勝ち” は事実だが、**相互作用(同時 ON の増分)は Combined を採用**する
---
## Why Subadditive? Technical Analysis
### 1. Baseline mismatch単独テストの前提差
E4-1 と E4-2 の “単独” A/B は測定条件(バイナリ状態/ENV/周辺最適化)が一致していないため、
「足し算期待値」を作ると **見かけ上 subadditive** に見えます。
### 2. Shared Bottlenecks
Both optimizations target the same underlying resource:
- **TLS access consolidation**: Reducing multiple TLS reads to single snapshot
- **Memory bandwidth**: TLS reads compete for same cache lines
- **Cache hierarchy**: ENV data shares L1/L2 cache space
Once TLS access is optimized in one path (e.g., free), optimizing it in the other path (malloc) yields diminishing returns.
### 3. Branch Predictor Saturation
Both ENV snapshot checks add branches:
```c
// Free path (E4-1)
if (free_wrapper_env_snapshot_enabled()) {
struct free_wrapper_env_snapshot snap = free_wrapper_env_snapshot();
// ...
}
// Malloc path (E4-2)
if (malloc_wrapper_env_snapshot_enabled()) {
struct malloc_wrapper_env_snapshot snap = malloc_wrapper_env_snapshot();
// ...
}
```
These branches compete for branch predictor entries. Combined overhead is non-linear.
### 4. Measurement Methodology
Individual tests were run sequentially, not in isolation:
- E4-1 was tested first (changing code + binary)
- E4-2 was tested on top of E4-1's code changes
- Combined test uses both, but baseline may have drifted
**Lesson**: Always measure combined effect from a **clean baseline** with all optimizations OFF.
---
## Health Check Results
```bash
scripts/verify_health_profiles.sh
```
**Status**: ✅ **PASS** (all profiles passed)
### Profile 1: MIXED_TINYV3_C7_SAFE
- Throughput: **42.3M ops/s**
- Status: PASS
### Profile 2: C6_HEAVY_LEGACY_POOLV1
- Throughput: **20.9M ops/s**
- Status: PASS
**No regressions detected** in health profiles.
---
## Perf Profile (New Baseline: E4 Combined ON)
**Command**:
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
perf record -F 99 -- ./bench_random_mixed_hakmem 20000000 400 1
```
**Throughput**: 47.0M ops/s (20M iters, ws=400)
**Samples**: 52 samples @ 99Hz
### Top Hot Spots (self% >= 2.0%)
| Rank | Function | Self% | Notes |
|------|----------|-------|-------|
| 1 | **free** | **37.56%** | Wrapper + gate (still dominant) |
| 2 | tiny_alloc_gate_fast | 13.73% | Reduced from 19.50% (E4-2 effect) |
| 3 | malloc | 12.95% | Reduced from 16.13% (E4-2 effect) |
| 4 | main | 11.13% | Benchmark driver |
| 5 | tiny_region_id_write_header | 6.97% | Header write tax |
| 6 | tiny_c7_ultra_alloc | 4.56% | C7 alloc path |
| 7 | hakmem_env_snapshot_enabled | 4.29% | **ENV snapshot overhead (NEW)** |
| 8 | tiny_get_max_size | 4.24% | Size limit check |
| 9 | tiny_route_for_class | 2.27% | Route lookup |
| 10 | unified_cache_push | 2.13% | TLS cache push |
### Key Observations
1. **free() dominance**: 37.56% self% is the largest single hot spot
- Already optimized with ENV snapshot (E4-1)
- Further optimization requires analyzing free() internals
2. **malloc/alloc gate reduction**: E4-2 successfully reduced combined malloc + tiny_alloc_gate_fast
- Before: 16.13% + 19.50% = 35.63%
- After: 12.95% + 13.73% = 26.68%
- **Reduction: -8.95 percentage points** ✅
3. **ENV snapshot overhead visible**: hakmem_env_snapshot_enabled() now shows 4.29% self%
- This is the **cost** of ENV snapshot checks
- Offset by larger gains from TLS consolidation
- Future: Consider caching enabled() result in hot paths
4. **Header write tax**: tiny_region_id_write_header (6.97%) is a candidate for E5
- Phase 1 A3 showed always_inline is NO-GO (-4% on Mixed)
- Alternative: Reduce write frequency (selective mode, cached headers)
### Next Phase 5 Candidates (self% >= 5%)
**E5-1: free() Path Internals** (37.56% self%)
- Target: Analyze free_tiny_fast() / free_tiny_fast_hotcold() structure
- Opportunity: Largest single hot spot, but already heavily optimized
- Challenge: Diminishing returns (already has ENV snapshot, hotcold split, static routing)
- Estimated ROI: Medium (+2-5%)
**E5-2: Header Write Reduction** (6.97% self%)
- Target: tiny_region_id_write_header() call frequency
- Strategy: Conditional header writes (write only when needed)
- Challenge: Phase 1 A3 showed always_inline causes I-cache pressure (-4%)
- Estimated ROI: Medium (+1-3%)
**E5-3: ENV Snapshot Overhead** (4.29% self%)
- Target: hakmem_env_snapshot_enabled() check cost
- Strategy: Cache enabled() result in TLS per-thread
- Opportunity: Remove repeated enabled() checks in hot loops
- Estimated ROI: Low-Medium (+1-2%)
---
## Cumulative Phase 5 Status
### Individual Optimizations
- **E4-1** (Free Wrapper ENV Snapshot): +3.51% standalone
- **E4-2** (Malloc Wrapper ENV Snapshot): +21.83% standalone (on E4-1 baseline)
### Combined Effect
- **E4 Combined**: +6.43% (from "both OFF" baseline of 44.48M)
- **Overall Phase 5 Progress**: 35.74M → 47.34M = **+32.4%**
### Interaction Type
- **SUBADDITIVE**: Combined gain (6.43%) < Sum of individual gains (25.34%)
- **Reason**: Overlapping baseline shifts, shared TLS/cache resources, baseline drift
### Key Insight
ENV snapshot optimizations work best when applied early. Stacking multiple ENV snapshots yields diminishing returns due to:
1. Shared TLS access patterns
2. Branch predictor competition
3. Cache line contention
4. Baseline measurement drift
---
## Next Steps
### Immediate Actions
1. ✅ Update CURRENT_TASK.md with E4 combined results
2. ✅ Create PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
3. Profile analysis: Identify E5 candidates
### Future Phase 5 Work
1. **E5-1**: free() path internals optimization
- Analyze free_tiny_fast_hotcold() structure
- Consider: unified cache optimization, hotcold threshold tuning
2. **E5-2**: Header write reduction
- Selective header writes (only when classification needed)
- Cached header mode (write once, reuse)
3. **E5-3**: ENV snapshot overhead reduction
- Cache enabled() result in TLS
- Eliminate repeated checks in hot loops
### Long-term Considerations
- **Baseline stability**: Need consistent baseline measurement protocol
- **Measurement methodology**: Test combined effects from clean baseline (all OFF)
- **Diminishing returns**: ENV snapshot pattern is plateauing (+6.43% combined vs +25% expected)
---
## References
- **E4-1 Design**: `docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md`
- **E4-2 Design**: `docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md`
- **Combined Instructions**: `docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md`
- **CURRENT_TASK.md**: Updated with E4 combined results
---
## Conclusion
**Decision**: ✅ **GO** - Keep both optimizations DEFAULT ON
**Rationale**:
- Combined gain (+6.43%) exceeds threshold (+1.0%)
- New baseline (47.34M ops/s) is highest achieved in Phase 5
- Health checks pass with no regressions
- Both optimizations provide value, even if subadditive
**Action Items**:
1. Keep HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 (default ON)
2. Keep HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 (default ON)
3. Shift focus to next bottleneck (free path internals or header write)
4. Update perf profile baseline to 47.34M ops/s for future comparisons
**Phase 5 Progress**: 35.74M → 47.34M ops/s = **+32.4% cumulative gain** ✅