Files
hakmem/docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md

249 lines
8.8 KiB
Markdown
Raw Permalink Normal View History

Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: **Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)** - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset **Phase 57: 60-min soak finalization** - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY **Phase 59: 50% recovery baseline rebase** - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc **Phase 60: Alloc pass-down SSOT (NO-GO)** - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
# Phase 47 — FAST Front "PGO mode" A/B Test Results
## Executive Summary
**Decision: NEUTRAL**
- **Mean improvement**: +0.27% (below +0.5% threshold)
- **Median improvement**: +1.02% (positive signal)
- **Verdict**: Within noise range; no actionable performance gain
- **Side effects**: Higher variance in treatment group (2.32% vs 1.23% CV)
## Background
### Objective
Apply `HAKMEM_TINY_FRONT_PGO=1` to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.
### Expected Outcome (from instructions)
- Original instruction estimate: **+3~8%**
- Revised expectation (based on Phase 46A lessons): **+0.5~2.0%**
- Rationale: Modern CPUs predict branches well; layout tax is a real risk
### Hypothesis
By converting runtime gate checks (e.g., `unified_cache_enabled()`) to compile-time constants:
- Eliminate 5-7 branches in hot path
- Improve I-cache density
- Enable better constant propagation
## Implementation
### Changes Made
1. **Makefile**: Added new target `bench_random_mixed_hakmem_fast_pgo`
- Build flags: `-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1`
- Location: `/mnt/workdisk/public_share/hakmem/Makefile` (line 662-670)
2. **Config Mechanism**: `core/box/tiny_front_config_box.h`
- Normal mode: Runtime gate functions (e.g., `unified_cache_enabled()`)
- PGO mode: Compile-time constants (e.g., `#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1`)
### PGO Fixed Config Values
```c
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0 // Disabled
#define TINY_FRONT_HEAP_V2_ENABLED 0 // Disabled
#define TINY_FRONT_SFC_ENABLED 1 // Enabled
#define TINY_FRONT_FASTCACHE_ENABLED 0 // Disabled
#define TINY_FRONT_TLS_SLL_ENABLED 1 // Enabled
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1 // Enabled
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1 // Enabled
#define TINY_FRONT_METRICS_ENABLED 0 // Disabled
#define TINY_FRONT_DIAG_ENABLED 0 // Disabled
```
## A/B Test Results
### Methodology
- **Baseline**: `bench_random_mixed_hakmem_minimal` (FAST v3: `BENCH_MINIMAL=1`)
- **Treatment**: `bench_random_mixed_hakmem_fast_pgo` (FAST v3 + PGO: `BENCH_MINIMAL=1 + TINY_FRONT_PGO=1`)
- **Iterations**: 10 runs per variant
- **Workload**: 20M ops, WS=400, random mixed allocation pattern
### Raw Data
#### Baseline (FAST - BENCH_MINIMAL only)
```
60378212, 60412333, 60126097, 60557230, 59593446,
59503095, 59686129, 58695907, 58750183, 58687807
```
#### Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)
```
61083082, 60515989, 60785621, 61251824, 61135770,
57473378, 58233393, 59070853, 58446760, 59977402
```
### Statistical Summary
| Metric | Baseline (ops/s) | Treatment (ops/s) | Delta |
|-----------------|------------------|-------------------|------------|
| **Mean** | 59,639,044 | 59,797,407 | **+0.27%** |
| **Median** | 59,639,788 | 60,246,696 | **+1.02%** |
| **Stdev** | 732,715 (1.23%) | 1,385,809 (2.32%) | +89% CV |
| **Min** | 58,687,807 | 57,473,378 | -2.1% |
| **Max** | 60,557,230 | 61,251,824 | +1.1% |
### Decision Criteria
| Threshold | Range | Decision | Result |
|-----------|---------|----------|---------|
| GO | ≥ +0.5% | Accept | ❌ |
| NEUTRAL | ±0.5% | Research | ✅ |
| NO-GO | ≤ -0.5% | Revert | ❌ |
**Actual**: Mean +0.27% → **NEUTRAL**
## Analysis
### Observations
1. **Mean vs Median divergence**:
- Mean: +0.27% (borderline noise)
- Median: +1.02% (positive signal, above threshold)
- Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
2. **Variance increase**:
- Baseline CV: 1.23%
- Treatment CV: 2.32% (+89% relative increase)
- Possible causes:
- Layout tax (code rearrangement affecting I-cache/alignment)
- Workload interaction with fixed config
- Run-to-run noise amplification
3. **Outlier in treatment**:
- Run 6: 57.47M ops/s (lowest across both groups)
- Suggests potential instability or cache thrashing event
### Why NEUTRAL (not GO)?
1. **Mean below threshold**: +0.27% < +0.5% decision boundary
2. **High variance**: 2× coefficient of variation suggests measurement uncertainty
3. **Phase 46A lesson**: Small positive signals can mask layout tax; require conservative threshold
4. **Reproducibility concern**: Wide spread in treatment group reduces confidence
### Why not NO-GO?
- Median improvement (+1.02%) is positive and above threshold
- No systematic regression pattern (just higher variance)
- Possibility of genuine small gain obscured by variance
## Health Check
**Status**: ✅ PASS
- Command: `make perf_observe` (1 run)
- Outcome: No crashes, assertions, or integrity failures
- Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
- Health profiles: Both C6_HEAVY and C7_SAFE passed
## Comparison with Phase 46A
| Aspect | Phase 46A (`always_inline`) | Phase 47 (PGO mode) |
|-------------------------|------------------------------|---------------------|
| **Hypothesis** | Inline hot function | Compile-time gates |
| **Expected gain** | +1~2% | +0.5~2.0% |
| **Actual mean** | -0.68% (NO-GO) | +0.27% (NEUTRAL) |
| **Actual median** | +0.17% | +1.02% |
| **Variance** | Similar to baseline | 2× baseline |
| **Binary size change** | None (inline ≈ non-inline) | Unknown (not measured) |
| **Lesson** | Layout tax real risk | Variance amplification risk |
### Key Insight
Both phases show **median-positive, mean-neutral** signals. This pattern suggests:
- Genuine micro-optimization present (median)
- But layout tax or variance offsets mean improvement
- Conservative threshold (±0.5% mean) is justified
## Recommendations
### 1. Keep as Research Box (Current Status)
- **Action**: Leave `bench_random_mixed_hakmem_fast_pgo` target in Makefile for future experiments
- **Rationale**: Median +1.02% suggests potential; may combine well with other optimizations
- **Do NOT**: Make default or promote to FAST standard build
### 2. Future Investigation (Optional)
If pursuing further:
1. **Increase sample size**: 20-30 runs to reduce variance noise
2. **Profile-guided analysis**: Check if variance correlates with:
- Cache miss patterns (`perf stat -e cache-misses`)
- Branch misprediction (`perf stat -e branch-misses`)
- TLB misses (`perf stat -e dTLB-load-misses`)
3. **Binary size/layout analysis**:
```bash
size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
objdump -d ... | analyze_layout.py
```
4. **Workload sensitivity**:
- Test on different allocation patterns (C6-heavy, C7-safe, etc.)
- Check if variance is workload-specific
### 3. DO NOT Promote (Current Verdict)
- **Reason**: Mean +0.27% within ±0.5% noise threshold
- **Risk**: High variance (2.32% CV) suggests instability
- **Box Theory**: FAST build should be stable baseline, not experimental
## Lessons Learned
1. **Branch prediction is effective**: Even 5-7 branch eliminations yield <1% gain
2. **Layout tax is real**: Variance increase (2× CV) suggests code rearrangement side effects
3. **Conservative thresholds justified**: ±0.5% mean threshold filters out noise
4. **Median-positive ≠ actionable**: Need both mean and median above threshold for GO decision
## Files Modified
1. **Makefile**: Added `bench_random_mixed_hakmem_fast_pgo` target (lines 662-670)
- Build flags: `EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'`
2. **No code changes**: PGO mode uses existing `tiny_front_config_box.h` infrastructure
## Next Steps
### If NEUTRAL (Current)
- Document in scorecard as "NEUTRAL - research box retained"
- Monitor future phases for synergy opportunities
### If Future GO Signal Emerges
1. Run extended validation (30+ runs)
2. Profile binary layout changes
3. Test across multiple workloads
4. Update scorecard and promote to FAST standard
## Appendix: Test Commands
### Baseline (FAST)
```bash
make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh
```
### Treatment (FAST+PGO)
```bash
make bench_random_mixed_hakmem_fast_pgo
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh
```
### Health Check
```bash
make perf_observe
```
## References
- **Phase 47 Instructions**: `docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md`
- **Phase 46A Results**: `docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md`
- **Box Theory**: `docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md`
- **Config Box**: `core/box/tiny_front_config_box.h`