249 lines
8.8 KiB
Markdown
249 lines
8.8 KiB
Markdown
|
|
# Phase 47 — FAST Front "PGO mode" A/B Test Results
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Decision: NEUTRAL**
|
|||
|
|
|
|||
|
|
- **Mean improvement**: +0.27% (below +0.5% threshold)
|
|||
|
|
- **Median improvement**: +1.02% (positive signal)
|
|||
|
|
- **Verdict**: Within noise range; no actionable performance gain
|
|||
|
|
- **Side effects**: Higher variance in treatment group (2.32% vs 1.23% CV)
|
|||
|
|
|
|||
|
|
## Background
|
|||
|
|
|
|||
|
|
### Objective
|
|||
|
|
|
|||
|
|
Apply `HAKMEM_TINY_FRONT_PGO=1` to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.
|
|||
|
|
|
|||
|
|
### Expected Outcome (from instructions)
|
|||
|
|
|
|||
|
|
- Original instruction estimate: **+3~8%**
|
|||
|
|
- Revised expectation (based on Phase 46A lessons): **+0.5~2.0%**
|
|||
|
|
- Rationale: Modern CPUs predict branches well; layout tax is a real risk
|
|||
|
|
|
|||
|
|
### Hypothesis
|
|||
|
|
|
|||
|
|
By converting runtime gate checks (e.g., `unified_cache_enabled()`) to compile-time constants:
|
|||
|
|
- Eliminate 5-7 branches in hot path
|
|||
|
|
- Improve I-cache density
|
|||
|
|
- Enable better constant propagation
|
|||
|
|
|
|||
|
|
## Implementation
|
|||
|
|
|
|||
|
|
### Changes Made
|
|||
|
|
|
|||
|
|
1. **Makefile**: Added new target `bench_random_mixed_hakmem_fast_pgo`
|
|||
|
|
- Build flags: `-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1`
|
|||
|
|
- Location: `/mnt/workdisk/public_share/hakmem/Makefile` (line 662-670)
|
|||
|
|
|
|||
|
|
2. **Config Mechanism**: `core/box/tiny_front_config_box.h`
|
|||
|
|
- Normal mode: Runtime gate functions (e.g., `unified_cache_enabled()`)
|
|||
|
|
- PGO mode: Compile-time constants (e.g., `#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1`)
|
|||
|
|
|
|||
|
|
### PGO Fixed Config Values
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0 // Disabled
|
|||
|
|
#define TINY_FRONT_HEAP_V2_ENABLED 0 // Disabled
|
|||
|
|
#define TINY_FRONT_SFC_ENABLED 1 // Enabled
|
|||
|
|
#define TINY_FRONT_FASTCACHE_ENABLED 0 // Disabled
|
|||
|
|
#define TINY_FRONT_TLS_SLL_ENABLED 1 // Enabled
|
|||
|
|
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1 // Enabled
|
|||
|
|
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1 // Enabled
|
|||
|
|
#define TINY_FRONT_METRICS_ENABLED 0 // Disabled
|
|||
|
|
#define TINY_FRONT_DIAG_ENABLED 0 // Disabled
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## A/B Test Results
|
|||
|
|
|
|||
|
|
### Methodology
|
|||
|
|
|
|||
|
|
- **Baseline**: `bench_random_mixed_hakmem_minimal` (FAST v3: `BENCH_MINIMAL=1`)
|
|||
|
|
- **Treatment**: `bench_random_mixed_hakmem_fast_pgo` (FAST v3 + PGO: `BENCH_MINIMAL=1 + TINY_FRONT_PGO=1`)
|
|||
|
|
- **Iterations**: 10 runs per variant
|
|||
|
|
- **Workload**: 20M ops, WS=400, random mixed allocation pattern
|
|||
|
|
|
|||
|
|
### Raw Data
|
|||
|
|
|
|||
|
|
#### Baseline (FAST - BENCH_MINIMAL only)
|
|||
|
|
```
|
|||
|
|
60378212, 60412333, 60126097, 60557230, 59593446,
|
|||
|
|
59503095, 59686129, 58695907, 58750183, 58687807
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)
|
|||
|
|
```
|
|||
|
|
61083082, 60515989, 60785621, 61251824, 61135770,
|
|||
|
|
57473378, 58233393, 59070853, 58446760, 59977402
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Statistical Summary
|
|||
|
|
|
|||
|
|
| Metric | Baseline (ops/s) | Treatment (ops/s) | Delta |
|
|||
|
|
|-----------------|------------------|-------------------|------------|
|
|||
|
|
| **Mean** | 59,639,044 | 59,797,407 | **+0.27%** |
|
|||
|
|
| **Median** | 59,639,788 | 60,246,696 | **+1.02%** |
|
|||
|
|
| **Stdev** | 732,715 (1.23%) | 1,385,809 (2.32%) | +89% CV |
|
|||
|
|
| **Min** | 58,687,807 | 57,473,378 | -2.1% |
|
|||
|
|
| **Max** | 60,557,230 | 61,251,824 | +1.1% |
|
|||
|
|
|
|||
|
|
### Decision Criteria
|
|||
|
|
|
|||
|
|
| Threshold | Range | Decision | Result |
|
|||
|
|
|-----------|---------|----------|---------|
|
|||
|
|
| GO | ≥ +0.5% | Accept | ❌ |
|
|||
|
|
| NEUTRAL | ±0.5% | Research | ✅ |
|
|||
|
|
| NO-GO | ≤ -0.5% | Revert | ❌ |
|
|||
|
|
|
|||
|
|
**Actual**: Mean +0.27% → **NEUTRAL**
|
|||
|
|
|
|||
|
|
## Analysis
|
|||
|
|
|
|||
|
|
### Observations
|
|||
|
|
|
|||
|
|
1. **Mean vs Median divergence**:
|
|||
|
|
- Mean: +0.27% (borderline noise)
|
|||
|
|
- Median: +1.02% (positive signal, above threshold)
|
|||
|
|
- Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
|
|||
|
|
|
|||
|
|
2. **Variance increase**:
|
|||
|
|
- Baseline CV: 1.23%
|
|||
|
|
- Treatment CV: 2.32% (+89% relative increase)
|
|||
|
|
- Possible causes:
|
|||
|
|
- Layout tax (code rearrangement affecting I-cache/alignment)
|
|||
|
|
- Workload interaction with fixed config
|
|||
|
|
- Run-to-run noise amplification
|
|||
|
|
|
|||
|
|
3. **Outlier in treatment**:
|
|||
|
|
- Run 6: 57.47M ops/s (lowest across both groups)
|
|||
|
|
- Suggests potential instability or cache thrashing event
|
|||
|
|
|
|||
|
|
### Why NEUTRAL (not GO)?
|
|||
|
|
|
|||
|
|
1. **Mean below threshold**: +0.27% < +0.5% decision boundary
|
|||
|
|
2. **High variance**: 2× coefficient of variation suggests measurement uncertainty
|
|||
|
|
3. **Phase 46A lesson**: Small positive signals can mask layout tax; require conservative threshold
|
|||
|
|
4. **Reproducibility concern**: Wide spread in treatment group reduces confidence
|
|||
|
|
|
|||
|
|
### Why not NO-GO?
|
|||
|
|
|
|||
|
|
- Median improvement (+1.02%) is positive and above threshold
|
|||
|
|
- No systematic regression pattern (just higher variance)
|
|||
|
|
- Possibility of genuine small gain obscured by variance
|
|||
|
|
|
|||
|
|
## Health Check
|
|||
|
|
|
|||
|
|
**Status**: ✅ PASS
|
|||
|
|
|
|||
|
|
- Command: `make perf_observe` (1 run)
|
|||
|
|
- Outcome: No crashes, assertions, or integrity failures
|
|||
|
|
- Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
|
|||
|
|
- Health profiles: Both C6_HEAVY and C7_SAFE passed
|
|||
|
|
|
|||
|
|
## Comparison with Phase 46A
|
|||
|
|
|
|||
|
|
| Aspect | Phase 46A (`always_inline`) | Phase 47 (PGO mode) |
|
|||
|
|
|-------------------------|------------------------------|---------------------|
|
|||
|
|
| **Hypothesis** | Inline hot function | Compile-time gates |
|
|||
|
|
| **Expected gain** | +1~2% | +0.5~2.0% |
|
|||
|
|
| **Actual mean** | -0.68% (NO-GO) | +0.27% (NEUTRAL) |
|
|||
|
|
| **Actual median** | +0.17% | +1.02% |
|
|||
|
|
| **Variance** | Similar to baseline | 2× baseline |
|
|||
|
|
| **Binary size change** | None (inline ≈ non-inline) | Unknown (not measured) |
|
|||
|
|
| **Lesson** | Layout tax real risk | Variance amplification risk |
|
|||
|
|
|
|||
|
|
### Key Insight
|
|||
|
|
|
|||
|
|
Both phases show **median-positive, mean-neutral** signals. This pattern suggests:
|
|||
|
|
- Genuine micro-optimization present (median)
|
|||
|
|
- But layout tax or variance offsets mean improvement
|
|||
|
|
- Conservative threshold (±0.5% mean) is justified
|
|||
|
|
|
|||
|
|
## Recommendations
|
|||
|
|
|
|||
|
|
### 1. Keep as Research Box (Current Status)
|
|||
|
|
|
|||
|
|
- **Action**: Leave `bench_random_mixed_hakmem_fast_pgo` target in Makefile for future experiments
|
|||
|
|
- **Rationale**: Median +1.02% suggests potential; may combine well with other optimizations
|
|||
|
|
- **Do NOT**: Make default or promote to FAST standard build
|
|||
|
|
|
|||
|
|
### 2. Future Investigation (Optional)
|
|||
|
|
|
|||
|
|
If pursuing further:
|
|||
|
|
|
|||
|
|
1. **Increase sample size**: 20-30 runs to reduce variance noise
|
|||
|
|
2. **Profile-guided analysis**: Check if variance correlates with:
|
|||
|
|
- Cache miss patterns (`perf stat -e cache-misses`)
|
|||
|
|
- Branch misprediction (`perf stat -e branch-misses`)
|
|||
|
|
- TLB misses (`perf stat -e dTLB-load-misses`)
|
|||
|
|
|
|||
|
|
3. **Binary size/layout analysis**:
|
|||
|
|
```bash
|
|||
|
|
size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
|
|||
|
|
objdump -d ... | analyze_layout.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
4. **Workload sensitivity**:
|
|||
|
|
- Test on different allocation patterns (C6-heavy, C7-safe, etc.)
|
|||
|
|
- Check if variance is workload-specific
|
|||
|
|
|
|||
|
|
### 3. DO NOT Promote (Current Verdict)
|
|||
|
|
|
|||
|
|
- **Reason**: Mean +0.27% within ±0.5% noise threshold
|
|||
|
|
- **Risk**: High variance (2.32% CV) suggests instability
|
|||
|
|
- **Box Theory**: FAST build should be stable baseline, not experimental
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
1. **Branch prediction is effective**: Even 5-7 branch eliminations yield <1% gain
|
|||
|
|
2. **Layout tax is real**: Variance increase (2× CV) suggests code rearrangement side effects
|
|||
|
|
3. **Conservative thresholds justified**: ±0.5% mean threshold filters out noise
|
|||
|
|
4. **Median-positive ≠ actionable**: Need both mean and median above threshold for GO decision
|
|||
|
|
|
|||
|
|
## Files Modified
|
|||
|
|
|
|||
|
|
1. **Makefile**: Added `bench_random_mixed_hakmem_fast_pgo` target (lines 662-670)
|
|||
|
|
- Build flags: `EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'`
|
|||
|
|
|
|||
|
|
2. **No code changes**: PGO mode uses existing `tiny_front_config_box.h` infrastructure
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
### If NEUTRAL (Current)
|
|||
|
|
|
|||
|
|
- Document in scorecard as "NEUTRAL - research box retained"
|
|||
|
|
- Monitor future phases for synergy opportunities
|
|||
|
|
|
|||
|
|
### If Future GO Signal Emerges
|
|||
|
|
|
|||
|
|
1. Run extended validation (30+ runs)
|
|||
|
|
2. Profile binary layout changes
|
|||
|
|
3. Test across multiple workloads
|
|||
|
|
4. Update scorecard and promote to FAST standard
|
|||
|
|
|
|||
|
|
## Appendix: Test Commands
|
|||
|
|
|
|||
|
|
### Baseline (FAST)
|
|||
|
|
```bash
|
|||
|
|
make bench_random_mixed_hakmem_minimal
|
|||
|
|
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Treatment (FAST+PGO)
|
|||
|
|
```bash
|
|||
|
|
make bench_random_mixed_hakmem_fast_pgo
|
|||
|
|
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Health Check
|
|||
|
|
```bash
|
|||
|
|
make perf_observe
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## References
|
|||
|
|
|
|||
|
|
- **Phase 47 Instructions**: `docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md`
|
|||
|
|
- **Phase 46A Results**: `docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md`
|
|||
|
|
- **Box Theory**: `docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md`
|
|||
|
|
- **Config Box**: `core/box/tiny_front_config_box.h`
|