hakmem/docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md

# Phase 47 — FAST Front "PGO mode" A/B Test Results

## Executive Summary

**Decision: NEUTRAL**

- **Mean improvement**: +0.27% (below +0.5% threshold)
- **Median improvement**: +1.02% (positive signal)
- **Verdict**: Within noise range; no actionable performance gain
- **Side effects**: Higher variance in treatment group (2.32% vs 1.23% CV)

## Background

### Objective

Apply `HAKMEM_TINY_FRONT_PGO=1` to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.

### Expected Outcome (from instructions)

- Original instruction estimate: **+3~8%**
- Revised expectation (based on Phase 46A lessons): **+0.5~2.0%**
  - Rationale: Modern CPUs predict branches well; layout tax is a real risk

### Hypothesis

By converting runtime gate checks (e.g., `unified_cache_enabled()`) to compile-time constants:
- Eliminate 5-7 branches in hot path
- Improve I-cache density
- Enable better constant propagation

## Implementation

### Changes Made

1. **Makefile**: Added new target `bench_random_mixed_hakmem_fast_pgo`
   - Build flags: `-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1`
   - Location: `/mnt/workdisk/public_share/hakmem/Makefile` (line 662-670)

2. **Config Mechanism**: `core/box/tiny_front_config_box.h`
   - Normal mode: Runtime gate functions (e.g., `unified_cache_enabled()`)
   - PGO mode: Compile-time constants (e.g., `#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1`)

### PGO Fixed Config Values

```c
#define TINY_FRONT_ULTRA_SLIM_ENABLED    0   // Disabled
#define TINY_FRONT_HEAP_V2_ENABLED       0   // Disabled
#define TINY_FRONT_SFC_ENABLED           1   // Enabled
#define TINY_FRONT_FASTCACHE_ENABLED     0   // Disabled
#define TINY_FRONT_TLS_SLL_ENABLED       1   // Enabled
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1   // Enabled
#define TINY_FRONT_UNIFIED_GATE_ENABLED  1   // Enabled
#define TINY_FRONT_METRICS_ENABLED       0   // Disabled
#define TINY_FRONT_DIAG_ENABLED          0   // Disabled
```

## A/B Test Results

### Methodology

- **Baseline**: `bench_random_mixed_hakmem_minimal` (FAST v3: `BENCH_MINIMAL=1`)
- **Treatment**: `bench_random_mixed_hakmem_fast_pgo` (FAST v3 + PGO: `BENCH_MINIMAL=1 + TINY_FRONT_PGO=1`)
- **Iterations**: 10 runs per variant
- **Workload**: 20M ops, WS=400, random mixed allocation pattern

### Raw Data

#### Baseline (FAST - BENCH_MINIMAL only)
```
60378212, 60412333, 60126097, 60557230, 59593446,
59503095, 59686129, 58695907, 58750183, 58687807
```

#### Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)
```
61083082, 60515989, 60785621, 61251824, 61135770,
57473378, 58233393, 59070853, 58446760, 59977402
```

### Statistical Summary

| Metric          | Baseline (ops/s) | Treatment (ops/s) | Delta      |
|-----------------|------------------|-------------------|------------|
| **Mean**        | 59,639,044       | 59,797,407        | **+0.27%** |
| **Median**      | 59,639,788       | 60,246,696        | **+1.02%** |
| **Stdev**       | 732,715 (1.23%)  | 1,385,809 (2.32%) | +89% CV    |
| **Min**         | 58,687,807       | 57,473,378        | -2.1%      |
| **Max**         | 60,557,230       | 61,251,824        | +1.1%      |

### Decision Criteria

| Threshold | Range   | Decision | Result  |
|-----------|---------|----------|---------|
| GO        | ≥ +0.5% | Accept   | ❌      |
| NEUTRAL   | ±0.5%   | Research | ✅      |
| NO-GO     | ≤ -0.5% | Revert   | ❌      |

**Actual**: Mean +0.27% → **NEUTRAL**

## Analysis

### Observations

1. **Mean vs Median divergence**:
   - Mean: +0.27% (borderline noise)
   - Median: +1.02% (positive signal, above threshold)
   - Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity

2. **Variance increase**:
   - Baseline CV: 1.23%
   - Treatment CV: 2.32% (+89% relative increase)
   - Possible causes:
     - Layout tax (code rearrangement affecting I-cache/alignment)
     - Workload interaction with fixed config
     - Run-to-run noise amplification

3. **Outlier in treatment**:
   - Run 6: 57.47M ops/s (lowest across both groups)
   - Suggests potential instability or cache thrashing event

### Why NEUTRAL (not GO)?

1. **Mean below threshold**: +0.27% < +0.5% decision boundary
2. **High variance**: 2× coefficient of variation suggests measurement uncertainty
3. **Phase 46A lesson**: Small positive signals can mask layout tax; require conservative threshold
4. **Reproducibility concern**: Wide spread in treatment group reduces confidence

### Why not NO-GO?

- Median improvement (+1.02%) is positive and above threshold
- No systematic regression pattern (just higher variance)
- Possibility of genuine small gain obscured by variance

## Health Check

**Status**: ✅ PASS

- Command: `make perf_observe` (1 run)
- Outcome: No crashes, assertions, or integrity failures
- Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
- Health profiles: Both C6_HEAVY and C7_SAFE passed

## Comparison with Phase 46A

| Aspect                  | Phase 46A (`always_inline`) | Phase 47 (PGO mode) |
|-------------------------|------------------------------|---------------------|
| **Hypothesis**          | Inline hot function          | Compile-time gates  |
| **Expected gain**       | +1~2%                        | +0.5~2.0%           |
| **Actual mean**         | -0.68% (NO-GO)               | +0.27% (NEUTRAL)    |
| **Actual median**       | +0.17%                       | +1.02%              |
| **Variance**            | Similar to baseline          | 2× baseline         |
| **Binary size change**  | None (inline ≈ non-inline)   | Unknown (not measured) |
| **Lesson**              | Layout tax real risk         | Variance amplification risk |

### Key Insight

Both phases show **median-positive, mean-neutral** signals. This pattern suggests:
- Genuine micro-optimization present (median)
- But layout tax or variance offsets mean improvement
- Conservative threshold (±0.5% mean) is justified

## Recommendations

### 1. Keep as Research Box (Current Status)

- **Action**: Leave `bench_random_mixed_hakmem_fast_pgo` target in Makefile for future experiments
- **Rationale**: Median +1.02% suggests potential; may combine well with other optimizations
- **Do NOT**: Make default or promote to FAST standard build

### 2. Future Investigation (Optional)

If pursuing further:

1. **Increase sample size**: 20-30 runs to reduce variance noise
2. **Profile-guided analysis**: Check if variance correlates with:
   - Cache miss patterns (`perf stat -e cache-misses`)
   - Branch misprediction (`perf stat -e branch-misses`)
   - TLB misses (`perf stat -e dTLB-load-misses`)

3. **Binary size/layout analysis**:
   ```bash
   size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
   objdump -d ... | analyze_layout.py
   ```

4. **Workload sensitivity**:
   - Test on different allocation patterns (C6-heavy, C7-safe, etc.)
   - Check if variance is workload-specific

### 3. DO NOT Promote (Current Verdict)

- **Reason**: Mean +0.27% within ±0.5% noise threshold
- **Risk**: High variance (2.32% CV) suggests instability
- **Box Theory**: FAST build should be stable baseline, not experimental

## Lessons Learned

1. **Branch prediction is effective**: Even 5-7 branch eliminations yield <1% gain
2. **Layout tax is real**: Variance increase (2× CV) suggests code rearrangement side effects
3. **Conservative thresholds justified**: ±0.5% mean threshold filters out noise
4. **Median-positive ≠ actionable**: Need both mean and median above threshold for GO decision

## Files Modified

1. **Makefile**: Added `bench_random_mixed_hakmem_fast_pgo` target (lines 662-670)
   - Build flags: `EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'`

2. **No code changes**: PGO mode uses existing `tiny_front_config_box.h` infrastructure

## Next Steps

### If NEUTRAL (Current)

- Document in scorecard as "NEUTRAL - research box retained"
- Monitor future phases for synergy opportunities

### If Future GO Signal Emerges

1. Run extended validation (30+ runs)
2. Profile binary layout changes
3. Test across multiple workloads
4. Update scorecard and promote to FAST standard

## Appendix: Test Commands

### Baseline (FAST)
```bash
make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh
```

### Treatment (FAST+PGO)
```bash
make bench_random_mixed_hakmem_fast_pgo
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh
```

### Health Check
```bash
make perf_observe
```

## References

- **Phase 47 Instructions**: `docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md`
- **Phase 46A Results**: `docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md`
- **Box Theory**: `docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md`
- **Config Box**: `core/box/tiny_front_config_box.h`
-												Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 06:24:01 +09:00
+								# Phase 47 — FAST Front "PGO mode" A/B Test Results
 								## Executive Summary
 								**Decision: NEUTRAL**
 								- **Mean improvement**: +0.27% (below +0.5% threshold)
 								- **Median improvement**: +1.02% (positive signal)
 								- **Verdict**: Within noise range; no actionable performance gain
 								- **Side effects**: Higher variance in treatment group (2.32% vs 1.23% CV)
 								## Background
 								### Objective
 								Apply `HAKMEM_TINY_FRONT_PGO=1` to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.
 								### Expected Outcome (from instructions)
 								- Original instruction estimate: **+3~8%**
 								- Revised expectation (based on Phase 46A lessons): **+0.5~2.0%**
 								  - Rationale: Modern CPUs predict branches well; layout tax is a real risk
 								### Hypothesis
 								By converting runtime gate checks (e.g., `unified_cache_enabled()`) to compile-time constants:
 								- Eliminate 5-7 branches in hot path
 								- Improve I-cache density
 								- Enable better constant propagation
 								## Implementation
 								### Changes Made
 . **Makefile**: Added new target `bench_random_mixed_hakmem_fast_pgo`
 								   - Build flags: `-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1`
 								   - Location: `/mnt/workdisk/public_share/hakmem/Makefile` (line 662-670)
 . **Config Mechanism**: `core/box/tiny_front_config_box.h`
 								   - Normal mode: Runtime gate functions (e.g., `unified_cache_enabled()`)
 								   - PGO mode: Compile-time constants (e.g., `#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1`)
 								### PGO Fixed Config Values
 								```c
 								#define TINY_FRONT_ULTRA_SLIM_ENABLED    0   // Disabled
 								#define TINY_FRONT_HEAP_V2_ENABLED       0   // Disabled
 								#define TINY_FRONT_SFC_ENABLED           1   // Enabled
 								#define TINY_FRONT_FASTCACHE_ENABLED     0   // Disabled
 								#define TINY_FRONT_TLS_SLL_ENABLED       1   // Enabled
 								#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1   // Enabled
 								#define TINY_FRONT_UNIFIED_GATE_ENABLED  1   // Enabled
 								#define TINY_FRONT_METRICS_ENABLED       0   // Disabled
 								#define TINY_FRONT_DIAG_ENABLED          0   // Disabled
 								```
 								## A/B Test Results
 								### Methodology
 								- **Baseline**: `bench_random_mixed_hakmem_minimal` (FAST v3: `BENCH_MINIMAL=1`)
 								- **Treatment**: `bench_random_mixed_hakmem_fast_pgo` (FAST v3 + PGO: `BENCH_MINIMAL=1 + TINY_FRONT_PGO=1`)
 								- **Iterations**: 10 runs per variant
 								- **Workload**: 20M ops, WS=400, random mixed allocation pattern
 								### Raw Data
 								#### Baseline (FAST - BENCH_MINIMAL only)
 								```
 								60378212, 60412333, 60126097, 60557230, 59593446,
 								59503095, 59686129, 58695907, 58750183, 58687807
 								```
 								#### Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)
 								```
 								61083082, 60515989, 60785621, 61251824, 61135770,
 								57473378, 58233393, 59070853, 58446760, 59977402
 								```
 								### Statistical Summary
 								| Metric          | Baseline (ops/s) | Treatment (ops/s) | Delta      |
 								|-----------------|------------------|-------------------|------------|
 								| **Mean**        | 59,639,044       | 59,797,407        | **+0.27%** |
 								| **Median**      | 59,639,788       | 60,246,696        | **+1.02%** |
 								| **Stdev**       | 732,715 (1.23%)  | 1,385,809 (2.32%) | +89% CV    |
 								| **Min**         | 58,687,807       | 57,473,378        | -2.1%      |
 								| **Max**         | 60,557,230       | 61,251,824        | +1.1%      |
 								### Decision Criteria
 								| Threshold | Range   | Decision | Result  |
 								|-----------|---------|----------|---------|
 								| GO        | ≥ +0.5% | Accept   | ❌      |
 								| NEUTRAL   | ±0.5%   | Research | ✅      |
 								| NO-GO     | ≤ -0.5% | Revert   | ❌      |
 								**Actual**: Mean +0.27% → **NEUTRAL**
 								## Analysis
 								### Observations
 . **Mean vs Median divergence**:
 								   - Mean: +0.27% (borderline noise)
 								   - Median: +1.02% (positive signal, above threshold)
 								   - Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
 . **Variance increase**:
 								   - Baseline CV: 1.23%
 								   - Treatment CV: 2.32% (+89% relative increase)
 								   - Possible causes:
 								     - Layout tax (code rearrangement affecting I-cache/alignment)
 								     - Workload interaction with fixed config
 								     - Run-to-run noise amplification
 . **Outlier in treatment**:
 								   - Run 6: 57.47M ops/s (lowest across both groups)
 								   - Suggests potential instability or cache thrashing event
 								### Why NEUTRAL (not GO)?
 . **Mean below threshold**: +0.27% < +0.5% decision boundary
 . **High variance**: 2× coefficient of variation suggests measurement uncertainty
 . **Phase 46A lesson**: Small positive signals can mask layout tax; require conservative threshold
 . **Reproducibility concern**: Wide spread in treatment group reduces confidence
 								### Why not NO-GO?
 								- Median improvement (+1.02%) is positive and above threshold
 								- No systematic regression pattern (just higher variance)
 								- Possibility of genuine small gain obscured by variance
 								## Health Check
 								**Status**: ✅ PASS
 								- Command: `make perf_observe` (1 run)
 								- Outcome: No crashes, assertions, or integrity failures
 								- Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
 								- Health profiles: Both C6_HEAVY and C7_SAFE passed
 								## Comparison with Phase 46A
 								| Aspect                  | Phase 46A (`always_inline`) | Phase 47 (PGO mode) |
 								|-------------------------|------------------------------|---------------------|
 								| **Hypothesis**          | Inline hot function          | Compile-time gates  |
 								| **Expected gain**       | +1~2%                        | +0.5~2.0%           |
 								| **Actual mean**         | -0.68% (NO-GO)               | +0.27% (NEUTRAL)    |
 								| **Actual median**       | +0.17%                       | +1.02%              |
 								| **Variance**            | Similar to baseline          | 2× baseline         |
 								| **Binary size change**  | None (inline ≈ non-inline)   | Unknown (not measured) |
 								| **Lesson**              | Layout tax real risk         | Variance amplification risk |
 								### Key Insight
 								Both phases show **median-positive, mean-neutral** signals. This pattern suggests:
 								- Genuine micro-optimization present (median)
 								- But layout tax or variance offsets mean improvement
 								- Conservative threshold (±0.5% mean) is justified
 								## Recommendations
 								### 1. Keep as Research Box (Current Status)
 								- **Action**: Leave `bench_random_mixed_hakmem_fast_pgo` target in Makefile for future experiments
 								- **Rationale**: Median +1.02% suggests potential; may combine well with other optimizations
 								- **Do NOT**: Make default or promote to FAST standard build
 								### 2. Future Investigation (Optional)
 								If pursuing further:
 . **Increase sample size**: 20-30 runs to reduce variance noise
 . **Profile-guided analysis**: Check if variance correlates with:
 								   - Cache miss patterns (`perf stat -e cache-misses`)
 								   - Branch misprediction (`perf stat -e branch-misses`)
 								   - TLB misses (`perf stat -e dTLB-load-misses`)
 . **Binary size/layout analysis**:
 								   ```bash
 								   size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
 								   objdump -d ... | analyze_layout.py
 								   ```
 . **Workload sensitivity**:
 								   - Test on different allocation patterns (C6-heavy, C7-safe, etc.)
 								   - Check if variance is workload-specific
 								### 3. DO NOT Promote (Current Verdict)
 								- **Reason**: Mean +0.27% within ±0.5% noise threshold
 								- **Risk**: High variance (2.32% CV) suggests instability
 								- **Box Theory**: FAST build should be stable baseline, not experimental
 								## Lessons Learned
 . **Branch prediction is effective**: Even 5-7 branch eliminations yield <1% gain
 . **Layout tax is real**: Variance increase (2× CV) suggests code rearrangement side effects
 . **Conservative thresholds justified**: ±0.5% mean threshold filters out noise
 . **Median-positive ≠ actionable**: Need both mean and median above threshold for GO decision
 								## Files Modified
 . **Makefile**: Added `bench_random_mixed_hakmem_fast_pgo` target (lines 662-670)
 								   - Build flags: `EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'`
 . **No code changes**: PGO mode uses existing `tiny_front_config_box.h` infrastructure
 								## Next Steps
 								### If NEUTRAL (Current)
 								- Document in scorecard as "NEUTRAL - research box retained"
 								- Monitor future phases for synergy opportunities
 								### If Future GO Signal Emerges
 . Run extended validation (30+ runs)
 . Profile binary layout changes
 . Test across multiple workloads
 . Update scorecard and promote to FAST standard
 								## Appendix: Test Commands
 								### Baseline (FAST)
 								```bash
 								make bench_random_mixed_hakmem_minimal
 								BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh
 								```
 								### Treatment (FAST+PGO)
 								```bash
 								make bench_random_mixed_hakmem_fast_pgo
 								BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh
 								```
 								### Health Check
 								```bash
 								make perf_observe
 								```
 								## References
 								- **Phase 47 Instructions**: `docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md`
 								- **Phase 46A Results**: `docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md`
 								- **Box Theory**: `docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md`
 								- **Config Box**: `core/box/tiny_front_config_box.h`