## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.8 KiB
Phase 47 — FAST Front "PGO mode" A/B Test Results
Executive Summary
Decision: NEUTRAL
- Mean improvement: +0.27% (below +0.5% threshold)
- Median improvement: +1.02% (positive signal)
- Verdict: Within noise range; no actionable performance gain
- Side effects: Higher variance in treatment group (2.32% vs 1.23% CV)
Background
Objective
Apply HAKMEM_TINY_FRONT_PGO=1 to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.
Expected Outcome (from instructions)
- Original instruction estimate: +3~8%
- Revised expectation (based on Phase 46A lessons): +0.5~2.0%
- Rationale: Modern CPUs predict branches well; layout tax is a real risk
Hypothesis
By converting runtime gate checks (e.g., unified_cache_enabled()) to compile-time constants:
- Eliminate 5-7 branches in hot path
- Improve I-cache density
- Enable better constant propagation
Implementation
Changes Made
-
Makefile: Added new target
bench_random_mixed_hakmem_fast_pgo- Build flags:
-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1 - Location:
/mnt/workdisk/public_share/hakmem/Makefile(line 662-670)
- Build flags:
-
Config Mechanism:
core/box/tiny_front_config_box.h- Normal mode: Runtime gate functions (e.g.,
unified_cache_enabled()) - PGO mode: Compile-time constants (e.g.,
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1)
- Normal mode: Runtime gate functions (e.g.,
PGO Fixed Config Values
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0 // Disabled
#define TINY_FRONT_HEAP_V2_ENABLED 0 // Disabled
#define TINY_FRONT_SFC_ENABLED 1 // Enabled
#define TINY_FRONT_FASTCACHE_ENABLED 0 // Disabled
#define TINY_FRONT_TLS_SLL_ENABLED 1 // Enabled
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1 // Enabled
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1 // Enabled
#define TINY_FRONT_METRICS_ENABLED 0 // Disabled
#define TINY_FRONT_DIAG_ENABLED 0 // Disabled
A/B Test Results
Methodology
- Baseline:
bench_random_mixed_hakmem_minimal(FAST v3:BENCH_MINIMAL=1) - Treatment:
bench_random_mixed_hakmem_fast_pgo(FAST v3 + PGO:BENCH_MINIMAL=1 + TINY_FRONT_PGO=1) - Iterations: 10 runs per variant
- Workload: 20M ops, WS=400, random mixed allocation pattern
Raw Data
Baseline (FAST - BENCH_MINIMAL only)
60378212, 60412333, 60126097, 60557230, 59593446,
59503095, 59686129, 58695907, 58750183, 58687807
Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)
61083082, 60515989, 60785621, 61251824, 61135770,
57473378, 58233393, 59070853, 58446760, 59977402
Statistical Summary
| Metric | Baseline (ops/s) | Treatment (ops/s) | Delta |
|---|---|---|---|
| Mean | 59,639,044 | 59,797,407 | +0.27% |
| Median | 59,639,788 | 60,246,696 | +1.02% |
| Stdev | 732,715 (1.23%) | 1,385,809 (2.32%) | +89% CV |
| Min | 58,687,807 | 57,473,378 | -2.1% |
| Max | 60,557,230 | 61,251,824 | +1.1% |
Decision Criteria
| Threshold | Range | Decision | Result |
|---|---|---|---|
| GO | ≥ +0.5% | Accept | ❌ |
| NEUTRAL | ±0.5% | Research | ✅ |
| NO-GO | ≤ -0.5% | Revert | ❌ |
Actual: Mean +0.27% → NEUTRAL
Analysis
Observations
-
Mean vs Median divergence:
- Mean: +0.27% (borderline noise)
- Median: +1.02% (positive signal, above threshold)
- Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
-
Variance increase:
- Baseline CV: 1.23%
- Treatment CV: 2.32% (+89% relative increase)
- Possible causes:
- Layout tax (code rearrangement affecting I-cache/alignment)
- Workload interaction with fixed config
- Run-to-run noise amplification
-
Outlier in treatment:
- Run 6: 57.47M ops/s (lowest across both groups)
- Suggests potential instability or cache thrashing event
Why NEUTRAL (not GO)?
- Mean below threshold: +0.27% < +0.5% decision boundary
- High variance: 2× coefficient of variation suggests measurement uncertainty
- Phase 46A lesson: Small positive signals can mask layout tax; require conservative threshold
- Reproducibility concern: Wide spread in treatment group reduces confidence
Why not NO-GO?
- Median improvement (+1.02%) is positive and above threshold
- No systematic regression pattern (just higher variance)
- Possibility of genuine small gain obscured by variance
Health Check
Status: ✅ PASS
- Command:
make perf_observe(1 run) - Outcome: No crashes, assertions, or integrity failures
- Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
- Health profiles: Both C6_HEAVY and C7_SAFE passed
Comparison with Phase 46A
| Aspect | Phase 46A (always_inline) |
Phase 47 (PGO mode) |
|---|---|---|
| Hypothesis | Inline hot function | Compile-time gates |
| Expected gain | +1~2% | +0.5~2.0% |
| Actual mean | -0.68% (NO-GO) | +0.27% (NEUTRAL) |
| Actual median | +0.17% | +1.02% |
| Variance | Similar to baseline | 2× baseline |
| Binary size change | None (inline ≈ non-inline) | Unknown (not measured) |
| Lesson | Layout tax real risk | Variance amplification risk |
Key Insight
Both phases show median-positive, mean-neutral signals. This pattern suggests:
- Genuine micro-optimization present (median)
- But layout tax or variance offsets mean improvement
- Conservative threshold (±0.5% mean) is justified
Recommendations
1. Keep as Research Box (Current Status)
- Action: Leave
bench_random_mixed_hakmem_fast_pgotarget in Makefile for future experiments - Rationale: Median +1.02% suggests potential; may combine well with other optimizations
- Do NOT: Make default or promote to FAST standard build
2. Future Investigation (Optional)
If pursuing further:
-
Increase sample size: 20-30 runs to reduce variance noise
-
Profile-guided analysis: Check if variance correlates with:
- Cache miss patterns (
perf stat -e cache-misses) - Branch misprediction (
perf stat -e branch-misses) - TLB misses (
perf stat -e dTLB-load-misses)
- Cache miss patterns (
-
Binary size/layout analysis:
size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo objdump -d ... | analyze_layout.py -
Workload sensitivity:
- Test on different allocation patterns (C6-heavy, C7-safe, etc.)
- Check if variance is workload-specific
3. DO NOT Promote (Current Verdict)
- Reason: Mean +0.27% within ±0.5% noise threshold
- Risk: High variance (2.32% CV) suggests instability
- Box Theory: FAST build should be stable baseline, not experimental
Lessons Learned
- Branch prediction is effective: Even 5-7 branch eliminations yield <1% gain
- Layout tax is real: Variance increase (2× CV) suggests code rearrangement side effects
- Conservative thresholds justified: ±0.5% mean threshold filters out noise
- Median-positive ≠ actionable: Need both mean and median above threshold for GO decision
Files Modified
-
Makefile: Added
bench_random_mixed_hakmem_fast_pgotarget (lines 662-670)- Build flags:
EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'
- Build flags:
-
No code changes: PGO mode uses existing
tiny_front_config_box.hinfrastructure
Next Steps
If NEUTRAL (Current)
- Document in scorecard as "NEUTRAL - research box retained"
- Monitor future phases for synergy opportunities
If Future GO Signal Emerges
- Run extended validation (30+ runs)
- Profile binary layout changes
- Test across multiple workloads
- Update scorecard and promote to FAST standard
Appendix: Test Commands
Baseline (FAST)
make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh
Treatment (FAST+PGO)
make bench_random_mixed_hakmem_fast_pgo
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh
Health Check
make perf_observe
References
- Phase 47 Instructions:
docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md - Phase 46A Results:
docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md - Box Theory:
docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md - Config Box:
core/box/tiny_front_config_box.h