Files
hakmem/docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_RESULTS.md
Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00

8.8 KiB
Raw Blame History

Phase 47 — FAST Front "PGO mode" A/B Test Results

Executive Summary

Decision: NEUTRAL

  • Mean improvement: +0.27% (below +0.5% threshold)
  • Median improvement: +1.02% (positive signal)
  • Verdict: Within noise range; no actionable performance gain
  • Side effects: Higher variance in treatment group (2.32% vs 1.23% CV)

Background

Objective

Apply HAKMEM_TINY_FRONT_PGO=1 to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.

Expected Outcome (from instructions)

  • Original instruction estimate: +3~8%
  • Revised expectation (based on Phase 46A lessons): +0.5~2.0%
    • Rationale: Modern CPUs predict branches well; layout tax is a real risk

Hypothesis

By converting runtime gate checks (e.g., unified_cache_enabled()) to compile-time constants:

  • Eliminate 5-7 branches in hot path
  • Improve I-cache density
  • Enable better constant propagation

Implementation

Changes Made

  1. Makefile: Added new target bench_random_mixed_hakmem_fast_pgo

    • Build flags: -DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1
    • Location: /mnt/workdisk/public_share/hakmem/Makefile (line 662-670)
  2. Config Mechanism: core/box/tiny_front_config_box.h

    • Normal mode: Runtime gate functions (e.g., unified_cache_enabled())
    • PGO mode: Compile-time constants (e.g., #define TINY_FRONT_UNIFIED_CACHE_ENABLED 1)

PGO Fixed Config Values

#define TINY_FRONT_ULTRA_SLIM_ENABLED    0   // Disabled
#define TINY_FRONT_HEAP_V2_ENABLED       0   // Disabled
#define TINY_FRONT_SFC_ENABLED           1   // Enabled
#define TINY_FRONT_FASTCACHE_ENABLED     0   // Disabled
#define TINY_FRONT_TLS_SLL_ENABLED       1   // Enabled
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1   // Enabled
#define TINY_FRONT_UNIFIED_GATE_ENABLED  1   // Enabled
#define TINY_FRONT_METRICS_ENABLED       0   // Disabled
#define TINY_FRONT_DIAG_ENABLED          0   // Disabled

A/B Test Results

Methodology

  • Baseline: bench_random_mixed_hakmem_minimal (FAST v3: BENCH_MINIMAL=1)
  • Treatment: bench_random_mixed_hakmem_fast_pgo (FAST v3 + PGO: BENCH_MINIMAL=1 + TINY_FRONT_PGO=1)
  • Iterations: 10 runs per variant
  • Workload: 20M ops, WS=400, random mixed allocation pattern

Raw Data

Baseline (FAST - BENCH_MINIMAL only)

60378212, 60412333, 60126097, 60557230, 59593446,
59503095, 59686129, 58695907, 58750183, 58687807

Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)

61083082, 60515989, 60785621, 61251824, 61135770,
57473378, 58233393, 59070853, 58446760, 59977402

Statistical Summary

Metric Baseline (ops/s) Treatment (ops/s) Delta
Mean 59,639,044 59,797,407 +0.27%
Median 59,639,788 60,246,696 +1.02%
Stdev 732,715 (1.23%) 1,385,809 (2.32%) +89% CV
Min 58,687,807 57,473,378 -2.1%
Max 60,557,230 61,251,824 +1.1%

Decision Criteria

Threshold Range Decision Result
GO ≥ +0.5% Accept
NEUTRAL ±0.5% Research
NO-GO ≤ -0.5% Revert

Actual: Mean +0.27% → NEUTRAL

Analysis

Observations

  1. Mean vs Median divergence:

    • Mean: +0.27% (borderline noise)
    • Median: +1.02% (positive signal, above threshold)
    • Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
  2. Variance increase:

    • Baseline CV: 1.23%
    • Treatment CV: 2.32% (+89% relative increase)
    • Possible causes:
      • Layout tax (code rearrangement affecting I-cache/alignment)
      • Workload interaction with fixed config
      • Run-to-run noise amplification
  3. Outlier in treatment:

    • Run 6: 57.47M ops/s (lowest across both groups)
    • Suggests potential instability or cache thrashing event

Why NEUTRAL (not GO)?

  1. Mean below threshold: +0.27% < +0.5% decision boundary
  2. High variance: 2× coefficient of variation suggests measurement uncertainty
  3. Phase 46A lesson: Small positive signals can mask layout tax; require conservative threshold
  4. Reproducibility concern: Wide spread in treatment group reduces confidence

Why not NO-GO?

  • Median improvement (+1.02%) is positive and above threshold
  • No systematic regression pattern (just higher variance)
  • Possibility of genuine small gain obscured by variance

Health Check

Status: PASS

  • Command: make perf_observe (1 run)
  • Outcome: No crashes, assertions, or integrity failures
  • Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
  • Health profiles: Both C6_HEAVY and C7_SAFE passed

Comparison with Phase 46A

Aspect Phase 46A (always_inline) Phase 47 (PGO mode)
Hypothesis Inline hot function Compile-time gates
Expected gain +1~2% +0.5~2.0%
Actual mean -0.68% (NO-GO) +0.27% (NEUTRAL)
Actual median +0.17% +1.02%
Variance Similar to baseline 2× baseline
Binary size change None (inline ≈ non-inline) Unknown (not measured)
Lesson Layout tax real risk Variance amplification risk

Key Insight

Both phases show median-positive, mean-neutral signals. This pattern suggests:

  • Genuine micro-optimization present (median)
  • But layout tax or variance offsets mean improvement
  • Conservative threshold (±0.5% mean) is justified

Recommendations

1. Keep as Research Box (Current Status)

  • Action: Leave bench_random_mixed_hakmem_fast_pgo target in Makefile for future experiments
  • Rationale: Median +1.02% suggests potential; may combine well with other optimizations
  • Do NOT: Make default or promote to FAST standard build

2. Future Investigation (Optional)

If pursuing further:

  1. Increase sample size: 20-30 runs to reduce variance noise

  2. Profile-guided analysis: Check if variance correlates with:

    • Cache miss patterns (perf stat -e cache-misses)
    • Branch misprediction (perf stat -e branch-misses)
    • TLB misses (perf stat -e dTLB-load-misses)
  3. Binary size/layout analysis:

    size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
    objdump -d ... | analyze_layout.py
    
  4. Workload sensitivity:

    • Test on different allocation patterns (C6-heavy, C7-safe, etc.)
    • Check if variance is workload-specific

3. DO NOT Promote (Current Verdict)

  • Reason: Mean +0.27% within ±0.5% noise threshold
  • Risk: High variance (2.32% CV) suggests instability
  • Box Theory: FAST build should be stable baseline, not experimental

Lessons Learned

  1. Branch prediction is effective: Even 5-7 branch eliminations yield <1% gain
  2. Layout tax is real: Variance increase (2× CV) suggests code rearrangement side effects
  3. Conservative thresholds justified: ±0.5% mean threshold filters out noise
  4. Median-positive ≠ actionable: Need both mean and median above threshold for GO decision

Files Modified

  1. Makefile: Added bench_random_mixed_hakmem_fast_pgo target (lines 662-670)

    • Build flags: EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'
  2. No code changes: PGO mode uses existing tiny_front_config_box.h infrastructure

Next Steps

If NEUTRAL (Current)

  • Document in scorecard as "NEUTRAL - research box retained"
  • Monitor future phases for synergy opportunities

If Future GO Signal Emerges

  1. Run extended validation (30+ runs)
  2. Profile binary layout changes
  3. Test across multiple workloads
  4. Update scorecard and promote to FAST standard

Appendix: Test Commands

Baseline (FAST)

make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh

Treatment (FAST+PGO)

make bench_random_mixed_hakmem_fast_pgo
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh

Health Check

make perf_observe

References

  • Phase 47 Instructions: docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md
  • Phase 46A Results: docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md
  • Box Theory: docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md
  • Config Box: core/box/tiny_front_config_box.h