Files

Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-17 06:24:01 +09:00

8.8 KiB

Raw Blame History

Phase 47 — FAST Front "PGO mode" A/B Test Results

Executive Summary

Decision: NEUTRAL

Mean improvement: +0.27% (below +0.5% threshold)
Median improvement: +1.02% (positive signal)
Verdict: Within noise range; no actionable performance gain
Side effects: Higher variance in treatment group (2.32% vs 1.23% CV)

Background

Objective

Apply HAKMEM_TINY_FRONT_PGO=1 to FAST build to evaluate whether compile-time fixed config (eliminating runtime gate branches) yields measurable performance improvements.

Expected Outcome (from instructions)

Original instruction estimate: +3~8%
Revised expectation (based on Phase 46A lessons): +0.5~2.0%
- Rationale: Modern CPUs predict branches well; layout tax is a real risk

Hypothesis

By converting runtime gate checks (e.g., unified_cache_enabled()) to compile-time constants:

Eliminate 5-7 branches in hot path
Improve I-cache density
Enable better constant propagation

Implementation

Changes Made

Makefile: Added new target bench_random_mixed_hakmem_fast_pgo
- Build flags: -DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1
- Location: /mnt/workdisk/public_share/hakmem/Makefile (line 662-670)
Config Mechanism: core/box/tiny_front_config_box.h
- Normal mode: Runtime gate functions (e.g., unified_cache_enabled())
- PGO mode: Compile-time constants (e.g., #define TINY_FRONT_UNIFIED_CACHE_ENABLED 1)

PGO Fixed Config Values

#define TINY_FRONT_ULTRA_SLIM_ENABLED    0   // Disabled
#define TINY_FRONT_HEAP_V2_ENABLED       0   // Disabled
#define TINY_FRONT_SFC_ENABLED           1   // Enabled
#define TINY_FRONT_FASTCACHE_ENABLED     0   // Disabled
#define TINY_FRONT_TLS_SLL_ENABLED       1   // Enabled
#define TINY_FRONT_UNIFIED_CACHE_ENABLED 1   // Enabled
#define TINY_FRONT_UNIFIED_GATE_ENABLED  1   // Enabled
#define TINY_FRONT_METRICS_ENABLED       0   // Disabled
#define TINY_FRONT_DIAG_ENABLED          0   // Disabled

A/B Test Results

Methodology

Baseline: bench_random_mixed_hakmem_minimal (FAST v3: BENCH_MINIMAL=1)
Treatment: bench_random_mixed_hakmem_fast_pgo (FAST v3 + PGO: BENCH_MINIMAL=1 + TINY_FRONT_PGO=1)
Iterations: 10 runs per variant
Workload: 20M ops, WS=400, random mixed allocation pattern

Raw Data

Baseline (FAST - BENCH_MINIMAL only)

60378212, 60412333, 60126097, 60557230, 59593446,
59503095, 59686129, 58695907, 58750183, 58687807

Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)

61083082, 60515989, 60785621, 61251824, 61135770,
57473378, 58233393, 59070853, 58446760, 59977402

Statistical Summary

Metric	Baseline (ops/s)	Treatment (ops/s)	Delta
Mean	59,639,044	59,797,407	+0.27%
Median	59,639,788	60,246,696	+1.02%
Stdev	732,715 (1.23%)	1,385,809 (2.32%)	+89% CV
Min	58,687,807	57,473,378	-2.1%
Max	60,557,230	61,251,824	+1.1%

Decision Criteria

Threshold	Range	Decision	Result
GO	≥ +0.5%	Accept	❌
NEUTRAL	±0.5%	Research	✅
NO-GO	≤ -0.5%	Revert	❌

Actual: Mean +0.27% → NEUTRAL

Analysis

Observations

Mean vs Median divergence:
- Mean: +0.27% (borderline noise)
- Median: +1.02% (positive signal, above threshold)
- Interpretation: Median suggests possible small gain, but mean shows high outlier sensitivity
Variance increase:
- Baseline CV: 1.23%
- Treatment CV: 2.32% (+89% relative increase)
- Possible causes:
  - Layout tax (code rearrangement affecting I-cache/alignment)
  - Workload interaction with fixed config
  - Run-to-run noise amplification
Outlier in treatment:
- Run 6: 57.47M ops/s (lowest across both groups)
- Suggests potential instability or cache thrashing event

Why NEUTRAL (not GO)?

Mean below threshold: +0.27% < +0.5% decision boundary
High variance: 2× coefficient of variation suggests measurement uncertainty
Phase 46A lesson: Small positive signals can mask layout tax; require conservative threshold
Reproducibility concern: Wide spread in treatment group reduces confidence

Why not NO-GO?

Median improvement (+1.02%) is positive and above threshold
No systematic regression pattern (just higher variance)
Possibility of genuine small gain obscured by variance

Health Check

Status: ✅ PASS

Command: make perf_observe (1 run)
Outcome: No crashes, assertions, or integrity failures
Throughput (OBSERVE build): 48.27M ops/s (expected ~20% slower than FAST)
Health profiles: Both C6_HEAVY and C7_SAFE passed

Comparison with Phase 46A

Aspect	Phase 46A (`always_inline`)	Phase 47 (PGO mode)
Hypothesis	Inline hot function	Compile-time gates
Expected gain	+1~2%	+0.5~2.0%
Actual mean	-0.68% (NO-GO)	+0.27% (NEUTRAL)
Actual median	+0.17%	+1.02%
Variance	Similar to baseline	2× baseline
Binary size change	None (inline ≈ non-inline)	Unknown (not measured)
Lesson	Layout tax real risk	Variance amplification risk

Key Insight

Both phases show median-positive, mean-neutral signals. This pattern suggests:

Genuine micro-optimization present (median)
But layout tax or variance offsets mean improvement
Conservative threshold (±0.5% mean) is justified

Recommendations

1. Keep as Research Box (Current Status)

Action: Leave bench_random_mixed_hakmem_fast_pgo target in Makefile for future experiments
Rationale: Median +1.02% suggests potential; may combine well with other optimizations
Do NOT: Make default or promote to FAST standard build

2. Future Investigation (Optional)

If pursuing further:

Increase sample size: 20-30 runs to reduce variance noise
Profile-guided analysis: Check if variance correlates with:
- Cache miss patterns (perf stat -e cache-misses)
- Branch misprediction (perf stat -e branch-misses)
- TLB misses (perf stat -e dTLB-load-misses)

Binary size/layout analysis:

size bench_random_mixed_hakmem_minimal bench_random_mixed_hakmem_fast_pgo
objdump -d ... | analyze_layout.py

Workload sensitivity:
- Test on different allocation patterns (C6-heavy, C7-safe, etc.)
- Check if variance is workload-specific

3. DO NOT Promote (Current Verdict)

Reason: Mean +0.27% within ±0.5% noise threshold
Risk: High variance (2.32% CV) suggests instability
Box Theory: FAST build should be stable baseline, not experimental

Lessons Learned

Branch prediction is effective: Even 5-7 branch eliminations yield <1% gain
Layout tax is real: Variance increase (2× CV) suggests code rearrangement side effects
Conservative thresholds justified: ±0.5% mean threshold filters out noise
Median-positive ≠ actionable: Need both mean and median above threshold for GO decision

Files Modified

Makefile: Added bench_random_mixed_hakmem_fast_pgo target (lines 662-670)
- Build flags: EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'
No code changes: PGO mode uses existing tiny_front_config_box.h infrastructure

Next Steps

If NEUTRAL (Current)

Document in scorecard as "NEUTRAL - research box retained"
Monitor future phases for synergy opportunities

If Future GO Signal Emerges

Run extended validation (30+ runs)
Profile binary layout changes
Test across multiple workloads
Update scorecard and promote to FAST standard

Appendix: Test Commands

Baseline (FAST)

make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh

Treatment (FAST+PGO)

make bench_random_mixed_hakmem_fast_pgo
BENCH_BIN=./bench_random_mixed_hakmem_fast_pgo scripts/run_mixed_10_cleanenv.sh

Health Check

make perf_observe

References

Phase 47 Instructions: docs/analysis/PHASE47_FAST_FRONT_PGO_MODE_INSTRUCTIONS.md
Phase 46A Results: docs/analysis/PHASE46A_TINY_REGION_ID_WRITE_HEADER_ALWAYS_INLINE_RESULTS.md
Box Theory: docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md
Config Box: core/box/tiny_front_config_box.h

8.8 KiB Raw Blame History Unescape Escape

Phase 47 — FAST Front "PGO mode" A/B Test Results

Executive Summary

Background

Objective

Expected Outcome (from instructions)

Hypothesis

Implementation

Changes Made

PGO Fixed Config Values

A/B Test Results

Methodology

Raw Data

Baseline (FAST - BENCH_MINIMAL only)

Treatment (FAST+PGO - BENCH_MINIMAL + TINY_FRONT_PGO)

Statistical Summary

Decision Criteria

Analysis

Observations

Why NEUTRAL (not GO)?

Why not NO-GO?

Health Check

Comparison with Phase 46A

Key Insight

Recommendations

1. Keep as Research Box (Current Status)

2. Future Investigation (Optional)

3. DO NOT Promote (Current Verdict)

Lessons Learned

Files Modified

Next Steps

If NEUTRAL (Current)

If Future GO Signal Emerges

Appendix: Test Commands

Baseline (FAST)

Treatment (FAST+PGO)

Health Check

References

8.8 KiB

Raw Blame History