Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6.6 KiB
Phase 69-1: Refill Tuning Parameter Sweeps - Results
Date: 2025-12-17
Baseline: Phase 68 PGO (bench_random_mixed_hakmem_minimal_pgo)
Benchmark: scripts/run_mixed_10_cleanenv.sh (RUNS=10)
Goal: Find +3-6% optimization for M2 milestone (55% of mimalloc)
Executive Summary
Winner Identified: Warm Pool Size=16 achieves +3.26% (Strong GO) with ENV-only change.
- No code changes required - Deploy via
HAKMEM_WARM_POOL_SIZE=16environment variable - Exceeds M2 threshold (+3.0% Strong GO criterion)
- Single strongest improvement among all tested parameters
- Combined optimizations are non-additive - Warm Pool Size=16 alone outperforms combinations
⚠️ Important correction (2025-12 audit):
The previously reported “Refill Batch Size sweep” based on TINY_REFILL_BATCH_SIZE was not measuring a real knob.
That macro currently has zero call sites (it is defined but not referenced in the active Tiny front path), so any
observed deltas were layout/drift noise, not an algorithmic effect.
Full Sweep Results
Baseline (Phase 68 PGO)
| Metric | Value |
|---|---|
| Mean | 60.65M ops/s |
| Median | 60.68M ops/s |
| CV | 1.68% |
| % of mimalloc | 50.93% |
Runs: 10
Binary: bench_random_mixed_hakmem_minimal_pgo (PGO optimized)
1. Warm Pool Size Sweep (ENV-only, no recompile)
Parameter: HAKMEM_WARM_POOL_SIZE (default: 12 SuperSlabs/class)
| Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|---|---|---|---|---|---|
| 16 | 62.63 | 63.38 | 2.43% | +3.26% | Strong GO ✓✓✓ |
| 24 | 62.37 | 62.35 | 1.99% | +2.84% | GO ✓ |
Winner: Size=16 (+3.26%)
Analysis:
- Size=16 exceeds +3.0% Strong GO threshold
- Size=24 shows diminishing returns (+2.84% vs +3.26%)
- Optimal sweet spot at Size=16 balances cache hit rate vs memory overhead
Command Used:
# Size=16
HAKMEM_WARM_POOL_SIZE=16 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Size=24
HAKMEM_WARM_POOL_SIZE=24 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
2. Unified Cache C5-C7 Sweep (ENV-only, no recompile)
Parameter: HAKMEM_TINY_UNIFIED_C5, HAKMEM_TINY_UNIFIED_C6, HAKMEM_TINY_UNIFIED_C7 (default: 128 slots)
| Cache Size | Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|---|---|---|---|---|---|
| 256 | 61.92 | 61.70 | 1.49% | +2.09% | GO ✓ |
| 512 | 61.80 | 62.00 | 1.21% | +1.89% | GO ✓ |
Winner: Cache=256 (+2.09%)
Analysis:
- Cache=256 shows +2.09% improvement (GO threshold)
- Cache=512 shows diminishing returns (+1.89% vs +2.09%)
- Larger caches provide marginal gains while increasing memory overhead
- Lower CV (1.49%) indicates stable performance
Command Used:
# Cache=256
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Cache=512
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
3. Combined Optimization Check
Configuration: Warm Pool Size=16 + Unified Cache C5-C7=256
| Mean (M ops/s) | Median (M ops/s) | CV | vs Baseline | Decision |
|---|---|---|---|---|
| 62.35 | 62.32 | 1.91% | +2.81% | GO (non-additive) |
Analysis:
- Combined result (+2.81%) is LESS than Warm Pool Size=16 alone (+3.26%)
- Non-additive behavior indicates parameters are not orthogonal
- Likely explanation: Warm pool optimization reduces unified cache miss rate, making cache capacity increase redundant
- Recommendation: Use Warm Pool Size=16 alone for maximum benefit
Command Used:
HAKMEM_WARM_POOL_SIZE=16 HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
4. Refill Batch Size Sweep (invalid — macro not wired)
The TINY_REFILL_BATCH_SIZE macro is currently define-only:
rg -n "TINY_REFILL_BATCH_SIZE" core
# -> core/hakmem_tiny_config.h only
So we do not treat it as a tuning parameter until it is actually connected to refill logic.
If we want to tune refill frequency, use the real knobs:
HAKMEM_TINY_REFILL_COUNT_HOTHAKMEM_TINY_REFILL_COUNT_MIDHAKMEM_TINY_REFILL_COUNT/HAKMEM_TINY_REFILL_COUNT_C{0..7}
Recommendations
Phase 69-2 (Baseline Promotion)
Primary Recommendation: Deploy Warm Pool Size=16 (ENV-only)
Rationale:
- Strongest single improvement (+3.26%, Strong GO)
- No code changes required - Zero risk of layout tax
- Immediate deployment via environment variable
- Exceeds M2 threshold (+3.0% Strong GO criterion)
Deployment:
# Add to PGO training environment and benchmark scripts
export HAKMEM_WARM_POOL_SIZE=16
Secondary Options (for Phase 69-3+)
Option A: Warm Pool Size=16 + Refill Batch=32
- Combined potential: Unknown (requires testing, may be non-additive like unified cache)
- Complexity: Requires PGO rebuild for Batch=32
- Risk: Layout tax from code change
Option B: Warm Pool Size=16 alone (recommended)
- Gain: +3.26% guaranteed
- Complexity: ENV-only, zero code changes
- Risk: None (reversible via ENV)
Raw Data Files
All 10-run logs saved to:
/tmp/phase69_baseline.log- Phase 68 PGO baseline/tmp/phase69_warm16.log- Warm Pool Size=16/tmp/phase69_warm24.log- Warm Pool Size=24/tmp/phase69_cache256.log- Unified Cache C5-C7=256/tmp/phase69_cache512.log- Unified Cache C5-C7=512/tmp/phase69_combined.log- Combined (Warm=16 + Cache=256)/tmp/phase69_batch32.log- Refill Batch=32
Next Steps
Awaiting User Instructions for Phase 69-2:
- Confirm Warm Pool Size=16 as baseline promotion candidate
- Decide whether to:
- Update ENV defaults in
hakmem_tiny_config.h(preferred for SSOT) - Document as recommended ENV setting in README/docs
- Add to PGO training scripts
- Update ENV defaults in
- Re-run
make pgo-fast-fullwithHAKMEM_WARM_POOL_SIZE=16in training environment - Update
PERFORMANCE_TARGETS_SCORECARD.mdwith new baseline (projected: 62.63M ops/s, ~52.6% of mimalloc)
Phase 69-1 Status: ✅ COMPLETE Winner: Warm Pool Size=16 (+3.26%, Strong GO, ENV-only)