# Phase 69-0: Refill Frequency ร— Fixed Tax Reduction โ€” Design Memo **Status**: ๐ŸŸก DESIGN (Phase 69-0) **Objective**: Reduce "refill count ร— fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3ใ€œ6%** (shortest path to M2: 55%). --- ## Executive Summary **Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline) **M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**) **Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks. **Why This Approach**: - โœ… **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk - โœ… **Box-compliant**: No structural changes, only parameter tuning - โœ… **Measurable**: refill count is directly observable via counters - โœ… **Reversible**: All changes are config-only (ENV variables or compile-time constants) --- ## 1. Current State Analysis ### 1.1 Refill Path Hierarchy ``` malloc() โ†’ Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots) โ†“ MISS unified_cache_refill() โ†“ Warm Pool (12 SuperSlabs per class) โ†“ MISS sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx)) โ†“ trc_pop_from_freelist() OR trc_linear_carve() โ†“ trc_splice_to_sll() ``` ### 1.2 Key Parameters (Current) | Component | Parameter | Current Value | Location | |-----------|-----------|---------------|----------| | **Refill Count (hot/mid/global)** | `HAKMEM_TINY_REFILL_COUNT_*` | hot=128, mid=96, global=64 (defaults) | `core/hakmem_tiny_init.inc:270` | | **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` | | **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` | | **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` | | **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` | ### 1.3 Refill Overhead Breakdown **Fixed Overhead per refill** (~50-100 cycles): - SuperSlab lookup (warm pool pop or registry scan) - SlabMeta load (freelist, used, carved, capacity) - Chain building (header writes, next pointer linking) - TLS SLL splice (update head/count atomically) **Cost Formula**: ``` Total Refill Cost = (refill_count ร— fixed_overhead) + (blocks_carved ร— per_block_cost) ``` **Optimization Strategy**: - โ†‘ Batch size โ†’ โ†“ refill_count โ†’ โ†“ Total Cost (up to cache capacity limit) - โ†‘ Cache capacity โ†’ โ†“ miss rate โ†’ โ†“ refill_count --- ## 2. Tunable Parameters (Phase 69 Sweep Plan) ### 2.1 Refill Count Sweep (ENV-only) **Parameter**: - `HAKMEM_TINY_REFILL_COUNT_HOT` (classes C0โ€“C3) - `HAKMEM_TINY_REFILL_COUNT_MID` (classes C4โ€“C7) - `HAKMEM_TINY_REFILL_COUNT` (fallback/global) **Current defaults** (when ENV unset): - hot=128, mid=96, global=64 **Rationale**: - Smaller counts reduce per-refill work but increase refill frequency. - Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure. - Sweep is cheap (ENV-only) and reversible. **A/B Test Method**: ```bash # Baseline RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh # Treatment examples (pick one axis at a time) HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh HAKMEM_TINY_REFILL_COUNT_MID=64 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh ``` --- ### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus) **Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload) **Current**: 128 slots (C5-C7) **Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate. **Sweep Range**: [128, 256, 512] **ENV Control**: ```bash # Baseline (128) HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128 # Treatment 1 (256) HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 # Treatment 2 (512) HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 ``` **Expected Winner**: 256 (balance between working set and RSS) **Why Not C2/C3**: - C2/C3 already have 2048 slots (very high capacity) - Further increase unlikely to improve (already low miss rate) - Focus on under-optimized classes (C5-C7) --- ### 2.3 Warm Pool Size Sweep **Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS` **Current**: 12 SuperSlabs per class **Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool โ†’ lower registry scan frequency. **Sweep Range**: [12, 16, 24] **ENV Control**: ```bash HAKMEM_WARM_POOL_SIZE=16 # Treatment 1 HAKMEM_WARM_POOL_SIZE=24 # Treatment 2 ``` **Expected Winner**: 16 (diminishing returns beyond this) **Caveat**: Memory overhead = pool_size ร— sizeof(SuperSlab*) ร— TINY_NUM_CLASSES = 16 ร— 8 ร— 8 = 1KB per thread (negligible) --- ## 3. Implementation Strategy ### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation) **Goal**: Measure each parameter's individual impact to avoid confounding effects. **Order** (easiest to hardest): 1. **Warm Pool Size** (ENV-only, no recompile): ```bash for SIZE in 12 16 24; do HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh done ``` - Expected: +0.5-1.0% (registry scan reduction) - Risk: Low (ENV-only change) 2. **Unified Cache C5-C7** (ENV-only, no recompile): ```bash for CAP in 128 256 512; do HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \ RUNS=10 scripts/run_mixed_10_cleanenv.sh done ``` - Expected: +1-2% (miss rate reduction for mid-size allocations) - Risk: Low (ENV-only change) 3. **Refill Batch Size**๏ผˆๅ‰Š้™ค, 2025-12 audit๏ผ‰: `TINY_REFILL_BATCH_SIZE` ใฏ็พ่กŒ Tiny front ใซๆŽฅ็ถšใ•ใ‚ŒใฆใŠใ‚‰ใšใ€knob ใจใ—ใฆๆˆ็ซ‹ใ—ใฆใ„ใชใ„ใ€‚ ๅฎŸ้š›ใฎ refill ้‡ใฏ `HAKMEM_TINY_REFILL_COUNT_*` ใงๅˆถๅพกใ™ใ‚‹๏ผˆENV-only๏ผ‰ใ€‚ ### 3.2 Phase 69-2: Combined Optimization (Best Settings) After identifying winners from Phase 69-1, combine them: ```bash # Example: C5-C7=256, warm_pool=16 HAKMEM_WARM_POOL_SIZE=16 \ HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \ RUNS=10 scripts/run_mixed_10_cleanenv.sh ``` **Expected Combined Gain**: +3-6% (additive if parameters are orthogonal) --- ## 4. Measurement & Validation ### 4.1 Primary Metric: Throughput ```bash RUNS=10 scripts/run_mixed_10_cleanenv.sh # Extract: Mean, Median, CV # Decision: GO (+1%), Strong GO (+3%) ``` ### 4.2 Secondary Metrics (Observability) **Unified Cache Hit Rate**: ```bash HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo # Output: g_unified_cache_hits_global / (hits + misses) ``` **Refill Count** (requires instrumentation): ```c // Add to unified_cache_refill(): static _Atomic uint64_t g_refill_count_total = 0; atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed); ``` **Warm Pool Hit Rate**: ```bash # Already exists in g_warm_pool_stats[class_idx].hits / misses # Print at shutdown via tiny_warm_pool_print_stats() ``` ### 4.3 Layout Tax Check (If NO-GO) ```bash # Run forensics on regression ./scripts/box/layout_tax_forensics_box.sh \ ./bench_random_mixed_hakmem_minimal_pgo \ ./bench_random_mixed_hakmem_minimal_pgo_phase69 ``` --- ## 5. Risk Assessment | Risk | Mitigation | |------|------------| | **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. | | **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. | | **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. | | **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). | --- ## 6. Decision Criteria ### GO Thresholds - **GO**: +1.0% (additive improvement, worth merging) - **Strong GO**: +3.0% (M2-worthy, promote to baseline) - **NEUTRAL**: ยฑ1.0% (no regression, but no clear win) - **NO-GO**: <-1.0% (regression, investigate layout tax) ### Promotion Strategy - **Single-parameter GO**: Merge immediately if +1%+ - **Combined GO**: Require +3%+ to justify complexity - **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD --- ## 7. Next Steps ### Immediate (Phase 69-1) 1. **ENV Sweeps** (no recompile): - Warm pool size: 12 โ†’ 16 โ†’ 24 - Unified cache C5-C7: 128 โ†’ 256 โ†’ 512 2. **Refill Count Sweep** (ENV-only): - `HAKMEM_TINY_REFILL_COUNT_MID`: 64 โ†’ 96 โ†’ 128 โ†’ 160 (example) - `HAKMEM_TINY_REFILL_COUNT_HOT`: 96 โ†’ 128 โ†’ 160 (example) ### Follow-Up (Phase 69-2) 3. **Combined Optimization**: - Apply winning parameters from Phase 69-1 - Verify additive gains (+3-6% target) 4. **Baseline Promotion** (if Strong GO): - Update `pgo_fast_profile_config.sh` with winning ENV vars - Re-run `make pgo-fast-full` to bake optimizations into baseline --- ## Artifacts - **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` - **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh` - **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md` --- **Status**: ๐ŸŸข READY FOR SWEEP (Phase 69-1) **Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO) **Expected Outcome**: +3-6% combined gain โ†’ M2 ๅฐ„็จ‹ (55% target)