- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain). - Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results. - Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default. - Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.
9.6 KiB
Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo
Status: 🟡 DESIGN (Phase 69-0)
Objective: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: +3〜6% (shortest path to M2: 55%).
Executive Summary
Current Performance: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)
M2 Target: 66.54M ops/s = 55% of mimalloc (Gap: +4.9M ops/s = +7.96%)
Strategy: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.
Why This Approach:
- ✅ High reproducibility: Unlike branch/inline tuning, batch tuning has low layout tax risk
- ✅ Box-compliant: No structural changes, only parameter tuning
- ✅ Measurable: refill count is directly observable via counters
- ✅ Reversible: All changes are config-only (ENV variables or compile-time constants)
1. Current State Analysis
1.1 Refill Path Hierarchy
malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
↓ MISS
unified_cache_refill()
↓
Warm Pool (12 SuperSlabs per class)
↓ MISS
sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx))
↓
trc_pop_from_freelist() OR trc_linear_carve()
↓
trc_splice_to_sll()
1.2 Key Parameters (Current)
| Component | Parameter | Current Value | Location |
|---|---|---|---|
| Refill Count (hot/mid/global) | HAKMEM_TINY_REFILL_COUNT_* |
hot=128, mid=96, global=64 (defaults) | core/hakmem_tiny_init.inc:270 |
| Unified Cache C2/C3 | unified_capacity(2/3) |
2048 slots | core/front/tiny_unified_cache.h:129 |
| Unified Cache C5-C7 | unified_capacity(5-7) |
128 slots | core/front/tiny_unified_cache.h:131 |
| Unified Cache Others | unified_capacity(0/1/4) |
64 slots | core/front/tiny_unified_cache.h:128 |
| Warm Pool Size | TINY_WARM_POOL_MAX_PER_CLASS |
12 | core/front/tiny_warm_pool.h:46 |
1.3 Refill Overhead Breakdown
Fixed Overhead per refill (~50-100 cycles):
- SuperSlab lookup (warm pool pop or registry scan)
- SlabMeta load (freelist, used, carved, capacity)
- Chain building (header writes, next pointer linking)
- TLS SLL splice (update head/count atomically)
Cost Formula:
Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
Optimization Strategy:
- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
- ↑ Cache capacity → ↓ miss rate → ↓ refill_count
2. Tunable Parameters (Phase 69 Sweep Plan)
2.1 Refill Count Sweep (ENV-only)
Parameter:
HAKMEM_TINY_REFILL_COUNT_HOT(classes C0–C3)HAKMEM_TINY_REFILL_COUNT_MID(classes C4–C7)HAKMEM_TINY_REFILL_COUNT(fallback/global)
Current defaults (when ENV unset):
- hot=128, mid=96, global=64
Rationale:
- Smaller counts reduce per-refill work but increase refill frequency.
- Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure.
- Sweep is cheap (ENV-only) and reversible.
A/B Test Method:
# Baseline
RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Treatment examples (pick one axis at a time)
HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_REFILL_COUNT_MID=64 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
2.2 Unified Cache Capacity Sweep (C5-C7 Focus)
Parameter: unified_capacity(class_idx) for C5-C7 (129B-1024B, Mixed workload)
Current: 128 slots (C5-C7)
Rationale: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.
Sweep Range: [128, 256, 512]
ENV Control:
# Baseline (128)
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128
# Treatment 1 (256)
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256
# Treatment 2 (512)
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
Expected Winner: 256 (balance between working set and RSS)
Why Not C2/C3:
- C2/C3 already have 2048 slots (very high capacity)
- Further increase unlikely to improve (already low miss rate)
- Focus on under-optimized classes (C5-C7)
2.3 Warm Pool Size Sweep
Parameter: TINY_WARM_POOL_MAX_PER_CLASS
Current: 12 SuperSlabs per class
Rationale: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.
Sweep Range: [12, 16, 24]
ENV Control:
HAKMEM_WARM_POOL_SIZE=16 # Treatment 1
HAKMEM_WARM_POOL_SIZE=24 # Treatment 2
Expected Winner: 16 (diminishing returns beyond this)
Caveat: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)
3. Implementation Strategy
3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)
Goal: Measure each parameter's individual impact to avoid confounding effects.
Order (easiest to hardest):
-
Warm Pool Size (ENV-only, no recompile):
for SIZE in 12 16 24; do HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh done- Expected: +0.5-1.0% (registry scan reduction)
- Risk: Low (ENV-only change)
-
Unified Cache C5-C7 (ENV-only, no recompile):
for CAP in 128 256 512; do HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \ RUNS=10 scripts/run_mixed_10_cleanenv.sh done- Expected: +1-2% (miss rate reduction for mid-size allocations)
- Risk: Low (ENV-only change)
-
Refill Batch Size(削除, 2025-12 audit):
TINY_REFILL_BATCH_SIZEは現行 Tiny front に接続されておらず、knob として成立していない。 実際の refill 量はHAKMEM_TINY_REFILL_COUNT_*で制御する(ENV-only)。
3.2 Phase 69-2: Combined Optimization (Best Settings)
After identifying winners from Phase 69-1, combine them:
# Example: C5-C7=256, warm_pool=16
HAKMEM_WARM_POOL_SIZE=16 \
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
RUNS=10 scripts/run_mixed_10_cleanenv.sh
Expected Combined Gain: +3-6% (additive if parameters are orthogonal)
4. Measurement & Validation
4.1 Primary Metric: Throughput
RUNS=10 scripts/run_mixed_10_cleanenv.sh
# Extract: Mean, Median, CV
# Decision: GO (+1%), Strong GO (+3%)
4.2 Secondary Metrics (Observability)
Unified Cache Hit Rate:
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
# Output: g_unified_cache_hits_global / (hits + misses)
Refill Count (requires instrumentation):
// Add to unified_cache_refill():
static _Atomic uint64_t g_refill_count_total = 0;
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
Warm Pool Hit Rate:
# Already exists in g_warm_pool_stats[class_idx].hits / misses
# Print at shutdown via tiny_warm_pool_print_stats()
4.3 Layout Tax Check (If NO-GO)
# Run forensics on regression
./scripts/box/layout_tax_forensics_box.sh \
./bench_random_mixed_hakmem_minimal_pgo \
./bench_random_mixed_hakmem_minimal_pgo_phase69
5. Risk Assessment
| Risk | Mitigation |
|---|---|
| Layout Tax (batch size change) | Use layout_tax_forensics_box.sh to diagnose. Revert if IPC drops >3%. |
| Cache Thrashing (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
| RSS Increase (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
| PGO Mismatch (batch change) | Re-run PGO training after batch size change (included in pgo-fast-full). |
6. Decision Criteria
GO Thresholds
- GO: +1.0% (additive improvement, worth merging)
- Strong GO: +3.0% (M2-worthy, promote to baseline)
- NEUTRAL: ±1.0% (no regression, but no clear win)
- NO-GO: <-1.0% (regression, investigate layout tax)
Promotion Strategy
- Single-parameter GO: Merge immediately if +1%+
- Combined GO: Require +3%+ to justify complexity
- Strong GO: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD
7. Next Steps
Immediate (Phase 69-1)
-
ENV Sweeps (no recompile):
- Warm pool size: 12 → 16 → 24
- Unified cache C5-C7: 128 → 256 → 512
-
Refill Count Sweep (ENV-only):
HAKMEM_TINY_REFILL_COUNT_MID: 64 → 96 → 128 → 160 (example)HAKMEM_TINY_REFILL_COUNT_HOT: 96 → 128 → 160 (example)
Follow-Up (Phase 69-2)
-
Combined Optimization:
- Apply winning parameters from Phase 69-1
- Verify additive gains (+3-6% target)
-
Baseline Promotion (if Strong GO):
- Update
pgo_fast_profile_config.shwith winning ENV vars - Re-run
make pgo-fast-fullto bake optimizations into baseline
- Update
Artifacts
- This design memo:
docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md - Sweep script (TODO):
scripts/box/phase69_refill_sweep.sh - Results log (TODO):
docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Status: 🟢 READY FOR SWEEP (Phase 69-1)
Estimated Time: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)
Expected Outcome: +3-6% combined gain → M2 射程 (55% target)