Files

Moe Charm (CI) b6212bbe31 Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)

- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.

2025-12-18 01:55:27 +09:00

9.6 KiB

Raw Blame History

Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo

Status: 🟡 DESIGN (Phase 69-0)

Objective: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: +3〜6% (shortest path to M2: 55%).

Executive Summary

Current Performance: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)

M2 Target: 66.54M ops/s = 55% of mimalloc (Gap: +4.9M ops/s = +7.96%)

Strategy: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.

Why This Approach:

✅ High reproducibility: Unlike branch/inline tuning, batch tuning has low layout tax risk
✅ Box-compliant: No structural changes, only parameter tuning
✅ Measurable: refill count is directly observable via counters
✅ Reversible: All changes are config-only (ENV variables or compile-time constants)

1. Current State Analysis

1.1 Refill Path Hierarchy

malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
            ↓ MISS
         unified_cache_refill()
            ↓
         Warm Pool (12 SuperSlabs per class)
            ↓ MISS
         sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx))
            ↓
         trc_pop_from_freelist() OR trc_linear_carve()
            ↓
         trc_splice_to_sll()

1.2 Key Parameters (Current)

Component	Parameter	Current Value	Location
Refill Count (hot/mid/global)	`HAKMEM_TINY_REFILL_COUNT_*`	hot=128, mid=96, global=64 (defaults)	`core/hakmem_tiny_init.inc:270`
Unified Cache C2/C3	`unified_capacity(2/3)`	2048 slots	`core/front/tiny_unified_cache.h:129`
Unified Cache C5-C7	`unified_capacity(5-7)`	128 slots	`core/front/tiny_unified_cache.h:131`
Unified Cache Others	`unified_capacity(0/1/4)`	64 slots	`core/front/tiny_unified_cache.h:128`
Warm Pool Size	`TINY_WARM_POOL_MAX_PER_CLASS`	12	`core/front/tiny_warm_pool.h:46`

1.3 Refill Overhead Breakdown

Fixed Overhead per refill (~50-100 cycles):

SuperSlab lookup (warm pool pop or registry scan)
SlabMeta load (freelist, used, carved, capacity)
Chain building (header writes, next pointer linking)
TLS SLL splice (update head/count atomically)

Cost Formula:

Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)

Optimization Strategy:

↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
↑ Cache capacity → ↓ miss rate → ↓ refill_count

2. Tunable Parameters (Phase 69 Sweep Plan)

2.1 Refill Count Sweep (ENV-only)

Parameter:

HAKMEM_TINY_REFILL_COUNT_HOT (classes C0–C3)
HAKMEM_TINY_REFILL_COUNT_MID (classes C4–C7)
HAKMEM_TINY_REFILL_COUNT (fallback/global)

Current defaults (when ENV unset):

hot=128, mid=96, global=64

Rationale:

Smaller counts reduce per-refill work but increase refill frequency.
Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure.
Sweep is cheap (ENV-only) and reversible.

A/B Test Method:

# Baseline
RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

# Treatment examples (pick one axis at a time)
HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_REFILL_COUNT_MID=64  RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

2.2 Unified Cache Capacity Sweep (C5-C7 Focus)

Parameter: unified_capacity(class_idx) for C5-C7 (129B-1024B, Mixed workload)

Current: 128 slots (C5-C7)

Rationale: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.

Sweep Range: [128, 256, 512]

ENV Control:

# Baseline (128)
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128

# Treatment 1 (256)
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256

# Treatment 2 (512)
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512

Expected Winner: 256 (balance between working set and RSS)

Why Not C2/C3:

C2/C3 already have 2048 slots (very high capacity)
Further increase unlikely to improve (already low miss rate)
Focus on under-optimized classes (C5-C7)

2.3 Warm Pool Size Sweep

Parameter: TINY_WARM_POOL_MAX_PER_CLASS

Current: 12 SuperSlabs per class

Rationale: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.

Sweep Range: [12, 16, 24]

ENV Control:

HAKMEM_WARM_POOL_SIZE=16  # Treatment 1
HAKMEM_WARM_POOL_SIZE=24  # Treatment 2

Expected Winner: 16 (diminishing returns beyond this)

Caveat: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)

3. Implementation Strategy

3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)

Goal: Measure each parameter's individual impact to avoid confounding effects.

Order (easiest to hardest):

Warm Pool Size (ENV-only, no recompile):

for SIZE in 12 16 24; do
  HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
done

Expected: +0.5-1.0% (registry scan reduction)
Risk: Low (ENV-only change)

Unified Cache C5-C7 (ENV-only, no recompile):

for CAP in 128 256 512; do
  HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
    RUNS=10 scripts/run_mixed_10_cleanenv.sh
done

Expected: +1-2% (miss rate reduction for mid-size allocations)
Risk: Low (ENV-only change)

Refill Batch Size（削除, 2025-12 audit）: TINY_REFILL_BATCH_SIZE は現行 Tiny front に接続されておらず、knob として成立していない。実際の refill 量は HAKMEM_TINY_REFILL_COUNT_* で制御する（ENV-only）。

3.2 Phase 69-2: Combined Optimization (Best Settings)

After identifying winners from Phase 69-1, combine them:

# Example: C5-C7=256, warm_pool=16
HAKMEM_WARM_POOL_SIZE=16 \
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
  RUNS=10 scripts/run_mixed_10_cleanenv.sh

Expected Combined Gain: +3-6% (additive if parameters are orthogonal)

4. Measurement & Validation

4.1 Primary Metric: Throughput

RUNS=10 scripts/run_mixed_10_cleanenv.sh
# Extract: Mean, Median, CV
# Decision: GO (+1%), Strong GO (+3%)

4.2 Secondary Metrics (Observability)

Unified Cache Hit Rate:

HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
# Output: g_unified_cache_hits_global / (hits + misses)

Refill Count (requires instrumentation):

// Add to unified_cache_refill():
static _Atomic uint64_t g_refill_count_total = 0;
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);

Warm Pool Hit Rate:

# Already exists in g_warm_pool_stats[class_idx].hits / misses
# Print at shutdown via tiny_warm_pool_print_stats()

4.3 Layout Tax Check (If NO-GO)

# Run forensics on regression
./scripts/box/layout_tax_forensics_box.sh \
    ./bench_random_mixed_hakmem_minimal_pgo \
    ./bench_random_mixed_hakmem_minimal_pgo_phase69

5. Risk Assessment

Risk	Mitigation
Layout Tax (batch size change)	Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%.
Cache Thrashing (capacity too high)	Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max.
RSS Increase (larger batches/caches)	Measure RSS before/after. Acceptable if <+5% for +3% throughput.
PGO Mismatch (batch change)	Re-run PGO training after batch size change (included in `pgo-fast-full`).

6. Decision Criteria

GO Thresholds

GO: +1.0% (additive improvement, worth merging)
Strong GO: +3.0% (M2-worthy, promote to baseline)
NEUTRAL: ±1.0% (no regression, but no clear win)
NO-GO: <-1.0% (regression, investigate layout tax)

Promotion Strategy

Single-parameter GO: Merge immediately if +1%+
Combined GO: Require +3%+ to justify complexity
Strong GO: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD

7. Next Steps

Immediate (Phase 69-1)

ENV Sweeps (no recompile):
- Warm pool size: 12 → 16 → 24
- Unified cache C5-C7: 128 → 256 → 512
Refill Count Sweep (ENV-only):
- HAKMEM_TINY_REFILL_COUNT_MID: 64 → 96 → 128 → 160 (example)
- HAKMEM_TINY_REFILL_COUNT_HOT: 96 → 128 → 160 (example)

Follow-Up (Phase 69-2)

Combined Optimization:
- Apply winning parameters from Phase 69-1
- Verify additive gains (+3-6% target)
Baseline Promotion (if Strong GO):
- Update pgo_fast_profile_config.sh with winning ENV vars
- Re-run make pgo-fast-full to bake optimizations into baseline

Artifacts

This design memo: docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
Sweep script (TODO): scripts/box/phase69_refill_sweep.sh
Results log (TODO): docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md

Status: 🟢 READY FOR SWEEP (Phase 69-1)

Estimated Time: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)

Expected Outcome: +3-6% combined gain → M2 射程 (55% target)

9.6 KiB Raw Blame History Unescape Escape