Files
hakmem/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
Moe Charm (CI) b6212bbe31 Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)
- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.
2025-12-18 01:55:27 +09:00

9.6 KiB
Raw Blame History

Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo

Status: 🟡 DESIGN (Phase 69-0)

Objective: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: +3〜6% (shortest path to M2: 55%).


Executive Summary

Current Performance: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)

M2 Target: 66.54M ops/s = 55% of mimalloc (Gap: +4.9M ops/s = +7.96%)

Strategy: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.

Why This Approach:

  • High reproducibility: Unlike branch/inline tuning, batch tuning has low layout tax risk
  • Box-compliant: No structural changes, only parameter tuning
  • Measurable: refill count is directly observable via counters
  • Reversible: All changes are config-only (ENV variables or compile-time constants)

1. Current State Analysis

1.1 Refill Path Hierarchy

malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
            ↓ MISS
         unified_cache_refill()
            ↓
         Warm Pool (12 SuperSlabs per class)
            ↓ MISS
         sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx))
            ↓
         trc_pop_from_freelist() OR trc_linear_carve()
            ↓
         trc_splice_to_sll()

1.2 Key Parameters (Current)

Component Parameter Current Value Location
Refill Count (hot/mid/global) HAKMEM_TINY_REFILL_COUNT_* hot=128, mid=96, global=64 (defaults) core/hakmem_tiny_init.inc:270
Unified Cache C2/C3 unified_capacity(2/3) 2048 slots core/front/tiny_unified_cache.h:129
Unified Cache C5-C7 unified_capacity(5-7) 128 slots core/front/tiny_unified_cache.h:131
Unified Cache Others unified_capacity(0/1/4) 64 slots core/front/tiny_unified_cache.h:128
Warm Pool Size TINY_WARM_POOL_MAX_PER_CLASS 12 core/front/tiny_warm_pool.h:46

1.3 Refill Overhead Breakdown

Fixed Overhead per refill (~50-100 cycles):

  • SuperSlab lookup (warm pool pop or registry scan)
  • SlabMeta load (freelist, used, carved, capacity)
  • Chain building (header writes, next pointer linking)
  • TLS SLL splice (update head/count atomically)

Cost Formula:

Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)

Optimization Strategy:

  • ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
  • ↑ Cache capacity → ↓ miss rate → ↓ refill_count

2. Tunable Parameters (Phase 69 Sweep Plan)

2.1 Refill Count Sweep (ENV-only)

Parameter:

  • HAKMEM_TINY_REFILL_COUNT_HOT (classes C0C3)
  • HAKMEM_TINY_REFILL_COUNT_MID (classes C4C7)
  • HAKMEM_TINY_REFILL_COUNT (fallback/global)

Current defaults (when ENV unset):

  • hot=128, mid=96, global=64

Rationale:

  • Smaller counts reduce per-refill work but increase refill frequency.
  • Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure.
  • Sweep is cheap (ENV-only) and reversible.

A/B Test Method:

# Baseline
RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

# Treatment examples (pick one axis at a time)
HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_REFILL_COUNT_MID=64  RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

2.2 Unified Cache Capacity Sweep (C5-C7 Focus)

Parameter: unified_capacity(class_idx) for C5-C7 (129B-1024B, Mixed workload)

Current: 128 slots (C5-C7)

Rationale: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.

Sweep Range: [128, 256, 512]

ENV Control:

# Baseline (128)
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128

# Treatment 1 (256)
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256

# Treatment 2 (512)
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512

Expected Winner: 256 (balance between working set and RSS)

Why Not C2/C3:

  • C2/C3 already have 2048 slots (very high capacity)
  • Further increase unlikely to improve (already low miss rate)
  • Focus on under-optimized classes (C5-C7)

2.3 Warm Pool Size Sweep

Parameter: TINY_WARM_POOL_MAX_PER_CLASS

Current: 12 SuperSlabs per class

Rationale: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.

Sweep Range: [12, 16, 24]

ENV Control:

HAKMEM_WARM_POOL_SIZE=16  # Treatment 1
HAKMEM_WARM_POOL_SIZE=24  # Treatment 2

Expected Winner: 16 (diminishing returns beyond this)

Caveat: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)


3. Implementation Strategy

3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)

Goal: Measure each parameter's individual impact to avoid confounding effects.

Order (easiest to hardest):

  1. Warm Pool Size (ENV-only, no recompile):

    for SIZE in 12 16 24; do
      HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
    done
    
    • Expected: +0.5-1.0% (registry scan reduction)
    • Risk: Low (ENV-only change)
  2. Unified Cache C5-C7 (ENV-only, no recompile):

    for CAP in 128 256 512; do
      HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
        RUNS=10 scripts/run_mixed_10_cleanenv.sh
    done
    
    • Expected: +1-2% (miss rate reduction for mid-size allocations)
    • Risk: Low (ENV-only change)
  3. Refill Batch Size(削除, 2025-12 audit: TINY_REFILL_BATCH_SIZE は現行 Tiny front に接続されておらず、knob として成立していない。 実際の refill 量は HAKMEM_TINY_REFILL_COUNT_* で制御するENV-only

3.2 Phase 69-2: Combined Optimization (Best Settings)

After identifying winners from Phase 69-1, combine them:

# Example: C5-C7=256, warm_pool=16
HAKMEM_WARM_POOL_SIZE=16 \
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
  RUNS=10 scripts/run_mixed_10_cleanenv.sh

Expected Combined Gain: +3-6% (additive if parameters are orthogonal)


4. Measurement & Validation

4.1 Primary Metric: Throughput

RUNS=10 scripts/run_mixed_10_cleanenv.sh
# Extract: Mean, Median, CV
# Decision: GO (+1%), Strong GO (+3%)

4.2 Secondary Metrics (Observability)

Unified Cache Hit Rate:

HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
# Output: g_unified_cache_hits_global / (hits + misses)

Refill Count (requires instrumentation):

// Add to unified_cache_refill():
static _Atomic uint64_t g_refill_count_total = 0;
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);

Warm Pool Hit Rate:

# Already exists in g_warm_pool_stats[class_idx].hits / misses
# Print at shutdown via tiny_warm_pool_print_stats()

4.3 Layout Tax Check (If NO-GO)

# Run forensics on regression
./scripts/box/layout_tax_forensics_box.sh \
    ./bench_random_mixed_hakmem_minimal_pgo \
    ./bench_random_mixed_hakmem_minimal_pgo_phase69

5. Risk Assessment

Risk Mitigation
Layout Tax (batch size change) Use layout_tax_forensics_box.sh to diagnose. Revert if IPC drops >3%.
Cache Thrashing (capacity too high) Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max.
RSS Increase (larger batches/caches) Measure RSS before/after. Acceptable if <+5% for +3% throughput.
PGO Mismatch (batch change) Re-run PGO training after batch size change (included in pgo-fast-full).

6. Decision Criteria

GO Thresholds

  • GO: +1.0% (additive improvement, worth merging)
  • Strong GO: +3.0% (M2-worthy, promote to baseline)
  • NEUTRAL: ±1.0% (no regression, but no clear win)
  • NO-GO: <-1.0% (regression, investigate layout tax)

Promotion Strategy

  • Single-parameter GO: Merge immediately if +1%+
  • Combined GO: Require +3%+ to justify complexity
  • Strong GO: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD

7. Next Steps

Immediate (Phase 69-1)

  1. ENV Sweeps (no recompile):

    • Warm pool size: 12 → 16 → 24
    • Unified cache C5-C7: 128 → 256 → 512
  2. Refill Count Sweep (ENV-only):

    • HAKMEM_TINY_REFILL_COUNT_MID: 64 → 96 → 128 → 160 (example)
    • HAKMEM_TINY_REFILL_COUNT_HOT: 96 → 128 → 160 (example)

Follow-Up (Phase 69-2)

  1. Combined Optimization:

    • Apply winning parameters from Phase 69-1
    • Verify additive gains (+3-6% target)
  2. Baseline Promotion (if Strong GO):

    • Update pgo_fast_profile_config.sh with winning ENV vars
    • Re-run make pgo-fast-full to bake optimizations into baseline

Artifacts

  • This design memo: docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
  • Sweep script (TODO): scripts/box/phase69_refill_sweep.sh
  • Results log (TODO): docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md

Status: 🟢 READY FOR SWEEP (Phase 69-1)

Estimated Time: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)

Expected Outcome: +3-6% combined gain → M2 射程 (55% target)