From 5c9b09148b8e04904c9cf02a7f1592317833e10e Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Wed, 17 Dec 2025 21:22:21 +0900 Subject: [PATCH] Phase 69-0: Refill tuning design memo (parameter sweep plan) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes: - docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size * Sweep plan: single-parameter isolation → combined optimization * Expected gain: +3-6% (shortest path to M2: 55% target) * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds) - CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active Key Parameters Identified: 1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%) 2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%) 3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%) Strategy: - ENV-only sweeps first (warm pool, cache capacity) - no recompile - Batch size sweep requires PGO rebuild - highest expected gain - Combined optimization targets +3-6% additive gain 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 --- CURRENT_TASK.md | 43 ++- .../PHASE69_REFILL_TUNING_0_DESIGN.md | 320 ++++++++++++++++++ 2 files changed, 357 insertions(+), 6 deletions(-) create mode 100644 docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 6b665cec..6ee67d2c 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -58,17 +58,48 @@ --- -**Phase 67b(後続): 境界inline/unrollチューニング** +**Phase 69: "refill頻度×固定税" を削る(M2への最短距離)** + +**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了** + +- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成 +- ✓ Tunable parameters 特定: + - `TINY_REFILL_BATCH_SIZE` (16 → 32/64/128) + - Unified Cache C5-C7 capacity (128 → 256/512) + - Warm Pool size (12 → 16/24) +- ✓ Sweep 計画立案(single-parameter → combined optimization) +- ✓ Risk assessment & 判定基準定義 + +**Phase 69-1(Active): Sweep 実行** + +- **狙い**: +3〜6% (M2: 55% target への最短距離) +- **施策**(優先順): + 1. **Warm Pool Size** (ENV-only, no recompile): 12 → 16 → 24 + - Expected: +0.5-1.0% (registry scan reduction) + 2. **Unified Cache C5-C7** (ENV-only, no recompile): 128 → 256 → 512 + - Expected: +1-2% (miss rate reduction for mid-size allocations) + 3. **Refill Batch Size** (requires PGO rebuild): 16 → 32 → 64 + - Expected: +1-3% (refill frequency reduction) +- **手順**: + - `scripts/run_mixed_10_cleanenv.sh` で 10-run (各パラメータ) + - 失敗時は `scripts/box/layout_tax_forensics_box.sh` を当てて原因分類 +- **判定**: + - GO: +1.0%(まず1段目) + - "強GO": +3.0% 以上(M2射程の芯として昇格) + +**Phase 69-2(後続): 勝ち設定を baseline に反映** +- 勝ちパラメータを `pgo_fast_profile_config.sh` / `core/hakmem_tiny_config.h` に反映 +- `make pgo-fast-full` で再ビルド → baseline 昇格 + +--- + +**Phase 67b(後続・保険): 境界inline/unrollチューニング** - **注意**: layout tax リスク高い(Phase 64 reference) - **前提**: Top 50 実行確認が必須 -- 触るなら最小限・高確度だけ(例: C0 allocator inline candidates のみ) +- Phase 69 が外れた時の保険として後回し推奨 **注記**: 研究箱の削除は今やらない(link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解) -**M2 への道 (55% target)**: -- PGO はもう +1% 程度の改善上限に達した可能性(profile training set 枯渇) -- 次のレバーは: (1) layout tax 排除 (Phase 67a の基盤で調査可能) / (2) structural changes(box design) / (3) compiler flags tuning - ## 3) アーカイブ - 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md` diff --git a/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md b/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md new file mode 100644 index 00000000..ddd903e7 --- /dev/null +++ b/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md @@ -0,0 +1,320 @@ +# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo + +**Status**: 🟡 DESIGN (Phase 69-0) + +**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%). + +--- + +## Executive Summary + +**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline) + +**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**) + +**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks. + +**Why This Approach**: +- ✅ **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk +- ✅ **Box-compliant**: No structural changes, only parameter tuning +- ✅ **Measurable**: refill count is directly observable via counters +- ✅ **Reversible**: All changes are config-only (ENV variables or compile-time constants) + +--- + +## 1. Current State Analysis + +### 1.1 Refill Path Hierarchy + +``` +malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots) + ↓ MISS + unified_cache_refill() + ↓ + Warm Pool (12 SuperSlabs per class) + ↓ MISS + sll_refill_batch_from_ss(class_idx, max_take=16) + ↓ + trc_pop_from_freelist() OR trc_linear_carve() + ↓ + trc_splice_to_sll() +``` + +### 1.2 Key Parameters (Current) + +| Component | Parameter | Current Value | Location | +|-----------|-----------|---------------|----------| +| **Refill Batch Size** | `TINY_REFILL_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:87` | +| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` | +| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` | +| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` | +| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` | +| **Drain Batch Size** | `TINY_DRAIN_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:90` | + +### 1.3 Refill Overhead Breakdown + +**Fixed Overhead per refill** (~50-100 cycles): +- SuperSlab lookup (warm pool pop or registry scan) +- SlabMeta load (freelist, used, carved, capacity) +- Chain building (header writes, next pointer linking) +- TLS SLL splice (update head/count atomically) + +**Cost Formula**: +``` +Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost) +``` + +**Optimization Strategy**: +- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit) +- ↑ Cache capacity → ↓ miss rate → ↓ refill_count + +--- + +## 2. Tunable Parameters (Phase 69 Sweep Plan) + +### 2.1 Refill Batch Size Sweep + +**Parameter**: `TINY_REFILL_BATCH_SIZE` (global default for all classes) + +**Current**: 16 (conservative, optimized for RSS) + +**Sweep Range**: [16, 32, 64, 128] + +**Rationale**: +- 16 → 32: Expected +1-2% (fewer refill calls) +- 32 → 64: Expected +1-2% (diminishing returns, cache pressure increases) +- 64 → 128: Expected +0-1% (likely NO-GO due to cache thrashing) + +**A/B Test Method**: +```bash +# Baseline (16) +make clean && make bench_random_mixed_hakmem_minimal_pgo +ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh + +# Treatment (32) +sed -i 's/TINY_REFILL_BATCH_SIZE 16/TINY_REFILL_BATCH_SIZE 32/' core/hakmem_tiny_config.h +make clean && make pgo-fast-full +ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh + +# Compare results +``` + +**Expected Winner**: 32 (balance between refill frequency and cache locality) + +--- + +### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus) + +**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload) + +**Current**: 128 slots (C5-C7) + +**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate. + +**Sweep Range**: [128, 256, 512] + +**ENV Control**: +```bash +# Baseline (128) +HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128 + +# Treatment 1 (256) +HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 + +# Treatment 2 (512) +HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512 +``` + +**Expected Winner**: 256 (balance between working set and RSS) + +**Why Not C2/C3**: +- C2/C3 already have 2048 slots (very high capacity) +- Further increase unlikely to improve (already low miss rate) +- Focus on under-optimized classes (C5-C7) + +--- + +### 2.3 Warm Pool Size Sweep + +**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS` + +**Current**: 12 SuperSlabs per class + +**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency. + +**Sweep Range**: [12, 16, 24] + +**ENV Control**: +```bash +HAKMEM_WARM_POOL_SIZE=16 # Treatment 1 +HAKMEM_WARM_POOL_SIZE=24 # Treatment 2 +``` + +**Expected Winner**: 16 (diminishing returns beyond this) + +**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible) + +--- + +## 3. Implementation Strategy + +### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation) + +**Goal**: Measure each parameter's individual impact to avoid confounding effects. + +**Order** (easiest to hardest): + +1. **Warm Pool Size** (ENV-only, no recompile): + ```bash + for SIZE in 12 16 24; do + HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh + done + ``` + - Expected: +0.5-1.0% (registry scan reduction) + - Risk: Low (ENV-only change) + +2. **Unified Cache C5-C7** (ENV-only, no recompile): + ```bash + for CAP in 128 256 512; do + HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \ + RUNS=10 scripts/run_mixed_10_cleanenv.sh + done + ``` + - Expected: +1-2% (miss rate reduction for mid-size allocations) + - Risk: Low (ENV-only change) + +3. **Refill Batch Size** (requires recompile + PGO): + ```bash + for BATCH in 16 32 64; do + sed -i "s/TINY_REFILL_BATCH_SIZE .*/TINY_REFILL_BATCH_SIZE $BATCH/" core/hakmem_tiny_config.h + make pgo-fast-full + RUNS=10 scripts/run_mixed_10_cleanenv.sh + done + ``` + - Expected: +1-3% (refill frequency reduction) + - Risk: Medium (requires PGO rebuild, potential layout tax) + +### 3.2 Phase 69-2: Combined Optimization (Best Settings) + +After identifying winners from Phase 69-1, combine them: + +```bash +# Example: batch=32, C5-C7=256, warm_pool=16 +HAKMEM_WARM_POOL_SIZE=16 \ +HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \ + make pgo-fast-full # with TINY_REFILL_BATCH_SIZE=32 + +RUNS=10 scripts/run_mixed_10_cleanenv.sh +``` + +**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal) + +--- + +## 4. Measurement & Validation + +### 4.1 Primary Metric: Throughput + +```bash +RUNS=10 scripts/run_mixed_10_cleanenv.sh +# Extract: Mean, Median, CV +# Decision: GO (+1%), Strong GO (+3%) +``` + +### 4.2 Secondary Metrics (Observability) + +**Unified Cache Hit Rate**: +```bash +HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo +# Output: g_unified_cache_hits_global / (hits + misses) +``` + +**Refill Count** (requires instrumentation): +```c +// Add to unified_cache_refill(): +static _Atomic uint64_t g_refill_count_total = 0; +atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed); +``` + +**Warm Pool Hit Rate**: +```bash +# Already exists in g_warm_pool_stats[class_idx].hits / misses +# Print at shutdown via tiny_warm_pool_print_stats() +``` + +### 4.3 Layout Tax Check (If NO-GO) + +```bash +# Run forensics on regression +./scripts/box/layout_tax_forensics_box.sh \ + ./bench_random_mixed_hakmem_minimal_pgo \ + ./bench_random_mixed_hakmem_minimal_pgo_phase69 +``` + +--- + +## 5. Risk Assessment + +| Risk | Mitigation | +|------|------------| +| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. | +| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. | +| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. | +| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). | + +--- + +## 6. Decision Criteria + +### GO Thresholds + +- **GO**: +1.0% (additive improvement, worth merging) +- **Strong GO**: +3.0% (M2-worthy, promote to baseline) +- **NEUTRAL**: ±1.0% (no regression, but no clear win) +- **NO-GO**: <-1.0% (regression, investigate layout tax) + +### Promotion Strategy + +- **Single-parameter GO**: Merge immediately if +1%+ +- **Combined GO**: Require +3%+ to justify complexity +- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD + +--- + +## 7. Next Steps + +### Immediate (Phase 69-1) + +1. **ENV Sweeps** (no recompile): + - Warm pool size: 12 → 16 → 24 + - Unified cache C5-C7: 128 → 256 → 512 + +2. **Batch Size Sweep** (requires PGO rebuild): + - TINY_REFILL_BATCH_SIZE: 16 → 32 → 64 + +### Follow-Up (Phase 69-2) + +3. **Combined Optimization**: + - Apply winning parameters from Phase 69-1 + - Verify additive gains (+3-6% target) + +4. **Baseline Promotion** (if Strong GO): + - Update `pgo_fast_profile_config.sh` with winning ENV vars + - Update `core/hakmem_tiny_config.h` with winning batch size + - Re-run `make pgo-fast-full` to bake optimizations into baseline + +--- + +## Artifacts + +- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` +- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh` +- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md` + +--- + +**Status**: 🟢 READY FOR SWEEP (Phase 69-1) + +**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO) + +**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)