From 5c9b09148b8e04904c9cf02a7f1592317833e10e Mon Sep 17 00:00:00 2001
From: "Moe Charm (CI)" <moecharm@example.com>
Date: Wed, 17 Dec 2025 21:22:21 +0900
Subject: [PATCH] Phase 69-0: Refill tuning design memo (parameter sweep plan)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---
 CURRENT_TASK.md                               |  43 ++-
 .../PHASE69_REFILL_TUNING_0_DESIGN.md         | 320 ++++++++++++++++++
 2 files changed, 357 insertions(+), 6 deletions(-)
 create mode 100644 docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md

diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md
index 6b665cec..6ee67d2c 100644
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@@ -58,17 +58,48 @@
 
 ---
 
-**Phase 67b（後続）: 境界inline/unrollチューニング**
+**Phase 69: "refill頻度×固定税" を削る（M2への最短距離）**
+
+**Phase 69-0: パラメータ sweep 設計メモ** ✅ **完了**
+
+- ✓ `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md` 作成
+- ✓ Tunable parameters 特定:
+  - `TINY_REFILL_BATCH_SIZE` (16 → 32/64/128)
+  - Unified Cache C5-C7 capacity (128 → 256/512)
+  - Warm Pool size (12 → 16/24)
+- ✓ Sweep 計画立案（single-parameter → combined optimization）
+- ✓ Risk assessment & 判定基準定義
+
+**Phase 69-1（Active）: Sweep 実行**
+
+- **狙い**: +3〜6% (M2: 55% target への最短距離)
+- **施策**（優先順）:
+  1. **Warm Pool Size** (ENV-only, no recompile): 12 → 16 → 24
+     - Expected: +0.5-1.0% (registry scan reduction)
+  2. **Unified Cache C5-C7** (ENV-only, no recompile): 128 → 256 → 512
+     - Expected: +1-2% (miss rate reduction for mid-size allocations)
+  3. **Refill Batch Size** (requires PGO rebuild): 16 → 32 → 64
+     - Expected: +1-3% (refill frequency reduction)
+- **手順**:
+  - `scripts/run_mixed_10_cleanenv.sh` で 10-run (各パラメータ)
+  - 失敗時は `scripts/box/layout_tax_forensics_box.sh` を当てて原因分類
+- **判定**:
+  - GO: +1.0%（まず1段目）
+  - "強GO": +3.0% 以上（M2射程の芯として昇格）
+
+**Phase 69-2（後続）: 勝ち設定を baseline に反映**
+- 勝ちパラメータを `pgo_fast_profile_config.sh` / `core/hakmem_tiny_config.h` に反映
+- `make pgo-fast-full` で再ビルド → baseline 昇格
+
+---
+
+**Phase 67b（後続・保険）: 境界inline/unrollチューニング**
 - **注意**: layout tax リスク高い（Phase 64 reference）
 - **前提**: Top 50 実行確認が必須
-- 触るなら最小限・高確度だけ（例: C0 allocator inline candidates のみ）
+- Phase 69 が外れた時の保険として後回し推奨
 
 **注記**: 研究箱の削除は今やらない（link-out/削除が layout tax を起こす前例が強いので、compile-out維持が正解）
 
-**M2 への道 (55% target)**:
-- PGO はもう +1% 程度の改善上限に達した可能性（profile training set 枯渇）
-- 次のレバーは: (1) layout tax 排除 (Phase 67a の基盤で調査可能) / (2) structural changes（box design） / (3) compiler flags tuning
-
 ## 3) アーカイブ
 
 - 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
diff --git a/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md b/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
new file mode 100644
index 00000000..ddd903e7
--- /dev/null
+++ b/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
@@ -0,0 +1,320 @@
+# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo
+
+**Status**: 🟡 DESIGN (Phase 69-0)
+
+**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%).
+
+---
+
+## Executive Summary
+
+**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)
+
+**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**)
+
+**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.
+
+**Why This Approach**:
+- ✅ **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk
+- ✅ **Box-compliant**: No structural changes, only parameter tuning
+- ✅ **Measurable**: refill count is directly observable via counters
+- ✅ **Reversible**: All changes are config-only (ENV variables or compile-time constants)
+
+---
+
+## 1. Current State Analysis
+
+### 1.1 Refill Path Hierarchy
+
+```
+malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
+            ↓ MISS
+         unified_cache_refill()
+            ↓
+         Warm Pool (12 SuperSlabs per class)
+            ↓ MISS
+         sll_refill_batch_from_ss(class_idx, max_take=16)
+            ↓
+         trc_pop_from_freelist() OR trc_linear_carve()
+            ↓
+         trc_splice_to_sll()
+```
+
+### 1.2 Key Parameters (Current)
+
+| Component | Parameter | Current Value | Location |
+|-----------|-----------|---------------|----------|
+| **Refill Batch Size** | `TINY_REFILL_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:87` |
+| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` |
+| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` |
+| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` |
+| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` |
+| **Drain Batch Size** | `TINY_DRAIN_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:90` |
+
+### 1.3 Refill Overhead Breakdown
+
+**Fixed Overhead per refill** (~50-100 cycles):
+- SuperSlab lookup (warm pool pop or registry scan)
+- SlabMeta load (freelist, used, carved, capacity)
+- Chain building (header writes, next pointer linking)
+- TLS SLL splice (update head/count atomically)
+
+**Cost Formula**:
+```
+Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
+```
+
+**Optimization Strategy**:
+- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
+- ↑ Cache capacity → ↓ miss rate → ↓ refill_count
+
+---
+
+## 2. Tunable Parameters (Phase 69 Sweep Plan)
+
+### 2.1 Refill Batch Size Sweep
+
+**Parameter**: `TINY_REFILL_BATCH_SIZE` (global default for all classes)
+
+**Current**: 16 (conservative, optimized for RSS)
+
+**Sweep Range**: [16, 32, 64, 128]
+
+**Rationale**:
+- 16 → 32: Expected +1-2% (fewer refill calls)
+- 32 → 64: Expected +1-2% (diminishing returns, cache pressure increases)
+- 64 → 128: Expected +0-1% (likely NO-GO due to cache thrashing)
+
+**A/B Test Method**:
+```bash
+# Baseline (16)
+make clean && make bench_random_mixed_hakmem_minimal_pgo
+ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh
+
+# Treatment (32)
+sed -i 's/TINY_REFILL_BATCH_SIZE 16/TINY_REFILL_BATCH_SIZE 32/' core/hakmem_tiny_config.h
+make clean && make pgo-fast-full
+ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh
+
+# Compare results
+```
+
+**Expected Winner**: 32 (balance between refill frequency and cache locality)
+
+---
+
+### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus)
+
+**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload)
+
+**Current**: 128 slots (C5-C7)
+
+**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.
+
+**Sweep Range**: [128, 256, 512]
+
+**ENV Control**:
+```bash
+# Baseline (128)
+HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128
+
+# Treatment 1 (256)
+HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256
+
+# Treatment 2 (512)
+HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
+```
+
+**Expected Winner**: 256 (balance between working set and RSS)
+
+**Why Not C2/C3**:
+- C2/C3 already have 2048 slots (very high capacity)
+- Further increase unlikely to improve (already low miss rate)
+- Focus on under-optimized classes (C5-C7)
+
+---
+
+### 2.3 Warm Pool Size Sweep
+
+**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS`
+
+**Current**: 12 SuperSlabs per class
+
+**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.
+
+**Sweep Range**: [12, 16, 24]
+
+**ENV Control**:
+```bash
+HAKMEM_WARM_POOL_SIZE=16  # Treatment 1
+HAKMEM_WARM_POOL_SIZE=24  # Treatment 2
+```
+
+**Expected Winner**: 16 (diminishing returns beyond this)
+
+**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)
+
+---
+
+## 3. Implementation Strategy
+
+### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)
+
+**Goal**: Measure each parameter's individual impact to avoid confounding effects.
+
+**Order** (easiest to hardest):
+
+1. **Warm Pool Size** (ENV-only, no recompile):
+   ```bash
+   for SIZE in 12 16 24; do
+     HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
+   done
+   ```
+   - Expected: +0.5-1.0% (registry scan reduction)
+   - Risk: Low (ENV-only change)
+
+2. **Unified Cache C5-C7** (ENV-only, no recompile):
+   ```bash
+   for CAP in 128 256 512; do
+     HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
+       RUNS=10 scripts/run_mixed_10_cleanenv.sh
+   done
+   ```
+   - Expected: +1-2% (miss rate reduction for mid-size allocations)
+   - Risk: Low (ENV-only change)
+
+3. **Refill Batch Size** (requires recompile + PGO):
+   ```bash
+   for BATCH in 16 32 64; do
+     sed -i "s/TINY_REFILL_BATCH_SIZE .*/TINY_REFILL_BATCH_SIZE $BATCH/" core/hakmem_tiny_config.h
+     make pgo-fast-full
+     RUNS=10 scripts/run_mixed_10_cleanenv.sh
+   done
+   ```
+   - Expected: +1-3% (refill frequency reduction)
+   - Risk: Medium (requires PGO rebuild, potential layout tax)
+
+### 3.2 Phase 69-2: Combined Optimization (Best Settings)
+
+After identifying winners from Phase 69-1, combine them:
+
+```bash
+# Example: batch=32, C5-C7=256, warm_pool=16
+HAKMEM_WARM_POOL_SIZE=16 \
+HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
+  make pgo-fast-full  # with TINY_REFILL_BATCH_SIZE=32
+
+RUNS=10 scripts/run_mixed_10_cleanenv.sh
+```
+
+**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal)
+
+---
+
+## 4. Measurement & Validation
+
+### 4.1 Primary Metric: Throughput
+
+```bash
+RUNS=10 scripts/run_mixed_10_cleanenv.sh
+# Extract: Mean, Median, CV
+# Decision: GO (+1%), Strong GO (+3%)
+```
+
+### 4.2 Secondary Metrics (Observability)
+
+**Unified Cache Hit Rate**:
+```bash
+HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
+# Output: g_unified_cache_hits_global / (hits + misses)
+```
+
+**Refill Count** (requires instrumentation):
+```c
+// Add to unified_cache_refill():
+static _Atomic uint64_t g_refill_count_total = 0;
+atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
+```
+
+**Warm Pool Hit Rate**:
+```bash
+# Already exists in g_warm_pool_stats[class_idx].hits / misses
+# Print at shutdown via tiny_warm_pool_print_stats()
+```
+
+### 4.3 Layout Tax Check (If NO-GO)
+
+```bash
+# Run forensics on regression
+./scripts/box/layout_tax_forensics_box.sh \
+    ./bench_random_mixed_hakmem_minimal_pgo \
+    ./bench_random_mixed_hakmem_minimal_pgo_phase69
+```
+
+---
+
+## 5. Risk Assessment
+
+| Risk | Mitigation |
+|------|------------|
+| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. |
+| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
+| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
+| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). |
+
+---
+
+## 6. Decision Criteria
+
+### GO Thresholds
+
+- **GO**: +1.0% (additive improvement, worth merging)
+- **Strong GO**: +3.0% (M2-worthy, promote to baseline)
+- **NEUTRAL**: ±1.0% (no regression, but no clear win)
+- **NO-GO**: <-1.0% (regression, investigate layout tax)
+
+### Promotion Strategy
+
+- **Single-parameter GO**: Merge immediately if +1%+
+- **Combined GO**: Require +3%+ to justify complexity
+- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD
+
+---
+
+## 7. Next Steps
+
+### Immediate (Phase 69-1)
+
+1. **ENV Sweeps** (no recompile):
+   - Warm pool size: 12 → 16 → 24
+   - Unified cache C5-C7: 128 → 256 → 512
+
+2. **Batch Size Sweep** (requires PGO rebuild):
+   - TINY_REFILL_BATCH_SIZE: 16 → 32 → 64
+
+### Follow-Up (Phase 69-2)
+
+3. **Combined Optimization**:
+   - Apply winning parameters from Phase 69-1
+   - Verify additive gains (+3-6% target)
+
+4. **Baseline Promotion** (if Strong GO):
+   - Update `pgo_fast_profile_config.sh` with winning ENV vars
+   - Update `core/hakmem_tiny_config.h` with winning batch size
+   - Re-run `make pgo-fast-full` to bake optimizations into baseline
+
+---
+
+## Artifacts
+
+- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
+- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh`
+- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
+
+---
+
+**Status**: 🟢 READY FOR SWEEP (Phase 69-1)
+
+**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)
+
+**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)