hakmem/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md

# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo

**Status**: 🟡 DESIGN (Phase 69-0)

**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%).

---

## Executive Summary

**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)

**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**)

**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.

**Why This Approach**:
- ✅ **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk
- ✅ **Box-compliant**: No structural changes, only parameter tuning
- ✅ **Measurable**: refill count is directly observable via counters
- ✅ **Reversible**: All changes are config-only (ENV variables or compile-time constants)

---

## 1. Current State Analysis

### 1.1 Refill Path Hierarchy

```
malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
            ↓ MISS
         unified_cache_refill()
            ↓
         Warm Pool (12 SuperSlabs per class)
            ↓ MISS
         sll_refill_batch_from_ss(class_idx, max_take=16)
            ↓
         trc_pop_from_freelist() OR trc_linear_carve()
            ↓
         trc_splice_to_sll()
```

### 1.2 Key Parameters (Current)

| Component | Parameter | Current Value | Location |
|-----------|-----------|---------------|----------|
| **Refill Batch Size** | `TINY_REFILL_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:87` |
| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` |
| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` |
| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` |
| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` |
| **Drain Batch Size** | `TINY_DRAIN_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:90` |

### 1.3 Refill Overhead Breakdown

**Fixed Overhead per refill** (~50-100 cycles):
- SuperSlab lookup (warm pool pop or registry scan)
- SlabMeta load (freelist, used, carved, capacity)
- Chain building (header writes, next pointer linking)
- TLS SLL splice (update head/count atomically)

**Cost Formula**:
```
Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
```

**Optimization Strategy**:
- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
- ↑ Cache capacity → ↓ miss rate → ↓ refill_count

---

## 2. Tunable Parameters (Phase 69 Sweep Plan)

### 2.1 Refill Batch Size Sweep

**Parameter**: `TINY_REFILL_BATCH_SIZE` (global default for all classes)

**Current**: 16 (conservative, optimized for RSS)

**Sweep Range**: [16, 32, 64, 128]

**Rationale**:
- 16 → 32: Expected +1-2% (fewer refill calls)
- 32 → 64: Expected +1-2% (diminishing returns, cache pressure increases)
- 64 → 128: Expected +0-1% (likely NO-GO due to cache thrashing)

**A/B Test Method**:
```bash
# Baseline (16)
make clean && make bench_random_mixed_hakmem_minimal_pgo
ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh

# Treatment (32)
sed -i 's/TINY_REFILL_BATCH_SIZE 16/TINY_REFILL_BATCH_SIZE 32/' core/hakmem_tiny_config.h
make clean && make pgo-fast-full
ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh

# Compare results
```

**Expected Winner**: 32 (balance between refill frequency and cache locality)

---

### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus)

**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload)

**Current**: 128 slots (C5-C7)

**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.

**Sweep Range**: [128, 256, 512]

**ENV Control**:
```bash
# Baseline (128)
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128

# Treatment 1 (256)
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256

# Treatment 2 (512)
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
```

**Expected Winner**: 256 (balance between working set and RSS)

**Why Not C2/C3**:
- C2/C3 already have 2048 slots (very high capacity)
- Further increase unlikely to improve (already low miss rate)
- Focus on under-optimized classes (C5-C7)

---

### 2.3 Warm Pool Size Sweep

**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS`

**Current**: 12 SuperSlabs per class

**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.

**Sweep Range**: [12, 16, 24]

**ENV Control**:
```bash
HAKMEM_WARM_POOL_SIZE=16  # Treatment 1
HAKMEM_WARM_POOL_SIZE=24  # Treatment 2
```

**Expected Winner**: 16 (diminishing returns beyond this)

**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)

---

## 3. Implementation Strategy

### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)

**Goal**: Measure each parameter's individual impact to avoid confounding effects.

**Order** (easiest to hardest):

1. **Warm Pool Size** (ENV-only, no recompile):
   ```bash
   for SIZE in 12 16 24; do
     HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
   done
   ```
   - Expected: +0.5-1.0% (registry scan reduction)
   - Risk: Low (ENV-only change)

2. **Unified Cache C5-C7** (ENV-only, no recompile):
   ```bash
   for CAP in 128 256 512; do
     HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
       RUNS=10 scripts/run_mixed_10_cleanenv.sh
   done
   ```
   - Expected: +1-2% (miss rate reduction for mid-size allocations)
   - Risk: Low (ENV-only change)

3. **Refill Batch Size** (requires recompile + PGO):
   ```bash
   for BATCH in 16 32 64; do
     sed -i "s/TINY_REFILL_BATCH_SIZE .*/TINY_REFILL_BATCH_SIZE $BATCH/" core/hakmem_tiny_config.h
     make pgo-fast-full
     RUNS=10 scripts/run_mixed_10_cleanenv.sh
   done
   ```
   - Expected: +1-3% (refill frequency reduction)
   - Risk: Medium (requires PGO rebuild, potential layout tax)

### 3.2 Phase 69-2: Combined Optimization (Best Settings)

After identifying winners from Phase 69-1, combine them:

```bash
# Example: batch=32, C5-C7=256, warm_pool=16
HAKMEM_WARM_POOL_SIZE=16 \
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
  make pgo-fast-full  # with TINY_REFILL_BATCH_SIZE=32

RUNS=10 scripts/run_mixed_10_cleanenv.sh
```

**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal)

---

## 4. Measurement & Validation

### 4.1 Primary Metric: Throughput

```bash
RUNS=10 scripts/run_mixed_10_cleanenv.sh
# Extract: Mean, Median, CV
# Decision: GO (+1%), Strong GO (+3%)
```

### 4.2 Secondary Metrics (Observability)

**Unified Cache Hit Rate**:
```bash
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
# Output: g_unified_cache_hits_global / (hits + misses)
```

**Refill Count** (requires instrumentation):
```c
// Add to unified_cache_refill():
static _Atomic uint64_t g_refill_count_total = 0;
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
```

**Warm Pool Hit Rate**:
```bash
# Already exists in g_warm_pool_stats[class_idx].hits / misses
# Print at shutdown via tiny_warm_pool_print_stats()
```

### 4.3 Layout Tax Check (If NO-GO)

```bash
# Run forensics on regression
./scripts/box/layout_tax_forensics_box.sh \
    ./bench_random_mixed_hakmem_minimal_pgo \
    ./bench_random_mixed_hakmem_minimal_pgo_phase69
```

---

## 5. Risk Assessment

| Risk | Mitigation |
|------|------------|
| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. |
| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). |

---

## 6. Decision Criteria

### GO Thresholds

- **GO**: +1.0% (additive improvement, worth merging)
- **Strong GO**: +3.0% (M2-worthy, promote to baseline)
- **NEUTRAL**: ±1.0% (no regression, but no clear win)
- **NO-GO**: <-1.0% (regression, investigate layout tax)

### Promotion Strategy

- **Single-parameter GO**: Merge immediately if +1%+
- **Combined GO**: Require +3%+ to justify complexity
- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD

---

## 7. Next Steps

### Immediate (Phase 69-1)

1. **ENV Sweeps** (no recompile):
   - Warm pool size: 12 → 16 → 24
   - Unified cache C5-C7: 128 → 256 → 512

2. **Batch Size Sweep** (requires PGO rebuild):
   - TINY_REFILL_BATCH_SIZE: 16 → 32 → 64

### Follow-Up (Phase 69-2)

3. **Combined Optimization**:
   - Apply winning parameters from Phase 69-1
   - Verify additive gains (+3-6% target)

4. **Baseline Promotion** (if Strong GO):
   - Update `pgo_fast_profile_config.sh` with winning ENV vars
   - Update `core/hakmem_tiny_config.h` with winning batch size
   - Re-run `make pgo-fast-full` to bake optimizations into baseline

---

## Artifacts

- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh`
- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`

---

**Status**: 🟢 READY FOR SWEEP (Phase 69-1)

**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)

**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)
-												Phase 69-0: Refill tuning design memo (parameter sweep plan)

Changes:
- docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md: New design document
  * Identified 3 tunable parameters: refill batch size, unified cache C5-C7 capacity, warm pool size
  * Sweep plan: single-parameter isolation → combined optimization
  * Expected gain: +3-6% (shortest path to M2: 55% target)
  * Risk assessment and decision criteria (GO/Strong GO/NO-GO thresholds)

- CURRENT_TASK.md: Phase 69-0 marked complete, Phase 69-1 (sweep execution) set Active

Key Parameters Identified:
1. TINY_REFILL_BATCH_SIZE: 16 → 32/64 (expected +1-3%)
2. Unified Cache C5-C7: 128 → 256/512 slots (expected +1-2%)
3. Warm Pool: 12 → 16/24 SuperSlabs (expected +0.5-1%)

Strategy:
- ENV-only sweeps first (warm pool, cache capacity) - no recompile
- Batch size sweep requires PGO rebuild - highest expected gain
- Combined optimization targets +3-6% additive gain

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-17 21:22:21 +09:00
+								# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo
 								**Status**: 🟡 DESIGN (Phase 69-0)
 								**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%).
 								---
 								## Executive Summary
 								**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)
 								**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**)
 								**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.
 								**Why This Approach**:
 								- ✅ **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk
 								- ✅ **Box-compliant**: No structural changes, only parameter tuning
 								- ✅ **Measurable**: refill count is directly observable via counters
 								- ✅ **Reversible**: All changes are config-only (ENV variables or compile-time constants)
 								---
 								## 1. Current State Analysis
 								### 1.1 Refill Path Hierarchy
 								```
 								malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
 								            ↓ MISS
 								         unified_cache_refill()
 								            ↓
 								         Warm Pool (12 SuperSlabs per class)
 								            ↓ MISS
 								         sll_refill_batch_from_ss(class_idx, max_take=16)
 								            ↓
 								         trc_pop_from_freelist() OR trc_linear_carve()
 								            ↓
 								         trc_splice_to_sll()
 								```
 								### 1.2 Key Parameters (Current)
 								| Component | Parameter | Current Value | Location |
 								|-----------|-----------|---------------|----------|
 								| **Refill Batch Size** | `TINY_REFILL_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:87` |
 								| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` |
 								| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` |
 								| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` |
 								| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` |
 								| **Drain Batch Size** | `TINY_DRAIN_BATCH_SIZE` | 16 | `core/hakmem_tiny_config.h:90` |
 								### 1.3 Refill Overhead Breakdown
 								**Fixed Overhead per refill** (~50-100 cycles):
 								- SuperSlab lookup (warm pool pop or registry scan)
 								- SlabMeta load (freelist, used, carved, capacity)
 								- Chain building (header writes, next pointer linking)
 								- TLS SLL splice (update head/count atomically)
 								**Cost Formula**:
 								```
 								Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
 								```
 								**Optimization Strategy**:
 								- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
 								- ↑ Cache capacity → ↓ miss rate → ↓ refill_count
 								---
 								## 2. Tunable Parameters (Phase 69 Sweep Plan)
 								### 2.1 Refill Batch Size Sweep
 								**Parameter**: `TINY_REFILL_BATCH_SIZE` (global default for all classes)
 								**Current**: 16 (conservative, optimized for RSS)
 								**Sweep Range**: [16, 32, 64, 128]
 								**Rationale**:
 								- 16 → 32: Expected +1-2% (fewer refill calls)
 								- 32 → 64: Expected +1-2% (diminishing returns, cache pressure increases)
 								- 64 → 128: Expected +0-1% (likely NO-GO due to cache thrashing)
 								**A/B Test Method**:
 								```bash
 								# Baseline (16)
 								make clean && make bench_random_mixed_hakmem_minimal_pgo
 								ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								# Treatment (32)
 								sed -i 's/TINY_REFILL_BATCH_SIZE 16/TINY_REFILL_BATCH_SIZE 32/' core/hakmem_tiny_config.h
 								make clean && make pgo-fast-full
 								ITERS=20000000 WS=400 RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								# Compare results
 								```
 								**Expected Winner**: 32 (balance between refill frequency and cache locality)
 								---
 								### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus)
 								**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload)
 								**Current**: 128 slots (C5-C7)
 								**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.
 								**Sweep Range**: [128, 256, 512]
 								**ENV Control**:
 								```bash
 								# Baseline (128)
 								HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128
 								# Treatment 1 (256)
 								HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256
 								# Treatment 2 (512)
 								HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
 								```
 								**Expected Winner**: 256 (balance between working set and RSS)
 								**Why Not C2/C3**:
 								- C2/C3 already have 2048 slots (very high capacity)
 								- Further increase unlikely to improve (already low miss rate)
 								- Focus on under-optimized classes (C5-C7)
 								---
 								### 2.3 Warm Pool Size Sweep
 								**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS`
 								**Current**: 12 SuperSlabs per class
 								**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.
 								**Sweep Range**: [12, 16, 24]
 								**ENV Control**:
 								```bash
 								HAKMEM_WARM_POOL_SIZE=16  # Treatment 1
 								HAKMEM_WARM_POOL_SIZE=24  # Treatment 2
 								```
 								**Expected Winner**: 16 (diminishing returns beyond this)
 								**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)
 								---
 								## 3. Implementation Strategy
 								### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)
 								**Goal**: Measure each parameter's individual impact to avoid confounding effects.
 								**Order** (easiest to hardest):
 . **Warm Pool Size** (ENV-only, no recompile):
 								   ```bash
 								   for SIZE in 12 16 24; do
 								     HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								   done
 								   ```
 								   - Expected: +0.5-1.0% (registry scan reduction)
 								   - Risk: Low (ENV-only change)
 . **Unified Cache C5-C7** (ENV-only, no recompile):
 								   ```bash
 								   for CAP in 128 256 512; do
 								     HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
 								       RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								   done
 								   ```
 								   - Expected: +1-2% (miss rate reduction for mid-size allocations)
 								   - Risk: Low (ENV-only change)
 . **Refill Batch Size** (requires recompile + PGO):
 								   ```bash
 								   for BATCH in 16 32 64; do
 								     sed -i "s/TINY_REFILL_BATCH_SIZE .*/TINY_REFILL_BATCH_SIZE $BATCH/" core/hakmem_tiny_config.h
 								     make pgo-fast-full
 								     RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								   done
 								   ```
 								   - Expected: +1-3% (refill frequency reduction)
 								   - Risk: Medium (requires PGO rebuild, potential layout tax)
 								### 3.2 Phase 69-2: Combined Optimization (Best Settings)
 								After identifying winners from Phase 69-1, combine them:
 								```bash
 								# Example: batch=32, C5-C7=256, warm_pool=16
 								HAKMEM_WARM_POOL_SIZE=16 \
 								HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
 								  make pgo-fast-full  # with TINY_REFILL_BATCH_SIZE=32
 								RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								```
 								**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal)
 								---
 								## 4. Measurement & Validation
 								### 4.1 Primary Metric: Throughput
 								```bash
 								RUNS=10 scripts/run_mixed_10_cleanenv.sh
 								# Extract: Mean, Median, CV
 								# Decision: GO (+1%), Strong GO (+3%)
 								```
 								### 4.2 Secondary Metrics (Observability)
 								**Unified Cache Hit Rate**:
 								```bash
 								HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
 								# Output: g_unified_cache_hits_global / (hits + misses)
 								```
 								**Refill Count** (requires instrumentation):
 								```c
 								// Add to unified_cache_refill():
 								static _Atomic uint64_t g_refill_count_total = 0;
 								atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
 								```
 								**Warm Pool Hit Rate**:
 								```bash
 								# Already exists in g_warm_pool_stats[class_idx].hits / misses
 								# Print at shutdown via tiny_warm_pool_print_stats()
 								```
 								### 4.3 Layout Tax Check (If NO-GO)
 								```bash
 								# Run forensics on regression
 								./scripts/box/layout_tax_forensics_box.sh \
 								    ./bench_random_mixed_hakmem_minimal_pgo \
 								    ./bench_random_mixed_hakmem_minimal_pgo_phase69
 								```
 								---
 								## 5. Risk Assessment
 								| Risk | Mitigation |
 								|------|------------|
 								| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. |
 								| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
 								| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
 								| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). |
 								---
 								## 6. Decision Criteria
 								### GO Thresholds
 								- **GO**: +1.0% (additive improvement, worth merging)
 								- **Strong GO**: +3.0% (M2-worthy, promote to baseline)
 								- **NEUTRAL**: ±1.0% (no regression, but no clear win)
 								- **NO-GO**: <-1.0% (regression, investigate layout tax)
 								### Promotion Strategy
 								- **Single-parameter GO**: Merge immediately if +1%+
 								- **Combined GO**: Require +3%+ to justify complexity
 								- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD
 								---
 								## 7. Next Steps
 								### Immediate (Phase 69-1)
 . **ENV Sweeps** (no recompile):
 								   - Warm pool size: 12 → 16 → 24
 								   - Unified cache C5-C7: 128 → 256 → 512
 . **Batch Size Sweep** (requires PGO rebuild):
 								   - TINY_REFILL_BATCH_SIZE: 16 → 32 → 64
 								### Follow-Up (Phase 69-2)
 . **Combined Optimization**:
 								   - Apply winning parameters from Phase 69-1
 								   - Verify additive gains (+3-6% target)
 . **Baseline Promotion** (if Strong GO):
 								   - Update `pgo_fast_profile_config.sh` with winning ENV vars
 								   - Update `core/hakmem_tiny_config.h` with winning batch size
 								   - Re-run `make pgo-fast-full` to bake optimizations into baseline
 								---
 								## Artifacts
 								- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
 								- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh`
 								- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
 								---
 								**Status**: 🟢 READY FOR SWEEP (Phase 69-1)
 								**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)
 								**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)