Files
hakmem/docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
Moe Charm (CI) b6212bbe31 Phase 69: Refill tuning completion (Warm Pool Size=16 optimized)
- Promoted Warm Pool Size=16 as the new baseline (+3.26% gain).
- Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results.
- Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default.
- Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.
2025-12-18 01:55:27 +09:00

307 lines
9.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo
**Status**: 🟡 DESIGN (Phase 69-0)
**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%).
---
## Executive Summary
**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)
**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**)
**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.
**Why This Approach**:
-**High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk
-**Box-compliant**: No structural changes, only parameter tuning
-**Measurable**: refill count is directly observable via counters
-**Reversible**: All changes are config-only (ENV variables or compile-time constants)
---
## 1. Current State Analysis
### 1.1 Refill Path Hierarchy
```
malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
↓ MISS
unified_cache_refill()
Warm Pool (12 SuperSlabs per class)
↓ MISS
sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx))
trc_pop_from_freelist() OR trc_linear_carve()
trc_splice_to_sll()
```
### 1.2 Key Parameters (Current)
| Component | Parameter | Current Value | Location |
|-----------|-----------|---------------|----------|
| **Refill Count (hot/mid/global)** | `HAKMEM_TINY_REFILL_COUNT_*` | hot=128, mid=96, global=64 (defaults) | `core/hakmem_tiny_init.inc:270` |
| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` |
| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` |
| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` |
| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` |
### 1.3 Refill Overhead Breakdown
**Fixed Overhead per refill** (~50-100 cycles):
- SuperSlab lookup (warm pool pop or registry scan)
- SlabMeta load (freelist, used, carved, capacity)
- Chain building (header writes, next pointer linking)
- TLS SLL splice (update head/count atomically)
**Cost Formula**:
```
Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
```
**Optimization Strategy**:
- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
- ↑ Cache capacity → ↓ miss rate → ↓ refill_count
---
## 2. Tunable Parameters (Phase 69 Sweep Plan)
### 2.1 Refill Count Sweep (ENV-only)
**Parameter**:
- `HAKMEM_TINY_REFILL_COUNT_HOT` (classes C0C3)
- `HAKMEM_TINY_REFILL_COUNT_MID` (classes C4C7)
- `HAKMEM_TINY_REFILL_COUNT` (fallback/global)
**Current defaults** (when ENV unset):
- hot=128, mid=96, global=64
**Rationale**:
- Smaller counts reduce per-refill work but increase refill frequency.
- Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure.
- Sweep is cheap (ENV-only) and reversible.
**A/B Test Method**:
```bash
# Baseline
RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
# Treatment examples (pick one axis at a time)
HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_REFILL_COUNT_MID=64 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
```
---
### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus)
**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload)
**Current**: 128 slots (C5-C7)
**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.
**Sweep Range**: [128, 256, 512]
**ENV Control**:
```bash
# Baseline (128)
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128
# Treatment 1 (256)
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256
# Treatment 2 (512)
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
```
**Expected Winner**: 256 (balance between working set and RSS)
**Why Not C2/C3**:
- C2/C3 already have 2048 slots (very high capacity)
- Further increase unlikely to improve (already low miss rate)
- Focus on under-optimized classes (C5-C7)
---
### 2.3 Warm Pool Size Sweep
**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS`
**Current**: 12 SuperSlabs per class
**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.
**Sweep Range**: [12, 16, 24]
**ENV Control**:
```bash
HAKMEM_WARM_POOL_SIZE=16 # Treatment 1
HAKMEM_WARM_POOL_SIZE=24 # Treatment 2
```
**Expected Winner**: 16 (diminishing returns beyond this)
**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)
---
## 3. Implementation Strategy
### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)
**Goal**: Measure each parameter's individual impact to avoid confounding effects.
**Order** (easiest to hardest):
1. **Warm Pool Size** (ENV-only, no recompile):
```bash
for SIZE in 12 16 24; do
HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
done
```
- Expected: +0.5-1.0% (registry scan reduction)
- Risk: Low (ENV-only change)
2. **Unified Cache C5-C7** (ENV-only, no recompile):
```bash
for CAP in 128 256 512; do
HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
RUNS=10 scripts/run_mixed_10_cleanenv.sh
done
```
- Expected: +1-2% (miss rate reduction for mid-size allocations)
- Risk: Low (ENV-only change)
3. **Refill Batch Size**(削除, 2025-12 audit:
`TINY_REFILL_BATCH_SIZE` は現行 Tiny front に接続されておらず、knob として成立していない。
実際の refill 量は `HAKMEM_TINY_REFILL_COUNT_*` で制御するENV-only
### 3.2 Phase 69-2: Combined Optimization (Best Settings)
After identifying winners from Phase 69-1, combine them:
```bash
# Example: C5-C7=256, warm_pool=16
HAKMEM_WARM_POOL_SIZE=16 \
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
RUNS=10 scripts/run_mixed_10_cleanenv.sh
```
**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal)
---
## 4. Measurement & Validation
### 4.1 Primary Metric: Throughput
```bash
RUNS=10 scripts/run_mixed_10_cleanenv.sh
# Extract: Mean, Median, CV
# Decision: GO (+1%), Strong GO (+3%)
```
### 4.2 Secondary Metrics (Observability)
**Unified Cache Hit Rate**:
```bash
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
# Output: g_unified_cache_hits_global / (hits + misses)
```
**Refill Count** (requires instrumentation):
```c
// Add to unified_cache_refill():
static _Atomic uint64_t g_refill_count_total = 0;
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
```
**Warm Pool Hit Rate**:
```bash
# Already exists in g_warm_pool_stats[class_idx].hits / misses
# Print at shutdown via tiny_warm_pool_print_stats()
```
### 4.3 Layout Tax Check (If NO-GO)
```bash
# Run forensics on regression
./scripts/box/layout_tax_forensics_box.sh \
./bench_random_mixed_hakmem_minimal_pgo \
./bench_random_mixed_hakmem_minimal_pgo_phase69
```
---
## 5. Risk Assessment
| Risk | Mitigation |
|------|------------|
| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. |
| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). |
---
## 6. Decision Criteria
### GO Thresholds
- **GO**: +1.0% (additive improvement, worth merging)
- **Strong GO**: +3.0% (M2-worthy, promote to baseline)
- **NEUTRAL**: ±1.0% (no regression, but no clear win)
- **NO-GO**: <-1.0% (regression, investigate layout tax)
### Promotion Strategy
- **Single-parameter GO**: Merge immediately if +1%+
- **Combined GO**: Require +3%+ to justify complexity
- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD
---
## 7. Next Steps
### Immediate (Phase 69-1)
1. **ENV Sweeps** (no recompile):
- Warm pool size: 12 → 16 → 24
- Unified cache C5-C7: 128 → 256 → 512
2. **Refill Count Sweep** (ENV-only):
- `HAKMEM_TINY_REFILL_COUNT_MID`: 64 → 96 → 128 → 160 (example)
- `HAKMEM_TINY_REFILL_COUNT_HOT`: 96 → 128 → 160 (example)
### Follow-Up (Phase 69-2)
3. **Combined Optimization**:
- Apply winning parameters from Phase 69-1
- Verify additive gains (+3-6% target)
4. **Baseline Promotion** (if Strong GO):
- Update `pgo_fast_profile_config.sh` with winning ENV vars
- Re-run `make pgo-fast-full` to bake optimizations into baseline
---
## Artifacts
- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh`
- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
---
**Status**: 🟢 READY FOR SWEEP (Phase 69-1)
**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)
**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)