2025-12-17 21:22:21 +09:00
|
|
|
|
# Phase 69-0: Refill Frequency × Fixed Tax Reduction — Design Memo
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: 🟡 DESIGN (Phase 69-0)
|
|
|
|
|
|
|
|
|
|
|
|
**Objective**: Reduce "refill count × fixed overhead" by tuning batch sizes and cache capacities to minimize refill frequency. Target: **+3〜6%** (shortest path to M2: 55%).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Executive Summary
|
|
|
|
|
|
|
|
|
|
|
|
**Current Performance**: 61.614M ops/s = 50.93% of mimalloc (Phase 68 PGO baseline)
|
|
|
|
|
|
|
|
|
|
|
|
**M2 Target**: 66.54M ops/s = 55% of mimalloc (Gap: **+4.9M ops/s = +7.96%**)
|
|
|
|
|
|
|
|
|
|
|
|
**Strategy**: Reduce refill frequency by tuning batch sizes and cache capacities. Every refill incurs fixed overhead (SuperSlab lookup, metadata access, chain splicing). Reducing refill count directly improves throughput without micro-optimization risks.
|
|
|
|
|
|
|
|
|
|
|
|
**Why This Approach**:
|
|
|
|
|
|
- ✅ **High reproducibility**: Unlike branch/inline tuning, batch tuning has low layout tax risk
|
|
|
|
|
|
- ✅ **Box-compliant**: No structural changes, only parameter tuning
|
|
|
|
|
|
- ✅ **Measurable**: refill count is directly observable via counters
|
|
|
|
|
|
- ✅ **Reversible**: All changes are config-only (ENV variables or compile-time constants)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 1. Current State Analysis
|
|
|
|
|
|
|
|
|
|
|
|
### 1.1 Refill Path Hierarchy
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
malloc() → Unified Cache (C2/C3: 2048 slots, C5-C7: 128 slots)
|
|
|
|
|
|
↓ MISS
|
|
|
|
|
|
unified_cache_refill()
|
|
|
|
|
|
↓
|
|
|
|
|
|
Warm Pool (12 SuperSlabs per class)
|
|
|
|
|
|
↓ MISS
|
2025-12-18 01:55:27 +09:00
|
|
|
|
sll_refill_*_from_ss(class_idx, max_take=refill_count(class_idx))
|
2025-12-17 21:22:21 +09:00
|
|
|
|
↓
|
|
|
|
|
|
trc_pop_from_freelist() OR trc_linear_carve()
|
|
|
|
|
|
↓
|
|
|
|
|
|
trc_splice_to_sll()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 1.2 Key Parameters (Current)
|
|
|
|
|
|
|
|
|
|
|
|
| Component | Parameter | Current Value | Location |
|
|
|
|
|
|
|-----------|-----------|---------------|----------|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
| **Refill Count (hot/mid/global)** | `HAKMEM_TINY_REFILL_COUNT_*` | hot=128, mid=96, global=64 (defaults) | `core/hakmem_tiny_init.inc:270` |
|
2025-12-17 21:22:21 +09:00
|
|
|
|
| **Unified Cache C2/C3** | `unified_capacity(2/3)` | 2048 slots | `core/front/tiny_unified_cache.h:129` |
|
|
|
|
|
|
| **Unified Cache C5-C7** | `unified_capacity(5-7)` | 128 slots | `core/front/tiny_unified_cache.h:131` |
|
|
|
|
|
|
| **Unified Cache Others** | `unified_capacity(0/1/4)` | 64 slots | `core/front/tiny_unified_cache.h:128` |
|
|
|
|
|
|
| **Warm Pool Size** | `TINY_WARM_POOL_MAX_PER_CLASS` | 12 | `core/front/tiny_warm_pool.h:46` |
|
|
|
|
|
|
|
|
|
|
|
|
### 1.3 Refill Overhead Breakdown
|
|
|
|
|
|
|
|
|
|
|
|
**Fixed Overhead per refill** (~50-100 cycles):
|
|
|
|
|
|
- SuperSlab lookup (warm pool pop or registry scan)
|
|
|
|
|
|
- SlabMeta load (freelist, used, carved, capacity)
|
|
|
|
|
|
- Chain building (header writes, next pointer linking)
|
|
|
|
|
|
- TLS SLL splice (update head/count atomically)
|
|
|
|
|
|
|
|
|
|
|
|
**Cost Formula**:
|
|
|
|
|
|
```
|
|
|
|
|
|
Total Refill Cost = (refill_count × fixed_overhead) + (blocks_carved × per_block_cost)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Optimization Strategy**:
|
|
|
|
|
|
- ↑ Batch size → ↓ refill_count → ↓ Total Cost (up to cache capacity limit)
|
|
|
|
|
|
- ↑ Cache capacity → ↓ miss rate → ↓ refill_count
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. Tunable Parameters (Phase 69 Sweep Plan)
|
|
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
### 2.1 Refill Count Sweep (ENV-only)
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
**Parameter**:
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_HOT` (classes C0–C3)
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_MID` (classes C4–C7)
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT` (fallback/global)
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
**Current defaults** (when ENV unset):
|
|
|
|
|
|
- hot=128, mid=96, global=64
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
|
|
|
|
|
**Rationale**:
|
2025-12-18 01:55:27 +09:00
|
|
|
|
- Smaller counts reduce per-refill work but increase refill frequency.
|
|
|
|
|
|
- Larger counts reduce refill frequency but may increase chain-building cost and TLS pressure.
|
|
|
|
|
|
- Sweep is cheap (ENV-only) and reversible.
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
|
|
|
|
|
**A/B Test Method**:
|
|
|
|
|
|
```bash
|
2025-12-18 01:55:27 +09:00
|
|
|
|
# Baseline
|
|
|
|
|
|
RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
# Treatment examples (pick one axis at a time)
|
|
|
|
|
|
HAKMEM_TINY_REFILL_COUNT_MID=128 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
|
|
|
|
|
HAKMEM_TINY_REFILL_COUNT_MID=64 RUNS=10 BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh
|
2025-12-17 21:22:21 +09:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 Unified Cache Capacity Sweep (C5-C7 Focus)
|
|
|
|
|
|
|
|
|
|
|
|
**Parameter**: `unified_capacity(class_idx)` for C5-C7 (129B-1024B, Mixed workload)
|
|
|
|
|
|
|
|
|
|
|
|
**Current**: 128 slots (C5-C7)
|
|
|
|
|
|
|
|
|
|
|
|
**Rationale**: C5-C7 handle 129B-1024B range (mid-size allocations in Mixed benchmark). Increasing capacity reduces miss rate.
|
|
|
|
|
|
|
|
|
|
|
|
**Sweep Range**: [128, 256, 512]
|
|
|
|
|
|
|
|
|
|
|
|
**ENV Control**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# Baseline (128)
|
|
|
|
|
|
HAKMEM_TINY_UNIFIED_C5=128 HAKMEM_TINY_UNIFIED_C6=128 HAKMEM_TINY_UNIFIED_C7=128
|
|
|
|
|
|
|
|
|
|
|
|
# Treatment 1 (256)
|
|
|
|
|
|
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256
|
|
|
|
|
|
|
|
|
|
|
|
# Treatment 2 (512)
|
|
|
|
|
|
HAKMEM_TINY_UNIFIED_C5=512 HAKMEM_TINY_UNIFIED_C6=512 HAKMEM_TINY_UNIFIED_C7=512
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Expected Winner**: 256 (balance between working set and RSS)
|
|
|
|
|
|
|
|
|
|
|
|
**Why Not C2/C3**:
|
|
|
|
|
|
- C2/C3 already have 2048 slots (very high capacity)
|
|
|
|
|
|
- Further increase unlikely to improve (already low miss rate)
|
|
|
|
|
|
- Focus on under-optimized classes (C5-C7)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 Warm Pool Size Sweep
|
|
|
|
|
|
|
|
|
|
|
|
**Parameter**: `TINY_WARM_POOL_MAX_PER_CLASS`
|
|
|
|
|
|
|
|
|
|
|
|
**Current**: 12 SuperSlabs per class
|
|
|
|
|
|
|
|
|
|
|
|
**Rationale**: Warm pool caches hot SuperSlabs to avoid registry scan. Larger pool → lower registry scan frequency.
|
|
|
|
|
|
|
|
|
|
|
|
**Sweep Range**: [12, 16, 24]
|
|
|
|
|
|
|
|
|
|
|
|
**ENV Control**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
HAKMEM_WARM_POOL_SIZE=16 # Treatment 1
|
|
|
|
|
|
HAKMEM_WARM_POOL_SIZE=24 # Treatment 2
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Expected Winner**: 16 (diminishing returns beyond this)
|
|
|
|
|
|
|
|
|
|
|
|
**Caveat**: Memory overhead = pool_size × sizeof(SuperSlab*) × TINY_NUM_CLASSES = 16 × 8 × 8 = 1KB per thread (negligible)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. Implementation Strategy
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 Phase 69-1: Single-Parameter Sweeps (Isolation)
|
|
|
|
|
|
|
|
|
|
|
|
**Goal**: Measure each parameter's individual impact to avoid confounding effects.
|
|
|
|
|
|
|
|
|
|
|
|
**Order** (easiest to hardest):
|
|
|
|
|
|
|
|
|
|
|
|
1. **Warm Pool Size** (ENV-only, no recompile):
|
|
|
|
|
|
```bash
|
|
|
|
|
|
for SIZE in 12 16 24; do
|
|
|
|
|
|
HAKMEM_WARM_POOL_SIZE=$SIZE RUNS=10 scripts/run_mixed_10_cleanenv.sh
|
|
|
|
|
|
done
|
|
|
|
|
|
```
|
|
|
|
|
|
- Expected: +0.5-1.0% (registry scan reduction)
|
|
|
|
|
|
- Risk: Low (ENV-only change)
|
|
|
|
|
|
|
|
|
|
|
|
2. **Unified Cache C5-C7** (ENV-only, no recompile):
|
|
|
|
|
|
```bash
|
|
|
|
|
|
for CAP in 128 256 512; do
|
|
|
|
|
|
HAKMEM_TINY_UNIFIED_C5=$CAP HAKMEM_TINY_UNIFIED_C6=$CAP HAKMEM_TINY_UNIFIED_C7=$CAP \
|
|
|
|
|
|
RUNS=10 scripts/run_mixed_10_cleanenv.sh
|
|
|
|
|
|
done
|
|
|
|
|
|
```
|
|
|
|
|
|
- Expected: +1-2% (miss rate reduction for mid-size allocations)
|
|
|
|
|
|
- Risk: Low (ENV-only change)
|
|
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
3. **Refill Batch Size**(削除, 2025-12 audit):
|
|
|
|
|
|
`TINY_REFILL_BATCH_SIZE` は現行 Tiny front に接続されておらず、knob として成立していない。
|
|
|
|
|
|
実際の refill 量は `HAKMEM_TINY_REFILL_COUNT_*` で制御する(ENV-only)。
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
|
|
|
|
|
### 3.2 Phase 69-2: Combined Optimization (Best Settings)
|
|
|
|
|
|
|
|
|
|
|
|
After identifying winners from Phase 69-1, combine them:
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2025-12-18 01:55:27 +09:00
|
|
|
|
# Example: C5-C7=256, warm_pool=16
|
2025-12-17 21:22:21 +09:00
|
|
|
|
HAKMEM_WARM_POOL_SIZE=16 \
|
|
|
|
|
|
HAKMEM_TINY_UNIFIED_C5=256 HAKMEM_TINY_UNIFIED_C6=256 HAKMEM_TINY_UNIFIED_C7=256 \
|
2025-12-18 01:55:27 +09:00
|
|
|
|
RUNS=10 scripts/run_mixed_10_cleanenv.sh
|
2025-12-17 21:22:21 +09:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Expected Combined Gain**: +3-6% (additive if parameters are orthogonal)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. Measurement & Validation
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 Primary Metric: Throughput
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
RUNS=10 scripts/run_mixed_10_cleanenv.sh
|
|
|
|
|
|
# Extract: Mean, Median, CV
|
|
|
|
|
|
# Decision: GO (+1%), Strong GO (+3%)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 Secondary Metrics (Observability)
|
|
|
|
|
|
|
|
|
|
|
|
**Unified Cache Hit Rate**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_random_mixed_hakmem_minimal_pgo
|
|
|
|
|
|
# Output: g_unified_cache_hits_global / (hits + misses)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Refill Count** (requires instrumentation):
|
|
|
|
|
|
```c
|
|
|
|
|
|
// Add to unified_cache_refill():
|
|
|
|
|
|
static _Atomic uint64_t g_refill_count_total = 0;
|
|
|
|
|
|
atomic_fetch_add(&g_refill_count_total, 1, memory_order_relaxed);
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Warm Pool Hit Rate**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# Already exists in g_warm_pool_stats[class_idx].hits / misses
|
|
|
|
|
|
# Print at shutdown via tiny_warm_pool_print_stats()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 Layout Tax Check (If NO-GO)
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# Run forensics on regression
|
|
|
|
|
|
./scripts/box/layout_tax_forensics_box.sh \
|
|
|
|
|
|
./bench_random_mixed_hakmem_minimal_pgo \
|
|
|
|
|
|
./bench_random_mixed_hakmem_minimal_pgo_phase69
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 5. Risk Assessment
|
|
|
|
|
|
|
|
|
|
|
|
| Risk | Mitigation |
|
|
|
|
|
|
|------|------------|
|
|
|
|
|
|
| **Layout Tax** (batch size change) | Use `layout_tax_forensics_box.sh` to diagnose. Revert if IPC drops >3%. |
|
|
|
|
|
|
| **Cache Thrashing** (capacity too high) | Monitor LLC-misses via perf stat. Limit C5-C7 capacity to 512 max. |
|
|
|
|
|
|
| **RSS Increase** (larger batches/caches) | Measure RSS before/after. Acceptable if <+5% for +3% throughput. |
|
|
|
|
|
|
| **PGO Mismatch** (batch change) | Re-run PGO training after batch size change (included in `pgo-fast-full`). |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 6. Decision Criteria
|
|
|
|
|
|
|
|
|
|
|
|
### GO Thresholds
|
|
|
|
|
|
|
|
|
|
|
|
- **GO**: +1.0% (additive improvement, worth merging)
|
|
|
|
|
|
- **Strong GO**: +3.0% (M2-worthy, promote to baseline)
|
|
|
|
|
|
- **NEUTRAL**: ±1.0% (no regression, but no clear win)
|
|
|
|
|
|
- **NO-GO**: <-1.0% (regression, investigate layout tax)
|
|
|
|
|
|
|
|
|
|
|
|
### Promotion Strategy
|
|
|
|
|
|
|
|
|
|
|
|
- **Single-parameter GO**: Merge immediately if +1%+
|
|
|
|
|
|
- **Combined GO**: Require +3%+ to justify complexity
|
|
|
|
|
|
- **Strong GO**: Update PGO baseline + PERFORMANCE_TARGETS_SCORECARD
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 7. Next Steps
|
|
|
|
|
|
|
|
|
|
|
|
### Immediate (Phase 69-1)
|
|
|
|
|
|
|
|
|
|
|
|
1. **ENV Sweeps** (no recompile):
|
|
|
|
|
|
- Warm pool size: 12 → 16 → 24
|
|
|
|
|
|
- Unified cache C5-C7: 128 → 256 → 512
|
|
|
|
|
|
|
2025-12-18 01:55:27 +09:00
|
|
|
|
2. **Refill Count Sweep** (ENV-only):
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_MID`: 64 → 96 → 128 → 160 (example)
|
|
|
|
|
|
- `HAKMEM_TINY_REFILL_COUNT_HOT`: 96 → 128 → 160 (example)
|
2025-12-17 21:22:21 +09:00
|
|
|
|
|
|
|
|
|
|
### Follow-Up (Phase 69-2)
|
|
|
|
|
|
|
|
|
|
|
|
3. **Combined Optimization**:
|
|
|
|
|
|
- Apply winning parameters from Phase 69-1
|
|
|
|
|
|
- Verify additive gains (+3-6% target)
|
|
|
|
|
|
|
|
|
|
|
|
4. **Baseline Promotion** (if Strong GO):
|
|
|
|
|
|
- Update `pgo_fast_profile_config.sh` with winning ENV vars
|
|
|
|
|
|
- Re-run `make pgo-fast-full` to bake optimizations into baseline
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Artifacts
|
|
|
|
|
|
|
|
|
|
|
|
- **This design memo**: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
|
|
|
|
|
|
- **Sweep script** (TODO): `scripts/box/phase69_refill_sweep.sh`
|
|
|
|
|
|
- **Results log** (TODO): `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: 🟢 READY FOR SWEEP (Phase 69-1)
|
|
|
|
|
|
|
|
|
|
|
|
**Estimated Time**: 2-3 hours (ENV sweeps) + 4-6 hours (batch sweep with PGO)
|
|
|
|
|
|
|
|
|
|
|
|
**Expected Outcome**: +3-6% combined gain → M2 射程 (55% target)
|