Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-18 18:50:00 +09:00
parent d5c1113b4c
commit 89a9212700
50 changed files with 4428 additions and 58 deletions

View File

@ -53,17 +53,60 @@ Note:
| allocator | mean (M ops/s) | median (M ops/s) | ratio vs mimalloc (mean) | CV |
|----------|-----------------|------------------|--------------------------|-----|
| **mimalloc (separate)** | **120.979** | 120.967 | **100%** | 0.90% |
| jemalloc (LD_PRELOAD) | 96.06 | 97.00 | 79.73% | 2.93% |
| system (separate) | 85.10 | 85.24 | 70.65% | 1.01% |
| **mimalloc (separate)** | **124.82** | 124.71 | **100%** | 1.10% |
| **tcmalloc (LD_PRELOAD)** | **115.26** | 115.51 | **92.33%** | 1.22% |
| **jemalloc (LD_PRELOAD)** | **97.39** | 97.88 | **77.96%** | 1.29% |
| **system (separate)** | **85.20** | 85.40 | **68.24%** | 1.98% |
| libc (same binary) | 76.26 | 76.66 | 63.30% | (old) |
Notes:
- **Phase 59b rebase**: mimalloc updated (120.466M → 120.979M, +0.43% variation)
- `system/mimalloc/jemalloc` は別バイナリ計測のため **layouttext size/I-cache差分を含む reference**
- **2025-12-18 Update (corrected)**: tcmalloc/jemalloc/system 計測完了 (10-run Random Mixed, WS=400, ITERS=20M, SEED=1)
- tcmalloc: 115.26M ops/s (92.33% of mimalloc) ✓
- jemalloc: 97.39M ops/s (77.96% of mimalloc)
- system: 85.20M ops/s (68.24% of mimalloc)
- mimalloc: 124.82M ops/s (baseline)
- 計測スクリプト: `scripts/run_allocator_quick_matrix.sh` (hakmem via run_mixed_10_cleanenv.sh)
- **修正**: hakmem 計測が HAKMEM_PROFILE を明示するように修正 → SSOT レンジ復帰
- `system/mimalloc/jemalloc/tcmalloc` は別バイナリ計測のため **layouttext size/I-cache差分を含む reference**
- `tcmalloc (LD_PRELOAD)` は gperftools から install `/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so`
- `libc (same binary)``HAKMEM_FORCE_LIBC_ALLOC=1` により、同一レイアウト上での比較の目安Phase 48 前計測)
- **mimalloc 比較は FAST build を使用すること**Standard の gate overhead は hakmem 固有の税)
- **jemalloc 初回計測**: 79.73% of mimallocPhase 59 baseline, system より 9% 速い strong competitor
- 比較手順SSOT: `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
- **同一バイナリ比較layout差を最小化**: `scripts/run_allocator_preload_matrix.sh``bench_random_mixed_system` 固定 + `LD_PRELOAD` 差し替え)
- 注意: hakmem の SSOT`bench_random_mixed_hakmem*`とは経路が異なるdrop-in wrapper reference
## Allocator Comparisonbench_allocators_compare.sh, small-scale reference
注意:
- これは `bench_allocators_*``--scenario mixed`8B..1MB の簡易混合)による **small-scale reference**
- Mixed 161024B SSOT`scripts/run_mixed_10_cleanenv.sh`)とは **別物**なので、FAST baseline/マイルストーンとは混同しない。
実行(例):
```bash
make bench
JEMALLOC_SO=/path/to/libjemalloc.so.2 \
TCMALLOC_SO=/path/to/libtcmalloc.so \
scripts/bench_allocators_compare.sh --scenario mixed --iterations 50
```
結果2025-12-18, mixed, iterations=50:
| allocator | ops/sec (M) | vs mimalloc (Phase 69 ref) | vs system | soft_pf | RSS (MB) |
|----------|--------------|----------------------------|-----------|---------|----------|
| tcmalloc (LD_PRELOAD) | 34.56 | 28.6% | 11.2x | 3,842 | 21.5 |
| jemalloc (LD_PRELOAD) | 24.33 | 20.1% | 7.9x | 143 | 3.8 |
| hakmem (linked) | 16.85 | 13.9% | 5.4x | 4,701 | 46.5 |
| system (linked) | 3.09 | 2.6% | 1.0x | 68,590 | 19.6 |
補足:
- `soft_pf`/`RSS``getrusage()` 由来Linux の `ru_maxrss` は KB
## Allocator ComparisonRandom Mixed, 10-run, WS=400, reference
注意:
- 別バイナリ比較は layout tax が混ざる。
- **同一バイナリ比較LD_PRELOADを優先**したい場合は `scripts/run_allocator_preload_matrix.sh` を使う。
## 1) Speed相対目標
@ -71,14 +114,16 @@ Notes:
推奨マイルストーンMixed 161024B, FAST build
| Milestone | Target | Current (FAST v3 + PGO Phase 69) | Status |
| Milestone | Target | Current (2025-12-18, corrected) | Status |
|-----------|--------|-----------------------------------|--------|
| M1 | mimalloc の **50%** | 51.77% | 🟢 **EXCEEDED** (Phase 69, Warm Pool Size=16, ENV-only) |
| M2 | mimalloc の **55%** | - | 🔴 未達(残り +3.23pp、Phase 69+ 継続中)|
| M1 | mimalloc の **50%** | 44.46% | 🟡 **未達** (PROFILE 修正後の計測) |
| M2 | mimalloc の **55%** | 44.46% | 🔴 **未達** (Gap: -10.54pp)|
| M3 | mimalloc の **60%** | - | 🔴 未達(構造改造必要)|
| M4 | mimalloc の **6570%** | - | 🔴 未達(構造改造必要)|
**現状:** FAST v3 + PGO (Phase 69) = 62.63M ops/s = mimalloc の 51.77%Warm Pool Size=16, ENV-only, 10-run 検証済み
**現状:** hakmem (FAST PGO) (2025-12-18) = 55.53M ops/s = mimalloc の 44.46%Random Mixed, WS=400, ITERS=20M, 10-run
⚠️ **重要**: Phase 69 baseline (62.63M = 51.77%) は古い計測条件の可能性。PROFILE 明示修正後の新 baseline は 44.46%M1 未達)。
**Phase 68 PGO 昇格Phase 66 → Phase 68 upgrade:**
- Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)