Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
374 lines
9.3 KiB
Markdown
374 lines
9.3 KiB
Markdown
# Phase 8 戦略: Ultrathink Analysis
|
||
|
||
## 🔍 調査結果サマリー
|
||
|
||
### 完全なOverhead内訳(1M × 16B test)
|
||
|
||
```
|
||
Total RSS at peak: 30.6 MB
|
||
Expected data: 22.9 MB (7.6 ptr + 15.3 allocs)
|
||
Total overhead: 7.7 MB
|
||
|
||
完全な内訳:
|
||
├─ Reserved SuperSlabs: 4.0 MB (52%) ← Phase 7.6 意図的設計
|
||
├─ SuperSlab fragmentation: 1.0 MB (13%) ← 2MB alignment必須
|
||
├─ Program baseline: 2.0 MB (26%) ← libc + global構造体
|
||
├─ Slab metadata: 0.4 MB (5%)
|
||
├─ TLS structures: 0.1 MB (1%)
|
||
└─ その他: 0.2 MB (3%)
|
||
──────────────────────────────────────
|
||
Total identified: 7.7 MB (100%) ✅ 完全解明!
|
||
```
|
||
|
||
### Gap to mimalloc分析
|
||
|
||
```
|
||
HAKMEM (1M scale): 32.9 MB
|
||
mimalloc: 25.1 MB
|
||
Gap: 7.8 MB
|
||
|
||
Gap内訳:
|
||
├─ Reserved SuperSlabs: 4.0 MB (51%) ← 最大の犯人
|
||
├─ Magazine cache想定: 2.0 MB (26%) ← Flush後は0だが、運用時残存
|
||
├─ その他 overhead: 1.8 MB (23%) ← 削減困難
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Phase 7.6設計の功罪
|
||
|
||
### EMPTY_SUPERSLAB_RESERVE = 2 の影響
|
||
|
||
**定義場所:** `hakmem_tiny.c:111-113`
|
||
```c
|
||
#define EMPTY_SUPERSLAB_RESERVE 2 // Keep up to N empty SuperSlabs per class
|
||
static SuperSlab* g_empty_superslabs[TINY_NUM_CLASSES];
|
||
```
|
||
|
||
**メリット(Phase 7.6で追加された理由):**
|
||
1. Re-allocation高速化: mmap syscall不要
|
||
2. Fragmentation削減: 既存SuperSlab再利用
|
||
3. Latency安定化: メモリ確保コスト排除
|
||
|
||
**デメリット:**
|
||
1. **4MB 常時overhead** ← mimalloc比較で致命的
|
||
2. Small workload時のメモリ無駄
|
||
3. Idle時にメモリ返却されない
|
||
|
||
### Phase 7.6での判断基準(当時)
|
||
|
||
**当時の状況:**
|
||
- Phase 7.6前: 40.9 MB
|
||
- Phase 7.6後: 33.0 MB (-8 MB削減成功)
|
||
- 目標: SuperSlab dynamic deallocation実現
|
||
|
||
**Phase 7.6の成果:**
|
||
- Empty SuperSlab検出: ✅ 成功
|
||
- Dynamic deallocation: ✅ 成功
|
||
- Reserve設計: ✅ 性能維持
|
||
|
||
**Phase 7.6では正しい設計だった:**
|
||
- 8MB削減達成(reserve 4MBを考慮しても net -4MB)
|
||
- 性能維持(re-allocation penalty無し)
|
||
|
||
**Phase 8で再検討すべき:**
|
||
- mimalloc比較で4MBは大きすぎる
|
||
- Reserve戦略の見直しが必要
|
||
|
||
---
|
||
|
||
## 📊 Magazine Cache実態調査
|
||
|
||
### 現状設定(hakmem_tiny.c:79-150)
|
||
|
||
```c
|
||
#define TINY_TLS_MAG_CAP 2048
|
||
|
||
static inline int tiny_effective_cap(int class_idx) {
|
||
case 0: return 2048; // Class 0 (16B): 2048 × 16B = 32 KB
|
||
case 1: return 1024; // Class 1 (32B): 1024 × 32B = 32 KB
|
||
case 2: return 768;
|
||
case 3: return 512;
|
||
case 4: case 5: return 256;
|
||
case 6: return 128;
|
||
default: return 64;
|
||
}
|
||
```
|
||
|
||
### Magazine Memory Usage(最大時)
|
||
|
||
| Class | Block Size | Capacity | Max Memory | Typical Usage |
|
||
|-------|-----------|----------|------------|---------------|
|
||
| 0 | 16B | 2048 | 32 KB | < 1% (調査結果) |
|
||
| 1 | 32B | 1024 | 32 KB | < 1% |
|
||
| 2 | 64B | 768 | 48 KB | < 5% |
|
||
| 3 | 128B | 512 | 64 KB | < 10% |
|
||
| 4 | 256B | 256 | 64 KB | 適正? |
|
||
| 5 | 512B | 256 | 128 KB | 適正? |
|
||
| 6 | 1KB | 128 | 128 KB | 適正? |
|
||
| 7 | 2KB | 64 | 128 KB | 適正? |
|
||
| **Total** | | | **624 KB** | **実使用 < 10%** |
|
||
|
||
**発見:**
|
||
- Class 0-3が過剰確保(usage < 1-10%)
|
||
- Class 4-7は適正規模
|
||
- Total 624 KB確保で実使用 < 62 KB ← **90% waste!**
|
||
|
||
**1M test時のMagazine残存想定:**
|
||
- Worst case: 2048 blocks (class 0) = 32 KB
|
||
- Typical: 200-500 blocks = 3-8 KB
|
||
- Best case (flush後): 0 KB
|
||
|
||
**つまり、Magazine cacheは実はそこまで大きくない!**
|
||
- Peak時でも 32 KB程度
|
||
- Gap 7.8 MBの 0.4% にすぎない
|
||
|
||
---
|
||
|
||
## 💡 Phase 8 戦略の抜本的見直し
|
||
|
||
### 従来の想定(調査前)
|
||
|
||
```
|
||
Gap 7.8 MB の内訳(想定):
|
||
├─ Magazine cache: 4 MB (51%) ← 間違い!
|
||
├─ System overhead: 3 MB (38%)
|
||
└─ その他: 0.8 MB (11%)
|
||
|
||
Phase 8 計画(従来):
|
||
→ Two-level Magazine で -3-4 MB削減
|
||
```
|
||
|
||
### 実態(調査後)
|
||
|
||
```
|
||
Gap 7.8 MB の内訳(実測):
|
||
├─ Reserved SuperSlabs: 4 MB (51%) ← 最大の犯人!
|
||
├─ Magazine cache: 0.03 MB (0.4%) ← ほぼ無視できる
|
||
├─ Fragmentation: 1 MB (13%)
|
||
├─ Program baseline: 2 MB (26%)
|
||
└─ その他: 0.77 MB (10%)
|
||
|
||
Phase 8 真の課題:
|
||
→ Reserved SuperSlabs 4MB をどう削減するか
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 Phase 8 戦略オプション(Ultrathink)
|
||
|
||
### Option A: Reserved SuperSlabs削減 ⭐⭐⭐⭐⭐
|
||
|
||
**アプローチ:**
|
||
```c
|
||
// Before (Phase 7.6):
|
||
#define EMPTY_SUPERSLAB_RESERVE 2 // 4 MB overhead
|
||
|
||
// After (Phase 8):
|
||
#define EMPTY_SUPERSLAB_RESERVE 0 // 0 MB overhead
|
||
// または
|
||
#define EMPTY_SUPERSLAB_RESERVE 1 // 2 MB overhead(妥協案)
|
||
```
|
||
|
||
**期待効果:**
|
||
- Reserve 2 → 0: **-4 MB削減** ✨
|
||
- Reserve 2 → 1: **-2 MB削減**
|
||
- Gap: 7.8 MB → 3.8-5.8 MB
|
||
|
||
**懸念点:**
|
||
1. Re-allocation性能低下
|
||
- mmap syscall頻度増加(~5 μs/call)
|
||
- 影響: Alloc/free cycling workload
|
||
- 実測必要
|
||
|
||
2. Fragmentation増加可能性
|
||
- 新SuperSlab確保 → 古いの未使用
|
||
- 実装次第で回避可能
|
||
|
||
**実装難易度:** ⭐ (1行変更)
|
||
|
||
**リスク:** ⭐⭐⭐ (性能影響要測定)
|
||
|
||
---
|
||
|
||
### Option B: Adaptive Reserve ⭐⭐⭐⭐
|
||
|
||
**アプローチ:**
|
||
```c
|
||
// Workload監視して動的調整
|
||
int adaptive_reserve_count(int class_idx) {
|
||
if (alloc_rate_high && churn_rate_high) {
|
||
return 2; // Hot workload: keep reserve
|
||
}
|
||
if (idle_time > 10ms) {
|
||
return 0; // Idle: release reserve
|
||
}
|
||
return 1; // Normal: minimal reserve
|
||
}
|
||
```
|
||
|
||
**期待効果:**
|
||
- Idle時: -4 MB
|
||
- Hot時: 性能維持
|
||
- Best of both worlds!
|
||
|
||
**実装難易度:** ⭐⭐⭐⭐ (100-200 lines)
|
||
|
||
**リスク:** ⭐⭐ (Tuning必要)
|
||
|
||
---
|
||
|
||
### Option C: Two-level Magazine (従来計画) ⭐⭐⭐
|
||
|
||
**再評価:**
|
||
- **想定効果: -3-4 MB** ← 間違い!
|
||
- **実際効果: -0.03 MB** ← Magazine cacheは小さい
|
||
|
||
**結論:** Two-level Magazineは**Gap削減にほぼ効果なし**
|
||
|
||
**ただし別のメリットあり:**
|
||
1. Locality向上(Hot cache 256 → L1ヒット率向上)
|
||
2. Spill頻度削減(性能向上)
|
||
3. TLS memory削減(624 KB → 200 KB)
|
||
|
||
**Phase 8での優先度:** ⭐⭐⭐ (性能向上目的なら価値あり)
|
||
|
||
---
|
||
|
||
### Option D: SuperSlab Alignment最適化 ⭐⭐
|
||
|
||
**現状:**
|
||
- 2MB alignment必須
|
||
- Fragmentation: 1 MB (13%)
|
||
|
||
**改善案:**
|
||
1. Partial SuperSlab release(slab単位 64KB)
|
||
2. Dynamic alignment(small workloadは1MB)
|
||
|
||
**期待効果:** -0.5-1 MB
|
||
|
||
**実装難易度:** ⭐⭐⭐⭐⭐ (高リスク・低効果)
|
||
|
||
---
|
||
|
||
## 📋 推奨ロードマップ
|
||
|
||
### Phase 8.1: Reserved SuperSlabs削減実験(1-2日)
|
||
|
||
**Step 1: Reserved 2 → 1 に削減**
|
||
```c
|
||
#define EMPTY_SUPERSLAB_RESERVE 1 // 2 → 1
|
||
```
|
||
|
||
**測定項目:**
|
||
1. Battle test memory: 期待 32.9 → 30.9 MB (-2 MB)
|
||
2. Re-allocation benchmark性能影響
|
||
3. Fragmentation変化
|
||
|
||
**判定基準:**
|
||
- 性能低下 < 5%: OK → Reserve 1 → 0 検討
|
||
- 性能低下 > 5%: NG → Adaptive Reserve検討
|
||
|
||
---
|
||
|
||
### Phase 8.2: Reserved 1 → 0 または Adaptive実装(2-3日)
|
||
|
||
**Option 2a: Reserved 0(シンプル)**
|
||
```c
|
||
#define EMPTY_SUPERSLAB_RESERVE 0
|
||
```
|
||
- 期待: 32.9 → 28.9 MB (-4 MB)
|
||
- Gap: 7.8 → 3.8 MB (**51% gap close!**)
|
||
|
||
**Option 2b: Adaptive Reserve(高度)**
|
||
- Idle検出実装
|
||
- Alloc rate監視
|
||
- 期待: Idle時 -4 MB, Hot時性能維持
|
||
|
||
---
|
||
|
||
### Phase 8.3: Two-level Magazine(性能最適化)(3-4日)
|
||
|
||
**目的変更:** Gap削減 → **性能向上**
|
||
|
||
**設計:**
|
||
```
|
||
Hot Magazine (256 cap, TLS)
|
||
├─ L1キャッシュ最適化
|
||
├─ Spill頻度 1/8 → 1/1(常にfull活用)
|
||
└─ Locality向上
|
||
|
||
Cold Magazine (削除 or 最小限)
|
||
└─ 不要(Reserved SuperSlabsで代替)
|
||
```
|
||
|
||
**期待効果:**
|
||
- Memory: 変化なし
|
||
- Performance: +5-10%(locality向上)
|
||
- TLS usage: -400 KB
|
||
|
||
---
|
||
|
||
## 🎯 Phase 8 最終目標
|
||
|
||
### Conservative Plan(Reserved 1)
|
||
|
||
```
|
||
Phase 8.1完了:
|
||
HAKMEM: 30.9 MB (-2 MB)
|
||
Gap: 5.8 MB (23%)
|
||
|
||
Phase 8.2 (Two-level):
|
||
HAKMEM: 30.5 MB (-0.4 MB, TLS削減)
|
||
Gap: 5.4 MB (22%)
|
||
```
|
||
|
||
### Aggressive Plan(Reserved 0)
|
||
|
||
```
|
||
Phase 8.1完了:
|
||
HAKMEM: 28.9 MB (-4 MB) ✨
|
||
Gap: 3.8 MB (15%)
|
||
|
||
Phase 8.2 (Two-level):
|
||
HAKMEM: 28.5 MB (-0.4 MB)
|
||
Gap: 3.4 MB (14%) 🎯
|
||
|
||
🏆 mimalloc比 +14% = 許容範囲!
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ 結論
|
||
|
||
### 調査結果
|
||
|
||
1. **Gap 7.8 MBの51%はReserved SuperSlabs(Phase 7.6設計)**
|
||
2. **Magazine cacheは想定の1/100程度(0.4%)**
|
||
3. **Two-level Magazineはメモリ削減効果ほぼゼロ**
|
||
|
||
### Phase 8戦略
|
||
|
||
**Priority 1:** Reserved SuperSlabs削減 ⭐⭐⭐⭐⭐
|
||
- 最大効果: -4 MB(Gap 51%削減)
|
||
- 最小リスク: 1行変更
|
||
- 要測定: 性能影響
|
||
|
||
**Priority 2:** Two-level Magazine(目的変更)⭐⭐⭐
|
||
- 効果: 性能向上+5-10%(メモリ効果minimal)
|
||
- 意義: Locality最適化
|
||
|
||
**Priority 3:** Adaptive Reserve ⭐⭐⭐⭐
|
||
- 効果: P1性能問題があればこれ
|
||
- Best of both worlds
|
||
|
||
### Target
|
||
|
||
```
|
||
Conservative: Gap 5.4 MB (22%)
|
||
Aggressive: Gap 3.4 MB (14%) 🎯
|
||
|
||
🚀 Phase 8完了後、mimalloc比+14-22%
|
||
= Production-ready!
|
||
```
|