235 lines
5.7 KiB
Markdown
235 lines
5.7 KiB
Markdown
|
|
# Phase 6-1: Ultra-Simple Fast Path - 総合評価レポート
|
||
|
|
|
||
|
|
**測定日**: 2025-11-02
|
||
|
|
**評価者**: Claude Code
|
||
|
|
**目的**: Phase 6-1 を baseline にすべきか判断
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 測定結果サマリー
|
||
|
|
|
||
|
|
### 1. LIFO Performance (64B single size)
|
||
|
|
|
||
|
|
| Allocator | Throughput | vs Phase 6-1 |
|
||
|
|
|-----------|------------|--------------|
|
||
|
|
| **Phase 6-1** | **476 M ops/sec** | **100%** |
|
||
|
|
| System glibc | 156-174 M ops/sec | +173-205% |
|
||
|
|
|
||
|
|
### 2. Mixed Workload (8-128B mixed sizes)
|
||
|
|
|
||
|
|
| Allocator | Mixed LIFO | vs Phase 6-1 |
|
||
|
|
|-----------|------------|--------------|
|
||
|
|
| **Phase 6-1** | **113.25 M ops/sec** | **100%** ✅ |
|
||
|
|
| System malloc | 76.06 M ops/sec | **+49%** 🏆 |
|
||
|
|
| mimalloc | 24.16 M ops/sec | **+369%** 🚀 |
|
||
|
|
| 既存HAKX | 16.60 M ops/sec | **+582%** 🚀 |
|
||
|
|
|
||
|
|
**Phase 6-1 Pattern Performance:**
|
||
|
|
- Mixed LIFO: 113.25 M ops/sec
|
||
|
|
- Mixed FIFO: 109.27 M ops/sec
|
||
|
|
- Mixed Random: 92.17 M ops/sec
|
||
|
|
- Interleaved: 110.73 M ops/sec
|
||
|
|
|
||
|
|
### 3. CPU/Memory Efficiency
|
||
|
|
|
||
|
|
| Metric | Phase 6-1 | System | 差分 |
|
||
|
|
|--------|-----------|--------|------|
|
||
|
|
| **Peak RSS** | 1536 KB | 1408 KB | +9% (ほぼ同等) ✅ |
|
||
|
|
| **CPU Time** | 6.63 sec | 2.62 sec | +153% (2.5倍遅い) 🔴 |
|
||
|
|
| **CPU Efficiency** | 30.2 M ops/sec | 76.3 M ops/sec | **-60% 悪い** ⚠️ |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ✅ Phase 6-1 の強み
|
||
|
|
|
||
|
|
### 1. **圧倒的な Mixed Workload 性能**
|
||
|
|
- mimalloc の **4.7倍速い**
|
||
|
|
- 既存HAKX の **6.8倍速い**
|
||
|
|
- System malloc の **1.5倍速い**
|
||
|
|
|
||
|
|
これは予想外の大成功! 既存HAKXの弱点 (Mixed -31%) を完全に解消しました。
|
||
|
|
|
||
|
|
### 2. **シンプルな設計**
|
||
|
|
- Fast path: 3-4 命令のみ
|
||
|
|
- Backend: 200行の シンプルな実装
|
||
|
|
- Magazine layers なし
|
||
|
|
- 100% hit rate (全パターン)
|
||
|
|
|
||
|
|
### 3. **Memory効率**
|
||
|
|
- Peak RSS: 1536 KB (System と ほぼ同等)
|
||
|
|
- Memory overhead: +9% のみ
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ⚠️ Phase 6-1 の弱点
|
||
|
|
|
||
|
|
### 1. **CPU効率が悪い** (最大の問題!)
|
||
|
|
|
||
|
|
```
|
||
|
|
CPU Efficiency:
|
||
|
|
- System malloc: 76.3 M ops/sec per CPU sec
|
||
|
|
- Phase 6-1: 30.2 M ops/sec per CPU sec
|
||
|
|
→ Phase 6-1 は 2.5倍多くCPUを消費
|
||
|
|
```
|
||
|
|
|
||
|
|
**原因推測:**
|
||
|
|
1. Size-to-class 変換の if-chain が重い?
|
||
|
|
2. Free list 操作のオーバーヘッド?
|
||
|
|
3. Chunk allocation の頻度が高い?
|
||
|
|
|
||
|
|
**他のAIちゃんの報告との比較:**
|
||
|
|
- mimalloc: CPU ~17%
|
||
|
|
- 既存HAKX: CPU ~49% (2.9倍多い vs mimalloc)
|
||
|
|
- **Phase 6-1: おそらく HAKX と同等か悪い**
|
||
|
|
|
||
|
|
### 2. **Memory Leak 的挙動**
|
||
|
|
|
||
|
|
```c
|
||
|
|
// munmap なし! Free した memory が OS に返らない
|
||
|
|
void* allocate_chunk(void) {
|
||
|
|
return mmap(NULL, CHUNK_SIZE, ...);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**問題:**
|
||
|
|
- 長時間実行で RSS が増加し続ける
|
||
|
|
- Production 環境で使えない
|
||
|
|
|
||
|
|
### 3. **学習層なし**
|
||
|
|
|
||
|
|
- 固定 refill count (64 blocks)
|
||
|
|
- Hotness tracking なし
|
||
|
|
- Dynamic capacity adjustment なし
|
||
|
|
|
||
|
|
既存HAKMEMの強み (ACE, Learner thread) が失われる。
|
||
|
|
|
||
|
|
### 4. **Integration 問題**
|
||
|
|
|
||
|
|
- SuperSlab system と統合されていない
|
||
|
|
- L25 (32KB-2MB) と連携なし
|
||
|
|
- Mid-Large の +171% の強みを活かせない
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 Baseline にすべきか?
|
||
|
|
|
||
|
|
### ❌ **NO - まだ早い**
|
||
|
|
|
||
|
|
**理由:**
|
||
|
|
|
||
|
|
1. **CPU効率が悪すぎる**
|
||
|
|
- 2.5倍多くCPUを消費 (vs System)
|
||
|
|
- 既存HAKXより悪い可能性
|
||
|
|
- Production で使えない
|
||
|
|
|
||
|
|
2. **Memory Leak 問題**
|
||
|
|
- munmap なし → RSS が増加し続ける
|
||
|
|
- 長時間実行で問題になる
|
||
|
|
|
||
|
|
3. **学習層なし**
|
||
|
|
- 負荷に応じた動的調整ができない
|
||
|
|
- Phase 6の元々の目標 ("Smart Back") が未実装
|
||
|
|
|
||
|
|
4. **Integration なし**
|
||
|
|
- Mid-Large (+171%) との連携なし
|
||
|
|
- 全体性能が最適化されない
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 💡 次のアクション
|
||
|
|
|
||
|
|
### Option A: Phase 6-1 の CPU効率を改善してから再評価 (推奨)
|
||
|
|
|
||
|
|
**改善案:**
|
||
|
|
|
||
|
|
1. **Size-to-class 最適化**
|
||
|
|
```c
|
||
|
|
// if-chain → lookup table
|
||
|
|
static const uint8_t size_to_class_lut[129] = {...};
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Memory release 実装**
|
||
|
|
```c
|
||
|
|
// Periodic munmap of unused chunks
|
||
|
|
void hak_tiny_simple_gc(void);
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Profile して bottleneck 特定**
|
||
|
|
```bash
|
||
|
|
perf record -g ./bench_mixed_workload
|
||
|
|
perf report
|
||
|
|
```
|
||
|
|
|
||
|
|
**期待効果:**
|
||
|
|
- CPU効率 30% 改善 → System 同等
|
||
|
|
- Memory leak 解消
|
||
|
|
- Production ready
|
||
|
|
|
||
|
|
### Option B: Phase 6-2 (Learning Layer) を先に設計
|
||
|
|
|
||
|
|
Phase 6-1 の fast path は良いが、Smart Back を実装してから baseline 判断。
|
||
|
|
|
||
|
|
### Option C: Hybrid approach
|
||
|
|
|
||
|
|
- Tiny: Phase 6-1 (Mixed で強い)
|
||
|
|
- Mid: 既存HAKX (+171%)
|
||
|
|
- Large: L25/SuperSlab
|
||
|
|
|
||
|
|
CPU効率問題があるので、部分的な採用。
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📝 結論
|
||
|
|
|
||
|
|
**Phase 6-1 は Mixed workload で圧倒的に速い** (System の 1.5倍、mimalloc の 4.7倍)
|
||
|
|
|
||
|
|
**しかし CPU効率が悪すぎる** (System の 2.5倍多く消費)
|
||
|
|
|
||
|
|
→ **まだ baseline にできない**
|
||
|
|
|
||
|
|
**次のステップ:**
|
||
|
|
1. CPU効率改善 (Option A)
|
||
|
|
2. Memory leak 修正
|
||
|
|
3. 再測定 → baseline 判断
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📈 測定データ
|
||
|
|
|
||
|
|
### Benchmark Files
|
||
|
|
|
||
|
|
- `benchmarks/src/tiny/phase6/bench_tiny_simple.c` - LIFO single size
|
||
|
|
- `benchmarks/src/tiny/phase6/bench_mixed_workload.c` - Mixed 8-128B
|
||
|
|
- `benchmarks/src/tiny/phase6/bench_mixed_system.c` - System comparison
|
||
|
|
- `benchmarks/src/tiny/phase6/test_tiny_simple.c` - Functional test
|
||
|
|
|
||
|
|
### Results
|
||
|
|
|
||
|
|
```
|
||
|
|
=== LIFO Performance (64B) ===
|
||
|
|
Phase 6-1: 476.09 M ops/sec, 4.17 cycles/op
|
||
|
|
System: 156-174 M ops/sec
|
||
|
|
|
||
|
|
=== Mixed Workload (8-128B) ===
|
||
|
|
Phase 6-1:
|
||
|
|
Mixed LIFO: 113.25 M ops/sec
|
||
|
|
Mixed FIFO: 109.27 M ops/sec
|
||
|
|
Mixed Random: 92.17 M ops/sec
|
||
|
|
Interleaved: 110.73 M ops/sec
|
||
|
|
Hit Rate: 100.00% (all classes)
|
||
|
|
|
||
|
|
System malloc:
|
||
|
|
Mixed LIFO: 76.06 M ops/sec
|
||
|
|
|
||
|
|
=== CPU/Memory Efficiency ===
|
||
|
|
Phase 6-1:
|
||
|
|
Peak RSS: 1536 KB
|
||
|
|
CPU Time: 6.63 sec (200M ops)
|
||
|
|
CPU Efficiency: 30.2 M ops/sec
|
||
|
|
|
||
|
|
System malloc:
|
||
|
|
Peak RSS: 1408 KB
|
||
|
|
CPU Time: 2.62 sec (200M ops)
|
||
|
|
CPU Efficiency: 76.3 M ops/sec
|
||
|
|
```
|