193 lines
4.9 KiB
Markdown
193 lines
4.9 KiB
Markdown
|
|
# Baseline Performance Measurement (2025-11-01)
|
||
|
|
|
||
|
|
**目的**: シンプル化前の現状性能を詳細計測
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 計測結果
|
||
|
|
|
||
|
|
### Tiny Hot Bench (64B)
|
||
|
|
|
||
|
|
```
|
||
|
|
Throughput: 172.87 - 190.43 M ops/sec (平均: ~179 M/s)
|
||
|
|
Latency: 5.25 - 5.78 ns/op
|
||
|
|
|
||
|
|
Performance counters (3 runs average):
|
||
|
|
- Instructions: 2,001,155,032
|
||
|
|
- Cycles: 424,906,995
|
||
|
|
- Branches: 443,675,939
|
||
|
|
- Branch misses: 605,482 (0.14%)
|
||
|
|
- L1-dcache loads: 483,391,104
|
||
|
|
- L1-dcache misses: 1,336,694 (0.28%)
|
||
|
|
- IPC: 4.71
|
||
|
|
```
|
||
|
|
|
||
|
|
**計算**:
|
||
|
|
- 20M ops / 2.001B instructions = **100.1 instructions/op**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Random Mixed Bench (8-128B)
|
||
|
|
|
||
|
|
```
|
||
|
|
Throughput: 21.18 - 21.89 M ops/sec (平均: ~21.6 M/s)
|
||
|
|
Latency: 45.68 - 47.20 ns/op
|
||
|
|
|
||
|
|
Performance counters (3 runs average):
|
||
|
|
- Instructions: 8,250,602,755
|
||
|
|
- Cycles: 3,576,062,935
|
||
|
|
- Branches: 2,117,913,982
|
||
|
|
- Branch misses: 29,586,718 (1.40%)
|
||
|
|
- L1-dcache loads: 2,416,946,713
|
||
|
|
- L1-dcache misses: 4,496,837 (0.19%)
|
||
|
|
- IPC: 2.31
|
||
|
|
```
|
||
|
|
|
||
|
|
**計算**:
|
||
|
|
- 20M ops / 8.25B instructions = **412.5 instructions/op**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 分析
|
||
|
|
|
||
|
|
### ⚠️ 問題点
|
||
|
|
|
||
|
|
#### 1. 命令数が多すぎる
|
||
|
|
|
||
|
|
**Tiny Hot: 100 instructions/op**
|
||
|
|
- mimalloc の fast path は推定 10-20 instructions/op
|
||
|
|
- **5-10倍の命令オーバーヘッド**
|
||
|
|
|
||
|
|
**Random Mixed: 412 instructions/op**
|
||
|
|
- 超多サイクル!
|
||
|
|
- 6-7層のチェックが累積している証拠
|
||
|
|
|
||
|
|
#### 2. 分岐ミス率
|
||
|
|
|
||
|
|
**Tiny Hot: 0.14%** - 良好 ✅
|
||
|
|
- 単一サイズなので予測が効いている
|
||
|
|
|
||
|
|
**Random Mixed: 1.40%** - やや高い ⚠️
|
||
|
|
- サイズがランダムで分岐予測が外れやすい
|
||
|
|
- 6-7層の条件分岐が影響
|
||
|
|
|
||
|
|
#### 3. L1キャッシュミス率
|
||
|
|
|
||
|
|
**Tiny Hot: 0.28%** - 良好 ✅
|
||
|
|
**Random Mixed: 0.19%** - 良好 ✅
|
||
|
|
|
||
|
|
→ キャッシュミスは問題ではない!**命令数が問題**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 目標値 (ChatGPT Pro 推奨)
|
||
|
|
|
||
|
|
### シンプル化後の目標
|
||
|
|
|
||
|
|
**Tiny Hot**:
|
||
|
|
- 現在: 100 instructions/op, 179 M ops/s
|
||
|
|
- 目標: **20-30 instructions/op** (3-5倍削減), **240-250 M ops/s** (+35%)
|
||
|
|
|
||
|
|
**Random Mixed**:
|
||
|
|
- 現在: 412 instructions/op, 21.6 M ops/s
|
||
|
|
- 目標: **100-150 instructions/op** (3-4倍削減), **23-24 M ops/s** (+10%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 現在のコード構造 (問題)
|
||
|
|
|
||
|
|
### hak_tiny_alloc の層構造 (6-7層!)
|
||
|
|
|
||
|
|
```c
|
||
|
|
void* hak_tiny_alloc(size_t size) {
|
||
|
|
// Layer 0: Size to class
|
||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
||
|
|
|
||
|
|
// Layer 1: HAKMEM_TINY_BENCH_FASTPATH (条件付き)
|
||
|
|
#ifdef HAKMEM_TINY_BENCH_FASTPATH
|
||
|
|
// ベンチ専用SLL
|
||
|
|
if (g_tls_sll_head[class_idx]) { ... }
|
||
|
|
if (g_tls_mags[class_idx].top > 0) { ... }
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Layer 2: TinyHotMag (class_idx <= 2, 条件付き)
|
||
|
|
if (g_hotmag_enable && class_idx <= 2 && ...) {
|
||
|
|
hotmag_pop(class_idx);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Layer 3: g_hot_alloc_fn (class 0-3専用関数)
|
||
|
|
if (g_hot_alloc_fn[class_idx] != NULL) {
|
||
|
|
switch (class_idx) {
|
||
|
|
case 0: tiny_hot_pop_class0(); break;
|
||
|
|
case 1: tiny_hot_pop_class1(); break;
|
||
|
|
case 2: tiny_hot_pop_class2(); break;
|
||
|
|
case 3: tiny_hot_pop_class3(); break;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Layer 4: tiny_fast_pop (Fast Head SLL)
|
||
|
|
void* fast = tiny_fast_pop(class_idx);
|
||
|
|
|
||
|
|
// Layer 5: hak_tiny_alloc_slow (Magazine, Slab, etc.)
|
||
|
|
return hak_tiny_alloc_slow(size, class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**問題**:
|
||
|
|
1. **重複する層**: Layer 1-4 はすべて TLS キャッシュから取得する処理(重複!)
|
||
|
|
2. **条件分岐が多い**: 各層で `if (...)` チェック
|
||
|
|
3. **関数呼び出しオーバーヘッド**: 各層で関数呼び出し
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🚀 シンプル化方針 (ChatGPT Pro 推奨)
|
||
|
|
|
||
|
|
### 目標: 6-7層 → 3層
|
||
|
|
|
||
|
|
```c
|
||
|
|
void* hak_tiny_alloc(size_t size) {
|
||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
||
|
|
if (class_idx < 0) return NULL;
|
||
|
|
|
||
|
|
// === Layer 1: TLS Bump (hot classes 0-2 only) ===
|
||
|
|
// Ultra fast: bcur += size; if (bcur <= bend) return old;
|
||
|
|
if (class_idx <= 2) {
|
||
|
|
void* p = tiny_bump_alloc(class_idx);
|
||
|
|
if (likely(p)) return p;
|
||
|
|
}
|
||
|
|
|
||
|
|
// === Layer 2: TLS Small Magazine (128 items) ===
|
||
|
|
// Fast: magazine pop (index only)
|
||
|
|
void* p = small_mag_pop(class_idx);
|
||
|
|
if (likely(p)) return p;
|
||
|
|
|
||
|
|
// === Layer 3: Slow path (Slab/refill) ===
|
||
|
|
return tiny_alloc_slow(class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**削減する層**:
|
||
|
|
- ✂️ HAKMEM_TINY_BENCH_FASTPATH (ベンチ専用、本番不要)
|
||
|
|
- ✂️ TinyHotMag (重複)
|
||
|
|
- ✂️ g_hot_alloc_fn (重複)
|
||
|
|
- ✂️ tiny_fast_pop (重複)
|
||
|
|
|
||
|
|
**期待効果**:
|
||
|
|
- 命令数: 100 → 20-30 (-70-80%)
|
||
|
|
- 分岐数: 大幅削減
|
||
|
|
- Throughput: 179 → 240-250 M ops/s (+35%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 次のアクション
|
||
|
|
|
||
|
|
1. ✅ ベースライン計測完了
|
||
|
|
2. 🔄 Layer 1: TLS Bump 実装 (bcur/bend の 2-register path)
|
||
|
|
3. 🔄 Layer 2: Small Magazine 128 実装
|
||
|
|
4. 🔄 不要な層を削除
|
||
|
|
5. 🔄 再計測・比較
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**参考**: ChatGPT Pro UltraThink Response (`docs/analysis/CHATGPT_PRO_ULTRATHINK_RESPONSE.md`)
|