hakmem/docs/analysis/BASELINE_PERF_MEASUREMENT.md

# Baseline Performance Measurement (2025-11-01)

**目的**: シンプル化前の現状性能を詳細計測

---

## 📊 計測結果

### Tiny Hot Bench (64B)

```
Throughput: 172.87 - 190.43 M ops/sec (平均: ~179 M/s)
Latency: 5.25 - 5.78 ns/op

Performance counters (3 runs average):
- Instructions:        2,001,155,032
- Cycles:                424,906,995
- Branches:              443,675,939
- Branch misses:             605,482 (0.14%)
- L1-dcache loads:       483,391,104
- L1-dcache misses:        1,336,694 (0.28%)
- IPC:                        4.71
```

**計算**:
- 20M ops / 2.001B instructions = **100.1 instructions/op**

---

### Random Mixed Bench (8-128B)

```
Throughput: 21.18 - 21.89 M ops/sec (平均: ~21.6 M/s)
Latency: 45.68 - 47.20 ns/op

Performance counters (3 runs average):
- Instructions:        8,250,602,755
- Cycles:              3,576,062,935
- Branches:            2,117,913,982
- Branch misses:          29,586,718 (1.40%)
- L1-dcache loads:     2,416,946,713
- L1-dcache misses:        4,496,837 (0.19%)
- IPC:                        2.31
```

**計算**:
- 20M ops / 8.25B instructions = **412.5 instructions/op**

---

## 🔍 分析

### ⚠️ 問題点

#### 1. 命令数が多すぎる

**Tiny Hot: 100 instructions/op**
- mimalloc の fast path は推定 10-20 instructions/op
- **5-10倍の命令オーバーヘッド**

**Random Mixed: 412 instructions/op**
- 超多サイクル！
- 6-7層のチェックが累積している証拠

#### 2. 分岐ミス率

**Tiny Hot: 0.14%** - 良好 ✅
- 単一サイズなので予測が効いている

**Random Mixed: 1.40%** - やや高い ⚠️
- サイズがランダムで分岐予測が外れやすい
- 6-7層の条件分岐が影響

#### 3. L1キャッシュミス率

**Tiny Hot: 0.28%** - 良好 ✅
**Random Mixed: 0.19%** - 良好 ✅

→ キャッシュミスは問題ではない！**命令数が問題**

---

## 🎯 目標値 (ChatGPT Pro 推奨)

### シンプル化後の目標

**Tiny Hot**:
- 現在: 100 instructions/op, 179 M ops/s
- 目標: **20-30 instructions/op** (3-5倍削減), **240-250 M ops/s** (+35%)

**Random Mixed**:
- 現在: 412 instructions/op, 21.6 M ops/s
- 目標: **100-150 instructions/op** (3-4倍削減), **23-24 M ops/s** (+10%)

---

## 📋 現在のコード構造 (問題)

### hak_tiny_alloc の層構造 (6-7層!)

```c
void* hak_tiny_alloc(size_t size) {
    // Layer 0: Size to class
    int class_idx = hak_tiny_size_to_class(size);

    // Layer 1: HAKMEM_TINY_BENCH_FASTPATH (条件付き)
    #ifdef HAKMEM_TINY_BENCH_FASTPATH
        // ベンチ専用SLL
        if (g_tls_sll_head[class_idx]) { ... }
        if (g_tls_mags[class_idx].top > 0) { ... }
    #endif

    // Layer 2: TinyHotMag (class_idx <= 2, 条件付き)
    if (g_hotmag_enable && class_idx <= 2 && ...) {
        hotmag_pop(class_idx);
    }

    // Layer 3: g_hot_alloc_fn (class 0-3専用関数)
    if (g_hot_alloc_fn[class_idx] != NULL) {
        switch (class_idx) {
            case 0: tiny_hot_pop_class0(); break;
            case 1: tiny_hot_pop_class1(); break;
            case 2: tiny_hot_pop_class2(); break;
            case 3: tiny_hot_pop_class3(); break;
        }
    }

    // Layer 4: tiny_fast_pop (Fast Head SLL)
    void* fast = tiny_fast_pop(class_idx);

    // Layer 5: hak_tiny_alloc_slow (Magazine, Slab, etc.)
    return hak_tiny_alloc_slow(size, class_idx);
}
```

**問題**:
1. **重複する層**: Layer 1-4 はすべて TLS キャッシュから取得する処理（重複！）
2. **条件分岐が多い**: 各層で `if (...)` チェック
3. **関数呼び出しオーバーヘッド**: 各層で関数呼び出し

---

## 🚀 シンプル化方針 (ChatGPT Pro 推奨)

### 目標: 6-7層 → 3層

```c
void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0) return NULL;

    // === Layer 1: TLS Bump (hot classes 0-2 only) ===
    // Ultra fast: bcur += size; if (bcur <= bend) return old;
    if (class_idx <= 2) {
        void* p = tiny_bump_alloc(class_idx);
        if (likely(p)) return p;
    }

    // === Layer 2: TLS Small Magazine (128 items) ===
    // Fast: magazine pop (index only)
    void* p = small_mag_pop(class_idx);
    if (likely(p)) return p;

    // === Layer 3: Slow path (Slab/refill) ===
    return tiny_alloc_slow(class_idx);
}
```

**削減する層**:
- ✂️ HAKMEM_TINY_BENCH_FASTPATH (ベンチ専用、本番不要)
- ✂️ TinyHotMag (重複)
- ✂️ g_hot_alloc_fn (重複)
- ✂️ tiny_fast_pop (重複)

**期待効果**:
- 命令数: 100 → 20-30 (-70-80%)
- 分岐数: 大幅削減
- Throughput: 179 → 240-250 M ops/s (+35%)

---

## 次のアクション

1. ✅ ベースライン計測完了
2. 🔄 Layer 1: TLS Bump 実装 (bcur/bend の 2-register path)
3. 🔄 Layer 2: Small Magazine 128 実装
4. 🔄 不要な層を削除
5. 🔄 再計測・比較

---

**参考**: ChatGPT Pro UltraThink Response (`docs/analysis/CHATGPT_PRO_ULTRATHINK_RESPONSE.md`)
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# Baseline Performance Measurement (2025-11-01)`

			`目的: シンプル化前の現状性能を詳細計測`

			`---`

			`## 📊 計測結果`

			`### Tiny Hot Bench (64B)`

			```
			`Throughput: 172.87 - 190.43 M ops/sec (平均: ~179 M/s)`
			`Latency: 5.25 - 5.78 ns/op`

			`Performance counters (3 runs average):`
			`- Instructions: 2,001,155,032`
			`- Cycles: 424,906,995`
			`- Branches: 443,675,939`
			`- Branch misses: 605,482 (0.14%)`
			`- L1-dcache loads: 483,391,104`
			`- L1-dcache misses: 1,336,694 (0.28%)`
			`- IPC: 4.71`
			```

			`計算:`
			`- 20M ops / 2.001B instructions = 100.1 instructions/op`

			`---`

			`### Random Mixed Bench (8-128B)`

			```
			`Throughput: 21.18 - 21.89 M ops/sec (平均: ~21.6 M/s)`
			`Latency: 45.68 - 47.20 ns/op`

			`Performance counters (3 runs average):`
			`- Instructions: 8,250,602,755`
			`- Cycles: 3,576,062,935`
			`- Branches: 2,117,913,982`
			`- Branch misses: 29,586,718 (1.40%)`
			`- L1-dcache loads: 2,416,946,713`
			`- L1-dcache misses: 4,496,837 (0.19%)`
			`- IPC: 2.31`
			```

			`計算:`
			`- 20M ops / 8.25B instructions = 412.5 instructions/op`

			`---`

			`## 🔍 分析`

			`### ⚠️ 問題点`

			`#### 1. 命令数が多すぎる`

			`Tiny Hot: 100 instructions/op`
			`- mimalloc の fast path は推定 10-20 instructions/op`
			`- 5-10倍の命令オーバーヘッド`

			`Random Mixed: 412 instructions/op`
			`- 超多サイクル！`
			`- 6-7層のチェックが累積している証拠`

			`#### 2. 分岐ミス率`

			`Tiny Hot: 0.14% - 良好 ✅`
			`- 単一サイズなので予測が効いている`

			`Random Mixed: 1.40% - やや高い ⚠️`
			`- サイズがランダムで分岐予測が外れやすい`
			`- 6-7層の条件分岐が影響`

			`#### 3. L1キャッシュミス率`

			`Tiny Hot: 0.28% - 良好 ✅`
			`Random Mixed: 0.19% - 良好 ✅`

			`→ キャッシュミスは問題ではない！命令数が問題`

			`---`

			`## 🎯 目標値 (ChatGPT Pro 推奨)`

			`### シンプル化後の目標`

			`Tiny Hot:`
			`- 現在: 100 instructions/op, 179 M ops/s`
			`- 目標: 20-30 instructions/op (3-5倍削減), 240-250 M ops/s (+35%)`

			`Random Mixed:`
			`- 現在: 412 instructions/op, 21.6 M ops/s`
			`- 目標: 100-150 instructions/op (3-4倍削減), 23-24 M ops/s (+10%)`

			`---`

			`## 📋 現在のコード構造 (問題)`

			`### hak_tiny_alloc の層構造 (6-7層!)`

			```c
			`void* hak_tiny_alloc(size_t size) {`
			`// Layer 0: Size to class`
			`int class_idx = hak_tiny_size_to_class(size);`

			`// Layer 1: HAKMEM_TINY_BENCH_FASTPATH (条件付き)`
			`#ifdef HAKMEM_TINY_BENCH_FASTPATH`
			`// ベンチ専用SLL`
			`if (g_tls_sll_head[class_idx]) { ... }`
			`if (g_tls_mags[class_idx].top > 0) { ... }`
			`#endif`

			`// Layer 2: TinyHotMag (class_idx <= 2, 条件付き)`
			`if (g_hotmag_enable && class_idx <= 2 && ...) {`
			`hotmag_pop(class_idx);`
			`}`

			`// Layer 3: g_hot_alloc_fn (class 0-3専用関数)`
			`if (g_hot_alloc_fn[class_idx] != NULL) {`
			`switch (class_idx) {`
			`case 0: tiny_hot_pop_class0(); break;`
			`case 1: tiny_hot_pop_class1(); break;`
			`case 2: tiny_hot_pop_class2(); break;`
			`case 3: tiny_hot_pop_class3(); break;`
			`}`
			`}`

			`// Layer 4: tiny_fast_pop (Fast Head SLL)`
			`void* fast = tiny_fast_pop(class_idx);`

			`// Layer 5: hak_tiny_alloc_slow (Magazine, Slab, etc.)`
			`return hak_tiny_alloc_slow(size, class_idx);`
			`}`
			```

			`問題:`
			`1. 重複する層: Layer 1-4 はすべて TLS キャッシュから取得する処理（重複！）`
			2. 条件分岐が多い: 各層で `if (...)` チェック
			`3. 関数呼び出しオーバーヘッド: 各層で関数呼び出し`

			`---`

			`## 🚀 シンプル化方針 (ChatGPT Pro 推奨)`

			`### 目標: 6-7層 → 3層`

			```c
			`void* hak_tiny_alloc(size_t size) {`
			`int class_idx = hak_tiny_size_to_class(size);`
			`if (class_idx < 0) return NULL;`

			`// === Layer 1: TLS Bump (hot classes 0-2 only) ===`
			`// Ultra fast: bcur += size; if (bcur <= bend) return old;`
			`if (class_idx <= 2) {`
			`void* p = tiny_bump_alloc(class_idx);`
			`if (likely(p)) return p;`
			`}`

			`// === Layer 2: TLS Small Magazine (128 items) ===`
			`// Fast: magazine pop (index only)`
			`void* p = small_mag_pop(class_idx);`
			`if (likely(p)) return p;`

			`// === Layer 3: Slow path (Slab/refill) ===`
			`return tiny_alloc_slow(class_idx);`
			`}`
			```

			`削減する層:`
			`- ✂️ HAKMEM_TINY_BENCH_FASTPATH (ベンチ専用、本番不要)`
			`- ✂️ TinyHotMag (重複)`
			`- ✂️ g_hot_alloc_fn (重複)`
			`- ✂️ tiny_fast_pop (重複)`

			`期待効果:`
			`- 命令数: 100 → 20-30 (-70-80%)`
			`- 分岐数: 大幅削減`
			`- Throughput: 179 → 240-250 M ops/s (+35%)`

			`---`

			`## 次のアクション`

			`1. ✅ ベースライン計測完了`
			`2. 🔄 Layer 1: TLS Bump 実装 (bcur/bend の 2-register path)`
			`3. 🔄 Layer 2: Small Magazine 128 実装`
			`4. 🔄 不要な層を削除`
			`5. 🔄 再計測・比較`

			`---`

			参考: ChatGPT Pro UltraThink Response (`docs/analysis/CHATGPT_PRO_ULTRATHINK_RESPONSE.md`)