hakmem/docs/analysis/PERF_ANALYSIS_2025_11_05.md

# HAKMEM Larson Benchmark Perf Analysis - 2025-11-05

## 🎯 測定結果

### スループット比較 (threads=4)

| Allocator | Throughput | vs System |
|-----------|-----------|-----------|
| **HAKMEM** | **3.62M ops/s** | **21.6%** |
| System malloc | 16.76M ops/s | 100% |
| mimalloc | 16.76M ops/s | 100% |

### スループット比較 (threads=1)

| Allocator | Throughput | vs System |
|-----------|-----------|-----------|
| **HAKMEM** | **2.59M ops/s** | **18.1%** |
| System malloc | 14.31M ops/s | 100% |

---

## 🔥 ボトルネック分析 (perf record -F 999)

### HAKMEM CPU Time トップ関数

```
28.51%  superslab_refill          💀💀💀 圧倒的ボトルネック
 2.58%  exercise_heap             (ベンチマーク本体)
 2.21%  hak_free_at
 1.87%  memset
 1.18%  sll_refill_batch_from_ss
 0.88%  malloc
```

**問題：アロケータ (superslab_refill) がベンチマーク本体より遅い！**

### System malloc CPU Time トップ関数

```
20.70%  exercise_heap             ✅ ベンチマーク本体が一番！
18.08%  _int_free
10.59%  cfree@GLIBC_2.2.5
```

**正常：ベンチマーク本体が CPU time を最も使う**

---

## 🐛 Root Cause: Registry 線形スキャン

### Hot Instructions (perf annotate superslab_refill)

```
32.36%  cmp    0x10(%rsp),%r11d    ← ループ比較
16.78%  inc    %r13d               ← カウンタ++
16.29%  add    $0x18,%rbx          ← ポインタ進める
10.89%  test   %r15,%r15           ← NULL チェック
10.83%  cmp    $0x3ffff,%r13d      ← 上限チェック (0x3ffff = 262143!)
10.50%  mov    (%rbx),%r15         ← 間接ロード
```

**合計 97.65% の CPU time がループに集中！**

### 該当コード

**File**: `core/hakmem_tiny_free.inc:917-943`

```c
const int scan_max = tiny_reg_scan_max();  // デフォルト 256
for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
    //                  ^^^^^^^^^^^^^ 262,144 エントリ！
    SuperRegEntry* e = &g_super_reg[i];
    uintptr_t base = atomic_load_explicit((_Atomic uintptr_t*)&e->base, memory_order_acquire);
    if (base == 0) continue;
    SuperSlab* ss = atomic_load_explicit(&e->ss, memory_order_acquire);
    if (!ss || ss->magic != SUPERSLAB_MAGIC) continue;
    if ((int)ss->size_class != class_idx) { scanned++; continue; }
    // ... 内側のループで slab をスキャン
}
```

**問題点：**

1. **262,144 エントリを線形スキャン** (`SUPER_REG_SIZE = 262144`)
2. **2 回の atomic load** per iteration (base + ss)
3. **class_idx 不一致でも iteration 継続** → 最悪 262,144 回ループ
4. **Cache miss 連発** (1つのエントリ = 24 bytes, 全体 = 6 MB)

**コスト見積もり：**
```
1 iteration = 2 atomic loads (20 cycles) + 比較 (5 cycles) = 25 cycles
262,144 iterations × 25 cycles = 6.5M cycles
@ 4GHz = 1.6ms per refill call
```

**refill 頻度:**
- TLS cache miss 時に発生 (hit rate ~95%)
- Larson benchmark: 3.62M ops/s × 5% miss = 181K refills/sec
- Total overhead: 181K × 1.6ms = **289 seconds = 480% of CPU time!**

---

## 💡 解決策

### Priority 1: Registry を per-class にインデックス化 🔥🔥🔥

**現状：**
```c
SuperRegEntry g_super_reg[262144];  // 全 class が混在
```

**提案：**
```c
SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][4096];
// 8 classes × 4096 entries = 32K total
```

**効果：**
- スキャン対象: 262,144 → 4,096 エントリ (-98.4%)
- 期待改善: **+200-300%** (2.59M → 7.8-10.4M ops/s)

### Priority 2: Registry スキャンを早期終了

**現状：**
```c
for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
    // 一致しなくても全エントリをイテレート
}
```

**提案：**
```c
for (int i = 0; i < scan_max && i < registry_size[class_idx]; i++) {
    // class 専用 registry のみスキャン
    // 早期終了: 最初の freelist 発見で即 return
}
```

**効果：**
- 早期終了により平均ループ回数: 4,096 → 10-50 回 (-99%)
- 期待改善: 追加 +50-100%

### Priority 3: getenv() キャッシング

**現状：**
- `tiny_reg_scan_max()` で毎回 `getenv()` チェック
- `static int v = -1` で初回のみ実行（既に最適化済み）

**効果：**
- 既に実装済み ✅

---

## 📊 期待効果まとめ

| 最適化 | 改善率 | スループット予測 |
|--------|--------|-----------------|
| **Baseline (現状)** | - | 2.59M ops/s (18% of system) |
| Per-class registry | +200-300% | 7.8-10.4M ops/s (54-73%) |
| 早期終了 | +50-100% | 11.7-20.8M ops/s (82-145%) |
| **Total** | **+350-700%** | **11.7-20.8M ops/s** 🎯 |

**Goal:** System malloc 同等 (14.31M ops/s) を超える！

---

## 🎯 実装プラン

### Phase 1 (1-2日): Per-class Registry

**変更箇所：**
1. `core/hakmem_super_registry.h`: 構造体変更
2. `core/hakmem_super_registry.c`: register/unregister 関数更新
3. `core/hakmem_tiny_free.inc:917`: スキャンロジック簡素化
4. `core/tiny_mmap_gate.h:46`: 同上

**実装：**
```c
// hakmem_super_registry.h
#define SUPER_REG_PER_CLASS 4096
SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];

// hakmem_tiny_free.inc
int scan_max = tiny_reg_scan_max();
int reg_size = g_super_reg_class_size[class_idx];
for (int i = 0; i < scan_max && i < reg_size; i++) {
    SuperRegEntry* e = &g_super_reg_by_class[class_idx][i];
    // ... 既存のロジック（class_idx チェック不要！）
}
```

**期待効果:** +200-300% (2.59M → 7.8-10.4M ops/s)

### Phase 2 (1日): 早期終了 + First-fit

**変更箇所：**
- `core/hakmem_tiny_free.inc:929-941`: 最初の freelist で即 return

**実装：**
```c
for (int s = 0; s < reg_cap; s++) {
    if (ss->slabs[s].freelist) {
        SlabHandle h = slab_try_acquire(ss, s, self_tid);
        if (slab_is_valid(&h)) {
            slab_drain_remote_full(&h);
            tiny_drain_freelist_to_sll_once(h.ss, h.slab_idx, class_idx);
            tiny_tls_bind_slab(tls, ss, s);
            return ss;  // 🚀 即 return！
        }
    }
}
```

**期待効果:** 追加 +50-100%

---

## 📚 参考

### 既存の分析ドキュメント

- `SLL_REFILL_BOTTLENECK_ANALYSIS.md` (外部AI作成)
  - superslab_refill の 298 行複雑性を指摘
  - Priority 3: Registry 線形スキャン (+10-12% と見積もり)
  - **実際の影響はもっと大きかった** (CPU time 28.51%!)

- `LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md` (外部AI作成)
  - malloc() エントリーポイントの分岐削減を提案
  - **既に実装済み** (Option A: Inline TLS cache access)
  - 効果: 0.46M → 2.59M ops/s (+463%) ✅

### Perf コマンド

```bash
# Record
perf record -g --call-graph dwarf -F 999 -o hakmem_perf.data \
  -- env HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 2 8 128 1024 1 12345 4

# Report (top functions)
perf report -i hakmem_perf.data --stdio --no-children --sort symbol | head -60

# Annotate (hot instructions)
perf annotate -i hakmem_perf.data superslab_refill --stdio | \
  grep -E "^\s+[0-9]+\.[0-9]+" | sort -rn | head -30
```

---

## 🎯 結論

**HAKMEM の Larson 性能低下 (-78.4%) は Registry 線形スキャンが原因**

1. ✅ **Root Cause 特定**: superslab_refill が 28.51% CPU time を消費
2. ✅ **ボトルネック特定**: 262,144 エントリの線形スキャン
3. ✅ **解決策提案**: Per-class registry (+200-300%)

**次のステップ:** Phase 1 実装 → 2.59M から 7.8-10.4M ops/s へ (+3-4倍!)

---

**Date**: 2025-11-05
**Measured with**: perf record -F 999, larson_hakmem threads=4
**Status**: Root cause identified, solution designed ✅
-												Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-26 13:14:18 +09:00
+								# HAKMEM Larson Benchmark Perf Analysis - 2025-11-05
 								## 🎯 測定結果
 								### スループット比較 (threads=4)
 								| Allocator | Throughput | vs System |
 								|-----------|-----------|-----------|
 								| **HAKMEM** | **3.62M ops/s** | **21.6%** |
 								| System malloc | 16.76M ops/s | 100% |
 								| mimalloc | 16.76M ops/s | 100% |
 								### スループット比較 (threads=1)
 								| Allocator | Throughput | vs System |
 								|-----------|-----------|-----------|
 								| **HAKMEM** | **2.59M ops/s** | **18.1%** |
 								| System malloc | 14.31M ops/s | 100% |
 								---
 								## 🔥 ボトルネック分析 (perf record -F 999)
 								### HAKMEM CPU Time トップ関数
 								```
 .51%  superslab_refill          💀💀💀 圧倒的ボトルネック
 .58%  exercise_heap             (ベンチマーク本体)
 .21%  hak_free_at
 .87%  memset
 .18%  sll_refill_batch_from_ss
 .88%  malloc
 								```
 								**問題：アロケータ (superslab_refill) がベンチマーク本体より遅い！**
 								### System malloc CPU Time トップ関数
 								```
 .70%  exercise_heap             ✅ ベンチマーク本体が一番！
 .08%  _int_free
 .59%  cfree@GLIBC_2.2.5
 								```
 								**正常：ベンチマーク本体が CPU time を最も使う**
 								---
 								## 🐛 Root Cause: Registry 線形スキャン
 								### Hot Instructions (perf annotate superslab_refill)
 								```
 .36%  cmp    0x10(%rsp),%r11d    ← ループ比較
 .78%  inc    %r13d               ← カウンタ++
 .29%  add    $0x18,%rbx          ← ポインタ進める
 .89%  test   %r15,%r15           ← NULL チェック
 .83%  cmp    $0x3ffff,%r13d      ← 上限チェック (0x3ffff = 262143!)
 .50%  mov    (%rbx),%r15         ← 間接ロード
 								```
 								**合計 97.65% の CPU time がループに集中！**
 								### 該当コード
 								**File**: `core/hakmem_tiny_free.inc:917-943`
 								```c
 								const int scan_max = tiny_reg_scan_max();  // デフォルト 256
 								for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
 								    //                  ^^^^^^^^^^^^^ 262,144 エントリ！
 								    SuperRegEntry* e = &g_super_reg[i];
 								    uintptr_t base = atomic_load_explicit((_Atomic uintptr_t*)&e->base, memory_order_acquire);
 								    if (base == 0) continue;
 								    SuperSlab* ss = atomic_load_explicit(&e->ss, memory_order_acquire);
 								    if (!ss || ss->magic != SUPERSLAB_MAGIC) continue;
 								    if ((int)ss->size_class != class_idx) { scanned++; continue; }
 								    // ... 内側のループで slab をスキャン
 								}
 								```
 								**問題点：**
 . **262,144 エントリを線形スキャン** (`SUPER_REG_SIZE = 262144`)
 . **2 回の atomic load** per iteration (base + ss)
 . **class_idx 不一致でも iteration 継続** → 最悪 262,144 回ループ
 . **Cache miss 連発** (1つのエントリ = 24 bytes, 全体 = 6 MB)
 								**コスト見積もり：**
 								```
 iteration = 2 atomic loads (20 cycles) + 比較 (5 cycles) = 25 cycles
 ,144 iterations × 25 cycles = 6.5M cycles
 								@ 4GHz = 1.6ms per refill call
 								```
 								**refill 頻度:**
 								- TLS cache miss 時に発生 (hit rate ~95%)
 								- Larson benchmark: 3.62M ops/s × 5% miss = 181K refills/sec
 								- Total overhead: 181K × 1.6ms = **289 seconds = 480% of CPU time!**
 								---
 								## 💡 解決策
 								### Priority 1: Registry を per-class にインデックス化 🔥🔥🔥
 								**現状：**
 								```c
 								SuperRegEntry g_super_reg[262144];  // 全 class が混在
 								```
 								**提案：**
 								```c
 								SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][4096];
 								// 8 classes × 4096 entries = 32K total
 								```
 								**効果：**
 								- スキャン対象: 262,144 → 4,096 エントリ (-98.4%)
 								- 期待改善: **+200-300%** (2.59M → 7.8-10.4M ops/s)
 								### Priority 2: Registry スキャンを早期終了
 								**現状：**
 								```c
 								for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
 								    // 一致しなくても全エントリをイテレート
 								}
 								```
 								**提案：**
 								```c
 								for (int i = 0; i < scan_max && i < registry_size[class_idx]; i++) {
 								    // class 専用 registry のみスキャン
 								    // 早期終了: 最初の freelist 発見で即 return
 								}
 								```
 								**効果：**
 								- 早期終了により平均ループ回数: 4,096 → 10-50 回 (-99%)
 								- 期待改善: 追加 +50-100%
 								### Priority 3: getenv() キャッシング
 								**現状：**
 								- `tiny_reg_scan_max()` で毎回 `getenv()` チェック
 								- `static int v = -1` で初回のみ実行（既に最適化済み）
 								**効果：**
 								- 既に実装済み ✅
 								---
 								## 📊 期待効果まとめ
 								| 最適化 | 改善率 | スループット予測 |
 								|--------|--------|-----------------|
 								| **Baseline (現状)** | - | 2.59M ops/s (18% of system) |
 								| Per-class registry | +200-300% | 7.8-10.4M ops/s (54-73%) |
 								| 早期終了 | +50-100% | 11.7-20.8M ops/s (82-145%) |
 								| **Total** | **+350-700%** | **11.7-20.8M ops/s** 🎯 |
 								**Goal:** System malloc 同等 (14.31M ops/s) を超える！
 								---
 								## 🎯 実装プラン
 								### Phase 1 (1-2日): Per-class Registry
 								**変更箇所：**
 . `core/hakmem_super_registry.h`: 構造体変更
 . `core/hakmem_super_registry.c`: register/unregister 関数更新
 . `core/hakmem_tiny_free.inc:917`: スキャンロジック簡素化
 . `core/tiny_mmap_gate.h:46`: 同上
 								**実装：**
 								```c
 								// hakmem_super_registry.h
 								#define SUPER_REG_PER_CLASS 4096
 								SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];
 								// hakmem_tiny_free.inc
 								int scan_max = tiny_reg_scan_max();
 								int reg_size = g_super_reg_class_size[class_idx];
 								for (int i = 0; i < scan_max && i < reg_size; i++) {
 								    SuperRegEntry* e = &g_super_reg_by_class[class_idx][i];
 								    // ... 既存のロジック（class_idx チェック不要！）
 								}
 								```
 								**期待効果:** +200-300% (2.59M → 7.8-10.4M ops/s)
 								### Phase 2 (1日): 早期終了 + First-fit
 								**変更箇所：**
 								- `core/hakmem_tiny_free.inc:929-941`: 最初の freelist で即 return
 								**実装：**
 								```c
 								for (int s = 0; s < reg_cap; s++) {
 								    if (ss->slabs[s].freelist) {
 								        SlabHandle h = slab_try_acquire(ss, s, self_tid);
 								        if (slab_is_valid(&h)) {
 								            slab_drain_remote_full(&h);
 								            tiny_drain_freelist_to_sll_once(h.ss, h.slab_idx, class_idx);
 								            tiny_tls_bind_slab(tls, ss, s);
 								            return ss;  // 🚀 即 return！
 								        }
 								    }
 								}
 								```
 								**期待効果:** 追加 +50-100%
 								---
 								## 📚 参考
 								### 既存の分析ドキュメント
 								- `SLL_REFILL_BOTTLENECK_ANALYSIS.md` (外部AI作成)
 								  - superslab_refill の 298 行複雑性を指摘
 								  - Priority 3: Registry 線形スキャン (+10-12% と見積もり)
 								  - **実際の影響はもっと大きかった** (CPU time 28.51%!)
 								- `LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md` (外部AI作成)
 								  - malloc() エントリーポイントの分岐削減を提案
 								  - **既に実装済み** (Option A: Inline TLS cache access)
 								  - 効果: 0.46M → 2.59M ops/s (+463%) ✅
 								### Perf コマンド
 								```bash
 								# Record
 								perf record -g --call-graph dwarf -F 999 -o hakmem_perf.data \
 								  -- env HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 2 8 128 1024 1 12345 4
 								# Report (top functions)
 								perf report -i hakmem_perf.data --stdio --no-children --sort symbol | head -60
 								# Annotate (hot instructions)
 								perf annotate -i hakmem_perf.data superslab_refill --stdio | \
 								  grep -E "^\s+[0-9]+\.[0-9]+" | sort -rn | head -30
 								```
 								---
 								## 🎯 結論
 								**HAKMEM の Larson 性能低下 (-78.4%) は Registry 線形スキャンが原因**
 . ✅ **Root Cause 特定**: superslab_refill が 28.51% CPU time を消費
 . ✅ **ボトルネック特定**: 262,144 エントリの線形スキャン
 . ✅ **解決策提案**: Per-class registry (+200-300%)
 								**次のステップ:** Phase 1 実装 → 2.59M から 7.8-10.4M ops/s へ (+3-4倍!)
 								---
 								**Date**: 2025-11-05
 								**Measured with**: perf record -F 999, larson_hakmem threads=4
 								**Status**: Root cause identified, solution designed ✅