Files

Moe Charm (CI) 859027e06c Perf Analysis: Registry 線形スキャンがボトルネック (28.51% CPU)

- perf record で superslab_refill が 28.51% CPU を消費していることを特定
- Root cause: 262,144 エントリの Registry を線形スキャン
- Hot instructions: ループ比較 (32.36%), カウンタ++ (16.78%), ポインタ進める (16.29%)
- 解決策: per-class registry (8 classes × 4096 entries) に変更
- 期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)

詳細: PERF_ANALYSIS_2025_11_05.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 16:44:43 +09:00

7.5 KiB

Raw Blame History

HAKMEM Larson Benchmark Perf Analysis - 2025-11-05

🎯 測定結果

スループット比較 (threads=4)

Allocator	Throughput	vs System
HAKMEM	3.62M ops/s	21.6%
System malloc	16.76M ops/s	100%
mimalloc	16.76M ops/s	100%

スループット比較 (threads=1)

Allocator	Throughput	vs System
HAKMEM	2.59M ops/s	18.1%
System malloc	14.31M ops/s	100%

🔥 ボトルネック分析 (perf record -F 999)

HAKMEM CPU Time トップ関数

28.51%  superslab_refill          💀💀💀 圧倒的ボトルネック
 2.58%  exercise_heap             (ベンチマーク本体)
 2.21%  hak_free_at
 1.87%  memset
 1.18%  sll_refill_batch_from_ss
 0.88%  malloc

問題：アロケータ (superslab_refill) がベンチマーク本体より遅い！

System malloc CPU Time トップ関数

20.70%  exercise_heap             ✅ ベンチマーク本体が一番！
18.08%  _int_free
10.59%  cfree@GLIBC_2.2.5

正常：ベンチマーク本体が CPU time を最も使う

🐛 Root Cause: Registry 線形スキャン

Hot Instructions (perf annotate superslab_refill)

32.36%  cmp    0x10(%rsp),%r11d    ← ループ比較
16.78%  inc    %r13d               ← カウンタ++
16.29%  add    $0x18,%rbx          ← ポインタ進める
10.89%  test   %r15,%r15           ← NULL チェック
10.83%  cmp    $0x3ffff,%r13d      ← 上限チェック (0x3ffff = 262143!)
10.50%  mov    (%rbx),%r15         ← 間接ロード

合計 97.65% の CPU time がループに集中！

該当コード

File: core/hakmem_tiny_free.inc:917-943

const int scan_max = tiny_reg_scan_max();  // デフォルト 256
for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
    //                  ^^^^^^^^^^^^^ 262,144 エントリ！
    SuperRegEntry* e = &g_super_reg[i];
    uintptr_t base = atomic_load_explicit((_Atomic uintptr_t*)&e->base, memory_order_acquire);
    if (base == 0) continue;
    SuperSlab* ss = atomic_load_explicit(&e->ss, memory_order_acquire);
    if (!ss || ss->magic != SUPERSLAB_MAGIC) continue;
    if ((int)ss->size_class != class_idx) { scanned++; continue; }
    // ... 内側のループで slab をスキャン
}

問題点：

262,144 エントリを線形スキャン (SUPER_REG_SIZE = 262144)
2 回の atomic load per iteration (base + ss)
class_idx 不一致でも iteration 継続 → 最悪 262,144 回ループ
Cache miss 連発 (1つのエントリ = 24 bytes, 全体 = 6 MB)

コスト見積もり：

1 iteration = 2 atomic loads (20 cycles) + 比較 (5 cycles) = 25 cycles
262,144 iterations × 25 cycles = 6.5M cycles
@ 4GHz = 1.6ms per refill call

refill 頻度:

TLS cache miss 時に発生 (hit rate ~95%)
Larson benchmark: 3.62M ops/s × 5% miss = 181K refills/sec
Total overhead: 181K × 1.6ms = 289 seconds = 480% of CPU time!

💡 解決策

Priority 1: Registry を per-class にインデックス化 🔥🔥🔥

現状：

SuperRegEntry g_super_reg[262144];  // 全 class が混在

提案：

SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][4096];
// 8 classes × 4096 entries = 32K total

効果：

スキャン対象: 262,144 → 4,096 エントリ (-98.4%)
期待改善: +200-300% (2.59M → 7.8-10.4M ops/s)

Priority 2: Registry スキャンを早期終了

現状：

for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
    // 一致しなくても全エントリをイテレート
}

提案：

for (int i = 0; i < scan_max && i < registry_size[class_idx]; i++) {
    // class 専用 registry のみスキャン
    // 早期終了: 最初の freelist 発見で即 return
}

効果：

早期終了により平均ループ回数: 4,096 → 10-50 回 (-99%)
期待改善: 追加 +50-100%

Priority 3: getenv() キャッシング

現状：

tiny_reg_scan_max() で毎回 getenv() チェック
static int v = -1 で初回のみ実行（既に最適化済み）

効果：

既に実装済み ✅

📊 期待効果まとめ

最適化	改善率	スループット予測
Baseline (現状)	-	2.59M ops/s (18% of system)
Per-class registry	+200-300%	7.8-10.4M ops/s (54-73%)
早期終了	+50-100%	11.7-20.8M ops/s (82-145%)
Total	+350-700%	11.7-20.8M ops/s 🎯

Goal: System malloc 同等 (14.31M ops/s) を超える！

🎯 実装プラン

Phase 1 (1-2日): Per-class Registry

変更箇所：

core/hakmem_super_registry.h: 構造体変更
core/hakmem_super_registry.c: register/unregister 関数更新
core/hakmem_tiny_free.inc:917: スキャンロジック簡素化
core/tiny_mmap_gate.h:46: 同上

実装：

// hakmem_super_registry.h
#define SUPER_REG_PER_CLASS 4096
SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];

// hakmem_tiny_free.inc
int scan_max = tiny_reg_scan_max();
int reg_size = g_super_reg_class_size[class_idx];
for (int i = 0; i < scan_max && i < reg_size; i++) {
    SuperRegEntry* e = &g_super_reg_by_class[class_idx][i];
    // ... 既存のロジック（class_idx チェック不要！）
}

期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)

Phase 2 (1日): 早期終了 + First-fit

変更箇所：

core/hakmem_tiny_free.inc:929-941: 最初の freelist で即 return

実装：

for (int s = 0; s < reg_cap; s++) {
    if (ss->slabs[s].freelist) {
        SlabHandle h = slab_try_acquire(ss, s, self_tid);
        if (slab_is_valid(&h)) {
            slab_drain_remote_full(&h);
            tiny_drain_freelist_to_sll_once(h.ss, h.slab_idx, class_idx);
            tiny_tls_bind_slab(tls, ss, s);
            return ss;  // 🚀 即 return！
        }
    }
}

期待効果: 追加 +50-100%

📚 参考

既存の分析ドキュメント

SLL_REFILL_BOTTLENECK_ANALYSIS.md (外部AI作成)
- superslab_refill の 298 行複雑性を指摘
- Priority 3: Registry 線形スキャン (+10-12% と見積もり)
- 実際の影響はもっと大きかった (CPU time 28.51%!)
LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md (外部AI作成)
- malloc() エントリーポイントの分岐削減を提案
- 既に実装済み (Option A: Inline TLS cache access)
- 効果: 0.46M → 2.59M ops/s (+463%) ✅

Perf コマンド

# Record
perf record -g --call-graph dwarf -F 999 -o hakmem_perf.data \
  -- env HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 2 8 128 1024 1 12345 4

# Report (top functions)
perf report -i hakmem_perf.data --stdio --no-children --sort symbol | head -60

# Annotate (hot instructions)
perf annotate -i hakmem_perf.data superslab_refill --stdio | \
  grep -E "^\s+[0-9]+\.[0-9]+" | sort -rn | head -30

🎯 結論

HAKMEM の Larson 性能低下 (-78.4%) は Registry 線形スキャンが原因

✅ Root Cause 特定: superslab_refill が 28.51% CPU time を消費
✅ ボトルネック特定: 262,144 エントリの線形スキャン
✅ 解決策提案: Per-class registry (+200-300%)

次のステップ: Phase 1 実装 → 2.59M から 7.8-10.4M ops/s へ (+3-4倍!)

Date: 2025-11-05 Measured with: perf record -F 999, larson_hakmem threads=4 Status: Root cause identified, solution designed ✅

7.5 KiB Raw Blame History Unescape Escape

HAKMEM Larson Benchmark Perf Analysis - 2025-11-05

🎯 測定結果

スループット比較 (threads=4)

スループット比較 (threads=1)

🔥 ボトルネック分析 (perf record -F 999)

HAKMEM CPU Time トップ関数

System malloc CPU Time トップ関数

🐛 Root Cause: Registry 線形スキャン

Hot Instructions (perf annotate superslab_refill)

該当コード

💡 解決策

Priority 1: Registry を per-class にインデックス化 🔥🔥🔥

Priority 2: Registry スキャンを早期終了

Priority 3: getenv() キャッシング

📊 期待効果まとめ

🎯 実装プラン

Phase 1 (1-2日): Per-class Registry

Phase 2 (1日): 早期終了 + First-fit

📚 参考

既存の分析ドキュメント

Perf コマンド

🎯 結論

7.5 KiB

Raw Blame History