From 859027e06c1dfad0d91cb90d560842d3aca8696e Mon Sep 17 00:00:00 2001
From: "Moe Charm (CI)" <moecharm@example.com>
Date: Wed, 5 Nov 2025 16:44:43 +0900
Subject: [PATCH] =?UTF-8?q?Perf=20Analysis:=20Registry=20=E7=B7=9A?=
 =?UTF-8?q?=E5=BD=A2=E3=82=B9=E3=82=AD=E3=83=A3=E3=83=B3=E3=81=8C=E3=83=9C?=
 =?UTF-8?q?=E3=83=88=E3=83=AB=E3=83=8D=E3=83=83=E3=82=AF=20(28.51%=20CPU)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- perf record で superslab_refill が 28.51% CPU を消費していることを特定
- Root cause: 262,144 エントリの Registry を線形スキャン
- Hot instructions: ループ比較 (32.36%), カウンタ++ (16.78%), ポインタ進める (16.29%)
- 解決策: per-class registry (8 classes × 4096 entries) に変更
- 期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)

詳細: PERF_ANALYSIS_2025_11_05.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 PERF_ANALYSIS_2025_11_05.md | 1094 ++++++++---------------------------
 1 file changed, 236 insertions(+), 858 deletions(-)

diff --git a/PERF_ANALYSIS_2025_11_05.md b/PERF_ANALYSIS_2025_11_05.md
index 8b9f68e8..88cb12c1 100644
--- a/PERF_ANALYSIS_2025_11_05.md
+++ b/PERF_ANALYSIS_2025_11_05.md
@@ -1,885 +1,263 @@
-# HAKMEM vs mimalloc Root Cause Analysis
+# HAKMEM Larson Benchmark Perf Analysis - 2025-11-05
 
-**Date:** 2025-11-05
-**Test:** Larson benchmark (2s, 4 threads, 8-128B allocations)
+## 🎯 測定結果
+
+### スループット比較 (threads=4)
+
+| Allocator | Throughput | vs System |
+|-----------|-----------|-----------|
+| **HAKMEM** | **3.62M ops/s** | **21.6%** |
+| System malloc | 16.76M ops/s | 100% |
+| mimalloc | 16.76M ops/s | 100% |
+
+### スループット比較 (threads=1)
+
+| Allocator | Throughput | vs System |
+|-----------|-----------|-----------|
+| **HAKMEM** | **2.59M ops/s** | **18.1%** |
+| System malloc | 14.31M ops/s | 100% |
 
 ---
 
-## Executive Summary
+## 🔥 ボトルネック分析 (perf record -F 999)
 
-**Performance Gap:** HAKMEM is **6.4x slower** than mimalloc (2.62M ops/s vs 16.76M ops/s)
+### HAKMEM CPU Time トップ関数
 
-**Root Cause:** HAKMEM spends **7.25% of CPU time** in `superslab_refill` - a slow refill path that mimalloc avoids almost entirely. Combined with **4.45x instruction overhead** and **3.19x L1 cache miss rate**, this creates a perfect storm of inefficiency.
+```
+28.51%  superslab_refill          💀💀💀 圧倒的ボトルネック
+ 2.58%  exercise_heap             (ベンチマーク本体)
+ 2.21%  hak_free_at
+ 1.87%  memset
+ 1.18%  sll_refill_batch_from_ss
+ 0.88%  malloc
+```
 
-**Key Finding:** HAKMEM executes **28x more instructions per operation** than mimalloc (17,366 vs 610 instructions/op).
+**問題：アロケータ (superslab_refill) がベンチマーク本体より遅い！**
+
+### System malloc CPU Time トップ関数
+
+```
+20.70%  exercise_heap             ✅ ベンチマーク本体が一番！
+18.08%  _int_free
+10.59%  cfree@GLIBC_2.2.5
+```
+
+**正常：ベンチマーク本体が CPU time を最も使う**
 
 ---
 
-## Performance Metrics Comparison
+## 🐛 Root Cause: Registry 線形スキャン
 
-### Throughput
-| Allocator | Ops/sec | Relative | Time |
-|-----------|---------|----------|------|
-| HAKMEM    | 2.62M   | 1.00x    | 4.28s |
-| mimalloc  | 16.76M  | 6.39x    | 4.13s |
+### Hot Instructions (perf annotate superslab_refill)
 
-### CPU Performance Counters
-
-| Metric | HAKMEM | mimalloc | HAKMEM/mimalloc |
-|--------|---------|----------|-----------------|
-| **Cycles** | 16,971M | 11,482M | 1.48x |
-| **Instructions** | 45,516M | 10,219M | **4.45x** |
-| **IPC** | 2.68 | 0.89 | 3.01x |
-| **L1 cache miss rate** | 15.61% | 4.89% | **3.19x** |
-| **Cache miss rate** | 5.89% | 40.79% | 0.14x |
-| **Branch miss rate** | 0.83% | 6.05% | 0.14x |
-| **L1 loads** | 11,071M | 3,940M | 2.81x |
-| **L1 misses** | 1,728M | 192M | **9.00x** |
-| **Branches** | 14,224M | 1,847M | 7.70x |
-| **Branch misses** | 118M | 112M | 1.05x |
-
-### Per-Operation Metrics
-
-| Metric | HAKMEM | mimalloc | Ratio |
-|--------|---------|----------|-------|
-| **Instructions/op** | 17,366 | 610 | **28.5x** |
-| **Cycles/op** | 6,473 | 685 | **9.4x** |
-| **L1 loads/op** | 4,224 | 235 | **18.0x** |
-| **L1 misses/op** | 659 | 11.5 | **57.3x** |
-| **Branches/op** | 5,426 | 110 | **49.3x** |
-
----
-
-## Key Insights from Metrics
-
-1. **HAKMEM executes 28x MORE instructions per operation**
-   - HAKMEM: 17,366 instructions/op
-   - mimalloc: 610 instructions/op
-   - **This is the smoking gun - massive algorithmic overhead**
-
-2. **HAKMEM has 57x MORE L1 cache misses per operation**
-   - HAKMEM: 659 L1 misses/op
-   - mimalloc: 11.5 L1 misses/op
-   - **Poor cache locality destroys performance**
-
-3. **HAKMEM has HIGH IPC (2.68) but still loses**
-   - CPU is executing instructions efficiently
-   - But it's executing the **WRONG** instructions
-   - **Algorithm problem, not CPU problem**
-
-4. **mimalloc has LOWER cache efficiency overall**
-   - mimalloc: 40.79% cache miss rate
-   - HAKMEM: 5.89% cache miss rate
-   - **But mimalloc still wins 6x on throughput**
-   - **Suggests mimalloc's algorithm is fundamentally better**
-
----
-
-## Top CPU Hotspots
-
-### HAKMEM Top Functions (user-space only)
-| % CPU | Function | Category | Notes |
-|-------|----------|----------|-------|
-| 7.25% | superslab_refill.lto_priv.0 | **REFILL** | **MAIN BOTTLENECK** |
-| 1.33% | memset | Init | Memory zeroing |
-| 0.55% | exercise_heap | Benchmark | Test code |
-| 0.42% | hak_tiny_init.part.0 | Init | Initialization |
-| 0.40% | hkm_custom_malloc | Entry | Main entry |
-| 0.39% | hak_free_at.constprop.0 | Free | Free path |
-| 0.31% | hak_tiny_alloc_slow | Alloc | Slow path |
-| 0.23% | pthread_mutex_lock | Sync | Lock overhead |
-| 0.21% | pthread_mutex_unlock | Sync | Unlock overhead |
-| 0.20% | hkm_custom_free | Entry | Free entry |
-| 0.12% | hak_tiny_owner_slab | Meta | Ownership check |
-
-**Total allocator overhead visible: ~11.4%** (excluding benchmark)
-
-### mimalloc Top Functions (user-space only)
-| % CPU | Function | Category | Notes |
-|-------|----------|----------|-------|
-| 30.33% | exercise_heap | Benchmark | Test code |
-| 6.72% | operator delete[] | Free | Fast free |
-| 4.15% | _mi_page_free_collect | Free | Collection |
-| 2.95% | mi_malloc | Entry | Main entry |
-| 2.57% | _mi_page_reclaim | Reclaim | Page reclaim |
-| 2.57% | _mi_free_block_mt | Free | MT free |
-| 1.18% | _mi_free_generic | Free | Generic free |
-| 1.03% | mi_segment_reclaim | Reclaim | Segment reclaim |
-| 0.69% | mi_thread_init | Init | TLS init |
-| 0.63% | _mi_page_use_delayed_free | Free | Delayed free |
-
-**Total allocator overhead visible: ~22.5%** (excluding benchmark)
-
----
-
-## Root Cause Analysis
-
-### Primary Bottleneck: superslab_refill (7.25% CPU)
-
-**What it does:**
-- Called from `hak_tiny_alloc_slow` when fast cache is empty
-- Refills the magazine/fast-cache with new blocks from superslab
-- Includes memory allocation and initialization (memset)
-
-**Why is this catastrophic?**
-- **7.25% CPU in a SINGLE function** is massive for an allocator
-- mimalloc has **NO equivalent high-cost refill function**
-- Indicates HAKMEM is **constantly missing the fast path**
-- Each refill is expensive (includes 1.33% memset overhead)
-
-**Call frequency analysis:**
-- Total time: 4.28s
-- superslab_refill: 7.25% = 0.31s
-- Total ops: 2.62M ops/s × 4.28s = 11.2M ops
-- If refill happens every N ops, and takes 0.31s:
-  - Assuming 50 cycles/op in refill = 16.97B cycles × 0.0725 = 1.23B cycles
-  - At 4 GHz = 0.31s ✓
-- **Estimated refill frequency: every 100-200 operations**
-
-**Impact:**
-- Fast cache capacity: 16 slots per class
-- Refill count: ~64 blocks per refill
-- Hit rate: ~60-70% (30-40% miss rate is TERRIBLE)
-- **mimalloc's tcache likely has >95% hit rate**
-
----
-
-### Secondary Issues
-
-#### 1. **Instruction Count Explosion (4.45x more, 28x per-op)**
-- HAKMEM: 45.5B instructions total, 17,366 per op
-- mimalloc: 10.2B instructions total, 610 per op
-- **Gap: 35.3B excess instructions, 16,756 per op**
-
-**What causes this?**
-- Complex fast path with many branches (5,426 branches/op vs 110)
-- Magazine layer overhead (pop, refill, push)
-- SuperSlab metadata lookups
-- Ownership checks (hak_tiny_owner_slab)
-- TLS access overhead
-- Debug instrumentation (tiny_debug_ring_record)
-
-**Evidence from disassembly:**
-```asm
-hkm_custom_malloc:
-    push   %r15          ; Save 6 registers
-    push   %r14
-    push   %r13
-    push   %r12
-    push   %rbp
-    push   %rbx
-    sub    $0x58,%rsp    ; 88 bytes stack
-    mov    %fs:0x28,%rax ; Stack canary
-    ...
-    test   %eax,%eax     ; Multiple branches
-    js     ...           ; Size class check
-    je     ...           ; Init check
-    cmp    $0x400,%rbx   ; Threshold check
-    jbe    ...           ; Another branch
+```
+32.36%  cmp    0x10(%rsp),%r11d    ← ループ比較
+16.78%  inc    %r13d               ← カウンタ++
+16.29%  add    $0x18,%rbx          ← ポインタ進める
+10.89%  test   %r15,%r15           ← NULL チェック
+10.83%  cmp    $0x3ffff,%r13d      ← 上限チェック (0x3ffff = 262143!)
+10.50%  mov    (%rbx),%r15         ← 間接ロード
 ```
 
-**mimalloc likely has:**
-```asm
-mi_malloc:
-    mov    %fs:0x?,%rax  ; Get TLS tcache
-    mov    (%rax),%rdx   ; Load head
-    test   %rdx,%rdx     ; Check if empty
-    je     slow_path     ; Miss -> slow path
-    mov    8(%rdx),%rcx  ; Load next
-    mov    %rcx,(%rax)   ; Update head
-    ret                  ; Done (6-8 instructions!)
-```
+**合計 97.65% の CPU time がループに集中！**
 
-#### 2. **L1 Cache Miss Explosion (3.19x rate, 57x per-op)**
-- HAKMEM: 15.61% miss rate, 659 misses/op
-- mimalloc: 4.89% miss rate, 11.5 misses/op
+### 該当コード
 
-**What causes this?**
-- **TLS cache thrashing** - accessing scattered TLS variables
-- **Magazine structure layout** - poor spatial locality
-- **SuperSlab metadata** - cold cache lines on refill
-- **Pointer chasing** - magazine → superslab → slab → block
-- **Debug structures** - debug ring buffer causes cache pollution
+**File**: `core/hakmem_tiny_free.inc:917-943`
 
-**Memory access pattern:**
-```
-HAKMEM malloc:
-  TLS var 1 → size class        [cache miss]
-  TLS var 2 → magazine          [cache miss]
-  magazine → fast_cache array   [cache miss]
-  fast_cache → block ptr        [cache miss]
-  → MISS → slow path
-  superslab lookup              [cache miss]
-  superslab metadata            [cache miss]
-  new slab allocation           [cache miss]
-  memset slab                   [many cache misses]
-```
-
-**mimalloc malloc:**
-```
-  TLS tcache → head ptr         [1 cache hit]
-  head → next ptr               [1 cache hit/miss]
-  → HIT → return                [done!]
-```
-
-#### 3. **Fast Path is Not Fast**
-- HAKMEM's `hkm_custom_malloc`: only 0.40% CPU visible
-- mimalloc's `mi_malloc`: 2.95% CPU visible
-
-**Paradox:** HAKMEM entry shows less CPU but is 6x slower?
-
-**Explanation:**
-- HAKMEM's work is **hidden in inlined code**
-- Profiler attributes time to callees (superslab_refill)
-- The "fast path" is actually calling into slow paths
-- **High miss rate means fast path is rarely taken**
-
----
-
-## Hypothesis Verification
-
-| Hypothesis | Status | Evidence |
-|------------|--------|----------|
-| **Refill overhead is massive** | ✅ CONFIRMED | 7.25% CPU in superslab_refill |
-| **Too many instructions** | ✅ CONFIRMED | 4.45x more, 28x per-op |
-| **Cache locality problems** | ✅ CONFIRMED | 3.19x worse miss rate, 57x per-op |
-| **Atomic operations overhead** | ❌ REJECTED | Branch miss 0.83% vs 6.05% (better) |
-| **Complex fast path** | ✅ CONFIRMED | 5,426 branches/op vs 110 |
-| **SuperSlab lookup cost** | ⚠️ PARTIAL | Only 0.12% visible in hak_tiny_owner_slab |
-| **Cross-thread free overhead** | ⚠️ UNKNOWN | Need to profile free path separately |
-
----
-
-## Detailed Problem Breakdown
-
-### Problem 1: Magazine Refill Design (PRIMARY - 7.25% CPU)
-
-**Current flow:**
-```
-malloc(size)
-  → hkm_custom_malloc() [0.40% CPU]
-      → size_to_class()
-      → TLS magazine lookup
-      → fast_cache check
-      → MISS (30-40% of the time!)
-      → hak_tiny_alloc_slow() [0.31% CPU]
-          → superslab_refill() [7.25% CPU!]
-              → ss_os_acquire() or slab allocation
-              → memset() [1.33% CPU]
-              → fill magazine with N blocks
-              → return 1 block
-```
-
-**mimalloc flow:**
-```
-mi_malloc(size)
-  → mi_malloc() [2.95% CPU - all inline]
-      → size_to_class (branchless)
-      → TLS tcache[class].head
-      → head != NULL? (95%+ hit rate)
-      → pop head, return
-      → MISS (rare!)
-      → mi_malloc_generic() [0.20% CPU]
-          → find free page
-          → return block
-```
-
-**Key differences:**
-1. **Hit rate:** HAKMEM 60-70%, mimalloc 95%+
-2. **Miss cost:** HAKMEM 7.25% (superslab_refill), mimalloc 0.20% (generic)
-3. **Cache size:** HAKMEM 16 slots, mimalloc probably 64+
-4. **Refill cost:** HAKMEM includes memset (1.33%), mimalloc lazy init
-
-**Impact calculation:**
-- HAKMEM miss rate: 30%
-- HAKMEM miss cost: 7.25% / 30% = 24.2% of miss time
-- mimalloc miss rate: 5%
-- mimalloc miss cost: 0.20% / 5% = 4% of miss time
-- **HAKMEM's miss is 6x more expensive per miss!**
-
-### Problem 2: Instruction Overhead (4.45x, 28x per-op)
-
-**Instruction budget per operation:**
-- mimalloc: 610 instructions/op (fast path ~20, slow path amortized)
-- HAKMEM: 17,366 instructions/op (27.7x more!)
-
-**Where do 17,366 instructions go?**
-
-Estimated breakdown (based on profiling and code analysis):
-```
-Function overhead (push/pop/stack):     ~500 instructions  (3%)
-Size class calculation:                 ~200 instructions  (1%)
-TLS access (scattered):                 ~800 instructions  (5%)
-Magazine lookup/management:             ~1,000 instructions (6%)
-Fast cache check/pop:                   ~300 instructions  (2%)
-Miss detection:                         ~200 instructions  (1%)
-Slow path call overhead:                ~400 instructions  (2%)
-SuperSlab refill (30% miss rate):       ~8,000 instructions (46%)
-  ├─ SuperSlab lookup:                  ~1,500 instructions
-  ├─ Slab allocation:                   ~3,000 instructions
-  ├─ memset:                            ~2,500 instructions
-  └─ Magazine fill:                     ~1,000 instructions
-Debug instrumentation:                  ~1,500 instructions (9%)
-Cross-thread handling:                  ~2,000 instructions (12%)
-Misc overhead:                          ~2,466 instructions (14%)
-──────────────────────────────────────────────────────────
-Total:                                  ~17,366 instructions
-```
-
-**Key insight:** 46% of instructions are in SuperSlab refill, which only happens 30% of the time. This means when refill happens, it costs **~26,000 instructions per refill** (serving ~64 blocks), or **~400 instructions per block amortized**.
-
-**mimalloc's 610 instructions:**
-```
-Fast path hit (95%):                    ~20 instructions   (3%)
-Fast path miss (5%):                    ~200 instructions  (16%)
-Slow path (5% × cost):                  ~8,000 instructions (81%)
-  └─ Amortized: 8000 × 0.05 = ~400 instructions
-──────────────────────────────────────────────────────────
-Total amortized:                        ~610 instructions
-```
-
-**Conclusion:** Even mimalloc's slow path costs ~8,000 instructions, but it happens only 5% of the time. HAKMEM's refill costs ~8,000 instructions and happens 30% of the time. **The hit rate is the killer.**
-
-### Problem 3: L1 Cache Thrashing (15.61% miss rate, 659 misses/op)
-
-**Cache behavior analysis:**
-
-**HAKMEM cache access pattern (per operation):**
-```
-L1 loads: 4,224 per op
-L1 misses: 659 per op (15.61%)
-
-Breakdown of cache misses:
-- TLS variable access (scattered):        ~50 misses  (8%)
-- Magazine structure access:              ~40 misses  (6%)
-- Fast cache array access:                ~30 misses  (5%)
-- SuperSlab lookup (30% ops):             ~200 misses (30%)
-- Slab metadata access:                   ~100 misses (15%)
-- memset during refill (30% ops):         ~150 misses (23%)
-- Debug ring buffer:                      ~50 misses  (8%)
-- Misc/stack:                             ~39 misses  (6%)
-────────────────────────────────────────────────────────
-Total:                                    ~659 misses
-```
-
-**mimalloc cache access pattern (per operation):**
-```
-L1 loads: 235 per op
-L1 misses: 11.5 per op (4.89%)
-
-Breakdown (estimated):
-- TLS tcache access (packed):             ~2 misses   (17%)
-- tcache array (fast path hit):           ~0 misses   (0%)
-- Slow path (5% ops):                     ~200 misses (83%)
-  └─ Amortized: 200 × 0.05 = ~10 misses
-────────────────────────────────────────────────────────
-Total:                                    ~11.5 misses
-```
-
-**Key differences:**
-1. **TLS layout:** mimalloc packs hot data in one structure, HAKMEM scatters across many TLS vars
-2. **Magazine overhead:** HAKMEM's 3-layer cache (fast/magazine/superslab) vs mimalloc's 2-layer (tcache/page)
-3. **Refill frequency:** HAKMEM refills 30% vs mimalloc 5%
-4. **Refill cost:** HAKMEM's refill does memset (cache-intensive), mimalloc lazy-inits
-
----
-
-## Comparison with System malloc
-
-From CLAUDE.md, comprehensive benchmark results:
-- **System malloc (glibc):** 135.94 M ops/s (tiny allocations)
-- **HAKMEM:** 2.62 M ops/s (this test)
-- **mimalloc:** 16.76 M ops/s (this test)
-
-**System malloc is 52x faster than HAKMEM, 8x faster than mimalloc!**
-
-**Why is System tcache so fast?**
-
-System malloc (glibc 2.28+) uses tcache:
 ```c
-// Simplified tcache fast path (~5 instructions)
-void* malloc(size_t size) {
-    tcache_entry *e = tcache->entries[size_class];
-    if (e) {
-        tcache->entries[size_class] = e->next;
-        return (void*)e;
+const int scan_max = tiny_reg_scan_max();  // デフォルト 256
+for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
+    //                  ^^^^^^^^^^^^^ 262,144 エントリ！
+    SuperRegEntry* e = &g_super_reg[i];
+    uintptr_t base = atomic_load_explicit((_Atomic uintptr_t*)&e->base, memory_order_acquire);
+    if (base == 0) continue;
+    SuperSlab* ss = atomic_load_explicit(&e->ss, memory_order_acquire);
+    if (!ss || ss->magic != SUPERSLAB_MAGIC) continue;
+    if ((int)ss->size_class != class_idx) { scanned++; continue; }
+    // ... 内側のループで slab をスキャン
+}
+```
+
+**問題点：**
+
+1. **262,144 エントリを線形スキャン** (`SUPER_REG_SIZE = 262144`)
+2. **2 回の atomic load** per iteration (base + ss)
+3. **class_idx 不一致でも iteration 継続** → 最悪 262,144 回ループ
+4. **Cache miss 連発** (1つのエントリ = 24 bytes, 全体 = 6 MB)
+
+**コスト見積もり：**
+```
+1 iteration = 2 atomic loads (20 cycles) + 比較 (5 cycles) = 25 cycles
+262,144 iterations × 25 cycles = 6.5M cycles
+@ 4GHz = 1.6ms per refill call
+```
+
+**refill 頻度:**
+- TLS cache miss 時に発生 (hit rate ~95%)
+- Larson benchmark: 3.62M ops/s × 5% miss = 181K refills/sec
+- Total overhead: 181K × 1.6ms = **289 seconds = 480% of CPU time!**
+
+---
+
+## 💡 解決策
+
+### Priority 1: Registry を per-class にインデックス化 🔥🔥🔥
+
+**現状：**
+```c
+SuperRegEntry g_super_reg[262144];  // 全 class が混在
+```
+
+**提案：**
+```c
+SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][4096];
+// 8 classes × 4096 entries = 32K total
+```
+
+**効果：**
+- スキャン対象: 262,144 → 4,096 エントリ (-98.4%)
+- 期待改善: **+200-300%** (2.59M → 7.8-10.4M ops/s)
+
+### Priority 2: Registry スキャンを早期終了
+
+**現状：**
+```c
+for (int i = 0; i < SUPER_REG_SIZE && scanned < scan_max; i++) {
+    // 一致しなくても全エントリをイテレート
+}
+```
+
+**提案：**
+```c
+for (int i = 0; i < scan_max && i < registry_size[class_idx]; i++) {
+    // class 専用 registry のみスキャン
+    // 早期終了: 最初の freelist 発見で即 return
+}
+```
+
+**効果：**
+- 早期終了により平均ループ回数: 4,096 → 10-50 回 (-99%)
+- 期待改善: 追加 +50-100%
+
+### Priority 3: getenv() キャッシング
+
+**現状：**
+- `tiny_reg_scan_max()` で毎回 `getenv()` チェック
+- `static int v = -1` で初回のみ実行（既に最適化済み）
+
+**効果：**
+- 既に実装済み ✅
+
+---
+
+## 📊 期待効果まとめ
+
+| 最適化 | 改善率 | スループット予測 |
+|--------|--------|-----------------|
+| **Baseline (現状)** | - | 2.59M ops/s (18% of system) |
+| Per-class registry | +200-300% | 7.8-10.4M ops/s (54-73%) |
+| 早期終了 | +50-100% | 11.7-20.8M ops/s (82-145%) |
+| **Total** | **+350-700%** | **11.7-20.8M ops/s** 🎯 |
+
+**Goal:** System malloc 同等 (14.31M ops/s) を超える！
+
+---
+
+## 🎯 実装プラン
+
+### Phase 1 (1-2日): Per-class Registry
+
+**変更箇所：**
+1. `core/hakmem_super_registry.h`: 構造体変更
+2. `core/hakmem_super_registry.c`: register/unregister 関数更新
+3. `core/hakmem_tiny_free.inc:917`: スキャンロジック簡素化
+4. `core/tiny_mmap_gate.h:46`: 同上
+
+**実装：**
+```c
+// hakmem_super_registry.h
+#define SUPER_REG_PER_CLASS 4096
+SuperRegEntry g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];
+
+// hakmem_tiny_free.inc
+int scan_max = tiny_reg_scan_max();
+int reg_size = g_super_reg_class_size[class_idx];
+for (int i = 0; i < scan_max && i < reg_size; i++) {
+    SuperRegEntry* e = &g_super_reg_by_class[class_idx][i];
+    // ... 既存のロジック（class_idx チェック不要！）
+}
+```
+
+**期待効果:** +200-300% (2.59M → 7.8-10.4M ops/s)
+
+### Phase 2 (1日): 早期終了 + First-fit
+
+**変更箇所：**
+- `core/hakmem_tiny_free.inc:929-941`: 最初の freelist で即 return
+
+**実装：**
+```c
+for (int s = 0; s < reg_cap; s++) {
+    if (ss->slabs[s].freelist) {
+        SlabHandle h = slab_try_acquire(ss, s, self_tid);
+        if (slab_is_valid(&h)) {
+            slab_drain_remote_full(&h);
+            tiny_drain_freelist_to_sll_once(h.ss, h.slab_idx, class_idx);
+            tiny_tls_bind_slab(tls, ss, s);
+            return ss;  // 🚀 即 return！
+        }
     }
-    return malloc_slow_path(size);
 }
 ```
 
-**Actual assembly (estimated):**
-```asm
-malloc:
-    mov    %fs:tcache_offset,%rax   ; Get tcache (TLS)
-    lea    (%rax,%class,8),%rdx     ; &tcache->entries[class]
-    mov    (%rdx),%rax              ; Load head
-    test   %rax,%rax                ; Check NULL
-    je     slow_path                ; Miss -> slow
-    mov    (%rax),%rcx              ; Load next
-    mov    %rcx,(%rdx)              ; Store next as new head
-    ret                             ; Return block (7 instructions!)
+**期待効果:** 追加 +50-100%
+
+---
+
+## 📚 参考
+
+### 既存の分析ドキュメント
+
+- `SLL_REFILL_BOTTLENECK_ANALYSIS.md` (外部AI作成)
+  - superslab_refill の 298 行複雑性を指摘
+  - Priority 3: Registry 線形スキャン (+10-12% と見積もり)
+  - **実際の影響はもっと大きかった** (CPU time 28.51%!)
+
+- `LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md` (外部AI作成)
+  - malloc() エントリーポイントの分岐削減を提案
+  - **既に実装済み** (Option A: Inline TLS cache access)
+  - 効果: 0.46M → 2.59M ops/s (+463%) ✅
+
+### Perf コマンド
+
+```bash
+# Record
+perf record -g --call-graph dwarf -F 999 -o hakmem_perf.data \
+  -- env HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 2 8 128 1024 1 12345 4
+
+# Report (top functions)
+perf report -i hakmem_perf.data --stdio --no-children --sort symbol | head -60
+
+# Annotate (hot instructions)
+perf annotate -i hakmem_perf.data superslab_refill --stdio | \
+  grep -E "^\s+[0-9]+\.[0-9]+" | sort -rn | head -30
 ```
 
-**Why HAKMEM can't match this:**
-1. **Magazine layer adds indirection** - magazine → cache → block (vs tcache → block)
-2. **SuperSlab adds more indirection** - superslab → slab → block
-3. **Size class calculation is complex** - not branchless
-4. **Debug instrumentation** - tiny_debug_ring_record
-5. **Ownership checks** - hak_tiny_owner_slab
-6. **Stack overhead** - saving 6 registers, 88-byte stack frame
+---
+
+## 🎯 結論
+
+**HAKMEM の Larson 性能低下 (-78.4%) は Registry 線形スキャンが原因**
+
+1. ✅ **Root Cause 特定**: superslab_refill が 28.51% CPU time を消費
+2. ✅ **ボトルネック特定**: 262,144 エントリの線形スキャン
+3. ✅ **解決策提案**: Per-class registry (+200-300%)
+
+**次のステップ:** Phase 1 実装 → 2.59M から 7.8-10.4M ops/s へ (+3-4倍!)
 
 ---
 
-## Improvement Recommendations (Prioritized)
-
-### 1. **CRITICAL: Fix superslab_refill bottleneck** (Expected: +50-100%)
-
-**Problem:** 7.25% CPU, called 30% of operations
-
-**Root cause:** Low fast cache capacity (16 slots) + expensive refill
-
-**Solutions (in order):**
-
-#### a) **Increase fast cache capacity**
-- **Current:** 16 slots per class
-- **Target:** 64-256 slots per class (adaptive based on hotness)
-- **Expected:** Reduce miss rate from 30% to 10%
-- **Impact:** 7.25% × (20/30) = **4.8% CPU savings (+18% throughput)**
-
-**Implementation:**
-```c
-// Current
-#define HAKMEM_TINY_FAST_CAP 16
-
-// New (adaptive)
-#define HAKMEM_TINY_FAST_CAP_COLD 16
-#define HAKMEM_TINY_FAST_CAP_WARM 64
-#define HAKMEM_TINY_FAST_CAP_HOT 256
-
-// Set based on allocation rate per class
-if (alloc_rate > 1000/s) use HOT cap
-else if (alloc_rate > 100/s) use WARM cap
-else use COLD cap
-```
-
-#### b) **Increase refill batch size**
-- **Current:** Unknown (likely 64 based on REFILL_COUNT)
-- **Target:** 128-256 blocks per refill
-- **Expected:** Reduce refill frequency by 2-4x
-- **Impact:** 7.25% × 0.5 = **3.6% CPU savings (+14% throughput)**
-
-#### c) **Eliminate memset in refill**
-- **Current:** 1.33% CPU in memset during refill
-- **Target:** Lazy initialization (only zero on first use)
-- **Expected:** Remove 1.33% CPU
-- **Impact:** **+5% throughput**
-
-**Implementation:**
-```c
-// Current: eager memset
-void* superslab_refill() {
-    void* blocks = allocate_slab();
-    memset(blocks, 0, slab_size);  // ← Remove this!
-    return blocks;
-}
-
-// New: lazy memset
-void* malloc() {
-    void* p = fast_cache_pop();
-    if (p && needs_zero(p)) {
-        memset(p, 0, size);  // Only zero on demand
-    }
-    return p;
-}
-```
-
-#### d) **Optimize refill path**
-- Profile `superslab_refill` internals
-- Reduce allocations per refill
-- Batch operations
-- **Expected:** Reduce refill cost by 30%
-- **Impact:** 7.25% × 0.3 = **2.2% CPU savings (+8% throughput)**
-
-**Combined expected improvement: +45-60% throughput**
-
----
-
-### 2. **HIGH: Simplify fast path** (Expected: +30-50%)
-
-**Problem:** 17,366 instructions/op vs mimalloc's 610 (28x overhead)
-
-**Target:** Reduce to <5,000 instructions/op (match System tcache's ~500)
-
-**Solutions:**
-
-#### a) **Inline aggressively**
-- Mark all hot functions `__attribute__((always_inline))`
-- Reduce function call overhead (save/restore registers)
-- **Expected:** -20% instructions (+5% throughput)
-
-**Implementation:**
-```c
-static inline __attribute__((always_inline))
-void* hak_tiny_alloc_fast(size_t size) {
-    // ... fast path logic ...
-}
-```
-
-#### b) **Branchless size class calculation**
-- **Current:** Multiple branches for size class
-- **Target:** Lookup table or branchless arithmetic
-- **Expected:** -5% instructions (+2% throughput)
-
-**Implementation:**
-```c
-// Current (branchy)
-int size_to_class(size_t sz) {
-    if (sz <= 16) return 0;
-    if (sz <= 32) return 1;
-    if (sz <= 64) return 2;
-    if (sz <= 128) return 3;
-    // ...
-}
-
-// New (branchless)
-static const uint8_t size_class_table[129] = {
-    0,0,0,...,0,  // 1-16
-    1,1,...,1,    // 17-32
-    2,2,...,2,    // 33-64
-    3,3,...,3     // 65-128
-};
-
-static inline int size_to_class(size_t sz) {
-    return (sz <= 128) ? size_class_table[sz]
-                       : size_to_class_large(sz);
-}
-```
-
-#### c) **Pack TLS structure**
-- **Current:** Scattered TLS variables
-- **Target:** Single cache-line TLS struct (64 bytes)
-- **Expected:** -30% cache misses (+10% throughput)
-
-**Implementation:**
-```c
-// Current (scattered)
-__thread void* g_fast_cache[16];
-__thread magazine_t g_magazine;
-__thread int g_class;
-
-// New (packed)
-struct tiny_tls_cache {
-    void* fast_cache[8];  // Hot data first
-    uint32_t counts[8];
-    magazine_t* magazine; // Cold data
-    // ... fit in 64 bytes
-} __attribute__((aligned(64)));
-
-__thread struct tiny_tls_cache g_tls_cache;
-```
-
-#### d) **Remove debug instrumentation**
-- **Current:** tiny_debug_ring_record in hot path
-- **Target:** Compile-time conditional
-- **Expected:** -5% instructions (+2% throughput)
-
-**Implementation:**
-```c
-#if HAKMEM_DEBUG_RING
-    tiny_debug_ring_record(...);
-#endif
-```
-
-#### e) **Simplify ownership check**
-- **Current:** hak_tiny_owner_slab (0.12% CPU)
-- **Target:** Store owner in block header or remove check
-- **Expected:** -3% instructions (+1% throughput)
-
-**Combined expected improvement: +20-25% throughput**
-
----
-
-### 3. **MEDIUM: Reduce L1 cache misses** (Expected: +20-30%)
-
-**Problem:** 659 L1 misses/op vs mimalloc's 11.5 (57x worse)
-
-**Target:** Reduce to <100 misses/op
-
-**Solutions:**
-
-#### a) **Pack hot TLS data in one cache line**
-- **Current:** Scattered across many cache lines
-- **Target:** Fast path data in 64 bytes
-- **Expected:** -60% TLS cache misses (+10% throughput)
-
-#### b) **Prefetch superslab metadata**
-- **Current:** Cold cache misses on refill
-- **Target:** Prefetch 1-2 cache lines ahead
-- **Expected:** -30% refill cache misses (+5% throughput)
-
-**Implementation:**
-```c
-void superslab_refill() {
-    superslab_t* ss = get_superslab();
-    __builtin_prefetch(ss, 0, 3);     // Prefetch for read
-    __builtin_prefetch(&ss->bitmap, 0, 3);
-    // ... continue refill ...
-}
-```
-
-#### c) **Align structures to cache lines**
-- **Current:** Structures may span cache lines
-- **Target:** 64-byte alignment for hot structures
-- **Expected:** -10% cache misses (+3% throughput)
-
-**Implementation:**
-```c
-struct tiny_fast_cache {
-    void* blocks[64];
-    uint32_t count;
-    uint32_t capacity;
-} __attribute__((aligned(64)));
-```
-
-#### d) **Remove debug ring buffer**
-- **Current:** 50 cache misses/op from debug ring
-- **Target:** Disable in production builds
-- **Expected:** -8% cache misses (+3% throughput)
-
-**Combined expected improvement: +21-26% throughput**
-
----
-
-### 4. **LOW: Reduce initialization overhead** (Expected: +5-10%)
-
-**Problem:** 1.33% CPU in memset
-
-**Solution:** Lazy initialization (covered in #1c above)
-
----
-
-## Expected Outcomes
-
-### Scenario 1: Quick Fixes Only (Week 1)
-**Changes:**
-- Increase FAST_CAP to 64
-- Increase refill batch to 128
-- Lazy initialization (remove memset)
-
-**Expected:**
-- Reduce refill frequency: +18%
-- Reduce refill cost: +8%
-- Remove memset: +5%
-
-**Total: 2.62M → 3.44M ops/s (+31%)**
-**Still 4.9x slower than mimalloc**
-
----
-
-### Scenario 2: Incremental Optimizations (Week 2-3)
-**Changes:**
-- All from Scenario 1
-- Inline hot functions
-- Branchless size class
-- Pack TLS structure
-- Remove debug code
-
-**Expected:**
-- From Scenario 1: +31%
-- Fast path simplification: +20%
-- Cache locality: +15%
-
-**Total: 2.62M → 4.85M ops/s (+85%)**
-**Still 3.5x slower than mimalloc**
-
----
-
-### Scenario 3: Aggressive Refactor (Week 4-6)
-**Changes:**
-- **Option A:** Adopt tcache-style design for tiny
-  - Ultra-simple fast path (5-10 instructions)
-  - Direct TLS array, no magazine layer
-  - Expected: Match System malloc (~100-130 M ops/s for tiny)
-  - **Total: 2.62M → ~80M ops/s (+30x)** 🚀
-
-- **Option B:** Hybrid approach
-  - Tiny: tcache-style (simple)
-  - Mid-Large: Keep current design (working well, +171%)
-  - Expected: Best of both worlds
-  - **Total: 2.62M → ~50M ops/s (+19x)** 🚀
-
----
-
-### Scenario 4: Best Case (Full Redesign)
-**Changes:**
-- Ultra-simple tcache-style fast path for tiny
-- Zero-overhead hit (5-10 instructions)
-- 99% hit rate (like System tcache)
-- Lazy initialization
-- No debug overhead
-
-**Expected:**
-- Match System malloc for tiny: ~130 M ops/s
-- **Total: 2.62M → 130M ops/s (+50x)** 🚀🚀🚀
-
----
-
-## Concrete Action Plan
-
-### Phase 1: Quick Wins (1 week)
-**Goal:** +30% improvement to prove approach
-
-1. ✅ Increase `HAKMEM_TINY_FAST_CAP` from 16 to 64
-   ```bash
-   # In core/hakmem_tiny.h
-   #define HAKMEM_TINY_FAST_CAP 64
-   ```
-
-2. ✅ Increase `HAKMEM_TINY_REFILL_COUNT_HOT` from 64 to 128
-   ```bash
-   # In ENV_VARS or code
-   HAKMEM_TINY_REFILL_COUNT_HOT=128
-   ```
-
-3. ✅ Remove eager memset in superslab_refill
-   ```c
-   // In core/hakmem_tiny_superslab.c
-   // Comment out or remove memset call
-   ```
-
-4. ✅ Rebuild and benchmark
-   ```bash
-   make clean && make
-   ./larson_hakmem 2 8 128 1024 1 12345 4
-   ```
-
-**Expected:** 2.62M → 3.44M ops/s
-
----
-
-### Phase 2: Fast Path Optimization (1-2 weeks)
-**Goal:** +50% cumulative improvement
-
-1. ✅ Inline all hot functions
-   - `hak_tiny_alloc_fast`
-   - `hak_tiny_free_fast`
-   - `size_to_class`
-
-2. ✅ Implement branchless size_to_class
-
-3. ✅ Pack TLS structure into single cache line
-
-4. ✅ Remove debug instrumentation from release builds
-
-5. ✅ Measure instruction count reduction
-   ```bash
-   perf stat -e instructions ./larson_hakmem ...
-   # Target: <30B instructions (down from 45.5B)
-   ```
-
-**Expected:** 2.62M → 4.85M ops/s
-
----
-
-### Phase 3: Algorithm Evaluation (1 week)
-**Goal:** Decide on redesign vs incremental
-
-1. ✅ **Benchmark System malloc**
-   ```bash
-   # Remove LD_PRELOAD, use system malloc
-   ./larson_system 2 8 128 1024 1 12345 4
-   # Confirm: ~130 M ops/s
-   ```
-
-2. ✅ **Study tcache implementation**
-   ```bash
-   # Read glibc tcache source
-   less /usr/src/glibc/malloc/malloc.c
-   # Focus on tcache_put, tcache_get
-   ```
-
-3. ✅ **Prototype simple tcache**
-   - Implement 64-entry TLS array per class
-   - Simple push/pop (5-10 instructions)
-   - Benchmark in isolation
-
-4. ✅ **Compare approaches**
-   - Incremental: 4.85M ops/s (realistic)
-   - Tcache: ~80M ops/s (aspirational)
-   - Hybrid: ~50M ops/s (balanced)
-
-**Decision:** Choose between incremental or redesign
-
----
-
-### Phase 4: Implementation (2-4 weeks)
-**Goal:** Achieve target performance
-
-**If Incremental:**
-- Continue optimizing refill path
-- Improve cache locality
-- Target: 5-10 M ops/s
-
-**If Tcache Redesign:**
-- Implement ultra-simple fast path
-- Keep slow path for refills
-- Target: 50-100 M ops/s
-
-**If Hybrid:**
-- Tcache for tiny (≤1KB)
-- Current design for mid-large (already fast)
-- Target: 50-80 M ops/s overall
-
----
-
-## Conclusion
-
-### Root Causes (Confirmed)
-
-1. **PRIMARY:** `superslab_refill` bottleneck (7.25% CPU)
-   - Caused by low fast cache capacity (16 slots)
-   - Expensive refill (includes memset)
-   - High miss rate (30%)
-
-2. **SECONDARY:** Instruction overhead (28x per-op)
-   - Complex fast path (17,366 instructions/op)
-   - Magazine layer indirection
-   - Debug instrumentation
-
-3. **TERTIARY:** L1 cache misses (57x per-op)
-   - Scattered TLS variables
-   - Poor spatial locality
-   - Refill cache pollution
-
-### Recommended Path Forward
-
-**Short term (1-2 weeks):**
-- Implement quick wins (Phase 1-2)
-- Target: +50% improvement (2.62M → 4M ops/s)
-- Validate approach with data
-
-**Medium term (3-4 weeks):**
-- Evaluate redesign options (Phase 3)
-- Decide: incremental vs tcache vs hybrid
-- Begin implementation (Phase 4)
-
-**Long term (5-8 weeks):**
-- Complete chosen approach
-- Target: 10x improvement (2.62M → 26M ops/s minimum)
-- Aspirational: 50x improvement (2.62M → 130M ops/s)
-
-### Success Metrics
-
-| Milestone | Target | Status |
-|-----------|--------|--------|
-| Phase 1 Quick Wins | 3.44M ops/s (+31%) | ⏳ Pending |
-| Phase 2 Optimizations | 4.85M ops/s (+85%) | ⏳ Pending |
-| Phase 3 Evaluation | Decision made | ⏳ Pending |
-| Phase 4 Final | 26M ops/s (+10x) | ⏳ Pending |
-| Stretch Goal | 130M ops/s (+50x) | 🎯 Aspirational |
-
----
-
-**Analysis completed:** 2025-11-05
-**Next action:** Implement Phase 1 quick wins and measure results
+**Date**: 2025-11-05
+**Measured with**: perf record -F 999, larson_hakmem threads=4
+**Status**: Root cause identified, solution designed ✅