hakmem/benchmarks/results/RESULTS.md

# HAKMEM vs System Malloc Benchmark Results

**Date**: 2025-10-27
**HAKMEM Version**: Phase 8.3 (ACE Step 1-3)
**Platform**: Linux 5.15.167.4-microsoft-standard-WSL2
**Compiler**: GCC with `-O3 -march=native`

---

## ベンチマーク概要

### テストパターン (全6種類)

| Test | パターン | 目的 |
|------|---------|------|
| **Test 1: Sequential LIFO** | alloc[0..99] → free[99..0] (逆順) | ベストケース：freelist の LIFO 特性を最大活用 |
| **Test 2: Sequential FIFO** | alloc[0..99] → free[0..99] (同順) | ワーストケース：freelist の FIFO 分断を測定 |
| **Test 3: Random Order Free** | alloc[0..99] → free[random] (ランダム) | 現実的：キャッシュミスとフラグメンテーション |
| **Test 4: Interleaved Alloc/Free** | alloc → free → alloc → free (交互) | 高速チャーン：magazine キャッシュの効果測定 |
| **Test 5: Mixed Sizes** | 8B, 16B, 32B, 64B mixed | マルチサイズ：サイズクラス切り替えコスト |
| **Test 6: Long-lived vs Short-lived** | 50% 保持、残り churn | メモリ圧：高負荷下のパフォーマンス |

### テストサイズクラス
- **16B**: Tiny pool (8-64B)
- **32B**: Tiny pool (8-64B)
- **64B**: Tiny pool (8-64B)
- **128B**: MF2 pool (65-2048B)

---

## 結果サマリ

### 🏆 Overall Winner by Size Class

| Size Class | LIFO | FIFO | Random | Interleaved | Mixed | Long-lived | **Total Winner** |
|------------|------|------|--------|-------------|-------|------------|------------------|
| **16B** | System | System | System | System | - | System | **System (5/5)** |
| **32B** | System | System | System | System | - | System | **System (5/5)** |
| **64B** | System | System | System | System | - | System | **System (5/5)** |
| **128B** | **HAKMEM** | **HAKMEM** | **HAKMEM** | **HAKMEM** | - | **HAKMEM** | **HAKMEM (5/5)** |
| **Mixed** | - | - | - | - | System | - | **System (1/1)** |

---

## 詳細結果

### 16 Bytes (Tiny Pool)

| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 212.24 M ops/s | **404.88 M ops/s** | System | **+90.7%** |
| FIFO | 210.90 M ops/s | **402.95 M ops/s** | System | **+91.0%** |
| Random | 109.91 M ops/s | **148.50 M ops/s** | System | **+35.1%** |
| Interleaved | 204.28 M ops/s | **405.50 M ops/s** | System | **+98.5%** |
| Long-lived | 208.82 M ops/s | **409.17 M ops/s** | System | **+95.9%** |

**Analysis**: System malloc は 16B で圧倒的。HAKMEM の約2倍の速度を記録。

---

### 32 Bytes (Tiny Pool)

| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 210.79 M ops/s | **401.61 M ops/s** | System | **+90.5%** |
| FIFO | 211.48 M ops/s | **401.52 M ops/s** | System | **+89.9%** |
| Random | 110.03 M ops/s | **148.94 M ops/s** | System | **+35.4%** |
| Interleaved | 203.77 M ops/s | **403.95 M ops/s** | System | **+98.3%** |
| Long-lived | 208.39 M ops/s | **405.39 M ops/s** | System | **+94.5%** |

**Analysis**: 16B と同様、System malloc が支配的。

---

### 64 Bytes (Tiny Pool)

| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 210.56 M ops/s | **400.45 M ops/s** | System | **+90.2%** |
| FIFO | 210.51 M ops/s | **386.92 M ops/s** | System | **+83.8%** |
| Random | 110.41 M ops/s | **147.07 M ops/s** | System | **+33.2%** |
| Interleaved | 204.72 M ops/s | **404.72 M ops/s** | System | **+97.7%** |
| Long-lived | 207.96 M ops/s | **403.51 M ops/s** | System | **+94.0%** |

**Analysis**: Tiny pool の最大サイズでも System malloc が優位。

---

### 128 Bytes (MF2 Pool)

| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | **209.20 M ops/s** | 166.98 M ops/s | HAKMEM | **+25.3%** |
| FIFO | **209.40 M ops/s** | 171.44 M ops/s | HAKMEM | **+22.1%** |
| Random | **109.41 M ops/s** | 71.21 M ops/s | HAKMEM | **+53.6%** |
| Interleaved | **203.93 M ops/s** | 185.41 M ops/s | HAKMEM | **+10.0%** |
| Long-lived | **206.51 M ops/s** | 182.92 M ops/s | HAKMEM | **+12.9%** |

**Analysis**: 🎉 **HAKMEM が全勝！** MF2 pool (65-2048B) は System malloc を大きく上回る。特に Random パターンで **+53.6%** の優位性。

---

### Mixed Sizes (8B, 16B, 32B, 64B)

| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| Mixed | 205.10 M ops/s | **406.60 M ops/s** | System | **+98.2%** |

**Analysis**: マルチサイズでは System malloc が優位。サイズクラス切り替えコストが影響。

---

## 総合評価

### 🏅 Performance Summary

| Allocator | Wins | Avg Speedup | Best Result | Worst Result |
|-----------|------|-------------|-------------|--------------|
| **HAKMEM** | 5/21 tests | - | **+53.6%** (128B Random) | **-98.5%** (16B Interleaved) |
| **System** | 16/21 tests | **+81.3%** (Tiny pool avg) | **+98.5%** (16B Interleaved) | **-53.6%** (128B Random) |

### 🔍 Key Insights

1. **System malloc が Tiny pool (8-64B) で圧倒的**
   - 原因: tcmalloc/jemalloc の thread-local cache が極めて高速
   - HAKMEM は約 200M ops/sec で安定
   - System は 400M+ ops/sec を達成

2. **HAKMEM が MF2 pool (65-2048B) で優位**
   - 128B で全パターン勝利（+10% ~ +53.6%）
   - Random パターンで特に強い（+53.6%）
   - MF2 の page-based allocation が効いている

3. **HAKMEM の強み**
   - 中サイズ (128B+) での安定性
   - Random access パターンでの強さ
   - メモリ効率（Phase 8.3 ACE で更に改善予定）

4. **HAKMEM の弱点**
   - 小サイズ (8-64B) で System malloc の約半分の速度
   - Tiny pool の最適化が不十分
   - Magazine キャッシュの効果が限定的

---

## ACE (Agentic Context Engineering) Status

### Phase 8.3 実装状況

✅ **Step 1-3 完了 (Current)**:
- SuperSlab lg_size 対応 (1MB ↔ 2MB 可変サイズ)
- ACE tick function (昇格/降格ロジック)
- Counter tracking (alloc_count, live_blocks, hot_score)

⏳ **Step 4-5 未実装**:
- ε-greedy bandit (batch/threshold 最適化)
- PGO 再生成

### ACE Stats (from HAKMEM run)

| Class | Current Size | Target Size | Hot Score | Allocs | Live Blocks |
|-------|-------------|-------------|-----------|--------|-------------|
| 8B | 1MB | 1MB | 1000 | 3.15M | 25.0M |
| 16B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
| 24B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
| 32B | 1MB | 1MB | 1000 | 3.15M | 475.0M |
| 40B | 1MB | 1MB | 1000 | 15.47M | 450.0M |

---

## 次のアクション

### 優先度 High
1. **Tiny pool の高速化**
   - Magazine cache の改善
   - Thread-local cache の最適化
   - SuperSlab allocation の軽量化

2. **ACE Phase 8.3 完了**
   - Step 4: ε-greedy bandit 実装
   - Step 5: PGO 再生成
   - RSS 削減効果を測定

### 優先度 Medium
3. **Mixed size パターンの最適化**
   - サイズクラス切り替えコストの削減
   - Size-class prediction の導入

---

## Conclusion

**Current Status**: HAKMEM は MF2 pool (128B+) で System malloc を上回るが、Tiny pool (8-64B) では約半分の速度。

**Next Goal**: Tiny pool の 2倍高速化 → System malloc と同等レベルへ。

**Long-term Vision**: 全サイズクラスで System malloc を上回り、かつメモリ効率も優れた allocator を実現。

---

## Historical Performance (HAKMEM Step 3d vs mimalloc)

### 🏆 Best Performance Record (HAKMEM Step 3d)

**Top 10 Results**:
1. Test 6 (128B Long-lived): **313.27 M ops/sec** ← 🥇 NEW RECORD!
2. Test 6 (16B Long-lived): 312.59 M ops/sec
3. Test 6 (64B Long-lived): 312.24 M ops/sec
4. Test 6 (32B Long-lived): 310.88 M ops/sec
5. Test 4 (32B Interleaved): 310.38 M ops/sec
6. Test 4 (64B Interleaved): 309.94 M ops/sec
7. Test 4 (16B Interleaved): 309.85 M ops/sec
8. Test 4 (128B Interleaved): 308.88 M ops/sec
9. Test 2 (32B FIFO): 307.53 M ops/sec

### 🎯 HAKMEM vs mimalloc (Step 3d)

| Metric | HAKMEM Step 3d | mimalloc | Winner | Gap |
|--------|----------------|----------|--------|-----|
| **Performance** | 313.27 M ops/sec | 307.00 M ops/sec | **HAKMEM** | **+2.0%** 🎉 |
| **Memory (RSS)** | 13,208 KB (13.2 MB) | 4,036 KB (4.0 MB) | **mimalloc** | **-227% (3.27x)** ⚠️ |

**Analysis**:
- ✅ **Speed**: HAKMEM は mimalloc を **+2.0%** 上回る (313.27 vs 307.00 M ops/sec)
- ⚠️ **Memory**: HAKMEM は mimalloc の **3.27倍** のメモリを使用 (+9.2 MB)

### 🎯 Performance vs Memory Trade-off

| Version | Speed (128B) | RSS Memory | Speed/MB Ratio |
|---------|-------------|------------|----------------|
| **mimalloc** | 307.0 M ops/s | 4.0 MB | **76.75 M ops/MB** 🏆 |
| **HAKMEM Step 3d** | 313.3 M ops/s | 13.2 MB | 23.74 M ops/MB |
| **HAKMEM Phase 8.3** | 206.5 M ops/s | TBD | TBD |

**Goal (Phase 8.3 ACE)**: RSS を 13.2 MB → 4-6 MB に削減しつつ、300M+ ops/sec を維持

---

## Regression Analysis: Phase 8.3 vs Step 3d

### 128B Long-lived Test

| Version | Throughput | vs Step 3d | vs mimalloc |
|---------|------------|-----------|-------------|
| **HAKMEM Step 3d** (Best) | 313.27 M ops/s | baseline | **+2.0%** ✅ |
| **HAKMEM Phase 8.3** (Current) | 206.51 M ops/s | **-34.1%** ⚠️ | **-32.7%** ⚠️ |
| **mimalloc** | 307.00 M ops/s | -2.0% | baseline |
| **System malloc** | 182.92 M ops/s | -41.6% | -40.4% |

**Regression**: Phase 8.3 は Step 3d より **34.1% 遅い**！

### 🔍 Root Cause Analysis

Phase 8.3 で追加された ACE (Agentic Context Engineering) のカウンタートラッキングがホットパスに追加されたことが原因。

#### 1. **ACE Counter Tracking on Every Allocation** (hakmem_tiny.c:1251-1264)
```c
g_ss_ace[class_idx].alloc_count++;    // +1 write
g_ss_ace[class_idx].live_blocks++;    // +1 write
if ((g_ss_ace[class_idx].alloc_count & 0x3FFFu) == 0) { // +1 load, +1 AND, +1 compare
    hak_tiny_superslab_ace_tick(...);
}
```
- **Impact**: 2 writes + 3 ops per allocation
- **Benchmark**: 200M allocations = **400M extra writes**

#### 2. **ACE Counter Tracking on Every Free** (hakmem_tiny.c:1336-1338, 1355-1357)
```c
if (g_ss_ace[ss->size_class].live_blocks > 0) {  // +1 load, +1 compare
    g_ss_ace[ss->size_class].live_blocks--;       // +1 write
}
```
- **Impact**: 1 load + 1 compare + 1 write per free
- **Benchmark**: 200M frees = **200M extra operations**

#### 3. **Registry Lookup Overhead** (hakmem_super_registry.h:52-74)
```c
for (int lg = 20; lg <= 21; lg++) {  // Try both 1MB and 2MB
    // ... probe loop ...
    if (b == base && e->lg_size == lg) return e->ss;  // Extra field check
}
```
- **Impact**: Doubles worst-case lookup time, extra lg_size comparisons on every free

#### 4. **Memory Pressure**
- `g_ss_ace[class_idx]` アクセスがキャッシュに負荷
- グローバル配列への書き込みが毎回発生

### 💡 Solution Options

1. **Option A: Sampling-based Tracking**
   - 1/256 の確率でのみカウンタ更新（統計的に十分）
   - Expected: ~1% overhead (313M → 310M ops/s)

2. **Option B: Per-TLS Counters**
   - Thread-local counters で書き込みを高速化
   - Tick 時に集約

3. **Option C: Conditional ACE (compile-time flag)**
   - `#ifdef HAKMEM_ACE_ENABLE` でトラッキングを無効化可能に
   - Production では ACE off、メモリ重視時のみ ACE on

4. **Option D: ACE v2 - Lazy Observation**
   - Magazine refill/spill 時のみカウント（既存の遅いパス）
   - alloc/free ホットパスには一切手を加えない

---

## Raw Data

- HAKMEM Phase 8.3: `benchmarks/hakmem_result.txt`
- System malloc: `benchmarks/system_result.txt`
- HAKMEM Step 3d: (Historical data, referenced above)