Files
hakmem/benchmarks/results/RESULTS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

314 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM vs System Malloc Benchmark Results
**Date**: 2025-10-27
**HAKMEM Version**: Phase 8.3 (ACE Step 1-3)
**Platform**: Linux 5.15.167.4-microsoft-standard-WSL2
**Compiler**: GCC with `-O3 -march=native`
---
## ベンチマーク概要
### テストパターン (全6種類)
| Test | パターン | 目的 |
|------|---------|------|
| **Test 1: Sequential LIFO** | alloc[0..99] → free[99..0] (逆順) | ベストケースfreelist の LIFO 特性を最大活用 |
| **Test 2: Sequential FIFO** | alloc[0..99] → free[0..99] (同順) | ワーストケースfreelist の FIFO 分断を測定 |
| **Test 3: Random Order Free** | alloc[0..99] → free[random] (ランダム) | 現実的:キャッシュミスとフラグメンテーション |
| **Test 4: Interleaved Alloc/Free** | alloc → free → alloc → free (交互) | 高速チャーンmagazine キャッシュの効果測定 |
| **Test 5: Mixed Sizes** | 8B, 16B, 32B, 64B mixed | マルチサイズ:サイズクラス切り替えコスト |
| **Test 6: Long-lived vs Short-lived** | 50% 保持、残り churn | メモリ圧:高負荷下のパフォーマンス |
### テストサイズクラス
- **16B**: Tiny pool (8-64B)
- **32B**: Tiny pool (8-64B)
- **64B**: Tiny pool (8-64B)
- **128B**: MF2 pool (65-2048B)
---
## 結果サマリ
### 🏆 Overall Winner by Size Class
| Size Class | LIFO | FIFO | Random | Interleaved | Mixed | Long-lived | **Total Winner** |
|------------|------|------|--------|-------------|-------|------------|------------------|
| **16B** | System | System | System | System | - | System | **System (5/5)** |
| **32B** | System | System | System | System | - | System | **System (5/5)** |
| **64B** | System | System | System | System | - | System | **System (5/5)** |
| **128B** | **HAKMEM** | **HAKMEM** | **HAKMEM** | **HAKMEM** | - | **HAKMEM** | **HAKMEM (5/5)** |
| **Mixed** | - | - | - | - | System | - | **System (1/1)** |
---
## 詳細結果
### 16 Bytes (Tiny Pool)
| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 212.24 M ops/s | **404.88 M ops/s** | System | **+90.7%** |
| FIFO | 210.90 M ops/s | **402.95 M ops/s** | System | **+91.0%** |
| Random | 109.91 M ops/s | **148.50 M ops/s** | System | **+35.1%** |
| Interleaved | 204.28 M ops/s | **405.50 M ops/s** | System | **+98.5%** |
| Long-lived | 208.82 M ops/s | **409.17 M ops/s** | System | **+95.9%** |
**Analysis**: System malloc は 16B で圧倒的。HAKMEM の約2倍の速度を記録。
---
### 32 Bytes (Tiny Pool)
| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 210.79 M ops/s | **401.61 M ops/s** | System | **+90.5%** |
| FIFO | 211.48 M ops/s | **401.52 M ops/s** | System | **+89.9%** |
| Random | 110.03 M ops/s | **148.94 M ops/s** | System | **+35.4%** |
| Interleaved | 203.77 M ops/s | **403.95 M ops/s** | System | **+98.3%** |
| Long-lived | 208.39 M ops/s | **405.39 M ops/s** | System | **+94.5%** |
**Analysis**: 16B と同様、System malloc が支配的。
---
### 64 Bytes (Tiny Pool)
| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | 210.56 M ops/s | **400.45 M ops/s** | System | **+90.2%** |
| FIFO | 210.51 M ops/s | **386.92 M ops/s** | System | **+83.8%** |
| Random | 110.41 M ops/s | **147.07 M ops/s** | System | **+33.2%** |
| Interleaved | 204.72 M ops/s | **404.72 M ops/s** | System | **+97.7%** |
| Long-lived | 207.96 M ops/s | **403.51 M ops/s** | System | **+94.0%** |
**Analysis**: Tiny pool の最大サイズでも System malloc が優位。
---
### 128 Bytes (MF2 Pool)
| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| LIFO | **209.20 M ops/s** | 166.98 M ops/s | HAKMEM | **+25.3%** |
| FIFO | **209.40 M ops/s** | 171.44 M ops/s | HAKMEM | **+22.1%** |
| Random | **109.41 M ops/s** | 71.21 M ops/s | HAKMEM | **+53.6%** |
| Interleaved | **203.93 M ops/s** | 185.41 M ops/s | HAKMEM | **+10.0%** |
| Long-lived | **206.51 M ops/s** | 182.92 M ops/s | HAKMEM | **+12.9%** |
**Analysis**: 🎉 **HAKMEM が全勝!** MF2 pool (65-2048B) は System malloc を大きく上回る。特に Random パターンで **+53.6%** の優位性。
---
### Mixed Sizes (8B, 16B, 32B, 64B)
| Test | HAKMEM | System | Winner | Gap |
|------|--------|--------|--------|-----|
| Mixed | 205.10 M ops/s | **406.60 M ops/s** | System | **+98.2%** |
**Analysis**: マルチサイズでは System malloc が優位。サイズクラス切り替えコストが影響。
---
## 総合評価
### 🏅 Performance Summary
| Allocator | Wins | Avg Speedup | Best Result | Worst Result |
|-----------|------|-------------|-------------|--------------|
| **HAKMEM** | 5/21 tests | - | **+53.6%** (128B Random) | **-98.5%** (16B Interleaved) |
| **System** | 16/21 tests | **+81.3%** (Tiny pool avg) | **+98.5%** (16B Interleaved) | **-53.6%** (128B Random) |
### 🔍 Key Insights
1. **System malloc が Tiny pool (8-64B) で圧倒的**
- 原因: tcmalloc/jemalloc の thread-local cache が極めて高速
- HAKMEM は約 200M ops/sec で安定
- System は 400M+ ops/sec を達成
2. **HAKMEM が MF2 pool (65-2048B) で優位**
- 128B で全パターン勝利(+10% ~ +53.6%
- Random パターンで特に強い(+53.6%
- MF2 の page-based allocation が効いている
3. **HAKMEM の強み**
- 中サイズ (128B+) での安定性
- Random access パターンでの強さ
- メモリ効率Phase 8.3 ACE で更に改善予定)
4. **HAKMEM の弱点**
- 小サイズ (8-64B) で System malloc の約半分の速度
- Tiny pool の最適化が不十分
- Magazine キャッシュの効果が限定的
---
## ACE (Agentic Context Engineering) Status
### Phase 8.3 実装状況
**Step 1-3 完了 (Current)**:
- SuperSlab lg_size 対応 (1MB ↔ 2MB 可変サイズ)
- ACE tick function (昇格/降格ロジック)
- Counter tracking (alloc_count, live_blocks, hot_score)
**Step 4-5 未実装**:
- ε-greedy bandit (batch/threshold 最適化)
- PGO 再生成
### ACE Stats (from HAKMEM run)
| Class | Current Size | Target Size | Hot Score | Allocs | Live Blocks |
|-------|-------------|-------------|-----------|--------|-------------|
| 8B | 1MB | 1MB | 1000 | 3.15M | 25.0M |
| 16B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
| 24B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
| 32B | 1MB | 1MB | 1000 | 3.15M | 475.0M |
| 40B | 1MB | 1MB | 1000 | 15.47M | 450.0M |
---
## 次のアクション
### 優先度 High
1. **Tiny pool の高速化**
- Magazine cache の改善
- Thread-local cache の最適化
- SuperSlab allocation の軽量化
2. **ACE Phase 8.3 完了**
- Step 4: ε-greedy bandit 実装
- Step 5: PGO 再生成
- RSS 削減効果を測定
### 優先度 Medium
3. **Mixed size パターンの最適化**
- サイズクラス切り替えコストの削減
- Size-class prediction の導入
---
## Conclusion
**Current Status**: HAKMEM は MF2 pool (128B+) で System malloc を上回るが、Tiny pool (8-64B) では約半分の速度。
**Next Goal**: Tiny pool の 2倍高速化 → System malloc と同等レベルへ。
**Long-term Vision**: 全サイズクラスで System malloc を上回り、かつメモリ効率も優れた allocator を実現。
---
## Historical Performance (HAKMEM Step 3d vs mimalloc)
### 🏆 Best Performance Record (HAKMEM Step 3d)
**Top 10 Results**:
1. Test 6 (128B Long-lived): **313.27 M ops/sec** ← 🥇 NEW RECORD!
2. Test 6 (16B Long-lived): 312.59 M ops/sec
3. Test 6 (64B Long-lived): 312.24 M ops/sec
4. Test 6 (32B Long-lived): 310.88 M ops/sec
5. Test 4 (32B Interleaved): 310.38 M ops/sec
6. Test 4 (64B Interleaved): 309.94 M ops/sec
7. Test 4 (16B Interleaved): 309.85 M ops/sec
8. Test 4 (128B Interleaved): 308.88 M ops/sec
9. Test 2 (32B FIFO): 307.53 M ops/sec
### 🎯 HAKMEM vs mimalloc (Step 3d)
| Metric | HAKMEM Step 3d | mimalloc | Winner | Gap |
|--------|----------------|----------|--------|-----|
| **Performance** | 313.27 M ops/sec | 307.00 M ops/sec | **HAKMEM** | **+2.0%** 🎉 |
| **Memory (RSS)** | 13,208 KB (13.2 MB) | 4,036 KB (4.0 MB) | **mimalloc** | **-227% (3.27x)** ⚠️ |
**Analysis**:
-**Speed**: HAKMEM は mimalloc を **+2.0%** 上回る (313.27 vs 307.00 M ops/sec)
- ⚠️ **Memory**: HAKMEM は mimalloc の **3.27倍** のメモリを使用 (+9.2 MB)
### 🎯 Performance vs Memory Trade-off
| Version | Speed (128B) | RSS Memory | Speed/MB Ratio |
|---------|-------------|------------|----------------|
| **mimalloc** | 307.0 M ops/s | 4.0 MB | **76.75 M ops/MB** 🏆 |
| **HAKMEM Step 3d** | 313.3 M ops/s | 13.2 MB | 23.74 M ops/MB |
| **HAKMEM Phase 8.3** | 206.5 M ops/s | TBD | TBD |
**Goal (Phase 8.3 ACE)**: RSS を 13.2 MB → 4-6 MB に削減しつつ、300M+ ops/sec を維持
---
## Regression Analysis: Phase 8.3 vs Step 3d
### 128B Long-lived Test
| Version | Throughput | vs Step 3d | vs mimalloc |
|---------|------------|-----------|-------------|
| **HAKMEM Step 3d** (Best) | 313.27 M ops/s | baseline | **+2.0%** ✅ |
| **HAKMEM Phase 8.3** (Current) | 206.51 M ops/s | **-34.1%** ⚠️ | **-32.7%** ⚠️ |
| **mimalloc** | 307.00 M ops/s | -2.0% | baseline |
| **System malloc** | 182.92 M ops/s | -41.6% | -40.4% |
**Regression**: Phase 8.3 は Step 3d より **34.1% 遅い**
### 🔍 Root Cause Analysis
Phase 8.3 で追加された ACE (Agentic Context Engineering) のカウンタートラッキングがホットパスに追加されたことが原因。
#### 1. **ACE Counter Tracking on Every Allocation** (hakmem_tiny.c:1251-1264)
```c
g_ss_ace[class_idx].alloc_count++; // +1 write
g_ss_ace[class_idx].live_blocks++; // +1 write
if ((g_ss_ace[class_idx].alloc_count & 0x3FFFu) == 0) { // +1 load, +1 AND, +1 compare
hak_tiny_superslab_ace_tick(...);
}
```
- **Impact**: 2 writes + 3 ops per allocation
- **Benchmark**: 200M allocations = **400M extra writes**
#### 2. **ACE Counter Tracking on Every Free** (hakmem_tiny.c:1336-1338, 1355-1357)
```c
if (g_ss_ace[ss->size_class].live_blocks > 0) { // +1 load, +1 compare
g_ss_ace[ss->size_class].live_blocks--; // +1 write
}
```
- **Impact**: 1 load + 1 compare + 1 write per free
- **Benchmark**: 200M frees = **200M extra operations**
#### 3. **Registry Lookup Overhead** (hakmem_super_registry.h:52-74)
```c
for (int lg = 20; lg <= 21; lg++) { // Try both 1MB and 2MB
// ... probe loop ...
if (b == base && e->lg_size == lg) return e->ss; // Extra field check
}
```
- **Impact**: Doubles worst-case lookup time, extra lg_size comparisons on every free
#### 4. **Memory Pressure**
- `g_ss_ace[class_idx]` アクセスがキャッシュに負荷
- グローバル配列への書き込みが毎回発生
### 💡 Solution Options
1. **Option A: Sampling-based Tracking**
- 1/256 の確率でのみカウンタ更新(統計的に十分)
- Expected: ~1% overhead (313M → 310M ops/s)
2. **Option B: Per-TLS Counters**
- Thread-local counters で書き込みを高速化
- Tick 時に集約
3. **Option C: Conditional ACE (compile-time flag)**
- `#ifdef HAKMEM_ACE_ENABLE` でトラッキングを無効化可能に
- Production では ACE off、メモリ重視時のみ ACE on
4. **Option D: ACE v2 - Lazy Observation**
- Magazine refill/spill 時のみカウント(既存の遅いパス)
- alloc/free ホットパスには一切手を加えない
---
## Raw Data
- HAKMEM Phase 8.3: `benchmarks/hakmem_result.txt`
- System malloc: `benchmarks/system_result.txt`
- HAKMEM Step 3d: (Historical data, referenced above)