Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
314 lines
11 KiB
Markdown
314 lines
11 KiB
Markdown
# HAKMEM vs System Malloc Benchmark Results
|
||
|
||
**Date**: 2025-10-27
|
||
**HAKMEM Version**: Phase 8.3 (ACE Step 1-3)
|
||
**Platform**: Linux 5.15.167.4-microsoft-standard-WSL2
|
||
**Compiler**: GCC with `-O3 -march=native`
|
||
|
||
---
|
||
|
||
## ベンチマーク概要
|
||
|
||
### テストパターン (全6種類)
|
||
|
||
| Test | パターン | 目的 |
|
||
|------|---------|------|
|
||
| **Test 1: Sequential LIFO** | alloc[0..99] → free[99..0] (逆順) | ベストケース:freelist の LIFO 特性を最大活用 |
|
||
| **Test 2: Sequential FIFO** | alloc[0..99] → free[0..99] (同順) | ワーストケース:freelist の FIFO 分断を測定 |
|
||
| **Test 3: Random Order Free** | alloc[0..99] → free[random] (ランダム) | 現実的:キャッシュミスとフラグメンテーション |
|
||
| **Test 4: Interleaved Alloc/Free** | alloc → free → alloc → free (交互) | 高速チャーン:magazine キャッシュの効果測定 |
|
||
| **Test 5: Mixed Sizes** | 8B, 16B, 32B, 64B mixed | マルチサイズ:サイズクラス切り替えコスト |
|
||
| **Test 6: Long-lived vs Short-lived** | 50% 保持、残り churn | メモリ圧:高負荷下のパフォーマンス |
|
||
|
||
### テストサイズクラス
|
||
- **16B**: Tiny pool (8-64B)
|
||
- **32B**: Tiny pool (8-64B)
|
||
- **64B**: Tiny pool (8-64B)
|
||
- **128B**: MF2 pool (65-2048B)
|
||
|
||
---
|
||
|
||
## 結果サマリ
|
||
|
||
### 🏆 Overall Winner by Size Class
|
||
|
||
| Size Class | LIFO | FIFO | Random | Interleaved | Mixed | Long-lived | **Total Winner** |
|
||
|------------|------|------|--------|-------------|-------|------------|------------------|
|
||
| **16B** | System | System | System | System | - | System | **System (5/5)** |
|
||
| **32B** | System | System | System | System | - | System | **System (5/5)** |
|
||
| **64B** | System | System | System | System | - | System | **System (5/5)** |
|
||
| **128B** | **HAKMEM** | **HAKMEM** | **HAKMEM** | **HAKMEM** | - | **HAKMEM** | **HAKMEM (5/5)** |
|
||
| **Mixed** | - | - | - | - | System | - | **System (1/1)** |
|
||
|
||
---
|
||
|
||
## 詳細結果
|
||
|
||
### 16 Bytes (Tiny Pool)
|
||
|
||
| Test | HAKMEM | System | Winner | Gap |
|
||
|------|--------|--------|--------|-----|
|
||
| LIFO | 212.24 M ops/s | **404.88 M ops/s** | System | **+90.7%** |
|
||
| FIFO | 210.90 M ops/s | **402.95 M ops/s** | System | **+91.0%** |
|
||
| Random | 109.91 M ops/s | **148.50 M ops/s** | System | **+35.1%** |
|
||
| Interleaved | 204.28 M ops/s | **405.50 M ops/s** | System | **+98.5%** |
|
||
| Long-lived | 208.82 M ops/s | **409.17 M ops/s** | System | **+95.9%** |
|
||
|
||
**Analysis**: System malloc は 16B で圧倒的。HAKMEM の約2倍の速度を記録。
|
||
|
||
---
|
||
|
||
### 32 Bytes (Tiny Pool)
|
||
|
||
| Test | HAKMEM | System | Winner | Gap |
|
||
|------|--------|--------|--------|-----|
|
||
| LIFO | 210.79 M ops/s | **401.61 M ops/s** | System | **+90.5%** |
|
||
| FIFO | 211.48 M ops/s | **401.52 M ops/s** | System | **+89.9%** |
|
||
| Random | 110.03 M ops/s | **148.94 M ops/s** | System | **+35.4%** |
|
||
| Interleaved | 203.77 M ops/s | **403.95 M ops/s** | System | **+98.3%** |
|
||
| Long-lived | 208.39 M ops/s | **405.39 M ops/s** | System | **+94.5%** |
|
||
|
||
**Analysis**: 16B と同様、System malloc が支配的。
|
||
|
||
---
|
||
|
||
### 64 Bytes (Tiny Pool)
|
||
|
||
| Test | HAKMEM | System | Winner | Gap |
|
||
|------|--------|--------|--------|-----|
|
||
| LIFO | 210.56 M ops/s | **400.45 M ops/s** | System | **+90.2%** |
|
||
| FIFO | 210.51 M ops/s | **386.92 M ops/s** | System | **+83.8%** |
|
||
| Random | 110.41 M ops/s | **147.07 M ops/s** | System | **+33.2%** |
|
||
| Interleaved | 204.72 M ops/s | **404.72 M ops/s** | System | **+97.7%** |
|
||
| Long-lived | 207.96 M ops/s | **403.51 M ops/s** | System | **+94.0%** |
|
||
|
||
**Analysis**: Tiny pool の最大サイズでも System malloc が優位。
|
||
|
||
---
|
||
|
||
### 128 Bytes (MF2 Pool)
|
||
|
||
| Test | HAKMEM | System | Winner | Gap |
|
||
|------|--------|--------|--------|-----|
|
||
| LIFO | **209.20 M ops/s** | 166.98 M ops/s | HAKMEM | **+25.3%** |
|
||
| FIFO | **209.40 M ops/s** | 171.44 M ops/s | HAKMEM | **+22.1%** |
|
||
| Random | **109.41 M ops/s** | 71.21 M ops/s | HAKMEM | **+53.6%** |
|
||
| Interleaved | **203.93 M ops/s** | 185.41 M ops/s | HAKMEM | **+10.0%** |
|
||
| Long-lived | **206.51 M ops/s** | 182.92 M ops/s | HAKMEM | **+12.9%** |
|
||
|
||
**Analysis**: 🎉 **HAKMEM が全勝!** MF2 pool (65-2048B) は System malloc を大きく上回る。特に Random パターンで **+53.6%** の優位性。
|
||
|
||
---
|
||
|
||
### Mixed Sizes (8B, 16B, 32B, 64B)
|
||
|
||
| Test | HAKMEM | System | Winner | Gap |
|
||
|------|--------|--------|--------|-----|
|
||
| Mixed | 205.10 M ops/s | **406.60 M ops/s** | System | **+98.2%** |
|
||
|
||
**Analysis**: マルチサイズでは System malloc が優位。サイズクラス切り替えコストが影響。
|
||
|
||
---
|
||
|
||
## 総合評価
|
||
|
||
### 🏅 Performance Summary
|
||
|
||
| Allocator | Wins | Avg Speedup | Best Result | Worst Result |
|
||
|-----------|------|-------------|-------------|--------------|
|
||
| **HAKMEM** | 5/21 tests | - | **+53.6%** (128B Random) | **-98.5%** (16B Interleaved) |
|
||
| **System** | 16/21 tests | **+81.3%** (Tiny pool avg) | **+98.5%** (16B Interleaved) | **-53.6%** (128B Random) |
|
||
|
||
### 🔍 Key Insights
|
||
|
||
1. **System malloc が Tiny pool (8-64B) で圧倒的**
|
||
- 原因: tcmalloc/jemalloc の thread-local cache が極めて高速
|
||
- HAKMEM は約 200M ops/sec で安定
|
||
- System は 400M+ ops/sec を達成
|
||
|
||
2. **HAKMEM が MF2 pool (65-2048B) で優位**
|
||
- 128B で全パターン勝利(+10% ~ +53.6%)
|
||
- Random パターンで特に強い(+53.6%)
|
||
- MF2 の page-based allocation が効いている
|
||
|
||
3. **HAKMEM の強み**
|
||
- 中サイズ (128B+) での安定性
|
||
- Random access パターンでの強さ
|
||
- メモリ効率(Phase 8.3 ACE で更に改善予定)
|
||
|
||
4. **HAKMEM の弱点**
|
||
- 小サイズ (8-64B) で System malloc の約半分の速度
|
||
- Tiny pool の最適化が不十分
|
||
- Magazine キャッシュの効果が限定的
|
||
|
||
---
|
||
|
||
## ACE (Agentic Context Engineering) Status
|
||
|
||
### Phase 8.3 実装状況
|
||
|
||
✅ **Step 1-3 完了 (Current)**:
|
||
- SuperSlab lg_size 対応 (1MB ↔ 2MB 可変サイズ)
|
||
- ACE tick function (昇格/降格ロジック)
|
||
- Counter tracking (alloc_count, live_blocks, hot_score)
|
||
|
||
⏳ **Step 4-5 未実装**:
|
||
- ε-greedy bandit (batch/threshold 最適化)
|
||
- PGO 再生成
|
||
|
||
### ACE Stats (from HAKMEM run)
|
||
|
||
| Class | Current Size | Target Size | Hot Score | Allocs | Live Blocks |
|
||
|-------|-------------|-------------|-----------|--------|-------------|
|
||
| 8B | 1MB | 1MB | 1000 | 3.15M | 25.0M |
|
||
| 16B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
|
||
| 24B | 1MB | 1MB | 1000 | 3.14M | 475.0M |
|
||
| 32B | 1MB | 1MB | 1000 | 3.15M | 475.0M |
|
||
| 40B | 1MB | 1MB | 1000 | 15.47M | 450.0M |
|
||
|
||
---
|
||
|
||
## 次のアクション
|
||
|
||
### 優先度 High
|
||
1. **Tiny pool の高速化**
|
||
- Magazine cache の改善
|
||
- Thread-local cache の最適化
|
||
- SuperSlab allocation の軽量化
|
||
|
||
2. **ACE Phase 8.3 完了**
|
||
- Step 4: ε-greedy bandit 実装
|
||
- Step 5: PGO 再生成
|
||
- RSS 削減効果を測定
|
||
|
||
### 優先度 Medium
|
||
3. **Mixed size パターンの最適化**
|
||
- サイズクラス切り替えコストの削減
|
||
- Size-class prediction の導入
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
**Current Status**: HAKMEM は MF2 pool (128B+) で System malloc を上回るが、Tiny pool (8-64B) では約半分の速度。
|
||
|
||
**Next Goal**: Tiny pool の 2倍高速化 → System malloc と同等レベルへ。
|
||
|
||
**Long-term Vision**: 全サイズクラスで System malloc を上回り、かつメモリ効率も優れた allocator を実現。
|
||
|
||
---
|
||
|
||
## Historical Performance (HAKMEM Step 3d vs mimalloc)
|
||
|
||
### 🏆 Best Performance Record (HAKMEM Step 3d)
|
||
|
||
**Top 10 Results**:
|
||
1. Test 6 (128B Long-lived): **313.27 M ops/sec** ← 🥇 NEW RECORD!
|
||
2. Test 6 (16B Long-lived): 312.59 M ops/sec
|
||
3. Test 6 (64B Long-lived): 312.24 M ops/sec
|
||
4. Test 6 (32B Long-lived): 310.88 M ops/sec
|
||
5. Test 4 (32B Interleaved): 310.38 M ops/sec
|
||
6. Test 4 (64B Interleaved): 309.94 M ops/sec
|
||
7. Test 4 (16B Interleaved): 309.85 M ops/sec
|
||
8. Test 4 (128B Interleaved): 308.88 M ops/sec
|
||
9. Test 2 (32B FIFO): 307.53 M ops/sec
|
||
|
||
### 🎯 HAKMEM vs mimalloc (Step 3d)
|
||
|
||
| Metric | HAKMEM Step 3d | mimalloc | Winner | Gap |
|
||
|--------|----------------|----------|--------|-----|
|
||
| **Performance** | 313.27 M ops/sec | 307.00 M ops/sec | **HAKMEM** | **+2.0%** 🎉 |
|
||
| **Memory (RSS)** | 13,208 KB (13.2 MB) | 4,036 KB (4.0 MB) | **mimalloc** | **-227% (3.27x)** ⚠️ |
|
||
|
||
**Analysis**:
|
||
- ✅ **Speed**: HAKMEM は mimalloc を **+2.0%** 上回る (313.27 vs 307.00 M ops/sec)
|
||
- ⚠️ **Memory**: HAKMEM は mimalloc の **3.27倍** のメモリを使用 (+9.2 MB)
|
||
|
||
### 🎯 Performance vs Memory Trade-off
|
||
|
||
| Version | Speed (128B) | RSS Memory | Speed/MB Ratio |
|
||
|---------|-------------|------------|----------------|
|
||
| **mimalloc** | 307.0 M ops/s | 4.0 MB | **76.75 M ops/MB** 🏆 |
|
||
| **HAKMEM Step 3d** | 313.3 M ops/s | 13.2 MB | 23.74 M ops/MB |
|
||
| **HAKMEM Phase 8.3** | 206.5 M ops/s | TBD | TBD |
|
||
|
||
**Goal (Phase 8.3 ACE)**: RSS を 13.2 MB → 4-6 MB に削減しつつ、300M+ ops/sec を維持
|
||
|
||
---
|
||
|
||
## Regression Analysis: Phase 8.3 vs Step 3d
|
||
|
||
### 128B Long-lived Test
|
||
|
||
| Version | Throughput | vs Step 3d | vs mimalloc |
|
||
|---------|------------|-----------|-------------|
|
||
| **HAKMEM Step 3d** (Best) | 313.27 M ops/s | baseline | **+2.0%** ✅ |
|
||
| **HAKMEM Phase 8.3** (Current) | 206.51 M ops/s | **-34.1%** ⚠️ | **-32.7%** ⚠️ |
|
||
| **mimalloc** | 307.00 M ops/s | -2.0% | baseline |
|
||
| **System malloc** | 182.92 M ops/s | -41.6% | -40.4% |
|
||
|
||
**Regression**: Phase 8.3 は Step 3d より **34.1% 遅い**!
|
||
|
||
### 🔍 Root Cause Analysis
|
||
|
||
Phase 8.3 で追加された ACE (Agentic Context Engineering) のカウンタートラッキングがホットパスに追加されたことが原因。
|
||
|
||
#### 1. **ACE Counter Tracking on Every Allocation** (hakmem_tiny.c:1251-1264)
|
||
```c
|
||
g_ss_ace[class_idx].alloc_count++; // +1 write
|
||
g_ss_ace[class_idx].live_blocks++; // +1 write
|
||
if ((g_ss_ace[class_idx].alloc_count & 0x3FFFu) == 0) { // +1 load, +1 AND, +1 compare
|
||
hak_tiny_superslab_ace_tick(...);
|
||
}
|
||
```
|
||
- **Impact**: 2 writes + 3 ops per allocation
|
||
- **Benchmark**: 200M allocations = **400M extra writes**
|
||
|
||
#### 2. **ACE Counter Tracking on Every Free** (hakmem_tiny.c:1336-1338, 1355-1357)
|
||
```c
|
||
if (g_ss_ace[ss->size_class].live_blocks > 0) { // +1 load, +1 compare
|
||
g_ss_ace[ss->size_class].live_blocks--; // +1 write
|
||
}
|
||
```
|
||
- **Impact**: 1 load + 1 compare + 1 write per free
|
||
- **Benchmark**: 200M frees = **200M extra operations**
|
||
|
||
#### 3. **Registry Lookup Overhead** (hakmem_super_registry.h:52-74)
|
||
```c
|
||
for (int lg = 20; lg <= 21; lg++) { // Try both 1MB and 2MB
|
||
// ... probe loop ...
|
||
if (b == base && e->lg_size == lg) return e->ss; // Extra field check
|
||
}
|
||
```
|
||
- **Impact**: Doubles worst-case lookup time, extra lg_size comparisons on every free
|
||
|
||
#### 4. **Memory Pressure**
|
||
- `g_ss_ace[class_idx]` アクセスがキャッシュに負荷
|
||
- グローバル配列への書き込みが毎回発生
|
||
|
||
### 💡 Solution Options
|
||
|
||
1. **Option A: Sampling-based Tracking**
|
||
- 1/256 の確率でのみカウンタ更新(統計的に十分)
|
||
- Expected: ~1% overhead (313M → 310M ops/s)
|
||
|
||
2. **Option B: Per-TLS Counters**
|
||
- Thread-local counters で書き込みを高速化
|
||
- Tick 時に集約
|
||
|
||
3. **Option C: Conditional ACE (compile-time flag)**
|
||
- `#ifdef HAKMEM_ACE_ENABLE` でトラッキングを無効化可能に
|
||
- Production では ACE off、メモリ重視時のみ ACE on
|
||
|
||
4. **Option D: ACE v2 - Lazy Observation**
|
||
- Magazine refill/spill 時のみカウント(既存の遅いパス)
|
||
- alloc/free ホットパスには一切手を加えない
|
||
|
||
---
|
||
|
||
## Raw Data
|
||
|
||
- HAKMEM Phase 8.3: `benchmarks/hakmem_result.txt`
|
||
- System malloc: `benchmarks/system_result.txt`
|
||
- HAKMEM Step 3d: (Historical data, referenced above)
|