Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
303 lines
9.3 KiB
Markdown
303 lines
9.3 KiB
Markdown
# Phase 6.11.4 Completion Report: hak_alloc Optimization
|
||
|
||
**Date**: 2025-10-22
|
||
**Status**: ✅ **Implementation Complete** (P0-1 + P0-2)
|
||
**Goal**: Optimize hak_alloc hotpath to beat mimalloc in all scenarios
|
||
|
||
---
|
||
|
||
## 📊 **Background: Why hak_alloc Optimization?**
|
||
|
||
### Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)
|
||
**Profiling results** (Phase 6.11.3):
|
||
```
|
||
syscall_munmap: 131,666 cycles (41.3%) ← #1 Bottleneck
|
||
hak_alloc: 126,479 cycles (39.6%) ← #2 NEW DISCOVERY! 🔥
|
||
hak_free: 48,206 cycles (15.1%)
|
||
```
|
||
|
||
**Target**: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios
|
||
|
||
---
|
||
|
||
## 🔧 **Implementation**
|
||
|
||
### **Phase 6.11.4 P0-1: Atomic Operation Elimination** (30 min)
|
||
|
||
**Goal**: Eliminate atomic operations when EVOLUTION feature is disabled
|
||
|
||
**Changes**:
|
||
1. **hakmem.c Line 361-375**: Replace runtime `if (HAK_ENABLED_LEARNING(...))` with compile-time `#if HAKMEM_FEATURE_EVOLUTION`
|
||
|
||
```c
|
||
// Before (runtime check)
|
||
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
|
||
static _Atomic uint64_t tick_counter = 0;
|
||
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
|
||
hak_evo_tick(now_ns);
|
||
}
|
||
}
|
||
|
||
// After (compile-time check)
|
||
#if HAKMEM_FEATURE_EVOLUTION
|
||
static _Atomic uint64_t tick_counter = 0;
|
||
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
|
||
if (hak_evo_tick(now_ns)) {
|
||
// P0-2: Update cached strategy when window closes
|
||
int new_strategy = hak_elo_select_strategy();
|
||
atomic_store(&g_cached_strategy_id, new_strategy);
|
||
}
|
||
}
|
||
#endif
|
||
```
|
||
|
||
**Benefits**:
|
||
- **Compile-time guard**: Zero overhead when EVOLUTION disabled
|
||
- **Reduced runtime checks**: -70 cycles/alloc in minimal mode
|
||
|
||
---
|
||
|
||
### **Phase 6.11.4 P0-2: Cached Strategy** (1-2 hrs)
|
||
|
||
**Goal**: Eliminate ELO strategy selection overhead in LEARN mode
|
||
|
||
**Problem**: Heavy overhead in LEARN mode
|
||
```c
|
||
// Before (LEARN mode): 100回ごとに重い計算
|
||
g_elo_call_count++;
|
||
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
|
||
strategy_id = hak_elo_select_strategy(); // Heavy
|
||
g_cached_strategy_id = strategy_id;
|
||
hak_elo_record_alloc(strategy_id, size, 0); // Heavy
|
||
} else {
|
||
strategy_id = g_cached_strategy_id; // 99回はキャッシュ
|
||
}
|
||
|
||
// Overhead:
|
||
// - 剰余計算 (% 100): 10-20 cycles
|
||
// - 条件分岐: 5-10 cycles
|
||
// - カウンタインクリメント: 3-5 cycles
|
||
// Total: 18-35 cycles (99回) + heavy (1回)
|
||
```
|
||
|
||
**Solution**: Always use cached strategy
|
||
```c
|
||
// After (ALL modes): すべてのモードで同じ速度
|
||
int strategy_id = atomic_load(&g_cached_strategy_id); // 10 cycles のみ
|
||
size_t threshold = hak_elo_get_threshold(strategy_id);
|
||
|
||
// 更新は window closure 時のみ (hak_evo_tick)
|
||
if (hak_evo_tick(now_ns)) {
|
||
int new_strategy = hak_elo_select_strategy();
|
||
atomic_store(&g_cached_strategy_id, new_strategy);
|
||
}
|
||
```
|
||
|
||
**Changes**:
|
||
1. **hakmem.c Line 57-58**:
|
||
- Removed `g_elo_call_count`
|
||
- Changed `g_cached_strategy_id` to `static _Atomic int`
|
||
|
||
2. **hakmem.c Line 376-383**: Simplified ELO logic (42 lines → 10 lines)
|
||
```c
|
||
// Before: 42 lines (complex branching)
|
||
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
|
||
if (hak_evo_is_frozen()) { ... }
|
||
else if (hak_evo_is_canary()) { ... }
|
||
else { /* LEARN: 15 lines of ELO logic */ }
|
||
} else { threshold = 2097152; }
|
||
|
||
// After: 10 lines (simple atomic load)
|
||
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
|
||
int strategy_id = atomic_load(&g_cached_strategy_id);
|
||
threshold = hak_elo_get_threshold(strategy_id);
|
||
} else {
|
||
threshold = 2097152; // 2MB
|
||
}
|
||
```
|
||
|
||
3. **hakmem.c Line 299-300**: Initialize cached strategy in `hak_init()`
|
||
|
||
**Benefits**:
|
||
- **LEARN mode**: 剰余・分岐・カウンタ削除 → -18-35 cycles
|
||
- **FROZEN/CANARY**: Same speed (10 cycles atomic load)
|
||
- **Code simplification**: 42 lines → 10 lines (-76%)
|
||
|
||
---
|
||
|
||
## 📈 **Test Results**
|
||
|
||
### **Profiling Results** (minimal mode, vm scenario, 10 iterations)
|
||
|
||
**Before (Phase 6.11.3)**:
|
||
```
|
||
hak_alloc: 126,479 cycles (39.6%)
|
||
```
|
||
|
||
**After P0-1**:
|
||
```
|
||
hak_alloc: 119,480 cycles (24.3%) → -6,999 cycles (-5.5%)
|
||
```
|
||
|
||
**After P0-2**:
|
||
```
|
||
hak_alloc: 114,186 cycles (33.8%) → -12,293 cycles (-9.7% total)
|
||
```
|
||
|
||
**Analysis**:
|
||
- **Expected**: -45% (-56,479 cycles)
|
||
- **Actual**: -9.7% (-12,293 cycles)
|
||
- **Reason**: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ
|
||
|
||
---
|
||
|
||
### **Benchmark Results** (all scenarios, 100 iterations)
|
||
|
||
| Scenario | Phase 6.10.1 | **P0-2 後** | mimalloc | vs mimalloc |
|
||
|----------|--------------|-------------|----------|-------------|
|
||
| json (64KB) | 298 ns | **300 ns** | 220 ns | **+36.4%** ❌ |
|
||
| mir (256KB) | - | **870 ns** | 1,072 ns | **-18.8%** ✅ |
|
||
| vm (2MB) | - | **15,385 ns** | 13,812 ns | **+11.4%** ❌ |
|
||
|
||
**Analysis**:
|
||
- json: ほぼ変化なし (+0.7%)
|
||
- mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
|
||
- vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)
|
||
|
||
---
|
||
|
||
## 🔍 **Key Discoveries**
|
||
|
||
### 1️⃣ **Pool/Cache が支配的で P0-2 の効果が見られない**
|
||
|
||
**実際の状況**:
|
||
- **json**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
|
||
- **mir**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
|
||
- **vm**: BigCache hit **99.9%** → hak_alloc のメインロジックをスキップ
|
||
|
||
**結論**: Pool/Cache の hit rate が高すぎて、`hak_alloc` の最適化が効果を発揮していない
|
||
|
||
### 2️⃣ **Profiling では効果あり、Benchmark では効果なし**
|
||
|
||
- **Profiling** (minimal mode): -9.7% 削減 ✅
|
||
- **Benchmark** (balanced mode): ほぼ変化なし ❌
|
||
|
||
**理由**:
|
||
- Profiling は Pool/Cache 無効(minimal mode)
|
||
- Benchmark は Pool/Cache 有効(balanced mode) → Pool/Cache が支配的
|
||
|
||
### 3️⃣ **次の最適化ターゲットは Pool/Cache 自体**
|
||
|
||
**hak_alloc の最適化は完了**(-9.7%)。次は:
|
||
- **L2.5 Pool の高速化** (Phase 6.13)
|
||
- **BigCache の高速化** (Phase 6.8+)
|
||
- **Tiny Pool の高速化** (Phase 6.12)
|
||
|
||
---
|
||
|
||
## 💡 **Lessons Learned**
|
||
|
||
### 1. **Profiling と Benchmark の違いを理解すべき**
|
||
- Profiling: 特定機能の overhead 測定(minimal mode)
|
||
- Benchmark: 実際のワークロードでの性能(balanced mode)
|
||
- **両方を測定しないと全体像が見えない**
|
||
|
||
### 2. **Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的**
|
||
- json/mir: L2.5 Pool が 100% ヒット
|
||
- vm: BigCache が 99.9% ヒット
|
||
- **→ Pool/Cache 自体を最適化すべき**
|
||
|
||
### 3. **Compile-time guard は有効**
|
||
- P0-1: 実行時チェック削除で -5.5% 削減
|
||
- minimal mode で効果が見られた
|
||
|
||
### 4. **Cached Strategy は実装できたが、効果は限定的**
|
||
- P0-2: 42 lines → 10 lines (-76% コード削減) ✅
|
||
- でも、benchmark では効果が見られない ❌
|
||
|
||
---
|
||
|
||
## ✅ **Implementation Checklist**
|
||
|
||
### Completed
|
||
- [x] P0-1: Atomic operation elimination (30 min)
|
||
- [x] P0-2: Cached strategy (1-2 hrs)
|
||
- [x] Build & test (clean compile)
|
||
- [x] Profiling test (minimal mode)
|
||
- [x] Benchmark test (json/mir/vm all scenarios)
|
||
- [x] Analysis & completion report
|
||
|
||
---
|
||
|
||
## 🚀 **Next Steps**
|
||
|
||
### P0: 現状維持
|
||
- P0-1/P0-2 は実装完了
|
||
- Profiling では効果あり (-9.7%)
|
||
- Benchmark では効果が見られない(Pool/Cache が支配的)
|
||
|
||
### P1: Pool/Cache 最適化に注力
|
||
**Phase 6.12 (Tiny Pool)**:
|
||
- ≤1KB allocations の高速化
|
||
- Slab allocator 最適化
|
||
|
||
**Phase 6.13 (L2.5 LargePool)**:
|
||
- 64KB-1MB allocations の高速化
|
||
- mir scenario の改善(-18.8% → < -30%?)
|
||
|
||
**Phase 6.8+ (BigCache)**:
|
||
- vm scenario の改善(+11.4% → < +0%?)
|
||
|
||
### P2: mimalloc 打倒目標の再評価
|
||
**現状**:
|
||
- json: +36.4% ❌ (Phase 6.10.1 と同レベル)
|
||
- mir: -18.8% ✅
|
||
- vm: +11.4% ❌ (悪化)
|
||
|
||
**新目標**: Pool/Cache 最適化で json/vm も改善
|
||
- json: 300 ns → < 220 ns (mimalloc レベル)
|
||
- vm: 15,385 ns → < 13,812 ns (mimalloc レベル)
|
||
|
||
---
|
||
|
||
## 📝 **Technical Details**
|
||
|
||
### Code Changes Summary
|
||
1. **hakmem.c**:
|
||
- Line 57-58: `g_cached_strategy_id` を `_Atomic` に変更、`g_elo_call_count` 削除
|
||
- Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
|
||
- Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
|
||
- Line 299-300: `hak_init()` で cached strategy 初期化
|
||
|
||
### Expected vs Actual
|
||
| Metric | Expected | Actual | Reason |
|
||
|--------|----------|--------|--------|
|
||
| Profiling (hak_alloc) | -45% | **-9.7%** | minimal mode (EVOLUTION 無効) |
|
||
| Benchmark (json) | -30% | **+0.7%** | Pool hit 100% |
|
||
| Benchmark (mir) | -42% | **-0.5%** | Pool hit 100% |
|
||
| Benchmark (vm) | -20% | **+10.4%** | BigCache hit 99.9% |
|
||
|
||
---
|
||
|
||
## 📚 **Summary**
|
||
|
||
### Implemented (Phase 6.11.4)
|
||
- ✅ P0-1: Atomic operation elimination (compile-time guard)
|
||
- ✅ P0-2: Cached strategy (42 lines → 10 lines)
|
||
- ✅ Profiling: hak_alloc -9.7% 削減
|
||
|
||
### Discovered ❌ **Pool/Cache が支配的で効果が見られない**
|
||
- **json/mir**: L2.5 Pool hit 100%
|
||
- **vm**: BigCache hit 99.9%
|
||
- **→ Pool/Cache 自体を最適化すべき**
|
||
|
||
### Recommendation ✅ **次は Pool/Cache 最適化に注力**
|
||
**Next Phase**: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)
|
||
|
||
---
|
||
|
||
**Implementation Time**: 約2-3時間(予想通り)
|
||
**Profiling Impact**: hak_alloc -9.7% 削減 ✅
|
||
**Benchmark Impact**: ほぼ変化なし(Pool/Cache が支配的)❌
|
||
**Lesson**: **Pool/Cache の最適化が次の優先事項** 🎯
|