Files
hakmem/docs/archive/PHASE_6.11.4_COMPLETION_REPORT.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

303 lines
9.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6.11.4 Completion Report: hak_alloc Optimization
**Date**: 2025-10-22
**Status**: ✅ **Implementation Complete** (P0-1 + P0-2)
**Goal**: Optimize hak_alloc hotpath to beat mimalloc in all scenarios
---
## 📊 **Background: Why hak_alloc Optimization?**
### Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)
**Profiling results** (Phase 6.11.3):
```
syscall_munmap: 131,666 cycles (41.3%) ← #1 Bottleneck
hak_alloc: 126,479 cycles (39.6%) ← #2 NEW DISCOVERY! 🔥
hak_free: 48,206 cycles (15.1%)
```
**Target**: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios
---
## 🔧 **Implementation**
### **Phase 6.11.4 P0-1: Atomic Operation Elimination** (30 min)
**Goal**: Eliminate atomic operations when EVOLUTION feature is disabled
**Changes**:
1. **hakmem.c Line 361-375**: Replace runtime `if (HAK_ENABLED_LEARNING(...))` with compile-time `#if HAKMEM_FEATURE_EVOLUTION`
```c
// Before (runtime check)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
}
// After (compile-time check)
#if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
if (hak_evo_tick(now_ns)) {
// P0-2: Update cached strategy when window closes
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
}
}
#endif
```
**Benefits**:
- **Compile-time guard**: Zero overhead when EVOLUTION disabled
- **Reduced runtime checks**: -70 cycles/alloc in minimal mode
---
### **Phase 6.11.4 P0-2: Cached Strategy** (1-2 hrs)
**Goal**: Eliminate ELO strategy selection overhead in LEARN mode
**Problem**: Heavy overhead in LEARN mode
```c
// Before (LEARN mode): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
strategy_id = hak_elo_select_strategy(); // Heavy
g_cached_strategy_id = strategy_id;
hak_elo_record_alloc(strategy_id, size, 0); // Heavy
} else {
strategy_id = g_cached_strategy_id; // 99回はキャッシュ
}
// Overhead:
// - 剰余計算 (% 100): 10-20 cycles
// - 条件分岐: 5-10 cycles
// - カウンタインクリメント: 3-5 cycles
// Total: 18-35 cycles (99回) + heavy (1回)
```
**Solution**: Always use cached strategy
```c
// After (ALL modes): すべてのモードで同じ速度
int strategy_id = atomic_load(&g_cached_strategy_id); // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);
// 更新は window closure 時のみ (hak_evo_tick)
if (hak_evo_tick(now_ns)) {
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
}
```
**Changes**:
1. **hakmem.c Line 57-58**:
- Removed `g_elo_call_count`
- Changed `g_cached_strategy_id` to `static _Atomic int`
2. **hakmem.c Line 376-383**: Simplified ELO logic (42 lines → 10 lines)
```c
// Before: 42 lines (complex branching)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
if (hak_evo_is_frozen()) { ... }
else if (hak_evo_is_canary()) { ... }
else { /* LEARN: 15 lines of ELO logic */ }
} else { threshold = 2097152; }
// After: 10 lines (simple atomic load)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
int strategy_id = atomic_load(&g_cached_strategy_id);
threshold = hak_elo_get_threshold(strategy_id);
} else {
threshold = 2097152; // 2MB
}
```
3. **hakmem.c Line 299-300**: Initialize cached strategy in `hak_init()`
**Benefits**:
- **LEARN mode**: 剰余・分岐・カウンタ削除 → -18-35 cycles
- **FROZEN/CANARY**: Same speed (10 cycles atomic load)
- **Code simplification**: 42 lines → 10 lines (-76%)
---
## 📈 **Test Results**
### **Profiling Results** (minimal mode, vm scenario, 10 iterations)
**Before (Phase 6.11.3)**:
```
hak_alloc: 126,479 cycles (39.6%)
```
**After P0-1**:
```
hak_alloc: 119,480 cycles (24.3%) → -6,999 cycles (-5.5%)
```
**After P0-2**:
```
hak_alloc: 114,186 cycles (33.8%) → -12,293 cycles (-9.7% total)
```
**Analysis**:
- **Expected**: -45% (-56,479 cycles)
- **Actual**: -9.7% (-12,293 cycles)
- **Reason**: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ
---
### **Benchmark Results** (all scenarios, 100 iterations)
| Scenario | Phase 6.10.1 | **P0-2 後** | mimalloc | vs mimalloc |
|----------|--------------|-------------|----------|-------------|
| json (64KB) | 298 ns | **300 ns** | 220 ns | **+36.4%** ❌ |
| mir (256KB) | - | **870 ns** | 1,072 ns | **-18.8%** ✅ |
| vm (2MB) | - | **15,385 ns** | 13,812 ns | **+11.4%** ❌ |
**Analysis**:
- json: ほぼ変化なし (+0.7%)
- mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
- vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)
---
## 🔍 **Key Discoveries**
### 1⃣ **Pool/Cache が支配的で P0-2 の効果が見られない**
**実際の状況**:
- **json**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
- **mir**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
- **vm**: BigCache hit **99.9%** → hak_alloc のメインロジックをスキップ
**結論**: Pool/Cache の hit rate が高すぎて、`hak_alloc` の最適化が効果を発揮していない
### 2⃣ **Profiling では効果あり、Benchmark では効果なし**
- **Profiling** (minimal mode): -9.7% 削減 ✅
- **Benchmark** (balanced mode): ほぼ変化なし ❌
**理由**:
- Profiling は Pool/Cache 無効minimal mode
- Benchmark は Pool/Cache 有効balanced mode → Pool/Cache が支配的
### 3⃣ **次の最適化ターゲットは Pool/Cache 自体**
**hak_alloc の最適化は完了**-9.7%)。次は:
- **L2.5 Pool の高速化** (Phase 6.13)
- **BigCache の高速化** (Phase 6.8+)
- **Tiny Pool の高速化** (Phase 6.12)
---
## 💡 **Lessons Learned**
### 1. **Profiling と Benchmark の違いを理解すべき**
- Profiling: 特定機能の overhead 測定minimal mode
- Benchmark: 実際のワークロードでの性能balanced mode
- **両方を測定しないと全体像が見えない**
### 2. **Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的**
- json/mir: L2.5 Pool が 100% ヒット
- vm: BigCache が 99.9% ヒット
- **→ Pool/Cache 自体を最適化すべき**
### 3. **Compile-time guard は有効**
- P0-1: 実行時チェック削除で -5.5% 削減
- minimal mode で効果が見られた
### 4. **Cached Strategy は実装できたが、効果は限定的**
- P0-2: 42 lines → 10 lines (-76% コード削減) ✅
- でも、benchmark では効果が見られない ❌
---
## ✅ **Implementation Checklist**
### Completed
- [x] P0-1: Atomic operation elimination (30 min)
- [x] P0-2: Cached strategy (1-2 hrs)
- [x] Build & test (clean compile)
- [x] Profiling test (minimal mode)
- [x] Benchmark test (json/mir/vm all scenarios)
- [x] Analysis & completion report
---
## 🚀 **Next Steps**
### P0: 現状維持
- P0-1/P0-2 は実装完了
- Profiling では効果あり (-9.7%)
- Benchmark では効果が見られないPool/Cache が支配的)
### P1: Pool/Cache 最適化に注力
**Phase 6.12 (Tiny Pool)**:
- ≤1KB allocations の高速化
- Slab allocator 最適化
**Phase 6.13 (L2.5 LargePool)**:
- 64KB-1MB allocations の高速化
- mir scenario の改善(-18.8% → < -30%
**Phase 6.8+ (BigCache)**:
- vm scenario の改善(+11.4% → < +0%
### P2: mimalloc 打倒目標の再評価
**現状**:
- json: +36.4% ❌ (Phase 6.10.1 と同レベル)
- mir: -18.8% ✅
- vm: +11.4% ❌ (悪化)
**新目標**: Pool/Cache 最適化で json/vm も改善
- json: 300 ns → < 220 ns (mimalloc レベル)
- vm: 15,385 ns → < 13,812 ns (mimalloc レベル)
---
## 📝 **Technical Details**
### Code Changes Summary
1. **hakmem.c**:
- Line 57-58: `g_cached_strategy_id` を `_Atomic` に変更、`g_elo_call_count` 削除
- Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
- Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
- Line 299-300: `hak_init()` で cached strategy 初期化
### Expected vs Actual
| Metric | Expected | Actual | Reason |
|--------|----------|--------|--------|
| Profiling (hak_alloc) | -45% | **-9.7%** | minimal mode (EVOLUTION 無効) |
| Benchmark (json) | -30% | **+0.7%** | Pool hit 100% |
| Benchmark (mir) | -42% | **-0.5%** | Pool hit 100% |
| Benchmark (vm) | -20% | **+10.4%** | BigCache hit 99.9% |
---
## 📚 **Summary**
### Implemented (Phase 6.11.4)
- ✅ P0-1: Atomic operation elimination (compile-time guard)
- ✅ P0-2: Cached strategy (42 lines → 10 lines)
- ✅ Profiling: hak_alloc -9.7% 削減
### Discovered ❌ **Pool/Cache が支配的で効果が見られない**
- **json/mir**: L2.5 Pool hit 100%
- **vm**: BigCache hit 99.9%
- **→ Pool/Cache 自体を最適化すべき**
### Recommendation ✅ **次は Pool/Cache 最適化に注力**
**Next Phase**: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)
---
**Implementation Time**: 約2-3時間予想通り
**Profiling Impact**: hak_alloc -9.7% 削減 ✅
**Benchmark Impact**: ほぼ変化なしPool/Cache が支配的)❌
**Lesson**: **Pool/Cache の最適化が次の優先事項** 🎯