# Phase 6.11.4 Completion Report: hak_alloc Optimization **Date**: 2025-10-22 **Status**: ✅ **Implementation Complete** (P0-1 + P0-2) **Goal**: Optimize hak_alloc hotpath to beat mimalloc in all scenarios --- ## 📊 **Background: Why hak_alloc Optimization?** ### Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery) **Profiling results** (Phase 6.11.3): ``` syscall_munmap: 131,666 cycles (41.3%) ← #1 Bottleneck hak_alloc: 126,479 cycles (39.6%) ← #2 NEW DISCOVERY! 🔥 hak_free: 48,206 cycles (15.1%) ``` **Target**: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios --- ## 🔧 **Implementation** ### **Phase 6.11.4 P0-1: Atomic Operation Elimination** (30 min) **Goal**: Eliminate atomic operations when EVOLUTION feature is disabled **Changes**: 1. **hakmem.c Line 361-375**: Replace runtime `if (HAK_ENABLED_LEARNING(...))` with compile-time `#if HAKMEM_FEATURE_EVOLUTION` ```c // Before (runtime check) if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) { static _Atomic uint64_t tick_counter = 0; if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) { hak_evo_tick(now_ns); } } // After (compile-time check) #if HAKMEM_FEATURE_EVOLUTION static _Atomic uint64_t tick_counter = 0; if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) { if (hak_evo_tick(now_ns)) { // P0-2: Update cached strategy when window closes int new_strategy = hak_elo_select_strategy(); atomic_store(&g_cached_strategy_id, new_strategy); } } #endif ``` **Benefits**: - **Compile-time guard**: Zero overhead when EVOLUTION disabled - **Reduced runtime checks**: -70 cycles/alloc in minimal mode --- ### **Phase 6.11.4 P0-2: Cached Strategy** (1-2 hrs) **Goal**: Eliminate ELO strategy selection overhead in LEARN mode **Problem**: Heavy overhead in LEARN mode ```c // Before (LEARN mode): 100回ごとに重い計算 g_elo_call_count++; if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) { strategy_id = hak_elo_select_strategy(); // Heavy g_cached_strategy_id = strategy_id; hak_elo_record_alloc(strategy_id, size, 0); // Heavy } else { strategy_id = g_cached_strategy_id; // 99回はキャッシュ } // Overhead: // - 剰余計算 (% 100): 10-20 cycles // - 条件分岐: 5-10 cycles // - カウンタインクリメント: 3-5 cycles // Total: 18-35 cycles (99回) + heavy (1回) ``` **Solution**: Always use cached strategy ```c // After (ALL modes): すべてのモードで同じ速度 int strategy_id = atomic_load(&g_cached_strategy_id); // 10 cycles のみ size_t threshold = hak_elo_get_threshold(strategy_id); // 更新は window closure 時のみ (hak_evo_tick) if (hak_evo_tick(now_ns)) { int new_strategy = hak_elo_select_strategy(); atomic_store(&g_cached_strategy_id, new_strategy); } ``` **Changes**: 1. **hakmem.c Line 57-58**: - Removed `g_elo_call_count` - Changed `g_cached_strategy_id` to `static _Atomic int` 2. **hakmem.c Line 376-383**: Simplified ELO logic (42 lines → 10 lines) ```c // Before: 42 lines (complex branching) if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) { if (hak_evo_is_frozen()) { ... } else if (hak_evo_is_canary()) { ... } else { /* LEARN: 15 lines of ELO logic */ } } else { threshold = 2097152; } // After: 10 lines (simple atomic load) if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) { int strategy_id = atomic_load(&g_cached_strategy_id); threshold = hak_elo_get_threshold(strategy_id); } else { threshold = 2097152; // 2MB } ``` 3. **hakmem.c Line 299-300**: Initialize cached strategy in `hak_init()` **Benefits**: - **LEARN mode**: 剰余・分岐・カウンタ削除 → -18-35 cycles - **FROZEN/CANARY**: Same speed (10 cycles atomic load) - **Code simplification**: 42 lines → 10 lines (-76%) --- ## 📈 **Test Results** ### **Profiling Results** (minimal mode, vm scenario, 10 iterations) **Before (Phase 6.11.3)**: ``` hak_alloc: 126,479 cycles (39.6%) ``` **After P0-1**: ``` hak_alloc: 119,480 cycles (24.3%) → -6,999 cycles (-5.5%) ``` **After P0-2**: ``` hak_alloc: 114,186 cycles (33.8%) → -12,293 cycles (-9.7% total) ``` **Analysis**: - **Expected**: -45% (-56,479 cycles) - **Actual**: -9.7% (-12,293 cycles) - **Reason**: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ --- ### **Benchmark Results** (all scenarios, 100 iterations) | Scenario | Phase 6.10.1 | **P0-2 後** | mimalloc | vs mimalloc | |----------|--------------|-------------|----------|-------------| | json (64KB) | 298 ns | **300 ns** | 220 ns | **+36.4%** ❌ | | mir (256KB) | - | **870 ns** | 1,072 ns | **-18.8%** ✅ | | vm (2MB) | - | **15,385 ns** | 13,812 ns | **+11.4%** ❌ | **Analysis**: - json: ほぼ変化なし (+0.7%) - mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns) - vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns) --- ## 🔍 **Key Discoveries** ### 1️⃣ **Pool/Cache が支配的で P0-2 の効果が見られない** **実際の状況**: - **json**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ - **mir**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ - **vm**: BigCache hit **99.9%** → hak_alloc のメインロジックをスキップ **結論**: Pool/Cache の hit rate が高すぎて、`hak_alloc` の最適化が効果を発揮していない ### 2️⃣ **Profiling では効果あり、Benchmark では効果なし** - **Profiling** (minimal mode): -9.7% 削減 ✅ - **Benchmark** (balanced mode): ほぼ変化なし ❌ **理由**: - Profiling は Pool/Cache 無効(minimal mode) - Benchmark は Pool/Cache 有効(balanced mode) → Pool/Cache が支配的 ### 3️⃣ **次の最適化ターゲットは Pool/Cache 自体** **hak_alloc の最適化は完了**(-9.7%)。次は: - **L2.5 Pool の高速化** (Phase 6.13) - **BigCache の高速化** (Phase 6.8+) - **Tiny Pool の高速化** (Phase 6.12) --- ## 💡 **Lessons Learned** ### 1. **Profiling と Benchmark の違いを理解すべき** - Profiling: 特定機能の overhead 測定(minimal mode) - Benchmark: 実際のワークロードでの性能(balanced mode) - **両方を測定しないと全体像が見えない** ### 2. **Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的** - json/mir: L2.5 Pool が 100% ヒット - vm: BigCache が 99.9% ヒット - **→ Pool/Cache 自体を最適化すべき** ### 3. **Compile-time guard は有効** - P0-1: 実行時チェック削除で -5.5% 削減 - minimal mode で効果が見られた ### 4. **Cached Strategy は実装できたが、効果は限定的** - P0-2: 42 lines → 10 lines (-76% コード削減) ✅ - でも、benchmark では効果が見られない ❌ --- ## ✅ **Implementation Checklist** ### Completed - [x] P0-1: Atomic operation elimination (30 min) - [x] P0-2: Cached strategy (1-2 hrs) - [x] Build & test (clean compile) - [x] Profiling test (minimal mode) - [x] Benchmark test (json/mir/vm all scenarios) - [x] Analysis & completion report --- ## 🚀 **Next Steps** ### P0: 現状維持 - P0-1/P0-2 は実装完了 - Profiling では効果あり (-9.7%) - Benchmark では効果が見られない(Pool/Cache が支配的) ### P1: Pool/Cache 最適化に注力 **Phase 6.12 (Tiny Pool)**: - ≤1KB allocations の高速化 - Slab allocator 最適化 **Phase 6.13 (L2.5 LargePool)**: - 64KB-1MB allocations の高速化 - mir scenario の改善(-18.8% → < -30%?) **Phase 6.8+ (BigCache)**: - vm scenario の改善(+11.4% → < +0%?) ### P2: mimalloc 打倒目標の再評価 **現状**: - json: +36.4% ❌ (Phase 6.10.1 と同レベル) - mir: -18.8% ✅ - vm: +11.4% ❌ (悪化) **新目標**: Pool/Cache 最適化で json/vm も改善 - json: 300 ns → < 220 ns (mimalloc レベル) - vm: 15,385 ns → < 13,812 ns (mimalloc レベル) --- ## 📝 **Technical Details** ### Code Changes Summary 1. **hakmem.c**: - Line 57-58: `g_cached_strategy_id` を `_Atomic` に変更、`g_elo_call_count` 削除 - Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新) - Line 376-383: ELO logic 簡略化 (42 lines → 10 lines) - Line 299-300: `hak_init()` で cached strategy 初期化 ### Expected vs Actual | Metric | Expected | Actual | Reason | |--------|----------|--------|--------| | Profiling (hak_alloc) | -45% | **-9.7%** | minimal mode (EVOLUTION 無効) | | Benchmark (json) | -30% | **+0.7%** | Pool hit 100% | | Benchmark (mir) | -42% | **-0.5%** | Pool hit 100% | | Benchmark (vm) | -20% | **+10.4%** | BigCache hit 99.9% | --- ## 📚 **Summary** ### Implemented (Phase 6.11.4) - ✅ P0-1: Atomic operation elimination (compile-time guard) - ✅ P0-2: Cached strategy (42 lines → 10 lines) - ✅ Profiling: hak_alloc -9.7% 削減 ### Discovered ❌ **Pool/Cache が支配的で効果が見られない** - **json/mir**: L2.5 Pool hit 100% - **vm**: BigCache hit 99.9% - **→ Pool/Cache 自体を最適化すべき** ### Recommendation ✅ **次は Pool/Cache 最適化に注力** **Next Phase**: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache) --- **Implementation Time**: 約2-3時間(予想通り) **Profiling Impact**: hak_alloc -9.7% 削減 ✅ **Benchmark Impact**: ほぼ変化なし(Pool/Cache が支配的)❌ **Lesson**: **Pool/Cache の最適化が次の優先事項** 🎯