Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.3 KiB
Phase 6.11.4 Completion Report: hak_alloc Optimization
Date: 2025-10-22 Status: ✅ Implementation Complete (P0-1 + P0-2) Goal: Optimize hak_alloc hotpath to beat mimalloc in all scenarios
📊 Background: Why hak_alloc Optimization?
Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)
Profiling results (Phase 6.11.3):
syscall_munmap: 131,666 cycles (41.3%) ← #1 Bottleneck
hak_alloc: 126,479 cycles (39.6%) ← #2 NEW DISCOVERY! 🔥
hak_free: 48,206 cycles (15.1%)
Target: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios
🔧 Implementation
Phase 6.11.4 P0-1: Atomic Operation Elimination (30 min)
Goal: Eliminate atomic operations when EVOLUTION feature is disabled
Changes:
- hakmem.c Line 361-375: Replace runtime
if (HAK_ENABLED_LEARNING(...))with compile-time#if HAKMEM_FEATURE_EVOLUTION
// Before (runtime check)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
}
// After (compile-time check)
#if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
if (hak_evo_tick(now_ns)) {
// P0-2: Update cached strategy when window closes
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
}
}
#endif
Benefits:
- Compile-time guard: Zero overhead when EVOLUTION disabled
- Reduced runtime checks: -70 cycles/alloc in minimal mode
Phase 6.11.4 P0-2: Cached Strategy (1-2 hrs)
Goal: Eliminate ELO strategy selection overhead in LEARN mode
Problem: Heavy overhead in LEARN mode
// Before (LEARN mode): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
strategy_id = hak_elo_select_strategy(); // Heavy
g_cached_strategy_id = strategy_id;
hak_elo_record_alloc(strategy_id, size, 0); // Heavy
} else {
strategy_id = g_cached_strategy_id; // 99回はキャッシュ
}
// Overhead:
// - 剰余計算 (% 100): 10-20 cycles
// - 条件分岐: 5-10 cycles
// - カウンタインクリメント: 3-5 cycles
// Total: 18-35 cycles (99回) + heavy (1回)
Solution: Always use cached strategy
// After (ALL modes): すべてのモードで同じ速度
int strategy_id = atomic_load(&g_cached_strategy_id); // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);
// 更新は window closure 時のみ (hak_evo_tick)
if (hak_evo_tick(now_ns)) {
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
}
Changes:
-
hakmem.c Line 57-58:
- Removed
g_elo_call_count - Changed
g_cached_strategy_idtostatic _Atomic int
- Removed
-
hakmem.c Line 376-383: Simplified ELO logic (42 lines → 10 lines)
// Before: 42 lines (complex branching) if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) { if (hak_evo_is_frozen()) { ... } else if (hak_evo_is_canary()) { ... } else { /* LEARN: 15 lines of ELO logic */ } } else { threshold = 2097152; } // After: 10 lines (simple atomic load) if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) { int strategy_id = atomic_load(&g_cached_strategy_id); threshold = hak_elo_get_threshold(strategy_id); } else { threshold = 2097152; // 2MB } -
hakmem.c Line 299-300: Initialize cached strategy in
hak_init()
Benefits:
- LEARN mode: 剰余・分岐・カウンタ削除 → -18-35 cycles
- FROZEN/CANARY: Same speed (10 cycles atomic load)
- Code simplification: 42 lines → 10 lines (-76%)
📈 Test Results
Profiling Results (minimal mode, vm scenario, 10 iterations)
Before (Phase 6.11.3):
hak_alloc: 126,479 cycles (39.6%)
After P0-1:
hak_alloc: 119,480 cycles (24.3%) → -6,999 cycles (-5.5%)
After P0-2:
hak_alloc: 114,186 cycles (33.8%) → -12,293 cycles (-9.7% total)
Analysis:
- Expected: -45% (-56,479 cycles)
- Actual: -9.7% (-12,293 cycles)
- Reason: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ
Benchmark Results (all scenarios, 100 iterations)
| Scenario | Phase 6.10.1 | P0-2 後 | mimalloc | vs mimalloc |
|---|---|---|---|---|
| json (64KB) | 298 ns | 300 ns | 220 ns | +36.4% ❌ |
| mir (256KB) | - | 870 ns | 1,072 ns | -18.8% ✅ |
| vm (2MB) | - | 15,385 ns | 13,812 ns | +11.4% ❌ |
Analysis:
- json: ほぼ変化なし (+0.7%)
- mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
- vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)
🔍 Key Discoveries
1️⃣ Pool/Cache が支配的で P0-2 の効果が見られない
実際の状況:
- json: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
- mir: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
- vm: BigCache hit 99.9% → hak_alloc のメインロジックをスキップ
結論: Pool/Cache の hit rate が高すぎて、hak_alloc の最適化が効果を発揮していない
2️⃣ Profiling では効果あり、Benchmark では効果なし
- Profiling (minimal mode): -9.7% 削減 ✅
- Benchmark (balanced mode): ほぼ変化なし ❌
理由:
- Profiling は Pool/Cache 無効(minimal mode)
- Benchmark は Pool/Cache 有効(balanced mode) → Pool/Cache が支配的
3️⃣ 次の最適化ターゲットは Pool/Cache 自体
hak_alloc の最適化は完了(-9.7%)。次は:
- L2.5 Pool の高速化 (Phase 6.13)
- BigCache の高速化 (Phase 6.8+)
- Tiny Pool の高速化 (Phase 6.12)
💡 Lessons Learned
1. Profiling と Benchmark の違いを理解すべき
- Profiling: 特定機能の overhead 測定(minimal mode)
- Benchmark: 実際のワークロードでの性能(balanced mode)
- 両方を測定しないと全体像が見えない
2. Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的
- json/mir: L2.5 Pool が 100% ヒット
- vm: BigCache が 99.9% ヒット
- → Pool/Cache 自体を最適化すべき
3. Compile-time guard は有効
- P0-1: 実行時チェック削除で -5.5% 削減
- minimal mode で効果が見られた
4. Cached Strategy は実装できたが、効果は限定的
- P0-2: 42 lines → 10 lines (-76% コード削減) ✅
- でも、benchmark では効果が見られない ❌
✅ Implementation Checklist
Completed
- P0-1: Atomic operation elimination (30 min)
- P0-2: Cached strategy (1-2 hrs)
- Build & test (clean compile)
- Profiling test (minimal mode)
- Benchmark test (json/mir/vm all scenarios)
- Analysis & completion report
🚀 Next Steps
P0: 現状維持
- P0-1/P0-2 は実装完了
- Profiling では効果あり (-9.7%)
- Benchmark では効果が見られない(Pool/Cache が支配的)
P1: Pool/Cache 最適化に注力
Phase 6.12 (Tiny Pool):
- ≤1KB allocations の高速化
- Slab allocator 最適化
Phase 6.13 (L2.5 LargePool):
- 64KB-1MB allocations の高速化
- mir scenario の改善(-18.8% → < -30%?)
Phase 6.8+ (BigCache):
- vm scenario の改善(+11.4% → < +0%?)
P2: mimalloc 打倒目標の再評価
現状:
- json: +36.4% ❌ (Phase 6.10.1 と同レベル)
- mir: -18.8% ✅
- vm: +11.4% ❌ (悪化)
新目標: Pool/Cache 最適化で json/vm も改善
- json: 300 ns → < 220 ns (mimalloc レベル)
- vm: 15,385 ns → < 13,812 ns (mimalloc レベル)
📝 Technical Details
Code Changes Summary
- hakmem.c:
- Line 57-58:
g_cached_strategy_idを_Atomicに変更、g_elo_call_count削除 - Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
- Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
- Line 299-300:
hak_init()で cached strategy 初期化
- Line 57-58:
Expected vs Actual
| Metric | Expected | Actual | Reason |
|---|---|---|---|
| Profiling (hak_alloc) | -45% | -9.7% | minimal mode (EVOLUTION 無効) |
| Benchmark (json) | -30% | +0.7% | Pool hit 100% |
| Benchmark (mir) | -42% | -0.5% | Pool hit 100% |
| Benchmark (vm) | -20% | +10.4% | BigCache hit 99.9% |
📚 Summary
Implemented (Phase 6.11.4)
- ✅ P0-1: Atomic operation elimination (compile-time guard)
- ✅ P0-2: Cached strategy (42 lines → 10 lines)
- ✅ Profiling: hak_alloc -9.7% 削減
Discovered ❌ Pool/Cache が支配的で効果が見られない
- json/mir: L2.5 Pool hit 100%
- vm: BigCache hit 99.9%
- → Pool/Cache 自体を最適化すべき
Recommendation ✅ 次は Pool/Cache 最適化に注力
Next Phase: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)
Implementation Time: 約2-3時間(予想通り) Profiling Impact: hak_alloc -9.7% 削減 ✅ Benchmark Impact: ほぼ変化なし(Pool/Cache が支配的)❌ Lesson: Pool/Cache の最適化が次の優先事項 🎯