Files
hakmem/docs/archive/PHASE_6.11.4_COMPLETION_REPORT.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

9.3 KiB
Raw Blame History

Phase 6.11.4 Completion Report: hak_alloc Optimization

Date: 2025-10-22 Status: Implementation Complete (P0-1 + P0-2) Goal: Optimize hak_alloc hotpath to beat mimalloc in all scenarios


📊 Background: Why hak_alloc Optimization?

Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)

Profiling results (Phase 6.11.3):

syscall_munmap:  131,666 cycles (41.3%)  ← #1 Bottleneck
hak_alloc:       126,479 cycles (39.6%)  ← #2 NEW DISCOVERY! 🔥
hak_free:         48,206 cycles (15.1%)

Target: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios


🔧 Implementation

Phase 6.11.4 P0-1: Atomic Operation Elimination (30 min)

Goal: Eliminate atomic operations when EVOLUTION feature is disabled

Changes:

  1. hakmem.c Line 361-375: Replace runtime if (HAK_ENABLED_LEARNING(...)) with compile-time #if HAKMEM_FEATURE_EVOLUTION
// Before (runtime check)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// After (compile-time check)
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        if (hak_evo_tick(now_ns)) {
            // P0-2: Update cached strategy when window closes
            int new_strategy = hak_elo_select_strategy();
            atomic_store(&g_cached_strategy_id, new_strategy);
        }
    }
#endif

Benefits:

  • Compile-time guard: Zero overhead when EVOLUTION disabled
  • Reduced runtime checks: -70 cycles/alloc in minimal mode

Phase 6.11.4 P0-2: Cached Strategy (1-2 hrs)

Goal: Eliminate ELO strategy selection overhead in LEARN mode

Problem: Heavy overhead in LEARN mode

// Before (LEARN mode): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();  // Heavy
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);  // Heavy
} else {
    strategy_id = g_cached_strategy_id;  // 99回はキャッシュ
}

// Overhead:
// - 剰余計算 (% 100): 10-20 cycles
// - 条件分岐: 5-10 cycles
// - カウンタインクリメント: 3-5 cycles
// Total: 18-35 cycles (99回) + heavy (1回)

Solution: Always use cached strategy

// After (ALL modes): すべてのモードで同じ速度
int strategy_id = atomic_load(&g_cached_strategy_id);  // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);

// 更新は window closure 時のみ (hak_evo_tick)
if (hak_evo_tick(now_ns)) {
    int new_strategy = hak_elo_select_strategy();
    atomic_store(&g_cached_strategy_id, new_strategy);
}

Changes:

  1. hakmem.c Line 57-58:

    • Removed g_elo_call_count
    • Changed g_cached_strategy_id to static _Atomic int
  2. hakmem.c Line 376-383: Simplified ELO logic (42 lines → 10 lines)

    // Before: 42 lines (complex branching)
    if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
        if (hak_evo_is_frozen()) { ... }
        else if (hak_evo_is_canary()) { ... }
        else { /* LEARN: 15 lines of ELO logic */ }
    } else { threshold = 2097152; }
    
    // After: 10 lines (simple atomic load)
    if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
        int strategy_id = atomic_load(&g_cached_strategy_id);
        threshold = hak_elo_get_threshold(strategy_id);
    } else {
        threshold = 2097152;  // 2MB
    }
    
  3. hakmem.c Line 299-300: Initialize cached strategy in hak_init()

Benefits:

  • LEARN mode: 剰余・分岐・カウンタ削除 → -18-35 cycles
  • FROZEN/CANARY: Same speed (10 cycles atomic load)
  • Code simplification: 42 lines → 10 lines (-76%)

📈 Test Results

Profiling Results (minimal mode, vm scenario, 10 iterations)

Before (Phase 6.11.3):

hak_alloc: 126,479 cycles (39.6%)

After P0-1:

hak_alloc: 119,480 cycles (24.3%)  → -6,999 cycles (-5.5%)

After P0-2:

hak_alloc: 114,186 cycles (33.8%)  → -12,293 cycles (-9.7% total)

Analysis:

  • Expected: -45% (-56,479 cycles)
  • Actual: -9.7% (-12,293 cycles)
  • Reason: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ

Benchmark Results (all scenarios, 100 iterations)

Scenario Phase 6.10.1 P0-2 後 mimalloc vs mimalloc
json (64KB) 298 ns 300 ns 220 ns +36.4%
mir (256KB) - 870 ns 1,072 ns -18.8%
vm (2MB) - 15,385 ns 13,812 ns +11.4%

Analysis:

  • json: ほぼ変化なし (+0.7%)
  • mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
  • vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)

🔍 Key Discoveries

1 Pool/Cache が支配的で P0-2 の効果が見られない

実際の状況:

  • json: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
  • mir: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
  • vm: BigCache hit 99.9% → hak_alloc のメインロジックをスキップ

結論: Pool/Cache の hit rate が高すぎて、hak_alloc の最適化が効果を発揮していない

2 Profiling では効果あり、Benchmark では効果なし

  • Profiling (minimal mode): -9.7% 削減
  • Benchmark (balanced mode): ほぼ変化なし

理由:

  • Profiling は Pool/Cache 無効minimal mode
  • Benchmark は Pool/Cache 有効balanced mode → Pool/Cache が支配的

3 次の最適化ターゲットは Pool/Cache 自体

hak_alloc の最適化は完了-9.7%)。次は:

  • L2.5 Pool の高速化 (Phase 6.13)
  • BigCache の高速化 (Phase 6.8+)
  • Tiny Pool の高速化 (Phase 6.12)

💡 Lessons Learned

1. Profiling と Benchmark の違いを理解すべき

  • Profiling: 特定機能の overhead 測定minimal mode
  • Benchmark: 実際のワークロードでの性能balanced mode
  • 両方を測定しないと全体像が見えない

2. Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的

  • json/mir: L2.5 Pool が 100% ヒット
  • vm: BigCache が 99.9% ヒット
  • → Pool/Cache 自体を最適化すべき

3. Compile-time guard は有効

  • P0-1: 実行時チェック削除で -5.5% 削減
  • minimal mode で効果が見られた

4. Cached Strategy は実装できたが、効果は限定的

  • P0-2: 42 lines → 10 lines (-76% コード削減)
  • でも、benchmark では効果が見られない

Implementation Checklist

Completed

  • P0-1: Atomic operation elimination (30 min)
  • P0-2: Cached strategy (1-2 hrs)
  • Build & test (clean compile)
  • Profiling test (minimal mode)
  • Benchmark test (json/mir/vm all scenarios)
  • Analysis & completion report

🚀 Next Steps

P0: 現状維持

  • P0-1/P0-2 は実装完了
  • Profiling では効果あり (-9.7%)
  • Benchmark では効果が見られないPool/Cache が支配的)

P1: Pool/Cache 最適化に注力

Phase 6.12 (Tiny Pool):

  • ≤1KB allocations の高速化
  • Slab allocator 最適化

Phase 6.13 (L2.5 LargePool):

  • 64KB-1MB allocations の高速化
  • mir scenario の改善(-18.8% → < -30%

Phase 6.8+ (BigCache):

  • vm scenario の改善(+11.4% → < +0%

P2: mimalloc 打倒目標の再評価

現状:

  • json: +36.4% (Phase 6.10.1 と同レベル)
  • mir: -18.8%
  • vm: +11.4% (悪化)

新目標: Pool/Cache 最適化で json/vm も改善

  • json: 300 ns → < 220 ns (mimalloc レベル)
  • vm: 15,385 ns → < 13,812 ns (mimalloc レベル)

📝 Technical Details

Code Changes Summary

  1. hakmem.c:
    • Line 57-58: g_cached_strategy_id_Atomic に変更、g_elo_call_count 削除
    • Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
    • Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
    • Line 299-300: hak_init() で cached strategy 初期化

Expected vs Actual

Metric Expected Actual Reason
Profiling (hak_alloc) -45% -9.7% minimal mode (EVOLUTION 無効)
Benchmark (json) -30% +0.7% Pool hit 100%
Benchmark (mir) -42% -0.5% Pool hit 100%
Benchmark (vm) -20% +10.4% BigCache hit 99.9%

📚 Summary

Implemented (Phase 6.11.4)

  • P0-1: Atomic operation elimination (compile-time guard)
  • P0-2: Cached strategy (42 lines → 10 lines)
  • Profiling: hak_alloc -9.7% 削減

Discovered Pool/Cache が支配的で効果が見られない

  • json/mir: L2.5 Pool hit 100%
  • vm: BigCache hit 99.9%
  • → Pool/Cache 自体を最適化すべき

Recommendation 次は Pool/Cache 最適化に注力

Next Phase: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)


Implementation Time: 約2-3時間予想通り Profiling Impact: hak_alloc -9.7% 削減 Benchmark Impact: ほぼ変化なしPool/Cache が支配的) Lesson: Pool/Cache の最適化が次の優先事項 🎯