Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

9.3 KiB

Raw Blame History

Phase 6.11.4 Completion Report: hak_alloc Optimization

Date: 2025-10-22 Status: ✅ Implementation Complete (P0-1 + P0-2) Goal: Optimize hak_alloc hotpath to beat mimalloc in all scenarios

📊 Background: Why hak_alloc Optimization?

Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)

Profiling results (Phase 6.11.3):

syscall_munmap:  131,666 cycles (41.3%)  ← #1 Bottleneck
hak_alloc:       126,479 cycles (39.6%)  ← #2 NEW DISCOVERY! 🔥
hak_free:         48,206 cycles (15.1%)

Target: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios

🔧 Implementation

Phase 6.11.4 P0-1: Atomic Operation Elimination (30 min)

Goal: Eliminate atomic operations when EVOLUTION feature is disabled

Changes:

hakmem.c Line 361-375: Replace runtime if (HAK_ENABLED_LEARNING(...)) with compile-time #if HAKMEM_FEATURE_EVOLUTION

// Before (runtime check)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// After (compile-time check)
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        if (hak_evo_tick(now_ns)) {
            // P0-2: Update cached strategy when window closes
            int new_strategy = hak_elo_select_strategy();
            atomic_store(&g_cached_strategy_id, new_strategy);
        }
    }
#endif

Benefits:

Compile-time guard: Zero overhead when EVOLUTION disabled
Reduced runtime checks: -70 cycles/alloc in minimal mode

Phase 6.11.4 P0-2: Cached Strategy (1-2 hrs)

Goal: Eliminate ELO strategy selection overhead in LEARN mode

Problem: Heavy overhead in LEARN mode

// Before (LEARN mode): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();  // Heavy
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);  // Heavy
} else {
    strategy_id = g_cached_strategy_id;  // 99回はキャッシュ
}

// Overhead:
// - 剰余計算 (% 100): 10-20 cycles
// - 条件分岐: 5-10 cycles
// - カウンタインクリメント: 3-5 cycles
// Total: 18-35 cycles (99回) + heavy (1回)

Solution: Always use cached strategy

// After (ALL modes): すべてのモードで同じ速度
int strategy_id = atomic_load(&g_cached_strategy_id);  // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);

// 更新は window closure 時のみ (hak_evo_tick)
if (hak_evo_tick(now_ns)) {
    int new_strategy = hak_elo_select_strategy();
    atomic_store(&g_cached_strategy_id, new_strategy);
}

Changes:

hakmem.c Line 57-58:
- Removed g_elo_call_count
- Changed g_cached_strategy_id to static _Atomic int

hakmem.c Line 376-383: Simplified ELO logic (42 lines → 10 lines)

// Before: 42 lines (complex branching)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
    if (hak_evo_is_frozen()) { ... }
    else if (hak_evo_is_canary()) { ... }
    else { /* LEARN: 15 lines of ELO logic */ }
} else { threshold = 2097152; }

// After: 10 lines (simple atomic load)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
    int strategy_id = atomic_load(&g_cached_strategy_id);
    threshold = hak_elo_get_threshold(strategy_id);
} else {
    threshold = 2097152;  // 2MB
}

hakmem.c Line 299-300: Initialize cached strategy in hak_init()

Benefits:

LEARN mode: 剰余・分岐・カウンタ削除 → -18-35 cycles
FROZEN/CANARY: Same speed (10 cycles atomic load)
Code simplification: 42 lines → 10 lines (-76%)

📈 Test Results

Profiling Results (minimal mode, vm scenario, 10 iterations)

Before (Phase 6.11.3):

hak_alloc: 126,479 cycles (39.6%)

After P0-1:

hak_alloc: 119,480 cycles (24.3%)  → -6,999 cycles (-5.5%)

After P0-2:

hak_alloc: 114,186 cycles (33.8%)  → -12,293 cycles (-9.7% total)

Analysis:

Expected: -45% (-56,479 cycles)
Actual: -9.7% (-12,293 cycles)
Reason: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ

Benchmark Results (all scenarios, 100 iterations)

Scenario	Phase 6.10.1	P0-2 後	mimalloc	vs mimalloc
json (64KB)	298 ns	300 ns	220 ns	+36.4% ❌
mir (256KB)	-	870 ns	1,072 ns	-18.8% ✅
vm (2MB)	-	15,385 ns	13,812 ns	+11.4% ❌

Analysis:

json: ほぼ変化なし (+0.7%)
mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)

🔍 Key Discoveries

1️⃣ Pool/Cache が支配的で P0-2 の効果が見られない

実際の状況:

json: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
mir: L2.5 Pool hit 100% → hak_alloc のメインロジックをスキップ
vm: BigCache hit 99.9% → hak_alloc のメインロジックをスキップ

結論: Pool/Cache の hit rate が高すぎて、hak_alloc の最適化が効果を発揮していない

2️⃣ Profiling では効果あり、Benchmark では効果なし

Profiling (minimal mode): -9.7% 削減 ✅
Benchmark (balanced mode): ほぼ変化なし ❌

理由:

Profiling は Pool/Cache 無効（minimal mode）
Benchmark は Pool/Cache 有効（balanced mode） → Pool/Cache が支配的

3️⃣ 次の最適化ターゲットは Pool/Cache 自体

hak_alloc の最適化は完了（-9.7%）。次は：

L2.5 Pool の高速化 (Phase 6.13)
BigCache の高速化 (Phase 6.8+)
Tiny Pool の高速化 (Phase 6.12)

💡 Lessons Learned

1. Profiling と Benchmark の違いを理解すべき

Profiling: 特定機能の overhead 測定（minimal mode）
Benchmark: 実際のワークロードでの性能（balanced mode）
両方を測定しないと全体像が見えない

2. Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的

json/mir: L2.5 Pool が 100% ヒット
vm: BigCache が 99.9% ヒット
→ Pool/Cache 自体を最適化すべき

3. Compile-time guard は有効

P0-1: 実行時チェック削除で -5.5% 削減
minimal mode で効果が見られた

4. Cached Strategy は実装できたが、効果は限定的

P0-2: 42 lines → 10 lines (-76% コード削減) ✅
でも、benchmark では効果が見られない ❌

✅ Implementation Checklist

Completed

P0-1: Atomic operation elimination (30 min)
P0-2: Cached strategy (1-2 hrs)
Build & test (clean compile)
Profiling test (minimal mode)
Benchmark test (json/mir/vm all scenarios)
Analysis & completion report

🚀 Next Steps

P0: 現状維持

P0-1/P0-2 は実装完了
Profiling では効果あり (-9.7%)
Benchmark では効果が見られない（Pool/Cache が支配的）

P1: Pool/Cache 最適化に注力

Phase 6.12 (Tiny Pool):

≤1KB allocations の高速化
Slab allocator 最適化

Phase 6.13 (L2.5 LargePool):

64KB-1MB allocations の高速化
mir scenario の改善（-18.8% → < -30%？）

Phase 6.8+ (BigCache):

vm scenario の改善（+11.4% → < +0%？）

P2: mimalloc 打倒目標の再評価

現状:

json: +36.4% ❌ (Phase 6.10.1 と同レベル)
mir: -18.8% ✅
vm: +11.4% ❌ (悪化)

新目標: Pool/Cache 最適化で json/vm も改善

json: 300 ns → < 220 ns (mimalloc レベル)
vm: 15,385 ns → < 13,812 ns (mimalloc レベル)

📝 Technical Details

Code Changes Summary

hakmem.c:
- Line 57-58: g_cached_strategy_id を _Atomic に変更、g_elo_call_count 削除
- Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
- Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
- Line 299-300: hak_init() で cached strategy 初期化

Expected vs Actual

Metric	Expected	Actual	Reason
Profiling (hak_alloc)	-45%	-9.7%	minimal mode (EVOLUTION 無効)
Benchmark (json)	-30%	+0.7%	Pool hit 100%
Benchmark (mir)	-42%	-0.5%	Pool hit 100%
Benchmark (vm)	-20%	+10.4%	BigCache hit 99.9%

📚 Summary

Implemented (Phase 6.11.4)

✅ P0-1: Atomic operation elimination (compile-time guard)
✅ P0-2: Cached strategy (42 lines → 10 lines)
✅ Profiling: hak_alloc -9.7% 削減

Discovered ❌ Pool/Cache が支配的で効果が見られない

json/mir: L2.5 Pool hit 100%
vm: BigCache hit 99.9%
→ Pool/Cache 自体を最適化すべき

Recommendation ✅ 次は Pool/Cache 最適化に注力

Next Phase: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)

Implementation Time: 約2-3時間（予想通り） Profiling Impact: hak_alloc -9.7% 削減 ✅ Benchmark Impact: ほぼ変化なし（Pool/Cache が支配的）❌ Lesson: Pool/Cache の最適化が次の優先事項 🎯

9.3 KiB Raw Blame History Unescape Escape