hakmem/docs/archive/PHASE_6.11.4_COMPLETION_REPORT.md

# Phase 6.11.4 Completion Report: hak_alloc Optimization

**Date**: 2025-10-22
**Status**: ✅ **Implementation Complete** (P0-1 + P0-2)
**Goal**: Optimize hak_alloc hotpath to beat mimalloc in all scenarios

---

## 📊 **Background: Why hak_alloc Optimization?**

### Problem: hak_alloc is the #2 Bottleneck (Phase 6.11.3 Discovery)
**Profiling results** (Phase 6.11.3):
```
syscall_munmap:  131,666 cycles (41.3%)  ← #1 Bottleneck
hak_alloc:       126,479 cycles (39.6%)  ← #2 NEW DISCOVERY! 🔥
hak_free:         48,206 cycles (15.1%)
```

**Target**: Reduce hak_alloc overhead by ~45% to beat mimalloc in all scenarios

---

## 🔧 **Implementation**

### **Phase 6.11.4 P0-1: Atomic Operation Elimination** (30 min)

**Goal**: Eliminate atomic operations when EVOLUTION feature is disabled

**Changes**:
1. **hakmem.c Line 361-375**: Replace runtime `if (HAK_ENABLED_LEARNING(...))` with compile-time `#if HAKMEM_FEATURE_EVOLUTION`

```c
// Before (runtime check)
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// After (compile-time check)
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        if (hak_evo_tick(now_ns)) {
            // P0-2: Update cached strategy when window closes
            int new_strategy = hak_elo_select_strategy();
            atomic_store(&g_cached_strategy_id, new_strategy);
        }
    }
#endif
```

**Benefits**:
- **Compile-time guard**: Zero overhead when EVOLUTION disabled
- **Reduced runtime checks**: -70 cycles/alloc in minimal mode

---

### **Phase 6.11.4 P0-2: Cached Strategy** (1-2 hrs)

**Goal**: Eliminate ELO strategy selection overhead in LEARN mode

**Problem**: Heavy overhead in LEARN mode
```c
// Before (LEARN mode): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();  // Heavy
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);  // Heavy
} else {
    strategy_id = g_cached_strategy_id;  // 99回はキャッシュ
}

// Overhead:
// - 剰余計算 (% 100): 10-20 cycles
// - 条件分岐: 5-10 cycles
// - カウンタインクリメント: 3-5 cycles
// Total: 18-35 cycles (99回) + heavy (1回)
```

**Solution**: Always use cached strategy
```c
// After (ALL modes): すべてのモードで同じ速度
int strategy_id = atomic_load(&g_cached_strategy_id);  // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);

// 更新は window closure 時のみ (hak_evo_tick)
if (hak_evo_tick(now_ns)) {
    int new_strategy = hak_elo_select_strategy();
    atomic_store(&g_cached_strategy_id, new_strategy);
}
```

**Changes**:
1. **hakmem.c Line 57-58**:
   - Removed `g_elo_call_count`
   - Changed `g_cached_strategy_id` to `static _Atomic int`

2. **hakmem.c Line 376-383**: Simplified ELO logic (42 lines → 10 lines)
   ```c
   // Before: 42 lines (complex branching)
   if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
       if (hak_evo_is_frozen()) { ... }
       else if (hak_evo_is_canary()) { ... }
       else { /* LEARN: 15 lines of ELO logic */ }
   } else { threshold = 2097152; }

   // After: 10 lines (simple atomic load)
   if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
       int strategy_id = atomic_load(&g_cached_strategy_id);
       threshold = hak_elo_get_threshold(strategy_id);
   } else {
       threshold = 2097152;  // 2MB
   }
   ```

3. **hakmem.c Line 299-300**: Initialize cached strategy in `hak_init()`

**Benefits**:
- **LEARN mode**: 剰余・分岐・カウンタ削除 → -18-35 cycles
- **FROZEN/CANARY**: Same speed (10 cycles atomic load)
- **Code simplification**: 42 lines → 10 lines (-76%)

---

## 📈 **Test Results**

### **Profiling Results** (minimal mode, vm scenario, 10 iterations)

**Before (Phase 6.11.3)**:
```
hak_alloc: 126,479 cycles (39.6%)
```

**After P0-1**:
```
hak_alloc: 119,480 cycles (24.3%)  → -6,999 cycles (-5.5%)
```

**After P0-2**:
```
hak_alloc: 114,186 cycles (33.8%)  → -12,293 cycles (-9.7% total)
```

**Analysis**:
- **Expected**: -45% (-56,479 cycles)
- **Actual**: -9.7% (-12,293 cycles)
- **Reason**: minimal mode では EVOLUTION 無効 → 削減されたのは実行時チェックのみ

---

### **Benchmark Results** (all scenarios, 100 iterations)

| Scenario | Phase 6.10.1 | **P0-2 後** | mimalloc | vs mimalloc |
|----------|--------------|-------------|----------|-------------|
| json (64KB) | 298 ns | **300 ns** | 220 ns | **+36.4%** ❌ |
| mir (256KB) | - | **870 ns** | 1,072 ns | **-18.8%** ✅ |
| vm (2MB) | - | **15,385 ns** | 13,812 ns | **+11.4%** ❌ |

**Analysis**:
- json: ほぼ変化なし (+0.7%)
- mir: わずかに改善 (-0.5% vs Phase 6.11.3 874 ns)
- vm: 悪化 (+10.4% vs Phase 6.11.3 13,933 ns)

---

## 🔍 **Key Discoveries**

### 1️⃣ **Pool/Cache が支配的で P0-2 の効果が見られない**

**実際の状況**:
- **json**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
- **mir**: L2.5 Pool hit **100%** → hak_alloc のメインロジックをスキップ
- **vm**: BigCache hit **99.9%** → hak_alloc のメインロジックをスキップ

**結論**: Pool/Cache の hit rate が高すぎて、`hak_alloc` の最適化が効果を発揮していない

### 2️⃣ **Profiling では効果あり、Benchmark では効果なし**

- **Profiling** (minimal mode): -9.7% 削減 ✅
- **Benchmark** (balanced mode): ほぼ変化なし ❌

**理由**:
- Profiling は Pool/Cache 無効（minimal mode）
- Benchmark は Pool/Cache 有効（balanced mode） → Pool/Cache が支配的

### 3️⃣ **次の最適化ターゲットは Pool/Cache 自体**

**hak_alloc の最適化は完了**（-9.7%）。次は：
- **L2.5 Pool の高速化** (Phase 6.13)
- **BigCache の高速化** (Phase 6.8+)
- **Tiny Pool の高速化** (Phase 6.12)

---

## 💡 **Lessons Learned**

### 1. **Profiling と Benchmark の違いを理解すべき**
- Profiling: 特定機能の overhead 測定（minimal mode）
- Benchmark: 実際のワークロードでの性能（balanced mode）
- **両方を測定しないと全体像が見えない**

### 2. **Pool/Cache が支配的な環境では hak_alloc 最適化の効果は限定的**
- json/mir: L2.5 Pool が 100% ヒット
- vm: BigCache が 99.9% ヒット
- **→ Pool/Cache 自体を最適化すべき**

### 3. **Compile-time guard は有効**
- P0-1: 実行時チェック削除で -5.5% 削減
- minimal mode で効果が見られた

### 4. **Cached Strategy は実装できたが、効果は限定的**
- P0-2: 42 lines → 10 lines (-76% コード削減) ✅
- でも、benchmark では効果が見られない ❌

---

## ✅ **Implementation Checklist**

### Completed
- [x] P0-1: Atomic operation elimination (30 min)
- [x] P0-2: Cached strategy (1-2 hrs)
- [x] Build & test (clean compile)
- [x] Profiling test (minimal mode)
- [x] Benchmark test (json/mir/vm all scenarios)
- [x] Analysis & completion report

---

## 🚀 **Next Steps**

### P0: 現状維持
- P0-1/P0-2 は実装完了
- Profiling では効果あり (-9.7%)
- Benchmark では効果が見られない（Pool/Cache が支配的）

### P1: Pool/Cache 最適化に注力
**Phase 6.12 (Tiny Pool)**:
- ≤1KB allocations の高速化
- Slab allocator 最適化

**Phase 6.13 (L2.5 LargePool)**:
- 64KB-1MB allocations の高速化
- mir scenario の改善（-18.8% → < -30%？）

**Phase 6.8+ (BigCache)**:
- vm scenario の改善（+11.4% → < +0%？）

### P2: mimalloc 打倒目標の再評価
**現状**:
- json: +36.4% ❌ (Phase 6.10.1 と同レベル)
- mir: -18.8% ✅
- vm: +11.4% ❌ (悪化)

**新目標**: Pool/Cache 最適化で json/vm も改善
- json: 300 ns → < 220 ns (mimalloc レベル)
- vm: 15,385 ns → < 13,812 ns (mimalloc レベル)

---

## 📝 **Technical Details**

### Code Changes Summary
1. **hakmem.c**:
   - Line 57-58: `g_cached_strategy_id` を `_Atomic` に変更、`g_elo_call_count` 削除
   - Line 361-375: Compile-time guard 追加 (P0-1 + P0-2 window closure 更新)
   - Line 376-383: ELO logic 簡略化 (42 lines → 10 lines)
   - Line 299-300: `hak_init()` で cached strategy 初期化

### Expected vs Actual
| Metric | Expected | Actual | Reason |
|--------|----------|--------|--------|
| Profiling (hak_alloc) | -45% | **-9.7%** | minimal mode (EVOLUTION 無効) |
| Benchmark (json) | -30% | **+0.7%** | Pool hit 100% |
| Benchmark (mir) | -42% | **-0.5%** | Pool hit 100% |
| Benchmark (vm) | -20% | **+10.4%** | BigCache hit 99.9% |

---

## 📚 **Summary**

### Implemented (Phase 6.11.4)
- ✅ P0-1: Atomic operation elimination (compile-time guard)
- ✅ P0-2: Cached strategy (42 lines → 10 lines)
- ✅ Profiling: hak_alloc -9.7% 削減

### Discovered ❌ **Pool/Cache が支配的で効果が見られない**
- **json/mir**: L2.5 Pool hit 100%
- **vm**: BigCache hit 99.9%
- **→ Pool/Cache 自体を最適化すべき**

### Recommendation ✅ **次は Pool/Cache 最適化に注力**
**Next Phase**: Phase 6.12 (Tiny Pool) / 6.13 (L2.5 Pool) / 6.8+ (BigCache)

---

**Implementation Time**: 約2-3時間（予想通り）
**Profiling Impact**: hak_alloc -9.7% 削減 ✅
**Benchmark Impact**: ほぼ変化なし（Pool/Cache が支配的）❌
**Lesson**: **Pool/Cache の最適化が次の優先事項** 🎯