Files
hakmem/docs/design/PHASE_6.11.4_PLAN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

392 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6.11.4 Plan: hak_alloc Optimization - mimalloc 完全打倒作戦
**Date**: 2025-10-22
**Goal**: 🎯 **全ての条件で mimalloc を倒す!**
**Strategy**: ultrathink (ChatGPT o1) + Gemini 協調分析
---
## 🎯 **Mission: mimalloc 完全打倒**
### 現状 vs 目標
| Scenario | hakmem (現在) | **目標 (P0-2後)** | mimalloc | **判定** |
|----------|---------------|------------------|----------|---------|
| **json** (64KB) | 217 ns | **155 ns** | 220 ns | **30% faster** 🔥 |
| **mir** (256KB) | 874 ns | **620 ns** | 1,072 ns | **42% faster** 🔥🔥 |
| **vm** (2MB) | 13,933 ns | **11,000 ns** | 13,812 ns | **20% faster** 🔥 |
**宣言**: ✅ **全シナリオで mimalloc を超える!**
---
## 📊 **問題分析 (Phase 6.11.3 Profiling 結果)**
### ボトルネック発見
```
HAKMEM Debug Timing Statistics
========================================
syscall_munmap: 131,666 cycles (41.3%) ← #1 既知Whale で最適化済み)
hak_alloc: 126,479 cycles (39.6%) ← #2 新発見! 🔥 **ターゲット**
hak_free: 48,206 cycles (15.1%)
========================================
```
### hak_alloc 内訳(推定)
**ultrathink 分析**: 126,479 cycles の内訳
```
Atomic 操作: 55-100 cycles (43-79%) ← 最大のボトルネック
- atomic_fetch_add: 30-50 cycles
- hak_evo_tick (1/1024): 5-10 cycles
- hak_elo_select (1/100): 5-10 cycles
- 条件分岐・剰余: 5-15 cycles
Hash lookup: 15-20 cycles (12-16%)
- Site Rules lookup: 10-15 cycles
- BigCache lookup: 5-10 cycles
その他ロジック: 10-20 cycles (8-16%)
- Threshold 判定: 3-5 cycles
- Feature flag check: 3-5 cycles
- ポインタ計算: 4-10 cycles
```
---
## 🚀 **Implementation Plan**
### **Phase 6.11.4 (P0-1): Atomic 操作削除** ⭐ **最優先!**
#### Goal
`HAKMEM_FEATURE_EVOLUTION` 無効時のアトミック操作を**コンパイル時削除**
#### Implementation
```c
// Before (現在)
void* hak_alloc_at(size_t size, hak_callsite_t site) {
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
}
// ... rest ...
}
// After (改善後)
void* hak_alloc_at(size_t size, hak_callsite_t site) {
#if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
#endif
// ... rest ...
}
```
#### Expected Results
```
Before: 126,479 cycles
After: 96,000 cycles (-30,479 cycles, -24%)
vm scenario: 13,933 ns → ~12,500 ns
```
#### Risk Assessment
- **Risk**: ✅ **ZERO** (コンパイル時 guard、実行時変化なし)
- **Complexity**: ✅ **Very Low** (単純な #if 追加)
- **Time**: ✅ **30 minutes**
---
### **Phase 6.11.4 (P0-2): Cached Strategy** ⭐ **高優先!**
#### Goal
LEARN モードを FROZEN と同じ速度に(アトミックキャッシュ戦略)
#### Problem
```c
// 現在 (LEARN モード): 100回ごとに重い計算
g_elo_call_count++;
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
strategy_id = hak_elo_select_strategy(); // 重い (5-10 cycles 平均)
g_cached_strategy_id = strategy_id;
hak_elo_record_alloc(strategy_id, size, 0); // 重い
} else {
strategy_id = g_cached_strategy_id; // 99回はキャッシュ読み取り
}
```
**オーバーヘッド**:
- 剰余計算 (`% 100`): 10-20 cycles
- 条件分岐: 5-10 cycles
- カウンタインクリメント: 3-5 cycles
- **合計**: 18-35 cycles (99回) + 重い計算 (1回)
#### Solution: Always Use Cached Strategy
```c
// Global cached strategy (atomic)
static _Atomic int g_cached_strategy_id = 0;
void* hak_alloc_at(size_t size, hak_callsite_t site) {
HKM_TIME_START(t0);
#if HAKMEM_FEATURE_EVOLUTION
// Evolution tick (1/1024) - 変更なし
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
uint64_t now_ns = now.tv_sec * 1000000000ULL + now.tv_nsec;
hak_evo_tick(now_ns); // ここで g_cached_strategy_id を更新
}
#endif
// Strategy selection: すべてのモードで同じ速度!
int strategy_id = atomic_load(&g_cached_strategy_id); // 10 cycles のみ
size_t threshold = hak_elo_get_threshold(strategy_id);
// ... rest of allocation logic ...
}
// Cold-path: hak_evo_tick で更新
void hak_evo_tick(uint64_t now_ns) {
// ... 既存の P2推定、分布更新、状態遷移 ...
// NEW: キャッシュされた戦略を更新
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
// 学習データ記録(非同期、ホットパス外)
// Note: サイズ情報は hak_evo_record_size で既に記録されている
}
```
#### Benefits
1. **LEARN モード**: 剰余・分岐・カウンタ削除 → **-18-35 cycles**
2. **FROZEN/CANARY**: 変化なし(既に最適)
3. **学習精度**: 1-2% 低下(無視できる)
#### Expected Results
```
Before (P0-1): 96,000 cycles
After (P0-2): 70,000 cycles (-26,000 cycles, -27% additional)
vm scenario: ~12,500 ns → ~11,000 ns
```
#### Risk Assessment
- **Risk**: ✅ **LOW** (1-2%精度低下、無視できる)
- **Complexity**: ✅ **Low** (アトミック変数追加のみ)
- **Time**: ✅ **1-2 hours**
---
### **❌ Phase 6.11.5 (P1): Learning Thread - SKIP!**
#### Why Skip?
**ultrathink 分析**: ロックフリーキューのコストが現在のアトミック操作より高い!
```
現在のアトミック操作: 30-50 cycles/alloc
ロックフリーキュー: 60-110 cycles/alloc ← 2倍遅い
```
**Gemini の提案は誤り**:
- 高頻度アロケーション10,000+/秒)では、イベントキューのオーバーヘッドが大きい
- P0-2 (Cached Strategy) が同等の性能で実装が簡単
**結論**: ✅ **P0-2 で十分、P1 は不要**
---
### **Phase 6.11.6 (P2): Hash Table Optimization** (Optional)
#### Goal
Site Rules / BigCache の hash lookup 最適化
#### Strategies
1. **Perfect Hashing**: サイズクラス数が既知L2.5: 5クラス、Tiny: 8クラス
2. **Cache-line Alignment**: Hash table をキャッシュライン64Bに整列
3. **Hash Function**: FNV-1a → xxHash or murmur3
#### Expected Results
```
Before (P0-2): 70,000 cycles
After (P2): 60,000 cycles (-10,000 cycles, -14% additional)
vm scenario: ~11,000 ns → ~10,000 ns
```
#### When to Implement
- **Condition**: P0-2 実装後、hak_alloc が依然として >75,000 cycles の場合
- **Priority**: P2P0-2 が期待通りなら不要)
---
## 📈 **Expected Final Results**
### Phase 6.11.4 (P0-1 + P0-2) 実装後
#### Profiling Results (予測)
```
HAKMEM Debug Timing Statistics
========================================
syscall_munmap: 131,666 cycles (50.2%) ← #1 (比率上昇)
hak_alloc: 70,000 cycles (26.7%) ← #2 (-45% 🔥)
hak_free: 48,206 cycles (18.4%)
whale_get: 3,113 cycles ( 1.2%)
whale_put: 3,064 cycles ( 1.2%)
========================================
Total cycles: 262,542 (-18%)
========================================
```
#### Benchmark Results (予測)
| Scenario | hakmem (現在) | **hakmem (P0-2後)** | mimalloc | **勝利!** |
|----------|---------------|---------------------|----------|-----------|
| **json** (64KB) | 217 ns | **155 ns** | 220 ns | **✅ 30% faster** |
| **mir** (256KB) | 874 ns | **620 ns** | 1,072 ns | **✅ 42% faster** |
| **vm** (2MB) | 13,933 ns | **11,000 ns** | 13,812 ns | **✅ 20% faster** |
**🎉 Mission Accomplished: 全シナリオで mimalloc 打倒!**
---
## 🎯 **Implementation Checklist**
### Phase 6.11.4 (P0-1): Atomic 操作削除 (30分)
**ファイル**: `hakmem.c`
- [ ] Line 362-368: `if (HAK_ENABLED_LEARNING(...))``#if HAKMEM_FEATURE_EVOLUTION`
- [ ] 同様の変更を他の `HAK_ENABLED_LEARNING` 箇所にも適用
- [ ] ビルド & テスト
- [ ] Profiling 測定(期待: hak_alloc ~96,000 cycles
### Phase 6.11.4 (P0-2): Cached Strategy (1-2時間)
**ファイル**: `hakmem.c`, `hakmem_elo.c`, `hakmem_evo.c`
**Step 1: Global cached strategy 追加**
- [ ] `static _Atomic int g_cached_strategy_id = 0;` を hakmem.c に追加
- [ ] 初期化: `hak_init()``atomic_store(&g_cached_strategy_id, 0);`
**Step 2: hak_alloc_at 簡略化**
- [ ] Line 375-405 の ELO 選択ロジックを削除
- [ ] 代わりに `int strategy_id = atomic_load(&g_cached_strategy_id);` のみ
- [ ] `g_elo_call_count` 関連の削除
**Step 3: hak_evo_tick で更新**
- [ ] `hak_evo_tick()` の最後に追加:
```c
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
```
**Step 4: テスト**
- [ ] ビルド & テスト
- [ ] Profiling 測定(期待: hak_alloc ~70,000 cycles
- [ ] 全シナリオベンチマークjson/mir/vm
---
## 📊 **Validation Plan**
### 1. Profiling Validation
```bash
HAKMEM_MODE=minimal HAKMEM_TIMING=1 \
./bench_allocators_hakmem --allocator hakmem-baseline --scenario vm --iterations 10
```
**Expected**:
```
hak_alloc: ~70,000 cycles (26.7%) ← -45% from 126,479
```
### 2. Benchmark Validation (All Scenarios)
```bash
# json scenario
./bench_allocators --allocator hakmem-baseline --scenario json --iterations 100
# mir scenario
./bench_allocators --allocator hakmem-baseline --scenario mir --iterations 100
# vm scenario
./bench_allocators --allocator hakmem-baseline --scenario vm --iterations 100
```
**Expected**:
- json: ~155 ns (vs mimalloc 220 ns) ✅
- mir: ~620 ns (vs mimalloc 1,072 ns) ✅
- vm: ~11,000 ns (vs mimalloc 13,812 ns) ✅
### 3. Regression Test
- [ ] すべての既存テストが PASS
- [ ] ELO 学習が依然として機能1-2%精度低下は許容)
- [ ] Whale cache hit rate が変化なし
---
## 🎓 **Lessons Learned**
### 1. **Profiling First, Optimize Later**
- Phase 6.11.3 で Profiling Infrastructure を作ったからこそ、hak_alloc のボトルネックを発見
- Pre-Warm Pages 失敗を未然に防げた
### 2. **Quantitative Analysis > Intuition**
- Gemini の提案(学習スレッド)は直感的には正しそう
- でも **ultrathink の定量分析** で、ロックフリーキューが遅いことを発見
- **数値で検証することの重要性** 🔬
### 3. **Simple Solutions Win**
- P0-2 (Cached Strategy) は P1 (Learning Thread) より:
- **速い**60-110 cycles vs 30-50 cycles
- **簡単**(アトミック変数 vs スレッド管理)
- **安全**race condition なし)
### 4. **User の直感は貴重**
- "オーバーヘッドどこか見つける仕組み先につくらない?" ← **完全に正しかった**
- エンジニアは素直にユーザーの指摘を受け入れるべき
---
## 🚀 **Next Steps**
### Immediate (P0): Implement Phase 6.11.4 (2-3 hours)
1. P0-1: Atomic 操作削除30分
2. P0-2: Cached Strategy1-2時間
3. Benchmark 全シナリオ
4. **mimalloc 打倒を確認**
### Short-term (P1): L2.5 Pool Optimization (Phase 6.13)
- mir scenario の更なる高速化620 ns → <500 ns
- Site Rules L2.5 routing 強化
### Medium-term (P2): Hash Table Optimization (Phase 6.11.6)
- **Condition**: hak_alloc >75,000 cycles の場合のみ
- Perfect hashing + cache-line alignment
---
## 📚 **References**
### Analysis Documents
- **ultrathink**: PHASE_6.11.4_THREADING_COST_ANALYSIS.md (Task Agent 作成)
- **Gemini**: 学習スレッド提案(誤り、定量分析で否定)
- **Phase 6.11.3**: PHASE_6.11.3_COMPLETION_REPORT.md (Profiling Infrastructure)
### Technical Background
- **Atomic Operations**: LOCK CMPXCHG on x86 = 30-50 cycles
- **Lock-free Queue**: MPSC queue = 60-110 cycles/op (alloc + atomic)
- **mimalloc/jemalloc**: Cached strategy approach を採用
---
**Mission**: 🎯 **全ての条件で mimalloc を倒す!**
**Strategy**: P0-1 (Atomic 削除) + P0-2 (Cached Strategy)
**Time**: 2-3 hours
**Expected**: json -30%, mir -42%, vm -20% **vs mimalloc** 🔥🔥🔥