docs: Update CLAUDE.md with Phase 9-11 lessons and Phase 12 strategy

## Changes - Updated performance metrics (Phase 11: 9.38M ops/s, still 9x slower) - Added Phase 9-11 lesson learned section - Identified root cause: SuperSlab allocation churn (877 SuperSlabs) - Added Phase 12 strategy: Shared SuperSlab Pool (mimalloc-style) ## Phase 12 Plan Goal: System malloc parity (90M ops/s) Strategy: Multiple size classes share same SuperSlab Expected: 877 → 100-200 SuperSlabs (-70-80%) Expected perf: +650-860% improvement 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 14:47:03 +09:00
parent 2be754853f
commit 2b9a03fa8b
1 changed files with 60 additions and 15 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -11,15 +11,26 @@

 ---

-## 📊 現在の性能（2025-11-09）
+## 📊 現在の性能（2025-11-13）

-### ベンチマーク結果
+### ベンチマーク結果（Random Mixed 256B）
 ```
-Tiny (256B):         2.76M ops/s (P0 ON, 100K iterations) 🏆
-Mid-Large (8-32KB):  167.75M vs System 61.81M (+171%) 🏆
+HAKMEM (Phase 11):   9.38M ops/s (Prewarm=8, +6.4% vs Phase 10) ⚠️
+System malloc:       90M ops/s (baseline)
+性能差:              9.6倍遅い (10.4% of target)
 ```

-### 重要な発見
+### Phase 9-11の教訓 🎓
+1. **Phase 9 (Lazy Deallocation)**: +12% → syscall削減は正しいが不十分
+2. **Phase 10 (TLS/SFC拡大)**: +2% → frontend hit rateはボトルネックではない
+3. **Phase 11 (Prewarm)**: +6.4% → 症状の緩和だけで根本解決ではない
+
+### 根本原因の特定 ✅
+- **SuperSlab allocation churn**: 877個のSuperSlab生成（100K iterations）
+- **現アーキテクチャの限界**: 1 SuperSlab = 1 size class（固定）
+- **次の戦略**: Phase 12 Shared SuperSlab Pool (mimalloc式) - 本質的解決
+
+### 過去の成果
 1. **Phase 7で大幅改善** - Header-based fast free (+180-280%)
 2. **P0バッチ最適化** - meta->used修正で安定動作達成
 3. **Mid-Large圧勝** - SuperSlab効率でSystem比+171%
@ -256,6 +267,24 @@ Ratio:              947% (9.47x faster!) 🏆

 ## 📝 開発履歴（要約）

+### Phase 11: SuperSlab Prewarm (2025-11-13) ⚠️ 教訓
+- 起動時にSuperSlabを事前確保してmmap削減
+- 結果: +6.4%改善（8.82M → 9.38M ops/s）
+- **教訓**: Syscall削減は正しいが、根本的なSuperSlab churn（877個生成）は解決せず
+- 詳細: `PHASE11_SUPERSLAB_PREWARM_IMPLEMENTATION_REPORT.md`
+
+### Phase 10: TLS/SFC Aggressive Tuning (2025-11-13) ⚠️ 教訓
+- TLS Cache容量 2-8x拡大、refillバッチ 4-8x増加
+- 結果: +2%改善（9.71M → 9.89M ops/s）
+- **教訓**: Frontend hit rateはボトルネックではない、backend churnが本質
+- 詳細: `core/tiny_adaptive_sizing.c`, `core/hakmem_tiny_config.c`
+
+### Phase 9: SuperSlab Lazy Deallocation (2025-11-13) ✅
+- mincore削除（841 syscalls → 0）、LRU cache導入
+- 結果: +12%改善（8.67M → 9.71M ops/s）
+- syscall削減: 3,412 → 1,729 (-49%)
+- 詳細: `core/hakmem_super_registry.c`
+
 ### Phase 2: Design Flaws Analysis (2025-11-08) 🔍
 - 固定サイズキャッシュの設計欠陥を発見
 - SuperSlab固定32 slabs、TLS Cache固定容量など
@ -361,21 +390,37 @@ make print-flags
 3. **ランタイムA/Bの威力** - 環境変数で問題箇所の切り分けが可能
 4. **Header-based最適化** - 1バイトで劇的な性能向上が可能
 5. **Box Theory** - 境界を明確にすることで安全性とパフォーマンスを両立
+6. **増分最適化の限界** - 症状の緩和では根本的な性能差（9x）は埋まらない
+7. **ボトルネック特定の重要性** - Phase 9-11で誤ったボトルネック（syscall）を対象にしていた

 ---

-## 🚀 次の最適化候補
+## 🚀 Phase 12: Shared SuperSlab Pool (本質的解決)

-### 優先度: 低（現状で十分高速）
-1. perf A/B（release）で branch-miss/IPC 最終確認
-2. COUNTER_MISMATCH閾値/頻度ロギング
-3. class5/6 front優先度と分岐ヒントの軽調整
-4. Pool TLS Phase 1.5b: Pre-warm + adaptive refill
+### 戦略: mimalloc式の動的slab共有

-### 優先度: 中（設計改善）
-1. SuperSlab dynamic expansion（mimalloc-style linked chunks）
-2. TLS Cache adaptive sizing
-3. BigCache hash table with chaining
+**目標**: System malloc並みの性能（90M ops/s）
+
+**根本原因**:
+- 現アーキテクチャ: 1 SuperSlab = 1 size class (固定)
+- 問題: 877個のSuperSlab生成 → 877MB確保 → 巨大なメタデータオーバーヘッド
+
+**解決策**:
+- 複数のsize classが同じSuperSlabを共有
+- 動的slab割り当て（class_idxは使用時に決定）
+- 期待効果: 877 SuperSlabs → 100-200 (-70-80%)
+
+**実装計画**:
+1. **Phase 12-1: 動的slab metadata** - SlabMeta拡張（class_idx動的化）
+2. **Phase 12-2: Shared allocation** - 複数classが同じSSから割り当て
+3. **Phase 12-3: Smart eviction** - 使用率低いslabを優先的に解放
+4. **Phase 12-4: ベンチマーク** - System malloc比較（目標: 80-100%）
+
+**期待される性能改善**:
+- SuperSlab count: 877 → 100-200 (-70-80%)
+- メタデータオーバーヘッド: -70-80%
+- Cache miss率: 大幅削減
+- 性能: 9.38M → 70-90M ops/s (+650-860%期待)

 ---