Files
hakmem/docs/archive/PHASE_6.12_COMPLETION_REPORT.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

219 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6.12: Tiny Pool 実装完了レポート
**完了日**: 2025-10-21
**ステータス**: ✅ 基本実装完了、❌ 性能目標未達、🎯 P0最適化へ進む
---
## 📊 **実装サマリ**
### ✅ **完了した実装**
1. **8 size classes 実装**
- 8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB
- Bitmap-based free block search (`__builtin_ctzll`)
- Free list管理 (free_slabs / full_slabs)
2. **64KB slab allocator**
- posix_memalign使用memory leak修正済み
- Slab metadata: TinySlab構造体 + bitmap
3. **Lite P1 pre-allocation**
- Tier 1 (8-64B) 4クラスのみ事前確保
- 256KB常駐512KBではない
4. **ベンチマークシナリオ追加**
- string-builder (8-64B, short-lived)
- token-stream (16-128B, FIFO)
- small-objects (32-256B, long-lived)
5. **Warmup分離実装**
- 測定フェーズと初期化フェーズを分離
- 測定精度向上
---
## 📊 **ベンチマーク結果**
### **性能測定結果** (vs mimalloc)
| Scenario | hakmem | mimalloc | system | 相対性能 |
|----------|--------|----------|--------|---------|
| **string-builder** (8-64B) | 7,871 ns | 18 ns | 18 ns | **437x slower** ❌ |
| **token-stream** (16-128B) | 99 ns | 9 ns | 12 ns | **11x slower** ⚠️ |
| **small-objects** (32-256B) | 6 ns | 3 ns | 6 ns | **2x slower** ✅ |
### **測定環境**
- CPU: x86_64 Linux
- Compiler: gcc -O2
- Iterations: 10,000 (string-builder, token-stream, small-objects)
- Warmup: 各シナリオで4サイズクラス事前確保
---
## 🔍 **根本原因分析**
### **Task先生調査結果**
**犯人特定**: `find_slab_by_ptr()` の二重呼び出し = **6,000ns/op (75%)**
#### **問題のコード**
```c
// hakmem.c:510 - hak_free_at()
if (hak_tiny_is_managed(ptr)) { // ← 1回目の find_slab_by_ptr()
hak_tiny_free(ptr); // ← 2回目の find_slab_by_ptr()
return;
}
// hakmem_tiny.c:253 - hak_tiny_is_managed()
int hak_tiny_is_managed(void* ptr) {
return find_slab_by_ptr(ptr) != NULL; // ← O(N) 線形探索!
}
// hakmem_tiny.c:71-96 - find_slab_by_ptr()
static TinySlab* find_slab_by_ptr(void* ptr) {
// Search in free_slabs (O(N))
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
for (TinySlab* slab = ...; slab; slab = slab->next) {
if ((uintptr_t)slab->base == slab_base) return slab;
}
}
// Search in full_slabs (O(N))
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
for (TinySlab* slab = ...; slab; slab = slab->next) {
if ((uintptr_t)slab->base == slab_base) return slab;
}
}
return NULL;
}
```
#### **コスト内訳**
```
string-builder: 40,000 free calls
× 2回 find_slab_by_ptr()
× 平均 3,000ns/call (O(N) 探索)
= 240,000,000 ns total
→ 6,000 ns/op (75%)
```
**その他のオーバーヘッド**:
- memset: 100 ns/op (1.3%)
- 関数呼び出し: 80 ns/op (1.0%)
- bitmap探索: 推定 1,691 ns/op (21.5%)
---
## 💡 **ChatGPT Pro 診断結果**
### **判断: 継続推奨!諦めない**
**理由**:
1.**P0+TLS で桁が変わる** - 7,871ns → 50-80ns (157倍高速化)
2.**SACS/ELO の差別化** - Tiny帯でもHot/Warm/Cold適用可能
3.**一貫性** - L1/L2/L2.5/L3が同じ方針で動く
**タイムボックス**: P0で ≤200ns/op 切れなければL2.5に注力
---
## 🎯 **P0最適化戦略**
### **Option B: Embedded Metadata (Slab先頭16Bにタグ)**
**実装**:
```c
typedef struct __attribute__((packed)) {
uintptr_t xored_owner; // owner ^ cookie
uint32_t magic; // 0xH4KM3M01
uint16_t class_idx;
uint16_t epoch; // ABA防止
} SlabTag;
static inline TinySlab* owner_slab(void* p) {
uintptr_t base = (uintptr_t)p & ~(TINY_SLAB_SIZE-1);
SlabTag* t = (SlabTag*)base;
if (unlikely(t->magic != MAGIC)) return NULL;
return (TinySlab*)((t->xored_owner) ^ cookie); // O(1)!
}
```
**期待効果**: 6,000ns → 5ns (**1200倍高速化**)
### **Option C: 二重呼び出し削除**
```c
// hak_free_at() 修正
TinySlab* slab = owner_slab(ptr); // ← 1回のみ
if (slab) {
hak_tiny_free_with_slab(ptr, slab);
return;
}
```
**期待効果**: 2倍高速化
### **memset全削除**
ベンチマーク測定用memset以外を削除
**期待効果**: 100ns削減
---
## 📊 **期待される改善効果**
| 改善 | 現状 | 改善後 | 効果 |
|------|------|--------|------|
| **P0: Option B + C + memset削除** | 7,871 ns | 1,871 ns | 4.2倍高速 |
| **P1: TLS freelist** | 1,871 ns | 50-80 ns | 27倍高速 |
| **最終** | 7,871 ns | **50-80 ns** | **157倍高速** |
**mimalloc比**: 18ns vs 50-80ns → 2.8-4.4倍遅い(許容範囲)
---
## 🎯 **最終目標**
| Scenario | 現状 | 目標 | mimalloc | 達成度 |
|----------|------|------|----------|--------|
| **string-builder** | 7,871 ns | **50-80 ns** | 18 ns | mimalloc比 2.8-4.4倍 ✅ |
| **token-stream** | 99 ns | **≤20 ns** | 9 ns | mimalloc比 2.2倍 ✅ |
| **small-objects** | 6 ns | **≤10 ns** | 3 ns | mimalloc比 3.3倍 ✅ |
---
## 📁 **関連ファイル**
### **実装**
- `apps/experiments/hakmem-poc/hakmem_tiny.h` - Tiny Pool API
- `apps/experiments/hakmem-poc/hakmem_tiny.c` - Slab allocator実装
- `apps/experiments/hakmem-poc/hakmem.c` - 統合コード
### **ベンチマーク**
- `apps/experiments/hakmem-poc/bench_allocators.c` - 3シナリオ実装
- `apps/experiments/hakmem-poc/Makefile` - ビルド設定
### **調査レポート**
- `apps/experiments/hakmem-poc/WARMUP_ZERO_EFFECT_INVESTIGATION.md` - Task先生調査
- `/tmp/chatgpt_tiny_pool_optimization.txt` - ChatGPT Pro質問状
- `/mnt/c/git/nyash-project/chatgpt_tiny_pool_optimization.txt` - 同上Windows版
---
## ✅ **Phase 6.12 完了判定**
**基本実装**: ✅ 完了
**性能目標**: ❌ 未達mimalloc比 437倍遅い
**根本原因**: ✅ 特定find_slab_by_ptr 二重呼び出し)
**最適化戦略**: ✅ 確定ChatGPT Pro承認済み
**次のステップ**: Phase 6.12.1 (P0最適化) 実装開始
---
**作成者**: Claude + Task先生 + ChatGPT Pro
**作成日**: 2025-10-21