hakmem/CLAUDE.md

# HAKMEM Memory Allocator - Claude 作業ログ

このファイルは Claude との開発セッションで重要な情報を記録します。

## プロジェクト概要

**HAKMEM** は高性能メモリアロケータで、以下を目標としています：
- 平均性能で mimalloc 前後
- 賢い学習層でメモリ効率も狙う
- Mid-Large (8-32KB) で特に強い性能

---

## 📊 現在の性能（2025-11-13）

### ベンチマーク結果（Random Mixed 256B）
```
HAKMEM (Phase 11):   9.38M ops/s (Prewarm=8, +6.4% vs Phase 10) ⚠️
System malloc:       90M ops/s (baseline)
性能差:              9.6倍遅い (10.4% of target)
```

### Phase 9-11の教訓 🎓
1. **Phase 9 (Lazy Deallocation)**: +12% → syscall削減は正しいが不十分
2. **Phase 10 (TLS/SFC拡大)**: +2% → frontend hit rateはボトルネックではない
3. **Phase 11 (Prewarm)**: +6.4% → 症状の緩和だけで根本解決ではない

### 根本原因の特定 ✅
- **SuperSlab allocation churn**: 877個のSuperSlab生成（100K iterations）
- **現アーキテクチャの限界**: 1 SuperSlab = 1 size class（固定）
- **次の戦略**: Phase 12 Shared SuperSlab Pool (mimalloc式) - 本質的解決

### 過去の成果
1. **Phase 7で大幅改善** - Header-based fast free (+180-280%)
2. **P0バッチ最適化** - meta->used修正で安定動作達成
3. **Mid-Large圧勝** - SuperSlab効率でSystem比+171%

---

## 🔥 **CRITICAL FIX: Pointer Conversion Bug (2025-11-13)** ✅

### **Root Cause**: DOUBLE CONVERSION (USER → BASE executed twice)

**Status**: ✅ **FIXED** - Minimal patch (< 15 lines)

**Symptoms**:
- C7 (1KB) alignment error: `delta % 1024 == 1` (off by one)
- Error log: `[C7_ALIGN_CHECK_FAIL] ptr=0x...402 base=0x...401`
- Expected: `delta % 1024 == 0` (aligned to block boundary)

**Root Cause**:
```c
// core/tiny_superslab_free.inc.h (before fix)
static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    int slab_idx = slab_index_for(ss, ptr);  // ← Uses USER pointer (wrong!)
    // ... 8 lines ...
    void* base = (void*)((uint8_t*)ptr - 1);  // ← Converts USER → BASE

    // Problem: On 2nd free cycle, ptr is already BASE, so:
    // base = BASE - 1 = storage - 1 ← DOUBLE CONVERSION! Off by one!
}
```

**Fix** (line 17-24):
```c
static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // ✅ FIX: Convert USER → BASE at entry point (single conversion)
    void* base = (void*)((uint8_t*)ptr - 1);

    // CRITICAL: Use BASE pointer for slab_index calculation!
    int slab_idx = slab_index_for(ss, base);  // ← Fixed!
    // ... rest of function uses BASE consistently
}
```

**Verification**:
```bash
# Before fix: [C7_ALIGN_CHECK_FAIL] delta%blk=1
# After fix: No errors
./out/release/bench_fixed_size_hakmem 10000 1024 128  # ✅ PASS
```

**Detailed Report**: [`POINTER_CONVERSION_BUG_ANALYSIS.md`](POINTER_CONVERSION_BUG_ANALYSIS.md), [`POINTER_FIX_SUMMARY.md`](POINTER_FIX_SUMMARY.md)

---

## 🔥 **CRITICAL FIX: P0 TLS Stale Pointer Bug (2025-11-09)** ✅

### **Root Cause**: Active Counter Corruption

**Status**: ✅ **FIXED** - 1-line patch

**Symptoms**:
- SEGV crash in `bench_fixed_size` (256B, 1KB)
- Active counter corruption: `active_delta=-991` when allocating 128 blocks
- Trying to allocate 128 blocks from slab with capacity=64

**Root Cause**:
```c
// core/hakmem_tiny_refill_p0.inc.h:256-262 (before fix)
if (meta->carved >= meta->capacity) {
    if (superslab_refill(class_idx) == NULL) break;
    meta = tls->meta;  // ← Updates meta, but tls is STALE!
    continue;
}
ss_active_add(tls->ss, batch);  // ← Updates WRONG SuperSlab counter!
```

After `superslab_refill()` switches to a new SuperSlab, the local `tls` pointer becomes stale (still points to the old SuperSlab). Subsequent `ss_active_add(tls->ss, batch)` updates the WRONG SuperSlab's active counter, causing:
1. SuperSlab A's counter incorrectly incremented
2. SuperSlab B's counter unchanged (should have been incremented)
3. When blocks from B are freed → counter underflow → SEGV

**Fix** (line 279):
```c
if (meta->carved >= meta->capacity) {
    if (superslab_refill(class_idx) == NULL) break;
    tls = &g_tls_slabs[class_idx];  // ← RELOAD TLS after slab switch!
    meta = tls->meta;
    continue;
}
```

**Verification**:
```
256B fixed-size: 862K ops/s (stable, 200K iterations, 0 crashes) ✅
1KB fixed-size:  872K ops/s (stable, 200K iterations, 0 crashes) ✅
Stability test:  3/3 runs passed ✅
Counter errors:  0 (was: active_delta=-991) ✅
```

**Detailed Report**: [`TINY_256B_1KB_SEGV_FIX_REPORT.md`](TINY_256B_1KB_SEGV_FIX_REPORT.md)

---

## 🚀 Phase 7: Header-Based Fast Free (2025-11-08) ✅

### 成果
- **+180-280% 性能向上**（Random Mixed 128-1024B）
- 1-byte header (`0xa0 | class_idx`) で O(1) class 識別
- Ultra-fast free path (3-5 instructions)

### 主要技術
1. **Header書き込み** - allocation時に1バイトヘッダー追加
2. **Fast free** - SuperSlab lookup不要、直接TLS SLLへpush
3. **Hybrid mincore** - Page境界のみmincore()実行（99.9%は1-2 cycles）

### 結果
```
Random Mixed 128B:  21M → 59M ops/s (+181%)
Random Mixed 256B:  19M → 70M ops/s (+268%)
Random Mixed 512B:  21M → 68M ops/s (+224%)
Random Mixed 1024B: 21M → 65M ops/s (+210%)
Larson 1T:          631K → 2.63M ops/s (+333%)
```

### ビルド方法
```bash
./build.sh bench_random_mixed_hakmem  # Phase 7フラグ自動設定
```

**主要ファイル**:
- `core/tiny_region_id.h` - Header書き込みAPI
- `core/tiny_free_fast_v2.inc.h` - Ultra-fast free (3-5命令)
- `core/box/hak_free_api.inc.h` - Dual-header dispatch

---

## 🐛 P0バッチ最適化 重大バグ修正 (2025-11-09) ✅

### 問題
P0（バッチrefill最適化）ON時に100K SEGVが発生

### 調査プロセス

**Phase 1: ビルドシステム問題**
- Task先生発見: ビルドエラーで古いバイナリ実行
- Claude修正: ローカルサイズテーブル追加（2行）
- 結果: P0 OFF で100K成功（2.73M ops/s）

**Phase 2: P0の真のバグ**
- ChatGPT先生発見: **`meta->used` 加算漏れ**

### 根本原因

**P0パス（修正前・バグ）**:
```c
trc_pop_from_freelist(meta, ..., &chain);  // freelistから一括pop
trc_splice_to_sll(&chain, &g_tls_sll_head[cls]);  // SLLへ連結
// meta->used += count;  ← これがない！💀
```

**影響**:
- `meta->used` と実際の使用ブロック数がズレる
- carve判定が狂う → メモリ破壊 → SEGV

### ChatGPT先生の修正

```c
trc_splice_to_sll(...);
ss_active_add(tls->ss, from_freelist);
meta->used = (uint16_t)((uint32_t)meta->used + from_freelist);  // ← 追加！✅
```

**追加実装（ランタイムA/Bフック）**:
- `HAKMEM_TINY_P0_ENABLE=1` - P0有効化
- `HAKMEM_TINY_P0_NO_DRAIN=1` - Remote drain無効（切り分け用）
- `HAKMEM_TINY_P0_LOG=1` - カウンタ検証ログ

### 修正結果

| 設定 | 修正前 | 修正後 |
|------|--------|--------|
| P0 OFF | 2.51-2.59M ops/s | 2.73M ops/s |
| P0 ON + NO_DRAIN | ❌ SEGV | ✅ 2.45M ops/s |
| **P0 ON（推奨）** | ❌ SEGV | ✅ **2.76M ops/s** 🏆 |

**100K iterations**: 全テスト成功

### 本番推奨設定

```bash
export HAKMEM_TINY_P0_ENABLE=1
./out/release/bench_random_mixed_hakmem
```

**性能**: 2.76M ops/s（最速、安定）

### 既知の警告（非致命的）

**COUNTER_MISMATCH**:
- 発生頻度: 稀（10K-100Kで1-2件）
- 影響: なし（クラッシュしない、性能影響なし）
- 対策: 引き続き監査（低優先度）

---

## 🎯 Pool TLS Phase 1.5a: Lock-Free Arena (2025-11-09) ✅

### 概要
Lock-free TLS arena with chunk carving for 8KB-52KB allocations

### 結果
```
Pool TLS Phase 1.5a: 1.79M ops/s (8KB allocations)
System malloc:       0.19M ops/s (8KB allocations)
Ratio:              947% (9.47x faster!) 🏆
```

### アーキテクチャ
- Box P1: Pool TLS API (ultra-fast alloc/free)
- Box P2: Refill Manager (batch allocation)
- Box P3: TLS Arena Backend (exponential chunk growth 1MB→8MB)
- Box P4: System Memory API (mmap wrapper)

### ビルド方法
```bash
./build.sh bench_mid_large_mt_hakmem  # Pool TLS自動有効化
```

**主要ファイル**:
- `core/pool_tls.h/c` - TLS freelist + size-to-class
- `core/pool_refill.h/c` - Batch refill
- `core/pool_tls_arena.h/c` - Chunk management

---

## 📝 開発履歴（要約）

### Phase 11: SuperSlab Prewarm (2025-11-13) ⚠️ 教訓
- 起動時にSuperSlabを事前確保してmmap削減
- 結果: +6.4%改善（8.82M → 9.38M ops/s）
- **教訓**: Syscall削減は正しいが、根本的なSuperSlab churn（877個生成）は解決せず
- 詳細: `PHASE11_SUPERSLAB_PREWARM_IMPLEMENTATION_REPORT.md`

### Phase 10: TLS/SFC Aggressive Tuning (2025-11-13) ⚠️ 教訓
- TLS Cache容量 2-8x拡大、refillバッチ 4-8x増加
- 結果: +2%改善（9.71M → 9.89M ops/s）
- **教訓**: Frontend hit rateはボトルネックではない、backend churnが本質
- 詳細: `core/tiny_adaptive_sizing.c`, `core/hakmem_tiny_config.c`

### Phase 9: SuperSlab Lazy Deallocation (2025-11-13) ✅
- mincore削除（841 syscalls → 0）、LRU cache導入
- 結果: +12%改善（8.67M → 9.71M ops/s）
- syscall削減: 3,412 → 1,729 (-49%)
- 詳細: `core/hakmem_super_registry.c`

### Phase 2: Design Flaws Analysis (2025-11-08) 🔍
- 固定サイズキャッシュの設計欠陥を発見
- SuperSlab固定32 slabs、TLS Cache固定容量など
- 詳細: `DESIGN_FLAWS_ANALYSIS.md`

### Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅
- Ultra-Simple Fast Path (3-4命令)
- +64% 性能向上（Larson 1.68M → 2.75M ops/s）
- 詳細: `core/tiny_alloc_fast.inc.h`, `core/tiny_free_fast.inc.h`

### Phase 6-2.1: P0 Optimization (2025-11-05) ✅
- superslab_refill の O(n) → O(1) 化（ctz使用）
- nonempty_mask導入
- 詳細: `core/hakmem_tiny_superslab.h`, `core/hakmem_tiny_refill_p0.inc.h`

### Phase 6-2.3: Active Counter Fix (2025-11-07) ✅
- P0 batch refill の active counter 加算漏れ修正
- 4T安定動作達成（838K ops/s）

### Phase 6-2.2: Sanitizer Compatibility (2025-11-07) ✅
- ASan/TSan ビルド修正
- `HAKMEM_FORCE_LIBC_ALLOC_BUILD=1` 導入

---

## 🛠️ ビルドシステム

### 基本ビルド
```bash
./build.sh <target>           # Release build (推奨)
./build.sh debug <target>     # Debug build
./build.sh help               # ヘルプ表示
./build.sh list               # ターゲット一覧
```

### 主要ターゲット
- `bench_random_mixed_hakmem` - Tiny 1T mixed
- `bench_pool_tls_hakmem` - Pool TLS 8-52KB
- `bench_mid_large_mt_hakmem` - Mid-Large MT 8-32KB
- `larson_hakmem` - Larson mixed

### ピン固定フラグ
```
POOL_TLS_PHASE1=1
POOL_TLS_PREWARM=1
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
BUILD_RELEASE_DEFAULT=1  # Release mode
```

### ENV変数（Pool TLS Arena）
```bash
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2   # default 1
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16   # default 8
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4  # default 3
```

### ENV変数（P0）
```bash
export HAKMEM_TINY_P0_ENABLE=1      # P0有効化（推奨）
export HAKMEM_TINY_P0_NO_DRAIN=1    # Remote drain無効（デバッグ用）
export HAKMEM_TINY_P0_LOG=1         # カウンタ検証ログ
```

---

## 🔍 デバッグ・プロファイリング

### Perf
```bash
perf stat -e cycles,instructions,branches,branch-misses,cache-misses -r 3 -- ./<bin>
```

### Strace
```bash
strace -e trace=mmap,madvise,munmap -c ./<bin>
```

### ビルド検証
```bash
./build.sh verify <binary>
make print-flags
```

---

## 📚 重要ドキュメント

- `BUILDING_QUICKSTART.md` - ビルド クイックスタート
- `LARSON_GUIDE.md` - Larson ベンチマーク統合ガイド
- `HISTORY.md` - 失敗した最適化の記録
- `100K_SEGV_ROOT_CAUSE_FINAL.md` - P0 SEGV詳細調査
- `P0_INVESTIGATION_FINAL.md` - P0包括的調査レポート
- `DESIGN_FLAWS_ANALYSIS.md` - 設計欠陥分析

---

## 🎓 学んだこと

1. **ビルド検証の重要性** - エラーに気づかず古いバイナリ実行の危険性
2. **カウンタ整合性** - バッチ最適化では全カウンタの同期が必須
3. **ランタイムA/Bの威力** - 環境変数で問題箇所の切り分けが可能
4. **Header-based最適化** - 1バイトで劇的な性能向上が可能
5. **Box Theory** - 境界を明確にすることで安全性とパフォーマンスを両立
6. **増分最適化の限界** - 症状の緩和では根本的な性能差（9x）は埋まらない
7. **ボトルネック特定の重要性** - Phase 9-11で誤ったボトルネック（syscall）を対象にしていた

---

## 🚀 Phase 12: Shared SuperSlab Pool (本質的解決)

### 戦略: mimalloc式の動的slab共有

**目標**: System malloc並みの性能（90M ops/s）

**根本原因**:
- 現アーキテクチャ: 1 SuperSlab = 1 size class (固定)
- 問題: 877個のSuperSlab生成 → 877MB確保 → 巨大なメタデータオーバーヘッド

**解決策**:
- 複数のsize classが同じSuperSlabを共有
- 動的slab割り当て（class_idxは使用時に決定）
- 期待効果: 877 SuperSlabs → 100-200 (-70-80%)

**実装計画**:
1. **Phase 12-1: 動的slab metadata** - SlabMeta拡張（class_idx動的化）
2. **Phase 12-2: Shared allocation** - 複数classが同じSSから割り当て
3. **Phase 12-3: Smart eviction** - 使用率低いslabを優先的に解放
4. **Phase 12-4: ベンチマーク** - System malloc比較（目標: 80-100%）

**期待される性能改善**:
- SuperSlab count: 877 → 100-200 (-70-80%)
- メタデータオーバーヘッド: -70-80%
- Cache miss率: 大幅削減
- 性能: 9.38M → 70-90M ops/s (+650-860%期待)

---

## 🔥 **Performance Bottleneck Analysis (2025-11-13)**

### **発見: Syscall Overhead が支配的**

**Status**: 🚧 **IN PROGRESS** - Lazy Deallocation 実装中

**Perf プロファイリング結果**:
- HAKMEM: 8.67M ops/s
- System malloc: 80.5M ops/s
- **9.3倍遅い原因**: Syscall Overhead (99.2% CPU)

**Syscall 統計**:
```
HAKMEM:       3,412 syscalls (100K iterations)
System malloc:   13 syscalls (100K iterations)
差:            262倍！

内訳:
- mmap:    1,250回 (SuperSlab積極的解放)
- munmap:  1,321回 (SuperSlab積極的解放)
- mincore:   841回 (Phase 7最適化が逆効果)
```

**根本原因**:
- HAKMEM: **Eager deallocation** (RSS削減優先) → syscall多発
- System malloc: **Lazy deallocation** (速度優先) → syscall最小

**修正方針** (ChatGPT先生レビュー済み ✅):

1. **SuperSlab Lazy Deallocation** (最優先、+271%期待)
   - SuperSlab = キャッシュ資源として扱う
   - LRU/世代管理 + グローバル上限制御
   - 高負荷中はほぼ munmap しない

2. **mincore 削除** (最優先、+75%期待)
   - mincore 依存を捨て、内部メタデータ駆動に統一
   - registry/metadata 方式で管理

3. **TLS Cache 拡大** (中優先度、+21%期待)
   - ホットクラスの容量を 2-4倍に
   - Lazy SuperSlab と組み合わせて効果発揮

**期待性能**: 8.67M → **74.5M ops/s** (System malloc の 93%) 🎯

**詳細レポート**: `RELEASE_DEBUG_OVERHEAD_REPORT.md`

---

## 📊 現在のステータス

```
BASE/USER Pointer Bugs:            ✅ FIXED (Iteration 66151 crash解消)
Debug Overhead Removal:             ✅ COMPLETE (2.0M → 8.67M ops/s, +333%)
Phase 7 (Header-based fast free):  ✅ COMPLETE (+180-280%)
P0 (Batch refill optimization):     ✅ COMPLETE (2.76M ops/s)
Pool TLS (8-52KB arena):            ✅ COMPLETE (9.47x vs System)
Lazy Deallocation (Syscall削減):   🚧 IN PROGRESS (目標: 74.5M ops/s)
```

**現在のタスク** (2025-11-13):
```
1. SuperSlab Lazy Deallocation 実装 (LRU + 上限制御)
2. mincore 削除、内部メタデータ駆動に統一
3. TLS Cache 容量拡大 (2-4倍)
```

**推奨本番設定**:
```bash
export HAKMEM_TINY_P0_ENABLE=1
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Current: 8.67M ops/s
# Target:  74.5M ops/s (System malloc 93%)
```