hakmem/CURRENT_TASK.md

# CURRENT TASK (Phase 14–17 Snapshot) – Tiny / Mid / ExternalGuard / Small-Mid

**Last Updated**: 2025-11-16
**Owner**: ChatGPT → Phase 17 実装中: Claude Code
**Size**: 約 300 行（Claude 用コンテキスト簡略版）

---

## 1. 全体の現在地（どこまで終わっているか）

- Tiny (0–1023B)
  - NEW 3-layer front（bump / small_mag / slow）安定。
  - TinyHeapV2: 「alloc フロント＋統計」は実装済みだが、実運用は **C2/C3 を UltraHot に委譲**。
  - TinyUltraHot（Phase 14）:
    - C2/C3（16B/32B）専用 L0 ultra-fast path（Stealing モデル）。
    - 固定サイズベンチで +16〜36% 改善、hit 率 ≈ 100%。
  - Box 分離（Phase 15）:
    - free ラッパが外部ポインタまで `hak_free_at` に投げていた問題を修正。
    - BenchMeta（slots など）→ 直接 `__libc_free`、CoreAlloc（Tiny/Mid）→ `hak_free_at` の二段構えに整理。

- Mid / PoolTLS（1KB–32KB）
  - PoolTLS Phase 完了（Mid-Large MT ベンチ）
    - ~10.6M ops/s（system malloc より速い構成あり）。
    - lock contention（futex 68%）を lock-free MPSC + bind box で大幅削減。
  - GAP 修正（Tiny 1023B / Mid 1KB〜）:
    - `TINY_MAX_SIZE=1023` / `MID_MIN_SIZE=1024` で 1KB–8KB の「誰も扱わない帯」は解消済み。

- Shared SuperSlab Pool（Phase 12 – SP-SLOT Box）
  - 1 SuperSlab : 多 class 共有 + SLOT_UNUSED/ACTIVE/EMPTY 追跡。
  - SuperSlab 数: 877 → 72（-92%）、mmap/munmap: -48%、Throughput: +131%。
  - Lock contention P0-5 まで実装済み（Stage 2 lock-free claiming）。

- ExternalGuard（Phase 15）
  - UNKNOWN ポインタ（Tiny/Pool/Mid/L25/registry どこでも捕まらないもの）を最後の箱で扱う。
  - 挙動:
    - `hak_super_lookup` など全て miss → mincore でページ確認 → 原則「解放せず leak 扱い（安全優先）」。
  - Phase 15 修正で:
    - BenchMeta のポインタを CoreAlloc に渡さなくなり、UNKNOWN 呼び出し回数が激減。
    - `mincore` の CPU 負荷もベンチではほぼ無視できるレベルまで縮小。

---

## 2. Tiny 性能の現状（Phase 14–15 時点）

### 2.1 Fixed-size Tiny ベンチ（HAKMEM vs System）

`bench_fixed_size_hakmem` / `bench_fixed_size_system`（workset=128, 500K iterations 相当）

| Size   | HAKMEM (Phase 15) | System malloc | 比率     |
|--------|-------------------|---------------|----------|
| 128B   | ~16.6M ops/s      | ~90M ops/s    | ~18.5%   |
| 256B   | ~16.2M ops/s      | ~89.6M ops/s  | ~18.1%   |
| 512B   | ~15.0M ops/s      | ~90M ops/s    | ~16.6%   |
| 1024B  | ~15.1M ops/s      | ~90M ops/s    | ~16.8%   |

状態:
- クラッシュは完全解消（workset=64/128 で長尺 500K iter も安定）。
- Tiny UltraHot + 学習層 + ExternalGuard の組み合わせは「正しさ」は OK。
- 性能は system の ~16–18% レベル（約 5–6× 遅い）→ まだ大きな伸びしろあり。

### 2.2 C2/C3 UltraHot 専用ベンチ

固定サイズ（100K iterations, workset=128）

| Size | Baseline (UltraHot OFF) | UltraHot ON | 改善率      | Hit Rate |
|------|-------------------------|-------------|-------------|---------|
| 16B  | ~40.4M ops/s            | ~55.0M      | +36.2% 🚀    | ≈100%   |
| 32B  | ~43.5M ops/s            | ~50.6M      | +16.3% 🚀    | ≈100%   |

Random Mixed 256B：
- Baseline: ~8.96M ops/s
- UltraHot ON: ~8.81M ops/s（-1.6%、誤差〜軽微退化）
- 理由: C2/C3 が全体の 1–2% のみ → UltraHot のメリットが平均に薄まる。

結論:
- C2/C3 UltraHot は **ターゲットクラスに対しては実用級の Box**。
- 他ワークロードでは「ほぼ影響なし（わずかな分岐オーバーヘッドのみ）」の範囲に収まっている。

---

## 3. Phase 15: ExternalGuard / Domain 分離の成果

### 3.1 以前の問題

- free ラッパ（`core/box/hak_wrappers.inc.h`）が:
  - HAKMEM 所有かチェックせず、すべての `free(ptr)` を `hak_free_at(ptr, …)` に投げていた。
  - その結果:
    - ベンチ内部 `slots`（`calloc(256, sizeof(void*))` の 2KB など）も CoreAlloc に流入。
    - `classify_ptr` → UNKNOWN → ExternalGuard → mincore → 「解放せず leak」と判定。
  - ベンチ観測:
    - 約 0.84% の leak（BenchMeta がどんどん漏れる）。
    - `mincore` が Tiny ベンチ CPU の ~13% を消費。

### 3.2 修正内容（Phase 15）

- free ラッパ側:
  - 軽量なドメインチェックを追加:
    - Tiny/Pool 用の header magic を安全に読んで、HAKMEM 所有の可能性があるものだけ `hak_free_at` へ。
    - そうでない（BenchMeta/外部）ポインタは `__libc_free` へ。
- ExternalGuard:
  - UNKNOWN ポインタを「解放しない（leak）」方針に明示的変更。
  - デバッグ時のみ `HAKMEM_EXTERNAL_GUARD_LOG=1` で原因特定用ログを出す。

### 3.3 結果

- Leak 率:
  - 100K iter: 840 leaks → 0.84%
  - 500K iter: ~4200 leaks → 0.84%
  - ほぼ全部が BenchMeta / 外部ポインタであり、CoreAlloc 側の漏れではないと確認。
- 性能:
  - 256B 固定:
    - Before: 15.9M ops/s
    - After:  16.2M ops/s（+1.9%）→ domain check オーバーヘッドは軽微、むしろ微増。
- 安定性:
  - 全サイズ（128/256/512/1024B）で 500K iter 完走（クラッシュなし）。
  - ExternalGuard 経由の「危ない free」は leak に封じ込められた。

**要点:**  
Box 境界違反（BenchMeta→CoreAlloc 流入）はほぼ完全に解消。  
ベンチでの mincore / ExternalGuard コストも許容範囲になった。

---

## 4. Phase 16: Dynamic Tiny/Mid Boundary A/B Testing（2025-11-16完了）

### 4.1 実装内容

ENV変数でTiny/Mid境界を動的調整可能にする機能を追加：
- `HAKMEM_TINY_MAX_CLASS=7` (デフォルト): Tiny が 0-1023B を担当
- `HAKMEM_TINY_MAX_CLASS=5` (実験用): Tiny が 0-255B のみ担当

実装ファイル：
- `hakmem_tiny.h/c`: `tiny_get_max_size()` - ENV読取とクラス→サイズマッピング
- `hakmem_mid_mt.h/c`: `mid_get_min_size()` - 動的境界調整（サイズギャップ防止）
- `hak_alloc_api.inc.h`: 静的TINY_MAX_SIZEを動的呼び出しに変更

### 4.2 A/B Benchmark Results

| Size | Config A (C0-C7) | Config B (C0-C5) | 変化率 |
|------|------------------|------------------|--------|
| 128B | 6.34M ops/s | 1.38M ops/s | **-78%** ❌ |
| 256B | 6.34M ops/s | 1.36M ops/s | **-79%** ❌ |
| 512B | 5.55M ops/s | 1.33M ops/s | **-76%** ❌ |
| 1024B | 5.91M ops/s | 1.37M ops/s | **-77%** ❌ |

### 4.3 発見と結論

✅ **成功**: サイズギャップ修正完了（OOMクラッシュなし）
❌ **失敗**: Tiny カバレッジ削減で大幅な性能劣化 (-76% ~ -79%)
⚠️ **根本原因**: Mid の粗いサイズクラス (8KB/16KB/32KB) が小サイズで非効率
- Mid は 8KB ページ単位の設計 → 256B-1KB を投げると 8KB ページをほぼ数ブロックのために確保
- ページ fault・TLB・メタデータコストが相対的に巨大
- Tiny は slab + freelist で高密度 → 同じサイズでも桁違いに効率的

**教訓（ChatGPT先生分析）**:
1. Mid 箱の前提が「8KB〜用」になっている
   - 256B/512B/1024B では 8KB ページをほぼ1〜数個のブロックのために確保 → 非効率
2. パス長も Mid の方が長い（PoolTLS / mid registry / page 管理）
3. 「Tiny を削って Mid に任せれば軽くなる」という仮説は、現行の "8KB〜前提の Mid 設計" では成り立たない

**推奨**: **デフォルト HAKMEM_TINY_MAX_CLASS=7 (C0-C7) を維持**

---

## 5. Phase 17: Small-Mid Allocator Box - 実験完了 ✅（2025-11-16）

### 5.1 目標と動機

**問題**: Tiny C5-C7 (256B/512B/1KB) が ~6M ops/s → system malloc の ~6.7% レベル
**仮説**: 専用層を作れば 2-4x 改善可能
**結果**: ❌ **仮説は誤り** - 性能改善なし（±0-1%）

### 5.2 Phase 17-1: TLS Frontend Cache（Tiny delegation）

**実装**:
- TLS freelist（256B/512B/1KB、容量32/24/16）
- Backend: Tiny C5/C6/C7に委譲、Header変換（0xa0 → 0xb0）
- Auto-adjust: Small-Mid ON時にTinyをC0-C5に自動制限

**結果**:
| Size | OFF | ON | 変化率 |
|------|-----|-----|--------|
| 256B | 5.87M | 6.06M | **+3.3%** |
| 512B | 6.02M | 5.91M | **-1.9%** |
| 1024B | 5.58M | 5.54M | **-0.6%** |
| **平均** | 5.82M | 5.84M | **+0.3%** |

**教訓**: Delegation overhead = TLS savings → 正味利益ゼロ

### 5.3 Phase 17-2: Dedicated SuperSlab Backend

**実装**:
- Small-Mid専用SuperSlab pool（1MB、16 slabs/SS）
- Batch refill（8-16 blocks/refill）
- 直接0xb0 header書き込み（Tiny delegationなし）

**結果**:
| Size | OFF | ON | 変化率 |
|------|-----|-----|--------|
| 256B | 6.08M | 5.84M | **-4.1%** ⚠️ |
| 512B | 5.79M | 5.86M | **+1.2%** |
| 1024B | 5.42M | 5.44M | **+0.4%** |
| **平均** | 5.76M | 5.71M | **-0.9%** |

**Phase 17-1比較**: Phase 17-2の方が悪化（-3.6% on 256B）

### 5.4 根本原因分析（ChatGPT先生 + perf profiling）

**発見**: **70% page fault** が支配的 🔥

**Perf分析**:
- `asm_exc_page_fault`: 70% CPU時間
- 実際のallocation logic（TLS/refill）: 30% のみ
- **結論**: Frontend実装は成功、Backendが重すぎる

**なぜpage faultが多いか**:
```
Small-Mid: alloc → TLS miss → refill → SuperSlab新規確保
  → mmap(1MB) → page fault 発生 → 70%のCPU消費

Tiny: alloc → TLS miss → refill → 既存warm SuperSlab使用
  → page faultなし → 高速
```

**Small-Mid問題**:
1. 新しいSuperSlabを頻繁に確保（workloadが短いため）
2. Warm SuperSlabの再利用なし（usedカウンタ減らない）
3. Batch refillのメリットよりmmap/page faultコストが大きい

### 5.5 Phase 17の結論と教訓

❌ **Small-Mid専用層戦略は失敗**:
- Phase 17-1（Frontend only）: +0.3%
- Phase 17-2（Dedicated backend）: -0.9%
- 目標（2-4x改善）: **未達成**（-50-67%不足）

✅ **重要な発見**:
1. **Frontend（TLS/batch refill）設計はOK** - 30%のみの負荷
2. **70% page fault = SuperSlab層の問題**
3. **Tiny (6.08M) は既に十分速い** - これを超えるのは困難
4. **層の分離では性能は上がらない** - Backend最適化が必要

✅ **実装の価値**:
- ENV=0でゼロオーバーヘッド（branch predictor学習）
- 実験記録として価値あり（"なぜ専用層が効果なかったか"の証拠）
- Tiny最適化の邪魔にならない（完全分離アーキテクチャ）

### 5.6 次のステップ: SuperSlab Reuse（Phase 18候補）

**ChatGPT提案**: Tiny SuperSlabの最適化（Small-Mid専用層ではなく）

**Box SS-Reuse（SuperSlab slab再利用箱）**:
- **目標**: 70% page fault → 5-10%に削減
- **戦略**:
  1. meta->freelistを優先使用（現在はbump onlyで再利用なし）
  2. slabがemptyになったらshared_poolに返却
  3. 同じSuperSlab内で長く回す（新規mmap削減）
- **効果**: page fault大幅削減 → 2-4x改善期待
- **実装場所**: `core/hakmem_tiny_superslab.c`（Tiny用、Small-Midではない）

**Box SS-Prewarm（事前温め箱）**:
- クラスごとにSuperSlabを事前確保（Phase 11実績: +6.4%）
- page faultをbenchmark開始時に集中
- **課題**: benchmark専用、実運用では無駄

**推奨**: Box SS-Reuse優先（実運用価値あり、根本解決）

---

## 6. 未達成の目標・残課題（次フェーズ候補）

### 6.1 Tiny 性能ギャップ（System の ~18% 止まり）

現状:
- System malloc が ~90M ops/s レベルのところ、
- HAKMEM は 128〜1024B 固定で ~15–16M ops/s（約 18%）。

原因の切り分け（これまでの調査から）:
- Front（UltraHot/TinyHeapV2/TLS SLL）のパス長はかなり短縮済み。
- L1 dcache miss / instructions / branches は Phase 14 で大幅削減済みだが、
  - まだ Tiny が 0–1023B を全部抱えており、
  - 特に 512/1024B が Superslab/Pool 側のメタ負荷に効いている可能性。

候補:
- **Phase 17 で実装中！** Small-Mid Box（256B〜4KB 専用箱）を設計し、Tiny/Mid の間を分離する。
  - 詳細は § 5. Phase 17 を参照

### 6.2 UltraHot/TinyHeapV2 の拡張 or 整理

- C2/C3 UltraHot は成功（16/32B 用）。
- C4/C5 まで拡張した試み（Phase 14-B）は:
  - Fixed-size では改善あり。
  - Random Mixed で shared_pool_acquire_slab() が 47.5% まで膨らみ、大退化。
  - 原因: Superslab/TLS 在庫のバランスを壊す「窃取カスケード」。

方針:
- UltraHot は **C2/C3 専用 Box** に戻す（C4/C5 は一旦対象外にする）。
- もし C4/C5 を最適化したいなら、SmallMid Box の中で別設計する。

### 6.3 ExternalGuard の統計と自動アラート

- 現在:
  - `HAKMEM_EXTERNAL_GUARD_STATS=1` で統計を手動出力。
  - 100+ 回呼ばれたら WARNING を出すのみ。
- 構想:
  - 「ExternalGuard 呼び出しが一定閾値を超えたら、自動で簡易レポートを吐く」Box を追加。
  - 例: Top N 呼び出し元アドレス、サイズ帯、mincore 結果 など。

---

## 7. Claude Code 君向け TODO

### 7.1 Phase 17: Small-Mid Allocator Box ✅ 完了（2025-11-16）

**Phase 17-1**: TLS Frontend Cache
- ✅ 実装完了（TLS freelist + Tiny delegation）
- ✅ A/B テスト: ±0.3%（性能改善なし）
- ✅ 教訓: Delegation overhead = TLS savings

**Phase 17-2**: Dedicated SuperSlab Backend
- ✅ 実装完了（専用SuperSlab pool + batch refill）
- ✅ A/B テスト: -0.9%（Phase 17-1より悪化）
- ✅ 根本原因: 70% page fault（mmap/SuperSlab確保が重い）

**結論**: Small-Mid専用層は性能改善なし（±0-1%）、Tiny最適化が必要

### 7.2 Phase 18 候補: SuperSlab Reuse（Tiny最適化）

**Box SS-Reuse（最優先）**:
1. meta->freelist優先使用（現状: bump only）
2. slab empty検出→shared_pool返却
3. 同じSuperSlab内で長く回す（page fault削減）
4. 目標: 70% page fault → 5-10%、性能 2-4x改善

**Box SS-Prewarm（次優先）**:
1. クラスごとSuperSlab事前確保
2. page faultをbenchmark開始時に集中
3. Phase 11実績: +6.4%（参考値）

**Box SS-HotHint（長期）**:
1. クラス別ホットSuperSlab管理
2. locality最適化（cache効率）
3. SS-Reuseとの統合

### 7.3 その他タスク

1. ✅ **Phase 16/17 結果分析** - CURRENT_TASK.md記録完了
2. **C2/C3 UltraHot コード掃除** - C4/C5関連を別Box化
3. **ExternalGuard 統計自動化** - 閾値超過時レポート

---

## 8. Phase 17 実装ログ（完了）

### 2025-11-16
- ✅ **Phase 17-1完了**: TLS Frontend + Tiny delegation
  - 実装: `hakmem_smallmid.h/c`, auto-adjust, routing修正
  - A/B結果: +0.3%（性能改善なし）
  - 教訓: Delegation overhead = TLS savings

- ✅ **Phase 17-2完了**: Dedicated SuperSlab backend
  - 実装: `hakmem_smallmid_superslab.h/c`, batch refill, 0xb0 header
  - A/B結果: -0.9%（Phase 17-1より悪化）
  - 根本原因: 70% page fault（ChatGPT + perf分析）

- ✅ **重要な発見**:
  - Frontend（TLS/batch refill）: OK（30%のみ）
  - Backend（SuperSlab確保）: ボトルネック（70% page fault）
  - 専用層では性能上がらない → **Tiny SuperSlab最適化が必要**

- ✅ **CURRENT_TASK.md更新**: Phase 17結果 + Phase 18計画
- 🎯 **次**: Phase 18 Box SS-Reuse実装（Tiny SuperSlab最適化）

---

## 9. Phase 19 実装ログ（完了） 🎉

### 2025-11-16
- ✅ **Phase 19-1完了**: Box FrontMetrics（観測）
  - 実装: `core/box/front_metrics_box.h/c`、全層にヒット率計測追加
  - ENV: `HAKMEM_TINY_FRONT_METRICS=1`, `HAKMEM_TINY_FRONT_DUMP=1`
  - 結果: CSV形式で per-class ヒット率レポート生成

- ✅ **Phase 19-2完了**: ベンチマークとヒット率分析
  - ワークロード: Random Mixed 16-1040B、50万イテレーション
  - **重要な発見**:
    - **HeapV2**: 88-99% ヒット率（主力として機能）✅
    - **UltraHot**: 0.2-11.7% ヒット率（ほぼ素通り）⚠️
    - FC/SFC: 無効化済み（0%）
    - TLS SLL: fallback として 0.7-2.7% のみ

- ✅ **Phase 19-3完了**: Box FrontPrune（診断）
  - 実装: ENV切り替えで層を個別ON/OFF可能
  - ENV: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1`（デフォルトOFF）
  - ENV: `HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1`（デフォルトON）

- ✅ **Phase 19-4完了**: A/Bテストと最適化
  - **テスト結果**:
    | 設定 | 性能 | vs Baseline | C2/C3 ヒット率 |
    |------|------|-------------|----------------|
    | Baseline（両方ON） | 10.1M ops/s | - | UH=11.7%, HV2=88.3% |
    | **HeapV2のみ** | **11.4M ops/s** | **+12.9%** ⭐ | HV2=99.3%, SLL=0.7% |
    | UltraHotのみ | 6.6M ops/s | -34.4% ❌ | UH=96.4% (C2), SLL=94.2% (C3) |

  - **決定的結論**:
    - **UltraHot削除で性能向上** (+12.9%)
    - 理由: 分岐予測ミスコスト > UltraHotヒット率向上効果
    - UltraHotチェック: 88.3%のケースで無駄な分岐 → CPU分岐予測器を混乱
    - HeapV2単独の方が予測可能性が高い → 性能向上

- ✅ **デフォルト設定変更**: UltraHot デフォルトOFF
  - 本番推奨: UltraHot OFF（最速設定）
  - 研究用: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1` で有効化可能
  - コードは削除せず ENV切り替えで残す（研究・デバッグ用）

- ✅ **Phase 19 成果**:
  - ChatGPT先生の「観測→診断→治療」戦略が完璧に機能 🎓
  - 直感に反する発見（UltraHotが阻害要因）をデータで証明
  - A/Bテストでリスクなし確認してから最適化実施
  - 詳細: `PHASE19_FRONTEND_METRICS_FINDINGS.md`, `PHASE19_AB_TEST_RESULTS.md`

---

## 10. Phase 20 計画: Tiny ホットパス一本化 + BenchFast モード 🎯

### 目標
- **性能目標**: 20-30M ops/s（system malloc の 25-35%）
- **設計目標**: 「箱を崩さず」に達成（研究価値を保つ）

### Phase 20-1: HeapV2 を唯一の Tiny Front に（本命ホットパス一本化）

**現状認識**:
- C2/C3: HeapV2 が 88-99% を処理（本命）
- UltraHot: 0.2-11.7% しか当たらず、分岐の邪魔（削ると +12.9%）
- FC/SFC: 実質 OFF、TLS SLL は fallback のみ

**実装方針**:
1. **HeapV2 を「唯一の front」として扱う**:
   - C2-C5: HeapV2 → fallback だけ TLS SLL
   - 他層（UltraHot, FC, SFC）はホットパスから完全に外し、実験用に退避

2. **HeapV2 の中身を徹底的に薄くする**:
   - size→class 再計算を全部やめて、「class_idx を渡すだけ」にする
   - 分岐を「classごとの専用関数」かテーブルジャンプにして 1-2 本に減らす
   - header 書き込み・TLS stack 操作・return までを「6-8 命令の直線」に近づける

3. **期待効果**:
   - 現在 11M ops/s → 目標 15-20M ops/s (+35-80% 改善)
   - 分岐削減 + 命令直線化 → CPU パイプライン効率向上

**ENV制御**:
```bash
# HeapV2専用モード（Phase 20デフォルト）
HAKMEM_TINY_FRONT_HEAPV2_ONLY=1  # UltraHot/FC/SFC完全バイパス

# 旧動作（研究用）
HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1  # Phase 19設定
```

---

### Phase 20-2: BenchFast モードで安全コストを外す

**現状認識**:
- `hak_free_at` / `classify_ptr` / ExternalGuard / mincore など、
  「LD_PRELOAD / 外部ライブラリから守る」層が、
  ベンチでは「絶対に hakmem だけを使っている」前提の上に乗っている

**実装方針**:
1. **ベンチ用完全信頼モード**（Box BenchFast）:
   - alloc/free ともに:
     - header 1バイト で Tiny を即判定
     - Pool/Mid/L25/ExternalGuard/registry を完全にバイパス
   - 変なポインタが来たら壊れていい（ベンチ用なので）

2. **ENV制御**:
   ```bash
   HAKMEM_BENCH_FAST_MODE=1  # 安全コスト全外し
   ```

3. **目的**:
   - 「箱全部乗せ版」と「安全コスト全外し版」の差を測る
   - 「設計そのものの限界」と「安全・汎用性のコスト」の内訳を見る
   - mimalloc と同じくらい「危ないモード」で、どこまで近づけるかを研究

4. **期待効果**:
   - HeapV2専用モード: 15-20M ops/s
   - BenchFast追加: 25-30M ops/s (+65-100% vs 現状)
   - system malloc (90M ops/s) の 28-33% に到達

---

### Phase 20-3: SuperSlab ホットセット チューニング

**現状認識**:
- SS-Reuse: 再利用率 98.8%、新規 mmap 1.2% → page fault は抑えられている
- とはいえ perf ではまだ `asm_exc_page_fault` がでかく見える場面もある

**実装方針**:
1. **Box SS-HotSet**（どのクラスが何枚をホットに持つか計測）:
   - クラスごとの「ホット SuperSlab 数」を 1-2 枚に抑えるように class_hints をチューニング
   - precharge (`HAKMEM_TINY_SS_PRECHARGE_Cn`) を使って、「最初から 2 枚だけ温める」戦略を試す

2. **Box SS-Compact**（ホットセット圧縮）:
   - 同じ SuperSlab に複数のホットクラスを詰め込む（Phase 12 の発展）
   - 例: C2/C3 を同じ SuperSlab に配置 → キャッシュ効率向上

3. **期待効果**:
   - page fault さらに削減 → +10-20% 性能向上
   - 既存の SS-Reuse/Cache 設計を、「Tiny front が見ているサイズ帯に合わせて細かく調整」

---

### Phase 20 実装順序

1. **Phase 20-1**: HeapV2 専用モード実装（優先度: 高）
   - 期待: +35-80% (11M → 15-20M ops/s)
   - 工数: 中（既存 HeapV2 をスリム化）

2. **Phase 20-2**: BenchFast モード実装（優先度: 中）
   - 期待: +65-100% (11M → 25-30M ops/s)
   - 工数: 中（安全層バイパス）

3. **Phase 20-3**: SS-HotSet チューニング（優先度: 低）
   - 期待: +10-20% 追加改善
   - 工数: 小（パラメータ調整 + 計測箱追加）

---

### Phase 20 成功条件

- ✅ Tiny 固定サイズで 20-30M ops/s 達成（system の 25-35%）
- ✅ 「箱を崩さず」達成（研究箱としての価値を保つ）
- ✅ ENV切り替えで「安全モード」「ベンチモード」を選べる状態を維持
- ✅ 残りの差（system との 2.5-3x）は「kernel/page fault + mimalloc の極端な inlining」と言える根拠を固める

---

### Phase 20 後の展望

ここまで行けたら:
- 「残りの差は kernel/page fault + mimalloc の極端な inlining・OS依存の差」だと自信を持って言える
- hakmem の「研究箱」としての価値（構造をいじりやすい / 可視化しやすい）を保ったまま、
  性能面でも「そこそこ実用に耐える」ラインに乗る
- 学術論文・技術ブログでの発表材料が揃う

---

## 11. Phase 20-1 実装ログ: Box SS-HotPrewarm（TLS Cache 事前確保） ✅

### 2025-11-16

#### 実装内容
- ✅ **Box SS-HotPrewarm 作成**: ENV制御の per-class TLS cache prewarm
  - 実装: `core/box/ss_hot_prewarm_box.h/c`
  - デフォルト targets: C2/C3=128, C4/C5=64（aggressive prewarm）
  - ENV制御: `HAKMEM_TINY_PREWARM_C2`, `_C3`, `_C4`, `_C5`, `_ALL`

- ✅ **初期化統合**: `hak_init_impl()` から自動呼び出し
  - 384 ブロック事前確保（C2=128, C3=128, C4=64, C5=64）
  - `box_prewarm_tls()` API 使用（安全な carve-push）

#### ベンチマーク結果（500K iterations, 256B random mixed）

| 設定 | Page Faults | Throughput | vs Baseline |
|------|-------------|------------|-------------|
| **Baseline** (Prewarm OFF) | 10,399 | 15.7M ops/s | - |
| **Phase 20-1** (Prewarm ON) | 10,342 | 16.2M ops/s | **+3.3%** ⭐ |

- **Page fault 削減**: 0.55%（期待: 50-66% → 現実: ほぼなし）
- **性能向上**: +3.3%（15.7M → 16.2M ops/s）

#### 分析と結論

**❌ Page Fault 削減の失敗理由**:
1. **ユーザーページ由来が支配的**: ベンチマーク自体の初期化・データ構造確保による page fault が大半
2. **SuperSlab 事前確保の限界**: 384 ブロック程度の prewarm では、ベンチマーク全体の page fault (10K+) に対して微々たる影響しかない
3. **カーネル側のコスト**: `asm_exc_page_fault` はユーザー空間だけでは制御不可能

**✅ Cache Warming 効果**:
1. **TLS SLL 事前充填**: 初期の refill コスト削減
2. **CPU サイクル節約**: +3.3% の性能向上
3. **安定性向上**: 初期状態が warm → 最初のアロケーションから高速

#### 決定: 「軽い +3% 箱」として確定

- **prewarm は有効**: 384 ブロック確保（C2/C3=128, C4/C5=64）のまま残す
- **これ以上の aggressive 化は不要**: RSS 消費増 vs page fault 削減効果が見合わない
- **次フェーズへ**: BenchFast モードで「上限性能」を測定し、構造的限界を把握

#### 変更ファイル
- `core/box/ss_hot_prewarm_box.h` - NEW
- `core/box/ss_hot_prewarm_box.c` - NEW
- `core/box/hak_core_init.inc.h` - prewarm 呼び出し追加
- `Makefile` - `ss_hot_prewarm_box.o` 追加

---

**Status**: Phase 20-1 完了 ✅ → **Phase 20-2 準備中** 🎯
**Next**: BenchFast モード実装（安全コスト全外し → 構造的上限測定）


---

## Phase 20-2: BenchFast Mode Implementation (2025-11-16) ✅

**Status**: ✅ **COMPLETE** - Recursion fixed via prealloc pool + init guard
**Goal**: Measure HAKMEM's structural performance ceiling by removing ALL safety costs
**Implementation**: Complete (core/box/bench_fast_box.{h,c})

### Design Philosophy

BenchFast mode bypasses all safety mechanisms to measure the theoretical maximum throughput:

**Alloc path** (6-8 instructions):
- size → class_idx → TLS SLL pop → write header → return USER pointer
- Bypasses: classify_ptr, Pool/Mid routing, registry, refill logic

**Free path** (3-5 instructions):
- Read header → BASE pointer → TLS SLL push
- Bypasses: registry lookup, mincore, ExternalGuard, capacity checks

### Implementation Details

**Files Created**:
- `core/box/bench_fast_box.h` - ENV-gated API with recursion guard
- `core/box/bench_fast_box.c` - Ultra-minimal alloc/free + prealloc pool

**Integration**:
- `core/box/hak_wrappers.inc.h` - malloc()/free() wrappers with BenchFast bypass
- `bench_random_mixed.c` - bench_fast_init() call before benchmark loop
- `Makefile` - bench_fast_box.o added to all object lists

**Activation**:
```bash
export HAKMEM_BENCH_FAST_MODE=1
./bench_fixed_size_hakmem 500000 256 128
```

### Recursion Fix: Prealloc Pool Strategy

**Problem**: When TLS SLL is empty, bench_fast_alloc() → hak_alloc_at() → malloc() → infinite loop

**Solution** (User's "C案"):
1. **Prealloc pool**: bench_fast_init() pre-allocates 50K blocks per class using normal path
2. **Init guard**: `bench_fast_init_in_progress` flag prevents BenchFast during init
3. **Pop-only alloc**: bench_fast_alloc() only pops from pool, NO REFILL

**Key Fix** (User's contribution):
```c
// core/box/bench_fast_box.h
extern __thread int bench_fast_init_in_progress;

// core/box/hak_wrappers.inc.h (malloc wrapper)
if (__builtin_expect(!bench_fast_init_in_progress && bench_fast_enabled(), 0)) {
    return bench_fast_alloc(size);  // Only activate AFTER init complete
}
```

### Performance Results (500K iterations, 256B fixed-size)

| Mode | Throughput | vs Baseline | vs System |
|------|------------|-------------|-----------|
| **Baseline** (通常) | 54.4M ops/s | - | 53.3% |
| **BenchFast** (安全コスト除去) | 56.9M ops/s | **+4.5%** | 55.7% |
| **System malloc** | 102.1M ops/s | +87.6% | 100% |

### 🔍 Critical Discovery: Safety Costs Are NOT the Bottleneck

**BenchFast で安全コストをすべて除去しても、わずか +4.5% しか改善しない！**

**What this reveals**:
- classify_ptr、Pool/Mid routing、registry、mincore、ExternalGuard → これらは**ボトルネックではない**
- 本当のボトルネックは**構造的な部分**：
  - SuperSlab 設計（1 SS = 1 class 固定）
  - メタデータアクセスパターン（cache miss 多発）
  - TLS SLL 効率（pointer chasing overhead）
  - 877 SuperSlab 生成による巨大なメタデータフットプリント

**System malloc との差**:
- Baseline: 47.7M ops/s 遅い（-46.7%）
- BenchFast でも 45.2M ops/s 遅い（-44.3%）
- → 安全コスト除去しても差は **たった 2.5M ops/s しか縮まらない**

### Implications for Future Work

**増分最適化の限界**:
- Phase 9-11 で学んだ教訓を確認：症状の緩和では埋まらない
- 安全コストは全体の 4.5% しか占めていない
- 残り 95.5% は**構造的なボトルネック**

**Phase 12 Shared SuperSlab Pool の重要性**:
- 877 SuperSlab → 100-200 に削減
- メタデータフットプリント削減 → cache miss 削減
- 動的 slab 共有 → 使用効率向上
- 期待性能: 70-90M ops/s（System の 70-90%）

### Bottleneck Breakdown (推定)

| コンポーネント | CPU 時間 | BenchFast で除去? |
|---------------|----------|------------------|
| SuperSlab metadata access | ~35% | ❌ 構造的 |
| TLS SLL pointer chasing | ~25% | ❌ 構造的 |
| Refill + carving | ~15% | ❌ 構造的 |
| classify_ptr + registry | ~10% | ✅ 除去済み |
| Pool/Mid routing | ~5% | ✅ 除去済み |
| mincore + guards | ~5% | ✅ 除去済み |
| その他 | ~5% | - |

**結論**: 構造的ボトルネック（75%）>> 安全コスト（20%）

**Next Steps**:
- Phase 12: Shared SuperSlab Pool（本質的解決）
- 877 SuperSlab → 100-200 に削減して cache miss を大幅削減
- 期待性能: 70-90M ops/s（System の 70-90%）

**Phase 20 完了**: BenchFast モードで「安全コストは 4.5%」と証明 ✅

---

## Phase 21: Hot Path Cache Optimization (HPCO) - 構造的ボトルネック攻略 🎯

**Status**: 🚧 **PLANNING** (ChatGPT先生のフィードバック反映済み)
**Goal**: アクセスパターン最適化で 60% CPU（メタアクセス 35% + ポインタチェイス 25%）を直接攻撃
**Target**: 75-82M ops/s（System malloc の 73-80%）

### Phase 20-2 で判明した構造的ボトルネック

**BenchFast の結論**:
- 安全コスト（classify_ptr/Pool routing/registry/mincore/guards）= **4.5%** しかない
- 残り 45M ops/s の差 = **箱の積み方そのもの**

**支配的ボトルネック** (60% CPU):
```
メタアクセス:      ~35% (SuperSlab/TinySlabMeta の複数フィールド読み書き)
ポインタチェイス:  ~25% (TLS SLL の next ポインタたどり)
carve/refill:      ~15% (batch carving + metadata updates)
```

**1 回の alloc/free で発生すること**:
- 何段も構造体を跨ぐ（TLS → SuperSlab → SlabMeta → freelist）
- ポインタを何回もたどる（SLL の next チェイン）
- メタデータを何フィールドも触る（used/capacity/carved/freelist/...）

### Phase 21 戦略（ChatGPT先生フィードバック反映）

#### Phase 21-1: Array-Based TLS Cache (C2/C3) 🔴 最優先

**狙い**: TLS SLL のポインタチェイス削減 → +15-20%

**現状の問題**:
```c
// TLS SLL (linked list) - 3 メモリアクセス、うち 1 回は cache miss
void* ptr = g_tls_sll_head[class_idx];  // 1. ヘッド読み込み
void* next = *(void**)ptr;              // 2. next ポインタ読み込み (cache miss!)
g_tls_sll_head[class_idx] = next;       // 3. ヘッド更新
```

**解決策: Ring Buffer**:
```c
// Box 21-1: Array-based hot cache (C2/C3 only)
typedef struct {
    void* slots[128];  // 初期サイズ 128（ENV で A/B: 64/128/256）
    uint16_t head;     // pop index
    uint16_t tail;     // push index
} TlsRingCache;

static __thread TlsRingCache g_hot_cache_c2;
static __thread TlsRingCache g_hot_cache_c3;

// Ultra-fast alloc (1-2 命令)
void* ptr = g_hot_cache_c2.slots[g_hot_cache_c2.head++ & 0x7F];  // ring wrap
```

**階層化** (ChatGPT先生フィードバック):
```
Ring → SLL → SuperSlab
  ↑      ↑       ↑
 L0     L1      L2

- alloc: Ring → 空なら SLL → 空なら SuperSlab
- free:  Ring → 満杯なら SLL
- drain: SLL → Ring に昇格（一方向）
```

**効果**:
- ポインタチェイス: 1 回 → **0 回**
- メモリアクセス: 3 → **2 回**
- cache locality: 配列は連続メモリ
- **期待: +15-20%** (54.4M → 62-65M ops/s)

**ENV 変数**:
```bash
HAKMEM_TINY_HOT_RING_C2=128   # C2 Ring サイズ (default: 128)
HAKMEM_TINY_HOT_RING_C3=128   # C3 Ring サイズ (default: 128)
HAKMEM_TINY_HOT_RING_ENABLE=1 # Ring cache 有効化
```

**実装ポイント** (ChatGPT先生):
- Ring サイズは 64/128/256 で A/B テスト
- C0/C1/C4/C5/C6/C7 は SLL のまま（使用頻度低い）
- drain 時: SLL → Ring への昇格（一方向）
- Ring が空 → SLL fallback → SuperSlab refill

#### Phase 21-2: Hot Slab Direct Index 🟡 中優先度

**狙い**: SuperSlab → slab ループ削減 → +10-15%

**現状の問題**:
```c
// 毎回 32 slab をスキャン
SuperSlab* ss = g_tls_slabs[class_idx].ss;
for (int i = 0; i < 32; i++) {  // ← ループ！
    TinySlabMeta* meta = &ss->slabs[i];
    if (meta->freelist != NULL) { ... }
}
```

**解決策: Hot Slab Cache**:
```c
// Box 21-2: Direct index to hot slab
static __thread TinySlabMeta* g_hot_slab[TINY_NUM_CLASSES];

void refill_from_hot_slab(int class_idx) {
    TinySlabMeta* hot = g_hot_slab[class_idx];

    // Hot slab が空なら更新
    if (!hot || hot->freelist == NULL) {
        hot = find_nonempty_slab(class_idx);  // 1回だけ探索
        g_hot_slab[class_idx] = hot;          // cache!
    }

    pop_batch_from_freelist(hot, ...);  // no loop!
}
```

**効果**:
- SuperSlab → slab ループ: 削除
- メタアクセス: 32 回 → **1 回**
- **期待: +10-15%** (62-65M → 70-75M ops/s)

**実装ポイント** (ChatGPT先生):
- Hot slab が EMPTY → find_nonempty_slab で差し替え
- free 時: hot slab に返す or freelist に戻す（ポリシー決める）
- shared_pool / SS-Reuse との整合性確保

#### Phase 21-3: Minimal Meta Access (C2/C3) 🟢 低優先度

**狙い**: 触るフィールド削減 → +5-10%

**現状の問題**:
```c
// 1 alloc/free で 4-5 フィールド触る
typedef struct {
    uint16_t used;      // ✅ 必須
    uint16_t capacity;  // ❌ compile-time 定数化できる
    uint16_t carved;    // ❌ C2/C3 では使わない
    void* freelist;     // ✅ 必須
} TinySlabMeta;
```

**解決策: アクセスパターン限定** (ChatGPT先生):
```c
// struct を分けなくてもOK（型分岐を避ける）
// C2/C3 コードパスで触るのを used/freelist だけに限定

#define C2_CAPACITY 64  // compile-time 定数

static inline int c2_can_alloc(TinySlabMeta* meta) {
    return meta->used < C2_CAPACITY;  // capacity フィールド不要！
}
```

**効果**:
- 触るフィールド: 4-5 → **2 個** (used/freelist のみ)
- cache line 消費: 削減
- **期待: +5-10%** (70-75M → 75-82M ops/s)

**実装ポイント** (ChatGPT先生):
- struct 分離は後回し（型分岐コスト vs 効果のトレードオフ）
- アクセスパターン限定だけでも cache 効果あり
- Phase 21-1/2 の結果を見てから判断

### Phase 21 実装順序

```
Phase 21-1 (Array-based TLS Cache C2/C3):
  ↓ +15-20% → 62-65M ops/s
Phase 21-2 (Hot Slab Direct Index):
  ↓ +10-15% → 70-75M ops/s
Phase 21-3 (Minimal Meta Access):
  ↓ +5-10%  → 75-82M ops/s
  ↓
🎯 Target: System malloc の 73-80%
```

**Phase 12 (SuperSlab 共有) は後回し**:
- Phase 21 で 80M ops/s 到達後、残り 20M ops/s を Phase 12 で詰める

### ChatGPT先生フィードバック（重要）

1. **Box 21-1 (Ring cache)**: ✅ perf 的にドンピシャ
   - Ring → SLL → SuperSlab の階層を明確に
   - Ring サイズは 128/64 から ENV で A/B
   - drain 時: SLL → Ring への昇格（一方向）

2. **Box 21-2 (Hot slab)**: ✅ 有効だが扱いに注意
   - hot slab が EMPTY 時の差し替えロジック
   - shared_pool / SS-Reuse との整合性

3. **Box 21-3 (Minimal meta)**: ⚠️ 後回しでOK
   - struct 分離は型分岐コスト増
   - アクセスパターン限定だけで効果あり
   - 21-1/2 の結果を見てから判断

4. **Phase 12 との順番**: ✅ 合理的
   - アクセスパターン > SuperSlab 数
   - Phase 21 → Phase 12 の順で問題なし

### 実装リスク

**低リスク**:
- C2/C3 のみ変更（他クラスは SLL のまま）
- 既存構造を大きく変えない
- ENV で A/B テスト可能

**注意点**:
- Ring と SLL の境界を明確に
- shared_pool / SS-Reuse との整合
- 型分岐が増えすぎないように

### 次のアクション

**Phase 21-1 実装開始**:
1. `core/box/hot_ring_cache_box.{h,c}` 作成
2. C2/C3 専用 TlsRingCache 実装
3. Ring → SLL → SuperSlab 階層化
4. ENV: `HAKMEM_TINY_HOT_RING_ENABLE=1`
5. ベンチマーク: 目標 62-65M ops/s (+15-20%)

---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								# CURRENT TASK (Phase 14–17 Snapshot) – Tiny / Mid / ExternalGuard / Small-Mid
-												CURRENT_TASK: Registry 線形スキャン ボトルネック特定 (2025-11-05)

- perf 分析で superslab_refill が 28.51% CPU を消費
- Root cause: 262,144 エントリの線形スキャン (97.65% の hot instructions)
- 解決策: per-class registry (8×4096 = 32K entries)
- 期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)
- Box Refactor は既に動いている (+463% ST, +131% MT)

次のアクション: Phase 1 実装 (per-class registry 変更)

詳細: PERF_ANALYSIS_2025_11_05.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 16:47:04 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								**Last Updated**: 2025-11-16
 								**Owner**: ChatGPT → Phase 17 実装中: Claude Code
 								**Size**: 約 300 行（Claude 用コンテキスト簡略版）
-												Fix: CRITICAL multi-threaded freelist/remote queue race condition

Root Cause:
===========
Freelist and remote queue contained the SAME blocks, causing use-after-free:

1. Thread A (owner): pops block X from freelist → allocates to user
2. User writes data ("ab") to block X
3. Thread B (remote): free(block X) → adds to remote queue
4. Thread A (later): drains remote queue → *(void**)block_X = chain_head
   → OVERWRITES USER DATA! 💥

The freelist pop path did NOT drain the remote queue first, so blocks could
be simultaneously in both freelist and remote queue.

Fix:
====
Add remote queue drain BEFORE freelist pop in refill path:

core/hakmem_tiny_refill_p0.inc.h:
  - Call _ss_remote_drain_to_freelist_unsafe() BEFORE trc_pop_from_freelist()
  - Add #include "superslab/superslab_inline.h"
  - This ensures freelist and remote queue are mutually exclusive

Test Results:
=============
BEFORE:
  larson_hakmem (4 threads): ❌ SEGV in seconds (freelist corruption)

AFTER:
  larson_hakmem (4 threads): ✅ 931,629 ops/s (1073 sec stable run)
  bench_random_mixed:        ✅ 1,020,163 ops/s (no crashes)

Evidence:
  - Fail-Fast logs showed next pointer corruption: 0x...6261 (ASCII "ab")
  - Single-threaded benchmarks worked (865K ops/s)
  - Multi-threaded Larson crashed immediately
  - Fix eliminates all crashes in both benchmarks

Files:
  - core/hakmem_tiny_refill_p0.inc.h: Add remote drain before freelist pop
  - CURRENT_TASK.md: Document fix details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-08 01:35:45 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								---
 								## 1. 全体の現在地（どこまで終わっているか）
 								- Tiny (0–1023B)
 								  - NEW 3-layer front（bump / small_mag / slow）安定。
 								  - TinyHeapV2: 「alloc フロント＋統計」は実装済みだが、実運用は **C2/C3 を UltraHot に委譲**。
 								  - TinyUltraHot（Phase 14）:
 								    - C2/C3（16B/32B）専用 L0 ultra-fast path（Stealing モデル）。
 								    - 固定サイズベンチで +16〜36% 改善、hit 率 ≈ 100%。
 								  - Box 分離（Phase 15）:
 								    - free ラッパが外部ポインタまで `hak_free_at` に投げていた問題を修正。
 								    - BenchMeta（slots など）→ 直接 `__libc_free`、CoreAlloc（Tiny/Mid）→ `hak_free_at` の二段構えに整理。
 								- Mid / PoolTLS（1KB–32KB）
 								  - PoolTLS Phase 完了（Mid-Large MT ベンチ）
 								    - ~10.6M ops/s（system malloc より速い構成あり）。
 								    - lock contention（futex 68%）を lock-free MPSC + bind box で大幅削減。
 								  - GAP 修正（Tiny 1023B / Mid 1KB〜）:
 								    - `TINY_MAX_SIZE=1023` / `MID_MIN_SIZE=1024` で 1KB–8KB の「誰も扱わない帯」は解消済み。
 								- Shared SuperSlab Pool（Phase 12 – SP-SLOT Box）
 								  - 1 SuperSlab : 多 class 共有 + SLOT_UNUSED/ACTIVE/EMPTY 追跡。
 								  - SuperSlab 数: 877 → 72（-92%）、mmap/munmap: -48%、Throughput: +131%。
 								  - Lock contention P0-5 まで実装済み（Stage 2 lock-free claiming）。
 								- ExternalGuard（Phase 15）
 								  - UNKNOWN ポインタ（Tiny/Pool/Mid/L25/registry どこでも捕まらないもの）を最後の箱で扱う。
 								  - 挙動:
 								    - `hak_super_lookup` など全て miss → mincore でページ確認 → 原則「解放せず leak 扱い（安全優先）」。
 								  - Phase 15 修正で:
 								    - BenchMeta のポインタを CoreAlloc に渡さなくなり、UNKNOWN 呼び出し回数が激減。
 								    - `mincore` の CPU 負荷もベンチではほぼ無視できるレベルまで縮小。
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
+								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 2. Tiny 性能の現状（Phase 14–15 時点）
 								### 2.1 Fixed-size Tiny ベンチ（HAKMEM vs System）
 								`bench_fixed_size_hakmem` / `bench_fixed_size_system`（workset=128, 500K iterations 相当）
 								| Size   | HAKMEM (Phase 15) | System malloc | 比率     |
 								|--------|-------------------|---------------|----------|
 								| 128B   | ~16.6M ops/s      | ~90M ops/s    | ~18.5%   |
 								| 256B   | ~16.2M ops/s      | ~89.6M ops/s  | ~18.1%   |
 								| 512B   | ~15.0M ops/s      | ~90M ops/s    | ~16.6%   |
 								| 1024B  | ~15.1M ops/s      | ~90M ops/s    | ~16.8%   |
 								状態:
 								- クラッシュは完全解消（workset=64/128 で長尺 500K iter も安定）。
 								- Tiny UltraHot + 学習層 + ExternalGuard の組み合わせは「正しさ」は OK。
 								- 性能は system の ~16–18% レベル（約 5–6× 遅い）→ まだ大きな伸びしろあり。
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 2.2 C2/C3 UltraHot 専用ベンチ
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								固定サイズ（100K iterations, workset=128）
 								| Size | Baseline (UltraHot OFF) | UltraHot ON | 改善率      | Hit Rate |
 								|------|-------------------------|-------------|-------------|---------|
 								| 16B  | ~40.4M ops/s            | ~55.0M      | +36.2% 🚀    | ≈100%   |
 								| 32B  | ~43.5M ops/s            | ~50.6M      | +16.3% 🚀    | ≈100%   |
 								Random Mixed 256B：
 								- Baseline: ~8.96M ops/s
 								- UltraHot ON: ~8.81M ops/s（-1.6%、誤差〜軽微退化）
 								- 理由: C2/C3 が全体の 1–2% のみ → UltraHot のメリットが平均に薄まる。
 								結論:
 								- C2/C3 UltraHot は **ターゲットクラスに対しては実用級の Box**。
 								- 他ワークロードでは「ほぼ影響なし（わずかな分岐オーバーヘッドのみ）」の範囲に収まっている。
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
 								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 3. Phase 15: ExternalGuard / Domain 分離の成果
 								### 3.1 以前の問題
 								- free ラッパ（`core/box/hak_wrappers.inc.h`）が:
 								  - HAKMEM 所有かチェックせず、すべての `free(ptr)` を `hak_free_at(ptr, …)` に投げていた。
 								  - その結果:
 								    - ベンチ内部 `slots`（`calloc(256, sizeof(void*))` の 2KB など）も CoreAlloc に流入。
 								    - `classify_ptr` → UNKNOWN → ExternalGuard → mincore → 「解放せず leak」と判定。
 								  - ベンチ観測:
 								    - 約 0.84% の leak（BenchMeta がどんどん漏れる）。
 								    - `mincore` が Tiny ベンチ CPU の ~13% を消費。
 								### 3.2 修正内容（Phase 15）
 								- free ラッパ側:
 								  - 軽量なドメインチェックを追加:
 								    - Tiny/Pool 用の header magic を安全に読んで、HAKMEM 所有の可能性があるものだけ `hak_free_at` へ。
 								    - そうでない（BenchMeta/外部）ポインタは `__libc_free` へ。
 								- ExternalGuard:
 								  - UNKNOWN ポインタを「解放しない（leak）」方針に明示的変更。
 								  - デバッグ時のみ `HAKMEM_EXTERNAL_GUARD_LOG=1` で原因特定用ログを出す。
 								### 3.3 結果
 								- Leak 率:
 								  - 100K iter: 840 leaks → 0.84%
 								  - 500K iter: ~4200 leaks → 0.84%
 								  - ほぼ全部が BenchMeta / 外部ポインタであり、CoreAlloc 側の漏れではないと確認。
 								- 性能:
 								  - 256B 固定:
 								    - Before: 15.9M ops/s
 								    - After:  16.2M ops/s（+1.9%）→ domain check オーバーヘッドは軽微、むしろ微増。
 								- 安定性:
 								  - 全サイズ（128/256/512/1024B）で 500K iter 完走（クラッシュなし）。
 								  - ExternalGuard 経由の「危ない free」は leak に封じ込められた。
 								**要点:**
 								Box 境界違反（BenchMeta→CoreAlloc 流入）はほぼ完全に解消。
 								ベンチでの mincore / ExternalGuard コストも許容範囲になった。
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
 								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 4. Phase 16: Dynamic Tiny/Mid Boundary A/B Testing（2025-11-16完了）
 								### 4.1 実装内容
 								ENV変数でTiny/Mid境界を動的調整可能にする機能を追加：
 								- `HAKMEM_TINY_MAX_CLASS=7` (デフォルト): Tiny が 0-1023B を担当
 								- `HAKMEM_TINY_MAX_CLASS=5` (実験用): Tiny が 0-255B のみ担当
 								実装ファイル：
 								- `hakmem_tiny.h/c`: `tiny_get_max_size()` - ENV読取とクラス→サイズマッピング
 								- `hakmem_mid_mt.h/c`: `mid_get_min_size()` - 動的境界調整（サイズギャップ防止）
 								- `hak_alloc_api.inc.h`: 静的TINY_MAX_SIZEを動的呼び出しに変更
 								### 4.2 A/B Benchmark Results
 								| Size | Config A (C0-C7) | Config B (C0-C5) | 変化率 |
 								|------|------------------|------------------|--------|
 								| 128B | 6.34M ops/s | 1.38M ops/s | **-78%** ❌ |
 								| 256B | 6.34M ops/s | 1.36M ops/s | **-79%** ❌ |
 								| 512B | 5.55M ops/s | 1.33M ops/s | **-76%** ❌ |
 								| 1024B | 5.91M ops/s | 1.37M ops/s | **-77%** ❌ |
 								### 4.3 発見と結論
 								✅ **成功**: サイズギャップ修正完了（OOMクラッシュなし）
 								❌ **失敗**: Tiny カバレッジ削減で大幅な性能劣化 (-76% ~ -79%)
 								⚠️ **根本原因**: Mid の粗いサイズクラス (8KB/16KB/32KB) が小サイズで非効率
 								- Mid は 8KB ページ単位の設計 → 256B-1KB を投げると 8KB ページをほぼ数ブロックのために確保
 								- ページ fault・TLB・メタデータコストが相対的に巨大
 								- Tiny は slab + freelist で高密度 → 同じサイズでも桁違いに効率的
 								**教訓（ChatGPT先生分析）**:
 . Mid 箱の前提が「8KB〜用」になっている
 								   - 256B/512B/1024B では 8KB ページをほぼ1〜数個のブロックのために確保 → 非効率
 . パス長も Mid の方が長い（PoolTLS / mid registry / page 管理）
 . 「Tiny を削って Mid に任せれば軽くなる」という仮説は、現行の "8KB〜前提の Mid 設計" では成り立たない
 								**推奨**: **デフォルト HAKMEM_TINY_MAX_CLASS=7 (C0-C7) を維持**
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
 								---
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								## 5. Phase 17: Small-Mid Allocator Box - 実験完了 ✅（2025-11-16）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 5.1 目標と動機
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**問題**: Tiny C5-C7 (256B/512B/1KB) が ~6M ops/s → system malloc の ~6.7% レベル
 								**仮説**: 専用層を作れば 2-4x 改善可能
 								**結果**: ❌ **仮説は誤り** - 性能改善なし（±0-1%）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 5.2 Phase 17-1: TLS Frontend Cache（Tiny delegation）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**実装**:
 								- TLS freelist（256B/512B/1KB、容量32/24/16）
 								- Backend: Tiny C5/C6/C7に委譲、Header変換（0xa0 → 0xb0）
 								- Auto-adjust: Small-Mid ON時にTinyをC0-C5に自動制限
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**結果**:
 								| Size | OFF | ON | 変化率 |
 								|------|-----|-----|--------|
 								| 256B | 5.87M | 6.06M | **+3.3%** |
 								| 512B | 6.02M | 5.91M | **-1.9%** |
 								| 1024B | 5.58M | 5.54M | **-0.6%** |
 								| **平均** | 5.82M | 5.84M | **+0.3%** |
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**教訓**: Delegation overhead = TLS savings → 正味利益ゼロ
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 5.3 Phase 17-2: Dedicated SuperSlab Backend
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**実装**:
 								- Small-Mid専用SuperSlab pool（1MB、16 slabs/SS）
 								- Batch refill（8-16 blocks/refill）
 								- 直接0xb0 header書き込み（Tiny delegationなし）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**結果**:
 								| Size | OFF | ON | 変化率 |
 								|------|-----|-----|--------|
 								| 256B | 6.08M | 5.84M | **-4.1%** ⚠️ |
 								| 512B | 5.79M | 5.86M | **+1.2%** |
 								| 1024B | 5.42M | 5.44M | **+0.4%** |
 								| **平均** | 5.76M | 5.71M | **-0.9%** |
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Phase 17-1比較**: Phase 17-2の方が悪化（-3.6% on 256B）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 5.4 根本原因分析（ChatGPT先生 + perf profiling）
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**発見**: **70% page fault** が支配的 🔥
 								**Perf分析**:
 								- `asm_exc_page_fault`: 70% CPU時間
 								- 実際のallocation logic（TLS/refill）: 30% のみ
 								- **結論**: Frontend実装は成功、Backendが重すぎる
 								**なぜpage faultが多いか**:
 								```
 								Small-Mid: alloc → TLS miss → refill → SuperSlab新規確保
 								  → mmap(1MB) → page fault 発生 → 70%のCPU消費
 								Tiny: alloc → TLS miss → refill → 既存warm SuperSlab使用
 								  → page faultなし → 高速
 								```
 								**Small-Mid問題**:
 . 新しいSuperSlabを頻繁に確保（workloadが短いため）
 . Warm SuperSlabの再利用なし（usedカウンタ減らない）
 . Batch refillのメリットよりmmap/page faultコストが大きい
 								### 5.5 Phase 17の結論と教訓
 								❌ **Small-Mid専用層戦略は失敗**:
 								- Phase 17-1（Frontend only）: +0.3%
 								- Phase 17-2（Dedicated backend）: -0.9%
 								- 目標（2-4x改善）: **未達成**（-50-67%不足）
 								✅ **重要な発見**:
 . **Frontend（TLS/batch refill）設計はOK** - 30%のみの負荷
 . **70% page fault = SuperSlab層の問題**
 . **Tiny (6.08M) は既に十分速い** - これを超えるのは困難
 . **層の分離では性能は上がらない** - Backend最適化が必要
 								✅ **実装の価値**:
 								- ENV=0でゼロオーバーヘッド（branch predictor学習）
 								- 実験記録として価値あり（"なぜ専用層が効果なかったか"の証拠）
 								- Tiny最適化の邪魔にならない（完全分離アーキテクチャ）
 								### 5.6 次のステップ: SuperSlab Reuse（Phase 18候補）
 								**ChatGPT提案**: Tiny SuperSlabの最適化（Small-Mid専用層ではなく）
 								**Box SS-Reuse（SuperSlab slab再利用箱）**:
 								- **目標**: 70% page fault → 5-10%に削減
 								- **戦略**:
 . meta->freelistを優先使用（現在はbump onlyで再利用なし）
 . slabがemptyになったらshared_poolに返却
 . 同じSuperSlab内で長く回す（新規mmap削減）
 								- **効果**: page fault大幅削減 → 2-4x改善期待
 								- **実装場所**: `core/hakmem_tiny_superslab.c`（Tiny用、Small-Midではない）
 								**Box SS-Prewarm（事前温め箱）**:
 								- クラスごとにSuperSlabを事前確保（Phase 11実績: +6.4%）
 								- page faultをbenchmark開始時に集中
 								- **課題**: benchmark専用、実運用では無駄
 								**推奨**: Box SS-Reuse優先（実運用価値あり、根本解決）
-												Docs: Document workset=128 recursion fix in CURRENT_TASK

Added section 3.3 documenting the critical infinite recursion bug fix:
- Root cause: realloc() → hak_alloc_at() → shared_pool_init() → realloc() loop
- Symptoms: workset=128 hung, workset=64 worked (size-class specific)
- Fix: Replace realloc() with system mmap() for Shared Pool metadata
- Performance: timeout → 18.5M ops/s

Commit 176bbf656

											
										
										
											2025-11-15 14:36:35 +09:00
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
+								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 6. 未達成の目標・残課題（次フェーズ候補）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.1 Tiny 性能ギャップ（System の ~18% 止まり）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								現状:
 								- System malloc が ~90M ops/s レベルのところ、
 								- HAKMEM は 128〜1024B 固定で ~15–16M ops/s（約 18%）。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								原因の切り分け（これまでの調査から）:
 								- Front（UltraHot/TinyHeapV2/TLS SLL）のパス長はかなり短縮済み。
 								- L1 dcache miss / instructions / branches は Phase 14 で大幅削減済みだが、
 								  - まだ Tiny が 0–1023B を全部抱えており、
 								  - 特に 512/1024B が Superslab/Pool 側のメタ負荷に効いている可能性。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								候補:
 								- **Phase 17 で実装中！** Small-Mid Box（256B〜4KB 専用箱）を設計し、Tiny/Mid の間を分離する。
 								  - 詳細は § 5. Phase 17 を参照
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.2 UltraHot/TinyHeapV2 の拡張 or 整理
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- C2/C3 UltraHot は成功（16/32B 用）。
 								- C4/C5 まで拡張した試み（Phase 14-B）は:
 								  - Fixed-size では改善あり。
 								  - Random Mixed で shared_pool_acquire_slab() が 47.5% まで膨らみ、大退化。
 								  - 原因: Superslab/TLS 在庫のバランスを壊す「窃取カスケード」。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								方針:
 								- UltraHot は **C2/C3 専用 Box** に戻す（C4/C5 は一旦対象外にする）。
 								- もし C4/C5 を最適化したいなら、SmallMid Box の中で別設計する。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.3 ExternalGuard の統計と自動アラート
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- 現在:
 								  - `HAKMEM_EXTERNAL_GUARD_STATS=1` で統計を手動出力。
 								  - 100+ 回呼ばれたら WARNING を出すのみ。
 								- 構想:
 								  - 「ExternalGuard 呼び出しが一定閾値を超えたら、自動で簡易レポートを吐く」Box を追加。
 								  - 例: Top N 呼び出し元アドレス、サイズ帯、mincore 結果 など。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								---
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								## 7. Claude Code 君向け TODO
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 7.1 Phase 17: Small-Mid Allocator Box ✅ 完了（2025-11-16）
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Phase 17-1**: TLS Frontend Cache
 								- ✅ 実装完了（TLS freelist + Tiny delegation）
 								- ✅ A/B テスト: ±0.3%（性能改善なし）
 								- ✅ 教訓: Delegation overhead = TLS savings
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Phase 17-2**: Dedicated SuperSlab Backend
 								- ✅ 実装完了（専用SuperSlab pool + batch refill）
 								- ✅ A/B テスト: -0.9%（Phase 17-1より悪化）
 								- ✅ 根本原因: 70% page fault（mmap/SuperSlab確保が重い）
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**結論**: Small-Mid専用層は性能改善なし（±0-1%）、Tiny最適化が必要
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 7.2 Phase 18 候補: SuperSlab Reuse（Tiny最適化）
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Box SS-Reuse（最優先）**:
 . meta->freelist優先使用（現状: bump only）
 . slab empty検出→shared_pool返却
 . 同じSuperSlab内で長く回す（page fault削減）
 . 目標: 70% page fault → 5-10%、性能 2-4x改善
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Box SS-Prewarm（次優先）**:
 . クラスごとSuperSlab事前確保
 . page faultをbenchmark開始時に集中
 . Phase 11実績: +6.4%（参考値）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								**Box SS-HotHint（長期）**:
 . クラス別ホットSuperSlab管理
 . locality最適化（cache効率）
 . SS-Reuseとの統合
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 7.3 その他タスク
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+. ✅ **Phase 16/17 結果分析** - CURRENT_TASK.md記録完了
 . **C2/C3 UltraHot コード掃除** - C4/C5関連を別Box化
 . **ExternalGuard 統計自動化** - 閾値超過時レポート
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								## 8. Phase 17 実装ログ（完了）
-												Front-Direct implementation: SS→FC direct refill + SLL complete bypass

## Summary

Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)

## New Modules

- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
  - Remote drain → Freelist → Carve priority
  - Header restoration for C1-C6 (NOT C0/C7)
  - ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN

- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition

## Allocation Path (core/tiny_alloc_fast.inc.h)

- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
  - Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
  - Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)

## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)

- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)

## Legacy Sealing

- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry

## ENV Controls

- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)

## Benchmarks (Front-Direct Enabled)

```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
     HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
     HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
     HAKMEM_TINY_BUMP_CHUNK=256

bench_random_mixed (16-1040B random, 200K iter):
  256 slots: 1.44M ops/s (STABLE, 0 SEGV)
  128 slots: 1.44M ops/s (STABLE, 0 SEGV)

bench_fixed_size (fixed size, 200K iter):
  256B: 4.06M ops/s (has debug logs, expected >10M without logs)
  128B: Similar (debug logs affect)
```

## Verification

- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV

## Next Steps

- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)

Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink

											
										
										
											2025-11-14 05:41:49 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								### 2025-11-16
 								- ✅ **Phase 17-1完了**: TLS Frontend + Tiny delegation
 								  - 実装: `hakmem_smallmid.h/c`, auto-adjust, routing修正
 								  - A/B結果: +0.3%（性能改善なし）
 								  - 教訓: Delegation overhead = TLS savings
-												Front-Direct implementation: SS→FC direct refill + SLL complete bypass

## Summary

Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)

## New Modules

- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
  - Remote drain → Freelist → Carve priority
  - Header restoration for C1-C6 (NOT C0/C7)
  - ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN

- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition

## Allocation Path (core/tiny_alloc_fast.inc.h)

- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
  - Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
  - Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)

## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)

- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)

## Legacy Sealing

- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry

## ENV Controls

- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)

## Benchmarks (Front-Direct Enabled)

```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
     HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
     HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
     HAKMEM_TINY_BUMP_CHUNK=256

bench_random_mixed (16-1040B random, 200K iter):
  256 slots: 1.44M ops/s (STABLE, 0 SEGV)
  128 slots: 1.44M ops/s (STABLE, 0 SEGV)

bench_fixed_size (fixed size, 200K iter):
  256B: 4.06M ops/s (has debug logs, expected >10M without logs)
  128B: Similar (debug logs affect)
```

## Verification

- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV

## Next Steps

- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)

Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink

											
										
										
											2025-11-14 05:41:49 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								- ✅ **Phase 17-2完了**: Dedicated SuperSlab backend
 								  - 実装: `hakmem_smallmid_superslab.h/c`, batch refill, 0xb0 header
 								  - A/B結果: -0.9%（Phase 17-1より悪化）
 								  - 根本原因: 70% page fault（ChatGPT + perf分析）
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								- ✅ **重要な発見**:
 								  - Frontend（TLS/batch refill）: OK（30%のみ）
 								  - Backend（SuperSlab確保）: ボトルネック（70% page fault）
 								  - 専用層では性能上がらない → **Tiny SuperSlab最適化が必要**
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
-												Phase 17-2: Small-Mid Dedicated SuperSlab Backend (実験結果: 70% page fault, 性能改善なし)

Summary:
========
Phase 17-2 implements dedicated SuperSlab backend for Small-Mid allocator (256B-1KB).
Result: No performance improvement (-0.9%), worse than Phase 17-1 (+0.3%).
Root cause: 70% page fault (ChatGPT + perf profiling).
Conclusion: Small-Mid専用層戦略は失敗。Tiny SuperSlab最適化が必要。

Implementation:
===============
1. Dedicated Small-Mid SuperSlab pool (1MB, 16 slabs/SS)
   - Separate from Tiny SuperSlab (no competition)
   - Batch refill (8-16 blocks per TLS refill)
   - Direct 0xb0 header writes (no Tiny delegation)

2. Backend architecture
   - SmallMidSuperSlab: 1MB aligned region, fast ptr→SS lookup
   - SmallMidSlabMeta: per-slab metadata (capacity/used/carved/freelist)
   - SmallMidSSHead: per-class pool with LRU tracking

3. Batch refill implementation
   - smallmid_refill_batch(): 8-16 blocks/call (vs 1 in Phase 17-1)
   - Freelist priority → bump allocation fallback
   - Auto SuperSlab expansion when exhausted

Files Added:
============
- core/hakmem_smallmid_superslab.h: SuperSlab metadata structures
- core/hakmem_smallmid_superslab.c: Backend implementation (~450 lines)

Files Modified:
===============
- core/hakmem_smallmid.c: Removed Tiny delegation, added batch refill
- Makefile: Added hakmem_smallmid_superslab.o to build
- CURRENT_TASK.md: Phase 17 完了記録 + Phase 18 計画

A/B Benchmark Results:
======================
| Size   | Phase 17-1 (ON) | Phase 17-2 (ON) | Delta    | vs Baseline |
|--------|-----------------|-----------------|----------|-------------|
| 256B   | 6.06M ops/s     | 5.84M ops/s     | -3.6%    | -4.1%       |
| 512B   | 5.91M ops/s     | 5.86M ops/s     | -0.8%    | +1.2%       |
| 1024B  | 5.54M ops/s     | 5.44M ops/s     | -1.8%    | +0.4%       |
| Avg    | 5.84M ops/s     | 5.71M ops/s     | -2.2%    | -0.9%       |

Performance Analysis (ChatGPT + perf):
======================================
✅ Frontend (TLS/batch refill): OK
   - Only 30% CPU time
   - Batch refill logic is efficient
   - Direct 0xb0 header writes work correctly

❌ Backend (SuperSlab allocation): BOTTLENECK
   - 70% CPU time in asm_exc_page_fault
   - mmap(1MB) → kernel page allocation → very slow
   - New SuperSlab allocation per benchmark run
   - No warm SuperSlab reuse (used counter never decrements)

Root Cause:
===========
Small-Mid allocates new SuperSlabs frequently:
  alloc → TLS miss → refill → new SuperSlab → mmap(1MB) → page fault (70%)

Tiny reuses warm SuperSlabs:
  alloc → TLS miss → refill → existing warm SuperSlab → no page fault

Key Finding: "70% page fault" reveals SuperSlab layer needs optimization,
NOT frontend layer (TLS/batch refill design is correct).

Lessons Learned:
================
1. ❌ Small-Mid専用層戦略は失敗 (Phase 17-1: +0.3%, Phase 17-2: -0.9%)
2. ✅ Frontend実装は成功 (30% CPU, batch refill works)
3. 🔥 70% page fault = SuperSlab allocation bottleneck
4. ✅ Tiny (6.08M ops/s) is already well-optimized, hard to beat
5. ✅ Layer separation doesn't improve performance - backend optimization needed

Next Steps (Phase 18):
======================
ChatGPT recommendation: Optimize Tiny SuperSlab (NOT Small-Mid specific layer)

Box SS-Reuse (Priority 1):
- Implement meta->freelist reuse (currently bump-only)
- Detect slab empty → return to shared_pool
- Reuse same SuperSlab for longer (reduce page faults)
- Target: 70% page fault → 5-10%, 2-4x improvement

Box SS-Prewarm (Priority 2):
- Pre-allocate SuperSlabs per class (Phase 11: +6.4%)
- Concentrate page faults at benchmark start
- Benchmark-only optimization

Small-Mid Implementation Status:
=================================
- ENV=0 by default (zero overhead, branch predictor learns)
- Complete separation from Tiny (no interference)
- Valuable as experimental record ("why dedicated layer failed")
- Can be removed later if needed (not blocking Tiny optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 03:21:13 +09:00
+								- ✅ **CURRENT_TASK.md更新**: Phase 17結果 + Phase 18計画
 								- 🎯 **次**: Phase 18 Box SS-Reuse実装（Tiny SuperSlab最適化）
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
-												Phase 19 & 20-1: Frontend optimization + TLS cache prewarm (+16.2% total)

Phase 19: Box FrontMetrics & Box FrontPrune (A/B testing framework)
========================================================================
- Box FrontMetrics: Per-class hit rate measurement for all frontend layers
  - Implementation: core/box/front_metrics_box.{h,c}
  - ENV: HAKMEM_TINY_FRONT_METRICS=1, HAKMEM_TINY_FRONT_DUMP=1
  - Output: CSV format per-class hit rate report

- A/B Test Results (Random Mixed 16-1040B, 500K iterations):
  | Config | Throughput | vs Baseline | C2/C3 Hit Rate |
  |--------|-----------|-------------|----------------|
  | Baseline (UH+HV2) | 10.1M ops/s | - | UH=11.7%, HV2=88.3% |
  | HeapV2 only | 11.4M ops/s | +12.9% ⭐ | HV2=99.3%, SLL=0.7% |
  | UltraHot only | 6.6M ops/s | -34.4% ❌ | UH=96.4%, SLL=94.2% |

- Key Finding: UltraHot removal improves performance by +12.9%
  - Root cause: Branch prediction miss cost > UltraHot hit rate benefit
  - UltraHot check: 88.3% cases = wasted branch → CPU confusion
  - HeapV2 alone: more predictable → better pipeline efficiency

- Default Setting Change: UltraHot default OFF
  - Production: UltraHot OFF (fastest)
  - Research: HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1 to enable
  - Code preserved (not deleted) for research/debug use

Phase 20-1: Box SS-HotPrewarm (TLS cache prewarming, +3.3%)
========================================================================
- Box SS-HotPrewarm: ENV-controlled per-class TLS cache prewarm
  - Implementation: core/box/ss_hot_prewarm_box.{h,c}
  - Default targets: C2/C3=128, C4/C5=64 (aggressive prewarm)
  - ENV: HAKMEM_TINY_PREWARM_C2, _C3, _C4, _C5, _ALL
  - Total: 384 blocks pre-allocated

- Benchmark Results (Random Mixed 256B, 500K iterations):
  | Config | Page Faults | Throughput | vs Baseline |
  |--------|-------------|------------|-------------|
  | Baseline (Prewarm OFF) | 10,399 | 15.7M ops/s | - |
  | Phase 20-1 (Prewarm ON) | 10,342 | 16.2M ops/s | +3.3% ⭐ |

  - Page fault reduction: 0.55% (expected: 50-66%, reality: minimal)
  - Performance gain: +3.3% (15.7M → 16.2M ops/s)

- Analysis:
  ❌ Page fault reduction failed:
    - User page-derived faults dominate (benchmark initialization)
    - 384 blocks prewarm = minimal impact on 10K+ total faults
    - Kernel-side cost (asm_exc_page_fault) uncontrollable from userspace

  ✅ Cache warming effect succeeded:
    - TLS SLL pre-filled → reduced initial refill cost
    - CPU cycle savings → +3.3% performance gain
    - Stability improvement: warm state from first allocation

- Decision: Keep as "light +3% box"
  - Prewarm valid: 384 blocks (C2/C3=128, C4/C5=64) preserved
  - No further aggressive scaling: RSS cost vs page fault reduction unbalanced
  - Next phase: BenchFast mode for structural upper limit measurement

Combined Performance Impact:
========================================================================
Phase 19 (HeapV2 only): +12.9% (10.1M → 11.4M ops/s)
Phase 20-1 (Prewarm ON): +3.3% (15.7M → 16.2M ops/s)
Total improvement: +16.2% vs original baseline

Files Changed:
========================================================================
Phase 19:
- core/box/front_metrics_box.{h,c} - NEW
- core/tiny_alloc_fast.inc.h - metrics + ENV gating
- PHASE19_AB_TEST_RESULTS.md - NEW (detailed A/B test report)
- PHASE19_FRONTEND_METRICS_FINDINGS.md - NEW (findings report)

Phase 20-1:
- core/box/ss_hot_prewarm_box.{h,c} - NEW
- core/box/hak_core_init.inc.h - prewarm call integration
- Makefile - ss_hot_prewarm_box.o added
- CURRENT_TASK.md - Phase 19 & 20-1 results documented

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 05:48:59 +09:00
+								---
 								## 9. Phase 19 実装ログ（完了） 🎉
 								### 2025-11-16
 								- ✅ **Phase 19-1完了**: Box FrontMetrics（観測）
 								  - 実装: `core/box/front_metrics_box.h/c`、全層にヒット率計測追加
 								  - ENV: `HAKMEM_TINY_FRONT_METRICS=1`, `HAKMEM_TINY_FRONT_DUMP=1`
 								  - 結果: CSV形式で per-class ヒット率レポート生成
 								- ✅ **Phase 19-2完了**: ベンチマークとヒット率分析
 								  - ワークロード: Random Mixed 16-1040B、50万イテレーション
 								  - **重要な発見**:
 								    - **HeapV2**: 88-99% ヒット率（主力として機能）✅
 								    - **UltraHot**: 0.2-11.7% ヒット率（ほぼ素通り）⚠️
 								    - FC/SFC: 無効化済み（0%）
 								    - TLS SLL: fallback として 0.7-2.7% のみ
 								- ✅ **Phase 19-3完了**: Box FrontPrune（診断）
 								  - 実装: ENV切り替えで層を個別ON/OFF可能
 								  - ENV: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1`（デフォルトOFF）
 								  - ENV: `HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1`（デフォルトON）
 								- ✅ **Phase 19-4完了**: A/Bテストと最適化
 								  - **テスト結果**:
 								    | 設定 | 性能 | vs Baseline | C2/C3 ヒット率 |
 								    |------|------|-------------|----------------|
 								    | Baseline（両方ON） | 10.1M ops/s | - | UH=11.7%, HV2=88.3% |
 								    | **HeapV2のみ** | **11.4M ops/s** | **+12.9%** ⭐ | HV2=99.3%, SLL=0.7% |
 								    | UltraHotのみ | 6.6M ops/s | -34.4% ❌ | UH=96.4% (C2), SLL=94.2% (C3) |
 								  - **決定的結論**:
 								    - **UltraHot削除で性能向上** (+12.9%)
 								    - 理由: 分岐予測ミスコスト > UltraHotヒット率向上効果
 								    - UltraHotチェック: 88.3%のケースで無駄な分岐 → CPU分岐予測器を混乱
 								    - HeapV2単独の方が予測可能性が高い → 性能向上
 								- ✅ **デフォルト設定変更**: UltraHot デフォルトOFF
 								  - 本番推奨: UltraHot OFF（最速設定）
 								  - 研究用: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1` で有効化可能
 								  - コードは削除せず ENV切り替えで残す（研究・デバッグ用）
 								- ✅ **Phase 19 成果**:
 								  - ChatGPT先生の「観測→診断→治療」戦略が完璧に機能 🎓
 								  - 直感に反する発見（UltraHotが阻害要因）をデータで証明
 								  - A/Bテストでリスクなし確認してから最適化実施
 								  - 詳細: `PHASE19_FRONTEND_METRICS_FINDINGS.md`, `PHASE19_AB_TEST_RESULTS.md`
 								---
 								## 10. Phase 20 計画: Tiny ホットパス一本化 + BenchFast モード 🎯
 								### 目標
 								- **性能目標**: 20-30M ops/s（system malloc の 25-35%）
 								- **設計目標**: 「箱を崩さず」に達成（研究価値を保つ）
 								### Phase 20-1: HeapV2 を唯一の Tiny Front に（本命ホットパス一本化）
 								**現状認識**:
 								- C2/C3: HeapV2 が 88-99% を処理（本命）
 								- UltraHot: 0.2-11.7% しか当たらず、分岐の邪魔（削ると +12.9%）
 								- FC/SFC: 実質 OFF、TLS SLL は fallback のみ
 								**実装方針**:
 . **HeapV2 を「唯一の front」として扱う**:
 								   - C2-C5: HeapV2 → fallback だけ TLS SLL
 								   - 他層（UltraHot, FC, SFC）はホットパスから完全に外し、実験用に退避
 . **HeapV2 の中身を徹底的に薄くする**:
 								   - size→class 再計算を全部やめて、「class_idx を渡すだけ」にする
 								   - 分岐を「classごとの専用関数」かテーブルジャンプにして 1-2 本に減らす
 								   - header 書き込み・TLS stack 操作・return までを「6-8 命令の直線」に近づける
 . **期待効果**:
 								   - 現在 11M ops/s → 目標 15-20M ops/s (+35-80% 改善)
 								   - 分岐削減 + 命令直線化 → CPU パイプライン効率向上
 								**ENV制御**:
 								```bash
 								# HeapV2専用モード（Phase 20デフォルト）
 								HAKMEM_TINY_FRONT_HEAPV2_ONLY=1  # UltraHot/FC/SFC完全バイパス
 								# 旧動作（研究用）
 								HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1  # Phase 19設定
 								```
 								---
 								### Phase 20-2: BenchFast モードで安全コストを外す
 								**現状認識**:
 								- `hak_free_at` / `classify_ptr` / ExternalGuard / mincore など、
 								  「LD_PRELOAD / 外部ライブラリから守る」層が、
 								  ベンチでは「絶対に hakmem だけを使っている」前提の上に乗っている
 								**実装方針**:
 . **ベンチ用完全信頼モード**（Box BenchFast）:
 								   - alloc/free ともに:
 								     - header 1バイト で Tiny を即判定
 								     - Pool/Mid/L25/ExternalGuard/registry を完全にバイパス
 								   - 変なポインタが来たら壊れていい（ベンチ用なので）
 . **ENV制御**:
 								   ```bash
 								   HAKMEM_BENCH_FAST_MODE=1  # 安全コスト全外し
 								   ```
 . **目的**:
 								   - 「箱全部乗せ版」と「安全コスト全外し版」の差を測る
 								   - 「設計そのものの限界」と「安全・汎用性のコスト」の内訳を見る
 								   - mimalloc と同じくらい「危ないモード」で、どこまで近づけるかを研究
 . **期待効果**:
 								   - HeapV2専用モード: 15-20M ops/s
 								   - BenchFast追加: 25-30M ops/s (+65-100% vs 現状)
 								   - system malloc (90M ops/s) の 28-33% に到達
 								---
 								### Phase 20-3: SuperSlab ホットセット チューニング
 								**現状認識**:
 								- SS-Reuse: 再利用率 98.8%、新規 mmap 1.2% → page fault は抑えられている
 								- とはいえ perf ではまだ `asm_exc_page_fault` がでかく見える場面もある
 								**実装方針**:
 . **Box SS-HotSet**（どのクラスが何枚をホットに持つか計測）:
 								   - クラスごとの「ホット SuperSlab 数」を 1-2 枚に抑えるように class_hints をチューニング
 								   - precharge (`HAKMEM_TINY_SS_PRECHARGE_Cn`) を使って、「最初から 2 枚だけ温める」戦略を試す
 . **Box SS-Compact**（ホットセット圧縮）:
 								   - 同じ SuperSlab に複数のホットクラスを詰め込む（Phase 12 の発展）
 								   - 例: C2/C3 を同じ SuperSlab に配置 → キャッシュ効率向上
 . **期待効果**:
 								   - page fault さらに削減 → +10-20% 性能向上
 								   - 既存の SS-Reuse/Cache 設計を、「Tiny front が見ているサイズ帯に合わせて細かく調整」
 								---
 								### Phase 20 実装順序
 . **Phase 20-1**: HeapV2 専用モード実装（優先度: 高）
 								   - 期待: +35-80% (11M → 15-20M ops/s)
 								   - 工数: 中（既存 HeapV2 をスリム化）
 . **Phase 20-2**: BenchFast モード実装（優先度: 中）
 								   - 期待: +65-100% (11M → 25-30M ops/s)
 								   - 工数: 中（安全層バイパス）
 . **Phase 20-3**: SS-HotSet チューニング（優先度: 低）
 								   - 期待: +10-20% 追加改善
 								   - 工数: 小（パラメータ調整 + 計測箱追加）
 								---
 								### Phase 20 成功条件
 								- ✅ Tiny 固定サイズで 20-30M ops/s 達成（system の 25-35%）
 								- ✅ 「箱を崩さず」達成（研究箱としての価値を保つ）
 								- ✅ ENV切り替えで「安全モード」「ベンチモード」を選べる状態を維持
 								- ✅ 残りの差（system との 2.5-3x）は「kernel/page fault + mimalloc の極端な inlining」と言える根拠を固める
 								---
 								### Phase 20 後の展望
 								ここまで行けたら:
 								- 「残りの差は kernel/page fault + mimalloc の極端な inlining・OS依存の差」だと自信を持って言える
 								- hakmem の「研究箱」としての価値（構造をいじりやすい / 可視化しやすい）を保ったまま、
 								  性能面でも「そこそこ実用に耐える」ラインに乗る
 								- 学術論文・技術ブログでの発表材料が揃う
 								---
 								## 11. Phase 20-1 実装ログ: Box SS-HotPrewarm（TLS Cache 事前確保） ✅
 								### 2025-11-16
 								#### 実装内容
 								- ✅ **Box SS-HotPrewarm 作成**: ENV制御の per-class TLS cache prewarm
 								  - 実装: `core/box/ss_hot_prewarm_box.h/c`
 								  - デフォルト targets: C2/C3=128, C4/C5=64（aggressive prewarm）
 								  - ENV制御: `HAKMEM_TINY_PREWARM_C2`, `_C3`, `_C4`, `_C5`, `_ALL`
 								- ✅ **初期化統合**: `hak_init_impl()` から自動呼び出し
 								  - 384 ブロック事前確保（C2=128, C3=128, C4=64, C5=64）
 								  - `box_prewarm_tls()` API 使用（安全な carve-push）
 								#### ベンチマーク結果（500K iterations, 256B random mixed）
 								| 設定 | Page Faults | Throughput | vs Baseline |
 								|------|-------------|------------|-------------|
 								| **Baseline** (Prewarm OFF) | 10,399 | 15.7M ops/s | - |
 								| **Phase 20-1** (Prewarm ON) | 10,342 | 16.2M ops/s | **+3.3%** ⭐ |
 								- **Page fault 削減**: 0.55%（期待: 50-66% → 現実: ほぼなし）
 								- **性能向上**: +3.3%（15.7M → 16.2M ops/s）
 								#### 分析と結論
 								**❌ Page Fault 削減の失敗理由**:
 . **ユーザーページ由来が支配的**: ベンチマーク自体の初期化・データ構造確保による page fault が大半
 . **SuperSlab 事前確保の限界**: 384 ブロック程度の prewarm では、ベンチマーク全体の page fault (10K+) に対して微々たる影響しかない
 . **カーネル側のコスト**: `asm_exc_page_fault` はユーザー空間だけでは制御不可能
 								**✅ Cache Warming 効果**:
 . **TLS SLL 事前充填**: 初期の refill コスト削減
 . **CPU サイクル節約**: +3.3% の性能向上
 . **安定性向上**: 初期状態が warm → 最初のアロケーションから高速
 								#### 決定: 「軽い +3% 箱」として確定
 								- **prewarm は有効**: 384 ブロック確保（C2/C3=128, C4/C5=64）のまま残す
 								- **これ以上の aggressive 化は不要**: RSS 消費増 vs page fault 削減効果が見合わない
 								- **次フェーズへ**: BenchFast モードで「上限性能」を測定し、構造的限界を把握
 								#### 変更ファイル
 								- `core/box/ss_hot_prewarm_box.h` - NEW
 								- `core/box/ss_hot_prewarm_box.c` - NEW
 								- `core/box/hak_core_init.inc.h` - prewarm 呼び出し追加
 								- `Makefile` - `ss_hot_prewarm_box.o` 追加
 								---
 								**Status**: Phase 20-1 完了 ✅ → **Phase 20-2 準備中** 🎯
 								**Next**: BenchFast モード実装（安全コスト全外し → 構造的上限測定）
-												Phase 20-2: BenchFast mode - Structural bottleneck analysis (+4.5% ceiling)

## Summary
Implemented BenchFast mode to measure HAKMEM's structural performance ceiling
by removing ALL safety costs. Result: +4.5% improvement reveals safety mechanisms
are NOT the bottleneck - 95% of the performance gap is structural.

## Critical Discovery: Safety Costs ≠ Bottleneck

**BenchFast Performance** (500K iterations, 256B fixed-size):
- Baseline (normal):     54.4M ops/s (53.3% of System malloc)
- BenchFast (no safety): 56.9M ops/s (55.7% of System malloc) **+4.5%**
- System malloc:        102.1M ops/s (100%)

**Key Finding**: Removing classify_ptr, Pool/Mid routing, registry, mincore,
and ExternalGuard yields only +4.5% improvement. This proves these safety
mechanisms account for <5% of total overhead.

**Real Bottleneck** (estimated 75% of overhead):
- SuperSlab metadata access (~35% CPU)
- TLS SLL pointer chasing (~25% CPU)
- Refill + carving logic (~15% CPU)

## Implementation Details

**BenchFast Bypass Strategy**:
- Alloc: size → class_idx → TLS SLL pop → write header (6-8 instructions)
- Free: read header → BASE pointer → TLS SLL push (3-5 instructions)
- Bypasses: classify_ptr, Pool/Mid routing, registry, mincore, refill

**Recursion Fix** (User's "C案" - Prealloc Pool):
1. bench_fast_init() pre-allocates 50K blocks per class using normal path
2. bench_fast_init_in_progress guard prevents BenchFast during init
3. bench_fast_alloc() pop-only (NO REFILL) during benchmark

**Files**:
- core/box/bench_fast_box.{h,c}: Ultra-minimal alloc/free + prealloc pool
- core/box/hak_wrappers.inc.h: malloc wrapper with init guard check
- Makefile: bench_fast_box.o integration
- CURRENT_TASK.md: Phase 20-2 results documentation

**Activation**:
export HAKMEM_BENCH_FAST_MODE=1
./bench_fixed_size_hakmem 500000 256 128

## Implications for Future Work

**Incremental Optimization Ceiling Confirmed**:
- Phase 9-11 lesson reinforced: symptom relief ≠ root cause fix
- Safety costs: 4.5% (removable via BenchFast)
- Structural bottleneck: 95.5% (requires Phase 12 redesign)

**Phase 12 Shared SuperSlab Pool Priority**:
- 877 SuperSlab → 100-200 (reduce metadata footprint)
- Dynamic slab sharing (mimalloc-style)
- Expected: 70-90M ops/s (70-90% of System malloc)

**Bottleneck Breakdown**:
| Component              | CPU Time | BenchFast Removed? |
|------------------------|----------|-------------------|
| SuperSlab metadata     | ~35%     | ❌ Structural     |
| TLS SLL pointer chase  | ~25%     | ❌ Structural     |
| Refill + carving       | ~15%     | ❌ Structural     |
| classify_ptr/registry  | ~10%     | ✅ Removed        |
| Pool/Mid routing       | ~5%      | ✅ Removed        |
| mincore/guards         | ~5%      | ✅ Removed        |

**Conclusion**: Structural bottleneck (75%) >> Safety costs (20%)

## Phase 20 Complete
- Phase 20-1: SS-HotPrewarm (+3.3% from cache warming)
- Phase 20-2: BenchFast mode (proved safety costs = 4.5%)
- **Total Phase 20 improvement**: +7.8% (Phase 19 baseline → BenchFast)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 06:36:02 +09:00
 								---
 								## Phase 20-2: BenchFast Mode Implementation (2025-11-16) ✅
 								**Status**: ✅ **COMPLETE** - Recursion fixed via prealloc pool + init guard
 								**Goal**: Measure HAKMEM's structural performance ceiling by removing ALL safety costs
 								**Implementation**: Complete (core/box/bench_fast_box.{h,c})
 								### Design Philosophy
 								BenchFast mode bypasses all safety mechanisms to measure the theoretical maximum throughput:
 								**Alloc path** (6-8 instructions):
 								- size → class_idx → TLS SLL pop → write header → return USER pointer
 								- Bypasses: classify_ptr, Pool/Mid routing, registry, refill logic
 								**Free path** (3-5 instructions):
 								- Read header → BASE pointer → TLS SLL push
 								- Bypasses: registry lookup, mincore, ExternalGuard, capacity checks
 								### Implementation Details
 								**Files Created**:
 								- `core/box/bench_fast_box.h` - ENV-gated API with recursion guard
 								- `core/box/bench_fast_box.c` - Ultra-minimal alloc/free + prealloc pool
 								**Integration**:
 								- `core/box/hak_wrappers.inc.h` - malloc()/free() wrappers with BenchFast bypass
 								- `bench_random_mixed.c` - bench_fast_init() call before benchmark loop
 								- `Makefile` - bench_fast_box.o added to all object lists
 								**Activation**:
 								```bash
 								export HAKMEM_BENCH_FAST_MODE=1
 								./bench_fixed_size_hakmem 500000 256 128
 								```
 								### Recursion Fix: Prealloc Pool Strategy
 								**Problem**: When TLS SLL is empty, bench_fast_alloc() → hak_alloc_at() → malloc() → infinite loop
 								**Solution** (User's "C案"):
 . **Prealloc pool**: bench_fast_init() pre-allocates 50K blocks per class using normal path
 . **Init guard**: `bench_fast_init_in_progress` flag prevents BenchFast during init
 . **Pop-only alloc**: bench_fast_alloc() only pops from pool, NO REFILL
 								**Key Fix** (User's contribution):
 								```c
 								// core/box/bench_fast_box.h
 								extern __thread int bench_fast_init_in_progress;
 								// core/box/hak_wrappers.inc.h (malloc wrapper)
 								if (__builtin_expect(!bench_fast_init_in_progress && bench_fast_enabled(), 0)) {
 								    return bench_fast_alloc(size);  // Only activate AFTER init complete
 								}
 								```
 								### Performance Results (500K iterations, 256B fixed-size)
 								| Mode | Throughput | vs Baseline | vs System |
 								|------|------------|-------------|-----------|
 								| **Baseline** (通常) | 54.4M ops/s | - | 53.3% |
 								| **BenchFast** (安全コスト除去) | 56.9M ops/s | **+4.5%** | 55.7% |
 								| **System malloc** | 102.1M ops/s | +87.6% | 100% |
 								### 🔍 Critical Discovery: Safety Costs Are NOT the Bottleneck
 								**BenchFast で安全コストをすべて除去しても、わずか +4.5% しか改善しない！**
 								**What this reveals**:
 								- classify_ptr、Pool/Mid routing、registry、mincore、ExternalGuard → これらは**ボトルネックではない**
 								- 本当のボトルネックは**構造的な部分**：
 								  - SuperSlab 設計（1 SS = 1 class 固定）
 								  - メタデータアクセスパターン（cache miss 多発）
 								  - TLS SLL 効率（pointer chasing overhead）
 								  - 877 SuperSlab 生成による巨大なメタデータフットプリント
 								**System malloc との差**:
 								- Baseline: 47.7M ops/s 遅い（-46.7%）
 								- BenchFast でも 45.2M ops/s 遅い（-44.3%）
 								- → 安全コスト除去しても差は **たった 2.5M ops/s しか縮まらない**
 								### Implications for Future Work
 								**増分最適化の限界**:
 								- Phase 9-11 で学んだ教訓を確認：症状の緩和では埋まらない
 								- 安全コストは全体の 4.5% しか占めていない
 								- 残り 95.5% は**構造的なボトルネック**
 								**Phase 12 Shared SuperSlab Pool の重要性**:
 								- 877 SuperSlab → 100-200 に削減
 								- メタデータフットプリント削減 → cache miss 削減
 								- 動的 slab 共有 → 使用効率向上
 								- 期待性能: 70-90M ops/s（System の 70-90%）
 								### Bottleneck Breakdown (推定)
 								| コンポーネント | CPU 時間 | BenchFast で除去? |
 								|---------------|----------|------------------|
 								| SuperSlab metadata access | ~35% | ❌ 構造的 |
 								| TLS SLL pointer chasing | ~25% | ❌ 構造的 |
 								| Refill + carving | ~15% | ❌ 構造的 |
 								| classify_ptr + registry | ~10% | ✅ 除去済み |
 								| Pool/Mid routing | ~5% | ✅ 除去済み |
 								| mincore + guards | ~5% | ✅ 除去済み |
 								| その他 | ~5% | - |
 								**結論**: 構造的ボトルネック（75%）>> 安全コスト（20%）
 								**Next Steps**:
 								- Phase 12: Shared SuperSlab Pool（本質的解決）
 								- 877 SuperSlab → 100-200 に削減して cache miss を大幅削減
 								- 期待性能: 70-90M ops/s（System の 70-90%）
 								**Phase 20 完了**: BenchFast モードで「安全コストは 4.5%」と証明 ✅
-												Phase 21 戦略: Hot Path Cache Optimization (HPCO) - 構造的ボトルネック攻略

## Summary
Phase 20-2 BenchFast の結果を踏まえ、Phase 21 の実装戦略を策定。
安全コストは 4.5% のみ、残り 60% CPU（メタアクセス 35% + ポインタチェイス 25%）
が真のボトルネックと判明。アクセスパターン最適化で 75-82M ops/s を目指す。

## Phase 20-2 の重要な発見

**BenchFast 実験結果**:
- 安全コスト除去（classify_ptr/Pool routing/registry/mincore/guards）= **+4.5%**
- System malloc との差 45M ops/s = **箱の積み方そのもの**

**支配的ボトルネック** (60% CPU):
- メタアクセス: ~35% (SuperSlab/TinySlabMeta の複数フィールド読み書き)
- ポインタチェイス: ~25% (TLS SLL の next ポインタたどり)
- carve/refill: ~15% (batch carving + metadata updates)

## Phase 21 戦略（ChatGPT 先生フィードバック反映済み）

### Phase 21-1: Array-Based TLS Cache (C2/C3) 🔴 最優先
**狙い**: TLS SLL のポインタチェイス削減 → +15-20%
**方法**: Ring buffer (初期 128 slots, ENV で A/B 64/128/256)
**階層化**: Ring (L0) → SLL (L1) → SuperSlab (L2)
**期待**: 54.4M → 62-65M ops/s

### Phase 21-2: Hot Slab Direct Index 🟡 中優先度
**狙い**: SuperSlab → slab ループ削減 → +10-15%
**方法**: g_hot_slab[class_idx] で直接インデックス
**期待**: 62-65M → 70-75M ops/s

### Phase 21-3: Minimal Meta Access (C2/C3) 🟢 低優先度
**狙い**: 触るフィールド削減 → +5-10%
**方法**: アクセスパターン限定（used/freelist のみ）
**期待**: 70-75M → 75-82M ops/s

## 実装方針

**ChatGPT 先生のフィードバック**:
1. Ring → SLL → SuperSlab の階層を明確に
2. Ring サイズは 128/64 から ENV で A/B
3. struct 分離は後回し（型分岐コスト vs 効果）
4. Phase 21 → Phase 12 の順で問題なし

**実装リスク**: 低
- C2/C3 のみ変更（他クラスは SLL のまま）
- 既存構造を大きく変えない
- ENV で A/B テスト可能

**注意点**:
- Ring と SLL の境界を明確に
- shared_pool / SS-Reuse との整合
- 型分岐が増えすぎないように

## 次のステップ

1. Task 先生に既存 front layer 構造調査を依頼
2. C2/C3 の現在の alloc/free パス理解
3. UltraHot との関係整理（競合 or 階層化？）
4. Ring cache の最適統合ポイント特定
5. Phase 21-1 実装開始

🎯 Target: System malloc の 73-80% (75-82M ops/s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 07:12:42 +09:00
+								---
 								## Phase 21: Hot Path Cache Optimization (HPCO) - 構造的ボトルネック攻略 🎯
 								**Status**: 🚧 **PLANNING** (ChatGPT先生のフィードバック反映済み)
 								**Goal**: アクセスパターン最適化で 60% CPU（メタアクセス 35% + ポインタチェイス 25%）を直接攻撃
 								**Target**: 75-82M ops/s（System malloc の 73-80%）
 								### Phase 20-2 で判明した構造的ボトルネック
 								**BenchFast の結論**:
 								- 安全コスト（classify_ptr/Pool routing/registry/mincore/guards）= **4.5%** しかない
 								- 残り 45M ops/s の差 = **箱の積み方そのもの**
 								**支配的ボトルネック** (60% CPU):
 								```
 								メタアクセス:      ~35% (SuperSlab/TinySlabMeta の複数フィールド読み書き)
 								ポインタチェイス:  ~25% (TLS SLL の next ポインタたどり)
 								carve/refill:      ~15% (batch carving + metadata updates)
 								```
 								**1 回の alloc/free で発生すること**:
 								- 何段も構造体を跨ぐ（TLS → SuperSlab → SlabMeta → freelist）
 								- ポインタを何回もたどる（SLL の next チェイン）
 								- メタデータを何フィールドも触る（used/capacity/carved/freelist/...）
 								### Phase 21 戦略（ChatGPT先生フィードバック反映）
 								#### Phase 21-1: Array-Based TLS Cache (C2/C3) 🔴 最優先
 								**狙い**: TLS SLL のポインタチェイス削減 → +15-20%
 								**現状の問題**:
 								```c
 								// TLS SLL (linked list) - 3 メモリアクセス、うち 1 回は cache miss
 								void* ptr = g_tls_sll_head[class_idx];  // 1. ヘッド読み込み
 								void* next = *(void**)ptr;              // 2. next ポインタ読み込み (cache miss!)
 								g_tls_sll_head[class_idx] = next;       // 3. ヘッド更新
 								```
 								**解決策: Ring Buffer**:
 								```c
 								// Box 21-1: Array-based hot cache (C2/C3 only)
 								typedef struct {
 								    void* slots[128];  // 初期サイズ 128（ENV で A/B: 64/128/256）
 								    uint16_t head;     // pop index
 								    uint16_t tail;     // push index
 								} TlsRingCache;
 								static __thread TlsRingCache g_hot_cache_c2;
 								static __thread TlsRingCache g_hot_cache_c3;
 								// Ultra-fast alloc (1-2 命令)
 								void* ptr = g_hot_cache_c2.slots[g_hot_cache_c2.head++ & 0x7F];  // ring wrap
 								```
 								**階層化** (ChatGPT先生フィードバック):
 								```
 								Ring → SLL → SuperSlab
 								  ↑      ↑       ↑
 								 L0     L1      L2
 								- alloc: Ring → 空なら SLL → 空なら SuperSlab
 								- free:  Ring → 満杯なら SLL
 								- drain: SLL → Ring に昇格（一方向）
 								```
 								**効果**:
 								- ポインタチェイス: 1 回 → **0 回**
 								- メモリアクセス: 3 → **2 回**
 								- cache locality: 配列は連続メモリ
 								- **期待: +15-20%** (54.4M → 62-65M ops/s)
 								**ENV 変数**:
 								```bash
 								HAKMEM_TINY_HOT_RING_C2=128   # C2 Ring サイズ (default: 128)
 								HAKMEM_TINY_HOT_RING_C3=128   # C3 Ring サイズ (default: 128)
 								HAKMEM_TINY_HOT_RING_ENABLE=1 # Ring cache 有効化
 								```
 								**実装ポイント** (ChatGPT先生):
 								- Ring サイズは 64/128/256 で A/B テスト
 								- C0/C1/C4/C5/C6/C7 は SLL のまま（使用頻度低い）
 								- drain 時: SLL → Ring への昇格（一方向）
 								- Ring が空 → SLL fallback → SuperSlab refill
 								#### Phase 21-2: Hot Slab Direct Index 🟡 中優先度
 								**狙い**: SuperSlab → slab ループ削減 → +10-15%
 								**現状の問題**:
 								```c
 								// 毎回 32 slab をスキャン
 								SuperSlab* ss = g_tls_slabs[class_idx].ss;
 								for (int i = 0; i < 32; i++) {  // ← ループ！
 								    TinySlabMeta* meta = &ss->slabs[i];
 								    if (meta->freelist != NULL) { ... }
 								}
 								```
 								**解決策: Hot Slab Cache**:
 								```c
 								// Box 21-2: Direct index to hot slab
 								static __thread TinySlabMeta* g_hot_slab[TINY_NUM_CLASSES];
 								void refill_from_hot_slab(int class_idx) {
 								    TinySlabMeta* hot = g_hot_slab[class_idx];
 								    // Hot slab が空なら更新
 								    if (!hot || hot->freelist == NULL) {
 								        hot = find_nonempty_slab(class_idx);  // 1回だけ探索
 								        g_hot_slab[class_idx] = hot;          // cache!
 								    }
 								    pop_batch_from_freelist(hot, ...);  // no loop!
 								}
 								```
 								**効果**:
 								- SuperSlab → slab ループ: 削除
 								- メタアクセス: 32 回 → **1 回**
 								- **期待: +10-15%** (62-65M → 70-75M ops/s)
 								**実装ポイント** (ChatGPT先生):
 								- Hot slab が EMPTY → find_nonempty_slab で差し替え
 								- free 時: hot slab に返す or freelist に戻す（ポリシー決める）
 								- shared_pool / SS-Reuse との整合性確保
 								#### Phase 21-3: Minimal Meta Access (C2/C3) 🟢 低優先度
 								**狙い**: 触るフィールド削減 → +5-10%
 								**現状の問題**:
 								```c
 								// 1 alloc/free で 4-5 フィールド触る
 								typedef struct {
 								    uint16_t used;      // ✅ 必須
 								    uint16_t capacity;  // ❌ compile-time 定数化できる
 								    uint16_t carved;    // ❌ C2/C3 では使わない
 								    void* freelist;     // ✅ 必須
 								} TinySlabMeta;
 								```
 								**解決策: アクセスパターン限定** (ChatGPT先生):
 								```c
 								// struct を分けなくてもOK（型分岐を避ける）
 								// C2/C3 コードパスで触るのを used/freelist だけに限定
 								#define C2_CAPACITY 64  // compile-time 定数
 								static inline int c2_can_alloc(TinySlabMeta* meta) {
 								    return meta->used < C2_CAPACITY;  // capacity フィールド不要！
 								}
 								```
 								**効果**:
 								- 触るフィールド: 4-5 → **2 個** (used/freelist のみ)
 								- cache line 消費: 削減
 								- **期待: +5-10%** (70-75M → 75-82M ops/s)
 								**実装ポイント** (ChatGPT先生):
 								- struct 分離は後回し（型分岐コスト vs 効果のトレードオフ）
 								- アクセスパターン限定だけでも cache 効果あり
 								- Phase 21-1/2 の結果を見てから判断
 								### Phase 21 実装順序
 								```
 								Phase 21-1 (Array-based TLS Cache C2/C3):
 								  ↓ +15-20% → 62-65M ops/s
 								Phase 21-2 (Hot Slab Direct Index):
 								  ↓ +10-15% → 70-75M ops/s
 								Phase 21-3 (Minimal Meta Access):
 								  ↓ +5-10%  → 75-82M ops/s
 								  ↓
 								🎯 Target: System malloc の 73-80%
 								```
 								**Phase 12 (SuperSlab 共有) は後回し**:
 								- Phase 21 で 80M ops/s 到達後、残り 20M ops/s を Phase 12 で詰める
 								### ChatGPT先生フィードバック（重要）
 . **Box 21-1 (Ring cache)**: ✅ perf 的にドンピシャ
 								   - Ring → SLL → SuperSlab の階層を明確に
 								   - Ring サイズは 128/64 から ENV で A/B
 								   - drain 時: SLL → Ring への昇格（一方向）
 . **Box 21-2 (Hot slab)**: ✅ 有効だが扱いに注意
 								   - hot slab が EMPTY 時の差し替えロジック
 								   - shared_pool / SS-Reuse との整合性
 . **Box 21-3 (Minimal meta)**: ⚠️ 後回しでOK
 								   - struct 分離は型分岐コスト増
 								   - アクセスパターン限定だけで効果あり
 								   - 21-1/2 の結果を見てから判断
 . **Phase 12 との順番**: ✅ 合理的
 								   - アクセスパターン > SuperSlab 数
 								   - Phase 21 → Phase 12 の順で問題なし
 								### 実装リスク
 								**低リスク**:
 								- C2/C3 のみ変更（他クラスは SLL のまま）
 								- 既存構造を大きく変えない
 								- ENV で A/B テスト可能
 								**注意点**:
 								- Ring と SLL の境界を明確に
 								- shared_pool / SS-Reuse との整合
 								- 型分岐が増えすぎないように
 								### 次のアクション
 								**Phase 21-1 実装開始**:
 . `core/box/hot_ring_cache_box.{h,c}` 作成
 . C2/C3 専用 TlsRingCache 実装
 . Ring → SLL → SuperSlab 階層化
 . ENV: `HAKMEM_TINY_HOT_RING_ENABLE=1`
 . ベンチマーク: 目標 62-65M ops/s (+15-20%)
 								---