hakmem/CURRENT_TASK.md

# CURRENT TASK (Phase 14–17 Snapshot) – Tiny / Mid / ExternalGuard / Small-Mid

**Last Updated**: 2025-11-16
**Owner**: ChatGPT → Phase 17 実装中: Claude Code
**Size**: 約 300 行（Claude 用コンテキスト簡略版）

---

## 1. 全体の現在地（どこまで終わっているか）

- Tiny (0–1023B)
  - NEW 3-layer front（bump / small_mag / slow）安定。
  - TinyHeapV2: 「alloc フロント＋統計」は実装済みだが、実運用は **C2/C3 を UltraHot に委譲**。
  - TinyUltraHot（Phase 14）:
    - C2/C3（16B/32B）専用 L0 ultra-fast path（Stealing モデル）。
    - 固定サイズベンチで +16〜36% 改善、hit 率 ≈ 100%。
  - Box 分離（Phase 15）:
    - free ラッパが外部ポインタまで `hak_free_at` に投げていた問題を修正。
    - BenchMeta（slots など）→ 直接 `__libc_free`、CoreAlloc（Tiny/Mid）→ `hak_free_at` の二段構えに整理。

- Mid / PoolTLS（1KB–32KB）
  - PoolTLS Phase 完了（Mid-Large MT ベンチ）
    - ~10.6M ops/s（system malloc より速い構成あり）。
    - lock contention（futex 68%）を lock-free MPSC + bind box で大幅削減。
  - GAP 修正（Tiny 1023B / Mid 1KB〜）:
    - `TINY_MAX_SIZE=1023` / `MID_MIN_SIZE=1024` で 1KB–8KB の「誰も扱わない帯」は解消済み。

- Shared SuperSlab Pool（Phase 12 – SP-SLOT Box）
  - 1 SuperSlab : 多 class 共有 + SLOT_UNUSED/ACTIVE/EMPTY 追跡。
  - SuperSlab 数: 877 → 72（-92%）、mmap/munmap: -48%、Throughput: +131%。
  - Lock contention P0-5 まで実装済み（Stage 2 lock-free claiming）。

- ExternalGuard（Phase 15）
  - UNKNOWN ポインタ（Tiny/Pool/Mid/L25/registry どこでも捕まらないもの）を最後の箱で扱う。
  - 挙動:
    - `hak_super_lookup` など全て miss → mincore でページ確認 → 原則「解放せず leak 扱い（安全優先）」。
  - Phase 15 修正で:
    - BenchMeta のポインタを CoreAlloc に渡さなくなり、UNKNOWN 呼び出し回数が激減。
    - `mincore` の CPU 負荷もベンチではほぼ無視できるレベルまで縮小。

---

## 2. Tiny 性能の現状（Phase 14–15 時点）

### 2.1 Fixed-size Tiny ベンチ（HAKMEM vs System）

`bench_fixed_size_hakmem` / `bench_fixed_size_system`（workset=128, 500K iterations 相当）

| Size   | HAKMEM (Phase 15) | System malloc | 比率     |
|--------|-------------------|---------------|----------|
| 128B   | ~16.6M ops/s      | ~90M ops/s    | ~18.5%   |
| 256B   | ~16.2M ops/s      | ~89.6M ops/s  | ~18.1%   |
| 512B   | ~15.0M ops/s      | ~90M ops/s    | ~16.6%   |
| 1024B  | ~15.1M ops/s      | ~90M ops/s    | ~16.8%   |

状態:
- クラッシュは完全解消（workset=64/128 で長尺 500K iter も安定）。
- Tiny UltraHot + 学習層 + ExternalGuard の組み合わせは「正しさ」は OK。
- 性能は system の ~16–18% レベル（約 5–6× 遅い）→ まだ大きな伸びしろあり。

### 2.2 C2/C3 UltraHot 専用ベンチ

固定サイズ（100K iterations, workset=128）

| Size | Baseline (UltraHot OFF) | UltraHot ON | 改善率      | Hit Rate |
|------|-------------------------|-------------|-------------|---------|
| 16B  | ~40.4M ops/s            | ~55.0M      | +36.2% 🚀    | ≈100%   |
| 32B  | ~43.5M ops/s            | ~50.6M      | +16.3% 🚀    | ≈100%   |

Random Mixed 256B：
- Baseline: ~8.96M ops/s
- UltraHot ON: ~8.81M ops/s（-1.6%、誤差〜軽微退化）
- 理由: C2/C3 が全体の 1–2% のみ → UltraHot のメリットが平均に薄まる。

結論:
- C2/C3 UltraHot は **ターゲットクラスに対しては実用級の Box**。
- 他ワークロードでは「ほぼ影響なし（わずかな分岐オーバーヘッドのみ）」の範囲に収まっている。

---

## 3. Phase 15: ExternalGuard / Domain 分離の成果

### 3.1 以前の問題

- free ラッパ（`core/box/hak_wrappers.inc.h`）が:
  - HAKMEM 所有かチェックせず、すべての `free(ptr)` を `hak_free_at(ptr, …)` に投げていた。
  - その結果:
    - ベンチ内部 `slots`（`calloc(256, sizeof(void*))` の 2KB など）も CoreAlloc に流入。
    - `classify_ptr` → UNKNOWN → ExternalGuard → mincore → 「解放せず leak」と判定。
  - ベンチ観測:
    - 約 0.84% の leak（BenchMeta がどんどん漏れる）。
    - `mincore` が Tiny ベンチ CPU の ~13% を消費。

### 3.2 修正内容（Phase 15）

- free ラッパ側:
  - 軽量なドメインチェックを追加:
    - Tiny/Pool 用の header magic を安全に読んで、HAKMEM 所有の可能性があるものだけ `hak_free_at` へ。
    - そうでない（BenchMeta/外部）ポインタは `__libc_free` へ。
- ExternalGuard:
  - UNKNOWN ポインタを「解放しない（leak）」方針に明示的変更。
  - デバッグ時のみ `HAKMEM_EXTERNAL_GUARD_LOG=1` で原因特定用ログを出す。

### 3.3 結果

- Leak 率:
  - 100K iter: 840 leaks → 0.84%
  - 500K iter: ~4200 leaks → 0.84%
  - ほぼ全部が BenchMeta / 外部ポインタであり、CoreAlloc 側の漏れではないと確認。
- 性能:
  - 256B 固定:
    - Before: 15.9M ops/s
    - After:  16.2M ops/s（+1.9%）→ domain check オーバーヘッドは軽微、むしろ微増。
- 安定性:
  - 全サイズ（128/256/512/1024B）で 500K iter 完走（クラッシュなし）。
  - ExternalGuard 経由の「危ない free」は leak に封じ込められた。

**要点:**  
Box 境界違反（BenchMeta→CoreAlloc 流入）はほぼ完全に解消。  
ベンチでの mincore / ExternalGuard コストも許容範囲になった。

---

## 4. Phase 16: Dynamic Tiny/Mid Boundary A/B Testing（2025-11-16完了）

### 4.1 実装内容

ENV変数でTiny/Mid境界を動的調整可能にする機能を追加：
- `HAKMEM_TINY_MAX_CLASS=7` (デフォルト): Tiny が 0-1023B を担当
- `HAKMEM_TINY_MAX_CLASS=5` (実験用): Tiny が 0-255B のみ担当

実装ファイル：
- `hakmem_tiny.h/c`: `tiny_get_max_size()` - ENV読取とクラス→サイズマッピング
- `hakmem_mid_mt.h/c`: `mid_get_min_size()` - 動的境界調整（サイズギャップ防止）
- `hak_alloc_api.inc.h`: 静的TINY_MAX_SIZEを動的呼び出しに変更

### 4.2 A/B Benchmark Results

| Size | Config A (C0-C7) | Config B (C0-C5) | 変化率 |
|------|------------------|------------------|--------|
| 128B | 6.34M ops/s | 1.38M ops/s | **-78%** ❌ |
| 256B | 6.34M ops/s | 1.36M ops/s | **-79%** ❌ |
| 512B | 5.55M ops/s | 1.33M ops/s | **-76%** ❌ |
| 1024B | 5.91M ops/s | 1.37M ops/s | **-77%** ❌ |

### 4.3 発見と結論

✅ **成功**: サイズギャップ修正完了（OOMクラッシュなし）
❌ **失敗**: Tiny カバレッジ削減で大幅な性能劣化 (-76% ~ -79%)
⚠️ **根本原因**: Mid の粗いサイズクラス (8KB/16KB/32KB) が小サイズで非効率
- Mid は 8KB ページ単位の設計 → 256B-1KB を投げると 8KB ページをほぼ数ブロックのために確保
- ページ fault・TLB・メタデータコストが相対的に巨大
- Tiny は slab + freelist で高密度 → 同じサイズでも桁違いに効率的

**教訓（ChatGPT先生分析）**:
1. Mid 箱の前提が「8KB〜用」になっている
   - 256B/512B/1024B では 8KB ページをほぼ1〜数個のブロックのために確保 → 非効率
2. パス長も Mid の方が長い（PoolTLS / mid registry / page 管理）
3. 「Tiny を削って Mid に任せれば軽くなる」という仮説は、現行の "8KB〜前提の Mid 設計" では成り立たない

**推奨**: **デフォルト HAKMEM_TINY_MAX_CLASS=7 (C0-C7) を維持**

---

## 5. Phase 17: Small-Mid Allocator Box（256B-4KB専用層）【実装中】

### 5.1 目標

**問題**: Tiny C6/C7 (512B/1KB) が 5.5M-5.9M ops/s → system malloc の ~6% レベル
**目標**: Small-Mid 専用層で **10M-20M ops/s** に改善、Tiny/Mid の間のギャップを埋める

### 5.2 設計原則（ChatGPT先生レビュー済み ✅）

1. **専用SuperSlab分離**
   - Small-Mid 専用の SuperSlab プールを用意
   - Tiny の SuperSlab とは完全分離（競合なし）
   - **Phase 12 のチャーン問題を回避**（最重要！）

2. **サイズクラス**
   - Small-Mid: 256B / 512B / 1KB / 2KB / 4KB (5 classes)
   - Tiny 側は変更なし（C0-C5 維持）
   - クラス数増加を最小限に抑える

3. **技術流用**
   - Header-based fast free (Phase 7 の実績技術)
   - TLS SLL freelist (Tiny と同じ構造)
   - Box理論による明確な境界（一方向依存）

4. **境界設計**
   ```
   Tiny:      0-255B    (C0-C5, 現在の設計そのまま)
   Small-Mid: 256B-4KB  (新設, 細かいサイズクラス)
   Mid:       8KB-32KB  (既存, ページ単位で効率的)
   ```

5. **ENV制御**
   - `HAKMEM_SMALLMID_ENABLE=1` で ON/OFF
   - A/B テスト可能（デフォルト OFF）

### 5.3 実装ステップ

1. **Small-Mid 専用ヘッダー作成** (`core/hakmem_smallmid.h`)
   - 5 size classes 定義
   - TLS freelist 構造
   - Fast alloc/free API

2. **専用 SuperSlab バックエンド** (`core/hakmem_smallmid_superslab.c`)
   - Small-Mid 専用 SuperSlab プール
   - Tiny SuperSlab とは完全分離
   - スパン予約・解放ロジック

3. **Fast alloc/free path** (`core/smallmid_alloc_fast.inc.h`)
   - Header-based fast free (Phase 7 流用)
   - TLS SLL pop/push (Tiny と同じ)
   - Bump allocation fallback

4. **ルーティング統合** (`hak_alloc_api.inc.h`)
   ```c
   if (size <= 255)          → Tiny
   else if (size <= 4096)    → Small-Mid  // NEW!
   else if (size <= 32768)   → Mid
   else                      → ACE / mmap
   ```

5. **A/B ベンチマーク**
   - Config A: Small-Mid OFF (現状)
   - Config B: Small-Mid ON (新実装)
   - 256B / 512B / 1KB / 2KB / 4KB で比較

### 5.4 懸念点と対策（ChatGPT先生指摘）

❌ **懸念1**: SuperSlab 共有の競合
- **対策**: Small-Mid が「自分専用のスパン」を予約して、その中だけで完結する境界設計

❌ **懸念2**: クラス数の増加
- **対策**: Tiny 側のクラスは増やさない（C0-C5 そのまま）、Small-Mid は 5 クラスに抑える

❌ **懸念3**: メタデータオーバーヘッド
- **対策**: TLS state + サイズクラス配列のみ（数KB程度）、影響は最小限

### 5.5 期待される効果

- **性能改善**: 256B-1KB で 5.5M → 10-20M ops/s (目標 2-4倍)
- **ギャップ解消**: Tiny (6M) と Mid (?) の間を埋める
- **Box 理論的健全性**: 境界明確、一方向依存、A/B 可能

### 5.6 Phase 17-1 実装結果（2025-11-16完了）

**戦略**: TLS Frontend Cache Only（Tiny Backend 委譲）
- サイズクラス: 5 → 3 に削減（256B/512B/1KB のみ）
- Backend: Tiny C5/C6/C7 に委譲、Header 変換（0xa0 → 0xb0）
- TLS 容量: 控えめ（32/24/16 blocks）

**実装ファイル**:
- `core/hakmem_smallmid.h/c`: TLS freelist + backend delegation
- `core/hakmem_tiny.c`: `tiny_get_max_size()` 自動調整（Small-Mid ON 時に C0-C5 に制限）
- `core/box/hak_alloc_api.inc.h`: Small-Mid を Tiny より前に配置（routing 順序）

### 5.7 A/B Benchmark Results（Phase 17-1）

| Size | Config A (OFF) | Config B (ON) | 変化率 | 目標達成 |
|------|----------------|---------------|--------|----------|
| **256B** | 5.87M ops/s | 6.06M ops/s | **+3.3%** | ❌ |
| **512B** | 6.02M ops/s | 5.91M ops/s | **-1.9%** | ❌ |
| **1024B** | 5.58M ops/s | 5.54M ops/s | **-0.6%** | ❌ |
| **総合** | 5.82M ops/s | 5.84M ops/s | **+0.3%** | ❌ |

### 5.8 Phase 17-1 の成果と学び

✅ **成功点**:
1. **層の分離達成** - Small-Mid と Tiny が cleanly 共存
2. **オーバーヘッド最小** - ±0.3% = 測定誤差内（clean な実装）
3. **Routing 順序修正** - Small-Mid → Tiny の順で正しく動作
4. **Auto-adjust 機能** - Small-Mid ON 時に Tiny が自動的に C0-C5 に制限
5. **基盤完成** - これから最適化で改善のみ！

❌ **失敗点**:
- **性能改善なし** (+0.3% は目標の 2-4x に遠く及ばず)

**根本原因分析**:
1. **Delegation オーバーヘッド = TLS 節約分**
   - Small-Mid TLS alloc: ~3-5 命令
   - Tiny backend delegation: ~3-5 命令
   - Header 変換 (0xa0 → 0xb0): ~2 命令
   - **正味利益: ~0命令** (オーバーヘッドが利益を相殺)

2. **Backend が1ブロックずつ呼ばれる**
   - Small-Mid は 1:1 で Tiny に delegate (batching なし)
   - `hak_tiny_alloc()` / `hak_tiny_free()` 呼び出し削減なし
   - 期待: Batch refills → 実際: Pass-through

**教訓**:
- **Frontend-only アプローチは効果なし** - Backend delegation コストが大きすぎる
- **次は専用 Backend が必須** - Tiny から独立した Small-Mid SuperSlab pool 必要

### 5.9 次のステップ: Phase 17-2（専用 Backend）

**戦略**: Small-Mid 専用 SuperSlab Backend（Tiny から完全分離）

**設計**:
1. **専用 SuperSlab pool** (Tiny と分離)
   - Tiny delegation なし
   - Header 変換オーバーヘッドなし
   - 直接 0xb0 header 書き込み

2. **TLS refill batching**
   - 1回のrefillで 8-16 blocks 取得
   - SuperSlab lookup コストを償却
   - 目標: 50-70% frontend hit rate

3. **最適化 free path**
   - 直接 0xb0 header 読み取り → Small-Mid TLS push
   - Cached blocks に backend round-trip なし

**期待性能**:
- **Frontend hits**: 1-2 命令 (TLS pop/push)
- **Backend misses**: 5-8 命令 (batch refill)
- **加重平均** (60% hit): 0.6×2 + 0.4×6 = **~4命令**
- **現在の Tiny path**: 8-12 命令
- **期待利益**: 50-67% 削減 → **2-3x throughput** ✅

**目標メトリクス**:
- 256B: 5.87M → 12-15M ops/s (2.0-2.6x)
- 512B: 6.02M → 12-15M ops/s (2.0-2.5x)
- 1024B: 5.58M → 11-14M ops/s (2.0-2.5x)

**実装優先順位**:
1. Phase 17-2.1: Dedicated SuperSlab backend (Tiny から分離)
2. Phase 17-2.2: TLS batch refill (8-16 blocks)
3. Phase 17-2.3: Optimized 0xb0 header fast path
4. Phase 17-2.4: Benchmark validation (目標: 12-18M ops/s)

---

## 6. 未達成の目標・残課題（次フェーズ候補）

### 6.1 Tiny 性能ギャップ（System の ~18% 止まり）

現状:
- System malloc が ~90M ops/s レベルのところ、
- HAKMEM は 128〜1024B 固定で ~15–16M ops/s（約 18%）。

原因の切り分け（これまでの調査から）:
- Front（UltraHot/TinyHeapV2/TLS SLL）のパス長はかなり短縮済み。
- L1 dcache miss / instructions / branches は Phase 14 で大幅削減済みだが、
  - まだ Tiny が 0–1023B を全部抱えており、
  - 特に 512/1024B が Superslab/Pool 側のメタ負荷に効いている可能性。

候補:
- **Phase 17 で実装中！** Small-Mid Box（256B〜4KB 専用箱）を設計し、Tiny/Mid の間を分離する。
  - 詳細は § 5. Phase 17 を参照

### 6.2 UltraHot/TinyHeapV2 の拡張 or 整理

- C2/C3 UltraHot は成功（16/32B 用）。
- C4/C5 まで拡張した試み（Phase 14-B）は:
  - Fixed-size では改善あり。
  - Random Mixed で shared_pool_acquire_slab() が 47.5% まで膨らみ、大退化。
  - 原因: Superslab/TLS 在庫のバランスを壊す「窃取カスケード」。

方針:
- UltraHot は **C2/C3 専用 Box** に戻す（C4/C5 は一旦対象外にする）。
- もし C4/C5 を最適化したいなら、SmallMid Box の中で別設計する。

### 6.3 ExternalGuard の統計と自動アラート

- 現在:
  - `HAKMEM_EXTERNAL_GUARD_STATS=1` で統計を手動出力。
  - 100+ 回呼ばれたら WARNING を出すのみ。
- 構想:
  - 「ExternalGuard 呼び出しが一定閾値を超えたら、自動で簡易レポートを吐く」Box を追加。
  - 例: Top N 呼び出し元アドレス、サイズ帯、mincore 結果 など。

---

## 7. Claude Code 君向け TODO（Phase 17-2 実装リスト）

### 7.1 Phase 17-1: TLS Frontend Cache ✅ 完了（2025-11-16）

1. ✅ **ヘッダー作成** (`core/hakmem_smallmid.h`)
   - 3 size classes 定義 (256B/512B/1KB)
   - TLS freelist 構造体定義
   - size → class マッピング関数

2. ✅ **Backend delegation 実装** (`core/hakmem_smallmid.c`)
   - Tiny C5/C6/C7 に委譲
   - Header 変換（0xa0 → 0xb0）
   - TLS SLL pop/push

3. ✅ **Auto-adjust 機能** (`core/hakmem_tiny.c`)
   - Small-Mid ON 時に Tiny を C0-C5 に自動制限
   - `tiny_get_max_size()` 動的調整

4. ✅ **ルーティング統合** (`hak_alloc_api.inc.h`)
   - Small-Mid を Tiny より前に配置
   - ENV 制御: `HAKMEM_SMALLMID_ENABLE=1`

5. ✅ **A/B ベンチマーク**
   - Config A/B 実施（3 runs each）
   - 結果: ±0.3% (性能改善なし)
   - 教訓: Frontend-only は効果なし、専用 Backend 必須

### 7.2 Phase 17-2: Dedicated Backend 🚧 次のタスク

**目標**: Small-Mid 専用 SuperSlab backend で 2-3x 性能改善

1. **専用 SuperSlab backend** (`core/hakmem_smallmid_superslab.c`)
   - Small-Mid 専用 SuperSlab プール（Tiny と完全分離）
   - Slab metadata 構造定義
   - スパン予約・解放ロジック

2. **TLS batch refill** (`core/smallmid_refill_box.c`)
   - 1回のrefillで 8-16 blocks 取得
   - SuperSlab lookup コストを償却
   - Refill 失敗時の fallback 処理

3. **Optimized alloc/free path** (`core/hakmem_smallmid.c`)
   - 直接 0xb0 header 書き込み（Tiny delegation なし）
   - TLS hit: 1-2 命令
   - TLS miss: batch refill (5-8 命令)

4. **A/B ベンチマーク**
   - Config A: Phase 17-2 OFF（現状 5.82M ops/s）
   - Config B: Phase 17-2 ON（目標 12-15M ops/s）
   - 256B/512B/1KB で性能測定

5. **ドキュメント作成**
   - `PHASE17_2_SMALLMID_BACKEND_DESIGN.md` - 設計書
   - `PHASE17_2_AB_RESULTS.md` - A/B テスト結果

### 7.3 その他タスク（Phase 17-2 後）

1. **Phase 16/17-1 結果の詳細分析**
   - ✅ 完了 - CURRENT_TASK.md に記録済み

2. **C2/C3 UltraHot のコード掃除**
   - C4/C5 関連の定義・分岐を ENV ガードか別 Box に切り出し
   - デフォルト構成では C2/C3 だけを対象とする形に簡素化

3. **ExternalGuard 統計の自動化**
   - 閾値超過時の自動レポート機能

この CURRENT_TASK.md は、あくまで「Phase 14–17 周辺の簡略版メモ」です。
より過去の詳細な経緯は `CURRENT_TASK_FULL.md` や各 PHASE レポートを参照してください。

---

## 8. Phase 17 実装ログ

### 2025-11-16（Phase 17-1 完了）
- ✅ Phase 16 完了・A/B テスト結果分析
- ✅ ChatGPT 先生の Small-Mid Box 提案レビュー
- ✅ Phase 17-1 実装完了（TLS Frontend + Tiny Backend delegation）
  - `core/hakmem_smallmid.h/c`: TLS freelist + backend delegation
  - `core/hakmem_tiny.c`: Auto-adjust 機能
  - `core/box/hak_alloc_api.inc.h`: Routing 順序修正
- ✅ A/B ベンチマーク完了（結果: ±0.3%, 性能改善なし）
- ✅ 根本原因分析: Delegation overhead = TLS savings (正味利益ゼロ)
- ✅ CURRENT_TASK.md 更新（Phase 17-1 結果 + Phase 17-2 計画）
- 🚧 次: Phase 17-2 専用 Backend 実装開始
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								# CURRENT TASK (Phase 14–17 Snapshot) – Tiny / Mid / ExternalGuard / Small-Mid
-												CURRENT_TASK: Registry 線形スキャン ボトルネック特定 (2025-11-05)

- perf 分析で superslab_refill が 28.51% CPU を消費
- Root cause: 262,144 エントリの線形スキャン (97.65% の hot instructions)
- 解決策: per-class registry (8×4096 = 32K entries)
- 期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)
- Box Refactor は既に動いている (+463% ST, +131% MT)

次のアクション: Phase 1 実装 (per-class registry 変更)

詳細: PERF_ANALYSIS_2025_11_05.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 16:47:04 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								**Last Updated**: 2025-11-16
 								**Owner**: ChatGPT → Phase 17 実装中: Claude Code
 								**Size**: 約 300 行（Claude 用コンテキスト簡略版）
-												Fix: CRITICAL multi-threaded freelist/remote queue race condition

Root Cause:
===========
Freelist and remote queue contained the SAME blocks, causing use-after-free:

1. Thread A (owner): pops block X from freelist → allocates to user
2. User writes data ("ab") to block X
3. Thread B (remote): free(block X) → adds to remote queue
4. Thread A (later): drains remote queue → *(void**)block_X = chain_head
   → OVERWRITES USER DATA! 💥

The freelist pop path did NOT drain the remote queue first, so blocks could
be simultaneously in both freelist and remote queue.

Fix:
====
Add remote queue drain BEFORE freelist pop in refill path:

core/hakmem_tiny_refill_p0.inc.h:
  - Call _ss_remote_drain_to_freelist_unsafe() BEFORE trc_pop_from_freelist()
  - Add #include "superslab/superslab_inline.h"
  - This ensures freelist and remote queue are mutually exclusive

Test Results:
=============
BEFORE:
  larson_hakmem (4 threads): ❌ SEGV in seconds (freelist corruption)

AFTER:
  larson_hakmem (4 threads): ✅ 931,629 ops/s (1073 sec stable run)
  bench_random_mixed:        ✅ 1,020,163 ops/s (no crashes)

Evidence:
  - Fail-Fast logs showed next pointer corruption: 0x...6261 (ASCII "ab")
  - Single-threaded benchmarks worked (865K ops/s)
  - Multi-threaded Larson crashed immediately
  - Fix eliminates all crashes in both benchmarks

Files:
  - core/hakmem_tiny_refill_p0.inc.h: Add remote drain before freelist pop
  - CURRENT_TASK.md: Document fix details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-08 01:35:45 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								---
 								## 1. 全体の現在地（どこまで終わっているか）
 								- Tiny (0–1023B)
 								  - NEW 3-layer front（bump / small_mag / slow）安定。
 								  - TinyHeapV2: 「alloc フロント＋統計」は実装済みだが、実運用は **C2/C3 を UltraHot に委譲**。
 								  - TinyUltraHot（Phase 14）:
 								    - C2/C3（16B/32B）専用 L0 ultra-fast path（Stealing モデル）。
 								    - 固定サイズベンチで +16〜36% 改善、hit 率 ≈ 100%。
 								  - Box 分離（Phase 15）:
 								    - free ラッパが外部ポインタまで `hak_free_at` に投げていた問題を修正。
 								    - BenchMeta（slots など）→ 直接 `__libc_free`、CoreAlloc（Tiny/Mid）→ `hak_free_at` の二段構えに整理。
 								- Mid / PoolTLS（1KB–32KB）
 								  - PoolTLS Phase 完了（Mid-Large MT ベンチ）
 								    - ~10.6M ops/s（system malloc より速い構成あり）。
 								    - lock contention（futex 68%）を lock-free MPSC + bind box で大幅削減。
 								  - GAP 修正（Tiny 1023B / Mid 1KB〜）:
 								    - `TINY_MAX_SIZE=1023` / `MID_MIN_SIZE=1024` で 1KB–8KB の「誰も扱わない帯」は解消済み。
 								- Shared SuperSlab Pool（Phase 12 – SP-SLOT Box）
 								  - 1 SuperSlab : 多 class 共有 + SLOT_UNUSED/ACTIVE/EMPTY 追跡。
 								  - SuperSlab 数: 877 → 72（-92%）、mmap/munmap: -48%、Throughput: +131%。
 								  - Lock contention P0-5 まで実装済み（Stage 2 lock-free claiming）。
 								- ExternalGuard（Phase 15）
 								  - UNKNOWN ポインタ（Tiny/Pool/Mid/L25/registry どこでも捕まらないもの）を最後の箱で扱う。
 								  - 挙動:
 								    - `hak_super_lookup` など全て miss → mincore でページ確認 → 原則「解放せず leak 扱い（安全優先）」。
 								  - Phase 15 修正で:
 								    - BenchMeta のポインタを CoreAlloc に渡さなくなり、UNKNOWN 呼び出し回数が激減。
 								    - `mincore` の CPU 負荷もベンチではほぼ無視できるレベルまで縮小。
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
+								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 2. Tiny 性能の現状（Phase 14–15 時点）
 								### 2.1 Fixed-size Tiny ベンチ（HAKMEM vs System）
 								`bench_fixed_size_hakmem` / `bench_fixed_size_system`（workset=128, 500K iterations 相当）
 								| Size   | HAKMEM (Phase 15) | System malloc | 比率     |
 								|--------|-------------------|---------------|----------|
 								| 128B   | ~16.6M ops/s      | ~90M ops/s    | ~18.5%   |
 								| 256B   | ~16.2M ops/s      | ~89.6M ops/s  | ~18.1%   |
 								| 512B   | ~15.0M ops/s      | ~90M ops/s    | ~16.6%   |
 								| 1024B  | ~15.1M ops/s      | ~90M ops/s    | ~16.8%   |
 								状態:
 								- クラッシュは完全解消（workset=64/128 で長尺 500K iter も安定）。
 								- Tiny UltraHot + 学習層 + ExternalGuard の組み合わせは「正しさ」は OK。
 								- 性能は system の ~16–18% レベル（約 5–6× 遅い）→ まだ大きな伸びしろあり。
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 2.2 C2/C3 UltraHot 専用ベンチ
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								固定サイズ（100K iterations, workset=128）
 								| Size | Baseline (UltraHot OFF) | UltraHot ON | 改善率      | Hit Rate |
 								|------|-------------------------|-------------|-------------|---------|
 								| 16B  | ~40.4M ops/s            | ~55.0M      | +36.2% 🚀    | ≈100%   |
 								| 32B  | ~43.5M ops/s            | ~50.6M      | +16.3% 🚀    | ≈100%   |
 								Random Mixed 256B：
 								- Baseline: ~8.96M ops/s
 								- UltraHot ON: ~8.81M ops/s（-1.6%、誤差〜軽微退化）
 								- 理由: C2/C3 が全体の 1–2% のみ → UltraHot のメリットが平均に薄まる。
 								結論:
 								- C2/C3 UltraHot は **ターゲットクラスに対しては実用級の Box**。
 								- 他ワークロードでは「ほぼ影響なし（わずかな分岐オーバーヘッドのみ）」の範囲に収まっている。
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
 								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 3. Phase 15: ExternalGuard / Domain 分離の成果
 								### 3.1 以前の問題
 								- free ラッパ（`core/box/hak_wrappers.inc.h`）が:
 								  - HAKMEM 所有かチェックせず、すべての `free(ptr)` を `hak_free_at(ptr, …)` に投げていた。
 								  - その結果:
 								    - ベンチ内部 `slots`（`calloc(256, sizeof(void*))` の 2KB など）も CoreAlloc に流入。
 								    - `classify_ptr` → UNKNOWN → ExternalGuard → mincore → 「解放せず leak」と判定。
 								  - ベンチ観測:
 								    - 約 0.84% の leak（BenchMeta がどんどん漏れる）。
 								    - `mincore` が Tiny ベンチ CPU の ~13% を消費。
 								### 3.2 修正内容（Phase 15）
 								- free ラッパ側:
 								  - 軽量なドメインチェックを追加:
 								    - Tiny/Pool 用の header magic を安全に読んで、HAKMEM 所有の可能性があるものだけ `hak_free_at` へ。
 								    - そうでない（BenchMeta/外部）ポインタは `__libc_free` へ。
 								- ExternalGuard:
 								  - UNKNOWN ポインタを「解放しない（leak）」方針に明示的変更。
 								  - デバッグ時のみ `HAKMEM_EXTERNAL_GUARD_LOG=1` で原因特定用ログを出す。
 								### 3.3 結果
 								- Leak 率:
 								  - 100K iter: 840 leaks → 0.84%
 								  - 500K iter: ~4200 leaks → 0.84%
 								  - ほぼ全部が BenchMeta / 外部ポインタであり、CoreAlloc 側の漏れではないと確認。
 								- 性能:
 								  - 256B 固定:
 								    - Before: 15.9M ops/s
 								    - After:  16.2M ops/s（+1.9%）→ domain check オーバーヘッドは軽微、むしろ微増。
 								- 安定性:
 								  - 全サイズ（128/256/512/1024B）で 500K iter 完走（クラッシュなし）。
 								  - ExternalGuard 経由の「危ない free」は leak に封じ込められた。
 								**要点:**
 								Box 境界違反（BenchMeta→CoreAlloc 流入）はほぼ完全に解消。
 								ベンチでの mincore / ExternalGuard コストも許容範囲になった。
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
 								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 4. Phase 16: Dynamic Tiny/Mid Boundary A/B Testing（2025-11-16完了）
 								### 4.1 実装内容
 								ENV変数でTiny/Mid境界を動的調整可能にする機能を追加：
 								- `HAKMEM_TINY_MAX_CLASS=7` (デフォルト): Tiny が 0-1023B を担当
 								- `HAKMEM_TINY_MAX_CLASS=5` (実験用): Tiny が 0-255B のみ担当
 								実装ファイル：
 								- `hakmem_tiny.h/c`: `tiny_get_max_size()` - ENV読取とクラス→サイズマッピング
 								- `hakmem_mid_mt.h/c`: `mid_get_min_size()` - 動的境界調整（サイズギャップ防止）
 								- `hak_alloc_api.inc.h`: 静的TINY_MAX_SIZEを動的呼び出しに変更
 								### 4.2 A/B Benchmark Results
 								| Size | Config A (C0-C7) | Config B (C0-C5) | 変化率 |
 								|------|------------------|------------------|--------|
 								| 128B | 6.34M ops/s | 1.38M ops/s | **-78%** ❌ |
 								| 256B | 6.34M ops/s | 1.36M ops/s | **-79%** ❌ |
 								| 512B | 5.55M ops/s | 1.33M ops/s | **-76%** ❌ |
 								| 1024B | 5.91M ops/s | 1.37M ops/s | **-77%** ❌ |
 								### 4.3 発見と結論
 								✅ **成功**: サイズギャップ修正完了（OOMクラッシュなし）
 								❌ **失敗**: Tiny カバレッジ削減で大幅な性能劣化 (-76% ~ -79%)
 								⚠️ **根本原因**: Mid の粗いサイズクラス (8KB/16KB/32KB) が小サイズで非効率
 								- Mid は 8KB ページ単位の設計 → 256B-1KB を投げると 8KB ページをほぼ数ブロックのために確保
 								- ページ fault・TLB・メタデータコストが相対的に巨大
 								- Tiny は slab + freelist で高密度 → 同じサイズでも桁違いに効率的
 								**教訓（ChatGPT先生分析）**:
 . Mid 箱の前提が「8KB〜用」になっている
 								   - 256B/512B/1024B では 8KB ページをほぼ1〜数個のブロックのために確保 → 非効率
 . パス長も Mid の方が長い（PoolTLS / mid registry / page 管理）
 . 「Tiny を削って Mid に任せれば軽くなる」という仮説は、現行の "8KB〜前提の Mid 設計" では成り立たない
 								**推奨**: **デフォルト HAKMEM_TINY_MAX_CLASS=7 (C0-C7) を維持**
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
 								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 5. Phase 17: Small-Mid Allocator Box（256B-4KB専用層）【実装中】
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 5.1 目標
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								**問題**: Tiny C6/C7 (512B/1KB) が 5.5M-5.9M ops/s → system malloc の ~6% レベル
 								**目標**: Small-Mid 専用層で **10M-20M ops/s** に改善、Tiny/Mid の間のギャップを埋める
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 5.2 設計原則（ChatGPT先生レビュー済み ✅）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **専用SuperSlab分離**
 								   - Small-Mid 専用の SuperSlab プールを用意
 								   - Tiny の SuperSlab とは完全分離（競合なし）
 								   - **Phase 12 のチャーン問題を回避**（最重要！）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **サイズクラス**
 								   - Small-Mid: 256B / 512B / 1KB / 2KB / 4KB (5 classes)
 								   - Tiny 側は変更なし（C0-C5 維持）
 								   - クラス数増加を最小限に抑える
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **技術流用**
 								   - Header-based fast free (Phase 7 の実績技術)
 								   - TLS SLL freelist (Tiny と同じ構造)
 								   - Box理論による明確な境界（一方向依存）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **境界設計**
 								   ```
 								   Tiny:      0-255B    (C0-C5, 現在の設計そのまま)
 								   Small-Mid: 256B-4KB  (新設, 細かいサイズクラス)
 								   Mid:       8KB-32KB  (既存, ページ単位で効率的)
 								   ```
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **ENV制御**
 								   - `HAKMEM_SMALLMID_ENABLE=1` で ON/OFF
 								   - A/B テスト可能（デフォルト OFF）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 5.3 実装ステップ
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **Small-Mid 専用ヘッダー作成** (`core/hakmem_smallmid.h`)
 								   - 5 size classes 定義
 								   - TLS freelist 構造
 								   - Fast alloc/free API
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **専用 SuperSlab バックエンド** (`core/hakmem_smallmid_superslab.c`)
 								   - Small-Mid 専用 SuperSlab プール
 								   - Tiny SuperSlab とは完全分離
 								   - スパン予約・解放ロジック
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **Fast alloc/free path** (`core/smallmid_alloc_fast.inc.h`)
 								   - Header-based fast free (Phase 7 流用)
 								   - TLS SLL pop/push (Tiny と同じ)
 								   - Bump allocation fallback
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **ルーティング統合** (`hak_alloc_api.inc.h`)
 								   ```c
 								   if (size <= 255)          → Tiny
 								   else if (size <= 4096)    → Small-Mid  // NEW!
 								   else if (size <= 32768)   → Mid
 								   else                      → ACE / mmap
 								   ```
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **A/B ベンチマーク**
 								   - Config A: Small-Mid OFF (現状)
 								   - Config B: Small-Mid ON (新実装)
 								   - 256B / 512B / 1KB / 2KB / 4KB で比較
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 5.4 懸念点と対策（ChatGPT先生指摘）
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								❌ **懸念1**: SuperSlab 共有の競合
 								- **対策**: Small-Mid が「自分専用のスパン」を予約して、その中だけで完結する境界設計
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								❌ **懸念2**: クラス数の増加
 								- **対策**: Tiny 側のクラスは増やさない（C0-C5 そのまま）、Small-Mid は 5 クラスに抑える
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								❌ **懸念3**: メタデータオーバーヘッド
 								- **対策**: TLS state + サイズクラス配列のみ（数KB程度）、影響は最小限
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 5.5 期待される効果
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- **性能改善**: 256B-1KB で 5.5M → 10-20M ops/s (目標 2-4倍)
 								- **ギャップ解消**: Tiny (6M) と Mid (?) の間を埋める
 								- **Box 理論的健全性**: 境界明確、一方向依存、A/B 可能
-												Phase 15 完了: CURRENT_TASK更新 - ベンチマーク結果記録

Phase 15 Box Separation / Wrapper Domain Check 完了を記録:
- 99.29% BenchMeta 正常解放 (domain check 成功)
- 0.71% page-aligned leak (acceptable tradeoff)
- Performance: 14.9-16.6M ops/s (stable, crash-free)
- vs System malloc: 18.1% (5.5倍差)

Next: Phase 16 - Tiny守備範囲最適化 (512/1024B → Mid へ移す A/B)

											
										
										
											2025-11-16 01:12:57 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								### 5.6 Phase 17-1 実装結果（2025-11-16完了）
 								**戦略**: TLS Frontend Cache Only（Tiny Backend 委譲）
 								- サイズクラス: 5 → 3 に削減（256B/512B/1KB のみ）
 								- Backend: Tiny C5/C6/C7 に委譲、Header 変換（0xa0 → 0xb0）
 								- TLS 容量: 控えめ（32/24/16 blocks）
 								**実装ファイル**:
 								- `core/hakmem_smallmid.h/c`: TLS freelist + backend delegation
 								- `core/hakmem_tiny.c`: `tiny_get_max_size()` 自動調整（Small-Mid ON 時に C0-C5 に制限）
 								- `core/box/hak_alloc_api.inc.h`: Small-Mid を Tiny より前に配置（routing 順序）
 								### 5.7 A/B Benchmark Results（Phase 17-1）
 								| Size | Config A (OFF) | Config B (ON) | 変化率 | 目標達成 |
 								|------|----------------|---------------|--------|----------|
 								| **256B** | 5.87M ops/s | 6.06M ops/s | **+3.3%** | ❌ |
 								| **512B** | 6.02M ops/s | 5.91M ops/s | **-1.9%** | ❌ |
 								| **1024B** | 5.58M ops/s | 5.54M ops/s | **-0.6%** | ❌ |
 								| **総合** | 5.82M ops/s | 5.84M ops/s | **+0.3%** | ❌ |
 								### 5.8 Phase 17-1 の成果と学び
 								✅ **成功点**:
 . **層の分離達成** - Small-Mid と Tiny が cleanly 共存
 . **オーバーヘッド最小** - ±0.3% = 測定誤差内（clean な実装）
 . **Routing 順序修正** - Small-Mid → Tiny の順で正しく動作
 . **Auto-adjust 機能** - Small-Mid ON 時に Tiny が自動的に C0-C5 に制限
 . **基盤完成** - これから最適化で改善のみ！
 								❌ **失敗点**:
 								- **性能改善なし** (+0.3% は目標の 2-4x に遠く及ばず)
 								**根本原因分析**:
 . **Delegation オーバーヘッド = TLS 節約分**
 								   - Small-Mid TLS alloc: ~3-5 命令
 								   - Tiny backend delegation: ~3-5 命令
 								   - Header 変換 (0xa0 → 0xb0): ~2 命令
 								   - **正味利益: ~0命令** (オーバーヘッドが利益を相殺)
 . **Backend が1ブロックずつ呼ばれる**
 								   - Small-Mid は 1:1 で Tiny に delegate (batching なし)
 								   - `hak_tiny_alloc()` / `hak_tiny_free()` 呼び出し削減なし
 								   - 期待: Batch refills → 実際: Pass-through
 								**教訓**:
 								- **Frontend-only アプローチは効果なし** - Backend delegation コストが大きすぎる
 								- **次は専用 Backend が必須** - Tiny から独立した Small-Mid SuperSlab pool 必要
 								### 5.9 次のステップ: Phase 17-2（専用 Backend）
 								**戦略**: Small-Mid 専用 SuperSlab Backend（Tiny から完全分離）
 								**設計**:
 . **専用 SuperSlab pool** (Tiny と分離)
 								   - Tiny delegation なし
 								   - Header 変換オーバーヘッドなし
 								   - 直接 0xb0 header 書き込み
 . **TLS refill batching**
 								   - 1回のrefillで 8-16 blocks 取得
 								   - SuperSlab lookup コストを償却
 								   - 目標: 50-70% frontend hit rate
 . **最適化 free path**
 								   - 直接 0xb0 header 読み取り → Small-Mid TLS push
 								   - Cached blocks に backend round-trip なし
 								**期待性能**:
 								- **Frontend hits**: 1-2 命令 (TLS pop/push)
 								- **Backend misses**: 5-8 命令 (batch refill)
 								- **加重平均** (60% hit): 0.6×2 + 0.4×6 = **~4命令**
 								- **現在の Tiny path**: 8-12 命令
 								- **期待利益**: 50-67% 削減 → **2-3x throughput** ✅
 								**目標メトリクス**:
 								- 256B: 5.87M → 12-15M ops/s (2.0-2.6x)
 								- 512B: 6.02M → 12-15M ops/s (2.0-2.5x)
 								- 1024B: 5.58M → 11-14M ops/s (2.0-2.5x)
 								**実装優先順位**:
 . Phase 17-2.1: Dedicated SuperSlab backend (Tiny から分離)
 . Phase 17-2.2: TLS batch refill (8-16 blocks)
 . Phase 17-2.3: Optimized 0xb0 header fast path
 . Phase 17-2.4: Benchmark validation (目標: 12-18M ops/s)
-												Docs: Document workset=128 recursion fix in CURRENT_TASK

Added section 3.3 documenting the critical infinite recursion bug fix:
- Root cause: realloc() → hak_alloc_at() → shared_pool_init() → realloc() loop
- Symptoms: workset=128 hung, workset=64 worked (size-class specific)
- Fix: Replace realloc() with system mmap() for Shared Pool metadata
- Performance: timeout → 18.5M ops/s

Commit 176bbf656

											
										
										
											2025-11-15 14:36:35 +09:00
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
+								---
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								## 6. 未達成の目標・残課題（次フェーズ候補）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.1 Tiny 性能ギャップ（System の ~18% 止まり）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								現状:
 								- System malloc が ~90M ops/s レベルのところ、
 								- HAKMEM は 128〜1024B 固定で ~15–16M ops/s（約 18%）。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								原因の切り分け（これまでの調査から）:
 								- Front（UltraHot/TinyHeapV2/TLS SLL）のパス長はかなり短縮済み。
 								- L1 dcache miss / instructions / branches は Phase 14 で大幅削減済みだが、
 								  - まだ Tiny が 0–1023B を全部抱えており、
 								  - 特に 512/1024B が Superslab/Pool 側のメタ負荷に効いている可能性。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								候補:
 								- **Phase 17 で実装中！** Small-Mid Box（256B〜4KB 専用箱）を設計し、Tiny/Mid の間を分離する。
 								  - 詳細は § 5. Phase 17 を参照
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.2 UltraHot/TinyHeapV2 の拡張 or 整理
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- C2/C3 UltraHot は成功（16/32B 用）。
 								- C4/C5 まで拡張した試み（Phase 14-B）は:
 								  - Fixed-size では改善あり。
 								  - Random Mixed で shared_pool_acquire_slab() が 47.5% まで膨らみ、大退化。
 								  - 原因: Superslab/TLS 在庫のバランスを壊す「窃取カスケード」。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								方針:
 								- UltraHot は **C2/C3 専用 Box** に戻す（C4/C5 は一旦対象外にする）。
 								- もし C4/C5 を最適化したいなら、SmallMid Box の中で別設計する。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								### 6.3 ExternalGuard の統計と自動アラート
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- 現在:
 								  - `HAKMEM_EXTERNAL_GUARD_STATS=1` で統計を手動出力。
 								  - 100+ 回呼ばれたら WARNING を出すのみ。
 								- 構想:
 								  - 「ExternalGuard 呼び出しが一定閾値を超えたら、自動で簡易レポートを吐く」Box を追加。
 								  - 例: Top N 呼び出し元アドレス、サイズ帯、mincore 結果 など。
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								---
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								## 7. Claude Code 君向け TODO（Phase 17-2 実装リスト）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								### 7.1 Phase 17-1: TLS Frontend Cache ✅ 完了（2025-11-16）
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. ✅ **ヘッダー作成** (`core/hakmem_smallmid.h`)
 								   - 3 size classes 定義 (256B/512B/1KB)
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								   - TLS freelist 構造体定義
 								   - size → class マッピング関数
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. ✅ **Backend delegation 実装** (`core/hakmem_smallmid.c`)
 								   - Tiny C5/C6/C7 に委譲
 								   - Header 変換（0xa0 → 0xb0）
 								   - TLS SLL pop/push
 . ✅ **Auto-adjust 機能** (`core/hakmem_tiny.c`)
 								   - Small-Mid ON 時に Tiny を C0-C5 に自動制限
 								   - `tiny_get_max_size()` 動的調整
 . ✅ **ルーティング統合** (`hak_alloc_api.inc.h`)
 								   - Small-Mid を Tiny より前に配置
 								   - ENV 制御: `HAKMEM_SMALLMID_ENABLE=1`
 . ✅ **A/B ベンチマーク**
 								   - Config A/B 実施（3 runs each）
 								   - 結果: ±0.3% (性能改善なし)
 								   - 教訓: Frontend-only は効果なし、専用 Backend 必須
 								### 7.2 Phase 17-2: Dedicated Backend 🚧 次のタスク
 								**目標**: Small-Mid 専用 SuperSlab backend で 2-3x 性能改善
 . **専用 SuperSlab backend** (`core/hakmem_smallmid_superslab.c`)
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								   - Small-Mid 専用 SuperSlab プール（Tiny と完全分離）
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								   - Slab metadata 構造定義
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								   - スパン予約・解放ロジック
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. **TLS batch refill** (`core/smallmid_refill_box.c`)
 								   - 1回のrefillで 8-16 blocks 取得
 								   - SuperSlab lookup コストを償却
 								   - Refill 失敗時の fallback 処理
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. **Optimized alloc/free path** (`core/hakmem_smallmid.c`)
 								   - 直接 0xb0 header 書き込み（Tiny delegation なし）
 								   - TLS hit: 1-2 命令
 								   - TLS miss: batch refill (5-8 命令)
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. **A/B ベンチマーク**
 								   - Config A: Phase 17-2 OFF（現状 5.82M ops/s）
 								   - Config B: Phase 17-2 ON（目標 12-15M ops/s）
 								   - 256B/512B/1KB で性能測定
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. **ドキュメント作成**
 								   - `PHASE17_2_SMALLMID_BACKEND_DESIGN.md` - 設計書
 								   - `PHASE17_2_AB_RESULTS.md` - A/B テスト結果
-												Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-14 14:18:56 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								### 7.3 その他タスク（Phase 17-2 後）
-												Front-Direct implementation: SS→FC direct refill + SLL complete bypass

## Summary

Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)

## New Modules

- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
  - Remote drain → Freelist → Carve priority
  - Header restoration for C1-C6 (NOT C0/C7)
  - ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN

- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition

## Allocation Path (core/tiny_alloc_fast.inc.h)

- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
  - Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
  - Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)

## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)

- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)

## Legacy Sealing

- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry

## ENV Controls

- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)

## Benchmarks (Front-Direct Enabled)

```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
     HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
     HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
     HAKMEM_TINY_BUMP_CHUNK=256

bench_random_mixed (16-1040B random, 200K iter):
  256 slots: 1.44M ops/s (STABLE, 0 SEGV)
  128 slots: 1.44M ops/s (STABLE, 0 SEGV)

bench_fixed_size (fixed size, 200K iter):
  256B: 4.06M ops/s (has debug logs, expected >10M without logs)
  128B: Similar (debug logs affect)
```

## Verification

- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV

## Next Steps

- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)

Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink

											
										
										
											2025-11-14 05:41:49 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+. **Phase 16/17-1 結果の詳細分析**
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								   - ✅ 完了 - CURRENT_TASK.md に記録済み
 . **C2/C3 UltraHot のコード掃除**
 								   - C4/C5 関連の定義・分岐を ENV ガードか別 Box に切り出し
 								   - デフォルト構成では C2/C3 だけを対象とする形に簡素化
-												Front-Direct implementation: SS→FC direct refill + SLL complete bypass

## Summary

Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)

## New Modules

- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
  - Remote drain → Freelist → Carve priority
  - Header restoration for C1-C6 (NOT C0/C7)
  - ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN

- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition

## Allocation Path (core/tiny_alloc_fast.inc.h)

- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
  - Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
  - Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)

## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)

- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)

## Legacy Sealing

- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry

## ENV Controls

- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)

## Benchmarks (Front-Direct Enabled)

```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
     HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
     HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
     HAKMEM_TINY_BUMP_CHUNK=256

bench_random_mixed (16-1040B random, 200K iter):
  256 slots: 1.44M ops/s (STABLE, 0 SEGV)
  128 slots: 1.44M ops/s (STABLE, 0 SEGV)

bench_fixed_size (fixed size, 200K iter):
  256B: 4.06M ops/s (has debug logs, expected >10M without logs)
  128B: Similar (debug logs affect)
```

## Verification

- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV

## Next Steps

- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)

Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink

											
										
										
											2025-11-14 05:41:49 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+. **ExternalGuard 統計の自動化**
 								   - 閾値超過時の自動レポート機能
-												Front-Direct implementation: SS→FC direct refill + SLL complete bypass

## Summary

Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)

## New Modules

- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
  - Remote drain → Freelist → Carve priority
  - Header restoration for C1-C6 (NOT C0/C7)
  - ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN

- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition

## Allocation Path (core/tiny_alloc_fast.inc.h)

- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
  - Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
  - Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)

## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)

- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)

## Legacy Sealing

- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry

## ENV Controls

- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)

## Benchmarks (Front-Direct Enabled)

```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
     HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
     HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
     HAKMEM_TINY_BUMP_CHUNK=256

bench_random_mixed (16-1040B random, 200K iter):
  256 slots: 1.44M ops/s (STABLE, 0 SEGV)
  128 slots: 1.44M ops/s (STABLE, 0 SEGV)

bench_fixed_size (fixed size, 200K iter):
  256B: 4.06M ops/s (has debug logs, expected >10M without logs)
  128B: Similar (debug logs affect)
```

## Verification

- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV

## Next Steps

- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)

Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink

											
										
										
											2025-11-14 05:41:49 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								この CURRENT_TASK.md は、あくまで「Phase 14–17 周辺の簡略版メモ」です。
 								より過去の詳細な経緯は `CURRENT_TASK_FULL.md` や各 PHASE レポートを参照してください。
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
 								---
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								## 8. Phase 17 実装ログ
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								### 2025-11-16（Phase 17-1 完了）
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								- ✅ Phase 16 完了・A/B テスト結果分析
 								- ✅ ChatGPT 先生の Small-Mid Box 提案レビュー
-												Phase 17-1: Small-Mid Allocator - TLS Frontend Cache (結果: ±0.3%, 層分離成功)

Summary:
========
Phase 17-1 implements Small-Mid allocator as TLS frontend cache with Tiny backend delegation.
Result: Clean layer separation achieved with minimal overhead (±0.3%), but no performance gain.
Conclusion: Frontend-only approach is dead end. Phase 17-2 (dedicated backend) required for 2-3x target.

Implementation:
===============
1. Small-Mid TLS frontend (256B/512B/1KB - 3 classes)
   - TLS freelist (32/24/16 capacity)
   - Backend delegation to Tiny C5/C6/C7
   - Header conversion (0xa0 → 0xb0)

2. Auto-adjust Tiny boundary
   - When Small-Mid ON: Tiny auto-limits to C0-C5 (0-255B)
   - When Small-Mid OFF: Tiny default C0-C7 (0-1023B)
   - Prevents routing conflict

3. Routing order fix
   - Small-Mid BEFORE Tiny (critical for proper execution)
   - Fall-through on TLS miss

Files Modified:
===============
- core/hakmem_smallmid.h/c: TLS freelist + backend delegation
- core/hakmem_tiny.c: tiny_get_max_size() auto-adjust
- core/box/hak_alloc_api.inc.h: Routing order (Small-Mid → Tiny)
- CURRENT_TASK.md: Phase 17-1 results + Phase 17-2 plan

A/B Benchmark Results:
======================
| Size   | Config A (OFF) | Config B (ON) | Delta    | % Change |
|--------|----------------|---------------|----------|----------|
| 256B   | 5.87M ops/s    | 6.06M ops/s   | +191K    | +3.3%    |
| 512B   | 6.02M ops/s    | 5.91M ops/s   | -112K    | -1.9%    |
| 1024B  | 5.58M ops/s    | 5.54M ops/s   | -35K     | -0.6%    |
| Overall| 5.82M ops/s    | 5.84M ops/s   | +20K     | +0.3%    |

Analysis:
=========
✅ SUCCESS: Clean layer separation (Small-Mid ↔ Tiny coexist)
✅ SUCCESS: Minimal overhead (±0.3% = measurement noise)
❌ FAIL: No performance gain (target was 2-4x)

Root Cause:
-----------
- Delegation overhead = TLS savings (net gain ≈ 0 instructions)
- Small-Mid TLS alloc: ~3-5 instructions
- Tiny backend delegation: ~3-5 instructions
- Header conversion: ~2 instructions
- No batching: 1:1 delegation to Tiny (no refill amortization)

Lessons Learned:
================
- Frontend-only approach ineffective (backend calls not reduced)
- Dedicated backend essential for meaningful improvement
- Clean separation achieved = solid foundation for Phase 17-2

Next Steps (Phase 17-2):
========================
- Dedicated Small-Mid SuperSlab backend (separate from Tiny)
- TLS batch refill (8-16 blocks per refill)
- Optimized 0xb0 header fast path (no delegation)
- Target: 12-15M ops/s (2.0-2.6x improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 02:37:24 +09:00
+								- ✅ Phase 17-1 実装完了（TLS Frontend + Tiny Backend delegation）
 								  - `core/hakmem_smallmid.h/c`: TLS freelist + backend delegation
 								  - `core/hakmem_tiny.c`: Auto-adjust 機能
 								  - `core/box/hak_alloc_api.inc.h`: Routing 順序修正
 								- ✅ A/B ベンチマーク完了（結果: ±0.3%, 性能改善なし）
 								- ✅ 根本原因分析: Delegation overhead = TLS savings (正味利益ゼロ)
 								- ✅ CURRENT_TASK.md 更新（Phase 17-1 結果 + Phase 17-2 計画）
 								- 🚧 次: Phase 17-2 専用 Backend 実装開始
-												Phase 15: Box Separation (partial) - Box headers completed, routing deferred

**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 22:08:51 +09:00