Tiny Pool redesign: P0.1, P0.3, P1.1, P1.2 - Out-of-band class_idx lookup
This commit implements the first phase of Tiny Pool redesign based on
ChatGPT architecture review. The goal is to eliminate Header/Next pointer
conflicts by moving class_idx lookup out-of-band (to SuperSlab metadata).
## P0.1: C0(8B) class upgraded to 16B
- Size table changed: {16,32,64,128,256,512,1024,2048} (8 classes)
- LUT updated: 1..16 → class 0, 17..32 → class 1, etc.
- tiny_next_off: C0 now uses offset 1 (header preserved)
- Eliminates edge cases for 8B allocations
## P0.3: Slab reuse guard Box (tls_slab_reuse_guard_box.h)
- New Box for draining TLS SLL before slab reuse
- ENV gate: HAKMEM_TINY_SLAB_REUSE_GUARD=1
- Prevents stale pointers when slabs are recycled
- Follows Box theory: single responsibility, minimal API
## P1.1: SuperSlab class_map addition
- Added uint8_t class_map[SLABS_PER_SUPERSLAB_MAX] to SuperSlab
- Maps slab_idx → class_idx for out-of-band lookup
- Initialized to 255 (UNASSIGNED) on SuperSlab creation
- Set correctly on slab initialization in all backends
## P1.2: Free fast path uses class_map
- ENV gate: HAKMEM_TINY_USE_CLASS_MAP=1
- Free path can now get class_idx from class_map instead of Header
- Falls back to Header read if class_map returns invalid value
- Fixed Legacy Backend dynamic slab initialization bug
## Documentation added
- HAKMEM_ARCHITECTURE_OVERVIEW.md: 4-layer architecture analysis
- TLS_SLL_ARCHITECTURE_INVESTIGATION.md: Root cause analysis
- PTR_LIFECYCLE_TRACE_AND_ROOT_CAUSE_ANALYSIS.md: Pointer tracking
- TINY_REDESIGN_CHECKLIST.md: Implementation roadmap (P0-P3)
## Test results
- Baseline: 70% success rate (30% crash - pre-existing issue)
- class_map enabled: 70% success rate (same as baseline)
- Performance: ~30.5M ops/s (unchanged)
## Next steps (P1.3, P2, P3)
- P1.3: Add meta->active for accurate TLS/freelist sync
- P2: TLS SLL redesign with Box-based counting
- P3: Complete Header out-of-band migration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 13:42:39 +09:00
|
|
|
|
# HAKMEM アーキテクチャ全体像 - 設計質問状
|
|
|
|
|
|
|
|
|
|
|
|
**作成日**: 2025-11-28
|
|
|
|
|
|
**目的**: ChatGPT に設計相談するための全体像整理
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 1: HAKMEM の全体アーキテクチャ
|
|
|
|
|
|
|
|
|
|
|
|
### 1.1 4層構造の概要
|
|
|
|
|
|
|
|
|
|
|
|
HAKMEMは**4層アーキテクチャ**で設計されています:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Layer 0 (L0): Tiny Pool - ≤1KB アロケータ │
|
|
|
|
|
|
│ - 8つのサイズクラス (8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB)│
|
|
|
|
|
|
│ - TLS Magazine (高速キャッシュ) │
|
|
|
|
|
|
│ - TLS SLL (Single-Linked List: 無制限オーバーフロー) │
|
|
|
|
|
|
│ - TLS Active Slab (Arena-lite: 所有スレッドがロックレス割当) │
|
|
|
|
|
|
│ - SuperSlab バックエンド (64KB slab × 複数) │
|
|
|
|
|
|
│ - MPSC remote-free queue (クロススレッド free) │
|
|
|
|
|
|
│ - 実装: core/hakmem_tiny.{c,h} │
|
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Layer 1 (L1): ACE Layer - 1KB~2MB アロケータ + 学習エンジン │
|
|
|
|
|
|
│ │
|
|
|
|
|
|
│ [Mid Pool] 2KB~32KB (5クラス: 2/4/8/16/32KB) │
|
|
|
|
|
|
│ - TLS キャッシュ + Shared Pool │
|
|
|
|
|
|
│ - W_MAX rounding (切り上げ許容: 32KB→64KB等) │
|
|
|
|
|
|
│ - DYN1/DYN2: 可変サイズクラス (学習で最適化) │
|
|
|
|
|
|
│ │
|
|
|
|
|
|
│ [Large Pool] 64KB~1MB (5クラス: 64/128/256/512KB/1MB) │
|
|
|
|
|
|
│ - Bundle-based refill (バンドル単位でリフィル) │
|
|
|
|
|
|
│ - CAP学習 (ヒット率ベースで容量自動調整) │
|
|
|
|
|
|
│ │
|
2025-12-03 20:42:28 +09:00
|
|
|
|
│ [ACE Controller] 学習エンジン │
|
|
|
|
|
|
│ - ACE = Agentic Context Engineering │
|
|
|
|
|
|
│ (実装コンテキストでは Adaptive Control/Cache Engine) │
|
Tiny Pool redesign: P0.1, P0.3, P1.1, P1.2 - Out-of-band class_idx lookup
This commit implements the first phase of Tiny Pool redesign based on
ChatGPT architecture review. The goal is to eliminate Header/Next pointer
conflicts by moving class_idx lookup out-of-band (to SuperSlab metadata).
## P0.1: C0(8B) class upgraded to 16B
- Size table changed: {16,32,64,128,256,512,1024,2048} (8 classes)
- LUT updated: 1..16 → class 0, 17..32 → class 1, etc.
- tiny_next_off: C0 now uses offset 1 (header preserved)
- Eliminates edge cases for 8B allocations
## P0.3: Slab reuse guard Box (tls_slab_reuse_guard_box.h)
- New Box for draining TLS SLL before slab reuse
- ENV gate: HAKMEM_TINY_SLAB_REUSE_GUARD=1
- Prevents stale pointers when slabs are recycled
- Follows Box theory: single responsibility, minimal API
## P1.1: SuperSlab class_map addition
- Added uint8_t class_map[SLABS_PER_SUPERSLAB_MAX] to SuperSlab
- Maps slab_idx → class_idx for out-of-band lookup
- Initialized to 255 (UNASSIGNED) on SuperSlab creation
- Set correctly on slab initialization in all backends
## P1.2: Free fast path uses class_map
- ENV gate: HAKMEM_TINY_USE_CLASS_MAP=1
- Free path can now get class_idx from class_map instead of Header
- Falls back to Header read if class_map returns invalid value
- Fixed Legacy Backend dynamic slab initialization bug
## Documentation added
- HAKMEM_ARCHITECTURE_OVERVIEW.md: 4-layer architecture analysis
- TLS_SLL_ARCHITECTURE_INVESTIGATION.md: Root cause analysis
- PTR_LIFECYCLE_TRACE_AND_ROOT_CAUSE_ANALYSIS.md: Pointer tracking
- TINY_REDESIGN_CHECKLIST.md: Implementation roadmap (P0-P3)
## Test results
- Baseline: 70% success rate (30% crash - pre-existing issue)
- class_map enabled: 70% success rate (same as baseline)
- Performance: ~30.5M ops/s (unchanged)
## Next steps (P1.3, P2, P3)
- P1.3: Add meta->active for accurate TLS/freelist sync
- P2: TLS SLL redesign with Box-based counting
- P3: Complete Header out-of-band migration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 13:42:39 +09:00
|
|
|
|
│ - CAP学習: ヒット率 vs 目標値 → CAP ±Δ │
|
|
|
|
|
|
│ - W_MAX学習: UCB1 bandit + Canary deployment │
|
|
|
|
|
|
│ - DYN1自動割り当て: サイズヒストグラムからピーク検出 │
|
|
|
|
|
|
│ - Budget制約: 合計CAP上限管理 + Water-filling配分 │
|
|
|
|
|
|
│ - 実装: core/hakmem_learner.{c,h}, hakmem_ace_controller.{c,h}│
|
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Layer 2 (L2): Big Cache - ≥2MB アロケータ │
|
|
|
|
|
|
│ - BigCache: サイズクラスキャッシュ (Tier-2最適化) │
|
|
|
|
|
|
│ - mmap/munmap (THP: Transparent Huge Pages 対応) │
|
|
|
|
|
|
│ - Batch madvise: MADV_DONTNEED バッチ化 (syscall削減) │
|
|
|
|
|
|
│ - THP threshold学習: UCB1/Canary方式で最適閾値を自動決定 │
|
|
|
|
|
|
│ - 実装: core/hakmem_bigcache.{c,h}, hakmem_batch.{c,h} │
|
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Layer 3: Policy & Learning Infrastructure │
|
|
|
|
|
|
│ - FrozenPolicy: RCUライクなポリシースナップショット │
|
|
|
|
|
|
│ - Site Rules: コールサイト別ルーティングヒント (オプション) │
|
|
|
|
|
|
│ - P² percentile: O(1)メモリでp99推定 │
|
|
|
|
|
|
│ - ELO rating: 戦略評価 + Softmax選択 │
|
|
|
|
|
|
│ - 実装: core/hakmem_policy.{c,h}, hakmem_elo.{c,h} │
|
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 1.2 各層の責任と相互作用
|
|
|
|
|
|
|
|
|
|
|
|
#### Layer 0: Tiny Pool (≤1KB)
|
|
|
|
|
|
|
|
|
|
|
|
**責任**:
|
|
|
|
|
|
- 小サイズアロケーション (8B~1KB) の超高速処理
|
|
|
|
|
|
- TLS (Thread-Local Storage) による競合回避
|
|
|
|
|
|
- クロススレッド free の安全処理 (MPSC queue)
|
|
|
|
|
|
|
|
|
|
|
|
**主要データ構造**:
|
|
|
|
|
|
- `TinyTLSSLL`: 単方向連結リスト (head, count)
|
|
|
|
|
|
- `TinySlab`: 64KB slab, bitmap管理
|
|
|
|
|
|
- `SuperSlab`: 複数slabの親構造 (Phase 7で導入)
|
|
|
|
|
|
- `g_tls_sll[8]`: クラスごとのTLS SLL (スレッドローカル)
|
|
|
|
|
|
|
|
|
|
|
|
**データフロー (Alloc)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1. TLS SLL pop (3-4命令, 5-10サイクル) → HIT (85-90%)
|
|
|
|
|
|
├─ hit → ユーザーポインタ返却 (base+1)
|
|
|
|
|
|
└─ miss → 2へ
|
|
|
|
|
|
|
|
|
|
|
|
2. TLS Active Slab (bitmap scan, 5-6ns) → HIT (8-12%)
|
|
|
|
|
|
├─ hit → ブロック確保 → TLS SLLにpush → 1へ戻る
|
|
|
|
|
|
└─ miss → 3へ
|
|
|
|
|
|
|
|
|
|
|
|
3. SuperSlab refill (バッチ補充, 32-128ブロック) → HIT (1-3%)
|
|
|
|
|
|
├─ success → TLS SLLにバッチpush → 1へ戻る
|
|
|
|
|
|
└─ fail → mmapで新SuperSlab確保
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**データフロー (Free)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1. Header読み取り (2-3サイクル) → class_idx決定
|
|
|
|
|
|
├─ 1-byte header: 0xA0 | class_idx
|
|
|
|
|
|
└─ C0,C7: offset 0 / C1-6: offset 1
|
|
|
|
|
|
|
|
|
|
|
|
2. TLS SLL push (3-4命令) → 完了 (95-99%)
|
|
|
|
|
|
├─ Same thread → TLS SLLに直接push
|
|
|
|
|
|
└─ Cross thread → MPSC remote queueに追加
|
|
|
|
|
|
|
|
|
|
|
|
3. Periodic drain (2048 frees毎) → Freelist還元
|
|
|
|
|
|
├─ TLS SLL → Slab freelist へ移動
|
|
|
|
|
|
└─ meta->used-- (使用中ブロック数デクリメント)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**関連ファイル**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny.{c,h}` (2228行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.{c,h}` (SuperSlab管理)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h` (TLS SLL API, 753行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h` (Header操作, 222行)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Layer 1: ACE Layer (1KB~2MB) + Learning
|
|
|
|
|
|
|
|
|
|
|
|
**責任**:
|
|
|
|
|
|
- 中~大サイズアロケーション (1KB~2MB) の効率的処理
|
|
|
|
|
|
- ヒット率ベース学習による容量自動調整
|
|
|
|
|
|
- W_MAX rounding (柔軟なサイズクラス選択)
|
|
|
|
|
|
|
|
|
|
|
|
**主要データ構造**:
|
|
|
|
|
|
- `FrozenPolicy`: スナップショット (mid_cap[5], large_cap[5], w_max_mid, w_max_large)
|
|
|
|
|
|
- `PoolStats`: ヒット/ミスカウンタ (窓ごとに集計)
|
|
|
|
|
|
- `ucb1_t`: UCB1 bandit構造 (候補値, pulls, sum_score)
|
|
|
|
|
|
|
|
|
|
|
|
**学習アルゴリズム**:
|
|
|
|
|
|
```
|
|
|
|
|
|
【CAP学習】(実装済)
|
|
|
|
|
|
窓長: 1秒 (HAKMEM_LEARN_WINDOW_MS=1000)
|
|
|
|
|
|
|
|
|
|
|
|
For each class c in {Mid0..4, Large0..4}:
|
|
|
|
|
|
hit_rate_c = Δhits_c / (Δhits_c + Δmisses_c)
|
|
|
|
|
|
|
|
|
|
|
|
IF hit_rate_c < target_c - eps:
|
|
|
|
|
|
cap_c += step_c (不足 → 増加)
|
|
|
|
|
|
ELIF hit_rate_c > target_c + eps:
|
|
|
|
|
|
cap_c -= step_c (過剰 → 減少)
|
|
|
|
|
|
|
|
|
|
|
|
cap_c = clamp(cap_c, min_c, max_c)
|
|
|
|
|
|
|
|
|
|
|
|
Budget制約:
|
|
|
|
|
|
IF sum(cap_mid) > BUDGET_MID:
|
|
|
|
|
|
→ 需要低いクラスから削減
|
|
|
|
|
|
ELIF sum(cap_mid) < BUDGET_MID && WATER_FILLING:
|
|
|
|
|
|
→ 需要高いクラスへ配分
|
|
|
|
|
|
|
|
|
|
|
|
【W_MAX学習】(実装済)
|
|
|
|
|
|
UCB1 bandit:
|
|
|
|
|
|
候補: [1.4, 1.6, 1.8, 2.0] (Mid)
|
|
|
|
|
|
[1.25, 1.5, 1.75, 2.0] (Large)
|
|
|
|
|
|
|
|
|
|
|
|
For each arm i:
|
|
|
|
|
|
UCB_i = mean_score_i + 1.5 * sqrt(log(total_pulls) / pulls_i)
|
|
|
|
|
|
|
|
|
|
|
|
Select: arm = argmax(UCB_i)
|
|
|
|
|
|
|
|
|
|
|
|
Canary deployment:
|
|
|
|
|
|
試行期間: dwell_sec (3~5秒)
|
|
|
|
|
|
評価: ヒット率改善度
|
|
|
|
|
|
改善なし → ロールバック
|
|
|
|
|
|
|
|
|
|
|
|
【DYN1学習】(実装済)
|
|
|
|
|
|
サイズヒストグラム:
|
|
|
|
|
|
ピーク検出 → 固定クラスと被らない範囲で動的クラス設定
|
|
|
|
|
|
例: 8-16KB ギャップを 12KB で埋める
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**データフロー (Alloc)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1. FrozenPolicy読み込み (1回, 関数冒頭)
|
|
|
|
|
|
pol = hkm_policy_get()
|
|
|
|
|
|
|
|
|
|
|
|
2. W_MAX判定
|
|
|
|
|
|
IF class ≤ W_MAX × size:
|
|
|
|
|
|
class = round_up(size, class_stride)
|
|
|
|
|
|
|
|
|
|
|
|
3. Pool TLS キャッシュ → HIT (65-85%)
|
|
|
|
|
|
├─ hit → 返却
|
|
|
|
|
|
└─ miss → 4へ
|
|
|
|
|
|
|
|
|
|
|
|
4. Shared Pool refill → HIT (10-20%)
|
|
|
|
|
|
├─ success → TLS補充 → 3へ戻る
|
|
|
|
|
|
└─ fail → malloc fallback (1-5%)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**環境変数**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 学習有効化
|
|
|
|
|
|
HAKMEM_LEARN=1
|
|
|
|
|
|
|
|
|
|
|
|
# 目標ヒット率
|
|
|
|
|
|
HAKMEM_TARGET_HIT_MID=0.65 # Mid: 65%
|
|
|
|
|
|
HAKMEM_TARGET_HIT_LARGE=0.55 # Large: 55%
|
|
|
|
|
|
|
|
|
|
|
|
# CAP調整ステップ
|
|
|
|
|
|
HAKMEM_CAP_STEP_MID=4 # Mid: 4 pages/update
|
|
|
|
|
|
HAKMEM_CAP_STEP_LARGE=1 # Large: 1 bundle/update
|
|
|
|
|
|
|
|
|
|
|
|
# Budget制約
|
|
|
|
|
|
HAKMEM_BUDGET_MID=300 # Mid総CAP上限: 300 pages
|
|
|
|
|
|
HAKMEM_BUDGET_LARGE=50 # Large総CAP上限: 50 bundles
|
|
|
|
|
|
|
|
|
|
|
|
# W_MAX学習
|
|
|
|
|
|
HAKMEM_WMAX_LEARN=1
|
|
|
|
|
|
HAKMEM_WMAX_CANDIDATES_MID=1.4,1.6,1.8
|
|
|
|
|
|
HAKMEM_WMAX_CANDIDATES_LARGE=1.25,1.5,2.0
|
|
|
|
|
|
|
|
|
|
|
|
# DYN1自動割り当て
|
|
|
|
|
|
HAKMEM_DYN1_AUTO=1
|
|
|
|
|
|
HAKMEM_MID_DYN1=12288 # 12KB
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**関連ファイル**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_ace.{c,h}` (L1統合)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_learner.{c,h}` (学習エンジン, 900行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_pool.{c,h}` (Mid Pool)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_l25_pool.{c,h}` (Large Pool)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_ace_controller.{c,h}` (ACE制御)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Layer 2: Big Cache (≥2MB)
|
|
|
|
|
|
|
|
|
|
|
|
**責任**:
|
|
|
|
|
|
- 大サイズアロケーション (≥2MB) のキャッシュ
|
|
|
|
|
|
- mmap/munmap syscall削減
|
|
|
|
|
|
- THP (Transparent Huge Pages) 活用
|
|
|
|
|
|
|
|
|
|
|
|
**主要データ構造**:
|
|
|
|
|
|
- `BigCache`: サイズクラス別キャッシュ (hash map)
|
|
|
|
|
|
- `BatchMadvise`: MADV_DONTNEED バッチ構造
|
|
|
|
|
|
|
|
|
|
|
|
**データフロー**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1. BigCache lookup → HIT (20-40%)
|
|
|
|
|
|
├─ hit → 返却
|
|
|
|
|
|
└─ miss → 2へ
|
|
|
|
|
|
|
|
|
|
|
|
2. mmap (MAP_ANONYMOUS | MAP_PRIVATE)
|
|
|
|
|
|
├─ size ≥ thp_threshold → MADV_HUGEPAGE hint
|
|
|
|
|
|
└─ success → キャッシュに登録
|
|
|
|
|
|
|
|
|
|
|
|
Free:
|
|
|
|
|
|
1. BigCache push (容量制限あり)
|
|
|
|
|
|
├─ 空きあり → キャッシュ保持
|
|
|
|
|
|
└─ 満杯 → 3へ
|
|
|
|
|
|
|
|
|
|
|
|
2. BatchMadvise追加
|
|
|
|
|
|
├─ batch_size < limit → バッチに追加
|
|
|
|
|
|
└─ batch_size ≥ limit → 3へ
|
|
|
|
|
|
|
|
|
|
|
|
3. Batch flush: MADV_DONTNEED一括呼び出し
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**関連ファイル**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_bigcache.{c,h}` (210行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_batch.{c,h}` (120行)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Layer 3: Policy & Learning Infrastructure
|
|
|
|
|
|
|
|
|
|
|
|
**責任**:
|
|
|
|
|
|
- ポリシー管理 (RCU-like snapshot)
|
|
|
|
|
|
- 学習基盤 (統計, 評価, 選択)
|
|
|
|
|
|
|
|
|
|
|
|
**主要データ構造**:
|
|
|
|
|
|
- `FrozenPolicy`: 読み取り専用スナップショット
|
|
|
|
|
|
- `p2_t`: P²アルゴリズム状態 (O(1) p99推定)
|
|
|
|
|
|
- `elo_t`: ELO rating構造
|
|
|
|
|
|
|
|
|
|
|
|
**FrozenPolicy RCU-like更新**:
|
|
|
|
|
|
```
|
|
|
|
|
|
Writer (learner thread):
|
|
|
|
|
|
1. 新ポリシー作成
|
|
|
|
|
|
new_pol = malloc(sizeof(FrozenPolicy))
|
|
|
|
|
|
new_pol->mid_cap[0] = 128
|
|
|
|
|
|
new_pol->generation++
|
|
|
|
|
|
|
|
|
|
|
|
2. Atomic publish
|
|
|
|
|
|
hkm_policy_publish(new_pol)
|
|
|
|
|
|
|
|
|
|
|
|
Reader (hot path):
|
|
|
|
|
|
1. Snapshot取得 (1回, 関数冒頭)
|
|
|
|
|
|
const FrozenPolicy* pol = hkm_policy_get()
|
|
|
|
|
|
|
|
|
|
|
|
2. 読み取り専用参照
|
|
|
|
|
|
cap = pol->mid_cap[class_idx]
|
|
|
|
|
|
|
|
|
|
|
|
Grace period: なし (単純な世代管理, 旧ポリシーは上書き)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**関連ファイル**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_policy.{c,h}` (50行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_p2.{c,h}` (P²アルゴリズム, 130行)
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_elo.{c,h}` (ELO rating, 450行)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 1.3 データフロー全体像
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
malloc(size) 呼び出し
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
[サイズ判定]
|
|
|
|
|
|
|
|
|
|
|
|
|
+---> size ≤ 1KB ---------> Layer 0 (Tiny Pool)
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| TLS SLL pop (5-10ns)
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| ユーザーポインタ返却
|
|
|
|
|
|
|
|
|
|
|
|
|
+---> 1KB < size < 2MB ---> Layer 1 (ACE)
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| FrozenPolicy取得
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| W_MAX判定 → class選択
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| Pool TLS キャッシュ
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| ユーザーポインタ返却
|
|
|
|
|
|
|
|
|
|
|
|
|
+---> size ≥ 2MB ----------> Layer 2 (Big Cache)
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
BigCache lookup
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
mmap (miss時)
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
ユーザーポインタ返却
|
|
|
|
|
|
|
|
|
|
|
|
free(ptr) 呼び出し
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
[Header読み取り]
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
class_idx = *(uint8_t*)ptr - 1 & 0x07
|
|
|
|
|
|
|
|
|
|
|
|
|
+---> class_idx < 8 -------> Layer 0 (Tiny Pool)
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| TLS SLL push (3-4命令)
|
|
|
|
|
|
| |
|
|
|
|
|
|
| v
|
|
|
|
|
|
| 完了 (95-99%)
|
|
|
|
|
|
|
|
|
|
|
|
|
+---> class_idx ≥ 8 -------> classify_ptr() (slow path)
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
Registry lookup (50-100サイクル)
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
|
Layer 1 or Layer 2 へルーティング
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 2: TLS SLL の位置づけと問題点
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 TLS SLL の本来の目的
|
|
|
|
|
|
|
|
|
|
|
|
**設計意図**:
|
|
|
|
|
|
- Thread-Local Storage による競合回避
|
|
|
|
|
|
- Lock-free な高速 free パス (3-4命令, 5-10サイクル)
|
|
|
|
|
|
- 無制限オーバーフロー (Magazine容量制限なし)
|
|
|
|
|
|
|
|
|
|
|
|
**Box Theory 位置づけ**:
|
|
|
|
|
|
- **Layer 2** (Fast Path): Alloc/Free の最速経路
|
|
|
|
|
|
- **Box 5** (TLS-SLL First): SLL優先、miss時にHotMag/TLS list fallback
|
|
|
|
|
|
|
|
|
|
|
|
**他層との相互作用**:
|
|
|
|
|
|
```
|
|
|
|
|
|
TLS SLL ←→ TLS Active Slab
|
|
|
|
|
|
- refill: Slab bitmap scan → TLS SLLにバッチpush
|
|
|
|
|
|
- drain: TLS SLL → Slab freelist (periodic, 2048 frees毎)
|
|
|
|
|
|
|
|
|
|
|
|
TLS SLL ←→ SuperSlab
|
|
|
|
|
|
- carve: SuperSlab線形切り出し → TLS SLLに供給
|
|
|
|
|
|
- free: MPSC remote queue → SuperSlab freelist (cross-thread)
|
|
|
|
|
|
|
|
|
|
|
|
TLS SLL ←→ ACE Learning
|
|
|
|
|
|
- capacity決定: sll_cap_for_class(class_idx) → 学習で最適容量
|
|
|
|
|
|
- metrics: TLS SLL hit/miss → 学習入力
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 Header/Next 競合問題の詳細
|
|
|
|
|
|
|
|
|
|
|
|
#### 問題の本質
|
|
|
|
|
|
|
|
|
|
|
|
**物理制約**:
|
|
|
|
|
|
- Class 0 (8B): [1B header][7B payload] → 8Bポインタ入らない
|
|
|
|
|
|
- Class 1-6 (16B~512B): [1B header][payload ≥ 8B] → 可能だが user data と干渉
|
|
|
|
|
|
- Class 7 (2048B): [1B header][2047B payload] → 可能
|
|
|
|
|
|
|
|
|
|
|
|
**設計ルール** (tiny_nextptr.h):
|
|
|
|
|
|
```c
|
|
|
|
|
|
// Phase E1-CORRECT FINAL
|
|
|
|
|
|
static inline size_t tiny_next_off(int class_idx) {
|
|
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
|
|
|
|
// Class 0, 7 → offset 0 (freelist中はheader潰す)
|
|
|
|
|
|
// - C0: 物理制約 (8Bポインタ入らない)
|
|
|
|
|
|
// - C7: 設計選択 (next をuser dataから隔離)
|
|
|
|
|
|
// Class 1-6 → offset 1 (header保持)
|
|
|
|
|
|
return (class_idx == 0 || class_idx == 7) ? 0u : 1u;
|
|
|
|
|
|
#else
|
|
|
|
|
|
return 0u;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**競合のメカニズム**:
|
|
|
|
|
|
```
|
|
|
|
|
|
[Alloc時]
|
|
|
|
|
|
1. base から user pointer 作成
|
|
|
|
|
|
user = base + 1 (C1-6)
|
|
|
|
|
|
user = base (C0, C7)
|
|
|
|
|
|
|
|
|
|
|
|
2. Header書き込み
|
|
|
|
|
|
*(uint8_t*)base = 0xA0 | class_idx
|
|
|
|
|
|
|
|
|
|
|
|
3. ユーザーに返却
|
|
|
|
|
|
|
|
|
|
|
|
[Free時]
|
|
|
|
|
|
1. Header読み取り
|
|
|
|
|
|
class_idx = *(uint8_t*)(ptr-1) & 0x07
|
|
|
|
|
|
|
|
|
|
|
|
2. Next pointer書き込み (freelist構築)
|
|
|
|
|
|
※ Class 0, 7: base+0 に書き込み → Header上書き
|
|
|
|
|
|
※ Class 1-6: base+1 に書き込み → Header保持
|
|
|
|
|
|
|
|
|
|
|
|
3. TLS SLL push
|
|
|
|
|
|
g_tls_sll[class_idx].head = base
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**問題の顕在化**:
|
|
|
|
|
|
- **Phase 7**: 直接TLS push (3命令) → 59-70M ops/s
|
|
|
|
|
|
- **commit b09ba4d40**: Box TLS-SLL導入 (150行) → 6-9M ops/s (-85~-93%!)
|
|
|
|
|
|
- 理由: Header/Next競合回避のため、複雑な正規化・チェック追加
|
|
|
|
|
|
- コスト: 10-20倍のオーバーヘッド
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 現在の問題点
|
|
|
|
|
|
|
|
|
|
|
|
#### 問題1: TLS SLL と meta->used の同期不整合
|
|
|
|
|
|
|
|
|
|
|
|
**現象**:
|
|
|
|
|
|
- TLS SLL push/pop 時、`meta->used` (使用中ブロック数) が更新されない
|
|
|
|
|
|
- Periodic drain (2048 frees毎) まで `meta->used` は高いまま
|
|
|
|
|
|
- Slab が "full" に見える → 再利用されない → メモリリーク
|
|
|
|
|
|
- Slab 再利用時、TLS SLL に古いポインタ残存 → double-free
|
|
|
|
|
|
|
|
|
|
|
|
**根本原因**:
|
|
|
|
|
|
```c
|
|
|
|
|
|
// tiny_free_fast_v2.inc.h (fast path, 95-99%)
|
|
|
|
|
|
void* base = (char*)ptr - 1;
|
|
|
|
|
|
tls_sll_push(class_idx, base, UINT32_MAX); // meta->used は変更なし!
|
|
|
|
|
|
|
|
|
|
|
|
// free_local_box.c (slow path, 1-5% or periodic drain)
|
|
|
|
|
|
meta->used--; // ← 唯一の decrement 箇所
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**影響**:
|
|
|
|
|
|
- Larson ベンチマーク: double-free クラッシュ (30%)
|
|
|
|
|
|
- 長時間稼働: メモリリーク + 性能劣化
|
|
|
|
|
|
|
|
|
|
|
|
#### 問題2: Box TLS-SLL の過剰オーバーヘッド
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 7 (直接push)**:
|
|
|
|
|
|
```c
|
|
|
|
|
|
// 3命令, 5-10サイクル
|
|
|
|
|
|
*(void**)base = g_tls_sll_head[class_idx]; // 1 mov
|
|
|
|
|
|
g_tls_sll_head[class_idx] = base; // 1 mov
|
|
|
|
|
|
g_tls_sll_count[class_idx]++; // 1 inc
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Current (Box TLS-SLL)**:
|
|
|
|
|
|
```c
|
|
|
|
|
|
// 150行, 50-100サイクル (release), 100-1000サイクル (debug)
|
|
|
|
|
|
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
|
|
|
|
|
// 内部処理:
|
|
|
|
|
|
// - Bounds check (重複)
|
|
|
|
|
|
// - Capacity check (重複)
|
|
|
|
|
|
// - User pointer check (35行, debug)
|
|
|
|
|
|
// - Header restoration (5行)
|
|
|
|
|
|
// - Double-free scan (O(n), 100-1000サイクル, debug)
|
|
|
|
|
|
// - PTR_TRACK macros
|
|
|
|
|
|
// - やっと push (3命令)
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**性能影響**:
|
|
|
|
|
|
- Phase 7: 59-70M ops/s
|
|
|
|
|
|
- Current: 6-9M ops/s (-85~-93%)
|
|
|
|
|
|
- Box TLS-SLL導入 (commit b09ba4d40) が主原因
|
|
|
|
|
|
|
|
|
|
|
|
#### 問題3: 設計上の矛盾
|
|
|
|
|
|
|
|
|
|
|
|
**矛盾点**:
|
|
|
|
|
|
- **Free パスが Header を読む必要がある** (class_idx決定)
|
|
|
|
|
|
- **「Header は Alloc で書く」設計との矛盾**
|
|
|
|
|
|
- Alloc: Header書き込み
|
|
|
|
|
|
- Free: Header読み取り → class_idx取得
|
|
|
|
|
|
- Freelist構築: Next pointer書き込み → Header上書き (C0, C7)
|
|
|
|
|
|
- Next Alloc: Header読めない! → Registry lookup fallback (50-100サイクル)
|
|
|
|
|
|
|
|
|
|
|
|
**Phase E3-1 の失敗**:
|
|
|
|
|
|
- 意図: Registry lookup削除 → +226-443%改善
|
|
|
|
|
|
- 実際: Registry lookup は slow path (1-5%) にあり、fast path (95-99%) にはない
|
|
|
|
|
|
- 結果: 無関係な overhead 追加 → -10~-38%劣化
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 3: ChatGPT 用質問状
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 アーキテクチャ全般
|
|
|
|
|
|
|
|
|
|
|
|
#### Q1: 4層設計の妥当性
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Layer 0 (Tiny): TLS主体, ≤1KB
|
|
|
|
|
|
- Layer 1 (ACE): Pool主体, 1KB~2MB, 学習機能
|
|
|
|
|
|
- Layer 2 (Big): mmap主体, ≥2MB
|
|
|
|
|
|
- Layer 3 (Policy): 学習基盤, ポリシー管理
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. この4層分離は適切か? 統合すべき層はあるか?
|
|
|
|
|
|
2. 各層の責任境界は明確か? オーバーラップはないか?
|
|
|
|
|
|
3. 学習層 (Layer 3) を独立させるメリット/デメリットは?
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- 層分離の設計原則 (SoC, SRP等)
|
|
|
|
|
|
- 類似アロケータ (jemalloc, mimalloc, tcmalloc) との比較
|
|
|
|
|
|
- 学習層の位置づけ (横断的関心事 vs 垂直統合)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q2: 層間インターフェース設計
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Layer 0 ↔ Layer 1: `malloc(1024)` は Layer 0 か Layer 1 か?
|
|
|
|
|
|
- FrozenPolicy: Layer 3 が生成、Layer 0/1/2 が参照
|
|
|
|
|
|
- 学習メトリクス: Layer 0/1/2 が収集、Layer 3 が消費
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. 境界サイズ (1KB) の扱いは適切か? オーバーラップ許容すべきか?
|
|
|
|
|
|
2. FrozenPolicy の RCU-like更新は妥当か? Grace period不要で問題ないか?
|
|
|
|
|
|
3. メトリクス収集のインターフェースは適切か? (push vs pull)
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Boundary alignment の設計パターン
|
|
|
|
|
|
- RCU の正しい実装方法 (grace period, quiescent state)
|
|
|
|
|
|
- Metrics pipeline の設計 (sampling, aggregation, backpressure)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 3.2 TLS SLL の設計
|
|
|
|
|
|
|
|
|
|
|
|
#### Q3: Header/Next 競合問題の回避策
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Class 0 (8B): 物理的に next pointer 入らない
|
|
|
|
|
|
- Class 7 (2048B): User data 保護のため offset 0 選択
|
|
|
|
|
|
- Class 1-6 (16B~512B): Header保持可能だが、freelist時に next pointer と干渉
|
|
|
|
|
|
|
|
|
|
|
|
**現在の解決策**:
|
|
|
|
|
|
- Phase E1-CORRECT: C0/C7は offset 0 (header潰す), C1-6は offset 1 (header保持)
|
|
|
|
|
|
- Box TLS-SLL: 複雑な正規化・チェック (150行, 50-100サイクル)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **別の設計案**: Header を別領域 (slab metadata) に保存?
|
|
|
|
|
|
2. **Trade-off**: Memory overhead (1B/block) vs CPU overhead (50-100サイクル)?
|
|
|
|
|
|
3. **他アロケータの解決策**: jemalloc, mimalloc, tcmalloc はどう解決?
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Header配置の設計パターン (in-band vs out-of-band)
|
|
|
|
|
|
- Metadata storage strategies (embed vs separate)
|
|
|
|
|
|
- Performance vs safety の適切なバランス
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q4: TLS SLL と meta->used の同期設計
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Fast path (95-99%): TLS SLL push/pop のみ → `meta->used` 更新なし
|
|
|
|
|
|
- Slow path (1-5%): `tiny_free_local_box()` → `meta->used--`
|
|
|
|
|
|
- Periodic drain (2048 frees毎): TLS SLL → freelist → `meta->used--`
|
|
|
|
|
|
|
|
|
|
|
|
**問題**:
|
|
|
|
|
|
- Slab empty判定 (`meta->used == 0`) が遅延
|
|
|
|
|
|
- TLS SLL に古いポインタ残存 → double-free
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **同期戦略**:
|
|
|
|
|
|
- Option A: Fast path でも `meta->used` 更新 (atomicコスト)
|
|
|
|
|
|
- Option B: TLS SLL count を `meta->used` に含める (複雑化)
|
|
|
|
|
|
- Option C: Periodic drain 頻度を上げる (CPU overhead)
|
|
|
|
|
|
2. **他アロケータ**: tcache (glibc), mimalloc の TLS キャッシュはどう同期?
|
|
|
|
|
|
3. **Best practice**: Lock-free TLS cache と shared metadata の同期パターン
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Consistency model の選択 (eventual vs strong)
|
|
|
|
|
|
- Performance vs correctness の trade-off
|
|
|
|
|
|
- Proven solutions (既知の良い解決策)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q5: Thread-Local Storage の活用方法
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- TLS SLL: 無制限容量、ロックフリー
|
|
|
|
|
|
- TLS Active Slab: 所有スレッドのみ割当、ロックレス
|
|
|
|
|
|
- MPSC remote queue: クロススレッド free
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **TLS設計の妥当性**:
|
|
|
|
|
|
- SLL + Active Slab の二重キャッシュは冗長? 統合すべき?
|
|
|
|
|
|
- Magazine (固定容量) vs SLL (無制限) のメリット/デメリット
|
|
|
|
|
|
2. **Remote free戦略**:
|
|
|
|
|
|
- MPSC queue は適切? SPSC (per-thread) の方が良い?
|
|
|
|
|
|
- Drain タイミング: periodic vs threshold vs on-demand
|
|
|
|
|
|
3. **他アロケータとの比較**:
|
|
|
|
|
|
- jemalloc: thread cache
|
|
|
|
|
|
- mimalloc: thread-local heap
|
|
|
|
|
|
- tcmalloc: per-thread cache
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- TLS cache の設計原則 (size, structure, eviction)
|
|
|
|
|
|
- Remote free の best practice (queue type, drain strategy)
|
|
|
|
|
|
- Multi-tier cache の適切な構成
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 3.3 学習層の設計
|
|
|
|
|
|
|
|
|
|
|
|
#### Q6: 学習アルゴリズムの選択
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- CAP学習: ヒット率ベース PID制御
|
|
|
|
|
|
- W_MAX学習: UCB1 bandit + Canary deployment
|
|
|
|
|
|
- DYN1学習: サイズヒストグラム + ピーク検出
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **PID制御の妥当性**:
|
|
|
|
|
|
- 現在: P制御のみ (比例項)
|
|
|
|
|
|
- 追加すべき: I制御 (積分項), D制御 (微分項)?
|
|
|
|
|
|
- Over-/Under-shoot リスクは?
|
|
|
|
|
|
2. **UCB1の選択理由**:
|
|
|
|
|
|
- Thompson Sampling, Epsilon-greedy との比較
|
|
|
|
|
|
- Exploration/Exploitation バランスの調整方法
|
|
|
|
|
|
3. **オンライン学習のリスク**:
|
|
|
|
|
|
- 誤学習 (noise, workload shift)
|
|
|
|
|
|
- 振動 (oscillation)
|
|
|
|
|
|
- 収束失敗 (divergence)
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Adaptive control の設計パターン (PID, MPC, RL)
|
|
|
|
|
|
- Bandit algorithm の選択基準
|
|
|
|
|
|
- Online learning の安全対策 (bounds, damping, rollback)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q7: オンライン学習 vs オフライン学習
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- 現在: オンライン学習 (実行中に継続的に調整)
|
|
|
|
|
|
- 代替案: オフライン学習 (プロファイル → 最適化 → 固定)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **オンライン学習のメリット/デメリット**:
|
|
|
|
|
|
- Pro: Workload変化に適応
|
|
|
|
|
|
- Con: CPU overhead, 誤学習リスク
|
|
|
|
|
|
2. **Hybrid approach**:
|
|
|
|
|
|
- LEARN → FROZEN → CANARY (現在実装済)
|
|
|
|
|
|
- 改善点: 収束判定, 再学習トリガー
|
|
|
|
|
|
3. **Production deployment**:
|
|
|
|
|
|
- 学習モードはいつ使う? (開発 vs 本番)
|
|
|
|
|
|
- Profile-Guided Optimization (PGO) との統合
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Adaptive system の適用範囲 (when to use online learning)
|
|
|
|
|
|
- Convergence detection の手法
|
|
|
|
|
|
- PGO との統合パターン
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q8: 学習結果の活用方法
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- FrozenPolicy: 学習結果を RCU-like に publish
|
|
|
|
|
|
- Hot path: `hkm_policy_get()` で snapshot取得、読み取り専用参照
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **RCU-like更新の妥当性**:
|
|
|
|
|
|
- Grace period なし → 問題ないか?
|
|
|
|
|
|
- Stale read のリスク (古いポリシー参照)
|
|
|
|
|
|
2. **Policy versioning**:
|
|
|
|
|
|
- generation カウンタで十分?
|
|
|
|
|
|
- ABA problem のリスクは?
|
|
|
|
|
|
3. **Hot path overhead**:
|
|
|
|
|
|
- `hkm_policy_get()` のコスト (キャッシュヒット前提)
|
|
|
|
|
|
- Inlining 戦略
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- RCU の正しい実装 (grace period, quiescent state, memory barriers)
|
|
|
|
|
|
- Versioning scheme の設計
|
|
|
|
|
|
- Hot path optimization techniques
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 3.4 メモリ構造
|
|
|
|
|
|
|
|
|
|
|
|
#### Q9: Header の設計
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- 1-byte header: `0xA0 | class_idx`
|
|
|
|
|
|
- 位置: block base (user pointer - 1)
|
|
|
|
|
|
- 内容: magic (0xA0) + class index (0-7)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **Header size の妥当性**:
|
|
|
|
|
|
- 1B → 0.1% overhead (1KB block)
|
|
|
|
|
|
- 2B → more info (flags, ownership, etc.)?
|
|
|
|
|
|
2. **Magic number の必要性**:
|
|
|
|
|
|
- 0xA0 で corruption detection 十分?
|
|
|
|
|
|
- CRC/Checksum は過剰?
|
|
|
|
|
|
3. **In-band vs Out-of-band**:
|
|
|
|
|
|
- 現在: In-band (user data に embedding)
|
|
|
|
|
|
- 代替案: Out-of-band (slab metadata に保存)
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Header design patterns (size, content, placement)
|
|
|
|
|
|
- Corruption detection strategies
|
|
|
|
|
|
- Trade-off analysis (memory vs CPU)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q10: Next ポインタとの共存問題
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Next pointer: freelist 構築用 (8B)
|
|
|
|
|
|
- Header: allocation metadata (1B)
|
|
|
|
|
|
- 競合: Class 0/7 は offset 0 (header 上書き)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **共存戦略の評価**:
|
|
|
|
|
|
- 現在: C0/C7 は offset 0, C1-6 は offset 1
|
|
|
|
|
|
- 代替案A: 全クラス offset 0 (header は別領域)
|
|
|
|
|
|
- 代替案B: 全クラス offset 1 (C0 は 16B に promote)
|
|
|
|
|
|
2. **他アロケータの解決策**:
|
|
|
|
|
|
- jemalloc: chunk header (別領域)
|
|
|
|
|
|
- mimalloc: page header (別領域)
|
|
|
|
|
|
- tcmalloc: span metadata (別領域)
|
|
|
|
|
|
3. **Performance impact**:
|
|
|
|
|
|
- Header read/write のコスト
|
|
|
|
|
|
- Cache line 効率
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Metadata placement の best practice
|
|
|
|
|
|
- Performance vs simplicity の評価
|
|
|
|
|
|
- Industry standard solutions
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q11: Size class の設計
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- Tiny (L0): 8クラス (8, 16, 32, 64, 128, 256, 512, 2048B)
|
|
|
|
|
|
- Mid (L1): 5クラス (2, 4, 8, 16, 32KB) + DYN1/DYN2
|
|
|
|
|
|
- Large (L1): 5クラス (64, 128, 256, 512KB, 1MB)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **Class granularity の妥当性**:
|
|
|
|
|
|
- Tiny: power-of-2 → 内部断片化 (最大50%)
|
|
|
|
|
|
- Mid/Large: fewer classes → 切り上げ許容 (W_MAX)
|
|
|
|
|
|
2. **Dynamic size class (DYN1/DYN2)**:
|
|
|
|
|
|
- ピーク検出でギャップ埋める
|
|
|
|
|
|
- リスク: 過学習 (overfitting), クラス爆発
|
|
|
|
|
|
3. **他アロケータとの比較**:
|
|
|
|
|
|
- jemalloc: 40+ classes (細粒度)
|
|
|
|
|
|
- mimalloc: 74 classes (超細粒度)
|
|
|
|
|
|
- tcmalloc: ~60 classes
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Size class design の原則 (granularity, coverage, fragmentation)
|
|
|
|
|
|
- Dynamic sizing の適用範囲
|
|
|
|
|
|
- Industry standard configurations
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 3.5 並行性
|
|
|
|
|
|
|
|
|
|
|
|
#### Q12: Lock-free 設計のトレードオフ
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- TLS SLL: Lock-free (thread-local only)
|
|
|
|
|
|
- MPSC remote queue: Lock-free (atomic CAS)
|
|
|
|
|
|
- Slab freelist: Mutex-protected (per-class lock)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **Lock-free の適用範囲**:
|
|
|
|
|
|
- 現在: TLS hot path のみ
|
|
|
|
|
|
- 拡張可能: Shared pool refill も lock-free化?
|
|
|
|
|
|
2. **ABA problem 対策**:
|
|
|
|
|
|
- MPSC queue: ポインタのみ (no counter)
|
|
|
|
|
|
- リスク: ABA発生時の影響
|
|
|
|
|
|
3. **Performance vs Complexity**:
|
|
|
|
|
|
- Lock-free は本当に速い? (cache ping-pong リスク)
|
|
|
|
|
|
- Mutex (futex) で十分な場合は?
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Lock-free design の適用基準
|
|
|
|
|
|
- ABA problem の解決策 (tagged pointer, epoch-based GC)
|
|
|
|
|
|
- When to use lock-free vs lock-based
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q13: スレッド間のデータ共有
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- FrozenPolicy: 全スレッド読み取り (writer 1, reader N)
|
|
|
|
|
|
- SuperSlab registry: 全スレッド参照 (lookup用)
|
|
|
|
|
|
- MPSC remote queue: cross-thread free
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **Read-heavy shared data の最適化**:
|
|
|
|
|
|
- FrozenPolicy: RCU適切? Seq-lock の方が良い?
|
|
|
|
|
|
- Registry: Lock-free hash table vs Mutex-protected
|
|
|
|
|
|
2. **Write contention の回避**:
|
|
|
|
|
|
- Learner thread: 書き込み頻度 (1秒毎) は適切?
|
|
|
|
|
|
- Batch update で contention 削減可能?
|
|
|
|
|
|
3. **Cache coherence overhead**:
|
|
|
|
|
|
- False sharing リスク (padding戦略)
|
|
|
|
|
|
- NUMA awareness (今後の課題)
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Read-heavy workload の最適化手法
|
|
|
|
|
|
- Write contention の設計パターン
|
|
|
|
|
|
- Cache coherence の考慮事項
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### Q14: Race condition の回避パターン
|
|
|
|
|
|
|
|
|
|
|
|
**背景**:
|
|
|
|
|
|
- meta->used vs TLS SLL count の不整合
|
|
|
|
|
|
- Slab reuse 時の TLS SLL クリア漏れ
|
|
|
|
|
|
- Double-free 検出 (debug mode)
|
|
|
|
|
|
|
|
|
|
|
|
**質問**:
|
|
|
|
|
|
1. **Current bugs の根本原因**:
|
|
|
|
|
|
- meta->used 同期問題は設計欠陥? 実装バグ?
|
|
|
|
|
|
- 正しい同期モデルは? (eventual consistency vs strong consistency)
|
|
|
|
|
|
2. **Double-free prevention**:
|
|
|
|
|
|
- TLS SLL duplicate scan (debug) は十分?
|
|
|
|
|
|
- Release mode でも detection 必要? (Sanitizer dependency)
|
|
|
|
|
|
3. **Testing strategy**:
|
|
|
|
|
|
- Race condition の再現方法
|
|
|
|
|
|
- ThreadSanitizer (TSan) の限界
|
|
|
|
|
|
|
|
|
|
|
|
**期待する回答**:
|
|
|
|
|
|
- Concurrency bugs の分類 (data race, deadlock, ABA, etc.)
|
|
|
|
|
|
- Prevention strategies (design patterns, tools)
|
|
|
|
|
|
- Testing methodology (stress test, model checking, formal verification)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 4: 現状の問題点整理
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 Header/Next 競合問題
|
|
|
|
|
|
|
|
|
|
|
|
**問題の詳細**:
|
|
|
|
|
|
1. **設計矛盾**:
|
|
|
|
|
|
- Free path: Header読み取り必要 (class_idx決定)
|
|
|
|
|
|
- Freelist構築: Next pointer書き込み → Header上書き (C0, C7)
|
|
|
|
|
|
- Next alloc: Header読めない → Registry lookup (50-100サイクル)
|
|
|
|
|
|
|
|
|
|
|
|
2. **これまでの修正試行**:
|
|
|
|
|
|
- Phase E1-CORRECT: C0/C7は offset 0, C1-6は offset 1
|
|
|
|
|
|
- Box TLS-SLL: 複雑な正規化 (150行, 50-100サイクル)
|
|
|
|
|
|
- Phase E3-1: Registry lookup削除試行 → 失敗 (-10~-38%)
|
|
|
|
|
|
|
|
|
|
|
|
3. **根本原因**:
|
|
|
|
|
|
- **物理制約** (C0: 8B block に 8B pointer 入らない)
|
|
|
|
|
|
- **設計選択** (C7: user data 保護 vs header保持)
|
|
|
|
|
|
- **Trade-off未解決** (memory overhead vs CPU overhead)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 TLS SLL と meta->used の同期問題
|
|
|
|
|
|
|
|
|
|
|
|
**問題の詳細**:
|
|
|
|
|
|
1. **Fast path (95-99%)**: TLS SLL push/pop のみ → meta->used 更新なし
|
|
|
|
|
|
2. **Slow path (1-5% or drain)**: tiny_free_local_box() → meta->used--
|
|
|
|
|
|
3. **Slab empty判定の遅延**: meta->used 高いまま → 再利用されない → リーク
|
|
|
|
|
|
4. **TLS SLL クリア漏れ**: Slab reuse 時、古いポインタ残存 → double-free
|
|
|
|
|
|
|
|
|
|
|
|
**修正試行**:
|
|
|
|
|
|
- Periodic drain 間隔調整 (2048 → 512 frees)
|
|
|
|
|
|
- Adaptive drain (hot classes 頻度↑)
|
|
|
|
|
|
- TLS SLL clear on slab reuse (debug mode)
|
|
|
|
|
|
|
|
|
|
|
|
**根本原因**:
|
|
|
|
|
|
- **設計レベルの問題**: Fast path と meta->used が分離
|
|
|
|
|
|
- **同期モデル未定義**: Eventual consistency? Strong consistency?
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 本番ビルドでも30%クラッシュ
|
|
|
|
|
|
|
|
|
|
|
|
**現象**:
|
|
|
|
|
|
- Larson benchmark: 30% crash rate (double-free)
|
|
|
|
|
|
- Release build でも発生 → 診断コードではなく実際のバグ
|
|
|
|
|
|
- Multi-thread 環境で再現性高い
|
|
|
|
|
|
|
|
|
|
|
|
**原因**:
|
|
|
|
|
|
1. TLS SLL と meta->used の同期不整合 (上記 4.2)
|
|
|
|
|
|
2. Box TLS-SLL の過剰 overhead → 性能劣化 → workload timeout → 不正状態
|
|
|
|
|
|
|
|
|
|
|
|
**影響範囲**:
|
|
|
|
|
|
- **確実**: Larson benchmark (high alloc/free rate, multi-thread)
|
|
|
|
|
|
- **潜在的**: Production workload (長時間稼働, memory leak)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 5: 期待する成果物
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1 設計の方向性
|
|
|
|
|
|
|
|
|
|
|
|
**ChatGPT に求める回答**:
|
|
|
|
|
|
1. **Header/Next 競合の解決策**:
|
|
|
|
|
|
- Best practice (他アロケータの実装)
|
|
|
|
|
|
- Trade-off分析 (memory vs CPU)
|
|
|
|
|
|
- 実装案 (in-band vs out-of-band)
|
|
|
|
|
|
|
|
|
|
|
|
2. **TLS SLL 同期の設計**:
|
|
|
|
|
|
- 適切な同期モデル (eventual vs strong)
|
|
|
|
|
|
- Performance overhead の見積もり
|
|
|
|
|
|
- 実装パターン (atomic, RCU, etc.)
|
|
|
|
|
|
|
|
|
|
|
|
3. **学習層の改善**:
|
|
|
|
|
|
- PID制御の調整 (I/D項追加)
|
|
|
|
|
|
- Convergence detection の手法
|
|
|
|
|
|
- Production deployment 戦略
|
|
|
|
|
|
|
|
|
|
|
|
4. **並行性の最適化**:
|
|
|
|
|
|
- Lock-free の適用範囲
|
|
|
|
|
|
- Race condition 回避パターン
|
|
|
|
|
|
- Testing strategy
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.2 具体的な実装案
|
|
|
|
|
|
|
|
|
|
|
|
**期待する形式**:
|
|
|
|
|
|
```
|
|
|
|
|
|
【提案】Header を Out-of-band 化
|
|
|
|
|
|
|
|
|
|
|
|
【根拠】
|
|
|
|
|
|
- jemalloc/mimalloc は chunk/page header を別領域管理
|
|
|
|
|
|
- In-band header は next pointer と競合 (C0/C7)
|
|
|
|
|
|
- Out-of-band なら競合なし、fast path 単純化可能
|
|
|
|
|
|
|
|
|
|
|
|
【設計】
|
|
|
|
|
|
SuperSlab に class_idx map 追加:
|
|
|
|
|
|
uint8_t class_map[256]; // slab_idx → class_idx
|
|
|
|
|
|
|
|
|
|
|
|
Lookup:
|
|
|
|
|
|
slab_idx = (ptr - ss->base) / SLAB_SIZE
|
|
|
|
|
|
class_idx = ss->class_map[slab_idx]
|
|
|
|
|
|
// Cost: 2 memory access (slab_idx calc + map lookup)
|
|
|
|
|
|
// vs Current: 1 memory access (header read)
|
|
|
|
|
|
|
|
|
|
|
|
【Trade-off】
|
|
|
|
|
|
- Memory: +256B/SuperSlab (許容範囲)
|
|
|
|
|
|
- CPU: +1 memory access (~3-5サイクル)
|
|
|
|
|
|
- 利点: Next pointer と競合なし → Box TLS-SLL 単純化 → -50サイクル
|
|
|
|
|
|
|
|
|
|
|
|
【Net】+3サイクル (lookup) - 50サイクル (Box overhead) = -47サイクル (改善)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.3 優先度付きロードマップ
|
|
|
|
|
|
|
|
|
|
|
|
**期待する形式**:
|
|
|
|
|
|
```
|
|
|
|
|
|
【P0 - 即座に修正】(1週間)
|
|
|
|
|
|
1. TLS SLL clear on slab reuse (防御的修正)
|
|
|
|
|
|
2. Periodic drain 間隔調整 (adaptive mode)
|
|
|
|
|
|
|
|
|
|
|
|
【P1 - 短期修正】(2-3週間)
|
|
|
|
|
|
3. Header out-of-band 化 (設計変更)
|
|
|
|
|
|
4. Box TLS-SLL 単純化 (性能回復)
|
|
|
|
|
|
|
|
|
|
|
|
【P2 - 中期改善】(1-2ヶ月)
|
|
|
|
|
|
5. PID制御調整 (I/D項追加)
|
|
|
|
|
|
6. Convergence detection 実装
|
|
|
|
|
|
|
|
|
|
|
|
【P3 - 長期最適化】(3ヶ月+)
|
|
|
|
|
|
7. NUMA awareness
|
|
|
|
|
|
8. Profile-Guided Optimization 統合
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 添付資料
|
|
|
|
|
|
|
|
|
|
|
|
### 関連ファイルパス
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 0 (Tiny Pool)**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_nextptr.h`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h`
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 1 (ACE)**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_ace.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_learner.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_pool.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_l25_pool.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_ace_controller.{c,h}`
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 2 (Big Cache)**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_bigcache.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_batch.{c,h}`
|
|
|
|
|
|
|
|
|
|
|
|
**Layer 3 (Policy)**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_policy.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_p2.{c,h}`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_elo.{c,h}`
|
|
|
|
|
|
|
|
|
|
|
|
**調査レポート**:
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/docs/analysis/TLS_SLL_ARCHITECTURE_INVESTIGATION.md`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/docs/status/PHASE_E3-1_SUMMARY.md`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/BOX_THEORY_LAYER_DIAGRAM.txt`
|
|
|
|
|
|
- `/mnt/workdisk/public_share/hakmem/REFACTOR_ARCHITECTURE_DIAGRAM.txt`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**以上**
|