Files

Moe Charm (CI) 5cc1f93622 Phase 13-A Step 1: TinyHeapV2 NO-REFILL L0 cache implementation

Implement TinyHeapV2 as a minimal "lucky hit" L0 cache that avoids
circular dependency with FastCache by eliminating self-refill.

Key Changes:
- New: core/front/tiny_heap_v2.h - NO-REFILL L0 cache implementation
  - tiny_heap_v2_alloc(): Pop from magazine if available, else return NULL
  - tiny_heap_v2_refill_mag(): No-op stub (no backend refill)
  - ENV: HAKMEM_TINY_HEAP_V2=1 to enable
  - ENV: HAKMEM_TINY_HEAP_V2_CLASS_MASK=bitmask (C0-C3 control)
  - ENV: HAKMEM_TINY_HEAP_V2_STATS=1 to print statistics
- Modified: core/hakmem_tiny_alloc_new.inc - Add TinyHeapV2 hook
  - Hook at entry point (after class_idx calculation)
  - Fallback to existing front if TinyHeapV2 returns NULL
- Modified: core/hakmem_tiny_alloc.inc - Add hook for legacy path
- Modified: core/hakmem_tiny.c - Add TLS variables and stats wrapper
  - TinyHeapV2Mag: Per-class magazine (capacity=16)
  - TinyHeapV2Stats: Per-class counters (alloc_calls, mag_hits, etc.)
  - tiny_heap_v2_print_stats(): Statistics output at exit
- New: TINY_HEAP_V2_TASK_SPEC.md - Phase 13 specification

Root Cause Fixed:
- BEFORE: TinyHeapV2 refilled from FastCache → circular dependency
  - TinyHeapV2 intercepted all allocs → FastCache never populated
  - Result: 100% backend OOM, 0% hit rate, 99% slowdown
- AFTER: TinyHeapV2 is passive L0 cache (no refill)
  - Magazine empty → return NULL → existing front handles it
  - Result: 0% overhead, stable baseline performance

A/B Test Results (100K iterations, fixed-size bench):
- C1 (8B):  Baseline 9,688 ops/s → HeapV2 ON 9,762 ops/s (+0.76%)
- C2 (16B): Baseline 9,804 ops/s → HeapV2 ON 9,845 ops/s (+0.42%)
- C3 (32B): Baseline 9,840 ops/s → HeapV2 ON 9,814 ops/s (-0.26%)
- All within noise range: NO PERFORMANCE REGRESSION ✅

Statistics (HeapV2 ON, C1-C3):
- alloc_calls: 200K (hook works correctly)
- mag_hits: 0 (0%) - Magazine empty as expected
- refill_calls: 0 - No refill executed (circular dependency avoided)
- backend_oom: 0 - No backend access

Next Steps (Phase 13-A Step 2):
- Implement magazine supply strategy (from existing front or free path)
- Goal: Populate magazine with "leftover" blocks from existing pipeline

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-15 01:42:57 +09:00

11 KiB

Raw Blame History

Tiny Heap v2 (T‑HEAP) – Task Spec for Claude Code

Date: 2025‑11‑14
Owner: Claude Code (Tiny Phase 13)
Status: Draft – ready for implementation

1. 背景とゴール

現状

Phase 12 までに:
- Shared SuperSlab Pool + SP‑SLOT Box 完成（multi‑class sharing, Superslab 92%削減）。
- TLS SLL drain + Lock‑free 改善で Superslab churn / futex / race をほぼ解消。
- Mid‑Large (8–32KB) は Pool TLS 経由で System malloc より高速（~10M ops/s）。
Tiny (16–1024B) は:
- 構造バグはほぼ解消済みだが、random_mixed / Larson では mimalloc / System に対してまだ大きな差がある。
- Tiny front/back は Box で綺麗に分離されているが、shared pool / drain / TLS SLL など層が厚く、per‑thread heap ほどシンプルではない。

目的

Tiny 向けに per‑thread heap フレーバー（Tiny Heap v2 / T‑HEAP）を導入し、

Tiny heavy ワークロード（random_mixed, Larson）での性能を大きく引き上げる。
既存 HAKMEM の学習層 / Superslab 管理は「細い箱経由」で最低限のコストで活かす。
本体構造（SP‑SLOT, drain, mid‑large, LD_PRELOAD 対応）は壊さず、A/B 可能な新モードとして提供する。

ターゲットイメージ:

Tiny random_mixed / Larson:
- 現在: ~8–9M ops/s レンジ
- 目標: 15–20M ops/s レンジ（System の 25–40%）
- Stretch: 20M+（以降は別フェーズ）

2. 現在の実装状態（TinyHeapV2 の骨格）

既にこのリポジトリには、実験用の tiny heap v2 箱が骨組みだけ存在しています。

2.1 追加済みファイル・シンボル

core/front/tiny_heap_v2.h
- ENV ゲート:
  - tiny_heap_v2_enabled():
    - HAKMEM_TINY_HEAP_V2 を読んで ON/OFF 判定（TLS キャッシュ）。
- TLS magazine 型:
  - TinyHeapV2Mag:
    - void* items[TINY_HEAP_V2_MAG_CAP];
    - int top;
  - TINY_HEAP_V2_MAG_CAP は現在 16。
- TLS インスタンス:
  - extern __thread TinyHeapV2Mag g_tiny_heap_v2_mag[TINY_NUM_CLASSES];
- ヘルパ:
  - tiny_heap_v2_refill_mag(int class_idx):
    - FastCache → backend (tiny_alloc_fast_refill) の順で TINY_HEAP_V2_MAG_CAP 個まで magazine に詰める。
  - tiny_heap_v2_alloc(size_t size):
    - size → class_idx（hak_tiny_size_to_class）。
    - class 0–3 のみ対象。
    - magazine pop → refill → magazine pop → FAIL なら NULL。
core/hakmem_tiny.c
- include 追加:
  - #include "front/tiny_heap_v2.h"
- TLS 定義追加:
  - __thread TinyHeapV2Mag g_tiny_heap_v2_mag[TINY_NUM_CLASSES];

core/tiny_alloc_fast.inc.h

tiny_alloc_fast() 内に コメントアウトされた hook がある:

現状（コメントアウト後）:

// Experimental Tiny heap v2 front (Box T-HEAP) is currently disabled
// due to instability under shared SuperSlab pool. Keep the hook here
// commented out for future experimentation.
// if (__builtin_expect(tiny_heap_v2_enabled(), 0)) {
//     void* base = tiny_heap_v2_alloc(size);
//     if (base) {
//         HAK_RET_ALLOC(class_idx, base);
//     }
// }

2.2 なぜ今は無効化されているか

tiny_heap_v2_alloc を有効化して bench_random_mixed_hakmem を回したところ、
- shared_pool_acquire_slab() 内で SEGV が発生。
- 同時に SP‑SLOT の lock‑free node pool の枯渇 ([P0-4 WARN] Node pool exhausted for class 7) が絡んでいた。
その後、node pool 枯渇時には 従来の mutex 保護 free list へフォールバック する修正を入れたが、
- TinyHeapV2 経路と shared pool の組み合わせを十分検証する時間がなかったため、
- 現時点では 安全第一で hook をコメントアウト している。

3. Phase 13 Tiny Heap v2 – 具体タスク

ここから先は Claude code 君に任せたい作業です。
大きく 3 フェーズに分けています。

Phase 13‑A: TinyHeapV2 の安定化（既存骨格の堅牢化）

目的: 「tiny_heap_v2_alloc を有効にしても SEGV / 破綻が出ない」状態を作る。

A‑1. magazine 初期化と基本動作の確認

確認:
- g_tiny_heap_v2_mag[class_idx].top が初期値 0 であること。
- magazine 中のポインタは BASE ポインタ（ヘッダ位置）を保持していること（FastCache 同様）。
テスト:
- 短尺（1K〜10K iterations）の bench_random_mixed_hakmem を、
  - HAKMEM_TINY_HEAP_V2=1、
  - HAKMEM_TINY_FRONT_DIRECT=1（必要に応じて）
    で走らせ、正常終了するか確認。

A‑2. shared pool / SP‑SLOT / node pool との整合確認

shared pool まわりの注意点:
- SP‑SLOT + lock‑free free list の node pool は、枯渇時に mutex の legacy free list にフォールバックするようになっている。
- TinyHeapV2 導入後も:
  - node pool 枯渇があっても クラッシュせず、
  - 性能が極端に悪化しない（大量ログや runaway がない）ことを要確認。
手順:
- strace -c / perf record までは無理にやらなくても良いが、
- HAKMEM_SS_ACQUIRE_DEBUG=1 や HAKMEM_SS_FREE_DEBUG=1 で shared_pool の挙動を軽く確認。

A‑3. 再度の hook 有効化（限定クラスのみ）

core/tiny_alloc_fast.inc.h のコメントアウトを戻し、ただし:
- class_idx <= 3 の場合のみ TinyHeapV2 を試し、
- 失敗（NULL）の場合は従来経路にフォールバックするように構造化。
ここまでで:
- 100K iterations の 128B / 256B / random_mixed で正常終了、
- TinyHeapV2 ON/OFF で 挙動（SEGV の有無）が変わらないことを確認。

Phase 13‑B: T‑HEAP/T‑BACKEND/T‑REMOTE の設計深化（optional だが推奨）

目的: TinyHeapV2 を単なる「magazine front」から、より mimalloc に近い per‑thread heap へ育てる。

このフェーズは 研究色が強いため、段階的に進めてください。

B‑1. T‑BACKEND: “span 供給箱” の導入

方向性:
- 現状の tiny_heap_v2_refill_mag は FastCache + tiny_alloc_fast_refill に依存している。
- これを、TinyHeapV2 専用の backend（span 単位の管理）に少しずつ寄せていく。
具体:
- 新しい Box API を定義 (tiny_heap_backend.h 等):
```
typedef struct TinySpan TinySpan;  // 既存 Superslab/TinySlabMeta からラップ

TinySpan* tiny_backend_acquire_span(int class_idx);
void      tiny_backend_release_span(TinySpan* span);
```
- 最初は wrapper で構わない:
  - tiny_backend_acquire_span → 既存 superslab_refill / shared_pool から TLS に 1 slab を割り当てるだけ。
  - tiny_backend_release_span → 既存の shared_pool_release_slab / SP‑SLOT に流すだけ。
TinyHeapV2 側では:
- magazine が空になったときに:
  - まず FastCache / existing refill を試す（後方互換）。
  - 将来的には tiny_backend_acquire_span から直接 span を借りて、そこから magazine を満たす方向に進化させる。

B‑2. T‑REMOTE: cross‑thread free を視野にいれる

現状:
- free は hak_tiny_free_fast_v2 → TLS SLL → drain → Superslab の流れ。
方向性（長期）:
- cross‑thread free のパスを、TinyHeapV2 に合わせて簡略化した T‑REMOTE に乗せ換える余地がある。
- ただし今は free パスを壊さないことを優先し、短期では触らなくてよい。

Phase 13‑C: 計測とまとめ

目的: TinyHeapV2 ON/OFF の効果を定量化し、どこまで mimalloc に迫れたかを整理する。

C‑1. ベンチセット

少なくともこの 2 系列で A/B を取る:
1. bench_random_mixed_hakmem（100K, size=128/256/512/1024）
2. Larson 系 (scripts/bench_larson_* / run_larson_claude.sh)
それぞれで:
- HAKMEM_TINY_HEAP_V2=0/1 の比較。
- Throughput と、可能なら strace -c の syscall 率。

C‑2. レポート

新しい .md として:
- TINY_HEAP_V2_EVALUATION.md
- 内容:
  - 実装概要（T‑HEAP/T‑BACKEND/T‑REMOTE/T‑EVENT のどこまで入ったか）。
  - ベンチ結果（System/mimalloc/HAKMEM + HeapV2 ON/OFF）。
  - どのサイズ・どの workload でどれだけ改善したか。
  - まだ残っているギャップと、その原因の仮説（カーネル側、ページフォールトなど）。

4. 制約と注意事項

既存の安定経路を壊さないこと
- HAKMEM_TINY_HEAP_V2 が 0 のときは、今の Phase12 Tiny 経路と完全に同じ動作を維持する。
- TinyHeapV2 関連の変更は、必ず ENV/flag でゲートし、A/B 可能に保つ。
shared pool / SP‑SLOT の契約を破らないこと
- span/superslab の acquire/release は必ず既存の API (shared_pool_acquire_slab, shared_pool_release_slab, superslab_refill 等) を経由する。
- SP‑SLOT の meta->slots[i].state や active_slots を直接いじるのは避ける（専用 helper 経由のみ）。
Lock‑free 化は段階的に
- すでに SP‑SLOT 周りには lock‑free の free list や CAS が入っているため、
- TinyHeapV2 側でさらに lock‑free を追加する際は、「mutex fallback」を必ず用意し、node pool 枯渇時のようなケースで SEGV しないようにする。
ベンチは短尺から
- まず 10K–100K iterations で TinyHeapV2 の安定性を確認し、その後 200K–1M など長尺ベンチに進む。
- perf/strace 用の run 時は、P0-4 WARN や debug ログがパフォーマンス計測を歪めないよう注意する。

5. まとめ

TinyHeapV2 は、現時点では「骨組みと TLS 構造だけある実験用 Box」です。
Claude code 君には:
- Phase 13‑A: 既存骨格の安定化と安全な hook 再有効化
- Phase 13‑B: T‑BACKEND/T‑REMOTE への進化（可能な範囲で）
- Phase 13‑C: ベンチ・評価・レポートを、Box Theory の境界を守りながら進めてもらいたい、というのがこのタスクの趣旨です。

この箱がうまく育てば、「HAKMEM の学習層＋Superslab 管理」と「mimalloc 風のシンプル Tiny front」が共存する、かなり面白い実験場になるはずです。*** End Patch*** }}}}]]} ***!

11 KiB Raw Blame History Unescape Escape