Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
parent 2e3fcc92af
commit 5685c2f4c9
29 changed files with 6023 additions and 138 deletions
--- a/docs/paper/ACE-Alloc/README.md
+++ b/docs/paper/ACE-Alloc/README.md
@ -20,15 +20,19 @@ pandoc -s main.md -o paper.pdf

 Repro / Benchmarks
 ------------------
-簡易スイープ（性能とRSS）:
+簡易スイープ（性能と RSS）:

 ```
 scripts/sweep_mem_perf.sh both | tee sweep.csv
 ```

-メモリ重視モードでの実行:
+代表的なベンチマーク（Tiny / Mixed）:

 ```
-HAKMEM_MEMORY_MODE=1 ./bench_tiny_hot_hakmem 64 1000 400000000
-HAKMEM_MEMORY_MODE=1 ./bench_random_mixed_hakmem 2000000 400 42
+make bench_tiny_hot_hakmem bench_random_mixed_hakmem
+
+HAKMEM_TINY_PROFILE=full         ./bench_tiny_hot_hakmem   64 100 60000
+HAKMEM_TINY_PROFILE=conservative ./bench_random_mixed_hakmem 2000000 400 42
 ```
+
+環境変数やプロファイルの詳細は `docs/specs/ENV_VARS.md` を参照してください。
--- a/docs/paper/ACE-Alloc/main.md
+++ b/docs/paper/ACE-Alloc/main.md
@ -4,7 +4,7 @@

 概要（Abstract）

-本論文は、Agentic Context Engineering（ACE）をメモリアロケータに適用し、実運用に耐える低オーバーヘッドの学習層を備えた小型オブジェクト向けアロケータ「ACE‑Alloc」を提案する。ACE‑Alloc は、観測（軽量イベント）、意思決定（cap/refill/trim の動的制御）、適用（非同期スレッド）から成るエージェント型最適化ループを実装しつつ、ホットパスから観測負荷を排除する TLS バッチ化を採用する。また、標準 API 互換の free(ptr) を保ちながら per‑object ヘッダを削除し、スラブ末尾の 32B prefix メタデータにより密度劣化なく即時判定を実現する。実験では、mimalloc と比較して Tiny/Mid における性能で優位性を示し、メモリ効率の差は Refill‑one、SLL 縮小、Idle Trim の ACE 制御により縮小可能であることを示す。
+本論文は、Agentic Context Engineering（ACE）をメモリアロケータに適用し、Box Theory に基づく Two‑Speed Tiny フロント（HOT/WARM/COLD）と低オーバーヘッドの学習層を備えた小型オブジェクト向けアロケータ「ACE‑Alloc」を提案する。ACE‑Alloc は、観測（軽量イベント）、意思決定（cap/refill/trim の動的制御）、適用（非同期スレッド）から成るエージェント型最適化ループを実装しつつ、ホットパスから観測負荷を排除する TLS バッチ化を採用する。また、標準 API 互換の free(ptr) を保ちながら per‑object ヘッダを削除し、スラブ末尾の 32B prefix メタデータと Tiny Front Gatekeeper/Route Box により密度劣化なく即時判定を実現する。Tiny‑only のホットパスベンチマークでは mimalloc と同一オーダーのスループットを達成しつつ、Mixed/Tiny+Mid のワークロードでは Refill‑one、SLL 縮小、Idle Trim、および Superslab Tiering の ACE 制御により性能とメモリ効率のトレードオフを系統的に探索可能であることを示す。

 1. はじめに（Introduction）

@ -27,30 +27,45 @@
  - ホットパスの命令・分岐・メモリアクセスを最小化（ゼロに近い）。
  - 標準 API 互換（free(ptr)）とメモリ密度の維持。
  - 学習層は非同期・オフホットパスで適用。
+  - Box Theory に基づき、ホットパス（Tiny Front）と学習層（ACE/ELO/CAP）を明確に分離した Two‑Speed 構成とする。
 - キー設計：
+  - Two‑Speed Tiny Front: HOT パス（TLS SLL / Unified Cache）、WARM パス（バッチリフィル）、COLD パス（Shared Pool / Superslab / Registry）を箱として分離し、HOT パスから Registry 参照・mutex・重い診断を排除する。
  - TLS バッチ化（alloc/free の観測カウンタは TLS に蓄積、しきい値到達時のみ atomic 反映）。
  - 観測リング＋背景ワーカー（イベントの集約とポリシ適用）。
-  - スラブ末尾 32B prefix（pool/type/class/owner を格納）により per‑object ヘッダを不要化。
-  - Refill‑one（ミス時 1 個だけ補充）と SLL 縮小、Idle Trim/Flush のポリシ。
+  - スラブ末尾 32B prefix（pool/type/class/owner を格納）と Tiny Layout/Ptr Bridge Box により per‑object ヘッダを不要化。
+  - Tiny Front Gatekeeper / Route Box により、malloc/free の入口で USER→BASE 変換と Tiny vs Pool のルーティングを 1 箇所に集約。
+  - Refill‑one（ミス時 1 個だけ補充）と SLL 縮小、Idle Trim/Flush、Superslab Tiering（HOT/DRAINING/FREE）のポリシ。

 4. 実装（Implementation）

- 主要コンポーネント：
-  - Prefix メタデータ: `core/hakmem_tiny_superslab.h/c`
-  - TLS バッチ＆ACE メトリクス: `core/hakmem_ace_metrics.h/c`
-  - 観測・意思決定・適用（INT エンジン）: `core/hakmem_tiny_intel.inc`
-  - Refill‑one／SLL 縮小／Idle Trim の適用箇所。
- 互換性と安全性：標準 API、LD_PRELOAD 環境での安全モード、Remote free の扱い（設計と今後の拡張）。
+- Tiny / Superslab の Box 化：
+  - Tiny Front（HOT/WARM/COLD）: `core/box/tiny_front_hot_box.h`、`core/box/tiny_front_cold_box.h`、`core/box/tiny_alloc_gate_box.h`、`core/box/tiny_free_gate_box.h`、`core/box/tiny_route_box.{h,c}`。
+  - Unified Cache / Backend: `core/tiny_unified_cache.{h,c}`、`core/hakmem_shared_pool_*.c`、`core/box/ss_allocation_box.{h,c}`。
+  - Superslab Tiering / Release Guard: `core/box/ss_tier_box.h`、`core/box/ss_release_guard_box.h`、`core/hakmem_super_registry.{c,h}`。
+- Headerless + ポインタ変換：
+  - Prefix メタデータとレイアウト: `core/hakmem_tiny_superslab*.h`、`core/box/tiny_layout_box.h`、`core/box/tiny_header_box.h`。
+  - USER/BASE ブリッジ: `core/box/tiny_ptr_bridge_box.h`、TLS SLL / Remote Queue: `core/box/tls_sll_box.h`、`core/box/tls_sll_drain_box.h`。
+- 学習層（ACE/ELO/CAP）：
+  - ACE メトリクスとコントローラ: `core/hakmem_ace_metrics.{h,c}`、`core/hakmem_ace_controller.{h,c}`、`core/hakmem_elo.{h,c}`、`core/hakmem_learner.{h,c}`。
+  - INT エンジン: `core/hakmem_tiny_intel.inc`（観測→意思決定→適用のループ。デフォルトでは OFF または OBSERVE モードで運用）。
+- 互換性と安全性：
+  - 標準 API と LD_PRELOAD 環境での安全モード（外部アプリから free(ptr) をそのまま受け入れる）。
+  - Tiny Front Gatekeeper Box による free 境界での検証（USER→BASE 正規化、範囲チェック、Box 境界での Fail‑Fast）。
+  - Remote free は専用の Remote Queue Box に隔離し、オーナーシップ移譲と drain/publish/adopt を Box 境界で分離。

 5. 評価（Evaluation）

 - ベンチマーク：Tiny Hot、Mid MT、Mixed（本リポジトリ同梱）。
+  - Tiny Hot: `bench_tiny_hot_hakmem`（固定サイズ Tiny クラス、Two‑Speed Tiny Front の HOT パス性能を測定）。
+  - Mixed: `bench_random_mixed_hakmem`（ランダムサイズ + malloc/free 混在、HOT/WARM/COLD パスの比率も観測）。
 - 指標：スループット（M ops/sec）、帯域、RSS/VmSize、断片化比（オプション）。
 - 比較：mimalloc、システム malloc。
 - アブレーション：
  - ACE OFF 対比（学習層無効）。
+  - Two‑Speed Tiny Front の ON/OFF（Tiny Route Profile による Tiny‑only/Tiny‑first/Pool‑only の切り替え）。
+  - Superslab Tiering / Eager FREE の有無。
  - Refill‑one/SLL 縮小/Idle Trim の有無。
-  - Prefix メタ（ヘッダ無し） vs per‑object ヘッダ（参考）。
+  - Prefix メタ（ヘッダ無し） vs per‑object ヘッダ（参考、比較実装がある場合）。

 6. 関連研究（Related Work）

@ -69,34 +84,29 @@

 付録 A. Artifact（再現手順）

- ビルド（メタデフォルト）:
+- ビルド（Tiny/Mixed ベンチ）:
  ```sh
-  make bench_tiny_hot_hakmem
+  make bench_tiny_hot_hakmem bench_random_mixed_hakmem
  ```
 - Tiny（性能）:
  ```sh
-  ./bench_tiny_hot_hakmem 64 100 60000
+  HAKMEM_TINY_PROFILE=full ./bench_tiny_hot_hakmem 64 100 60000
  ```
 - Mixed（性能）:
  ```sh
-  ./bench_random_mixed_hakmem 2000000 400 42
-  ```
- メモリ重視モード（推奨プリセット）:
-  ```sh
-  HAKMEM_MEMORY_MODE=1 ./bench_tiny_hot_hakmem 64 1000 400000000
-  HAKMEM_MEMORY_MODE=1 ./bench_random_mixed_hakmem 2000000 400 42
+  HAKMEM_TINY_PROFILE=conservative ./bench_random_mixed_hakmem 2000000 400 42
  ```
 - スイープ計測（短時間のCSV出力）:
  ```sh
  scripts/sweep_mem_perf.sh both | tee sweep.csv
  ```
- 平均推移ログ（EMA）:
+- INT エンジン＋学習層 ON（例）:
  ```sh
-  HAKMEM_TINY_OBS=1 HAKMEM_TINY_OBS_LOG_AVG=1 HAKMEM_TINY_OBS_LOG_EVERY=2 HAKMEM_INT_ENGINE=1 \
+  HAKMEM_INT_ENGINE=1 \
    ./bench_random_mixed_hakmem 2000000 400 42 2>&1 | less
  ```
+  （詳細な環境変数とプロファイルは `docs/specs/ENV_VARS.md` を参照。）

 謝辞（Acknowledgments）

 （TBD）
-