hakmem/main.md at 0b306f72f40d98f9eea1a28ee06c63c39c6db537

Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

7.9 KiB

Raw Blame History

% ACE‑Alloc: Agentic Context Engineering for Runtime‑Adaptive Small‑Object Allocation % Authors: (TBD) % Draft v0.1 — 2025‑11‑02

概要（Abstract）

本論文は、Agentic Context Engineering（ACE）をメモリアロケータに適用し、Box Theory に基づく Two‑Speed Tiny フロント（HOT/WARM/COLD）と低オーバーヘッドの学習層を備えた小型オブジェクト向けアロケータ「ACE‑Alloc」を提案する。ACE‑Alloc は、観測（軽量イベント）、意思決定（cap/refill/trim の動的制御）、適用（非同期スレッド）から成るエージェント型最適化ループを実装しつつ、ホットパスから観測負荷を排除する TLS バッチ化を採用する。また、標準 API 互換の free(ptr) を保ちながら per‑object ヘッダを削除し、スラブ末尾の 32B prefix メタデータと Tiny Front Gatekeeper/Route Box により密度劣化なく即時判定を実現する。Tiny‑only のホットパスベンチマークでは mimalloc と同一オーダーのスループットを達成しつつ、Mixed/Tiny+Mid のワークロードでは Refill‑one、SLL 縮小、Idle Trim、および Superslab Tiering の ACE 制御により性能とメモリ効率のトレードオフを系統的に探索可能であることを示す。

はじめに（Introduction）

小型オブジェクトの高速割り当ては、多くのアプリケーションにとって主要な性能決定因子である。
既存の高速アロケータは、性能とメモリ効率のトレードオフに直面する。エージェント型の制御（ACE）により、ワークロードに追従した動的最適化を可能にする。
本論文の貢献：
- ACE（Agentic Context Engineering）をメモリアロケータに実装し、低オーバーヘッドで動作させる設計（TLS バッチ化、非同期適用）。
- 標準 API に準拠しながら per‑object ヘッダを排し、スラブ末尾 32B prefix で即時判定を実現する設計。
- 性能/メモリのスイートスポット探索（Refill‑one、SLL 縮小、Idle Trim/Flush）と評価。

背景（Background）

メモリアロケータの基礎（小型クラス、TLS キャッシュ、スラブ/スーパー・スラブ）。
mimalloc などの関連実装の要点（ページ記述子、低メタデータ、局所性）。
Agentic Context Engineering（ACE）の要旨：観測→意思決定→適用のエージェント型ループを、ホットパスを汚染せず実装する考え方。

設計（Design of ACE‑Alloc）

目標：
- ホットパスの命令・分岐・メモリアクセスを最小化（ゼロに近い）。
- 標準 API 互換（free(ptr)）とメモリ密度の維持。
- 学習層は非同期・オフホットパスで適用。
- Box Theory に基づき、ホットパス（Tiny Front）と学習層（ACE/ELO/CAP）を明確に分離した Two‑Speed 構成とする。
キー設計：
- Two‑Speed Tiny Front: HOT パス（TLS SLL / Unified Cache）、WARM パス（バッチリフィル）、COLD パス（Shared Pool / Superslab / Registry）を箱として分離し、HOT パスから Registry 参照・mutex・重い診断を排除する。
- TLS バッチ化（alloc/free の観測カウンタは TLS に蓄積、しきい値到達時のみ atomic 反映）。
- 観測リング＋背景ワーカー（イベントの集約とポリシ適用）。
- スラブ末尾 32B prefix（pool/type/class/owner を格納）と Tiny Layout/Ptr Bridge Box により per‑object ヘッダを不要化。
- Tiny Front Gatekeeper / Route Box により、malloc/free の入口で USER→BASE 変換と Tiny vs Pool のルーティングを 1 箇所に集約。
- Refill‑one（ミス時 1 個だけ補充）と SLL 縮小、Idle Trim/Flush、Superslab Tiering（HOT/DRAINING/FREE）のポリシ。

実装（Implementation）

Tiny / Superslab の Box 化：
- Tiny Front（HOT/WARM/COLD）: core/box/tiny_front_hot_box.h、core/box/tiny_front_cold_box.h、core/box/tiny_alloc_gate_box.h、core/box/tiny_free_gate_box.h、core/box/tiny_route_box.{h,c}。
- Unified Cache / Backend: core/tiny_unified_cache.{h,c}、core/hakmem_shared_pool_*.c、core/box/ss_allocation_box.{h,c}。
- Superslab Tiering / Release Guard: core/box/ss_tier_box.h、core/box/ss_release_guard_box.h、core/hakmem_super_registry.{c,h}。
Headerless + ポインタ変換：
- Prefix メタデータとレイアウト: core/hakmem_tiny_superslab*.h、core/box/tiny_layout_box.h、core/box/tiny_header_box.h。
- USER/BASE ブリッジ: core/box/tiny_ptr_bridge_box.h、TLS SLL / Remote Queue: core/box/tls_sll_box.h、core/box/tls_sll_drain_box.h。
学習層（ACE/ELO/CAP）：
- ACE メトリクスとコントローラ: core/hakmem_ace_metrics.{h,c}、core/hakmem_ace_controller.{h,c}、core/hakmem_elo.{h,c}、core/hakmem_learner.{h,c}。
- INT エンジン: core/hakmem_tiny_intel.inc（観測→意思決定→適用のループ。デフォルトでは OFF または OBSERVE モードで運用）。
互換性と安全性：
- 標準 API と LD_PRELOAD 環境での安全モード（外部アプリから free(ptr) をそのまま受け入れる）。
- Tiny Front Gatekeeper Box による free 境界での検証（USER→BASE 正規化、範囲チェック、Box 境界での Fail‑Fast）。
- Remote free は専用の Remote Queue Box に隔離し、オーナーシップ移譲と drain/publish/adopt を Box 境界で分離。

評価（Evaluation）

ベンチマーク：Tiny Hot、Mid MT、Mixed（本リポジトリ同梱）。
- Tiny Hot: bench_tiny_hot_hakmem（固定サイズ Tiny クラス、Two‑Speed Tiny Front の HOT パス性能を測定）。
- Mixed: bench_random_mixed_hakmem（ランダムサイズ + malloc/free 混在、HOT/WARM/COLD パスの比率も観測）。
指標：スループット（M ops/sec）、帯域、RSS/VmSize、断片化比（オプション）。
比較：mimalloc、システム malloc。
アブレーション：
- ACE OFF 対比（学習層無効）。
- Two‑Speed Tiny Front の ON/OFF（Tiny Route Profile による Tiny‑only/Tiny‑first/Pool‑only の切り替え）。
- Superslab Tiering / Eager FREE の有無。
- Refill‑one/SLL 縮小/Idle Trim の有無。
- Prefix メタ（ヘッダ無し） vs per‑object ヘッダ（参考、比較実装がある場合）。

関連研究（Related Work）

既存アロケータ（mimalloc、jemalloc など）。
動的最適化・学習型手法（エージェントベースの最適化）。

考察・限界（Discussion & Limitations）

Idle Trim の即時 RSS への効き方（短時間測定 vs 長時間）。
Remote free 多発時の設計トレードオフ（将来の安全なリモートキュー）。
MT スケール時のスラブ粒度と ACE ポリシのチューニング。

結論（Conclusion）

ACE‑Alloc は、ホットパスのオーバーヘッドを極小化しつつ、学習層によって動的にメモリ効率と性能のスイートスポットを探索できる実装である。今後は、長時間の断片化評価と MT スケール最適化を進める。

付録 A. Artifact（再現手順）

ビルド（Tiny/Mixed ベンチ）:

make bench_tiny_hot_hakmem bench_random_mixed_hakmem

Tiny（性能）:

HAKMEM_TINY_PROFILE=full ./bench_tiny_hot_hakmem 64 100 60000

Mixed（性能）:

HAKMEM_TINY_PROFILE=conservative ./bench_random_mixed_hakmem 2000000 400 42

スイープ計測（短時間のCSV出力）:

scripts/sweep_mem_perf.sh both | tee sweep.csv

INT エンジン＋学習層 ON（例）:
```
HAKMEM_INT_ENGINE=1 \
  ./bench_random_mixed_hakmem 2000000 400 42 2>&1 | less
```
（詳細な環境変数とプロファイルは docs/specs/ENV_VARS.md を参照。）

謝辞（Acknowledgments）

（TBD）

7.9 KiB Raw Blame History Unescape Escape

7.9 KiB

Raw Blame History