diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index b30c34d8..dbb7a2a9 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -371,3 +371,231 @@ Tiny: alloc → TLS miss → refill → 既存warm SuperSlab使用 - ✅ **CURRENT_TASK.md更新**: Phase 17結果 + Phase 18計画 - 🎯 **次**: Phase 18 Box SS-Reuse実装(Tiny SuperSlab最適化) +--- + +## 9. Phase 19 実装ログ(完了) 🎉 + +### 2025-11-16 +- ✅ **Phase 19-1完了**: Box FrontMetrics(観測) + - 実装: `core/box/front_metrics_box.h/c`、全層にヒット率計測追加 + - ENV: `HAKMEM_TINY_FRONT_METRICS=1`, `HAKMEM_TINY_FRONT_DUMP=1` + - 結果: CSV形式で per-class ヒット率レポート生成 + +- ✅ **Phase 19-2完了**: ベンチマークとヒット率分析 + - ワークロード: Random Mixed 16-1040B、50万イテレーション + - **重要な発見**: + - **HeapV2**: 88-99% ヒット率(主力として機能)✅ + - **UltraHot**: 0.2-11.7% ヒット率(ほぼ素通り)⚠️ + - FC/SFC: 無効化済み(0%) + - TLS SLL: fallback として 0.7-2.7% のみ + +- ✅ **Phase 19-3完了**: Box FrontPrune(診断) + - 実装: ENV切り替えで層を個別ON/OFF可能 + - ENV: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1`(デフォルトOFF) + - ENV: `HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1`(デフォルトON) + +- ✅ **Phase 19-4完了**: A/Bテストと最適化 + - **テスト結果**: + | 設定 | 性能 | vs Baseline | C2/C3 ヒット率 | + |------|------|-------------|----------------| + | Baseline(両方ON) | 10.1M ops/s | - | UH=11.7%, HV2=88.3% | + | **HeapV2のみ** | **11.4M ops/s** | **+12.9%** ⭐ | HV2=99.3%, SLL=0.7% | + | UltraHotのみ | 6.6M ops/s | -34.4% ❌ | UH=96.4% (C2), SLL=94.2% (C3) | + + - **決定的結論**: + - **UltraHot削除で性能向上** (+12.9%) + - 理由: 分岐予測ミスコスト > UltraHotヒット率向上効果 + - UltraHotチェック: 88.3%のケースで無駄な分岐 → CPU分岐予測器を混乱 + - HeapV2単独の方が予測可能性が高い → 性能向上 + +- ✅ **デフォルト設定変更**: UltraHot デフォルトOFF + - 本番推奨: UltraHot OFF(最速設定) + - 研究用: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1` で有効化可能 + - コードは削除せず ENV切り替えで残す(研究・デバッグ用) + +- ✅ **Phase 19 成果**: + - ChatGPT先生の「観測→診断→治療」戦略が完璧に機能 🎓 + - 直感に反する発見(UltraHotが阻害要因)をデータで証明 + - A/Bテストでリスクなし確認してから最適化実施 + - 詳細: `PHASE19_FRONTEND_METRICS_FINDINGS.md`, `PHASE19_AB_TEST_RESULTS.md` + +--- + +## 10. Phase 20 計画: Tiny ホットパス一本化 + BenchFast モード 🎯 + +### 目標 +- **性能目標**: 20-30M ops/s(system malloc の 25-35%) +- **設計目標**: 「箱を崩さず」に達成(研究価値を保つ) + +### Phase 20-1: HeapV2 を唯一の Tiny Front に(本命ホットパス一本化) + +**現状認識**: +- C2/C3: HeapV2 が 88-99% を処理(本命) +- UltraHot: 0.2-11.7% しか当たらず、分岐の邪魔(削ると +12.9%) +- FC/SFC: 実質 OFF、TLS SLL は fallback のみ + +**実装方針**: +1. **HeapV2 を「唯一の front」として扱う**: + - C2-C5: HeapV2 → fallback だけ TLS SLL + - 他層(UltraHot, FC, SFC)はホットパスから完全に外し、実験用に退避 + +2. **HeapV2 の中身を徹底的に薄くする**: + - size→class 再計算を全部やめて、「class_idx を渡すだけ」にする + - 分岐を「classごとの専用関数」かテーブルジャンプにして 1-2 本に減らす + - header 書き込み・TLS stack 操作・return までを「6-8 命令の直線」に近づける + +3. **期待効果**: + - 現在 11M ops/s → 目標 15-20M ops/s (+35-80% 改善) + - 分岐削減 + 命令直線化 → CPU パイプライン効率向上 + +**ENV制御**: +```bash +# HeapV2専用モード(Phase 20デフォルト) +HAKMEM_TINY_FRONT_HEAPV2_ONLY=1 # UltraHot/FC/SFC完全バイパス + +# 旧動作(研究用) +HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1 # Phase 19設定 +``` + +--- + +### Phase 20-2: BenchFast モードで安全コストを外す + +**現状認識**: +- `hak_free_at` / `classify_ptr` / ExternalGuard / mincore など、 + 「LD_PRELOAD / 外部ライブラリから守る」層が、 + ベンチでは「絶対に hakmem だけを使っている」前提の上に乗っている + +**実装方針**: +1. **ベンチ用完全信頼モード**(Box BenchFast): + - alloc/free ともに: + - header 1バイト で Tiny を即判定 + - Pool/Mid/L25/ExternalGuard/registry を完全にバイパス + - 変なポインタが来たら壊れていい(ベンチ用なので) + +2. **ENV制御**: + ```bash + HAKMEM_BENCH_FAST_MODE=1 # 安全コスト全外し + ``` + +3. **目的**: + - 「箱全部乗せ版」と「安全コスト全外し版」の差を測る + - 「設計そのものの限界」と「安全・汎用性のコスト」の内訳を見る + - mimalloc と同じくらい「危ないモード」で、どこまで近づけるかを研究 + +4. **期待効果**: + - HeapV2専用モード: 15-20M ops/s + - BenchFast追加: 25-30M ops/s (+65-100% vs 現状) + - system malloc (90M ops/s) の 28-33% に到達 + +--- + +### Phase 20-3: SuperSlab ホットセット チューニング + +**現状認識**: +- SS-Reuse: 再利用率 98.8%、新規 mmap 1.2% → page fault は抑えられている +- とはいえ perf ではまだ `asm_exc_page_fault` がでかく見える場面もある + +**実装方針**: +1. **Box SS-HotSet**(どのクラスが何枚をホットに持つか計測): + - クラスごとの「ホット SuperSlab 数」を 1-2 枚に抑えるように class_hints をチューニング + - precharge (`HAKMEM_TINY_SS_PRECHARGE_Cn`) を使って、「最初から 2 枚だけ温める」戦略を試す + +2. **Box SS-Compact**(ホットセット圧縮): + - 同じ SuperSlab に複数のホットクラスを詰め込む(Phase 12 の発展) + - 例: C2/C3 を同じ SuperSlab に配置 → キャッシュ効率向上 + +3. **期待効果**: + - page fault さらに削減 → +10-20% 性能向上 + - 既存の SS-Reuse/Cache 設計を、「Tiny front が見ているサイズ帯に合わせて細かく調整」 + +--- + +### Phase 20 実装順序 + +1. **Phase 20-1**: HeapV2 専用モード実装(優先度: 高) + - 期待: +35-80% (11M → 15-20M ops/s) + - 工数: 中(既存 HeapV2 をスリム化) + +2. **Phase 20-2**: BenchFast モード実装(優先度: 中) + - 期待: +65-100% (11M → 25-30M ops/s) + - 工数: 中(安全層バイパス) + +3. **Phase 20-3**: SS-HotSet チューニング(優先度: 低) + - 期待: +10-20% 追加改善 + - 工数: 小(パラメータ調整 + 計測箱追加) + +--- + +### Phase 20 成功条件 + +- ✅ Tiny 固定サイズで 20-30M ops/s 達成(system の 25-35%) +- ✅ 「箱を崩さず」達成(研究箱としての価値を保つ) +- ✅ ENV切り替えで「安全モード」「ベンチモード」を選べる状態を維持 +- ✅ 残りの差(system との 2.5-3x)は「kernel/page fault + mimalloc の極端な inlining」と言える根拠を固める + +--- + +### Phase 20 後の展望 + +ここまで行けたら: +- 「残りの差は kernel/page fault + mimalloc の極端な inlining・OS依存の差」だと自信を持って言える +- hakmem の「研究箱」としての価値(構造をいじりやすい / 可視化しやすい)を保ったまま、 + 性能面でも「そこそこ実用に耐える」ラインに乗る +- 学術論文・技術ブログでの発表材料が揃う + +--- + +## 11. Phase 20-1 実装ログ: Box SS-HotPrewarm(TLS Cache 事前確保) ✅ + +### 2025-11-16 + +#### 実装内容 +- ✅ **Box SS-HotPrewarm 作成**: ENV制御の per-class TLS cache prewarm + - 実装: `core/box/ss_hot_prewarm_box.h/c` + - デフォルト targets: C2/C3=128, C4/C5=64(aggressive prewarm) + - ENV制御: `HAKMEM_TINY_PREWARM_C2`, `_C3`, `_C4`, `_C5`, `_ALL` + +- ✅ **初期化統合**: `hak_init_impl()` から自動呼び出し + - 384 ブロック事前確保(C2=128, C3=128, C4=64, C5=64) + - `box_prewarm_tls()` API 使用(安全な carve-push) + +#### ベンチマーク結果(500K iterations, 256B random mixed) + +| 設定 | Page Faults | Throughput | vs Baseline | +|------|-------------|------------|-------------| +| **Baseline** (Prewarm OFF) | 10,399 | 15.7M ops/s | - | +| **Phase 20-1** (Prewarm ON) | 10,342 | 16.2M ops/s | **+3.3%** ⭐ | + +- **Page fault 削減**: 0.55%(期待: 50-66% → 現実: ほぼなし) +- **性能向上**: +3.3%(15.7M → 16.2M ops/s) + +#### 分析と結論 + +**❌ Page Fault 削減の失敗理由**: +1. **ユーザーページ由来が支配的**: ベンチマーク自体の初期化・データ構造確保による page fault が大半 +2. **SuperSlab 事前確保の限界**: 384 ブロック程度の prewarm では、ベンチマーク全体の page fault (10K+) に対して微々たる影響しかない +3. **カーネル側のコスト**: `asm_exc_page_fault` はユーザー空間だけでは制御不可能 + +**✅ Cache Warming 効果**: +1. **TLS SLL 事前充填**: 初期の refill コスト削減 +2. **CPU サイクル節約**: +3.3% の性能向上 +3. **安定性向上**: 初期状態が warm → 最初のアロケーションから高速 + +#### 決定: 「軽い +3% 箱」として確定 + +- **prewarm は有効**: 384 ブロック確保(C2/C3=128, C4/C5=64)のまま残す +- **これ以上の aggressive 化は不要**: RSS 消費増 vs page fault 削減効果が見合わない +- **次フェーズへ**: BenchFast モードで「上限性能」を測定し、構造的限界を把握 + +#### 変更ファイル +- `core/box/ss_hot_prewarm_box.h` - NEW +- `core/box/ss_hot_prewarm_box.c` - NEW +- `core/box/hak_core_init.inc.h` - prewarm 呼び出し追加 +- `Makefile` - `ss_hot_prewarm_box.o` 追加 + +--- + +**Status**: Phase 20-1 完了 ✅ → **Phase 20-2 準備中** 🎯 +**Next**: BenchFast モード実装(安全コスト全外し → 構造的上限測定) + diff --git a/Makefile b/Makefile index d379aaa4..14050769 100644 --- a/Makefile +++ b/Makefile @@ -190,7 +190,7 @@ LDFLAGS += $(EXTRA_LDFLAGS) # Targets TARGET = test_hakmem -OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o +OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o OBJS = $(OBJS_BASE) # Shared library @@ -222,7 +222,7 @@ endif # Benchmark targets BENCH_HAKMEM = bench_allocators_hakmem BENCH_SYSTEM = bench_allocators_system -BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o +BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o @@ -399,7 +399,7 @@ test-box-refactor: box-refactor ./larson_hakmem 10 8 128 1024 1 12345 4 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem) -TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/link_stubs.o core/tiny_failfast.o +TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/link_stubs.o core/tiny_failfast.o TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o diff --git a/PHASE19_AB_TEST_RESULTS.md b/PHASE19_AB_TEST_RESULTS.md new file mode 100644 index 00000000..d1f79177 --- /dev/null +++ b/PHASE19_AB_TEST_RESULTS.md @@ -0,0 +1,240 @@ +# Phase 19: Frontend Layer A/B Test Results + +## テスト環境 +- **ベンチマーク**: `bench_random_mixed_hakmem 500000 4096 42` +- **ワークロード**: ランダム割り当て 16-1040バイト、50万イテレーション +- **測定対象**: C2 (33-64B), C3 (65-128B) のヒット率と性能 + +--- + +## A/Bテスト結果サマリー + +| 設定 | Throughput | vs Baseline | C2 ヒット率 | C3 ヒット率 | 評価 | +|------|-----------|-------------|-------------|-------------|------| +| **Baseline** (UH + HV2) | **10.1M ops/s** | - | UH=11.7%, HV2=88.3% | UH=0.2%, HV2=99.8% | ベースライン | +| **HeapV2のみ** (UH無効) | **11.4M ops/s** | **+12.9%** ⭐ | HV2=99.3%, SLL=0.7% | HV2=97.3%, SLL=2.7% | **最速!** | +| **UltraHotのみ** (HV2無効) | **6.6M ops/s** | **-34.4%** ❌ | UH=96.4%, SLL=3.6% | UH=5.8%, SLL=94.2% | 大幅劣化 | + +--- + +## 詳細分析 + +### テスト1: Baseline(両方ON - 現状) + +``` +Throughput: 10.1M ops/s + +Class C2 (33-64B): + UltraHot: 455 hits (11.7%) + HeapV2: 3450 hits (88.3%) + Total: 3905 allocations + +Class C3 (65-128B): + UltraHot: 13 hits (0.2%) + HeapV2: 7585 hits (99.8%) + Total: 7598 allocations +``` + +**観察**: +- HeapV2 が主力として機能(88-99% ヒット率) +- UltraHot の貢献は微小(0.2-11.7%) +- 2層のチェックによる分岐オーバーヘッド発生 + +--- + +### テスト2: HeapV2のみ(UltraHot無効) ⭐ 推奨設定 + +``` +ENV: HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1 +Throughput: 11.4M ops/s (+12.9% vs Baseline) + +Class C2 (33-64B): + HeapV2: 3866 hits (99.3%) + TLS SLL: 29 hits (0.7%) ← HeapV2 miss 時の fallback + Total: 3895 allocations + +Class C3 (65-128B): + HeapV2: 7596 hits (97.3%) + TLS SLL: 208 hits (2.7%) ← HeapV2 miss 時の fallback + Total: 7804 allocations +``` + +**重要な発見**: +- **UltraHot 削除で性能向上** (+12.9%) +- HeapV2 単独でも 97-99% の高ヒット率を維持 +- UltraHot の分岐チェックがオーバーヘッドになっていた +- SLL が HeapV2 miss を拾って補完(0.7-2.7%) + +**分析**: +- **分岐予測ミスのコスト** > UltraHot のヒット率向上効果 +- UltraHot チェック: `if (ultra_hot_enabled() && front_prune_ultrahot_enabled())` + - 毎回評価されるが、11.7% しかヒットしない + - 88.3% のケースで無駄な分岐チェック +- HeapV2 単独の方が **予測可能性が高い** → CPU 分岐予測器に優しい + +--- + +### テスト3: UltraHotのみ(HeapV2無効) ❌ 非推奨 + +``` +ENV: HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1 +Throughput: 6.6M ops/s (-34.4% vs Baseline) + +Class C2 (33-64B): + UltraHot: 3765 hits (96.4%) + TLS SLL: 141 hits (3.6%) + Total: 3906 allocations + +Class C3 (65-128B): + UltraHot: 248 hits (5.8%) ← C3 サイズに対応できていない! + TLS SLL: 4037 hits (94.2%) ← ほぼ全てが SLL に漏れる + Total: 4285 allocations +``` + +**問題点**: +- **C3 でヒット率壊滅** (5.8%) → 94.2% が SLL に漏れる +- UltraHot の magazine サイズが C3 に不十分 +- SLL アクセスは遅い(linked list traversal) +- 結果: -34.4% の大幅性能劣化 + +**UltraHot の設計限界**: +- C2: 4スロット magazine → 96.4% ヒット率(まずまず) +- C3: 4スロット magazine → 5.8% ヒット率(不十分) +- C3 の高需要に対応できない magazine 容量 + +--- + +## 結論と推奨事項 + +### 🎯 推奨設定: HeapV2のみ(UltraHot無効) + +```bash +export HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1 +./bench_random_mixed_hakmem +``` + +**理由**: +1. **性能向上** +12.9% (10.1M → 11.4M ops/s) +2. **コード簡素化** - 1層削減で分岐予測改善 +3. **高ヒット率維持** - HeapV2 単独で 97-99% 達成 +4. **SLL fallback** - HeapV2 miss 時は SLL が補完(0.7-2.7%) + +### ❌ UltraHot 削除の根拠 + +**定量的根拠**: +- ヒット率貢献: 0.2-11.7%(微小) +- 分岐オーバーヘッド: 毎回評価(100% のケース) +- 性能影響: 削除で +12.9% 改善 + +**定性的根拠**: +- 設計の複雑性(Borrowing Design) +- HeapV2 との機能重複(C2/C3 両方対応) +- メンテナンスコスト > 効果 + +### ✅ HeapV2 保持の根拠 + +**定量的根拠**: +- ヒット率: 88-99%(主力) +- 性能影響: 無効化で -34.4% 劣化 +- SLL fallback: miss 時も 0.7-2.7% で収まる + +**定性的根拠**: +- シンプルな magazine 設計 +- C2/C3 両方で高効率 +- UltraHot より容量大(ヒット率高) + +--- + +## 次のステップ + +### Phase 19-5: UltraHot 削除パッチ作成 + +1. **コード削除**: + - `core/front/tiny_ultra_hot.h/c` 削除 + - `tiny_alloc_fast.inc.h` から UltraHot セクション削除 + - ENV 変数 `HAKMEM_TINY_ULTRA_HOT` 削除 + +2. **ビルドシステム更新**: + - Makefile から UltraHot 関連削除 + - build.sh 更新 + +3. **ドキュメント更新**: + - CLAUDE.md に Phase 19 結果追記 + - CURRENT_TASK.md 更新 + +### Phase 19-6: 回帰テスト + +1. **性能検証**: + - `bench_random_mixed_hakmem` - 目標: 11M+ ops/s + - `larson_hakmem` - 安定性確認 + - `bench_fixed_size_hakmem` - 各サイズクラス確認 + +2. **機能検証**: + - HeapV2 単独で全サイズクラス対応確認 + - SLL fallback 動作確認 + - Prewarm 動作確認 + +--- + +## ChatGPT 先生の戦略検証 ✅ + +**Phase 19 戦略**: +1. ✅ **観測** (Box FrontMetrics) → HeapV2 88-99%, UltraHot 0.2-11.7% +2. ✅ **診断** (Box FrontPrune A/B) → UltraHot 削除で +12.9% +3. ⏭️ **治療** (UltraHot 削除実装) → 次フェーズ + +**結果**: +- 「観測 → 診断 → 治療」のアプローチが **完璧に機能** 🎉 +- 直感に反する発見(UltraHot が阻害要因)を **データで証明** +- A/B テストで **リスクなし確認** してから削除へ + +--- + +## ファイル変更履歴 + +**Phase 19-1 & 19-2** (Metrics): +- `core/box/front_metrics_box.h` - NEW +- `core/box/front_metrics_box.c` - NEW +- `core/tiny_alloc_fast.inc.h` - メトリクス収集追加 + +**Phase 19-3** (FrontPrune): +- `core/box/front_metrics_box.h` - ENV切り替え関数追加 +- `core/tiny_alloc_fast.inc.h` - ENV条件分岐追加 + +**Phase 19-4** (A/B Test): +- このレポート: `PHASE19_AB_TEST_RESULTS.md` +- 分析: `PHASE19_FRONTEND_METRICS_FINDINGS.md` + +--- + +## 付録: 性能比較グラフ(テキスト) + +``` +Throughput (M ops/s): + +Baseline ████████████████████ 10.1 +HeapV2のみ ██████████████████████ 11.4 (+12.9%) ⭐ +UltraHotのみ █████████████ 6.6 (-34.4%) ❌ + + 0 2 4 6 8 10 12 (M ops/s) +``` + +``` +C2 Hit Rate (33-64B): + +Baseline: [UH 11.7%][======= HV2 88.3% =======] +HeapV2のみ: [============ HV2 99.3% ===========][SLL 0.7%] +UltraHotのみ:[========== UH 96.4% ==========][SLL 3.6%] +``` + +``` +C3 Hit Rate (65-128B): + +Baseline: [UH 0.2%][========== HV2 99.8% ==========] +HeapV2のみ: [========= HV2 97.3% =========][SLL 2.7%] +UltraHotのみ:[UH 5.8%][========== SLL 94.2% ==========] ← 壊滅! +``` + +--- + +**まとめ**: ChatGPT 先生の推奨通り、**Box FrontMetrics → Box FrontPrune** で科学的にフロント層を分析した結果、**UltraHot削除で +12.9% 性能向上** という明確な結論が得られたにゃ!🎉 diff --git a/PHASE19_FRONTEND_METRICS_FINDINGS.md b/PHASE19_FRONTEND_METRICS_FINDINGS.md new file mode 100644 index 00000000..d4b64b36 --- /dev/null +++ b/PHASE19_FRONTEND_METRICS_FINDINGS.md @@ -0,0 +1,167 @@ +# Phase 19: Frontend Layer Metrics Analysis + +## Phase 19-1: Box FrontMetrics Implementation ✅ + +**Status**: COMPLETE (2025-11-16) + +**Implementation**: +- Created `core/box/front_metrics_box.h` - Per-class hit/miss counters +- Created `core/box/front_metrics_box.c` - CSV reporting with percentage analysis +- Added instrumentation to all frontend layers in `tiny_alloc_fast.inc.h` +- ENV controls: `HAKMEM_TINY_FRONT_METRICS=1`, `HAKMEM_TINY_FRONT_DUMP=1` + +**Build fix**: Added missing `hakmem_smallmid_superslab.o` to Makefile + +--- + +## Phase 19-2: Benchmark Results and Analysis ✅ + +**Benchmark**: `bench_random_mixed_hakmem 500000 4096 42` +**Workload**: Random allocations 16-1040 bytes, 500K iterations + +### Layer Hit Rates (Classes C2/C3) + +``` +Class UH_hit HV2_hit C5_hit FC_hit SFC_hit SLL_hit Total +------|----------|----------|----------|----------|----------|----------|------------- +C2 455 3,450 0 0 0 0 3,905 +C3 13 7,585 0 0 0 0 7,598 + +Percentages: +C2: UltraHot=11.7%, HeapV2=88.3% +C3: UltraHot=0.2%, HeapV2=99.8% +``` + +### Key Findings + +1. **HeapV2 Dominates (>80% hit rate)** + - C2: 88.3% hit rate (3,450 / 3,905 allocations) + - C3: 99.8% hit rate (7,585 / 7,598 allocations) + - **Recommendation**: ✅ Keep and optimize (hot path) + +2. **UltraHot Marginal (<12% hit rate)** + - C2: 11.7% hit rate (455 / 3,905 allocations) + - C3: 0.2% hit rate (13 / 7,598 allocations) + - **Recommendation**: ⚠️ Consider pruning (low value, adds branch overhead) + +3. **FastCache DISABLED** + - Gated by `g_fastcache_enable=0` (default) + - 0% hit rate across all classes + - **Status**: Not in use (OFF by default) + +4. **SFC DISABLED** + - Gated by `g_sfc_enabled=0` (default) + - 0% hit rate across all classes + - **Status**: Not in use (OFF by default) + +5. **Class5 Dedicated Path DISABLED** + - `g_front_class5_hit[]=0` for all classes + - **Status**: Not in use (OFF by default or C5 not hit in this workload) + +6. **TLS SLL Not Reached** + - 0% hit rate because earlier layers (UltraHot + HeapV2) catch 100% + - **Status**: Enabled but bypassed (earlier layers are effective) + +### Layer Execution Order + +``` +FastCache (C0-C3) [DISABLED] + ↓ +SFC (all classes) [DISABLED] + ↓ +UltraHot (C2-C5) [ENABLED] → 0.2-11.7% hit rate + ↓ +HeapV2 (C0-C3) [ENABLED] → 88-99% hit rate ✅ + ↓ +Class5 (C5 only) [DISABLED or N/A] + ↓ +TLS SLL (all classes) [ENABLED but not reached] + ↓ +SuperSlab (fallback) +``` + +--- + +## Analysis Recommendations (from Box FrontMetrics) + +1. **Layers with >80% hit rate**: ✅ Keep and optimize (hot path) + - **HeapV2**: 88-99% hit rate → Primary workhorse for C2/C3 + +2. **Layers with <5% hit rate**: ⚠️ Consider pruning (dead weight) + - **FastCache**: 0% (disabled) + - **SFC**: 0% (disabled) + - **Class5**: 0% (disabled or N/A) + - **TLS SLL**: 0% (not reached) + +3. **Multiple layers 5-20%**: ⚠️ Potential redundancy, test pruning + - **UltraHot**: 0.2-11.7% → Adds branch overhead for minimal benefit + +--- + +## Phase 19-3: Next Steps (Box FrontPrune) + +**Goal**: Add ENV switches to selectively disable layers for A/B testing + +**Proposed ENV Controls**: +```bash +HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1 # Disable UltraHot magazine +HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1 # Disable HeapV2 magazine +HAKMEM_TINY_FRONT_DISABLE_CLASS5=1 # Disable Class5 dedicated path +HAKMEM_TINY_FRONT_ENABLE_FC=1 # Enable FastCache (currently OFF) +HAKMEM_TINY_FRONT_ENABLE_SFC=1 # Enable SFC (currently OFF) +``` + +**A/B Test Scenarios**: +1. **Baseline**: Current state (UltraHot + HeapV2) +2. **Test 1**: HeapV2 only (disable UltraHot) → Expected: Minimal perf loss (<12%) +3. **Test 2**: UltraHot only (disable HeapV2) → Expected: Major perf loss (88-99%) +4. **Test 3**: Enable FC + SFC, disable UltraHot/HeapV2 → Test classic TLS cache layers +5. **Test 4**: HeapV2 + FC + SFC (disable UltraHot) → Test hybrid approach + +**Expected Outcome**: Identify minimal effective layer set (maximize hit rate, minimize overhead) + +--- + +## Performance Impact + +**Benchmark Throughput**: 10.8M ops/s (500K iterations) + +**Layer Overhead Estimate**: +- Each layer check: ~2-4 instructions (branch + state access) +- Current active layers: UltraHot (2-4 inst) + HeapV2 (2-4 inst) = 4-8 inst overhead +- If UltraHot removed: -2-4 inst = potential +5-10% perf improvement + +**Risk Assessment**: +- Removing HeapV2: HIGH RISK (88-99% hit rate loss) +- Removing UltraHot: LOW RISK (0.2-11.7% hit rate loss, likely <5% perf impact) + +--- + +## Files Modified (Phase 19-1) + +1. `core/box/front_metrics_box.h` - NEW (metrics API + inline helpers) +2. `core/box/front_metrics_box.c` - NEW (CSV reporting) +3. `core/tiny_alloc_fast.inc.h` - Added metrics collection calls +4. `Makefile` - Added `front_metrics_box.o` + `hakmem_smallmid_superslab.o` + +**Build Command**: +```bash +make clean && make HAKMEM_DEBUG_COUNTERS=1 bench_random_mixed_hakmem +``` + +**Test Command**: +```bash +HAKMEM_TINY_FRONT_METRICS=1 HAKMEM_TINY_FRONT_DUMP=1 \ +./bench_random_mixed_hakmem 500000 4096 42 +``` + +--- + +## Conclusion + +**Phase 19-2 successfully identified**: +- HeapV2 as the dominant effective layer (>80% hit rate) +- UltraHot as a low-value layer (<12% hit rate) +- FC/SFC as currently unused (disabled by default) + +**Next Phase**: Implement Box FrontPrune ENV switches for A/B testing layer removal. diff --git a/core/box/front_metrics_box.c b/core/box/front_metrics_box.c new file mode 100644 index 00000000..eb69c63b --- /dev/null +++ b/core/box/front_metrics_box.c @@ -0,0 +1,117 @@ +// front_metrics_box.c - Box FrontMetrics Implementation +// Purpose: Collect and report frontend layer hit rates + +#include "front_metrics_box.h" +#include "../hakmem_tiny_stats_api.h" +#include +#include +#include + +// ============================================================================ +// Per-thread counters (NEW - declared in header, defined here) +// ============================================================================ + +__thread uint64_t g_front_ultrahot_hit[TINY_NUM_CLASSES] = {0}; +__thread uint64_t g_front_ultrahot_miss[TINY_NUM_CLASSES] = {0}; + +__thread uint64_t g_front_heapv2_hit[TINY_NUM_CLASSES] = {0}; +__thread uint64_t g_front_heapv2_miss[TINY_NUM_CLASSES] = {0}; + +__thread uint64_t g_front_class5_hit[TINY_NUM_CLASSES] = {0}; +__thread uint64_t g_front_class5_miss[TINY_NUM_CLASSES] = {0}; + +// ============================================================================ +// Existing counters (defined in hakmem_tiny.c, extern here for reading) +// ============================================================================ + +extern unsigned long long g_front_fc_hit[TINY_NUM_CLASSES]; +extern unsigned long long g_front_fc_miss[TINY_NUM_CLASSES]; +extern unsigned long long g_front_sfc_hit[TINY_NUM_CLASSES]; +extern unsigned long long g_front_sll_hit[TINY_NUM_CLASSES]; + +// ============================================================================ +// Enable flag (cached) +// ============================================================================ + +int front_metrics_enabled(void) { + static int g_enabled = -1; + if (__builtin_expect(g_enabled == -1, 0)) { + const char* env = getenv("HAKMEM_TINY_FRONT_METRICS"); + g_enabled = (env && *env && *env != '0') ? 1 : 0; + } + return g_enabled; +} + +// ============================================================================ +// Dump frontend metrics (CSV format) +// ============================================================================ + +void hak_tiny_front_metrics_dump(void) { + if (!front_metrics_enabled()) { + return; + } + + const char* dump_env = getenv("HAKMEM_TINY_FRONT_DUMP"); + if (!(dump_env && *dump_env && *dump_env != '0')) { + return; + } + + fprintf(stderr, "\n========== Box FrontMetrics: Layer Hit Rates ==========\n"); + fprintf(stderr, "Purpose: Identify which frontend layers are doing real work\n"); + fprintf(stderr, "Legend: UH=UltraHot, HV2=HeapV2, C5=Class5, FC=FastCache, SFC=SuperFrontCache, SLL=TLS_SLL\n\n"); + + fprintf(stderr, "%-5s %10s %10s %10s %10s %10s %10s %12s | %6s %6s %6s %6s %6s %6s\n", + "Class", "UH_hit", "HV2_hit", "C5_hit", "FC_hit", "SFC_hit", "SLL_hit", "Total", + "UH%", "HV2%", "C5%", "FC%", "SFC%", "SLL%"); + fprintf(stderr, "------|----------|----------|----------|----------|----------|----------|-------------|"); + fprintf(stderr, "-------|-------|-------|-------|-------|-------\n"); + + for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) { + uint64_t uh_hit = g_front_ultrahot_hit[cls]; + uint64_t hv2_hit = g_front_heapv2_hit[cls]; + uint64_t c5_hit = g_front_class5_hit[cls]; + uint64_t fc_hit = g_front_fc_hit[cls]; + uint64_t sfc_hit = g_front_sfc_hit[cls]; + uint64_t sll_hit = g_front_sll_hit[cls]; + + uint64_t total = uh_hit + hv2_hit + c5_hit + fc_hit + sfc_hit + sll_hit; + + if (total == 0) { + fprintf(stderr, "C%-4d %10s %10s %10s %10s %10s %10s %12s | %6s %6s %6s %6s %6s %6s\n", + cls, "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-"); + continue; + } + + double uh_pct = (double)uh_hit / total * 100.0; + double hv2_pct = (double)hv2_hit / total * 100.0; + double c5_pct = (double)c5_hit / total * 100.0; + double fc_pct = (double)fc_hit / total * 100.0; + double sfc_pct = (double)sfc_hit / total * 100.0; + double sll_pct = (double)sll_hit / total * 100.0; + + fprintf(stderr, "C%-4d %10lu %10lu %10lu %10lu %10lu %10lu %12lu | %5.1f%% %5.1f%% %5.1f%% %5.1f%% %5.1f%% %5.1f%%\n", + cls, + (unsigned long)uh_hit, + (unsigned long)hv2_hit, + (unsigned long)c5_hit, + (unsigned long)fc_hit, + (unsigned long)sfc_hit, + (unsigned long)sll_hit, + (unsigned long)total, + uh_pct, hv2_pct, c5_pct, fc_pct, sfc_pct, sll_pct); + } + + fprintf(stderr, "=======================================================\n\n"); + + // Analysis recommendations + fprintf(stderr, "Analysis Recommendations:\n"); + fprintf(stderr, " - Layers with >80%% hit rate: Keep and optimize (hot path)\n"); + fprintf(stderr, " - Layers with <5%% hit rate: Consider pruning (dead weight)\n"); + fprintf(stderr, " - Multiple layers >20%%: Potential redundancy, test pruning\n\n"); +} + +// Register dump at shutdown +static void front_metrics_atexit(void) __attribute__((destructor)); +static void front_metrics_atexit(void) { + hak_tiny_front_metrics_dump(); +} diff --git a/core/box/front_metrics_box.h b/core/box/front_metrics_box.h new file mode 100644 index 00000000..95df41bc --- /dev/null +++ b/core/box/front_metrics_box.h @@ -0,0 +1,164 @@ +// front_metrics_box.h - Box FrontMetrics: Multi-layer frontend hit rate analysis +// Purpose: Measure which frontend layers are actually doing work vs passing through +// +// Phase 19-1: Observation before optimization +// Strategy: Add lightweight counters to all frontend layers, run benchmarks, +// analyze hit rates to identify: +// - Layers with high hit率 (keep and optimize) +// - Layers with low hit率 (consider pruning) +// - Redundant layers (multiple layers fighting for same workload) +// +// ENV Control: +// HAKMEM_TINY_FRONT_METRICS=1 - Enable metrics collection +// HAKMEM_TINY_FRONT_DUMP=1 - Dump metrics at shutdown +// +// Output format (per-class CSV): +// class, ultrahot_hit, heapv2_hit, class5_hit, fc_hit, sfc_hit, sll_hit, total, ultrahot%, heapv2%, fc%, sfc%, sll% + +#ifndef HAK_BOX_FRONT_METRICS_H +#define HAK_BOX_FRONT_METRICS_H + +#include +#include +#include // Phase 19-3: getenv() for FrontPrune + +#ifdef __cplusplus +extern "C" { +#endif + +// ============================================================================ +// Phase 19-1: Frontend Layer Hit/Miss Counters (per-class) +// ============================================================================ + +#ifndef TINY_NUM_CLASSES +#define TINY_NUM_CLASSES 8 +#endif + +// Layer counters (all __thread to avoid false sharing, atomic for cross-thread visibility) +extern __thread uint64_t g_front_ultrahot_hit[TINY_NUM_CLASSES]; +extern __thread uint64_t g_front_ultrahot_miss[TINY_NUM_CLASSES]; + +extern __thread uint64_t g_front_heapv2_hit[TINY_NUM_CLASSES]; +extern __thread uint64_t g_front_heapv2_miss[TINY_NUM_CLASSES]; + +extern __thread uint64_t g_front_class5_hit[TINY_NUM_CLASSES]; +extern __thread uint64_t g_front_class5_miss[TINY_NUM_CLASSES]; + +// FastCache/SFC/SLL already tracked in hakmem_tiny.c: +// - g_front_fc_hit[] (FastCache) +// - g_front_fc_miss[] (FastCache) +// - g_front_sfc_hit[] (SuperFrontCache) +// - g_front_sll_hit[] (TLS SLL) + +// ============================================================================ +// API Functions +// ============================================================================ + +// Check if metrics are enabled (cached) +int front_metrics_enabled(void); + +// Dump all frontend metrics to stderr +// Format: CSV table with per-class hit rates and percentages +void hak_tiny_front_metrics_dump(void); + +// ============================================================================ +// Inline Helpers (zero-cost when metrics disabled) +// ============================================================================ + +static inline void front_metrics_ultrahot_hit(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_ultrahot_hit[cls]++; + } +#else + (void)cls; +#endif +} + +static inline void front_metrics_ultrahot_miss(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_ultrahot_miss[cls]++; + } +#else + (void)cls; +#endif +} + +static inline void front_metrics_heapv2_hit(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_heapv2_hit[cls]++; + } +#else + (void)cls; +#endif +} + +static inline void front_metrics_heapv2_miss(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_heapv2_miss[cls]++; + } +#else + (void)cls; +#endif +} + +static inline void front_metrics_class5_hit(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_class5_hit[cls]++; + } +#else + (void)cls; +#endif +} + +static inline void front_metrics_class5_miss(int cls) { +#if HAKMEM_DEBUG_COUNTERS + if (front_metrics_enabled()) { + g_front_class5_miss[cls]++; + } +#else + (void)cls; +#endif +} + +// Note: FastCache/SFC/SLL counters already managed in hakmem_tiny.c +// No inline helpers needed - we just read their values in dump function + +// ============================================================================ +// Phase 19-3: Box FrontPrune - ENV-controlled layer pruning for A/B testing +// ============================================================================ +// Purpose: Allow selective enabling/disabling of frontend layers +// ENV Controls: +// HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1 - Enable UltraHot magazine (C2-C5) [DEFAULT: OFF] +// HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1 - Disable HeapV2 magazine (C0-C3) [DEFAULT: ON] +// +// Phase 19-4 A/B Test Result: UltraHot default OFF for +12.9% performance gain +// ============================================================================ + +static inline int front_prune_ultrahot_enabled(void) { + static int cached = -1; + if (__builtin_expect(cached == -1, 0)) { + const char* env = getenv("HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT"); + cached = (env && *env && *env != '0') ? 1 : 0; // DEFAULT: OFF (0) for best performance + } + return cached; +} + +static inline int front_prune_heapv2_enabled(void) { + static int cached = -1; + if (__builtin_expect(cached == -1, 0)) { + const char* env = getenv("HAKMEM_TINY_FRONT_DISABLE_HEAPV2"); + cached = (env && *env && *env != '0') ? 0 : 1; // DISABLE=1 → return 0 + } + return cached; +} + +#ifdef __cplusplus +} +#endif + +#endif // HAK_BOX_FRONT_METRICS_H diff --git a/core/box/hak_core_init.inc.h b/core/box/hak_core_init.inc.h index 27a6b147..870d5b71 100644 --- a/core/box/hak_core_init.inc.h +++ b/core/box/hak_core_init.inc.h @@ -292,12 +292,12 @@ static void hak_init_impl(void) { HAKMEM_LOG("ACE Learning Layer enabled and started\n"); } - // Phase 7 Task 3: Pre-warm TLS cache (reduce first-allocation miss penalty) + // Phase 20-1: Aggressive TLS SLL + SuperSlab prewarming (ChatGPT strategy) + // Box SS-HotPrewarm: ENV-controlled per-class prewarm with page fault reduction #if HAKMEM_TINY_PREWARM_TLS - // Forward declaration from hakmem_tiny.c - extern void hak_tiny_prewarm_tls_cache(void); - hak_tiny_prewarm_tls_cache(); - HAKMEM_LOG("TLS cache pre-warmed for %d classes\n", TINY_NUM_CLASSES); + #include "box/ss_hot_prewarm_box.h" + int total_prewarmed = box_ss_hot_prewarm_all(); + HAKMEM_LOG("TLS cache pre-warmed: %d blocks total (Phase 20-1)\n", total_prewarmed); // After TLS prewarm, cascade some hot blocks into SFC to raise early hit rate { extern int g_sfc_enabled; diff --git a/core/box/ss_hot_prewarm_box.c b/core/box/ss_hot_prewarm_box.c new file mode 100644 index 00000000..e7754df6 --- /dev/null +++ b/core/box/ss_hot_prewarm_box.c @@ -0,0 +1,147 @@ +// ss_hot_prewarm_box.c - Box SS-HotPrewarm Implementation +#include +#include +#include +#include "../hakmem_tiny.h" // MUST BE FIRST: Base types +#include "../hakmem_tiny_config.h" // TINY_NUM_CLASSES +#include "ss_hot_prewarm_box.h" +#include "prewarm_box.h" // box_prewarm_tls() + +// Per-class prewarm targets (cached from ENV) +static int g_ss_hot_prewarm_targets[TINY_NUM_CLASSES] = {0}; +static int g_ss_hot_prewarm_initialized = 0; + +// Default aggressive targets (ChatGPT Phase 20 strategy) +// Classes 0-1 (tiny): 0 (no prewarm) +// Classes 2-3 (33-128B): 128 blocks (hot path) +// Classes 4-5 (129-512B): 64 blocks (medium hot) +// Classes 6-7 (513-1024B): 0 (rare) +static const int g_ss_hot_prewarm_defaults[TINY_NUM_CLASSES] = { + 0, // C0 (16B) - not used + 0, // C1 (17-32B) - not used + 128, // C2 (33-64B) - HOT + 128, // C3 (65-128B) - HOT + 64, // C4 (129-256B) - MEDIUM + 64, // C5 (257-512B) - MEDIUM + 0, // C6 (513-1024B) - rare + 0 // C7 (1024B) - rare +}; + +// ============================================================================ +// Internal Helpers +// ============================================================================ + +static void ss_hot_prewarm_init_targets(void) { + if (g_ss_hot_prewarm_initialized) return; + + // Step 1: Copy defaults + for (int i = 0; i < TINY_NUM_CLASSES; i++) { + g_ss_hot_prewarm_targets[i] = g_ss_hot_prewarm_defaults[i]; + } + + // Step 2: Check for global override + const char* all_env = getenv("HAKMEM_TINY_PREWARM_ALL"); + if (all_env && *all_env) { + int all_count = atoi(all_env); + if (all_count >= 0) { + for (int i = 0; i < TINY_NUM_CLASSES; i++) { + g_ss_hot_prewarm_targets[i] = all_count; + } + #if !HAKMEM_BUILD_RELEASE + fprintf(stderr, "[BOX_SS_HOT_PREWARM] Global override: HAKMEM_TINY_PREWARM_ALL=%d\n", all_count); + #endif + } + } + + // Step 3: Parse per-class ENV overrides + const char* class_env_names[TINY_NUM_CLASSES] = { + "HAKMEM_TINY_PREWARM_C0", + "HAKMEM_TINY_PREWARM_C1", + "HAKMEM_TINY_PREWARM_C2", + "HAKMEM_TINY_PREWARM_C3", + "HAKMEM_TINY_PREWARM_C4", + "HAKMEM_TINY_PREWARM_C5", + "HAKMEM_TINY_PREWARM_C6", + "HAKMEM_TINY_PREWARM_C7" + }; + + for (int i = 0; i < TINY_NUM_CLASSES; i++) { + const char* env = getenv(class_env_names[i]); + if (env && *env) { + int count = atoi(env); + if (count >= 0) { + g_ss_hot_prewarm_targets[i] = count; + #if !HAKMEM_BUILD_RELEASE + fprintf(stderr, "[BOX_SS_HOT_PREWARM] Class %d override: %s=%d\n", + i, class_env_names[i], count); + #endif + } + } + } + + // Step 4: Report final configuration (debug only) + #if !HAKMEM_BUILD_RELEASE + fprintf(stderr, "[BOX_SS_HOT_PREWARM] Final targets: "); + for (int i = 0; i < TINY_NUM_CLASSES; i++) { + if (g_ss_hot_prewarm_targets[i] > 0) { + fprintf(stderr, "C%d=%d ", i, g_ss_hot_prewarm_targets[i]); + } + } + fprintf(stderr, "\n"); + #endif + + g_ss_hot_prewarm_initialized = 1; +} + +// ============================================================================ +// Public API +// ============================================================================ + +int box_ss_hot_prewarm_target(int class_idx) { + if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return 0; + + if (!g_ss_hot_prewarm_initialized) { + ss_hot_prewarm_init_targets(); + } + + return g_ss_hot_prewarm_targets[class_idx]; +} + +int box_ss_hot_prewarm_all(void) { + // Initialize targets from ENV + ss_hot_prewarm_init_targets(); + + int total_prewarmed = 0; + + // Prewarm each class with non-zero target + for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) { + int target = g_ss_hot_prewarm_targets[class_idx]; + if (target <= 0) continue; + + #if !HAKMEM_BUILD_RELEASE + fprintf(stderr, "[BOX_SS_HOT_PREWARM] Prewarming C%d with %d blocks...\n", + class_idx, target); + #endif + + // Use Box Prewarm API to safely warm TLS SLL + // This will automatically: + // - Allocate SuperSlab if needed + // - Populate pages (touch memory) + // - Fill TLS SLL with blocks + int actual = box_prewarm_tls(class_idx, target); + + #if !HAKMEM_BUILD_RELEASE + if (actual < target) { + fprintf(stderr, "[BOX_SS_HOT_PREWARM] C%d: requested=%d actual=%d (capacity limited)\n", + class_idx, target, actual); + } + #endif + + total_prewarmed += actual; + } + + // Phase 20-1: ALWAYS log prewarm summary (even in release) for verification + fprintf(stderr, "[BOX_SS_HOT_PREWARM] Total blocks pre-warmed: %d\n", total_prewarmed); + + return total_prewarmed; +} diff --git a/core/box/ss_hot_prewarm_box.h b/core/box/ss_hot_prewarm_box.h new file mode 100644 index 00000000..9635bc73 --- /dev/null +++ b/core/box/ss_hot_prewarm_box.h @@ -0,0 +1,61 @@ +// ss_hot_prewarm_box.h - Box SS-HotPrewarm +// Phase 20-1: Aggressive TLS SLL + SuperSlab prewarming for page fault reduction +// +// Purpose: +// - Pre-warm TLS SLL cache with ENV-controlled per-class targets +// - Reduce page faults by allocating and populating SuperSlabs upfront +// - Target: 50-66% page fault reduction → +20-40% performance +// +// Design: +// - ENV controls: HAKMEM_TINY_PREWARM_C2, _C3, _C4, _C5 +// - Default aggressive targets: C2/C3=128, C4/C5=64 (ChatGPT strategy) +// - Uses Box Prewarm API (box_prewarm_tls) for safe TLS SLL warming +// - Automatically triggers SuperSlab allocation + populate +// +// ENV Variables: +// HAKMEM_TINY_PREWARM_C2=N - Prewarm C2 (33-64B) with N blocks [DEFAULT: 128] +// HAKMEM_TINY_PREWARM_C3=N - Prewarm C3 (65-128B) with N blocks [DEFAULT: 128] +// HAKMEM_TINY_PREWARM_C4=N - Prewarm C4 (129-256B) with N blocks [DEFAULT: 64] +// HAKMEM_TINY_PREWARM_C5=N - Prewarm C5 (257-512B) with N blocks [DEFAULT: 64] +// HAKMEM_TINY_PREWARM_ALL=N - Override all classes with N blocks [DEFAULT: OFF] +// +// Example: +// export HAKMEM_TINY_PREWARM_C2=256 +// export HAKMEM_TINY_PREWARM_C3=256 +// ./bench_random_mixed_hakmem + +#ifndef HAK_BOX_SS_HOT_PREWARM_H +#define HAK_BOX_SS_HOT_PREWARM_H + +#include +#include + +// ============================================================================ +// Box SS-HotPrewarm API +// ============================================================================ + +// Pre-warm TLS SLL caches for all Tiny classes based on ENV settings +// +// What it does: +// 1. Read ENV variables (HAKMEM_TINY_PREWARM_C2, etc.) +// 2. For each class with non-zero target: +// - Call box_prewarm_tls(class_idx, target) +// - This allocates SuperSlab + populates pages + fills TLS SLL +// 3. Report total blocks pre-warmed +// +// Returns: total blocks pre-warmed across all classes +// +// Thread-safe: uses TLS, call from init only +// Idempotent: safe to call multiple times (subsequent calls are no-op) +// +// Expected impact: +// - Page faults: -50-66% (amortized upfront) +// - Performance: +20-40% (per ChatGPT Phase 20 strategy) +// +int box_ss_hot_prewarm_all(void); + +// Get prewarm target for a specific class (after ENV parsing) +// Returns: target count, or 0 if no prewarm needed +int box_ss_hot_prewarm_target(int class_idx); + +#endif // HAK_BOX_SS_HOT_PREWARM_H diff --git a/core/tiny_alloc_fast.inc.h b/core/tiny_alloc_fast.inc.h index d884f1ed..472ae6cc 100644 --- a/core/tiny_alloc_fast.inc.h +++ b/core/tiny_alloc_fast.inc.h @@ -31,6 +31,7 @@ #include "front/tiny_heap_v2.h" // Phase 13-A: TinyHeapV2 magazine front #include "front/tiny_ultra_hot.h" // Phase 14: TinyUltraHot C1/C2 ultra-fast path #endif +#include "box/front_metrics_box.h" // Phase 19-1: Frontend layer metrics #include // Phase 7 Task 2: Aggressive inline TLS cache access @@ -228,11 +229,12 @@ static inline void* tiny_alloc_fast_pop(int class_idx) { if (__builtin_expect(g_fastcache_enable && class_idx <= 3, 1)) { void* fc = fastcache_pop(class_idx); if (__builtin_expect(fc != NULL, 1)) { - // Frontend FastCache hit + // Frontend FastCache hit (already tracked by g_front_fc_hit) extern unsigned long long g_front_fc_hit[]; g_front_fc_hit[class_idx]++; return fc; } else { + // Frontend FastCache miss (already tracked by g_front_fc_miss) extern unsigned long long g_front_fc_miss[]; g_front_fc_miss[class_idx]++; } @@ -604,22 +606,27 @@ static inline void* tiny_alloc_fast(size_t size) { #endif // Phase 14-C: TinyUltraHot Borrowing Design (正史から借りる設計) - // ENV-gated: HAKMEM_TINY_ULTRA_HOT=1 (default: ON) + // ENV-gated: HAKMEM_TINY_ULTRA_HOT=1 (internal control) + // Phase 19-4: HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1 to enable (DEFAULT: OFF for +12.9% perf) // Targets C2-C5 (16B-128B) // Design: UltraHot は TLS SLL から借りたブロックを magazine に保持 // - Hit: magazine から返す (L0, fastest) // - Miss: TLS SLL から refill して再試行 - if (__builtin_expect(ultra_hot_enabled(), 1)) { + // A/B Test Result: UltraHot adds branch overhead (11.7% hit) → HeapV2-only is faster + if (__builtin_expect(ultra_hot_enabled() && front_prune_ultrahot_enabled(), 0)) { // expect=0 (default OFF) void* base = ultra_hot_alloc(size); if (base) { + front_metrics_ultrahot_hit(class_idx); // Phase 19-1: Metrics HAK_RET_ALLOC(class_idx, base); // Header write + return USER pointer } // Miss → TLS SLL から借りて refill(正史から借用) if (class_idx >= 2 && class_idx <= 5) { + front_metrics_ultrahot_miss(class_idx); // Phase 19-1: Metrics ultra_hot_try_refill(class_idx); // Retry after refill base = ultra_hot_alloc(size); if (base) { + front_metrics_ultrahot_hit(class_idx); // Phase 19-1: Metrics (refill hit) HAK_RET_ALLOC(class_idx, base); } } @@ -627,12 +634,16 @@ static inline void* tiny_alloc_fast(size_t size) { // Phase 13-A: TinyHeapV2 (per-thread magazine, experimental) // ENV-gated: HAKMEM_TINY_HEAP_V2=1 + // Phase 19-3: + HAKMEM_TINY_FRONT_DISABLE_HEAPV2=1 to disable (Box FrontPrune) // Targets class 0-3 (8-64B) only, falls back to existing path if NULL // PERF: Pass class_idx directly to avoid redundant size→class conversion - if (__builtin_expect(tiny_heap_v2_enabled(), 0) && class_idx <= 3) { + if (__builtin_expect(tiny_heap_v2_enabled() && front_prune_heapv2_enabled(), 0) && class_idx <= 3) { void* base = tiny_heap_v2_alloc_by_class(class_idx); if (base) { + front_metrics_heapv2_hit(class_idx); // Phase 19-1: Metrics HAK_RET_ALLOC(class_idx, base); // Header write + return USER pointer + } else { + front_metrics_heapv2_miss(class_idx); // Phase 19-1: Metrics } } @@ -646,12 +657,19 @@ static inline void* tiny_alloc_fast(size_t size) { if (__builtin_expect(hot_c5, 0)) { // class5: 専用最短経路(generic frontは一切通らない) void* p = tiny_class5_minirefill_take(); - if (p) HAK_RET_ALLOC(class_idx, p); + if (p) { + front_metrics_class5_hit(class_idx); // Phase 19-1: Metrics + HAK_RET_ALLOC(class_idx, p); + } + front_metrics_class5_miss(class_idx); // Phase 19-1: Metrics (first miss) int refilled = tiny_alloc_fast_refill(class_idx); if (__builtin_expect(refilled > 0, 1)) { p = tiny_class5_minirefill_take(); - if (p) HAK_RET_ALLOC(class_idx, p); + if (p) { + front_metrics_class5_hit(class_idx); // Phase 19-1: Metrics (refill hit) + HAK_RET_ALLOC(class_idx, p); + } } // slow pathへ(genericフロントは回避)