Step 1 & 2 Complete: - Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334) - LEGACY path prefetch of g_unified_cache[class_idx] to L1 - ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF) - Conditional: only when prefetch enabled + route_kind == LEGACY - A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg - Median: +1.28% (within ±1.0% neutral range) - Result: 🔬 NEUTRAL (research box, default OFF) Decision: FREEZE as research box - Average -0.34% suggests prefetch overhead > benefit - Prefetch timing too late (after route_kind selection) - TLS cache access is already fast (head/tail indices) - Actual memory wait happens at slots[] array access (after prefetch) Technical Learning: - Prefetch effectiveness depends on L1 miss rate at access time - Inserting prefetch after route selection may be too late - Future approach: move prefetch earlier or use different target Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%) 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2.9 KiB
2.9 KiB
Phase 3: Cache Locality 最適化(開始指示)
目標
現状: Mixed ~35.2M ops/s (B3+B4 後) 目標: 57-68M ops/s (+12-22%)
Phase 3 構成
C3(優先度: 🔴 最高): Static Routing
背景:
- Mixed の perf top では malloc/policy_snapshot が hot
- 現在: 毎回 malloc 時に policy snapshot + learner evaluation → 大きな overhead
- 案: malloc_tiny_fast() 呼び出し前に "static route" を init 時決定
設計メモ: docs/analysis/PHASE3_C3_STATIC_ROUTING_1_DESIGN.md
実装ステップ:
-
Profiling(現状把握)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1 perf record -F 99 --call-graph dwarf -- HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1 perf report --stdio # → malloc/policy_snapshot/learner がどの程度か確認 -
Static Route Detection (init 時)
- malloc_tiny_fast() が呼ばれる前に route を "決定"
- 対象: C0-C7 の class 別に「LEGACY が dominant か」を判定
- ENV gate:
HAKMEM_TINY_STATIC_ROUTE=1/0(default 0)
-
Route Bypass
// 現在(毎回評価): route = g_policy_learner->get_route(class_idx); // 高コスト // C3 Static(init 時決定): if (static_route_enabled()) { route = g_static_route[class_idx]; // cached, no learner } else { route = learner_route(...); // 従来通り } -
期待: +5-8%
-
A/B Test: Mixed 10-run + C6-heavy 5-run
C1(優先度: 🟡 中): TLS Cache Prefetch
狙い: policy ではなく、実際に alloc が触る TLS cache(Unified Cache)をプリフェッチ
設計メモ: docs/analysis/PHASE3_C1_TLS_PREFETCH_1_DESIGN.md
実装:
// malloc_tiny_fast_for_class() 内で、LEGACY route のときだけ:
__builtin_prefetch(&g_unified_cache[class_idx], 0, 3);
期待: +2-4%
C2(優先度: 🟡 中): Slab Metadata Cache Optimization
狙い: hot metadata(policy, slab descriptor)をより近い場所に配置
期待: +5-10%
実装フロー
1. C3(Static Routing)実装 & A/B test
├─ GO: default 化
└─ NO-GO: freeze
2. C1(TLS Prefetch)追加実装
└─ Cumulative test
3. C2(Metadata Optimization)
└─ Final A/B
安全性確認
- LD mode: static route は LD 環境では disable(policy learner は LD で active)
- Lock depth: malloc 側なので不要
- Rollback: ENV gate で即時 OFF 可能
次のアクション
今すぐ:
perf record -F 99 ./bench_random_mixed_hakmemを実行して hot spot 特定- policy_snapshot / learner evaluation の overhead 定量化
- C3 static route detection の実装開始
以降:
- A/B テスト(Mixed 10-run)で +5-8% 確認
- C1/C2 の段階的導入
背中が見える段階から、さらに奥深く。