Files
hakmem/docs/analysis/PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md
Moe Charm (CI) d0b931b197 Phase 3 C1: TLS Prefetch Implementation - NEUTRAL Result (Research Box)
Step 1 & 2 Complete:
- Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334)
  - LEGACY path prefetch of g_unified_cache[class_idx] to L1
  - ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF)
  - Conditional: only when prefetch enabled + route_kind == LEGACY

- A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg
  - Median: +1.28% (within ±1.0% neutral range)
  - Result: 🔬 NEUTRAL (research box, default OFF)

Decision: FREEZE as research box
- Average -0.34% suggests prefetch overhead > benefit
- Prefetch timing too late (after route_kind selection)
- TLS cache access is already fast (head/tail indices)
- Actual memory wait happens at slots[] array access (after prefetch)

Technical Learning:
- Prefetch effectiveness depends on L1 miss rate at access time
- Inserting prefetch after route selection may be too late
- Future approach: move prefetch earlier or use different target

Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 19:01:57 +09:00

2.9 KiB
Raw Blame History

Phase 3: Cache Locality 最適化(開始指示)

目標

現状: Mixed ~35.2M ops/s (B3+B4 後) 目標: 57-68M ops/s (+12-22%)

Phase 3 構成

C3優先度: 🔴 最高): Static Routing

背景:

  • Mixed の perf top では malloc/policy_snapshot が hot
  • 現在: 毎回 malloc 時に policy snapshot + learner evaluation → 大きな overhead
  • 案: malloc_tiny_fast() 呼び出し前に "static route" を init 時決定

設計メモ: docs/analysis/PHASE3_C3_STATIC_ROUTING_1_DESIGN.md

実装ステップ:

  1. Profiling現状把握

    HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1
    perf record -F 99 --call-graph dwarf -- HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1
    perf report --stdio
    # → malloc/policy_snapshot/learner がどの程度か確認
    
  2. Static Route Detection (init 時)

    • malloc_tiny_fast() が呼ばれる前に route を "決定"
    • 対象: C0-C7 の class 別に「LEGACY が dominant か」を判定
    • ENV gate: HAKMEM_TINY_STATIC_ROUTE=1/0 (default 0)
  3. Route Bypass

    // 現在(毎回評価):
    route = g_policy_learner->get_route(class_idx);  // 高コスト
    
    // C3 Staticinit 時決定):
    if (static_route_enabled()) {
        route = g_static_route[class_idx];  // cached, no learner
    } else {
        route = learner_route(...);  // 従来通り
    }
    
  4. 期待: +5-8%

  5. A/B Test: Mixed 10-run + C6-heavy 5-run

C1優先度: 🟡 中): TLS Cache Prefetch

狙い: policy ではなく、実際に alloc が触る TLS cacheUnified Cacheをプリフェッチ

設計メモ: docs/analysis/PHASE3_C1_TLS_PREFETCH_1_DESIGN.md

実装:

// malloc_tiny_fast_for_class() 内で、LEGACY route のときだけ:
__builtin_prefetch(&g_unified_cache[class_idx], 0, 3);

期待: +2-4%

C2優先度: 🟡 中): Slab Metadata Cache Optimization

狙い: hot metadatapolicy, slab descriptorをより近い場所に配置

期待: +5-10%

実装フロー

1. C3Static Routing実装 & A/B test
   ├─ GO: default 化
   └─ NO-GO: freeze
   
2. C1TLS Prefetch追加実装
   └─ Cumulative test
   
3. C2Metadata Optimization
   └─ Final A/B

安全性確認

  • LD mode: static route は LD 環境では disablepolicy learner は LD で active
  • Lock depth: malloc 側なので不要
  • Rollback: ENV gate で即時 OFF 可能

次のアクション

今すぐ:

  1. perf record -F 99 ./bench_random_mixed_hakmem を実行して hot spot 特定
  2. policy_snapshot / learner evaluation の overhead 定量化
  3. C3 static route detection の実装開始

以降:

  • A/B テストMixed 10-runで +5-8% 確認
  • C1/C2 の段階的導入

背中が見える段階から、さらに奥深く。