Files
hakmem/docs/analysis/PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md
Moe Charm (CI) d0b931b197 Phase 3 C1: TLS Prefetch Implementation - NEUTRAL Result (Research Box)
Step 1 & 2 Complete:
- Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334)
  - LEGACY path prefetch of g_unified_cache[class_idx] to L1
  - ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF)
  - Conditional: only when prefetch enabled + route_kind == LEGACY

- A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg
  - Median: +1.28% (within ±1.0% neutral range)
  - Result: 🔬 NEUTRAL (research box, default OFF)

Decision: FREEZE as research box
- Average -0.34% suggests prefetch overhead > benefit
- Prefetch timing too late (after route_kind selection)
- TLS cache access is already fast (head/tail indices)
- Actual memory wait happens at slots[] array access (after prefetch)

Technical Learning:
- Prefetch effectiveness depends on L1 miss rate at access time
- Inserting prefetch after route selection may be too late
- Future approach: move prefetch earlier or use different target

Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 19:01:57 +09:00

104 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 3: Cache Locality 最適化(開始指示)
## 目標
**現状**: Mixed ~35.2M ops/s (B3+B4 後)
**目標**: 57-68M ops/s (+12-22%)
## Phase 3 構成
### C3優先度: 🔴 最高): Static Routing
**背景**:
- Mixed の perf top では malloc/policy_snapshot が hot
- 現在: 毎回 malloc 時に policy snapshot + learner evaluation → 大きな overhead
- 案: malloc_tiny_fast() 呼び出し前に "static route" を init 時決定
**設計メモ**: `docs/analysis/PHASE3_C3_STATIC_ROUTING_1_DESIGN.md`
**実装ステップ**:
1. **Profiling現状把握**
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1
perf record -F 99 --call-graph dwarf -- HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 1000000 400 1
perf report --stdio
# → malloc/policy_snapshot/learner がどの程度か確認
```
2. **Static Route Detection (init 時)**
- malloc_tiny_fast() が呼ばれる前に route を "決定"
- 対象: C0-C7 の class 別に「LEGACY が dominant か」を判定
- ENV gate: `HAKMEM_TINY_STATIC_ROUTE=1/0` (default 0)
3. **Route Bypass**
```c
// 現在(毎回評価):
route = g_policy_learner->get_route(class_idx); // 高コスト
// C3 Staticinit 時決定):
if (static_route_enabled()) {
route = g_static_route[class_idx]; // cached, no learner
} else {
route = learner_route(...); // 従来通り
}
```
4. **期待**: +5-8%
5. **A/B Test**: Mixed 10-run + C6-heavy 5-run
### C1優先度: 🟡 中): TLS Cache Prefetch
**狙い**: policy ではなく、実際に alloc が触る **TLS cache**Unified Cacheをプリフェッチ
**設計メモ**: `docs/analysis/PHASE3_C1_TLS_PREFETCH_1_DESIGN.md`
**実装**:
```c
// malloc_tiny_fast_for_class() 内で、LEGACY route のときだけ:
__builtin_prefetch(&g_unified_cache[class_idx], 0, 3);
```
**期待**: +2-4%
### C2優先度: 🟡 中): Slab Metadata Cache Optimization
**狙い**: hot metadatapolicy, slab descriptorをより近い場所に配置
**期待**: +5-10%
## 実装フロー
```
1. C3Static Routing実装 & A/B test
├─ GO: default 化
└─ NO-GO: freeze
2. C1TLS Prefetch追加実装
└─ Cumulative test
3. C2Metadata Optimization
└─ Final A/B
```
## 安全性確認
- **LD mode**: static route は LD 環境では disablepolicy learner は LD で active
- **Lock depth**: malloc 側なので不要
- **Rollback**: ENV gate で即時 OFF 可能
## 次のアクション
**今すぐ**:
1. `perf record -F 99 ./bench_random_mixed_hakmem` を実行して hot spot 特定
2. policy_snapshot / learner evaluation の overhead 定量化
3. C3 static route detection の実装開始
**以降**:
- A/B テストMixed 10-runで +5-8% 確認
- C1/C2 の段階的導入
---
**背中が見える段階から、さらに奥深く。**