Files
hakmem/docs/analysis/PHASE_V11A_DESIGN_MID_V3.5.md
Moe Charm (CI) bf83612b97 Phase v11a-4: Mixed本線ベンチマーク結果追加
Results:
- C6-heavy (257-512B): +5.1% (34.0M → 35.8M ops/s)
- Mixed 16-1024B:      +4.4% (38.6M → 40.3M ops/s)

Conclusion: Mixed本線で C6→MID v3.5 は採用候補。
予測(+1-3%)を上回る +4-5% の改善を確認。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 07:17:52 +09:00

344 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase v11a 設計仕様: MID v3.5 (257-1KiB Unified Box)
## 1. 位置づけ
**v10 から v11a への移行**:
- v10: v7 を C5/C6-only 研究箱として凍結、Learner default ON
- v11a: **MID v3.5 を 257-1KiB メイン実装として統一拡張**
**アーキテクチャ役割**:
```
L0: ULTRA (C4-C7) → FROZEN変わらず
L1: MID v3.5 (C5-C7) → 本線 ★NEW
├─ C5: TLS cache + 2MiB segment (multi-page)
├─ C6: TLS cache + 2MiB segment (multi-page)
└─ C7: TLS cache + 2MiB segment (multi-page)
L1-research: v7 (C5/C6) → 研究箱(凍結)
L2: Segment/ColdIface/RegionId
L3: Policy Box + Learner v2 (expanded stats)
```
## 2. MID v3 → MID v3.5 の変更
### 2-1. 現在の MID v3 構成
**実装済み**:
- C5/C6 multi-class TLS heap
- 2MiB segments
- RefillPolicy (TLS segment hint, pool fallback)
- Policy routing (via Policy Box v7-4)
- Legacy Statspage retire時の基本データ
**制限事項**:
- C7 未対応ULTRA固有
- Learner 統計なしv7のみ
- Single-class segment 前提
### 2-2. v11a で追加する機能
#### 機能 1: C7 完全対応
**目標**: MID v3.5 が C5-C7 すべてをカバー
**実装**:
```c
// mid_v3.5.h - new extension
// TLS context for C7
typedef struct {
SmallHeapCtx ctx; // Reuse existing context
void *tls_page; // Current page pointer (C7)
uint32_t tls_offset; // Allocation offset in page
} SmallHeapCtx_C7_MID;
// Allocation fast path
// C7: size > 512B → check MID_ROUTE_C7
// If enabled: try TLS fast alloc → refill on demand
```
**Policy routing**:
```c
// mid_policy.h
route_kind[7] = SMALL_ROUTE_MID_V3; // If C7 enabled, else ULTRA
```
**Stats tracking**:
```c
// SmallPageStatsMID_v3: record class_idx for all retires
typedef struct {
uint32_t class_idx;
uint64_t total_allocations;
uint64_t total_frees;
uint32_t page_alloc_count; // ← v11a new
uint32_t free_hit_ratio_bps; // ← v11a new (basis points)
} SmallPageStatsMID_v3;
```
#### 機能 2: Multi-class Segment 設計決定
**2択の検討**:
**設計 A: Separate segments**
```
MID_v3_segment[3] = {
[0] → segment_C5,
[1] → segment_C6,
[2] → segment_C7
}
```
利点: Simple, clean class separation
欠点: 3x segment overhead, TLS lookup複雑化
**設計 B: Shared segment + per-class pages**
```
SmallSegment_MID_v3 {
free_pages[8]; // per class free stack
class_pages[8]; // current page per class
page_alloc[8]; // allocation count per class
}
```
利点: 1 segment で済む, RegionIdBox 変更不要
欠点: Logic 複雑化
**v11a 決定**: **設計 B (shared segment)**
- 理由: RegionIdBox は既存(変更最小化)
- Segment geometry 統一v7と同じ2MiB/64KiB
- Multi-class TLS hint 対応可能
#### 機能 3: Learner v2 (Expanded Stats)
**v7-7 Learner の制限**:
```c
// Current: C5 ratio のみ監視
c5_ratio_pct = (stats->per_class[5].v7_allocs * 100) / total_allocs;
if (c5_ratio_pct >= THRESHOLD) route[5] = V7;
```
**v11a Learner v2 の拡張**:
```c
typedef struct {
uint64_t allocs[8]; // per class allocation count
uint32_t retire_ratio_pct[8]; // per class retire efficiency
uint64_t avg_page_utilization; // global metric
uint32_t free_hit_ratio_bps; // global free hit (basis points)
uint64_t eval_count;
} SmallLearnerStatsV2;
// 複数指標での route決定後日拡張可能
// Example (Phase v11b):
// - C5_ratio < 30% AND retire_ratio < 50% → MID_v3
// - C5_ratio >= 30% AND free_hit > 8000bps → V7
```
**実装フロー**:
```
MID_v3 page retire
↓ record stats
SmallPageStatsMID_v3 {class_idx, allocs, free_hit_ratio}
↓ periodic publish (every LEARNER_EVAL_INTERVAL)
SmallLearnerStatsV2 aggregate
small_learner_v2_evaluate()
small_policy_v3_update_from_learner() ← NEW (Policy v2)
TLS policy cache invalidation
```
### 2-3. 既存コンポーネント継承
**変更なし**:
- RegionIdBox: Segment ptr → region lookup既存動作
- Policy Box: route_kind[8] 配列(既存 API
- ColdIface: refill/retire インターフェース(既存)
- TLS cache: class ごと快速化(既存パターン)
**要変更**:
- Policy initialization: C7 routing 追加
- Learner stats recording: class_idx 記録追加
- Stats aggregation: Multi-class 対応
## 3. 実装スケジュール
### Phase v11a-1: Design & Infrastructure (Week 1-2)
- [ ] SmallSegment_MID_v3 multi-class layout 決定
- [ ] SmallPageStatsMID_v3 型定義 + publish API
- [ ] SmallLearnerStatsV2 型定義
- [ ] Policy v2 update 関数スケッチ
- [ ] Bench suite拡張: C5/C6/C7 individual tests
### Phase v11a-2: Core Implementation (Week 3-4)
- [ ] SmallHeap_MID_v3_C7 alloc/free path
- [ ] Multi-class refill logic
- [ ] Stats recording (per-page class_idx)
- [ ] Learner stats aggregation
- [ ] Policy update_from_learner v2
### Phase v11a-3: Integration & Testing (Week 5)
- [ ] Learner default ON for MID_v3
- [ ] Perf benchmarks: C5/C6/C7 mixed
- [ ] Learner route switch verification
- [ ] Regression: v7 research preset still works
### Phase v11b: Multi-segment Expansion (TBD)
- [ ] Evaluate separate segment approach
- [ ] TLS multi-segment hint optimization
- [ ] C4 support decision (ULTRA vs MID_v3)
## 4. API 変更最小化
### Policy Box API変更最小
```c
// 既存: 関数署名そのまま
const SmallPolicyV7* small_policy_v7_snapshot(void);
void small_policy_v7_init_from_env(SmallPolicyV7* policy);
void small_policy_v7_update_from_learner(
const SmallLearnerStatsV7* stats,
SmallPolicyV7* policy_out
);
// v11a: 型名だけ拡張
// typedef SmallLearnerStatsV7 → SmallLearnerStatsV2 (backward compat)
// → 内部で v2 の新フィールドは optional
```
### Learner Box API新規 add
```c
// smallobject_learner_v2_box.h
typedef struct { /* SmallLearnerStatsV2 */ } SmallLearnerStatsV2;
void small_learner_v2_record_retire(uint32_t class_idx,
uint32_t free_hit_ratio_bps);
void small_learner_v2_evaluate(void);
const SmallLearnerStatsV2* small_learner_v2_stats_snapshot(void);
```
### ColdIface API変更なし
```c
// 既存の refill/retire インターフェース
typedef void (*cold_refill_page_fn)(uint32_t class_idx, ...);
typedef void (*cold_retire_page_fn)(uint32_t class_idx, ...);
```
## 5. パフォーマンス予測
### Current MID v3 (C5/C6)
```
C5/C6 mixed (200-500B, 300K iter): 38.7M ops/s
C6 heavy (400-510B, 500K iter): 56.3M ops/s
Mixed 16-1024B (v7 OFF): 21.5M ops/s
```
### Expected MID v3.5 (after implementation)
```
C5/C6/C7 mixed (200-1000B): +3-5% (more pages, better locality)
C7 heavy (800-1000B): +2-3% (vs ULTRA fallback)
Mixed 16-1024B (with Learner): +1-2% (dynamic routing)
```
### Actual MID v3.5 Results (Phase v11a-4)
**C6-heavy (257-512B)**:
```
v3.5 OFF: 34.0M ops/s
v3.5 ON: 35.8M ops/s (+5.1%)
```
**Mixed 16-1024B (ws=400, 10M iters, avg of 3 runs)**:
```
v3.5 OFF: 38.6M ops/s
v3.5 ON: 40.3M ops/s (+4.4%)
```
**所感**: C6-heavy では予測通り +5%、Mixed でも +4% の改善が確認できた。
予測より良い結果。Mixed 本線で C6→MID v3.5 は採用候補として有効。
**メトリクス**:
- Throughput: +4-5% (予測+1-3% を上回る)
- Overhead: 測定なしmmap 直叩きで回避)
- Learner accuracy: 観測モードのみroute 切替は将来フェーズ)
## 6. 設計確定事項
### Segment Geometry (v11a)
```
SmallSegment_MID_v3:
- Total size: 2 MiB (same as v7)
- Page size: 64 KiB (same as v7)
- Free stack: per-class (C5/C6/C7 each)
- Class pages: current[8], partial[8]
- RegionId: single segment per TLS thread
```
### TLS Caching Pattern
```c
// TLS MID context
struct {
SmallSegment_MID_v3 *seg;
void *page[8]; // Current page per class
uint32_t offset[8]; // Allocation offset
uint32_t cache_hits;
uint32_t cache_misses;
} __thread tls_mid_v3_ctx;
```
### Stats Recording
```c
// On page retire:
void small_cold_mid_v3_retire_page(..., uint32_t class_idx) {
SmallPageMeta* meta = page->meta;
meta->class_idx = class_idx; // ← record class
// Calculate metrics
uint32_t free_hit = calc_free_hit_ratio(page);
// Publish stats
SmallPageStatsMID_v3 stat = {
.class_idx = class_idx,
.total_allocations = page->alloc_count,
.total_frees = page->free_count,
.page_alloc_count = capacity,
.free_hit_ratio_bps = free_hit
};
// Feed to Learner
small_learner_v2_ingest_stats(&stat);
}
```
## 7. Next Decision Points
### v11b への移行判定
```
Go to v11b (multi-segment) if:
✓ C7 performance matches ULTRA (±2%)
✓ Learner accuracy > 90% on class patterns
✓ RegionId lookup latency acceptable (<2% overhead)
```
### Stay in v11a (iterate) if:
```
✗ C7 performance < 90% vs ULTRA
✗ Learner detection < 80% accuracy
✗ Stats aggregation cost > 5% CPU
```
## 8. 枝刳り対象(後日)
### Branch Cutting (Phase v12+)
- v3 backend の細部最適化
- v6 headerless gains検証
- v7 multi-class 検証
- Learner 多次元最適化free_pressure, fragmentation
### Not in v11a
- Policy v2 の複雑なルーティング(多次元条件)
- v6/v7/MID 同時最適化
- ColdIface の大規模リファクタ
---
**Document Date**: 2025-12-12
**Decision**: Option A (MID v3.5 consolidation)
**Target Completion**: Phase v11a end (2025-12-31)
**Next Review**: After Phase v11a-2 implementation