Files
hakmem/docs/analysis/PHASE_V11A_DESIGN_MID_V3.5.md
Moe Charm (CI) bf83612b97 Phase v11a-4: Mixed本線ベンチマーク結果追加
Results:
- C6-heavy (257-512B): +5.1% (34.0M → 35.8M ops/s)
- Mixed 16-1024B:      +4.4% (38.6M → 40.3M ops/s)

Conclusion: Mixed本線で C6→MID v3.5 は採用候補。
予測(+1-3%)を上回る +4-5% の改善を確認。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 07:17:52 +09:00

9.4 KiB
Raw Blame History

Phase v11a 設計仕様: MID v3.5 (257-1KiB Unified Box)

1. 位置づけ

v10 から v11a への移行:

  • v10: v7 を C5/C6-only 研究箱として凍結、Learner default ON
  • v11a: MID v3.5 を 257-1KiB メイン実装として統一拡張

アーキテクチャ役割:

L0: ULTRA (C4-C7)          → FROZEN変わらず
L1: MID v3.5 (C5-C7)       → 本線 ★NEW
     ├─ C5: TLS cache + 2MiB segment (multi-page)
     ├─ C6: TLS cache + 2MiB segment (multi-page)
     └─ C7: TLS cache + 2MiB segment (multi-page)
L1-research: v7 (C5/C6)    → 研究箱(凍結)
L2: Segment/ColdIface/RegionId
L3: Policy Box + Learner v2 (expanded stats)

2. MID v3 → MID v3.5 の変更

2-1. 現在の MID v3 構成

実装済み:

  • C5/C6 multi-class TLS heap
  • 2MiB segments
  • RefillPolicy (TLS segment hint, pool fallback)
  • Policy routing (via Policy Box v7-4)
  • Legacy Statspage retire時の基本データ

制限事項:

  • C7 未対応ULTRA固有
  • Learner 統計なしv7のみ
  • Single-class segment 前提

2-2. v11a で追加する機能

機能 1: C7 完全対応

目標: MID v3.5 が C5-C7 すべてをカバー

実装:

// mid_v3.5.h - new extension

// TLS context for C7
typedef struct {
    SmallHeapCtx ctx;  // Reuse existing context
    void *tls_page;    // Current page pointer (C7)
    uint32_t tls_offset; // Allocation offset in page
} SmallHeapCtx_C7_MID;

// Allocation fast path
// C7: size > 512B → check MID_ROUTE_C7
// If enabled: try TLS fast alloc → refill on demand

Policy routing:

// mid_policy.h
route_kind[7] = SMALL_ROUTE_MID_V3;  // If C7 enabled, else ULTRA

Stats tracking:

// SmallPageStatsMID_v3: record class_idx for all retires
typedef struct {
    uint32_t class_idx;
    uint64_t total_allocations;
    uint64_t total_frees;
    uint32_t page_alloc_count;  // ← v11a new
    uint32_t free_hit_ratio_bps;  // ← v11a new (basis points)
} SmallPageStatsMID_v3;

機能 2: Multi-class Segment 設計決定

2択の検討:

設計 A: Separate segments

MID_v3_segment[3] = {
    [0] → segment_C5,
    [1] → segment_C6,
    [2] → segment_C7
}

利点: Simple, clean class separation 欠点: 3x segment overhead, TLS lookup複雑化

設計 B: Shared segment + per-class pages

SmallSegment_MID_v3 {
    free_pages[8];  // per class free stack
    class_pages[8]; // current page per class
    page_alloc[8];  // allocation count per class
}

利点: 1 segment で済む, RegionIdBox 変更不要 欠点: Logic 複雑化

v11a 決定: 設計 B (shared segment)

  • 理由: RegionIdBox は既存(変更最小化)
  • Segment geometry 統一v7と同じ2MiB/64KiB
  • Multi-class TLS hint 対応可能

機能 3: Learner v2 (Expanded Stats)

v7-7 Learner の制限:

// Current: C5 ratio のみ監視
c5_ratio_pct = (stats->per_class[5].v7_allocs * 100) / total_allocs;
if (c5_ratio_pct >= THRESHOLD)  route[5] = V7;

v11a Learner v2 の拡張:

typedef struct {
    uint64_t allocs[8];              // per class allocation count
    uint32_t retire_ratio_pct[8];    // per class retire efficiency
    uint64_t avg_page_utilization;   // global metric
    uint32_t free_hit_ratio_bps;     // global free hit (basis points)
    uint64_t eval_count;
} SmallLearnerStatsV2;

// 複数指標での route決定後日拡張可能
// Example (Phase v11b):
//   - C5_ratio < 30% AND retire_ratio < 50% → MID_v3
//   - C5_ratio >= 30% AND free_hit > 8000bps → V7

実装フロー:

MID_v3 page retire
  ↓ record stats
SmallPageStatsMID_v3 {class_idx, allocs, free_hit_ratio}
  ↓ periodic publish (every LEARNER_EVAL_INTERVAL)
SmallLearnerStatsV2 aggregate
  ↓
small_learner_v2_evaluate()
  ↓
small_policy_v3_update_from_learner()  ← NEW (Policy v2)
  ↓
TLS policy cache invalidation

2-3. 既存コンポーネント継承

変更なし:

  • RegionIdBox: Segment ptr → region lookup既存動作
  • Policy Box: route_kind[8] 配列(既存 API
  • ColdIface: refill/retire インターフェース(既存)
  • TLS cache: class ごと快速化(既存パターン)

要変更:

  • Policy initialization: C7 routing 追加
  • Learner stats recording: class_idx 記録追加
  • Stats aggregation: Multi-class 対応

3. 実装スケジュール

Phase v11a-1: Design & Infrastructure (Week 1-2)

  • SmallSegment_MID_v3 multi-class layout 決定
  • SmallPageStatsMID_v3 型定義 + publish API
  • SmallLearnerStatsV2 型定義
  • Policy v2 update 関数スケッチ
  • Bench suite拡張: C5/C6/C7 individual tests

Phase v11a-2: Core Implementation (Week 3-4)

  • SmallHeap_MID_v3_C7 alloc/free path
  • Multi-class refill logic
  • Stats recording (per-page class_idx)
  • Learner stats aggregation
  • Policy update_from_learner v2

Phase v11a-3: Integration & Testing (Week 5)

  • Learner default ON for MID_v3
  • Perf benchmarks: C5/C6/C7 mixed
  • Learner route switch verification
  • Regression: v7 research preset still works

Phase v11b: Multi-segment Expansion (TBD)

  • Evaluate separate segment approach
  • TLS multi-segment hint optimization
  • C4 support decision (ULTRA vs MID_v3)

4. API 変更最小化

Policy Box API変更最小

// 既存: 関数署名そのまま
const SmallPolicyV7* small_policy_v7_snapshot(void);
void small_policy_v7_init_from_env(SmallPolicyV7* policy);
void small_policy_v7_update_from_learner(
    const SmallLearnerStatsV7* stats,
    SmallPolicyV7* policy_out
);

// v11a: 型名だけ拡張
// typedef SmallLearnerStatsV7 → SmallLearnerStatsV2 (backward compat)
// → 内部で v2 の新フィールドは optional

Learner Box API新規 add

// smallobject_learner_v2_box.h
typedef struct { /* SmallLearnerStatsV2 */ } SmallLearnerStatsV2;

void small_learner_v2_record_retire(uint32_t class_idx,
                                    uint32_t free_hit_ratio_bps);
void small_learner_v2_evaluate(void);
const SmallLearnerStatsV2* small_learner_v2_stats_snapshot(void);

ColdIface API変更なし

// 既存の refill/retire インターフェース
typedef void (*cold_refill_page_fn)(uint32_t class_idx, ...);
typedef void (*cold_retire_page_fn)(uint32_t class_idx, ...);

5. パフォーマンス予測

Current MID v3 (C5/C6)

C5/C6 mixed (200-500B, 300K iter): 38.7M ops/s
C6 heavy (400-510B, 500K iter):    56.3M ops/s
Mixed 16-1024B (v7 OFF):           21.5M ops/s

Expected MID v3.5 (after implementation)

C5/C6/C7 mixed (200-1000B):        +3-5% (more pages, better locality)
C7 heavy (800-1000B):               +2-3% (vs ULTRA fallback)
Mixed 16-1024B (with Learner):     +1-2% (dynamic routing)

Actual MID v3.5 Results (Phase v11a-4)

C6-heavy (257-512B):

v3.5 OFF: 34.0M ops/s
v3.5 ON:  35.8M ops/s  (+5.1%)

Mixed 16-1024B (ws=400, 10M iters, avg of 3 runs):

v3.5 OFF: 38.6M ops/s
v3.5 ON:  40.3M ops/s  (+4.4%)

所感: C6-heavy では予測通り +5%、Mixed でも +4% の改善が確認できた。 予測より良い結果。Mixed 本線で C6→MID v3.5 は採用候補として有効。

メトリクス:

  • Throughput: +4-5% (予測+1-3% を上回る)
  • Overhead: 測定なしmmap 直叩きで回避)
  • Learner accuracy: 観測モードのみroute 切替は将来フェーズ)

6. 設計確定事項

Segment Geometry (v11a)

SmallSegment_MID_v3:
  - Total size: 2 MiB (same as v7)
  - Page size: 64 KiB (same as v7)
  - Free stack: per-class (C5/C6/C7 each)
  - Class pages: current[8], partial[8]
  - RegionId: single segment per TLS thread

TLS Caching Pattern

// TLS MID context
struct {
    SmallSegment_MID_v3 *seg;
    void *page[8];  // Current page per class
    uint32_t offset[8];  // Allocation offset
    uint32_t cache_hits;
    uint32_t cache_misses;
} __thread tls_mid_v3_ctx;

Stats Recording

// On page retire:
void small_cold_mid_v3_retire_page(..., uint32_t class_idx) {
    SmallPageMeta* meta = page->meta;
    meta->class_idx = class_idx;  // ← record class

    // Calculate metrics
    uint32_t free_hit = calc_free_hit_ratio(page);

    // Publish stats
    SmallPageStatsMID_v3 stat = {
        .class_idx = class_idx,
        .total_allocations = page->alloc_count,
        .total_frees = page->free_count,
        .page_alloc_count = capacity,
        .free_hit_ratio_bps = free_hit
    };

    // Feed to Learner
    small_learner_v2_ingest_stats(&stat);
}

7. Next Decision Points

v11b への移行判定

Go to v11b (multi-segment) if:
  ✓ C7 performance matches ULTRA (±2%)
  ✓ Learner accuracy > 90% on class patterns
  ✓ RegionId lookup latency acceptable (<2% overhead)

Stay in v11a (iterate) if:

  ✗ C7 performance < 90% vs ULTRA
  ✗ Learner detection < 80% accuracy
  ✗ Stats aggregation cost > 5% CPU

8. 枝刳り対象(後日)

Branch Cutting (Phase v12+)

  • v3 backend の細部最適化
  • v6 headerless gains検証
  • v7 multi-class 検証
  • Learner 多次元最適化free_pressure, fragmentation

Not in v11a

  • Policy v2 の複雑なルーティング(多次元条件)
  • v6/v7/MID 同時最適化
  • ColdIface の大規模リファクタ

Document Date: 2025-12-12 Decision: Option A (MID v3.5 consolidation) Target Completion: Phase v11a end (2025-12-31) Next Review: After Phase v11a-2 implementation