# Mid Range MT 最適化設計書 (2025-11-01)

## 📋 概要

**目的**: 8-32KB サイズ範囲のマルチスレッド性能を mimalloc 並みに改善する

**現状性能**:
- hakmem: 46-47 M ops/s (mid_large_mt ベンチ)
- mimalloc: 122 M ops/s
- 差: **-62%** (2.6倍遅い) ❌

**目標性能**:
- **100-120 M ops/s** (mimalloc の 80-100%)
- 2.2-2.6倍の改善

**実装期間**: 1週間

---

## 🎯 設計思想

### ハイブリッド戦略の位置づけ

```
≤1KB (Tiny Pool)     → 静的最適化（P0完了、学習不要）
8-32KB (Mid Range)   → mimalloc風 per-thread segment（本設計） ← 🔥 ここ
≥64KB (Large Pool)   → 学習ベース（ELO戦略選択）
```

**なぜ 8-32KB だけ mimalloc 風？**

1. ✅ **MT性能が最優先**: ロック競合が最大のボトルネック
2. ✅ **学習の効果が低い**: サイズ範囲が狭く、戦略の差が小さい
3. ✅ **mimalloc で実績**: per-thread heap は証明済み
4. ✅ **学習層と非衝突**: 64KB以上は学習層がそのまま活きる

### mimalloc の核心アイデア

**Free List Sharding**:
```c
// mimalloc の key insight:
// "各スレッドが独立した heap segment を持つ"
// → ロックフリー、キャッシュ局所性、スケーラビリティ

thread_local Segment* my_segment;  // スレッドごとに独立

void* alloc(size) {
    // TLS から直接取得（ロック不要！）
    return segment_alloc(my_segment, size);
}
```

---

## 🏗️ アーキテクチャ

### 全体構造

```
┌─────────────────────────────────────────────────────────┐
│ malloc(size) - 8KB ≤ size ≤ 32KB                        │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
         ┌──────────────────────────┐
         │ size_to_class(size)      │ ← サイズクラス決定
         │  8KB  → class 0          │
         │ 16KB  → class 1          │
         │ 32KB  → class 2          │
         └──────────┬───────────────┘
                    │
                    ▼
    ┌───────────────────────────────────────┐
    │ g_mid_segments[class_idx]             │ ← __thread TLS
    │                                       │
    │ MidThreadSegment {                    │
    │   void* free_list;    ← Free objects │
    │   void* current;      ← Bump pointer │
    │   void* end;          ← Segment end  │
    │   size_t chunk_size;  ← 64KB        │
    │   uint32_t used;                      │
    │   uint32_t capacity;                  │
    │ }                                     │
    └───────────┬───────────────────────────┘
                │
                ├─── fast path: free_list から取得
                │    (ほとんどの場合)
                │
                ├─── bump path: current から bump
                │    (free_list が空の場合)
                │
                └─── slow path: 新規 segment 割り当て
                     (current == end の場合)
```

### サイズクラス定義

```c
// 8KB, 16KB, 32KB の3クラス
#define MID_SIZE_CLASS_8K   0   // 8KB
#define MID_SIZE_CLASS_16K  1   // 16KB
#define MID_SIZE_CLASS_32K  2   // 32KB
#define MID_NUM_CLASSES     3

// サイズ → クラス変換
static inline int mid_size_to_class(size_t size) {
    if (size <= 8192)  return MID_SIZE_CLASS_8K;
    if (size <= 16384) return MID_SIZE_CLASS_16K;
    return MID_SIZE_CLASS_32K;
}

// クラス → ブロックサイズ
static inline size_t mid_class_to_size(int class_idx) {
    static const size_t sizes[MID_NUM_CLASSES] = {
        8192,   // 8KB
        16384,  // 16KB
        32768   // 32KB
    };
    return sizes[class_idx];
}
```

---

## 📊 データ構造

### MidThreadSegment（コアデータ構造）

```c
// core/hakmem_mid_mt.h

typedef struct MidThreadSegment {
    // === Fast Path (L1 cache line 0) ===
    void*    free_list;       // Free objects linked list (NULL if empty)
    void*    current;         // Bump allocation pointer
    void*    end;             // End of current chunk
    uint32_t used_count;      // Number of allocated blocks

    // === Metadata (L1 cache line 1) ===
    void*    chunk_base;      // Base address of current chunk
    size_t   chunk_size;      // Size of chunk (64KB)
    size_t   block_size;      // Size of each block (8KB/16KB/32KB)
    uint32_t capacity;        // Total blocks in chunk

    // === Statistics (L1 cache line 2) ===
    uint64_t alloc_count;     // Total allocations
    uint64_t free_count;      // Total frees
    uint32_t refill_count;    // Number of chunk refills
    uint32_t padding;

} __attribute__((aligned(64))) MidThreadSegment;

// TLS: 各スレッドが独立した segment を持つ
extern __thread MidThreadSegment g_mid_segments[MID_NUM_CLASSES];
```

**設計ポイント**:
1. **64-byte alignment**: L1キャッシュライン境界に配置（false sharing回避）
2. **Fast path優先**: `free_list`, `current`, `end` を先頭に配置
3. **統計分離**: ホットパスで不要な統計は後ろに配置

### Segment Registry（free時のサイズ復元用）

```c
// free(ptr) でサイズを知るための registry

typedef struct MidSegmentRegistry {
    void*  base;              // Segment base address
    size_t block_size;        // Block size (8KB/16KB/32KB)
    int    class_idx;         // Size class index
} MidSegmentRegistry;

// グローバル registry (ロックで保護)
typedef struct MidGlobalRegistry {
    MidSegmentRegistry* entries;  // Dynamic array
    uint32_t count;               // Number of entries
    uint32_t capacity;            // Array capacity
    pthread_mutex_t lock;         // Registry lock
} MidGlobalRegistry;

extern MidGlobalRegistry g_mid_registry;

// Registry 操作
void mid_registry_add(void* base, size_t block_size, int class_idx);
int  mid_registry_lookup(void* ptr, size_t* out_block_size, int* out_class_idx);
void mid_registry_remove(void* base);
```

**設計ポイント**:
1. **最小ロック範囲**: registry 操作時のみロック、alloc/free は TLS でロックフリー
2. **O(log N) 検索**: エントリをアドレスでソート、二分探索
3. **遅延削除**: segment が完全に空になった時のみ削除（頻度低い）

---

## 🚀 API 仕様

### 割り当て API

```c
// core/hakmem_mid_mt.h

/**
 * Mid Range MT allocation (8-32KB)
 *
 * @param size  Allocation size (must be 8KB ≤ size ≤ 32KB)
 * @return      Allocated pointer (aligned to block_size), or NULL on failure
 *
 * Thread-safety: Lock-free (uses TLS)
 * Performance:   O(1) fast path, O(1) amortized
 */
void* mid_mt_alloc(size_t size);

/**
 * Mid Range MT free
 *
 * @param ptr   Pointer to free (must be from mid_mt_alloc)
 * @param size  Original allocation size (for class lookup)
 *
 * Thread-safety: Lock-free if freeing to own thread's segment
 *                Atomic CAS if remote free (cross-thread)
 * Performance:   O(1) local free, O(1) remote free
 */
void mid_mt_free(void* ptr, size_t size);

/**
 * Initialize Mid Range MT allocator
 *
 * Call once at startup (thread-safe)
 */
void mid_mt_init(void);

/**
 * Cleanup thread-local segments (called on thread exit)
 */
void mid_mt_thread_exit(void);
```

### 内部ヘルパー

```c
// Segment operations
static void*   segment_alloc(MidThreadSegment* seg, size_t size);
static void    segment_free_local(MidThreadSegment* seg, void* ptr);
static void    segment_free_remote(void* ptr, size_t block_size, int class_idx);
static bool    segment_refill(MidThreadSegment* seg, int class_idx);

// Chunk management
static void*   chunk_allocate(size_t chunk_size);
static void    chunk_deallocate(void* chunk, size_t chunk_size);

// Registry
static void    registry_register(void* base, size_t block_size, int class_idx);
static bool    registry_find_segment(void* ptr, size_t* block_size, int* class_idx);
```

---

## 🔍 実装詳細

### Fast Path: segment_alloc

```c
// core/hakmem_mid_mt.c

static inline void* segment_alloc(MidThreadSegment* seg, size_t size) {
    // === Path 1: Free list (fastest, ~4-5 instructions) ===
    void* p = seg->free_list;
    if (likely(p != NULL)) {
        seg->free_list = *(void**)p;  // Pop from free list
        seg->used_count++;
        return p;
    }

    // === Path 2: Bump allocation (fast, ~6-8 instructions) ===
    p = seg->current;
    void* next = (uint8_t*)p + seg->block_size;

    if (likely(next <= seg->end)) {
        seg->current = next;
        seg->used_count++;
        return p;
    }

    // === Path 3: Refill (slow, called ~once per 64KB) ===
    if (!segment_refill(seg, class_idx)) {
        return NULL;  // OOM
    }

    // Retry after refill
    p = seg->current;
    seg->current = (uint8_t*)p + seg->block_size;
    seg->used_count++;
    return p;
}
```

**性能予測**:
- Path 1 (free list): 99% のケース、~4-5 instructions
- Path 2 (bump): 0.9% のケース、~6-8 instructions
- Path 3 (refill): 0.1% のケース、~1000 instructions (amortized cost low)

### Segment Refill

```c
static bool segment_refill(MidThreadSegment* seg, int class_idx) {
    size_t block_size = mid_class_to_size(class_idx);
    size_t chunk_size = 64 * 1024;  // 64KB chunks

    // Allocate new chunk via mmap
    void* chunk = chunk_allocate(chunk_size);
    if (!chunk) return false;

    // Register chunk in global registry (for free path)
    registry_register(chunk, block_size, class_idx);

    // Setup segment
    seg->chunk_base = chunk;
    seg->chunk_size = chunk_size;
    seg->block_size = block_size;
    seg->current = chunk;
    seg->end = (uint8_t*)chunk + chunk_size;
    seg->capacity = chunk_size / block_size;
    seg->refill_count++;

    return true;
}
```

### Free Path: mid_mt_free

```c
void mid_mt_free(void* ptr, size_t size) {
    if (!ptr) return;

    int class_idx = mid_size_to_class(size);

    // === Fast path: Local free (own thread's segment) ===
    MidThreadSegment* seg = &g_mid_segments[class_idx];

    // Check if ptr belongs to current segment
    if (likely(ptr >= seg->chunk_base && ptr < seg->end)) {
        segment_free_local(seg, ptr);
        return;
    }

    // === Slow path: Remote free (cross-thread) ===
    // Need to find the owning segment via registry
    size_t block_size;
    int owner_class;

    if (registry_find_segment(ptr, &block_size, &owner_class)) {
        segment_free_remote(ptr, block_size, owner_class);
    } else {
        // Not from mid_mt allocator, fallback
        // (Should not happen if routing is correct)
    }
}

static inline void segment_free_local(MidThreadSegment* seg, void* ptr) {
    // Push to free list (lock-free, local)
    *(void**)ptr = seg->free_list;
    seg->free_list = ptr;
    seg->used_count--;
}

static void segment_free_remote(void* ptr, size_t block_size, int class_idx) {
    // Remote free: We don't know which thread owns this segment
    // Strategy: Use atomic CAS to push to a global "remote free list"
    //           Owner thread will drain it periodically

    // For simplicity in Phase 1: Fall back to global pool
    // TODO Phase 2: Implement per-segment remote free list

    // For now, just leak (will implement properly in Phase 2)
}
```

### Registry: Binary Search Lookup

```c
static bool registry_find_segment(void* ptr, size_t* block_size, int* class_idx) {
    pthread_mutex_lock(&g_mid_registry.lock);

    // Binary search for segment containing ptr
    int left = 0;
    int right = g_mid_registry.count - 1;
    bool found = false;

    while (left <= right) {
        int mid = left + (right - left) / 2;
        MidSegmentRegistry* entry = &g_mid_registry.entries[mid];

        void* seg_end = (uint8_t*)entry->base + 64 * 1024;

        if (ptr < entry->base) {
            right = mid - 1;
        } else if (ptr >= seg_end) {
            left = mid + 1;
        } else {
            // Found!
            *block_size = entry->block_size;
            *class_idx = entry->class_idx;
            found = true;
            break;
        }
    }

    pthread_mutex_unlock(&g_mid_registry.lock);
    return found;
}
```

---

## 🔒 スレッドセーフティ

### ロックフリー領域（高速パス）

```
Allocation (own thread):
  g_mid_segments[class_idx]  ← __thread TLS, 完全ロックフリー
  └─ free_list, current, end  ← 単一スレッドからのみアクセス

Free (local, own thread):
  seg->free_list push  ← ローカル操作、ロック不要
```

### ロックあり領域（低頻度）

```
Registry 操作:
  g_mid_registry  ← グローバル、pthread_mutex で保護
  ├─ registry_register()   ← Refill時（0.1%の頻度）
  ├─ registry_find()       ← Remote free時（10-20%？）
  └─ registry_remove()     ← Segment返却時（0.01%）

Chunk allocation:
  mmap()  ← システムコール（0.1%の頻度）
```

### Remote Free（Phase 2で実装）

```c
// Phase 1: リモートフリーは未実装（メモリリーク）
// Phase 2: Per-segment remote free list 実装

typedef struct MidThreadSegment {
    // ...existing fields...

    // Remote free list (accessed by other threads)
    _Atomic(void*) remote_free_head;  // Atomic Treiber stack
    uint32_t remote_free_count;       // Approximate count
} MidThreadSegment;

// Remote free (from other thread)
void segment_free_remote(MidThreadSegment* seg, void* ptr) {
    // Atomic push to remote_free_head
    void* old_head;
    do {
        old_head = atomic_load_explicit(&seg->remote_free_head, memory_order_acquire);
        *(void**)ptr = old_head;
    } while (!atomic_compare_exchange_weak_explicit(
        &seg->remote_free_head, &old_head, ptr,
        memory_order_release, memory_order_acquire));
}

// Drain remote frees (by owner thread, periodically)
void segment_drain_remote(MidThreadSegment* seg) {
    void* remote_head = atomic_exchange_explicit(
        &seg->remote_free_head, NULL, memory_order_acquire);

    // Merge to local free_list
    if (remote_head) {
        void* tail = remote_head;
        while (*(void**)tail != NULL) {
            tail = *(void**)tail;
        }
        *(void**)tail = seg->free_list;
        seg->free_list = remote_head;
    }
}
```

---

## 🧪 テスト計画

### Unit Tests

```c
// test_mid_mt_basic.c
void test_single_thread_alloc_free() {
    // 単一スレッド、単純な alloc/free
    void* p1 = mid_mt_alloc(8192);
    assert(p1 != NULL);
    mid_mt_free(p1, 8192);
}

void test_multiple_size_classes() {
    // 3つのサイズクラスを同時に使用
    void* p8k  = mid_mt_alloc(8192);
    void* p16k = mid_mt_alloc(16384);
    void* p32k = mid_mt_alloc(32768);
    assert(p8k && p16k && p32k);
    mid_mt_free(p8k, 8192);
    mid_mt_free(p16k, 16384);
    mid_mt_free(p32k, 32768);
}

void test_segment_refill() {
    // Segment枯渇 → refillをテスト
    // 64KB chunk → 8KB blocks = 8 blocks
    void* ptrs[10];
    for (int i = 0; i < 10; i++) {
        ptrs[i] = mid_mt_alloc(8192);
        assert(ptrs[i] != NULL);
    }
    for (int i = 0; i < 10; i++) {
        mid_mt_free(ptrs[i], 8192);
    }
}
```

### Multi-threaded Tests

```c
// test_mid_mt_multithread.c
void* thread_worker(void* arg) {
    int thread_id = *(int*)arg;

    // 各スレッドが独立に alloc/free
    for (int i = 0; i < 100000; i++) {
        void* p = mid_mt_alloc(8192);
        assert(p != NULL);

        // Write to memory (check no overlap)
        *(int*)p = thread_id;

        mid_mt_free(p, 8192);
    }
    return NULL;
}

void test_concurrent_alloc() {
    pthread_t threads[4];
    int thread_ids[4] = {0, 1, 2, 3};

    for (int i = 0; i < 4; i++) {
        pthread_create(&threads[i], NULL, thread_worker, &thread_ids[i]);
    }

    for (int i = 0; i < 4; i++) {
        pthread_join(threads[i], NULL);
    }
}
```

### Benchmark Tests

```bash
# Existing benchmark: bench_mid_large_mt
./bench_mid_large_mt_hakx  # 新実装

# 比較
./bench_mid_large_mt_mi    # mimalloc
./bench_mid_large_mt_system # glibc

# 目標:
# Before: 46-47 M ops/s
# After:  100-120 M ops/s (mimalloc: 122 M ops/s)
```

---

## 📈 性能予測

### 理論的性能

**Fast path (free_list hit)**:
```asm
; segment_alloc - fast path (~4-5 instructions)
mov    rax, [seg->free_list]      ; Load free_list head
test   rax, rax                    ; NULL check
je     .bump_path                  ; Branch if empty
mov    rdx, [rax]                  ; Load next pointer
mov    [seg->free_list], rdx       ; Update free_list
inc    dword [seg->used_count]     ; Increment used count
ret                                 ; Return rax
```

**Instructions per allocation**: ~5 (vs mimalloc ~4-6)

**IPC**: 3.5-4.0 (modern CPU, good branch prediction)

**Theoretical peak**:
- 4 GHz CPU / 5 instructions × IPC 3.5 = **2.8 billion ops/s** per core
- But limited by memory latency (~100ns) → **~10M ops/s** realistic

**Multi-threaded (2 threads)**:
- Linear scaling (no lock contention)
- **2 × 10M = 20M ops/s** (but bench shows higher due to cache reuse)

### 実測予測

**bench_mid_large_mt** (8KB, 16KB, 32KB mixed, 2 threads):
- ワーキングセット: ~100-200 objects per thread
- Free list hit rate: ~90-95% (高い再利用率)
- Expected: **100-120 M ops/s**

**根拠**:
1. mimalloc: 122 M ops/s (proven)
2. Our design: ほぼ同じアーキテクチャ
3. Overhead: Registry lookup (remote free時) → -10-20%?
4. But: Remote free は少ない（各スレッド独立ワーキングセット）

**Phase 1 目標**: **100 M ops/s 以上**（mimalloc の 82%以上）

---

## 🚧 実装マイルストーン

### Day 1-2: 基本実装

**ファイル作成**:
- [ ] `core/hakmem_mid_mt.h` - API定義、データ構造
- [ ] `core/hakmem_mid_mt.c` - 実装

**実装内容**:
- [ ] `MidThreadSegment` データ構造
- [ ] TLS変数 `g_mid_segments[3]`
- [ ] `mid_mt_alloc()` - free_list + bump path only
- [ ] `mid_mt_free()` - local free only (remote は未実装)
- [ ] `segment_refill()` - mmap chunk allocation
- [ ] 基本的な unit tests

**デバッグ**:
- [ ] Single-threaded test で動作確認
- [ ] Valgrind でメモリリーク検証

### Day 3-4: 統合

**Routing 統合** (`core/hakmem.c`):
```c
void* malloc(size_t size) {
    if (size <= 1024) {
        return hak_tiny_alloc(size);
    }

    // NEW: Mid Range MT
    if (size <= 32768) {
        return mid_mt_alloc(size);
    }

    // Existing: L2 Pool / L2.5 Pool
    return hak_pool_alloc(size);
}

void free(void* ptr) {
    if (!ptr) return;

    // Size lookup (need to determine which allocator)
    size_t size = get_allocation_size(ptr);  // How?

    if (size <= 1024) {
        hak_tiny_free(ptr);
    } else if (size <= 32768) {
        mid_mt_free(ptr, size);
    } else {
        hak_pool_free(ptr);
    }
}
```

**課題**: `free(ptr)` でサイズが分からない
**解決策**:
1. Registry でサイズ復元（本設計）
2. または: ヘッダに埋め込む（メモリオーバーヘッド）

**実装**:
- [ ] Registry実装（binary search lookup）
- [ ] Routing logic 統合
- [ ] Multi-threaded test で動作確認

### Day 5-7: ベンチマーク・最適化

**ベンチマーク**:
- [ ] `bench_mid_large_mt` で測定
- [ ] mimalloc と比較
- [ ] `perf record` でプロファイリング

**最適化**:
- [ ] Hot path の assembly 確認（inline展開されているか？）
- [ ] Branch prediction 最適化（likely/unlikely）
- [ ] Alignment 調整（64-byte aligned）
- [ ] Cache line 分析

**目標達成判定**:
- [ ] 100 M ops/s 以上達成？
- [ ] mimalloc の 80% 以上？

---

## 🤔 リスクと軽減策

### リスク1: Registry ロックのオーバーヘッド

**問題**: `free()` の度に registry lookup → ロック競合？

**軽減策**:
1. **Fast path**: 現在の segment に属するかチェック（ほとんどのケース）
2. **Registry**: Remote free時のみ（少ない）
3. **Read-Write Lock**: 読み込みは並行可能
4. **Phase 2**: Per-segment metadata で registry 不要に

### リスク2: Remote Free の未実装

**問題**: Phase 1 では remote free 未実装 → メモリリーク？

**軽減策**:
1. **ベンチマーク特性**: `bench_mid_large_mt` は各スレッド独立 → remote free 少ない
2. **Phase 2 で実装**: Atomic remote free list
3. **短期リスク**: 許容（Phase 1 はベンチマークで検証のみ）

### リスク3: Chunk サイズの最適化不足

**問題**: 64KB chunk は最適？

**軽減策**:
1. **Phase 1**: 64KB 固定（mimalloc も同様）
2. **Phase 2**: 環境変数で調整可能に
3. **チューニング**: ベンチマーク結果で調整

---

## 📚 参考文献

1. **mimalloc paper**: "mimalloc: Free List Sharding in Action"
   - https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/

2. **tcmalloc design**: "TCMalloc: Thread-Caching Malloc"
   - https://google.github.io/tcmalloc/design.html

3. **jemalloc documentation**: "jemalloc Memory Allocator"
   - http://jemalloc.net/

4. **hakmem 既存実装**:
   - `core/hakmem_pool.c` - 既存 L2 Pool（参考）
   - `core/hakmem_tiny.c` - Tiny Pool TLS 設計

---

## ✅ 完了条件

Phase 1 完了の定義:

1. ✅ **機能**: 8-32KB の alloc/free が動作（単一・マルチスレッド）
2. ✅ **性能**: `bench_mid_large_mt` で **100 M ops/s 以上**
3. ✅ **正確性**: Valgrind clean、unit tests pass
4. ✅ **統合**: `core/hakmem.c` routing で透過的に動作
5. ✅ **ドキュメント**: 実装完了レポート作成

---

**作成日**: 2025-11-01
**実装予定**: 2025-11-02 ～ 2025-11-08
**レビュー**: Phase 1 完了後にベンチマーク結果で評価
**次のステップ**: Phase 2 (ChatGPT Pro P1-P2) or Remote Free実装