Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

6.5 KiB

Raw Blame History

Tiny Allocator 性能分析レポート

📉 現状の問題

ベンチマーク結果 (2025-11-02)

HAKMEM Tiny: 52.59 M ops/sec (平均)
System (glibc): 135.94 M ops/sec (平均)
差分: -61.3% (System の 38.7%)

すべてのパターンで劣る:

Sequential LIFO: -69.2%
Sequential FIFO: -69.4%
Random Free: -58.9%
Interleaved: -67.2%
Long/Short-lived: -68.6%

🔍 根本原因

1. Fast Path が複雑すぎる

System tcache (glibc):

// 3-4 命令のみ!
void* tcache_get(size_t sz) {
    tcache_entry *e = &tcache->entries[tc_idx(sz)];
    if (e->count > 0) {
        void *ret = e->list;
        e->list = ret->next;  // Single linked list pop
        e->count--;
        return ret;
    }
    return NULL;  // Fallback to arena
}

HAKMEM Tiny (core/hakmem_tiny_alloc.inc:76-214):

初期化チェック (line 77-83)
Wrapper チェック (line 84-101)
Size → class 変換 (line 103-109)
[ifdef] BENCH_FASTPATH (line 111-157)
- SLL (single linked list) チェック
- Magazine チェック
- Refill 処理
HotMag チェック (line 159-172)
- HotMag pop
- Conditional refill
Hot alloc (line 174-199)
- Switch-case で class 別関数
Fast tier (line 201-207)
Slow path (line 209-213)

→ 何十もの分岐 + 複数の関数呼び出し

Branch Misprediction のコスト:

最近の CPU: 15-20 cycles/miss
HAKMEM は 5-10 branches → 50-200 cycles の可能性
System tcache: 1-2 branches → 15-40 cycles

2. Magazine 層が多すぎる

現在の構造 (4-5層):

HotMag (128 slots, class 0-2)
  ↓ miss
Hot Alloc (class-specific functions)
  ↓ miss
Fast Tier  
  ↓ miss
Magazine (TinyTLSMag)
  ↓ miss
TLS List
  ↓ miss
Slab (bitmap-based)
  ↓ miss
SuperSlab

System tcache (1層):

tcache (7 entries per size)
  ↓ miss
Arena (ptmalloc bins)

問題:

各層で branch + function call のオーバーヘッド
Cache locality が悪化
複雑性による最適化の阻害

3. Refill が Fast Path に混入

Line 160-172: HotMag refill on fast path

if (g_hotmag_enable && class_idx <= 2 && g_fast_head[class_idx] == NULL) {
    hotmag_init_if_needed(class_idx);
    TinyHotMag* hm = &g_tls_hot_mag[class_idx];
    void* hotmag_ptr = hotmag_pop(class_idx);
    if (hotmag_ptr == NULL) {
        if (hotmag_try_refill(class_idx, hm) > 0) {  // ← Refill on fast path!
            hotmag_ptr = hotmag_pop(class_idx);
        }
    }
    ...
}

問題:

Refill は slow path で行うべき
Fast path は pure pop のみにすべき
System tcache は refill を完全に分離

4. Bitmap-based Slab Management

HAKMEM:

int block_idx = hak_tiny_find_free_block(tls);  // Bitmap scan
if (block_idx >= 0) {
    hak_tiny_set_used(tls, block_idx);
    ...
}

System tcache/arena:

void *ret = bin->list;  // Free list pop (O(1))
bin->list = ret->next;

問題:

Bitmap scan: O(n) worst case
Free list: O(1) always
Bitmap は fragmentation には強いが、速度では劣る

🎯 改善案

Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐

目標: System tcache と同等の速度

設計:

// Global TLS cache (per size class)
static __thread void* g_tls_tcache[TINY_NUM_CLASSES];

void* hak_tiny_alloc(size_t size) {
    int class_idx = size_to_class_inline(size);  // Inline化
    if (class_idx < 0) return NULL;
    
    // Ultra-fast path: Single instruction!
    void** head_ptr = &g_tls_tcache[class_idx];
    void* ptr = *head_ptr;
    if (ptr) {
        *head_ptr = *(void**)ptr;  // Pop from free list
        return ptr;
    }
    
    // Slow path: Refill from SuperSlab
    return hak_tiny_alloc_slow_refill(size, class_idx);
}

メリット:

Fast path: 3-4 命令のみ
Branch: 2つのみ (class check + list check)
System tcache と同等の速度が期待できる

デメリット:

Magazine 層の複雑な最適化が無駄になる
大幅なリファクタリングが必要

実装期間: 1-2週間

成功確率: ⭐⭐⭐⭐ (80%)

Option B: Magazine 層の段階的削減 ⭐⭐⭐

目標: 複雑性を減らしつつ、既存の投資を活かす

段階1: HotMag + Hot Alloc を削除 (2層削減)

void* hak_tiny_alloc(size_t size) {
    int class_idx = size_to_class_inline(size);
    if (class_idx < 0) return NULL;
    
    // Fast path: TLS Magazine のみ
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;
    }
    
    // Slow path
    return hak_tiny_alloc_slow(size, class_idx);
}

段階2: Magazine を Free List に変更

// Replace Magazine with Free List
static __thread void* g_tls_free_list[TINY_NUM_CLASSES];

メリット:

段階的に改善可能
リスク低い

デメリット:

最終的には Option A と同じになる可能性
中途半端な状態が続く

実装期間: 2-3週間

成功確率: ⭐⭐⭐ (60%)

Option C: Hybrid - Tiny は tcache風 + Mid-Large は現行維持 ⭐⭐⭐⭐

目標: Tiny と Mid-Large で異なる戦略

Tiny (≤1KB):

System tcache 風の ultra-simple design
Free list ベース
目標: System の 80-90%

Mid-Large (8KB-32MB):

現在の SuperSlab/L25 を維持・強化
目標: System の 150-200%

メリット:

各サイズ帯に最適な設計
Mid-Large の強み (+171%!) を維持
Tiny の弱点を解消

デメリット:

コードベースが複雑化
統一感が失われる

実装期間: 2-3週間

成功確率: ⭐⭐⭐⭐ (75%)

📝 推奨アプローチ

短期 (1-2週間): Option A (Ultra-Simple Fast Path)

最もシンプルで効果的
System tcache と同等の速度が期待できる
失敗してもロールバック容易

中期 (1ヶ月): Option C (Hybrid)

Tiny の弱点解消 + Mid-Large の強み維持
全体性能で mimalloc 同等を目指せる

長期 (3-6ヶ月): 学習層との統合

Tiny の簡素化により、学習層の導入が容易に
ACE (Adaptive Compression Engine) との連携

次のステップ

Option A のプロトタイプ実装 (1週間)
- core/hakmem_tiny_simple.c として新規作成
- ベンチマーク比較
結果評価
- 目標: System の 80%以上 (108 M ops/sec)
- 達成できれば mainline に統合
Mid-Large 最適化 (並行作業)
- HAKX の mainline 統合
- L25 最適化

6.5 KiB Raw Blame History

Tiny Allocator 性能分析レポート

📉 現状の問題

ベンチマーク結果 (2025-11-02)

🔍 根本原因

1. Fast Path が複雑すぎる

2. Magazine 層が多すぎる

3. Refill が Fast Path に混入

4. Bitmap-based Slab Management

🎯 改善案

Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐

Option B: Magazine 層の段階的削減 ⭐⭐⭐

Option C: Hybrid - Tiny は tcache風 + Mid-Large は現行維持 ⭐⭐⭐⭐

📝 推奨アプローチ

次のステップ

6.5 KiB

Raw Blame History