Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

7.0 KiB

Raw Blame History

HAKMEM Development History

Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02~03) ❌

目標

Dual Free Lists (mimalloc): +10-15%
Magazine 統合: +3-5%
合計期待: +15-23% (16.53 → 19.1-20.3 M ops/sec)

実装内容

1. TinyUnifiedMag 定義 (hakmem_tiny.c:590-603)

typedef struct {
    void* slots[256];   // Large capacity for better hit rate
    uint16_t top;       // 0..256
    uint16_t cap;       // =256 (adjustable per class)
} TinyUnifiedMag;

static int g_unified_mag_enable = 1;
static uint16_t g_unified_mag_cap[TINY_NUM_CLASSES] = {
    64, 64, 64, 64,      // Classes 0-3 (hot): 64 slots
    32, 32, 16, 16       // Classes 4-7 (cold): smaller capacity
};
static __thread TinyUnifiedMag g_tls_unified_mag[TINY_NUM_CLASSES];

2. Dual Free Lists 追加 (hakmem_tiny.h:147-151)

// Phase 5-B: Dual Free Lists (mimalloc-inspired optimization)
void* local_free;               // Local free list (same-thread, no atomic)
atomic_uintptr_t thread_free;   // Remote free list (cross-thread, atomic)

3. hak_tiny_alloc() 書き換え (hakmem_tiny_alloc.inc:159-180)

48 lines → 8 lines に削減
3-4 branches → 1 branch に削減

if (__builtin_expect(g_unified_mag_enable, 1)) {
    TinyUnifiedMag* mag = &g_tls_unified_mag[class_idx];
    if (__builtin_expect(mag->top > 0, 1)) {
        void* ptr = mag->slots[--mag->top];
        HAK_RET_ALLOC(class_idx, ptr);
    }
    // Fast path - try local_free from TLS active slabs (no atomic!)
    TinySlab* slab = g_tls_active_slab_a[class_idx];
    if (!slab) slab = g_tls_active_slab_b[class_idx];
    if (slab && slab->local_free) {
        void* ptr = slab->local_free;
        slab->local_free = *(void**)ptr;
        HAK_RET_ALLOC(class_idx, ptr);
    }
}

4. Free path 分離 (hakmem_tiny_free.inc)

Same-thread: local_free (no atomic) - lines 216-230
Remote-thread: thread_free (atomic CAS) - lines 468-484

5. Migration logic (hakmem_tiny_slow.inc:12-76)

local_free → Magazine (batch 32 items)
thread_free → local_free → Magazine

6. Magazine refill from SuperSlab (hakmem_tiny_slow.inc:78-107)

Batch allocate 8-64 blocks

ベンチマーク結果 💥

Initial (Magazine cap=256)

bench_random_mixed: 16.51 M ops/sec (baseline: 16.53, -0.12%)

After Dual Free Lists (Magazine cap=256)

bench_random_mixed: 16.35 M ops/sec (-1.1% vs baseline)

After local_free fast path (Magazine cap=256)

bench_random_mixed: 16.42 M ops/sec (-0.67% vs baseline)

After capacity optimization (Magazine cap=64)

bench_random_mixed: 16.36 M ops/sec (-1.0% vs baseline)

Final evaluation (Magazine cap=64)

Single-threaded (bench_tiny_hot, 64B):

System allocator: 169.49 M ops/sec
HAKMEM Phase 5-B: 49.91 M ops/sec
Regression: -71% (3.4x slower!)

Multi-threaded (bench_mid_large_mt, 2 threads, 8-32KB):

System allocator: 11.51 M ops/sec
HAKMEM Phase 5-B: 7.44 M ops/sec
Regression: -35%
⚠️ NOTE: Tests 8-32KB allocations (outside Tiny range)

根本原因分析 🔍

1. Magazine capacity ミスチューン

問題: 64 slots は ST workload に小さすぎる
詳細: batch=100 の場合、2回に1回は slow path に落ちる
原因: System allocator の tcache (7+ entries per size) との比較で劣る
Perf分析: hak_tiny_alloc_slow が 4.25% を占める (高すぎ)

2. Migration logic オーバーヘッド

問題: Slow path での free list → Magazine migration が高コスト
詳細: Batch migration (32 items) が頻繁に発生
原因: Pointer chase + atomic operations の累積
Perf分析: pthread_mutex_lock が 3.40% (single-threaded なのに!)

3. Dual Free Lists の誤算

問題: ST では効果ゼロ、むしろオーバーヘッド
詳細: ST では remote_free は発生しない
原因: Dual structures のメモリ overhead のみが残る
教訓: MT 専用の最適化を ST に適用した

4. Unified Magazine の問題

問題: 統合で simplicity は得たが performance は失った
詳細: 旧 HotMag (128 slots) + Fast + Quick の組み合わせのほうが高速
原因: 単純化 ≠ 高速化
教訓: Complexity reduction が performance improvement とは限らない

学んだこと 📚

✅ Good Ideas

Magazine unification 自体は良アイデア (complexity 削減の方向性は正しい)
Dual Free Lists は mimalloc で実証済み (ただし MT 環境で)
Migration logic の発想 (free list を Magazine に集約)

❌ Bad Execution

Capacity tuning が不適切 (64 slots → 128+ 必要)
Dual Free Lists は MT 専用 (ST で導入すべきでない)
Migration logic が重すぎる (batch size 削減 or lazy migration 必要)
Benchmark mismatch (ST で MT 最適化を評価した)

🎯 Next Time

ST と MT を分けて設計 (条件付きコンパイル or runtime switch)
Capacity を大きめに (128-256 slots for hot classes)
Migration を軽量化 (lazy migration, smaller batch size)
Benchmark を先に選定 (最適化の方向性と一致させる)

次のステップ候補

Phase 5-B-v2: Magazine unification のみ (Dual Free Lists なし, capacity 128-256)
Phase 6 系: L25/SuperSlab 最適化に移行
Rollback: Baseline に戻って別アプローチ

Phase 5-A: Direct Page Cache (2025-11-01) ❌

目標

Direct cache でO(1) slab lookup: +15-20%

実装内容

Global slabs_direct[129] でO(1) direct page cache

ベンチマーク結果 💥

bench_random_mixed: 15.25-16.04 M ops/sec (baseline: 16.53)
Regression: -3~-7.7% (期待+15-20% → 実際-3~-7.7%)

根本原因

Global cache による contention
Cache pollution
False sharing

学んだこと

Global structures は避けるべき (TLS が基本)
Direct cache よりも Magazine-based approach が有効

Phase 4-A1: HotMag capacity tuning (2025-10-31) ❌

目標

HotMag capacity を増やして hit rate 向上

結果

性能改善なし

学んだこと

Capacity 単体では効果薄い
構造的な問題を解決する必要

Phase 3: Remote drain optimization (2025-10-30) ❌

目標

Remote drain の最適化

結果

性能改善なし

学んだこと

Remote drain はボトルネックではなかった

Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅

目標

Magazine capacity tuning
Registry optimization

結果

成功: 性能改善達成

学んだこと

Magazine-based approach は有効
Registry は O(1) lookup で十分

7.0 KiB Raw Blame History

HAKMEM Development History

Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02~03) ❌

目標

実装内容

1. TinyUnifiedMag 定義 (hakmem_tiny.c:590-603)

2. Dual Free Lists 追加 (hakmem_tiny.h:147-151)

3. hak_tiny_alloc() 書き換え (hakmem_tiny_alloc.inc:159-180)

4. Free path 分離 (hakmem_tiny_free.inc)

5. Migration logic (hakmem_tiny_slow.inc:12-76)

6. Magazine refill from SuperSlab (hakmem_tiny_slow.inc:78-107)

ベンチマーク結果 💥

Initial (Magazine cap=256)

After Dual Free Lists (Magazine cap=256)

After local_free fast path (Magazine cap=256)

After capacity optimization (Magazine cap=64)

Final evaluation (Magazine cap=64)

根本原因分析 🔍

1. Magazine capacity ミスチューン

2. Migration logic オーバーヘッド

3. Dual Free Lists の誤算

4. Unified Magazine の問題

学んだこと 📚

✅ Good Ideas

❌ Bad Execution

🎯 Next Time

関連コミット

次のステップ候補

Phase 5-A: Direct Page Cache (2025-11-01) ❌

目標

実装内容

ベンチマーク結果 💥

根本原因

学んだこと

Phase 4-A1: HotMag capacity tuning (2025-10-31) ❌

目標

結果

学んだこと

Phase 3: Remote drain optimization (2025-10-30) ❌

目標

結果

学んだこと

Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅

目標

結果

学んだこと

7.0 KiB

Raw Blame History