Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

15 KiB

Raw Blame History

Phase 4 改善ロードマップ

Overview

Phase 4 で 3.6% の性能退行が発生したため、段階的に改善します。

現状:

Phase 3: 391 M ops/sec
Phase 4: 373-380 M ops/sec
退行: -3.6%

目標:

Phase 4.1（Quick Win）: 385-390 M ops/sec（+1-2%）
Phase 4.2（Gating）: 390-395 M ops/sec（Phase 3 レベル回復）
Phase 4.3（Batching）: 395-400 M ops/sec（Phase 3 超え）

Phase 4.1: Quick Win（Option A+B）

目標

実装時間: 5-10分
期待効果: +1-2%（385-390 M ops/sec）
リスク: 低

実装内容

Option A: 重複メモリアクセスの削減

Before:

// owner->class_idx を2回読む
int is_tls_active = (owner == g_tls_active_slab_a[owner->class_idx] ||
                      owner == g_tls_active_slab_b[owner->class_idx]);

if (is_tls_active && !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(owner->class_idx);
    continue;
}

After:

// 1回だけ読んで再利用
uint8_t cidx = owner->class_idx;
TinySlab* tls_a = g_tls_active_slab_a[cidx];
TinySlab* tls_b = g_tls_active_slab_b[cidx];

if ((owner == tls_a || owner == tls_b) &&
    !mini_mag_is_full(&owner->mini_mag)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}

改善点:

owner->class_idx: 3回読み → 1回読み
g_tls_active_slab_a/b[cidx]: ループ外に hoist 可能

Option B: Branch prediction hint

// TLS Magazine から spill する場合、TLS-active slab への戻りが likely
if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                     !mini_mag_is_full(&owner->mini_mag), 1)) {
    mini_mag_push(&owner->mini_mag, it.ptr);
    stats_record_free(cidx);
    continue;
}

改善点:

Branch misprediction を削減
CPU の分岐予測をヒント

修正箇所

ファイル: hakmem_tiny.c 関数: hak_tiny_free_with_slab() 行番号: 890-922

検証方法

# Phase 4.1 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 期待結果: 385-390 M ops/sec

成功基準

最低: 380 M ops/sec（現状維持）
目標: 385-390 M ops/sec（+1-2%）
理想: 390+ M ops/sec（Phase 3 レベル）

Phase 4.2: High-water ゲート（Option E-1）

目標

実装時間: 10-20分
期待効果: +2-5%（390-395 M ops/sec）
リスク: 低〜中

実装内容

High-water ゲートロジック

コンセプト:

TLS Magazine が高水位（≥75%）のとき、Phase 4 を丸ごとスキップ理由: 次回 alloc は TLS から出るので、mini-mag への投入は無駄

実装:

// hak_tiny_free_with_slab() の先頭に追加
void hak_tiny_free_with_slab(...) {
    // ... 既存の前処理 ...

    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;  // TLS Magazine 現在の占有数
    int tls_cap = TLS_MAG_CAPACITY;  // 2048

    if (tls_occ >= (tls_cap * 3 / 4)) {
        // High-water: Phase 4 無効
        // 全件 bitmap へ直書き（既存ロジック）
        for (int i = 0; i < mag->count; i++) {
            PoolItem it = mag->items[i];
            TinySlab* owner = hak_tiny_owner_slab(it.ptr);
            if (!owner) continue;

            // Bitmap へ spill（既存ロジックを流用）
            // ... bitmap operations ...
        }

        // 統計
        g_tiny_pool.phase4_gate_skip[class_idx]++;
        return;
    }

    // Low-water: Phase 4 実行（既存ロジック）
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Phase 4.1 で最適化したロジック
        uint8_t cidx = owner->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        if (__builtin_expect((owner == tls_a || owner == tls_b) &&
                             !mini_mag_is_full(&owner->mini_mag), 1)) {
            mini_mag_push(&owner->mini_mag, it.ptr);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
            continue;
        }

        // Bitmap へ spill
        // ... bitmap operations ...
        g_tiny_pool.phase4_bitmap_spill[cidx]++;
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}

定数定義

// hakmem_tiny.h または hakmem_tiny.c
#define TLS_MAG_CAPACITY 2048
#define TLS_MAG_HIGH_WATER (TLS_MAG_CAPACITY * 3 / 4)  // 1536
#define TLS_MAG_LOW_WATER (TLS_MAG_CAPACITY / 4)        // 512

統計追加

// hakmem_tiny.h: TinyPool 構造体に追加
typedef struct {
    // 既存
    uint64_t alloc_count[TINY_NUM_CLASSES];
    uint64_t free_count[TINY_NUM_CLASSES];
    uint64_t slab_count[TINY_NUM_CLASSES];

    // Phase 4 測定用（新規）
    uint64_t phase4_spill_count[TINY_NUM_CLASSES];    // Phase 4 判定回数
    uint64_t phase4_mini_push[TINY_NUM_CLASSES];      // Mini-mag push 成功
    uint64_t phase4_bitmap_spill[TINY_NUM_CLASSES];   // Bitmap spill
    uint64_t phase4_gate_skip[TINY_NUM_CLASSES];      // High-water skip
} TinyPool;

// hakmem_tiny.c: 初期化
void hak_tiny_init(void) {
    // ... 既存の初期化 ...

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        g_tiny_pool.phase4_spill_count[i] = 0;
        g_tiny_pool.phase4_mini_push[i] = 0;
        g_tiny_pool.phase4_bitmap_spill[i] = 0;
        g_tiny_pool.phase4_gate_skip[i] = 0;
    }
}

検証方法

# Phase 4.2 実装
make clean && make bench_tiny

# ベンチマーク実行（3回）
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5
./bench_tiny 2>&1 | tail -5

# 統計確認（実装後）
# hak_tiny_print_stats() に phase4 統計を追加
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 390-395 M ops/sec（Phase 3 レベル）

成功基準

最低: 385 M ops/sec（Phase 4.1 維持）
目標: 390-395 M ops/sec（Phase 3 レベル回復）
理想: 395+ M ops/sec（Phase 3 超え）

Revert 判断

Phase 4.2 実装後も 385 M ops/sec を下回る場合:

Phase 4 全体を revert
Phase 3（391 M ops/sec）に戻る
Pull 型アプローチを検討

Phase 4.3: Per-slab バッチ（Option E-2）

目標

実装時間: 30-40分
期待効果: +2-5%（395-400 M ops/sec）
リスク: 中〜高（実装複雑）

実装内容

Per-slab グルーピング

コンセプト:

Spill 256 items を slab 単位でグルーピング is_tls_active 判定: 256回 → slab数回（1-8回）に激減

データ構造:

#define SLAB_BUCKETS 32  // 線形プローブ用バケツ数

typedef struct {
    TinySlab* owner;
    void* ptrs[256];
    int count;
} SlabBucket;

実装:

void hak_tiny_free_with_slab_batched(...) {
    // Phase 4.2: High-water ゲート
    int tls_occ = mag->count;
    if (tls_occ >= TLS_MAG_HIGH_WATER) {
        fast_spill_all_to_bitmap(mag);
        return;
    }

    // Phase 4.3: Per-slab バッチ
    SlabBucket buckets[SLAB_BUCKETS] = {0};

    // 1st pass: Slab 単位でグルーピング
    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Linear probing hash
        size_t hash = ((uintptr_t)owner >> 6) & (SLAB_BUCKETS - 1);
        while (buckets[hash].owner && buckets[hash].owner != owner) {
            hash = (hash + 1) & (SLAB_BUCKETS - 1);
        }

        if (!buckets[hash].owner) {
            buckets[hash].owner = owner;
        }
        buckets[hash].ptrs[buckets[hash].count++] = it.ptr;
    }

    // 2nd pass: Slab ごとに処理（判定は slab ごとに 1 回）
    for (int b = 0; b < SLAB_BUCKETS; b++) {
        if (!buckets[b].owner) continue;

        TinySlab* slab = buckets[b].owner;
        uint8_t cidx = slab->class_idx;
        TinySlab* tls_a = g_tls_active_slab_a[cidx];
        TinySlab* tls_b = g_tls_active_slab_b[cidx];

        int is_tls_active = (slab == tls_a || slab == tls_b);
        int room = mini_capacity(&slab->mini_mag) - mini_count(&slab->mini_mag);
        int take = is_tls_active ? min(room, buckets[b].count) : 0;

        // Mini-mag へ一括 push
        for (int i = 0; i < take; i++) {
            mini_mag_push(&slab->mini_mag, buckets[b].ptrs[i]);
            stats_record_free(cidx);
            g_tiny_pool.phase4_mini_push[cidx]++;
        }

        // 余りは bitmap へ一括 spill
        for (int i = take; i < buckets[b].count; i++) {
            // ... bitmap operations ...
            g_tiny_pool.phase4_bitmap_spill[cidx]++;
        }
    }

    g_tiny_pool.phase4_spill_count[class_idx]++;
}

Helper 関数

// Mini-magazine の容量と現在数
static inline int mini_capacity(PageMiniMag* mag) {
    return mag->capacity;
}

static inline int mini_count(PageMiniMag* mag) {
    return mag->count;
}

// min マクロ
#define min(a, b) ((a) < (b) ? (a) : (b))

検証方法

# Phase 4.3 実装
make clean && make bench_tiny

# ベンチマーク実行（5回）
for i in {1..5}; do
    ./bench_tiny 2>&1 | tail -5
done

# 統計確認
./test_mf2 2>&1 | grep -A 10 "Phase 4"

# 期待結果: 395-400 M ops/sec（Phase 3 超え）

成功基準

最低: 390 M ops/sec（Phase 4.2 維持）
目標: 395-400 M ops/sec（Phase 3 超え）
理想: 400+ M ops/sec（5% 改善）

Phase 4.4: Pull 型反転（将来）

目標

実装時間: 1-2時間
期待効果: 根本的解決
リスク: 高（アーキテクチャ変更）

コンセプト

現状（Push型）:

Free 側（spill）で mini-mag に押し戻す
すべての spill item に overhead
恩恵は allocation 側で発生（不確実）

改善（Pull型）:

Allocation 側で必要時だけ mini-mag から引き上げる
Free 側の overhead ゼロ
Allocation latency は若干増加（trade-off）

実装箇所

ファイル: hakmem_tiny.c 関数: hak_tiny_alloc() タイミング: Bitmap scan の直前

実装イメージ

void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0) return NULL;

    // 1. TLS Magazine (fast path)
    if (!mini_mag_is_empty(&g_tls_mag[class_idx])) {
        return mini_mag_pop(&g_tls_mag[class_idx]);
    }

    // 2. TLS Active Slabs (medium path)
    TinySlab* tls = g_tls_active_slab_a[class_idx];
    if (!(tls && tls->free_count > 0)) {
        tls = g_tls_active_slab_b[class_idx];
    }

    if (tls && tls->free_count > 0) {
        // Phase 4.4: Pull from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            void* p = mini_mag_pop(&tls->mini_mag);
            if (p) {
                stats_record_alloc(class_idx);
                return p;
            }
        }

        // Phase 4.4: Refill TLS Magazine from page mini-mag
        if (!mini_mag_is_empty(&tls->mini_mag)) {
            int pulled = mini_pull_batch(&tls->mini_mag, &g_tls_mag[class_idx], 16);
            if (pulled > 0) {
                void* p = mini_mag_pop(&g_tls_mag[class_idx]);
                if (p) {
                    stats_record_alloc(class_idx);
                    return p;
                }
            }
        }

        // Fallback: Bitmap scan（既存ロジック）
        // ...
    }

    // 3. Global pool（既存ロジック）
    // ...
}

Free 側の変更

void hak_tiny_free_with_slab(...) {
    // Phase 4.4: Push型ロジックを削除
    // 全件 bitmap へ直書き（シンプル化）

    for (int i = 0; i < mag->count; i++) {
        PoolItem it = mag->items[i];
        TinySlab* owner = hak_tiny_owner_slab(it.ptr);
        if (!owner) continue;

        // Bitmap へ spill（既存ロジック）
        // ... bitmap operations ...
    }
}

Trade-off

利点:

Free latency が安定（overhead なし）
Allocation 側で制御できる（必要時だけ pull）

欠点:

Allocation latency が若干増加（mini-mag からの pull コスト）
実装が複雑

測定・診断

統計出力

void hak_tiny_print_phase4_stats(void) {
    printf("========================================\n");
    printf("Phase 4 Statistics\n");
    printf("========================================\n");

    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uint64_t spill = g_tiny_pool.phase4_spill_count[i];
        uint64_t mini = g_tiny_pool.phase4_mini_push[i];
        uint64_t bitmap = g_tiny_pool.phase4_bitmap_spill[i];
        uint64_t gate = g_tiny_pool.phase4_gate_skip[i];

        if (spill == 0) continue;

        double mini_ratio = (double)mini / (mini + bitmap) * 100;
        double gate_ratio = (double)gate / spill * 100;

        printf("Class %d (%zu B):\n", i, g_tiny_class_sizes[i]);
        printf("  Spill count:     %lu\n", spill);
        printf("  Mini-mag push:   %lu (%.1f%%)\n", mini, mini_ratio);
        printf("  Bitmap spill:    %lu (%.1f%%)\n", bitmap, 100 - mini_ratio);
        printf("  Gate skip:       %lu (%.1f%%)\n", gate, gate_ratio);
        printf("\n");
    }
}

ベンチマーク比較

# Phase 3（ベースライン）
git checkout <phase3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase3.txt

# Phase 4.1
git checkout <phase4.1-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_1.txt

# Phase 4.2
git checkout <phase4.2-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_2.txt

# Phase 4.3
git checkout <phase4.3-commit>
make clean && make bench_tiny
./bench_tiny > results_phase4_3.txt

# 比較
diff results_phase3.txt results_phase4_*.txt

成功基準まとめ

Phase	実装時間	期待性能	リスク	必須
4.1 (A+B)	5-10分	385-390 M ops/sec	低	✅ Yes
4.2 (ゲート)	10-20分	390-395 M ops/sec	低〜中	✅ Yes
4.3 (バッチ)	30-40分	395-400 M ops/sec	中〜高	⚠️ Conditional
4.4 (Pull)	1-2時間	根本解決	高	❌ Future

Revert 条件:

Phase 4.2 実装後も < 385 M ops/sec → Phase 4 全体を revert

継続条件:

Phase 4.2 で >= 390 M ops/sec → Phase 4.3 に進む

Timeline

Day 1（今日）:

✅ ドキュメント整備（完了）
🔄 Phase 4.1 実装（5-10分）
🔄 Phase 4.1 検証（5分）
🔄 Phase 4.2 実装（10-20分）
🔄 Phase 4.2 検証（5分）
✅ コミット
📊 結果まとめ

Day 2（条件付き）:

Phase 4.2 が成功した場合のみ
Phase 4.3 実装・検証

Future:

Phase 4.4（Pull型）は別途検討

References

PHASE4_REGRESSION_ANALYSIS.md（詳細分析）
ChatGPT Pro アドバイス（2025-10-26）

15 KiB Raw Blame History Unescape Escape

Phase 4 改善ロードマップ

Overview

Phase 4.1: Quick Win（Option A+B）

目標

実装内容

Option A: 重複メモリアクセスの削減

Option B: Branch prediction hint

修正箇所

検証方法

成功基準

Phase 4.2: High-water ゲート（Option E-1）

目標

実装内容

High-water ゲートロジック

定数定義

統計追加

検証方法

成功基準

Revert 判断

Phase 4.3: Per-slab バッチ（Option E-2）

目標

実装内容

Per-slab グルーピング

Helper 関数

検証方法

成功基準

Phase 4.4: Pull 型反転（将来）

目標

コンセプト

実装箇所

実装イメージ

Free 側の変更

Trade-off

測定・診断

統計出力

ベンチマーク比較

成功基準まとめ

Timeline

References

15 KiB

Raw Blame History