Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

8.6 KiB

Raw Blame History

Phase 6.24: SuperSlab 最適化（Quick + Medium fix）

日付: 2025-10-24 ステータス: ✅ 大成功！Baseline を +2.6% 上回る 目標: Phase 6.23 の性能低下を改善 → Baseline 超え実績: +8.2% 改善（vs Phase 6.23）、+2.6% 改善（vs Baseline）

📊 Benchmark 結果

Multi-threaded (4 threads, 16B allocation)

Version	Throughput	vs Phase 6.23	vs Baseline (OFF)
Baseline (SuperSlab OFF)	268.21 M ops/sec	-	baseline
Phase 6.23 (SuperSlab ON)	254.15 M ops/sec	baseline	-5.2% ❌
Phase 6.24 Quick fix	271.01 M ops/sec	+6.6%	+1.0% ✅
Phase 6.24 Quick + Medium	275.10 M ops/sec	+8.2% 🎉	+2.6% ✅

改善の内訳

最適化	Throughput	改善幅	累積改善
Phase 6.23 (baseline)	254.15 M ops/sec	-	-
+ Quick fix (Lazy freelist)	271.01 M ops/sec	+6.6%	+6.6%
+ Medium fix (TLS unified)	275.10 M ops/sec	+1.5%	+8.2%

結論: Task先生の分析通り、freelist 初期化コストが主なボトルネックでした！

🚀 実装内容

Quick fix: freelist Lazy Initialization

問題: superslab_init_slab() で 4096 blocks の freelist を構築（4-7 μs）

解決策: Linear allocation mode を実装

freelist を最初に構築しない
meta->freelist == NULL のときは sequential memory access で allocation
Free 後の再利用のみ freelist 使用

実装 (hakmem_tiny_superslab.c:137-151, hakmem_tiny.c:895-925):

// superslab_init_slab() - NO freelist build!
void superslab_init_slab(SuperSlab* ss, int slab_idx, ...) {
    // ...
    // Phase 6.24: Lazy freelist initialization
    // NO freelist build here! (saves 4000-8000 cycles)
    TinySlabMeta* meta = &ss->slabs[slab_idx];
    meta->freelist = NULL;  // NULL = linear mode
    meta->used = 0;
    meta->capacity = (uint16_t)capacity;
    meta->owner_tid = owner_tid;
}

// superslab_alloc_from_slab() - Linear allocation
static inline void* superslab_alloc_from_slab(SuperSlab* ss, int slab_idx) {
    TinySlabMeta* meta = &ss->slabs[slab_idx];

    // Linear allocation mode (freelist == NULL)
    if (meta->freelist == NULL && meta->used < meta->capacity) {
        size_t block_size = g_tiny_class_sizes[ss->size_class];
        void* slab_start = slab_data_start(ss, slab_idx);
        if (slab_idx == 0) slab_start = (char*)slab_start + 1024;

        void* block = (char*)slab_start + (meta->used * block_size);
        meta->used++;
        return block;  // O(1) pointer arithmetic
    }

    // Freelist mode (after first free)
    if (meta->freelist) {
        void* block = meta->freelist;
        meta->freelist = *(void**)block;
        meta->used++;
        return block;
    }

    return NULL;
}

効果: +6.6% 改善（254.15 → 271.01 M ops/sec）

Medium fix: TLS 変数統合

問題: g_tls_superslab[class_idx] と g_tls_slab_idx[class_idx] の二重アクセス → 3 TLS reads

解決策: Unified TLS structure

実装 (hakmem_tiny.c:78-87):

// Phase 6.24: Unified TLS slab cache
typedef struct {
    SuperSlab* ss;           // SuperSlab pointer (8B)
    TinySlabMeta* meta;      // Direct slab metadata cache (8B)
    uint8_t slab_idx;        // Slab index (1B)
    uint8_t _pad[7];         // Padding to 16B
} TinyTLSSlab;

static __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES];

Fast path (hakmem_tiny.c:969-1015):

static inline void* hak_tiny_alloc_superslab(int class_idx) {
    // Phase 6.24: 1 TLS read (down from 3!)
    TinyTLSSlab* tls = &g_tls_slabs[class_idx];
    TinySlabMeta* meta = tls->meta;  // Already cached!

    // Fast path: Direct metadata access
    if (meta && meta->freelist == NULL && meta->used < meta->capacity) {
        // Linear allocation
        size_t block_size = g_tiny_class_sizes[tls->ss->size_class];
        void* slab_start = slab_data_start(tls->ss, tls->slab_idx);
        if (tls->slab_idx == 0) slab_start = (char*)slab_start + 1024;
        void* block = (char*)slab_start + (meta->used * block_size);
        meta->used++;
        return block;
    }

    if (meta && meta->freelist) {
        // Freelist allocation
        void* block = meta->freelist;
        meta->freelist = *(void**)block;
        meta->used++;
        return block;
    }

    // Slow path: Refill
    // ...
}

効果: +1.5% 改善（271.01 → 275.10 M ops/sec）

TLS reads 削減:

Before: 3 TLS reads（g_tls_superslab + g_tls_slab_idx + retry時に再読み）
After: 1 TLS read（g_tls_slabs のみ）
削減: -6-10 cycles per allocation

📈 Performance Analysis

Task先生の分析精度

予測	実測	精度
Quick fix: +2-3%	+6.6%	🎯 超えた！
Medium fix: +0.5-1%	+1.5%	🎯 超えた！
合計: +2.5-4%	+8.2%	🎯🎯 大幅超過！

Task先生の分析は保守的だったが、方向性は完璧！

Cycle 削減の計算

Quick fix (Lazy freelist init):

Before: 4096 iterations × 3-5 cycles = 12,288-20,480 cycles per slab init
After: 0 cycles (linear allocation)
Multi-threaded で 4 threads × 複数 SuperSlab → 数万 cycles 削減

Medium fix (TLS unified):

Before: 3 TLS reads × 3-5 cycles = 9-15 cycles per allocation
After: 1 TLS read × 3-5 cycles = 3-5 cycles per allocation
削減: 6-10 cycles per allocation
800M operations × 6-10 cycles = 4.8-8.0 billion cycles 削減

🎓 Lessons Learned

1. Lazy initialization の威力

freelist を最初に構築しないという発想が大勝利。

初期化コスト → 0 cycles
Linear allocation → Sequential memory access（キャッシュ効率最高）
Free 後の再利用のみ freelist → 実用上問題なし

2. TLS アクセスのコストは無視できない

1 TLS read は 3-5 cycles だが、hot path で 3回読むと 9-15 cycles。

Unified structure で 1回に削減 → 6-10 cycles 削減
Multi-threaded で累積効果

3. Profiling < 理論分析

perf を使わなくてもコードレビュー + 計算で十分分析可能。

Task先生の理論分析が的確だった
「推測するな、測定せよ」も大事だが、「理解して最適化」はもっと速い

4. 段階的最適化の重要性

Quick fix → Medium fix と分けたことで：

各最適化の効果を定量化できた
問題の切り分けが明確
Rollback が容易

📂 File Changes

変更ファイル

ファイル	変更内容	行数
`hakmem_tiny_superslab.h`	Lazy init コメント追加	+5
`hakmem_tiny_superslab.c`	freelist 構築削除	+12, -10
`hakmem_tiny.c`	TLS unified + Linear allocation	+100, -30

合計

変更: 3 ファイル, ~87 行追加

🎯 Next Steps: mimalloc を倒す計画

Phase 6.24 で SuperSlab の基盤が整いました。次は：

Priority 1: Tiny Pool の完成（Phase 6.25）

Long-term fix: Magazine と SuperSlab の統合

Hybrid Magazine（SuperSlab-backed）を実装
または Magazine を完全削除（mimalloc style）
期待効果: +5-10%

Target: 300+ M ops/sec（vs 現在 275 M ops/sec）

Priority 2: Mid Pool の最適化（Phase 6.26+）

現状: Mid 4T = 8.33 M/s Target: mimalloc 並み（？）

最適化候補:

Mid Pool にも SuperSlab 適用？
TLS ring buffer の拡張
Headerless allocation の最適化

Priority 3: Large Pool の最適化（Phase 6.27+）

現状: Large 4T = 1.27 M/s Target: mimalloc 並み（？）