Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

6.1 KiB

Raw Blame History

Phase 6.12: Tiny Pool 実装完了レポート

完了日: 2025-10-21 ステータス: ✅ 基本実装完了、❌ 性能目標未達、🎯 P0最適化へ進む

📊 実装サマリ

✅ 完了した実装

8 size classes 実装
- 8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB
- Bitmap-based free block search (__builtin_ctzll)
- Free list管理 (free_slabs / full_slabs)
64KB slab allocator
- posix_memalign使用（memory leak修正済み）
- Slab metadata: TinySlab構造体 + bitmap
Lite P1 pre-allocation
- Tier 1 (8-64B) 4クラスのみ事前確保
- 256KB常駐（512KBではない）
ベンチマークシナリオ追加
- string-builder (8-64B, short-lived)
- token-stream (16-128B, FIFO)
- small-objects (32-256B, long-lived)
Warmup分離実装
- 測定フェーズと初期化フェーズを分離
- 測定精度向上

📊 ベンチマーク結果

性能測定結果 (vs mimalloc)

Scenario	hakmem	mimalloc	system	相対性能
string-builder (8-64B)	7,871 ns	18 ns	18 ns	437x slower ❌
token-stream (16-128B)	99 ns	9 ns	12 ns	11x slower ⚠️
small-objects (32-256B)	6 ns	3 ns	6 ns	2x slower ✅

測定環境

CPU: x86_64 Linux
Compiler: gcc -O2
Iterations: 10,000 (string-builder, token-stream, small-objects)
Warmup: 各シナリオで4サイズクラス事前確保

🔍 根本原因分析

Task先生調査結果

犯人特定: find_slab_by_ptr() の二重呼び出し = 6,000ns/op (75%)

問題のコード

// hakmem.c:510 - hak_free_at()
if (hak_tiny_is_managed(ptr)) {  // ← 1回目の find_slab_by_ptr()
    hak_tiny_free(ptr);          // ← 2回目の find_slab_by_ptr()
    return;
}

// hakmem_tiny.c:253 - hak_tiny_is_managed()
int hak_tiny_is_managed(void* ptr) {
    return find_slab_by_ptr(ptr) != NULL;  // ← O(N) 線形探索！
}

// hakmem_tiny.c:71-96 - find_slab_by_ptr()
static TinySlab* find_slab_by_ptr(void* ptr) {
    // Search in free_slabs (O(N))
    for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
        for (TinySlab* slab = ...; slab; slab = slab->next) {
            if ((uintptr_t)slab->base == slab_base) return slab;
        }
    }
    // Search in full_slabs (O(N))
    for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
        for (TinySlab* slab = ...; slab; slab = slab->next) {
            if ((uintptr_t)slab->base == slab_base) return slab;
        }
    }
    return NULL;
}

コスト内訳

string-builder: 40,000 free calls
× 2回 find_slab_by_ptr()
× 平均 3,000ns/call (O(N) 探索)
= 240,000,000 ns total
→ 6,000 ns/op (75%)

その他のオーバーヘッド:

memset: 100 ns/op (1.3%)
関数呼び出し: 80 ns/op (1.0%)
bitmap探索: 推定 1,691 ns/op (21.5%)

💡 ChatGPT Pro 診断結果

判断: 継続推奨！諦めない

理由:

✅ P0+TLS で桁が変わる - 7,871ns → 50-80ns (157倍高速化)
✅ SACS/ELO の差別化 - Tiny帯でもHot/Warm/Cold適用可能
✅ 一貫性 - L1/L2/L2.5/L3が同じ方針で動く

タイムボックス: P0で ≤200ns/op 切れなければL2.5に注力

🎯 P0最適化戦略

Option B: Embedded Metadata (Slab先頭16Bにタグ)

実装:

typedef struct __attribute__((packed)) {
    uintptr_t xored_owner;   // owner ^ cookie
    uint32_t  magic;         // 0xH4KM3M01
    uint16_t  class_idx;
    uint16_t  epoch;         // ABA防止
} SlabTag;

static inline TinySlab* owner_slab(void* p) {
    uintptr_t base = (uintptr_t)p & ~(TINY_SLAB_SIZE-1);
    SlabTag* t = (SlabTag*)base;
    if (unlikely(t->magic != MAGIC)) return NULL;
    return (TinySlab*)((t->xored_owner) ^ cookie);  // O(1)!
}

期待効果: 6,000ns → 5ns (1200倍高速化)

Option C: 二重呼び出し削除

// hak_free_at() 修正
TinySlab* slab = owner_slab(ptr);  // ← 1回のみ
if (slab) {
    hak_tiny_free_with_slab(ptr, slab);
    return;
}

期待効果: 2倍高速化

memset全削除

ベンチマーク測定用memset以外を削除

期待効果: 100ns削減

📊 期待される改善効果

改善	現状	改善後	効果
P0: Option B + C + memset削除	7,871 ns	1,871 ns	4.2倍高速
P1: TLS freelist	1,871 ns	50-80 ns	27倍高速
最終	7,871 ns	50-80 ns	157倍高速

mimalloc比: 18ns vs 50-80ns → 2.8-4.4倍遅い（許容範囲）

🎯 最終目標

Scenario	現状	目標	mimalloc	達成度
string-builder	7,871 ns	50-80 ns	18 ns	mimalloc比 2.8-4.4倍 ✅
token-stream	99 ns	≤20 ns	9 ns	mimalloc比 2.2倍 ✅
small-objects	6 ns	≤10 ns	3 ns	mimalloc比 3.3倍 ✅

📁 関連ファイル

実装

apps/experiments/hakmem-poc/hakmem_tiny.h - Tiny Pool API
apps/experiments/hakmem-poc/hakmem_tiny.c - Slab allocator実装
apps/experiments/hakmem-poc/hakmem.c - 統合コード

ベンチマーク

apps/experiments/hakmem-poc/bench_allocators.c - 3シナリオ実装
apps/experiments/hakmem-poc/Makefile - ビルド設定

調査レポート

apps/experiments/hakmem-poc/WARMUP_ZERO_EFFECT_INVESTIGATION.md - Task先生調査
/tmp/chatgpt_tiny_pool_optimization.txt - ChatGPT Pro質問状
/mnt/c/git/nyash-project/chatgpt_tiny_pool_optimization.txt - 同上（Windows版）

✅ Phase 6.12 完了判定

基本実装: ✅ 完了 性能目標: ❌ 未達（mimalloc比 437倍遅い） 根本原因: ✅ 特定（find_slab_by_ptr 二重呼び出し） 最適化戦略: ✅ 確定（ChatGPT Pro承認済み）

次のステップ: Phase 6.12.1 (P0最適化) 実装開始

作成者: Claude + Task先生 + ChatGPT Pro 作成日: 2025-10-21

6.1 KiB Raw Blame History Unescape Escape