Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

14 KiB

Raw Blame History

HAKMEM Memory Allocator - Claude 作業ログ

このファイルは Claude との開発セッションで重要な情報を記録します。

プロジェクト概要

HAKMEM は高性能メモリアロケータで、以下を目標としています：

平均性能で mimalloc 前後
賢い学習層でメモリ効率も狙う
Mid-Large (8-32KB) で特に強い性能

📊 包括的ベンチマーク結果 (2025-11-02)

測定完了

Comprehensive Benchmark: 21パターン (LIFO, FIFO, Random, Interleaved, Long/Short-lived, Mixed) × 4サイズ (16B, 32B, 64B, 128B)
Fragmentation Stress: 50 rounds, 2000 live slots, mixed sizes

結果サマリー

Tiny (≤128B):    HAKMEM 52.59 M/s  vs  System 135.94 M/s  → -61.3% 💀
Fragment Stress: HAKMEM 4.68 M/s   vs  System 18.43 M/s   → -75.0% 💥
Mid-Large (8-32KB): HAKMEM 167.75 M/s vs System 61.81 M/s → +171% 🏆

詳細レポート

benchmarks/results/BENCHMARK_SUMMARY_2025_11_02.md - 総合まとめ
benchmarks/results/comprehensive_comparison.md - 詳細比較表

ベンチマーク実行方法

# ビルド
make bench_comprehensive_hakmem bench_comprehensive_system
make bench_fragment_stress_hakmem bench_fragment_stress_system

# 実行
./bench_comprehensive_hakmem          # 包括的テスト (~5分)
./bench_fragment_stress_hakmem 50 2000  # フラグメンテーションストレス

重要な発見

Tiny は構造的に System に劣る (-60~-70%)
- すべてのパターン (LIFO/FIFO/Random/Interleaved) で劣る
- Magazine 層のオーバーヘッド、Refill コスト、フラグメンテーション耐性の弱さ
Mid-Large は圧倒的に強い (+108~+171%)
- SuperSlab の効率、L25 中間層、System の mmap overhead 回避
- HAKX 専用最適化で更に高速化可能
System malloc fallback は不可
- HAKMEM の存在意義がなくなる
- Tiny の根本的再設計が必要

次のアクション

Tiny の根本原因分析 (なぜ System tcache に劣るのか?)
Magazine 層の効率化検討
Mid-Large (HAKX) の mainline 統合検討

開発履歴

Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅

目標: Ultra-Simple Fast Path (3-4命令) による Larson ベンチマーク改善 結果: +64% 性能向上 🎉

実装内容

Box 1 (Foundation): core/tiny_atomic.h - アトミック操作抽象化
Box 5 (Alloc Fast Path): core/tiny_alloc_fast.inc.h - TLS freelist 直接 pop (3-4命令)
Box 6 (Free Fast Path): core/tiny_free_fast.inc.h - TOCTOU-safe ownership check + TLS push

ビルド方法

基本（Box-refactor のみ）:

make box-refactor    # Box 5/6 Fast Path 有効
./larson_hakmem 2 8 128 1024 1 12345 4

Larson 最適化（Box-refactor + 環境変数）:

make box-refactor

# デバッグモード（+64%）
HAKMEM_TINY_REFILL_OPT_DEBUG=1 HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 \
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0 \
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 \
./larson_hakmem 2 8 128 1024 1 12345 4

# 本番モード（+150%）
HAKMEM_TINY_REFILL_COUNT_HOT=64 HAKMEM_TINY_FAST_CAP=16 \
HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 \
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0 \
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 \
./larson_hakmem 2 8 128 1024 1 12345 4

通常版（元のコード）:

make larson_hakmem   # Box-refactor なし

性能結果

設定	Throughput	改善
元のコード（デバッグモード）	1,676,8xx ops/s	ベースライン
Box-refactor（デバッグモード）	2,748,759 ops/s	+64% 🚀
Box-refactor（最適化モード）	4,192,128 ops/s	+150% 🏆

ChatGPT の評価

「グッドジョブ」

境界の一箇所化で安全性↑（所有権→drain→bind を SlabHandle に集約）

ホットパス短縮（中間層を迂回）でレイテンシ↓・分岐↓

A213/A202 エラー（3日間の詰まり）を解決

環境ノブでA/B可能（g_sll_multiplier, g_sll_cap_override[]）

Batch Refill との統合

Box-refactor は ChatGPT の Batch Refill 最適化と完全統合:

Box 5: tiny_alloc_fast()
  ↓ TLS freelist pop (3-4命令)
  ↓ Miss
  ↓ tiny_alloc_fast_refill()
  ↓ sll_refill_small_from_ss()
  ↓ (自動マッピング)
  ↓ sll_refill_batch_from_ss()  ← ChatGPT の最適化
  ↓   - trc_linear_carve() (batch 64個)
  ↓   - trc_splice_to_sll() (一度で splice)
  ↓
  g_tls_sll_head に補充完了
  ↓ Retry pop → Success!

統合の効果:

Fast path: 3-4命令（Box 5）
Refill path: Batch carving で64個を一気に補充（ChatGPT 最適化）
メモリ書き込み: 128回 → 2回（-98%）
結果: +64% 性能向上

主要ファイル

core/tiny_atomic.h - Box 1: アトミック操作
core/tiny_alloc_fast.inc.h - Box 5: Ultra-fast alloc
core/tiny_free_fast.inc.h - Box 6: Fast free with ownership validation
core/tiny_refill_opt.h - Batch Refill helpers (ChatGPT)
core/hakmem_tiny_refill_p0.inc.h - P0 Batch Refill 最適化 (ChatGPT)
Makefile - box-refactor ターゲット追加

Feature Flag

HAKMEM_TINY_PHASE6_BOX_REFACTOR=1: Box Theory Fast Path を有効化
デフォルト（flag なし）: 元のコードが動作（後方互換性維持）

Phase 6-2.1: ChatGPT Pro P0 Optimization (2025-11-05) ✅

目標: superslab_refill の O(n) 線形走査を O(1) ctz 化 結果: 内部効率改善、性能維持 (4.19M ops/s)

実装内容

1. P0 最適化 (ChatGPT Pro):

O(n) → O(1) 変換: 32スラブの線形スキャンを __builtin_ctz() で1命令化
nonempty_mask: uint32_t ビットマスク（bit i = slabs[i].freelist != NULL）
効果: superslab_refill CPU 29.47% → 25.89% (-12%)

コード:

// Before (O(n)): 32 loads + 32 branches
for (int i = 0; i < 32; i++) {
    if (slabs[i].freelist) { /* try acquire */ }
}

// After (O(1)): bitmap build + ctz
uint32_t mask = 0;
for (int i = 0; i < 32; i++) {
    if (slabs[i].freelist) mask |= (1u << i);
}
while (mask) {
    int i = __builtin_ctz(mask);  // 1 instruction!
    mask &= ~(1u << i);
    /* try acquire slab i */
}

2. Active Counter Bug Fix (ChatGPT Pro Ultrathink):

問題: P0 batch refill が meta->used を更新するが ss->total_active_blocks を更新しない
影響: カウンタ不整合 → メモリリーク/不正回収
修正: ss_active_add(tls->ss, batch) を freelist/linear carve の両方に追加

3. Debug Overhead 削除 (Claude Task Agent Ultrathink):

問題: refill_opt_dbg() が debug=off でも atomic CAS を実行 → -26% 性能低下
修正: trc_pop_from_freelist() と trc_linear_carve() から debug 呼び出しを削除
効果: 3.10M → 4.19M ops/s (+35% 復帰)

性能結果

Version	Score	Change	Notes
BOX_REFACTOR baseline	4.19M ops/s	-	元のコード
P0 (buggy)	4.19M ops/s	0%	カウンタバグあり
P0 + active_add (debug on)	3.10M ops/s	-26%	Debug overhead
P0 + active_add + no debug	4.19M ops/s	0%	最終版 ✅

内部改善 (perf):

superslab_refill CPU: 29.47% → 25.89% (-12%)
全体スループット: Baseline 維持 (debug overhead 削除で復帰)

主要ファイル

core/hakmem_tiny_superslab.h - nonempty_mask フィールド追加
core/hakmem_tiny_superslab.c - nonempty_mask 初期化
core/hakmem_tiny_free.inc - superslab_refill の ctz 最適化
core/hakmem_tiny_refill_p0.inc.h - ss_active_add() 呼び出し追加
core/tiny_refill_opt.h - debug overhead 削除
Makefile - ULTRA_SIMPLE テスト結果を記録 (-15%, 無効化)

重要な発見

ULTRA_SIMPLE テスト: 3.56M ops/s (-15% vs BOX_REFACTOR)
両方とも同じボトルネック: superslab_refill 29% CPU
P0 で部分改善: 内部 -12% だが全体効果は限定的
Debug overhead の教訓: Hot path に atomic 操作は禁物

Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02) ❌

目標: +15-23% → 実際: -71% ST, -35% MT
Magazine unification 自体は良アイデアだが、capacity tuning と Dual Free Lists の組み合わせが失敗
詳細: HISTORY.md

Phase 5-A: Direct Page Cache (2025-11-01) ❌

Global cache による contention で -3~-7.7%

Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅

成功: 性能改善達成

重要なドキュメント

LARSON_GUIDE.md - Larson ベンチマーク統合ガイド（ビルド・実行・プロファイル）
HISTORY.md - 失敗した最適化の詳細記録
CURRENT_TASK.md - 現在のタスク
benchmarks/results/ - ベンチマーク結果

🔍 Tiny 性能分析 (2025-11-02)

根本原因発見

詳細レポート: benchmarks/results/TINY_PERFORMANCE_ANALYSIS.md

Fast Path が複雑すぎる:

System tcache: 3-4 命令
HAKMEM: 何十もの分岐 + 複数の関数呼び出し
Branch misprediction cost: 50-200 cycles (vs System の 15-40 cycles)

改善案:

Option A: Ultra-Simple Fast Path (tcache風) ⭐⭐⭐⭐⭐
- System tcache と同等の設計
- 3-4 命令の fast path
- 成功確率: 80%, 期間: 1-2週間
Option C: Hybrid アプローチ ⭐⭐⭐⭐
- Tiny: tcache風に再設計
- Mid-Large: 現行維持 (+171% の強みを活かす)
- 成功確率: 75%, 期間: 2-3週間

推奨: Option A → 成功したら Option C に発展

🚀 Phase 6: Learning-Based Tiny Allocator (2025-11-02~)

戦略決定

ユーザーの洞察: 「Mid-Large の真似をすればいい」

コンセプト: "Simple Front + Smart Back"

Front: Ultra-Simple Fast Path (System tcache 風、3-4 命令)
Back: 学習層 (動的容量調整、hotness tracking)

実装プラン

Phase 1 (1週間): Ultra-Simple Fast Path

// TLS Free List ベース (3-4 命令のみ!)
void* hak_tiny_alloc(size_t size) {
    int cls = size_to_class_inline(size);
    void** head = &g_tls_cache[cls];
    void* ptr = *head;
    if (ptr) {
        *head = *(void**)ptr;  // Pop
        return ptr;
    }
    return hak_tiny_alloc_slow(size, cls);
}

目標: System の 70-80% (95-108 M ops/sec)

Phase 2 (1週間): 学習層

Class hotness tracking
動的キャッシュ容量調整 (16-256 slots)
Adaptive refill count (16-128 blocks)

目標: System の 80-90% (108-122 M ops/sec)

Phase 3 (1週間): メモリ効率最適化

Cold classes のキャッシュ削減
目標: System 同等速度 + メモリで勝つ 🏆

Mid-Large HAKX の成功パターンを適用

要素	HAKX (Mid-Large)	Tiny への適用
Fast Path	Direct SuperSlab pop	TLS Free List pop (3-4命令) ✅
学習層	Size pattern 学習	Class hotness 学習 ✅
専用最適化	8-32KB 専用	Hot classes 優遇 ✅
Batch 処理	Batch allocation	Adaptive refill ✅

進捗

TODO リスト作成
CURRENT_TASK.md 更新
CLAUDE.md 更新
Phase 1 実装開始

🛠️ ビルドシステムの改善 (2025-11-02)

問題発見: `.inc` ファイル更新時の再ビルド漏れ

症状:

.inc / .inc.h ファイルを更新しても libhakmem.so が再ビルドされない
ChatGPT が何度も最適化を実装したが、スコアが全く変わらなかった
原因: Makefile の依存関係に .inc ファイルが含まれていなかった

影響:

タイムスタンプ確認で発覚: libhakmem.so が36分前のまま
古いバイナリで実行され続けていた
エラーも出ないため気づきにくい（超危険！）

解決策: 自動依存関係生成 ✅

実装内容:

自動依存関係生成: 導入済み 〈採用〉
- gcc の -MMD -MP フラグで .inc ファイルも自動検出
- .d ファイル（依存関係情報）を生成
- メンテナンス不要、業界標準の方法
build.sh（毎回clean）: 必要なら追加可能
- 確実だが遅い
smart_build.sh（タイムスタンプ検知で必要時のみclean）: 追加可能
- .inc が .so より新しければ自動 clean
verify_build.sh（ビルド後検証）: 追加可能
- ビルド後にバイナリが最新か確認

ビルド時の注意点

.inc ファイル更新時:

自動依存関係生成により、通常は自動再ビルド
不安なら make clean && make を実行

確認方法:

# タイムスタンプ確認
ls -la --time-style=full-iso libhakmem.so core/*.inc core/*.inc.h

# 強制リビルド
make clean && make

効果確認 (2025-11-02)

修正前:

どんな最適化を実装してもスコアが変わらない（~2.3-4.2M ops/s 固定）

修正後 (make clean && make 実行):

モード	スコア (ops/s)	変化
Normal	2,229,692	ベースライン
TINY_ONLY	2,623,397	+18% 🎉
LARSON_MODE	1,459,156	-35% (allocation 失敗)
ONDEMAND	1,439,179	-35% (allocation 失敗)

→ 最適化が実際に反映され、スコアが変化するようになった！

14 KiB Raw Blame History Unescape Escape

HAKMEM Memory Allocator - Claude 作業ログ

プロジェクト概要

📊 包括的ベンチマーク結果 (2025-11-02)

測定完了

結果サマリー

詳細レポート

ベンチマーク実行方法

重要な発見

次のアクション

開発履歴

Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅

実装内容

ビルド方法

性能結果

ChatGPT の評価

Batch Refill との統合

主要ファイル

Feature Flag

Phase 6-2.1: ChatGPT Pro P0 Optimization (2025-11-05) ✅

実装内容

性能結果

主要ファイル

重要な発見

Phase 5-B-Simple: Dual Free Lists + Magazine Unification (2025-11-02) ❌

Phase 5-A: Direct Page Cache (2025-11-01) ❌

Phase 2+1: Magazine + Registry optimizations (2025-10-29) ✅

重要なドキュメント

🔍 Tiny 性能分析 (2025-11-02)

根本原因発見

🚀 Phase 6: Learning-Based Tiny Allocator (2025-11-02~)

戦略決定

実装プラン

Mid-Large HAKX の成功パターンを適用

進捗

🛠️ ビルドシステムの改善 (2025-11-02)

問題発見: .inc ファイル更新時の再ビルド漏れ

解決策: 自動依存関係生成 ✅

ビルド時の注意点

効果確認 (2025-11-02)

14 KiB

Raw Blame History

問題発見: `.inc` ファイル更新時の再ビルド漏れ