Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

11 KiB

Raw Blame History

HAKMEM vs System Malloc Benchmark Results

Date: 2025-10-27 HAKMEM Version: Phase 8.3 (ACE Step 1-3) Platform: Linux 5.15.167.4-microsoft-standard-WSL2 Compiler: GCC with -O3 -march=native

ベンチマーク概要

テストパターン (全6種類)

Test	パターン	目的
Test 1: Sequential LIFO	alloc[0..99] → free[99..0] (逆順)	ベストケース：freelist の LIFO 特性を最大活用
Test 2: Sequential FIFO	alloc[0..99] → free[0..99] (同順)	ワーストケース：freelist の FIFO 分断を測定
Test 3: Random Order Free	alloc[0..99] → free[random] (ランダム)	現実的：キャッシュミスとフラグメンテーション
Test 4: Interleaved Alloc/Free	alloc → free → alloc → free (交互)	高速チャーン：magazine キャッシュの効果測定
Test 5: Mixed Sizes	8B, 16B, 32B, 64B mixed	マルチサイズ：サイズクラス切り替えコスト
Test 6: Long-lived vs Short-lived	50% 保持、残り churn	メモリ圧：高負荷下のパフォーマンス

テストサイズクラス

16B: Tiny pool (8-64B)
32B: Tiny pool (8-64B)
64B: Tiny pool (8-64B)
128B: MF2 pool (65-2048B)

結果サマリ

🏆 Overall Winner by Size Class

Size Class	LIFO	FIFO	Random	Interleaved	Mixed	Long-lived	Total Winner
16B	System	System	System	System	-	System	System (5/5)
32B	System	System	System	System	-	System	System (5/5)
64B	System	System	System	System	-	System	System (5/5)
128B	HAKMEM	HAKMEM	HAKMEM	HAKMEM	-	HAKMEM	HAKMEM (5/5)
Mixed	-	-	-	-	System	-	System (1/1)

詳細結果

16 Bytes (Tiny Pool)

Test	HAKMEM	System	Winner	Gap
LIFO	212.24 M ops/s	404.88 M ops/s	System	+90.7%
FIFO	210.90 M ops/s	402.95 M ops/s	System	+91.0%
Random	109.91 M ops/s	148.50 M ops/s	System	+35.1%
Interleaved	204.28 M ops/s	405.50 M ops/s	System	+98.5%
Long-lived	208.82 M ops/s	409.17 M ops/s	System	+95.9%

Analysis: System malloc は 16B で圧倒的。HAKMEM の約2倍の速度を記録。

32 Bytes (Tiny Pool)

Test	HAKMEM	System	Winner	Gap
LIFO	210.79 M ops/s	401.61 M ops/s	System	+90.5%
FIFO	211.48 M ops/s	401.52 M ops/s	System	+89.9%
Random	110.03 M ops/s	148.94 M ops/s	System	+35.4%
Interleaved	203.77 M ops/s	403.95 M ops/s	System	+98.3%
Long-lived	208.39 M ops/s	405.39 M ops/s	System	+94.5%

Analysis: 16B と同様、System malloc が支配的。

64 Bytes (Tiny Pool)

Test	HAKMEM	System	Winner	Gap
LIFO	210.56 M ops/s	400.45 M ops/s	System	+90.2%
FIFO	210.51 M ops/s	386.92 M ops/s	System	+83.8%
Random	110.41 M ops/s	147.07 M ops/s	System	+33.2%
Interleaved	204.72 M ops/s	404.72 M ops/s	System	+97.7%
Long-lived	207.96 M ops/s	403.51 M ops/s	System	+94.0%

Analysis: Tiny pool の最大サイズでも System malloc が優位。

128 Bytes (MF2 Pool)

Test	HAKMEM	System	Winner	Gap
LIFO	209.20 M ops/s	166.98 M ops/s	HAKMEM	+25.3%
FIFO	209.40 M ops/s	171.44 M ops/s	HAKMEM	+22.1%
Random	109.41 M ops/s	71.21 M ops/s	HAKMEM	+53.6%
Interleaved	203.93 M ops/s	185.41 M ops/s	HAKMEM	+10.0%
Long-lived	206.51 M ops/s	182.92 M ops/s	HAKMEM	+12.9%

Analysis: 🎉 HAKMEM が全勝！ MF2 pool (65-2048B) は System malloc を大きく上回る。特に Random パターンで +53.6% の優位性。

Mixed Sizes (8B, 16B, 32B, 64B)

Test	HAKMEM	System	Winner	Gap
Mixed	205.10 M ops/s	406.60 M ops/s	System	+98.2%

Analysis: マルチサイズでは System malloc が優位。サイズクラス切り替えコストが影響。

総合評価

🏅 Performance Summary

Allocator	Wins	Avg Speedup	Best Result	Worst Result
HAKMEM	5/21 tests	-	+53.6% (128B Random)	-98.5% (16B Interleaved)
System	16/21 tests	+81.3% (Tiny pool avg)	+98.5% (16B Interleaved)	-53.6% (128B Random)

🔍 Key Insights

System malloc が Tiny pool (8-64B) で圧倒的
- 原因: tcmalloc/jemalloc の thread-local cache が極めて高速
- HAKMEM は約 200M ops/sec で安定
- System は 400M+ ops/sec を達成
HAKMEM が MF2 pool (65-2048B) で優位
- 128B で全パターン勝利（+10% ~ +53.6%）
- Random パターンで特に強い（+53.6%）
- MF2 の page-based allocation が効いている
HAKMEM の強み
- 中サイズ (128B+) での安定性
- Random access パターンでの強さ
- メモリ効率（Phase 8.3 ACE で更に改善予定）
HAKMEM の弱点
- 小サイズ (8-64B) で System malloc の約半分の速度
- Tiny pool の最適化が不十分
- Magazine キャッシュの効果が限定的

ACE (Agentic Context Engineering) Status

Phase 8.3 実装状況

✅ Step 1-3 完了 (Current):

SuperSlab lg_size 対応 (1MB ↔ 2MB 可変サイズ)
ACE tick function (昇格/降格ロジック)
Counter tracking (alloc_count, live_blocks, hot_score)

⏳ Step 4-5 未実装:

ε-greedy bandit (batch/threshold 最適化)
PGO 再生成

ACE Stats (from HAKMEM run)

Class	Current Size	Target Size	Hot Score	Allocs	Live Blocks
8B	1MB	1MB	1000	3.15M	25.0M
16B	1MB	1MB	1000	3.14M	475.0M
24B	1MB	1MB	1000	3.14M	475.0M
32B	1MB	1MB	1000	3.15M	475.0M
40B	1MB	1MB	1000	15.47M	450.0M

次のアクション

優先度 High

Tiny pool の高速化
- Magazine cache の改善
- Thread-local cache の最適化
- SuperSlab allocation の軽量化
ACE Phase 8.3 完了
- Step 4: ε-greedy bandit 実装
- Step 5: PGO 再生成
- RSS 削減効果を測定

優先度 Medium

Mixed size パターンの最適化
- サイズクラス切り替えコストの削減
- Size-class prediction の導入

Conclusion

Current Status: HAKMEM は MF2 pool (128B+) で System malloc を上回るが、Tiny pool (8-64B) では約半分の速度。

Next Goal: Tiny pool の 2倍高速化 → System malloc と同等レベルへ。

Long-term Vision: 全サイズクラスで System malloc を上回り、かつメモリ効率も優れた allocator を実現。

Historical Performance (HAKMEM Step 3d vs mimalloc)

🏆 Best Performance Record (HAKMEM Step 3d)

Top 10 Results:

Test 6 (128B Long-lived): 313.27 M ops/sec ← 🥇 NEW RECORD!
Test 6 (16B Long-lived): 312.59 M ops/sec
Test 6 (64B Long-lived): 312.24 M ops/sec
Test 6 (32B Long-lived): 310.88 M ops/sec
Test 4 (32B Interleaved): 310.38 M ops/sec
Test 4 (64B Interleaved): 309.94 M ops/sec
Test 4 (16B Interleaved): 309.85 M ops/sec
Test 4 (128B Interleaved): 308.88 M ops/sec
Test 2 (32B FIFO): 307.53 M ops/sec

🎯 HAKMEM vs mimalloc (Step 3d)

Metric	HAKMEM Step 3d	mimalloc	Winner	Gap
Performance	313.27 M ops/sec	307.00 M ops/sec	HAKMEM	+2.0% 🎉
Memory (RSS)	13,208 KB (13.2 MB)	4,036 KB (4.0 MB)	mimalloc	-227% (3.27x) ⚠️

Analysis:

✅ Speed: HAKMEM は mimalloc を +2.0% 上回る (313.27 vs 307.00 M ops/sec)
⚠️ Memory: HAKMEM は mimalloc の 3.27倍 のメモリを使用 (+9.2 MB)

🎯 Performance vs Memory Trade-off

Version	Speed (128B)	RSS Memory	Speed/MB Ratio
mimalloc	307.0 M ops/s	4.0 MB	76.75 M ops/MB 🏆
HAKMEM Step 3d	313.3 M ops/s	13.2 MB	23.74 M ops/MB
HAKMEM Phase 8.3	206.5 M ops/s	TBD	TBD

Goal (Phase 8.3 ACE): RSS を 13.2 MB → 4-6 MB に削減しつつ、300M+ ops/sec を維持

Regression Analysis: Phase 8.3 vs Step 3d

128B Long-lived Test

Version	Throughput	vs Step 3d	vs mimalloc
HAKMEM Step 3d (Best)	313.27 M ops/s	baseline	+2.0% ✅
HAKMEM Phase 8.3 (Current)	206.51 M ops/s	-34.1% ⚠️	-32.7% ⚠️
mimalloc	307.00 M ops/s	-2.0%	baseline
System malloc	182.92 M ops/s	-41.6%	-40.4%

Regression: Phase 8.3 は Step 3d より 34.1% 遅い！

🔍 Root Cause Analysis

Phase 8.3 で追加された ACE (Agentic Context Engineering) のカウンタートラッキングがホットパスに追加されたことが原因。

1. ACE Counter Tracking on Every Allocation (hakmem_tiny.c:1251-1264)

g_ss_ace[class_idx].alloc_count++;    // +1 write
g_ss_ace[class_idx].live_blocks++;    // +1 write
if ((g_ss_ace[class_idx].alloc_count & 0x3FFFu) == 0) { // +1 load, +1 AND, +1 compare
    hak_tiny_superslab_ace_tick(...);
}

Impact: 2 writes + 3 ops per allocation
Benchmark: 200M allocations = 400M extra writes

2. ACE Counter Tracking on Every Free (hakmem_tiny.c:1336-1338, 1355-1357)

if (g_ss_ace[ss->size_class].live_blocks > 0) {  // +1 load, +1 compare
    g_ss_ace[ss->size_class].live_blocks--;       // +1 write
}

Impact: 1 load + 1 compare + 1 write per free
Benchmark: 200M frees = 200M extra operations

3. Registry Lookup Overhead (hakmem_super_registry.h:52-74)

for (int lg = 20; lg <= 21; lg++) {  // Try both 1MB and 2MB
    // ... probe loop ...
    if (b == base && e->lg_size == lg) return e->ss;  // Extra field check
}

Impact: Doubles worst-case lookup time, extra lg_size comparisons on every free

4. Memory Pressure

g_ss_ace[class_idx] アクセスがキャッシュに負荷
グローバル配列への書き込みが毎回発生

💡 Solution Options

Option A: Sampling-based Tracking
- 1/256 の確率でのみカウンタ更新（統計的に十分）
- Expected: ~1% overhead (313M → 310M ops/s)
Option B: Per-TLS Counters
- Thread-local counters で書き込みを高速化
- Tick 時に集約
Option C: Conditional ACE (compile-time flag)
- #ifdef HAKMEM_ACE_ENABLE でトラッキングを無効化可能に
- Production では ACE off、メモリ重視時のみ ACE on
Option D: ACE v2 - Lazy Observation
- Magazine refill/spill 時のみカウント（既存の遅いパス）
- alloc/free ホットパスには一切手を加えない

Raw Data

HAKMEM Phase 8.3: benchmarks/hakmem_result.txt
System malloc: benchmarks/system_result.txt
HAKMEM Step 3d: (Historical data, referenced above)

11 KiB Raw Blame History Unescape Escape