## Root Cause Analysis (GPT5) **Physical Layout Constraints**: - Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed = ❌ IMPOSSIBLE - Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 = ✅ POSSIBLE - Class 7: 1KB → offset 0 (compatibility) **Correct Specification**: - HAKMEM_TINY_HEADER_CLASSIDX != 0: - Class 0, 7: next at offset 0 (overwrites header when on freelist) - Class 1-6: next at offset 1 (after header) - HAKMEM_TINY_HEADER_CLASSIDX == 0: - All classes: next at offset 0 **Previous Bug**: - Attempted "ALL classes offset 1" unification - Class 0 with offset 1 caused immediate SEGV (9B > 8B block size) - Mixed 2-arg/3-arg API caused confusion ## Fixes Applied ### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h) ```c // Correct signatures void tiny_next_write(int class_idx, void* base, void* next_value) void* tiny_next_read(int class_idx, const void* base) // Correct offset calculation size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1; ``` ### 2. Updated 123+ Call Sites Across 34 Files - hakmem_tiny_hot_pop_v4.inc.h (4 locations) - hakmem_tiny_fastcache.inc.h (3 locations) - hakmem_tiny_tls_list.h (12 locations) - superslab_inline.h (5 locations) - tiny_fastcache.h (3 locations) - ptr_trace.h (macro definitions) - tls_sll_box.h (2 locations) - + 27 additional files Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)` Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)` ### 3. Added Sentinel Detection Guards - tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next - tls_list_push(): Block nodes with sentinel in ptr or ptr->next - Defense-in-depth against remote free sentinel leakage ## Verification (GPT5 Report) **Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000` **Results**: - ✅ Main loop completed successfully - ✅ Drain phase completed successfully - ✅ NO SEGV (previous crash at iteration 66151 is FIXED) - ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers **Analysis**: - Class 0 immediate SEGV: ✅ RESOLVED (correct offset 0 now used) - 66K iteration crash: ✅ RESOLVED (offset consistency fixed) - Box API conflicts: ✅ RESOLVED (unified 3-arg API) ## Technical Details ### Offset Logic Justification ``` Class 0: 8B block → next pointer (8B) fits ONLY at offset 0 Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header) Class 2: 32B block → next pointer (8B) fits at offset 1 ... Class 6: 512B block → next pointer (8B) fits at offset 1 Class 7: 1024B block → offset 0 for legacy compatibility ``` ### Files Modified (Summary) - Core API: `box/tiny_next_ptr_box.h` - Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h` - TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h` - SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h` - Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h` - Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h` - Documentation: Multiple Phase E3 reports ## Remaining Work None for Box API offset bugs - all structural issues resolved. Future enhancements (non-critical): - Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations - Enforce Box API usage via static analysis - Document offset rationale in architecture docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
HAKMEM Memory Allocator - Claude 作業ログ
このファイルは Claude との開発セッションで重要な情報を記録します。
プロジェクト概要
HAKMEM は高性能メモリアロケータで、以下を目標としています:
- 平均性能で mimalloc 前後
- 賢い学習層でメモリ効率も狙う
- Mid-Large (8-32KB) で特に強い性能
📊 現在の性能(2025-11-09)
ベンチマーク結果
Tiny (256B): 2.76M ops/s (P0 ON, 100K iterations) 🏆
Mid-Large (8-32KB): 167.75M vs System 61.81M (+171%) 🏆
重要な発見
- Phase 7で大幅改善 - Header-based fast free (+180-280%)
- P0バッチ最適化 - meta->used修正で安定動作達成
- Mid-Large圧勝 - SuperSlab効率でSystem比+171%
🔥 CRITICAL FIX: Pointer Conversion Bug (2025-11-13) ✅
Root Cause: DOUBLE CONVERSION (USER → BASE executed twice)
Status: ✅ FIXED - Minimal patch (< 15 lines)
Symptoms:
- C7 (1KB) alignment error:
delta % 1024 == 1(off by one) - Error log:
[C7_ALIGN_CHECK_FAIL] ptr=0x...402 base=0x...401 - Expected:
delta % 1024 == 0(aligned to block boundary)
Root Cause:
// core/tiny_superslab_free.inc.h (before fix)
static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
int slab_idx = slab_index_for(ss, ptr); // ← Uses USER pointer (wrong!)
// ... 8 lines ...
void* base = (void*)((uint8_t*)ptr - 1); // ← Converts USER → BASE
// Problem: On 2nd free cycle, ptr is already BASE, so:
// base = BASE - 1 = storage - 1 ← DOUBLE CONVERSION! Off by one!
}
Fix (line 17-24):
static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
// ✅ FIX: Convert USER → BASE at entry point (single conversion)
void* base = (void*)((uint8_t*)ptr - 1);
// CRITICAL: Use BASE pointer for slab_index calculation!
int slab_idx = slab_index_for(ss, base); // ← Fixed!
// ... rest of function uses BASE consistently
}
Verification:
# Before fix: [C7_ALIGN_CHECK_FAIL] delta%blk=1
# After fix: No errors
./out/release/bench_fixed_size_hakmem 10000 1024 128 # ✅ PASS
Detailed Report: POINTER_CONVERSION_BUG_ANALYSIS.md, POINTER_FIX_SUMMARY.md
🔥 CRITICAL FIX: P0 TLS Stale Pointer Bug (2025-11-09) ✅
Root Cause: Active Counter Corruption
Status: ✅ FIXED - 1-line patch
Symptoms:
- SEGV crash in
bench_fixed_size(256B, 1KB) - Active counter corruption:
active_delta=-991when allocating 128 blocks - Trying to allocate 128 blocks from slab with capacity=64
Root Cause:
// core/hakmem_tiny_refill_p0.inc.h:256-262 (before fix)
if (meta->carved >= meta->capacity) {
if (superslab_refill(class_idx) == NULL) break;
meta = tls->meta; // ← Updates meta, but tls is STALE!
continue;
}
ss_active_add(tls->ss, batch); // ← Updates WRONG SuperSlab counter!
After superslab_refill() switches to a new SuperSlab, the local tls pointer becomes stale (still points to the old SuperSlab). Subsequent ss_active_add(tls->ss, batch) updates the WRONG SuperSlab's active counter, causing:
- SuperSlab A's counter incorrectly incremented
- SuperSlab B's counter unchanged (should have been incremented)
- When blocks from B are freed → counter underflow → SEGV
Fix (line 279):
if (meta->carved >= meta->capacity) {
if (superslab_refill(class_idx) == NULL) break;
tls = &g_tls_slabs[class_idx]; // ← RELOAD TLS after slab switch!
meta = tls->meta;
continue;
}
Verification:
256B fixed-size: 862K ops/s (stable, 200K iterations, 0 crashes) ✅
1KB fixed-size: 872K ops/s (stable, 200K iterations, 0 crashes) ✅
Stability test: 3/3 runs passed ✅
Counter errors: 0 (was: active_delta=-991) ✅
Detailed Report: TINY_256B_1KB_SEGV_FIX_REPORT.md
🚀 Phase 7: Header-Based Fast Free (2025-11-08) ✅
成果
- +180-280% 性能向上(Random Mixed 128-1024B)
- 1-byte header (
0xa0 | class_idx) で O(1) class 識別 - Ultra-fast free path (3-5 instructions)
主要技術
- Header書き込み - allocation時に1バイトヘッダー追加
- Fast free - SuperSlab lookup不要、直接TLS SLLへpush
- Hybrid mincore - Page境界のみmincore()実行(99.9%は1-2 cycles)
結果
Random Mixed 128B: 21M → 59M ops/s (+181%)
Random Mixed 256B: 19M → 70M ops/s (+268%)
Random Mixed 512B: 21M → 68M ops/s (+224%)
Random Mixed 1024B: 21M → 65M ops/s (+210%)
Larson 1T: 631K → 2.63M ops/s (+333%)
ビルド方法
./build.sh bench_random_mixed_hakmem # Phase 7フラグ自動設定
主要ファイル:
core/tiny_region_id.h- Header書き込みAPIcore/tiny_free_fast_v2.inc.h- Ultra-fast free (3-5命令)core/box/hak_free_api.inc.h- Dual-header dispatch
🐛 P0バッチ最適化 重大バグ修正 (2025-11-09) ✅
問題
P0(バッチrefill最適化)ON時に100K SEGVが発生
調査プロセス
Phase 1: ビルドシステム問題
- Task先生発見: ビルドエラーで古いバイナリ実行
- Claude修正: ローカルサイズテーブル追加(2行)
- 結果: P0 OFF で100K成功(2.73M ops/s)
Phase 2: P0の真のバグ
- ChatGPT先生発見:
meta->used加算漏れ
根本原因
P0パス(修正前・バグ):
trc_pop_from_freelist(meta, ..., &chain); // freelistから一括pop
trc_splice_to_sll(&chain, &g_tls_sll_head[cls]); // SLLへ連結
// meta->used += count; ← これがない!💀
影響:
meta->usedと実際の使用ブロック数がズレる- carve判定が狂う → メモリ破壊 → SEGV
ChatGPT先生の修正
trc_splice_to_sll(...);
ss_active_add(tls->ss, from_freelist);
meta->used = (uint16_t)((uint32_t)meta->used + from_freelist); // ← 追加!✅
追加実装(ランタイムA/Bフック):
HAKMEM_TINY_P0_ENABLE=1- P0有効化HAKMEM_TINY_P0_NO_DRAIN=1- Remote drain無効(切り分け用)HAKMEM_TINY_P0_LOG=1- カウンタ検証ログ
修正結果
| 設定 | 修正前 | 修正後 |
|---|---|---|
| P0 OFF | 2.51-2.59M ops/s | 2.73M ops/s |
| P0 ON + NO_DRAIN | ❌ SEGV | ✅ 2.45M ops/s |
| P0 ON(推奨) | ❌ SEGV | ✅ 2.76M ops/s 🏆 |
100K iterations: 全テスト成功
本番推奨設定
export HAKMEM_TINY_P0_ENABLE=1
./out/release/bench_random_mixed_hakmem
性能: 2.76M ops/s(最速、安定)
既知の警告(非致命的)
COUNTER_MISMATCH:
- 発生頻度: 稀(10K-100Kで1-2件)
- 影響: なし(クラッシュしない、性能影響なし)
- 対策: 引き続き監査(低優先度)
🎯 Pool TLS Phase 1.5a: Lock-Free Arena (2025-11-09) ✅
概要
Lock-free TLS arena with chunk carving for 8KB-52KB allocations
結果
Pool TLS Phase 1.5a: 1.79M ops/s (8KB allocations)
System malloc: 0.19M ops/s (8KB allocations)
Ratio: 947% (9.47x faster!) 🏆
アーキテクチャ
- Box P1: Pool TLS API (ultra-fast alloc/free)
- Box P2: Refill Manager (batch allocation)
- Box P3: TLS Arena Backend (exponential chunk growth 1MB→8MB)
- Box P4: System Memory API (mmap wrapper)
ビルド方法
./build.sh bench_mid_large_mt_hakmem # Pool TLS自動有効化
主要ファイル:
core/pool_tls.h/c- TLS freelist + size-to-classcore/pool_refill.h/c- Batch refillcore/pool_tls_arena.h/c- Chunk management
📝 開発履歴(要約)
Phase 2: Design Flaws Analysis (2025-11-08) 🔍
- 固定サイズキャッシュの設計欠陥を発見
- SuperSlab固定32 slabs、TLS Cache固定容量など
- 詳細:
DESIGN_FLAWS_ANALYSIS.md
Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅
- Ultra-Simple Fast Path (3-4命令)
- +64% 性能向上(Larson 1.68M → 2.75M ops/s)
- 詳細:
core/tiny_alloc_fast.inc.h,core/tiny_free_fast.inc.h
Phase 6-2.1: P0 Optimization (2025-11-05) ✅
- superslab_refill の O(n) → O(1) 化(ctz使用)
- nonempty_mask導入
- 詳細:
core/hakmem_tiny_superslab.h,core/hakmem_tiny_refill_p0.inc.h
Phase 6-2.3: Active Counter Fix (2025-11-07) ✅
- P0 batch refill の active counter 加算漏れ修正
- 4T安定動作達成(838K ops/s)
Phase 6-2.2: Sanitizer Compatibility (2025-11-07) ✅
- ASan/TSan ビルド修正
HAKMEM_FORCE_LIBC_ALLOC_BUILD=1導入
🛠️ ビルドシステム
基本ビルド
./build.sh <target> # Release build (推奨)
./build.sh debug <target> # Debug build
./build.sh help # ヘルプ表示
./build.sh list # ターゲット一覧
主要ターゲット
bench_random_mixed_hakmem- Tiny 1T mixedbench_pool_tls_hakmem- Pool TLS 8-52KBbench_mid_large_mt_hakmem- Mid-Large MT 8-32KBlarson_hakmem- Larson mixed
ピン固定フラグ
POOL_TLS_PHASE1=1
POOL_TLS_PREWARM=1
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
BUILD_RELEASE_DEFAULT=1 # Release mode
ENV変数(Pool TLS Arena)
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2 # default 1
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16 # default 8
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4 # default 3
ENV変数(P0)
export HAKMEM_TINY_P0_ENABLE=1 # P0有効化(推奨)
export HAKMEM_TINY_P0_NO_DRAIN=1 # Remote drain無効(デバッグ用)
export HAKMEM_TINY_P0_LOG=1 # カウンタ検証ログ
🔍 デバッグ・プロファイリング
Perf
perf stat -e cycles,instructions,branches,branch-misses,cache-misses -r 3 -- ./<bin>
Strace
strace -e trace=mmap,madvise,munmap -c ./<bin>
ビルド検証
./build.sh verify <binary>
make print-flags
📚 重要ドキュメント
BUILDING_QUICKSTART.md- ビルド クイックスタートLARSON_GUIDE.md- Larson ベンチマーク統合ガイドHISTORY.md- 失敗した最適化の記録100K_SEGV_ROOT_CAUSE_FINAL.md- P0 SEGV詳細調査P0_INVESTIGATION_FINAL.md- P0包括的調査レポートDESIGN_FLAWS_ANALYSIS.md- 設計欠陥分析
🎓 学んだこと
- ビルド検証の重要性 - エラーに気づかず古いバイナリ実行の危険性
- カウンタ整合性 - バッチ最適化では全カウンタの同期が必須
- ランタイムA/Bの威力 - 環境変数で問題箇所の切り分けが可能
- Header-based最適化 - 1バイトで劇的な性能向上が可能
- Box Theory - 境界を明確にすることで安全性とパフォーマンスを両立
🚀 次の最適化候補
優先度: 低(現状で十分高速)
- perf A/B(release)で branch-miss/IPC 最終確認
- COUNTER_MISMATCH閾値/頻度ロギング
- class5/6 front優先度と分岐ヒントの軽調整
- Pool TLS Phase 1.5b: Pre-warm + adaptive refill
優先度: 中(設計改善)
- SuperSlab dynamic expansion(mimalloc-style linked chunks)
- TLS Cache adaptive sizing
- BigCache hash table with chaining
📊 現在のステータス
Phase 7 (Header-based fast free): ✅ COMPLETE (+180-280%)
P0 (Batch refill optimization): ✅ COMPLETE (2.76M ops/s)
Pool TLS (8-52KB arena): ✅ COMPLETE (9.47x vs System)
Build System: ✅ STABLE (release/debug切替)
Production Readiness: ✅ READY (P0 ON推奨)
推奨本番設定:
export HAKMEM_TINY_P0_ENABLE=1
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 2.76M ops/s ✅