2025-11-05 12:31:14 +09:00
# HAKMEM Memory Allocator - Claude 作業ログ
このファイルは Claude との開発セッションで重要な情報を記録します。
## プロジェクト概要
**HAKMEM** は高性能メモリアロケータで、以下を目標としています:
- 平均性能で mimalloc 前後
- 賢い学習層でメモリ効率も狙う
- Mid-Large (8-32KB) で特に強い性能
---
2025-11-09 23:15:02 +09:00
## 📊 現在の性能( 2025-11-09)
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### ベンチマーク結果
2025-11-05 12:31:14 +09:00
```
2025-11-09 23:15:02 +09:00
Tiny (256B): 2.76M ops/s (P0 ON, 100K iterations) 🏆
Mid-Large (8-32KB): 167.75M vs System 61.81M (+171%) 🏆
2025-11-05 12:31:14 +09:00
```
### 重要な発見
2025-11-09 23:15:02 +09:00
1. **Phase 7で大幅改善** - Header-based fast free (+180-280%)
2. **P0バッチ最適化** - meta->used修正で安定動作達成
3. **Mid-Large圧勝** - SuperSlab効率でSystem比+171%
2025-11-05 12:31:14 +09:00
---
2025-11-10 00:25:02 +09:00
## 🔥 **CRITICAL FIX: P0 TLS Stale Pointer Bug (2025-11-09)** ✅
### **Root Cause**: Active Counter Corruption
**Status**: ✅ **FIXED** - 1-line patch
**Symptoms**:
- SEGV crash in `bench_fixed_size` (256B, 1KB)
- Active counter corruption: `active_delta=-991` when allocating 128 blocks
- Trying to allocate 128 blocks from slab with capacity=64
**Root Cause**:
```c
// core/hakmem_tiny_refill_p0.inc.h:256-262 (before fix)
if (meta->carved >= meta->capacity) {
if (superslab_refill(class_idx) == NULL) break;
meta = tls->meta; // ← Updates meta, but tls is STALE!
continue;
}
ss_active_add(tls->ss, batch); // ← Updates WRONG SuperSlab counter!
```
After `superslab_refill()` switches to a new SuperSlab, the local `tls` pointer becomes stale (still points to the old SuperSlab). Subsequent `ss_active_add(tls->ss, batch)` updates the WRONG SuperSlab's active counter, causing:
1. SuperSlab A's counter incorrectly incremented
2. SuperSlab B's counter unchanged (should have been incremented)
3. When blocks from B are freed → counter underflow → SEGV
**Fix** (line 279):
```c
if (meta->carved >= meta->capacity) {
if (superslab_refill(class_idx) == NULL) break;
tls = &g_tls_slabs[class_idx]; // ← RELOAD TLS after slab switch!
meta = tls->meta;
continue;
}
```
**Verification**:
```
256B fixed-size: 862K ops/s (stable, 200K iterations, 0 crashes) ✅
1KB fixed-size: 872K ops/s (stable, 200K iterations, 0 crashes) ✅
Stability test: 3/3 runs passed ✅
Counter errors: 0 (was: active_delta=-991) ✅
```
**Detailed Report**: [`TINY_256B_1KB_SEGV_FIX_REPORT.md` ](TINY_256B_1KB_SEGV_FIX_REPORT.md )
---
2025-11-09 23:15:02 +09:00
## 🚀 Phase 7: Header-Based Fast Free (2025-11-08) ✅
2025-11-08 12:54:52 +09:00
2025-11-09 23:15:02 +09:00
### 成果
- **+180-280% 性能向上**( Random Mixed 128-1024B)
- 1-byte header (`0xa0 | class_idx` ) で O(1) class 識別
- Ultra-fast free path (3-5 instructions)
2025-11-08 12:54:52 +09:00
2025-11-09 23:15:02 +09:00
### 主要技術
1. **Header書き込み** - allocation時に1バイトヘッダー追加
2. **Fast free** - SuperSlab lookup不要、直接TLS SLLへpush
3. **Hybrid mincore** - Page境界のみmincore()実行( 99.9%は1-2 cycles)
2025-11-08 12:54:52 +09:00
2025-11-09 23:15:02 +09:00
### 結果
2025-11-08 12:54:52 +09:00
```
2025-11-09 23:15:02 +09:00
Random Mixed 128B: 21M → 59M ops/s (+181%)
Random Mixed 256B: 19M → 70M ops/s (+268%)
Random Mixed 512B: 21M → 68M ops/s (+224%)
Random Mixed 1024B: 21M → 65M ops/s (+210%)
Larson 1T: 631K → 2.63M ops/s (+333%)
2025-11-08 12:54:52 +09:00
```
2025-11-09 23:15:02 +09:00
### ビルド方法
2025-11-08 12:54:52 +09:00
```bash
2025-11-09 23:15:02 +09:00
./build.sh bench_random_mixed_hakmem # Phase 7フラグ自動設定
2025-11-08 12:54:52 +09:00
```
2025-11-09 23:15:02 +09:00
**主要ファイル**:
- `core/tiny_region_id.h` - Header書き込みAPI
- `core/tiny_free_fast_v2.inc.h` - Ultra-fast free (3-5命令)
- `core/box/hak_free_api.inc.h` - Dual-header dispatch
2025-11-08 12:54:52 +09:00
---
2025-11-09 23:15:02 +09:00
## 🐛 P0バッチ最適化 重大バグ修正 (2025-11-09) ✅
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 問題
P0( バッチrefill最適化) ON時に100K SEGVが発生
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 調査プロセス
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
**Phase 1: ビルドシステム問題**
- Task先生発見: ビルドエラーで古いバイナリ実行
- Claude修正: ローカルサイズテーブル追加( 2行)
- 結果: P0 OFF で100K成功( 2.73M ops/s)
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
**Phase 2: P0の真のバグ**
- ChatGPT先生発見: ** `meta->used` 加算漏れ**
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 根本原因
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
**P0パス( 修正前・バグ) **:
```c
trc_pop_from_freelist(meta, ..., &chain); // freelistから一括pop
trc_splice_to_sll(& chain, &g_tls_sll_head[cls]); // SLLへ連結
// meta->used += count; ← これがない!💀
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
**影響**:
- `meta->used` と実際の使用ブロック数がズレる
- carve判定が狂う → メモリ破壊 → SEGV
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### ChatGPT先生の修正
2025-11-09 18:55:50 +09:00
```c
2025-11-09 23:15:02 +09:00
trc_splice_to_sll(...);
ss_active_add(tls->ss, from_freelist);
meta->used = (uint16_t)((uint32_t)meta->used + from_freelist); // ← 追加!✅
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
**追加実装( ランタイムA/Bフック) **:
- `HAKMEM_TINY_P0_ENABLE=1` - P0有効化
- `HAKMEM_TINY_P0_NO_DRAIN=1` - Remote drain無効( 切り分け用)
- `HAKMEM_TINY_P0_LOG=1` - カウンタ検証ログ
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 修正結果
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
| 設定 | 修正前 | 修正後 |
|------|--------|--------|
| P0 OFF | 2.51-2.59M ops/s | 2.73M ops/s |
| P0 ON + NO_DRAIN | ❌ SEGV | ✅ 2.45M ops/s |
| **P0 ON( 推奨) ** | ❌ SEGV | ✅ **2.76M ops/s** 🏆 |
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
**100K iterations**: 全テスト成功
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 本番推奨設定
2025-11-09 18:55:50 +09:00
```bash
2025-11-09 23:15:02 +09:00
export HAKMEM_TINY_P0_ENABLE=1
./out/release/bench_random_mixed_hakmem
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
**性能**: 2.76M ops/s( 最速、安定)
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 既知の警告(非致命的)
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
**COUNTER_MISMATCH**:
- 発生頻度: 稀( 10K-100Kで1-2件)
- 影響: なし(クラッシュしない、性能影響なし)
- 対策: 引き続き監査(低優先度)
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
---
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
## 🎯 Pool TLS Phase 1.5a: Lock-Free Arena (2025-11-09) ✅
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 概要
Lock-free TLS arena with chunk carving for 8KB-52KB allocations
2025-11-09 18:55:50 +09:00
2025-11-09 23:15:02 +09:00
### 結果
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
Pool TLS Phase 1.5a: 1.79M ops/s (8KB allocations)
System malloc: 0.19M ops/s (8KB allocations)
Ratio: 947% (9.47x faster!) 🏆
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
### アーキテクチャ
- Box P1: Pool TLS API (ultra-fast alloc/free)
- Box P2: Refill Manager (batch allocation)
- Box P3: TLS Arena Backend (exponential chunk growth 1MB→8MB)
- Box P4: System Memory API (mmap wrapper)
### ビルド方法
2025-11-09 18:55:50 +09:00
```bash
2025-11-09 23:15:02 +09:00
./build.sh bench_mid_large_mt_hakmem # Pool TLS自動有効化
2025-11-09 18:55:50 +09:00
```
2025-11-09 23:15:02 +09:00
**主要ファイル**:
- `core/pool_tls.h/c` - TLS freelist + size-to-class
- `core/pool_refill.h/c` - Batch refill
- `core/pool_tls_arena.h/c` - Chunk management
2025-11-09 18:55:50 +09:00
---
2025-11-09 23:15:02 +09:00
## 📝 開発履歴(要約)
2025-11-05 12:31:14 +09:00
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
### Phase 2: Design Flaws Analysis (2025-11-08) 🔍
2025-11-09 23:15:02 +09:00
- 固定サイズキャッシュの設計欠陥を発見
- SuperSlab固定32 slabs、TLS Cache固定容量など
- 詳細: `DESIGN_FLAWS_ANALYSIS.md`
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
2025-11-05 12:31:14 +09:00
### Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅
2025-11-09 23:15:02 +09:00
- Ultra-Simple Fast Path (3-4命令)
- +64% 性能向上( Larson 1.68M → 2.75M ops/s)
- 詳細: `core/tiny_alloc_fast.inc.h` , `core/tiny_free_fast.inc.h`
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### Phase 6-2.1: P0 Optimization (2025-11-05) ✅
- superslab_refill の O(n) → O(1) 化( ctz使用)
- nonempty_mask導入
- 詳細: `core/hakmem_tiny_superslab.h` , `core/hakmem_tiny_refill_p0.inc.h`
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### Phase 6-2.3: Active Counter Fix (2025-11-07) ✅
- P0 batch refill の active counter 加算漏れ修正
- 4T安定動作達成( 838K ops/s)
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### Phase 6-2.2: Sanitizer Compatibility (2025-11-07) ✅
- ASan/TSan ビルド修正
- `HAKMEM_FORCE_LIBC_ALLOC_BUILD=1` 導入
2025-11-05 12:31:14 +09:00
---
2025-11-09 23:15:02 +09:00
## 🛠️ ビルドシステム
2025-11-07 17:34:24 +09:00
2025-11-09 23:15:02 +09:00
### 基本ビルド
2025-11-07 17:34:24 +09:00
```bash
2025-11-09 23:15:02 +09:00
./build.sh < target > # Release build (推奨)
./build.sh debug < target > # Debug build
./build.sh help # ヘルプ表示
./build.sh list # ターゲット一覧
2025-11-07 17:34:24 +09:00
```
2025-11-09 23:15:02 +09:00
### 主要ターゲット
- `bench_random_mixed_hakmem` - Tiny 1T mixed
- `bench_pool_tls_hakmem` - Pool TLS 8-52KB
- `bench_mid_large_mt_hakmem` - Mid-Large MT 8-32KB
- `larson_hakmem` - Larson mixed
2025-11-07 12:37:23 +09:00
2025-11-09 23:15:02 +09:00
### ピン固定フラグ
2025-11-07 12:37:23 +09:00
```
2025-11-09 23:15:02 +09:00
POOL_TLS_PHASE1=1
POOL_TLS_PREWARM=1
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
BUILD_RELEASE_DEFAULT=1 # Release mode
2025-11-07 12:37:23 +09:00
```
2025-11-09 23:15:02 +09:00
### ENV変数( Pool TLS Arena)
```bash
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2 # default 1
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16 # default 8
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4 # default 3
2025-11-07 12:37:23 +09:00
```
2025-11-09 23:15:02 +09:00
### ENV変数( P0)
2025-11-07 12:37:23 +09:00
```bash
2025-11-09 23:15:02 +09:00
export HAKMEM_TINY_P0_ENABLE=1 # P0有効化( 推奨)
export HAKMEM_TINY_P0_NO_DRAIN=1 # Remote drain無効( デバッグ用)
export HAKMEM_TINY_P0_LOG=1 # カウンタ検証ログ
2025-11-07 12:37:23 +09:00
```
---
2025-11-09 23:15:02 +09:00
## 🔍 デバッグ・プロファイリング
2025-11-07 12:37:23 +09:00
2025-11-09 23:15:02 +09:00
### Perf
2025-11-07 12:37:23 +09:00
```bash
2025-11-09 23:15:02 +09:00
perf stat -e cycles,instructions,branches,branch-misses,cache-misses -r 3 -- ./< bin >
2025-11-07 12:37:23 +09:00
```
2025-11-09 23:15:02 +09:00
### Strace
```bash
strace -e trace=mmap,madvise,munmap -c ./< bin >
2025-11-08 03:18:17 +09:00
```
2025-11-09 23:15:02 +09:00
### ビルド検証
```bash
./build.sh verify < binary >
make print-flags
2025-11-08 03:18:17 +09:00
```
2025-11-08 11:50:43 +09:00
---
2025-11-09 23:15:02 +09:00
## 📚 重要ドキュメント
2025-11-08 11:50:43 +09:00
2025-11-09 23:15:02 +09:00
- `BUILDING_QUICKSTART.md` - ビルド クイックスタート
- `LARSON_GUIDE.md` - Larson ベンチマーク統合ガイド
- `HISTORY.md` - 失敗した最適化の記録
- `100K_SEGV_ROOT_CAUSE_FINAL.md` - P0 SEGV詳細調査
- `P0_INVESTIGATION_FINAL.md` - P0包括的調査レポート
- `DESIGN_FLAWS_ANALYSIS.md` - 設計欠陥分析
2025-11-08 03:18:17 +09:00
---
2025-11-09 23:15:02 +09:00
## 🎓 学んだこと
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
1. **ビルド検証の重要性** - エラーに気づかず古いバイナリ実行の危険性
2. **カウンタ整合性** - バッチ最適化では全カウンタの同期が必須
3. **ランタイムA/Bの威力** - 環境変数で問題箇所の切り分けが可能
4. **Header-based最適化** - 1バイトで劇的な性能向上が可能
5. **Box Theory** - 境界を明確にすることで安全性とパフォーマンスを両立
2025-11-05 12:31:14 +09:00
---
2025-11-09 23:15:02 +09:00
## 🚀 次の最適化候補
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### 優先度: 低(現状で十分高速)
1. perf A/B( release) で branch-miss/IPC 最終確認
2. COUNTER_MISMATCH閾値/頻度ロギング
3. class5/6 front優先度と分岐ヒントの軽調整
4. Pool TLS Phase 1.5b: Pre-warm + adaptive refill
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
### 優先度: 中(設計改善)
1. SuperSlab dynamic expansion( mimalloc-style linked chunks)
2. TLS Cache adaptive sizing
3. BigCache hash table with chaining
2025-11-05 12:31:14 +09:00
---
2025-11-09 23:15:02 +09:00
## 📊 現在のステータス
2025-11-05 12:31:14 +09:00
```
2025-11-09 23:15:02 +09:00
Phase 7 (Header-based fast free): ✅ COMPLETE (+180-280%)
P0 (Batch refill optimization): ✅ COMPLETE (2.76M ops/s)
Pool TLS (8-52KB arena): ✅ COMPLETE (9.47x vs System)
Build System: ✅ STABLE (release/debug切替)
Production Readiness: ✅ READY (P0 ON推奨)
```
2025-11-05 12:31:14 +09:00
2025-11-09 23:15:02 +09:00
**推奨本番設定**:
2025-11-05 12:31:14 +09:00
```bash
2025-11-09 23:15:02 +09:00
export HAKMEM_TINY_P0_ENABLE=1
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 2.76M ops/s ✅
2025-11-05 12:31:14 +09:00
```