hakmem/docs/status/CURRENT_TASK.md

# CURRENT TASK – Performance Optimization Status

**Last Updated**: 2025-11-26
**Scope**: Phase UNIFIED-HEADER Bug Fixes / Header Read Performance

---

## 🎯 現状サマリ

### ✅ Phase UNIFIED-HEADER バグ修正完了 - 大幅な性能改善達成

| Benchmark | Before | After | Improvement |
|-----------|--------|-------|-------------|
| **Random Mixed (10M)** | 68-70M ops/s | **80.64M ops/s** | **+15-19%** 🎉 |
| **Fixed Size (10M)** | 21.3M ops/s | **30.09M ops/s** | **+41%** 🎉 |
| Larson (1T) | SEGV ❌ | SEGV ❌ | 未解決（別問題） |

### 現在の性能比較 (10M iterations, Random Mixed)
```
System malloc: 93M ops/s (baseline)
HAKMEM:        80.64M ops/s (87% of system malloc) ← NEW!
Gap:           ~13% (vs 以前の 27%)
```

**重要**: Phase 27 で「68-70M が限界」と結論したが、今回のバグ修正で **80.64M ops/s** を達成。
System malloc の **87%** まで到達（以前は 73-76%）。

---

## 🐛 Phase UNIFIED-HEADER で発見・修正したバグ

### Bug #1: `tiny_region_id_read_header()` の致命的な実装ミス ⚠️

**問題**: Phase 7 の目的は「SuperSlab lookup（100+ cycles）を排除してヘッダー読み込み（2-3 cycles）で O(1) class 識別」だったが、**実装が逆のことをしていた**

**発見された実装**:
```c
// tiny_region_id.h (修正前)
static inline int tiny_region_id_read_header(void* ptr) {
    // ❌ SuperSlab lookup してメタデータから class_idx を読む（100+ cycles）
    SuperSlab* ss = hak_super_lookup(ptr);
    return (int)ss->slabs[sidx].class_idx;  // これでは Phase 7 の意味がない！
}
```

**修正後**:
```c
// tiny_region_id.h (修正後)
static inline int tiny_region_id_read_header(void* ptr) {
    // ✅ 実際のヘッダーバイトを読む（2-3 cycles）
    uint8_t* header_ptr = (uint8_t*)ptr - 1;
    uint8_t header = *header_ptr;

    // Magic validation
    if ((header & 0xF0) != HEADER_MAGIC) return -1;

    // Extract class_idx
    return (int)(header & HEADER_CLASS_MASK);
}
```

**影響**:
- `class_idx=255` エラーの根本原因（スラブリサイクル時に `meta->class_idx = 255` を読んでしまう）
- Phase 7 の性能改善が発揮されていなかった（100+ cycles のlookup を毎回実行）
- 修正後: Fixed Size +41%, Random Mixed +15-19% 改善

### Bug #2: `tiny_superslab_free.inc.h` - USER→BASE 変換の chicken-and-egg 問題

**問題**: `PTR_USER_TO_BASE(ptr, 0)` で常に class 0 (headerless) を仮定
- C1-C7 (header あり) で間違った base pointer を計算

**修正**: 2段階 lookup
```c
// Step 1: USER ptr で slab を検索
int slab_idx = slab_index_for(ss, ptr);

// Step 2: meta から class を取得
uint8_t cls = meta->class_idx;

// Step 3: 正しい class で BASE に変換
void* base = PTR_USER_TO_BASE(ptr, cls);
```

### Bug #3: `sp_core_box.inc` Stage 3 - free_slab_mask クリア漏れ

**問題**: 新しい SuperSlab 割り当て時に `free_slab_mask` ビットをクリアしていない
- Stage 0.6 が同じ slab を複数の class に誤割り当て

**修正**:
```c
atomic_fetch_and_explicit(&new_ss->free_slab_mask, ~(1u << first_slot), memory_order_release);
```

### Bug #4: `tiny_ultra_fast.inc.h` - Alloc パスの +1 決め打ち

**問題**: `return (char*)base + 1;` が全 class で +1（C0 headerless で間違い）

**修正**: `return PTR_BASE_TO_USER(base, cl);`

---

## 📁 主要な修正ファイル

### 今回の修正（2025-11-26）
- `core/tiny_region_id.h:122-148` - ✅ ヘッダーバイト直接読み込みに修正（Phase 7 本来の設計）
- `core/tiny_superslab_free.inc.h:24-41` - ✅ 2段階 lookup 実装
- `core/box/sp_core_box.inc:693-695` - ✅ Stage 3 free_slab_mask クリア追加
- `core/tiny_ultra_fast.inc.h:55` - ✅ PTR_BASE_TO_USER 使用

### Arena Allocator 実装（以前）
- `core/box/ss_cache_box.inc:138-229` - SSArena allocator 追加
- `core/box/tls_sll_box.h:509-561` - Release mode で recycle check オプショナル化
- `core/tiny_free_fast_v2.inc.h:113-148` - Release mode で cross-check 削除
- `core/hakmem_tiny_sll_cap_box.inc:8-25` - C5 容量を full capacity に変更

---

## 🗃 過去の問題と解決（参考）

### Phase 27: アーキテクチャ限界調査（2025-11-25）
- **結論**: 68-70M ops/s が限界と判断
- **実際**: Bug 修正で **80.64M ops/s** 達成（+15-19%） ← 実装バグが原因だった！

### Arena Allocator 以前の状態
- **Random Mixed (5M ops)**: ~56-60M ops/s, **mmap 418回**
- **根本原因**: SuperSlab = allocation単位 = cache単位 という設計ミスマッチ
- **解決**: Arena allocator 実装 → mmap 92%削減、性能 +15%

---

## 📊 他アロケータとのアーキテクチャ対応（参考）

| HAKMEM | mimalloc | tcmalloc | jemalloc |
|--------|----------|----------|----------|
| SuperSlab (2MB) | Segment (~2MiB) | PageHeap | Extent |
| Slab (64KB) | Page (~64KiB) | Span | Run/slab |
| per-class freelist | pages_queue | Central freelist | bin/slab lists |
| Arena allocator | segment cache | PageHeap | extent_avail |

---

## ⚠️ 既知の問題

### Larson (MT) クラッシュ
- **Status**: 未解決（別のレースコンディション）
- **原因候補**:
  - Cross-thread free（Thread A alloc, Thread B free）
  - TLS SLL stale pointer
  - SuperSlab lifecycle race
- **Next Step**: ENV `HAKMEM_TINY_LARSON_FIX=1` を使った cross-thread 検証

---

## ✅ 完成したマイルストーン

1. **Arena Allocator 実装** - mmap 95% 削減達成 ✅
2. **Phase 27 調査** - アーキテクチャ限界の確認 ✅
3. **Phase UNIFIED-HEADER バグ修正** - 80.64M ops/s 達成 ✅
4. **Header Read 最適化** - SuperSlab lookup 排除 ✅

**現在の推奨**: 80.64M ops/s を新 baseline として、Larson (MT) 問題の解決と Mid-Large ワークロードの最適化に注力する。
-												Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-26 13:14:18 +09:00
+								# CURRENT TASK – Performance Optimization Status
 								**Last Updated**: 2025-11-26
 								**Scope**: Phase UNIFIED-HEADER Bug Fixes / Header Read Performance
 								---
 								## 🎯 現状サマリ
 								### ✅ Phase UNIFIED-HEADER バグ修正完了 - 大幅な性能改善達成
 								| Benchmark | Before | After | Improvement |
 								|-----------|--------|-------|-------------|
 								| **Random Mixed (10M)** | 68-70M ops/s | **80.64M ops/s** | **+15-19%** 🎉 |
 								| **Fixed Size (10M)** | 21.3M ops/s | **30.09M ops/s** | **+41%** 🎉 |
 								| Larson (1T) | SEGV ❌ | SEGV ❌ | 未解決（別問題） |
 								### 現在の性能比較 (10M iterations, Random Mixed)
 								```
 								System malloc: 93M ops/s (baseline)
 								HAKMEM:        80.64M ops/s (87% of system malloc) ← NEW!
 								Gap:           ~13% (vs 以前の 27%)
 								```
 								**重要**: Phase 27 で「68-70M が限界」と結論したが、今回のバグ修正で **80.64M ops/s** を達成。
 								System malloc の **87%** まで到達（以前は 73-76%）。
 								---
 								## 🐛 Phase UNIFIED-HEADER で発見・修正したバグ
 								### Bug #1: `tiny_region_id_read_header()` の致命的な実装ミス ⚠️
 								**問題**: Phase 7 の目的は「SuperSlab lookup（100+ cycles）を排除してヘッダー読み込み（2-3 cycles）で O(1) class 識別」だったが、**実装が逆のことをしていた**
 								**発見された実装**:
 								```c
 								// tiny_region_id.h (修正前)
 								static inline int tiny_region_id_read_header(void* ptr) {
 								    // ❌ SuperSlab lookup してメタデータから class_idx を読む（100+ cycles）
 								    SuperSlab* ss = hak_super_lookup(ptr);
 								    return (int)ss->slabs[sidx].class_idx;  // これでは Phase 7 の意味がない！
 								}
 								```
 								**修正後**:
 								```c
 								// tiny_region_id.h (修正後)
 								static inline int tiny_region_id_read_header(void* ptr) {
 								    // ✅ 実際のヘッダーバイトを読む（2-3 cycles）
 								    uint8_t* header_ptr = (uint8_t*)ptr - 1;
 								    uint8_t header = *header_ptr;
 								    // Magic validation
 								    if ((header & 0xF0) != HEADER_MAGIC) return -1;
 								    // Extract class_idx
 								    return (int)(header & HEADER_CLASS_MASK);
 								}
 								```
 								**影響**:
 								- `class_idx=255` エラーの根本原因（スラブリサイクル時に `meta->class_idx = 255` を読んでしまう）
 								- Phase 7 の性能改善が発揮されていなかった（100+ cycles のlookup を毎回実行）
 								- 修正後: Fixed Size +41%, Random Mixed +15-19% 改善
 								### Bug #2: `tiny_superslab_free.inc.h` - USER→BASE 変換の chicken-and-egg 問題
 								**問題**: `PTR_USER_TO_BASE(ptr, 0)` で常に class 0 (headerless) を仮定
 								- C1-C7 (header あり) で間違った base pointer を計算
 								**修正**: 2段階 lookup
 								```c
 								// Step 1: USER ptr で slab を検索
 								int slab_idx = slab_index_for(ss, ptr);
 								// Step 2: meta から class を取得
 								uint8_t cls = meta->class_idx;
 								// Step 3: 正しい class で BASE に変換
 								void* base = PTR_USER_TO_BASE(ptr, cls);
 								```
 								### Bug #3: `sp_core_box.inc` Stage 3 - free_slab_mask クリア漏れ
 								**問題**: 新しい SuperSlab 割り当て時に `free_slab_mask` ビットをクリアしていない
 								- Stage 0.6 が同じ slab を複数の class に誤割り当て
 								**修正**:
 								```c
 								atomic_fetch_and_explicit(&new_ss->free_slab_mask, ~(1u << first_slot), memory_order_release);
 								```
 								### Bug #4: `tiny_ultra_fast.inc.h` - Alloc パスの +1 決め打ち
 								**問題**: `return (char*)base + 1;` が全 class で +1（C0 headerless で間違い）
 								**修正**: `return PTR_BASE_TO_USER(base, cl);`
 								---
 								## 📁 主要な修正ファイル
 								### 今回の修正（2025-11-26）
 								- `core/tiny_region_id.h:122-148` - ✅ ヘッダーバイト直接読み込みに修正（Phase 7 本来の設計）
 								- `core/tiny_superslab_free.inc.h:24-41` - ✅ 2段階 lookup 実装
 								- `core/box/sp_core_box.inc:693-695` - ✅ Stage 3 free_slab_mask クリア追加
 								- `core/tiny_ultra_fast.inc.h:55` - ✅ PTR_BASE_TO_USER 使用
 								### Arena Allocator 実装（以前）
 								- `core/box/ss_cache_box.inc:138-229` - SSArena allocator 追加
 								- `core/box/tls_sll_box.h:509-561` - Release mode で recycle check オプショナル化
 								- `core/tiny_free_fast_v2.inc.h:113-148` - Release mode で cross-check 削除
 								- `core/hakmem_tiny_sll_cap_box.inc:8-25` - C5 容量を full capacity に変更
 								---
 								## 🗃 過去の問題と解決（参考）
 								### Phase 27: アーキテクチャ限界調査（2025-11-25）
 								- **結論**: 68-70M ops/s が限界と判断
 								- **実際**: Bug 修正で **80.64M ops/s** 達成（+15-19%） ← 実装バグが原因だった！
 								### Arena Allocator 以前の状態
 								- **Random Mixed (5M ops)**: ~56-60M ops/s, **mmap 418回**
 								- **根本原因**: SuperSlab = allocation単位 = cache単位 という設計ミスマッチ
 								- **解決**: Arena allocator 実装 → mmap 92%削減、性能 +15%
 								---
 								## 📊 他アロケータとのアーキテクチャ対応（参考）
 								| HAKMEM | mimalloc | tcmalloc | jemalloc |
 								|--------|----------|----------|----------|
 								| SuperSlab (2MB) | Segment (~2MiB) | PageHeap | Extent |
 								| Slab (64KB) | Page (~64KiB) | Span | Run/slab |
 								| per-class freelist | pages_queue | Central freelist | bin/slab lists |
 								| Arena allocator | segment cache | PageHeap | extent_avail |
 								---
 								## ⚠️ 既知の問題
 								### Larson (MT) クラッシュ
 								- **Status**: 未解決（別のレースコンディション）
 								- **原因候補**:
 								  - Cross-thread free（Thread A alloc, Thread B free）
 								  - TLS SLL stale pointer
 								  - SuperSlab lifecycle race
 								- **Next Step**: ENV `HAKMEM_TINY_LARSON_FIX=1` を使った cross-thread 検証
 								---
 								## ✅ 完成したマイルストーン
 . **Arena Allocator 実装** - mmap 95% 削減達成 ✅
 . **Phase 27 調査** - アーキテクチャ限界の確認 ✅
 . **Phase UNIFIED-HEADER バグ修正** - 80.64M ops/s 達成 ✅
 . **Header Read 最適化** - SuperSlab lookup 排除 ✅
 								**現在の推奨**: 80.64M ops/s を新 baseline として、Larson (MT) 問題の解決と Mid-Large ワークロードの最適化に注力する。