Files

Moe Charm (CI) 72b38bc994 Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets

## Root Cause Analysis (GPT5)

**Physical Layout Constraints**:
- Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed = ❌ IMPOSSIBLE
- Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 = ✅ POSSIBLE
- Class 7: 1KB → offset 0 (compatibility)

**Correct Specification**:
- HAKMEM_TINY_HEADER_CLASSIDX != 0:
  - Class 0, 7: next at offset 0 (overwrites header when on freelist)
  - Class 1-6: next at offset 1 (after header)
- HAKMEM_TINY_HEADER_CLASSIDX == 0:
  - All classes: next at offset 0

**Previous Bug**:
- Attempted "ALL classes offset 1" unification
- Class 0 with offset 1 caused immediate SEGV (9B > 8B block size)
- Mixed 2-arg/3-arg API caused confusion

## Fixes Applied

### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h)
```c
// Correct signatures
void tiny_next_write(int class_idx, void* base, void* next_value)
void* tiny_next_read(int class_idx, const void* base)

// Correct offset calculation
size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1;
```

### 2. Updated 123+ Call Sites Across 34 Files
- hakmem_tiny_hot_pop_v4.inc.h (4 locations)
- hakmem_tiny_fastcache.inc.h (3 locations)
- hakmem_tiny_tls_list.h (12 locations)
- superslab_inline.h (5 locations)
- tiny_fastcache.h (3 locations)
- ptr_trace.h (macro definitions)
- tls_sll_box.h (2 locations)
- + 27 additional files

Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)`
Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)`

### 3. Added Sentinel Detection Guards
- tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next
- tls_list_push(): Block nodes with sentinel in ptr or ptr->next
- Defense-in-depth against remote free sentinel leakage

## Verification (GPT5 Report)

**Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000`

**Results**:
- ✅ Main loop completed successfully
- ✅ Drain phase completed successfully
- ✅ NO SEGV (previous crash at iteration 66151 is FIXED)
- ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers

**Analysis**:
- Class 0 immediate SEGV: ✅ RESOLVED (correct offset 0 now used)
- 66K iteration crash: ✅ RESOLVED (offset consistency fixed)
- Box API conflicts: ✅ RESOLVED (unified 3-arg API)

## Technical Details

### Offset Logic Justification
```
Class 0:  8B block → next pointer (8B) fits ONLY at offset 0
Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header)
Class 2: 32B block → next pointer (8B) fits at offset 1
...
Class 6: 512B block → next pointer (8B) fits at offset 1
Class 7: 1024B block → offset 0 for legacy compatibility
```

### Files Modified (Summary)
- Core API: `box/tiny_next_ptr_box.h`
- Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h`
- TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h`
- SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h`
- Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h`
- Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h`
- Documentation: Multiple Phase E3 reports

## Remaining Work

None for Box API offset bugs - all structural issues resolved.

Future enhancements (non-critical):
- Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations
- Enforce Box API usage via static analysis
- Document offset rationale in architecture docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-13 06:50:20 +09:00

12 KiB

Raw Blame History

HAKMEM Memory Allocator - Claude 作業ログ

このファイルは Claude との開発セッションで重要な情報を記録します。

プロジェクト概要

HAKMEM は高性能メモリアロケータで、以下を目標としています：

平均性能で mimalloc 前後
賢い学習層でメモリ効率も狙う
Mid-Large (8-32KB) で特に強い性能

📊 現在の性能（2025-11-09）

ベンチマーク結果

Tiny (256B):         2.76M ops/s (P0 ON, 100K iterations) 🏆
Mid-Large (8-32KB):  167.75M vs System 61.81M (+171%) 🏆

重要な発見

Phase 7で大幅改善 - Header-based fast free (+180-280%)
P0バッチ最適化 - meta->used修正で安定動作達成
Mid-Large圧勝 - SuperSlab効率でSystem比+171%

🔥 CRITICAL FIX: Pointer Conversion Bug (2025-11-13) ✅

Root Cause: DOUBLE CONVERSION (USER → BASE executed twice)

Status: ✅ FIXED - Minimal patch (< 15 lines)

Symptoms:

C7 (1KB) alignment error: delta % 1024 == 1 (off by one)
Error log: [C7_ALIGN_CHECK_FAIL] ptr=0x...402 base=0x...401
Expected: delta % 1024 == 0 (aligned to block boundary)

Root Cause:

// core/tiny_superslab_free.inc.h (before fix)
static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    int slab_idx = slab_index_for(ss, ptr);  // ← Uses USER pointer (wrong!)
    // ... 8 lines ...
    void* base = (void*)((uint8_t*)ptr - 1);  // ← Converts USER → BASE

    // Problem: On 2nd free cycle, ptr is already BASE, so:
    // base = BASE - 1 = storage - 1 ← DOUBLE CONVERSION! Off by one!
}

Fix (line 17-24):

static inline void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // ✅ FIX: Convert USER → BASE at entry point (single conversion)
    void* base = (void*)((uint8_t*)ptr - 1);

    // CRITICAL: Use BASE pointer for slab_index calculation!
    int slab_idx = slab_index_for(ss, base);  // ← Fixed!
    // ... rest of function uses BASE consistently
}

Verification:

# Before fix: [C7_ALIGN_CHECK_FAIL] delta%blk=1
# After fix: No errors
./out/release/bench_fixed_size_hakmem 10000 1024 128  # ✅ PASS

Detailed Report: POINTER_CONVERSION_BUG_ANALYSIS.md, POINTER_FIX_SUMMARY.md

🔥 CRITICAL FIX: P0 TLS Stale Pointer Bug (2025-11-09) ✅

Root Cause: Active Counter Corruption

Status: ✅ FIXED - 1-line patch

Symptoms:

SEGV crash in bench_fixed_size (256B, 1KB)
Active counter corruption: active_delta=-991 when allocating 128 blocks
Trying to allocate 128 blocks from slab with capacity=64

Root Cause:

// core/hakmem_tiny_refill_p0.inc.h:256-262 (before fix)
if (meta->carved >= meta->capacity) {
    if (superslab_refill(class_idx) == NULL) break;
    meta = tls->meta;  // ← Updates meta, but tls is STALE!
    continue;
}
ss_active_add(tls->ss, batch);  // ← Updates WRONG SuperSlab counter!

After superslab_refill() switches to a new SuperSlab, the local tls pointer becomes stale (still points to the old SuperSlab). Subsequent ss_active_add(tls->ss, batch) updates the WRONG SuperSlab's active counter, causing:

SuperSlab A's counter incorrectly incremented
SuperSlab B's counter unchanged (should have been incremented)
When blocks from B are freed → counter underflow → SEGV

Fix (line 279):

if (meta->carved >= meta->capacity) {
    if (superslab_refill(class_idx) == NULL) break;
    tls = &g_tls_slabs[class_idx];  // ← RELOAD TLS after slab switch!
    meta = tls->meta;
    continue;
}

Verification:

256B fixed-size: 862K ops/s (stable, 200K iterations, 0 crashes) ✅
1KB fixed-size:  872K ops/s (stable, 200K iterations, 0 crashes) ✅
Stability test:  3/3 runs passed ✅
Counter errors:  0 (was: active_delta=-991) ✅

Detailed Report: TINY_256B_1KB_SEGV_FIX_REPORT.md

🚀 Phase 7: Header-Based Fast Free (2025-11-08) ✅

成果

+180-280% 性能向上（Random Mixed 128-1024B）
1-byte header (0xa0 | class_idx) で O(1) class 識別
Ultra-fast free path (3-5 instructions)

主要技術

Header書き込み - allocation時に1バイトヘッダー追加
Fast free - SuperSlab lookup不要、直接TLS SLLへpush
Hybrid mincore - Page境界のみmincore()実行（99.9%は1-2 cycles）

結果

Random Mixed 128B:  21M → 59M ops/s (+181%)
Random Mixed 256B:  19M → 70M ops/s (+268%)
Random Mixed 512B:  21M → 68M ops/s (+224%)
Random Mixed 1024B: 21M → 65M ops/s (+210%)
Larson 1T:          631K → 2.63M ops/s (+333%)

ビルド方法

./build.sh bench_random_mixed_hakmem  # Phase 7フラグ自動設定

主要ファイル:

core/tiny_region_id.h - Header書き込みAPI
core/tiny_free_fast_v2.inc.h - Ultra-fast free (3-5命令)
core/box/hak_free_api.inc.h - Dual-header dispatch

🐛 P0バッチ最適化重大バグ修正 (2025-11-09) ✅

問題

P0（バッチrefill最適化）ON時に100K SEGVが発生

調査プロセス

Phase 1: ビルドシステム問題

Task先生発見: ビルドエラーで古いバイナリ実行
Claude修正: ローカルサイズテーブル追加（2行）
結果: P0 OFF で100K成功（2.73M ops/s）

Phase 2: P0の真のバグ

ChatGPT先生発見: meta->used 加算漏れ

根本原因

P0パス（修正前・バグ）:

trc_pop_from_freelist(meta, ..., &chain);  // freelistから一括pop
trc_splice_to_sll(&chain, &g_tls_sll_head[cls]);  // SLLへ連結
// meta->used += count;  ← これがない！💀

影響:

meta->used と実際の使用ブロック数がズレる
carve判定が狂う → メモリ破壊 → SEGV

ChatGPT先生の修正

trc_splice_to_sll(...);
ss_active_add(tls->ss, from_freelist);
meta->used = (uint16_t)((uint32_t)meta->used + from_freelist);  // ← 追加！✅

追加実装（ランタイムA/Bフック）:

HAKMEM_TINY_P0_ENABLE=1 - P0有効化
HAKMEM_TINY_P0_NO_DRAIN=1 - Remote drain無効（切り分け用）
HAKMEM_TINY_P0_LOG=1 - カウンタ検証ログ

修正結果

設定	修正前	修正後
P0 OFF	2.51-2.59M ops/s	2.73M ops/s
P0 ON + NO_DRAIN	❌ SEGV	✅ 2.45M ops/s
P0 ON（推奨）	❌ SEGV	✅ 2.76M ops/s 🏆

100K iterations: 全テスト成功

本番推奨設定

export HAKMEM_TINY_P0_ENABLE=1
./out/release/bench_random_mixed_hakmem

性能: 2.76M ops/s（最速、安定）

既知の警告（非致命的）

COUNTER_MISMATCH:

発生頻度: 稀（10K-100Kで1-2件）
影響: なし（クラッシュしない、性能影響なし）
対策: 引き続き監査（低優先度）

🎯 Pool TLS Phase 1.5a: Lock-Free Arena (2025-11-09) ✅

概要

Lock-free TLS arena with chunk carving for 8KB-52KB allocations

結果

Pool TLS Phase 1.5a: 1.79M ops/s (8KB allocations)
System malloc:       0.19M ops/s (8KB allocations)
Ratio:              947% (9.47x faster!) 🏆

アーキテクチャ

Box P1: Pool TLS API (ultra-fast alloc/free)
Box P2: Refill Manager (batch allocation)
Box P3: TLS Arena Backend (exponential chunk growth 1MB→8MB)
Box P4: System Memory API (mmap wrapper)

ビルド方法

./build.sh bench_mid_large_mt_hakmem  # Pool TLS自動有効化

主要ファイル:

core/pool_tls.h/c - TLS freelist + size-to-class
core/pool_refill.h/c - Batch refill
core/pool_tls_arena.h/c - Chunk management

📝 開発履歴（要約）

Phase 2: Design Flaws Analysis (2025-11-08) 🔍

固定サイズキャッシュの設計欠陥を発見
SuperSlab固定32 slabs、TLS Cache固定容量など
詳細: DESIGN_FLAWS_ANALYSIS.md

Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅

Ultra-Simple Fast Path (3-4命令)
+64% 性能向上（Larson 1.68M → 2.75M ops/s）
詳細: core/tiny_alloc_fast.inc.h, core/tiny_free_fast.inc.h

Phase 6-2.1: P0 Optimization (2025-11-05) ✅

superslab_refill の O(n) → O(1) 化（ctz使用）
nonempty_mask導入
詳細: core/hakmem_tiny_superslab.h, core/hakmem_tiny_refill_p0.inc.h

Phase 6-2.3: Active Counter Fix (2025-11-07) ✅

P0 batch refill の active counter 加算漏れ修正
4T安定動作達成（838K ops/s）

Phase 6-2.2: Sanitizer Compatibility (2025-11-07) ✅

ASan/TSan ビルド修正
HAKMEM_FORCE_LIBC_ALLOC_BUILD=1 導入

🛠️ ビルドシステム

基本ビルド

./build.sh <target>           # Release build (推奨)
./build.sh debug <target>     # Debug build
./build.sh help               # ヘルプ表示
./build.sh list               # ターゲット一覧

主要ターゲット

bench_random_mixed_hakmem - Tiny 1T mixed
bench_pool_tls_hakmem - Pool TLS 8-52KB
bench_mid_large_mt_hakmem - Mid-Large MT 8-32KB
larson_hakmem - Larson mixed

ピン固定フラグ

POOL_TLS_PHASE1=1
POOL_TLS_PREWARM=1
HEADER_CLASSIDX=1
AGGRESSIVE_INLINE=1
PREWARM_TLS=1
BUILD_RELEASE_DEFAULT=1  # Release mode

ENV変数（Pool TLS Arena）

export HAKMEM_POOL_TLS_ARENA_MB_INIT=2   # default 1
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16   # default 8
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4  # default 3

ENV変数（P0）

export HAKMEM_TINY_P0_ENABLE=1      # P0有効化（推奨）
export HAKMEM_TINY_P0_NO_DRAIN=1    # Remote drain無効（デバッグ用）
export HAKMEM_TINY_P0_LOG=1         # カウンタ検証ログ

🔍 デバッグ・プロファイリング

Perf

perf stat -e cycles,instructions,branches,branch-misses,cache-misses -r 3 -- ./<bin>

Strace

strace -e trace=mmap,madvise,munmap -c ./<bin>

ビルド検証

./build.sh verify <binary>
make print-flags

📚 重要ドキュメント

BUILDING_QUICKSTART.md - ビルドクイックスタート
LARSON_GUIDE.md - Larson ベンチマーク統合ガイド
HISTORY.md - 失敗した最適化の記録
100K_SEGV_ROOT_CAUSE_FINAL.md - P0 SEGV詳細調査
P0_INVESTIGATION_FINAL.md - P0包括的調査レポート
DESIGN_FLAWS_ANALYSIS.md - 設計欠陥分析

🎓 学んだこと

ビルド検証の重要性 - エラーに気づかず古いバイナリ実行の危険性
カウンタ整合性 - バッチ最適化では全カウンタの同期が必須
ランタイムA/Bの威力 - 環境変数で問題箇所の切り分けが可能
Header-based最適化 - 1バイトで劇的な性能向上が可能
Box Theory - 境界を明確にすることで安全性とパフォーマンスを両立

🚀 次の最適化候補

優先度: 低（現状で十分高速）

perf A/B（release）で branch-miss/IPC 最終確認
COUNTER_MISMATCH閾値/頻度ロギング
class5/6 front優先度と分岐ヒントの軽調整
Pool TLS Phase 1.5b: Pre-warm + adaptive refill

優先度: 中（設計改善）

SuperSlab dynamic expansion（mimalloc-style linked chunks）
TLS Cache adaptive sizing
BigCache hash table with chaining

📊 現在のステータス

Phase 7 (Header-based fast free):  ✅ COMPLETE (+180-280%)
P0 (Batch refill optimization):    ✅ COMPLETE (2.76M ops/s)
Pool TLS (8-52KB arena):            ✅ COMPLETE (9.47x vs System)
Build System:                       ✅ STABLE (release/debug切替)
Production Readiness:               ✅ READY (P0 ON推奨)

推奨本番設定:

export HAKMEM_TINY_P0_ENABLE=1
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 2.76M ops/s ✅

12 KiB Raw Blame History Unescape Escape

HAKMEM Memory Allocator - Claude 作業ログ

プロジェクト概要

📊 現在の性能（2025-11-09）

ベンチマーク結果

重要な発見

🔥 CRITICAL FIX: Pointer Conversion Bug (2025-11-13) ✅

Root Cause: DOUBLE CONVERSION (USER → BASE executed twice)

🔥 CRITICAL FIX: P0 TLS Stale Pointer Bug (2025-11-09) ✅

Root Cause: Active Counter Corruption

🚀 Phase 7: Header-Based Fast Free (2025-11-08) ✅

成果

主要技術

結果

ビルド方法

🐛 P0バッチ最適化 重大バグ修正 (2025-11-09) ✅

問題

調査プロセス

根本原因

ChatGPT先生の修正

修正結果

本番推奨設定

既知の警告（非致命的）

🎯 Pool TLS Phase 1.5a: Lock-Free Arena (2025-11-09) ✅

概要

結果

アーキテクチャ

ビルド方法

📝 開発履歴（要約）

Phase 2: Design Flaws Analysis (2025-11-08) 🔍

Phase 6-1.7: Box Theory Refactoring (2025-11-05) ✅

Phase 6-2.1: P0 Optimization (2025-11-05) ✅

Phase 6-2.3: Active Counter Fix (2025-11-07) ✅

Phase 6-2.2: Sanitizer Compatibility (2025-11-07) ✅

🛠️ ビルドシステム

基本ビルド

主要ターゲット

ピン固定フラグ

ENV変数（Pool TLS Arena）

ENV変数（P0）

🔍 デバッグ・プロファイリング

Perf

Strace

ビルド検証

📚 重要ドキュメント

🎓 学んだこと

🚀 次の最適化候補

優先度: 低（現状で十分高速）

優先度: 中（設計改善）

📊 現在のステータス

12 KiB

Raw Blame History

🐛 P0バッチ最適化重大バグ修正 (2025-11-09) ✅