Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) cef99b311d Phase 15: Box Separation (partial) - Box headers completed, routing deferred
**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert

**Files Created**:
1. core/box/front_gate_v2.h (98 lines)
   - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL)
   - Performance: 2-5 cycles
   - Same-page guard added (防御的プログラミング)

2. core/box/external_guard_box.h (146 lines)
   - ENV-controlled mincore safety check
   - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF)
   - Uses __libc_free() to avoid infinite loop

**Routing**:
- hak_free_at reverted to Phase 14-C (classify_ptr-based, stable)
- Phase 15 routing caused SEGV on page-aligned pointers

**Performance**:
- Phase 14-C (mincore ON): 16.5M ops/s (stable)
- mincore: 841 calls/100K iterations
- mincore OFF: SEGV (unsafe AllocHeader deref)

**Next Steps** (deferred):
- Mid/Large/C7 registry consolidation
- AllocHeader safety validation
- ExternalGuard integration

**Recommendation**: Stick with Phase 14-C for now
- mincore overhead acceptable (~1.9ms / 100K)
- Focus on other bottlenecks (TLS SLL, SuperSlab churn)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-15 22:08:51 +09:00

384 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CURRENT TASK Phase 14 (TinyUltraHot / C2/C3 Ultra-Fast Path)
**Date**: 2025-11-15
**Status**: ✅ TinyUltraHot 実装完了 (+21-36% on C2/C3 workloads), Phase 13 TinyHeapV2 = 安全な stub
**Owner**: Claude Code → 実装完了
---
## 1. 全体の今どこ
- Tiny (01023B):
- Front: NEW 3-layer front (bump / small_mag / slow) 安定。
- TinyHeapV2: 「alloc フロント統計」実装済みだが、magazine 供給なし → hit 率 0%。
- Drain: TLS SLL drain interval = 2048デフォルト。Tiny random mixed で ~9M ops/s レベル。
- Mid (1KB32KB):
- GAP 修正済み: `MID_MIN_SIZE=1024` に下げて 1KB8KB を Mid が担当。
- Pool TLS ON デフォルトmid ベンチ)で ~10.6M ops/sSystem malloc より速い)。
- Shared SuperSlab Pool (SPSLOT Box):
- 実装完了。SuperSlab 数 -92%、mmap/munmap -48%、Throughput +131%。
- Lock contention (Stage 2) は P0-5 まで実装済み、+23% 程度の改善。
結論: Mid / Shared Pool 側は「研究目的としては一旦完了」。
残りの大きな余白は **Tiny frontC0C3****一部 Tiny ベンチ (Larson / 1KB fixed)**
---
## 2. Phase 14: TinyUltraHot Box (2025-11-15) ✅
### 2.1 実装概要
**ChatGPT Phase 14 戦略** - L1 dcache miss 攻撃:
- **Problem**: perf stat で System malloc との比較
- L1 dcache miss: 30x worse (2.9M vs 96K)
- Instructions: 6.2x more (281M vs 45M)
- Branches: 7.1x more (59M vs 8.3M)
- **Solution**: C1/C2 (16B/32B) に特化した ultra-simple straight-line path
- Target: ~60% of tiny allocations
- Magazine-based (4 slots per class)
- Single cache line TLS structure (64B aligned)
- 5-7 instructions per alloc/free
### 2.2 実装詳細
**Box**: `TinyUltraHot` (L0 ultra-fast path, C2/C3 = 16B/32B only)
**Files**:
- `core/front/tiny_ultra_hot.h` (343 lines, self-contained)
- `core/hakmem_tiny.c` (TLS + stats)
- `core/tiny_alloc_fast.inc.h` (alloc hook)
- `core/tiny_free_fast_v2.inc.h` (free hook)
- `bench_random_mixed.c` (stats output added)
**TLS Structure**:
```c
typedef struct {
void* c1_mag[4]; // C2 (16B) magazine (32B)
void* c2_mag[4]; // C3 (32B) magazine (32B)
uint8_t c1_top, c2_top;
// Statistics...
} __attribute__((aligned(64))) TinyUltraHot;
```
**ENV Controls**:
- `HAKMEM_TINY_ULTRA_HOT=0/1` - Enable/disable (default: 1)
- `HAKMEM_TINY_ULTRA_HOT_STATS=0/1` - Print stats at exit
**Class Mapping** (CRITICAL):
- C0 = 8B (not covered)
- C1 = ? (unknown)
- **C2 = 16B** ← UltraHot C1
- **C3 = 32B** ← UltraHot C2
- C4+ = 64B+ (not covered)
### 2.3 Performance Results
**Fixed-Size Benchmarks** (100K iterations, 128 workset):
| Size | Baseline | UltraHot ON | Improvement | Hit Rate |
|------|----------|-------------|-------------|----------|
| **16B (C2)** | 48.2M ops/s | 58.3M ops/s | **+20.9%** | 99.9% |
| **32B (C3)** | 45.1M ops/s | 55.9M ops/s | **+23.9%** | 99.9% |
**Extended C2/C3 Tests** (200K iterations, 256 workset):
| Size | Baseline | UltraHot ON | Improvement |
|------|----------|-------------|-------------|
| **16B (C2)** | 40.4M ops/s | 55.0M ops/s | **+36.2%** |
| **32B (C3)** | 43.5M ops/s | 50.6M ops/s | **+16.3%** |
| 24B (C3) | 43.5M ops/s | 44.6M ops/s | +2.5% |
**Random Mixed 256B** (100K iterations):
- Baseline: 8.96M ops/s
- UltraHot ON: 8.81M ops/s (-1.6%)
- **Reason**: C2/C3 coverage = only 1-2% of workload
- C1 alloc=45 (0.045%), free=820 (0.82%)
- C2 alloc=828 (0.83%), free=1,567 (1.57%)
- Size distribution: 16-1040B (C2/C3 = ~1.7% of range)
- **Conclusion**: UltraHot overhead negligible on non-target workloads
### 2.4 Design Decisions
**Why C2/C3 only?**
- Cover ~60% of tiny allocations (ChatGPT estimate for 16B/32B)
- Small magazine (4 slots) fits in 1-2 cache lines
- Size check trivial (size <= 16 / size <= 32)
- Larger classes (C4+) have different access patterns
**Why 4 slots per magazine?**
- Target: 1 cache line (64B) for all state
- C1 mag (32B) + C2 mag (32B) = 64B (first cache line)
- Counters + stats in second cache line
- Trade capacity for cache locality
**Integration with existing layers**:
- **L0** (fastest): TinyUltraHot (C2/C3 only)
- **L1** (fast): TinyHeapV2 (C0-C3, 16 slots, Phase 13)
- **L2** (normal): FastCache + TLS SLL
- Fallback chain: L0 miss → L1 → L2
### 2.5 Critical Bug Fix: Class Numbering
**Issue**: Initial implementation assumed C1=16B, C2=32B
- **Symptom**: 0% hit rate, alloc_calls registered but free_calls=0
- **Root cause**: HAKMEM class numbering is C2=16B, C3=32B
- **Discovery**: Ran TinyHeapV2 stats on 16B → showed [C2] hit rate
- **Fix**: Changed checks from (class_idx==1||2) to (class_idx==2||3)
- **Verification**: Hit rate → 99.9% after fix
### 2.6 Next Steps (Optional)
1. **perf stat validation**: Measure actual L1 dcache miss reduction
2. **Larger magazines**: Test 8-16 slots if cache locality permits
3. **C0/C4 coverage**: Extend to 8B/64B if beneficial
4. **Adaptive enable**: Auto-detect workload characteristics
**Current Recommendation**: Phase 14 COMPLETE ✅
- C2/C3-heavy workloads: **+16-36% improvement**
- Mixed workloads: Negligible overhead (<2%)
- Magazine-based design proven effective
- Ready for production use (default: ON)
---
## 3. TinyHeapV2 Box の現状 (Phase 13)
### 3.1 実装済み (Phase 13-A Alloc Front)
- Box: `TinyHeapV2`per-thread magazine front, C0C3 用の L0 キャッシュ
- ファイル:
- `core/front/tiny_heap_v2.h`
- `core/hakmem_tiny.c`TLS 定義 + 統計出力
- `core/hakmem_tiny_alloc_new.inc`alloc hook
- TLS 構造:
- `__thread TinyHeapV2Mag g_tiny_heap_v2_mag[TINY_NUM_CLASSES];`
- `__thread TinyHeapV2Stats g_tiny_heap_v2_stats[TINY_NUM_CLASSES];`
- ENV:
- `HAKMEM_TINY_HEAP_V2` Box ON/OFF
- `HAKMEM_TINY_HEAP_V2_CLASS_MASK` bit03 C0C3 有効化
- `HAKMEM_TINY_HEAP_V2_STATS` 統計出力 ON
- `HAKMEM_TINY_HEAP_V2_DEBUG` 初期デバッグログ
- 振る舞い:
- `hak_tiny_alloc(size)` C0C3 かつ mask OK のとき `tiny_heap_v2_alloc(size)` を先に試す
- `tiny_heap_v2_alloc`:
- mag.top>0 なら popBASE を返す)→ `HAK_RET_ALLOC` で header + user に変換。
- mag 空なら **即 NULL** を返し、既存 front へフォールバック。
- `tiny_heap_v2_refill_mag` は NO-OPrefill なし)。
- `tiny_heap_v2_try_push` は実装済みだが、まだ実際の free/alloc 経路から呼ばれていない想定で OKPhase 13-B で使う)。
- 現状の性能:
- 16/32/64B fixed-size (100K) で ±1% 以内 → hook オーバーヘッドはほぼゼロ。
- `alloc_calls` は 200K まで増えるが `mag_hits=0`supply なしのため)。
**要点:** TinyHeapV2 は「壊さず差し込めた L0 stub」。
これから **supply をどう設計するか** が Phase 13-B の主題。
---
## 3. 最近のバグ修正・仕様調整(もう触らなくてよい箱)
### 3.1 Tiny / Mid サイズ境界ギャップ修正(完了)
- 以前:
- `TINY_MAX_SIZE = 1024` / `MID_MIN_SIZE = 8192` で 1KB8KB が誰の担当でもなく mmap 直行。
- 今:
- Tiny: `TINY_MAX_SIZE = 1023`(ヘッダ 1B 前提で 1023B まで Tiny
- Mid: `MID_MIN_SIZE = 1024`1KB32KB を Mid MT が処理)。
- 効果:
- `bench_fixed_size_hakmem 1024B` が mmap 地獄から脱出 → Mid MT 経路で ~0.5M ops/s レベルに改善。
- SEGV は解消。今残っているのは性能ギャップだけTinyHeapV2 とは独立)。
### 3.2 Shared Pool / LRU / Drain 周り
- TLS SLL drain:
- `HAKMEM_TINY_SLL_DRAIN_INTERVAL` デフォルト = 2048。
- 128/256B 固定サイズで A/B 済み。どちらも退化なく、むしろ +5〜+15% 程度の改善。
- SPSLOT Box:
- SuperSlab 数削減・syscall 削減は期待通り。
- futex / lock contention は P0-5 まで対処済み(追加改善は高コスト領域として一旦後回し)。
### 3.3 ✅ CRITICAL FIX: workset=128 Infinite Recursion Bug (2025-11-15)
**Commit**: 176bbf656
**Root Cause**:
- `shared_pool_ensure_capacity_unlocked()` used `realloc()` for Shared Pool metadata allocation
- `realloc()``hak_alloc_at(128B)``shared_pool_init()``realloc()`**INFINITE RECURSION**
- Triggered by high memory pressure (workset=128) but not lower pressure (workset=64)
**Symptoms**:
- `bench_fixed_size_hakmem 1 16 128`: infinite hang (timeout)
- `bench_fixed_size_hakmem 1 1024 128`: worked fine (4.3M ops/s)
- **Size-class specific**: C1-C3 (16-64B) hung, C7 (1024B) worked
- Reason: Small allocations trigger more SuperSlab allocations → more metadata realloc → deeper recursion
**Fix** (`core/hakmem_shared_pool.c`):
- Replace `realloc()` with direct system `mmap()` for Shared Pool metadata
- Use `munmap()` to free old mappings (not `free()`!)
- **Breaks recursion**: Shared Pool metadata now allocated outside HAKMEM allocator
**Performance** (before → after):
- 16B / workset=128: **timeout → 18.5M ops/s** ✅ FIXED
- 1024B / workset=128: 4.3M ops/s → stable (no regression)
- 16B / workset=64: 44M ops/s → stable (no regression)
**Testing**:
```bash
# Critical test (previously hung indefinitely)
./out/release/bench_fixed_size_hakmem 10000 256 128
# Expected: ~18M ops/s, instant completion
```
**Key Lesson**:
- Never use allocator-managed memory for the allocator's own metadata
- Bootstrap phase must use system primitives (mmap) directly
- Workset size can expose hidden recursion bugs under memory pressure
---
## 4. Phase 13-B TinyHeapV2: Supply 経路実装 ✅ 完了
**Status**: 2025-11-15 完了
**結果**: **Stealing 設計を採用Mode 0 デフォルト、32B で +18% 改善**
### 4.1 実装完了内容
1.**Free path supply 実装** (`core/tiny_free_fast_v2.inc.h`)
- 2 つの supply モードを実装ENV で A/B 可能):
- **Mode 0 (Stealing)**: L0 が free を先に受け取る(デフォルト)
- **Mode 1 (Leftover)**: L1 primary owner, L0 は「おこぼれ」
2.**Alloc path hook 実装** (`core/tiny_alloc_fast.inc.h`)
- `tiny_heap_v2_alloc_by_class(class_idx)` - 最適化済み(-47% 退化を +14% 改善に修正)
- class_idx を直接受け取り、冗長な変換・チェックを削除
3.**ENV フラグ完備**:
- `HAKMEM_TINY_HEAP_V2` - Box ON/OFF
- `HAKMEM_TINY_HEAP_V2_CLASS_MASK` - class 別有効化bitmask
- `HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE` - Mode 0/1 切り替え
- `HAKMEM_TINY_HEAP_V2_STATS` - 統計出力
### 4.2 A/B テスト結果100K iterations, workset=128
| サイズ | Baseline (V2 OFF) | **Mode 0 (Stealing)** | Mode 1 (Leftover) |
|--------|------------------|----------------------|------------------|
| **16B** | 43.9M ops/s | **45.6M (+3.9%)** ✅ | 41.6M (-5.2%) ❌ |
| **32B** | 41.9M ops/s | **49.6M (+18.4%)** ✅ | 41.1M (-1.9%) ❌ |
| **64B** | 51.2M ops/s | **51.5M (+0.6%)** ≈ | 51.0M (-0.4%) ≈ |
**統計**Mode 0 @ 16B:
- alloc_calls: 99,872
- mag_hits: 99,872 (**100.0% hit rate**)
- refill: 0supply from free path のみ)
### 4.3 設計判断Stealing をデフォルトに採用
**ChatGPT 先生の分析**ultrathink 相談):
1. **学習層との整合性 OK**:
- 学習層は主に Superslab / Pool / Drain の統計を見る
- L0 stealing は Superslab 側の carving/drain 信号を壊さない
- 必要なら TinyHeapV2 の hit/miss カウンタを学習用フックとして追加すれば良い
2. **Box 境界の整理**:
- TinyHeapV2 は **front-only Box** として完結
- 学習層には「Superslab/Pool の世界」と「L0/L1 の統計」を別々の箱として渡す
- 性能 (+18%) > 厳格な Box 境界
3. **推奨方針**:
- **今は Stealing で性能を攻める**Mode 0 デフォルト)
- 学習層との整合は後続 Phase で必要に応じて調整
**決定**: `HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0` (Stealing) をデフォルトに採用。
**根拠**: 32B で +18% の性能改善、学習層への影響は軽微。
### 4.4 残タスク(後続 Phase
- [ ] **C0 (8B) の最適化**: 現在 -5% 退化 → CLASS_MASK で無効化を検討
- [ ] **学習層統合**: 必要に応じて TinyHeapV2 の hit/miss/refill カウンタを学習用フックとして追加
- [ ] **Random mixed ベンチ**: 256B mixed workload でも A/B テスト
---
## 5. 「今は触らない」領域メモ
- Mid-Large allocatorPool TLS + lock-free Stage 1/2:
- SEGV 修正済み、futex 95% 削減、8T で +896% 改善。
- 現時点では研究テーマとしては十分進んだので、Tiny に集中して OK。
- Larson ベンチの 100x 差:
- Lock contention / metadata 再利用の問題が絡む大きめのテーマ。
- TinyHeapV2 がある程度形になってから、別 Phase で攻める。
---
## 6. まとめClaude Code 用の一言メモ)
- **箱の境界**: TinyHeapV2 は「front-only L0 Cache Box」。Superslab / Pool / Drain には触らない。
- **今すぐやること**: alloc 側からの「おこぼれ supply」を 1 箇所だけ差し込んで、統計と A/B を取る。
- **free 側の統合**: 設計だけ整理しておき、実装は TinyHeapV2 の挙動を見てからで大丈夫。
---
## 4. Phase 15: Box Separation (2025-11-15) - Incomplete ⏸️
### 4.1 Goal
Eliminate mincore syscall overhead (13.65% CPU, 987 calls/100K) via Box Separation architecture.
### 4.2 Implementation Status
**Box headers完成、routing未完成SEGV**
**Files Created**:
1. **`core/box/front_gate_v2.h`** (98 lines) ✅
- Ultra-fast 1-byte header classification ONLY
- Domains: TINY (0xa0), POOL (0xb0), MIDCAND, EXTERNAL
- Performance: 2-5 cycles
- Same-page guard added (防御的プログラミング)
2. **`core/box/external_guard_box.h`** (146 lines) ✅
- ENV-controlled mincore safety check
- ENV flags:
- `HAKMEM_EXTERNAL_GUARD_MINCORE=0/1` (default: 0 = OFF)
- `HAKMEM_EXTERNAL_GUARD_LOG=0/1`
- `HAKMEM_EXTERNAL_GUARD_STATS=0/1`
- Expected: Called 0-10 times in bench (if >100 → box leak)
- Uses __libc_free() to avoid infinite loop
**Routing (hak_free_at)**:
- ❌ Phase 15 routing incomplete (SEGV on page-aligned pointers)
- ✅ Reverted to Phase 14-C (classify_ptr-based, stable)
### 4.3 Issues Encountered
1. **Page-aligned pointer crash** (0x...000 & 0xFFF == 0)
- Box FG V2 missing same-page guard → fixed
- Still crashes on drain phase → deeper issue
2. **C7 (1KB headerless) misclassification**
- Box FG V2 cannot classify C7 (no 1-byte header)
- Requires registry lookup fallback
3. **mincore OFF unsafe**
- DISABLE_MINCORE=1 causes SEGV (invalid AllocHeader deref)
- mincore safety check is essential for mixed allocations
### 4.4 Performance (Phase 14-C Baseline)
**Current (mincore ON)**:
- Random Mixed 256B: **16.5M ops/s**
- mincore: 841 calls/100K iterations
- Stable, no crashes
**Target (mincore OFF)**:
- Expected: +15% (eliminate 13.65% CPU overhead)
- Reality: SEGV (unsafe AllocHeader access)
### 4.5 Next Steps (Deferred to Future Phase)
Phase 15 完全実装は次のフェーズで再挑戦:
1. **Mid/Large/C7 registry consolidation** - Unified lookup for MIDCAND
2. **AllocHeader safety** - Add header validation before deref
3. **ExternalGuard integration** - Proper libc delegation
**Current Recommendation**: Stick with Phase 14-C (stable, 16.5M ops/s)
- mincore overhead is acceptable for now (~1.9ms / 100K iterations)
- Focus on other bottlenecks (TLS SLL refill, SuperSlab churn)
---