2025-11-15 22:08:51 +09:00
|
|
|
|
# CURRENT TASK – Phase 14 (TinyUltraHot / C2/C3 Ultra-Fast Path)
|
2025-11-05 16:47:04 +09:00
|
|
|
|
|
2025-11-15 22:08:51 +09:00
|
|
|
|
**Date**: 2025-11-15
|
|
|
|
|
|
**Status**: ✅ TinyUltraHot 実装完了 (+21-36% on C2/C3 workloads), Phase 13 TinyHeapV2 = 安全な stub
|
|
|
|
|
|
**Owner**: Claude Code → 実装完了
|
2025-11-08 01:35:45 +09:00
|
|
|
|
|
2025-11-14 14:18:56 +09:00
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
## 1. 全体の今どこ
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- Tiny (0–1023B):
|
|
|
|
|
|
- Front: NEW 3-layer front (bump / small_mag / slow) 安定。
|
|
|
|
|
|
- TinyHeapV2: 「alloc フロント+統計」実装済みだが、magazine 供給なし → hit 率 0%。
|
|
|
|
|
|
- Drain: TLS SLL drain interval = 2048(デフォルト)。Tiny random mixed で ~9M ops/s レベル。
|
|
|
|
|
|
- Mid (1KB–32KB):
|
|
|
|
|
|
- GAP 修正済み: `MID_MIN_SIZE=1024` に下げて 1KB–8KB を Mid が担当。
|
|
|
|
|
|
- Pool TLS ON デフォルト(mid ベンチ)で ~10.6M ops/s(System malloc より速い)。
|
|
|
|
|
|
- Shared SuperSlab Pool (SP‑SLOT Box):
|
|
|
|
|
|
- 実装完了。SuperSlab 数 -92%、mmap/munmap -48%、Throughput +131%。
|
|
|
|
|
|
- Lock contention (Stage 2) は P0-5 まで実装済み、+2–3% 程度の改善。
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
結論: Mid / Shared Pool 側は「研究目的としては一旦完了」。
|
|
|
|
|
|
残りの大きな余白は **Tiny front(C0–C3)** と **一部 Tiny ベンチ (Larson / 1KB fixed)**。
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-15 22:08:51 +09:00
|
|
|
|
## 2. Phase 14: TinyUltraHot Box (2025-11-15) ✅
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 実装概要
|
|
|
|
|
|
|
|
|
|
|
|
**ChatGPT Phase 14 戦略** - L1 dcache miss 攻撃:
|
|
|
|
|
|
- **Problem**: perf stat で System malloc との比較
|
|
|
|
|
|
- L1 dcache miss: 30x worse (2.9M vs 96K)
|
|
|
|
|
|
- Instructions: 6.2x more (281M vs 45M)
|
|
|
|
|
|
- Branches: 7.1x more (59M vs 8.3M)
|
|
|
|
|
|
- **Solution**: C1/C2 (16B/32B) に特化した ultra-simple straight-line path
|
|
|
|
|
|
- Target: ~60% of tiny allocations
|
|
|
|
|
|
- Magazine-based (4 slots per class)
|
|
|
|
|
|
- Single cache line TLS structure (64B aligned)
|
|
|
|
|
|
- 5-7 instructions per alloc/free
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 実装詳細
|
|
|
|
|
|
|
|
|
|
|
|
**Box**: `TinyUltraHot` (L0 ultra-fast path, C2/C3 = 16B/32B only)
|
|
|
|
|
|
|
|
|
|
|
|
**Files**:
|
|
|
|
|
|
- `core/front/tiny_ultra_hot.h` (343 lines, self-contained)
|
|
|
|
|
|
- `core/hakmem_tiny.c` (TLS + stats)
|
|
|
|
|
|
- `core/tiny_alloc_fast.inc.h` (alloc hook)
|
|
|
|
|
|
- `core/tiny_free_fast_v2.inc.h` (free hook)
|
|
|
|
|
|
- `bench_random_mixed.c` (stats output added)
|
|
|
|
|
|
|
|
|
|
|
|
**TLS Structure**:
|
|
|
|
|
|
```c
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
|
void* c1_mag[4]; // C2 (16B) magazine (32B)
|
|
|
|
|
|
void* c2_mag[4]; // C3 (32B) magazine (32B)
|
|
|
|
|
|
uint8_t c1_top, c2_top;
|
|
|
|
|
|
// Statistics...
|
|
|
|
|
|
} __attribute__((aligned(64))) TinyUltraHot;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**ENV Controls**:
|
|
|
|
|
|
- `HAKMEM_TINY_ULTRA_HOT=0/1` - Enable/disable (default: 1)
|
|
|
|
|
|
- `HAKMEM_TINY_ULTRA_HOT_STATS=0/1` - Print stats at exit
|
|
|
|
|
|
|
|
|
|
|
|
**Class Mapping** (CRITICAL):
|
|
|
|
|
|
- C0 = 8B (not covered)
|
|
|
|
|
|
- C1 = ? (unknown)
|
|
|
|
|
|
- **C2 = 16B** ← UltraHot C1
|
|
|
|
|
|
- **C3 = 32B** ← UltraHot C2
|
|
|
|
|
|
- C4+ = 64B+ (not covered)
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 Performance Results
|
|
|
|
|
|
|
|
|
|
|
|
**Fixed-Size Benchmarks** (100K iterations, 128 workset):
|
|
|
|
|
|
|
|
|
|
|
|
| Size | Baseline | UltraHot ON | Improvement | Hit Rate |
|
|
|
|
|
|
|------|----------|-------------|-------------|----------|
|
|
|
|
|
|
| **16B (C2)** | 48.2M ops/s | 58.3M ops/s | **+20.9%** | 99.9% |
|
|
|
|
|
|
| **32B (C3)** | 45.1M ops/s | 55.9M ops/s | **+23.9%** | 99.9% |
|
|
|
|
|
|
|
|
|
|
|
|
**Extended C2/C3 Tests** (200K iterations, 256 workset):
|
|
|
|
|
|
|
|
|
|
|
|
| Size | Baseline | UltraHot ON | Improvement |
|
|
|
|
|
|
|------|----------|-------------|-------------|
|
|
|
|
|
|
| **16B (C2)** | 40.4M ops/s | 55.0M ops/s | **+36.2%** |
|
|
|
|
|
|
| **32B (C3)** | 43.5M ops/s | 50.6M ops/s | **+16.3%** |
|
|
|
|
|
|
| 24B (C3) | 43.5M ops/s | 44.6M ops/s | +2.5% |
|
|
|
|
|
|
|
|
|
|
|
|
**Random Mixed 256B** (100K iterations):
|
|
|
|
|
|
- Baseline: 8.96M ops/s
|
|
|
|
|
|
- UltraHot ON: 8.81M ops/s (-1.6%)
|
|
|
|
|
|
- **Reason**: C2/C3 coverage = only 1-2% of workload
|
|
|
|
|
|
- C1 alloc=45 (0.045%), free=820 (0.82%)
|
|
|
|
|
|
- C2 alloc=828 (0.83%), free=1,567 (1.57%)
|
|
|
|
|
|
- Size distribution: 16-1040B (C2/C3 = ~1.7% of range)
|
|
|
|
|
|
- **Conclusion**: UltraHot overhead negligible on non-target workloads
|
|
|
|
|
|
|
|
|
|
|
|
### 2.4 Design Decisions
|
|
|
|
|
|
|
|
|
|
|
|
**Why C2/C3 only?**
|
|
|
|
|
|
- Cover ~60% of tiny allocations (ChatGPT estimate for 16B/32B)
|
|
|
|
|
|
- Small magazine (4 slots) fits in 1-2 cache lines
|
|
|
|
|
|
- Size check trivial (size <= 16 / size <= 32)
|
|
|
|
|
|
- Larger classes (C4+) have different access patterns
|
|
|
|
|
|
|
|
|
|
|
|
**Why 4 slots per magazine?**
|
|
|
|
|
|
- Target: 1 cache line (64B) for all state
|
|
|
|
|
|
- C1 mag (32B) + C2 mag (32B) = 64B (first cache line)
|
|
|
|
|
|
- Counters + stats in second cache line
|
|
|
|
|
|
- Trade capacity for cache locality
|
|
|
|
|
|
|
|
|
|
|
|
**Integration with existing layers**:
|
|
|
|
|
|
- **L0** (fastest): TinyUltraHot (C2/C3 only)
|
|
|
|
|
|
- **L1** (fast): TinyHeapV2 (C0-C3, 16 slots, Phase 13)
|
|
|
|
|
|
- **L2** (normal): FastCache + TLS SLL
|
|
|
|
|
|
- Fallback chain: L0 miss → L1 → L2
|
|
|
|
|
|
|
|
|
|
|
|
### 2.5 Critical Bug Fix: Class Numbering
|
|
|
|
|
|
|
|
|
|
|
|
**Issue**: Initial implementation assumed C1=16B, C2=32B
|
|
|
|
|
|
- **Symptom**: 0% hit rate, alloc_calls registered but free_calls=0
|
|
|
|
|
|
- **Root cause**: HAKMEM class numbering is C2=16B, C3=32B
|
|
|
|
|
|
- **Discovery**: Ran TinyHeapV2 stats on 16B → showed [C2] hit rate
|
|
|
|
|
|
- **Fix**: Changed checks from (class_idx==1||2) to (class_idx==2||3)
|
|
|
|
|
|
- **Verification**: Hit rate → 99.9% after fix
|
|
|
|
|
|
|
|
|
|
|
|
### 2.6 Next Steps (Optional)
|
|
|
|
|
|
|
|
|
|
|
|
1. **perf stat validation**: Measure actual L1 dcache miss reduction
|
|
|
|
|
|
2. **Larger magazines**: Test 8-16 slots if cache locality permits
|
|
|
|
|
|
3. **C0/C4 coverage**: Extend to 8B/64B if beneficial
|
|
|
|
|
|
4. **Adaptive enable**: Auto-detect workload characteristics
|
|
|
|
|
|
|
|
|
|
|
|
**Current Recommendation**: Phase 14 COMPLETE ✅
|
|
|
|
|
|
- C2/C3-heavy workloads: **+16-36% improvement**
|
|
|
|
|
|
- Mixed workloads: Negligible overhead (<2%)
|
|
|
|
|
|
- Magazine-based design proven effective
|
|
|
|
|
|
- Ready for production use (default: ON)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. TinyHeapV2 Box の現状 (Phase 13)
|
2025-11-15 14:35:44 +09:00
|
|
|
|
|
2025-11-15 22:08:51 +09:00
|
|
|
|
### 3.1 実装済み (Phase 13-A – Alloc Front)
|
2025-11-15 14:35:44 +09:00
|
|
|
|
|
|
|
|
|
|
- Box: `TinyHeapV2`(per-thread magazine front, C0–C3 用の L0 キャッシュ)
|
|
|
|
|
|
- ファイル:
|
|
|
|
|
|
- `core/front/tiny_heap_v2.h`
|
|
|
|
|
|
- `core/hakmem_tiny.c`(TLS 定義 + 統計出力)
|
|
|
|
|
|
- `core/hakmem_tiny_alloc_new.inc`(alloc hook)
|
|
|
|
|
|
- TLS 構造:
|
|
|
|
|
|
- `__thread TinyHeapV2Mag g_tiny_heap_v2_mag[TINY_NUM_CLASSES];`
|
|
|
|
|
|
- `__thread TinyHeapV2Stats g_tiny_heap_v2_stats[TINY_NUM_CLASSES];`
|
|
|
|
|
|
- ENV:
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2` → Box ON/OFF。
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_CLASS_MASK` → bit0–3 で C0–C3 有効化。
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_STATS` → 統計出力 ON。
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_DEBUG` → 初期デバッグログ。
|
|
|
|
|
|
- 振る舞い:
|
|
|
|
|
|
- `hak_tiny_alloc(size)` が C0–C3 かつ mask OK のとき `tiny_heap_v2_alloc(size)` を先に試す。
|
|
|
|
|
|
- `tiny_heap_v2_alloc`:
|
|
|
|
|
|
- mag.top>0 なら pop(BASE を返す)→ `HAK_RET_ALLOC` で header + user に変換。
|
|
|
|
|
|
- mag 空なら **即 NULL** を返し、既存 front へフォールバック。
|
|
|
|
|
|
- `tiny_heap_v2_refill_mag` は NO-OP(refill なし)。
|
|
|
|
|
|
- `tiny_heap_v2_try_push` は実装済みだが、まだ実際の free/alloc 経路から呼ばれていない想定で OK(Phase 13-B で使う)。
|
|
|
|
|
|
- 現状の性能:
|
|
|
|
|
|
- 16/32/64B fixed-size (100K) で ±1% 以内 → hook オーバーヘッドはほぼゼロ。
|
|
|
|
|
|
- `alloc_calls` は 200K まで増えるが `mag_hits=0`(supply なしのため)。
|
|
|
|
|
|
|
|
|
|
|
|
**要点:** TinyHeapV2 は「壊さず差し込めた L0 stub」。
|
|
|
|
|
|
これから **supply をどう設計するか** が Phase 13-B の主題。
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
## 3. 最近のバグ修正・仕様調整(もう触らなくてよい箱)
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
### 3.1 Tiny / Mid サイズ境界ギャップ修正(完了)
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- 以前:
|
|
|
|
|
|
- `TINY_MAX_SIZE = 1024` / `MID_MIN_SIZE = 8192` で 1KB–8KB が誰の担当でもなく mmap 直行。
|
|
|
|
|
|
- 今:
|
|
|
|
|
|
- Tiny: `TINY_MAX_SIZE = 1023`(ヘッダ 1B 前提で 1023B まで Tiny)。
|
|
|
|
|
|
- Mid: `MID_MIN_SIZE = 1024`(1KB–32KB を Mid MT が処理)。
|
|
|
|
|
|
- 効果:
|
|
|
|
|
|
- `bench_fixed_size_hakmem 1024B` が mmap 地獄から脱出 → Mid MT 経路で ~0.5M ops/s レベルに改善。
|
|
|
|
|
|
- SEGV は解消。今残っているのは性能ギャップだけ(TinyHeapV2 とは独立)。
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
### 3.2 Shared Pool / LRU / Drain 周り
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- TLS SLL drain:
|
|
|
|
|
|
- `HAKMEM_TINY_SLL_DRAIN_INTERVAL` デフォルト = 2048。
|
|
|
|
|
|
- 128/256B 固定サイズで A/B 済み。どちらも退化なく、むしろ +5〜+15% 程度の改善。
|
|
|
|
|
|
- SP‑SLOT Box:
|
2025-11-15 14:36:35 +09:00
|
|
|
|
- SuperSlab 数削減・syscall 削減は期待通り。
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- futex / lock contention は P0-5 まで対処済み(追加改善は高コスト領域として一旦後回し)。
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:36:35 +09:00
|
|
|
|
### 3.3 ✅ CRITICAL FIX: workset=128 Infinite Recursion Bug (2025-11-15)
|
|
|
|
|
|
|
|
|
|
|
|
**Commit**: 176bbf656
|
|
|
|
|
|
|
|
|
|
|
|
**Root Cause**:
|
|
|
|
|
|
- `shared_pool_ensure_capacity_unlocked()` used `realloc()` for Shared Pool metadata allocation
|
|
|
|
|
|
- `realloc()` → `hak_alloc_at(128B)` → `shared_pool_init()` → `realloc()` → **INFINITE RECURSION**
|
|
|
|
|
|
- Triggered by high memory pressure (workset=128) but not lower pressure (workset=64)
|
|
|
|
|
|
|
|
|
|
|
|
**Symptoms**:
|
|
|
|
|
|
- `bench_fixed_size_hakmem 1 16 128`: infinite hang (timeout)
|
|
|
|
|
|
- `bench_fixed_size_hakmem 1 1024 128`: worked fine (4.3M ops/s)
|
|
|
|
|
|
- **Size-class specific**: C1-C3 (16-64B) hung, C7 (1024B) worked
|
|
|
|
|
|
- Reason: Small allocations trigger more SuperSlab allocations → more metadata realloc → deeper recursion
|
|
|
|
|
|
|
|
|
|
|
|
**Fix** (`core/hakmem_shared_pool.c`):
|
|
|
|
|
|
- Replace `realloc()` with direct system `mmap()` for Shared Pool metadata
|
|
|
|
|
|
- Use `munmap()` to free old mappings (not `free()`!)
|
|
|
|
|
|
- **Breaks recursion**: Shared Pool metadata now allocated outside HAKMEM allocator
|
|
|
|
|
|
|
|
|
|
|
|
**Performance** (before → after):
|
|
|
|
|
|
- 16B / workset=128: **timeout → 18.5M ops/s** ✅ FIXED
|
|
|
|
|
|
- 1024B / workset=128: 4.3M ops/s → stable (no regression)
|
|
|
|
|
|
- 16B / workset=64: 44M ops/s → stable (no regression)
|
|
|
|
|
|
|
|
|
|
|
|
**Testing**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# Critical test (previously hung indefinitely)
|
|
|
|
|
|
./out/release/bench_fixed_size_hakmem 10000 256 128
|
|
|
|
|
|
# Expected: ~18M ops/s, instant completion
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Key Lesson**:
|
|
|
|
|
|
- Never use allocator-managed memory for the allocator's own metadata
|
|
|
|
|
|
- Bootstrap phase must use system primitives (mmap) directly
|
|
|
|
|
|
- Workset size can expose hidden recursion bugs under memory pressure
|
|
|
|
|
|
|
2025-11-14 14:18:56 +09:00
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-15 16:28:40 +09:00
|
|
|
|
## 4. Phase 13-B – TinyHeapV2: Supply 経路実装 ✅ 完了
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: 2025-11-15 完了
|
|
|
|
|
|
**結果**: **Stealing 設計を採用(Mode 0 デフォルト)、32B で +18% 改善**
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 実装完了内容
|
|
|
|
|
|
|
|
|
|
|
|
1. ✅ **Free path supply 実装** (`core/tiny_free_fast_v2.inc.h`)
|
|
|
|
|
|
- 2 つの supply モードを実装(ENV で A/B 可能):
|
|
|
|
|
|
- **Mode 0 (Stealing)**: L0 が free を先に受け取る(デフォルト)
|
|
|
|
|
|
- **Mode 1 (Leftover)**: L1 primary owner, L0 は「おこぼれ」
|
|
|
|
|
|
|
|
|
|
|
|
2. ✅ **Alloc path hook 実装** (`core/tiny_alloc_fast.inc.h`)
|
|
|
|
|
|
- `tiny_heap_v2_alloc_by_class(class_idx)` - 最適化済み(-47% 退化を +14% 改善に修正)
|
|
|
|
|
|
- class_idx を直接受け取り、冗長な変換・チェックを削除
|
|
|
|
|
|
|
|
|
|
|
|
3. ✅ **ENV フラグ完備**:
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2` - Box ON/OFF
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_CLASS_MASK` - class 別有効化(bitmask)
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE` - Mode 0/1 切り替え
|
|
|
|
|
|
- `HAKMEM_TINY_HEAP_V2_STATS` - 統計出力
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 A/B テスト結果(100K iterations, workset=128)
|
|
|
|
|
|
|
|
|
|
|
|
| サイズ | Baseline (V2 OFF) | **Mode 0 (Stealing)** | Mode 1 (Leftover) |
|
|
|
|
|
|
|--------|------------------|----------------------|------------------|
|
|
|
|
|
|
| **16B** | 43.9M ops/s | **45.6M (+3.9%)** ✅ | 41.6M (-5.2%) ❌ |
|
|
|
|
|
|
| **32B** | 41.9M ops/s | **49.6M (+18.4%)** ✅ | 41.1M (-1.9%) ❌ |
|
|
|
|
|
|
| **64B** | 51.2M ops/s | **51.5M (+0.6%)** ≈ | 51.0M (-0.4%) ≈ |
|
|
|
|
|
|
|
|
|
|
|
|
**統計**(Mode 0 @ 16B):
|
|
|
|
|
|
- alloc_calls: 99,872
|
|
|
|
|
|
- mag_hits: 99,872 (**100.0% hit rate**)
|
|
|
|
|
|
- refill: 0(supply from free path のみ)
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 設計判断:Stealing をデフォルトに採用
|
|
|
|
|
|
|
|
|
|
|
|
**ChatGPT 先生の分析**(ultrathink 相談):
|
|
|
|
|
|
|
|
|
|
|
|
1. **学習層との整合性 OK**:
|
|
|
|
|
|
- 学習層は主に Superslab / Pool / Drain の統計を見る
|
|
|
|
|
|
- L0 stealing は Superslab 側の carving/drain 信号を壊さない
|
|
|
|
|
|
- 必要なら TinyHeapV2 の hit/miss カウンタを学習用フックとして追加すれば良い
|
|
|
|
|
|
|
|
|
|
|
|
2. **Box 境界の整理**:
|
|
|
|
|
|
- TinyHeapV2 は **front-only Box** として完結
|
|
|
|
|
|
- 学習層には「Superslab/Pool の世界」と「L0/L1 の統計」を別々の箱として渡す
|
|
|
|
|
|
- 性能 (+18%) > 厳格な Box 境界
|
|
|
|
|
|
|
|
|
|
|
|
3. **推奨方針**:
|
|
|
|
|
|
- **今は Stealing で性能を攻める**(Mode 0 デフォルト)
|
|
|
|
|
|
- 学習層との整合は後続 Phase で必要に応じて調整
|
|
|
|
|
|
|
|
|
|
|
|
**決定**: `HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0` (Stealing) をデフォルトに採用。
|
|
|
|
|
|
**根拠**: 32B で +18% の性能改善、学習層への影響は軽微。
|
|
|
|
|
|
|
|
|
|
|
|
### 4.4 残タスク(後続 Phase)
|
|
|
|
|
|
|
|
|
|
|
|
- [ ] **C0 (8B) の最適化**: 現在 -5% 退化 → CLASS_MASK で無効化を検討
|
|
|
|
|
|
- [ ] **学習層統合**: 必要に応じて TinyHeapV2 の hit/miss/refill カウンタを学習用フックとして追加
|
|
|
|
|
|
- [ ] **Random mixed ベンチ**: 256B mixed workload でも A/B テスト
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
## 5. 「今は触らない」領域メモ
|
2025-11-14 14:18:56 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- Mid-Large allocator(Pool TLS + lock-free Stage 1/2):
|
|
|
|
|
|
- SEGV 修正済み、futex 95% 削減、8T で +896% 改善。
|
|
|
|
|
|
- 現時点では研究テーマとしては十分進んだので、Tiny に集中して OK。
|
|
|
|
|
|
- Larson ベンチの 100x 差:
|
|
|
|
|
|
- Lock contention / metadata 再利用の問題が絡む大きめのテーマ。
|
|
|
|
|
|
- TinyHeapV2 がある程度形になってから、別 Phase で攻める。
|
Front-Direct implementation: SS→FC direct refill + SLL complete bypass
## Summary
Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)
## New Modules
- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
- Remote drain → Freelist → Carve priority
- Header restoration for C1-C6 (NOT C0/C7)
- ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN
- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition
## Allocation Path (core/tiny_alloc_fast.inc.h)
- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
- Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
- Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)
## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)
- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)
## Legacy Sealing
- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry
## ENV Controls
- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)
## Benchmarks (Front-Direct Enabled)
```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
HAKMEM_TINY_BUMP_CHUNK=256
bench_random_mixed (16-1040B random, 200K iter):
256 slots: 1.44M ops/s (STABLE, 0 SEGV)
128 slots: 1.44M ops/s (STABLE, 0 SEGV)
bench_fixed_size (fixed size, 200K iter):
256B: 4.06M ops/s (has debug logs, expected >10M without logs)
128B: Similar (debug logs affect)
```
## Verification
- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV
## Next Steps
- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)
Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink
2025-11-14 05:41:49 +09:00
|
|
|
|
|
2025-11-14 14:18:56 +09:00
|
|
|
|
---
|
Front-Direct implementation: SS→FC direct refill + SLL complete bypass
## Summary
Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)
## New Modules
- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
- Remote drain → Freelist → Carve priority
- Header restoration for C1-C6 (NOT C0/C7)
- ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN
- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition
## Allocation Path (core/tiny_alloc_fast.inc.h)
- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
- Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
- Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)
## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)
- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)
## Legacy Sealing
- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry
## ENV Controls
- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)
## Benchmarks (Front-Direct Enabled)
```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
HAKMEM_TINY_BUMP_CHUNK=256
bench_random_mixed (16-1040B random, 200K iter):
256 slots: 1.44M ops/s (STABLE, 0 SEGV)
128 slots: 1.44M ops/s (STABLE, 0 SEGV)
bench_fixed_size (fixed size, 200K iter):
256B: 4.06M ops/s (has debug logs, expected >10M without logs)
128B: Similar (debug logs affect)
```
## Verification
- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV
## Next Steps
- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)
Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink
2025-11-14 05:41:49 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
## 6. まとめ(Claude Code 用の一言メモ)
|
Front-Direct implementation: SS→FC direct refill + SLL complete bypass
## Summary
Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)
## New Modules
- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
- Remote drain → Freelist → Carve priority
- Header restoration for C1-C6 (NOT C0/C7)
- ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN
- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition
## Allocation Path (core/tiny_alloc_fast.inc.h)
- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
- Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
- Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)
## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)
- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)
## Legacy Sealing
- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry
## ENV Controls
- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)
## Benchmarks (Front-Direct Enabled)
```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
HAKMEM_TINY_BUMP_CHUNK=256
bench_random_mixed (16-1040B random, 200K iter):
256 slots: 1.44M ops/s (STABLE, 0 SEGV)
128 slots: 1.44M ops/s (STABLE, 0 SEGV)
bench_fixed_size (fixed size, 200K iter):
256B: 4.06M ops/s (has debug logs, expected >10M without logs)
128B: Similar (debug logs affect)
```
## Verification
- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV
## Next Steps
- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)
Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink
2025-11-14 05:41:49 +09:00
|
|
|
|
|
2025-11-15 14:35:44 +09:00
|
|
|
|
- **箱の境界**: TinyHeapV2 は「front-only L0 Cache Box」。Superslab / Pool / Drain には触らない。
|
|
|
|
|
|
- **今すぐやること**: alloc 側からの「おこぼれ supply」を 1 箇所だけ差し込んで、統計と A/B を取る。
|
|
|
|
|
|
- **free 側の統合**: 設計だけ整理しておき、実装は TinyHeapV2 の挙動を見てからで大丈夫。
|
2025-11-15 22:08:51 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. Phase 15: Box Separation (2025-11-15) - Incomplete ⏸️
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 Goal
|
|
|
|
|
|
Eliminate mincore syscall overhead (13.65% CPU, 987 calls/100K) via Box Separation architecture.
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 Implementation Status
|
|
|
|
|
|
**Box headers完成、routing未完成(SEGV)**
|
|
|
|
|
|
|
|
|
|
|
|
**Files Created**:
|
|
|
|
|
|
1. **`core/box/front_gate_v2.h`** (98 lines) ✅
|
|
|
|
|
|
- Ultra-fast 1-byte header classification ONLY
|
|
|
|
|
|
- Domains: TINY (0xa0), POOL (0xb0), MIDCAND, EXTERNAL
|
|
|
|
|
|
- Performance: 2-5 cycles
|
|
|
|
|
|
- Same-page guard added (防御的プログラミング)
|
|
|
|
|
|
|
|
|
|
|
|
2. **`core/box/external_guard_box.h`** (146 lines) ✅
|
|
|
|
|
|
- ENV-controlled mincore safety check
|
|
|
|
|
|
- ENV flags:
|
|
|
|
|
|
- `HAKMEM_EXTERNAL_GUARD_MINCORE=0/1` (default: 0 = OFF)
|
|
|
|
|
|
- `HAKMEM_EXTERNAL_GUARD_LOG=0/1`
|
|
|
|
|
|
- `HAKMEM_EXTERNAL_GUARD_STATS=0/1`
|
|
|
|
|
|
- Expected: Called 0-10 times in bench (if >100 → box leak)
|
|
|
|
|
|
- Uses __libc_free() to avoid infinite loop
|
|
|
|
|
|
|
|
|
|
|
|
**Routing (hak_free_at)**:
|
|
|
|
|
|
- ❌ Phase 15 routing incomplete (SEGV on page-aligned pointers)
|
|
|
|
|
|
- ✅ Reverted to Phase 14-C (classify_ptr-based, stable)
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 Issues Encountered
|
|
|
|
|
|
1. **Page-aligned pointer crash** (0x...000 & 0xFFF == 0)
|
|
|
|
|
|
- Box FG V2 missing same-page guard → fixed
|
|
|
|
|
|
- Still crashes on drain phase → deeper issue
|
|
|
|
|
|
|
|
|
|
|
|
2. **C7 (1KB headerless) misclassification**
|
|
|
|
|
|
- Box FG V2 cannot classify C7 (no 1-byte header)
|
|
|
|
|
|
- Requires registry lookup fallback
|
|
|
|
|
|
|
|
|
|
|
|
3. **mincore OFF unsafe**
|
|
|
|
|
|
- DISABLE_MINCORE=1 causes SEGV (invalid AllocHeader deref)
|
|
|
|
|
|
- mincore safety check is essential for mixed allocations
|
|
|
|
|
|
|
|
|
|
|
|
### 4.4 Performance (Phase 14-C Baseline)
|
|
|
|
|
|
**Current (mincore ON)**:
|
|
|
|
|
|
- Random Mixed 256B: **16.5M ops/s**
|
|
|
|
|
|
- mincore: 841 calls/100K iterations
|
|
|
|
|
|
- Stable, no crashes
|
|
|
|
|
|
|
|
|
|
|
|
**Target (mincore OFF)**:
|
|
|
|
|
|
- Expected: +15% (eliminate 13.65% CPU overhead)
|
|
|
|
|
|
- Reality: SEGV (unsafe AllocHeader access)
|
|
|
|
|
|
|
|
|
|
|
|
### 4.5 Next Steps (Deferred to Future Phase)
|
|
|
|
|
|
Phase 15 完全実装は次のフェーズで再挑戦:
|
|
|
|
|
|
1. **Mid/Large/C7 registry consolidation** - Unified lookup for MIDCAND
|
|
|
|
|
|
2. **AllocHeader safety** - Add header validation before deref
|
|
|
|
|
|
3. **ExternalGuard integration** - Proper libc delegation
|
|
|
|
|
|
|
|
|
|
|
|
**Current Recommendation**: Stick with Phase 14-C (stable, 16.5M ops/s)
|
|
|
|
|
|
- mincore overhead is acceptable for now (~1.9ms / 100K iterations)
|
|
|
|
|
|
- Focus on other bottlenecks (TLS SLL refill, SuperSlab churn)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|