**Status**: Box FG V2 + ExternalGuard 実装完了、hak_free_at routing は Phase 14-C に revert **Files Created**: 1. core/box/front_gate_v2.h (98 lines) - Ultra-fast 1-byte header classification (TINY/POOL/MIDCAND/EXTERNAL) - Performance: 2-5 cycles - Same-page guard added (防御的プログラミング) 2. core/box/external_guard_box.h (146 lines) - ENV-controlled mincore safety check - HAKMEM_EXTERNAL_GUARD_MINCORE=0/1 (default: OFF) - Uses __libc_free() to avoid infinite loop **Routing**: - hak_free_at reverted to Phase 14-C (classify_ptr-based, stable) - Phase 15 routing caused SEGV on page-aligned pointers **Performance**: - Phase 14-C (mincore ON): 16.5M ops/s (stable) - mincore: 841 calls/100K iterations - mincore OFF: SEGV (unsafe AllocHeader deref) **Next Steps** (deferred): - Mid/Large/C7 registry consolidation - AllocHeader safety validation - ExternalGuard integration **Recommendation**: Stick with Phase 14-C for now - mincore overhead acceptable (~1.9ms / 100K) - Focus on other bottlenecks (TLS SLL, SuperSlab churn) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
15 KiB
CURRENT TASK – Phase 14 (TinyUltraHot / C2/C3 Ultra-Fast Path)
Date: 2025-11-15 Status: ✅ TinyUltraHot 実装完了 (+21-36% on C2/C3 workloads), Phase 13 TinyHeapV2 = 安全な stub Owner: Claude Code → 実装完了
1. 全体の今どこ
- Tiny (0–1023B):
- Front: NEW 3-layer front (bump / small_mag / slow) 安定。
- TinyHeapV2: 「alloc フロント+統計」実装済みだが、magazine 供給なし → hit 率 0%。
- Drain: TLS SLL drain interval = 2048(デフォルト)。Tiny random mixed で ~9M ops/s レベル。
- Mid (1KB–32KB):
- GAP 修正済み:
MID_MIN_SIZE=1024に下げて 1KB–8KB を Mid が担当。 - Pool TLS ON デフォルト(mid ベンチ)で ~10.6M ops/s(System malloc より速い)。
- GAP 修正済み:
- Shared SuperSlab Pool (SP‑SLOT Box):
- 実装完了。SuperSlab 数 -92%、mmap/munmap -48%、Throughput +131%。
- Lock contention (Stage 2) は P0-5 まで実装済み、+2–3% 程度の改善。
結論: Mid / Shared Pool 側は「研究目的としては一旦完了」。
残りの大きな余白は Tiny front(C0–C3) と 一部 Tiny ベンチ (Larson / 1KB fixed)。
2. Phase 14: TinyUltraHot Box (2025-11-15) ✅
2.1 実装概要
ChatGPT Phase 14 戦略 - L1 dcache miss 攻撃:
- Problem: perf stat で System malloc との比較
- L1 dcache miss: 30x worse (2.9M vs 96K)
- Instructions: 6.2x more (281M vs 45M)
- Branches: 7.1x more (59M vs 8.3M)
- Solution: C1/C2 (16B/32B) に特化した ultra-simple straight-line path
- Target: ~60% of tiny allocations
- Magazine-based (4 slots per class)
- Single cache line TLS structure (64B aligned)
- 5-7 instructions per alloc/free
2.2 実装詳細
Box: TinyUltraHot (L0 ultra-fast path, C2/C3 = 16B/32B only)
Files:
core/front/tiny_ultra_hot.h(343 lines, self-contained)core/hakmem_tiny.c(TLS + stats)core/tiny_alloc_fast.inc.h(alloc hook)core/tiny_free_fast_v2.inc.h(free hook)bench_random_mixed.c(stats output added)
TLS Structure:
typedef struct {
void* c1_mag[4]; // C2 (16B) magazine (32B)
void* c2_mag[4]; // C3 (32B) magazine (32B)
uint8_t c1_top, c2_top;
// Statistics...
} __attribute__((aligned(64))) TinyUltraHot;
ENV Controls:
HAKMEM_TINY_ULTRA_HOT=0/1- Enable/disable (default: 1)HAKMEM_TINY_ULTRA_HOT_STATS=0/1- Print stats at exit
Class Mapping (CRITICAL):
- C0 = 8B (not covered)
- C1 = ? (unknown)
- C2 = 16B ← UltraHot C1
- C3 = 32B ← UltraHot C2
- C4+ = 64B+ (not covered)
2.3 Performance Results
Fixed-Size Benchmarks (100K iterations, 128 workset):
| Size | Baseline | UltraHot ON | Improvement | Hit Rate |
|---|---|---|---|---|
| 16B (C2) | 48.2M ops/s | 58.3M ops/s | +20.9% | 99.9% |
| 32B (C3) | 45.1M ops/s | 55.9M ops/s | +23.9% | 99.9% |
Extended C2/C3 Tests (200K iterations, 256 workset):
| Size | Baseline | UltraHot ON | Improvement |
|---|---|---|---|
| 16B (C2) | 40.4M ops/s | 55.0M ops/s | +36.2% |
| 32B (C3) | 43.5M ops/s | 50.6M ops/s | +16.3% |
| 24B (C3) | 43.5M ops/s | 44.6M ops/s | +2.5% |
Random Mixed 256B (100K iterations):
- Baseline: 8.96M ops/s
- UltraHot ON: 8.81M ops/s (-1.6%)
- Reason: C2/C3 coverage = only 1-2% of workload
- C1 alloc=45 (0.045%), free=820 (0.82%)
- C2 alloc=828 (0.83%), free=1,567 (1.57%)
- Size distribution: 16-1040B (C2/C3 = ~1.7% of range)
- Conclusion: UltraHot overhead negligible on non-target workloads
2.4 Design Decisions
Why C2/C3 only?
- Cover ~60% of tiny allocations (ChatGPT estimate for 16B/32B)
- Small magazine (4 slots) fits in 1-2 cache lines
- Size check trivial (size <= 16 / size <= 32)
- Larger classes (C4+) have different access patterns
Why 4 slots per magazine?
- Target: 1 cache line (64B) for all state
- C1 mag (32B) + C2 mag (32B) = 64B (first cache line)
- Counters + stats in second cache line
- Trade capacity for cache locality
Integration with existing layers:
- L0 (fastest): TinyUltraHot (C2/C3 only)
- L1 (fast): TinyHeapV2 (C0-C3, 16 slots, Phase 13)
- L2 (normal): FastCache + TLS SLL
- Fallback chain: L0 miss → L1 → L2
2.5 Critical Bug Fix: Class Numbering
Issue: Initial implementation assumed C1=16B, C2=32B
- Symptom: 0% hit rate, alloc_calls registered but free_calls=0
- Root cause: HAKMEM class numbering is C2=16B, C3=32B
- Discovery: Ran TinyHeapV2 stats on 16B → showed [C2] hit rate
- Fix: Changed checks from (class_idx==1||2) to (class_idx==2||3)
- Verification: Hit rate → 99.9% after fix
2.6 Next Steps (Optional)
- perf stat validation: Measure actual L1 dcache miss reduction
- Larger magazines: Test 8-16 slots if cache locality permits
- C0/C4 coverage: Extend to 8B/64B if beneficial
- Adaptive enable: Auto-detect workload characteristics
Current Recommendation: Phase 14 COMPLETE ✅
- C2/C3-heavy workloads: +16-36% improvement
- Mixed workloads: Negligible overhead (<2%)
- Magazine-based design proven effective
- Ready for production use (default: ON)
3. TinyHeapV2 Box の現状 (Phase 13)
3.1 実装済み (Phase 13-A – Alloc Front)
- Box:
TinyHeapV2(per-thread magazine front, C0–C3 用の L0 キャッシュ) - ファイル:
core/front/tiny_heap_v2.hcore/hakmem_tiny.c(TLS 定義 + 統計出力)core/hakmem_tiny_alloc_new.inc(alloc hook)
- TLS 構造:
__thread TinyHeapV2Mag g_tiny_heap_v2_mag[TINY_NUM_CLASSES];__thread TinyHeapV2Stats g_tiny_heap_v2_stats[TINY_NUM_CLASSES];
- ENV:
HAKMEM_TINY_HEAP_V2→ Box ON/OFF。HAKMEM_TINY_HEAP_V2_CLASS_MASK→ bit0–3 で C0–C3 有効化。HAKMEM_TINY_HEAP_V2_STATS→ 統計出力 ON。HAKMEM_TINY_HEAP_V2_DEBUG→ 初期デバッグログ。
- 振る舞い:
hak_tiny_alloc(size)が C0–C3 かつ mask OK のときtiny_heap_v2_alloc(size)を先に試す。tiny_heap_v2_alloc:- mag.top>0 なら pop(BASE を返す)→
HAK_RET_ALLOCで header + user に変換。 - mag 空なら 即 NULL を返し、既存 front へフォールバック。
- mag.top>0 なら pop(BASE を返す)→
tiny_heap_v2_refill_magは NO-OP(refill なし)。tiny_heap_v2_try_pushは実装済みだが、まだ実際の free/alloc 経路から呼ばれていない想定で OK(Phase 13-B で使う)。
- 現状の性能:
- 16/32/64B fixed-size (100K) で ±1% 以内 → hook オーバーヘッドはほぼゼロ。
alloc_callsは 200K まで増えるがmag_hits=0(supply なしのため)。
要点: TinyHeapV2 は「壊さず差し込めた L0 stub」。
これから supply をどう設計するか が Phase 13-B の主題。
3. 最近のバグ修正・仕様調整(もう触らなくてよい箱)
3.1 Tiny / Mid サイズ境界ギャップ修正(完了)
- 以前:
TINY_MAX_SIZE = 1024/MID_MIN_SIZE = 8192で 1KB–8KB が誰の担当でもなく mmap 直行。
- 今:
- Tiny:
TINY_MAX_SIZE = 1023(ヘッダ 1B 前提で 1023B まで Tiny)。 - Mid:
MID_MIN_SIZE = 1024(1KB–32KB を Mid MT が処理)。
- Tiny:
- 効果:
bench_fixed_size_hakmem 1024Bが mmap 地獄から脱出 → Mid MT 経路で ~0.5M ops/s レベルに改善。- SEGV は解消。今残っているのは性能ギャップだけ(TinyHeapV2 とは独立)。
3.2 Shared Pool / LRU / Drain 周り
- TLS SLL drain:
HAKMEM_TINY_SLL_DRAIN_INTERVALデフォルト = 2048。- 128/256B 固定サイズで A/B 済み。どちらも退化なく、むしろ +5〜+15% 程度の改善。
- SP‑SLOT Box:
- SuperSlab 数削減・syscall 削減は期待通り。
- futex / lock contention は P0-5 まで対処済み(追加改善は高コスト領域として一旦後回し)。
3.3 ✅ CRITICAL FIX: workset=128 Infinite Recursion Bug (2025-11-15)
Commit: 176bbf656
Root Cause:
shared_pool_ensure_capacity_unlocked()usedrealloc()for Shared Pool metadata allocationrealloc()→hak_alloc_at(128B)→shared_pool_init()→realloc()→ INFINITE RECURSION- Triggered by high memory pressure (workset=128) but not lower pressure (workset=64)
Symptoms:
bench_fixed_size_hakmem 1 16 128: infinite hang (timeout)bench_fixed_size_hakmem 1 1024 128: worked fine (4.3M ops/s)- Size-class specific: C1-C3 (16-64B) hung, C7 (1024B) worked
- Reason: Small allocations trigger more SuperSlab allocations → more metadata realloc → deeper recursion
Fix (core/hakmem_shared_pool.c):
- Replace
realloc()with direct systemmmap()for Shared Pool metadata - Use
munmap()to free old mappings (notfree()!) - Breaks recursion: Shared Pool metadata now allocated outside HAKMEM allocator
Performance (before → after):
- 16B / workset=128: timeout → 18.5M ops/s ✅ FIXED
- 1024B / workset=128: 4.3M ops/s → stable (no regression)
- 16B / workset=64: 44M ops/s → stable (no regression)
Testing:
# Critical test (previously hung indefinitely)
./out/release/bench_fixed_size_hakmem 10000 256 128
# Expected: ~18M ops/s, instant completion
Key Lesson:
- Never use allocator-managed memory for the allocator's own metadata
- Bootstrap phase must use system primitives (mmap) directly
- Workset size can expose hidden recursion bugs under memory pressure
4. Phase 13-B – TinyHeapV2: Supply 経路実装 ✅ 完了
Status: 2025-11-15 完了 結果: Stealing 設計を採用(Mode 0 デフォルト)、32B で +18% 改善
4.1 実装完了内容
-
✅ Free path supply 実装 (
core/tiny_free_fast_v2.inc.h)- 2 つの supply モードを実装(ENV で A/B 可能):
- Mode 0 (Stealing): L0 が free を先に受け取る(デフォルト)
- Mode 1 (Leftover): L1 primary owner, L0 は「おこぼれ」
- 2 つの supply モードを実装(ENV で A/B 可能):
-
✅ Alloc path hook 実装 (
core/tiny_alloc_fast.inc.h)tiny_heap_v2_alloc_by_class(class_idx)- 最適化済み(-47% 退化を +14% 改善に修正)- class_idx を直接受け取り、冗長な変換・チェックを削除
-
✅ ENV フラグ完備:
HAKMEM_TINY_HEAP_V2- Box ON/OFFHAKMEM_TINY_HEAP_V2_CLASS_MASK- class 別有効化(bitmask)HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE- Mode 0/1 切り替えHAKMEM_TINY_HEAP_V2_STATS- 統計出力
4.2 A/B テスト結果(100K iterations, workset=128)
| サイズ | Baseline (V2 OFF) | Mode 0 (Stealing) | Mode 1 (Leftover) |
|---|---|---|---|
| 16B | 43.9M ops/s | 45.6M (+3.9%) ✅ | 41.6M (-5.2%) ❌ |
| 32B | 41.9M ops/s | 49.6M (+18.4%) ✅ | 41.1M (-1.9%) ❌ |
| 64B | 51.2M ops/s | 51.5M (+0.6%) ≈ | 51.0M (-0.4%) ≈ |
統計(Mode 0 @ 16B):
- alloc_calls: 99,872
- mag_hits: 99,872 (100.0% hit rate)
- refill: 0(supply from free path のみ)
4.3 設計判断:Stealing をデフォルトに採用
ChatGPT 先生の分析(ultrathink 相談):
-
学習層との整合性 OK:
- 学習層は主に Superslab / Pool / Drain の統計を見る
- L0 stealing は Superslab 側の carving/drain 信号を壊さない
- 必要なら TinyHeapV2 の hit/miss カウンタを学習用フックとして追加すれば良い
-
Box 境界の整理:
- TinyHeapV2 は front-only Box として完結
- 学習層には「Superslab/Pool の世界」と「L0/L1 の統計」を別々の箱として渡す
- 性能 (+18%) > 厳格な Box 境界
-
推奨方針:
- 今は Stealing で性能を攻める(Mode 0 デフォルト)
- 学習層との整合は後続 Phase で必要に応じて調整
決定: HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0 (Stealing) をデフォルトに採用。
根拠: 32B で +18% の性能改善、学習層への影響は軽微。
4.4 残タスク(後続 Phase)
- C0 (8B) の最適化: 現在 -5% 退化 → CLASS_MASK で無効化を検討
- 学習層統合: 必要に応じて TinyHeapV2 の hit/miss/refill カウンタを学習用フックとして追加
- Random mixed ベンチ: 256B mixed workload でも A/B テスト
5. 「今は触らない」領域メモ
- Mid-Large allocator(Pool TLS + lock-free Stage 1/2):
- SEGV 修正済み、futex 95% 削減、8T で +896% 改善。
- 現時点では研究テーマとしては十分進んだので、Tiny に集中して OK。
- Larson ベンチの 100x 差:
- Lock contention / metadata 再利用の問題が絡む大きめのテーマ。
- TinyHeapV2 がある程度形になってから、別 Phase で攻める。
6. まとめ(Claude Code 用の一言メモ)
- 箱の境界: TinyHeapV2 は「front-only L0 Cache Box」。Superslab / Pool / Drain には触らない。
- 今すぐやること: alloc 側からの「おこぼれ supply」を 1 箇所だけ差し込んで、統計と A/B を取る。
- free 側の統合: 設計だけ整理しておき、実装は TinyHeapV2 の挙動を見てからで大丈夫。
4. Phase 15: Box Separation (2025-11-15) - Incomplete ⏸️
4.1 Goal
Eliminate mincore syscall overhead (13.65% CPU, 987 calls/100K) via Box Separation architecture.
4.2 Implementation Status
Box headers完成、routing未完成(SEGV)
Files Created:
-
core/box/front_gate_v2.h(98 lines) ✅- Ultra-fast 1-byte header classification ONLY
- Domains: TINY (0xa0), POOL (0xb0), MIDCAND, EXTERNAL
- Performance: 2-5 cycles
- Same-page guard added (防御的プログラミング)
-
core/box/external_guard_box.h(146 lines) ✅- ENV-controlled mincore safety check
- ENV flags:
HAKMEM_EXTERNAL_GUARD_MINCORE=0/1(default: 0 = OFF)HAKMEM_EXTERNAL_GUARD_LOG=0/1HAKMEM_EXTERNAL_GUARD_STATS=0/1
- Expected: Called 0-10 times in bench (if >100 → box leak)
- Uses __libc_free() to avoid infinite loop
Routing (hak_free_at):
- ❌ Phase 15 routing incomplete (SEGV on page-aligned pointers)
- ✅ Reverted to Phase 14-C (classify_ptr-based, stable)
4.3 Issues Encountered
-
Page-aligned pointer crash (0x...000 & 0xFFF == 0)
- Box FG V2 missing same-page guard → fixed
- Still crashes on drain phase → deeper issue
-
C7 (1KB headerless) misclassification
- Box FG V2 cannot classify C7 (no 1-byte header)
- Requires registry lookup fallback
-
mincore OFF unsafe
- DISABLE_MINCORE=1 causes SEGV (invalid AllocHeader deref)
- mincore safety check is essential for mixed allocations
4.4 Performance (Phase 14-C Baseline)
Current (mincore ON):
- Random Mixed 256B: 16.5M ops/s
- mincore: 841 calls/100K iterations
- Stable, no crashes
Target (mincore OFF):
- Expected: +15% (eliminate 13.65% CPU overhead)
- Reality: SEGV (unsafe AllocHeader access)
4.5 Next Steps (Deferred to Future Phase)
Phase 15 完全実装は次のフェーズで再挑戦:
- Mid/Large/C7 registry consolidation - Unified lookup for MIDCAND
- AllocHeader safety - Add header validation before deref
- ExternalGuard integration - Proper libc delegation
Current Recommendation: Stick with Phase 14-C (stable, 16.5M ops/s)
- mincore overhead is acceptable for now (~1.9ms / 100K iterations)
- Focus on other bottlenecks (TLS SLL refill, SuperSlab churn)