12 KiB
Current Task: Phase 9-2 — SuperSlab状態の一元化プラン
Date: 2025-12-01
Status: 実行時バグは暫定解消(スロット同期で registry 枯渇を停止)
Goal: Legacy/Shared の二重メタデータを排除し、SuperSlab 状態管理を shared pool に一本化する根治設計を進める。
背景 / 症状
HAKMEM_TINY_USE_SUPERSLAB=1でSuperSlab registry full→ registry が解放されず枯渇。- 原因: Legacy 経路で確保した SuperSlab が Shared Pool のスロット状態に反映されず、
shared_pool_release_slab()が早期 return していた。 - 暫定対処:
sp_meta_sync_slots_from_ss()で差分を検出したら同期し、EMPTY→FREE リスト→registry 解除まで進めるよう修正済み。
根本原因(箱理論での見立て)
- 状態の二重管理: Legacy パスと Shared Pool パスがそれぞれ SuperSlab 状態を持ち、整合が崩れる。
- 境界が多重化: acquire/free の境界が複数あり、EMPTY 判定・slot 遷移が散在。
目標
- SuperSlab の状態遷移(UNUSED/ACTIVE/EMPTY)を Shared Pool の slot 状態に一元化。
- acquire/free/adopt/drain の境界を共有プール経路に集約(戻せるよう A/B ガード付き)。
- Legacy backend は互換箱として残しつつ入口で同期し、最終的に削除可能な状態へ。
次にやること(手順)
- 入口統一の設計
superslab_allocate()を shared pool 薄ラッパ経由にし、登録・SharedSSMeta初期化を必ず通す案を設計(env で ON/OFF)。
- free 経路の整理
- TLS drain / remote / local free からの EMPTY 判定を
shared_pool_release_slab()だけが扱うよう責務を明確化。 empty_mask/nonempty_mask/freelist_mask更新を shared pool 内部ヘルパに一本化する設計を起こす。
- TLS drain / remote / local free からの EMPTY 判定を
- 観測とガード
HAKMEM_TINY_SS_SHARED/HAKMEM_TINY_USE_SUPERSLABで A/B、*_DEBUGでワンショット観測。shared_fail→legacyと registry 占有率をダッシュボード化して移行完了を判断。
- 段階的収束計画を書く
- いつ Legacy backend を既定 OFF にし、削除するかの段階と撤退条件(戻し条件)を文書化。
現状のブロッカー / リスク
- Legacy/Shared 混在のままコードが増えると再び同期漏れが出るリスク。
- LRU/EMPTY マスクの責務が散らばっており、統合時に副作用が出る可能性。
成果物イメージ
- 設計ノート: 入口統一ラッパ、マスク更新ヘルパ、A/B ガード設計。
- 最小パッチ案: ラッパ導入+マスク更新の集約(コード変更は次ステップで)。
- 検証手順: registry 枯渇の再発テスト、
shared_fail→legacyカウンタの収束確認。
Commits
Phase 8 Root Cause Fix
Commit: 191e65983
Date: 2025-11-30
Files: 3 files, 36 insertions(+), 13 deletions(-)
Changes:
-
bench_fast_box.c(Layer 0 + Layer 1):- Removed unified_cache_init() call (design misunderstanding)
- Limited prealloc to 128 blocks/class (actual TLS SLL capacity)
- Added root cause comments explaining why unified_cache_init() was wrong
-
bench_fast_box.h(Layer 3):- Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC)
- Documented scope separation (workload vs infrastructure allocations)
- Added contract violation example (Phase 8 bug explanation)
-
tiny_unified_cache.c(Layer 2):- Changed calloc() → __libc_calloc() (infrastructure isolation)
- Changed free() → __libc_free() (symmetric cleanup)
- Added defensive fix comments explaining infrastructure bypass
Phase 8-TLS-Fix
Commit: da8f4d2c8
Date: 2025-11-30
Files: 3 files, 21 insertions(+), 11 deletions(-)
Changes:
-
bench_fast_box.c(TLS→Atomic):- Changed
__thread int bench_fast_init_in_progress→atomic_int g_bench_fast_init_in_progress - Added atomic_load() for reads, atomic_store() for writes
- Added root cause comments (pthread_once creates fresh TLS)
- Changed
-
bench_fast_box.h(TLS→Atomic):- Updated extern declaration to match atomic_int
- Added Phase 8-TLS-Fix comment explaining cross-thread safety
-
bench_fast_box.c(Header Write):- Replaced
tiny_region_id_write_header()→ direct write*(uint8_t*)base = 0xa0 | class_idx - Added Phase 8-P3-Fix comment explaining P3 optimization bypass
- Contract: BenchFast always writes headers (required for free routing)
- Replaced
-
hak_wrappers.inc.h(Atomic):- Updated bench_fast_init_in_progress check to use atomic_load()
- Added Phase 8-TLS-Fix comment for cross-thread safety
Performance Journey
Phase-by-Phase Progress
Phase 3 (mincore removal): 56.8 M ops/s
Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%)
Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression)
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
Phase 8 (Normal mode): 16.3 M ops/s (working, different workload)
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
Note: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256). Normal mode performance: 16.3M ops/s (working, no crash).
Technical Details
Layer 0: Prealloc Capacity Fix
File: core/box/bench_fast_box.c
Lines: 131-148
Root Cause:
- Old code preallocated 50,000 blocks/class
- TLS SLL actual capacity: 128 blocks (adaptive sizing limit)
- Lost blocks (beyond 128) caused heap corruption
Fix:
// Before:
const uint32_t PREALLOC_COUNT = 50000; // Too large!
// After:
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity
for (int cls = 2; cls <= 7; cls++) {
uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY;
for (int i = 0; i < (int)capacity; i++) {
// preallocate...
}
}
Layer 1: Design Misunderstanding Fix
File: core/box/bench_fast_box.c
Lines: 123-128 (REMOVED)
Root Cause:
- BenchFast uses TLS SLL directly (g_tls_sll[])
- Unified Cache is NOT used by BenchFast
- unified_cache_init() created 16KB allocations (infrastructure)
- Later freed by BenchFast → header misclassification → CRASH
Fix:
// REMOVED:
// unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache
// Added comment:
// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call
// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache
Layer 2: Infrastructure Isolation
File: core/front/tiny_unified_cache.c
Lines: 61-71 (init), 103-109 (shutdown)
Strategy: Dual-Path Separation
- Workload allocations (measured): HAKMEM paths (TLS SLL, Unified Cache)
- Infrastructure allocations (unmeasured): __libc_calloc/__libc_free
Fix:
// Before:
g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
// After:
extern void* __libc_calloc(size_t, size_t);
g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
Layer 3: Box Contract Documentation
File: core/box/bench_fast_box.h
Lines: 13-51
Added Documentation:
- BenchFast uses TLS SLL, NOT Unified Cache
- Scope separation (workload vs infrastructure)
- Preconditions and guarantees
- Contract violation example (Phase 8 bug)
TLS→Atomic Fix
File: core/box/bench_fast_box.c
Lines: 22-27 (declaration), 37, 124, 215 (usage)
Root Cause:
pthread_once() → creates new thread
New thread has fresh TLS (bench_fast_init_in_progress = 0)
Guard broken → getenv() allocates → freed by __libc_free() → CRASH
Fix:
// Before (TLS - broken):
__thread int bench_fast_init_in_progress = 0;
if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... }
// After (Atomic - fixed):
atomic_int g_bench_fast_init_in_progress = 0;
if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... }
箱理論 Validation:
- Responsibility: Guard must protect entire process (not per-thread)
- Contract: "No BenchFast allocations during init" (all threads)
- Observable: Atomic variable visible across all threads
- Composable: Works with pthread_once() threading model
Header Write Fix
File: core/box/bench_fast_box.c
Lines: 70-80
Root Cause:
- P3 optimization: tiny_region_id_write_header() skips header writes by default
- BenchFast free routing checks header magic (0xa0-0xa7)
- No header → free() misroutes to __libc_free() → CRASH
Fix:
// Before (broken - calls function that skips write):
tiny_region_id_write_header(base, class_idx);
return (void*)((char*)base + 1);
// After (fixed - direct write):
*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write
return (void*)((char*)base + 1);
Contract: BenchFast always writes headers (required for free routing)
Next Phase Options
Option A: Continue Phase 7 (Steps 5-7) 📦
Goal: Remove remaining legacy layers (complete dead code elimination) Expected: Additional +3-5% via further code cleanup Duration: 1-2 days Risk: Low (infrastructure already in place)
Remaining Steps:
- Step 5: Compile library with PGO flag (Makefile change)
- Step 6: Verify dead code elimination in assembly
- Step 7: Measure performance improvement
Option B: PGO Re-enablement 🚀
Goal: Re-enable PGO workflow from Phase 4-Step1 Expected: +6-13% cumulative (on top of 81.5M) Duration: 2-3 days Risk: Low (proven pattern)
Current projection:
- Phase 7 baseline: 81.5 M ops/s
- With PGO: ~86-93 M ops/s (+6-13%)
Option C: BenchFast Pool Expansion 🏎️
Goal: Increase BenchFast pool size for full 10M iteration support Expected: Structural ceiling measurement (30-40M ops/s target) Duration: 1 day Risk: Low (just increase prealloc count)
Current status:
- Pool: 128 blocks/class (768 total)
- Exhaustion: C6/C7 exhaust after ~200 iterations
- Need: ~10,000 blocks/class for 10M iterations (60,000 total)
Option D: Production Readiness 📊
Goal: Comprehensive benchmark suite, deployment guide Expected: Full performance comparison, stability testing Duration: 3-5 days Risk: Low (documentation + testing)
Recommendation
Top Pick: Option C (BenchFast Pool Expansion) 🏎️
Reasoning:
- Phase 8 fixes working: TLS→Atomic + Header write proven
- Quick win: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000
- Scientific value: Measure true structural ceiling (no safety costs)
- Low risk: 1-day task, no code changes (just capacity tuning)
- Data-driven: Enables comparison vs normal mode (16.3M vs 30-40M expected)
Expected Result:
Normal mode: 16.3 M ops/s (current)
BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster)
Implementation:
// core/box/bench_fast_box.c:140
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128
Second Choice: Option B (PGO Re-enablement) 🚀
Reasoning:
- Proven benefit: +6.25% in Phase 4-Step1
- Cumulative: Would stack with Phase 7 (81.5M baseline)
- Low risk: Just fix build issue
- High impact: ~86-93 M ops/s projected
Current Performance Summary
bench_random_mixed (16B-1KB, Tiny workload)
Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total)
Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working)
bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%)
vs System malloc: 26.8 M ops/s (1.57x faster)
Overall Status
- ✅ Tiny allocations (16B-1KB): 81.5 M ops/s (excellent, +55.5%!)
- ✅ Mid MT allocations (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system)
- ✅ BenchFast mode: No crash (TLS→Atomic + Header fix working)
- ⏸️ Large allocations (32KB-2MB): Not benchmarked yet
- ⏸️ MT workloads: No MT benchmarks yet
Decision Time
Choose your next phase:
- Option A: Continue Phase 7 (Steps 5-7, final cleanup)
- Option B: PGO re-enablement (recommended for normal builds)
- Option C: BenchFast pool expansion (recommended for ceiling measurement)
- Option D: Production readiness & benchmarking
Or: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!)
Updated: 2025-11-30 Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING Previous: Phase 7 (Tiny Front Unification, +55.5%) Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)