# Current Task: Phase 9-2 — SuperSlab状態の一元化プラン **Date**: 2025-12-01 **Status**: 実行時バグは暫定解消(スロット同期で registry 枯渇を停止) **Goal**: Legacy/Shared の二重メタデータを排除し、SuperSlab 状態管理を shared pool に一本化する根治設計を進める。 --- ## 背景 / 症状 - `HAKMEM_TINY_USE_SUPERSLAB=1` で `SuperSlab registry full` → registry が解放されず枯渇。 - 原因: Legacy 経路で確保した SuperSlab が Shared Pool のスロット状態に反映されず、`shared_pool_release_slab()` が早期 return していた。 - 暫定対処: `sp_meta_sync_slots_from_ss()` で差分を検出したら同期し、EMPTY→FREE リスト→registry 解除まで進めるよう修正済み。 ## 根本原因(箱理論での見立て) - 状態の二重管理: Legacy パスと Shared Pool パスがそれぞれ SuperSlab 状態を持ち、整合が崩れる。 - 境界が多重化: acquire/free の境界が複数あり、EMPTY 判定・slot 遷移が散在。 ## 目標 1) SuperSlab の状態遷移(UNUSED/ACTIVE/EMPTY)を Shared Pool の slot 状態に一元化。 2) acquire/free/adopt/drain の境界を共有プール経路に集約(戻せるよう A/B ガード付き)。 3) Legacy backend は互換箱として残しつつ入口で同期し、最終的に削除可能な状態へ。 ## 次にやること(手順) 1. **入口統一の設計** - `superslab_allocate()` を shared pool 薄ラッパ経由にし、登録・`SharedSSMeta` 初期化を必ず通す案を設計(env で ON/OFF)。 2. **free 経路の整理** - TLS drain / remote / local free からの EMPTY 判定を `shared_pool_release_slab()` だけが扱うよう責務を明確化。 - `empty_mask/nonempty_mask/freelist_mask` 更新を shared pool 内部ヘルパに一本化する設計を起こす。 3. **観測とガード** - `HAKMEM_TINY_SS_SHARED` / `HAKMEM_TINY_USE_SUPERSLAB` で A/B、`*_DEBUG` でワンショット観測。 - `shared_fail→legacy` と registry 占有率をダッシュボード化して移行完了を判断。 4. **段階的収束計画を書く** - いつ Legacy backend を既定 OFF にし、削除するかの段階と撤退条件(戻し条件)を文書化。 ## 現状のブロッカー / リスク - Legacy/Shared 混在のままコードが増えると再び同期漏れが出るリスク。 - LRU/EMPTY マスクの責務が散らばっており、統合時に副作用が出る可能性。 ## 成果物イメージ - 設計ノート: 入口統一ラッパ、マスク更新ヘルパ、A/B ガード設計。 - 最小パッチ案: ラッパ導入+マスク更新の集約(コード変更は次ステップで)。 - 検証手順: registry 枯渇の再発テスト、`shared_fail→legacy` カウンタの収束確認。 --- ## Commits ### Phase 8 Root Cause Fix **Commit**: `191e65983` **Date**: 2025-11-30 **Files**: 3 files, 36 insertions(+), 13 deletions(-) **Changes**: 1. `bench_fast_box.c` (Layer 0 + Layer 1): - Removed unified_cache_init() call (design misunderstanding) - Limited prealloc to 128 blocks/class (actual TLS SLL capacity) - Added root cause comments explaining why unified_cache_init() was wrong 2. `bench_fast_box.h` (Layer 3): - Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC) - Documented scope separation (workload vs infrastructure allocations) - Added contract violation example (Phase 8 bug explanation) 3. `tiny_unified_cache.c` (Layer 2): - Changed calloc() → __libc_calloc() (infrastructure isolation) - Changed free() → __libc_free() (symmetric cleanup) - Added defensive fix comments explaining infrastructure bypass ### Phase 8-TLS-Fix **Commit**: `da8f4d2c8` **Date**: 2025-11-30 **Files**: 3 files, 21 insertions(+), 11 deletions(-) **Changes**: 1. `bench_fast_box.c` (TLS→Atomic): - Changed `__thread int bench_fast_init_in_progress` → `atomic_int g_bench_fast_init_in_progress` - Added atomic_load() for reads, atomic_store() for writes - Added root cause comments (pthread_once creates fresh TLS) 2. `bench_fast_box.h` (TLS→Atomic): - Updated extern declaration to match atomic_int - Added Phase 8-TLS-Fix comment explaining cross-thread safety 3. `bench_fast_box.c` (Header Write): - Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx` - Added Phase 8-P3-Fix comment explaining P3 optimization bypass - Contract: BenchFast always writes headers (required for free routing) 4. `hak_wrappers.inc.h` (Atomic): - Updated bench_fast_init_in_progress check to use atomic_load() - Added Phase 8-TLS-Fix comment for cross-thread safety --- ## Performance Journey ### Phase-by-Phase Progress ``` Phase 3 (mincore removal): 56.8 M ops/s Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%) Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression) Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%) Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐ Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐ Phase 8 (Normal mode): 16.3 M ops/s (working, different workload) Total improvement: +43.5% (56.8M → 81.5M) from Phase 3 ``` **Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256). Normal mode performance: 16.3M ops/s (working, no crash). --- ## Technical Details ### Layer 0: Prealloc Capacity Fix **File**: `core/box/bench_fast_box.c` **Lines**: 131-148 **Root Cause**: - Old code preallocated 50,000 blocks/class - TLS SLL actual capacity: 128 blocks (adaptive sizing limit) - Lost blocks (beyond 128) caused heap corruption **Fix**: ```c // Before: const uint32_t PREALLOC_COUNT = 50000; // Too large! // After: const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity for (int cls = 2; cls <= 7; cls++) { uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY; for (int i = 0; i < (int)capacity; i++) { // preallocate... } } ``` ### Layer 1: Design Misunderstanding Fix **File**: `core/box/bench_fast_box.c` **Lines**: 123-128 (REMOVED) **Root Cause**: - BenchFast uses TLS SLL directly (g_tls_sll[]) - Unified Cache is NOT used by BenchFast - unified_cache_init() created 16KB allocations (infrastructure) - Later freed by BenchFast → header misclassification → CRASH **Fix**: ```c // REMOVED: // unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache // Added comment: // Phase 8 Root Cause Fix: REMOVED unified_cache_init() call // Reason: BenchFast uses TLS SLL directly, NOT Unified Cache ``` ### Layer 2: Infrastructure Isolation **File**: `core/front/tiny_unified_cache.c` **Lines**: 61-71 (init), 103-109 (shutdown) **Strategy**: Dual-Path Separation - **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache) - **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free **Fix**: ```c // Before: g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*)); // After: extern void* __libc_calloc(size_t, size_t); g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*)); ``` ### Layer 3: Box Contract Documentation **File**: `core/box/bench_fast_box.h` **Lines**: 13-51 **Added Documentation**: - BenchFast uses TLS SLL, NOT Unified Cache - Scope separation (workload vs infrastructure) - Preconditions and guarantees - Contract violation example (Phase 8 bug) ### TLS→Atomic Fix **File**: `core/box/bench_fast_box.c` **Lines**: 22-27 (declaration), 37, 124, 215 (usage) **Root Cause**: ``` pthread_once() → creates new thread New thread has fresh TLS (bench_fast_init_in_progress = 0) Guard broken → getenv() allocates → freed by __libc_free() → CRASH ``` **Fix**: ```c // Before (TLS - broken): __thread int bench_fast_init_in_progress = 0; if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... } // After (Atomic - fixed): atomic_int g_bench_fast_init_in_progress = 0; if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... } ``` **箱理論 Validation**: - **Responsibility**: Guard must protect entire process (not per-thread) - **Contract**: "No BenchFast allocations during init" (all threads) - **Observable**: Atomic variable visible across all threads - **Composable**: Works with pthread_once() threading model ### Header Write Fix **File**: `core/box/bench_fast_box.c` **Lines**: 70-80 **Root Cause**: - P3 optimization: tiny_region_id_write_header() skips header writes by default - BenchFast free routing checks header magic (0xa0-0xa7) - No header → free() misroutes to __libc_free() → CRASH **Fix**: ```c // Before (broken - calls function that skips write): tiny_region_id_write_header(base, class_idx); return (void*)((char*)base + 1); // After (fixed - direct write): *(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write return (void*)((char*)base + 1); ``` **Contract**: BenchFast always writes headers (required for free routing) --- ## Next Phase Options ### Option A: Continue Phase 7 (Steps 5-7) 📦 **Goal**: Remove remaining legacy layers (complete dead code elimination) **Expected**: Additional +3-5% via further code cleanup **Duration**: 1-2 days **Risk**: Low (infrastructure already in place) **Remaining Steps**: - Step 5: Compile library with PGO flag (Makefile change) - Step 6: Verify dead code elimination in assembly - Step 7: Measure performance improvement ### Option B: PGO Re-enablement 🚀 **Goal**: Re-enable PGO workflow from Phase 4-Step1 **Expected**: +6-13% cumulative (on top of 81.5M) **Duration**: 2-3 days **Risk**: Low (proven pattern) **Current projection**: - Phase 7 baseline: 81.5 M ops/s - With PGO: ~86-93 M ops/s (+6-13%) ### Option C: BenchFast Pool Expansion 🏎️ **Goal**: Increase BenchFast pool size for full 10M iteration support **Expected**: Structural ceiling measurement (30-40M ops/s target) **Duration**: 1 day **Risk**: Low (just increase prealloc count) **Current status**: - Pool: 128 blocks/class (768 total) - Exhaustion: C6/C7 exhaust after ~200 iterations - Need: ~10,000 blocks/class for 10M iterations (60,000 total) ### Option D: Production Readiness 📊 **Goal**: Comprehensive benchmark suite, deployment guide **Expected**: Full performance comparison, stability testing **Duration**: 3-5 days **Risk**: Low (documentation + testing) --- ## Recommendation ### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️ **Reasoning**: 1. **Phase 8 fixes working**: TLS→Atomic + Header write proven 2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000 3. **Scientific value**: Measure true structural ceiling (no safety costs) 4. **Low risk**: 1-day task, no code changes (just capacity tuning) 5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected) **Expected Result**: ``` Normal mode: 16.3 M ops/s (current) BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster) ``` **Implementation**: ```c // core/box/bench_fast_box.c:140 const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128 ``` --- ### Second Choice: **Option B (PGO Re-enablement)** 🚀 **Reasoning**: 1. **Proven benefit**: +6.25% in Phase 4-Step1 2. **Cumulative**: Would stack with Phase 7 (81.5M baseline) 3. **Low risk**: Just fix build issue 4. **High impact**: ~86-93 M ops/s projected --- ## Current Performance Summary ### bench_random_mixed (16B-1KB, Tiny workload) ``` Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total) Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working) ``` ### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256) ``` After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%) vs System malloc: 26.8 M ops/s (1.57x faster) ``` ### Overall Status - ✅ **Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!) - ✅ **Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system) - ✅ **BenchFast mode**: No crash (TLS→Atomic + Header fix working) - ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet - ⏸️ **MT workloads**: No MT benchmarks yet --- ## Decision Time **Choose your next phase**: - **Option A**: Continue Phase 7 (Steps 5-7, final cleanup) - **Option B**: PGO re-enablement (recommended for normal builds) - **Option C**: BenchFast pool expansion (recommended for ceiling measurement) - **Option D**: Production readiness & benchmarking **Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!) --- Updated: 2025-11-30 Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING Previous: Phase 7 (Tiny Front Unification, +55.5%) Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)