diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md
index 5ebca6bd..6bad1ede 100644
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@@ -1,363 +1,51 @@
-# Current Task: Phase 9-2 — SuperSlab状態の一元化プラン
+# Current Task: Phase 9-2 Refactoring (Complete)
 
-**Date**: 2025-12-01  
-**Status**: 実行時バグは暫定解消（スロット同期で registry 枯渇を停止）  
-**Goal**: Legacy/Shared の二重メタデータを排除し、SuperSlab 状態管理を shared pool に一本化する根治設計を進める。
-
----
-
-## 背景 / 症状
-- `HAKMEM_TINY_USE_SUPERSLAB=1` で `SuperSlab registry full` → registry が解放されず枯渇。
-- 原因: Legacy 経路で確保した SuperSlab が Shared Pool のスロット状態に反映されず、`shared_pool_release_slab()` が早期 return していた。
-- 暫定対処: `sp_meta_sync_slots_from_ss()` で差分を検出したら同期し、EMPTY→FREE リスト→registry 解除まで進めるよう修正済み。
-
-## 根本原因（箱理論での見立て）
-- 状態の二重管理: Legacy パスと Shared Pool パスがそれぞれ SuperSlab 状態を持ち、整合が崩れる。
-- 境界が多重化: acquire/free の境界が複数あり、EMPTY 判定・slot 遷移が散在。
-
-## 目標
-1) SuperSlab の状態遷移（UNUSED/ACTIVE/EMPTY）を Shared Pool の slot 状態に一元化。  
-2) acquire/free/adopt/drain の境界を共有プール経路に集約（戻せるよう A/B ガード付き）。  
-3) Legacy backend は互換箱として残しつつ入口で同期し、最終的に削除可能な状態へ。
-
-## 次にやること（手順）
-1. **入口統一の設計**  
-   - `superslab_allocate()` を shared pool 薄ラッパ経由にし、登録・`SharedSSMeta` 初期化を必ず通す案を設計（env で ON/OFF）。
-2. **free 経路の整理**  
-   - TLS drain / remote / local free からの EMPTY 判定を `shared_pool_release_slab()` だけが扱うよう責務を明確化。  
-   - `empty_mask/nonempty_mask/freelist_mask` 更新を shared pool 内部ヘルパに一本化する設計を起こす。
-3. **観測とガード**  
-   - `HAKMEM_TINY_SS_SHARED` / `HAKMEM_TINY_USE_SUPERSLAB` で A/B、`*_DEBUG` でワンショット観測。  
-   - `shared_fail→legacy` と registry 占有率をダッシュボード化して移行完了を判断。
-4. **段階的収束計画を書く**  
-   - いつ Legacy backend を既定 OFF にし、削除するかの段階と撤退条件（戻し条件）を文書化。
-
-## 現状のブロッカー / リスク
-- Legacy/Shared 混在のままコードが増えると再び同期漏れが出るリスク。  
-- LRU/EMPTY マスクの責務が散らばっており、統合時に副作用が出る可能性。
-
-## 成果物イメージ
-- 設計ノート: 入口統一ラッパ、マスク更新ヘルパ、A/B ガード設計。  
-- 最小パッチ案: ラッパ導入＋マスク更新の集約（コード変更は次ステップで）。  
-- 検証手順: registry 枯渇の再発テスト、`shared_fail→legacy` カウンタの収束確認。
-
----
-
-## Commits
-
-### Phase 8 Root Cause Fix
-**Commit**: `191e65983`
 **Date**: 2025-11-30
-**Files**: 3 files, 36 insertions(+), 13 deletions(-)
-
-**Changes**:
-1. `bench_fast_box.c` (Layer 0 + Layer 1):
-   - Removed unified_cache_init() call (design misunderstanding)
-   - Limited prealloc to 128 blocks/class (actual TLS SLL capacity)
-   - Added root cause comments explaining why unified_cache_init() was wrong
-
-2. `bench_fast_box.h` (Layer 3):
-   - Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC)
-   - Documented scope separation (workload vs infrastructure allocations)
-   - Added contract violation example (Phase 8 bug explanation)
-
-3. `tiny_unified_cache.c` (Layer 2):
-   - Changed calloc() → __libc_calloc() (infrastructure isolation)
-   - Changed free() → __libc_free() (symmetric cleanup)
-   - Added defensive fix comments explaining infrastructure bypass
-
-### Phase 8-TLS-Fix
-**Commit**: `da8f4d2c8`
-**Date**: 2025-11-30
-**Files**: 3 files, 21 insertions(+), 11 deletions(-)
-
-**Changes**:
-1. `bench_fast_box.c` (TLS→Atomic):
-   - Changed `__thread int bench_fast_init_in_progress` → `atomic_int g_bench_fast_init_in_progress`
-   - Added atomic_load() for reads, atomic_store() for writes
-   - Added root cause comments (pthread_once creates fresh TLS)
-
-2. `bench_fast_box.h` (TLS→Atomic):
-   - Updated extern declaration to match atomic_int
-   - Added Phase 8-TLS-Fix comment explaining cross-thread safety
-
-3. `bench_fast_box.c` (Header Write):
-   - Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx`
-   - Added Phase 8-P3-Fix comment explaining P3 optimization bypass
-   - Contract: BenchFast always writes headers (required for free routing)
-
-4. `hak_wrappers.inc.h` (Atomic):
-   - Updated bench_fast_init_in_progress check to use atomic_load()
-   - Added Phase 8-TLS-Fix comment for cross-thread safety
+**Status**: **COMPLETE** (Phase 9-2 & Refactoring)
+**Goal**: SuperSlab Unified Management, Stability Fixes, and Code Refactoring
 
 ---
 
-## Performance Journey
+## Phase 9-2 Achievements (Completed)
 
-### Phase-by-Phase Progress
+1.  **Critical Fixes (Deadlock & OOM)**
+    *   **Deadlock**: `shared_pool_acquire_slab` now releases `alloc_lock` before calling `superslab_allocate` (via `sp_internal_allocate_superslab`), preventing lock inversion with `g_super_reg_lock`.
+    *   **OOM**: Enabled `HAKMEM_TINY_USE_SUPERSLAB=1` by default in `hakmem_build_flags.h`, ensuring fallback to Legacy Backend when Shared Pool hits soft cap.
 
-```
-Phase 3 (mincore removal):     56.8 M ops/s
-Phase 4 (Hot/Cold Box):         57.2 M ops/s (+0.7%)
-Phase 5 (Mid MT fix):           52.3 M ops/s (-8.6% regression)
-Phase 6 (Lock-free Mid MT):     42.1 M ops/s (Mid MT: +2.65%)
-Phase 7-Step1 (Unified front):  80.6 M ops/s (+54.2%!) ⭐
-Phase 7-Step4 (Dead code):      81.5 M ops/s (+1.1%) ⭐⭐
-Phase 8 (Normal mode):          16.3 M ops/s (working, different workload)
+2.  **SuperSlab Management Unification**
+    *   **Unified Entry**: `sp_internal_allocate_superslab` helper introduced to manage safe allocation flow.
+    *   **Unified Free**: `remove_superslab_from_legacy_head` implemented to safely remove pointers from legacy lists when freeing via Shared Pool.
 
-Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
-```
-
-**Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256).
-Normal mode performance: 16.3M ops/s (working, no crash).
+3.  **Code Refactoring (Split `hakmem_shared_pool.c`)**
+    *   **Split Strategy**: Divided the monolithic `core/hakmem_shared_pool.c` (1400+ lines) into logical modules:
+        *   `core/hakmem_shared_pool.c`: Initialization, stats, and common helpers.
+        *   `core/hakmem_shared_pool_acquire.c`: Allocation logic (`shared_pool_acquire_slab` and Stage 0.5-3).
+        *   `core/hakmem_shared_pool_release.c`: Deallocation logic (`shared_pool_release_slab`).
+        *   `core/hakmem_shared_pool_internal.h`: Internal shared definitions and prototypes.
+    *   **Makefile**: Updated to compile and link the new files.
+    *   **Cleanups**: Removed unused "L0 Cache" experimental code and fixed incorrect function names (`superslab_alloc` -> `superslab_allocate`).
 
 ---
 
-## Technical Details
+## Next Phase Candidates (Handover from Phase 9-2)
 
-### Layer 0: Prealloc Capacity Fix
+### 1. Soft Cap (Policy) Tuning
+*   **Issue**: Medium Working Sets (8192) hit the Shared Pool "Soft Cap" easily, causing frequent fallbacks and performance degradation.
+*   **Action**: Review `hakmem_policy.c` and adjust `tiny_cap` or improve dynamic adjustment logic.
 
-**File**: `core/box/bench_fast_box.c`
-**Lines**: 131-148
+### 2. Fast Path Optimization
+*   **Issue**: Small Working Sets (256) show 70-88% performance vs SysAlloc due to lock/call overhead. Refactoring caused a slight dip (15%), highlighting the need for optimization.
+*   **Action**: Re-implement a lightweight L0 Cache or optimize the lock-free path in Shared Pool for hot-path performance. Consider inlining hot helpers again via header-only implementations if needed.
 
-**Root Cause**:
-- Old code preallocated 50,000 blocks/class
-- TLS SLL actual capacity: 128 blocks (adaptive sizing limit)
-- Lost blocks (beyond 128) caused heap corruption
-
-**Fix**:
-```c
-// Before:
-const uint32_t PREALLOC_COUNT = 50000;  // Too large!
-
-// After:
-const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128;  // Observed actual capacity
-for (int cls = 2; cls <= 7; cls++) {
-    uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY;
-    for (int i = 0; i < (int)capacity; i++) {
-        // preallocate...
-    }
-}
-```
-
-### Layer 1: Design Misunderstanding Fix
-
-**File**: `core/box/bench_fast_box.c`
-**Lines**: 123-128 (REMOVED)
-
-**Root Cause**:
-- BenchFast uses TLS SLL directly (g_tls_sll[])
-- Unified Cache is NOT used by BenchFast
-- unified_cache_init() created 16KB allocations (infrastructure)
-- Later freed by BenchFast → header misclassification → CRASH
-
-**Fix**:
-```c
-// REMOVED:
-// unified_cache_init();  // WRONG! BenchFast uses TLS SLL, not Unified Cache
-
-// Added comment:
-// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call
-// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache
-```
-
-### Layer 2: Infrastructure Isolation
-
-**File**: `core/front/tiny_unified_cache.c`
-**Lines**: 61-71 (init), 103-109 (shutdown)
-
-**Strategy**: Dual-Path Separation
-- **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache)
-- **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free
-
-**Fix**:
-```c
-// Before:
-g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
-
-// After:
-extern void* __libc_calloc(size_t, size_t);
-g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
-```
-
-### Layer 3: Box Contract Documentation
-
-**File**: `core/box/bench_fast_box.h`
-**Lines**: 13-51
-
-**Added Documentation**:
-- BenchFast uses TLS SLL, NOT Unified Cache
-- Scope separation (workload vs infrastructure)
-- Preconditions and guarantees
-- Contract violation example (Phase 8 bug)
-
-### TLS→Atomic Fix
-
-**File**: `core/box/bench_fast_box.c`
-**Lines**: 22-27 (declaration), 37, 124, 215 (usage)
-
-**Root Cause**:
-```
-pthread_once() → creates new thread
-New thread has fresh TLS (bench_fast_init_in_progress = 0)
-Guard broken → getenv() allocates → freed by __libc_free() → CRASH
-```
-
-**Fix**:
-```c
-// Before (TLS - broken):
-__thread int bench_fast_init_in_progress = 0;
-if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... }
-
-// After (Atomic - fixed):
-atomic_int g_bench_fast_init_in_progress = 0;
-if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... }
-```
-
-**箱理論 Validation**:
-- **Responsibility**: Guard must protect entire process (not per-thread)
-- **Contract**: "No BenchFast allocations during init" (all threads)
-- **Observable**: Atomic variable visible across all threads
-- **Composable**: Works with pthread_once() threading model
-
-### Header Write Fix
-
-**File**: `core/box/bench_fast_box.c`
-**Lines**: 70-80
-
-**Root Cause**:
-- P3 optimization: tiny_region_id_write_header() skips header writes by default
-- BenchFast free routing checks header magic (0xa0-0xa7)
-- No header → free() misroutes to __libc_free() → CRASH
-
-**Fix**:
-```c
-// Before (broken - calls function that skips write):
-tiny_region_id_write_header(base, class_idx);
-return (void*)((char*)base + 1);
-
-// After (fixed - direct write):
-*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f));  // Direct write
-return (void*)((char*)base + 1);
-```
-
-**Contract**: BenchFast always writes headers (required for free routing)
+### 3. Legacy Backend Removal
+*   **Issue**: Legacy Backend (`g_superslab_heads`) is still kept for fallback but causes complexity.
+*   **Action**: Plan complete removal of `g_superslab_heads`, migrating all management to Shared Pool.
 
 ---
 
-## Next Phase Options
-
-### Option A: Continue Phase 7 (Steps 5-7) 📦
-**Goal**: Remove remaining legacy layers (complete dead code elimination)
-**Expected**: Additional +3-5% via further code cleanup
-**Duration**: 1-2 days
-**Risk**: Low (infrastructure already in place)
-
-**Remaining Steps**:
-- Step 5: Compile library with PGO flag (Makefile change)
-- Step 6: Verify dead code elimination in assembly
-- Step 7: Measure performance improvement
-
-### Option B: PGO Re-enablement 🚀
-**Goal**: Re-enable PGO workflow from Phase 4-Step1
-**Expected**: +6-13% cumulative (on top of 81.5M)
-**Duration**: 2-3 days
-**Risk**: Low (proven pattern)
-
-**Current projection**:
-- Phase 7 baseline: 81.5 M ops/s
-- With PGO: ~86-93 M ops/s (+6-13%)
-
-### Option C: BenchFast Pool Expansion 🏎️
-**Goal**: Increase BenchFast pool size for full 10M iteration support
-**Expected**: Structural ceiling measurement (30-40M ops/s target)
-**Duration**: 1 day
-**Risk**: Low (just increase prealloc count)
-
-**Current status**:
-- Pool: 128 blocks/class (768 total)
-- Exhaustion: C6/C7 exhaust after ~200 iterations
-- Need: ~10,000 blocks/class for 10M iterations (60,000 total)
-
-### Option D: Production Readiness 📊
-**Goal**: Comprehensive benchmark suite, deployment guide
-**Expected**: Full performance comparison, stability testing
-**Duration**: 3-5 days
-**Risk**: Low (documentation + testing)
-
----
-
-## Recommendation
-
-### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️
-
-**Reasoning**:
-1. **Phase 8 fixes working**: TLS→Atomic + Header write proven
-2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000
-3. **Scientific value**: Measure true structural ceiling (no safety costs)
-4. **Low risk**: 1-day task, no code changes (just capacity tuning)
-5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected)
-
-**Expected Result**:
-```
-Normal mode:    16.3 M ops/s (current)
-BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster)
-```
-
-**Implementation**:
-```c
-// core/box/bench_fast_box.c:140
-const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000;  // Was 128
-```
-
----
-
-### Second Choice: **Option B (PGO Re-enablement)** 🚀
-
-**Reasoning**:
-1. **Proven benefit**: +6.25% in Phase 4-Step1
-2. **Cumulative**: Would stack with Phase 7 (81.5M baseline)
-3. **Low risk**: Just fix build issue
-4. **High impact**: ~86-93 M ops/s projected
-
----
-
-## Current Performance Summary
-
-### bench_random_mixed (16B-1KB, Tiny workload)
-```
-Phase 7-Step4 (ws=256):   81.5 M ops/s (+55.5% total)
-Phase 8 (ws=8192):        16.3 M ops/s (normal mode, working)
-```
-
-### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
-```
-After Phase 6-B (lock-free):    42.09 M ops/s (+2.65%)
-vs System malloc:               26.8 M ops/s (1.57x faster)
-```
-
-### Overall Status
-- ✅ **Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!)
-- ✅ **Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system)
-- ✅ **BenchFast mode**: No crash (TLS→Atomic + Header fix working)
-- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet
-- ⏸️ **MT workloads**: No MT benchmarks yet
-
----
-
-## Decision Time
-
-**Choose your next phase**:
-- **Option A**: Continue Phase 7 (Steps 5-7, final cleanup)
-- **Option B**: PGO re-enablement (recommended for normal builds)
-- **Option C**: BenchFast pool expansion (recommended for ceiling measurement)
-- **Option D**: Production readiness & benchmarking
-
-**Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!)
-
----
-
-Updated: 2025-11-30
-Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING
-Previous: Phase 7 (Tiny Front Unification, +55.5%)
-Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)
+## Current Status
+*   **Build**: Passing (Clean build verified).
+*   **Benchmarks**:
+    *   `HAKMEM_TINY_SS_SHARED=1` (Normal): ~20.0 M ops/s (working, fallback active).
+    *   `HAKMEM_TINY_SS_SHARED=2` (Strict): ~20.3 M ops/s (working, OOMs on soft cap as expected).
+*   **Pending**: Selection of next focus area.
diff --git a/Makefile b/Makefile
index 0fda8b7f..28a03458 100644
--- a/Makefile
+++ b/Makefile
@@ -218,12 +218,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
 
 # Targets
 TARGET = test_hakmem
-OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o
+OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o
 OBJS = $(OBJS_BASE)
 
 # Shared library
 SHARED_LIB = libhakmem.so
-SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o superslab_allocate_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o superslab_head_shared.o hakmem_smallmid_shared.o hakmem_smallmid_superslab_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/unified_batch_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_tls_hint_box_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o
+SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o superslab_allocate_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o superslab_head_shared.o hakmem_smallmid_shared.o hakmem_smallmid_superslab_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/unified_batch_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_tls_hint_box_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o
 
 # Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
 ifeq ($(POOL_TLS_PHASE1),1)
@@ -250,7 +250,7 @@ endif
 # Benchmark targets
 BENCH_HAKMEM = bench_allocators_hakmem
 BENCH_SYSTEM = bench_allocators_system
-BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o
+BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o
 BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
 ifeq ($(POOL_TLS_PHASE1),1)
 BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
@@ -427,7 +427,7 @@ test-box-refactor: box-refactor
 	./larson_hakmem 10 8 128 1024 1 12345 4
 
 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
-TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/tiny_sizeclass_hist_box.o core/box/pagefault_telemetry_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o
+TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/tiny_sizeclass_hist_box.o core/box/pagefault_telemetry_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o
 TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
 ifeq ($(POOL_TLS_PHASE1),1)
 TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
diff --git a/core/hakmem_shared_pool.c b/core/hakmem_shared_pool.c
index 8f678aaf..54468e39 100644
--- a/core/hakmem_shared_pool.c
+++ b/core/hakmem_shared_pool.c
@@ -1,6 +1,4 @@
-#include "hakmem_shared_pool.h"
-#include "hakmem_tiny_superslab.h"
-#include "hakmem_tiny_superslab_constants.h"
+#include "hakmem_shared_pool_internal.h"
 #include "hakmem_debug_master.h"  // Phase 4b: Master debug control
 #include "hakmem_stats_master.h"  // Phase 4d: Master stats control
 #include "box/ss_slab_meta_box.h"  // Phase 3d-A: SlabMeta Box boundary
@@ -19,16 +17,17 @@
 // ============================================================================
 // P0 Lock Contention Instrumentation (Debug build only; counters defined always)
 // ============================================================================
-static _Atomic uint64_t g_lock_acquire_count = 0;      // Total lock acquisitions
-static _Atomic uint64_t g_lock_release_count = 0;      // Total lock releases
-static _Atomic uint64_t g_lock_acquire_slab_count = 0; // Locks from acquire_slab path
-static _Atomic uint64_t g_lock_release_slab_count = 0; // Locks from release_slab path
-static int g_lock_stats_enabled = -1;                  // -1=uninitialized, 0=off, 1=on
+_Atomic uint64_t g_lock_acquire_count = 0;      // Total lock acquisitions
+_Atomic uint64_t g_lock_release_count = 0;      // Total lock releases
+_Atomic uint64_t g_lock_acquire_slab_count = 0; // Locks from acquire_slab path
+_Atomic uint64_t g_lock_release_slab_count = 0; // Locks from release_slab path
 
 #if !HAKMEM_BUILD_RELEASE
+int g_lock_stats_enabled = -1;                  // -1=uninitialized, 0=off, 1=on
+
 // Initialize lock stats from environment variable
 // Phase 4b: Now uses hak_debug_check() for master debug control support
-static inline void lock_stats_init(void) {
+void lock_stats_init(void) {
     if (__builtin_expect(g_lock_stats_enabled == -1, 0)) {
         g_lock_stats_enabled = hak_debug_check("HAKMEM_SHARED_POOL_LOCK_STATS");
     }
@@ -60,27 +59,23 @@ static void __attribute__((destructor)) lock_stats_report(void) {
 }
 #else
 // Release build: No-op stubs
-static inline void lock_stats_init(void) {
-    if (__builtin_expect(g_lock_stats_enabled == -1, 0)) {
-        g_lock_stats_enabled = 0;
-    }
-}
+int g_lock_stats_enabled = 0;
 #endif
 
 // ============================================================================
 // SP Acquire Stage Statistics (Stage1/2/3 breakdown)
 // ============================================================================
-static _Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS];
-static _Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS];
-static _Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS];
+_Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS];
+_Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS];
+_Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS];
 // Data collection gate (0=off, 1=on). 学習層からも有効化される。
-static int g_sp_stage_stats_enabled = 0;
+int g_sp_stage_stats_enabled = 0;
 
 #if !HAKMEM_BUILD_RELEASE
 // Logging gate for destructor（ENV: HAKMEM_SHARED_POOL_STAGE_STATS）
 static int g_sp_stage_stats_log_enabled = -1;  // -1=uninitialized, 0=off, 1=on
 
-static inline void sp_stage_stats_init(void) {
+void sp_stage_stats_init(void) {
     // Phase 4d: Now uses hak_stats_check() for unified stats control
     if (__builtin_expect(g_sp_stage_stats_log_enabled == -1, 0)) {
         g_sp_stage_stats_log_enabled = hak_stats_check("HAKMEM_SHARED_POOL_STAGE_STATS", "pool");
@@ -123,7 +118,7 @@ static void __attribute__((destructor)) sp_stage_stats_report(void) {
 }
 #else
 // Release build: No-op stubs
-static inline void sp_stage_stats_init(void) {}
+void sp_stage_stats_init(void) {}
 #endif
 
 // Snapshot Tiny-related backend metrics for learner / observability.
@@ -161,7 +156,7 @@ shared_pool_tiny_metrics_snapshot(uint64_t stage1[TINY_NUM_CLASSES_SS],
 // Semantics:
 //   - tiny_cap[class] == 0 → no limit (unbounded)
 //   - otherwise: soft cap on ACTIVE slots managed by shared pool for this class.
-static inline uint32_t sp_class_active_limit(int class_idx) {
+uint32_t sp_class_active_limit(int class_idx) {
     const FrozenPolicy* pol = hkm_policy_get();
     if (!pol) {
         return 0;  // no limit
@@ -211,14 +206,7 @@ static inline FreeSlotNode* node_alloc(int class_idx) {
 
     uint32_t idx = atomic_fetch_add(&g_node_alloc_index[class_idx], 1);
     if (idx >= MAX_FREE_NODES_PER_CLASS) {
-        // Pool exhausted - should be rare. Caller must fall back to legacy
-        // mutex-protected free list to preserve correctness.
-        #if !HAKMEM_BUILD_RELEASE
-        static _Atomic int warn_once = 0;
-        if (atomic_exchange(&warn_once, 1) == 0) {
-            fprintf(stderr, "[P0-4 WARN] Node pool exhausted for class %d\n", class_idx);
-        }
-        #endif
+        // Pool exhausted - should be rare.
         return NULL;
     }
 
@@ -255,7 +243,7 @@ SharedSuperSlabPool g_shared_pool = {
     .ss_meta_count = 0
 };
 
-static void
+void
 shared_pool_ensure_capacity_unlocked(uint32_t min_capacity)
 {
     if (g_shared_pool.capacity >= min_capacity) {
@@ -268,9 +256,6 @@ shared_pool_ensure_capacity_unlocked(uint32_t min_capacity)
     }
 
     // CRITICAL FIX: Use system mmap() directly to avoid recursion!
-    // Problem: realloc() goes through HAKMEM allocator → hak_alloc_at(128)
-    //          → needs Shared Pool init → calls realloc() → INFINITE RECURSION!
-    // Solution: Allocate Shared Pool metadata using system mmap, not HAKMEM allocator
     size_t new_size = new_cap * sizeof(SuperSlab*);
     SuperSlab** new_slabs = (SuperSlab**)mmap(NULL, new_size,
                                                PROT_READ | PROT_WRITE,
@@ -333,7 +318,7 @@ static int sp_slot_find_unused(SharedSSMeta* meta) {
 // Mark slot as ACTIVE (UNUSED→ACTIVE or EMPTY→ACTIVE)
 // P0-5: Uses atomic store for state transition (caller must hold mutex!)
 // Returns: 0 on success, -1 on error
-static int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx) {
+int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx) {
     if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1;
     if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1;
 
@@ -357,7 +342,7 @@ static int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx)
 // Mark slot as EMPTY (ACTIVE→EMPTY)
 // P0-5: Uses atomic store for state transition (caller must hold mutex!)
 // Returns: 0 on success, -1 on error
-static int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) {
+int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) {
     if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1;
 
     SharedSlot* slot = &meta->slots[slot_idx];
@@ -379,7 +364,7 @@ static int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) {
 // Sync SP-SLOT view from an existing SuperSlab.
 // This is needed when a legacy-allocated SuperSlab reaches the shared-pool
 // release path for the first time (slot states are still SLOT_UNUSED).
-static void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss) {
+void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss) {
     if (!meta || !ss) return;
 
     int cap = ss_slabs_capacity(ss);
@@ -439,7 +424,7 @@ static int sp_meta_ensure_capacity(uint32_t min_count) {
 // Find SharedSSMeta for given SuperSlab, or create if not exists
 // Caller must hold alloc_lock
 // Returns: SharedSSMeta* on success, NULL on error
-static SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) {
+SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) {
     if (!ss) return NULL;
 
     // RACE FIX: Load count atomically for consistency (even under mutex)
@@ -483,110 +468,27 @@ static SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) {
     return meta;
 }
 
-// ============================================================================
-// Phase 12-1.x: Acquire Helper Boxes (Stage 0.5/1/2/3)
-// ============================================================================
+// Find UNUSED slot and claim it (UNUSED → ACTIVE) using lock-free CAS
+// Returns: slot_idx on success, -1 if no UNUSED slots
+int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx) {
+    if (!meta) return -1;
 
-// Debug / stats helper (Stage hits)
-static inline void sp_stage_stats_dump_if_enabled(void) {
-#if !HAKMEM_BUILD_RELEASE
-    static int dump_en = -1;
-    if (__builtin_expect(dump_en == -1, 0)) {
-        const char* e = getenv("HAKMEM_SHARED_POOL_STAGE_STATS");
-        dump_en = (e && *e && *e != '0') ? 1 : 0;
-    }
-    if (!dump_en) return;
-
-    // 全クラス合計を出力（スキャン/ヒットの分布を見るため）
-    uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
-    for (int c = 0; c < TINY_NUM_CLASSES_SS; c++) {
-        s0 += atomic_load_explicit(&g_sp_stage0_hits[c], memory_order_relaxed);
-        s1 += atomic_load_explicit(&g_sp_stage1_hits[c], memory_order_relaxed);
-        s2 += atomic_load_explicit(&g_sp_stage2_hits[c], memory_order_relaxed);
-        s3 += atomic_load_explicit(&g_sp_stage3_hits[c], memory_order_relaxed);
-    }
-    fprintf(stderr, "[SP_STAGE_STATS] total: stage0.5=%lu stage1=%lu stage2=%lu stage3=%lu\n",
-            (unsigned long)s0, (unsigned long)s1, (unsigned long)s2, (unsigned long)s3);
-#else
-    (void)g_sp_stage1_hits; (void)g_sp_stage2_hits; (void)g_sp_stage3_hits;
-#endif
-}
-
-// Stage 0.5: EMPTY slab direct scan（registry ベースの EMPTY 再利用）
-static inline int
-sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out, int dbg_acquire)
-{
-    static int empty_reuse_enabled = -1;
-    if (__builtin_expect(empty_reuse_enabled == -1, 0)) {
-        const char* e = getenv("HAKMEM_SS_EMPTY_REUSE");
-        empty_reuse_enabled = (e && *e && *e == '0') ? 0 : 1;  // default ON
-    }
-
-    if (!empty_reuse_enabled) {
-        return -1;
-    }
-
-    extern SuperSlab* g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];
-    extern int g_super_reg_class_size[TINY_NUM_CLASSES];
-
-    int reg_size = (class_idx < TINY_NUM_CLASSES) ? g_super_reg_class_size[class_idx] : 0;
-    static int scan_limit = -1;
-        if (__builtin_expect(scan_limit == -1, 0)) {
-            const char* e = getenv("HAKMEM_SS_EMPTY_SCAN_LIMIT");
-            scan_limit = (e && *e) ? atoi(e) : 32;  // default: scan first 32 SuperSlabs (Phase 9-2 tuning)
-        }
-    if (scan_limit > reg_size) scan_limit = reg_size;
-
-    // Stage 0.5 hit counter for visualization
-    static _Atomic uint64_t stage05_hits = 0;
-    static _Atomic uint64_t stage05_attempts = 0;
-    atomic_fetch_add_explicit(&stage05_attempts, 1, memory_order_relaxed);
-
-    for (int i = 0; i < scan_limit; i++) {
-        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
-        if (!(ss && ss->magic == SUPERSLAB_MAGIC)) continue;
-        if (ss->empty_count == 0) continue;  // No EMPTY slabs in this SS
-
-        uint32_t mask = ss->empty_mask;
-        while (mask) {
-            int empty_idx = __builtin_ctz(mask);
-            mask &= (mask - 1);  // clear lowest bit
-
-            TinySlabMeta* meta = &ss->slabs[empty_idx];
-            if (meta->capacity > 0 && meta->used == 0) {
-                tiny_tls_slab_reuse_guard(ss);
-                ss_clear_slab_empty(ss, empty_idx);
-
-                meta->class_idx = (uint8_t)class_idx;
-                ss->class_map[empty_idx] = (uint8_t)class_idx;
-
-#if !HAKMEM_BUILD_RELEASE
-                if (dbg_acquire == 1) {
-                    fprintf(stderr,
-                            "[SP_ACQUIRE_STAGE0.5_EMPTY] class=%d reusing EMPTY slab (ss=%p slab=%d empty_count=%u)\n",
-                            class_idx, (void*)ss, empty_idx, ss->empty_count);
-                }
-#else
-                (void)dbg_acquire;
-#endif
-
-                *ss_out = ss;
-                *slab_idx_out = empty_idx;
-                sp_stage_stats_init();
-                if (g_sp_stage_stats_enabled) {
-                    atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1);
-                }
-                atomic_fetch_add_explicit(&stage05_hits, 1, memory_order_relaxed);
-                
-                // Stage 0.5 hit rate visualization (every 100 hits)
-                uint64_t hits = atomic_load_explicit(&stage05_hits, memory_order_relaxed);
-                if (hits % 100 == 1) {
-                    uint64_t attempts = atomic_load_explicit(&stage05_attempts, memory_order_relaxed);
-                    fprintf(stderr, "[STAGE0.5_STATS] hits=%lu attempts=%lu rate=%.1f%% (scan_limit=%d)\n",
-                            hits, attempts, (double)hits * 100.0 / attempts, scan_limit);
-                }
-                return 0;
+    // Optimization: Quick check if any unused slots exist?
+    // For now, just iterate. Metadata size is small (max 32 slots).
+    for (int i = 0; i < meta->total_slots; i++) {
+        SharedSlot* slot = &meta->slots[i];
+        SlotState state = atomic_load_explicit(&slot->state, memory_order_acquire);
+        if (state == SLOT_UNUSED) {
+            // Attempt CAS: UNUSED → ACTIVE
+            if (atomic_compare_exchange_strong_explicit(
+                    &slot->state,
+                    &state,
+                    SLOT_ACTIVE,
+                    memory_order_acq_rel,
+                    memory_order_acquire)) {
+                return i;  // Success!
             }
+            // CAS failed: someone else took it or state changed
         }
     }
     return -1;
@@ -597,822 +499,108 @@ sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out,
 // Push empty slot to per-class free list
 // Caller must hold alloc_lock
 // Returns: 0 on success, -1 if list is full
-static int sp_freelist_push(int class_idx, SharedSSMeta* meta, int slot_idx) {
+int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx) {
     if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1;
-    if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1;
 
-    FreeSlotList* list = &g_shared_pool.free_slots[class_idx];
-
-    if (list->count >= MAX_FREE_SLOTS_PER_CLASS) {
-        return -1;  // List full
+    FreeSlotNode* node = node_alloc(class_idx);
+    if (!node) {
+        // Pool exhausted
+        return -1;
     }
 
-    list->entries[list->count].meta = meta;
-    list->entries[list->count].slot_idx = (uint8_t)slot_idx;
-    list->count++;
+    node->meta = meta;
+    node->slot_idx = slot_idx;
+
+    // Lock-free push to stack (LIFO)
+    FreeSlotNode* old_head = atomic_load_explicit(
+        &g_shared_pool.free_slots_lockfree[class_idx].head,
+        memory_order_relaxed);
+    do {
+        node->next = old_head;
+    } while (!atomic_compare_exchange_weak_explicit(
+        &g_shared_pool.free_slots_lockfree[class_idx].head,
+        &old_head,
+        node,
+        memory_order_release,
+        memory_order_relaxed));
+
     return 0;
 }
 
 // Pop empty slot from per-class free list
-// Caller must hold alloc_lock
-// Returns: 1 if popped (out params filled), 0 if list empty
-static int sp_freelist_pop(int class_idx, SharedSSMeta** out_meta, int* out_slot_idx) {
+// Lock-free
+// Returns: 1 on success, 0 if empty
+int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** meta_out, int* slot_idx_out) {
     if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return 0;
-    if (!out_meta || !out_slot_idx) return 0;
 
-    FreeSlotList* list = &g_shared_pool.free_slots[class_idx];
-
-    if (list->count == 0) {
-        return 0;  // List empty
-    }
-
-    // Pop from end (LIFO for cache locality)
-    list->count--;
-    *out_meta = list->entries[list->count].meta;
-    *out_slot_idx = list->entries[list->count].slot_idx;
-    return 1;
-}
-
-// ============================================================================
-// P0-5: Lock-Free Slot Claiming (Stage 2 Optimization)
-// ============================================================================
-
-// Try to claim an UNUSED slot via lock-free CAS
-// Returns: slot_idx on success, -1 if no UNUSED slots available
-// LOCK-FREE: Can be called from any thread without mutex
-static int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx) {
-    if (!meta) return -1;
-    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1;
-
-    // Scan all slots for UNUSED state
-    for (int i = 0; i < meta->total_slots; i++) {
-        SlotState expected = SLOT_UNUSED;
-
-        // Try to claim this slot atomically (UNUSED → ACTIVE)
-        if (atomic_compare_exchange_strong_explicit(
-            &meta->slots[i].state,
-            &expected,
-            SLOT_ACTIVE,
-            memory_order_acq_rel,  // Success: acquire+release semantics
-            memory_order_relaxed   // Failure: just retry next slot
-        )) {
-            // Successfully claimed! Update non-atomic fields
-            // (Safe because we now own this slot)
-            meta->slots[i].class_idx = (uint8_t)class_idx;
-            meta->slots[i].slab_idx = (uint8_t)i;
-
-            // Increment active_slots counter atomically
-            // (Multiple threads may claim slots concurrently)
-            atomic_fetch_add_explicit(
-                (_Atomic uint8_t*)&meta->active_slots, 1,
-                memory_order_relaxed
-            );
-
-            return i;  // Return claimed slot index
-        }
-
-        // CAS failed (slot was not UNUSED) - continue to next slot
-    }
-
-    return -1;  // No UNUSED slots available
-}
-
-// ============================================================================
-// P0-4: Lock-Free Free Slot List Operations
-// ============================================================================
-
-// Push empty slot to lock-free per-class free list (LIFO)
-// LOCK-FREE: Can be called from any thread without mutex
-// Returns: 0 on success, -1 on failure (node pool exhausted)
-static int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx) {
-    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1;
-    if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1;
-
-    // Allocate node from pool
-    FreeSlotNode* node = node_alloc(class_idx);
-    if (!node) {
-        // Fallback: push into legacy per-class free list
-        // ASSUME: Caller already holds alloc_lock (e.g., shared_pool_release_slab:772)
-        // Do NOT lock again to avoid deadlock on non-recursive mutex!
-        (void)sp_freelist_push(class_idx, meta, slot_idx);
-        return 0;
-    }
-
-    // Fill node data
-    node->meta = meta;
-    node->slot_idx = (uint8_t)slot_idx;
-
-    // Lock-free LIFO push using CAS loop
-    LockFreeFreeList* list = &g_shared_pool.free_slots_lockfree[class_idx];
-    FreeSlotNode* old_head = atomic_load_explicit(&list->head, memory_order_relaxed);
-
-    do {
-        node->next = old_head;
-    } while (!atomic_compare_exchange_weak_explicit(
-        &list->head, &old_head, node,
-        memory_order_release,   // Success: publish node to other threads
-        memory_order_relaxed    // Failure: retry with updated old_head
-    ));
-
-    return 0;  // Success
-}
-
-// Pop empty slot from lock-free per-class free list (LIFO)
-// LOCK-FREE: Can be called from any thread without mutex
-// Returns: 1 if popped (out params filled), 0 if list empty
-static int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** out_meta, int* out_slot_idx) {
-    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return 0;
-    if (!out_meta || !out_slot_idx) return 0;
-
-    LockFreeFreeList* list = &g_shared_pool.free_slots_lockfree[class_idx];
-    FreeSlotNode* old_head = atomic_load_explicit(&list->head, memory_order_acquire);
-
-    // Lock-free LIFO pop using CAS loop
-    do {
-        if (old_head == NULL) {
-            return 0;  // List empty
-        }
-    } while (!atomic_compare_exchange_weak_explicit(
-        &list->head, &old_head, old_head->next,
-        memory_order_acquire,   // Success: acquire node data
-        memory_order_acquire    // Failure: retry with updated old_head
-    ));
-
-    // Extract data from popped node
-    *out_meta = old_head->meta;
-    *out_slot_idx = old_head->slot_idx;
-
-    // Recycle node back into per-class free list so that long-running workloads
-    // do not permanently consume new nodes on every EMPTY event.
-    FreeSlotNode* free_head = atomic_load_explicit(
-        &g_node_free_head[class_idx],
+    FreeSlotNode* head = atomic_load_explicit(
+        &g_shared_pool.free_slots_lockfree[class_idx].head,
         memory_order_acquire);
-    do {
-        old_head->next = free_head;
-    } while (!atomic_compare_exchange_weak_explicit(
-        &g_node_free_head[class_idx],
-        &free_head,
-        old_head,
-        memory_order_release,
-        memory_order_acquire));
 
-    return 1;  // Success
+    while (head) {
+        FreeSlotNode* next = head->next;
+        if (atomic_compare_exchange_weak_explicit(
+                &g_shared_pool.free_slots_lockfree[class_idx].head,
+                &head,
+                next,
+                memory_order_acquire,
+                memory_order_acquire)) {
+            // Success!
+            *meta_out = head->meta;
+            *slot_idx_out = head->slot_idx;
+
+            // Recycle node (push to free_head list)
+            FreeSlotNode* free_head = atomic_load_explicit(&g_node_free_head[class_idx], memory_order_relaxed);
+            do {
+                head->next = free_head;
+            } while (!atomic_compare_exchange_weak_explicit(
+                &g_node_free_head[class_idx],
+                &free_head,
+                head,
+                memory_order_release,
+                memory_order_relaxed));
+
+            return 1;
+        }
+        // CAS failed: head updated, retry
+    }
+    return 0;  // Empty list
 }
 
-// Internal helper: Allocates a new SuperSlab from the OS and performs basic initialization.
-// Does NOT interact with g_shared_pool.slabs[] or g_shared_pool.total_count directly.
-// Caller is responsible for adding the SuperSlab to g_shared_pool's arrays and metadata.
-static SuperSlab*
+
+// Allocator helper for SuperSlab (Phase 9-2 Task 1)
+SuperSlab*
 sp_internal_allocate_superslab(void)
 {
-    // Use size_class 0 as a neutral hint; Phase 12 per-slab class_idx is authoritative.
+    // Use legacy backend to allocate a SuperSlab (malloc-based)
     extern SuperSlab* superslab_allocate(uint8_t size_class);
-    SuperSlab* ss = superslab_allocate(0);
-
+    // Pass 8 as class_idx (dummy, will be overwritten) or larger
+    SuperSlab* ss = superslab_allocate(8);
     if (!ss) {
         return NULL;
     }
 
-    // PageFaultTelemetry: mark all backing pages for this Superslab (approximate)
-    size_t ss_bytes = (size_t)1 << ss->lg_size;
-    for (size_t off = 0; off < ss_bytes; off += 4096) {
-        pagefault_telemetry_touch(PF_BUCKET_SS_META, (char*)ss + off);
-    }
+    // Initialize basic fields if not done by superslab_alloc
+    ss->active_slabs = 0;
+    ss->slab_bitmap = 0;
 
-    // superslab_allocate() already:
-    //  - zeroes slab metadata / remote queues,
-    //  - sets magic/lg_size/etc,
-    //  - registers in global registry.
-    // For shared-pool semantics we normalize all slab class_idx to UNASSIGNED.
-    int max_slabs = ss_slabs_capacity(ss);
-    for (int i = 0; i < max_slabs; i++) {
-        ss_slab_meta_class_idx_set(ss, i, 255); // UNASSIGNED
-        // P1.1: Initialize class_map to UNASSIGNED as well
-        ss->class_map[i] = 255;
-    }
     return ss;
 }
 
+// ============================================================================
+// Public API (High-level)
+// ============================================================================
+
 SuperSlab*
 shared_pool_acquire_superslab(void)
 {
-    shared_pool_init();
-
-    pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-    // For now, always allocate a fresh SuperSlab and register it.
-    // More advanced reuse/GC comes later.
-    // Release lock to avoid deadlock with registry during superslab_allocate
-    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-    SuperSlab* ss = sp_internal_allocate_superslab(); // Call lock-free internal helper
-    pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-    if (!ss) {
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return NULL;
-    }
-
-    // Add newly allocated SuperSlab to the shared pool's internal array
-    if (g_shared_pool.total_count >= g_shared_pool.capacity) {
-        shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1);
-        if (g_shared_pool.total_count >= g_shared_pool.capacity) {
-            // Pool table expansion failed; leave ss alive (registry-owned),
-            // but do not treat it as part of shared_pool.
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-            return NULL;
-        }
-    }
-    g_shared_pool.slabs[g_shared_pool.total_count] = ss;
-    g_shared_pool.total_count++;
-
-    // Not counted as active until at least one slab is assigned.
-    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-    return ss;
+    // Phase 12: Legacy wrapper?
+    // This function seems to be a direct allocation bypass.
+    return sp_internal_allocate_superslab();
 }
 
-// ---------- Layer 4: Public API (High-level) ----------
-
-// Ensure slab geometry matches current class stride (handles upgrades like C7 1024->2048).
-static inline void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx)
-{
-    if (!ss || slab_idx < 0 || class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) {
-        return;
-    }
-    TinySlabMeta* meta = &ss->slabs[slab_idx];
-    size_t stride = g_tiny_class_sizes[class_idx];
-    size_t usable = (slab_idx == 0) ? SUPERSLAB_SLAB0_USABLE_SIZE : SUPERSLAB_SLAB_USABLE_SIZE;
-    uint16_t expect_cap = (uint16_t)(usable / stride);
-
-    // Reinitialize if capacity is off or class_idx mismatches.
-    if (meta->class_idx != (uint8_t)class_idx || meta->capacity != expect_cap) {
-        #if !HAKMEM_BUILD_RELEASE
-        extern __thread int g_hakmem_lock_depth;
-        g_hakmem_lock_depth++;
-        fprintf(stderr, "[SP_FIX_GEOMETRY] ss=%p slab=%d cls=%d: old_cls=%u old_cap=%u -> new_cls=%d new_cap=%u (stride=%zu)\n",
-                (void*)ss, slab_idx, class_idx,
-                meta->class_idx, meta->capacity,
-                class_idx, expect_cap, stride);
-        g_hakmem_lock_depth--;
-        #endif
-
-        superslab_init_slab(ss, slab_idx, stride, 0 /*owner_tid*/);
-        meta->class_idx = (uint8_t)class_idx;
-        // P1.1: Update class_map after geometry fix
-        ss->class_map[slab_idx] = (uint8_t)class_idx;
-    }
-}
-
-int
-shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out)
-{
-    // Phase 12: SP-SLOT Box - 3-Stage Acquire Logic
-    //
-    // Stage 1: Reuse EMPTY slots from per-class free list (EMPTY→ACTIVE)
-    // Stage 2: Find UNUSED slots in existing SuperSlabs
-    // Stage 3: Get new SuperSlab (LRU pop or mmap)
-    //
-    // Invariants:
-    //  - On success: *ss_out != NULL, 0 <= *slab_idx_out < total_slots
-    //  - The chosen slab has meta->class_idx == class_idx
-
-    if (!ss_out || !slab_idx_out) {
-        return -1;
-    }
-    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) {
-        return -1;
-    }
-
-    shared_pool_init();
-
-    // Debug logging / stage stats
-#if !HAKMEM_BUILD_RELEASE
-    static int dbg_acquire = -1;
-    if (__builtin_expect(dbg_acquire == -1, 0)) {
-        const char* e = getenv("HAKMEM_SS_ACQUIRE_DEBUG");
-        dbg_acquire = (e && *e && *e != '0') ? 1 : 0;
-    }
-#else
-    static const int dbg_acquire = 0;
-#endif
-    sp_stage_stats_init();
-
-stage1_retry_after_tension_drain:
-    // ========== Stage 0.5 (Phase 12-1.1): EMPTY slab direct scan ==========
-    // Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to
-    // avoid Stage 3 (mmap) when freed slabs are available.
-    if (sp_acquire_from_empty_scan(class_idx, ss_out, slab_idx_out, dbg_acquire) == 0) {
-        return 0;
-    }
-
-    // ========== Stage 1 (Lock-Free): Try to reuse EMPTY slots ==========
-    // P0-4: Lock-free pop from per-class free list (no mutex needed!)
-    // Best case: Same class freed a slot, reuse immediately (cache-hot)
-    SharedSSMeta* reuse_meta = NULL;
-    int reuse_slot_idx = -1;
-
-    if (sp_freelist_pop_lockfree(class_idx, &reuse_meta, &reuse_slot_idx)) {
-        // Found EMPTY slot from lock-free list!
-        // Now acquire mutex ONLY for slot activation and metadata update
-
-        // P0 instrumentation: count lock acquisitions
-        lock_stats_init();
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_acquire_count, 1);
-            atomic_fetch_add(&g_lock_acquire_slab_count, 1);
-        }
-
-        pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-        // P0.3: Guard against TLS SLL orphaned pointers before reusing slab
-        // RACE FIX: Load SuperSlab pointer atomically BEFORE guard (consistency)
-        SuperSlab* ss_guard = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
-        if (ss_guard) {
-            tiny_tls_slab_reuse_guard(ss_guard);
-        }
-
-        // Activate slot under mutex (slot state transition requires protection)
-        if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) {
-            // RACE FIX: Load SuperSlab pointer atomically (consistency)
-            SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
-
-            // RACE FIX: Check if SuperSlab was freed (NULL pointer)
-            // This can happen if Thread A freed the SuperSlab after pushing slot to freelist,
-            // but Thread B popped the stale slot before the freelist was cleared.
-            if (!ss) {
-                // SuperSlab freed - skip and fall through to Stage 2/3
-                if (g_lock_stats_enabled == 1) {
-                    atomic_fetch_add(&g_lock_release_count, 1);
-                }
-                pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-                goto stage2_fallback;
-            }
-
-            #if !HAKMEM_BUILD_RELEASE
-            if (dbg_acquire == 1) {
-                fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n",
-                        class_idx, (void*)ss, reuse_slot_idx);
-            }
-            #endif
-
-            // Update SuperSlab metadata
-            ss->slab_bitmap |= (1u << reuse_slot_idx);
-            ss_slab_meta_class_idx_set(ss, reuse_slot_idx, (uint8_t)class_idx);
-
-            if (ss->active_slabs == 0) {
-                // Was empty, now active again
-                ss->active_slabs = 1;
-                g_shared_pool.active_count++;
-            }
-            // Track per-class active slots (approximate, under alloc_lock)
-            if (class_idx < TINY_NUM_CLASSES_SS) {
-                g_shared_pool.class_active_slots[class_idx]++;
-            }
-
-            // Update hint
-            g_shared_pool.class_hints[class_idx] = ss;
-
-            *ss_out = ss;
-            *slab_idx_out = reuse_slot_idx;
-
-            if (g_lock_stats_enabled == 1) {
-                atomic_fetch_add(&g_lock_release_count, 1);
-            }
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-	            if (g_sp_stage_stats_enabled) {
-	                atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1);
-	            }
-            return 0;  // ✅ Stage 1 (lock-free) success
-        }
-
-        // Slot activation failed (race condition?) - release lock and fall through
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-    }
-
-stage2_fallback:
-    // ========== Stage 2 (Lock-Free): Try to claim UNUSED slots ==========
-    // P0-5: Lock-free atomic CAS claiming (no mutex needed for slot state transition!)
-    // RACE FIX: Read ss_meta_count atomically (now properly declared as _Atomic)
-    // No cast needed! memory_order_acquire synchronizes with release in sp_meta_find_or_create
-    uint32_t meta_count = atomic_load_explicit(
-        &g_shared_pool.ss_meta_count,
-        memory_order_acquire
-    );
-
-    for (uint32_t i = 0; i < meta_count; i++) {
-        SharedSSMeta* meta = &g_shared_pool.ss_metadata[i];
-
-        // Try lock-free claiming (UNUSED → ACTIVE via CAS)
-        int claimed_idx = sp_slot_claim_lockfree(meta, class_idx);
-        if (claimed_idx >= 0) {
-            // RACE FIX: Load SuperSlab pointer atomically (critical for lock-free Stage 2)
-            // Use memory_order_acquire to synchronize with release in sp_meta_find_or_create
-            SuperSlab* ss = atomic_load_explicit(&meta->ss, memory_order_acquire);
-            if (!ss) {
-                // SuperSlab was freed between claiming and loading - skip this entry
-                continue;
-            }
-
-            #if !HAKMEM_BUILD_RELEASE
-            if (dbg_acquire == 1) {
-                fprintf(stderr, "[SP_ACQUIRE_STAGE2_LOCKFREE] class=%d claimed UNUSED slot (ss=%p slab=%d)\n",
-                        class_idx, (void*)ss, claimed_idx);
-            }
-            #endif
-
-            // P0 instrumentation: count lock acquisitions
-            lock_stats_init();
-            if (g_lock_stats_enabled == 1) {
-                atomic_fetch_add(&g_lock_acquire_count, 1);
-                atomic_fetch_add(&g_lock_acquire_slab_count, 1);
-            }
-
-            pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-            // Update SuperSlab metadata under mutex
-            ss->slab_bitmap |= (1u << claimed_idx);
-            ss_slab_meta_class_idx_set(ss, claimed_idx, (uint8_t)class_idx);
-
-            if (ss->active_slabs == 0) {
-                ss->active_slabs = 1;
-                g_shared_pool.active_count++;
-            }
-            if (class_idx < TINY_NUM_CLASSES_SS) {
-                g_shared_pool.class_active_slots[class_idx]++;
-            }
-
-            // Update hint
-            g_shared_pool.class_hints[class_idx] = ss;
-
-            *ss_out = ss;
-            *slab_idx_out = claimed_idx;
-            sp_fix_geometry_if_needed(ss, claimed_idx, class_idx);
-
-            if (g_lock_stats_enabled == 1) {
-                atomic_fetch_add(&g_lock_release_count, 1);
-            }
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-	            if (g_sp_stage_stats_enabled) {
-	                atomic_fetch_add(&g_sp_stage2_hits[class_idx], 1);
-	            }
-            return 0;  // ✅ Stage 2 (lock-free) success
-        }
-
-        // Claim failed (no UNUSED slots in this meta) - continue to next SuperSlab
-    }
-
-    // ========== Tension-Based Drain: Try to create EMPTY slots before Stage 3 ==========
-    // If TLS SLL has accumulated blocks, drain them to enable EMPTY slot detection
-    // This can avoid allocating new SuperSlabs by reusing EMPTY slots in Stage 1
-    // ENV: HAKMEM_TINY_TENSION_DRAIN_ENABLE=0 to disable (default=1)
-    // ENV: HAKMEM_TINY_TENSION_DRAIN_THRESHOLD=N to set threshold (default=1024)
-    {
-        static int tension_drain_enabled = -1;
-        static uint32_t tension_threshold = 1024;
-
-        if (tension_drain_enabled < 0) {
-            const char* env = getenv("HAKMEM_TINY_TENSION_DRAIN_ENABLE");
-            tension_drain_enabled = (env == NULL || atoi(env) != 0) ? 1 : 0;
-
-            const char* thresh_env = getenv("HAKMEM_TINY_TENSION_DRAIN_THRESHOLD");
-            if (thresh_env) {
-                tension_threshold = (uint32_t)atoi(thresh_env);
-                if (tension_threshold < 64) tension_threshold = 64;
-                if (tension_threshold > 65536) tension_threshold = 65536;
-            }
-        }
-
-        if (tension_drain_enabled) {
-            extern __thread TinyTLSSLL g_tls_sll[TINY_NUM_CLASSES];
-            extern uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size);
-
-            uint32_t sll_count = (class_idx < TINY_NUM_CLASSES) ? g_tls_sll[class_idx].count : 0;
-
-            if (sll_count >= tension_threshold) {
-                // Drain all blocks to maximize EMPTY slot creation
-                uint32_t drained = tiny_tls_sll_drain(class_idx, 0);  // 0 = drain all
-
-                if (drained > 0) {
-                    // Retry Stage 1 (EMPTY reuse) after drain
-                    // Some slabs might have become EMPTY (meta->used == 0)
-                    goto stage1_retry_after_tension_drain;
-                }
-            }
-        }
-    }
-
-    // ========== Stage 3: Mutex-protected fallback (new SuperSlab allocation) ==========
-    // All existing SuperSlabs have no UNUSED slots → need new SuperSlab
-    // P0 instrumentation: count lock acquisitions
-    lock_stats_init();
-    if (g_lock_stats_enabled == 1) {
-        atomic_fetch_add(&g_lock_acquire_count, 1);
-        atomic_fetch_add(&g_lock_acquire_slab_count, 1);
-    }
-
-    pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-    // ========== Stage 3: Get new SuperSlab ==========
-    // Try LRU cache first, then mmap
-    SuperSlab* new_ss = NULL;
-
-    // Stage 3a: Try LRU cache
-    extern SuperSlab* hak_ss_lru_pop(uint8_t size_class);
-    new_ss = hak_ss_lru_pop((uint8_t)class_idx);
-
-    int from_lru = (new_ss != NULL);
-
-    // Stage 3b: If LRU miss, allocate new SuperSlab
-    if (!new_ss) {
-        // Release the alloc_lock to avoid deadlock with registry during superslab_allocate
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-
-        SuperSlab* allocated_ss = sp_internal_allocate_superslab();
-
-        // Re-acquire the alloc_lock
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_acquire_count, 1);
-            atomic_fetch_add(&g_lock_acquire_slab_count, 1); // This is part of acquisition path
-        }
-        pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-        if (!allocated_ss) {
-            // Allocation failed; return now.
-            if (g_lock_stats_enabled == 1) {
-                atomic_fetch_add(&g_lock_release_count, 1);
-            }
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-            return -1; // Out of memory
-        }
-
-        new_ss = allocated_ss;
-
-        // Add newly allocated SuperSlab to the shared pool's internal array
-        if (g_shared_pool.total_count >= g_shared_pool.capacity) {
-            shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1);
-            if (g_shared_pool.total_count >= g_shared_pool.capacity) {
-                // Pool table expansion failed; leave ss alive (registry-owned),
-                // but do not treat it as part of shared_pool.
-                // This is a critical error, return early.
-                if (g_lock_stats_enabled == 1) {
-                    atomic_fetch_add(&g_lock_release_count, 1);
-                }
-                pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-                return -1;
-            }
-        }
-        g_shared_pool.slabs[g_shared_pool.total_count] = new_ss;
-        g_shared_pool.total_count++;
-    }
-
-    #if !HAKMEM_BUILD_RELEASE
-    if (dbg_acquire == 1 && new_ss) {
-        fprintf(stderr, "[SP_ACQUIRE_STAGE3] class=%d new SuperSlab (ss=%p from_lru=%d)\n",
-                class_idx, (void*)new_ss, from_lru);
-    }
-    #endif
-
-    if (!new_ss) {
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return -1;  // ❌ Out of memory
-    }
-
-    // Before creating a new SuperSlab, consult learning-layer soft cap.
-    // If current active slots for this class already exceed the policy cap,
-    // fail early so caller can fall back to legacy backend.
-    uint32_t limit = sp_class_active_limit(class_idx);
-    if (limit > 0) {
-        uint32_t cur = g_shared_pool.class_active_slots[class_idx];
-        if (cur >= limit) {
-            if (g_lock_stats_enabled == 1) {
-                atomic_fetch_add(&g_lock_release_count, 1);
-            }
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-            return -1;  // Soft cap reached for this class
-        }
-    }
-
-    // Create metadata for this new SuperSlab
-    SharedSSMeta* new_meta = sp_meta_find_or_create(new_ss);
-    if (!new_meta) {
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return -1;  // ❌ Metadata allocation failed
-    }
-
-    // Assign first slot to this class
-    int first_slot = 0;
-    if (sp_slot_mark_active(new_meta, first_slot, class_idx) != 0) {
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return -1;  // ❌ Should not happen
-    }
-
-    // Update SuperSlab metadata
-    new_ss->slab_bitmap |= (1u << first_slot);
-    ss_slab_meta_class_idx_set(new_ss, first_slot, (uint8_t)class_idx);
-    new_ss->active_slabs = 1;
-    g_shared_pool.active_count++;
-    if (class_idx < TINY_NUM_CLASSES_SS) {
-        g_shared_pool.class_active_slots[class_idx]++;
-    }
-
-    // Update hint
-    g_shared_pool.class_hints[class_idx] = new_ss;
-
-    *ss_out = new_ss;
-    *slab_idx_out = first_slot;
-    sp_fix_geometry_if_needed(new_ss, first_slot, class_idx);
-
-    if (g_lock_stats_enabled == 1) {
-        atomic_fetch_add(&g_lock_release_count, 1);
-    }
-    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-	    if (g_sp_stage_stats_enabled) {
-	        atomic_fetch_add(&g_sp_stage3_hits[class_idx], 1);
-	    }
-    return 0;  // ✅ Stage 3 success
-}
-
-void
-shared_pool_release_slab(SuperSlab* ss, int slab_idx)
-{
-    // Phase 12: SP-SLOT Box - Slot-based Release
-    //
-    // Flow:
-    //   1. Validate inputs and check meta->used == 0
-    //   2. Find SharedSSMeta for this SuperSlab
-    //   3. Mark slot ACTIVE → EMPTY
-    //   4. Push to per-class free list (enables same-class reuse)
-    //   5. If all slots EMPTY → superslab_free() → LRU cache
-
-    if (!ss) {
-        return;
-    }
-    if (slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) {
-        return;
-    }
-
-    // Debug logging
-#if !HAKMEM_BUILD_RELEASE
-    static int dbg = -1;
-    if (__builtin_expect(dbg == -1, 0)) {
-        const char* e = getenv("HAKMEM_SS_FREE_DEBUG");
-        dbg = (e && *e && *e != '0') ? 1 : 0;
-    }
-#else
-    static const int dbg = 0;
-#endif
-
-    // P0 instrumentation: count lock acquisitions
-    lock_stats_init();
-    if (g_lock_stats_enabled == 1) {
-        atomic_fetch_add(&g_lock_acquire_count, 1);
-        atomic_fetch_add(&g_lock_release_slab_count, 1);
-    }
-
-    pthread_mutex_lock(&g_shared_pool.alloc_lock);
-
-    TinySlabMeta* slab_meta = &ss->slabs[slab_idx];
-    if (slab_meta->used != 0) {
-        // Not actually empty; nothing to do
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return;
-    }
-
-    uint8_t class_idx = slab_meta->class_idx;
-
-    #if !HAKMEM_BUILD_RELEASE
-    if (dbg == 1) {
-        fprintf(stderr, "[SP_SLOT_RELEASE] ss=%p slab_idx=%d class=%d used=0 (marking EMPTY)\n",
-                (void*)ss, slab_idx, class_idx);
-    }
-    #endif
-
-    // Find SharedSSMeta for this SuperSlab
-    SharedSSMeta* sp_meta = NULL;
-    uint32_t count = atomic_load_explicit(&g_shared_pool.ss_meta_count, memory_order_relaxed);
-    for (uint32_t i = 0; i < count; i++) {
-        // RACE FIX: Load pointer atomically
-        SuperSlab* meta_ss = atomic_load_explicit(&g_shared_pool.ss_metadata[i].ss, memory_order_relaxed);
-        if (meta_ss == ss) {
-            sp_meta = &g_shared_pool.ss_metadata[i];
-            break;
-        }
-    }
-
-    if (!sp_meta) {
-        // SuperSlab not in SP-SLOT system yet - create metadata
-        sp_meta = sp_meta_find_or_create(ss);
-        if (!sp_meta) {
-            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-            return;  // Failed to create metadata
-        }
-    }
-
-    // Mark slot as EMPTY (ACTIVE → EMPTY)
-    uint32_t slab_bit = (1u << slab_idx);
-    SlotState slot_state = atomic_load_explicit(
-        &sp_meta->slots[slab_idx].state,
-        memory_order_acquire);
-    if (slot_state != SLOT_ACTIVE && (ss->slab_bitmap & slab_bit)) {
-        // Legacy path import: rebuild slot states from SuperSlab bitmap/class_map
-        sp_meta_sync_slots_from_ss(sp_meta, ss);
-        slot_state = atomic_load_explicit(
-            &sp_meta->slots[slab_idx].state,
-            memory_order_acquire);
-    }
-
-    if (slot_state != SLOT_ACTIVE || sp_slot_mark_empty(sp_meta, slab_idx) != 0) {
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-        return;  // Slot wasn't ACTIVE
-    }
-
-    // Update SuperSlab metadata
-    uint32_t bit = (1u << slab_idx);
-    if (ss->slab_bitmap & bit) {
-        ss->slab_bitmap &= ~bit;
-        slab_meta->class_idx = 255;  // UNASSIGNED
-        // P1.1: Mark class_map as UNASSIGNED when releasing slab
-        ss->class_map[slab_idx] = 255;
-
-        if (ss->active_slabs > 0) {
-            ss->active_slabs--;
-            if (ss->active_slabs == 0 && g_shared_pool.active_count > 0) {
-                g_shared_pool.active_count--;
-            }
-        }
-        if (class_idx < TINY_NUM_CLASSES_SS &&
-            g_shared_pool.class_active_slots[class_idx] > 0) {
-            g_shared_pool.class_active_slots[class_idx]--;
-        }
-    }
-
-    // P0-4: Push to lock-free per-class free list (enables reuse by same class)
-    // Note: push BEFORE releasing mutex (slot state already updated under lock)
-    if (class_idx < TINY_NUM_CLASSES_SS) {
-        sp_freelist_push_lockfree(class_idx, sp_meta, slab_idx);
-
-        #if !HAKMEM_BUILD_RELEASE
-        if (dbg == 1) {
-            fprintf(stderr, "[SP_SLOT_FREELIST_LOCKFREE] class=%d pushed slot (ss=%p slab=%d) active_slots=%u/%u\n",
-                    class_idx, (void*)ss, slab_idx,
-                    sp_meta->active_slots, sp_meta->total_slots);
-        }
-        #endif
-    }
-
-    // Check if SuperSlab is now completely empty (all slots EMPTY or UNUSED)
-    if (sp_meta->active_slots == 0) {
-        #if !HAKMEM_BUILD_RELEASE
-        if (dbg == 1) {
-            fprintf(stderr, "[SP_SLOT_COMPLETELY_EMPTY] ss=%p active_slots=0 (calling superslab_free)\n",
-                    (void*)ss);
-        }
-        #endif
-
-        if (g_lock_stats_enabled == 1) {
-            atomic_fetch_add(&g_lock_release_count, 1);
-        }
-
-        // RACE FIX: Set meta->ss to NULL BEFORE unlocking mutex
-        // This prevents Stage 2 from accessing freed SuperSlab
-        atomic_store_explicit(&sp_meta->ss, NULL, memory_order_release);
-
-        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
-
-        // Remove from legacy backend list (if present) to prevent dangling pointers
-        extern void remove_superslab_from_legacy_head(SuperSlab* ss);
-        remove_superslab_from_legacy_head(ss);
-
-        // Free SuperSlab:
-        // 1. Try LRU cache (hak_ss_lru_push) - lazy deallocation
-        // 2. Or munmap if LRU is full - eager deallocation
-        extern void superslab_free(SuperSlab* ss);
-        superslab_free(ss);
-        return;
-    }
-
-    if (g_lock_stats_enabled == 1) {
-        atomic_fetch_add(&g_lock_release_count, 1);
-    }
-    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx) {
+    // Phase 9-1: For now, we assume geometry is compatible or set by caller.
+    // This hook exists for future use when we support dynamic geometry resizing.
+    (void)ss; (void)slab_idx; (void)class_idx;
 }
diff --git a/core/hakmem_shared_pool_acquire.c b/core/hakmem_shared_pool_acquire.c
new file mode 100644
index 00000000..3f7cba84
--- /dev/null
+++ b/core/hakmem_shared_pool_acquire.c
@@ -0,0 +1,479 @@
+#include "hakmem_shared_pool_internal.h"
+#include "hakmem_debug_master.h"
+#include "hakmem_stats_master.h"
+#include "box/ss_slab_meta_box.h"
+#include "box/ss_hot_cold_box.h"
+#include "box/pagefault_telemetry_box.h"
+#include "box/tls_sll_drain_box.h"
+#include "box/tls_slab_reuse_guard_box.h"
+#include "hakmem_policy.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdatomic.h>
+
+// Stage 0.5: EMPTY slab direct scan（registry ベースの EMPTY 再利用）
+// Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to
+// avoid Stage 3 (mmap) when freed slabs are available.
+static inline int
+sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out, int dbg_acquire)
+{
+    static int empty_reuse_enabled = -1;
+    if (__builtin_expect(empty_reuse_enabled == -1, 0)) {
+        const char* e = getenv("HAKMEM_SS_EMPTY_REUSE");
+        empty_reuse_enabled = (e && *e && *e == '0') ? 0 : 1;  // default ON
+    }
+
+    if (!empty_reuse_enabled) {
+        return -1;
+    }
+
+    extern SuperSlab* g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS];
+    extern int g_super_reg_class_size[TINY_NUM_CLASSES];
+
+    int reg_size = (class_idx < TINY_NUM_CLASSES) ? g_super_reg_class_size[class_idx] : 0;
+    static int scan_limit = -1;
+        if (__builtin_expect(scan_limit == -1, 0)) {
+            const char* e = getenv("HAKMEM_SS_EMPTY_SCAN_LIMIT");
+            scan_limit = (e && *e) ? atoi(e) : 32;  // default: scan first 32 SuperSlabs (Phase 9-2 tuning)
+        }
+    if (scan_limit > reg_size) scan_limit = reg_size;
+
+    // Stage 0.5 hit counter for visualization
+    static _Atomic uint64_t stage05_hits = 0;
+    static _Atomic uint64_t stage05_attempts = 0;
+    atomic_fetch_add_explicit(&stage05_attempts, 1, memory_order_relaxed);
+
+    for (int i = 0; i < scan_limit; i++) {
+        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
+        if (!(ss && ss->magic == SUPERSLAB_MAGIC)) continue;
+        if (ss->empty_count == 0) continue;  // No EMPTY slabs in this SS
+
+        uint32_t mask = ss->empty_mask;
+        while (mask) {
+            int empty_idx = __builtin_ctz(mask);
+            mask &= (mask - 1);  // clear lowest bit
+
+            TinySlabMeta* meta = &ss->slabs[empty_idx];
+            if (meta->capacity > 0 && meta->used == 0) {
+                tiny_tls_slab_reuse_guard(ss);
+                ss_clear_slab_empty(ss, empty_idx);
+
+                meta->class_idx = (uint8_t)class_idx;
+                ss->class_map[empty_idx] = (uint8_t)class_idx;
+
+#if !HAKMEM_BUILD_RELEASE
+                if (dbg_acquire == 1) {
+                    fprintf(stderr,
+                            "[SP_ACQUIRE_STAGE0.5_EMPTY] class=%d reusing EMPTY slab (ss=%p slab=%d empty_count=%u)\n",
+                            class_idx, (void*)ss, empty_idx, ss->empty_count);
+                }
+#else
+                (void)dbg_acquire;
+#endif
+
+                *ss_out = ss;
+                *slab_idx_out = empty_idx;
+                sp_stage_stats_init();
+                if (g_sp_stage_stats_enabled) {
+                    atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1);
+                }
+                atomic_fetch_add_explicit(&stage05_hits, 1, memory_order_relaxed);
+                
+                // Stage 0.5 hit rate visualization (every 100 hits)
+                uint64_t hits = atomic_load_explicit(&stage05_hits, memory_order_relaxed);
+                if (hits % 100 == 1) {
+                    uint64_t attempts = atomic_load_explicit(&stage05_attempts, memory_order_relaxed);
+                    fprintf(stderr, "[STAGE0.5_STATS] hits=%lu attempts=%lu rate=%.1f%% (scan_limit=%d)\n",
+                            hits, attempts, (double)hits * 100.0 / attempts, scan_limit);
+                }
+                return 0;
+            }
+        }
+    }
+    return -1;
+}
+
+int
+shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out)
+{
+    // Phase 12: SP-SLOT Box - 3-Stage Acquire Logic
+    //
+    // Stage 1: Reuse EMPTY slots from per-class free list (EMPTY→ACTIVE)
+    // Stage 2: Find UNUSED slots in existing SuperSlabs
+    // Stage 3: Get new SuperSlab (LRU pop or mmap)
+    //
+    // Invariants:
+    //  - On success: *ss_out != NULL, 0 <= *slab_idx_out < total_slots
+    //  - The chosen slab has meta->class_idx == class_idx
+
+    if (!ss_out || !slab_idx_out) {
+        return -1;
+    }
+    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) {
+        return -1;
+    }
+
+    shared_pool_init();
+
+    // Debug logging / stage stats
+#if !HAKMEM_BUILD_RELEASE
+    static int dbg_acquire = -1;
+    if (__builtin_expect(dbg_acquire == -1, 0)) {
+        const char* e = getenv("HAKMEM_SS_ACQUIRE_DEBUG");
+        dbg_acquire = (e && *e && *e != '0') ? 1 : 0;
+    }
+#else
+    static const int dbg_acquire = 0;
+#endif
+    sp_stage_stats_init();
+
+stage1_retry_after_tension_drain:
+    // ========== Stage 0.5 (Phase 12-1.1): EMPTY slab direct scan ==========
+    // Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to
+    // avoid Stage 3 (mmap) when freed slabs are available.
+    if (sp_acquire_from_empty_scan(class_idx, ss_out, slab_idx_out, dbg_acquire) == 0) {
+        return 0;
+    }
+
+    // ========== Stage 1 (Lock-Free): Try to reuse EMPTY slots ==========
+    // P0-4: Lock-free pop from per-class free list (no mutex needed!)
+    // Best case: Same class freed a slot, reuse immediately (cache-hot)
+    SharedSSMeta* reuse_meta = NULL;
+    int reuse_slot_idx = -1;
+
+    if (sp_freelist_pop_lockfree(class_idx, &reuse_meta, &reuse_slot_idx)) {
+        // Found EMPTY slot from lock-free list!
+        // Now acquire mutex ONLY for slot activation and metadata update
+
+        // P0 instrumentation: count lock acquisitions
+        lock_stats_init();
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_acquire_count, 1);
+            atomic_fetch_add(&g_lock_acquire_slab_count, 1);
+        }
+
+        pthread_mutex_lock(&g_shared_pool.alloc_lock);
+
+        // P0.3: Guard against TLS SLL orphaned pointers before reusing slab
+        // RACE FIX: Load SuperSlab pointer atomically BEFORE guard (consistency)
+        SuperSlab* ss_guard = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
+        if (ss_guard) {
+            tiny_tls_slab_reuse_guard(ss_guard);
+        }
+
+        // Activate slot under mutex (slot state transition requires protection)
+        if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) {
+            // RACE FIX: Load SuperSlab pointer atomically (consistency)
+            SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed);
+
+            // RACE FIX: Check if SuperSlab was freed (NULL pointer)
+            // This can happen if Thread A freed the SuperSlab after pushing slot to freelist,
+            // but Thread B popped the stale slot before the freelist was cleared.
+            if (!ss) {
+                // SuperSlab freed - skip and fall through to Stage 2/3
+                if (g_lock_stats_enabled == 1) {
+                    atomic_fetch_add(&g_lock_release_count, 1);
+                }
+                pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+                goto stage2_fallback;
+            }
+
+            #if !HAKMEM_BUILD_RELEASE
+            if (dbg_acquire == 1) {
+                fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n",
+                        class_idx, (void*)ss, reuse_slot_idx);
+            }
+            #endif
+
+            // Update SuperSlab metadata
+            ss->slab_bitmap |= (1u << reuse_slot_idx);
+            ss_slab_meta_class_idx_set(ss, reuse_slot_idx, (uint8_t)class_idx);
+
+            if (ss->active_slabs == 0) {
+                // Was empty, now active again
+                ss->active_slabs = 1;
+                g_shared_pool.active_count++;
+            }
+            // Track per-class active slots (approximate, under alloc_lock)
+            if (class_idx < TINY_NUM_CLASSES_SS) {
+                g_shared_pool.class_active_slots[class_idx]++;
+            }
+
+            // Update hint
+            g_shared_pool.class_hints[class_idx] = ss;
+
+            *ss_out = ss;
+            *slab_idx_out = reuse_slot_idx;
+
+            if (g_lock_stats_enabled == 1) {
+                atomic_fetch_add(&g_lock_release_count, 1);
+            }
+            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+	            if (g_sp_stage_stats_enabled) {
+	                atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1);
+	            }
+            return 0;  // ✅ Stage 1 (lock-free) success
+        }
+
+        // Slot activation failed (race condition?) - release lock and fall through
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+    }
+
+stage2_fallback:
+    // ========== Stage 2 (Lock-Free): Try to claim UNUSED slots ==========
+    // P0-5: Lock-free atomic CAS claiming (no mutex needed for slot state transition!)
+    // RACE FIX: Read ss_meta_count atomically (now properly declared as _Atomic)
+    // No cast needed! memory_order_acquire synchronizes with release in sp_meta_find_or_create
+    uint32_t meta_count = atomic_load_explicit(
+        &g_shared_pool.ss_meta_count,
+        memory_order_acquire
+    );
+
+    for (uint32_t i = 0; i < meta_count; i++) {
+        SharedSSMeta* meta = &g_shared_pool.ss_metadata[i];
+
+        // Try lock-free claiming (UNUSED → ACTIVE via CAS)
+        int claimed_idx = sp_slot_claim_lockfree(meta, class_idx);
+        if (claimed_idx >= 0) {
+            // RACE FIX: Load SuperSlab pointer atomically (critical for lock-free Stage 2)
+            // Use memory_order_acquire to synchronize with release in sp_meta_find_or_create
+            SuperSlab* ss = atomic_load_explicit(&meta->ss, memory_order_acquire);
+            if (!ss) {
+                // SuperSlab was freed between claiming and loading - skip this entry
+                continue;
+            }
+
+            #if !HAKMEM_BUILD_RELEASE
+            if (dbg_acquire == 1) {
+                fprintf(stderr, "[SP_ACQUIRE_STAGE2_LOCKFREE] class=%d claimed UNUSED slot (ss=%p slab=%d)\n",
+                        class_idx, (void*)ss, claimed_idx);
+            }
+            #endif
+
+            // P0 instrumentation: count lock acquisitions
+            lock_stats_init();
+            if (g_lock_stats_enabled == 1) {
+                atomic_fetch_add(&g_lock_acquire_count, 1);
+                atomic_fetch_add(&g_lock_acquire_slab_count, 1);
+            }
+
+            pthread_mutex_lock(&g_shared_pool.alloc_lock);
+
+            // Update SuperSlab metadata under mutex
+            ss->slab_bitmap |= (1u << claimed_idx);
+            ss_slab_meta_class_idx_set(ss, claimed_idx, (uint8_t)class_idx);
+
+            if (ss->active_slabs == 0) {
+                ss->active_slabs = 1;
+                g_shared_pool.active_count++;
+            }
+            if (class_idx < TINY_NUM_CLASSES_SS) {
+                g_shared_pool.class_active_slots[class_idx]++;
+            }
+
+            // Update hint
+            g_shared_pool.class_hints[class_idx] = ss;
+
+            *ss_out = ss;
+            *slab_idx_out = claimed_idx;
+            sp_fix_geometry_if_needed(ss, claimed_idx, class_idx);
+
+            if (g_lock_stats_enabled == 1) {
+                atomic_fetch_add(&g_lock_release_count, 1);
+            }
+            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+	            if (g_sp_stage_stats_enabled) {
+	                atomic_fetch_add(&g_sp_stage2_hits[class_idx], 1);
+	            }
+            return 0;  // ✅ Stage 2 (lock-free) success
+        }
+
+        // Claim failed (no UNUSED slots in this meta) - continue to next SuperSlab
+    }
+
+    // ========== Tension-Based Drain: Try to create EMPTY slots before Stage 3 ==========
+    // If TLS SLL has accumulated blocks, drain them to enable EMPTY slot detection
+    // This can avoid allocating new SuperSlabs by reusing EMPTY slots in Stage 1
+    // ENV: HAKMEM_TINY_TENSION_DRAIN_ENABLE=0 to disable (default=1)
+    // ENV: HAKMEM_TINY_TENSION_DRAIN_THRESHOLD=N to set threshold (default=1024)
+    {
+        static int tension_drain_enabled = -1;
+        static uint32_t tension_threshold = 1024;
+
+        if (tension_drain_enabled < 0) {
+            const char* env = getenv("HAKMEM_TINY_TENSION_DRAIN_ENABLE");
+            tension_drain_enabled = (env == NULL || atoi(env) != 0) ? 1 : 0;
+
+            const char* thresh_env = getenv("HAKMEM_TINY_TENSION_DRAIN_THRESHOLD");
+            if (thresh_env) {
+                tension_threshold = (uint32_t)atoi(thresh_env);
+                if (tension_threshold < 64) tension_threshold = 64;
+                if (tension_threshold > 65536) tension_threshold = 65536;
+            }
+        }
+
+        if (tension_drain_enabled) {
+            extern __thread TinyTLSSLL g_tls_sll[TINY_NUM_CLASSES];
+            extern uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size);
+
+            uint32_t sll_count = (class_idx < TINY_NUM_CLASSES) ? g_tls_sll[class_idx].count : 0;
+
+            if (sll_count >= tension_threshold) {
+                // Drain all blocks to maximize EMPTY slot creation
+                uint32_t drained = tiny_tls_sll_drain(class_idx, 0);  // 0 = drain all
+
+                if (drained > 0) {
+                    // Retry Stage 1 (EMPTY reuse) after drain
+                    // Some slabs might have become EMPTY (meta->used == 0)
+                    goto stage1_retry_after_tension_drain;
+                }
+            }
+        }
+    }
+
+    // ========== Stage 3: Mutex-protected fallback (new SuperSlab allocation) ==========
+    // All existing SuperSlabs have no UNUSED slots → need new SuperSlab
+    // P0 instrumentation: count lock acquisitions
+    lock_stats_init();
+    if (g_lock_stats_enabled == 1) {
+        atomic_fetch_add(&g_lock_acquire_count, 1);
+        atomic_fetch_add(&g_lock_acquire_slab_count, 1);
+    }
+
+    pthread_mutex_lock(&g_shared_pool.alloc_lock);
+
+    // ========== Stage 3: Get new SuperSlab ==========
+    // Try LRU cache first, then mmap
+    SuperSlab* new_ss = NULL;
+
+    // Stage 3a: Try LRU cache
+    extern SuperSlab* hak_ss_lru_pop(uint8_t size_class);
+    new_ss = hak_ss_lru_pop((uint8_t)class_idx);
+
+    int from_lru = (new_ss != NULL);
+
+    // Stage 3b: If LRU miss, allocate new SuperSlab
+    if (!new_ss) {
+        // Release the alloc_lock to avoid deadlock with registry during superslab_allocate
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+
+        SuperSlab* allocated_ss = sp_internal_allocate_superslab();
+
+        // Re-acquire the alloc_lock
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_acquire_count, 1);
+            atomic_fetch_add(&g_lock_acquire_slab_count, 1); // This is part of acquisition path
+        }
+        pthread_mutex_lock(&g_shared_pool.alloc_lock);
+
+        if (!allocated_ss) {
+            // Allocation failed; return now.
+            if (g_lock_stats_enabled == 1) {
+                atomic_fetch_add(&g_lock_release_count, 1);
+            }
+            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+            return -1; // Out of memory
+        }
+
+        new_ss = allocated_ss;
+
+        // Add newly allocated SuperSlab to the shared pool's internal array
+        if (g_shared_pool.total_count >= g_shared_pool.capacity) {
+            shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1);
+            if (g_shared_pool.total_count >= g_shared_pool.capacity) {
+                // Pool table expansion failed; leave ss alive (registry-owned),
+                // but do not treat it as part of shared_pool.
+                // This is a critical error, return early.
+                if (g_lock_stats_enabled == 1) {
+                    atomic_fetch_add(&g_lock_release_count, 1);
+                }
+                pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+                return -1;
+            }
+        }
+        g_shared_pool.slabs[g_shared_pool.total_count] = new_ss;
+        g_shared_pool.total_count++;
+    }
+
+    #if !HAKMEM_BUILD_RELEASE
+    if (dbg_acquire == 1 && new_ss) {
+        fprintf(stderr, "[SP_ACQUIRE_STAGE3] class=%d new SuperSlab (ss=%p from_lru=%d)\n",
+                class_idx, (void*)new_ss, from_lru);
+    }
+    #endif
+
+    if (!new_ss) {
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+        return -1;  // ❌ Out of memory
+    }
+
+    // Before creating a new SuperSlab, consult learning-layer soft cap.
+    // If current active slots for this class already exceed the policy cap,
+    // fail early so caller can fall back to legacy backend.
+    uint32_t limit = sp_class_active_limit(class_idx);
+    if (limit > 0) {
+        uint32_t cur = g_shared_pool.class_active_slots[class_idx];
+        if (cur >= limit) {
+            if (g_lock_stats_enabled == 1) {
+                atomic_fetch_add(&g_lock_release_count, 1);
+            }
+            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+            return -1;  // Soft cap reached for this class
+        }
+    }
+
+    // Create metadata for this new SuperSlab
+    SharedSSMeta* new_meta = sp_meta_find_or_create(new_ss);
+    if (!new_meta) {
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+        return -1;  // ❌ Metadata allocation failed
+    }
+
+    // Assign first slot to this class
+    int first_slot = 0;
+    if (sp_slot_mark_active(new_meta, first_slot, class_idx) != 0) {
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+        return -1;  // ❌ Should not happen
+    }
+
+    // Update SuperSlab metadata
+    new_ss->slab_bitmap |= (1u << first_slot);
+    ss_slab_meta_class_idx_set(new_ss, first_slot, (uint8_t)class_idx);
+    new_ss->active_slabs = 1;
+    g_shared_pool.active_count++;
+    if (class_idx < TINY_NUM_CLASSES_SS) {
+        g_shared_pool.class_active_slots[class_idx]++;
+    }
+
+    // Update hint
+    g_shared_pool.class_hints[class_idx] = new_ss;
+
+    *ss_out = new_ss;
+    *slab_idx_out = first_slot;
+    sp_fix_geometry_if_needed(new_ss, first_slot, class_idx);
+
+    if (g_lock_stats_enabled == 1) {
+        atomic_fetch_add(&g_lock_release_count, 1);
+    }
+    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+	    if (g_sp_stage_stats_enabled) {
+	        atomic_fetch_add(&g_sp_stage3_hits[class_idx], 1);
+	    }
+    return 0;  // ✅ Stage 3 success
+}
diff --git a/core/hakmem_shared_pool_internal.h b/core/hakmem_shared_pool_internal.h
new file mode 100644
index 00000000..0dcec158
--- /dev/null
+++ b/core/hakmem_shared_pool_internal.h
@@ -0,0 +1,56 @@
+#ifndef HAKMEM_SHARED_POOL_INTERNAL_H
+#define HAKMEM_SHARED_POOL_INTERNAL_H
+
+#include "hakmem_shared_pool.h"
+#include "hakmem_tiny_superslab.h"
+#include "hakmem_tiny_superslab_constants.h"
+#include <stdatomic.h>
+#include <pthread.h>
+
+// Global Shared Pool Instance
+extern SharedSuperSlabPool g_shared_pool;
+
+// Lock Statistics
+// Counters are defined always to avoid compilation errors in Release build
+// (usage is guarded by g_lock_stats_enabled which is 0 in Release)
+extern _Atomic uint64_t g_lock_acquire_count;
+extern _Atomic uint64_t g_lock_release_count;
+extern _Atomic uint64_t g_lock_acquire_slab_count;
+extern _Atomic uint64_t g_lock_release_slab_count;
+extern int g_lock_stats_enabled;
+
+#if !HAKMEM_BUILD_RELEASE
+void lock_stats_init(void);
+#else
+static inline void lock_stats_init(void) {
+    // No-op for release build
+}
+#endif
+
+// Stage Statistics
+extern _Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS];
+extern _Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS];
+extern _Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS];
+extern int g_sp_stage_stats_enabled;
+void sp_stage_stats_init(void);
+
+// Internal Helpers (Shared between acquire/release/pool)
+void shared_pool_ensure_capacity_unlocked(uint32_t min_capacity);
+SuperSlab* sp_internal_allocate_superslab(void);
+
+// Slot & Meta Helpers
+int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx);
+int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx);
+int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx);
+SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss);
+void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss);
+
+// Free List Helpers
+int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx);
+int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** meta_out, int* slot_idx_out);
+
+// Policy & Geometry Helpers
+uint32_t sp_class_active_limit(int class_idx);
+void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx);
+
+#endif // HAKMEM_SHARED_POOL_INTERNAL_H
diff --git a/core/hakmem_shared_pool_release.c b/core/hakmem_shared_pool_release.c
new file mode 100644
index 00000000..a51dfeef
--- /dev/null
+++ b/core/hakmem_shared_pool_release.c
@@ -0,0 +1,179 @@
+#include "hakmem_shared_pool_internal.h"
+#include "hakmem_debug_master.h"
+#include "box/ss_slab_meta_box.h"
+#include "box/ss_hot_cold_box.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdatomic.h>
+
+void
+shared_pool_release_slab(SuperSlab* ss, int slab_idx)
+{
+    // Phase 12: SP-SLOT Box - Slot-based Release
+    //
+    // Flow:
+    //   1. Validate inputs and check meta->used == 0
+    //   2. Find SharedSSMeta for this SuperSlab
+    //   3. Mark slot ACTIVE → EMPTY
+    //   4. Push to per-class free list (enables same-class reuse)
+    //   5. If all slots EMPTY → superslab_free() → LRU cache
+
+    if (!ss) {
+        return;
+    }
+    if (slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) {
+        return;
+    }
+
+    // Debug logging
+#if !HAKMEM_BUILD_RELEASE
+    static int dbg = -1;
+    if (__builtin_expect(dbg == -1, 0)) {
+        const char* e = getenv("HAKMEM_SS_FREE_DEBUG");
+        dbg = (e && *e && *e != '0') ? 1 : 0;
+    }
+#else
+    static const int dbg = 0;
+#endif
+
+    // P0 instrumentation: count lock acquisitions
+    lock_stats_init();
+    if (g_lock_stats_enabled == 1) {
+        atomic_fetch_add(&g_lock_stats_enabled, 1);
+        atomic_fetch_add(&g_lock_release_slab_count, 1);
+    }
+
+    pthread_mutex_lock(&g_shared_pool.alloc_lock);
+
+    TinySlabMeta* slab_meta = &ss->slabs[slab_idx];
+    if (slab_meta->used != 0) {
+        // Not actually empty; nothing to do
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+        return;
+    }
+
+    uint8_t class_idx = slab_meta->class_idx;
+
+    #if !HAKMEM_BUILD_RELEASE
+    if (dbg == 1) {
+        fprintf(stderr, "[SP_SLOT_RELEASE] ss=%p slab_idx=%d class=%d used=0 (marking EMPTY)\n",
+                (void*)ss, slab_idx, class_idx);
+    }
+    #endif
+
+    // Find SharedSSMeta for this SuperSlab
+    SharedSSMeta* sp_meta = NULL;
+    uint32_t count = atomic_load_explicit(&g_shared_pool.ss_meta_count, memory_order_relaxed);
+    for (uint32_t i = 0; i < count; i++) {
+        // RACE FIX: Load pointer atomically
+        SuperSlab* meta_ss = atomic_load_explicit(&g_shared_pool.ss_metadata[i].ss, memory_order_relaxed);
+        if (meta_ss == ss) {
+            sp_meta = &g_shared_pool.ss_metadata[i];
+            break;
+        }
+    }
+
+    if (!sp_meta) {
+        // SuperSlab not in SP-SLOT system yet - create metadata
+        sp_meta = sp_meta_find_or_create(ss);
+        if (!sp_meta) {
+            pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+            return;  // Failed to create metadata
+        }
+    }
+
+    // Mark slot as EMPTY (ACTIVE → EMPTY)
+    uint32_t slab_bit = (1u << slab_idx);
+    SlotState slot_state = atomic_load_explicit(
+        &sp_meta->slots[slab_idx].state,
+        memory_order_acquire);
+    if (slot_state != SLOT_ACTIVE && (ss->slab_bitmap & slab_bit)) {
+        // Legacy path import: rebuild slot states from SuperSlab bitmap/class_map
+        sp_meta_sync_slots_from_ss(sp_meta, ss);
+        slot_state = atomic_load_explicit(
+            &sp_meta->slots[slab_idx].state,
+            memory_order_acquire);
+    }
+
+    if (slot_state != SLOT_ACTIVE || sp_slot_mark_empty(sp_meta, slab_idx) != 0) {
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+        return;  // Slot wasn't ACTIVE
+    }
+
+    // Update SuperSlab metadata
+    uint32_t bit = (1u << slab_idx);
+    if (ss->slab_bitmap & bit) {
+        ss->slab_bitmap &= ~bit;
+        slab_meta->class_idx = 255;  // UNASSIGNED
+        // P1.1: Mark class_map as UNASSIGNED when releasing slab
+        ss->class_map[slab_idx] = 255;
+
+        if (ss->active_slabs > 0) {
+            ss->active_slabs--;
+            if (ss->active_slabs == 0 && g_shared_pool.active_count > 0) {
+                g_shared_pool.active_count--;
+            }
+        }
+        if (class_idx < TINY_NUM_CLASSES_SS &&
+            g_shared_pool.class_active_slots[class_idx] > 0) {
+            g_shared_pool.class_active_slots[class_idx]--;
+        }
+    }
+
+    // P0-4: Push to lock-free per-class free list (enables reuse by same class)
+    // Note: push BEFORE releasing mutex (slot state already updated under lock)
+    if (class_idx < TINY_NUM_CLASSES_SS) {
+        sp_freelist_push_lockfree(class_idx, sp_meta, slab_idx);
+
+        #if !HAKMEM_BUILD_RELEASE
+        if (dbg == 1) {
+            fprintf(stderr, "[SP_SLOT_FREELIST_LOCKFREE] class=%d pushed slot (ss=%p slab=%d) active_slots=%u/%u\n",
+                    class_idx, (void*)ss, slab_idx,
+                    sp_meta->active_slots, sp_meta->total_slots);
+        }
+        #endif
+    }
+
+    // Check if SuperSlab is now completely empty (all slots EMPTY or UNUSED)
+    if (sp_meta->active_slots == 0) {
+        #if !HAKMEM_BUILD_RELEASE
+        if (dbg == 1) {
+            fprintf(stderr, "[SP_SLOT_COMPLETELY_EMPTY] ss=%p active_slots=0 (calling superslab_free)\n",
+                    (void*)ss);
+        }
+        #endif
+
+        if (g_lock_stats_enabled == 1) {
+            atomic_fetch_add(&g_lock_release_count, 1);
+        }
+
+        // RACE FIX: Set meta->ss to NULL BEFORE unlocking mutex
+        // This prevents Stage 2 from accessing freed SuperSlab
+        atomic_store_explicit(&sp_meta->ss, NULL, memory_order_release);
+
+        pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+
+        // Remove from legacy backend list (if present) to prevent dangling pointers
+        extern void remove_superslab_from_legacy_head(SuperSlab* ss);
+        remove_superslab_from_legacy_head(ss);
+
+        // Free SuperSlab:
+        // 1. Try LRU cache (hak_ss_lru_push) - lazy deallocation
+        // 2. Or munmap if LRU is full - eager deallocation
+        extern void superslab_free(SuperSlab* ss);
+        superslab_free(ss);
+        return;
+    }
+
+    if (g_lock_stats_enabled == 1) {
+        atomic_fetch_add(&g_lock_release_count, 1);
+    }
+    pthread_mutex_unlock(&g_shared_pool.alloc_lock);
+}
diff --git a/core/hakmem_super_registry.h b/core/hakmem_super_registry.h
index 0ded3f0a..1b45a1e8 100644
--- a/core/hakmem_super_registry.h
+++ b/core/hakmem_super_registry.h
@@ -24,7 +24,7 @@
 // Increased from 4096 to 32768 to avoid registry exhaustion under
 // high-churn microbenchmarks (e.g., larson with many active SuperSlabs).
 // Still a power of two for fast masking.
-#define SUPER_REG_SIZE      262144   // Power of 2 for fast modulo (8x larger for workloads)
+#define SUPER_REG_SIZE      1048576   // Power of 2 for fast modulo (1M entries)
 #define SUPER_REG_MASK      (SUPER_REG_SIZE - 1)
 #define SUPER_MAX_PROBE     32     // Linear probing limit (increased from 8 for Phase 15 fix)