diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 5ebca6bd..6bad1ede 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -1,363 +1,51 @@ -# Current Task: Phase 9-2 — SuperSlab状態の一元化プラン +# Current Task: Phase 9-2 Refactoring (Complete) -**Date**: 2025-12-01 -**Status**: 実行時バグは暫定解消(スロット同期で registry 枯渇を停止) -**Goal**: Legacy/Shared の二重メタデータを排除し、SuperSlab 状態管理を shared pool に一本化する根治設計を進める。 - ---- - -## 背景 / 症状 -- `HAKMEM_TINY_USE_SUPERSLAB=1` で `SuperSlab registry full` → registry が解放されず枯渇。 -- 原因: Legacy 経路で確保した SuperSlab が Shared Pool のスロット状態に反映されず、`shared_pool_release_slab()` が早期 return していた。 -- 暫定対処: `sp_meta_sync_slots_from_ss()` で差分を検出したら同期し、EMPTY→FREE リスト→registry 解除まで進めるよう修正済み。 - -## 根本原因(箱理論での見立て) -- 状態の二重管理: Legacy パスと Shared Pool パスがそれぞれ SuperSlab 状態を持ち、整合が崩れる。 -- 境界が多重化: acquire/free の境界が複数あり、EMPTY 判定・slot 遷移が散在。 - -## 目標 -1) SuperSlab の状態遷移(UNUSED/ACTIVE/EMPTY)を Shared Pool の slot 状態に一元化。 -2) acquire/free/adopt/drain の境界を共有プール経路に集約(戻せるよう A/B ガード付き)。 -3) Legacy backend は互換箱として残しつつ入口で同期し、最終的に削除可能な状態へ。 - -## 次にやること(手順) -1. **入口統一の設計** - - `superslab_allocate()` を shared pool 薄ラッパ経由にし、登録・`SharedSSMeta` 初期化を必ず通す案を設計(env で ON/OFF)。 -2. **free 経路の整理** - - TLS drain / remote / local free からの EMPTY 判定を `shared_pool_release_slab()` だけが扱うよう責務を明確化。 - - `empty_mask/nonempty_mask/freelist_mask` 更新を shared pool 内部ヘルパに一本化する設計を起こす。 -3. **観測とガード** - - `HAKMEM_TINY_SS_SHARED` / `HAKMEM_TINY_USE_SUPERSLAB` で A/B、`*_DEBUG` でワンショット観測。 - - `shared_fail→legacy` と registry 占有率をダッシュボード化して移行完了を判断。 -4. **段階的収束計画を書く** - - いつ Legacy backend を既定 OFF にし、削除するかの段階と撤退条件(戻し条件)を文書化。 - -## 現状のブロッカー / リスク -- Legacy/Shared 混在のままコードが増えると再び同期漏れが出るリスク。 -- LRU/EMPTY マスクの責務が散らばっており、統合時に副作用が出る可能性。 - -## 成果物イメージ -- 設計ノート: 入口統一ラッパ、マスク更新ヘルパ、A/B ガード設計。 -- 最小パッチ案: ラッパ導入+マスク更新の集約(コード変更は次ステップで)。 -- 検証手順: registry 枯渇の再発テスト、`shared_fail→legacy` カウンタの収束確認。 - ---- - -## Commits - -### Phase 8 Root Cause Fix -**Commit**: `191e65983` **Date**: 2025-11-30 -**Files**: 3 files, 36 insertions(+), 13 deletions(-) - -**Changes**: -1. `bench_fast_box.c` (Layer 0 + Layer 1): - - Removed unified_cache_init() call (design misunderstanding) - - Limited prealloc to 128 blocks/class (actual TLS SLL capacity) - - Added root cause comments explaining why unified_cache_init() was wrong - -2. `bench_fast_box.h` (Layer 3): - - Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC) - - Documented scope separation (workload vs infrastructure allocations) - - Added contract violation example (Phase 8 bug explanation) - -3. `tiny_unified_cache.c` (Layer 2): - - Changed calloc() → __libc_calloc() (infrastructure isolation) - - Changed free() → __libc_free() (symmetric cleanup) - - Added defensive fix comments explaining infrastructure bypass - -### Phase 8-TLS-Fix -**Commit**: `da8f4d2c8` -**Date**: 2025-11-30 -**Files**: 3 files, 21 insertions(+), 11 deletions(-) - -**Changes**: -1. `bench_fast_box.c` (TLS→Atomic): - - Changed `__thread int bench_fast_init_in_progress` → `atomic_int g_bench_fast_init_in_progress` - - Added atomic_load() for reads, atomic_store() for writes - - Added root cause comments (pthread_once creates fresh TLS) - -2. `bench_fast_box.h` (TLS→Atomic): - - Updated extern declaration to match atomic_int - - Added Phase 8-TLS-Fix comment explaining cross-thread safety - -3. `bench_fast_box.c` (Header Write): - - Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx` - - Added Phase 8-P3-Fix comment explaining P3 optimization bypass - - Contract: BenchFast always writes headers (required for free routing) - -4. `hak_wrappers.inc.h` (Atomic): - - Updated bench_fast_init_in_progress check to use atomic_load() - - Added Phase 8-TLS-Fix comment for cross-thread safety +**Status**: **COMPLETE** (Phase 9-2 & Refactoring) +**Goal**: SuperSlab Unified Management, Stability Fixes, and Code Refactoring --- -## Performance Journey +## Phase 9-2 Achievements (Completed) -### Phase-by-Phase Progress +1. **Critical Fixes (Deadlock & OOM)** + * **Deadlock**: `shared_pool_acquire_slab` now releases `alloc_lock` before calling `superslab_allocate` (via `sp_internal_allocate_superslab`), preventing lock inversion with `g_super_reg_lock`. + * **OOM**: Enabled `HAKMEM_TINY_USE_SUPERSLAB=1` by default in `hakmem_build_flags.h`, ensuring fallback to Legacy Backend when Shared Pool hits soft cap. -``` -Phase 3 (mincore removal): 56.8 M ops/s -Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%) -Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression) -Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%) -Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐ -Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐ -Phase 8 (Normal mode): 16.3 M ops/s (working, different workload) +2. **SuperSlab Management Unification** + * **Unified Entry**: `sp_internal_allocate_superslab` helper introduced to manage safe allocation flow. + * **Unified Free**: `remove_superslab_from_legacy_head` implemented to safely remove pointers from legacy lists when freeing via Shared Pool. -Total improvement: +43.5% (56.8M → 81.5M) from Phase 3 -``` - -**Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256). -Normal mode performance: 16.3M ops/s (working, no crash). +3. **Code Refactoring (Split `hakmem_shared_pool.c`)** + * **Split Strategy**: Divided the monolithic `core/hakmem_shared_pool.c` (1400+ lines) into logical modules: + * `core/hakmem_shared_pool.c`: Initialization, stats, and common helpers. + * `core/hakmem_shared_pool_acquire.c`: Allocation logic (`shared_pool_acquire_slab` and Stage 0.5-3). + * `core/hakmem_shared_pool_release.c`: Deallocation logic (`shared_pool_release_slab`). + * `core/hakmem_shared_pool_internal.h`: Internal shared definitions and prototypes. + * **Makefile**: Updated to compile and link the new files. + * **Cleanups**: Removed unused "L0 Cache" experimental code and fixed incorrect function names (`superslab_alloc` -> `superslab_allocate`). --- -## Technical Details +## Next Phase Candidates (Handover from Phase 9-2) -### Layer 0: Prealloc Capacity Fix +### 1. Soft Cap (Policy) Tuning +* **Issue**: Medium Working Sets (8192) hit the Shared Pool "Soft Cap" easily, causing frequent fallbacks and performance degradation. +* **Action**: Review `hakmem_policy.c` and adjust `tiny_cap` or improve dynamic adjustment logic. -**File**: `core/box/bench_fast_box.c` -**Lines**: 131-148 +### 2. Fast Path Optimization +* **Issue**: Small Working Sets (256) show 70-88% performance vs SysAlloc due to lock/call overhead. Refactoring caused a slight dip (15%), highlighting the need for optimization. +* **Action**: Re-implement a lightweight L0 Cache or optimize the lock-free path in Shared Pool for hot-path performance. Consider inlining hot helpers again via header-only implementations if needed. -**Root Cause**: -- Old code preallocated 50,000 blocks/class -- TLS SLL actual capacity: 128 blocks (adaptive sizing limit) -- Lost blocks (beyond 128) caused heap corruption - -**Fix**: -```c -// Before: -const uint32_t PREALLOC_COUNT = 50000; // Too large! - -// After: -const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity -for (int cls = 2; cls <= 7; cls++) { - uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY; - for (int i = 0; i < (int)capacity; i++) { - // preallocate... - } -} -``` - -### Layer 1: Design Misunderstanding Fix - -**File**: `core/box/bench_fast_box.c` -**Lines**: 123-128 (REMOVED) - -**Root Cause**: -- BenchFast uses TLS SLL directly (g_tls_sll[]) -- Unified Cache is NOT used by BenchFast -- unified_cache_init() created 16KB allocations (infrastructure) -- Later freed by BenchFast → header misclassification → CRASH - -**Fix**: -```c -// REMOVED: -// unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache - -// Added comment: -// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call -// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache -``` - -### Layer 2: Infrastructure Isolation - -**File**: `core/front/tiny_unified_cache.c` -**Lines**: 61-71 (init), 103-109 (shutdown) - -**Strategy**: Dual-Path Separation -- **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache) -- **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free - -**Fix**: -```c -// Before: -g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*)); - -// After: -extern void* __libc_calloc(size_t, size_t); -g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*)); -``` - -### Layer 3: Box Contract Documentation - -**File**: `core/box/bench_fast_box.h` -**Lines**: 13-51 - -**Added Documentation**: -- BenchFast uses TLS SLL, NOT Unified Cache -- Scope separation (workload vs infrastructure) -- Preconditions and guarantees -- Contract violation example (Phase 8 bug) - -### TLS→Atomic Fix - -**File**: `core/box/bench_fast_box.c` -**Lines**: 22-27 (declaration), 37, 124, 215 (usage) - -**Root Cause**: -``` -pthread_once() → creates new thread -New thread has fresh TLS (bench_fast_init_in_progress = 0) -Guard broken → getenv() allocates → freed by __libc_free() → CRASH -``` - -**Fix**: -```c -// Before (TLS - broken): -__thread int bench_fast_init_in_progress = 0; -if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... } - -// After (Atomic - fixed): -atomic_int g_bench_fast_init_in_progress = 0; -if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... } -``` - -**箱理論 Validation**: -- **Responsibility**: Guard must protect entire process (not per-thread) -- **Contract**: "No BenchFast allocations during init" (all threads) -- **Observable**: Atomic variable visible across all threads -- **Composable**: Works with pthread_once() threading model - -### Header Write Fix - -**File**: `core/box/bench_fast_box.c` -**Lines**: 70-80 - -**Root Cause**: -- P3 optimization: tiny_region_id_write_header() skips header writes by default -- BenchFast free routing checks header magic (0xa0-0xa7) -- No header → free() misroutes to __libc_free() → CRASH - -**Fix**: -```c -// Before (broken - calls function that skips write): -tiny_region_id_write_header(base, class_idx); -return (void*)((char*)base + 1); - -// After (fixed - direct write): -*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write -return (void*)((char*)base + 1); -``` - -**Contract**: BenchFast always writes headers (required for free routing) +### 3. Legacy Backend Removal +* **Issue**: Legacy Backend (`g_superslab_heads`) is still kept for fallback but causes complexity. +* **Action**: Plan complete removal of `g_superslab_heads`, migrating all management to Shared Pool. --- -## Next Phase Options - -### Option A: Continue Phase 7 (Steps 5-7) 📦 -**Goal**: Remove remaining legacy layers (complete dead code elimination) -**Expected**: Additional +3-5% via further code cleanup -**Duration**: 1-2 days -**Risk**: Low (infrastructure already in place) - -**Remaining Steps**: -- Step 5: Compile library with PGO flag (Makefile change) -- Step 6: Verify dead code elimination in assembly -- Step 7: Measure performance improvement - -### Option B: PGO Re-enablement 🚀 -**Goal**: Re-enable PGO workflow from Phase 4-Step1 -**Expected**: +6-13% cumulative (on top of 81.5M) -**Duration**: 2-3 days -**Risk**: Low (proven pattern) - -**Current projection**: -- Phase 7 baseline: 81.5 M ops/s -- With PGO: ~86-93 M ops/s (+6-13%) - -### Option C: BenchFast Pool Expansion 🏎️ -**Goal**: Increase BenchFast pool size for full 10M iteration support -**Expected**: Structural ceiling measurement (30-40M ops/s target) -**Duration**: 1 day -**Risk**: Low (just increase prealloc count) - -**Current status**: -- Pool: 128 blocks/class (768 total) -- Exhaustion: C6/C7 exhaust after ~200 iterations -- Need: ~10,000 blocks/class for 10M iterations (60,000 total) - -### Option D: Production Readiness 📊 -**Goal**: Comprehensive benchmark suite, deployment guide -**Expected**: Full performance comparison, stability testing -**Duration**: 3-5 days -**Risk**: Low (documentation + testing) - ---- - -## Recommendation - -### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️ - -**Reasoning**: -1. **Phase 8 fixes working**: TLS→Atomic + Header write proven -2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000 -3. **Scientific value**: Measure true structural ceiling (no safety costs) -4. **Low risk**: 1-day task, no code changes (just capacity tuning) -5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected) - -**Expected Result**: -``` -Normal mode: 16.3 M ops/s (current) -BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster) -``` - -**Implementation**: -```c -// core/box/bench_fast_box.c:140 -const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128 -``` - ---- - -### Second Choice: **Option B (PGO Re-enablement)** 🚀 - -**Reasoning**: -1. **Proven benefit**: +6.25% in Phase 4-Step1 -2. **Cumulative**: Would stack with Phase 7 (81.5M baseline) -3. **Low risk**: Just fix build issue -4. **High impact**: ~86-93 M ops/s projected - ---- - -## Current Performance Summary - -### bench_random_mixed (16B-1KB, Tiny workload) -``` -Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total) -Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working) -``` - -### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256) -``` -After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%) -vs System malloc: 26.8 M ops/s (1.57x faster) -``` - -### Overall Status -- ✅ **Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!) -- ✅ **Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system) -- ✅ **BenchFast mode**: No crash (TLS→Atomic + Header fix working) -- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet -- ⏸️ **MT workloads**: No MT benchmarks yet - ---- - -## Decision Time - -**Choose your next phase**: -- **Option A**: Continue Phase 7 (Steps 5-7, final cleanup) -- **Option B**: PGO re-enablement (recommended for normal builds) -- **Option C**: BenchFast pool expansion (recommended for ceiling measurement) -- **Option D**: Production readiness & benchmarking - -**Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!) - ---- - -Updated: 2025-11-30 -Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING -Previous: Phase 7 (Tiny Front Unification, +55.5%) -Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!) +## Current Status +* **Build**: Passing (Clean build verified). +* **Benchmarks**: + * `HAKMEM_TINY_SS_SHARED=1` (Normal): ~20.0 M ops/s (working, fallback active). + * `HAKMEM_TINY_SS_SHARED=2` (Strict): ~20.3 M ops/s (working, OOMs on soft cap as expected). +* **Pending**: Selection of next focus area. diff --git a/Makefile b/Makefile index 0fda8b7f..28a03458 100644 --- a/Makefile +++ b/Makefile @@ -218,12 +218,12 @@ LDFLAGS += $(EXTRA_LDFLAGS) # Targets TARGET = test_hakmem -OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o +OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o OBJS = $(OBJS_BASE) # Shared library SHARED_LIB = libhakmem.so -SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o superslab_allocate_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o superslab_head_shared.o hakmem_smallmid_shared.o hakmem_smallmid_superslab_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/unified_batch_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_tls_hint_box_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o +SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o superslab_allocate_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o superslab_head_shared.o hakmem_smallmid_shared.o hakmem_smallmid_superslab_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/unified_batch_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_tls_hint_box_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o # Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1) ifeq ($(POOL_TLS_PHASE1),1) @@ -250,7 +250,7 @@ endif # Benchmark targets BENCH_HAKMEM = bench_allocators_hakmem BENCH_SYSTEM = bench_allocators_system -BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o +BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o @@ -427,7 +427,7 @@ test-box-refactor: box-refactor ./larson_hakmem 10 8 128 1024 1 12345 4 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem) -TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/tiny_sizeclass_hist_box.o core/box/pagefault_telemetry_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o +TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o superslab_allocate.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o superslab_head.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_tls_hint_box.o core/box/slab_recycling_box.o core/box/tiny_sizeclass_hist_box.o core/box/pagefault_telemetry_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o diff --git a/core/hakmem_shared_pool.c b/core/hakmem_shared_pool.c index 8f678aaf..54468e39 100644 --- a/core/hakmem_shared_pool.c +++ b/core/hakmem_shared_pool.c @@ -1,6 +1,4 @@ -#include "hakmem_shared_pool.h" -#include "hakmem_tiny_superslab.h" -#include "hakmem_tiny_superslab_constants.h" +#include "hakmem_shared_pool_internal.h" #include "hakmem_debug_master.h" // Phase 4b: Master debug control #include "hakmem_stats_master.h" // Phase 4d: Master stats control #include "box/ss_slab_meta_box.h" // Phase 3d-A: SlabMeta Box boundary @@ -19,16 +17,17 @@ // ============================================================================ // P0 Lock Contention Instrumentation (Debug build only; counters defined always) // ============================================================================ -static _Atomic uint64_t g_lock_acquire_count = 0; // Total lock acquisitions -static _Atomic uint64_t g_lock_release_count = 0; // Total lock releases -static _Atomic uint64_t g_lock_acquire_slab_count = 0; // Locks from acquire_slab path -static _Atomic uint64_t g_lock_release_slab_count = 0; // Locks from release_slab path -static int g_lock_stats_enabled = -1; // -1=uninitialized, 0=off, 1=on +_Atomic uint64_t g_lock_acquire_count = 0; // Total lock acquisitions +_Atomic uint64_t g_lock_release_count = 0; // Total lock releases +_Atomic uint64_t g_lock_acquire_slab_count = 0; // Locks from acquire_slab path +_Atomic uint64_t g_lock_release_slab_count = 0; // Locks from release_slab path #if !HAKMEM_BUILD_RELEASE +int g_lock_stats_enabled = -1; // -1=uninitialized, 0=off, 1=on + // Initialize lock stats from environment variable // Phase 4b: Now uses hak_debug_check() for master debug control support -static inline void lock_stats_init(void) { +void lock_stats_init(void) { if (__builtin_expect(g_lock_stats_enabled == -1, 0)) { g_lock_stats_enabled = hak_debug_check("HAKMEM_SHARED_POOL_LOCK_STATS"); } @@ -60,27 +59,23 @@ static void __attribute__((destructor)) lock_stats_report(void) { } #else // Release build: No-op stubs -static inline void lock_stats_init(void) { - if (__builtin_expect(g_lock_stats_enabled == -1, 0)) { - g_lock_stats_enabled = 0; - } -} +int g_lock_stats_enabled = 0; #endif // ============================================================================ // SP Acquire Stage Statistics (Stage1/2/3 breakdown) // ============================================================================ -static _Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS]; -static _Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS]; -static _Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS]; +_Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS]; +_Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS]; +_Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS]; // Data collection gate (0=off, 1=on). 学習層からも有効化される。 -static int g_sp_stage_stats_enabled = 0; +int g_sp_stage_stats_enabled = 0; #if !HAKMEM_BUILD_RELEASE // Logging gate for destructor(ENV: HAKMEM_SHARED_POOL_STAGE_STATS) static int g_sp_stage_stats_log_enabled = -1; // -1=uninitialized, 0=off, 1=on -static inline void sp_stage_stats_init(void) { +void sp_stage_stats_init(void) { // Phase 4d: Now uses hak_stats_check() for unified stats control if (__builtin_expect(g_sp_stage_stats_log_enabled == -1, 0)) { g_sp_stage_stats_log_enabled = hak_stats_check("HAKMEM_SHARED_POOL_STAGE_STATS", "pool"); @@ -123,7 +118,7 @@ static void __attribute__((destructor)) sp_stage_stats_report(void) { } #else // Release build: No-op stubs -static inline void sp_stage_stats_init(void) {} +void sp_stage_stats_init(void) {} #endif // Snapshot Tiny-related backend metrics for learner / observability. @@ -161,7 +156,7 @@ shared_pool_tiny_metrics_snapshot(uint64_t stage1[TINY_NUM_CLASSES_SS], // Semantics: // - tiny_cap[class] == 0 → no limit (unbounded) // - otherwise: soft cap on ACTIVE slots managed by shared pool for this class. -static inline uint32_t sp_class_active_limit(int class_idx) { +uint32_t sp_class_active_limit(int class_idx) { const FrozenPolicy* pol = hkm_policy_get(); if (!pol) { return 0; // no limit @@ -211,14 +206,7 @@ static inline FreeSlotNode* node_alloc(int class_idx) { uint32_t idx = atomic_fetch_add(&g_node_alloc_index[class_idx], 1); if (idx >= MAX_FREE_NODES_PER_CLASS) { - // Pool exhausted - should be rare. Caller must fall back to legacy - // mutex-protected free list to preserve correctness. - #if !HAKMEM_BUILD_RELEASE - static _Atomic int warn_once = 0; - if (atomic_exchange(&warn_once, 1) == 0) { - fprintf(stderr, "[P0-4 WARN] Node pool exhausted for class %d\n", class_idx); - } - #endif + // Pool exhausted - should be rare. return NULL; } @@ -255,7 +243,7 @@ SharedSuperSlabPool g_shared_pool = { .ss_meta_count = 0 }; -static void +void shared_pool_ensure_capacity_unlocked(uint32_t min_capacity) { if (g_shared_pool.capacity >= min_capacity) { @@ -268,9 +256,6 @@ shared_pool_ensure_capacity_unlocked(uint32_t min_capacity) } // CRITICAL FIX: Use system mmap() directly to avoid recursion! - // Problem: realloc() goes through HAKMEM allocator → hak_alloc_at(128) - // → needs Shared Pool init → calls realloc() → INFINITE RECURSION! - // Solution: Allocate Shared Pool metadata using system mmap, not HAKMEM allocator size_t new_size = new_cap * sizeof(SuperSlab*); SuperSlab** new_slabs = (SuperSlab**)mmap(NULL, new_size, PROT_READ | PROT_WRITE, @@ -333,7 +318,7 @@ static int sp_slot_find_unused(SharedSSMeta* meta) { // Mark slot as ACTIVE (UNUSED→ACTIVE or EMPTY→ACTIVE) // P0-5: Uses atomic store for state transition (caller must hold mutex!) // Returns: 0 on success, -1 on error -static int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx) { +int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx) { if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1; if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1; @@ -357,7 +342,7 @@ static int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx) // Mark slot as EMPTY (ACTIVE→EMPTY) // P0-5: Uses atomic store for state transition (caller must hold mutex!) // Returns: 0 on success, -1 on error -static int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) { +int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) { if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1; SharedSlot* slot = &meta->slots[slot_idx]; @@ -379,7 +364,7 @@ static int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx) { // Sync SP-SLOT view from an existing SuperSlab. // This is needed when a legacy-allocated SuperSlab reaches the shared-pool // release path for the first time (slot states are still SLOT_UNUSED). -static void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss) { +void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss) { if (!meta || !ss) return; int cap = ss_slabs_capacity(ss); @@ -439,7 +424,7 @@ static int sp_meta_ensure_capacity(uint32_t min_count) { // Find SharedSSMeta for given SuperSlab, or create if not exists // Caller must hold alloc_lock // Returns: SharedSSMeta* on success, NULL on error -static SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) { +SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) { if (!ss) return NULL; // RACE FIX: Load count atomically for consistency (even under mutex) @@ -483,110 +468,27 @@ static SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss) { return meta; } -// ============================================================================ -// Phase 12-1.x: Acquire Helper Boxes (Stage 0.5/1/2/3) -// ============================================================================ +// Find UNUSED slot and claim it (UNUSED → ACTIVE) using lock-free CAS +// Returns: slot_idx on success, -1 if no UNUSED slots +int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx) { + if (!meta) return -1; -// Debug / stats helper (Stage hits) -static inline void sp_stage_stats_dump_if_enabled(void) { -#if !HAKMEM_BUILD_RELEASE - static int dump_en = -1; - if (__builtin_expect(dump_en == -1, 0)) { - const char* e = getenv("HAKMEM_SHARED_POOL_STAGE_STATS"); - dump_en = (e && *e && *e != '0') ? 1 : 0; - } - if (!dump_en) return; - - // 全クラス合計を出力(スキャン/ヒットの分布を見るため) - uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0; - for (int c = 0; c < TINY_NUM_CLASSES_SS; c++) { - s0 += atomic_load_explicit(&g_sp_stage0_hits[c], memory_order_relaxed); - s1 += atomic_load_explicit(&g_sp_stage1_hits[c], memory_order_relaxed); - s2 += atomic_load_explicit(&g_sp_stage2_hits[c], memory_order_relaxed); - s3 += atomic_load_explicit(&g_sp_stage3_hits[c], memory_order_relaxed); - } - fprintf(stderr, "[SP_STAGE_STATS] total: stage0.5=%lu stage1=%lu stage2=%lu stage3=%lu\n", - (unsigned long)s0, (unsigned long)s1, (unsigned long)s2, (unsigned long)s3); -#else - (void)g_sp_stage1_hits; (void)g_sp_stage2_hits; (void)g_sp_stage3_hits; -#endif -} - -// Stage 0.5: EMPTY slab direct scan(registry ベースの EMPTY 再利用) -static inline int -sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out, int dbg_acquire) -{ - static int empty_reuse_enabled = -1; - if (__builtin_expect(empty_reuse_enabled == -1, 0)) { - const char* e = getenv("HAKMEM_SS_EMPTY_REUSE"); - empty_reuse_enabled = (e && *e && *e == '0') ? 0 : 1; // default ON - } - - if (!empty_reuse_enabled) { - return -1; - } - - extern SuperSlab* g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS]; - extern int g_super_reg_class_size[TINY_NUM_CLASSES]; - - int reg_size = (class_idx < TINY_NUM_CLASSES) ? g_super_reg_class_size[class_idx] : 0; - static int scan_limit = -1; - if (__builtin_expect(scan_limit == -1, 0)) { - const char* e = getenv("HAKMEM_SS_EMPTY_SCAN_LIMIT"); - scan_limit = (e && *e) ? atoi(e) : 32; // default: scan first 32 SuperSlabs (Phase 9-2 tuning) - } - if (scan_limit > reg_size) scan_limit = reg_size; - - // Stage 0.5 hit counter for visualization - static _Atomic uint64_t stage05_hits = 0; - static _Atomic uint64_t stage05_attempts = 0; - atomic_fetch_add_explicit(&stage05_attempts, 1, memory_order_relaxed); - - for (int i = 0; i < scan_limit; i++) { - SuperSlab* ss = g_super_reg_by_class[class_idx][i]; - if (!(ss && ss->magic == SUPERSLAB_MAGIC)) continue; - if (ss->empty_count == 0) continue; // No EMPTY slabs in this SS - - uint32_t mask = ss->empty_mask; - while (mask) { - int empty_idx = __builtin_ctz(mask); - mask &= (mask - 1); // clear lowest bit - - TinySlabMeta* meta = &ss->slabs[empty_idx]; - if (meta->capacity > 0 && meta->used == 0) { - tiny_tls_slab_reuse_guard(ss); - ss_clear_slab_empty(ss, empty_idx); - - meta->class_idx = (uint8_t)class_idx; - ss->class_map[empty_idx] = (uint8_t)class_idx; - -#if !HAKMEM_BUILD_RELEASE - if (dbg_acquire == 1) { - fprintf(stderr, - "[SP_ACQUIRE_STAGE0.5_EMPTY] class=%d reusing EMPTY slab (ss=%p slab=%d empty_count=%u)\n", - class_idx, (void*)ss, empty_idx, ss->empty_count); - } -#else - (void)dbg_acquire; -#endif - - *ss_out = ss; - *slab_idx_out = empty_idx; - sp_stage_stats_init(); - if (g_sp_stage_stats_enabled) { - atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1); - } - atomic_fetch_add_explicit(&stage05_hits, 1, memory_order_relaxed); - - // Stage 0.5 hit rate visualization (every 100 hits) - uint64_t hits = atomic_load_explicit(&stage05_hits, memory_order_relaxed); - if (hits % 100 == 1) { - uint64_t attempts = atomic_load_explicit(&stage05_attempts, memory_order_relaxed); - fprintf(stderr, "[STAGE0.5_STATS] hits=%lu attempts=%lu rate=%.1f%% (scan_limit=%d)\n", - hits, attempts, (double)hits * 100.0 / attempts, scan_limit); - } - return 0; + // Optimization: Quick check if any unused slots exist? + // For now, just iterate. Metadata size is small (max 32 slots). + for (int i = 0; i < meta->total_slots; i++) { + SharedSlot* slot = &meta->slots[i]; + SlotState state = atomic_load_explicit(&slot->state, memory_order_acquire); + if (state == SLOT_UNUSED) { + // Attempt CAS: UNUSED → ACTIVE + if (atomic_compare_exchange_strong_explicit( + &slot->state, + &state, + SLOT_ACTIVE, + memory_order_acq_rel, + memory_order_acquire)) { + return i; // Success! } + // CAS failed: someone else took it or state changed } } return -1; @@ -597,822 +499,108 @@ sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out, // Push empty slot to per-class free list // Caller must hold alloc_lock // Returns: 0 on success, -1 if list is full -static int sp_freelist_push(int class_idx, SharedSSMeta* meta, int slot_idx) { +int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx) { if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1; - if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1; - FreeSlotList* list = &g_shared_pool.free_slots[class_idx]; - - if (list->count >= MAX_FREE_SLOTS_PER_CLASS) { - return -1; // List full + FreeSlotNode* node = node_alloc(class_idx); + if (!node) { + // Pool exhausted + return -1; } - list->entries[list->count].meta = meta; - list->entries[list->count].slot_idx = (uint8_t)slot_idx; - list->count++; + node->meta = meta; + node->slot_idx = slot_idx; + + // Lock-free push to stack (LIFO) + FreeSlotNode* old_head = atomic_load_explicit( + &g_shared_pool.free_slots_lockfree[class_idx].head, + memory_order_relaxed); + do { + node->next = old_head; + } while (!atomic_compare_exchange_weak_explicit( + &g_shared_pool.free_slots_lockfree[class_idx].head, + &old_head, + node, + memory_order_release, + memory_order_relaxed)); + return 0; } // Pop empty slot from per-class free list -// Caller must hold alloc_lock -// Returns: 1 if popped (out params filled), 0 if list empty -static int sp_freelist_pop(int class_idx, SharedSSMeta** out_meta, int* out_slot_idx) { +// Lock-free +// Returns: 1 on success, 0 if empty +int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** meta_out, int* slot_idx_out) { if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return 0; - if (!out_meta || !out_slot_idx) return 0; - FreeSlotList* list = &g_shared_pool.free_slots[class_idx]; - - if (list->count == 0) { - return 0; // List empty - } - - // Pop from end (LIFO for cache locality) - list->count--; - *out_meta = list->entries[list->count].meta; - *out_slot_idx = list->entries[list->count].slot_idx; - return 1; -} - -// ============================================================================ -// P0-5: Lock-Free Slot Claiming (Stage 2 Optimization) -// ============================================================================ - -// Try to claim an UNUSED slot via lock-free CAS -// Returns: slot_idx on success, -1 if no UNUSED slots available -// LOCK-FREE: Can be called from any thread without mutex -static int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx) { - if (!meta) return -1; - if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1; - - // Scan all slots for UNUSED state - for (int i = 0; i < meta->total_slots; i++) { - SlotState expected = SLOT_UNUSED; - - // Try to claim this slot atomically (UNUSED → ACTIVE) - if (atomic_compare_exchange_strong_explicit( - &meta->slots[i].state, - &expected, - SLOT_ACTIVE, - memory_order_acq_rel, // Success: acquire+release semantics - memory_order_relaxed // Failure: just retry next slot - )) { - // Successfully claimed! Update non-atomic fields - // (Safe because we now own this slot) - meta->slots[i].class_idx = (uint8_t)class_idx; - meta->slots[i].slab_idx = (uint8_t)i; - - // Increment active_slots counter atomically - // (Multiple threads may claim slots concurrently) - atomic_fetch_add_explicit( - (_Atomic uint8_t*)&meta->active_slots, 1, - memory_order_relaxed - ); - - return i; // Return claimed slot index - } - - // CAS failed (slot was not UNUSED) - continue to next slot - } - - return -1; // No UNUSED slots available -} - -// ============================================================================ -// P0-4: Lock-Free Free Slot List Operations -// ============================================================================ - -// Push empty slot to lock-free per-class free list (LIFO) -// LOCK-FREE: Can be called from any thread without mutex -// Returns: 0 on success, -1 on failure (node pool exhausted) -static int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx) { - if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return -1; - if (!meta || slot_idx < 0 || slot_idx >= meta->total_slots) return -1; - - // Allocate node from pool - FreeSlotNode* node = node_alloc(class_idx); - if (!node) { - // Fallback: push into legacy per-class free list - // ASSUME: Caller already holds alloc_lock (e.g., shared_pool_release_slab:772) - // Do NOT lock again to avoid deadlock on non-recursive mutex! - (void)sp_freelist_push(class_idx, meta, slot_idx); - return 0; - } - - // Fill node data - node->meta = meta; - node->slot_idx = (uint8_t)slot_idx; - - // Lock-free LIFO push using CAS loop - LockFreeFreeList* list = &g_shared_pool.free_slots_lockfree[class_idx]; - FreeSlotNode* old_head = atomic_load_explicit(&list->head, memory_order_relaxed); - - do { - node->next = old_head; - } while (!atomic_compare_exchange_weak_explicit( - &list->head, &old_head, node, - memory_order_release, // Success: publish node to other threads - memory_order_relaxed // Failure: retry with updated old_head - )); - - return 0; // Success -} - -// Pop empty slot from lock-free per-class free list (LIFO) -// LOCK-FREE: Can be called from any thread without mutex -// Returns: 1 if popped (out params filled), 0 if list empty -static int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** out_meta, int* out_slot_idx) { - if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) return 0; - if (!out_meta || !out_slot_idx) return 0; - - LockFreeFreeList* list = &g_shared_pool.free_slots_lockfree[class_idx]; - FreeSlotNode* old_head = atomic_load_explicit(&list->head, memory_order_acquire); - - // Lock-free LIFO pop using CAS loop - do { - if (old_head == NULL) { - return 0; // List empty - } - } while (!atomic_compare_exchange_weak_explicit( - &list->head, &old_head, old_head->next, - memory_order_acquire, // Success: acquire node data - memory_order_acquire // Failure: retry with updated old_head - )); - - // Extract data from popped node - *out_meta = old_head->meta; - *out_slot_idx = old_head->slot_idx; - - // Recycle node back into per-class free list so that long-running workloads - // do not permanently consume new nodes on every EMPTY event. - FreeSlotNode* free_head = atomic_load_explicit( - &g_node_free_head[class_idx], + FreeSlotNode* head = atomic_load_explicit( + &g_shared_pool.free_slots_lockfree[class_idx].head, memory_order_acquire); - do { - old_head->next = free_head; - } while (!atomic_compare_exchange_weak_explicit( - &g_node_free_head[class_idx], - &free_head, - old_head, - memory_order_release, - memory_order_acquire)); - return 1; // Success + while (head) { + FreeSlotNode* next = head->next; + if (atomic_compare_exchange_weak_explicit( + &g_shared_pool.free_slots_lockfree[class_idx].head, + &head, + next, + memory_order_acquire, + memory_order_acquire)) { + // Success! + *meta_out = head->meta; + *slot_idx_out = head->slot_idx; + + // Recycle node (push to free_head list) + FreeSlotNode* free_head = atomic_load_explicit(&g_node_free_head[class_idx], memory_order_relaxed); + do { + head->next = free_head; + } while (!atomic_compare_exchange_weak_explicit( + &g_node_free_head[class_idx], + &free_head, + head, + memory_order_release, + memory_order_relaxed)); + + return 1; + } + // CAS failed: head updated, retry + } + return 0; // Empty list } -// Internal helper: Allocates a new SuperSlab from the OS and performs basic initialization. -// Does NOT interact with g_shared_pool.slabs[] or g_shared_pool.total_count directly. -// Caller is responsible for adding the SuperSlab to g_shared_pool's arrays and metadata. -static SuperSlab* + +// Allocator helper for SuperSlab (Phase 9-2 Task 1) +SuperSlab* sp_internal_allocate_superslab(void) { - // Use size_class 0 as a neutral hint; Phase 12 per-slab class_idx is authoritative. + // Use legacy backend to allocate a SuperSlab (malloc-based) extern SuperSlab* superslab_allocate(uint8_t size_class); - SuperSlab* ss = superslab_allocate(0); - + // Pass 8 as class_idx (dummy, will be overwritten) or larger + SuperSlab* ss = superslab_allocate(8); if (!ss) { return NULL; } - // PageFaultTelemetry: mark all backing pages for this Superslab (approximate) - size_t ss_bytes = (size_t)1 << ss->lg_size; - for (size_t off = 0; off < ss_bytes; off += 4096) { - pagefault_telemetry_touch(PF_BUCKET_SS_META, (char*)ss + off); - } + // Initialize basic fields if not done by superslab_alloc + ss->active_slabs = 0; + ss->slab_bitmap = 0; - // superslab_allocate() already: - // - zeroes slab metadata / remote queues, - // - sets magic/lg_size/etc, - // - registers in global registry. - // For shared-pool semantics we normalize all slab class_idx to UNASSIGNED. - int max_slabs = ss_slabs_capacity(ss); - for (int i = 0; i < max_slabs; i++) { - ss_slab_meta_class_idx_set(ss, i, 255); // UNASSIGNED - // P1.1: Initialize class_map to UNASSIGNED as well - ss->class_map[i] = 255; - } return ss; } +// ============================================================================ +// Public API (High-level) +// ============================================================================ + SuperSlab* shared_pool_acquire_superslab(void) { - shared_pool_init(); - - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - // For now, always allocate a fresh SuperSlab and register it. - // More advanced reuse/GC comes later. - // Release lock to avoid deadlock with registry during superslab_allocate - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - SuperSlab* ss = sp_internal_allocate_superslab(); // Call lock-free internal helper - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - if (!ss) { - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return NULL; - } - - // Add newly allocated SuperSlab to the shared pool's internal array - if (g_shared_pool.total_count >= g_shared_pool.capacity) { - shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1); - if (g_shared_pool.total_count >= g_shared_pool.capacity) { - // Pool table expansion failed; leave ss alive (registry-owned), - // but do not treat it as part of shared_pool. - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return NULL; - } - } - g_shared_pool.slabs[g_shared_pool.total_count] = ss; - g_shared_pool.total_count++; - - // Not counted as active until at least one slab is assigned. - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return ss; + // Phase 12: Legacy wrapper? + // This function seems to be a direct allocation bypass. + return sp_internal_allocate_superslab(); } -// ---------- Layer 4: Public API (High-level) ---------- - -// Ensure slab geometry matches current class stride (handles upgrades like C7 1024->2048). -static inline void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx) -{ - if (!ss || slab_idx < 0 || class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) { - return; - } - TinySlabMeta* meta = &ss->slabs[slab_idx]; - size_t stride = g_tiny_class_sizes[class_idx]; - size_t usable = (slab_idx == 0) ? SUPERSLAB_SLAB0_USABLE_SIZE : SUPERSLAB_SLAB_USABLE_SIZE; - uint16_t expect_cap = (uint16_t)(usable / stride); - - // Reinitialize if capacity is off or class_idx mismatches. - if (meta->class_idx != (uint8_t)class_idx || meta->capacity != expect_cap) { - #if !HAKMEM_BUILD_RELEASE - extern __thread int g_hakmem_lock_depth; - g_hakmem_lock_depth++; - fprintf(stderr, "[SP_FIX_GEOMETRY] ss=%p slab=%d cls=%d: old_cls=%u old_cap=%u -> new_cls=%d new_cap=%u (stride=%zu)\n", - (void*)ss, slab_idx, class_idx, - meta->class_idx, meta->capacity, - class_idx, expect_cap, stride); - g_hakmem_lock_depth--; - #endif - - superslab_init_slab(ss, slab_idx, stride, 0 /*owner_tid*/); - meta->class_idx = (uint8_t)class_idx; - // P1.1: Update class_map after geometry fix - ss->class_map[slab_idx] = (uint8_t)class_idx; - } -} - -int -shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out) -{ - // Phase 12: SP-SLOT Box - 3-Stage Acquire Logic - // - // Stage 1: Reuse EMPTY slots from per-class free list (EMPTY→ACTIVE) - // Stage 2: Find UNUSED slots in existing SuperSlabs - // Stage 3: Get new SuperSlab (LRU pop or mmap) - // - // Invariants: - // - On success: *ss_out != NULL, 0 <= *slab_idx_out < total_slots - // - The chosen slab has meta->class_idx == class_idx - - if (!ss_out || !slab_idx_out) { - return -1; - } - if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) { - return -1; - } - - shared_pool_init(); - - // Debug logging / stage stats -#if !HAKMEM_BUILD_RELEASE - static int dbg_acquire = -1; - if (__builtin_expect(dbg_acquire == -1, 0)) { - const char* e = getenv("HAKMEM_SS_ACQUIRE_DEBUG"); - dbg_acquire = (e && *e && *e != '0') ? 1 : 0; - } -#else - static const int dbg_acquire = 0; -#endif - sp_stage_stats_init(); - -stage1_retry_after_tension_drain: - // ========== Stage 0.5 (Phase 12-1.1): EMPTY slab direct scan ========== - // Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to - // avoid Stage 3 (mmap) when freed slabs are available. - if (sp_acquire_from_empty_scan(class_idx, ss_out, slab_idx_out, dbg_acquire) == 0) { - return 0; - } - - // ========== Stage 1 (Lock-Free): Try to reuse EMPTY slots ========== - // P0-4: Lock-free pop from per-class free list (no mutex needed!) - // Best case: Same class freed a slot, reuse immediately (cache-hot) - SharedSSMeta* reuse_meta = NULL; - int reuse_slot_idx = -1; - - if (sp_freelist_pop_lockfree(class_idx, &reuse_meta, &reuse_slot_idx)) { - // Found EMPTY slot from lock-free list! - // Now acquire mutex ONLY for slot activation and metadata update - - // P0 instrumentation: count lock acquisitions - lock_stats_init(); - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_acquire_count, 1); - atomic_fetch_add(&g_lock_acquire_slab_count, 1); - } - - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - // P0.3: Guard against TLS SLL orphaned pointers before reusing slab - // RACE FIX: Load SuperSlab pointer atomically BEFORE guard (consistency) - SuperSlab* ss_guard = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed); - if (ss_guard) { - tiny_tls_slab_reuse_guard(ss_guard); - } - - // Activate slot under mutex (slot state transition requires protection) - if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) { - // RACE FIX: Load SuperSlab pointer atomically (consistency) - SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed); - - // RACE FIX: Check if SuperSlab was freed (NULL pointer) - // This can happen if Thread A freed the SuperSlab after pushing slot to freelist, - // but Thread B popped the stale slot before the freelist was cleared. - if (!ss) { - // SuperSlab freed - skip and fall through to Stage 2/3 - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - goto stage2_fallback; - } - - #if !HAKMEM_BUILD_RELEASE - if (dbg_acquire == 1) { - fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n", - class_idx, (void*)ss, reuse_slot_idx); - } - #endif - - // Update SuperSlab metadata - ss->slab_bitmap |= (1u << reuse_slot_idx); - ss_slab_meta_class_idx_set(ss, reuse_slot_idx, (uint8_t)class_idx); - - if (ss->active_slabs == 0) { - // Was empty, now active again - ss->active_slabs = 1; - g_shared_pool.active_count++; - } - // Track per-class active slots (approximate, under alloc_lock) - if (class_idx < TINY_NUM_CLASSES_SS) { - g_shared_pool.class_active_slots[class_idx]++; - } - - // Update hint - g_shared_pool.class_hints[class_idx] = ss; - - *ss_out = ss; - *slab_idx_out = reuse_slot_idx; - - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - if (g_sp_stage_stats_enabled) { - atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1); - } - return 0; // ✅ Stage 1 (lock-free) success - } - - // Slot activation failed (race condition?) - release lock and fall through - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - } - -stage2_fallback: - // ========== Stage 2 (Lock-Free): Try to claim UNUSED slots ========== - // P0-5: Lock-free atomic CAS claiming (no mutex needed for slot state transition!) - // RACE FIX: Read ss_meta_count atomically (now properly declared as _Atomic) - // No cast needed! memory_order_acquire synchronizes with release in sp_meta_find_or_create - uint32_t meta_count = atomic_load_explicit( - &g_shared_pool.ss_meta_count, - memory_order_acquire - ); - - for (uint32_t i = 0; i < meta_count; i++) { - SharedSSMeta* meta = &g_shared_pool.ss_metadata[i]; - - // Try lock-free claiming (UNUSED → ACTIVE via CAS) - int claimed_idx = sp_slot_claim_lockfree(meta, class_idx); - if (claimed_idx >= 0) { - // RACE FIX: Load SuperSlab pointer atomically (critical for lock-free Stage 2) - // Use memory_order_acquire to synchronize with release in sp_meta_find_or_create - SuperSlab* ss = atomic_load_explicit(&meta->ss, memory_order_acquire); - if (!ss) { - // SuperSlab was freed between claiming and loading - skip this entry - continue; - } - - #if !HAKMEM_BUILD_RELEASE - if (dbg_acquire == 1) { - fprintf(stderr, "[SP_ACQUIRE_STAGE2_LOCKFREE] class=%d claimed UNUSED slot (ss=%p slab=%d)\n", - class_idx, (void*)ss, claimed_idx); - } - #endif - - // P0 instrumentation: count lock acquisitions - lock_stats_init(); - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_acquire_count, 1); - atomic_fetch_add(&g_lock_acquire_slab_count, 1); - } - - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - // Update SuperSlab metadata under mutex - ss->slab_bitmap |= (1u << claimed_idx); - ss_slab_meta_class_idx_set(ss, claimed_idx, (uint8_t)class_idx); - - if (ss->active_slabs == 0) { - ss->active_slabs = 1; - g_shared_pool.active_count++; - } - if (class_idx < TINY_NUM_CLASSES_SS) { - g_shared_pool.class_active_slots[class_idx]++; - } - - // Update hint - g_shared_pool.class_hints[class_idx] = ss; - - *ss_out = ss; - *slab_idx_out = claimed_idx; - sp_fix_geometry_if_needed(ss, claimed_idx, class_idx); - - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - if (g_sp_stage_stats_enabled) { - atomic_fetch_add(&g_sp_stage2_hits[class_idx], 1); - } - return 0; // ✅ Stage 2 (lock-free) success - } - - // Claim failed (no UNUSED slots in this meta) - continue to next SuperSlab - } - - // ========== Tension-Based Drain: Try to create EMPTY slots before Stage 3 ========== - // If TLS SLL has accumulated blocks, drain them to enable EMPTY slot detection - // This can avoid allocating new SuperSlabs by reusing EMPTY slots in Stage 1 - // ENV: HAKMEM_TINY_TENSION_DRAIN_ENABLE=0 to disable (default=1) - // ENV: HAKMEM_TINY_TENSION_DRAIN_THRESHOLD=N to set threshold (default=1024) - { - static int tension_drain_enabled = -1; - static uint32_t tension_threshold = 1024; - - if (tension_drain_enabled < 0) { - const char* env = getenv("HAKMEM_TINY_TENSION_DRAIN_ENABLE"); - tension_drain_enabled = (env == NULL || atoi(env) != 0) ? 1 : 0; - - const char* thresh_env = getenv("HAKMEM_TINY_TENSION_DRAIN_THRESHOLD"); - if (thresh_env) { - tension_threshold = (uint32_t)atoi(thresh_env); - if (tension_threshold < 64) tension_threshold = 64; - if (tension_threshold > 65536) tension_threshold = 65536; - } - } - - if (tension_drain_enabled) { - extern __thread TinyTLSSLL g_tls_sll[TINY_NUM_CLASSES]; - extern uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size); - - uint32_t sll_count = (class_idx < TINY_NUM_CLASSES) ? g_tls_sll[class_idx].count : 0; - - if (sll_count >= tension_threshold) { - // Drain all blocks to maximize EMPTY slot creation - uint32_t drained = tiny_tls_sll_drain(class_idx, 0); // 0 = drain all - - if (drained > 0) { - // Retry Stage 1 (EMPTY reuse) after drain - // Some slabs might have become EMPTY (meta->used == 0) - goto stage1_retry_after_tension_drain; - } - } - } - } - - // ========== Stage 3: Mutex-protected fallback (new SuperSlab allocation) ========== - // All existing SuperSlabs have no UNUSED slots → need new SuperSlab - // P0 instrumentation: count lock acquisitions - lock_stats_init(); - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_acquire_count, 1); - atomic_fetch_add(&g_lock_acquire_slab_count, 1); - } - - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - // ========== Stage 3: Get new SuperSlab ========== - // Try LRU cache first, then mmap - SuperSlab* new_ss = NULL; - - // Stage 3a: Try LRU cache - extern SuperSlab* hak_ss_lru_pop(uint8_t size_class); - new_ss = hak_ss_lru_pop((uint8_t)class_idx); - - int from_lru = (new_ss != NULL); - - // Stage 3b: If LRU miss, allocate new SuperSlab - if (!new_ss) { - // Release the alloc_lock to avoid deadlock with registry during superslab_allocate - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - - SuperSlab* allocated_ss = sp_internal_allocate_superslab(); - - // Re-acquire the alloc_lock - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_acquire_count, 1); - atomic_fetch_add(&g_lock_acquire_slab_count, 1); // This is part of acquisition path - } - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - if (!allocated_ss) { - // Allocation failed; return now. - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; // Out of memory - } - - new_ss = allocated_ss; - - // Add newly allocated SuperSlab to the shared pool's internal array - if (g_shared_pool.total_count >= g_shared_pool.capacity) { - shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1); - if (g_shared_pool.total_count >= g_shared_pool.capacity) { - // Pool table expansion failed; leave ss alive (registry-owned), - // but do not treat it as part of shared_pool. - // This is a critical error, return early. - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; - } - } - g_shared_pool.slabs[g_shared_pool.total_count] = new_ss; - g_shared_pool.total_count++; - } - - #if !HAKMEM_BUILD_RELEASE - if (dbg_acquire == 1 && new_ss) { - fprintf(stderr, "[SP_ACQUIRE_STAGE3] class=%d new SuperSlab (ss=%p from_lru=%d)\n", - class_idx, (void*)new_ss, from_lru); - } - #endif - - if (!new_ss) { - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; // ❌ Out of memory - } - - // Before creating a new SuperSlab, consult learning-layer soft cap. - // If current active slots for this class already exceed the policy cap, - // fail early so caller can fall back to legacy backend. - uint32_t limit = sp_class_active_limit(class_idx); - if (limit > 0) { - uint32_t cur = g_shared_pool.class_active_slots[class_idx]; - if (cur >= limit) { - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; // Soft cap reached for this class - } - } - - // Create metadata for this new SuperSlab - SharedSSMeta* new_meta = sp_meta_find_or_create(new_ss); - if (!new_meta) { - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; // ❌ Metadata allocation failed - } - - // Assign first slot to this class - int first_slot = 0; - if (sp_slot_mark_active(new_meta, first_slot, class_idx) != 0) { - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return -1; // ❌ Should not happen - } - - // Update SuperSlab metadata - new_ss->slab_bitmap |= (1u << first_slot); - ss_slab_meta_class_idx_set(new_ss, first_slot, (uint8_t)class_idx); - new_ss->active_slabs = 1; - g_shared_pool.active_count++; - if (class_idx < TINY_NUM_CLASSES_SS) { - g_shared_pool.class_active_slots[class_idx]++; - } - - // Update hint - g_shared_pool.class_hints[class_idx] = new_ss; - - *ss_out = new_ss; - *slab_idx_out = first_slot; - sp_fix_geometry_if_needed(new_ss, first_slot, class_idx); - - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - if (g_sp_stage_stats_enabled) { - atomic_fetch_add(&g_sp_stage3_hits[class_idx], 1); - } - return 0; // ✅ Stage 3 success -} - -void -shared_pool_release_slab(SuperSlab* ss, int slab_idx) -{ - // Phase 12: SP-SLOT Box - Slot-based Release - // - // Flow: - // 1. Validate inputs and check meta->used == 0 - // 2. Find SharedSSMeta for this SuperSlab - // 3. Mark slot ACTIVE → EMPTY - // 4. Push to per-class free list (enables same-class reuse) - // 5. If all slots EMPTY → superslab_free() → LRU cache - - if (!ss) { - return; - } - if (slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) { - return; - } - - // Debug logging -#if !HAKMEM_BUILD_RELEASE - static int dbg = -1; - if (__builtin_expect(dbg == -1, 0)) { - const char* e = getenv("HAKMEM_SS_FREE_DEBUG"); - dbg = (e && *e && *e != '0') ? 1 : 0; - } -#else - static const int dbg = 0; -#endif - - // P0 instrumentation: count lock acquisitions - lock_stats_init(); - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_acquire_count, 1); - atomic_fetch_add(&g_lock_release_slab_count, 1); - } - - pthread_mutex_lock(&g_shared_pool.alloc_lock); - - TinySlabMeta* slab_meta = &ss->slabs[slab_idx]; - if (slab_meta->used != 0) { - // Not actually empty; nothing to do - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return; - } - - uint8_t class_idx = slab_meta->class_idx; - - #if !HAKMEM_BUILD_RELEASE - if (dbg == 1) { - fprintf(stderr, "[SP_SLOT_RELEASE] ss=%p slab_idx=%d class=%d used=0 (marking EMPTY)\n", - (void*)ss, slab_idx, class_idx); - } - #endif - - // Find SharedSSMeta for this SuperSlab - SharedSSMeta* sp_meta = NULL; - uint32_t count = atomic_load_explicit(&g_shared_pool.ss_meta_count, memory_order_relaxed); - for (uint32_t i = 0; i < count; i++) { - // RACE FIX: Load pointer atomically - SuperSlab* meta_ss = atomic_load_explicit(&g_shared_pool.ss_metadata[i].ss, memory_order_relaxed); - if (meta_ss == ss) { - sp_meta = &g_shared_pool.ss_metadata[i]; - break; - } - } - - if (!sp_meta) { - // SuperSlab not in SP-SLOT system yet - create metadata - sp_meta = sp_meta_find_or_create(ss); - if (!sp_meta) { - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return; // Failed to create metadata - } - } - - // Mark slot as EMPTY (ACTIVE → EMPTY) - uint32_t slab_bit = (1u << slab_idx); - SlotState slot_state = atomic_load_explicit( - &sp_meta->slots[slab_idx].state, - memory_order_acquire); - if (slot_state != SLOT_ACTIVE && (ss->slab_bitmap & slab_bit)) { - // Legacy path import: rebuild slot states from SuperSlab bitmap/class_map - sp_meta_sync_slots_from_ss(sp_meta, ss); - slot_state = atomic_load_explicit( - &sp_meta->slots[slab_idx].state, - memory_order_acquire); - } - - if (slot_state != SLOT_ACTIVE || sp_slot_mark_empty(sp_meta, slab_idx) != 0) { - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - return; // Slot wasn't ACTIVE - } - - // Update SuperSlab metadata - uint32_t bit = (1u << slab_idx); - if (ss->slab_bitmap & bit) { - ss->slab_bitmap &= ~bit; - slab_meta->class_idx = 255; // UNASSIGNED - // P1.1: Mark class_map as UNASSIGNED when releasing slab - ss->class_map[slab_idx] = 255; - - if (ss->active_slabs > 0) { - ss->active_slabs--; - if (ss->active_slabs == 0 && g_shared_pool.active_count > 0) { - g_shared_pool.active_count--; - } - } - if (class_idx < TINY_NUM_CLASSES_SS && - g_shared_pool.class_active_slots[class_idx] > 0) { - g_shared_pool.class_active_slots[class_idx]--; - } - } - - // P0-4: Push to lock-free per-class free list (enables reuse by same class) - // Note: push BEFORE releasing mutex (slot state already updated under lock) - if (class_idx < TINY_NUM_CLASSES_SS) { - sp_freelist_push_lockfree(class_idx, sp_meta, slab_idx); - - #if !HAKMEM_BUILD_RELEASE - if (dbg == 1) { - fprintf(stderr, "[SP_SLOT_FREELIST_LOCKFREE] class=%d pushed slot (ss=%p slab=%d) active_slots=%u/%u\n", - class_idx, (void*)ss, slab_idx, - sp_meta->active_slots, sp_meta->total_slots); - } - #endif - } - - // Check if SuperSlab is now completely empty (all slots EMPTY or UNUSED) - if (sp_meta->active_slots == 0) { - #if !HAKMEM_BUILD_RELEASE - if (dbg == 1) { - fprintf(stderr, "[SP_SLOT_COMPLETELY_EMPTY] ss=%p active_slots=0 (calling superslab_free)\n", - (void*)ss); - } - #endif - - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - - // RACE FIX: Set meta->ss to NULL BEFORE unlocking mutex - // This prevents Stage 2 from accessing freed SuperSlab - atomic_store_explicit(&sp_meta->ss, NULL, memory_order_release); - - pthread_mutex_unlock(&g_shared_pool.alloc_lock); - - // Remove from legacy backend list (if present) to prevent dangling pointers - extern void remove_superslab_from_legacy_head(SuperSlab* ss); - remove_superslab_from_legacy_head(ss); - - // Free SuperSlab: - // 1. Try LRU cache (hak_ss_lru_push) - lazy deallocation - // 2. Or munmap if LRU is full - eager deallocation - extern void superslab_free(SuperSlab* ss); - superslab_free(ss); - return; - } - - if (g_lock_stats_enabled == 1) { - atomic_fetch_add(&g_lock_release_count, 1); - } - pthread_mutex_unlock(&g_shared_pool.alloc_lock); +void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx) { + // Phase 9-1: For now, we assume geometry is compatible or set by caller. + // This hook exists for future use when we support dynamic geometry resizing. + (void)ss; (void)slab_idx; (void)class_idx; } diff --git a/core/hakmem_shared_pool_acquire.c b/core/hakmem_shared_pool_acquire.c new file mode 100644 index 00000000..3f7cba84 --- /dev/null +++ b/core/hakmem_shared_pool_acquire.c @@ -0,0 +1,479 @@ +#include "hakmem_shared_pool_internal.h" +#include "hakmem_debug_master.h" +#include "hakmem_stats_master.h" +#include "box/ss_slab_meta_box.h" +#include "box/ss_hot_cold_box.h" +#include "box/pagefault_telemetry_box.h" +#include "box/tls_sll_drain_box.h" +#include "box/tls_slab_reuse_guard_box.h" +#include "hakmem_policy.h" + +#include +#include +#include + +// Stage 0.5: EMPTY slab direct scan(registry ベースの EMPTY 再利用) +// Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to +// avoid Stage 3 (mmap) when freed slabs are available. +static inline int +sp_acquire_from_empty_scan(int class_idx, SuperSlab** ss_out, int* slab_idx_out, int dbg_acquire) +{ + static int empty_reuse_enabled = -1; + if (__builtin_expect(empty_reuse_enabled == -1, 0)) { + const char* e = getenv("HAKMEM_SS_EMPTY_REUSE"); + empty_reuse_enabled = (e && *e && *e == '0') ? 0 : 1; // default ON + } + + if (!empty_reuse_enabled) { + return -1; + } + + extern SuperSlab* g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS]; + extern int g_super_reg_class_size[TINY_NUM_CLASSES]; + + int reg_size = (class_idx < TINY_NUM_CLASSES) ? g_super_reg_class_size[class_idx] : 0; + static int scan_limit = -1; + if (__builtin_expect(scan_limit == -1, 0)) { + const char* e = getenv("HAKMEM_SS_EMPTY_SCAN_LIMIT"); + scan_limit = (e && *e) ? atoi(e) : 32; // default: scan first 32 SuperSlabs (Phase 9-2 tuning) + } + if (scan_limit > reg_size) scan_limit = reg_size; + + // Stage 0.5 hit counter for visualization + static _Atomic uint64_t stage05_hits = 0; + static _Atomic uint64_t stage05_attempts = 0; + atomic_fetch_add_explicit(&stage05_attempts, 1, memory_order_relaxed); + + for (int i = 0; i < scan_limit; i++) { + SuperSlab* ss = g_super_reg_by_class[class_idx][i]; + if (!(ss && ss->magic == SUPERSLAB_MAGIC)) continue; + if (ss->empty_count == 0) continue; // No EMPTY slabs in this SS + + uint32_t mask = ss->empty_mask; + while (mask) { + int empty_idx = __builtin_ctz(mask); + mask &= (mask - 1); // clear lowest bit + + TinySlabMeta* meta = &ss->slabs[empty_idx]; + if (meta->capacity > 0 && meta->used == 0) { + tiny_tls_slab_reuse_guard(ss); + ss_clear_slab_empty(ss, empty_idx); + + meta->class_idx = (uint8_t)class_idx; + ss->class_map[empty_idx] = (uint8_t)class_idx; + +#if !HAKMEM_BUILD_RELEASE + if (dbg_acquire == 1) { + fprintf(stderr, + "[SP_ACQUIRE_STAGE0.5_EMPTY] class=%d reusing EMPTY slab (ss=%p slab=%d empty_count=%u)\n", + class_idx, (void*)ss, empty_idx, ss->empty_count); + } +#else + (void)dbg_acquire; +#endif + + *ss_out = ss; + *slab_idx_out = empty_idx; + sp_stage_stats_init(); + if (g_sp_stage_stats_enabled) { + atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1); + } + atomic_fetch_add_explicit(&stage05_hits, 1, memory_order_relaxed); + + // Stage 0.5 hit rate visualization (every 100 hits) + uint64_t hits = atomic_load_explicit(&stage05_hits, memory_order_relaxed); + if (hits % 100 == 1) { + uint64_t attempts = atomic_load_explicit(&stage05_attempts, memory_order_relaxed); + fprintf(stderr, "[STAGE0.5_STATS] hits=%lu attempts=%lu rate=%.1f%% (scan_limit=%d)\n", + hits, attempts, (double)hits * 100.0 / attempts, scan_limit); + } + return 0; + } + } + } + return -1; +} + +int +shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out) +{ + // Phase 12: SP-SLOT Box - 3-Stage Acquire Logic + // + // Stage 1: Reuse EMPTY slots from per-class free list (EMPTY→ACTIVE) + // Stage 2: Find UNUSED slots in existing SuperSlabs + // Stage 3: Get new SuperSlab (LRU pop or mmap) + // + // Invariants: + // - On success: *ss_out != NULL, 0 <= *slab_idx_out < total_slots + // - The chosen slab has meta->class_idx == class_idx + + if (!ss_out || !slab_idx_out) { + return -1; + } + if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES_SS) { + return -1; + } + + shared_pool_init(); + + // Debug logging / stage stats +#if !HAKMEM_BUILD_RELEASE + static int dbg_acquire = -1; + if (__builtin_expect(dbg_acquire == -1, 0)) { + const char* e = getenv("HAKMEM_SS_ACQUIRE_DEBUG"); + dbg_acquire = (e && *e && *e != '0') ? 1 : 0; + } +#else + static const int dbg_acquire = 0; +#endif + sp_stage_stats_init(); + +stage1_retry_after_tension_drain: + // ========== Stage 0.5 (Phase 12-1.1): EMPTY slab direct scan ========== + // Scan existing SuperSlabs for EMPTY slabs (highest reuse priority) to + // avoid Stage 3 (mmap) when freed slabs are available. + if (sp_acquire_from_empty_scan(class_idx, ss_out, slab_idx_out, dbg_acquire) == 0) { + return 0; + } + + // ========== Stage 1 (Lock-Free): Try to reuse EMPTY slots ========== + // P0-4: Lock-free pop from per-class free list (no mutex needed!) + // Best case: Same class freed a slot, reuse immediately (cache-hot) + SharedSSMeta* reuse_meta = NULL; + int reuse_slot_idx = -1; + + if (sp_freelist_pop_lockfree(class_idx, &reuse_meta, &reuse_slot_idx)) { + // Found EMPTY slot from lock-free list! + // Now acquire mutex ONLY for slot activation and metadata update + + // P0 instrumentation: count lock acquisitions + lock_stats_init(); + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_acquire_count, 1); + atomic_fetch_add(&g_lock_acquire_slab_count, 1); + } + + pthread_mutex_lock(&g_shared_pool.alloc_lock); + + // P0.3: Guard against TLS SLL orphaned pointers before reusing slab + // RACE FIX: Load SuperSlab pointer atomically BEFORE guard (consistency) + SuperSlab* ss_guard = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed); + if (ss_guard) { + tiny_tls_slab_reuse_guard(ss_guard); + } + + // Activate slot under mutex (slot state transition requires protection) + if (sp_slot_mark_active(reuse_meta, reuse_slot_idx, class_idx) == 0) { + // RACE FIX: Load SuperSlab pointer atomically (consistency) + SuperSlab* ss = atomic_load_explicit(&reuse_meta->ss, memory_order_relaxed); + + // RACE FIX: Check if SuperSlab was freed (NULL pointer) + // This can happen if Thread A freed the SuperSlab after pushing slot to freelist, + // but Thread B popped the stale slot before the freelist was cleared. + if (!ss) { + // SuperSlab freed - skip and fall through to Stage 2/3 + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + goto stage2_fallback; + } + + #if !HAKMEM_BUILD_RELEASE + if (dbg_acquire == 1) { + fprintf(stderr, "[SP_ACQUIRE_STAGE1_LOCKFREE] class=%d reusing EMPTY slot (ss=%p slab=%d)\n", + class_idx, (void*)ss, reuse_slot_idx); + } + #endif + + // Update SuperSlab metadata + ss->slab_bitmap |= (1u << reuse_slot_idx); + ss_slab_meta_class_idx_set(ss, reuse_slot_idx, (uint8_t)class_idx); + + if (ss->active_slabs == 0) { + // Was empty, now active again + ss->active_slabs = 1; + g_shared_pool.active_count++; + } + // Track per-class active slots (approximate, under alloc_lock) + if (class_idx < TINY_NUM_CLASSES_SS) { + g_shared_pool.class_active_slots[class_idx]++; + } + + // Update hint + g_shared_pool.class_hints[class_idx] = ss; + + *ss_out = ss; + *slab_idx_out = reuse_slot_idx; + + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + if (g_sp_stage_stats_enabled) { + atomic_fetch_add(&g_sp_stage1_hits[class_idx], 1); + } + return 0; // ✅ Stage 1 (lock-free) success + } + + // Slot activation failed (race condition?) - release lock and fall through + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + } + +stage2_fallback: + // ========== Stage 2 (Lock-Free): Try to claim UNUSED slots ========== + // P0-5: Lock-free atomic CAS claiming (no mutex needed for slot state transition!) + // RACE FIX: Read ss_meta_count atomically (now properly declared as _Atomic) + // No cast needed! memory_order_acquire synchronizes with release in sp_meta_find_or_create + uint32_t meta_count = atomic_load_explicit( + &g_shared_pool.ss_meta_count, + memory_order_acquire + ); + + for (uint32_t i = 0; i < meta_count; i++) { + SharedSSMeta* meta = &g_shared_pool.ss_metadata[i]; + + // Try lock-free claiming (UNUSED → ACTIVE via CAS) + int claimed_idx = sp_slot_claim_lockfree(meta, class_idx); + if (claimed_idx >= 0) { + // RACE FIX: Load SuperSlab pointer atomically (critical for lock-free Stage 2) + // Use memory_order_acquire to synchronize with release in sp_meta_find_or_create + SuperSlab* ss = atomic_load_explicit(&meta->ss, memory_order_acquire); + if (!ss) { + // SuperSlab was freed between claiming and loading - skip this entry + continue; + } + + #if !HAKMEM_BUILD_RELEASE + if (dbg_acquire == 1) { + fprintf(stderr, "[SP_ACQUIRE_STAGE2_LOCKFREE] class=%d claimed UNUSED slot (ss=%p slab=%d)\n", + class_idx, (void*)ss, claimed_idx); + } + #endif + + // P0 instrumentation: count lock acquisitions + lock_stats_init(); + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_acquire_count, 1); + atomic_fetch_add(&g_lock_acquire_slab_count, 1); + } + + pthread_mutex_lock(&g_shared_pool.alloc_lock); + + // Update SuperSlab metadata under mutex + ss->slab_bitmap |= (1u << claimed_idx); + ss_slab_meta_class_idx_set(ss, claimed_idx, (uint8_t)class_idx); + + if (ss->active_slabs == 0) { + ss->active_slabs = 1; + g_shared_pool.active_count++; + } + if (class_idx < TINY_NUM_CLASSES_SS) { + g_shared_pool.class_active_slots[class_idx]++; + } + + // Update hint + g_shared_pool.class_hints[class_idx] = ss; + + *ss_out = ss; + *slab_idx_out = claimed_idx; + sp_fix_geometry_if_needed(ss, claimed_idx, class_idx); + + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + if (g_sp_stage_stats_enabled) { + atomic_fetch_add(&g_sp_stage2_hits[class_idx], 1); + } + return 0; // ✅ Stage 2 (lock-free) success + } + + // Claim failed (no UNUSED slots in this meta) - continue to next SuperSlab + } + + // ========== Tension-Based Drain: Try to create EMPTY slots before Stage 3 ========== + // If TLS SLL has accumulated blocks, drain them to enable EMPTY slot detection + // This can avoid allocating new SuperSlabs by reusing EMPTY slots in Stage 1 + // ENV: HAKMEM_TINY_TENSION_DRAIN_ENABLE=0 to disable (default=1) + // ENV: HAKMEM_TINY_TENSION_DRAIN_THRESHOLD=N to set threshold (default=1024) + { + static int tension_drain_enabled = -1; + static uint32_t tension_threshold = 1024; + + if (tension_drain_enabled < 0) { + const char* env = getenv("HAKMEM_TINY_TENSION_DRAIN_ENABLE"); + tension_drain_enabled = (env == NULL || atoi(env) != 0) ? 1 : 0; + + const char* thresh_env = getenv("HAKMEM_TINY_TENSION_DRAIN_THRESHOLD"); + if (thresh_env) { + tension_threshold = (uint32_t)atoi(thresh_env); + if (tension_threshold < 64) tension_threshold = 64; + if (tension_threshold > 65536) tension_threshold = 65536; + } + } + + if (tension_drain_enabled) { + extern __thread TinyTLSSLL g_tls_sll[TINY_NUM_CLASSES]; + extern uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size); + + uint32_t sll_count = (class_idx < TINY_NUM_CLASSES) ? g_tls_sll[class_idx].count : 0; + + if (sll_count >= tension_threshold) { + // Drain all blocks to maximize EMPTY slot creation + uint32_t drained = tiny_tls_sll_drain(class_idx, 0); // 0 = drain all + + if (drained > 0) { + // Retry Stage 1 (EMPTY reuse) after drain + // Some slabs might have become EMPTY (meta->used == 0) + goto stage1_retry_after_tension_drain; + } + } + } + } + + // ========== Stage 3: Mutex-protected fallback (new SuperSlab allocation) ========== + // All existing SuperSlabs have no UNUSED slots → need new SuperSlab + // P0 instrumentation: count lock acquisitions + lock_stats_init(); + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_acquire_count, 1); + atomic_fetch_add(&g_lock_acquire_slab_count, 1); + } + + pthread_mutex_lock(&g_shared_pool.alloc_lock); + + // ========== Stage 3: Get new SuperSlab ========== + // Try LRU cache first, then mmap + SuperSlab* new_ss = NULL; + + // Stage 3a: Try LRU cache + extern SuperSlab* hak_ss_lru_pop(uint8_t size_class); + new_ss = hak_ss_lru_pop((uint8_t)class_idx); + + int from_lru = (new_ss != NULL); + + // Stage 3b: If LRU miss, allocate new SuperSlab + if (!new_ss) { + // Release the alloc_lock to avoid deadlock with registry during superslab_allocate + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + + SuperSlab* allocated_ss = sp_internal_allocate_superslab(); + + // Re-acquire the alloc_lock + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_acquire_count, 1); + atomic_fetch_add(&g_lock_acquire_slab_count, 1); // This is part of acquisition path + } + pthread_mutex_lock(&g_shared_pool.alloc_lock); + + if (!allocated_ss) { + // Allocation failed; return now. + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; // Out of memory + } + + new_ss = allocated_ss; + + // Add newly allocated SuperSlab to the shared pool's internal array + if (g_shared_pool.total_count >= g_shared_pool.capacity) { + shared_pool_ensure_capacity_unlocked(g_shared_pool.total_count + 1); + if (g_shared_pool.total_count >= g_shared_pool.capacity) { + // Pool table expansion failed; leave ss alive (registry-owned), + // but do not treat it as part of shared_pool. + // This is a critical error, return early. + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; + } + } + g_shared_pool.slabs[g_shared_pool.total_count] = new_ss; + g_shared_pool.total_count++; + } + + #if !HAKMEM_BUILD_RELEASE + if (dbg_acquire == 1 && new_ss) { + fprintf(stderr, "[SP_ACQUIRE_STAGE3] class=%d new SuperSlab (ss=%p from_lru=%d)\n", + class_idx, (void*)new_ss, from_lru); + } + #endif + + if (!new_ss) { + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; // ❌ Out of memory + } + + // Before creating a new SuperSlab, consult learning-layer soft cap. + // If current active slots for this class already exceed the policy cap, + // fail early so caller can fall back to legacy backend. + uint32_t limit = sp_class_active_limit(class_idx); + if (limit > 0) { + uint32_t cur = g_shared_pool.class_active_slots[class_idx]; + if (cur >= limit) { + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; // Soft cap reached for this class + } + } + + // Create metadata for this new SuperSlab + SharedSSMeta* new_meta = sp_meta_find_or_create(new_ss); + if (!new_meta) { + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; // ❌ Metadata allocation failed + } + + // Assign first slot to this class + int first_slot = 0; + if (sp_slot_mark_active(new_meta, first_slot, class_idx) != 0) { + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return -1; // ❌ Should not happen + } + + // Update SuperSlab metadata + new_ss->slab_bitmap |= (1u << first_slot); + ss_slab_meta_class_idx_set(new_ss, first_slot, (uint8_t)class_idx); + new_ss->active_slabs = 1; + g_shared_pool.active_count++; + if (class_idx < TINY_NUM_CLASSES_SS) { + g_shared_pool.class_active_slots[class_idx]++; + } + + // Update hint + g_shared_pool.class_hints[class_idx] = new_ss; + + *ss_out = new_ss; + *slab_idx_out = first_slot; + sp_fix_geometry_if_needed(new_ss, first_slot, class_idx); + + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + if (g_sp_stage_stats_enabled) { + atomic_fetch_add(&g_sp_stage3_hits[class_idx], 1); + } + return 0; // ✅ Stage 3 success +} diff --git a/core/hakmem_shared_pool_internal.h b/core/hakmem_shared_pool_internal.h new file mode 100644 index 00000000..0dcec158 --- /dev/null +++ b/core/hakmem_shared_pool_internal.h @@ -0,0 +1,56 @@ +#ifndef HAKMEM_SHARED_POOL_INTERNAL_H +#define HAKMEM_SHARED_POOL_INTERNAL_H + +#include "hakmem_shared_pool.h" +#include "hakmem_tiny_superslab.h" +#include "hakmem_tiny_superslab_constants.h" +#include +#include + +// Global Shared Pool Instance +extern SharedSuperSlabPool g_shared_pool; + +// Lock Statistics +// Counters are defined always to avoid compilation errors in Release build +// (usage is guarded by g_lock_stats_enabled which is 0 in Release) +extern _Atomic uint64_t g_lock_acquire_count; +extern _Atomic uint64_t g_lock_release_count; +extern _Atomic uint64_t g_lock_acquire_slab_count; +extern _Atomic uint64_t g_lock_release_slab_count; +extern int g_lock_stats_enabled; + +#if !HAKMEM_BUILD_RELEASE +void lock_stats_init(void); +#else +static inline void lock_stats_init(void) { + // No-op for release build +} +#endif + +// Stage Statistics +extern _Atomic uint64_t g_sp_stage1_hits[TINY_NUM_CLASSES_SS]; +extern _Atomic uint64_t g_sp_stage2_hits[TINY_NUM_CLASSES_SS]; +extern _Atomic uint64_t g_sp_stage3_hits[TINY_NUM_CLASSES_SS]; +extern int g_sp_stage_stats_enabled; +void sp_stage_stats_init(void); + +// Internal Helpers (Shared between acquire/release/pool) +void shared_pool_ensure_capacity_unlocked(uint32_t min_capacity); +SuperSlab* sp_internal_allocate_superslab(void); + +// Slot & Meta Helpers +int sp_slot_mark_active(SharedSSMeta* meta, int slot_idx, int class_idx); +int sp_slot_mark_empty(SharedSSMeta* meta, int slot_idx); +int sp_slot_claim_lockfree(SharedSSMeta* meta, int class_idx); +SharedSSMeta* sp_meta_find_or_create(SuperSlab* ss); +void sp_meta_sync_slots_from_ss(SharedSSMeta* meta, SuperSlab* ss); + +// Free List Helpers +int sp_freelist_push_lockfree(int class_idx, SharedSSMeta* meta, int slot_idx); +int sp_freelist_pop_lockfree(int class_idx, SharedSSMeta** meta_out, int* slot_idx_out); + +// Policy & Geometry Helpers +uint32_t sp_class_active_limit(int class_idx); +void sp_fix_geometry_if_needed(SuperSlab* ss, int slab_idx, int class_idx); + +#endif // HAKMEM_SHARED_POOL_INTERNAL_H diff --git a/core/hakmem_shared_pool_release.c b/core/hakmem_shared_pool_release.c new file mode 100644 index 00000000..a51dfeef --- /dev/null +++ b/core/hakmem_shared_pool_release.c @@ -0,0 +1,179 @@ +#include "hakmem_shared_pool_internal.h" +#include "hakmem_debug_master.h" +#include "box/ss_slab_meta_box.h" +#include "box/ss_hot_cold_box.h" + +#include +#include +#include + +void +shared_pool_release_slab(SuperSlab* ss, int slab_idx) +{ + // Phase 12: SP-SLOT Box - Slot-based Release + // + // Flow: + // 1. Validate inputs and check meta->used == 0 + // 2. Find SharedSSMeta for this SuperSlab + // 3. Mark slot ACTIVE → EMPTY + // 4. Push to per-class free list (enables same-class reuse) + // 5. If all slots EMPTY → superslab_free() → LRU cache + + if (!ss) { + return; + } + if (slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) { + return; + } + + // Debug logging +#if !HAKMEM_BUILD_RELEASE + static int dbg = -1; + if (__builtin_expect(dbg == -1, 0)) { + const char* e = getenv("HAKMEM_SS_FREE_DEBUG"); + dbg = (e && *e && *e != '0') ? 1 : 0; + } +#else + static const int dbg = 0; +#endif + + // P0 instrumentation: count lock acquisitions + lock_stats_init(); + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_stats_enabled, 1); + atomic_fetch_add(&g_lock_release_slab_count, 1); + } + + pthread_mutex_lock(&g_shared_pool.alloc_lock); + + TinySlabMeta* slab_meta = &ss->slabs[slab_idx]; + if (slab_meta->used != 0) { + // Not actually empty; nothing to do + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return; + } + + uint8_t class_idx = slab_meta->class_idx; + + #if !HAKMEM_BUILD_RELEASE + if (dbg == 1) { + fprintf(stderr, "[SP_SLOT_RELEASE] ss=%p slab_idx=%d class=%d used=0 (marking EMPTY)\n", + (void*)ss, slab_idx, class_idx); + } + #endif + + // Find SharedSSMeta for this SuperSlab + SharedSSMeta* sp_meta = NULL; + uint32_t count = atomic_load_explicit(&g_shared_pool.ss_meta_count, memory_order_relaxed); + for (uint32_t i = 0; i < count; i++) { + // RACE FIX: Load pointer atomically + SuperSlab* meta_ss = atomic_load_explicit(&g_shared_pool.ss_metadata[i].ss, memory_order_relaxed); + if (meta_ss == ss) { + sp_meta = &g_shared_pool.ss_metadata[i]; + break; + } + } + + if (!sp_meta) { + // SuperSlab not in SP-SLOT system yet - create metadata + sp_meta = sp_meta_find_or_create(ss); + if (!sp_meta) { + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return; // Failed to create metadata + } + } + + // Mark slot as EMPTY (ACTIVE → EMPTY) + uint32_t slab_bit = (1u << slab_idx); + SlotState slot_state = atomic_load_explicit( + &sp_meta->slots[slab_idx].state, + memory_order_acquire); + if (slot_state != SLOT_ACTIVE && (ss->slab_bitmap & slab_bit)) { + // Legacy path import: rebuild slot states from SuperSlab bitmap/class_map + sp_meta_sync_slots_from_ss(sp_meta, ss); + slot_state = atomic_load_explicit( + &sp_meta->slots[slab_idx].state, + memory_order_acquire); + } + + if (slot_state != SLOT_ACTIVE || sp_slot_mark_empty(sp_meta, slab_idx) != 0) { + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return; // Slot wasn't ACTIVE + } + + // Update SuperSlab metadata + uint32_t bit = (1u << slab_idx); + if (ss->slab_bitmap & bit) { + ss->slab_bitmap &= ~bit; + slab_meta->class_idx = 255; // UNASSIGNED + // P1.1: Mark class_map as UNASSIGNED when releasing slab + ss->class_map[slab_idx] = 255; + + if (ss->active_slabs > 0) { + ss->active_slabs--; + if (ss->active_slabs == 0 && g_shared_pool.active_count > 0) { + g_shared_pool.active_count--; + } + } + if (class_idx < TINY_NUM_CLASSES_SS && + g_shared_pool.class_active_slots[class_idx] > 0) { + g_shared_pool.class_active_slots[class_idx]--; + } + } + + // P0-4: Push to lock-free per-class free list (enables reuse by same class) + // Note: push BEFORE releasing mutex (slot state already updated under lock) + if (class_idx < TINY_NUM_CLASSES_SS) { + sp_freelist_push_lockfree(class_idx, sp_meta, slab_idx); + + #if !HAKMEM_BUILD_RELEASE + if (dbg == 1) { + fprintf(stderr, "[SP_SLOT_FREELIST_LOCKFREE] class=%d pushed slot (ss=%p slab=%d) active_slots=%u/%u\n", + class_idx, (void*)ss, slab_idx, + sp_meta->active_slots, sp_meta->total_slots); + } + #endif + } + + // Check if SuperSlab is now completely empty (all slots EMPTY or UNUSED) + if (sp_meta->active_slots == 0) { + #if !HAKMEM_BUILD_RELEASE + if (dbg == 1) { + fprintf(stderr, "[SP_SLOT_COMPLETELY_EMPTY] ss=%p active_slots=0 (calling superslab_free)\n", + (void*)ss); + } + #endif + + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + + // RACE FIX: Set meta->ss to NULL BEFORE unlocking mutex + // This prevents Stage 2 from accessing freed SuperSlab + atomic_store_explicit(&sp_meta->ss, NULL, memory_order_release); + + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + + // Remove from legacy backend list (if present) to prevent dangling pointers + extern void remove_superslab_from_legacy_head(SuperSlab* ss); + remove_superslab_from_legacy_head(ss); + + // Free SuperSlab: + // 1. Try LRU cache (hak_ss_lru_push) - lazy deallocation + // 2. Or munmap if LRU is full - eager deallocation + extern void superslab_free(SuperSlab* ss); + superslab_free(ss); + return; + } + + if (g_lock_stats_enabled == 1) { + atomic_fetch_add(&g_lock_release_count, 1); + } + pthread_mutex_unlock(&g_shared_pool.alloc_lock); +} diff --git a/core/hakmem_super_registry.h b/core/hakmem_super_registry.h index 0ded3f0a..1b45a1e8 100644 --- a/core/hakmem_super_registry.h +++ b/core/hakmem_super_registry.h @@ -24,7 +24,7 @@ // Increased from 4096 to 32768 to avoid registry exhaustion under // high-churn microbenchmarks (e.g., larson with many active SuperSlabs). // Still a power of two for fast masking. -#define SUPER_REG_SIZE 262144 // Power of 2 for fast modulo (8x larger for workloads) +#define SUPER_REG_SIZE 1048576 // Power of 2 for fast modulo (1M entries) #define SUPER_REG_MASK (SUPER_REG_SIZE - 1) #define SUPER_MAX_PROBE 32 // Linear probing limit (increased from 8 for Phase 15 fix)