# Mid-Large Lock Contention Analysis (P0-3) **Date**: 2025-11-14 **Status**: ✅ **Analysis Complete** - Instrumentation reveals critical insights --- ## Executive Summary Lock contention analysis for `g_shared_pool.alloc_lock` reveals: - **100% of lock contention comes from `acquire_slab()` (allocation path)** - **0% from `release_slab()` (free path is effectively lock-free)** - **Lock acquisition rate: 0.206% (TLS hit rate: 99.8%)** - **Contention scales linearly with thread count** ### Key Insight > **The release path is already lock-free in practice!** > `release_slab()` only acquires the lock when a slab becomes completely empty, > but in this workload, slabs stay active throughout execution. --- ## Instrumentation Results ### Test Configuration - **Benchmark**: `bench_mid_large_mt_hakmem` - **Workload**: 40,000 iterations per thread, 2KB block size - **Environment**: `HAKMEM_SHARED_POOL_LOCK_STATS=1` ### 4-Thread Results ``` Throughput: 1,592,036 ops/s Total operations: 160,000 (4 × 40,000) Lock acquisitions: 330 Lock rate: 0.206% --- Breakdown by Code Path --- acquire_slab(): 330 (100.0%) release_slab(): 0 (0.0%) ``` ### 8-Thread Results ``` Throughput: 2,290,621 ops/s Total operations: 320,000 (8 × 40,000) Lock acquisitions: 658 Lock rate: 0.206% --- Breakdown by Code Path --- acquire_slab(): 658 (100.0%) release_slab(): 0 (0.0%) ``` ### Scaling Analysis | Threads | Ops | Lock Acq | Lock Rate | Throughput (ops/s) | Scaling | |---------|---------|----------|-----------|-------------------|---------| | 4T | 160,000 | 330 | 0.206% | 1,592,036 | 1.00x | | 8T | 320,000 | 658 | 0.206% | 2,290,621 | 1.44x | **Observations**: - Lock acquisitions scale linearly: 8T ≈ 2× 4T (658 vs 330) - Lock rate is constant: 0.206% across all thread counts - Throughput scaling is sublinear: 1.44x (should be 2.0x for perfect scaling) --- ## Root Cause Analysis ### Why 100% acquire_slab()? `acquire_slab()` is called on **TLS cache miss** (happens when): 1. Thread starts and has empty TLS cache 2. TLS cache is depleted during execution With **TLS hit rate of 99.8%**, only 0.2% of operations miss and hit the shared pool. ### Why 0% release_slab()? `release_slab()` acquires lock only when: - `slab_meta->used == 0` (slab becomes completely empty) In this workload: - Slabs stay active (partially full) throughout benchmark - No slab becomes completely empty → no lock acquisition ### Lock Contention Sources (acquire_slab 3-Stage Logic) ```c pthread_mutex_lock(&g_shared_pool.alloc_lock); // Stage 1: Reuse EMPTY slots from per-class free list if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... } // Stage 2: Find UNUSED slots in existing SuperSlabs for (uint32_t i = 0; i < g_shared_pool.ss_meta_count; i++) { int unused_idx = sp_slot_find_unused(meta); if (unused_idx >= 0) { ... } } // Stage 3: Get new SuperSlab (LRU pop or mmap) SuperSlab* new_ss = hak_ss_lru_pop(...); if (!new_ss) { new_ss = shared_pool_allocate_superslab_unlocked(); } pthread_mutex_unlock(&g_shared_pool.alloc_lock); ``` **All 3 stages protected by single coarse-grained lock!** --- ## Performance Impact ### Futex Syscall Analysis (from previous strace) ``` futex: 68% of syscall time (209 calls in 4T workload) ``` ### Amdahl's Law Estimate With lock contention at **0.206%** of operations: - Serial fraction: 0.206% - Maximum speedup (∞ threads): **1 / 0.00206 ≈ 486x** But observed scaling (4T → 8T): **1.44x** (should be 2.0x) **Bottleneck**: Lock serializes all threads during acquire_slab --- ## Recommendations (P0-4 Implementation) ### Strategy: Lock-Free Per-Class Free Lists Replace `pthread_mutex` with **atomic CAS operations** for: #### 1. Stage 1: Lock-Free Free List Pop (LIFO stack) ```c // Current: protected by mutex if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... } // Lock-free: atomic CAS-based stack pop typedef struct { _Atomic(FreeSlotEntry*) head; // Atomic pointer } LockFreeFreeList; FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) { FreeSlotEntry* old_head = atomic_load(&list->head); do { if (old_head == NULL) return NULL; // Empty } while (!atomic_compare_exchange_weak( &list->head, &old_head, old_head->next)); return old_head; } ``` #### 2. Stage 2: Lock-Free UNUSED Slot Search Use **atomic bit operations** on slab_bitmap: ```c // Current: linear scan under lock for (uint32_t i = 0; i < ss_meta_count; i++) { int unused_idx = sp_slot_find_unused(meta); if (unused_idx >= 0) { ... } } // Lock-free: atomic bitmap scan + CAS claim int sp_claim_unused_slot_lockfree(SharedSSMeta* meta) { for (int i = 0; i < meta->total_slots; i++) { SlotState expected = SLOT_UNUSED; if (atomic_compare_exchange_strong( &meta->slots[i].state, &expected, SLOT_ACTIVE)) { return i; // Claimed! } } return -1; // No unused slots } ``` #### 3. Stage 3: Lock-Free SuperSlab Allocation Use **atomic counter + CAS** for ss_meta_count: ```c // Current: realloc + capacity check under lock if (sp_meta_ensure_capacity(g_shared_pool.ss_meta_count + 1) != 0) { ... } // Lock-free: pre-allocate metadata array, atomic index increment uint32_t idx = atomic_fetch_add(&g_shared_pool.ss_meta_count, 1); if (idx >= g_shared_pool.ss_meta_capacity) { // Fallback: slow path with mutex for capacity expansion pthread_mutex_lock(&g_capacity_lock); sp_meta_ensure_capacity(idx + 1); pthread_mutex_unlock(&g_capacity_lock); } ``` ### Expected Impact - **Eliminate 658 mutex acquisitions** (8T workload) - **Reduce futex syscalls from 68% → <5%** - **Improve 4T→8T scaling from 1.44x → ~1.9x** (closer to linear) - **Overall throughput: +50-73%** (based on Task agent estimate) --- ## Implementation Plan (P0-4) ### Phase 1: Lock-Free Free List (Highest Impact) **Files**: `core/hakmem_shared_pool.c` (sp_freelist_pop/push) **Effort**: 2-3 hours **Expected**: +30-40% throughput (eliminates Stage 1 contention) ### Phase 2: Lock-Free Slot Claiming **Files**: `core/hakmem_shared_pool.c` (sp_slot_mark_active/empty) **Effort**: 3-4 hours **Expected**: +15-20% additional (eliminates Stage 2 contention) ### Phase 3: Lock-Free Metadata Growth **Files**: `core/hakmem_shared_pool.c` (sp_meta_ensure_capacity) **Effort**: 2-3 hours **Expected**: +5-10% additional (rare path, low contention) ### Total Expected Improvement - **Conservative**: +50% (1.59M → 2.4M ops/s, 4T) - **Optimistic**: +73% (Task agent estimate, 1.04M → 1.8M ops/s baseline) --- ## Testing Strategy (P0-5) ### A/B Comparison 1. **Baseline** (mutex): Current implementation with stats 2. **Lock-Free** (CAS): After P0-4 implementation ### Metrics - Throughput (ops/s) - target: +50-73% - futex syscalls - target: <10% (from 68%) - Lock acquisitions - target: 0 (fully lock-free) - Scaling (4T→8T) - target: 1.9x (from 1.44x) ### Validation - **Correctness**: Run with TSan (Thread Sanitizer) - **Stress test**: 100K iterations, 1-16 threads - **Performance**: Compare with mimalloc (target: 70-90% of mimalloc) --- ## Conclusion Lock contention analysis reveals: - **Single choke point**: `acquire_slab()` mutex (100% of contention) - **Lock-free opportunity**: All 3 stages can be converted to atomic CAS - **Expected impact**: +50-73% throughput, near-linear scaling **Next Step**: P0-4 - Implement lock-free per-class free lists (CAS-based) --- ## Appendix: Instrumentation Code ### Added to `core/hakmem_shared_pool.c` ```c // Atomic counters static _Atomic uint64_t g_lock_acquire_count = 0; static _Atomic uint64_t g_lock_release_count = 0; static _Atomic uint64_t g_lock_acquire_slab_count = 0; static _Atomic uint64_t g_lock_release_slab_count = 0; // Report at shutdown static void __attribute__((destructor)) lock_stats_report(void) { fprintf(stderr, "\n=== SHARED POOL LOCK STATISTICS ===\n"); fprintf(stderr, "Total lock ops: %lu (acquire) + %lu (release)\n", acquires, releases); fprintf(stderr, "--- Breakdown by Code Path ---\n"); fprintf(stderr, "acquire_slab(): %lu (%.1f%%)\n", acquire_path, ...); fprintf(stderr, "release_slab(): %lu (%.1f%%)\n", release_path, ...); } ``` ### Usage ```bash export HAKMEM_SHARED_POOL_LOCK_STATS=1 ./out/release/bench_mid_large_mt_hakmem 8 40000 2048 42 ```