287 lines
8.3 KiB
Markdown
287 lines
8.3 KiB
Markdown
|
|
# Mid-Large Lock Contention Analysis (P0-3)
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-14
|
|||
|
|
**Status**: ✅ **Analysis Complete** - Instrumentation reveals critical insights
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
Lock contention analysis for `g_shared_pool.alloc_lock` reveals:
|
|||
|
|
|
|||
|
|
- **100% of lock contention comes from `acquire_slab()` (allocation path)**
|
|||
|
|
- **0% from `release_slab()` (free path is effectively lock-free)**
|
|||
|
|
- **Lock acquisition rate: 0.206% (TLS hit rate: 99.8%)**
|
|||
|
|
- **Contention scales linearly with thread count**
|
|||
|
|
|
|||
|
|
### Key Insight
|
|||
|
|
|
|||
|
|
> **The release path is already lock-free in practice!**
|
|||
|
|
> `release_slab()` only acquires the lock when a slab becomes completely empty,
|
|||
|
|
> but in this workload, slabs stay active throughout execution.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Instrumentation Results
|
|||
|
|
|
|||
|
|
### Test Configuration
|
|||
|
|
- **Benchmark**: `bench_mid_large_mt_hakmem`
|
|||
|
|
- **Workload**: 40,000 iterations per thread, 2KB block size
|
|||
|
|
- **Environment**: `HAKMEM_SHARED_POOL_LOCK_STATS=1`
|
|||
|
|
|
|||
|
|
### 4-Thread Results
|
|||
|
|
```
|
|||
|
|
Throughput: 1,592,036 ops/s
|
|||
|
|
Total operations: 160,000 (4 × 40,000)
|
|||
|
|
Lock acquisitions: 330
|
|||
|
|
Lock rate: 0.206%
|
|||
|
|
|
|||
|
|
--- Breakdown by Code Path ---
|
|||
|
|
acquire_slab(): 330 (100.0%)
|
|||
|
|
release_slab(): 0 (0.0%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 8-Thread Results
|
|||
|
|
```
|
|||
|
|
Throughput: 2,290,621 ops/s
|
|||
|
|
Total operations: 320,000 (8 × 40,000)
|
|||
|
|
Lock acquisitions: 658
|
|||
|
|
Lock rate: 0.206%
|
|||
|
|
|
|||
|
|
--- Breakdown by Code Path ---
|
|||
|
|
acquire_slab(): 658 (100.0%)
|
|||
|
|
release_slab(): 0 (0.0%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Scaling Analysis
|
|||
|
|
| Threads | Ops | Lock Acq | Lock Rate | Throughput (ops/s) | Scaling |
|
|||
|
|
|---------|---------|----------|-----------|-------------------|---------|
|
|||
|
|
| 4T | 160,000 | 330 | 0.206% | 1,592,036 | 1.00x |
|
|||
|
|
| 8T | 320,000 | 658 | 0.206% | 2,290,621 | 1.44x |
|
|||
|
|
|
|||
|
|
**Observations**:
|
|||
|
|
- Lock acquisitions scale linearly: 8T ≈ 2× 4T (658 vs 330)
|
|||
|
|
- Lock rate is constant: 0.206% across all thread counts
|
|||
|
|
- Throughput scaling is sublinear: 1.44x (should be 2.0x for perfect scaling)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis
|
|||
|
|
|
|||
|
|
### Why 100% acquire_slab()?
|
|||
|
|
|
|||
|
|
`acquire_slab()` is called on **TLS cache miss** (happens when):
|
|||
|
|
1. Thread starts and has empty TLS cache
|
|||
|
|
2. TLS cache is depleted during execution
|
|||
|
|
|
|||
|
|
With **TLS hit rate of 99.8%**, only 0.2% of operations miss and hit the shared pool.
|
|||
|
|
|
|||
|
|
### Why 0% release_slab()?
|
|||
|
|
|
|||
|
|
`release_slab()` acquires lock only when:
|
|||
|
|
- `slab_meta->used == 0` (slab becomes completely empty)
|
|||
|
|
|
|||
|
|
In this workload:
|
|||
|
|
- Slabs stay active (partially full) throughout benchmark
|
|||
|
|
- No slab becomes completely empty → no lock acquisition
|
|||
|
|
|
|||
|
|
### Lock Contention Sources (acquire_slab 3-Stage Logic)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
pthread_mutex_lock(&g_shared_pool.alloc_lock);
|
|||
|
|
|
|||
|
|
// Stage 1: Reuse EMPTY slots from per-class free list
|
|||
|
|
if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }
|
|||
|
|
|
|||
|
|
// Stage 2: Find UNUSED slots in existing SuperSlabs
|
|||
|
|
for (uint32_t i = 0; i < g_shared_pool.ss_meta_count; i++) {
|
|||
|
|
int unused_idx = sp_slot_find_unused(meta);
|
|||
|
|
if (unused_idx >= 0) { ... }
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Stage 3: Get new SuperSlab (LRU pop or mmap)
|
|||
|
|
SuperSlab* new_ss = hak_ss_lru_pop(...);
|
|||
|
|
if (!new_ss) {
|
|||
|
|
new_ss = shared_pool_allocate_superslab_unlocked();
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
pthread_mutex_unlock(&g_shared_pool.alloc_lock);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**All 3 stages protected by single coarse-grained lock!**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance Impact
|
|||
|
|
|
|||
|
|
### Futex Syscall Analysis (from previous strace)
|
|||
|
|
```
|
|||
|
|
futex: 68% of syscall time (209 calls in 4T workload)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Amdahl's Law Estimate
|
|||
|
|
|
|||
|
|
With lock contention at **0.206%** of operations:
|
|||
|
|
- Serial fraction: 0.206%
|
|||
|
|
- Maximum speedup (∞ threads): **1 / 0.00206 ≈ 486x**
|
|||
|
|
|
|||
|
|
But observed scaling (4T → 8T): **1.44x** (should be 2.0x)
|
|||
|
|
|
|||
|
|
**Bottleneck**: Lock serializes all threads during acquire_slab
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommendations (P0-4 Implementation)
|
|||
|
|
|
|||
|
|
### Strategy: Lock-Free Per-Class Free Lists
|
|||
|
|
|
|||
|
|
Replace `pthread_mutex` with **atomic CAS operations** for:
|
|||
|
|
|
|||
|
|
#### 1. Stage 1: Lock-Free Free List Pop (LIFO stack)
|
|||
|
|
```c
|
|||
|
|
// Current: protected by mutex
|
|||
|
|
if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }
|
|||
|
|
|
|||
|
|
// Lock-free: atomic CAS-based stack pop
|
|||
|
|
typedef struct {
|
|||
|
|
_Atomic(FreeSlotEntry*) head; // Atomic pointer
|
|||
|
|
} LockFreeFreeList;
|
|||
|
|
|
|||
|
|
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
|
|||
|
|
FreeSlotEntry* old_head = atomic_load(&list->head);
|
|||
|
|
do {
|
|||
|
|
if (old_head == NULL) return NULL; // Empty
|
|||
|
|
} while (!atomic_compare_exchange_weak(
|
|||
|
|
&list->head, &old_head, old_head->next));
|
|||
|
|
return old_head;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2. Stage 2: Lock-Free UNUSED Slot Search
|
|||
|
|
Use **atomic bit operations** on slab_bitmap:
|
|||
|
|
```c
|
|||
|
|
// Current: linear scan under lock
|
|||
|
|
for (uint32_t i = 0; i < ss_meta_count; i++) {
|
|||
|
|
int unused_idx = sp_slot_find_unused(meta);
|
|||
|
|
if (unused_idx >= 0) { ... }
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Lock-free: atomic bitmap scan + CAS claim
|
|||
|
|
int sp_claim_unused_slot_lockfree(SharedSSMeta* meta) {
|
|||
|
|
for (int i = 0; i < meta->total_slots; i++) {
|
|||
|
|
SlotState expected = SLOT_UNUSED;
|
|||
|
|
if (atomic_compare_exchange_strong(
|
|||
|
|
&meta->slots[i].state, &expected, SLOT_ACTIVE)) {
|
|||
|
|
return i; // Claimed!
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return -1; // No unused slots
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3. Stage 3: Lock-Free SuperSlab Allocation
|
|||
|
|
Use **atomic counter + CAS** for ss_meta_count:
|
|||
|
|
```c
|
|||
|
|
// Current: realloc + capacity check under lock
|
|||
|
|
if (sp_meta_ensure_capacity(g_shared_pool.ss_meta_count + 1) != 0) { ... }
|
|||
|
|
|
|||
|
|
// Lock-free: pre-allocate metadata array, atomic index increment
|
|||
|
|
uint32_t idx = atomic_fetch_add(&g_shared_pool.ss_meta_count, 1);
|
|||
|
|
if (idx >= g_shared_pool.ss_meta_capacity) {
|
|||
|
|
// Fallback: slow path with mutex for capacity expansion
|
|||
|
|
pthread_mutex_lock(&g_capacity_lock);
|
|||
|
|
sp_meta_ensure_capacity(idx + 1);
|
|||
|
|
pthread_mutex_unlock(&g_capacity_lock);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Expected Impact
|
|||
|
|
|
|||
|
|
- **Eliminate 658 mutex acquisitions** (8T workload)
|
|||
|
|
- **Reduce futex syscalls from 68% → <5%**
|
|||
|
|
- **Improve 4T→8T scaling from 1.44x → ~1.9x** (closer to linear)
|
|||
|
|
- **Overall throughput: +50-73%** (based on Task agent estimate)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Plan (P0-4)
|
|||
|
|
|
|||
|
|
### Phase 1: Lock-Free Free List (Highest Impact)
|
|||
|
|
**Files**: `core/hakmem_shared_pool.c` (sp_freelist_pop/push)
|
|||
|
|
**Effort**: 2-3 hours
|
|||
|
|
**Expected**: +30-40% throughput (eliminates Stage 1 contention)
|
|||
|
|
|
|||
|
|
### Phase 2: Lock-Free Slot Claiming
|
|||
|
|
**Files**: `core/hakmem_shared_pool.c` (sp_slot_mark_active/empty)
|
|||
|
|
**Effort**: 3-4 hours
|
|||
|
|
**Expected**: +15-20% additional (eliminates Stage 2 contention)
|
|||
|
|
|
|||
|
|
### Phase 3: Lock-Free Metadata Growth
|
|||
|
|
**Files**: `core/hakmem_shared_pool.c` (sp_meta_ensure_capacity)
|
|||
|
|
**Effort**: 2-3 hours
|
|||
|
|
**Expected**: +5-10% additional (rare path, low contention)
|
|||
|
|
|
|||
|
|
### Total Expected Improvement
|
|||
|
|
- **Conservative**: +50% (1.59M → 2.4M ops/s, 4T)
|
|||
|
|
- **Optimistic**: +73% (Task agent estimate, 1.04M → 1.8M ops/s baseline)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Testing Strategy (P0-5)
|
|||
|
|
|
|||
|
|
### A/B Comparison
|
|||
|
|
1. **Baseline** (mutex): Current implementation with stats
|
|||
|
|
2. **Lock-Free** (CAS): After P0-4 implementation
|
|||
|
|
|
|||
|
|
### Metrics
|
|||
|
|
- Throughput (ops/s) - target: +50-73%
|
|||
|
|
- futex syscalls - target: <10% (from 68%)
|
|||
|
|
- Lock acquisitions - target: 0 (fully lock-free)
|
|||
|
|
- Scaling (4T→8T) - target: 1.9x (from 1.44x)
|
|||
|
|
|
|||
|
|
### Validation
|
|||
|
|
- **Correctness**: Run with TSan (Thread Sanitizer)
|
|||
|
|
- **Stress test**: 100K iterations, 1-16 threads
|
|||
|
|
- **Performance**: Compare with mimalloc (target: 70-90% of mimalloc)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
Lock contention analysis reveals:
|
|||
|
|
- **Single choke point**: `acquire_slab()` mutex (100% of contention)
|
|||
|
|
- **Lock-free opportunity**: All 3 stages can be converted to atomic CAS
|
|||
|
|
- **Expected impact**: +50-73% throughput, near-linear scaling
|
|||
|
|
|
|||
|
|
**Next Step**: P0-4 - Implement lock-free per-class free lists (CAS-based)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix: Instrumentation Code
|
|||
|
|
|
|||
|
|
### Added to `core/hakmem_shared_pool.c`
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Atomic counters
|
|||
|
|
static _Atomic uint64_t g_lock_acquire_count = 0;
|
|||
|
|
static _Atomic uint64_t g_lock_release_count = 0;
|
|||
|
|
static _Atomic uint64_t g_lock_acquire_slab_count = 0;
|
|||
|
|
static _Atomic uint64_t g_lock_release_slab_count = 0;
|
|||
|
|
|
|||
|
|
// Report at shutdown
|
|||
|
|
static void __attribute__((destructor)) lock_stats_report(void) {
|
|||
|
|
fprintf(stderr, "\n=== SHARED POOL LOCK STATISTICS ===\n");
|
|||
|
|
fprintf(stderr, "Total lock ops: %lu (acquire) + %lu (release)\n",
|
|||
|
|
acquires, releases);
|
|||
|
|
fprintf(stderr, "--- Breakdown by Code Path ---\n");
|
|||
|
|
fprintf(stderr, "acquire_slab(): %lu (%.1f%%)\n", acquire_path, ...);
|
|||
|
|
fprintf(stderr, "release_slab(): %lu (%.1f%%)\n", release_path, ...);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Usage
|
|||
|
|
```bash
|
|||
|
|
export HAKMEM_SHARED_POOL_LOCK_STATS=1
|
|||
|
|
./out/release/bench_mid_large_mt_hakmem 8 40000 2048 42
|
|||
|
|
```
|