hakmem/docs/analysis/MID_LARGE_LOCK_CONTENTION_ANALYSIS.md

# Mid-Large Lock Contention Analysis (P0-3)

**Date**: 2025-11-14
**Status**: ✅ **Analysis Complete** - Instrumentation reveals critical insights

---

## Executive Summary

Lock contention analysis for `g_shared_pool.alloc_lock` reveals:

- **100% of lock contention comes from `acquire_slab()` (allocation path)**
- **0% from `release_slab()` (free path is effectively lock-free)**
- **Lock acquisition rate: 0.206% (TLS hit rate: 99.8%)**
- **Contention scales linearly with thread count**

### Key Insight

> **The release path is already lock-free in practice!**
> `release_slab()` only acquires the lock when a slab becomes completely empty,
> but in this workload, slabs stay active throughout execution.

---

## Instrumentation Results

### Test Configuration
- **Benchmark**: `bench_mid_large_mt_hakmem`
- **Workload**: 40,000 iterations per thread, 2KB block size
- **Environment**: `HAKMEM_SHARED_POOL_LOCK_STATS=1`

### 4-Thread Results
```
Throughput:        1,592,036 ops/s
Total operations:  160,000 (4 × 40,000)
Lock acquisitions: 330
Lock rate:         0.206%

--- Breakdown by Code Path ---
acquire_slab():    330 (100.0%)
release_slab():    0 (0.0%)
```

### 8-Thread Results
```
Throughput:        2,290,621 ops/s
Total operations:  320,000 (8 × 40,000)
Lock acquisitions: 658
Lock rate:         0.206%

--- Breakdown by Code Path ---
acquire_slab():    658 (100.0%)
release_slab():    0 (0.0%)
```

### Scaling Analysis
| Threads | Ops     | Lock Acq | Lock Rate | Throughput (ops/s) | Scaling |
|---------|---------|----------|-----------|-------------------|---------|
| 4T      | 160,000 | 330      | 0.206%    | 1,592,036         | 1.00x   |
| 8T      | 320,000 | 658      | 0.206%    | 2,290,621         | 1.44x   |

**Observations**:
- Lock acquisitions scale linearly: 8T ≈ 2× 4T (658 vs 330)
- Lock rate is constant: 0.206% across all thread counts
- Throughput scaling is sublinear: 1.44x (should be 2.0x for perfect scaling)

---

## Root Cause Analysis

### Why 100% acquire_slab()?

`acquire_slab()` is called on **TLS cache miss** (happens when):
1. Thread starts and has empty TLS cache
2. TLS cache is depleted during execution

With **TLS hit rate of 99.8%**, only 0.2% of operations miss and hit the shared pool.

### Why 0% release_slab()?

`release_slab()` acquires lock only when:
- `slab_meta->used == 0` (slab becomes completely empty)

In this workload:
- Slabs stay active (partially full) throughout benchmark
- No slab becomes completely empty → no lock acquisition

### Lock Contention Sources (acquire_slab 3-Stage Logic)

```c
pthread_mutex_lock(&g_shared_pool.alloc_lock);

// Stage 1: Reuse EMPTY slots from per-class free list
if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }

// Stage 2: Find UNUSED slots in existing SuperSlabs
for (uint32_t i = 0; i < g_shared_pool.ss_meta_count; i++) {
    int unused_idx = sp_slot_find_unused(meta);
    if (unused_idx >= 0) { ... }
}

// Stage 3: Get new SuperSlab (LRU pop or mmap)
SuperSlab* new_ss = hak_ss_lru_pop(...);
if (!new_ss) {
    new_ss = shared_pool_allocate_superslab_unlocked();
}

pthread_mutex_unlock(&g_shared_pool.alloc_lock);
```

**All 3 stages protected by single coarse-grained lock!**

---

## Performance Impact

### Futex Syscall Analysis (from previous strace)
```
futex: 68% of syscall time (209 calls in 4T workload)
```

### Amdahl's Law Estimate

With lock contention at **0.206%** of operations:
- Serial fraction: 0.206%
- Maximum speedup (∞ threads): **1 / 0.00206 ≈ 486x**

But observed scaling (4T → 8T): **1.44x** (should be 2.0x)

**Bottleneck**: Lock serializes all threads during acquire_slab

---

## Recommendations (P0-4 Implementation)

### Strategy: Lock-Free Per-Class Free Lists

Replace `pthread_mutex` with **atomic CAS operations** for:

#### 1. Stage 1: Lock-Free Free List Pop (LIFO stack)
```c
// Current: protected by mutex
if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }

// Lock-free: atomic CAS-based stack pop
typedef struct {
    _Atomic(FreeSlotEntry*) head;  // Atomic pointer
} LockFreeFreeList;

FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
    FreeSlotEntry* old_head = atomic_load(&list->head);
    do {
        if (old_head == NULL) return NULL;  // Empty
    } while (!atomic_compare_exchange_weak(
        &list->head, &old_head, old_head->next));
    return old_head;
}
```

#### 2. Stage 2: Lock-Free UNUSED Slot Search
Use **atomic bit operations** on slab_bitmap:
```c
// Current: linear scan under lock
for (uint32_t i = 0; i < ss_meta_count; i++) {
    int unused_idx = sp_slot_find_unused(meta);
    if (unused_idx >= 0) { ... }
}

// Lock-free: atomic bitmap scan + CAS claim
int sp_claim_unused_slot_lockfree(SharedSSMeta* meta) {
    for (int i = 0; i < meta->total_slots; i++) {
        SlotState expected = SLOT_UNUSED;
        if (atomic_compare_exchange_strong(
            &meta->slots[i].state, &expected, SLOT_ACTIVE)) {
            return i;  // Claimed!
        }
    }
    return -1;  // No unused slots
}
```

#### 3. Stage 3: Lock-Free SuperSlab Allocation
Use **atomic counter + CAS** for ss_meta_count:
```c
// Current: realloc + capacity check under lock
if (sp_meta_ensure_capacity(g_shared_pool.ss_meta_count + 1) != 0) { ... }

// Lock-free: pre-allocate metadata array, atomic index increment
uint32_t idx = atomic_fetch_add(&g_shared_pool.ss_meta_count, 1);
if (idx >= g_shared_pool.ss_meta_capacity) {
    // Fallback: slow path with mutex for capacity expansion
    pthread_mutex_lock(&g_capacity_lock);
    sp_meta_ensure_capacity(idx + 1);
    pthread_mutex_unlock(&g_capacity_lock);
}
```

### Expected Impact

- **Eliminate 658 mutex acquisitions** (8T workload)
- **Reduce futex syscalls from 68% → <5%**
- **Improve 4T→8T scaling from 1.44x → ~1.9x** (closer to linear)
- **Overall throughput: +50-73%** (based on Task agent estimate)

---

## Implementation Plan (P0-4)

### Phase 1: Lock-Free Free List (Highest Impact)
**Files**: `core/hakmem_shared_pool.c` (sp_freelist_pop/push)
**Effort**: 2-3 hours
**Expected**: +30-40% throughput (eliminates Stage 1 contention)

### Phase 2: Lock-Free Slot Claiming
**Files**: `core/hakmem_shared_pool.c` (sp_slot_mark_active/empty)
**Effort**: 3-4 hours
**Expected**: +15-20% additional (eliminates Stage 2 contention)

### Phase 3: Lock-Free Metadata Growth
**Files**: `core/hakmem_shared_pool.c` (sp_meta_ensure_capacity)
**Effort**: 2-3 hours
**Expected**: +5-10% additional (rare path, low contention)

### Total Expected Improvement
- **Conservative**: +50% (1.59M → 2.4M ops/s, 4T)
- **Optimistic**: +73% (Task agent estimate, 1.04M → 1.8M ops/s baseline)

---

## Testing Strategy (P0-5)

### A/B Comparison
1. **Baseline** (mutex): Current implementation with stats
2. **Lock-Free** (CAS): After P0-4 implementation

### Metrics
- Throughput (ops/s) - target: +50-73%
- futex syscalls - target: <10% (from 68%)
- Lock acquisitions - target: 0 (fully lock-free)
- Scaling (4T→8T) - target: 1.9x (from 1.44x)

### Validation
- **Correctness**: Run with TSan (Thread Sanitizer)
- **Stress test**: 100K iterations, 1-16 threads
- **Performance**: Compare with mimalloc (target: 70-90% of mimalloc)

---

## Conclusion

Lock contention analysis reveals:
- **Single choke point**: `acquire_slab()` mutex (100% of contention)
- **Lock-free opportunity**: All 3 stages can be converted to atomic CAS
- **Expected impact**: +50-73% throughput, near-linear scaling

**Next Step**: P0-4 - Implement lock-free per-class free lists (CAS-based)

---

## Appendix: Instrumentation Code

### Added to `core/hakmem_shared_pool.c`

```c
// Atomic counters
static _Atomic uint64_t g_lock_acquire_count = 0;
static _Atomic uint64_t g_lock_release_count = 0;
static _Atomic uint64_t g_lock_acquire_slab_count = 0;
static _Atomic uint64_t g_lock_release_slab_count = 0;

// Report at shutdown
static void __attribute__((destructor)) lock_stats_report(void) {
    fprintf(stderr, "\n=== SHARED POOL LOCK STATISTICS ===\n");
    fprintf(stderr, "Total lock ops:    %lu (acquire) + %lu (release)\n",
            acquires, releases);
    fprintf(stderr, "--- Breakdown by Code Path ---\n");
    fprintf(stderr, "acquire_slab():    %lu (%.1f%%)\n", acquire_path, ...);
    fprintf(stderr, "release_slab():    %lu (%.1f%%)\n", release_path, ...);
}
```

### Usage
```bash
export HAKMEM_SHARED_POOL_LOCK_STATS=1
./out/release/bench_mid_large_mt_hakmem 8 40000 2048 42
```
-												Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-26 13:14:18 +09:00
+								# Mid-Large Lock Contention Analysis (P0-3)
 								**Date**: 2025-11-14
 								**Status**: ✅ **Analysis Complete** - Instrumentation reveals critical insights
 								---
 								## Executive Summary
 								Lock contention analysis for `g_shared_pool.alloc_lock` reveals:
 								- **100% of lock contention comes from `acquire_slab()` (allocation path)**
 								- **0% from `release_slab()` (free path is effectively lock-free)**
 								- **Lock acquisition rate: 0.206% (TLS hit rate: 99.8%)**
 								- **Contention scales linearly with thread count**
 								### Key Insight
 								> **The release path is already lock-free in practice!**
 								> `release_slab()` only acquires the lock when a slab becomes completely empty,
 								> but in this workload, slabs stay active throughout execution.
 								---
 								## Instrumentation Results
 								### Test Configuration
 								- **Benchmark**: `bench_mid_large_mt_hakmem`
 								- **Workload**: 40,000 iterations per thread, 2KB block size
 								- **Environment**: `HAKMEM_SHARED_POOL_LOCK_STATS=1`
 								### 4-Thread Results
 								```
 								Throughput:        1,592,036 ops/s
 								Total operations:  160,000 (4 × 40,000)
 								Lock acquisitions: 330
 								Lock rate:         0.206%
 								--- Breakdown by Code Path ---
 								acquire_slab():    330 (100.0%)
 								release_slab():    0 (0.0%)
 								```
 								### 8-Thread Results
 								```
 								Throughput:        2,290,621 ops/s
 								Total operations:  320,000 (8 × 40,000)
 								Lock acquisitions: 658
 								Lock rate:         0.206%
 								--- Breakdown by Code Path ---
 								acquire_slab():    658 (100.0%)
 								release_slab():    0 (0.0%)
 								```
 								### Scaling Analysis
 								| Threads | Ops     | Lock Acq | Lock Rate | Throughput (ops/s) | Scaling |
 								|---------|---------|----------|-----------|-------------------|---------|
 								| 4T      | 160,000 | 330      | 0.206%    | 1,592,036         | 1.00x   |
 								| 8T      | 320,000 | 658      | 0.206%    | 2,290,621         | 1.44x   |
 								**Observations**:
 								- Lock acquisitions scale linearly: 8T ≈ 2× 4T (658 vs 330)
 								- Lock rate is constant: 0.206% across all thread counts
 								- Throughput scaling is sublinear: 1.44x (should be 2.0x for perfect scaling)
 								---
 								## Root Cause Analysis
 								### Why 100% acquire_slab()?
 								`acquire_slab()` is called on **TLS cache miss** (happens when):
 . Thread starts and has empty TLS cache
 . TLS cache is depleted during execution
 								With **TLS hit rate of 99.8%**, only 0.2% of operations miss and hit the shared pool.
 								### Why 0% release_slab()?
 								`release_slab()` acquires lock only when:
 								- `slab_meta->used == 0` (slab becomes completely empty)
 								In this workload:
 								- Slabs stay active (partially full) throughout benchmark
 								- No slab becomes completely empty → no lock acquisition
 								### Lock Contention Sources (acquire_slab 3-Stage Logic)
 								```c
 								pthread_mutex_lock(&g_shared_pool.alloc_lock);
 								// Stage 1: Reuse EMPTY slots from per-class free list
 								if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }
 								// Stage 2: Find UNUSED slots in existing SuperSlabs
 								for (uint32_t i = 0; i < g_shared_pool.ss_meta_count; i++) {
 								    int unused_idx = sp_slot_find_unused(meta);
 								    if (unused_idx >= 0) { ... }
 								}
 								// Stage 3: Get new SuperSlab (LRU pop or mmap)
 								SuperSlab* new_ss = hak_ss_lru_pop(...);
 								if (!new_ss) {
 								    new_ss = shared_pool_allocate_superslab_unlocked();
 								}
 								pthread_mutex_unlock(&g_shared_pool.alloc_lock);
 								```
 								**All 3 stages protected by single coarse-grained lock!**
 								---
 								## Performance Impact
 								### Futex Syscall Analysis (from previous strace)
 								```
 								futex: 68% of syscall time (209 calls in 4T workload)
 								```
 								### Amdahl's Law Estimate
 								With lock contention at **0.206%** of operations:
 								- Serial fraction: 0.206%
 								- Maximum speedup (∞ threads): **1 / 0.00206 ≈ 486x**
 								But observed scaling (4T → 8T): **1.44x** (should be 2.0x)
 								**Bottleneck**: Lock serializes all threads during acquire_slab
 								---
 								## Recommendations (P0-4 Implementation)
 								### Strategy: Lock-Free Per-Class Free Lists
 								Replace `pthread_mutex` with **atomic CAS operations** for:
 								#### 1. Stage 1: Lock-Free Free List Pop (LIFO stack)
 								```c
 								// Current: protected by mutex
 								if (sp_freelist_pop(class_idx, &reuse_meta, &reuse_slot_idx)) { ... }
 								// Lock-free: atomic CAS-based stack pop
 								typedef struct {
 								    _Atomic(FreeSlotEntry*) head;  // Atomic pointer
 								} LockFreeFreeList;
 								FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
 								    FreeSlotEntry* old_head = atomic_load(&list->head);
 								    do {
 								        if (old_head == NULL) return NULL;  // Empty
 								    } while (!atomic_compare_exchange_weak(
 								        &list->head, &old_head, old_head->next));
 								    return old_head;
 								}
 								```
 								#### 2. Stage 2: Lock-Free UNUSED Slot Search
 								Use **atomic bit operations** on slab_bitmap:
 								```c
 								// Current: linear scan under lock
 								for (uint32_t i = 0; i < ss_meta_count; i++) {
 								    int unused_idx = sp_slot_find_unused(meta);
 								    if (unused_idx >= 0) { ... }
 								}
 								// Lock-free: atomic bitmap scan + CAS claim
 								int sp_claim_unused_slot_lockfree(SharedSSMeta* meta) {
 								    for (int i = 0; i < meta->total_slots; i++) {
 								        SlotState expected = SLOT_UNUSED;
 								        if (atomic_compare_exchange_strong(
 								            &meta->slots[i].state, &expected, SLOT_ACTIVE)) {
 								            return i;  // Claimed!
 								        }
 								    }
 								    return -1;  // No unused slots
 								}
 								```
 								#### 3. Stage 3: Lock-Free SuperSlab Allocation
 								Use **atomic counter + CAS** for ss_meta_count:
 								```c
 								// Current: realloc + capacity check under lock
 								if (sp_meta_ensure_capacity(g_shared_pool.ss_meta_count + 1) != 0) { ... }
 								// Lock-free: pre-allocate metadata array, atomic index increment
 								uint32_t idx = atomic_fetch_add(&g_shared_pool.ss_meta_count, 1);
 								if (idx >= g_shared_pool.ss_meta_capacity) {
 								    // Fallback: slow path with mutex for capacity expansion
 								    pthread_mutex_lock(&g_capacity_lock);
 								    sp_meta_ensure_capacity(idx + 1);
 								    pthread_mutex_unlock(&g_capacity_lock);
 								}
 								```
 								### Expected Impact
 								- **Eliminate 658 mutex acquisitions** (8T workload)
 								- **Reduce futex syscalls from 68% → <5%**
 								- **Improve 4T→8T scaling from 1.44x → ~1.9x** (closer to linear)
 								- **Overall throughput: +50-73%** (based on Task agent estimate)
 								---
 								## Implementation Plan (P0-4)
 								### Phase 1: Lock-Free Free List (Highest Impact)
 								**Files**: `core/hakmem_shared_pool.c` (sp_freelist_pop/push)
 								**Effort**: 2-3 hours
 								**Expected**: +30-40% throughput (eliminates Stage 1 contention)
 								### Phase 2: Lock-Free Slot Claiming
 								**Files**: `core/hakmem_shared_pool.c` (sp_slot_mark_active/empty)
 								**Effort**: 3-4 hours
 								**Expected**: +15-20% additional (eliminates Stage 2 contention)
 								### Phase 3: Lock-Free Metadata Growth
 								**Files**: `core/hakmem_shared_pool.c` (sp_meta_ensure_capacity)
 								**Effort**: 2-3 hours
 								**Expected**: +5-10% additional (rare path, low contention)
 								### Total Expected Improvement
 								- **Conservative**: +50% (1.59M → 2.4M ops/s, 4T)
 								- **Optimistic**: +73% (Task agent estimate, 1.04M → 1.8M ops/s baseline)
 								---
 								## Testing Strategy (P0-5)
 								### A/B Comparison
 . **Baseline** (mutex): Current implementation with stats
 . **Lock-Free** (CAS): After P0-4 implementation
 								### Metrics
 								- Throughput (ops/s) - target: +50-73%
 								- futex syscalls - target: <10% (from 68%)
 								- Lock acquisitions - target: 0 (fully lock-free)
 								- Scaling (4T→8T) - target: 1.9x (from 1.44x)
 								### Validation
 								- **Correctness**: Run with TSan (Thread Sanitizer)
 								- **Stress test**: 100K iterations, 1-16 threads
 								- **Performance**: Compare with mimalloc (target: 70-90% of mimalloc)
 								---
 								## Conclusion
 								Lock contention analysis reveals:
 								- **Single choke point**: `acquire_slab()` mutex (100% of contention)
 								- **Lock-free opportunity**: All 3 stages can be converted to atomic CAS
 								- **Expected impact**: +50-73% throughput, near-linear scaling
 								**Next Step**: P0-4 - Implement lock-free per-class free lists (CAS-based)
 								---
 								## Appendix: Instrumentation Code
 								### Added to `core/hakmem_shared_pool.c`
 								```c
 								// Atomic counters
 								static _Atomic uint64_t g_lock_acquire_count = 0;
 								static _Atomic uint64_t g_lock_release_count = 0;
 								static _Atomic uint64_t g_lock_acquire_slab_count = 0;
 								static _Atomic uint64_t g_lock_release_slab_count = 0;
 								// Report at shutdown
 								static void __attribute__((destructor)) lock_stats_report(void) {
 								    fprintf(stderr, "\n=== SHARED POOL LOCK STATISTICS ===\n");
 								    fprintf(stderr, "Total lock ops:    %lu (acquire) + %lu (release)\n",
 								            acquires, releases);
 								    fprintf(stderr, "--- Breakdown by Code Path ---\n");
 								    fprintf(stderr, "acquire_slab():    %lu (%.1f%%)\n", acquire_path, ...);
 								    fprintf(stderr, "release_slab():    %lu (%.1f%%)\n", release_path, ...);
 								}
 								```
 								### Usage
 								```bash
 								export HAKMEM_SHARED_POOL_LOCK_STATS=1
 								./out/release/bench_mid_large_mt_hakmem 8 40000 2048 42
 								```