378 lines
12 KiB
Markdown
378 lines
12 KiB
Markdown
|
|
# Phase 7.1 MF1: Lock-Free Freelist Results
|
|||
|
|
**Date**: 2025-10-24
|
|||
|
|
**Goal**: Eliminate 56 mutexes (7 classes × 8 shards) by replacing with lock-free CAS operations
|
|||
|
|
**Expected**: +15-25% improvement
|
|||
|
|
**Actual**: -3% regression ❌
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
Successfully implemented lock-free freelist using atomic CAS operations, eliminating all 56 mutex locks from Mid Pool. However, performance DECREASED by ~3% instead of the expected 15-25% improvement.
|
|||
|
|
|
|||
|
|
This is a valuable finding: **naive lock-free implementations aren't always faster than mutexes**.
|
|||
|
|
|
|||
|
|
## Implementation Details
|
|||
|
|
|
|||
|
|
### Changes Made
|
|||
|
|
|
|||
|
|
1. **Data Structure** (hakmem_pool.c:277-279):
|
|||
|
|
```c
|
|||
|
|
// Before: PoolBlock* freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
// After: atomic_uintptr_t freelist[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
|
|||
|
|
// Removed: PaddedMutex freelist_locks[POOL_NUM_CLASSES][POOL_NUM_SHARDS];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Lock-Free Operations** (hakmem_pool.c:431-556):
|
|||
|
|
- `freelist_pop_lockfree()`: Single-block atomic pop using CAS
|
|||
|
|
- `freelist_push_lockfree()`: Single-block atomic push using CAS
|
|||
|
|
- `freelist_batch_pop_lockfree()`: Batch pop for TLS ring filling
|
|||
|
|
- `drain_remote_lockfree()`: Atomic drain of remote stack to freelist
|
|||
|
|
|
|||
|
|
3. **Call Sites Updated**:
|
|||
|
|
- Line 992-1015: trylock batch pop → lock-free batch pop
|
|||
|
|
- Line 1042-1047: locked pop → lock-free pop
|
|||
|
|
- Line 1058-1083: locked drain & shard stealing → lock-free versions
|
|||
|
|
- Line 1300-1302: locked push → lock-free push
|
|||
|
|
|
|||
|
|
### Code Size
|
|||
|
|
- Added: ~130 LOC (lock-free helper functions)
|
|||
|
|
- Removed: ~50 LOC (mutex lock/unlock calls)
|
|||
|
|
- Modified: ~80 LOC (call site updates)
|
|||
|
|
- Net: +160 LOC
|
|||
|
|
|
|||
|
|
## Benchmark Results
|
|||
|
|
|
|||
|
|
### Mid Pool (larson 10s, 2-32 KiB)
|
|||
|
|
|
|||
|
|
| Threads | Before (P6.25) | After (P7.1) | Change | Expected |
|
|||
|
|
|---------|----------------|--------------|--------|----------|
|
|||
|
|
| 1T | 4.03 M/s | 3.89 M/s | **-3.5%** ❌ | +10% |
|
|||
|
|
| 4T | 13.78 M/s | 13.34 M/s | **-3.2%** ❌ | +15-25% |
|
|||
|
|
|
|||
|
|
**Conclusion**: Lock-free implementation is SLOWER than mutex-based version on both 1T and 4T.
|
|||
|
|
|
|||
|
|
## Root Cause Analysis
|
|||
|
|
|
|||
|
|
### Why Is It Slower?
|
|||
|
|
|
|||
|
|
#### 1. Batch Pop Overhead (High Confidence) 🔥
|
|||
|
|
|
|||
|
|
**Problem**: `freelist_batch_pop_lockfree()` walks the freelist chain INSIDE the CAS retry loop.
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load(...);
|
|||
|
|
head = (PoolBlock*)old_head;
|
|||
|
|
tail = head;
|
|||
|
|
batch_size = 1;
|
|||
|
|
|
|||
|
|
// PROBLEM: Walking chain inside CAS loop!
|
|||
|
|
while (tail->next && batch_size < max_pop) {
|
|||
|
|
tail = tail->next; // Slow pointer chasing
|
|||
|
|
batch_size++;
|
|||
|
|
}
|
|||
|
|
} while (!atomic_compare_exchange_weak(...)); // If fails, walk again!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Impact**:
|
|||
|
|
- With 4 threads, CAS contention is high
|
|||
|
|
- Each retry requires re-walking the chain (pointer chasing)
|
|||
|
|
- Example: Walking 32 blocks = 32 cache misses per retry
|
|||
|
|
- With 50% CAS retry rate, this DOUBLES the work
|
|||
|
|
|
|||
|
|
**Mutex Comparison**:
|
|||
|
|
- Mutex-based version walked the chain ONCE under lock
|
|||
|
|
- Lock contention might be high, but no wasted work
|
|||
|
|
|
|||
|
|
#### 2. Cache Line Bouncing (Medium Confidence)
|
|||
|
|
|
|||
|
|
**Problem**: Atomic operations cause more aggressive cache line invalidation than mutexes.
|
|||
|
|
|
|||
|
|
- **Mutexes**: Only bounce when thread acquires/releases lock
|
|||
|
|
- **Atomics**: Every CAS attempt bounces the cache line
|
|||
|
|
|
|||
|
|
With 4 threads hammering the same freelist head, we're bouncing the cache line on EVERY allocation attempt.
|
|||
|
|
|
|||
|
|
#### 3. Single-Thread Overhead (Medium Confidence)
|
|||
|
|
|
|||
|
|
Even 1T is slower (-3.5%), suggesting overhead beyond contention:
|
|||
|
|
|
|||
|
|
- **Memory ordering**: `memory_order_acquire/release` has fence overhead
|
|||
|
|
- **CAS overhead**: Even successful CAS is slower than direct assignment
|
|||
|
|
- **Nonempty mask updates**: More atomic operations for bookkeeping
|
|||
|
|
|
|||
|
|
#### 4. Speculative Execution Barriers (Low Confidence)
|
|||
|
|
|
|||
|
|
Atomic operations with acquire/release semantics create memory barriers that prevent CPU speculation and out-of-order execution.
|
|||
|
|
|
|||
|
|
## What We Learned
|
|||
|
|
|
|||
|
|
### 1. Lock-Free != Always Faster
|
|||
|
|
|
|||
|
|
**Myth**: "Lock-free is always faster than locks"
|
|||
|
|
**Reality**: Lock-free trades lock contention for CAS contention + retry overhead
|
|||
|
|
|
|||
|
|
**When Locks Win**:
|
|||
|
|
- Critical section does significant work (e.g., walking chains)
|
|||
|
|
- Lock holder's work amortizes lock acquisition cost
|
|||
|
|
- Low contention scenarios
|
|||
|
|
|
|||
|
|
**When Lock-Free Wins**:
|
|||
|
|
- Critical section is trivial (e.g., single pointer swap)
|
|||
|
|
- Very high contention on short critical sections
|
|||
|
|
- Need wait-free progress guarantees
|
|||
|
|
|
|||
|
|
### 2. Retry Overhead Is Real
|
|||
|
|
|
|||
|
|
CAS retry loops can do MASSIVE wasted work if:
|
|||
|
|
- Retry operation is expensive (pointer chasing, computation)
|
|||
|
|
- Contention is high (50%+ retry rate)
|
|||
|
|
|
|||
|
|
**Our Case**: Walking 32-block chain with 50% retry rate = 2x overhead
|
|||
|
|
|
|||
|
|
### 3. Memory Ordering Matters
|
|||
|
|
|
|||
|
|
`memory_order_acquire/release` isn't free:
|
|||
|
|
- Creates memory barriers
|
|||
|
|
- Prevents speculation
|
|||
|
|
- Flushes store buffers
|
|||
|
|
|
|||
|
|
For hot paths, might need `memory_order_relaxed` where safe.
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
### Option A: Optimize Current Lock-Free Implementation
|
|||
|
|
|
|||
|
|
**A1. Batch Pop Optimization** (Quick, High Impact)
|
|||
|
|
- Walk chain ONCE before CAS loop
|
|||
|
|
- Use versioned pointers (ABA protection) to detect modifications
|
|||
|
|
- Or: Limit batch size to small constant (e.g., 4 blocks) to reduce walk overhead
|
|||
|
|
|
|||
|
|
**A2. Memory Ordering Relaxation** (Quick, Medium Impact)
|
|||
|
|
- Use `memory_order_relaxed` for nonempty mask updates
|
|||
|
|
- Use `memory_order_consume` instead of `acquire` where possible
|
|||
|
|
- Profile to identify safe relaxation points
|
|||
|
|
|
|||
|
|
**A3. Hybrid Approach** (Medium, Medium Impact)
|
|||
|
|
- Keep lock-free for single-block pop/push (fast path)
|
|||
|
|
- Use mutex for batch operations (slow path with complex work)
|
|||
|
|
|
|||
|
|
### Option B: Revert to Mutexes + Different Approach
|
|||
|
|
|
|||
|
|
**B1. Per-Page Sharding** (MF2 from battle plan)
|
|||
|
|
- Like mimalloc: O(1) page lookup from block address
|
|||
|
|
- No shared freelist at all (every page is independent)
|
|||
|
|
- Expected: +50% improvement
|
|||
|
|
- Effort: 20-30 hours
|
|||
|
|
|
|||
|
|
**B2. Reduce Lock Granularity**
|
|||
|
|
- Keep mutexes but reduce from 56 to 7 (one per class, no sharding)
|
|||
|
|
- Or: Single global lock with optimistic lock-free fast path
|
|||
|
|
|
|||
|
|
### Option C: Targeted Lock-Free (Best of Both)
|
|||
|
|
|
|||
|
|
Keep mutexes for batch operations, lock-free for:
|
|||
|
|
- **Remote-free stacks**: Already lock-free, works well ✅
|
|||
|
|
- **Single-block pop/push**: Critical fast path, simple CAS
|
|||
|
|
- **Batch operations**: Keep mutex (complex work under lock is OK)
|
|||
|
|
|
|||
|
|
## Recommendation
|
|||
|
|
|
|||
|
|
**Immediate**: Revert to mutexes, proceed with **MF2 (Per-Page Sharding)**
|
|||
|
|
|
|||
|
|
**Reasoning**:
|
|||
|
|
1. MF2 has higher expected gain (+50%) than optimized lock-free (+10-15%)
|
|||
|
|
2. MF2 eliminates shared freelists entirely (no contention at all)
|
|||
|
|
3. Lock-free optimization is a rabbit hole (diminishing returns)
|
|||
|
|
4. mimalloc's success proves per-page sharding is the right approach
|
|||
|
|
|
|||
|
|
**Timeline**:
|
|||
|
|
- Revert Phase 7.1: 30 min
|
|||
|
|
- Implement MF2: 20-30 hours
|
|||
|
|
- Expected result: 13.78 M/s → 20.7 M/s (70% of mimalloc target!)
|
|||
|
|
|
|||
|
|
## Detailed Benchmark Log
|
|||
|
|
|
|||
|
|
### Phase 6.25 (Before, with mutexes)
|
|||
|
|
```
|
|||
|
|
[Mid 1T] 4.03 M/s
|
|||
|
|
[Mid 4T] 13.78 M/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase 7.1 (After, lock-free)
|
|||
|
|
```
|
|||
|
|
[Mid 1T Run 1] 3.89 M/s (-3.5%)
|
|||
|
|
[Mid 4T Run 1] 13.71 M/s (-0.5%)
|
|||
|
|
[Mid 4T Run 2] 13.34 M/s (-3.2%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Average degradation: -3%
|
|||
|
|
|
|||
|
|
## Files Modified
|
|||
|
|
|
|||
|
|
- `hakmem_pool.c`: Core lock-free implementation
|
|||
|
|
- Lines 277-279: Data structure change
|
|||
|
|
- Lines 431-556: Lock-free helper functions
|
|||
|
|
- Lines 751: Initialization update
|
|||
|
|
- Lines 992-1015, 1042-1083, 1300-1302: Call site updates
|
|||
|
|
|
|||
|
|
## Lessons for Future Work
|
|||
|
|
|
|||
|
|
1. **Profile First**: Should have profiled lock contention before assuming locks were the bottleneck
|
|||
|
|
2. **Benchmark Early**: Should have benchmarked simple pop/push first, then batch operations
|
|||
|
|
3. **Incremental**: Should have done lock-free in stages (single-block first, batch second)
|
|||
|
|
4. **Understand Tradeoffs**: Lock-free trades lock contention for CAS contention + retry overhead
|
|||
|
|
|
|||
|
|
**Key Insight**: Sometimes the "obvious" optimization makes things worse. Data-driven optimization > intuition.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Status**: Implementation complete, benchmarked, reverted ✅
|
|||
|
|
**Next**: MF2 Per-Page Sharding (mimalloc approach)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Phase 7.1.1: Quick Fix - Simplified Batch Pop
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-24 (continued)
|
|||
|
|
**Goal**: Fix batch pop overhead by eliminating chain walking from CAS retry loop
|
|||
|
|
**Hypothesis**: Complex CAS (chain walking) → Simple CAS (repeated single pops)
|
|||
|
|
**Expected**: +5-10% improvement over P7.1
|
|||
|
|
**Actual**: -3.5% further regression ❌❌
|
|||
|
|
|
|||
|
|
## Changes Made
|
|||
|
|
|
|||
|
|
Replaced `freelist_batch_pop_lockfree()` with `freelist_batch_pop_lockfree_simple()`:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Before (P7.1): Walk chain inside CAS loop
|
|||
|
|
do {
|
|||
|
|
old_head = load();
|
|||
|
|
head = (PoolBlock*)old_head;
|
|||
|
|
tail = head;
|
|||
|
|
// PROBLEM: Walk chain inside retry loop
|
|||
|
|
while (tail->next && batch_size < max_pop) {
|
|||
|
|
tail = tail->next;
|
|||
|
|
batch_size++;
|
|||
|
|
}
|
|||
|
|
} while (!CAS(...));
|
|||
|
|
|
|||
|
|
// After (P7.1.1): Repeated single-block pops
|
|||
|
|
for (int i = 0; i < max_pop; i++) {
|
|||
|
|
PoolBlock* block = freelist_pop_lockfree(...); // Simple CAS
|
|||
|
|
if (!block) break;
|
|||
|
|
ring->items[ring->top++] = block;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Code Changes**:
|
|||
|
|
- hakmem_pool.c:471-498: New `freelist_batch_pop_lockfree_simple()` function
|
|||
|
|
- hakmem_pool.c:984: Call site updated
|
|||
|
|
|
|||
|
|
## Benchmark Results
|
|||
|
|
|
|||
|
|
| Version | Mid 1T | Mid 4T | vs P6.25 | vs P7.1 |
|
|||
|
|
|---------|--------|--------|----------|---------|
|
|||
|
|
| P6.25 (mutex baseline) | 4.03 M/s | **13.78 M/s** | - | +6.3% |
|
|||
|
|
| P7.1 (complex lock-free) | 3.89 M/s | 13.34 M/s | -3.2% | - |
|
|||
|
|
| **P7.1.1 (simple lock-free)** | - | **12.87 M/s** | **-6.6%** ❌ | **-3.5%** ❌ |
|
|||
|
|
|
|||
|
|
**Run 1**: 12.98 M/s (-5.8% vs P6.25)
|
|||
|
|
**Run 2**: 12.76 M/s (-7.4% vs P6.25)
|
|||
|
|
**Average**: 12.87 M/s
|
|||
|
|
|
|||
|
|
## Root Cause: CAS Contention Multiplied
|
|||
|
|
|
|||
|
|
### Why Simplification Made It Worse
|
|||
|
|
|
|||
|
|
**Hypothesis was wrong**: We thought chain-walking overhead in CAS retry was the problem.
|
|||
|
|
|
|||
|
|
**Reality**:
|
|||
|
|
- **1 Complex CAS** (walk 32 blocks once) < **32 Simple CAS** (contention × 32)
|
|||
|
|
|
|||
|
|
### The Math
|
|||
|
|
|
|||
|
|
**P7.1 Complex Batch Pop**:
|
|||
|
|
- 1 CAS attempt
|
|||
|
|
- 50% retry rate → 2 CAS attempts average
|
|||
|
|
- Each retry: walk 32 blocks again (expensive, but only 2× total)
|
|||
|
|
- Total cost: ~60-80 cycles
|
|||
|
|
|
|||
|
|
**P7.1.1 Simple Batch Pop**:
|
|||
|
|
- 32 CAS attempts (one per block)
|
|||
|
|
- 50% retry rate per CAS → 64 CAS attempts average
|
|||
|
|
- Each CAS: contention + cache line bounce
|
|||
|
|
- Total cost: ~100-150 cycles
|
|||
|
|
|
|||
|
|
**Verdict**: **32× CAS contention >> 1× chain walking overhead**
|
|||
|
|
|
|||
|
|
### Why This Happens
|
|||
|
|
|
|||
|
|
With 4 threads competing:
|
|||
|
|
1. **Thread A**: Pop block 1... (CAS)
|
|||
|
|
2. **Thread B**: Pop block 1... (CAS conflicts with A) → retry
|
|||
|
|
3. **Thread C**: Pop block 1... (CAS conflicts with A/B) → retry
|
|||
|
|
4. **Thread D**: Pop block 1... (CAS conflicts) → retry
|
|||
|
|
5. Repeat 32 times for 32 blocks...
|
|||
|
|
|
|||
|
|
**Result**: Retry storm, cache line bouncing × 32
|
|||
|
|
|
|||
|
|
### Cache Line Analysis
|
|||
|
|
|
|||
|
|
**P7.1** (complex):
|
|||
|
|
- 2 cache line bounces average (1 CAS × 2 retries)
|
|||
|
|
|
|||
|
|
**P7.1.1** (simple):
|
|||
|
|
- 64 cache line bounces average (32 CAS × 2 retries)
|
|||
|
|
|
|||
|
|
**32× worse cache behavior!**
|
|||
|
|
|
|||
|
|
## Final Conclusion
|
|||
|
|
|
|||
|
|
### Lock-Free Is Not Viable For Mid Pool
|
|||
|
|
|
|||
|
|
Both complex and simple lock-free implementations are slower than mutexes because:
|
|||
|
|
|
|||
|
|
1. **Fundamental Design Problem**: Shared freelist with contention
|
|||
|
|
- 4 threads → 1 freelist → inevitable contention
|
|||
|
|
- Lock-free: Contention = retry storm + cache bouncing
|
|||
|
|
- Mutex: Contention = waiting (but no wasted work)
|
|||
|
|
|
|||
|
|
2. **1T Performance**: Lock-free is slower even without contention
|
|||
|
|
- Memory ordering overhead (acquire/release fences)
|
|||
|
|
- CAS instruction overhead (LOCK CMPXCHG)
|
|||
|
|
- Mutexes have optimized fast-path
|
|||
|
|
|
|||
|
|
3. **Batch Operations**: Core use case for Mid Pool
|
|||
|
|
- Lock-free batch = N× contention
|
|||
|
|
- Mutex batch = 1× lock, amortized cost
|
|||
|
|
|
|||
|
|
### Key Insight
|
|||
|
|
|
|||
|
|
**The bottleneck is not LOCKING mechanism, but SHARING itself.**
|
|||
|
|
|
|||
|
|
- Mutexes serialize access to shared data → 1 thread wins, others wait
|
|||
|
|
- Lock-free allows concurrent access → all threads retry → cache thrashing
|
|||
|
|
|
|||
|
|
**Solution**: **Eliminate sharing** (MF2 Per-Page Sharding)
|
|||
|
|
|
|||
|
|
## Recommendation: Revert + MF2
|
|||
|
|
|
|||
|
|
**Action**: Revert to Phase 6.25 (mutex baseline), implement MF2
|
|||
|
|
|
|||
|
|
**MF2 Approach**:
|
|||
|
|
- Each thread owns pages (no sharing)
|
|||
|
|
- O(1) page lookup from block address
|
|||
|
|
- No mutex, no lock-free, no contention
|
|||
|
|
- Expected: +50% (13.78 → 20.7 M/s)
|
|||
|
|
|
|||
|
|
**Timeline**:
|
|||
|
|
- Revert: 15 min
|
|||
|
|
- MF2 implementation: 20-30 hours
|
|||
|
|
- Expected ROI: 1.67-2.5% gain per hour (better than lock-free optimization)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Status**: Phase 7.1 + 7.1.1 complete, lessons learned, ready to revert ✅
|
|||
|
|
**Next**: MF2 Per-Page Sharding (mimalloc approach)
|