353 lines
11 KiB
Markdown
353 lines
11 KiB
Markdown
|
|
# Phase 7.2.1: Consumer-Driven Adoption Investigation Results
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-25
|
|||
|
|
**Goal**: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2
|
|||
|
|
**Status**: ❌ **CDA Failed** - Switching to Route S (Owner-Priority Design)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. **Multiple optimization attempts all failed**, revealing a fundamental design flaw:
|
|||
|
|
|
|||
|
|
**Root Cause**: CDA creates a **"adoption stealing → fragmentation domino"** effect:
|
|||
|
|
1. Thread A allocates from Page X
|
|||
|
|
2. Thread A passes objects to Thread B for processing
|
|||
|
|
3. Thread B frees objects → remote frees accumulate on Page X
|
|||
|
|
4. **Thread C (needing allocation) adopts Page X** from A's pending queue
|
|||
|
|
5. Thread A continues allocating → **creates NEW pages** (its queue is now empty!)
|
|||
|
|
|
|||
|
|
**Result**: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Attempted Fixes & Results
|
|||
|
|
|
|||
|
|
| Fix Attempt | Page Reuse | Throughput | New Pages | Status |
|
|||
|
|
|-------------|------------|------------|-----------|--------|
|
|||
|
|
| Original (Phase 7.2) | ~40% | 44-58K ops/s | 227K | ❌ Suspected ping-pong |
|
|||
|
|
| + pending_claim | ~40% | 44-58K ops/s | 227K | ❌ No improvement |
|
|||
|
|
| Threshold=4 (batch) | 0.6% | 29K ops/s | 138K | ❌ Pages never enqueued |
|
|||
|
|
| Threshold=2 | 3.6% | 31K ops/s | 138K | ❌ Still too high |
|
|||
|
|
| Threshold=1 + Re-enqueue | 6.2% | 30K ops/s | 230K | ❌ Worse than original |
|
|||
|
|
| **Target (mimalloc)** | >90% | 1M+ ops/s | <50K | 🎯 Goal |
|
|||
|
|
|
|||
|
|
### Key Statistics (Threshold=1 test, 10s run)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Page reuses: 14,299 (6.2% of total pages)
|
|||
|
|
New pages allocated: 230,267 (94x more than reuses!)
|
|||
|
|
Drain attempts: 14,346
|
|||
|
|
Drain successes: 13,870 (96.7% success rate)
|
|||
|
|
Pending enqueued: 16,587
|
|||
|
|
Pending drained: 14,299
|
|||
|
|
Throughput: 30,377 ops/sec
|
|||
|
|
Real time: 15.989s (should be ~10s)
|
|||
|
|
CPU time: 32,919s (4 threads × ~8.2s each)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Diagnosis**: High drain success rate (96.7%) proves draining works correctly. The problem is **new page allocation rate** (230K pages for 300K operations = 1.3 allocs per page reuse).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis (ChatGPT Pro Consultation)
|
|||
|
|
|
|||
|
|
### The Adoption Stealing Problem
|
|||
|
|
|
|||
|
|
**Larson benchmark pattern**:
|
|||
|
|
```
|
|||
|
|
Phase 1 (Alloc):
|
|||
|
|
Thread A: alloc from Page X → pass to B
|
|||
|
|
Thread B: alloc from Page Y → pass to A
|
|||
|
|
(Cross-thread ownership transfer)
|
|||
|
|
|
|||
|
|
Phase 2 (Free):
|
|||
|
|
Thread A: free B's objects → remote free to Page Y
|
|||
|
|
Thread B: free A's objects → remote free to Page X
|
|||
|
|
(Both pages have remote frees)
|
|||
|
|
|
|||
|
|
Phase 3 (Alloc again):
|
|||
|
|
Thread A: needs alloc
|
|||
|
|
→ checks own pending queue → Page X is there!
|
|||
|
|
→ BUT Thread C already adopted Page X! (queue empty)
|
|||
|
|
→ allocates NEW page
|
|||
|
|
|
|||
|
|
Thread C: was idle, woke up
|
|||
|
|
→ scanned all pending queues
|
|||
|
|
→ adopted Page X from A
|
|||
|
|
→ used the free blocks
|
|||
|
|
→ goes idle again
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**The vicious cycle**:
|
|||
|
|
1. Owner (A) has pages with remotes in pending queue
|
|||
|
|
2. Adopter (C) scans and steals pages before owner can reuse
|
|||
|
|
3. Owner finds empty queue → allocates new page
|
|||
|
|
4. New page eventually fills → gets remote frees → stolen again
|
|||
|
|
5. Repeat → **endless fragmentation**
|
|||
|
|
|
|||
|
|
### Threshold Approach Failures
|
|||
|
|
|
|||
|
|
**Threshold=1 (0→1 edge)**:
|
|||
|
|
- ✅ Pages enqueued quickly (every remote free)
|
|||
|
|
- ❌ Adopters steal immediately
|
|||
|
|
- ❌ Owner rarely gets to reuse own pages
|
|||
|
|
- Result: 40% reuse (not good enough)
|
|||
|
|
|
|||
|
|
**Threshold=2-4 (batching)**:
|
|||
|
|
- ✅ Reduces pending queue operations
|
|||
|
|
- ❌ Many pages never reach threshold
|
|||
|
|
- ❌ Pages with 1-3 remotes are **lost entirely**
|
|||
|
|
- Result: 0.6-3.6% reuse (catastrophic)
|
|||
|
|
|
|||
|
|
**Threshold + Re-enqueue after drain**:
|
|||
|
|
- ✅ Catches new remotes during drain
|
|||
|
|
- ❌ Doesn't fix fundamental adoption stealing
|
|||
|
|
- Result: 6.2% reuse (worse than original)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architectural Flaws in CDA
|
|||
|
|
|
|||
|
|
### 1. No Owner Priority (Right-of-First-Refusal)
|
|||
|
|
|
|||
|
|
Current design allows **any thread** to adopt **immediately** when a page enters pending queue.
|
|||
|
|
|
|||
|
|
**Problem**: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.
|
|||
|
|
|
|||
|
|
**Missing**: Time-based grace period where owner has exclusive access.
|
|||
|
|
|
|||
|
|
### 2. Adoption Too Aggressive
|
|||
|
|
|
|||
|
|
Adoption triggers on **any non-empty pending queue**, regardless of:
|
|||
|
|
- Owner activity level (is owner still allocating?)
|
|||
|
|
- Owner's own partial page availability
|
|||
|
|
- Remote count (is it worth adopting for just 1 block?)
|
|||
|
|
|
|||
|
|
**Problem**: Unnecessary ownership transfers cause:
|
|||
|
|
- Cache line bouncing (page metadata)
|
|||
|
|
- Loss of temporal locality
|
|||
|
|
- Fragmentation across threads
|
|||
|
|
|
|||
|
|
### 3. No Must-Reuse Gate
|
|||
|
|
|
|||
|
|
Current design allows allocating **new pages** even when:
|
|||
|
|
- Own pending queue has reusable pages
|
|||
|
|
- Own full_pages list has pages with remotes
|
|||
|
|
|
|||
|
|
**Problem**: Owner bypasses reuse opportunities, creating new pages unnecessarily.
|
|||
|
|
|
|||
|
|
### 4. Full Pages List Misuse
|
|||
|
|
|
|||
|
|
Drained pages return to `full_pages` list, which is:
|
|||
|
|
- Not scanned frequently
|
|||
|
|
- Requires O(N) scan to find pages with remotes
|
|||
|
|
- Effectively a "graveyard" for partially-free pages
|
|||
|
|
|
|||
|
|
**Problem**: Pages with free blocks get stuck in full_pages, not reused.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommended Solution: Route S (Simple & Stable)
|
|||
|
|
|
|||
|
|
### Design Principles
|
|||
|
|
|
|||
|
|
1. **Owner-Only Drain**: Remove cross-thread adoption entirely
|
|||
|
|
2. **Must-Reuse Gate**: Forbid new page allocation when reusable pages exist
|
|||
|
|
3. **Partial List**: Replace full_pages with partial list for fast reuse
|
|||
|
|
4. **0→1 Edge Detection**: Keep immediate pending queue notification
|
|||
|
|
|
|||
|
|
### Core Changes
|
|||
|
|
|
|||
|
|
#### Change 1: Disable CDA
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// In mf2_alloc_slow():
|
|||
|
|
// ==== DISABLE Consumer-Driven Adoption ====
|
|||
|
|
// if (mf2_try_adopt_pending(tp, class_idx)) {
|
|||
|
|
// return mf2_alloc_fast(class_idx, size, site_id);
|
|||
|
|
// }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Change 2: Must-Reuse Gate (Owner Pending Priority)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// In mf2_alloc_slow(), BEFORE new page allocation:
|
|||
|
|
|
|||
|
|
// CRITICAL: Drain own pending queue FIRST (must-reuse gate)
|
|||
|
|
for (int budget = 0; budget < 4; budget++) {
|
|||
|
|
MidPage* pending_page = mf2_dequeue_pending(tp, class_idx);
|
|||
|
|
if (!pending_page) break;
|
|||
|
|
|
|||
|
|
atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release);
|
|||
|
|
|
|||
|
|
if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) {
|
|||
|
|
atomic_fetch_add(&g_mf2_mustreuse_hit, 1); // Stats
|
|||
|
|
return mf2_alloc_fast(class_idx, size, site_id);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Check active page for remotes
|
|||
|
|
if (page && page->remote_count > 0) {
|
|||
|
|
int drained = mf2_drain_remote_frees(page);
|
|||
|
|
if (drained > 0 && page->freelist) {
|
|||
|
|
atomic_fetch_add(&g_mf2_active_drain_hit, 1); // Stats
|
|||
|
|
return mf2_alloc_fast(class_idx, size, site_id);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Only NOW allocate new page (after exhausting reuse opportunities)
|
|||
|
|
atomic_fetch_add(&g_mf2_new_page_count, 1);
|
|||
|
|
page = mf2_alloc_new_page(class_idx);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Change 3: Keep Threshold=1 (0→1 Edge)
|
|||
|
|
|
|||
|
|
Already implemented - no change needed.
|
|||
|
|
|
|||
|
|
### Expected Results
|
|||
|
|
|
|||
|
|
| Metric | Before (CDA) | After (Route S) | Improvement |
|
|||
|
|
|--------|--------------|-----------------|-------------|
|
|||
|
|
| Page reuse rate | 6-40% | 70-80% | 2-13x ✅ |
|
|||
|
|
| New pages (10s) | 230K | 50-80K | 3-5x fewer ✅ |
|
|||
|
|
| Throughput | 30-58K ops/s | 60-100K ops/s | 2-3x ✅ |
|
|||
|
|
| Real time | 16s | ~10s | 1.6x faster ✅ |
|
|||
|
|
|
|||
|
|
### Why This Works
|
|||
|
|
|
|||
|
|
1. **Owner always gets first chance** to reuse pages with remotes
|
|||
|
|
2. **No adoption stealing** → no fragmentation domino
|
|||
|
|
3. **Must-reuse gate** physically prevents new page allocation when reuse is possible
|
|||
|
|
4. **Simpler design** → fewer race conditions, easier to debug
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Alternative: Route P (Performance - Keep CDA with Safeguards)
|
|||
|
|
|
|||
|
|
If Route S succeeds but we want to re-enable CDA for other workloads:
|
|||
|
|
|
|||
|
|
### Right-of-First-Refusal (RFR)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// On 0→1 edge:
|
|||
|
|
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES; // e.g., 200µs
|
|||
|
|
|
|||
|
|
// In adoption:
|
|||
|
|
if (rdtsc() < page->rfr_deadline) {
|
|||
|
|
continue; // Owner priority window - skip adoption
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Adoption Gating (Multi-Condition AND)
|
|||
|
|
|
|||
|
|
Only allow adoption when **ALL** conditions met:
|
|||
|
|
1. Owner inactive: `now - owner->last_alloc_tsc > IDLE_THRESHOLD` (e.g., 150µs)
|
|||
|
|
2. Owner partial empty: `owner->partial_pages[k] == NULL`
|
|||
|
|
3. Global pressure: `adoptable_count[k] > ADOPTION_THRESHOLD` (e.g., 4)
|
|||
|
|
4. Sufficient remotes: `page->remote_count >= MIN_REMOTE_COUNT` (e.g., 8)
|
|||
|
|
5. RFR expired: `now >= page->rfr_deadline`
|
|||
|
|
|
|||
|
|
### O(1) Adoption Scan
|
|||
|
|
|
|||
|
|
- Use `ready_mask[k]` bitmap for non-empty pending queues
|
|||
|
|
- Budget: maximum 1 page adoption per slow path
|
|||
|
|
- No scanning of empty queues
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase Plan: Route S Implementation
|
|||
|
|
|
|||
|
|
### Phase 1: Minimal Changes (30 minutes)
|
|||
|
|
|
|||
|
|
1. ✅ Disable CDA (comment out `mf2_try_adopt_pending` call)
|
|||
|
|
2. ✅ Verify must-reuse gate order (already mostly correct)
|
|||
|
|
3. ✅ Keep threshold=1 (already set)
|
|||
|
|
|
|||
|
|
**Test**: 10s larson benchmark
|
|||
|
|
**Success Criterion**: Page reuse > 50%, throughput > 60K ops/s
|
|||
|
|
|
|||
|
|
### Phase 2: Partial List (2-3 hours)
|
|||
|
|
|
|||
|
|
1. Add `partial_pages[POOL_NUM_CLASSES]` to MF2_ThreadPages
|
|||
|
|
2. Modify `mf2_try_drain_and_activate()` to use partial list
|
|||
|
|
3. Pop from partial list before active page check
|
|||
|
|
|
|||
|
|
**Test**: Same benchmark
|
|||
|
|
**Success Criterion**: Page reuse > 70%, throughput > 80K ops/s
|
|||
|
|
|
|||
|
|
### Phase 3: Event Wakeup (Optional, 4-6 hours)
|
|||
|
|
|
|||
|
|
1. Add futex-based wakeup for 0→1 edge
|
|||
|
|
2. Owner thread sleeps when idle, wakes on remote free
|
|||
|
|
3. Background drain for sleeping owners
|
|||
|
|
|
|||
|
|
**Test**: Same benchmark + long idle periods
|
|||
|
|
**Success Criterion**: Low CPU usage during idle, fast wakeup on activity
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Debug Instrumentation (Lightweight Counters)
|
|||
|
|
|
|||
|
|
Add to hakmem_pool.c:
|
|||
|
|
```c
|
|||
|
|
// Must-reuse gate effectiveness
|
|||
|
|
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0; // Reused from pending
|
|||
|
|
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0; // Drained active page
|
|||
|
|
static atomic_uint_fast64_t g_mf2_new_page_forced = 0; // No reuse possible
|
|||
|
|
|
|||
|
|
// (Route P only) Adoption gating
|
|||
|
|
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
|
|||
|
|
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
|
|||
|
|
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
### What Worked
|
|||
|
|
- ✅ Pending queue design (lock-free MPSC stack)
|
|||
|
|
- ✅ 0→1 edge detection (low overhead)
|
|||
|
|
- ✅ Remote free tracking (atomic counters)
|
|||
|
|
- ✅ Lock-free drain (exchange entire stack)
|
|||
|
|
|
|||
|
|
### What Failed
|
|||
|
|
- ❌ Consumer-Driven Adoption (too aggressive)
|
|||
|
|
- ❌ Threshold-based batching (wrong trade-off)
|
|||
|
|
- ❌ Full pages list for reuse (too slow)
|
|||
|
|
- ❌ No owner priority (adoption stealing)
|
|||
|
|
|
|||
|
|
### Key Insights
|
|||
|
|
1. **"Good idea on paper" ≠ "works in practice"**
|
|||
|
|
- CDA sounded great (use pages immediately!) but caused worse fragmentation
|
|||
|
|
|
|||
|
|
2. **Benchmarks reveal real-world patterns**
|
|||
|
|
- Larson's producer-consumer pattern exposed CDA's fatal flaw
|
|||
|
|
|
|||
|
|
3. **Simplicity wins**
|
|||
|
|
- Owner-only drain is simpler and likely faster than complex adoption logic
|
|||
|
|
|
|||
|
|
4. **Must-reuse is critical**
|
|||
|
|
- Without forcing reuse, allocators will always prefer new pages (easier)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## References
|
|||
|
|
|
|||
|
|
- PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
|
|||
|
|
- ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
|
|||
|
|
- ChatGPT Pro consultation (2025-10-25)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Actions
|
|||
|
|
|
|||
|
|
1. ✅ Document investigation results (this file)
|
|||
|
|
2. ⏳ Implement Route S minimal changes
|
|||
|
|
3. ⏳ Test and measure improvements
|
|||
|
|
4. ⏳ Decide: stop here or add partial list (Phase 2)
|
|||
|
|
5. ⏳ Consider Route P if CDA needed for other workloads
|
|||
|
|
|
|||
|
|
**Status**: Ready to implement Route S Phase 1 ✅
|