Files
hakmem/docs/status/PHASE_7.2.1_CDA_INVESTIGATION_2025_10_25.md

353 lines
11 KiB
Markdown
Raw Normal View History

# Phase 7.2.1: Consumer-Driven Adoption Investigation Results
**Date**: 2025-10-25
**Goal**: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2
**Status**: ❌ **CDA Failed** - Switching to Route S (Owner-Priority Design)
---
## Executive Summary
Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. **Multiple optimization attempts all failed**, revealing a fundamental design flaw:
**Root Cause**: CDA creates a **"adoption stealing → fragmentation domino"** effect:
1. Thread A allocates from Page X
2. Thread A passes objects to Thread B for processing
3. Thread B frees objects → remote frees accumulate on Page X
4. **Thread C (needing allocation) adopts Page X** from A's pending queue
5. Thread A continues allocating → **creates NEW pages** (its queue is now empty!)
**Result**: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.
---
## Attempted Fixes & Results
| Fix Attempt | Page Reuse | Throughput | New Pages | Status |
|-------------|------------|------------|-----------|--------|
| Original (Phase 7.2) | ~40% | 44-58K ops/s | 227K | ❌ Suspected ping-pong |
| + pending_claim | ~40% | 44-58K ops/s | 227K | ❌ No improvement |
| Threshold=4 (batch) | 0.6% | 29K ops/s | 138K | ❌ Pages never enqueued |
| Threshold=2 | 3.6% | 31K ops/s | 138K | ❌ Still too high |
| Threshold=1 + Re-enqueue | 6.2% | 30K ops/s | 230K | ❌ Worse than original |
| **Target (mimalloc)** | >90% | 1M+ ops/s | <50K | 🎯 Goal |
### Key Statistics (Threshold=1 test, 10s run)
```
Page reuses: 14,299 (6.2% of total pages)
New pages allocated: 230,267 (94x more than reuses!)
Drain attempts: 14,346
Drain successes: 13,870 (96.7% success rate)
Pending enqueued: 16,587
Pending drained: 14,299
Throughput: 30,377 ops/sec
Real time: 15.989s (should be ~10s)
CPU time: 32,919s (4 threads × ~8.2s each)
```
**Diagnosis**: High drain success rate (96.7%) proves draining works correctly. The problem is **new page allocation rate** (230K pages for 300K operations = 1.3 allocs per page reuse).
---
## Root Cause Analysis (ChatGPT Pro Consultation)
### The Adoption Stealing Problem
**Larson benchmark pattern**:
```
Phase 1 (Alloc):
Thread A: alloc from Page X → pass to B
Thread B: alloc from Page Y → pass to A
(Cross-thread ownership transfer)
Phase 2 (Free):
Thread A: free B's objects → remote free to Page Y
Thread B: free A's objects → remote free to Page X
(Both pages have remote frees)
Phase 3 (Alloc again):
Thread A: needs alloc
→ checks own pending queue → Page X is there!
→ BUT Thread C already adopted Page X! (queue empty)
→ allocates NEW page
Thread C: was idle, woke up
→ scanned all pending queues
→ adopted Page X from A
→ used the free blocks
→ goes idle again
```
**The vicious cycle**:
1. Owner (A) has pages with remotes in pending queue
2. Adopter (C) scans and steals pages before owner can reuse
3. Owner finds empty queue → allocates new page
4. New page eventually fills → gets remote frees → stolen again
5. Repeat → **endless fragmentation**
### Threshold Approach Failures
**Threshold=1 (0→1 edge)**:
- ✅ Pages enqueued quickly (every remote free)
- ❌ Adopters steal immediately
- ❌ Owner rarely gets to reuse own pages
- Result: 40% reuse (not good enough)
**Threshold=2-4 (batching)**:
- ✅ Reduces pending queue operations
- ❌ Many pages never reach threshold
- ❌ Pages with 1-3 remotes are **lost entirely**
- Result: 0.6-3.6% reuse (catastrophic)
**Threshold + Re-enqueue after drain**:
- ✅ Catches new remotes during drain
- ❌ Doesn't fix fundamental adoption stealing
- Result: 6.2% reuse (worse than original)
---
## Architectural Flaws in CDA
### 1. No Owner Priority (Right-of-First-Refusal)
Current design allows **any thread** to adopt **immediately** when a page enters pending queue.
**Problem**: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.
**Missing**: Time-based grace period where owner has exclusive access.
### 2. Adoption Too Aggressive
Adoption triggers on **any non-empty pending queue**, regardless of:
- Owner activity level (is owner still allocating?)
- Owner's own partial page availability
- Remote count (is it worth adopting for just 1 block?)
**Problem**: Unnecessary ownership transfers cause:
- Cache line bouncing (page metadata)
- Loss of temporal locality
- Fragmentation across threads
### 3. No Must-Reuse Gate
Current design allows allocating **new pages** even when:
- Own pending queue has reusable pages
- Own full_pages list has pages with remotes
**Problem**: Owner bypasses reuse opportunities, creating new pages unnecessarily.
### 4. Full Pages List Misuse
Drained pages return to `full_pages` list, which is:
- Not scanned frequently
- Requires O(N) scan to find pages with remotes
- Effectively a "graveyard" for partially-free pages
**Problem**: Pages with free blocks get stuck in full_pages, not reused.
---
## Recommended Solution: Route S (Simple & Stable)
### Design Principles
1. **Owner-Only Drain**: Remove cross-thread adoption entirely
2. **Must-Reuse Gate**: Forbid new page allocation when reusable pages exist
3. **Partial List**: Replace full_pages with partial list for fast reuse
4. **0→1 Edge Detection**: Keep immediate pending queue notification
### Core Changes
#### Change 1: Disable CDA
```c
// In mf2_alloc_slow():
// ==== DISABLE Consumer-Driven Adoption ====
// if (mf2_try_adopt_pending(tp, class_idx)) {
// return mf2_alloc_fast(class_idx, size, site_id);
// }
```
#### Change 2: Must-Reuse Gate (Owner Pending Priority)
```c
// In mf2_alloc_slow(), BEFORE new page allocation:
// CRITICAL: Drain own pending queue FIRST (must-reuse gate)
for (int budget = 0; budget < 4; budget++) {
MidPage* pending_page = mf2_dequeue_pending(tp, class_idx);
if (!pending_page) break;
atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release);
if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) {
atomic_fetch_add(&g_mf2_mustreuse_hit, 1); // Stats
return mf2_alloc_fast(class_idx, size, site_id);
}
}
// Check active page for remotes
if (page && page->remote_count > 0) {
int drained = mf2_drain_remote_frees(page);
if (drained > 0 && page->freelist) {
atomic_fetch_add(&g_mf2_active_drain_hit, 1); // Stats
return mf2_alloc_fast(class_idx, size, site_id);
}
}
// Only NOW allocate new page (after exhausting reuse opportunities)
atomic_fetch_add(&g_mf2_new_page_count, 1);
page = mf2_alloc_new_page(class_idx);
```
#### Change 3: Keep Threshold=1 (0→1 Edge)
Already implemented - no change needed.
### Expected Results
| Metric | Before (CDA) | After (Route S) | Improvement |
|--------|--------------|-----------------|-------------|
| Page reuse rate | 6-40% | 70-80% | 2-13x ✅ |
| New pages (10s) | 230K | 50-80K | 3-5x fewer ✅ |
| Throughput | 30-58K ops/s | 60-100K ops/s | 2-3x ✅ |
| Real time | 16s | ~10s | 1.6x faster ✅ |
### Why This Works
1. **Owner always gets first chance** to reuse pages with remotes
2. **No adoption stealing** → no fragmentation domino
3. **Must-reuse gate** physically prevents new page allocation when reuse is possible
4. **Simpler design** → fewer race conditions, easier to debug
---
## Alternative: Route P (Performance - Keep CDA with Safeguards)
If Route S succeeds but we want to re-enable CDA for other workloads:
### Right-of-First-Refusal (RFR)
```c
// On 0→1 edge:
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES; // e.g., 200µs
// In adoption:
if (rdtsc() < page->rfr_deadline) {
continue; // Owner priority window - skip adoption
}
```
### Adoption Gating (Multi-Condition AND)
Only allow adoption when **ALL** conditions met:
1. Owner inactive: `now - owner->last_alloc_tsc > IDLE_THRESHOLD` (e.g., 150µs)
2. Owner partial empty: `owner->partial_pages[k] == NULL`
3. Global pressure: `adoptable_count[k] > ADOPTION_THRESHOLD` (e.g., 4)
4. Sufficient remotes: `page->remote_count >= MIN_REMOTE_COUNT` (e.g., 8)
5. RFR expired: `now >= page->rfr_deadline`
### O(1) Adoption Scan
- Use `ready_mask[k]` bitmap for non-empty pending queues
- Budget: maximum 1 page adoption per slow path
- No scanning of empty queues
---
## Phase Plan: Route S Implementation
### Phase 1: Minimal Changes (30 minutes)
1. ✅ Disable CDA (comment out `mf2_try_adopt_pending` call)
2. ✅ Verify must-reuse gate order (already mostly correct)
3. ✅ Keep threshold=1 (already set)
**Test**: 10s larson benchmark
**Success Criterion**: Page reuse > 50%, throughput > 60K ops/s
### Phase 2: Partial List (2-3 hours)
1. Add `partial_pages[POOL_NUM_CLASSES]` to MF2_ThreadPages
2. Modify `mf2_try_drain_and_activate()` to use partial list
3. Pop from partial list before active page check
**Test**: Same benchmark
**Success Criterion**: Page reuse > 70%, throughput > 80K ops/s
### Phase 3: Event Wakeup (Optional, 4-6 hours)
1. Add futex-based wakeup for 0→1 edge
2. Owner thread sleeps when idle, wakes on remote free
3. Background drain for sleeping owners
**Test**: Same benchmark + long idle periods
**Success Criterion**: Low CPU usage during idle, fast wakeup on activity
---
## Debug Instrumentation (Lightweight Counters)
Add to hakmem_pool.c:
```c
// Must-reuse gate effectiveness
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0; // Reused from pending
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0; // Drained active page
static atomic_uint_fast64_t g_mf2_new_page_forced = 0; // No reuse possible
// (Route P only) Adoption gating
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;
```
---
## Lessons Learned
### What Worked
- ✅ Pending queue design (lock-free MPSC stack)
- ✅ 0→1 edge detection (low overhead)
- ✅ Remote free tracking (atomic counters)
- ✅ Lock-free drain (exchange entire stack)
### What Failed
- ❌ Consumer-Driven Adoption (too aggressive)
- ❌ Threshold-based batching (wrong trade-off)
- ❌ Full pages list for reuse (too slow)
- ❌ No owner priority (adoption stealing)
### Key Insights
1. **"Good idea on paper" ≠ "works in practice"**
- CDA sounded great (use pages immediately!) but caused worse fragmentation
2. **Benchmarks reveal real-world patterns**
- Larson's producer-consumer pattern exposed CDA's fatal flaw
3. **Simplicity wins**
- Owner-only drain is simpler and likely faster than complex adoption logic
4. **Must-reuse is critical**
- Without forcing reuse, allocators will always prefer new pages (easier)
---
## References
- PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
- ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
- ChatGPT Pro consultation (2025-10-25)
---
## Next Actions
1. ✅ Document investigation results (this file)
2. ⏳ Implement Route S minimal changes
3. ⏳ Test and measure improvements
4. ⏳ Decide: stop here or add partial list (Phase 2)
5. ⏳ Consider Route P if CDA needed for other workloads
**Status**: Ready to implement Route S Phase 1 ✅