# Phase 7.2.1: Consumer-Driven Adoption Investigation Results **Date**: 2025-10-25 **Goal**: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2 **Status**: ❌ **CDA Failed** - Switching to Route S (Owner-Priority Design) --- ## Executive Summary Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. **Multiple optimization attempts all failed**, revealing a fundamental design flaw: **Root Cause**: CDA creates a **"adoption stealing → fragmentation domino"** effect: 1. Thread A allocates from Page X 2. Thread A passes objects to Thread B for processing 3. Thread B frees objects → remote frees accumulate on Page X 4. **Thread C (needing allocation) adopts Page X** from A's pending queue 5. Thread A continues allocating → **creates NEW pages** (its queue is now empty!) **Result**: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly. --- ## Attempted Fixes & Results | Fix Attempt | Page Reuse | Throughput | New Pages | Status | |-------------|------------|------------|-----------|--------| | Original (Phase 7.2) | ~40% | 44-58K ops/s | 227K | ❌ Suspected ping-pong | | + pending_claim | ~40% | 44-58K ops/s | 227K | ❌ No improvement | | Threshold=4 (batch) | 0.6% | 29K ops/s | 138K | ❌ Pages never enqueued | | Threshold=2 | 3.6% | 31K ops/s | 138K | ❌ Still too high | | Threshold=1 + Re-enqueue | 6.2% | 30K ops/s | 230K | ❌ Worse than original | | **Target (mimalloc)** | >90% | 1M+ ops/s | <50K | 🎯 Goal | ### Key Statistics (Threshold=1 test, 10s run) ``` Page reuses: 14,299 (6.2% of total pages) New pages allocated: 230,267 (94x more than reuses!) Drain attempts: 14,346 Drain successes: 13,870 (96.7% success rate) Pending enqueued: 16,587 Pending drained: 14,299 Throughput: 30,377 ops/sec Real time: 15.989s (should be ~10s) CPU time: 32,919s (4 threads × ~8.2s each) ``` **Diagnosis**: High drain success rate (96.7%) proves draining works correctly. The problem is **new page allocation rate** (230K pages for 300K operations = 1.3 allocs per page reuse). --- ## Root Cause Analysis (ChatGPT Pro Consultation) ### The Adoption Stealing Problem **Larson benchmark pattern**: ``` Phase 1 (Alloc): Thread A: alloc from Page X → pass to B Thread B: alloc from Page Y → pass to A (Cross-thread ownership transfer) Phase 2 (Free): Thread A: free B's objects → remote free to Page Y Thread B: free A's objects → remote free to Page X (Both pages have remote frees) Phase 3 (Alloc again): Thread A: needs alloc → checks own pending queue → Page X is there! → BUT Thread C already adopted Page X! (queue empty) → allocates NEW page Thread C: was idle, woke up → scanned all pending queues → adopted Page X from A → used the free blocks → goes idle again ``` **The vicious cycle**: 1. Owner (A) has pages with remotes in pending queue 2. Adopter (C) scans and steals pages before owner can reuse 3. Owner finds empty queue → allocates new page 4. New page eventually fills → gets remote frees → stolen again 5. Repeat → **endless fragmentation** ### Threshold Approach Failures **Threshold=1 (0→1 edge)**: - ✅ Pages enqueued quickly (every remote free) - ❌ Adopters steal immediately - ❌ Owner rarely gets to reuse own pages - Result: 40% reuse (not good enough) **Threshold=2-4 (batching)**: - ✅ Reduces pending queue operations - ❌ Many pages never reach threshold - ❌ Pages with 1-3 remotes are **lost entirely** - Result: 0.6-3.6% reuse (catastrophic) **Threshold + Re-enqueue after drain**: - ✅ Catches new remotes during drain - ❌ Doesn't fix fundamental adoption stealing - Result: 6.2% reuse (worse than original) --- ## Architectural Flaws in CDA ### 1. No Owner Priority (Right-of-First-Refusal) Current design allows **any thread** to adopt **immediately** when a page enters pending queue. **Problem**: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first. **Missing**: Time-based grace period where owner has exclusive access. ### 2. Adoption Too Aggressive Adoption triggers on **any non-empty pending queue**, regardless of: - Owner activity level (is owner still allocating?) - Owner's own partial page availability - Remote count (is it worth adopting for just 1 block?) **Problem**: Unnecessary ownership transfers cause: - Cache line bouncing (page metadata) - Loss of temporal locality - Fragmentation across threads ### 3. No Must-Reuse Gate Current design allows allocating **new pages** even when: - Own pending queue has reusable pages - Own full_pages list has pages with remotes **Problem**: Owner bypasses reuse opportunities, creating new pages unnecessarily. ### 4. Full Pages List Misuse Drained pages return to `full_pages` list, which is: - Not scanned frequently - Requires O(N) scan to find pages with remotes - Effectively a "graveyard" for partially-free pages **Problem**: Pages with free blocks get stuck in full_pages, not reused. --- ## Recommended Solution: Route S (Simple & Stable) ### Design Principles 1. **Owner-Only Drain**: Remove cross-thread adoption entirely 2. **Must-Reuse Gate**: Forbid new page allocation when reusable pages exist 3. **Partial List**: Replace full_pages with partial list for fast reuse 4. **0→1 Edge Detection**: Keep immediate pending queue notification ### Core Changes #### Change 1: Disable CDA ```c // In mf2_alloc_slow(): // ==== DISABLE Consumer-Driven Adoption ==== // if (mf2_try_adopt_pending(tp, class_idx)) { // return mf2_alloc_fast(class_idx, size, site_id); // } ``` #### Change 2: Must-Reuse Gate (Owner Pending Priority) ```c // In mf2_alloc_slow(), BEFORE new page allocation: // CRITICAL: Drain own pending queue FIRST (must-reuse gate) for (int budget = 0; budget < 4; budget++) { MidPage* pending_page = mf2_dequeue_pending(tp, class_idx); if (!pending_page) break; atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release); if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) { atomic_fetch_add(&g_mf2_mustreuse_hit, 1); // Stats return mf2_alloc_fast(class_idx, size, site_id); } } // Check active page for remotes if (page && page->remote_count > 0) { int drained = mf2_drain_remote_frees(page); if (drained > 0 && page->freelist) { atomic_fetch_add(&g_mf2_active_drain_hit, 1); // Stats return mf2_alloc_fast(class_idx, size, site_id); } } // Only NOW allocate new page (after exhausting reuse opportunities) atomic_fetch_add(&g_mf2_new_page_count, 1); page = mf2_alloc_new_page(class_idx); ``` #### Change 3: Keep Threshold=1 (0→1 Edge) Already implemented - no change needed. ### Expected Results | Metric | Before (CDA) | After (Route S) | Improvement | |--------|--------------|-----------------|-------------| | Page reuse rate | 6-40% | 70-80% | 2-13x ✅ | | New pages (10s) | 230K | 50-80K | 3-5x fewer ✅ | | Throughput | 30-58K ops/s | 60-100K ops/s | 2-3x ✅ | | Real time | 16s | ~10s | 1.6x faster ✅ | ### Why This Works 1. **Owner always gets first chance** to reuse pages with remotes 2. **No adoption stealing** → no fragmentation domino 3. **Must-reuse gate** physically prevents new page allocation when reuse is possible 4. **Simpler design** → fewer race conditions, easier to debug --- ## Alternative: Route P (Performance - Keep CDA with Safeguards) If Route S succeeds but we want to re-enable CDA for other workloads: ### Right-of-First-Refusal (RFR) ```c // On 0→1 edge: page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES; // e.g., 200µs // In adoption: if (rdtsc() < page->rfr_deadline) { continue; // Owner priority window - skip adoption } ``` ### Adoption Gating (Multi-Condition AND) Only allow adoption when **ALL** conditions met: 1. Owner inactive: `now - owner->last_alloc_tsc > IDLE_THRESHOLD` (e.g., 150µs) 2. Owner partial empty: `owner->partial_pages[k] == NULL` 3. Global pressure: `adoptable_count[k] > ADOPTION_THRESHOLD` (e.g., 4) 4. Sufficient remotes: `page->remote_count >= MIN_REMOTE_COUNT` (e.g., 8) 5. RFR expired: `now >= page->rfr_deadline` ### O(1) Adoption Scan - Use `ready_mask[k]` bitmap for non-empty pending queues - Budget: maximum 1 page adoption per slow path - No scanning of empty queues --- ## Phase Plan: Route S Implementation ### Phase 1: Minimal Changes (30 minutes) 1. ✅ Disable CDA (comment out `mf2_try_adopt_pending` call) 2. ✅ Verify must-reuse gate order (already mostly correct) 3. ✅ Keep threshold=1 (already set) **Test**: 10s larson benchmark **Success Criterion**: Page reuse > 50%, throughput > 60K ops/s ### Phase 2: Partial List (2-3 hours) 1. Add `partial_pages[POOL_NUM_CLASSES]` to MF2_ThreadPages 2. Modify `mf2_try_drain_and_activate()` to use partial list 3. Pop from partial list before active page check **Test**: Same benchmark **Success Criterion**: Page reuse > 70%, throughput > 80K ops/s ### Phase 3: Event Wakeup (Optional, 4-6 hours) 1. Add futex-based wakeup for 0→1 edge 2. Owner thread sleeps when idle, wakes on remote free 3. Background drain for sleeping owners **Test**: Same benchmark + long idle periods **Success Criterion**: Low CPU usage during idle, fast wakeup on activity --- ## Debug Instrumentation (Lightweight Counters) Add to hakmem_pool.c: ```c // Must-reuse gate effectiveness static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0; // Reused from pending static atomic_uint_fast64_t g_mf2_active_drain_hit = 0; // Drained active page static atomic_uint_fast64_t g_mf2_new_page_forced = 0; // No reuse possible // (Route P only) Adoption gating static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0; static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0; static atomic_uint_fast64_t g_mf2_adoption_allowed = 0; ``` --- ## Lessons Learned ### What Worked - ✅ Pending queue design (lock-free MPSC stack) - ✅ 0→1 edge detection (low overhead) - ✅ Remote free tracking (atomic counters) - ✅ Lock-free drain (exchange entire stack) ### What Failed - ❌ Consumer-Driven Adoption (too aggressive) - ❌ Threshold-based batching (wrong trade-off) - ❌ Full pages list for reuse (too slow) - ❌ No owner priority (adoption stealing) ### Key Insights 1. **"Good idea on paper" ≠ "works in practice"** - CDA sounded great (use pages immediately!) but caused worse fragmentation 2. **Benchmarks reveal real-world patterns** - Larson's producer-consumer pattern exposed CDA's fatal flaw 3. **Simplicity wins** - Owner-only drain is simpler and likely faster than complex adoption logic 4. **Must-reuse is critical** - Without forcing reuse, allocators will always prefer new pages (easier) --- ## References - PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan) - ALIGNMENT_FIX_VERIFICATION.md (alignment debugging) - ChatGPT Pro consultation (2025-10-25) --- ## Next Actions 1. ✅ Document investigation results (this file) 2. ⏳ Implement Route S minimal changes 3. ⏳ Test and measure improvements 4. ⏳ Decide: stop here or add partial list (Phase 2) 5. ⏳ Consider Route P if CDA needed for other workloads **Status**: Ready to implement Route S Phase 1 ✅