hakmem/docs/status/PHASE_7.2.1_CDA_INVESTIGATION_2025_10_25.md

# Phase 7.2.1: Consumer-Driven Adoption Investigation Results

**Date**: 2025-10-25
**Goal**: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2
**Status**: ❌ **CDA Failed** - Switching to Route S (Owner-Priority Design)

---

## Executive Summary

Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. **Multiple optimization attempts all failed**, revealing a fundamental design flaw:

**Root Cause**: CDA creates a **"adoption stealing → fragmentation domino"** effect:
1. Thread A allocates from Page X
2. Thread A passes objects to Thread B for processing
3. Thread B frees objects → remote frees accumulate on Page X
4. **Thread C (needing allocation) adopts Page X** from A's pending queue
5. Thread A continues allocating → **creates NEW pages** (its queue is now empty!)

**Result**: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.

---

## Attempted Fixes & Results

| Fix Attempt | Page Reuse | Throughput | New Pages | Status |
|-------------|------------|------------|-----------|--------|
| Original (Phase 7.2) | ~40% | 44-58K ops/s | 227K | ❌ Suspected ping-pong |
| + pending_claim | ~40% | 44-58K ops/s | 227K | ❌ No improvement |
| Threshold=4 (batch) | 0.6% | 29K ops/s | 138K | ❌ Pages never enqueued |
| Threshold=2 | 3.6% | 31K ops/s | 138K | ❌ Still too high |
| Threshold=1 + Re-enqueue | 6.2% | 30K ops/s | 230K | ❌ Worse than original |
| **Target (mimalloc)** | >90% | 1M+ ops/s | <50K | 🎯 Goal |

### Key Statistics (Threshold=1 test, 10s run)

```
Page reuses:             14,299  (6.2% of total pages)
New pages allocated:    230,267  (94x more than reuses!)
Drain attempts:          14,346
Drain successes:         13,870  (96.7% success rate)
Pending enqueued:        16,587
Pending drained:         14,299
Throughput:              30,377 ops/sec
Real time:               15.989s  (should be ~10s)
CPU time:                32,919s  (4 threads × ~8.2s each)
```

**Diagnosis**: High drain success rate (96.7%) proves draining works correctly. The problem is **new page allocation rate** (230K pages for 300K operations = 1.3 allocs per page reuse).

---

## Root Cause Analysis (ChatGPT Pro Consultation)

### The Adoption Stealing Problem

**Larson benchmark pattern**:
```
Phase 1 (Alloc):
  Thread A: alloc from Page X → pass to B
  Thread B: alloc from Page Y → pass to A
  (Cross-thread ownership transfer)

Phase 2 (Free):
  Thread A: free B's objects → remote free to Page Y
  Thread B: free A's objects → remote free to Page X
  (Both pages have remote frees)

Phase 3 (Alloc again):
  Thread A: needs alloc
    → checks own pending queue → Page X is there!
    → BUT Thread C already adopted Page X! (queue empty)
    → allocates NEW page

  Thread C: was idle, woke up
    → scanned all pending queues
    → adopted Page X from A
    → used the free blocks
    → goes idle again
```

**The vicious cycle**:
1. Owner (A) has pages with remotes in pending queue
2. Adopter (C) scans and steals pages before owner can reuse
3. Owner finds empty queue → allocates new page
4. New page eventually fills → gets remote frees → stolen again
5. Repeat → **endless fragmentation**

### Threshold Approach Failures

**Threshold=1 (0→1 edge)**:
- ✅ Pages enqueued quickly (every remote free)
- ❌ Adopters steal immediately
- ❌ Owner rarely gets to reuse own pages
- Result: 40% reuse (not good enough)

**Threshold=2-4 (batching)**:
- ✅ Reduces pending queue operations
- ❌ Many pages never reach threshold
- ❌ Pages with 1-3 remotes are **lost entirely**
- Result: 0.6-3.6% reuse (catastrophic)

**Threshold + Re-enqueue after drain**:
- ✅ Catches new remotes during drain
- ❌ Doesn't fix fundamental adoption stealing
- Result: 6.2% reuse (worse than original)

---

## Architectural Flaws in CDA

### 1. No Owner Priority (Right-of-First-Refusal)

Current design allows **any thread** to adopt **immediately** when a page enters pending queue.

**Problem**: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.

**Missing**: Time-based grace period where owner has exclusive access.

### 2. Adoption Too Aggressive

Adoption triggers on **any non-empty pending queue**, regardless of:
- Owner activity level (is owner still allocating?)
- Owner's own partial page availability
- Remote count (is it worth adopting for just 1 block?)

**Problem**: Unnecessary ownership transfers cause:
- Cache line bouncing (page metadata)
- Loss of temporal locality
- Fragmentation across threads

### 3. No Must-Reuse Gate

Current design allows allocating **new pages** even when:
- Own pending queue has reusable pages
- Own full_pages list has pages with remotes

**Problem**: Owner bypasses reuse opportunities, creating new pages unnecessarily.

### 4. Full Pages List Misuse

Drained pages return to `full_pages` list, which is:
- Not scanned frequently
- Requires O(N) scan to find pages with remotes
- Effectively a "graveyard" for partially-free pages

**Problem**: Pages with free blocks get stuck in full_pages, not reused.

---

## Recommended Solution: Route S (Simple & Stable)

### Design Principles

1. **Owner-Only Drain**: Remove cross-thread adoption entirely
2. **Must-Reuse Gate**: Forbid new page allocation when reusable pages exist
3. **Partial List**: Replace full_pages with partial list for fast reuse
4. **0→1 Edge Detection**: Keep immediate pending queue notification

### Core Changes

#### Change 1: Disable CDA

```c
// In mf2_alloc_slow():
// ==== DISABLE Consumer-Driven Adoption ====
// if (mf2_try_adopt_pending(tp, class_idx)) {
//     return mf2_alloc_fast(class_idx, size, site_id);
// }
```

#### Change 2: Must-Reuse Gate (Owner Pending Priority)

```c
// In mf2_alloc_slow(), BEFORE new page allocation:

// CRITICAL: Drain own pending queue FIRST (must-reuse gate)
for (int budget = 0; budget < 4; budget++) {
    MidPage* pending_page = mf2_dequeue_pending(tp, class_idx);
    if (!pending_page) break;

    atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release);

    if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) {
        atomic_fetch_add(&g_mf2_mustreuse_hit, 1);  // Stats
        return mf2_alloc_fast(class_idx, size, site_id);
    }
}

// Check active page for remotes
if (page && page->remote_count > 0) {
    int drained = mf2_drain_remote_frees(page);
    if (drained > 0 && page->freelist) {
        atomic_fetch_add(&g_mf2_active_drain_hit, 1);  // Stats
        return mf2_alloc_fast(class_idx, size, site_id);
    }
}

// Only NOW allocate new page (after exhausting reuse opportunities)
atomic_fetch_add(&g_mf2_new_page_count, 1);
page = mf2_alloc_new_page(class_idx);
```

#### Change 3: Keep Threshold=1 (0→1 Edge)

Already implemented - no change needed.

### Expected Results

| Metric | Before (CDA) | After (Route S) | Improvement |
|--------|--------------|-----------------|-------------|
| Page reuse rate | 6-40% | 70-80% | 2-13x ✅ |
| New pages (10s) | 230K | 50-80K | 3-5x fewer ✅ |
| Throughput | 30-58K ops/s | 60-100K ops/s | 2-3x ✅ |
| Real time | 16s | ~10s | 1.6x faster ✅ |

### Why This Works

1. **Owner always gets first chance** to reuse pages with remotes
2. **No adoption stealing** → no fragmentation domino
3. **Must-reuse gate** physically prevents new page allocation when reuse is possible
4. **Simpler design** → fewer race conditions, easier to debug

---

## Alternative: Route P (Performance - Keep CDA with Safeguards)

If Route S succeeds but we want to re-enable CDA for other workloads:

### Right-of-First-Refusal (RFR)

```c
// On 0→1 edge:
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES;  // e.g., 200µs

// In adoption:
if (rdtsc() < page->rfr_deadline) {
    continue;  // Owner priority window - skip adoption
}
```

### Adoption Gating (Multi-Condition AND)

Only allow adoption when **ALL** conditions met:
1. Owner inactive: `now - owner->last_alloc_tsc > IDLE_THRESHOLD` (e.g., 150µs)
2. Owner partial empty: `owner->partial_pages[k] == NULL`
3. Global pressure: `adoptable_count[k] > ADOPTION_THRESHOLD` (e.g., 4)
4. Sufficient remotes: `page->remote_count >= MIN_REMOTE_COUNT` (e.g., 8)
5. RFR expired: `now >= page->rfr_deadline`

### O(1) Adoption Scan

- Use `ready_mask[k]` bitmap for non-empty pending queues
- Budget: maximum 1 page adoption per slow path
- No scanning of empty queues

---

## Phase Plan: Route S Implementation

### Phase 1: Minimal Changes (30 minutes)

1. ✅ Disable CDA (comment out `mf2_try_adopt_pending` call)
2. ✅ Verify must-reuse gate order (already mostly correct)
3. ✅ Keep threshold=1 (already set)

**Test**: 10s larson benchmark
**Success Criterion**: Page reuse > 50%, throughput > 60K ops/s

### Phase 2: Partial List (2-3 hours)

1. Add `partial_pages[POOL_NUM_CLASSES]` to MF2_ThreadPages
2. Modify `mf2_try_drain_and_activate()` to use partial list
3. Pop from partial list before active page check

**Test**: Same benchmark
**Success Criterion**: Page reuse > 70%, throughput > 80K ops/s

### Phase 3: Event Wakeup (Optional, 4-6 hours)

1. Add futex-based wakeup for 0→1 edge
2. Owner thread sleeps when idle, wakes on remote free
3. Background drain for sleeping owners

**Test**: Same benchmark + long idle periods
**Success Criterion**: Low CPU usage during idle, fast wakeup on activity

---

## Debug Instrumentation (Lightweight Counters)

Add to hakmem_pool.c:
```c
// Must-reuse gate effectiveness
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0;      // Reused from pending
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0;  // Drained active page
static atomic_uint_fast64_t g_mf2_new_page_forced = 0;   // No reuse possible

// (Route P only) Adoption gating
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;
```

---

## Lessons Learned

### What Worked
- ✅ Pending queue design (lock-free MPSC stack)
- ✅ 0→1 edge detection (low overhead)
- ✅ Remote free tracking (atomic counters)
- ✅ Lock-free drain (exchange entire stack)

### What Failed
- ❌ Consumer-Driven Adoption (too aggressive)
- ❌ Threshold-based batching (wrong trade-off)
- ❌ Full pages list for reuse (too slow)
- ❌ No owner priority (adoption stealing)

### Key Insights
1. **"Good idea on paper" ≠ "works in practice"**
   - CDA sounded great (use pages immediately!) but caused worse fragmentation

2. **Benchmarks reveal real-world patterns**
   - Larson's producer-consumer pattern exposed CDA's fatal flaw

3. **Simplicity wins**
   - Owner-only drain is simpler and likely faster than complex adoption logic

4. **Must-reuse is critical**
   - Without forcing reuse, allocators will always prefer new pages (easier)

---

## References

- PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
- ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
- ChatGPT Pro consultation (2025-10-25)

---

## Next Actions

1. ✅ Document investigation results (this file)
2. ⏳ Implement Route S minimal changes
3. ⏳ Test and measure improvements
4. ⏳ Decide: stop here or add partial list (Phase 2)
5. ⏳ Consider Route P if CDA needed for other workloads

**Status**: Ready to implement Route S Phase 1 ✅