Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
Phase 7.2.1: Consumer-Driven Adoption Investigation Results
Date: 2025-10-25 Goal: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2 Status: ❌ CDA Failed - Switching to Route S (Owner-Priority Design)
Executive Summary
Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. Multiple optimization attempts all failed, revealing a fundamental design flaw:
Root Cause: CDA creates a "adoption stealing → fragmentation domino" effect:
- Thread A allocates from Page X
- Thread A passes objects to Thread B for processing
- Thread B frees objects → remote frees accumulate on Page X
- Thread C (needing allocation) adopts Page X from A's pending queue
- Thread A continues allocating → creates NEW pages (its queue is now empty!)
Result: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.
Attempted Fixes & Results
| Fix Attempt | Page Reuse | Throughput | New Pages | Status |
|---|---|---|---|---|
| Original (Phase 7.2) | ~40% | 44-58K ops/s | 227K | ❌ Suspected ping-pong |
| + pending_claim | ~40% | 44-58K ops/s | 227K | ❌ No improvement |
| Threshold=4 (batch) | 0.6% | 29K ops/s | 138K | ❌ Pages never enqueued |
| Threshold=2 | 3.6% | 31K ops/s | 138K | ❌ Still too high |
| Threshold=1 + Re-enqueue | 6.2% | 30K ops/s | 230K | ❌ Worse than original |
| Target (mimalloc) | >90% | 1M+ ops/s | <50K | 🎯 Goal |
Key Statistics (Threshold=1 test, 10s run)
Page reuses: 14,299 (6.2% of total pages)
New pages allocated: 230,267 (94x more than reuses!)
Drain attempts: 14,346
Drain successes: 13,870 (96.7% success rate)
Pending enqueued: 16,587
Pending drained: 14,299
Throughput: 30,377 ops/sec
Real time: 15.989s (should be ~10s)
CPU time: 32,919s (4 threads × ~8.2s each)
Diagnosis: High drain success rate (96.7%) proves draining works correctly. The problem is new page allocation rate (230K pages for 300K operations = 1.3 allocs per page reuse).
Root Cause Analysis (ChatGPT Pro Consultation)
The Adoption Stealing Problem
Larson benchmark pattern:
Phase 1 (Alloc):
Thread A: alloc from Page X → pass to B
Thread B: alloc from Page Y → pass to A
(Cross-thread ownership transfer)
Phase 2 (Free):
Thread A: free B's objects → remote free to Page Y
Thread B: free A's objects → remote free to Page X
(Both pages have remote frees)
Phase 3 (Alloc again):
Thread A: needs alloc
→ checks own pending queue → Page X is there!
→ BUT Thread C already adopted Page X! (queue empty)
→ allocates NEW page
Thread C: was idle, woke up
→ scanned all pending queues
→ adopted Page X from A
→ used the free blocks
→ goes idle again
The vicious cycle:
- Owner (A) has pages with remotes in pending queue
- Adopter (C) scans and steals pages before owner can reuse
- Owner finds empty queue → allocates new page
- New page eventually fills → gets remote frees → stolen again
- Repeat → endless fragmentation
Threshold Approach Failures
Threshold=1 (0→1 edge):
- ✅ Pages enqueued quickly (every remote free)
- ❌ Adopters steal immediately
- ❌ Owner rarely gets to reuse own pages
- Result: 40% reuse (not good enough)
Threshold=2-4 (batching):
- ✅ Reduces pending queue operations
- ❌ Many pages never reach threshold
- ❌ Pages with 1-3 remotes are lost entirely
- Result: 0.6-3.6% reuse (catastrophic)
Threshold + Re-enqueue after drain:
- ✅ Catches new remotes during drain
- ❌ Doesn't fix fundamental adoption stealing
- Result: 6.2% reuse (worse than original)
Architectural Flaws in CDA
1. No Owner Priority (Right-of-First-Refusal)
Current design allows any thread to adopt immediately when a page enters pending queue.
Problem: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.
Missing: Time-based grace period where owner has exclusive access.
2. Adoption Too Aggressive
Adoption triggers on any non-empty pending queue, regardless of:
- Owner activity level (is owner still allocating?)
- Owner's own partial page availability
- Remote count (is it worth adopting for just 1 block?)
Problem: Unnecessary ownership transfers cause:
- Cache line bouncing (page metadata)
- Loss of temporal locality
- Fragmentation across threads
3. No Must-Reuse Gate
Current design allows allocating new pages even when:
- Own pending queue has reusable pages
- Own full_pages list has pages with remotes
Problem: Owner bypasses reuse opportunities, creating new pages unnecessarily.
4. Full Pages List Misuse
Drained pages return to full_pages list, which is:
- Not scanned frequently
- Requires O(N) scan to find pages with remotes
- Effectively a "graveyard" for partially-free pages
Problem: Pages with free blocks get stuck in full_pages, not reused.
Recommended Solution: Route S (Simple & Stable)
Design Principles
- Owner-Only Drain: Remove cross-thread adoption entirely
- Must-Reuse Gate: Forbid new page allocation when reusable pages exist
- Partial List: Replace full_pages with partial list for fast reuse
- 0→1 Edge Detection: Keep immediate pending queue notification
Core Changes
Change 1: Disable CDA
// In mf2_alloc_slow():
// ==== DISABLE Consumer-Driven Adoption ====
// if (mf2_try_adopt_pending(tp, class_idx)) {
// return mf2_alloc_fast(class_idx, size, site_id);
// }
Change 2: Must-Reuse Gate (Owner Pending Priority)
// In mf2_alloc_slow(), BEFORE new page allocation:
// CRITICAL: Drain own pending queue FIRST (must-reuse gate)
for (int budget = 0; budget < 4; budget++) {
MidPage* pending_page = mf2_dequeue_pending(tp, class_idx);
if (!pending_page) break;
atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release);
if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) {
atomic_fetch_add(&g_mf2_mustreuse_hit, 1); // Stats
return mf2_alloc_fast(class_idx, size, site_id);
}
}
// Check active page for remotes
if (page && page->remote_count > 0) {
int drained = mf2_drain_remote_frees(page);
if (drained > 0 && page->freelist) {
atomic_fetch_add(&g_mf2_active_drain_hit, 1); // Stats
return mf2_alloc_fast(class_idx, size, site_id);
}
}
// Only NOW allocate new page (after exhausting reuse opportunities)
atomic_fetch_add(&g_mf2_new_page_count, 1);
page = mf2_alloc_new_page(class_idx);
Change 3: Keep Threshold=1 (0→1 Edge)
Already implemented - no change needed.
Expected Results
| Metric | Before (CDA) | After (Route S) | Improvement |
|---|---|---|---|
| Page reuse rate | 6-40% | 70-80% | 2-13x ✅ |
| New pages (10s) | 230K | 50-80K | 3-5x fewer ✅ |
| Throughput | 30-58K ops/s | 60-100K ops/s | 2-3x ✅ |
| Real time | 16s | ~10s | 1.6x faster ✅ |
Why This Works
- Owner always gets first chance to reuse pages with remotes
- No adoption stealing → no fragmentation domino
- Must-reuse gate physically prevents new page allocation when reuse is possible
- Simpler design → fewer race conditions, easier to debug
Alternative: Route P (Performance - Keep CDA with Safeguards)
If Route S succeeds but we want to re-enable CDA for other workloads:
Right-of-First-Refusal (RFR)
// On 0→1 edge:
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES; // e.g., 200µs
// In adoption:
if (rdtsc() < page->rfr_deadline) {
continue; // Owner priority window - skip adoption
}
Adoption Gating (Multi-Condition AND)
Only allow adoption when ALL conditions met:
- Owner inactive:
now - owner->last_alloc_tsc > IDLE_THRESHOLD(e.g., 150µs) - Owner partial empty:
owner->partial_pages[k] == NULL - Global pressure:
adoptable_count[k] > ADOPTION_THRESHOLD(e.g., 4) - Sufficient remotes:
page->remote_count >= MIN_REMOTE_COUNT(e.g., 8) - RFR expired:
now >= page->rfr_deadline
O(1) Adoption Scan
- Use
ready_mask[k]bitmap for non-empty pending queues - Budget: maximum 1 page adoption per slow path
- No scanning of empty queues
Phase Plan: Route S Implementation
Phase 1: Minimal Changes (30 minutes)
- ✅ Disable CDA (comment out
mf2_try_adopt_pendingcall) - ✅ Verify must-reuse gate order (already mostly correct)
- ✅ Keep threshold=1 (already set)
Test: 10s larson benchmark Success Criterion: Page reuse > 50%, throughput > 60K ops/s
Phase 2: Partial List (2-3 hours)
- Add
partial_pages[POOL_NUM_CLASSES]to MF2_ThreadPages - Modify
mf2_try_drain_and_activate()to use partial list - Pop from partial list before active page check
Test: Same benchmark Success Criterion: Page reuse > 70%, throughput > 80K ops/s
Phase 3: Event Wakeup (Optional, 4-6 hours)
- Add futex-based wakeup for 0→1 edge
- Owner thread sleeps when idle, wakes on remote free
- Background drain for sleeping owners
Test: Same benchmark + long idle periods Success Criterion: Low CPU usage during idle, fast wakeup on activity
Debug Instrumentation (Lightweight Counters)
Add to hakmem_pool.c:
// Must-reuse gate effectiveness
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0; // Reused from pending
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0; // Drained active page
static atomic_uint_fast64_t g_mf2_new_page_forced = 0; // No reuse possible
// (Route P only) Adoption gating
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;
Lessons Learned
What Worked
- ✅ Pending queue design (lock-free MPSC stack)
- ✅ 0→1 edge detection (low overhead)
- ✅ Remote free tracking (atomic counters)
- ✅ Lock-free drain (exchange entire stack)
What Failed
- ❌ Consumer-Driven Adoption (too aggressive)
- ❌ Threshold-based batching (wrong trade-off)
- ❌ Full pages list for reuse (too slow)
- ❌ No owner priority (adoption stealing)
Key Insights
-
"Good idea on paper" ≠ "works in practice"
- CDA sounded great (use pages immediately!) but caused worse fragmentation
-
Benchmarks reveal real-world patterns
- Larson's producer-consumer pattern exposed CDA's fatal flaw
-
Simplicity wins
- Owner-only drain is simpler and likely faster than complex adoption logic
-
Must-reuse is critical
- Without forcing reuse, allocators will always prefer new pages (easier)
References
- PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
- ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
- ChatGPT Pro consultation (2025-10-25)
Next Actions
- ✅ Document investigation results (this file)
- ⏳ Implement Route S minimal changes
- ⏳ Test and measure improvements
- ⏳ Decide: stop here or add partial list (Phase 2)
- ⏳ Consider Route P if CDA needed for other workloads
Status: Ready to implement Route S Phase 1 ✅