Files
hakmem/docs/status/PHASE_7.2.1_CDA_INVESTIGATION_2025_10_25.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

11 KiB
Raw Blame History

Phase 7.2.1: Consumer-Driven Adoption Investigation Results

Date: 2025-10-25 Goal: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2 Status: CDA Failed - Switching to Route S (Owner-Priority Design)


Executive Summary

Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. Multiple optimization attempts all failed, revealing a fundamental design flaw:

Root Cause: CDA creates a "adoption stealing → fragmentation domino" effect:

  1. Thread A allocates from Page X
  2. Thread A passes objects to Thread B for processing
  3. Thread B frees objects → remote frees accumulate on Page X
  4. Thread C (needing allocation) adopts Page X from A's pending queue
  5. Thread A continues allocating → creates NEW pages (its queue is now empty!)

Result: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.


Attempted Fixes & Results

Fix Attempt Page Reuse Throughput New Pages Status
Original (Phase 7.2) ~40% 44-58K ops/s 227K Suspected ping-pong
+ pending_claim ~40% 44-58K ops/s 227K No improvement
Threshold=4 (batch) 0.6% 29K ops/s 138K Pages never enqueued
Threshold=2 3.6% 31K ops/s 138K Still too high
Threshold=1 + Re-enqueue 6.2% 30K ops/s 230K Worse than original
Target (mimalloc) >90% 1M+ ops/s <50K 🎯 Goal

Key Statistics (Threshold=1 test, 10s run)

Page reuses:             14,299  (6.2% of total pages)
New pages allocated:    230,267  (94x more than reuses!)
Drain attempts:          14,346
Drain successes:         13,870  (96.7% success rate)
Pending enqueued:        16,587
Pending drained:         14,299
Throughput:              30,377 ops/sec
Real time:               15.989s  (should be ~10s)
CPU time:                32,919s  (4 threads × ~8.2s each)

Diagnosis: High drain success rate (96.7%) proves draining works correctly. The problem is new page allocation rate (230K pages for 300K operations = 1.3 allocs per page reuse).


Root Cause Analysis (ChatGPT Pro Consultation)

The Adoption Stealing Problem

Larson benchmark pattern:

Phase 1 (Alloc):
  Thread A: alloc from Page X → pass to B
  Thread B: alloc from Page Y → pass to A
  (Cross-thread ownership transfer)

Phase 2 (Free):
  Thread A: free B's objects → remote free to Page Y
  Thread B: free A's objects → remote free to Page X
  (Both pages have remote frees)

Phase 3 (Alloc again):
  Thread A: needs alloc
    → checks own pending queue → Page X is there!
    → BUT Thread C already adopted Page X! (queue empty)
    → allocates NEW page

  Thread C: was idle, woke up
    → scanned all pending queues
    → adopted Page X from A
    → used the free blocks
    → goes idle again

The vicious cycle:

  1. Owner (A) has pages with remotes in pending queue
  2. Adopter (C) scans and steals pages before owner can reuse
  3. Owner finds empty queue → allocates new page
  4. New page eventually fills → gets remote frees → stolen again
  5. Repeat → endless fragmentation

Threshold Approach Failures

Threshold=1 (0→1 edge):

  • Pages enqueued quickly (every remote free)
  • Adopters steal immediately
  • Owner rarely gets to reuse own pages
  • Result: 40% reuse (not good enough)

Threshold=2-4 (batching):

  • Reduces pending queue operations
  • Many pages never reach threshold
  • Pages with 1-3 remotes are lost entirely
  • Result: 0.6-3.6% reuse (catastrophic)

Threshold + Re-enqueue after drain:

  • Catches new remotes during drain
  • Doesn't fix fundamental adoption stealing
  • Result: 6.2% reuse (worse than original)

Architectural Flaws in CDA

1. No Owner Priority (Right-of-First-Refusal)

Current design allows any thread to adopt immediately when a page enters pending queue.

Problem: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.

Missing: Time-based grace period where owner has exclusive access.

2. Adoption Too Aggressive

Adoption triggers on any non-empty pending queue, regardless of:

  • Owner activity level (is owner still allocating?)
  • Owner's own partial page availability
  • Remote count (is it worth adopting for just 1 block?)

Problem: Unnecessary ownership transfers cause:

  • Cache line bouncing (page metadata)
  • Loss of temporal locality
  • Fragmentation across threads

3. No Must-Reuse Gate

Current design allows allocating new pages even when:

  • Own pending queue has reusable pages
  • Own full_pages list has pages with remotes

Problem: Owner bypasses reuse opportunities, creating new pages unnecessarily.

4. Full Pages List Misuse

Drained pages return to full_pages list, which is:

  • Not scanned frequently
  • Requires O(N) scan to find pages with remotes
  • Effectively a "graveyard" for partially-free pages

Problem: Pages with free blocks get stuck in full_pages, not reused.


Design Principles

  1. Owner-Only Drain: Remove cross-thread adoption entirely
  2. Must-Reuse Gate: Forbid new page allocation when reusable pages exist
  3. Partial List: Replace full_pages with partial list for fast reuse
  4. 0→1 Edge Detection: Keep immediate pending queue notification

Core Changes

Change 1: Disable CDA

// In mf2_alloc_slow():
// ==== DISABLE Consumer-Driven Adoption ====
// if (mf2_try_adopt_pending(tp, class_idx)) {
//     return mf2_alloc_fast(class_idx, size, site_id);
// }

Change 2: Must-Reuse Gate (Owner Pending Priority)

// In mf2_alloc_slow(), BEFORE new page allocation:

// CRITICAL: Drain own pending queue FIRST (must-reuse gate)
for (int budget = 0; budget < 4; budget++) {
    MidPage* pending_page = mf2_dequeue_pending(tp, class_idx);
    if (!pending_page) break;

    atomic_store_explicit(&pending_page->in_remote_pending, false, memory_order_release);

    if (mf2_try_drain_and_activate(tp, class_idx, pending_page)) {
        atomic_fetch_add(&g_mf2_mustreuse_hit, 1);  // Stats
        return mf2_alloc_fast(class_idx, size, site_id);
    }
}

// Check active page for remotes
if (page && page->remote_count > 0) {
    int drained = mf2_drain_remote_frees(page);
    if (drained > 0 && page->freelist) {
        atomic_fetch_add(&g_mf2_active_drain_hit, 1);  // Stats
        return mf2_alloc_fast(class_idx, size, site_id);
    }
}

// Only NOW allocate new page (after exhausting reuse opportunities)
atomic_fetch_add(&g_mf2_new_page_count, 1);
page = mf2_alloc_new_page(class_idx);

Change 3: Keep Threshold=1 (0→1 Edge)

Already implemented - no change needed.

Expected Results

Metric Before (CDA) After (Route S) Improvement
Page reuse rate 6-40% 70-80% 2-13x
New pages (10s) 230K 50-80K 3-5x fewer
Throughput 30-58K ops/s 60-100K ops/s 2-3x
Real time 16s ~10s 1.6x faster

Why This Works

  1. Owner always gets first chance to reuse pages with remotes
  2. No adoption stealing → no fragmentation domino
  3. Must-reuse gate physically prevents new page allocation when reuse is possible
  4. Simpler design → fewer race conditions, easier to debug

Alternative: Route P (Performance - Keep CDA with Safeguards)

If Route S succeeds but we want to re-enable CDA for other workloads:

Right-of-First-Refusal (RFR)

// On 0→1 edge:
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES;  // e.g., 200µs

// In adoption:
if (rdtsc() < page->rfr_deadline) {
    continue;  // Owner priority window - skip adoption
}

Adoption Gating (Multi-Condition AND)

Only allow adoption when ALL conditions met:

  1. Owner inactive: now - owner->last_alloc_tsc > IDLE_THRESHOLD (e.g., 150µs)
  2. Owner partial empty: owner->partial_pages[k] == NULL
  3. Global pressure: adoptable_count[k] > ADOPTION_THRESHOLD (e.g., 4)
  4. Sufficient remotes: page->remote_count >= MIN_REMOTE_COUNT (e.g., 8)
  5. RFR expired: now >= page->rfr_deadline

O(1) Adoption Scan

  • Use ready_mask[k] bitmap for non-empty pending queues
  • Budget: maximum 1 page adoption per slow path
  • No scanning of empty queues

Phase Plan: Route S Implementation

Phase 1: Minimal Changes (30 minutes)

  1. Disable CDA (comment out mf2_try_adopt_pending call)
  2. Verify must-reuse gate order (already mostly correct)
  3. Keep threshold=1 (already set)

Test: 10s larson benchmark Success Criterion: Page reuse > 50%, throughput > 60K ops/s

Phase 2: Partial List (2-3 hours)

  1. Add partial_pages[POOL_NUM_CLASSES] to MF2_ThreadPages
  2. Modify mf2_try_drain_and_activate() to use partial list
  3. Pop from partial list before active page check

Test: Same benchmark Success Criterion: Page reuse > 70%, throughput > 80K ops/s

Phase 3: Event Wakeup (Optional, 4-6 hours)

  1. Add futex-based wakeup for 0→1 edge
  2. Owner thread sleeps when idle, wakes on remote free
  3. Background drain for sleeping owners

Test: Same benchmark + long idle periods Success Criterion: Low CPU usage during idle, fast wakeup on activity


Debug Instrumentation (Lightweight Counters)

Add to hakmem_pool.c:

// Must-reuse gate effectiveness
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0;      // Reused from pending
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0;  // Drained active page
static atomic_uint_fast64_t g_mf2_new_page_forced = 0;   // No reuse possible

// (Route P only) Adoption gating
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;

Lessons Learned

What Worked

  • Pending queue design (lock-free MPSC stack)
  • 0→1 edge detection (low overhead)
  • Remote free tracking (atomic counters)
  • Lock-free drain (exchange entire stack)

What Failed

  • Consumer-Driven Adoption (too aggressive)
  • Threshold-based batching (wrong trade-off)
  • Full pages list for reuse (too slow)
  • No owner priority (adoption stealing)

Key Insights

  1. "Good idea on paper" ≠ "works in practice"

    • CDA sounded great (use pages immediately!) but caused worse fragmentation
  2. Benchmarks reveal real-world patterns

    • Larson's producer-consumer pattern exposed CDA's fatal flaw
  3. Simplicity wins

    • Owner-only drain is simpler and likely faster than complex adoption logic
  4. Must-reuse is critical

    • Without forcing reuse, allocators will always prefer new pages (easier)

References

  • PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
  • ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
  • ChatGPT Pro consultation (2025-10-25)

Next Actions

  1. Document investigation results (this file)
  2. Implement Route S minimal changes
  3. Test and measure improvements
  4. Decide: stop here or add partial list (Phase 2)
  5. Consider Route P if CDA needed for other workloads

Status: Ready to implement Route S Phase 1