Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

11 KiB

Raw Blame History

Phase 7.2.1: Consumer-Driven Adoption Investigation Results

Date: 2025-10-25 Goal: Fix performance issues with Consumer-Driven Adoption (CDA) in MF2 Status: ❌ CDA Failed - Switching to Route S (Owner-Priority Design)

Executive Summary

Consumer-Driven Adoption (CDA) was implemented to handle Larson-style workloads where owner threads stop allocating. Multiple optimization attempts all failed, revealing a fundamental design flaw:

Root Cause: CDA creates a "adoption stealing → fragmentation domino" effect:

Thread A allocates from Page X
Thread A passes objects to Thread B for processing
Thread B frees objects → remote frees accumulate on Page X
Thread C (needing allocation) adopts Page X from A's pending queue
Thread A continues allocating → creates NEW pages (its queue is now empty!)

Result: Pages fragment across threads via aggressive adoption, original owners allocate new pages endlessly.

Attempted Fixes & Results

Fix Attempt	Page Reuse	Throughput	New Pages	Status
Original (Phase 7.2)	~40%	44-58K ops/s	227K	❌ Suspected ping-pong
+ pending_claim	~40%	44-58K ops/s	227K	❌ No improvement
Threshold=4 (batch)	0.6%	29K ops/s	138K	❌ Pages never enqueued
Threshold=2	3.6%	31K ops/s	138K	❌ Still too high
Threshold=1 + Re-enqueue	6.2%	30K ops/s	230K	❌ Worse than original
Target (mimalloc)	>90%	1M+ ops/s	<50K	🎯 Goal

Key Statistics (Threshold=1 test, 10s run)

Page reuses:             14,299  (6.2% of total pages)
New pages allocated:    230,267  (94x more than reuses!)
Drain attempts:          14,346
Drain successes:         13,870  (96.7% success rate)
Pending enqueued:        16,587
Pending drained:         14,299
Throughput:              30,377 ops/sec
Real time:               15.989s  (should be ~10s)
CPU time:                32,919s  (4 threads × ~8.2s each)

Diagnosis: High drain success rate (96.7%) proves draining works correctly. The problem is new page allocation rate (230K pages for 300K operations = 1.3 allocs per page reuse).

Root Cause Analysis (ChatGPT Pro Consultation)

The Adoption Stealing Problem

Larson benchmark pattern:

Phase 1 (Alloc):
  Thread A: alloc from Page X → pass to B
  Thread B: alloc from Page Y → pass to A
  (Cross-thread ownership transfer)

Phase 2 (Free):
  Thread A: free B's objects → remote free to Page Y
  Thread B: free A's objects → remote free to Page X
  (Both pages have remote frees)

Phase 3 (Alloc again):
  Thread A: needs alloc
    → checks own pending queue → Page X is there!
    → BUT Thread C already adopted Page X! (queue empty)
    → allocates NEW page

  Thread C: was idle, woke up
    → scanned all pending queues
    → adopted Page X from A
    → used the free blocks
    → goes idle again

The vicious cycle:

Owner (A) has pages with remotes in pending queue
Adopter (C) scans and steals pages before owner can reuse
Owner finds empty queue → allocates new page
New page eventually fills → gets remote frees → stolen again
Repeat → endless fragmentation

Threshold Approach Failures

Threshold=1 (0→1 edge):

✅ Pages enqueued quickly (every remote free)
❌ Adopters steal immediately
❌ Owner rarely gets to reuse own pages
Result: 40% reuse (not good enough)

Threshold=2-4 (batching):

✅ Reduces pending queue operations
❌ Many pages never reach threshold
❌ Pages with 1-3 remotes are lost entirely
Result: 0.6-3.6% reuse (catastrophic)

Threshold + Re-enqueue after drain:

✅ Catches new remotes during drain
❌ Doesn't fix fundamental adoption stealing
Result: 6.2% reuse (worse than original)

Architectural Flaws in CDA

1. No Owner Priority (Right-of-First-Refusal)

Current design allows any thread to adopt immediately when a page enters pending queue.

Problem: Owner thread may still be actively allocating and would naturally reuse the page, but adopter gets there first.

Missing: Time-based grace period where owner has exclusive access.

2. Adoption Too Aggressive

Adoption triggers on any non-empty pending queue, regardless of:

Owner activity level (is owner still allocating?)
Owner's own partial page availability
Remote count (is it worth adopting for just 1 block?)

Problem: Unnecessary ownership transfers cause:

Cache line bouncing (page metadata)
Loss of temporal locality
Fragmentation across threads

3. No Must-Reuse Gate

Current design allows allocating new pages even when:

Own pending queue has reusable pages
Own full_pages list has pages with remotes

Problem: Owner bypasses reuse opportunities, creating new pages unnecessarily.

4. Full Pages List Misuse

Drained pages return to full_pages list, which is:

Not scanned frequently
Requires O(N) scan to find pages with remotes
Effectively a "graveyard" for partially-free pages

Problem: Pages with free blocks get stuck in full_pages, not reused.

Metric	Before (CDA)	After (Route S)	Improvement
Page reuse rate	6-40%	70-80%	2-13x ✅
New pages (10s)	230K	50-80K	3-5x fewer ✅
Throughput	30-58K ops/s	60-100K ops/s	2-3x ✅
Real time	16s	~10s	1.6x faster ✅

Alternative: Route P (Performance - Keep CDA with Safeguards)

If Route S succeeds but we want to re-enable CDA for other workloads:

Right-of-First-Refusal (RFR)

// On 0→1 edge:
page->rfr_deadline = rdtsc() + RFR_WINDOW_CYCLES;  // e.g., 200µs

// In adoption:
if (rdtsc() < page->rfr_deadline) {
    continue;  // Owner priority window - skip adoption
}

Adoption Gating (Multi-Condition AND)

Only allow adoption when ALL conditions met:

Owner inactive: now - owner->last_alloc_tsc > IDLE_THRESHOLD (e.g., 150µs)
Owner partial empty: owner->partial_pages[k] == NULL
Global pressure: adoptable_count[k] > ADOPTION_THRESHOLD (e.g., 4)
Sufficient remotes: page->remote_count >= MIN_REMOTE_COUNT (e.g., 8)
RFR expired: now >= page->rfr_deadline

O(1) Adoption Scan

Use ready_mask[k] bitmap for non-empty pending queues
Budget: maximum 1 page adoption per slow path
No scanning of empty queues

Phase Plan: Route S Implementation

Phase 1: Minimal Changes (30 minutes)

✅ Disable CDA (comment out mf2_try_adopt_pending call)
✅ Verify must-reuse gate order (already mostly correct)
✅ Keep threshold=1 (already set)

Test: 10s larson benchmark Success Criterion: Page reuse > 50%, throughput > 60K ops/s

Phase 2: Partial List (2-3 hours)

Add partial_pages[POOL_NUM_CLASSES] to MF2_ThreadPages
Modify mf2_try_drain_and_activate() to use partial list
Pop from partial list before active page check

Test: Same benchmark Success Criterion: Page reuse > 70%, throughput > 80K ops/s

Phase 3: Event Wakeup (Optional, 4-6 hours)

Add futex-based wakeup for 0→1 edge
Owner thread sleeps when idle, wakes on remote free
Background drain for sleeping owners

Test: Same benchmark + long idle periods Success Criterion: Low CPU usage during idle, fast wakeup on activity

Debug Instrumentation (Lightweight Counters)

Add to hakmem_pool.c:

// Must-reuse gate effectiveness
static atomic_uint_fast64_t g_mf2_mustreuse_hit = 0;      // Reused from pending
static atomic_uint_fast64_t g_mf2_active_drain_hit = 0;  // Drained active page
static atomic_uint_fast64_t g_mf2_new_page_forced = 0;   // No reuse possible

// (Route P only) Adoption gating
static atomic_uint_fast64_t g_mf2_adoption_blocked_rfr = 0;
static atomic_uint_fast64_t g_mf2_adoption_blocked_idle = 0;
static atomic_uint_fast64_t g_mf2_adoption_allowed = 0;

Lessons Learned

What Worked

✅ Pending queue design (lock-free MPSC stack)
✅ 0→1 edge detection (low overhead)
✅ Remote free tracking (atomic counters)
✅ Lock-free drain (exchange entire stack)

What Failed

❌ Consumer-Driven Adoption (too aggressive)
❌ Threshold-based batching (wrong trade-off)
❌ Full pages list for reuse (too slow)
❌ No owner priority (adoption stealing)

Key Insights

"Good idea on paper" ≠ "works in practice"
- CDA sounded great (use pages immediately!) but caused worse fragmentation
Benchmarks reveal real-world patterns
- Larson's producer-consumer pattern exposed CDA's fatal flaw
Simplicity wins
- Owner-only drain is simpler and likely faster than complex adoption logic
Must-reuse is critical
- Without forcing reuse, allocators will always prefer new pages (easier)

References

PHASE_7.2_MF2_PLAN_2025_10_24.md (original plan)
ALIGNMENT_FIX_VERIFICATION.md (alignment debugging)
ChatGPT Pro consultation (2025-10-25)

Next Actions

✅ Document investigation results (this file)
⏳ Implement Route S minimal changes
⏳ Test and measure improvements
⏳ Decide: stop here or add partial list (Phase 2)
⏳ Consider Route P if CDA needed for other workloads

Status: Ready to implement Route S Phase 1 ✅

11 KiB

Raw Blame History

Phase 7.2.1: Consumer-Driven Adoption Investigation Results

Executive Summary

Attempted Fixes & Results

Key Statistics (Threshold=1 test, 10s run)

Root Cause Analysis (ChatGPT Pro Consultation)

The Adoption Stealing Problem

Threshold Approach Failures

Architectural Flaws in CDA

1. No Owner Priority (Right-of-First-Refusal)

2. Adoption Too Aggressive

3. No Must-Reuse Gate

4. Full Pages List Misuse

Recommended Solution: Route S (Simple & Stable)

Design Principles

Core Changes

Change 1: Disable CDA

Change 2: Must-Reuse Gate (Owner Pending Priority)

Change 3: Keep Threshold=1 (0→1 Edge)

Expected Results

Why This Works

Alternative: Route P (Performance - Keep CDA with Safeguards)

Right-of-First-Refusal (RFR)

Adoption Gating (Multi-Condition AND)

O(1) Adoption Scan

Phase Plan: Route S Implementation

Phase 1: Minimal Changes (30 minutes)

Phase 2: Partial List (2-3 hours)

Phase 3: Event Wakeup (Optional, 4-6 hours)

Debug Instrumentation (Lightweight Counters)

Lessons Learned

What Worked

What Failed

Key Insights

References

Next Actions

11 KiB Raw Blame History Unescape Escape

Phase 7.2.1: Consumer-Driven Adoption Investigation Results

Executive Summary

Attempted Fixes & Results

Key Statistics (Threshold=1 test, 10s run)

Root Cause Analysis (ChatGPT Pro Consultation)

The Adoption Stealing Problem

Threshold Approach Failures

Architectural Flaws in CDA

1. No Owner Priority (Right-of-First-Refusal)

2. Adoption Too Aggressive

3. No Must-Reuse Gate

4. Full Pages List Misuse

Recommended Solution: Route S (Simple & Stable)

Design Principles

Core Changes

Change 1: Disable CDA

Change 2: Must-Reuse Gate (Owner Pending Priority)

Change 3: Keep Threshold=1 (0→1 Edge)

Expected Results

Why This Works

Alternative: Route P (Performance - Keep CDA with Safeguards)

Right-of-First-Refusal (RFR)

Adoption Gating (Multi-Condition AND)

O(1) Adoption Scan

Phase Plan: Route S Implementation

Phase 1: Minimal Changes (30 minutes)

Phase 2: Partial List (2-3 hours)

Phase 3: Event Wakeup (Optional, 4-6 hours)

Debug Instrumentation (Lightweight Counters)

Lessons Learned

What Worked

What Failed

Key Insights

References

Next Actions

11 KiB

Raw Blame History