Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

22 KiB

Raw Blame History

Phase 7.2 MF2: Implementation Progress

Date: 2025-10-24 Status: In Progress - Fixing Pending Queue Drain Issue Current: Implementing Global Round-Robin Strategy

Summary

MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。 Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。結果として、pending queueへのenqueueは成功（69K pages）するが、drainが0回という状況。

Task先生の詳細分析により根本原因を特定：

各スレッドは自分のTLSのpending queueしか見ない
Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空
他スレッドのpending queueに溜まったページは永遠に処理されない

Implementation Timeline

Phase 1-4: Core Implementation ✅

Commits:

0855b37 - Phase 1: Data structures
5c4b780 - Phase 2: Page allocation
b12f58c - Phase 3: Allocation path
7e756c6 - Phase 4: Free path

Status: Complete

Phase 5: Bug Fixes (Fix #1-6) ✅

Fix #1: Block Spacing Bug (`54609c1`)

Problem: Infinite loop on first test Root Cause:

size_t block_size = g_class_sizes[class_idx];  // Missing HEADER_SIZE

Fix: block_size = HEADER_SIZE + user_size; Result: Test completes instead of hanging

Fix #2-3: Performance Optimizations (`aa869b9`)

Changes:

Removed 64KB memset (switched from posix_memalign to mmap)
Removed O(N) eager drain scan
Reduced scan limit from 256 to 8

Result: 27.5K → 110K ops/s (4x improvement on 4T)

Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL

Problem: 97% of frees silently dropped! Root Cause:

mmap() only guarantees 4KB alignment
addr_to_page() assumes 64KB alignment
Lookup fails: (ptr & ~0xFFFF) rounds to wrong page base

Fix: Changed to posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)

Verification (by Task agent):

Pages allocated:     101,093
Alignment bugs:      0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98%

Side Effect: Performance degraded (466K → 54K) due to memset overhead returning

Fix #5: Active Page Drain Attempt (`9e64f7e`)

Change: Added check for remote frees in active_page before allocating new Result: No improvement (remote drains still 0)

Fix #6: Memory Ordering (`b0768b3`)

Problem: All remote_count operations used memory_order_relaxed Fix: Changed 7 locations to seq_cst/acquire/release Result: Memory ordering now perfect, but performance still no improvement

Root Cause Discovery (by Task agent):

Debug instrumentation revealed: drain checks and remote frees target DIFFERENT page objects
Thread A's pages in Thread A's tp->active_page/full_pages
Thread B frees to Thread A's pages → remote_count++
Thread B's slow path checks Thread B's pages only
Result: Thread A's pages (with remote_count > 0) never checked by anyone!

Phase 2: Pending Queue Implementation (`89541fc`) ✅

Implementation (by Task agent):

Box 1: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage
Box 2: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending)
Box 3: 0→1 edge detection in mf2_free_slow()
Box 4: Allocation slow path drain (up to 4 pages per allocation)
Box 5: Opportunistic drain (every 16th owner free)
Box 6: Comprehensive debug logging and statistics

Test Results:

Pending enqueued: 43,138 ✅
Pending drained: 0 ❌

Analysis (by Task agent):

Implementation is correct
Problem: Larson benchmark allocates all pages early, frees later
By the time remote frees arrive, owner threads don't allocate anymore
Slow path never called → pending queue never processed
This is a workload mismatch, not an implementation bug

Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅

Change: Increased from every 16th to every 4th free (4x more aggressive)

Test Results (larson 10 2-32K 10s 4T):

Pending enqueued: 52,912 ✅
Pending drained:  0 ❌
Throughput:       53K ops/s

Conclusion: Frequency tuning didn't help - workload pattern issue persists

Option 1: free_slow Drain Addition ❌

Concept: Add opportunistic drain to both free_fast() and free_slow()

Implementation:

Created mf2_maybe_drain_pending() helper
Called from both free_fast() (Line 1115) and free_slow() (Line 1167)

Test Results:

Pending enqueued: 76,733 ✅
Pending drained:  0 ❌
OPP_DRAIN_TRY:    10 attempts (all from tp=0x55828805f7a0)
Throughput:       27,890 ops/s

Problem: All drain attempts from same thread - other 3 threads not visible

Option C: alloc_slow Drain Addition ❌

Concept: Add drain before new page allocation (owner thread allocating continuously)

Implementation: Added mf2_maybe_drain_pending() at Line 1021 (before mf2_alloc_new_page())

Test Results:

Pending enqueued: 69,702 ✅
Pending drained:  0 ❌
OPP_DRAIN_TRY:    10 attempts (all from tp=0x559146bb17a0)
Throughput:       27,965 ops/s

Conclusion: Still 0 drains - same thread issue persists

Root Cause Analysis (by Task Agent)

Larson Benchmark Characteristics

// larson.cpp: exercise_heap()
for (cblks=0; cblks<pdea->NumBlocks; cblks++) {
    victim = lran2(&pdea->rgen) % pdea->asize;  // Own array range
    CUSTOM_FREE(pdea->array[victim]);            // Free own allocation
    pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size);  // Same slot
}

// Array partitioning (Line 481):
de_area[i].array = &blkp[i*nperthread];  // Each thread owns separate range

Key Finding: Each thread allocates/frees from its own array range

Thread 0: array[0..999]
Thread 1: array[1000..1999]
Thread 2: array[2000..2999]
Thread 3: array[3000..3999]

Result: Cross-thread frees are almost ZERO

MF2 Design vs Larson Mismatch

MF2 Assumption:

4 threads freeing → all threads call mf2_free() → all threads drain pending

Larson Reality:

1 thread does most freeing → only 1 thread drains pending
Other threads allocate-only → never drain their own pending queues

Problem:

mf2_maybe_drain_pending() {
    MF2_ThreadPages* tp = mf2_thread_pages_get();  // ← Own TLS only!
    MidPage* pending = mf2_dequeue_pending(tp, class_idx);  // ← Own pending only!
}

Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty
Thread B/C/D's pending queues (with 69K pages!) are never checked

Pending Enqueue Sources

76,733 enqueues come from:

Phase 1 allocation interruptions (rare cross-thread frees)
NOT from Phase 2 continuous freeing (same-thread pattern)

Solution Strategy: Global Round-Robin

Design Philosophy: "Where to Separate, Where to Integrate"

Separation Points (working well) ✅:

Allocation: Thread-local, no lock
Owner Free: Thread-local, no lock
Cross-thread Free: Lock-free MPSC stack

Integration Point (broken) ❌:

Pending Queue Drain: Currently thread-local only

Strategy A: Global Round-Robin (Phase 1) 🎯

Core Idea: All threads can drain ANY thread's pending queue

// Global registry
static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS];
static _Atomic int g_num_thread_pages = 0;

// Round-robin drain
mf2_maybe_drain_pending() {
    static _Atomic uint64_t counter = 0;
    uint64_t count = atomic_fetch_add(&counter, 1);

    // Round-robin across ALL threads (not just self!)
    int tp_idx = (count / 4) % g_num_thread_pages;
    MF2_ThreadPages* tp = g_all_thread_pages[tp_idx];

    if (tp) {
        int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES;
        MidPage* pending = mf2_dequeue_pending(tp, class_idx);
        if (pending) drain_remote_frees(pending);
    }
}

Benefits:

Larson works: Any thread can drain any thread's pending queue
Fair: All TLSs get equal drain opportunities
Simple: Just global array + round-robin

Implementation Steps:

Add global array g_all_thread_pages[]
Register TLS in mf2_thread_pages_get()
Add destructor with pthread_key_create()
Modify mf2_maybe_drain_pending() to round-robin

Expected Impact:

Pending enqueued: 69K
Pending drained:  69K ✅ (100% instead of 0%)
Page reuse rate:  3% → 90%+ ✅
Throughput:       28K → 3-10M ops/s ✅ (100-350x improvement!)

Strategy B: Hybrid (Phase 2) ⚡

Optimization: Prefer own TLS (cache efficiency) but periodically check others

if ((count & 3) == 0) {  // 1/4: Other threads
    tp = g_all_thread_pages[round_robin_idx];
} else {  // 3/4: Own TLS (cache hot)
    tp = mf2_thread_pages_get();
}

Benefits:

Cache efficiency: 75% of drains are own TLS (L1 cache)
Fairness: 25% of drains are others (ensures progress)

Metrics:

Own TLS: L1 cache hit (1-2 cycles)
Other TLS: L3 cache hit (10-20 cycles)
Average cost: 3-5 cycles (negligible)

Strategy C: Background Sweeper (Phase 3) 🔄

Safety Net: Handle edge cases where all threads stop allocating/freeing

void* mf2_drain_thread(void* arg) {
    while (running) {
        usleep(1000);  // 1ms interval (not 100μs - too aggressive)

        // Scan all TLSs for leftover pending pages
        for (int i = 0; i < g_num_thread_pages; i++) {
            for (int c = 0; c < POOL_NUM_CLASSES; c++) {
                MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c);
                if (pending) drain_remote_frees(pending);
            }
        }
    }
}

Role: Insurance policy, not main drain mechanism

Strategy A handles 95% of drains (hot path)
Strategy C handles 5% leftover (rare cases)

Latency Impact: NONE on hot path (async background)

3-Layer Latency Hiding Design

Layer	Strategy	Frequency	Latency	Coverage	Role
L1: Hot Path	A (Global RR)	Every 4th op	<1μs	95%	Main drain
L2: Optimization	B (Hybrid)	3/4 own, 1/4 other	<1μs	100%	Cache efficiency
L3: Safety Net	C (BG sweeper)	1ms interval	1ms	100%	Edge cases

Latency Guarantee: Front-end (alloc/free) always returns in <1μs, regardless of background drain state

Implementation Plan

Phase 1: Global Round-Robin (Today) 🎯

Target: Make Larson work

Tasks:

Add global array g_all_thread_pages[256]
Add atomic counter g_num_thread_pages
Add registration in mf2_thread_pages_get()
Add pthread_key destructor for cleanup
Modify mf2_maybe_drain_pending() for round-robin

Expected Time: 1-2 hours

Success Criteria:

Pending drained > 0 (ideally ~69K)
Throughput > 1M ops/s (35x improvement from 28K)

Phase 2: Hybrid Optimization (Tomorrow)

Target: Improve cache efficiency

Tasks:

Modify mf2_maybe_drain_pending() to prefer own TLS (3/4 ratio)
Benchmark cache hit rates

Expected Time: 30 minutes

Success Criteria:

L1 cache hit rate > 75%
Throughput gain: +5-10%

Phase 3: Background Sweeper (Optional)

Target: Handle edge cases

Tasks:

Create background thread with pthread_create()
Scan all TLSs every 1ms
CPU throttling (< 1% usage)

Expected Time: 30 minutes

Success Criteria:

No pending leftovers after 10s idle
CPU overhead < 1%

Current Status

Working:

✅ Per-page sharding (data structures, allocation, free paths)
✅ 64KB alignment (Fix #4)
✅ Memory ordering (Fix #6)
✅ Pending queue infrastructure (enqueue works perfectly)
✅ 0→1 edge detection

Broken:

❌ Pending queue drain (0 drains due to TLS isolation)
❌ Page reuse (3% instead of 90%)
❌ Performance (28K ops/s instead of 3-10M)

Next:

🎯 Implement Phase 1: Global Round-Robin
🎯 Expected breakthrough: 28K → 3-10M ops/s

Files Modified

Core Implementation

hakmem_pool.c (Lines 275-1200): MF2 implementation
- Data structures (MidPage, MF2_ThreadPages, PageRegistry)
- Allocation paths (fast/slow)
- Free paths (fast/slow)
- Pending queue operations
- Opportunistic drain (currently broken)

Documentation

docs/specs/ENV_VARS.md: Added HAKMEM_MF2_ENABLE
docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md: Original plan
docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md: This file

Debug Reports

ALIGNMENT_FIX_VERIFICATION.md: Fix #4 verification by Task agent

Lessons Learned

Alignment is Critical: 97% free failure from 4KB vs 64KB alignment mismatch
Memory Ordering Matters: But doesn't solve architectural issues
Workload Characteristics: Larson's same-thread pattern exposed TLS isolation bug
Integration vs Separation: Need to carefully choose integration points
Task Agent is MVP: Detailed analysis saved days of debugging

Phase 1: Global Round-Robin Implementation ✅

Commit: (multiple commits implementing round-robin drain)

Implementation:

Added g_all_thread_pages[256] global array
Added g_num_thread_pages atomic counter
Implemented TLS registration in mf2_thread_pages_get()
Implemented mf2_maybe_drain_pending() with round-robin logic
Called from both mf2_free_fast() and mf2_alloc_slow()

Test Results (larson 10 2-32K 10s 4T):

Pending enqueued: 96,429 ✅
Pending drained:  70,607 ✅ (73% - huge improvement from 0%!)
Page reuse count: 5,222
Throughput:       ~28,705 ops/s

Analysis:

✅ Round-robin drain WORKS! (0 drains → 70K drains)
⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated)
Problem: Drained pages returned to full_pages, but owner doesn't scan them

Strategy C: Direct Handoff Implementation ✅

Concept: Don't return drained pages to full_pages - make them active immediately

Implementation (clean modular code):

// Helper: Make page active (move old active to full_pages)
static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page);

// Helper: Drain page and activate if successful (Direct Handoff)
static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page);

Changes:

Modified mf2_maybe_drain_pending() to use mf2_try_drain_and_activate()
Modified alloc_slow pending drain loop to use Direct Handoff
Reduced opportunistic drain from 60+ lines to 20 lines

Test Results (larson 10 2-32K 10s 4T):

Pending enqueued: 96,429
Pending drained:  70,607
Page reuse count: 80,017 ✅ (15x improvement!)
Throughput:       ~28,705 ops/s

Success: Page reuse 35% (80,017 / 226,447)

Full Pages Scan Removal ✅

Evidence: Full_pages scan checked 1.88M pages but found 0 pages (0% success rate)

Reason: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages

Action: Removed full_pages scan (76 lines deleted)

Test Results:

Page reuses: 69,098 (31%)
Throughput:  27,206 ops/s

Conclusion: Slight decrease but acceptable (simplification benefit)

Frequency Tuning Attempts ⚙️

Tested multiple opportunistic drain frequencies:

Frequency	Page Reuses	Reuse %	Throughput
1/2 (50%)	70,607	31%	27,206 ops/s
1/4 (25%)	45,369	20%	27,423 ops/s
1/8 (12.5%)	24,901	11%	27,642 ops/s

Finding: Higher frequency = better reuse, but still far from 90% target

Hybrid Strategy Attempt (Strategy B) ❌

Concept: 75% own TLS (cache efficiency) + 25% round-robin (fairness)

Implementation:

if ((count & 3) == 0) {  // 1/4: Other threads
    tp = g_all_thread_pages[round_robin_idx];
} else {  // 3/4: Own TLS
    tp = mf2_thread_pages_get();
}

Test Results (50% overall frequency):

Page reuses: 12,676 (5.5%) ❌
Problem: Effective frequency too low (37.5% own + 12.5% others)

Conclusion: Reverted to pure round-robin at 50% frequency (31% reuse)

ChatGPT Pro Consultation 🧠

Date: 2025-10-24

Question Posed

Complete technical question covering:

MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain)
Problem: 31% reuse vs 90% target
Constraints: O(1), lock-free, per-page freelist
What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25)

Diagnosis

Root Problem: "Round-robin drain → owner handoff" doesn't work when owner stops allocating

Larson Benchmark Pattern:

Phase 1 (0-1s): All threads allocate → pages populate
Phase 2 (1-10s): All threads free+realloc from own ranges
- Thread A frees Thread A's objects → no cross-thread frees
- Thread B frees Thread B's objects → no cross-thread frees
- But: Some cross-thread frees do occur (~10%)

The Architectural Mismatch:

Current (Round-Robin Drain):
1. Thread A frees → Thread B's page goes to pending queue
2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B
3. Thread B is NOT allocating (Larson Phase 2) → page sits unused
4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page)

Result: Pages drained but never used = 31% reuse instead of 90%

Implementation Plan: Consumer-Driven Adoption

Phase 1: Code Cleanup & Preparation ✅

Tasks:

✅ Remove mf2_maybe_drain_pending() (opportunistic drain)
✅ Remove all calls to mf2_maybe_drain_pending()
✅ Keep helper functions (mf2_make_page_active, mf2_try_drain_and_activate)

Phase 2: Data Structure Updates

Tasks:

Add uint64_t last_transfer_time to MidPage struct
Ensure owner_tid and owner_tp are already present (✅ verified)

Phase 3: Adoption Function

Tasks:

Implement mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)
- Scan all threads' pending queues (round-robin)
- Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES)
- CAS ownership transfer
- Drain and activate if successful
Tune LEASE_CYCLES (start with 10ms = ~30M cycles on 3GHz CPU)

Phase 4: Integration

Tasks:

Call mf2_try_adopt_pending() in alloc_slow BEFORE allocating new page
If adoption succeeds, retry fast path
If adoption fails, allocate new page (existing logic)

Phase 5: Benchmark & Validate

Tasks:

Run larson 4T benchmark
Verify page reuse >90%
Verify throughput >1M ops/s (target: 3-10M)
Run full benchmark suite

Current Status (Updated)

Working:

✅ Per-page sharding (data structures, allocation, free paths)
✅ 64KB alignment
✅ Memory ordering
✅ Pending queue infrastructure (enqueue/dequeue)
✅ Direct Handoff (immediate page activation)
✅ Helper functions (modular, inline-optimized)
✅ Round-robin drain (proof of concept - to be replaced)

Needs Improvement:

⚠️ Page reuse: 31% (target: >90%)
⚠️ Throughput: 27K ops/s (target: 3-10M)

Root Cause Identified:

❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern)
✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption)

Next Steps:

🎯 Remove mf2_maybe_drain_pending() (cleanup)
🎯 Add last_transfer_time field
🎯 Implement mf2_try_adopt_pending()
🎯 Integrate adoption into alloc_slow
🎯 Benchmark and validate

Lessons Learned (Updated)

Alignment is Critical: 97% free failure from 4KB vs 64KB alignment mismatch
Memory Ordering Matters: But doesn't solve architectural issues
Workload Characteristics: Larson's same-thread pattern exposed TLS isolation bug
Integration vs Separation: Need to carefully choose integration points
Direct Handoff is Essential: Returning drained pages to intermediate lists wastes reuse opportunities
Push vs Pull: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design
ChatGPT Pro Consultation: Fresh perspective identified fundamental architectural mismatch

Status: Ready for Consumer-Driven Adoption implementation Confidence: Very High (ChatGPT Pro validated approach, clear design) Expected Outcome: >90% page reuse, 3-10M ops/s (100-350x improvement)

22 KiB Raw Blame History Unescape Escape

Phase 7.2 MF2: Implementation Progress

Summary

Implementation Timeline

Phase 1-4: Core Implementation ✅

Phase 5: Bug Fixes (Fix #1-6) ✅

Fix #1: Block Spacing Bug (54609c1)

Fix #2-3: Performance Optimizations (aa869b9)

Fix #4: Alignment Bug (9e64f7e) - CRITICAL

Fix #5: Active Page Drain Attempt (9e64f7e)

Fix #6: Memory Ordering (b0768b3)

Phase 2: Pending Queue Implementation (89541fc) ✅

Tuning: Opportunistic Drain Frequency (a6eb666) ✅

Option 1: free_slow Drain Addition ❌

Option C: alloc_slow Drain Addition ❌

Root Cause Analysis (by Task Agent)

Larson Benchmark Characteristics

MF2 Design vs Larson Mismatch

Pending Enqueue Sources

Solution Strategy: Global Round-Robin

Design Philosophy: "Where to Separate, Where to Integrate"

Strategy A: Global Round-Robin (Phase 1) 🎯

Strategy B: Hybrid (Phase 2) ⚡

Strategy C: Background Sweeper (Phase 3) 🔄

3-Layer Latency Hiding Design

Implementation Plan

Phase 1: Global Round-Robin (Today) 🎯

Phase 2: Hybrid Optimization (Tomorrow)

Phase 3: Background Sweeper (Optional)

Current Status

Files Modified

Core Implementation

Documentation

Debug Reports

Lessons Learned

Phase 1: Global Round-Robin Implementation ✅

Strategy C: Direct Handoff Implementation ✅

Full Pages Scan Removal ✅

Frequency Tuning Attempts ⚙️

Hybrid Strategy Attempt (Strategy B) ❌

ChatGPT Pro Consultation 🧠

Question Posed

Diagnosis

Recommended Solution: Consumer-Driven Adoption

Implementation Plan: Consumer-Driven Adoption

Phase 1: Code Cleanup & Preparation ✅

Phase 2: Data Structure Updates

Phase 3: Adoption Function

Phase 4: Integration

Phase 5: Benchmark & Validate

Current Status (Updated)

Lessons Learned (Updated)

22 KiB

Raw Blame History

Fix #1: Block Spacing Bug (`54609c1`)

Fix #2-3: Performance Optimizations (`aa869b9`)

Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL

Fix #5: Active Page Drain Attempt (`9e64f7e`)

Fix #6: Memory Ordering (`b0768b3`)

Phase 2: Pending Queue Implementation (`89541fc`) ✅

Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅