Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
22 KiB
Phase 7.2 MF2: Implementation Progress
Date: 2025-10-24 Status: In Progress - Fixing Pending Queue Drain Issue Current: Implementing Global Round-Robin Strategy
Summary
MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。 Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。 結果として、pending queueへのenqueueは成功(69K pages)するが、drainが0回という状況。
Task先生の詳細分析により根本原因を特定:
- 各スレッドは自分のTLSのpending queueしか見ない
- Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空
- 他スレッドのpending queueに溜まったページは永遠に処理されない
Implementation Timeline
Phase 1-4: Core Implementation ✅
Commits:
0855b37- Phase 1: Data structures5c4b780- Phase 2: Page allocationb12f58c- Phase 3: Allocation path7e756c6- Phase 4: Free path
Status: Complete
Phase 5: Bug Fixes (Fix #1-6) ✅
Fix #1: Block Spacing Bug (54609c1)
Problem: Infinite loop on first test Root Cause:
size_t block_size = g_class_sizes[class_idx]; // Missing HEADER_SIZE
Fix: block_size = HEADER_SIZE + user_size;
Result: Test completes instead of hanging
Fix #2-3: Performance Optimizations (aa869b9)
Changes:
- Removed 64KB memset (switched from posix_memalign to mmap)
- Removed O(N) eager drain scan
- Reduced scan limit from 256 to 8
Result: 27.5K → 110K ops/s (4x improvement on 4T)
Fix #4: Alignment Bug (9e64f7e) - CRITICAL
Problem: 97% of frees silently dropped! Root Cause:
- mmap() only guarantees 4KB alignment
addr_to_page()assumes 64KB alignment- Lookup fails:
(ptr & ~0xFFFF)rounds to wrong page base
Fix: Changed to posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)
Verification (by Task agent):
Pages allocated: 101,093
Alignment bugs: 0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98%
Side Effect: Performance degraded (466K → 54K) due to memset overhead returning
Fix #5: Active Page Drain Attempt (9e64f7e)
Change: Added check for remote frees in active_page before allocating new Result: No improvement (remote drains still 0)
Fix #6: Memory Ordering (b0768b3)
Problem: All remote_count operations used memory_order_relaxed
Fix: Changed 7 locations to seq_cst/acquire/release
Result: Memory ordering now perfect, but performance still no improvement
Root Cause Discovery (by Task agent):
- Debug instrumentation revealed: drain checks and remote frees target DIFFERENT page objects
- Thread A's pages in Thread A's tp->active_page/full_pages
- Thread B frees to Thread A's pages → remote_count++
- Thread B's slow path checks Thread B's pages only
- Result: Thread A's pages (with remote_count > 0) never checked by anyone!
Phase 2: Pending Queue Implementation (89541fc) ✅
Implementation (by Task agent):
- Box 1: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage
- Box 2: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending)
- Box 3: 0→1 edge detection in mf2_free_slow()
- Box 4: Allocation slow path drain (up to 4 pages per allocation)
- Box 5: Opportunistic drain (every 16th owner free)
- Box 6: Comprehensive debug logging and statistics
Test Results:
Pending enqueued: 43,138 ✅
Pending drained: 0 ❌
Analysis (by Task agent):
- Implementation is correct
- Problem: Larson benchmark allocates all pages early, frees later
- By the time remote frees arrive, owner threads don't allocate anymore
- Slow path never called → pending queue never processed
- This is a workload mismatch, not an implementation bug
Tuning: Opportunistic Drain Frequency (a6eb666) ✅
Change: Increased from every 16th to every 4th free (4x more aggressive)
Test Results (larson 10 2-32K 10s 4T):
Pending enqueued: 52,912 ✅
Pending drained: 0 ❌
Throughput: 53K ops/s
Conclusion: Frequency tuning didn't help - workload pattern issue persists
Option 1: free_slow Drain Addition ❌
Concept: Add opportunistic drain to both free_fast() and free_slow()
Implementation:
- Created
mf2_maybe_drain_pending()helper - Called from both free_fast() (Line 1115) and free_slow() (Line 1167)
Test Results:
Pending enqueued: 76,733 ✅
Pending drained: 0 ❌
OPP_DRAIN_TRY: 10 attempts (all from tp=0x55828805f7a0)
Throughput: 27,890 ops/s
Problem: All drain attempts from same thread - other 3 threads not visible
Option C: alloc_slow Drain Addition ❌
Concept: Add drain before new page allocation (owner thread allocating continuously)
Implementation: Added mf2_maybe_drain_pending() at Line 1021 (before mf2_alloc_new_page())
Test Results:
Pending enqueued: 69,702 ✅
Pending drained: 0 ❌
OPP_DRAIN_TRY: 10 attempts (all from tp=0x559146bb17a0)
Throughput: 27,965 ops/s
Conclusion: Still 0 drains - same thread issue persists
Root Cause Analysis (by Task Agent)
Larson Benchmark Characteristics
// larson.cpp: exercise_heap()
for (cblks=0; cblks<pdea->NumBlocks; cblks++) {
victim = lran2(&pdea->rgen) % pdea->asize; // Own array range
CUSTOM_FREE(pdea->array[victim]); // Free own allocation
pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size); // Same slot
}
// Array partitioning (Line 481):
de_area[i].array = &blkp[i*nperthread]; // Each thread owns separate range
Key Finding: Each thread allocates/frees from its own array range
- Thread 0:
array[0..999] - Thread 1:
array[1000..1999] - Thread 2:
array[2000..2999] - Thread 3:
array[3000..3999]
Result: Cross-thread frees are almost ZERO
MF2 Design vs Larson Mismatch
MF2 Assumption:
4 threads freeing → all threads call mf2_free() → all threads drain pending
Larson Reality:
1 thread does most freeing → only 1 thread drains pending
Other threads allocate-only → never drain their own pending queues
Problem:
mf2_maybe_drain_pending() {
MF2_ThreadPages* tp = mf2_thread_pages_get(); // ← Own TLS only!
MidPage* pending = mf2_dequeue_pending(tp, class_idx); // ← Own pending only!
}
- Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty
- Thread B/C/D's pending queues (with 69K pages!) are never checked
Pending Enqueue Sources
76,733 enqueues come from:
- Phase 1 allocation interruptions (rare cross-thread frees)
- NOT from Phase 2 continuous freeing (same-thread pattern)
Solution Strategy: Global Round-Robin
Design Philosophy: "Where to Separate, Where to Integrate"
Separation Points (working well) ✅:
- Allocation: Thread-local, no lock
- Owner Free: Thread-local, no lock
- Cross-thread Free: Lock-free MPSC stack
Integration Point (broken) ❌:
- Pending Queue Drain: Currently thread-local only
Strategy A: Global Round-Robin (Phase 1) 🎯
Core Idea: All threads can drain ANY thread's pending queue
// Global registry
static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS];
static _Atomic int g_num_thread_pages = 0;
// Round-robin drain
mf2_maybe_drain_pending() {
static _Atomic uint64_t counter = 0;
uint64_t count = atomic_fetch_add(&counter, 1);
// Round-robin across ALL threads (not just self!)
int tp_idx = (count / 4) % g_num_thread_pages;
MF2_ThreadPages* tp = g_all_thread_pages[tp_idx];
if (tp) {
int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES;
MidPage* pending = mf2_dequeue_pending(tp, class_idx);
if (pending) drain_remote_frees(pending);
}
}
Benefits:
- Larson works: Any thread can drain any thread's pending queue
- Fair: All TLSs get equal drain opportunities
- Simple: Just global array + round-robin
Implementation Steps:
- Add global array
g_all_thread_pages[] - Register TLS in
mf2_thread_pages_get() - Add destructor with
pthread_key_create() - Modify
mf2_maybe_drain_pending()to round-robin
Expected Impact:
Pending enqueued: 69K
Pending drained: 69K ✅ (100% instead of 0%)
Page reuse rate: 3% → 90%+ ✅
Throughput: 28K → 3-10M ops/s ✅ (100-350x improvement!)
Strategy B: Hybrid (Phase 2) ⚡
Optimization: Prefer own TLS (cache efficiency) but periodically check others
if ((count & 3) == 0) { // 1/4: Other threads
tp = g_all_thread_pages[round_robin_idx];
} else { // 3/4: Own TLS (cache hot)
tp = mf2_thread_pages_get();
}
Benefits:
- Cache efficiency: 75% of drains are own TLS (L1 cache)
- Fairness: 25% of drains are others (ensures progress)
Metrics:
- Own TLS: L1 cache hit (1-2 cycles)
- Other TLS: L3 cache hit (10-20 cycles)
- Average cost: 3-5 cycles (negligible)
Strategy C: Background Sweeper (Phase 3) 🔄
Safety Net: Handle edge cases where all threads stop allocating/freeing
void* mf2_drain_thread(void* arg) {
while (running) {
usleep(1000); // 1ms interval (not 100μs - too aggressive)
// Scan all TLSs for leftover pending pages
for (int i = 0; i < g_num_thread_pages; i++) {
for (int c = 0; c < POOL_NUM_CLASSES; c++) {
MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c);
if (pending) drain_remote_frees(pending);
}
}
}
}
Role: Insurance policy, not main drain mechanism
- Strategy A handles 95% of drains (hot path)
- Strategy C handles 5% leftover (rare cases)
Latency Impact: NONE on hot path (async background)
3-Layer Latency Hiding Design
| Layer | Strategy | Frequency | Latency | Coverage | Role |
|---|---|---|---|---|---|
| L1: Hot Path | A (Global RR) | Every 4th op | <1μs | 95% | Main drain |
| L2: Optimization | B (Hybrid) | 3/4 own, 1/4 other | <1μs | 100% | Cache efficiency |
| L3: Safety Net | C (BG sweeper) | 1ms interval | 1ms | 100% | Edge cases |
Latency Guarantee: Front-end (alloc/free) always returns in <1μs, regardless of background drain state
Implementation Plan
Phase 1: Global Round-Robin (Today) 🎯
Target: Make Larson work
Tasks:
- Add global array
g_all_thread_pages[256] - Add atomic counter
g_num_thread_pages - Add registration in
mf2_thread_pages_get() - Add pthread_key destructor for cleanup
- Modify
mf2_maybe_drain_pending()for round-robin
Expected Time: 1-2 hours
Success Criteria:
- Pending drained > 0 (ideally ~69K)
- Throughput > 1M ops/s (35x improvement from 28K)
Phase 2: Hybrid Optimization (Tomorrow)
Target: Improve cache efficiency
Tasks:
- Modify
mf2_maybe_drain_pending()to prefer own TLS (3/4 ratio) - Benchmark cache hit rates
Expected Time: 30 minutes
Success Criteria:
- L1 cache hit rate > 75%
- Throughput gain: +5-10%
Phase 3: Background Sweeper (Optional)
Target: Handle edge cases
Tasks:
- Create background thread with
pthread_create() - Scan all TLSs every 1ms
- CPU throttling (< 1% usage)
Expected Time: 30 minutes
Success Criteria:
- No pending leftovers after 10s idle
- CPU overhead < 1%
Current Status
Working:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment (Fix #4)
- ✅ Memory ordering (Fix #6)
- ✅ Pending queue infrastructure (enqueue works perfectly)
- ✅ 0→1 edge detection
Broken:
- ❌ Pending queue drain (0 drains due to TLS isolation)
- ❌ Page reuse (3% instead of 90%)
- ❌ Performance (28K ops/s instead of 3-10M)
Next:
- 🎯 Implement Phase 1: Global Round-Robin
- 🎯 Expected breakthrough: 28K → 3-10M ops/s
Files Modified
Core Implementation
hakmem_pool.c(Lines 275-1200): MF2 implementation- Data structures (MidPage, MF2_ThreadPages, PageRegistry)
- Allocation paths (fast/slow)
- Free paths (fast/slow)
- Pending queue operations
- Opportunistic drain (currently broken)
Documentation
docs/specs/ENV_VARS.md: AddedHAKMEM_MF2_ENABLEdocs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md: Original plandocs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md: This file
Debug Reports
ALIGNMENT_FIX_VERIFICATION.md: Fix #4 verification by Task agent
Lessons Learned
- Alignment is Critical: 97% free failure from 4KB vs 64KB alignment mismatch
- Memory Ordering Matters: But doesn't solve architectural issues
- Workload Characteristics: Larson's same-thread pattern exposed TLS isolation bug
- Integration vs Separation: Need to carefully choose integration points
- Task Agent is MVP: Detailed analysis saved days of debugging
Phase 1: Global Round-Robin Implementation ✅
Commit: (multiple commits implementing round-robin drain)
Implementation:
- Added
g_all_thread_pages[256]global array - Added
g_num_thread_pagesatomic counter - Implemented TLS registration in
mf2_thread_pages_get() - Implemented
mf2_maybe_drain_pending()with round-robin logic - Called from both
mf2_free_fast()andmf2_alloc_slow()
Test Results (larson 10 2-32K 10s 4T):
Pending enqueued: 96,429 ✅
Pending drained: 70,607 ✅ (73% - huge improvement from 0%!)
Page reuse count: 5,222
Throughput: ~28,705 ops/s
Analysis:
- ✅ Round-robin drain WORKS! (0 drains → 70K drains)
- ⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated)
- Problem: Drained pages returned to full_pages, but owner doesn't scan them
Strategy C: Direct Handoff Implementation ✅
Concept: Don't return drained pages to full_pages - make them active immediately
Implementation (clean modular code):
// Helper: Make page active (move old active to full_pages)
static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page);
// Helper: Drain page and activate if successful (Direct Handoff)
static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page);
Changes:
- Modified
mf2_maybe_drain_pending()to usemf2_try_drain_and_activate() - Modified
alloc_slowpending drain loop to use Direct Handoff - Reduced opportunistic drain from 60+ lines to 20 lines
Test Results (larson 10 2-32K 10s 4T):
Pending enqueued: 96,429
Pending drained: 70,607
Page reuse count: 80,017 ✅ (15x improvement!)
Throughput: ~28,705 ops/s
Success: Page reuse 35% (80,017 / 226,447)
Full Pages Scan Removal ✅
Evidence: Full_pages scan checked 1.88M pages but found 0 pages (0% success rate)
Reason: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages
Action: Removed full_pages scan (76 lines deleted)
Test Results:
Page reuses: 69,098 (31%)
Throughput: 27,206 ops/s
Conclusion: Slight decrease but acceptable (simplification benefit)
Frequency Tuning Attempts ⚙️
Tested multiple opportunistic drain frequencies:
| Frequency | Page Reuses | Reuse % | Throughput |
|---|---|---|---|
| 1/2 (50%) | 70,607 | 31% | 27,206 ops/s |
| 1/4 (25%) | 45,369 | 20% | 27,423 ops/s |
| 1/8 (12.5%) | 24,901 | 11% | 27,642 ops/s |
Finding: Higher frequency = better reuse, but still far from 90% target
Hybrid Strategy Attempt (Strategy B) ❌
Concept: 75% own TLS (cache efficiency) + 25% round-robin (fairness)
Implementation:
if ((count & 3) == 0) { // 1/4: Other threads
tp = g_all_thread_pages[round_robin_idx];
} else { // 3/4: Own TLS
tp = mf2_thread_pages_get();
}
Test Results (50% overall frequency):
Page reuses: 12,676 (5.5%) ❌
Problem: Effective frequency too low (37.5% own + 12.5% others)
Conclusion: Reverted to pure round-robin at 50% frequency (31% reuse)
ChatGPT Pro Consultation 🧠
Date: 2025-10-24
Question Posed
Complete technical question covering:
- MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain)
- Problem: 31% reuse vs 90% target
- Constraints: O(1), lock-free, per-page freelist
- What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25)
Diagnosis
Root Problem: "Round-robin drain → owner handoff" doesn't work when owner stops allocating
Larson Benchmark Pattern:
- Phase 1 (0-1s): All threads allocate → pages populate
- Phase 2 (1-10s): All threads free+realloc from own ranges
- Thread A frees Thread A's objects → no cross-thread frees
- Thread B frees Thread B's objects → no cross-thread frees
- But: Some cross-thread frees do occur (~10%)
The Architectural Mismatch:
Current (Round-Robin Drain):
1. Thread A frees → Thread B's page goes to pending queue
2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B
3. Thread B is NOT allocating (Larson Phase 2) → page sits unused
4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page)
Result: Pages drained but never used = 31% reuse instead of 90%
Recommended Solution: Consumer-Driven Adoption
Core Principle: "Don't push pages to idle threads, let active threads pull and adopt them"
Key Changes:
- Remove round-robin drain entirely (no more
mf2_maybe_drain_pending()) - Add ownership transfer: CAS to change
page->owner_tid - Adoption on-demand: Allocating thread adopts pages from ANY thread's pending queue
- Lease mechanism: Prevent thrashing (re-transfer within 10ms)
Algorithm:
// In alloc_slow, BEFORE allocating new page:
bool mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx) {
// Scan all threads' pending queues (round-robin for fairness)
for (int i = 0; i < num_threads; i++) {
MidPage* page = mf2_dequeue_pending(other_thread[i], class_idx);
if (!page) continue;
// Try to transfer ownership (CAS)
uint64_t old_owner = page->owner_tid;
uint64_t now = rdtsc();
if (now - page->last_transfer_time < LEASE_CYCLES) continue; // Lease active
if (!CAS(&page->owner_tid, old_owner, my_tid)) continue; // CAS failed
// Success! Ownership transferred
page->owner_tp = me;
page->last_transfer_time = now;
// Drain and activate
mf2_drain_remote_frees(page);
if (page->freelist) {
mf2_make_page_active(me, class_idx, page);
return true; // SUCCESS!
}
}
return false; // No adoptable pages
}
Expected Effects:
- ✅ No wasted effort (only allocating threads drain)
- ✅ Page reuse >90% (allocating thread gets any available page)
- ✅ Throughput 3-10M ops/s (100-350x improvement)
- ✅ Hot path unchanged (fast alloc/free still O(1), lock-free)
Implementation Plan: Consumer-Driven Adoption
Phase 1: Code Cleanup & Preparation ✅
Tasks:
- ✅ Remove
mf2_maybe_drain_pending()(opportunistic drain) - ✅ Remove all calls to
mf2_maybe_drain_pending() - ✅ Keep helper functions (
mf2_make_page_active,mf2_try_drain_and_activate)
Phase 2: Data Structure Updates
Tasks:
- Add
uint64_t last_transfer_timetoMidPagestruct - Ensure
owner_tidandowner_tpare already present (✅ verified)
Phase 3: Adoption Function
Tasks:
- Implement
mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)- Scan all threads' pending queues (round-robin)
- Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES)
- CAS ownership transfer
- Drain and activate if successful
- Tune
LEASE_CYCLES(start with 10ms = ~30M cycles on 3GHz CPU)
Phase 4: Integration
Tasks:
- Call
mf2_try_adopt_pending()inalloc_slowBEFORE allocating new page - If adoption succeeds, retry fast path
- If adoption fails, allocate new page (existing logic)
Phase 5: Benchmark & Validate
Tasks:
- Run larson 4T benchmark
- Verify page reuse >90%
- Verify throughput >1M ops/s (target: 3-10M)
- Run full benchmark suite
Current Status (Updated)
Working:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment
- ✅ Memory ordering
- ✅ Pending queue infrastructure (enqueue/dequeue)
- ✅ Direct Handoff (immediate page activation)
- ✅ Helper functions (modular, inline-optimized)
- ✅ Round-robin drain (proof of concept - to be replaced)
Needs Improvement:
- ⚠️ Page reuse: 31% (target: >90%)
- ⚠️ Throughput: 27K ops/s (target: 3-10M)
Root Cause Identified:
- ❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern)
- ✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption)
Next Steps:
- 🎯 Remove
mf2_maybe_drain_pending()(cleanup) - 🎯 Add
last_transfer_timefield - 🎯 Implement
mf2_try_adopt_pending() - 🎯 Integrate adoption into
alloc_slow - 🎯 Benchmark and validate
Lessons Learned (Updated)
- Alignment is Critical: 97% free failure from 4KB vs 64KB alignment mismatch
- Memory Ordering Matters: But doesn't solve architectural issues
- Workload Characteristics: Larson's same-thread pattern exposed TLS isolation bug
- Integration vs Separation: Need to carefully choose integration points
- Direct Handoff is Essential: Returning drained pages to intermediate lists wastes reuse opportunities
- Push vs Pull: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design
- ChatGPT Pro Consultation: Fresh perspective identified fundamental architectural mismatch
Status: Ready for Consumer-Driven Adoption implementation Confidence: Very High (ChatGPT Pro validated approach, clear design) Expected Outcome: >90% page reuse, 3-10M ops/s (100-350x improvement)