# Phase 7.2 MF2: Implementation Progress **Date**: 2025-10-24 **Status**: In Progress - Fixing Pending Queue Drain Issue **Current**: Implementing Global Round-Robin Strategy --- ## Summary MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。 Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。 結果として、pending queueへのenqueueは成功(69K pages)するが、drainが0回という状況。 Task先生の詳細分析により根本原因を特定: - 各スレッドは**自分のTLS**のpending queueしか見ない - Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空 - 他スレッドのpending queueに溜まったページは**永遠に処理されない** --- ## Implementation Timeline ### Phase 1-4: Core Implementation ✅ **Commits**: - `0855b37` - Phase 1: Data structures - `5c4b780` - Phase 2: Page allocation - `b12f58c` - Phase 3: Allocation path - `7e756c6` - Phase 4: Free path **Status**: Complete --- ### Phase 5: Bug Fixes (Fix #1-6) ✅ #### Fix #1: Block Spacing Bug (`54609c1`) **Problem**: Infinite loop on first test **Root Cause**: ```c size_t block_size = g_class_sizes[class_idx]; // Missing HEADER_SIZE ``` **Fix**: `block_size = HEADER_SIZE + user_size;` **Result**: Test completes instead of hanging --- #### Fix #2-3: Performance Optimizations (`aa869b9`) **Changes**: - Removed 64KB memset (switched from posix_memalign to mmap) - Removed O(N) eager drain scan - Reduced scan limit from 256 to 8 **Result**: 27.5K → 110K ops/s (4x improvement on 4T) --- #### Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL **Problem**: 97% of frees silently dropped! **Root Cause**: - mmap() only guarantees 4KB alignment - `addr_to_page()` assumes 64KB alignment - Lookup fails: `(ptr & ~0xFFFF)` rounds to wrong page base **Fix**: Changed to `posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)` **Verification** (by Task agent): ``` Pages allocated: 101,093 Alignment bugs: 0 (ZERO!) Registry collisions: 0 (ZERO!) Lookup success rate: 98% ``` **Side Effect**: Performance degraded (466K → 54K) due to memset overhead returning --- #### Fix #5: Active Page Drain Attempt (`9e64f7e`) **Change**: Added check for remote frees in active_page before allocating new **Result**: No improvement (remote drains still 0) --- #### Fix #6: Memory Ordering (`b0768b3`) **Problem**: All remote_count operations used `memory_order_relaxed` **Fix**: Changed 7 locations to `seq_cst/acquire/release` **Result**: Memory ordering now perfect, but performance still no improvement **Root Cause Discovery** (by Task agent): - Debug instrumentation revealed: drain checks and remote frees target **DIFFERENT page objects** - Thread A's pages in Thread A's tp->active_page/full_pages - Thread B frees to Thread A's pages → remote_count++ - Thread B's slow path checks Thread B's pages only - Result: Thread A's pages (with remote_count > 0) never checked by anyone! --- ### Phase 2: Pending Queue Implementation (`89541fc`) ✅ **Implementation** (by Task agent): - **Box 1**: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage - **Box 2**: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending) - **Box 3**: 0→1 edge detection in mf2_free_slow() - **Box 4**: Allocation slow path drain (up to 4 pages per allocation) - **Box 5**: Opportunistic drain (every 16th owner free) - **Box 6**: Comprehensive debug logging and statistics **Test Results**: ``` Pending enqueued: 43,138 ✅ Pending drained: 0 ❌ ``` **Analysis** (by Task agent): - Implementation is correct - Problem: Larson benchmark allocates all pages early, frees later - By the time remote frees arrive, owner threads don't allocate anymore - Slow path never called → pending queue never processed - This is a workload mismatch, not an implementation bug --- ### Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅ **Change**: Increased from every 16th to every 4th free (4x more aggressive) **Test Results** (larson 10 2-32K 10s 4T): ``` Pending enqueued: 52,912 ✅ Pending drained: 0 ❌ Throughput: 53K ops/s ``` **Conclusion**: Frequency tuning didn't help - workload pattern issue persists --- ### Option 1: free_slow Drain Addition ❌ **Concept**: Add opportunistic drain to both `free_fast()` and `free_slow()` **Implementation**: - Created `mf2_maybe_drain_pending()` helper - Called from both free_fast() (Line 1115) and free_slow() (Line 1167) **Test Results**: ``` Pending enqueued: 76,733 ✅ Pending drained: 0 ❌ OPP_DRAIN_TRY: 10 attempts (all from tp=0x55828805f7a0) Throughput: 27,890 ops/s ``` **Problem**: All drain attempts from same thread - other 3 threads not visible --- ### Option C: alloc_slow Drain Addition ❌ **Concept**: Add drain before new page allocation (owner thread allocating continuously) **Implementation**: Added `mf2_maybe_drain_pending()` at Line 1021 (before `mf2_alloc_new_page()`) **Test Results**: ``` Pending enqueued: 69,702 ✅ Pending drained: 0 ❌ OPP_DRAIN_TRY: 10 attempts (all from tp=0x559146bb17a0) Throughput: 27,965 ops/s ``` **Conclusion**: Still 0 drains - same thread issue persists --- ## Root Cause Analysis (by Task Agent) ### Larson Benchmark Characteristics ```cpp // larson.cpp: exercise_heap() for (cblks=0; cblksNumBlocks; cblks++) { victim = lran2(&pdea->rgen) % pdea->asize; // Own array range CUSTOM_FREE(pdea->array[victim]); // Free own allocation pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size); // Same slot } // Array partitioning (Line 481): de_area[i].array = &blkp[i*nperthread]; // Each thread owns separate range ``` **Key Finding**: Each thread allocates/frees from its own array range - Thread 0: `array[0..999]` - Thread 1: `array[1000..1999]` - Thread 2: `array[2000..2999]` - Thread 3: `array[3000..3999]` **Result**: **Cross-thread frees are almost ZERO** ### MF2 Design vs Larson Mismatch **MF2 Assumption**: ``` 4 threads freeing → all threads call mf2_free() → all threads drain pending ``` **Larson Reality**: ``` 1 thread does most freeing → only 1 thread drains pending Other threads allocate-only → never drain their own pending queues ``` **Problem**: ```c mf2_maybe_drain_pending() { MF2_ThreadPages* tp = mf2_thread_pages_get(); // ← Own TLS only! MidPage* pending = mf2_dequeue_pending(tp, class_idx); // ← Own pending only! } ``` - Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty - Thread B/C/D's pending queues (with 69K pages!) are **never checked** ### Pending Enqueue Sources **76,733 enqueues** come from: - Phase 1 allocation interruptions (rare cross-thread frees) - NOT from Phase 2 continuous freeing (same-thread pattern) --- ## Solution Strategy: Global Round-Robin ### Design Philosophy: "Where to Separate, Where to Integrate" **Separation Points** (working well) ✅: - Allocation: Thread-local, no lock - Owner Free: Thread-local, no lock - Cross-thread Free: Lock-free MPSC stack **Integration Point** (broken) ❌: - Pending Queue Drain: Currently thread-local only ### Strategy A: Global Round-Robin (Phase 1) 🎯 **Core Idea**: All threads can drain ANY thread's pending queue ```c // Global registry static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS]; static _Atomic int g_num_thread_pages = 0; // Round-robin drain mf2_maybe_drain_pending() { static _Atomic uint64_t counter = 0; uint64_t count = atomic_fetch_add(&counter, 1); // Round-robin across ALL threads (not just self!) int tp_idx = (count / 4) % g_num_thread_pages; MF2_ThreadPages* tp = g_all_thread_pages[tp_idx]; if (tp) { int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES; MidPage* pending = mf2_dequeue_pending(tp, class_idx); if (pending) drain_remote_frees(pending); } } ``` **Benefits**: - Larson works: Any thread can drain any thread's pending queue - Fair: All TLSs get equal drain opportunities - Simple: Just global array + round-robin **Implementation Steps**: 1. Add global array `g_all_thread_pages[]` 2. Register TLS in `mf2_thread_pages_get()` 3. Add destructor with `pthread_key_create()` 4. Modify `mf2_maybe_drain_pending()` to round-robin **Expected Impact**: ``` Pending enqueued: 69K Pending drained: 69K ✅ (100% instead of 0%) Page reuse rate: 3% → 90%+ ✅ Throughput: 28K → 3-10M ops/s ✅ (100-350x improvement!) ``` --- ### Strategy B: Hybrid (Phase 2) ⚡ **Optimization**: Prefer own TLS (cache efficiency) but periodically check others ```c if ((count & 3) == 0) { // 1/4: Other threads tp = g_all_thread_pages[round_robin_idx]; } else { // 3/4: Own TLS (cache hot) tp = mf2_thread_pages_get(); } ``` **Benefits**: - Cache efficiency: 75% of drains are own TLS (L1 cache) - Fairness: 25% of drains are others (ensures progress) **Metrics**: - Own TLS: L1 cache hit (1-2 cycles) - Other TLS: L3 cache hit (10-20 cycles) - Average cost: **3-5 cycles** (negligible) --- ### Strategy C: Background Sweeper (Phase 3) 🔄 **Safety Net**: Handle edge cases where all threads stop allocating/freeing ```c void* mf2_drain_thread(void* arg) { while (running) { usleep(1000); // 1ms interval (not 100μs - too aggressive) // Scan all TLSs for leftover pending pages for (int i = 0; i < g_num_thread_pages; i++) { for (int c = 0; c < POOL_NUM_CLASSES; c++) { MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c); if (pending) drain_remote_frees(pending); } } } } ``` **Role**: Insurance policy, not main drain mechanism - Strategy A handles 95% of drains (hot path) - Strategy C handles 5% leftover (rare cases) **Latency Impact**: **NONE on hot path** (async background) --- ## 3-Layer Latency Hiding Design | Layer | Strategy | Frequency | Latency | Coverage | Role | |-------|----------|-----------|---------|----------|------| | **L1: Hot Path** | A (Global RR) | Every 4th op | <1μs | 95% | Main drain | | **L2: Optimization** | B (Hybrid) | 3/4 own, 1/4 other | <1μs | 100% | Cache efficiency | | **L3: Safety Net** | C (BG sweeper) | 1ms interval | 1ms | 100% | Edge cases | **Latency Guarantee**: Front-end (alloc/free) always returns in **<1μs**, regardless of background drain state --- ## Implementation Plan ### Phase 1: Global Round-Robin (Today) 🎯 **Target**: Make Larson work **Tasks**: 1. Add global array `g_all_thread_pages[256]` 2. Add atomic counter `g_num_thread_pages` 3. Add registration in `mf2_thread_pages_get()` 4. Add pthread_key destructor for cleanup 5. Modify `mf2_maybe_drain_pending()` for round-robin **Expected Time**: 1-2 hours **Success Criteria**: - Pending drained > 0 (ideally ~69K) - Throughput > 1M ops/s (35x improvement from 28K) --- ### Phase 2: Hybrid Optimization (Tomorrow) **Target**: Improve cache efficiency **Tasks**: 1. Modify `mf2_maybe_drain_pending()` to prefer own TLS (3/4 ratio) 2. Benchmark cache hit rates **Expected Time**: 30 minutes **Success Criteria**: - L1 cache hit rate > 75% - Throughput gain: +5-10% --- ### Phase 3: Background Sweeper (Optional) **Target**: Handle edge cases **Tasks**: 1. Create background thread with `pthread_create()` 2. Scan all TLSs every 1ms 3. CPU throttling (< 1% usage) **Expected Time**: 30 minutes **Success Criteria**: - No pending leftovers after 10s idle - CPU overhead < 1% --- ## Current Status **Working**: - ✅ Per-page sharding (data structures, allocation, free paths) - ✅ 64KB alignment (Fix #4) - ✅ Memory ordering (Fix #6) - ✅ Pending queue infrastructure (enqueue works perfectly) - ✅ 0→1 edge detection **Broken**: - ❌ Pending queue drain (0 drains due to TLS isolation) - ❌ Page reuse (3% instead of 90%) - ❌ Performance (28K ops/s instead of 3-10M) **Next**: - 🎯 Implement Phase 1: Global Round-Robin - 🎯 Expected breakthrough: 28K → 3-10M ops/s --- ## Files Modified ### Core Implementation - `hakmem_pool.c` (Lines 275-1200): MF2 implementation - Data structures (MidPage, MF2_ThreadPages, PageRegistry) - Allocation paths (fast/slow) - Free paths (fast/slow) - Pending queue operations - Opportunistic drain (currently broken) ### Documentation - `docs/specs/ENV_VARS.md`: Added `HAKMEM_MF2_ENABLE` - `docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md`: Original plan - `docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md`: This file ### Debug Reports - `ALIGNMENT_FIX_VERIFICATION.md`: Fix #4 verification by Task agent --- ## Lessons Learned 1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch 2. **Memory Ordering Matters**: But doesn't solve architectural issues 3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug 4. **Integration vs Separation**: Need to carefully choose integration points 5. **Task Agent is MVP**: Detailed analysis saved days of debugging --- --- ## Phase 1: Global Round-Robin Implementation ✅ **Commit**: (multiple commits implementing round-robin drain) **Implementation**: 1. Added `g_all_thread_pages[256]` global array 2. Added `g_num_thread_pages` atomic counter 3. Implemented TLS registration in `mf2_thread_pages_get()` 4. Implemented `mf2_maybe_drain_pending()` with round-robin logic 5. Called from both `mf2_free_fast()` and `mf2_alloc_slow()` **Test Results** (larson 10 2-32K 10s 4T): ``` Pending enqueued: 96,429 ✅ Pending drained: 70,607 ✅ (73% - huge improvement from 0%!) Page reuse count: 5,222 Throughput: ~28,705 ops/s ``` **Analysis**: - ✅ Round-robin drain WORKS! (0 drains → 70K drains) - ⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated) - Problem: Drained pages returned to full_pages, but owner doesn't scan them --- ## Strategy C: Direct Handoff Implementation ✅ **Concept**: Don't return drained pages to full_pages - make them **active immediately** **Implementation** (clean modular code): ```c // Helper: Make page active (move old active to full_pages) static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page); // Helper: Drain page and activate if successful (Direct Handoff) static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page); ``` **Changes**: 1. Modified `mf2_maybe_drain_pending()` to use `mf2_try_drain_and_activate()` 2. Modified `alloc_slow` pending drain loop to use Direct Handoff 3. Reduced opportunistic drain from 60+ lines to 20 lines **Test Results** (larson 10 2-32K 10s 4T): ``` Pending enqueued: 96,429 Pending drained: 70,607 Page reuse count: 80,017 ✅ (15x improvement!) Throughput: ~28,705 ops/s ``` **Success**: Page reuse 35% (80,017 / 226,447) --- ## Full Pages Scan Removal ✅ **Evidence**: Full_pages scan checked 1.88M pages but found **0 pages** (0% success rate) **Reason**: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages **Action**: Removed full_pages scan (76 lines deleted) **Test Results**: ``` Page reuses: 69,098 (31%) Throughput: 27,206 ops/s ``` **Conclusion**: Slight decrease but acceptable (simplification benefit) --- ## Frequency Tuning Attempts ⚙️ Tested multiple opportunistic drain frequencies: | Frequency | Page Reuses | Reuse % | Throughput | |-----------|-------------|---------|------------| | 1/2 (50%) | 70,607 | 31% | 27,206 ops/s | | 1/4 (25%) | 45,369 | 20% | 27,423 ops/s | | 1/8 (12.5%) | 24,901 | 11% | 27,642 ops/s | **Finding**: Higher frequency = better reuse, but still far from 90% target --- ## Hybrid Strategy Attempt (Strategy B) ❌ **Concept**: 75% own TLS (cache efficiency) + 25% round-robin (fairness) **Implementation**: ```c if ((count & 3) == 0) { // 1/4: Other threads tp = g_all_thread_pages[round_robin_idx]; } else { // 3/4: Own TLS tp = mf2_thread_pages_get(); } ``` **Test Results** (50% overall frequency): ``` Page reuses: 12,676 (5.5%) ❌ Problem: Effective frequency too low (37.5% own + 12.5% others) ``` **Conclusion**: Reverted to pure round-robin at 50% frequency (31% reuse) --- ## ChatGPT Pro Consultation 🧠 **Date**: 2025-10-24 ### Question Posed Complete technical question covering: - MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain) - Problem: 31% reuse vs 90% target - Constraints: O(1), lock-free, per-page freelist - What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25) ### Diagnosis **Root Problem**: "Round-robin drain → owner handoff" doesn't work when owner stops allocating **Larson Benchmark Pattern**: - **Phase 1** (0-1s): All threads allocate → pages populate - **Phase 2** (1-10s): All threads free+realloc from own ranges - Thread A frees Thread A's objects → no cross-thread frees - Thread B frees Thread B's objects → no cross-thread frees - **But**: Some cross-thread frees do occur (~10%) **The Architectural Mismatch**: ``` Current (Round-Robin Drain): 1. Thread A frees → Thread B's page goes to pending queue 2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B 3. Thread B is NOT allocating (Larson Phase 2) → page sits unused 4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page) ``` **Result**: Pages drained but never used = 31% reuse instead of 90% ### Recommended Solution: Consumer-Driven Adoption **Core Principle**: "Don't push pages to idle threads, let active threads **pull** and **adopt** them" **Key Changes**: 1. **Remove round-robin drain entirely** (no more `mf2_maybe_drain_pending()`) 2. **Add ownership transfer**: CAS to change `page->owner_tid` 3. **Adoption on-demand**: Allocating thread adopts pages from ANY thread's pending queue 4. **Lease mechanism**: Prevent thrashing (re-transfer within 10ms) **Algorithm**: ```c // In alloc_slow, BEFORE allocating new page: bool mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx) { // Scan all threads' pending queues (round-robin for fairness) for (int i = 0; i < num_threads; i++) { MidPage* page = mf2_dequeue_pending(other_thread[i], class_idx); if (!page) continue; // Try to transfer ownership (CAS) uint64_t old_owner = page->owner_tid; uint64_t now = rdtsc(); if (now - page->last_transfer_time < LEASE_CYCLES) continue; // Lease active if (!CAS(&page->owner_tid, old_owner, my_tid)) continue; // CAS failed // Success! Ownership transferred page->owner_tp = me; page->last_transfer_time = now; // Drain and activate mf2_drain_remote_frees(page); if (page->freelist) { mf2_make_page_active(me, class_idx, page); return true; // SUCCESS! } } return false; // No adoptable pages } ``` **Expected Effects**: - ✅ No wasted effort (only allocating threads drain) - ✅ Page reuse >90% (allocating thread gets any available page) - ✅ Throughput 3-10M ops/s (100-350x improvement) - ✅ Hot path unchanged (fast alloc/free still O(1), lock-free) --- ## Implementation Plan: Consumer-Driven Adoption ### Phase 1: Code Cleanup & Preparation ✅ **Tasks**: 1. ✅ Remove `mf2_maybe_drain_pending()` (opportunistic drain) 2. ✅ Remove all calls to `mf2_maybe_drain_pending()` 3. ✅ Keep helper functions (`mf2_make_page_active`, `mf2_try_drain_and_activate`) ### Phase 2: Data Structure Updates **Tasks**: 1. Add `uint64_t last_transfer_time` to `MidPage` struct 2. Ensure `owner_tid` and `owner_tp` are already present (✅ verified) ### Phase 3: Adoption Function **Tasks**: 1. Implement `mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)` - Scan all threads' pending queues (round-robin) - Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES) - CAS ownership transfer - Drain and activate if successful 2. Tune `LEASE_CYCLES` (start with 10ms = ~30M cycles on 3GHz CPU) ### Phase 4: Integration **Tasks**: 1. Call `mf2_try_adopt_pending()` in `alloc_slow` BEFORE allocating new page 2. If adoption succeeds, retry fast path 3. If adoption fails, allocate new page (existing logic) ### Phase 5: Benchmark & Validate **Tasks**: 1. Run larson 4T benchmark 2. Verify page reuse >90% 3. Verify throughput >1M ops/s (target: 3-10M) 4. Run full benchmark suite --- ## Current Status (Updated) **Working**: - ✅ Per-page sharding (data structures, allocation, free paths) - ✅ 64KB alignment - ✅ Memory ordering - ✅ Pending queue infrastructure (enqueue/dequeue) - ✅ Direct Handoff (immediate page activation) - ✅ Helper functions (modular, inline-optimized) - ✅ Round-robin drain (proof of concept - to be replaced) **Needs Improvement**: - ⚠️ Page reuse: 31% (target: >90%) - ⚠️ Throughput: 27K ops/s (target: 3-10M) **Root Cause Identified**: - ❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern) - ✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption) **Next Steps**: 1. 🎯 Remove `mf2_maybe_drain_pending()` (cleanup) 2. 🎯 Add `last_transfer_time` field 3. 🎯 Implement `mf2_try_adopt_pending()` 4. 🎯 Integrate adoption into `alloc_slow` 5. 🎯 Benchmark and validate --- ## Lessons Learned (Updated) 1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch 2. **Memory Ordering Matters**: But doesn't solve architectural issues 3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug 4. **Integration vs Separation**: Need to carefully choose integration points 5. **Direct Handoff is Essential**: Returning drained pages to intermediate lists wastes reuse opportunities 6. **Push vs Pull**: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design 7. **ChatGPT Pro Consultation**: Fresh perspective identified fundamental architectural mismatch --- **Status**: Ready for Consumer-Driven Adoption implementation **Confidence**: Very High (ChatGPT Pro validated approach, clear design) **Expected Outcome**: >90% page reuse, 3-10M ops/s (100-350x improvement)