# Phase 7.2 MF2: Per-Page Sharding Implementation Plan **Date**: 2025-10-24 **Goal**: Eliminate shared freelists by implementing per-page sharding (mimalloc approach) **Expected**: +50% improvement (13.78 M/s → 20.7 M/s) **Effort**: 20-30 hours **Risk**: Medium (major architectural change) --- ## Executive Summary **Problem**: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free). **Solution**: **Per-Page Sharding** - Each 64KB page has its own independent freelist. No sharing = no contention. **Key Insight**: The bottleneck is **SHARING**, not the locking mechanism. Eliminate sharing, eliminate contention. --- ## Current Architecture Problems ### Shared Freelist Design ``` Current (P6.24): ┌─────────────────────────────────────────┐ │ Global Freelists (56 total) │ │ ├─ Class 0 (2KB): 8 shards × mutex │ │ ├─ Class 1 (4KB): 8 shards × mutex │ │ ├─ Class 2 (8KB): 8 shards × mutex │ │ └─ ... │ └─────────────────────────────────────────┘ ↑ ↑ ↑ ↑ │ │ │ │ 4 threads competing └─┴─┴─┘ → Lock contention ``` **Problems**: 1. **Lock Contention**: 4 threads → 1 freelist → serialized access 2. **Cache Line Bouncing**: Mutex or atomic operations bounce cache lines 3. **No Scalability**: More threads = worse contention ### Lock-Free Failed (P7.1) We tried lock-free CAS operations: - **Result**: -6.6% regression - **Reason**: CAS contention + retry overhead > mutex contention - **Lesson**: Can't fix fundamental sharing problem with lock-free --- ## MF2 Approach: Per-Page Sharding ### Core Concept **mimalloc's Secret Sauce**: O(1) page lookup from block address ```c // Magic: Block address → Page (bitwise AND) PageDesc* page = addr_to_page(ptr); // (ptr & ~0xFFFF) ``` This enables: 1. Each page has independent freelist (no sharing!) 2. O(1) page lookup (no hash table!) 3. Owner-based optimization (fast path for owner thread) ### New Architecture ``` MF2 Per-Page Design: ┌─────────────────────────────────────────┐ │ Thread 1 Pages │ │ ├─ Page A (2KB): freelist [no lock] │ │ ├─ Page B (4KB): freelist [no lock] │ │ └─ ... │ └─────────────────────────────────────────┘ ┌─────────────────────────────────────────┐ │ Thread 2 Pages │ │ ├─ Page C (2KB): freelist [no lock] │ │ ├─ Page D (8KB): freelist [no lock] │ │ └─ ... │ └─────────────────────────────────────────┘ Each thread accesses its own pages → ZERO contention! ``` --- ## Data Structures ### Page Descriptor (New) ```c // Per-page metadata (aligned to 64KB boundary) typedef struct MidPage { // Page identity void* base; // Page base address (64KB aligned) uint8_t class_idx; // Size class (0-6) uint8_t _pad[3]; // Ownership uint64_t owner_tid; // Owner thread ID // Freelist (page-local, no sharing!) PoolBlock* freelist; // Local freelist (owner-only, no lock!) uint16_t free_count; // Number of free blocks uint16_t capacity; // Total blocks per page // Remote frees (cross-thread, lock-free stack) atomic_uintptr_t remote_head; // Lock-free MPSC stack atomic_uint remote_count; // Count for quick check // Lifecycle atomic_int in_use; // Live allocations (for empty detection) atomic_int pending_dn; // DONTNEED queued flag // Linkage struct MidPage* next_page; // Next page in thread's page list } MidPage; ``` ### Thread-Local Page Lists ```c // Per-thread page lists (one per class) typedef struct ThreadPages { MidPage* active_page[POOL_NUM_CLASSES]; // Current page with free blocks MidPage* full_pages[POOL_NUM_CLASSES]; // Full pages (no free blocks) int page_count[POOL_NUM_CLASSES]; // Total pages owned } ThreadPages; static __thread ThreadPages* t_pages = NULL; ``` ### Global Page Registry ```c // Global registry for cross-thread free (O(1) lookup) #define PAGE_REGISTRY_BITS 16 // 64K entries (covers 4GB with 64KB pages) #define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS) typedef struct { MidPage* pages[PAGE_REGISTRY_SIZE]; // Indexed by (addr >> 16) & 0xFFFF pthread_mutex_t locks[256]; // Coarse-grained locks for rare updates } PageRegistry; static PageRegistry g_page_registry; ``` --- ## Core Algorithms ### Allocation Fast Path (Owner Thread) ```c void* mid_alloc_fast(size_t size) { int class_idx = size_to_class(size); MidPage* page = t_pages->active_page[class_idx]; // Fast path: Pop from page-local freelist (NO LOCK!) if (page && page->freelist) { PoolBlock* block = page->freelist; page->freelist = block->next; page->free_count--; atomic_fetch_add(&page->in_use, 1, memory_order_relaxed); return (char*)block + HEADER_SIZE; } // Slow path: Drain remote frees or allocate new page return mid_alloc_slow(class_idx); } ``` **Key Point**: NO mutex, NO CAS in fast path (owner-only access)! ### Allocation Slow Path ```c void* mid_alloc_slow(int class_idx) { MidPage* page = t_pages->active_page[class_idx]; // Try to drain remote frees (lock-free pop) if (page && page->remote_count > 0) { drain_remote_frees(page); if (page->freelist) { // Retry fast path return mid_alloc_fast(class_idx); } } // Allocate new page page = alloc_new_page(class_idx); if (!page) return NULL; // OOM // Register page in global registry register_page(page); // Set as active page t_pages->active_page[class_idx] = page; // Retry allocation return mid_alloc_fast(class_idx); } ``` ### Free Fast Path (Owner Thread) ```c void mid_free_fast(void* ptr) { // O(1) page lookup (bitwise AND) MidPage* page = addr_to_page(ptr); // (ptr & ~0xFFFF) // Check if we're the owner (fast path) if (page->owner_tid == my_tid()) { // Fast: Push to page-local freelist (NO LOCK!) PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE); block->next = page->freelist; page->freelist = block; page->free_count++; // Decrement in-use, enqueue DONTNEED if empty int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1; if (nv == 0) { enqueue_dontneed(page); } return; } // Slow path: Cross-thread free mid_free_slow(page, ptr); } ``` ### Free Slow Path (Cross-Thread) ```c void mid_free_slow(MidPage* page, void* ptr) { // Push to page's remote stack (lock-free MPSC) PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE); uintptr_t old_head; do { old_head = atomic_load(&page->remote_head, memory_order_acquire); block->next = (PoolBlock*)old_head; } while (!atomic_compare_exchange_weak( &page->remote_head, &old_head, (uintptr_t)block, memory_order_release, memory_order_relaxed)); atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed); // Decrement in-use int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1; if (nv == 0) { enqueue_dontneed(page); } } ``` ### Page Lookup (O(1)) ```c // Ultra-fast page lookup using address arithmetic static inline MidPage* addr_to_page(void* addr) { // Assume 64KB pages, aligned to 64KB boundary void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL); // Index into registry size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1); // Direct lookup (no hash collision handling needed if registry is large enough) return g_page_registry.pages[idx]; } ``` --- ## Implementation Phases ### Phase 1: Data Structures (4-6h) **Tasks**: 1. Define `MidPage` struct 2. Define `ThreadPages` struct 3. Define `PageRegistry` struct 4. Initialize global page registry 5. Add TLS for thread pages **Validation**: Compiles, structures allocated correctly ### Phase 2: Page Allocation (3-4h) **Tasks**: 1. Implement `alloc_new_page(class_idx)` 2. Implement `register_page(page)` 3. Implement `addr_to_page(ptr)` lookup 4. Initialize page freelist (build block chain) **Validation**: Can allocate pages, lookup works ### Phase 3: Allocation Path (4-6h) **Tasks**: 1. Implement `mid_alloc_fast()` (owner-only) 2. Implement `mid_alloc_slow()` (drain + new page) 3. Implement `drain_remote_frees(page)` 4. Update `hak_pool_try_alloc()` entry point **Validation**: Single-threaded allocation works ### Phase 4: Free Path (3-4h) **Tasks**: 1. Implement `mid_free_fast()` (owner-only) 2. Implement `mid_free_slow()` (cross-thread) 3. Update `hak_pool_free()` entry point 4. Implement empty page DONTNEED **Validation**: Single-threaded free works ### Phase 5: Multi-Thread Testing (3-4h) **Tasks**: 1. Test with 2 threads (cross-thread frees) 2. Test with 4 threads (full contention) 3. Fix any races or deadlocks 4. ThreadSanitizer validation **Validation**: larson benchmark runs without crashes ### Phase 6: Optimization & Tuning (3-5h) **Tasks**: 1. Optimize page allocation (batch allocate?) 2. Optimize remote drain (batch drain?) 3. Tune page registry size 4. Profile and fix hotspots **Validation**: Performance meets expectations --- ## Expected Performance ### Baseline (P6.24) ``` Mid 1T: 4.03 M/s Mid 4T: 13.78 M/s (3.42x scaling) ``` ### Target (MF2) ``` Mid 1T: 5.0-6.0 M/s (+24-49%) [No lock overhead] Mid 4T: 20.0-22.0 M/s (+45-60%) [Zero contention] ``` **vs mimalloc (29.50 M/s)**: - Expected: 68-75% of mimalloc - **Success criterion**: >60% (17.70 M/s) ### Why This Will Work 1. **Contention Elimination**: Each thread accesses own pages - Current: 4 threads → 1 freelist → 75% waiting - MF2: 4 threads → 4 pages → 0% waiting 2. **Cache Locality**: Page-local data stays in cache - Current: Freelist bounces between threads - MF2: Page metadata stays in owner's cache 3. **Fast Path Optimization**: No synchronization for owner - Current: mutex lock/unlock every allocation - MF2: simple pointer manipulation --- ## Risks & Mitigation ### Risk 1: Page Registry Conflicts **Risk**: Hash collisions in page registry **Mitigation**: - Use large registry (64K entries = low collision rate) - Handle collisions with chaining if needed - Monitor collision rate, resize if >5% ### Risk 2: Memory Fragmentation **Risk**: Each thread allocates pages independently → fragmentation **Mitigation**: - Page reuse when thread exits (return pages to global pool) - Periodic empty page cleanup - Monitor RSS, acceptable if <20% overhead ### Risk 3: Cross-Thread Free Overhead **Risk**: Remote stack drain might be slow **Mitigation**: - Batch drain (drain multiple blocks at once) - Adaptive drain frequency - Keep remote stack lock-free ### Risk 4: Implementation Bugs **Risk**: Complex concurrency, hard to debug **Mitigation**: - Incremental implementation (phase by phase) - Extensive testing (1T, 2T, 4T, 8T) - ThreadSanitizer validation - Fallback plan: revert if unfixable bugs --- ## Success Criteria ### Must-Have (P0) - ✅ Compiles and runs without crashes - ✅ larson benchmark completes (1T, 4T) - ✅ No memory leaks (valgrind clean) - ✅ ThreadSanitizer clean (no data races) - ✅ Mid 4T > 17.70 M/s (60% of mimalloc) ### Should-Have (P1) - ✅ Mid 4T > 20.0 M/s (68% of mimalloc) - ✅ Mid 1T > 5.0 M/s (25% improvement) - ✅ RSS overhead < 20% ### Nice-to-Have (P2) - Mid 4T > 22.0 M/s (75% of mimalloc) - Full suite performance improvement - Documentation complete --- ## Timeline | Phase | Duration | Cumulative | |-------|----------|------------| | Phase 1: Data Structures | 4-6h | 4-6h | | Phase 2: Page Allocation | 3-4h | 7-10h | | Phase 3: Allocation Path | 4-6h | 11-16h | | Phase 4: Free Path | 3-4h | 14-20h | | Phase 5: Multi-Thread Testing | 3-4h | 17-24h | | Phase 6: Optimization | 3-5h | 20-29h | **Total**: 20-30 hours (2.5-4 working days) --- ## Next Steps 1. **Review Plan**: Get user approval 2. **Phase 1 Start**: Data structures 3. **Incremental Build**: Test after each phase 4. **Benchmark Early**: Don't wait until end **Let's crush mimalloc!** 🔥 --- **Status**: Plan complete, ready to implement ✅ **Confidence**: High (proven approach, incremental plan) **Expected Outcome**: 60-75% of mimalloc (SUCCESS!)