Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
Phase 7.2 MF2: Per-Page Sharding Implementation Plan
Date: 2025-10-24 Goal: Eliminate shared freelists by implementing per-page sharding (mimalloc approach) Expected: +50% improvement (13.78 M/s → 20.7 M/s) Effort: 20-30 hours Risk: Medium (major architectural change)
Executive Summary
Problem: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).
Solution: Per-Page Sharding - Each 64KB page has its own independent freelist. No sharing = no contention.
Key Insight: The bottleneck is SHARING, not the locking mechanism. Eliminate sharing, eliminate contention.
Current Architecture Problems
Shared Freelist Design
Current (P6.24):
┌─────────────────────────────────────────┐
│ Global Freelists (56 total) │
│ ├─ Class 0 (2KB): 8 shards × mutex │
│ ├─ Class 1 (4KB): 8 shards × mutex │
│ ├─ Class 2 (8KB): 8 shards × mutex │
│ └─ ... │
└─────────────────────────────────────────┘
↑ ↑ ↑ ↑
│ │ │ │ 4 threads competing
└─┴─┴─┘ → Lock contention
Problems:
- Lock Contention: 4 threads → 1 freelist → serialized access
- Cache Line Bouncing: Mutex or atomic operations bounce cache lines
- No Scalability: More threads = worse contention
Lock-Free Failed (P7.1)
We tried lock-free CAS operations:
- Result: -6.6% regression
- Reason: CAS contention + retry overhead > mutex contention
- Lesson: Can't fix fundamental sharing problem with lock-free
MF2 Approach: Per-Page Sharding
Core Concept
mimalloc's Secret Sauce: O(1) page lookup from block address
// Magic: Block address → Page (bitwise AND)
PageDesc* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
This enables:
- Each page has independent freelist (no sharing!)
- O(1) page lookup (no hash table!)
- Owner-based optimization (fast path for owner thread)
New Architecture
MF2 Per-Page Design:
┌─────────────────────────────────────────┐
│ Thread 1 Pages │
│ ├─ Page A (2KB): freelist [no lock] │
│ ├─ Page B (4KB): freelist [no lock] │
│ └─ ... │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Thread 2 Pages │
│ ├─ Page C (2KB): freelist [no lock] │
│ ├─ Page D (8KB): freelist [no lock] │
│ └─ ... │
└─────────────────────────────────────────┘
Each thread accesses its own pages → ZERO contention!
Data Structures
Page Descriptor (New)
// Per-page metadata (aligned to 64KB boundary)
typedef struct MidPage {
// Page identity
void* base; // Page base address (64KB aligned)
uint8_t class_idx; // Size class (0-6)
uint8_t _pad[3];
// Ownership
uint64_t owner_tid; // Owner thread ID
// Freelist (page-local, no sharing!)
PoolBlock* freelist; // Local freelist (owner-only, no lock!)
uint16_t free_count; // Number of free blocks
uint16_t capacity; // Total blocks per page
// Remote frees (cross-thread, lock-free stack)
atomic_uintptr_t remote_head; // Lock-free MPSC stack
atomic_uint remote_count; // Count for quick check
// Lifecycle
atomic_int in_use; // Live allocations (for empty detection)
atomic_int pending_dn; // DONTNEED queued flag
// Linkage
struct MidPage* next_page; // Next page in thread's page list
} MidPage;
Thread-Local Page Lists
// Per-thread page lists (one per class)
typedef struct ThreadPages {
MidPage* active_page[POOL_NUM_CLASSES]; // Current page with free blocks
MidPage* full_pages[POOL_NUM_CLASSES]; // Full pages (no free blocks)
int page_count[POOL_NUM_CLASSES]; // Total pages owned
} ThreadPages;
static __thread ThreadPages* t_pages = NULL;
Global Page Registry
// Global registry for cross-thread free (O(1) lookup)
#define PAGE_REGISTRY_BITS 16 // 64K entries (covers 4GB with 64KB pages)
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)
typedef struct {
MidPage* pages[PAGE_REGISTRY_SIZE]; // Indexed by (addr >> 16) & 0xFFFF
pthread_mutex_t locks[256]; // Coarse-grained locks for rare updates
} PageRegistry;
static PageRegistry g_page_registry;
Core Algorithms
Allocation Fast Path (Owner Thread)
void* mid_alloc_fast(size_t size) {
int class_idx = size_to_class(size);
MidPage* page = t_pages->active_page[class_idx];
// Fast path: Pop from page-local freelist (NO LOCK!)
if (page && page->freelist) {
PoolBlock* block = page->freelist;
page->freelist = block->next;
page->free_count--;
atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
return (char*)block + HEADER_SIZE;
}
// Slow path: Drain remote frees or allocate new page
return mid_alloc_slow(class_idx);
}
Key Point: NO mutex, NO CAS in fast path (owner-only access)!
Allocation Slow Path
void* mid_alloc_slow(int class_idx) {
MidPage* page = t_pages->active_page[class_idx];
// Try to drain remote frees (lock-free pop)
if (page && page->remote_count > 0) {
drain_remote_frees(page);
if (page->freelist) {
// Retry fast path
return mid_alloc_fast(class_idx);
}
}
// Allocate new page
page = alloc_new_page(class_idx);
if (!page) return NULL; // OOM
// Register page in global registry
register_page(page);
// Set as active page
t_pages->active_page[class_idx] = page;
// Retry allocation
return mid_alloc_fast(class_idx);
}
Free Fast Path (Owner Thread)
void mid_free_fast(void* ptr) {
// O(1) page lookup (bitwise AND)
MidPage* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
// Check if we're the owner (fast path)
if (page->owner_tid == my_tid()) {
// Fast: Push to page-local freelist (NO LOCK!)
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
block->next = page->freelist;
page->freelist = block;
page->free_count++;
// Decrement in-use, enqueue DONTNEED if empty
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
if (nv == 0) {
enqueue_dontneed(page);
}
return;
}
// Slow path: Cross-thread free
mid_free_slow(page, ptr);
}
Free Slow Path (Cross-Thread)
void mid_free_slow(MidPage* page, void* ptr) {
// Push to page's remote stack (lock-free MPSC)
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
uintptr_t old_head;
do {
old_head = atomic_load(&page->remote_head, memory_order_acquire);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak(
&page->remote_head, &old_head, (uintptr_t)block,
memory_order_release, memory_order_relaxed));
atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);
// Decrement in-use
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
if (nv == 0) {
enqueue_dontneed(page);
}
}
Page Lookup (O(1))
// Ultra-fast page lookup using address arithmetic
static inline MidPage* addr_to_page(void* addr) {
// Assume 64KB pages, aligned to 64KB boundary
void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);
// Index into registry
size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);
// Direct lookup (no hash collision handling needed if registry is large enough)
return g_page_registry.pages[idx];
}
Implementation Phases
Phase 1: Data Structures (4-6h)
Tasks:
- Define
MidPagestruct - Define
ThreadPagesstruct - Define
PageRegistrystruct - Initialize global page registry
- Add TLS for thread pages
Validation: Compiles, structures allocated correctly
Phase 2: Page Allocation (3-4h)
Tasks:
- Implement
alloc_new_page(class_idx) - Implement
register_page(page) - Implement
addr_to_page(ptr)lookup - Initialize page freelist (build block chain)
Validation: Can allocate pages, lookup works
Phase 3: Allocation Path (4-6h)
Tasks:
- Implement
mid_alloc_fast()(owner-only) - Implement
mid_alloc_slow()(drain + new page) - Implement
drain_remote_frees(page) - Update
hak_pool_try_alloc()entry point
Validation: Single-threaded allocation works
Phase 4: Free Path (3-4h)
Tasks:
- Implement
mid_free_fast()(owner-only) - Implement
mid_free_slow()(cross-thread) - Update
hak_pool_free()entry point - Implement empty page DONTNEED
Validation: Single-threaded free works
Phase 5: Multi-Thread Testing (3-4h)
Tasks:
- Test with 2 threads (cross-thread frees)
- Test with 4 threads (full contention)
- Fix any races or deadlocks
- ThreadSanitizer validation
Validation: larson benchmark runs without crashes
Phase 6: Optimization & Tuning (3-5h)
Tasks:
- Optimize page allocation (batch allocate?)
- Optimize remote drain (batch drain?)
- Tune page registry size
- Profile and fix hotspots
Validation: Performance meets expectations
Expected Performance
Baseline (P6.24)
Mid 1T: 4.03 M/s
Mid 4T: 13.78 M/s (3.42x scaling)
Target (MF2)
Mid 1T: 5.0-6.0 M/s (+24-49%) [No lock overhead]
Mid 4T: 20.0-22.0 M/s (+45-60%) [Zero contention]
vs mimalloc (29.50 M/s):
- Expected: 68-75% of mimalloc
- Success criterion: >60% (17.70 M/s)
Why This Will Work
-
Contention Elimination: Each thread accesses own pages
- Current: 4 threads → 1 freelist → 75% waiting
- MF2: 4 threads → 4 pages → 0% waiting
-
Cache Locality: Page-local data stays in cache
- Current: Freelist bounces between threads
- MF2: Page metadata stays in owner's cache
-
Fast Path Optimization: No synchronization for owner
- Current: mutex lock/unlock every allocation
- MF2: simple pointer manipulation
Risks & Mitigation
Risk 1: Page Registry Conflicts
Risk: Hash collisions in page registry
Mitigation:
- Use large registry (64K entries = low collision rate)
- Handle collisions with chaining if needed
- Monitor collision rate, resize if >5%
Risk 2: Memory Fragmentation
Risk: Each thread allocates pages independently → fragmentation
Mitigation:
- Page reuse when thread exits (return pages to global pool)
- Periodic empty page cleanup
- Monitor RSS, acceptable if <20% overhead
Risk 3: Cross-Thread Free Overhead
Risk: Remote stack drain might be slow
Mitigation:
- Batch drain (drain multiple blocks at once)
- Adaptive drain frequency
- Keep remote stack lock-free
Risk 4: Implementation Bugs
Risk: Complex concurrency, hard to debug
Mitigation:
- Incremental implementation (phase by phase)
- Extensive testing (1T, 2T, 4T, 8T)
- ThreadSanitizer validation
- Fallback plan: revert if unfixable bugs
Success Criteria
Must-Have (P0)
- ✅ Compiles and runs without crashes
- ✅ larson benchmark completes (1T, 4T)
- ✅ No memory leaks (valgrind clean)
- ✅ ThreadSanitizer clean (no data races)
- ✅ Mid 4T > 17.70 M/s (60% of mimalloc)
Should-Have (P1)
- ✅ Mid 4T > 20.0 M/s (68% of mimalloc)
- ✅ Mid 1T > 5.0 M/s (25% improvement)
- ✅ RSS overhead < 20%
Nice-to-Have (P2)
- Mid 4T > 22.0 M/s (75% of mimalloc)
- Full suite performance improvement
- Documentation complete
Timeline
| Phase | Duration | Cumulative |
|---|---|---|
| Phase 1: Data Structures | 4-6h | 4-6h |
| Phase 2: Page Allocation | 3-4h | 7-10h |
| Phase 3: Allocation Path | 4-6h | 11-16h |
| Phase 4: Free Path | 3-4h | 14-20h |
| Phase 5: Multi-Thread Testing | 3-4h | 17-24h |
| Phase 6: Optimization | 3-5h | 20-29h |
Total: 20-30 hours (2.5-4 working days)
Next Steps
- Review Plan: Get user approval
- Phase 1 Start: Data structures
- Incremental Build: Test after each phase
- Benchmark Early: Don't wait until end
Let's crush mimalloc! 🔥
Status: Plan complete, ready to implement ✅ Confidence: High (proven approach, incremental plan) Expected Outcome: 60-75% of mimalloc (SUCCESS!)