474 lines
13 KiB
Markdown
474 lines
13 KiB
Markdown
|
|
# Phase 7.2 MF2: Per-Page Sharding Implementation Plan
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-24
|
|||
|
|
**Goal**: Eliminate shared freelists by implementing per-page sharding (mimalloc approach)
|
|||
|
|
**Expected**: +50% improvement (13.78 M/s → 20.7 M/s)
|
|||
|
|
**Effort**: 20-30 hours
|
|||
|
|
**Risk**: Medium (major architectural change)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Problem**: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).
|
|||
|
|
|
|||
|
|
**Solution**: **Per-Page Sharding** - Each 64KB page has its own independent freelist. No sharing = no contention.
|
|||
|
|
|
|||
|
|
**Key Insight**: The bottleneck is **SHARING**, not the locking mechanism. Eliminate sharing, eliminate contention.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Current Architecture Problems
|
|||
|
|
|
|||
|
|
### Shared Freelist Design
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Current (P6.24):
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ Global Freelists (56 total) │
|
|||
|
|
│ ├─ Class 0 (2KB): 8 shards × mutex │
|
|||
|
|
│ ├─ Class 1 (4KB): 8 shards × mutex │
|
|||
|
|
│ ├─ Class 2 (8KB): 8 shards × mutex │
|
|||
|
|
│ └─ ... │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
↑ ↑ ↑ ↑
|
|||
|
|
│ │ │ │ 4 threads competing
|
|||
|
|
└─┴─┴─┘ → Lock contention
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problems**:
|
|||
|
|
1. **Lock Contention**: 4 threads → 1 freelist → serialized access
|
|||
|
|
2. **Cache Line Bouncing**: Mutex or atomic operations bounce cache lines
|
|||
|
|
3. **No Scalability**: More threads = worse contention
|
|||
|
|
|
|||
|
|
### Lock-Free Failed (P7.1)
|
|||
|
|
|
|||
|
|
We tried lock-free CAS operations:
|
|||
|
|
- **Result**: -6.6% regression
|
|||
|
|
- **Reason**: CAS contention + retry overhead > mutex contention
|
|||
|
|
- **Lesson**: Can't fix fundamental sharing problem with lock-free
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## MF2 Approach: Per-Page Sharding
|
|||
|
|
|
|||
|
|
### Core Concept
|
|||
|
|
|
|||
|
|
**mimalloc's Secret Sauce**: O(1) page lookup from block address
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Magic: Block address → Page (bitwise AND)
|
|||
|
|
PageDesc* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This enables:
|
|||
|
|
1. Each page has independent freelist (no sharing!)
|
|||
|
|
2. O(1) page lookup (no hash table!)
|
|||
|
|
3. Owner-based optimization (fast path for owner thread)
|
|||
|
|
|
|||
|
|
### New Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
MF2 Per-Page Design:
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ Thread 1 Pages │
|
|||
|
|
│ ├─ Page A (2KB): freelist [no lock] │
|
|||
|
|
│ ├─ Page B (4KB): freelist [no lock] │
|
|||
|
|
│ └─ ... │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ Thread 2 Pages │
|
|||
|
|
│ ├─ Page C (2KB): freelist [no lock] │
|
|||
|
|
│ ├─ Page D (8KB): freelist [no lock] │
|
|||
|
|
│ └─ ... │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
|
|||
|
|
Each thread accesses its own pages → ZERO contention!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Data Structures
|
|||
|
|
|
|||
|
|
### Page Descriptor (New)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Per-page metadata (aligned to 64KB boundary)
|
|||
|
|
typedef struct MidPage {
|
|||
|
|
// Page identity
|
|||
|
|
void* base; // Page base address (64KB aligned)
|
|||
|
|
uint8_t class_idx; // Size class (0-6)
|
|||
|
|
uint8_t _pad[3];
|
|||
|
|
|
|||
|
|
// Ownership
|
|||
|
|
uint64_t owner_tid; // Owner thread ID
|
|||
|
|
|
|||
|
|
// Freelist (page-local, no sharing!)
|
|||
|
|
PoolBlock* freelist; // Local freelist (owner-only, no lock!)
|
|||
|
|
uint16_t free_count; // Number of free blocks
|
|||
|
|
uint16_t capacity; // Total blocks per page
|
|||
|
|
|
|||
|
|
// Remote frees (cross-thread, lock-free stack)
|
|||
|
|
atomic_uintptr_t remote_head; // Lock-free MPSC stack
|
|||
|
|
atomic_uint remote_count; // Count for quick check
|
|||
|
|
|
|||
|
|
// Lifecycle
|
|||
|
|
atomic_int in_use; // Live allocations (for empty detection)
|
|||
|
|
atomic_int pending_dn; // DONTNEED queued flag
|
|||
|
|
|
|||
|
|
// Linkage
|
|||
|
|
struct MidPage* next_page; // Next page in thread's page list
|
|||
|
|
} MidPage;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Thread-Local Page Lists
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Per-thread page lists (one per class)
|
|||
|
|
typedef struct ThreadPages {
|
|||
|
|
MidPage* active_page[POOL_NUM_CLASSES]; // Current page with free blocks
|
|||
|
|
MidPage* full_pages[POOL_NUM_CLASSES]; // Full pages (no free blocks)
|
|||
|
|
int page_count[POOL_NUM_CLASSES]; // Total pages owned
|
|||
|
|
} ThreadPages;
|
|||
|
|
|
|||
|
|
static __thread ThreadPages* t_pages = NULL;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Global Page Registry
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Global registry for cross-thread free (O(1) lookup)
|
|||
|
|
#define PAGE_REGISTRY_BITS 16 // 64K entries (covers 4GB with 64KB pages)
|
|||
|
|
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)
|
|||
|
|
|
|||
|
|
typedef struct {
|
|||
|
|
MidPage* pages[PAGE_REGISTRY_SIZE]; // Indexed by (addr >> 16) & 0xFFFF
|
|||
|
|
pthread_mutex_t locks[256]; // Coarse-grained locks for rare updates
|
|||
|
|
} PageRegistry;
|
|||
|
|
|
|||
|
|
static PageRegistry g_page_registry;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Core Algorithms
|
|||
|
|
|
|||
|
|
### Allocation Fast Path (Owner Thread)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* mid_alloc_fast(size_t size) {
|
|||
|
|
int class_idx = size_to_class(size);
|
|||
|
|
MidPage* page = t_pages->active_page[class_idx];
|
|||
|
|
|
|||
|
|
// Fast path: Pop from page-local freelist (NO LOCK!)
|
|||
|
|
if (page && page->freelist) {
|
|||
|
|
PoolBlock* block = page->freelist;
|
|||
|
|
page->freelist = block->next;
|
|||
|
|
page->free_count--;
|
|||
|
|
atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
|
|||
|
|
return (char*)block + HEADER_SIZE;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Slow path: Drain remote frees or allocate new page
|
|||
|
|
return mid_alloc_slow(class_idx);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Point**: NO mutex, NO CAS in fast path (owner-only access)!
|
|||
|
|
|
|||
|
|
### Allocation Slow Path
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* mid_alloc_slow(int class_idx) {
|
|||
|
|
MidPage* page = t_pages->active_page[class_idx];
|
|||
|
|
|
|||
|
|
// Try to drain remote frees (lock-free pop)
|
|||
|
|
if (page && page->remote_count > 0) {
|
|||
|
|
drain_remote_frees(page);
|
|||
|
|
if (page->freelist) {
|
|||
|
|
// Retry fast path
|
|||
|
|
return mid_alloc_fast(class_idx);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Allocate new page
|
|||
|
|
page = alloc_new_page(class_idx);
|
|||
|
|
if (!page) return NULL; // OOM
|
|||
|
|
|
|||
|
|
// Register page in global registry
|
|||
|
|
register_page(page);
|
|||
|
|
|
|||
|
|
// Set as active page
|
|||
|
|
t_pages->active_page[class_idx] = page;
|
|||
|
|
|
|||
|
|
// Retry allocation
|
|||
|
|
return mid_alloc_fast(class_idx);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Free Fast Path (Owner Thread)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void mid_free_fast(void* ptr) {
|
|||
|
|
// O(1) page lookup (bitwise AND)
|
|||
|
|
MidPage* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
|
|||
|
|
|
|||
|
|
// Check if we're the owner (fast path)
|
|||
|
|
if (page->owner_tid == my_tid()) {
|
|||
|
|
// Fast: Push to page-local freelist (NO LOCK!)
|
|||
|
|
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
|
|||
|
|
block->next = page->freelist;
|
|||
|
|
page->freelist = block;
|
|||
|
|
page->free_count++;
|
|||
|
|
|
|||
|
|
// Decrement in-use, enqueue DONTNEED if empty
|
|||
|
|
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
|
|||
|
|
if (nv == 0) {
|
|||
|
|
enqueue_dontneed(page);
|
|||
|
|
}
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Slow path: Cross-thread free
|
|||
|
|
mid_free_slow(page, ptr);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Free Slow Path (Cross-Thread)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void mid_free_slow(MidPage* page, void* ptr) {
|
|||
|
|
// Push to page's remote stack (lock-free MPSC)
|
|||
|
|
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
|
|||
|
|
uintptr_t old_head;
|
|||
|
|
do {
|
|||
|
|
old_head = atomic_load(&page->remote_head, memory_order_acquire);
|
|||
|
|
block->next = (PoolBlock*)old_head;
|
|||
|
|
} while (!atomic_compare_exchange_weak(
|
|||
|
|
&page->remote_head, &old_head, (uintptr_t)block,
|
|||
|
|
memory_order_release, memory_order_relaxed));
|
|||
|
|
|
|||
|
|
atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);
|
|||
|
|
|
|||
|
|
// Decrement in-use
|
|||
|
|
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
|
|||
|
|
if (nv == 0) {
|
|||
|
|
enqueue_dontneed(page);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Page Lookup (O(1))
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Ultra-fast page lookup using address arithmetic
|
|||
|
|
static inline MidPage* addr_to_page(void* addr) {
|
|||
|
|
// Assume 64KB pages, aligned to 64KB boundary
|
|||
|
|
void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);
|
|||
|
|
|
|||
|
|
// Index into registry
|
|||
|
|
size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);
|
|||
|
|
|
|||
|
|
// Direct lookup (no hash collision handling needed if registry is large enough)
|
|||
|
|
return g_page_registry.pages[idx];
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Phases
|
|||
|
|
|
|||
|
|
### Phase 1: Data Structures (4-6h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Define `MidPage` struct
|
|||
|
|
2. Define `ThreadPages` struct
|
|||
|
|
3. Define `PageRegistry` struct
|
|||
|
|
4. Initialize global page registry
|
|||
|
|
5. Add TLS for thread pages
|
|||
|
|
|
|||
|
|
**Validation**: Compiles, structures allocated correctly
|
|||
|
|
|
|||
|
|
### Phase 2: Page Allocation (3-4h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Implement `alloc_new_page(class_idx)`
|
|||
|
|
2. Implement `register_page(page)`
|
|||
|
|
3. Implement `addr_to_page(ptr)` lookup
|
|||
|
|
4. Initialize page freelist (build block chain)
|
|||
|
|
|
|||
|
|
**Validation**: Can allocate pages, lookup works
|
|||
|
|
|
|||
|
|
### Phase 3: Allocation Path (4-6h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Implement `mid_alloc_fast()` (owner-only)
|
|||
|
|
2. Implement `mid_alloc_slow()` (drain + new page)
|
|||
|
|
3. Implement `drain_remote_frees(page)`
|
|||
|
|
4. Update `hak_pool_try_alloc()` entry point
|
|||
|
|
|
|||
|
|
**Validation**: Single-threaded allocation works
|
|||
|
|
|
|||
|
|
### Phase 4: Free Path (3-4h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Implement `mid_free_fast()` (owner-only)
|
|||
|
|
2. Implement `mid_free_slow()` (cross-thread)
|
|||
|
|
3. Update `hak_pool_free()` entry point
|
|||
|
|
4. Implement empty page DONTNEED
|
|||
|
|
|
|||
|
|
**Validation**: Single-threaded free works
|
|||
|
|
|
|||
|
|
### Phase 5: Multi-Thread Testing (3-4h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Test with 2 threads (cross-thread frees)
|
|||
|
|
2. Test with 4 threads (full contention)
|
|||
|
|
3. Fix any races or deadlocks
|
|||
|
|
4. ThreadSanitizer validation
|
|||
|
|
|
|||
|
|
**Validation**: larson benchmark runs without crashes
|
|||
|
|
|
|||
|
|
### Phase 6: Optimization & Tuning (3-5h)
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Optimize page allocation (batch allocate?)
|
|||
|
|
2. Optimize remote drain (batch drain?)
|
|||
|
|
3. Tune page registry size
|
|||
|
|
4. Profile and fix hotspots
|
|||
|
|
|
|||
|
|
**Validation**: Performance meets expectations
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Expected Performance
|
|||
|
|
|
|||
|
|
### Baseline (P6.24)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Mid 1T: 4.03 M/s
|
|||
|
|
Mid 4T: 13.78 M/s (3.42x scaling)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Target (MF2)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Mid 1T: 5.0-6.0 M/s (+24-49%) [No lock overhead]
|
|||
|
|
Mid 4T: 20.0-22.0 M/s (+45-60%) [Zero contention]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**vs mimalloc (29.50 M/s)**:
|
|||
|
|
- Expected: 68-75% of mimalloc
|
|||
|
|
- **Success criterion**: >60% (17.70 M/s)
|
|||
|
|
|
|||
|
|
### Why This Will Work
|
|||
|
|
|
|||
|
|
1. **Contention Elimination**: Each thread accesses own pages
|
|||
|
|
- Current: 4 threads → 1 freelist → 75% waiting
|
|||
|
|
- MF2: 4 threads → 4 pages → 0% waiting
|
|||
|
|
|
|||
|
|
2. **Cache Locality**: Page-local data stays in cache
|
|||
|
|
- Current: Freelist bounces between threads
|
|||
|
|
- MF2: Page metadata stays in owner's cache
|
|||
|
|
|
|||
|
|
3. **Fast Path Optimization**: No synchronization for owner
|
|||
|
|
- Current: mutex lock/unlock every allocation
|
|||
|
|
- MF2: simple pointer manipulation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Risks & Mitigation
|
|||
|
|
|
|||
|
|
### Risk 1: Page Registry Conflicts
|
|||
|
|
|
|||
|
|
**Risk**: Hash collisions in page registry
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
- Use large registry (64K entries = low collision rate)
|
|||
|
|
- Handle collisions with chaining if needed
|
|||
|
|
- Monitor collision rate, resize if >5%
|
|||
|
|
|
|||
|
|
### Risk 2: Memory Fragmentation
|
|||
|
|
|
|||
|
|
**Risk**: Each thread allocates pages independently → fragmentation
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
- Page reuse when thread exits (return pages to global pool)
|
|||
|
|
- Periodic empty page cleanup
|
|||
|
|
- Monitor RSS, acceptable if <20% overhead
|
|||
|
|
|
|||
|
|
### Risk 3: Cross-Thread Free Overhead
|
|||
|
|
|
|||
|
|
**Risk**: Remote stack drain might be slow
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
- Batch drain (drain multiple blocks at once)
|
|||
|
|
- Adaptive drain frequency
|
|||
|
|
- Keep remote stack lock-free
|
|||
|
|
|
|||
|
|
### Risk 4: Implementation Bugs
|
|||
|
|
|
|||
|
|
**Risk**: Complex concurrency, hard to debug
|
|||
|
|
|
|||
|
|
**Mitigation**:
|
|||
|
|
- Incremental implementation (phase by phase)
|
|||
|
|
- Extensive testing (1T, 2T, 4T, 8T)
|
|||
|
|
- ThreadSanitizer validation
|
|||
|
|
- Fallback plan: revert if unfixable bugs
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
### Must-Have (P0)
|
|||
|
|
|
|||
|
|
- ✅ Compiles and runs without crashes
|
|||
|
|
- ✅ larson benchmark completes (1T, 4T)
|
|||
|
|
- ✅ No memory leaks (valgrind clean)
|
|||
|
|
- ✅ ThreadSanitizer clean (no data races)
|
|||
|
|
- ✅ Mid 4T > 17.70 M/s (60% of mimalloc)
|
|||
|
|
|
|||
|
|
### Should-Have (P1)
|
|||
|
|
|
|||
|
|
- ✅ Mid 4T > 20.0 M/s (68% of mimalloc)
|
|||
|
|
- ✅ Mid 1T > 5.0 M/s (25% improvement)
|
|||
|
|
- ✅ RSS overhead < 20%
|
|||
|
|
|
|||
|
|
### Nice-to-Have (P2)
|
|||
|
|
|
|||
|
|
- Mid 4T > 22.0 M/s (75% of mimalloc)
|
|||
|
|
- Full suite performance improvement
|
|||
|
|
- Documentation complete
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Timeline
|
|||
|
|
|
|||
|
|
| Phase | Duration | Cumulative |
|
|||
|
|
|-------|----------|------------|
|
|||
|
|
| Phase 1: Data Structures | 4-6h | 4-6h |
|
|||
|
|
| Phase 2: Page Allocation | 3-4h | 7-10h |
|
|||
|
|
| Phase 3: Allocation Path | 4-6h | 11-16h |
|
|||
|
|
| Phase 4: Free Path | 3-4h | 14-20h |
|
|||
|
|
| Phase 5: Multi-Thread Testing | 3-4h | 17-24h |
|
|||
|
|
| Phase 6: Optimization | 3-5h | 20-29h |
|
|||
|
|
|
|||
|
|
**Total**: 20-30 hours (2.5-4 working days)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
1. **Review Plan**: Get user approval
|
|||
|
|
2. **Phase 1 Start**: Data structures
|
|||
|
|
3. **Incremental Build**: Test after each phase
|
|||
|
|
4. **Benchmark Early**: Don't wait until end
|
|||
|
|
|
|||
|
|
**Let's crush mimalloc!** 🔥
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Status**: Plan complete, ready to implement ✅
|
|||
|
|
**Confidence**: High (proven approach, incremental plan)
|
|||
|
|
**Expected Outcome**: 60-75% of mimalloc (SUCCESS!)
|