Files
hakmem/docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md

474 lines
13 KiB
Markdown
Raw Normal View History

# Phase 7.2 MF2: Per-Page Sharding Implementation Plan
**Date**: 2025-10-24
**Goal**: Eliminate shared freelists by implementing per-page sharding (mimalloc approach)
**Expected**: +50% improvement (13.78 M/s → 20.7 M/s)
**Effort**: 20-30 hours
**Risk**: Medium (major architectural change)
---
## Executive Summary
**Problem**: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).
**Solution**: **Per-Page Sharding** - Each 64KB page has its own independent freelist. No sharing = no contention.
**Key Insight**: The bottleneck is **SHARING**, not the locking mechanism. Eliminate sharing, eliminate contention.
---
## Current Architecture Problems
### Shared Freelist Design
```
Current (P6.24):
┌─────────────────────────────────────────┐
│ Global Freelists (56 total) │
│ ├─ Class 0 (2KB): 8 shards × mutex │
│ ├─ Class 1 (4KB): 8 shards × mutex │
│ ├─ Class 2 (8KB): 8 shards × mutex │
│ └─ ... │
└─────────────────────────────────────────┘
↑ ↑ ↑ ↑
│ │ │ │ 4 threads competing
└─┴─┴─┘ → Lock contention
```
**Problems**:
1. **Lock Contention**: 4 threads → 1 freelist → serialized access
2. **Cache Line Bouncing**: Mutex or atomic operations bounce cache lines
3. **No Scalability**: More threads = worse contention
### Lock-Free Failed (P7.1)
We tried lock-free CAS operations:
- **Result**: -6.6% regression
- **Reason**: CAS contention + retry overhead > mutex contention
- **Lesson**: Can't fix fundamental sharing problem with lock-free
---
## MF2 Approach: Per-Page Sharding
### Core Concept
**mimalloc's Secret Sauce**: O(1) page lookup from block address
```c
// Magic: Block address → Page (bitwise AND)
PageDesc* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
```
This enables:
1. Each page has independent freelist (no sharing!)
2. O(1) page lookup (no hash table!)
3. Owner-based optimization (fast path for owner thread)
### New Architecture
```
MF2 Per-Page Design:
┌─────────────────────────────────────────┐
│ Thread 1 Pages │
│ ├─ Page A (2KB): freelist [no lock] │
│ ├─ Page B (4KB): freelist [no lock] │
│ └─ ... │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Thread 2 Pages │
│ ├─ Page C (2KB): freelist [no lock] │
│ ├─ Page D (8KB): freelist [no lock] │
│ └─ ... │
└─────────────────────────────────────────┘
Each thread accesses its own pages → ZERO contention!
```
---
## Data Structures
### Page Descriptor (New)
```c
// Per-page metadata (aligned to 64KB boundary)
typedef struct MidPage {
// Page identity
void* base; // Page base address (64KB aligned)
uint8_t class_idx; // Size class (0-6)
uint8_t _pad[3];
// Ownership
uint64_t owner_tid; // Owner thread ID
// Freelist (page-local, no sharing!)
PoolBlock* freelist; // Local freelist (owner-only, no lock!)
uint16_t free_count; // Number of free blocks
uint16_t capacity; // Total blocks per page
// Remote frees (cross-thread, lock-free stack)
atomic_uintptr_t remote_head; // Lock-free MPSC stack
atomic_uint remote_count; // Count for quick check
// Lifecycle
atomic_int in_use; // Live allocations (for empty detection)
atomic_int pending_dn; // DONTNEED queued flag
// Linkage
struct MidPage* next_page; // Next page in thread's page list
} MidPage;
```
### Thread-Local Page Lists
```c
// Per-thread page lists (one per class)
typedef struct ThreadPages {
MidPage* active_page[POOL_NUM_CLASSES]; // Current page with free blocks
MidPage* full_pages[POOL_NUM_CLASSES]; // Full pages (no free blocks)
int page_count[POOL_NUM_CLASSES]; // Total pages owned
} ThreadPages;
static __thread ThreadPages* t_pages = NULL;
```
### Global Page Registry
```c
// Global registry for cross-thread free (O(1) lookup)
#define PAGE_REGISTRY_BITS 16 // 64K entries (covers 4GB with 64KB pages)
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)
typedef struct {
MidPage* pages[PAGE_REGISTRY_SIZE]; // Indexed by (addr >> 16) & 0xFFFF
pthread_mutex_t locks[256]; // Coarse-grained locks for rare updates
} PageRegistry;
static PageRegistry g_page_registry;
```
---
## Core Algorithms
### Allocation Fast Path (Owner Thread)
```c
void* mid_alloc_fast(size_t size) {
int class_idx = size_to_class(size);
MidPage* page = t_pages->active_page[class_idx];
// Fast path: Pop from page-local freelist (NO LOCK!)
if (page && page->freelist) {
PoolBlock* block = page->freelist;
page->freelist = block->next;
page->free_count--;
atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
return (char*)block + HEADER_SIZE;
}
// Slow path: Drain remote frees or allocate new page
return mid_alloc_slow(class_idx);
}
```
**Key Point**: NO mutex, NO CAS in fast path (owner-only access)!
### Allocation Slow Path
```c
void* mid_alloc_slow(int class_idx) {
MidPage* page = t_pages->active_page[class_idx];
// Try to drain remote frees (lock-free pop)
if (page && page->remote_count > 0) {
drain_remote_frees(page);
if (page->freelist) {
// Retry fast path
return mid_alloc_fast(class_idx);
}
}
// Allocate new page
page = alloc_new_page(class_idx);
if (!page) return NULL; // OOM
// Register page in global registry
register_page(page);
// Set as active page
t_pages->active_page[class_idx] = page;
// Retry allocation
return mid_alloc_fast(class_idx);
}
```
### Free Fast Path (Owner Thread)
```c
void mid_free_fast(void* ptr) {
// O(1) page lookup (bitwise AND)
MidPage* page = addr_to_page(ptr); // (ptr & ~0xFFFF)
// Check if we're the owner (fast path)
if (page->owner_tid == my_tid()) {
// Fast: Push to page-local freelist (NO LOCK!)
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
block->next = page->freelist;
page->freelist = block;
page->free_count++;
// Decrement in-use, enqueue DONTNEED if empty
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
if (nv == 0) {
enqueue_dontneed(page);
}
return;
}
// Slow path: Cross-thread free
mid_free_slow(page, ptr);
}
```
### Free Slow Path (Cross-Thread)
```c
void mid_free_slow(MidPage* page, void* ptr) {
// Push to page's remote stack (lock-free MPSC)
PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
uintptr_t old_head;
do {
old_head = atomic_load(&page->remote_head, memory_order_acquire);
block->next = (PoolBlock*)old_head;
} while (!atomic_compare_exchange_weak(
&page->remote_head, &old_head, (uintptr_t)block,
memory_order_release, memory_order_relaxed));
atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);
// Decrement in-use
int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
if (nv == 0) {
enqueue_dontneed(page);
}
}
```
### Page Lookup (O(1))
```c
// Ultra-fast page lookup using address arithmetic
static inline MidPage* addr_to_page(void* addr) {
// Assume 64KB pages, aligned to 64KB boundary
void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);
// Index into registry
size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);
// Direct lookup (no hash collision handling needed if registry is large enough)
return g_page_registry.pages[idx];
}
```
---
## Implementation Phases
### Phase 1: Data Structures (4-6h)
**Tasks**:
1. Define `MidPage` struct
2. Define `ThreadPages` struct
3. Define `PageRegistry` struct
4. Initialize global page registry
5. Add TLS for thread pages
**Validation**: Compiles, structures allocated correctly
### Phase 2: Page Allocation (3-4h)
**Tasks**:
1. Implement `alloc_new_page(class_idx)`
2. Implement `register_page(page)`
3. Implement `addr_to_page(ptr)` lookup
4. Initialize page freelist (build block chain)
**Validation**: Can allocate pages, lookup works
### Phase 3: Allocation Path (4-6h)
**Tasks**:
1. Implement `mid_alloc_fast()` (owner-only)
2. Implement `mid_alloc_slow()` (drain + new page)
3. Implement `drain_remote_frees(page)`
4. Update `hak_pool_try_alloc()` entry point
**Validation**: Single-threaded allocation works
### Phase 4: Free Path (3-4h)
**Tasks**:
1. Implement `mid_free_fast()` (owner-only)
2. Implement `mid_free_slow()` (cross-thread)
3. Update `hak_pool_free()` entry point
4. Implement empty page DONTNEED
**Validation**: Single-threaded free works
### Phase 5: Multi-Thread Testing (3-4h)
**Tasks**:
1. Test with 2 threads (cross-thread frees)
2. Test with 4 threads (full contention)
3. Fix any races or deadlocks
4. ThreadSanitizer validation
**Validation**: larson benchmark runs without crashes
### Phase 6: Optimization & Tuning (3-5h)
**Tasks**:
1. Optimize page allocation (batch allocate?)
2. Optimize remote drain (batch drain?)
3. Tune page registry size
4. Profile and fix hotspots
**Validation**: Performance meets expectations
---
## Expected Performance
### Baseline (P6.24)
```
Mid 1T: 4.03 M/s
Mid 4T: 13.78 M/s (3.42x scaling)
```
### Target (MF2)
```
Mid 1T: 5.0-6.0 M/s (+24-49%) [No lock overhead]
Mid 4T: 20.0-22.0 M/s (+45-60%) [Zero contention]
```
**vs mimalloc (29.50 M/s)**:
- Expected: 68-75% of mimalloc
- **Success criterion**: >60% (17.70 M/s)
### Why This Will Work
1. **Contention Elimination**: Each thread accesses own pages
- Current: 4 threads → 1 freelist → 75% waiting
- MF2: 4 threads → 4 pages → 0% waiting
2. **Cache Locality**: Page-local data stays in cache
- Current: Freelist bounces between threads
- MF2: Page metadata stays in owner's cache
3. **Fast Path Optimization**: No synchronization for owner
- Current: mutex lock/unlock every allocation
- MF2: simple pointer manipulation
---
## Risks & Mitigation
### Risk 1: Page Registry Conflicts
**Risk**: Hash collisions in page registry
**Mitigation**:
- Use large registry (64K entries = low collision rate)
- Handle collisions with chaining if needed
- Monitor collision rate, resize if >5%
### Risk 2: Memory Fragmentation
**Risk**: Each thread allocates pages independently → fragmentation
**Mitigation**:
- Page reuse when thread exits (return pages to global pool)
- Periodic empty page cleanup
- Monitor RSS, acceptable if <20% overhead
### Risk 3: Cross-Thread Free Overhead
**Risk**: Remote stack drain might be slow
**Mitigation**:
- Batch drain (drain multiple blocks at once)
- Adaptive drain frequency
- Keep remote stack lock-free
### Risk 4: Implementation Bugs
**Risk**: Complex concurrency, hard to debug
**Mitigation**:
- Incremental implementation (phase by phase)
- Extensive testing (1T, 2T, 4T, 8T)
- ThreadSanitizer validation
- Fallback plan: revert if unfixable bugs
---
## Success Criteria
### Must-Have (P0)
- ✅ Compiles and runs without crashes
- ✅ larson benchmark completes (1T, 4T)
- ✅ No memory leaks (valgrind clean)
- ✅ ThreadSanitizer clean (no data races)
- ✅ Mid 4T > 17.70 M/s (60% of mimalloc)
### Should-Have (P1)
- ✅ Mid 4T > 20.0 M/s (68% of mimalloc)
- ✅ Mid 1T > 5.0 M/s (25% improvement)
- ✅ RSS overhead < 20%
### Nice-to-Have (P2)
- Mid 4T > 22.0 M/s (75% of mimalloc)
- Full suite performance improvement
- Documentation complete
---
## Timeline
| Phase | Duration | Cumulative |
|-------|----------|------------|
| Phase 1: Data Structures | 4-6h | 4-6h |
| Phase 2: Page Allocation | 3-4h | 7-10h |
| Phase 3: Allocation Path | 4-6h | 11-16h |
| Phase 4: Free Path | 3-4h | 14-20h |
| Phase 5: Multi-Thread Testing | 3-4h | 17-24h |
| Phase 6: Optimization | 3-5h | 20-29h |
**Total**: 20-30 hours (2.5-4 working days)
---
## Next Steps
1. **Review Plan**: Get user approval
2. **Phase 1 Start**: Data structures
3. **Incremental Build**: Test after each phase
4. **Benchmark Early**: Don't wait until end
**Let's crush mimalloc!** 🔥
---
**Status**: Plan complete, ready to implement ✅
**Confidence**: High (proven approach, incremental plan)
**Expected Outcome**: 60-75% of mimalloc (SUCCESS!)