hakmem/docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md

# Phase 7.2 MF2: Per-Page Sharding Implementation Plan

**Date**: 2025-10-24
**Goal**: Eliminate shared freelists by implementing per-page sharding (mimalloc approach)
**Expected**: +50% improvement (13.78 M/s → 20.7 M/s)
**Effort**: 20-30 hours
**Risk**: Medium (major architectural change)

---

## Executive Summary

**Problem**: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).

**Solution**: **Per-Page Sharding** - Each 64KB page has its own independent freelist. No sharing = no contention.

**Key Insight**: The bottleneck is **SHARING**, not the locking mechanism. Eliminate sharing, eliminate contention.

---

## Current Architecture Problems

### Shared Freelist Design

```
Current (P6.24):
┌─────────────────────────────────────────┐
│ Global Freelists (56 total)            │
│ ├─ Class 0 (2KB): 8 shards × mutex     │
│ ├─ Class 1 (4KB): 8 shards × mutex     │
│ ├─ Class 2 (8KB): 8 shards × mutex     │
│ └─ ...                                  │
└─────────────────────────────────────────┘
        ↑ ↑ ↑ ↑
        │ │ │ │ 4 threads competing
        └─┴─┴─┘ → Lock contention
```

**Problems**:
1. **Lock Contention**: 4 threads → 1 freelist → serialized access
2. **Cache Line Bouncing**: Mutex or atomic operations bounce cache lines
3. **No Scalability**: More threads = worse contention

### Lock-Free Failed (P7.1)

We tried lock-free CAS operations:
- **Result**: -6.6% regression
- **Reason**: CAS contention + retry overhead > mutex contention
- **Lesson**: Can't fix fundamental sharing problem with lock-free

---

## MF2 Approach: Per-Page Sharding

### Core Concept

**mimalloc's Secret Sauce**: O(1) page lookup from block address

```c
// Magic: Block address → Page (bitwise AND)
PageDesc* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)
```

This enables:
1. Each page has independent freelist (no sharing!)
2. O(1) page lookup (no hash table!)
3. Owner-based optimization (fast path for owner thread)

### New Architecture

```
MF2 Per-Page Design:
┌─────────────────────────────────────────┐
│ Thread 1 Pages                          │
│ ├─ Page A (2KB): freelist [no lock]    │
│ ├─ Page B (4KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Thread 2 Pages                          │
│ ├─ Page C (2KB): freelist [no lock]    │
│ ├─ Page D (8KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘

Each thread accesses its own pages → ZERO contention!
```

---

## Data Structures

### Page Descriptor (New)

```c
// Per-page metadata (aligned to 64KB boundary)
typedef struct MidPage {
    // Page identity
    void* base;              // Page base address (64KB aligned)
    uint8_t class_idx;       // Size class (0-6)
    uint8_t _pad[3];

    // Ownership
    uint64_t owner_tid;      // Owner thread ID

    // Freelist (page-local, no sharing!)
    PoolBlock* freelist;     // Local freelist (owner-only, no lock!)
    uint16_t free_count;     // Number of free blocks
    uint16_t capacity;       // Total blocks per page

    // Remote frees (cross-thread, lock-free stack)
    atomic_uintptr_t remote_head;   // Lock-free MPSC stack
    atomic_uint remote_count;       // Count for quick check

    // Lifecycle
    atomic_int in_use;       // Live allocations (for empty detection)
    atomic_int pending_dn;   // DONTNEED queued flag

    // Linkage
    struct MidPage* next_page;  // Next page in thread's page list
} MidPage;
```

### Thread-Local Page Lists

```c
// Per-thread page lists (one per class)
typedef struct ThreadPages {
    MidPage* active_page[POOL_NUM_CLASSES];  // Current page with free blocks
    MidPage* full_pages[POOL_NUM_CLASSES];   // Full pages (no free blocks)
    int page_count[POOL_NUM_CLASSES];        // Total pages owned
} ThreadPages;

static __thread ThreadPages* t_pages = NULL;
```

### Global Page Registry

```c
// Global registry for cross-thread free (O(1) lookup)
#define PAGE_REGISTRY_BITS 16  // 64K entries (covers 4GB with 64KB pages)
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)

typedef struct {
    MidPage* pages[PAGE_REGISTRY_SIZE];  // Indexed by (addr >> 16) & 0xFFFF
    pthread_mutex_t locks[256];          // Coarse-grained locks for rare updates
} PageRegistry;

static PageRegistry g_page_registry;
```

---

## Core Algorithms

### Allocation Fast Path (Owner Thread)

```c
void* mid_alloc_fast(size_t size) {
    int class_idx = size_to_class(size);
    MidPage* page = t_pages->active_page[class_idx];

    // Fast path: Pop from page-local freelist (NO LOCK!)
    if (page && page->freelist) {
        PoolBlock* block = page->freelist;
        page->freelist = block->next;
        page->free_count--;
        atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
        return (char*)block + HEADER_SIZE;
    }

    // Slow path: Drain remote frees or allocate new page
    return mid_alloc_slow(class_idx);
}
```

**Key Point**: NO mutex, NO CAS in fast path (owner-only access)!

### Allocation Slow Path

```c
void* mid_alloc_slow(int class_idx) {
    MidPage* page = t_pages->active_page[class_idx];

    // Try to drain remote frees (lock-free pop)
    if (page && page->remote_count > 0) {
        drain_remote_frees(page);
        if (page->freelist) {
            // Retry fast path
            return mid_alloc_fast(class_idx);
        }
    }

    // Allocate new page
    page = alloc_new_page(class_idx);
    if (!page) return NULL;  // OOM

    // Register page in global registry
    register_page(page);

    // Set as active page
    t_pages->active_page[class_idx] = page;

    // Retry allocation
    return mid_alloc_fast(class_idx);
}
```

### Free Fast Path (Owner Thread)

```c
void mid_free_fast(void* ptr) {
    // O(1) page lookup (bitwise AND)
    MidPage* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)

    // Check if we're the owner (fast path)
    if (page->owner_tid == my_tid()) {
        // Fast: Push to page-local freelist (NO LOCK!)
        PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
        block->next = page->freelist;
        page->freelist = block;
        page->free_count++;

        // Decrement in-use, enqueue DONTNEED if empty
        int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
        if (nv == 0) {
            enqueue_dontneed(page);
        }
        return;
    }

    // Slow path: Cross-thread free
    mid_free_slow(page, ptr);
}
```

### Free Slow Path (Cross-Thread)

```c
void mid_free_slow(MidPage* page, void* ptr) {
    // Push to page's remote stack (lock-free MPSC)
    PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
    uintptr_t old_head;
    do {
        old_head = atomic_load(&page->remote_head, memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak(
        &page->remote_head, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));

    atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);

    // Decrement in-use
    int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
    if (nv == 0) {
        enqueue_dontneed(page);
    }
}
```

### Page Lookup (O(1))

```c
// Ultra-fast page lookup using address arithmetic
static inline MidPage* addr_to_page(void* addr) {
    // Assume 64KB pages, aligned to 64KB boundary
    void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);

    // Index into registry
    size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);

    // Direct lookup (no hash collision handling needed if registry is large enough)
    return g_page_registry.pages[idx];
}
```

---

## Implementation Phases

### Phase 1: Data Structures (4-6h)

**Tasks**:
1. Define `MidPage` struct
2. Define `ThreadPages` struct
3. Define `PageRegistry` struct
4. Initialize global page registry
5. Add TLS for thread pages

**Validation**: Compiles, structures allocated correctly

### Phase 2: Page Allocation (3-4h)

**Tasks**:
1. Implement `alloc_new_page(class_idx)`
2. Implement `register_page(page)`
3. Implement `addr_to_page(ptr)` lookup
4. Initialize page freelist (build block chain)

**Validation**: Can allocate pages, lookup works

### Phase 3: Allocation Path (4-6h)

**Tasks**:
1. Implement `mid_alloc_fast()` (owner-only)
2. Implement `mid_alloc_slow()` (drain + new page)
3. Implement `drain_remote_frees(page)`
4. Update `hak_pool_try_alloc()` entry point

**Validation**: Single-threaded allocation works

### Phase 4: Free Path (3-4h)

**Tasks**:
1. Implement `mid_free_fast()` (owner-only)
2. Implement `mid_free_slow()` (cross-thread)
3. Update `hak_pool_free()` entry point
4. Implement empty page DONTNEED

**Validation**: Single-threaded free works

### Phase 5: Multi-Thread Testing (3-4h)

**Tasks**:
1. Test with 2 threads (cross-thread frees)
2. Test with 4 threads (full contention)
3. Fix any races or deadlocks
4. ThreadSanitizer validation

**Validation**: larson benchmark runs without crashes

### Phase 6: Optimization & Tuning (3-5h)

**Tasks**:
1. Optimize page allocation (batch allocate?)
2. Optimize remote drain (batch drain?)
3. Tune page registry size
4. Profile and fix hotspots

**Validation**: Performance meets expectations

---

## Expected Performance

### Baseline (P6.24)

```
Mid 1T: 4.03 M/s
Mid 4T: 13.78 M/s  (3.42x scaling)
```

### Target (MF2)

```
Mid 1T: 5.0-6.0 M/s    (+24-49%)  [No lock overhead]
Mid 4T: 20.0-22.0 M/s  (+45-60%)  [Zero contention]
```

**vs mimalloc (29.50 M/s)**:
- Expected: 68-75% of mimalloc
- **Success criterion**: >60% (17.70 M/s)

### Why This Will Work

1. **Contention Elimination**: Each thread accesses own pages
   - Current: 4 threads → 1 freelist → 75% waiting
   - MF2: 4 threads → 4 pages → 0% waiting

2. **Cache Locality**: Page-local data stays in cache
   - Current: Freelist bounces between threads
   - MF2: Page metadata stays in owner's cache

3. **Fast Path Optimization**: No synchronization for owner
   - Current: mutex lock/unlock every allocation
   - MF2: simple pointer manipulation

---

## Risks & Mitigation

### Risk 1: Page Registry Conflicts

**Risk**: Hash collisions in page registry

**Mitigation**:
- Use large registry (64K entries = low collision rate)
- Handle collisions with chaining if needed
- Monitor collision rate, resize if >5%

### Risk 2: Memory Fragmentation

**Risk**: Each thread allocates pages independently → fragmentation

**Mitigation**:
- Page reuse when thread exits (return pages to global pool)
- Periodic empty page cleanup
- Monitor RSS, acceptable if <20% overhead

### Risk 3: Cross-Thread Free Overhead

**Risk**: Remote stack drain might be slow

**Mitigation**:
- Batch drain (drain multiple blocks at once)
- Adaptive drain frequency
- Keep remote stack lock-free

### Risk 4: Implementation Bugs

**Risk**: Complex concurrency, hard to debug

**Mitigation**:
- Incremental implementation (phase by phase)
- Extensive testing (1T, 2T, 4T, 8T)
- ThreadSanitizer validation
- Fallback plan: revert if unfixable bugs

---

## Success Criteria

### Must-Have (P0)

- ✅ Compiles and runs without crashes
- ✅ larson benchmark completes (1T, 4T)
- ✅ No memory leaks (valgrind clean)
- ✅ ThreadSanitizer clean (no data races)
- ✅ Mid 4T > 17.70 M/s (60% of mimalloc)

### Should-Have (P1)

- ✅ Mid 4T > 20.0 M/s (68% of mimalloc)
- ✅ Mid 1T > 5.0 M/s (25% improvement)
- ✅ RSS overhead < 20%

### Nice-to-Have (P2)

- Mid 4T > 22.0 M/s (75% of mimalloc)
- Full suite performance improvement
- Documentation complete

---

## Timeline

| Phase | Duration | Cumulative |
|-------|----------|------------|
| Phase 1: Data Structures | 4-6h | 4-6h |
| Phase 2: Page Allocation | 3-4h | 7-10h |
| Phase 3: Allocation Path | 4-6h | 11-16h |
| Phase 4: Free Path | 3-4h | 14-20h |
| Phase 5: Multi-Thread Testing | 3-4h | 17-24h |
| Phase 6: Optimization | 3-5h | 20-29h |

**Total**: 20-30 hours (2.5-4 working days)

---

## Next Steps

1. **Review Plan**: Get user approval
2. **Phase 1 Start**: Data structures
3. **Incremental Build**: Test after each phase
4. **Benchmark Early**: Don't wait until end

**Let's crush mimalloc!** 🔥

---

**Status**: Plan complete, ready to implement ✅
**Confidence**: High (proven approach, incremental plan)
**Expected Outcome**: 60-75% of mimalloc (SUCCESS!)