Files
hakmem/docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

13 KiB
Raw Blame History

Phase 7.2 MF2: Per-Page Sharding Implementation Plan

Date: 2025-10-24 Goal: Eliminate shared freelists by implementing per-page sharding (mimalloc approach) Expected: +50% improvement (13.78 M/s → 20.7 M/s) Effort: 20-30 hours Risk: Medium (major architectural change)


Executive Summary

Problem: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).

Solution: Per-Page Sharding - Each 64KB page has its own independent freelist. No sharing = no contention.

Key Insight: The bottleneck is SHARING, not the locking mechanism. Eliminate sharing, eliminate contention.


Current Architecture Problems

Shared Freelist Design

Current (P6.24):
┌─────────────────────────────────────────┐
│ Global Freelists (56 total)            │
│ ├─ Class 0 (2KB): 8 shards × mutex     │
│ ├─ Class 1 (4KB): 8 shards × mutex     │
│ ├─ Class 2 (8KB): 8 shards × mutex     │
│ └─ ...                                  │
└─────────────────────────────────────────┘
        ↑ ↑ ↑ ↑
        │ │ │ │ 4 threads competing
        └─┴─┴─┘ → Lock contention

Problems:

  1. Lock Contention: 4 threads → 1 freelist → serialized access
  2. Cache Line Bouncing: Mutex or atomic operations bounce cache lines
  3. No Scalability: More threads = worse contention

Lock-Free Failed (P7.1)

We tried lock-free CAS operations:

  • Result: -6.6% regression
  • Reason: CAS contention + retry overhead > mutex contention
  • Lesson: Can't fix fundamental sharing problem with lock-free

MF2 Approach: Per-Page Sharding

Core Concept

mimalloc's Secret Sauce: O(1) page lookup from block address

// Magic: Block address → Page (bitwise AND)
PageDesc* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)

This enables:

  1. Each page has independent freelist (no sharing!)
  2. O(1) page lookup (no hash table!)
  3. Owner-based optimization (fast path for owner thread)

New Architecture

MF2 Per-Page Design:
┌─────────────────────────────────────────┐
│ Thread 1 Pages                          │
│ ├─ Page A (2KB): freelist [no lock]    │
│ ├─ Page B (4KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Thread 2 Pages                          │
│ ├─ Page C (2KB): freelist [no lock]    │
│ ├─ Page D (8KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘

Each thread accesses its own pages → ZERO contention!

Data Structures

Page Descriptor (New)

// Per-page metadata (aligned to 64KB boundary)
typedef struct MidPage {
    // Page identity
    void* base;              // Page base address (64KB aligned)
    uint8_t class_idx;       // Size class (0-6)
    uint8_t _pad[3];

    // Ownership
    uint64_t owner_tid;      // Owner thread ID

    // Freelist (page-local, no sharing!)
    PoolBlock* freelist;     // Local freelist (owner-only, no lock!)
    uint16_t free_count;     // Number of free blocks
    uint16_t capacity;       // Total blocks per page

    // Remote frees (cross-thread, lock-free stack)
    atomic_uintptr_t remote_head;   // Lock-free MPSC stack
    atomic_uint remote_count;       // Count for quick check

    // Lifecycle
    atomic_int in_use;       // Live allocations (for empty detection)
    atomic_int pending_dn;   // DONTNEED queued flag

    // Linkage
    struct MidPage* next_page;  // Next page in thread's page list
} MidPage;

Thread-Local Page Lists

// Per-thread page lists (one per class)
typedef struct ThreadPages {
    MidPage* active_page[POOL_NUM_CLASSES];  // Current page with free blocks
    MidPage* full_pages[POOL_NUM_CLASSES];   // Full pages (no free blocks)
    int page_count[POOL_NUM_CLASSES];        // Total pages owned
} ThreadPages;

static __thread ThreadPages* t_pages = NULL;

Global Page Registry

// Global registry for cross-thread free (O(1) lookup)
#define PAGE_REGISTRY_BITS 16  // 64K entries (covers 4GB with 64KB pages)
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)

typedef struct {
    MidPage* pages[PAGE_REGISTRY_SIZE];  // Indexed by (addr >> 16) & 0xFFFF
    pthread_mutex_t locks[256];          // Coarse-grained locks for rare updates
} PageRegistry;

static PageRegistry g_page_registry;

Core Algorithms

Allocation Fast Path (Owner Thread)

void* mid_alloc_fast(size_t size) {
    int class_idx = size_to_class(size);
    MidPage* page = t_pages->active_page[class_idx];

    // Fast path: Pop from page-local freelist (NO LOCK!)
    if (page && page->freelist) {
        PoolBlock* block = page->freelist;
        page->freelist = block->next;
        page->free_count--;
        atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
        return (char*)block + HEADER_SIZE;
    }

    // Slow path: Drain remote frees or allocate new page
    return mid_alloc_slow(class_idx);
}

Key Point: NO mutex, NO CAS in fast path (owner-only access)!

Allocation Slow Path

void* mid_alloc_slow(int class_idx) {
    MidPage* page = t_pages->active_page[class_idx];

    // Try to drain remote frees (lock-free pop)
    if (page && page->remote_count > 0) {
        drain_remote_frees(page);
        if (page->freelist) {
            // Retry fast path
            return mid_alloc_fast(class_idx);
        }
    }

    // Allocate new page
    page = alloc_new_page(class_idx);
    if (!page) return NULL;  // OOM

    // Register page in global registry
    register_page(page);

    // Set as active page
    t_pages->active_page[class_idx] = page;

    // Retry allocation
    return mid_alloc_fast(class_idx);
}

Free Fast Path (Owner Thread)

void mid_free_fast(void* ptr) {
    // O(1) page lookup (bitwise AND)
    MidPage* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)

    // Check if we're the owner (fast path)
    if (page->owner_tid == my_tid()) {
        // Fast: Push to page-local freelist (NO LOCK!)
        PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
        block->next = page->freelist;
        page->freelist = block;
        page->free_count++;

        // Decrement in-use, enqueue DONTNEED if empty
        int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
        if (nv == 0) {
            enqueue_dontneed(page);
        }
        return;
    }

    // Slow path: Cross-thread free
    mid_free_slow(page, ptr);
}

Free Slow Path (Cross-Thread)

void mid_free_slow(MidPage* page, void* ptr) {
    // Push to page's remote stack (lock-free MPSC)
    PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
    uintptr_t old_head;
    do {
        old_head = atomic_load(&page->remote_head, memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak(
        &page->remote_head, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));

    atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);

    // Decrement in-use
    int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
    if (nv == 0) {
        enqueue_dontneed(page);
    }
}

Page Lookup (O(1))

// Ultra-fast page lookup using address arithmetic
static inline MidPage* addr_to_page(void* addr) {
    // Assume 64KB pages, aligned to 64KB boundary
    void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);

    // Index into registry
    size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);

    // Direct lookup (no hash collision handling needed if registry is large enough)
    return g_page_registry.pages[idx];
}

Implementation Phases

Phase 1: Data Structures (4-6h)

Tasks:

  1. Define MidPage struct
  2. Define ThreadPages struct
  3. Define PageRegistry struct
  4. Initialize global page registry
  5. Add TLS for thread pages

Validation: Compiles, structures allocated correctly

Phase 2: Page Allocation (3-4h)

Tasks:

  1. Implement alloc_new_page(class_idx)
  2. Implement register_page(page)
  3. Implement addr_to_page(ptr) lookup
  4. Initialize page freelist (build block chain)

Validation: Can allocate pages, lookup works

Phase 3: Allocation Path (4-6h)

Tasks:

  1. Implement mid_alloc_fast() (owner-only)
  2. Implement mid_alloc_slow() (drain + new page)
  3. Implement drain_remote_frees(page)
  4. Update hak_pool_try_alloc() entry point

Validation: Single-threaded allocation works

Phase 4: Free Path (3-4h)

Tasks:

  1. Implement mid_free_fast() (owner-only)
  2. Implement mid_free_slow() (cross-thread)
  3. Update hak_pool_free() entry point
  4. Implement empty page DONTNEED

Validation: Single-threaded free works

Phase 5: Multi-Thread Testing (3-4h)

Tasks:

  1. Test with 2 threads (cross-thread frees)
  2. Test with 4 threads (full contention)
  3. Fix any races or deadlocks
  4. ThreadSanitizer validation

Validation: larson benchmark runs without crashes

Phase 6: Optimization & Tuning (3-5h)

Tasks:

  1. Optimize page allocation (batch allocate?)
  2. Optimize remote drain (batch drain?)
  3. Tune page registry size
  4. Profile and fix hotspots

Validation: Performance meets expectations


Expected Performance

Baseline (P6.24)

Mid 1T: 4.03 M/s
Mid 4T: 13.78 M/s  (3.42x scaling)

Target (MF2)

Mid 1T: 5.0-6.0 M/s    (+24-49%)  [No lock overhead]
Mid 4T: 20.0-22.0 M/s  (+45-60%)  [Zero contention]

vs mimalloc (29.50 M/s):

  • Expected: 68-75% of mimalloc
  • Success criterion: >60% (17.70 M/s)

Why This Will Work

  1. Contention Elimination: Each thread accesses own pages

    • Current: 4 threads → 1 freelist → 75% waiting
    • MF2: 4 threads → 4 pages → 0% waiting
  2. Cache Locality: Page-local data stays in cache

    • Current: Freelist bounces between threads
    • MF2: Page metadata stays in owner's cache
  3. Fast Path Optimization: No synchronization for owner

    • Current: mutex lock/unlock every allocation
    • MF2: simple pointer manipulation

Risks & Mitigation

Risk 1: Page Registry Conflicts

Risk: Hash collisions in page registry

Mitigation:

  • Use large registry (64K entries = low collision rate)
  • Handle collisions with chaining if needed
  • Monitor collision rate, resize if >5%

Risk 2: Memory Fragmentation

Risk: Each thread allocates pages independently → fragmentation

Mitigation:

  • Page reuse when thread exits (return pages to global pool)
  • Periodic empty page cleanup
  • Monitor RSS, acceptable if <20% overhead

Risk 3: Cross-Thread Free Overhead

Risk: Remote stack drain might be slow

Mitigation:

  • Batch drain (drain multiple blocks at once)
  • Adaptive drain frequency
  • Keep remote stack lock-free

Risk 4: Implementation Bugs

Risk: Complex concurrency, hard to debug

Mitigation:

  • Incremental implementation (phase by phase)
  • Extensive testing (1T, 2T, 4T, 8T)
  • ThreadSanitizer validation
  • Fallback plan: revert if unfixable bugs

Success Criteria

Must-Have (P0)

  • Compiles and runs without crashes
  • larson benchmark completes (1T, 4T)
  • No memory leaks (valgrind clean)
  • ThreadSanitizer clean (no data races)
  • Mid 4T > 17.70 M/s (60% of mimalloc)

Should-Have (P1)

  • Mid 4T > 20.0 M/s (68% of mimalloc)
  • Mid 1T > 5.0 M/s (25% improvement)
  • RSS overhead < 20%

Nice-to-Have (P2)

  • Mid 4T > 22.0 M/s (75% of mimalloc)
  • Full suite performance improvement
  • Documentation complete

Timeline

Phase Duration Cumulative
Phase 1: Data Structures 4-6h 4-6h
Phase 2: Page Allocation 3-4h 7-10h
Phase 3: Allocation Path 4-6h 11-16h
Phase 4: Free Path 3-4h 14-20h
Phase 5: Multi-Thread Testing 3-4h 17-24h
Phase 6: Optimization 3-5h 20-29h

Total: 20-30 hours (2.5-4 working days)


Next Steps

  1. Review Plan: Get user approval
  2. Phase 1 Start: Data structures
  3. Incremental Build: Test after each phase
  4. Benchmark Early: Don't wait until end

Let's crush mimalloc! 🔥


Status: Plan complete, ready to implement Confidence: High (proven approach, incremental plan) Expected Outcome: 60-75% of mimalloc (SUCCESS!)