Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

13 KiB

Raw Blame History

Phase 7.2 MF2: Per-Page Sharding Implementation Plan

Date: 2025-10-24 Goal: Eliminate shared freelists by implementing per-page sharding (mimalloc approach) Expected: +50% improvement (13.78 M/s → 20.7 M/s) Effort: 20-30 hours Risk: Medium (major architectural change)

Executive Summary

Problem: Current Mid Pool uses shared freelists (7 classes × 8 shards = 56), causing lock contention regardless of locking mechanism (mutex or lock-free).

Solution: Per-Page Sharding - Each 64KB page has its own independent freelist. No sharing = no contention.

Key Insight: The bottleneck is SHARING, not the locking mechanism. Eliminate sharing, eliminate contention.

Current Architecture Problems

Shared Freelist Design

Current (P6.24):
┌─────────────────────────────────────────┐
│ Global Freelists (56 total)            │
│ ├─ Class 0 (2KB): 8 shards × mutex     │
│ ├─ Class 1 (4KB): 8 shards × mutex     │
│ ├─ Class 2 (8KB): 8 shards × mutex     │
│ └─ ...                                  │
└─────────────────────────────────────────┘
        ↑ ↑ ↑ ↑
        │ │ │ │ 4 threads competing
        └─┴─┴─┘ → Lock contention

Problems:

Lock Contention: 4 threads → 1 freelist → serialized access
Cache Line Bouncing: Mutex or atomic operations bounce cache lines
No Scalability: More threads = worse contention

Lock-Free Failed (P7.1)

We tried lock-free CAS operations:

Result: -6.6% regression
Reason: CAS contention + retry overhead > mutex contention
Lesson: Can't fix fundamental sharing problem with lock-free

MF2 Approach: Per-Page Sharding

Core Concept

mimalloc's Secret Sauce: O(1) page lookup from block address

// Magic: Block address → Page (bitwise AND)
PageDesc* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)

This enables:

Each page has independent freelist (no sharing!)
O(1) page lookup (no hash table!)
Owner-based optimization (fast path for owner thread)

New Architecture

MF2 Per-Page Design:
┌─────────────────────────────────────────┐
│ Thread 1 Pages                          │
│ ├─ Page A (2KB): freelist [no lock]    │
│ ├─ Page B (4KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Thread 2 Pages                          │
│ ├─ Page C (2KB): freelist [no lock]    │
│ ├─ Page D (8KB): freelist [no lock]    │
│ └─ ...                                  │
└─────────────────────────────────────────┘

Each thread accesses its own pages → ZERO contention!

Data Structures

Page Descriptor (New)

// Per-page metadata (aligned to 64KB boundary)
typedef struct MidPage {
    // Page identity
    void* base;              // Page base address (64KB aligned)
    uint8_t class_idx;       // Size class (0-6)
    uint8_t _pad[3];

    // Ownership
    uint64_t owner_tid;      // Owner thread ID

    // Freelist (page-local, no sharing!)
    PoolBlock* freelist;     // Local freelist (owner-only, no lock!)
    uint16_t free_count;     // Number of free blocks
    uint16_t capacity;       // Total blocks per page

    // Remote frees (cross-thread, lock-free stack)
    atomic_uintptr_t remote_head;   // Lock-free MPSC stack
    atomic_uint remote_count;       // Count for quick check

    // Lifecycle
    atomic_int in_use;       // Live allocations (for empty detection)
    atomic_int pending_dn;   // DONTNEED queued flag

    // Linkage
    struct MidPage* next_page;  // Next page in thread's page list
} MidPage;

Thread-Local Page Lists

// Per-thread page lists (one per class)
typedef struct ThreadPages {
    MidPage* active_page[POOL_NUM_CLASSES];  // Current page with free blocks
    MidPage* full_pages[POOL_NUM_CLASSES];   // Full pages (no free blocks)
    int page_count[POOL_NUM_CLASSES];        // Total pages owned
} ThreadPages;

static __thread ThreadPages* t_pages = NULL;

Global Page Registry

// Global registry for cross-thread free (O(1) lookup)
#define PAGE_REGISTRY_BITS 16  // 64K entries (covers 4GB with 64KB pages)
#define PAGE_REGISTRY_SIZE (1 << PAGE_REGISTRY_BITS)

typedef struct {
    MidPage* pages[PAGE_REGISTRY_SIZE];  // Indexed by (addr >> 16) & 0xFFFF
    pthread_mutex_t locks[256];          // Coarse-grained locks for rare updates
} PageRegistry;

static PageRegistry g_page_registry;

Core Algorithms

Allocation Fast Path (Owner Thread)

void* mid_alloc_fast(size_t size) {
    int class_idx = size_to_class(size);
    MidPage* page = t_pages->active_page[class_idx];

    // Fast path: Pop from page-local freelist (NO LOCK!)
    if (page && page->freelist) {
        PoolBlock* block = page->freelist;
        page->freelist = block->next;
        page->free_count--;
        atomic_fetch_add(&page->in_use, 1, memory_order_relaxed);
        return (char*)block + HEADER_SIZE;
    }

    // Slow path: Drain remote frees or allocate new page
    return mid_alloc_slow(class_idx);
}

Key Point: NO mutex, NO CAS in fast path (owner-only access)!

Allocation Slow Path

void* mid_alloc_slow(int class_idx) {
    MidPage* page = t_pages->active_page[class_idx];

    // Try to drain remote frees (lock-free pop)
    if (page && page->remote_count > 0) {
        drain_remote_frees(page);
        if (page->freelist) {
            // Retry fast path
            return mid_alloc_fast(class_idx);
        }
    }

    // Allocate new page
    page = alloc_new_page(class_idx);
    if (!page) return NULL;  // OOM

    // Register page in global registry
    register_page(page);

    // Set as active page
    t_pages->active_page[class_idx] = page;

    // Retry allocation
    return mid_alloc_fast(class_idx);
}

Free Fast Path (Owner Thread)

void mid_free_fast(void* ptr) {
    // O(1) page lookup (bitwise AND)
    MidPage* page = addr_to_page(ptr);  // (ptr & ~0xFFFF)

    // Check if we're the owner (fast path)
    if (page->owner_tid == my_tid()) {
        // Fast: Push to page-local freelist (NO LOCK!)
        PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
        block->next = page->freelist;
        page->freelist = block;
        page->free_count++;

        // Decrement in-use, enqueue DONTNEED if empty
        int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
        if (nv == 0) {
            enqueue_dontneed(page);
        }
        return;
    }

    // Slow path: Cross-thread free
    mid_free_slow(page, ptr);
}

Free Slow Path (Cross-Thread)

void mid_free_slow(MidPage* page, void* ptr) {
    // Push to page's remote stack (lock-free MPSC)
    PoolBlock* block = (PoolBlock*)((char*)ptr - HEADER_SIZE);
    uintptr_t old_head;
    do {
        old_head = atomic_load(&page->remote_head, memory_order_acquire);
        block->next = (PoolBlock*)old_head;
    } while (!atomic_compare_exchange_weak(
        &page->remote_head, &old_head, (uintptr_t)block,
        memory_order_release, memory_order_relaxed));

    atomic_fetch_add(&page->remote_count, 1, memory_order_relaxed);

    // Decrement in-use
    int nv = atomic_fetch_sub(&page->in_use, 1, memory_order_release) - 1;
    if (nv == 0) {
        enqueue_dontneed(page);
    }
}

Page Lookup (O(1))

// Ultra-fast page lookup using address arithmetic
static inline MidPage* addr_to_page(void* addr) {
    // Assume 64KB pages, aligned to 64KB boundary
    void* page_base = (void*)((uintptr_t)addr & ~0xFFFFULL);

    // Index into registry
    size_t idx = ((uintptr_t)page_base >> 16) & (PAGE_REGISTRY_SIZE - 1);

    // Direct lookup (no hash collision handling needed if registry is large enough)
    return g_page_registry.pages[idx];
}

Implementation Phases

Phase 1: Data Structures (4-6h)

Tasks:

Define MidPage struct
Define ThreadPages struct
Define PageRegistry struct
Initialize global page registry
Add TLS for thread pages

Validation: Compiles, structures allocated correctly

Phase 2: Page Allocation (3-4h)

Tasks:

Implement alloc_new_page(class_idx)
Implement register_page(page)
Implement addr_to_page(ptr) lookup
Initialize page freelist (build block chain)

Validation: Can allocate pages, lookup works

Phase 3: Allocation Path (4-6h)

Tasks:

Implement mid_alloc_fast() (owner-only)
Implement mid_alloc_slow() (drain + new page)
Implement drain_remote_frees(page)
Update hak_pool_try_alloc() entry point

Validation: Single-threaded allocation works

Phase 4: Free Path (3-4h)

Tasks:

Implement mid_free_fast() (owner-only)
Implement mid_free_slow() (cross-thread)
Update hak_pool_free() entry point
Implement empty page DONTNEED

Validation: Single-threaded free works

Phase 5: Multi-Thread Testing (3-4h)

Tasks:

Test with 2 threads (cross-thread frees)
Test with 4 threads (full contention)
Fix any races or deadlocks
ThreadSanitizer validation

Validation: larson benchmark runs without crashes

Phase 6: Optimization & Tuning (3-5h)

Tasks:

Optimize page allocation (batch allocate?)
Optimize remote drain (batch drain?)
Tune page registry size
Profile and fix hotspots

Validation: Performance meets expectations

Expected Performance

Baseline (P6.24)

Mid 1T: 4.03 M/s
Mid 4T: 13.78 M/s  (3.42x scaling)

Target (MF2)

Mid 1T: 5.0-6.0 M/s    (+24-49%)  [No lock overhead]
Mid 4T: 20.0-22.0 M/s  (+45-60%)  [Zero contention]

vs mimalloc (29.50 M/s):

Expected: 68-75% of mimalloc
Success criterion: >60% (17.70 M/s)

Why This Will Work

Contention Elimination: Each thread accesses own pages
- Current: 4 threads → 1 freelist → 75% waiting
- MF2: 4 threads → 4 pages → 0% waiting
Cache Locality: Page-local data stays in cache
- Current: Freelist bounces between threads
- MF2: Page metadata stays in owner's cache
Fast Path Optimization: No synchronization for owner
- Current: mutex lock/unlock every allocation
- MF2: simple pointer manipulation

Risks & Mitigation

Risk 1: Page Registry Conflicts

Risk: Hash collisions in page registry

Mitigation:

Use large registry (64K entries = low collision rate)
Handle collisions with chaining if needed
Monitor collision rate, resize if >5%

Risk 2: Memory Fragmentation

Risk: Each thread allocates pages independently → fragmentation

Mitigation:

Page reuse when thread exits (return pages to global pool)
Periodic empty page cleanup
Monitor RSS, acceptable if <20% overhead

Risk 3: Cross-Thread Free Overhead

Risk: Remote stack drain might be slow

Mitigation:

Batch drain (drain multiple blocks at once)
Adaptive drain frequency
Keep remote stack lock-free

Risk 4: Implementation Bugs

Risk: Complex concurrency, hard to debug

Mitigation:

Incremental implementation (phase by phase)
Extensive testing (1T, 2T, 4T, 8T)
ThreadSanitizer validation
Fallback plan: revert if unfixable bugs

Success Criteria

Must-Have (P0)

✅ Compiles and runs without crashes
✅ larson benchmark completes (1T, 4T)
✅ No memory leaks (valgrind clean)
✅ ThreadSanitizer clean (no data races)
✅ Mid 4T > 17.70 M/s (60% of mimalloc)

Should-Have (P1)

✅ Mid 4T > 20.0 M/s (68% of mimalloc)
✅ Mid 1T > 5.0 M/s (25% improvement)
✅ RSS overhead < 20%

Nice-to-Have (P2)

Mid 4T > 22.0 M/s (75% of mimalloc)
Full suite performance improvement
Documentation complete

Timeline

Phase	Duration	Cumulative
Phase 1: Data Structures	4-6h	4-6h
Phase 2: Page Allocation	3-4h	7-10h
Phase 3: Allocation Path	4-6h	11-16h
Phase 4: Free Path	3-4h	14-20h
Phase 5: Multi-Thread Testing	3-4h	17-24h
Phase 6: Optimization	3-5h	20-29h

Total: 20-30 hours (2.5-4 working days)

Next Steps

Review Plan: Get user approval
Phase 1 Start: Data structures
Incremental Build: Test after each phase
Benchmark Early: Don't wait until end

Let's crush mimalloc! 🔥

Status: Plan complete, ready to implement ✅ Confidence: High (proven approach, incremental plan) Expected Outcome: 60-75% of mimalloc (SUCCESS!)

13 KiB Raw Blame History Unescape Escape

Phase 7.2 MF2: Per-Page Sharding Implementation Plan

Executive Summary

Current Architecture Problems

Shared Freelist Design

Lock-Free Failed (P7.1)

MF2 Approach: Per-Page Sharding

Core Concept

New Architecture

Data Structures

Page Descriptor (New)

Thread-Local Page Lists

Global Page Registry

Core Algorithms

Allocation Fast Path (Owner Thread)

Allocation Slow Path

Free Fast Path (Owner Thread)

Free Slow Path (Cross-Thread)

Page Lookup (O(1))

Implementation Phases

Phase 1: Data Structures (4-6h)

Phase 2: Page Allocation (3-4h)

Phase 3: Allocation Path (4-6h)

Phase 4: Free Path (3-4h)

Phase 5: Multi-Thread Testing (3-4h)

Phase 6: Optimization & Tuning (3-5h)

Expected Performance

Baseline (P6.24)

Target (MF2)

Why This Will Work

Risks & Mitigation

Risk 1: Page Registry Conflicts

Risk 2: Memory Fragmentation

Risk 3: Cross-Thread Free Overhead

Risk 4: Implementation Bugs

Success Criteria

Must-Have (P0)

Should-Have (P1)

Nice-to-Have (P2)

Timeline

Next Steps

13 KiB

Raw Blame History