# Phase 7.2 MF2: Implementation Progress

**Date**: 2025-10-24
**Status**: In Progress - Fixing Pending Queue Drain Issue
**Current**: Implementing Global Round-Robin Strategy

---

## Summary

MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。
Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。
結果として、pending queueへのenqueueは成功（69K pages）するが、drainが0回という状況。

Task先生の詳細分析により根本原因を特定：
- 各スレッドは**自分のTLS**のpending queueしか見ない
- Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空
- 他スレッドのpending queueに溜まったページは**永遠に処理されない**

---

## Implementation Timeline

### Phase 1-4: Core Implementation ✅

**Commits**:
- `0855b37` - Phase 1: Data structures
- `5c4b780` - Phase 2: Page allocation
- `b12f58c` - Phase 3: Allocation path
- `7e756c6` - Phase 4: Free path

**Status**: Complete

---

### Phase 5: Bug Fixes (Fix #1-6) ✅

#### Fix #1: Block Spacing Bug (`54609c1`)

**Problem**: Infinite loop on first test
**Root Cause**:
```c
size_t block_size = g_class_sizes[class_idx];  // Missing HEADER_SIZE
```
**Fix**: `block_size = HEADER_SIZE + user_size;`
**Result**: Test completes instead of hanging

---

#### Fix #2-3: Performance Optimizations (`aa869b9`)

**Changes**:
- Removed 64KB memset (switched from posix_memalign to mmap)
- Removed O(N) eager drain scan
- Reduced scan limit from 256 to 8

**Result**: 27.5K → 110K ops/s (4x improvement on 4T)

---

#### Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL

**Problem**: 97% of frees silently dropped!
**Root Cause**:
- mmap() only guarantees 4KB alignment
- `addr_to_page()` assumes 64KB alignment
- Lookup fails: `(ptr & ~0xFFFF)` rounds to wrong page base

**Fix**: Changed to `posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)`

**Verification** (by Task agent):
```
Pages allocated:     101,093
Alignment bugs:      0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98%
```

**Side Effect**: Performance degraded (466K → 54K) due to memset overhead returning

---

#### Fix #5: Active Page Drain Attempt (`9e64f7e`)

**Change**: Added check for remote frees in active_page before allocating new
**Result**: No improvement (remote drains still 0)

---

#### Fix #6: Memory Ordering (`b0768b3`)

**Problem**: All remote_count operations used `memory_order_relaxed`
**Fix**: Changed 7 locations to `seq_cst/acquire/release`
**Result**: Memory ordering now perfect, but performance still no improvement

**Root Cause Discovery** (by Task agent):
- Debug instrumentation revealed: drain checks and remote frees target **DIFFERENT page objects**
- Thread A's pages in Thread A's tp->active_page/full_pages
- Thread B frees to Thread A's pages → remote_count++
- Thread B's slow path checks Thread B's pages only
- Result: Thread A's pages (with remote_count > 0) never checked by anyone!

---

### Phase 2: Pending Queue Implementation (`89541fc`) ✅

**Implementation** (by Task agent):
- **Box 1**: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage
- **Box 2**: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending)
- **Box 3**: 0→1 edge detection in mf2_free_slow()
- **Box 4**: Allocation slow path drain (up to 4 pages per allocation)
- **Box 5**: Opportunistic drain (every 16th owner free)
- **Box 6**: Comprehensive debug logging and statistics

**Test Results**:
```
Pending enqueued: 43,138 ✅
Pending drained: 0 ❌
```

**Analysis** (by Task agent):
- Implementation is correct
- Problem: Larson benchmark allocates all pages early, frees later
- By the time remote frees arrive, owner threads don't allocate anymore
- Slow path never called → pending queue never processed
- This is a workload mismatch, not an implementation bug

---

### Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅

**Change**: Increased from every 16th to every 4th free (4x more aggressive)

**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 52,912 ✅
Pending drained:  0 ❌
Throughput:       53K ops/s
```

**Conclusion**: Frequency tuning didn't help - workload pattern issue persists

---

### Option 1: free_slow Drain Addition ❌

**Concept**: Add opportunistic drain to both `free_fast()` and `free_slow()`

**Implementation**:
- Created `mf2_maybe_drain_pending()` helper
- Called from both free_fast() (Line 1115) and free_slow() (Line 1167)

**Test Results**:
```
Pending enqueued: 76,733 ✅
Pending drained:  0 ❌
OPP_DRAIN_TRY:    10 attempts (all from tp=0x55828805f7a0)
Throughput:       27,890 ops/s
```

**Problem**: All drain attempts from same thread - other 3 threads not visible

---

### Option C: alloc_slow Drain Addition ❌

**Concept**: Add drain before new page allocation (owner thread allocating continuously)

**Implementation**: Added `mf2_maybe_drain_pending()` at Line 1021 (before `mf2_alloc_new_page()`)

**Test Results**:
```
Pending enqueued: 69,702 ✅
Pending drained:  0 ❌
OPP_DRAIN_TRY:    10 attempts (all from tp=0x559146bb17a0)
Throughput:       27,965 ops/s
```

**Conclusion**: Still 0 drains - same thread issue persists

---

## Root Cause Analysis (by Task Agent)

### Larson Benchmark Characteristics

```cpp
// larson.cpp: exercise_heap()
for (cblks=0; cblks<pdea->NumBlocks; cblks++) {
    victim = lran2(&pdea->rgen) % pdea->asize;  // Own array range
    CUSTOM_FREE(pdea->array[victim]);            // Free own allocation
    pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size);  // Same slot
}

// Array partitioning (Line 481):
de_area[i].array = &blkp[i*nperthread];  // Each thread owns separate range
```

**Key Finding**: Each thread allocates/frees from its own array range
- Thread 0: `array[0..999]`
- Thread 1: `array[1000..1999]`
- Thread 2: `array[2000..2999]`
- Thread 3: `array[3000..3999]`

**Result**: **Cross-thread frees are almost ZERO**

### MF2 Design vs Larson Mismatch

**MF2 Assumption**:
```
4 threads freeing → all threads call mf2_free() → all threads drain pending
```

**Larson Reality**:
```
1 thread does most freeing → only 1 thread drains pending
Other threads allocate-only → never drain their own pending queues
```

**Problem**:
```c
mf2_maybe_drain_pending() {
    MF2_ThreadPages* tp = mf2_thread_pages_get();  // ← Own TLS only!
    MidPage* pending = mf2_dequeue_pending(tp, class_idx);  // ← Own pending only!
}
```

- Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty
- Thread B/C/D's pending queues (with 69K pages!) are **never checked**

### Pending Enqueue Sources

**76,733 enqueues** come from:
- Phase 1 allocation interruptions (rare cross-thread frees)
- NOT from Phase 2 continuous freeing (same-thread pattern)

---

## Solution Strategy: Global Round-Robin

### Design Philosophy: "Where to Separate, Where to Integrate"

**Separation Points** (working well) ✅:
- Allocation: Thread-local, no lock
- Owner Free: Thread-local, no lock
- Cross-thread Free: Lock-free MPSC stack

**Integration Point** (broken) ❌:
- Pending Queue Drain: Currently thread-local only

### Strategy A: Global Round-Robin (Phase 1) 🎯

**Core Idea**: All threads can drain ANY thread's pending queue

```c
// Global registry
static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS];
static _Atomic int g_num_thread_pages = 0;

// Round-robin drain
mf2_maybe_drain_pending() {
    static _Atomic uint64_t counter = 0;
    uint64_t count = atomic_fetch_add(&counter, 1);

    // Round-robin across ALL threads (not just self!)
    int tp_idx = (count / 4) % g_num_thread_pages;
    MF2_ThreadPages* tp = g_all_thread_pages[tp_idx];

    if (tp) {
        int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES;
        MidPage* pending = mf2_dequeue_pending(tp, class_idx);
        if (pending) drain_remote_frees(pending);
    }
}
```

**Benefits**:
- Larson works: Any thread can drain any thread's pending queue
- Fair: All TLSs get equal drain opportunities
- Simple: Just global array + round-robin

**Implementation Steps**:
1. Add global array `g_all_thread_pages[]`
2. Register TLS in `mf2_thread_pages_get()`
3. Add destructor with `pthread_key_create()`
4. Modify `mf2_maybe_drain_pending()` to round-robin

**Expected Impact**:
```
Pending enqueued: 69K
Pending drained:  69K ✅ (100% instead of 0%)
Page reuse rate:  3% → 90%+ ✅
Throughput:       28K → 3-10M ops/s ✅ (100-350x improvement!)
```

---

### Strategy B: Hybrid (Phase 2) ⚡

**Optimization**: Prefer own TLS (cache efficiency) but periodically check others

```c
if ((count & 3) == 0) {  // 1/4: Other threads
    tp = g_all_thread_pages[round_robin_idx];
} else {  // 3/4: Own TLS (cache hot)
    tp = mf2_thread_pages_get();
}
```

**Benefits**:
- Cache efficiency: 75% of drains are own TLS (L1 cache)
- Fairness: 25% of drains are others (ensures progress)

**Metrics**:
- Own TLS: L1 cache hit (1-2 cycles)
- Other TLS: L3 cache hit (10-20 cycles)
- Average cost: **3-5 cycles** (negligible)

---

### Strategy C: Background Sweeper (Phase 3) 🔄

**Safety Net**: Handle edge cases where all threads stop allocating/freeing

```c
void* mf2_drain_thread(void* arg) {
    while (running) {
        usleep(1000);  // 1ms interval (not 100μs - too aggressive)

        // Scan all TLSs for leftover pending pages
        for (int i = 0; i < g_num_thread_pages; i++) {
            for (int c = 0; c < POOL_NUM_CLASSES; c++) {
                MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c);
                if (pending) drain_remote_frees(pending);
            }
        }
    }
}
```

**Role**: Insurance policy, not main drain mechanism
- Strategy A handles 95% of drains (hot path)
- Strategy C handles 5% leftover (rare cases)

**Latency Impact**: **NONE on hot path** (async background)

---

## 3-Layer Latency Hiding Design

| Layer | Strategy | Frequency | Latency | Coverage | Role |
|-------|----------|-----------|---------|----------|------|
| **L1: Hot Path** | A (Global RR) | Every 4th op | <1μs | 95% | Main drain |
| **L2: Optimization** | B (Hybrid) | 3/4 own, 1/4 other | <1μs | 100% | Cache efficiency |
| **L3: Safety Net** | C (BG sweeper) | 1ms interval | 1ms | 100% | Edge cases |

**Latency Guarantee**: Front-end (alloc/free) always returns in **<1μs**, regardless of background drain state

---

## Implementation Plan

### Phase 1: Global Round-Robin (Today) 🎯

**Target**: Make Larson work

**Tasks**:
1. Add global array `g_all_thread_pages[256]`
2. Add atomic counter `g_num_thread_pages`
3. Add registration in `mf2_thread_pages_get()`
4. Add pthread_key destructor for cleanup
5. Modify `mf2_maybe_drain_pending()` for round-robin

**Expected Time**: 1-2 hours

**Success Criteria**:
- Pending drained > 0 (ideally ~69K)
- Throughput > 1M ops/s (35x improvement from 28K)

---

### Phase 2: Hybrid Optimization (Tomorrow)

**Target**: Improve cache efficiency

**Tasks**:
1. Modify `mf2_maybe_drain_pending()` to prefer own TLS (3/4 ratio)
2. Benchmark cache hit rates

**Expected Time**: 30 minutes

**Success Criteria**:
- L1 cache hit rate > 75%
- Throughput gain: +5-10%

---

### Phase 3: Background Sweeper (Optional)

**Target**: Handle edge cases

**Tasks**:
1. Create background thread with `pthread_create()`
2. Scan all TLSs every 1ms
3. CPU throttling (< 1% usage)

**Expected Time**: 30 minutes

**Success Criteria**:
- No pending leftovers after 10s idle
- CPU overhead < 1%

---

## Current Status

**Working**:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment (Fix #4)
- ✅ Memory ordering (Fix #6)
- ✅ Pending queue infrastructure (enqueue works perfectly)
- ✅ 0→1 edge detection

**Broken**:
- ❌ Pending queue drain (0 drains due to TLS isolation)
- ❌ Page reuse (3% instead of 90%)
- ❌ Performance (28K ops/s instead of 3-10M)

**Next**:
- 🎯 Implement Phase 1: Global Round-Robin
- 🎯 Expected breakthrough: 28K → 3-10M ops/s

---

## Files Modified

### Core Implementation
- `hakmem_pool.c` (Lines 275-1200): MF2 implementation
  - Data structures (MidPage, MF2_ThreadPages, PageRegistry)
  - Allocation paths (fast/slow)
  - Free paths (fast/slow)
  - Pending queue operations
  - Opportunistic drain (currently broken)

### Documentation
- `docs/specs/ENV_VARS.md`: Added `HAKMEM_MF2_ENABLE`
- `docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md`: Original plan
- `docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md`: This file

### Debug Reports
- `ALIGNMENT_FIX_VERIFICATION.md`: Fix #4 verification by Task agent

---

## Lessons Learned

1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
2. **Memory Ordering Matters**: But doesn't solve architectural issues
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
4. **Integration vs Separation**: Need to carefully choose integration points
5. **Task Agent is MVP**: Detailed analysis saved days of debugging

---

---

## Phase 1: Global Round-Robin Implementation ✅

**Commit**: (multiple commits implementing round-robin drain)

**Implementation**:
1. Added `g_all_thread_pages[256]` global array
2. Added `g_num_thread_pages` atomic counter
3. Implemented TLS registration in `mf2_thread_pages_get()`
4. Implemented `mf2_maybe_drain_pending()` with round-robin logic
5. Called from both `mf2_free_fast()` and `mf2_alloc_slow()`

**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 96,429 ✅
Pending drained:  70,607 ✅ (73% - huge improvement from 0%!)
Page reuse count: 5,222
Throughput:       ~28,705 ops/s
```

**Analysis**:
- ✅ Round-robin drain WORKS! (0 drains → 70K drains)
- ⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated)
- Problem: Drained pages returned to full_pages, but owner doesn't scan them

---

## Strategy C: Direct Handoff Implementation ✅

**Concept**: Don't return drained pages to full_pages - make them **active immediately**

**Implementation** (clean modular code):

```c
// Helper: Make page active (move old active to full_pages)
static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page);

// Helper: Drain page and activate if successful (Direct Handoff)
static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page);
```

**Changes**:
1. Modified `mf2_maybe_drain_pending()` to use `mf2_try_drain_and_activate()`
2. Modified `alloc_slow` pending drain loop to use Direct Handoff
3. Reduced opportunistic drain from 60+ lines to 20 lines

**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 96,429
Pending drained:  70,607
Page reuse count: 80,017 ✅ (15x improvement!)
Throughput:       ~28,705 ops/s
```

**Success**: Page reuse 35% (80,017 / 226,447)

---

## Full Pages Scan Removal ✅

**Evidence**: Full_pages scan checked 1.88M pages but found **0 pages** (0% success rate)

**Reason**: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages

**Action**: Removed full_pages scan (76 lines deleted)

**Test Results**:
```
Page reuses: 69,098 (31%)
Throughput:  27,206 ops/s
```

**Conclusion**: Slight decrease but acceptable (simplification benefit)

---

## Frequency Tuning Attempts ⚙️

Tested multiple opportunistic drain frequencies:

| Frequency | Page Reuses | Reuse % | Throughput |
|-----------|-------------|---------|------------|
| 1/2 (50%) | 70,607 | 31% | 27,206 ops/s |
| 1/4 (25%) | 45,369 | 20% | 27,423 ops/s |
| 1/8 (12.5%) | 24,901 | 11% | 27,642 ops/s |

**Finding**: Higher frequency = better reuse, but still far from 90% target

---

## Hybrid Strategy Attempt (Strategy B) ❌

**Concept**: 75% own TLS (cache efficiency) + 25% round-robin (fairness)

**Implementation**:
```c
if ((count & 3) == 0) {  // 1/4: Other threads
    tp = g_all_thread_pages[round_robin_idx];
} else {  // 3/4: Own TLS
    tp = mf2_thread_pages_get();
}
```

**Test Results** (50% overall frequency):
```
Page reuses: 12,676 (5.5%) ❌
Problem: Effective frequency too low (37.5% own + 12.5% others)
```

**Conclusion**: Reverted to pure round-robin at 50% frequency (31% reuse)

---

## ChatGPT Pro Consultation 🧠

**Date**: 2025-10-24

### Question Posed

Complete technical question covering:
- MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain)
- Problem: 31% reuse vs 90% target
- Constraints: O(1), lock-free, per-page freelist
- What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25)

### Diagnosis

**Root Problem**: "Round-robin drain → owner handoff" doesn't work when owner stops allocating

**Larson Benchmark Pattern**:
- **Phase 1** (0-1s): All threads allocate → pages populate
- **Phase 2** (1-10s): All threads free+realloc from own ranges
  - Thread A frees Thread A's objects → no cross-thread frees
  - Thread B frees Thread B's objects → no cross-thread frees
  - **But**: Some cross-thread frees do occur (~10%)

**The Architectural Mismatch**:
```
Current (Round-Robin Drain):
1. Thread A frees → Thread B's page goes to pending queue
2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B
3. Thread B is NOT allocating (Larson Phase 2) → page sits unused
4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page)
```

**Result**: Pages drained but never used = 31% reuse instead of 90%

### Recommended Solution: Consumer-Driven Adoption

**Core Principle**: "Don't push pages to idle threads, let active threads **pull** and **adopt** them"

**Key Changes**:
1. **Remove round-robin drain entirely** (no more `mf2_maybe_drain_pending()`)
2. **Add ownership transfer**: CAS to change `page->owner_tid`
3. **Adoption on-demand**: Allocating thread adopts pages from ANY thread's pending queue
4. **Lease mechanism**: Prevent thrashing (re-transfer within 10ms)

**Algorithm**:
```c
// In alloc_slow, BEFORE allocating new page:
bool mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx) {
    // Scan all threads' pending queues (round-robin for fairness)
    for (int i = 0; i < num_threads; i++) {
        MidPage* page = mf2_dequeue_pending(other_thread[i], class_idx);
        if (!page) continue;

        // Try to transfer ownership (CAS)
        uint64_t old_owner = page->owner_tid;
        uint64_t now = rdtsc();
        if (now - page->last_transfer_time < LEASE_CYCLES) continue;  // Lease active

        if (!CAS(&page->owner_tid, old_owner, my_tid)) continue;  // CAS failed

        // Success! Ownership transferred
        page->owner_tp = me;
        page->last_transfer_time = now;

        // Drain and activate
        mf2_drain_remote_frees(page);
        if (page->freelist) {
            mf2_make_page_active(me, class_idx, page);
            return true;  // SUCCESS!
        }
    }
    return false;  // No adoptable pages
}
```

**Expected Effects**:
- ✅ No wasted effort (only allocating threads drain)
- ✅ Page reuse >90% (allocating thread gets any available page)
- ✅ Throughput 3-10M ops/s (100-350x improvement)
- ✅ Hot path unchanged (fast alloc/free still O(1), lock-free)

---

## Implementation Plan: Consumer-Driven Adoption

### Phase 1: Code Cleanup & Preparation ✅

**Tasks**:
1. ✅ Remove `mf2_maybe_drain_pending()` (opportunistic drain)
2. ✅ Remove all calls to `mf2_maybe_drain_pending()`
3. ✅ Keep helper functions (`mf2_make_page_active`, `mf2_try_drain_and_activate`)

### Phase 2: Data Structure Updates

**Tasks**:
1. Add `uint64_t last_transfer_time` to `MidPage` struct
2. Ensure `owner_tid` and `owner_tp` are already present (✅ verified)

### Phase 3: Adoption Function

**Tasks**:
1. Implement `mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)`
   - Scan all threads' pending queues (round-robin)
   - Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES)
   - CAS ownership transfer
   - Drain and activate if successful
2. Tune `LEASE_CYCLES` (start with 10ms = ~30M cycles on 3GHz CPU)

### Phase 4: Integration

**Tasks**:
1. Call `mf2_try_adopt_pending()` in `alloc_slow` BEFORE allocating new page
2. If adoption succeeds, retry fast path
3. If adoption fails, allocate new page (existing logic)

### Phase 5: Benchmark & Validate

**Tasks**:
1. Run larson 4T benchmark
2. Verify page reuse >90%
3. Verify throughput >1M ops/s (target: 3-10M)
4. Run full benchmark suite

---

## Current Status (Updated)

**Working**:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment
- ✅ Memory ordering
- ✅ Pending queue infrastructure (enqueue/dequeue)
- ✅ Direct Handoff (immediate page activation)
- ✅ Helper functions (modular, inline-optimized)
- ✅ Round-robin drain (proof of concept - to be replaced)

**Needs Improvement**:
- ⚠️ Page reuse: 31% (target: >90%)
- ⚠️ Throughput: 27K ops/s (target: 3-10M)

**Root Cause Identified**:
- ❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern)
- ✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption)

**Next Steps**:
1. 🎯 Remove `mf2_maybe_drain_pending()` (cleanup)
2. 🎯 Add `last_transfer_time` field
3. 🎯 Implement `mf2_try_adopt_pending()`
4. 🎯 Integrate adoption into `alloc_slow`
5. 🎯 Benchmark and validate

---

## Lessons Learned (Updated)

1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
2. **Memory Ordering Matters**: But doesn't solve architectural issues
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
4. **Integration vs Separation**: Need to carefully choose integration points
5. **Direct Handoff is Essential**: Returning drained pages to intermediate lists wastes reuse opportunities
6. **Push vs Pull**: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design
7. **ChatGPT Pro Consultation**: Fresh perspective identified fundamental architectural mismatch

---

**Status**: Ready for Consumer-Driven Adoption implementation
**Confidence**: Very High (ChatGPT Pro validated approach, clear design)
**Expected Outcome**: >90% page reuse, 3-10M ops/s (100-350x improvement)