747 lines
22 KiB
Markdown
747 lines
22 KiB
Markdown
|
|
# Phase 7.2 MF2: Implementation Progress
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-24
|
|||
|
|
**Status**: In Progress - Fixing Pending Queue Drain Issue
|
|||
|
|
**Current**: Implementing Global Round-Robin Strategy
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。
|
|||
|
|
Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。
|
|||
|
|
結果として、pending queueへのenqueueは成功(69K pages)するが、drainが0回という状況。
|
|||
|
|
|
|||
|
|
Task先生の詳細分析により根本原因を特定:
|
|||
|
|
- 各スレッドは**自分のTLS**のpending queueしか見ない
|
|||
|
|
- Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空
|
|||
|
|
- 他スレッドのpending queueに溜まったページは**永遠に処理されない**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Timeline
|
|||
|
|
|
|||
|
|
### Phase 1-4: Core Implementation ✅
|
|||
|
|
|
|||
|
|
**Commits**:
|
|||
|
|
- `0855b37` - Phase 1: Data structures
|
|||
|
|
- `5c4b780` - Phase 2: Page allocation
|
|||
|
|
- `b12f58c` - Phase 3: Allocation path
|
|||
|
|
- `7e756c6` - Phase 4: Free path
|
|||
|
|
|
|||
|
|
**Status**: Complete
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 5: Bug Fixes (Fix #1-6) ✅
|
|||
|
|
|
|||
|
|
#### Fix #1: Block Spacing Bug (`54609c1`)
|
|||
|
|
|
|||
|
|
**Problem**: Infinite loop on first test
|
|||
|
|
**Root Cause**:
|
|||
|
|
```c
|
|||
|
|
size_t block_size = g_class_sizes[class_idx]; // Missing HEADER_SIZE
|
|||
|
|
```
|
|||
|
|
**Fix**: `block_size = HEADER_SIZE + user_size;`
|
|||
|
|
**Result**: Test completes instead of hanging
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Fix #2-3: Performance Optimizations (`aa869b9`)
|
|||
|
|
|
|||
|
|
**Changes**:
|
|||
|
|
- Removed 64KB memset (switched from posix_memalign to mmap)
|
|||
|
|
- Removed O(N) eager drain scan
|
|||
|
|
- Reduced scan limit from 256 to 8
|
|||
|
|
|
|||
|
|
**Result**: 27.5K → 110K ops/s (4x improvement on 4T)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL
|
|||
|
|
|
|||
|
|
**Problem**: 97% of frees silently dropped!
|
|||
|
|
**Root Cause**:
|
|||
|
|
- mmap() only guarantees 4KB alignment
|
|||
|
|
- `addr_to_page()` assumes 64KB alignment
|
|||
|
|
- Lookup fails: `(ptr & ~0xFFFF)` rounds to wrong page base
|
|||
|
|
|
|||
|
|
**Fix**: Changed to `posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)`
|
|||
|
|
|
|||
|
|
**Verification** (by Task agent):
|
|||
|
|
```
|
|||
|
|
Pages allocated: 101,093
|
|||
|
|
Alignment bugs: 0 (ZERO!)
|
|||
|
|
Registry collisions: 0 (ZERO!)
|
|||
|
|
Lookup success rate: 98%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Side Effect**: Performance degraded (466K → 54K) due to memset overhead returning
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Fix #5: Active Page Drain Attempt (`9e64f7e`)
|
|||
|
|
|
|||
|
|
**Change**: Added check for remote frees in active_page before allocating new
|
|||
|
|
**Result**: No improvement (remote drains still 0)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Fix #6: Memory Ordering (`b0768b3`)
|
|||
|
|
|
|||
|
|
**Problem**: All remote_count operations used `memory_order_relaxed`
|
|||
|
|
**Fix**: Changed 7 locations to `seq_cst/acquire/release`
|
|||
|
|
**Result**: Memory ordering now perfect, but performance still no improvement
|
|||
|
|
|
|||
|
|
**Root Cause Discovery** (by Task agent):
|
|||
|
|
- Debug instrumentation revealed: drain checks and remote frees target **DIFFERENT page objects**
|
|||
|
|
- Thread A's pages in Thread A's tp->active_page/full_pages
|
|||
|
|
- Thread B frees to Thread A's pages → remote_count++
|
|||
|
|
- Thread B's slow path checks Thread B's pages only
|
|||
|
|
- Result: Thread A's pages (with remote_count > 0) never checked by anyone!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2: Pending Queue Implementation (`89541fc`) ✅
|
|||
|
|
|
|||
|
|
**Implementation** (by Task agent):
|
|||
|
|
- **Box 1**: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage
|
|||
|
|
- **Box 2**: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending)
|
|||
|
|
- **Box 3**: 0→1 edge detection in mf2_free_slow()
|
|||
|
|
- **Box 4**: Allocation slow path drain (up to 4 pages per allocation)
|
|||
|
|
- **Box 5**: Opportunistic drain (every 16th owner free)
|
|||
|
|
- **Box 6**: Comprehensive debug logging and statistics
|
|||
|
|
|
|||
|
|
**Test Results**:
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 43,138 ✅
|
|||
|
|
Pending drained: 0 ❌
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis** (by Task agent):
|
|||
|
|
- Implementation is correct
|
|||
|
|
- Problem: Larson benchmark allocates all pages early, frees later
|
|||
|
|
- By the time remote frees arrive, owner threads don't allocate anymore
|
|||
|
|
- Slow path never called → pending queue never processed
|
|||
|
|
- This is a workload mismatch, not an implementation bug
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅
|
|||
|
|
|
|||
|
|
**Change**: Increased from every 16th to every 4th free (4x more aggressive)
|
|||
|
|
|
|||
|
|
**Test Results** (larson 10 2-32K 10s 4T):
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 52,912 ✅
|
|||
|
|
Pending drained: 0 ❌
|
|||
|
|
Throughput: 53K ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion**: Frequency tuning didn't help - workload pattern issue persists
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option 1: free_slow Drain Addition ❌
|
|||
|
|
|
|||
|
|
**Concept**: Add opportunistic drain to both `free_fast()` and `free_slow()`
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
- Created `mf2_maybe_drain_pending()` helper
|
|||
|
|
- Called from both free_fast() (Line 1115) and free_slow() (Line 1167)
|
|||
|
|
|
|||
|
|
**Test Results**:
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 76,733 ✅
|
|||
|
|
Pending drained: 0 ❌
|
|||
|
|
OPP_DRAIN_TRY: 10 attempts (all from tp=0x55828805f7a0)
|
|||
|
|
Throughput: 27,890 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem**: All drain attempts from same thread - other 3 threads not visible
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option C: alloc_slow Drain Addition ❌
|
|||
|
|
|
|||
|
|
**Concept**: Add drain before new page allocation (owner thread allocating continuously)
|
|||
|
|
|
|||
|
|
**Implementation**: Added `mf2_maybe_drain_pending()` at Line 1021 (before `mf2_alloc_new_page()`)
|
|||
|
|
|
|||
|
|
**Test Results**:
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 69,702 ✅
|
|||
|
|
Pending drained: 0 ❌
|
|||
|
|
OPP_DRAIN_TRY: 10 attempts (all from tp=0x559146bb17a0)
|
|||
|
|
Throughput: 27,965 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion**: Still 0 drains - same thread issue persists
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis (by Task Agent)
|
|||
|
|
|
|||
|
|
### Larson Benchmark Characteristics
|
|||
|
|
|
|||
|
|
```cpp
|
|||
|
|
// larson.cpp: exercise_heap()
|
|||
|
|
for (cblks=0; cblks<pdea->NumBlocks; cblks++) {
|
|||
|
|
victim = lran2(&pdea->rgen) % pdea->asize; // Own array range
|
|||
|
|
CUSTOM_FREE(pdea->array[victim]); // Free own allocation
|
|||
|
|
pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size); // Same slot
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Array partitioning (Line 481):
|
|||
|
|
de_area[i].array = &blkp[i*nperthread]; // Each thread owns separate range
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Finding**: Each thread allocates/frees from its own array range
|
|||
|
|
- Thread 0: `array[0..999]`
|
|||
|
|
- Thread 1: `array[1000..1999]`
|
|||
|
|
- Thread 2: `array[2000..2999]`
|
|||
|
|
- Thread 3: `array[3000..3999]`
|
|||
|
|
|
|||
|
|
**Result**: **Cross-thread frees are almost ZERO**
|
|||
|
|
|
|||
|
|
### MF2 Design vs Larson Mismatch
|
|||
|
|
|
|||
|
|
**MF2 Assumption**:
|
|||
|
|
```
|
|||
|
|
4 threads freeing → all threads call mf2_free() → all threads drain pending
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Larson Reality**:
|
|||
|
|
```
|
|||
|
|
1 thread does most freeing → only 1 thread drains pending
|
|||
|
|
Other threads allocate-only → never drain their own pending queues
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem**:
|
|||
|
|
```c
|
|||
|
|
mf2_maybe_drain_pending() {
|
|||
|
|
MF2_ThreadPages* tp = mf2_thread_pages_get(); // ← Own TLS only!
|
|||
|
|
MidPage* pending = mf2_dequeue_pending(tp, class_idx); // ← Own pending only!
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty
|
|||
|
|
- Thread B/C/D's pending queues (with 69K pages!) are **never checked**
|
|||
|
|
|
|||
|
|
### Pending Enqueue Sources
|
|||
|
|
|
|||
|
|
**76,733 enqueues** come from:
|
|||
|
|
- Phase 1 allocation interruptions (rare cross-thread frees)
|
|||
|
|
- NOT from Phase 2 continuous freeing (same-thread pattern)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Solution Strategy: Global Round-Robin
|
|||
|
|
|
|||
|
|
### Design Philosophy: "Where to Separate, Where to Integrate"
|
|||
|
|
|
|||
|
|
**Separation Points** (working well) ✅:
|
|||
|
|
- Allocation: Thread-local, no lock
|
|||
|
|
- Owner Free: Thread-local, no lock
|
|||
|
|
- Cross-thread Free: Lock-free MPSC stack
|
|||
|
|
|
|||
|
|
**Integration Point** (broken) ❌:
|
|||
|
|
- Pending Queue Drain: Currently thread-local only
|
|||
|
|
|
|||
|
|
### Strategy A: Global Round-Robin (Phase 1) 🎯
|
|||
|
|
|
|||
|
|
**Core Idea**: All threads can drain ANY thread's pending queue
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Global registry
|
|||
|
|
static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS];
|
|||
|
|
static _Atomic int g_num_thread_pages = 0;
|
|||
|
|
|
|||
|
|
// Round-robin drain
|
|||
|
|
mf2_maybe_drain_pending() {
|
|||
|
|
static _Atomic uint64_t counter = 0;
|
|||
|
|
uint64_t count = atomic_fetch_add(&counter, 1);
|
|||
|
|
|
|||
|
|
// Round-robin across ALL threads (not just self!)
|
|||
|
|
int tp_idx = (count / 4) % g_num_thread_pages;
|
|||
|
|
MF2_ThreadPages* tp = g_all_thread_pages[tp_idx];
|
|||
|
|
|
|||
|
|
if (tp) {
|
|||
|
|
int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES;
|
|||
|
|
MidPage* pending = mf2_dequeue_pending(tp, class_idx);
|
|||
|
|
if (pending) drain_remote_frees(pending);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits**:
|
|||
|
|
- Larson works: Any thread can drain any thread's pending queue
|
|||
|
|
- Fair: All TLSs get equal drain opportunities
|
|||
|
|
- Simple: Just global array + round-robin
|
|||
|
|
|
|||
|
|
**Implementation Steps**:
|
|||
|
|
1. Add global array `g_all_thread_pages[]`
|
|||
|
|
2. Register TLS in `mf2_thread_pages_get()`
|
|||
|
|
3. Add destructor with `pthread_key_create()`
|
|||
|
|
4. Modify `mf2_maybe_drain_pending()` to round-robin
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 69K
|
|||
|
|
Pending drained: 69K ✅ (100% instead of 0%)
|
|||
|
|
Page reuse rate: 3% → 90%+ ✅
|
|||
|
|
Throughput: 28K → 3-10M ops/s ✅ (100-350x improvement!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Strategy B: Hybrid (Phase 2) ⚡
|
|||
|
|
|
|||
|
|
**Optimization**: Prefer own TLS (cache efficiency) but periodically check others
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
if ((count & 3) == 0) { // 1/4: Other threads
|
|||
|
|
tp = g_all_thread_pages[round_robin_idx];
|
|||
|
|
} else { // 3/4: Own TLS (cache hot)
|
|||
|
|
tp = mf2_thread_pages_get();
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits**:
|
|||
|
|
- Cache efficiency: 75% of drains are own TLS (L1 cache)
|
|||
|
|
- Fairness: 25% of drains are others (ensures progress)
|
|||
|
|
|
|||
|
|
**Metrics**:
|
|||
|
|
- Own TLS: L1 cache hit (1-2 cycles)
|
|||
|
|
- Other TLS: L3 cache hit (10-20 cycles)
|
|||
|
|
- Average cost: **3-5 cycles** (negligible)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Strategy C: Background Sweeper (Phase 3) 🔄
|
|||
|
|
|
|||
|
|
**Safety Net**: Handle edge cases where all threads stop allocating/freeing
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* mf2_drain_thread(void* arg) {
|
|||
|
|
while (running) {
|
|||
|
|
usleep(1000); // 1ms interval (not 100μs - too aggressive)
|
|||
|
|
|
|||
|
|
// Scan all TLSs for leftover pending pages
|
|||
|
|
for (int i = 0; i < g_num_thread_pages; i++) {
|
|||
|
|
for (int c = 0; c < POOL_NUM_CLASSES; c++) {
|
|||
|
|
MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c);
|
|||
|
|
if (pending) drain_remote_frees(pending);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Role**: Insurance policy, not main drain mechanism
|
|||
|
|
- Strategy A handles 95% of drains (hot path)
|
|||
|
|
- Strategy C handles 5% leftover (rare cases)
|
|||
|
|
|
|||
|
|
**Latency Impact**: **NONE on hot path** (async background)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3-Layer Latency Hiding Design
|
|||
|
|
|
|||
|
|
| Layer | Strategy | Frequency | Latency | Coverage | Role |
|
|||
|
|
|-------|----------|-----------|---------|----------|------|
|
|||
|
|
| **L1: Hot Path** | A (Global RR) | Every 4th op | <1μs | 95% | Main drain |
|
|||
|
|
| **L2: Optimization** | B (Hybrid) | 3/4 own, 1/4 other | <1μs | 100% | Cache efficiency |
|
|||
|
|
| **L3: Safety Net** | C (BG sweeper) | 1ms interval | 1ms | 100% | Edge cases |
|
|||
|
|
|
|||
|
|
**Latency Guarantee**: Front-end (alloc/free) always returns in **<1μs**, regardless of background drain state
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Global Round-Robin (Today) 🎯
|
|||
|
|
|
|||
|
|
**Target**: Make Larson work
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Add global array `g_all_thread_pages[256]`
|
|||
|
|
2. Add atomic counter `g_num_thread_pages`
|
|||
|
|
3. Add registration in `mf2_thread_pages_get()`
|
|||
|
|
4. Add pthread_key destructor for cleanup
|
|||
|
|
5. Modify `mf2_maybe_drain_pending()` for round-robin
|
|||
|
|
|
|||
|
|
**Expected Time**: 1-2 hours
|
|||
|
|
|
|||
|
|
**Success Criteria**:
|
|||
|
|
- Pending drained > 0 (ideally ~69K)
|
|||
|
|
- Throughput > 1M ops/s (35x improvement from 28K)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2: Hybrid Optimization (Tomorrow)
|
|||
|
|
|
|||
|
|
**Target**: Improve cache efficiency
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Modify `mf2_maybe_drain_pending()` to prefer own TLS (3/4 ratio)
|
|||
|
|
2. Benchmark cache hit rates
|
|||
|
|
|
|||
|
|
**Expected Time**: 30 minutes
|
|||
|
|
|
|||
|
|
**Success Criteria**:
|
|||
|
|
- L1 cache hit rate > 75%
|
|||
|
|
- Throughput gain: +5-10%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 3: Background Sweeper (Optional)
|
|||
|
|
|
|||
|
|
**Target**: Handle edge cases
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Create background thread with `pthread_create()`
|
|||
|
|
2. Scan all TLSs every 1ms
|
|||
|
|
3. CPU throttling (< 1% usage)
|
|||
|
|
|
|||
|
|
**Expected Time**: 30 minutes
|
|||
|
|
|
|||
|
|
**Success Criteria**:
|
|||
|
|
- No pending leftovers after 10s idle
|
|||
|
|
- CPU overhead < 1%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Current Status
|
|||
|
|
|
|||
|
|
**Working**:
|
|||
|
|
- ✅ Per-page sharding (data structures, allocation, free paths)
|
|||
|
|
- ✅ 64KB alignment (Fix #4)
|
|||
|
|
- ✅ Memory ordering (Fix #6)
|
|||
|
|
- ✅ Pending queue infrastructure (enqueue works perfectly)
|
|||
|
|
- ✅ 0→1 edge detection
|
|||
|
|
|
|||
|
|
**Broken**:
|
|||
|
|
- ❌ Pending queue drain (0 drains due to TLS isolation)
|
|||
|
|
- ❌ Page reuse (3% instead of 90%)
|
|||
|
|
- ❌ Performance (28K ops/s instead of 3-10M)
|
|||
|
|
|
|||
|
|
**Next**:
|
|||
|
|
- 🎯 Implement Phase 1: Global Round-Robin
|
|||
|
|
- 🎯 Expected breakthrough: 28K → 3-10M ops/s
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Files Modified
|
|||
|
|
|
|||
|
|
### Core Implementation
|
|||
|
|
- `hakmem_pool.c` (Lines 275-1200): MF2 implementation
|
|||
|
|
- Data structures (MidPage, MF2_ThreadPages, PageRegistry)
|
|||
|
|
- Allocation paths (fast/slow)
|
|||
|
|
- Free paths (fast/slow)
|
|||
|
|
- Pending queue operations
|
|||
|
|
- Opportunistic drain (currently broken)
|
|||
|
|
|
|||
|
|
### Documentation
|
|||
|
|
- `docs/specs/ENV_VARS.md`: Added `HAKMEM_MF2_ENABLE`
|
|||
|
|
- `docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md`: Original plan
|
|||
|
|
- `docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md`: This file
|
|||
|
|
|
|||
|
|
### Debug Reports
|
|||
|
|
- `ALIGNMENT_FIX_VERIFICATION.md`: Fix #4 verification by Task agent
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
|
|||
|
|
2. **Memory Ordering Matters**: But doesn't solve architectural issues
|
|||
|
|
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
|
|||
|
|
4. **Integration vs Separation**: Need to carefully choose integration points
|
|||
|
|
5. **Task Agent is MVP**: Detailed analysis saved days of debugging
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 1: Global Round-Robin Implementation ✅
|
|||
|
|
|
|||
|
|
**Commit**: (multiple commits implementing round-robin drain)
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
1. Added `g_all_thread_pages[256]` global array
|
|||
|
|
2. Added `g_num_thread_pages` atomic counter
|
|||
|
|
3. Implemented TLS registration in `mf2_thread_pages_get()`
|
|||
|
|
4. Implemented `mf2_maybe_drain_pending()` with round-robin logic
|
|||
|
|
5. Called from both `mf2_free_fast()` and `mf2_alloc_slow()`
|
|||
|
|
|
|||
|
|
**Test Results** (larson 10 2-32K 10s 4T):
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 96,429 ✅
|
|||
|
|
Pending drained: 70,607 ✅ (73% - huge improvement from 0%!)
|
|||
|
|
Page reuse count: 5,222
|
|||
|
|
Throughput: ~28,705 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- ✅ Round-robin drain WORKS! (0 drains → 70K drains)
|
|||
|
|
- ⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated)
|
|||
|
|
- Problem: Drained pages returned to full_pages, but owner doesn't scan them
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Strategy C: Direct Handoff Implementation ✅
|
|||
|
|
|
|||
|
|
**Concept**: Don't return drained pages to full_pages - make them **active immediately**
|
|||
|
|
|
|||
|
|
**Implementation** (clean modular code):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Helper: Make page active (move old active to full_pages)
|
|||
|
|
static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page);
|
|||
|
|
|
|||
|
|
// Helper: Drain page and activate if successful (Direct Handoff)
|
|||
|
|
static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Changes**:
|
|||
|
|
1. Modified `mf2_maybe_drain_pending()` to use `mf2_try_drain_and_activate()`
|
|||
|
|
2. Modified `alloc_slow` pending drain loop to use Direct Handoff
|
|||
|
|
3. Reduced opportunistic drain from 60+ lines to 20 lines
|
|||
|
|
|
|||
|
|
**Test Results** (larson 10 2-32K 10s 4T):
|
|||
|
|
```
|
|||
|
|
Pending enqueued: 96,429
|
|||
|
|
Pending drained: 70,607
|
|||
|
|
Page reuse count: 80,017 ✅ (15x improvement!)
|
|||
|
|
Throughput: ~28,705 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Success**: Page reuse 35% (80,017 / 226,447)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Full Pages Scan Removal ✅
|
|||
|
|
|
|||
|
|
**Evidence**: Full_pages scan checked 1.88M pages but found **0 pages** (0% success rate)
|
|||
|
|
|
|||
|
|
**Reason**: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages
|
|||
|
|
|
|||
|
|
**Action**: Removed full_pages scan (76 lines deleted)
|
|||
|
|
|
|||
|
|
**Test Results**:
|
|||
|
|
```
|
|||
|
|
Page reuses: 69,098 (31%)
|
|||
|
|
Throughput: 27,206 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion**: Slight decrease but acceptable (simplification benefit)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Frequency Tuning Attempts ⚙️
|
|||
|
|
|
|||
|
|
Tested multiple opportunistic drain frequencies:
|
|||
|
|
|
|||
|
|
| Frequency | Page Reuses | Reuse % | Throughput |
|
|||
|
|
|-----------|-------------|---------|------------|
|
|||
|
|
| 1/2 (50%) | 70,607 | 31% | 27,206 ops/s |
|
|||
|
|
| 1/4 (25%) | 45,369 | 20% | 27,423 ops/s |
|
|||
|
|
| 1/8 (12.5%) | 24,901 | 11% | 27,642 ops/s |
|
|||
|
|
|
|||
|
|
**Finding**: Higher frequency = better reuse, but still far from 90% target
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Hybrid Strategy Attempt (Strategy B) ❌
|
|||
|
|
|
|||
|
|
**Concept**: 75% own TLS (cache efficiency) + 25% round-robin (fairness)
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
```c
|
|||
|
|
if ((count & 3) == 0) { // 1/4: Other threads
|
|||
|
|
tp = g_all_thread_pages[round_robin_idx];
|
|||
|
|
} else { // 3/4: Own TLS
|
|||
|
|
tp = mf2_thread_pages_get();
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Test Results** (50% overall frequency):
|
|||
|
|
```
|
|||
|
|
Page reuses: 12,676 (5.5%) ❌
|
|||
|
|
Problem: Effective frequency too low (37.5% own + 12.5% others)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion**: Reverted to pure round-robin at 50% frequency (31% reuse)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ChatGPT Pro Consultation 🧠
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-24
|
|||
|
|
|
|||
|
|
### Question Posed
|
|||
|
|
|
|||
|
|
Complete technical question covering:
|
|||
|
|
- MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain)
|
|||
|
|
- Problem: 31% reuse vs 90% target
|
|||
|
|
- Constraints: O(1), lock-free, per-page freelist
|
|||
|
|
- What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25)
|
|||
|
|
|
|||
|
|
### Diagnosis
|
|||
|
|
|
|||
|
|
**Root Problem**: "Round-robin drain → owner handoff" doesn't work when owner stops allocating
|
|||
|
|
|
|||
|
|
**Larson Benchmark Pattern**:
|
|||
|
|
- **Phase 1** (0-1s): All threads allocate → pages populate
|
|||
|
|
- **Phase 2** (1-10s): All threads free+realloc from own ranges
|
|||
|
|
- Thread A frees Thread A's objects → no cross-thread frees
|
|||
|
|
- Thread B frees Thread B's objects → no cross-thread frees
|
|||
|
|
- **But**: Some cross-thread frees do occur (~10%)
|
|||
|
|
|
|||
|
|
**The Architectural Mismatch**:
|
|||
|
|
```
|
|||
|
|
Current (Round-Robin Drain):
|
|||
|
|
1. Thread A frees → Thread B's page goes to pending queue
|
|||
|
|
2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B
|
|||
|
|
3. Thread B is NOT allocating (Larson Phase 2) → page sits unused
|
|||
|
|
4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Result**: Pages drained but never used = 31% reuse instead of 90%
|
|||
|
|
|
|||
|
|
### Recommended Solution: Consumer-Driven Adoption
|
|||
|
|
|
|||
|
|
**Core Principle**: "Don't push pages to idle threads, let active threads **pull** and **adopt** them"
|
|||
|
|
|
|||
|
|
**Key Changes**:
|
|||
|
|
1. **Remove round-robin drain entirely** (no more `mf2_maybe_drain_pending()`)
|
|||
|
|
2. **Add ownership transfer**: CAS to change `page->owner_tid`
|
|||
|
|
3. **Adoption on-demand**: Allocating thread adopts pages from ANY thread's pending queue
|
|||
|
|
4. **Lease mechanism**: Prevent thrashing (re-transfer within 10ms)
|
|||
|
|
|
|||
|
|
**Algorithm**:
|
|||
|
|
```c
|
|||
|
|
// In alloc_slow, BEFORE allocating new page:
|
|||
|
|
bool mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx) {
|
|||
|
|
// Scan all threads' pending queues (round-robin for fairness)
|
|||
|
|
for (int i = 0; i < num_threads; i++) {
|
|||
|
|
MidPage* page = mf2_dequeue_pending(other_thread[i], class_idx);
|
|||
|
|
if (!page) continue;
|
|||
|
|
|
|||
|
|
// Try to transfer ownership (CAS)
|
|||
|
|
uint64_t old_owner = page->owner_tid;
|
|||
|
|
uint64_t now = rdtsc();
|
|||
|
|
if (now - page->last_transfer_time < LEASE_CYCLES) continue; // Lease active
|
|||
|
|
|
|||
|
|
if (!CAS(&page->owner_tid, old_owner, my_tid)) continue; // CAS failed
|
|||
|
|
|
|||
|
|
// Success! Ownership transferred
|
|||
|
|
page->owner_tp = me;
|
|||
|
|
page->last_transfer_time = now;
|
|||
|
|
|
|||
|
|
// Drain and activate
|
|||
|
|
mf2_drain_remote_frees(page);
|
|||
|
|
if (page->freelist) {
|
|||
|
|
mf2_make_page_active(me, class_idx, page);
|
|||
|
|
return true; // SUCCESS!
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return false; // No adoptable pages
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Effects**:
|
|||
|
|
- ✅ No wasted effort (only allocating threads drain)
|
|||
|
|
- ✅ Page reuse >90% (allocating thread gets any available page)
|
|||
|
|
- ✅ Throughput 3-10M ops/s (100-350x improvement)
|
|||
|
|
- ✅ Hot path unchanged (fast alloc/free still O(1), lock-free)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Plan: Consumer-Driven Adoption
|
|||
|
|
|
|||
|
|
### Phase 1: Code Cleanup & Preparation ✅
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. ✅ Remove `mf2_maybe_drain_pending()` (opportunistic drain)
|
|||
|
|
2. ✅ Remove all calls to `mf2_maybe_drain_pending()`
|
|||
|
|
3. ✅ Keep helper functions (`mf2_make_page_active`, `mf2_try_drain_and_activate`)
|
|||
|
|
|
|||
|
|
### Phase 2: Data Structure Updates
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Add `uint64_t last_transfer_time` to `MidPage` struct
|
|||
|
|
2. Ensure `owner_tid` and `owner_tp` are already present (✅ verified)
|
|||
|
|
|
|||
|
|
### Phase 3: Adoption Function
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Implement `mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)`
|
|||
|
|
- Scan all threads' pending queues (round-robin)
|
|||
|
|
- Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES)
|
|||
|
|
- CAS ownership transfer
|
|||
|
|
- Drain and activate if successful
|
|||
|
|
2. Tune `LEASE_CYCLES` (start with 10ms = ~30M cycles on 3GHz CPU)
|
|||
|
|
|
|||
|
|
### Phase 4: Integration
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Call `mf2_try_adopt_pending()` in `alloc_slow` BEFORE allocating new page
|
|||
|
|
2. If adoption succeeds, retry fast path
|
|||
|
|
3. If adoption fails, allocate new page (existing logic)
|
|||
|
|
|
|||
|
|
### Phase 5: Benchmark & Validate
|
|||
|
|
|
|||
|
|
**Tasks**:
|
|||
|
|
1. Run larson 4T benchmark
|
|||
|
|
2. Verify page reuse >90%
|
|||
|
|
3. Verify throughput >1M ops/s (target: 3-10M)
|
|||
|
|
4. Run full benchmark suite
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Current Status (Updated)
|
|||
|
|
|
|||
|
|
**Working**:
|
|||
|
|
- ✅ Per-page sharding (data structures, allocation, free paths)
|
|||
|
|
- ✅ 64KB alignment
|
|||
|
|
- ✅ Memory ordering
|
|||
|
|
- ✅ Pending queue infrastructure (enqueue/dequeue)
|
|||
|
|
- ✅ Direct Handoff (immediate page activation)
|
|||
|
|
- ✅ Helper functions (modular, inline-optimized)
|
|||
|
|
- ✅ Round-robin drain (proof of concept - to be replaced)
|
|||
|
|
|
|||
|
|
**Needs Improvement**:
|
|||
|
|
- ⚠️ Page reuse: 31% (target: >90%)
|
|||
|
|
- ⚠️ Throughput: 27K ops/s (target: 3-10M)
|
|||
|
|
|
|||
|
|
**Root Cause Identified**:
|
|||
|
|
- ❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern)
|
|||
|
|
- ✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption)
|
|||
|
|
|
|||
|
|
**Next Steps**:
|
|||
|
|
1. 🎯 Remove `mf2_maybe_drain_pending()` (cleanup)
|
|||
|
|
2. 🎯 Add `last_transfer_time` field
|
|||
|
|
3. 🎯 Implement `mf2_try_adopt_pending()`
|
|||
|
|
4. 🎯 Integrate adoption into `alloc_slow`
|
|||
|
|
5. 🎯 Benchmark and validate
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned (Updated)
|
|||
|
|
|
|||
|
|
1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
|
|||
|
|
2. **Memory Ordering Matters**: But doesn't solve architectural issues
|
|||
|
|
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
|
|||
|
|
4. **Integration vs Separation**: Need to carefully choose integration points
|
|||
|
|
5. **Direct Handoff is Essential**: Returning drained pages to intermediate lists wastes reuse opportunities
|
|||
|
|
6. **Push vs Pull**: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design
|
|||
|
|
7. **ChatGPT Pro Consultation**: Fresh perspective identified fundamental architectural mismatch
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Status**: Ready for Consumer-Driven Adoption implementation
|
|||
|
|
**Confidence**: Very High (ChatGPT Pro validated approach, clear design)
|
|||
|
|
**Expected Outcome**: >90% page reuse, 3-10M ops/s (100-350x improvement)
|