Files
hakmem/docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md

747 lines
22 KiB
Markdown
Raw Normal View History

# Phase 7.2 MF2: Implementation Progress
**Date**: 2025-10-24
**Status**: In Progress - Fixing Pending Queue Drain Issue
**Current**: Implementing Global Round-Robin Strategy
---
## Summary
MF2 Per-Page Sharding基本実装は完了したが、Pending Queue Drainメカニズムに構造的な問題を発見。
Larsonベンチマークでは各スレッドが自分専用の配列範囲でalloc/freeするため、cross-thread freeがほぼゼロ。
結果として、pending queueへのenqueueは成功69K pagesするが、drainが0回という状況。
Task先生の詳細分析により根本原因を特定
- 各スレッドは**自分のTLS**のpending queueしか見ない
- Larsonでは各スレッドが自分でalloc/free → 自分のpending queueは空
- 他スレッドのpending queueに溜まったページは**永遠に処理されない**
---
## Implementation Timeline
### Phase 1-4: Core Implementation ✅
**Commits**:
- `0855b37` - Phase 1: Data structures
- `5c4b780` - Phase 2: Page allocation
- `b12f58c` - Phase 3: Allocation path
- `7e756c6` - Phase 4: Free path
**Status**: Complete
---
### Phase 5: Bug Fixes (Fix #1-6) ✅
#### Fix #1: Block Spacing Bug (`54609c1`)
**Problem**: Infinite loop on first test
**Root Cause**:
```c
size_t block_size = g_class_sizes[class_idx]; // Missing HEADER_SIZE
```
**Fix**: `block_size = HEADER_SIZE + user_size;`
**Result**: Test completes instead of hanging
---
#### Fix #2-3: Performance Optimizations (`aa869b9`)
**Changes**:
- Removed 64KB memset (switched from posix_memalign to mmap)
- Removed O(N) eager drain scan
- Reduced scan limit from 256 to 8
**Result**: 27.5K → 110K ops/s (4x improvement on 4T)
---
#### Fix #4: Alignment Bug (`9e64f7e`) - CRITICAL
**Problem**: 97% of frees silently dropped!
**Root Cause**:
- mmap() only guarantees 4KB alignment
- `addr_to_page()` assumes 64KB alignment
- Lookup fails: `(ptr & ~0xFFFF)` rounds to wrong page base
**Fix**: Changed to `posix_memalign(&page_base, 65536, POOL_PAGE_SIZE)`
**Verification** (by Task agent):
```
Pages allocated: 101,093
Alignment bugs: 0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98%
```
**Side Effect**: Performance degraded (466K → 54K) due to memset overhead returning
---
#### Fix #5: Active Page Drain Attempt (`9e64f7e`)
**Change**: Added check for remote frees in active_page before allocating new
**Result**: No improvement (remote drains still 0)
---
#### Fix #6: Memory Ordering (`b0768b3`)
**Problem**: All remote_count operations used `memory_order_relaxed`
**Fix**: Changed 7 locations to `seq_cst/acquire/release`
**Result**: Memory ordering now perfect, but performance still no improvement
**Root Cause Discovery** (by Task agent):
- Debug instrumentation revealed: drain checks and remote frees target **DIFFERENT page objects**
- Thread A's pages in Thread A's tp->active_page/full_pages
- Thread B frees to Thread A's pages → remote_count++
- Thread B's slow path checks Thread B's pages only
- Result: Thread A's pages (with remote_count > 0) never checked by anyone!
---
### Phase 2: Pending Queue Implementation (`89541fc`) ✅
**Implementation** (by Task agent):
- **Box 1**: Data structures - added owner_tp, in_remote_pending, next_pending to MidPage
- **Box 2**: MPSC lock-free queue operations (mf2_enqueue_pending, mf2_dequeue_pending)
- **Box 3**: 0→1 edge detection in mf2_free_slow()
- **Box 4**: Allocation slow path drain (up to 4 pages per allocation)
- **Box 5**: Opportunistic drain (every 16th owner free)
- **Box 6**: Comprehensive debug logging and statistics
**Test Results**:
```
Pending enqueued: 43,138 ✅
Pending drained: 0 ❌
```
**Analysis** (by Task agent):
- Implementation is correct
- Problem: Larson benchmark allocates all pages early, frees later
- By the time remote frees arrive, owner threads don't allocate anymore
- Slow path never called → pending queue never processed
- This is a workload mismatch, not an implementation bug
---
### Tuning: Opportunistic Drain Frequency (`a6eb666`) ✅
**Change**: Increased from every 16th to every 4th free (4x more aggressive)
**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 52,912 ✅
Pending drained: 0 ❌
Throughput: 53K ops/s
```
**Conclusion**: Frequency tuning didn't help - workload pattern issue persists
---
### Option 1: free_slow Drain Addition ❌
**Concept**: Add opportunistic drain to both `free_fast()` and `free_slow()`
**Implementation**:
- Created `mf2_maybe_drain_pending()` helper
- Called from both free_fast() (Line 1115) and free_slow() (Line 1167)
**Test Results**:
```
Pending enqueued: 76,733 ✅
Pending drained: 0 ❌
OPP_DRAIN_TRY: 10 attempts (all from tp=0x55828805f7a0)
Throughput: 27,890 ops/s
```
**Problem**: All drain attempts from same thread - other 3 threads not visible
---
### Option C: alloc_slow Drain Addition ❌
**Concept**: Add drain before new page allocation (owner thread allocating continuously)
**Implementation**: Added `mf2_maybe_drain_pending()` at Line 1021 (before `mf2_alloc_new_page()`)
**Test Results**:
```
Pending enqueued: 69,702 ✅
Pending drained: 0 ❌
OPP_DRAIN_TRY: 10 attempts (all from tp=0x559146bb17a0)
Throughput: 27,965 ops/s
```
**Conclusion**: Still 0 drains - same thread issue persists
---
## Root Cause Analysis (by Task Agent)
### Larson Benchmark Characteristics
```cpp
// larson.cpp: exercise_heap()
for (cblks=0; cblks<pdea->NumBlocks; cblks++) {
victim = lran2(&pdea->rgen) % pdea->asize; // Own array range
CUSTOM_FREE(pdea->array[victim]); // Free own allocation
pdea->array[victim] = (char*)CUSTOM_MALLOC(blk_size); // Same slot
}
// Array partitioning (Line 481):
de_area[i].array = &blkp[i*nperthread]; // Each thread owns separate range
```
**Key Finding**: Each thread allocates/frees from its own array range
- Thread 0: `array[0..999]`
- Thread 1: `array[1000..1999]`
- Thread 2: `array[2000..2999]`
- Thread 3: `array[3000..3999]`
**Result**: **Cross-thread frees are almost ZERO**
### MF2 Design vs Larson Mismatch
**MF2 Assumption**:
```
4 threads freeing → all threads call mf2_free() → all threads drain pending
```
**Larson Reality**:
```
1 thread does most freeing → only 1 thread drains pending
Other threads allocate-only → never drain their own pending queues
```
**Problem**:
```c
mf2_maybe_drain_pending() {
MF2_ThreadPages* tp = mf2_thread_pages_get(); // ← Own TLS only!
MidPage* pending = mf2_dequeue_pending(tp, class_idx); // ← Own pending only!
}
```
- Thread A drains → checks Thread A's TLS → Thread A's pending queue is empty
- Thread B/C/D's pending queues (with 69K pages!) are **never checked**
### Pending Enqueue Sources
**76,733 enqueues** come from:
- Phase 1 allocation interruptions (rare cross-thread frees)
- NOT from Phase 2 continuous freeing (same-thread pattern)
---
## Solution Strategy: Global Round-Robin
### Design Philosophy: "Where to Separate, Where to Integrate"
**Separation Points** (working well) ✅:
- Allocation: Thread-local, no lock
- Owner Free: Thread-local, no lock
- Cross-thread Free: Lock-free MPSC stack
**Integration Point** (broken) ❌:
- Pending Queue Drain: Currently thread-local only
### Strategy A: Global Round-Robin (Phase 1) 🎯
**Core Idea**: All threads can drain ANY thread's pending queue
```c
// Global registry
static MF2_ThreadPages* g_all_thread_pages[MAX_THREADS];
static _Atomic int g_num_thread_pages = 0;
// Round-robin drain
mf2_maybe_drain_pending() {
static _Atomic uint64_t counter = 0;
uint64_t count = atomic_fetch_add(&counter, 1);
// Round-robin across ALL threads (not just self!)
int tp_idx = (count / 4) % g_num_thread_pages;
MF2_ThreadPages* tp = g_all_thread_pages[tp_idx];
if (tp) {
int class_idx = (count / 4 / g_num_thread_pages) % POOL_NUM_CLASSES;
MidPage* pending = mf2_dequeue_pending(tp, class_idx);
if (pending) drain_remote_frees(pending);
}
}
```
**Benefits**:
- Larson works: Any thread can drain any thread's pending queue
- Fair: All TLSs get equal drain opportunities
- Simple: Just global array + round-robin
**Implementation Steps**:
1. Add global array `g_all_thread_pages[]`
2. Register TLS in `mf2_thread_pages_get()`
3. Add destructor with `pthread_key_create()`
4. Modify `mf2_maybe_drain_pending()` to round-robin
**Expected Impact**:
```
Pending enqueued: 69K
Pending drained: 69K ✅ (100% instead of 0%)
Page reuse rate: 3% → 90%+ ✅
Throughput: 28K → 3-10M ops/s ✅ (100-350x improvement!)
```
---
### Strategy B: Hybrid (Phase 2) ⚡
**Optimization**: Prefer own TLS (cache efficiency) but periodically check others
```c
if ((count & 3) == 0) { // 1/4: Other threads
tp = g_all_thread_pages[round_robin_idx];
} else { // 3/4: Own TLS (cache hot)
tp = mf2_thread_pages_get();
}
```
**Benefits**:
- Cache efficiency: 75% of drains are own TLS (L1 cache)
- Fairness: 25% of drains are others (ensures progress)
**Metrics**:
- Own TLS: L1 cache hit (1-2 cycles)
- Other TLS: L3 cache hit (10-20 cycles)
- Average cost: **3-5 cycles** (negligible)
---
### Strategy C: Background Sweeper (Phase 3) 🔄
**Safety Net**: Handle edge cases where all threads stop allocating/freeing
```c
void* mf2_drain_thread(void* arg) {
while (running) {
usleep(1000); // 1ms interval (not 100μs - too aggressive)
// Scan all TLSs for leftover pending pages
for (int i = 0; i < g_num_thread_pages; i++) {
for (int c = 0; c < POOL_NUM_CLASSES; c++) {
MidPage* pending = mf2_dequeue_pending(g_all_thread_pages[i], c);
if (pending) drain_remote_frees(pending);
}
}
}
}
```
**Role**: Insurance policy, not main drain mechanism
- Strategy A handles 95% of drains (hot path)
- Strategy C handles 5% leftover (rare cases)
**Latency Impact**: **NONE on hot path** (async background)
---
## 3-Layer Latency Hiding Design
| Layer | Strategy | Frequency | Latency | Coverage | Role |
|-------|----------|-----------|---------|----------|------|
| **L1: Hot Path** | A (Global RR) | Every 4th op | <1μs | 95% | Main drain |
| **L2: Optimization** | B (Hybrid) | 3/4 own, 1/4 other | <1μs | 100% | Cache efficiency |
| **L3: Safety Net** | C (BG sweeper) | 1ms interval | 1ms | 100% | Edge cases |
**Latency Guarantee**: Front-end (alloc/free) always returns in **<1μs**, regardless of background drain state
---
## Implementation Plan
### Phase 1: Global Round-Robin (Today) 🎯
**Target**: Make Larson work
**Tasks**:
1. Add global array `g_all_thread_pages[256]`
2. Add atomic counter `g_num_thread_pages`
3. Add registration in `mf2_thread_pages_get()`
4. Add pthread_key destructor for cleanup
5. Modify `mf2_maybe_drain_pending()` for round-robin
**Expected Time**: 1-2 hours
**Success Criteria**:
- Pending drained > 0 (ideally ~69K)
- Throughput > 1M ops/s (35x improvement from 28K)
---
### Phase 2: Hybrid Optimization (Tomorrow)
**Target**: Improve cache efficiency
**Tasks**:
1. Modify `mf2_maybe_drain_pending()` to prefer own TLS (3/4 ratio)
2. Benchmark cache hit rates
**Expected Time**: 30 minutes
**Success Criteria**:
- L1 cache hit rate > 75%
- Throughput gain: +5-10%
---
### Phase 3: Background Sweeper (Optional)
**Target**: Handle edge cases
**Tasks**:
1. Create background thread with `pthread_create()`
2. Scan all TLSs every 1ms
3. CPU throttling (< 1% usage)
**Expected Time**: 30 minutes
**Success Criteria**:
- No pending leftovers after 10s idle
- CPU overhead < 1%
---
## Current Status
**Working**:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment (Fix #4)
- ✅ Memory ordering (Fix #6)
- ✅ Pending queue infrastructure (enqueue works perfectly)
- ✅ 0→1 edge detection
**Broken**:
- ❌ Pending queue drain (0 drains due to TLS isolation)
- ❌ Page reuse (3% instead of 90%)
- ❌ Performance (28K ops/s instead of 3-10M)
**Next**:
- 🎯 Implement Phase 1: Global Round-Robin
- 🎯 Expected breakthrough: 28K → 3-10M ops/s
---
## Files Modified
### Core Implementation
- `hakmem_pool.c` (Lines 275-1200): MF2 implementation
- Data structures (MidPage, MF2_ThreadPages, PageRegistry)
- Allocation paths (fast/slow)
- Free paths (fast/slow)
- Pending queue operations
- Opportunistic drain (currently broken)
### Documentation
- `docs/specs/ENV_VARS.md`: Added `HAKMEM_MF2_ENABLE`
- `docs/status/PHASE_7.2_MF2_PLAN_2025_10_24.md`: Original plan
- `docs/status/PHASE_7.2_MF2_PROGRESS_2025_10_24.md`: This file
### Debug Reports
- `ALIGNMENT_FIX_VERIFICATION.md`: Fix #4 verification by Task agent
---
## Lessons Learned
1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
2. **Memory Ordering Matters**: But doesn't solve architectural issues
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
4. **Integration vs Separation**: Need to carefully choose integration points
5. **Task Agent is MVP**: Detailed analysis saved days of debugging
---
---
## Phase 1: Global Round-Robin Implementation ✅
**Commit**: (multiple commits implementing round-robin drain)
**Implementation**:
1. Added `g_all_thread_pages[256]` global array
2. Added `g_num_thread_pages` atomic counter
3. Implemented TLS registration in `mf2_thread_pages_get()`
4. Implemented `mf2_maybe_drain_pending()` with round-robin logic
5. Called from both `mf2_free_fast()` and `mf2_alloc_slow()`
**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 96,429 ✅
Pending drained: 70,607 ✅ (73% - huge improvement from 0%!)
Page reuse count: 5,222
Throughput: ~28,705 ops/s
```
**Analysis**:
- ✅ Round-robin drain WORKS! (0 drains → 70K drains)
- ⚠️ But page reuse only 2.3% (5,222 / 226,447 pages allocated)
- Problem: Drained pages returned to full_pages, but owner doesn't scan them
---
## Strategy C: Direct Handoff Implementation ✅
**Concept**: Don't return drained pages to full_pages - make them **active immediately**
**Implementation** (clean modular code):
```c
// Helper: Make page active (move old active to full_pages)
static inline void mf2_make_page_active(MF2_ThreadPages* tp, int class_idx, MidPage* page);
// Helper: Drain page and activate if successful (Direct Handoff)
static inline bool mf2_try_drain_and_activate(MF2_ThreadPages* tp, int class_idx, MidPage* page);
```
**Changes**:
1. Modified `mf2_maybe_drain_pending()` to use `mf2_try_drain_and_activate()`
2. Modified `alloc_slow` pending drain loop to use Direct Handoff
3. Reduced opportunistic drain from 60+ lines to 20 lines
**Test Results** (larson 10 2-32K 10s 4T):
```
Pending enqueued: 96,429
Pending drained: 70,607
Page reuse count: 80,017 ✅ (15x improvement!)
Throughput: ~28,705 ops/s
```
**Success**: Page reuse 35% (80,017 / 226,447)
---
## Full Pages Scan Removal ✅
**Evidence**: Full_pages scan checked 1.88M pages but found **0 pages** (0% success rate)
**Reason**: Direct Handoff immediately activates drained pages, so full_pages never contains reusable pages
**Action**: Removed full_pages scan (76 lines deleted)
**Test Results**:
```
Page reuses: 69,098 (31%)
Throughput: 27,206 ops/s
```
**Conclusion**: Slight decrease but acceptable (simplification benefit)
---
## Frequency Tuning Attempts ⚙️
Tested multiple opportunistic drain frequencies:
| Frequency | Page Reuses | Reuse % | Throughput |
|-----------|-------------|---------|------------|
| 1/2 (50%) | 70,607 | 31% | 27,206 ops/s |
| 1/4 (25%) | 45,369 | 20% | 27,423 ops/s |
| 1/8 (12.5%) | 24,901 | 11% | 27,642 ops/s |
**Finding**: Higher frequency = better reuse, but still far from 90% target
---
## Hybrid Strategy Attempt (Strategy B) ❌
**Concept**: 75% own TLS (cache efficiency) + 25% round-robin (fairness)
**Implementation**:
```c
if ((count & 3) == 0) { // 1/4: Other threads
tp = g_all_thread_pages[round_robin_idx];
} else { // 3/4: Own TLS
tp = mf2_thread_pages_get();
}
```
**Test Results** (50% overall frequency):
```
Page reuses: 12,676 (5.5%) ❌
Problem: Effective frequency too low (37.5% own + 12.5% others)
```
**Conclusion**: Reverted to pure round-robin at 50% frequency (31% reuse)
---
## ChatGPT Pro Consultation 🧠
**Date**: 2025-10-24
### Question Posed
Complete technical question covering:
- MF2 architecture (Pending Queue, Direct Handoff, Opportunistic Drain)
- Problem: 31% reuse vs 90% target
- Constraints: O(1), lock-free, per-page freelist
- What was tried: Frequencies (1/8, 1/4, 1/2), Hybrid (75/25)
### Diagnosis
**Root Problem**: "Round-robin drain → owner handoff" doesn't work when owner stops allocating
**Larson Benchmark Pattern**:
- **Phase 1** (0-1s): All threads allocate → pages populate
- **Phase 2** (1-10s): All threads free+realloc from own ranges
- Thread A frees Thread A's objects → no cross-thread frees
- Thread B frees Thread B's objects → no cross-thread frees
- **But**: Some cross-thread frees do occur (~10%)
**The Architectural Mismatch**:
```
Current (Round-Robin Drain):
1. Thread A frees → Thread B's page goes to pending queue
2. Thread C (round-robin) drains Thread B's pending → activates page on Thread B
3. Thread B is NOT allocating (Larson Phase 2) → page sits unused
4. Thread A needs memory → allocates NEW page (doesn't know about Thread B's ready page)
```
**Result**: Pages drained but never used = 31% reuse instead of 90%
### Recommended Solution: Consumer-Driven Adoption
**Core Principle**: "Don't push pages to idle threads, let active threads **pull** and **adopt** them"
**Key Changes**:
1. **Remove round-robin drain entirely** (no more `mf2_maybe_drain_pending()`)
2. **Add ownership transfer**: CAS to change `page->owner_tid`
3. **Adoption on-demand**: Allocating thread adopts pages from ANY thread's pending queue
4. **Lease mechanism**: Prevent thrashing (re-transfer within 10ms)
**Algorithm**:
```c
// In alloc_slow, BEFORE allocating new page:
bool mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx) {
// Scan all threads' pending queues (round-robin for fairness)
for (int i = 0; i < num_threads; i++) {
MidPage* page = mf2_dequeue_pending(other_thread[i], class_idx);
if (!page) continue;
// Try to transfer ownership (CAS)
uint64_t old_owner = page->owner_tid;
uint64_t now = rdtsc();
if (now - page->last_transfer_time < LEASE_CYCLES) continue; // Lease active
if (!CAS(&page->owner_tid, old_owner, my_tid)) continue; // CAS failed
// Success! Ownership transferred
page->owner_tp = me;
page->last_transfer_time = now;
// Drain and activate
mf2_drain_remote_frees(page);
if (page->freelist) {
mf2_make_page_active(me, class_idx, page);
return true; // SUCCESS!
}
}
return false; // No adoptable pages
}
```
**Expected Effects**:
- ✅ No wasted effort (only allocating threads drain)
- ✅ Page reuse >90% (allocating thread gets any available page)
- ✅ Throughput 3-10M ops/s (100-350x improvement)
- ✅ Hot path unchanged (fast alloc/free still O(1), lock-free)
---
## Implementation Plan: Consumer-Driven Adoption
### Phase 1: Code Cleanup & Preparation ✅
**Tasks**:
1. ✅ Remove `mf2_maybe_drain_pending()` (opportunistic drain)
2. ✅ Remove all calls to `mf2_maybe_drain_pending()`
3. ✅ Keep helper functions (`mf2_make_page_active`, `mf2_try_drain_and_activate`)
### Phase 2: Data Structure Updates
**Tasks**:
1. Add `uint64_t last_transfer_time` to `MidPage` struct
2. Ensure `owner_tid` and `owner_tp` are already present (✅ verified)
### Phase 3: Adoption Function
**Tasks**:
1. Implement `mf2_try_adopt_pending(MF2_ThreadPages* me, int class_idx)`
- Scan all threads' pending queues (round-robin)
- Check lease (rdtsc() - last_transfer_time >= LEASE_CYCLES)
- CAS ownership transfer
- Drain and activate if successful
2. Tune `LEASE_CYCLES` (start with 10ms = ~30M cycles on 3GHz CPU)
### Phase 4: Integration
**Tasks**:
1. Call `mf2_try_adopt_pending()` in `alloc_slow` BEFORE allocating new page
2. If adoption succeeds, retry fast path
3. If adoption fails, allocate new page (existing logic)
### Phase 5: Benchmark & Validate
**Tasks**:
1. Run larson 4T benchmark
2. Verify page reuse >90%
3. Verify throughput >1M ops/s (target: 3-10M)
4. Run full benchmark suite
---
## Current Status (Updated)
**Working**:
- ✅ Per-page sharding (data structures, allocation, free paths)
- ✅ 64KB alignment
- ✅ Memory ordering
- ✅ Pending queue infrastructure (enqueue/dequeue)
- ✅ Direct Handoff (immediate page activation)
- ✅ Helper functions (modular, inline-optimized)
- ✅ Round-robin drain (proof of concept - to be replaced)
**Needs Improvement**:
- ⚠️ Page reuse: 31% (target: >90%)
- ⚠️ Throughput: 27K ops/s (target: 3-10M)
**Root Cause Identified**:
- ❌ "Push to idle owner" doesn't work (Larson Phase 2 pattern)
- ✅ Solution: "Pull by active allocator" (Consumer-Driven Adoption)
**Next Steps**:
1. 🎯 Remove `mf2_maybe_drain_pending()` (cleanup)
2. 🎯 Add `last_transfer_time` field
3. 🎯 Implement `mf2_try_adopt_pending()`
4. 🎯 Integrate adoption into `alloc_slow`
5. 🎯 Benchmark and validate
---
## Lessons Learned (Updated)
1. **Alignment is Critical**: 97% free failure from 4KB vs 64KB alignment mismatch
2. **Memory Ordering Matters**: But doesn't solve architectural issues
3. **Workload Characteristics**: Larson's same-thread pattern exposed TLS isolation bug
4. **Integration vs Separation**: Need to carefully choose integration points
5. **Direct Handoff is Essential**: Returning drained pages to intermediate lists wastes reuse opportunities
6. **Push vs Pull**: "Push to idle owner" doesn't work; "Pull by active allocator" is correct design
7. **ChatGPT Pro Consultation**: Fresh perspective identified fundamental architectural mismatch
---
**Status**: Ready for Consumer-Driven Adoption implementation
**Confidence**: Very High (ChatGPT Pro validated approach, clear design)
**Expected Outcome**: >90% page reuse, 3-10M ops/s (100-350x improvement)