Files
hakmem/docs/archive/ALIGNMENT_FIX_VERIFICATION.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

306 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ALIGNMENT FIX VERIFICATION REPORT
## Status: ✅ ALIGNMENT WORKS | 🚨 NEW BUG FOUND
**Date:** 2025-10-24
**Test:** larson 3 2048 32768 10000 1 12345 1
**Config:** HAKMEM_MF2_ENABLE=1 HAKMEM_WRAP_L2=1
---
## Executive Summary
### Question: Does the alignment fix actually work?
**Answer: YES! Alignment is perfect.**
### Question: Why is performance worse?
**Answer: Different bug - remote frees go to orphaned pages!**
---
## Part 1: Alignment Verification ✅
### Test Results
```
Pages allocated: 101,093
Alignment bugs: 0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98% (2% are non-MF2 pages as expected)
```
### Evidence
**Page Allocation (Step 1):**
```
[REGISTER 0] Page 0x7efc3b240000 → idx 15140 (aligned=YES)
[REGISTER 1] Page 0x7efc3b220000 → idx 15138 (aligned=YES)
[REGISTER 2] Page 0x7efc3b200000 → idx 15136 (aligned=YES)
...
[REGISTER 9] Page 0x7efc3b110000 → idx 15121 (aligned=YES)
```
**Page Lookup (Step 3):**
```
[LOOKUP 1] addr=0x7efc34420028 → page_base=0x7efc34420000 → idx=13378
found=YES, page->base=0x7efc34420000, match=YES
[LOOKUP 2] addr=0x7efc34420028 → page_base=0x7efc34420000 → idx=13378
found=YES, page->base=0x7efc34420000, match=YES
```
### Conclusion
**Fix #4 (posix_memalign) WORKS PERFECTLY!**
- All pages are 64KB aligned (no alignment bugs detected)
- Registry lookups succeed when querying MF2-managed memory
- The 97% free failure problem is SOLVED by alignment
---
## Part 2: Performance Regression Analysis 🚨
### Expected vs Actual Performance
| Metric | Before Fix #4 (mmap) | After Fix #4 (posix_memalign) | Expected |
|--------|---------------------|-------------------------------|----------|
| Alignment | Broken (4KB) | Perfect (64KB) | Perfect ✅ |
| Free success | 3% | 100% | 100% ✅ |
| Ops/sec | 466K | 54K | 1M+ ❌ |
| Remote drains | Unknown | **0** | >50K ❌ |
### Root Cause: Orphaned Remote Frees
**Problem:**
```
Remote frees sent: 77,041 (51% of all frees!)
Remote drains executed: 0 (ZERO!)
Pages scanned: 860,522
Pages with remotes found: 0 (ZERO!)
```
**Smoking Gun Evidence:**
```bash
# Pages receiving remote frees:
[REMOTE_FREE 0] → page (base=0x7f3b308e0000), remote_count=1
[REMOTE_FREE 1] → page (base=0x7f3b309f0000), remote_count=1
...
# Pages being scanned in full_pages list:
[FULL_SCAN 0] page (base=0x7f0b2cb10000), remote_count=0
[FULL_SCAN 1] page (base=0x7f0b2caf0000), remote_count=0
...
# Cross-reference result:
ORPHAN RATE: 100% (0/10 pages with remotes are in full_pages!)
```
### The Bug Explained
**Multi-threaded allocation pattern:**
1. Thread A allocates objects from Page X (X is active on Thread A)
2. Thread B allocates objects from Page Y (Y is active on Thread B)
3. Thread A passes objects to Thread B for processing
4. Thread B frees Thread A's objects → **remote free to Page X!**
5. Page X is still ACTIVE on Thread A (not in full_pages yet!)
6. Thread A scans full_pages → **Page X not there!** → no drain!
**Result:** Pages with remote frees remain active on their owner threads,
never get moved to full_pages, never get drained!
### Why Performance Got WORSE After Fix
**Before (mmap - alignment broken):**
- 97% of frees fail silently (page not found in registry)
- Memory leaks, but some lucky 3% get reused
- Performance: 466K ops/sec (bad but not terrible)
**After (posix_memalign - alignment perfect):**
- 100% of frees succeed (remote_count incremented correctly!)
- BUT pages with remotes are orphaned (still active, not in full_pages)
- Owner threads never check their own active pages for remotes!
- Allocator allocates NEW pages instead of draining existing ones
- Performance: 54K ops/sec (MUCH worse!)
---
## Part 3: The Missing Drain Logic
### Current Drain Strategy (BROKEN)
```c
// Drain check #1: Active page
if (active_page->remote_count > 0) {
drain(active_page); // ← This NEVER triggers!
}
// Drain check #2: Full pages
for (page in full_pages) {
if (page->remote_head != 0) {
drain(page); // ← Remote pages not here!
}
}
```
**Why it fails:**
- Active pages never self-check for remotes (assumption: remotes only go to full pages)
- But in producer-consumer pattern, remotes go to pages still ACTIVE on producer!
### What SHOULD Happen
```c
// Option A: Check active page for remotes BEFORE allocating new page
if (active_page && active_page->remote_count > 0) {
drain(active_page);
if (active_page->freelist) {
return alloc_from_active_page();
}
}
// Option B: Periodic drain of all active pages (background thread)
// Option C: Signal owner thread when remote_count exceeds threshold
```
---
## Part 4: Performance Math
### Allocation Efficiency
```
Total operations: 263,143
New pages allocated: 101,093
Expected reuse: ~90% (mimalloc-style allocator)
Actual reuse: 3% (3,324 / 101,093)
Why so many pages?
→ 77K remote frees sitting in orphaned active pages
→ Never drained (owner thread doesn't check own active page!)
→ Allocator allocates NEW pages instead
→ 2.6 allocs per page (should be ~50+!)
```
### memset Overhead (Secondary Issue)
```
memset overhead: 15μs per page × 101K pages = 1.5 seconds
Total runtime: ~18 seconds
memset impact: ~8% of total runtime
Not the main problem, but still wasteful!
```
---
## Part 5: Recommended Fixes (Priority Order)
### FIX A: Active Page Drain (CRITICAL)
**Before allocating new page, drain active page if it has remotes:**
```c
// In mf2_alloc_slow(), BEFORE "allocate new page":
if (tp->active_page[class_idx]) {
MidPage* active = tp->active_page[class_idx];
if (atomic_load_explicit(&active->remote_count, memory_order_relaxed) > 0) {
int drained = mf2_drain_remote_frees(active);
if (drained > 0 && active->freelist) {
// Success! Retry allocation from active page
return mf2_alloc_fast(class_idx, size, site_id);
}
}
}
```
**Expected impact:** 77K drains instead of 0 → 90%+ page reuse → 10x fewer pages → major speedup!
### FIX B: Eliminate memset (OPTIMIZATION)
**Use mmap with manual alignment:**
```c
// Allocate 128KB to ensure we can align to 64KB boundary
void* raw = mmap(NULL, POOL_PAGE_SIZE * 2, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
void* aligned = (void*)(((uintptr_t)raw + 0xFFFF) & ~0xFFFFULL);
// Save raw pointer for munmap later
page->raw_base = raw;
page->base = aligned;
```
**Benefits:**
- Lazy zero-fill (no memset!)
- Perfect 64KB alignment
- Fast allocation (mmap)
**Tradeoffs:**
- Wastes 64KB per page (128KB allocated, 64KB used)
- More complex cleanup (must munmap raw_base, not aligned base)
### FIX C: Background Drain Thread (OPTIONAL)
**Periodically drain all active pages in all threads:**
```c
void* bg_drain_thread(void* arg) {
while (true) {
sleep(0.1); // 100ms interval
for_each_thread(tp) {
for (class_idx = 0; class_idx < NUM_CLASSES; class_idx++) {
MidPage* page = tp->active_page[class_idx];
if (page && page->remote_count > 0) {
mf2_drain_remote_frees(page);
}
}
}
}
}
```
**Benefits:** Drains remotes even if allocation rate is low
**Tradeoffs:** Extra thread overhead, complexity
---
## Part 6: Next Actions
### Immediate (Tonight)
1.**Alignment verified** - Fix #4 works!
2. 🚨 **Implement Fix A** - Add active page drain before new page allocation
3. 📊 **Re-run benchmarks** - Measure impact
### Short-term (This Week)
4. **Implement Fix B** - Eliminate memset with mmap+manual_align
5. **Full benchmark suite** - larson, sh6bench, sh8bench, etc.
6. **Compare to baseline** - MF2 vs original allocator
### Long-term (Optional)
7. **Implement Fix C** - Background drain thread (if needed)
8. **Profile hot paths** - Identify remaining bottlenecks
9. **Consider alternatives** - If MF2 still underperforms, revert to original
---
## Bottom Line
**ALIGNMENT IS PERFECT!** The posix_memalign fix solved the 97% free failure problem.
**PERFORMANCE IS BAD!** But it's NOT because of alignment - it's because:
1. Remote frees go to orphaned active pages (not drained)
2. memset overhead (1.5s wasted)
**FIX THE DRAIN LOGIC FIRST** - that's where the real problem is!
Without draining active pages, MF2 allocates 30x more pages than it should,
causing catastrophic memory waste and allocation overhead.
---
**Test artifacts saved in:**
- This report: `/home/tomoaki/git/hakmem/ALIGNMENT_FIX_VERIFICATION.md`
- Instrumented code: `hakmem_pool.c` (search for "[ALIGNMENT", "[LOOKUP", etc.)