Files
hakmem/docs/archive/ALIGNMENT_FIX_VERIFICATION.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

8.5 KiB
Raw Blame History

ALIGNMENT FIX VERIFICATION REPORT

Status: ALIGNMENT WORKS | 🚨 NEW BUG FOUND

Date: 2025-10-24
Test: larson 3 2048 32768 10000 1 12345 1
Config: HAKMEM_MF2_ENABLE=1 HAKMEM_WRAP_L2=1


Executive Summary

Question: Does the alignment fix actually work?

Answer: YES! Alignment is perfect.

Question: Why is performance worse?

Answer: Different bug - remote frees go to orphaned pages!


Part 1: Alignment Verification

Test Results

Pages allocated:     101,093
Alignment bugs:      0 (ZERO!)
Registry collisions: 0 (ZERO!)
Lookup success rate: 98% (2% are non-MF2 pages as expected)

Evidence

Page Allocation (Step 1):

[REGISTER 0] Page 0x7efc3b240000 → idx 15140 (aligned=YES)
[REGISTER 1] Page 0x7efc3b220000 → idx 15138 (aligned=YES)
[REGISTER 2] Page 0x7efc3b200000 → idx 15136 (aligned=YES)
...
[REGISTER 9] Page 0x7efc3b110000 → idx 15121 (aligned=YES)

Page Lookup (Step 3):

[LOOKUP 1] addr=0x7efc34420028 → page_base=0x7efc34420000 → idx=13378 
           found=YES, page->base=0x7efc34420000, match=YES
[LOOKUP 2] addr=0x7efc34420028 → page_base=0x7efc34420000 → idx=13378 
           found=YES, page->base=0x7efc34420000, match=YES

Conclusion

Fix #4 (posix_memalign) WORKS PERFECTLY!

  • All pages are 64KB aligned (no alignment bugs detected)
  • Registry lookups succeed when querying MF2-managed memory
  • The 97% free failure problem is SOLVED by alignment

Part 2: Performance Regression Analysis 🚨

Expected vs Actual Performance

Metric Before Fix #4 (mmap) After Fix #4 (posix_memalign) Expected
Alignment Broken (4KB) Perfect (64KB) Perfect
Free success 3% 100% 100%
Ops/sec 466K 54K 1M+
Remote drains Unknown 0 >50K

Root Cause: Orphaned Remote Frees

Problem:

Remote frees sent:        77,041 (51% of all frees!)
Remote drains executed:   0      (ZERO!)
Pages scanned:            860,522
Pages with remotes found: 0      (ZERO!)

Smoking Gun Evidence:

# Pages receiving remote frees:
[REMOTE_FREE 0] → page (base=0x7f3b308e0000), remote_count=1
[REMOTE_FREE 1] → page (base=0x7f3b309f0000), remote_count=1
...

# Pages being scanned in full_pages list:
[FULL_SCAN 0] page (base=0x7f0b2cb10000), remote_count=0
[FULL_SCAN 1] page (base=0x7f0b2caf0000), remote_count=0
...

# Cross-reference result:
ORPHAN RATE: 100% (0/10 pages with remotes are in full_pages!)

The Bug Explained

Multi-threaded allocation pattern:

  1. Thread A allocates objects from Page X (X is active on Thread A)
  2. Thread B allocates objects from Page Y (Y is active on Thread B)
  3. Thread A passes objects to Thread B for processing
  4. Thread B frees Thread A's objects → remote free to Page X!
  5. Page X is still ACTIVE on Thread A (not in full_pages yet!)
  6. Thread A scans full_pages → Page X not there! → no drain!

Result: Pages with remote frees remain active on their owner threads,
never get moved to full_pages, never get drained!

Why Performance Got WORSE After Fix

Before (mmap - alignment broken):

  • 97% of frees fail silently (page not found in registry)
  • Memory leaks, but some lucky 3% get reused
  • Performance: 466K ops/sec (bad but not terrible)

After (posix_memalign - alignment perfect):

  • 100% of frees succeed (remote_count incremented correctly!)
  • BUT pages with remotes are orphaned (still active, not in full_pages)
  • Owner threads never check their own active pages for remotes!
  • Allocator allocates NEW pages instead of draining existing ones
  • Performance: 54K ops/sec (MUCH worse!)

Part 3: The Missing Drain Logic

Current Drain Strategy (BROKEN)

// Drain check #1: Active page
if (active_page->remote_count > 0) {
    drain(active_page);  // ← This NEVER triggers!
}

// Drain check #2: Full pages
for (page in full_pages) {
    if (page->remote_head != 0) {
        drain(page);  // ← Remote pages not here!
    }
}

Why it fails:

  • Active pages never self-check for remotes (assumption: remotes only go to full pages)
  • But in producer-consumer pattern, remotes go to pages still ACTIVE on producer!

What SHOULD Happen

// Option A: Check active page for remotes BEFORE allocating new page
if (active_page && active_page->remote_count > 0) {
    drain(active_page);
    if (active_page->freelist) {
        return alloc_from_active_page();
    }
}

// Option B: Periodic drain of all active pages (background thread)
// Option C: Signal owner thread when remote_count exceeds threshold

Part 4: Performance Math

Allocation Efficiency

Total operations:    263,143
New pages allocated: 101,093
Expected reuse:      ~90% (mimalloc-style allocator)
Actual reuse:        3% (3,324 / 101,093)

Why so many pages?
→ 77K remote frees sitting in orphaned active pages
→ Never drained (owner thread doesn't check own active page!)
→ Allocator allocates NEW pages instead
→ 2.6 allocs per page (should be ~50+!)

memset Overhead (Secondary Issue)

memset overhead: 15μs per page × 101K pages = 1.5 seconds
Total runtime:   ~18 seconds
memset impact:   ~8% of total runtime

Not the main problem, but still wasteful!

FIX A: Active Page Drain (CRITICAL)

Before allocating new page, drain active page if it has remotes:

// In mf2_alloc_slow(), BEFORE "allocate new page":
if (tp->active_page[class_idx]) {
    MidPage* active = tp->active_page[class_idx];
    if (atomic_load_explicit(&active->remote_count, memory_order_relaxed) > 0) {
        int drained = mf2_drain_remote_frees(active);
        if (drained > 0 && active->freelist) {
            // Success! Retry allocation from active page
            return mf2_alloc_fast(class_idx, size, site_id);
        }
    }
}

Expected impact: 77K drains instead of 0 → 90%+ page reuse → 10x fewer pages → major speedup!

FIX B: Eliminate memset (OPTIMIZATION)

Use mmap with manual alignment:

// Allocate 128KB to ensure we can align to 64KB boundary
void* raw = mmap(NULL, POOL_PAGE_SIZE * 2, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
void* aligned = (void*)(((uintptr_t)raw + 0xFFFF) & ~0xFFFFULL);

// Save raw pointer for munmap later
page->raw_base = raw;
page->base = aligned;

Benefits:

  • Lazy zero-fill (no memset!)
  • Perfect 64KB alignment
  • Fast allocation (mmap)

Tradeoffs:

  • Wastes 64KB per page (128KB allocated, 64KB used)
  • More complex cleanup (must munmap raw_base, not aligned base)

FIX C: Background Drain Thread (OPTIONAL)

Periodically drain all active pages in all threads:

void* bg_drain_thread(void* arg) {
    while (true) {
        sleep(0.1);  // 100ms interval
        for_each_thread(tp) {
            for (class_idx = 0; class_idx < NUM_CLASSES; class_idx++) {
                MidPage* page = tp->active_page[class_idx];
                if (page && page->remote_count > 0) {
                    mf2_drain_remote_frees(page);
                }
            }
        }
    }
}

Benefits: Drains remotes even if allocation rate is low
Tradeoffs: Extra thread overhead, complexity


Part 6: Next Actions

Immediate (Tonight)

  1. Alignment verified - Fix #4 works!
  2. 🚨 Implement Fix A - Add active page drain before new page allocation
  3. 📊 Re-run benchmarks - Measure impact

Short-term (This Week)

  1. Implement Fix B - Eliminate memset with mmap+manual_align
  2. Full benchmark suite - larson, sh6bench, sh8bench, etc.
  3. Compare to baseline - MF2 vs original allocator

Long-term (Optional)

  1. Implement Fix C - Background drain thread (if needed)
  2. Profile hot paths - Identify remaining bottlenecks
  3. Consider alternatives - If MF2 still underperforms, revert to original

Bottom Line

ALIGNMENT IS PERFECT! The posix_memalign fix solved the 97% free failure problem.

PERFORMANCE IS BAD! But it's NOT because of alignment - it's because:

  1. Remote frees go to orphaned active pages (not drained)
  2. memset overhead (1.5s wasted)

FIX THE DRAIN LOGIC FIRST - that's where the real problem is!

Without draining active pages, MF2 allocates 30x more pages than it should, causing catastrophic memory waste and allocation overhead.


Test artifacts saved in:

  • This report: /home/tomoaki/git/hakmem/ALIGNMENT_FIX_VERIFICATION.md
  • Instrumented code: hakmem_pool.c (search for "[ALIGNMENT", "[LOOKUP", etc.)