Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

20 KiB

Raw Blame History

Hybrid Bitmap+Magazine Approach: Objective Analysis

Date: 2025-10-26 Proposal: ChatGPT Pro's "Bitmap = Control Plane, Free-list = Data Plane" hybrid Goal: Achieve both speed (mimalloc-like) and research features (bitmap visibility) Status: Technical feasibility analysis

Executive Summary

The Proposal

Core Idea: "Bitmap on top of Micro-Freelist"

Data Plane (hot path): Page-level mini-magazine (8-16 items, LIFO free-list)
Control Plane (cold path): Bitmap as "truth", batch refill/spill
Research Features: Read from bitmap (complete visibility maintained)

Objective Assessment

Verdict: ✅ Technically sound and promising, but requires careful integration

Aspect	Rating	Comment
Technical soundness	✅ Excellent	Well-established pattern (mimalloc uses similar)
Performance potential	✅ Good	83ns → 45-55ns realistic (35-45% improvement)
Research value	✅ Excellent	Bitmap visibility fully preserved
Implementation complexity	⚠️ Moderate	6-8 hours, careful integration needed
Risk	⚠️ Moderate	TLS Magazine integration unclear, bitmap lag concerns

Recommendation: Adopt with modifications (see Section 8)

1. Technical Architecture

1.1 Current hakmem Tiny Pool Structure

┌─────────────────────────────────┐
│ TLS Magazine [2048 items]       │  ← Fast path (magazine hit)
│   items: void* [2048]            │
│   top: int                       │
└────────────┬────────────────────┘
             ↓ (magazine empty)
┌─────────────────────────────────┐
│ TLS Active Slab A/B             │  ← Medium path (bitmap scan)
│   bitmap[16]: uint64_t          │
│   free_count: uint16_t          │
└────────────┬────────────────────┘
             ↓ (slab full)
┌─────────────────────────────────┐
│ Global Pool (mutex-protected)   │  ← Slow path (lock contention)
│   free_slabs[8]: TinySlab*      │
│   full_slabs[8]: TinySlab*      │
└─────────────────────────────────┘

Problem: Bitmap scan on every slab allocation (5-6ns overhead)

1.2 Proposed Hybrid Structure

┌─────────────────────────────────┐
│ Page Mini-Magazine [8-16 items] │  ← Fast path (O(1) LIFO)
│   mag_head: Block*              │     Cost: 1-2ns
│   mag_count: uint8_t            │
└────────────┬────────────────────┘
             ↓ (mini-mag empty)
┌─────────────────────────────────┐
│ Batch Refill from Bitmap        │  ← Medium path (batch of 8)
│   bm_top: uint64_t (summary)    │     Cost: 5-8ns (amortized 1ns/item)
│   bm_word[16]: uint64_t         │
│   refill_batch: 8 items         │
└────────────┬────────────────────┘
             ↓ (bitmap empty)
┌─────────────────────────────────┐
│ New Page or Drain Pending       │  ← Slow path
└─────────────────────────────────┘

Benefit: Fast path is free-list speed, bitmap cost is amortized

1.3 Key Innovation: Two-Tier Bitmap

Standard Bitmap (current hakmem):

uint64_t bitmap[16];  // 1024 bits
// Problem: Must scan 16 words to find first free
for (int i = 0; i < 16; i++) {
    if (bitmap[i] == 0) continue;  // Empty word scan overhead
    // ...
}
// Cost: 2-3ns per word in worst case = 30-50ns total

Two-Tier Bitmap (proposed):

uint64_t bm_top;       // Summary: 1 bit per word (16 bits used)
uint64_t bm_word[16];  // Data: 64 bits per word

// Fast path: Zero empty scan
if (bm_top == 0) return 0;  // Instant check (1 cycle)

int w = __builtin_ctzll(bm_top);  // First non-empty word (1 cycle)
uint64_t m = bm_word[w];           // Load word (3 cycles)
// Cost: 1.5ns total (vs 30-50ns worst case)

Impact: Empty scan overhead eliminated ✅

2. Performance Analysis

2.1 Expected Fast Path (Best Case)

static inline void* tiny_alloc_fast(ThreadHeap* th, int class_idx) {
    Page* p = th->active[class_idx];   // 2 ns (L1 TLS hit)
    Block* b = p->mag_head;             // 2 ns (L1 page hit)
    if (likely(b)) {                    // 0.5 ns (predicted taken)
        p->mag_head = b->next;          // 1 ns (L1 write)
        p->mag_count--;                 // 0.5 ns (inc)
        return b;                       // 0.5 ns
    }
    return tiny_alloc_refill(th, p, class_idx);  // Slow path
}
// Total: 6.5 ns (pure CPU, L1 hits)

But reality includes:

Size classification: +1 ns (with LUT)
TLS base load: +1 ns
Occasional branch mispredict: +5 ns (1 in 20)
Occasional L2 miss: +10 ns (1 in 50)

Realistic fast path average: 12-15 ns (vs current 83 ns)

2.2 Medium Path: Refill from Bitmap

static inline int refill_from_bitmap(Page* p, int want) {
    uint64_t top = p->bm_top;           // 2 ns (L1 hit)
    if (top == 0) return 0;             // 0.5 ns

    int w = __builtin_ctzll(top);       // 1 ns (tzcnt instruction)
    uint64_t m = p->bm_word[w];         // 2 ns (L1 hit)

    int got = 0;
    while (m && got < want) {           // 8 iterations (want=8)
        int bit = __builtin_ctzll(m);   // 1 ns
        m &= (m - 1);                   // 1 ns (clear bit)
        void* blk = index_to_block(...);// 2 ns
        push_to_mag(blk);               // 1 ns
        got++;
    }
    // Total loop: 8 * 5 ns = 40 ns

    p->bm_word[w] = m;                  // 1 ns
    if (!m) p->bm_top &= ~(1ull << w);  // 1 ns
    p->mag_count += got;                // 1 ns
    return got;
}
// Total: 2 + 0.5 + 1 + 2 + 40 + 1 + 1 + 1 = 48.5 ns for 8 items
// Amortized: 6 ns per item

Impact: Bitmap cost amortized to 6 ns/item (vs current 5-6 ns/item, but batched)

2.3 Overall Expected Performance

Allocation breakdown (with 90% mini-mag hit rate):

90% fast path:   12 ns * 0.9 = 10.8 ns
10% refill path: 48 ns * 0.1 =  4.8 ns  (includes fast path + refill)
Total average:                  15.6 ns

But this assumes:

Mini-magazine always has items (90% hit rate)
Bitmap refill is infrequent (10%)
No statistics overhead
No TLS magazine layer

More realistic (accounting for all overheads):

Size classification (LUT):       1 ns
TLS Magazine check:              3 ns (if kept)
  OR
Page mini-magazine:              12 ns (if TLS Magazine removed)
Statistics (batched):            2 ns (sampled)
Occasional refill:               5 ns (amortized)
Total:                           20-23 ns (if optimized)

Current baseline: 83 ns Expected with hybrid: 35-45 ns (40-55% improvement)

2.4 Why Not 12-15 ns?

Missing overhead in best-case analysis:

TLS Magazine integration: Current hakmem has TLS Magazine layer
- If kept: +10 ns (magazine check overhead)
- If removed: Simpler but loses current fast path
Statistics: Even batched, adds 2-3 ns
Refill frequency: If mini-mag is only 8-16 items, refill happens often
Cache misses: Real-world workloads have 5-10% L2 misses

Realistic target: 35-45 ns (still 2x faster than current 83 ns!)

3. Integration with Existing hakmem Structure

3.1 Critical Question: What happens to TLS Magazine?

Current TLS Magazine:

typedef struct TinyTLSMag {
    TinyItem items[2048];  // 16 KB per class
    int top;
} TinyTLSMag;
static __thread TinyTLSMag g_tls_mags[TINY_NUM_CLASSES];

Options:

Option A: Keep Both (Dual-Layer Cache)

TLS Magazine [2048 items]
  ↓ (empty)
Page Mini-Magazine [8-16 items]
  ↓ (empty)
Bitmap Refill

Pros: Preserves current fast path Cons:

Double caching overhead (complexity)
TLS Magazine dominates, mini-magazine rarely used
Not recommended ❌

Option B: Remove TLS Magazine (Single-Layer)

Page Mini-Magazine [16-32 items]  ← Increase size
  ↓ (empty)
Bitmap Refill [batch of 16]

Pros: Simpler, clearer hot path Cons:

Loses current TLS Magazine fast path (1.5 ns/op)
Requires testing to verify performance
Moderate risk ⚠️

Option C: Hybrid (TLS Mini-Magazine)

TLS Mini-Magazine [64-128 items per class]
  ↓ (empty)
Refill from Multiple Pages' Bitmaps
  ↓ (all bitmaps empty)
New Page

Pros: Best of both (TLS speed + bitmap control) Cons:

More complex refill logic
Recommended ✅

3.2 Recommended Structure

typedef struct TinyTLSCache {
    // Fast path: Small TLS magazine
    Block* mag_head;       // LIFO stack (not array)
    uint16_t mag_count;    // Current count
    uint16_t mag_max;      // 64-128 (tunable)

    // Medium path: Active page with bitmap
    Page* active;

    // Cold path: Partial pages list
    Page* partial_head;
} TinyTLSCache;

static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];

Allocation:

Pop from mag_head (1-2 ns) ← Fast path
If empty, refill_from_bitmap(active, 16) (48 ns, 16 items) → +3 ns amortized
If active bitmap empty, swap to partial page
If no partial, allocate new page

Expected: 12-15 ns average (90%+ mag hit rate)

4. Bitmap as "Control Plane": Research Features

4.1 Bitmap Consistency Model

Problem: Mini-magazine has items, but bitmap still marks them as "free"

Bitmap state:  [1 1 1 1 1 1 1 1]  (all free)
Mini-mag:      [b1, b2, b3]      (3 blocks cached)
Truth:         Only 5 are truly free, not 8

Solution 1: Lazy Update (Eventual Consistency)

// On refill: Mark blocks as allocated in bitmap
void refill_from_bitmap(Page* p, int want) {
    // ... extract blocks ...
    for each block:
        clear_bit(p->bm_word, idx);  // Mark allocated immediately
    // Mini-mag now holds allocated blocks (consistent)
}

// On spill: Mark blocks as free in bitmap
void spill_to_bitmap(Page* p, int count) {
    for each block in mini-mag:
        set_bit(p->bm_word, idx);    // Mark free
}

Consistency: ✅ Bitmap is always truth, mini-mag is just cache

Solution 2: Shadow State

// Bitmap tracks "ever allocated" state
// Mini-mag tracks "currently cached" state
// Research features read: bitmap + mini-mag count

uint16_t get_true_free_count(Page* p) {
    return p->bitmap_free_count - p->mag_count;
}

Consistency: ⚠️ More complex, but allows instant queries

Recommendation: Solution 1 (simpler, consistent)

4.2 Research Features Still Work

Call-site profiling:

// On allocation, record call-site
void* alloc_with_profiling(void* site) {
    void* ptr = tiny_alloc_fast(...);

    // Diagnostic: Update bitmap-based tracking
    if (diagnostic_enabled) {
        int idx = block_index(page, ptr);
        page->owner[idx] = current_thread();
        page->alloc_site[idx] = site;
    }
    return ptr;
}

ELO learning:

// On free, update ELO based on lifetime
void free_with_elo(void* ptr) {
    int idx = block_index(page, ptr);
    void* site = page->alloc_site[idx];
    uint64_t lifetime = rdtsc() - page->alloc_time[idx];

    update_elo(site, lifetime);  // Bitmap enables this

    tiny_free_fast(ptr);  // Then free normally
}

Memory diagnostics:

// Snapshot: Flush mini-mag to bitmap, then read
void snapshot_memory_state() {
    flush_all_mini_magazines();  // Spill to bitmaps

    for_each_page(page) {
        print_bitmap_state(page);  // Full visibility
    }
}

Conclusion: ✅ All research features preserved (with flush/spill)

5. Implementation Complexity

5.1 Required Changes

New structures (~50 lines):

typedef struct Block {
    struct Block* next;  // Intrusive LIFO
} Block;

typedef struct Page {
    // Mini-magazine
    Block* mag_head;
    uint16_t mag_count;
    uint16_t mag_max;

    // Two-tier bitmap
    uint64_t bm_top;
    uint64_t bm_word[16];

    // Existing (keep)
    uint8_t* base;
    uint16_t block_size;
    // ...
} Page;

New functions (~200 lines):

void* tiny_alloc_fast(ThreadHeap* th, int class_idx);
void tiny_free_fast(Page* p, void* ptr);
int refill_from_bitmap(Page* p, int want);
void spill_to_bitmap(Page* p);
void init_two_tier_bitmap(Page* p);

Modified functions (~300 lines):

// Existing bitmap allocation → refill logic
hak_tiny_alloc() → integrate with tiny_alloc_fast()
hak_tiny_free() → integrate with tiny_free_fast()
// Statistics collection → batched/sampled

Total code changes: ~500-600 lines (moderate)

5.2 Testing Requirements

Unit tests:

Two-tier bitmap correctness (refill/spill)
Mini-magazine overflow/underflow
Bitmap-magazine consistency

Integration tests:

Existing bench_tiny benchmarks
Multi-threaded stress tests
Diagnostic feature validation

Performance tests:

Before/after latency comparison
Hit rate measurement (mini-mag vs refill)

Estimated effort: 6-8 hours (implementation + testing)

6. Risks and Mitigation

Risk 1: Mini-Magazine Size Tuning

Problem: Too small (8) → frequent refills; too large (64) → memory overhead

Mitigation:

Make mag_max tunable via environment variable
Adaptive sizing based on allocation pattern
Start with 16-32 (sweet spot)

Risk 2: Bitmap Refill Overhead

Problem: If mini-mag empties frequently, refill cost dominates

Scenarios:

Burst allocation (1000 allocs in a row) → 1000/16 = 62 refills
Refill cost: 62 * 48ns = 2976ns total = 3ns/alloc amortized ✅

Mitigation: Batch size (16) amortizes cost well

Risk 3: TLS Magazine Integration

Problem: Unclear how to integrate with existing TLS Magazine

Options:

Remove TLS Magazine entirely → Simplest
Keep TLS Magazine, add page mini-mag → Complex
Replace TLS Magazine with TLS mini-mag (64-128 items) → Recommended

Mitigation: Prototype Option 3, benchmark against current

Risk 4: Diagnostic Lag

Problem: Bitmap doesn't reflect mini-mag state in real-time

Scenarios:

Profiler reads bitmap → sees "free" but block is in mini-mag
Fix: Flush before diagnostic read

Mitigation:

void flush_diagnostics() {
    for_each_class(c) {
        spill_to_bitmap(g_tls_cache[c].active);
    }
}

7. Performance Comparison Matrix

Approach	Fast Path	Research	Complexity	Risk	Improvement
Current (Bitmap only)	83 ns	✅ Full	Low	Low	Baseline
Strategy A (Bitmap + cleanup)	58-65 ns	✅ Full	Low	Low	+25-30%
Strategy B (Free-list only)	45-55 ns	❌ Lost	Moderate	Moderate	+35-45%
Hybrid (Bitmap+Mini-Mag)	35-45 ns	✅ Full	Moderate	Moderate	45-58%

Winner: Hybrid (best speed + research preservation)

8. Recommended Implementation Plan

Phase 1: Two-Tier Bitmap (2-3 hours)

Goal: Eliminate empty word scan overhead

// Add bm_top to existing TinySlab
typedef struct TinySlab {
    uint64_t bm_top;      // NEW: Summary bitmap
    uint64_t bitmap[16];  // Existing
    // ...
} TinySlab;

// Update allocation to use bm_top
if (slab->bm_top == 0) return NULL;  // Fast empty check
int w = __builtin_ctzll(slab->bm_top);
// ...

Expected: 83ns → 78-80ns (+3-5ns)

Risk: Low (additive change)

Phase 2: Page Mini-Magazine (3-4 hours)

Goal: Add LIFO mini-magazine to slabs

typedef struct TinySlab {
    // Mini-magazine (NEW)
    Block* mag_head;
    uint16_t mag_count;
    uint16_t mag_max;  // 16

    // Two-tier bitmap (from Phase 1)
    uint64_t bm_top;
    uint64_t bitmap[16];
    // ...
} TinySlab;

void* tiny_alloc_fast() {
    Block* b = slab->mag_head;
    if (likely(b)) {
        slab->mag_head = b->next;
        return b;
    }
    // Refill from bitmap (batch of 16)
    refill_from_bitmap(slab, 16);
    // Retry
    return slab->mag_head ? pop_mag(slab) : NULL;
}

Expected: 78-80ns → 45-55ns (+25-35ns)

Risk: Moderate (structural change)

Phase 3: TLS Integration (1-2 hours)

Goal: Integrate with existing TLS Magazine

// Option: Replace TLS Magazine with TLS mini-mag
typedef struct TinyTLSCache {
    Block* mag_head;       // 64-128 items
    uint16_t mag_count;
    TinySlab* active;      // Current slab
    TinySlab* partial;     // Partial slabs
} TinyTLSCache;

Expected: 45-55ns → 35-45ns (+10ns from better TLS integration)

Risk: Moderate (requires careful testing)

Phase 4: Statistics Batching (1 hour)

Goal: Remove per-allocation statistics overhead

// Batch counter update (cold path only)
if (++g_tls_alloc_counter[class_idx] >= 100) {
    g_tiny_pool.alloc_count[class_idx] += 100;
    g_tls_alloc_counter[class_idx] = 0;
}

Expected: 35-45ns → 30-40ns (+5-10ns)

Risk: Low (independent change)

Total Timeline

Effort: 7-10 hours Expected result: 83ns → 30-45ns (45-65% improvement) Research features: ✅ Fully preserved (bitmap visibility maintained)

9. Comparison to Alternatives

vs Strategy A (Bitmap + Cleanup)

Strategy A: 83ns → 58-65ns (+25-30%)
Hybrid: 83ns → 30-45ns (+45-65%)
Winner: Hybrid (+20-30ns better)

vs Strategy B (Free-list Only)

Strategy B: 83ns → 45-55ns, ❌ loses research features
Hybrid: 83ns → 30-45ns, ✅ keeps research features
Winner: Hybrid (faster + research preserved)

vs ChatGPT Pro's Estimate (55-60ns)

ChatGPT Pro: 55-60ns (optimistic)
Realistic Hybrid: 30-45ns (with all phases)
Conservative: 40-50ns (if hit rate is lower)
Conclusion: 55-60ns is achievable, 30-40ns is optimistic but possible

10. Conclusion

Technical Verdict

The Hybrid Bitmap+Mini-Magazine approach is sound and recommended ✅

Key strengths:

✅ Preserves bitmap visibility (research features intact)
✅ Achieves free-list-like speed on hot path (30-45ns realistic)
✅ Two-tier bitmap eliminates empty scan overhead
✅ Well-established pattern (mimalloc uses similar techniques)

Key concerns:

⚠️ Moderate implementation complexity (7-10 hours)
⚠️ TLS Magazine integration needs careful design
⚠️ Bitmap consistency requires flush for diagnostics
⚠️ Performance depends on mini-magazine hit rate (90%+ needed)

Recommendation

Adopt the Hybrid approach with 4-phase implementation:

Two-tier bitmap (low risk, immediate gain)
Page mini-magazine (moderate risk, big gain)
TLS integration (moderate risk, polish)
Statistics batching (low risk, final optimization)

Expected outcome: 83ns → 30-45ns (45-65% improvement) while preserving all research features

Next Steps

✅ Create final implementation strategy document
✅ Update TINY_POOL_OPTIMIZATION_STRATEGY.md to Hybrid approach
✅ Begin Phase 1 (Two-tier bitmap) implementation
✅ Validate with benchmarks after each phase

Last Updated: 2025-10-26 Status: Analysis complete, ready for implementation Confidence: HIGH (backed by mimalloc precedent, realistic estimates) Risk Level: MODERATE (phased approach mitigates risk)

20 KiB Raw Blame History