Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

18 KiB

Raw Blame History

Bitmap vs Free List: Design Tradeoffs

Date: 2025-10-26 Context: Evaluating architectural choices for hakmem Tiny Pool optimization Purpose: Understand tradeoffs before deciding whether to adopt mimalloc's free list approach

Executive Summary

The Core Question

Should hakmem abandon bitmap allocation in favor of mimalloc's intrusive free list?

Answer: It depends on project goals:

If goal = production speed: Free list wins (5-10ns faster)
If goal = research/diagnostics: Bitmap wins (visibility, safety, flexibility)
If goal = both: Hybrid approach possible (see Section 6)

1. Architecture Comparison

Bitmap Approach (Current hakmem)

// Metadata: Separate bitmap (1 bit per block)
typedef struct TinySlab {
    uint64_t bitmap[16];      // 1024 blocks = 1024 bits
    uint8_t* base;            // Data region
    uint16_t free_count;      // O(1) empty check
    // ... diagnostics, ownership, stats ...
} TinySlab;

// Allocation: Find-first-set
void* alloc_from_bitmap(TinySlab* s) {
    int word_idx = find_first_nonzero(s->bitmap);  // ~3 ns
    int bit_idx = __builtin_ctzll(s->bitmap[word_idx]);  // ~1 ns
    s->bitmap[word_idx] &= ~(1ULL << bit_idx);     // ~1 ns
    return s->base + (word_idx * 64 + bit_idx) * block_size;
}
// Cost: 5-6 ns (bitmap scan + bit extraction)

Key Properties:

✅ Metadata separate from data
✅ Random access to allocation state
✅ O(1) slab-level statistics (free_count, bitmap scan)
⚠️ 5-6 ns overhead per allocation

Free List Approach (mimalloc)

// Metadata: Intrusive next-pointer in free blocks
typedef struct Block {
    struct Block* next;       // 8 bytes IN the data region
} Block;

typedef struct Page {
    Block* local_free;        // LIFO stack head
    // ... minimal metadata ...
} Page;

// Allocation: Pop from LIFO
void* alloc_from_freelist(Page* p) {
    Block* b = p->local_free;             // ~0.5 ns (L1 hit)
    p->local_free = b->next;              // ~0.5 ns (L1 hit)
    return b;
}
// Cost: 1-2 ns (two pointer operations)

Key Properties:

✅ Zero metadata overhead (uses free blocks themselves)
✅ Minimal CPU overhead (1-2 pointer ops)
⚠️ Intrusive (overwrites first 8 bytes of free blocks)
⚠️ No random access (must traverse list)

2. Bitmap Advantages

2.1 Observability and Diagnostics

Bitmap: Complete allocation state visible at a glance

// Print slab state (O(1) bitmap scan)
void print_slab_state(TinySlab* s) {
    printf("Slab free pattern: ");
    for (int i = 0; i < 1024; i++) {
        printf("%c", is_free(s->bitmap, i) ? '.' : 'X');
    }
    // Output: "X...XX.X.XX....." (visual fragmentation pattern)
}

Free List: Must traverse entire list

// Print page state (O(n) traversal)
void print_page_state(Page* p) {
    int count = 0;
    Block* b = p->local_free;
    while (b) { count++; b = b->next; }
    printf("Free blocks: %d (locations unknown)\n", count);
    // Output: "Free blocks: 42" (no spatial information)
}

Impact:

✅ Bitmap: Can detect fragmentation patterns, hot spots, allocation clustering
⚠️ Free List: Only knows count, not spatial distribution

2.2 Memory Safety and Debugging

Bitmap: Freed memory can be immediately zeroed

void free_to_bitmap(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    s->bitmap[idx / 64] |= (1ULL << (idx % 64));
    memset(ptr, 0, block_size);  // Safe: no metadata in block
}
// Use-after-free detection: accessing 0-filled memory likely crashes early

Free List: Next-pointer remains in freed memory

void free_to_list(Page* p, void* ptr) {
    Block* b = (Block*)ptr;
    b->next = p->local_free;     // Writes to freed memory!
    p->local_free = b;
}
// Use-after-free: might corrupt next-pointer, causing subtle bugs later

Impact:

✅ Bitmap: Easier debugging (freed memory is clean)
✅ Bitmap: Better ASAN/Valgrind integration (can mark freed)
⚠️ Free List: Next-pointer corruption can cause cascading failures

2.3 Ownership Tracking and Validation

Bitmap: Can track per-block metadata

typedef struct TinySlab {
    uint64_t bitmap[16];       // Allocation state
    uint8_t owner[1024];       // Per-block owner thread ID
    uint32_t alloc_time[1024]; // Allocation timestamp
} TinySlab;

// Validate ownership on free
void free_with_validation(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    if (s->owner[idx] != current_thread()) {
        fprintf(stderr, "ERROR: Cross-thread free without handoff!\n");
        // Can detect bugs immediately
    }
}

Free List: No per-block metadata (intrusive design)

// Cannot store per-block metadata without external hash table
// Owner validation requires separate data structure

Impact:

✅ Bitmap: Can implement rich diagnostics (owner, timestamp, call-site)
✅ Bitmap: Validates invariants at allocation/free time
⚠️ Free List: Requires external data structures for diagnostics

2.4 Statistics and Profiling

Bitmap: O(1) slab-level queries

// All O(1) operations
uint16_t free_count = slab->free_count;
bool is_empty = (free_count == 1024);
bool is_full = (free_count == 0);
float utilization = 1.0 - (free_count / 1024.0);

// Fragmentation analysis (O(n) but rare)
int longest_run = find_longest_free_run(slab->bitmap);

Free List: Requires traversal

// Count requires O(n) traversal
int free_count = 0;
for (Block* b = page->local_free; b; b = b->next) {
    free_count++;
}
// Cannot determine fragmentation without traversal

Impact:

✅ Bitmap: Fast statistics collection (research-friendly)
✅ Bitmap: Can analyze allocation patterns
⚠️ Free List: Statistics require expensive traversal or external counters

2.5 Concurrent Access Visibility

Bitmap: Can inspect remote thread state

// Diagnostic thread can scan all slabs
void print_global_state() {
    for (int tid = 0; tid < MAX_THREADS; tid++) {
        for (int class = 0; class < 8; class++) {
            TinySlab* s = get_slab(tid, class);
            // Instant visibility of free_count, bitmap
            printf("Thread %d Class %d: %d/%d free\n",
                   tid, class, s->free_count, 1024);
        }
    }
}

Free List: Cannot safely inspect remote thread's local_free

// Diagnostic thread CANNOT read local_free (race condition)
// Must use external atomic counters (defeats purpose)

Impact:

✅ Bitmap: Can build monitoring dashboards, live profilers
✅ Bitmap: Supports cross-thread adoption decisions (CDA)
⚠️ Free List: Opaque to external observers

2.6 Research and Experimentation

Bitmap: Easy to modify allocation policy

// Experiment: Best-fit instead of first-fit
int find_best_fit_block(TinySlab* s, int requested_run) {
    // Scan bitmap for smallest run >= requested_run
    // Easy to implement alternative allocation strategies
}

// Experiment: Locality-aware allocation
int find_nearest_free(TinySlab* s, void* previous_alloc) {
    int prev_idx = block_index(s, previous_alloc);
    // Search bitmap for nearby free blocks (cache locality)
}

Free List: Policy locked to LIFO

// Always LIFO (most recently freed = next allocated)
// Cannot experiment with other policies without major restructuring

Impact:

✅ Bitmap: Flexible research platform (try different allocation strategies)
✅ Bitmap: Can experiment with locality, fragmentation reduction
⚠️ Free List: Fixed policy (LIFO only)

3. Free List Advantages

3.1 Raw Performance

Numbers from ANALYSIS_SUMMARY.md:

Bitmap: 5-6 ns per allocation (find-first-set + bit extraction)
Free List: 1-2 ns per allocation (two pointer operations)
Gap: 3-4 ns per allocation (2-6x faster)

Why Free List Wins:

// Bitmap: 5 operations
int word_idx = find_first_nonzero(bitmap);  // 2-3 ns (unpredictable branch)
int bit_idx = ctzll(bitmap[word_idx]);      // 1 ns (CPU instruction)
bitmap[word_idx] &= ~(1ULL << bit_idx);     // 1 ns (bit clear)
void* ptr = base + index * block_size;      // 1 ns (arithmetic)
// Total: 5 ns

// Free List: 2 operations
Block* b = page->local_free;                // 0.5 ns (L1 hit)
page->local_free = b->next;                 // 0.5 ns (L1 hit)
return b;                                   // 0.5 ns
// Total: 1.5 ns

3.2 Cache Efficiency

Free List: Excellent temporal locality

// Recently freed block = next allocated (LIFO)
// Likely still in L1 cache (3-5 cycles)
Block* b = page->local_free;  // Cache hit!

Bitmap: Poorer temporal locality

// Allocated block may be anywhere in slab
// Bitmap access + block access = 2 cache lines
int idx = find_first_set(...);      // Cache line 1 (bitmap)
void* ptr = base + idx * block_size; // Cache line 2 (block)

Impact:

✅ Free List: Better L1 cache hit rate (~95%+)
⚠️ Bitmap: More cache line touches (~2x)

3.3 Memory Overhead

Free List: Zero metadata

typedef struct Page {
    Block* local_free;   // 8 bytes
    uint16_t capacity;   // 2 bytes
    // Total: 10 bytes for entire page
} Page;

Bitmap: 1 bit per block (+ supporting metadata)

typedef struct TinySlab {
    uint64_t bitmap[16];  // 128 bytes (1024 blocks)
    uint16_t free_count;  // 2 bytes
    uint8_t* base;        // 8 bytes
    // Total: 138 bytes minimum
} TinySlab;
// For 8-byte blocks: 1024 * 8 = 8KB data, 138B metadata = 1.7% overhead

Impact:

✅ Free List: ~0.1% overhead
⚠️ Bitmap: ~1-2% overhead

3.4 Simplicity

Free List: Minimal code complexity

// Entire allocation logic: 3 lines
void* alloc(Page* p) {
    Block* b = p->local_free;
    if (!b) return NULL;
    p->local_free = b->next;
    return b;
}

Bitmap: More complex

// Allocation logic: 15+ lines
void* alloc(TinySlab* s) {
    if (s->free_count == 0) return NULL;
    for (int i = 0; i < 16; i++) {
        if (s->bitmap[i] == 0) continue;  // Skip empty words
        int bit_idx = __builtin_ctzll(s->bitmap[i]);
        s->bitmap[i] &= ~(1ULL << bit_idx);
        s->free_count--;
        return s->base + (i * 64 + bit_idx) * s->block_size;
    }
    return NULL;  // Should never reach
}

Impact:

✅ Free List: Easier to understand, maintain, optimize
⚠️ Bitmap: More code paths, more potential for bugs

4. Real-World Use Cases

When Bitmap Wins

Scenario 1: Memory Debugging Tools

// AddressSanitizer, Valgrind integration
// Can mark freed blocks immediately
void free_with_asan(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    s->bitmap[idx / 64] |= (1ULL << (idx % 64));
    __asan_poison_memory_region(ptr, block_size);  // Safe!
}

Scenario 2: Research Allocators

// Experimenting with allocation strategies
// e.g., hakmem's ELO learning, call-site profiling
void alloc_with_learning(TinySlab* s, void* site) {
    int idx = find_best_block_for_site(s, site);  // Bitmap enables this
    // Can implement custom heuristics
}

Scenario 3: Diagnostic Dashboards

// Real-time monitoring (e.g., allocator profiler UI)
// Can scan all slabs without stopping allocation
void update_dashboard() {
    for_each_slab(slab) {
        dashboard_update(slab->free_count, slab->bitmap);
        // No disruption to allocation threads
    }
}

When Free List Wins

Scenario 1: Production Web Servers

// mimalloc in WebKit, nginx, etc.
// Every nanosecond counts (millions of allocations/sec)
// Diagnostics = rare, speed = always

Scenario 2: Latency-Sensitive Systems

// HFT, real-time systems
// Predictable 1-2ns allocation critical
// Bitmap's 5-6ns too variable

Scenario 3: Memory-Constrained Embedded

// 1.7% bitmap overhead unacceptable
// Every byte matters

5. Quantitative Comparison

Metric	Bitmap	Free List	Winner
Performance
Allocation latency	5-6 ns	1-2 ns	Free List (3-4ns faster)
Cache efficiency	2 cache lines	1 cache line	Free List
Branch mispredicts	1-2 per alloc	0-1 per alloc	Free List
Memory
Metadata overhead	1-2%	~0.1%	Free List
Block size impact	+128B per slab	+8B per page	Free List
Diagnostics
Observability	Full state visible	Opaque (count only)	Bitmap
Debugging	Easy (zeroed free)	Hard (pointer corruption)	Bitmap
Statistics	O(1) queries	O(n) traversal	Bitmap
Profiling	Per-block tracking	External hash table	Bitmap
Flexibility
Allocation policy	Pluggable (first-fit, best-fit, etc.)	LIFO only	Bitmap
Research	Easy experimentation	Fixed design	Bitmap
Monitoring	Non-intrusive scanning	Requires external counters	Bitmap
Safety
Use-after-free detection	Good (zeroed memory)	Poor (pointer corruption)	Bitmap
ASAN/Valgrind integration	Excellent	Limited	Bitmap
Cross-thread validation	Easy	Requires external state	Bitmap
Complexity
Code size	~100 lines	~20 lines	Free List
Maintainability	Moderate	High	Free List
Optimization potential	Limited (bitmap scan)	High (2 pointers)	Free List

Overall:

Production speed: Free List wins (3-4ns faster, simpler)
Research/diagnostics: Bitmap wins (visibility, flexibility, safety)

6. Hybrid Approaches

Option 1: Dual-Mode Allocator

#ifdef HAKMEM_DIAGNOSTIC_MODE
    // Bitmap mode (slow but visible)
    void* alloc() { return alloc_bitmap(); }
#else
    // Free list mode (fast production)
    void* alloc() { return alloc_freelist(); }
#endif

Pros: Best of both worlds Cons: Maintenance burden (two code paths)

Option 2: Shadow Bitmap

// Fast path: Free list
Block* b = page->local_free;
page->local_free = b->next;

// Diagnostic path: Update shadow bitmap (async)
if (unlikely(diagnostic_enabled)) {
    shadow_bitmap_record(page, b);  // Non-blocking queue
}

Pros: Fast path unaffected, diagnostics available Cons: Shadow state may lag, memory overhead

Option 3: Adaptive Strategy

// Use bitmap for slabs with high churn (diagnostic value)
// Use free list for stable slabs (performance critical)
if (slab->churn_rate > THRESHOLD) {
    use_bitmap_mode(slab);
} else {
    use_freelist_mode(slab);
}

Pros: Dynamic optimization Cons: Complex, runtime overhead

7. Recommendations for hakmem

Context: hakmem's Goals (from ANALYSIS_SUMMARY.md)

hakmem's Philosophy (research PoC):

"Flexible architecture: research platform for learning"

"Trade performance for visibility (ownership tracking, per-class stats)"

"Novel features: call-site profiling, ELO learning, evolution tracking"

Recommendation: Keep Bitmap for Tiny Pool

Reasons:

✅ Research value: hakmem's ELO learning, call-site profiling require per-block tracking
✅ Diagnostics: Ownership tracking, CDA decision-making benefit from bitmap visibility
✅ Trade-off is acceptable: 5-6ns overhead is worth the flexibility for a research allocator
⚠️ But optimize around it: Remove statistics overhead, simplify hot path (my original P1-P2)

Alternative: Adopt Free List for Tiny Pool

Reasons:

✅ Performance: Closes 3-4ns of the 69ns gap
✅ Proven: mimalloc's design is battle-tested
✅ Simplicity: Easier to maintain, optimize
⚠️ But lose research features: Must find alternative ways to track per-block metadata

Compromise: Hybrid Approach

Proposal:

// Fast path: Free list (mimalloc-style)
void* tiny_alloc_fast(Page* p) {
    Block* b = p->local_free;
    if (likely(b)) {
        p->local_free = b->next;
        return b;
    }
    return tiny_alloc_slow(p);
}

// Diagnostic mode: Enable shadow bitmap
#ifdef HAKMEM_DIAGNOSTIC_MODE
void* tiny_alloc_slow(Page* p) {
    void* ptr = refill_from_partial(p);
    diagnostic_record_alloc(p, ptr);  // Async, non-blocking
    return ptr;
}
#endif

Benefits:

Fast path: 1-2ns (mimalloc speed)
Diagnostic mode: Optional bitmap tracking (research features)
Production mode: Zero overhead

8. Decision Matrix

Priority	Bitmap	Free List	Hybrid
Speed is #1 goal	❌	✅	✅
Research/diagnostics #1	✅	❌	⚠️ (complex)
Simplicity #1	⚠️	✅	❌
Memory efficiency #1	❌	✅	⚠️
Flexibility #1	✅	❌	✅

For hakmem specifically:

If goal = beat mimalloc: Free List
If goal = research platform: Bitmap
If goal = both: Hybrid (complex but feasible)

9. Conclusion

The Fundamental Tradeoff

Bitmap = Observatory, Free List = Race Car

Bitmap: Sacrifices 3-4ns for complete visibility and flexibility
Free List: Sacrifices observability for raw speed

For hakmem's Context

Based on ANALYSIS_SUMMARY.md, hakmem's goals include:

"Call-site profiling" → Requires per-block tracking → Bitmap advantage
"ELO learning" → Requires allocation history → Bitmap advantage
"Evolution tracking" → Requires observability → Bitmap advantage

Verdict: Bitmap is the right choice for hakmem's research goals

But Optimize Around It

Instead of abandoning bitmap:

✅ Remove statistics overhead (ChatGPT Pro's P1) → +10ns
✅ Simplify hot path (my original P1-P2) → +15ns
✅ Keep bitmap → Preserve research features

Expected: 83ns → 58-65ns (still 4x slower than mimalloc, but research features intact)

Last Updated: 2025-10-26 Status: Analysis complete Next: Decide strategy based on project priorities

18 KiB Raw Blame History

Bitmap vs Free List: Design Tradeoffs

Executive Summary

The Core Question

1. Architecture Comparison

Bitmap Approach (Current hakmem)

Free List Approach (mimalloc)

2. Bitmap Advantages

2.1 Observability and Diagnostics

2.2 Memory Safety and Debugging

2.3 Ownership Tracking and Validation

2.4 Statistics and Profiling

2.5 Concurrent Access Visibility

2.6 Research and Experimentation

3. Free List Advantages

3.1 Raw Performance

3.2 Cache Efficiency

3.3 Memory Overhead

3.4 Simplicity

4. Real-World Use Cases

When Bitmap Wins

When Free List Wins

5. Quantitative Comparison

6. Hybrid Approaches

Option 1: Dual-Mode Allocator

Option 2: Shadow Bitmap

Option 3: Adaptive Strategy

7. Recommendations for hakmem

Context: hakmem's Goals (from ANALYSIS_SUMMARY.md)

Recommendation: Keep Bitmap for Tiny Pool

Alternative: Adopt Free List for Tiny Pool

Compromise: Hybrid Approach

8. Decision Matrix

9. Conclusion

The Fundamental Tradeoff

For hakmem's Context

But Optimize Around It

18 KiB

Raw Blame History