Files
hakmem/docs/archive/BITMAP_VS_FREELIST_TRADEOFFS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

18 KiB

Bitmap vs Free List: Design Tradeoffs

Date: 2025-10-26 Context: Evaluating architectural choices for hakmem Tiny Pool optimization Purpose: Understand tradeoffs before deciding whether to adopt mimalloc's free list approach


Executive Summary

The Core Question

Should hakmem abandon bitmap allocation in favor of mimalloc's intrusive free list?

Answer: It depends on project goals:

  • If goal = production speed: Free list wins (5-10ns faster)
  • If goal = research/diagnostics: Bitmap wins (visibility, safety, flexibility)
  • If goal = both: Hybrid approach possible (see Section 6)

1. Architecture Comparison

Bitmap Approach (Current hakmem)

// Metadata: Separate bitmap (1 bit per block)
typedef struct TinySlab {
    uint64_t bitmap[16];      // 1024 blocks = 1024 bits
    uint8_t* base;            // Data region
    uint16_t free_count;      // O(1) empty check
    // ... diagnostics, ownership, stats ...
} TinySlab;

// Allocation: Find-first-set
void* alloc_from_bitmap(TinySlab* s) {
    int word_idx = find_first_nonzero(s->bitmap);  // ~3 ns
    int bit_idx = __builtin_ctzll(s->bitmap[word_idx]);  // ~1 ns
    s->bitmap[word_idx] &= ~(1ULL << bit_idx);     // ~1 ns
    return s->base + (word_idx * 64 + bit_idx) * block_size;
}
// Cost: 5-6 ns (bitmap scan + bit extraction)

Key Properties:

  • Metadata separate from data
  • Random access to allocation state
  • O(1) slab-level statistics (free_count, bitmap scan)
  • ⚠️ 5-6 ns overhead per allocation

Free List Approach (mimalloc)

// Metadata: Intrusive next-pointer in free blocks
typedef struct Block {
    struct Block* next;       // 8 bytes IN the data region
} Block;

typedef struct Page {
    Block* local_free;        // LIFO stack head
    // ... minimal metadata ...
} Page;

// Allocation: Pop from LIFO
void* alloc_from_freelist(Page* p) {
    Block* b = p->local_free;             // ~0.5 ns (L1 hit)
    p->local_free = b->next;              // ~0.5 ns (L1 hit)
    return b;
}
// Cost: 1-2 ns (two pointer operations)

Key Properties:

  • Zero metadata overhead (uses free blocks themselves)
  • Minimal CPU overhead (1-2 pointer ops)
  • ⚠️ Intrusive (overwrites first 8 bytes of free blocks)
  • ⚠️ No random access (must traverse list)

2. Bitmap Advantages

2.1 Observability and Diagnostics

Bitmap: Complete allocation state visible at a glance

// Print slab state (O(1) bitmap scan)
void print_slab_state(TinySlab* s) {
    printf("Slab free pattern: ");
    for (int i = 0; i < 1024; i++) {
        printf("%c", is_free(s->bitmap, i) ? '.' : 'X');
    }
    // Output: "X...XX.X.XX....." (visual fragmentation pattern)
}

Free List: Must traverse entire list

// Print page state (O(n) traversal)
void print_page_state(Page* p) {
    int count = 0;
    Block* b = p->local_free;
    while (b) { count++; b = b->next; }
    printf("Free blocks: %d (locations unknown)\n", count);
    // Output: "Free blocks: 42" (no spatial information)
}

Impact:

  • Bitmap: Can detect fragmentation patterns, hot spots, allocation clustering
  • ⚠️ Free List: Only knows count, not spatial distribution

2.2 Memory Safety and Debugging

Bitmap: Freed memory can be immediately zeroed

void free_to_bitmap(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    s->bitmap[idx / 64] |= (1ULL << (idx % 64));
    memset(ptr, 0, block_size);  // Safe: no metadata in block
}
// Use-after-free detection: accessing 0-filled memory likely crashes early

Free List: Next-pointer remains in freed memory

void free_to_list(Page* p, void* ptr) {
    Block* b = (Block*)ptr;
    b->next = p->local_free;     // Writes to freed memory!
    p->local_free = b;
}
// Use-after-free: might corrupt next-pointer, causing subtle bugs later

Impact:

  • Bitmap: Easier debugging (freed memory is clean)
  • Bitmap: Better ASAN/Valgrind integration (can mark freed)
  • ⚠️ Free List: Next-pointer corruption can cause cascading failures

2.3 Ownership Tracking and Validation

Bitmap: Can track per-block metadata

typedef struct TinySlab {
    uint64_t bitmap[16];       // Allocation state
    uint8_t owner[1024];       // Per-block owner thread ID
    uint32_t alloc_time[1024]; // Allocation timestamp
} TinySlab;

// Validate ownership on free
void free_with_validation(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    if (s->owner[idx] != current_thread()) {
        fprintf(stderr, "ERROR: Cross-thread free without handoff!\n");
        // Can detect bugs immediately
    }
}

Free List: No per-block metadata (intrusive design)

// Cannot store per-block metadata without external hash table
// Owner validation requires separate data structure

Impact:

  • Bitmap: Can implement rich diagnostics (owner, timestamp, call-site)
  • Bitmap: Validates invariants at allocation/free time
  • ⚠️ Free List: Requires external data structures for diagnostics

2.4 Statistics and Profiling

Bitmap: O(1) slab-level queries

// All O(1) operations
uint16_t free_count = slab->free_count;
bool is_empty = (free_count == 1024);
bool is_full = (free_count == 0);
float utilization = 1.0 - (free_count / 1024.0);

// Fragmentation analysis (O(n) but rare)
int longest_run = find_longest_free_run(slab->bitmap);

Free List: Requires traversal

// Count requires O(n) traversal
int free_count = 0;
for (Block* b = page->local_free; b; b = b->next) {
    free_count++;
}
// Cannot determine fragmentation without traversal

Impact:

  • Bitmap: Fast statistics collection (research-friendly)
  • Bitmap: Can analyze allocation patterns
  • ⚠️ Free List: Statistics require expensive traversal or external counters

2.5 Concurrent Access Visibility

Bitmap: Can inspect remote thread state

// Diagnostic thread can scan all slabs
void print_global_state() {
    for (int tid = 0; tid < MAX_THREADS; tid++) {
        for (int class = 0; class < 8; class++) {
            TinySlab* s = get_slab(tid, class);
            // Instant visibility of free_count, bitmap
            printf("Thread %d Class %d: %d/%d free\n",
                   tid, class, s->free_count, 1024);
        }
    }
}

Free List: Cannot safely inspect remote thread's local_free

// Diagnostic thread CANNOT read local_free (race condition)
// Must use external atomic counters (defeats purpose)

Impact:

  • Bitmap: Can build monitoring dashboards, live profilers
  • Bitmap: Supports cross-thread adoption decisions (CDA)
  • ⚠️ Free List: Opaque to external observers

2.6 Research and Experimentation

Bitmap: Easy to modify allocation policy

// Experiment: Best-fit instead of first-fit
int find_best_fit_block(TinySlab* s, int requested_run) {
    // Scan bitmap for smallest run >= requested_run
    // Easy to implement alternative allocation strategies
}

// Experiment: Locality-aware allocation
int find_nearest_free(TinySlab* s, void* previous_alloc) {
    int prev_idx = block_index(s, previous_alloc);
    // Search bitmap for nearby free blocks (cache locality)
}

Free List: Policy locked to LIFO

// Always LIFO (most recently freed = next allocated)
// Cannot experiment with other policies without major restructuring

Impact:

  • Bitmap: Flexible research platform (try different allocation strategies)
  • Bitmap: Can experiment with locality, fragmentation reduction
  • ⚠️ Free List: Fixed policy (LIFO only)

3. Free List Advantages

3.1 Raw Performance

Numbers from ANALYSIS_SUMMARY.md:

  • Bitmap: 5-6 ns per allocation (find-first-set + bit extraction)
  • Free List: 1-2 ns per allocation (two pointer operations)
  • Gap: 3-4 ns per allocation (2-6x faster)

Why Free List Wins:

// Bitmap: 5 operations
int word_idx = find_first_nonzero(bitmap);  // 2-3 ns (unpredictable branch)
int bit_idx = ctzll(bitmap[word_idx]);      // 1 ns (CPU instruction)
bitmap[word_idx] &= ~(1ULL << bit_idx);     // 1 ns (bit clear)
void* ptr = base + index * block_size;      // 1 ns (arithmetic)
// Total: 5 ns

// Free List: 2 operations
Block* b = page->local_free;                // 0.5 ns (L1 hit)
page->local_free = b->next;                 // 0.5 ns (L1 hit)
return b;                                   // 0.5 ns
// Total: 1.5 ns

3.2 Cache Efficiency

Free List: Excellent temporal locality

// Recently freed block = next allocated (LIFO)
// Likely still in L1 cache (3-5 cycles)
Block* b = page->local_free;  // Cache hit!

Bitmap: Poorer temporal locality

// Allocated block may be anywhere in slab
// Bitmap access + block access = 2 cache lines
int idx = find_first_set(...);      // Cache line 1 (bitmap)
void* ptr = base + idx * block_size; // Cache line 2 (block)

Impact:

  • Free List: Better L1 cache hit rate (~95%+)
  • ⚠️ Bitmap: More cache line touches (~2x)

3.3 Memory Overhead

Free List: Zero metadata

typedef struct Page {
    Block* local_free;   // 8 bytes
    uint16_t capacity;   // 2 bytes
    // Total: 10 bytes for entire page
} Page;

Bitmap: 1 bit per block (+ supporting metadata)

typedef struct TinySlab {
    uint64_t bitmap[16];  // 128 bytes (1024 blocks)
    uint16_t free_count;  // 2 bytes
    uint8_t* base;        // 8 bytes
    // Total: 138 bytes minimum
} TinySlab;
// For 8-byte blocks: 1024 * 8 = 8KB data, 138B metadata = 1.7% overhead

Impact:

  • Free List: ~0.1% overhead
  • ⚠️ Bitmap: ~1-2% overhead

3.4 Simplicity

Free List: Minimal code complexity

// Entire allocation logic: 3 lines
void* alloc(Page* p) {
    Block* b = p->local_free;
    if (!b) return NULL;
    p->local_free = b->next;
    return b;
}

Bitmap: More complex

// Allocation logic: 15+ lines
void* alloc(TinySlab* s) {
    if (s->free_count == 0) return NULL;
    for (int i = 0; i < 16; i++) {
        if (s->bitmap[i] == 0) continue;  // Skip empty words
        int bit_idx = __builtin_ctzll(s->bitmap[i]);
        s->bitmap[i] &= ~(1ULL << bit_idx);
        s->free_count--;
        return s->base + (i * 64 + bit_idx) * s->block_size;
    }
    return NULL;  // Should never reach
}

Impact:

  • Free List: Easier to understand, maintain, optimize
  • ⚠️ Bitmap: More code paths, more potential for bugs

4. Real-World Use Cases

When Bitmap Wins

Scenario 1: Memory Debugging Tools

// AddressSanitizer, Valgrind integration
// Can mark freed blocks immediately
void free_with_asan(TinySlab* s, void* ptr) {
    int idx = block_index(s, ptr);
    s->bitmap[idx / 64] |= (1ULL << (idx % 64));
    __asan_poison_memory_region(ptr, block_size);  // Safe!
}

Scenario 2: Research Allocators

// Experimenting with allocation strategies
// e.g., hakmem's ELO learning, call-site profiling
void alloc_with_learning(TinySlab* s, void* site) {
    int idx = find_best_block_for_site(s, site);  // Bitmap enables this
    // Can implement custom heuristics
}

Scenario 3: Diagnostic Dashboards

// Real-time monitoring (e.g., allocator profiler UI)
// Can scan all slabs without stopping allocation
void update_dashboard() {
    for_each_slab(slab) {
        dashboard_update(slab->free_count, slab->bitmap);
        // No disruption to allocation threads
    }
}

When Free List Wins

Scenario 1: Production Web Servers

// mimalloc in WebKit, nginx, etc.
// Every nanosecond counts (millions of allocations/sec)
// Diagnostics = rare, speed = always

Scenario 2: Latency-Sensitive Systems

// HFT, real-time systems
// Predictable 1-2ns allocation critical
// Bitmap's 5-6ns too variable

Scenario 3: Memory-Constrained Embedded

// 1.7% bitmap overhead unacceptable
// Every byte matters

5. Quantitative Comparison

Metric Bitmap Free List Winner
Performance
Allocation latency 5-6 ns 1-2 ns Free List (3-4ns faster)
Cache efficiency 2 cache lines 1 cache line Free List
Branch mispredicts 1-2 per alloc 0-1 per alloc Free List
Memory
Metadata overhead 1-2% ~0.1% Free List
Block size impact +128B per slab +8B per page Free List
Diagnostics
Observability Full state visible Opaque (count only) Bitmap
Debugging Easy (zeroed free) Hard (pointer corruption) Bitmap
Statistics O(1) queries O(n) traversal Bitmap
Profiling Per-block tracking External hash table Bitmap
Flexibility
Allocation policy Pluggable (first-fit, best-fit, etc.) LIFO only Bitmap
Research Easy experimentation Fixed design Bitmap
Monitoring Non-intrusive scanning Requires external counters Bitmap
Safety
Use-after-free detection Good (zeroed memory) Poor (pointer corruption) Bitmap
ASAN/Valgrind integration Excellent Limited Bitmap
Cross-thread validation Easy Requires external state Bitmap
Complexity
Code size ~100 lines ~20 lines Free List
Maintainability Moderate High Free List
Optimization potential Limited (bitmap scan) High (2 pointers) Free List

Overall:

  • Production speed: Free List wins (3-4ns faster, simpler)
  • Research/diagnostics: Bitmap wins (visibility, flexibility, safety)

6. Hybrid Approaches

Option 1: Dual-Mode Allocator

#ifdef HAKMEM_DIAGNOSTIC_MODE
    // Bitmap mode (slow but visible)
    void* alloc() { return alloc_bitmap(); }
#else
    // Free list mode (fast production)
    void* alloc() { return alloc_freelist(); }
#endif

Pros: Best of both worlds Cons: Maintenance burden (two code paths)

Option 2: Shadow Bitmap

// Fast path: Free list
Block* b = page->local_free;
page->local_free = b->next;

// Diagnostic path: Update shadow bitmap (async)
if (unlikely(diagnostic_enabled)) {
    shadow_bitmap_record(page, b);  // Non-blocking queue
}

Pros: Fast path unaffected, diagnostics available Cons: Shadow state may lag, memory overhead

Option 3: Adaptive Strategy

// Use bitmap for slabs with high churn (diagnostic value)
// Use free list for stable slabs (performance critical)
if (slab->churn_rate > THRESHOLD) {
    use_bitmap_mode(slab);
} else {
    use_freelist_mode(slab);
}

Pros: Dynamic optimization Cons: Complex, runtime overhead


7. Recommendations for hakmem

Context: hakmem's Goals (from ANALYSIS_SUMMARY.md)

hakmem's Philosophy (research PoC):

  • "Flexible architecture: research platform for learning"
  • "Trade performance for visibility (ownership tracking, per-class stats)"
  • "Novel features: call-site profiling, ELO learning, evolution tracking"

Recommendation: Keep Bitmap for Tiny Pool

Reasons:

  1. Research value: hakmem's ELO learning, call-site profiling require per-block tracking
  2. Diagnostics: Ownership tracking, CDA decision-making benefit from bitmap visibility
  3. Trade-off is acceptable: 5-6ns overhead is worth the flexibility for a research allocator
  4. ⚠️ But optimize around it: Remove statistics overhead, simplify hot path (my original P1-P2)

Alternative: Adopt Free List for Tiny Pool

Reasons:

  1. Performance: Closes 3-4ns of the 69ns gap
  2. Proven: mimalloc's design is battle-tested
  3. Simplicity: Easier to maintain, optimize
  4. ⚠️ But lose research features: Must find alternative ways to track per-block metadata

Compromise: Hybrid Approach

Proposal:

// Fast path: Free list (mimalloc-style)
void* tiny_alloc_fast(Page* p) {
    Block* b = p->local_free;
    if (likely(b)) {
        p->local_free = b->next;
        return b;
    }
    return tiny_alloc_slow(p);
}

// Diagnostic mode: Enable shadow bitmap
#ifdef HAKMEM_DIAGNOSTIC_MODE
void* tiny_alloc_slow(Page* p) {
    void* ptr = refill_from_partial(p);
    diagnostic_record_alloc(p, ptr);  // Async, non-blocking
    return ptr;
}
#endif

Benefits:

  • Fast path: 1-2ns (mimalloc speed)
  • Diagnostic mode: Optional bitmap tracking (research features)
  • Production mode: Zero overhead

8. Decision Matrix

Priority Bitmap Free List Hybrid
Speed is #1 goal
Research/diagnostics #1 ⚠️ (complex)
Simplicity #1 ⚠️
Memory efficiency #1 ⚠️
Flexibility #1

For hakmem specifically:

  • If goal = beat mimalloc: Free List
  • If goal = research platform: Bitmap
  • If goal = both: Hybrid (complex but feasible)

9. Conclusion

The Fundamental Tradeoff

Bitmap = Observatory, Free List = Race Car

  • Bitmap: Sacrifices 3-4ns for complete visibility and flexibility
  • Free List: Sacrifices observability for raw speed

For hakmem's Context

Based on ANALYSIS_SUMMARY.md, hakmem's goals include:

  • "Call-site profiling" → Requires per-block tracking → Bitmap advantage
  • "ELO learning" → Requires allocation history → Bitmap advantage
  • "Evolution tracking" → Requires observability → Bitmap advantage

Verdict: Bitmap is the right choice for hakmem's research goals

But Optimize Around It

Instead of abandoning bitmap:

  1. Remove statistics overhead (ChatGPT Pro's P1) → +10ns
  2. Simplify hot path (my original P1-P2) → +15ns
  3. Keep bitmap → Preserve research features

Expected: 83ns → 58-65ns (still 4x slower than mimalloc, but research features intact)


Last Updated: 2025-10-26 Status: Analysis complete Next: Decide strategy based on project priorities