Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
18 KiB
Bitmap vs Free List: Design Tradeoffs
Date: 2025-10-26 Context: Evaluating architectural choices for hakmem Tiny Pool optimization Purpose: Understand tradeoffs before deciding whether to adopt mimalloc's free list approach
Executive Summary
The Core Question
Should hakmem abandon bitmap allocation in favor of mimalloc's intrusive free list?
Answer: It depends on project goals:
- If goal = production speed: Free list wins (5-10ns faster)
- If goal = research/diagnostics: Bitmap wins (visibility, safety, flexibility)
- If goal = both: Hybrid approach possible (see Section 6)
1. Architecture Comparison
Bitmap Approach (Current hakmem)
// Metadata: Separate bitmap (1 bit per block)
typedef struct TinySlab {
uint64_t bitmap[16]; // 1024 blocks = 1024 bits
uint8_t* base; // Data region
uint16_t free_count; // O(1) empty check
// ... diagnostics, ownership, stats ...
} TinySlab;
// Allocation: Find-first-set
void* alloc_from_bitmap(TinySlab* s) {
int word_idx = find_first_nonzero(s->bitmap); // ~3 ns
int bit_idx = __builtin_ctzll(s->bitmap[word_idx]); // ~1 ns
s->bitmap[word_idx] &= ~(1ULL << bit_idx); // ~1 ns
return s->base + (word_idx * 64 + bit_idx) * block_size;
}
// Cost: 5-6 ns (bitmap scan + bit extraction)
Key Properties:
- ✅ Metadata separate from data
- ✅ Random access to allocation state
- ✅ O(1) slab-level statistics (free_count, bitmap scan)
- ⚠️ 5-6 ns overhead per allocation
Free List Approach (mimalloc)
// Metadata: Intrusive next-pointer in free blocks
typedef struct Block {
struct Block* next; // 8 bytes IN the data region
} Block;
typedef struct Page {
Block* local_free; // LIFO stack head
// ... minimal metadata ...
} Page;
// Allocation: Pop from LIFO
void* alloc_from_freelist(Page* p) {
Block* b = p->local_free; // ~0.5 ns (L1 hit)
p->local_free = b->next; // ~0.5 ns (L1 hit)
return b;
}
// Cost: 1-2 ns (two pointer operations)
Key Properties:
- ✅ Zero metadata overhead (uses free blocks themselves)
- ✅ Minimal CPU overhead (1-2 pointer ops)
- ⚠️ Intrusive (overwrites first 8 bytes of free blocks)
- ⚠️ No random access (must traverse list)
2. Bitmap Advantages
2.1 Observability and Diagnostics
Bitmap: Complete allocation state visible at a glance
// Print slab state (O(1) bitmap scan)
void print_slab_state(TinySlab* s) {
printf("Slab free pattern: ");
for (int i = 0; i < 1024; i++) {
printf("%c", is_free(s->bitmap, i) ? '.' : 'X');
}
// Output: "X...XX.X.XX....." (visual fragmentation pattern)
}
Free List: Must traverse entire list
// Print page state (O(n) traversal)
void print_page_state(Page* p) {
int count = 0;
Block* b = p->local_free;
while (b) { count++; b = b->next; }
printf("Free blocks: %d (locations unknown)\n", count);
// Output: "Free blocks: 42" (no spatial information)
}
Impact:
- ✅ Bitmap: Can detect fragmentation patterns, hot spots, allocation clustering
- ⚠️ Free List: Only knows count, not spatial distribution
2.2 Memory Safety and Debugging
Bitmap: Freed memory can be immediately zeroed
void free_to_bitmap(TinySlab* s, void* ptr) {
int idx = block_index(s, ptr);
s->bitmap[idx / 64] |= (1ULL << (idx % 64));
memset(ptr, 0, block_size); // Safe: no metadata in block
}
// Use-after-free detection: accessing 0-filled memory likely crashes early
Free List: Next-pointer remains in freed memory
void free_to_list(Page* p, void* ptr) {
Block* b = (Block*)ptr;
b->next = p->local_free; // Writes to freed memory!
p->local_free = b;
}
// Use-after-free: might corrupt next-pointer, causing subtle bugs later
Impact:
- ✅ Bitmap: Easier debugging (freed memory is clean)
- ✅ Bitmap: Better ASAN/Valgrind integration (can mark freed)
- ⚠️ Free List: Next-pointer corruption can cause cascading failures
2.3 Ownership Tracking and Validation
Bitmap: Can track per-block metadata
typedef struct TinySlab {
uint64_t bitmap[16]; // Allocation state
uint8_t owner[1024]; // Per-block owner thread ID
uint32_t alloc_time[1024]; // Allocation timestamp
} TinySlab;
// Validate ownership on free
void free_with_validation(TinySlab* s, void* ptr) {
int idx = block_index(s, ptr);
if (s->owner[idx] != current_thread()) {
fprintf(stderr, "ERROR: Cross-thread free without handoff!\n");
// Can detect bugs immediately
}
}
Free List: No per-block metadata (intrusive design)
// Cannot store per-block metadata without external hash table
// Owner validation requires separate data structure
Impact:
- ✅ Bitmap: Can implement rich diagnostics (owner, timestamp, call-site)
- ✅ Bitmap: Validates invariants at allocation/free time
- ⚠️ Free List: Requires external data structures for diagnostics
2.4 Statistics and Profiling
Bitmap: O(1) slab-level queries
// All O(1) operations
uint16_t free_count = slab->free_count;
bool is_empty = (free_count == 1024);
bool is_full = (free_count == 0);
float utilization = 1.0 - (free_count / 1024.0);
// Fragmentation analysis (O(n) but rare)
int longest_run = find_longest_free_run(slab->bitmap);
Free List: Requires traversal
// Count requires O(n) traversal
int free_count = 0;
for (Block* b = page->local_free; b; b = b->next) {
free_count++;
}
// Cannot determine fragmentation without traversal
Impact:
- ✅ Bitmap: Fast statistics collection (research-friendly)
- ✅ Bitmap: Can analyze allocation patterns
- ⚠️ Free List: Statistics require expensive traversal or external counters
2.5 Concurrent Access Visibility
Bitmap: Can inspect remote thread state
// Diagnostic thread can scan all slabs
void print_global_state() {
for (int tid = 0; tid < MAX_THREADS; tid++) {
for (int class = 0; class < 8; class++) {
TinySlab* s = get_slab(tid, class);
// Instant visibility of free_count, bitmap
printf("Thread %d Class %d: %d/%d free\n",
tid, class, s->free_count, 1024);
}
}
}
Free List: Cannot safely inspect remote thread's local_free
// Diagnostic thread CANNOT read local_free (race condition)
// Must use external atomic counters (defeats purpose)
Impact:
- ✅ Bitmap: Can build monitoring dashboards, live profilers
- ✅ Bitmap: Supports cross-thread adoption decisions (CDA)
- ⚠️ Free List: Opaque to external observers
2.6 Research and Experimentation
Bitmap: Easy to modify allocation policy
// Experiment: Best-fit instead of first-fit
int find_best_fit_block(TinySlab* s, int requested_run) {
// Scan bitmap for smallest run >= requested_run
// Easy to implement alternative allocation strategies
}
// Experiment: Locality-aware allocation
int find_nearest_free(TinySlab* s, void* previous_alloc) {
int prev_idx = block_index(s, previous_alloc);
// Search bitmap for nearby free blocks (cache locality)
}
Free List: Policy locked to LIFO
// Always LIFO (most recently freed = next allocated)
// Cannot experiment with other policies without major restructuring
Impact:
- ✅ Bitmap: Flexible research platform (try different allocation strategies)
- ✅ Bitmap: Can experiment with locality, fragmentation reduction
- ⚠️ Free List: Fixed policy (LIFO only)
3. Free List Advantages
3.1 Raw Performance
Numbers from ANALYSIS_SUMMARY.md:
- Bitmap: 5-6 ns per allocation (find-first-set + bit extraction)
- Free List: 1-2 ns per allocation (two pointer operations)
- Gap: 3-4 ns per allocation (2-6x faster)
Why Free List Wins:
// Bitmap: 5 operations
int word_idx = find_first_nonzero(bitmap); // 2-3 ns (unpredictable branch)
int bit_idx = ctzll(bitmap[word_idx]); // 1 ns (CPU instruction)
bitmap[word_idx] &= ~(1ULL << bit_idx); // 1 ns (bit clear)
void* ptr = base + index * block_size; // 1 ns (arithmetic)
// Total: 5 ns
// Free List: 2 operations
Block* b = page->local_free; // 0.5 ns (L1 hit)
page->local_free = b->next; // 0.5 ns (L1 hit)
return b; // 0.5 ns
// Total: 1.5 ns
3.2 Cache Efficiency
Free List: Excellent temporal locality
// Recently freed block = next allocated (LIFO)
// Likely still in L1 cache (3-5 cycles)
Block* b = page->local_free; // Cache hit!
Bitmap: Poorer temporal locality
// Allocated block may be anywhere in slab
// Bitmap access + block access = 2 cache lines
int idx = find_first_set(...); // Cache line 1 (bitmap)
void* ptr = base + idx * block_size; // Cache line 2 (block)
Impact:
- ✅ Free List: Better L1 cache hit rate (~95%+)
- ⚠️ Bitmap: More cache line touches (~2x)
3.3 Memory Overhead
Free List: Zero metadata
typedef struct Page {
Block* local_free; // 8 bytes
uint16_t capacity; // 2 bytes
// Total: 10 bytes for entire page
} Page;
Bitmap: 1 bit per block (+ supporting metadata)
typedef struct TinySlab {
uint64_t bitmap[16]; // 128 bytes (1024 blocks)
uint16_t free_count; // 2 bytes
uint8_t* base; // 8 bytes
// Total: 138 bytes minimum
} TinySlab;
// For 8-byte blocks: 1024 * 8 = 8KB data, 138B metadata = 1.7% overhead
Impact:
- ✅ Free List: ~0.1% overhead
- ⚠️ Bitmap: ~1-2% overhead
3.4 Simplicity
Free List: Minimal code complexity
// Entire allocation logic: 3 lines
void* alloc(Page* p) {
Block* b = p->local_free;
if (!b) return NULL;
p->local_free = b->next;
return b;
}
Bitmap: More complex
// Allocation logic: 15+ lines
void* alloc(TinySlab* s) {
if (s->free_count == 0) return NULL;
for (int i = 0; i < 16; i++) {
if (s->bitmap[i] == 0) continue; // Skip empty words
int bit_idx = __builtin_ctzll(s->bitmap[i]);
s->bitmap[i] &= ~(1ULL << bit_idx);
s->free_count--;
return s->base + (i * 64 + bit_idx) * s->block_size;
}
return NULL; // Should never reach
}
Impact:
- ✅ Free List: Easier to understand, maintain, optimize
- ⚠️ Bitmap: More code paths, more potential for bugs
4. Real-World Use Cases
When Bitmap Wins
Scenario 1: Memory Debugging Tools
// AddressSanitizer, Valgrind integration
// Can mark freed blocks immediately
void free_with_asan(TinySlab* s, void* ptr) {
int idx = block_index(s, ptr);
s->bitmap[idx / 64] |= (1ULL << (idx % 64));
__asan_poison_memory_region(ptr, block_size); // Safe!
}
Scenario 2: Research Allocators
// Experimenting with allocation strategies
// e.g., hakmem's ELO learning, call-site profiling
void alloc_with_learning(TinySlab* s, void* site) {
int idx = find_best_block_for_site(s, site); // Bitmap enables this
// Can implement custom heuristics
}
Scenario 3: Diagnostic Dashboards
// Real-time monitoring (e.g., allocator profiler UI)
// Can scan all slabs without stopping allocation
void update_dashboard() {
for_each_slab(slab) {
dashboard_update(slab->free_count, slab->bitmap);
// No disruption to allocation threads
}
}
When Free List Wins
Scenario 1: Production Web Servers
// mimalloc in WebKit, nginx, etc.
// Every nanosecond counts (millions of allocations/sec)
// Diagnostics = rare, speed = always
Scenario 2: Latency-Sensitive Systems
// HFT, real-time systems
// Predictable 1-2ns allocation critical
// Bitmap's 5-6ns too variable
Scenario 3: Memory-Constrained Embedded
// 1.7% bitmap overhead unacceptable
// Every byte matters
5. Quantitative Comparison
| Metric | Bitmap | Free List | Winner |
|---|---|---|---|
| Performance | |||
| Allocation latency | 5-6 ns | 1-2 ns | Free List (3-4ns faster) |
| Cache efficiency | 2 cache lines | 1 cache line | Free List |
| Branch mispredicts | 1-2 per alloc | 0-1 per alloc | Free List |
| Memory | |||
| Metadata overhead | 1-2% | ~0.1% | Free List |
| Block size impact | +128B per slab | +8B per page | Free List |
| Diagnostics | |||
| Observability | Full state visible | Opaque (count only) | Bitmap |
| Debugging | Easy (zeroed free) | Hard (pointer corruption) | Bitmap |
| Statistics | O(1) queries | O(n) traversal | Bitmap |
| Profiling | Per-block tracking | External hash table | Bitmap |
| Flexibility | |||
| Allocation policy | Pluggable (first-fit, best-fit, etc.) | LIFO only | Bitmap |
| Research | Easy experimentation | Fixed design | Bitmap |
| Monitoring | Non-intrusive scanning | Requires external counters | Bitmap |
| Safety | |||
| Use-after-free detection | Good (zeroed memory) | Poor (pointer corruption) | Bitmap |
| ASAN/Valgrind integration | Excellent | Limited | Bitmap |
| Cross-thread validation | Easy | Requires external state | Bitmap |
| Complexity | |||
| Code size | ~100 lines | ~20 lines | Free List |
| Maintainability | Moderate | High | Free List |
| Optimization potential | Limited (bitmap scan) | High (2 pointers) | Free List |
Overall:
- Production speed: Free List wins (3-4ns faster, simpler)
- Research/diagnostics: Bitmap wins (visibility, flexibility, safety)
6. Hybrid Approaches
Option 1: Dual-Mode Allocator
#ifdef HAKMEM_DIAGNOSTIC_MODE
// Bitmap mode (slow but visible)
void* alloc() { return alloc_bitmap(); }
#else
// Free list mode (fast production)
void* alloc() { return alloc_freelist(); }
#endif
Pros: Best of both worlds Cons: Maintenance burden (two code paths)
Option 2: Shadow Bitmap
// Fast path: Free list
Block* b = page->local_free;
page->local_free = b->next;
// Diagnostic path: Update shadow bitmap (async)
if (unlikely(diagnostic_enabled)) {
shadow_bitmap_record(page, b); // Non-blocking queue
}
Pros: Fast path unaffected, diagnostics available Cons: Shadow state may lag, memory overhead
Option 3: Adaptive Strategy
// Use bitmap for slabs with high churn (diagnostic value)
// Use free list for stable slabs (performance critical)
if (slab->churn_rate > THRESHOLD) {
use_bitmap_mode(slab);
} else {
use_freelist_mode(slab);
}
Pros: Dynamic optimization Cons: Complex, runtime overhead
7. Recommendations for hakmem
Context: hakmem's Goals (from ANALYSIS_SUMMARY.md)
hakmem's Philosophy (research PoC):
- "Flexible architecture: research platform for learning"
- "Trade performance for visibility (ownership tracking, per-class stats)"
- "Novel features: call-site profiling, ELO learning, evolution tracking"
Recommendation: Keep Bitmap for Tiny Pool
Reasons:
- ✅ Research value: hakmem's ELO learning, call-site profiling require per-block tracking
- ✅ Diagnostics: Ownership tracking, CDA decision-making benefit from bitmap visibility
- ✅ Trade-off is acceptable: 5-6ns overhead is worth the flexibility for a research allocator
- ⚠️ But optimize around it: Remove statistics overhead, simplify hot path (my original P1-P2)
Alternative: Adopt Free List for Tiny Pool
Reasons:
- ✅ Performance: Closes 3-4ns of the 69ns gap
- ✅ Proven: mimalloc's design is battle-tested
- ✅ Simplicity: Easier to maintain, optimize
- ⚠️ But lose research features: Must find alternative ways to track per-block metadata
Compromise: Hybrid Approach
Proposal:
// Fast path: Free list (mimalloc-style)
void* tiny_alloc_fast(Page* p) {
Block* b = p->local_free;
if (likely(b)) {
p->local_free = b->next;
return b;
}
return tiny_alloc_slow(p);
}
// Diagnostic mode: Enable shadow bitmap
#ifdef HAKMEM_DIAGNOSTIC_MODE
void* tiny_alloc_slow(Page* p) {
void* ptr = refill_from_partial(p);
diagnostic_record_alloc(p, ptr); // Async, non-blocking
return ptr;
}
#endif
Benefits:
- Fast path: 1-2ns (mimalloc speed)
- Diagnostic mode: Optional bitmap tracking (research features)
- Production mode: Zero overhead
8. Decision Matrix
| Priority | Bitmap | Free List | Hybrid |
|---|---|---|---|
| Speed is #1 goal | ❌ | ✅ | ✅ |
| Research/diagnostics #1 | ✅ | ❌ | ⚠️ (complex) |
| Simplicity #1 | ⚠️ | ✅ | ❌ |
| Memory efficiency #1 | ❌ | ✅ | ⚠️ |
| Flexibility #1 | ✅ | ❌ | ✅ |
For hakmem specifically:
- If goal = beat mimalloc: Free List
- If goal = research platform: Bitmap
- If goal = both: Hybrid (complex but feasible)
9. Conclusion
The Fundamental Tradeoff
Bitmap = Observatory, Free List = Race Car
- Bitmap: Sacrifices 3-4ns for complete visibility and flexibility
- Free List: Sacrifices observability for raw speed
For hakmem's Context
Based on ANALYSIS_SUMMARY.md, hakmem's goals include:
- "Call-site profiling" → Requires per-block tracking → Bitmap advantage
- "ELO learning" → Requires allocation history → Bitmap advantage
- "Evolution tracking" → Requires observability → Bitmap advantage
Verdict: Bitmap is the right choice for hakmem's research goals
But Optimize Around It
Instead of abandoning bitmap:
- ✅ Remove statistics overhead (ChatGPT Pro's P1) → +10ns
- ✅ Simplify hot path (my original P1-P2) → +15ns
- ✅ Keep bitmap → Preserve research features
Expected: 83ns → 58-65ns (still 4x slower than mimalloc, but research features intact)
Last Updated: 2025-10-26 Status: Analysis complete Next: Decide strategy based on project priorities