Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
20 KiB
Hybrid Bitmap+Magazine Approach: Objective Analysis
Date: 2025-10-26 Proposal: ChatGPT Pro's "Bitmap = Control Plane, Free-list = Data Plane" hybrid Goal: Achieve both speed (mimalloc-like) and research features (bitmap visibility) Status: Technical feasibility analysis
Executive Summary
The Proposal
Core Idea: "Bitmap on top of Micro-Freelist"
- Data Plane (hot path): Page-level mini-magazine (8-16 items, LIFO free-list)
- Control Plane (cold path): Bitmap as "truth", batch refill/spill
- Research Features: Read from bitmap (complete visibility maintained)
Objective Assessment
Verdict: ✅ Technically sound and promising, but requires careful integration
| Aspect | Rating | Comment |
|---|---|---|
| Technical soundness | ✅ Excellent | Well-established pattern (mimalloc uses similar) |
| Performance potential | ✅ Good | 83ns → 45-55ns realistic (35-45% improvement) |
| Research value | ✅ Excellent | Bitmap visibility fully preserved |
| Implementation complexity | ⚠️ Moderate | 6-8 hours, careful integration needed |
| Risk | ⚠️ Moderate | TLS Magazine integration unclear, bitmap lag concerns |
Recommendation: Adopt with modifications (see Section 8)
1. Technical Architecture
1.1 Current hakmem Tiny Pool Structure
┌─────────────────────────────────┐
│ TLS Magazine [2048 items] │ ← Fast path (magazine hit)
│ items: void* [2048] │
│ top: int │
└────────────┬────────────────────┘
↓ (magazine empty)
┌─────────────────────────────────┐
│ TLS Active Slab A/B │ ← Medium path (bitmap scan)
│ bitmap[16]: uint64_t │
│ free_count: uint16_t │
└────────────┬────────────────────┘
↓ (slab full)
┌─────────────────────────────────┐
│ Global Pool (mutex-protected) │ ← Slow path (lock contention)
│ free_slabs[8]: TinySlab* │
│ full_slabs[8]: TinySlab* │
└─────────────────────────────────┘
Problem: Bitmap scan on every slab allocation (5-6ns overhead)
1.2 Proposed Hybrid Structure
┌─────────────────────────────────┐
│ Page Mini-Magazine [8-16 items] │ ← Fast path (O(1) LIFO)
│ mag_head: Block* │ Cost: 1-2ns
│ mag_count: uint8_t │
└────────────┬────────────────────┘
↓ (mini-mag empty)
┌─────────────────────────────────┐
│ Batch Refill from Bitmap │ ← Medium path (batch of 8)
│ bm_top: uint64_t (summary) │ Cost: 5-8ns (amortized 1ns/item)
│ bm_word[16]: uint64_t │
│ refill_batch: 8 items │
└────────────┬────────────────────┘
↓ (bitmap empty)
┌─────────────────────────────────┐
│ New Page or Drain Pending │ ← Slow path
└─────────────────────────────────┘
Benefit: Fast path is free-list speed, bitmap cost is amortized
1.3 Key Innovation: Two-Tier Bitmap
Standard Bitmap (current hakmem):
uint64_t bitmap[16]; // 1024 bits
// Problem: Must scan 16 words to find first free
for (int i = 0; i < 16; i++) {
if (bitmap[i] == 0) continue; // Empty word scan overhead
// ...
}
// Cost: 2-3ns per word in worst case = 30-50ns total
Two-Tier Bitmap (proposed):
uint64_t bm_top; // Summary: 1 bit per word (16 bits used)
uint64_t bm_word[16]; // Data: 64 bits per word
// Fast path: Zero empty scan
if (bm_top == 0) return 0; // Instant check (1 cycle)
int w = __builtin_ctzll(bm_top); // First non-empty word (1 cycle)
uint64_t m = bm_word[w]; // Load word (3 cycles)
// Cost: 1.5ns total (vs 30-50ns worst case)
Impact: Empty scan overhead eliminated ✅
2. Performance Analysis
2.1 Expected Fast Path (Best Case)
static inline void* tiny_alloc_fast(ThreadHeap* th, int class_idx) {
Page* p = th->active[class_idx]; // 2 ns (L1 TLS hit)
Block* b = p->mag_head; // 2 ns (L1 page hit)
if (likely(b)) { // 0.5 ns (predicted taken)
p->mag_head = b->next; // 1 ns (L1 write)
p->mag_count--; // 0.5 ns (inc)
return b; // 0.5 ns
}
return tiny_alloc_refill(th, p, class_idx); // Slow path
}
// Total: 6.5 ns (pure CPU, L1 hits)
But reality includes:
- Size classification: +1 ns (with LUT)
- TLS base load: +1 ns
- Occasional branch mispredict: +5 ns (1 in 20)
- Occasional L2 miss: +10 ns (1 in 50)
Realistic fast path average: 12-15 ns (vs current 83 ns)
2.2 Medium Path: Refill from Bitmap
static inline int refill_from_bitmap(Page* p, int want) {
uint64_t top = p->bm_top; // 2 ns (L1 hit)
if (top == 0) return 0; // 0.5 ns
int w = __builtin_ctzll(top); // 1 ns (tzcnt instruction)
uint64_t m = p->bm_word[w]; // 2 ns (L1 hit)
int got = 0;
while (m && got < want) { // 8 iterations (want=8)
int bit = __builtin_ctzll(m); // 1 ns
m &= (m - 1); // 1 ns (clear bit)
void* blk = index_to_block(...);// 2 ns
push_to_mag(blk); // 1 ns
got++;
}
// Total loop: 8 * 5 ns = 40 ns
p->bm_word[w] = m; // 1 ns
if (!m) p->bm_top &= ~(1ull << w); // 1 ns
p->mag_count += got; // 1 ns
return got;
}
// Total: 2 + 0.5 + 1 + 2 + 40 + 1 + 1 + 1 = 48.5 ns for 8 items
// Amortized: 6 ns per item
Impact: Bitmap cost amortized to 6 ns/item (vs current 5-6 ns/item, but batched)
2.3 Overall Expected Performance
Allocation breakdown (with 90% mini-mag hit rate):
90% fast path: 12 ns * 0.9 = 10.8 ns
10% refill path: 48 ns * 0.1 = 4.8 ns (includes fast path + refill)
Total average: 15.6 ns
But this assumes:
- Mini-magazine always has items (90% hit rate)
- Bitmap refill is infrequent (10%)
- No statistics overhead
- No TLS magazine layer
More realistic (accounting for all overheads):
Size classification (LUT): 1 ns
TLS Magazine check: 3 ns (if kept)
OR
Page mini-magazine: 12 ns (if TLS Magazine removed)
Statistics (batched): 2 ns (sampled)
Occasional refill: 5 ns (amortized)
Total: 20-23 ns (if optimized)
Current baseline: 83 ns Expected with hybrid: 35-45 ns (40-55% improvement)
2.4 Why Not 12-15 ns?
Missing overhead in best-case analysis:
- TLS Magazine integration: Current hakmem has TLS Magazine layer
- If kept: +10 ns (magazine check overhead)
- If removed: Simpler but loses current fast path
- Statistics: Even batched, adds 2-3 ns
- Refill frequency: If mini-mag is only 8-16 items, refill happens often
- Cache misses: Real-world workloads have 5-10% L2 misses
Realistic target: 35-45 ns (still 2x faster than current 83 ns!)
3. Integration with Existing hakmem Structure
3.1 Critical Question: What happens to TLS Magazine?
Current TLS Magazine:
typedef struct TinyTLSMag {
TinyItem items[2048]; // 16 KB per class
int top;
} TinyTLSMag;
static __thread TinyTLSMag g_tls_mags[TINY_NUM_CLASSES];
Options:
Option A: Keep Both (Dual-Layer Cache)
TLS Magazine [2048 items]
↓ (empty)
Page Mini-Magazine [8-16 items]
↓ (empty)
Bitmap Refill
Pros: Preserves current fast path Cons:
- Double caching overhead (complexity)
- TLS Magazine dominates, mini-magazine rarely used
- Not recommended ❌
Option B: Remove TLS Magazine (Single-Layer)
Page Mini-Magazine [16-32 items] ← Increase size
↓ (empty)
Bitmap Refill [batch of 16]
Pros: Simpler, clearer hot path Cons:
- Loses current TLS Magazine fast path (1.5 ns/op)
- Requires testing to verify performance
- Moderate risk ⚠️
Option C: Hybrid (TLS Mini-Magazine)
TLS Mini-Magazine [64-128 items per class]
↓ (empty)
Refill from Multiple Pages' Bitmaps
↓ (all bitmaps empty)
New Page
Pros: Best of both (TLS speed + bitmap control) Cons:
- More complex refill logic
- Recommended ✅
3.2 Recommended Structure
typedef struct TinyTLSCache {
// Fast path: Small TLS magazine
Block* mag_head; // LIFO stack (not array)
uint16_t mag_count; // Current count
uint16_t mag_max; // 64-128 (tunable)
// Medium path: Active page with bitmap
Page* active;
// Cold path: Partial pages list
Page* partial_head;
} TinyTLSCache;
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
Allocation:
- Pop from
mag_head(1-2 ns) ← Fast path - If empty,
refill_from_bitmap(active, 16)(48 ns, 16 items) → +3 ns amortized - If active bitmap empty, swap to partial page
- If no partial, allocate new page
Expected: 12-15 ns average (90%+ mag hit rate)
4. Bitmap as "Control Plane": Research Features
4.1 Bitmap Consistency Model
Problem: Mini-magazine has items, but bitmap still marks them as "free"
Bitmap state: [1 1 1 1 1 1 1 1] (all free)
Mini-mag: [b1, b2, b3] (3 blocks cached)
Truth: Only 5 are truly free, not 8
Solution 1: Lazy Update (Eventual Consistency)
// On refill: Mark blocks as allocated in bitmap
void refill_from_bitmap(Page* p, int want) {
// ... extract blocks ...
for each block:
clear_bit(p->bm_word, idx); // Mark allocated immediately
// Mini-mag now holds allocated blocks (consistent)
}
// On spill: Mark blocks as free in bitmap
void spill_to_bitmap(Page* p, int count) {
for each block in mini-mag:
set_bit(p->bm_word, idx); // Mark free
}
Consistency: ✅ Bitmap is always truth, mini-mag is just cache
Solution 2: Shadow State
// Bitmap tracks "ever allocated" state
// Mini-mag tracks "currently cached" state
// Research features read: bitmap + mini-mag count
uint16_t get_true_free_count(Page* p) {
return p->bitmap_free_count - p->mag_count;
}
Consistency: ⚠️ More complex, but allows instant queries
Recommendation: Solution 1 (simpler, consistent)
4.2 Research Features Still Work
Call-site profiling:
// On allocation, record call-site
void* alloc_with_profiling(void* site) {
void* ptr = tiny_alloc_fast(...);
// Diagnostic: Update bitmap-based tracking
if (diagnostic_enabled) {
int idx = block_index(page, ptr);
page->owner[idx] = current_thread();
page->alloc_site[idx] = site;
}
return ptr;
}
ELO learning:
// On free, update ELO based on lifetime
void free_with_elo(void* ptr) {
int idx = block_index(page, ptr);
void* site = page->alloc_site[idx];
uint64_t lifetime = rdtsc() - page->alloc_time[idx];
update_elo(site, lifetime); // Bitmap enables this
tiny_free_fast(ptr); // Then free normally
}
Memory diagnostics:
// Snapshot: Flush mini-mag to bitmap, then read
void snapshot_memory_state() {
flush_all_mini_magazines(); // Spill to bitmaps
for_each_page(page) {
print_bitmap_state(page); // Full visibility
}
}
Conclusion: ✅ All research features preserved (with flush/spill)
5. Implementation Complexity
5.1 Required Changes
New structures (~50 lines):
typedef struct Block {
struct Block* next; // Intrusive LIFO
} Block;
typedef struct Page {
// Mini-magazine
Block* mag_head;
uint16_t mag_count;
uint16_t mag_max;
// Two-tier bitmap
uint64_t bm_top;
uint64_t bm_word[16];
// Existing (keep)
uint8_t* base;
uint16_t block_size;
// ...
} Page;
New functions (~200 lines):
void* tiny_alloc_fast(ThreadHeap* th, int class_idx);
void tiny_free_fast(Page* p, void* ptr);
int refill_from_bitmap(Page* p, int want);
void spill_to_bitmap(Page* p);
void init_two_tier_bitmap(Page* p);
Modified functions (~300 lines):
// Existing bitmap allocation → refill logic
hak_tiny_alloc() → integrate with tiny_alloc_fast()
hak_tiny_free() → integrate with tiny_free_fast()
// Statistics collection → batched/sampled
Total code changes: ~500-600 lines (moderate)
5.2 Testing Requirements
Unit tests:
- Two-tier bitmap correctness (refill/spill)
- Mini-magazine overflow/underflow
- Bitmap-magazine consistency
Integration tests:
- Existing bench_tiny benchmarks
- Multi-threaded stress tests
- Diagnostic feature validation
Performance tests:
- Before/after latency comparison
- Hit rate measurement (mini-mag vs refill)
Estimated effort: 6-8 hours (implementation + testing)
6. Risks and Mitigation
Risk 1: Mini-Magazine Size Tuning
Problem: Too small (8) → frequent refills; too large (64) → memory overhead
Mitigation:
- Make
mag_maxtunable via environment variable - Adaptive sizing based on allocation pattern
- Start with 16-32 (sweet spot)
Risk 2: Bitmap Refill Overhead
Problem: If mini-mag empties frequently, refill cost dominates
Scenarios:
- Burst allocation (1000 allocs in a row) → 1000/16 = 62 refills
- Refill cost: 62 * 48ns = 2976ns total = 3ns/alloc amortized ✅
Mitigation: Batch size (16) amortizes cost well
Risk 3: TLS Magazine Integration
Problem: Unclear how to integrate with existing TLS Magazine
Options:
- Remove TLS Magazine entirely → Simplest
- Keep TLS Magazine, add page mini-mag → Complex
- Replace TLS Magazine with TLS mini-mag (64-128 items) → Recommended
Mitigation: Prototype Option 3, benchmark against current
Risk 4: Diagnostic Lag
Problem: Bitmap doesn't reflect mini-mag state in real-time
Scenarios:
- Profiler reads bitmap → sees "free" but block is in mini-mag
- Fix: Flush before diagnostic read
Mitigation:
void flush_diagnostics() {
for_each_class(c) {
spill_to_bitmap(g_tls_cache[c].active);
}
}
7. Performance Comparison Matrix
| Approach | Fast Path | Research | Complexity | Risk | Improvement |
|---|---|---|---|---|---|
| Current (Bitmap only) | 83 ns | ✅ Full | Low | Low | Baseline |
| Strategy A (Bitmap + cleanup) | 58-65 ns | ✅ Full | Low | Low | +25-30% |
| Strategy B (Free-list only) | 45-55 ns | ❌ Lost | Moderate | Moderate | +35-45% |
| Hybrid (Bitmap+Mini-Mag) | 35-45 ns | ✅ Full | Moderate | Moderate | 45-58% |
Winner: Hybrid (best speed + research preservation)
8. Recommended Implementation Plan
Phase 1: Two-Tier Bitmap (2-3 hours)
Goal: Eliminate empty word scan overhead
// Add bm_top to existing TinySlab
typedef struct TinySlab {
uint64_t bm_top; // NEW: Summary bitmap
uint64_t bitmap[16]; // Existing
// ...
} TinySlab;
// Update allocation to use bm_top
if (slab->bm_top == 0) return NULL; // Fast empty check
int w = __builtin_ctzll(slab->bm_top);
// ...
Expected: 83ns → 78-80ns (+3-5ns)
Risk: Low (additive change)
Phase 2: Page Mini-Magazine (3-4 hours)
Goal: Add LIFO mini-magazine to slabs
typedef struct TinySlab {
// Mini-magazine (NEW)
Block* mag_head;
uint16_t mag_count;
uint16_t mag_max; // 16
// Two-tier bitmap (from Phase 1)
uint64_t bm_top;
uint64_t bitmap[16];
// ...
} TinySlab;
void* tiny_alloc_fast() {
Block* b = slab->mag_head;
if (likely(b)) {
slab->mag_head = b->next;
return b;
}
// Refill from bitmap (batch of 16)
refill_from_bitmap(slab, 16);
// Retry
return slab->mag_head ? pop_mag(slab) : NULL;
}
Expected: 78-80ns → 45-55ns (+25-35ns)
Risk: Moderate (structural change)
Phase 3: TLS Integration (1-2 hours)
Goal: Integrate with existing TLS Magazine
// Option: Replace TLS Magazine with TLS mini-mag
typedef struct TinyTLSCache {
Block* mag_head; // 64-128 items
uint16_t mag_count;
TinySlab* active; // Current slab
TinySlab* partial; // Partial slabs
} TinyTLSCache;
Expected: 45-55ns → 35-45ns (+10ns from better TLS integration)
Risk: Moderate (requires careful testing)
Phase 4: Statistics Batching (1 hour)
Goal: Remove per-allocation statistics overhead
// Batch counter update (cold path only)
if (++g_tls_alloc_counter[class_idx] >= 100) {
g_tiny_pool.alloc_count[class_idx] += 100;
g_tls_alloc_counter[class_idx] = 0;
}
Expected: 35-45ns → 30-40ns (+5-10ns)
Risk: Low (independent change)
Total Timeline
Effort: 7-10 hours Expected result: 83ns → 30-45ns (45-65% improvement) Research features: ✅ Fully preserved (bitmap visibility maintained)
9. Comparison to Alternatives
vs Strategy A (Bitmap + Cleanup)
- Strategy A: 83ns → 58-65ns (+25-30%)
- Hybrid: 83ns → 30-45ns (+45-65%)
- Winner: Hybrid (+20-30ns better)
vs Strategy B (Free-list Only)
- Strategy B: 83ns → 45-55ns, ❌ loses research features
- Hybrid: 83ns → 30-45ns, ✅ keeps research features
- Winner: Hybrid (faster + research preserved)
vs ChatGPT Pro's Estimate (55-60ns)
- ChatGPT Pro: 55-60ns (optimistic)
- Realistic Hybrid: 30-45ns (with all phases)
- Conservative: 40-50ns (if hit rate is lower)
- Conclusion: 55-60ns is achievable, 30-40ns is optimistic but possible
10. Conclusion
Technical Verdict
The Hybrid Bitmap+Mini-Magazine approach is sound and recommended ✅
Key strengths:
- ✅ Preserves bitmap visibility (research features intact)
- ✅ Achieves free-list-like speed on hot path (30-45ns realistic)
- ✅ Two-tier bitmap eliminates empty scan overhead
- ✅ Well-established pattern (mimalloc uses similar techniques)
Key concerns:
- ⚠️ Moderate implementation complexity (7-10 hours)
- ⚠️ TLS Magazine integration needs careful design
- ⚠️ Bitmap consistency requires flush for diagnostics
- ⚠️ Performance depends on mini-magazine hit rate (90%+ needed)
Recommendation
Adopt the Hybrid approach with 4-phase implementation:
- Two-tier bitmap (low risk, immediate gain)
- Page mini-magazine (moderate risk, big gain)
- TLS integration (moderate risk, polish)
- Statistics batching (low risk, final optimization)
Expected outcome: 83ns → 30-45ns (45-65% improvement) while preserving all research features
Next Steps
- ✅ Create final implementation strategy document
- ✅ Update TINY_POOL_OPTIMIZATION_STRATEGY.md to Hybrid approach
- ✅ Begin Phase 1 (Two-tier bitmap) implementation
- ✅ Validate with benchmarks after each phase
Last Updated: 2025-10-26 Status: Analysis complete, ready for implementation Confidence: HIGH (backed by mimalloc precedent, realistic estimates) Risk Level: MODERATE (phased approach mitigates risk)