680 lines
20 KiB
Markdown
680 lines
20 KiB
Markdown
|
|
# Hybrid Bitmap+Magazine Approach: Objective Analysis
|
||
|
|
|
||
|
|
**Date**: 2025-10-26
|
||
|
|
**Proposal**: ChatGPT Pro's "Bitmap = Control Plane, Free-list = Data Plane" hybrid
|
||
|
|
**Goal**: Achieve both speed (mimalloc-like) and research features (bitmap visibility)
|
||
|
|
**Status**: Technical feasibility analysis
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
### The Proposal
|
||
|
|
|
||
|
|
**Core Idea**: "Bitmap on top of Micro-Freelist"
|
||
|
|
- **Data Plane (hot path)**: Page-level mini-magazine (8-16 items, LIFO free-list)
|
||
|
|
- **Control Plane (cold path)**: Bitmap as "truth", batch refill/spill
|
||
|
|
- **Research Features**: Read from bitmap (complete visibility maintained)
|
||
|
|
|
||
|
|
### Objective Assessment
|
||
|
|
|
||
|
|
**Verdict**: ✅ **Technically sound and promising, but requires careful integration**
|
||
|
|
|
||
|
|
| Aspect | Rating | Comment |
|
||
|
|
|--------|--------|---------|
|
||
|
|
| **Technical soundness** | ✅ Excellent | Well-established pattern (mimalloc uses similar) |
|
||
|
|
| **Performance potential** | ✅ Good | 83ns → 45-55ns realistic (35-45% improvement) |
|
||
|
|
| **Research value** | ✅ Excellent | Bitmap visibility fully preserved |
|
||
|
|
| **Implementation complexity** | ⚠️ Moderate | 6-8 hours, careful integration needed |
|
||
|
|
| **Risk** | ⚠️ Moderate | TLS Magazine integration unclear, bitmap lag concerns |
|
||
|
|
|
||
|
|
**Recommendation**: **Adopt with modifications** (see Section 8)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Technical Architecture
|
||
|
|
|
||
|
|
### 1.1 Current hakmem Tiny Pool Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ TLS Magazine [2048 items] │ ← Fast path (magazine hit)
|
||
|
|
│ items: void* [2048] │
|
||
|
|
│ top: int │
|
||
|
|
└────────────┬────────────────────┘
|
||
|
|
↓ (magazine empty)
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ TLS Active Slab A/B │ ← Medium path (bitmap scan)
|
||
|
|
│ bitmap[16]: uint64_t │
|
||
|
|
│ free_count: uint16_t │
|
||
|
|
└────────────┬────────────────────┘
|
||
|
|
↓ (slab full)
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ Global Pool (mutex-protected) │ ← Slow path (lock contention)
|
||
|
|
│ free_slabs[8]: TinySlab* │
|
||
|
|
│ full_slabs[8]: TinySlab* │
|
||
|
|
└─────────────────────────────────┘
|
||
|
|
|
||
|
|
Problem: Bitmap scan on every slab allocation (5-6ns overhead)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.2 Proposed Hybrid Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ Page Mini-Magazine [8-16 items] │ ← Fast path (O(1) LIFO)
|
||
|
|
│ mag_head: Block* │ Cost: 1-2ns
|
||
|
|
│ mag_count: uint8_t │
|
||
|
|
└────────────┬────────────────────┘
|
||
|
|
↓ (mini-mag empty)
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ Batch Refill from Bitmap │ ← Medium path (batch of 8)
|
||
|
|
│ bm_top: uint64_t (summary) │ Cost: 5-8ns (amortized 1ns/item)
|
||
|
|
│ bm_word[16]: uint64_t │
|
||
|
|
│ refill_batch: 8 items │
|
||
|
|
└────────────┬────────────────────┘
|
||
|
|
↓ (bitmap empty)
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ New Page or Drain Pending │ ← Slow path
|
||
|
|
└─────────────────────────────────┘
|
||
|
|
|
||
|
|
Benefit: Fast path is free-list speed, bitmap cost is amortized
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.3 Key Innovation: Two-Tier Bitmap
|
||
|
|
|
||
|
|
**Standard Bitmap** (current hakmem):
|
||
|
|
```c
|
||
|
|
uint64_t bitmap[16]; // 1024 bits
|
||
|
|
// Problem: Must scan 16 words to find first free
|
||
|
|
for (int i = 0; i < 16; i++) {
|
||
|
|
if (bitmap[i] == 0) continue; // Empty word scan overhead
|
||
|
|
// ...
|
||
|
|
}
|
||
|
|
// Cost: 2-3ns per word in worst case = 30-50ns total
|
||
|
|
```
|
||
|
|
|
||
|
|
**Two-Tier Bitmap** (proposed):
|
||
|
|
```c
|
||
|
|
uint64_t bm_top; // Summary: 1 bit per word (16 bits used)
|
||
|
|
uint64_t bm_word[16]; // Data: 64 bits per word
|
||
|
|
|
||
|
|
// Fast path: Zero empty scan
|
||
|
|
if (bm_top == 0) return 0; // Instant check (1 cycle)
|
||
|
|
|
||
|
|
int w = __builtin_ctzll(bm_top); // First non-empty word (1 cycle)
|
||
|
|
uint64_t m = bm_word[w]; // Load word (3 cycles)
|
||
|
|
// Cost: 1.5ns total (vs 30-50ns worst case)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Impact**: Empty scan overhead eliminated ✅
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Performance Analysis
|
||
|
|
|
||
|
|
### 2.1 Expected Fast Path (Best Case)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline void* tiny_alloc_fast(ThreadHeap* th, int class_idx) {
|
||
|
|
Page* p = th->active[class_idx]; // 2 ns (L1 TLS hit)
|
||
|
|
Block* b = p->mag_head; // 2 ns (L1 page hit)
|
||
|
|
if (likely(b)) { // 0.5 ns (predicted taken)
|
||
|
|
p->mag_head = b->next; // 1 ns (L1 write)
|
||
|
|
p->mag_count--; // 0.5 ns (inc)
|
||
|
|
return b; // 0.5 ns
|
||
|
|
}
|
||
|
|
return tiny_alloc_refill(th, p, class_idx); // Slow path
|
||
|
|
}
|
||
|
|
// Total: 6.5 ns (pure CPU, L1 hits)
|
||
|
|
```
|
||
|
|
|
||
|
|
**But reality includes**:
|
||
|
|
- Size classification: +1 ns (with LUT)
|
||
|
|
- TLS base load: +1 ns
|
||
|
|
- Occasional branch mispredict: +5 ns (1 in 20)
|
||
|
|
- Occasional L2 miss: +10 ns (1 in 50)
|
||
|
|
|
||
|
|
**Realistic fast path average**: **12-15 ns** (vs current 83 ns)
|
||
|
|
|
||
|
|
### 2.2 Medium Path: Refill from Bitmap
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline int refill_from_bitmap(Page* p, int want) {
|
||
|
|
uint64_t top = p->bm_top; // 2 ns (L1 hit)
|
||
|
|
if (top == 0) return 0; // 0.5 ns
|
||
|
|
|
||
|
|
int w = __builtin_ctzll(top); // 1 ns (tzcnt instruction)
|
||
|
|
uint64_t m = p->bm_word[w]; // 2 ns (L1 hit)
|
||
|
|
|
||
|
|
int got = 0;
|
||
|
|
while (m && got < want) { // 8 iterations (want=8)
|
||
|
|
int bit = __builtin_ctzll(m); // 1 ns
|
||
|
|
m &= (m - 1); // 1 ns (clear bit)
|
||
|
|
void* blk = index_to_block(...);// 2 ns
|
||
|
|
push_to_mag(blk); // 1 ns
|
||
|
|
got++;
|
||
|
|
}
|
||
|
|
// Total loop: 8 * 5 ns = 40 ns
|
||
|
|
|
||
|
|
p->bm_word[w] = m; // 1 ns
|
||
|
|
if (!m) p->bm_top &= ~(1ull << w); // 1 ns
|
||
|
|
p->mag_count += got; // 1 ns
|
||
|
|
return got;
|
||
|
|
}
|
||
|
|
// Total: 2 + 0.5 + 1 + 2 + 40 + 1 + 1 + 1 = 48.5 ns for 8 items
|
||
|
|
// Amortized: 6 ns per item
|
||
|
|
```
|
||
|
|
|
||
|
|
**Impact**: Bitmap cost amortized to **6 ns/item** (vs current 5-6 ns/item, but batched)
|
||
|
|
|
||
|
|
### 2.3 Overall Expected Performance
|
||
|
|
|
||
|
|
**Allocation breakdown** (with 90% mini-mag hit rate):
|
||
|
|
```
|
||
|
|
90% fast path: 12 ns * 0.9 = 10.8 ns
|
||
|
|
10% refill path: 48 ns * 0.1 = 4.8 ns (includes fast path + refill)
|
||
|
|
Total average: 15.6 ns
|
||
|
|
```
|
||
|
|
|
||
|
|
**But this assumes**:
|
||
|
|
- Mini-magazine always has items (90% hit rate)
|
||
|
|
- Bitmap refill is infrequent (10%)
|
||
|
|
- No statistics overhead
|
||
|
|
- No TLS magazine layer
|
||
|
|
|
||
|
|
**More realistic** (accounting for all overheads):
|
||
|
|
```
|
||
|
|
Size classification (LUT): 1 ns
|
||
|
|
TLS Magazine check: 3 ns (if kept)
|
||
|
|
OR
|
||
|
|
Page mini-magazine: 12 ns (if TLS Magazine removed)
|
||
|
|
Statistics (batched): 2 ns (sampled)
|
||
|
|
Occasional refill: 5 ns (amortized)
|
||
|
|
Total: 20-23 ns (if optimized)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Current baseline**: 83 ns
|
||
|
|
**Expected with hybrid**: **35-45 ns** (40-55% improvement)
|
||
|
|
|
||
|
|
### 2.4 Why Not 12-15 ns?
|
||
|
|
|
||
|
|
**Missing overhead in best-case analysis**:
|
||
|
|
1. **TLS Magazine integration**: Current hakmem has TLS Magazine layer
|
||
|
|
- If kept: +10 ns (magazine check overhead)
|
||
|
|
- If removed: Simpler but loses current fast path
|
||
|
|
2. **Statistics**: Even batched, adds 2-3 ns
|
||
|
|
3. **Refill frequency**: If mini-mag is only 8-16 items, refill happens often
|
||
|
|
4. **Cache misses**: Real-world workloads have 5-10% L2 misses
|
||
|
|
|
||
|
|
**Realistic target**: **35-45 ns** (still 2x faster than current 83 ns!)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Integration with Existing hakmem Structure
|
||
|
|
|
||
|
|
### 3.1 Critical Question: What happens to TLS Magazine?
|
||
|
|
|
||
|
|
**Current TLS Magazine**:
|
||
|
|
```c
|
||
|
|
typedef struct TinyTLSMag {
|
||
|
|
TinyItem items[2048]; // 16 KB per class
|
||
|
|
int top;
|
||
|
|
} TinyTLSMag;
|
||
|
|
static __thread TinyTLSMag g_tls_mags[TINY_NUM_CLASSES];
|
||
|
|
```
|
||
|
|
|
||
|
|
**Options**:
|
||
|
|
|
||
|
|
#### Option A: Keep Both (Dual-Layer Cache)
|
||
|
|
```
|
||
|
|
TLS Magazine [2048 items]
|
||
|
|
↓ (empty)
|
||
|
|
Page Mini-Magazine [8-16 items]
|
||
|
|
↓ (empty)
|
||
|
|
Bitmap Refill
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros**: Preserves current fast path
|
||
|
|
**Cons**:
|
||
|
|
- Double caching overhead (complexity)
|
||
|
|
- TLS Magazine dominates, mini-magazine rarely used
|
||
|
|
- **Not recommended** ❌
|
||
|
|
|
||
|
|
#### Option B: Remove TLS Magazine (Single-Layer)
|
||
|
|
```
|
||
|
|
Page Mini-Magazine [16-32 items] ← Increase size
|
||
|
|
↓ (empty)
|
||
|
|
Bitmap Refill [batch of 16]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros**: Simpler, clearer hot path
|
||
|
|
**Cons**:
|
||
|
|
- Loses current TLS Magazine fast path (1.5 ns/op)
|
||
|
|
- Requires testing to verify performance
|
||
|
|
- **Moderate risk** ⚠️
|
||
|
|
|
||
|
|
#### Option C: Hybrid (TLS Mini-Magazine)
|
||
|
|
```
|
||
|
|
TLS Mini-Magazine [64-128 items per class]
|
||
|
|
↓ (empty)
|
||
|
|
Refill from Multiple Pages' Bitmaps
|
||
|
|
↓ (all bitmaps empty)
|
||
|
|
New Page
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros**: Best of both (TLS speed + bitmap control)
|
||
|
|
**Cons**:
|
||
|
|
- More complex refill logic
|
||
|
|
- **Recommended** ✅
|
||
|
|
|
||
|
|
### 3.2 Recommended Structure
|
||
|
|
|
||
|
|
```c
|
||
|
|
typedef struct TinyTLSCache {
|
||
|
|
// Fast path: Small TLS magazine
|
||
|
|
Block* mag_head; // LIFO stack (not array)
|
||
|
|
uint16_t mag_count; // Current count
|
||
|
|
uint16_t mag_max; // 64-128 (tunable)
|
||
|
|
|
||
|
|
// Medium path: Active page with bitmap
|
||
|
|
Page* active;
|
||
|
|
|
||
|
|
// Cold path: Partial pages list
|
||
|
|
Page* partial_head;
|
||
|
|
} TinyTLSCache;
|
||
|
|
|
||
|
|
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
|
||
|
|
```
|
||
|
|
|
||
|
|
**Allocation**:
|
||
|
|
1. Pop from `mag_head` (1-2 ns) ← Fast path
|
||
|
|
2. If empty, `refill_from_bitmap(active, 16)` (48 ns, 16 items) → +3 ns amortized
|
||
|
|
3. If active bitmap empty, swap to partial page
|
||
|
|
4. If no partial, allocate new page
|
||
|
|
|
||
|
|
**Expected**: **12-15 ns average** (90%+ mag hit rate)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Bitmap as "Control Plane": Research Features
|
||
|
|
|
||
|
|
### 4.1 Bitmap Consistency Model
|
||
|
|
|
||
|
|
**Problem**: Mini-magazine has items, but bitmap still marks them as "free"
|
||
|
|
```
|
||
|
|
Bitmap state: [1 1 1 1 1 1 1 1] (all free)
|
||
|
|
Mini-mag: [b1, b2, b3] (3 blocks cached)
|
||
|
|
Truth: Only 5 are truly free, not 8
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solution 1**: Lazy Update (Eventual Consistency)
|
||
|
|
```c
|
||
|
|
// On refill: Mark blocks as allocated in bitmap
|
||
|
|
void refill_from_bitmap(Page* p, int want) {
|
||
|
|
// ... extract blocks ...
|
||
|
|
for each block:
|
||
|
|
clear_bit(p->bm_word, idx); // Mark allocated immediately
|
||
|
|
// Mini-mag now holds allocated blocks (consistent)
|
||
|
|
}
|
||
|
|
|
||
|
|
// On spill: Mark blocks as free in bitmap
|
||
|
|
void spill_to_bitmap(Page* p, int count) {
|
||
|
|
for each block in mini-mag:
|
||
|
|
set_bit(p->bm_word, idx); // Mark free
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Consistency**: ✅ Bitmap is always truth, mini-mag is just cache
|
||
|
|
|
||
|
|
**Solution 2**: Shadow State
|
||
|
|
```c
|
||
|
|
// Bitmap tracks "ever allocated" state
|
||
|
|
// Mini-mag tracks "currently cached" state
|
||
|
|
// Research features read: bitmap + mini-mag count
|
||
|
|
|
||
|
|
uint16_t get_true_free_count(Page* p) {
|
||
|
|
return p->bitmap_free_count - p->mag_count;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Consistency**: ⚠️ More complex, but allows instant queries
|
||
|
|
|
||
|
|
**Recommendation**: **Solution 1** (simpler, consistent)
|
||
|
|
|
||
|
|
### 4.2 Research Features Still Work
|
||
|
|
|
||
|
|
**Call-site profiling**:
|
||
|
|
```c
|
||
|
|
// On allocation, record call-site
|
||
|
|
void* alloc_with_profiling(void* site) {
|
||
|
|
void* ptr = tiny_alloc_fast(...);
|
||
|
|
|
||
|
|
// Diagnostic: Update bitmap-based tracking
|
||
|
|
if (diagnostic_enabled) {
|
||
|
|
int idx = block_index(page, ptr);
|
||
|
|
page->owner[idx] = current_thread();
|
||
|
|
page->alloc_site[idx] = site;
|
||
|
|
}
|
||
|
|
return ptr;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**ELO learning**:
|
||
|
|
```c
|
||
|
|
// On free, update ELO based on lifetime
|
||
|
|
void free_with_elo(void* ptr) {
|
||
|
|
int idx = block_index(page, ptr);
|
||
|
|
void* site = page->alloc_site[idx];
|
||
|
|
uint64_t lifetime = rdtsc() - page->alloc_time[idx];
|
||
|
|
|
||
|
|
update_elo(site, lifetime); // Bitmap enables this
|
||
|
|
|
||
|
|
tiny_free_fast(ptr); // Then free normally
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Memory diagnostics**:
|
||
|
|
```c
|
||
|
|
// Snapshot: Flush mini-mag to bitmap, then read
|
||
|
|
void snapshot_memory_state() {
|
||
|
|
flush_all_mini_magazines(); // Spill to bitmaps
|
||
|
|
|
||
|
|
for_each_page(page) {
|
||
|
|
print_bitmap_state(page); // Full visibility
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Conclusion**: ✅ **All research features preserved** (with flush/spill)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Implementation Complexity
|
||
|
|
|
||
|
|
### 5.1 Required Changes
|
||
|
|
|
||
|
|
**New structures** (~50 lines):
|
||
|
|
```c
|
||
|
|
typedef struct Block {
|
||
|
|
struct Block* next; // Intrusive LIFO
|
||
|
|
} Block;
|
||
|
|
|
||
|
|
typedef struct Page {
|
||
|
|
// Mini-magazine
|
||
|
|
Block* mag_head;
|
||
|
|
uint16_t mag_count;
|
||
|
|
uint16_t mag_max;
|
||
|
|
|
||
|
|
// Two-tier bitmap
|
||
|
|
uint64_t bm_top;
|
||
|
|
uint64_t bm_word[16];
|
||
|
|
|
||
|
|
// Existing (keep)
|
||
|
|
uint8_t* base;
|
||
|
|
uint16_t block_size;
|
||
|
|
// ...
|
||
|
|
} Page;
|
||
|
|
```
|
||
|
|
|
||
|
|
**New functions** (~200 lines):
|
||
|
|
```c
|
||
|
|
void* tiny_alloc_fast(ThreadHeap* th, int class_idx);
|
||
|
|
void tiny_free_fast(Page* p, void* ptr);
|
||
|
|
int refill_from_bitmap(Page* p, int want);
|
||
|
|
void spill_to_bitmap(Page* p);
|
||
|
|
void init_two_tier_bitmap(Page* p);
|
||
|
|
```
|
||
|
|
|
||
|
|
**Modified functions** (~300 lines):
|
||
|
|
```c
|
||
|
|
// Existing bitmap allocation → refill logic
|
||
|
|
hak_tiny_alloc() → integrate with tiny_alloc_fast()
|
||
|
|
hak_tiny_free() → integrate with tiny_free_fast()
|
||
|
|
// Statistics collection → batched/sampled
|
||
|
|
```
|
||
|
|
|
||
|
|
**Total code changes**: ~500-600 lines (moderate)
|
||
|
|
|
||
|
|
### 5.2 Testing Requirements
|
||
|
|
|
||
|
|
**Unit tests**:
|
||
|
|
- Two-tier bitmap correctness (refill/spill)
|
||
|
|
- Mini-magazine overflow/underflow
|
||
|
|
- Bitmap-magazine consistency
|
||
|
|
|
||
|
|
**Integration tests**:
|
||
|
|
- Existing bench_tiny benchmarks
|
||
|
|
- Multi-threaded stress tests
|
||
|
|
- Diagnostic feature validation
|
||
|
|
|
||
|
|
**Performance tests**:
|
||
|
|
- Before/after latency comparison
|
||
|
|
- Hit rate measurement (mini-mag vs refill)
|
||
|
|
|
||
|
|
**Estimated effort**: **6-8 hours** (implementation + testing)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Risks and Mitigation
|
||
|
|
|
||
|
|
### Risk 1: Mini-Magazine Size Tuning
|
||
|
|
|
||
|
|
**Problem**: Too small (8) → frequent refills; too large (64) → memory overhead
|
||
|
|
|
||
|
|
**Mitigation**:
|
||
|
|
- Make `mag_max` tunable via environment variable
|
||
|
|
- Adaptive sizing based on allocation pattern
|
||
|
|
- Start with 16-32 (sweet spot)
|
||
|
|
|
||
|
|
### Risk 2: Bitmap Refill Overhead
|
||
|
|
|
||
|
|
**Problem**: If mini-mag empties frequently, refill cost dominates
|
||
|
|
|
||
|
|
**Scenarios**:
|
||
|
|
- Burst allocation (1000 allocs in a row) → 1000/16 = 62 refills
|
||
|
|
- Refill cost: 62 * 48ns = 2976ns total = **3ns/alloc amortized** ✅
|
||
|
|
|
||
|
|
**Mitigation**: Batch size (16) amortizes cost well
|
||
|
|
|
||
|
|
### Risk 3: TLS Magazine Integration
|
||
|
|
|
||
|
|
**Problem**: Unclear how to integrate with existing TLS Magazine
|
||
|
|
|
||
|
|
**Options**:
|
||
|
|
1. Remove TLS Magazine entirely → **Simplest**
|
||
|
|
2. Keep TLS Magazine, add page mini-mag → **Complex**
|
||
|
|
3. Replace TLS Magazine with TLS mini-mag (64-128 items) → **Recommended**
|
||
|
|
|
||
|
|
**Mitigation**: Prototype Option 3, benchmark against current
|
||
|
|
|
||
|
|
### Risk 4: Diagnostic Lag
|
||
|
|
|
||
|
|
**Problem**: Bitmap doesn't reflect mini-mag state in real-time
|
||
|
|
|
||
|
|
**Scenarios**:
|
||
|
|
- Profiler reads bitmap → sees "free" but block is in mini-mag
|
||
|
|
- Fix: Flush before diagnostic read
|
||
|
|
|
||
|
|
**Mitigation**:
|
||
|
|
```c
|
||
|
|
void flush_diagnostics() {
|
||
|
|
for_each_class(c) {
|
||
|
|
spill_to_bitmap(g_tls_cache[c].active);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Performance Comparison Matrix
|
||
|
|
|
||
|
|
| Approach | Fast Path | Research | Complexity | Risk | Improvement |
|
||
|
|
|----------|-----------|----------|------------|------|-------------|
|
||
|
|
| **Current (Bitmap only)** | 83 ns | ✅ Full | Low | Low | Baseline |
|
||
|
|
| **Strategy A (Bitmap + cleanup)** | 58-65 ns | ✅ Full | Low | Low | +25-30% |
|
||
|
|
| **Strategy B (Free-list only)** | 45-55 ns | ❌ Lost | Moderate | Moderate | +35-45% |
|
||
|
|
| **Hybrid (Bitmap+Mini-Mag)** | **35-45 ns** | ✅ Full | Moderate | Moderate | **45-58%** |
|
||
|
|
|
||
|
|
**Winner**: **Hybrid** (best speed + research preservation)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Recommended Implementation Plan
|
||
|
|
|
||
|
|
### Phase 1: Two-Tier Bitmap (2-3 hours)
|
||
|
|
|
||
|
|
**Goal**: Eliminate empty word scan overhead
|
||
|
|
```c
|
||
|
|
// Add bm_top to existing TinySlab
|
||
|
|
typedef struct TinySlab {
|
||
|
|
uint64_t bm_top; // NEW: Summary bitmap
|
||
|
|
uint64_t bitmap[16]; // Existing
|
||
|
|
// ...
|
||
|
|
} TinySlab;
|
||
|
|
|
||
|
|
// Update allocation to use bm_top
|
||
|
|
if (slab->bm_top == 0) return NULL; // Fast empty check
|
||
|
|
int w = __builtin_ctzll(slab->bm_top);
|
||
|
|
// ...
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: 83ns → 78-80ns (+3-5ns)
|
||
|
|
|
||
|
|
**Risk**: Low (additive change)
|
||
|
|
|
||
|
|
### Phase 2: Page Mini-Magazine (3-4 hours)
|
||
|
|
|
||
|
|
**Goal**: Add LIFO mini-magazine to slabs
|
||
|
|
```c
|
||
|
|
typedef struct TinySlab {
|
||
|
|
// Mini-magazine (NEW)
|
||
|
|
Block* mag_head;
|
||
|
|
uint16_t mag_count;
|
||
|
|
uint16_t mag_max; // 16
|
||
|
|
|
||
|
|
// Two-tier bitmap (from Phase 1)
|
||
|
|
uint64_t bm_top;
|
||
|
|
uint64_t bitmap[16];
|
||
|
|
// ...
|
||
|
|
} TinySlab;
|
||
|
|
|
||
|
|
void* tiny_alloc_fast() {
|
||
|
|
Block* b = slab->mag_head;
|
||
|
|
if (likely(b)) {
|
||
|
|
slab->mag_head = b->next;
|
||
|
|
return b;
|
||
|
|
}
|
||
|
|
// Refill from bitmap (batch of 16)
|
||
|
|
refill_from_bitmap(slab, 16);
|
||
|
|
// Retry
|
||
|
|
return slab->mag_head ? pop_mag(slab) : NULL;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: 78-80ns → 45-55ns (+25-35ns)
|
||
|
|
|
||
|
|
**Risk**: Moderate (structural change)
|
||
|
|
|
||
|
|
### Phase 3: TLS Integration (1-2 hours)
|
||
|
|
|
||
|
|
**Goal**: Integrate with existing TLS Magazine
|
||
|
|
```c
|
||
|
|
// Option: Replace TLS Magazine with TLS mini-mag
|
||
|
|
typedef struct TinyTLSCache {
|
||
|
|
Block* mag_head; // 64-128 items
|
||
|
|
uint16_t mag_count;
|
||
|
|
TinySlab* active; // Current slab
|
||
|
|
TinySlab* partial; // Partial slabs
|
||
|
|
} TinyTLSCache;
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: 45-55ns → 35-45ns (+10ns from better TLS integration)
|
||
|
|
|
||
|
|
**Risk**: Moderate (requires careful testing)
|
||
|
|
|
||
|
|
### Phase 4: Statistics Batching (1 hour)
|
||
|
|
|
||
|
|
**Goal**: Remove per-allocation statistics overhead
|
||
|
|
```c
|
||
|
|
// Batch counter update (cold path only)
|
||
|
|
if (++g_tls_alloc_counter[class_idx] >= 100) {
|
||
|
|
g_tiny_pool.alloc_count[class_idx] += 100;
|
||
|
|
g_tls_alloc_counter[class_idx] = 0;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: 35-45ns → 30-40ns (+5-10ns)
|
||
|
|
|
||
|
|
**Risk**: Low (independent change)
|
||
|
|
|
||
|
|
### Total Timeline
|
||
|
|
|
||
|
|
**Effort**: 7-10 hours
|
||
|
|
**Expected result**: 83ns → **30-45ns** (45-65% improvement)
|
||
|
|
**Research features**: ✅ Fully preserved (bitmap visibility maintained)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Comparison to Alternatives
|
||
|
|
|
||
|
|
### vs Strategy A (Bitmap + Cleanup)
|
||
|
|
- **Strategy A**: 83ns → 58-65ns (+25-30%)
|
||
|
|
- **Hybrid**: 83ns → 30-45ns (+45-65%)
|
||
|
|
- **Winner**: Hybrid (+20-30ns better)
|
||
|
|
|
||
|
|
### vs Strategy B (Free-list Only)
|
||
|
|
- **Strategy B**: 83ns → 45-55ns, ❌ loses research features
|
||
|
|
- **Hybrid**: 83ns → 30-45ns, ✅ keeps research features
|
||
|
|
- **Winner**: Hybrid (faster + research preserved)
|
||
|
|
|
||
|
|
### vs ChatGPT Pro's Estimate (55-60ns)
|
||
|
|
- **ChatGPT Pro**: 55-60ns (optimistic)
|
||
|
|
- **Realistic Hybrid**: 30-45ns (with all phases)
|
||
|
|
- **Conservative**: 40-50ns (if hit rate is lower)
|
||
|
|
- **Conclusion**: 55-60ns is achievable, 30-40ns is optimistic but possible
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10. Conclusion
|
||
|
|
|
||
|
|
### Technical Verdict
|
||
|
|
|
||
|
|
**The Hybrid Bitmap+Mini-Magazine approach is sound and recommended** ✅
|
||
|
|
|
||
|
|
**Key strengths**:
|
||
|
|
1. ✅ Preserves bitmap visibility (research features intact)
|
||
|
|
2. ✅ Achieves free-list-like speed on hot path (30-45ns realistic)
|
||
|
|
3. ✅ Two-tier bitmap eliminates empty scan overhead
|
||
|
|
4. ✅ Well-established pattern (mimalloc uses similar techniques)
|
||
|
|
|
||
|
|
**Key concerns**:
|
||
|
|
1. ⚠️ Moderate implementation complexity (7-10 hours)
|
||
|
|
2. ⚠️ TLS Magazine integration needs careful design
|
||
|
|
3. ⚠️ Bitmap consistency requires flush for diagnostics
|
||
|
|
4. ⚠️ Performance depends on mini-magazine hit rate (90%+ needed)
|
||
|
|
|
||
|
|
### Recommendation
|
||
|
|
|
||
|
|
**Adopt the Hybrid approach with 4-phase implementation**:
|
||
|
|
1. Two-tier bitmap (low risk, immediate gain)
|
||
|
|
2. Page mini-magazine (moderate risk, big gain)
|
||
|
|
3. TLS integration (moderate risk, polish)
|
||
|
|
4. Statistics batching (low risk, final optimization)
|
||
|
|
|
||
|
|
**Expected outcome**: **83ns → 30-45ns** (45-65% improvement) while preserving all research features
|
||
|
|
|
||
|
|
### Next Steps
|
||
|
|
|
||
|
|
1. ✅ Create final implementation strategy document
|
||
|
|
2. ✅ Update TINY_POOL_OPTIMIZATION_STRATEGY.md to Hybrid approach
|
||
|
|
3. ✅ Begin Phase 1 (Two-tier bitmap) implementation
|
||
|
|
4. ✅ Validate with benchmarks after each phase
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: 2025-10-26
|
||
|
|
**Status**: Analysis complete, ready for implementation
|
||
|
|
**Confidence**: HIGH (backed by mimalloc precedent, realistic estimates)
|
||
|
|
**Risk Level**: MODERATE (phased approach mitigates risk)
|