Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
45 KiB
Bottleneck Analysis Report: hakmem Tiny Pool Allocator
Date: 2025-10-26 Target: hakmem bitmap-based allocator Baseline: mimalloc (industry standard) Analyzed by: Deep code analysis + performance modeling
Executive Summary
Top 3 Bottlenecks with Estimated Impact
-
TLS Magazine Hierarchy Overhead (HIGH: ~3-5 ns per allocation)
- 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap
- Each tier adds cache miss risk and branching overhead
- Expected speedup: 30-40% if collapsed to 2-tier
-
Two-Tier Bitmap Traversal (HIGH: ~4-6 ns on bitmap path)
- Summary bitmap scan + main bitmap scan + hint_word update
- Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates)
- Expected speedup: 20-30% if bypassed more often via better caching
-
Registry Lookup on Free Path (MEDIUM: ~2-4 ns per free)
- Hash computation + linear probe + validation on every cross-slab free
- Could be eliminated with mimalloc-style pointer arithmetic
- Expected speedup: 15-25% on free-heavy workloads
Performance Gap Analysis
Random Free Pattern (Bitmap's best case):
- hakmem: 68 M ops/sec (14.7 ns/op)
- mimalloc: 176 M ops/sec (5.7 ns/op)
- Gap: 2.6× slower (9 ns difference)
Sequential LIFO Pattern (Free-list's best case):
- hakmem: 102 M ops/sec (9.8 ns/op)
- mimalloc: 942 M ops/sec (1.1 ns/op)
- Gap: 9.2× slower (8.7 ns difference)
Key Insight: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture.
Expected Total Speedup
- Conservative: 2.0-2.5× (close the 2.6× gap partially)
- Optimistic: 3.0-4.0× (with aggressive optimizations)
- Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO)
Critical Path Analysis
Allocation Fast Path Walkthrough
Let me trace the exact execution path for hak_tiny_alloc(16) with step-by-step cycle estimates:
// hakmem_tiny.c:557 - Entry point
void* hak_tiny_alloc(size_t size) {
// Line 558: Initialization check
if (!g_tiny_initialized) hak_tiny_init(); // BRANCH: ~1 cycle (predicted taken once)
// Line 561-562: Wrapper context check
extern int hak_in_wrapper(void);
if (!g_wrap_tiny_enabled && hak_in_wrapper()) // BRANCH: ~1 cycle
return NULL;
// Line 565: Size to class conversion
int class_idx = hak_tiny_size_to_class(size); // INLINE: ~2 cycles (branch chain)
if (class_idx < 0) return NULL; // BRANCH: ~1 cycle
// Line 569-576: SuperSlab path (disabled by default)
if (g_use_superslab) { /* ... */ } // BRANCH: ~1 cycle (not taken)
// Line 650-651: TLS Magazine initialization check
tiny_mag_init_if_needed(class_idx); // INLINE: ~3 cycles (conditional init)
TinyTLSMag* mag = &g_tls_mags[class_idx]; // TLS ACCESS: ~2 cycles
// Line 666-670: TLS Magazine fast path (BEST CASE)
if (mag->top > 0) { // LOAD + BRANCH: ~2 cycles
void* p = mag->items[--mag->top].ptr; // LOAD + DEC + STORE: ~3 cycles
stats_record_alloc(class_idx); // INLINE: ~1 cycle (TLS increment)
return p; // RETURN: ~1 cycle
}
// TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz)
// Line 673-674: TLS Active Slab lookup (MEDIUM PATH)
TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles
if (!(tls && tls->free_count > 0)) // LOAD + BRANCH: ~3 cycles
tls = g_tls_active_slab_b[class_idx]; // TLS ACCESS: ~2 cycles (if taken)
if (tls && tls->free_count > 0) { // BRANCH: ~1 cycle
// Line 677-679: Remote drain check
if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) {
tiny_remote_drain_owner(tls); // RARE: ~50-200 cycles (if taken)
}
// Line 682-688: Mini-magazine fast path
if (!mini_mag_is_empty(&tls->mini_mag)) { // LOAD + BRANCH: ~2 cycles
void* p = mini_mag_pop(&tls->mini_mag); // INLINE: ~4 cycles (LIFO pop)
if (p) {
stats_record_alloc(class_idx); // INLINE: ~1 cycle
return p; // RETURN: ~1 cycle
}
}
// MINI-MAG PATH: ~30 cycles (~10 ns)
// Line 691-700: Batch refill from bitmap
if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) {
int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
// REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized
if (refilled > 0) {
void* p = mini_mag_pop(&tls->mini_mag);
if (p) {
stats_record_alloc(class_idx);
return p;
}
}
}
// REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc
// Line 703-713: Bitmap scan fallback
if (tls->free_count > 0) {
int block_idx = hak_tiny_find_free_block(tls); // BITMAP SCAN: ~15-20 cycles
if (block_idx >= 0) {
hak_tiny_set_used(tls, block_idx); // BITMAP UPDATE: ~10 cycles
tls->free_count--; // STORE: ~1 cycle
void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles
stats_record_alloc(class_idx); // INLINE: ~1 cycle
return p; // RETURN: ~1 cycle
}
}
// BITMAP PATH: ~50 cycles (~17 ns)
}
// Line 717-718: Lock and refill from global pool (SLOW PATH)
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
// ... slow path: 200-1000 cycles (rare) ...
}
Cycle Count Summary
| Path | Cycles | Latency (ns) | Frequency | Notes |
|---|---|---|---|---|
| TLS Magazine Hit | ~18 | ~6 ns | 60-80% | Best case (cache hit) |
| Mini-Mag Hit | ~30 | ~10 ns | 10-20% | Good case (slab-local) |
| Batch Refill | ~50 | ~17 ns | 5-10% | Amortized 3 ns/item |
| Bitmap Scan | ~50 | ~17 ns | 5-10% | Worst case before lock |
| Global Lock Path | ~300 | ~100 ns | <5% | Very rare (refill) |
Weighted Average: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = ~11 ns/op (theoretical) Measured Actual: 9.8-14.7 ns/op (matches model!)
Comparison with mimalloc's Approach
mimalloc achieves 1.1 ns/op on LIFO pattern by:
- No TLS Magazine Layer: Direct access to thread-local page free-list
- Intrusive Free-List: 1 load + 1 store (2 cycles) vs our 18 cycles
- 2MB Alignment: O(1) pointer→slab via bit-masking (no registry lookup)
- No Bitmap: Free-list only (trades random-access resistance for speed)
hakmem's Architecture:
Allocation Request
↓
TLS Magazine (2048 items) ← 1st tier: ~6 ns (cache hit)
↓ (miss)
TLS Active Slab (2 per class) ← 2nd tier: lookup cost
↓
Mini-Magazine (16-32 items) ← 3rd tier: ~10 ns (LIFO pop)
↓ (miss)
Batch Refill (16 items) ← 4th tier: ~3 ns amortized
↓ (miss)
Bitmap Scan (two-tier) ← 5th tier: ~17 ns (expensive)
↓ (miss)
Global Lock + Slab Allocation ← 6th tier: ~100+ ns (rare)
mimalloc's Architecture:
Allocation Request
↓
Thread-Local Page Free-List ← 1st tier: ~1 ns (1 load + 1 store)
↓ (miss)
Thread-Local Page Queue ← 2nd tier: ~5 ns (page switch)
↓ (miss)
Global Segment Allocation ← 3rd tier: ~50 ns (rare)
Key Difference: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead.
Bottleneck #1: TLS Magazine Hierarchy Overhead
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.c - Lines: 650-714 (allocation fast path)
- Impact: HIGH (affects 100% of allocations)
Code Analysis
// Line 650-651: 1st tier - TLS Magazine
tiny_mag_init_if_needed(class_idx); // ~3 cycles (conditional check)
TinyTLSMag* mag = &g_tls_mags[class_idx]; // ~2 cycles (TLS base + offset)
// Line 666-670: TLS Magazine lookup
if (mag->top > 0) { // ~2 cycles (load + branch)
void* p = mag->items[--mag->top].ptr; // ~3 cycles (array access + decrement)
stats_record_alloc(class_idx); // ~1 cycle (TLS increment)
return p; // ~1 cycle
}
// TOTAL: ~12 cycles for cache hit (BEST CASE)
// Line 673-674: 2nd tier - TLS Active Slab lookup
TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access)
if (!(tls && tls->free_count > 0)) // ~3 cycles (2 loads + branch)
tls = g_tls_active_slab_b[class_idx]; // ~2 cycles (if miss)
// Line 682-688: 3rd tier - Mini-Magazine
if (!mini_mag_is_empty(&tls->mini_mag)) { // ~2 cycles (load slab->mini_mag.count)
void* p = mini_mag_pop(&tls->mini_mag); // ~4 cycles (LIFO pop: 2 loads + 1 store)
if (p) { stats_record_alloc(class_idx); return p; }
}
// TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE)
Why It's Slow
-
Multiple TLS Accesses: Each tier requires TLS base lookup + offset calculation
g_tls_mags[class_idx]→ TLS read #1g_tls_active_slab_a[class_idx]→ TLS read #2g_tls_active_slab_b[class_idx]→ TLS read #3 (conditional)- Cost: 2-3 cycles each × 3 = 6-9 cycles overhead
-
Cache Line Fragmentation: TLS variables are separate arrays
g_tls_mags[8]= 16 KB (2048 items × 8 classes × 8 bytes)g_tls_active_slab_a[8]= 64 bytesg_tls_active_slab_b[8]= 64 bytes- Cost: Likely span multiple cache lines → potential cache misses
-
Branch Misprediction: Multi-tier fallback creates branch chain
- Magazine empty? → Check active slab A
- Slab A empty? → Check active slab B
- Mini-mag empty? → Refill from bitmap
- Cost: Each mispredicted branch = 10-20 cycles penalty
-
Redundant Metadata: Magazine items store
{void* ptr}separately from slab pointers- Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class)
- Slab pointers: 8 bytes × 2 per class (16 bytes)
- Cost: Memory overhead reduces cache efficiency
Optimization: Unified TLS Cache Structure
Before (current):
// Separate TLS arrays (fragmented in memory)
static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP];
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];
After (proposed):
// Unified per-class TLS structure (cache-line aligned)
typedef struct __attribute__((aligned(64))) {
// Hot fields (first 64 bytes for L1 cache line)
void* mag_items[32]; // Reduced from 2048 to 32 (still effective)
uint16_t mag_top; // Current magazine count
uint16_t mag_cap; // Magazine capacity
uint32_t _pad0;
// Warm fields (second cache line)
TinySlab* active_slab; // Primary active slab (no A/B split)
PageMiniMag* mini_mag; // Direct pointer to slab's mini-mag
uint64_t last_refill_tsc; // For adaptive refill timing
// Cold fields (third cache line)
uint64_t stats_alloc_batch; // Batched statistics
uint64_t stats_free_batch;
} __attribute__((packed)) TinyTLSCache;
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
Benefits:
- Single TLS access:
g_tls_cache[class_idx](not 3 separate lookups) - Cache-line aligned: All hot fields in first 64 bytes
- Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class
- Direct mini-mag pointer: No slab→mini_mag indirection
Expected Speedup: 30-40% (reduce fast path from ~12 cycles to ~7 cycles)
Risk: MEDIUM
- Requires refactoring TLS access patterns throughout codebase
- Magazine size reduction may increase refill frequency (trade-off)
- Need careful testing to ensure no regression on multi-threaded workloads
Bottleneck #2: Two-Tier Bitmap Traversal
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.h - Lines: 235-269 (
hak_tiny_find_free_block) - Impact: HIGH (affects 5-15% of allocations, but expensive when hit)
Code Analysis
// Line 235-269: Two-tier bitmap scan
static inline int hak_tiny_find_free_block(TinySlab* slab) {
const int bw = g_tiny_bitmap_words[slab->class_idx]; // Bitmap words
const int sw = slab->summary_words; // Summary words
if (bw <= 0 || sw <= 0) return -1;
int start_word = slab->hint_word % bw; // Hint optimization
int start_sw = start_word / 64; // Summary word index
int start_sb = start_word % 64; // Summary bit offset
// Line 244-267: Summary bitmap scan (outer loop)
for (int k = 0; k < sw; k++) { // ~sw iterations (1-128)
int idx = start_sw + k;
if (idx >= sw) idx -= sw; // Wrap-around
uint64_t bits = slab->summary[idx]; // LOAD: ~2 cycles
// Mask optimization (skip processed bits)
if (k == 0) {
bits &= (~0ULL) << start_sb; // BITWISE: ~1 cycle
}
if (idx == sw - 1 && (bw % 64) != 0) {
uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL);
bits &= mask; // BITWISE: ~1 cycle
}
if (bits == 0) continue; // BRANCH: ~1 cycle (often taken)
int woff = __builtin_ctzll(bits); // CTZ #1: ~3 cycles
int word_idx = idx * 64 + woff; // COMPUTE: ~2 cycles
if (word_idx >= bw) continue; // BRANCH: ~1 cycle
// Line 261-266: Main bitmap scan (inner)
uint64_t used = slab->bitmap[word_idx]; // LOAD: ~2 cycles (cache miss risk)
uint64_t free_bits = ~used; // BITWISE: ~1 cycle
if (free_bits == 0) continue; // BRANCH: ~1 cycle (rare)
int bit_idx = __builtin_ctzll(free_bits); // CTZ #2: ~3 cycles
slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles
return word_idx * 64 + bit_idx; // RETURN: ~1 cycle
}
return -1;
}
// TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access)
// WORST CASE: 50-100 cycles (many summary words scanned, cache misses)
Why It's Slow
-
Two-Level Indirection: Summary → Bitmap → Block
- Summary scan: Find word with free bits (~5-10 cycles)
- Main bitmap scan: Find bit within word (~5 cycles)
- Cost: 2× CTZ operations, 2× memory loads
-
Cache Miss Risk: Bitmap can be up to 1 KB (128 words × 8 bytes)
- Class 0 (8B): 128 words = 1024 bytes
- Class 1 (16B): 64 words = 512 bytes
- Class 2 (32B): 32 words = 256 bytes
- Cost: Bitmap may not fit in L1 cache (32 KB) → L2 access (~10-20 cycles)
-
Hint Word State: Requires update on every allocation
- Read hint_word (~1 cycle)
- Compute new hint (~2 cycles)
- Write hint_word (~1 cycle)
- Cost: 4 cycles per allocation (not amortized)
-
Branch-Heavy Loop: Multiple branches per iteration
if (bits == 0) continue;(often taken when bitmap is sparse)if (word_idx >= bw) continue;(rare safety check)if (free_bits == 0) continue;(rare but costly)- Cost: Branch misprediction = 10-20 cycles each
Optimization #1: Increase Mini-Magazine Capacity
Rationale: Avoid bitmap scan by keeping more items in mini-magazine
Current:
// Line 344: Mini-magazine capacity
uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16;
Proposed:
// Increase capacity to reduce bitmap scan frequency
uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32;
Benefits:
- Fewer bitmap scans (amortized over 64 items instead of 32)
- Better temporal locality (more items cached)
Costs:
- +256 bytes memory per slab (64 × 8 bytes pointers)
- Slightly higher refill cost (64 items vs 32)
Expected Speedup: 10-15% (reduce bitmap scan frequency by 50%)
Risk: LOW (simple parameter change, no logic changes)
Optimization #2: Cache-Aware Bitmap Layout
Rationale: Ensure bitmap fits in L1 cache for hot classes
Current:
// Separate bitmap allocation (may be cache-cold)
slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t));
Proposed:
// Embed small bitmaps directly in slab structure
typedef struct TinySlab {
// ... existing fields ...
// Embedded bitmap for small classes (≤256 bytes)
union {
uint64_t* bitmap_ptr; // Large classes: heap-allocated
uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes)
};
uint8_t bitmap_embedded; // Flag: 1=embedded, 0=heap
} TinySlab;
Benefits:
- Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded)
- Single cache line access for bitmap + slab metadata
- No heap allocation for small classes
Expected Speedup: 5-10% (reduce cache misses on bitmap access)
Risk: MEDIUM (requires refactoring bitmap access logic)
Optimization #3: Lazy Summary Bitmap Update
Rationale: Summary bitmap update is expensive on free path
Current:
// Line 199-213: Summary update on every set_used/set_free
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
// ... bitmap update ...
// Update summary (EXPENSIVE)
int sum_word = word_idx / 64;
int sum_bit = word_idx % 64;
uint64_t has_free = ~v;
if (has_free != 0) {
slab->summary[sum_word] |= (1ULL << sum_bit); // WRITE
} else {
slab->summary[sum_word] &= ~(1ULL << sum_bit); // WRITE
}
}
Proposed:
// Lazy summary update (rebuild only when scanning)
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
// ... bitmap update ...
// NO SUMMARY UPDATE (deferred)
}
static inline int hak_tiny_find_free_block(TinySlab* slab) {
// Rebuild summary if stale (rare)
if (slab->summary_stale) {
rebuild_summary_bitmap(slab); // O(N) but rare
slab->summary_stale = 0;
}
// ... existing scan logic ...
}
Benefits:
- Eliminate summary update on 95% of operations (free path)
- Summary rebuild cost amortized over many allocations
Expected Speedup: 15-20% on free-heavy workloads
Risk: MEDIUM (requires careful stale bit management)
Bottleneck #3: Registry Lookup on Free Path
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.c - Lines: 1102-1118 (
hak_tiny_free) - Impact: MEDIUM (affects cross-slab frees, ~30-50% of frees)
Code Analysis
// Line 1102-1118: Free path with registry lookup
void hak_tiny_free(void* ptr) {
if (!ptr || !g_tiny_initialized) return;
// Line 1106-1111: SuperSlab fast path (disabled by default)
SuperSlab* ss = ptr_to_superslab(ptr); // BITWISE: ~2 cycles
if (ss && ss->magic == SUPERSLAB_MAGIC) { // LOAD + BRANCH: ~3 cycles
hak_tiny_free_superslab(ptr, ss); // FAST PATH: ~5 ns
return;
}
// Line 1114: Registry lookup (EXPENSIVE)
TinySlab* slab = hak_tiny_owner_slab(ptr); // LOOKUP: ~10-30 cycles
if (!slab) return;
hak_tiny_free_with_slab(ptr, slab); // FREE: ~50-200 cycles
}
// hakmem_tiny.c:395-440 - Registry lookup implementation
TinySlab* hak_tiny_owner_slab(void* ptr) {
if (!ptr || !g_tiny_initialized) return NULL;
if (g_use_registry) {
// O(1) hash table lookup
uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles
TinySlab* slab = registry_lookup(slab_base); // FUNCTION CALL: ~20-50 cycles
if (!slab) return NULL;
// Validation (bounds check)
uintptr_t start = (uintptr_t)slab->base;
uintptr_t end = start + TINY_SLAB_SIZE;
if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) {
return NULL; // False positive
}
return slab;
} else {
// O(N) linear search (fallback)
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles
// Search free slabs
for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) {
// ... bounds check ...
}
pthread_mutex_unlock(lock);
}
return NULL;
}
}
// Line 268-288: Registry lookup (hash table linear probe)
static TinySlab* registry_lookup(uintptr_t slab_base) {
int hash = registry_hash(slab_base); // HASH: ~5 cycles
for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes
int idx = (hash + i) & SLAB_REGISTRY_MASK; // BITWISE: ~2 cycles
SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles
if (entry->slab_base == slab_base) { // LOAD + BRANCH: ~3 cycles
TinySlab* owner = entry->owner; // LOAD: ~2 cycles
return owner;
}
if (entry->slab_base == 0) { // LOAD + BRANCH: ~2 cycles
return NULL; // Empty slot
}
}
return NULL;
}
// TYPICAL COST: 20-30 cycles (1-2 probes, cache hit)
// WORST CASE: 50-100 cycles (8 probes, cache miss on registry array)
Why It's Slow
-
Hash Computation: Complex mix function
static inline int registry_hash(uintptr_t slab_base) { return (slab_base >> 16) & SLAB_REGISTRY_MASK; // Simple, but... }- Shift + mask = 2 cycles (acceptable)
- BUT: Linear probing on collision adds 10-30 cycles
-
Linear Probing: Up to 8 probes on collision
- Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case)
- Registry size: 1024 entries (8 KB array)
- Cost: May span multiple cache lines → cache miss (10-20 cycles penalty)
-
Validation Overhead: Bounds check after lookup
- Load slab->base (2 cycles)
- Compute end address (1 cycle)
- Compare twice (2 cycles)
- Cost: 5 cycles per free (not amortized)
-
Global Shared State: Registry is shared across all threads
- No cache-line alignment (false sharing risk)
- Lock-free reads → ABA problem potential
- Cost: Atomic load penalties (~5-10 cycles vs normal load)
Optimization #1: Enable SuperSlab by Default
Rationale: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style)
Current:
// Line 81: SuperSlab disabled by default
static int g_use_superslab = 0; // Runtime toggle
Proposed:
// Enable SuperSlab by default
static int g_use_superslab = 1; // Always on
Benefits:
- Eliminate registry lookup entirely:
ptr & ~0x1FFFFF(1 AND operation) - SuperSlab free path: ~5 ns (vs ~10-30 ns registry path)
- Better cache locality (2MB aligned pages)
Costs:
- 2MB address space per SuperSlab (not physical memory due to lazy allocation)
- Slightly higher memory overhead (metadata at SuperSlab level)
Expected Speedup: 20-30% on free-heavy workloads
Risk: LOW (SuperSlab already implemented and tested in Phase 6.23)
Optimization #2: Cache Last Freed Slab
Rationale: Temporal locality - next free likely from same slab
Proposed:
// Per-thread cache of last freed slab
static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL};
void hak_tiny_free(void* ptr) {
if (!ptr) return;
// Try cached slab first (likely hit)
int class_idx = guess_class_from_size(ptr); // Heuristic
TinySlab* slab = t_last_freed_slab[class_idx];
// Validate pointer is in this slab
if (slab && ptr_in_slab_range(ptr, slab)) {
hak_tiny_free_with_slab(ptr, slab); // FAST PATH: ~5 ns
return;
}
// Fallback to registry lookup (rare)
slab = hak_tiny_owner_slab(ptr);
if (slab) {
t_last_freed_slab[slab->class_idx] = slab; // Update cache
hak_tiny_free_with_slab(ptr, slab);
}
}
Benefits:
- 80-90% cache hit rate (temporal locality)
- Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles)
Expected Speedup: 15-20% on free-heavy workloads
Risk: MEDIUM (requires heuristic for class_idx guessing, may mispredict)
Bottleneck #4: Statistics Collection Overhead
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny_stats.h - Lines: 59-73 (
stats_record_alloc,stats_record_free) - Impact: LOW (already optimized to TLS batching, but still ~0.5 ns per op)
Code Analysis
// Line 59-62: Allocation statistics (inline)
static inline void stats_record_alloc(int class_idx) __attribute__((always_inline));
static inline void stats_record_alloc(int class_idx) {
t_alloc_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
}
// Line 70-73: Free statistics (inline)
static inline void stats_record_free(int class_idx) __attribute__((always_inline));
static inline void stats_record_free(int class_idx) {
t_free_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
}
Why It's (Slightly) Slow
-
TLS Access Overhead: Even TLS has cost
- TLS base register: %fs on x86-64 (implicit)
- Offset calculation:
[%fs + class_idx*4] - Cost: ~0.5 cycles (not zero!)
-
Cache Line Pollution: TLS counters compete for L1 cache
t_alloc_batch[8]= 32 bytest_free_batch[8]= 32 bytes- Cost: 64 bytes of L1 cache (1 cache line)
-
Compiler Optimization Barriers:
always_inlineprevents optimization- Forces inline (good)
- But prevents compiler from hoisting out of loops (bad)
- Cost: Increment inside hot loop vs once outside
Optimization: Compile-Time Statistics Toggle
Rationale: Production builds don't need exact counts
Proposed:
#ifdef HAKMEM_ENABLE_STATS
#define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++
#define STATS_RECORD_FREE(cls) t_free_batch[cls]++
#else
#define STATS_RECORD_ALLOC(cls) ((void)0)
#define STATS_RECORD_FREE(cls) ((void)0)
#endif
Benefits:
- Zero overhead when stats disabled
- Compiler can optimize away dead code
Expected Speedup: 3-5% (small but measurable)
Risk: VERY LOW (compile-time flag, no runtime impact)
Bottleneck #5: Magazine Spill/Refill Lock Contention
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.c - Lines: 880-939 (magazine spill under class lock)
- Impact: MEDIUM (affects 5-10% of frees when magazine is full)
Code Analysis
// Line 880-939: Magazine spill (class lock held)
if (mag->top < cap) {
// Fast path: push to magazine (no lock)
mag->items[mag->top].ptr = ptr;
mag->top++;
return;
}
// Spill half under class lock
pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
int spill = cap / 2; // Spill 1024 items (for 2048 cap)
for (int i = 0; i < spill && mag->top > 0; i++) {
TinyMagItem it = mag->items[--mag->top];
TinySlab* owner = hak_tiny_owner_slab(it.ptr); // LOOKUP: ~20-30 cycles × 1024
if (!owner) continue;
// Phase 4.1: Try mini-magazine push (avoid bitmap)
if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) {
mini_mag_push(&owner->mini_mag, it.ptr); // FAST: ~4 cycles
continue;
}
// Slow path: bitmap update
size_t bs = g_tiny_class_sizes[owner->class_idx];
int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs; // DIV: ~10 cycles
if (hak_tiny_is_used(owner, idx)) {
hak_tiny_set_free(owner, idx); // BITMAP: ~10 cycles
owner->free_count++;
// ... list management ...
}
}
pthread_mutex_unlock(lock);
// TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item)
// Amortized: 50-100 ns per free (when spill happens every ~1000 frees)
Why It's Slow
-
Lock Hold Time: Lock held for entire spill (1024 items)
- Blocks other threads from accessing class lock
- Spill takes ~50-100 µs → other threads stalled
- Cost: Contention penalty on multi-threaded workloads
-
Registry Lookup in Loop: 1024 lookups under lock
hak_tiny_owner_slab(it.ptr)called 1024 times- Each lookup: 20-30 cycles
- Cost: 20,000-30,000 cycles just for lookups
-
Division in Hot Loop: Block index calculation uses division
int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;- Division is ~10 cycles on modern CPUs (not fully pipelined)
- Cost: 10,000 cycles for 1024 divisions
-
Large Spill Batch: 1024 items is too large
- Amortizes lock cost well (good)
- But increases lock hold time (bad)
- Trade-off not optimized
Optimization #1: Reduce Spill Batch Size
Rationale: Smaller batches = shorter lock hold time = less contention
Current:
int spill = cap / 2; // 1024 items for 2048 cap
Proposed:
int spill = 128; // Fixed batch size (not cap-dependent)
Benefits:
- Shorter lock hold time: ~6-12 µs (vs 50-100 µs)
- Better multi-thread responsiveness
Costs:
- More frequent spills (8× more frequent)
- Slightly higher total lock overhead
Expected Speedup: 10-15% on multi-threaded workloads
Risk: LOW (simple parameter change)
Optimization #2: Lock-Free Spill Stack
Rationale: Avoid lock entirely for spill path
Proposed:
// Per-class global spill stack (lock-free MPSC)
static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES];
void magazine_spill_lockfree(int class_idx, void* ptr) {
// Push to lock-free stack
uintptr_t old_head;
do {
old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire);
*((uintptr_t*)ptr) = old_head; // Intrusive next-pointer
} while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr,
memory_order_release, memory_order_relaxed));
}
// Background thread drains spill stack periodically
void background_drain_spill_stack(void) {
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel);
if (!head) continue;
pthread_mutex_lock(&g_tiny_class_locks[i].m);
// ... drain to bitmap ...
pthread_mutex_unlock(&g_tiny_class_locks[i].m);
}
}
Benefits:
- Zero lock contention on spill path
- Fast atomic CAS (~5-10 cycles)
Costs:
- Requires background thread or periodic drain
- Slightly more complex memory management
Expected Speedup: 20-30% on multi-threaded workloads
Risk: HIGH (requires careful design of background drain mechanism)
Bottleneck #6: Branch Misprediction in Size Class Lookup
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.h - Lines: 159-182 (
hak_tiny_size_to_class) - Impact: LOW (only 1-2 ns per allocation, but called on every allocation)
Code Analysis
// Line 159-182: Size to class lookup (branch chain)
static inline int hak_tiny_size_to_class(size_t size) {
if (size == 0 || size > TINY_MAX_SIZE) return -1; // BRANCH: ~1 cycle
// Branch chain (8 branches for 8 classes)
if (size <= 8) return 0; // BRANCH: ~1 cycle
if (size <= 16) return 1; // BRANCH: ~1 cycle
if (size <= 32) return 2; // BRANCH: ~1 cycle
if (size <= 64) return 3; // BRANCH: ~1 cycle
if (size <= 128) return 4; // BRANCH: ~1 cycle
if (size <= 256) return 5; // BRANCH: ~1 cycle
if (size <= 512) return 6; // BRANCH: ~1 cycle
return 7; // size <= 1024
}
// TYPICAL COST: 3-5 cycles (3-4 branches taken)
// WORST CASE: 8 cycles (all branches checked)
Why It's (Slightly) Slow
-
Unpredictable Size Distribution: Branch predictor can't learn pattern
- Real-world allocation sizes are quasi-random
- Size 16 most common (33%), but others vary
- Cost: ~20-30% branch misprediction rate (~10 cycles penalty)
-
Sequential Dependency: Each branch depends on previous
- CPU can't parallelize branch evaluation
- Must evaluate branches in order
- Cost: No instruction-level parallelism (ILP)
Optimization: Branchless Lookup Table
Rationale: Use CLZ (count leading zeros) for O(1) class lookup
Proposed:
// Lookup table for size → class (branchless)
static const uint8_t g_size_to_class_table[128] = {
// size 0-7: class -1 (invalid)
-1, -1, -1, -1, -1, -1, -1, -1,
// size 8: class 0
0,
// size 9-16: class 1
1, 1, 1, 1, 1, 1, 1, 1,
// size 17-32: class 2
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
// size 33-64: class 3
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
// size 65-128: class 4
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
};
static inline int hak_tiny_size_to_class(size_t size) {
if (size == 0 || size > TINY_MAX_SIZE) return -1;
// Fast path: direct table lookup for small sizes
if (size <= 128) {
return g_size_to_class_table[size]; // LOAD: ~2 cycles (L1 cache)
}
// Slow path: CLZ-based for larger sizes
// size 129-256 → class 5
// size 257-512 → class 6
// size 513-1024 → class 7
int clz = __builtin_clzll(size - 1); // CLZ: ~3 cycles
return 12 - clz; // Magic constant for power-of-2 classes
}
// TYPICAL COST: 2-3 cycles (table lookup, no branches)
Benefits:
- Branchless for common sizes (8-128B covers 80%+ of allocations)
- Table fits in L1 cache (128 bytes = 2 cache lines)
- Predictable performance (no branch misprediction)
Expected Speedup: 2-3% (reduce 5 cycles to 2-3 cycles)
Risk: VERY LOW (table is static, no runtime overhead)
Bottleneck #7: Remote Free Drain Overhead
Location
- File:
/home/tomoaki/git/hakmem/hakmem_tiny.c - Lines: 146-184 (
tiny_remote_drain_locked) - Impact: LOW (only affects cross-thread frees, ~10-20% of workloads)
Code Analysis
// Line 146-184: Remote free drain (under class lock)
static void tiny_remote_drain_locked(TinySlab* slab) {
uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles
unsigned drained = 0;
while (head) { // LOOP: variable iterations
void* p = (void*)head;
head = *((uintptr_t*)p); // LOAD NEXT: ~2 cycles
// Calculate block index
size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles
uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
// Skip if already free (idempotent)
if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles
hak_tiny_set_free(slab, block_idx); // BITMAP UPDATE: ~10 cycles
int was_full = (slab->free_count == 0); // LOAD: ~1 cycle
slab->free_count++; // INCREMENT: ~1 cycle
if (was_full) {
move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare)
}
if (slab->free_count == slab->total_count) {
// ... slab release logic ... (rare)
release_slab(slab); // EXPENSIVE: ~1000 cycles (very rare)
break;
}
g_tiny_pool.free_count[slab->class_idx]++; // GLOBAL INCREMENT: ~1 cycle
drained++;
}
if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles
}
// TYPICAL COST: 50-100 cycles per drained block (moderate)
// WORST CASE: 1000+ cycles (slab release)
Why It's Slow
-
Division in Loop: Block index calculation uses division
int block_idx = offset / block_size;- Division is ~10 cycles (even on modern CPUs)
- Cost: 10 cycles × N remote frees
-
Atomic Operations: 2 atomic ops per drain (exchange + fetch_sub)
atomic_exchangeat start (~10 cycles)atomic_fetch_subat end (~10 cycles)- Cost: 20 cycles overhead (not per-block, but still expensive)
-
Bitmap Update: Same as allocation path
hak_tiny_set_freeupdates both bitmap and summary- Cost: 10 cycles per block
Optimization: Multiplication-Based Division
Rationale: Replace division with multiplication by reciprocal
Current:
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
Proposed:
// Pre-computed reciprocals (magic constants)
static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = {
// Computed as: (1ULL << 48) / block_size
// Allows: block_idx = (offset * reciprocal) >> 48
0x200000000000ULL / 8, // Class 0: 8B
0x100000000000ULL / 16, // Class 1: 16B
0x80000000000ULL / 32, // Class 2: 32B
0x40000000000ULL / 64, // Class 3: 64B
0x20000000000ULL / 128, // Class 4: 128B
0x10000000000ULL / 256, // Class 5: 256B
0x8000000000ULL / 512, // Class 6: 512B
0x4000000000ULL / 1024, // Class 7: 1024B
};
// Fast division using multiplication
int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles
Benefits:
- Reduce 10 cycles to 3 cycles per division
- Saves 7 cycles per remote free
Expected Speedup: 5-10% on cross-thread workloads
Risk: VERY LOW (well-known compiler optimization, manually applied)
Profiling Plan
perf Commands to Run
# 1. CPU cycle breakdown (identify hotspots)
perf record -e cycles:u -g ./bench_comprehensive
perf report --stdio --no-children | head -100 > perf_cycles.txt
# 2. Cache miss analysis (L1d, L1i, LLC)
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\
L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \
./bench_comprehensive
# 3. Branch misprediction rate
perf stat -e cycles,instructions,branches,branch-misses \
./bench_comprehensive
# 4. TLB miss analysis (address translation overhead)
perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_comprehensive
# 5. Function-level profiling (annotated source)
perf record -e cycles:u --call-graph dwarf ./bench_comprehensive
perf report --stdio --sort symbol --percent-limit 1
# 6. Memory bandwidth utilization
perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\
mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \
./bench_comprehensive
# 7. Allocation-specific hotspots (focus on hak_tiny_alloc)
perf record -e cycles:u -g --call-graph dwarf -- \
./bench_comprehensive 2>&1 | grep "hak_tiny"
Expected Hotspots to Validate
Based on code analysis, we expect to see:
-
hak_tiny_find_free_block (15-25% of cycles)
- Two-tier bitmap scan
- CTZ operations
- Cache misses on large bitmaps
-
hak_tiny_set_used / hak_tiny_set_free (10-15% of cycles)
- Bitmap updates
- Summary bitmap updates
- Write-heavy (cache line bouncing)
-
hak_tiny_owner_slab (10-20% of cycles on free path)
- Registry lookup
- Hash computation
- Linear probing
-
tiny_mag_init_if_needed (5-10% of cycles)
- TLS access
- Conditional initialization
-
stats_record_alloc / stats_record_free (3-5% of cycles)
- TLS counter increments
- Cache line pollution
Validation Criteria
Cache Miss Rates:
- L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor)
- LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor)
Branch Misprediction:
- Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor)
- Expected: 3-4% (due to unpredictable size classes)
IPC (Instructions Per Cycle):
- IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor)
- Expected: 1.5-1.8 (memory-bound, not compute-bound)
Function Time Distribution:
- hak_tiny_alloc: 40-60% (hot path)
- hak_tiny_free: 20-30% (warm path)
- hak_tiny_find_free_block: 10-20% (expensive when hit)
- Other: < 10%
Optimization Roadmap
Quick Wins (< 1 hour, Low Risk)
-
Enable SuperSlab by Default (Bottleneck #3)
- Change:
g_use_superslab = 1; - Impact: 20-30% speedup on free path
- Risk: VERY LOW (already implemented)
- Effort: 5 minutes
- Change:
-
Disable Statistics in Production (Bottleneck #4)
- Change: Add
#ifndef HAKMEM_ENABLE_STATSguards - Impact: 3-5% speedup
- Risk: VERY LOW (compile-time flag)
- Effort: 15 minutes
- Change: Add
-
Increase Mini-Magazine Capacity (Bottleneck #2)
- Change:
mag_capacity = 64(was 32) - Impact: 10-15% speedup (reduce bitmap scans)
- Risk: LOW (slight memory increase)
- Effort: 5 minutes
- Change:
-
Branchless Size Class Lookup (Bottleneck #6)
- Change: Use lookup table for common sizes
- Impact: 2-3% speedup
- Risk: VERY LOW (static table)
- Effort: 30 minutes
Total Expected Speedup: 35-53% (conservative: 1.4-1.5×)
Medium Effort (1-4 hours, Medium Risk)
-
Unified TLS Cache Structure (Bottleneck #1)
- Change: Merge TLS arrays into single cache-aligned struct
- Impact: 30-40% speedup on fast path
- Risk: MEDIUM (requires refactoring)
- Effort: 3-4 hours
-
Reduce Magazine Spill Batch (Bottleneck #5)
- Change:
spill = 128(was 1024) - Impact: 10-15% speedup on multi-threaded
- Risk: LOW (parameter tuning)
- Effort: 30 minutes
- Change:
-
Cache-Aware Bitmap Layout (Bottleneck #2)
- Change: Embed small bitmaps in slab structure
- Impact: 5-10% speedup
- Risk: MEDIUM (requires struct changes)
- Effort: 2-3 hours
-
Multiplication-Based Division (Bottleneck #7)
- Change: Replace division with mul+shift
- Impact: 5-10% speedup on remote frees
- Risk: VERY LOW (well-known optimization)
- Effort: 1 hour
Total Expected Speedup: 50-85% (conservative: 1.5-1.8×)
Major Refactors (> 4 hours, High Risk)
-
Lock-Free Spill Stack (Bottleneck #5)
- Change: Use atomic MPSC queue for magazine spill
- Impact: 20-30% speedup on multi-threaded
- Risk: HIGH (complex concurrency)
- Effort: 8-12 hours
-
Lazy Summary Bitmap Update (Bottleneck #2)
- Change: Rebuild summary only when scanning
- Impact: 15-20% speedup on free-heavy workloads
- Risk: MEDIUM (requires careful staleness tracking)
- Effort: 4-6 hours
-
Collapse TLS Magazine Tiers (Bottleneck #1)
- Change: Merge magazine + mini-mag into single LIFO
- Impact: 40-50% speedup (eliminate tier overhead)
- Risk: HIGH (major architectural change)
- Effort: 12-16 hours
-
Full mimalloc-Style Rewrite (All Bottlenecks)
- Change: Replace bitmap with intrusive free-list
- Impact: 5-9× speedup (match mimalloc)
- Risk: VERY HIGH (complete redesign)
- Effort: 40+ hours
Total Expected Speedup: 75-150% (optimistic: 1.8-2.5×)
Risk Assessment Summary
Low Risk Optimizations (Safe to implement immediately)
- SuperSlab enable
- Statistics compile-time toggle
- Mini-mag capacity increase
- Branchless size lookup
- Multiplication division
- Magazine spill batch reduction
Expected: 1.4-1.6× speedup, 2-3 hours effort
Medium Risk Optimizations (Test thoroughly)
- Unified TLS cache structure
- Cache-aware bitmap layout
- Lazy summary update
Expected: 1.6-2.0× speedup, 6-10 hours effort
High Risk Optimizations (Prototype first)
- Lock-free spill stack
- Magazine tier collapse
- Full mimalloc rewrite
Expected: 2.0-9.0× speedup, 20-60 hours effort
Estimated Speedup Summary
Conservative Target (Low + Medium optimizations)
- Random pattern: 68 M ops/sec → 140 M ops/sec (2.0× speedup)
- LIFO pattern: 102 M ops/sec → 200 M ops/sec (2.0× speedup)
- Gap to mimalloc: 2.6× → 1.3× (close 50% of gap)
Optimistic Target (All optimizations)
- Random pattern: 68 M ops/sec → 170 M ops/sec (2.5× speedup)
- LIFO pattern: 102 M ops/sec → 450 M ops/sec (4.4× speedup)
- Gap to mimalloc: 2.6× → 1.0× (match on random, 2× on LIFO)
Conclusion
The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to:
- Architectural overhead: 6-tier allocation hierarchy vs mimalloc's 3-tier
- Bitmap traversal cost: Two-tier scan adds 15-20 cycles even when optimized
- Registry lookup overhead: Hash table lookup adds 20-30 cycles on free path
Quick wins (1-3 hours effort) can achieve 1.4-1.6× speedup. Medium effort (10 hours) can achieve 1.8-2.0× speedup. Full mimalloc-style rewrite (40+ hours) needed to match mimalloc's 1.1 ns/op.
Recommendation: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with perf, then decide if medium-effort optimizations are worth the complexity increase.