Files
hakmem/docs/analysis/BOTTLENECK_ANALYSIS_TASK.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

1319 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bottleneck Analysis Report: hakmem Tiny Pool Allocator
**Date**: 2025-10-26
**Target**: hakmem bitmap-based allocator
**Baseline**: mimalloc (industry standard)
**Analyzed by**: Deep code analysis + performance modeling
---
## Executive Summary
### Top 3 Bottlenecks with Estimated Impact
1. **TLS Magazine Hierarchy Overhead** (HIGH: ~3-5 ns per allocation)
- 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap
- Each tier adds cache miss risk and branching overhead
- Expected speedup: 30-40% if collapsed to 2-tier
2. **Two-Tier Bitmap Traversal** (HIGH: ~4-6 ns on bitmap path)
- Summary bitmap scan + main bitmap scan + hint_word update
- Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates)
- Expected speedup: 20-30% if bypassed more often via better caching
3. **Registry Lookup on Free Path** (MEDIUM: ~2-4 ns per free)
- Hash computation + linear probe + validation on every cross-slab free
- Could be eliminated with mimalloc-style pointer arithmetic
- Expected speedup: 15-25% on free-heavy workloads
### Performance Gap Analysis
**Random Free Pattern** (Bitmap's best case):
- hakmem: 68 M ops/sec (14.7 ns/op)
- mimalloc: 176 M ops/sec (5.7 ns/op)
- **Gap**: 2.6× slower (9 ns difference)
**Sequential LIFO Pattern** (Free-list's best case):
- hakmem: 102 M ops/sec (9.8 ns/op)
- mimalloc: 942 M ops/sec (1.1 ns/op)
- **Gap**: 9.2× slower (8.7 ns difference)
**Key Insight**: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture.
### Expected Total Speedup
- Conservative: 2.0-2.5× (close the 2.6× gap partially)
- Optimistic: 3.0-4.0× (with aggressive optimizations)
- Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO)
---
## Critical Path Analysis
### Allocation Fast Path Walkthrough
Let me trace the exact execution path for `hak_tiny_alloc(16)` with step-by-step cycle estimates:
```c
// hakmem_tiny.c:557 - Entry point
void* hak_tiny_alloc(size_t size) {
// Line 558: Initialization check
if (!g_tiny_initialized) hak_tiny_init(); // BRANCH: ~1 cycle (predicted taken once)
// Line 561-562: Wrapper context check
extern int hak_in_wrapper(void);
if (!g_wrap_tiny_enabled && hak_in_wrapper()) // BRANCH: ~1 cycle
return NULL;
// Line 565: Size to class conversion
int class_idx = hak_tiny_size_to_class(size); // INLINE: ~2 cycles (branch chain)
if (class_idx < 0) return NULL; // BRANCH: ~1 cycle
// Line 569-576: SuperSlab path (disabled by default)
if (g_use_superslab) { /* ... */ } // BRANCH: ~1 cycle (not taken)
// Line 650-651: TLS Magazine initialization check
tiny_mag_init_if_needed(class_idx); // INLINE: ~3 cycles (conditional init)
TinyTLSMag* mag = &g_tls_mags[class_idx]; // TLS ACCESS: ~2 cycles
// Line 666-670: TLS Magazine fast path (BEST CASE)
if (mag->top > 0) { // LOAD + BRANCH: ~2 cycles
void* p = mag->items[--mag->top].ptr; // LOAD + DEC + STORE: ~3 cycles
stats_record_alloc(class_idx); // INLINE: ~1 cycle (TLS increment)
return p; // RETURN: ~1 cycle
}
// TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz)
// Line 673-674: TLS Active Slab lookup (MEDIUM PATH)
TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles
if (!(tls && tls->free_count > 0)) // LOAD + BRANCH: ~3 cycles
tls = g_tls_active_slab_b[class_idx]; // TLS ACCESS: ~2 cycles (if taken)
if (tls && tls->free_count > 0) { // BRANCH: ~1 cycle
// Line 677-679: Remote drain check
if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) {
tiny_remote_drain_owner(tls); // RARE: ~50-200 cycles (if taken)
}
// Line 682-688: Mini-magazine fast path
if (!mini_mag_is_empty(&tls->mini_mag)) { // LOAD + BRANCH: ~2 cycles
void* p = mini_mag_pop(&tls->mini_mag); // INLINE: ~4 cycles (LIFO pop)
if (p) {
stats_record_alloc(class_idx); // INLINE: ~1 cycle
return p; // RETURN: ~1 cycle
}
}
// MINI-MAG PATH: ~30 cycles (~10 ns)
// Line 691-700: Batch refill from bitmap
if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) {
int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
// REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized
if (refilled > 0) {
void* p = mini_mag_pop(&tls->mini_mag);
if (p) {
stats_record_alloc(class_idx);
return p;
}
}
}
// REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc
// Line 703-713: Bitmap scan fallback
if (tls->free_count > 0) {
int block_idx = hak_tiny_find_free_block(tls); // BITMAP SCAN: ~15-20 cycles
if (block_idx >= 0) {
hak_tiny_set_used(tls, block_idx); // BITMAP UPDATE: ~10 cycles
tls->free_count--; // STORE: ~1 cycle
void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles
stats_record_alloc(class_idx); // INLINE: ~1 cycle
return p; // RETURN: ~1 cycle
}
}
// BITMAP PATH: ~50 cycles (~17 ns)
}
// Line 717-718: Lock and refill from global pool (SLOW PATH)
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
// ... slow path: 200-1000 cycles (rare) ...
}
```
### Cycle Count Summary
| Path | Cycles | Latency (ns) | Frequency | Notes |
|---------------------|--------|--------------|-----------|-------|
| **TLS Magazine Hit** | ~18 | ~6 ns | 60-80% | Best case (cache hit) |
| **Mini-Mag Hit** | ~30 | ~10 ns | 10-20% | Good case (slab-local) |
| **Batch Refill** | ~50 | ~17 ns | 5-10% | Amortized 3 ns/item |
| **Bitmap Scan** | ~50 | ~17 ns | 5-10% | Worst case before lock |
| **Global Lock Path** | ~300 | ~100 ns | <5% | Very rare (refill) |
**Weighted Average**: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = **~11 ns/op** (theoretical)
**Measured Actual**: 9.8-14.7 ns/op (matches model!)
### Comparison with mimalloc's Approach
mimalloc achieves **1.1 ns/op** on LIFO pattern by:
1. **No TLS Magazine Layer**: Direct access to thread-local page free-list
2. **Intrusive Free-List**: 1 load + 1 store (2 cycles) vs our 18 cycles
3. **2MB Alignment**: O(1) pointerslab via bit-masking (no registry lookup)
4. **No Bitmap**: Free-list only (trades random-access resistance for speed)
**hakmem's Architecture**:
```
Allocation Request
TLS Magazine (2048 items) ← 1st tier: ~6 ns (cache hit)
↓ (miss)
TLS Active Slab (2 per class) ← 2nd tier: lookup cost
Mini-Magazine (16-32 items) ← 3rd tier: ~10 ns (LIFO pop)
↓ (miss)
Batch Refill (16 items) ← 4th tier: ~3 ns amortized
↓ (miss)
Bitmap Scan (two-tier) ← 5th tier: ~17 ns (expensive)
↓ (miss)
Global Lock + Slab Allocation ← 6th tier: ~100+ ns (rare)
```
**mimalloc's Architecture**:
```
Allocation Request
Thread-Local Page Free-List ← 1st tier: ~1 ns (1 load + 1 store)
↓ (miss)
Thread-Local Page Queue ← 2nd tier: ~5 ns (page switch)
↓ (miss)
Global Segment Allocation ← 3rd tier: ~50 ns (rare)
```
**Key Difference**: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead.
---
## Bottleneck #1: TLS Magazine Hierarchy Overhead
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 650-714 (allocation fast path)
- **Impact**: HIGH (affects 100% of allocations)
### Code Analysis
```c
// Line 650-651: 1st tier - TLS Magazine
tiny_mag_init_if_needed(class_idx); // ~3 cycles (conditional check)
TinyTLSMag* mag = &g_tls_mags[class_idx]; // ~2 cycles (TLS base + offset)
// Line 666-670: TLS Magazine lookup
if (mag->top > 0) { // ~2 cycles (load + branch)
void* p = mag->items[--mag->top].ptr; // ~3 cycles (array access + decrement)
stats_record_alloc(class_idx); // ~1 cycle (TLS increment)
return p; // ~1 cycle
}
// TOTAL: ~12 cycles for cache hit (BEST CASE)
// Line 673-674: 2nd tier - TLS Active Slab lookup
TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access)
if (!(tls && tls->free_count > 0)) // ~3 cycles (2 loads + branch)
tls = g_tls_active_slab_b[class_idx]; // ~2 cycles (if miss)
// Line 682-688: 3rd tier - Mini-Magazine
if (!mini_mag_is_empty(&tls->mini_mag)) { // ~2 cycles (load slab->mini_mag.count)
void* p = mini_mag_pop(&tls->mini_mag); // ~4 cycles (LIFO pop: 2 loads + 1 store)
if (p) { stats_record_alloc(class_idx); return p; }
}
// TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE)
```
### Why It's Slow
1. **Multiple TLS Accesses**: Each tier requires TLS base lookup + offset calculation
- `g_tls_mags[class_idx]` TLS read #1
- `g_tls_active_slab_a[class_idx]` TLS read #2
- `g_tls_active_slab_b[class_idx]` TLS read #3 (conditional)
- **Cost**: 2-3 cycles each × 3 = 6-9 cycles overhead
2. **Cache Line Fragmentation**: TLS variables are separate arrays
- `g_tls_mags[8]` = 16 KB (2048 items × 8 classes × 8 bytes)
- `g_tls_active_slab_a[8]` = 64 bytes
- `g_tls_active_slab_b[8]` = 64 bytes
- **Cost**: Likely span multiple cache lines potential cache misses
3. **Branch Misprediction**: Multi-tier fallback creates branch chain
- Magazine empty? Check active slab A
- Slab A empty? Check active slab B
- Mini-mag empty? Refill from bitmap
- **Cost**: Each mispredicted branch = 10-20 cycles penalty
4. **Redundant Metadata**: Magazine items store `{void* ptr}` separately from slab pointers
- Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class)
- Slab pointers: 8 bytes × 2 per class (16 bytes)
- **Cost**: Memory overhead reduces cache efficiency
### Optimization: Unified TLS Cache Structure
**Before** (current):
```c
// Separate TLS arrays (fragmented in memory)
static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP];
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];
```
**After** (proposed):
```c
// Unified per-class TLS structure (cache-line aligned)
typedef struct __attribute__((aligned(64))) {
// Hot fields (first 64 bytes for L1 cache line)
void* mag_items[32]; // Reduced from 2048 to 32 (still effective)
uint16_t mag_top; // Current magazine count
uint16_t mag_cap; // Magazine capacity
uint32_t _pad0;
// Warm fields (second cache line)
TinySlab* active_slab; // Primary active slab (no A/B split)
PageMiniMag* mini_mag; // Direct pointer to slab's mini-mag
uint64_t last_refill_tsc; // For adaptive refill timing
// Cold fields (third cache line)
uint64_t stats_alloc_batch; // Batched statistics
uint64_t stats_free_batch;
} __attribute__((packed)) TinyTLSCache;
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
```
**Benefits**:
1. Single TLS access: `g_tls_cache[class_idx]` (not 3 separate lookups)
2. Cache-line aligned: All hot fields in first 64 bytes
3. Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class
4. Direct mini-mag pointer: No slabmini_mag indirection
**Expected Speedup**: 30-40% (reduce fast path from ~12 cycles to ~7 cycles)
**Risk**: MEDIUM
- Requires refactoring TLS access patterns throughout codebase
- Magazine size reduction may increase refill frequency (trade-off)
- Need careful testing to ensure no regression on multi-threaded workloads
---
## Bottleneck #2: Two-Tier Bitmap Traversal
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
- **Lines**: 235-269 (`hak_tiny_find_free_block`)
- **Impact**: HIGH (affects 5-15% of allocations, but expensive when hit)
### Code Analysis
```c
// Line 235-269: Two-tier bitmap scan
static inline int hak_tiny_find_free_block(TinySlab* slab) {
const int bw = g_tiny_bitmap_words[slab->class_idx]; // Bitmap words
const int sw = slab->summary_words; // Summary words
if (bw <= 0 || sw <= 0) return -1;
int start_word = slab->hint_word % bw; // Hint optimization
int start_sw = start_word / 64; // Summary word index
int start_sb = start_word % 64; // Summary bit offset
// Line 244-267: Summary bitmap scan (outer loop)
for (int k = 0; k < sw; k++) { // ~sw iterations (1-128)
int idx = start_sw + k;
if (idx >= sw) idx -= sw; // Wrap-around
uint64_t bits = slab->summary[idx]; // LOAD: ~2 cycles
// Mask optimization (skip processed bits)
if (k == 0) {
bits &= (~0ULL) << start_sb; // BITWISE: ~1 cycle
}
if (idx == sw - 1 && (bw % 64) != 0) {
uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL);
bits &= mask; // BITWISE: ~1 cycle
}
if (bits == 0) continue; // BRANCH: ~1 cycle (often taken)
int woff = __builtin_ctzll(bits); // CTZ #1: ~3 cycles
int word_idx = idx * 64 + woff; // COMPUTE: ~2 cycles
if (word_idx >= bw) continue; // BRANCH: ~1 cycle
// Line 261-266: Main bitmap scan (inner)
uint64_t used = slab->bitmap[word_idx]; // LOAD: ~2 cycles (cache miss risk)
uint64_t free_bits = ~used; // BITWISE: ~1 cycle
if (free_bits == 0) continue; // BRANCH: ~1 cycle (rare)
int bit_idx = __builtin_ctzll(free_bits); // CTZ #2: ~3 cycles
slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles
return word_idx * 64 + bit_idx; // RETURN: ~1 cycle
}
return -1;
}
// TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access)
// WORST CASE: 50-100 cycles (many summary words scanned, cache misses)
```
### Why It's Slow
1. **Two-Level Indirection**: Summary Bitmap Block
- Summary scan: Find word with free bits (~5-10 cycles)
- Main bitmap scan: Find bit within word (~5 cycles)
- **Cost**: 2× CTZ operations, 2× memory loads
2. **Cache Miss Risk**: Bitmap can be up to 1 KB (128 words × 8 bytes)
- Class 0 (8B): 128 words = 1024 bytes
- Class 1 (16B): 64 words = 512 bytes
- Class 2 (32B): 32 words = 256 bytes
- **Cost**: Bitmap may not fit in L1 cache (32 KB) L2 access (~10-20 cycles)
3. **Hint Word State**: Requires update on every allocation
- Read hint_word (~1 cycle)
- Compute new hint (~2 cycles)
- Write hint_word (~1 cycle)
- **Cost**: 4 cycles per allocation (not amortized)
4. **Branch-Heavy Loop**: Multiple branches per iteration
- `if (bits == 0) continue;` (often taken when bitmap is sparse)
- `if (word_idx >= bw) continue;` (rare safety check)
- `if (free_bits == 0) continue;` (rare but costly)
- **Cost**: Branch misprediction = 10-20 cycles each
### Optimization #1: Increase Mini-Magazine Capacity
**Rationale**: Avoid bitmap scan by keeping more items in mini-magazine
**Current**:
```c
// Line 344: Mini-magazine capacity
uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16;
```
**Proposed**:
```c
// Increase capacity to reduce bitmap scan frequency
uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32;
```
**Benefits**:
- Fewer bitmap scans (amortized over 64 items instead of 32)
- Better temporal locality (more items cached)
**Costs**:
- +256 bytes memory per slab (64 × 8 bytes pointers)
- Slightly higher refill cost (64 items vs 32)
**Expected Speedup**: 10-15% (reduce bitmap scan frequency by 50%)
**Risk**: LOW (simple parameter change, no logic changes)
### Optimization #2: Cache-Aware Bitmap Layout
**Rationale**: Ensure bitmap fits in L1 cache for hot classes
**Current**:
```c
// Separate bitmap allocation (may be cache-cold)
slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t));
```
**Proposed**:
```c
// Embed small bitmaps directly in slab structure
typedef struct TinySlab {
// ... existing fields ...
// Embedded bitmap for small classes (≤256 bytes)
union {
uint64_t* bitmap_ptr; // Large classes: heap-allocated
uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes)
};
uint8_t bitmap_embedded; // Flag: 1=embedded, 0=heap
} TinySlab;
```
**Benefits**:
- Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded)
- Single cache line access for bitmap + slab metadata
- No heap allocation for small classes
**Expected Speedup**: 5-10% (reduce cache misses on bitmap access)
**Risk**: MEDIUM (requires refactoring bitmap access logic)
### Optimization #3: Lazy Summary Bitmap Update
**Rationale**: Summary bitmap update is expensive on free path
**Current**:
```c
// Line 199-213: Summary update on every set_used/set_free
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
// ... bitmap update ...
// Update summary (EXPENSIVE)
int sum_word = word_idx / 64;
int sum_bit = word_idx % 64;
uint64_t has_free = ~v;
if (has_free != 0) {
slab->summary[sum_word] |= (1ULL << sum_bit); // WRITE
} else {
slab->summary[sum_word] &= ~(1ULL << sum_bit); // WRITE
}
}
```
**Proposed**:
```c
// Lazy summary update (rebuild only when scanning)
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
// ... bitmap update ...
// NO SUMMARY UPDATE (deferred)
}
static inline int hak_tiny_find_free_block(TinySlab* slab) {
// Rebuild summary if stale (rare)
if (slab->summary_stale) {
rebuild_summary_bitmap(slab); // O(N) but rare
slab->summary_stale = 0;
}
// ... existing scan logic ...
}
```
**Benefits**:
- Eliminate summary update on 95% of operations (free path)
- Summary rebuild cost amortized over many allocations
**Expected Speedup**: 15-20% on free-heavy workloads
**Risk**: MEDIUM (requires careful stale bit management)
---
## Bottleneck #3: Registry Lookup on Free Path
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 1102-1118 (`hak_tiny_free`)
- **Impact**: MEDIUM (affects cross-slab frees, ~30-50% of frees)
### Code Analysis
```c
// Line 1102-1118: Free path with registry lookup
void hak_tiny_free(void* ptr) {
if (!ptr || !g_tiny_initialized) return;
// Line 1106-1111: SuperSlab fast path (disabled by default)
SuperSlab* ss = ptr_to_superslab(ptr); // BITWISE: ~2 cycles
if (ss && ss->magic == SUPERSLAB_MAGIC) { // LOAD + BRANCH: ~3 cycles
hak_tiny_free_superslab(ptr, ss); // FAST PATH: ~5 ns
return;
}
// Line 1114: Registry lookup (EXPENSIVE)
TinySlab* slab = hak_tiny_owner_slab(ptr); // LOOKUP: ~10-30 cycles
if (!slab) return;
hak_tiny_free_with_slab(ptr, slab); // FREE: ~50-200 cycles
}
// hakmem_tiny.c:395-440 - Registry lookup implementation
TinySlab* hak_tiny_owner_slab(void* ptr) {
if (!ptr || !g_tiny_initialized) return NULL;
if (g_use_registry) {
// O(1) hash table lookup
uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles
TinySlab* slab = registry_lookup(slab_base); // FUNCTION CALL: ~20-50 cycles
if (!slab) return NULL;
// Validation (bounds check)
uintptr_t start = (uintptr_t)slab->base;
uintptr_t end = start + TINY_SLAB_SIZE;
if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) {
return NULL; // False positive
}
return slab;
} else {
// O(N) linear search (fallback)
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles
// Search free slabs
for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) {
// ... bounds check ...
}
pthread_mutex_unlock(lock);
}
return NULL;
}
}
// Line 268-288: Registry lookup (hash table linear probe)
static TinySlab* registry_lookup(uintptr_t slab_base) {
int hash = registry_hash(slab_base); // HASH: ~5 cycles
for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes
int idx = (hash + i) & SLAB_REGISTRY_MASK; // BITWISE: ~2 cycles
SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles
if (entry->slab_base == slab_base) { // LOAD + BRANCH: ~3 cycles
TinySlab* owner = entry->owner; // LOAD: ~2 cycles
return owner;
}
if (entry->slab_base == 0) { // LOAD + BRANCH: ~2 cycles
return NULL; // Empty slot
}
}
return NULL;
}
// TYPICAL COST: 20-30 cycles (1-2 probes, cache hit)
// WORST CASE: 50-100 cycles (8 probes, cache miss on registry array)
```
### Why It's Slow
1. **Hash Computation**: Complex mix function
```c
static inline int registry_hash(uintptr_t slab_base) {
return (slab_base >> 16) & SLAB_REGISTRY_MASK; // Simple, but...
}
```
- Shift + mask = 2 cycles (acceptable)
- **BUT**: Linear probing on collision adds 10-30 cycles
2. **Linear Probing**: Up to 8 probes on collision
- Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case)
- Registry size: 1024 entries (8 KB array)
- **Cost**: May span multiple cache lines → cache miss (10-20 cycles penalty)
3. **Validation Overhead**: Bounds check after lookup
- Load slab->base (2 cycles)
- Compute end address (1 cycle)
- Compare twice (2 cycles)
- **Cost**: 5 cycles per free (not amortized)
4. **Global Shared State**: Registry is shared across all threads
- No cache-line alignment (false sharing risk)
- Lock-free reads → ABA problem potential
- **Cost**: Atomic load penalties (~5-10 cycles vs normal load)
### Optimization #1: Enable SuperSlab by Default
**Rationale**: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style)
**Current**:
```c
// Line 81: SuperSlab disabled by default
static int g_use_superslab = 0; // Runtime toggle
```
**Proposed**:
```c
// Enable SuperSlab by default
static int g_use_superslab = 1; // Always on
```
**Benefits**:
- Eliminate registry lookup entirely: `ptr & ~0x1FFFFF` (1 AND operation)
- SuperSlab free path: ~5 ns (vs ~10-30 ns registry path)
- Better cache locality (2MB aligned pages)
**Costs**:
- 2MB address space per SuperSlab (not physical memory due to lazy allocation)
- Slightly higher memory overhead (metadata at SuperSlab level)
**Expected Speedup**: 20-30% on free-heavy workloads
**Risk**: LOW (SuperSlab already implemented and tested in Phase 6.23)
### Optimization #2: Cache Last Freed Slab
**Rationale**: Temporal locality - next free likely from same slab
**Proposed**:
```c
// Per-thread cache of last freed slab
static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL};
void hak_tiny_free(void* ptr) {
if (!ptr) return;
// Try cached slab first (likely hit)
int class_idx = guess_class_from_size(ptr); // Heuristic
TinySlab* slab = t_last_freed_slab[class_idx];
// Validate pointer is in this slab
if (slab && ptr_in_slab_range(ptr, slab)) {
hak_tiny_free_with_slab(ptr, slab); // FAST PATH: ~5 ns
return;
}
// Fallback to registry lookup (rare)
slab = hak_tiny_owner_slab(ptr);
if (slab) {
t_last_freed_slab[slab->class_idx] = slab; // Update cache
hak_tiny_free_with_slab(ptr, slab);
}
}
```
**Benefits**:
- 80-90% cache hit rate (temporal locality)
- Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles)
**Expected Speedup**: 15-20% on free-heavy workloads
**Risk**: MEDIUM (requires heuristic for class_idx guessing, may mispredict)
---
## Bottleneck #4: Statistics Collection Overhead
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny_stats.h`
- **Lines**: 59-73 (`stats_record_alloc`, `stats_record_free`)
- **Impact**: LOW (already optimized to TLS batching, but still ~0.5 ns per op)
### Code Analysis
```c
// Line 59-62: Allocation statistics (inline)
static inline void stats_record_alloc(int class_idx) __attribute__((always_inline));
static inline void stats_record_alloc(int class_idx) {
t_alloc_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
}
// Line 70-73: Free statistics (inline)
static inline void stats_record_free(int class_idx) __attribute__((always_inline));
static inline void stats_record_free(int class_idx) {
t_free_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
}
```
### Why It's (Slightly) Slow
1. **TLS Access Overhead**: Even TLS has cost
- TLS base register: %fs on x86-64 (implicit)
- Offset calculation: `[%fs + class_idx*4]`
- **Cost**: ~0.5 cycles (not zero!)
2. **Cache Line Pollution**: TLS counters compete for L1 cache
- `t_alloc_batch[8]` = 32 bytes
- `t_free_batch[8]` = 32 bytes
- **Cost**: 64 bytes of L1 cache (1 cache line)
3. **Compiler Optimization Barriers**: `always_inline` prevents optimization
- Forces inline (good)
- But prevents compiler from hoisting out of loops (bad)
- **Cost**: Increment inside hot loop vs once outside
### Optimization: Compile-Time Statistics Toggle
**Rationale**: Production builds don't need exact counts
**Proposed**:
```c
#ifdef HAKMEM_ENABLE_STATS
#define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++
#define STATS_RECORD_FREE(cls) t_free_batch[cls]++
#else
#define STATS_RECORD_ALLOC(cls) ((void)0)
#define STATS_RECORD_FREE(cls) ((void)0)
#endif
```
**Benefits**:
- Zero overhead when stats disabled
- Compiler can optimize away dead code
**Expected Speedup**: 3-5% (small but measurable)
**Risk**: VERY LOW (compile-time flag, no runtime impact)
---
## Bottleneck #5: Magazine Spill/Refill Lock Contention
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 880-939 (magazine spill under class lock)
- **Impact**: MEDIUM (affects 5-10% of frees when magazine is full)
### Code Analysis
```c
// Line 880-939: Magazine spill (class lock held)
if (mag->top < cap) {
// Fast path: push to magazine (no lock)
mag->items[mag->top].ptr = ptr;
mag->top++;
return;
}
// Spill half under class lock
pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
int spill = cap / 2; // Spill 1024 items (for 2048 cap)
for (int i = 0; i < spill && mag->top > 0; i++) {
TinyMagItem it = mag->items[--mag->top];
TinySlab* owner = hak_tiny_owner_slab(it.ptr); // LOOKUP: ~20-30 cycles × 1024
if (!owner) continue;
// Phase 4.1: Try mini-magazine push (avoid bitmap)
if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) {
mini_mag_push(&owner->mini_mag, it.ptr); // FAST: ~4 cycles
continue;
}
// Slow path: bitmap update
size_t bs = g_tiny_class_sizes[owner->class_idx];
int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs; // DIV: ~10 cycles
if (hak_tiny_is_used(owner, idx)) {
hak_tiny_set_free(owner, idx); // BITMAP: ~10 cycles
owner->free_count++;
// ... list management ...
}
}
pthread_mutex_unlock(lock);
// TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item)
// Amortized: 50-100 ns per free (when spill happens every ~1000 frees)
```
### Why It's Slow
1. **Lock Hold Time**: Lock held for entire spill (1024 items)
- Blocks other threads from accessing class lock
- Spill takes ~50-100 µs → other threads stalled
- **Cost**: Contention penalty on multi-threaded workloads
2. **Registry Lookup in Loop**: 1024 lookups under lock
- `hak_tiny_owner_slab(it.ptr)` called 1024 times
- Each lookup: 20-30 cycles
- **Cost**: 20,000-30,000 cycles just for lookups
3. **Division in Hot Loop**: Block index calculation uses division
- `int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;`
- Division is ~10 cycles on modern CPUs (not fully pipelined)
- **Cost**: 10,000 cycles for 1024 divisions
4. **Large Spill Batch**: 1024 items is too large
- Amortizes lock cost well (good)
- But increases lock hold time (bad)
- Trade-off not optimized
### Optimization #1: Reduce Spill Batch Size
**Rationale**: Smaller batches = shorter lock hold time = less contention
**Current**:
```c
int spill = cap / 2; // 1024 items for 2048 cap
```
**Proposed**:
```c
int spill = 128; // Fixed batch size (not cap-dependent)
```
**Benefits**:
- Shorter lock hold time: ~6-12 µs (vs 50-100 µs)
- Better multi-thread responsiveness
**Costs**:
- More frequent spills (8× more frequent)
- Slightly higher total lock overhead
**Expected Speedup**: 10-15% on multi-threaded workloads
**Risk**: LOW (simple parameter change)
### Optimization #2: Lock-Free Spill Stack
**Rationale**: Avoid lock entirely for spill path
**Proposed**:
```c
// Per-class global spill stack (lock-free MPSC)
static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES];
void magazine_spill_lockfree(int class_idx, void* ptr) {
// Push to lock-free stack
uintptr_t old_head;
do {
old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire);
*((uintptr_t*)ptr) = old_head; // Intrusive next-pointer
} while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr,
memory_order_release, memory_order_relaxed));
}
// Background thread drains spill stack periodically
void background_drain_spill_stack(void) {
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel);
if (!head) continue;
pthread_mutex_lock(&g_tiny_class_locks[i].m);
// ... drain to bitmap ...
pthread_mutex_unlock(&g_tiny_class_locks[i].m);
}
}
```
**Benefits**:
- Zero lock contention on spill path
- Fast atomic CAS (~5-10 cycles)
**Costs**:
- Requires background thread or periodic drain
- Slightly more complex memory management
**Expected Speedup**: 20-30% on multi-threaded workloads
**Risk**: HIGH (requires careful design of background drain mechanism)
---
## Bottleneck #6: Branch Misprediction in Size Class Lookup
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
- **Lines**: 159-182 (`hak_tiny_size_to_class`)
- **Impact**: LOW (only 1-2 ns per allocation, but called on every allocation)
### Code Analysis
```c
// Line 159-182: Size to class lookup (branch chain)
static inline int hak_tiny_size_to_class(size_t size) {
if (size == 0 || size > TINY_MAX_SIZE) return -1; // BRANCH: ~1 cycle
// Branch chain (8 branches for 8 classes)
if (size <= 8) return 0; // BRANCH: ~1 cycle
if (size <= 16) return 1; // BRANCH: ~1 cycle
if (size <= 32) return 2; // BRANCH: ~1 cycle
if (size <= 64) return 3; // BRANCH: ~1 cycle
if (size <= 128) return 4; // BRANCH: ~1 cycle
if (size <= 256) return 5; // BRANCH: ~1 cycle
if (size <= 512) return 6; // BRANCH: ~1 cycle
return 7; // size <= 1024
}
// TYPICAL COST: 3-5 cycles (3-4 branches taken)
// WORST CASE: 8 cycles (all branches checked)
```
### Why It's (Slightly) Slow
1. **Unpredictable Size Distribution**: Branch predictor can't learn pattern
- Real-world allocation sizes are quasi-random
- Size 16 most common (33%), but others vary
- **Cost**: ~20-30% branch misprediction rate (~10 cycles penalty)
2. **Sequential Dependency**: Each branch depends on previous
- CPU can't parallelize branch evaluation
- Must evaluate branches in order
- **Cost**: No instruction-level parallelism (ILP)
### Optimization: Branchless Lookup Table
**Rationale**: Use CLZ (count leading zeros) for O(1) class lookup
**Proposed**:
```c
// Lookup table for size → class (branchless)
static const uint8_t g_size_to_class_table[128] = {
// size 0-7: class -1 (invalid)
-1, -1, -1, -1, -1, -1, -1, -1,
// size 8: class 0
0,
// size 9-16: class 1
1, 1, 1, 1, 1, 1, 1, 1,
// size 17-32: class 2
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
// size 33-64: class 3
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
// size 65-128: class 4
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
};
static inline int hak_tiny_size_to_class(size_t size) {
if (size == 0 || size > TINY_MAX_SIZE) return -1;
// Fast path: direct table lookup for small sizes
if (size <= 128) {
return g_size_to_class_table[size]; // LOAD: ~2 cycles (L1 cache)
}
// Slow path: CLZ-based for larger sizes
// size 129-256 → class 5
// size 257-512 → class 6
// size 513-1024 → class 7
int clz = __builtin_clzll(size - 1); // CLZ: ~3 cycles
return 12 - clz; // Magic constant for power-of-2 classes
}
// TYPICAL COST: 2-3 cycles (table lookup, no branches)
```
**Benefits**:
- Branchless for common sizes (8-128B covers 80%+ of allocations)
- Table fits in L1 cache (128 bytes = 2 cache lines)
- Predictable performance (no branch misprediction)
**Expected Speedup**: 2-3% (reduce 5 cycles to 2-3 cycles)
**Risk**: VERY LOW (table is static, no runtime overhead)
---
## Bottleneck #7: Remote Free Drain Overhead
### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 146-184 (`tiny_remote_drain_locked`)
- **Impact**: LOW (only affects cross-thread frees, ~10-20% of workloads)
### Code Analysis
```c
// Line 146-184: Remote free drain (under class lock)
static void tiny_remote_drain_locked(TinySlab* slab) {
uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles
unsigned drained = 0;
while (head) { // LOOP: variable iterations
void* p = (void*)head;
head = *((uintptr_t*)p); // LOAD NEXT: ~2 cycles
// Calculate block index
size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles
uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
// Skip if already free (idempotent)
if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles
hak_tiny_set_free(slab, block_idx); // BITMAP UPDATE: ~10 cycles
int was_full = (slab->free_count == 0); // LOAD: ~1 cycle
slab->free_count++; // INCREMENT: ~1 cycle
if (was_full) {
move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare)
}
if (slab->free_count == slab->total_count) {
// ... slab release logic ... (rare)
release_slab(slab); // EXPENSIVE: ~1000 cycles (very rare)
break;
}
g_tiny_pool.free_count[slab->class_idx]++; // GLOBAL INCREMENT: ~1 cycle
drained++;
}
if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles
}
// TYPICAL COST: 50-100 cycles per drained block (moderate)
// WORST CASE: 1000+ cycles (slab release)
```
### Why It's Slow
1. **Division in Loop**: Block index calculation uses division
- `int block_idx = offset / block_size;`
- Division is ~10 cycles (even on modern CPUs)
- **Cost**: 10 cycles × N remote frees
2. **Atomic Operations**: 2 atomic ops per drain (exchange + fetch_sub)
- `atomic_exchange` at start (~10 cycles)
- `atomic_fetch_sub` at end (~10 cycles)
- **Cost**: 20 cycles overhead (not per-block, but still expensive)
3. **Bitmap Update**: Same as allocation path
- `hak_tiny_set_free` updates both bitmap and summary
- **Cost**: 10 cycles per block
### Optimization: Multiplication-Based Division
**Rationale**: Replace division with multiplication by reciprocal
**Current**:
```c
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
```
**Proposed**:
```c
// Pre-computed reciprocals (magic constants)
static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = {
// Computed as: (1ULL << 48) / block_size
// Allows: block_idx = (offset * reciprocal) >> 48
0x200000000000ULL / 8, // Class 0: 8B
0x100000000000ULL / 16, // Class 1: 16B
0x80000000000ULL / 32, // Class 2: 32B
0x40000000000ULL / 64, // Class 3: 64B
0x20000000000ULL / 128, // Class 4: 128B
0x10000000000ULL / 256, // Class 5: 256B
0x8000000000ULL / 512, // Class 6: 512B
0x4000000000ULL / 1024, // Class 7: 1024B
};
// Fast division using multiplication
int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles
```
**Benefits**:
- Reduce 10 cycles to 3 cycles per division
- Saves 7 cycles per remote free
**Expected Speedup**: 5-10% on cross-thread workloads
**Risk**: VERY LOW (well-known compiler optimization, manually applied)
---
## Profiling Plan
### perf Commands to Run
```bash
# 1. CPU cycle breakdown (identify hotspots)
perf record -e cycles:u -g ./bench_comprehensive
perf report --stdio --no-children | head -100 > perf_cycles.txt
# 2. Cache miss analysis (L1d, L1i, LLC)
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\
L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \
./bench_comprehensive
# 3. Branch misprediction rate
perf stat -e cycles,instructions,branches,branch-misses \
./bench_comprehensive
# 4. TLB miss analysis (address translation overhead)
perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_comprehensive
# 5. Function-level profiling (annotated source)
perf record -e cycles:u --call-graph dwarf ./bench_comprehensive
perf report --stdio --sort symbol --percent-limit 1
# 6. Memory bandwidth utilization
perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\
mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \
./bench_comprehensive
# 7. Allocation-specific hotspots (focus on hak_tiny_alloc)
perf record -e cycles:u -g --call-graph dwarf -- \
./bench_comprehensive 2>&1 | grep "hak_tiny"
```
### Expected Hotspots to Validate
Based on code analysis, we expect to see:
1. **hak_tiny_find_free_block** (15-25% of cycles)
- Two-tier bitmap scan
- CTZ operations
- Cache misses on large bitmaps
2. **hak_tiny_set_used / hak_tiny_set_free** (10-15% of cycles)
- Bitmap updates
- Summary bitmap updates
- Write-heavy (cache line bouncing)
3. **hak_tiny_owner_slab** (10-20% of cycles on free path)
- Registry lookup
- Hash computation
- Linear probing
4. **tiny_mag_init_if_needed** (5-10% of cycles)
- TLS access
- Conditional initialization
5. **stats_record_alloc / stats_record_free** (3-5% of cycles)
- TLS counter increments
- Cache line pollution
### Validation Criteria
**Cache Miss Rates**:
- L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor)
- LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor)
**Branch Misprediction**:
- Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor)
- Expected: 3-4% (due to unpredictable size classes)
**IPC (Instructions Per Cycle)**:
- IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor)
- Expected: 1.5-1.8 (memory-bound, not compute-bound)
**Function Time Distribution**:
- hak_tiny_alloc: 40-60% (hot path)
- hak_tiny_free: 20-30% (warm path)
- hak_tiny_find_free_block: 10-20% (expensive when hit)
- Other: < 10%
---
## Optimization Roadmap
### Quick Wins (< 1 hour, Low Risk)
1. **Enable SuperSlab by Default** (Bottleneck #3)
- Change: `g_use_superslab = 1;`
- Impact: 20-30% speedup on free path
- Risk: VERY LOW (already implemented)
- Effort: 5 minutes
2. **Disable Statistics in Production** (Bottleneck #4)
- Change: Add `#ifndef HAKMEM_ENABLE_STATS` guards
- Impact: 3-5% speedup
- Risk: VERY LOW (compile-time flag)
- Effort: 15 minutes
3. **Increase Mini-Magazine Capacity** (Bottleneck #2)
- Change: `mag_capacity = 64` (was 32)
- Impact: 10-15% speedup (reduce bitmap scans)
- Risk: LOW (slight memory increase)
- Effort: 5 minutes
4. **Branchless Size Class Lookup** (Bottleneck #6)
- Change: Use lookup table for common sizes
- Impact: 2-3% speedup
- Risk: VERY LOW (static table)
- Effort: 30 minutes
**Total Expected Speedup: 35-53%** (conservative: 1.4-1.5×)
### Medium Effort (1-4 hours, Medium Risk)
5. **Unified TLS Cache Structure** (Bottleneck #1)
- Change: Merge TLS arrays into single cache-aligned struct
- Impact: 30-40% speedup on fast path
- Risk: MEDIUM (requires refactoring)
- Effort: 3-4 hours
6. **Reduce Magazine Spill Batch** (Bottleneck #5)
- Change: `spill = 128` (was 1024)
- Impact: 10-15% speedup on multi-threaded
- Risk: LOW (parameter tuning)
- Effort: 30 minutes
7. **Cache-Aware Bitmap Layout** (Bottleneck #2)
- Change: Embed small bitmaps in slab structure
- Impact: 5-10% speedup
- Risk: MEDIUM (requires struct changes)
- Effort: 2-3 hours
8. **Multiplication-Based Division** (Bottleneck #7)
- Change: Replace division with mul+shift
- Impact: 5-10% speedup on remote frees
- Risk: VERY LOW (well-known optimization)
- Effort: 1 hour
**Total Expected Speedup: 50-85%** (conservative: 1.5-1.8×)
### Major Refactors (> 4 hours, High Risk)
9. **Lock-Free Spill Stack** (Bottleneck #5)
- Change: Use atomic MPSC queue for magazine spill
- Impact: 20-30% speedup on multi-threaded
- Risk: HIGH (complex concurrency)
- Effort: 8-12 hours
10. **Lazy Summary Bitmap Update** (Bottleneck #2)
- Change: Rebuild summary only when scanning
- Impact: 15-20% speedup on free-heavy workloads
- Risk: MEDIUM (requires careful staleness tracking)
- Effort: 4-6 hours
11. **Collapse TLS Magazine Tiers** (Bottleneck #1)
- Change: Merge magazine + mini-mag into single LIFO
- Impact: 40-50% speedup (eliminate tier overhead)
- Risk: HIGH (major architectural change)
- Effort: 12-16 hours
12. **Full mimalloc-Style Rewrite** (All Bottlenecks)
- Change: Replace bitmap with intrusive free-list
- Impact: 5-9× speedup (match mimalloc)
- Risk: VERY HIGH (complete redesign)
- Effort: 40+ hours
**Total Expected Speedup: 75-150%** (optimistic: 1.8-2.5×)
---
## Risk Assessment Summary
### Low Risk Optimizations (Safe to implement immediately)
- SuperSlab enable
- Statistics compile-time toggle
- Mini-mag capacity increase
- Branchless size lookup
- Multiplication division
- Magazine spill batch reduction
**Expected: 1.4-1.6× speedup, 2-3 hours effort**
### Medium Risk Optimizations (Test thoroughly)
- Unified TLS cache structure
- Cache-aware bitmap layout
- Lazy summary update
**Expected: 1.6-2.0× speedup, 6-10 hours effort**
### High Risk Optimizations (Prototype first)
- Lock-free spill stack
- Magazine tier collapse
- Full mimalloc rewrite
**Expected: 2.0-9.0× speedup, 20-60 hours effort**
---
## Estimated Speedup Summary
### Conservative Target (Low + Medium optimizations)
- **Random pattern**: 68 M ops/sec → **140 M ops/sec** (2.0× speedup)
- **LIFO pattern**: 102 M ops/sec → **200 M ops/sec** (2.0× speedup)
- **Gap to mimalloc**: 2.6× → **1.3×** (close 50% of gap)
### Optimistic Target (All optimizations)
- **Random pattern**: 68 M ops/sec → **170 M ops/sec** (2.5× speedup)
- **LIFO pattern**: 102 M ops/sec → **450 M ops/sec** (4.4× speedup)
- **Gap to mimalloc**: 2.6× → **1.0×** (match on random, 2× on LIFO)
---
## Conclusion
The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to:
1. **Architectural overhead**: 6-tier allocation hierarchy vs mimalloc's 3-tier
2. **Bitmap traversal cost**: Two-tier scan adds 15-20 cycles even when optimized
3. **Registry lookup overhead**: Hash table lookup adds 20-30 cycles on free path
**Quick wins** (1-3 hours effort) can achieve **1.4-1.6× speedup**.
**Medium effort** (10 hours) can achieve **1.8-2.0× speedup**.
**Full mimalloc-style rewrite** (40+ hours) needed to match mimalloc's 1.1 ns/op.
**Recommendation**: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with `perf`, then decide if medium-effort optimizations are worth the complexity increase.