Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1319 lines
45 KiB
Markdown
1319 lines
45 KiB
Markdown
# Bottleneck Analysis Report: hakmem Tiny Pool Allocator
|
||
|
||
**Date**: 2025-10-26
|
||
**Target**: hakmem bitmap-based allocator
|
||
**Baseline**: mimalloc (industry standard)
|
||
**Analyzed by**: Deep code analysis + performance modeling
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
### Top 3 Bottlenecks with Estimated Impact
|
||
|
||
1. **TLS Magazine Hierarchy Overhead** (HIGH: ~3-5 ns per allocation)
|
||
- 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap
|
||
- Each tier adds cache miss risk and branching overhead
|
||
- Expected speedup: 30-40% if collapsed to 2-tier
|
||
|
||
2. **Two-Tier Bitmap Traversal** (HIGH: ~4-6 ns on bitmap path)
|
||
- Summary bitmap scan + main bitmap scan + hint_word update
|
||
- Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates)
|
||
- Expected speedup: 20-30% if bypassed more often via better caching
|
||
|
||
3. **Registry Lookup on Free Path** (MEDIUM: ~2-4 ns per free)
|
||
- Hash computation + linear probe + validation on every cross-slab free
|
||
- Could be eliminated with mimalloc-style pointer arithmetic
|
||
- Expected speedup: 15-25% on free-heavy workloads
|
||
|
||
### Performance Gap Analysis
|
||
|
||
**Random Free Pattern** (Bitmap's best case):
|
||
- hakmem: 68 M ops/sec (14.7 ns/op)
|
||
- mimalloc: 176 M ops/sec (5.7 ns/op)
|
||
- **Gap**: 2.6× slower (9 ns difference)
|
||
|
||
**Sequential LIFO Pattern** (Free-list's best case):
|
||
- hakmem: 102 M ops/sec (9.8 ns/op)
|
||
- mimalloc: 942 M ops/sec (1.1 ns/op)
|
||
- **Gap**: 9.2× slower (8.7 ns difference)
|
||
|
||
**Key Insight**: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture.
|
||
|
||
### Expected Total Speedup
|
||
|
||
- Conservative: 2.0-2.5× (close the 2.6× gap partially)
|
||
- Optimistic: 3.0-4.0× (with aggressive optimizations)
|
||
- Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO)
|
||
|
||
---
|
||
|
||
## Critical Path Analysis
|
||
|
||
### Allocation Fast Path Walkthrough
|
||
|
||
Let me trace the exact execution path for `hak_tiny_alloc(16)` with step-by-step cycle estimates:
|
||
|
||
```c
|
||
// hakmem_tiny.c:557 - Entry point
|
||
void* hak_tiny_alloc(size_t size) {
|
||
// Line 558: Initialization check
|
||
if (!g_tiny_initialized) hak_tiny_init(); // BRANCH: ~1 cycle (predicted taken once)
|
||
|
||
// Line 561-562: Wrapper context check
|
||
extern int hak_in_wrapper(void);
|
||
if (!g_wrap_tiny_enabled && hak_in_wrapper()) // BRANCH: ~1 cycle
|
||
return NULL;
|
||
|
||
// Line 565: Size to class conversion
|
||
int class_idx = hak_tiny_size_to_class(size); // INLINE: ~2 cycles (branch chain)
|
||
if (class_idx < 0) return NULL; // BRANCH: ~1 cycle
|
||
|
||
// Line 569-576: SuperSlab path (disabled by default)
|
||
if (g_use_superslab) { /* ... */ } // BRANCH: ~1 cycle (not taken)
|
||
|
||
// Line 650-651: TLS Magazine initialization check
|
||
tiny_mag_init_if_needed(class_idx); // INLINE: ~3 cycles (conditional init)
|
||
TinyTLSMag* mag = &g_tls_mags[class_idx]; // TLS ACCESS: ~2 cycles
|
||
|
||
// Line 666-670: TLS Magazine fast path (BEST CASE)
|
||
if (mag->top > 0) { // LOAD + BRANCH: ~2 cycles
|
||
void* p = mag->items[--mag->top].ptr; // LOAD + DEC + STORE: ~3 cycles
|
||
stats_record_alloc(class_idx); // INLINE: ~1 cycle (TLS increment)
|
||
return p; // RETURN: ~1 cycle
|
||
}
|
||
// TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz)
|
||
|
||
// Line 673-674: TLS Active Slab lookup (MEDIUM PATH)
|
||
TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles
|
||
if (!(tls && tls->free_count > 0)) // LOAD + BRANCH: ~3 cycles
|
||
tls = g_tls_active_slab_b[class_idx]; // TLS ACCESS: ~2 cycles (if taken)
|
||
|
||
if (tls && tls->free_count > 0) { // BRANCH: ~1 cycle
|
||
// Line 677-679: Remote drain check
|
||
if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) {
|
||
tiny_remote_drain_owner(tls); // RARE: ~50-200 cycles (if taken)
|
||
}
|
||
|
||
// Line 682-688: Mini-magazine fast path
|
||
if (!mini_mag_is_empty(&tls->mini_mag)) { // LOAD + BRANCH: ~2 cycles
|
||
void* p = mini_mag_pop(&tls->mini_mag); // INLINE: ~4 cycles (LIFO pop)
|
||
if (p) {
|
||
stats_record_alloc(class_idx); // INLINE: ~1 cycle
|
||
return p; // RETURN: ~1 cycle
|
||
}
|
||
}
|
||
// MINI-MAG PATH: ~30 cycles (~10 ns)
|
||
|
||
// Line 691-700: Batch refill from bitmap
|
||
if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) {
|
||
int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
|
||
// REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized
|
||
if (refilled > 0) {
|
||
void* p = mini_mag_pop(&tls->mini_mag);
|
||
if (p) {
|
||
stats_record_alloc(class_idx);
|
||
return p;
|
||
}
|
||
}
|
||
}
|
||
// REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc
|
||
|
||
// Line 703-713: Bitmap scan fallback
|
||
if (tls->free_count > 0) {
|
||
int block_idx = hak_tiny_find_free_block(tls); // BITMAP SCAN: ~15-20 cycles
|
||
if (block_idx >= 0) {
|
||
hak_tiny_set_used(tls, block_idx); // BITMAP UPDATE: ~10 cycles
|
||
tls->free_count--; // STORE: ~1 cycle
|
||
void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles
|
||
stats_record_alloc(class_idx); // INLINE: ~1 cycle
|
||
return p; // RETURN: ~1 cycle
|
||
}
|
||
}
|
||
// BITMAP PATH: ~50 cycles (~17 ns)
|
||
}
|
||
|
||
// Line 717-718: Lock and refill from global pool (SLOW PATH)
|
||
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
|
||
// ... slow path: 200-1000 cycles (rare) ...
|
||
}
|
||
```
|
||
|
||
### Cycle Count Summary
|
||
|
||
| Path | Cycles | Latency (ns) | Frequency | Notes |
|
||
|---------------------|--------|--------------|-----------|-------|
|
||
| **TLS Magazine Hit** | ~18 | ~6 ns | 60-80% | Best case (cache hit) |
|
||
| **Mini-Mag Hit** | ~30 | ~10 ns | 10-20% | Good case (slab-local) |
|
||
| **Batch Refill** | ~50 | ~17 ns | 5-10% | Amortized 3 ns/item |
|
||
| **Bitmap Scan** | ~50 | ~17 ns | 5-10% | Worst case before lock |
|
||
| **Global Lock Path** | ~300 | ~100 ns | <5% | Very rare (refill) |
|
||
|
||
**Weighted Average**: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = **~11 ns/op** (theoretical)
|
||
**Measured Actual**: 9.8-14.7 ns/op (matches model!)
|
||
|
||
### Comparison with mimalloc's Approach
|
||
|
||
mimalloc achieves **1.1 ns/op** on LIFO pattern by:
|
||
|
||
1. **No TLS Magazine Layer**: Direct access to thread-local page free-list
|
||
2. **Intrusive Free-List**: 1 load + 1 store (2 cycles) vs our 18 cycles
|
||
3. **2MB Alignment**: O(1) pointer→slab via bit-masking (no registry lookup)
|
||
4. **No Bitmap**: Free-list only (trades random-access resistance for speed)
|
||
|
||
**hakmem's Architecture**:
|
||
```
|
||
Allocation Request
|
||
↓
|
||
TLS Magazine (2048 items) ← 1st tier: ~6 ns (cache hit)
|
||
↓ (miss)
|
||
TLS Active Slab (2 per class) ← 2nd tier: lookup cost
|
||
↓
|
||
Mini-Magazine (16-32 items) ← 3rd tier: ~10 ns (LIFO pop)
|
||
↓ (miss)
|
||
Batch Refill (16 items) ← 4th tier: ~3 ns amortized
|
||
↓ (miss)
|
||
Bitmap Scan (two-tier) ← 5th tier: ~17 ns (expensive)
|
||
↓ (miss)
|
||
Global Lock + Slab Allocation ← 6th tier: ~100+ ns (rare)
|
||
```
|
||
|
||
**mimalloc's Architecture**:
|
||
```
|
||
Allocation Request
|
||
↓
|
||
Thread-Local Page Free-List ← 1st tier: ~1 ns (1 load + 1 store)
|
||
↓ (miss)
|
||
Thread-Local Page Queue ← 2nd tier: ~5 ns (page switch)
|
||
↓ (miss)
|
||
Global Segment Allocation ← 3rd tier: ~50 ns (rare)
|
||
```
|
||
|
||
**Key Difference**: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead.
|
||
|
||
---
|
||
|
||
## Bottleneck #1: TLS Magazine Hierarchy Overhead
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
|
||
- **Lines**: 650-714 (allocation fast path)
|
||
- **Impact**: HIGH (affects 100% of allocations)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 650-651: 1st tier - TLS Magazine
|
||
tiny_mag_init_if_needed(class_idx); // ~3 cycles (conditional check)
|
||
TinyTLSMag* mag = &g_tls_mags[class_idx]; // ~2 cycles (TLS base + offset)
|
||
|
||
// Line 666-670: TLS Magazine lookup
|
||
if (mag->top > 0) { // ~2 cycles (load + branch)
|
||
void* p = mag->items[--mag->top].ptr; // ~3 cycles (array access + decrement)
|
||
stats_record_alloc(class_idx); // ~1 cycle (TLS increment)
|
||
return p; // ~1 cycle
|
||
}
|
||
// TOTAL: ~12 cycles for cache hit (BEST CASE)
|
||
|
||
// Line 673-674: 2nd tier - TLS Active Slab lookup
|
||
TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access)
|
||
if (!(tls && tls->free_count > 0)) // ~3 cycles (2 loads + branch)
|
||
tls = g_tls_active_slab_b[class_idx]; // ~2 cycles (if miss)
|
||
|
||
// Line 682-688: 3rd tier - Mini-Magazine
|
||
if (!mini_mag_is_empty(&tls->mini_mag)) { // ~2 cycles (load slab->mini_mag.count)
|
||
void* p = mini_mag_pop(&tls->mini_mag); // ~4 cycles (LIFO pop: 2 loads + 1 store)
|
||
if (p) { stats_record_alloc(class_idx); return p; }
|
||
}
|
||
// TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE)
|
||
```
|
||
|
||
### Why It's Slow
|
||
|
||
1. **Multiple TLS Accesses**: Each tier requires TLS base lookup + offset calculation
|
||
- `g_tls_mags[class_idx]` → TLS read #1
|
||
- `g_tls_active_slab_a[class_idx]` → TLS read #2
|
||
- `g_tls_active_slab_b[class_idx]` → TLS read #3 (conditional)
|
||
- **Cost**: 2-3 cycles each × 3 = 6-9 cycles overhead
|
||
|
||
2. **Cache Line Fragmentation**: TLS variables are separate arrays
|
||
- `g_tls_mags[8]` = 16 KB (2048 items × 8 classes × 8 bytes)
|
||
- `g_tls_active_slab_a[8]` = 64 bytes
|
||
- `g_tls_active_slab_b[8]` = 64 bytes
|
||
- **Cost**: Likely span multiple cache lines → potential cache misses
|
||
|
||
3. **Branch Misprediction**: Multi-tier fallback creates branch chain
|
||
- Magazine empty? → Check active slab A
|
||
- Slab A empty? → Check active slab B
|
||
- Mini-mag empty? → Refill from bitmap
|
||
- **Cost**: Each mispredicted branch = 10-20 cycles penalty
|
||
|
||
4. **Redundant Metadata**: Magazine items store `{void* ptr}` separately from slab pointers
|
||
- Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class)
|
||
- Slab pointers: 8 bytes × 2 per class (16 bytes)
|
||
- **Cost**: Memory overhead reduces cache efficiency
|
||
|
||
### Optimization: Unified TLS Cache Structure
|
||
|
||
**Before** (current):
|
||
```c
|
||
// Separate TLS arrays (fragmented in memory)
|
||
static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP];
|
||
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
|
||
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];
|
||
```
|
||
|
||
**After** (proposed):
|
||
```c
|
||
// Unified per-class TLS structure (cache-line aligned)
|
||
typedef struct __attribute__((aligned(64))) {
|
||
// Hot fields (first 64 bytes for L1 cache line)
|
||
void* mag_items[32]; // Reduced from 2048 to 32 (still effective)
|
||
uint16_t mag_top; // Current magazine count
|
||
uint16_t mag_cap; // Magazine capacity
|
||
uint32_t _pad0;
|
||
|
||
// Warm fields (second cache line)
|
||
TinySlab* active_slab; // Primary active slab (no A/B split)
|
||
PageMiniMag* mini_mag; // Direct pointer to slab's mini-mag
|
||
uint64_t last_refill_tsc; // For adaptive refill timing
|
||
|
||
// Cold fields (third cache line)
|
||
uint64_t stats_alloc_batch; // Batched statistics
|
||
uint64_t stats_free_batch;
|
||
} __attribute__((packed)) TinyTLSCache;
|
||
|
||
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
|
||
```
|
||
|
||
**Benefits**:
|
||
1. Single TLS access: `g_tls_cache[class_idx]` (not 3 separate lookups)
|
||
2. Cache-line aligned: All hot fields in first 64 bytes
|
||
3. Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class
|
||
4. Direct mini-mag pointer: No slab→mini_mag indirection
|
||
|
||
**Expected Speedup**: 30-40% (reduce fast path from ~12 cycles to ~7 cycles)
|
||
|
||
**Risk**: MEDIUM
|
||
- Requires refactoring TLS access patterns throughout codebase
|
||
- Magazine size reduction may increase refill frequency (trade-off)
|
||
- Need careful testing to ensure no regression on multi-threaded workloads
|
||
|
||
---
|
||
|
||
## Bottleneck #2: Two-Tier Bitmap Traversal
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
|
||
- **Lines**: 235-269 (`hak_tiny_find_free_block`)
|
||
- **Impact**: HIGH (affects 5-15% of allocations, but expensive when hit)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 235-269: Two-tier bitmap scan
|
||
static inline int hak_tiny_find_free_block(TinySlab* slab) {
|
||
const int bw = g_tiny_bitmap_words[slab->class_idx]; // Bitmap words
|
||
const int sw = slab->summary_words; // Summary words
|
||
if (bw <= 0 || sw <= 0) return -1;
|
||
|
||
int start_word = slab->hint_word % bw; // Hint optimization
|
||
int start_sw = start_word / 64; // Summary word index
|
||
int start_sb = start_word % 64; // Summary bit offset
|
||
|
||
// Line 244-267: Summary bitmap scan (outer loop)
|
||
for (int k = 0; k < sw; k++) { // ~sw iterations (1-128)
|
||
int idx = start_sw + k;
|
||
if (idx >= sw) idx -= sw; // Wrap-around
|
||
uint64_t bits = slab->summary[idx]; // LOAD: ~2 cycles
|
||
|
||
// Mask optimization (skip processed bits)
|
||
if (k == 0) {
|
||
bits &= (~0ULL) << start_sb; // BITWISE: ~1 cycle
|
||
}
|
||
if (idx == sw - 1 && (bw % 64) != 0) {
|
||
uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL);
|
||
bits &= mask; // BITWISE: ~1 cycle
|
||
}
|
||
if (bits == 0) continue; // BRANCH: ~1 cycle (often taken)
|
||
|
||
int woff = __builtin_ctzll(bits); // CTZ #1: ~3 cycles
|
||
int word_idx = idx * 64 + woff; // COMPUTE: ~2 cycles
|
||
if (word_idx >= bw) continue; // BRANCH: ~1 cycle
|
||
|
||
// Line 261-266: Main bitmap scan (inner)
|
||
uint64_t used = slab->bitmap[word_idx]; // LOAD: ~2 cycles (cache miss risk)
|
||
uint64_t free_bits = ~used; // BITWISE: ~1 cycle
|
||
if (free_bits == 0) continue; // BRANCH: ~1 cycle (rare)
|
||
|
||
int bit_idx = __builtin_ctzll(free_bits); // CTZ #2: ~3 cycles
|
||
slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles
|
||
return word_idx * 64 + bit_idx; // RETURN: ~1 cycle
|
||
}
|
||
return -1;
|
||
}
|
||
// TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access)
|
||
// WORST CASE: 50-100 cycles (many summary words scanned, cache misses)
|
||
```
|
||
|
||
### Why It's Slow
|
||
|
||
1. **Two-Level Indirection**: Summary → Bitmap → Block
|
||
- Summary scan: Find word with free bits (~5-10 cycles)
|
||
- Main bitmap scan: Find bit within word (~5 cycles)
|
||
- **Cost**: 2× CTZ operations, 2× memory loads
|
||
|
||
2. **Cache Miss Risk**: Bitmap can be up to 1 KB (128 words × 8 bytes)
|
||
- Class 0 (8B): 128 words = 1024 bytes
|
||
- Class 1 (16B): 64 words = 512 bytes
|
||
- Class 2 (32B): 32 words = 256 bytes
|
||
- **Cost**: Bitmap may not fit in L1 cache (32 KB) → L2 access (~10-20 cycles)
|
||
|
||
3. **Hint Word State**: Requires update on every allocation
|
||
- Read hint_word (~1 cycle)
|
||
- Compute new hint (~2 cycles)
|
||
- Write hint_word (~1 cycle)
|
||
- **Cost**: 4 cycles per allocation (not amortized)
|
||
|
||
4. **Branch-Heavy Loop**: Multiple branches per iteration
|
||
- `if (bits == 0) continue;` (often taken when bitmap is sparse)
|
||
- `if (word_idx >= bw) continue;` (rare safety check)
|
||
- `if (free_bits == 0) continue;` (rare but costly)
|
||
- **Cost**: Branch misprediction = 10-20 cycles each
|
||
|
||
### Optimization #1: Increase Mini-Magazine Capacity
|
||
|
||
**Rationale**: Avoid bitmap scan by keeping more items in mini-magazine
|
||
|
||
**Current**:
|
||
```c
|
||
// Line 344: Mini-magazine capacity
|
||
uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16;
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Increase capacity to reduce bitmap scan frequency
|
||
uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32;
|
||
```
|
||
|
||
**Benefits**:
|
||
- Fewer bitmap scans (amortized over 64 items instead of 32)
|
||
- Better temporal locality (more items cached)
|
||
|
||
**Costs**:
|
||
- +256 bytes memory per slab (64 × 8 bytes pointers)
|
||
- Slightly higher refill cost (64 items vs 32)
|
||
|
||
**Expected Speedup**: 10-15% (reduce bitmap scan frequency by 50%)
|
||
|
||
**Risk**: LOW (simple parameter change, no logic changes)
|
||
|
||
### Optimization #2: Cache-Aware Bitmap Layout
|
||
|
||
**Rationale**: Ensure bitmap fits in L1 cache for hot classes
|
||
|
||
**Current**:
|
||
```c
|
||
// Separate bitmap allocation (may be cache-cold)
|
||
slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t));
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Embed small bitmaps directly in slab structure
|
||
typedef struct TinySlab {
|
||
// ... existing fields ...
|
||
|
||
// Embedded bitmap for small classes (≤256 bytes)
|
||
union {
|
||
uint64_t* bitmap_ptr; // Large classes: heap-allocated
|
||
uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes)
|
||
};
|
||
uint8_t bitmap_embedded; // Flag: 1=embedded, 0=heap
|
||
} TinySlab;
|
||
```
|
||
|
||
**Benefits**:
|
||
- Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded)
|
||
- Single cache line access for bitmap + slab metadata
|
||
- No heap allocation for small classes
|
||
|
||
**Expected Speedup**: 5-10% (reduce cache misses on bitmap access)
|
||
|
||
**Risk**: MEDIUM (requires refactoring bitmap access logic)
|
||
|
||
### Optimization #3: Lazy Summary Bitmap Update
|
||
|
||
**Rationale**: Summary bitmap update is expensive on free path
|
||
|
||
**Current**:
|
||
```c
|
||
// Line 199-213: Summary update on every set_used/set_free
|
||
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
|
||
// ... bitmap update ...
|
||
|
||
// Update summary (EXPENSIVE)
|
||
int sum_word = word_idx / 64;
|
||
int sum_bit = word_idx % 64;
|
||
uint64_t has_free = ~v;
|
||
if (has_free != 0) {
|
||
slab->summary[sum_word] |= (1ULL << sum_bit); // WRITE
|
||
} else {
|
||
slab->summary[sum_word] &= ~(1ULL << sum_bit); // WRITE
|
||
}
|
||
}
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Lazy summary update (rebuild only when scanning)
|
||
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
|
||
// ... bitmap update ...
|
||
// NO SUMMARY UPDATE (deferred)
|
||
}
|
||
|
||
static inline int hak_tiny_find_free_block(TinySlab* slab) {
|
||
// Rebuild summary if stale (rare)
|
||
if (slab->summary_stale) {
|
||
rebuild_summary_bitmap(slab); // O(N) but rare
|
||
slab->summary_stale = 0;
|
||
}
|
||
// ... existing scan logic ...
|
||
}
|
||
```
|
||
|
||
**Benefits**:
|
||
- Eliminate summary update on 95% of operations (free path)
|
||
- Summary rebuild cost amortized over many allocations
|
||
|
||
**Expected Speedup**: 15-20% on free-heavy workloads
|
||
|
||
**Risk**: MEDIUM (requires careful stale bit management)
|
||
|
||
---
|
||
|
||
## Bottleneck #3: Registry Lookup on Free Path
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
|
||
- **Lines**: 1102-1118 (`hak_tiny_free`)
|
||
- **Impact**: MEDIUM (affects cross-slab frees, ~30-50% of frees)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 1102-1118: Free path with registry lookup
|
||
void hak_tiny_free(void* ptr) {
|
||
if (!ptr || !g_tiny_initialized) return;
|
||
|
||
// Line 1106-1111: SuperSlab fast path (disabled by default)
|
||
SuperSlab* ss = ptr_to_superslab(ptr); // BITWISE: ~2 cycles
|
||
if (ss && ss->magic == SUPERSLAB_MAGIC) { // LOAD + BRANCH: ~3 cycles
|
||
hak_tiny_free_superslab(ptr, ss); // FAST PATH: ~5 ns
|
||
return;
|
||
}
|
||
|
||
// Line 1114: Registry lookup (EXPENSIVE)
|
||
TinySlab* slab = hak_tiny_owner_slab(ptr); // LOOKUP: ~10-30 cycles
|
||
if (!slab) return;
|
||
|
||
hak_tiny_free_with_slab(ptr, slab); // FREE: ~50-200 cycles
|
||
}
|
||
|
||
// hakmem_tiny.c:395-440 - Registry lookup implementation
|
||
TinySlab* hak_tiny_owner_slab(void* ptr) {
|
||
if (!ptr || !g_tiny_initialized) return NULL;
|
||
|
||
if (g_use_registry) {
|
||
// O(1) hash table lookup
|
||
uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles
|
||
TinySlab* slab = registry_lookup(slab_base); // FUNCTION CALL: ~20-50 cycles
|
||
if (!slab) return NULL;
|
||
|
||
// Validation (bounds check)
|
||
uintptr_t start = (uintptr_t)slab->base;
|
||
uintptr_t end = start + TINY_SLAB_SIZE;
|
||
if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) {
|
||
return NULL; // False positive
|
||
}
|
||
return slab;
|
||
} else {
|
||
// O(N) linear search (fallback)
|
||
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
|
||
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles
|
||
// Search free slabs
|
||
for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) {
|
||
// ... bounds check ...
|
||
}
|
||
pthread_mutex_unlock(lock);
|
||
}
|
||
return NULL;
|
||
}
|
||
}
|
||
|
||
// Line 268-288: Registry lookup (hash table linear probe)
|
||
static TinySlab* registry_lookup(uintptr_t slab_base) {
|
||
int hash = registry_hash(slab_base); // HASH: ~5 cycles
|
||
|
||
for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes
|
||
int idx = (hash + i) & SLAB_REGISTRY_MASK; // BITWISE: ~2 cycles
|
||
SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles
|
||
|
||
if (entry->slab_base == slab_base) { // LOAD + BRANCH: ~3 cycles
|
||
TinySlab* owner = entry->owner; // LOAD: ~2 cycles
|
||
return owner;
|
||
}
|
||
|
||
if (entry->slab_base == 0) { // LOAD + BRANCH: ~2 cycles
|
||
return NULL; // Empty slot
|
||
}
|
||
}
|
||
return NULL;
|
||
}
|
||
// TYPICAL COST: 20-30 cycles (1-2 probes, cache hit)
|
||
// WORST CASE: 50-100 cycles (8 probes, cache miss on registry array)
|
||
```
|
||
|
||
### Why It's Slow
|
||
|
||
1. **Hash Computation**: Complex mix function
|
||
```c
|
||
static inline int registry_hash(uintptr_t slab_base) {
|
||
return (slab_base >> 16) & SLAB_REGISTRY_MASK; // Simple, but...
|
||
}
|
||
```
|
||
- Shift + mask = 2 cycles (acceptable)
|
||
- **BUT**: Linear probing on collision adds 10-30 cycles
|
||
|
||
2. **Linear Probing**: Up to 8 probes on collision
|
||
- Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case)
|
||
- Registry size: 1024 entries (8 KB array)
|
||
- **Cost**: May span multiple cache lines → cache miss (10-20 cycles penalty)
|
||
|
||
3. **Validation Overhead**: Bounds check after lookup
|
||
- Load slab->base (2 cycles)
|
||
- Compute end address (1 cycle)
|
||
- Compare twice (2 cycles)
|
||
- **Cost**: 5 cycles per free (not amortized)
|
||
|
||
4. **Global Shared State**: Registry is shared across all threads
|
||
- No cache-line alignment (false sharing risk)
|
||
- Lock-free reads → ABA problem potential
|
||
- **Cost**: Atomic load penalties (~5-10 cycles vs normal load)
|
||
|
||
### Optimization #1: Enable SuperSlab by Default
|
||
|
||
**Rationale**: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style)
|
||
|
||
**Current**:
|
||
```c
|
||
// Line 81: SuperSlab disabled by default
|
||
static int g_use_superslab = 0; // Runtime toggle
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Enable SuperSlab by default
|
||
static int g_use_superslab = 1; // Always on
|
||
```
|
||
|
||
**Benefits**:
|
||
- Eliminate registry lookup entirely: `ptr & ~0x1FFFFF` (1 AND operation)
|
||
- SuperSlab free path: ~5 ns (vs ~10-30 ns registry path)
|
||
- Better cache locality (2MB aligned pages)
|
||
|
||
**Costs**:
|
||
- 2MB address space per SuperSlab (not physical memory due to lazy allocation)
|
||
- Slightly higher memory overhead (metadata at SuperSlab level)
|
||
|
||
**Expected Speedup**: 20-30% on free-heavy workloads
|
||
|
||
**Risk**: LOW (SuperSlab already implemented and tested in Phase 6.23)
|
||
|
||
### Optimization #2: Cache Last Freed Slab
|
||
|
||
**Rationale**: Temporal locality - next free likely from same slab
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Per-thread cache of last freed slab
|
||
static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL};
|
||
|
||
void hak_tiny_free(void* ptr) {
|
||
if (!ptr) return;
|
||
|
||
// Try cached slab first (likely hit)
|
||
int class_idx = guess_class_from_size(ptr); // Heuristic
|
||
TinySlab* slab = t_last_freed_slab[class_idx];
|
||
|
||
// Validate pointer is in this slab
|
||
if (slab && ptr_in_slab_range(ptr, slab)) {
|
||
hak_tiny_free_with_slab(ptr, slab); // FAST PATH: ~5 ns
|
||
return;
|
||
}
|
||
|
||
// Fallback to registry lookup (rare)
|
||
slab = hak_tiny_owner_slab(ptr);
|
||
if (slab) {
|
||
t_last_freed_slab[slab->class_idx] = slab; // Update cache
|
||
hak_tiny_free_with_slab(ptr, slab);
|
||
}
|
||
}
|
||
```
|
||
|
||
**Benefits**:
|
||
- 80-90% cache hit rate (temporal locality)
|
||
- Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles)
|
||
|
||
**Expected Speedup**: 15-20% on free-heavy workloads
|
||
|
||
**Risk**: MEDIUM (requires heuristic for class_idx guessing, may mispredict)
|
||
|
||
---
|
||
|
||
## Bottleneck #4: Statistics Collection Overhead
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny_stats.h`
|
||
- **Lines**: 59-73 (`stats_record_alloc`, `stats_record_free`)
|
||
- **Impact**: LOW (already optimized to TLS batching, but still ~0.5 ns per op)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 59-62: Allocation statistics (inline)
|
||
static inline void stats_record_alloc(int class_idx) __attribute__((always_inline));
|
||
static inline void stats_record_alloc(int class_idx) {
|
||
t_alloc_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
|
||
}
|
||
|
||
// Line 70-73: Free statistics (inline)
|
||
static inline void stats_record_free(int class_idx) __attribute__((always_inline));
|
||
static inline void stats_record_free(int class_idx) {
|
||
t_free_batch[class_idx]++; // TLS INCREMENT: ~0.5-1 cycle
|
||
}
|
||
```
|
||
|
||
### Why It's (Slightly) Slow
|
||
|
||
1. **TLS Access Overhead**: Even TLS has cost
|
||
- TLS base register: %fs on x86-64 (implicit)
|
||
- Offset calculation: `[%fs + class_idx*4]`
|
||
- **Cost**: ~0.5 cycles (not zero!)
|
||
|
||
2. **Cache Line Pollution**: TLS counters compete for L1 cache
|
||
- `t_alloc_batch[8]` = 32 bytes
|
||
- `t_free_batch[8]` = 32 bytes
|
||
- **Cost**: 64 bytes of L1 cache (1 cache line)
|
||
|
||
3. **Compiler Optimization Barriers**: `always_inline` prevents optimization
|
||
- Forces inline (good)
|
||
- But prevents compiler from hoisting out of loops (bad)
|
||
- **Cost**: Increment inside hot loop vs once outside
|
||
|
||
### Optimization: Compile-Time Statistics Toggle
|
||
|
||
**Rationale**: Production builds don't need exact counts
|
||
|
||
**Proposed**:
|
||
```c
|
||
#ifdef HAKMEM_ENABLE_STATS
|
||
#define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++
|
||
#define STATS_RECORD_FREE(cls) t_free_batch[cls]++
|
||
#else
|
||
#define STATS_RECORD_ALLOC(cls) ((void)0)
|
||
#define STATS_RECORD_FREE(cls) ((void)0)
|
||
#endif
|
||
```
|
||
|
||
**Benefits**:
|
||
- Zero overhead when stats disabled
|
||
- Compiler can optimize away dead code
|
||
|
||
**Expected Speedup**: 3-5% (small but measurable)
|
||
|
||
**Risk**: VERY LOW (compile-time flag, no runtime impact)
|
||
|
||
---
|
||
|
||
## Bottleneck #5: Magazine Spill/Refill Lock Contention
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
|
||
- **Lines**: 880-939 (magazine spill under class lock)
|
||
- **Impact**: MEDIUM (affects 5-10% of frees when magazine is full)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 880-939: Magazine spill (class lock held)
|
||
if (mag->top < cap) {
|
||
// Fast path: push to magazine (no lock)
|
||
mag->items[mag->top].ptr = ptr;
|
||
mag->top++;
|
||
return;
|
||
}
|
||
|
||
// Spill half under class lock
|
||
pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
|
||
pthread_mutex_lock(lock); // LOCK: ~30-100 cycles (contended)
|
||
|
||
int spill = cap / 2; // Spill 1024 items (for 2048 cap)
|
||
|
||
for (int i = 0; i < spill && mag->top > 0; i++) {
|
||
TinyMagItem it = mag->items[--mag->top];
|
||
TinySlab* owner = hak_tiny_owner_slab(it.ptr); // LOOKUP: ~20-30 cycles × 1024
|
||
if (!owner) continue;
|
||
|
||
// Phase 4.1: Try mini-magazine push (avoid bitmap)
|
||
if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) {
|
||
mini_mag_push(&owner->mini_mag, it.ptr); // FAST: ~4 cycles
|
||
continue;
|
||
}
|
||
|
||
// Slow path: bitmap update
|
||
size_t bs = g_tiny_class_sizes[owner->class_idx];
|
||
int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs; // DIV: ~10 cycles
|
||
if (hak_tiny_is_used(owner, idx)) {
|
||
hak_tiny_set_free(owner, idx); // BITMAP: ~10 cycles
|
||
owner->free_count++;
|
||
// ... list management ...
|
||
}
|
||
}
|
||
|
||
pthread_mutex_unlock(lock);
|
||
// TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item)
|
||
// Amortized: 50-100 ns per free (when spill happens every ~1000 frees)
|
||
```
|
||
|
||
### Why It's Slow
|
||
|
||
1. **Lock Hold Time**: Lock held for entire spill (1024 items)
|
||
- Blocks other threads from accessing class lock
|
||
- Spill takes ~50-100 µs → other threads stalled
|
||
- **Cost**: Contention penalty on multi-threaded workloads
|
||
|
||
2. **Registry Lookup in Loop**: 1024 lookups under lock
|
||
- `hak_tiny_owner_slab(it.ptr)` called 1024 times
|
||
- Each lookup: 20-30 cycles
|
||
- **Cost**: 20,000-30,000 cycles just for lookups
|
||
|
||
3. **Division in Hot Loop**: Block index calculation uses division
|
||
- `int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;`
|
||
- Division is ~10 cycles on modern CPUs (not fully pipelined)
|
||
- **Cost**: 10,000 cycles for 1024 divisions
|
||
|
||
4. **Large Spill Batch**: 1024 items is too large
|
||
- Amortizes lock cost well (good)
|
||
- But increases lock hold time (bad)
|
||
- Trade-off not optimized
|
||
|
||
### Optimization #1: Reduce Spill Batch Size
|
||
|
||
**Rationale**: Smaller batches = shorter lock hold time = less contention
|
||
|
||
**Current**:
|
||
```c
|
||
int spill = cap / 2; // 1024 items for 2048 cap
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
int spill = 128; // Fixed batch size (not cap-dependent)
|
||
```
|
||
|
||
**Benefits**:
|
||
- Shorter lock hold time: ~6-12 µs (vs 50-100 µs)
|
||
- Better multi-thread responsiveness
|
||
|
||
**Costs**:
|
||
- More frequent spills (8× more frequent)
|
||
- Slightly higher total lock overhead
|
||
|
||
**Expected Speedup**: 10-15% on multi-threaded workloads
|
||
|
||
**Risk**: LOW (simple parameter change)
|
||
|
||
### Optimization #2: Lock-Free Spill Stack
|
||
|
||
**Rationale**: Avoid lock entirely for spill path
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Per-class global spill stack (lock-free MPSC)
|
||
static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES];
|
||
|
||
void magazine_spill_lockfree(int class_idx, void* ptr) {
|
||
// Push to lock-free stack
|
||
uintptr_t old_head;
|
||
do {
|
||
old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire);
|
||
*((uintptr_t*)ptr) = old_head; // Intrusive next-pointer
|
||
} while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr,
|
||
memory_order_release, memory_order_relaxed));
|
||
}
|
||
|
||
// Background thread drains spill stack periodically
|
||
void background_drain_spill_stack(void) {
|
||
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
|
||
uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel);
|
||
if (!head) continue;
|
||
|
||
pthread_mutex_lock(&g_tiny_class_locks[i].m);
|
||
// ... drain to bitmap ...
|
||
pthread_mutex_unlock(&g_tiny_class_locks[i].m);
|
||
}
|
||
}
|
||
```
|
||
|
||
**Benefits**:
|
||
- Zero lock contention on spill path
|
||
- Fast atomic CAS (~5-10 cycles)
|
||
|
||
**Costs**:
|
||
- Requires background thread or periodic drain
|
||
- Slightly more complex memory management
|
||
|
||
**Expected Speedup**: 20-30% on multi-threaded workloads
|
||
|
||
**Risk**: HIGH (requires careful design of background drain mechanism)
|
||
|
||
---
|
||
|
||
## Bottleneck #6: Branch Misprediction in Size Class Lookup
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
|
||
- **Lines**: 159-182 (`hak_tiny_size_to_class`)
|
||
- **Impact**: LOW (only 1-2 ns per allocation, but called on every allocation)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 159-182: Size to class lookup (branch chain)
|
||
static inline int hak_tiny_size_to_class(size_t size) {
|
||
if (size == 0 || size > TINY_MAX_SIZE) return -1; // BRANCH: ~1 cycle
|
||
|
||
// Branch chain (8 branches for 8 classes)
|
||
if (size <= 8) return 0; // BRANCH: ~1 cycle
|
||
if (size <= 16) return 1; // BRANCH: ~1 cycle
|
||
if (size <= 32) return 2; // BRANCH: ~1 cycle
|
||
if (size <= 64) return 3; // BRANCH: ~1 cycle
|
||
if (size <= 128) return 4; // BRANCH: ~1 cycle
|
||
if (size <= 256) return 5; // BRANCH: ~1 cycle
|
||
if (size <= 512) return 6; // BRANCH: ~1 cycle
|
||
return 7; // size <= 1024
|
||
}
|
||
// TYPICAL COST: 3-5 cycles (3-4 branches taken)
|
||
// WORST CASE: 8 cycles (all branches checked)
|
||
```
|
||
|
||
### Why It's (Slightly) Slow
|
||
|
||
1. **Unpredictable Size Distribution**: Branch predictor can't learn pattern
|
||
- Real-world allocation sizes are quasi-random
|
||
- Size 16 most common (33%), but others vary
|
||
- **Cost**: ~20-30% branch misprediction rate (~10 cycles penalty)
|
||
|
||
2. **Sequential Dependency**: Each branch depends on previous
|
||
- CPU can't parallelize branch evaluation
|
||
- Must evaluate branches in order
|
||
- **Cost**: No instruction-level parallelism (ILP)
|
||
|
||
### Optimization: Branchless Lookup Table
|
||
|
||
**Rationale**: Use CLZ (count leading zeros) for O(1) class lookup
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Lookup table for size → class (branchless)
|
||
static const uint8_t g_size_to_class_table[128] = {
|
||
// size 0-7: class -1 (invalid)
|
||
-1, -1, -1, -1, -1, -1, -1, -1,
|
||
// size 8: class 0
|
||
0,
|
||
// size 9-16: class 1
|
||
1, 1, 1, 1, 1, 1, 1, 1,
|
||
// size 17-32: class 2
|
||
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
||
// size 33-64: class 3
|
||
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
||
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
||
// size 65-128: class 4
|
||
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
||
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
||
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
||
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
||
};
|
||
|
||
static inline int hak_tiny_size_to_class(size_t size) {
|
||
if (size == 0 || size > TINY_MAX_SIZE) return -1;
|
||
|
||
// Fast path: direct table lookup for small sizes
|
||
if (size <= 128) {
|
||
return g_size_to_class_table[size]; // LOAD: ~2 cycles (L1 cache)
|
||
}
|
||
|
||
// Slow path: CLZ-based for larger sizes
|
||
// size 129-256 → class 5
|
||
// size 257-512 → class 6
|
||
// size 513-1024 → class 7
|
||
int clz = __builtin_clzll(size - 1); // CLZ: ~3 cycles
|
||
return 12 - clz; // Magic constant for power-of-2 classes
|
||
}
|
||
// TYPICAL COST: 2-3 cycles (table lookup, no branches)
|
||
```
|
||
|
||
**Benefits**:
|
||
- Branchless for common sizes (8-128B covers 80%+ of allocations)
|
||
- Table fits in L1 cache (128 bytes = 2 cache lines)
|
||
- Predictable performance (no branch misprediction)
|
||
|
||
**Expected Speedup**: 2-3% (reduce 5 cycles to 2-3 cycles)
|
||
|
||
**Risk**: VERY LOW (table is static, no runtime overhead)
|
||
|
||
---
|
||
|
||
## Bottleneck #7: Remote Free Drain Overhead
|
||
|
||
### Location
|
||
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
|
||
- **Lines**: 146-184 (`tiny_remote_drain_locked`)
|
||
- **Impact**: LOW (only affects cross-thread frees, ~10-20% of workloads)
|
||
|
||
### Code Analysis
|
||
|
||
```c
|
||
// Line 146-184: Remote free drain (under class lock)
|
||
static void tiny_remote_drain_locked(TinySlab* slab) {
|
||
uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles
|
||
unsigned drained = 0;
|
||
|
||
while (head) { // LOOP: variable iterations
|
||
void* p = (void*)head;
|
||
head = *((uintptr_t*)p); // LOAD NEXT: ~2 cycles
|
||
|
||
// Calculate block index
|
||
size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles
|
||
uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle
|
||
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
|
||
|
||
// Skip if already free (idempotent)
|
||
if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles
|
||
|
||
hak_tiny_set_free(slab, block_idx); // BITMAP UPDATE: ~10 cycles
|
||
|
||
int was_full = (slab->free_count == 0); // LOAD: ~1 cycle
|
||
slab->free_count++; // INCREMENT: ~1 cycle
|
||
|
||
if (was_full) {
|
||
move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare)
|
||
}
|
||
|
||
if (slab->free_count == slab->total_count) {
|
||
// ... slab release logic ... (rare)
|
||
release_slab(slab); // EXPENSIVE: ~1000 cycles (very rare)
|
||
break;
|
||
}
|
||
|
||
g_tiny_pool.free_count[slab->class_idx]++; // GLOBAL INCREMENT: ~1 cycle
|
||
drained++;
|
||
}
|
||
|
||
if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles
|
||
}
|
||
// TYPICAL COST: 50-100 cycles per drained block (moderate)
|
||
// WORST CASE: 1000+ cycles (slab release)
|
||
```
|
||
|
||
### Why It's Slow
|
||
|
||
1. **Division in Loop**: Block index calculation uses division
|
||
- `int block_idx = offset / block_size;`
|
||
- Division is ~10 cycles (even on modern CPUs)
|
||
- **Cost**: 10 cycles × N remote frees
|
||
|
||
2. **Atomic Operations**: 2 atomic ops per drain (exchange + fetch_sub)
|
||
- `atomic_exchange` at start (~10 cycles)
|
||
- `atomic_fetch_sub` at end (~10 cycles)
|
||
- **Cost**: 20 cycles overhead (not per-block, but still expensive)
|
||
|
||
3. **Bitmap Update**: Same as allocation path
|
||
- `hak_tiny_set_free` updates both bitmap and summary
|
||
- **Cost**: 10 cycles per block
|
||
|
||
### Optimization: Multiplication-Based Division
|
||
|
||
**Rationale**: Replace division with multiplication by reciprocal
|
||
|
||
**Current**:
|
||
```c
|
||
int block_idx = offset / block_size; // DIVIDE: ~10 cycles
|
||
```
|
||
|
||
**Proposed**:
|
||
```c
|
||
// Pre-computed reciprocals (magic constants)
|
||
static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = {
|
||
// Computed as: (1ULL << 48) / block_size
|
||
// Allows: block_idx = (offset * reciprocal) >> 48
|
||
0x200000000000ULL / 8, // Class 0: 8B
|
||
0x100000000000ULL / 16, // Class 1: 16B
|
||
0x80000000000ULL / 32, // Class 2: 32B
|
||
0x40000000000ULL / 64, // Class 3: 64B
|
||
0x20000000000ULL / 128, // Class 4: 128B
|
||
0x10000000000ULL / 256, // Class 5: 256B
|
||
0x8000000000ULL / 512, // Class 6: 512B
|
||
0x4000000000ULL / 1024, // Class 7: 1024B
|
||
};
|
||
|
||
// Fast division using multiplication
|
||
int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles
|
||
```
|
||
|
||
**Benefits**:
|
||
- Reduce 10 cycles to 3 cycles per division
|
||
- Saves 7 cycles per remote free
|
||
|
||
**Expected Speedup**: 5-10% on cross-thread workloads
|
||
|
||
**Risk**: VERY LOW (well-known compiler optimization, manually applied)
|
||
|
||
---
|
||
|
||
## Profiling Plan
|
||
|
||
### perf Commands to Run
|
||
|
||
```bash
|
||
# 1. CPU cycle breakdown (identify hotspots)
|
||
perf record -e cycles:u -g ./bench_comprehensive
|
||
perf report --stdio --no-children | head -100 > perf_cycles.txt
|
||
|
||
# 2. Cache miss analysis (L1d, L1i, LLC)
|
||
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\
|
||
L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \
|
||
./bench_comprehensive
|
||
|
||
# 3. Branch misprediction rate
|
||
perf stat -e cycles,instructions,branches,branch-misses \
|
||
./bench_comprehensive
|
||
|
||
# 4. TLB miss analysis (address translation overhead)
|
||
perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
|
||
./bench_comprehensive
|
||
|
||
# 5. Function-level profiling (annotated source)
|
||
perf record -e cycles:u --call-graph dwarf ./bench_comprehensive
|
||
perf report --stdio --sort symbol --percent-limit 1
|
||
|
||
# 6. Memory bandwidth utilization
|
||
perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\
|
||
mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \
|
||
./bench_comprehensive
|
||
|
||
# 7. Allocation-specific hotspots (focus on hak_tiny_alloc)
|
||
perf record -e cycles:u -g --call-graph dwarf -- \
|
||
./bench_comprehensive 2>&1 | grep "hak_tiny"
|
||
```
|
||
|
||
### Expected Hotspots to Validate
|
||
|
||
Based on code analysis, we expect to see:
|
||
|
||
1. **hak_tiny_find_free_block** (15-25% of cycles)
|
||
- Two-tier bitmap scan
|
||
- CTZ operations
|
||
- Cache misses on large bitmaps
|
||
|
||
2. **hak_tiny_set_used / hak_tiny_set_free** (10-15% of cycles)
|
||
- Bitmap updates
|
||
- Summary bitmap updates
|
||
- Write-heavy (cache line bouncing)
|
||
|
||
3. **hak_tiny_owner_slab** (10-20% of cycles on free path)
|
||
- Registry lookup
|
||
- Hash computation
|
||
- Linear probing
|
||
|
||
4. **tiny_mag_init_if_needed** (5-10% of cycles)
|
||
- TLS access
|
||
- Conditional initialization
|
||
|
||
5. **stats_record_alloc / stats_record_free** (3-5% of cycles)
|
||
- TLS counter increments
|
||
- Cache line pollution
|
||
|
||
### Validation Criteria
|
||
|
||
**Cache Miss Rates**:
|
||
- L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor)
|
||
- LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor)
|
||
|
||
**Branch Misprediction**:
|
||
- Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor)
|
||
- Expected: 3-4% (due to unpredictable size classes)
|
||
|
||
**IPC (Instructions Per Cycle)**:
|
||
- IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor)
|
||
- Expected: 1.5-1.8 (memory-bound, not compute-bound)
|
||
|
||
**Function Time Distribution**:
|
||
- hak_tiny_alloc: 40-60% (hot path)
|
||
- hak_tiny_free: 20-30% (warm path)
|
||
- hak_tiny_find_free_block: 10-20% (expensive when hit)
|
||
- Other: < 10%
|
||
|
||
---
|
||
|
||
## Optimization Roadmap
|
||
|
||
### Quick Wins (< 1 hour, Low Risk)
|
||
|
||
1. **Enable SuperSlab by Default** (Bottleneck #3)
|
||
- Change: `g_use_superslab = 1;`
|
||
- Impact: 20-30% speedup on free path
|
||
- Risk: VERY LOW (already implemented)
|
||
- Effort: 5 minutes
|
||
|
||
2. **Disable Statistics in Production** (Bottleneck #4)
|
||
- Change: Add `#ifndef HAKMEM_ENABLE_STATS` guards
|
||
- Impact: 3-5% speedup
|
||
- Risk: VERY LOW (compile-time flag)
|
||
- Effort: 15 minutes
|
||
|
||
3. **Increase Mini-Magazine Capacity** (Bottleneck #2)
|
||
- Change: `mag_capacity = 64` (was 32)
|
||
- Impact: 10-15% speedup (reduce bitmap scans)
|
||
- Risk: LOW (slight memory increase)
|
||
- Effort: 5 minutes
|
||
|
||
4. **Branchless Size Class Lookup** (Bottleneck #6)
|
||
- Change: Use lookup table for common sizes
|
||
- Impact: 2-3% speedup
|
||
- Risk: VERY LOW (static table)
|
||
- Effort: 30 minutes
|
||
|
||
**Total Expected Speedup: 35-53%** (conservative: 1.4-1.5×)
|
||
|
||
### Medium Effort (1-4 hours, Medium Risk)
|
||
|
||
5. **Unified TLS Cache Structure** (Bottleneck #1)
|
||
- Change: Merge TLS arrays into single cache-aligned struct
|
||
- Impact: 30-40% speedup on fast path
|
||
- Risk: MEDIUM (requires refactoring)
|
||
- Effort: 3-4 hours
|
||
|
||
6. **Reduce Magazine Spill Batch** (Bottleneck #5)
|
||
- Change: `spill = 128` (was 1024)
|
||
- Impact: 10-15% speedup on multi-threaded
|
||
- Risk: LOW (parameter tuning)
|
||
- Effort: 30 minutes
|
||
|
||
7. **Cache-Aware Bitmap Layout** (Bottleneck #2)
|
||
- Change: Embed small bitmaps in slab structure
|
||
- Impact: 5-10% speedup
|
||
- Risk: MEDIUM (requires struct changes)
|
||
- Effort: 2-3 hours
|
||
|
||
8. **Multiplication-Based Division** (Bottleneck #7)
|
||
- Change: Replace division with mul+shift
|
||
- Impact: 5-10% speedup on remote frees
|
||
- Risk: VERY LOW (well-known optimization)
|
||
- Effort: 1 hour
|
||
|
||
**Total Expected Speedup: 50-85%** (conservative: 1.5-1.8×)
|
||
|
||
### Major Refactors (> 4 hours, High Risk)
|
||
|
||
9. **Lock-Free Spill Stack** (Bottleneck #5)
|
||
- Change: Use atomic MPSC queue for magazine spill
|
||
- Impact: 20-30% speedup on multi-threaded
|
||
- Risk: HIGH (complex concurrency)
|
||
- Effort: 8-12 hours
|
||
|
||
10. **Lazy Summary Bitmap Update** (Bottleneck #2)
|
||
- Change: Rebuild summary only when scanning
|
||
- Impact: 15-20% speedup on free-heavy workloads
|
||
- Risk: MEDIUM (requires careful staleness tracking)
|
||
- Effort: 4-6 hours
|
||
|
||
11. **Collapse TLS Magazine Tiers** (Bottleneck #1)
|
||
- Change: Merge magazine + mini-mag into single LIFO
|
||
- Impact: 40-50% speedup (eliminate tier overhead)
|
||
- Risk: HIGH (major architectural change)
|
||
- Effort: 12-16 hours
|
||
|
||
12. **Full mimalloc-Style Rewrite** (All Bottlenecks)
|
||
- Change: Replace bitmap with intrusive free-list
|
||
- Impact: 5-9× speedup (match mimalloc)
|
||
- Risk: VERY HIGH (complete redesign)
|
||
- Effort: 40+ hours
|
||
|
||
**Total Expected Speedup: 75-150%** (optimistic: 1.8-2.5×)
|
||
|
||
---
|
||
|
||
## Risk Assessment Summary
|
||
|
||
### Low Risk Optimizations (Safe to implement immediately)
|
||
|
||
- SuperSlab enable
|
||
- Statistics compile-time toggle
|
||
- Mini-mag capacity increase
|
||
- Branchless size lookup
|
||
- Multiplication division
|
||
- Magazine spill batch reduction
|
||
|
||
**Expected: 1.4-1.6× speedup, 2-3 hours effort**
|
||
|
||
### Medium Risk Optimizations (Test thoroughly)
|
||
|
||
- Unified TLS cache structure
|
||
- Cache-aware bitmap layout
|
||
- Lazy summary update
|
||
|
||
**Expected: 1.6-2.0× speedup, 6-10 hours effort**
|
||
|
||
### High Risk Optimizations (Prototype first)
|
||
|
||
- Lock-free spill stack
|
||
- Magazine tier collapse
|
||
- Full mimalloc rewrite
|
||
|
||
**Expected: 2.0-9.0× speedup, 20-60 hours effort**
|
||
|
||
---
|
||
|
||
## Estimated Speedup Summary
|
||
|
||
### Conservative Target (Low + Medium optimizations)
|
||
|
||
- **Random pattern**: 68 M ops/sec → **140 M ops/sec** (2.0× speedup)
|
||
- **LIFO pattern**: 102 M ops/sec → **200 M ops/sec** (2.0× speedup)
|
||
- **Gap to mimalloc**: 2.6× → **1.3×** (close 50% of gap)
|
||
|
||
### Optimistic Target (All optimizations)
|
||
|
||
- **Random pattern**: 68 M ops/sec → **170 M ops/sec** (2.5× speedup)
|
||
- **LIFO pattern**: 102 M ops/sec → **450 M ops/sec** (4.4× speedup)
|
||
- **Gap to mimalloc**: 2.6× → **1.0×** (match on random, 2× on LIFO)
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to:
|
||
|
||
1. **Architectural overhead**: 6-tier allocation hierarchy vs mimalloc's 3-tier
|
||
2. **Bitmap traversal cost**: Two-tier scan adds 15-20 cycles even when optimized
|
||
3. **Registry lookup overhead**: Hash table lookup adds 20-30 cycles on free path
|
||
|
||
**Quick wins** (1-3 hours effort) can achieve **1.4-1.6× speedup**.
|
||
**Medium effort** (10 hours) can achieve **1.8-2.0× speedup**.
|
||
**Full mimalloc-style rewrite** (40+ hours) needed to match mimalloc's 1.1 ns/op.
|
||
|
||
**Recommendation**: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with `perf`, then decide if medium-effort optimizations are worth the complexity increase.
|