Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation
Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)
Date: 2025-12-04 Objective: Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)
Executive Summary
HAKMEM is 7.88x slower than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op). The performance gap comes from 4 main sources:
- Malloc overhead (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers
- Free overhead (29.4% of gap): Multi-layer free path with validation and routing
- Cache refill (15.7% of gap): Expensive superslab metadata lookups and validation
- Infrastructure (22.5% of gap): Cache misses, branch mispredictions, diagnostic code
Key Finding: Cache Miss Penalty Dominates
- 238M cycles lost to cache misses (24.4% of total runtime!)
- HAKMEM has 20.3x more cache misses than mimalloc (1.19M vs 58.7K)
- L1 D-cache misses are 97.7x higher (4.29M vs 43.9K)
Detailed Performance Metrics
Overall Comparison
| Metric | HAKMEM | mimalloc | Ratio |
|---|---|---|---|
| Total Cycles | 975,602,722 | 123,838,496 | 7.88x |
| Total Instructions | 3,782,043,459 | 515,485,797 | 7.34x |
| Cycles per op | 48.8 | 6.2 | 7.88x |
| Instructions per op | 189.1 | 25.8 | 7.34x |
| IPC (inst/cycle) | 3.88 | 4.16 | 0.93x |
| Cache misses | 1,191,800 | 58,727 | 20.29x |
| Cache miss rate | 59.59‰ | 2.94‰ | 20.29x |
| Branch misses | 1,497,133 | 58,943 | 25.40x |
| Branch miss rate | 0.17% | 0.05% | 3.20x |
| L1 D-cache misses | 4,291,649 | 43,913 | 97.73x |
| L1 miss rate | 0.41% | 0.03% | 13.88x |
IPC Analysis
- HAKMEM IPC: 3.88 (good, but memory-bound)
- mimalloc IPC: 4.16 (better, less memory stall)
- Interpretation: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns
Function-Level Cycle Breakdown
HAKMEM: Where Cycles Are Spent
| Function | % | Total Cycles | Cycles/op | Category |
|---|---|---|---|---|
| malloc | 33.32% | 325,070,826 | 16.25 | Hot path allocation |
| unified_cache_refill | 13.67% | 133,364,892 | 6.67 | Cache miss handler |
| free.part.0 | 12.22% | 119,218,652 | 5.96 | Free wrapper |
| main (benchmark) | 12.07% | 117,755,248 | 5.89 | Test harness |
| hak_free_at.constprop.0 | 11.55% | 112,682,114 | 5.63 | Free routing |
| hak_tiny_free_fast_v2 | 8.11% | 79,121,380 | 3.96 | Free fast path |
| kernel/other | 9.06% | 88,389,606 | 4.42 | Syscalls, page faults |
| TOTAL | 100% | 975,602,722 | 48.78 |
mimalloc: Where Cycles Are Spent
| Function | % | Total Cycles | Cycles/op | Category |
|---|---|---|---|---|
| operator delete[] | 48.66% | 60,259,812 | 3.01 | Free path |
| malloc | 39.82% | 49,312,489 | 2.47 | Allocation path |
| kernel/other | 6.77% | 8,383,866 | 0.42 | Syscalls, page faults |
| main (benchmark) | 4.75% | 5,882,328 | 0.29 | Test harness |
| TOTAL | 100% | 123,838,496 | 6.19 |
Insight: HAKMEM Fragmentation
- mimalloc concentrates 88.5% of cycles in malloc/free
- HAKMEM spreads across 6 functions (malloc + 3 free variants + refill + wrapper)
- Recommendation: Consolidate hot path to reduce function call overhead
Cache Miss Deep Dive
Cache Misses by Function (HAKMEM)
| Function | % | Cache Misses | Misses/op | Impact |
|---|---|---|---|---|
| malloc | 58.51% | 697,322 | 0.0349 | CRITICAL |
| unified_cache_refill | 29.92% | 356,586 | 0.0178 | HIGH |
| Other | 11.57% | 137,892 | 0.0069 | Low |
Estimated Penalty
- Cache miss penalty: 238,360,000 cycles (assuming ~200 cycles/LLC miss)
- Per operation: 11.9 cycles lost to cache misses
- Percentage of total: 24.4% of all cycles
Root Causes
-
malloc (58% of cache misses):
- Pointer chasing through TLS → cache → metadata
- Multiple indirections:
g_tls_slabs[class_idx]→tls->ss→tls->meta - Cold metadata access patterns
-
unified_cache_refill (30% of cache misses):
- SuperSlab metadata lookups via
hak_super_lookup(p) - Freelist traversal:
tiny_next_read()on cold pointers - Validation logic: Multiple metadata accesses per block
- SuperSlab metadata lookups via
Branch Misprediction Analysis
Branch Misses by Function (HAKMEM)
| Function | % | Branch Misses | Misses/op | Impact |
|---|---|---|---|---|
| malloc | 21.59% | 323,231 | 0.0162 | Moderate |
| unified_cache_refill | 10.35% | 154,953 | 0.0077 | Moderate |
| free.part.0 | 3.80% | 56,891 | 0.0028 | Low |
| main | 3.66% | 54,795 | 0.0027 | (Benchmark) |
| hak_free_at | 3.49% | 52,249 | 0.0026 | Low |
| hak_tiny_free_fast_v2 | 3.11% | 46,560 | 0.0023 | Low |
Estimated Penalty
- Branch miss penalty: 22,456,995 cycles (assuming ~15 cycles/miss)
- Per operation: 1.1 cycles lost to branch misses
- Percentage of total: 2.3% of all cycles
Root Causes
-
Unpredictable control flow:
- Environment variable checks:
if (g_wrapper_env),if (g_enable) - Initialization barriers:
if (!g_initialized),if (g_initializing) - Multi-way routing:
if (cache miss) → refill; if (freelist) → pop; else → carve
- Environment variable checks:
-
malloc wrapper overhead (lines 7795-78a3 in disassembly):
- 20+ conditional branches before reaching fast path
- Lazy initialization checks
- Diagnostic tracing (
lock incl g_wrap_malloc_trace_count)
Top 3 Bottlenecks & Recommendations
🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)
Problem:
- Complex TLS access pattern:
g_tls_sll[class_idx].headrequires cache line load - Unified cache lookup:
g_unified_cache[class_idx].slots[head]→ second cache line - Cold metadata: Refill triggers
hak_super_lookup()+ metadata traversal
Hot Path Code Flow (from source analysis):
// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast
// 1. Check unified cache (cache hit path)
void* p = cache->slots[cache->head];
if (p) {
cache->head = (cache->head + 1) & cache->mask; // ← Cache line load
return p;
}
// 2. Cache miss → unified_cache_refill
unified_cache_refill(class_idx); // ← Expensive! 6.67 cycles/op
Disassembly Evidence (malloc function, lines 7a60-7ac7):
- Multiple indirect loads:
mov %fs:0x0,%r8(TLS base) - Pointer arithmetic:
lea -0x47d30(%r8),%rsi(cache offset calculation) - Conditional moves:
cmpb $0x2,(%rdx,%rcx,1)(route check) - Cache line thrashing on
cache->slotsarray
Recommendations:
-
Inline unified_cache_refill for common case (CRITICAL)
- Move refill logic inline to eliminate function call overhead
- Use
__attribute__((always_inline))or manual inlining - Expected gain: ~2-3 cycles/op
-
Optimize TLS data layout (HIGH PRIORITY)
- Pack hot fields (
cache->head,cache->tail,cache->slots) into single cache line - Current:
g_unified_cache[8]array → 8 separate cache lines - Target: Hot path fields in 64-byte cache line
- Expected gain: ~3-5 cycles/op, reduce misses by 30-40%
- Pack hot fields (
-
Prefetch next block during refill (MEDIUM)
void* first = out[0]; __builtin_prefetch(cache->slots[cache->tail + 1], 0, 3); // Temporal prefetch return first;- Expected gain: ~1-2 cycles/op
-
Reduce validation overhead (MEDIUM)
unified_refill_validate_base()callshak_super_lookup()on every block- Move to debug-only (
#if !HAKMEM_BUILD_RELEASE) - Expected gain: ~1-2 cycles/op
🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)
Problem:
- Expensive metadata lookups:
hak_super_lookup(p)on every freelist node - Freelist traversal:
tiny_next_read()requires dereferencing cold pointers - Validation logic: Multiple safety checks per block (lines 384-408 in source)
Hot Path Code (from tiny_unified_cache.c:377-414):
while (produced < room) {
if (m->freelist) {
void* p = m->freelist;
// ❌ EXPENSIVE: Lookup SuperSlab for validation
SuperSlab* fl_ss = hak_super_lookup(p); // ← Cache miss!
int fl_idx = slab_index_for(fl_ss, p); // ← More metadata access
// ❌ EXPENSIVE: Dereference next pointer (cold memory)
void* next_node = tiny_next_read(class_idx, p); // ← Cache miss!
// Write header
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
m->freelist = next_node;
out[produced++] = p;
}
}
Recommendations:
-
Batch validation (amortize lookup cost) (CRITICAL)
- Validate SuperSlab once at start, not per block
- Trust freelist integrity within single refill
SuperSlab* ss_once = hak_super_lookup(m->freelist); // Validate ss_once, then skip per-block validation while (produced < room && m->freelist) { void* p = m->freelist; void* next = tiny_next_read(class_idx, p); // No lookup! out[produced++] = p; m->freelist = next; }- Expected gain: ~2-3 cycles/op
-
Prefetch freelist nodes (HIGH PRIORITY)
void* p = m->freelist; void* next = tiny_next_read(class_idx, p); __builtin_prefetch(next, 0, 3); // Prefetch next node __builtin_prefetch(tiny_next_read(class_idx, next), 0, 2); // +2 ahead- Expected gain: ~1-2 cycles/op on miss path
-
Increase batch size for hot classes (MEDIUM)
- Current: Max 128 blocks per refill
- Proposal: 256 blocks for C0-C3 (tiny sizes)
- Amortize refill cost over more allocations
- Expected gain: ~0.5-1 cycles/op
-
Remove atomic fence on header write (LOW, risky)
- Line 422:
__atomic_thread_fence(__ATOMIC_RELEASE) - Only needed for cross-thread visibility
- Benchmark: Single-threaded case doesn't need fence
- Expected gain: ~0.3-0.5 cycles/op
- Line 422:
🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)
Problem:
- 20+ branches before reaching fast path (disassembly lines 7795-78a3)
- Lazy initialization checks on every call
- Diagnostic tracing with atomic increment
- Environment variable checks
Hot Path Disassembly (malloc, lines 7795-77ba):
7795: lock incl 0x190fb78(%rip) ; ❌ Atomic trace counter (12.33% of cycles!)
779c: mov 0x190fb6e(%rip),%eax ; Check g_bench_fast_init_in_progress
77a2: test %eax,%eax
77a4: je 7d90 ; Branch #1
77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment
77b2: mov 0x438c8(%rip),%eax ; Check g_wrapper_env
77b8: test %eax,%eax
77ba: je 7e40 ; Branch #2
Wrapper Code (hakmem_tiny_phase6_wrappers_box.inc:22-79):
void* hak_tiny_alloc_fast_wrapper(size_t size) {
atomic_fetch_add(&g_alloc_fast_trace, 1, ...); // ❌ Expensive!
// ❌ Branch #1: Bench fast mode check
if (g_bench_fast_front) {
return tiny_alloc_fast(size);
}
atomic_fetch_add(&wrapper_call_count, 1); // ❌ Atomic again!
PTR_TRACK_INIT(); // ❌ Initialization check
periodic_canary_check(call_num, ...); // ❌ Periodic check
// Finally, actual allocation
void* result = tiny_alloc_fast(size);
return result;
}
Recommendations:
-
Compile-time disable diagnostics (CRITICAL)
- Remove atomic trace counters in hot path
- Move to
#if HAKMEM_BUILD_RELEASEguards - Expected gain: ~4-6 cycles/op (eliminates 12% overhead)
-
Hoist initialization checks (HIGH PRIORITY)
- Move
PTR_TRACK_INIT()to library init (once per thread) - Cache
g_bench_fast_frontin thread-local variable
static __thread int g_init_done = 0; if (__builtin_expect(!g_init_done, 0)) { PTR_TRACK_INIT(); g_init_done = 1; }- Expected gain: ~1-2 cycles/op
- Move
-
Eliminate wrapper layer for benchmarks (MEDIUM)
- Direct call to
tiny_alloc_fast()frommalloc() - Use LTO to inline wrapper entirely
- Expected gain: ~1-2 cycles/op (function call overhead)
- Direct call to
-
Branchless environment checks (LOW)
- Replace
if (g_wrapper_env)with bitmask operations
int mask = -(int)g_wrapper_env; // -1 if true, 0 if false result = (mask & diagnostic_path) | (~mask & fast_path);- Expected gain: ~0.3-0.5 cycles/op
- Replace
Summary: Optimization Roadmap
Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)
- ✅ Remove atomic trace counters (
lock incl) → -6 cycles/op - ✅ Inline
unified_cache_refill→ -3 cycles/op - ✅ Batch validation in refill → -3 cycles/op
- ✅ Optimize TLS cache layout → -3 cycles/op
Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)
- ✅ Prefetch in refill and malloc → -3 cycles/op
- ✅ Increase batch size for hot classes → -2 cycles/op
- ✅ Consolidate free path (merge 3 functions) → -3 cycles/op
- ✅ Hoist initialization checks → -2 cycles/op
Long-Term (Target: -8 cycles/op, 23.8 → 15.8)
- ✅ Branchless routing logic → -2 cycles/op
- ✅ SIMD batch processing in refill → -3 cycles/op
- ✅ Reduce metadata indirections → -3 cycles/op
Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)
- Requires architectural changes (single-layer cache, no validation)
- Trade-off: Safety vs performance
Conclusion
HAKMEM's 7.88x slowdown is primarily due to:
- Cache misses (24.4% of cycles) from multi-layer indirection
- Diagnostic overhead (12%+ of cycles) from atomic counters and tracing
- Function fragmentation (6 hot functions vs mimalloc's 2)
Top Priority Actions:
- Remove atomic trace counters (immediate -6 cycles/op)
- Inline refill + batch validation (-6 cycles/op combined)
- Optimize TLS layout for cache locality (-3 cycles/op)
Expected Impact: -15 cycles/op (48.8 → 33.8, ~30% improvement) Timeline: 1-2 days of focused optimization work