376 lines
14 KiB
Markdown
376 lines
14 KiB
Markdown
|
|
# HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation
|
||
|
|
## Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)
|
||
|
|
|
||
|
|
**Date:** 2025-12-04
|
||
|
|
**Objective:** Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
HAKMEM is **7.88x slower** than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op).
|
||
|
|
The performance gap comes from **4 main sources**:
|
||
|
|
|
||
|
|
1. **Malloc overhead** (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers
|
||
|
|
2. **Free overhead** (29.4% of gap): Multi-layer free path with validation and routing
|
||
|
|
3. **Cache refill** (15.7% of gap): Expensive superslab metadata lookups and validation
|
||
|
|
4. **Infrastructure** (22.5% of gap): Cache misses, branch mispredictions, diagnostic code
|
||
|
|
|
||
|
|
### Key Finding: Cache Miss Penalty Dominates
|
||
|
|
- **238M cycles lost to cache misses** (24.4% of total runtime!)
|
||
|
|
- HAKMEM has **20.3x more cache misses** than mimalloc (1.19M vs 58.7K)
|
||
|
|
- L1 D-cache misses are **97.7x higher** (4.29M vs 43.9K)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detailed Performance Metrics
|
||
|
|
|
||
|
|
### Overall Comparison
|
||
|
|
|
||
|
|
| Metric | HAKMEM | mimalloc | Ratio |
|
||
|
|
|--------|--------|----------|-------|
|
||
|
|
| **Total Cycles** | 975,602,722 | 123,838,496 | **7.88x** |
|
||
|
|
| **Total Instructions** | 3,782,043,459 | 515,485,797 | **7.34x** |
|
||
|
|
| **Cycles per op** | 48.8 | 6.2 | **7.88x** |
|
||
|
|
| **Instructions per op** | 189.1 | 25.8 | **7.34x** |
|
||
|
|
| **IPC (inst/cycle)** | 3.88 | 4.16 | 0.93x |
|
||
|
|
| **Cache misses** | 1,191,800 | 58,727 | **20.29x** |
|
||
|
|
| **Cache miss rate** | 59.59‰ | 2.94‰ | **20.29x** |
|
||
|
|
| **Branch misses** | 1,497,133 | 58,943 | **25.40x** |
|
||
|
|
| **Branch miss rate** | 0.17% | 0.05% | **3.20x** |
|
||
|
|
| **L1 D-cache misses** | 4,291,649 | 43,913 | **97.73x** |
|
||
|
|
| **L1 miss rate** | 0.41% | 0.03% | **13.88x** |
|
||
|
|
|
||
|
|
### IPC Analysis
|
||
|
|
- HAKMEM IPC: **3.88** (good, but memory-bound)
|
||
|
|
- mimalloc IPC: **4.16** (better, less memory stall)
|
||
|
|
- **Interpretation**: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Function-Level Cycle Breakdown
|
||
|
|
|
||
|
|
### HAKMEM: Where Cycles Are Spent
|
||
|
|
|
||
|
|
| Function | % | Total Cycles | Cycles/op | Category |
|
||
|
|
|----------|---|-------------|-----------|----------|
|
||
|
|
| **malloc** | 33.32% | 325,070,826 | 16.25 | Hot path allocation |
|
||
|
|
| **unified_cache_refill** | 13.67% | 133,364,892 | 6.67 | Cache miss handler |
|
||
|
|
| **free.part.0** | 12.22% | 119,218,652 | 5.96 | Free wrapper |
|
||
|
|
| **main** (benchmark) | 12.07% | 117,755,248 | 5.89 | Test harness |
|
||
|
|
| **hak_free_at.constprop.0** | 11.55% | 112,682,114 | 5.63 | Free routing |
|
||
|
|
| **hak_tiny_free_fast_v2** | 8.11% | 79,121,380 | 3.96 | Free fast path |
|
||
|
|
| **kernel/other** | 9.06% | 88,389,606 | 4.42 | Syscalls, page faults |
|
||
|
|
| **TOTAL** | 100% | 975,602,722 | 48.78 | |
|
||
|
|
|
||
|
|
### mimalloc: Where Cycles Are Spent
|
||
|
|
|
||
|
|
| Function | % | Total Cycles | Cycles/op | Category |
|
||
|
|
|----------|---|-------------|-----------|----------|
|
||
|
|
| **operator delete[]** | 48.66% | 60,259,812 | 3.01 | Free path |
|
||
|
|
| **malloc** | 39.82% | 49,312,489 | 2.47 | Allocation path |
|
||
|
|
| **kernel/other** | 6.77% | 8,383,866 | 0.42 | Syscalls, page faults |
|
||
|
|
| **main** (benchmark) | 4.75% | 5,882,328 | 0.29 | Test harness |
|
||
|
|
| **TOTAL** | 100% | 123,838,496 | 6.19 | |
|
||
|
|
|
||
|
|
### Insight: HAKMEM Fragmentation
|
||
|
|
- mimalloc concentrates 88.5% of cycles in malloc/free
|
||
|
|
- HAKMEM spreads across **6 functions** (malloc + 3 free variants + refill + wrapper)
|
||
|
|
- **Recommendation**: Consolidate hot path to reduce function call overhead
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Cache Miss Deep Dive
|
||
|
|
|
||
|
|
### Cache Misses by Function (HAKMEM)
|
||
|
|
|
||
|
|
| Function | % | Cache Misses | Misses/op | Impact |
|
||
|
|
|----------|---|--------------|-----------|--------|
|
||
|
|
| **malloc** | 58.51% | 697,322 | 0.0349 | **CRITICAL** |
|
||
|
|
| **unified_cache_refill** | 29.92% | 356,586 | 0.0178 | **HIGH** |
|
||
|
|
| Other | 11.57% | 137,892 | 0.0069 | Low |
|
||
|
|
|
||
|
|
### Estimated Penalty
|
||
|
|
- **Cache miss penalty**: 238,360,000 cycles (assuming ~200 cycles/LLC miss)
|
||
|
|
- **Per operation**: 11.9 cycles lost to cache misses
|
||
|
|
- **Percentage of total**: **24.4%** of all cycles
|
||
|
|
|
||
|
|
### Root Causes
|
||
|
|
1. **malloc (58% of cache misses)**:
|
||
|
|
- Pointer chasing through TLS → cache → metadata
|
||
|
|
- Multiple indirections: `g_tls_slabs[class_idx]` → `tls->ss` → `tls->meta`
|
||
|
|
- Cold metadata access patterns
|
||
|
|
|
||
|
|
2. **unified_cache_refill (30% of cache misses)**:
|
||
|
|
- SuperSlab metadata lookups via `hak_super_lookup(p)`
|
||
|
|
- Freelist traversal: `tiny_next_read()` on cold pointers
|
||
|
|
- Validation logic: Multiple metadata accesses per block
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Branch Misprediction Analysis
|
||
|
|
|
||
|
|
### Branch Misses by Function (HAKMEM)
|
||
|
|
|
||
|
|
| Function | % | Branch Misses | Misses/op | Impact |
|
||
|
|
|----------|---|---------------|-----------|--------|
|
||
|
|
| **malloc** | 21.59% | 323,231 | 0.0162 | Moderate |
|
||
|
|
| **unified_cache_refill** | 10.35% | 154,953 | 0.0077 | Moderate |
|
||
|
|
| **free.part.0** | 3.80% | 56,891 | 0.0028 | Low |
|
||
|
|
| **main** | 3.66% | 54,795 | 0.0027 | (Benchmark) |
|
||
|
|
| **hak_free_at** | 3.49% | 52,249 | 0.0026 | Low |
|
||
|
|
| **hak_tiny_free_fast_v2** | 3.11% | 46,560 | 0.0023 | Low |
|
||
|
|
|
||
|
|
### Estimated Penalty
|
||
|
|
- **Branch miss penalty**: 22,456,995 cycles (assuming ~15 cycles/miss)
|
||
|
|
- **Per operation**: 1.1 cycles lost to branch misses
|
||
|
|
- **Percentage of total**: **2.3%** of all cycles
|
||
|
|
|
||
|
|
### Root Causes
|
||
|
|
1. **Unpredictable control flow**:
|
||
|
|
- Environment variable checks: `if (g_wrapper_env)`, `if (g_enable)`
|
||
|
|
- Initialization barriers: `if (!g_initialized)`, `if (g_initializing)`
|
||
|
|
- Multi-way routing: `if (cache miss) → refill; if (freelist) → pop; else → carve`
|
||
|
|
|
||
|
|
2. **malloc wrapper overhead** (lines 7795-78a3 in disassembly):
|
||
|
|
- 20+ conditional branches before reaching fast path
|
||
|
|
- Lazy initialization checks
|
||
|
|
- Diagnostic tracing (`lock incl g_wrap_malloc_trace_count`)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 3 Bottlenecks & Recommendations
|
||
|
|
|
||
|
|
### 🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)
|
||
|
|
|
||
|
|
**Problem:**
|
||
|
|
- Complex TLS access pattern: `g_tls_sll[class_idx].head` requires cache line load
|
||
|
|
- Unified cache lookup: `g_unified_cache[class_idx].slots[head]` → second cache line
|
||
|
|
- Cold metadata: Refill triggers `hak_super_lookup()` + metadata traversal
|
||
|
|
|
||
|
|
**Hot Path Code Flow** (from source analysis):
|
||
|
|
```c
|
||
|
|
// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast
|
||
|
|
// 1. Check unified cache (cache hit path)
|
||
|
|
void* p = cache->slots[cache->head];
|
||
|
|
if (p) {
|
||
|
|
cache->head = (cache->head + 1) & cache->mask; // ← Cache line load
|
||
|
|
return p;
|
||
|
|
}
|
||
|
|
// 2. Cache miss → unified_cache_refill
|
||
|
|
unified_cache_refill(class_idx); // ← Expensive! 6.67 cycles/op
|
||
|
|
```
|
||
|
|
|
||
|
|
**Disassembly Evidence** (malloc function, lines 7a60-7ac7):
|
||
|
|
- Multiple indirect loads: `mov %fs:0x0,%r8` (TLS base)
|
||
|
|
- Pointer arithmetic: `lea -0x47d30(%r8),%rsi` (cache offset calculation)
|
||
|
|
- Conditional moves: `cmpb $0x2,(%rdx,%rcx,1)` (route check)
|
||
|
|
- Cache line thrashing on `cache->slots` array
|
||
|
|
|
||
|
|
**Recommendations:**
|
||
|
|
1. **Inline unified_cache_refill for common case** (CRITICAL)
|
||
|
|
- Move refill logic inline to eliminate function call overhead
|
||
|
|
- Use `__attribute__((always_inline))` or manual inlining
|
||
|
|
- Expected gain: ~2-3 cycles/op
|
||
|
|
|
||
|
|
2. **Optimize TLS data layout** (HIGH PRIORITY)
|
||
|
|
- Pack hot fields (`cache->head`, `cache->tail`, `cache->slots`) into single cache line
|
||
|
|
- Current: `g_unified_cache[8]` array → 8 separate cache lines
|
||
|
|
- Target: Hot path fields in 64-byte cache line
|
||
|
|
- Expected gain: ~3-5 cycles/op, reduce misses by 30-40%
|
||
|
|
|
||
|
|
3. **Prefetch next block during refill** (MEDIUM)
|
||
|
|
```c
|
||
|
|
void* first = out[0];
|
||
|
|
__builtin_prefetch(cache->slots[cache->tail + 1], 0, 3); // Temporal prefetch
|
||
|
|
return first;
|
||
|
|
```
|
||
|
|
- Expected gain: ~1-2 cycles/op
|
||
|
|
|
||
|
|
4. **Reduce validation overhead** (MEDIUM)
|
||
|
|
- `unified_refill_validate_base()` calls `hak_super_lookup()` on every block
|
||
|
|
- Move to debug-only (`#if !HAKMEM_BUILD_RELEASE`)
|
||
|
|
- Expected gain: ~1-2 cycles/op
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)
|
||
|
|
|
||
|
|
**Problem:**
|
||
|
|
- Expensive metadata lookups: `hak_super_lookup(p)` on every freelist node
|
||
|
|
- Freelist traversal: `tiny_next_read()` requires dereferencing cold pointers
|
||
|
|
- Validation logic: Multiple safety checks per block (lines 384-408 in source)
|
||
|
|
|
||
|
|
**Hot Path Code** (from tiny_unified_cache.c:377-414):
|
||
|
|
```c
|
||
|
|
while (produced < room) {
|
||
|
|
if (m->freelist) {
|
||
|
|
void* p = m->freelist;
|
||
|
|
|
||
|
|
// ❌ EXPENSIVE: Lookup SuperSlab for validation
|
||
|
|
SuperSlab* fl_ss = hak_super_lookup(p); // ← Cache miss!
|
||
|
|
int fl_idx = slab_index_for(fl_ss, p); // ← More metadata access
|
||
|
|
|
||
|
|
// ❌ EXPENSIVE: Dereference next pointer (cold memory)
|
||
|
|
void* next_node = tiny_next_read(class_idx, p); // ← Cache miss!
|
||
|
|
|
||
|
|
// Write header
|
||
|
|
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
||
|
|
m->freelist = next_node;
|
||
|
|
out[produced++] = p;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Recommendations:**
|
||
|
|
1. **Batch validation (amortize lookup cost)** (CRITICAL)
|
||
|
|
- Validate SuperSlab once at start, not per block
|
||
|
|
- Trust freelist integrity within single refill
|
||
|
|
```c
|
||
|
|
SuperSlab* ss_once = hak_super_lookup(m->freelist);
|
||
|
|
// Validate ss_once, then skip per-block validation
|
||
|
|
while (produced < room && m->freelist) {
|
||
|
|
void* p = m->freelist;
|
||
|
|
void* next = tiny_next_read(class_idx, p); // No lookup!
|
||
|
|
out[produced++] = p;
|
||
|
|
m->freelist = next;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
- Expected gain: ~2-3 cycles/op
|
||
|
|
|
||
|
|
2. **Prefetch freelist nodes** (HIGH PRIORITY)
|
||
|
|
```c
|
||
|
|
void* p = m->freelist;
|
||
|
|
void* next = tiny_next_read(class_idx, p);
|
||
|
|
__builtin_prefetch(next, 0, 3); // Prefetch next node
|
||
|
|
__builtin_prefetch(tiny_next_read(class_idx, next), 0, 2); // +2 ahead
|
||
|
|
```
|
||
|
|
- Expected gain: ~1-2 cycles/op on miss path
|
||
|
|
|
||
|
|
3. **Increase batch size for hot classes** (MEDIUM)
|
||
|
|
- Current: Max 128 blocks per refill
|
||
|
|
- Proposal: 256 blocks for C0-C3 (tiny sizes)
|
||
|
|
- Amortize refill cost over more allocations
|
||
|
|
- Expected gain: ~0.5-1 cycles/op
|
||
|
|
|
||
|
|
4. **Remove atomic fence on header write** (LOW, risky)
|
||
|
|
- Line 422: `__atomic_thread_fence(__ATOMIC_RELEASE)`
|
||
|
|
- Only needed for cross-thread visibility
|
||
|
|
- Benchmark: Single-threaded case doesn't need fence
|
||
|
|
- Expected gain: ~0.3-0.5 cycles/op
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)
|
||
|
|
|
||
|
|
**Problem:**
|
||
|
|
- 20+ branches before reaching fast path (disassembly lines 7795-78a3)
|
||
|
|
- Lazy initialization checks on every call
|
||
|
|
- Diagnostic tracing with atomic increment
|
||
|
|
- Environment variable checks
|
||
|
|
|
||
|
|
**Hot Path Disassembly** (malloc, lines 7795-77ba):
|
||
|
|
```asm
|
||
|
|
7795: lock incl 0x190fb78(%rip) ; ❌ Atomic trace counter (12.33% of cycles!)
|
||
|
|
779c: mov 0x190fb6e(%rip),%eax ; Check g_bench_fast_init_in_progress
|
||
|
|
77a2: test %eax,%eax
|
||
|
|
77a4: je 7d90 ; Branch #1
|
||
|
|
77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment
|
||
|
|
77b2: mov 0x438c8(%rip),%eax ; Check g_wrapper_env
|
||
|
|
77b8: test %eax,%eax
|
||
|
|
77ba: je 7e40 ; Branch #2
|
||
|
|
```
|
||
|
|
|
||
|
|
**Wrapper Code** (hakmem_tiny_phase6_wrappers_box.inc:22-79):
|
||
|
|
```c
|
||
|
|
void* hak_tiny_alloc_fast_wrapper(size_t size) {
|
||
|
|
atomic_fetch_add(&g_alloc_fast_trace, 1, ...); // ❌ Expensive!
|
||
|
|
|
||
|
|
// ❌ Branch #1: Bench fast mode check
|
||
|
|
if (g_bench_fast_front) {
|
||
|
|
return tiny_alloc_fast(size);
|
||
|
|
}
|
||
|
|
|
||
|
|
atomic_fetch_add(&wrapper_call_count, 1); // ❌ Atomic again!
|
||
|
|
PTR_TRACK_INIT(); // ❌ Initialization check
|
||
|
|
periodic_canary_check(call_num, ...); // ❌ Periodic check
|
||
|
|
|
||
|
|
// Finally, actual allocation
|
||
|
|
void* result = tiny_alloc_fast(size);
|
||
|
|
return result;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Recommendations:**
|
||
|
|
1. **Compile-time disable diagnostics** (CRITICAL)
|
||
|
|
- Remove atomic trace counters in hot path
|
||
|
|
- Move to `#if HAKMEM_BUILD_RELEASE` guards
|
||
|
|
- Expected gain: **~4-6 cycles/op** (eliminates 12% overhead)
|
||
|
|
|
||
|
|
2. **Hoist initialization checks** (HIGH PRIORITY)
|
||
|
|
- Move `PTR_TRACK_INIT()` to library init (once per thread)
|
||
|
|
- Cache `g_bench_fast_front` in thread-local variable
|
||
|
|
```c
|
||
|
|
static __thread int g_init_done = 0;
|
||
|
|
if (__builtin_expect(!g_init_done, 0)) {
|
||
|
|
PTR_TRACK_INIT();
|
||
|
|
g_init_done = 1;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
- Expected gain: ~1-2 cycles/op
|
||
|
|
|
||
|
|
3. **Eliminate wrapper layer for benchmarks** (MEDIUM)
|
||
|
|
- Direct call to `tiny_alloc_fast()` from `malloc()`
|
||
|
|
- Use LTO to inline wrapper entirely
|
||
|
|
- Expected gain: ~1-2 cycles/op (function call overhead)
|
||
|
|
|
||
|
|
4. **Branchless environment checks** (LOW)
|
||
|
|
- Replace `if (g_wrapper_env)` with bitmask operations
|
||
|
|
```c
|
||
|
|
int mask = -(int)g_wrapper_env; // -1 if true, 0 if false
|
||
|
|
result = (mask & diagnostic_path) | (~mask & fast_path);
|
||
|
|
```
|
||
|
|
- Expected gain: ~0.3-0.5 cycles/op
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary: Optimization Roadmap
|
||
|
|
|
||
|
|
### Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)
|
||
|
|
1. ✅ Remove atomic trace counters (`lock incl`) → **-6 cycles/op**
|
||
|
|
2. ✅ Inline `unified_cache_refill` → **-3 cycles/op**
|
||
|
|
3. ✅ Batch validation in refill → **-3 cycles/op**
|
||
|
|
4. ✅ Optimize TLS cache layout → **-3 cycles/op**
|
||
|
|
|
||
|
|
### Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)
|
||
|
|
5. ✅ Prefetch in refill and malloc → **-3 cycles/op**
|
||
|
|
6. ✅ Increase batch size for hot classes → **-2 cycles/op**
|
||
|
|
7. ✅ Consolidate free path (merge 3 functions) → **-3 cycles/op**
|
||
|
|
8. ✅ Hoist initialization checks → **-2 cycles/op**
|
||
|
|
|
||
|
|
### Long-Term (Target: -8 cycles/op, 23.8 → 15.8)
|
||
|
|
9. ✅ Branchless routing logic → **-2 cycles/op**
|
||
|
|
10. ✅ SIMD batch processing in refill → **-3 cycles/op**
|
||
|
|
11. ✅ Reduce metadata indirections → **-3 cycles/op**
|
||
|
|
|
||
|
|
### Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)
|
||
|
|
- Requires architectural changes (single-layer cache, no validation)
|
||
|
|
- Trade-off: Safety vs performance
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
HAKMEM's 7.88x slowdown is primarily due to:
|
||
|
|
1. **Cache misses** (24.4% of cycles) from multi-layer indirection
|
||
|
|
2. **Diagnostic overhead** (12%+ of cycles) from atomic counters and tracing
|
||
|
|
3. **Function fragmentation** (6 hot functions vs mimalloc's 2)
|
||
|
|
|
||
|
|
**Top Priority Actions:**
|
||
|
|
- Remove atomic trace counters (immediate -6 cycles/op)
|
||
|
|
- Inline refill + batch validation (-6 cycles/op combined)
|
||
|
|
- Optimize TLS layout for cache locality (-3 cycles/op)
|
||
|
|
|
||
|
|
**Expected Impact:** **-15 cycles/op** (48.8 → 33.8, ~30% improvement)
|
||
|
|
**Timeline:** 1-2 days of focused optimization work
|