# Tiny Mixed Workload Performance Analysis **Date**: 2025-11-01 **Benchmark**: bench_random_mixed (200K cycles, 400 ws, seed=1) **Size Range**: 8-128B (16 size classes) **Thread Count**: 1 (single-threaded) ## Executive Summary HAKMEM is **32% slower** than mimalloc on tiny mixed workloads: - **HAKMEM**: 16.46 M ops/sec - **mimalloc**: 24.21 M ops/sec - **Relative**: 68% of mimalloc speed **Root Cause**: HAKMEM spends **3x more CPU cycles** in allocator functions (49%) compared to mimalloc (17%). --- ## Performance Breakdown ### Overall Workload Profile Both allocators spend ~60% of time in random number generation (libc `__random`, `__random_r`), which is expected for this benchmark. The critical difference is in **allocator overhead**: | Allocator | Total Overhead | malloc/new | free/delete | |-----------|----------------|------------|-------------| | mimalloc | 17% | 7.35% | 9.77% | | HAKMEM | 49% | 27% | 21.64% | --- ## HAKMEM Bottlenecks ### 1. `free()` Path (21.64% total, 6.75% self) **Perf annotate highlights:** ``` 9d45: pthread_mutex_lock (2.29%) ← Global lock on g_mid_registry 9d53: Binary search loop ← Search g_mid_registry (Mid range check) 9dad: pthread_mutex_unlock (3.93%) ← Unlock 9db5: hak_tiny_owner_slab (4.98%) ← SECOND ownership check (Tiny range) 9dce: hak_tiny_free ← Actual free ``` **Problems:** 1. **Double ownership check**: - First: Global mutex + binary search in `g_mid_registry` (for Mid range 8KB-32KB) - Second: `hak_tiny_owner_slab` (for Tiny range 8B-1KB) 2. **Global mutex contention**: Every `free()` locks `g_mid_registry` even for Tiny allocations (8-128B) 3. **Unnecessary Mid lookup**: Mixed workload uses 8-128B sizes, which never hit Mid range **Impact**: - mutex lock/unlock: 6.22% - `hak_pool_mid_lookup` (binary search): 7.08% - `hak_tiny_owner_slab`: 3.50% - Total wasted on ownership: ~16.8% --- ### 2. `hak_tiny_alloc_slow` Path (9.33% total, 2.95% self) **Perf annotate highlights:** ``` Function prologue (push registers): 14.05% ← Heavy stack frame Multi-stage fallback: 1. hotmag_try_refill (if class <= 3) 2. TLS list check (g_tls_list_enable) 3. superslab check (g_use_superslab) 4. Magazine refill ``` **Problems:** 1. **Complex fallback chain**: 4+ branches before actual allocation 2. **TLS variable access overhead**: Multiple `%fs:` prefixed loads 3. **Branch misprediction**: Conditional checks for hotmag, superslab, etc. **Impact**: - Function overhead: ~3% - Fallback logic: ~6% --- ### 3. Allocation Path Summary | Function | Total % | Self % | |-----------------------------|---------|--------| | `free` | 21.64 | 6.75 | | `hak_tiny_alloc_slow` | 9.33 | 2.95 | | `hak_pool_mid_lookup` | 7.08 | 2.92 | | `hak_tiny_alloc` | 7.12 | 2.81 | | `hak_tiny_owner_slab` | 3.50 | 2.41 | | `malloc` | 3.67 | 2.41 | | **Total HAKMEM overhead** | **49%** | - | --- ## mimalloc Efficiency mimalloc achieves **17% total allocator overhead** through: 1. **Fast ownership checks**: Likely uses pointer bit patterns or alignment tricks (no mutex) 2. **Simple allocation path**: `mi_malloc` (7.35%) has minimal branching 3. **Lockless thread-local caches**: No global mutex for common cases --- ## Optimization Roadmap ### Priority 1: Eliminate Unnecessary Mid Lookup in `free()` 🔴 **Current flow:** ``` free(ptr) ↓ Lock g_mid_registry mutex ↓ Binary search g_mid_registry (8KB-32KB check) ↓ Unlock g_mid_registry mutex ↓ hak_tiny_owner_slab (8B-1KB check) ← SHOULD BE FIRST! ``` **Proposed flow:** ``` free(ptr) ↓ Fast TLS range check (Tiny: 8B-1KB) ← NO MUTEX ↓ If miss: Check Mid registry (with mutex) ↓ If miss: Check L25/other ``` **Expected gain**: - Eliminate 6.22% (mutex) + 7.08% (mid_lookup) = **~13% speedup** - For Tiny-dominated workloads: **~40% of total overhead removed** --- ### Priority 2: Streamline `free()` Ownership Check 🟡 **Current approach:** - `hak_tiny_owner_slab`: Linear search through TLS slabs (3.50% overhead) **Proposed approach** (mimalloc-style): 1. **Alignment check**: `if (ptr & (SLAB_SIZE - 1)) == 0` → Not from slab 2. **Range check**: `if (ptr >= tls_heap_start && ptr < tls_heap_end)` → TLS-owned 3. **Slab header**: Direct jump to slab metadata at `(ptr & ~(SLAB_SIZE - 1))` **Expected gain**: - Reduce `hak_tiny_owner_slab` from 3.50% → ~0.5% - Net speedup: **~3%** --- ### Priority 3: Simplify `hak_tiny_alloc_slow` Fallback 🟢 **Current approach:** - 4-stage fallback: hotmag → TLS list → superslab → magazine **Proposed approach:** 1. Check TLS magazine first (most common) 2. Check superslab (if enabled) 3. Refill from central depot **Expected gain**: - Reduce branching overhead: **~2-3% speedup** --- ### Priority 4: Inline Hot Functions 🟢 Candidates: - `hak_tiny_alloc` (7.12% total, 2.81% self) → Inline into `malloc` - `hak_tiny_owner_slab` (3.50% total) → Inline into `free` **Expected gain**: - Function call overhead reduction: **~2%** --- ## Comparison with Mid MT Workload | Metric | Tiny Mixed (this) | Mid MT (8KB-32KB) | |-----------------------|-------------------|-------------------| | HAKMEM performance | 68% of mimalloc | 138% of mimalloc | | Size range | 8-128B (16 classes) | 8KB-32KB (3 classes) | | Thread count | 1 thread | 4 threads | | HAKMEM strength | ❌ Weak | ✅ Strong | **Key insight**: HAKMEM's Tiny allocator is not optimized for high-frequency small allocations. The Mid MT allocator excels due to: - Fewer size classes (3 vs 16) - Lock-free TLS caching - Optimized for 8KB+ allocations --- ## Next Steps 1. ✅ **DONE**: Profile with perf to identify bottlenecks 2. 🔄 **IN PROGRESS**: Document findings (this file) 3. ⏳ **TODO**: Implement Priority 1 optimization (fast TLS range check in `free()`) 4. ⏳ **TODO**: Benchmark improvement 5. ⏳ **TODO**: Iterate on Priority 2-4 optimizations --- ## Raw Perf Data ### HAKMEM Top Functions (>1%) ``` Children Self Samples Symbol 57.97% 0.00% 0 __libc_start_call_main 43.11% 27.67% 321 __random 28.97% 15.14% 179 __random_r 21.64% 6.75% 79 free 9.33% 2.95% 35 hak_tiny_alloc_slow.constprop.0 7.12% 2.81% 33 hak_tiny_alloc 7.08% 2.92% 35 hak_pool_mid_lookup.constprop.0 3.67% 2.41% 29 malloc 3.50% 2.41% 29 hak_tiny_owner_slab ``` ### mimalloc Top Functions (>1%) ``` Children Self Samples Symbol 99.74% 0.00% 0 __libc_start_call_main 61.43% 41.47% 331 __random 43.63% 22.50% 177 __random_r 29.39% 19.88% 157 main 9.77% 2.42% 19 operator delete[](void*) 7.35% 1.80% 14 mi_malloc ``` --- ## Profiling Commands Used ```bash # HAKMEM profiling perf record -o perf_random_mixed_hakmi.data -F 999 -g ./bench_random_mixed_hakmem 200000 400 1 perf report -i perf_random_mixed_hakmi.data --stdio -n --percent-limit 1 # mimalloc profiling LD_LIBRARY_PATH=./mimalloc-bench/extern/mi/out/release \ perf record -o perf_random_mixed_mi.data -F 999 -g ./bench_random_mixed_mi 200000 400 1 perf report -i perf_random_mixed_mi.data --stdio -n --percent-limit 1 # Annotate specific functions perf annotate -i perf_random_mixed_hakmi.data --stdio free perf annotate -i perf_random_mixed_hakmi.data --stdio "hak_tiny_alloc_slow.constprop.0" ```