Files
hakmem/docs/PERF_ANALYSIS_TINY_MIXED.md

260 lines
7.5 KiB
Markdown
Raw Normal View History

# Tiny Mixed Workload Performance Analysis
**Date**: 2025-11-01
**Benchmark**: bench_random_mixed (200K cycles, 400 ws, seed=1)
**Size Range**: 8-128B (16 size classes)
**Thread Count**: 1 (single-threaded)
## Executive Summary
HAKMEM is **32% slower** than mimalloc on tiny mixed workloads:
- **HAKMEM**: 16.46 M ops/sec
- **mimalloc**: 24.21 M ops/sec
- **Relative**: 68% of mimalloc speed
**Root Cause**: HAKMEM spends **3x more CPU cycles** in allocator functions (49%) compared to mimalloc (17%).
---
## Performance Breakdown
### Overall Workload Profile
Both allocators spend ~60% of time in random number generation (libc `__random`, `__random_r`), which is expected for this benchmark.
The critical difference is in **allocator overhead**:
| Allocator | Total Overhead | malloc/new | free/delete |
|-----------|----------------|------------|-------------|
| mimalloc | 17% | 7.35% | 9.77% |
| HAKMEM | 49% | 27% | 21.64% |
---
## HAKMEM Bottlenecks
### 1. `free()` Path (21.64% total, 6.75% self)
**Perf annotate highlights:**
```
9d45: pthread_mutex_lock (2.29%) ← Global lock on g_mid_registry
9d53: Binary search loop ← Search g_mid_registry (Mid range check)
9dad: pthread_mutex_unlock (3.93%) ← Unlock
9db5: hak_tiny_owner_slab (4.98%) ← SECOND ownership check (Tiny range)
9dce: hak_tiny_free ← Actual free
```
**Problems:**
1. **Double ownership check**:
- First: Global mutex + binary search in `g_mid_registry` (for Mid range 8KB-32KB)
- Second: `hak_tiny_owner_slab` (for Tiny range 8B-1KB)
2. **Global mutex contention**: Every `free()` locks `g_mid_registry` even for Tiny allocations (8-128B)
3. **Unnecessary Mid lookup**: Mixed workload uses 8-128B sizes, which never hit Mid range
**Impact**:
- mutex lock/unlock: 6.22%
- `hak_pool_mid_lookup` (binary search): 7.08%
- `hak_tiny_owner_slab`: 3.50%
- Total wasted on ownership: ~16.8%
---
### 2. `hak_tiny_alloc_slow` Path (9.33% total, 2.95% self)
**Perf annotate highlights:**
```
Function prologue (push registers): 14.05% ← Heavy stack frame
Multi-stage fallback:
1. hotmag_try_refill (if class <= 3)
2. TLS list check (g_tls_list_enable)
3. superslab check (g_use_superslab)
4. Magazine refill
```
**Problems:**
1. **Complex fallback chain**: 4+ branches before actual allocation
2. **TLS variable access overhead**: Multiple `%fs:` prefixed loads
3. **Branch misprediction**: Conditional checks for hotmag, superslab, etc.
**Impact**:
- Function overhead: ~3%
- Fallback logic: ~6%
---
### 3. Allocation Path Summary
| Function | Total % | Self % |
|-----------------------------|---------|--------|
| `free` | 21.64 | 6.75 |
| `hak_tiny_alloc_slow` | 9.33 | 2.95 |
| `hak_pool_mid_lookup` | 7.08 | 2.92 |
| `hak_tiny_alloc` | 7.12 | 2.81 |
| `hak_tiny_owner_slab` | 3.50 | 2.41 |
| `malloc` | 3.67 | 2.41 |
| **Total HAKMEM overhead** | **49%** | - |
---
## mimalloc Efficiency
mimalloc achieves **17% total allocator overhead** through:
1. **Fast ownership checks**: Likely uses pointer bit patterns or alignment tricks (no mutex)
2. **Simple allocation path**: `mi_malloc` (7.35%) has minimal branching
3. **Lockless thread-local caches**: No global mutex for common cases
---
## Optimization Roadmap
### Priority 1: Eliminate Unnecessary Mid Lookup in `free()` 🔴
**Current flow:**
```
free(ptr)
Lock g_mid_registry mutex
Binary search g_mid_registry (8KB-32KB check)
Unlock g_mid_registry mutex
hak_tiny_owner_slab (8B-1KB check) ← SHOULD BE FIRST!
```
**Proposed flow:**
```
free(ptr)
Fast TLS range check (Tiny: 8B-1KB) ← NO MUTEX
If miss: Check Mid registry (with mutex)
If miss: Check L25/other
```
**Expected gain**:
- Eliminate 6.22% (mutex) + 7.08% (mid_lookup) = **~13% speedup**
- For Tiny-dominated workloads: **~40% of total overhead removed**
---
### Priority 2: Streamline `free()` Ownership Check 🟡
**Current approach:**
- `hak_tiny_owner_slab`: Linear search through TLS slabs (3.50% overhead)
**Proposed approach** (mimalloc-style):
1. **Alignment check**: `if (ptr & (SLAB_SIZE - 1)) == 0` → Not from slab
2. **Range check**: `if (ptr >= tls_heap_start && ptr < tls_heap_end)` → TLS-owned
3. **Slab header**: Direct jump to slab metadata at `(ptr & ~(SLAB_SIZE - 1))`
**Expected gain**:
- Reduce `hak_tiny_owner_slab` from 3.50% → ~0.5%
- Net speedup: **~3%**
---
### Priority 3: Simplify `hak_tiny_alloc_slow` Fallback 🟢
**Current approach:**
- 4-stage fallback: hotmag → TLS list → superslab → magazine
**Proposed approach:**
1. Check TLS magazine first (most common)
2. Check superslab (if enabled)
3. Refill from central depot
**Expected gain**:
- Reduce branching overhead: **~2-3% speedup**
---
### Priority 4: Inline Hot Functions 🟢
Candidates:
- `hak_tiny_alloc` (7.12% total, 2.81% self) → Inline into `malloc`
- `hak_tiny_owner_slab` (3.50% total) → Inline into `free`
**Expected gain**:
- Function call overhead reduction: **~2%**
---
## Comparison with Mid MT Workload
| Metric | Tiny Mixed (this) | Mid MT (8KB-32KB) |
|-----------------------|-------------------|-------------------|
| HAKMEM performance | 68% of mimalloc | 138% of mimalloc |
| Size range | 8-128B (16 classes) | 8KB-32KB (3 classes) |
| Thread count | 1 thread | 4 threads |
| HAKMEM strength | ❌ Weak | ✅ Strong |
**Key insight**: HAKMEM's Tiny allocator is not optimized for high-frequency small allocations. The Mid MT allocator excels due to:
- Fewer size classes (3 vs 16)
- Lock-free TLS caching
- Optimized for 8KB+ allocations
---
## Next Steps
1.**DONE**: Profile with perf to identify bottlenecks
2. 🔄 **IN PROGRESS**: Document findings (this file)
3.**TODO**: Implement Priority 1 optimization (fast TLS range check in `free()`)
4.**TODO**: Benchmark improvement
5.**TODO**: Iterate on Priority 2-4 optimizations
---
## Raw Perf Data
### HAKMEM Top Functions (>1%)
```
Children Self Samples Symbol
57.97% 0.00% 0 __libc_start_call_main
43.11% 27.67% 321 __random
28.97% 15.14% 179 __random_r
21.64% 6.75% 79 free
9.33% 2.95% 35 hak_tiny_alloc_slow.constprop.0
7.12% 2.81% 33 hak_tiny_alloc
7.08% 2.92% 35 hak_pool_mid_lookup.constprop.0
3.67% 2.41% 29 malloc
3.50% 2.41% 29 hak_tiny_owner_slab
```
### mimalloc Top Functions (>1%)
```
Children Self Samples Symbol
99.74% 0.00% 0 __libc_start_call_main
61.43% 41.47% 331 __random
43.63% 22.50% 177 __random_r
29.39% 19.88% 157 main
9.77% 2.42% 19 operator delete[](void*)
7.35% 1.80% 14 mi_malloc
```
---
## Profiling Commands Used
```bash
# HAKMEM profiling
perf record -o perf_random_mixed_hakmi.data -F 999 -g ./bench_random_mixed_hakmem 200000 400 1
perf report -i perf_random_mixed_hakmi.data --stdio -n --percent-limit 1
# mimalloc profiling
LD_LIBRARY_PATH=./mimalloc-bench/extern/mi/out/release \
perf record -o perf_random_mixed_mi.data -F 999 -g ./bench_random_mixed_mi 200000 400 1
perf report -i perf_random_mixed_mi.data --stdio -n --percent-limit 1
# Annotate specific functions
perf annotate -i perf_random_mixed_hakmi.data --stdio free
perf annotate -i perf_random_mixed_hakmi.data --stdio "hak_tiny_alloc_slow.constprop.0"
```