Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.5 KiB
Tiny Mixed Workload Performance Analysis
Date: 2025-11-01 Benchmark: bench_random_mixed (200K cycles, 400 ws, seed=1) Size Range: 8-128B (16 size classes) Thread Count: 1 (single-threaded)
Executive Summary
HAKMEM is 32% slower than mimalloc on tiny mixed workloads:
- HAKMEM: 16.46 M ops/sec
- mimalloc: 24.21 M ops/sec
- Relative: 68% of mimalloc speed
Root Cause: HAKMEM spends 3x more CPU cycles in allocator functions (49%) compared to mimalloc (17%).
Performance Breakdown
Overall Workload Profile
Both allocators spend ~60% of time in random number generation (libc __random, __random_r), which is expected for this benchmark.
The critical difference is in allocator overhead:
| Allocator | Total Overhead | malloc/new | free/delete |
|---|---|---|---|
| mimalloc | 17% | 7.35% | 9.77% |
| HAKMEM | 49% | 27% | 21.64% |
HAKMEM Bottlenecks
1. free() Path (21.64% total, 6.75% self)
Perf annotate highlights:
9d45: pthread_mutex_lock (2.29%) ← Global lock on g_mid_registry
9d53: Binary search loop ← Search g_mid_registry (Mid range check)
9dad: pthread_mutex_unlock (3.93%) ← Unlock
9db5: hak_tiny_owner_slab (4.98%) ← SECOND ownership check (Tiny range)
9dce: hak_tiny_free ← Actual free
Problems:
- Double ownership check:
- First: Global mutex + binary search in
g_mid_registry(for Mid range 8KB-32KB) - Second:
hak_tiny_owner_slab(for Tiny range 8B-1KB)
- First: Global mutex + binary search in
- Global mutex contention: Every
free()locksg_mid_registryeven for Tiny allocations (8-128B) - Unnecessary Mid lookup: Mixed workload uses 8-128B sizes, which never hit Mid range
Impact:
- mutex lock/unlock: 6.22%
hak_pool_mid_lookup(binary search): 7.08%hak_tiny_owner_slab: 3.50%- Total wasted on ownership: ~16.8%
2. hak_tiny_alloc_slow Path (9.33% total, 2.95% self)
Perf annotate highlights:
Function prologue (push registers): 14.05% ← Heavy stack frame
Multi-stage fallback:
1. hotmag_try_refill (if class <= 3)
2. TLS list check (g_tls_list_enable)
3. superslab check (g_use_superslab)
4. Magazine refill
Problems:
- Complex fallback chain: 4+ branches before actual allocation
- TLS variable access overhead: Multiple
%fs:prefixed loads - Branch misprediction: Conditional checks for hotmag, superslab, etc.
Impact:
- Function overhead: ~3%
- Fallback logic: ~6%
3. Allocation Path Summary
| Function | Total % | Self % |
|---|---|---|
free |
21.64 | 6.75 |
hak_tiny_alloc_slow |
9.33 | 2.95 |
hak_pool_mid_lookup |
7.08 | 2.92 |
hak_tiny_alloc |
7.12 | 2.81 |
hak_tiny_owner_slab |
3.50 | 2.41 |
malloc |
3.67 | 2.41 |
| Total HAKMEM overhead | 49% | - |
mimalloc Efficiency
mimalloc achieves 17% total allocator overhead through:
- Fast ownership checks: Likely uses pointer bit patterns or alignment tricks (no mutex)
- Simple allocation path:
mi_malloc(7.35%) has minimal branching - Lockless thread-local caches: No global mutex for common cases
Optimization Roadmap
Priority 1: Eliminate Unnecessary Mid Lookup in free() 🔴
Current flow:
free(ptr)
↓
Lock g_mid_registry mutex
↓
Binary search g_mid_registry (8KB-32KB check)
↓
Unlock g_mid_registry mutex
↓
hak_tiny_owner_slab (8B-1KB check) ← SHOULD BE FIRST!
Proposed flow:
free(ptr)
↓
Fast TLS range check (Tiny: 8B-1KB) ← NO MUTEX
↓
If miss: Check Mid registry (with mutex)
↓
If miss: Check L25/other
Expected gain:
- Eliminate 6.22% (mutex) + 7.08% (mid_lookup) = ~13% speedup
- For Tiny-dominated workloads: ~40% of total overhead removed
Priority 2: Streamline free() Ownership Check 🟡
Current approach:
hak_tiny_owner_slab: Linear search through TLS slabs (3.50% overhead)
Proposed approach (mimalloc-style):
- Alignment check:
if (ptr & (SLAB_SIZE - 1)) == 0→ Not from slab - Range check:
if (ptr >= tls_heap_start && ptr < tls_heap_end)→ TLS-owned - Slab header: Direct jump to slab metadata at
(ptr & ~(SLAB_SIZE - 1))
Expected gain:
- Reduce
hak_tiny_owner_slabfrom 3.50% → ~0.5% - Net speedup: ~3%
Priority 3: Simplify hak_tiny_alloc_slow Fallback 🟢
Current approach:
- 4-stage fallback: hotmag → TLS list → superslab → magazine
Proposed approach:
- Check TLS magazine first (most common)
- Check superslab (if enabled)
- Refill from central depot
Expected gain:
- Reduce branching overhead: ~2-3% speedup
Priority 4: Inline Hot Functions 🟢
Candidates:
hak_tiny_alloc(7.12% total, 2.81% self) → Inline intomallochak_tiny_owner_slab(3.50% total) → Inline intofree
Expected gain:
- Function call overhead reduction: ~2%
Comparison with Mid MT Workload
| Metric | Tiny Mixed (this) | Mid MT (8KB-32KB) |
|---|---|---|
| HAKMEM performance | 68% of mimalloc | 138% of mimalloc |
| Size range | 8-128B (16 classes) | 8KB-32KB (3 classes) |
| Thread count | 1 thread | 4 threads |
| HAKMEM strength | ❌ Weak | ✅ Strong |
Key insight: HAKMEM's Tiny allocator is not optimized for high-frequency small allocations. The Mid MT allocator excels due to:
- Fewer size classes (3 vs 16)
- Lock-free TLS caching
- Optimized for 8KB+ allocations
Next Steps
- ✅ DONE: Profile with perf to identify bottlenecks
- 🔄 IN PROGRESS: Document findings (this file)
- ⏳ TODO: Implement Priority 1 optimization (fast TLS range check in
free()) - ⏳ TODO: Benchmark improvement
- ⏳ TODO: Iterate on Priority 2-4 optimizations
Raw Perf Data
HAKMEM Top Functions (>1%)
Children Self Samples Symbol
57.97% 0.00% 0 __libc_start_call_main
43.11% 27.67% 321 __random
28.97% 15.14% 179 __random_r
21.64% 6.75% 79 free
9.33% 2.95% 35 hak_tiny_alloc_slow.constprop.0
7.12% 2.81% 33 hak_tiny_alloc
7.08% 2.92% 35 hak_pool_mid_lookup.constprop.0
3.67% 2.41% 29 malloc
3.50% 2.41% 29 hak_tiny_owner_slab
mimalloc Top Functions (>1%)
Children Self Samples Symbol
99.74% 0.00% 0 __libc_start_call_main
61.43% 41.47% 331 __random
43.63% 22.50% 177 __random_r
29.39% 19.88% 157 main
9.77% 2.42% 19 operator delete[](void*)
7.35% 1.80% 14 mi_malloc
Profiling Commands Used
# HAKMEM profiling
perf record -o perf_random_mixed_hakmi.data -F 999 -g ./bench_random_mixed_hakmem 200000 400 1
perf report -i perf_random_mixed_hakmi.data --stdio -n --percent-limit 1
# mimalloc profiling
LD_LIBRARY_PATH=./mimalloc-bench/extern/mi/out/release \
perf record -o perf_random_mixed_mi.data -F 999 -g ./bench_random_mixed_mi 200000 400 1
perf report -i perf_random_mixed_mi.data --stdio -n --percent-limit 1
# Annotate specific functions
perf annotate -i perf_random_mixed_hakmi.data --stdio free
perf annotate -i perf_random_mixed_hakmi.data --stdio "hak_tiny_alloc_slow.constprop.0"