Files
hakmem/docs/PERF_ANALYSIS_TINY_MIXED.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

7.5 KiB

Tiny Mixed Workload Performance Analysis

Date: 2025-11-01 Benchmark: bench_random_mixed (200K cycles, 400 ws, seed=1) Size Range: 8-128B (16 size classes) Thread Count: 1 (single-threaded)

Executive Summary

HAKMEM is 32% slower than mimalloc on tiny mixed workloads:

  • HAKMEM: 16.46 M ops/sec
  • mimalloc: 24.21 M ops/sec
  • Relative: 68% of mimalloc speed

Root Cause: HAKMEM spends 3x more CPU cycles in allocator functions (49%) compared to mimalloc (17%).


Performance Breakdown

Overall Workload Profile

Both allocators spend ~60% of time in random number generation (libc __random, __random_r), which is expected for this benchmark.

The critical difference is in allocator overhead:

Allocator Total Overhead malloc/new free/delete
mimalloc 17% 7.35% 9.77%
HAKMEM 49% 27% 21.64%

HAKMEM Bottlenecks

1. free() Path (21.64% total, 6.75% self)

Perf annotate highlights:

9d45:  pthread_mutex_lock      (2.29%)   ← Global lock on g_mid_registry
9d53:  Binary search loop                ← Search g_mid_registry (Mid range check)
9dad:  pthread_mutex_unlock    (3.93%)   ← Unlock
9db5:  hak_tiny_owner_slab     (4.98%)   ← SECOND ownership check (Tiny range)
9dce:  hak_tiny_free                     ← Actual free

Problems:

  1. Double ownership check:
    • First: Global mutex + binary search in g_mid_registry (for Mid range 8KB-32KB)
    • Second: hak_tiny_owner_slab (for Tiny range 8B-1KB)
  2. Global mutex contention: Every free() locks g_mid_registry even for Tiny allocations (8-128B)
  3. Unnecessary Mid lookup: Mixed workload uses 8-128B sizes, which never hit Mid range

Impact:

  • mutex lock/unlock: 6.22%
  • hak_pool_mid_lookup (binary search): 7.08%
  • hak_tiny_owner_slab: 3.50%
  • Total wasted on ownership: ~16.8%

2. hak_tiny_alloc_slow Path (9.33% total, 2.95% self)

Perf annotate highlights:

Function prologue (push registers): 14.05%  ← Heavy stack frame
Multi-stage fallback:
  1. hotmag_try_refill (if class <= 3)
  2. TLS list check (g_tls_list_enable)
  3. superslab check (g_use_superslab)
  4. Magazine refill

Problems:

  1. Complex fallback chain: 4+ branches before actual allocation
  2. TLS variable access overhead: Multiple %fs: prefixed loads
  3. Branch misprediction: Conditional checks for hotmag, superslab, etc.

Impact:

  • Function overhead: ~3%
  • Fallback logic: ~6%

3. Allocation Path Summary

Function Total % Self %
free 21.64 6.75
hak_tiny_alloc_slow 9.33 2.95
hak_pool_mid_lookup 7.08 2.92
hak_tiny_alloc 7.12 2.81
hak_tiny_owner_slab 3.50 2.41
malloc 3.67 2.41
Total HAKMEM overhead 49% -

mimalloc Efficiency

mimalloc achieves 17% total allocator overhead through:

  1. Fast ownership checks: Likely uses pointer bit patterns or alignment tricks (no mutex)
  2. Simple allocation path: mi_malloc (7.35%) has minimal branching
  3. Lockless thread-local caches: No global mutex for common cases

Optimization Roadmap

Priority 1: Eliminate Unnecessary Mid Lookup in free() 🔴

Current flow:

free(ptr)
  ↓
Lock g_mid_registry mutex
  ↓
Binary search g_mid_registry (8KB-32KB check)
  ↓
Unlock g_mid_registry mutex
  ↓
hak_tiny_owner_slab (8B-1KB check)  ← SHOULD BE FIRST!

Proposed flow:

free(ptr)
  ↓
Fast TLS range check (Tiny: 8B-1KB)  ← NO MUTEX
  ↓
  If miss: Check Mid registry (with mutex)
  ↓
  If miss: Check L25/other

Expected gain:

  • Eliminate 6.22% (mutex) + 7.08% (mid_lookup) = ~13% speedup
  • For Tiny-dominated workloads: ~40% of total overhead removed

Priority 2: Streamline free() Ownership Check 🟡

Current approach:

  • hak_tiny_owner_slab: Linear search through TLS slabs (3.50% overhead)

Proposed approach (mimalloc-style):

  1. Alignment check: if (ptr & (SLAB_SIZE - 1)) == 0 → Not from slab
  2. Range check: if (ptr >= tls_heap_start && ptr < tls_heap_end) → TLS-owned
  3. Slab header: Direct jump to slab metadata at (ptr & ~(SLAB_SIZE - 1))

Expected gain:

  • Reduce hak_tiny_owner_slab from 3.50% → ~0.5%
  • Net speedup: ~3%

Priority 3: Simplify hak_tiny_alloc_slow Fallback 🟢

Current approach:

  • 4-stage fallback: hotmag → TLS list → superslab → magazine

Proposed approach:

  1. Check TLS magazine first (most common)
  2. Check superslab (if enabled)
  3. Refill from central depot

Expected gain:

  • Reduce branching overhead: ~2-3% speedup

Priority 4: Inline Hot Functions 🟢

Candidates:

  • hak_tiny_alloc (7.12% total, 2.81% self) → Inline into malloc
  • hak_tiny_owner_slab (3.50% total) → Inline into free

Expected gain:

  • Function call overhead reduction: ~2%

Comparison with Mid MT Workload

Metric Tiny Mixed (this) Mid MT (8KB-32KB)
HAKMEM performance 68% of mimalloc 138% of mimalloc
Size range 8-128B (16 classes) 8KB-32KB (3 classes)
Thread count 1 thread 4 threads
HAKMEM strength Weak Strong

Key insight: HAKMEM's Tiny allocator is not optimized for high-frequency small allocations. The Mid MT allocator excels due to:

  • Fewer size classes (3 vs 16)
  • Lock-free TLS caching
  • Optimized for 8KB+ allocations

Next Steps

  1. DONE: Profile with perf to identify bottlenecks
  2. 🔄 IN PROGRESS: Document findings (this file)
  3. TODO: Implement Priority 1 optimization (fast TLS range check in free())
  4. TODO: Benchmark improvement
  5. TODO: Iterate on Priority 2-4 optimizations

Raw Perf Data

HAKMEM Top Functions (>1%)

Children  Self    Samples  Symbol
57.97%    0.00%   0        __libc_start_call_main
43.11%    27.67%  321      __random
28.97%    15.14%  179      __random_r
21.64%    6.75%   79       free
9.33%     2.95%   35       hak_tiny_alloc_slow.constprop.0
7.12%     2.81%   33       hak_tiny_alloc
7.08%     2.92%   35       hak_pool_mid_lookup.constprop.0
3.67%     2.41%   29       malloc
3.50%     2.41%   29       hak_tiny_owner_slab

mimalloc Top Functions (>1%)

Children  Self    Samples  Symbol
99.74%    0.00%   0        __libc_start_call_main
61.43%    41.47%  331      __random
43.63%    22.50%  177      __random_r
29.39%    19.88%  157      main
9.77%     2.42%   19       operator delete[](void*)
7.35%     1.80%   14       mi_malloc

Profiling Commands Used

# HAKMEM profiling
perf record -o perf_random_mixed_hakmi.data -F 999 -g ./bench_random_mixed_hakmem 200000 400 1
perf report -i perf_random_mixed_hakmi.data --stdio -n --percent-limit 1

# mimalloc profiling
LD_LIBRARY_PATH=./mimalloc-bench/extern/mi/out/release \
  perf record -o perf_random_mixed_mi.data -F 999 -g ./bench_random_mixed_mi 200000 400 1
perf report -i perf_random_mixed_mi.data --stdio -n --percent-limit 1

# Annotate specific functions
perf annotate -i perf_random_mixed_hakmi.data --stdio free
perf annotate -i perf_random_mixed_hakmi.data --stdio "hak_tiny_alloc_slow.constprop.0"