Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

7.5 KiB

Raw Blame History

Tiny Mixed Workload Performance Analysis

Date: 2025-11-01 Benchmark: bench_random_mixed (200K cycles, 400 ws, seed=1) Size Range: 8-128B (16 size classes) Thread Count: 1 (single-threaded)

Executive Summary

HAKMEM is 32% slower than mimalloc on tiny mixed workloads:

HAKMEM: 16.46 M ops/sec
mimalloc: 24.21 M ops/sec
Relative: 68% of mimalloc speed

Root Cause: HAKMEM spends 3x more CPU cycles in allocator functions (49%) compared to mimalloc (17%).

Performance Breakdown

Overall Workload Profile

Both allocators spend ~60% of time in random number generation (libc __random, __random_r), which is expected for this benchmark.

The critical difference is in allocator overhead:

Allocator	Total Overhead	malloc/new	free/delete
mimalloc	17%	7.35%	9.77%
HAKMEM	49%	27%	21.64%

HAKMEM Bottlenecks

1. `free()` Path (21.64% total, 6.75% self)

Perf annotate highlights:

9d45:  pthread_mutex_lock      (2.29%)   ← Global lock on g_mid_registry
9d53:  Binary search loop                ← Search g_mid_registry (Mid range check)
9dad:  pthread_mutex_unlock    (3.93%)   ← Unlock
9db5:  hak_tiny_owner_slab     (4.98%)   ← SECOND ownership check (Tiny range)
9dce:  hak_tiny_free                     ← Actual free

Problems:

Double ownership check:
- First: Global mutex + binary search in g_mid_registry (for Mid range 8KB-32KB)
- Second: hak_tiny_owner_slab (for Tiny range 8B-1KB)
Global mutex contention: Every free() locks g_mid_registry even for Tiny allocations (8-128B)
Unnecessary Mid lookup: Mixed workload uses 8-128B sizes, which never hit Mid range

Impact:

mutex lock/unlock: 6.22%
hak_pool_mid_lookup (binary search): 7.08%
hak_tiny_owner_slab: 3.50%
Total wasted on ownership: ~16.8%

2. `hak_tiny_alloc_slow` Path (9.33% total, 2.95% self)

Perf annotate highlights:

Function prologue (push registers): 14.05%  ← Heavy stack frame
Multi-stage fallback:
  1. hotmag_try_refill (if class <= 3)
  2. TLS list check (g_tls_list_enable)
  3. superslab check (g_use_superslab)
  4. Magazine refill

Problems:

Complex fallback chain: 4+ branches before actual allocation
TLS variable access overhead: Multiple %fs: prefixed loads
Branch misprediction: Conditional checks for hotmag, superslab, etc.

Impact:

Function overhead: ~3%
Fallback logic: ~6%

3. Allocation Path Summary

Function	Total %	Self %
`free`	21.64	6.75
`hak_tiny_alloc_slow`	9.33	2.95
`hak_pool_mid_lookup`	7.08	2.92
`hak_tiny_alloc`	7.12	2.81
`hak_tiny_owner_slab`	3.50	2.41
`malloc`	3.67	2.41
Total HAKMEM overhead	49%	-

mimalloc Efficiency

mimalloc achieves 17% total allocator overhead through:

Fast ownership checks: Likely uses pointer bit patterns or alignment tricks (no mutex)
Simple allocation path: mi_malloc (7.35%) has minimal branching
Lockless thread-local caches: No global mutex for common cases

Optimization Roadmap

Priority 1: Eliminate Unnecessary Mid Lookup in `free()` 🔴

Current flow:

free(ptr)
  ↓
Lock g_mid_registry mutex
  ↓
Binary search g_mid_registry (8KB-32KB check)
  ↓
Unlock g_mid_registry mutex
  ↓
hak_tiny_owner_slab (8B-1KB check)  ← SHOULD BE FIRST!

Proposed flow:

free(ptr)
  ↓
Fast TLS range check (Tiny: 8B-1KB)  ← NO MUTEX
  ↓
  If miss: Check Mid registry (with mutex)
  ↓
  If miss: Check L25/other

Expected gain:

Eliminate 6.22% (mutex) + 7.08% (mid_lookup) = ~13% speedup
For Tiny-dominated workloads: ~40% of total overhead removed

Priority 2: Streamline `free()` Ownership Check 🟡

Current approach:

hak_tiny_owner_slab: Linear search through TLS slabs (3.50% overhead)

Proposed approach (mimalloc-style):

Alignment check: if (ptr & (SLAB_SIZE - 1)) == 0 → Not from slab
Range check: if (ptr >= tls_heap_start && ptr < tls_heap_end) → TLS-owned
Slab header: Direct jump to slab metadata at (ptr & ~(SLAB_SIZE - 1))

Expected gain:

Reduce hak_tiny_owner_slab from 3.50% → ~0.5%
Net speedup: ~3%

Priority 3: Simplify `hak_tiny_alloc_slow` Fallback 🟢

Current approach:

4-stage fallback: hotmag → TLS list → superslab → magazine

Proposed approach:

Check TLS magazine first (most common)
Check superslab (if enabled)
Refill from central depot

Expected gain:

Reduce branching overhead: ~2-3% speedup

Priority 4: Inline Hot Functions 🟢

Candidates:

hak_tiny_alloc (7.12% total, 2.81% self) → Inline into malloc
hak_tiny_owner_slab (3.50% total) → Inline into free

Expected gain:

Function call overhead reduction: ~2%

Comparison with Mid MT Workload

Metric	Tiny Mixed (this)	Mid MT (8KB-32KB)
HAKMEM performance	68% of mimalloc	138% of mimalloc
Size range	8-128B (16 classes)	8KB-32KB (3 classes)
Thread count	1 thread	4 threads
HAKMEM strength	❌ Weak	✅ Strong

Key insight: HAKMEM's Tiny allocator is not optimized for high-frequency small allocations. The Mid MT allocator excels due to:

Fewer size classes (3 vs 16)
Lock-free TLS caching
Optimized for 8KB+ allocations

Next Steps

✅ DONE: Profile with perf to identify bottlenecks
🔄 IN PROGRESS: Document findings (this file)
⏳ TODO: Implement Priority 1 optimization (fast TLS range check in free())
⏳ TODO: Benchmark improvement
⏳ TODO: Iterate on Priority 2-4 optimizations

Raw Perf Data

HAKMEM Top Functions (>1%)

Children  Self    Samples  Symbol
57.97%    0.00%   0        __libc_start_call_main
43.11%    27.67%  321      __random
28.97%    15.14%  179      __random_r
21.64%    6.75%   79       free
9.33%     2.95%   35       hak_tiny_alloc_slow.constprop.0
7.12%     2.81%   33       hak_tiny_alloc
7.08%     2.92%   35       hak_pool_mid_lookup.constprop.0
3.67%     2.41%   29       malloc
3.50%     2.41%   29       hak_tiny_owner_slab

mimalloc Top Functions (>1%)

Children  Self    Samples  Symbol
99.74%    0.00%   0        __libc_start_call_main
61.43%    41.47%  331      __random
43.63%    22.50%  177      __random_r
29.39%    19.88%  157      main
9.77%     2.42%   19       operator delete[](void*)
7.35%     1.80%   14       mi_malloc

Profiling Commands Used

# HAKMEM profiling
perf record -o perf_random_mixed_hakmi.data -F 999 -g ./bench_random_mixed_hakmem 200000 400 1
perf report -i perf_random_mixed_hakmi.data --stdio -n --percent-limit 1

# mimalloc profiling
LD_LIBRARY_PATH=./mimalloc-bench/extern/mi/out/release \
  perf record -o perf_random_mixed_mi.data -F 999 -g ./bench_random_mixed_mi 200000 400 1
perf report -i perf_random_mixed_mi.data --stdio -n --percent-limit 1

# Annotate specific functions
perf annotate -i perf_random_mixed_hakmi.data --stdio free
perf annotate -i perf_random_mixed_hakmi.data --stdio "hak_tiny_alloc_slow.constprop.0"

7.5 KiB Raw Blame History

Tiny Mixed Workload Performance Analysis

Executive Summary

Performance Breakdown

Overall Workload Profile

HAKMEM Bottlenecks

1. free() Path (21.64% total, 6.75% self)

2. hak_tiny_alloc_slow Path (9.33% total, 2.95% self)

3. Allocation Path Summary

mimalloc Efficiency

Optimization Roadmap

Priority 1: Eliminate Unnecessary Mid Lookup in free() 🔴

Priority 2: Streamline free() Ownership Check 🟡

Priority 3: Simplify hak_tiny_alloc_slow Fallback 🟢

Priority 4: Inline Hot Functions 🟢

Comparison with Mid MT Workload

Next Steps

Raw Perf Data

HAKMEM Top Functions (>1%)

mimalloc Top Functions (>1%)

Profiling Commands Used

7.5 KiB

Raw Blame History

1. `free()` Path (21.64% total, 6.75% self)

2. `hak_tiny_alloc_slow` Path (9.33% total, 2.95% self)

Priority 1: Eliminate Unnecessary Mid Lookup in `free()` 🔴

Priority 2: Streamline `free()` Ownership Check 🟡

Priority 3: Simplify `hak_tiny_alloc_slow` Fallback 🟢