Files
hakmem/PERF_PROFILE_ANALYSIS_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

14 KiB

HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation

Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)

Date: 2025-12-04 Objective: Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)


Executive Summary

HAKMEM is 7.88x slower than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op). The performance gap comes from 4 main sources:

  1. Malloc overhead (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers
  2. Free overhead (29.4% of gap): Multi-layer free path with validation and routing
  3. Cache refill (15.7% of gap): Expensive superslab metadata lookups and validation
  4. Infrastructure (22.5% of gap): Cache misses, branch mispredictions, diagnostic code

Key Finding: Cache Miss Penalty Dominates

  • 238M cycles lost to cache misses (24.4% of total runtime!)
  • HAKMEM has 20.3x more cache misses than mimalloc (1.19M vs 58.7K)
  • L1 D-cache misses are 97.7x higher (4.29M vs 43.9K)

Detailed Performance Metrics

Overall Comparison

Metric HAKMEM mimalloc Ratio
Total Cycles 975,602,722 123,838,496 7.88x
Total Instructions 3,782,043,459 515,485,797 7.34x
Cycles per op 48.8 6.2 7.88x
Instructions per op 189.1 25.8 7.34x
IPC (inst/cycle) 3.88 4.16 0.93x
Cache misses 1,191,800 58,727 20.29x
Cache miss rate 59.59‰ 2.94‰ 20.29x
Branch misses 1,497,133 58,943 25.40x
Branch miss rate 0.17% 0.05% 3.20x
L1 D-cache misses 4,291,649 43,913 97.73x
L1 miss rate 0.41% 0.03% 13.88x

IPC Analysis

  • HAKMEM IPC: 3.88 (good, but memory-bound)
  • mimalloc IPC: 4.16 (better, less memory stall)
  • Interpretation: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns

Function-Level Cycle Breakdown

HAKMEM: Where Cycles Are Spent

Function % Total Cycles Cycles/op Category
malloc 33.32% 325,070,826 16.25 Hot path allocation
unified_cache_refill 13.67% 133,364,892 6.67 Cache miss handler
free.part.0 12.22% 119,218,652 5.96 Free wrapper
main (benchmark) 12.07% 117,755,248 5.89 Test harness
hak_free_at.constprop.0 11.55% 112,682,114 5.63 Free routing
hak_tiny_free_fast_v2 8.11% 79,121,380 3.96 Free fast path
kernel/other 9.06% 88,389,606 4.42 Syscalls, page faults
TOTAL 100% 975,602,722 48.78

mimalloc: Where Cycles Are Spent

Function % Total Cycles Cycles/op Category
operator delete[] 48.66% 60,259,812 3.01 Free path
malloc 39.82% 49,312,489 2.47 Allocation path
kernel/other 6.77% 8,383,866 0.42 Syscalls, page faults
main (benchmark) 4.75% 5,882,328 0.29 Test harness
TOTAL 100% 123,838,496 6.19

Insight: HAKMEM Fragmentation

  • mimalloc concentrates 88.5% of cycles in malloc/free
  • HAKMEM spreads across 6 functions (malloc + 3 free variants + refill + wrapper)
  • Recommendation: Consolidate hot path to reduce function call overhead

Cache Miss Deep Dive

Cache Misses by Function (HAKMEM)

Function % Cache Misses Misses/op Impact
malloc 58.51% 697,322 0.0349 CRITICAL
unified_cache_refill 29.92% 356,586 0.0178 HIGH
Other 11.57% 137,892 0.0069 Low

Estimated Penalty

  • Cache miss penalty: 238,360,000 cycles (assuming ~200 cycles/LLC miss)
  • Per operation: 11.9 cycles lost to cache misses
  • Percentage of total: 24.4% of all cycles

Root Causes

  1. malloc (58% of cache misses):

    • Pointer chasing through TLS → cache → metadata
    • Multiple indirections: g_tls_slabs[class_idx]tls->sstls->meta
    • Cold metadata access patterns
  2. unified_cache_refill (30% of cache misses):

    • SuperSlab metadata lookups via hak_super_lookup(p)
    • Freelist traversal: tiny_next_read() on cold pointers
    • Validation logic: Multiple metadata accesses per block

Branch Misprediction Analysis

Branch Misses by Function (HAKMEM)

Function % Branch Misses Misses/op Impact
malloc 21.59% 323,231 0.0162 Moderate
unified_cache_refill 10.35% 154,953 0.0077 Moderate
free.part.0 3.80% 56,891 0.0028 Low
main 3.66% 54,795 0.0027 (Benchmark)
hak_free_at 3.49% 52,249 0.0026 Low
hak_tiny_free_fast_v2 3.11% 46,560 0.0023 Low

Estimated Penalty

  • Branch miss penalty: 22,456,995 cycles (assuming ~15 cycles/miss)
  • Per operation: 1.1 cycles lost to branch misses
  • Percentage of total: 2.3% of all cycles

Root Causes

  1. Unpredictable control flow:

    • Environment variable checks: if (g_wrapper_env), if (g_enable)
    • Initialization barriers: if (!g_initialized), if (g_initializing)
    • Multi-way routing: if (cache miss) → refill; if (freelist) → pop; else → carve
  2. malloc wrapper overhead (lines 7795-78a3 in disassembly):

    • 20+ conditional branches before reaching fast path
    • Lazy initialization checks
    • Diagnostic tracing (lock incl g_wrap_malloc_trace_count)

Top 3 Bottlenecks & Recommendations

🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)

Problem:

  • Complex TLS access pattern: g_tls_sll[class_idx].head requires cache line load
  • Unified cache lookup: g_unified_cache[class_idx].slots[head] → second cache line
  • Cold metadata: Refill triggers hak_super_lookup() + metadata traversal

Hot Path Code Flow (from source analysis):

// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast
// 1. Check unified cache (cache hit path)
void* p = cache->slots[cache->head];
if (p) {
    cache->head = (cache->head + 1) & cache->mask;  // ← Cache line load
    return p;
}
// 2. Cache miss → unified_cache_refill
unified_cache_refill(class_idx);  // ← Expensive! 6.67 cycles/op

Disassembly Evidence (malloc function, lines 7a60-7ac7):

  • Multiple indirect loads: mov %fs:0x0,%r8 (TLS base)
  • Pointer arithmetic: lea -0x47d30(%r8),%rsi (cache offset calculation)
  • Conditional moves: cmpb $0x2,(%rdx,%rcx,1) (route check)
  • Cache line thrashing on cache->slots array

Recommendations:

  1. Inline unified_cache_refill for common case (CRITICAL)

    • Move refill logic inline to eliminate function call overhead
    • Use __attribute__((always_inline)) or manual inlining
    • Expected gain: ~2-3 cycles/op
  2. Optimize TLS data layout (HIGH PRIORITY)

    • Pack hot fields (cache->head, cache->tail, cache->slots) into single cache line
    • Current: g_unified_cache[8] array → 8 separate cache lines
    • Target: Hot path fields in 64-byte cache line
    • Expected gain: ~3-5 cycles/op, reduce misses by 30-40%
  3. Prefetch next block during refill (MEDIUM)

    void* first = out[0];
    __builtin_prefetch(cache->slots[cache->tail + 1], 0, 3);  // Temporal prefetch
    return first;
    
    • Expected gain: ~1-2 cycles/op
  4. Reduce validation overhead (MEDIUM)

    • unified_refill_validate_base() calls hak_super_lookup() on every block
    • Move to debug-only (#if !HAKMEM_BUILD_RELEASE)
    • Expected gain: ~1-2 cycles/op

🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)

Problem:

  • Expensive metadata lookups: hak_super_lookup(p) on every freelist node
  • Freelist traversal: tiny_next_read() requires dereferencing cold pointers
  • Validation logic: Multiple safety checks per block (lines 384-408 in source)

Hot Path Code (from tiny_unified_cache.c:377-414):

while (produced < room) {
    if (m->freelist) {
        void* p = m->freelist;

        // ❌ EXPENSIVE: Lookup SuperSlab for validation
        SuperSlab* fl_ss = hak_super_lookup(p);  // ← Cache miss!
        int fl_idx = slab_index_for(fl_ss, p);   // ← More metadata access

        // ❌ EXPENSIVE: Dereference next pointer (cold memory)
        void* next_node = tiny_next_read(class_idx, p);  // ← Cache miss!

        // Write header
        *(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
        m->freelist = next_node;
        out[produced++] = p;
    }
}

Recommendations:

  1. Batch validation (amortize lookup cost) (CRITICAL)

    • Validate SuperSlab once at start, not per block
    • Trust freelist integrity within single refill
    SuperSlab* ss_once = hak_super_lookup(m->freelist);
    // Validate ss_once, then skip per-block validation
    while (produced < room && m->freelist) {
        void* p = m->freelist;
        void* next = tiny_next_read(class_idx, p);  // No lookup!
        out[produced++] = p;
        m->freelist = next;
    }
    
    • Expected gain: ~2-3 cycles/op
  2. Prefetch freelist nodes (HIGH PRIORITY)

    void* p = m->freelist;
    void* next = tiny_next_read(class_idx, p);
    __builtin_prefetch(next, 0, 3);  // Prefetch next node
    __builtin_prefetch(tiny_next_read(class_idx, next), 0, 2);  // +2 ahead
    
    • Expected gain: ~1-2 cycles/op on miss path
  3. Increase batch size for hot classes (MEDIUM)

    • Current: Max 128 blocks per refill
    • Proposal: 256 blocks for C0-C3 (tiny sizes)
    • Amortize refill cost over more allocations
    • Expected gain: ~0.5-1 cycles/op
  4. Remove atomic fence on header write (LOW, risky)

    • Line 422: __atomic_thread_fence(__ATOMIC_RELEASE)
    • Only needed for cross-thread visibility
    • Benchmark: Single-threaded case doesn't need fence
    • Expected gain: ~0.3-0.5 cycles/op

🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)

Problem:

  • 20+ branches before reaching fast path (disassembly lines 7795-78a3)
  • Lazy initialization checks on every call
  • Diagnostic tracing with atomic increment
  • Environment variable checks

Hot Path Disassembly (malloc, lines 7795-77ba):

7795: lock incl 0x190fb78(%rip)  ; ❌ Atomic trace counter (12.33% of cycles!)
779c: mov 0x190fb6e(%rip),%eax   ; Check g_bench_fast_init_in_progress
77a2: test %eax,%eax
77a4: je 7d90                    ; Branch #1
77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment
77b2: mov 0x438c8(%rip),%eax     ; Check g_wrapper_env
77b8: test %eax,%eax
77ba: je 7e40                    ; Branch #2

Wrapper Code (hakmem_tiny_phase6_wrappers_box.inc:22-79):

void* hak_tiny_alloc_fast_wrapper(size_t size) {
    atomic_fetch_add(&g_alloc_fast_trace, 1, ...);  // ❌ Expensive!

    // ❌ Branch #1: Bench fast mode check
    if (g_bench_fast_front) {
        return tiny_alloc_fast(size);
    }

    atomic_fetch_add(&wrapper_call_count, 1);  // ❌ Atomic again!
    PTR_TRACK_INIT();  // ❌ Initialization check
    periodic_canary_check(call_num, ...);  // ❌ Periodic check

    // Finally, actual allocation
    void* result = tiny_alloc_fast(size);
    return result;
}

Recommendations:

  1. Compile-time disable diagnostics (CRITICAL)

    • Remove atomic trace counters in hot path
    • Move to #if HAKMEM_BUILD_RELEASE guards
    • Expected gain: ~4-6 cycles/op (eliminates 12% overhead)
  2. Hoist initialization checks (HIGH PRIORITY)

    • Move PTR_TRACK_INIT() to library init (once per thread)
    • Cache g_bench_fast_front in thread-local variable
    static __thread int g_init_done = 0;
    if (__builtin_expect(!g_init_done, 0)) {
        PTR_TRACK_INIT();
        g_init_done = 1;
    }
    
    • Expected gain: ~1-2 cycles/op
  3. Eliminate wrapper layer for benchmarks (MEDIUM)

    • Direct call to tiny_alloc_fast() from malloc()
    • Use LTO to inline wrapper entirely
    • Expected gain: ~1-2 cycles/op (function call overhead)
  4. Branchless environment checks (LOW)

    • Replace if (g_wrapper_env) with bitmask operations
    int mask = -(int)g_wrapper_env;  // -1 if true, 0 if false
    result = (mask & diagnostic_path) | (~mask & fast_path);
    
    • Expected gain: ~0.3-0.5 cycles/op

Summary: Optimization Roadmap

Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)

  1. Remove atomic trace counters (lock incl) → -6 cycles/op
  2. Inline unified_cache_refill-3 cycles/op
  3. Batch validation in refill → -3 cycles/op
  4. Optimize TLS cache layout → -3 cycles/op

Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)

  1. Prefetch in refill and malloc → -3 cycles/op
  2. Increase batch size for hot classes → -2 cycles/op
  3. Consolidate free path (merge 3 functions) → -3 cycles/op
  4. Hoist initialization checks → -2 cycles/op

Long-Term (Target: -8 cycles/op, 23.8 → 15.8)

  1. Branchless routing logic → -2 cycles/op
  2. SIMD batch processing in refill → -3 cycles/op
  3. Reduce metadata indirections → -3 cycles/op

Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)

  • Requires architectural changes (single-layer cache, no validation)
  • Trade-off: Safety vs performance

Conclusion

HAKMEM's 7.88x slowdown is primarily due to:

  1. Cache misses (24.4% of cycles) from multi-layer indirection
  2. Diagnostic overhead (12%+ of cycles) from atomic counters and tracing
  3. Function fragmentation (6 hot functions vs mimalloc's 2)

Top Priority Actions:

  • Remove atomic trace counters (immediate -6 cycles/op)
  • Inline refill + batch validation (-6 cycles/op combined)
  • Optimize TLS layout for cache locality (-3 cycles/op)

Expected Impact: -15 cycles/op (48.8 → 33.8, ~30% improvement) Timeline: 1-2 days of focused optimization work