Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

14 KiB

Raw Blame History

HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation

Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)

Date: 2025-12-04 Objective: Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)

Executive Summary

HAKMEM is 7.88x slower than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op). The performance gap comes from 4 main sources:

Malloc overhead (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers
Free overhead (29.4% of gap): Multi-layer free path with validation and routing
Cache refill (15.7% of gap): Expensive superslab metadata lookups and validation
Infrastructure (22.5% of gap): Cache misses, branch mispredictions, diagnostic code

Key Finding: Cache Miss Penalty Dominates

238M cycles lost to cache misses (24.4% of total runtime!)
HAKMEM has 20.3x more cache misses than mimalloc (1.19M vs 58.7K)
L1 D-cache misses are 97.7x higher (4.29M vs 43.9K)

Detailed Performance Metrics

Overall Comparison

Metric	HAKMEM	mimalloc	Ratio
Total Cycles	975,602,722	123,838,496	7.88x
Total Instructions	3,782,043,459	515,485,797	7.34x
Cycles per op	48.8	6.2	7.88x
Instructions per op	189.1	25.8	7.34x
IPC (inst/cycle)	3.88	4.16	0.93x
Cache misses	1,191,800	58,727	20.29x
Cache miss rate	59.59‰	2.94‰	20.29x
Branch misses	1,497,133	58,943	25.40x
Branch miss rate	0.17%	0.05%	3.20x
L1 D-cache misses	4,291,649	43,913	97.73x
L1 miss rate	0.41%	0.03%	13.88x

IPC Analysis

HAKMEM IPC: 3.88 (good, but memory-bound)
mimalloc IPC: 4.16 (better, less memory stall)
Interpretation: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns

Function-Level Cycle Breakdown

HAKMEM: Where Cycles Are Spent

Function	%	Total Cycles	Cycles/op	Category
malloc	33.32%	325,070,826	16.25	Hot path allocation
unified_cache_refill	13.67%	133,364,892	6.67	Cache miss handler
free.part.0	12.22%	119,218,652	5.96	Free wrapper
main (benchmark)	12.07%	117,755,248	5.89	Test harness
hak_free_at.constprop.0	11.55%	112,682,114	5.63	Free routing
hak_tiny_free_fast_v2	8.11%	79,121,380	3.96	Free fast path
kernel/other	9.06%	88,389,606	4.42	Syscalls, page faults
TOTAL	100%	975,602,722	48.78

mimalloc: Where Cycles Are Spent

Function	%	Total Cycles	Cycles/op	Category
operator delete[]	48.66%	60,259,812	3.01	Free path
malloc	39.82%	49,312,489	2.47	Allocation path
kernel/other	6.77%	8,383,866	0.42	Syscalls, page faults
main (benchmark)	4.75%	5,882,328	0.29	Test harness
TOTAL	100%	123,838,496	6.19

Insight: HAKMEM Fragmentation

mimalloc concentrates 88.5% of cycles in malloc/free
HAKMEM spreads across 6 functions (malloc + 3 free variants + refill + wrapper)
Recommendation: Consolidate hot path to reduce function call overhead

Cache Miss Deep Dive

Cache Misses by Function (HAKMEM)

Function	%	Cache Misses	Misses/op	Impact
malloc	58.51%	697,322	0.0349	CRITICAL
unified_cache_refill	29.92%	356,586	0.0178	HIGH
Other	11.57%	137,892	0.0069	Low

Estimated Penalty

Cache miss penalty: 238,360,000 cycles (assuming ~200 cycles/LLC miss)
Per operation: 11.9 cycles lost to cache misses
Percentage of total: 24.4% of all cycles

Root Causes

malloc (58% of cache misses):
- Pointer chasing through TLS → cache → metadata
- Multiple indirections: g_tls_slabs[class_idx] → tls->ss → tls->meta
- Cold metadata access patterns
unified_cache_refill (30% of cache misses):
- SuperSlab metadata lookups via hak_super_lookup(p)
- Freelist traversal: tiny_next_read() on cold pointers
- Validation logic: Multiple metadata accesses per block

Branch Misprediction Analysis

Branch Misses by Function (HAKMEM)

Function	%	Branch Misses	Misses/op	Impact
malloc	21.59%	323,231	0.0162	Moderate
unified_cache_refill	10.35%	154,953	0.0077	Moderate
free.part.0	3.80%	56,891	0.0028	Low
main	3.66%	54,795	0.0027	(Benchmark)
hak_free_at	3.49%	52,249	0.0026	Low
hak_tiny_free_fast_v2	3.11%	46,560	0.0023	Low

Estimated Penalty

Branch miss penalty: 22,456,995 cycles (assuming ~15 cycles/miss)
Per operation: 1.1 cycles lost to branch misses
Percentage of total: 2.3% of all cycles

Root Causes

Unpredictable control flow:
- Environment variable checks: if (g_wrapper_env), if (g_enable)
- Initialization barriers: if (!g_initialized), if (g_initializing)
- Multi-way routing: if (cache miss) → refill; if (freelist) → pop; else → carve
malloc wrapper overhead (lines 7795-78a3 in disassembly):
- 20+ conditional branches before reaching fast path
- Lazy initialization checks
- Diagnostic tracing (lock incl g_wrap_malloc_trace_count)

Top 3 Bottlenecks & Recommendations

🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)

Problem:

Complex TLS access pattern: g_tls_sll[class_idx].head requires cache line load
Unified cache lookup: g_unified_cache[class_idx].slots[head] → second cache line
Cold metadata: Refill triggers hak_super_lookup() + metadata traversal

Hot Path Code Flow (from source analysis):

// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast
// 1. Check unified cache (cache hit path)
void* p = cache->slots[cache->head];
if (p) {
    cache->head = (cache->head + 1) & cache->mask;  // ← Cache line load
    return p;
}
// 2. Cache miss → unified_cache_refill
unified_cache_refill(class_idx);  // ← Expensive! 6.67 cycles/op

Disassembly Evidence (malloc function, lines 7a60-7ac7):

Multiple indirect loads: mov %fs:0x0,%r8 (TLS base)
Pointer arithmetic: lea -0x47d30(%r8),%rsi (cache offset calculation)
Conditional moves: cmpb $0x2,(%rdx,%rcx,1) (route check)
Cache line thrashing on cache->slots array

Recommendations:

Inline unified_cache_refill for common case (CRITICAL)
- Move refill logic inline to eliminate function call overhead
- Use __attribute__((always_inline)) or manual inlining
- Expected gain: ~2-3 cycles/op
Optimize TLS data layout (HIGH PRIORITY)
- Pack hot fields (cache->head, cache->tail, cache->slots) into single cache line
- Current: g_unified_cache[8] array → 8 separate cache lines
- Target: Hot path fields in 64-byte cache line
- Expected gain: ~3-5 cycles/op, reduce misses by 30-40%

Prefetch next block during refill (MEDIUM)

void* first = out[0];
__builtin_prefetch(cache->slots[cache->tail + 1], 0, 3);  // Temporal prefetch
return first;

Expected gain: ~1-2 cycles/op

Reduce validation overhead (MEDIUM)
- unified_refill_validate_base() calls hak_super_lookup() on every block
- Move to debug-only (#if !HAKMEM_BUILD_RELEASE)
- Expected gain: ~1-2 cycles/op

🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)

Problem:

Expensive metadata lookups: hak_super_lookup(p) on every freelist node
Freelist traversal: tiny_next_read() requires dereferencing cold pointers
Validation logic: Multiple safety checks per block (lines 384-408 in source)

Hot Path Code (from tiny_unified_cache.c:377-414):

while (produced < room) {
    if (m->freelist) {
        void* p = m->freelist;

        // ❌ EXPENSIVE: Lookup SuperSlab for validation
        SuperSlab* fl_ss = hak_super_lookup(p);  // ← Cache miss!
        int fl_idx = slab_index_for(fl_ss, p);   // ← More metadata access

        // ❌ EXPENSIVE: Dereference next pointer (cold memory)
        void* next_node = tiny_next_read(class_idx, p);  // ← Cache miss!

        // Write header
        *(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
        m->freelist = next_node;
        out[produced++] = p;
    }
}

Recommendations:

Batch validation (amortize lookup cost) (CRITICAL)

Validate SuperSlab once at start, not per block
Trust freelist integrity within single refill

SuperSlab* ss_once = hak_super_lookup(m->freelist);
// Validate ss_once, then skip per-block validation
while (produced < room && m->freelist) {
    void* p = m->freelist;
    void* next = tiny_next_read(class_idx, p);  // No lookup!
    out[produced++] = p;
    m->freelist = next;
}

Expected gain: ~2-3 cycles/op

Prefetch freelist nodes (HIGH PRIORITY)

void* p = m->freelist;
void* next = tiny_next_read(class_idx, p);
__builtin_prefetch(next, 0, 3);  // Prefetch next node
__builtin_prefetch(tiny_next_read(class_idx, next), 0, 2);  // +2 ahead

Expected gain: ~1-2 cycles/op on miss path

Increase batch size for hot classes (MEDIUM)
- Current: Max 128 blocks per refill
- Proposal: 256 blocks for C0-C3 (tiny sizes)
- Amortize refill cost over more allocations
- Expected gain: ~0.5-1 cycles/op
Remove atomic fence on header write (LOW, risky)
- Line 422: __atomic_thread_fence(__ATOMIC_RELEASE)
- Only needed for cross-thread visibility
- Benchmark: Single-threaded case doesn't need fence
- Expected gain: ~0.3-0.5 cycles/op

🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)

Problem:

20+ branches before reaching fast path (disassembly lines 7795-78a3)
Lazy initialization checks on every call
Diagnostic tracing with atomic increment
Environment variable checks

Hot Path Disassembly (malloc, lines 7795-77ba):

7795: lock incl 0x190fb78(%rip)  ; ❌ Atomic trace counter (12.33% of cycles!)
779c: mov 0x190fb6e(%rip),%eax   ; Check g_bench_fast_init_in_progress
77a2: test %eax,%eax
77a4: je 7d90                    ; Branch #1
77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment
77b2: mov 0x438c8(%rip),%eax     ; Check g_wrapper_env
77b8: test %eax,%eax
77ba: je 7e40                    ; Branch #2

Wrapper Code (hakmem_tiny_phase6_wrappers_box.inc:22-79):

void* hak_tiny_alloc_fast_wrapper(size_t size) {
    atomic_fetch_add(&g_alloc_fast_trace, 1, ...);  // ❌ Expensive!

    // ❌ Branch #1: Bench fast mode check
    if (g_bench_fast_front) {
        return tiny_alloc_fast(size);
    }

    atomic_fetch_add(&wrapper_call_count, 1);  // ❌ Atomic again!
    PTR_TRACK_INIT();  // ❌ Initialization check
    periodic_canary_check(call_num, ...);  // ❌ Periodic check

    // Finally, actual allocation
    void* result = tiny_alloc_fast(size);
    return result;
}

Recommendations:

Compile-time disable diagnostics (CRITICAL)
- Remove atomic trace counters in hot path
- Move to #if HAKMEM_BUILD_RELEASE guards
- Expected gain: ~4-6 cycles/op (eliminates 12% overhead)
Hoist initialization checks (HIGH PRIORITY)
- Move PTR_TRACK_INIT() to library init (once per thread)
- Cache g_bench_fast_front in thread-local variable
```
static __thread int g_init_done = 0;
if (__builtin_expect(!g_init_done, 0)) {
    PTR_TRACK_INIT();
    g_init_done = 1;
}
```
- Expected gain: ~1-2 cycles/op
Eliminate wrapper layer for benchmarks (MEDIUM)
- Direct call to tiny_alloc_fast() from malloc()
- Use LTO to inline wrapper entirely
- Expected gain: ~1-2 cycles/op (function call overhead)

Branchless environment checks (LOW)

Replace if (g_wrapper_env) with bitmask operations

int mask = -(int)g_wrapper_env;  // -1 if true, 0 if false
result = (mask & diagnostic_path) | (~mask & fast_path);

Expected gain: ~0.3-0.5 cycles/op

Summary: Optimization Roadmap

Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)

✅ Remove atomic trace counters (lock incl) → -6 cycles/op
✅ Inline unified_cache_refill → -3 cycles/op
✅ Batch validation in refill → -3 cycles/op
✅ Optimize TLS cache layout → -3 cycles/op

Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)

✅ Prefetch in refill and malloc → -3 cycles/op
✅ Increase batch size for hot classes → -2 cycles/op
✅ Consolidate free path (merge 3 functions) → -3 cycles/op
✅ Hoist initialization checks → -2 cycles/op

Long-Term (Target: -8 cycles/op, 23.8 → 15.8)

✅ Branchless routing logic → -2 cycles/op
✅ SIMD batch processing in refill → -3 cycles/op
✅ Reduce metadata indirections → -3 cycles/op

Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)

Requires architectural changes (single-layer cache, no validation)
Trade-off: Safety vs performance

Conclusion

HAKMEM's 7.88x slowdown is primarily due to:

Cache misses (24.4% of cycles) from multi-layer indirection
Diagnostic overhead (12%+ of cycles) from atomic counters and tracing
Function fragmentation (6 hot functions vs mimalloc's 2)

Top Priority Actions:

Remove atomic trace counters (immediate -6 cycles/op)
Inline refill + batch validation (-6 cycles/op combined)
Optimize TLS layout for cache locality (-3 cycles/op)

Expected Impact: -15 cycles/op (48.8 → 33.8, ~30% improvement) Timeline: 1-2 days of focused optimization work

14 KiB Raw Blame History

HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation

Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)

Executive Summary

Key Finding: Cache Miss Penalty Dominates

Detailed Performance Metrics

Overall Comparison

IPC Analysis

Function-Level Cycle Breakdown

HAKMEM: Where Cycles Are Spent

mimalloc: Where Cycles Are Spent

Insight: HAKMEM Fragmentation

Cache Miss Deep Dive

Cache Misses by Function (HAKMEM)

Estimated Penalty

Root Causes

Branch Misprediction Analysis

Branch Misses by Function (HAKMEM)

Estimated Penalty

Root Causes

Top 3 Bottlenecks & Recommendations

🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)

🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)

🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)

Summary: Optimization Roadmap

Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)

Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)

Long-Term (Target: -8 cycles/op, 23.8 → 15.8)

Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)

Conclusion

14 KiB

Raw Blame History