hakmem/PERF_PROFILE_ANALYSIS_20251204.md

# HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation
## Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)

**Date:** 2025-12-04
**Objective:** Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)

---

## Executive Summary

HAKMEM is **7.88x slower** than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op).
The performance gap comes from **4 main sources**:

1. **Malloc overhead** (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers
2. **Free overhead** (29.4% of gap): Multi-layer free path with validation and routing
3. **Cache refill** (15.7% of gap): Expensive superslab metadata lookups and validation
4. **Infrastructure** (22.5% of gap): Cache misses, branch mispredictions, diagnostic code

### Key Finding: Cache Miss Penalty Dominates
- **238M cycles lost to cache misses** (24.4% of total runtime!)
- HAKMEM has **20.3x more cache misses** than mimalloc (1.19M vs 58.7K)
- L1 D-cache misses are **97.7x higher** (4.29M vs 43.9K)

---

## Detailed Performance Metrics

### Overall Comparison

| Metric | HAKMEM | mimalloc | Ratio |
|--------|--------|----------|-------|
| **Total Cycles** | 975,602,722 | 123,838,496 | **7.88x** |
| **Total Instructions** | 3,782,043,459 | 515,485,797 | **7.34x** |
| **Cycles per op** | 48.8 | 6.2 | **7.88x** |
| **Instructions per op** | 189.1 | 25.8 | **7.34x** |
| **IPC (inst/cycle)** | 3.88 | 4.16 | 0.93x |
| **Cache misses** | 1,191,800 | 58,727 | **20.29x** |
| **Cache miss rate** | 59.59‰ | 2.94‰ | **20.29x** |
| **Branch misses** | 1,497,133 | 58,943 | **25.40x** |
| **Branch miss rate** | 0.17% | 0.05% | **3.20x** |
| **L1 D-cache misses** | 4,291,649 | 43,913 | **97.73x** |
| **L1 miss rate** | 0.41% | 0.03% | **13.88x** |

### IPC Analysis
- HAKMEM IPC: **3.88** (good, but memory-bound)
- mimalloc IPC: **4.16** (better, less memory stall)
- **Interpretation**: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns

---

## Function-Level Cycle Breakdown

### HAKMEM: Where Cycles Are Spent

| Function | % | Total Cycles | Cycles/op | Category |
|----------|---|-------------|-----------|----------|
| **malloc** | 33.32% | 325,070,826 | 16.25 | Hot path allocation |
| **unified_cache_refill** | 13.67% | 133,364,892 | 6.67 | Cache miss handler |
| **free.part.0** | 12.22% | 119,218,652 | 5.96 | Free wrapper |
| **main** (benchmark) | 12.07% | 117,755,248 | 5.89 | Test harness |
| **hak_free_at.constprop.0** | 11.55% | 112,682,114 | 5.63 | Free routing |
| **hak_tiny_free_fast_v2** | 8.11% | 79,121,380 | 3.96 | Free fast path |
| **kernel/other** | 9.06% | 88,389,606 | 4.42 | Syscalls, page faults |
| **TOTAL** | 100% | 975,602,722 | 48.78 | |

### mimalloc: Where Cycles Are Spent

| Function | % | Total Cycles | Cycles/op | Category |
|----------|---|-------------|-----------|----------|
| **operator delete[]** | 48.66% | 60,259,812 | 3.01 | Free path |
| **malloc** | 39.82% | 49,312,489 | 2.47 | Allocation path |
| **kernel/other** | 6.77% | 8,383,866 | 0.42 | Syscalls, page faults |
| **main** (benchmark) | 4.75% | 5,882,328 | 0.29 | Test harness |
| **TOTAL** | 100% | 123,838,496 | 6.19 | |

### Insight: HAKMEM Fragmentation
- mimalloc concentrates 88.5% of cycles in malloc/free
- HAKMEM spreads across **6 functions** (malloc + 3 free variants + refill + wrapper)
- **Recommendation**: Consolidate hot path to reduce function call overhead

---

## Cache Miss Deep Dive

### Cache Misses by Function (HAKMEM)

| Function | % | Cache Misses | Misses/op | Impact |
|----------|---|--------------|-----------|--------|
| **malloc** | 58.51% | 697,322 | 0.0349 | **CRITICAL** |
| **unified_cache_refill** | 29.92% | 356,586 | 0.0178 | **HIGH** |
| Other | 11.57% | 137,892 | 0.0069 | Low |

### Estimated Penalty
- **Cache miss penalty**: 238,360,000 cycles (assuming ~200 cycles/LLC miss)
- **Per operation**: 11.9 cycles lost to cache misses
- **Percentage of total**: **24.4%** of all cycles

### Root Causes
1. **malloc (58% of cache misses)**:
   - Pointer chasing through TLS → cache → metadata
   - Multiple indirections: `g_tls_slabs[class_idx]` → `tls->ss` → `tls->meta`
   - Cold metadata access patterns

2. **unified_cache_refill (30% of cache misses)**:
   - SuperSlab metadata lookups via `hak_super_lookup(p)`
   - Freelist traversal: `tiny_next_read()` on cold pointers
   - Validation logic: Multiple metadata accesses per block

---

## Branch Misprediction Analysis

### Branch Misses by Function (HAKMEM)

| Function | % | Branch Misses | Misses/op | Impact |
|----------|---|---------------|-----------|--------|
| **malloc** | 21.59% | 323,231 | 0.0162 | Moderate |
| **unified_cache_refill** | 10.35% | 154,953 | 0.0077 | Moderate |
| **free.part.0** | 3.80% | 56,891 | 0.0028 | Low |
| **main** | 3.66% | 54,795 | 0.0027 | (Benchmark) |
| **hak_free_at** | 3.49% | 52,249 | 0.0026 | Low |
| **hak_tiny_free_fast_v2** | 3.11% | 46,560 | 0.0023 | Low |

### Estimated Penalty
- **Branch miss penalty**: 22,456,995 cycles (assuming ~15 cycles/miss)
- **Per operation**: 1.1 cycles lost to branch misses
- **Percentage of total**: **2.3%** of all cycles

### Root Causes
1. **Unpredictable control flow**:
   - Environment variable checks: `if (g_wrapper_env)`, `if (g_enable)`
   - Initialization barriers: `if (!g_initialized)`, `if (g_initializing)`
   - Multi-way routing: `if (cache miss) → refill; if (freelist) → pop; else → carve`

2. **malloc wrapper overhead** (lines 7795-78a3 in disassembly):
   - 20+ conditional branches before reaching fast path
   - Lazy initialization checks
   - Diagnostic tracing (`lock incl g_wrap_malloc_trace_count`)

---

## Top 3 Bottlenecks & Recommendations

### 🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)

**Problem:**
- Complex TLS access pattern: `g_tls_sll[class_idx].head` requires cache line load
- Unified cache lookup: `g_unified_cache[class_idx].slots[head]` → second cache line
- Cold metadata: Refill triggers `hak_super_lookup()` + metadata traversal

**Hot Path Code Flow** (from source analysis):
```c
// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast
// 1. Check unified cache (cache hit path)
void* p = cache->slots[cache->head];
if (p) {
    cache->head = (cache->head + 1) & cache->mask;  // ← Cache line load
    return p;
}
// 2. Cache miss → unified_cache_refill
unified_cache_refill(class_idx);  // ← Expensive! 6.67 cycles/op
```

**Disassembly Evidence** (malloc function, lines 7a60-7ac7):
- Multiple indirect loads: `mov %fs:0x0,%r8` (TLS base)
- Pointer arithmetic: `lea -0x47d30(%r8),%rsi` (cache offset calculation)
- Conditional moves: `cmpb $0x2,(%rdx,%rcx,1)` (route check)
- Cache line thrashing on `cache->slots` array

**Recommendations:**
1. **Inline unified_cache_refill for common case** (CRITICAL)
   - Move refill logic inline to eliminate function call overhead
   - Use `__attribute__((always_inline))` or manual inlining
   - Expected gain: ~2-3 cycles/op

2. **Optimize TLS data layout** (HIGH PRIORITY)
   - Pack hot fields (`cache->head`, `cache->tail`, `cache->slots`) into single cache line
   - Current: `g_unified_cache[8]` array → 8 separate cache lines
   - Target: Hot path fields in 64-byte cache line
   - Expected gain: ~3-5 cycles/op, reduce misses by 30-40%

3. **Prefetch next block during refill** (MEDIUM)
   ```c
   void* first = out[0];
   __builtin_prefetch(cache->slots[cache->tail + 1], 0, 3);  // Temporal prefetch
   return first;
   ```
   - Expected gain: ~1-2 cycles/op

4. **Reduce validation overhead** (MEDIUM)
   - `unified_refill_validate_base()` calls `hak_super_lookup()` on every block
   - Move to debug-only (`#if !HAKMEM_BUILD_RELEASE`)
   - Expected gain: ~1-2 cycles/op

---

### 🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)

**Problem:**
- Expensive metadata lookups: `hak_super_lookup(p)` on every freelist node
- Freelist traversal: `tiny_next_read()` requires dereferencing cold pointers
- Validation logic: Multiple safety checks per block (lines 384-408 in source)

**Hot Path Code** (from tiny_unified_cache.c:377-414):
```c
while (produced < room) {
    if (m->freelist) {
        void* p = m->freelist;

        // ❌ EXPENSIVE: Lookup SuperSlab for validation
        SuperSlab* fl_ss = hak_super_lookup(p);  // ← Cache miss!
        int fl_idx = slab_index_for(fl_ss, p);   // ← More metadata access

        // ❌ EXPENSIVE: Dereference next pointer (cold memory)
        void* next_node = tiny_next_read(class_idx, p);  // ← Cache miss!

        // Write header
        *(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
        m->freelist = next_node;
        out[produced++] = p;
    }
}
```

**Recommendations:**
1. **Batch validation (amortize lookup cost)** (CRITICAL)
   - Validate SuperSlab once at start, not per block
   - Trust freelist integrity within single refill
   ```c
   SuperSlab* ss_once = hak_super_lookup(m->freelist);
   // Validate ss_once, then skip per-block validation
   while (produced < room && m->freelist) {
       void* p = m->freelist;
       void* next = tiny_next_read(class_idx, p);  // No lookup!
       out[produced++] = p;
       m->freelist = next;
   }
   ```
   - Expected gain: ~2-3 cycles/op

2. **Prefetch freelist nodes** (HIGH PRIORITY)
   ```c
   void* p = m->freelist;
   void* next = tiny_next_read(class_idx, p);
   __builtin_prefetch(next, 0, 3);  // Prefetch next node
   __builtin_prefetch(tiny_next_read(class_idx, next), 0, 2);  // +2 ahead
   ```
   - Expected gain: ~1-2 cycles/op on miss path

3. **Increase batch size for hot classes** (MEDIUM)
   - Current: Max 128 blocks per refill
   - Proposal: 256 blocks for C0-C3 (tiny sizes)
   - Amortize refill cost over more allocations
   - Expected gain: ~0.5-1 cycles/op

4. **Remove atomic fence on header write** (LOW, risky)
   - Line 422: `__atomic_thread_fence(__ATOMIC_RELEASE)`
   - Only needed for cross-thread visibility
   - Benchmark: Single-threaded case doesn't need fence
   - Expected gain: ~0.3-0.5 cycles/op

---

### 🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)

**Problem:**
- 20+ branches before reaching fast path (disassembly lines 7795-78a3)
- Lazy initialization checks on every call
- Diagnostic tracing with atomic increment
- Environment variable checks

**Hot Path Disassembly** (malloc, lines 7795-77ba):
```asm
7795: lock incl 0x190fb78(%rip)  ; ❌ Atomic trace counter (12.33% of cycles!)
779c: mov 0x190fb6e(%rip),%eax   ; Check g_bench_fast_init_in_progress
77a2: test %eax,%eax
77a4: je 7d90                    ; Branch #1
77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment
77b2: mov 0x438c8(%rip),%eax     ; Check g_wrapper_env
77b8: test %eax,%eax
77ba: je 7e40                    ; Branch #2
```

**Wrapper Code** (hakmem_tiny_phase6_wrappers_box.inc:22-79):
```c
void* hak_tiny_alloc_fast_wrapper(size_t size) {
    atomic_fetch_add(&g_alloc_fast_trace, 1, ...);  // ❌ Expensive!

    // ❌ Branch #1: Bench fast mode check
    if (g_bench_fast_front) {
        return tiny_alloc_fast(size);
    }

    atomic_fetch_add(&wrapper_call_count, 1);  // ❌ Atomic again!
    PTR_TRACK_INIT();  // ❌ Initialization check
    periodic_canary_check(call_num, ...);  // ❌ Periodic check

    // Finally, actual allocation
    void* result = tiny_alloc_fast(size);
    return result;
}
```

**Recommendations:**
1. **Compile-time disable diagnostics** (CRITICAL)
   - Remove atomic trace counters in hot path
   - Move to `#if HAKMEM_BUILD_RELEASE` guards
   - Expected gain: **~4-6 cycles/op** (eliminates 12% overhead)

2. **Hoist initialization checks** (HIGH PRIORITY)
   - Move `PTR_TRACK_INIT()` to library init (once per thread)
   - Cache `g_bench_fast_front` in thread-local variable
   ```c
   static __thread int g_init_done = 0;
   if (__builtin_expect(!g_init_done, 0)) {
       PTR_TRACK_INIT();
       g_init_done = 1;
   }
   ```
   - Expected gain: ~1-2 cycles/op

3. **Eliminate wrapper layer for benchmarks** (MEDIUM)
   - Direct call to `tiny_alloc_fast()` from `malloc()`
   - Use LTO to inline wrapper entirely
   - Expected gain: ~1-2 cycles/op (function call overhead)

4. **Branchless environment checks** (LOW)
   - Replace `if (g_wrapper_env)` with bitmask operations
   ```c
   int mask = -(int)g_wrapper_env;  // -1 if true, 0 if false
   result = (mask & diagnostic_path) | (~mask & fast_path);
   ```
   - Expected gain: ~0.3-0.5 cycles/op

---

## Summary: Optimization Roadmap

### Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)
1. ✅ Remove atomic trace counters (`lock incl`) → **-6 cycles/op**
2. ✅ Inline `unified_cache_refill` → **-3 cycles/op**
3. ✅ Batch validation in refill → **-3 cycles/op**
4. ✅ Optimize TLS cache layout → **-3 cycles/op**

### Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)
5. ✅ Prefetch in refill and malloc → **-3 cycles/op**
6. ✅ Increase batch size for hot classes → **-2 cycles/op**
7. ✅ Consolidate free path (merge 3 functions) → **-3 cycles/op**
8. ✅ Hoist initialization checks → **-2 cycles/op**

### Long-Term (Target: -8 cycles/op, 23.8 → 15.8)
9. ✅ Branchless routing logic → **-2 cycles/op**
10. ✅ SIMD batch processing in refill → **-3 cycles/op**
11. ✅ Reduce metadata indirections → **-3 cycles/op**

### Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)
- Requires architectural changes (single-layer cache, no validation)
- Trade-off: Safety vs performance

---

## Conclusion

HAKMEM's 7.88x slowdown is primarily due to:
1. **Cache misses** (24.4% of cycles) from multi-layer indirection
2. **Diagnostic overhead** (12%+ of cycles) from atomic counters and tracing
3. **Function fragmentation** (6 hot functions vs mimalloc's 2)

**Top Priority Actions:**
- Remove atomic trace counters (immediate -6 cycles/op)
- Inline refill + batch validation (-6 cycles/op combined)
- Optimize TLS layout for cache locality (-3 cycles/op)

**Expected Impact:** **-15 cycles/op** (48.8 → 33.8, ~30% improvement)
**Timeline:** 1-2 days of focused optimization work
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete) Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-12-04 23:31:54 +09:00			`# HAKMEM Performance Profile Analysis: CPU Cycle Bottleneck Investigation`
			`## Benchmark: bench_tiny_hot (64-byte allocations, 20M operations)`

			`Date: 2025-12-04`
			`Objective: Identify where HAKMEM spends CPU cycles compared to mimalloc (7.88x slower)`

			`---`

			`## Executive Summary`

			`HAKMEM is 7.88x slower than mimalloc on tiny hot allocations (48.8 vs 6.2 cycles/op).`
			`The performance gap comes from 4 main sources:`

			`1. Malloc overhead (32.4% of gap): Complex wrapper logic, environment checks, initialization barriers`
			`2. Free overhead (29.4% of gap): Multi-layer free path with validation and routing`
			`3. Cache refill (15.7% of gap): Expensive superslab metadata lookups and validation`
			`4. Infrastructure (22.5% of gap): Cache misses, branch mispredictions, diagnostic code`

			`### Key Finding: Cache Miss Penalty Dominates`
			`- 238M cycles lost to cache misses (24.4% of total runtime!)`
			`- HAKMEM has 20.3x more cache misses than mimalloc (1.19M vs 58.7K)`
			`- L1 D-cache misses are 97.7x higher (4.29M vs 43.9K)`

			`---`

			`## Detailed Performance Metrics`

			`### Overall Comparison`

			`\| Metric \| HAKMEM \| mimalloc \| Ratio \|`
			`\|--------\|--------\|----------\|-------\|`
			`\| Total Cycles \| 975,602,722 \| 123,838,496 \| 7.88x \|`
			`\| Total Instructions \| 3,782,043,459 \| 515,485,797 \| 7.34x \|`
			`\| Cycles per op \| 48.8 \| 6.2 \| 7.88x \|`
			`\| Instructions per op \| 189.1 \| 25.8 \| 7.34x \|`
			`\| IPC (inst/cycle) \| 3.88 \| 4.16 \| 0.93x \|`
			`\| Cache misses \| 1,191,800 \| 58,727 \| 20.29x \|`
			`\| Cache miss rate \| 59.59‰ \| 2.94‰ \| 20.29x \|`
			`\| Branch misses \| 1,497,133 \| 58,943 \| 25.40x \|`
			`\| Branch miss rate \| 0.17% \| 0.05% \| 3.20x \|`
			`\| L1 D-cache misses \| 4,291,649 \| 43,913 \| 97.73x \|`
			`\| L1 miss rate \| 0.41% \| 0.03% \| 13.88x \|`

			`### IPC Analysis`
			`- HAKMEM IPC: 3.88 (good, but memory-bound)`
			`- mimalloc IPC: 4.16 (better, less memory stall)`
			`- Interpretation: Both have high IPC, but HAKMEM is bottlenecked by memory access patterns`

			`---`

			`## Function-Level Cycle Breakdown`

			`### HAKMEM: Where Cycles Are Spent`

			`\| Function \| % \| Total Cycles \| Cycles/op \| Category \|`
			`\|----------\|---\|-------------\|-----------\|----------\|`
			`\| malloc \| 33.32% \| 325,070,826 \| 16.25 \| Hot path allocation \|`
			`\| unified_cache_refill \| 13.67% \| 133,364,892 \| 6.67 \| Cache miss handler \|`
			`\| free.part.0 \| 12.22% \| 119,218,652 \| 5.96 \| Free wrapper \|`
			`\| main (benchmark) \| 12.07% \| 117,755,248 \| 5.89 \| Test harness \|`
			`\| hak_free_at.constprop.0 \| 11.55% \| 112,682,114 \| 5.63 \| Free routing \|`
			`\| hak_tiny_free_fast_v2 \| 8.11% \| 79,121,380 \| 3.96 \| Free fast path \|`
			`\| kernel/other \| 9.06% \| 88,389,606 \| 4.42 \| Syscalls, page faults \|`
			`\| TOTAL \| 100% \| 975,602,722 \| 48.78 \| \|`

			`### mimalloc: Where Cycles Are Spent`

			`\| Function \| % \| Total Cycles \| Cycles/op \| Category \|`
			`\|----------\|---\|-------------\|-----------\|----------\|`
			`\| operator delete[] \| 48.66% \| 60,259,812 \| 3.01 \| Free path \|`
			`\| malloc \| 39.82% \| 49,312,489 \| 2.47 \| Allocation path \|`
			`\| kernel/other \| 6.77% \| 8,383,866 \| 0.42 \| Syscalls, page faults \|`
			`\| main (benchmark) \| 4.75% \| 5,882,328 \| 0.29 \| Test harness \|`
			`\| TOTAL \| 100% \| 123,838,496 \| 6.19 \| \|`

			`### Insight: HAKMEM Fragmentation`
			`- mimalloc concentrates 88.5% of cycles in malloc/free`
			`- HAKMEM spreads across 6 functions (malloc + 3 free variants + refill + wrapper)`
			`- Recommendation: Consolidate hot path to reduce function call overhead`

			`---`

			`## Cache Miss Deep Dive`

			`### Cache Misses by Function (HAKMEM)`

			`\| Function \| % \| Cache Misses \| Misses/op \| Impact \|`
			`\|----------\|---\|--------------\|-----------\|--------\|`
			`\| malloc \| 58.51% \| 697,322 \| 0.0349 \| CRITICAL \|`
			`\| unified_cache_refill \| 29.92% \| 356,586 \| 0.0178 \| HIGH \|`
			`\| Other \| 11.57% \| 137,892 \| 0.0069 \| Low \|`

			`### Estimated Penalty`
			`- Cache miss penalty: 238,360,000 cycles (assuming ~200 cycles/LLC miss)`
			`- Per operation: 11.9 cycles lost to cache misses`
			`- Percentage of total: 24.4% of all cycles`

			`### Root Causes`
			`1. malloc (58% of cache misses):`
			`- Pointer chasing through TLS → cache → metadata`
			- Multiple indirections: `g_tls_slabs[class_idx]` → `tls->ss` → `tls->meta`
			`- Cold metadata access patterns`

			`2. unified_cache_refill (30% of cache misses):`
			- SuperSlab metadata lookups via `hak_super_lookup(p)`
			- Freelist traversal: `tiny_next_read()` on cold pointers
			`- Validation logic: Multiple metadata accesses per block`

			`---`

			`## Branch Misprediction Analysis`

			`### Branch Misses by Function (HAKMEM)`

			`\| Function \| % \| Branch Misses \| Misses/op \| Impact \|`
			`\|----------\|---\|---------------\|-----------\|--------\|`
			`\| malloc \| 21.59% \| 323,231 \| 0.0162 \| Moderate \|`
			`\| unified_cache_refill \| 10.35% \| 154,953 \| 0.0077 \| Moderate \|`
			`\| free.part.0 \| 3.80% \| 56,891 \| 0.0028 \| Low \|`
			`\| main \| 3.66% \| 54,795 \| 0.0027 \| (Benchmark) \|`
			`\| hak_free_at \| 3.49% \| 52,249 \| 0.0026 \| Low \|`
			`\| hak_tiny_free_fast_v2 \| 3.11% \| 46,560 \| 0.0023 \| Low \|`

			`### Estimated Penalty`
			`- Branch miss penalty: 22,456,995 cycles (assuming ~15 cycles/miss)`
			`- Per operation: 1.1 cycles lost to branch misses`
			`- Percentage of total: 2.3% of all cycles`

			`### Root Causes`
			`1. Unpredictable control flow:`
			- Environment variable checks: `if (g_wrapper_env)`, `if (g_enable)`
			- Initialization barriers: `if (!g_initialized)`, `if (g_initializing)`
			- Multi-way routing: `if (cache miss) → refill; if (freelist) → pop; else → carve`

			`2. malloc wrapper overhead (lines 7795-78a3 in disassembly):`
			`- 20+ conditional branches before reaching fast path`
			`- Lazy initialization checks`
			- Diagnostic tracing (`lock incl g_wrap_malloc_trace_count`)

			`---`

			`## Top 3 Bottlenecks & Recommendations`

			`### 🔴 Bottleneck #1: Cache Misses in malloc (16.25 cycles/op, 58% of misses)`

			`Problem:`
			- Complex TLS access pattern: `g_tls_sll[class_idx].head` requires cache line load
			- Unified cache lookup: `g_unified_cache[class_idx].slots[head]` → second cache line
			- Cold metadata: Refill triggers `hak_super_lookup()` + metadata traversal

			`Hot Path Code Flow (from source analysis):`
			```c
			`// malloc wrapper → hak_tiny_alloc_fast_wrapper → tiny_alloc_fast`
			`// 1. Check unified cache (cache hit path)`
			`void* p = cache->slots[cache->head];`
			`if (p) {`
			`cache->head = (cache->head + 1) & cache->mask; // ← Cache line load`
			`return p;`
			`}`
			`// 2. Cache miss → unified_cache_refill`
			`unified_cache_refill(class_idx); // ← Expensive! 6.67 cycles/op`
			```

			`Disassembly Evidence (malloc function, lines 7a60-7ac7):`
			- Multiple indirect loads: `mov %fs:0x0,%r8` (TLS base)
			- Pointer arithmetic: `lea -0x47d30(%r8),%rsi` (cache offset calculation)
			- Conditional moves: `cmpb $0x2,(%rdx,%rcx,1)` (route check)
			- Cache line thrashing on `cache->slots` array

			`Recommendations:`
			`1. Inline unified_cache_refill for common case (CRITICAL)`
			`- Move refill logic inline to eliminate function call overhead`
			- Use `__attribute__((always_inline))` or manual inlining
			`- Expected gain: ~2-3 cycles/op`

			`2. Optimize TLS data layout (HIGH PRIORITY)`
			- Pack hot fields (`cache->head`, `cache->tail`, `cache->slots`) into single cache line
			- Current: `g_unified_cache[8]` array → 8 separate cache lines
			`- Target: Hot path fields in 64-byte cache line`
			`- Expected gain: ~3-5 cycles/op, reduce misses by 30-40%`

			`3. Prefetch next block during refill (MEDIUM)`
			```c
			`void* first = out[0];`
			`__builtin_prefetch(cache->slots[cache->tail + 1], 0, 3); // Temporal prefetch`
			`return first;`
			```
			`- Expected gain: ~1-2 cycles/op`

			`4. Reduce validation overhead (MEDIUM)`
			- `unified_refill_validate_base()` calls `hak_super_lookup()` on every block
			- Move to debug-only (`#if !HAKMEM_BUILD_RELEASE`)
			`- Expected gain: ~1-2 cycles/op`

			`---`

			`### 🔴 Bottleneck #2: unified_cache_refill (6.67 cycles/op, 30% of misses)`

			`Problem:`
			- Expensive metadata lookups: `hak_super_lookup(p)` on every freelist node
			- Freelist traversal: `tiny_next_read()` requires dereferencing cold pointers
			`- Validation logic: Multiple safety checks per block (lines 384-408 in source)`

			`Hot Path Code (from tiny_unified_cache.c:377-414):`
			```c
			`while (produced < room) {`
			`if (m->freelist) {`
			`void* p = m->freelist;`

			`// ❌ EXPENSIVE: Lookup SuperSlab for validation`
			`SuperSlab* fl_ss = hak_super_lookup(p); // ← Cache miss!`
			`int fl_idx = slab_index_for(fl_ss, p); // ← More metadata access`

			`// ❌ EXPENSIVE: Dereference next pointer (cold memory)`
			`void* next_node = tiny_next_read(class_idx, p); // ← Cache miss!`

			`// Write header`
			`(uint8_t)p = (uint8_t)(0xa0 \| (class_idx & 0x0f));`
			`m->freelist = next_node;`
			`out[produced++] = p;`
			`}`
			`}`
			```

			`Recommendations:`
			`1. Batch validation (amortize lookup cost) (CRITICAL)`
			`- Validate SuperSlab once at start, not per block`
			`- Trust freelist integrity within single refill`
			```c
			`SuperSlab* ss_once = hak_super_lookup(m->freelist);`
			`// Validate ss_once, then skip per-block validation`
			`while (produced < room && m->freelist) {`
			`void* p = m->freelist;`
			`void* next = tiny_next_read(class_idx, p); // No lookup!`
			`out[produced++] = p;`
			`m->freelist = next;`
			`}`
			```
			`- Expected gain: ~2-3 cycles/op`

			`2. Prefetch freelist nodes (HIGH PRIORITY)`
			```c
			`void* p = m->freelist;`
			`void* next = tiny_next_read(class_idx, p);`
			`__builtin_prefetch(next, 0, 3); // Prefetch next node`
			`__builtin_prefetch(tiny_next_read(class_idx, next), 0, 2); // +2 ahead`
			```
			`- Expected gain: ~1-2 cycles/op on miss path`

			`3. Increase batch size for hot classes (MEDIUM)`
			`- Current: Max 128 blocks per refill`
			`- Proposal: 256 blocks for C0-C3 (tiny sizes)`
			`- Amortize refill cost over more allocations`
			`- Expected gain: ~0.5-1 cycles/op`

			`4. Remove atomic fence on header write (LOW, risky)`
			- Line 422: `__atomic_thread_fence(__ATOMIC_RELEASE)`
			`- Only needed for cross-thread visibility`
			`- Benchmark: Single-threaded case doesn't need fence`
			`- Expected gain: ~0.3-0.5 cycles/op`

			`---`

			`### 🔴 Bottleneck #3: malloc Wrapper Overhead (16.25 cycles/op, excessive branching)`

			`Problem:`
			`- 20+ branches before reaching fast path (disassembly lines 7795-78a3)`
			`- Lazy initialization checks on every call`
			`- Diagnostic tracing with atomic increment`
			`- Environment variable checks`

			`Hot Path Disassembly (malloc, lines 7795-77ba):`
			```asm
			`7795: lock incl 0x190fb78(%rip) ; ❌ Atomic trace counter (12.33% of cycles!)`
			`779c: mov 0x190fb6e(%rip),%eax ; Check g_bench_fast_init_in_progress`
			`77a2: test %eax,%eax`
			`77a4: je 7d90 ; Branch #1`
			`77aa: incl %fs:0xfffffffffffb8354 ; TLS counter increment`
			`77b2: mov 0x438c8(%rip),%eax ; Check g_wrapper_env`
			`77b8: test %eax,%eax`
			`77ba: je 7e40 ; Branch #2`
			```

			`Wrapper Code (hakmem_tiny_phase6_wrappers_box.inc:22-79):`
			```c
			`void* hak_tiny_alloc_fast_wrapper(size_t size) {`
			`atomic_fetch_add(&g_alloc_fast_trace, 1, ...); // ❌ Expensive!`

			`// ❌ Branch #1: Bench fast mode check`
			`if (g_bench_fast_front) {`
			`return tiny_alloc_fast(size);`
			`}`

			`atomic_fetch_add(&wrapper_call_count, 1); // ❌ Atomic again!`
			`PTR_TRACK_INIT(); // ❌ Initialization check`
			`periodic_canary_check(call_num, ...); // ❌ Periodic check`

			`// Finally, actual allocation`
			`void* result = tiny_alloc_fast(size);`
			`return result;`
			`}`
			```

			`Recommendations:`
			`1. Compile-time disable diagnostics (CRITICAL)`
			`- Remove atomic trace counters in hot path`
			- Move to `#if HAKMEM_BUILD_RELEASE` guards
			`- Expected gain: ~4-6 cycles/op (eliminates 12% overhead)`

			`2. Hoist initialization checks (HIGH PRIORITY)`
			- Move `PTR_TRACK_INIT()` to library init (once per thread)
			- Cache `g_bench_fast_front` in thread-local variable
			```c
			`static __thread int g_init_done = 0;`
			`if (__builtin_expect(!g_init_done, 0)) {`
			`PTR_TRACK_INIT();`
			`g_init_done = 1;`
			`}`
			```
			`- Expected gain: ~1-2 cycles/op`

			`3. Eliminate wrapper layer for benchmarks (MEDIUM)`
			- Direct call to `tiny_alloc_fast()` from `malloc()`
			`- Use LTO to inline wrapper entirely`
			`- Expected gain: ~1-2 cycles/op (function call overhead)`

			`4. Branchless environment checks (LOW)`
			- Replace `if (g_wrapper_env)` with bitmask operations
			```c
			`int mask = -(int)g_wrapper_env; // -1 if true, 0 if false`
			`result = (mask & diagnostic_path) \| (~mask & fast_path);`
			```
			`- Expected gain: ~0.3-0.5 cycles/op`

			`---`

			`## Summary: Optimization Roadmap`

			`### Immediate Wins (Target: -15 cycles/op, 48.8 → 33.8)`
			1. ✅ Remove atomic trace counters (`lock incl`) → -6 cycles/op
			2. ✅ Inline `unified_cache_refill` → -3 cycles/op
			`3. ✅ Batch validation in refill → -3 cycles/op`
			`4. ✅ Optimize TLS cache layout → -3 cycles/op`

			`### Medium-Term (Target: -10 cycles/op, 33.8 → 23.8)`
			`5. ✅ Prefetch in refill and malloc → -3 cycles/op`
			`6. ✅ Increase batch size for hot classes → -2 cycles/op`
			`7. ✅ Consolidate free path (merge 3 functions) → -3 cycles/op`
			`8. ✅ Hoist initialization checks → -2 cycles/op`

			`### Long-Term (Target: -8 cycles/op, 23.8 → 15.8)`
			`9. ✅ Branchless routing logic → -2 cycles/op`
			`10. ✅ SIMD batch processing in refill → -3 cycles/op`
			`11. ✅ Reduce metadata indirections → -3 cycles/op`

			`### Stretch Goal: Match mimalloc (15.8 → 6.2 cycles/op)`
			`- Requires architectural changes (single-layer cache, no validation)`
			`- Trade-off: Safety vs performance`

			`---`

			`## Conclusion`

			`HAKMEM's 7.88x slowdown is primarily due to:`
			`1. Cache misses (24.4% of cycles) from multi-layer indirection`
			`2. Diagnostic overhead (12%+ of cycles) from atomic counters and tracing`
			`3. Function fragmentation (6 hot functions vs mimalloc's 2)`

			`Top Priority Actions:`
			`- Remove atomic trace counters (immediate -6 cycles/op)`
			`- Inline refill + batch validation (-6 cycles/op combined)`
			`- Optimize TLS layout for cache locality (-3 cycles/op)`

			`Expected Impact: -15 cycles/op (48.8 → 33.8, ~30% improvement)`
			`Timeline: 1-2 days of focused optimization work`