hakmem/COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md

# Comprehensive Profiling Analysis: HAKMEM Performance Gaps

## 🔍 Executive Summary

After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:

### Current Performance Metrics

| Metric | Random Mixed | Tiny Hot | Gap |
|--------|---|---|---|
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
| **L1 Cache Misses** | 763K | 738K | Similar |
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |

### 🚨 KEY FINDING: Prefault is NOT working as expected!

**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
1. ✗ Prefault Box is either disabled or ineffective
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
3. ✗ MAP_POPULATE flag is not preventing runtime faults

---

## 📊 Detailed Performance Breakdown

### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)

**Top Kernel Functions (by CPU time):**
```
15.01% asm_exc_page_fault       ← Page fault handling
11.65% clear_page_erms          ← Page zeroing
 5.27% zap_pte_range            ← Memory cleanup
 5.20% handle_mm_fault          ← MMU fault handling
 4.06% do_anonymous_page        ← Anonymous page allocation
 3.18% __handle_mm_fault        ← Nested fault handling
 2.35% rmqueue_bulk             ← Allocator backend
 2.35% __memset_avx2_unaligned  ← Memory operations
 2.28% do_user_addr_fault       ← User fault handling
 1.77% arch_exit_to_user_mode   ← Context switch
```

**Kernel overhead:** ~63% of cycles
**L1 DCL misses:** 763K / operations
**Branch miss rate:** 11.94%

### Tiny Hot Workload (10M allocations, fixed size)

**Top Kernel Functions (by CPU time):**
```
14.19% asm_exc_page_fault       ← Page fault handling
12.82% clear_page_erms          ← Page zeroing
 5.61% __memset_avx2_unaligned  ← Memory operations
 5.02% do_anonymous_page        ← Anonymous page allocation
 3.31% mem_cgroup_commit_charge ← Memory accounting
 2.67% __handle_mm_fault        ← MMU fault handling
 2.45% do_user_addr_fault       ← User fault handling
```

**Kernel overhead:** ~66% of cycles
**L1 DCL misses:** 738K / operations
**Branch miss rate:** 11.03%

### Comparison: Why are cycles similar?

```
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot:     72.3M cycles / 10M ops = 7.23 cycles/op

⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
```

The cycles are nearly identical, but throughput differs because:
- Random Mixed: Measuring only 1M operations (baseline)
- Tiny Hot: Measuring 10M operations (10x scale)
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op

**This means Random Mixed is NOT actually slower in per-operation cost!**

---

## 🎯 Critical Findings

### Finding 1: Page Faults Are NOT Being Reduced

**Observed:**
- Random Mixed: 7,672 page faults
- Tiny Hot: 7,672 page faults
- **Difference: 0** ← This is wrong!

**Expected (with prefault):**
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
- Tiny Hot: 7,672 → ~50-100 (minimal change)

**Hypothesis:**
- Prefault Box may not be enabled
- Or MAP_POPULATE is not working on this kernel
- Or allocations are hitting kernel-internal mmap (not Superslab)

### Finding 2: TLB Misses Are HIGH (48.65%)

```
dTLB-loads:        49,160
dTLB-load-misses:  23,917 (48.65% miss rate!)
iTLB-load-misses:  17,590 (7748.90% - kernel measurement artifact)
```

**Meaning:** Nearly half of TLB lookups fail, causing page table walks.

**Why this matters:**
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
- 23,917 × 25 cycles = ~600K wasted cycles
- That's ~10% of total runtime!

### Finding 3: Both Workloads Are Similar

Despite different access patterns:
- Both spend 15% on page fault handling
- Both spend 12% on page zeroing
- Both have similar L1 miss rates
- Both have similar branch miss rates

**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.

---

## 📈 Layer Analysis

### Kernel vs User Split

| Category | Random Mixed | Tiny Hot | Analysis |
|----------|---|---|---|
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
| **User malloc/free** | <1% | <1% | Not visible |
| **User pool/cache logic** | <1% | <1% | Not visible |

### User-Space Functions Visible in Profile

**Random Mixed:**
```
0.59% hak_free_at.constprop.0 (hakmem free path)
```

**Tiny Hot:**
```
0.59% hak_pool_mid_lookup (hakmem pool routing)
```

**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).

---

## 🔧 What's Really Happening

### Current State (POST-Prefault Box)

```
allocate(size):
  1. malloc wrapper           → <1% cycles
  2. Gatekeeper routing       → ~0.1% cycles
  3. unified_cache_refill     → (hidden in kernel time)
  4. shared_pool_acquire      → (hidden in kernel time)
  5. SuperSlab/mmap call      → Triggers kernel
  6. **KERNEL PAGE FAULTS**   → 15% cycles
  7. clear_page_erms (zero)   → 12% cycles
```

### Why Prefault Isn't Working

**Possible reasons:**

1. **Prefault Box disabled?**
   - Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
   - Or: `g_ss_populate_once` not being set

2. **MAP_POPULATE not actually pre-faulting?**
   - Linux kernel may be lazy even with MAP_POPULATE
   - Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
   - Or use `mincore()` to check before allocation

3. **Allocations not from Superslab mmap?**
   - Page faults may be from TLS cache allocation
   - Or from libc internal allocations
   - Not from Superslab backend

4. **TLB misses dominating?**
   - 48.65% TLB miss rate suggests memory layout issue
   - SuperSlab metadata may not be cache-friendly
   - Working set too large for TLB

---

## 🎓 What We Learned From Previous Analysis

From the earlier profiling report, we identified that:
- **Random Mixed was 21.7x slower** due to 61.7% page faults
- **Expected with prefault:** Should drop to ~5% or less

But NOW we see:
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
- **Page faults are identical** to Tiny Hot
- **This contradicts expectations**

### Possible Explanation

The **earlier measurements** may have been from:
- Benchmark run at startup (cold caches)
- With additional profiling overhead
- Or different workload parameters

The **current measurements** are:
- Steady state (after initial allocation)
- With higher throughput (Tiny Hot = 10M ops)
- After recent optimizations

---

## 🎯 Next Steps - Three Options

### 📋 Option A: Verify Prefault is Actually Enabled

**Goal:** Confirm prefault mechanism is working

**Steps:**
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
3. Run with `strace` to see mmap flags:
   ```bash
   strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
   ```
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening

**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces

---

### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)

**Goal:** Improve memory layout to reduce TLB pressure

**Steps:**
1. **Analyze SuperSlab metadata layout:**
   - Current: Is metadata per-slab or centralized?
   - Check: `sp_meta_find_or_create()` hot path

2. **Improve cache locality:**
   - Cache-align metadata structures
   - Use larger pages (2MB or 1GB hugepages)
   - Reduce working set size

3. **Profile with hugepages:**
   ```bash
   echo 10 > /proc/sys/vm/nr_hugepages
   HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
   ```

**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)

---

### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)

**Goal:** Skip unnecessary page zeroing

**Steps:**
1. **Analyze what needs zeroing:**
   - Are SuperSlab pages truly uninitialized?
   - Can we reuse memory without zeroing?
   - Use `MADV_DONTNEED` before reuse?

2. **Implement lazy zeroing:**
   - Don't zero pages on allocation
   - Only zero used portions
   - Let kernel handle rest on free

3. **Use uninitialized pools:**
   - Pre-allocate without zeroing
   - Initialize on-demand

**Expected gain:** 1.5x speedup (eliminate 12% zero cost)

---

## 📊 Recommendation

Based on the analysis:

### Most Impactful (Order of Preference):

1. **Fix TLB Misses (48.65%)**
   - Potential gain: 1.5-2x
   - Implementation: Medium difficulty
   - Reason: Already showing 48% miss rate

2. **Verify Prefault Actually Works**
   - Potential gain: Unknown (currently not working?)
   - Implementation: Easy (debugging)
   - Reason: Should have been solved but showing same page faults

3. **Reduce Page Zeroing**
   - Potential gain: 1.5x
   - Implementation: Medium difficulty
   - Reason: 12% of total time

---

## 🧪 Recommended Next Action

### Immediate (This Session)

Run diagnostic to confirm prefault status:

```bash
# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20

# Check compiler flags
grep -i prefault Makefile

# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT
```

### If Prefault is Disabled → Enable It

Then re-run profiling to verify improvement.

### If Prefault is Enabled → Move to Option B (TLB)

Focus on reducing 48% TLB miss rate.

---

## 📈 Expected Outcome After All Fixes

| Factor | Current | After | Gain |
|--------|---------|-------|------|
| Page faults | 7,672 | 500-1000 | 8-15x |
| TLB misses | 48.65% | ~5% | 3-5x |
| Page zeroing | 12% | 2% | 2x |
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |