Files
hakmem/COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md

347 lines
9.7 KiB
Markdown
Raw Permalink Normal View History

# Comprehensive Profiling Analysis: HAKMEM Performance Gaps
## 🔍 Executive Summary
After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:
### Current Performance Metrics
| Metric | Random Mixed | Tiny Hot | Gap |
|--------|---|---|---|
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
| **L1 Cache Misses** | 763K | 738K | Similar |
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |
### 🚨 KEY FINDING: Prefault is NOT working as expected!
**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
1. ✗ Prefault Box is either disabled or ineffective
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
3. ✗ MAP_POPULATE flag is not preventing runtime faults
---
## 📊 Detailed Performance Breakdown
### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
**Top Kernel Functions (by CPU time):**
```
15.01% asm_exc_page_fault ← Page fault handling
11.65% clear_page_erms ← Page zeroing
5.27% zap_pte_range ← Memory cleanup
5.20% handle_mm_fault ← MMU fault handling
4.06% do_anonymous_page ← Anonymous page allocation
3.18% __handle_mm_fault ← Nested fault handling
2.35% rmqueue_bulk ← Allocator backend
2.35% __memset_avx2_unaligned ← Memory operations
2.28% do_user_addr_fault ← User fault handling
1.77% arch_exit_to_user_mode ← Context switch
```
**Kernel overhead:** ~63% of cycles
**L1 DCL misses:** 763K / operations
**Branch miss rate:** 11.94%
### Tiny Hot Workload (10M allocations, fixed size)
**Top Kernel Functions (by CPU time):**
```
14.19% asm_exc_page_fault ← Page fault handling
12.82% clear_page_erms ← Page zeroing
5.61% __memset_avx2_unaligned ← Memory operations
5.02% do_anonymous_page ← Anonymous page allocation
3.31% mem_cgroup_commit_charge ← Memory accounting
2.67% __handle_mm_fault ← MMU fault handling
2.45% do_user_addr_fault ← User fault handling
```
**Kernel overhead:** ~66% of cycles
**L1 DCL misses:** 738K / operations
**Branch miss rate:** 11.03%
### Comparison: Why are cycles similar?
```
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op
⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
```
The cycles are nearly identical, but throughput differs because:
- Random Mixed: Measuring only 1M operations (baseline)
- Tiny Hot: Measuring 10M operations (10x scale)
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
**This means Random Mixed is NOT actually slower in per-operation cost!**
---
## 🎯 Critical Findings
### Finding 1: Page Faults Are NOT Being Reduced
**Observed:**
- Random Mixed: 7,672 page faults
- Tiny Hot: 7,672 page faults
- **Difference: 0** ← This is wrong!
**Expected (with prefault):**
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
- Tiny Hot: 7,672 → ~50-100 (minimal change)
**Hypothesis:**
- Prefault Box may not be enabled
- Or MAP_POPULATE is not working on this kernel
- Or allocations are hitting kernel-internal mmap (not Superslab)
### Finding 2: TLB Misses Are HIGH (48.65%)
```
dTLB-loads: 49,160
dTLB-load-misses: 23,917 (48.65% miss rate!)
iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact)
```
**Meaning:** Nearly half of TLB lookups fail, causing page table walks.
**Why this matters:**
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
- 23,917 × 25 cycles = ~600K wasted cycles
- That's ~10% of total runtime!
### Finding 3: Both Workloads Are Similar
Despite different access patterns:
- Both spend 15% on page fault handling
- Both spend 12% on page zeroing
- Both have similar L1 miss rates
- Both have similar branch miss rates
**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
---
## 📈 Layer Analysis
### Kernel vs User Split
| Category | Random Mixed | Tiny Hot | Analysis |
|----------|---|---|---|
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
| **User malloc/free** | <1% | <1% | Not visible |
| **User pool/cache logic** | <1% | <1% | Not visible |
### User-Space Functions Visible in Profile
**Random Mixed:**
```
0.59% hak_free_at.constprop.0 (hakmem free path)
```
**Tiny Hot:**
```
0.59% hak_pool_mid_lookup (hakmem pool routing)
```
**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).
---
## 🔧 What's Really Happening
### Current State (POST-Prefault Box)
```
allocate(size):
1. malloc wrapper → <1% cycles
2. Gatekeeper routing → ~0.1% cycles
3. unified_cache_refill → (hidden in kernel time)
4. shared_pool_acquire → (hidden in kernel time)
5. SuperSlab/mmap call → Triggers kernel
6. **KERNEL PAGE FAULTS** → 15% cycles
7. clear_page_erms (zero) → 12% cycles
```
### Why Prefault Isn't Working
**Possible reasons:**
1. **Prefault Box disabled?**
- Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
- Or: `g_ss_populate_once` not being set
2. **MAP_POPULATE not actually pre-faulting?**
- Linux kernel may be lazy even with MAP_POPULATE
- Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
- Or use `mincore()` to check before allocation
3. **Allocations not from Superslab mmap?**
- Page faults may be from TLS cache allocation
- Or from libc internal allocations
- Not from Superslab backend
4. **TLB misses dominating?**
- 48.65% TLB miss rate suggests memory layout issue
- SuperSlab metadata may not be cache-friendly
- Working set too large for TLB
---
## 🎓 What We Learned From Previous Analysis
From the earlier profiling report, we identified that:
- **Random Mixed was 21.7x slower** due to 61.7% page faults
- **Expected with prefault:** Should drop to ~5% or less
But NOW we see:
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
- **Page faults are identical** to Tiny Hot
- **This contradicts expectations**
### Possible Explanation
The **earlier measurements** may have been from:
- Benchmark run at startup (cold caches)
- With additional profiling overhead
- Or different workload parameters
The **current measurements** are:
- Steady state (after initial allocation)
- With higher throughput (Tiny Hot = 10M ops)
- After recent optimizations
---
## 🎯 Next Steps - Three Options
### 📋 Option A: Verify Prefault is Actually Enabled
**Goal:** Confirm prefault mechanism is working
**Steps:**
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
3. Run with `strace` to see mmap flags:
```bash
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
```
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening
**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces
---
### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
**Goal:** Improve memory layout to reduce TLB pressure
**Steps:**
1. **Analyze SuperSlab metadata layout:**
- Current: Is metadata per-slab or centralized?
- Check: `sp_meta_find_or_create()` hot path
2. **Improve cache locality:**
- Cache-align metadata structures
- Use larger pages (2MB or 1GB hugepages)
- Reduce working set size
3. **Profile with hugepages:**
```bash
echo 10 > /proc/sys/vm/nr_hugepages
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```
**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)
---
### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)
**Goal:** Skip unnecessary page zeroing
**Steps:**
1. **Analyze what needs zeroing:**
- Are SuperSlab pages truly uninitialized?
- Can we reuse memory without zeroing?
- Use `MADV_DONTNEED` before reuse?
2. **Implement lazy zeroing:**
- Don't zero pages on allocation
- Only zero used portions
- Let kernel handle rest on free
3. **Use uninitialized pools:**
- Pre-allocate without zeroing
- Initialize on-demand
**Expected gain:** 1.5x speedup (eliminate 12% zero cost)
---
## 📊 Recommendation
Based on the analysis:
### Most Impactful (Order of Preference):
1. **Fix TLB Misses (48.65%)**
- Potential gain: 1.5-2x
- Implementation: Medium difficulty
- Reason: Already showing 48% miss rate
2. **Verify Prefault Actually Works**
- Potential gain: Unknown (currently not working?)
- Implementation: Easy (debugging)
- Reason: Should have been solved but showing same page faults
3. **Reduce Page Zeroing**
- Potential gain: 1.5x
- Implementation: Medium difficulty
- Reason: 12% of total time
---
## 🧪 Recommended Next Action
### Immediate (This Session)
Run diagnostic to confirm prefault status:
```bash
# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
# Check compiler flags
grep -i prefault Makefile
# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT
```
### If Prefault is Disabled → Enable It
Then re-run profiling to verify improvement.
### If Prefault is Enabled → Move to Option B (TLB)
Focus on reducing 48% TLB miss rate.
---
## 📈 Expected Outcome After All Fixes
| Factor | Current | After | Gain |
|--------|---------|-------|------|
| Page faults | 7,672 | 500-1000 | 8-15x |
| TLB misses | 48.65% | ~5% | 3-5x |
| Page zeroing | 12% | 2% | 2x |
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |