347 lines
9.7 KiB
Markdown
347 lines
9.7 KiB
Markdown
|
|
# Comprehensive Profiling Analysis: HAKMEM Performance Gaps
|
|||
|
|
|
|||
|
|
## 🔍 Executive Summary
|
|||
|
|
|
|||
|
|
After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:
|
|||
|
|
|
|||
|
|
### Current Performance Metrics
|
|||
|
|
|
|||
|
|
| Metric | Random Mixed | Tiny Hot | Gap |
|
|||
|
|
|--------|---|---|---|
|
|||
|
|
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
|
|||
|
|
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
|
|||
|
|
| **L1 Cache Misses** | 763K | 738K | Similar |
|
|||
|
|
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
|
|||
|
|
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
|
|||
|
|
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |
|
|||
|
|
|
|||
|
|
### 🚨 KEY FINDING: Prefault is NOT working as expected!
|
|||
|
|
|
|||
|
|
**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
|
|||
|
|
1. ✗ Prefault Box is either disabled or ineffective
|
|||
|
|
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
|
|||
|
|
3. ✗ MAP_POPULATE flag is not preventing runtime faults
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Detailed Performance Breakdown
|
|||
|
|
|
|||
|
|
### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
|
|||
|
|
|
|||
|
|
**Top Kernel Functions (by CPU time):**
|
|||
|
|
```
|
|||
|
|
15.01% asm_exc_page_fault ← Page fault handling
|
|||
|
|
11.65% clear_page_erms ← Page zeroing
|
|||
|
|
5.27% zap_pte_range ← Memory cleanup
|
|||
|
|
5.20% handle_mm_fault ← MMU fault handling
|
|||
|
|
4.06% do_anonymous_page ← Anonymous page allocation
|
|||
|
|
3.18% __handle_mm_fault ← Nested fault handling
|
|||
|
|
2.35% rmqueue_bulk ← Allocator backend
|
|||
|
|
2.35% __memset_avx2_unaligned ← Memory operations
|
|||
|
|
2.28% do_user_addr_fault ← User fault handling
|
|||
|
|
1.77% arch_exit_to_user_mode ← Context switch
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Kernel overhead:** ~63% of cycles
|
|||
|
|
**L1 DCL misses:** 763K / operations
|
|||
|
|
**Branch miss rate:** 11.94%
|
|||
|
|
|
|||
|
|
### Tiny Hot Workload (10M allocations, fixed size)
|
|||
|
|
|
|||
|
|
**Top Kernel Functions (by CPU time):**
|
|||
|
|
```
|
|||
|
|
14.19% asm_exc_page_fault ← Page fault handling
|
|||
|
|
12.82% clear_page_erms ← Page zeroing
|
|||
|
|
5.61% __memset_avx2_unaligned ← Memory operations
|
|||
|
|
5.02% do_anonymous_page ← Anonymous page allocation
|
|||
|
|
3.31% mem_cgroup_commit_charge ← Memory accounting
|
|||
|
|
2.67% __handle_mm_fault ← MMU fault handling
|
|||
|
|
2.45% do_user_addr_fault ← User fault handling
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Kernel overhead:** ~66% of cycles
|
|||
|
|
**L1 DCL misses:** 738K / operations
|
|||
|
|
**Branch miss rate:** 11.03%
|
|||
|
|
|
|||
|
|
### Comparison: Why are cycles similar?
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
|
|||
|
|
Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op
|
|||
|
|
|
|||
|
|
⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The cycles are nearly identical, but throughput differs because:
|
|||
|
|
- Random Mixed: Measuring only 1M operations (baseline)
|
|||
|
|
- Tiny Hot: Measuring 10M operations (10x scale)
|
|||
|
|
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
|
|||
|
|
|
|||
|
|
**This means Random Mixed is NOT actually slower in per-operation cost!**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Critical Findings
|
|||
|
|
|
|||
|
|
### Finding 1: Page Faults Are NOT Being Reduced
|
|||
|
|
|
|||
|
|
**Observed:**
|
|||
|
|
- Random Mixed: 7,672 page faults
|
|||
|
|
- Tiny Hot: 7,672 page faults
|
|||
|
|
- **Difference: 0** ← This is wrong!
|
|||
|
|
|
|||
|
|
**Expected (with prefault):**
|
|||
|
|
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
|
|||
|
|
- Tiny Hot: 7,672 → ~50-100 (minimal change)
|
|||
|
|
|
|||
|
|
**Hypothesis:**
|
|||
|
|
- Prefault Box may not be enabled
|
|||
|
|
- Or MAP_POPULATE is not working on this kernel
|
|||
|
|
- Or allocations are hitting kernel-internal mmap (not Superslab)
|
|||
|
|
|
|||
|
|
### Finding 2: TLB Misses Are HIGH (48.65%)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
dTLB-loads: 49,160
|
|||
|
|
dTLB-load-misses: 23,917 (48.65% miss rate!)
|
|||
|
|
iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Meaning:** Nearly half of TLB lookups fail, causing page table walks.
|
|||
|
|
|
|||
|
|
**Why this matters:**
|
|||
|
|
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
|
|||
|
|
- 23,917 × 25 cycles = ~600K wasted cycles
|
|||
|
|
- That's ~10% of total runtime!
|
|||
|
|
|
|||
|
|
### Finding 3: Both Workloads Are Similar
|
|||
|
|
|
|||
|
|
Despite different access patterns:
|
|||
|
|
- Both spend 15% on page fault handling
|
|||
|
|
- Both spend 12% on page zeroing
|
|||
|
|
- Both have similar L1 miss rates
|
|||
|
|
- Both have similar branch miss rates
|
|||
|
|
|
|||
|
|
**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Layer Analysis
|
|||
|
|
|
|||
|
|
### Kernel vs User Split
|
|||
|
|
|
|||
|
|
| Category | Random Mixed | Tiny Hot | Analysis |
|
|||
|
|
|----------|---|---|---|
|
|||
|
|
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
|
|||
|
|
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
|
|||
|
|
| **User malloc/free** | <1% | <1% | Not visible |
|
|||
|
|
| **User pool/cache logic** | <1% | <1% | Not visible |
|
|||
|
|
|
|||
|
|
### User-Space Functions Visible in Profile
|
|||
|
|
|
|||
|
|
**Random Mixed:**
|
|||
|
|
```
|
|||
|
|
0.59% hak_free_at.constprop.0 (hakmem free path)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Tiny Hot:**
|
|||
|
|
```
|
|||
|
|
0.59% hak_pool_mid_lookup (hakmem pool routing)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 What's Really Happening
|
|||
|
|
|
|||
|
|
### Current State (POST-Prefault Box)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
allocate(size):
|
|||
|
|
1. malloc wrapper → <1% cycles
|
|||
|
|
2. Gatekeeper routing → ~0.1% cycles
|
|||
|
|
3. unified_cache_refill → (hidden in kernel time)
|
|||
|
|
4. shared_pool_acquire → (hidden in kernel time)
|
|||
|
|
5. SuperSlab/mmap call → Triggers kernel
|
|||
|
|
6. **KERNEL PAGE FAULTS** → 15% cycles
|
|||
|
|
7. clear_page_erms (zero) → 12% cycles
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Why Prefault Isn't Working
|
|||
|
|
|
|||
|
|
**Possible reasons:**
|
|||
|
|
|
|||
|
|
1. **Prefault Box disabled?**
|
|||
|
|
- Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
|
|||
|
|
- Or: `g_ss_populate_once` not being set
|
|||
|
|
|
|||
|
|
2. **MAP_POPULATE not actually pre-faulting?**
|
|||
|
|
- Linux kernel may be lazy even with MAP_POPULATE
|
|||
|
|
- Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
|
|||
|
|
- Or use `mincore()` to check before allocation
|
|||
|
|
|
|||
|
|
3. **Allocations not from Superslab mmap?**
|
|||
|
|
- Page faults may be from TLS cache allocation
|
|||
|
|
- Or from libc internal allocations
|
|||
|
|
- Not from Superslab backend
|
|||
|
|
|
|||
|
|
4. **TLB misses dominating?**
|
|||
|
|
- 48.65% TLB miss rate suggests memory layout issue
|
|||
|
|
- SuperSlab metadata may not be cache-friendly
|
|||
|
|
- Working set too large for TLB
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 What We Learned From Previous Analysis
|
|||
|
|
|
|||
|
|
From the earlier profiling report, we identified that:
|
|||
|
|
- **Random Mixed was 21.7x slower** due to 61.7% page faults
|
|||
|
|
- **Expected with prefault:** Should drop to ~5% or less
|
|||
|
|
|
|||
|
|
But NOW we see:
|
|||
|
|
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
|
|||
|
|
- **Page faults are identical** to Tiny Hot
|
|||
|
|
- **This contradicts expectations**
|
|||
|
|
|
|||
|
|
### Possible Explanation
|
|||
|
|
|
|||
|
|
The **earlier measurements** may have been from:
|
|||
|
|
- Benchmark run at startup (cold caches)
|
|||
|
|
- With additional profiling overhead
|
|||
|
|
- Or different workload parameters
|
|||
|
|
|
|||
|
|
The **current measurements** are:
|
|||
|
|
- Steady state (after initial allocation)
|
|||
|
|
- With higher throughput (Tiny Hot = 10M ops)
|
|||
|
|
- After recent optimizations
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Next Steps - Three Options
|
|||
|
|
|
|||
|
|
### 📋 Option A: Verify Prefault is Actually Enabled
|
|||
|
|
|
|||
|
|
**Goal:** Confirm prefault mechanism is working
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
|
|||
|
|
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
|
|||
|
|
3. Run with `strace` to see mmap flags:
|
|||
|
|
```bash
|
|||
|
|
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
|
|||
|
|
```
|
|||
|
|
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening
|
|||
|
|
|
|||
|
|
**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
|
|||
|
|
|
|||
|
|
**Goal:** Improve memory layout to reduce TLB pressure
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. **Analyze SuperSlab metadata layout:**
|
|||
|
|
- Current: Is metadata per-slab or centralized?
|
|||
|
|
- Check: `sp_meta_find_or_create()` hot path
|
|||
|
|
|
|||
|
|
2. **Improve cache locality:**
|
|||
|
|
- Cache-align metadata structures
|
|||
|
|
- Use larger pages (2MB or 1GB hugepages)
|
|||
|
|
- Reduce working set size
|
|||
|
|
|
|||
|
|
3. **Profile with hugepages:**
|
|||
|
|
```bash
|
|||
|
|
echo 10 > /proc/sys/vm/nr_hugepages
|
|||
|
|
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)
|
|||
|
|
|
|||
|
|
**Goal:** Skip unnecessary page zeroing
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. **Analyze what needs zeroing:**
|
|||
|
|
- Are SuperSlab pages truly uninitialized?
|
|||
|
|
- Can we reuse memory without zeroing?
|
|||
|
|
- Use `MADV_DONTNEED` before reuse?
|
|||
|
|
|
|||
|
|
2. **Implement lazy zeroing:**
|
|||
|
|
- Don't zero pages on allocation
|
|||
|
|
- Only zero used portions
|
|||
|
|
- Let kernel handle rest on free
|
|||
|
|
|
|||
|
|
3. **Use uninitialized pools:**
|
|||
|
|
- Pre-allocate without zeroing
|
|||
|
|
- Initialize on-demand
|
|||
|
|
|
|||
|
|
**Expected gain:** 1.5x speedup (eliminate 12% zero cost)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Recommendation
|
|||
|
|
|
|||
|
|
Based on the analysis:
|
|||
|
|
|
|||
|
|
### Most Impactful (Order of Preference):
|
|||
|
|
|
|||
|
|
1. **Fix TLB Misses (48.65%)**
|
|||
|
|
- Potential gain: 1.5-2x
|
|||
|
|
- Implementation: Medium difficulty
|
|||
|
|
- Reason: Already showing 48% miss rate
|
|||
|
|
|
|||
|
|
2. **Verify Prefault Actually Works**
|
|||
|
|
- Potential gain: Unknown (currently not working?)
|
|||
|
|
- Implementation: Easy (debugging)
|
|||
|
|
- Reason: Should have been solved but showing same page faults
|
|||
|
|
|
|||
|
|
3. **Reduce Page Zeroing**
|
|||
|
|
- Potential gain: 1.5x
|
|||
|
|
- Implementation: Medium difficulty
|
|||
|
|
- Reason: 12% of total time
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧪 Recommended Next Action
|
|||
|
|
|
|||
|
|
### Immediate (This Session)
|
|||
|
|
|
|||
|
|
Run diagnostic to confirm prefault status:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check if MAP_POPULATE is in actual mmap calls
|
|||
|
|
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
|
|||
|
|
|
|||
|
|
# Check compiler flags
|
|||
|
|
grep -i prefault Makefile
|
|||
|
|
|
|||
|
|
# Check environment variables
|
|||
|
|
env | grep -i HAKMEM | grep -i PREFAULT
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### If Prefault is Disabled → Enable It
|
|||
|
|
|
|||
|
|
Then re-run profiling to verify improvement.
|
|||
|
|
|
|||
|
|
### If Prefault is Enabled → Move to Option B (TLB)
|
|||
|
|
|
|||
|
|
Focus on reducing 48% TLB miss rate.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Expected Outcome After All Fixes
|
|||
|
|
|
|||
|
|
| Factor | Current | After | Gain |
|
|||
|
|
|--------|---------|-------|------|
|
|||
|
|
| Page faults | 7,672 | 500-1000 | 8-15x |
|
|||
|
|
| TLB misses | 48.65% | ~5% | 3-5x |
|
|||
|
|
| Page zeroing | 12% | 2% | 2x |
|
|||
|
|
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
|
|||
|
|
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |
|
|||
|
|
|