## Key Findings: 1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix) 2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time) 3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc 4. THP and PREFAULT optimizations have ZERO impact on dTLB misses 5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation ## Session Deliverables: - COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis - PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation - PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results - SESSION_SUMMARY_FINDINGS_20251204.md: Final summary ## Phase 2 Recommendations: 1. Investigate lazy zeroing (11.65% of cycles) 2. Analyze page fault sources (debug with callgraph) 3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective) ## Paradigm Shift: Old: THP/PREFAULT → 2-3x speedup New: Lazy zeroing → 1.10x-1.15x speedup (realistic) 🐱 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
347 lines
9.7 KiB
Markdown
347 lines
9.7 KiB
Markdown
# Comprehensive Profiling Analysis: HAKMEM Performance Gaps
|
||
|
||
## 🔍 Executive Summary
|
||
|
||
After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:
|
||
|
||
### Current Performance Metrics
|
||
|
||
| Metric | Random Mixed | Tiny Hot | Gap |
|
||
|--------|---|---|---|
|
||
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
|
||
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
|
||
| **L1 Cache Misses** | 763K | 738K | Similar |
|
||
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
|
||
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
|
||
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |
|
||
|
||
### 🚨 KEY FINDING: Prefault is NOT working as expected!
|
||
|
||
**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
|
||
1. ✗ Prefault Box is either disabled or ineffective
|
||
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
|
||
3. ✗ MAP_POPULATE flag is not preventing runtime faults
|
||
|
||
---
|
||
|
||
## 📊 Detailed Performance Breakdown
|
||
|
||
### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
|
||
|
||
**Top Kernel Functions (by CPU time):**
|
||
```
|
||
15.01% asm_exc_page_fault ← Page fault handling
|
||
11.65% clear_page_erms ← Page zeroing
|
||
5.27% zap_pte_range ← Memory cleanup
|
||
5.20% handle_mm_fault ← MMU fault handling
|
||
4.06% do_anonymous_page ← Anonymous page allocation
|
||
3.18% __handle_mm_fault ← Nested fault handling
|
||
2.35% rmqueue_bulk ← Allocator backend
|
||
2.35% __memset_avx2_unaligned ← Memory operations
|
||
2.28% do_user_addr_fault ← User fault handling
|
||
1.77% arch_exit_to_user_mode ← Context switch
|
||
```
|
||
|
||
**Kernel overhead:** ~63% of cycles
|
||
**L1 DCL misses:** 763K / operations
|
||
**Branch miss rate:** 11.94%
|
||
|
||
### Tiny Hot Workload (10M allocations, fixed size)
|
||
|
||
**Top Kernel Functions (by CPU time):**
|
||
```
|
||
14.19% asm_exc_page_fault ← Page fault handling
|
||
12.82% clear_page_erms ← Page zeroing
|
||
5.61% __memset_avx2_unaligned ← Memory operations
|
||
5.02% do_anonymous_page ← Anonymous page allocation
|
||
3.31% mem_cgroup_commit_charge ← Memory accounting
|
||
2.67% __handle_mm_fault ← MMU fault handling
|
||
2.45% do_user_addr_fault ← User fault handling
|
||
```
|
||
|
||
**Kernel overhead:** ~66% of cycles
|
||
**L1 DCL misses:** 738K / operations
|
||
**Branch miss rate:** 11.03%
|
||
|
||
### Comparison: Why are cycles similar?
|
||
|
||
```
|
||
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
|
||
Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op
|
||
|
||
⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
|
||
```
|
||
|
||
The cycles are nearly identical, but throughput differs because:
|
||
- Random Mixed: Measuring only 1M operations (baseline)
|
||
- Tiny Hot: Measuring 10M operations (10x scale)
|
||
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
|
||
|
||
**This means Random Mixed is NOT actually slower in per-operation cost!**
|
||
|
||
---
|
||
|
||
## 🎯 Critical Findings
|
||
|
||
### Finding 1: Page Faults Are NOT Being Reduced
|
||
|
||
**Observed:**
|
||
- Random Mixed: 7,672 page faults
|
||
- Tiny Hot: 7,672 page faults
|
||
- **Difference: 0** ← This is wrong!
|
||
|
||
**Expected (with prefault):**
|
||
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
|
||
- Tiny Hot: 7,672 → ~50-100 (minimal change)
|
||
|
||
**Hypothesis:**
|
||
- Prefault Box may not be enabled
|
||
- Or MAP_POPULATE is not working on this kernel
|
||
- Or allocations are hitting kernel-internal mmap (not Superslab)
|
||
|
||
### Finding 2: TLB Misses Are HIGH (48.65%)
|
||
|
||
```
|
||
dTLB-loads: 49,160
|
||
dTLB-load-misses: 23,917 (48.65% miss rate!)
|
||
iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact)
|
||
```
|
||
|
||
**Meaning:** Nearly half of TLB lookups fail, causing page table walks.
|
||
|
||
**Why this matters:**
|
||
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
|
||
- 23,917 × 25 cycles = ~600K wasted cycles
|
||
- That's ~10% of total runtime!
|
||
|
||
### Finding 3: Both Workloads Are Similar
|
||
|
||
Despite different access patterns:
|
||
- Both spend 15% on page fault handling
|
||
- Both spend 12% on page zeroing
|
||
- Both have similar L1 miss rates
|
||
- Both have similar branch miss rates
|
||
|
||
**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
|
||
|
||
---
|
||
|
||
## 📈 Layer Analysis
|
||
|
||
### Kernel vs User Split
|
||
|
||
| Category | Random Mixed | Tiny Hot | Analysis |
|
||
|----------|---|---|---|
|
||
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
|
||
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
|
||
| **User malloc/free** | <1% | <1% | Not visible |
|
||
| **User pool/cache logic** | <1% | <1% | Not visible |
|
||
|
||
### User-Space Functions Visible in Profile
|
||
|
||
**Random Mixed:**
|
||
```
|
||
0.59% hak_free_at.constprop.0 (hakmem free path)
|
||
```
|
||
|
||
**Tiny Hot:**
|
||
```
|
||
0.59% hak_pool_mid_lookup (hakmem pool routing)
|
||
```
|
||
|
||
**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).
|
||
|
||
---
|
||
|
||
## 🔧 What's Really Happening
|
||
|
||
### Current State (POST-Prefault Box)
|
||
|
||
```
|
||
allocate(size):
|
||
1. malloc wrapper → <1% cycles
|
||
2. Gatekeeper routing → ~0.1% cycles
|
||
3. unified_cache_refill → (hidden in kernel time)
|
||
4. shared_pool_acquire → (hidden in kernel time)
|
||
5. SuperSlab/mmap call → Triggers kernel
|
||
6. **KERNEL PAGE FAULTS** → 15% cycles
|
||
7. clear_page_erms (zero) → 12% cycles
|
||
```
|
||
|
||
### Why Prefault Isn't Working
|
||
|
||
**Possible reasons:**
|
||
|
||
1. **Prefault Box disabled?**
|
||
- Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
|
||
- Or: `g_ss_populate_once` not being set
|
||
|
||
2. **MAP_POPULATE not actually pre-faulting?**
|
||
- Linux kernel may be lazy even with MAP_POPULATE
|
||
- Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
|
||
- Or use `mincore()` to check before allocation
|
||
|
||
3. **Allocations not from Superslab mmap?**
|
||
- Page faults may be from TLS cache allocation
|
||
- Or from libc internal allocations
|
||
- Not from Superslab backend
|
||
|
||
4. **TLB misses dominating?**
|
||
- 48.65% TLB miss rate suggests memory layout issue
|
||
- SuperSlab metadata may not be cache-friendly
|
||
- Working set too large for TLB
|
||
|
||
---
|
||
|
||
## 🎓 What We Learned From Previous Analysis
|
||
|
||
From the earlier profiling report, we identified that:
|
||
- **Random Mixed was 21.7x slower** due to 61.7% page faults
|
||
- **Expected with prefault:** Should drop to ~5% or less
|
||
|
||
But NOW we see:
|
||
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
|
||
- **Page faults are identical** to Tiny Hot
|
||
- **This contradicts expectations**
|
||
|
||
### Possible Explanation
|
||
|
||
The **earlier measurements** may have been from:
|
||
- Benchmark run at startup (cold caches)
|
||
- With additional profiling overhead
|
||
- Or different workload parameters
|
||
|
||
The **current measurements** are:
|
||
- Steady state (after initial allocation)
|
||
- With higher throughput (Tiny Hot = 10M ops)
|
||
- After recent optimizations
|
||
|
||
---
|
||
|
||
## 🎯 Next Steps - Three Options
|
||
|
||
### 📋 Option A: Verify Prefault is Actually Enabled
|
||
|
||
**Goal:** Confirm prefault mechanism is working
|
||
|
||
**Steps:**
|
||
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
|
||
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
|
||
3. Run with `strace` to see mmap flags:
|
||
```bash
|
||
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
|
||
```
|
||
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening
|
||
|
||
**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces
|
||
|
||
---
|
||
|
||
### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
|
||
|
||
**Goal:** Improve memory layout to reduce TLB pressure
|
||
|
||
**Steps:**
|
||
1. **Analyze SuperSlab metadata layout:**
|
||
- Current: Is metadata per-slab or centralized?
|
||
- Check: `sp_meta_find_or_create()` hot path
|
||
|
||
2. **Improve cache locality:**
|
||
- Cache-align metadata structures
|
||
- Use larger pages (2MB or 1GB hugepages)
|
||
- Reduce working set size
|
||
|
||
3. **Profile with hugepages:**
|
||
```bash
|
||
echo 10 > /proc/sys/vm/nr_hugepages
|
||
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
||
```
|
||
|
||
**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)
|
||
|
||
---
|
||
|
||
### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)
|
||
|
||
**Goal:** Skip unnecessary page zeroing
|
||
|
||
**Steps:**
|
||
1. **Analyze what needs zeroing:**
|
||
- Are SuperSlab pages truly uninitialized?
|
||
- Can we reuse memory without zeroing?
|
||
- Use `MADV_DONTNEED` before reuse?
|
||
|
||
2. **Implement lazy zeroing:**
|
||
- Don't zero pages on allocation
|
||
- Only zero used portions
|
||
- Let kernel handle rest on free
|
||
|
||
3. **Use uninitialized pools:**
|
||
- Pre-allocate without zeroing
|
||
- Initialize on-demand
|
||
|
||
**Expected gain:** 1.5x speedup (eliminate 12% zero cost)
|
||
|
||
---
|
||
|
||
## 📊 Recommendation
|
||
|
||
Based on the analysis:
|
||
|
||
### Most Impactful (Order of Preference):
|
||
|
||
1. **Fix TLB Misses (48.65%)**
|
||
- Potential gain: 1.5-2x
|
||
- Implementation: Medium difficulty
|
||
- Reason: Already showing 48% miss rate
|
||
|
||
2. **Verify Prefault Actually Works**
|
||
- Potential gain: Unknown (currently not working?)
|
||
- Implementation: Easy (debugging)
|
||
- Reason: Should have been solved but showing same page faults
|
||
|
||
3. **Reduce Page Zeroing**
|
||
- Potential gain: 1.5x
|
||
- Implementation: Medium difficulty
|
||
- Reason: 12% of total time
|
||
|
||
---
|
||
|
||
## 🧪 Recommended Next Action
|
||
|
||
### Immediate (This Session)
|
||
|
||
Run diagnostic to confirm prefault status:
|
||
|
||
```bash
|
||
# Check if MAP_POPULATE is in actual mmap calls
|
||
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
|
||
|
||
# Check compiler flags
|
||
grep -i prefault Makefile
|
||
|
||
# Check environment variables
|
||
env | grep -i HAKMEM | grep -i PREFAULT
|
||
```
|
||
|
||
### If Prefault is Disabled → Enable It
|
||
|
||
Then re-run profiling to verify improvement.
|
||
|
||
### If Prefault is Enabled → Move to Option B (TLB)
|
||
|
||
Focus on reducing 48% TLB miss rate.
|
||
|
||
---
|
||
|
||
## 📈 Expected Outcome After All Fixes
|
||
|
||
| Factor | Current | After | Gain |
|
||
|--------|---------|-------|------|
|
||
| Page faults | 7,672 | 500-1000 | 8-15x |
|
||
| TLB misses | 48.65% | ~5% | 3-5x |
|
||
| Page zeroing | 12% | 2% | 2x |
|
||
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
|
||
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |
|
||
|