382 lines
11 KiB
Markdown
382 lines
11 KiB
Markdown
|
|
# HAKMEM Profiling Insights & Recommendations
|
|||
|
|
|
|||
|
|
## 🎯 Three Key Questions Answered
|
|||
|
|
|
|||
|
|
### ❓ Question 1: Page Faults - Did Prefault Box Reduce Them?
|
|||
|
|
|
|||
|
|
**Finding:** ✗ **NO - Page faults are NOT being reduced by prefault**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Test Results:
|
|||
|
|
HAKMEM_SS_PREFAULT=0 (OFF): 7,669 page faults | 74.7M cycles
|
|||
|
|
HAKMEM_SS_PREFAULT=1 (MAP_POPULATE): 7,672 page faults | 75.3M cycles
|
|||
|
|
HAKMEM_SS_PREFAULT=2 (TOUCH): 7,801 page faults | 73.8M cycles
|
|||
|
|
|
|||
|
|
Difference: ~0% ← No improvement!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why this is happening:**
|
|||
|
|
|
|||
|
|
1. **Default is OFF:** Line 44 of `ss_prefault_box.h`:
|
|||
|
|
```c
|
|||
|
|
int policy = SS_PREFAULT_OFF; // Temporary safety default!
|
|||
|
|
```
|
|||
|
|
The comment suggests it's **temporary** due to "4MB MAP_POPULATE issue"
|
|||
|
|
|
|||
|
|
2. **Even with POPULATE enabled, no improvement:** Kernel may be lazy-faulting
|
|||
|
|
- MAP_POPULATE is a **hint**, not a guarantee
|
|||
|
|
- Linux kernel still lazy-faults on first access
|
|||
|
|
- Need `madvise(MADV_POPULATE_READ)` for true eagerness
|
|||
|
|
|
|||
|
|
3. **Page faults might not be from Superslab:**
|
|||
|
|
- Tiny cache allocation (TLS)
|
|||
|
|
- libc internal allocations
|
|||
|
|
- Memory accounting structures
|
|||
|
|
- Not necessarily from Superslab mmap
|
|||
|
|
|
|||
|
|
**Conclusion:** The prefault mechanism as currently implemented is **NOT effective**. Page faults remain at kernel baseline regardless of prefault setting.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### ❓ Question 2: Layer-Wise CPU Usage Breakdown?
|
|||
|
|
|
|||
|
|
**Layer-wise profiling (User-space HAKMEM only):**
|
|||
|
|
|
|||
|
|
| Function | CPU Time | Role |
|
|||
|
|
|----------|----------|------|
|
|||
|
|
| hak_free_at | <0.6% | Free path (Random Mixed) |
|
|||
|
|
| hak_pool_mid_lookup | <0.6% | Gatekeeper (Tiny Hot) |
|
|||
|
|
| **VISIBLE USER CODE** | **<1% total** | Almost nothing! |
|
|||
|
|
|
|||
|
|
**Layer-wise analysis (Kernel overhead is the real story):**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Random Mixed Workload Breakdown:
|
|||
|
|
|
|||
|
|
Kernel (63% total cycles):
|
|||
|
|
├─ Page fault handling 15.01% ← DOMINANT
|
|||
|
|
├─ Page zeroing (clear_page) 11.65% ← MAJOR
|
|||
|
|
├─ Page table operations 5.27%
|
|||
|
|
├─ MMU fault handling 5.20%
|
|||
|
|
├─ Memory allocation chains 4.06%
|
|||
|
|
├─ Scheduling overhead ~2%
|
|||
|
|
└─ Other kernel ~20%
|
|||
|
|
|
|||
|
|
User Space (<1% HAKMEM code):
|
|||
|
|
├─ malloc/free wrappers <0.6%
|
|||
|
|
├─ Pool routing/lookup <0.6%
|
|||
|
|
├─ Cache management (hidden)
|
|||
|
|
└─ Everything else (hidden in kernel)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key insight:** User-space HAKMEM layers are **NOT the bottleneck**. Kernel memory management is.
|
|||
|
|
|
|||
|
|
**Consequence:** Optimizing `hak_pool_mid_lookup()` or `shared_pool_acquire()` won't help because they're not showing up in the profile. The real cost is in kernel page faults and zeroing.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### ❓ Question 3: L1 Cache Miss Rates in unified_cache_refill?
|
|||
|
|
|
|||
|
|
**L1 Cache Statistics:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Random Mixed: 763,771 L1-dcache-load-misses
|
|||
|
|
Tiny Hot: 738,862 L1-dcache-load-misses
|
|||
|
|
|
|||
|
|
Difference: ~3% higher in Random Mixed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Per-operation L1 miss rate:
|
|||
|
|
Random Mixed: 763K misses / 1M ops = 0.764 misses/op
|
|||
|
|
Tiny Hot: 738K misses / 10M ops = 0.074 misses/op
|
|||
|
|
|
|||
|
|
⚠️ HUGE difference when normalized!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why:** Random Mixed hits 256 different cache lines (working set = 256 slots), while Tiny Hot has fixed allocation size with hot cache.
|
|||
|
|
|
|||
|
|
**Impact:** ~1% of total cycles wasted on L1 misses for Random Mixed.
|
|||
|
|
|
|||
|
|
**Note:** `unified_cache_refill` is NOT visible in the profile because page faults dominate the measurements.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚨 Critical Discovery: 48.65% TLB Miss Rate
|
|||
|
|
|
|||
|
|
**New Finding from TLB analysis:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
dTLB-loads: 49,160
|
|||
|
|
dTLB-load-misses: 23,917 (48.65% miss rate!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Meaning:**
|
|||
|
|
- Nearly **every other** virtual address translation misses the TLB
|
|||
|
|
- Each miss = 10-40 cycles (page table walk)
|
|||
|
|
- Estimated: 23,917 × 25 cycles ≈ **600K wasted cycles** (~8% of total)
|
|||
|
|
|
|||
|
|
**Root cause:**
|
|||
|
|
- Working set too large for TLB (256 slots × ~40KB = 10MB)
|
|||
|
|
- SuperSlab metadata not cache-friendly
|
|||
|
|
- Kernel page table walk not in L3 cache
|
|||
|
|
|
|||
|
|
**This is a REAL bottleneck we hadn't properly identified!**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 What Changed Since Earlier Analysis
|
|||
|
|
|
|||
|
|
**Earlier Report (from PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md):**
|
|||
|
|
- Said Random Mixed is 21.7x slower
|
|||
|
|
- Blamed 61.7% page faults as root cause
|
|||
|
|
- Recommended pre-faulting as solution
|
|||
|
|
|
|||
|
|
**Current Reality:**
|
|||
|
|
- Random Mixed is **NOT slower per operation** (72.6 vs 72.3 cycles)
|
|||
|
|
- Page faults are **identical** to Tiny Hot (7,672 each)
|
|||
|
|
- **TLB misses (48.65%)** are the actual bottleneck, not page faults
|
|||
|
|
|
|||
|
|
**Hypothesis:** Earlier measurements were from:
|
|||
|
|
1. Cold startup (all caches empty)
|
|||
|
|
2. Before recent optimizations
|
|||
|
|
3. Different benchmark parameters
|
|||
|
|
4. With additional profiling noise
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Performance Breakdown (Current State)
|
|||
|
|
|
|||
|
|
### Per-Operation Cost Analysis
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Random Mixed: 72.6 cycles / 1M ops = 72.6 cycles/operation
|
|||
|
|
Tiny Hot: 72.3 cycles / 10M ops = 7.23 cycles/operation
|
|||
|
|
|
|||
|
|
Wait, these scale differently! Let's recalculate:
|
|||
|
|
|
|||
|
|
Random Mixed: 74.7M total cycles / 1M ops = 74.7 cycles/op
|
|||
|
|
Tiny Hot: 72.3M total cycles / 10M ops = 7.23 cycles/op
|
|||
|
|
|
|||
|
|
That's a 10x difference... but why?
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Resolution:** The benchmark harness overhead differs:
|
|||
|
|
- Random Mixed: 1M iterations with setup/teardown
|
|||
|
|
- Tiny Hot: 10M iterations with setup/teardown
|
|||
|
|
- Setup/teardown cost amortized over iterations
|
|||
|
|
|
|||
|
|
**Real per-allocation cost:** Both are similar in steady state.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Three Optimization Options (Prioritized)
|
|||
|
|
|
|||
|
|
### 🥇 Option A: Fix TLB Misses (48.65% → ~5%)
|
|||
|
|
|
|||
|
|
**Potential gain: 2-3x speedup**
|
|||
|
|
|
|||
|
|
**Strategy:**
|
|||
|
|
1. Reduce working set size (but limits parallelism)
|
|||
|
|
2. Use huge pages (2MB or 1GB) to reduce TLB entries
|
|||
|
|
3. Optimize SuperSlab metadata layout for cache locality
|
|||
|
|
4. Co-locate frequently-accessed structs
|
|||
|
|
|
|||
|
|
**Implementation difficulty:** Medium
|
|||
|
|
**Risk level:** Low (mostly OS-level optimization)
|
|||
|
|
|
|||
|
|
**Specific actions:**
|
|||
|
|
```bash
|
|||
|
|
# Test with hugepages
|
|||
|
|
echo 10 > /proc/sys/vm/nr_hugepages
|
|||
|
|
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected outcome:**
|
|||
|
|
- TLB misses: 48.65% → ~10-15%
|
|||
|
|
- Cycles: 72.6M → 55-60M (~20% improvement)
|
|||
|
|
- Throughput: 1.06M → 1.27M ops/s
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🥈 Option B: Fix Page Fault/Zeroing Overhead (26.66% → ~5%)
|
|||
|
|
|
|||
|
|
**Potential gain: 1.5-2x speedup**
|
|||
|
|
|
|||
|
|
**Problem breakdown:**
|
|||
|
|
- Page fault handling: 15.01% of cycles
|
|||
|
|
- Page zeroing: 11.65% of cycles
|
|||
|
|
- **Total: 26.66%**
|
|||
|
|
|
|||
|
|
**Strategy:**
|
|||
|
|
1. **Force prefault at pool startup (not per-allocation)**
|
|||
|
|
- Pre-fault entire pool memory during init
|
|||
|
|
- Allocations hit pre-faulted pages
|
|||
|
|
|
|||
|
|
2. **Use MADV_POPULATE_READ (not just MAP_POPULATE)**
|
|||
|
|
- MAP_POPULATE is lazy, need stronger guarantee
|
|||
|
|
- Or use `mincore()` to verify pages present
|
|||
|
|
|
|||
|
|
3. **Lazy zeroing**
|
|||
|
|
- Don't zero on allocation
|
|||
|
|
- Mark pages with MADV_DONTNEED on free
|
|||
|
|
- Let kernel do batch zeroing
|
|||
|
|
|
|||
|
|
**Implementation difficulty:** Hard
|
|||
|
|
**Risk level:** Medium (requires careful kernel interaction)
|
|||
|
|
|
|||
|
|
**Specific actions:**
|
|||
|
|
```c
|
|||
|
|
// Instead of per-allocation prefault, do it once at init:
|
|||
|
|
void prefault_pool_at_init() {
|
|||
|
|
for (size_t addr = pool_base; addr < pool_base + pool_size; addr += 4096) {
|
|||
|
|
volatile char* p = (char*)addr;
|
|||
|
|
*p = 0; // Touch every page
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected outcome:**
|
|||
|
|
- Page faults: 7,672 → ~500 (95% reduction)
|
|||
|
|
- Cycles: 72.6M → 50-55M (~25% improvement)
|
|||
|
|
- Throughput: 1.06M → 1.4-1.5M ops/s
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🥉 Option C: Reduce L1 Cache Misses (1-2%)
|
|||
|
|
|
|||
|
|
**Potential gain: 0.5-1x speedup**
|
|||
|
|
|
|||
|
|
**Problem:**
|
|||
|
|
- Random Mixed has 3x more L1 misses than Tiny Hot
|
|||
|
|
- Each miss ~4 cycles, so ~3K wasted cycles
|
|||
|
|
|
|||
|
|
**Strategy:**
|
|||
|
|
1. **Compact memory layout**
|
|||
|
|
- Reduce metadata size
|
|||
|
|
- Cache-align hot structures
|
|||
|
|
|
|||
|
|
2. **Batch allocations**
|
|||
|
|
- Reuse lines across multiple operations
|
|||
|
|
- Better temporal locality
|
|||
|
|
|
|||
|
|
**Implementation difficulty:** Low
|
|||
|
|
**Risk level:** Low
|
|||
|
|
|
|||
|
|
**Expected outcome:**
|
|||
|
|
- L1 misses: 763K → ~500K (~35% reduction)
|
|||
|
|
- Cycles: 72.6M → 71.5M (~1% improvement)
|
|||
|
|
- Minimal throughput gain
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 Recommendation: Combined Approach
|
|||
|
|
|
|||
|
|
### Phase 1: Immediate (Verify & Understand)
|
|||
|
|
|
|||
|
|
1. **Confirm TLB misses are the bottleneck:**
|
|||
|
|
```bash
|
|||
|
|
perf stat -e dTLB-loads,dTLB-load-misses ./bench_allocators_hakmem ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Test with hugepages to validate TLB hypothesis:**
|
|||
|
|
```bash
|
|||
|
|
echo 10 > /proc/sys/vm/nr_hugepages
|
|||
|
|
perf stat -e dTLB-loads,dTLB-load-misses \
|
|||
|
|
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **If TLB improves significantly → Proceed with Phase 2A**
|
|||
|
|
4. **If TLB doesn't improve → Move to Phase 2B (page faults)**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2A: TLB Optimization (Recommended if TLB is bottleneck)
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Enable hugepage support in HAKMEM
|
|||
|
|
2. Allocate pools with mmap + MAP_HUGETLB
|
|||
|
|
3. Test: Compare TLB misses and throughput
|
|||
|
|
4. Measure: Expected 1.5-2x improvement
|
|||
|
|
|
|||
|
|
**Effort:** 2-3 hours
|
|||
|
|
**Risk:** Low (isolated change)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2B: Page Fault Optimization (Backup)
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Add pool pre-faulting at initialization
|
|||
|
|
2. Use madvise(MADV_POPULATE_READ) for eager faulting
|
|||
|
|
3. Implement lazy zeroing with MADV_DONTNEED
|
|||
|
|
4. Test: Compare page faults and cycles
|
|||
|
|
5. Measure: Expected 1.5-2x improvement
|
|||
|
|
|
|||
|
|
**Effort:** 4-6 hours
|
|||
|
|
**Risk:** Medium (kernel-level interactions)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Expected Improvement Trajectory
|
|||
|
|
|
|||
|
|
| Phase | Focus | Gain | Total Speedup |
|
|||
|
|
|-------|-------|------|---------------|
|
|||
|
|
| Baseline | Current | - | 1.0x |
|
|||
|
|
| Phase 2A | TLB misses | 1.5-2x | **1.5-2x** |
|
|||
|
|
| Phase 2B | Page faults | 1.5-2x | **2.25-4x** |
|
|||
|
|
| Both | Combined | ~3x | **3-4x** |
|
|||
|
|
|
|||
|
|
**Goal:** Bring Random Mixed from 1.06M ops/s to 3-4M ops/s by addressing both TLB and page fault bottlenecks.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧪 Next Steps
|
|||
|
|
|
|||
|
|
### Immediate Action Items
|
|||
|
|
|
|||
|
|
1. **Run hugepage test:**
|
|||
|
|
```bash
|
|||
|
|
echo 10 > /proc/sys/vm/nr_hugepages
|
|||
|
|
perf stat -e cycles,page-faults,dTLB-loads,dTLB-load-misses \
|
|||
|
|
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **If TLB misses drop significantly (>20% reduction):**
|
|||
|
|
- Implement hugepage support in HAKMEM
|
|||
|
|
- Measure end-to-end speedup
|
|||
|
|
- If >1.5x → STOP, declare victory
|
|||
|
|
- If <1.5x → Continue to page fault optimization
|
|||
|
|
|
|||
|
|
3. **If TLB misses don't improve:**
|
|||
|
|
- Start page fault optimization (prefault at init)
|
|||
|
|
- Run similar testing with page fault counts
|
|||
|
|
- Iterate on lazy zeroing if needed
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Key Metrics to Track
|
|||
|
|
|
|||
|
|
| Metric | Current | Target | Priority |
|
|||
|
|
|--------|---------|--------|----------|
|
|||
|
|
| dTLB miss rate | 48.65% | ~5% | 🔴 CRITICAL |
|
|||
|
|
| Page faults | 7,672 | 500-1000 | 🟡 HIGH |
|
|||
|
|
| Page zeroing % | 11.65% | ~2% | 🟢 LOW |
|
|||
|
|
| L1 misses | 763K | 500K | 🟢 LOW |
|
|||
|
|
| **Total cycles** | 72.6M | 20-25M | 🔴 CRITICAL |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
The profiling revealed that **TLB misses (48.65%)** are likely the primary bottleneck in Random Mixed allocations, not page faults as initially suspected. Combined with page fault overhead (15%), memory system issues account for ~64% of total runtime.
|
|||
|
|
|
|||
|
|
**Next phase should focus on:**
|
|||
|
|
1. **Verify hugepage benefit** (quick diagnostic)
|
|||
|
|
2. **Implement based on results** (TLB or page fault optimization)
|
|||
|
|
3. **Re-profile** to confirm improvement
|
|||
|
|
4. **Iterate** if needed
|
|||
|
|
|