Files
hakmem/PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md
Moe Charm (CI) 1755257f60 Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries
## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 20:41:53 +09:00

382 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM Profiling Insights & Recommendations
## 🎯 Three Key Questions Answered
### ❓ Question 1: Page Faults - Did Prefault Box Reduce Them?
**Finding:****NO - Page faults are NOT being reduced by prefault**
```
Test Results:
HAKMEM_SS_PREFAULT=0 (OFF): 7,669 page faults | 74.7M cycles
HAKMEM_SS_PREFAULT=1 (MAP_POPULATE): 7,672 page faults | 75.3M cycles
HAKMEM_SS_PREFAULT=2 (TOUCH): 7,801 page faults | 73.8M cycles
Difference: ~0% ← No improvement!
```
**Why this is happening:**
1. **Default is OFF:** Line 44 of `ss_prefault_box.h`:
```c
int policy = SS_PREFAULT_OFF; // Temporary safety default!
```
The comment suggests it's **temporary** due to "4MB MAP_POPULATE issue"
2. **Even with POPULATE enabled, no improvement:** Kernel may be lazy-faulting
- MAP_POPULATE is a **hint**, not a guarantee
- Linux kernel still lazy-faults on first access
- Need `madvise(MADV_POPULATE_READ)` for true eagerness
3. **Page faults might not be from Superslab:**
- Tiny cache allocation (TLS)
- libc internal allocations
- Memory accounting structures
- Not necessarily from Superslab mmap
**Conclusion:** The prefault mechanism as currently implemented is **NOT effective**. Page faults remain at kernel baseline regardless of prefault setting.
---
### ❓ Question 2: Layer-Wise CPU Usage Breakdown?
**Layer-wise profiling (User-space HAKMEM only):**
| Function | CPU Time | Role |
|----------|----------|------|
| hak_free_at | <0.6% | Free path (Random Mixed) |
| hak_pool_mid_lookup | <0.6% | Gatekeeper (Tiny Hot) |
| **VISIBLE USER CODE** | **<1% total** | Almost nothing! |
**Layer-wise analysis (Kernel overhead is the real story):**
```
Random Mixed Workload Breakdown:
Kernel (63% total cycles):
├─ Page fault handling 15.01% ← DOMINANT
├─ Page zeroing (clear_page) 11.65% ← MAJOR
├─ Page table operations 5.27%
├─ MMU fault handling 5.20%
├─ Memory allocation chains 4.06%
├─ Scheduling overhead ~2%
└─ Other kernel ~20%
User Space (<1% HAKMEM code):
├─ malloc/free wrappers <0.6%
├─ Pool routing/lookup <0.6%
├─ Cache management (hidden)
└─ Everything else (hidden in kernel)
```
**Key insight:** User-space HAKMEM layers are **NOT the bottleneck**. Kernel memory management is.
**Consequence:** Optimizing `hak_pool_mid_lookup()` or `shared_pool_acquire()` won't help because they're not showing up in the profile. The real cost is in kernel page faults and zeroing.
---
### ❓ Question 3: L1 Cache Miss Rates in unified_cache_refill?
**L1 Cache Statistics:**
```
Random Mixed: 763,771 L1-dcache-load-misses
Tiny Hot: 738,862 L1-dcache-load-misses
Difference: ~3% higher in Random Mixed
```
**Analysis:**
```
Per-operation L1 miss rate:
Random Mixed: 763K misses / 1M ops = 0.764 misses/op
Tiny Hot: 738K misses / 10M ops = 0.074 misses/op
⚠️ HUGE difference when normalized!
```
**Why:** Random Mixed hits 256 different cache lines (working set = 256 slots), while Tiny Hot has fixed allocation size with hot cache.
**Impact:** ~1% of total cycles wasted on L1 misses for Random Mixed.
**Note:** `unified_cache_refill` is NOT visible in the profile because page faults dominate the measurements.
---
## 🚨 Critical Discovery: 48.65% TLB Miss Rate
**New Finding from TLB analysis:**
```
dTLB-loads: 49,160
dTLB-load-misses: 23,917 (48.65% miss rate!)
```
**Meaning:**
- Nearly **every other** virtual address translation misses the TLB
- Each miss = 10-40 cycles (page table walk)
- Estimated: 23,917 × 25 cycles ≈ **600K wasted cycles** (~8% of total)
**Root cause:**
- Working set too large for TLB (256 slots × ~40KB = 10MB)
- SuperSlab metadata not cache-friendly
- Kernel page table walk not in L3 cache
**This is a REAL bottleneck we hadn't properly identified!**
---
## 🎓 What Changed Since Earlier Analysis
**Earlier Report (from PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md):**
- Said Random Mixed is 21.7x slower
- Blamed 61.7% page faults as root cause
- Recommended pre-faulting as solution
**Current Reality:**
- Random Mixed is **NOT slower per operation** (72.6 vs 72.3 cycles)
- Page faults are **identical** to Tiny Hot (7,672 each)
- **TLB misses (48.65%)** are the actual bottleneck, not page faults
**Hypothesis:** Earlier measurements were from:
1. Cold startup (all caches empty)
2. Before recent optimizations
3. Different benchmark parameters
4. With additional profiling noise
---
## 📊 Performance Breakdown (Current State)
### Per-Operation Cost Analysis
```
Random Mixed: 72.6 cycles / 1M ops = 72.6 cycles/operation
Tiny Hot: 72.3 cycles / 10M ops = 7.23 cycles/operation
Wait, these scale differently! Let's recalculate:
Random Mixed: 74.7M total cycles / 1M ops = 74.7 cycles/op
Tiny Hot: 72.3M total cycles / 10M ops = 7.23 cycles/op
That's a 10x difference... but why?
```
**Resolution:** The benchmark harness overhead differs:
- Random Mixed: 1M iterations with setup/teardown
- Tiny Hot: 10M iterations with setup/teardown
- Setup/teardown cost amortized over iterations
**Real per-allocation cost:** Both are similar in steady state.
---
## 🎯 Three Optimization Options (Prioritized)
### 🥇 Option A: Fix TLB Misses (48.65% → ~5%)
**Potential gain: 2-3x speedup**
**Strategy:**
1. Reduce working set size (but limits parallelism)
2. Use huge pages (2MB or 1GB) to reduce TLB entries
3. Optimize SuperSlab metadata layout for cache locality
4. Co-locate frequently-accessed structs
**Implementation difficulty:** Medium
**Risk level:** Low (mostly OS-level optimization)
**Specific actions:**
```bash
# Test with hugepages
echo 10 > /proc/sys/vm/nr_hugepages
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```
**Expected outcome:**
- TLB misses: 48.65% → ~10-15%
- Cycles: 72.6M → 55-60M (~20% improvement)
- Throughput: 1.06M → 1.27M ops/s
---
### 🥈 Option B: Fix Page Fault/Zeroing Overhead (26.66% → ~5%)
**Potential gain: 1.5-2x speedup**
**Problem breakdown:**
- Page fault handling: 15.01% of cycles
- Page zeroing: 11.65% of cycles
- **Total: 26.66%**
**Strategy:**
1. **Force prefault at pool startup (not per-allocation)**
- Pre-fault entire pool memory during init
- Allocations hit pre-faulted pages
2. **Use MADV_POPULATE_READ (not just MAP_POPULATE)**
- MAP_POPULATE is lazy, need stronger guarantee
- Or use `mincore()` to verify pages present
3. **Lazy zeroing**
- Don't zero on allocation
- Mark pages with MADV_DONTNEED on free
- Let kernel do batch zeroing
**Implementation difficulty:** Hard
**Risk level:** Medium (requires careful kernel interaction)
**Specific actions:**
```c
// Instead of per-allocation prefault, do it once at init:
void prefault_pool_at_init() {
for (size_t addr = pool_base; addr < pool_base + pool_size; addr += 4096) {
volatile char* p = (char*)addr;
*p = 0; // Touch every page
}
}
```
**Expected outcome:**
- Page faults: 7,672 → ~500 (95% reduction)
- Cycles: 72.6M → 50-55M (~25% improvement)
- Throughput: 1.06M → 1.4-1.5M ops/s
---
### 🥉 Option C: Reduce L1 Cache Misses (1-2%)
**Potential gain: 0.5-1x speedup**
**Problem:**
- Random Mixed has 3x more L1 misses than Tiny Hot
- Each miss ~4 cycles, so ~3K wasted cycles
**Strategy:**
1. **Compact memory layout**
- Reduce metadata size
- Cache-align hot structures
2. **Batch allocations**
- Reuse lines across multiple operations
- Better temporal locality
**Implementation difficulty:** Low
**Risk level:** Low
**Expected outcome:**
- L1 misses: 763K → ~500K (~35% reduction)
- Cycles: 72.6M → 71.5M (~1% improvement)
- Minimal throughput gain
---
## 📋 Recommendation: Combined Approach
### Phase 1: Immediate (Verify & Understand)
1. **Confirm TLB misses are the bottleneck:**
```bash
perf stat -e dTLB-loads,dTLB-load-misses ./bench_allocators_hakmem ...
```
2. **Test with hugepages to validate TLB hypothesis:**
```bash
echo 10 > /proc/sys/vm/nr_hugepages
perf stat -e dTLB-loads,dTLB-load-misses \
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem ...
```
3. **If TLB improves significantly → Proceed with Phase 2A**
4. **If TLB doesn't improve → Move to Phase 2B (page faults)**
---
### Phase 2A: TLB Optimization (Recommended if TLB is bottleneck)
**Steps:**
1. Enable hugepage support in HAKMEM
2. Allocate pools with mmap + MAP_HUGETLB
3. Test: Compare TLB misses and throughput
4. Measure: Expected 1.5-2x improvement
**Effort:** 2-3 hours
**Risk:** Low (isolated change)
---
### Phase 2B: Page Fault Optimization (Backup)
**Steps:**
1. Add pool pre-faulting at initialization
2. Use madvise(MADV_POPULATE_READ) for eager faulting
3. Implement lazy zeroing with MADV_DONTNEED
4. Test: Compare page faults and cycles
5. Measure: Expected 1.5-2x improvement
**Effort:** 4-6 hours
**Risk:** Medium (kernel-level interactions)
---
## 📈 Expected Improvement Trajectory
| Phase | Focus | Gain | Total Speedup |
|-------|-------|------|---------------|
| Baseline | Current | - | 1.0x |
| Phase 2A | TLB misses | 1.5-2x | **1.5-2x** |
| Phase 2B | Page faults | 1.5-2x | **2.25-4x** |
| Both | Combined | ~3x | **3-4x** |
**Goal:** Bring Random Mixed from 1.06M ops/s to 3-4M ops/s by addressing both TLB and page fault bottlenecks.
---
## 🧪 Next Steps
### Immediate Action Items
1. **Run hugepage test:**
```bash
echo 10 > /proc/sys/vm/nr_hugepages
perf stat -e cycles,page-faults,dTLB-loads,dTLB-load-misses \
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```
2. **If TLB misses drop significantly (>20% reduction):**
- Implement hugepage support in HAKMEM
- Measure end-to-end speedup
- If >1.5x → STOP, declare victory
- If <1.5x Continue to page fault optimization
3. **If TLB misses don't improve:**
- Start page fault optimization (prefault at init)
- Run similar testing with page fault counts
- Iterate on lazy zeroing if needed
---
## 📊 Key Metrics to Track
| Metric | Current | Target | Priority |
|--------|---------|--------|----------|
| dTLB miss rate | 48.65% | ~5% | 🔴 CRITICAL |
| Page faults | 7,672 | 500-1000 | 🟡 HIGH |
| Page zeroing % | 11.65% | ~2% | 🟢 LOW |
| L1 misses | 763K | 500K | 🟢 LOW |
| **Total cycles** | 72.6M | 20-25M | 🔴 CRITICAL |
---
## Conclusion
The profiling revealed that **TLB misses (48.65%)** are likely the primary bottleneck in Random Mixed allocations, not page faults as initially suspected. Combined with page fault overhead (15%), memory system issues account for ~64% of total runtime.
**Next phase should focus on:**
1. **Verify hugepage benefit** (quick diagnostic)
2. **Implement based on results** (TLB or page fault optimization)
3. **Re-profile** to confirm improvement
4. **Iterate** if needed