Files
hakmem/COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md
Moe Charm (CI) 1755257f60 Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries
## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 20:41:53 +09:00

347 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Comprehensive Profiling Analysis: HAKMEM Performance Gaps
## 🔍 Executive Summary
After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:
### Current Performance Metrics
| Metric | Random Mixed | Tiny Hot | Gap |
|--------|---|---|---|
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
| **L1 Cache Misses** | 763K | 738K | Similar |
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |
### 🚨 KEY FINDING: Prefault is NOT working as expected!
**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
1. ✗ Prefault Box is either disabled or ineffective
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
3. ✗ MAP_POPULATE flag is not preventing runtime faults
---
## 📊 Detailed Performance Breakdown
### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
**Top Kernel Functions (by CPU time):**
```
15.01% asm_exc_page_fault ← Page fault handling
11.65% clear_page_erms ← Page zeroing
5.27% zap_pte_range ← Memory cleanup
5.20% handle_mm_fault ← MMU fault handling
4.06% do_anonymous_page ← Anonymous page allocation
3.18% __handle_mm_fault ← Nested fault handling
2.35% rmqueue_bulk ← Allocator backend
2.35% __memset_avx2_unaligned ← Memory operations
2.28% do_user_addr_fault ← User fault handling
1.77% arch_exit_to_user_mode ← Context switch
```
**Kernel overhead:** ~63% of cycles
**L1 DCL misses:** 763K / operations
**Branch miss rate:** 11.94%
### Tiny Hot Workload (10M allocations, fixed size)
**Top Kernel Functions (by CPU time):**
```
14.19% asm_exc_page_fault ← Page fault handling
12.82% clear_page_erms ← Page zeroing
5.61% __memset_avx2_unaligned ← Memory operations
5.02% do_anonymous_page ← Anonymous page allocation
3.31% mem_cgroup_commit_charge ← Memory accounting
2.67% __handle_mm_fault ← MMU fault handling
2.45% do_user_addr_fault ← User fault handling
```
**Kernel overhead:** ~66% of cycles
**L1 DCL misses:** 738K / operations
**Branch miss rate:** 11.03%
### Comparison: Why are cycles similar?
```
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op
⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
```
The cycles are nearly identical, but throughput differs because:
- Random Mixed: Measuring only 1M operations (baseline)
- Tiny Hot: Measuring 10M operations (10x scale)
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
**This means Random Mixed is NOT actually slower in per-operation cost!**
---
## 🎯 Critical Findings
### Finding 1: Page Faults Are NOT Being Reduced
**Observed:**
- Random Mixed: 7,672 page faults
- Tiny Hot: 7,672 page faults
- **Difference: 0** ← This is wrong!
**Expected (with prefault):**
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
- Tiny Hot: 7,672 → ~50-100 (minimal change)
**Hypothesis:**
- Prefault Box may not be enabled
- Or MAP_POPULATE is not working on this kernel
- Or allocations are hitting kernel-internal mmap (not Superslab)
### Finding 2: TLB Misses Are HIGH (48.65%)
```
dTLB-loads: 49,160
dTLB-load-misses: 23,917 (48.65% miss rate!)
iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact)
```
**Meaning:** Nearly half of TLB lookups fail, causing page table walks.
**Why this matters:**
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
- 23,917 × 25 cycles = ~600K wasted cycles
- That's ~10% of total runtime!
### Finding 3: Both Workloads Are Similar
Despite different access patterns:
- Both spend 15% on page fault handling
- Both spend 12% on page zeroing
- Both have similar L1 miss rates
- Both have similar branch miss rates
**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
---
## 📈 Layer Analysis
### Kernel vs User Split
| Category | Random Mixed | Tiny Hot | Analysis |
|----------|---|---|---|
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
| **User malloc/free** | <1% | <1% | Not visible |
| **User pool/cache logic** | <1% | <1% | Not visible |
### User-Space Functions Visible in Profile
**Random Mixed:**
```
0.59% hak_free_at.constprop.0 (hakmem free path)
```
**Tiny Hot:**
```
0.59% hak_pool_mid_lookup (hakmem pool routing)
```
**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).
---
## 🔧 What's Really Happening
### Current State (POST-Prefault Box)
```
allocate(size):
1. malloc wrapper → <1% cycles
2. Gatekeeper routing → ~0.1% cycles
3. unified_cache_refill → (hidden in kernel time)
4. shared_pool_acquire → (hidden in kernel time)
5. SuperSlab/mmap call → Triggers kernel
6. **KERNEL PAGE FAULTS** → 15% cycles
7. clear_page_erms (zero) → 12% cycles
```
### Why Prefault Isn't Working
**Possible reasons:**
1. **Prefault Box disabled?**
- Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
- Or: `g_ss_populate_once` not being set
2. **MAP_POPULATE not actually pre-faulting?**
- Linux kernel may be lazy even with MAP_POPULATE
- Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
- Or use `mincore()` to check before allocation
3. **Allocations not from Superslab mmap?**
- Page faults may be from TLS cache allocation
- Or from libc internal allocations
- Not from Superslab backend
4. **TLB misses dominating?**
- 48.65% TLB miss rate suggests memory layout issue
- SuperSlab metadata may not be cache-friendly
- Working set too large for TLB
---
## 🎓 What We Learned From Previous Analysis
From the earlier profiling report, we identified that:
- **Random Mixed was 21.7x slower** due to 61.7% page faults
- **Expected with prefault:** Should drop to ~5% or less
But NOW we see:
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
- **Page faults are identical** to Tiny Hot
- **This contradicts expectations**
### Possible Explanation
The **earlier measurements** may have been from:
- Benchmark run at startup (cold caches)
- With additional profiling overhead
- Or different workload parameters
The **current measurements** are:
- Steady state (after initial allocation)
- With higher throughput (Tiny Hot = 10M ops)
- After recent optimizations
---
## 🎯 Next Steps - Three Options
### 📋 Option A: Verify Prefault is Actually Enabled
**Goal:** Confirm prefault mechanism is working
**Steps:**
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
3. Run with `strace` to see mmap flags:
```bash
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
```
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening
**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces
---
### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
**Goal:** Improve memory layout to reduce TLB pressure
**Steps:**
1. **Analyze SuperSlab metadata layout:**
- Current: Is metadata per-slab or centralized?
- Check: `sp_meta_find_or_create()` hot path
2. **Improve cache locality:**
- Cache-align metadata structures
- Use larger pages (2MB or 1GB hugepages)
- Reduce working set size
3. **Profile with hugepages:**
```bash
echo 10 > /proc/sys/vm/nr_hugepages
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```
**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)
---
### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)
**Goal:** Skip unnecessary page zeroing
**Steps:**
1. **Analyze what needs zeroing:**
- Are SuperSlab pages truly uninitialized?
- Can we reuse memory without zeroing?
- Use `MADV_DONTNEED` before reuse?
2. **Implement lazy zeroing:**
- Don't zero pages on allocation
- Only zero used portions
- Let kernel handle rest on free
3. **Use uninitialized pools:**
- Pre-allocate without zeroing
- Initialize on-demand
**Expected gain:** 1.5x speedup (eliminate 12% zero cost)
---
## 📊 Recommendation
Based on the analysis:
### Most Impactful (Order of Preference):
1. **Fix TLB Misses (48.65%)**
- Potential gain: 1.5-2x
- Implementation: Medium difficulty
- Reason: Already showing 48% miss rate
2. **Verify Prefault Actually Works**
- Potential gain: Unknown (currently not working?)
- Implementation: Easy (debugging)
- Reason: Should have been solved but showing same page faults
3. **Reduce Page Zeroing**
- Potential gain: 1.5x
- Implementation: Medium difficulty
- Reason: 12% of total time
---
## 🧪 Recommended Next Action
### Immediate (This Session)
Run diagnostic to confirm prefault status:
```bash
# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
# Check compiler flags
grep -i prefault Makefile
# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT
```
### If Prefault is Disabled → Enable It
Then re-run profiling to verify improvement.
### If Prefault is Enabled → Move to Option B (TLB)
Focus on reducing 48% TLB miss rate.
---
## 📈 Expected Outcome After All Fixes
| Factor | Current | After | Gain |
|--------|---------|-------|------|
| Page faults | 7,672 | 500-1000 | 8-15x |
| TLB misses | 48.65% | ~5% | 3-5x |
| Page zeroing | 12% | 2% | 2x |
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |