## Key Findings: 1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix) 2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time) 3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc 4. THP and PREFAULT optimizations have ZERO impact on dTLB misses 5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation ## Session Deliverables: - COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis - PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation - PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results - SESSION_SUMMARY_FINDINGS_20251204.md: Final summary ## Phase 2 Recommendations: 1. Investigate lazy zeroing (11.65% of cycles) 2. Analyze page fault sources (debug with callgraph) 3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective) ## Paradigm Shift: Old: THP/PREFAULT → 2-3x speedup New: Lazy zeroing → 1.10x-1.15x speedup (realistic) 🐱 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.7 KiB
Comprehensive Profiling Analysis: HAKMEM Performance Gaps
🔍 Executive Summary
After the Prefault Box + MAP_POPULATE fix, the profiling shows:
Current Performance Metrics
| Metric | Random Mixed | Tiny Hot | Gap |
|---|---|---|---|
| Cycles (lower is better) | 72.6M | 72.3M | SAME 🤯 |
| Page Faults | 7,672 | 7,672 | IDENTICAL ⚠️ |
| L1 Cache Misses | 763K | 738K | Similar |
| Throughput | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
| Instructions/Cycle | 0.74 | 0.73 | Similar |
| TLB Miss Rate | 48.65% (dTLB) | N/A | High |
🚨 KEY FINDING: Prefault is NOT working as expected!
Problem: Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
- ✗ Prefault Box is either disabled or ineffective
- ✗ Page faults are coming from elsewhere (not Superslab mmap)
- ✗ MAP_POPULATE flag is not preventing runtime faults
📊 Detailed Performance Breakdown
Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
Top Kernel Functions (by CPU time):
15.01% asm_exc_page_fault ← Page fault handling
11.65% clear_page_erms ← Page zeroing
5.27% zap_pte_range ← Memory cleanup
5.20% handle_mm_fault ← MMU fault handling
4.06% do_anonymous_page ← Anonymous page allocation
3.18% __handle_mm_fault ← Nested fault handling
2.35% rmqueue_bulk ← Allocator backend
2.35% __memset_avx2_unaligned ← Memory operations
2.28% do_user_addr_fault ← User fault handling
1.77% arch_exit_to_user_mode ← Context switch
Kernel overhead: ~63% of cycles L1 DCL misses: 763K / operations Branch miss rate: 11.94%
Tiny Hot Workload (10M allocations, fixed size)
Top Kernel Functions (by CPU time):
14.19% asm_exc_page_fault ← Page fault handling
12.82% clear_page_erms ← Page zeroing
5.61% __memset_avx2_unaligned ← Memory operations
5.02% do_anonymous_page ← Anonymous page allocation
3.31% mem_cgroup_commit_charge ← Memory accounting
2.67% __handle_mm_fault ← MMU fault handling
2.45% do_user_addr_fault ← User fault handling
Kernel overhead: ~66% of cycles L1 DCL misses: 738K / operations Branch miss rate: 11.03%
Comparison: Why are cycles similar?
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot: 72.3M cycles / 10M ops = 7.23 cycles/op
⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
The cycles are nearly identical, but throughput differs because:
- Random Mixed: Measuring only 1M operations (baseline)
- Tiny Hot: Measuring 10M operations (10x scale)
- Real throughput: Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
This means Random Mixed is NOT actually slower in per-operation cost!
🎯 Critical Findings
Finding 1: Page Faults Are NOT Being Reduced
Observed:
- Random Mixed: 7,672 page faults
- Tiny Hot: 7,672 page faults
- Difference: 0 ← This is wrong!
Expected (with prefault):
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
- Tiny Hot: 7,672 → ~50-100 (minimal change)
Hypothesis:
- Prefault Box may not be enabled
- Or MAP_POPULATE is not working on this kernel
- Or allocations are hitting kernel-internal mmap (not Superslab)
Finding 2: TLB Misses Are HIGH (48.65%)
dTLB-loads: 49,160
dTLB-load-misses: 23,917 (48.65% miss rate!)
iTLB-load-misses: 17,590 (7748.90% - kernel measurement artifact)
Meaning: Nearly half of TLB lookups fail, causing page table walks.
Why this matters:
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
- 23,917 × 25 cycles = ~600K wasted cycles
- That's ~10% of total runtime!
Finding 3: Both Workloads Are Similar
Despite different access patterns:
- Both spend 15% on page fault handling
- Both spend 12% on page zeroing
- Both have similar L1 miss rates
- Both have similar branch miss rates
Conclusion: The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
📈 Layer Analysis
Kernel vs User Split
| Category | Random Mixed | Tiny Hot | Analysis |
|---|---|---|---|
| Kernel (page faults, scheduling, etc) | 63% | 66% | Dominant |
| Kernel zeroing (clear_page_erms) | 11.65% | 12.82% | Similar |
| User malloc/free | <1% | <1% | Not visible |
| User pool/cache logic | <1% | <1% | Not visible |
User-Space Functions Visible in Profile
Random Mixed:
0.59% hak_free_at.constprop.0 (hakmem free path)
Tiny Hot:
0.59% hak_pool_mid_lookup (hakmem pool routing)
Conclusion: User-space HAKMEM code is NOT a bottleneck (<1% each).
🔧 What's Really Happening
Current State (POST-Prefault Box)
allocate(size):
1. malloc wrapper → <1% cycles
2. Gatekeeper routing → ~0.1% cycles
3. unified_cache_refill → (hidden in kernel time)
4. shared_pool_acquire → (hidden in kernel time)
5. SuperSlab/mmap call → Triggers kernel
6. **KERNEL PAGE FAULTS** → 15% cycles
7. clear_page_erms (zero) → 12% cycles
Why Prefault Isn't Working
Possible reasons:
-
Prefault Box disabled?
- Check:
HAKMEM_BOX_SS_PREFAULT_ENABLED - Or:
g_ss_populate_oncenot being set
- Check:
-
MAP_POPULATE not actually pre-faulting?
- Linux kernel may be lazy even with MAP_POPULATE
- Need
madvise(MADV_POPULATE_READ)to force immediate faulting - Or use
mincore()to check before allocation
-
Allocations not from Superslab mmap?
- Page faults may be from TLS cache allocation
- Or from libc internal allocations
- Not from Superslab backend
-
TLB misses dominating?
- 48.65% TLB miss rate suggests memory layout issue
- SuperSlab metadata may not be cache-friendly
- Working set too large for TLB
🎓 What We Learned From Previous Analysis
From the earlier profiling report, we identified that:
- Random Mixed was 21.7x slower due to 61.7% page faults
- Expected with prefault: Should drop to ~5% or less
But NOW we see:
- Random Mixed is NOT significantly slower (per-op cost is similar)
- Page faults are identical to Tiny Hot
- This contradicts expectations
Possible Explanation
The earlier measurements may have been from:
- Benchmark run at startup (cold caches)
- With additional profiling overhead
- Or different workload parameters
The current measurements are:
- Steady state (after initial allocation)
- With higher throughput (Tiny Hot = 10M ops)
- After recent optimizations
🎯 Next Steps - Three Options
📋 Option A: Verify Prefault is Actually Enabled
Goal: Confirm prefault mechanism is working
Steps:
- Add debug output to
ss_prefault_policy()andss_prefault_region() - Check if
MAP_POPULATEflag is set in actual mmap calls - Run with
straceto see mmap flags:strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE - Check if
madvise(MADV_POPULATE_READ)calls are happening
Expected outcome: Should see MAP_POPULATE or MADV_POPULATE in traces
🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
Goal: Improve memory layout to reduce TLB pressure
Steps:
-
Analyze SuperSlab metadata layout:
- Current: Is metadata per-slab or centralized?
- Check:
sp_meta_find_or_create()hot path
-
Improve cache locality:
- Cache-align metadata structures
- Use larger pages (2MB or 1GB hugepages)
- Reduce working set size
-
Profile with hugepages:
echo 10 > /proc/sys/vm/nr_hugepages HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
Expected gain: 1.5-2x speedup (eliminate TLB miss penalty)
🚀 Option C: Reduce Page Zeroing (12% → ~2%)
Goal: Skip unnecessary page zeroing
Steps:
-
Analyze what needs zeroing:
- Are SuperSlab pages truly uninitialized?
- Can we reuse memory without zeroing?
- Use
MADV_DONTNEEDbefore reuse?
-
Implement lazy zeroing:
- Don't zero pages on allocation
- Only zero used portions
- Let kernel handle rest on free
-
Use uninitialized pools:
- Pre-allocate without zeroing
- Initialize on-demand
Expected gain: 1.5x speedup (eliminate 12% zero cost)
📊 Recommendation
Based on the analysis:
Most Impactful (Order of Preference):
-
Fix TLB Misses (48.65%)
- Potential gain: 1.5-2x
- Implementation: Medium difficulty
- Reason: Already showing 48% miss rate
-
Verify Prefault Actually Works
- Potential gain: Unknown (currently not working?)
- Implementation: Easy (debugging)
- Reason: Should have been solved but showing same page faults
-
Reduce Page Zeroing
- Potential gain: 1.5x
- Implementation: Medium difficulty
- Reason: 12% of total time
🧪 Recommended Next Action
Immediate (This Session)
Run diagnostic to confirm prefault status:
# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
# Check compiler flags
grep -i prefault Makefile
# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT
If Prefault is Disabled → Enable It
Then re-run profiling to verify improvement.
If Prefault is Enabled → Move to Option B (TLB)
Focus on reducing 48% TLB miss rate.
📈 Expected Outcome After All Fixes
| Factor | Current | After | Gain |
|---|---|---|---|
| Page faults | 7,672 | 500-1000 | 8-15x |
| TLB misses | 48.65% | ~5% | 3-5x |
| Page zeroing | 12% | 2% | 2x |
| Total per-op time | 72.6 cycles | 20-25 cycles | 3-4x |
| Throughput | 1.06M ops/s | 3.5-4M ops/s | 3-4x |