Files

Moe Charm (CI) 1755257f60 Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries

## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 20:41:53 +09:00

9.7 KiB

Raw Blame History

Comprehensive Profiling Analysis: HAKMEM Performance Gaps

🔍 Executive Summary

After the Prefault Box + MAP_POPULATE fix, the profiling shows:

Current Performance Metrics

Metric	Random Mixed	Tiny Hot	Gap
Cycles (lower is better)	72.6M	72.3M	SAME 🤯
Page Faults	7,672	7,672	IDENTICAL ⚠️
L1 Cache Misses	763K	738K	Similar
Throughput	~1.06M ops/s	~1.23M ops/s	1.16x
Instructions/Cycle	0.74	0.73	Similar
TLB Miss Rate	48.65% (dTLB)	N/A	High

🚨 KEY FINDING: Prefault is NOT working as expected!

Problem: Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:

✗ Prefault Box is either disabled or ineffective
✗ Page faults are coming from elsewhere (not Superslab mmap)
✗ MAP_POPULATE flag is not preventing runtime faults

📊 Detailed Performance Breakdown

Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)

Top Kernel Functions (by CPU time):

15.01% asm_exc_page_fault       ← Page fault handling
11.65% clear_page_erms          ← Page zeroing
 5.27% zap_pte_range            ← Memory cleanup
 5.20% handle_mm_fault          ← MMU fault handling
 4.06% do_anonymous_page        ← Anonymous page allocation
 3.18% __handle_mm_fault        ← Nested fault handling
 2.35% rmqueue_bulk             ← Allocator backend
 2.35% __memset_avx2_unaligned  ← Memory operations
 2.28% do_user_addr_fault       ← User fault handling
 1.77% arch_exit_to_user_mode   ← Context switch

Kernel overhead: ~63% of cycles L1 DCL misses: 763K / operations Branch miss rate: 11.94%

Tiny Hot Workload (10M allocations, fixed size)

Top Kernel Functions (by CPU time):

14.19% asm_exc_page_fault       ← Page fault handling
12.82% clear_page_erms          ← Page zeroing
 5.61% __memset_avx2_unaligned  ← Memory operations
 5.02% do_anonymous_page        ← Anonymous page allocation
 3.31% mem_cgroup_commit_charge ← Memory accounting
 2.67% __handle_mm_fault        ← MMU fault handling
 2.45% do_user_addr_fault       ← User fault handling

Kernel overhead: ~66% of cycles L1 DCL misses: 738K / operations Branch miss rate: 11.03%

Comparison: Why are cycles similar?

Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot:     72.3M cycles / 10M ops = 7.23 cycles/op

⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!

The cycles are nearly identical, but throughput differs because:

Random Mixed: Measuring only 1M operations (baseline)
Tiny Hot: Measuring 10M operations (10x scale)
Real throughput: Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op

This means Random Mixed is NOT actually slower in per-operation cost!

🎯 Critical Findings

Finding 1: Page Faults Are NOT Being Reduced

Observed:

Random Mixed: 7,672 page faults
Tiny Hot: 7,672 page faults
Difference: 0 ← This is wrong!

Expected (with prefault):

Random Mixed: 7,672 → maybe 100-500 (90% reduction)
Tiny Hot: 7,672 → ~50-100 (minimal change)

Hypothesis:

Prefault Box may not be enabled
Or MAP_POPULATE is not working on this kernel
Or allocations are hitting kernel-internal mmap (not Superslab)

Finding 2: TLB Misses Are HIGH (48.65%)

dTLB-loads:        49,160
dTLB-load-misses:  23,917 (48.65% miss rate!)
iTLB-load-misses:  17,590 (7748.90% - kernel measurement artifact)

Meaning: Nearly half of TLB lookups fail, causing page table walks.

Why this matters:

Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
23,917 × 25 cycles = ~600K wasted cycles
That's ~10% of total runtime!

Finding 3: Both Workloads Are Similar

Despite different access patterns:

Both spend 15% on page fault handling
Both spend 12% on page zeroing
Both have similar L1 miss rates
Both have similar branch miss rates

Conclusion: The memory subsystem is the bottleneck for BOTH workloads, not user-space code.

📈 Layer Analysis

Kernel vs User Split

Category	Random Mixed	Tiny Hot	Analysis
Kernel (page faults, scheduling, etc)	63%	66%	Dominant
Kernel zeroing (clear_page_erms)	11.65%	12.82%	Similar
User malloc/free	<1%	<1%	Not visible
User pool/cache logic	<1%	<1%	Not visible

User-Space Functions Visible in Profile

Random Mixed:

0.59% hak_free_at.constprop.0 (hakmem free path)

Tiny Hot:

0.59% hak_pool_mid_lookup (hakmem pool routing)

Conclusion: User-space HAKMEM code is NOT a bottleneck (<1% each).

🔧 What's Really Happening

Current State (POST-Prefault Box)

allocate(size):
  1. malloc wrapper           → <1% cycles
  2. Gatekeeper routing       → ~0.1% cycles
  3. unified_cache_refill     → (hidden in kernel time)
  4. shared_pool_acquire      → (hidden in kernel time)
  5. SuperSlab/mmap call      → Triggers kernel
  6. **KERNEL PAGE FAULTS**   → 15% cycles
  7. clear_page_erms (zero)   → 12% cycles

Why Prefault Isn't Working

Possible reasons:

Prefault Box disabled?
- Check: HAKMEM_BOX_SS_PREFAULT_ENABLED
- Or: g_ss_populate_once not being set
MAP_POPULATE not actually pre-faulting?
- Linux kernel may be lazy even with MAP_POPULATE
- Need madvise(MADV_POPULATE_READ) to force immediate faulting
- Or use mincore() to check before allocation
Allocations not from Superslab mmap?
- Page faults may be from TLS cache allocation
- Or from libc internal allocations
- Not from Superslab backend
TLB misses dominating?
- 48.65% TLB miss rate suggests memory layout issue
- SuperSlab metadata may not be cache-friendly
- Working set too large for TLB

🎓 What We Learned From Previous Analysis

From the earlier profiling report, we identified that:

Random Mixed was 21.7x slower due to 61.7% page faults
Expected with prefault: Should drop to ~5% or less

But NOW we see:

Random Mixed is NOT significantly slower (per-op cost is similar)
Page faults are identical to Tiny Hot
This contradicts expectations

Possible Explanation

The earlier measurements may have been from:

Benchmark run at startup (cold caches)
With additional profiling overhead
Or different workload parameters

The current measurements are:

Steady state (after initial allocation)
With higher throughput (Tiny Hot = 10M ops)
After recent optimizations

🎯 Next Steps - Three Options

📋 Option A: Verify Prefault is Actually Enabled

Goal: Confirm prefault mechanism is working

Steps:

Add debug output to ss_prefault_policy() and ss_prefault_region()
Check if MAP_POPULATE flag is set in actual mmap calls

Run with strace to see mmap flags:

strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE

Check if madvise(MADV_POPULATE_READ) calls are happening

Expected outcome: Should see MAP_POPULATE or MADV_POPULATE in traces

🎯 Option B: Reduce TLB Misses (48.65% → ~5%)

Goal: Improve memory layout to reduce TLB pressure

Steps:

Analyze SuperSlab metadata layout:
- Current: Is metadata per-slab or centralized?
- Check: sp_meta_find_or_create() hot path
Improve cache locality:
- Cache-align metadata structures
- Use larger pages (2MB or 1GB hugepages)
- Reduce working set size

Profile with hugepages:

echo 10 > /proc/sys/vm/nr_hugepages
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

Expected gain: 1.5-2x speedup (eliminate TLB miss penalty)

🚀 Option C: Reduce Page Zeroing (12% → ~2%)

Goal: Skip unnecessary page zeroing

Steps:

Analyze what needs zeroing:
- Are SuperSlab pages truly uninitialized?
- Can we reuse memory without zeroing?
- Use MADV_DONTNEED before reuse?
Implement lazy zeroing:
- Don't zero pages on allocation
- Only zero used portions
- Let kernel handle rest on free
Use uninitialized pools:
- Pre-allocate without zeroing
- Initialize on-demand

Expected gain: 1.5x speedup (eliminate 12% zero cost)

📊 Recommendation

Based on the analysis:

Most Impactful (Order of Preference):

Fix TLB Misses (48.65%)
- Potential gain: 1.5-2x
- Implementation: Medium difficulty
- Reason: Already showing 48% miss rate
Verify Prefault Actually Works
- Potential gain: Unknown (currently not working?)
- Implementation: Easy (debugging)
- Reason: Should have been solved but showing same page faults
Reduce Page Zeroing
- Potential gain: 1.5x
- Implementation: Medium difficulty
- Reason: 12% of total time

🧪 Recommended Next Action

Immediate (This Session)

Run diagnostic to confirm prefault status:

# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20

# Check compiler flags
grep -i prefault Makefile

# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT

If Prefault is Disabled → Enable It

Then re-run profiling to verify improvement.

If Prefault is Enabled → Move to Option B (TLB)

Focus on reducing 48% TLB miss rate.

📈 Expected Outcome After All Fixes

Factor	Current	After	Gain
Page faults	7,672	500-1000	8-15x
TLB misses	48.65%	~5%	3-5x
Page zeroing	12%	2%	2x
Total per-op time	72.6 cycles	20-25 cycles	3-4x
Throughput	1.06M ops/s	3.5-4M ops/s	3-4x

9.7 KiB Raw Blame History Unescape Escape

Comprehensive Profiling Analysis: HAKMEM Performance Gaps

🔍 Executive Summary

Current Performance Metrics

🚨 KEY FINDING: Prefault is NOT working as expected!

📊 Detailed Performance Breakdown

Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)

Tiny Hot Workload (10M allocations, fixed size)

Comparison: Why are cycles similar?

🎯 Critical Findings

Finding 1: Page Faults Are NOT Being Reduced

Finding 2: TLB Misses Are HIGH (48.65%)

Finding 3: Both Workloads Are Similar

📈 Layer Analysis

Kernel vs User Split

User-Space Functions Visible in Profile

🔧 What's Really Happening

Current State (POST-Prefault Box)

Why Prefault Isn't Working

🎓 What We Learned From Previous Analysis

Possible Explanation

🎯 Next Steps - Three Options

📋 Option A: Verify Prefault is Actually Enabled

🎯 Option B: Reduce TLB Misses (48.65% → ~5%)

🚀 Option C: Reduce Page Zeroing (12% → ~2%)

📊 Recommendation

Most Impactful (Order of Preference):

🧪 Recommended Next Action

Immediate (This Session)

If Prefault is Disabled → Enable It

If Prefault is Enabled → Move to Option B (TLB)

📈 Expected Outcome After All Fixes

9.7 KiB

Raw Blame History