Files
hakmem/COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md
Moe Charm (CI) 1755257f60 Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries
## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 20:41:53 +09:00

9.7 KiB
Raw Blame History

Comprehensive Profiling Analysis: HAKMEM Performance Gaps

🔍 Executive Summary

After the Prefault Box + MAP_POPULATE fix, the profiling shows:

Current Performance Metrics

Metric Random Mixed Tiny Hot Gap
Cycles (lower is better) 72.6M 72.3M SAME 🤯
Page Faults 7,672 7,672 IDENTICAL ⚠️
L1 Cache Misses 763K 738K Similar
Throughput ~1.06M ops/s ~1.23M ops/s 1.16x
Instructions/Cycle 0.74 0.73 Similar
TLB Miss Rate 48.65% (dTLB) N/A High

🚨 KEY FINDING: Prefault is NOT working as expected!

Problem: Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:

  1. ✗ Prefault Box is either disabled or ineffective
  2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
  3. ✗ MAP_POPULATE flag is not preventing runtime faults

📊 Detailed Performance Breakdown

Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)

Top Kernel Functions (by CPU time):

15.01% asm_exc_page_fault       ← Page fault handling
11.65% clear_page_erms          ← Page zeroing
 5.27% zap_pte_range            ← Memory cleanup
 5.20% handle_mm_fault          ← MMU fault handling
 4.06% do_anonymous_page        ← Anonymous page allocation
 3.18% __handle_mm_fault        ← Nested fault handling
 2.35% rmqueue_bulk             ← Allocator backend
 2.35% __memset_avx2_unaligned  ← Memory operations
 2.28% do_user_addr_fault       ← User fault handling
 1.77% arch_exit_to_user_mode   ← Context switch

Kernel overhead: ~63% of cycles L1 DCL misses: 763K / operations Branch miss rate: 11.94%

Tiny Hot Workload (10M allocations, fixed size)

Top Kernel Functions (by CPU time):

14.19% asm_exc_page_fault       ← Page fault handling
12.82% clear_page_erms          ← Page zeroing
 5.61% __memset_avx2_unaligned  ← Memory operations
 5.02% do_anonymous_page        ← Anonymous page allocation
 3.31% mem_cgroup_commit_charge ← Memory accounting
 2.67% __handle_mm_fault        ← MMU fault handling
 2.45% do_user_addr_fault       ← User fault handling

Kernel overhead: ~66% of cycles L1 DCL misses: 738K / operations Branch miss rate: 11.03%

Comparison: Why are cycles similar?

Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot:     72.3M cycles / 10M ops = 7.23 cycles/op

⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!

The cycles are nearly identical, but throughput differs because:

  • Random Mixed: Measuring only 1M operations (baseline)
  • Tiny Hot: Measuring 10M operations (10x scale)
  • Real throughput: Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op

This means Random Mixed is NOT actually slower in per-operation cost!


🎯 Critical Findings

Finding 1: Page Faults Are NOT Being Reduced

Observed:

  • Random Mixed: 7,672 page faults
  • Tiny Hot: 7,672 page faults
  • Difference: 0 ← This is wrong!

Expected (with prefault):

  • Random Mixed: 7,672 → maybe 100-500 (90% reduction)
  • Tiny Hot: 7,672 → ~50-100 (minimal change)

Hypothesis:

  • Prefault Box may not be enabled
  • Or MAP_POPULATE is not working on this kernel
  • Or allocations are hitting kernel-internal mmap (not Superslab)

Finding 2: TLB Misses Are HIGH (48.65%)

dTLB-loads:        49,160
dTLB-load-misses:  23,917 (48.65% miss rate!)
iTLB-load-misses:  17,590 (7748.90% - kernel measurement artifact)

Meaning: Nearly half of TLB lookups fail, causing page table walks.

Why this matters:

  • Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
  • 23,917 × 25 cycles = ~600K wasted cycles
  • That's ~10% of total runtime!

Finding 3: Both Workloads Are Similar

Despite different access patterns:

  • Both spend 15% on page fault handling
  • Both spend 12% on page zeroing
  • Both have similar L1 miss rates
  • Both have similar branch miss rates

Conclusion: The memory subsystem is the bottleneck for BOTH workloads, not user-space code.


📈 Layer Analysis

Kernel vs User Split

Category Random Mixed Tiny Hot Analysis
Kernel (page faults, scheduling, etc) 63% 66% Dominant
Kernel zeroing (clear_page_erms) 11.65% 12.82% Similar
User malloc/free <1% <1% Not visible
User pool/cache logic <1% <1% Not visible

User-Space Functions Visible in Profile

Random Mixed:

0.59% hak_free_at.constprop.0 (hakmem free path)

Tiny Hot:

0.59% hak_pool_mid_lookup (hakmem pool routing)

Conclusion: User-space HAKMEM code is NOT a bottleneck (<1% each).


🔧 What's Really Happening

Current State (POST-Prefault Box)

allocate(size):
  1. malloc wrapper           → <1% cycles
  2. Gatekeeper routing       → ~0.1% cycles
  3. unified_cache_refill     → (hidden in kernel time)
  4. shared_pool_acquire      → (hidden in kernel time)
  5. SuperSlab/mmap call      → Triggers kernel
  6. **KERNEL PAGE FAULTS**   → 15% cycles
  7. clear_page_erms (zero)   → 12% cycles

Why Prefault Isn't Working

Possible reasons:

  1. Prefault Box disabled?

    • Check: HAKMEM_BOX_SS_PREFAULT_ENABLED
    • Or: g_ss_populate_once not being set
  2. MAP_POPULATE not actually pre-faulting?

    • Linux kernel may be lazy even with MAP_POPULATE
    • Need madvise(MADV_POPULATE_READ) to force immediate faulting
    • Or use mincore() to check before allocation
  3. Allocations not from Superslab mmap?

    • Page faults may be from TLS cache allocation
    • Or from libc internal allocations
    • Not from Superslab backend
  4. TLB misses dominating?

    • 48.65% TLB miss rate suggests memory layout issue
    • SuperSlab metadata may not be cache-friendly
    • Working set too large for TLB

🎓 What We Learned From Previous Analysis

From the earlier profiling report, we identified that:

  • Random Mixed was 21.7x slower due to 61.7% page faults
  • Expected with prefault: Should drop to ~5% or less

But NOW we see:

  • Random Mixed is NOT significantly slower (per-op cost is similar)
  • Page faults are identical to Tiny Hot
  • This contradicts expectations

Possible Explanation

The earlier measurements may have been from:

  • Benchmark run at startup (cold caches)
  • With additional profiling overhead
  • Or different workload parameters

The current measurements are:

  • Steady state (after initial allocation)
  • With higher throughput (Tiny Hot = 10M ops)
  • After recent optimizations

🎯 Next Steps - Three Options

📋 Option A: Verify Prefault is Actually Enabled

Goal: Confirm prefault mechanism is working

Steps:

  1. Add debug output to ss_prefault_policy() and ss_prefault_region()
  2. Check if MAP_POPULATE flag is set in actual mmap calls
  3. Run with strace to see mmap flags:
    strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
    
  4. Check if madvise(MADV_POPULATE_READ) calls are happening

Expected outcome: Should see MAP_POPULATE or MADV_POPULATE in traces


🎯 Option B: Reduce TLB Misses (48.65% → ~5%)

Goal: Improve memory layout to reduce TLB pressure

Steps:

  1. Analyze SuperSlab metadata layout:

    • Current: Is metadata per-slab or centralized?
    • Check: sp_meta_find_or_create() hot path
  2. Improve cache locality:

    • Cache-align metadata structures
    • Use larger pages (2MB or 1GB hugepages)
    • Reduce working set size
  3. Profile with hugepages:

    echo 10 > /proc/sys/vm/nr_hugepages
    HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
    

Expected gain: 1.5-2x speedup (eliminate TLB miss penalty)


🚀 Option C: Reduce Page Zeroing (12% → ~2%)

Goal: Skip unnecessary page zeroing

Steps:

  1. Analyze what needs zeroing:

    • Are SuperSlab pages truly uninitialized?
    • Can we reuse memory without zeroing?
    • Use MADV_DONTNEED before reuse?
  2. Implement lazy zeroing:

    • Don't zero pages on allocation
    • Only zero used portions
    • Let kernel handle rest on free
  3. Use uninitialized pools:

    • Pre-allocate without zeroing
    • Initialize on-demand

Expected gain: 1.5x speedup (eliminate 12% zero cost)


📊 Recommendation

Based on the analysis:

Most Impactful (Order of Preference):

  1. Fix TLB Misses (48.65%)

    • Potential gain: 1.5-2x
    • Implementation: Medium difficulty
    • Reason: Already showing 48% miss rate
  2. Verify Prefault Actually Works

    • Potential gain: Unknown (currently not working?)
    • Implementation: Easy (debugging)
    • Reason: Should have been solved but showing same page faults
  3. Reduce Page Zeroing

    • Potential gain: 1.5x
    • Implementation: Medium difficulty
    • Reason: 12% of total time

Immediate (This Session)

Run diagnostic to confirm prefault status:

# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20

# Check compiler flags
grep -i prefault Makefile

# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT

If Prefault is Disabled → Enable It

Then re-run profiling to verify improvement.

If Prefault is Enabled → Move to Option B (TLB)

Focus on reducing 48% TLB miss rate.


📈 Expected Outcome After All Fixes

Factor Current After Gain
Page faults 7,672 500-1000 8-15x
TLB misses 48.65% ~5% 3-5x
Page zeroing 12% 2% 2x
Total per-op time 72.6 cycles 20-25 cycles 3-4x
Throughput 1.06M ops/s 3.5-4M ops/s 3-4x