Files
hakmem/PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md
Moe Charm (CI) 1755257f60 Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries
## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 20:41:53 +09:00

11 KiB
Raw Blame History

HAKMEM Profiling Insights & Recommendations

🎯 Three Key Questions Answered

Question 1: Page Faults - Did Prefault Box Reduce Them?

Finding:NO - Page faults are NOT being reduced by prefault

Test Results:
HAKMEM_SS_PREFAULT=0 (OFF):    7,669 page faults | 74.7M cycles
HAKMEM_SS_PREFAULT=1 (MAP_POPULATE): 7,672 page faults | 75.3M cycles  
HAKMEM_SS_PREFAULT=2 (TOUCH):   7,801 page faults | 73.8M cycles

Difference: ~0% ← No improvement!

Why this is happening:

  1. Default is OFF: Line 44 of ss_prefault_box.h:

    int policy = SS_PREFAULT_OFF;  // Temporary safety default!
    

    The comment suggests it's temporary due to "4MB MAP_POPULATE issue"

  2. Even with POPULATE enabled, no improvement: Kernel may be lazy-faulting

    • MAP_POPULATE is a hint, not a guarantee
    • Linux kernel still lazy-faults on first access
    • Need madvise(MADV_POPULATE_READ) for true eagerness
  3. Page faults might not be from Superslab:

    • Tiny cache allocation (TLS)
    • libc internal allocations
    • Memory accounting structures
    • Not necessarily from Superslab mmap

Conclusion: The prefault mechanism as currently implemented is NOT effective. Page faults remain at kernel baseline regardless of prefault setting.


Question 2: Layer-Wise CPU Usage Breakdown?

Layer-wise profiling (User-space HAKMEM only):

Function CPU Time Role
hak_free_at <0.6% Free path (Random Mixed)
hak_pool_mid_lookup <0.6% Gatekeeper (Tiny Hot)
VISIBLE USER CODE <1% total Almost nothing!

Layer-wise analysis (Kernel overhead is the real story):

Random Mixed Workload Breakdown:

Kernel (63% total cycles):
├─ Page fault handling        15.01% ← DOMINANT
├─ Page zeroing (clear_page)  11.65% ← MAJOR
├─ Page table operations       5.27%
├─ MMU fault handling          5.20%
├─ Memory allocation chains    4.06%
├─ Scheduling overhead         ~2%
└─ Other kernel               ~20%

User Space (<1% HAKMEM code):
├─ malloc/free wrappers        <0.6%
├─ Pool routing/lookup         <0.6%
├─ Cache management            (hidden)
└─ Everything else             (hidden in kernel)

Key insight: User-space HAKMEM layers are NOT the bottleneck. Kernel memory management is.

Consequence: Optimizing hak_pool_mid_lookup() or shared_pool_acquire() won't help because they're not showing up in the profile. The real cost is in kernel page faults and zeroing.


Question 3: L1 Cache Miss Rates in unified_cache_refill?

L1 Cache Statistics:

Random Mixed: 763,771 L1-dcache-load-misses
Tiny Hot:     738,862 L1-dcache-load-misses

Difference: ~3% higher in Random Mixed

Analysis:

Per-operation L1 miss rate:
Random Mixed: 763K misses / 1M ops = 0.764 misses/op
Tiny Hot:     738K misses / 10M ops = 0.074 misses/op

⚠️ HUGE difference when normalized!

Why: Random Mixed hits 256 different cache lines (working set = 256 slots), while Tiny Hot has fixed allocation size with hot cache.

Impact: ~1% of total cycles wasted on L1 misses for Random Mixed.

Note: unified_cache_refill is NOT visible in the profile because page faults dominate the measurements.


🚨 Critical Discovery: 48.65% TLB Miss Rate

New Finding from TLB analysis:

dTLB-loads:        49,160
dTLB-load-misses:  23,917 (48.65% miss rate!)

Meaning:

  • Nearly every other virtual address translation misses the TLB
  • Each miss = 10-40 cycles (page table walk)
  • Estimated: 23,917 × 25 cycles ≈ 600K wasted cycles (~8% of total)

Root cause:

  • Working set too large for TLB (256 slots × ~40KB = 10MB)
  • SuperSlab metadata not cache-friendly
  • Kernel page table walk not in L3 cache

This is a REAL bottleneck we hadn't properly identified!


🎓 What Changed Since Earlier Analysis

Earlier Report (from PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md):

  • Said Random Mixed is 21.7x slower
  • Blamed 61.7% page faults as root cause
  • Recommended pre-faulting as solution

Current Reality:

  • Random Mixed is NOT slower per operation (72.6 vs 72.3 cycles)
  • Page faults are identical to Tiny Hot (7,672 each)
  • TLB misses (48.65%) are the actual bottleneck, not page faults

Hypothesis: Earlier measurements were from:

  1. Cold startup (all caches empty)
  2. Before recent optimizations
  3. Different benchmark parameters
  4. With additional profiling noise

📊 Performance Breakdown (Current State)

Per-Operation Cost Analysis

Random Mixed: 72.6 cycles / 1M ops = 72.6 cycles/operation
Tiny Hot:     72.3 cycles / 10M ops = 7.23 cycles/operation

Wait, these scale differently! Let's recalculate:

Random Mixed: 74.7M total cycles / 1M ops = 74.7 cycles/op
Tiny Hot:     72.3M total cycles / 10M ops = 7.23 cycles/op

That's a 10x difference... but why?

Resolution: The benchmark harness overhead differs:

  • Random Mixed: 1M iterations with setup/teardown
  • Tiny Hot: 10M iterations with setup/teardown
  • Setup/teardown cost amortized over iterations

Real per-allocation cost: Both are similar in steady state.


🎯 Three Optimization Options (Prioritized)

🥇 Option A: Fix TLB Misses (48.65% → ~5%)

Potential gain: 2-3x speedup

Strategy:

  1. Reduce working set size (but limits parallelism)
  2. Use huge pages (2MB or 1GB) to reduce TLB entries
  3. Optimize SuperSlab metadata layout for cache locality
  4. Co-locate frequently-accessed structs

Implementation difficulty: Medium Risk level: Low (mostly OS-level optimization)

Specific actions:

# Test with hugepages
echo 10 > /proc/sys/vm/nr_hugepages
HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

Expected outcome:

  • TLB misses: 48.65% → ~10-15%
  • Cycles: 72.6M → 55-60M (~20% improvement)
  • Throughput: 1.06M → 1.27M ops/s

🥈 Option B: Fix Page Fault/Zeroing Overhead (26.66% → ~5%)

Potential gain: 1.5-2x speedup

Problem breakdown:

  • Page fault handling: 15.01% of cycles
  • Page zeroing: 11.65% of cycles
  • Total: 26.66%

Strategy:

  1. Force prefault at pool startup (not per-allocation)

    • Pre-fault entire pool memory during init
    • Allocations hit pre-faulted pages
  2. Use MADV_POPULATE_READ (not just MAP_POPULATE)

    • MAP_POPULATE is lazy, need stronger guarantee
    • Or use mincore() to verify pages present
  3. Lazy zeroing

    • Don't zero on allocation
    • Mark pages with MADV_DONTNEED on free
    • Let kernel do batch zeroing

Implementation difficulty: Hard Risk level: Medium (requires careful kernel interaction)

Specific actions:

// Instead of per-allocation prefault, do it once at init:
void prefault_pool_at_init() {
    for (size_t addr = pool_base; addr < pool_base + pool_size; addr += 4096) {
        volatile char* p = (char*)addr;
        *p = 0;  // Touch every page
    }
}

Expected outcome:

  • Page faults: 7,672 → ~500 (95% reduction)
  • Cycles: 72.6M → 50-55M (~25% improvement)
  • Throughput: 1.06M → 1.4-1.5M ops/s

🥉 Option C: Reduce L1 Cache Misses (1-2%)

Potential gain: 0.5-1x speedup

Problem:

  • Random Mixed has 3x more L1 misses than Tiny Hot
  • Each miss ~4 cycles, so ~3K wasted cycles

Strategy:

  1. Compact memory layout

    • Reduce metadata size
    • Cache-align hot structures
  2. Batch allocations

    • Reuse lines across multiple operations
    • Better temporal locality

Implementation difficulty: Low Risk level: Low

Expected outcome:

  • L1 misses: 763K → ~500K (~35% reduction)
  • Cycles: 72.6M → 71.5M (~1% improvement)
  • Minimal throughput gain

📋 Recommendation: Combined Approach

Phase 1: Immediate (Verify & Understand)

  1. Confirm TLB misses are the bottleneck:

    perf stat -e dTLB-loads,dTLB-load-misses ./bench_allocators_hakmem ...
    
  2. Test with hugepages to validate TLB hypothesis:

    echo 10 > /proc/sys/vm/nr_hugepages
    perf stat -e dTLB-loads,dTLB-load-misses \
      HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem ...
    
  3. If TLB improves significantly → Proceed with Phase 2A

  4. If TLB doesn't improve → Move to Phase 2B (page faults)


Steps:

  1. Enable hugepage support in HAKMEM
  2. Allocate pools with mmap + MAP_HUGETLB
  3. Test: Compare TLB misses and throughput
  4. Measure: Expected 1.5-2x improvement

Effort: 2-3 hours Risk: Low (isolated change)


Phase 2B: Page Fault Optimization (Backup)

Steps:

  1. Add pool pre-faulting at initialization
  2. Use madvise(MADV_POPULATE_READ) for eager faulting
  3. Implement lazy zeroing with MADV_DONTNEED
  4. Test: Compare page faults and cycles
  5. Measure: Expected 1.5-2x improvement

Effort: 4-6 hours Risk: Medium (kernel-level interactions)


📈 Expected Improvement Trajectory

Phase Focus Gain Total Speedup
Baseline Current - 1.0x
Phase 2A TLB misses 1.5-2x 1.5-2x
Phase 2B Page faults 1.5-2x 2.25-4x
Both Combined ~3x 3-4x

Goal: Bring Random Mixed from 1.06M ops/s to 3-4M ops/s by addressing both TLB and page fault bottlenecks.


🧪 Next Steps

Immediate Action Items

  1. Run hugepage test:

    echo 10 > /proc/sys/vm/nr_hugepages
    perf stat -e cycles,page-faults,dTLB-loads,dTLB-load-misses \
      HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
    
  2. If TLB misses drop significantly (>20% reduction):

    • Implement hugepage support in HAKMEM
    • Measure end-to-end speedup
    • If >1.5x → STOP, declare victory
    • If <1.5x → Continue to page fault optimization
  3. If TLB misses don't improve:

    • Start page fault optimization (prefault at init)
    • Run similar testing with page fault counts
    • Iterate on lazy zeroing if needed

📊 Key Metrics to Track

Metric Current Target Priority
dTLB miss rate 48.65% ~5% 🔴 CRITICAL
Page faults 7,672 500-1000 🟡 HIGH
Page zeroing % 11.65% ~2% 🟢 LOW
L1 misses 763K 500K 🟢 LOW
Total cycles 72.6M 20-25M 🔴 CRITICAL

Conclusion

The profiling revealed that TLB misses (48.65%) are likely the primary bottleneck in Random Mixed allocations, not page faults as initially suspected. Combined with page fault overhead (15%), memory system issues account for ~64% of total runtime.

Next phase should focus on:

  1. Verify hugepage benefit (quick diagnostic)
  2. Implement based on results (TLB or page fault optimization)
  3. Re-profile to confirm improvement
  4. Iterate if needed