Files

Moe Charm (CI) a67965139f Add performance analysis reports and archive legacy superslab

- Add investigation reports for allocation routing, bottlenecks, madvise
- Archive old smallmid superslab implementation
- Document Page Box integration findings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-05 15:31:58 +09:00

6.2 KiB

Raw Permalink Blame History

HAKMEM Performance Bottleneck Executive Summary

Date: 2025-12-04
Analysis Type: Comprehensive Performance Profiling
Status: CRITICAL BOTTLENECK IDENTIFIED

The Problem

Current Performance: 4.1M ops/s
Target Performance: 16M+ ops/s (4x improvement)
Performance Gap: 3.9x remaining

Root Cause: Page Fault Storm

The smoking gun: 69% of execution time is spent handling page faults.

The Evidence

perf stat shows:
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
- 1,146 cycles per operation (286 cycles at 4x = target)
- 690 cycles per operation spent in kernel page fault handling (60% of total time)

perf report shows:
- unified_cache_refill: 69.07% of total time (with children)
  └─ 60%+ is kernel page fault handling chain:
     - clear_page_erms: 11.25% (zeroing newly allocated pages)
     - do_anonymous_page: 20%+ (allocating kernel folios)
     - folio_add_new_anon_rmap: 7.11% (adding to reverse map)
     - folio_add_lru_vma: 4.88% (adding to LRU list)
     - __mem_cgroup_charge: 4.37% (memory cgroup accounting)

Why This Matters

Every time unified_cache_refill allocates memory from a SuperSlab, it writes to previously unmapped memory. This triggers a page fault, forcing the kernel to:

Allocate a physical page (rmqueue: 2.03%)
Zero the page for security (clear_page_erms: 11.25%)
Set up page tables (handle_pte_fault, __pte_offset_map: 3-5%)
Add to LRU lists (folio_add_lru_vma: 4.88%)
Charge memory cgroup (__mem_cgroup_charge: 4.37%)
Update reverse map (folio_add_new_anon_rmap: 7.11%)

Total kernel overhead: ~690 cycles per operation (60% of 1,146 cycles)

Secondary Bottlenecks

1. Branch Mispredictions (9.04% miss rate)

21M mispredictions / 1M operations = 21 misses per op
Each miss costs ~15-20 cycles = 315-420 cycles wasted per op
Indicates complex control flow in allocation path

2. Speculation Mitigation (5.44% overhead)

srso_alias_safe_ret: 2.85%
srso_alias_return_thunk: 2.59%
CPU security features (Spectre/Meltdown) add indirect branch overhead
Cannot be eliminated but can be minimized

3. Cache Misses (Moderate)

L1 D-cache misses: 17.2 per operation
Cache miss rate: 13.03% of cache references
At ~10 cycles per L1 miss = ~172 cycles per op
Not catastrophic but room for improvement

The Path to 4x Performance

Immediate Action: Pre-fault SuperSlab Memory

Solution: Add MAP_POPULATE flag to mmap() calls in SuperSlab acquisition

Implementation:

// In superslab_acquire():
void* ptr = mmap(NULL, SUPERSLAB_SIZE, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, // Add MAP_POPULATE
                 -1, 0);

Expected Impact:

Eliminates 60-70% of runtime page faults
Trades startup time for runtime performance
Expected speedup: 2-3x (8.2M - 12.3M ops/s)
Effort: 1 hour

Follow-up: Profile-Guided Optimization (PGO)

Solution: Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use

Expected Impact:

Optimizes branch layout for common paths
Reduces branch misprediction rate from 9% to ~6-7%
Expected speedup: 1.2-1.3x on top of prefaulting
Effort: 2 hours

Advanced: Transparent Hugepages

Solution: Use mmap(MAP_HUGETLB) for 2MB pages instead of 4KB pages

Expected Impact:

Reduces page fault count by 512x (4KB → 2MB)
Reduces TLB pressure significantly
Expected speedup: 2-4x
Effort: 1 day (with fallback logic)

Conservative Performance Projection

Optimization	Speedup	Cumulative	Ops/s	Effort
Baseline	1.0x	1.0x	4.1M	-
MAP_POPULATE	2.5x	2.5x	10.3M	1 hour
PGO	1.25x	3.1x	12.7M	2 hours
Branch hints	1.1x	3.4x	14.0M	4 hours
Cache layout	1.15x	3.9x	16.0M	2 hours

Total effort to reach 4x target: ~1 day of development

Aggressive Performance Projection

Optimization	Speedup	Cumulative	Ops/s	Effort
Baseline	1.0x	1.0x	4.1M	-
Hugepages	3.0x	3.0x	12.3M	1 day
PGO	1.3x	3.9x	16.0M	2 hours
Branch optimization	1.2x	4.7x	19.3M	4 hours
Prefetching	1.15x	5.4x	22.1M	4 hours

Total effort to reach 5x+: ~2 days of development

Recommended Action Plan

Phase 1: Immediate (Today)

Add MAP_POPULATE to superslab mmap() calls
Verify page fault count drops to near-zero
Measure new throughput (expect 8-12M ops/s)

Phase 2: Quick Wins (Tomorrow)

Build with PGO (-fprofile-generate/use)
Add __builtin_expect() hints to hot paths
Measure new throughput (expect 12-16M ops/s)

Phase 3: Advanced (This Week)

Implement hugepage support with fallback
Optimize data structure layout for cache
Add prefetch hints for predictable accesses
Target: 16-24M ops/s

Key Metrics Summary

Metric	Current	Target	Status
Throughput	4.1M ops/s	16M ops/s	🔴 25% of target
Cycles/op	1,146	~245	🔴 4.7x too slow
Page faults	132,509	<1,000	🔴 132x too many
IPC	0.97	0.97	🟢 Optimal
Branch misses	9.04%	<5%	🟡 Moderate
Cache misses	13.03%	<10%	🟡 Moderate
Kernel time	60%	<5%	🔴 Critical

Files Generated

PERF_BOTTLENECK_ANALYSIS_20251204.md - Full detailed analysis with recommendations
PERF_RAW_DATA_20251204.txt - Raw perf stat/report output for reference
EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md - This file (executive overview)

Conclusion

The performance gap is not a mystery. The profiling data clearly shows that 60-70% of execution time is spent in kernel page fault handling.

The fix is straightforward: pre-fault memory with MAP_POPULATE and eliminate the runtime page fault overhead. This single change should deliver 2-3x improvement, putting us at 8-12M ops/s. Combined with PGO and minor branch optimizations, we can confidently reach the 4x target (16M+ ops/s).

Next Step: Implement MAP_POPULATE in superslab_acquire() and re-measure.

6.2 KiB Raw Permalink Blame History

HAKMEM Performance Bottleneck Executive Summary

The Problem

Root Cause: Page Fault Storm

The Evidence

Why This Matters

Secondary Bottlenecks

1. Branch Mispredictions (9.04% miss rate)

2. Speculation Mitigation (5.44% overhead)

3. Cache Misses (Moderate)

The Path to 4x Performance

Immediate Action: Pre-fault SuperSlab Memory

Follow-up: Profile-Guided Optimization (PGO)

Advanced: Transparent Hugepages

Conservative Performance Projection

Aggressive Performance Projection

Recommended Action Plan

Phase 1: Immediate (Today)

Phase 2: Quick Wins (Tomorrow)

Phase 3: Advanced (This Week)

Key Metrics Summary

Files Generated

Conclusion

6.2 KiB

Raw Permalink Blame History