Files
hakmem/EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md
Moe Charm (CI) a67965139f Add performance analysis reports and archive legacy superslab
- Add investigation reports for allocation routing, bottlenecks, madvise
- Archive old smallmid superslab implementation
- Document Page Box integration findings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 15:31:58 +09:00

198 lines
6.2 KiB
Markdown

# HAKMEM Performance Bottleneck Executive Summary
**Date**: 2025-12-04
**Analysis Type**: Comprehensive Performance Profiling
**Status**: CRITICAL BOTTLENECK IDENTIFIED
---
## The Problem
**Current Performance**: 4.1M ops/s
**Target Performance**: 16M+ ops/s (4x improvement)
**Performance Gap**: 3.9x remaining
---
## Root Cause: Page Fault Storm
**The smoking gun**: 69% of execution time is spent handling page faults.
### The Evidence
```
perf stat shows:
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
- 1,146 cycles per operation (286 cycles at 4x = target)
- 690 cycles per operation spent in kernel page fault handling (60% of total time)
perf report shows:
- unified_cache_refill: 69.07% of total time (with children)
└─ 60%+ is kernel page fault handling chain:
- clear_page_erms: 11.25% (zeroing newly allocated pages)
- do_anonymous_page: 20%+ (allocating kernel folios)
- folio_add_new_anon_rmap: 7.11% (adding to reverse map)
- folio_add_lru_vma: 4.88% (adding to LRU list)
- __mem_cgroup_charge: 4.37% (memory cgroup accounting)
```
### Why This Matters
Every time `unified_cache_refill` allocates memory from a SuperSlab, it writes to
previously unmapped memory. This triggers a page fault, forcing the kernel to:
1. **Allocate a physical page** (rmqueue: 2.03%)
2. **Zero the page for security** (clear_page_erms: 11.25%)
3. **Set up page tables** (handle_pte_fault, __pte_offset_map: 3-5%)
4. **Add to LRU lists** (folio_add_lru_vma: 4.88%)
5. **Charge memory cgroup** (__mem_cgroup_charge: 4.37%)
6. **Update reverse map** (folio_add_new_anon_rmap: 7.11%)
**Total kernel overhead**: ~690 cycles per operation (60% of 1,146 cycles)
---
## Secondary Bottlenecks
### 1. Branch Mispredictions (9.04% miss rate)
- 21M mispredictions / 1M operations = 21 misses per op
- Each miss costs ~15-20 cycles = 315-420 cycles wasted per op
- Indicates complex control flow in allocation path
### 2. Speculation Mitigation (5.44% overhead)
- srso_alias_safe_ret: 2.85%
- srso_alias_return_thunk: 2.59%
- CPU security features (Spectre/Meltdown) add indirect branch overhead
- Cannot be eliminated but can be minimized
### 3. Cache Misses (Moderate)
- L1 D-cache misses: 17.2 per operation
- Cache miss rate: 13.03% of cache references
- At ~10 cycles per L1 miss = ~172 cycles per op
- Not catastrophic but room for improvement
---
## The Path to 4x Performance
### Immediate Action: Pre-fault SuperSlab Memory
**Solution**: Add `MAP_POPULATE` flag to `mmap()` calls in SuperSlab acquisition
**Implementation**:
```c
// In superslab_acquire():
void* ptr = mmap(NULL, SUPERSLAB_SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, // Add MAP_POPULATE
-1, 0);
```
**Expected Impact**:
- Eliminates 60-70% of runtime page faults
- Trades startup time for runtime performance
- **Expected speedup: 2-3x (8.2M - 12.3M ops/s)**
- **Effort: 1 hour**
### Follow-up: Profile-Guided Optimization (PGO)
**Solution**: Build with `-fprofile-generate`, run benchmark, rebuild with `-fprofile-use`
**Expected Impact**:
- Optimizes branch layout for common paths
- Reduces branch misprediction rate from 9% to ~6-7%
- **Expected speedup: 1.2-1.3x on top of prefaulting**
- **Effort: 2 hours**
### Advanced: Transparent Hugepages
**Solution**: Use `mmap(MAP_HUGETLB)` for 2MB pages instead of 4KB pages
**Expected Impact**:
- Reduces page fault count by 512x (4KB → 2MB)
- Reduces TLB pressure significantly
- **Expected speedup: 2-4x**
- **Effort: 1 day (with fallback logic)**
---
## Conservative Performance Projection
| Optimization | Speedup | Cumulative | Ops/s | Effort |
|-------------|---------|------------|-------|--------|
| Baseline | 1.0x | 1.0x | 4.1M | - |
| MAP_POPULATE | 2.5x | 2.5x | 10.3M | 1 hour |
| PGO | 1.25x | 3.1x | 12.7M | 2 hours |
| Branch hints | 1.1x | 3.4x | 14.0M | 4 hours |
| Cache layout | 1.15x | 3.9x | **16.0M** | 2 hours |
**Total effort to reach 4x target**: ~1 day of development
---
## Aggressive Performance Projection
| Optimization | Speedup | Cumulative | Ops/s | Effort |
|-------------|---------|------------|-------|--------|
| Baseline | 1.0x | 1.0x | 4.1M | - |
| Hugepages | 3.0x | 3.0x | 12.3M | 1 day |
| PGO | 1.3x | 3.9x | 16.0M | 2 hours |
| Branch optimization | 1.2x | 4.7x | 19.3M | 4 hours |
| Prefetching | 1.15x | 5.4x | **22.1M** | 4 hours |
**Total effort to reach 5x+**: ~2 days of development
---
## Recommended Action Plan
### Phase 1: Immediate (Today)
1. Add MAP_POPULATE to superslab mmap() calls
2. Verify page fault count drops to near-zero
3. Measure new throughput (expect 8-12M ops/s)
### Phase 2: Quick Wins (Tomorrow)
1. Build with PGO (-fprofile-generate/use)
2. Add __builtin_expect() hints to hot paths
3. Measure new throughput (expect 12-16M ops/s)
### Phase 3: Advanced (This Week)
1. Implement hugepage support with fallback
2. Optimize data structure layout for cache
3. Add prefetch hints for predictable accesses
4. Target: 16-24M ops/s
---
## Key Metrics Summary
| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Throughput | 4.1M ops/s | 16M ops/s | 🔴 25% of target |
| Cycles/op | 1,146 | ~245 | 🔴 4.7x too slow |
| Page faults | 132,509 | <1,000 | 🔴 132x too many |
| IPC | 0.97 | 0.97 | 🟢 Optimal |
| Branch misses | 9.04% | <5% | 🟡 Moderate |
| Cache misses | 13.03% | <10% | 🟡 Moderate |
| Kernel time | 60% | <5% | 🔴 Critical |
---
## Files Generated
1. **PERF_BOTTLENECK_ANALYSIS_20251204.md** - Full detailed analysis with recommendations
2. **PERF_RAW_DATA_20251204.txt** - Raw perf stat/report output for reference
3. **EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md** - This file (executive overview)
---
## Conclusion
The performance gap is **not a mystery**. The profiling data clearly shows that
**60-70% of execution time is spent in kernel page fault handling**.
The fix is straightforward: **pre-fault memory with MAP_POPULATE** and eliminate
the runtime page fault overhead. This single change should deliver 2-3x improvement,
putting us at 8-12M ops/s. Combined with PGO and minor branch optimizations,
we can confidently reach the 4x target (16M+ ops/s).
**Next Step**: Implement MAP_POPULATE in superslab_acquire() and re-measure.