hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md

HAKMEM Performance Bottleneck Analysis Report
==============================================
Date: 2025-12-04
Current Performance: 4.1M ops/s
Target Performance: 16M+ ops/s (4x improvement)
Performance Gap: 3.9x remaining
mid/smallmid（C6-heavy）ベンチを再現するときは、`docs/analysis/ENV_PROFILE_PRESETS.md` の `C6_HEAVY_LEGACY_POOLV1` プリセットをスタートポイントにしてください。

## KEY METRICS SUMMARY

### Hardware Performance Counters (3-run average):
- Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
- Instructions: 1,109M instructions
- IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
- Branches: 231.7M
- Branch Misses: 21.0M (9.04% miss rate - MODERATE)
- Cache References: 50.9M
- Cache Misses: 6.6M (13.03% miss rate - MODERATE)
- L1 D-cache Load Misses: 17.2M

### Per-Operation Breakdown (1M operations):
- Cycles per op: 1,146 cycles/op
- Instructions per op: 1,109 instructions/op
- L1 misses per op: 17.2 per op
- Page faults: 132,509 total (0.132 per op)

### System-Level Metrics:
- Page Faults: 132,509 (448K/sec)
- Minor Faults: 132,509 (all minor, no major faults)
- Context Switches: 29 (negligible)
- CPU Migrations: 8 (negligible)
- Task Clock: 295.67ms (99.7% CPU utilization)

### Syscall Overhead:
- Total Syscalls: 2,017
- mmap: 1,016 calls (36.41% time)
- munmap: 995 calls (63.48% time)
- mprotect: 5 calls
- madvise: 1 call
- Total Syscall Time: 13.8ms (4.8% of total runtime)

## TOP 10 HOTTEST FUNCTIONS (Self Time)

1. clear_page_erms [kernel]: 7.05% (11.25% with children)
   - Kernel zeroing newly allocated pages
   - This is page fault handling overhead

2. unified_cache_refill [hakmem]: 4.37%
   - Main allocation hot path in HAKMEM
   - Triggers page faults on first touch

3. do_anonymous_page [kernel]: 4.38%
   - Anonymous page allocation in kernel
   - Part of page fault handling

4. __handle_mm_fault [kernel]: 3.80%
   - Memory management fault handler
   - Core of page fault processing

5. srso_alias_safe_ret [kernel]: 2.85%
   - CPU speculation mitigation overhead
   - Retpoline-style security overhead

6. asm_exc_page_fault [kernel]: 2.68%
   - Page fault exception entry
   - Low-level page fault handling

7. srso_alias_return_thunk [kernel]: 2.59%
   - More speculation mitigation
   - Security overhead (Spectre/Meltdown)

8. __mod_lruvec_state [kernel]: 2.27%
   - LRU (page cache) stat tracking
   - Memory accounting overhead

9. __lruvec_stat_mod_folio [kernel]: 2.26%
   - More LRU statistics
   - Memory accounting

10. rmqueue [kernel]: 2.03%
    - Page allocation from buddy allocator
    - Kernel memory allocation

## CRITICAL BOTTLENECK ANALYSIS

### Primary Bottleneck: Page Fault Handling (69% of total time)

The perf profile shows that **69.07%** of execution time is spent in unified_cache_refill
and its children, with the vast majority (60%+) spent in kernel page fault handling:

- asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
- The call chain shows: 68.96% of time is in page fault handling

**Root Cause**: The benchmark is triggering page faults on every cache refill operation.

Breaking down the 69% time spent:
1. Page fault overhead: ~60% (kernel handling)
   - clear_page_erms: 11.25% (zeroing pages)
   - do_anonymous_page: 20%+ (allocating folios)
   - folio_add_new_anon_rmap: 7.11% (adding to reverse map)
   - folio_add_lru_vma: 4.88% (adding to LRU)
   - __mem_cgroup_charge: 4.37% (memory cgroup accounting)
   - Page table operations: 2-3%

2. Unified cache refill logic: ~4.37% (user space)

3. Other kernel overhead: ~5%

### Secondary Bottlenecks:

1. **Memory Zeroing (11.25%)**
   - clear_page_erms takes 11.25% of total time
   - Kernel zeroes newly allocated pages for security
   - 132,509 page faults × 4KB = ~515MB of memory touched
   - At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth

2. **Memory Cgroup Accounting (4.37%)**
   - __mem_cgroup_charge and related functions
   - Per-page memory accounting overhead
   - LRU statistics tracking

3. **Speculation Mitigation (5.44%)**
   - srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
   - CPU security mitigations (Spectre/Meltdown)
   - Indirect branch overhead

4. **User-space Allocation (6-8%)**
   - free: 1.40%
   - malloc: 1.36%
   - shared_pool_acquire_slab: 1.31%
   - unified_cache_refill: 4.37%

5. **Branch Mispredictions (moderate)**
   - 9.04% branch miss rate
   - 21M mispredictions / 1M ops = 21 misses per operation
   - Each miss ~15-20 cycles = 315-420 cycles/op wasted

## WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+

**Fundamental Issue: Page Fault Storm**

The current implementation is triggering page faults on nearly every cache refill:
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
- Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)

**Time Budget Analysis** (at 4.08 GHz):
- Current: 1,146 cycles/op → 4.1M ops/s
- Target: ~245 cycles/op → 16M ops/s

**Where the 900 extra cycles go**:
1. Page fault handling: ~690 cycles/op (76% of overhead)
2. Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
3. Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
4. Speculation mitigation: ~60 cycles/op
5. Other kernel overhead: ~100 cycles/op

**The Math Doesn't Add Up to 4x**:
- If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
- If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
- If we cut cache misses in half, we'd save another 85 cycles

The **overlapping penalties** mean these don't sum linearly, but the analysis shows:
- Page faults are the #1 bottleneck (60-70% of time)
- Branch mispredictions are significant (9% miss rate)
- Cache misses are moderate but not catastrophic

## SPECIFIC OBSERVATIONS

### 1. Cache Refill Pattern
From unified_cache_refill annotation at line 26f7:
```asm
26f7:   mov    %dil,0x0(%rbp)    # 17.27% of samples (HOTTEST instruction)
26fb:   incb   0x11(%r15)        # 3.31% (updating metadata)
```
This suggests the hot path is writing to newly allocated memory (triggering page faults).

### 2. Working Set Size
- Benchmark uses ws=256 slots
- Size range: 16-1024 bytes
- Average ~520 bytes per allocation
- Total working set: ~130KB (fits in L2, but spans many pages)

### 3. Allocation Pattern
- 50/50 malloc/free distribution
- Random replacement (xorshift PRNG)
- This creates maximum memory fragmentation and poor locality

## RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE

### Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)

**Option A: Pre-fault Memory (Immediate - 1 hour)**
- Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
- Add MAP_POPULATE to superslab_acquire() mmap calls
- This will trade startup time for runtime performance
- Expected: Eliminate 60-70% of page faults → 2-3x improvement

**Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)**
- Keep physical pages resident but mark them clean
- Avoid repeated zeroing on reuse
- Requires careful lifecycle management
- Expected: 30-50% improvement

**Option C: Use Hugepages (Medium-High complexity - 1 day)**
- mmap with MAP_HUGETLB to use 2MB pages
- Reduces page fault count by 512x (4KB → 2MB)
- Reduces TLB pressure significantly
- Expected: 2-4x improvement
- Risk: May increase memory waste for small allocations

### Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)

**Option A: Profile-Guided Optimization (Easy - 2 hours)**
- Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
- Helps compiler optimize branch layout
- Expected: 20-30% improvement

**Option B: Simplify Cache Refill Logic (Medium - 1 day)**
- Review unified_cache_refill control flow
- Reduce conditional branches in hot path
- Use __builtin_expect() for likely/unlikely hints
- Expected: 15-25% improvement

**Option C: Add Fast Path for Common Cases (Medium - 4 hours)**
- Special-case the most common allocation sizes
- Bypass complex logic for hot sizes
- Expected: 20-30% improvement on typical workloads

### Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)

**Option A: Optimize Data Structure Layout (Easy - 2 hours)**
- Pack hot fields together in cache lines
- Align structures to cache line boundaries
- Add __attribute__((aligned(64))) to hot structures
- Expected: 10-20% improvement

**Option B: Prefetch Optimization (Medium - 4 hours)**
- Add __builtin_prefetch() for predictable access patterns
- Prefetch next slab metadata during allocation
- Expected: 15-25% improvement

### Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)

**Option A: Batch Operations (Hard - 2 days)**
- Batch multiple allocations into single mmap() call
- Reduce syscall frequency
- Expected: 10-15% improvement

**Option B: Disable Memory Cgroup Accounting (Config - immediate)**
- Run with cgroup v1 or disable memory controller
- Saves ~4% overhead
- Not practical for production but useful for profiling

## IMMEDIATE NEXT STEPS (Recommended Priority)

1. **URGENT: Pre-fault SuperSlab Memory** (1 hour work, 2-3x gain)
   - Add MAP_POPULATE to mmap() in superslab acquisition
   - Modify: core/superslab/*.c (superslab_acquire functions)
   - Test: Run bench_random_mixed_hakmem and verify page fault count drops

2. **Profile-Guided Optimization** (2 hours, 20-30% gain)
   - Build with PGO flags
   - Run representative workload
   - Rebuild with profile data

3. **Hugepage Support** (1 day, 2-4x gain)
   - Add MAP_HUGETLB flag to superslab mmap
   - Add fallback for systems without hugepage support
   - Test memory usage impact

4. **Branch Optimization** (4 hours, 15-25% gain)
   - Add __builtin_expect() hints to unified_cache_refill
   - Simplify hot path conditionals
   - Reorder checks for common case first

**Conservative Estimate**: With just priorities #1 and #2, we could reach:
- Current: 4.1M ops/s
- After prefaulting: 8.2-12.3M ops/s (2-3x)
- After PGO: 9.8-16.0M ops/s (1.2x more)
- **Final: ~10-16M ops/s (2.4x - 4x total improvement)**

**Aggressive Estimate**: With hugepages + PGO + branch optimization:
- **Final: 16-24M ops/s (4-6x improvement)**

## CONCLUSION

The primary bottleneck is **kernel page fault handling**, consuming 60-70% of execution time.
This is because the benchmark triggers page faults on nearly every cache refill operation,
forcing the kernel to:
1. Zero new pages (11% of time)
2. Set up page tables (3-5% of time)
3. Add pages to LRU and memory cgroups (12% of time)
4. Manage folios and reverse mappings (10% of time)

**The path to 4x performance is clear**:
1. Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
2. Reduce branch mispredictions with PGO (1.2-1.3x gain)
3. Optimize cache locality (1.1-1.2x gain)

Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).