Files
hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md
Moe Charm (CI) acc64f2438 Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)
## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え)
- core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off)
- A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy)

## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避)
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載

## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K       | 3.06 M ops/s   | 3.17 M ops/s    | +3.65%     |
| 1M        | 23.71 M ops/s  | 27.34 M ops/s   | **+15.34%** |

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 09:08:18 +09:00

301 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

HAKMEM Performance Bottleneck Analysis Report
==============================================
Date: 2025-12-04
Current Performance: 4.1M ops/s
Target Performance: 16M+ ops/s (4x improvement)
Performance Gap: 3.9x remaining
mid/smallmidC6-heavyベンチを再現するときは、`docs/analysis/ENV_PROFILE_PRESETS.md``C6_HEAVY_LEGACY_POOLV1` プリセットをスタートポイントにしてください。
## KEY METRICS SUMMARY
### Hardware Performance Counters (3-run average):
- Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
- Instructions: 1,109M instructions
- IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
- Branches: 231.7M
- Branch Misses: 21.0M (9.04% miss rate - MODERATE)
- Cache References: 50.9M
- Cache Misses: 6.6M (13.03% miss rate - MODERATE)
- L1 D-cache Load Misses: 17.2M
### Per-Operation Breakdown (1M operations):
- Cycles per op: 1,146 cycles/op
- Instructions per op: 1,109 instructions/op
- L1 misses per op: 17.2 per op
- Page faults: 132,509 total (0.132 per op)
### System-Level Metrics:
- Page Faults: 132,509 (448K/sec)
- Minor Faults: 132,509 (all minor, no major faults)
- Context Switches: 29 (negligible)
- CPU Migrations: 8 (negligible)
- Task Clock: 295.67ms (99.7% CPU utilization)
### Syscall Overhead:
- Total Syscalls: 2,017
- mmap: 1,016 calls (36.41% time)
- munmap: 995 calls (63.48% time)
- mprotect: 5 calls
- madvise: 1 call
- Total Syscall Time: 13.8ms (4.8% of total runtime)
## TOP 10 HOTTEST FUNCTIONS (Self Time)
1. clear_page_erms [kernel]: 7.05% (11.25% with children)
- Kernel zeroing newly allocated pages
- This is page fault handling overhead
2. unified_cache_refill [hakmem]: 4.37%
- Main allocation hot path in HAKMEM
- Triggers page faults on first touch
3. do_anonymous_page [kernel]: 4.38%
- Anonymous page allocation in kernel
- Part of page fault handling
4. __handle_mm_fault [kernel]: 3.80%
- Memory management fault handler
- Core of page fault processing
5. srso_alias_safe_ret [kernel]: 2.85%
- CPU speculation mitigation overhead
- Retpoline-style security overhead
6. asm_exc_page_fault [kernel]: 2.68%
- Page fault exception entry
- Low-level page fault handling
7. srso_alias_return_thunk [kernel]: 2.59%
- More speculation mitigation
- Security overhead (Spectre/Meltdown)
8. __mod_lruvec_state [kernel]: 2.27%
- LRU (page cache) stat tracking
- Memory accounting overhead
9. __lruvec_stat_mod_folio [kernel]: 2.26%
- More LRU statistics
- Memory accounting
10. rmqueue [kernel]: 2.03%
- Page allocation from buddy allocator
- Kernel memory allocation
## CRITICAL BOTTLENECK ANALYSIS
### Primary Bottleneck: Page Fault Handling (69% of total time)
The perf profile shows that **69.07%** of execution time is spent in unified_cache_refill
and its children, with the vast majority (60%+) spent in kernel page fault handling:
- asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
- The call chain shows: 68.96% of time is in page fault handling
**Root Cause**: The benchmark is triggering page faults on every cache refill operation.
Breaking down the 69% time spent:
1. Page fault overhead: ~60% (kernel handling)
- clear_page_erms: 11.25% (zeroing pages)
- do_anonymous_page: 20%+ (allocating folios)
- folio_add_new_anon_rmap: 7.11% (adding to reverse map)
- folio_add_lru_vma: 4.88% (adding to LRU)
- __mem_cgroup_charge: 4.37% (memory cgroup accounting)
- Page table operations: 2-3%
2. Unified cache refill logic: ~4.37% (user space)
3. Other kernel overhead: ~5%
### Secondary Bottlenecks:
1. **Memory Zeroing (11.25%)**
- clear_page_erms takes 11.25% of total time
- Kernel zeroes newly allocated pages for security
- 132,509 page faults × 4KB = ~515MB of memory touched
- At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth
2. **Memory Cgroup Accounting (4.37%)**
- __mem_cgroup_charge and related functions
- Per-page memory accounting overhead
- LRU statistics tracking
3. **Speculation Mitigation (5.44%)**
- srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
- CPU security mitigations (Spectre/Meltdown)
- Indirect branch overhead
4. **User-space Allocation (6-8%)**
- free: 1.40%
- malloc: 1.36%
- shared_pool_acquire_slab: 1.31%
- unified_cache_refill: 4.37%
5. **Branch Mispredictions (moderate)**
- 9.04% branch miss rate
- 21M mispredictions / 1M ops = 21 misses per operation
- Each miss ~15-20 cycles = 315-420 cycles/op wasted
## WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+
**Fundamental Issue: Page Fault Storm**
The current implementation is triggering page faults on nearly every cache refill:
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
- Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)
**Time Budget Analysis** (at 4.08 GHz):
- Current: 1,146 cycles/op → 4.1M ops/s
- Target: ~245 cycles/op → 16M ops/s
**Where the 900 extra cycles go**:
1. Page fault handling: ~690 cycles/op (76% of overhead)
2. Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
3. Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
4. Speculation mitigation: ~60 cycles/op
5. Other kernel overhead: ~100 cycles/op
**The Math Doesn't Add Up to 4x**:
- If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
- If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
- If we cut cache misses in half, we'd save another 85 cycles
The **overlapping penalties** mean these don't sum linearly, but the analysis shows:
- Page faults are the #1 bottleneck (60-70% of time)
- Branch mispredictions are significant (9% miss rate)
- Cache misses are moderate but not catastrophic
## SPECIFIC OBSERVATIONS
### 1. Cache Refill Pattern
From unified_cache_refill annotation at line 26f7:
```asm
26f7: mov %dil,0x0(%rbp) # 17.27% of samples (HOTTEST instruction)
26fb: incb 0x11(%r15) # 3.31% (updating metadata)
```
This suggests the hot path is writing to newly allocated memory (triggering page faults).
### 2. Working Set Size
- Benchmark uses ws=256 slots
- Size range: 16-1024 bytes
- Average ~520 bytes per allocation
- Total working set: ~130KB (fits in L2, but spans many pages)
### 3. Allocation Pattern
- 50/50 malloc/free distribution
- Random replacement (xorshift PRNG)
- This creates maximum memory fragmentation and poor locality
## RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE
### Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)
**Option A: Pre-fault Memory (Immediate - 1 hour)**
- Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
- Add MAP_POPULATE to superslab_acquire() mmap calls
- This will trade startup time for runtime performance
- Expected: Eliminate 60-70% of page faults → 2-3x improvement
**Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)**
- Keep physical pages resident but mark them clean
- Avoid repeated zeroing on reuse
- Requires careful lifecycle management
- Expected: 30-50% improvement
**Option C: Use Hugepages (Medium-High complexity - 1 day)**
- mmap with MAP_HUGETLB to use 2MB pages
- Reduces page fault count by 512x (4KB → 2MB)
- Reduces TLB pressure significantly
- Expected: 2-4x improvement
- Risk: May increase memory waste for small allocations
### Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)
**Option A: Profile-Guided Optimization (Easy - 2 hours)**
- Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
- Helps compiler optimize branch layout
- Expected: 20-30% improvement
**Option B: Simplify Cache Refill Logic (Medium - 1 day)**
- Review unified_cache_refill control flow
- Reduce conditional branches in hot path
- Use __builtin_expect() for likely/unlikely hints
- Expected: 15-25% improvement
**Option C: Add Fast Path for Common Cases (Medium - 4 hours)**
- Special-case the most common allocation sizes
- Bypass complex logic for hot sizes
- Expected: 20-30% improvement on typical workloads
### Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)
**Option A: Optimize Data Structure Layout (Easy - 2 hours)**
- Pack hot fields together in cache lines
- Align structures to cache line boundaries
- Add __attribute__((aligned(64))) to hot structures
- Expected: 10-20% improvement
**Option B: Prefetch Optimization (Medium - 4 hours)**
- Add __builtin_prefetch() for predictable access patterns
- Prefetch next slab metadata during allocation
- Expected: 15-25% improvement
### Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)
**Option A: Batch Operations (Hard - 2 days)**
- Batch multiple allocations into single mmap() call
- Reduce syscall frequency
- Expected: 10-15% improvement
**Option B: Disable Memory Cgroup Accounting (Config - immediate)**
- Run with cgroup v1 or disable memory controller
- Saves ~4% overhead
- Not practical for production but useful for profiling
## IMMEDIATE NEXT STEPS (Recommended Priority)
1. **URGENT: Pre-fault SuperSlab Memory** (1 hour work, 2-3x gain)
- Add MAP_POPULATE to mmap() in superslab acquisition
- Modify: core/superslab/*.c (superslab_acquire functions)
- Test: Run bench_random_mixed_hakmem and verify page fault count drops
2. **Profile-Guided Optimization** (2 hours, 20-30% gain)
- Build with PGO flags
- Run representative workload
- Rebuild with profile data
3. **Hugepage Support** (1 day, 2-4x gain)
- Add MAP_HUGETLB flag to superslab mmap
- Add fallback for systems without hugepage support
- Test memory usage impact
4. **Branch Optimization** (4 hours, 15-25% gain)
- Add __builtin_expect() hints to unified_cache_refill
- Simplify hot path conditionals
- Reorder checks for common case first
**Conservative Estimate**: With just priorities #1 and #2, we could reach:
- Current: 4.1M ops/s
- After prefaulting: 8.2-12.3M ops/s (2-3x)
- After PGO: 9.8-16.0M ops/s (1.2x more)
- **Final: ~10-16M ops/s (2.4x - 4x total improvement)**
**Aggressive Estimate**: With hugepages + PGO + branch optimization:
- **Final: 16-24M ops/s (4-6x improvement)**
## CONCLUSION
The primary bottleneck is **kernel page fault handling**, consuming 60-70% of execution time.
This is because the benchmark triggers page faults on nearly every cache refill operation,
forcing the kernel to:
1. Zero new pages (11% of time)
2. Set up page tables (3-5% of time)
3. Add pages to LRU and memory cgroups (12% of time)
4. Manage folios and reverse mappings (10% of time)
**The path to 4x performance is clear**:
1. Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
2. Reduce branch mispredictions with PGO (1.2-1.3x gain)
3. Optimize cache locality (1.1-1.2x gain)
Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).