## Summary - ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え) - core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off) - A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy) ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避) - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results | Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement | |-----------|----------------|-----------------|------------| | 10K | 3.06 M ops/s | 3.17 M ops/s | +3.65% | | 1M | 23.71 M ops/s | 27.34 M ops/s | **+15.34%** | 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
301 lines
11 KiB
Markdown
301 lines
11 KiB
Markdown
HAKMEM Performance Bottleneck Analysis Report
|
||
==============================================
|
||
Date: 2025-12-04
|
||
Current Performance: 4.1M ops/s
|
||
Target Performance: 16M+ ops/s (4x improvement)
|
||
Performance Gap: 3.9x remaining
|
||
mid/smallmid(C6-heavy)ベンチを再現するときは、`docs/analysis/ENV_PROFILE_PRESETS.md` の `C6_HEAVY_LEGACY_POOLV1` プリセットをスタートポイントにしてください。
|
||
|
||
## KEY METRICS SUMMARY
|
||
|
||
### Hardware Performance Counters (3-run average):
|
||
- Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
|
||
- Instructions: 1,109M instructions
|
||
- IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
|
||
- Branches: 231.7M
|
||
- Branch Misses: 21.0M (9.04% miss rate - MODERATE)
|
||
- Cache References: 50.9M
|
||
- Cache Misses: 6.6M (13.03% miss rate - MODERATE)
|
||
- L1 D-cache Load Misses: 17.2M
|
||
|
||
### Per-Operation Breakdown (1M operations):
|
||
- Cycles per op: 1,146 cycles/op
|
||
- Instructions per op: 1,109 instructions/op
|
||
- L1 misses per op: 17.2 per op
|
||
- Page faults: 132,509 total (0.132 per op)
|
||
|
||
### System-Level Metrics:
|
||
- Page Faults: 132,509 (448K/sec)
|
||
- Minor Faults: 132,509 (all minor, no major faults)
|
||
- Context Switches: 29 (negligible)
|
||
- CPU Migrations: 8 (negligible)
|
||
- Task Clock: 295.67ms (99.7% CPU utilization)
|
||
|
||
### Syscall Overhead:
|
||
- Total Syscalls: 2,017
|
||
- mmap: 1,016 calls (36.41% time)
|
||
- munmap: 995 calls (63.48% time)
|
||
- mprotect: 5 calls
|
||
- madvise: 1 call
|
||
- Total Syscall Time: 13.8ms (4.8% of total runtime)
|
||
|
||
## TOP 10 HOTTEST FUNCTIONS (Self Time)
|
||
|
||
1. clear_page_erms [kernel]: 7.05% (11.25% with children)
|
||
- Kernel zeroing newly allocated pages
|
||
- This is page fault handling overhead
|
||
|
||
2. unified_cache_refill [hakmem]: 4.37%
|
||
- Main allocation hot path in HAKMEM
|
||
- Triggers page faults on first touch
|
||
|
||
3. do_anonymous_page [kernel]: 4.38%
|
||
- Anonymous page allocation in kernel
|
||
- Part of page fault handling
|
||
|
||
4. __handle_mm_fault [kernel]: 3.80%
|
||
- Memory management fault handler
|
||
- Core of page fault processing
|
||
|
||
5. srso_alias_safe_ret [kernel]: 2.85%
|
||
- CPU speculation mitigation overhead
|
||
- Retpoline-style security overhead
|
||
|
||
6. asm_exc_page_fault [kernel]: 2.68%
|
||
- Page fault exception entry
|
||
- Low-level page fault handling
|
||
|
||
7. srso_alias_return_thunk [kernel]: 2.59%
|
||
- More speculation mitigation
|
||
- Security overhead (Spectre/Meltdown)
|
||
|
||
8. __mod_lruvec_state [kernel]: 2.27%
|
||
- LRU (page cache) stat tracking
|
||
- Memory accounting overhead
|
||
|
||
9. __lruvec_stat_mod_folio [kernel]: 2.26%
|
||
- More LRU statistics
|
||
- Memory accounting
|
||
|
||
10. rmqueue [kernel]: 2.03%
|
||
- Page allocation from buddy allocator
|
||
- Kernel memory allocation
|
||
|
||
## CRITICAL BOTTLENECK ANALYSIS
|
||
|
||
### Primary Bottleneck: Page Fault Handling (69% of total time)
|
||
|
||
The perf profile shows that **69.07%** of execution time is spent in unified_cache_refill
|
||
and its children, with the vast majority (60%+) spent in kernel page fault handling:
|
||
|
||
- asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
|
||
- The call chain shows: 68.96% of time is in page fault handling
|
||
|
||
**Root Cause**: The benchmark is triggering page faults on every cache refill operation.
|
||
|
||
Breaking down the 69% time spent:
|
||
1. Page fault overhead: ~60% (kernel handling)
|
||
- clear_page_erms: 11.25% (zeroing pages)
|
||
- do_anonymous_page: 20%+ (allocating folios)
|
||
- folio_add_new_anon_rmap: 7.11% (adding to reverse map)
|
||
- folio_add_lru_vma: 4.88% (adding to LRU)
|
||
- __mem_cgroup_charge: 4.37% (memory cgroup accounting)
|
||
- Page table operations: 2-3%
|
||
|
||
2. Unified cache refill logic: ~4.37% (user space)
|
||
|
||
3. Other kernel overhead: ~5%
|
||
|
||
### Secondary Bottlenecks:
|
||
|
||
1. **Memory Zeroing (11.25%)**
|
||
- clear_page_erms takes 11.25% of total time
|
||
- Kernel zeroes newly allocated pages for security
|
||
- 132,509 page faults × 4KB = ~515MB of memory touched
|
||
- At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth
|
||
|
||
2. **Memory Cgroup Accounting (4.37%)**
|
||
- __mem_cgroup_charge and related functions
|
||
- Per-page memory accounting overhead
|
||
- LRU statistics tracking
|
||
|
||
3. **Speculation Mitigation (5.44%)**
|
||
- srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
|
||
- CPU security mitigations (Spectre/Meltdown)
|
||
- Indirect branch overhead
|
||
|
||
4. **User-space Allocation (6-8%)**
|
||
- free: 1.40%
|
||
- malloc: 1.36%
|
||
- shared_pool_acquire_slab: 1.31%
|
||
- unified_cache_refill: 4.37%
|
||
|
||
5. **Branch Mispredictions (moderate)**
|
||
- 9.04% branch miss rate
|
||
- 21M mispredictions / 1M ops = 21 misses per operation
|
||
- Each miss ~15-20 cycles = 315-420 cycles/op wasted
|
||
|
||
## WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+
|
||
|
||
**Fundamental Issue: Page Fault Storm**
|
||
|
||
The current implementation is triggering page faults on nearly every cache refill:
|
||
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
|
||
- Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)
|
||
|
||
**Time Budget Analysis** (at 4.08 GHz):
|
||
- Current: 1,146 cycles/op → 4.1M ops/s
|
||
- Target: ~245 cycles/op → 16M ops/s
|
||
|
||
**Where the 900 extra cycles go**:
|
||
1. Page fault handling: ~690 cycles/op (76% of overhead)
|
||
2. Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
|
||
3. Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
|
||
4. Speculation mitigation: ~60 cycles/op
|
||
5. Other kernel overhead: ~100 cycles/op
|
||
|
||
**The Math Doesn't Add Up to 4x**:
|
||
- If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
|
||
- If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
|
||
- If we cut cache misses in half, we'd save another 85 cycles
|
||
|
||
The **overlapping penalties** mean these don't sum linearly, but the analysis shows:
|
||
- Page faults are the #1 bottleneck (60-70% of time)
|
||
- Branch mispredictions are significant (9% miss rate)
|
||
- Cache misses are moderate but not catastrophic
|
||
|
||
## SPECIFIC OBSERVATIONS
|
||
|
||
### 1. Cache Refill Pattern
|
||
From unified_cache_refill annotation at line 26f7:
|
||
```asm
|
||
26f7: mov %dil,0x0(%rbp) # 17.27% of samples (HOTTEST instruction)
|
||
26fb: incb 0x11(%r15) # 3.31% (updating metadata)
|
||
```
|
||
This suggests the hot path is writing to newly allocated memory (triggering page faults).
|
||
|
||
### 2. Working Set Size
|
||
- Benchmark uses ws=256 slots
|
||
- Size range: 16-1024 bytes
|
||
- Average ~520 bytes per allocation
|
||
- Total working set: ~130KB (fits in L2, but spans many pages)
|
||
|
||
### 3. Allocation Pattern
|
||
- 50/50 malloc/free distribution
|
||
- Random replacement (xorshift PRNG)
|
||
- This creates maximum memory fragmentation and poor locality
|
||
|
||
## RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE
|
||
|
||
### Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)
|
||
|
||
**Option A: Pre-fault Memory (Immediate - 1 hour)**
|
||
- Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
|
||
- Add MAP_POPULATE to superslab_acquire() mmap calls
|
||
- This will trade startup time for runtime performance
|
||
- Expected: Eliminate 60-70% of page faults → 2-3x improvement
|
||
|
||
**Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)**
|
||
- Keep physical pages resident but mark them clean
|
||
- Avoid repeated zeroing on reuse
|
||
- Requires careful lifecycle management
|
||
- Expected: 30-50% improvement
|
||
|
||
**Option C: Use Hugepages (Medium-High complexity - 1 day)**
|
||
- mmap with MAP_HUGETLB to use 2MB pages
|
||
- Reduces page fault count by 512x (4KB → 2MB)
|
||
- Reduces TLB pressure significantly
|
||
- Expected: 2-4x improvement
|
||
- Risk: May increase memory waste for small allocations
|
||
|
||
### Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)
|
||
|
||
**Option A: Profile-Guided Optimization (Easy - 2 hours)**
|
||
- Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
|
||
- Helps compiler optimize branch layout
|
||
- Expected: 20-30% improvement
|
||
|
||
**Option B: Simplify Cache Refill Logic (Medium - 1 day)**
|
||
- Review unified_cache_refill control flow
|
||
- Reduce conditional branches in hot path
|
||
- Use __builtin_expect() for likely/unlikely hints
|
||
- Expected: 15-25% improvement
|
||
|
||
**Option C: Add Fast Path for Common Cases (Medium - 4 hours)**
|
||
- Special-case the most common allocation sizes
|
||
- Bypass complex logic for hot sizes
|
||
- Expected: 20-30% improvement on typical workloads
|
||
|
||
### Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)
|
||
|
||
**Option A: Optimize Data Structure Layout (Easy - 2 hours)**
|
||
- Pack hot fields together in cache lines
|
||
- Align structures to cache line boundaries
|
||
- Add __attribute__((aligned(64))) to hot structures
|
||
- Expected: 10-20% improvement
|
||
|
||
**Option B: Prefetch Optimization (Medium - 4 hours)**
|
||
- Add __builtin_prefetch() for predictable access patterns
|
||
- Prefetch next slab metadata during allocation
|
||
- Expected: 15-25% improvement
|
||
|
||
### Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)
|
||
|
||
**Option A: Batch Operations (Hard - 2 days)**
|
||
- Batch multiple allocations into single mmap() call
|
||
- Reduce syscall frequency
|
||
- Expected: 10-15% improvement
|
||
|
||
**Option B: Disable Memory Cgroup Accounting (Config - immediate)**
|
||
- Run with cgroup v1 or disable memory controller
|
||
- Saves ~4% overhead
|
||
- Not practical for production but useful for profiling
|
||
|
||
## IMMEDIATE NEXT STEPS (Recommended Priority)
|
||
|
||
1. **URGENT: Pre-fault SuperSlab Memory** (1 hour work, 2-3x gain)
|
||
- Add MAP_POPULATE to mmap() in superslab acquisition
|
||
- Modify: core/superslab/*.c (superslab_acquire functions)
|
||
- Test: Run bench_random_mixed_hakmem and verify page fault count drops
|
||
|
||
2. **Profile-Guided Optimization** (2 hours, 20-30% gain)
|
||
- Build with PGO flags
|
||
- Run representative workload
|
||
- Rebuild with profile data
|
||
|
||
3. **Hugepage Support** (1 day, 2-4x gain)
|
||
- Add MAP_HUGETLB flag to superslab mmap
|
||
- Add fallback for systems without hugepage support
|
||
- Test memory usage impact
|
||
|
||
4. **Branch Optimization** (4 hours, 15-25% gain)
|
||
- Add __builtin_expect() hints to unified_cache_refill
|
||
- Simplify hot path conditionals
|
||
- Reorder checks for common case first
|
||
|
||
**Conservative Estimate**: With just priorities #1 and #2, we could reach:
|
||
- Current: 4.1M ops/s
|
||
- After prefaulting: 8.2-12.3M ops/s (2-3x)
|
||
- After PGO: 9.8-16.0M ops/s (1.2x more)
|
||
- **Final: ~10-16M ops/s (2.4x - 4x total improvement)**
|
||
|
||
**Aggressive Estimate**: With hugepages + PGO + branch optimization:
|
||
- **Final: 16-24M ops/s (4-6x improvement)**
|
||
|
||
## CONCLUSION
|
||
|
||
The primary bottleneck is **kernel page fault handling**, consuming 60-70% of execution time.
|
||
This is because the benchmark triggers page faults on nearly every cache refill operation,
|
||
forcing the kernel to:
|
||
1. Zero new pages (11% of time)
|
||
2. Set up page tables (3-5% of time)
|
||
3. Add pages to LRU and memory cgroups (12% of time)
|
||
4. Manage folios and reverse mappings (10% of time)
|
||
|
||
**The path to 4x performance is clear**:
|
||
1. Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
|
||
2. Reduce branch mispredictions with PGO (1.2-1.3x gain)
|
||
3. Optimize cache locality (1.1-1.2x gain)
|
||
|
||
Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).
|