## Summary - ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え) - core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off) - A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy) ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避) - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results | Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement | |-----------|----------------|-----------------|------------| | 10K | 3.06 M ops/s | 3.17 M ops/s | +3.65% | | 1M | 23.71 M ops/s | 27.34 M ops/s | **+15.34%** | 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
11 KiB
HAKMEM Performance Bottleneck Analysis Report
Date: 2025-12-04
Current Performance: 4.1M ops/s
Target Performance: 16M+ ops/s (4x improvement)
Performance Gap: 3.9x remaining
mid/smallmid(C6-heavy)ベンチを再現するときは、docs/analysis/ENV_PROFILE_PRESETS.md の C6_HEAVY_LEGACY_POOLV1 プリセットをスタートポイントにしてください。
KEY METRICS SUMMARY
Hardware Performance Counters (3-run average):
- Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
- Instructions: 1,109M instructions
- IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
- Branches: 231.7M
- Branch Misses: 21.0M (9.04% miss rate - MODERATE)
- Cache References: 50.9M
- Cache Misses: 6.6M (13.03% miss rate - MODERATE)
- L1 D-cache Load Misses: 17.2M
Per-Operation Breakdown (1M operations):
- Cycles per op: 1,146 cycles/op
- Instructions per op: 1,109 instructions/op
- L1 misses per op: 17.2 per op
- Page faults: 132,509 total (0.132 per op)
System-Level Metrics:
- Page Faults: 132,509 (448K/sec)
- Minor Faults: 132,509 (all minor, no major faults)
- Context Switches: 29 (negligible)
- CPU Migrations: 8 (negligible)
- Task Clock: 295.67ms (99.7% CPU utilization)
Syscall Overhead:
- Total Syscalls: 2,017
- mmap: 1,016 calls (36.41% time)
- munmap: 995 calls (63.48% time)
- mprotect: 5 calls
- madvise: 1 call
- Total Syscall Time: 13.8ms (4.8% of total runtime)
TOP 10 HOTTEST FUNCTIONS (Self Time)
-
clear_page_erms [kernel]: 7.05% (11.25% with children)
- Kernel zeroing newly allocated pages
- This is page fault handling overhead
-
unified_cache_refill [hakmem]: 4.37%
- Main allocation hot path in HAKMEM
- Triggers page faults on first touch
-
do_anonymous_page [kernel]: 4.38%
- Anonymous page allocation in kernel
- Part of page fault handling
-
__handle_mm_fault [kernel]: 3.80%
- Memory management fault handler
- Core of page fault processing
-
srso_alias_safe_ret [kernel]: 2.85%
- CPU speculation mitigation overhead
- Retpoline-style security overhead
-
asm_exc_page_fault [kernel]: 2.68%
- Page fault exception entry
- Low-level page fault handling
-
srso_alias_return_thunk [kernel]: 2.59%
- More speculation mitigation
- Security overhead (Spectre/Meltdown)
-
__mod_lruvec_state [kernel]: 2.27%
- LRU (page cache) stat tracking
- Memory accounting overhead
-
__lruvec_stat_mod_folio [kernel]: 2.26%
- More LRU statistics
- Memory accounting
-
rmqueue [kernel]: 2.03%
- Page allocation from buddy allocator
- Kernel memory allocation
CRITICAL BOTTLENECK ANALYSIS
Primary Bottleneck: Page Fault Handling (69% of total time)
The perf profile shows that 69.07% of execution time is spent in unified_cache_refill and its children, with the vast majority (60%+) spent in kernel page fault handling:
- asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
- The call chain shows: 68.96% of time is in page fault handling
Root Cause: The benchmark is triggering page faults on every cache refill operation.
Breaking down the 69% time spent:
-
Page fault overhead: ~60% (kernel handling)
- clear_page_erms: 11.25% (zeroing pages)
- do_anonymous_page: 20%+ (allocating folios)
- folio_add_new_anon_rmap: 7.11% (adding to reverse map)
- folio_add_lru_vma: 4.88% (adding to LRU)
- __mem_cgroup_charge: 4.37% (memory cgroup accounting)
- Page table operations: 2-3%
-
Unified cache refill logic: ~4.37% (user space)
-
Other kernel overhead: ~5%
Secondary Bottlenecks:
-
Memory Zeroing (11.25%)
- clear_page_erms takes 11.25% of total time
- Kernel zeroes newly allocated pages for security
- 132,509 page faults × 4KB = ~515MB of memory touched
- At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth
-
Memory Cgroup Accounting (4.37%)
- __mem_cgroup_charge and related functions
- Per-page memory accounting overhead
- LRU statistics tracking
-
Speculation Mitigation (5.44%)
- srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
- CPU security mitigations (Spectre/Meltdown)
- Indirect branch overhead
-
User-space Allocation (6-8%)
- free: 1.40%
- malloc: 1.36%
- shared_pool_acquire_slab: 1.31%
- unified_cache_refill: 4.37%
-
Branch Mispredictions (moderate)
- 9.04% branch miss rate
- 21M mispredictions / 1M ops = 21 misses per operation
- Each miss ~15-20 cycles = 315-420 cycles/op wasted
WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+
Fundamental Issue: Page Fault Storm
The current implementation is triggering page faults on nearly every cache refill:
- 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
- Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)
Time Budget Analysis (at 4.08 GHz):
- Current: 1,146 cycles/op → 4.1M ops/s
- Target: ~245 cycles/op → 16M ops/s
Where the 900 extra cycles go:
- Page fault handling: ~690 cycles/op (76% of overhead)
- Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
- Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
- Speculation mitigation: ~60 cycles/op
- Other kernel overhead: ~100 cycles/op
The Math Doesn't Add Up to 4x:
- If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
- If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
- If we cut cache misses in half, we'd save another 85 cycles
The overlapping penalties mean these don't sum linearly, but the analysis shows:
- Page faults are the #1 bottleneck (60-70% of time)
- Branch mispredictions are significant (9% miss rate)
- Cache misses are moderate but not catastrophic
SPECIFIC OBSERVATIONS
1. Cache Refill Pattern
From unified_cache_refill annotation at line 26f7:
26f7: mov %dil,0x0(%rbp) # 17.27% of samples (HOTTEST instruction)
26fb: incb 0x11(%r15) # 3.31% (updating metadata)
This suggests the hot path is writing to newly allocated memory (triggering page faults).
2. Working Set Size
- Benchmark uses ws=256 slots
- Size range: 16-1024 bytes
- Average ~520 bytes per allocation
- Total working set: ~130KB (fits in L2, but spans many pages)
3. Allocation Pattern
- 50/50 malloc/free distribution
- Random replacement (xorshift PRNG)
- This creates maximum memory fragmentation and poor locality
RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE
Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)
Option A: Pre-fault Memory (Immediate - 1 hour)
- Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
- Add MAP_POPULATE to superslab_acquire() mmap calls
- This will trade startup time for runtime performance
- Expected: Eliminate 60-70% of page faults → 2-3x improvement
Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)
- Keep physical pages resident but mark them clean
- Avoid repeated zeroing on reuse
- Requires careful lifecycle management
- Expected: 30-50% improvement
Option C: Use Hugepages (Medium-High complexity - 1 day)
- mmap with MAP_HUGETLB to use 2MB pages
- Reduces page fault count by 512x (4KB → 2MB)
- Reduces TLB pressure significantly
- Expected: 2-4x improvement
- Risk: May increase memory waste for small allocations
Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)
Option A: Profile-Guided Optimization (Easy - 2 hours)
- Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
- Helps compiler optimize branch layout
- Expected: 20-30% improvement
Option B: Simplify Cache Refill Logic (Medium - 1 day)
- Review unified_cache_refill control flow
- Reduce conditional branches in hot path
- Use __builtin_expect() for likely/unlikely hints
- Expected: 15-25% improvement
Option C: Add Fast Path for Common Cases (Medium - 4 hours)
- Special-case the most common allocation sizes
- Bypass complex logic for hot sizes
- Expected: 20-30% improvement on typical workloads
Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)
Option A: Optimize Data Structure Layout (Easy - 2 hours)
- Pack hot fields together in cache lines
- Align structures to cache line boundaries
- Add attribute((aligned(64))) to hot structures
- Expected: 10-20% improvement
Option B: Prefetch Optimization (Medium - 4 hours)
- Add __builtin_prefetch() for predictable access patterns
- Prefetch next slab metadata during allocation
- Expected: 15-25% improvement
Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)
Option A: Batch Operations (Hard - 2 days)
- Batch multiple allocations into single mmap() call
- Reduce syscall frequency
- Expected: 10-15% improvement
Option B: Disable Memory Cgroup Accounting (Config - immediate)
- Run with cgroup v1 or disable memory controller
- Saves ~4% overhead
- Not practical for production but useful for profiling
IMMEDIATE NEXT STEPS (Recommended Priority)
-
URGENT: Pre-fault SuperSlab Memory (1 hour work, 2-3x gain)
- Add MAP_POPULATE to mmap() in superslab acquisition
- Modify: core/superslab/*.c (superslab_acquire functions)
- Test: Run bench_random_mixed_hakmem and verify page fault count drops
-
Profile-Guided Optimization (2 hours, 20-30% gain)
- Build with PGO flags
- Run representative workload
- Rebuild with profile data
-
Hugepage Support (1 day, 2-4x gain)
- Add MAP_HUGETLB flag to superslab mmap
- Add fallback for systems without hugepage support
- Test memory usage impact
-
Branch Optimization (4 hours, 15-25% gain)
- Add __builtin_expect() hints to unified_cache_refill
- Simplify hot path conditionals
- Reorder checks for common case first
Conservative Estimate: With just priorities #1 and #2, we could reach:
- Current: 4.1M ops/s
- After prefaulting: 8.2-12.3M ops/s (2-3x)
- After PGO: 9.8-16.0M ops/s (1.2x more)
- Final: ~10-16M ops/s (2.4x - 4x total improvement)
Aggressive Estimate: With hugepages + PGO + branch optimization:
- Final: 16-24M ops/s (4-6x improvement)
CONCLUSION
The primary bottleneck is kernel page fault handling, consuming 60-70% of execution time. This is because the benchmark triggers page faults on nearly every cache refill operation, forcing the kernel to:
- Zero new pages (11% of time)
- Set up page tables (3-5% of time)
- Add pages to LRU and memory cgroups (12% of time)
- Manage folios and reverse mappings (10% of time)
The path to 4x performance is clear:
- Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
- Reduce branch mispredictions with PGO (1.2-1.3x gain)
- Optimize cache locality (1.1-1.2x gain)
Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).