Files
hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md
Moe Charm (CI) acc64f2438 Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)
## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え)
- core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off)
- A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy)

## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避)
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載

## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K       | 3.06 M ops/s   | 3.17 M ops/s    | +3.65%     |
| 1M        | 23.71 M ops/s  | 27.34 M ops/s   | **+15.34%** |

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 09:08:18 +09:00

11 KiB
Raw Permalink Blame History

HAKMEM Performance Bottleneck Analysis Report

Date: 2025-12-04 Current Performance: 4.1M ops/s Target Performance: 16M+ ops/s (4x improvement) Performance Gap: 3.9x remaining mid/smallmidC6-heavyベンチを再現するときは、docs/analysis/ENV_PROFILE_PRESETS.mdC6_HEAVY_LEGACY_POOLV1 プリセットをスタートポイントにしてください。

KEY METRICS SUMMARY

Hardware Performance Counters (3-run average):

  • Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
  • Instructions: 1,109M instructions
  • IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
  • Branches: 231.7M
  • Branch Misses: 21.0M (9.04% miss rate - MODERATE)
  • Cache References: 50.9M
  • Cache Misses: 6.6M (13.03% miss rate - MODERATE)
  • L1 D-cache Load Misses: 17.2M

Per-Operation Breakdown (1M operations):

  • Cycles per op: 1,146 cycles/op
  • Instructions per op: 1,109 instructions/op
  • L1 misses per op: 17.2 per op
  • Page faults: 132,509 total (0.132 per op)

System-Level Metrics:

  • Page Faults: 132,509 (448K/sec)
  • Minor Faults: 132,509 (all minor, no major faults)
  • Context Switches: 29 (negligible)
  • CPU Migrations: 8 (negligible)
  • Task Clock: 295.67ms (99.7% CPU utilization)

Syscall Overhead:

  • Total Syscalls: 2,017
  • mmap: 1,016 calls (36.41% time)
  • munmap: 995 calls (63.48% time)
  • mprotect: 5 calls
  • madvise: 1 call
  • Total Syscall Time: 13.8ms (4.8% of total runtime)

TOP 10 HOTTEST FUNCTIONS (Self Time)

  1. clear_page_erms [kernel]: 7.05% (11.25% with children)

    • Kernel zeroing newly allocated pages
    • This is page fault handling overhead
  2. unified_cache_refill [hakmem]: 4.37%

    • Main allocation hot path in HAKMEM
    • Triggers page faults on first touch
  3. do_anonymous_page [kernel]: 4.38%

    • Anonymous page allocation in kernel
    • Part of page fault handling
  4. __handle_mm_fault [kernel]: 3.80%

    • Memory management fault handler
    • Core of page fault processing
  5. srso_alias_safe_ret [kernel]: 2.85%

    • CPU speculation mitigation overhead
    • Retpoline-style security overhead
  6. asm_exc_page_fault [kernel]: 2.68%

    • Page fault exception entry
    • Low-level page fault handling
  7. srso_alias_return_thunk [kernel]: 2.59%

    • More speculation mitigation
    • Security overhead (Spectre/Meltdown)
  8. __mod_lruvec_state [kernel]: 2.27%

    • LRU (page cache) stat tracking
    • Memory accounting overhead
  9. __lruvec_stat_mod_folio [kernel]: 2.26%

    • More LRU statistics
    • Memory accounting
  10. rmqueue [kernel]: 2.03%

    • Page allocation from buddy allocator
    • Kernel memory allocation

CRITICAL BOTTLENECK ANALYSIS

Primary Bottleneck: Page Fault Handling (69% of total time)

The perf profile shows that 69.07% of execution time is spent in unified_cache_refill and its children, with the vast majority (60%+) spent in kernel page fault handling:

  • asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
  • The call chain shows: 68.96% of time is in page fault handling

Root Cause: The benchmark is triggering page faults on every cache refill operation.

Breaking down the 69% time spent:

  1. Page fault overhead: ~60% (kernel handling)

    • clear_page_erms: 11.25% (zeroing pages)
    • do_anonymous_page: 20%+ (allocating folios)
    • folio_add_new_anon_rmap: 7.11% (adding to reverse map)
    • folio_add_lru_vma: 4.88% (adding to LRU)
    • __mem_cgroup_charge: 4.37% (memory cgroup accounting)
    • Page table operations: 2-3%
  2. Unified cache refill logic: ~4.37% (user space)

  3. Other kernel overhead: ~5%

Secondary Bottlenecks:

  1. Memory Zeroing (11.25%)

    • clear_page_erms takes 11.25% of total time
    • Kernel zeroes newly allocated pages for security
    • 132,509 page faults × 4KB = ~515MB of memory touched
    • At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth
  2. Memory Cgroup Accounting (4.37%)

    • __mem_cgroup_charge and related functions
    • Per-page memory accounting overhead
    • LRU statistics tracking
  3. Speculation Mitigation (5.44%)

    • srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
    • CPU security mitigations (Spectre/Meltdown)
    • Indirect branch overhead
  4. User-space Allocation (6-8%)

    • free: 1.40%
    • malloc: 1.36%
    • shared_pool_acquire_slab: 1.31%
    • unified_cache_refill: 4.37%
  5. Branch Mispredictions (moderate)

    • 9.04% branch miss rate
    • 21M mispredictions / 1M ops = 21 misses per operation
    • Each miss ~15-20 cycles = 315-420 cycles/op wasted

WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+

Fundamental Issue: Page Fault Storm

The current implementation is triggering page faults on nearly every cache refill:

  • 132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
  • Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)

Time Budget Analysis (at 4.08 GHz):

  • Current: 1,146 cycles/op → 4.1M ops/s
  • Target: ~245 cycles/op → 16M ops/s

Where the 900 extra cycles go:

  1. Page fault handling: ~690 cycles/op (76% of overhead)
  2. Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
  3. Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
  4. Speculation mitigation: ~60 cycles/op
  5. Other kernel overhead: ~100 cycles/op

The Math Doesn't Add Up to 4x:

  • If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
  • If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
  • If we cut cache misses in half, we'd save another 85 cycles

The overlapping penalties mean these don't sum linearly, but the analysis shows:

  • Page faults are the #1 bottleneck (60-70% of time)
  • Branch mispredictions are significant (9% miss rate)
  • Cache misses are moderate but not catastrophic

SPECIFIC OBSERVATIONS

1. Cache Refill Pattern

From unified_cache_refill annotation at line 26f7:

26f7:   mov    %dil,0x0(%rbp)    # 17.27% of samples (HOTTEST instruction)
26fb:   incb   0x11(%r15)        # 3.31% (updating metadata)

This suggests the hot path is writing to newly allocated memory (triggering page faults).

2. Working Set Size

  • Benchmark uses ws=256 slots
  • Size range: 16-1024 bytes
  • Average ~520 bytes per allocation
  • Total working set: ~130KB (fits in L2, but spans many pages)

3. Allocation Pattern

  • 50/50 malloc/free distribution
  • Random replacement (xorshift PRNG)
  • This creates maximum memory fragmentation and poor locality

RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE

Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)

Option A: Pre-fault Memory (Immediate - 1 hour)

  • Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
  • Add MAP_POPULATE to superslab_acquire() mmap calls
  • This will trade startup time for runtime performance
  • Expected: Eliminate 60-70% of page faults → 2-3x improvement

Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)

  • Keep physical pages resident but mark them clean
  • Avoid repeated zeroing on reuse
  • Requires careful lifecycle management
  • Expected: 30-50% improvement

Option C: Use Hugepages (Medium-High complexity - 1 day)

  • mmap with MAP_HUGETLB to use 2MB pages
  • Reduces page fault count by 512x (4KB → 2MB)
  • Reduces TLB pressure significantly
  • Expected: 2-4x improvement
  • Risk: May increase memory waste for small allocations

Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)

Option A: Profile-Guided Optimization (Easy - 2 hours)

  • Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
  • Helps compiler optimize branch layout
  • Expected: 20-30% improvement

Option B: Simplify Cache Refill Logic (Medium - 1 day)

  • Review unified_cache_refill control flow
  • Reduce conditional branches in hot path
  • Use __builtin_expect() for likely/unlikely hints
  • Expected: 15-25% improvement

Option C: Add Fast Path for Common Cases (Medium - 4 hours)

  • Special-case the most common allocation sizes
  • Bypass complex logic for hot sizes
  • Expected: 20-30% improvement on typical workloads

Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)

Option A: Optimize Data Structure Layout (Easy - 2 hours)

  • Pack hot fields together in cache lines
  • Align structures to cache line boundaries
  • Add attribute((aligned(64))) to hot structures
  • Expected: 10-20% improvement

Option B: Prefetch Optimization (Medium - 4 hours)

  • Add __builtin_prefetch() for predictable access patterns
  • Prefetch next slab metadata during allocation
  • Expected: 15-25% improvement

Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)

Option A: Batch Operations (Hard - 2 days)

  • Batch multiple allocations into single mmap() call
  • Reduce syscall frequency
  • Expected: 10-15% improvement

Option B: Disable Memory Cgroup Accounting (Config - immediate)

  • Run with cgroup v1 or disable memory controller
  • Saves ~4% overhead
  • Not practical for production but useful for profiling
  1. URGENT: Pre-fault SuperSlab Memory (1 hour work, 2-3x gain)

    • Add MAP_POPULATE to mmap() in superslab acquisition
    • Modify: core/superslab/*.c (superslab_acquire functions)
    • Test: Run bench_random_mixed_hakmem and verify page fault count drops
  2. Profile-Guided Optimization (2 hours, 20-30% gain)

    • Build with PGO flags
    • Run representative workload
    • Rebuild with profile data
  3. Hugepage Support (1 day, 2-4x gain)

    • Add MAP_HUGETLB flag to superslab mmap
    • Add fallback for systems without hugepage support
    • Test memory usage impact
  4. Branch Optimization (4 hours, 15-25% gain)

    • Add __builtin_expect() hints to unified_cache_refill
    • Simplify hot path conditionals
    • Reorder checks for common case first

Conservative Estimate: With just priorities #1 and #2, we could reach:

  • Current: 4.1M ops/s
  • After prefaulting: 8.2-12.3M ops/s (2-3x)
  • After PGO: 9.8-16.0M ops/s (1.2x more)
  • Final: ~10-16M ops/s (2.4x - 4x total improvement)

Aggressive Estimate: With hugepages + PGO + branch optimization:

  • Final: 16-24M ops/s (4-6x improvement)

CONCLUSION

The primary bottleneck is kernel page fault handling, consuming 60-70% of execution time. This is because the benchmark triggers page faults on nearly every cache refill operation, forcing the kernel to:

  1. Zero new pages (11% of time)
  2. Set up page tables (3-5% of time)
  3. Add pages to LRU and memory cgroups (12% of time)
  4. Manage folios and reverse mappings (10% of time)

The path to 4x performance is clear:

  1. Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
  2. Reduce branch mispredictions with PGO (1.2-1.3x gain)
  3. Optimize cache locality (1.1-1.2x gain)

Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).