Files

Moe Charm (CI) acc64f2438 Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)

## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え）
- core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off）
- A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy）

## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避）
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載

## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K       | 3.06 M ops/s   | 3.17 M ops/s    | +3.65%     |
| 1M        | 23.71 M ops/s  | 27.34 M ops/s   | **+15.34%** |

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2025-12-10 09:08:18 +09:00

11 KiB

Raw Permalink Blame History

HAKMEM Performance Bottleneck Analysis Report

Date: 2025-12-04 Current Performance: 4.1M ops/s Target Performance: 16M+ ops/s (4x improvement) Performance Gap: 3.9x remaining mid/smallmid（C6-heavy）ベンチを再現するときは、docs/analysis/ENV_PROFILE_PRESETS.md の C6_HEAVY_LEGACY_POOLV1 プリセットをスタートポイントにしてください。

KEY METRICS SUMMARY

Hardware Performance Counters (3-run average):

Total Cycles: 1,146M cycles (0.281s @ ~4.08 GHz)
Instructions: 1,109M instructions
IPC (Instructions Per Cycle): 0.97 (GOOD - near optimal)
Branches: 231.7M
Branch Misses: 21.0M (9.04% miss rate - MODERATE)
Cache References: 50.9M
Cache Misses: 6.6M (13.03% miss rate - MODERATE)
L1 D-cache Load Misses: 17.2M

Per-Operation Breakdown (1M operations):

Cycles per op: 1,146 cycles/op
Instructions per op: 1,109 instructions/op
L1 misses per op: 17.2 per op
Page faults: 132,509 total (0.132 per op)

System-Level Metrics:

Page Faults: 132,509 (448K/sec)
Minor Faults: 132,509 (all minor, no major faults)
Context Switches: 29 (negligible)
CPU Migrations: 8 (negligible)
Task Clock: 295.67ms (99.7% CPU utilization)

Syscall Overhead:

Total Syscalls: 2,017
mmap: 1,016 calls (36.41% time)
munmap: 995 calls (63.48% time)
mprotect: 5 calls
madvise: 1 call
Total Syscall Time: 13.8ms (4.8% of total runtime)

TOP 10 HOTTEST FUNCTIONS (Self Time)

clear_page_erms [kernel]: 7.05% (11.25% with children)
- Kernel zeroing newly allocated pages
- This is page fault handling overhead
unified_cache_refill [hakmem]: 4.37%
- Main allocation hot path in HAKMEM
- Triggers page faults on first touch
do_anonymous_page [kernel]: 4.38%
- Anonymous page allocation in kernel
- Part of page fault handling
__handle_mm_fault [kernel]: 3.80%
- Memory management fault handler
- Core of page fault processing
srso_alias_safe_ret [kernel]: 2.85%
- CPU speculation mitigation overhead
- Retpoline-style security overhead
asm_exc_page_fault [kernel]: 2.68%
- Page fault exception entry
- Low-level page fault handling
srso_alias_return_thunk [kernel]: 2.59%
- More speculation mitigation
- Security overhead (Spectre/Meltdown)
__mod_lruvec_state [kernel]: 2.27%
- LRU (page cache) stat tracking
- Memory accounting overhead
__lruvec_stat_mod_folio [kernel]: 2.26%
- More LRU statistics
- Memory accounting
rmqueue [kernel]: 2.03%
- Page allocation from buddy allocator
- Kernel memory allocation

CRITICAL BOTTLENECK ANALYSIS

Primary Bottleneck: Page Fault Handling (69% of total time)

The perf profile shows that 69.07% of execution time is spent in unified_cache_refill and its children, with the vast majority (60%+) spent in kernel page fault handling:

asm_exc_page_fault → exc_page_fault → do_user_addr_fault → handle_mm_fault
The call chain shows: 68.96% of time is in page fault handling

Root Cause: The benchmark is triggering page faults on every cache refill operation.

Breaking down the 69% time spent:

Page fault overhead: ~60% (kernel handling)
- clear_page_erms: 11.25% (zeroing pages)
- do_anonymous_page: 20%+ (allocating folios)
- folio_add_new_anon_rmap: 7.11% (adding to reverse map)
- folio_add_lru_vma: 4.88% (adding to LRU)
- __mem_cgroup_charge: 4.37% (memory cgroup accounting)
- Page table operations: 2-3%
Unified cache refill logic: ~4.37% (user space)
Other kernel overhead: ~5%

Secondary Bottlenecks:

Memory Zeroing (11.25%)
- clear_page_erms takes 11.25% of total time
- Kernel zeroes newly allocated pages for security
- 132,509 page faults × 4KB = ~515MB of memory touched
- At 4.1M ops/s, that's 515MB in 0.25s = 2GB/s zeroing bandwidth
Memory Cgroup Accounting (4.37%)
- __mem_cgroup_charge and related functions
- Per-page memory accounting overhead
- LRU statistics tracking
Speculation Mitigation (5.44%)
- srso_alias_safe_ret (2.85%) + srso_alias_return_thunk (2.59%)
- CPU security mitigations (Spectre/Meltdown)
- Indirect branch overhead
User-space Allocation (6-8%)
- free: 1.40%
- malloc: 1.36%
- shared_pool_acquire_slab: 1.31%
- unified_cache_refill: 4.37%
Branch Mispredictions (moderate)
- 9.04% branch miss rate
- 21M mispredictions / 1M ops = 21 misses per operation
- Each miss ~15-20 cycles = 315-420 cycles/op wasted

WHY WE'RE AT 4.1M OPS/S INSTEAD OF 16M+

Fundamental Issue: Page Fault Storm

The current implementation is triggering page faults on nearly every cache refill:

132,509 page faults / 1,000,000 operations = 13.25% of operations trigger page faults
Each page fault costs ~680 cycles (0.6 × 1146 cycles ÷ 1M ops = ~687 cycles overhead per op)

Time Budget Analysis (at 4.08 GHz):

Current: 1,146 cycles/op → 4.1M ops/s
Target: ~245 cycles/op → 16M ops/s

Where the 900 extra cycles go:

Page fault handling: ~690 cycles/op (76% of overhead)
Branch mispredictions: ~315-420 cycles/op (35-46% of overhead)
Cache misses: ~170 cycles/op (17.2 L1 misses × 10 cycles)
Speculation mitigation: ~60 cycles/op
Other kernel overhead: ~100 cycles/op

The Math Doesn't Add Up to 4x:

If we eliminate ALL page faults (690 cycles), we'd be at 456 cycles/op → 8.9M ops/s (2.2x)
If we also eliminate branch misses (315 cycles), we'd be at 141 cycles/op → 28.9M ops/s (7x!)
If we cut cache misses in half, we'd save another 85 cycles

The overlapping penalties mean these don't sum linearly, but the analysis shows:

Page faults are the #1 bottleneck (60-70% of time)
Branch mispredictions are significant (9% miss rate)
Cache misses are moderate but not catastrophic

SPECIFIC OBSERVATIONS

1. Cache Refill Pattern

From unified_cache_refill annotation at line 26f7:

26f7:   mov    %dil,0x0(%rbp)    # 17.27% of samples (HOTTEST instruction)
26fb:   incb   0x11(%r15)        # 3.31% (updating metadata)

This suggests the hot path is writing to newly allocated memory (triggering page faults).

2. Working Set Size

Benchmark uses ws=256 slots
Size range: 16-1024 bytes
Average ~520 bytes per allocation
Total working set: ~130KB (fits in L2, but spans many pages)

3. Allocation Pattern

50/50 malloc/free distribution
Random replacement (xorshift PRNG)
This creates maximum memory fragmentation and poor locality

RECOMMENDATIONS FOR NEXT OPTIMIZATION PHASE

Priority 1: Eliminate Page Fault Overhead (Target: 2-3x improvement)

Option A: Pre-fault Memory (Immediate - 1 hour)

Use madvise(MADV_WILLNEED) or mmap(MAP_POPULATE) to pre-fault SuperSlab pages
Add MAP_POPULATE to superslab_acquire() mmap calls
This will trade startup time for runtime performance
Expected: Eliminate 60-70% of page faults → 2-3x improvement

Option B: Implement madvise(MADV_FREE) / MADV_DONTNEED Cycling (Medium - 4 hours)

Keep physical pages resident but mark them clean
Avoid repeated zeroing on reuse
Requires careful lifecycle management
Expected: 30-50% improvement

Option C: Use Hugepages (Medium-High complexity - 1 day)

mmap with MAP_HUGETLB to use 2MB pages
Reduces page fault count by 512x (4KB → 2MB)
Reduces TLB pressure significantly
Expected: 2-4x improvement
Risk: May increase memory waste for small allocations

Priority 2: Reduce Branch Mispredictions (Target: 1.5x improvement)

Option A: Profile-Guided Optimization (Easy - 2 hours)

Build with -fprofile-generate, run benchmark, rebuild with -fprofile-use
Helps compiler optimize branch layout
Expected: 20-30% improvement

Option B: Simplify Cache Refill Logic (Medium - 1 day)

Review unified_cache_refill control flow
Reduce conditional branches in hot path
Use __builtin_expect() for likely/unlikely hints
Expected: 15-25% improvement

Option C: Add Fast Path for Common Cases (Medium - 4 hours)

Special-case the most common allocation sizes
Bypass complex logic for hot sizes
Expected: 20-30% improvement on typical workloads

Priority 3: Improve Cache Locality (Target: 1.2-1.5x improvement)

Option A: Optimize Data Structure Layout (Easy - 2 hours)

Pack hot fields together in cache lines
Align structures to cache line boundaries
Add attribute((aligned(64))) to hot structures
Expected: 10-20% improvement

Option B: Prefetch Optimization (Medium - 4 hours)

Add __builtin_prefetch() for predictable access patterns
Prefetch next slab metadata during allocation
Expected: 15-25% improvement

Priority 4: Reduce Kernel Overhead (Target: 1.1-1.2x improvement)

Option A: Batch Operations (Hard - 2 days)

Batch multiple allocations into single mmap() call
Reduce syscall frequency
Expected: 10-15% improvement

Option B: Disable Memory Cgroup Accounting (Config - immediate)

Run with cgroup v1 or disable memory controller
Saves ~4% overhead
Not practical for production but useful for profiling

IMMEDIATE NEXT STEPS (Recommended Priority)

URGENT: Pre-fault SuperSlab Memory (1 hour work, 2-3x gain)
- Add MAP_POPULATE to mmap() in superslab acquisition
- Modify: core/superslab/*.c (superslab_acquire functions)
- Test: Run bench_random_mixed_hakmem and verify page fault count drops
Profile-Guided Optimization (2 hours, 20-30% gain)
- Build with PGO flags
- Run representative workload
- Rebuild with profile data
Hugepage Support (1 day, 2-4x gain)
- Add MAP_HUGETLB flag to superslab mmap
- Add fallback for systems without hugepage support
- Test memory usage impact
Branch Optimization (4 hours, 15-25% gain)
- Add __builtin_expect() hints to unified_cache_refill
- Simplify hot path conditionals
- Reorder checks for common case first

Conservative Estimate: With just priorities #1 and #2, we could reach:

Current: 4.1M ops/s
After prefaulting: 8.2-12.3M ops/s (2-3x)
After PGO: 9.8-16.0M ops/s (1.2x more)
Final: ~10-16M ops/s (2.4x - 4x total improvement)

Aggressive Estimate: With hugepages + PGO + branch optimization:

Final: 16-24M ops/s (4-6x improvement)

CONCLUSION

The primary bottleneck is kernel page fault handling, consuming 60-70% of execution time. This is because the benchmark triggers page faults on nearly every cache refill operation, forcing the kernel to:

Zero new pages (11% of time)
Set up page tables (3-5% of time)
Add pages to LRU and memory cgroups (12% of time)
Manage folios and reverse mappings (10% of time)

The path to 4x performance is clear:

Eliminate page faults with MAP_POPULATE or hugepages (2-3x gain)
Reduce branch mispredictions with PGO (1.2-1.3x gain)
Optimize cache locality (1.1-1.2x gain)

Combined, these optimizations should easily achieve the 4x target (4.1M → 16M+ ops/s).

11 KiB Raw Permalink Blame History Unescape Escape