Files
hakmem/PERF_INDEX.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

5.5 KiB

HAKMEM Performance Profiling Index

Date: 2025-12-04
Profiler: Linux perf (6.8.12)
Benchmarks: bench_random_mixed_hakmem vs bench_tiny_hot_hakmem


Quick Start

TL;DR: What's the bottleneck?

Answer: Kernel page faults (61.7% of cycles) from on-demand mmap allocations.

Fix: Pre-fault SuperSlabs at startup → expected 10-15x speedup.


Available Reports

1. PERF_SUMMARY_TABLE.txt (20KB)

Quick reference table with cycle breakdowns, top functions, and recommendations.

Use when: You need a fast overview with numbers.

cat PERF_SUMMARY_TABLE.txt

Key sections:

  • Performance comparison table
  • Cycle breakdown by layer (random_mixed vs tiny_hot)
  • Top 10 functions by CPU time
  • Actionable recommendations with expected gains

2. PERF_PROFILING_ANSWERS.md (16KB)

Answers to specific questions from the profiling request.

Use when: You want direct answers to:

  • What % of cycles are in wrappers?
  • Is unified_cache_refill being called frequently?
  • Is shared_pool_acquire being called?
  • Is registry lookup visible?
  • Where are the 22x slowdown cycles spent?
less PERF_PROFILING_ANSWERS.md

Key sections:

  • Q&A format (5 main questions)
  • Top functions with cache/branch miss data
  • Unexpected bottlenecks flagged
  • Layer-by-layer optimization recommendations

3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)

Comprehensive layer-by-layer analysis with detailed explanations.

Use when: You need deep understanding of:

  • Why each layer contributes to the gap
  • Root cause analysis (kernel page faults)
  • Optimization strategies with implementation details
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md

Key sections:

  • Executive summary
  • Detailed cycle breakdown (random_mixed vs tiny_hot)
  • Layer-by-layer analysis (6 layers)
  • Performance gap analysis
  • Actionable recommendations (7 priorities)
  • Expected results after optimization

Key Findings Summary

Performance Gap

  • bench_tiny_hot: 89M ops/s (baseline)
  • bench_random_mixed: 4.1M ops/s
  • Gap: 21.7x slower

Root Cause: Kernel Page Faults (61.7%)

Random sizes (16-1040B)
    ↓
Unified Cache misses
    ↓
unified_cache_refill (2.3%)
    ↓
shared_pool_acquire (3.3%)
    ↓
SuperSlab mmap (2MB chunks)
    ↓
512 page faults per slab (61.7% cycles!)
    ↓
clear_page_erms (6.9% - zeroing)

User-Space Hotspots (only 11% of total)

  1. Shared Pool: 3.3% (mutex locks)
  2. Wrappers: 3.7% (malloc/free entry)
  3. Unified Cache: 2.3% (triggers page faults)
  4. Other: 1.7%

Tiny Hot (for comparison)

  • 70% user-space, 30% kernel (inverted!)
  • 0.5% page faults (122x less than random_mixed)
  • Free path dominates (43%) due to safe ownership checks

Top 3 Optimization Priorities

Priority 1: Pre-fault SuperSlabs (10-15x gain)

Problem: 61.7% of cycles in kernel page faults
Solution: Pre-allocate and fault-in 2MB slabs at startup
Expected: 4.1M → 41M ops/s

Priority 2: Lock-Free Shared Pool (2-4x gain)

Problem: 3.3% of cycles in mutex locks
Solution: Atomic CAS for free list
Expected: Contributes to 2x overall gain

Priority 3: Increase Unified Cache (2x fewer refills)

Problem: High miss rate → frequent refills
Solution: 64-128 blocks per class (currently 16-32)
Expected: 50% fewer refills


Expected Performance After Optimizations

Stage Random Mixed Gain vs Tiny Hot
Current 4.1 M ops/s - 21.7x slower
After P1 (Pre-fault) 35 M ops/s 8.5x 2.5x slower
After P1-2 (Lock-free) 45 M ops/s 11x 2.0x slower
After P1-3 (Cache) 55 M ops/s 13x 1.6x slower
After All (P1-7) 60 M ops/s 15x 1.5x slower

Target achieved: Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.


How to Reproduce

1. Build benchmarks

make bench_random_mixed_hakmem
make bench_tiny_hot_hakmem

2. Run without profiling (baseline)

HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000

3. Profile with perf

# Random mixed
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_random_mixed.data -- \
  ./bench_random_mixed_hakmem 1000000 256 42

# Tiny hot
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_tiny_hot.data -- \
  ./bench_tiny_hot_hakmem 1000000

4. Analyze results

perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5

File Locations

All reports are in: /mnt/workdisk/public_share/hakmem/

PERF_SUMMARY_TABLE.txt                     - Quick reference (20KB)
PERF_PROFILING_ANSWERS.md                  - Q&A format (16KB)
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md  - Detailed analysis (14KB)
PERF_INDEX.md                              - This file (index)

Contact

For questions about this profiling analysis, see:

  • Original request: Questions 1-7 in profiling task
  • Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md

Generated by: Linux perf + manual analysis
Date: 2025-12-04
Version: HAKMEM Phase 20+ (latest)