Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

5.5 KiB

Raw Blame History

HAKMEM Performance Profiling Index

Date: 2025-12-04
Profiler: Linux perf (6.8.12)
Benchmarks: bench_random_mixed_hakmem vs bench_tiny_hot_hakmem

Quick Start

TL;DR: What's the bottleneck?

Answer: Kernel page faults (61.7% of cycles) from on-demand mmap allocations.

Fix: Pre-fault SuperSlabs at startup → expected 10-15x speedup.

Available Reports

1. PERF_SUMMARY_TABLE.txt (20KB)

Quick reference table with cycle breakdowns, top functions, and recommendations.

Use when: You need a fast overview with numbers.

cat PERF_SUMMARY_TABLE.txt

Key sections:

Performance comparison table
Cycle breakdown by layer (random_mixed vs tiny_hot)
Top 10 functions by CPU time
Actionable recommendations with expected gains

2. PERF_PROFILING_ANSWERS.md (16KB)

Answers to specific questions from the profiling request.

Use when: You want direct answers to:

What % of cycles are in wrappers?
Is unified_cache_refill being called frequently?
Is shared_pool_acquire being called?
Is registry lookup visible?
Where are the 22x slowdown cycles spent?

less PERF_PROFILING_ANSWERS.md

Key sections:

Q&A format (5 main questions)
Top functions with cache/branch miss data
Unexpected bottlenecks flagged
Layer-by-layer optimization recommendations

3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)

Comprehensive layer-by-layer analysis with detailed explanations.

Use when: You need deep understanding of:

Why each layer contributes to the gap
Root cause analysis (kernel page faults)
Optimization strategies with implementation details

less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md

Key sections:

Executive summary
Detailed cycle breakdown (random_mixed vs tiny_hot)
Layer-by-layer analysis (6 layers)
Performance gap analysis
Actionable recommendations (7 priorities)
Expected results after optimization

Key Findings Summary

Performance Gap

bench_tiny_hot: 89M ops/s (baseline)
bench_random_mixed: 4.1M ops/s
Gap: 21.7x slower

Root Cause: Kernel Page Faults (61.7%)

Random sizes (16-1040B)
    ↓
Unified Cache misses
    ↓
unified_cache_refill (2.3%)
    ↓
shared_pool_acquire (3.3%)
    ↓
SuperSlab mmap (2MB chunks)
    ↓
512 page faults per slab (61.7% cycles!)
    ↓
clear_page_erms (6.9% - zeroing)

User-Space Hotspots (only 11% of total)

Shared Pool: 3.3% (mutex locks)
Wrappers: 3.7% (malloc/free entry)
Unified Cache: 2.3% (triggers page faults)
Other: 1.7%

Tiny Hot (for comparison)

70% user-space, 30% kernel (inverted!)
0.5% page faults (122x less than random_mixed)
Free path dominates (43%) due to safe ownership checks

Top 3 Optimization Priorities

Priority 1: Pre-fault SuperSlabs (10-15x gain)

Problem: 61.7% of cycles in kernel page faults
Solution: Pre-allocate and fault-in 2MB slabs at startup
Expected: 4.1M → 41M ops/s

Priority 2: Lock-Free Shared Pool (2-4x gain)

Problem: 3.3% of cycles in mutex locks
Solution: Atomic CAS for free list
Expected: Contributes to 2x overall gain

Priority 3: Increase Unified Cache (2x fewer refills)

Problem: High miss rate → frequent refills
Solution: 64-128 blocks per class (currently 16-32)
Expected: 50% fewer refills

Expected Performance After Optimizations

Stage	Random Mixed	Gain	vs Tiny Hot
Current	4.1 M ops/s	-	21.7x slower
After P1 (Pre-fault)	35 M ops/s	8.5x	2.5x slower
After P1-2 (Lock-free)	45 M ops/s	11x	2.0x slower
After P1-3 (Cache)	55 M ops/s	13x	1.6x slower
After All (P1-7)	60 M ops/s	15x	1.5x slower

Target achieved: Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.

How to Reproduce

1. Build benchmarks

make bench_random_mixed_hakmem
make bench_tiny_hot_hakmem

2. Run without profiling (baseline)

HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000

3. Profile with perf

# Random mixed
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_random_mixed.data -- \
  ./bench_random_mixed_hakmem 1000000 256 42

# Tiny hot
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_tiny_hot.data -- \
  ./bench_tiny_hot_hakmem 1000000

4. Analyze results

perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5

File Locations

All reports are in: /mnt/workdisk/public_share/hakmem/

PERF_SUMMARY_TABLE.txt                     - Quick reference (20KB)
PERF_PROFILING_ANSWERS.md                  - Q&A format (16KB)
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md  - Detailed analysis (14KB)
PERF_INDEX.md                              - This file (index)

Contact

For questions about this profiling analysis, see:

Original request: Questions 1-7 in profiling task
Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md

Generated by: Linux perf + manual analysis
Date: 2025-12-04
Version: HAKMEM Phase 20+ (latest)

5.5 KiB Raw Blame History

HAKMEM Performance Profiling Index

Quick Start

TL;DR: What's the bottleneck?

Available Reports

1. PERF_SUMMARY_TABLE.txt (20KB)

2. PERF_PROFILING_ANSWERS.md (16KB)

3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)

Key Findings Summary

Performance Gap

Root Cause: Kernel Page Faults (61.7%)

User-Space Hotspots (only 11% of total)

Tiny Hot (for comparison)

Top 3 Optimization Priorities

Priority 1: Pre-fault SuperSlabs (10-15x gain)

Priority 2: Lock-Free Shared Pool (2-4x gain)

Priority 3: Increase Unified Cache (2x fewer refills)

Expected Performance After Optimizations

How to Reproduce

1. Build benchmarks

2. Run without profiling (baseline)

3. Profile with perf

4. Analyze results

File Locations

Contact

5.5 KiB

Raw Blame History