Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.5 KiB
HAKMEM Performance Profiling Index
Date: 2025-12-04
Profiler: Linux perf (6.8.12)
Benchmarks: bench_random_mixed_hakmem vs bench_tiny_hot_hakmem
Quick Start
TL;DR: What's the bottleneck?
Answer: Kernel page faults (61.7% of cycles) from on-demand mmap allocations.
Fix: Pre-fault SuperSlabs at startup → expected 10-15x speedup.
Available Reports
1. PERF_SUMMARY_TABLE.txt (20KB)
Quick reference table with cycle breakdowns, top functions, and recommendations.
Use when: You need a fast overview with numbers.
cat PERF_SUMMARY_TABLE.txt
Key sections:
- Performance comparison table
- Cycle breakdown by layer (random_mixed vs tiny_hot)
- Top 10 functions by CPU time
- Actionable recommendations with expected gains
2. PERF_PROFILING_ANSWERS.md (16KB)
Answers to specific questions from the profiling request.
Use when: You want direct answers to:
- What % of cycles are in wrappers?
- Is unified_cache_refill being called frequently?
- Is shared_pool_acquire being called?
- Is registry lookup visible?
- Where are the 22x slowdown cycles spent?
less PERF_PROFILING_ANSWERS.md
Key sections:
- Q&A format (5 main questions)
- Top functions with cache/branch miss data
- Unexpected bottlenecks flagged
- Layer-by-layer optimization recommendations
3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)
Comprehensive layer-by-layer analysis with detailed explanations.
Use when: You need deep understanding of:
- Why each layer contributes to the gap
- Root cause analysis (kernel page faults)
- Optimization strategies with implementation details
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
Key sections:
- Executive summary
- Detailed cycle breakdown (random_mixed vs tiny_hot)
- Layer-by-layer analysis (6 layers)
- Performance gap analysis
- Actionable recommendations (7 priorities)
- Expected results after optimization
Key Findings Summary
Performance Gap
- bench_tiny_hot: 89M ops/s (baseline)
- bench_random_mixed: 4.1M ops/s
- Gap: 21.7x slower
Root Cause: Kernel Page Faults (61.7%)
Random sizes (16-1040B)
↓
Unified Cache misses
↓
unified_cache_refill (2.3%)
↓
shared_pool_acquire (3.3%)
↓
SuperSlab mmap (2MB chunks)
↓
512 page faults per slab (61.7% cycles!)
↓
clear_page_erms (6.9% - zeroing)
User-Space Hotspots (only 11% of total)
- Shared Pool: 3.3% (mutex locks)
- Wrappers: 3.7% (malloc/free entry)
- Unified Cache: 2.3% (triggers page faults)
- Other: 1.7%
Tiny Hot (for comparison)
- 70% user-space, 30% kernel (inverted!)
- 0.5% page faults (122x less than random_mixed)
- Free path dominates (43%) due to safe ownership checks
Top 3 Optimization Priorities
Priority 1: Pre-fault SuperSlabs (10-15x gain)
Problem: 61.7% of cycles in kernel page faults
Solution: Pre-allocate and fault-in 2MB slabs at startup
Expected: 4.1M → 41M ops/s
Priority 2: Lock-Free Shared Pool (2-4x gain)
Problem: 3.3% of cycles in mutex locks
Solution: Atomic CAS for free list
Expected: Contributes to 2x overall gain
Priority 3: Increase Unified Cache (2x fewer refills)
Problem: High miss rate → frequent refills
Solution: 64-128 blocks per class (currently 16-32)
Expected: 50% fewer refills
Expected Performance After Optimizations
| Stage | Random Mixed | Gain | vs Tiny Hot |
|---|---|---|---|
| Current | 4.1 M ops/s | - | 21.7x slower |
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 2.5x slower |
| After P1-2 (Lock-free) | 45 M ops/s | 11x | 2.0x slower |
| After P1-3 (Cache) | 55 M ops/s | 13x | 1.6x slower |
| After All (P1-7) | 60 M ops/s | 15x | 1.5x slower |
Target achieved: Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.
How to Reproduce
1. Build benchmarks
make bench_random_mixed_hakmem
make bench_tiny_hot_hakmem
2. Run without profiling (baseline)
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000
3. Profile with perf
# Random mixed
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
-o perf_random_mixed.data -- \
./bench_random_mixed_hakmem 1000000 256 42
# Tiny hot
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
-o perf_tiny_hot.data -- \
./bench_tiny_hot_hakmem 1000000
4. Analyze results
perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5
File Locations
All reports are in: /mnt/workdisk/public_share/hakmem/
PERF_SUMMARY_TABLE.txt - Quick reference (20KB)
PERF_PROFILING_ANSWERS.md - Q&A format (16KB)
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md - Detailed analysis (14KB)
PERF_INDEX.md - This file (index)
Contact
For questions about this profiling analysis, see:
- Original request: Questions 1-7 in profiling task
- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
Generated by: Linux perf + manual analysis
Date: 2025-12-04
Version: HAKMEM Phase 20+ (latest)