Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
211 lines
5.5 KiB
Markdown
211 lines
5.5 KiB
Markdown
# HAKMEM Performance Profiling Index
|
|
|
|
**Date:** 2025-12-04
|
|
**Profiler:** Linux perf (6.8.12)
|
|
**Benchmarks:** bench_random_mixed_hakmem vs bench_tiny_hot_hakmem
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### TL;DR: What's the bottleneck?
|
|
|
|
**Answer:** Kernel page faults (61.7% of cycles) from on-demand mmap allocations.
|
|
|
|
**Fix:** Pre-fault SuperSlabs at startup → expected 10-15x speedup.
|
|
|
|
---
|
|
|
|
## Available Reports
|
|
|
|
### 1. PERF_SUMMARY_TABLE.txt (20KB)
|
|
**Quick reference table** with cycle breakdowns, top functions, and recommendations.
|
|
|
|
**Use when:** You need a fast overview with numbers.
|
|
|
|
```bash
|
|
cat PERF_SUMMARY_TABLE.txt
|
|
```
|
|
|
|
Key sections:
|
|
- Performance comparison table
|
|
- Cycle breakdown by layer (random_mixed vs tiny_hot)
|
|
- Top 10 functions by CPU time
|
|
- Actionable recommendations with expected gains
|
|
|
|
---
|
|
|
|
### 2. PERF_PROFILING_ANSWERS.md (16KB)
|
|
**Answers to specific questions** from the profiling request.
|
|
|
|
**Use when:** You want direct answers to:
|
|
- What % of cycles are in wrappers?
|
|
- Is unified_cache_refill being called frequently?
|
|
- Is shared_pool_acquire being called?
|
|
- Is registry lookup visible?
|
|
- Where are the 22x slowdown cycles spent?
|
|
|
|
```bash
|
|
less PERF_PROFILING_ANSWERS.md
|
|
```
|
|
|
|
Key sections:
|
|
- Q&A format (5 main questions)
|
|
- Top functions with cache/branch miss data
|
|
- Unexpected bottlenecks flagged
|
|
- Layer-by-layer optimization recommendations
|
|
|
|
---
|
|
|
|
### 3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)
|
|
**Comprehensive layer-by-layer analysis** with detailed explanations.
|
|
|
|
**Use when:** You need deep understanding of:
|
|
- Why each layer contributes to the gap
|
|
- Root cause analysis (kernel page faults)
|
|
- Optimization strategies with implementation details
|
|
|
|
```bash
|
|
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
|
|
```
|
|
|
|
Key sections:
|
|
- Executive summary
|
|
- Detailed cycle breakdown (random_mixed vs tiny_hot)
|
|
- Layer-by-layer analysis (6 layers)
|
|
- Performance gap analysis
|
|
- Actionable recommendations (7 priorities)
|
|
- Expected results after optimization
|
|
|
|
---
|
|
|
|
## Key Findings Summary
|
|
|
|
### Performance Gap
|
|
- **bench_tiny_hot:** 89M ops/s (baseline)
|
|
- **bench_random_mixed:** 4.1M ops/s
|
|
- **Gap:** 21.7x slower
|
|
|
|
### Root Cause: Kernel Page Faults (61.7%)
|
|
```
|
|
Random sizes (16-1040B)
|
|
↓
|
|
Unified Cache misses
|
|
↓
|
|
unified_cache_refill (2.3%)
|
|
↓
|
|
shared_pool_acquire (3.3%)
|
|
↓
|
|
SuperSlab mmap (2MB chunks)
|
|
↓
|
|
512 page faults per slab (61.7% cycles!)
|
|
↓
|
|
clear_page_erms (6.9% - zeroing)
|
|
```
|
|
|
|
### User-Space Hotspots (only 11% of total)
|
|
1. **Shared Pool:** 3.3% (mutex locks)
|
|
2. **Wrappers:** 3.7% (malloc/free entry)
|
|
3. **Unified Cache:** 2.3% (triggers page faults)
|
|
4. **Other:** 1.7%
|
|
|
|
### Tiny Hot (for comparison)
|
|
- **70% user-space, 30% kernel** (inverted!)
|
|
- **0.5% page faults** (122x less than random_mixed)
|
|
- Free path dominates (43%) due to safe ownership checks
|
|
|
|
---
|
|
|
|
## Top 3 Optimization Priorities
|
|
|
|
### Priority 1: Pre-fault SuperSlabs (10-15x gain)
|
|
**Problem:** 61.7% of cycles in kernel page faults
|
|
**Solution:** Pre-allocate and fault-in 2MB slabs at startup
|
|
**Expected:** 4.1M → 41M ops/s
|
|
|
|
### Priority 2: Lock-Free Shared Pool (2-4x gain)
|
|
**Problem:** 3.3% of cycles in mutex locks
|
|
**Solution:** Atomic CAS for free list
|
|
**Expected:** Contributes to 2x overall gain
|
|
|
|
### Priority 3: Increase Unified Cache (2x fewer refills)
|
|
**Problem:** High miss rate → frequent refills
|
|
**Solution:** 64-128 blocks per class (currently 16-32)
|
|
**Expected:** 50% fewer refills
|
|
|
|
---
|
|
|
|
## Expected Performance After Optimizations
|
|
|
|
| Stage | Random Mixed | Gain | vs Tiny Hot |
|
|
|-------|-------------|------|-------------|
|
|
| Current | 4.1 M ops/s | - | 21.7x slower |
|
|
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 2.5x slower |
|
|
| After P1-2 (Lock-free) | 45 M ops/s | 11x | 2.0x slower |
|
|
| After P1-3 (Cache) | 55 M ops/s | 13x | 1.6x slower |
|
|
| **After All (P1-7)** | **60 M ops/s** | **15x** | **1.5x slower** |
|
|
|
|
**Target achieved:** Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.
|
|
|
|
---
|
|
|
|
## How to Reproduce
|
|
|
|
### 1. Build benchmarks
|
|
```bash
|
|
make bench_random_mixed_hakmem
|
|
make bench_tiny_hot_hakmem
|
|
```
|
|
|
|
### 2. Run without profiling (baseline)
|
|
```bash
|
|
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
|
|
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000
|
|
```
|
|
|
|
### 3. Profile with perf
|
|
```bash
|
|
# Random mixed
|
|
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
|
|
-o perf_random_mixed.data -- \
|
|
./bench_random_mixed_hakmem 1000000 256 42
|
|
|
|
# Tiny hot
|
|
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
|
|
-o perf_tiny_hot.data -- \
|
|
./bench_tiny_hot_hakmem 1000000
|
|
```
|
|
|
|
### 4. Analyze results
|
|
```bash
|
|
perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
|
|
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5
|
|
```
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
All reports are in: `/mnt/workdisk/public_share/hakmem/`
|
|
|
|
```
|
|
PERF_SUMMARY_TABLE.txt - Quick reference (20KB)
|
|
PERF_PROFILING_ANSWERS.md - Q&A format (16KB)
|
|
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md - Detailed analysis (14KB)
|
|
PERF_INDEX.md - This file (index)
|
|
```
|
|
|
|
---
|
|
|
|
## Contact
|
|
|
|
For questions about this profiling analysis, see:
|
|
- Original request: Questions 1-7 in profiling task
|
|
- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
|
|
|
|
---
|
|
|
|
**Generated by:** Linux perf + manual analysis
|
|
**Date:** 2025-12-04
|
|
**Version:** HAKMEM Phase 20+ (latest)
|