Files
hakmem/PERF_INDEX.md

211 lines
5.5 KiB
Markdown
Raw Normal View History

Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete) Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
# HAKMEM Performance Profiling Index
**Date:** 2025-12-04
**Profiler:** Linux perf (6.8.12)
**Benchmarks:** bench_random_mixed_hakmem vs bench_tiny_hot_hakmem
---
## Quick Start
### TL;DR: What's the bottleneck?
**Answer:** Kernel page faults (61.7% of cycles) from on-demand mmap allocations.
**Fix:** Pre-fault SuperSlabs at startup → expected 10-15x speedup.
---
## Available Reports
### 1. PERF_SUMMARY_TABLE.txt (20KB)
**Quick reference table** with cycle breakdowns, top functions, and recommendations.
**Use when:** You need a fast overview with numbers.
```bash
cat PERF_SUMMARY_TABLE.txt
```
Key sections:
- Performance comparison table
- Cycle breakdown by layer (random_mixed vs tiny_hot)
- Top 10 functions by CPU time
- Actionable recommendations with expected gains
---
### 2. PERF_PROFILING_ANSWERS.md (16KB)
**Answers to specific questions** from the profiling request.
**Use when:** You want direct answers to:
- What % of cycles are in wrappers?
- Is unified_cache_refill being called frequently?
- Is shared_pool_acquire being called?
- Is registry lookup visible?
- Where are the 22x slowdown cycles spent?
```bash
less PERF_PROFILING_ANSWERS.md
```
Key sections:
- Q&A format (5 main questions)
- Top functions with cache/branch miss data
- Unexpected bottlenecks flagged
- Layer-by-layer optimization recommendations
---
### 3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)
**Comprehensive layer-by-layer analysis** with detailed explanations.
**Use when:** You need deep understanding of:
- Why each layer contributes to the gap
- Root cause analysis (kernel page faults)
- Optimization strategies with implementation details
```bash
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
```
Key sections:
- Executive summary
- Detailed cycle breakdown (random_mixed vs tiny_hot)
- Layer-by-layer analysis (6 layers)
- Performance gap analysis
- Actionable recommendations (7 priorities)
- Expected results after optimization
---
## Key Findings Summary
### Performance Gap
- **bench_tiny_hot:** 89M ops/s (baseline)
- **bench_random_mixed:** 4.1M ops/s
- **Gap:** 21.7x slower
### Root Cause: Kernel Page Faults (61.7%)
```
Random sizes (16-1040B)
Unified Cache misses
unified_cache_refill (2.3%)
shared_pool_acquire (3.3%)
SuperSlab mmap (2MB chunks)
512 page faults per slab (61.7% cycles!)
clear_page_erms (6.9% - zeroing)
```
### User-Space Hotspots (only 11% of total)
1. **Shared Pool:** 3.3% (mutex locks)
2. **Wrappers:** 3.7% (malloc/free entry)
3. **Unified Cache:** 2.3% (triggers page faults)
4. **Other:** 1.7%
### Tiny Hot (for comparison)
- **70% user-space, 30% kernel** (inverted!)
- **0.5% page faults** (122x less than random_mixed)
- Free path dominates (43%) due to safe ownership checks
---
## Top 3 Optimization Priorities
### Priority 1: Pre-fault SuperSlabs (10-15x gain)
**Problem:** 61.7% of cycles in kernel page faults
**Solution:** Pre-allocate and fault-in 2MB slabs at startup
**Expected:** 4.1M → 41M ops/s
### Priority 2: Lock-Free Shared Pool (2-4x gain)
**Problem:** 3.3% of cycles in mutex locks
**Solution:** Atomic CAS for free list
**Expected:** Contributes to 2x overall gain
### Priority 3: Increase Unified Cache (2x fewer refills)
**Problem:** High miss rate → frequent refills
**Solution:** 64-128 blocks per class (currently 16-32)
**Expected:** 50% fewer refills
---
## Expected Performance After Optimizations
| Stage | Random Mixed | Gain | vs Tiny Hot |
|-------|-------------|------|-------------|
| Current | 4.1 M ops/s | - | 21.7x slower |
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 2.5x slower |
| After P1-2 (Lock-free) | 45 M ops/s | 11x | 2.0x slower |
| After P1-3 (Cache) | 55 M ops/s | 13x | 1.6x slower |
| **After All (P1-7)** | **60 M ops/s** | **15x** | **1.5x slower** |
**Target achieved:** Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.
---
## How to Reproduce
### 1. Build benchmarks
```bash
make bench_random_mixed_hakmem
make bench_tiny_hot_hakmem
```
### 2. Run without profiling (baseline)
```bash
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000
```
### 3. Profile with perf
```bash
# Random mixed
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
-o perf_random_mixed.data -- \
./bench_random_mixed_hakmem 1000000 256 42
# Tiny hot
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
-o perf_tiny_hot.data -- \
./bench_tiny_hot_hakmem 1000000
```
### 4. Analyze results
```bash
perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5
```
---
## File Locations
All reports are in: `/mnt/workdisk/public_share/hakmem/`
```
PERF_SUMMARY_TABLE.txt - Quick reference (20KB)
PERF_PROFILING_ANSWERS.md - Q&A format (16KB)
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md - Detailed analysis (14KB)
PERF_INDEX.md - This file (index)
```
---
## Contact
For questions about this profiling analysis, see:
- Original request: Questions 1-7 in profiling task
- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
---
**Generated by:** Linux perf + manual analysis
**Date:** 2025-12-04
**Version:** HAKMEM Phase 20+ (latest)