211 lines
5.5 KiB
Markdown
211 lines
5.5 KiB
Markdown
|
|
# HAKMEM Performance Profiling Index
|
||
|
|
|
||
|
|
**Date:** 2025-12-04
|
||
|
|
**Profiler:** Linux perf (6.8.12)
|
||
|
|
**Benchmarks:** bench_random_mixed_hakmem vs bench_tiny_hot_hakmem
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### TL;DR: What's the bottleneck?
|
||
|
|
|
||
|
|
**Answer:** Kernel page faults (61.7% of cycles) from on-demand mmap allocations.
|
||
|
|
|
||
|
|
**Fix:** Pre-fault SuperSlabs at startup → expected 10-15x speedup.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Available Reports
|
||
|
|
|
||
|
|
### 1. PERF_SUMMARY_TABLE.txt (20KB)
|
||
|
|
**Quick reference table** with cycle breakdowns, top functions, and recommendations.
|
||
|
|
|
||
|
|
**Use when:** You need a fast overview with numbers.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cat PERF_SUMMARY_TABLE.txt
|
||
|
|
```
|
||
|
|
|
||
|
|
Key sections:
|
||
|
|
- Performance comparison table
|
||
|
|
- Cycle breakdown by layer (random_mixed vs tiny_hot)
|
||
|
|
- Top 10 functions by CPU time
|
||
|
|
- Actionable recommendations with expected gains
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. PERF_PROFILING_ANSWERS.md (16KB)
|
||
|
|
**Answers to specific questions** from the profiling request.
|
||
|
|
|
||
|
|
**Use when:** You want direct answers to:
|
||
|
|
- What % of cycles are in wrappers?
|
||
|
|
- Is unified_cache_refill being called frequently?
|
||
|
|
- Is shared_pool_acquire being called?
|
||
|
|
- Is registry lookup visible?
|
||
|
|
- Where are the 22x slowdown cycles spent?
|
||
|
|
|
||
|
|
```bash
|
||
|
|
less PERF_PROFILING_ANSWERS.md
|
||
|
|
```
|
||
|
|
|
||
|
|
Key sections:
|
||
|
|
- Q&A format (5 main questions)
|
||
|
|
- Top functions with cache/branch miss data
|
||
|
|
- Unexpected bottlenecks flagged
|
||
|
|
- Layer-by-layer optimization recommendations
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)
|
||
|
|
**Comprehensive layer-by-layer analysis** with detailed explanations.
|
||
|
|
|
||
|
|
**Use when:** You need deep understanding of:
|
||
|
|
- Why each layer contributes to the gap
|
||
|
|
- Root cause analysis (kernel page faults)
|
||
|
|
- Optimization strategies with implementation details
|
||
|
|
|
||
|
|
```bash
|
||
|
|
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
|
||
|
|
```
|
||
|
|
|
||
|
|
Key sections:
|
||
|
|
- Executive summary
|
||
|
|
- Detailed cycle breakdown (random_mixed vs tiny_hot)
|
||
|
|
- Layer-by-layer analysis (6 layers)
|
||
|
|
- Performance gap analysis
|
||
|
|
- Actionable recommendations (7 priorities)
|
||
|
|
- Expected results after optimization
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Key Findings Summary
|
||
|
|
|
||
|
|
### Performance Gap
|
||
|
|
- **bench_tiny_hot:** 89M ops/s (baseline)
|
||
|
|
- **bench_random_mixed:** 4.1M ops/s
|
||
|
|
- **Gap:** 21.7x slower
|
||
|
|
|
||
|
|
### Root Cause: Kernel Page Faults (61.7%)
|
||
|
|
```
|
||
|
|
Random sizes (16-1040B)
|
||
|
|
↓
|
||
|
|
Unified Cache misses
|
||
|
|
↓
|
||
|
|
unified_cache_refill (2.3%)
|
||
|
|
↓
|
||
|
|
shared_pool_acquire (3.3%)
|
||
|
|
↓
|
||
|
|
SuperSlab mmap (2MB chunks)
|
||
|
|
↓
|
||
|
|
512 page faults per slab (61.7% cycles!)
|
||
|
|
↓
|
||
|
|
clear_page_erms (6.9% - zeroing)
|
||
|
|
```
|
||
|
|
|
||
|
|
### User-Space Hotspots (only 11% of total)
|
||
|
|
1. **Shared Pool:** 3.3% (mutex locks)
|
||
|
|
2. **Wrappers:** 3.7% (malloc/free entry)
|
||
|
|
3. **Unified Cache:** 2.3% (triggers page faults)
|
||
|
|
4. **Other:** 1.7%
|
||
|
|
|
||
|
|
### Tiny Hot (for comparison)
|
||
|
|
- **70% user-space, 30% kernel** (inverted!)
|
||
|
|
- **0.5% page faults** (122x less than random_mixed)
|
||
|
|
- Free path dominates (43%) due to safe ownership checks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 3 Optimization Priorities
|
||
|
|
|
||
|
|
### Priority 1: Pre-fault SuperSlabs (10-15x gain)
|
||
|
|
**Problem:** 61.7% of cycles in kernel page faults
|
||
|
|
**Solution:** Pre-allocate and fault-in 2MB slabs at startup
|
||
|
|
**Expected:** 4.1M → 41M ops/s
|
||
|
|
|
||
|
|
### Priority 2: Lock-Free Shared Pool (2-4x gain)
|
||
|
|
**Problem:** 3.3% of cycles in mutex locks
|
||
|
|
**Solution:** Atomic CAS for free list
|
||
|
|
**Expected:** Contributes to 2x overall gain
|
||
|
|
|
||
|
|
### Priority 3: Increase Unified Cache (2x fewer refills)
|
||
|
|
**Problem:** High miss rate → frequent refills
|
||
|
|
**Solution:** 64-128 blocks per class (currently 16-32)
|
||
|
|
**Expected:** 50% fewer refills
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Expected Performance After Optimizations
|
||
|
|
|
||
|
|
| Stage | Random Mixed | Gain | vs Tiny Hot |
|
||
|
|
|-------|-------------|------|-------------|
|
||
|
|
| Current | 4.1 M ops/s | - | 21.7x slower |
|
||
|
|
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 2.5x slower |
|
||
|
|
| After P1-2 (Lock-free) | 45 M ops/s | 11x | 2.0x slower |
|
||
|
|
| After P1-3 (Cache) | 55 M ops/s | 13x | 1.6x slower |
|
||
|
|
| **After All (P1-7)** | **60 M ops/s** | **15x** | **1.5x slower** |
|
||
|
|
|
||
|
|
**Target achieved:** Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## How to Reproduce
|
||
|
|
|
||
|
|
### 1. Build benchmarks
|
||
|
|
```bash
|
||
|
|
make bench_random_mixed_hakmem
|
||
|
|
make bench_tiny_hot_hakmem
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Run without profiling (baseline)
|
||
|
|
```bash
|
||
|
|
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Profile with perf
|
||
|
|
```bash
|
||
|
|
# Random mixed
|
||
|
|
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
|
||
|
|
-o perf_random_mixed.data -- \
|
||
|
|
./bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
|
||
|
|
# Tiny hot
|
||
|
|
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
|
||
|
|
-o perf_tiny_hot.data -- \
|
||
|
|
./bench_tiny_hot_hakmem 1000000
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Analyze results
|
||
|
|
```bash
|
||
|
|
perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
|
||
|
|
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## File Locations
|
||
|
|
|
||
|
|
All reports are in: `/mnt/workdisk/public_share/hakmem/`
|
||
|
|
|
||
|
|
```
|
||
|
|
PERF_SUMMARY_TABLE.txt - Quick reference (20KB)
|
||
|
|
PERF_PROFILING_ANSWERS.md - Q&A format (16KB)
|
||
|
|
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md - Detailed analysis (14KB)
|
||
|
|
PERF_INDEX.md - This file (index)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Contact
|
||
|
|
|
||
|
|
For questions about this profiling analysis, see:
|
||
|
|
- Original request: Questions 1-7 in profiling task
|
||
|
|
- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Generated by:** Linux perf + manual analysis
|
||
|
|
**Date:** 2025-12-04
|
||
|
|
**Version:** HAKMEM Phase 20+ (latest)
|