223 lines
6.3 KiB
Markdown
223 lines
6.3 KiB
Markdown
|
|
# Perf Baseline: Front-Direct Mode (Post-SEGV Fix)
|
||
|
|
|
||
|
|
**Date**: 2025-11-14
|
||
|
|
**Commit**: 696aa7c0b (SEGV fix with mincore() safety checks)
|
||
|
|
**Test**: `bench_random_mixed_hakmem 200000 4096 1234567`
|
||
|
|
**Mode**: `HAKMEM_TINY_FRONT_DIRECT=1`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 Performance Summary
|
||
|
|
|
||
|
|
### Throughput
|
||
|
|
```
|
||
|
|
HAKMEM (Front-Direct): 563K ops/s (0.355s for 200K iterations)
|
||
|
|
System malloc: ~90M ops/s (estimated)
|
||
|
|
Gap: 160x slower (0.63% of target)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Regression Alert**: Phase 11 achieved 9.38M ops/s (before SEGV fix)
|
||
|
|
**Current**: 563K ops/s → **-94% regression** (mincore() overhead)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔥 Hotspot Analysis
|
||
|
|
|
||
|
|
### Syscall Statistics (200K iterations)
|
||
|
|
|
||
|
|
| Syscall | Count | Time (s) | % Time | Impact |
|
||
|
|
|---------|-------|----------|--------|--------|
|
||
|
|
| **munmap** | 3,214 | 0.0258 | 47.4% | ❌ **CRITICAL** |
|
||
|
|
| **mmap** | 3,241 | 0.0149 | 27.4% | ❌ **CRITICAL** |
|
||
|
|
| **madvise** | 1,591 | 0.0072 | 13.3% | ⚠️ High |
|
||
|
|
| **mincore** | 1,591 | 0.0060 | 11.0% | ⚠️ High (SEGV fix overhead) |
|
||
|
|
| Other | 143 | 0.0006 | 1.0% | ✓ OK |
|
||
|
|
| **Total** | **9,780** | 0.0544 | 100% | |
|
||
|
|
|
||
|
|
**Key Findings**:
|
||
|
|
1. **mmap/munmap churn**: 6,455 calls (74.8% of syscall time)
|
||
|
|
- Root cause: SuperSlab aggressive deallocation
|
||
|
|
- Expected: ~100-200 calls (mimalloc-style pooling)
|
||
|
|
- **Gap**: 32-65x excessive syscalls
|
||
|
|
|
||
|
|
2. **mincore() overhead**: 1,591 calls (11.0% time)
|
||
|
|
- Added by SEGV fix (commit 696aa7c0b)
|
||
|
|
- Called on EVERY unknown pointer in free wrapper
|
||
|
|
- **Optimization needed**: Cache result, skip for known patterns
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📈 Hardware Performance Counters
|
||
|
|
|
||
|
|
| Counter | Value | Notes |
|
||
|
|
|---------|-------|-------|
|
||
|
|
| **Cycles** | 826M | |
|
||
|
|
| **Instructions** | 847M | |
|
||
|
|
| **IPC** | 1.03 | ⚠️ Low (target: 2-4) |
|
||
|
|
| **Branches** | 177M | |
|
||
|
|
| **Branch misses** | 12.1M | 6.82% miss rate (✓ OK) |
|
||
|
|
| **Cache refs** | 53.3M | |
|
||
|
|
| **Cache misses** | 8.7M | 16.32% miss rate (⚠️ High) |
|
||
|
|
| **Page faults** | 59,659 | ⚠️ High (0.30 per iteration) |
|
||
|
|
|
||
|
|
**Performance Issues**:
|
||
|
|
1. **Low IPC (1.03)**: Memory stalls dominating (cache misses, TLB pressure)
|
||
|
|
2. **High cache miss rate (16.32%)**: Pointer chasing, poor locality
|
||
|
|
3. **Page faults (59K)**: mmap/munmap churn causing TLB thrashing
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 Bottleneck Ranking (by Impact)
|
||
|
|
|
||
|
|
### **Box 1: SuperSlab/Shared Pool (CRITICAL - 74.8% syscall time)**
|
||
|
|
|
||
|
|
**Symptoms**:
|
||
|
|
- mmap: 3,241 calls
|
||
|
|
- munmap: 3,214 calls
|
||
|
|
- madvise: 1,591 calls
|
||
|
|
- Total: 8,046 syscalls (82% of all syscalls)
|
||
|
|
|
||
|
|
**Root Cause**: Phase 9 Lazy Deallocation **NOT working**
|
||
|
|
- Hypothesis: LRU cache too small, prewarm insufficient
|
||
|
|
- Expected behavior: Reuse SuperSlabs, minimal syscalls
|
||
|
|
- Actual: Aggressive deallocation (mimalloc gap)
|
||
|
|
|
||
|
|
**Attack Plan**:
|
||
|
|
1. **Immediate**: Verify LRU cache is active
|
||
|
|
- Check `g_ss_lru_*` counters
|
||
|
|
- ENV: `HAKMEM_SS_LRU_DEBUG=1`
|
||
|
|
2. **Phase 12 Design**: Shared SuperSlab Pool (mimalloc-style)
|
||
|
|
- 1 SuperSlab serves multiple size classes
|
||
|
|
- Dynamic slab allocation
|
||
|
|
- Target: 877 SuperSlabs → 100-200 (-70-80%)
|
||
|
|
|
||
|
|
**Expected Impact**: +1500% (74.8% → ~5%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### **Box 2: mincore() Overhead (MODERATE - 11.0% syscall time)**
|
||
|
|
|
||
|
|
**Symptoms**:
|
||
|
|
- mincore: 1,591 calls (11.0% time)
|
||
|
|
- Added by SEGV fix (commit 696aa7c0b)
|
||
|
|
- Called on EVERY external pointer in free wrapper
|
||
|
|
|
||
|
|
**Root Cause**: No caching, no fast-path for known patterns
|
||
|
|
|
||
|
|
**Attack Plan**:
|
||
|
|
1. **Optimization A**: Cache mincore() result per page
|
||
|
|
- TLS cache: `last_checked_page → is_mapped`
|
||
|
|
- Hit rate estimate: 90-95% (same page repeated)
|
||
|
|
2. **Optimization B**: Skip mincore() for known ranges
|
||
|
|
- Check if ptr in expected range (heap, stack, mmap areas)
|
||
|
|
- Use `/proc/self/maps` on init
|
||
|
|
3. **Optimization C**: Remove from classify_ptr()
|
||
|
|
- Already done (Step 3 removed AllocHeader probe)
|
||
|
|
- Only free wrapper needs it
|
||
|
|
|
||
|
|
**Expected Impact**: +12-15% (11.0% → ~1%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### **Box 3: Front Cache Miss (LOW - visible in cache stats)**
|
||
|
|
|
||
|
|
**Symptoms**:
|
||
|
|
- Cache miss rate: 16.32%
|
||
|
|
- IPC: 1.03 (low, memory-bound)
|
||
|
|
|
||
|
|
**Attack Plan** (after Box 1/2 fixed):
|
||
|
|
1. Check FastCache hit rate
|
||
|
|
- ENV: `HAKMEM_FRONT_STATS=1`
|
||
|
|
- Target: >90% hit rate
|
||
|
|
2. Tune FC capacity/refill size
|
||
|
|
- ENV: `HAKMEM_FC_CAP=256` (2x current)
|
||
|
|
- ENV: `HAKMEM_FC_REFILL=32` (2x current)
|
||
|
|
|
||
|
|
**Expected Impact**: +5-10% (after syscall fixes)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🚀 Optimization Priority
|
||
|
|
|
||
|
|
### **Phase A: SuperSlab Churn Fix (Target: +1500%)**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Step 1: Diagnose LRU
|
||
|
|
export HAKMEM_SS_LRU_DEBUG=1
|
||
|
|
export HAKMEM_SS_PREWARM_DEBUG=1
|
||
|
|
./bench_random_mixed_hakmem 200000 4096 1234567
|
||
|
|
|
||
|
|
# Step 2: Tune LRU size
|
||
|
|
export HAKMEM_SS_LRU_SIZE=128 # Current: unknown
|
||
|
|
export HAKMEM_SS_PREWARM=64 # Current: unknown
|
||
|
|
|
||
|
|
# Step 3: Design Phase 12 Shared Pool
|
||
|
|
# - Implement mimalloc-style dynamic slab allocation
|
||
|
|
# - Target: 6,455 syscalls → ~100 (-98%)
|
||
|
|
```
|
||
|
|
|
||
|
|
### **Phase B: mincore() Optimization (Target: +12-15%)**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Step 1: Page cache (TLS)
|
||
|
|
static __thread struct {
|
||
|
|
void* page;
|
||
|
|
int is_mapped;
|
||
|
|
} g_mincore_cache = {NULL, 0};
|
||
|
|
|
||
|
|
# Step 2: Fast-path check
|
||
|
|
if (page == g_mincore_cache.page) {
|
||
|
|
is_mapped = g_mincore_cache.is_mapped; // Cache hit
|
||
|
|
} else {
|
||
|
|
is_mapped = mincore(...); // Syscall
|
||
|
|
g_mincore_cache.page = page;
|
||
|
|
g_mincore_cache.is_mapped = is_mapped;
|
||
|
|
}
|
||
|
|
|
||
|
|
# Expected: 1,591 → ~100 calls (-94%)
|
||
|
|
```
|
||
|
|
|
||
|
|
### **Phase C: Front Tuning (Target: +5-10%)**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# After Phase A/B complete
|
||
|
|
export HAKMEM_FC_CAP=256
|
||
|
|
export HAKMEM_FC_REFILL=32
|
||
|
|
export HAKMEM_FRONT_STATS=1
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 Immediate Action Items
|
||
|
|
|
||
|
|
1. **[ultrathink/ChatGPT]** Review this report
|
||
|
|
2. **[Task 1]** Diagnose why Phase 9 LRU is not working
|
||
|
|
- Run with `HAKMEM_SS_LRU_DEBUG=1`
|
||
|
|
- Check LRU hit/miss counters
|
||
|
|
3. **[Task 2]** Design mincore() page cache
|
||
|
|
- TLS cache (page → is_mapped)
|
||
|
|
- Measure hit rate
|
||
|
|
4. **[Task 3]** Implement Phase 12 Shared SuperSlab Pool
|
||
|
|
- Design doc: mimalloc-style dynamic allocation
|
||
|
|
- Target: 877 → 100-200 SuperSlabs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 Target Performance (After Optimizations)
|
||
|
|
|
||
|
|
```
|
||
|
|
Current: 563K ops/s
|
||
|
|
Target: 70-90M ops/s (System malloc: 90M)
|
||
|
|
Gap: 124-160x
|
||
|
|
Required: +12,400-15,900% improvement
|
||
|
|
|
||
|
|
Phase A (SuperSlab): +1500% → 8.5M ops/s (9.4% of target)
|
||
|
|
Phase B (mincore): +15% → 10.0M ops/s (11.1% of target)
|
||
|
|
Phase C (Front): +10% → 11.0M ops/s (12.2% of target)
|
||
|
|
Phase D (??): Need more (+650-750%)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Note**: Current performance is **worse than Phase 11** (9.38M → 563K)
|
||
|
|
**Root cause**: mincore() added in SEGV fix (1,591 syscalls)
|
||
|
|
**Priority**: Fix mincore() overhead FIRST (Phase B), then SuperSlab (Phase A)
|