# Perf Baseline: Front-Direct Mode (Post-SEGV Fix) **Date**: 2025-11-14 **Commit**: 696aa7c0b (SEGV fix with mincore() safety checks) **Test**: `bench_random_mixed_hakmem 200000 4096 1234567` **Mode**: `HAKMEM_TINY_FRONT_DIRECT=1` --- ## 📊 Performance Summary ### Throughput ``` HAKMEM (Front-Direct): 563K ops/s (0.355s for 200K iterations) System malloc: ~90M ops/s (estimated) Gap: 160x slower (0.63% of target) ``` **Regression Alert**: Phase 11 achieved 9.38M ops/s (before SEGV fix) **Current**: 563K ops/s → **-94% regression** (mincore() overhead) --- ## 🔥 Hotspot Analysis ### Syscall Statistics (200K iterations) | Syscall | Count | Time (s) | % Time | Impact | |---------|-------|----------|--------|--------| | **munmap** | 3,214 | 0.0258 | 47.4% | ❌ **CRITICAL** | | **mmap** | 3,241 | 0.0149 | 27.4% | ❌ **CRITICAL** | | **madvise** | 1,591 | 0.0072 | 13.3% | ⚠️ High | | **mincore** | 1,591 | 0.0060 | 11.0% | ⚠️ High (SEGV fix overhead) | | Other | 143 | 0.0006 | 1.0% | ✓ OK | | **Total** | **9,780** | 0.0544 | 100% | | **Key Findings**: 1. **mmap/munmap churn**: 6,455 calls (74.8% of syscall time) - Root cause: SuperSlab aggressive deallocation - Expected: ~100-200 calls (mimalloc-style pooling) - **Gap**: 32-65x excessive syscalls 2. **mincore() overhead**: 1,591 calls (11.0% time) - Added by SEGV fix (commit 696aa7c0b) - Called on EVERY unknown pointer in free wrapper - **Optimization needed**: Cache result, skip for known patterns --- ## 📈 Hardware Performance Counters | Counter | Value | Notes | |---------|-------|-------| | **Cycles** | 826M | | | **Instructions** | 847M | | | **IPC** | 1.03 | ⚠️ Low (target: 2-4) | | **Branches** | 177M | | | **Branch misses** | 12.1M | 6.82% miss rate (✓ OK) | | **Cache refs** | 53.3M | | | **Cache misses** | 8.7M | 16.32% miss rate (⚠️ High) | | **Page faults** | 59,659 | ⚠️ High (0.30 per iteration) | **Performance Issues**: 1. **Low IPC (1.03)**: Memory stalls dominating (cache misses, TLB pressure) 2. **High cache miss rate (16.32%)**: Pointer chasing, poor locality 3. **Page faults (59K)**: mmap/munmap churn causing TLB thrashing --- ## 🎯 Bottleneck Ranking (by Impact) ### **Box 1: SuperSlab/Shared Pool (CRITICAL - 74.8% syscall time)** **Symptoms**: - mmap: 3,241 calls - munmap: 3,214 calls - madvise: 1,591 calls - Total: 8,046 syscalls (82% of all syscalls) **Root Cause**: Phase 9 Lazy Deallocation **NOT working** - Hypothesis: LRU cache too small, prewarm insufficient - Expected behavior: Reuse SuperSlabs, minimal syscalls - Actual: Aggressive deallocation (mimalloc gap) **Attack Plan**: 1. **Immediate**: Verify LRU cache is active - Check `g_ss_lru_*` counters - ENV: `HAKMEM_SS_LRU_DEBUG=1` 2. **Phase 12 Design**: Shared SuperSlab Pool (mimalloc-style) - 1 SuperSlab serves multiple size classes - Dynamic slab allocation - Target: 877 SuperSlabs → 100-200 (-70-80%) **Expected Impact**: +1500% (74.8% → ~5%) --- ### **Box 2: mincore() Overhead (MODERATE - 11.0% syscall time)** **Symptoms**: - mincore: 1,591 calls (11.0% time) - Added by SEGV fix (commit 696aa7c0b) - Called on EVERY external pointer in free wrapper **Root Cause**: No caching, no fast-path for known patterns **Attack Plan**: 1. **Optimization A**: Cache mincore() result per page - TLS cache: `last_checked_page → is_mapped` - Hit rate estimate: 90-95% (same page repeated) 2. **Optimization B**: Skip mincore() for known ranges - Check if ptr in expected range (heap, stack, mmap areas) - Use `/proc/self/maps` on init 3. **Optimization C**: Remove from classify_ptr() - Already done (Step 3 removed AllocHeader probe) - Only free wrapper needs it **Expected Impact**: +12-15% (11.0% → ~1%) --- ### **Box 3: Front Cache Miss (LOW - visible in cache stats)** **Symptoms**: - Cache miss rate: 16.32% - IPC: 1.03 (low, memory-bound) **Attack Plan** (after Box 1/2 fixed): 1. Check FastCache hit rate - ENV: `HAKMEM_FRONT_STATS=1` - Target: >90% hit rate 2. Tune FC capacity/refill size - ENV: `HAKMEM_FC_CAP=256` (2x current) - ENV: `HAKMEM_FC_REFILL=32` (2x current) **Expected Impact**: +5-10% (after syscall fixes) --- ## 🚀 Optimization Priority ### **Phase A: SuperSlab Churn Fix (Target: +1500%)** ```bash # Step 1: Diagnose LRU export HAKMEM_SS_LRU_DEBUG=1 export HAKMEM_SS_PREWARM_DEBUG=1 ./bench_random_mixed_hakmem 200000 4096 1234567 # Step 2: Tune LRU size export HAKMEM_SS_LRU_SIZE=128 # Current: unknown export HAKMEM_SS_PREWARM=64 # Current: unknown # Step 3: Design Phase 12 Shared Pool # - Implement mimalloc-style dynamic slab allocation # - Target: 6,455 syscalls → ~100 (-98%) ``` ### **Phase B: mincore() Optimization (Target: +12-15%)** ```bash # Step 1: Page cache (TLS) static __thread struct { void* page; int is_mapped; } g_mincore_cache = {NULL, 0}; # Step 2: Fast-path check if (page == g_mincore_cache.page) { is_mapped = g_mincore_cache.is_mapped; // Cache hit } else { is_mapped = mincore(...); // Syscall g_mincore_cache.page = page; g_mincore_cache.is_mapped = is_mapped; } # Expected: 1,591 → ~100 calls (-94%) ``` ### **Phase C: Front Tuning (Target: +5-10%)** ```bash # After Phase A/B complete export HAKMEM_FC_CAP=256 export HAKMEM_FC_REFILL=32 export HAKMEM_FRONT_STATS=1 ``` --- ## 📋 Immediate Action Items 1. **[ultrathink/ChatGPT]** Review this report 2. **[Task 1]** Diagnose why Phase 9 LRU is not working - Run with `HAKMEM_SS_LRU_DEBUG=1` - Check LRU hit/miss counters 3. **[Task 2]** Design mincore() page cache - TLS cache (page → is_mapped) - Measure hit rate 4. **[Task 3]** Implement Phase 12 Shared SuperSlab Pool - Design doc: mimalloc-style dynamic allocation - Target: 877 → 100-200 SuperSlabs --- ## 🎯 Target Performance (After Optimizations) ``` Current: 563K ops/s Target: 70-90M ops/s (System malloc: 90M) Gap: 124-160x Required: +12,400-15,900% improvement Phase A (SuperSlab): +1500% → 8.5M ops/s (9.4% of target) Phase B (mincore): +15% → 10.0M ops/s (11.1% of target) Phase C (Front): +10% → 11.0M ops/s (12.2% of target) Phase D (??): Need more (+650-750%) ``` **Note**: Current performance is **worse than Phase 11** (9.38M → 563K) **Root cause**: mincore() added in SEGV fix (1,591 syscalls) **Priority**: Fix mincore() overhead FIRST (Phase B), then SuperSlab (Phase A)