hakmem/PERF_INDEX.md

# HAKMEM Performance Profiling Index

**Date:** 2025-12-04  
**Profiler:** Linux perf (6.8.12)  
**Benchmarks:** bench_random_mixed_hakmem vs bench_tiny_hot_hakmem

---

## Quick Start

### TL;DR: What's the bottleneck?

**Answer:** Kernel page faults (61.7% of cycles) from on-demand mmap allocations.

**Fix:** Pre-fault SuperSlabs at startup → expected 10-15x speedup.

---

## Available Reports

### 1. PERF_SUMMARY_TABLE.txt (20KB)
**Quick reference table** with cycle breakdowns, top functions, and recommendations.

**Use when:** You need a fast overview with numbers.

```bash
cat PERF_SUMMARY_TABLE.txt
```

Key sections:
- Performance comparison table
- Cycle breakdown by layer (random_mixed vs tiny_hot)
- Top 10 functions by CPU time
- Actionable recommendations with expected gains

---

### 2. PERF_PROFILING_ANSWERS.md (16KB)
**Answers to specific questions** from the profiling request.

**Use when:** You want direct answers to:
- What % of cycles are in wrappers?
- Is unified_cache_refill being called frequently?
- Is shared_pool_acquire being called?
- Is registry lookup visible?
- Where are the 22x slowdown cycles spent?

```bash
less PERF_PROFILING_ANSWERS.md
```

Key sections:
- Q&A format (5 main questions)
- Top functions with cache/branch miss data
- Unexpected bottlenecks flagged
- Layer-by-layer optimization recommendations

---

### 3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)
**Comprehensive layer-by-layer analysis** with detailed explanations.

**Use when:** You need deep understanding of:
- Why each layer contributes to the gap
- Root cause analysis (kernel page faults)
- Optimization strategies with implementation details

```bash
less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md
```

Key sections:
- Executive summary
- Detailed cycle breakdown (random_mixed vs tiny_hot)
- Layer-by-layer analysis (6 layers)
- Performance gap analysis
- Actionable recommendations (7 priorities)
- Expected results after optimization

---

## Key Findings Summary

### Performance Gap
- **bench_tiny_hot:** 89M ops/s (baseline)
- **bench_random_mixed:** 4.1M ops/s
- **Gap:** 21.7x slower

### Root Cause: Kernel Page Faults (61.7%)
```
Random sizes (16-1040B)
    ↓
Unified Cache misses
    ↓
unified_cache_refill (2.3%)
    ↓
shared_pool_acquire (3.3%)
    ↓
SuperSlab mmap (2MB chunks)
    ↓
512 page faults per slab (61.7% cycles!)
    ↓
clear_page_erms (6.9% - zeroing)
```

### User-Space Hotspots (only 11% of total)
1. **Shared Pool:** 3.3% (mutex locks)
2. **Wrappers:** 3.7% (malloc/free entry)
3. **Unified Cache:** 2.3% (triggers page faults)
4. **Other:** 1.7%

### Tiny Hot (for comparison)
- **70% user-space, 30% kernel** (inverted!)
- **0.5% page faults** (122x less than random_mixed)
- Free path dominates (43%) due to safe ownership checks

---

## Top 3 Optimization Priorities

### Priority 1: Pre-fault SuperSlabs (10-15x gain)
**Problem:** 61.7% of cycles in kernel page faults  
**Solution:** Pre-allocate and fault-in 2MB slabs at startup  
**Expected:** 4.1M → 41M ops/s

### Priority 2: Lock-Free Shared Pool (2-4x gain)
**Problem:** 3.3% of cycles in mutex locks  
**Solution:** Atomic CAS for free list  
**Expected:** Contributes to 2x overall gain

### Priority 3: Increase Unified Cache (2x fewer refills)
**Problem:** High miss rate → frequent refills  
**Solution:** 64-128 blocks per class (currently 16-32)  
**Expected:** 50% fewer refills

---

## Expected Performance After Optimizations

| Stage | Random Mixed | Gain | vs Tiny Hot |
|-------|-------------|------|-------------|
| Current | 4.1 M ops/s | - | 21.7x slower |
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 2.5x slower |
| After P1-2 (Lock-free) | 45 M ops/s | 11x | 2.0x slower |
| After P1-3 (Cache) | 55 M ops/s | 13x | 1.6x slower |
| **After All (P1-7)** | **60 M ops/s** | **15x** | **1.5x slower** |

**Target achieved:** Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.

---

## How to Reproduce

### 1. Build benchmarks
```bash
make bench_random_mixed_hakmem
make bench_tiny_hot_hakmem
```

### 2. Run without profiling (baseline)
```bash
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42
HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000
```

### 3. Profile with perf
```bash
# Random mixed
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_random_mixed.data -- \
  ./bench_random_mixed_hakmem 1000000 256 42

# Tiny hot
perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \
  -o perf_tiny_hot.data -- \
  ./bench_tiny_hot_hakmem 1000000
```

### 4. Analyze results
```bash
perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5
perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5
```

---

## File Locations

All reports are in: `/mnt/workdisk/public_share/hakmem/`

```
PERF_SUMMARY_TABLE.txt                     - Quick reference (20KB)
PERF_PROFILING_ANSWERS.md                  - Q&A format (16KB)
PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md  - Detailed analysis (14KB)
PERF_INDEX.md                              - This file (index)
```

---

## Contact

For questions about this profiling analysis, see:
- Original request: Questions 1-7 in profiling task
- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md

---

**Generated by:** Linux perf + manual analysis  
**Date:** 2025-12-04  
**Version:** HAKMEM Phase 20+ (latest)
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete) Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-12-04 23:31:54 +09:00			`# HAKMEM Performance Profiling Index`

			`Date: 2025-12-04`
			`Profiler: Linux perf (6.8.12)`
			`Benchmarks: bench_random_mixed_hakmem vs bench_tiny_hot_hakmem`

			`---`

			`## Quick Start`

			`### TL;DR: What's the bottleneck?`

			`Answer: Kernel page faults (61.7% of cycles) from on-demand mmap allocations.`

			`Fix: Pre-fault SuperSlabs at startup → expected 10-15x speedup.`

			`---`

			`## Available Reports`

			`### 1. PERF_SUMMARY_TABLE.txt (20KB)`
			`Quick reference table with cycle breakdowns, top functions, and recommendations.`

			`Use when: You need a fast overview with numbers.`

			```bash
			`cat PERF_SUMMARY_TABLE.txt`
			```

			`Key sections:`
			`- Performance comparison table`
			`- Cycle breakdown by layer (random_mixed vs tiny_hot)`
			`- Top 10 functions by CPU time`
			`- Actionable recommendations with expected gains`

			`---`

			`### 2. PERF_PROFILING_ANSWERS.md (16KB)`
			`Answers to specific questions from the profiling request.`

			`Use when: You want direct answers to:`
			`- What % of cycles are in wrappers?`
			`- Is unified_cache_refill being called frequently?`
			`- Is shared_pool_acquire being called?`
			`- Is registry lookup visible?`
			`- Where are the 22x slowdown cycles spent?`

			```bash
			`less PERF_PROFILING_ANSWERS.md`
			```

			`Key sections:`
			`- Q&A format (5 main questions)`
			`- Top functions with cache/branch miss data`
			`- Unexpected bottlenecks flagged`
			`- Layer-by-layer optimization recommendations`

			`---`

			`### 3. PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md (14KB)`
			`Comprehensive layer-by-layer analysis with detailed explanations.`

			`Use when: You need deep understanding of:`
			`- Why each layer contributes to the gap`
			`- Root cause analysis (kernel page faults)`
			`- Optimization strategies with implementation details`

			```bash
			`less PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md`
			```

			`Key sections:`
			`- Executive summary`
			`- Detailed cycle breakdown (random_mixed vs tiny_hot)`
			`- Layer-by-layer analysis (6 layers)`
			`- Performance gap analysis`
			`- Actionable recommendations (7 priorities)`
			`- Expected results after optimization`

			`---`

			`## Key Findings Summary`

			`### Performance Gap`
			`- bench_tiny_hot: 89M ops/s (baseline)`
			`- bench_random_mixed: 4.1M ops/s`
			`- Gap: 21.7x slower`

			`### Root Cause: Kernel Page Faults (61.7%)`
			```
			`Random sizes (16-1040B)`
			`↓`
			`Unified Cache misses`
			`↓`
			`unified_cache_refill (2.3%)`
			`↓`
			`shared_pool_acquire (3.3%)`
			`↓`
			`SuperSlab mmap (2MB chunks)`
			`↓`
			`512 page faults per slab (61.7% cycles!)`
			`↓`
			`clear_page_erms (6.9% - zeroing)`
			```

			`### User-Space Hotspots (only 11% of total)`
			`1. Shared Pool: 3.3% (mutex locks)`
			`2. Wrappers: 3.7% (malloc/free entry)`
			`3. Unified Cache: 2.3% (triggers page faults)`
			`4. Other: 1.7%`

			`### Tiny Hot (for comparison)`
			`- 70% user-space, 30% kernel (inverted!)`
			`- 0.5% page faults (122x less than random_mixed)`
			`- Free path dominates (43%) due to safe ownership checks`

			`---`

			`## Top 3 Optimization Priorities`

			`### Priority 1: Pre-fault SuperSlabs (10-15x gain)`
			`Problem: 61.7% of cycles in kernel page faults`
			`Solution: Pre-allocate and fault-in 2MB slabs at startup`
			`Expected: 4.1M → 41M ops/s`

			`### Priority 2: Lock-Free Shared Pool (2-4x gain)`
			`Problem: 3.3% of cycles in mutex locks`
			`Solution: Atomic CAS for free list`
			`Expected: Contributes to 2x overall gain`

			`### Priority 3: Increase Unified Cache (2x fewer refills)`
			`Problem: High miss rate → frequent refills`
			`Solution: 64-128 blocks per class (currently 16-32)`
			`Expected: 50% fewer refills`

			`---`

			`## Expected Performance After Optimizations`

			`\| Stage \| Random Mixed \| Gain \| vs Tiny Hot \|`
			`\|-------\|-------------\|------\|-------------\|`
			`\| Current \| 4.1 M ops/s \| - \| 21.7x slower \|`
			`\| After P1 (Pre-fault) \| 35 M ops/s \| 8.5x \| 2.5x slower \|`
			`\| After P1-2 (Lock-free) \| 45 M ops/s \| 11x \| 2.0x slower \|`
			`\| After P1-3 (Cache) \| 55 M ops/s \| 13x \| 1.6x slower \|`
			`\| After All (P1-7) \| 60 M ops/s \| 15x \| 1.5x slower \|`

			`Target achieved: Within 1.5-2x of Tiny Hot is acceptable given the inherent complexity of handling varied allocation sizes.`

			`---`

			`## How to Reproduce`

			`### 1. Build benchmarks`
			```bash
			`make bench_random_mixed_hakmem`
			`make bench_tiny_hot_hakmem`
			```

			`### 2. Run without profiling (baseline)`
			```bash
			`HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_random_mixed_hakmem 1000000 256 42`
			`HAKMEM_MODE=balanced HAKMEM_QUIET=1 ./bench_tiny_hot_hakmem 1000000`
			```

			`### 3. Profile with perf`
			```bash
			`# Random mixed`
			`perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \`
			`-o perf_random_mixed.data -- \`
			`./bench_random_mixed_hakmem 1000000 256 42`

			`# Tiny hot`
			`perf record -e cycles,instructions,cache-misses,branch-misses -c 1000 -g --call-graph dwarf \`
			`-o perf_tiny_hot.data -- \`
			`./bench_tiny_hot_hakmem 1000000`
			```

			`### 4. Analyze results`
			```bash
			`perf report --stdio -i perf_random_mixed.data --no-children --sort symbol --percent-limit 0.5`
			`perf report --stdio -i perf_tiny_hot.data --no-children --sort symbol --percent-limit 0.5`
			```

			`---`

			`## File Locations`

			All reports are in: `/mnt/workdisk/public_share/hakmem/`

			```
			`PERF_SUMMARY_TABLE.txt - Quick reference (20KB)`
			`PERF_PROFILING_ANSWERS.md - Q&A format (16KB)`
			`PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md - Detailed analysis (14KB)`
			`PERF_INDEX.md - This file (index)`
			```

			`---`

			`## Contact`

			`For questions about this profiling analysis, see:`
			`- Original request: Questions 1-7 in profiling task`
			`- Implementation recommendations: PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md`

			`---`

			`Generated by: Linux perf + manual analysis`
			`Date: 2025-12-04`
			`Version: HAKMEM Phase 20+ (latest)`