hakmem/PERF_ANALYSIS_INDEX_20251204.md

# HAKMEM Performance Analysis - Complete Index
**Date**: 2025-12-04  
**Benchmark**: bench_random_mixed_hakmem (1M operations, ws=256)  
**Current Performance**: 4.1M ops/s  
**Target**: 16M+ ops/s (4x improvement)

---

## Quick Summary

**CRITICAL FINDING**: Page fault handling consumes 60-70% of execution time.

**Primary Bottleneck**: 
- 132,509 page faults per 1M operations
- Each page fault costs ~690 cycles
- Kernel spends 60% of time in: clear_page_erms (11%), do_anonymous_page (20%), LRU/cgroup accounting (12%)

**Recommended Fix**: 
- Add `MAP_POPULATE` to superslab mmap() calls → 2-3x speedup (1 hour effort)
- Follow with PGO and branch optimization → reach 4x target

---

## Analysis Documents (Read in Order)

### 1. Executive Summary (START HERE)
**File**: `/mnt/workdisk/public_share/hakmem/EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md`  
**Purpose**: High-level overview for decision makers  
**Content**:
- Problem statement and root cause
- Key metrics summary
- Recommended action plan with timelines
- Conservative and aggressive performance projections

**Reading time**: 5 minutes

---

### 2. Detailed Analysis Report
**File**: `/mnt/workdisk/public_share/hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md`  
**Purpose**: In-depth technical analysis for engineers  
**Content**:
- Complete performance counter breakdown
- Top 10 hottest functions with call chains
- Bottleneck analysis with cycle accounting
- Detailed optimization recommendations with effort estimates
- Specific code changes required

**Reading time**: 20 minutes

---

### 3. Raw Performance Data
**File**: `/mnt/workdisk/public_share/hakmem/PERF_RAW_DATA_20251204.txt`  
**Purpose**: Reference data for validation and future comparison  
**Content**:
- Raw perf stat output (all counters)
- Raw perf report output (function profiles)
- Syscall trace data
- Assembly annotation of hot functions
- Complete call chain data

**Reading time**: Reference only (5-10 minutes to browse)

---

## Key Findings at a Glance

| Category | Finding | Impact | Fix Effort |
|----------|---------|--------|------------|
| **Page Faults** | 132,509 faults (13% of ops) | 60-70% of runtime | 1 hour (MAP_POPULATE) |
| **Branch Misses** | 9.04% miss rate (21M misses) | ~30% overhead | 4 hours (hints + PGO) |
| **Cache Misses** | 13.03% miss rate (17 L1 misses/op) | ~15% overhead | 2 hours (layout) |
| **Speculation** | Retpoline overhead | ~5% overhead | Cannot fix (CPU security) |
| **IPC** | 0.97 (near optimal) | No issue | No fix needed |

---

## Performance Metrics

### Current State
```
Throughput:        4.1M ops/s
Cycles per op:     1,146 cycles
Instructions/op:   1,109 instructions
IPC:               0.97 (excellent)
Page faults/op:    0.132 (catastrophic)
Branch misses/op:  21 (high)
L1 misses/op:      17.2 (moderate)
```

### Target State (after optimizations)
```
Throughput:        16M+ ops/s (4x improvement)
Cycles per op:     ~245 cycles (4.7x reduction)
Page faults/op:    <0.001 (132x reduction)
Branch misses/op:  ~12 (1.75x reduction)
L1 misses/op:      ~10 (1.7x reduction)
```

---

## Top Bottleneck Functions (by time spent)

### Kernel Functions (60% of total time)
1. **clear_page_erms** (11.25%) - Zeroing newly allocated pages
2. **do_anonymous_page** (20%+) - Kernel page allocation
3. **folio_add_new_anon_rmap** (7.11%) - Reverse mapping
4. **folio_add_lru_vma** (4.88%) - LRU list management
5. **__mem_cgroup_charge** (4.37%) - Memory cgroup accounting

### User-space Functions (8-10% of total time)
1. **unified_cache_refill** (4.37%) - Main HAKMEM allocation path
2. **free** (1.40%) - Deallocation
3. **malloc** (1.36%) - Allocation wrapper
4. **shared_pool_acquire_slab** (1.31%) - Slab acquisition

**Insight**: User-space code is only 8-10% of runtime. The remaining 90% is kernel overhead!

---

## Optimization Roadmap

### Phase 1: Eliminate Page Faults (Priority: URGENT)
**Target**: 2-3x improvement (8-12M ops/s)  
**Effort**: 1 hour  
**Changes**:
- Add `MAP_POPULATE` to `mmap()` in `superslab_acquire()`
- Files to modify: `/mnt/workdisk/public_share/hakmem/core/superslab/*.c`

**Validation**:
```bash
perf stat -e page-faults ./bench_random_mixed_hakmem 1000000 256 42
# Expected: <1,000 page faults (was 132,509)
```

### Phase 2: Profile-Guided Optimization (Priority: HIGH)
**Target**: 1.2-1.3x additional improvement (10-16M ops/s cumulative)  
**Effort**: 2 hours  
**Changes**:
```bash
make clean
CFLAGS="-fprofile-generate" make
./bench_random_mixed_hakmem 10000000 256 42  # Generate profile
make clean
CFLAGS="-fprofile-use" make
```

### Phase 3: Branch Optimization (Priority: MEDIUM)
**Target**: 1.1-1.2x additional improvement  
**Effort**: 4 hours  
**Changes**:
- Add `__builtin_expect()` hints to hot paths in `unified_cache_refill`
- Simplify conditionals in fast path
- Reorder checks for common case first

### Phase 4: Cache Layout Optimization (Priority: LOW)
**Target**: 1.1-1.15x additional improvement  
**Effort**: 2 hours  
**Changes**:
- Add `__attribute__((aligned(64)))` to hot structures
- Pack frequently-accessed fields together
- Separate read-mostly vs write-mostly data

---

## Commands Used for Analysis

```bash
# Hardware performance counters
perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,L1-dcache-load-misses,LLC-load-misses -r 3 \
  ./bench_random_mixed_hakmem 1000000 256 42

# Page fault and context switch metrics
perf stat -e task-clock,context-switches,cpu-migrations,page-faults,minor-faults,major-faults -r 3 \
  ./bench_random_mixed_hakmem 1000000 256 42

# Function-level profiling
perf record -F 5000 -g ./bench_random_mixed_hakmem 1000000 256 42
perf report --stdio -n --percent-limit 0.5

# Syscall tracing
strace -e trace=mmap,madvise,munmap,mprotect -c ./bench_random_mixed_hakmem 1000000 256 42
```

---

## Related Documents

- **PERF_PROFILE_ANALYSIS_20251204.md** - Earlier profiling analysis (phase 1)
- **BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md** - Batch tier optimization results
- **bench_random_mixed.c** - Benchmark source code

---

## Next Steps

1. **Read Executive Summary** (5 min) - Understand the problem and solution
2. **Implement MAP_POPULATE** (1 hour) - Immediate 2-3x improvement
3. **Validate with perf stat** (5 min) - Confirm page faults dropped
4. **Re-run full benchmark suite** (30 min) - Measure actual speedup
5. **If target not reached, proceed to Phase 2** (PGO optimization)

---

## Questions & Answers

**Q: Why is IPC so high (0.97) if we're only at 4.1M ops/s?**  
A: The CPU is executing instructions efficiently, but most of those instructions are 
in the kernel handling page faults. The user-space code is only 10% of runtime.

**Q: Can we just disable page fault handling?**  
A: No, but we can pre-fault memory with MAP_POPULATE so page faults happen at 
startup instead of during the benchmark.

**Q: Why not just use hugepages?**  
A: Hugepages are better (2-4x improvement) but require more complex implementation. 
MAP_POPULATE gives 2-3x improvement with 1 hour of work. We should do MAP_POPULATE 
first, then consider hugepages if we need more performance.

**Q: Will MAP_POPULATE hurt startup time?**  
A: Yes, but we're trading startup time for runtime performance. For a memory allocator, 
this is usually the right tradeoff. We can make it optional via environment variable.

**Q: What about the branch mispredictions?**  
A: Those are secondary. Fix page faults first (60% of time), then tackle branches 
(30% of remaining time), then cache misses (15% of remaining time).

---

## Conclusion

The analysis is complete and the path forward is clear:

1. Page faults are the primary bottleneck (60-70% of time)
2. MAP_POPULATE is the simplest fix (1 hour, 2-3x improvement)
3. PGO and branch hints can get us to 4x target
4. All optimizations are straightforward and low-risk

**Confidence level**: Very high (based on hard profiling data)  
**Risk level**: Low (MAP_POPULATE is well-tested and widely used)  
**Time to 4x target**: 1-2 days of development

---

**Analysis conducted by**: Claude (Anthropic)  
**Analysis method**: perf stat, perf record, perf report, strace  
**Data quality**: High (3-run averages, <1% variance)  
**Reproducibility**: 100% (all commands documented)
Add performance analysis reports and archive legacy superslab - Add investigation reports for allocation routing, bottlenecks, madvise - Archive old smallmid superslab implementation - Document Page Box integration findings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-12-05 15:31:58 +09:00			`# HAKMEM Performance Analysis - Complete Index`
			`Date: 2025-12-04`
			`Benchmark: bench_random_mixed_hakmem (1M operations, ws=256)`
			`Current Performance: 4.1M ops/s`
			`Target: 16M+ ops/s (4x improvement)`

			`---`

			`## Quick Summary`

			`CRITICAL FINDING: Page fault handling consumes 60-70% of execution time.`

			`Primary Bottleneck:`
			`- 132,509 page faults per 1M operations`
			`- Each page fault costs ~690 cycles`
			`- Kernel spends 60% of time in: clear_page_erms (11%), do_anonymous_page (20%), LRU/cgroup accounting (12%)`

			`Recommended Fix:`
			- Add `MAP_POPULATE` to superslab mmap() calls → 2-3x speedup (1 hour effort)
			`- Follow with PGO and branch optimization → reach 4x target`

			`---`

			`## Analysis Documents (Read in Order)`

			`### 1. Executive Summary (START HERE)`
			File: `/mnt/workdisk/public_share/hakmem/EXECUTIVE_SUMMARY_BOTTLENECK_20251204.md`
			`Purpose: High-level overview for decision makers`
			`Content:`
			`- Problem statement and root cause`
			`- Key metrics summary`
			`- Recommended action plan with timelines`
			`- Conservative and aggressive performance projections`

			`Reading time: 5 minutes`

			`---`

			`### 2. Detailed Analysis Report`
			File: `/mnt/workdisk/public_share/hakmem/PERF_BOTTLENECK_ANALYSIS_20251204.md`
			`Purpose: In-depth technical analysis for engineers`
			`Content:`
			`- Complete performance counter breakdown`
			`- Top 10 hottest functions with call chains`
			`- Bottleneck analysis with cycle accounting`
			`- Detailed optimization recommendations with effort estimates`
			`- Specific code changes required`

			`Reading time: 20 minutes`

			`---`

			`### 3. Raw Performance Data`
			File: `/mnt/workdisk/public_share/hakmem/PERF_RAW_DATA_20251204.txt`
			`Purpose: Reference data for validation and future comparison`
			`Content:`
			`- Raw perf stat output (all counters)`
			`- Raw perf report output (function profiles)`
			`- Syscall trace data`
			`- Assembly annotation of hot functions`
			`- Complete call chain data`

			`Reading time: Reference only (5-10 minutes to browse)`

			`---`

			`## Key Findings at a Glance`

			`\| Category \| Finding \| Impact \| Fix Effort \|`
			`\|----------\|---------\|--------\|------------\|`
			`\| Page Faults \| 132,509 faults (13% of ops) \| 60-70% of runtime \| 1 hour (MAP_POPULATE) \|`
			`\| Branch Misses \| 9.04% miss rate (21M misses) \| ~30% overhead \| 4 hours (hints + PGO) \|`
			`\| Cache Misses \| 13.03% miss rate (17 L1 misses/op) \| ~15% overhead \| 2 hours (layout) \|`
			`\| Speculation \| Retpoline overhead \| ~5% overhead \| Cannot fix (CPU security) \|`
			`\| IPC \| 0.97 (near optimal) \| No issue \| No fix needed \|`

			`---`

			`## Performance Metrics`

			`### Current State`
			```
			`Throughput: 4.1M ops/s`
			`Cycles per op: 1,146 cycles`
			`Instructions/op: 1,109 instructions`
			`IPC: 0.97 (excellent)`
			`Page faults/op: 0.132 (catastrophic)`
			`Branch misses/op: 21 (high)`
			`L1 misses/op: 17.2 (moderate)`
			```

			`### Target State (after optimizations)`
			```
			`Throughput: 16M+ ops/s (4x improvement)`
			`Cycles per op: ~245 cycles (4.7x reduction)`
			`Page faults/op: <0.001 (132x reduction)`
			`Branch misses/op: ~12 (1.75x reduction)`
			`L1 misses/op: ~10 (1.7x reduction)`
			```

			`---`

			`## Top Bottleneck Functions (by time spent)`

			`### Kernel Functions (60% of total time)`
			`1. clear_page_erms (11.25%) - Zeroing newly allocated pages`
			`2. do_anonymous_page (20%+) - Kernel page allocation`
			`3. folio_add_new_anon_rmap (7.11%) - Reverse mapping`
			`4. folio_add_lru_vma (4.88%) - LRU list management`
			`5. __mem_cgroup_charge (4.37%) - Memory cgroup accounting`

			`### User-space Functions (8-10% of total time)`
			`1. unified_cache_refill (4.37%) - Main HAKMEM allocation path`
			`2. free (1.40%) - Deallocation`
			`3. malloc (1.36%) - Allocation wrapper`
			`4. shared_pool_acquire_slab (1.31%) - Slab acquisition`

			`Insight: User-space code is only 8-10% of runtime. The remaining 90% is kernel overhead!`

			`---`

			`## Optimization Roadmap`

			`### Phase 1: Eliminate Page Faults (Priority: URGENT)`
			`Target: 2-3x improvement (8-12M ops/s)`
			`Effort: 1 hour`
			`Changes:`
			- Add `MAP_POPULATE` to `mmap()` in `superslab_acquire()`
			- Files to modify: `/mnt/workdisk/public_share/hakmem/core/superslab/*.c`

			`Validation:`
			```bash
			`perf stat -e page-faults ./bench_random_mixed_hakmem 1000000 256 42`
			`# Expected: <1,000 page faults (was 132,509)`
			```

			`### Phase 2: Profile-Guided Optimization (Priority: HIGH)`
			`Target: 1.2-1.3x additional improvement (10-16M ops/s cumulative)`
			`Effort: 2 hours`
			`Changes:`
			```bash
			`make clean`
			`CFLAGS="-fprofile-generate" make`
			`./bench_random_mixed_hakmem 10000000 256 42 # Generate profile`
			`make clean`
			`CFLAGS="-fprofile-use" make`
			```

			`### Phase 3: Branch Optimization (Priority: MEDIUM)`
			`Target: 1.1-1.2x additional improvement`
			`Effort: 4 hours`
			`Changes:`
			- Add `__builtin_expect()` hints to hot paths in `unified_cache_refill`
			`- Simplify conditionals in fast path`
			`- Reorder checks for common case first`

			`### Phase 4: Cache Layout Optimization (Priority: LOW)`
			`Target: 1.1-1.15x additional improvement`
			`Effort: 2 hours`
			`Changes:`
			- Add `__attribute__((aligned(64)))` to hot structures
			`- Pack frequently-accessed fields together`
			`- Separate read-mostly vs write-mostly data`

			`---`

			`## Commands Used for Analysis`

			```bash
			`# Hardware performance counters`
			`perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,L1-dcache-load-misses,LLC-load-misses -r 3 \`
			`./bench_random_mixed_hakmem 1000000 256 42`

			`# Page fault and context switch metrics`
			`perf stat -e task-clock,context-switches,cpu-migrations,page-faults,minor-faults,major-faults -r 3 \`
			`./bench_random_mixed_hakmem 1000000 256 42`

			`# Function-level profiling`
			`perf record -F 5000 -g ./bench_random_mixed_hakmem 1000000 256 42`
			`perf report --stdio -n --percent-limit 0.5`

			`# Syscall tracing`
			`strace -e trace=mmap,madvise,munmap,mprotect -c ./bench_random_mixed_hakmem 1000000 256 42`
			```

			`---`

			`## Related Documents`

			`- PERF_PROFILE_ANALYSIS_20251204.md - Earlier profiling analysis (phase 1)`
			`- BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md - Batch tier optimization results`
			`- bench_random_mixed.c - Benchmark source code`

			`---`

			`## Next Steps`

			`1. Read Executive Summary (5 min) - Understand the problem and solution`
			`2. Implement MAP_POPULATE (1 hour) - Immediate 2-3x improvement`
			`3. Validate with perf stat (5 min) - Confirm page faults dropped`
			`4. Re-run full benchmark suite (30 min) - Measure actual speedup`
			`5. If target not reached, proceed to Phase 2 (PGO optimization)`

			`---`

			`## Questions & Answers`

			`Q: Why is IPC so high (0.97) if we're only at 4.1M ops/s?`
			`A: The CPU is executing instructions efficiently, but most of those instructions are`
			`in the kernel handling page faults. The user-space code is only 10% of runtime.`

			`Q: Can we just disable page fault handling?`
			`A: No, but we can pre-fault memory with MAP_POPULATE so page faults happen at`
			`startup instead of during the benchmark.`

			`Q: Why not just use hugepages?`
			`A: Hugepages are better (2-4x improvement) but require more complex implementation.`
			`MAP_POPULATE gives 2-3x improvement with 1 hour of work. We should do MAP_POPULATE`
			`first, then consider hugepages if we need more performance.`

			`Q: Will MAP_POPULATE hurt startup time?`
			`A: Yes, but we're trading startup time for runtime performance. For a memory allocator,`
			`this is usually the right tradeoff. We can make it optional via environment variable.`

			`Q: What about the branch mispredictions?`
			`A: Those are secondary. Fix page faults first (60% of time), then tackle branches`
			`(30% of remaining time), then cache misses (15% of remaining time).`

			`---`

			`## Conclusion`

			`The analysis is complete and the path forward is clear:`

			`1. Page faults are the primary bottleneck (60-70% of time)`
			`2. MAP_POPULATE is the simplest fix (1 hour, 2-3x improvement)`
			`3. PGO and branch hints can get us to 4x target`
			`4. All optimizations are straightforward and low-risk`

			`Confidence level: Very high (based on hard profiling data)`
			`Risk level: Low (MAP_POPULATE is well-tested and widely used)`
			`Time to 4x target: 1-2 days of development`

			`---`

			`Analysis conducted by: Claude (Anthropic)`
			`Analysis method: perf stat, perf record, perf report, strace`
			`Data quality: High (3-run averages, <1% variance)`
			`Reproducibility: 100% (all commands documented)`