hakmem/PHASE2_PERF_ANALYSIS.md

# HAKMEM Allocator - Phase 2 Performance Analysis

## Quick Summary

| Metric | Phase 1 | Phase 2 | Change |
|--------|---------|---------|--------|
| **Throughput** | 72M ops/s | 79.8M ops/s | **+10.8%** ✓ |
| Cycles | 78.6M | 72.2M | -8.1% ✓ |
| Instructions | 167M | 153M | -8.4% ✓ |
| Branches | 36M | 23M | **-36%** ✓ |
| Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ |
| L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ |
| dTLB Misses | N/A | 41 (0.01%) | **Excellent!** ✓ |

## Top 5 Hotspots (Phase 2, 628 samples)

1. **malloc()** - 36.51% CPU time
   - Function overhead (prologue/epilogue): ~18%
   - Lock operations: 5.05%
   - Initialization checks: ~15%

2. **main()** - 30.51% CPU time
   - Benchmark loop overhead (not allocator)

3. **free()** - 19.66% CPU time
   - Lock operations: 3.29%
   - Cached variable checks: ~15%
   - Function overhead: ~10%

4. **clear_page_erms (kernel)** - 9.31% CPU time
   - Page fault handling

5. **irqentry_exit_to_user_mode (kernel)** - 5.33% CPU time
   - Kernel exit overhead

## Phase 3 Optimization Targets (Ranked by Impact)

### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)
**Target**: Reduce malloc/free from 56% → ~33% CPU time
- Inline hot paths to eliminate function call overhead
- Remove stats counters from production builds
- Cache initialization state in TLS

### 🔥 Priority 2: Branch Optimization (Expected: +3-5%)
**Target**: Reduce branch misses from 1.02M → <700K
- Apply Profile-Guided Optimization (PGO)
- Add LIKELY/UNLIKELY hints
- Reduce branches in fast path from ~15 to 5-7

### 🔥 Priority 3: Cache Optimization (Expected: +2-4%)
**Target**: Reduce L3 misses from 216K → <180K
- Align hot structures to cache lines
- Add prefetching in allocation path
- Compact metadata structures

### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)
- Cache g_initialized/g_enable checks in TLS
- Use constructor attributes more aggressively

### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)
- Move stats to TLS, aggregate periodically
- Eliminate atomic ops from fast path

### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)
- Reduce TLS reads/writes from ~10 to ~4 per operation
- Cache TLS values in registers

## Expected Phase 3 Results

**Target Throughput**: 87-95M ops/s (+9-19% improvement)

| Metric | Phase 2 | Phase 3 Target | Change |
|--------|---------|----------------|--------|
| Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% |
| malloc CPU | 36.51% | ~22% | -40% |
| free CPU | 19.66% | ~11% | -44% |
| Branch misses | 4.43% | <3% | -32% |
| L3 cache misses | 10.28% | <8% | -22% |

## Key Insights

### ✅ What Worked in Phase 2
1. **SuperSlab size increase** (64KB → 512KB): Dramatically reduced branches (-36%)
2. **Amortized initialization**: memset overhead dropped from 6.41% → 1.77%
3. **Virtual memory optimization**: TLB miss rate is excellent (0.01%)

### ❌ What Needs Work
1. **Branch prediction**: Miss rate doubled despite fewer branches
2. **Cache pressure**: Larger SuperSlabs increased L3 misses
3. **Function overhead**: malloc/free dominate CPU time (56%)

### 🤔 Surprising Findings
1. **Cross-calling pattern**: malloc/free call each other 8-12% of the time
   - Thread-local cache flushing
   - Deferred release operations
   - May benefit from batching

2. **Kernel overhead increased**: clear_page_erms went from 2.23% → 9.31%
   - May need page pre-faulting strategy

3. **Main loop visible**: 30.51% CPU time
   - Benchmark overhead, not allocator
   - Real allocator overhead is ~56% (malloc + free)

## Files Generated

- `perf_phase2_stats.txt` - perf stat -d output
- `perf_phase2_symbols.txt` - Symbol-level hotspots
- `perf_phase2_callgraph.txt` - Call graph analysis
- `perf_phase2_detailed.txt` - Detailed counter breakdown
- `perf_malloc_annotate.txt` - Assembly annotation for malloc()
- `perf_free_annotate.txt` - Assembly annotation for free()
- `perf_analysis_summary.txt` - Detailed comparison with Phase 1
- `phase3_recommendations.txt` - Complete optimization roadmap

## How to Use This Data

### For Quick Reference
```bash
cat perf_phase2_stats.txt        # See overall metrics
cat perf_phase2_symbols.txt      # See top functions
```

### For Deep Analysis
```bash
cat perf_malloc_annotate.txt     # See assembly-level hotspots in malloc
cat perf_free_annotate.txt       # See assembly-level hotspots in free
cat perf_analysis_summary.txt    # See Phase 1 vs Phase 2 comparison
```

### For Planning Phase 3
```bash
cat phase3_recommendations.txt   # See ranked optimization opportunities
```

### To Re-run Analysis
```bash
# Quick stat
perf stat -d ./bench_random_mixed_hakmem 1000000 256 42

# Detailed profiling
perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
perf report --stdio --no-children --sort symbol
```

## Next Steps

1. **Week 1**: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
2. **Week 2**: Apply PGO + branch hints (Expected: +3-5%)
3. **Week 3**: Cache line alignment + prefetching (Expected: +2-4%)
4. **Week 4**: TLS optimization + polish (Expected: +1-3%)

**Total Expected**: +14-22% improvement → **Target: 91-97M ops/s**

---

Generated: 2025-11-28
Phase: 2 → 3 transition
Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s
Refactor: Unified allocation macros + header validation 1. Archive unused backend files (ss_legacy/unified_backend_box.c/h) - These files were not linked in the build - Moved to archive/ to reduce confusion 2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations - Replaces superslab_return_block() function - Consistent with existing HAK_RET_ALLOC pattern - Single source of truth for header writing - Defined in hakmem_tiny_superslab_internal.h 3. Added header validation on TLS SLL push - Detects blocks pushed without proper header - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release) - Always on in debug builds - Logs first 10 violations with backtraces Benefits: - Easier to track allocation paths - Catches header bugs at push time - More maintainable macro-based design Note: Larson bug still reproduces - header corruption occurs before push validation can catch it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-29 05:37:24 +09:00			`# HAKMEM Allocator - Phase 2 Performance Analysis`

			`## Quick Summary`

			`\| Metric \| Phase 1 \| Phase 2 \| Change \|`
			`\|--------\|---------\|---------\|--------\|`
			`\| Throughput \| 72M ops/s \| 79.8M ops/s \| +10.8% ✓ \|`
			`\| Cycles \| 78.6M \| 72.2M \| -8.1% ✓ \|`
			`\| Instructions \| 167M \| 153M \| -8.4% ✓ \|`
			`\| Branches \| 36M \| 23M \| -36% ✓ \|`
			`\| Branch Misses \| 921K (2.56%) \| 1.02M (4.43%) \| +73% ✗ \|`
			`\| L3 Cache Misses \| 173K (9.28%) \| 216K (10.28%) \| +25% ✗ \|`
			`\| dTLB Misses \| N/A \| 41 (0.01%) \| Excellent! ✓ \|`

			`## Top 5 Hotspots (Phase 2, 628 samples)`

			`1. malloc() - 36.51% CPU time`
			`- Function overhead (prologue/epilogue): ~18%`
			`- Lock operations: 5.05%`
			`- Initialization checks: ~15%`

			`2. main() - 30.51% CPU time`
			`- Benchmark loop overhead (not allocator)`

			`3. free() - 19.66% CPU time`
			`- Lock operations: 3.29%`
			`- Cached variable checks: ~15%`
			`- Function overhead: ~10%`

			`4. clear_page_erms (kernel) - 9.31% CPU time`
			`- Page fault handling`

			`5. irqentry_exit_to_user_mode (kernel) - 5.33% CPU time`
			`- Kernel exit overhead`

			`## Phase 3 Optimization Targets (Ranked by Impact)`

			`### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)`
			`Target: Reduce malloc/free from 56% → ~33% CPU time`
			`- Inline hot paths to eliminate function call overhead`
			`- Remove stats counters from production builds`
			`- Cache initialization state in TLS`

			`### 🔥 Priority 2: Branch Optimization (Expected: +3-5%)`
			`Target: Reduce branch misses from 1.02M → <700K`
			`- Apply Profile-Guided Optimization (PGO)`
			`- Add LIKELY/UNLIKELY hints`
			`- Reduce branches in fast path from ~15 to 5-7`

			`### 🔥 Priority 3: Cache Optimization (Expected: +2-4%)`
			`Target: Reduce L3 misses from 216K → <180K`
			`- Align hot structures to cache lines`
			`- Add prefetching in allocation path`
			`- Compact metadata structures`

			`### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)`
			`- Cache g_initialized/g_enable checks in TLS`
			`- Use constructor attributes more aggressively`

			`### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)`
			`- Move stats to TLS, aggregate periodically`
			`- Eliminate atomic ops from fast path`

			`### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)`
			`- Reduce TLS reads/writes from ~10 to ~4 per operation`
			`- Cache TLS values in registers`

			`## Expected Phase 3 Results`

			`Target Throughput: 87-95M ops/s (+9-19% improvement)`

			`\| Metric \| Phase 2 \| Phase 3 Target \| Change \|`
			`\|--------\|---------\|----------------\|--------\|`
			`\| Throughput \| 79.8M ops/s \| 87-95M ops/s \| +9-19% \|`
			`\| malloc CPU \| 36.51% \| ~22% \| -40% \|`
			`\| free CPU \| 19.66% \| ~11% \| -44% \|`
			`\| Branch misses \| 4.43% \| <3% \| -32% \|`
			`\| L3 cache misses \| 10.28% \| <8% \| -22% \|`

			`## Key Insights`

			`### ✅ What Worked in Phase 2`
			`1. SuperSlab size increase (64KB → 512KB): Dramatically reduced branches (-36%)`
			`2. Amortized initialization: memset overhead dropped from 6.41% → 1.77%`
			`3. Virtual memory optimization: TLB miss rate is excellent (0.01%)`

			`### ❌ What Needs Work`
			`1. Branch prediction: Miss rate doubled despite fewer branches`
			`2. Cache pressure: Larger SuperSlabs increased L3 misses`
			`3. Function overhead: malloc/free dominate CPU time (56%)`

			`### 🤔 Surprising Findings`
			`1. Cross-calling pattern: malloc/free call each other 8-12% of the time`
			`- Thread-local cache flushing`
			`- Deferred release operations`
			`- May benefit from batching`

			`2. Kernel overhead increased: clear_page_erms went from 2.23% → 9.31%`
			`- May need page pre-faulting strategy`

			`3. Main loop visible: 30.51% CPU time`
			`- Benchmark overhead, not allocator`
			`- Real allocator overhead is ~56% (malloc + free)`

			`## Files Generated`

			- `perf_phase2_stats.txt` - perf stat -d output
			- `perf_phase2_symbols.txt` - Symbol-level hotspots
			- `perf_phase2_callgraph.txt` - Call graph analysis
			- `perf_phase2_detailed.txt` - Detailed counter breakdown
			- `perf_malloc_annotate.txt` - Assembly annotation for malloc()
			- `perf_free_annotate.txt` - Assembly annotation for free()
			- `perf_analysis_summary.txt` - Detailed comparison with Phase 1
			- `phase3_recommendations.txt` - Complete optimization roadmap

			`## How to Use This Data`

			`### For Quick Reference`
			```bash
			`cat perf_phase2_stats.txt # See overall metrics`
			`cat perf_phase2_symbols.txt # See top functions`
			```

			`### For Deep Analysis`
			```bash
			`cat perf_malloc_annotate.txt # See assembly-level hotspots in malloc`
			`cat perf_free_annotate.txt # See assembly-level hotspots in free`
			`cat perf_analysis_summary.txt # See Phase 1 vs Phase 2 comparison`
			```

			`### For Planning Phase 3`
			```bash
			`cat phase3_recommendations.txt # See ranked optimization opportunities`
			```

			`### To Re-run Analysis`
			```bash
			`# Quick stat`
			`perf stat -d ./bench_random_mixed_hakmem 1000000 256 42`

			`# Detailed profiling`
			`perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42`
			`perf report --stdio --no-children --sort symbol`
			```

			`## Next Steps`

			`1. Week 1: Implement fast-path inlining + remove stats locks (Expected: +8-10%)`
			`2. Week 2: Apply PGO + branch hints (Expected: +3-5%)`
			`3. Week 3: Cache line alignment + prefetching (Expected: +2-4%)`
			`4. Week 4: TLS optimization + polish (Expected: +1-3%)`

			`Total Expected: +14-22% improvement → Target: 91-97M ops/s`

			`---`

			`Generated: 2025-11-28`
			`Phase: 2 → 3 transition`
			`Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s`