hakmem/docs/analysis/PERF_ANALYSIS_INDEX.md

# HAKMEM Tiny Pool - Performance Analysis Index

**Date**: 2025-10-26
**Session**: Post-getenv Fix Analysis
**Status**: Analysis Complete - Optimization Recommended

---

## Quick Navigation

### For Immediate Action
- **[OPTIMIZATION_NEXT_STEPS.md](./OPTIMIZATION_NEXT_STEPS.md)** - Implementation guide for next optimization
- **[PERF_SUMMARY.txt](./PERF_SUMMARY.txt)** - One-page executive summary

### For Detailed Review
- **[PERF_POST_GETENV_ANALYSIS.md](./PERF_POST_GETENV_ANALYSIS.md)** - Complete analysis with Q&A
- **[BOTTLENECK_COMPARISON.txt](./BOTTLENECK_COMPARISON.txt)** - Before/after comparison

### Raw Performance Data
- `perf_post_getenv.data` - Perf recording (1 GB)
- `perf_post_getenv_report.txt` - Top functions report
- `perf_post_getenv_annotate.txt` - Annotated assembly

---

## Executive Summary

### Achievement
- **Eliminated getenv bottleneck**: 43.96% CPU → 0%
- **Performance improvement**: +86% to +173% (60 → 120-164 M ops/sec)
- **Now FASTER than glibc**: +15% to +57%

### Current Status
- **New #1 Bottleneck**: hak_tiny_alloc (22.75% CPU)
- **Verdict**: Worth optimizing (2.27x above 10% threshold)
- **Next Target**: Reduce hak_tiny_alloc to ~10% CPU

### Recommendation
**OPTIMIZE NEXT BOTTLENECK** - Clear path to 180-250 M ops/sec (2-3x glibc)

---

## File Descriptions

### Analysis Documents

#### PERF_POST_GETENV_ANALYSIS.md (11 KB)
**Purpose**: Comprehensive post-getenv performance analysis
**Contains**:
- Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
- Q2: Top 5 hotspots ranking
- Q3: Optimization worthiness assessment
- Q4: Root cause analysis and proposed fixes
- Before/after comparison table
- Final recommendation with justification

**Key Finding**: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!

#### OPTIMIZATION_NEXT_STEPS.md (7 KB)
**Purpose**: Actionable implementation guide
**Contains**:
- Root cause breakdown from perf annotate
- 4-phase optimization strategy (prioritized)
- Implementation plan with time estimates
- Success criteria and validation commands
- Risk assessment
- Code examples and snippets

**Start Here**: If you're ready to implement optimizations

#### PERF_SUMMARY.txt (2.6 KB)
**Purpose**: Quick reference card
**Contains**:
- Performance journey (4 phases)
- Optimization roadmap
- Key metrics comparison
- Next steps recommendation

**Use Case**: Quick briefing or status check

#### BOTTLENECK_COMPARISON.txt (4.4 KB)
**Purpose**: Side-by-side before/after analysis
**Contains**:
- Top 10 CPU consumers comparison
- Critical observations (4 key insights)
- Performance trajectory visualization
- Decision matrix (6 criteria)
- Next bottleneck recommendation

**Use Case**: Understanding impact of getenv fix

---

## Key Metrics at a Glance

| Metric | Before (getenv bug) | After (fixed) | Change |
|--------|---------------------|---------------|---------|
| **Performance** | 60 M ops/sec | 120-164 M ops/sec | +86-173% |
| **vs glibc** | -43% slower | +15-57% faster | HUGE WIN |
| **Top bottleneck** | getenv 43.96% | hak_tiny_alloc 22.75% | Different |
| **Allocator CPU** | ~69% | ~51% | -18% |
| **Wasted CPU** | 44% (getenv) | 0% | -44% |

---

## Top 5 Current Bottlenecks

| Rank | Function | CPU (Self) | Status | Action |
|------|----------|-----------|---------|--------|
| 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE |
| 2 | __random | 14.00% | ℹ INFO | Benchmark overhead |
| 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing |
| 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold |
| 5 | hak_free_at | 11.08% | ℹ INFO | Children time |

**Primary Target**: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold

---

## Optimization Roadmap

### Phase 7.2.5: Eliminate getenv ✓ COMPLETE
- **Status**: Done
- **Impact**: -43.96% CPU, +86-173% throughput
- **Achievement**: 60 → 120-164 M ops/sec

### Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT
- **Target**: 22.75% → ~10% CPU
- **Method**: Inline fast path, reduce stack, cache TLS
- **Expected**: +50-70% throughput (→ 180-220 M ops/sec)
- **Effort**: 2-4 hours

### Phase 7.2.7: Optimize mid_desc_lookup (Optional)
- **Target**: 12.55% → ~6% CPU
- **Method**: Smaller hash table, prefetching
- **Expected**: +10-20% additional throughput
- **Effort**: 1-2 hours

### Phase 7.2.8: Ship It!
- **Condition**: All bottlenecks <10%
- **Expected Performance**: 200-250 M ops/sec (2-3x glibc)
- **Status**: Enable g_wrap_tiny_enabled = 1 by default

---

## Root Cause: hak_tiny_alloc (22.75% CPU)

### Hotspot Breakdown

1. **Heavy stack usage** (10.5% CPU)
   - 88 bytes allocated
   - Multiple stack reads/writes
   - Register spilling

2. **Repeated global reads** (7.2% CPU)
   - g_tiny_initialized (3.52%)
   - g_wrap_tiny_enabled (0.28%)
   - Should cache in TLS

3. **Complex control flow** (5.0% CPU)
   - Size validation branches
   - Magazine refill in main path
   - Should separate fast/slow paths

### Hottest Instructions (from perf annotate)

```asm
3.71%:  push %r14                       ← Register pressure
3.52%:  mov g_tiny_initialized,%r14d    ← Global read
3.53%:  mov 0x1c(%rsp),%ebp            ← Stack read
3.33%:  cmpq $0x80,0x10(%rsp)          ← Size check
3.06%:  mov %rbp,0x38(%rsp)            ← Stack write
```

---

## Proposed Solution

### 1. Inline Fast Path (Priority: HIGH)
**Impact**: -5 to -7% CPU
**Effort**: 2-3 hours

Create inline `hak_tiny_alloc_fast()`:
- Quick size validation
- Direct TLS magazine access
- Fast path for magazine hit (common case)
- Delegate to slow path only for refill

### 2. Reduce Stack Usage (Priority: MEDIUM)
**Impact**: -3 to -4% CPU
**Effort**: 1-2 hours

Reduce from 88 → <32 bytes:
- Fewer local variables
- Pass in registers where possible
- Move rarely-used locals to slow path

### 3. Cache Globals in TLS (Priority: LOW)
**Impact**: -2 to -3% CPU
**Effort**: 1 hour

Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:
- Read once on TLS init
- Avoid repeated global reads (3.8% CPU saved)

**Total Expected**: -10 to -15% CPU reduction (22.75% → ~10%)

---

## Success Criteria

After optimization, verify:
- [ ] hak_tiny_alloc CPU: 22.75% → <12%
- [ ] Total throughput: 120-164 M → 180-250 M ops/sec
- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
- [ ] No correctness regressions
- [ ] No new bottleneck >15%

---

## Files to Review/Modify

### Source Code
- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main implementation
- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline fast path

### Performance Data
- `/home/tomoaki/git/hakmem/perf_post_getenv.data` - Current perf recording
- `/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt` - Assembly hotspots

### Benchmarks
- `/home/tomoaki/git/hakmem/bench_comprehensive_hakmem` - Test binary
- Run with: `HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem`

---

## Timeline

### Completed (Today)
- [x] Collect fresh perf data post-getenv fix
- [x] Identify new #1 bottleneck (hak_tiny_alloc)
- [x] Analyze root causes via perf annotate
- [x] Compare before/after getenv fix
- [x] Make optimization recommendation
- [x] Create implementation guide

### Next Session (2-4 hours)
- [ ] Implement inline fast path
- [ ] Reduce stack usage
- [ ] Benchmark and validate
- [ ] Collect new perf data
- [ ] Assess if further optimization needed

### Future (Optional, 1-2 hours)
- [ ] Optimize mid_desc_lookup (12.55%)
- [ ] Final validation
- [ ] Enable tiny pool by default
- [ ] Ship it!

---

## Questions?

**Q: Should we stop optimizing and ship now?**
A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).

**Q: What if optimization doesn't work?**
A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.

**Q: How do we know when to stop?**
A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.

**Q: What about the other bottlenecks?**
A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.

---

## Additional Resources

### Previous Analysis (For Context)
- `PERF_ANALYSIS_RESULTS.md` - Original analysis that identified getenv bug
- `perf_report.txt` - Old data (with getenv bug)
- `perf_annotate_*.txt` - Old annotations

### Benchmark Results
See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:
- Per-test throughput breakdown
- Size class performance (16B, 32B, 64B, 128B)
- Comparison with glibc baseline

---

## Contact

**Project**: HAKMEM Memory Allocator
**Repository**: /home/tomoaki/git/hakmem
**Analysis Date**: 2025-10-26
**Analyst**: Claude Code (Anthropic)

---

**Last Updated**: 2025-10-26 09:08 JST
**Status**: Ready for Phase 7.2.6 Implementation
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# HAKMEM Tiny Pool - Performance Analysis Index
 								**Date**: 2025-10-26
 								**Session**: Post-getenv Fix Analysis
 								**Status**: Analysis Complete - Optimization Recommended
 								---
 								## Quick Navigation
 								### For Immediate Action
 								- **[OPTIMIZATION_NEXT_STEPS.md](./OPTIMIZATION_NEXT_STEPS.md)** - Implementation guide for next optimization
 								- **[PERF_SUMMARY.txt](./PERF_SUMMARY.txt)** - One-page executive summary
 								### For Detailed Review
 								- **[PERF_POST_GETENV_ANALYSIS.md](./PERF_POST_GETENV_ANALYSIS.md)** - Complete analysis with Q&A
 								- **[BOTTLENECK_COMPARISON.txt](./BOTTLENECK_COMPARISON.txt)** - Before/after comparison
 								### Raw Performance Data
 								- `perf_post_getenv.data` - Perf recording (1 GB)
 								- `perf_post_getenv_report.txt` - Top functions report
 								- `perf_post_getenv_annotate.txt` - Annotated assembly
 								---
 								## Executive Summary
 								### Achievement
 								- **Eliminated getenv bottleneck**: 43.96% CPU → 0%
 								- **Performance improvement**: +86% to +173% (60 → 120-164 M ops/sec)
 								- **Now FASTER than glibc**: +15% to +57%
 								### Current Status
 								- **New #1 Bottleneck**: hak_tiny_alloc (22.75% CPU)
 								- **Verdict**: Worth optimizing (2.27x above 10% threshold)
 								- **Next Target**: Reduce hak_tiny_alloc to ~10% CPU
 								### Recommendation
 								**OPTIMIZE NEXT BOTTLENECK** - Clear path to 180-250 M ops/sec (2-3x glibc)
 								---
 								## File Descriptions
 								### Analysis Documents
 								#### PERF_POST_GETENV_ANALYSIS.md (11 KB)
 								**Purpose**: Comprehensive post-getenv performance analysis
 								**Contains**:
 								- Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
 								- Q2: Top 5 hotspots ranking
 								- Q3: Optimization worthiness assessment
 								- Q4: Root cause analysis and proposed fixes
 								- Before/after comparison table
 								- Final recommendation with justification
 								**Key Finding**: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!
 								#### OPTIMIZATION_NEXT_STEPS.md (7 KB)
 								**Purpose**: Actionable implementation guide
 								**Contains**:
 								- Root cause breakdown from perf annotate
 								- 4-phase optimization strategy (prioritized)
 								- Implementation plan with time estimates
 								- Success criteria and validation commands
 								- Risk assessment
 								- Code examples and snippets
 								**Start Here**: If you're ready to implement optimizations
 								#### PERF_SUMMARY.txt (2.6 KB)
 								**Purpose**: Quick reference card
 								**Contains**:
 								- Performance journey (4 phases)
 								- Optimization roadmap
 								- Key metrics comparison
 								- Next steps recommendation
 								**Use Case**: Quick briefing or status check
 								#### BOTTLENECK_COMPARISON.txt (4.4 KB)
 								**Purpose**: Side-by-side before/after analysis
 								**Contains**:
 								- Top 10 CPU consumers comparison
 								- Critical observations (4 key insights)
 								- Performance trajectory visualization
 								- Decision matrix (6 criteria)
 								- Next bottleneck recommendation
 								**Use Case**: Understanding impact of getenv fix
 								---
 								## Key Metrics at a Glance
 								| Metric | Before (getenv bug) | After (fixed) | Change |
 								|--------|---------------------|---------------|---------|
 								| **Performance** | 60 M ops/sec | 120-164 M ops/sec | +86-173% |
 								| **vs glibc** | -43% slower | +15-57% faster | HUGE WIN |
 								| **Top bottleneck** | getenv 43.96% | hak_tiny_alloc 22.75% | Different |
 								| **Allocator CPU** | ~69% | ~51% | -18% |
 								| **Wasted CPU** | 44% (getenv) | 0% | -44% |
 								---
 								## Top 5 Current Bottlenecks
 								| Rank | Function | CPU (Self) | Status | Action |
 								|------|----------|-----------|---------|--------|
 								| 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE |
 								| 2 | __random | 14.00% | ℹ INFO | Benchmark overhead |
 								| 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing |
 								| 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold |
 								| 5 | hak_free_at | 11.08% | ℹ INFO | Children time |
 								**Primary Target**: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold
 								---
 								## Optimization Roadmap
 								### Phase 7.2.5: Eliminate getenv ✓ COMPLETE
 								- **Status**: Done
 								- **Impact**: -43.96% CPU, +86-173% throughput
 								- **Achievement**: 60 → 120-164 M ops/sec
 								### Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT
 								- **Target**: 22.75% → ~10% CPU
 								- **Method**: Inline fast path, reduce stack, cache TLS
 								- **Expected**: +50-70% throughput (→ 180-220 M ops/sec)
 								- **Effort**: 2-4 hours
 								### Phase 7.2.7: Optimize mid_desc_lookup (Optional)
 								- **Target**: 12.55% → ~6% CPU
 								- **Method**: Smaller hash table, prefetching
 								- **Expected**: +10-20% additional throughput
 								- **Effort**: 1-2 hours
 								### Phase 7.2.8: Ship It!
 								- **Condition**: All bottlenecks <10%
 								- **Expected Performance**: 200-250 M ops/sec (2-3x glibc)
 								- **Status**: Enable g_wrap_tiny_enabled = 1 by default
 								---
 								## Root Cause: hak_tiny_alloc (22.75% CPU)
 								### Hotspot Breakdown
 . **Heavy stack usage** (10.5% CPU)
 								   - 88 bytes allocated
 								   - Multiple stack reads/writes
 								   - Register spilling
 . **Repeated global reads** (7.2% CPU)
 								   - g_tiny_initialized (3.52%)
 								   - g_wrap_tiny_enabled (0.28%)
 								   - Should cache in TLS
 . **Complex control flow** (5.0% CPU)
 								   - Size validation branches
 								   - Magazine refill in main path
 								   - Should separate fast/slow paths
 								### Hottest Instructions (from perf annotate)
 								```asm
 .71%:  push %r14                       ← Register pressure
 .52%:  mov g_tiny_initialized,%r14d    ← Global read
 .53%:  mov 0x1c(%rsp),%ebp            ← Stack read
 .33%:  cmpq $0x80,0x10(%rsp)          ← Size check
 .06%:  mov %rbp,0x38(%rsp)            ← Stack write
 								```
 								---
 								## Proposed Solution
 								### 1. Inline Fast Path (Priority: HIGH)
 								**Impact**: -5 to -7% CPU
 								**Effort**: 2-3 hours
 								Create inline `hak_tiny_alloc_fast()`:
 								- Quick size validation
 								- Direct TLS magazine access
 								- Fast path for magazine hit (common case)
 								- Delegate to slow path only for refill
 								### 2. Reduce Stack Usage (Priority: MEDIUM)
 								**Impact**: -3 to -4% CPU
 								**Effort**: 1-2 hours
 								Reduce from 88 → <32 bytes:
 								- Fewer local variables
 								- Pass in registers where possible
 								- Move rarely-used locals to slow path
 								### 3. Cache Globals in TLS (Priority: LOW)
 								**Impact**: -2 to -3% CPU
 								**Effort**: 1 hour
 								Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:
 								- Read once on TLS init
 								- Avoid repeated global reads (3.8% CPU saved)
 								**Total Expected**: -10 to -15% CPU reduction (22.75% → ~10%)
 								---
 								## Success Criteria
 								After optimization, verify:
 								- [ ] hak_tiny_alloc CPU: 22.75% → <12%
 								- [ ] Total throughput: 120-164 M → 180-250 M ops/sec
 								- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
 								- [ ] No correctness regressions
 								- [ ] No new bottleneck >15%
 								---
 								## Files to Review/Modify
 								### Source Code
 								- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main implementation
 								- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline fast path
 								### Performance Data
 								- `/home/tomoaki/git/hakmem/perf_post_getenv.data` - Current perf recording
 								- `/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt` - Assembly hotspots
 								### Benchmarks
 								- `/home/tomoaki/git/hakmem/bench_comprehensive_hakmem` - Test binary
 								- Run with: `HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem`
 								---
 								## Timeline
 								### Completed (Today)
 								- [x] Collect fresh perf data post-getenv fix
 								- [x] Identify new #1 bottleneck (hak_tiny_alloc)
 								- [x] Analyze root causes via perf annotate
 								- [x] Compare before/after getenv fix
 								- [x] Make optimization recommendation
 								- [x] Create implementation guide
 								### Next Session (2-4 hours)
 								- [ ] Implement inline fast path
 								- [ ] Reduce stack usage
 								- [ ] Benchmark and validate
 								- [ ] Collect new perf data
 								- [ ] Assess if further optimization needed
 								### Future (Optional, 1-2 hours)
 								- [ ] Optimize mid_desc_lookup (12.55%)
 								- [ ] Final validation
 								- [ ] Enable tiny pool by default
 								- [ ] Ship it!
 								---
 								## Questions?
 								**Q: Should we stop optimizing and ship now?**
 								A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).
 								**Q: What if optimization doesn't work?**
 								A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.
 								**Q: How do we know when to stop?**
 								A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.
 								**Q: What about the other bottlenecks?**
 								A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.
 								---
 								## Additional Resources
 								### Previous Analysis (For Context)
 								- `PERF_ANALYSIS_RESULTS.md` - Original analysis that identified getenv bug
 								- `perf_report.txt` - Old data (with getenv bug)
 								- `perf_annotate_*.txt` - Old annotations
 								### Benchmark Results
 								See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:
 								- Per-test throughput breakdown
 								- Size class performance (16B, 32B, 64B, 128B)
 								- Comparison with glibc baseline
 								---
 								## Contact
 								**Project**: HAKMEM Memory Allocator
 								**Repository**: /home/tomoaki/git/hakmem
 								**Analysis Date**: 2025-10-26
 								**Analyst**: Claude Code (Anthropic)
 								---
 								**Last Updated**: 2025-10-26 09:08 JST
 								**Status**: Ready for Phase 7.2.6 Implementation