Files
hakmem/docs/analysis/PERF_ANALYSIS_INDEX.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

304 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM Tiny Pool - Performance Analysis Index
**Date**: 2025-10-26
**Session**: Post-getenv Fix Analysis
**Status**: Analysis Complete - Optimization Recommended
---
## Quick Navigation
### For Immediate Action
- **[OPTIMIZATION_NEXT_STEPS.md](./OPTIMIZATION_NEXT_STEPS.md)** - Implementation guide for next optimization
- **[PERF_SUMMARY.txt](./PERF_SUMMARY.txt)** - One-page executive summary
### For Detailed Review
- **[PERF_POST_GETENV_ANALYSIS.md](./PERF_POST_GETENV_ANALYSIS.md)** - Complete analysis with Q&A
- **[BOTTLENECK_COMPARISON.txt](./BOTTLENECK_COMPARISON.txt)** - Before/after comparison
### Raw Performance Data
- `perf_post_getenv.data` - Perf recording (1 GB)
- `perf_post_getenv_report.txt` - Top functions report
- `perf_post_getenv_annotate.txt` - Annotated assembly
---
## Executive Summary
### Achievement
- **Eliminated getenv bottleneck**: 43.96% CPU → 0%
- **Performance improvement**: +86% to +173% (60 → 120-164 M ops/sec)
- **Now FASTER than glibc**: +15% to +57%
### Current Status
- **New #1 Bottleneck**: hak_tiny_alloc (22.75% CPU)
- **Verdict**: Worth optimizing (2.27x above 10% threshold)
- **Next Target**: Reduce hak_tiny_alloc to ~10% CPU
### Recommendation
**OPTIMIZE NEXT BOTTLENECK** - Clear path to 180-250 M ops/sec (2-3x glibc)
---
## File Descriptions
### Analysis Documents
#### PERF_POST_GETENV_ANALYSIS.md (11 KB)
**Purpose**: Comprehensive post-getenv performance analysis
**Contains**:
- Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
- Q2: Top 5 hotspots ranking
- Q3: Optimization worthiness assessment
- Q4: Root cause analysis and proposed fixes
- Before/after comparison table
- Final recommendation with justification
**Key Finding**: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!
#### OPTIMIZATION_NEXT_STEPS.md (7 KB)
**Purpose**: Actionable implementation guide
**Contains**:
- Root cause breakdown from perf annotate
- 4-phase optimization strategy (prioritized)
- Implementation plan with time estimates
- Success criteria and validation commands
- Risk assessment
- Code examples and snippets
**Start Here**: If you're ready to implement optimizations
#### PERF_SUMMARY.txt (2.6 KB)
**Purpose**: Quick reference card
**Contains**:
- Performance journey (4 phases)
- Optimization roadmap
- Key metrics comparison
- Next steps recommendation
**Use Case**: Quick briefing or status check
#### BOTTLENECK_COMPARISON.txt (4.4 KB)
**Purpose**: Side-by-side before/after analysis
**Contains**:
- Top 10 CPU consumers comparison
- Critical observations (4 key insights)
- Performance trajectory visualization
- Decision matrix (6 criteria)
- Next bottleneck recommendation
**Use Case**: Understanding impact of getenv fix
---
## Key Metrics at a Glance
| Metric | Before (getenv bug) | After (fixed) | Change |
|--------|---------------------|---------------|---------|
| **Performance** | 60 M ops/sec | 120-164 M ops/sec | +86-173% |
| **vs glibc** | -43% slower | +15-57% faster | HUGE WIN |
| **Top bottleneck** | getenv 43.96% | hak_tiny_alloc 22.75% | Different |
| **Allocator CPU** | ~69% | ~51% | -18% |
| **Wasted CPU** | 44% (getenv) | 0% | -44% |
---
## Top 5 Current Bottlenecks
| Rank | Function | CPU (Self) | Status | Action |
|------|----------|-----------|---------|--------|
| 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE |
| 2 | __random | 14.00% | INFO | Benchmark overhead |
| 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing |
| 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold |
| 5 | hak_free_at | 11.08% | INFO | Children time |
**Primary Target**: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold
---
## Optimization Roadmap
### Phase 7.2.5: Eliminate getenv ✓ COMPLETE
- **Status**: Done
- **Impact**: -43.96% CPU, +86-173% throughput
- **Achievement**: 60 → 120-164 M ops/sec
### Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT
- **Target**: 22.75% → ~10% CPU
- **Method**: Inline fast path, reduce stack, cache TLS
- **Expected**: +50-70% throughput (→ 180-220 M ops/sec)
- **Effort**: 2-4 hours
### Phase 7.2.7: Optimize mid_desc_lookup (Optional)
- **Target**: 12.55% → ~6% CPU
- **Method**: Smaller hash table, prefetching
- **Expected**: +10-20% additional throughput
- **Effort**: 1-2 hours
### Phase 7.2.8: Ship It!
- **Condition**: All bottlenecks <10%
- **Expected Performance**: 200-250 M ops/sec (2-3x glibc)
- **Status**: Enable g_wrap_tiny_enabled = 1 by default
---
## Root Cause: hak_tiny_alloc (22.75% CPU)
### Hotspot Breakdown
1. **Heavy stack usage** (10.5% CPU)
- 88 bytes allocated
- Multiple stack reads/writes
- Register spilling
2. **Repeated global reads** (7.2% CPU)
- g_tiny_initialized (3.52%)
- g_wrap_tiny_enabled (0.28%)
- Should cache in TLS
3. **Complex control flow** (5.0% CPU)
- Size validation branches
- Magazine refill in main path
- Should separate fast/slow paths
### Hottest Instructions (from perf annotate)
```asm
3.71%: push %r14 Register pressure
3.52%: mov g_tiny_initialized,%r14d Global read
3.53%: mov 0x1c(%rsp),%ebp Stack read
3.33%: cmpq $0x80,0x10(%rsp) Size check
3.06%: mov %rbp,0x38(%rsp) Stack write
```
---
## Proposed Solution
### 1. Inline Fast Path (Priority: HIGH)
**Impact**: -5 to -7% CPU
**Effort**: 2-3 hours
Create inline `hak_tiny_alloc_fast()`:
- Quick size validation
- Direct TLS magazine access
- Fast path for magazine hit (common case)
- Delegate to slow path only for refill
### 2. Reduce Stack Usage (Priority: MEDIUM)
**Impact**: -3 to -4% CPU
**Effort**: 1-2 hours
Reduce from 88 <32 bytes:
- Fewer local variables
- Pass in registers where possible
- Move rarely-used locals to slow path
### 3. Cache Globals in TLS (Priority: LOW)
**Impact**: -2 to -3% CPU
**Effort**: 1 hour
Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:
- Read once on TLS init
- Avoid repeated global reads (3.8% CPU saved)
**Total Expected**: -10 to -15% CPU reduction (22.75% ~10%)
---
## Success Criteria
After optimization, verify:
- [ ] hak_tiny_alloc CPU: 22.75% <12%
- [ ] Total throughput: 120-164 M 180-250 M ops/sec
- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
- [ ] No correctness regressions
- [ ] No new bottleneck >15%
---
## Files to Review/Modify
### Source Code
- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main implementation
- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline fast path
### Performance Data
- `/home/tomoaki/git/hakmem/perf_post_getenv.data` - Current perf recording
- `/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt` - Assembly hotspots
### Benchmarks
- `/home/tomoaki/git/hakmem/bench_comprehensive_hakmem` - Test binary
- Run with: `HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem`
---
## Timeline
### Completed (Today)
- [x] Collect fresh perf data post-getenv fix
- [x] Identify new #1 bottleneck (hak_tiny_alloc)
- [x] Analyze root causes via perf annotate
- [x] Compare before/after getenv fix
- [x] Make optimization recommendation
- [x] Create implementation guide
### Next Session (2-4 hours)
- [ ] Implement inline fast path
- [ ] Reduce stack usage
- [ ] Benchmark and validate
- [ ] Collect new perf data
- [ ] Assess if further optimization needed
### Future (Optional, 1-2 hours)
- [ ] Optimize mid_desc_lookup (12.55%)
- [ ] Final validation
- [ ] Enable tiny pool by default
- [ ] Ship it!
---
## Questions?
**Q: Should we stop optimizing and ship now?**
A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).
**Q: What if optimization doesn't work?**
A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.
**Q: How do we know when to stop?**
A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.
**Q: What about the other bottlenecks?**
A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.
---
## Additional Resources
### Previous Analysis (For Context)
- `PERF_ANALYSIS_RESULTS.md` - Original analysis that identified getenv bug
- `perf_report.txt` - Old data (with getenv bug)
- `perf_annotate_*.txt` - Old annotations
### Benchmark Results
See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:
- Per-test throughput breakdown
- Size class performance (16B, 32B, 64B, 128B)
- Comparison with glibc baseline
---
## Contact
**Project**: HAKMEM Memory Allocator
**Repository**: /home/tomoaki/git/hakmem
**Analysis Date**: 2025-10-26
**Analyst**: Claude Code (Anthropic)
---
**Last Updated**: 2025-10-26 09:08 JST
**Status**: Ready for Phase 7.2.6 Implementation