304 lines
8.7 KiB
Markdown
304 lines
8.7 KiB
Markdown
|
|
# HAKMEM Tiny Pool - Performance Analysis Index
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-26
|
|||
|
|
**Session**: Post-getenv Fix Analysis
|
|||
|
|
**Status**: Analysis Complete - Optimization Recommended
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quick Navigation
|
|||
|
|
|
|||
|
|
### For Immediate Action
|
|||
|
|
- **[OPTIMIZATION_NEXT_STEPS.md](./OPTIMIZATION_NEXT_STEPS.md)** - Implementation guide for next optimization
|
|||
|
|
- **[PERF_SUMMARY.txt](./PERF_SUMMARY.txt)** - One-page executive summary
|
|||
|
|
|
|||
|
|
### For Detailed Review
|
|||
|
|
- **[PERF_POST_GETENV_ANALYSIS.md](./PERF_POST_GETENV_ANALYSIS.md)** - Complete analysis with Q&A
|
|||
|
|
- **[BOTTLENECK_COMPARISON.txt](./BOTTLENECK_COMPARISON.txt)** - Before/after comparison
|
|||
|
|
|
|||
|
|
### Raw Performance Data
|
|||
|
|
- `perf_post_getenv.data` - Perf recording (1 GB)
|
|||
|
|
- `perf_post_getenv_report.txt` - Top functions report
|
|||
|
|
- `perf_post_getenv_annotate.txt` - Annotated assembly
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
### Achievement
|
|||
|
|
- **Eliminated getenv bottleneck**: 43.96% CPU → 0%
|
|||
|
|
- **Performance improvement**: +86% to +173% (60 → 120-164 M ops/sec)
|
|||
|
|
- **Now FASTER than glibc**: +15% to +57%
|
|||
|
|
|
|||
|
|
### Current Status
|
|||
|
|
- **New #1 Bottleneck**: hak_tiny_alloc (22.75% CPU)
|
|||
|
|
- **Verdict**: Worth optimizing (2.27x above 10% threshold)
|
|||
|
|
- **Next Target**: Reduce hak_tiny_alloc to ~10% CPU
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
**OPTIMIZE NEXT BOTTLENECK** - Clear path to 180-250 M ops/sec (2-3x glibc)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## File Descriptions
|
|||
|
|
|
|||
|
|
### Analysis Documents
|
|||
|
|
|
|||
|
|
#### PERF_POST_GETENV_ANALYSIS.md (11 KB)
|
|||
|
|
**Purpose**: Comprehensive post-getenv performance analysis
|
|||
|
|
**Contains**:
|
|||
|
|
- Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
|
|||
|
|
- Q2: Top 5 hotspots ranking
|
|||
|
|
- Q3: Optimization worthiness assessment
|
|||
|
|
- Q4: Root cause analysis and proposed fixes
|
|||
|
|
- Before/after comparison table
|
|||
|
|
- Final recommendation with justification
|
|||
|
|
|
|||
|
|
**Key Finding**: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!
|
|||
|
|
|
|||
|
|
#### OPTIMIZATION_NEXT_STEPS.md (7 KB)
|
|||
|
|
**Purpose**: Actionable implementation guide
|
|||
|
|
**Contains**:
|
|||
|
|
- Root cause breakdown from perf annotate
|
|||
|
|
- 4-phase optimization strategy (prioritized)
|
|||
|
|
- Implementation plan with time estimates
|
|||
|
|
- Success criteria and validation commands
|
|||
|
|
- Risk assessment
|
|||
|
|
- Code examples and snippets
|
|||
|
|
|
|||
|
|
**Start Here**: If you're ready to implement optimizations
|
|||
|
|
|
|||
|
|
#### PERF_SUMMARY.txt (2.6 KB)
|
|||
|
|
**Purpose**: Quick reference card
|
|||
|
|
**Contains**:
|
|||
|
|
- Performance journey (4 phases)
|
|||
|
|
- Optimization roadmap
|
|||
|
|
- Key metrics comparison
|
|||
|
|
- Next steps recommendation
|
|||
|
|
|
|||
|
|
**Use Case**: Quick briefing or status check
|
|||
|
|
|
|||
|
|
#### BOTTLENECK_COMPARISON.txt (4.4 KB)
|
|||
|
|
**Purpose**: Side-by-side before/after analysis
|
|||
|
|
**Contains**:
|
|||
|
|
- Top 10 CPU consumers comparison
|
|||
|
|
- Critical observations (4 key insights)
|
|||
|
|
- Performance trajectory visualization
|
|||
|
|
- Decision matrix (6 criteria)
|
|||
|
|
- Next bottleneck recommendation
|
|||
|
|
|
|||
|
|
**Use Case**: Understanding impact of getenv fix
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Metrics at a Glance
|
|||
|
|
|
|||
|
|
| Metric | Before (getenv bug) | After (fixed) | Change |
|
|||
|
|
|--------|---------------------|---------------|---------|
|
|||
|
|
| **Performance** | 60 M ops/sec | 120-164 M ops/sec | +86-173% |
|
|||
|
|
| **vs glibc** | -43% slower | +15-57% faster | HUGE WIN |
|
|||
|
|
| **Top bottleneck** | getenv 43.96% | hak_tiny_alloc 22.75% | Different |
|
|||
|
|
| **Allocator CPU** | ~69% | ~51% | -18% |
|
|||
|
|
| **Wasted CPU** | 44% (getenv) | 0% | -44% |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Top 5 Current Bottlenecks
|
|||
|
|
|
|||
|
|
| Rank | Function | CPU (Self) | Status | Action |
|
|||
|
|
|------|----------|-----------|---------|--------|
|
|||
|
|
| 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE |
|
|||
|
|
| 2 | __random | 14.00% | ℹ INFO | Benchmark overhead |
|
|||
|
|
| 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing |
|
|||
|
|
| 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold |
|
|||
|
|
| 5 | hak_free_at | 11.08% | ℹ INFO | Children time |
|
|||
|
|
|
|||
|
|
**Primary Target**: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Optimization Roadmap
|
|||
|
|
|
|||
|
|
### Phase 7.2.5: Eliminate getenv ✓ COMPLETE
|
|||
|
|
- **Status**: Done
|
|||
|
|
- **Impact**: -43.96% CPU, +86-173% throughput
|
|||
|
|
- **Achievement**: 60 → 120-164 M ops/sec
|
|||
|
|
|
|||
|
|
### Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT
|
|||
|
|
- **Target**: 22.75% → ~10% CPU
|
|||
|
|
- **Method**: Inline fast path, reduce stack, cache TLS
|
|||
|
|
- **Expected**: +50-70% throughput (→ 180-220 M ops/sec)
|
|||
|
|
- **Effort**: 2-4 hours
|
|||
|
|
|
|||
|
|
### Phase 7.2.7: Optimize mid_desc_lookup (Optional)
|
|||
|
|
- **Target**: 12.55% → ~6% CPU
|
|||
|
|
- **Method**: Smaller hash table, prefetching
|
|||
|
|
- **Expected**: +10-20% additional throughput
|
|||
|
|
- **Effort**: 1-2 hours
|
|||
|
|
|
|||
|
|
### Phase 7.2.8: Ship It!
|
|||
|
|
- **Condition**: All bottlenecks <10%
|
|||
|
|
- **Expected Performance**: 200-250 M ops/sec (2-3x glibc)
|
|||
|
|
- **Status**: Enable g_wrap_tiny_enabled = 1 by default
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause: hak_tiny_alloc (22.75% CPU)
|
|||
|
|
|
|||
|
|
### Hotspot Breakdown
|
|||
|
|
|
|||
|
|
1. **Heavy stack usage** (10.5% CPU)
|
|||
|
|
- 88 bytes allocated
|
|||
|
|
- Multiple stack reads/writes
|
|||
|
|
- Register spilling
|
|||
|
|
|
|||
|
|
2. **Repeated global reads** (7.2% CPU)
|
|||
|
|
- g_tiny_initialized (3.52%)
|
|||
|
|
- g_wrap_tiny_enabled (0.28%)
|
|||
|
|
- Should cache in TLS
|
|||
|
|
|
|||
|
|
3. **Complex control flow** (5.0% CPU)
|
|||
|
|
- Size validation branches
|
|||
|
|
- Magazine refill in main path
|
|||
|
|
- Should separate fast/slow paths
|
|||
|
|
|
|||
|
|
### Hottest Instructions (from perf annotate)
|
|||
|
|
|
|||
|
|
```asm
|
|||
|
|
3.71%: push %r14 ← Register pressure
|
|||
|
|
3.52%: mov g_tiny_initialized,%r14d ← Global read
|
|||
|
|
3.53%: mov 0x1c(%rsp),%ebp ← Stack read
|
|||
|
|
3.33%: cmpq $0x80,0x10(%rsp) ← Size check
|
|||
|
|
3.06%: mov %rbp,0x38(%rsp) ← Stack write
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Proposed Solution
|
|||
|
|
|
|||
|
|
### 1. Inline Fast Path (Priority: HIGH)
|
|||
|
|
**Impact**: -5 to -7% CPU
|
|||
|
|
**Effort**: 2-3 hours
|
|||
|
|
|
|||
|
|
Create inline `hak_tiny_alloc_fast()`:
|
|||
|
|
- Quick size validation
|
|||
|
|
- Direct TLS magazine access
|
|||
|
|
- Fast path for magazine hit (common case)
|
|||
|
|
- Delegate to slow path only for refill
|
|||
|
|
|
|||
|
|
### 2. Reduce Stack Usage (Priority: MEDIUM)
|
|||
|
|
**Impact**: -3 to -4% CPU
|
|||
|
|
**Effort**: 1-2 hours
|
|||
|
|
|
|||
|
|
Reduce from 88 → <32 bytes:
|
|||
|
|
- Fewer local variables
|
|||
|
|
- Pass in registers where possible
|
|||
|
|
- Move rarely-used locals to slow path
|
|||
|
|
|
|||
|
|
### 3. Cache Globals in TLS (Priority: LOW)
|
|||
|
|
**Impact**: -2 to -3% CPU
|
|||
|
|
**Effort**: 1 hour
|
|||
|
|
|
|||
|
|
Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:
|
|||
|
|
- Read once on TLS init
|
|||
|
|
- Avoid repeated global reads (3.8% CPU saved)
|
|||
|
|
|
|||
|
|
**Total Expected**: -10 to -15% CPU reduction (22.75% → ~10%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
After optimization, verify:
|
|||
|
|
- [ ] hak_tiny_alloc CPU: 22.75% → <12%
|
|||
|
|
- [ ] Total throughput: 120-164 M → 180-250 M ops/sec
|
|||
|
|
- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
|
|||
|
|
- [ ] No correctness regressions
|
|||
|
|
- [ ] No new bottleneck >15%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Files to Review/Modify
|
|||
|
|
|
|||
|
|
### Source Code
|
|||
|
|
- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main implementation
|
|||
|
|
- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline fast path
|
|||
|
|
|
|||
|
|
### Performance Data
|
|||
|
|
- `/home/tomoaki/git/hakmem/perf_post_getenv.data` - Current perf recording
|
|||
|
|
- `/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt` - Assembly hotspots
|
|||
|
|
|
|||
|
|
### Benchmarks
|
|||
|
|
- `/home/tomoaki/git/hakmem/bench_comprehensive_hakmem` - Test binary
|
|||
|
|
- Run with: `HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Timeline
|
|||
|
|
|
|||
|
|
### Completed (Today)
|
|||
|
|
- [x] Collect fresh perf data post-getenv fix
|
|||
|
|
- [x] Identify new #1 bottleneck (hak_tiny_alloc)
|
|||
|
|
- [x] Analyze root causes via perf annotate
|
|||
|
|
- [x] Compare before/after getenv fix
|
|||
|
|
- [x] Make optimization recommendation
|
|||
|
|
- [x] Create implementation guide
|
|||
|
|
|
|||
|
|
### Next Session (2-4 hours)
|
|||
|
|
- [ ] Implement inline fast path
|
|||
|
|
- [ ] Reduce stack usage
|
|||
|
|
- [ ] Benchmark and validate
|
|||
|
|
- [ ] Collect new perf data
|
|||
|
|
- [ ] Assess if further optimization needed
|
|||
|
|
|
|||
|
|
### Future (Optional, 1-2 hours)
|
|||
|
|
- [ ] Optimize mid_desc_lookup (12.55%)
|
|||
|
|
- [ ] Final validation
|
|||
|
|
- [ ] Enable tiny pool by default
|
|||
|
|
- [ ] Ship it!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Questions?
|
|||
|
|
|
|||
|
|
**Q: Should we stop optimizing and ship now?**
|
|||
|
|
A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).
|
|||
|
|
|
|||
|
|
**Q: What if optimization doesn't work?**
|
|||
|
|
A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.
|
|||
|
|
|
|||
|
|
**Q: How do we know when to stop?**
|
|||
|
|
A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.
|
|||
|
|
|
|||
|
|
**Q: What about the other bottlenecks?**
|
|||
|
|
A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Additional Resources
|
|||
|
|
|
|||
|
|
### Previous Analysis (For Context)
|
|||
|
|
- `PERF_ANALYSIS_RESULTS.md` - Original analysis that identified getenv bug
|
|||
|
|
- `perf_report.txt` - Old data (with getenv bug)
|
|||
|
|
- `perf_annotate_*.txt` - Old annotations
|
|||
|
|
|
|||
|
|
### Benchmark Results
|
|||
|
|
See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:
|
|||
|
|
- Per-test throughput breakdown
|
|||
|
|
- Size class performance (16B, 32B, 64B, 128B)
|
|||
|
|
- Comparison with glibc baseline
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Contact
|
|||
|
|
|
|||
|
|
**Project**: HAKMEM Memory Allocator
|
|||
|
|
**Repository**: /home/tomoaki/git/hakmem
|
|||
|
|
**Analysis Date**: 2025-10-26
|
|||
|
|
**Analyst**: Claude Code (Anthropic)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Last Updated**: 2025-10-26 09:08 JST
|
|||
|
|
**Status**: Ready for Phase 7.2.6 Implementation
|