Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.7 KiB
HAKMEM Tiny Pool - Performance Analysis Index
Date: 2025-10-26 Session: Post-getenv Fix Analysis Status: Analysis Complete - Optimization Recommended
Quick Navigation
For Immediate Action
- OPTIMIZATION_NEXT_STEPS.md - Implementation guide for next optimization
- PERF_SUMMARY.txt - One-page executive summary
For Detailed Review
- PERF_POST_GETENV_ANALYSIS.md - Complete analysis with Q&A
- BOTTLENECK_COMPARISON.txt - Before/after comparison
Raw Performance Data
perf_post_getenv.data- Perf recording (1 GB)perf_post_getenv_report.txt- Top functions reportperf_post_getenv_annotate.txt- Annotated assembly
Executive Summary
Achievement
- Eliminated getenv bottleneck: 43.96% CPU → 0%
- Performance improvement: +86% to +173% (60 → 120-164 M ops/sec)
- Now FASTER than glibc: +15% to +57%
Current Status
- New #1 Bottleneck: hak_tiny_alloc (22.75% CPU)
- Verdict: Worth optimizing (2.27x above 10% threshold)
- Next Target: Reduce hak_tiny_alloc to ~10% CPU
Recommendation
OPTIMIZE NEXT BOTTLENECK - Clear path to 180-250 M ops/sec (2-3x glibc)
File Descriptions
Analysis Documents
PERF_POST_GETENV_ANALYSIS.md (11 KB)
Purpose: Comprehensive post-getenv performance analysis Contains:
- Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
- Q2: Top 5 hotspots ranking
- Q3: Optimization worthiness assessment
- Q4: Root cause analysis and proposed fixes
- Before/after comparison table
- Final recommendation with justification
Key Finding: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!
OPTIMIZATION_NEXT_STEPS.md (7 KB)
Purpose: Actionable implementation guide Contains:
- Root cause breakdown from perf annotate
- 4-phase optimization strategy (prioritized)
- Implementation plan with time estimates
- Success criteria and validation commands
- Risk assessment
- Code examples and snippets
Start Here: If you're ready to implement optimizations
PERF_SUMMARY.txt (2.6 KB)
Purpose: Quick reference card Contains:
- Performance journey (4 phases)
- Optimization roadmap
- Key metrics comparison
- Next steps recommendation
Use Case: Quick briefing or status check
BOTTLENECK_COMPARISON.txt (4.4 KB)
Purpose: Side-by-side before/after analysis Contains:
- Top 10 CPU consumers comparison
- Critical observations (4 key insights)
- Performance trajectory visualization
- Decision matrix (6 criteria)
- Next bottleneck recommendation
Use Case: Understanding impact of getenv fix
Key Metrics at a Glance
| Metric | Before (getenv bug) | After (fixed) | Change |
|---|---|---|---|
| Performance | 60 M ops/sec | 120-164 M ops/sec | +86-173% |
| vs glibc | -43% slower | +15-57% faster | HUGE WIN |
| Top bottleneck | getenv 43.96% | hak_tiny_alloc 22.75% | Different |
| Allocator CPU | ~69% | ~51% | -18% |
| Wasted CPU | 44% (getenv) | 0% | -44% |
Top 5 Current Bottlenecks
| Rank | Function | CPU (Self) | Status | Action |
|---|---|---|---|---|
| 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE |
| 2 | __random | 14.00% | ℹ INFO | Benchmark overhead |
| 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing |
| 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold |
| 5 | hak_free_at | 11.08% | ℹ INFO | Children time |
Primary Target: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold
Optimization Roadmap
Phase 7.2.5: Eliminate getenv ✓ COMPLETE
- Status: Done
- Impact: -43.96% CPU, +86-173% throughput
- Achievement: 60 → 120-164 M ops/sec
Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT
- Target: 22.75% → ~10% CPU
- Method: Inline fast path, reduce stack, cache TLS
- Expected: +50-70% throughput (→ 180-220 M ops/sec)
- Effort: 2-4 hours
Phase 7.2.7: Optimize mid_desc_lookup (Optional)
- Target: 12.55% → ~6% CPU
- Method: Smaller hash table, prefetching
- Expected: +10-20% additional throughput
- Effort: 1-2 hours
Phase 7.2.8: Ship It!
- Condition: All bottlenecks <10%
- Expected Performance: 200-250 M ops/sec (2-3x glibc)
- Status: Enable g_wrap_tiny_enabled = 1 by default
Root Cause: hak_tiny_alloc (22.75% CPU)
Hotspot Breakdown
-
Heavy stack usage (10.5% CPU)
- 88 bytes allocated
- Multiple stack reads/writes
- Register spilling
-
Repeated global reads (7.2% CPU)
- g_tiny_initialized (3.52%)
- g_wrap_tiny_enabled (0.28%)
- Should cache in TLS
-
Complex control flow (5.0% CPU)
- Size validation branches
- Magazine refill in main path
- Should separate fast/slow paths
Hottest Instructions (from perf annotate)
3.71%: push %r14 ← Register pressure
3.52%: mov g_tiny_initialized,%r14d ← Global read
3.53%: mov 0x1c(%rsp),%ebp ← Stack read
3.33%: cmpq $0x80,0x10(%rsp) ← Size check
3.06%: mov %rbp,0x38(%rsp) ← Stack write
Proposed Solution
1. Inline Fast Path (Priority: HIGH)
Impact: -5 to -7% CPU Effort: 2-3 hours
Create inline hak_tiny_alloc_fast():
- Quick size validation
- Direct TLS magazine access
- Fast path for magazine hit (common case)
- Delegate to slow path only for refill
2. Reduce Stack Usage (Priority: MEDIUM)
Impact: -3 to -4% CPU Effort: 1-2 hours
Reduce from 88 → <32 bytes:
- Fewer local variables
- Pass in registers where possible
- Move rarely-used locals to slow path
3. Cache Globals in TLS (Priority: LOW)
Impact: -2 to -3% CPU Effort: 1 hour
Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:
- Read once on TLS init
- Avoid repeated global reads (3.8% CPU saved)
Total Expected: -10 to -15% CPU reduction (22.75% → ~10%)
Success Criteria
After optimization, verify:
- hak_tiny_alloc CPU: 22.75% → <12%
- Total throughput: 120-164 M → 180-250 M ops/sec
- Faster than glibc: +70% to +140% (vs current +15-57%)
- No correctness regressions
- No new bottleneck >15%
Files to Review/Modify
Source Code
/home/tomoaki/git/hakmem/hakmem_pool.c- Main implementation/home/tomoaki/git/hakmem/hakmem_pool.h- Add inline fast path
Performance Data
/home/tomoaki/git/hakmem/perf_post_getenv.data- Current perf recording/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt- Assembly hotspots
Benchmarks
/home/tomoaki/git/hakmem/bench_comprehensive_hakmem- Test binary- Run with:
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
Timeline
Completed (Today)
- Collect fresh perf data post-getenv fix
- Identify new #1 bottleneck (hak_tiny_alloc)
- Analyze root causes via perf annotate
- Compare before/after getenv fix
- Make optimization recommendation
- Create implementation guide
Next Session (2-4 hours)
- Implement inline fast path
- Reduce stack usage
- Benchmark and validate
- Collect new perf data
- Assess if further optimization needed
Future (Optional, 1-2 hours)
- Optimize mid_desc_lookup (12.55%)
- Final validation
- Enable tiny pool by default
- Ship it!
Questions?
Q: Should we stop optimizing and ship now? A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).
Q: What if optimization doesn't work? A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.
Q: How do we know when to stop? A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.
Q: What about the other bottlenecks? A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.
Additional Resources
Previous Analysis (For Context)
PERF_ANALYSIS_RESULTS.md- Original analysis that identified getenv bugperf_report.txt- Old data (with getenv bug)perf_annotate_*.txt- Old annotations
Benchmark Results
See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:
- Per-test throughput breakdown
- Size class performance (16B, 32B, 64B, 128B)
- Comparison with glibc baseline
Contact
Project: HAKMEM Memory Allocator Repository: /home/tomoaki/git/hakmem Analysis Date: 2025-10-26 Analyst: Claude Code (Anthropic)
Last Updated: 2025-10-26 09:08 JST Status: Ready for Phase 7.2.6 Implementation