# HAKMEM Tiny Pool - Performance Analysis Index **Date**: 2025-10-26 **Session**: Post-getenv Fix Analysis **Status**: Analysis Complete - Optimization Recommended --- ## Quick Navigation ### For Immediate Action - **[OPTIMIZATION_NEXT_STEPS.md](./OPTIMIZATION_NEXT_STEPS.md)** - Implementation guide for next optimization - **[PERF_SUMMARY.txt](./PERF_SUMMARY.txt)** - One-page executive summary ### For Detailed Review - **[PERF_POST_GETENV_ANALYSIS.md](./PERF_POST_GETENV_ANALYSIS.md)** - Complete analysis with Q&A - **[BOTTLENECK_COMPARISON.txt](./BOTTLENECK_COMPARISON.txt)** - Before/after comparison ### Raw Performance Data - `perf_post_getenv.data` - Perf recording (1 GB) - `perf_post_getenv_report.txt` - Top functions report - `perf_post_getenv_annotate.txt` - Annotated assembly --- ## Executive Summary ### Achievement - **Eliminated getenv bottleneck**: 43.96% CPU → 0% - **Performance improvement**: +86% to +173% (60 → 120-164 M ops/sec) - **Now FASTER than glibc**: +15% to +57% ### Current Status - **New #1 Bottleneck**: hak_tiny_alloc (22.75% CPU) - **Verdict**: Worth optimizing (2.27x above 10% threshold) - **Next Target**: Reduce hak_tiny_alloc to ~10% CPU ### Recommendation **OPTIMIZE NEXT BOTTLENECK** - Clear path to 180-250 M ops/sec (2-3x glibc) --- ## File Descriptions ### Analysis Documents #### PERF_POST_GETENV_ANALYSIS.md (11 KB) **Purpose**: Comprehensive post-getenv performance analysis **Contains**: - Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%) - Q2: Top 5 hotspots ranking - Q3: Optimization worthiness assessment - Q4: Root cause analysis and proposed fixes - Before/after comparison table - Final recommendation with justification **Key Finding**: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize! #### OPTIMIZATION_NEXT_STEPS.md (7 KB) **Purpose**: Actionable implementation guide **Contains**: - Root cause breakdown from perf annotate - 4-phase optimization strategy (prioritized) - Implementation plan with time estimates - Success criteria and validation commands - Risk assessment - Code examples and snippets **Start Here**: If you're ready to implement optimizations #### PERF_SUMMARY.txt (2.6 KB) **Purpose**: Quick reference card **Contains**: - Performance journey (4 phases) - Optimization roadmap - Key metrics comparison - Next steps recommendation **Use Case**: Quick briefing or status check #### BOTTLENECK_COMPARISON.txt (4.4 KB) **Purpose**: Side-by-side before/after analysis **Contains**: - Top 10 CPU consumers comparison - Critical observations (4 key insights) - Performance trajectory visualization - Decision matrix (6 criteria) - Next bottleneck recommendation **Use Case**: Understanding impact of getenv fix --- ## Key Metrics at a Glance | Metric | Before (getenv bug) | After (fixed) | Change | |--------|---------------------|---------------|---------| | **Performance** | 60 M ops/sec | 120-164 M ops/sec | +86-173% | | **vs glibc** | -43% slower | +15-57% faster | HUGE WIN | | **Top bottleneck** | getenv 43.96% | hak_tiny_alloc 22.75% | Different | | **Allocator CPU** | ~69% | ~51% | -18% | | **Wasted CPU** | 44% (getenv) | 0% | -44% | --- ## Top 5 Current Bottlenecks | Rank | Function | CPU (Self) | Status | Action | |------|----------|-----------|---------|--------| | 1 | hak_tiny_alloc | 22.75% | ⚠ HIGH | OPTIMIZE | | 2 | __random | 14.00% | ℹ INFO | Benchmark overhead | | 3 | mid_desc_lookup | 12.55% | ⚠ MED | Consider optimizing | | 4 | hak_tiny_owner_slab | 9.09% | ✓ OK | Below threshold | | 5 | hak_free_at | 11.08% | ℹ INFO | Children time | **Primary Target**: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold --- ## Optimization Roadmap ### Phase 7.2.5: Eliminate getenv ✓ COMPLETE - **Status**: Done - **Impact**: -43.96% CPU, +86-173% throughput - **Achievement**: 60 → 120-164 M ops/sec ### Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT - **Target**: 22.75% → ~10% CPU - **Method**: Inline fast path, reduce stack, cache TLS - **Expected**: +50-70% throughput (→ 180-220 M ops/sec) - **Effort**: 2-4 hours ### Phase 7.2.7: Optimize mid_desc_lookup (Optional) - **Target**: 12.55% → ~6% CPU - **Method**: Smaller hash table, prefetching - **Expected**: +10-20% additional throughput - **Effort**: 1-2 hours ### Phase 7.2.8: Ship It! - **Condition**: All bottlenecks <10% - **Expected Performance**: 200-250 M ops/sec (2-3x glibc) - **Status**: Enable g_wrap_tiny_enabled = 1 by default --- ## Root Cause: hak_tiny_alloc (22.75% CPU) ### Hotspot Breakdown 1. **Heavy stack usage** (10.5% CPU) - 88 bytes allocated - Multiple stack reads/writes - Register spilling 2. **Repeated global reads** (7.2% CPU) - g_tiny_initialized (3.52%) - g_wrap_tiny_enabled (0.28%) - Should cache in TLS 3. **Complex control flow** (5.0% CPU) - Size validation branches - Magazine refill in main path - Should separate fast/slow paths ### Hottest Instructions (from perf annotate) ```asm 3.71%: push %r14 ← Register pressure 3.52%: mov g_tiny_initialized,%r14d ← Global read 3.53%: mov 0x1c(%rsp),%ebp ← Stack read 3.33%: cmpq $0x80,0x10(%rsp) ← Size check 3.06%: mov %rbp,0x38(%rsp) ← Stack write ``` --- ## Proposed Solution ### 1. Inline Fast Path (Priority: HIGH) **Impact**: -5 to -7% CPU **Effort**: 2-3 hours Create inline `hak_tiny_alloc_fast()`: - Quick size validation - Direct TLS magazine access - Fast path for magazine hit (common case) - Delegate to slow path only for refill ### 2. Reduce Stack Usage (Priority: MEDIUM) **Impact**: -3 to -4% CPU **Effort**: 1-2 hours Reduce from 88 → <32 bytes: - Fewer local variables - Pass in registers where possible - Move rarely-used locals to slow path ### 3. Cache Globals in TLS (Priority: LOW) **Impact**: -2 to -3% CPU **Effort**: 1 hour Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS: - Read once on TLS init - Avoid repeated global reads (3.8% CPU saved) **Total Expected**: -10 to -15% CPU reduction (22.75% → ~10%) --- ## Success Criteria After optimization, verify: - [ ] hak_tiny_alloc CPU: 22.75% → <12% - [ ] Total throughput: 120-164 M → 180-250 M ops/sec - [ ] Faster than glibc: +70% to +140% (vs current +15-57%) - [ ] No correctness regressions - [ ] No new bottleneck >15% --- ## Files to Review/Modify ### Source Code - `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main implementation - `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline fast path ### Performance Data - `/home/tomoaki/git/hakmem/perf_post_getenv.data` - Current perf recording - `/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt` - Assembly hotspots ### Benchmarks - `/home/tomoaki/git/hakmem/bench_comprehensive_hakmem` - Test binary - Run with: `HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem` --- ## Timeline ### Completed (Today) - [x] Collect fresh perf data post-getenv fix - [x] Identify new #1 bottleneck (hak_tiny_alloc) - [x] Analyze root causes via perf annotate - [x] Compare before/after getenv fix - [x] Make optimization recommendation - [x] Create implementation guide ### Next Session (2-4 hours) - [ ] Implement inline fast path - [ ] Reduce stack usage - [ ] Benchmark and validate - [ ] Collect new perf data - [ ] Assess if further optimization needed ### Future (Optional, 1-2 hours) - [ ] Optimize mid_desc_lookup (12.55%) - [ ] Final validation - [ ] Enable tiny pool by default - [ ] Ship it! --- ## Questions? **Q: Should we stop optimizing and ship now?** A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work). **Q: What if optimization doesn't work?** A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse. **Q: How do we know when to stop?** A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet. **Q: What about the other bottlenecks?** A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable. --- ## Additional Resources ### Previous Analysis (For Context) - `PERF_ANALYSIS_RESULTS.md` - Original analysis that identified getenv bug - `perf_report.txt` - Old data (with getenv bug) - `perf_annotate_*.txt` - Old annotations ### Benchmark Results See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for: - Per-test throughput breakdown - Size class performance (16B, 32B, 64B, 128B) - Comparison with glibc baseline --- ## Contact **Project**: HAKMEM Memory Allocator **Repository**: /home/tomoaki/git/hakmem **Analysis Date**: 2025-10-26 **Analyst**: Claude Code (Anthropic) --- **Last Updated**: 2025-10-26 09:08 JST **Status**: Ready for Phase 7.2.6 Implementation