354 lines
10 KiB
Markdown
354 lines
10 KiB
Markdown
|
|
# Post-getenv Fix Performance Analysis
|
||
|
|
|
||
|
|
**Date**: 2025-10-26
|
||
|
|
**Context**: Analysis of performance after fixing the getenv bottleneck
|
||
|
|
**Achievement**: 86% speedup (60 M ops/sec → 120-164 M ops/sec)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**VERDICT: OPTIMIZE NEXT BOTTLENECK**
|
||
|
|
|
||
|
|
The getenv fix was hugely successful (48% CPU → ~0%), but revealed that **hak_tiny_alloc is now the #1 bottleneck at 22.75% CPU**. This is well above the 10% threshold and represents a clear optimization opportunity.
|
||
|
|
|
||
|
|
**Recommendation**: Optimize hak_tiny_alloc before enabling tiny pool by default.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 1: Top Bottleneck Identification
|
||
|
|
|
||
|
|
### Q1: What is the NEW #1 Bottleneck?
|
||
|
|
|
||
|
|
```
|
||
|
|
Function Name: hak_tiny_alloc
|
||
|
|
CPU Time (Self): 22.75%
|
||
|
|
File: hakmem_pool.c
|
||
|
|
Location: 0x14ec0 <hak_tiny_alloc>
|
||
|
|
Type: Actual CPU time (not just call overhead)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Hotspot Instructions** (from perf annotate):
|
||
|
|
- `3.52%`: `mov 0x14a263(%rip),%r14d # g_tiny_initialized` - Global read
|
||
|
|
- `3.71%`: `push %r14` - Register spill
|
||
|
|
- `3.53%`: `mov 0x1c(%rsp),%ebp` - Stack access
|
||
|
|
- `3.33%`: `cmpq $0x80,0x10(%rsp)` - Size comparison
|
||
|
|
- `3.06%`: `mov %rbp,0x38(%rsp)` - More stack writes
|
||
|
|
|
||
|
|
**Analysis**: Heavy register pressure and stack usage. The function has significant preamble overhead.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q2: Top 5 Hotspots (Post-getenv Fix)
|
||
|
|
|
||
|
|
Based on **Self CPU%** (actual time spent in function, not children):
|
||
|
|
|
||
|
|
```
|
||
|
|
1. hak_tiny_alloc: 22.75% ← NEW #1 BOTTLENECK
|
||
|
|
2. __random: 14.00% ← Benchmark overhead (rand() calls)
|
||
|
|
3. mid_desc_lookup: 12.55% ← Hash table lookup for mid-size pool
|
||
|
|
4. hak_tiny_owner_slab: 9.09% ← Slab ownership lookup
|
||
|
|
5. hak_free_at: 11.08% ← Free path overhead (children time, but some self)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Allocation-specific bottlenecks** (excluding benchmark rand()):
|
||
|
|
1. hak_tiny_alloc: 22.75%
|
||
|
|
2. mid_desc_lookup: 12.55%
|
||
|
|
3. hak_tiny_owner_slab: 9.09%
|
||
|
|
|
||
|
|
Total allocator CPU after removing getenv: **~44% self time** in core allocator functions.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q3: Is Optimization Worth It?
|
||
|
|
|
||
|
|
**Decision Criteria Check**:
|
||
|
|
- Top bottleneck CPU%: **22.75%**
|
||
|
|
- Threshold: 10%
|
||
|
|
- **Result: 22.75% >> 10% → WORTH OPTIMIZING**
|
||
|
|
|
||
|
|
**Justification**:
|
||
|
|
- hak_tiny_alloc is 2.27x above the threshold
|
||
|
|
- It's a core allocation path (called millions of times)
|
||
|
|
- Already achieving 120-164 M ops/sec; could reach 150-200+ M ops/sec with optimization
|
||
|
|
- Second bottleneck (mid_desc_lookup at 12.55%) is also above threshold
|
||
|
|
|
||
|
|
**Recommendation**: **[OPTIMIZE]** - Don't stop yet, there's clear low-hanging fruit.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 3: Before/After Comparison Table
|
||
|
|
|
||
|
|
| Function | Old % (with getenv) | New % (post-getenv) | Change | Notes |
|
||
|
|
|----------|---------------------|---------------------|---------|-------|
|
||
|
|
| **getenv + strcmp** | **43.96%** | **~0.00%** | **-43.96%** | ELIMINATED! |
|
||
|
|
| hak_tiny_alloc | 10.16% (Children) | **22.75%** (Self) | **+12.59%** | Now visible as #1 bottleneck |
|
||
|
|
| __random | 14.00% | 14.00% | 0.00% | Benchmark overhead (unchanged) |
|
||
|
|
| mid_desc_lookup | 7.58% (Children) | **12.55%** (Self) | **+4.97%** | More visible now |
|
||
|
|
| hak_tiny_owner_slab | 5.21% (Children) | **9.09%** (Self) | **+3.88%** | More visible now |
|
||
|
|
| hak_pool_mid_lookup | ~2.06% | 2.06% (Children) | ~0.00% | Unchanged |
|
||
|
|
| hak_elo_get_threshold | N/A | 3.27% | +3.27% | Newly visible |
|
||
|
|
|
||
|
|
**Key Insights**:
|
||
|
|
1. **getenv elimination was massive**: Freed up ~44% CPU
|
||
|
|
2. **Allocator functions now dominate**: hak_tiny_alloc, mid_desc_lookup, hak_tiny_owner_slab are the new hotspots
|
||
|
|
3. **Good news**: No single overwhelming bottleneck - performance is more balanced
|
||
|
|
4. **Bad news**: hak_tiny_alloc at 22.75% is still quite high
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 4: Root Cause Analysis of hak_tiny_alloc
|
||
|
|
|
||
|
|
### Hotspot Breakdown (from perf annotate)
|
||
|
|
|
||
|
|
**Top expensive operations in hak_tiny_alloc**:
|
||
|
|
|
||
|
|
1. **Global variable reads** (7.23% total):
|
||
|
|
- `3.52%`: Read `g_tiny_initialized`
|
||
|
|
- `3.71%`: Register pressure (push %r14)
|
||
|
|
|
||
|
|
2. **Stack operations** (10.45% total):
|
||
|
|
- `3.53%`: `mov 0x1c(%rsp),%ebp`
|
||
|
|
- `3.33%`: `cmpq $0x80,0x10(%rsp)`
|
||
|
|
- `3.06%`: `mov %rbp,0x38(%rsp)`
|
||
|
|
- `0.59%`: Other stack accesses
|
||
|
|
|
||
|
|
3. **Branching/conditionals** (2.51% total):
|
||
|
|
- `0.28%`: `test %r13d,%r13d` (wrap_tiny_enabled check)
|
||
|
|
- `0.60%`: `test %r14d,%r14d` (initialized check)
|
||
|
|
- Other branch costs
|
||
|
|
|
||
|
|
4. **Hash/index computation** (3.13% total):
|
||
|
|
- `3.06%`: `lzcnt` for bin index calculation
|
||
|
|
|
||
|
|
### Root Causes
|
||
|
|
|
||
|
|
1. **Heavy stack usage**: Function uses 0x58 (88) bytes of stack
|
||
|
|
- Suggests many local variables
|
||
|
|
- Register spilling due to pressure
|
||
|
|
- Could benefit from inlining or refactoring
|
||
|
|
|
||
|
|
2. **Repeated global reads**:
|
||
|
|
- `g_tiny_initialized`, `g_wrap_tiny_enabled` read on every call
|
||
|
|
- Should be cached or checked once
|
||
|
|
|
||
|
|
3. **Complex control flow**:
|
||
|
|
- Multiple early exit paths
|
||
|
|
- Size class calculation overhead
|
||
|
|
- Magazine/superslab logic adds branches
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 4: Optimization Recommendations
|
||
|
|
|
||
|
|
### Option A: Optimize hak_tiny_alloc (RECOMMENDED)
|
||
|
|
|
||
|
|
**Target**: Reduce hak_tiny_alloc from 22.75% to ~10-12%
|
||
|
|
|
||
|
|
**Proposed Optimizations** (Priority Order):
|
||
|
|
|
||
|
|
#### 1. **Inline Fast Path** (Expected: -5-7% CPU)
|
||
|
|
**Complexity**: Medium
|
||
|
|
**Impact**: High
|
||
|
|
|
||
|
|
- Create `hak_tiny_alloc_fast()` inline function for common case
|
||
|
|
- Move size validation and bin calculation inline
|
||
|
|
- Only call full `hak_tiny_alloc()` for slow path (empty magazines, initialization)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline void* hak_tiny_alloc_fast(size_t size) {
|
||
|
|
if (size > 1024) return NULL; // Fast rejection
|
||
|
|
|
||
|
|
// Cache globals (compiler should optimize)
|
||
|
|
if (!g_tiny_initialized) return hak_tiny_alloc(size);
|
||
|
|
if (!g_wrap_tiny_enabled) return hak_tiny_alloc(size);
|
||
|
|
|
||
|
|
// Inline bin calculation
|
||
|
|
unsigned bin = SIZE_TO_BIN_FAST(size);
|
||
|
|
mag_t* mag = TLS_GET_MAG(bin);
|
||
|
|
|
||
|
|
if (mag && mag->count > 0) {
|
||
|
|
return mag->objects[--mag->count]; // Fast path!
|
||
|
|
}
|
||
|
|
|
||
|
|
return hak_tiny_alloc(size); // Slow path
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 2. **Reduce Stack Usage** (Expected: -3-4% CPU)
|
||
|
|
**Complexity**: Low
|
||
|
|
**Impact**: Medium
|
||
|
|
|
||
|
|
- Current: 88 bytes (0x58) of stack
|
||
|
|
- Target: <32 bytes
|
||
|
|
- Use fewer local variables
|
||
|
|
- Pass parameters in registers where possible
|
||
|
|
|
||
|
|
#### 3. **Cache Global Flags in TLS** (Expected: -2-3% CPU)
|
||
|
|
**Complexity**: Low
|
||
|
|
**Impact**: Low-Medium
|
||
|
|
|
||
|
|
```c
|
||
|
|
// In TLS structure
|
||
|
|
struct tls_cache {
|
||
|
|
bool tiny_initialized;
|
||
|
|
bool wrap_enabled;
|
||
|
|
mag_t* mags[NUM_BINS];
|
||
|
|
};
|
||
|
|
|
||
|
|
// Read once on TLS init, avoid global reads
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4. **Optimize lzcnt Path** (Expected: -1-2% CPU)
|
||
|
|
**Complexity**: Medium
|
||
|
|
**Impact**: Low
|
||
|
|
|
||
|
|
- Use lookup table for small sizes (≤128 bytes)
|
||
|
|
- Only use lzcnt for larger allocations
|
||
|
|
|
||
|
|
**Total Expected Impact**: -11 to -16% CPU reduction
|
||
|
|
**New hak_tiny_alloc CPU**: ~7-12% (acceptable)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 5. **BONUS: Optimize mid_desc_lookup** (Expected: -4-6% CPU)
|
||
|
|
**Complexity**: Medium
|
||
|
|
**Impact**: Medium
|
||
|
|
|
||
|
|
**Current**: 12.55% CPU - hash table lookup for mid-size pool
|
||
|
|
|
||
|
|
**Hottest instruction** (45.74% of mid_desc_lookup time):
|
||
|
|
```asm
|
||
|
|
9029: mov (%rcx,%rbp,8),%rax # 45.74% - Cache miss on hash table lookup
|
||
|
|
```
|
||
|
|
|
||
|
|
**Root cause**: Hash table bucket read causes cache misses
|
||
|
|
|
||
|
|
**Optimization**:
|
||
|
|
- Use smaller hash table (better cache locality)
|
||
|
|
- Prefetch next bucket during hash computation
|
||
|
|
- Consider direct mapped cache for recent lookups
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option B: Done - Enable Tiny Pool Default
|
||
|
|
|
||
|
|
**Reason**: Current performance (120-164 M ops/sec) already beats glibc (105 M ops/sec)
|
||
|
|
|
||
|
|
**Arguments for stopping**:
|
||
|
|
- 86% improvement already achieved
|
||
|
|
- Beats competitive allocator (glibc)
|
||
|
|
- Could ship as "good enough"
|
||
|
|
|
||
|
|
**Arguments against**:
|
||
|
|
- Still have 22.75% bottleneck (well above 10% threshold)
|
||
|
|
- Could achieve 50-70% additional improvement with moderate effort
|
||
|
|
- Would dominate glibc by even wider margin (150-200 M ops/sec possible)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Part 5: Final Recommendation
|
||
|
|
|
||
|
|
### RECOMMENDATION: **OPTION A - Optimize Next Bottleneck**
|
||
|
|
|
||
|
|
**Bottleneck**: hak_tiny_alloc (22.75% CPU)
|
||
|
|
**Expected gain**: 50-70% additional speedup
|
||
|
|
**Effort**: Medium (2-4 hours of work)
|
||
|
|
**Timeline**: Same day
|
||
|
|
|
||
|
|
### Implementation Plan
|
||
|
|
|
||
|
|
**Phase 1: Quick Wins** (1-2 hours)
|
||
|
|
1. Inline fast path for hak_tiny_alloc
|
||
|
|
2. Reduce stack usage from 88 → 32 bytes
|
||
|
|
3. Expected: 120-164 M → 160-220 M ops/sec
|
||
|
|
|
||
|
|
**Phase 2: Medium Optimizations** (1-2 hours)
|
||
|
|
4. Cache globals in TLS
|
||
|
|
5. Optimize size-to-bin calculation with lookup table
|
||
|
|
6. Expected: Additional 10-20% gain
|
||
|
|
|
||
|
|
**Phase 3: Polish** (Optional, 1 hour)
|
||
|
|
7. Optimize mid_desc_lookup hash table
|
||
|
|
8. Expected: Additional 5-10% gain
|
||
|
|
|
||
|
|
**Target Performance**: 180-250 M ops/sec (2-3x faster than glibc)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Supporting Data
|
||
|
|
|
||
|
|
### Benchmark Results (Post-getenv Fix)
|
||
|
|
|
||
|
|
```
|
||
|
|
Test 1 (LIFO 16B): 118.21 M ops/sec
|
||
|
|
Test 2 (FIFO 16B): 119.19 M ops/sec
|
||
|
|
Test 3 (Random 16B): 78.65 M ops/sec ← Bottlenecked by rand()
|
||
|
|
Test 4 (Interleaved): 117.50 M ops/sec
|
||
|
|
Test 6 (Long-lived): 115.58 M ops/sec
|
||
|
|
|
||
|
|
32B tests: 61-84 M ops/sec
|
||
|
|
64B tests: 86-140 M ops/sec
|
||
|
|
128B tests: 78-114 M ops/sec
|
||
|
|
Mixed sizes: 162.07 M ops/sec ← BEST!
|
||
|
|
|
||
|
|
Average: ~110 M ops/sec
|
||
|
|
Peak: 164 M ops/sec (mixed sizes)
|
||
|
|
Glibc baseline: 105 M ops/sec
|
||
|
|
```
|
||
|
|
|
||
|
|
**Current standing**: 5-57% faster than glibc (size-dependent)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Perf Data Excerpts
|
||
|
|
|
||
|
|
### New Top Functions (Self CPU%)
|
||
|
|
```
|
||
|
|
22.75% hak_tiny_alloc ← #1 Target
|
||
|
|
14.00% __random ← Benchmark overhead
|
||
|
|
12.55% mid_desc_lookup ← #2 Target
|
||
|
|
9.09% hak_tiny_owner_slab ← #3 Target
|
||
|
|
11.08% hak_free_at (children) ← Composite
|
||
|
|
3.27% hak_elo_get_threshold
|
||
|
|
2.06% hak_pool_mid_lookup
|
||
|
|
1.79% hak_l25_lookup
|
||
|
|
```
|
||
|
|
|
||
|
|
### hak_tiny_alloc Hottest Instructions
|
||
|
|
```
|
||
|
|
3.71%: push %r14 ← Register pressure
|
||
|
|
3.52%: mov g_tiny_initialized,%r14d ← Global read
|
||
|
|
3.53%: mov 0x1c(%rsp),%ebp ← Stack read
|
||
|
|
3.33%: cmpq $0x80,0x10(%rsp) ← Size check
|
||
|
|
3.06%: mov %rbp,0x38(%rsp) ← Stack write
|
||
|
|
```
|
||
|
|
|
||
|
|
### mid_desc_lookup Hottest Instruction
|
||
|
|
```
|
||
|
|
45.74%: mov (%rcx,%rbp,8),%rax ← Hash table lookup (cache miss!)
|
||
|
|
```
|
||
|
|
|
||
|
|
This single instruction accounts for **5.74% of total CPU** (45.74% of 12.55%)!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**Stop or Continue?**: **CONTINUE OPTIMIZING**
|
||
|
|
|
||
|
|
The getenv fix was a massive win, but we're leaving significant performance on the table:
|
||
|
|
- hak_tiny_alloc: 22.75% (can reduce to ~10%)
|
||
|
|
- mid_desc_lookup: 12.55% (can reduce to ~6-8%)
|
||
|
|
- Combined potential: 50-70% additional speedup
|
||
|
|
|
||
|
|
**With optimizations, HAKMEM tiny pool could reach 180-250 M ops/sec** - making it 2-3x faster than glibc instead of just 1.5x.
|
||
|
|
|
||
|
|
**Effort is justified** given:
|
||
|
|
1. Clear bottlenecks above 10% threshold
|
||
|
|
2. Medium complexity (not diminishing returns yet)
|
||
|
|
3. High impact potential
|
||
|
|
4. Clean optimization opportunities (inlining, caching, lookup tables)
|
||
|
|
|
||
|
|
**Let's do Phase 1 quick wins and reassess!**
|