269 lines
7.0 KiB
Markdown
269 lines
7.0 KiB
Markdown
|
|
# HAKMEM Tiny Pool - Next Optimization Steps
|
||
|
|
|
||
|
|
## Quick Status
|
||
|
|
|
||
|
|
**Current Performance**: 120-164 M ops/sec (15-57% faster than glibc)
|
||
|
|
**Target Performance**: 180-250 M ops/sec (70-140% faster than glibc)
|
||
|
|
**Gap**: 40-86 M ops/sec achievable with next optimization
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Bottleneck Identified: hak_tiny_alloc (22.75% CPU)
|
||
|
|
|
||
|
|
### Root Causes (from perf annotate)
|
||
|
|
|
||
|
|
1. **Heavy stack usage** (10.5% CPU):
|
||
|
|
- 88 bytes (0x58) of stack allocated
|
||
|
|
- Multiple stack reads/writes per call
|
||
|
|
- Register spilling due to pressure
|
||
|
|
|
||
|
|
2. **Repeated global reads** (7.2% CPU):
|
||
|
|
- `g_tiny_initialized` read every call (3.52%)
|
||
|
|
- `g_wrap_tiny_enabled` read every call (0.28%)
|
||
|
|
- Should cache in TLS or check once
|
||
|
|
|
||
|
|
3. **Complex control flow** (5.0% CPU):
|
||
|
|
- Multiple branches for size validation
|
||
|
|
- Magazine refill logic in main path
|
||
|
|
- Should separate fast/slow paths
|
||
|
|
|
||
|
|
### Optimization Strategy (Priority Order)
|
||
|
|
|
||
|
|
#### 1. Inline Fast Path (HIGH PRIORITY)
|
||
|
|
**Impact**: -5 to -7% CPU
|
||
|
|
**Effort**: 2-3 hours
|
||
|
|
**Complexity**: Medium
|
||
|
|
|
||
|
|
Create `hak_tiny_alloc_fast()` inline helper:
|
||
|
|
```c
|
||
|
|
static inline void* hak_tiny_alloc_fast(size_t size) {
|
||
|
|
// Quick validation
|
||
|
|
if (UNLIKELY(size > MAX_TINY_SIZE))
|
||
|
|
return hak_tiny_alloc(size);
|
||
|
|
|
||
|
|
// Inline bin calculation
|
||
|
|
unsigned bin = size_to_bin_fast(size);
|
||
|
|
|
||
|
|
// Fast path: TLS magazine hit
|
||
|
|
mag_t* mag = &g_tls_mags[bin];
|
||
|
|
if (LIKELY(mag->count > 0))
|
||
|
|
return mag->objects[--mag->count];
|
||
|
|
|
||
|
|
// Slow path: refill magazine
|
||
|
|
return hak_tiny_alloc_slow(size, bin);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key optimizations**:
|
||
|
|
- Move size validation inline (avoid function call for common case)
|
||
|
|
- Direct TLS magazine access (no indirection)
|
||
|
|
- Separate slow path (refill) to different function (better branch prediction)
|
||
|
|
|
||
|
|
#### 2. Reduce Stack Usage (MEDIUM PRIORITY)
|
||
|
|
**Impact**: -3 to -4% CPU
|
||
|
|
**Effort**: 1-2 hours
|
||
|
|
**Complexity**: Low
|
||
|
|
|
||
|
|
Current: 88 bytes stack
|
||
|
|
Target: <32 bytes stack
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- Audit local variables in `hak_tiny_alloc()`
|
||
|
|
- Use fewer temporaries
|
||
|
|
- Pass calculated values as function params (use registers)
|
||
|
|
- Move rarely-used locals to slow path
|
||
|
|
|
||
|
|
#### 3. Cache Globals in TLS (LOW PRIORITY)
|
||
|
|
**Impact**: -2 to -3% CPU
|
||
|
|
**Effort**: 1 hour
|
||
|
|
**Complexity**: Low
|
||
|
|
|
||
|
|
Instead of reading `g_tiny_initialized` on every call:
|
||
|
|
```c
|
||
|
|
// In TLS init
|
||
|
|
tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;
|
||
|
|
|
||
|
|
// In allocation fast path
|
||
|
|
if (UNLIKELY(!tls->tiny_available))
|
||
|
|
return fallback_alloc(size);
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefit**: Removes 2 global reads per allocation (3.8% CPU saved)
|
||
|
|
|
||
|
|
#### 4. Optimize Size-to-Bin (OPTIONAL)
|
||
|
|
**Impact**: -1 to -2% CPU
|
||
|
|
**Effort**: 1 hour
|
||
|
|
**Complexity**: Medium
|
||
|
|
|
||
|
|
Current: Uses `lzcnt` (3.06% CPU in that instruction alone)
|
||
|
|
|
||
|
|
**Option A**: Lookup table for sizes ≤128 bytes
|
||
|
|
```c
|
||
|
|
static const uint8_t size_to_bin_table[128] = {
|
||
|
|
0, 0, 0, 0, 0, 0, 0, 0, // 0-7 → bin 0
|
||
|
|
1, 1, 1, 1, 1, 1, 1, 1, // 8-15 → bin 1
|
||
|
|
2, 2, 2, 2, 2, 2, 2, 2, // 16-23 → bin 2
|
||
|
|
// ...
|
||
|
|
};
|
||
|
|
|
||
|
|
static inline unsigned size_to_bin_fast(size_t sz) {
|
||
|
|
if (sz <= 128)
|
||
|
|
return size_to_bin_table[sz];
|
||
|
|
return size_to_bin_lzcnt(sz); // Fallback
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Option B**: Use multiply-shift instead of lzcnt
|
||
|
|
- May be faster on older CPUs
|
||
|
|
- Profile to confirm
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation Plan
|
||
|
|
|
||
|
|
### Phase 1: Quick Wins (2-3 hours) ← START HERE
|
||
|
|
- [ ] Inline fast path (`hak_tiny_alloc_fast()`)
|
||
|
|
- [ ] Reduce stack usage (88 → 32 bytes)
|
||
|
|
- Expected: 120-164 M → 160-220 M ops/sec
|
||
|
|
|
||
|
|
### Phase 2: Polish (1-2 hours)
|
||
|
|
- [ ] Cache globals in TLS
|
||
|
|
- [ ] Optimize size-to-bin with lookup table
|
||
|
|
- Expected: Additional 10-20% gain
|
||
|
|
|
||
|
|
### Phase 3: Validate (30 min)
|
||
|
|
- [ ] Run comprehensive benchmark
|
||
|
|
- [ ] Collect new perf data
|
||
|
|
- [ ] Verify hak_tiny_alloc reduced from 22.75% → ~10%
|
||
|
|
|
||
|
|
**Total time**: 3-6 hours
|
||
|
|
**Total impact**: +50-70% throughput improvement
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
After optimization, verify:
|
||
|
|
- [ ] hak_tiny_alloc CPU: 22.75% → <12% (primary goal)
|
||
|
|
- [ ] Total throughput: 120-164 M → 180-250 M ops/sec
|
||
|
|
- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
|
||
|
|
- [ ] No correctness regressions (all tests pass)
|
||
|
|
- [ ] No new bottleneck >10% (except benchmark overhead like rand())
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Files to Modify
|
||
|
|
|
||
|
|
**Primary**:
|
||
|
|
- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main allocation logic
|
||
|
|
- `hak_tiny_alloc()` - Current slow implementation
|
||
|
|
- Add `hak_tiny_alloc_fast()` - New inline fast path
|
||
|
|
- Add `hak_tiny_alloc_slow()` - Refill logic
|
||
|
|
|
||
|
|
**Supporting**:
|
||
|
|
- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline function
|
||
|
|
- Check if size-to-bin logic is in separate function
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Validation Commands
|
||
|
|
|
||
|
|
After implementing optimizations:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Rebuild
|
||
|
|
make clean && make bench_comprehensive_hakmem
|
||
|
|
|
||
|
|
# Benchmark
|
||
|
|
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
|
||
|
|
|
||
|
|
# Perf analysis
|
||
|
|
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \
|
||
|
|
-o perf_post_tiny_opt.data ./bench_comprehensive_hakmem
|
||
|
|
|
||
|
|
perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \
|
||
|
|
| grep -E "^\s+[0-9]+\.[0-9]+%"
|
||
|
|
```
|
||
|
|
|
||
|
|
Check that:
|
||
|
|
1. hak_tiny_alloc < 12%
|
||
|
|
2. No new bottleneck appeared >15%
|
||
|
|
3. Throughput improved by 40-80 M ops/sec
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Risk Assessment
|
||
|
|
|
||
|
|
**Low Risk**:
|
||
|
|
- Inline fast path: Only affects common path, easy to test
|
||
|
|
- Stack reduction: Mechanical refactoring
|
||
|
|
- TLS caching: Simple optimization
|
||
|
|
|
||
|
|
**Medium Risk**:
|
||
|
|
- Size-to-bin lookup table: Need to ensure correctness of mapping
|
||
|
|
- Separation of fast/slow paths: Need to ensure all cases covered
|
||
|
|
|
||
|
|
**Mitigation**:
|
||
|
|
- Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)
|
||
|
|
- Run comprehensive benchmark suite
|
||
|
|
- Verify with perf that CPU moved to expected places
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Reference Data
|
||
|
|
|
||
|
|
### Current Hottest Instructions in hak_tiny_alloc
|
||
|
|
|
||
|
|
```asm
|
||
|
|
3.71%: push %r14 ← Register spill
|
||
|
|
3.52%: mov g_tiny_initialized,%r14d ← Global read
|
||
|
|
3.53%: mov 0x1c(%rsp),%ebp ← Stack read
|
||
|
|
3.33%: cmpq $0x80,0x10(%rsp) ← Size check
|
||
|
|
3.06%: mov %rbp,0x38(%rsp) ← Stack write
|
||
|
|
```
|
||
|
|
|
||
|
|
After optimization, expect:
|
||
|
|
- Push/pop overhead: reduced (fewer registers needed in fast path)
|
||
|
|
- Global reads: eliminated (cached in TLS)
|
||
|
|
- Stack access: reduced (smaller frame)
|
||
|
|
- Size checks: inlined (better prediction)
|
||
|
|
|
||
|
|
### Expected New Bottlenecks After Fix
|
||
|
|
|
||
|
|
If hak_tiny_alloc drops to ~10%, next hottest will be:
|
||
|
|
1. mid_desc_lookup: 12.55% (may become #1)
|
||
|
|
2. hak_tiny_owner_slab: 9.09% (close to threshold)
|
||
|
|
|
||
|
|
But both <15%, so diminishing returns starting to apply.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Questions Before Starting?
|
||
|
|
|
||
|
|
1. Do we want to target 10% or 12% for hak_tiny_alloc?
|
||
|
|
- 10%: More aggressive, may need all optimizations
|
||
|
|
- 12%: Conservative, Phase 1 may be enough
|
||
|
|
|
||
|
|
2. Should we also tackle mid_desc_lookup (12.55%)?
|
||
|
|
- Could do both in one session
|
||
|
|
- Hash table optimization is independent
|
||
|
|
|
||
|
|
3. When to stop optimizing?
|
||
|
|
- When top bottleneck <10%?
|
||
|
|
- When beating glibc by 2x (target achieved)?
|
||
|
|
- When effort exceeds returns?
|
||
|
|
|
||
|
|
**Recommendation**: Do Phase 1 (fast path inline), measure, then decide on Phase 2.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Ready to Start?
|
||
|
|
|
||
|
|
Next command to run:
|
||
|
|
```bash
|
||
|
|
# Read current implementation
|
||
|
|
cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc"
|
||
|
|
```
|
||
|
|
|
||
|
|
Then start implementing inline fast path!
|