# HAKMEM Tiny Pool - Next Optimization Steps ## Quick Status **Current Performance**: 120-164 M ops/sec (15-57% faster than glibc) **Target Performance**: 180-250 M ops/sec (70-140% faster than glibc) **Gap**: 40-86 M ops/sec achievable with next optimization --- ## Bottleneck Identified: hak_tiny_alloc (22.75% CPU) ### Root Causes (from perf annotate) 1. **Heavy stack usage** (10.5% CPU): - 88 bytes (0x58) of stack allocated - Multiple stack reads/writes per call - Register spilling due to pressure 2. **Repeated global reads** (7.2% CPU): - `g_tiny_initialized` read every call (3.52%) - `g_wrap_tiny_enabled` read every call (0.28%) - Should cache in TLS or check once 3. **Complex control flow** (5.0% CPU): - Multiple branches for size validation - Magazine refill logic in main path - Should separate fast/slow paths ### Optimization Strategy (Priority Order) #### 1. Inline Fast Path (HIGH PRIORITY) **Impact**: -5 to -7% CPU **Effort**: 2-3 hours **Complexity**: Medium Create `hak_tiny_alloc_fast()` inline helper: ```c static inline void* hak_tiny_alloc_fast(size_t size) { // Quick validation if (UNLIKELY(size > MAX_TINY_SIZE)) return hak_tiny_alloc(size); // Inline bin calculation unsigned bin = size_to_bin_fast(size); // Fast path: TLS magazine hit mag_t* mag = &g_tls_mags[bin]; if (LIKELY(mag->count > 0)) return mag->objects[--mag->count]; // Slow path: refill magazine return hak_tiny_alloc_slow(size, bin); } ``` **Key optimizations**: - Move size validation inline (avoid function call for common case) - Direct TLS magazine access (no indirection) - Separate slow path (refill) to different function (better branch prediction) #### 2. Reduce Stack Usage (MEDIUM PRIORITY) **Impact**: -3 to -4% CPU **Effort**: 1-2 hours **Complexity**: Low Current: 88 bytes stack Target: <32 bytes stack **Actions**: - Audit local variables in `hak_tiny_alloc()` - Use fewer temporaries - Pass calculated values as function params (use registers) - Move rarely-used locals to slow path #### 3. Cache Globals in TLS (LOW PRIORITY) **Impact**: -2 to -3% CPU **Effort**: 1 hour **Complexity**: Low Instead of reading `g_tiny_initialized` on every call: ```c // In TLS init tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled; // In allocation fast path if (UNLIKELY(!tls->tiny_available)) return fallback_alloc(size); ``` **Benefit**: Removes 2 global reads per allocation (3.8% CPU saved) #### 4. Optimize Size-to-Bin (OPTIONAL) **Impact**: -1 to -2% CPU **Effort**: 1 hour **Complexity**: Medium Current: Uses `lzcnt` (3.06% CPU in that instruction alone) **Option A**: Lookup table for sizes ≤128 bytes ```c static const uint8_t size_to_bin_table[128] = { 0, 0, 0, 0, 0, 0, 0, 0, // 0-7 → bin 0 1, 1, 1, 1, 1, 1, 1, 1, // 8-15 → bin 1 2, 2, 2, 2, 2, 2, 2, 2, // 16-23 → bin 2 // ... }; static inline unsigned size_to_bin_fast(size_t sz) { if (sz <= 128) return size_to_bin_table[sz]; return size_to_bin_lzcnt(sz); // Fallback } ``` **Option B**: Use multiply-shift instead of lzcnt - May be faster on older CPUs - Profile to confirm --- ## Implementation Plan ### Phase 1: Quick Wins (2-3 hours) ← START HERE - [ ] Inline fast path (`hak_tiny_alloc_fast()`) - [ ] Reduce stack usage (88 → 32 bytes) - Expected: 120-164 M → 160-220 M ops/sec ### Phase 2: Polish (1-2 hours) - [ ] Cache globals in TLS - [ ] Optimize size-to-bin with lookup table - Expected: Additional 10-20% gain ### Phase 3: Validate (30 min) - [ ] Run comprehensive benchmark - [ ] Collect new perf data - [ ] Verify hak_tiny_alloc reduced from 22.75% → ~10% **Total time**: 3-6 hours **Total impact**: +50-70% throughput improvement --- ## Success Criteria After optimization, verify: - [ ] hak_tiny_alloc CPU: 22.75% → <12% (primary goal) - [ ] Total throughput: 120-164 M → 180-250 M ops/sec - [ ] Faster than glibc: +70% to +140% (vs current +15-57%) - [ ] No correctness regressions (all tests pass) - [ ] No new bottleneck >10% (except benchmark overhead like rand()) --- ## Files to Modify **Primary**: - `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main allocation logic - `hak_tiny_alloc()` - Current slow implementation - Add `hak_tiny_alloc_fast()` - New inline fast path - Add `hak_tiny_alloc_slow()` - Refill logic **Supporting**: - `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline function - Check if size-to-bin logic is in separate function --- ## Validation Commands After implementing optimizations: ```bash # Rebuild make clean && make bench_comprehensive_hakmem # Benchmark HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem # Perf analysis HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \ -o perf_post_tiny_opt.data ./bench_comprehensive_hakmem perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \ | grep -E "^\s+[0-9]+\.[0-9]+%" ``` Check that: 1. hak_tiny_alloc < 12% 2. No new bottleneck appeared >15% 3. Throughput improved by 40-80 M ops/sec --- ## Risk Assessment **Low Risk**: - Inline fast path: Only affects common path, easy to test - Stack reduction: Mechanical refactoring - TLS caching: Simple optimization **Medium Risk**: - Size-to-bin lookup table: Need to ensure correctness of mapping - Separation of fast/slow paths: Need to ensure all cases covered **Mitigation**: - Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes) - Run comprehensive benchmark suite - Verify with perf that CPU moved to expected places --- ## Reference Data ### Current Hottest Instructions in hak_tiny_alloc ```asm 3.71%: push %r14 ← Register spill 3.52%: mov g_tiny_initialized,%r14d ← Global read 3.53%: mov 0x1c(%rsp),%ebp ← Stack read 3.33%: cmpq $0x80,0x10(%rsp) ← Size check 3.06%: mov %rbp,0x38(%rsp) ← Stack write ``` After optimization, expect: - Push/pop overhead: reduced (fewer registers needed in fast path) - Global reads: eliminated (cached in TLS) - Stack access: reduced (smaller frame) - Size checks: inlined (better prediction) ### Expected New Bottlenecks After Fix If hak_tiny_alloc drops to ~10%, next hottest will be: 1. mid_desc_lookup: 12.55% (may become #1) 2. hak_tiny_owner_slab: 9.09% (close to threshold) But both <15%, so diminishing returns starting to apply. --- ## Questions Before Starting? 1. Do we want to target 10% or 12% for hak_tiny_alloc? - 10%: More aggressive, may need all optimizations - 12%: Conservative, Phase 1 may be enough 2. Should we also tackle mid_desc_lookup (12.55%)? - Could do both in one session - Hash table optimization is independent 3. When to stop optimizing? - When top bottleneck <10%? - When beating glibc by 2x (target achieved)? - When effort exceeds returns? **Recommendation**: Do Phase 1 (fast path inline), measure, then decide on Phase 2. --- ## Ready to Start? Next command to run: ```bash # Read current implementation cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc" ``` Then start implementing inline fast path!