hakmem/docs/archive/OPTIMIZATION_NEXT_STEPS.md

# HAKMEM Tiny Pool - Next Optimization Steps

## Quick Status

**Current Performance**: 120-164 M ops/sec (15-57% faster than glibc)
**Target Performance**: 180-250 M ops/sec (70-140% faster than glibc)
**Gap**: 40-86 M ops/sec achievable with next optimization

---

## Bottleneck Identified: hak_tiny_alloc (22.75% CPU)

### Root Causes (from perf annotate)

1. **Heavy stack usage** (10.5% CPU):
   - 88 bytes (0x58) of stack allocated
   - Multiple stack reads/writes per call
   - Register spilling due to pressure

2. **Repeated global reads** (7.2% CPU):
   - `g_tiny_initialized` read every call (3.52%)
   - `g_wrap_tiny_enabled` read every call (0.28%)
   - Should cache in TLS or check once

3. **Complex control flow** (5.0% CPU):
   - Multiple branches for size validation
   - Magazine refill logic in main path
   - Should separate fast/slow paths

### Optimization Strategy (Priority Order)

#### 1. Inline Fast Path (HIGH PRIORITY)
**Impact**: -5 to -7% CPU
**Effort**: 2-3 hours
**Complexity**: Medium

Create `hak_tiny_alloc_fast()` inline helper:
```c
static inline void* hak_tiny_alloc_fast(size_t size) {
    // Quick validation
    if (UNLIKELY(size > MAX_TINY_SIZE)) 
        return hak_tiny_alloc(size);
    
    // Inline bin calculation
    unsigned bin = size_to_bin_fast(size);
    
    // Fast path: TLS magazine hit
    mag_t* mag = &g_tls_mags[bin];
    if (LIKELY(mag->count > 0))
        return mag->objects[--mag->count];
    
    // Slow path: refill magazine
    return hak_tiny_alloc_slow(size, bin);
}
```

**Key optimizations**:
- Move size validation inline (avoid function call for common case)
- Direct TLS magazine access (no indirection)
- Separate slow path (refill) to different function (better branch prediction)

#### 2. Reduce Stack Usage (MEDIUM PRIORITY)
**Impact**: -3 to -4% CPU
**Effort**: 1-2 hours
**Complexity**: Low

Current: 88 bytes stack
Target: <32 bytes stack

**Actions**:
- Audit local variables in `hak_tiny_alloc()`
- Use fewer temporaries
- Pass calculated values as function params (use registers)
- Move rarely-used locals to slow path

#### 3. Cache Globals in TLS (LOW PRIORITY)
**Impact**: -2 to -3% CPU
**Effort**: 1 hour
**Complexity**: Low

Instead of reading `g_tiny_initialized` on every call:
```c
// In TLS init
tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;

// In allocation fast path
if (UNLIKELY(!tls->tiny_available))
    return fallback_alloc(size);
```

**Benefit**: Removes 2 global reads per allocation (3.8% CPU saved)

#### 4. Optimize Size-to-Bin (OPTIONAL)
**Impact**: -1 to -2% CPU
**Effort**: 1 hour
**Complexity**: Medium

Current: Uses `lzcnt` (3.06% CPU in that instruction alone)

**Option A**: Lookup table for sizes ≤128 bytes
```c
static const uint8_t size_to_bin_table[128] = {
    0, 0, 0, 0, 0, 0, 0, 0,  // 0-7   → bin 0
    1, 1, 1, 1, 1, 1, 1, 1,  // 8-15  → bin 1
    2, 2, 2, 2, 2, 2, 2, 2,  // 16-23 → bin 2
    // ...
};

static inline unsigned size_to_bin_fast(size_t sz) {
    if (sz <= 128)
        return size_to_bin_table[sz];
    return size_to_bin_lzcnt(sz);  // Fallback
}
```

**Option B**: Use multiply-shift instead of lzcnt
- May be faster on older CPUs
- Profile to confirm

---

## Implementation Plan

### Phase 1: Quick Wins (2-3 hours) ← START HERE
- [ ] Inline fast path (`hak_tiny_alloc_fast()`)
- [ ] Reduce stack usage (88 → 32 bytes)
- Expected: 120-164 M → 160-220 M ops/sec

### Phase 2: Polish (1-2 hours)
- [ ] Cache globals in TLS
- [ ] Optimize size-to-bin with lookup table
- Expected: Additional 10-20% gain

### Phase 3: Validate (30 min)
- [ ] Run comprehensive benchmark
- [ ] Collect new perf data
- [ ] Verify hak_tiny_alloc reduced from 22.75% → ~10%

**Total time**: 3-6 hours
**Total impact**: +50-70% throughput improvement

---

## Success Criteria

After optimization, verify:
- [ ] hak_tiny_alloc CPU: 22.75% → <12% (primary goal)
- [ ] Total throughput: 120-164 M → 180-250 M ops/sec
- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)
- [ ] No correctness regressions (all tests pass)
- [ ] No new bottleneck >10% (except benchmark overhead like rand())

---

## Files to Modify

**Primary**:
- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main allocation logic
  - `hak_tiny_alloc()` - Current slow implementation
  - Add `hak_tiny_alloc_fast()` - New inline fast path
  - Add `hak_tiny_alloc_slow()` - Refill logic

**Supporting**:
- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline function
- Check if size-to-bin logic is in separate function

---

## Validation Commands

After implementing optimizations:

```bash
# Rebuild
make clean && make bench_comprehensive_hakmem

# Benchmark
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

# Perf analysis
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \
  -o perf_post_tiny_opt.data ./bench_comprehensive_hakmem

perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \
  | grep -E "^\s+[0-9]+\.[0-9]+%"
```

Check that:
1. hak_tiny_alloc < 12%
2. No new bottleneck appeared >15%
3. Throughput improved by 40-80 M ops/sec

---

## Risk Assessment

**Low Risk**:
- Inline fast path: Only affects common path, easy to test
- Stack reduction: Mechanical refactoring
- TLS caching: Simple optimization

**Medium Risk**:
- Size-to-bin lookup table: Need to ensure correctness of mapping
- Separation of fast/slow paths: Need to ensure all cases covered

**Mitigation**:
- Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)
- Run comprehensive benchmark suite
- Verify with perf that CPU moved to expected places

---

## Reference Data

### Current Hottest Instructions in hak_tiny_alloc

```asm
3.71%:  push %r14                       ← Register spill
3.52%:  mov g_tiny_initialized,%r14d    ← Global read
3.53%:  mov 0x1c(%rsp),%ebp            ← Stack read
3.33%:  cmpq $0x80,0x10(%rsp)          ← Size check
3.06%:  mov %rbp,0x38(%rsp)            ← Stack write
```

After optimization, expect:
- Push/pop overhead: reduced (fewer registers needed in fast path)
- Global reads: eliminated (cached in TLS)
- Stack access: reduced (smaller frame)
- Size checks: inlined (better prediction)

### Expected New Bottlenecks After Fix

If hak_tiny_alloc drops to ~10%, next hottest will be:
1. mid_desc_lookup: 12.55% (may become #1)
2. hak_tiny_owner_slab: 9.09% (close to threshold)

But both <15%, so diminishing returns starting to apply.

---

## Questions Before Starting?

1. Do we want to target 10% or 12% for hak_tiny_alloc?
   - 10%: More aggressive, may need all optimizations
   - 12%: Conservative, Phase 1 may be enough

2. Should we also tackle mid_desc_lookup (12.55%)?
   - Could do both in one session
   - Hash table optimization is independent

3. When to stop optimizing?
   - When top bottleneck <10%? 
   - When beating glibc by 2x (target achieved)?
   - When effort exceeds returns?

**Recommendation**: Do Phase 1 (fast path inline), measure, then decide on Phase 2.

---

## Ready to Start?

Next command to run:
```bash
# Read current implementation
cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc"
```

Then start implementing inline fast path!
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# HAKMEM Tiny Pool - Next Optimization Steps`

			`## Quick Status`

			`Current Performance: 120-164 M ops/sec (15-57% faster than glibc)`
			`Target Performance: 180-250 M ops/sec (70-140% faster than glibc)`
			`Gap: 40-86 M ops/sec achievable with next optimization`

			`---`

			`## Bottleneck Identified: hak_tiny_alloc (22.75% CPU)`

			`### Root Causes (from perf annotate)`

			`1. Heavy stack usage (10.5% CPU):`
			`- 88 bytes (0x58) of stack allocated`
			`- Multiple stack reads/writes per call`
			`- Register spilling due to pressure`

			`2. Repeated global reads (7.2% CPU):`
			- `g_tiny_initialized` read every call (3.52%)
			- `g_wrap_tiny_enabled` read every call (0.28%)
			`- Should cache in TLS or check once`

			`3. Complex control flow (5.0% CPU):`
			`- Multiple branches for size validation`
			`- Magazine refill logic in main path`
			`- Should separate fast/slow paths`

			`### Optimization Strategy (Priority Order)`

			`#### 1. Inline Fast Path (HIGH PRIORITY)`
			`Impact: -5 to -7% CPU`
			`Effort: 2-3 hours`
			`Complexity: Medium`

			Create `hak_tiny_alloc_fast()` inline helper:
			```c
			`static inline void* hak_tiny_alloc_fast(size_t size) {`
			`// Quick validation`
			`if (UNLIKELY(size > MAX_TINY_SIZE))`
			`return hak_tiny_alloc(size);`

			`// Inline bin calculation`
			`unsigned bin = size_to_bin_fast(size);`

			`// Fast path: TLS magazine hit`
			`mag_t* mag = &g_tls_mags[bin];`
			`if (LIKELY(mag->count > 0))`
			`return mag->objects[--mag->count];`

			`// Slow path: refill magazine`
			`return hak_tiny_alloc_slow(size, bin);`
			`}`
			```

			`Key optimizations:`
			`- Move size validation inline (avoid function call for common case)`
			`- Direct TLS magazine access (no indirection)`
			`- Separate slow path (refill) to different function (better branch prediction)`

			`#### 2. Reduce Stack Usage (MEDIUM PRIORITY)`
			`Impact: -3 to -4% CPU`
			`Effort: 1-2 hours`
			`Complexity: Low`

			`Current: 88 bytes stack`
			`Target: <32 bytes stack`

			`Actions:`
			- Audit local variables in `hak_tiny_alloc()`
			`- Use fewer temporaries`
			`- Pass calculated values as function params (use registers)`
			`- Move rarely-used locals to slow path`

			`#### 3. Cache Globals in TLS (LOW PRIORITY)`
			`Impact: -2 to -3% CPU`
			`Effort: 1 hour`
			`Complexity: Low`

			Instead of reading `g_tiny_initialized` on every call:
			```c
			`// In TLS init`
			`tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;`

			`// In allocation fast path`
			`if (UNLIKELY(!tls->tiny_available))`
			`return fallback_alloc(size);`
			```

			`Benefit: Removes 2 global reads per allocation (3.8% CPU saved)`

			`#### 4. Optimize Size-to-Bin (OPTIONAL)`
			`Impact: -1 to -2% CPU`
			`Effort: 1 hour`
			`Complexity: Medium`

			Current: Uses `lzcnt` (3.06% CPU in that instruction alone)

			`Option A: Lookup table for sizes ≤128 bytes`
			```c
			`static const uint8_t size_to_bin_table[128] = {`
			`0, 0, 0, 0, 0, 0, 0, 0, // 0-7 → bin 0`
			`1, 1, 1, 1, 1, 1, 1, 1, // 8-15 → bin 1`
			`2, 2, 2, 2, 2, 2, 2, 2, // 16-23 → bin 2`
			`// ...`
			`};`

			`static inline unsigned size_to_bin_fast(size_t sz) {`
			`if (sz <= 128)`
			`return size_to_bin_table[sz];`
			`return size_to_bin_lzcnt(sz); // Fallback`
			`}`
			```

			`Option B: Use multiply-shift instead of lzcnt`
			`- May be faster on older CPUs`
			`- Profile to confirm`

			`---`

			`## Implementation Plan`

			`### Phase 1: Quick Wins (2-3 hours) ← START HERE`
			- [ ] Inline fast path (`hak_tiny_alloc_fast()`)
			`- [ ] Reduce stack usage (88 → 32 bytes)`
			`- Expected: 120-164 M → 160-220 M ops/sec`

			`### Phase 2: Polish (1-2 hours)`
			`- [ ] Cache globals in TLS`
			`- [ ] Optimize size-to-bin with lookup table`
			`- Expected: Additional 10-20% gain`

			`### Phase 3: Validate (30 min)`
			`- [ ] Run comprehensive benchmark`
			`- [ ] Collect new perf data`
			`- [ ] Verify hak_tiny_alloc reduced from 22.75% → ~10%`

			`Total time: 3-6 hours`
			`Total impact: +50-70% throughput improvement`

			`---`

			`## Success Criteria`

			`After optimization, verify:`
			`- [ ] hak_tiny_alloc CPU: 22.75% → <12% (primary goal)`
			`- [ ] Total throughput: 120-164 M → 180-250 M ops/sec`
			`- [ ] Faster than glibc: +70% to +140% (vs current +15-57%)`
			`- [ ] No correctness regressions (all tests pass)`
			`- [ ] No new bottleneck >10% (except benchmark overhead like rand())`

			`---`

			`## Files to Modify`

			`Primary:`
			- `/home/tomoaki/git/hakmem/hakmem_pool.c` - Main allocation logic
			- `hak_tiny_alloc()` - Current slow implementation
			- Add `hak_tiny_alloc_fast()` - New inline fast path
			- Add `hak_tiny_alloc_slow()` - Refill logic

			`Supporting:`
			- `/home/tomoaki/git/hakmem/hakmem_pool.h` - Add inline function
			`- Check if size-to-bin logic is in separate function`

			`---`

			`## Validation Commands`

			`After implementing optimizations:`

			```bash
			`# Rebuild`
			`make clean && make bench_comprehensive_hakmem`

			`# Benchmark`
			`HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem`

			`# Perf analysis`
			`HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \`
			`-o perf_post_tiny_opt.data ./bench_comprehensive_hakmem`

			`perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \`
			`\| grep -E "^\s+[0-9]+\.[0-9]+%"`
			```

			`Check that:`
			`1. hak_tiny_alloc < 12%`
			`2. No new bottleneck appeared >15%`
			`3. Throughput improved by 40-80 M ops/sec`

			`---`

			`## Risk Assessment`

			`Low Risk:`
			`- Inline fast path: Only affects common path, easy to test`
			`- Stack reduction: Mechanical refactoring`
			`- TLS caching: Simple optimization`

			`Medium Risk:`
			`- Size-to-bin lookup table: Need to ensure correctness of mapping`
			`- Separation of fast/slow paths: Need to ensure all cases covered`

			`Mitigation:`
			`- Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)`
			`- Run comprehensive benchmark suite`
			`- Verify with perf that CPU moved to expected places`

			`---`

			`## Reference Data`

			`### Current Hottest Instructions in hak_tiny_alloc`

			```asm
			`3.71%: push %r14 ← Register spill`
			`3.52%: mov g_tiny_initialized,%r14d ← Global read`
			`3.53%: mov 0x1c(%rsp),%ebp ← Stack read`
			`3.33%: cmpq $0x80,0x10(%rsp) ← Size check`
			`3.06%: mov %rbp,0x38(%rsp) ← Stack write`
			```

			`After optimization, expect:`
			`- Push/pop overhead: reduced (fewer registers needed in fast path)`
			`- Global reads: eliminated (cached in TLS)`
			`- Stack access: reduced (smaller frame)`
			`- Size checks: inlined (better prediction)`

			`### Expected New Bottlenecks After Fix`

			`If hak_tiny_alloc drops to ~10%, next hottest will be:`
			`1. mid_desc_lookup: 12.55% (may become #1)`
			`2. hak_tiny_owner_slab: 9.09% (close to threshold)`

			`But both <15%, so diminishing returns starting to apply.`

			`---`

			`## Questions Before Starting?`

			`1. Do we want to target 10% or 12% for hak_tiny_alloc?`
			`- 10%: More aggressive, may need all optimizations`
			`- 12%: Conservative, Phase 1 may be enough`

			`2. Should we also tackle mid_desc_lookup (12.55%)?`
			`- Could do both in one session`
			`- Hash table optimization is independent`

			`3. When to stop optimizing?`
			`- When top bottleneck <10%?`
			`- When beating glibc by 2x (target achieved)?`
			`- When effort exceeds returns?`

			`Recommendation: Do Phase 1 (fast path inline), measure, then decide on Phase 2.`

			`---`

			`## Ready to Start?`

			`Next command to run:`
			```bash
			`# Read current implementation`
			`cat /home/tomoaki/git/hakmem/hakmem_pool.c \| grep -A 200 "hak_tiny_alloc"`
			```

			`Then start implementing inline fast path!`