Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.0 KiB
HAKMEM Tiny Pool - Next Optimization Steps
Quick Status
Current Performance: 120-164 M ops/sec (15-57% faster than glibc) Target Performance: 180-250 M ops/sec (70-140% faster than glibc) Gap: 40-86 M ops/sec achievable with next optimization
Bottleneck Identified: hak_tiny_alloc (22.75% CPU)
Root Causes (from perf annotate)
-
Heavy stack usage (10.5% CPU):
- 88 bytes (0x58) of stack allocated
- Multiple stack reads/writes per call
- Register spilling due to pressure
-
Repeated global reads (7.2% CPU):
g_tiny_initializedread every call (3.52%)g_wrap_tiny_enabledread every call (0.28%)- Should cache in TLS or check once
-
Complex control flow (5.0% CPU):
- Multiple branches for size validation
- Magazine refill logic in main path
- Should separate fast/slow paths
Optimization Strategy (Priority Order)
1. Inline Fast Path (HIGH PRIORITY)
Impact: -5 to -7% CPU Effort: 2-3 hours Complexity: Medium
Create hak_tiny_alloc_fast() inline helper:
static inline void* hak_tiny_alloc_fast(size_t size) {
// Quick validation
if (UNLIKELY(size > MAX_TINY_SIZE))
return hak_tiny_alloc(size);
// Inline bin calculation
unsigned bin = size_to_bin_fast(size);
// Fast path: TLS magazine hit
mag_t* mag = &g_tls_mags[bin];
if (LIKELY(mag->count > 0))
return mag->objects[--mag->count];
// Slow path: refill magazine
return hak_tiny_alloc_slow(size, bin);
}
Key optimizations:
- Move size validation inline (avoid function call for common case)
- Direct TLS magazine access (no indirection)
- Separate slow path (refill) to different function (better branch prediction)
2. Reduce Stack Usage (MEDIUM PRIORITY)
Impact: -3 to -4% CPU Effort: 1-2 hours Complexity: Low
Current: 88 bytes stack Target: <32 bytes stack
Actions:
- Audit local variables in
hak_tiny_alloc() - Use fewer temporaries
- Pass calculated values as function params (use registers)
- Move rarely-used locals to slow path
3. Cache Globals in TLS (LOW PRIORITY)
Impact: -2 to -3% CPU Effort: 1 hour Complexity: Low
Instead of reading g_tiny_initialized on every call:
// In TLS init
tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;
// In allocation fast path
if (UNLIKELY(!tls->tiny_available))
return fallback_alloc(size);
Benefit: Removes 2 global reads per allocation (3.8% CPU saved)
4. Optimize Size-to-Bin (OPTIONAL)
Impact: -1 to -2% CPU Effort: 1 hour Complexity: Medium
Current: Uses lzcnt (3.06% CPU in that instruction alone)
Option A: Lookup table for sizes ≤128 bytes
static const uint8_t size_to_bin_table[128] = {
0, 0, 0, 0, 0, 0, 0, 0, // 0-7 → bin 0
1, 1, 1, 1, 1, 1, 1, 1, // 8-15 → bin 1
2, 2, 2, 2, 2, 2, 2, 2, // 16-23 → bin 2
// ...
};
static inline unsigned size_to_bin_fast(size_t sz) {
if (sz <= 128)
return size_to_bin_table[sz];
return size_to_bin_lzcnt(sz); // Fallback
}
Option B: Use multiply-shift instead of lzcnt
- May be faster on older CPUs
- Profile to confirm
Implementation Plan
Phase 1: Quick Wins (2-3 hours) ← START HERE
- Inline fast path (
hak_tiny_alloc_fast()) - Reduce stack usage (88 → 32 bytes)
- Expected: 120-164 M → 160-220 M ops/sec
Phase 2: Polish (1-2 hours)
- Cache globals in TLS
- Optimize size-to-bin with lookup table
- Expected: Additional 10-20% gain
Phase 3: Validate (30 min)
- Run comprehensive benchmark
- Collect new perf data
- Verify hak_tiny_alloc reduced from 22.75% → ~10%
Total time: 3-6 hours Total impact: +50-70% throughput improvement
Success Criteria
After optimization, verify:
- hak_tiny_alloc CPU: 22.75% → <12% (primary goal)
- Total throughput: 120-164 M → 180-250 M ops/sec
- Faster than glibc: +70% to +140% (vs current +15-57%)
- No correctness regressions (all tests pass)
- No new bottleneck >10% (except benchmark overhead like rand())
Files to Modify
Primary:
/home/tomoaki/git/hakmem/hakmem_pool.c- Main allocation logichak_tiny_alloc()- Current slow implementation- Add
hak_tiny_alloc_fast()- New inline fast path - Add
hak_tiny_alloc_slow()- Refill logic
Supporting:
/home/tomoaki/git/hakmem/hakmem_pool.h- Add inline function- Check if size-to-bin logic is in separate function
Validation Commands
After implementing optimizations:
# Rebuild
make clean && make bench_comprehensive_hakmem
# Benchmark
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
# Perf analysis
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \
-o perf_post_tiny_opt.data ./bench_comprehensive_hakmem
perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \
| grep -E "^\s+[0-9]+\.[0-9]+%"
Check that:
- hak_tiny_alloc < 12%
- No new bottleneck appeared >15%
- Throughput improved by 40-80 M ops/sec
Risk Assessment
Low Risk:
- Inline fast path: Only affects common path, easy to test
- Stack reduction: Mechanical refactoring
- TLS caching: Simple optimization
Medium Risk:
- Size-to-bin lookup table: Need to ensure correctness of mapping
- Separation of fast/slow paths: Need to ensure all cases covered
Mitigation:
- Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)
- Run comprehensive benchmark suite
- Verify with perf that CPU moved to expected places
Reference Data
Current Hottest Instructions in hak_tiny_alloc
3.71%: push %r14 ← Register spill
3.52%: mov g_tiny_initialized,%r14d ← Global read
3.53%: mov 0x1c(%rsp),%ebp ← Stack read
3.33%: cmpq $0x80,0x10(%rsp) ← Size check
3.06%: mov %rbp,0x38(%rsp) ← Stack write
After optimization, expect:
- Push/pop overhead: reduced (fewer registers needed in fast path)
- Global reads: eliminated (cached in TLS)
- Stack access: reduced (smaller frame)
- Size checks: inlined (better prediction)
Expected New Bottlenecks After Fix
If hak_tiny_alloc drops to ~10%, next hottest will be:
- mid_desc_lookup: 12.55% (may become #1)
- hak_tiny_owner_slab: 9.09% (close to threshold)
But both <15%, so diminishing returns starting to apply.
Questions Before Starting?
-
Do we want to target 10% or 12% for hak_tiny_alloc?
- 10%: More aggressive, may need all optimizations
- 12%: Conservative, Phase 1 may be enough
-
Should we also tackle mid_desc_lookup (12.55%)?
- Could do both in one session
- Hash table optimization is independent
-
When to stop optimizing?
- When top bottleneck <10%?
- When beating glibc by 2x (target achieved)?
- When effort exceeds returns?
Recommendation: Do Phase 1 (fast path inline), measure, then decide on Phase 2.
Ready to Start?
Next command to run:
# Read current implementation
cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc"
Then start implementing inline fast path!