Files
hakmem/docs/archive/OPTIMIZATION_NEXT_STEPS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

7.0 KiB

HAKMEM Tiny Pool - Next Optimization Steps

Quick Status

Current Performance: 120-164 M ops/sec (15-57% faster than glibc) Target Performance: 180-250 M ops/sec (70-140% faster than glibc) Gap: 40-86 M ops/sec achievable with next optimization


Bottleneck Identified: hak_tiny_alloc (22.75% CPU)

Root Causes (from perf annotate)

  1. Heavy stack usage (10.5% CPU):

    • 88 bytes (0x58) of stack allocated
    • Multiple stack reads/writes per call
    • Register spilling due to pressure
  2. Repeated global reads (7.2% CPU):

    • g_tiny_initialized read every call (3.52%)
    • g_wrap_tiny_enabled read every call (0.28%)
    • Should cache in TLS or check once
  3. Complex control flow (5.0% CPU):

    • Multiple branches for size validation
    • Magazine refill logic in main path
    • Should separate fast/slow paths

Optimization Strategy (Priority Order)

1. Inline Fast Path (HIGH PRIORITY)

Impact: -5 to -7% CPU Effort: 2-3 hours Complexity: Medium

Create hak_tiny_alloc_fast() inline helper:

static inline void* hak_tiny_alloc_fast(size_t size) {
    // Quick validation
    if (UNLIKELY(size > MAX_TINY_SIZE)) 
        return hak_tiny_alloc(size);
    
    // Inline bin calculation
    unsigned bin = size_to_bin_fast(size);
    
    // Fast path: TLS magazine hit
    mag_t* mag = &g_tls_mags[bin];
    if (LIKELY(mag->count > 0))
        return mag->objects[--mag->count];
    
    // Slow path: refill magazine
    return hak_tiny_alloc_slow(size, bin);
}

Key optimizations:

  • Move size validation inline (avoid function call for common case)
  • Direct TLS magazine access (no indirection)
  • Separate slow path (refill) to different function (better branch prediction)

2. Reduce Stack Usage (MEDIUM PRIORITY)

Impact: -3 to -4% CPU Effort: 1-2 hours Complexity: Low

Current: 88 bytes stack Target: <32 bytes stack

Actions:

  • Audit local variables in hak_tiny_alloc()
  • Use fewer temporaries
  • Pass calculated values as function params (use registers)
  • Move rarely-used locals to slow path

3. Cache Globals in TLS (LOW PRIORITY)

Impact: -2 to -3% CPU Effort: 1 hour Complexity: Low

Instead of reading g_tiny_initialized on every call:

// In TLS init
tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;

// In allocation fast path
if (UNLIKELY(!tls->tiny_available))
    return fallback_alloc(size);

Benefit: Removes 2 global reads per allocation (3.8% CPU saved)

4. Optimize Size-to-Bin (OPTIONAL)

Impact: -1 to -2% CPU Effort: 1 hour Complexity: Medium

Current: Uses lzcnt (3.06% CPU in that instruction alone)

Option A: Lookup table for sizes ≤128 bytes

static const uint8_t size_to_bin_table[128] = {
    0, 0, 0, 0, 0, 0, 0, 0,  // 0-7   → bin 0
    1, 1, 1, 1, 1, 1, 1, 1,  // 8-15  → bin 1
    2, 2, 2, 2, 2, 2, 2, 2,  // 16-23 → bin 2
    // ...
};

static inline unsigned size_to_bin_fast(size_t sz) {
    if (sz <= 128)
        return size_to_bin_table[sz];
    return size_to_bin_lzcnt(sz);  // Fallback
}

Option B: Use multiply-shift instead of lzcnt

  • May be faster on older CPUs
  • Profile to confirm

Implementation Plan

Phase 1: Quick Wins (2-3 hours) ← START HERE

  • Inline fast path (hak_tiny_alloc_fast())
  • Reduce stack usage (88 → 32 bytes)
  • Expected: 120-164 M → 160-220 M ops/sec

Phase 2: Polish (1-2 hours)

  • Cache globals in TLS
  • Optimize size-to-bin with lookup table
  • Expected: Additional 10-20% gain

Phase 3: Validate (30 min)

  • Run comprehensive benchmark
  • Collect new perf data
  • Verify hak_tiny_alloc reduced from 22.75% → ~10%

Total time: 3-6 hours Total impact: +50-70% throughput improvement


Success Criteria

After optimization, verify:

  • hak_tiny_alloc CPU: 22.75% → <12% (primary goal)
  • Total throughput: 120-164 M → 180-250 M ops/sec
  • Faster than glibc: +70% to +140% (vs current +15-57%)
  • No correctness regressions (all tests pass)
  • No new bottleneck >10% (except benchmark overhead like rand())

Files to Modify

Primary:

  • /home/tomoaki/git/hakmem/hakmem_pool.c - Main allocation logic
    • hak_tiny_alloc() - Current slow implementation
    • Add hak_tiny_alloc_fast() - New inline fast path
    • Add hak_tiny_alloc_slow() - Refill logic

Supporting:

  • /home/tomoaki/git/hakmem/hakmem_pool.h - Add inline function
  • Check if size-to-bin logic is in separate function

Validation Commands

After implementing optimizations:

# Rebuild
make clean && make bench_comprehensive_hakmem

# Benchmark
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

# Perf analysis
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \
  -o perf_post_tiny_opt.data ./bench_comprehensive_hakmem

perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \
  | grep -E "^\s+[0-9]+\.[0-9]+%"

Check that:

  1. hak_tiny_alloc < 12%
  2. No new bottleneck appeared >15%
  3. Throughput improved by 40-80 M ops/sec

Risk Assessment

Low Risk:

  • Inline fast path: Only affects common path, easy to test
  • Stack reduction: Mechanical refactoring
  • TLS caching: Simple optimization

Medium Risk:

  • Size-to-bin lookup table: Need to ensure correctness of mapping
  • Separation of fast/slow paths: Need to ensure all cases covered

Mitigation:

  • Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)
  • Run comprehensive benchmark suite
  • Verify with perf that CPU moved to expected places

Reference Data

Current Hottest Instructions in hak_tiny_alloc

3.71%:  push %r14                        Register spill
3.52%:  mov g_tiny_initialized,%r14d     Global read
3.53%:  mov 0x1c(%rsp),%ebp             Stack read
3.33%:  cmpq $0x80,0x10(%rsp)           Size check
3.06%:  mov %rbp,0x38(%rsp)             Stack write

After optimization, expect:

  • Push/pop overhead: reduced (fewer registers needed in fast path)
  • Global reads: eliminated (cached in TLS)
  • Stack access: reduced (smaller frame)
  • Size checks: inlined (better prediction)

Expected New Bottlenecks After Fix

If hak_tiny_alloc drops to ~10%, next hottest will be:

  1. mid_desc_lookup: 12.55% (may become #1)
  2. hak_tiny_owner_slab: 9.09% (close to threshold)

But both <15%, so diminishing returns starting to apply.


Questions Before Starting?

  1. Do we want to target 10% or 12% for hak_tiny_alloc?

    • 10%: More aggressive, may need all optimizations
    • 12%: Conservative, Phase 1 may be enough
  2. Should we also tackle mid_desc_lookup (12.55%)?

    • Could do both in one session
    • Hash table optimization is independent
  3. When to stop optimizing?

    • When top bottleneck <10%?
    • When beating glibc by 2x (target achieved)?
    • When effort exceeds returns?

Recommendation: Do Phase 1 (fast path inline), measure, then decide on Phase 2.


Ready to Start?

Next command to run:

# Read current implementation
cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc"

Then start implementing inline fast path!