Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

7.0 KiB

Raw Blame History

HAKMEM Tiny Pool - Next Optimization Steps

Quick Status

Current Performance: 120-164 M ops/sec (15-57% faster than glibc) Target Performance: 180-250 M ops/sec (70-140% faster than glibc) Gap: 40-86 M ops/sec achievable with next optimization

Bottleneck Identified: hak_tiny_alloc (22.75% CPU)

Root Causes (from perf annotate)

Heavy stack usage (10.5% CPU):
- 88 bytes (0x58) of stack allocated
- Multiple stack reads/writes per call
- Register spilling due to pressure
Repeated global reads (7.2% CPU):
- g_tiny_initialized read every call (3.52%)
- g_wrap_tiny_enabled read every call (0.28%)
- Should cache in TLS or check once
Complex control flow (5.0% CPU):
- Multiple branches for size validation
- Magazine refill logic in main path
- Should separate fast/slow paths

Optimization Strategy (Priority Order)

1. Inline Fast Path (HIGH PRIORITY)

Impact: -5 to -7% CPU Effort: 2-3 hours Complexity: Medium

Create hak_tiny_alloc_fast() inline helper:

static inline void* hak_tiny_alloc_fast(size_t size) {
    // Quick validation
    if (UNLIKELY(size > MAX_TINY_SIZE)) 
        return hak_tiny_alloc(size);
    
    // Inline bin calculation
    unsigned bin = size_to_bin_fast(size);
    
    // Fast path: TLS magazine hit
    mag_t* mag = &g_tls_mags[bin];
    if (LIKELY(mag->count > 0))
        return mag->objects[--mag->count];
    
    // Slow path: refill magazine
    return hak_tiny_alloc_slow(size, bin);
}

Key optimizations:

Move size validation inline (avoid function call for common case)
Direct TLS magazine access (no indirection)
Separate slow path (refill) to different function (better branch prediction)

2. Reduce Stack Usage (MEDIUM PRIORITY)

Impact: -3 to -4% CPU Effort: 1-2 hours Complexity: Low

Current: 88 bytes stack Target: <32 bytes stack

Actions:

Audit local variables in hak_tiny_alloc()
Use fewer temporaries
Pass calculated values as function params (use registers)
Move rarely-used locals to slow path

3. Cache Globals in TLS (LOW PRIORITY)

Impact: -2 to -3% CPU Effort: 1 hour Complexity: Low

Instead of reading g_tiny_initialized on every call:

// In TLS init
tls->tiny_available = g_tiny_initialized && g_wrap_tiny_enabled;

// In allocation fast path
if (UNLIKELY(!tls->tiny_available))
    return fallback_alloc(size);

Benefit: Removes 2 global reads per allocation (3.8% CPU saved)

4. Optimize Size-to-Bin (OPTIONAL)

Impact: -1 to -2% CPU Effort: 1 hour Complexity: Medium

Current: Uses lzcnt (3.06% CPU in that instruction alone)

Option A: Lookup table for sizes ≤128 bytes

static const uint8_t size_to_bin_table[128] = {
    0, 0, 0, 0, 0, 0, 0, 0,  // 0-7   → bin 0
    1, 1, 1, 1, 1, 1, 1, 1,  // 8-15  → bin 1
    2, 2, 2, 2, 2, 2, 2, 2,  // 16-23 → bin 2
    // ...
};

static inline unsigned size_to_bin_fast(size_t sz) {
    if (sz <= 128)
        return size_to_bin_table[sz];
    return size_to_bin_lzcnt(sz);  // Fallback
}

Option B: Use multiply-shift instead of lzcnt

May be faster on older CPUs
Profile to confirm

Implementation Plan

Phase 1: Quick Wins (2-3 hours) ← START HERE

Inline fast path (hak_tiny_alloc_fast())
Reduce stack usage (88 → 32 bytes)
Expected: 120-164 M → 160-220 M ops/sec

Phase 2: Polish (1-2 hours)

Cache globals in TLS
Optimize size-to-bin with lookup table
Expected: Additional 10-20% gain

Phase 3: Validate (30 min)

Run comprehensive benchmark
Collect new perf data
Verify hak_tiny_alloc reduced from 22.75% → ~10%

Total time: 3-6 hours Total impact: +50-70% throughput improvement

Success Criteria

After optimization, verify:

hak_tiny_alloc CPU: 22.75% → <12% (primary goal)
Total throughput: 120-164 M → 180-250 M ops/sec
Faster than glibc: +70% to +140% (vs current +15-57%)
No correctness regressions (all tests pass)
No new bottleneck >10% (except benchmark overhead like rand())

Files to Modify

Primary:

/home/tomoaki/git/hakmem/hakmem_pool.c - Main allocation logic
- hak_tiny_alloc() - Current slow implementation
- Add hak_tiny_alloc_fast() - New inline fast path
- Add hak_tiny_alloc_slow() - Refill logic

Supporting:

/home/tomoaki/git/hakmem/hakmem_pool.h - Add inline function
Check if size-to-bin logic is in separate function

Validation Commands

After implementing optimizations:

# Rebuild
make clean && make bench_comprehensive_hakmem

# Benchmark
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

# Perf analysis
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf \
  -o perf_post_tiny_opt.data ./bench_comprehensive_hakmem

perf report --stdio -i perf_post_tiny_opt.data --sort=symbol \
  | grep -E "^\s+[0-9]+\.[0-9]+%"

Check that:

hak_tiny_alloc < 12%
No new bottleneck appeared >15%
Throughput improved by 40-80 M ops/sec

Risk Assessment

Low Risk:

Inline fast path: Only affects common path, easy to test
Stack reduction: Mechanical refactoring
TLS caching: Simple optimization

Medium Risk:

Size-to-bin lookup table: Need to ensure correctness of mapping
Separation of fast/slow paths: Need to ensure all cases covered

Mitigation:

Test all size classes (8, 16, 32, 64, 128, 256, 512, 1024 bytes)
Run comprehensive benchmark suite
Verify with perf that CPU moved to expected places

Reference Data

Current Hottest Instructions in hak_tiny_alloc

3.71%:  push %r14                       ← Register spill
3.52%:  mov g_tiny_initialized,%r14d    ← Global read
3.53%:  mov 0x1c(%rsp),%ebp            ← Stack read
3.33%:  cmpq $0x80,0x10(%rsp)          ← Size check
3.06%:  mov %rbp,0x38(%rsp)            ← Stack write

After optimization, expect:

Push/pop overhead: reduced (fewer registers needed in fast path)
Global reads: eliminated (cached in TLS)
Stack access: reduced (smaller frame)
Size checks: inlined (better prediction)

Expected New Bottlenecks After Fix

If hak_tiny_alloc drops to ~10%, next hottest will be:

mid_desc_lookup: 12.55% (may become #1)
hak_tiny_owner_slab: 9.09% (close to threshold)

But both <15%, so diminishing returns starting to apply.

Questions Before Starting?

Do we want to target 10% or 12% for hak_tiny_alloc?
- 10%: More aggressive, may need all optimizations
- 12%: Conservative, Phase 1 may be enough
Should we also tackle mid_desc_lookup (12.55%)?
- Could do both in one session
- Hash table optimization is independent
When to stop optimizing?
- When top bottleneck <10%?
- When beating glibc by 2x (target achieved)?
- When effort exceeds returns?

Recommendation: Do Phase 1 (fast path inline), measure, then decide on Phase 2.

Ready to Start?

Next command to run:

# Read current implementation
cat /home/tomoaki/git/hakmem/hakmem_pool.c | grep -A 200 "hak_tiny_alloc"

Then start implementing inline fast path!

7.0 KiB Raw Blame History