Files
hakmem/INLINING_BENCHMARK_INDEX.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

5.0 KiB
Raw Blame History

Gatekeeper Inlining Optimization - Benchmark Index

Date: 2025-12-04
Status: COMPLETED - OPTIMIZATION VALIDATED


Quick Summary

The __attribute__((always_inline)) optimization on Gatekeeper functions is EFFECTIVE and VALIDATED:

  • Throughput: +10.57% improvement (Test 1)
  • CPU Cycles: -2.13% reduction (statistically significant)
  • Cache Misses: -13.53% reduction (statistically significant)

Recommendation: KEEP the inlining optimization


Documentation

Primary Reports

  1. BENCHMARK_SUMMARY.txt (14KB)

    • Quick reference with all key metrics
    • Best for: Command-line viewing, sharing results
    • Location: /mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt
  2. GATEKEEPER_INLINING_BENCHMARK_REPORT.md (15KB)

    • Comprehensive markdown report with tables and analysis
    • Best for: GitHub, documentation, detailed review
    • Location: /mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md

Generated Artifacts

Binaries

  • bench_allocators_hakmem.with_inline (354KB)

    • BUILD A: With __attribute__((always_inline))
    • Optimized binary
  • bench_allocators_hakmem.no_inline (350KB)

    • BUILD B: Without forced inlining (baseline)
    • Used for A/B comparison

Scripts

  • analyze_results.py (13KB)

    • Python statistical analysis script
    • Computes means, std dev, CV, t-tests
    • Run: python3 analyze_results.py
  • run_benchmark.sh

    • Standard benchmark runner (5 iterations)
    • Usage: ./run_benchmark.sh <binary> <name> [iterations]
  • run_benchmark_conservative.sh

    • Conservative profile benchmark runner
    • Sets HAKMEM_TINY_PROFILE=conservative and HAKMEM_SS_PREFAULT=0
  • run_perf.sh

    • Perf counter collection script
    • Measures cycles, cache-misses, L1-dcache-load-misses

Key Results at a Glance

Metric WITH Inlining WITHOUT Inlining Improvement
Test 1 Throughput 1,055,159 ops/s 954,265 ops/s +10.57%
Test 2 Throughput 1,095,292 ops/s 1,054,294 ops/s +3.89%
CPU Cycles 71,522,202 73,076,160 -2.13%
Cache Misses 256,020 296,074 -13.53%

= Statistically significant (p < 0.05)


Modified Files

The following files were modified to add __attribute__((always_inline)):

  1. core/box/tiny_alloc_gate_box.h (Line 139)

    static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size)
    
  2. core/box/tiny_free_gate_box.h (Line 131)

    static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
    

Statistical Validation

Significant Results (p < 0.05)

  • CPU Cycles: t = 2.823, df = 5.76
  • Cache Misses: t = 3.177, df = 5.73

These metrics passed statistical significance testing with 5 samples.

Variance Analysis

BUILD A (WITH inlining) shows consistently lower variance:

  • CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
  • Cache Misses CV: 4.74% vs 8.60% (45% improvement)
  • Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)

Reproducing Results

Build Both Binaries

# BUILD A (WITH inlining) - already built
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.with_inline

# BUILD B (WITHOUT inlining)
# Remove __attribute__((always_inline)) from:
#   - core/box/tiny_alloc_gate_box.h:139
#   - core/box/tiny_free_gate_box.h:131
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.no_inline

Run Benchmarks

# Test 1: Standard workload
./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5

# Test 2: Conservative profile
./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5

# Perf counters
./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5

Analyze Results

python3 analyze_results.py

Next Steps

With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:

Batch Tier Checks

Goal: Reduce overhead of per-allocation route policy lookups

Approach:

  1. Batch route policy checks for multiple allocations
  2. Cache tier decisions in TLS
  3. Amortize lookup overhead across multiple operations

Expected Benefit: Additional 1-3% throughput improvement


References

  • Original optimization request: Gatekeeper inlining analysis
  • Benchmark workload: bench_random_mixed_hakmem 1000000 256 42
  • Test parameters: 5 iterations per configuration after 1 warmup
  • Statistical method: Welch's t-test (α = 0.05)

Generated: 2025-12-04
System: Linux 6.8.0-87-generic
Compiler: GCC with -O3 -march=native -flto