Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.0 KiB
Gatekeeper Inlining Optimization - Benchmark Index
Date: 2025-12-04
Status: ✅ COMPLETED - OPTIMIZATION VALIDATED
Quick Summary
The __attribute__((always_inline)) optimization on Gatekeeper functions is EFFECTIVE and VALIDATED:
- Throughput: +10.57% improvement (Test 1)
- CPU Cycles: -2.13% reduction (statistically significant)
- Cache Misses: -13.53% reduction (statistically significant)
Recommendation: ✅ KEEP the inlining optimization
Documentation
Primary Reports
-
BENCHMARK_SUMMARY.txt (14KB)
- Quick reference with all key metrics
- Best for: Command-line viewing, sharing results
- Location:
/mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt
-
GATEKEEPER_INLINING_BENCHMARK_REPORT.md (15KB)
- Comprehensive markdown report with tables and analysis
- Best for: GitHub, documentation, detailed review
- Location:
/mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md
Generated Artifacts
Binaries
-
bench_allocators_hakmem.with_inline (354KB)
- BUILD A: With
__attribute__((always_inline)) - Optimized binary
- BUILD A: With
-
bench_allocators_hakmem.no_inline (350KB)
- BUILD B: Without forced inlining (baseline)
- Used for A/B comparison
Scripts
-
analyze_results.py (13KB)
- Python statistical analysis script
- Computes means, std dev, CV, t-tests
- Run:
python3 analyze_results.py
-
run_benchmark.sh
- Standard benchmark runner (5 iterations)
- Usage:
./run_benchmark.sh <binary> <name> [iterations]
-
run_benchmark_conservative.sh
- Conservative profile benchmark runner
- Sets
HAKMEM_TINY_PROFILE=conservativeandHAKMEM_SS_PREFAULT=0
-
run_perf.sh
- Perf counter collection script
- Measures cycles, cache-misses, L1-dcache-load-misses
Key Results at a Glance
| Metric | WITH Inlining | WITHOUT Inlining | Improvement |
|---|---|---|---|
| Test 1 Throughput | 1,055,159 ops/s | 954,265 ops/s | +10.57% |
| Test 2 Throughput | 1,095,292 ops/s | 1,054,294 ops/s | +3.89% |
| CPU Cycles | 71,522,202 | 73,076,160 | -2.13% ⭐ |
| Cache Misses | 256,020 | 296,074 | -13.53% ⭐ |
⭐ = Statistically significant (p < 0.05)
Modified Files
The following files were modified to add __attribute__((always_inline)):
-
core/box/tiny_alloc_gate_box.h (Line 139)
static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size) -
core/box/tiny_free_gate_box.h (Line 131)
static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
Statistical Validation
Significant Results (p < 0.05)
- CPU Cycles: t = 2.823, df = 5.76 ✅
- Cache Misses: t = 3.177, df = 5.73 ✅
These metrics passed statistical significance testing with 5 samples.
Variance Analysis
BUILD A (WITH inlining) shows consistently lower variance:
- CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
- Cache Misses CV: 4.74% vs 8.60% (45% improvement)
- Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)
Reproducing Results
Build Both Binaries
# BUILD A (WITH inlining) - already built
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.with_inline
# BUILD B (WITHOUT inlining)
# Remove __attribute__((always_inline)) from:
# - core/box/tiny_alloc_gate_box.h:139
# - core/box/tiny_free_gate_box.h:131
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.no_inline
Run Benchmarks
# Test 1: Standard workload
./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
# Test 2: Conservative profile
./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
# Perf counters
./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
Analyze Results
python3 analyze_results.py
Next Steps
With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:
Batch Tier Checks
Goal: Reduce overhead of per-allocation route policy lookups
Approach:
- Batch route policy checks for multiple allocations
- Cache tier decisions in TLS
- Amortize lookup overhead across multiple operations
Expected Benefit: Additional 1-3% throughput improvement
References
- Original optimization request: Gatekeeper inlining analysis
- Benchmark workload:
bench_random_mixed_hakmem 1000000 256 42 - Test parameters: 5 iterations per configuration after 1 warmup
- Statistical method: Welch's t-test (α = 0.05)
Generated: 2025-12-04
System: Linux 6.8.0-87-generic
Compiler: GCC with -O3 -march=native -flto