Files
hakmem/INLINING_BENCHMARK_INDEX.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

188 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Gatekeeper Inlining Optimization - Benchmark Index
**Date**: 2025-12-04
**Status**: ✅ COMPLETED - OPTIMIZATION VALIDATED
---
## Quick Summary
The `__attribute__((always_inline))` optimization on Gatekeeper functions is **EFFECTIVE and VALIDATED**:
- **Throughput**: +10.57% improvement (Test 1)
- **CPU Cycles**: -2.13% reduction (statistically significant)
- **Cache Misses**: -13.53% reduction (statistically significant)
**Recommendation**: ✅ **KEEP** the inlining optimization
---
## Documentation
### Primary Reports
1. **BENCHMARK_SUMMARY.txt** (14KB)
- Quick reference with all key metrics
- Best for: Command-line viewing, sharing results
- Location: `/mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt`
2. **GATEKEEPER_INLINING_BENCHMARK_REPORT.md** (15KB)
- Comprehensive markdown report with tables and analysis
- Best for: GitHub, documentation, detailed review
- Location: `/mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md`
---
## Generated Artifacts
### Binaries
- **bench_allocators_hakmem.with_inline** (354KB)
- BUILD A: With `__attribute__((always_inline))`
- Optimized binary
- **bench_allocators_hakmem.no_inline** (350KB)
- BUILD B: Without forced inlining (baseline)
- Used for A/B comparison
### Scripts
- **analyze_results.py** (13KB)
- Python statistical analysis script
- Computes means, std dev, CV, t-tests
- Run: `python3 analyze_results.py`
- **run_benchmark.sh**
- Standard benchmark runner (5 iterations)
- Usage: `./run_benchmark.sh <binary> <name> [iterations]`
- **run_benchmark_conservative.sh**
- Conservative profile benchmark runner
- Sets `HAKMEM_TINY_PROFILE=conservative` and `HAKMEM_SS_PREFAULT=0`
- **run_perf.sh**
- Perf counter collection script
- Measures cycles, cache-misses, L1-dcache-load-misses
---
## Key Results at a Glance
| Metric | WITH Inlining | WITHOUT Inlining | Improvement |
|--------|-------------:|----------------:|------------:|
| **Test 1 Throughput** | 1,055,159 ops/s | 954,265 ops/s | **+10.57%** |
| **Test 2 Throughput** | 1,095,292 ops/s | 1,054,294 ops/s | **+3.89%** |
| **CPU Cycles** | 71,522,202 | 73,076,160 | **-2.13%** ⭐ |
| **Cache Misses** | 256,020 | 296,074 | **-13.53%** ⭐ |
⭐ = Statistically significant (p < 0.05)
---
## Modified Files
The following files were modified to add `__attribute__((always_inline))`:
1. **core/box/tiny_alloc_gate_box.h** (Line 139)
```c
static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size)
```
2. **core/box/tiny_free_gate_box.h** (Line 131)
```c
static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
```
---
## Statistical Validation
### Significant Results (p < 0.05)
- **CPU Cycles**: t = 2.823, df = 5.76 ✅
- **Cache Misses**: t = 3.177, df = 5.73 ✅
These metrics passed statistical significance testing with 5 samples.
### Variance Analysis
BUILD A (WITH inlining) shows **consistently lower variance**:
- CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
- Cache Misses CV: 4.74% vs 8.60% (45% improvement)
- Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)
---
## Reproducing Results
### Build Both Binaries
```bash
# BUILD A (WITH inlining) - already built
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.with_inline
# BUILD B (WITHOUT inlining)
# Remove __attribute__((always_inline)) from:
# - core/box/tiny_alloc_gate_box.h:139
# - core/box/tiny_free_gate_box.h:131
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.no_inline
```
### Run Benchmarks
```bash
# Test 1: Standard workload
./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
# Test 2: Conservative profile
./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
# Perf counters
./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
```
### Analyze Results
```bash
python3 analyze_results.py
```
---
## Next Steps
With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:
### **Batch Tier Checks**
**Goal**: Reduce overhead of per-allocation route policy lookups
**Approach**:
1. Batch route policy checks for multiple allocations
2. Cache tier decisions in TLS
3. Amortize lookup overhead across multiple operations
**Expected Benefit**: Additional 1-3% throughput improvement
---
## References
- Original optimization request: Gatekeeper inlining analysis
- Benchmark workload: `bench_random_mixed_hakmem 1000000 256 42`
- Test parameters: 5 iterations per configuration after 1 warmup
- Statistical method: Welch's t-test (α = 0.05)
---
**Generated**: 2025-12-04
**System**: Linux 6.8.0-87-generic
**Compiler**: GCC with -O3 -march=native -flto