188 lines
5.0 KiB
Markdown
188 lines
5.0 KiB
Markdown
|
|
# Gatekeeper Inlining Optimization - Benchmark Index
|
|||
|
|
|
|||
|
|
**Date**: 2025-12-04
|
|||
|
|
**Status**: ✅ COMPLETED - OPTIMIZATION VALIDATED
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quick Summary
|
|||
|
|
|
|||
|
|
The `__attribute__((always_inline))` optimization on Gatekeeper functions is **EFFECTIVE and VALIDATED**:
|
|||
|
|
|
|||
|
|
- **Throughput**: +10.57% improvement (Test 1)
|
|||
|
|
- **CPU Cycles**: -2.13% reduction (statistically significant)
|
|||
|
|
- **Cache Misses**: -13.53% reduction (statistically significant)
|
|||
|
|
|
|||
|
|
**Recommendation**: ✅ **KEEP** the inlining optimization
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Documentation
|
|||
|
|
|
|||
|
|
### Primary Reports
|
|||
|
|
|
|||
|
|
1. **BENCHMARK_SUMMARY.txt** (14KB)
|
|||
|
|
- Quick reference with all key metrics
|
|||
|
|
- Best for: Command-line viewing, sharing results
|
|||
|
|
- Location: `/mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt`
|
|||
|
|
|
|||
|
|
2. **GATEKEEPER_INLINING_BENCHMARK_REPORT.md** (15KB)
|
|||
|
|
- Comprehensive markdown report with tables and analysis
|
|||
|
|
- Best for: GitHub, documentation, detailed review
|
|||
|
|
- Location: `/mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Generated Artifacts
|
|||
|
|
|
|||
|
|
### Binaries
|
|||
|
|
|
|||
|
|
- **bench_allocators_hakmem.with_inline** (354KB)
|
|||
|
|
- BUILD A: With `__attribute__((always_inline))`
|
|||
|
|
- Optimized binary
|
|||
|
|
|
|||
|
|
- **bench_allocators_hakmem.no_inline** (350KB)
|
|||
|
|
- BUILD B: Without forced inlining (baseline)
|
|||
|
|
- Used for A/B comparison
|
|||
|
|
|
|||
|
|
### Scripts
|
|||
|
|
|
|||
|
|
- **analyze_results.py** (13KB)
|
|||
|
|
- Python statistical analysis script
|
|||
|
|
- Computes means, std dev, CV, t-tests
|
|||
|
|
- Run: `python3 analyze_results.py`
|
|||
|
|
|
|||
|
|
- **run_benchmark.sh**
|
|||
|
|
- Standard benchmark runner (5 iterations)
|
|||
|
|
- Usage: `./run_benchmark.sh <binary> <name> [iterations]`
|
|||
|
|
|
|||
|
|
- **run_benchmark_conservative.sh**
|
|||
|
|
- Conservative profile benchmark runner
|
|||
|
|
- Sets `HAKMEM_TINY_PROFILE=conservative` and `HAKMEM_SS_PREFAULT=0`
|
|||
|
|
|
|||
|
|
- **run_perf.sh**
|
|||
|
|
- Perf counter collection script
|
|||
|
|
- Measures cycles, cache-misses, L1-dcache-load-misses
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Results at a Glance
|
|||
|
|
|
|||
|
|
| Metric | WITH Inlining | WITHOUT Inlining | Improvement |
|
|||
|
|
|--------|-------------:|----------------:|------------:|
|
|||
|
|
| **Test 1 Throughput** | 1,055,159 ops/s | 954,265 ops/s | **+10.57%** |
|
|||
|
|
| **Test 2 Throughput** | 1,095,292 ops/s | 1,054,294 ops/s | **+3.89%** |
|
|||
|
|
| **CPU Cycles** | 71,522,202 | 73,076,160 | **-2.13%** ⭐ |
|
|||
|
|
| **Cache Misses** | 256,020 | 296,074 | **-13.53%** ⭐ |
|
|||
|
|
|
|||
|
|
⭐ = Statistically significant (p < 0.05)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Modified Files
|
|||
|
|
|
|||
|
|
The following files were modified to add `__attribute__((always_inline))`:
|
|||
|
|
|
|||
|
|
1. **core/box/tiny_alloc_gate_box.h** (Line 139)
|
|||
|
|
```c
|
|||
|
|
static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **core/box/tiny_free_gate_box.h** (Line 131)
|
|||
|
|
```c
|
|||
|
|
static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Statistical Validation
|
|||
|
|
|
|||
|
|
### Significant Results (p < 0.05)
|
|||
|
|
|
|||
|
|
- **CPU Cycles**: t = 2.823, df = 5.76 ✅
|
|||
|
|
- **Cache Misses**: t = 3.177, df = 5.73 ✅
|
|||
|
|
|
|||
|
|
These metrics passed statistical significance testing with 5 samples.
|
|||
|
|
|
|||
|
|
### Variance Analysis
|
|||
|
|
|
|||
|
|
BUILD A (WITH inlining) shows **consistently lower variance**:
|
|||
|
|
- CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
|
|||
|
|
- Cache Misses CV: 4.74% vs 8.60% (45% improvement)
|
|||
|
|
- Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Reproducing Results
|
|||
|
|
|
|||
|
|
### Build Both Binaries
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# BUILD A (WITH inlining) - already built
|
|||
|
|
make clean
|
|||
|
|
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
|
|||
|
|
cp bench_allocators_hakmem bench_allocators_hakmem.with_inline
|
|||
|
|
|
|||
|
|
# BUILD B (WITHOUT inlining)
|
|||
|
|
# Remove __attribute__((always_inline)) from:
|
|||
|
|
# - core/box/tiny_alloc_gate_box.h:139
|
|||
|
|
# - core/box/tiny_free_gate_box.h:131
|
|||
|
|
make clean
|
|||
|
|
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
|
|||
|
|
cp bench_allocators_hakmem bench_allocators_hakmem.no_inline
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Run Benchmarks
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Test 1: Standard workload
|
|||
|
|
./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
|
|||
|
|
./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
|
|||
|
|
|
|||
|
|
# Test 2: Conservative profile
|
|||
|
|
./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
|
|||
|
|
./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
|
|||
|
|
|
|||
|
|
# Perf counters
|
|||
|
|
./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
|
|||
|
|
./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Analyze Results
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python3 analyze_results.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:
|
|||
|
|
|
|||
|
|
### **Batch Tier Checks**
|
|||
|
|
|
|||
|
|
**Goal**: Reduce overhead of per-allocation route policy lookups
|
|||
|
|
|
|||
|
|
**Approach**:
|
|||
|
|
1. Batch route policy checks for multiple allocations
|
|||
|
|
2. Cache tier decisions in TLS
|
|||
|
|
3. Amortize lookup overhead across multiple operations
|
|||
|
|
|
|||
|
|
**Expected Benefit**: Additional 1-3% throughput improvement
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## References
|
|||
|
|
|
|||
|
|
- Original optimization request: Gatekeeper inlining analysis
|
|||
|
|
- Benchmark workload: `bench_random_mixed_hakmem 1000000 256 42`
|
|||
|
|
- Test parameters: 5 iterations per configuration after 1 warmup
|
|||
|
|
- Statistical method: Welch's t-test (α = 0.05)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Generated**: 2025-12-04
|
|||
|
|
**System**: Linux 6.8.0-87-generic
|
|||
|
|
**Compiler**: GCC with -O3 -march=native -flto
|