hakmem/INLINING_BENCHMARK_INDEX.md

# Gatekeeper Inlining Optimization - Benchmark Index

**Date**: 2025-12-04  
**Status**: ✅ COMPLETED - OPTIMIZATION VALIDATED

---

## Quick Summary

The `__attribute__((always_inline))` optimization on Gatekeeper functions is **EFFECTIVE and VALIDATED**:

- **Throughput**: +10.57% improvement (Test 1)
- **CPU Cycles**: -2.13% reduction (statistically significant)
- **Cache Misses**: -13.53% reduction (statistically significant)

**Recommendation**: ✅ **KEEP** the inlining optimization

---

## Documentation

### Primary Reports

1. **BENCHMARK_SUMMARY.txt** (14KB)
   - Quick reference with all key metrics
   - Best for: Command-line viewing, sharing results
   - Location: `/mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt`

2. **GATEKEEPER_INLINING_BENCHMARK_REPORT.md** (15KB)
   - Comprehensive markdown report with tables and analysis
   - Best for: GitHub, documentation, detailed review
   - Location: `/mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md`

---

## Generated Artifacts

### Binaries

- **bench_allocators_hakmem.with_inline** (354KB)
  - BUILD A: With `__attribute__((always_inline))`
  - Optimized binary

- **bench_allocators_hakmem.no_inline** (350KB)
  - BUILD B: Without forced inlining (baseline)
  - Used for A/B comparison

### Scripts

- **analyze_results.py** (13KB)
  - Python statistical analysis script
  - Computes means, std dev, CV, t-tests
  - Run: `python3 analyze_results.py`

- **run_benchmark.sh**
  - Standard benchmark runner (5 iterations)
  - Usage: `./run_benchmark.sh <binary> <name> [iterations]`

- **run_benchmark_conservative.sh**
  - Conservative profile benchmark runner
  - Sets `HAKMEM_TINY_PROFILE=conservative` and `HAKMEM_SS_PREFAULT=0`

- **run_perf.sh**
  - Perf counter collection script
  - Measures cycles, cache-misses, L1-dcache-load-misses

---

## Key Results at a Glance

| Metric | WITH Inlining | WITHOUT Inlining | Improvement |
|--------|-------------:|----------------:|------------:|
| **Test 1 Throughput** | 1,055,159 ops/s | 954,265 ops/s | **+10.57%** |
| **Test 2 Throughput** | 1,095,292 ops/s | 1,054,294 ops/s | **+3.89%** |
| **CPU Cycles** | 71,522,202 | 73,076,160 | **-2.13%** ⭐ |
| **Cache Misses** | 256,020 | 296,074 | **-13.53%** ⭐ |

⭐ = Statistically significant (p < 0.05)

---

## Modified Files

The following files were modified to add `__attribute__((always_inline))`:

1. **core/box/tiny_alloc_gate_box.h** (Line 139)
   ```c
   static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size)
   ```

2. **core/box/tiny_free_gate_box.h** (Line 131)
   ```c
   static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
   ```

---

## Statistical Validation

### Significant Results (p < 0.05)

- **CPU Cycles**: t = 2.823, df = 5.76 ✅
- **Cache Misses**: t = 3.177, df = 5.73 ✅

These metrics passed statistical significance testing with 5 samples.

### Variance Analysis

BUILD A (WITH inlining) shows **consistently lower variance**:
- CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
- Cache Misses CV: 4.74% vs 8.60% (45% improvement)
- Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)

---

## Reproducing Results

### Build Both Binaries

```bash
# BUILD A (WITH inlining) - already built
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.with_inline

# BUILD B (WITHOUT inlining)
# Remove __attribute__((always_inline)) from:
#   - core/box/tiny_alloc_gate_box.h:139
#   - core/box/tiny_free_gate_box.h:131
make clean
CFLAGS="-O3 -march=native" make bench_allocators_hakmem
cp bench_allocators_hakmem bench_allocators_hakmem.no_inline
```

### Run Benchmarks

```bash
# Test 1: Standard workload
./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5

# Test 2: Conservative profile
./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5

# Perf counters
./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
```

### Analyze Results

```bash
python3 analyze_results.py
```

---

## Next Steps

With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:

### **Batch Tier Checks**

**Goal**: Reduce overhead of per-allocation route policy lookups

**Approach**:
1. Batch route policy checks for multiple allocations
2. Cache tier decisions in TLS
3. Amortize lookup overhead across multiple operations

**Expected Benefit**: Additional 1-3% throughput improvement

---

## References

- Original optimization request: Gatekeeper inlining analysis
- Benchmark workload: `bench_random_mixed_hakmem 1000000 256 42`
- Test parameters: 5 iterations per configuration after 1 warmup
- Statistical method: Welch's t-test (α = 0.05)

---

**Generated**: 2025-12-04  
**System**: Linux 6.8.0-87-generic  
**Compiler**: GCC with -O3 -march=native -flto
-												Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 23:31:54 +09:00
+								# Gatekeeper Inlining Optimization - Benchmark Index
 								**Date**: 2025-12-04
 								**Status**: ✅ COMPLETED - OPTIMIZATION VALIDATED
 								---
 								## Quick Summary
 								The `__attribute__((always_inline))` optimization on Gatekeeper functions is **EFFECTIVE and VALIDATED**:
 								- **Throughput**: +10.57% improvement (Test 1)
 								- **CPU Cycles**: -2.13% reduction (statistically significant)
 								- **Cache Misses**: -13.53% reduction (statistically significant)
 								**Recommendation**: ✅ **KEEP** the inlining optimization
 								---
 								## Documentation
 								### Primary Reports
 . **BENCHMARK_SUMMARY.txt** (14KB)
 								   - Quick reference with all key metrics
 								   - Best for: Command-line viewing, sharing results
 								   - Location: `/mnt/workdisk/public_share/hakmem/BENCHMARK_SUMMARY.txt`
 . **GATEKEEPER_INLINING_BENCHMARK_REPORT.md** (15KB)
 								   - Comprehensive markdown report with tables and analysis
 								   - Best for: GitHub, documentation, detailed review
 								   - Location: `/mnt/workdisk/public_share/hakmem/GATEKEEPER_INLINING_BENCHMARK_REPORT.md`
 								---
 								## Generated Artifacts
 								### Binaries
 								- **bench_allocators_hakmem.with_inline** (354KB)
 								  - BUILD A: With `__attribute__((always_inline))`
 								  - Optimized binary
 								- **bench_allocators_hakmem.no_inline** (350KB)
 								  - BUILD B: Without forced inlining (baseline)
 								  - Used for A/B comparison
 								### Scripts
 								- **analyze_results.py** (13KB)
 								  - Python statistical analysis script
 								  - Computes means, std dev, CV, t-tests
 								  - Run: `python3 analyze_results.py`
 								- **run_benchmark.sh**
 								  - Standard benchmark runner (5 iterations)
 								  - Usage: `./run_benchmark.sh <binary> <name> [iterations]`
 								- **run_benchmark_conservative.sh**
 								  - Conservative profile benchmark runner
 								  - Sets `HAKMEM_TINY_PROFILE=conservative` and `HAKMEM_SS_PREFAULT=0`
 								- **run_perf.sh**
 								  - Perf counter collection script
 								  - Measures cycles, cache-misses, L1-dcache-load-misses
 								---
 								## Key Results at a Glance
 								| Metric | WITH Inlining | WITHOUT Inlining | Improvement |
 								|--------|-------------:|----------------:|------------:|
 								| **Test 1 Throughput** | 1,055,159 ops/s | 954,265 ops/s | **+10.57%** |
 								| **Test 2 Throughput** | 1,095,292 ops/s | 1,054,294 ops/s | **+3.89%** |
 								| **CPU Cycles** | 71,522,202 | 73,076,160 | **-2.13%** ⭐ |
 								| **Cache Misses** | 256,020 | 296,074 | **-13.53%** ⭐ |
 								⭐ = Statistically significant (p < 0.05)
 								---
 								## Modified Files
 								The following files were modified to add `__attribute__((always_inline))`:
 . **core/box/tiny_alloc_gate_box.h** (Line 139)
 								   ```c
 								   static __attribute__((always_inline)) void* tiny_alloc_gate_fast(size_t size)
 								   ```
 . **core/box/tiny_free_gate_box.h** (Line 131)
 								   ```c
 								   static __attribute__((always_inline)) int tiny_free_gate_try_fast(void* user_ptr)
 								   ```
 								---
 								## Statistical Validation
 								### Significant Results (p < 0.05)
 								- **CPU Cycles**: t = 2.823, df = 5.76 ✅
 								- **Cache Misses**: t = 3.177, df = 5.73 ✅
 								These metrics passed statistical significance testing with 5 samples.
 								### Variance Analysis
 								BUILD A (WITH inlining) shows **consistently lower variance**:
 								- CPU Cycles CV: 0.75% vs 1.52% (50% improvement)
 								- Cache Misses CV: 4.74% vs 8.60% (45% improvement)
 								- Test 2 Throughput CV: 11.26% vs 19.18% (41% improvement)
 								---
 								## Reproducing Results
 								### Build Both Binaries
 								```bash
 								# BUILD A (WITH inlining) - already built
 								make clean
 								CFLAGS="-O3 -march=native" make bench_allocators_hakmem
 								cp bench_allocators_hakmem bench_allocators_hakmem.with_inline
 								# BUILD B (WITHOUT inlining)
 								# Remove __attribute__((always_inline)) from:
 								#   - core/box/tiny_alloc_gate_box.h:139
 								#   - core/box/tiny_free_gate_box.h:131
 								make clean
 								CFLAGS="-O3 -march=native" make bench_allocators_hakmem
 								cp bench_allocators_hakmem bench_allocators_hakmem.no_inline
 								```
 								### Run Benchmarks
 								```bash
 								# Test 1: Standard workload
 								./run_benchmark.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
 								./run_benchmark.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
 								# Test 2: Conservative profile
 								./run_benchmark_conservative.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
 								./run_benchmark_conservative.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
 								# Perf counters
 								./run_perf.sh ./bench_allocators_hakmem.with_inline "WITH_INLINE" 5
 								./run_perf.sh ./bench_allocators_hakmem.no_inline "NO_INLINE" 5
 								```
 								### Analyze Results
 								```bash
 								python3 analyze_results.py
 								```
 								---
 								## Next Steps
 								With the Gatekeeper inlining optimization validated and in place, the recommended next optimization is:
 								### **Batch Tier Checks**
 								**Goal**: Reduce overhead of per-allocation route policy lookups
 								**Approach**:
 . Batch route policy checks for multiple allocations
 . Cache tier decisions in TLS
 . Amortize lookup overhead across multiple operations
 								**Expected Benefit**: Additional 1-3% throughput improvement
 								---
 								## References
 								- Original optimization request: Gatekeeper inlining analysis
 								- Benchmark workload: `bench_random_mixed_hakmem 1000000 256 42`
 								- Test parameters: 5 iterations per configuration after 1 warmup
 								- Statistical method: Welch's t-test (α = 0.05)
 								---
 								**Generated**: 2025-12-04
 								**System**: Linux 6.8.0-87-generic
 								**Compiler**: GCC with -O3 -march=native -flto