Files
hakmem/PHASE6A_BENCHMARK_RESULTS.md

343 lines
12 KiB
Markdown
Raw Normal View History

# Phase 6-A Benchmark Results
**Date**: 2025-11-29
**Change**: Disable SuperSlab lookup debug validation in RELEASE builds
**File**: `core/tiny_region_id.h:199-239`
**Guard**: `#if !HAKMEM_BUILD_RELEASE` around `hak_super_lookup()` call
**Reason**: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)
---
## Executive Summary
Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is **significantly smaller** than predicted:
- **Expected**: +12-15% (random_mixed), +8-10% (mid_mt_gap)
- **Actual (best 3 of 5)**: +1.67% (random_mixed), +1.33% (mid_mt_gap)
- **Actual (excluding warmup)**: +4.07% (random_mixed), +1.97% (mid_mt_gap)
**Recommendation**: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).
---
## Benchmark Configuration
### Build Configurations
#### Baseline (Before Phase 6-A)
```bash
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)
```
#### Phase 6-A (After)
```bash
git stash pop # Restore Phase 6-A changes
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)
```
### Benchmark Parameters
- **Iterations**: 1,000,000 operations per run
- **Working Set**: 256 blocks
- **Seed**: 42 (reproducible)
- **Runs**: 5 per configuration
- **Suppression**: `2>/dev/null` to exclude debug output noise
---
## Raw Results
### bench_random_mixed (Tiny workload, 16B-1KB)
#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
```
Run 1: 53.81 M ops/s
Run 2: 53.25 M ops/s
Run 3: 53.56 M ops/s
Run 4: 49.41 M ops/s
Run 5: 51.41 M ops/s
Average: 52.29 M ops/s
Stdev: 1.86 M ops/s
```
#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
```
Run 1: 39.11 M ops/s ⚠️ OUTLIER (warmup)
Run 2: 53.30 M ops/s
Run 3: 56.28 M ops/s
Run 4: 52.79 M ops/s
Run 5: 53.72 M ops/s
Average: 51.04 M ops/s (all runs)
Stdev: 6.80 M ops/s (high due to outlier)
Average (excl. Run 1): 54.02 M ops/s
```
**Outlier Analysis**: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.
---
### bench_mid_mt_gap (Mid MT workload, 1KB-8KB)
#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
```
Run 1: 41.70 M ops/s
Run 2: 37.39 M ops/s
Run 3: 40.91 M ops/s
Run 4: 40.53 M ops/s
Run 5: 40.56 M ops/s
Average: 40.22 M ops/s
Stdev: 1.65 M ops/s
```
#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
```
Run 1: 41.49 M ops/s
Run 2: 41.81 M ops/s
Run 3: 41.51 M ops/s
Run 4: 38.43 M ops/s
Run 5: 40.78 M ops/s
Average: 40.80 M ops/s
Stdev: 1.38 M ops/s
```
**Variance Analysis**: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.
---
## Statistical Analysis
### Comparison 1: All Runs (Conservative)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 52.29 M | 51.04 M | -1.25 M | **-2.39%** | +12-15% | ❌ FAIL |
| mid_mt_gap | 40.22 M | 40.80 M | +0.59 M | **+1.46%** | +8-10% | ❌ FAIL |
### Comparison 2: Excluding First Run (Warmup Correction)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 51.91 M | 54.02 M | +2.11 M | **+4.07%** | +12-15% | ⚠️ PARTIAL |
| mid_mt_gap | 39.85 M | 40.63 M | +0.78 M | **+1.97%** | +8-10% | ❌ FAIL |
### Comparison 3: Best 3 of 5 (Peak Performance)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 53.54 M | 54.43 M | +0.89 M | **+1.67%** | +12-15% | ❌ FAIL |
| mid_mt_gap | 41.06 M | 41.60 M | +0.54 M | **+1.33%** | +8-10% | ❌ FAIL |
---
## Performance Summary
### Overall Results (Best 3 of 5 method)
- **random_mixed**: 53.54 → 54.43 M ops/s (+1.67%)
- **mid_mt_gap**: 41.06 → 41.60 M ops/s (+1.33%)
### vs Predictions
- **random_mixed**: Expected +12-15%, Actual +1.67% → **FAIL** (8-10x smaller than expected)
- **mid_mt_gap**: Expected +8-10%, Actual +1.33% → **FAIL** (6-7x smaller than expected)
### Interpretation
Phase 6-A shows **statistically measurable but practically negligible** performance improvements:
- Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
- Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
- All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)
The improvements are **8-10x smaller** than expected based on perf analysis.
---
## Root Cause Analysis
### Why the Discrepancy?
The perf profile showed `hak_super_lookup()` consuming **15.84% of CPU time**, yet removing it yields only **~1-4% improvement**. Possible explanations:
#### 1. **Compiler Optimization (Most Likely)**
The compiler may already be optimizing away the `hak_super_lookup()` call in release builds:
- **Dead Store Elimination**: The result of `hak_super_lookup()` is only used for debug logging
- **Inlining + Constant Propagation**: With LTO, the compiler sees the result is unused
- **Evidence**: Phase 6-A guard has minimal impact, suggesting code was already "free"
**Action**: Examine assembly output to verify if `hak_super_lookup()` is present in baseline build
#### 2. **Perf Sampling Bias**
The perf profile may have been captured during a different workload phase:
- Different allocation patterns (class distribution)
- Different cache states (cold vs. hot)
- Different thread counts (single vs. multi-threaded)
**Action**: Re-run perf on the exact benchmark workload to verify 15.84% claim
#### 3. **Measurement Noise**
The benchmarks show high variance:
- random_mixed: 1.86 M stdev (3.6% of mean)
- mid_mt_gap: 1.65 M stdev (4.1% of mean)
The measured improvements (+1-4%) are within **1-2 standard deviations** of noise.
**Action**: Run longer benchmarks (10M+ operations) to reduce noise
#### 4. **Lookup Already Cache-Friendly**
The SuperSlab registry lookup may be highly cache-efficient in these workloads:
- Small working set (256 blocks) fits in L1/L2 cache
- Registry entries for active SuperSlabs are hot
- Cost is much lower than perf's 15.84% suggests
**Action**: Benchmark with larger working sets (4KB+) to stress cache
#### 5. **Wrong Hot Path**
The perf profile showed 15.84% CPU in `hak_super_lookup()`, but this may not be on the **allocation hot path** that these benchmarks exercise:
- The call is in `tiny_region_id_write_header()` (allocation)
- Benchmarks mix alloc+free, free path may dominate
- Perf may have sampled during a malloc-heavy phase
**Action**: Isolate allocation-only benchmark (no frees) to verify
---
## Recommendations
### Immediate Actions
1. **HOLD** on committing Phase 6-A until investigation completes
- Current results don't justify the change
- Risk: code churn without measurable benefit
2. **Verify Compiler Behavior**
```bash
# Generate assembly for baseline build
gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h
# Check if hak_super_lookup appears
grep "hak_super_lookup" baseline.s
# If absent: compiler already eliminated it (explains minimal improvement)
# If present: something else is going on
```
3. **Re-run Perf on Benchmark Workload**
```bash
# Build baseline without Phase 6-A
git stash
make clean && make bench_random_mixed_hakmem
# Profile the exact benchmark
perf record -g ./bench_random_mixed_hakmem 10000000 256 42
perf report --stdio | grep -A20 "hak_super_lookup"
# Verify if 15.84% claim holds for this workload
```
4. **Longer Benchmark Runs**
```bash
# 100M operations to reduce noise
for i in 1 2 3 4 5; do
./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null
done
```
### Long-Term Considerations
If investigation reveals:
#### Scenario A: Compiler Already Optimized
- **Decision**: Commit Phase 6-A for code cleanliness (no harm, no foul)
- **Rationale**: Explicitly documents debug-only code, prevents future confusion
- **Benefit**: Future-proof if compiler behavior changes
#### Scenario B: Perf Was Wrong
- **Decision**: Discard Phase 6-A, update perf methodology
- **Rationale**: The 15.84% CPU claim was based on flawed profiling
- **Action**: Document correct perf sampling procedure
#### Scenario C: Benchmark Doesn't Stress Hot Path
- **Decision**: Commit Phase 6-A, improve benchmark coverage
- **Rationale**: Real workloads may show the expected gains
- **Action**: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)
#### Scenario D: Measurement Noise Dominates
- **Decision**: Commit Phase 6-A if longer runs show >5% improvement
- **Rationale**: Noise can hide real improvements
- **Action**: Use mimalloc-bench suite for more stable measurements
---
## Next Steps
### Phase 6-B: Conditional Path Forward
**Option 1: Investigate First (Recommended)**
1. Run assembly analysis (1 hour)
2. Re-run perf on benchmark (2 hours)
3. Run longer benchmarks (4 hours)
4. Make data-driven decision
**Option 2: Commit Anyway**
- Rationale: Code is cleaner, no measurable harm
- Risk: Future confusion if optimization isn't actually needed
**Option 3: Discard Phase 6-A**
- Rationale: No measurable benefit, not worth the churn
- Risk: Miss real optimization if measurement was flawed
---
## Appendix: Full Benchmark Output
### Baseline - bench_random_mixed
```
=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 53806309 ops/s [iter=1000000 ws=256] time=0.019s
Run 2: Throughput = 53246568 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput = 53563123 ops/s [iter=1000000 ws=256] time=0.019s
Run 4: Throughput = 49409566 ops/s [iter=1000000 ws=256] time=0.020s
Run 5: Throughput = 51412515 ops/s [iter=1000000 ws=256] time=0.019s
```
### Phase 6-A - bench_random_mixed
```
=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 39111201 ops/s [iter=1000000 ws=256] time=0.026s
Run 2: Throughput = 53296242 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput = 56279982 ops/s [iter=1000000 ws=256] time=0.018s
Run 4: Throughput = 52790754 ops/s [iter=1000000 ws=256] time=0.019s
Run 5: Throughput = 53715992 ops/s [iter=1000000 ws=256] time=0.019s
```
### Baseline - bench_mid_mt_gap
```
=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.
```
### Phase 6-A - bench_mid_mt_gap
```
=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.
```
---
## Conclusion
Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is **8-10x smaller** than the expected +12-15% based on perf analysis.
**Critical Question**: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?
**Most Likely Answer**: The compiler was already optimizing away the `hak_super_lookup()` call in release builds through dead code elimination, since its result is only used for debug assertions.
**Recommended Action**: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.