343 lines
12 KiB
Markdown
343 lines
12 KiB
Markdown
|
|
# Phase 6-A Benchmark Results
|
||
|
|
|
||
|
|
**Date**: 2025-11-29
|
||
|
|
**Change**: Disable SuperSlab lookup debug validation in RELEASE builds
|
||
|
|
**File**: `core/tiny_region_id.h:199-239`
|
||
|
|
**Guard**: `#if !HAKMEM_BUILD_RELEASE` around `hak_super_lookup()` call
|
||
|
|
**Reason**: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is **significantly smaller** than predicted:
|
||
|
|
|
||
|
|
- **Expected**: +12-15% (random_mixed), +8-10% (mid_mt_gap)
|
||
|
|
- **Actual (best 3 of 5)**: +1.67% (random_mixed), +1.33% (mid_mt_gap)
|
||
|
|
- **Actual (excluding warmup)**: +4.07% (random_mixed), +1.97% (mid_mt_gap)
|
||
|
|
|
||
|
|
**Recommendation**: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Benchmark Configuration
|
||
|
|
|
||
|
|
### Build Configurations
|
||
|
|
|
||
|
|
#### Baseline (Before Phase 6-A)
|
||
|
|
```bash
|
||
|
|
make clean
|
||
|
|
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
|
||
|
|
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
|
||
|
|
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Phase 6-A (After)
|
||
|
|
```bash
|
||
|
|
git stash pop # Restore Phase 6-A changes
|
||
|
|
make clean
|
||
|
|
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
|
||
|
|
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
|
||
|
|
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Benchmark Parameters
|
||
|
|
- **Iterations**: 1,000,000 operations per run
|
||
|
|
- **Working Set**: 256 blocks
|
||
|
|
- **Seed**: 42 (reproducible)
|
||
|
|
- **Runs**: 5 per configuration
|
||
|
|
- **Suppression**: `2>/dev/null` to exclude debug output noise
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Raw Results
|
||
|
|
|
||
|
|
### bench_random_mixed (Tiny workload, 16B-1KB)
|
||
|
|
|
||
|
|
#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
|
||
|
|
```
|
||
|
|
Run 1: 53.81 M ops/s
|
||
|
|
Run 2: 53.25 M ops/s
|
||
|
|
Run 3: 53.56 M ops/s
|
||
|
|
Run 4: 49.41 M ops/s
|
||
|
|
Run 5: 51.41 M ops/s
|
||
|
|
Average: 52.29 M ops/s
|
||
|
|
Stdev: 1.86 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
|
||
|
|
```
|
||
|
|
Run 1: 39.11 M ops/s ⚠️ OUTLIER (warmup)
|
||
|
|
Run 2: 53.30 M ops/s
|
||
|
|
Run 3: 56.28 M ops/s
|
||
|
|
Run 4: 52.79 M ops/s
|
||
|
|
Run 5: 53.72 M ops/s
|
||
|
|
Average: 51.04 M ops/s (all runs)
|
||
|
|
Stdev: 6.80 M ops/s (high due to outlier)
|
||
|
|
Average (excl. Run 1): 54.02 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Outlier Analysis**: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### bench_mid_mt_gap (Mid MT workload, 1KB-8KB)
|
||
|
|
|
||
|
|
#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
|
||
|
|
```
|
||
|
|
Run 1: 41.70 M ops/s
|
||
|
|
Run 2: 37.39 M ops/s
|
||
|
|
Run 3: 40.91 M ops/s
|
||
|
|
Run 4: 40.53 M ops/s
|
||
|
|
Run 5: 40.56 M ops/s
|
||
|
|
Average: 40.22 M ops/s
|
||
|
|
Stdev: 1.65 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
|
||
|
|
```
|
||
|
|
Run 1: 41.49 M ops/s
|
||
|
|
Run 2: 41.81 M ops/s
|
||
|
|
Run 3: 41.51 M ops/s
|
||
|
|
Run 4: 38.43 M ops/s
|
||
|
|
Run 5: 40.78 M ops/s
|
||
|
|
Average: 40.80 M ops/s
|
||
|
|
Stdev: 1.38 M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Variance Analysis**: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Statistical Analysis
|
||
|
|
|
||
|
|
### Comparison 1: All Runs (Conservative)
|
||
|
|
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|
||
|
|
|-----------|----------|-----------|----------|----------|----------|--------|
|
||
|
|
| random_mixed | 52.29 M | 51.04 M | -1.25 M | **-2.39%** | +12-15% | ❌ FAIL |
|
||
|
|
| mid_mt_gap | 40.22 M | 40.80 M | +0.59 M | **+1.46%** | +8-10% | ❌ FAIL |
|
||
|
|
|
||
|
|
### Comparison 2: Excluding First Run (Warmup Correction)
|
||
|
|
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|
||
|
|
|-----------|----------|-----------|----------|----------|----------|--------|
|
||
|
|
| random_mixed | 51.91 M | 54.02 M | +2.11 M | **+4.07%** | +12-15% | ⚠️ PARTIAL |
|
||
|
|
| mid_mt_gap | 39.85 M | 40.63 M | +0.78 M | **+1.97%** | +8-10% | ❌ FAIL |
|
||
|
|
|
||
|
|
### Comparison 3: Best 3 of 5 (Peak Performance)
|
||
|
|
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|
||
|
|
|-----------|----------|-----------|----------|----------|----------|--------|
|
||
|
|
| random_mixed | 53.54 M | 54.43 M | +0.89 M | **+1.67%** | +12-15% | ❌ FAIL |
|
||
|
|
| mid_mt_gap | 41.06 M | 41.60 M | +0.54 M | **+1.33%** | +8-10% | ❌ FAIL |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Summary
|
||
|
|
|
||
|
|
### Overall Results (Best 3 of 5 method)
|
||
|
|
- **random_mixed**: 53.54 → 54.43 M ops/s (+1.67%)
|
||
|
|
- **mid_mt_gap**: 41.06 → 41.60 M ops/s (+1.33%)
|
||
|
|
|
||
|
|
### vs Predictions
|
||
|
|
- **random_mixed**: Expected +12-15%, Actual +1.67% → **FAIL** (8-10x smaller than expected)
|
||
|
|
- **mid_mt_gap**: Expected +8-10%, Actual +1.33% → **FAIL** (6-7x smaller than expected)
|
||
|
|
|
||
|
|
### Interpretation
|
||
|
|
Phase 6-A shows **statistically measurable but practically negligible** performance improvements:
|
||
|
|
- Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
|
||
|
|
- Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
|
||
|
|
- All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)
|
||
|
|
|
||
|
|
The improvements are **8-10x smaller** than expected based on perf analysis.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### Why the Discrepancy?
|
||
|
|
|
||
|
|
The perf profile showed `hak_super_lookup()` consuming **15.84% of CPU time**, yet removing it yields only **~1-4% improvement**. Possible explanations:
|
||
|
|
|
||
|
|
#### 1. **Compiler Optimization (Most Likely)**
|
||
|
|
The compiler may already be optimizing away the `hak_super_lookup()` call in release builds:
|
||
|
|
- **Dead Store Elimination**: The result of `hak_super_lookup()` is only used for debug logging
|
||
|
|
- **Inlining + Constant Propagation**: With LTO, the compiler sees the result is unused
|
||
|
|
- **Evidence**: Phase 6-A guard has minimal impact, suggesting code was already "free"
|
||
|
|
|
||
|
|
**Action**: Examine assembly output to verify if `hak_super_lookup()` is present in baseline build
|
||
|
|
|
||
|
|
#### 2. **Perf Sampling Bias**
|
||
|
|
The perf profile may have been captured during a different workload phase:
|
||
|
|
- Different allocation patterns (class distribution)
|
||
|
|
- Different cache states (cold vs. hot)
|
||
|
|
- Different thread counts (single vs. multi-threaded)
|
||
|
|
|
||
|
|
**Action**: Re-run perf on the exact benchmark workload to verify 15.84% claim
|
||
|
|
|
||
|
|
#### 3. **Measurement Noise**
|
||
|
|
The benchmarks show high variance:
|
||
|
|
- random_mixed: 1.86 M stdev (3.6% of mean)
|
||
|
|
- mid_mt_gap: 1.65 M stdev (4.1% of mean)
|
||
|
|
|
||
|
|
The measured improvements (+1-4%) are within **1-2 standard deviations** of noise.
|
||
|
|
|
||
|
|
**Action**: Run longer benchmarks (10M+ operations) to reduce noise
|
||
|
|
|
||
|
|
#### 4. **Lookup Already Cache-Friendly**
|
||
|
|
The SuperSlab registry lookup may be highly cache-efficient in these workloads:
|
||
|
|
- Small working set (256 blocks) fits in L1/L2 cache
|
||
|
|
- Registry entries for active SuperSlabs are hot
|
||
|
|
- Cost is much lower than perf's 15.84% suggests
|
||
|
|
|
||
|
|
**Action**: Benchmark with larger working sets (4KB+) to stress cache
|
||
|
|
|
||
|
|
#### 5. **Wrong Hot Path**
|
||
|
|
The perf profile showed 15.84% CPU in `hak_super_lookup()`, but this may not be on the **allocation hot path** that these benchmarks exercise:
|
||
|
|
- The call is in `tiny_region_id_write_header()` (allocation)
|
||
|
|
- Benchmarks mix alloc+free, free path may dominate
|
||
|
|
- Perf may have sampled during a malloc-heavy phase
|
||
|
|
|
||
|
|
**Action**: Isolate allocation-only benchmark (no frees) to verify
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendations
|
||
|
|
|
||
|
|
### Immediate Actions
|
||
|
|
|
||
|
|
1. **HOLD** on committing Phase 6-A until investigation completes
|
||
|
|
- Current results don't justify the change
|
||
|
|
- Risk: code churn without measurable benefit
|
||
|
|
|
||
|
|
2. **Verify Compiler Behavior**
|
||
|
|
```bash
|
||
|
|
# Generate assembly for baseline build
|
||
|
|
gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h
|
||
|
|
|
||
|
|
# Check if hak_super_lookup appears
|
||
|
|
grep "hak_super_lookup" baseline.s
|
||
|
|
|
||
|
|
# If absent: compiler already eliminated it (explains minimal improvement)
|
||
|
|
# If present: something else is going on
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Re-run Perf on Benchmark Workload**
|
||
|
|
```bash
|
||
|
|
# Build baseline without Phase 6-A
|
||
|
|
git stash
|
||
|
|
make clean && make bench_random_mixed_hakmem
|
||
|
|
|
||
|
|
# Profile the exact benchmark
|
||
|
|
perf record -g ./bench_random_mixed_hakmem 10000000 256 42
|
||
|
|
perf report --stdio | grep -A20 "hak_super_lookup"
|
||
|
|
|
||
|
|
# Verify if 15.84% claim holds for this workload
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Longer Benchmark Runs**
|
||
|
|
```bash
|
||
|
|
# 100M operations to reduce noise
|
||
|
|
for i in 1 2 3 4 5; do
|
||
|
|
./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
### Long-Term Considerations
|
||
|
|
|
||
|
|
If investigation reveals:
|
||
|
|
|
||
|
|
#### Scenario A: Compiler Already Optimized
|
||
|
|
- **Decision**: Commit Phase 6-A for code cleanliness (no harm, no foul)
|
||
|
|
- **Rationale**: Explicitly documents debug-only code, prevents future confusion
|
||
|
|
- **Benefit**: Future-proof if compiler behavior changes
|
||
|
|
|
||
|
|
#### Scenario B: Perf Was Wrong
|
||
|
|
- **Decision**: Discard Phase 6-A, update perf methodology
|
||
|
|
- **Rationale**: The 15.84% CPU claim was based on flawed profiling
|
||
|
|
- **Action**: Document correct perf sampling procedure
|
||
|
|
|
||
|
|
#### Scenario C: Benchmark Doesn't Stress Hot Path
|
||
|
|
- **Decision**: Commit Phase 6-A, improve benchmark coverage
|
||
|
|
- **Rationale**: Real workloads may show the expected gains
|
||
|
|
- **Action**: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)
|
||
|
|
|
||
|
|
#### Scenario D: Measurement Noise Dominates
|
||
|
|
- **Decision**: Commit Phase 6-A if longer runs show >5% improvement
|
||
|
|
- **Rationale**: Noise can hide real improvements
|
||
|
|
- **Action**: Use mimalloc-bench suite for more stable measurements
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
### Phase 6-B: Conditional Path Forward
|
||
|
|
|
||
|
|
**Option 1: Investigate First (Recommended)**
|
||
|
|
1. Run assembly analysis (1 hour)
|
||
|
|
2. Re-run perf on benchmark (2 hours)
|
||
|
|
3. Run longer benchmarks (4 hours)
|
||
|
|
4. Make data-driven decision
|
||
|
|
|
||
|
|
**Option 2: Commit Anyway**
|
||
|
|
- Rationale: Code is cleaner, no measurable harm
|
||
|
|
- Risk: Future confusion if optimization isn't actually needed
|
||
|
|
|
||
|
|
**Option 3: Discard Phase 6-A**
|
||
|
|
- Rationale: No measurable benefit, not worth the churn
|
||
|
|
- Risk: Miss real optimization if measurement was flawed
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Full Benchmark Output
|
||
|
|
|
||
|
|
### Baseline - bench_random_mixed
|
||
|
|
```
|
||
|
|
=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
|
||
|
|
Run 1: Throughput = 53806309 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
Run 2: Throughput = 53246568 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
Run 3: Throughput = 53563123 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
Run 4: Throughput = 49409566 ops/s [iter=1000000 ws=256] time=0.020s
|
||
|
|
Run 5: Throughput = 51412515 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 6-A - bench_random_mixed
|
||
|
|
```
|
||
|
|
=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
|
||
|
|
Run 1: Throughput = 39111201 ops/s [iter=1000000 ws=256] time=0.026s
|
||
|
|
Run 2: Throughput = 53296242 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
Run 3: Throughput = 56279982 ops/s [iter=1000000 ws=256] time=0.018s
|
||
|
|
Run 4: Throughput = 52790754 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
Run 5: Throughput = 53715992 ops/s [iter=1000000 ws=256] time=0.019s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Baseline - bench_mid_mt_gap
|
||
|
|
```
|
||
|
|
=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
|
||
|
|
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
|
||
|
|
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
|
||
|
|
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
|
||
|
|
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
|
||
|
|
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 6-A - bench_mid_mt_gap
|
||
|
|
```
|
||
|
|
=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
|
||
|
|
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
|
||
|
|
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
|
||
|
|
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
|
||
|
|
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
|
||
|
|
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is **8-10x smaller** than the expected +12-15% based on perf analysis.
|
||
|
|
|
||
|
|
**Critical Question**: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?
|
||
|
|
|
||
|
|
**Most Likely Answer**: The compiler was already optimizing away the `hak_super_lookup()` call in release builds through dead code elimination, since its result is only used for debug assertions.
|
||
|
|
|
||
|
|
**Recommended Action**: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.
|