264 lines
9.4 KiB
Markdown
264 lines
9.4 KiB
Markdown
|
|
# Batch Tier Checks Performance Measurement Results
|
|||
|
|
**Date:** 2025-12-04
|
|||
|
|
**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
|
|||
|
|
**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**
|
|||
|
|
|
|||
|
|
The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).
|
|||
|
|
|
|||
|
|
**Key Findings:**
|
|||
|
|
- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
|
|||
|
|
- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
|
|||
|
|
- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
|
|||
|
|
- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings
|
|||
|
|
|
|||
|
|
**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Test Configuration
|
|||
|
|
|
|||
|
|
### Test Parameters
|
|||
|
|
```
|
|||
|
|
Benchmark: bench_allocators_hakmem
|
|||
|
|
Workload: mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
|
|||
|
|
Iterations: 100 per run
|
|||
|
|
Runs per config: 10
|
|||
|
|
Platform: Linux 6.8.0-87-generic, x86-64
|
|||
|
|
Compiler: gcc with -O3 -flto -march=native
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Configurations Tested
|
|||
|
|
| Config | Batch Size | Description | Atomic Op Reduction |
|
|||
|
|
|--------|------------|-------------|---------------------|
|
|||
|
|
| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
|
|||
|
|
| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
|
|||
|
|
| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance Results
|
|||
|
|
|
|||
|
|
### Throughput Comparison
|
|||
|
|
|
|||
|
|
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|
|||
|
|
|--------|---------------:|------------------:|-------------------:|
|
|||
|
|
| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
|
|||
|
|
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
|
|||
|
|
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
|
|||
|
|
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
|
|||
|
|
| CV (%) | 5.15% | 5.38% | 3.58% |
|
|||
|
|
|
|||
|
|
**Improvement Analysis:**
|
|||
|
|
- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
|
|||
|
|
- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
|
|||
|
|
- **B=256 vs B=64:** -1.44% (-21,226 ops/s)
|
|||
|
|
|
|||
|
|
### CPU Cycles & Cache Performance
|
|||
|
|
|
|||
|
|
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|
|||
|
|
|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
|
|||
|
|
| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
|
|||
|
|
| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
|
|||
|
|
| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |
|
|||
|
|
|
|||
|
|
**Analysis:**
|
|||
|
|
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
|
|||
|
|
- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
|
|||
|
|
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
|
|||
|
|
- L1 cache behavior is mostly neutral for B=64, worse for B=256
|
|||
|
|
|
|||
|
|
### Variance & Consistency
|
|||
|
|
|
|||
|
|
| Config | CV (%) | Interpretation |
|
|||
|
|
|--------|-------:|----------------|
|
|||
|
|
| Baseline (B=1) | 5.15% | Good consistency |
|
|||
|
|
| Optimized (B=64) | 5.38% | Slightly worse |
|
|||
|
|
| Aggressive (B=256) | 3.58% | Best consistency |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Detailed Analysis
|
|||
|
|
|
|||
|
|
### 1. Why Did the Optimization Fail?
|
|||
|
|
|
|||
|
|
**Expected Behavior:**
|
|||
|
|
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
|
|||
|
|
- Save ~0.2-0.4 cycles per allocation
|
|||
|
|
- Achieve +5-10% throughput improvement
|
|||
|
|
|
|||
|
|
**Actual Behavior:**
|
|||
|
|
- Cache misses decreased by 11% (confirms atomic op reduction)
|
|||
|
|
- CPU cycles **increased** by 0.85% (unexpected overhead)
|
|||
|
|
- Net throughput **decreased** by 0.87%
|
|||
|
|
|
|||
|
|
**Root Cause Hypothesis:**
|
|||
|
|
|
|||
|
|
1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
|
|||
|
|
- `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
|
|||
|
|
- Modulo operation `(state->refill_count % batch)` may be expensive
|
|||
|
|
- Branch misprediction on `if ((state->refill_count % batch) == 0)`
|
|||
|
|
|
|||
|
|
2. **Cache pressure:** The batch state array may evict more useful data from cache
|
|||
|
|
- 8 bytes × 32 classes = 256 bytes of TLS state
|
|||
|
|
- This competes with actual allocation metadata in L1 cache
|
|||
|
|
|
|||
|
|
3. **False sharing:** Multiple threads may access different elements of the same cache line
|
|||
|
|
- Though TLS mitigates this, the benchmark may have threading effects
|
|||
|
|
|
|||
|
|
4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns
|
|||
|
|
- If cache misses are clustered, batching provides no benefit
|
|||
|
|
- If cache hits dominate, the batch check is rarely needed
|
|||
|
|
|
|||
|
|
### 2. Why Is B=256 Even Worse?
|
|||
|
|
|
|||
|
|
The aggressive batching (B=256) shows severe regression (+15% cycles):
|
|||
|
|
|
|||
|
|
- **Longer staleness period:** Tier status can be stale for up to 256 operations
|
|||
|
|
- **More allocations from DRAINING SuperSlabs:** This causes additional work
|
|||
|
|
- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING
|
|||
|
|
|
|||
|
|
### 3. Positive Observations
|
|||
|
|
|
|||
|
|
Despite the regression, some aspects worked:
|
|||
|
|
|
|||
|
|
1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
|
|||
|
|
2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
|
|||
|
|
3. **Code correctness:** No crashes or correctness issues observed
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria Checklist
|
|||
|
|
|
|||
|
|
| Criterion | Expected | Actual | Status |
|
|||
|
|
|-----------|----------|--------|--------|
|
|||
|
|
| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
|
|||
|
|
| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
|
|||
|
|
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
|
|||
|
|
| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
|
|||
|
|
| No correctness issues | None | None | **PASS** |
|
|||
|
|
|
|||
|
|
**Overall: FAIL - Optimization does not achieve expected improvement**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Comparison: JSON Workload (Invalid Baseline)
|
|||
|
|
|
|||
|
|
**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.
|
|||
|
|
|
|||
|
|
Results from JSON workload (for reference only):
|
|||
|
|
- All configs showed ~1,070,000 ops/s (nearly identical)
|
|||
|
|
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
|
|||
|
|
- This confirms the optimization is specific to tiny allocations (<2KB)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommendations
|
|||
|
|
|
|||
|
|
### Immediate Actions
|
|||
|
|
|
|||
|
|
1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
|
|||
|
|
- Current optimization shows regression, not improvement
|
|||
|
|
- Need to understand root cause before adding more complexity
|
|||
|
|
|
|||
|
|
2. **INVESTIGATE overhead sources:**
|
|||
|
|
- Profile the modulo operation cost
|
|||
|
|
- Check TLS access patterns
|
|||
|
|
- Measure branch misprediction rate
|
|||
|
|
- Analyze cache line behavior
|
|||
|
|
|
|||
|
|
3. **CONSIDER alternative approaches:**
|
|||
|
|
- Use power-of-2 batch sizes for cheaper modulo (bit masking)
|
|||
|
|
- Precompute batch size at compile time (remove getenv overhead)
|
|||
|
|
- Try smaller batch sizes (B=16, B=32) for better locality
|
|||
|
|
- Use per-thread batch counter instead of per-class counter
|
|||
|
|
|
|||
|
|
### Future Experiments
|
|||
|
|
|
|||
|
|
If investigating further:
|
|||
|
|
|
|||
|
|
1. **Test different batch sizes:** B=16, B=32, B=128
|
|||
|
|
2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
|
|||
|
|
3. **Reduce TLS footprint:** Single global counter instead of per-class
|
|||
|
|
4. **Profile-guided optimization:** Use perf to identify hotspots
|
|||
|
|
5. **Test with different workloads:**
|
|||
|
|
- Pure tiny allocations (16B-2KB only)
|
|||
|
|
- High cache miss rate workload
|
|||
|
|
- Multi-threaded workload
|
|||
|
|
|
|||
|
|
### Alternative Optimization Strategies
|
|||
|
|
|
|||
|
|
Since batch tier checks failed, consider:
|
|||
|
|
|
|||
|
|
1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
|
|||
|
|
2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
|
|||
|
|
3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
|
|||
|
|
4. **Lazy tier checking:** Only check tier on actual allocation failure
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Raw Data
|
|||
|
|
|
|||
|
|
### Baseline (B=1) - 10 Runs
|
|||
|
|
```
|
|||
|
|
1,615,938.8 ops/s
|
|||
|
|
1,424,832.0 ops/s
|
|||
|
|
1,415,710.5 ops/s
|
|||
|
|
1,531,173.0 ops/s
|
|||
|
|
1,524,721.8 ops/s
|
|||
|
|
1,343,540.7 ops/s
|
|||
|
|
1,520,723.1 ops/s
|
|||
|
|
1,520,476.5 ops/s
|
|||
|
|
1,464,046.2 ops/s
|
|||
|
|
1,467,736.3 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Optimized (B=64) - 10 Runs
|
|||
|
|
```
|
|||
|
|
1,394,566.7 ops/s
|
|||
|
|
1,422,447.5 ops/s
|
|||
|
|
1,556,167.0 ops/s
|
|||
|
|
1,447,934.5 ops/s
|
|||
|
|
1,359,677.3 ops/s
|
|||
|
|
1,436,005.2 ops/s
|
|||
|
|
1,568,456.7 ops/s
|
|||
|
|
1,423,222.2 ops/s
|
|||
|
|
1,589,416.6 ops/s
|
|||
|
|
1,501,629.6 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Aggressive (B=256) - 10 Runs
|
|||
|
|
```
|
|||
|
|
1,543,813.0 ops/s
|
|||
|
|
1,436,644.9 ops/s
|
|||
|
|
1,479,174.7 ops/s
|
|||
|
|
1,428,092.3 ops/s
|
|||
|
|
1,419,232.7 ops/s
|
|||
|
|
1,422,254.4 ops/s
|
|||
|
|
1,510,832.1 ops/s
|
|||
|
|
1,417,032.7 ops/s
|
|||
|
|
1,465,069.6 ops/s
|
|||
|
|
1,365,118.3 ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.
|
|||
|
|
|
|||
|
|
**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).
|
|||
|
|
|
|||
|
|
**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Report Generated:** 2025-12-04
|
|||
|
|
**Analysis Tool:** Python 3 statistical analysis
|
|||
|
|
**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)
|