Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
264 lines
9.4 KiB
Markdown
264 lines
9.4 KiB
Markdown
# Batch Tier Checks Performance Measurement Results
|
||
**Date:** 2025-12-04
|
||
**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
|
||
**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**
|
||
|
||
The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).
|
||
|
||
**Key Findings:**
|
||
- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
|
||
- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
|
||
- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
|
||
- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings
|
||
|
||
**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.
|
||
|
||
---
|
||
|
||
## Test Configuration
|
||
|
||
### Test Parameters
|
||
```
|
||
Benchmark: bench_allocators_hakmem
|
||
Workload: mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
|
||
Iterations: 100 per run
|
||
Runs per config: 10
|
||
Platform: Linux 6.8.0-87-generic, x86-64
|
||
Compiler: gcc with -O3 -flto -march=native
|
||
```
|
||
|
||
### Configurations Tested
|
||
| Config | Batch Size | Description | Atomic Op Reduction |
|
||
|--------|------------|-------------|---------------------|
|
||
| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
|
||
| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
|
||
| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |
|
||
|
||
---
|
||
|
||
## Performance Results
|
||
|
||
### Throughput Comparison
|
||
|
||
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|
||
|--------|---------------:|------------------:|-------------------:|
|
||
| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
|
||
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
|
||
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
|
||
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
|
||
| CV (%) | 5.15% | 5.38% | 3.58% |
|
||
|
||
**Improvement Analysis:**
|
||
- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
|
||
- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
|
||
- **B=256 vs B=64:** -1.44% (-21,226 ops/s)
|
||
|
||
### CPU Cycles & Cache Performance
|
||
|
||
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|
||
|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
|
||
| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
|
||
| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
|
||
| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |
|
||
|
||
**Analysis:**
|
||
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
|
||
- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
|
||
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
|
||
- L1 cache behavior is mostly neutral for B=64, worse for B=256
|
||
|
||
### Variance & Consistency
|
||
|
||
| Config | CV (%) | Interpretation |
|
||
|--------|-------:|----------------|
|
||
| Baseline (B=1) | 5.15% | Good consistency |
|
||
| Optimized (B=64) | 5.38% | Slightly worse |
|
||
| Aggressive (B=256) | 3.58% | Best consistency |
|
||
|
||
---
|
||
|
||
## Detailed Analysis
|
||
|
||
### 1. Why Did the Optimization Fail?
|
||
|
||
**Expected Behavior:**
|
||
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
|
||
- Save ~0.2-0.4 cycles per allocation
|
||
- Achieve +5-10% throughput improvement
|
||
|
||
**Actual Behavior:**
|
||
- Cache misses decreased by 11% (confirms atomic op reduction)
|
||
- CPU cycles **increased** by 0.85% (unexpected overhead)
|
||
- Net throughput **decreased** by 0.87%
|
||
|
||
**Root Cause Hypothesis:**
|
||
|
||
1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
|
||
- `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
|
||
- Modulo operation `(state->refill_count % batch)` may be expensive
|
||
- Branch misprediction on `if ((state->refill_count % batch) == 0)`
|
||
|
||
2. **Cache pressure:** The batch state array may evict more useful data from cache
|
||
- 8 bytes × 32 classes = 256 bytes of TLS state
|
||
- This competes with actual allocation metadata in L1 cache
|
||
|
||
3. **False sharing:** Multiple threads may access different elements of the same cache line
|
||
- Though TLS mitigates this, the benchmark may have threading effects
|
||
|
||
4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns
|
||
- If cache misses are clustered, batching provides no benefit
|
||
- If cache hits dominate, the batch check is rarely needed
|
||
|
||
### 2. Why Is B=256 Even Worse?
|
||
|
||
The aggressive batching (B=256) shows severe regression (+15% cycles):
|
||
|
||
- **Longer staleness period:** Tier status can be stale for up to 256 operations
|
||
- **More allocations from DRAINING SuperSlabs:** This causes additional work
|
||
- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING
|
||
|
||
### 3. Positive Observations
|
||
|
||
Despite the regression, some aspects worked:
|
||
|
||
1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
|
||
2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
|
||
3. **Code correctness:** No crashes or correctness issues observed
|
||
|
||
---
|
||
|
||
## Success Criteria Checklist
|
||
|
||
| Criterion | Expected | Actual | Status |
|
||
|-----------|----------|--------|--------|
|
||
| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
|
||
| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
|
||
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
|
||
| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
|
||
| No correctness issues | None | None | **PASS** |
|
||
|
||
**Overall: FAIL - Optimization does not achieve expected improvement**
|
||
|
||
---
|
||
|
||
## Comparison: JSON Workload (Invalid Baseline)
|
||
|
||
**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.
|
||
|
||
Results from JSON workload (for reference only):
|
||
- All configs showed ~1,070,000 ops/s (nearly identical)
|
||
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
|
||
- This confirms the optimization is specific to tiny allocations (<2KB)
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### Immediate Actions
|
||
|
||
1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
|
||
- Current optimization shows regression, not improvement
|
||
- Need to understand root cause before adding more complexity
|
||
|
||
2. **INVESTIGATE overhead sources:**
|
||
- Profile the modulo operation cost
|
||
- Check TLS access patterns
|
||
- Measure branch misprediction rate
|
||
- Analyze cache line behavior
|
||
|
||
3. **CONSIDER alternative approaches:**
|
||
- Use power-of-2 batch sizes for cheaper modulo (bit masking)
|
||
- Precompute batch size at compile time (remove getenv overhead)
|
||
- Try smaller batch sizes (B=16, B=32) for better locality
|
||
- Use per-thread batch counter instead of per-class counter
|
||
|
||
### Future Experiments
|
||
|
||
If investigating further:
|
||
|
||
1. **Test different batch sizes:** B=16, B=32, B=128
|
||
2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
|
||
3. **Reduce TLS footprint:** Single global counter instead of per-class
|
||
4. **Profile-guided optimization:** Use perf to identify hotspots
|
||
5. **Test with different workloads:**
|
||
- Pure tiny allocations (16B-2KB only)
|
||
- High cache miss rate workload
|
||
- Multi-threaded workload
|
||
|
||
### Alternative Optimization Strategies
|
||
|
||
Since batch tier checks failed, consider:
|
||
|
||
1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
|
||
2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
|
||
3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
|
||
4. **Lazy tier checking:** Only check tier on actual allocation failure
|
||
|
||
---
|
||
|
||
## Raw Data
|
||
|
||
### Baseline (B=1) - 10 Runs
|
||
```
|
||
1,615,938.8 ops/s
|
||
1,424,832.0 ops/s
|
||
1,415,710.5 ops/s
|
||
1,531,173.0 ops/s
|
||
1,524,721.8 ops/s
|
||
1,343,540.7 ops/s
|
||
1,520,723.1 ops/s
|
||
1,520,476.5 ops/s
|
||
1,464,046.2 ops/s
|
||
1,467,736.3 ops/s
|
||
```
|
||
|
||
### Optimized (B=64) - 10 Runs
|
||
```
|
||
1,394,566.7 ops/s
|
||
1,422,447.5 ops/s
|
||
1,556,167.0 ops/s
|
||
1,447,934.5 ops/s
|
||
1,359,677.3 ops/s
|
||
1,436,005.2 ops/s
|
||
1,568,456.7 ops/s
|
||
1,423,222.2 ops/s
|
||
1,589,416.6 ops/s
|
||
1,501,629.6 ops/s
|
||
```
|
||
|
||
### Aggressive (B=256) - 10 Runs
|
||
```
|
||
1,543,813.0 ops/s
|
||
1,436,644.9 ops/s
|
||
1,479,174.7 ops/s
|
||
1,428,092.3 ops/s
|
||
1,419,232.7 ops/s
|
||
1,422,254.4 ops/s
|
||
1,510,832.1 ops/s
|
||
1,417,032.7 ops/s
|
||
1,465,069.6 ops/s
|
||
1,365,118.3 ops/s
|
||
```
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.
|
||
|
||
**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).
|
||
|
||
**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.
|
||
|
||
---
|
||
|
||
**Report Generated:** 2025-12-04
|
||
**Analysis Tool:** Python 3 statistical analysis
|
||
**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)
|