# Batch Tier Checks Performance Measurement Results **Date:** 2025-12-04 **Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations) **Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100 --- ## Executive Summary **RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement** The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256). **Key Findings:** - **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%) - **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad) - **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput - **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings **Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches. --- ## Test Configuration ### Test Parameters ``` Benchmark: bench_allocators_hakmem Workload: mixed (16B, 512B, 8KB, 128KB, 1KB allocations) Iterations: 100 per run Runs per config: 10 Platform: Linux 6.8.0-87-generic, x86-64 Compiler: gcc with -O3 -flto -march=native ``` ### Configurations Tested | Config | Batch Size | Description | Atomic Op Reduction | |--------|------------|-------------|---------------------| | **Test A** | B=1 | Baseline (no batching) | 0% (every check) | | **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) | | **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) | --- ## Performance Results ### Throughput Comparison | Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | |--------|---------------:|------------------:|-------------------:| | **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 | | Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 | | Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 | | Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 | | CV (%) | 5.15% | 5.38% | 3.58% | **Improvement Analysis:** - **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]** - **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]** - **B=256 vs B=64:** -1.44% (-21,226 ops/s) ### CPU Cycles & Cache Performance | Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 | |--------|---------------:|------------------:|-------------------:|------------:|-------------:| | **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** | | **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** | | **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** | **Analysis:** - B=64 reduces cache misses by 11% (expected from fewer atomic ops) - However, CPU cycles **increase** by 0.85% (unexpected - should decrease) - B=256 shows severe regression: +15% cycles, +4.4% cache misses - L1 cache behavior is mostly neutral for B=64, worse for B=256 ### Variance & Consistency | Config | CV (%) | Interpretation | |--------|-------:|----------------| | Baseline (B=1) | 5.15% | Good consistency | | Optimized (B=64) | 5.38% | Slightly worse | | Aggressive (B=256) | 3.58% | Best consistency | --- ## Detailed Analysis ### 1. Why Did the Optimization Fail? **Expected Behavior:** - Reduce atomic operations from 5% of allocations to 0.08% (64x reduction) - Save ~0.2-0.4 cycles per allocation - Achieve +5-10% throughput improvement **Actual Behavior:** - Cache misses decreased by 11% (confirms atomic op reduction) - CPU cycles **increased** by 0.85% (unexpected overhead) - Net throughput **decreased** by 0.87% **Root Cause Hypothesis:** 1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead - `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss - Modulo operation `(state->refill_count % batch)` may be expensive - Branch misprediction on `if ((state->refill_count % batch) == 0)` 2. **Cache pressure:** The batch state array may evict more useful data from cache - 8 bytes × 32 classes = 256 bytes of TLS state - This competes with actual allocation metadata in L1 cache 3. **False sharing:** Multiple threads may access different elements of the same cache line - Though TLS mitigates this, the benchmark may have threading effects 4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns - If cache misses are clustered, batching provides no benefit - If cache hits dominate, the batch check is rarely needed ### 2. Why Is B=256 Even Worse? The aggressive batching (B=256) shows severe regression (+15% cycles): - **Longer staleness period:** Tier status can be stale for up to 256 operations - **More allocations from DRAINING SuperSlabs:** This causes additional work - **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING ### 3. Positive Observations Despite the regression, some aspects worked: 1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced) 2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%) 3. **Code correctness:** No crashes or correctness issues observed --- ## Success Criteria Checklist | Criterion | Expected | Actual | Status | |-----------|----------|--------|--------| | B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** | | Cycles reduced as expected | -5% | **+0.85%** | **FAIL** | | Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** | | Variance acceptable (<15%) | <15% | 5.38% | **PASS** | | No correctness issues | None | None | **PASS** | **Overall: FAIL - Optimization does not achieve expected improvement** --- ## Comparison: JSON Workload (Invalid Baseline) **Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply. Results from JSON workload (for reference only): - All configs showed ~1,070,000 ops/s (nearly identical) - No improvement because 64KB allocations use L2.5 pool, not Shared Pool - This confirms the optimization is specific to tiny allocations (<2KB) --- ## Recommendations ### Immediate Actions 1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization) - Current optimization shows regression, not improvement - Need to understand root cause before adding more complexity 2. **INVESTIGATE overhead sources:** - Profile the modulo operation cost - Check TLS access patterns - Measure branch misprediction rate - Analyze cache line behavior 3. **CONSIDER alternative approaches:** - Use power-of-2 batch sizes for cheaper modulo (bit masking) - Precompute batch size at compile time (remove getenv overhead) - Try smaller batch sizes (B=16, B=32) for better locality - Use per-thread batch counter instead of per-class counter ### Future Experiments If investigating further: 1. **Test different batch sizes:** B=16, B=32, B=128 2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2 3. **Reduce TLS footprint:** Single global counter instead of per-class 4. **Profile-guided optimization:** Use perf to identify hotspots 5. **Test with different workloads:** - Pure tiny allocations (16B-2KB only) - High cache miss rate workload - Multi-threaded workload ### Alternative Optimization Strategies Since batch tier checks failed, consider: 1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently 2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status 3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools 4. **Lazy tier checking:** Only check tier on actual allocation failure --- ## Raw Data ### Baseline (B=1) - 10 Runs ``` 1,615,938.8 ops/s 1,424,832.0 ops/s 1,415,710.5 ops/s 1,531,173.0 ops/s 1,524,721.8 ops/s 1,343,540.7 ops/s 1,520,723.1 ops/s 1,520,476.5 ops/s 1,464,046.2 ops/s 1,467,736.3 ops/s ``` ### Optimized (B=64) - 10 Runs ``` 1,394,566.7 ops/s 1,422,447.5 ops/s 1,556,167.0 ops/s 1,447,934.5 ops/s 1,359,677.3 ops/s 1,436,005.2 ops/s 1,568,456.7 ops/s 1,423,222.2 ops/s 1,589,416.6 ops/s 1,501,629.6 ops/s ``` ### Aggressive (B=256) - 10 Runs ``` 1,543,813.0 ops/s 1,436,644.9 ops/s 1,479,174.7 ops/s 1,428,092.3 ops/s 1,419,232.7 ops/s 1,422,254.4 ops/s 1,510,832.1 ops/s 1,417,032.7 ops/s 1,465,069.6 ops/s 1,365,118.3 ops/s ``` --- ## Conclusion The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations. **Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations). **Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies. --- **Report Generated:** 2025-12-04 **Analysis Tool:** Python 3 statistical analysis **Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)