hakmem/BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md

# Batch Tier Checks Performance Measurement Results
**Date:** 2025-12-04
**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100

---

## Executive Summary

**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**

The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).

**Key Findings:**
- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings

**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.

---

## Test Configuration

### Test Parameters
```
Benchmark: bench_allocators_hakmem
Workload:  mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native
```

### Configurations Tested
| Config | Batch Size | Description | Atomic Op Reduction |
|--------|------------|-------------|---------------------|
| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |

---

## Performance Results

### Throughput Comparison

| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|--------|---------------:|------------------:|-------------------:|
| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
| CV (%) | 5.15% | 5.38% | 3.58% |

**Improvement Analysis:**
- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
- **B=256 vs B=64:** -1.44% (-21,226 ops/s)

### CPU Cycles & Cache Performance

| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |

**Analysis:**
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
- L1 cache behavior is mostly neutral for B=64, worse for B=256

### Variance & Consistency

| Config | CV (%) | Interpretation |
|--------|-------:|----------------|
| Baseline (B=1) | 5.15% | Good consistency |
| Optimized (B=64) | 5.38% | Slightly worse |
| Aggressive (B=256) | 3.58% | Best consistency |

---

## Detailed Analysis

### 1. Why Did the Optimization Fail?

**Expected Behavior:**
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
- Save ~0.2-0.4 cycles per allocation
- Achieve +5-10% throughput improvement

**Actual Behavior:**
- Cache misses decreased by 11% (confirms atomic op reduction)
- CPU cycles **increased** by 0.85% (unexpected overhead)
- Net throughput **decreased** by 0.87%

**Root Cause Hypothesis:**

1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
   - `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
   - Modulo operation `(state->refill_count % batch)` may be expensive
   - Branch misprediction on `if ((state->refill_count % batch) == 0)`

2. **Cache pressure:** The batch state array may evict more useful data from cache
   - 8 bytes × 32 classes = 256 bytes of TLS state
   - This competes with actual allocation metadata in L1 cache

3. **False sharing:** Multiple threads may access different elements of the same cache line
   - Though TLS mitigates this, the benchmark may have threading effects

4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns
   - If cache misses are clustered, batching provides no benefit
   - If cache hits dominate, the batch check is rarely needed

### 2. Why Is B=256 Even Worse?

The aggressive batching (B=256) shows severe regression (+15% cycles):

- **Longer staleness period:** Tier status can be stale for up to 256 operations
- **More allocations from DRAINING SuperSlabs:** This causes additional work
- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING

### 3. Positive Observations

Despite the regression, some aspects worked:

1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
3. **Code correctness:** No crashes or correctness issues observed

---

## Success Criteria Checklist

| Criterion | Expected | Actual | Status |
|-----------|----------|--------|--------|
| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
| No correctness issues | None | None | **PASS** |

**Overall: FAIL - Optimization does not achieve expected improvement**

---

## Comparison: JSON Workload (Invalid Baseline)

**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.

Results from JSON workload (for reference only):
- All configs showed ~1,070,000 ops/s (nearly identical)
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
- This confirms the optimization is specific to tiny allocations (<2KB)

---

## Recommendations

### Immediate Actions

1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
   - Current optimization shows regression, not improvement
   - Need to understand root cause before adding more complexity

2. **INVESTIGATE overhead sources:**
   - Profile the modulo operation cost
   - Check TLS access patterns
   - Measure branch misprediction rate
   - Analyze cache line behavior

3. **CONSIDER alternative approaches:**
   - Use power-of-2 batch sizes for cheaper modulo (bit masking)
   - Precompute batch size at compile time (remove getenv overhead)
   - Try smaller batch sizes (B=16, B=32) for better locality
   - Use per-thread batch counter instead of per-class counter

### Future Experiments

If investigating further:

1. **Test different batch sizes:** B=16, B=32, B=128
2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
3. **Reduce TLS footprint:** Single global counter instead of per-class
4. **Profile-guided optimization:** Use perf to identify hotspots
5. **Test with different workloads:**
   - Pure tiny allocations (16B-2KB only)
   - High cache miss rate workload
   - Multi-threaded workload

### Alternative Optimization Strategies

Since batch tier checks failed, consider:

1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
4. **Lazy tier checking:** Only check tier on actual allocation failure

---

## Raw Data

### Baseline (B=1) - 10 Runs
```
1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s
```

### Optimized (B=64) - 10 Runs
```
1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s
```

### Aggressive (B=256) - 10 Runs
```
1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s
```

---

## Conclusion

The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.

**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).

**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.

---

**Report Generated:** 2025-12-04
**Analysis Tool:** Python 3 statistical analysis
**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)