hakmem/BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md

# Batch Tier Checks Performance Measurement Results
**Date:** 2025-12-04
**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100

---

## Executive Summary

**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**

The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).

**Key Findings:**
- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings

**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.

---

## Test Configuration

### Test Parameters
```
Benchmark: bench_allocators_hakmem
Workload:  mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native
```

### Configurations Tested
| Config | Batch Size | Description | Atomic Op Reduction |
|--------|------------|-------------|---------------------|
| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |

---

## Performance Results

### Throughput Comparison

| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|--------|---------------:|------------------:|-------------------:|
| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
| CV (%) | 5.15% | 5.38% | 3.58% |

**Improvement Analysis:**
- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
- **B=256 vs B=64:** -1.44% (-21,226 ops/s)

### CPU Cycles & Cache Performance

| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |

**Analysis:**
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
- L1 cache behavior is mostly neutral for B=64, worse for B=256

### Variance & Consistency

| Config | CV (%) | Interpretation |
|--------|-------:|----------------|
| Baseline (B=1) | 5.15% | Good consistency |
| Optimized (B=64) | 5.38% | Slightly worse |
| Aggressive (B=256) | 3.58% | Best consistency |

---

## Detailed Analysis

### 1. Why Did the Optimization Fail?

**Expected Behavior:**
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
- Save ~0.2-0.4 cycles per allocation
- Achieve +5-10% throughput improvement

**Actual Behavior:**
- Cache misses decreased by 11% (confirms atomic op reduction)
- CPU cycles **increased** by 0.85% (unexpected overhead)
- Net throughput **decreased** by 0.87%

**Root Cause Hypothesis:**

1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
   - `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
   - Modulo operation `(state->refill_count % batch)` may be expensive
   - Branch misprediction on `if ((state->refill_count % batch) == 0)`

2. **Cache pressure:** The batch state array may evict more useful data from cache
   - 8 bytes × 32 classes = 256 bytes of TLS state
   - This competes with actual allocation metadata in L1 cache

3. **False sharing:** Multiple threads may access different elements of the same cache line
   - Though TLS mitigates this, the benchmark may have threading effects

4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns
   - If cache misses are clustered, batching provides no benefit
   - If cache hits dominate, the batch check is rarely needed

### 2. Why Is B=256 Even Worse?

The aggressive batching (B=256) shows severe regression (+15% cycles):

- **Longer staleness period:** Tier status can be stale for up to 256 operations
- **More allocations from DRAINING SuperSlabs:** This causes additional work
- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING

### 3. Positive Observations

Despite the regression, some aspects worked:

1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
3. **Code correctness:** No crashes or correctness issues observed

---

## Success Criteria Checklist

| Criterion | Expected | Actual | Status |
|-----------|----------|--------|--------|
| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
| No correctness issues | None | None | **PASS** |

**Overall: FAIL - Optimization does not achieve expected improvement**

---

## Comparison: JSON Workload (Invalid Baseline)

**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.

Results from JSON workload (for reference only):
- All configs showed ~1,070,000 ops/s (nearly identical)
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
- This confirms the optimization is specific to tiny allocations (<2KB)

---

## Recommendations

### Immediate Actions

1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
   - Current optimization shows regression, not improvement
   - Need to understand root cause before adding more complexity

2. **INVESTIGATE overhead sources:**
   - Profile the modulo operation cost
   - Check TLS access patterns
   - Measure branch misprediction rate
   - Analyze cache line behavior

3. **CONSIDER alternative approaches:**
   - Use power-of-2 batch sizes for cheaper modulo (bit masking)
   - Precompute batch size at compile time (remove getenv overhead)
   - Try smaller batch sizes (B=16, B=32) for better locality
   - Use per-thread batch counter instead of per-class counter

### Future Experiments

If investigating further:

1. **Test different batch sizes:** B=16, B=32, B=128
2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
3. **Reduce TLS footprint:** Single global counter instead of per-class
4. **Profile-guided optimization:** Use perf to identify hotspots
5. **Test with different workloads:**
   - Pure tiny allocations (16B-2KB only)
   - High cache miss rate workload
   - Multi-threaded workload

### Alternative Optimization Strategies

Since batch tier checks failed, consider:

1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
4. **Lazy tier checking:** Only check tier on actual allocation failure

---

## Raw Data

### Baseline (B=1) - 10 Runs
```
1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s
```

### Optimized (B=64) - 10 Runs
```
1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s
```

### Aggressive (B=256) - 10 Runs
```
1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s
```

---

## Conclusion

The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.

**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).

**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.

---

**Report Generated:** 2025-12-04
**Analysis Tool:** Python 3 statistical analysis
**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)
-												Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 23:31:54 +09:00
+								# Batch Tier Checks Performance Measurement Results
 								**Date:** 2025-12-04
 								**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
 								**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100
 								---
 								## Executive Summary
 								**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**
 								The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).
 								**Key Findings:**
 								- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
 								- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
 								- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
 								- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings
 								**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.
 								---
 								## Test Configuration
 								### Test Parameters
 								```
 								Benchmark: bench_allocators_hakmem
 								Workload:  mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
 								Iterations: 100 per run
 								Runs per config: 10
 								Platform: Linux 6.8.0-87-generic, x86-64
 								Compiler: gcc with -O3 -flto -march=native
 								```
 								### Configurations Tested
 								| Config | Batch Size | Description | Atomic Op Reduction |
 								|--------|------------|-------------|---------------------|
 								| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
 								| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
 								| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |
 								---
 								## Performance Results
 								### Throughput Comparison
 								| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
 								|--------|---------------:|------------------:|-------------------:|
 								| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
 								| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
 								| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
 								| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
 								| CV (%) | 5.15% | 5.38% | 3.58% |
 								**Improvement Analysis:**
 								- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
 								- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
 								- **B=256 vs B=64:** -1.44% (-21,226 ops/s)
 								### CPU Cycles & Cache Performance
 								| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
 								|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
 								| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
 								| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
 								| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |
 								**Analysis:**
 								- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
 								- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
 								- B=256 shows severe regression: +15% cycles, +4.4% cache misses
 								- L1 cache behavior is mostly neutral for B=64, worse for B=256
 								### Variance & Consistency
 								| Config | CV (%) | Interpretation |
 								|--------|-------:|----------------|
 								| Baseline (B=1) | 5.15% | Good consistency |
 								| Optimized (B=64) | 5.38% | Slightly worse |
 								| Aggressive (B=256) | 3.58% | Best consistency |
 								---
 								## Detailed Analysis
 								### 1. Why Did the Optimization Fail?
 								**Expected Behavior:**
 								- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
 								- Save ~0.2-0.4 cycles per allocation
 								- Achieve +5-10% throughput improvement
 								**Actual Behavior:**
 								- Cache misses decreased by 11% (confirms atomic op reduction)
 								- CPU cycles **increased** by 0.85% (unexpected overhead)
 								- Net throughput **decreased** by 0.87%
 								**Root Cause Hypothesis:**
 . **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
 								   - `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
 								   - Modulo operation `(state->refill_count % batch)` may be expensive
 								   - Branch misprediction on `if ((state->refill_count % batch) == 0)`
 . **Cache pressure:** The batch state array may evict more useful data from cache
 								   - 8 bytes × 32 classes = 256 bytes of TLS state
 								   - This competes with actual allocation metadata in L1 cache
 . **False sharing:** Multiple threads may access different elements of the same cache line
 								   - Though TLS mitigates this, the benchmark may have threading effects
 . **Batch size mismatch:** B=64 may not align with actual cache miss patterns
 								   - If cache misses are clustered, batching provides no benefit
 								   - If cache hits dominate, the batch check is rarely needed
 								### 2. Why Is B=256 Even Worse?
 								The aggressive batching (B=256) shows severe regression (+15% cycles):
 								- **Longer staleness period:** Tier status can be stale for up to 256 operations
 								- **More allocations from DRAINING SuperSlabs:** This causes additional work
 								- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING
 								### 3. Positive Observations
 								Despite the regression, some aspects worked:
 . **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
 . **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
 . **Code correctness:** No crashes or correctness issues observed
 								---
 								## Success Criteria Checklist
 								| Criterion | Expected | Actual | Status |
 								|-----------|----------|--------|--------|
 								| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
 								| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
 								| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
 								| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
 								| No correctness issues | None | None | **PASS** |
 								**Overall: FAIL - Optimization does not achieve expected improvement**
 								---
 								## Comparison: JSON Workload (Invalid Baseline)
 								**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.
 								Results from JSON workload (for reference only):
 								- All configs showed ~1,070,000 ops/s (nearly identical)
 								- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
 								- This confirms the optimization is specific to tiny allocations (<2KB)
 								---
 								## Recommendations
 								### Immediate Actions
 . **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
 								   - Current optimization shows regression, not improvement
 								   - Need to understand root cause before adding more complexity
 . **INVESTIGATE overhead sources:**
 								   - Profile the modulo operation cost
 								   - Check TLS access patterns
 								   - Measure branch misprediction rate
 								   - Analyze cache line behavior
 . **CONSIDER alternative approaches:**
 								   - Use power-of-2 batch sizes for cheaper modulo (bit masking)
 								   - Precompute batch size at compile time (remove getenv overhead)
 								   - Try smaller batch sizes (B=16, B=32) for better locality
 								   - Use per-thread batch counter instead of per-class counter
 								### Future Experiments
 								If investigating further:
 . **Test different batch sizes:** B=16, B=32, B=128
 . **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
 . **Reduce TLS footprint:** Single global counter instead of per-class
 . **Profile-guided optimization:** Use perf to identify hotspots
 . **Test with different workloads:**
 								   - Pure tiny allocations (16B-2KB only)
 								   - High cache miss rate workload
 								   - Multi-threaded workload
 								### Alternative Optimization Strategies
 								Since batch tier checks failed, consider:
 . **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
 . **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
 . **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
 . **Lazy tier checking:** Only check tier on actual allocation failure
 								---
 								## Raw Data
 								### Baseline (B=1) - 10 Runs
 								```
 ,615,938.8 ops/s
 ,424,832.0 ops/s
 ,415,710.5 ops/s
 ,531,173.0 ops/s
 ,524,721.8 ops/s
 ,343,540.7 ops/s
 ,520,723.1 ops/s
 ,520,476.5 ops/s
 ,464,046.2 ops/s
 ,467,736.3 ops/s
 								```
 								### Optimized (B=64) - 10 Runs
 								```
 ,394,566.7 ops/s
 ,422,447.5 ops/s
 ,556,167.0 ops/s
 ,447,934.5 ops/s
 ,359,677.3 ops/s
 ,436,005.2 ops/s
 ,568,456.7 ops/s
 ,423,222.2 ops/s
 ,589,416.6 ops/s
 ,501,629.6 ops/s
 								```
 								### Aggressive (B=256) - 10 Runs
 								```
 ,543,813.0 ops/s
 ,436,644.9 ops/s
 ,479,174.7 ops/s
 ,428,092.3 ops/s
 ,419,232.7 ops/s
 ,422,254.4 ops/s
 ,510,832.1 ops/s
 ,417,032.7 ops/s
 ,465,069.6 ops/s
 ,365,118.3 ops/s
 								```
 								---
 								## Conclusion
 								The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.
 								**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).
 								**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.
 								---
 								**Report Generated:** 2025-12-04
 								**Analysis Tool:** Python 3 statistical analysis
 								**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)