Files
hakmem/BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

264 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Batch Tier Checks Performance Measurement Results
**Date:** 2025-12-04
**Optimization:** Phase A-2 - Batch Tier Checks (Reduce Atomic Operations)
**Benchmark:** bench_allocators_hakmem --scenario mixed --iterations 100
---
## Executive Summary
**RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement**
The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a **-0.87% performance regression** with the default batch size (B=64) and **-2.30% regression** with aggressive batching (B=256).
**Key Findings:**
- **Throughput:** Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
- **Cache Performance:** B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
- **Consistency:** B=256 has best consistency (CV=3.58%), but worst throughput
- **Verdict:** The optimization introduces overhead that exceeds the atomic operation savings
**Recommendation:** **DO NOT PROCEED** to Phase A-3. Investigate root cause and consider alternative approaches.
---
## Test Configuration
### Test Parameters
```
Benchmark: bench_allocators_hakmem
Workload: mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native
```
### Configurations Tested
| Config | Batch Size | Description | Atomic Op Reduction |
|--------|------------|-------------|---------------------|
| **Test A** | B=1 | Baseline (no batching) | 0% (every check) |
| **Test B** | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
| **Test C** | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |
---
## Performance Results
### Throughput Comparison
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|--------|---------------:|------------------:|-------------------:|
| **Average ops/s** | **1,482,889.9** | 1,469,952.3 | 1,448,726.5 |
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
| CV (%) | 5.15% | 5.38% | 3.58% |
**Improvement Analysis:**
- **B=64 vs B=1:** **-0.87%** (-12,938 ops/s) **[REGRESSION]**
- **B=256 vs B=1:** **-2.30%** (-34,163 ops/s) **[REGRESSION]**
- **B=256 vs B=64:** -1.44% (-21,226 ops/s)
### CPU Cycles & Cache Performance
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|--------|---------------:|------------------:|-------------------:|------------:|-------------:|
| **CPU Cycles** | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | **+0.85%** | **+15.04%** |
| **Cache Misses** | 9,672,579 | 8,605,566 | 10,100,798 | **-11.03%** | **+4.43%** |
| **L1 Cache Misses** | 26,465,121 | 26,297,329 | 28,928,265 | **-0.63%** | **+9.31%** |
**Analysis:**
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
- However, CPU cycles **increase** by 0.85% (unexpected - should decrease)
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
- L1 cache behavior is mostly neutral for B=64, worse for B=256
### Variance & Consistency
| Config | CV (%) | Interpretation |
|--------|-------:|----------------|
| Baseline (B=1) | 5.15% | Good consistency |
| Optimized (B=64) | 5.38% | Slightly worse |
| Aggressive (B=256) | 3.58% | Best consistency |
---
## Detailed Analysis
### 1. Why Did the Optimization Fail?
**Expected Behavior:**
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
- Save ~0.2-0.4 cycles per allocation
- Achieve +5-10% throughput improvement
**Actual Behavior:**
- Cache misses decreased by 11% (confirms atomic op reduction)
- CPU cycles **increased** by 0.85% (unexpected overhead)
- Net throughput **decreased** by 0.87%
**Root Cause Hypothesis:**
1. **Thread-local state overhead:** The batch counter and cached tier result add TLS storage and access overhead
- `g_tier_batch_state[TINY_NUM_CLASSES]` is accessed on every cache miss
- Modulo operation `(state->refill_count % batch)` may be expensive
- Branch misprediction on `if ((state->refill_count % batch) == 0)`
2. **Cache pressure:** The batch state array may evict more useful data from cache
- 8 bytes × 32 classes = 256 bytes of TLS state
- This competes with actual allocation metadata in L1 cache
3. **False sharing:** Multiple threads may access different elements of the same cache line
- Though TLS mitigates this, the benchmark may have threading effects
4. **Batch size mismatch:** B=64 may not align with actual cache miss patterns
- If cache misses are clustered, batching provides no benefit
- If cache hits dominate, the batch check is rarely needed
### 2. Why Is B=256 Even Worse?
The aggressive batching (B=256) shows severe regression (+15% cycles):
- **Longer staleness period:** Tier status can be stale for up to 256 operations
- **More allocations from DRAINING SuperSlabs:** This causes additional work
- **Increased memory pressure:** More operations before discovering SuperSlab is DRAINING
### 3. Positive Observations
Despite the regression, some aspects worked:
1. **Cache miss reduction:** B=64 achieved -11% cache misses (atomic ops were reduced)
2. **Consistency improvement:** B=256 has lowest variance (CV=3.58%)
3. **Code correctness:** No crashes or correctness issues observed
---
## Success Criteria Checklist
| Criterion | Expected | Actual | Status |
|-----------|----------|--------|--------|
| B=64 shows +5-10% improvement | +5-10% | **-0.87%** | **FAIL** |
| Cycles reduced as expected | -5% | **+0.85%** | **FAIL** |
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | **MIXED** |
| Variance acceptable (<15%) | <15% | 5.38% | **PASS** |
| No correctness issues | None | None | **PASS** |
**Overall: FAIL - Optimization does not achieve expected improvement**
---
## Comparison: JSON Workload (Invalid Baseline)
**Note:** Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.
Results from JSON workload (for reference only):
- All configs showed ~1,070,000 ops/s (nearly identical)
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
- This confirms the optimization is specific to tiny allocations (<2KB)
---
## Recommendations
### Immediate Actions
1. **DO NOT PROCEED to Phase A-3** (Shared Pool Stage Optimization)
- Current optimization shows regression, not improvement
- Need to understand root cause before adding more complexity
2. **INVESTIGATE overhead sources:**
- Profile the modulo operation cost
- Check TLS access patterns
- Measure branch misprediction rate
- Analyze cache line behavior
3. **CONSIDER alternative approaches:**
- Use power-of-2 batch sizes for cheaper modulo (bit masking)
- Precompute batch size at compile time (remove getenv overhead)
- Try smaller batch sizes (B=16, B=32) for better locality
- Use per-thread batch counter instead of per-class counter
### Future Experiments
If investigating further:
1. **Test different batch sizes:** B=16, B=32, B=128
2. **Optimize modulo operation:** Use `(count & (batch-1))` for power-of-2
3. **Reduce TLS footprint:** Single global counter instead of per-class
4. **Profile-guided optimization:** Use perf to identify hotspots
5. **Test with different workloads:**
- Pure tiny allocations (16B-2KB only)
- High cache miss rate workload
- Multi-threaded workload
### Alternative Optimization Strategies
Since batch tier checks failed, consider:
1. **Shared Pool Stage Optimization (Phase A-3):** May still be viable independently
2. **Superslab-level caching:** Cache entire SuperSlab pointer instead of tier status
3. **Lockless shared pool:** Remove atomic operations entirely via per-thread pools
4. **Lazy tier checking:** Only check tier on actual allocation failure
---
## Raw Data
### Baseline (B=1) - 10 Runs
```
1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s
```
### Optimized (B=64) - 10 Runs
```
1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s
```
### Aggressive (B=256) - 10 Runs
```
1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s
```
---
## Conclusion
The Batch Tier Checks optimization, while theoretically sound, **fails to achieve the expected +5-10% throughput improvement** in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.
**Key Takeaway:** Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).
**Next Steps:** Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.
---
**Report Generated:** 2025-12-04
**Analysis Tool:** Python 3 statistical analysis
**Benchmark Framework:** bench_allocators_hakmem (hakmem custom benchmarking suite)