Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.4 KiB
Batch Tier Checks Performance Measurement Results
Date: 2025-12-04 Optimization: Phase A-2 - Batch Tier Checks (Reduce Atomic Operations) Benchmark: bench_allocators_hakmem --scenario mixed --iterations 100
Executive Summary
RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement
The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a -0.87% performance regression with the default batch size (B=64) and -2.30% regression with aggressive batching (B=256).
Key Findings:
- Throughput: Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
- Cache Performance: B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
- Consistency: B=256 has best consistency (CV=3.58%), but worst throughput
- Verdict: The optimization introduces overhead that exceeds the atomic operation savings
Recommendation: DO NOT PROCEED to Phase A-3. Investigate root cause and consider alternative approaches.
Test Configuration
Test Parameters
Benchmark: bench_allocators_hakmem
Workload: mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native
Configurations Tested
| Config | Batch Size | Description | Atomic Op Reduction |
|---|---|---|---|
| Test A | B=1 | Baseline (no batching) | 0% (every check) |
| Test B | B=64 | Optimized (conservative) | 98.4% (1 per 64 checks) |
| Test C | B=256 | Aggressive (max batching) | 99.6% (1 per 256 checks) |
Performance Results
Throughput Comparison
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) |
|---|---|---|---|
| Average ops/s | 1,482,889.9 | 1,469,952.3 | 1,448,726.5 |
| Std Dev ops/s | 76,386.4 | 79,114.8 | 51,886.6 |
| Min ops/s | 1,343,540.7 | 1,359,677.3 | 1,365,118.3 |
| Max ops/s | 1,615,938.8 | 1,589,416.6 | 1,543,813.0 |
| CV (%) | 5.15% | 5.38% | 3.58% |
Improvement Analysis:
- B=64 vs B=1: -0.87% (-12,938 ops/s) [REGRESSION]
- B=256 vs B=1: -2.30% (-34,163 ops/s) [REGRESSION]
- B=256 vs B=64: -1.44% (-21,226 ops/s)
CPU Cycles & Cache Performance
| Metric | Baseline (B=1) | Optimized (B=64) | Aggressive (B=256) | B=64 vs B=1 | B=256 vs B=1 |
|---|---|---|---|---|---|
| CPU Cycles | 2,349,670,806 | 2,369,727,585 | 2,703,167,708 | +0.85% | +15.04% |
| Cache Misses | 9,672,579 | 8,605,566 | 10,100,798 | -11.03% | +4.43% |
| L1 Cache Misses | 26,465,121 | 26,297,329 | 28,928,265 | -0.63% | +9.31% |
Analysis:
- B=64 reduces cache misses by 11% (expected from fewer atomic ops)
- However, CPU cycles increase by 0.85% (unexpected - should decrease)
- B=256 shows severe regression: +15% cycles, +4.4% cache misses
- L1 cache behavior is mostly neutral for B=64, worse for B=256
Variance & Consistency
| Config | CV (%) | Interpretation |
|---|---|---|
| Baseline (B=1) | 5.15% | Good consistency |
| Optimized (B=64) | 5.38% | Slightly worse |
| Aggressive (B=256) | 3.58% | Best consistency |
Detailed Analysis
1. Why Did the Optimization Fail?
Expected Behavior:
- Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
- Save ~0.2-0.4 cycles per allocation
- Achieve +5-10% throughput improvement
Actual Behavior:
- Cache misses decreased by 11% (confirms atomic op reduction)
- CPU cycles increased by 0.85% (unexpected overhead)
- Net throughput decreased by 0.87%
Root Cause Hypothesis:
-
Thread-local state overhead: The batch counter and cached tier result add TLS storage and access overhead
g_tier_batch_state[TINY_NUM_CLASSES]is accessed on every cache miss- Modulo operation
(state->refill_count % batch)may be expensive - Branch misprediction on
if ((state->refill_count % batch) == 0)
-
Cache pressure: The batch state array may evict more useful data from cache
- 8 bytes × 32 classes = 256 bytes of TLS state
- This competes with actual allocation metadata in L1 cache
-
False sharing: Multiple threads may access different elements of the same cache line
- Though TLS mitigates this, the benchmark may have threading effects
-
Batch size mismatch: B=64 may not align with actual cache miss patterns
- If cache misses are clustered, batching provides no benefit
- If cache hits dominate, the batch check is rarely needed
2. Why Is B=256 Even Worse?
The aggressive batching (B=256) shows severe regression (+15% cycles):
- Longer staleness period: Tier status can be stale for up to 256 operations
- More allocations from DRAINING SuperSlabs: This causes additional work
- Increased memory pressure: More operations before discovering SuperSlab is DRAINING
3. Positive Observations
Despite the regression, some aspects worked:
- Cache miss reduction: B=64 achieved -11% cache misses (atomic ops were reduced)
- Consistency improvement: B=256 has lowest variance (CV=3.58%)
- Code correctness: No crashes or correctness issues observed
Success Criteria Checklist
| Criterion | Expected | Actual | Status |
|---|---|---|---|
| B=64 shows +5-10% improvement | +5-10% | -0.87% | FAIL |
| Cycles reduced as expected | -5% | +0.85% | FAIL |
| Cache behavior improves or neutral | Neutral | -11% misses (good), but +0.85% cycles (bad) | MIXED |
| Variance acceptable (<15%) | <15% | 5.38% | PASS |
| No correctness issues | None | None | PASS |
Overall: FAIL - Optimization does not achieve expected improvement
Comparison: JSON Workload (Invalid Baseline)
Note: Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.
Results from JSON workload (for reference only):
- All configs showed ~1,070,000 ops/s (nearly identical)
- No improvement because 64KB allocations use L2.5 pool, not Shared Pool
- This confirms the optimization is specific to tiny allocations (<2KB)
Recommendations
Immediate Actions
-
DO NOT PROCEED to Phase A-3 (Shared Pool Stage Optimization)
- Current optimization shows regression, not improvement
- Need to understand root cause before adding more complexity
-
INVESTIGATE overhead sources:
- Profile the modulo operation cost
- Check TLS access patterns
- Measure branch misprediction rate
- Analyze cache line behavior
-
CONSIDER alternative approaches:
- Use power-of-2 batch sizes for cheaper modulo (bit masking)
- Precompute batch size at compile time (remove getenv overhead)
- Try smaller batch sizes (B=16, B=32) for better locality
- Use per-thread batch counter instead of per-class counter
Future Experiments
If investigating further:
- Test different batch sizes: B=16, B=32, B=128
- Optimize modulo operation: Use
(count & (batch-1))for power-of-2 - Reduce TLS footprint: Single global counter instead of per-class
- Profile-guided optimization: Use perf to identify hotspots
- Test with different workloads:
- Pure tiny allocations (16B-2KB only)
- High cache miss rate workload
- Multi-threaded workload
Alternative Optimization Strategies
Since batch tier checks failed, consider:
- Shared Pool Stage Optimization (Phase A-3): May still be viable independently
- Superslab-level caching: Cache entire SuperSlab pointer instead of tier status
- Lockless shared pool: Remove atomic operations entirely via per-thread pools
- Lazy tier checking: Only check tier on actual allocation failure
Raw Data
Baseline (B=1) - 10 Runs
1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s
Optimized (B=64) - 10 Runs
1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s
Aggressive (B=256) - 10 Runs
1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s
Conclusion
The Batch Tier Checks optimization, while theoretically sound, fails to achieve the expected +5-10% throughput improvement in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.
Key Takeaway: Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).
Next Steps: Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.
Report Generated: 2025-12-04 Analysis Tool: Python 3 statistical analysis Benchmark Framework: bench_allocators_hakmem (hakmem custom benchmarking suite)