Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

9.4 KiB

Raw Permalink Blame History

Batch Tier Checks Performance Measurement Results

Date: 2025-12-04 Optimization: Phase A-2 - Batch Tier Checks (Reduce Atomic Operations) Benchmark: bench_allocators_hakmem --scenario mixed --iterations 100

Executive Summary

RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement

The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a -0.87% performance regression with the default batch size (B=64) and -2.30% regression with aggressive batching (B=256).

Key Findings:

Throughput: Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
Cache Performance: B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
Consistency: B=256 has best consistency (CV=3.58%), but worst throughput
Verdict: The optimization introduces overhead that exceeds the atomic operation savings

Recommendation: DO NOT PROCEED to Phase A-3. Investigate root cause and consider alternative approaches.

Test Configuration

Test Parameters

Benchmark: bench_allocators_hakmem
Workload:  mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native

Configurations Tested

Config	Batch Size	Description	Atomic Op Reduction
Test A	B=1	Baseline (no batching)	0% (every check)
Test B	B=64	Optimized (conservative)	98.4% (1 per 64 checks)
Test C	B=256	Aggressive (max batching)	99.6% (1 per 256 checks)

Performance Results

Throughput Comparison

Metric	Baseline (B=1)	Optimized (B=64)	Aggressive (B=256)
Average ops/s	1,482,889.9	1,469,952.3	1,448,726.5
Std Dev ops/s	76,386.4	79,114.8	51,886.6
Min ops/s	1,343,540.7	1,359,677.3	1,365,118.3
Max ops/s	1,615,938.8	1,589,416.6	1,543,813.0
CV (%)	5.15%	5.38%	3.58%

Improvement Analysis:

B=64 vs B=1: -0.87% (-12,938 ops/s) [REGRESSION]
B=256 vs B=1: -2.30% (-34,163 ops/s) [REGRESSION]
B=256 vs B=64: -1.44% (-21,226 ops/s)

CPU Cycles & Cache Performance

Metric	Baseline (B=1)	Optimized (B=64)	Aggressive (B=256)	B=64 vs B=1	B=256 vs B=1
CPU Cycles	2,349,670,806	2,369,727,585	2,703,167,708	+0.85%	+15.04%
Cache Misses	9,672,579	8,605,566	10,100,798	-11.03%	+4.43%
L1 Cache Misses	26,465,121	26,297,329	28,928,265	-0.63%	+9.31%

Analysis:

B=64 reduces cache misses by 11% (expected from fewer atomic ops)
However, CPU cycles increase by 0.85% (unexpected - should decrease)
B=256 shows severe regression: +15% cycles, +4.4% cache misses
L1 cache behavior is mostly neutral for B=64, worse for B=256

Variance & Consistency

Config	CV (%)	Interpretation
Baseline (B=1)	5.15%	Good consistency
Optimized (B=64)	5.38%	Slightly worse
Aggressive (B=256)	3.58%	Best consistency

Detailed Analysis

1. Why Did the Optimization Fail?

Expected Behavior:

Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
Save ~0.2-0.4 cycles per allocation
Achieve +5-10% throughput improvement

Actual Behavior:

Cache misses decreased by 11% (confirms atomic op reduction)
CPU cycles increased by 0.85% (unexpected overhead)
Net throughput decreased by 0.87%

Root Cause Hypothesis:

Thread-local state overhead: The batch counter and cached tier result add TLS storage and access overhead
- g_tier_batch_state[TINY_NUM_CLASSES] is accessed on every cache miss
- Modulo operation (state->refill_count % batch) may be expensive
- Branch misprediction on if ((state->refill_count % batch) == 0)
Cache pressure: The batch state array may evict more useful data from cache
- 8 bytes × 32 classes = 256 bytes of TLS state
- This competes with actual allocation metadata in L1 cache
False sharing: Multiple threads may access different elements of the same cache line
- Though TLS mitigates this, the benchmark may have threading effects
Batch size mismatch: B=64 may not align with actual cache miss patterns
- If cache misses are clustered, batching provides no benefit
- If cache hits dominate, the batch check is rarely needed

2. Why Is B=256 Even Worse?

The aggressive batching (B=256) shows severe regression (+15% cycles):

Longer staleness period: Tier status can be stale for up to 256 operations
More allocations from DRAINING SuperSlabs: This causes additional work
Increased memory pressure: More operations before discovering SuperSlab is DRAINING

3. Positive Observations

Despite the regression, some aspects worked:

Cache miss reduction: B=64 achieved -11% cache misses (atomic ops were reduced)
Consistency improvement: B=256 has lowest variance (CV=3.58%)
Code correctness: No crashes or correctness issues observed

Success Criteria Checklist

Criterion	Expected	Actual	Status
B=64 shows +5-10% improvement	+5-10%	-0.87%	FAIL
Cycles reduced as expected	-5%	+0.85%	FAIL
Cache behavior improves or neutral	Neutral	-11% misses (good), but +0.85% cycles (bad)	MIXED
Variance acceptable (<15%)	<15%	5.38%	PASS
No correctness issues	None	None	PASS

Overall: FAIL - Optimization does not achieve expected improvement

Comparison: JSON Workload (Invalid Baseline)

Note: Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.

Results from JSON workload (for reference only):

All configs showed ~1,070,000 ops/s (nearly identical)
No improvement because 64KB allocations use L2.5 pool, not Shared Pool
This confirms the optimization is specific to tiny allocations (<2KB)

Recommendations

Immediate Actions

DO NOT PROCEED to Phase A-3 (Shared Pool Stage Optimization)
- Current optimization shows regression, not improvement
- Need to understand root cause before adding more complexity
INVESTIGATE overhead sources:
- Profile the modulo operation cost
- Check TLS access patterns
- Measure branch misprediction rate
- Analyze cache line behavior
CONSIDER alternative approaches:
- Use power-of-2 batch sizes for cheaper modulo (bit masking)
- Precompute batch size at compile time (remove getenv overhead)
- Try smaller batch sizes (B=16, B=32) for better locality
- Use per-thread batch counter instead of per-class counter

Future Experiments

If investigating further:

Test different batch sizes: B=16, B=32, B=128
Optimize modulo operation: Use (count & (batch-1)) for power-of-2
Reduce TLS footprint: Single global counter instead of per-class
Profile-guided optimization: Use perf to identify hotspots
Test with different workloads:
- Pure tiny allocations (16B-2KB only)
- High cache miss rate workload
- Multi-threaded workload

Alternative Optimization Strategies

Since batch tier checks failed, consider:

Shared Pool Stage Optimization (Phase A-3): May still be viable independently
Superslab-level caching: Cache entire SuperSlab pointer instead of tier status
Lockless shared pool: Remove atomic operations entirely via per-thread pools
Lazy tier checking: Only check tier on actual allocation failure

Raw Data

Baseline (B=1) - 10 Runs

1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s

Optimized (B=64) - 10 Runs

1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s

Aggressive (B=256) - 10 Runs

1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s

Conclusion

The Batch Tier Checks optimization, while theoretically sound, fails to achieve the expected +5-10% throughput improvement in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.

Key Takeaway: Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).

Next Steps: Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.

Report Generated: 2025-12-04 Analysis Tool: Python 3 statistical analysis Benchmark Framework: bench_allocators_hakmem (hakmem custom benchmarking suite)

9.4 KiB Raw Permalink Blame History Unescape Escape