Files
hakmem/BATCH_TIER_CHECKS_PERF_RESULTS_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

9.4 KiB
Raw Permalink Blame History

Batch Tier Checks Performance Measurement Results

Date: 2025-12-04 Optimization: Phase A-2 - Batch Tier Checks (Reduce Atomic Operations) Benchmark: bench_allocators_hakmem --scenario mixed --iterations 100


Executive Summary

RESULT: REGRESSION DETECTED - Optimization does NOT achieve +5-10% improvement

The Batch Tier Checks optimization, designed to reduce atomic operations in the tiny allocation hot path by batching tier checks, shows a -0.87% performance regression with the default batch size (B=64) and -2.30% regression with aggressive batching (B=256).

Key Findings:

  • Throughput: Baseline (B=1) outperforms both B=64 (-0.87%) and B=256 (-2.30%)
  • Cache Performance: B=64 shows -11% cache misses (good), but +0.85% CPU cycles (bad)
  • Consistency: B=256 has best consistency (CV=3.58%), but worst throughput
  • Verdict: The optimization introduces overhead that exceeds the atomic operation savings

Recommendation: DO NOT PROCEED to Phase A-3. Investigate root cause and consider alternative approaches.


Test Configuration

Test Parameters

Benchmark: bench_allocators_hakmem
Workload:  mixed (16B, 512B, 8KB, 128KB, 1KB allocations)
Iterations: 100 per run
Runs per config: 10
Platform: Linux 6.8.0-87-generic, x86-64
Compiler: gcc with -O3 -flto -march=native

Configurations Tested

Config Batch Size Description Atomic Op Reduction
Test A B=1 Baseline (no batching) 0% (every check)
Test B B=64 Optimized (conservative) 98.4% (1 per 64 checks)
Test C B=256 Aggressive (max batching) 99.6% (1 per 256 checks)

Performance Results

Throughput Comparison

Metric Baseline (B=1) Optimized (B=64) Aggressive (B=256)
Average ops/s 1,482,889.9 1,469,952.3 1,448,726.5
Std Dev ops/s 76,386.4 79,114.8 51,886.6
Min ops/s 1,343,540.7 1,359,677.3 1,365,118.3
Max ops/s 1,615,938.8 1,589,416.6 1,543,813.0
CV (%) 5.15% 5.38% 3.58%

Improvement Analysis:

  • B=64 vs B=1: -0.87% (-12,938 ops/s) [REGRESSION]
  • B=256 vs B=1: -2.30% (-34,163 ops/s) [REGRESSION]
  • B=256 vs B=64: -1.44% (-21,226 ops/s)

CPU Cycles & Cache Performance

Metric Baseline (B=1) Optimized (B=64) Aggressive (B=256) B=64 vs B=1 B=256 vs B=1
CPU Cycles 2,349,670,806 2,369,727,585 2,703,167,708 +0.85% +15.04%
Cache Misses 9,672,579 8,605,566 10,100,798 -11.03% +4.43%
L1 Cache Misses 26,465,121 26,297,329 28,928,265 -0.63% +9.31%

Analysis:

  • B=64 reduces cache misses by 11% (expected from fewer atomic ops)
  • However, CPU cycles increase by 0.85% (unexpected - should decrease)
  • B=256 shows severe regression: +15% cycles, +4.4% cache misses
  • L1 cache behavior is mostly neutral for B=64, worse for B=256

Variance & Consistency

Config CV (%) Interpretation
Baseline (B=1) 5.15% Good consistency
Optimized (B=64) 5.38% Slightly worse
Aggressive (B=256) 3.58% Best consistency

Detailed Analysis

1. Why Did the Optimization Fail?

Expected Behavior:

  • Reduce atomic operations from 5% of allocations to 0.08% (64x reduction)
  • Save ~0.2-0.4 cycles per allocation
  • Achieve +5-10% throughput improvement

Actual Behavior:

  • Cache misses decreased by 11% (confirms atomic op reduction)
  • CPU cycles increased by 0.85% (unexpected overhead)
  • Net throughput decreased by 0.87%

Root Cause Hypothesis:

  1. Thread-local state overhead: The batch counter and cached tier result add TLS storage and access overhead

    • g_tier_batch_state[TINY_NUM_CLASSES] is accessed on every cache miss
    • Modulo operation (state->refill_count % batch) may be expensive
    • Branch misprediction on if ((state->refill_count % batch) == 0)
  2. Cache pressure: The batch state array may evict more useful data from cache

    • 8 bytes × 32 classes = 256 bytes of TLS state
    • This competes with actual allocation metadata in L1 cache
  3. False sharing: Multiple threads may access different elements of the same cache line

    • Though TLS mitigates this, the benchmark may have threading effects
  4. Batch size mismatch: B=64 may not align with actual cache miss patterns

    • If cache misses are clustered, batching provides no benefit
    • If cache hits dominate, the batch check is rarely needed

2. Why Is B=256 Even Worse?

The aggressive batching (B=256) shows severe regression (+15% cycles):

  • Longer staleness period: Tier status can be stale for up to 256 operations
  • More allocations from DRAINING SuperSlabs: This causes additional work
  • Increased memory pressure: More operations before discovering SuperSlab is DRAINING

3. Positive Observations

Despite the regression, some aspects worked:

  1. Cache miss reduction: B=64 achieved -11% cache misses (atomic ops were reduced)
  2. Consistency improvement: B=256 has lowest variance (CV=3.58%)
  3. Code correctness: No crashes or correctness issues observed

Success Criteria Checklist

Criterion Expected Actual Status
B=64 shows +5-10% improvement +5-10% -0.87% FAIL
Cycles reduced as expected -5% +0.85% FAIL
Cache behavior improves or neutral Neutral -11% misses (good), but +0.85% cycles (bad) MIXED
Variance acceptable (<15%) <15% 5.38% PASS
No correctness issues None None PASS

Overall: FAIL - Optimization does not achieve expected improvement


Comparison: JSON Workload (Invalid Baseline)

Note: Initial measurements used the wrong workload (JSON = 64KB allocations), which does NOT exercise the tiny allocation path where batch tier checks apply.

Results from JSON workload (for reference only):

  • All configs showed ~1,070,000 ops/s (nearly identical)
  • No improvement because 64KB allocations use L2.5 pool, not Shared Pool
  • This confirms the optimization is specific to tiny allocations (<2KB)

Recommendations

Immediate Actions

  1. DO NOT PROCEED to Phase A-3 (Shared Pool Stage Optimization)

    • Current optimization shows regression, not improvement
    • Need to understand root cause before adding more complexity
  2. INVESTIGATE overhead sources:

    • Profile the modulo operation cost
    • Check TLS access patterns
    • Measure branch misprediction rate
    • Analyze cache line behavior
  3. CONSIDER alternative approaches:

    • Use power-of-2 batch sizes for cheaper modulo (bit masking)
    • Precompute batch size at compile time (remove getenv overhead)
    • Try smaller batch sizes (B=16, B=32) for better locality
    • Use per-thread batch counter instead of per-class counter

Future Experiments

If investigating further:

  1. Test different batch sizes: B=16, B=32, B=128
  2. Optimize modulo operation: Use (count & (batch-1)) for power-of-2
  3. Reduce TLS footprint: Single global counter instead of per-class
  4. Profile-guided optimization: Use perf to identify hotspots
  5. Test with different workloads:
    • Pure tiny allocations (16B-2KB only)
    • High cache miss rate workload
    • Multi-threaded workload

Alternative Optimization Strategies

Since batch tier checks failed, consider:

  1. Shared Pool Stage Optimization (Phase A-3): May still be viable independently
  2. Superslab-level caching: Cache entire SuperSlab pointer instead of tier status
  3. Lockless shared pool: Remove atomic operations entirely via per-thread pools
  4. Lazy tier checking: Only check tier on actual allocation failure

Raw Data

Baseline (B=1) - 10 Runs

1,615,938.8 ops/s
1,424,832.0 ops/s
1,415,710.5 ops/s
1,531,173.0 ops/s
1,524,721.8 ops/s
1,343,540.7 ops/s
1,520,723.1 ops/s
1,520,476.5 ops/s
1,464,046.2 ops/s
1,467,736.3 ops/s

Optimized (B=64) - 10 Runs

1,394,566.7 ops/s
1,422,447.5 ops/s
1,556,167.0 ops/s
1,447,934.5 ops/s
1,359,677.3 ops/s
1,436,005.2 ops/s
1,568,456.7 ops/s
1,423,222.2 ops/s
1,589,416.6 ops/s
1,501,629.6 ops/s

Aggressive (B=256) - 10 Runs

1,543,813.0 ops/s
1,436,644.9 ops/s
1,479,174.7 ops/s
1,428,092.3 ops/s
1,419,232.7 ops/s
1,422,254.4 ops/s
1,510,832.1 ops/s
1,417,032.7 ops/s
1,465,069.6 ops/s
1,365,118.3 ops/s

Conclusion

The Batch Tier Checks optimization, while theoretically sound, fails to achieve the expected +5-10% throughput improvement in practice. The -0.87% regression suggests that the overhead of maintaining batch state and performing modulo operations exceeds the savings from reduced atomic operations.

Key Takeaway: Not all theoretically beneficial optimizations translate to real-world performance gains. The overhead of bookkeeping (TLS state, modulo, branches) can exceed the savings from reduced atomic operations, especially when those operations are already infrequent (5% of allocations).

Next Steps: Investigate root cause, optimize the implementation, or abandon this approach in favor of alternative optimization strategies.


Report Generated: 2025-12-04 Analysis Tool: Python 3 statistical analysis Benchmark Framework: bench_allocators_hakmem (hakmem custom benchmarking suite)