Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.4 KiB
2.4 KiB
Ring Size Analysis: Executive Summary
Problem
Ring=64 shows conflicting results between benchmarks:
- mid_large_mt: +3.3% (36.04M → 37.22M ops/s) ✅
- random_mixed: -5.4% (22.5M → 21.29M ops/s) ❌
Why does the SAME parameter help one benchmark but hurt another?
Root Cause
POOL_TLS_RING_CAP affects ONLY L2 Pool (8-32KB allocations):
| Benchmark | Size Range | Pool Used | Ring Impact |
|---|---|---|---|
| mid_large_mt | 8-32KB | L2 Pool | ✅ Direct benefit |
| random_mixed | 8-128B | Tiny Pool | ❌ Indirect penalty |
Mechanism:
- Ring=64 grows L2 Pool TLS from 980B → 3,668B (+275%)
- Tiny Pool has NO ring (uses freelist, ~640B)
- Larger L2 TLS evicts Tiny Pool data from L1 cache
- random_mixed suffers 3× slower access (L1→L2 cache)
Solution
Use separate ring sizes per pool:
// L2 Pool (mid-size 2-32KB)
#define POOL_L2_RING_CAP 48 // Balanced performance + cache fit
// L2.5 Pool (large 64KB-1MB)
#define POOL_L25_RING_CAP 16 // Optimal for infrequent large allocs
// Tiny Pool (tiny ≤1KB)
// No ring - uses freelist (unchanged)
Expected Results
| Metric | Ring=16 | Ring=64 | L2=48, L25=16 | vs Ring=64 |
|---|---|---|---|---|
| mid_large_mt | 36.04M | 37.22M | 36.8M | -1.1% |
| random_mixed | 22.5M | 21.29M | 22.5M | +5.7% ✅ |
| Average | 29.27M | 29.26M | 29.65M | +1.3% ✅ |
| TLS/thread | 2.36 KB | 5.05 KB | 3.4 KB | -33% ✅ |
Win-Win: Improves BOTH benchmarks simultaneously.
Implementation
3 simple changes:
- hakmem_pool.c: Replace
POOL_TLS_RING_CAP→POOL_L2_RING_CAP(48) - hakmem_l25_pool.c: Replace
POOL_TLS_RING_CAP→POOL_L25_RING_CAP(16) - Makefile: Add
-DPOOL_L2_RING_CAP=48 -DPOOL_L25_RING_CAP=16
Time: ~30 minutes coding + 2 hours testing
Key Insights
- Pool isolation: Different benchmarks use completely different pools
- TLS pollution: Unused pool TLS evicts active pool data from cache
- Cache is king: L1 cache pressure explains >5% performance swings
- Separate tuning: Per-pool optimization is essential for mixed workloads
Files
- RING_SIZE_DEEP_ANALYSIS.md - Full technical analysis (10 sections)
- RING_SIZE_SOLUTION.md - Step-by-step implementation guide
- RING_SIZE_SUMMARY.md - This executive summary