Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
14 KiB
Warm Pool Architecture - Visual Summary & Decision Framework
2025-12-04
🎯 The Core Problem
Current Random Mixed Performance: 1.06M ops/s
What's happening on EVERY CACHE MISS (~5% of allocations):
malloc_tiny_fast(size)
↓
tiny_cold_refill_and_alloc() ← Called ~53,000 times per 1M allocs
↓
unified_cache_refill()
↓
Linear registry scan (O(N)) ← BOTTLENECK!
├─ Search per-class registry
├─ Check tier of each SuperSlab
├─ Find first HOT one
├─ Cost: 50-100 cycles per miss
└─ Impact: ~5% of ops doing expensive work
↓
Carve ~64 blocks (fast)
↓
Return first block
Total cache miss cost: ~500-1000 cycles per miss
Amortized: ~5-10 cycles per object
Multiplied over 5% misses: SIGNIFICANT OVERHEAD
💡 The Warm Pool Solution
BEFORE (Current):
Cache miss → Registry scan (O(N)) → Find HOT → Carve → Return
AFTER (Warm Pool):
Cache miss → Warm pool pop (O(1)) → Already HOT → Carve → Return
↑
Pre-allocated SuperSlabs
stored per-thread
(TLS)
The Warm Pool Concept
Per-thread data structure:
g_tiny_warm_pool[TINY_NUM_CLASSES]: // For each size class
.slabs[]: // Array of pre-allocated SuperSlabs
.count: // How many are in pool
.capacity: // Max capacity (typically 4)
For a 64-byte allocation (class 2):
If warm_pool[2].count > 0: ← FAST PATH
Pop ss = warm_pool[2].slabs[--count]
Carve blocks
Return
Cost: ~50 cycles per batch (5 per object)
Else: ← FALLBACK
Registry scan (old path)
Cost: ~500 cycles per batch
(But RARE because pool is usually full)
📊 Performance Model
Current (Registry Scan Every Miss)
Scenario: 1M allocations, 5% cache miss rate = 50,000 misses
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%): 50,000 batches × 1000 cycles = 50M cycles
Other overhead: ~15M cycles
─────────────────────────────────────────────────
Total: ~70.4M cycles
~1.06M ops/s
Proposed (Warm Pool, 90% Hit)
Scenario: 1M allocations, 5% cache miss rate
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):
├─ 90% warm pool hits: 45,000 batches × 100 cycles = 4.5M cycles
└─ 10% registry falls: 5,000 batches × 1000 cycles = 5M cycles
├─ Sub-total: 9.5M cycles (vs 50M before)
Other overhead: ~15M cycles
─────────────────────────────────────────────────
Total: ~48M cycles
~1.46M ops/s (+38%)
With Additional Optimizations (Lock-free, Batched Tier Checks)
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):
├─ 95% warm pool hits: 47,500 batches × 75 cycles = 3.56M cycles
└─ 5% registry falls: 2,500 batches × 800 cycles = 2M cycles
├─ Sub-total: 5.56M cycles
Other overhead: ~10M cycles
─────────────────────────────────────────────────
Total: ~39M cycles
~1.79M ops/s (+69%)
Further optimizations (per-thread pools, batch pre-alloc):
Potential ceiling: ~2.5-3.0M ops/s (+135-180%)
🔄 Warm Pool Data Flow
Thread Startup
Thread calls malloc() for first time:
↓
Check if warm_pool[class].capacity == 0:
├─ YES → Initialize warm pools
│ ├─ Set capacity = 4 per class
│ ├─ Allocate array space (TLS, ~128KB total)
│ ├─ Try to pre-populate from LRU cache
│ │ ├─ Success: Get 2-3 SuperSlabs per class from LRU
│ │ └─ Fail: Leave empty (will populate on cold alloc)
│ └─ Ready!
│
└─ NO → Already initialized, continue
First allocation:
├─ HOT: Unified cache hit → Return (99% of time)
│
└─ WARM (on cache miss):
├─ warm_pool_pop(class) returns SuperSlab
├─ If NULL (pool empty, rare):
│ └─ Fall back to registry scan
└─ Carve & return
Steady State Execution
For each allocation:
malloc(size)
├─ size → class_idx
│
├─ HOT: Unified cache hit (head != tail)?
│ └─ YES (95%): Return immediately
│
└─ WARM: Unified cache miss (head == tail)?
├─ Call unified_cache_refill(class_idx)
│ ├─ SuperSlab ss = tiny_warm_pool_pop(class_idx)
│ ├─ If ss != NULL (90% of misses):
│ │ ├─ Carve ~64 blocks from ss
│ │ ├─ Refill Unified Cache array
│ │ └─ Return first block
│ │
│ └─ Else (10% of misses):
│ ├─ Fall back to registry scan (COLD path)
│ ├─ Find HOT SuperSlab in per-class registry
│ ├─ Allocate new if not found (mmap)
│ ├─ Carve blocks + refill warm pool
│ └─ Return first block
│
└─ Return USER pointer
Free Path Integration
free(ptr)
├─ tiny_hot_free_fast()
│ ├─ Push to TLS SLL (99% of time)
│ └─ Return
│
└─ (On SLL full, triggered once per ~256 frees)
├─ Batch drain SLL to SuperSlab freelist
├─ When SuperSlab becomes empty:
│ ├─ Remove from refill registry
│ ├─ Push to LRU cache (NOT warm pool)
│ │ (LRU will eventually evict or reuse)
│ └─ When LRU reuses: add to warm pool
│
└─ Return
Warm Pool Replenishment (Background)
When warm_pool[class].count drops below threshold (1):
├─ Called from cold allocation path (rare)
│
├─ For next 2-3 SuperSlabs in registry:
│ ├─ Check if tier is still HOT
│ ├─ Add to warm pool (up to capacity)
│ └─ Continue registry scan
│
└─ Restore warm pool for next miss
No explicit background thread needed!
Warm pool is refilled as side effect of cold allocs.
⚡ Implementation Complexity vs Gain
Low Complexity (Recommended)
Effort: 200-300 lines of code
Time: 2-3 developer-days
Risk: Low
Changes:
1. Create tiny_warm_pool.h header (~50 lines)
2. Declare __thread warm pools (~10 lines)
3. Modify unified_cache_refill() (~100 lines)
- Try warm_pool_pop() first
- On success: carve & return
- On fail: registry scan (existing code path)
4. Add initialization in malloc (~20 lines)
5. Add cleanup on thread exit (~10 lines)
Expected gain: +40-50% (1.06M → 1.5M ops/s)
Risk: Very low (warm pool is additive, fallback to registry always works)
Medium Complexity (Phase 2)
Effort: 500-700 lines of code
Time: 5-7 developer-days
Risk: Medium
Changes:
1. Lock-free warm pool using CAS
2. Batched tier transition checks
3. Per-thread allocation pool
4. Background warm pool refill thread
Expected gain: +70-100% (1.06M → 1.8-2.1M ops/s)
Risk: Medium (requires careful synchronization)
High Complexity (Phase 3)
Effort: 1000+ lines
Time: 2-3 weeks
Risk: High
Changes:
1. Comprehensive redesign with three separate pools per thread
2. Lock-free fast path for entire allocation
3. Per-size-class threads for refill
4. Complex tier management
Expected gain: +150-200% (1.06M → 2.5-3.2M ops/s)
Risk: High (major architectural changes, potential correctness issues)
🎓 Why 10x is Hard (But 2x is Feasible)
The 80x Gap: Random Mixed vs Tiny Hot
Tiny Hot: 89M ops/s
├─ Single fixed size (16 bytes)
├─ L1 cache perfect hit
├─ No pool lookup
├─ No routing
├─ No page faults
└─ Ideal case
Random Mixed: 1.06M ops/s
├─ 256 different sizes
├─ L1 cache misses
├─ Pool routing needed
├─ Registry lookup on miss
├─ ~7,600 page faults
└─ Real-world case
Difference: 83x
Can we close this gap?
- Warm pool optimization: +40-50% (to 1.5-1.6M)
- Lock-free pools: +20-30% (to 1.8-2.0M)
- Per-thread pools: +10-15% (to 2.0-2.3M)
- Other tuning: +5-10% (to 2.1-2.5M)
──────────────────────────────────
Total realistic: 2.0-2.5x (still 35-40x below Tiny Hot)
Why not 10x?
1. Fundamental overhead: 256 size classes (not 1)
2. Working set: Pages faults (7,600) are unavoidable
3. Routing: Pool lookup adds cycles (can't eliminate)
4. Tier management: Utilization tracking costs (necessary for correctness)
5. Memory: 2MB SuperSlab fragmentation (not tunable)
The 10x gap is ARCHITECTURAL, not a bug.
📋 Implementation Phases
✅ Phase 1: Basic Warm Pool (THIS PROPOSAL)
- Goal: +40-50% improvement (1.06M → 1.5M ops/s)
- Scope: Warm pool data structure + unified_cache_refill() integration
- Risk: Low
- Timeline: 2-3 days
- Recommended: YES (high ROI)
⏳ Phase 2: Advanced Optimizations (Optional)
- Goal: +20-30% additional (1.5M → 1.8-2.0M ops/s)
- Scope: Lock-free pools, batched tier checks, per-thread refill
- Risk: Medium
- Timeline: 1-2 weeks
- Recommended: Maybe (depends on user requirements)
❌ Phase 3: Architectural Redesign (Not Recommended)
- Goal: +100%+ improvement (2.0M+ ops/s)
- Scope: Major rewrite of allocation path
- Risk: High
- Timeline: 3-4 weeks
- Recommended: No (diminishing returns, high risk)
🔐 Safety & Correctness
Thread Safety
Warm pool is thread-local (__thread):
✓ No locks needed
✓ No atomic operations
✓ No synchronization required
✓ Safe for all threads
Fallback path:
✓ Registry scan (existing code, proven)
✓ Always works if warm pool empty
✓ Correctness guaranteed
Memory Safety
SuperSlab ownership:
✓ Warm pool only holds SuperSlabs we own
✓ Tier/Guard checks catch invalid cases
✓ On tier change (HOT→DRAINING): removed from pool
✓ Validation on periodic tier checks (batched)
Object layout:
✓ No change to object headers
✓ No change to allocation metadata
✓ Warm pool is transparent to user code
Tier Transitions
If SuperSlab changes tier (HOT → DRAINING):
├─ Current: Caught on next registry scan
├─ Proposed: Caught on next batch tier check
├─ Rare case (only if working set shrinks)
└─ Fallback: Registry scan still works
Validation strategy:
├─ Periodic (batched) tier validation
├─ On cold path (always validates)
├─ Accept small window of stale data
└─ Correctness preserved
📊 Success Metrics
Warm Pool Metrics to Track
While running Random Mixed benchmark:
Per-thread warm pool statistics:
├─ Pool capacity: 4 per class (128 total for 32 classes)
├─ Pool hit rate: 85-95% (target: > 90%)
├─ Pool miss rate: 5-15% (fallback to registry)
└─ Pool push rate: On cold alloc (should be rare)
Cache refill metrics:
├─ Warm pool refills: ~50,000 (90% of misses)
├─ Registry fallbacks: ~5,000 (10% of misses)
└─ Cold allocations: 10-100 (very rare)
Performance metrics:
├─ Total ops/s: 1.5M+ (target: +40% from 1.06M)
├─ Ops per cycle: 0.05+ (from 0.015 baseline)
└─ Cache miss overhead: Reduced by 80%+
Regression Tests
Ensure no degradation:
✓ Tiny Hot: 89M ops/s (unchanged)
✓ Tiny Cold: No regression expected
✓ Tiny Middle: No regression expected
✓ Memory correctness: All tests pass
✓ Multithreaded: No race conditions
✓ Thread safety: Concurrent access safe
🚀 Recommended Next Steps
Step 1: Agree on Scope
- Accept Phase 1 (warm pool) proposal
- Defer Phase 2 (advanced optimizations) to later
- Do not attempt Phase 3 (architectural rewrite)
Step 2: Create Warm Pool Implementation
- Create
core/front/tiny_warm_pool.h - Implement data structures and operations
- Write inline functions for hot operations
Step 3: Integrate with Unified Cache
- Modify
unified_cache_refill()to use warm pool - Add initialization logic
- Test correctness
Step 4: Benchmark & Validate
- Run Random Mixed benchmark
- Measure ops/s improvement (target: 1.5M+)
- Profile warm pool hit rate (target: > 90%)
- Verify no regression in other workloads
Step 5: Iterate & Refine
- If hit rate < 80%: Increase warm pool size
- If hit rate > 95%: Reduce warm pool size (save memory)
- If performance < 1.4M ops/s: Review bottlenecks
🎯 Conclusion
Warm pool implementation offers:
- High ROI (40-50% improvement with 200-300 lines of code)
- Low risk (fallback to proven registry scan path)
- Incremental approach (doesn't require full redesign)
- Clear success criteria (ops/s improvement, hit rate tracking)
Expected outcome:
- Random Mixed: 1.06M → 1.5M+ ops/s (+40%)
- Tiny Hot: 89M ops/s (unchanged)
- Total system: Better performance for real-world workloads
Path to further improvements:
- Phase 2 (advanced): +20-30% more (1.8-2.0M ops/s)
- Phase 3 (redesign): Not recommended (high effort, limited gain)
Recommendation: Implement Phase 1 warm pool. Re-evaluate after measuring actual performance.
Document Status: Ready for implementation Review & Approval: Required before starting code changes