Files
hakmem/WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

492 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Warm Pool Architecture - Visual Summary & Decision Framework
## 2025-12-04
---
## 🎯 The Core Problem
```
Current Random Mixed Performance: 1.06M ops/s
What's happening on EVERY CACHE MISS (~5% of allocations):
malloc_tiny_fast(size)
tiny_cold_refill_and_alloc() ← Called ~53,000 times per 1M allocs
unified_cache_refill()
Linear registry scan (O(N)) ← BOTTLENECK!
├─ Search per-class registry
├─ Check tier of each SuperSlab
├─ Find first HOT one
├─ Cost: 50-100 cycles per miss
└─ Impact: ~5% of ops doing expensive work
Carve ~64 blocks (fast)
Return first block
Total cache miss cost: ~500-1000 cycles per miss
Amortized: ~5-10 cycles per object
Multiplied over 5% misses: SIGNIFICANT OVERHEAD
```
---
## 💡 The Warm Pool Solution
```
BEFORE (Current):
Cache miss → Registry scan (O(N)) → Find HOT → Carve → Return
AFTER (Warm Pool):
Cache miss → Warm pool pop (O(1)) → Already HOT → Carve → Return
Pre-allocated SuperSlabs
stored per-thread
(TLS)
```
### The Warm Pool Concept
```
Per-thread data structure:
g_tiny_warm_pool[TINY_NUM_CLASSES]: // For each size class
.slabs[]: // Array of pre-allocated SuperSlabs
.count: // How many are in pool
.capacity: // Max capacity (typically 4)
For a 64-byte allocation (class 2):
If warm_pool[2].count > 0: ← FAST PATH
Pop ss = warm_pool[2].slabs[--count]
Carve blocks
Return
Cost: ~50 cycles per batch (5 per object)
Else: ← FALLBACK
Registry scan (old path)
Cost: ~500 cycles per batch
(But RARE because pool is usually full)
```
---
## 📊 Performance Model
### Current (Registry Scan Every Miss)
```
Scenario: 1M allocations, 5% cache miss rate = 50,000 misses
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%): 50,000 batches × 1000 cycles = 50M cycles
Other overhead: ~15M cycles
─────────────────────────────────────────────────
Total: ~70.4M cycles
~1.06M ops/s
```
### Proposed (Warm Pool, 90% Hit)
```
Scenario: 1M allocations, 5% cache miss rate
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):
├─ 90% warm pool hits: 45,000 batches × 100 cycles = 4.5M cycles
└─ 10% registry falls: 5,000 batches × 1000 cycles = 5M cycles
├─ Sub-total: 9.5M cycles (vs 50M before)
Other overhead: ~15M cycles
─────────────────────────────────────────────────
Total: ~48M cycles
~1.46M ops/s (+38%)
```
### With Additional Optimizations (Lock-free, Batched Tier Checks)
```
Hot path (95%): 950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):
├─ 95% warm pool hits: 47,500 batches × 75 cycles = 3.56M cycles
└─ 5% registry falls: 2,500 batches × 800 cycles = 2M cycles
├─ Sub-total: 5.56M cycles
Other overhead: ~10M cycles
─────────────────────────────────────────────────
Total: ~39M cycles
~1.79M ops/s (+69%)
Further optimizations (per-thread pools, batch pre-alloc):
Potential ceiling: ~2.5-3.0M ops/s (+135-180%)
```
---
## 🔄 Warm Pool Data Flow
### Thread Startup
```
Thread calls malloc() for first time:
Check if warm_pool[class].capacity == 0:
├─ YES → Initialize warm pools
│ ├─ Set capacity = 4 per class
│ ├─ Allocate array space (TLS, ~128KB total)
│ ├─ Try to pre-populate from LRU cache
│ │ ├─ Success: Get 2-3 SuperSlabs per class from LRU
│ │ └─ Fail: Leave empty (will populate on cold alloc)
│ └─ Ready!
└─ NO → Already initialized, continue
First allocation:
├─ HOT: Unified cache hit → Return (99% of time)
└─ WARM (on cache miss):
├─ warm_pool_pop(class) returns SuperSlab
├─ If NULL (pool empty, rare):
│ └─ Fall back to registry scan
└─ Carve & return
```
### Steady State Execution
```
For each allocation:
malloc(size)
├─ size → class_idx
├─ HOT: Unified cache hit (head != tail)?
│ └─ YES (95%): Return immediately
└─ WARM: Unified cache miss (head == tail)?
├─ Call unified_cache_refill(class_idx)
│ ├─ SuperSlab ss = tiny_warm_pool_pop(class_idx)
│ ├─ If ss != NULL (90% of misses):
│ │ ├─ Carve ~64 blocks from ss
│ │ ├─ Refill Unified Cache array
│ │ └─ Return first block
│ │
│ └─ Else (10% of misses):
│ ├─ Fall back to registry scan (COLD path)
│ ├─ Find HOT SuperSlab in per-class registry
│ ├─ Allocate new if not found (mmap)
│ ├─ Carve blocks + refill warm pool
│ └─ Return first block
└─ Return USER pointer
```
### Free Path Integration
```
free(ptr)
├─ tiny_hot_free_fast()
│ ├─ Push to TLS SLL (99% of time)
│ └─ Return
└─ (On SLL full, triggered once per ~256 frees)
├─ Batch drain SLL to SuperSlab freelist
├─ When SuperSlab becomes empty:
│ ├─ Remove from refill registry
│ ├─ Push to LRU cache (NOT warm pool)
│ │ (LRU will eventually evict or reuse)
│ └─ When LRU reuses: add to warm pool
└─ Return
```
### Warm Pool Replenishment (Background)
```
When warm_pool[class].count drops below threshold (1):
├─ Called from cold allocation path (rare)
├─ For next 2-3 SuperSlabs in registry:
│ ├─ Check if tier is still HOT
│ ├─ Add to warm pool (up to capacity)
│ └─ Continue registry scan
└─ Restore warm pool for next miss
No explicit background thread needed!
Warm pool is refilled as side effect of cold allocs.
```
---
## ⚡ Implementation Complexity vs Gain
### Low Complexity (Recommended)
```
Effort: 200-300 lines of code
Time: 2-3 developer-days
Risk: Low
Changes:
1. Create tiny_warm_pool.h header (~50 lines)
2. Declare __thread warm pools (~10 lines)
3. Modify unified_cache_refill() (~100 lines)
- Try warm_pool_pop() first
- On success: carve & return
- On fail: registry scan (existing code path)
4. Add initialization in malloc (~20 lines)
5. Add cleanup on thread exit (~10 lines)
Expected gain: +40-50% (1.06M → 1.5M ops/s)
Risk: Very low (warm pool is additive, fallback to registry always works)
```
### Medium Complexity (Phase 2)
```
Effort: 500-700 lines of code
Time: 5-7 developer-days
Risk: Medium
Changes:
1. Lock-free warm pool using CAS
2. Batched tier transition checks
3. Per-thread allocation pool
4. Background warm pool refill thread
Expected gain: +70-100% (1.06M → 1.8-2.1M ops/s)
Risk: Medium (requires careful synchronization)
```
### High Complexity (Phase 3)
```
Effort: 1000+ lines
Time: 2-3 weeks
Risk: High
Changes:
1. Comprehensive redesign with three separate pools per thread
2. Lock-free fast path for entire allocation
3. Per-size-class threads for refill
4. Complex tier management
Expected gain: +150-200% (1.06M → 2.5-3.2M ops/s)
Risk: High (major architectural changes, potential correctness issues)
```
---
## 🎓 Why 10x is Hard (But 2x is Feasible)
### The 80x Gap: Random Mixed vs Tiny Hot
```
Tiny Hot: 89M ops/s
├─ Single fixed size (16 bytes)
├─ L1 cache perfect hit
├─ No pool lookup
├─ No routing
├─ No page faults
└─ Ideal case
Random Mixed: 1.06M ops/s
├─ 256 different sizes
├─ L1 cache misses
├─ Pool routing needed
├─ Registry lookup on miss
├─ ~7,600 page faults
└─ Real-world case
Difference: 83x
Can we close this gap?
- Warm pool optimization: +40-50% (to 1.5-1.6M)
- Lock-free pools: +20-30% (to 1.8-2.0M)
- Per-thread pools: +10-15% (to 2.0-2.3M)
- Other tuning: +5-10% (to 2.1-2.5M)
──────────────────────────────────
Total realistic: 2.0-2.5x (still 35-40x below Tiny Hot)
Why not 10x?
1. Fundamental overhead: 256 size classes (not 1)
2. Working set: Pages faults (7,600) are unavoidable
3. Routing: Pool lookup adds cycles (can't eliminate)
4. Tier management: Utilization tracking costs (necessary for correctness)
5. Memory: 2MB SuperSlab fragmentation (not tunable)
The 10x gap is ARCHITECTURAL, not a bug.
```
---
## 📋 Implementation Phases
### ✅ Phase 1: Basic Warm Pool (THIS PROPOSAL)
- **Goal:** +40-50% improvement (1.06M → 1.5M ops/s)
- **Scope:** Warm pool data structure + unified_cache_refill() integration
- **Risk:** Low
- **Timeline:** 2-3 days
- **Recommended:** YES (high ROI)
### ⏳ Phase 2: Advanced Optimizations (Optional)
- **Goal:** +20-30% additional (1.5M → 1.8-2.0M ops/s)
- **Scope:** Lock-free pools, batched tier checks, per-thread refill
- **Risk:** Medium
- **Timeline:** 1-2 weeks
- **Recommended:** Maybe (depends on user requirements)
### ❌ Phase 3: Architectural Redesign (Not Recommended)
- **Goal:** +100%+ improvement (2.0M+ ops/s)
- **Scope:** Major rewrite of allocation path
- **Risk:** High
- **Timeline:** 3-4 weeks
- **Recommended:** No (diminishing returns, high risk)
---
## 🔐 Safety & Correctness
### Thread Safety
```
Warm pool is thread-local (__thread):
✓ No locks needed
✓ No atomic operations
✓ No synchronization required
✓ Safe for all threads
Fallback path:
✓ Registry scan (existing code, proven)
✓ Always works if warm pool empty
✓ Correctness guaranteed
```
### Memory Safety
```
SuperSlab ownership:
✓ Warm pool only holds SuperSlabs we own
✓ Tier/Guard checks catch invalid cases
✓ On tier change (HOT→DRAINING): removed from pool
✓ Validation on periodic tier checks (batched)
Object layout:
✓ No change to object headers
✓ No change to allocation metadata
✓ Warm pool is transparent to user code
```
### Tier Transitions
```
If SuperSlab changes tier (HOT → DRAINING):
├─ Current: Caught on next registry scan
├─ Proposed: Caught on next batch tier check
├─ Rare case (only if working set shrinks)
└─ Fallback: Registry scan still works
Validation strategy:
├─ Periodic (batched) tier validation
├─ On cold path (always validates)
├─ Accept small window of stale data
└─ Correctness preserved
```
---
## 📊 Success Metrics
### Warm Pool Metrics to Track
```
While running Random Mixed benchmark:
Per-thread warm pool statistics:
├─ Pool capacity: 4 per class (128 total for 32 classes)
├─ Pool hit rate: 85-95% (target: > 90%)
├─ Pool miss rate: 5-15% (fallback to registry)
└─ Pool push rate: On cold alloc (should be rare)
Cache refill metrics:
├─ Warm pool refills: ~50,000 (90% of misses)
├─ Registry fallbacks: ~5,000 (10% of misses)
└─ Cold allocations: 10-100 (very rare)
Performance metrics:
├─ Total ops/s: 1.5M+ (target: +40% from 1.06M)
├─ Ops per cycle: 0.05+ (from 0.015 baseline)
└─ Cache miss overhead: Reduced by 80%+
```
### Regression Tests
```
Ensure no degradation:
✓ Tiny Hot: 89M ops/s (unchanged)
✓ Tiny Cold: No regression expected
✓ Tiny Middle: No regression expected
✓ Memory correctness: All tests pass
✓ Multithreaded: No race conditions
✓ Thread safety: Concurrent access safe
```
---
## 🚀 Recommended Next Steps
### Step 1: Agree on Scope
- [ ] Accept Phase 1 (warm pool) proposal
- [ ] Defer Phase 2 (advanced optimizations) to later
- [ ] Do not attempt Phase 3 (architectural rewrite)
### Step 2: Create Warm Pool Implementation
- [ ] Create `core/front/tiny_warm_pool.h`
- [ ] Implement data structures and operations
- [ ] Write inline functions for hot operations
### Step 3: Integrate with Unified Cache
- [ ] Modify `unified_cache_refill()` to use warm pool
- [ ] Add initialization logic
- [ ] Test correctness
### Step 4: Benchmark & Validate
- [ ] Run Random Mixed benchmark
- [ ] Measure ops/s improvement (target: 1.5M+)
- [ ] Profile warm pool hit rate (target: > 90%)
- [ ] Verify no regression in other workloads
### Step 5: Iterate & Refine
- [ ] If hit rate < 80%: Increase warm pool size
- [ ] If hit rate > 95%: Reduce warm pool size (save memory)
- [ ] If performance < 1.4M ops/s: Review bottlenecks
---
## 🎯 Conclusion
**Warm pool implementation offers:**
- High ROI (40-50% improvement with 200-300 lines of code)
- Low risk (fallback to proven registry scan path)
- Incremental approach (doesn't require full redesign)
- Clear success criteria (ops/s improvement, hit rate tracking)
**Expected outcome:**
- Random Mixed: 1.06M 1.5M+ ops/s (+40%)
- Tiny Hot: 89M ops/s (unchanged)
- Total system: Better performance for real-world workloads
**Path to further improvements:**
- Phase 2 (advanced): +20-30% more (1.8-2.0M ops/s)
- Phase 3 (redesign): Not recommended (high effort, limited gain)
**Recommendation:** Implement Phase 1 warm pool. Re-evaluate after measuring actual performance.
---
**Document Status:** Ready for implementation
**Review & Approval:** Required before starting code changes