hakmem/WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md

# Warm Pool Architecture - Visual Summary & Decision Framework
## 2025-12-04

---

## 🎯 The Core Problem

```
Current Random Mixed Performance: 1.06M ops/s

What's happening on EVERY CACHE MISS (~5% of allocations):

  malloc_tiny_fast(size)
    ↓
  tiny_cold_refill_and_alloc()  ← Called ~53,000 times per 1M allocs
    ↓
  unified_cache_refill()
    ↓
  Linear registry scan (O(N))     ← BOTTLENECK!
    ├─ Search per-class registry
    ├─ Check tier of each SuperSlab
    ├─ Find first HOT one
    ├─ Cost: 50-100 cycles per miss
    └─ Impact: ~5% of ops doing expensive work
    ↓
  Carve ~64 blocks (fast)
    ↓
  Return first block

Total cache miss cost: ~500-1000 cycles per miss
Amortized: ~5-10 cycles per object
Multiplied over 5% misses: SIGNIFICANT OVERHEAD
```

---

## 💡 The Warm Pool Solution

```
BEFORE (Current):
  Cache miss → Registry scan (O(N)) → Find HOT → Carve → Return

AFTER (Warm Pool):
  Cache miss → Warm pool pop (O(1)) → Already HOT → Carve → Return
                         ↑
              Pre-allocated SuperSlabs
              stored per-thread
              (TLS)
```

### The Warm Pool Concept

```
Per-thread data structure:

  g_tiny_warm_pool[TINY_NUM_CLASSES]:  // For each size class
    .slabs[]:                           // Array of pre-allocated SuperSlabs
    .count:                             // How many are in pool
    .capacity:                          // Max capacity (typically 4)

For a 64-byte allocation (class 2):

  If warm_pool[2].count > 0:            ← FAST PATH
    Pop ss = warm_pool[2].slabs[--count]
    Carve blocks
    Return
    Cost: ~50 cycles per batch (5 per object)

  Else:                                  ← FALLBACK
    Registry scan (old path)
    Cost: ~500 cycles per batch
    (But RARE because pool is usually full)
```

---

## 📊 Performance Model

### Current (Registry Scan Every Miss)

```
Scenario: 1M allocations, 5% cache miss rate = 50,000 misses

Hot path (95%):    950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):     50,000 batches × 1000 cycles = 50M cycles
Other overhead:                                    ~15M cycles
─────────────────────────────────────────────────
Total:                                            ~70.4M cycles
                                                   ~1.06M ops/s
```

### Proposed (Warm Pool, 90% Hit)

```
Scenario: 1M allocations, 5% cache miss rate

Hot path (95%):           950,000 allocs × 25 cycles = 23.75M cycles

Warm path (5%):
  ├─ 90% warm pool hits:   45,000 batches × 100 cycles = 4.5M cycles
  └─ 10% registry falls:    5,000 batches × 1000 cycles = 5M cycles
  ├─ Sub-total: 9.5M cycles (vs 50M before)

Other overhead:                                    ~15M cycles
─────────────────────────────────────────────────
Total:                                            ~48M cycles
                                                   ~1.46M ops/s (+38%)
```

### With Additional Optimizations (Lock-free, Batched Tier Checks)

```
Hot path (95%):           950,000 allocs × 25 cycles = 23.75M cycles
Warm path (5%):
  ├─ 95% warm pool hits:   47,500 batches × 75 cycles = 3.56M cycles
  └─ 5% registry falls:     2,500 batches × 800 cycles = 2M cycles
  ├─ Sub-total: 5.56M cycles
Other overhead:                                    ~10M cycles
─────────────────────────────────────────────────
Total:                                            ~39M cycles
                                                   ~1.79M ops/s (+69%)

Further optimizations (per-thread pools, batch pre-alloc):
Potential ceiling: ~2.5-3.0M ops/s (+135-180%)
```

---

## 🔄 Warm Pool Data Flow

### Thread Startup

```
Thread calls malloc() for first time:
  ↓
Check if warm_pool[class].capacity == 0:
  ├─ YES → Initialize warm pools
  │   ├─ Set capacity = 4 per class
  │   ├─ Allocate array space (TLS, ~128KB total)
  │   ├─ Try to pre-populate from LRU cache
  │   │   ├─ Success: Get 2-3 SuperSlabs per class from LRU
  │   │   └─ Fail: Leave empty (will populate on cold alloc)
  │   └─ Ready!
  │
  └─ NO → Already initialized, continue

First allocation:
  ├─ HOT: Unified cache hit → Return (99% of time)
  │
  └─ WARM (on cache miss):
      ├─ warm_pool_pop(class) returns SuperSlab
      ├─ If NULL (pool empty, rare):
      │   └─ Fall back to registry scan
      └─ Carve & return
```

### Steady State Execution

```
For each allocation:

malloc(size)
  ├─ size → class_idx
  │
  ├─ HOT: Unified cache hit (head != tail)?
  │   └─ YES (95%): Return immediately
  │
  └─ WARM: Unified cache miss (head == tail)?
      ├─ Call unified_cache_refill(class_idx)
      │   ├─ SuperSlab ss = tiny_warm_pool_pop(class_idx)
      │   ├─ If ss != NULL (90% of misses):
      │   │   ├─ Carve ~64 blocks from ss
      │   │   ├─ Refill Unified Cache array
      │   │   └─ Return first block
      │   │
      │   └─ Else (10% of misses):
      │       ├─ Fall back to registry scan (COLD path)
      │       ├─ Find HOT SuperSlab in per-class registry
      │       ├─ Allocate new if not found (mmap)
      │       ├─ Carve blocks + refill warm pool
      │       └─ Return first block
      │
      └─ Return USER pointer
```

### Free Path Integration

```
free(ptr)
  ├─ tiny_hot_free_fast()
  │   ├─ Push to TLS SLL (99% of time)
  │   └─ Return
  │
  └─ (On SLL full, triggered once per ~256 frees)
      ├─ Batch drain SLL to SuperSlab freelist
      ├─ When SuperSlab becomes empty:
      │   ├─ Remove from refill registry
      │   ├─ Push to LRU cache (NOT warm pool)
      │   │   (LRU will eventually evict or reuse)
      │   └─ When LRU reuses: add to warm pool
      │
      └─ Return
```

### Warm Pool Replenishment (Background)

```
When warm_pool[class].count drops below threshold (1):
  ├─ Called from cold allocation path (rare)
  │
  ├─ For next 2-3 SuperSlabs in registry:
  │   ├─ Check if tier is still HOT
  │   ├─ Add to warm pool (up to capacity)
  │   └─ Continue registry scan
  │
  └─ Restore warm pool for next miss

No explicit background thread needed!
Warm pool is refilled as side effect of cold allocs.
```

---

## ⚡ Implementation Complexity vs Gain

### Low Complexity (Recommended)

```
Effort: 200-300 lines of code
Time: 2-3 developer-days
Risk: Low

Changes:
  1. Create tiny_warm_pool.h header (~50 lines)
  2. Declare __thread warm pools (~10 lines)
  3. Modify unified_cache_refill() (~100 lines)
     - Try warm_pool_pop() first
     - On success: carve & return
     - On fail: registry scan (existing code path)
  4. Add initialization in malloc (~20 lines)
  5. Add cleanup on thread exit (~10 lines)

Expected gain: +40-50% (1.06M → 1.5M ops/s)
Risk: Very low (warm pool is additive, fallback to registry always works)
```

### Medium Complexity (Phase 2)

```
Effort: 500-700 lines of code
Time: 5-7 developer-days
Risk: Medium

Changes:
  1. Lock-free warm pool using CAS
  2. Batched tier transition checks
  3. Per-thread allocation pool
  4. Background warm pool refill thread

Expected gain: +70-100% (1.06M → 1.8-2.1M ops/s)
Risk: Medium (requires careful synchronization)
```

### High Complexity (Phase 3)

```
Effort: 1000+ lines
Time: 2-3 weeks
Risk: High

Changes:
  1. Comprehensive redesign with three separate pools per thread
  2. Lock-free fast path for entire allocation
  3. Per-size-class threads for refill
  4. Complex tier management

Expected gain: +150-200% (1.06M → 2.5-3.2M ops/s)
Risk: High (major architectural changes, potential correctness issues)
```

---

## 🎓 Why 10x is Hard (But 2x is Feasible)

### The 80x Gap: Random Mixed vs Tiny Hot

```
Tiny Hot:      89M ops/s
  ├─ Single fixed size (16 bytes)
  ├─ L1 cache perfect hit
  ├─ No pool lookup
  ├─ No routing
  ├─ No page faults
  └─ Ideal case

Random Mixed:   1.06M ops/s
  ├─ 256 different sizes
  ├─ L1 cache misses
  ├─ Pool routing needed
  ├─ Registry lookup on miss
  ├─ ~7,600 page faults
  └─ Real-world case

Difference: 83x

Can we close this gap?
  - Warm pool optimization: +40-50% (to 1.5-1.6M)
  - Lock-free pools: +20-30% (to 1.8-2.0M)
  - Per-thread pools: +10-15% (to 2.0-2.3M)
  - Other tuning: +5-10% (to 2.1-2.5M)
  ──────────────────────────────────
  Total realistic: 2.0-2.5x (still 35-40x below Tiny Hot)

Why not 10x?
  1. Fundamental overhead: 256 size classes (not 1)
  2. Working set: Pages faults (7,600) are unavoidable
  3. Routing: Pool lookup adds cycles (can't eliminate)
  4. Tier management: Utilization tracking costs (necessary for correctness)
  5. Memory: 2MB SuperSlab fragmentation (not tunable)

The 10x gap is ARCHITECTURAL, not a bug.
```

---

## 📋 Implementation Phases

### ✅ Phase 1: Basic Warm Pool (THIS PROPOSAL)
- **Goal:** +40-50% improvement (1.06M → 1.5M ops/s)
- **Scope:** Warm pool data structure + unified_cache_refill() integration
- **Risk:** Low
- **Timeline:** 2-3 days
- **Recommended:** YES (high ROI)

### ⏳ Phase 2: Advanced Optimizations (Optional)
- **Goal:** +20-30% additional (1.5M → 1.8-2.0M ops/s)
- **Scope:** Lock-free pools, batched tier checks, per-thread refill
- **Risk:** Medium
- **Timeline:** 1-2 weeks
- **Recommended:** Maybe (depends on user requirements)

### ❌ Phase 3: Architectural Redesign (Not Recommended)
- **Goal:** +100%+ improvement (2.0M+ ops/s)
- **Scope:** Major rewrite of allocation path
- **Risk:** High
- **Timeline:** 3-4 weeks
- **Recommended:** No (diminishing returns, high risk)

---

## 🔐 Safety & Correctness

### Thread Safety

```
Warm pool is thread-local (__thread):
  ✓ No locks needed
  ✓ No atomic operations
  ✓ No synchronization required
  ✓ Safe for all threads

Fallback path:
  ✓ Registry scan (existing code, proven)
  ✓ Always works if warm pool empty
  ✓ Correctness guaranteed
```

### Memory Safety

```
SuperSlab ownership:
  ✓ Warm pool only holds SuperSlabs we own
  ✓ Tier/Guard checks catch invalid cases
  ✓ On tier change (HOT→DRAINING): removed from pool
  ✓ Validation on periodic tier checks (batched)

Object layout:
  ✓ No change to object headers
  ✓ No change to allocation metadata
  ✓ Warm pool is transparent to user code
```

### Tier Transitions

```
If SuperSlab changes tier (HOT → DRAINING):
  ├─ Current: Caught on next registry scan
  ├─ Proposed: Caught on next batch tier check
  ├─ Rare case (only if working set shrinks)
  └─ Fallback: Registry scan still works

Validation strategy:
  ├─ Periodic (batched) tier validation
  ├─ On cold path (always validates)
  ├─ Accept small window of stale data
  └─ Correctness preserved
```

---

## 📊 Success Metrics

### Warm Pool Metrics to Track

```
While running Random Mixed benchmark:

Per-thread warm pool statistics:
  ├─ Pool capacity: 4 per class (128 total for 32 classes)
  ├─ Pool hit rate: 85-95% (target: > 90%)
  ├─ Pool miss rate: 5-15% (fallback to registry)
  └─ Pool push rate: On cold alloc (should be rare)

Cache refill metrics:
  ├─ Warm pool refills: ~50,000 (90% of misses)
  ├─ Registry fallbacks: ~5,000 (10% of misses)
  └─ Cold allocations: 10-100 (very rare)

Performance metrics:
  ├─ Total ops/s: 1.5M+ (target: +40% from 1.06M)
  ├─ Ops per cycle: 0.05+ (from 0.015 baseline)
  └─ Cache miss overhead: Reduced by 80%+
```

### Regression Tests

```
Ensure no degradation:
  ✓ Tiny Hot: 89M ops/s (unchanged)
  ✓ Tiny Cold: No regression expected
  ✓ Tiny Middle: No regression expected
  ✓ Memory correctness: All tests pass
  ✓ Multithreaded: No race conditions
  ✓ Thread safety: Concurrent access safe
```

---

## 🚀 Recommended Next Steps

### Step 1: Agree on Scope
- [ ] Accept Phase 1 (warm pool) proposal
- [ ] Defer Phase 2 (advanced optimizations) to later
- [ ] Do not attempt Phase 3 (architectural rewrite)

### Step 2: Create Warm Pool Implementation
- [ ] Create `core/front/tiny_warm_pool.h`
- [ ] Implement data structures and operations
- [ ] Write inline functions for hot operations

### Step 3: Integrate with Unified Cache
- [ ] Modify `unified_cache_refill()` to use warm pool
- [ ] Add initialization logic
- [ ] Test correctness

### Step 4: Benchmark & Validate
- [ ] Run Random Mixed benchmark
- [ ] Measure ops/s improvement (target: 1.5M+)
- [ ] Profile warm pool hit rate (target: > 90%)
- [ ] Verify no regression in other workloads

### Step 5: Iterate & Refine
- [ ] If hit rate < 80%: Increase warm pool size
- [ ] If hit rate > 95%: Reduce warm pool size (save memory)
- [ ] If performance < 1.4M ops/s: Review bottlenecks

---

## 🎯 Conclusion

**Warm pool implementation offers:**
- High ROI (40-50% improvement with 200-300 lines of code)
- Low risk (fallback to proven registry scan path)
- Incremental approach (doesn't require full redesign)
- Clear success criteria (ops/s improvement, hit rate tracking)

**Expected outcome:**
- Random Mixed: 1.06M → 1.5M+ ops/s (+40%)
- Tiny Hot: 89M ops/s (unchanged)
- Total system: Better performance for real-world workloads

**Path to further improvements:**
- Phase 2 (advanced): +20-30% more (1.8-2.0M ops/s)
- Phase 3 (redesign): Not recommended (high effort, limited gain)

**Recommendation:** Implement Phase 1 warm pool. Re-evaluate after measuring actual performance.

---

**Document Status:** Ready for implementation
**Review & Approval:** Required before starting code changes
-												Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 23:31:54 +09:00
+								# Warm Pool Architecture - Visual Summary & Decision Framework
 								## 2025-12-04
 								---
 								## 🎯 The Core Problem
 								```
 								Current Random Mixed Performance: 1.06M ops/s
 								What's happening on EVERY CACHE MISS (~5% of allocations):
 								  malloc_tiny_fast(size)
 								    ↓
 								  tiny_cold_refill_and_alloc()  ← Called ~53,000 times per 1M allocs
 								    ↓
 								  unified_cache_refill()
 								    ↓
 								  Linear registry scan (O(N))     ← BOTTLENECK!
 								    ├─ Search per-class registry
 								    ├─ Check tier of each SuperSlab
 								    ├─ Find first HOT one
 								    ├─ Cost: 50-100 cycles per miss
 								    └─ Impact: ~5% of ops doing expensive work
 								    ↓
 								  Carve ~64 blocks (fast)
 								    ↓
 								  Return first block
 								Total cache miss cost: ~500-1000 cycles per miss
 								Amortized: ~5-10 cycles per object
 								Multiplied over 5% misses: SIGNIFICANT OVERHEAD
 								```
 								---
 								## 💡 The Warm Pool Solution
 								```
 								BEFORE (Current):
 								  Cache miss → Registry scan (O(N)) → Find HOT → Carve → Return
 								AFTER (Warm Pool):
 								  Cache miss → Warm pool pop (O(1)) → Already HOT → Carve → Return
 								                         ↑
 								              Pre-allocated SuperSlabs
 								              stored per-thread
 								              (TLS)
 								```
 								### The Warm Pool Concept
 								```
 								Per-thread data structure:
 								  g_tiny_warm_pool[TINY_NUM_CLASSES]:  // For each size class
 								    .slabs[]:                           // Array of pre-allocated SuperSlabs
 								    .count:                             // How many are in pool
 								    .capacity:                          // Max capacity (typically 4)
 								For a 64-byte allocation (class 2):
 								  If warm_pool[2].count > 0:            ← FAST PATH
 								    Pop ss = warm_pool[2].slabs[--count]
 								    Carve blocks
 								    Return
 								    Cost: ~50 cycles per batch (5 per object)
 								  Else:                                  ← FALLBACK
 								    Registry scan (old path)
 								    Cost: ~500 cycles per batch
 								    (But RARE because pool is usually full)
 								```
 								---
 								## 📊 Performance Model
 								### Current (Registry Scan Every Miss)
 								```
 								Scenario: 1M allocations, 5% cache miss rate = 50,000 misses
 								Hot path (95%):    950,000 allocs × 25 cycles = 23.75M cycles
 								Warm path (5%):     50,000 batches × 1000 cycles = 50M cycles
 								Other overhead:                                    ~15M cycles
 								─────────────────────────────────────────────────
 								Total:                                            ~70.4M cycles
 								                                                   ~1.06M ops/s
 								```
 								### Proposed (Warm Pool, 90% Hit)
 								```
 								Scenario: 1M allocations, 5% cache miss rate
 								Hot path (95%):           950,000 allocs × 25 cycles = 23.75M cycles
 								Warm path (5%):
 								  ├─ 90% warm pool hits:   45,000 batches × 100 cycles = 4.5M cycles
 								  └─ 10% registry falls:    5,000 batches × 1000 cycles = 5M cycles
 								  ├─ Sub-total: 9.5M cycles (vs 50M before)
 								Other overhead:                                    ~15M cycles
 								─────────────────────────────────────────────────
 								Total:                                            ~48M cycles
 								                                                   ~1.46M ops/s (+38%)
 								```
 								### With Additional Optimizations (Lock-free, Batched Tier Checks)
 								```
 								Hot path (95%):           950,000 allocs × 25 cycles = 23.75M cycles
 								Warm path (5%):
 								  ├─ 95% warm pool hits:   47,500 batches × 75 cycles = 3.56M cycles
 								  └─ 5% registry falls:     2,500 batches × 800 cycles = 2M cycles
 								  ├─ Sub-total: 5.56M cycles
 								Other overhead:                                    ~10M cycles
 								─────────────────────────────────────────────────
 								Total:                                            ~39M cycles
 								                                                   ~1.79M ops/s (+69%)
 								Further optimizations (per-thread pools, batch pre-alloc):
 								Potential ceiling: ~2.5-3.0M ops/s (+135-180%)
 								```
 								---
 								## 🔄 Warm Pool Data Flow
 								### Thread Startup
 								```
 								Thread calls malloc() for first time:
 								  ↓
 								Check if warm_pool[class].capacity == 0:
 								  ├─ YES → Initialize warm pools
 								  │   ├─ Set capacity = 4 per class
 								  │   ├─ Allocate array space (TLS, ~128KB total)
 								  │   ├─ Try to pre-populate from LRU cache
 								  │   │   ├─ Success: Get 2-3 SuperSlabs per class from LRU
 								  │   │   └─ Fail: Leave empty (will populate on cold alloc)
 								  │   └─ Ready!
 								  │
 								  └─ NO → Already initialized, continue
 								First allocation:
 								  ├─ HOT: Unified cache hit → Return (99% of time)
 								  │
 								  └─ WARM (on cache miss):
 								      ├─ warm_pool_pop(class) returns SuperSlab
 								      ├─ If NULL (pool empty, rare):
 								      │   └─ Fall back to registry scan
 								      └─ Carve & return
 								```
 								### Steady State Execution
 								```
 								For each allocation:
 								malloc(size)
 								  ├─ size → class_idx
 								  │
 								  ├─ HOT: Unified cache hit (head != tail)?
 								  │   └─ YES (95%): Return immediately
 								  │
 								  └─ WARM: Unified cache miss (head == tail)?
 								      ├─ Call unified_cache_refill(class_idx)
 								      │   ├─ SuperSlab ss = tiny_warm_pool_pop(class_idx)
 								      │   ├─ If ss != NULL (90% of misses):
 								      │   │   ├─ Carve ~64 blocks from ss
 								      │   │   ├─ Refill Unified Cache array
 								      │   │   └─ Return first block
 								      │   │
 								      │   └─ Else (10% of misses):
 								      │       ├─ Fall back to registry scan (COLD path)
 								      │       ├─ Find HOT SuperSlab in per-class registry
 								      │       ├─ Allocate new if not found (mmap)
 								      │       ├─ Carve blocks + refill warm pool
 								      │       └─ Return first block
 								      │
 								      └─ Return USER pointer
 								```
 								### Free Path Integration
 								```
 								free(ptr)
 								  ├─ tiny_hot_free_fast()
 								  │   ├─ Push to TLS SLL (99% of time)
 								  │   └─ Return
 								  │
 								  └─ (On SLL full, triggered once per ~256 frees)
 								      ├─ Batch drain SLL to SuperSlab freelist
 								      ├─ When SuperSlab becomes empty:
 								      │   ├─ Remove from refill registry
 								      │   ├─ Push to LRU cache (NOT warm pool)
 								      │   │   (LRU will eventually evict or reuse)
 								      │   └─ When LRU reuses: add to warm pool
 								      │
 								      └─ Return
 								```
 								### Warm Pool Replenishment (Background)
 								```
 								When warm_pool[class].count drops below threshold (1):
 								  ├─ Called from cold allocation path (rare)
 								  │
 								  ├─ For next 2-3 SuperSlabs in registry:
 								  │   ├─ Check if tier is still HOT
 								  │   ├─ Add to warm pool (up to capacity)
 								  │   └─ Continue registry scan
 								  │
 								  └─ Restore warm pool for next miss
 								No explicit background thread needed!
 								Warm pool is refilled as side effect of cold allocs.
 								```
 								---
 								## ⚡ Implementation Complexity vs Gain
 								### Low Complexity (Recommended)
 								```
 								Effort: 200-300 lines of code
 								Time: 2-3 developer-days
 								Risk: Low
 								Changes:
 . Create tiny_warm_pool.h header (~50 lines)
 . Declare __thread warm pools (~10 lines)
 . Modify unified_cache_refill() (~100 lines)
 								     - Try warm_pool_pop() first
 								     - On success: carve & return
 								     - On fail: registry scan (existing code path)
 . Add initialization in malloc (~20 lines)
 . Add cleanup on thread exit (~10 lines)
 								Expected gain: +40-50% (1.06M → 1.5M ops/s)
 								Risk: Very low (warm pool is additive, fallback to registry always works)
 								```
 								### Medium Complexity (Phase 2)
 								```
 								Effort: 500-700 lines of code
 								Time: 5-7 developer-days
 								Risk: Medium
 								Changes:
 . Lock-free warm pool using CAS
 . Batched tier transition checks
 . Per-thread allocation pool
 . Background warm pool refill thread
 								Expected gain: +70-100% (1.06M → 1.8-2.1M ops/s)
 								Risk: Medium (requires careful synchronization)
 								```
 								### High Complexity (Phase 3)
 								```
 								Effort: 1000+ lines
 								Time: 2-3 weeks
 								Risk: High
 								Changes:
 . Comprehensive redesign with three separate pools per thread
 . Lock-free fast path for entire allocation
 . Per-size-class threads for refill
 . Complex tier management
 								Expected gain: +150-200% (1.06M → 2.5-3.2M ops/s)
 								Risk: High (major architectural changes, potential correctness issues)
 								```
 								---
 								## 🎓 Why 10x is Hard (But 2x is Feasible)
 								### The 80x Gap: Random Mixed vs Tiny Hot
 								```
 								Tiny Hot:      89M ops/s
 								  ├─ Single fixed size (16 bytes)
 								  ├─ L1 cache perfect hit
 								  ├─ No pool lookup
 								  ├─ No routing
 								  ├─ No page faults
 								  └─ Ideal case
 								Random Mixed:   1.06M ops/s
 								  ├─ 256 different sizes
 								  ├─ L1 cache misses
 								  ├─ Pool routing needed
 								  ├─ Registry lookup on miss
 								  ├─ ~7,600 page faults
 								  └─ Real-world case
 								Difference: 83x
 								Can we close this gap?
 								  - Warm pool optimization: +40-50% (to 1.5-1.6M)
 								  - Lock-free pools: +20-30% (to 1.8-2.0M)
 								  - Per-thread pools: +10-15% (to 2.0-2.3M)
 								  - Other tuning: +5-10% (to 2.1-2.5M)
 								  ──────────────────────────────────
 								  Total realistic: 2.0-2.5x (still 35-40x below Tiny Hot)
 								Why not 10x?
 . Fundamental overhead: 256 size classes (not 1)
 . Working set: Pages faults (7,600) are unavoidable
 . Routing: Pool lookup adds cycles (can't eliminate)
 . Tier management: Utilization tracking costs (necessary for correctness)
 . Memory: 2MB SuperSlab fragmentation (not tunable)
 								The 10x gap is ARCHITECTURAL, not a bug.
 								```
 								---
 								## 📋 Implementation Phases
 								### ✅ Phase 1: Basic Warm Pool (THIS PROPOSAL)
 								- **Goal:** +40-50% improvement (1.06M → 1.5M ops/s)
 								- **Scope:** Warm pool data structure + unified_cache_refill() integration
 								- **Risk:** Low
 								- **Timeline:** 2-3 days
 								- **Recommended:** YES (high ROI)
 								### ⏳ Phase 2: Advanced Optimizations (Optional)
 								- **Goal:** +20-30% additional (1.5M → 1.8-2.0M ops/s)
 								- **Scope:** Lock-free pools, batched tier checks, per-thread refill
 								- **Risk:** Medium
 								- **Timeline:** 1-2 weeks
 								- **Recommended:** Maybe (depends on user requirements)
 								### ❌ Phase 3: Architectural Redesign (Not Recommended)
 								- **Goal:** +100%+ improvement (2.0M+ ops/s)
 								- **Scope:** Major rewrite of allocation path
 								- **Risk:** High
 								- **Timeline:** 3-4 weeks
 								- **Recommended:** No (diminishing returns, high risk)
 								---
 								## 🔐 Safety & Correctness
 								### Thread Safety
 								```
 								Warm pool is thread-local (__thread):
 								  ✓ No locks needed
 								  ✓ No atomic operations
 								  ✓ No synchronization required
 								  ✓ Safe for all threads
 								Fallback path:
 								  ✓ Registry scan (existing code, proven)
 								  ✓ Always works if warm pool empty
 								  ✓ Correctness guaranteed
 								```
 								### Memory Safety
 								```
 								SuperSlab ownership:
 								  ✓ Warm pool only holds SuperSlabs we own
 								  ✓ Tier/Guard checks catch invalid cases
 								  ✓ On tier change (HOT→DRAINING): removed from pool
 								  ✓ Validation on periodic tier checks (batched)
 								Object layout:
 								  ✓ No change to object headers
 								  ✓ No change to allocation metadata
 								  ✓ Warm pool is transparent to user code
 								```
 								### Tier Transitions
 								```
 								If SuperSlab changes tier (HOT → DRAINING):
 								  ├─ Current: Caught on next registry scan
 								  ├─ Proposed: Caught on next batch tier check
 								  ├─ Rare case (only if working set shrinks)
 								  └─ Fallback: Registry scan still works
 								Validation strategy:
 								  ├─ Periodic (batched) tier validation
 								  ├─ On cold path (always validates)
 								  ├─ Accept small window of stale data
 								  └─ Correctness preserved
 								```
 								---
 								## 📊 Success Metrics
 								### Warm Pool Metrics to Track
 								```
 								While running Random Mixed benchmark:
 								Per-thread warm pool statistics:
 								  ├─ Pool capacity: 4 per class (128 total for 32 classes)
 								  ├─ Pool hit rate: 85-95% (target: > 90%)
 								  ├─ Pool miss rate: 5-15% (fallback to registry)
 								  └─ Pool push rate: On cold alloc (should be rare)
 								Cache refill metrics:
 								  ├─ Warm pool refills: ~50,000 (90% of misses)
 								  ├─ Registry fallbacks: ~5,000 (10% of misses)
 								  └─ Cold allocations: 10-100 (very rare)
 								Performance metrics:
 								  ├─ Total ops/s: 1.5M+ (target: +40% from 1.06M)
 								  ├─ Ops per cycle: 0.05+ (from 0.015 baseline)
 								  └─ Cache miss overhead: Reduced by 80%+
 								```
 								### Regression Tests
 								```
 								Ensure no degradation:
 								  ✓ Tiny Hot: 89M ops/s (unchanged)
 								  ✓ Tiny Cold: No regression expected
 								  ✓ Tiny Middle: No regression expected
 								  ✓ Memory correctness: All tests pass
 								  ✓ Multithreaded: No race conditions
 								  ✓ Thread safety: Concurrent access safe
 								```
 								---
 								## 🚀 Recommended Next Steps
 								### Step 1: Agree on Scope
 								- [ ] Accept Phase 1 (warm pool) proposal
 								- [ ] Defer Phase 2 (advanced optimizations) to later
 								- [ ] Do not attempt Phase 3 (architectural rewrite)
 								### Step 2: Create Warm Pool Implementation
 								- [ ] Create `core/front/tiny_warm_pool.h`
 								- [ ] Implement data structures and operations
 								- [ ] Write inline functions for hot operations
 								### Step 3: Integrate with Unified Cache
 								- [ ] Modify `unified_cache_refill()` to use warm pool
 								- [ ] Add initialization logic
 								- [ ] Test correctness
 								### Step 4: Benchmark & Validate
 								- [ ] Run Random Mixed benchmark
 								- [ ] Measure ops/s improvement (target: 1.5M+)
 								- [ ] Profile warm pool hit rate (target: > 90%)
 								- [ ] Verify no regression in other workloads
 								### Step 5: Iterate & Refine
 								- [ ] If hit rate < 80%: Increase warm pool size
 								- [ ] If hit rate > 95%: Reduce warm pool size (save memory)
 								- [ ] If performance < 1.4M ops/s: Review bottlenecks
 								---
 								## 🎯 Conclusion
 								**Warm pool implementation offers:**
 								- High ROI (40-50% improvement with 200-300 lines of code)
 								- Low risk (fallback to proven registry scan path)
 								- Incremental approach (doesn't require full redesign)
 								- Clear success criteria (ops/s improvement, hit rate tracking)
 								**Expected outcome:**
 								- Random Mixed: 1.06M → 1.5M+ ops/s (+40%)
 								- Tiny Hot: 89M ops/s (unchanged)
 								- Total system: Better performance for real-world workloads
 								**Path to further improvements:**
 								- Phase 2 (advanced): +20-30% more (1.8-2.0M ops/s)
 								- Phase 3 (redesign): Not recommended (high effort, limited gain)
 								**Recommendation:** Implement Phase 1 warm pool. Re-evaluate after measuring actual performance.
 								---
 								**Document Status:** Ready for implementation
 								**Review & Approval:** Required before starting code changes