hakmem/ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md

# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
## 2025-12-04

---

## 📊 Executive Summary

**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.

**Current Performance Gap:**
```
Random Mixed:  1.06M ops/s  (current baseline)
Tiny Hot:      89M ops/s    (reference - different workload)
Goal:          10.6M ops/s  (10x from baseline)
```

**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:

1. **Registry scan on cache miss** (O(N) search through per-class registry)
2. **Per-allocation tier checks** (atomic operations, not batched)
3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
4. **Global registry contention** (mutex-protected writes)

---

## 🔍 Current Architecture Analysis

### Existing Two-Speed Foundation

HAKMEM **already implements** a two-tier design:

```
HOT PATH (95%+ allocations):
  malloc_tiny_fast()
    → tiny_hot_alloc_fast()
       → Unified Cache pop (TLS, 2-3 cache misses)
       → Return USER pointer
  Cost: ~20-30 CPU cycles

WARM PATH (1-5% cache misses):
  malloc_tiny_fast()
    → tiny_cold_refill_and_alloc()
       → unified_cache_refill()
          → Per-class registry scan (find HOT SuperSlab)
          → Tier check (is HOT)
          → Carve ~64 blocks
          → Refill Unified Cache
       → Return USER pointer
  Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
```

### Performance Bottlenecks in WARM Path

**Bottleneck 1: Registry Scan (O(N))**
- Current: Linear search through per-class registry to find HOT SuperSlab
- Cost: 50-100 cycles per refill
- Happens on EVERY cache miss (~1-5% of allocations)
- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)

**Bottleneck 2: Per-Allocation Tier Checks**
- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
- Should be: Batch multiple tier checks together
- Cost: Atomic operations, not amortized
- File: `core/box/ss_tier_box.h`

**Bottleneck 3: Global Registry Contention**
- Current: Mutex-protected registry insert on SuperSlab alloc
- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
- Lock: `g_super_reg_lock`

**Bottleneck 4: SuperSlab Initialization Overhead**
- Current: Full allocation + initialization on cache miss → cold path
- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
- Should be: Pre-allocated from LRU cache or warm pool

---

## 💡 Proposed Three-Tier Architecture

### Tier 1: HOT (95%+ allocations)
```c
// Path: TLS Unified Cache hit
// Cost: ~20-30 cycles (unchanged)
// Characteristics:
//   - No registry access
//   - No Tier/Guard calls
//   - No locks
//   - Branch-free (or 1-branch pipeline hits)

Path:
  1. Read TLS Unified Cache (TLS access, 1 cache miss)
  2. Pop from array (array access, 1 cache miss)
  3. Update head pointer (1 store)
  4. Return USER pointer (0 additional branches for hit)

Total: 2-3 cache misses, ~20-30 cycles
```

### Tier 2: WARM (1-5% cache misses)
**NEW: Per-Thread Warm Pool**

```c
// Path: Unified Cache miss → Pop from per-thread warm pool
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
// Characteristics:
//   - No global registry scan
//   - Pre-qualified SuperSlabs (already HOT)
//   - Batched tier transitions (not per-object)
//   - Minimal lock contention

Data Structure:
  __thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_count[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_capacity[TINY_NUM_CLASSES];

Path:
  1. Detect Unified Cache miss (head == tail)
  2. Check warm pool (TLS access, no lock)
     a. If warm_pool_count > 0:
        ├─ Pop SuperSlab from warm_pool_head (O(1))
        ├─ Use existing SuperSlab (no mmap)
        ├─ Carve ~64 blocks (amortized cost)
        ├─ Refill Unified Cache
        ├─ (Optional) Batch tier check after ~64 pops
        └─ Return first block

     b. If warm_pool_count == 0:
        └─ Fall through to COLD (rare)

Total: ~50-100 cycles per batch
```

### Tier 3: COLD (<0.1% special cases)
```c
// Path: Warm pool exhausted, error, or special handling
// Cost: ~1000-10000 cycles per SuperSlab (rare)
// Characteristics:
//   - Full SuperSlab allocation (mmap)
//   - Registry insert (mutex-protected write)
//   - Tier initialization
//   - Guard validation

Path:
  1. Warm pool exhausted
  2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
  3. Insert into global registry (mutex-protected)
  4. Initialize TinySlabMeta + metadata
  5. Add to per-class registry
  6. Carve blocks + refill both Unified Cache and warm pool
  7. Return first block
```

---

## 🔧 Implementation Plan

### Phase 1: Design & Data Structures (THIS DOCUMENT)

**Task 1.1: Define Warm Pool Data Structure**

```c
// File: core/front/tiny_warm_pool.h (NEW)
//
// Per-thread warm pool for pre-allocated SuperSlabs
// Reduces registry scan cost on cache miss

#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H

#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"

// Maximum warm SuperSlabs per thread (tunable)
#define TINY_WARM_POOL_MAX_PER_CLASS 4

typedef struct {
    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
    int count;
    int capacity;
} TinyWarmPool;

// Per-thread warm pools (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];

// Operations:
// - tiny_warm_pool_init() → Initialize at thread startup
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
// - tiny_warm_pool_refill() → Batch refill from LRU cache

#endif
```

**Task 1.2: Define Warm Pool Operations**

```c
// Lazy initialization (once per thread)
static inline void tiny_warm_pool_init_once(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->capacity == 0) {
        pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
        pool->count = 0;
        // Allocate initial SuperSlabs on demand (COLD path)
    }
}

// O(1) pop from warm pool
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count > 0) {
        return pool->slabs[--pool->count];  // Pop from end
    }
    return NULL;  // Pool empty → fall through to COLD
}

// O(1) push to warm pool
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count < pool->capacity) {
        pool->slabs[pool->count++] = ss;
    } else {
        // Pool full → return to LRU cache or free
        ss_cache_put(ss);  // Return to global LRU
    }
}
```

### Phase 2: Implement Warm Pool Initialization

**Task 2.1: Thread Startup Integration**
- Initialize warm pools on first malloc call
- Pre-populate from LRU cache (if available)
- Fall back to cold allocation if needed

**Task 2.2: Batch Refill Strategy**
- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
- On cache miss: Pop from warm pool (no registry scan)
- On warm pool depletion: Allocate 1-2 more in cold path

### Phase 3: Modify unified_cache_refill()

**Current Implementation** (Registry Scan):
```c
void unified_cache_refill(int class_idx) {
    // Linear search through per-class registry
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {  // ← Tier check (5-10 cycles)
            // Carve blocks
            carve_blocks_from_superslab(ss, class_idx, cache);
            return;
        }
    }
    // Not found → cold path (allocate new SuperSlab)
}
```

**Proposed Implementation** (Warm Pool First):
```c
void unified_cache_refill(int class_idx) {
    // 1. Try warm pool first (no lock, O(1))
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        // SuperSlab already HOT (pre-qualified), no tier check needed
        carve_blocks_from_superslab(ss, class_idx, cache);
        return;
    }

    // 2. Fall back to registry scan (only if warm pool empty)
    // (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {
            carve_blocks_from_superslab(ss, class_idx, cache);
            // Refill warm pool on success
            for (int j = 0; j < 2; j++) {
                SuperSlab* extra = find_next_hot_slab(class_idx, i);
                if (extra) {
                    tiny_warm_pool_push(class_idx, extra);
                    i++;
                }
            }
            return;
        }
    }

    // 3. Cold path (allocate new SuperSlab)
    allocate_new_superslab(class_idx, cache);
}
```

### Phase 4: Batched Tier Transition Checks

**Current:** Tier check on every refill (5-10 cycles)
**Proposed:** Batch tier checks once per N operations

```c
// Global tier check counter (atomic, publish periodically)
static __thread uint32_t g_tier_check_counter = 0;
#define TIER_CHECK_BATCH_SIZE 256

void tier_check_maybe_batch(int class_idx) {
    if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
        // Batch check: validate tier of all SuperSlabs in registry
        for (int i = 0; i < 10; i++) {  // Sample 10 SuperSlabs
            SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
            if (!ss_tier_is_hot(ss)) {
                // Demote from warm pool if present
                // (Cost: 1 atomic per 256 operations)
            }
        }
    }
}
```

### Phase 5: LRU Cache Integration

**How Warm Pool Gets Replenished:**

1. **Startup:** Pre-populate warm pools from LRU cache
2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool
3. **Periodic:** Background thread refills warm pools when < threshold
4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)

---

## 📈 Expected Performance Impact

### Current Baseline
```
Random Mixed: 1.06M ops/s
Breakdown:
  - 95% cache hits (HOT):     ~1.007M ops/s (clean, 2-3 cache misses)
  - 5% cache misses (WARM):   ~0.053M ops/s (registry scan + refill)
```

### After Warm Pool Implementation
```
Estimated: 1.5-1.8M ops/s (+40-70%)

Breakdown:
  - 95% cache hits (HOT):       ~1.007M ops/s (unchanged, 2-3 cache misses)
  - 5% cache misses (WARM):     ~0.15-0.20M ops/s (warm pool, O(1) pop)
                                 (vs 0.053M before)

Improvement mechanism:
  - Remove registry O(N) scan → O(1) warm pool pop
  - Reduce per-refill cost: ~500 cycles → ~50 cycles
  - Expected per-miss speedup: ~10x
  - Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
  - Actual gain: 1.06M × 0.05 × 9 = 0.477M
  - Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
```

### Path to 10x

Current efforts can achieve:
- **Warm pool optimization:** +40-70% (this proposal)
- **Lock-free refill path:** +10-20% (phase 2)
- **Batch tier transitions:** +5-10% (phase 2)
- **Reduced syscall overhead:** +5% (phase 3)
- **Total realistic: 2.0-2.5x** (not 10x)

**To reach 10x improvement, would need:**
1. Dedicated per-thread allocation pools (reduce lock contention)
2. Batch pre-allocation strategy (reduce per-op overhead)
3. Size class coalescing (reduce routing complexity)
4. Or: Change workload pattern (batch allocations)

---

## ⚠️ Implementation Risks & Mitigations

### Risk 1: Thread-Local Storage Bloat
**Risk:** Adding warm pool increases per-thread memory usage
**Mitigation:**
- Allocate warm pool lazily
- Limit to 4-8 SuperSlabs per class (128KB per thread max)
- Default: 4 slots per class → 128KB total (acceptable)

### Risk 2: Warm Pool Invalidation
**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
**Mitigation:**
- Periodic validation during batch tier checks
- Accept occasional validation error (rare, correctness not affected)
- Fallback to registry scan if warm pool slot invalid

### Risk 3: Stale SuperSlabs
**Risk:** Warm pool holds SuperSlabs that should be freed
**Mitigation:**
- LRU-based eviction from warm pool
- Maximum hold time: 60s (configurable)
- On thread exit: drain warm pool back to LRU cache

### Risk 4: Initialization Race
**Risk:** Multiple threads initialize warm pools simultaneously
**Mitigation:**
- Use `__thread` (thread-safe per POSIX)
- Lazy initialization with check-then-set
- No atomic operations needed (per-thread)

---

## 🔄 Integration Checklist

### Pre-Implementation
- [ ] Review current unified_cache_refill() implementation
- [ ] Identify all places where SuperSlab allocation happens
- [ ] Audit Tier system for validation requirements
- [ ] Measure current registry scan cost in micro-benchmark

### Phase 1: Warm Pool Infrastructure
- [ ] Create `core/front/tiny_warm_pool.h` with data structures
- [ ] Implement warm_pool_init(), pop(), push() operations
- [ ] Add __thread variable declarations
- [ ] Write unit tests for warm pool operations
- [ ] Verify no TLS bloat (profile memory usage)

### Phase 2: Integration Points
- [ ] Modify malloc_tiny_fast() to initialize warm pools
- [ ] Integrate warm_pool_pop() in unified_cache_refill()
- [ ] Implement warm_pool_push() in cold allocation path
- [ ] Add initialization on first malloc
- [ ] Handle thread exit cleanup

### Phase 3: Testing
- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
- [ ] Benchmark Random Mixed: measure ops/s improvement
- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
- [ ] Stress test: concurrent threads + warm pool refill
- [ ] Correctness: verify all objects properly allocated/freed

### Phase 4: Profiling & Optimization
- [ ] Profile hot path (should still be 20-30 cycles)
- [ ] Profile warm path (should be reduced to 50-100 cycles)
- [ ] Measure registry scan reduction
- [ ] Identify any remaining bottlenecks

### Phase 5: Documentation
- [ ] Update comments in unified_cache_refill()
- [ ] Document warm pool design in README
- [ ] Add environment variables (if needed)
- [ ] Document tier check batching strategy

---

## 📊 Metrics to Track

### Pre-Implementation
```
Baseline Random Mixed:
  - Ops/sec: 1.06M
  - L1 cache misses: ~763K per 1M ops
  - Page faults: ~7,674
  - CPU cycles: ~70.4M
```

### Post-Implementation Targets
```
After warm pool:
  - Ops/sec: 1.5-1.8M (+40-70%)
  - L1 cache misses: Similar or slightly reduced
  - Page faults: Same (~7,674)
  - CPU cycles: ~45-50M (30% reduction)

  Warm path breakdown:
    - Warm pool hit: 50-100 cycles per batch
    - Registry fallback: 200-300 cycles (rare)
    - Cold alloc: 1000-5000 cycles (very rare)
```

---

## 💾 Files to Create/Modify

### New Files
- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations

### Modified Files
1. `core/front/malloc_tiny_fast.h`
   - Initialize warm pools on first call
   - Document three-tier routing

2. `core/front/tiny_unified_cache.h`
   - Modify unified_cache_refill() to use warm pool first
   - Add warm pool replenishment logic

3. `core/box/ss_tier_box.h`
   - Add batched tier check strategy
   - Document validation requirements

4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
   - Add environment variables:
     - `HAKMEM_WARM_POOL_SIZE` (default: 4)
     - `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)

### Configuration Files
- Add warm pool parameters to benchmark configuration
- Update profiling tools to measure warm pool effectiveness

---

## 🎯 Success Criteria

✅ **Must Have:**
1. Warm pool implementation reduces registry scan cost by 80%+
2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
3. Tiny Hot ops/s unchanged (no regression)
4. All allocations remain correct (no memory corruption)
5. No thread-local storage bloat (< 200KB per thread)

✅ **Nice to Have:**
1. Random Mixed reaches 2M+ ops/s (90%+ improvement)
2. Warm pool hit rate > 90% (rarely fall back to registry)
3. L1 cache misses reduced by 10%+
4. Per-free cost unchanged (no regression)

❌ **Not in Scope (separate PR):**
1. Lock-free refill path (requires CAS-based warm pool)
2. Per-thread allocation pools (requires larger redesign)
3. Hugepages support (already tested, no gain)

---

## 📝 Next Steps

1. **Review this proposal** with the team
2. **Approve scope & success criteria**
3. **Begin Phase 1 implementation** (warm pool header file)
4. **Integrate with unified_cache_refill()**
5. **Benchmark and measure improvements**
6. **Iterate based on profiling results**

---

## 🔗 References

- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
- Box Architecture: `core/box/` directory
- Unified Cache: `core/front/tiny_unified_cache.h`
- Registry: `core/hakmem_super_registry.h`
- Tier System: `core/box/ss_tier_box.h`
-												Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 23:31:54 +09:00
+								# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
 								## 2025-12-04
 								---
 								## 📊 Executive Summary
 								**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.
 								**Current Performance Gap:**
 								```
 								Random Mixed:  1.06M ops/s  (current baseline)
 								Tiny Hot:      89M ops/s    (reference - different workload)
 								Goal:          10.6M ops/s  (10x from baseline)
 								```
 								**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:
 . **Registry scan on cache miss** (O(N) search through per-class registry)
 . **Per-allocation tier checks** (atomic operations, not batched)
 . **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
 . **Global registry contention** (mutex-protected writes)
 								---
 								## 🔍 Current Architecture Analysis
 								### Existing Two-Speed Foundation
 								HAKMEM **already implements** a two-tier design:
 								```
 								HOT PATH (95%+ allocations):
 								  malloc_tiny_fast()
 								    → tiny_hot_alloc_fast()
 								       → Unified Cache pop (TLS, 2-3 cache misses)
 								       → Return USER pointer
 								  Cost: ~20-30 CPU cycles
 								WARM PATH (1-5% cache misses):
 								  malloc_tiny_fast()
 								    → tiny_cold_refill_and_alloc()
 								       → unified_cache_refill()
 								          → Per-class registry scan (find HOT SuperSlab)
 								          → Tier check (is HOT)
 								          → Carve ~64 blocks
 								          → Refill Unified Cache
 								       → Return USER pointer
 								  Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
 								```
 								### Performance Bottlenecks in WARM Path
 								**Bottleneck 1: Registry Scan (O(N))**
 								- Current: Linear search through per-class registry to find HOT SuperSlab
 								- Cost: 50-100 cycles per refill
 								- Happens on EVERY cache miss (~1-5% of allocations)
 								- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)
 								**Bottleneck 2: Per-Allocation Tier Checks**
 								- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
 								- Should be: Batch multiple tier checks together
 								- Cost: Atomic operations, not amortized
 								- File: `core/box/ss_tier_box.h`
 								**Bottleneck 3: Global Registry Contention**
 								- Current: Mutex-protected registry insert on SuperSlab alloc
 								- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
 								- Lock: `g_super_reg_lock`
 								**Bottleneck 4: SuperSlab Initialization Overhead**
 								- Current: Full allocation + initialization on cache miss → cold path
 								- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
 								- Should be: Pre-allocated from LRU cache or warm pool
 								---
 								## 💡 Proposed Three-Tier Architecture
 								### Tier 1: HOT (95%+ allocations)
 								```c
 								// Path: TLS Unified Cache hit
 								// Cost: ~20-30 cycles (unchanged)
 								// Characteristics:
 								//   - No registry access
 								//   - No Tier/Guard calls
 								//   - No locks
 								//   - Branch-free (or 1-branch pipeline hits)
 								Path:
 . Read TLS Unified Cache (TLS access, 1 cache miss)
 . Pop from array (array access, 1 cache miss)
 . Update head pointer (1 store)
 . Return USER pointer (0 additional branches for hit)
 								Total: 2-3 cache misses, ~20-30 cycles
 								```
 								### Tier 2: WARM (1-5% cache misses)
 								**NEW: Per-Thread Warm Pool**
 								```c
 								// Path: Unified Cache miss → Pop from per-thread warm pool
 								// Cost: ~50-100 cycles per batch (5-10 per object amortized)
 								// Characteristics:
 								//   - No global registry scan
 								//   - Pre-qualified SuperSlabs (already HOT)
 								//   - Batched tier transitions (not per-object)
 								//   - Minimal lock contention
 								Data Structure:
 								  __thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
 								  __thread int       g_warm_pool_count[TINY_NUM_CLASSES];
 								  __thread int       g_warm_pool_capacity[TINY_NUM_CLASSES];
 								Path:
 . Detect Unified Cache miss (head == tail)
 . Check warm pool (TLS access, no lock)
 								     a. If warm_pool_count > 0:
 								        ├─ Pop SuperSlab from warm_pool_head (O(1))
 								        ├─ Use existing SuperSlab (no mmap)
 								        ├─ Carve ~64 blocks (amortized cost)
 								        ├─ Refill Unified Cache
 								        ├─ (Optional) Batch tier check after ~64 pops
 								        └─ Return first block
 								     b. If warm_pool_count == 0:
 								        └─ Fall through to COLD (rare)
 								Total: ~50-100 cycles per batch
 								```
 								### Tier 3: COLD (<0.1% special cases)
 								```c
 								// Path: Warm pool exhausted, error, or special handling
 								// Cost: ~1000-10000 cycles per SuperSlab (rare)
 								// Characteristics:
 								//   - Full SuperSlab allocation (mmap)
 								//   - Registry insert (mutex-protected write)
 								//   - Tier initialization
 								//   - Guard validation
 								Path:
 . Warm pool exhausted
 . Allocate new SuperSlab (mmap via ss_os_acquire_box)
 . Insert into global registry (mutex-protected)
 . Initialize TinySlabMeta + metadata
 . Add to per-class registry
 . Carve blocks + refill both Unified Cache and warm pool
 . Return first block
 								```
 								---
 								## 🔧 Implementation Plan
 								### Phase 1: Design & Data Structures (THIS DOCUMENT)
 								**Task 1.1: Define Warm Pool Data Structure**
 								```c
 								// File: core/front/tiny_warm_pool.h (NEW)
 								//
 								// Per-thread warm pool for pre-allocated SuperSlabs
 								// Reduces registry scan cost on cache miss
 								#ifndef HAK_TINY_WARM_POOL_H
 								#define HAK_TINY_WARM_POOL_H
 								#include <stdint.h>
 								#include "../hakmem_tiny_config.h"
 								#include "../superslab/superslab_types.h"
 								// Maximum warm SuperSlabs per thread (tunable)
 								#define TINY_WARM_POOL_MAX_PER_CLASS 4
 								typedef struct {
 								    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
 								    int count;
 								    int capacity;
 								} TinyWarmPool;
 								// Per-thread warm pools (one per class)
 								extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
 								// Operations:
 								// - tiny_warm_pool_init() → Initialize at thread startup
 								// - tiny_warm_pool_push() → Add SuperSlab to warm pool
 								// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
 								// - tiny_warm_pool_drain() → Return all to LRU on thread exit
 								// - tiny_warm_pool_refill() → Batch refill from LRU cache
 								#endif
 								```
 								**Task 1.2: Define Warm Pool Operations**
 								```c
 								// Lazy initialization (once per thread)
 								static inline void tiny_warm_pool_init_once(int class_idx) {
 								    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
 								    if (pool->capacity == 0) {
 								        pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
 								        pool->count = 0;
 								        // Allocate initial SuperSlabs on demand (COLD path)
 								    }
 								}
 								// O(1) pop from warm pool
 								static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
 								    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
 								    if (pool->count > 0) {
 								        return pool->slabs[--pool->count];  // Pop from end
 								    }
 								    return NULL;  // Pool empty → fall through to COLD
 								}
 								// O(1) push to warm pool
 								static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
 								    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
 								    if (pool->count < pool->capacity) {
 								        pool->slabs[pool->count++] = ss;
 								    } else {
 								        // Pool full → return to LRU cache or free
 								        ss_cache_put(ss);  // Return to global LRU
 								    }
 								}
 								```
 								### Phase 2: Implement Warm Pool Initialization
 								**Task 2.1: Thread Startup Integration**
 								- Initialize warm pools on first malloc call
 								- Pre-populate from LRU cache (if available)
 								- Fall back to cold allocation if needed
 								**Task 2.2: Batch Refill Strategy**
 								- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
 								- On cache miss: Pop from warm pool (no registry scan)
 								- On warm pool depletion: Allocate 1-2 more in cold path
 								### Phase 3: Modify unified_cache_refill()
 								**Current Implementation** (Registry Scan):
 								```c
 								void unified_cache_refill(int class_idx) {
 								    // Linear search through per-class registry
 								    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
 								        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
 								        if (ss_tier_is_hot(ss)) {  // ← Tier check (5-10 cycles)
 								            // Carve blocks
 								            carve_blocks_from_superslab(ss, class_idx, cache);
 								            return;
 								        }
 								    }
 								    // Not found → cold path (allocate new SuperSlab)
 								}
 								```
 								**Proposed Implementation** (Warm Pool First):
 								```c
 								void unified_cache_refill(int class_idx) {
 								    // 1. Try warm pool first (no lock, O(1))
 								    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
 								    if (ss) {
 								        // SuperSlab already HOT (pre-qualified), no tier check needed
 								        carve_blocks_from_superslab(ss, class_idx, cache);
 								        return;
 								    }
 								    // 2. Fall back to registry scan (only if warm pool empty)
 								    // (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
 								    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
 								        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
 								        if (ss_tier_is_hot(ss)) {
 								            carve_blocks_from_superslab(ss, class_idx, cache);
 								            // Refill warm pool on success
 								            for (int j = 0; j < 2; j++) {
 								                SuperSlab* extra = find_next_hot_slab(class_idx, i);
 								                if (extra) {
 								                    tiny_warm_pool_push(class_idx, extra);
 								                    i++;
 								                }
 								            }
 								            return;
 								        }
 								    }
 								    // 3. Cold path (allocate new SuperSlab)
 								    allocate_new_superslab(class_idx, cache);
 								}
 								```
 								### Phase 4: Batched Tier Transition Checks
 								**Current:** Tier check on every refill (5-10 cycles)
 								**Proposed:** Batch tier checks once per N operations
 								```c
 								// Global tier check counter (atomic, publish periodically)
 								static __thread uint32_t g_tier_check_counter = 0;
 								#define TIER_CHECK_BATCH_SIZE 256
 								void tier_check_maybe_batch(int class_idx) {
 								    if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
 								        // Batch check: validate tier of all SuperSlabs in registry
 								        for (int i = 0; i < 10; i++) {  // Sample 10 SuperSlabs
 								            SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
 								            if (!ss_tier_is_hot(ss)) {
 								                // Demote from warm pool if present
 								                // (Cost: 1 atomic per 256 operations)
 								            }
 								        }
 								    }
 								}
 								```
 								### Phase 5: LRU Cache Integration
 								**How Warm Pool Gets Replenished:**
 . **Startup:** Pre-populate warm pools from LRU cache
 . **During execution:** On cold path alloc, add extra SuperSlab to warm pool
 . **Periodic:** Background thread refills warm pools when < threshold
 . **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)
 								---
 								## 📈 Expected Performance Impact
 								### Current Baseline
 								```
 								Random Mixed: 1.06M ops/s
 								Breakdown:
 								  - 95% cache hits (HOT):     ~1.007M ops/s (clean, 2-3 cache misses)
 								  - 5% cache misses (WARM):   ~0.053M ops/s (registry scan + refill)
 								```
 								### After Warm Pool Implementation
 								```
 								Estimated: 1.5-1.8M ops/s (+40-70%)
 								Breakdown:
 								  - 95% cache hits (HOT):       ~1.007M ops/s (unchanged, 2-3 cache misses)
 								  - 5% cache misses (WARM):     ~0.15-0.20M ops/s (warm pool, O(1) pop)
 								                                 (vs 0.053M before)
 								Improvement mechanism:
 								  - Remove registry O(N) scan → O(1) warm pool pop
 								  - Reduce per-refill cost: ~500 cycles → ~50 cycles
 								  - Expected per-miss speedup: ~10x
 								  - Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
 								  - Actual gain: 1.06M × 0.05 × 9 = 0.477M
 								  - Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
 								```
 								### Path to 10x
 								Current efforts can achieve:
 								- **Warm pool optimization:** +40-70% (this proposal)
 								- **Lock-free refill path:** +10-20% (phase 2)
 								- **Batch tier transitions:** +5-10% (phase 2)
 								- **Reduced syscall overhead:** +5% (phase 3)
 								- **Total realistic: 2.0-2.5x** (not 10x)
 								**To reach 10x improvement, would need:**
 . Dedicated per-thread allocation pools (reduce lock contention)
 . Batch pre-allocation strategy (reduce per-op overhead)
 . Size class coalescing (reduce routing complexity)
 . Or: Change workload pattern (batch allocations)
 								---
 								## ⚠️ Implementation Risks & Mitigations
 								### Risk 1: Thread-Local Storage Bloat
 								**Risk:** Adding warm pool increases per-thread memory usage
 								**Mitigation:**
 								- Allocate warm pool lazily
 								- Limit to 4-8 SuperSlabs per class (128KB per thread max)
 								- Default: 4 slots per class → 128KB total (acceptable)
 								### Risk 2: Warm Pool Invalidation
 								**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
 								**Mitigation:**
 								- Periodic validation during batch tier checks
 								- Accept occasional validation error (rare, correctness not affected)
 								- Fallback to registry scan if warm pool slot invalid
 								### Risk 3: Stale SuperSlabs
 								**Risk:** Warm pool holds SuperSlabs that should be freed
 								**Mitigation:**
 								- LRU-based eviction from warm pool
 								- Maximum hold time: 60s (configurable)
 								- On thread exit: drain warm pool back to LRU cache
 								### Risk 4: Initialization Race
 								**Risk:** Multiple threads initialize warm pools simultaneously
 								**Mitigation:**
 								- Use `__thread` (thread-safe per POSIX)
 								- Lazy initialization with check-then-set
 								- No atomic operations needed (per-thread)
 								---
 								## 🔄 Integration Checklist
 								### Pre-Implementation
 								- [ ] Review current unified_cache_refill() implementation
 								- [ ] Identify all places where SuperSlab allocation happens
 								- [ ] Audit Tier system for validation requirements
 								- [ ] Measure current registry scan cost in micro-benchmark
 								### Phase 1: Warm Pool Infrastructure
 								- [ ] Create `core/front/tiny_warm_pool.h` with data structures
 								- [ ] Implement warm_pool_init(), pop(), push() operations
 								- [ ] Add __thread variable declarations
 								- [ ] Write unit tests for warm pool operations
 								- [ ] Verify no TLS bloat (profile memory usage)
 								### Phase 2: Integration Points
 								- [ ] Modify malloc_tiny_fast() to initialize warm pools
 								- [ ] Integrate warm_pool_pop() in unified_cache_refill()
 								- [ ] Implement warm_pool_push() in cold allocation path
 								- [ ] Add initialization on first malloc
 								- [ ] Handle thread exit cleanup
 								### Phase 3: Testing
 								- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
 								- [ ] Benchmark Random Mixed: measure ops/s improvement
 								- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
 								- [ ] Stress test: concurrent threads + warm pool refill
 								- [ ] Correctness: verify all objects properly allocated/freed
 								### Phase 4: Profiling & Optimization
 								- [ ] Profile hot path (should still be 20-30 cycles)
 								- [ ] Profile warm path (should be reduced to 50-100 cycles)
 								- [ ] Measure registry scan reduction
 								- [ ] Identify any remaining bottlenecks
 								### Phase 5: Documentation
 								- [ ] Update comments in unified_cache_refill()
 								- [ ] Document warm pool design in README
 								- [ ] Add environment variables (if needed)
 								- [ ] Document tier check batching strategy
 								---
 								## 📊 Metrics to Track
 								### Pre-Implementation
 								```
 								Baseline Random Mixed:
 								  - Ops/sec: 1.06M
 								  - L1 cache misses: ~763K per 1M ops
 								  - Page faults: ~7,674
 								  - CPU cycles: ~70.4M
 								```
 								### Post-Implementation Targets
 								```
 								After warm pool:
 								  - Ops/sec: 1.5-1.8M (+40-70%)
 								  - L1 cache misses: Similar or slightly reduced
 								  - Page faults: Same (~7,674)
 								  - CPU cycles: ~45-50M (30% reduction)
 								  Warm path breakdown:
 								    - Warm pool hit: 50-100 cycles per batch
 								    - Registry fallback: 200-300 cycles (rare)
 								    - Cold alloc: 1000-5000 cycles (very rare)
 								```
 								---
 								## 💾 Files to Create/Modify
 								### New Files
 								- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations
 								### Modified Files
 . `core/front/malloc_tiny_fast.h`
 								   - Initialize warm pools on first call
 								   - Document three-tier routing
 . `core/front/tiny_unified_cache.h`
 								   - Modify unified_cache_refill() to use warm pool first
 								   - Add warm pool replenishment logic
 . `core/box/ss_tier_box.h`
 								   - Add batched tier check strategy
 								   - Document validation requirements
 . `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
 								   - Add environment variables:
 								     - `HAKMEM_WARM_POOL_SIZE` (default: 4)
 								     - `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)
 								### Configuration Files
 								- Add warm pool parameters to benchmark configuration
 								- Update profiling tools to measure warm pool effectiveness
 								---
 								## 🎯 Success Criteria
 								✅ **Must Have:**
 . Warm pool implementation reduces registry scan cost by 80%+
 . Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
 . Tiny Hot ops/s unchanged (no regression)
 . All allocations remain correct (no memory corruption)
 . No thread-local storage bloat (< 200KB per thread)
 								✅ **Nice to Have:**
 . Random Mixed reaches 2M+ ops/s (90%+ improvement)
 . Warm pool hit rate > 90% (rarely fall back to registry)
 . L1 cache misses reduced by 10%+
 . Per-free cost unchanged (no regression)
 								❌ **Not in Scope (separate PR):**
 . Lock-free refill path (requires CAS-based warm pool)
 . Per-thread allocation pools (requires larger redesign)
 . Hugepages support (already tested, no gain)
 								---
 								## 📝 Next Steps
 . **Review this proposal** with the team
 . **Approve scope & success criteria**
 . **Begin Phase 1 implementation** (warm pool header file)
 . **Integrate with unified_cache_refill()**
 . **Benchmark and measure improvements**
 . **Iterate based on profiling results**
 								---
 								## 🔗 References
 								- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
 								- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
 								- Box Architecture: `core/box/` directory
 								- Unified Cache: `core/front/tiny_unified_cache.h`
 								- Registry: `core/hakmem_super_registry.h`
 								- Tier System: `core/box/ss_tier_box.h`