hakmem/WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md

# Warm Pool Implementation - Quick-Start Guide
## 2025-12-04

---

## 🎯 TL;DR

**Objective:** Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss.

**Expected Result:** +40-50% performance (1.06M → 1.5M+ ops/s)

**Code Changes:** ~300 lines total
- 1 new header file (80 lines)
- 3 files modified (unified_cache, malloc_tiny_fast, superslab_registry)

**Time Estimate:** 2-3 days

---

## 📋 Implementation Roadmap

### Step 1: Create Warm Pool Header (30 mins)

**File:** `core/front/tiny_warm_pool.h` (NEW)

```c
#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H

#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"

// Maximum warm SuperSlabs per thread per class
#define TINY_WARM_POOL_MAX_PER_CLASS 4

typedef struct {
    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
    int32_t count;
} TinyWarmPool;

// Per-thread warm pool (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];

// Initialize once per thread (lazy)
static inline void tiny_warm_pool_init_once(void) {
    static __thread int initialized = 0;
    if (!initialized) {
        for (int i = 0; i < TINY_NUM_CLASSES; i++) {
            g_tiny_warm_pool[i].count = 0;
        }
        initialized = 1;
    }
}

// O(1) pop from warm pool
// Returns: SuperSlab* (not NULL if pool has items)
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
    if (g_tiny_warm_pool[class_idx].count > 0) {
        return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count];
    }
    return NULL;
}

// O(1) push to warm pool
// Returns: 1 if pushed, 0 if pool full (caller should free to LRU)
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) {
        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
        return 1;
    }
    return 0;
}

// Get current count (for metrics)
static inline int tiny_warm_pool_count(int class_idx) {
    return g_tiny_warm_pool[class_idx].count;
}

#endif // HAK_TINY_WARM_POOL_H
```

### Step 2: Declare Thread-Local Variable (5 mins)

**File:** `core/front/malloc_tiny_fast.h` (or `tiny_warm_pool.h`)

Add to appropriate source file (e.g., `core/hakmem_tiny.c` or new `core/front/tiny_warm_pool.c`):

```c
#include "tiny_warm_pool.h"

// Per-thread warm pools (one array per class)
__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};
```

### Step 3: Modify unified_cache_refill() (60 mins)

**File:** `core/front/tiny_unified_cache.h`

**Current Implementation:**
```c
static inline void unified_cache_refill(int class_idx) {
    // Find first HOT SuperSlab in per-class registry
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {
            // Carve and refill cache
            carve_blocks_from_superslab(ss, class_idx,
                &g_unified_cache[class_idx]);
            return;
        }
    }
    // Not found → cold path (allocate new SuperSlab)
    allocate_new_superslab_and_carve(class_idx);
}
```

**New Implementation (with Warm Pool):**
```c
#include "tiny_warm_pool.h"

static inline void unified_cache_refill(int class_idx) {
    // 1. Initialize warm pool on first use (per-thread)
    tiny_warm_pool_init_once();

    // 2. Try warm pool first (no locks, O(1))
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        // SuperSlab already HOT (pre-qualified)
        // No tier check needed, just carve
        carve_blocks_from_superslab(ss, class_idx,
            &g_unified_cache[class_idx]);
        return;
    }

    // 3. Fall back to registry scan (only if warm pool empty)
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* candidate = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(candidate)) {
            // Carve blocks
            carve_blocks_from_superslab(candidate, class_idx,
                &g_unified_cache[class_idx]);

            // Refill warm pool for next miss
            // (Look ahead 2-3 more HOT SuperSlabs)
            for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) {
                SuperSlab* extra = g_super_reg_by_class[class_idx][j];
                if (ss_tier_is_hot(extra)) {
                    tiny_warm_pool_push(class_idx, extra);
                }
            }
            return;
        }
    }

    // 4. Registry exhausted → cold path (allocate new SuperSlab)
    allocate_new_superslab_and_carve(class_idx);
}
```

### Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins)

**File:** `core/front/malloc_tiny_fast.h`

Ensure warm pool is initialized on first malloc call:

```c
// In malloc_tiny_fast() or tiny_hot_alloc_fast():
if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) {
    tiny_warm_pool_init_once();
}
```

Or simpler: Let `unified_cache_refill()` call `tiny_warm_pool_init_once()` (as shown in Step 3).

### Step 5: Add to SuperSlab Cleanup (30 mins)

**File:** `core/hakmem_super_registry.h` or `core/hakmem_tiny.h`

When a SuperSlab becomes empty (no active objects), add it to warm pool if room:

```c
// In ss_slab_meta free path (when last object freed):
if (ss_slab_meta_active_count(slab_meta) == 0) {
    // SuperSlab is now empty
    SuperSlab* ss = ss_from_slab_meta(slab_meta);
    int class_idx = ss_slab_meta_class_get(slab_meta);

    // Try to add to warm pool for next allocation
    if (!tiny_warm_pool_push(class_idx, ss)) {
        // Warm pool full, return to LRU cache
        ss_cache_put(ss);
    }
}
```

### Step 6: Add Optional Environment Variables (15 mins)

**File:** `core/hakmem_tiny.h` or `core/front/tiny_warm_pool.h`

```c
// Check warm pool size via environment (for tuning)
static inline int warm_pool_max_per_class(void) {
    static int max = -1;
    if (max == -1) {
        const char* env = getenv("HAKMEM_WARM_POOL_SIZE");
        if (env) {
            max = atoi(env);
            if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS;
        } else {
            max = TINY_WARM_POOL_MAX_PER_CLASS;
        }
    }
    return max;
}

// Use in tiny_warm_pool_push():
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    int capacity = warm_pool_max_per_class();
    if (g_tiny_warm_pool[class_idx].count < capacity) {
        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
        return 1;
    }
    return 0;
}
```

---

## 🔍 Testing Checklist

### Unit Tests

```c
// In test/test_warm_pool.c (NEW)

void test_warm_pool_pop_empty() {
    // Verify pop on empty returns NULL
    SuperSlab* ss = tiny_warm_pool_pop(0);
    assert(ss == NULL);
}

void test_warm_pool_push_pop() {
    // Verify push then pop returns same
    SuperSlab* test_ss = (SuperSlab*)0x123456;
    tiny_warm_pool_push(0, test_ss);
    SuperSlab* popped = tiny_warm_pool_pop(0);
    assert(popped == test_ss);
}

void test_warm_pool_capacity() {
    // Verify pool respects capacity
    for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) {
        SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab));
        int pushed = tiny_warm_pool_push(0, ss);
        if (i < TINY_WARM_POOL_MAX_PER_CLASS) {
            assert(pushed == 1);  // Should succeed
        } else {
            assert(pushed == 0);  // Should fail when full
        }
    }
}

void test_warm_pool_per_thread() {
    // Verify thread isolation
    pthread_t t1, t2;
    pthread_create(&t1, NULL, thread_func_1, NULL);
    pthread_create(&t2, NULL, thread_func_2, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    // Each thread should have independent warm pools
}
```

### Integration Tests

```bash
# Run existing benchmark suite
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Compare before/after:
Before:  1.06M ops/s
After:   1.5M+ ops/s (target +40%)

# Run other benchmarks to verify no regression
./bench_allocators_hakmem bench_tiny_hot      # Should be ~89M ops/s
./bench_allocators_hakmem bench_tiny_cold     # Should be similar
./bench_allocators_hakmem bench_random_mid    # Should improve
```

### Performance Metrics

```bash
# With perf profiling
HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \
  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Expected to see:
# - Fewer unified_cache_refill calls
# - Reduced registry scan overhead
# - Increased warm pool pop hits
```

---

## 📊 Success Criteria

| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Random Mixed ops/s | 1.06M | 1.5M+ | ✓ Target |
| Warm pool hit rate | N/A | > 90% | ✓ New metric |
| Tiny Hot ops/s | 89M | 89M | ✓ No regression |
| Memory per thread | ~256KB | < 400KB | ✓ Acceptable |
| All tests pass | ✓ | ✓ | ✓ Verify |

---

## 🚀 Quick Build & Test

```bash
# After code changes, compile and test:

cd /mnt/workdisk/public_share/hakmem

# Build
make clean && make

# Test warm pool directly
make test_warm_pool
./test_warm_pool

# Benchmark
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Profile
perf record -F 5000 -e cycles \
  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
perf report
```

---

## 🔧 Debugging Tips

### Verify Warm Pool is Active

Add debug output to warm pool operations:

```c
#if !HAKMEM_BUILD_RELEASE
static int warm_pool_pop_debug(int class_idx) {
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n",
            class_idx, g_tiny_warm_pool[class_idx].count);
    }
    return ss ? 1 : 0;
}
#endif
```

### Check Warm Pool Hit Rate

```c
// Global counters (atomic)
__thread uint64_t g_warm_pool_hits = 0;
__thread uint64_t g_warm_pool_misses = 0;

// Add to refill
if (tiny_warm_pool_pop(...)) {
    g_warm_pool_hits++;  // Hit
} else {
    g_warm_pool_misses++;  // Miss
}

// Print at end of benchmark
fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n",
    g_warm_pool_hits, g_warm_pool_misses,
    100.0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses));
```

### Measure Registry Scan Reduction

Profile before/after to verify:
- Fewer calls to registry scan loop
- Reduced cycles in `unified_cache_refill()`
- Increased warm pool pop calls

---

## 📝 Commit Message Template

```
Add warm pool optimization for 40% performance improvement

- New: tiny_warm_pool.h with per-thread SuperSlab pools
- Modify: unified_cache_refill() to use warm pool (O(1) pop)
- Modify: SuperSlab cleanup to add to warm pool
- Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4)

Benefits:
  - Eliminates registry O(N) scan on cache miss
  - 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s)
  - No regression in other workloads
  - Minimal per-thread memory overhead (<200KB)

Testing:
  - Unit tests for warm pool operations
  - Benchmark validation: Random Mixed +40%
  - No regression in Tiny Hot, Tiny Cold
  - Thread safety verified

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
```

---

## 🎓 Key Design Decisions

### Why 4 SuperSlabs per Class?

```
Trade-off: Working set size vs warm pool effectiveness

Too small (1-2):
  - Less memory: ✓
  - High miss rate: ✗ (frequently falls back to registry)

Right size (4):
  - Memory: ~8-32 KB per class × 32 classes = 256-512 KB
  - Hit rate: ~90% (captures typical working set)
  - Sweet spot: ✓

Too large (8+):
  - More memory: ✗ (unnecessary TLS bloat)
  - Marginal benefit: ✗ (diminishing returns)
```

### Why Thread-Local Storage?

```
Options:
1. Global pool (lock-protected) → Contention
2. Per-thread pool (TLS) → No locks, thread-safe ✓
3. Hybrid (mostly TLS) → Complexity

Chosen: Per-thread TLS
  - Fast path: No locks
  - Correctness: Thread-safe by design
  - Simplicity: No synchronization needed
```

### Why Batched Tier Check?

```
Current: Check tier on every refill (expensive)
Proposed: Check tier periodically (every 64 pops)

Cost:
  - Rare case: SuperSlab changes tier while in warm pool
  - Detection: Caught on next batch check (~50 operations later)
  - Fallback: Registry scan still validates

Benefit:
  - Reduces unnecessary tier checks
  - Improves cache refill performance
```

---

## 📚 Related Files

**Core Implementation:**
- `core/front/tiny_warm_pool.h` (NEW - this guide)
- `core/front/tiny_unified_cache.h` (MODIFY - call warm pool)
- `core/front/malloc_tiny_fast.h` (MODIFY - init warm pool)

**Supporting:**
- `core/hakmem_super_registry.h` (UNDERSTAND - how registry works)
- `core/box/ss_tier_box.h` (UNDERSTAND - tier management)
- `core/superslab/superslab_types.h` (REFERENCE - SuperSlab struct)

**Testing:**
- `bench_allocators_hakmem` (BENCHMARK)
- `test/test_*.c` (ADD warm pool tests)

---

## ✅ Implementation Checklist

- [ ] Create `core/front/tiny_warm_pool.h`
- [ ] Declare `__thread g_tiny_warm_pool[]`
- [ ] Modify `unified_cache_refill()` in `tiny_unified_cache.h`
- [ ] Add `tiny_warm_pool_init_once()` call in malloc hot path
- [ ] Add warm pool push on SuperSlab cleanup
- [ ] Add optional environment variable tuning
- [ ] Write unit tests for warm pool operations
- [ ] Compile and verify no errors
- [ ] Run benchmark: Random Mixed ops/s improvement
- [ ] Verify no regression in other workloads
- [ ] Measure warm pool hit rate (target > 90%)
- [ ] Profile CPU cycles (target ~40-50% reduction)
- [ ] Create commit with summary above
- [ ] Update documentation if needed

---

## 📞 Questions or Issues?

If you encounter:

1. **Compilation errors:** Check includes, particularly `superslab_types.h`
2. **Low hit rate (<80%):** Increase pool size via `HAKMEM_WARM_POOL_SIZE`
3. **Memory bloat:** Verify pool size is <= 4 slots per class
4. **No performance gain:** Check warm pool is actually being used (add debug output)
5. **Regression in other tests:** Verify registry fallback path still works

---

**Status:** Ready to implement
**Expected Timeline:** 2-3 development days
**Estimated Performance Gain:** +40-50% (1.06M → 1.5M+ ops/s)
-												Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 23:31:54 +09:00
+								# Warm Pool Implementation - Quick-Start Guide
 								## 2025-12-04
 								---
 								## 🎯 TL;DR
 								**Objective:** Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss.
 								**Expected Result:** +40-50% performance (1.06M → 1.5M+ ops/s)
 								**Code Changes:** ~300 lines total
 								- 1 new header file (80 lines)
 								- 3 files modified (unified_cache, malloc_tiny_fast, superslab_registry)
 								**Time Estimate:** 2-3 days
 								---
 								## 📋 Implementation Roadmap
 								### Step 1: Create Warm Pool Header (30 mins)
 								**File:** `core/front/tiny_warm_pool.h` (NEW)
 								```c
 								#ifndef HAK_TINY_WARM_POOL_H
 								#define HAK_TINY_WARM_POOL_H
 								#include <stdint.h>
 								#include "../hakmem_tiny_config.h"
 								#include "../superslab/superslab_types.h"
 								// Maximum warm SuperSlabs per thread per class
 								#define TINY_WARM_POOL_MAX_PER_CLASS 4
 								typedef struct {
 								    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
 								    int32_t count;
 								} TinyWarmPool;
 								// Per-thread warm pool (one per class)
 								extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
 								// Initialize once per thread (lazy)
 								static inline void tiny_warm_pool_init_once(void) {
 								    static __thread int initialized = 0;
 								    if (!initialized) {
 								        for (int i = 0; i < TINY_NUM_CLASSES; i++) {
 								            g_tiny_warm_pool[i].count = 0;
 								        }
 								        initialized = 1;
 								    }
 								}
 								// O(1) pop from warm pool
 								// Returns: SuperSlab* (not NULL if pool has items)
 								static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
 								    if (g_tiny_warm_pool[class_idx].count > 0) {
 								        return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count];
 								    }
 								    return NULL;
 								}
 								// O(1) push to warm pool
 								// Returns: 1 if pushed, 0 if pool full (caller should free to LRU)
 								static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
 								    if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) {
 								        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
 								        return 1;
 								    }
 								    return 0;
 								}
 								// Get current count (for metrics)
 								static inline int tiny_warm_pool_count(int class_idx) {
 								    return g_tiny_warm_pool[class_idx].count;
 								}
 								#endif // HAK_TINY_WARM_POOL_H
 								```
 								### Step 2: Declare Thread-Local Variable (5 mins)
 								**File:** `core/front/malloc_tiny_fast.h` (or `tiny_warm_pool.h`)
 								Add to appropriate source file (e.g., `core/hakmem_tiny.c` or new `core/front/tiny_warm_pool.c`):
 								```c
 								#include "tiny_warm_pool.h"
 								// Per-thread warm pools (one array per class)
 								__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};
 								```
 								### Step 3: Modify unified_cache_refill() (60 mins)
 								**File:** `core/front/tiny_unified_cache.h`
 								**Current Implementation:**
 								```c
 								static inline void unified_cache_refill(int class_idx) {
 								    // Find first HOT SuperSlab in per-class registry
 								    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
 								        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
 								        if (ss_tier_is_hot(ss)) {
 								            // Carve and refill cache
 								            carve_blocks_from_superslab(ss, class_idx,
 								                &g_unified_cache[class_idx]);
 								            return;
 								        }
 								    }
 								    // Not found → cold path (allocate new SuperSlab)
 								    allocate_new_superslab_and_carve(class_idx);
 								}
 								```
 								**New Implementation (with Warm Pool):**
 								```c
 								#include "tiny_warm_pool.h"
 								static inline void unified_cache_refill(int class_idx) {
 								    // 1. Initialize warm pool on first use (per-thread)
 								    tiny_warm_pool_init_once();
 								    // 2. Try warm pool first (no locks, O(1))
 								    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
 								    if (ss) {
 								        // SuperSlab already HOT (pre-qualified)
 								        // No tier check needed, just carve
 								        carve_blocks_from_superslab(ss, class_idx,
 								            &g_unified_cache[class_idx]);
 								        return;
 								    }
 								    // 3. Fall back to registry scan (only if warm pool empty)
 								    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
 								        SuperSlab* candidate = g_super_reg_by_class[class_idx][i];
 								        if (ss_tier_is_hot(candidate)) {
 								            // Carve blocks
 								            carve_blocks_from_superslab(candidate, class_idx,
 								                &g_unified_cache[class_idx]);
 								            // Refill warm pool for next miss
 								            // (Look ahead 2-3 more HOT SuperSlabs)
 								            for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) {
 								                SuperSlab* extra = g_super_reg_by_class[class_idx][j];
 								                if (ss_tier_is_hot(extra)) {
 								                    tiny_warm_pool_push(class_idx, extra);
 								                }
 								            }
 								            return;
 								        }
 								    }
 								    // 4. Registry exhausted → cold path (allocate new SuperSlab)
 								    allocate_new_superslab_and_carve(class_idx);
 								}
 								```
 								### Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins)
 								**File:** `core/front/malloc_tiny_fast.h`
 								Ensure warm pool is initialized on first malloc call:
 								```c
 								// In malloc_tiny_fast() or tiny_hot_alloc_fast():
 								if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) {
 								    tiny_warm_pool_init_once();
 								}
 								```
 								Or simpler: Let `unified_cache_refill()` call `tiny_warm_pool_init_once()` (as shown in Step 3).
 								### Step 5: Add to SuperSlab Cleanup (30 mins)
 								**File:** `core/hakmem_super_registry.h` or `core/hakmem_tiny.h`
 								When a SuperSlab becomes empty (no active objects), add it to warm pool if room:
 								```c
 								// In ss_slab_meta free path (when last object freed):
 								if (ss_slab_meta_active_count(slab_meta) == 0) {
 								    // SuperSlab is now empty
 								    SuperSlab* ss = ss_from_slab_meta(slab_meta);
 								    int class_idx = ss_slab_meta_class_get(slab_meta);
 								    // Try to add to warm pool for next allocation
 								    if (!tiny_warm_pool_push(class_idx, ss)) {
 								        // Warm pool full, return to LRU cache
 								        ss_cache_put(ss);
 								    }
 								}
 								```
 								### Step 6: Add Optional Environment Variables (15 mins)
 								**File:** `core/hakmem_tiny.h` or `core/front/tiny_warm_pool.h`
 								```c
 								// Check warm pool size via environment (for tuning)
 								static inline int warm_pool_max_per_class(void) {
 								    static int max = -1;
 								    if (max == -1) {
 								        const char* env = getenv("HAKMEM_WARM_POOL_SIZE");
 								        if (env) {
 								            max = atoi(env);
 								            if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS;
 								        } else {
 								            max = TINY_WARM_POOL_MAX_PER_CLASS;
 								        }
 								    }
 								    return max;
 								}
 								// Use in tiny_warm_pool_push():
 								static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
 								    int capacity = warm_pool_max_per_class();
 								    if (g_tiny_warm_pool[class_idx].count < capacity) {
 								        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
 								        return 1;
 								    }
 								    return 0;
 								}
 								```
 								---
 								## 🔍 Testing Checklist
 								### Unit Tests
 								```c
 								// In test/test_warm_pool.c (NEW)
 								void test_warm_pool_pop_empty() {
 								    // Verify pop on empty returns NULL
 								    SuperSlab* ss = tiny_warm_pool_pop(0);
 								    assert(ss == NULL);
 								}
 								void test_warm_pool_push_pop() {
 								    // Verify push then pop returns same
 								    SuperSlab* test_ss = (SuperSlab*)0x123456;
 								    tiny_warm_pool_push(0, test_ss);
 								    SuperSlab* popped = tiny_warm_pool_pop(0);
 								    assert(popped == test_ss);
 								}
 								void test_warm_pool_capacity() {
 								    // Verify pool respects capacity
 								    for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) {
 								        SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab));
 								        int pushed = tiny_warm_pool_push(0, ss);
 								        if (i < TINY_WARM_POOL_MAX_PER_CLASS) {
 								            assert(pushed == 1);  // Should succeed
 								        } else {
 								            assert(pushed == 0);  // Should fail when full
 								        }
 								    }
 								}
 								void test_warm_pool_per_thread() {
 								    // Verify thread isolation
 								    pthread_t t1, t2;
 								    pthread_create(&t1, NULL, thread_func_1, NULL);
 								    pthread_create(&t2, NULL, thread_func_2, NULL);
 								    pthread_join(t1, NULL);
 								    pthread_join(t2, NULL);
 								    // Each thread should have independent warm pools
 								}
 								```
 								### Integration Tests
 								```bash
 								# Run existing benchmark suite
 								./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
 								# Compare before/after:
 								Before:  1.06M ops/s
 								After:   1.5M+ ops/s (target +40%)
 								# Run other benchmarks to verify no regression
 								./bench_allocators_hakmem bench_tiny_hot      # Should be ~89M ops/s
 								./bench_allocators_hakmem bench_tiny_cold     # Should be similar
 								./bench_allocators_hakmem bench_random_mid    # Should improve
 								```
 								### Performance Metrics
 								```bash
 								# With perf profiling
 								HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \
 								  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
 								# Expected to see:
 								# - Fewer unified_cache_refill calls
 								# - Reduced registry scan overhead
 								# - Increased warm pool pop hits
 								```
 								---
 								## 📊 Success Criteria
 								| Metric | Current | Target | Status |
 								|--------|---------|--------|--------|
 								| Random Mixed ops/s | 1.06M | 1.5M+ | ✓ Target |
 								| Warm pool hit rate | N/A | > 90% | ✓ New metric |
 								| Tiny Hot ops/s | 89M | 89M | ✓ No regression |
 								| Memory per thread | ~256KB | < 400KB | ✓ Acceptable |
 								| All tests pass | ✓ | ✓ | ✓ Verify |
 								---
 								## 🚀 Quick Build & Test
 								```bash
 								# After code changes, compile and test:
 								cd /mnt/workdisk/public_share/hakmem
 								# Build
 								make clean && make
 								# Test warm pool directly
 								make test_warm_pool
 								./test_warm_pool
 								# Benchmark
 								./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
 								# Profile
 								perf record -F 5000 -e cycles \
 								  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
 								perf report
 								```
 								---
 								## 🔧 Debugging Tips
 								### Verify Warm Pool is Active
 								Add debug output to warm pool operations:
 								```c
 								#if !HAKMEM_BUILD_RELEASE
 								static int warm_pool_pop_debug(int class_idx) {
 								    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
 								    if (ss) {
 								        fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n",
 								            class_idx, g_tiny_warm_pool[class_idx].count);
 								    }
 								    return ss ? 1 : 0;
 								}
 								#endif
 								```
 								### Check Warm Pool Hit Rate
 								```c
 								// Global counters (atomic)
 								__thread uint64_t g_warm_pool_hits = 0;
 								__thread uint64_t g_warm_pool_misses = 0;
 								// Add to refill
 								if (tiny_warm_pool_pop(...)) {
 								    g_warm_pool_hits++;  // Hit
 								} else {
 								    g_warm_pool_misses++;  // Miss
 								}
 								// Print at end of benchmark
 								fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n",
 								    g_warm_pool_hits, g_warm_pool_misses,
 .0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses));
 								```
 								### Measure Registry Scan Reduction
 								Profile before/after to verify:
 								- Fewer calls to registry scan loop
 								- Reduced cycles in `unified_cache_refill()`
 								- Increased warm pool pop calls
 								---
 								## 📝 Commit Message Template
 								```
 								Add warm pool optimization for 40% performance improvement
 								- New: tiny_warm_pool.h with per-thread SuperSlab pools
 								- Modify: unified_cache_refill() to use warm pool (O(1) pop)
 								- Modify: SuperSlab cleanup to add to warm pool
 								- Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4)
 								Benefits:
 								  - Eliminates registry O(N) scan on cache miss
 								  - 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s)
 								  - No regression in other workloads
 								  - Minimal per-thread memory overhead (<200KB)
 								Testing:
 								  - Unit tests for warm pool operations
 								  - Benchmark validation: Random Mixed +40%
 								  - No regression in Tiny Hot, Tiny Cold
 								  - Thread safety verified
 								🤖 Generated with Claude Code
 								Co-Authored-By: Claude <noreply@anthropic.com>
 								```
 								---
 								## 🎓 Key Design Decisions
 								### Why 4 SuperSlabs per Class?
 								```
 								Trade-off: Working set size vs warm pool effectiveness
 								Too small (1-2):
 								  - Less memory: ✓
 								  - High miss rate: ✗ (frequently falls back to registry)
 								Right size (4):
 								  - Memory: ~8-32 KB per class × 32 classes = 256-512 KB
 								  - Hit rate: ~90% (captures typical working set)
 								  - Sweet spot: ✓
 								Too large (8+):
 								  - More memory: ✗ (unnecessary TLS bloat)
 								  - Marginal benefit: ✗ (diminishing returns)
 								```
 								### Why Thread-Local Storage?
 								```
 								Options:
 . Global pool (lock-protected) → Contention
 . Per-thread pool (TLS) → No locks, thread-safe ✓
 . Hybrid (mostly TLS) → Complexity
 								Chosen: Per-thread TLS
 								  - Fast path: No locks
 								  - Correctness: Thread-safe by design
 								  - Simplicity: No synchronization needed
 								```
 								### Why Batched Tier Check?
 								```
 								Current: Check tier on every refill (expensive)
 								Proposed: Check tier periodically (every 64 pops)
 								Cost:
 								  - Rare case: SuperSlab changes tier while in warm pool
 								  - Detection: Caught on next batch check (~50 operations later)
 								  - Fallback: Registry scan still validates
 								Benefit:
 								  - Reduces unnecessary tier checks
 								  - Improves cache refill performance
 								```
 								---
 								## 📚 Related Files
 								**Core Implementation:**
 								- `core/front/tiny_warm_pool.h` (NEW - this guide)
 								- `core/front/tiny_unified_cache.h` (MODIFY - call warm pool)
 								- `core/front/malloc_tiny_fast.h` (MODIFY - init warm pool)
 								**Supporting:**
 								- `core/hakmem_super_registry.h` (UNDERSTAND - how registry works)
 								- `core/box/ss_tier_box.h` (UNDERSTAND - tier management)
 								- `core/superslab/superslab_types.h` (REFERENCE - SuperSlab struct)
 								**Testing:**
 								- `bench_allocators_hakmem` (BENCHMARK)
 								- `test/test_*.c` (ADD warm pool tests)
 								---
 								## ✅ Implementation Checklist
 								- [ ] Create `core/front/tiny_warm_pool.h`
 								- [ ] Declare `__thread g_tiny_warm_pool[]`
 								- [ ] Modify `unified_cache_refill()` in `tiny_unified_cache.h`
 								- [ ] Add `tiny_warm_pool_init_once()` call in malloc hot path
 								- [ ] Add warm pool push on SuperSlab cleanup
 								- [ ] Add optional environment variable tuning
 								- [ ] Write unit tests for warm pool operations
 								- [ ] Compile and verify no errors
 								- [ ] Run benchmark: Random Mixed ops/s improvement
 								- [ ] Verify no regression in other workloads
 								- [ ] Measure warm pool hit rate (target > 90%)
 								- [ ] Profile CPU cycles (target ~40-50% reduction)
 								- [ ] Create commit with summary above
 								- [ ] Update documentation if needed
 								---
 								## 📞 Questions or Issues?
 								If you encounter:
 . **Compilation errors:** Check includes, particularly `superslab_types.h`
 . **Low hit rate (<80%):** Increase pool size via `HAKMEM_WARM_POOL_SIZE`
 . **Memory bloat:** Verify pool size is <= 4 slots per class
 . **No performance gain:** Check warm pool is actually being used (add debug output)
 . **Regression in other tests:** Verify registry fallback path still works
 								---
 								**Status:** Ready to implement
 								**Expected Timeline:** 2-3 development days
 								**Estimated Performance Gain:** +40-50% (1.06M → 1.5M+ ops/s)