# Warm Pool Implementation - Quick-Start Guide ## 2025-12-04 --- ## 🎯 TL;DR **Objective:** Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss. **Expected Result:** +40-50% performance (1.06M → 1.5M+ ops/s) **Code Changes:** ~300 lines total - 1 new header file (80 lines) - 3 files modified (unified_cache, malloc_tiny_fast, superslab_registry) **Time Estimate:** 2-3 days --- ## 📋 Implementation Roadmap ### Step 1: Create Warm Pool Header (30 mins) **File:** `core/front/tiny_warm_pool.h` (NEW) ```c #ifndef HAK_TINY_WARM_POOL_H #define HAK_TINY_WARM_POOL_H #include #include "../hakmem_tiny_config.h" #include "../superslab/superslab_types.h" // Maximum warm SuperSlabs per thread per class #define TINY_WARM_POOL_MAX_PER_CLASS 4 typedef struct { SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS]; int32_t count; } TinyWarmPool; // Per-thread warm pool (one per class) extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES]; // Initialize once per thread (lazy) static inline void tiny_warm_pool_init_once(void) { static __thread int initialized = 0; if (!initialized) { for (int i = 0; i < TINY_NUM_CLASSES; i++) { g_tiny_warm_pool[i].count = 0; } initialized = 1; } } // O(1) pop from warm pool // Returns: SuperSlab* (not NULL if pool has items) static inline SuperSlab* tiny_warm_pool_pop(int class_idx) { if (g_tiny_warm_pool[class_idx].count > 0) { return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count]; } return NULL; } // O(1) push to warm pool // Returns: 1 if pushed, 0 if pool full (caller should free to LRU) static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) { if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) { g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss; return 1; } return 0; } // Get current count (for metrics) static inline int tiny_warm_pool_count(int class_idx) { return g_tiny_warm_pool[class_idx].count; } #endif // HAK_TINY_WARM_POOL_H ``` ### Step 2: Declare Thread-Local Variable (5 mins) **File:** `core/front/malloc_tiny_fast.h` (or `tiny_warm_pool.h`) Add to appropriate source file (e.g., `core/hakmem_tiny.c` or new `core/front/tiny_warm_pool.c`): ```c #include "tiny_warm_pool.h" // Per-thread warm pools (one array per class) __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0}; ``` ### Step 3: Modify unified_cache_refill() (60 mins) **File:** `core/front/tiny_unified_cache.h` **Current Implementation:** ```c static inline void unified_cache_refill(int class_idx) { // Find first HOT SuperSlab in per-class registry for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) { SuperSlab* ss = g_super_reg_by_class[class_idx][i]; if (ss_tier_is_hot(ss)) { // Carve and refill cache carve_blocks_from_superslab(ss, class_idx, &g_unified_cache[class_idx]); return; } } // Not found → cold path (allocate new SuperSlab) allocate_new_superslab_and_carve(class_idx); } ``` **New Implementation (with Warm Pool):** ```c #include "tiny_warm_pool.h" static inline void unified_cache_refill(int class_idx) { // 1. Initialize warm pool on first use (per-thread) tiny_warm_pool_init_once(); // 2. Try warm pool first (no locks, O(1)) SuperSlab* ss = tiny_warm_pool_pop(class_idx); if (ss) { // SuperSlab already HOT (pre-qualified) // No tier check needed, just carve carve_blocks_from_superslab(ss, class_idx, &g_unified_cache[class_idx]); return; } // 3. Fall back to registry scan (only if warm pool empty) for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) { SuperSlab* candidate = g_super_reg_by_class[class_idx][i]; if (ss_tier_is_hot(candidate)) { // Carve blocks carve_blocks_from_superslab(candidate, class_idx, &g_unified_cache[class_idx]); // Refill warm pool for next miss // (Look ahead 2-3 more HOT SuperSlabs) for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) { SuperSlab* extra = g_super_reg_by_class[class_idx][j]; if (ss_tier_is_hot(extra)) { tiny_warm_pool_push(class_idx, extra); } } return; } } // 4. Registry exhausted → cold path (allocate new SuperSlab) allocate_new_superslab_and_carve(class_idx); } ``` ### Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins) **File:** `core/front/malloc_tiny_fast.h` Ensure warm pool is initialized on first malloc call: ```c // In malloc_tiny_fast() or tiny_hot_alloc_fast(): if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) { tiny_warm_pool_init_once(); } ``` Or simpler: Let `unified_cache_refill()` call `tiny_warm_pool_init_once()` (as shown in Step 3). ### Step 5: Add to SuperSlab Cleanup (30 mins) **File:** `core/hakmem_super_registry.h` or `core/hakmem_tiny.h` When a SuperSlab becomes empty (no active objects), add it to warm pool if room: ```c // In ss_slab_meta free path (when last object freed): if (ss_slab_meta_active_count(slab_meta) == 0) { // SuperSlab is now empty SuperSlab* ss = ss_from_slab_meta(slab_meta); int class_idx = ss_slab_meta_class_get(slab_meta); // Try to add to warm pool for next allocation if (!tiny_warm_pool_push(class_idx, ss)) { // Warm pool full, return to LRU cache ss_cache_put(ss); } } ``` ### Step 6: Add Optional Environment Variables (15 mins) **File:** `core/hakmem_tiny.h` or `core/front/tiny_warm_pool.h` ```c // Check warm pool size via environment (for tuning) static inline int warm_pool_max_per_class(void) { static int max = -1; if (max == -1) { const char* env = getenv("HAKMEM_WARM_POOL_SIZE"); if (env) { max = atoi(env); if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS; } else { max = TINY_WARM_POOL_MAX_PER_CLASS; } } return max; } // Use in tiny_warm_pool_push(): static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) { int capacity = warm_pool_max_per_class(); if (g_tiny_warm_pool[class_idx].count < capacity) { g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss; return 1; } return 0; } ``` --- ## 🔍 Testing Checklist ### Unit Tests ```c // In test/test_warm_pool.c (NEW) void test_warm_pool_pop_empty() { // Verify pop on empty returns NULL SuperSlab* ss = tiny_warm_pool_pop(0); assert(ss == NULL); } void test_warm_pool_push_pop() { // Verify push then pop returns same SuperSlab* test_ss = (SuperSlab*)0x123456; tiny_warm_pool_push(0, test_ss); SuperSlab* popped = tiny_warm_pool_pop(0); assert(popped == test_ss); } void test_warm_pool_capacity() { // Verify pool respects capacity for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) { SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab)); int pushed = tiny_warm_pool_push(0, ss); if (i < TINY_WARM_POOL_MAX_PER_CLASS) { assert(pushed == 1); // Should succeed } else { assert(pushed == 0); // Should fail when full } } } void test_warm_pool_per_thread() { // Verify thread isolation pthread_t t1, t2; pthread_create(&t1, NULL, thread_func_1, NULL); pthread_create(&t2, NULL, thread_func_2, NULL); pthread_join(t1, NULL); pthread_join(t2, NULL); // Each thread should have independent warm pools } ``` ### Integration Tests ```bash # Run existing benchmark suite ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 # Compare before/after: Before: 1.06M ops/s After: 1.5M+ ops/s (target +40%) # Run other benchmarks to verify no regression ./bench_allocators_hakmem bench_tiny_hot # Should be ~89M ops/s ./bench_allocators_hakmem bench_tiny_cold # Should be similar ./bench_allocators_hakmem bench_random_mid # Should improve ``` ### Performance Metrics ```bash # With perf profiling HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \ ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 # Expected to see: # - Fewer unified_cache_refill calls # - Reduced registry scan overhead # - Increased warm pool pop hits ``` --- ## 📊 Success Criteria | Metric | Current | Target | Status | |--------|---------|--------|--------| | Random Mixed ops/s | 1.06M | 1.5M+ | ✓ Target | | Warm pool hit rate | N/A | > 90% | ✓ New metric | | Tiny Hot ops/s | 89M | 89M | ✓ No regression | | Memory per thread | ~256KB | < 400KB | ✓ Acceptable | | All tests pass | ✓ | ✓ | ✓ Verify | --- ## 🚀 Quick Build & Test ```bash # After code changes, compile and test: cd /mnt/workdisk/public_share/hakmem # Build make clean && make # Test warm pool directly make test_warm_pool ./test_warm_pool # Benchmark ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 # Profile perf record -F 5000 -e cycles \ ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 perf report ``` --- ## 🔧 Debugging Tips ### Verify Warm Pool is Active Add debug output to warm pool operations: ```c #if !HAKMEM_BUILD_RELEASE static int warm_pool_pop_debug(int class_idx) { SuperSlab* ss = tiny_warm_pool_pop(class_idx); if (ss) { fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n", class_idx, g_tiny_warm_pool[class_idx].count); } return ss ? 1 : 0; } #endif ``` ### Check Warm Pool Hit Rate ```c // Global counters (atomic) __thread uint64_t g_warm_pool_hits = 0; __thread uint64_t g_warm_pool_misses = 0; // Add to refill if (tiny_warm_pool_pop(...)) { g_warm_pool_hits++; // Hit } else { g_warm_pool_misses++; // Miss } // Print at end of benchmark fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n", g_warm_pool_hits, g_warm_pool_misses, 100.0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses)); ``` ### Measure Registry Scan Reduction Profile before/after to verify: - Fewer calls to registry scan loop - Reduced cycles in `unified_cache_refill()` - Increased warm pool pop calls --- ## 📝 Commit Message Template ``` Add warm pool optimization for 40% performance improvement - New: tiny_warm_pool.h with per-thread SuperSlab pools - Modify: unified_cache_refill() to use warm pool (O(1) pop) - Modify: SuperSlab cleanup to add to warm pool - Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4) Benefits: - Eliminates registry O(N) scan on cache miss - 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s) - No regression in other workloads - Minimal per-thread memory overhead (<200KB) Testing: - Unit tests for warm pool operations - Benchmark validation: Random Mixed +40% - No regression in Tiny Hot, Tiny Cold - Thread safety verified 🤖 Generated with Claude Code Co-Authored-By: Claude ``` --- ## 🎓 Key Design Decisions ### Why 4 SuperSlabs per Class? ``` Trade-off: Working set size vs warm pool effectiveness Too small (1-2): - Less memory: ✓ - High miss rate: ✗ (frequently falls back to registry) Right size (4): - Memory: ~8-32 KB per class × 32 classes = 256-512 KB - Hit rate: ~90% (captures typical working set) - Sweet spot: ✓ Too large (8+): - More memory: ✗ (unnecessary TLS bloat) - Marginal benefit: ✗ (diminishing returns) ``` ### Why Thread-Local Storage? ``` Options: 1. Global pool (lock-protected) → Contention 2. Per-thread pool (TLS) → No locks, thread-safe ✓ 3. Hybrid (mostly TLS) → Complexity Chosen: Per-thread TLS - Fast path: No locks - Correctness: Thread-safe by design - Simplicity: No synchronization needed ``` ### Why Batched Tier Check? ``` Current: Check tier on every refill (expensive) Proposed: Check tier periodically (every 64 pops) Cost: - Rare case: SuperSlab changes tier while in warm pool - Detection: Caught on next batch check (~50 operations later) - Fallback: Registry scan still validates Benefit: - Reduces unnecessary tier checks - Improves cache refill performance ``` --- ## 📚 Related Files **Core Implementation:** - `core/front/tiny_warm_pool.h` (NEW - this guide) - `core/front/tiny_unified_cache.h` (MODIFY - call warm pool) - `core/front/malloc_tiny_fast.h` (MODIFY - init warm pool) **Supporting:** - `core/hakmem_super_registry.h` (UNDERSTAND - how registry works) - `core/box/ss_tier_box.h` (UNDERSTAND - tier management) - `core/superslab/superslab_types.h` (REFERENCE - SuperSlab struct) **Testing:** - `bench_allocators_hakmem` (BENCHMARK) - `test/test_*.c` (ADD warm pool tests) --- ## ✅ Implementation Checklist - [ ] Create `core/front/tiny_warm_pool.h` - [ ] Declare `__thread g_tiny_warm_pool[]` - [ ] Modify `unified_cache_refill()` in `tiny_unified_cache.h` - [ ] Add `tiny_warm_pool_init_once()` call in malloc hot path - [ ] Add warm pool push on SuperSlab cleanup - [ ] Add optional environment variable tuning - [ ] Write unit tests for warm pool operations - [ ] Compile and verify no errors - [ ] Run benchmark: Random Mixed ops/s improvement - [ ] Verify no regression in other workloads - [ ] Measure warm pool hit rate (target > 90%) - [ ] Profile CPU cycles (target ~40-50% reduction) - [ ] Create commit with summary above - [ ] Update documentation if needed --- ## 📞 Questions or Issues? If you encounter: 1. **Compilation errors:** Check includes, particularly `superslab_types.h` 2. **Low hit rate (<80%):** Increase pool size via `HAKMEM_WARM_POOL_SIZE` 3. **Memory bloat:** Verify pool size is <= 4 slots per class 4. **No performance gain:** Check warm pool is actually being used (add debug output) 5. **Regression in other tests:** Verify registry fallback path still works --- **Status:** Ready to implement **Expected Timeline:** 2-3 development days **Estimated Performance Gain:** +40-50% (1.06M → 1.5M+ ops/s)