Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
Warm Pool Implementation - Quick-Start Guide
2025-12-04
🎯 TL;DR
Objective: Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss.
Expected Result: +40-50% performance (1.06M → 1.5M+ ops/s)
Code Changes: ~300 lines total
- 1 new header file (80 lines)
- 3 files modified (unified_cache, malloc_tiny_fast, superslab_registry)
Time Estimate: 2-3 days
📋 Implementation Roadmap
Step 1: Create Warm Pool Header (30 mins)
File: core/front/tiny_warm_pool.h (NEW)
#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H
#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"
// Maximum warm SuperSlabs per thread per class
#define TINY_WARM_POOL_MAX_PER_CLASS 4
typedef struct {
SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
int32_t count;
} TinyWarmPool;
// Per-thread warm pool (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
// Initialize once per thread (lazy)
static inline void tiny_warm_pool_init_once(void) {
static __thread int initialized = 0;
if (!initialized) {
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
g_tiny_warm_pool[i].count = 0;
}
initialized = 1;
}
}
// O(1) pop from warm pool
// Returns: SuperSlab* (not NULL if pool has items)
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
if (g_tiny_warm_pool[class_idx].count > 0) {
return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count];
}
return NULL;
}
// O(1) push to warm pool
// Returns: 1 if pushed, 0 if pool full (caller should free to LRU)
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) {
g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
return 1;
}
return 0;
}
// Get current count (for metrics)
static inline int tiny_warm_pool_count(int class_idx) {
return g_tiny_warm_pool[class_idx].count;
}
#endif // HAK_TINY_WARM_POOL_H
Step 2: Declare Thread-Local Variable (5 mins)
File: core/front/malloc_tiny_fast.h (or tiny_warm_pool.h)
Add to appropriate source file (e.g., core/hakmem_tiny.c or new core/front/tiny_warm_pool.c):
#include "tiny_warm_pool.h"
// Per-thread warm pools (one array per class)
__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};
Step 3: Modify unified_cache_refill() (60 mins)
File: core/front/tiny_unified_cache.h
Current Implementation:
static inline void unified_cache_refill(int class_idx) {
// Find first HOT SuperSlab in per-class registry
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) {
// Carve and refill cache
carve_blocks_from_superslab(ss, class_idx,
&g_unified_cache[class_idx]);
return;
}
}
// Not found → cold path (allocate new SuperSlab)
allocate_new_superslab_and_carve(class_idx);
}
New Implementation (with Warm Pool):
#include "tiny_warm_pool.h"
static inline void unified_cache_refill(int class_idx) {
// 1. Initialize warm pool on first use (per-thread)
tiny_warm_pool_init_once();
// 2. Try warm pool first (no locks, O(1))
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
// SuperSlab already HOT (pre-qualified)
// No tier check needed, just carve
carve_blocks_from_superslab(ss, class_idx,
&g_unified_cache[class_idx]);
return;
}
// 3. Fall back to registry scan (only if warm pool empty)
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* candidate = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(candidate)) {
// Carve blocks
carve_blocks_from_superslab(candidate, class_idx,
&g_unified_cache[class_idx]);
// Refill warm pool for next miss
// (Look ahead 2-3 more HOT SuperSlabs)
for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) {
SuperSlab* extra = g_super_reg_by_class[class_idx][j];
if (ss_tier_is_hot(extra)) {
tiny_warm_pool_push(class_idx, extra);
}
}
return;
}
}
// 4. Registry exhausted → cold path (allocate new SuperSlab)
allocate_new_superslab_and_carve(class_idx);
}
Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins)
File: core/front/malloc_tiny_fast.h
Ensure warm pool is initialized on first malloc call:
// In malloc_tiny_fast() or tiny_hot_alloc_fast():
if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) {
tiny_warm_pool_init_once();
}
Or simpler: Let unified_cache_refill() call tiny_warm_pool_init_once() (as shown in Step 3).
Step 5: Add to SuperSlab Cleanup (30 mins)
File: core/hakmem_super_registry.h or core/hakmem_tiny.h
When a SuperSlab becomes empty (no active objects), add it to warm pool if room:
// In ss_slab_meta free path (when last object freed):
if (ss_slab_meta_active_count(slab_meta) == 0) {
// SuperSlab is now empty
SuperSlab* ss = ss_from_slab_meta(slab_meta);
int class_idx = ss_slab_meta_class_get(slab_meta);
// Try to add to warm pool for next allocation
if (!tiny_warm_pool_push(class_idx, ss)) {
// Warm pool full, return to LRU cache
ss_cache_put(ss);
}
}
Step 6: Add Optional Environment Variables (15 mins)
File: core/hakmem_tiny.h or core/front/tiny_warm_pool.h
// Check warm pool size via environment (for tuning)
static inline int warm_pool_max_per_class(void) {
static int max = -1;
if (max == -1) {
const char* env = getenv("HAKMEM_WARM_POOL_SIZE");
if (env) {
max = atoi(env);
if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS;
} else {
max = TINY_WARM_POOL_MAX_PER_CLASS;
}
}
return max;
}
// Use in tiny_warm_pool_push():
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
int capacity = warm_pool_max_per_class();
if (g_tiny_warm_pool[class_idx].count < capacity) {
g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
return 1;
}
return 0;
}
🔍 Testing Checklist
Unit Tests
// In test/test_warm_pool.c (NEW)
void test_warm_pool_pop_empty() {
// Verify pop on empty returns NULL
SuperSlab* ss = tiny_warm_pool_pop(0);
assert(ss == NULL);
}
void test_warm_pool_push_pop() {
// Verify push then pop returns same
SuperSlab* test_ss = (SuperSlab*)0x123456;
tiny_warm_pool_push(0, test_ss);
SuperSlab* popped = tiny_warm_pool_pop(0);
assert(popped == test_ss);
}
void test_warm_pool_capacity() {
// Verify pool respects capacity
for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) {
SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab));
int pushed = tiny_warm_pool_push(0, ss);
if (i < TINY_WARM_POOL_MAX_PER_CLASS) {
assert(pushed == 1); // Should succeed
} else {
assert(pushed == 0); // Should fail when full
}
}
}
void test_warm_pool_per_thread() {
// Verify thread isolation
pthread_t t1, t2;
pthread_create(&t1, NULL, thread_func_1, NULL);
pthread_create(&t2, NULL, thread_func_2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
// Each thread should have independent warm pools
}
Integration Tests
# Run existing benchmark suite
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Compare before/after:
Before: 1.06M ops/s
After: 1.5M+ ops/s (target +40%)
# Run other benchmarks to verify no regression
./bench_allocators_hakmem bench_tiny_hot # Should be ~89M ops/s
./bench_allocators_hakmem bench_tiny_cold # Should be similar
./bench_allocators_hakmem bench_random_mid # Should improve
Performance Metrics
# With perf profiling
HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Expected to see:
# - Fewer unified_cache_refill calls
# - Reduced registry scan overhead
# - Increased warm pool pop hits
📊 Success Criteria
| Metric | Current | Target | Status |
|---|---|---|---|
| Random Mixed ops/s | 1.06M | 1.5M+ | ✓ Target |
| Warm pool hit rate | N/A | > 90% | ✓ New metric |
| Tiny Hot ops/s | 89M | 89M | ✓ No regression |
| Memory per thread | ~256KB | < 400KB | ✓ Acceptable |
| All tests pass | ✓ | ✓ | ✓ Verify |
🚀 Quick Build & Test
# After code changes, compile and test:
cd /mnt/workdisk/public_share/hakmem
# Build
make clean && make
# Test warm pool directly
make test_warm_pool
./test_warm_pool
# Benchmark
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Profile
perf record -F 5000 -e cycles \
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
perf report
🔧 Debugging Tips
Verify Warm Pool is Active
Add debug output to warm pool operations:
#if !HAKMEM_BUILD_RELEASE
static int warm_pool_pop_debug(int class_idx) {
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n",
class_idx, g_tiny_warm_pool[class_idx].count);
}
return ss ? 1 : 0;
}
#endif
Check Warm Pool Hit Rate
// Global counters (atomic)
__thread uint64_t g_warm_pool_hits = 0;
__thread uint64_t g_warm_pool_misses = 0;
// Add to refill
if (tiny_warm_pool_pop(...)) {
g_warm_pool_hits++; // Hit
} else {
g_warm_pool_misses++; // Miss
}
// Print at end of benchmark
fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n",
g_warm_pool_hits, g_warm_pool_misses,
100.0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses));
Measure Registry Scan Reduction
Profile before/after to verify:
- Fewer calls to registry scan loop
- Reduced cycles in
unified_cache_refill() - Increased warm pool pop calls
📝 Commit Message Template
Add warm pool optimization for 40% performance improvement
- New: tiny_warm_pool.h with per-thread SuperSlab pools
- Modify: unified_cache_refill() to use warm pool (O(1) pop)
- Modify: SuperSlab cleanup to add to warm pool
- Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4)
Benefits:
- Eliminates registry O(N) scan on cache miss
- 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s)
- No regression in other workloads
- Minimal per-thread memory overhead (<200KB)
Testing:
- Unit tests for warm pool operations
- Benchmark validation: Random Mixed +40%
- No regression in Tiny Hot, Tiny Cold
- Thread safety verified
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
🎓 Key Design Decisions
Why 4 SuperSlabs per Class?
Trade-off: Working set size vs warm pool effectiveness
Too small (1-2):
- Less memory: ✓
- High miss rate: ✗ (frequently falls back to registry)
Right size (4):
- Memory: ~8-32 KB per class × 32 classes = 256-512 KB
- Hit rate: ~90% (captures typical working set)
- Sweet spot: ✓
Too large (8+):
- More memory: ✗ (unnecessary TLS bloat)
- Marginal benefit: ✗ (diminishing returns)
Why Thread-Local Storage?
Options:
1. Global pool (lock-protected) → Contention
2. Per-thread pool (TLS) → No locks, thread-safe ✓
3. Hybrid (mostly TLS) → Complexity
Chosen: Per-thread TLS
- Fast path: No locks
- Correctness: Thread-safe by design
- Simplicity: No synchronization needed
Why Batched Tier Check?
Current: Check tier on every refill (expensive)
Proposed: Check tier periodically (every 64 pops)
Cost:
- Rare case: SuperSlab changes tier while in warm pool
- Detection: Caught on next batch check (~50 operations later)
- Fallback: Registry scan still validates
Benefit:
- Reduces unnecessary tier checks
- Improves cache refill performance
📚 Related Files
Core Implementation:
core/front/tiny_warm_pool.h(NEW - this guide)core/front/tiny_unified_cache.h(MODIFY - call warm pool)core/front/malloc_tiny_fast.h(MODIFY - init warm pool)
Supporting:
core/hakmem_super_registry.h(UNDERSTAND - how registry works)core/box/ss_tier_box.h(UNDERSTAND - tier management)core/superslab/superslab_types.h(REFERENCE - SuperSlab struct)
Testing:
bench_allocators_hakmem(BENCHMARK)test/test_*.c(ADD warm pool tests)
✅ Implementation Checklist
- Create
core/front/tiny_warm_pool.h - Declare
__thread g_tiny_warm_pool[] - Modify
unified_cache_refill()intiny_unified_cache.h - Add
tiny_warm_pool_init_once()call in malloc hot path - Add warm pool push on SuperSlab cleanup
- Add optional environment variable tuning
- Write unit tests for warm pool operations
- Compile and verify no errors
- Run benchmark: Random Mixed ops/s improvement
- Verify no regression in other workloads
- Measure warm pool hit rate (target > 90%)
- Profile CPU cycles (target ~40-50% reduction)
- Create commit with summary above
- Update documentation if needed
📞 Questions or Issues?
If you encounter:
- Compilation errors: Check includes, particularly
superslab_types.h - Low hit rate (<80%): Increase pool size via
HAKMEM_WARM_POOL_SIZE - Memory bloat: Verify pool size is <= 4 slots per class
- No performance gain: Check warm pool is actually being used (add debug output)
- Regression in other tests: Verify registry fallback path still works
Status: Ready to implement Expected Timeline: 2-3 development days Estimated Performance Gain: +40-50% (1.06M → 1.5M+ ops/s)