Files
hakmem/WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

524 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Warm Pool Implementation - Quick-Start Guide
## 2025-12-04
---
## 🎯 TL;DR
**Objective:** Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss.
**Expected Result:** +40-50% performance (1.06M → 1.5M+ ops/s)
**Code Changes:** ~300 lines total
- 1 new header file (80 lines)
- 3 files modified (unified_cache, malloc_tiny_fast, superslab_registry)
**Time Estimate:** 2-3 days
---
## 📋 Implementation Roadmap
### Step 1: Create Warm Pool Header (30 mins)
**File:** `core/front/tiny_warm_pool.h` (NEW)
```c
#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H
#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"
// Maximum warm SuperSlabs per thread per class
#define TINY_WARM_POOL_MAX_PER_CLASS 4
typedef struct {
SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
int32_t count;
} TinyWarmPool;
// Per-thread warm pool (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
// Initialize once per thread (lazy)
static inline void tiny_warm_pool_init_once(void) {
static __thread int initialized = 0;
if (!initialized) {
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
g_tiny_warm_pool[i].count = 0;
}
initialized = 1;
}
}
// O(1) pop from warm pool
// Returns: SuperSlab* (not NULL if pool has items)
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
if (g_tiny_warm_pool[class_idx].count > 0) {
return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count];
}
return NULL;
}
// O(1) push to warm pool
// Returns: 1 if pushed, 0 if pool full (caller should free to LRU)
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) {
g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
return 1;
}
return 0;
}
// Get current count (for metrics)
static inline int tiny_warm_pool_count(int class_idx) {
return g_tiny_warm_pool[class_idx].count;
}
#endif // HAK_TINY_WARM_POOL_H
```
### Step 2: Declare Thread-Local Variable (5 mins)
**File:** `core/front/malloc_tiny_fast.h` (or `tiny_warm_pool.h`)
Add to appropriate source file (e.g., `core/hakmem_tiny.c` or new `core/front/tiny_warm_pool.c`):
```c
#include "tiny_warm_pool.h"
// Per-thread warm pools (one array per class)
__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};
```
### Step 3: Modify unified_cache_refill() (60 mins)
**File:** `core/front/tiny_unified_cache.h`
**Current Implementation:**
```c
static inline void unified_cache_refill(int class_idx) {
// Find first HOT SuperSlab in per-class registry
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) {
// Carve and refill cache
carve_blocks_from_superslab(ss, class_idx,
&g_unified_cache[class_idx]);
return;
}
}
// Not found → cold path (allocate new SuperSlab)
allocate_new_superslab_and_carve(class_idx);
}
```
**New Implementation (with Warm Pool):**
```c
#include "tiny_warm_pool.h"
static inline void unified_cache_refill(int class_idx) {
// 1. Initialize warm pool on first use (per-thread)
tiny_warm_pool_init_once();
// 2. Try warm pool first (no locks, O(1))
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
// SuperSlab already HOT (pre-qualified)
// No tier check needed, just carve
carve_blocks_from_superslab(ss, class_idx,
&g_unified_cache[class_idx]);
return;
}
// 3. Fall back to registry scan (only if warm pool empty)
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* candidate = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(candidate)) {
// Carve blocks
carve_blocks_from_superslab(candidate, class_idx,
&g_unified_cache[class_idx]);
// Refill warm pool for next miss
// (Look ahead 2-3 more HOT SuperSlabs)
for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) {
SuperSlab* extra = g_super_reg_by_class[class_idx][j];
if (ss_tier_is_hot(extra)) {
tiny_warm_pool_push(class_idx, extra);
}
}
return;
}
}
// 4. Registry exhausted → cold path (allocate new SuperSlab)
allocate_new_superslab_and_carve(class_idx);
}
```
### Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins)
**File:** `core/front/malloc_tiny_fast.h`
Ensure warm pool is initialized on first malloc call:
```c
// In malloc_tiny_fast() or tiny_hot_alloc_fast():
if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) {
tiny_warm_pool_init_once();
}
```
Or simpler: Let `unified_cache_refill()` call `tiny_warm_pool_init_once()` (as shown in Step 3).
### Step 5: Add to SuperSlab Cleanup (30 mins)
**File:** `core/hakmem_super_registry.h` or `core/hakmem_tiny.h`
When a SuperSlab becomes empty (no active objects), add it to warm pool if room:
```c
// In ss_slab_meta free path (when last object freed):
if (ss_slab_meta_active_count(slab_meta) == 0) {
// SuperSlab is now empty
SuperSlab* ss = ss_from_slab_meta(slab_meta);
int class_idx = ss_slab_meta_class_get(slab_meta);
// Try to add to warm pool for next allocation
if (!tiny_warm_pool_push(class_idx, ss)) {
// Warm pool full, return to LRU cache
ss_cache_put(ss);
}
}
```
### Step 6: Add Optional Environment Variables (15 mins)
**File:** `core/hakmem_tiny.h` or `core/front/tiny_warm_pool.h`
```c
// Check warm pool size via environment (for tuning)
static inline int warm_pool_max_per_class(void) {
static int max = -1;
if (max == -1) {
const char* env = getenv("HAKMEM_WARM_POOL_SIZE");
if (env) {
max = atoi(env);
if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS;
} else {
max = TINY_WARM_POOL_MAX_PER_CLASS;
}
}
return max;
}
// Use in tiny_warm_pool_push():
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
int capacity = warm_pool_max_per_class();
if (g_tiny_warm_pool[class_idx].count < capacity) {
g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
return 1;
}
return 0;
}
```
---
## 🔍 Testing Checklist
### Unit Tests
```c
// In test/test_warm_pool.c (NEW)
void test_warm_pool_pop_empty() {
// Verify pop on empty returns NULL
SuperSlab* ss = tiny_warm_pool_pop(0);
assert(ss == NULL);
}
void test_warm_pool_push_pop() {
// Verify push then pop returns same
SuperSlab* test_ss = (SuperSlab*)0x123456;
tiny_warm_pool_push(0, test_ss);
SuperSlab* popped = tiny_warm_pool_pop(0);
assert(popped == test_ss);
}
void test_warm_pool_capacity() {
// Verify pool respects capacity
for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) {
SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab));
int pushed = tiny_warm_pool_push(0, ss);
if (i < TINY_WARM_POOL_MAX_PER_CLASS) {
assert(pushed == 1); // Should succeed
} else {
assert(pushed == 0); // Should fail when full
}
}
}
void test_warm_pool_per_thread() {
// Verify thread isolation
pthread_t t1, t2;
pthread_create(&t1, NULL, thread_func_1, NULL);
pthread_create(&t2, NULL, thread_func_2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
// Each thread should have independent warm pools
}
```
### Integration Tests
```bash
# Run existing benchmark suite
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Compare before/after:
Before: 1.06M ops/s
After: 1.5M+ ops/s (target +40%)
# Run other benchmarks to verify no regression
./bench_allocators_hakmem bench_tiny_hot # Should be ~89M ops/s
./bench_allocators_hakmem bench_tiny_cold # Should be similar
./bench_allocators_hakmem bench_random_mid # Should improve
```
### Performance Metrics
```bash
# With perf profiling
HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Expected to see:
# - Fewer unified_cache_refill calls
# - Reduced registry scan overhead
# - Increased warm pool pop hits
```
---
## 📊 Success Criteria
| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Random Mixed ops/s | 1.06M | 1.5M+ | ✓ Target |
| Warm pool hit rate | N/A | > 90% | ✓ New metric |
| Tiny Hot ops/s | 89M | 89M | ✓ No regression |
| Memory per thread | ~256KB | < 400KB | Acceptable |
| All tests pass | | | Verify |
---
## 🚀 Quick Build & Test
```bash
# After code changes, compile and test:
cd /mnt/workdisk/public_share/hakmem
# Build
make clean && make
# Test warm pool directly
make test_warm_pool
./test_warm_pool
# Benchmark
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
# Profile
perf record -F 5000 -e cycles \
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
perf report
```
---
## 🔧 Debugging Tips
### Verify Warm Pool is Active
Add debug output to warm pool operations:
```c
#if !HAKMEM_BUILD_RELEASE
static int warm_pool_pop_debug(int class_idx) {
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n",
class_idx, g_tiny_warm_pool[class_idx].count);
}
return ss ? 1 : 0;
}
#endif
```
### Check Warm Pool Hit Rate
```c
// Global counters (atomic)
__thread uint64_t g_warm_pool_hits = 0;
__thread uint64_t g_warm_pool_misses = 0;
// Add to refill
if (tiny_warm_pool_pop(...)) {
g_warm_pool_hits++; // Hit
} else {
g_warm_pool_misses++; // Miss
}
// Print at end of benchmark
fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n",
g_warm_pool_hits, g_warm_pool_misses,
100.0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses));
```
### Measure Registry Scan Reduction
Profile before/after to verify:
- Fewer calls to registry scan loop
- Reduced cycles in `unified_cache_refill()`
- Increased warm pool pop calls
---
## 📝 Commit Message Template
```
Add warm pool optimization for 40% performance improvement
- New: tiny_warm_pool.h with per-thread SuperSlab pools
- Modify: unified_cache_refill() to use warm pool (O(1) pop)
- Modify: SuperSlab cleanup to add to warm pool
- Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4)
Benefits:
- Eliminates registry O(N) scan on cache miss
- 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s)
- No regression in other workloads
- Minimal per-thread memory overhead (<200KB)
Testing:
- Unit tests for warm pool operations
- Benchmark validation: Random Mixed +40%
- No regression in Tiny Hot, Tiny Cold
- Thread safety verified
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
```
---
## 🎓 Key Design Decisions
### Why 4 SuperSlabs per Class?
```
Trade-off: Working set size vs warm pool effectiveness
Too small (1-2):
- Less memory: ✓
- High miss rate: ✗ (frequently falls back to registry)
Right size (4):
- Memory: ~8-32 KB per class × 32 classes = 256-512 KB
- Hit rate: ~90% (captures typical working set)
- Sweet spot: ✓
Too large (8+):
- More memory: ✗ (unnecessary TLS bloat)
- Marginal benefit: ✗ (diminishing returns)
```
### Why Thread-Local Storage?
```
Options:
1. Global pool (lock-protected) → Contention
2. Per-thread pool (TLS) → No locks, thread-safe ✓
3. Hybrid (mostly TLS) → Complexity
Chosen: Per-thread TLS
- Fast path: No locks
- Correctness: Thread-safe by design
- Simplicity: No synchronization needed
```
### Why Batched Tier Check?
```
Current: Check tier on every refill (expensive)
Proposed: Check tier periodically (every 64 pops)
Cost:
- Rare case: SuperSlab changes tier while in warm pool
- Detection: Caught on next batch check (~50 operations later)
- Fallback: Registry scan still validates
Benefit:
- Reduces unnecessary tier checks
- Improves cache refill performance
```
---
## 📚 Related Files
**Core Implementation:**
- `core/front/tiny_warm_pool.h` (NEW - this guide)
- `core/front/tiny_unified_cache.h` (MODIFY - call warm pool)
- `core/front/malloc_tiny_fast.h` (MODIFY - init warm pool)
**Supporting:**
- `core/hakmem_super_registry.h` (UNDERSTAND - how registry works)
- `core/box/ss_tier_box.h` (UNDERSTAND - tier management)
- `core/superslab/superslab_types.h` (REFERENCE - SuperSlab struct)
**Testing:**
- `bench_allocators_hakmem` (BENCHMARK)
- `test/test_*.c` (ADD warm pool tests)
---
## ✅ Implementation Checklist
- [ ] Create `core/front/tiny_warm_pool.h`
- [ ] Declare `__thread g_tiny_warm_pool[]`
- [ ] Modify `unified_cache_refill()` in `tiny_unified_cache.h`
- [ ] Add `tiny_warm_pool_init_once()` call in malloc hot path
- [ ] Add warm pool push on SuperSlab cleanup
- [ ] Add optional environment variable tuning
- [ ] Write unit tests for warm pool operations
- [ ] Compile and verify no errors
- [ ] Run benchmark: Random Mixed ops/s improvement
- [ ] Verify no regression in other workloads
- [ ] Measure warm pool hit rate (target > 90%)
- [ ] Profile CPU cycles (target ~40-50% reduction)
- [ ] Create commit with summary above
- [ ] Update documentation if needed
---
## 📞 Questions or Issues?
If you encounter:
1. **Compilation errors:** Check includes, particularly `superslab_types.h`
2. **Low hit rate (<80%):** Increase pool size via `HAKMEM_WARM_POOL_SIZE`
3. **Memory bloat:** Verify pool size is <= 4 slots per class
4. **No performance gain:** Check warm pool is actually being used (add debug output)
5. **Regression in other tests:** Verify registry fallback path still works
---
**Status:** Ready to implement
**Expected Timeline:** 2-3 development days
**Estimated Performance Gain:** +40-50% (1.06M → 1.5M+ ops/s)