Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

14 KiB

Raw Permalink Blame History

Warm Pool Implementation - Quick-Start Guide

2025-12-04

🎯 TL;DR

Objective: Add per-thread warm SuperSlab pools to eliminate registry scan on cache miss.

Expected Result: +40-50% performance (1.06M → 1.5M+ ops/s)

Code Changes: ~300 lines total

1 new header file (80 lines)
3 files modified (unified_cache, malloc_tiny_fast, superslab_registry)

Time Estimate: 2-3 days

📋 Implementation Roadmap

Step 1: Create Warm Pool Header (30 mins)

File: core/front/tiny_warm_pool.h (NEW)

#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H

#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"

// Maximum warm SuperSlabs per thread per class
#define TINY_WARM_POOL_MAX_PER_CLASS 4

typedef struct {
    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
    int32_t count;
} TinyWarmPool;

// Per-thread warm pool (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];

// Initialize once per thread (lazy)
static inline void tiny_warm_pool_init_once(void) {
    static __thread int initialized = 0;
    if (!initialized) {
        for (int i = 0; i < TINY_NUM_CLASSES; i++) {
            g_tiny_warm_pool[i].count = 0;
        }
        initialized = 1;
    }
}

// O(1) pop from warm pool
// Returns: SuperSlab* (not NULL if pool has items)
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
    if (g_tiny_warm_pool[class_idx].count > 0) {
        return g_tiny_warm_pool[class_idx].slabs[--g_tiny_warm_pool[class_idx].count];
    }
    return NULL;
}

// O(1) push to warm pool
// Returns: 1 if pushed, 0 if pool full (caller should free to LRU)
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    if (g_tiny_warm_pool[class_idx].count < TINY_WARM_POOL_MAX_PER_CLASS) {
        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
        return 1;
    }
    return 0;
}

// Get current count (for metrics)
static inline int tiny_warm_pool_count(int class_idx) {
    return g_tiny_warm_pool[class_idx].count;
}

#endif // HAK_TINY_WARM_POOL_H

Step 2: Declare Thread-Local Variable (5 mins)

File: core/front/malloc_tiny_fast.h (or tiny_warm_pool.h)

Add to appropriate source file (e.g., core/hakmem_tiny.c or new core/front/tiny_warm_pool.c):

#include "tiny_warm_pool.h"

// Per-thread warm pools (one array per class)
__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};

Step 3: Modify unified_cache_refill() (60 mins)

File: core/front/tiny_unified_cache.h

Current Implementation:

static inline void unified_cache_refill(int class_idx) {
    // Find first HOT SuperSlab in per-class registry
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {
            // Carve and refill cache
            carve_blocks_from_superslab(ss, class_idx,
                &g_unified_cache[class_idx]);
            return;
        }
    }
    // Not found → cold path (allocate new SuperSlab)
    allocate_new_superslab_and_carve(class_idx);
}

New Implementation (with Warm Pool):

#include "tiny_warm_pool.h"

static inline void unified_cache_refill(int class_idx) {
    // 1. Initialize warm pool on first use (per-thread)
    tiny_warm_pool_init_once();

    // 2. Try warm pool first (no locks, O(1))
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        // SuperSlab already HOT (pre-qualified)
        // No tier check needed, just carve
        carve_blocks_from_superslab(ss, class_idx,
            &g_unified_cache[class_idx]);
        return;
    }

    // 3. Fall back to registry scan (only if warm pool empty)
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* candidate = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(candidate)) {
            // Carve blocks
            carve_blocks_from_superslab(candidate, class_idx,
                &g_unified_cache[class_idx]);

            // Refill warm pool for next miss
            // (Look ahead 2-3 more HOT SuperSlabs)
            for (int j = i + 1; j < g_super_reg_by_class_count[class_idx] && j < i + 3; j++) {
                SuperSlab* extra = g_super_reg_by_class[class_idx][j];
                if (ss_tier_is_hot(extra)) {
                    tiny_warm_pool_push(class_idx, extra);
                }
            }
            return;
        }
    }

    // 4. Registry exhausted → cold path (allocate new SuperSlab)
    allocate_new_superslab_and_carve(class_idx);
}

Step 4: Initialize Warm Pool in malloc_tiny_fast() (20 mins)

File: core/front/malloc_tiny_fast.h

Ensure warm pool is initialized on first malloc call:

// In malloc_tiny_fast() or tiny_hot_alloc_fast():
if (__builtin_expect(g_tiny_warm_pool[0].count == 0 && need_init, 0)) {
    tiny_warm_pool_init_once();
}

Or simpler: Let unified_cache_refill() call tiny_warm_pool_init_once() (as shown in Step 3).

Step 5: Add to SuperSlab Cleanup (30 mins)

File: core/hakmem_super_registry.h or core/hakmem_tiny.h

When a SuperSlab becomes empty (no active objects), add it to warm pool if room:

// In ss_slab_meta free path (when last object freed):
if (ss_slab_meta_active_count(slab_meta) == 0) {
    // SuperSlab is now empty
    SuperSlab* ss = ss_from_slab_meta(slab_meta);
    int class_idx = ss_slab_meta_class_get(slab_meta);

    // Try to add to warm pool for next allocation
    if (!tiny_warm_pool_push(class_idx, ss)) {
        // Warm pool full, return to LRU cache
        ss_cache_put(ss);
    }
}

Step 6: Add Optional Environment Variables (15 mins)

File: core/hakmem_tiny.h or core/front/tiny_warm_pool.h

// Check warm pool size via environment (for tuning)
static inline int warm_pool_max_per_class(void) {
    static int max = -1;
    if (max == -1) {
        const char* env = getenv("HAKMEM_WARM_POOL_SIZE");
        if (env) {
            max = atoi(env);
            if (max < 1 || max > 16) max = TINY_WARM_POOL_MAX_PER_CLASS;
        } else {
            max = TINY_WARM_POOL_MAX_PER_CLASS;
        }
    }
    return max;
}

// Use in tiny_warm_pool_push():
static inline int tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    int capacity = warm_pool_max_per_class();
    if (g_tiny_warm_pool[class_idx].count < capacity) {
        g_tiny_warm_pool[class_idx].slabs[g_tiny_warm_pool[class_idx].count++] = ss;
        return 1;
    }
    return 0;
}

🔍 Testing Checklist

Unit Tests

// In test/test_warm_pool.c (NEW)

void test_warm_pool_pop_empty() {
    // Verify pop on empty returns NULL
    SuperSlab* ss = tiny_warm_pool_pop(0);
    assert(ss == NULL);
}

void test_warm_pool_push_pop() {
    // Verify push then pop returns same
    SuperSlab* test_ss = (SuperSlab*)0x123456;
    tiny_warm_pool_push(0, test_ss);
    SuperSlab* popped = tiny_warm_pool_pop(0);
    assert(popped == test_ss);
}

void test_warm_pool_capacity() {
    // Verify pool respects capacity
    for (int i = 0; i < TINY_WARM_POOL_MAX_PER_CLASS + 1; i++) {
        SuperSlab* ss = (SuperSlab*)malloc(sizeof(SuperSlab));
        int pushed = tiny_warm_pool_push(0, ss);
        if (i < TINY_WARM_POOL_MAX_PER_CLASS) {
            assert(pushed == 1);  // Should succeed
        } else {
            assert(pushed == 0);  // Should fail when full
        }
    }
}

void test_warm_pool_per_thread() {
    // Verify thread isolation
    pthread_t t1, t2;
    pthread_create(&t1, NULL, thread_func_1, NULL);
    pthread_create(&t2, NULL, thread_func_2, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    // Each thread should have independent warm pools
}

Integration Tests

# Run existing benchmark suite
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Compare before/after:
Before:  1.06M ops/s
After:   1.5M+ ops/s (target +40%)

# Run other benchmarks to verify no regression
./bench_allocators_hakmem bench_tiny_hot      # Should be ~89M ops/s
./bench_allocators_hakmem bench_tiny_cold     # Should be similar
./bench_allocators_hakmem bench_random_mid    # Should improve

Performance Metrics

# With perf profiling
HAKMEM_WARM_POOL_SIZE=4 perf record -F 5000 -e cycles \
  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Expected to see:
# - Fewer unified_cache_refill calls
# - Reduced registry scan overhead
# - Increased warm pool pop hits

📊 Success Criteria

Metric	Current	Target	Status
Random Mixed ops/s	1.06M	1.5M+	✓ Target
Warm pool hit rate	N/A	> 90%	✓ New metric
Tiny Hot ops/s	89M	89M	✓ No regression
Memory per thread	~256KB	< 400KB	✓ Acceptable
All tests pass	✓	✓	✓ Verify

🚀 Quick Build & Test

# After code changes, compile and test:

cd /mnt/workdisk/public_share/hakmem

# Build
make clean && make

# Test warm pool directly
make test_warm_pool
./test_warm_pool

# Benchmark
./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42

# Profile
perf record -F 5000 -e cycles \
  ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
perf report

🔧 Debugging Tips

Verify Warm Pool is Active

Add debug output to warm pool operations:

#if !HAKMEM_BUILD_RELEASE
static int warm_pool_pop_debug(int class_idx) {
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        fprintf(stderr, "[WarmPool] Pop class=%d, count=%d\n",
            class_idx, g_tiny_warm_pool[class_idx].count);
    }
    return ss ? 1 : 0;
}
#endif

Check Warm Pool Hit Rate

// Global counters (atomic)
__thread uint64_t g_warm_pool_hits = 0;
__thread uint64_t g_warm_pool_misses = 0;

// Add to refill
if (tiny_warm_pool_pop(...)) {
    g_warm_pool_hits++;  // Hit
} else {
    g_warm_pool_misses++;  // Miss
}

// Print at end of benchmark
fprintf(stderr, "Warm pool: %lu hits, %lu misses (%.1f%% hit rate)\n",
    g_warm_pool_hits, g_warm_pool_misses,
    100.0 * g_warm_pool_hits / (g_warm_pool_hits + g_warm_pool_misses));

Measure Registry Scan Reduction

Profile before/after to verify:

Fewer calls to registry scan loop
Reduced cycles in unified_cache_refill()
Increased warm pool pop calls

📝 Commit Message Template

Add warm pool optimization for 40% performance improvement

- New: tiny_warm_pool.h with per-thread SuperSlab pools
- Modify: unified_cache_refill() to use warm pool (O(1) pop)
- Modify: SuperSlab cleanup to add to warm pool
- Env: HAKMEM_WARM_POOL_SIZE for tuning (default: 4)

Benefits:
  - Eliminates registry O(N) scan on cache miss
  - 40-50% improvement on Random Mixed (1.06M → 1.5M+ ops/s)
  - No regression in other workloads
  - Minimal per-thread memory overhead (<200KB)

Testing:
  - Unit tests for warm pool operations
  - Benchmark validation: Random Mixed +40%
  - No regression in Tiny Hot, Tiny Cold
  - Thread safety verified

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>

🎓 Key Design Decisions

Why 4 SuperSlabs per Class?

Trade-off: Working set size vs warm pool effectiveness

Too small (1-2):
  - Less memory: ✓
  - High miss rate: ✗ (frequently falls back to registry)

Right size (4):
  - Memory: ~8-32 KB per class × 32 classes = 256-512 KB
  - Hit rate: ~90% (captures typical working set)
  - Sweet spot: ✓

Too large (8+):
  - More memory: ✗ (unnecessary TLS bloat)
  - Marginal benefit: ✗ (diminishing returns)

Why Thread-Local Storage?

Options:
1. Global pool (lock-protected) → Contention
2. Per-thread pool (TLS) → No locks, thread-safe ✓
3. Hybrid (mostly TLS) → Complexity

Chosen: Per-thread TLS
  - Fast path: No locks
  - Correctness: Thread-safe by design
  - Simplicity: No synchronization needed

Why Batched Tier Check?

Current: Check tier on every refill (expensive)
Proposed: Check tier periodically (every 64 pops)

Cost:
  - Rare case: SuperSlab changes tier while in warm pool
  - Detection: Caught on next batch check (~50 operations later)
  - Fallback: Registry scan still validates

Benefit:
  - Reduces unnecessary tier checks
  - Improves cache refill performance

Core Implementation:

core/front/tiny_warm_pool.h (NEW - this guide)
core/front/tiny_unified_cache.h (MODIFY - call warm pool)
core/front/malloc_tiny_fast.h (MODIFY - init warm pool)

Supporting:

core/hakmem_super_registry.h (UNDERSTAND - how registry works)
core/box/ss_tier_box.h (UNDERSTAND - tier management)
core/superslab/superslab_types.h (REFERENCE - SuperSlab struct)

Testing:

bench_allocators_hakmem (BENCHMARK)
test/test_*.c (ADD warm pool tests)

✅ Implementation Checklist

Create core/front/tiny_warm_pool.h
Declare __thread g_tiny_warm_pool[]
Modify unified_cache_refill() in tiny_unified_cache.h
Add tiny_warm_pool_init_once() call in malloc hot path
Add warm pool push on SuperSlab cleanup
Add optional environment variable tuning
Write unit tests for warm pool operations
Compile and verify no errors
Run benchmark: Random Mixed ops/s improvement
Verify no regression in other workloads
Measure warm pool hit rate (target > 90%)
Profile CPU cycles (target ~40-50% reduction)
Create commit with summary above
Update documentation if needed

📞 Questions or Issues?

If you encounter:

Compilation errors: Check includes, particularly superslab_types.h
Low hit rate (<80%): Increase pool size via HAKMEM_WARM_POOL_SIZE
Memory bloat: Verify pool size is <= 4 slots per class
No performance gain: Check warm pool is actually being used (add debug output)
Regression in other tests: Verify registry fallback path still works

Status: Ready to implement Expected Timeline: 2-3 development days Estimated Performance Gain: +40-50% (1.06M → 1.5M+ ops/s)

14 KiB Raw Permalink Blame History Unescape Escape