Files

Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)

Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 23:31:54 +09:00

17 KiB

Raw Blame History

HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal

2025-12-04

📊 Executive Summary

Goal: Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.

Current Performance Gap:

Random Mixed:  1.06M ops/s  (current baseline)
Tiny Hot:      89M ops/s    (reference - different workload)
Goal:          10.6M ops/s  (10x from baseline)

Key Discovery: Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:

Registry scan on cache miss (O(N) search through per-class registry)
Per-allocation tier checks (atomic operations, not batched)
Lack of pre-warmed SuperSlab pools (must allocate/initialize on miss)
Global registry contention (mutex-protected writes)

🔍 Current Architecture Analysis

Existing Two-Speed Foundation

HAKMEM already implements a two-tier design:

HOT PATH (95%+ allocations):
  malloc_tiny_fast()
    → tiny_hot_alloc_fast()
       → Unified Cache pop (TLS, 2-3 cache misses)
       → Return USER pointer
  Cost: ~20-30 CPU cycles

WARM PATH (1-5% cache misses):
  malloc_tiny_fast()
    → tiny_cold_refill_and_alloc()
       → unified_cache_refill()
          → Per-class registry scan (find HOT SuperSlab)
          → Tier check (is HOT)
          → Carve ~64 blocks
          → Refill Unified Cache
       → Return USER pointer
  Cost: ~500-1000 cycles per batch (~5-10 per object amortized)

Performance Bottlenecks in WARM Path

Bottleneck 1: Registry Scan (O(N))

Current: Linear search through per-class registry to find HOT SuperSlab
Cost: 50-100 cycles per refill
Happens on EVERY cache miss (~1-5% of allocations)
Files: core/hakmem_super_registry.h, core/front/tiny_unified_cache.h (unified_cache_refill function)

Bottleneck 2: Per-Allocation Tier Checks

Current: Call ss_tier_is_hot(ss) once per batch (during refill)
Should be: Batch multiple tier checks together
Cost: Atomic operations, not amortized
File: core/box/ss_tier_box.h

Bottleneck 3: Global Registry Contention

Current: Mutex-protected registry insert on SuperSlab alloc
File: core/hakmem_super_registry.h (hak_super_registry_insert)
Lock: g_super_reg_lock

Bottleneck 4: SuperSlab Initialization Overhead

Current: Full allocation + initialization on cache miss → cold path
Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
Should be: Pre-allocated from LRU cache or warm pool

💡 Proposed Three-Tier Architecture

Tier 1: HOT (95%+ allocations)

// Path: TLS Unified Cache hit
// Cost: ~20-30 cycles (unchanged)
// Characteristics:
//   - No registry access
//   - No Tier/Guard calls
//   - No locks
//   - Branch-free (or 1-branch pipeline hits)

Path:
  1. Read TLS Unified Cache (TLS access, 1 cache miss)
  2. Pop from array (array access, 1 cache miss)
  3. Update head pointer (1 store)
  4. Return USER pointer (0 additional branches for hit)

Total: 2-3 cache misses, ~20-30 cycles

Tier 2: WARM (1-5% cache misses)

NEW: Per-Thread Warm Pool

// Path: Unified Cache miss → Pop from per-thread warm pool
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
// Characteristics:
//   - No global registry scan
//   - Pre-qualified SuperSlabs (already HOT)
//   - Batched tier transitions (not per-object)
//   - Minimal lock contention

Data Structure:
  __thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_count[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_capacity[TINY_NUM_CLASSES];

Path:
  1. Detect Unified Cache miss (head == tail)
  2. Check warm pool (TLS access, no lock)
     a. If warm_pool_count > 0:
        ├─ Pop SuperSlab from warm_pool_head (O(1))
        ├─ Use existing SuperSlab (no mmap)
        ├─ Carve ~64 blocks (amortized cost)
        ├─ Refill Unified Cache
        ├─ (Optional) Batch tier check after ~64 pops
        └─ Return first block

     b. If warm_pool_count == 0:
        └─ Fall through to COLD (rare)

Total: ~50-100 cycles per batch

Tier 3: COLD (<0.1% special cases)

// Path: Warm pool exhausted, error, or special handling
// Cost: ~1000-10000 cycles per SuperSlab (rare)
// Characteristics:
//   - Full SuperSlab allocation (mmap)
//   - Registry insert (mutex-protected write)
//   - Tier initialization
//   - Guard validation

Path:
  1. Warm pool exhausted
  2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
  3. Insert into global registry (mutex-protected)
  4. Initialize TinySlabMeta + metadata
  5. Add to per-class registry
  6. Carve blocks + refill both Unified Cache and warm pool
  7. Return first block

🔧 Implementation Plan

Phase 1: Design & Data Structures (THIS DOCUMENT)

Task 1.1: Define Warm Pool Data Structure

// File: core/front/tiny_warm_pool.h (NEW)
//
// Per-thread warm pool for pre-allocated SuperSlabs
// Reduces registry scan cost on cache miss

#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H

#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"

// Maximum warm SuperSlabs per thread (tunable)
#define TINY_WARM_POOL_MAX_PER_CLASS 4

typedef struct {
    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
    int count;
    int capacity;
} TinyWarmPool;

// Per-thread warm pools (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];

// Operations:
// - tiny_warm_pool_init() → Initialize at thread startup
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
// - tiny_warm_pool_refill() → Batch refill from LRU cache

#endif

Task 1.2: Define Warm Pool Operations

// Lazy initialization (once per thread)
static inline void tiny_warm_pool_init_once(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->capacity == 0) {
        pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
        pool->count = 0;
        // Allocate initial SuperSlabs on demand (COLD path)
    }
}

// O(1) pop from warm pool
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count > 0) {
        return pool->slabs[--pool->count];  // Pop from end
    }
    return NULL;  // Pool empty → fall through to COLD
}

// O(1) push to warm pool
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count < pool->capacity) {
        pool->slabs[pool->count++] = ss;
    } else {
        // Pool full → return to LRU cache or free
        ss_cache_put(ss);  // Return to global LRU
    }
}

Phase 2: Implement Warm Pool Initialization

Task 2.1: Thread Startup Integration

Initialize warm pools on first malloc call
Pre-populate from LRU cache (if available)
Fall back to cold allocation if needed

Task 2.2: Batch Refill Strategy

On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
On cache miss: Pop from warm pool (no registry scan)
On warm pool depletion: Allocate 1-2 more in cold path

Phase 3: Modify unified_cache_refill()

Current Implementation (Registry Scan):

void unified_cache_refill(int class_idx) {
    // Linear search through per-class registry
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {  // ← Tier check (5-10 cycles)
            // Carve blocks
            carve_blocks_from_superslab(ss, class_idx, cache);
            return;
        }
    }
    // Not found → cold path (allocate new SuperSlab)
}

Proposed Implementation (Warm Pool First):

void unified_cache_refill(int class_idx) {
    // 1. Try warm pool first (no lock, O(1))
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        // SuperSlab already HOT (pre-qualified), no tier check needed
        carve_blocks_from_superslab(ss, class_idx, cache);
        return;
    }

    // 2. Fall back to registry scan (only if warm pool empty)
    // (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {
            carve_blocks_from_superslab(ss, class_idx, cache);
            // Refill warm pool on success
            for (int j = 0; j < 2; j++) {
                SuperSlab* extra = find_next_hot_slab(class_idx, i);
                if (extra) {
                    tiny_warm_pool_push(class_idx, extra);
                    i++;
                }
            }
            return;
        }
    }

    // 3. Cold path (allocate new SuperSlab)
    allocate_new_superslab(class_idx, cache);
}

Phase 4: Batched Tier Transition Checks

Current: Tier check on every refill (5-10 cycles) Proposed: Batch tier checks once per N operations

// Global tier check counter (atomic, publish periodically)
static __thread uint32_t g_tier_check_counter = 0;
#define TIER_CHECK_BATCH_SIZE 256

void tier_check_maybe_batch(int class_idx) {
    if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
        // Batch check: validate tier of all SuperSlabs in registry
        for (int i = 0; i < 10; i++) {  // Sample 10 SuperSlabs
            SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
            if (!ss_tier_is_hot(ss)) {
                // Demote from warm pool if present
                // (Cost: 1 atomic per 256 operations)
            }
        }
    }
}

Phase 5: LRU Cache Integration

How Warm Pool Gets Replenished:

Startup: Pre-populate warm pools from LRU cache
During execution: On cold path alloc, add extra SuperSlab to warm pool
Periodic: Background thread refills warm pools when < threshold
On free: When SuperSlab becomes empty, add to LRU cache (not warm pool)

📈 Expected Performance Impact

Current Baseline

Random Mixed: 1.06M ops/s
Breakdown:
  - 95% cache hits (HOT):     ~1.007M ops/s (clean, 2-3 cache misses)
  - 5% cache misses (WARM):   ~0.053M ops/s (registry scan + refill)

After Warm Pool Implementation

Estimated: 1.5-1.8M ops/s (+40-70%)

Breakdown:
  - 95% cache hits (HOT):       ~1.007M ops/s (unchanged, 2-3 cache misses)
  - 5% cache misses (WARM):     ~0.15-0.20M ops/s (warm pool, O(1) pop)
                                 (vs 0.053M before)

Improvement mechanism:
  - Remove registry O(N) scan → O(1) warm pool pop
  - Reduce per-refill cost: ~500 cycles → ~50 cycles
  - Expected per-miss speedup: ~10x
  - Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
  - Actual gain: 1.06M × 0.05 × 9 = 0.477M
  - Total: 1.06M + 0.477M = 1.537M ops/s (+45%)

Path to 10x

Current efforts can achieve:

Warm pool optimization: +40-70% (this proposal)
Lock-free refill path: +10-20% (phase 2)
Batch tier transitions: +5-10% (phase 2)
Reduced syscall overhead: +5% (phase 3)
Total realistic: 2.0-2.5x (not 10x)

To reach 10x improvement, would need:

Dedicated per-thread allocation pools (reduce lock contention)
Batch pre-allocation strategy (reduce per-op overhead)
Size class coalescing (reduce routing complexity)
Or: Change workload pattern (batch allocations)

⚠️ Implementation Risks & Mitigations

Risk 1: Thread-Local Storage Bloat

Risk: Adding warm pool increases per-thread memory usage Mitigation:

Allocate warm pool lazily
Limit to 4-8 SuperSlabs per class (128KB per thread max)
Default: 4 slots per class → 128KB total (acceptable)

Risk 2: Warm Pool Invalidation

Risk: SuperSlabs become DRAINING/FREE unexpectedly Mitigation:

Periodic validation during batch tier checks
Accept occasional validation error (rare, correctness not affected)
Fallback to registry scan if warm pool slot invalid

Risk 3: Stale SuperSlabs

Risk: Warm pool holds SuperSlabs that should be freed Mitigation:

LRU-based eviction from warm pool
Maximum hold time: 60s (configurable)
On thread exit: drain warm pool back to LRU cache

Risk 4: Initialization Race

Risk: Multiple threads initialize warm pools simultaneously Mitigation:

Use __thread (thread-safe per POSIX)
Lazy initialization with check-then-set
No atomic operations needed (per-thread)

🔄 Integration Checklist

Pre-Implementation

Review current unified_cache_refill() implementation
Identify all places where SuperSlab allocation happens
Audit Tier system for validation requirements
Measure current registry scan cost in micro-benchmark

Phase 1: Warm Pool Infrastructure

Create core/front/tiny_warm_pool.h with data structures
Implement warm_pool_init(), pop(), push() operations
Add __thread variable declarations
Write unit tests for warm pool operations
Verify no TLS bloat (profile memory usage)

Phase 2: Integration Points

Modify malloc_tiny_fast() to initialize warm pools
Integrate warm_pool_pop() in unified_cache_refill()
Implement warm_pool_push() in cold allocation path
Add initialization on first malloc
Handle thread exit cleanup

Phase 3: Testing

Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
Benchmark Random Mixed: measure ops/s improvement
Benchmark Tiny Hot: verify no regression (should be unchanged)
Stress test: concurrent threads + warm pool refill
Correctness: verify all objects properly allocated/freed

Phase 4: Profiling & Optimization

Profile hot path (should still be 20-30 cycles)
Profile warm path (should be reduced to 50-100 cycles)
Measure registry scan reduction
Identify any remaining bottlenecks

Phase 5: Documentation

Update comments in unified_cache_refill()
Document warm pool design in README
Add environment variables (if needed)
Document tier check batching strategy

📊 Metrics to Track

Pre-Implementation

Baseline Random Mixed:
  - Ops/sec: 1.06M
  - L1 cache misses: ~763K per 1M ops
  - Page faults: ~7,674
  - CPU cycles: ~70.4M

Post-Implementation Targets

After warm pool:
  - Ops/sec: 1.5-1.8M (+40-70%)
  - L1 cache misses: Similar or slightly reduced
  - Page faults: Same (~7,674)
  - CPU cycles: ~45-50M (30% reduction)

  Warm path breakdown:
    - Warm pool hit: 50-100 cycles per batch
    - Registry fallback: 200-300 cycles (rare)
    - Cold alloc: 1000-5000 cycles (very rare)

💾 Files to Create/Modify

New Files

core/front/tiny_warm_pool.h - Warm pool data structures & operations

Modified Files

core/front/malloc_tiny_fast.h
- Initialize warm pools on first call
- Document three-tier routing
core/front/tiny_unified_cache.h
- Modify unified_cache_refill() to use warm pool first
- Add warm pool replenishment logic
core/box/ss_tier_box.h
- Add batched tier check strategy
- Document validation requirements
core/hakmem_tiny.h or core/front/malloc_tiny_fast.h
- Add environment variables:
  - HAKMEM_WARM_POOL_SIZE (default: 4)
  - HAKMEM_WARM_POOL_REFILL_THRESHOLD (default: 1)

Configuration Files

Add warm pool parameters to benchmark configuration
Update profiling tools to measure warm pool effectiveness

🎯 Success Criteria

✅ Must Have:

Warm pool implementation reduces registry scan cost by 80%+
Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
Tiny Hot ops/s unchanged (no regression)
All allocations remain correct (no memory corruption)
No thread-local storage bloat (< 200KB per thread)

✅ Nice to Have:

Random Mixed reaches 2M+ ops/s (90%+ improvement)
Warm pool hit rate > 90% (rarely fall back to registry)
L1 cache misses reduced by 10%+
Per-free cost unchanged (no regression)

❌ Not in Scope (separate PR):

Lock-free refill path (requires CAS-based warm pool)
Per-thread allocation pools (requires larger redesign)
Hugepages support (already tested, no gain)

📝 Next Steps

Review this proposal with the team
Approve scope & success criteria
Begin Phase 1 implementation (warm pool header file)
Integrate with unified_cache_refill()
Benchmark and measure improvements
Iterate based on profiling results

🔗 References

Current Profiling: COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md
Session Summary: FINAL_SESSION_REPORT_20251204.md
Box Architecture: core/box/ directory
Unified Cache: core/front/tiny_unified_cache.h
Registry: core/hakmem_super_registry.h
Tier System: core/box/ss_tier_box.h

17 KiB Raw Blame History Unescape Escape

HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal

2025-12-04

📊 Executive Summary

🔍 Current Architecture Analysis

Existing Two-Speed Foundation

Performance Bottlenecks in WARM Path

💡 Proposed Three-Tier Architecture

Tier 1: HOT (95%+ allocations)

Tier 2: WARM (1-5% cache misses)

Tier 3: COLD (<0.1% special cases)

🔧 Implementation Plan

Phase 1: Design & Data Structures (THIS DOCUMENT)

Phase 2: Implement Warm Pool Initialization

Phase 3: Modify unified_cache_refill()

Phase 4: Batched Tier Transition Checks

Phase 5: LRU Cache Integration

📈 Expected Performance Impact

Current Baseline

After Warm Pool Implementation

Path to 10x

⚠️ Implementation Risks & Mitigations

Risk 1: Thread-Local Storage Bloat

Risk 2: Warm Pool Invalidation

Risk 3: Stale SuperSlabs

Risk 4: Initialization Race

🔄 Integration Checklist

Pre-Implementation

Phase 1: Warm Pool Infrastructure

Phase 2: Integration Points

Phase 3: Testing

Phase 4: Profiling & Optimization

Phase 5: Documentation

📊 Metrics to Track

Pre-Implementation

Post-Implementation Targets

💾 Files to Create/Modify

New Files

Modified Files

Configuration Files

🎯 Success Criteria

📝 Next Steps

🔗 References

17 KiB

Raw Blame History