# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal ## 2025-12-04 --- ## 📊 Executive Summary **Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths. **Current Performance Gap:** ``` Random Mixed: 1.06M ops/s (current baseline) Tiny Hot: 89M ops/s (reference - different workload) Goal: 10.6M ops/s (10x from baseline) ``` **Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling: 1. **Registry scan on cache miss** (O(N) search through per-class registry) 2. **Per-allocation tier checks** (atomic operations, not batched) 3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss) 4. **Global registry contention** (mutex-protected writes) --- ## 🔍 Current Architecture Analysis ### Existing Two-Speed Foundation HAKMEM **already implements** a two-tier design: ``` HOT PATH (95%+ allocations): malloc_tiny_fast() → tiny_hot_alloc_fast() → Unified Cache pop (TLS, 2-3 cache misses) → Return USER pointer Cost: ~20-30 CPU cycles WARM PATH (1-5% cache misses): malloc_tiny_fast() → tiny_cold_refill_and_alloc() → unified_cache_refill() → Per-class registry scan (find HOT SuperSlab) → Tier check (is HOT) → Carve ~64 blocks → Refill Unified Cache → Return USER pointer Cost: ~500-1000 cycles per batch (~5-10 per object amortized) ``` ### Performance Bottlenecks in WARM Path **Bottleneck 1: Registry Scan (O(N))** - Current: Linear search through per-class registry to find HOT SuperSlab - Cost: 50-100 cycles per refill - Happens on EVERY cache miss (~1-5% of allocations) - Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function) **Bottleneck 2: Per-Allocation Tier Checks** - Current: Call `ss_tier_is_hot(ss)` once per batch (during refill) - Should be: Batch multiple tier checks together - Cost: Atomic operations, not amortized - File: `core/box/ss_tier_box.h` **Bottleneck 3: Global Registry Contention** - Current: Mutex-protected registry insert on SuperSlab alloc - File: `core/hakmem_super_registry.h` (hak_super_registry_insert) - Lock: `g_super_reg_lock` **Bottleneck 4: SuperSlab Initialization Overhead** - Current: Full allocation + initialization on cache miss → cold path - Cost: ~1000+ cycles (mmap, metadata setup, registry insert) - Should be: Pre-allocated from LRU cache or warm pool --- ## 💡 Proposed Three-Tier Architecture ### Tier 1: HOT (95%+ allocations) ```c // Path: TLS Unified Cache hit // Cost: ~20-30 cycles (unchanged) // Characteristics: // - No registry access // - No Tier/Guard calls // - No locks // - Branch-free (or 1-branch pipeline hits) Path: 1. Read TLS Unified Cache (TLS access, 1 cache miss) 2. Pop from array (array access, 1 cache miss) 3. Update head pointer (1 store) 4. Return USER pointer (0 additional branches for hit) Total: 2-3 cache misses, ~20-30 cycles ``` ### Tier 2: WARM (1-5% cache misses) **NEW: Per-Thread Warm Pool** ```c // Path: Unified Cache miss → Pop from per-thread warm pool // Cost: ~50-100 cycles per batch (5-10 per object amortized) // Characteristics: // - No global registry scan // - Pre-qualified SuperSlabs (already HOT) // - Batched tier transitions (not per-object) // - Minimal lock contention Data Structure: __thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES]; __thread int g_warm_pool_count[TINY_NUM_CLASSES]; __thread int g_warm_pool_capacity[TINY_NUM_CLASSES]; Path: 1. Detect Unified Cache miss (head == tail) 2. Check warm pool (TLS access, no lock) a. If warm_pool_count > 0: ├─ Pop SuperSlab from warm_pool_head (O(1)) ├─ Use existing SuperSlab (no mmap) ├─ Carve ~64 blocks (amortized cost) ├─ Refill Unified Cache ├─ (Optional) Batch tier check after ~64 pops └─ Return first block b. If warm_pool_count == 0: └─ Fall through to COLD (rare) Total: ~50-100 cycles per batch ``` ### Tier 3: COLD (<0.1% special cases) ```c // Path: Warm pool exhausted, error, or special handling // Cost: ~1000-10000 cycles per SuperSlab (rare) // Characteristics: // - Full SuperSlab allocation (mmap) // - Registry insert (mutex-protected write) // - Tier initialization // - Guard validation Path: 1. Warm pool exhausted 2. Allocate new SuperSlab (mmap via ss_os_acquire_box) 3. Insert into global registry (mutex-protected) 4. Initialize TinySlabMeta + metadata 5. Add to per-class registry 6. Carve blocks + refill both Unified Cache and warm pool 7. Return first block ``` --- ## 🔧 Implementation Plan ### Phase 1: Design & Data Structures (THIS DOCUMENT) **Task 1.1: Define Warm Pool Data Structure** ```c // File: core/front/tiny_warm_pool.h (NEW) // // Per-thread warm pool for pre-allocated SuperSlabs // Reduces registry scan cost on cache miss #ifndef HAK_TINY_WARM_POOL_H #define HAK_TINY_WARM_POOL_H #include #include "../hakmem_tiny_config.h" #include "../superslab/superslab_types.h" // Maximum warm SuperSlabs per thread (tunable) #define TINY_WARM_POOL_MAX_PER_CLASS 4 typedef struct { SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS]; int count; int capacity; } TinyWarmPool; // Per-thread warm pools (one per class) extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES]; // Operations: // - tiny_warm_pool_init() → Initialize at thread startup // - tiny_warm_pool_push() → Add SuperSlab to warm pool // - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1)) // - tiny_warm_pool_drain() → Return all to LRU on thread exit // - tiny_warm_pool_refill() → Batch refill from LRU cache #endif ``` **Task 1.2: Define Warm Pool Operations** ```c // Lazy initialization (once per thread) static inline void tiny_warm_pool_init_once(int class_idx) { TinyWarmPool* pool = &g_tiny_warm_pool[class_idx]; if (pool->capacity == 0) { pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS; pool->count = 0; // Allocate initial SuperSlabs on demand (COLD path) } } // O(1) pop from warm pool static inline SuperSlab* tiny_warm_pool_pop(int class_idx) { TinyWarmPool* pool = &g_tiny_warm_pool[class_idx]; if (pool->count > 0) { return pool->slabs[--pool->count]; // Pop from end } return NULL; // Pool empty → fall through to COLD } // O(1) push to warm pool static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) { TinyWarmPool* pool = &g_tiny_warm_pool[class_idx]; if (pool->count < pool->capacity) { pool->slabs[pool->count++] = ss; } else { // Pool full → return to LRU cache or free ss_cache_put(ss); // Return to global LRU } } ``` ### Phase 2: Implement Warm Pool Initialization **Task 2.1: Thread Startup Integration** - Initialize warm pools on first malloc call - Pre-populate from LRU cache (if available) - Fall back to cold allocation if needed **Task 2.2: Batch Refill Strategy** - On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool - On cache miss: Pop from warm pool (no registry scan) - On warm pool depletion: Allocate 1-2 more in cold path ### Phase 3: Modify unified_cache_refill() **Current Implementation** (Registry Scan): ```c void unified_cache_refill(int class_idx) { // Linear search through per-class registry for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) { SuperSlab* ss = g_super_reg_by_class[class_idx][i]; if (ss_tier_is_hot(ss)) { // ← Tier check (5-10 cycles) // Carve blocks carve_blocks_from_superslab(ss, class_idx, cache); return; } } // Not found → cold path (allocate new SuperSlab) } ``` **Proposed Implementation** (Warm Pool First): ```c void unified_cache_refill(int class_idx) { // 1. Try warm pool first (no lock, O(1)) SuperSlab* ss = tiny_warm_pool_pop(class_idx); if (ss) { // SuperSlab already HOT (pre-qualified), no tier check needed carve_blocks_from_superslab(ss, class_idx, cache); return; } // 2. Fall back to registry scan (only if warm pool empty) // (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens) for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) { SuperSlab* ss = g_super_reg_by_class[class_idx][i]; if (ss_tier_is_hot(ss)) { carve_blocks_from_superslab(ss, class_idx, cache); // Refill warm pool on success for (int j = 0; j < 2; j++) { SuperSlab* extra = find_next_hot_slab(class_idx, i); if (extra) { tiny_warm_pool_push(class_idx, extra); i++; } } return; } } // 3. Cold path (allocate new SuperSlab) allocate_new_superslab(class_idx, cache); } ``` ### Phase 4: Batched Tier Transition Checks **Current:** Tier check on every refill (5-10 cycles) **Proposed:** Batch tier checks once per N operations ```c // Global tier check counter (atomic, publish periodically) static __thread uint32_t g_tier_check_counter = 0; #define TIER_CHECK_BATCH_SIZE 256 void tier_check_maybe_batch(int class_idx) { if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) { // Batch check: validate tier of all SuperSlabs in registry for (int i = 0; i < 10; i++) { // Sample 10 SuperSlabs SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N]; if (!ss_tier_is_hot(ss)) { // Demote from warm pool if present // (Cost: 1 atomic per 256 operations) } } } } ``` ### Phase 5: LRU Cache Integration **How Warm Pool Gets Replenished:** 1. **Startup:** Pre-populate warm pools from LRU cache 2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool 3. **Periodic:** Background thread refills warm pools when < threshold 4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool) --- ## 📈 Expected Performance Impact ### Current Baseline ``` Random Mixed: 1.06M ops/s Breakdown: - 95% cache hits (HOT): ~1.007M ops/s (clean, 2-3 cache misses) - 5% cache misses (WARM): ~0.053M ops/s (registry scan + refill) ``` ### After Warm Pool Implementation ``` Estimated: 1.5-1.8M ops/s (+40-70%) Breakdown: - 95% cache hits (HOT): ~1.007M ops/s (unchanged, 2-3 cache misses) - 5% cache misses (WARM): ~0.15-0.20M ops/s (warm pool, O(1) pop) (vs 0.053M before) Improvement mechanism: - Remove registry O(N) scan → O(1) warm pool pop - Reduce per-refill cost: ~500 cycles → ~50 cycles - Expected per-miss speedup: ~10x - Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact - Actual gain: 1.06M × 0.05 × 9 = 0.477M - Total: 1.06M + 0.477M = 1.537M ops/s (+45%) ``` ### Path to 10x Current efforts can achieve: - **Warm pool optimization:** +40-70% (this proposal) - **Lock-free refill path:** +10-20% (phase 2) - **Batch tier transitions:** +5-10% (phase 2) - **Reduced syscall overhead:** +5% (phase 3) - **Total realistic: 2.0-2.5x** (not 10x) **To reach 10x improvement, would need:** 1. Dedicated per-thread allocation pools (reduce lock contention) 2. Batch pre-allocation strategy (reduce per-op overhead) 3. Size class coalescing (reduce routing complexity) 4. Or: Change workload pattern (batch allocations) --- ## ⚠️ Implementation Risks & Mitigations ### Risk 1: Thread-Local Storage Bloat **Risk:** Adding warm pool increases per-thread memory usage **Mitigation:** - Allocate warm pool lazily - Limit to 4-8 SuperSlabs per class (128KB per thread max) - Default: 4 slots per class → 128KB total (acceptable) ### Risk 2: Warm Pool Invalidation **Risk:** SuperSlabs become DRAINING/FREE unexpectedly **Mitigation:** - Periodic validation during batch tier checks - Accept occasional validation error (rare, correctness not affected) - Fallback to registry scan if warm pool slot invalid ### Risk 3: Stale SuperSlabs **Risk:** Warm pool holds SuperSlabs that should be freed **Mitigation:** - LRU-based eviction from warm pool - Maximum hold time: 60s (configurable) - On thread exit: drain warm pool back to LRU cache ### Risk 4: Initialization Race **Risk:** Multiple threads initialize warm pools simultaneously **Mitigation:** - Use `__thread` (thread-safe per POSIX) - Lazy initialization with check-then-set - No atomic operations needed (per-thread) --- ## 🔄 Integration Checklist ### Pre-Implementation - [ ] Review current unified_cache_refill() implementation - [ ] Identify all places where SuperSlab allocation happens - [ ] Audit Tier system for validation requirements - [ ] Measure current registry scan cost in micro-benchmark ### Phase 1: Warm Pool Infrastructure - [ ] Create `core/front/tiny_warm_pool.h` with data structures - [ ] Implement warm_pool_init(), pop(), push() operations - [ ] Add __thread variable declarations - [ ] Write unit tests for warm pool operations - [ ] Verify no TLS bloat (profile memory usage) ### Phase 2: Integration Points - [ ] Modify malloc_tiny_fast() to initialize warm pools - [ ] Integrate warm_pool_pop() in unified_cache_refill() - [ ] Implement warm_pool_push() in cold allocation path - [ ] Add initialization on first malloc - [ ] Handle thread exit cleanup ### Phase 3: Testing - [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles) - [ ] Benchmark Random Mixed: measure ops/s improvement - [ ] Benchmark Tiny Hot: verify no regression (should be unchanged) - [ ] Stress test: concurrent threads + warm pool refill - [ ] Correctness: verify all objects properly allocated/freed ### Phase 4: Profiling & Optimization - [ ] Profile hot path (should still be 20-30 cycles) - [ ] Profile warm path (should be reduced to 50-100 cycles) - [ ] Measure registry scan reduction - [ ] Identify any remaining bottlenecks ### Phase 5: Documentation - [ ] Update comments in unified_cache_refill() - [ ] Document warm pool design in README - [ ] Add environment variables (if needed) - [ ] Document tier check batching strategy --- ## 📊 Metrics to Track ### Pre-Implementation ``` Baseline Random Mixed: - Ops/sec: 1.06M - L1 cache misses: ~763K per 1M ops - Page faults: ~7,674 - CPU cycles: ~70.4M ``` ### Post-Implementation Targets ``` After warm pool: - Ops/sec: 1.5-1.8M (+40-70%) - L1 cache misses: Similar or slightly reduced - Page faults: Same (~7,674) - CPU cycles: ~45-50M (30% reduction) Warm path breakdown: - Warm pool hit: 50-100 cycles per batch - Registry fallback: 200-300 cycles (rare) - Cold alloc: 1000-5000 cycles (very rare) ``` --- ## 💾 Files to Create/Modify ### New Files - `core/front/tiny_warm_pool.h` - Warm pool data structures & operations ### Modified Files 1. `core/front/malloc_tiny_fast.h` - Initialize warm pools on first call - Document three-tier routing 2. `core/front/tiny_unified_cache.h` - Modify unified_cache_refill() to use warm pool first - Add warm pool replenishment logic 3. `core/box/ss_tier_box.h` - Add batched tier check strategy - Document validation requirements 4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h` - Add environment variables: - `HAKMEM_WARM_POOL_SIZE` (default: 4) - `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1) ### Configuration Files - Add warm pool parameters to benchmark configuration - Update profiling tools to measure warm pool effectiveness --- ## 🎯 Success Criteria ✅ **Must Have:** 1. Warm pool implementation reduces registry scan cost by 80%+ 2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement) 3. Tiny Hot ops/s unchanged (no regression) 4. All allocations remain correct (no memory corruption) 5. No thread-local storage bloat (< 200KB per thread) ✅ **Nice to Have:** 1. Random Mixed reaches 2M+ ops/s (90%+ improvement) 2. Warm pool hit rate > 90% (rarely fall back to registry) 3. L1 cache misses reduced by 10%+ 4. Per-free cost unchanged (no regression) ❌ **Not in Scope (separate PR):** 1. Lock-free refill path (requires CAS-based warm pool) 2. Per-thread allocation pools (requires larger redesign) 3. Hugepages support (already tested, no gain) --- ## 📝 Next Steps 1. **Review this proposal** with the team 2. **Approve scope & success criteria** 3. **Begin Phase 1 implementation** (warm pool header file) 4. **Integrate with unified_cache_refill()** 5. **Benchmark and measure improvements** 6. **Iterate based on profiling results** --- ## 🔗 References - Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md` - Session Summary: `FINAL_SESSION_REPORT_20251204.md` - Box Architecture: `core/box/` directory - Unified Cache: `core/front/tiny_unified_cache.h` - Registry: `core/hakmem_super_registry.h` - Tier System: `core/box/ss_tier_box.h`