# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
## 2025-12-04

---

## 📊 Executive Summary

**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.

**Current Performance Gap:**
```
Random Mixed:  1.06M ops/s  (current baseline)
Tiny Hot:      89M ops/s    (reference - different workload)
Goal:          10.6M ops/s  (10x from baseline)
```

**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:

1. **Registry scan on cache miss** (O(N) search through per-class registry)
2. **Per-allocation tier checks** (atomic operations, not batched)
3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
4. **Global registry contention** (mutex-protected writes)

---

## 🔍 Current Architecture Analysis

### Existing Two-Speed Foundation

HAKMEM **already implements** a two-tier design:

```
HOT PATH (95%+ allocations):
  malloc_tiny_fast()
    → tiny_hot_alloc_fast()
       → Unified Cache pop (TLS, 2-3 cache misses)
       → Return USER pointer
  Cost: ~20-30 CPU cycles

WARM PATH (1-5% cache misses):
  malloc_tiny_fast()
    → tiny_cold_refill_and_alloc()
       → unified_cache_refill()
          → Per-class registry scan (find HOT SuperSlab)
          → Tier check (is HOT)
          → Carve ~64 blocks
          → Refill Unified Cache
       → Return USER pointer
  Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
```

### Performance Bottlenecks in WARM Path

**Bottleneck 1: Registry Scan (O(N))**
- Current: Linear search through per-class registry to find HOT SuperSlab
- Cost: 50-100 cycles per refill
- Happens on EVERY cache miss (~1-5% of allocations)
- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)

**Bottleneck 2: Per-Allocation Tier Checks**
- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
- Should be: Batch multiple tier checks together
- Cost: Atomic operations, not amortized
- File: `core/box/ss_tier_box.h`

**Bottleneck 3: Global Registry Contention**
- Current: Mutex-protected registry insert on SuperSlab alloc
- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
- Lock: `g_super_reg_lock`

**Bottleneck 4: SuperSlab Initialization Overhead**
- Current: Full allocation + initialization on cache miss → cold path
- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
- Should be: Pre-allocated from LRU cache or warm pool

---

## 💡 Proposed Three-Tier Architecture

### Tier 1: HOT (95%+ allocations)
```c
// Path: TLS Unified Cache hit
// Cost: ~20-30 cycles (unchanged)
// Characteristics:
//   - No registry access
//   - No Tier/Guard calls
//   - No locks
//   - Branch-free (or 1-branch pipeline hits)

Path:
  1. Read TLS Unified Cache (TLS access, 1 cache miss)
  2. Pop from array (array access, 1 cache miss)
  3. Update head pointer (1 store)
  4. Return USER pointer (0 additional branches for hit)

Total: 2-3 cache misses, ~20-30 cycles
```

### Tier 2: WARM (1-5% cache misses)
**NEW: Per-Thread Warm Pool**

```c
// Path: Unified Cache miss → Pop from per-thread warm pool
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
// Characteristics:
//   - No global registry scan
//   - Pre-qualified SuperSlabs (already HOT)
//   - Batched tier transitions (not per-object)
//   - Minimal lock contention

Data Structure:
  __thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_count[TINY_NUM_CLASSES];
  __thread int       g_warm_pool_capacity[TINY_NUM_CLASSES];

Path:
  1. Detect Unified Cache miss (head == tail)
  2. Check warm pool (TLS access, no lock)
     a. If warm_pool_count > 0:
        ├─ Pop SuperSlab from warm_pool_head (O(1))
        ├─ Use existing SuperSlab (no mmap)
        ├─ Carve ~64 blocks (amortized cost)
        ├─ Refill Unified Cache
        ├─ (Optional) Batch tier check after ~64 pops
        └─ Return first block

     b. If warm_pool_count == 0:
        └─ Fall through to COLD (rare)

Total: ~50-100 cycles per batch
```

### Tier 3: COLD (<0.1% special cases)
```c
// Path: Warm pool exhausted, error, or special handling
// Cost: ~1000-10000 cycles per SuperSlab (rare)
// Characteristics:
//   - Full SuperSlab allocation (mmap)
//   - Registry insert (mutex-protected write)
//   - Tier initialization
//   - Guard validation

Path:
  1. Warm pool exhausted
  2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
  3. Insert into global registry (mutex-protected)
  4. Initialize TinySlabMeta + metadata
  5. Add to per-class registry
  6. Carve blocks + refill both Unified Cache and warm pool
  7. Return first block
```

---

## 🔧 Implementation Plan

### Phase 1: Design & Data Structures (THIS DOCUMENT)

**Task 1.1: Define Warm Pool Data Structure**

```c
// File: core/front/tiny_warm_pool.h (NEW)
//
// Per-thread warm pool for pre-allocated SuperSlabs
// Reduces registry scan cost on cache miss

#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H

#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"

// Maximum warm SuperSlabs per thread (tunable)
#define TINY_WARM_POOL_MAX_PER_CLASS 4

typedef struct {
    SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
    int count;
    int capacity;
} TinyWarmPool;

// Per-thread warm pools (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];

// Operations:
// - tiny_warm_pool_init() → Initialize at thread startup
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
// - tiny_warm_pool_refill() → Batch refill from LRU cache

#endif
```

**Task 1.2: Define Warm Pool Operations**

```c
// Lazy initialization (once per thread)
static inline void tiny_warm_pool_init_once(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->capacity == 0) {
        pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
        pool->count = 0;
        // Allocate initial SuperSlabs on demand (COLD path)
    }
}

// O(1) pop from warm pool
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count > 0) {
        return pool->slabs[--pool->count];  // Pop from end
    }
    return NULL;  // Pool empty → fall through to COLD
}

// O(1) push to warm pool
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
    TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
    if (pool->count < pool->capacity) {
        pool->slabs[pool->count++] = ss;
    } else {
        // Pool full → return to LRU cache or free
        ss_cache_put(ss);  // Return to global LRU
    }
}
```

### Phase 2: Implement Warm Pool Initialization

**Task 2.1: Thread Startup Integration**
- Initialize warm pools on first malloc call
- Pre-populate from LRU cache (if available)
- Fall back to cold allocation if needed

**Task 2.2: Batch Refill Strategy**
- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
- On cache miss: Pop from warm pool (no registry scan)
- On warm pool depletion: Allocate 1-2 more in cold path

### Phase 3: Modify unified_cache_refill()

**Current Implementation** (Registry Scan):
```c
void unified_cache_refill(int class_idx) {
    // Linear search through per-class registry
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {  // ← Tier check (5-10 cycles)
            // Carve blocks
            carve_blocks_from_superslab(ss, class_idx, cache);
            return;
        }
    }
    // Not found → cold path (allocate new SuperSlab)
}
```

**Proposed Implementation** (Warm Pool First):
```c
void unified_cache_refill(int class_idx) {
    // 1. Try warm pool first (no lock, O(1))
    SuperSlab* ss = tiny_warm_pool_pop(class_idx);
    if (ss) {
        // SuperSlab already HOT (pre-qualified), no tier check needed
        carve_blocks_from_superslab(ss, class_idx, cache);
        return;
    }

    // 2. Fall back to registry scan (only if warm pool empty)
    // (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
    for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
        SuperSlab* ss = g_super_reg_by_class[class_idx][i];
        if (ss_tier_is_hot(ss)) {
            carve_blocks_from_superslab(ss, class_idx, cache);
            // Refill warm pool on success
            for (int j = 0; j < 2; j++) {
                SuperSlab* extra = find_next_hot_slab(class_idx, i);
                if (extra) {
                    tiny_warm_pool_push(class_idx, extra);
                    i++;
                }
            }
            return;
        }
    }

    // 3. Cold path (allocate new SuperSlab)
    allocate_new_superslab(class_idx, cache);
}
```

### Phase 4: Batched Tier Transition Checks

**Current:** Tier check on every refill (5-10 cycles)
**Proposed:** Batch tier checks once per N operations

```c
// Global tier check counter (atomic, publish periodically)
static __thread uint32_t g_tier_check_counter = 0;
#define TIER_CHECK_BATCH_SIZE 256

void tier_check_maybe_batch(int class_idx) {
    if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
        // Batch check: validate tier of all SuperSlabs in registry
        for (int i = 0; i < 10; i++) {  // Sample 10 SuperSlabs
            SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
            if (!ss_tier_is_hot(ss)) {
                // Demote from warm pool if present
                // (Cost: 1 atomic per 256 operations)
            }
        }
    }
}
```

### Phase 5: LRU Cache Integration

**How Warm Pool Gets Replenished:**

1. **Startup:** Pre-populate warm pools from LRU cache
2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool
3. **Periodic:** Background thread refills warm pools when < threshold
4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)

---

## 📈 Expected Performance Impact

### Current Baseline
```
Random Mixed: 1.06M ops/s
Breakdown:
  - 95% cache hits (HOT):     ~1.007M ops/s (clean, 2-3 cache misses)
  - 5% cache misses (WARM):   ~0.053M ops/s (registry scan + refill)
```

### After Warm Pool Implementation
```
Estimated: 1.5-1.8M ops/s (+40-70%)

Breakdown:
  - 95% cache hits (HOT):       ~1.007M ops/s (unchanged, 2-3 cache misses)
  - 5% cache misses (WARM):     ~0.15-0.20M ops/s (warm pool, O(1) pop)
                                 (vs 0.053M before)

Improvement mechanism:
  - Remove registry O(N) scan → O(1) warm pool pop
  - Reduce per-refill cost: ~500 cycles → ~50 cycles
  - Expected per-miss speedup: ~10x
  - Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
  - Actual gain: 1.06M × 0.05 × 9 = 0.477M
  - Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
```

### Path to 10x

Current efforts can achieve:
- **Warm pool optimization:** +40-70% (this proposal)
- **Lock-free refill path:** +10-20% (phase 2)
- **Batch tier transitions:** +5-10% (phase 2)
- **Reduced syscall overhead:** +5% (phase 3)
- **Total realistic: 2.0-2.5x** (not 10x)

**To reach 10x improvement, would need:**
1. Dedicated per-thread allocation pools (reduce lock contention)
2. Batch pre-allocation strategy (reduce per-op overhead)
3. Size class coalescing (reduce routing complexity)
4. Or: Change workload pattern (batch allocations)

---

## ⚠️ Implementation Risks & Mitigations

### Risk 1: Thread-Local Storage Bloat
**Risk:** Adding warm pool increases per-thread memory usage
**Mitigation:**
- Allocate warm pool lazily
- Limit to 4-8 SuperSlabs per class (128KB per thread max)
- Default: 4 slots per class → 128KB total (acceptable)

### Risk 2: Warm Pool Invalidation
**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
**Mitigation:**
- Periodic validation during batch tier checks
- Accept occasional validation error (rare, correctness not affected)
- Fallback to registry scan if warm pool slot invalid

### Risk 3: Stale SuperSlabs
**Risk:** Warm pool holds SuperSlabs that should be freed
**Mitigation:**
- LRU-based eviction from warm pool
- Maximum hold time: 60s (configurable)
- On thread exit: drain warm pool back to LRU cache

### Risk 4: Initialization Race
**Risk:** Multiple threads initialize warm pools simultaneously
**Mitigation:**
- Use `__thread` (thread-safe per POSIX)
- Lazy initialization with check-then-set
- No atomic operations needed (per-thread)

---

## 🔄 Integration Checklist

### Pre-Implementation
- [ ] Review current unified_cache_refill() implementation
- [ ] Identify all places where SuperSlab allocation happens
- [ ] Audit Tier system for validation requirements
- [ ] Measure current registry scan cost in micro-benchmark

### Phase 1: Warm Pool Infrastructure
- [ ] Create `core/front/tiny_warm_pool.h` with data structures
- [ ] Implement warm_pool_init(), pop(), push() operations
- [ ] Add __thread variable declarations
- [ ] Write unit tests for warm pool operations
- [ ] Verify no TLS bloat (profile memory usage)

### Phase 2: Integration Points
- [ ] Modify malloc_tiny_fast() to initialize warm pools
- [ ] Integrate warm_pool_pop() in unified_cache_refill()
- [ ] Implement warm_pool_push() in cold allocation path
- [ ] Add initialization on first malloc
- [ ] Handle thread exit cleanup

### Phase 3: Testing
- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
- [ ] Benchmark Random Mixed: measure ops/s improvement
- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
- [ ] Stress test: concurrent threads + warm pool refill
- [ ] Correctness: verify all objects properly allocated/freed

### Phase 4: Profiling & Optimization
- [ ] Profile hot path (should still be 20-30 cycles)
- [ ] Profile warm path (should be reduced to 50-100 cycles)
- [ ] Measure registry scan reduction
- [ ] Identify any remaining bottlenecks

### Phase 5: Documentation
- [ ] Update comments in unified_cache_refill()
- [ ] Document warm pool design in README
- [ ] Add environment variables (if needed)
- [ ] Document tier check batching strategy

---

## 📊 Metrics to Track

### Pre-Implementation
```
Baseline Random Mixed:
  - Ops/sec: 1.06M
  - L1 cache misses: ~763K per 1M ops
  - Page faults: ~7,674
  - CPU cycles: ~70.4M
```

### Post-Implementation Targets
```
After warm pool:
  - Ops/sec: 1.5-1.8M (+40-70%)
  - L1 cache misses: Similar or slightly reduced
  - Page faults: Same (~7,674)
  - CPU cycles: ~45-50M (30% reduction)

  Warm path breakdown:
    - Warm pool hit: 50-100 cycles per batch
    - Registry fallback: 200-300 cycles (rare)
    - Cold alloc: 1000-5000 cycles (very rare)
```

---

## 💾 Files to Create/Modify

### New Files
- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations

### Modified Files
1. `core/front/malloc_tiny_fast.h`
   - Initialize warm pools on first call
   - Document three-tier routing

2. `core/front/tiny_unified_cache.h`
   - Modify unified_cache_refill() to use warm pool first
   - Add warm pool replenishment logic

3. `core/box/ss_tier_box.h`
   - Add batched tier check strategy
   - Document validation requirements

4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
   - Add environment variables:
     - `HAKMEM_WARM_POOL_SIZE` (default: 4)
     - `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)

### Configuration Files
- Add warm pool parameters to benchmark configuration
- Update profiling tools to measure warm pool effectiveness

---

## 🎯 Success Criteria

✅ **Must Have:**
1. Warm pool implementation reduces registry scan cost by 80%+
2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
3. Tiny Hot ops/s unchanged (no regression)
4. All allocations remain correct (no memory corruption)
5. No thread-local storage bloat (< 200KB per thread)

✅ **Nice to Have:**
1. Random Mixed reaches 2M+ ops/s (90%+ improvement)
2. Warm pool hit rate > 90% (rarely fall back to registry)
3. L1 cache misses reduced by 10%+
4. Per-free cost unchanged (no regression)

❌ **Not in Scope (separate PR):**
1. Lock-free refill path (requires CAS-based warm pool)
2. Per-thread allocation pools (requires larger redesign)
3. Hugepages support (already tested, no gain)

---

## 📝 Next Steps

1. **Review this proposal** with the team
2. **Approve scope & success criteria**
3. **Begin Phase 1 implementation** (warm pool header file)
4. **Integrate with unified_cache_refill()**
5. **Benchmark and measure improvements**
6. **Iterate based on profiling results**

---

## 🔗 References

- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
- Box Architecture: `core/box/` directory
- Unified Cache: `core/front/tiny_unified_cache.h`
- Registry: `core/hakmem_super_registry.h`
- Tier System: `core/box/ss_tier_box.h`