546 lines
17 KiB
Markdown
546 lines
17 KiB
Markdown
|
|
# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
|
|||
|
|
## 2025-12-04
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Executive Summary
|
|||
|
|
|
|||
|
|
**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.
|
|||
|
|
|
|||
|
|
**Current Performance Gap:**
|
|||
|
|
```
|
|||
|
|
Random Mixed: 1.06M ops/s (current baseline)
|
|||
|
|
Tiny Hot: 89M ops/s (reference - different workload)
|
|||
|
|
Goal: 10.6M ops/s (10x from baseline)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:
|
|||
|
|
|
|||
|
|
1. **Registry scan on cache miss** (O(N) search through per-class registry)
|
|||
|
|
2. **Per-allocation tier checks** (atomic operations, not batched)
|
|||
|
|
3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
|
|||
|
|
4. **Global registry contention** (mutex-protected writes)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Current Architecture Analysis
|
|||
|
|
|
|||
|
|
### Existing Two-Speed Foundation
|
|||
|
|
|
|||
|
|
HAKMEM **already implements** a two-tier design:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
HOT PATH (95%+ allocations):
|
|||
|
|
malloc_tiny_fast()
|
|||
|
|
→ tiny_hot_alloc_fast()
|
|||
|
|
→ Unified Cache pop (TLS, 2-3 cache misses)
|
|||
|
|
→ Return USER pointer
|
|||
|
|
Cost: ~20-30 CPU cycles
|
|||
|
|
|
|||
|
|
WARM PATH (1-5% cache misses):
|
|||
|
|
malloc_tiny_fast()
|
|||
|
|
→ tiny_cold_refill_and_alloc()
|
|||
|
|
→ unified_cache_refill()
|
|||
|
|
→ Per-class registry scan (find HOT SuperSlab)
|
|||
|
|
→ Tier check (is HOT)
|
|||
|
|
→ Carve ~64 blocks
|
|||
|
|
→ Refill Unified Cache
|
|||
|
|
→ Return USER pointer
|
|||
|
|
Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance Bottlenecks in WARM Path
|
|||
|
|
|
|||
|
|
**Bottleneck 1: Registry Scan (O(N))**
|
|||
|
|
- Current: Linear search through per-class registry to find HOT SuperSlab
|
|||
|
|
- Cost: 50-100 cycles per refill
|
|||
|
|
- Happens on EVERY cache miss (~1-5% of allocations)
|
|||
|
|
- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)
|
|||
|
|
|
|||
|
|
**Bottleneck 2: Per-Allocation Tier Checks**
|
|||
|
|
- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
|
|||
|
|
- Should be: Batch multiple tier checks together
|
|||
|
|
- Cost: Atomic operations, not amortized
|
|||
|
|
- File: `core/box/ss_tier_box.h`
|
|||
|
|
|
|||
|
|
**Bottleneck 3: Global Registry Contention**
|
|||
|
|
- Current: Mutex-protected registry insert on SuperSlab alloc
|
|||
|
|
- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
|
|||
|
|
- Lock: `g_super_reg_lock`
|
|||
|
|
|
|||
|
|
**Bottleneck 4: SuperSlab Initialization Overhead**
|
|||
|
|
- Current: Full allocation + initialization on cache miss → cold path
|
|||
|
|
- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
|
|||
|
|
- Should be: Pre-allocated from LRU cache or warm pool
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 Proposed Three-Tier Architecture
|
|||
|
|
|
|||
|
|
### Tier 1: HOT (95%+ allocations)
|
|||
|
|
```c
|
|||
|
|
// Path: TLS Unified Cache hit
|
|||
|
|
// Cost: ~20-30 cycles (unchanged)
|
|||
|
|
// Characteristics:
|
|||
|
|
// - No registry access
|
|||
|
|
// - No Tier/Guard calls
|
|||
|
|
// - No locks
|
|||
|
|
// - Branch-free (or 1-branch pipeline hits)
|
|||
|
|
|
|||
|
|
Path:
|
|||
|
|
1. Read TLS Unified Cache (TLS access, 1 cache miss)
|
|||
|
|
2. Pop from array (array access, 1 cache miss)
|
|||
|
|
3. Update head pointer (1 store)
|
|||
|
|
4. Return USER pointer (0 additional branches for hit)
|
|||
|
|
|
|||
|
|
Total: 2-3 cache misses, ~20-30 cycles
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Tier 2: WARM (1-5% cache misses)
|
|||
|
|
**NEW: Per-Thread Warm Pool**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Path: Unified Cache miss → Pop from per-thread warm pool
|
|||
|
|
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
|
|||
|
|
// Characteristics:
|
|||
|
|
// - No global registry scan
|
|||
|
|
// - Pre-qualified SuperSlabs (already HOT)
|
|||
|
|
// - Batched tier transitions (not per-object)
|
|||
|
|
// - Minimal lock contention
|
|||
|
|
|
|||
|
|
Data Structure:
|
|||
|
|
__thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
|
|||
|
|
__thread int g_warm_pool_count[TINY_NUM_CLASSES];
|
|||
|
|
__thread int g_warm_pool_capacity[TINY_NUM_CLASSES];
|
|||
|
|
|
|||
|
|
Path:
|
|||
|
|
1. Detect Unified Cache miss (head == tail)
|
|||
|
|
2. Check warm pool (TLS access, no lock)
|
|||
|
|
a. If warm_pool_count > 0:
|
|||
|
|
├─ Pop SuperSlab from warm_pool_head (O(1))
|
|||
|
|
├─ Use existing SuperSlab (no mmap)
|
|||
|
|
├─ Carve ~64 blocks (amortized cost)
|
|||
|
|
├─ Refill Unified Cache
|
|||
|
|
├─ (Optional) Batch tier check after ~64 pops
|
|||
|
|
└─ Return first block
|
|||
|
|
|
|||
|
|
b. If warm_pool_count == 0:
|
|||
|
|
└─ Fall through to COLD (rare)
|
|||
|
|
|
|||
|
|
Total: ~50-100 cycles per batch
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Tier 3: COLD (<0.1% special cases)
|
|||
|
|
```c
|
|||
|
|
// Path: Warm pool exhausted, error, or special handling
|
|||
|
|
// Cost: ~1000-10000 cycles per SuperSlab (rare)
|
|||
|
|
// Characteristics:
|
|||
|
|
// - Full SuperSlab allocation (mmap)
|
|||
|
|
// - Registry insert (mutex-protected write)
|
|||
|
|
// - Tier initialization
|
|||
|
|
// - Guard validation
|
|||
|
|
|
|||
|
|
Path:
|
|||
|
|
1. Warm pool exhausted
|
|||
|
|
2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
|
|||
|
|
3. Insert into global registry (mutex-protected)
|
|||
|
|
4. Initialize TinySlabMeta + metadata
|
|||
|
|
5. Add to per-class registry
|
|||
|
|
6. Carve blocks + refill both Unified Cache and warm pool
|
|||
|
|
7. Return first block
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 Implementation Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Design & Data Structures (THIS DOCUMENT)
|
|||
|
|
|
|||
|
|
**Task 1.1: Define Warm Pool Data Structure**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// File: core/front/tiny_warm_pool.h (NEW)
|
|||
|
|
//
|
|||
|
|
// Per-thread warm pool for pre-allocated SuperSlabs
|
|||
|
|
// Reduces registry scan cost on cache miss
|
|||
|
|
|
|||
|
|
#ifndef HAK_TINY_WARM_POOL_H
|
|||
|
|
#define HAK_TINY_WARM_POOL_H
|
|||
|
|
|
|||
|
|
#include <stdint.h>
|
|||
|
|
#include "../hakmem_tiny_config.h"
|
|||
|
|
#include "../superslab/superslab_types.h"
|
|||
|
|
|
|||
|
|
// Maximum warm SuperSlabs per thread (tunable)
|
|||
|
|
#define TINY_WARM_POOL_MAX_PER_CLASS 4
|
|||
|
|
|
|||
|
|
typedef struct {
|
|||
|
|
SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
|
|||
|
|
int count;
|
|||
|
|
int capacity;
|
|||
|
|
} TinyWarmPool;
|
|||
|
|
|
|||
|
|
// Per-thread warm pools (one per class)
|
|||
|
|
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
|
|||
|
|
|
|||
|
|
// Operations:
|
|||
|
|
// - tiny_warm_pool_init() → Initialize at thread startup
|
|||
|
|
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
|
|||
|
|
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
|
|||
|
|
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
|
|||
|
|
// - tiny_warm_pool_refill() → Batch refill from LRU cache
|
|||
|
|
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Task 1.2: Define Warm Pool Operations**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Lazy initialization (once per thread)
|
|||
|
|
static inline void tiny_warm_pool_init_once(int class_idx) {
|
|||
|
|
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
|
|||
|
|
if (pool->capacity == 0) {
|
|||
|
|
pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
|
|||
|
|
pool->count = 0;
|
|||
|
|
// Allocate initial SuperSlabs on demand (COLD path)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// O(1) pop from warm pool
|
|||
|
|
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
|
|||
|
|
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
|
|||
|
|
if (pool->count > 0) {
|
|||
|
|
return pool->slabs[--pool->count]; // Pop from end
|
|||
|
|
}
|
|||
|
|
return NULL; // Pool empty → fall through to COLD
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// O(1) push to warm pool
|
|||
|
|
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
|
|||
|
|
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
|
|||
|
|
if (pool->count < pool->capacity) {
|
|||
|
|
pool->slabs[pool->count++] = ss;
|
|||
|
|
} else {
|
|||
|
|
// Pool full → return to LRU cache or free
|
|||
|
|
ss_cache_put(ss); // Return to global LRU
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase 2: Implement Warm Pool Initialization
|
|||
|
|
|
|||
|
|
**Task 2.1: Thread Startup Integration**
|
|||
|
|
- Initialize warm pools on first malloc call
|
|||
|
|
- Pre-populate from LRU cache (if available)
|
|||
|
|
- Fall back to cold allocation if needed
|
|||
|
|
|
|||
|
|
**Task 2.2: Batch Refill Strategy**
|
|||
|
|
- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
|
|||
|
|
- On cache miss: Pop from warm pool (no registry scan)
|
|||
|
|
- On warm pool depletion: Allocate 1-2 more in cold path
|
|||
|
|
|
|||
|
|
### Phase 3: Modify unified_cache_refill()
|
|||
|
|
|
|||
|
|
**Current Implementation** (Registry Scan):
|
|||
|
|
```c
|
|||
|
|
void unified_cache_refill(int class_idx) {
|
|||
|
|
// Linear search through per-class registry
|
|||
|
|
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
|
|||
|
|
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
|
|||
|
|
if (ss_tier_is_hot(ss)) { // ← Tier check (5-10 cycles)
|
|||
|
|
// Carve blocks
|
|||
|
|
carve_blocks_from_superslab(ss, class_idx, cache);
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
// Not found → cold path (allocate new SuperSlab)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Proposed Implementation** (Warm Pool First):
|
|||
|
|
```c
|
|||
|
|
void unified_cache_refill(int class_idx) {
|
|||
|
|
// 1. Try warm pool first (no lock, O(1))
|
|||
|
|
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
|
|||
|
|
if (ss) {
|
|||
|
|
// SuperSlab already HOT (pre-qualified), no tier check needed
|
|||
|
|
carve_blocks_from_superslab(ss, class_idx, cache);
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 2. Fall back to registry scan (only if warm pool empty)
|
|||
|
|
// (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
|
|||
|
|
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
|
|||
|
|
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
|
|||
|
|
if (ss_tier_is_hot(ss)) {
|
|||
|
|
carve_blocks_from_superslab(ss, class_idx, cache);
|
|||
|
|
// Refill warm pool on success
|
|||
|
|
for (int j = 0; j < 2; j++) {
|
|||
|
|
SuperSlab* extra = find_next_hot_slab(class_idx, i);
|
|||
|
|
if (extra) {
|
|||
|
|
tiny_warm_pool_push(class_idx, extra);
|
|||
|
|
i++;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 3. Cold path (allocate new SuperSlab)
|
|||
|
|
allocate_new_superslab(class_idx, cache);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase 4: Batched Tier Transition Checks
|
|||
|
|
|
|||
|
|
**Current:** Tier check on every refill (5-10 cycles)
|
|||
|
|
**Proposed:** Batch tier checks once per N operations
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Global tier check counter (atomic, publish periodically)
|
|||
|
|
static __thread uint32_t g_tier_check_counter = 0;
|
|||
|
|
#define TIER_CHECK_BATCH_SIZE 256
|
|||
|
|
|
|||
|
|
void tier_check_maybe_batch(int class_idx) {
|
|||
|
|
if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
|
|||
|
|
// Batch check: validate tier of all SuperSlabs in registry
|
|||
|
|
for (int i = 0; i < 10; i++) { // Sample 10 SuperSlabs
|
|||
|
|
SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
|
|||
|
|
if (!ss_tier_is_hot(ss)) {
|
|||
|
|
// Demote from warm pool if present
|
|||
|
|
// (Cost: 1 atomic per 256 operations)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase 5: LRU Cache Integration
|
|||
|
|
|
|||
|
|
**How Warm Pool Gets Replenished:**
|
|||
|
|
|
|||
|
|
1. **Startup:** Pre-populate warm pools from LRU cache
|
|||
|
|
2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool
|
|||
|
|
3. **Periodic:** Background thread refills warm pools when < threshold
|
|||
|
|
4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Expected Performance Impact
|
|||
|
|
|
|||
|
|
### Current Baseline
|
|||
|
|
```
|
|||
|
|
Random Mixed: 1.06M ops/s
|
|||
|
|
Breakdown:
|
|||
|
|
- 95% cache hits (HOT): ~1.007M ops/s (clean, 2-3 cache misses)
|
|||
|
|
- 5% cache misses (WARM): ~0.053M ops/s (registry scan + refill)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### After Warm Pool Implementation
|
|||
|
|
```
|
|||
|
|
Estimated: 1.5-1.8M ops/s (+40-70%)
|
|||
|
|
|
|||
|
|
Breakdown:
|
|||
|
|
- 95% cache hits (HOT): ~1.007M ops/s (unchanged, 2-3 cache misses)
|
|||
|
|
- 5% cache misses (WARM): ~0.15-0.20M ops/s (warm pool, O(1) pop)
|
|||
|
|
(vs 0.053M before)
|
|||
|
|
|
|||
|
|
Improvement mechanism:
|
|||
|
|
- Remove registry O(N) scan → O(1) warm pool pop
|
|||
|
|
- Reduce per-refill cost: ~500 cycles → ~50 cycles
|
|||
|
|
- Expected per-miss speedup: ~10x
|
|||
|
|
- Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
|
|||
|
|
- Actual gain: 1.06M × 0.05 × 9 = 0.477M
|
|||
|
|
- Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Path to 10x
|
|||
|
|
|
|||
|
|
Current efforts can achieve:
|
|||
|
|
- **Warm pool optimization:** +40-70% (this proposal)
|
|||
|
|
- **Lock-free refill path:** +10-20% (phase 2)
|
|||
|
|
- **Batch tier transitions:** +5-10% (phase 2)
|
|||
|
|
- **Reduced syscall overhead:** +5% (phase 3)
|
|||
|
|
- **Total realistic: 2.0-2.5x** (not 10x)
|
|||
|
|
|
|||
|
|
**To reach 10x improvement, would need:**
|
|||
|
|
1. Dedicated per-thread allocation pools (reduce lock contention)
|
|||
|
|
2. Batch pre-allocation strategy (reduce per-op overhead)
|
|||
|
|
3. Size class coalescing (reduce routing complexity)
|
|||
|
|
4. Or: Change workload pattern (batch allocations)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ Implementation Risks & Mitigations
|
|||
|
|
|
|||
|
|
### Risk 1: Thread-Local Storage Bloat
|
|||
|
|
**Risk:** Adding warm pool increases per-thread memory usage
|
|||
|
|
**Mitigation:**
|
|||
|
|
- Allocate warm pool lazily
|
|||
|
|
- Limit to 4-8 SuperSlabs per class (128KB per thread max)
|
|||
|
|
- Default: 4 slots per class → 128KB total (acceptable)
|
|||
|
|
|
|||
|
|
### Risk 2: Warm Pool Invalidation
|
|||
|
|
**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
|
|||
|
|
**Mitigation:**
|
|||
|
|
- Periodic validation during batch tier checks
|
|||
|
|
- Accept occasional validation error (rare, correctness not affected)
|
|||
|
|
- Fallback to registry scan if warm pool slot invalid
|
|||
|
|
|
|||
|
|
### Risk 3: Stale SuperSlabs
|
|||
|
|
**Risk:** Warm pool holds SuperSlabs that should be freed
|
|||
|
|
**Mitigation:**
|
|||
|
|
- LRU-based eviction from warm pool
|
|||
|
|
- Maximum hold time: 60s (configurable)
|
|||
|
|
- On thread exit: drain warm pool back to LRU cache
|
|||
|
|
|
|||
|
|
### Risk 4: Initialization Race
|
|||
|
|
**Risk:** Multiple threads initialize warm pools simultaneously
|
|||
|
|
**Mitigation:**
|
|||
|
|
- Use `__thread` (thread-safe per POSIX)
|
|||
|
|
- Lazy initialization with check-then-set
|
|||
|
|
- No atomic operations needed (per-thread)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔄 Integration Checklist
|
|||
|
|
|
|||
|
|
### Pre-Implementation
|
|||
|
|
- [ ] Review current unified_cache_refill() implementation
|
|||
|
|
- [ ] Identify all places where SuperSlab allocation happens
|
|||
|
|
- [ ] Audit Tier system for validation requirements
|
|||
|
|
- [ ] Measure current registry scan cost in micro-benchmark
|
|||
|
|
|
|||
|
|
### Phase 1: Warm Pool Infrastructure
|
|||
|
|
- [ ] Create `core/front/tiny_warm_pool.h` with data structures
|
|||
|
|
- [ ] Implement warm_pool_init(), pop(), push() operations
|
|||
|
|
- [ ] Add __thread variable declarations
|
|||
|
|
- [ ] Write unit tests for warm pool operations
|
|||
|
|
- [ ] Verify no TLS bloat (profile memory usage)
|
|||
|
|
|
|||
|
|
### Phase 2: Integration Points
|
|||
|
|
- [ ] Modify malloc_tiny_fast() to initialize warm pools
|
|||
|
|
- [ ] Integrate warm_pool_pop() in unified_cache_refill()
|
|||
|
|
- [ ] Implement warm_pool_push() in cold allocation path
|
|||
|
|
- [ ] Add initialization on first malloc
|
|||
|
|
- [ ] Handle thread exit cleanup
|
|||
|
|
|
|||
|
|
### Phase 3: Testing
|
|||
|
|
- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
|
|||
|
|
- [ ] Benchmark Random Mixed: measure ops/s improvement
|
|||
|
|
- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
|
|||
|
|
- [ ] Stress test: concurrent threads + warm pool refill
|
|||
|
|
- [ ] Correctness: verify all objects properly allocated/freed
|
|||
|
|
|
|||
|
|
### Phase 4: Profiling & Optimization
|
|||
|
|
- [ ] Profile hot path (should still be 20-30 cycles)
|
|||
|
|
- [ ] Profile warm path (should be reduced to 50-100 cycles)
|
|||
|
|
- [ ] Measure registry scan reduction
|
|||
|
|
- [ ] Identify any remaining bottlenecks
|
|||
|
|
|
|||
|
|
### Phase 5: Documentation
|
|||
|
|
- [ ] Update comments in unified_cache_refill()
|
|||
|
|
- [ ] Document warm pool design in README
|
|||
|
|
- [ ] Add environment variables (if needed)
|
|||
|
|
- [ ] Document tier check batching strategy
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Metrics to Track
|
|||
|
|
|
|||
|
|
### Pre-Implementation
|
|||
|
|
```
|
|||
|
|
Baseline Random Mixed:
|
|||
|
|
- Ops/sec: 1.06M
|
|||
|
|
- L1 cache misses: ~763K per 1M ops
|
|||
|
|
- Page faults: ~7,674
|
|||
|
|
- CPU cycles: ~70.4M
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Post-Implementation Targets
|
|||
|
|
```
|
|||
|
|
After warm pool:
|
|||
|
|
- Ops/sec: 1.5-1.8M (+40-70%)
|
|||
|
|
- L1 cache misses: Similar or slightly reduced
|
|||
|
|
- Page faults: Same (~7,674)
|
|||
|
|
- CPU cycles: ~45-50M (30% reduction)
|
|||
|
|
|
|||
|
|
Warm path breakdown:
|
|||
|
|
- Warm pool hit: 50-100 cycles per batch
|
|||
|
|
- Registry fallback: 200-300 cycles (rare)
|
|||
|
|
- Cold alloc: 1000-5000 cycles (very rare)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💾 Files to Create/Modify
|
|||
|
|
|
|||
|
|
### New Files
|
|||
|
|
- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations
|
|||
|
|
|
|||
|
|
### Modified Files
|
|||
|
|
1. `core/front/malloc_tiny_fast.h`
|
|||
|
|
- Initialize warm pools on first call
|
|||
|
|
- Document three-tier routing
|
|||
|
|
|
|||
|
|
2. `core/front/tiny_unified_cache.h`
|
|||
|
|
- Modify unified_cache_refill() to use warm pool first
|
|||
|
|
- Add warm pool replenishment logic
|
|||
|
|
|
|||
|
|
3. `core/box/ss_tier_box.h`
|
|||
|
|
- Add batched tier check strategy
|
|||
|
|
- Document validation requirements
|
|||
|
|
|
|||
|
|
4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
|
|||
|
|
- Add environment variables:
|
|||
|
|
- `HAKMEM_WARM_POOL_SIZE` (default: 4)
|
|||
|
|
- `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)
|
|||
|
|
|
|||
|
|
### Configuration Files
|
|||
|
|
- Add warm pool parameters to benchmark configuration
|
|||
|
|
- Update profiling tools to measure warm pool effectiveness
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Success Criteria
|
|||
|
|
|
|||
|
|
✅ **Must Have:**
|
|||
|
|
1. Warm pool implementation reduces registry scan cost by 80%+
|
|||
|
|
2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
|
|||
|
|
3. Tiny Hot ops/s unchanged (no regression)
|
|||
|
|
4. All allocations remain correct (no memory corruption)
|
|||
|
|
5. No thread-local storage bloat (< 200KB per thread)
|
|||
|
|
|
|||
|
|
✅ **Nice to Have:**
|
|||
|
|
1. Random Mixed reaches 2M+ ops/s (90%+ improvement)
|
|||
|
|
2. Warm pool hit rate > 90% (rarely fall back to registry)
|
|||
|
|
3. L1 cache misses reduced by 10%+
|
|||
|
|
4. Per-free cost unchanged (no regression)
|
|||
|
|
|
|||
|
|
❌ **Not in Scope (separate PR):**
|
|||
|
|
1. Lock-free refill path (requires CAS-based warm pool)
|
|||
|
|
2. Per-thread allocation pools (requires larger redesign)
|
|||
|
|
3. Hugepages support (already tested, no gain)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 Next Steps
|
|||
|
|
|
|||
|
|
1. **Review this proposal** with the team
|
|||
|
|
2. **Approve scope & success criteria**
|
|||
|
|
3. **Begin Phase 1 implementation** (warm pool header file)
|
|||
|
|
4. **Integrate with unified_cache_refill()**
|
|||
|
|
5. **Benchmark and measure improvements**
|
|||
|
|
6. **Iterate based on profiling results**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔗 References
|
|||
|
|
|
|||
|
|
- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
|
|||
|
|
- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
|
|||
|
|
- Box Architecture: `core/box/` directory
|
|||
|
|
- Unified Cache: `core/front/tiny_unified_cache.h`
|
|||
|
|
- Registry: `core/hakmem_super_registry.h`
|
|||
|
|
- Tier System: `core/box/ss_tier_box.h`
|