Files
hakmem/ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md

546 lines
17 KiB
Markdown
Raw Normal View History

Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete) Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
## 2025-12-04
---
## 📊 Executive Summary
**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.
**Current Performance Gap:**
```
Random Mixed: 1.06M ops/s (current baseline)
Tiny Hot: 89M ops/s (reference - different workload)
Goal: 10.6M ops/s (10x from baseline)
```
**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:
1. **Registry scan on cache miss** (O(N) search through per-class registry)
2. **Per-allocation tier checks** (atomic operations, not batched)
3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
4. **Global registry contention** (mutex-protected writes)
---
## 🔍 Current Architecture Analysis
### Existing Two-Speed Foundation
HAKMEM **already implements** a two-tier design:
```
HOT PATH (95%+ allocations):
malloc_tiny_fast()
→ tiny_hot_alloc_fast()
→ Unified Cache pop (TLS, 2-3 cache misses)
→ Return USER pointer
Cost: ~20-30 CPU cycles
WARM PATH (1-5% cache misses):
malloc_tiny_fast()
→ tiny_cold_refill_and_alloc()
→ unified_cache_refill()
→ Per-class registry scan (find HOT SuperSlab)
→ Tier check (is HOT)
→ Carve ~64 blocks
→ Refill Unified Cache
→ Return USER pointer
Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
```
### Performance Bottlenecks in WARM Path
**Bottleneck 1: Registry Scan (O(N))**
- Current: Linear search through per-class registry to find HOT SuperSlab
- Cost: 50-100 cycles per refill
- Happens on EVERY cache miss (~1-5% of allocations)
- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)
**Bottleneck 2: Per-Allocation Tier Checks**
- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
- Should be: Batch multiple tier checks together
- Cost: Atomic operations, not amortized
- File: `core/box/ss_tier_box.h`
**Bottleneck 3: Global Registry Contention**
- Current: Mutex-protected registry insert on SuperSlab alloc
- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
- Lock: `g_super_reg_lock`
**Bottleneck 4: SuperSlab Initialization Overhead**
- Current: Full allocation + initialization on cache miss → cold path
- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
- Should be: Pre-allocated from LRU cache or warm pool
---
## 💡 Proposed Three-Tier Architecture
### Tier 1: HOT (95%+ allocations)
```c
// Path: TLS Unified Cache hit
// Cost: ~20-30 cycles (unchanged)
// Characteristics:
// - No registry access
// - No Tier/Guard calls
// - No locks
// - Branch-free (or 1-branch pipeline hits)
Path:
1. Read TLS Unified Cache (TLS access, 1 cache miss)
2. Pop from array (array access, 1 cache miss)
3. Update head pointer (1 store)
4. Return USER pointer (0 additional branches for hit)
Total: 2-3 cache misses, ~20-30 cycles
```
### Tier 2: WARM (1-5% cache misses)
**NEW: Per-Thread Warm Pool**
```c
// Path: Unified Cache miss → Pop from per-thread warm pool
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
// Characteristics:
// - No global registry scan
// - Pre-qualified SuperSlabs (already HOT)
// - Batched tier transitions (not per-object)
// - Minimal lock contention
Data Structure:
__thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
__thread int g_warm_pool_count[TINY_NUM_CLASSES];
__thread int g_warm_pool_capacity[TINY_NUM_CLASSES];
Path:
1. Detect Unified Cache miss (head == tail)
2. Check warm pool (TLS access, no lock)
a. If warm_pool_count > 0:
├─ Pop SuperSlab from warm_pool_head (O(1))
├─ Use existing SuperSlab (no mmap)
├─ Carve ~64 blocks (amortized cost)
├─ Refill Unified Cache
├─ (Optional) Batch tier check after ~64 pops
└─ Return first block
b. If warm_pool_count == 0:
└─ Fall through to COLD (rare)
Total: ~50-100 cycles per batch
```
### Tier 3: COLD (<0.1% special cases)
```c
// Path: Warm pool exhausted, error, or special handling
// Cost: ~1000-10000 cycles per SuperSlab (rare)
// Characteristics:
// - Full SuperSlab allocation (mmap)
// - Registry insert (mutex-protected write)
// - Tier initialization
// - Guard validation
Path:
1. Warm pool exhausted
2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
3. Insert into global registry (mutex-protected)
4. Initialize TinySlabMeta + metadata
5. Add to per-class registry
6. Carve blocks + refill both Unified Cache and warm pool
7. Return first block
```
---
## 🔧 Implementation Plan
### Phase 1: Design & Data Structures (THIS DOCUMENT)
**Task 1.1: Define Warm Pool Data Structure**
```c
// File: core/front/tiny_warm_pool.h (NEW)
//
// Per-thread warm pool for pre-allocated SuperSlabs
// Reduces registry scan cost on cache miss
#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H
#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"
// Maximum warm SuperSlabs per thread (tunable)
#define TINY_WARM_POOL_MAX_PER_CLASS 4
typedef struct {
SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
int count;
int capacity;
} TinyWarmPool;
// Per-thread warm pools (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
// Operations:
// - tiny_warm_pool_init() → Initialize at thread startup
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
// - tiny_warm_pool_refill() → Batch refill from LRU cache
#endif
```
**Task 1.2: Define Warm Pool Operations**
```c
// Lazy initialization (once per thread)
static inline void tiny_warm_pool_init_once(int class_idx) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->capacity == 0) {
pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
pool->count = 0;
// Allocate initial SuperSlabs on demand (COLD path)
}
}
// O(1) pop from warm pool
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->count > 0) {
return pool->slabs[--pool->count]; // Pop from end
}
return NULL; // Pool empty → fall through to COLD
}
// O(1) push to warm pool
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->count < pool->capacity) {
pool->slabs[pool->count++] = ss;
} else {
// Pool full → return to LRU cache or free
ss_cache_put(ss); // Return to global LRU
}
}
```
### Phase 2: Implement Warm Pool Initialization
**Task 2.1: Thread Startup Integration**
- Initialize warm pools on first malloc call
- Pre-populate from LRU cache (if available)
- Fall back to cold allocation if needed
**Task 2.2: Batch Refill Strategy**
- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
- On cache miss: Pop from warm pool (no registry scan)
- On warm pool depletion: Allocate 1-2 more in cold path
### Phase 3: Modify unified_cache_refill()
**Current Implementation** (Registry Scan):
```c
void unified_cache_refill(int class_idx) {
// Linear search through per-class registry
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) { // ← Tier check (5-10 cycles)
// Carve blocks
carve_blocks_from_superslab(ss, class_idx, cache);
return;
}
}
// Not found → cold path (allocate new SuperSlab)
}
```
**Proposed Implementation** (Warm Pool First):
```c
void unified_cache_refill(int class_idx) {
// 1. Try warm pool first (no lock, O(1))
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
// SuperSlab already HOT (pre-qualified), no tier check needed
carve_blocks_from_superslab(ss, class_idx, cache);
return;
}
// 2. Fall back to registry scan (only if warm pool empty)
// (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) {
carve_blocks_from_superslab(ss, class_idx, cache);
// Refill warm pool on success
for (int j = 0; j < 2; j++) {
SuperSlab* extra = find_next_hot_slab(class_idx, i);
if (extra) {
tiny_warm_pool_push(class_idx, extra);
i++;
}
}
return;
}
}
// 3. Cold path (allocate new SuperSlab)
allocate_new_superslab(class_idx, cache);
}
```
### Phase 4: Batched Tier Transition Checks
**Current:** Tier check on every refill (5-10 cycles)
**Proposed:** Batch tier checks once per N operations
```c
// Global tier check counter (atomic, publish periodically)
static __thread uint32_t g_tier_check_counter = 0;
#define TIER_CHECK_BATCH_SIZE 256
void tier_check_maybe_batch(int class_idx) {
if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
// Batch check: validate tier of all SuperSlabs in registry
for (int i = 0; i < 10; i++) { // Sample 10 SuperSlabs
SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
if (!ss_tier_is_hot(ss)) {
// Demote from warm pool if present
// (Cost: 1 atomic per 256 operations)
}
}
}
}
```
### Phase 5: LRU Cache Integration
**How Warm Pool Gets Replenished:**
1. **Startup:** Pre-populate warm pools from LRU cache
2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool
3. **Periodic:** Background thread refills warm pools when < threshold
4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)
---
## 📈 Expected Performance Impact
### Current Baseline
```
Random Mixed: 1.06M ops/s
Breakdown:
- 95% cache hits (HOT): ~1.007M ops/s (clean, 2-3 cache misses)
- 5% cache misses (WARM): ~0.053M ops/s (registry scan + refill)
```
### After Warm Pool Implementation
```
Estimated: 1.5-1.8M ops/s (+40-70%)
Breakdown:
- 95% cache hits (HOT): ~1.007M ops/s (unchanged, 2-3 cache misses)
- 5% cache misses (WARM): ~0.15-0.20M ops/s (warm pool, O(1) pop)
(vs 0.053M before)
Improvement mechanism:
- Remove registry O(N) scan → O(1) warm pool pop
- Reduce per-refill cost: ~500 cycles → ~50 cycles
- Expected per-miss speedup: ~10x
- Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
- Actual gain: 1.06M × 0.05 × 9 = 0.477M
- Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
```
### Path to 10x
Current efforts can achieve:
- **Warm pool optimization:** +40-70% (this proposal)
- **Lock-free refill path:** +10-20% (phase 2)
- **Batch tier transitions:** +5-10% (phase 2)
- **Reduced syscall overhead:** +5% (phase 3)
- **Total realistic: 2.0-2.5x** (not 10x)
**To reach 10x improvement, would need:**
1. Dedicated per-thread allocation pools (reduce lock contention)
2. Batch pre-allocation strategy (reduce per-op overhead)
3. Size class coalescing (reduce routing complexity)
4. Or: Change workload pattern (batch allocations)
---
## ⚠️ Implementation Risks & Mitigations
### Risk 1: Thread-Local Storage Bloat
**Risk:** Adding warm pool increases per-thread memory usage
**Mitigation:**
- Allocate warm pool lazily
- Limit to 4-8 SuperSlabs per class (128KB per thread max)
- Default: 4 slots per class → 128KB total (acceptable)
### Risk 2: Warm Pool Invalidation
**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
**Mitigation:**
- Periodic validation during batch tier checks
- Accept occasional validation error (rare, correctness not affected)
- Fallback to registry scan if warm pool slot invalid
### Risk 3: Stale SuperSlabs
**Risk:** Warm pool holds SuperSlabs that should be freed
**Mitigation:**
- LRU-based eviction from warm pool
- Maximum hold time: 60s (configurable)
- On thread exit: drain warm pool back to LRU cache
### Risk 4: Initialization Race
**Risk:** Multiple threads initialize warm pools simultaneously
**Mitigation:**
- Use `__thread` (thread-safe per POSIX)
- Lazy initialization with check-then-set
- No atomic operations needed (per-thread)
---
## 🔄 Integration Checklist
### Pre-Implementation
- [ ] Review current unified_cache_refill() implementation
- [ ] Identify all places where SuperSlab allocation happens
- [ ] Audit Tier system for validation requirements
- [ ] Measure current registry scan cost in micro-benchmark
### Phase 1: Warm Pool Infrastructure
- [ ] Create `core/front/tiny_warm_pool.h` with data structures
- [ ] Implement warm_pool_init(), pop(), push() operations
- [ ] Add __thread variable declarations
- [ ] Write unit tests for warm pool operations
- [ ] Verify no TLS bloat (profile memory usage)
### Phase 2: Integration Points
- [ ] Modify malloc_tiny_fast() to initialize warm pools
- [ ] Integrate warm_pool_pop() in unified_cache_refill()
- [ ] Implement warm_pool_push() in cold allocation path
- [ ] Add initialization on first malloc
- [ ] Handle thread exit cleanup
### Phase 3: Testing
- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
- [ ] Benchmark Random Mixed: measure ops/s improvement
- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
- [ ] Stress test: concurrent threads + warm pool refill
- [ ] Correctness: verify all objects properly allocated/freed
### Phase 4: Profiling & Optimization
- [ ] Profile hot path (should still be 20-30 cycles)
- [ ] Profile warm path (should be reduced to 50-100 cycles)
- [ ] Measure registry scan reduction
- [ ] Identify any remaining bottlenecks
### Phase 5: Documentation
- [ ] Update comments in unified_cache_refill()
- [ ] Document warm pool design in README
- [ ] Add environment variables (if needed)
- [ ] Document tier check batching strategy
---
## 📊 Metrics to Track
### Pre-Implementation
```
Baseline Random Mixed:
- Ops/sec: 1.06M
- L1 cache misses: ~763K per 1M ops
- Page faults: ~7,674
- CPU cycles: ~70.4M
```
### Post-Implementation Targets
```
After warm pool:
- Ops/sec: 1.5-1.8M (+40-70%)
- L1 cache misses: Similar or slightly reduced
- Page faults: Same (~7,674)
- CPU cycles: ~45-50M (30% reduction)
Warm path breakdown:
- Warm pool hit: 50-100 cycles per batch
- Registry fallback: 200-300 cycles (rare)
- Cold alloc: 1000-5000 cycles (very rare)
```
---
## 💾 Files to Create/Modify
### New Files
- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations
### Modified Files
1. `core/front/malloc_tiny_fast.h`
- Initialize warm pools on first call
- Document three-tier routing
2. `core/front/tiny_unified_cache.h`
- Modify unified_cache_refill() to use warm pool first
- Add warm pool replenishment logic
3. `core/box/ss_tier_box.h`
- Add batched tier check strategy
- Document validation requirements
4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
- Add environment variables:
- `HAKMEM_WARM_POOL_SIZE` (default: 4)
- `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)
### Configuration Files
- Add warm pool parameters to benchmark configuration
- Update profiling tools to measure warm pool effectiveness
---
## 🎯 Success Criteria
**Must Have:**
1. Warm pool implementation reduces registry scan cost by 80%+
2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
3. Tiny Hot ops/s unchanged (no regression)
4. All allocations remain correct (no memory corruption)
5. No thread-local storage bloat (< 200KB per thread)
**Nice to Have:**
1. Random Mixed reaches 2M+ ops/s (90%+ improvement)
2. Warm pool hit rate > 90% (rarely fall back to registry)
3. L1 cache misses reduced by 10%+
4. Per-free cost unchanged (no regression)
**Not in Scope (separate PR):**
1. Lock-free refill path (requires CAS-based warm pool)
2. Per-thread allocation pools (requires larger redesign)
3. Hugepages support (already tested, no gain)
---
## 📝 Next Steps
1. **Review this proposal** with the team
2. **Approve scope & success criteria**
3. **Begin Phase 1 implementation** (warm pool header file)
4. **Integrate with unified_cache_refill()**
5. **Benchmark and measure improvements**
6. **Iterate based on profiling results**
---
## 🔗 References
- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
- Box Architecture: `core/box/` directory
- Unified Cache: `core/front/tiny_unified_cache.h`
- Registry: `core/hakmem_super_registry.h`
- Tier System: `core/box/ss_tier_box.h`