Files
hakmem/ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

546 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM Architectural Restructuring for 10x Performance - Implementation Proposal
## 2025-12-04
---
## 📊 Executive Summary
**Goal:** Achieve 10x performance improvement on Random Mixed allocations (1.06M → 10.6M ops/s) by restructuring allocator to separate HOT/WARM/COLD execution paths.
**Current Performance Gap:**
```
Random Mixed: 1.06M ops/s (current baseline)
Tiny Hot: 89M ops/s (reference - different workload)
Goal: 10.6M ops/s (10x from baseline)
```
**Key Discovery:** Current architecture already has HOT/WARM separation (via Unified Cache), but inefficiencies in WARM path prevent scaling:
1. **Registry scan on cache miss** (O(N) search through per-class registry)
2. **Per-allocation tier checks** (atomic operations, not batched)
3. **Lack of pre-warmed SuperSlab pools** (must allocate/initialize on miss)
4. **Global registry contention** (mutex-protected writes)
---
## 🔍 Current Architecture Analysis
### Existing Two-Speed Foundation
HAKMEM **already implements** a two-tier design:
```
HOT PATH (95%+ allocations):
malloc_tiny_fast()
→ tiny_hot_alloc_fast()
→ Unified Cache pop (TLS, 2-3 cache misses)
→ Return USER pointer
Cost: ~20-30 CPU cycles
WARM PATH (1-5% cache misses):
malloc_tiny_fast()
→ tiny_cold_refill_and_alloc()
→ unified_cache_refill()
→ Per-class registry scan (find HOT SuperSlab)
→ Tier check (is HOT)
→ Carve ~64 blocks
→ Refill Unified Cache
→ Return USER pointer
Cost: ~500-1000 cycles per batch (~5-10 per object amortized)
```
### Performance Bottlenecks in WARM Path
**Bottleneck 1: Registry Scan (O(N))**
- Current: Linear search through per-class registry to find HOT SuperSlab
- Cost: 50-100 cycles per refill
- Happens on EVERY cache miss (~1-5% of allocations)
- Files: `core/hakmem_super_registry.h`, `core/front/tiny_unified_cache.h` (unified_cache_refill function)
**Bottleneck 2: Per-Allocation Tier Checks**
- Current: Call `ss_tier_is_hot(ss)` once per batch (during refill)
- Should be: Batch multiple tier checks together
- Cost: Atomic operations, not amortized
- File: `core/box/ss_tier_box.h`
**Bottleneck 3: Global Registry Contention**
- Current: Mutex-protected registry insert on SuperSlab alloc
- File: `core/hakmem_super_registry.h` (hak_super_registry_insert)
- Lock: `g_super_reg_lock`
**Bottleneck 4: SuperSlab Initialization Overhead**
- Current: Full allocation + initialization on cache miss → cold path
- Cost: ~1000+ cycles (mmap, metadata setup, registry insert)
- Should be: Pre-allocated from LRU cache or warm pool
---
## 💡 Proposed Three-Tier Architecture
### Tier 1: HOT (95%+ allocations)
```c
// Path: TLS Unified Cache hit
// Cost: ~20-30 cycles (unchanged)
// Characteristics:
// - No registry access
// - No Tier/Guard calls
// - No locks
// - Branch-free (or 1-branch pipeline hits)
Path:
1. Read TLS Unified Cache (TLS access, 1 cache miss)
2. Pop from array (array access, 1 cache miss)
3. Update head pointer (1 store)
4. Return USER pointer (0 additional branches for hit)
Total: 2-3 cache misses, ~20-30 cycles
```
### Tier 2: WARM (1-5% cache misses)
**NEW: Per-Thread Warm Pool**
```c
// Path: Unified Cache miss → Pop from per-thread warm pool
// Cost: ~50-100 cycles per batch (5-10 per object amortized)
// Characteristics:
// - No global registry scan
// - Pre-qualified SuperSlabs (already HOT)
// - Batched tier transitions (not per-object)
// - Minimal lock contention
Data Structure:
__thread SuperSlab* g_warm_pool_head[TINY_NUM_CLASSES];
__thread int g_warm_pool_count[TINY_NUM_CLASSES];
__thread int g_warm_pool_capacity[TINY_NUM_CLASSES];
Path:
1. Detect Unified Cache miss (head == tail)
2. Check warm pool (TLS access, no lock)
a. If warm_pool_count > 0:
├─ Pop SuperSlab from warm_pool_head (O(1))
├─ Use existing SuperSlab (no mmap)
├─ Carve ~64 blocks (amortized cost)
├─ Refill Unified Cache
├─ (Optional) Batch tier check after ~64 pops
└─ Return first block
b. If warm_pool_count == 0:
└─ Fall through to COLD (rare)
Total: ~50-100 cycles per batch
```
### Tier 3: COLD (<0.1% special cases)
```c
// Path: Warm pool exhausted, error, or special handling
// Cost: ~1000-10000 cycles per SuperSlab (rare)
// Characteristics:
// - Full SuperSlab allocation (mmap)
// - Registry insert (mutex-protected write)
// - Tier initialization
// - Guard validation
Path:
1. Warm pool exhausted
2. Allocate new SuperSlab (mmap via ss_os_acquire_box)
3. Insert into global registry (mutex-protected)
4. Initialize TinySlabMeta + metadata
5. Add to per-class registry
6. Carve blocks + refill both Unified Cache and warm pool
7. Return first block
```
---
## 🔧 Implementation Plan
### Phase 1: Design & Data Structures (THIS DOCUMENT)
**Task 1.1: Define Warm Pool Data Structure**
```c
// File: core/front/tiny_warm_pool.h (NEW)
//
// Per-thread warm pool for pre-allocated SuperSlabs
// Reduces registry scan cost on cache miss
#ifndef HAK_TINY_WARM_POOL_H
#define HAK_TINY_WARM_POOL_H
#include <stdint.h>
#include "../hakmem_tiny_config.h"
#include "../superslab/superslab_types.h"
// Maximum warm SuperSlabs per thread (tunable)
#define TINY_WARM_POOL_MAX_PER_CLASS 4
typedef struct {
SuperSlab* slabs[TINY_WARM_POOL_MAX_PER_CLASS];
int count;
int capacity;
} TinyWarmPool;
// Per-thread warm pools (one per class)
extern __thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES];
// Operations:
// - tiny_warm_pool_init() → Initialize at thread startup
// - tiny_warm_pool_push() → Add SuperSlab to warm pool
// - tiny_warm_pool_pop() → Remove SuperSlab from warm pool (O(1))
// - tiny_warm_pool_drain() → Return all to LRU on thread exit
// - tiny_warm_pool_refill() → Batch refill from LRU cache
#endif
```
**Task 1.2: Define Warm Pool Operations**
```c
// Lazy initialization (once per thread)
static inline void tiny_warm_pool_init_once(int class_idx) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->capacity == 0) {
pool->capacity = TINY_WARM_POOL_MAX_PER_CLASS;
pool->count = 0;
// Allocate initial SuperSlabs on demand (COLD path)
}
}
// O(1) pop from warm pool
static inline SuperSlab* tiny_warm_pool_pop(int class_idx) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->count > 0) {
return pool->slabs[--pool->count]; // Pop from end
}
return NULL; // Pool empty → fall through to COLD
}
// O(1) push to warm pool
static inline void tiny_warm_pool_push(int class_idx, SuperSlab* ss) {
TinyWarmPool* pool = &g_tiny_warm_pool[class_idx];
if (pool->count < pool->capacity) {
pool->slabs[pool->count++] = ss;
} else {
// Pool full → return to LRU cache or free
ss_cache_put(ss); // Return to global LRU
}
}
```
### Phase 2: Implement Warm Pool Initialization
**Task 2.1: Thread Startup Integration**
- Initialize warm pools on first malloc call
- Pre-populate from LRU cache (if available)
- Fall back to cold allocation if needed
**Task 2.2: Batch Refill Strategy**
- On thread startup: Allocate ~2-3 SuperSlabs per class to warm pool
- On cache miss: Pop from warm pool (no registry scan)
- On warm pool depletion: Allocate 1-2 more in cold path
### Phase 3: Modify unified_cache_refill()
**Current Implementation** (Registry Scan):
```c
void unified_cache_refill(int class_idx) {
// Linear search through per-class registry
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) { // ← Tier check (5-10 cycles)
// Carve blocks
carve_blocks_from_superslab(ss, class_idx, cache);
return;
}
}
// Not found → cold path (allocate new SuperSlab)
}
```
**Proposed Implementation** (Warm Pool First):
```c
void unified_cache_refill(int class_idx) {
// 1. Try warm pool first (no lock, O(1))
SuperSlab* ss = tiny_warm_pool_pop(class_idx);
if (ss) {
// SuperSlab already HOT (pre-qualified), no tier check needed
carve_blocks_from_superslab(ss, class_idx, cache);
return;
}
// 2. Fall back to registry scan (only if warm pool empty)
// (WARM_POOL_MAX_PER_CLASS = 4, so rarely happens)
for (int i = 0; i < g_super_reg_by_class_count[class_idx]; i++) {
SuperSlab* ss = g_super_reg_by_class[class_idx][i];
if (ss_tier_is_hot(ss)) {
carve_blocks_from_superslab(ss, class_idx, cache);
// Refill warm pool on success
for (int j = 0; j < 2; j++) {
SuperSlab* extra = find_next_hot_slab(class_idx, i);
if (extra) {
tiny_warm_pool_push(class_idx, extra);
i++;
}
}
return;
}
}
// 3. Cold path (allocate new SuperSlab)
allocate_new_superslab(class_idx, cache);
}
```
### Phase 4: Batched Tier Transition Checks
**Current:** Tier check on every refill (5-10 cycles)
**Proposed:** Batch tier checks once per N operations
```c
// Global tier check counter (atomic, publish periodically)
static __thread uint32_t g_tier_check_counter = 0;
#define TIER_CHECK_BATCH_SIZE 256
void tier_check_maybe_batch(int class_idx) {
if (++g_tier_check_counter % TIER_CHECK_BATCH_SIZE == 0) {
// Batch check: validate tier of all SuperSlabs in registry
for (int i = 0; i < 10; i++) { // Sample 10 SuperSlabs
SuperSlab* ss = g_super_reg_by_class[class_idx][rand() % N];
if (!ss_tier_is_hot(ss)) {
// Demote from warm pool if present
// (Cost: 1 atomic per 256 operations)
}
}
}
}
```
### Phase 5: LRU Cache Integration
**How Warm Pool Gets Replenished:**
1. **Startup:** Pre-populate warm pools from LRU cache
2. **During execution:** On cold path alloc, add extra SuperSlab to warm pool
3. **Periodic:** Background thread refills warm pools when < threshold
4. **On free:** When SuperSlab becomes empty, add to LRU cache (not warm pool)
---
## 📈 Expected Performance Impact
### Current Baseline
```
Random Mixed: 1.06M ops/s
Breakdown:
- 95% cache hits (HOT): ~1.007M ops/s (clean, 2-3 cache misses)
- 5% cache misses (WARM): ~0.053M ops/s (registry scan + refill)
```
### After Warm Pool Implementation
```
Estimated: 1.5-1.8M ops/s (+40-70%)
Breakdown:
- 95% cache hits (HOT): ~1.007M ops/s (unchanged, 2-3 cache misses)
- 5% cache misses (WARM): ~0.15-0.20M ops/s (warm pool, O(1) pop)
(vs 0.053M before)
Improvement mechanism:
- Remove registry O(N) scan → O(1) warm pool pop
- Reduce per-refill cost: ~500 cycles → ~50 cycles
- Expected per-miss speedup: ~10x
- Applied to 5% of operations: ~1.06M × 1.05 = ~1.11M baseline impact
- Actual gain: 1.06M × 0.05 × 9 = 0.477M
- Total: 1.06M + 0.477M = 1.537M ops/s (+45%)
```
### Path to 10x
Current efforts can achieve:
- **Warm pool optimization:** +40-70% (this proposal)
- **Lock-free refill path:** +10-20% (phase 2)
- **Batch tier transitions:** +5-10% (phase 2)
- **Reduced syscall overhead:** +5% (phase 3)
- **Total realistic: 2.0-2.5x** (not 10x)
**To reach 10x improvement, would need:**
1. Dedicated per-thread allocation pools (reduce lock contention)
2. Batch pre-allocation strategy (reduce per-op overhead)
3. Size class coalescing (reduce routing complexity)
4. Or: Change workload pattern (batch allocations)
---
## ⚠️ Implementation Risks & Mitigations
### Risk 1: Thread-Local Storage Bloat
**Risk:** Adding warm pool increases per-thread memory usage
**Mitigation:**
- Allocate warm pool lazily
- Limit to 4-8 SuperSlabs per class (128KB per thread max)
- Default: 4 slots per class 128KB total (acceptable)
### Risk 2: Warm Pool Invalidation
**Risk:** SuperSlabs become DRAINING/FREE unexpectedly
**Mitigation:**
- Periodic validation during batch tier checks
- Accept occasional validation error (rare, correctness not affected)
- Fallback to registry scan if warm pool slot invalid
### Risk 3: Stale SuperSlabs
**Risk:** Warm pool holds SuperSlabs that should be freed
**Mitigation:**
- LRU-based eviction from warm pool
- Maximum hold time: 60s (configurable)
- On thread exit: drain warm pool back to LRU cache
### Risk 4: Initialization Race
**Risk:** Multiple threads initialize warm pools simultaneously
**Mitigation:**
- Use `__thread` (thread-safe per POSIX)
- Lazy initialization with check-then-set
- No atomic operations needed (per-thread)
---
## 🔄 Integration Checklist
### Pre-Implementation
- [ ] Review current unified_cache_refill() implementation
- [ ] Identify all places where SuperSlab allocation happens
- [ ] Audit Tier system for validation requirements
- [ ] Measure current registry scan cost in micro-benchmark
### Phase 1: Warm Pool Infrastructure
- [ ] Create `core/front/tiny_warm_pool.h` with data structures
- [ ] Implement warm_pool_init(), pop(), push() operations
- [ ] Add __thread variable declarations
- [ ] Write unit tests for warm pool operations
- [ ] Verify no TLS bloat (profile memory usage)
### Phase 2: Integration Points
- [ ] Modify malloc_tiny_fast() to initialize warm pools
- [ ] Integrate warm_pool_pop() in unified_cache_refill()
- [ ] Implement warm_pool_push() in cold allocation path
- [ ] Add initialization on first malloc
- [ ] Handle thread exit cleanup
### Phase 3: Testing
- [ ] Micro-benchmark: warm pool pop (should be O(1), 2-3 cycles)
- [ ] Benchmark Random Mixed: measure ops/s improvement
- [ ] Benchmark Tiny Hot: verify no regression (should be unchanged)
- [ ] Stress test: concurrent threads + warm pool refill
- [ ] Correctness: verify all objects properly allocated/freed
### Phase 4: Profiling & Optimization
- [ ] Profile hot path (should still be 20-30 cycles)
- [ ] Profile warm path (should be reduced to 50-100 cycles)
- [ ] Measure registry scan reduction
- [ ] Identify any remaining bottlenecks
### Phase 5: Documentation
- [ ] Update comments in unified_cache_refill()
- [ ] Document warm pool design in README
- [ ] Add environment variables (if needed)
- [ ] Document tier check batching strategy
---
## 📊 Metrics to Track
### Pre-Implementation
```
Baseline Random Mixed:
- Ops/sec: 1.06M
- L1 cache misses: ~763K per 1M ops
- Page faults: ~7,674
- CPU cycles: ~70.4M
```
### Post-Implementation Targets
```
After warm pool:
- Ops/sec: 1.5-1.8M (+40-70%)
- L1 cache misses: Similar or slightly reduced
- Page faults: Same (~7,674)
- CPU cycles: ~45-50M (30% reduction)
Warm path breakdown:
- Warm pool hit: 50-100 cycles per batch
- Registry fallback: 200-300 cycles (rare)
- Cold alloc: 1000-5000 cycles (very rare)
```
---
## 💾 Files to Create/Modify
### New Files
- `core/front/tiny_warm_pool.h` - Warm pool data structures & operations
### Modified Files
1. `core/front/malloc_tiny_fast.h`
- Initialize warm pools on first call
- Document three-tier routing
2. `core/front/tiny_unified_cache.h`
- Modify unified_cache_refill() to use warm pool first
- Add warm pool replenishment logic
3. `core/box/ss_tier_box.h`
- Add batched tier check strategy
- Document validation requirements
4. `core/hakmem_tiny.h` or `core/front/malloc_tiny_fast.h`
- Add environment variables:
- `HAKMEM_WARM_POOL_SIZE` (default: 4)
- `HAKMEM_WARM_POOL_REFILL_THRESHOLD` (default: 1)
### Configuration Files
- Add warm pool parameters to benchmark configuration
- Update profiling tools to measure warm pool effectiveness
---
## 🎯 Success Criteria
**Must Have:**
1. Warm pool implementation reduces registry scan cost by 80%+
2. Random Mixed ops/s increases to 1.5M+ (40%+ improvement)
3. Tiny Hot ops/s unchanged (no regression)
4. All allocations remain correct (no memory corruption)
5. No thread-local storage bloat (< 200KB per thread)
**Nice to Have:**
1. Random Mixed reaches 2M+ ops/s (90%+ improvement)
2. Warm pool hit rate > 90% (rarely fall back to registry)
3. L1 cache misses reduced by 10%+
4. Per-free cost unchanged (no regression)
**Not in Scope (separate PR):**
1. Lock-free refill path (requires CAS-based warm pool)
2. Per-thread allocation pools (requires larger redesign)
3. Hugepages support (already tested, no gain)
---
## 📝 Next Steps
1. **Review this proposal** with the team
2. **Approve scope & success criteria**
3. **Begin Phase 1 implementation** (warm pool header file)
4. **Integrate with unified_cache_refill()**
5. **Benchmark and measure improvements**
6. **Iterate based on profiling results**
---
## 🔗 References
- Current Profiling: `COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md`
- Session Summary: `FINAL_SESSION_REPORT_20251204.md`
- Box Architecture: `core/box/` directory
- Unified Cache: `core/front/tiny_unified_cache.h`
- Registry: `core/hakmem_super_registry.h`
- Tier System: `core/box/ss_tier_box.h`