# Phase 9 LRU Architecture Issue - Root Cause Analysis **Date**: 2025-11-14 **Discovery**: Task B-1 Investigation **Impact**: ❌ **CRITICAL** - Phase 9 Lazy Deallocation completely non-functional --- ## Executive Summary Phase 9 LRU cache for SuperSlab reuse is **architecturally unreachable** during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition. **Result**: - LRU cache never populated (0% utilization) - SuperSlabs never reused (100% mmap/munmap churn) - Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time) - Performance impact: **-94% regression** (9.38M → 563K ops/s) --- ## Root Cause Chain ### 1. Free Path Architecture **Fast Path (95-99% of frees):** ```c // core/tiny_free_fast_v2.inc.h hak_tiny_free_fast_v2(ptr) { tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used } ``` **Slow Path (1-5% of frees):** ```c // core/tiny_superslab_free.inc.h tiny_free_local_box() { meta->used--; // ← ONLY here is meta->used decremented } ``` ### 2. The Accounting Gap **Physical Reality**: Blocks freed to TLS SLL (available for reuse) **Slab Accounting**: Blocks still counted as "used" (`meta->used` unchanged) **Consequence**: Slabs never appear empty → SuperSlabs never freed → LRU never used ### 3. Empty Detection Code Path ```c // core/tiny_superslab_free.inc.h:211 (local free) if (meta->used == 0) { shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED } // core/hakmem_shared_pool.c:298 if (ss->active_slabs == 0) { superslab_free(ss); // ← NEVER REACHED } // core/hakmem_tiny_superslab.c:1016 void superslab_free(SuperSlab* ss) { int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED } ``` ### 4. Experimental Evidence **Test**: `bench_random_mixed_hakmem 200000 4096 1234567` **Observations**: ```bash export HAKMEM_SS_LRU_DEBUG=1 export HAKMEM_SS_FREE_DEBUG=1 # Results (200K iterations): [LRU_POP] class=X (miss): 877 times ← LRU lookup attempts [LRU_PUSH]: 0 times ← NEVER populated [SS_FREE]: 0 times ← NEVER called [SS_EMPTY]: 0 times ← meta->used never reached 0 ``` **Syscall Impact**: ``` mmap: 3,241 calls (27.4% time) munmap: 3,214 calls (47.4% time) Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working ``` --- ## Why This Happens ### TLS SLL Design Rationale **Purpose**: Ultra-fast free path (3-5 instructions) **Tradeoff**: No slab accounting updates **Lifecycle**: 1. Block allocated from slab: `meta->used++` 2. Block freed to TLS SLL: `meta->used` UNCHANGED 3. Block reallocated from TLS SLL: `meta->used` UNCHANGED 4. Cycle repeats infinitely **Drain Behavior**: - `bench_random_mixed` drain phase frees all blocks - But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs - `meta->used` never decremented - Slabs never reported as empty ### Benchmark Characteristics `bench_random_mixed.c`: - Working set: 4,096 slots (random alloc/free) - Size range: 16-1040 bytes - Pattern: Blocks cycle through TLS SLL - **Never reaches `meta->used == 0` during main loop** --- ## Impact Analysis ### Performance Regression | Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change | |--------|-------------------|--------------------------|--------| | Throughput | 9.38M ops/s | 563K ops/s | **-94%** | | mmap calls | ~800-900 | 3,241 | +260-305% | | munmap calls | ~800-900 | 3,214 | +257-302% | | LRU hits | Expected high | **0** | -100% | **Root Causes**: 1. **Primary (74.8% time)**: LRU not working → mmap/munmap churn 2. **Secondary (11.0% time)**: mincore() SEGV fix overhead ### Design Validity **Phase 9 LRU Implementation**: ✅ **Functionally Correct** - `hak_ss_lru_push()`: Works as designed - `hak_ss_lru_pop()`: Works as designed - Cache eviction: Works as designed **Phase 9 Architecture**: ❌ **Fundamentally Incompatible** with TLS SLL fast path --- ## Solution Options ### Option A: Decrement `meta->used` in Fast Path ❌ **Approach**: Modify `tls_sll_push()` to decrement `meta->used` **Problem**: - Requires SuperSlab lookup (expensive) - Defeats fast path purpose (3-5 instructions → 50+ instructions) - Cache misses, branch mispredicts **Verdict**: Not viable --- ### Option B: Periodic TLS SLL Drain to Slabs ✅ **RECOMMENDED** **Approach**: - Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees) - Decrement `meta->used` via `tiny_free_local_box()` - Allow slab empty detection **Implementation**: ```c static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0}; void tls_sll_push(int class_idx, void* base) { // Fast path: push to SLL // ... existing code ... // Periodic drain if (++g_tls_sll_drain_counter[class_idx] >= 1024) { tls_sll_drain_to_slabs(class_idx); g_tls_sll_drain_counter[class_idx] = 0; } } ``` **Benefits**: - Fast path stays fast (99.9% of frees) - Slow path drain (0.1% of frees) updates `meta->used` - Enables slab empty detection - LRU cache becomes functional **Expected Impact**: - mmap/munmap: 6,455 → ~100-200 calls (-96-97%) - Throughput: 563K → 8-10M ops/s (+1,300-1,700%) --- ### Option C: Separate Accounting ⚠️ **Approach**: Track "logical used" (includes TLS SLL) vs "physical used" **Problem**: - Complex, error-prone - Atomic operations required (slow) - Hard to maintain consistency **Verdict**: Not recommended --- ### Option D: Accept Current Behavior ❌ **Approach**: LRU cache only for shutdown/cleanup, not runtime **Problem**: - Defeats Phase 9 purpose (lazy deallocation) - Leaves 74.8% syscall overhead unfixed - Performance remains -94% regressed **Verdict**: Not acceptable --- ## Recommendation **Implement Option B: Periodic TLS SLL Drain** ### Phase 12 Design 1. **Add drain trigger** in `tls_sll_push()` - Every 1,024 frees (tunable via ENV) - Drain TLS SLL → slab freelist - Decrement `meta->used` properly 2. **Enable slab empty detection** - `meta->used == 0` now reachable - `shared_pool_release_slab()` called - `superslab_free()` → `hak_ss_lru_push()` called 3. **LRU cache becomes functional** - SuperSlabs reused from cache - mmap/munmap reduced by 96-97% - Syscall overhead: 74.8% → ~5% ### Expected Performance ``` Current: 563K ops/s (0.63% of System malloc) After: 8-10M ops/s (9-11% of System malloc) Gain: +1,300-1,700% ``` **Remaining gap to System malloc (90M ops/s)**: - Still need +800-1,000% additional optimization - Focus areas: Front cache hit rate, branch prediction, cache locality --- ## Action Items 1. **[URGENT]** Implement TLS SLL periodic drain (Option B) 2. **[HIGH]** Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024` 3. **[HIGH]** Re-measure with `strace -c` (expect -96% mmap/munmap) 4. **[MEDIUM]** Fix prewarm crash (separate investigation) 5. **[MEDIUM]** Document architectural tradeoff in design docs --- ## Lessons Learned 1. **Fast path optimizations can disable architectural features** - TLS SLL fast path → LRU cache unreachable - Need periodic cleanup to restore functionality 2. **Accounting consistency is critical** - `meta->used` must reflect true state - Buffering (TLS SLL) creates accounting gap 3. **Integration testing needed** - Phase 9 LRU tested in isolation: ✅ Works - Phase 9 LRU + TLS SLL integration: ❌ Broken - Need end-to-end benchmarks 4. **Performance monitoring essential** - LRU hit rate = 0% should have triggered alert - Syscall count regression should have been caught earlier --- ## Files Involved - `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update) - `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`) - `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection - `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()` - `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation --- ## Conclusion Phase 9 LRU cache is **functionally correct** but **architecturally unreachable** due to TLS SLL fast path not updating `meta->used`. **Fix**: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization. **Expected Impact**: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)