Files
hakmem/docs/design/PHASE9_LRU_ARCHITECTURE_ISSUE.md

306 lines
8.3 KiB
Markdown
Raw Normal View History

# Phase 9 LRU Architecture Issue - Root Cause Analysis
**Date**: 2025-11-14
**Discovery**: Task B-1 Investigation
**Impact**: ❌ **CRITICAL** - Phase 9 Lazy Deallocation completely non-functional
---
## Executive Summary
Phase 9 LRU cache for SuperSlab reuse is **architecturally unreachable** during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition.
**Result**:
- LRU cache never populated (0% utilization)
- SuperSlabs never reused (100% mmap/munmap churn)
- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
- Performance impact: **-94% regression** (9.38M → 563K ops/s)
---
## Root Cause Chain
### 1. Free Path Architecture
**Fast Path (95-99% of frees):**
```c
// core/tiny_free_fast_v2.inc.h
hak_tiny_free_fast_v2(ptr) {
tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used
}
```
**Slow Path (1-5% of frees):**
```c
// core/tiny_superslab_free.inc.h
tiny_free_local_box() {
meta->used--; // ← ONLY here is meta->used decremented
}
```
### 2. The Accounting Gap
**Physical Reality**: Blocks freed to TLS SLL (available for reuse)
**Slab Accounting**: Blocks still counted as "used" (`meta->used` unchanged)
**Consequence**: Slabs never appear empty → SuperSlabs never freed → LRU never used
### 3. Empty Detection Code Path
```c
// core/tiny_superslab_free.inc.h:211 (local free)
if (meta->used == 0) {
shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED
}
// core/hakmem_shared_pool.c:298
if (ss->active_slabs == 0) {
superslab_free(ss); // ← NEVER REACHED
}
// core/hakmem_tiny_superslab.c:1016
void superslab_free(SuperSlab* ss) {
int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED
}
```
### 4. Experimental Evidence
**Test**: `bench_random_mixed_hakmem 200000 4096 1234567`
**Observations**:
```bash
export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_FREE_DEBUG=1
# Results (200K iterations):
[LRU_POP] class=X (miss): 877 times ← LRU lookup attempts
[LRU_PUSH]: 0 times ← NEVER populated
[SS_FREE]: 0 times ← NEVER called
[SS_EMPTY]: 0 times ← meta->used never reached 0
```
**Syscall Impact**:
```
mmap: 3,241 calls (27.4% time)
munmap: 3,214 calls (47.4% time)
Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working
```
---
## Why This Happens
### TLS SLL Design Rationale
**Purpose**: Ultra-fast free path (3-5 instructions)
**Tradeoff**: No slab accounting updates
**Lifecycle**:
1. Block allocated from slab: `meta->used++`
2. Block freed to TLS SLL: `meta->used` UNCHANGED
3. Block reallocated from TLS SLL: `meta->used` UNCHANGED
4. Cycle repeats infinitely
**Drain Behavior**:
- `bench_random_mixed` drain phase frees all blocks
- But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs
- `meta->used` never decremented
- Slabs never reported as empty
### Benchmark Characteristics
`bench_random_mixed.c`:
- Working set: 4,096 slots (random alloc/free)
- Size range: 16-1040 bytes
- Pattern: Blocks cycle through TLS SLL
- **Never reaches `meta->used == 0` during main loop**
---
## Impact Analysis
### Performance Regression
| Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change |
|--------|-------------------|--------------------------|--------|
| Throughput | 9.38M ops/s | 563K ops/s | **-94%** |
| mmap calls | ~800-900 | 3,241 | +260-305% |
| munmap calls | ~800-900 | 3,214 | +257-302% |
| LRU hits | Expected high | **0** | -100% |
**Root Causes**:
1. **Primary (74.8% time)**: LRU not working → mmap/munmap churn
2. **Secondary (11.0% time)**: mincore() SEGV fix overhead
### Design Validity
**Phase 9 LRU Implementation**: ✅ **Functionally Correct**
- `hak_ss_lru_push()`: Works as designed
- `hak_ss_lru_pop()`: Works as designed
- Cache eviction: Works as designed
**Phase 9 Architecture**: ❌ **Fundamentally Incompatible** with TLS SLL fast path
---
## Solution Options
### Option A: Decrement `meta->used` in Fast Path ❌
**Approach**: Modify `tls_sll_push()` to decrement `meta->used`
**Problem**:
- Requires SuperSlab lookup (expensive)
- Defeats fast path purpose (3-5 instructions → 50+ instructions)
- Cache misses, branch mispredicts
**Verdict**: Not viable
---
### Option B: Periodic TLS SLL Drain to Slabs ✅ **RECOMMENDED**
**Approach**:
- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
- Decrement `meta->used` via `tiny_free_local_box()`
- Allow slab empty detection
**Implementation**:
```c
static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};
void tls_sll_push(int class_idx, void* base) {
// Fast path: push to SLL
// ... existing code ...
// Periodic drain
if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
tls_sll_drain_to_slabs(class_idx);
g_tls_sll_drain_counter[class_idx] = 0;
}
}
```
**Benefits**:
- Fast path stays fast (99.9% of frees)
- Slow path drain (0.1% of frees) updates `meta->used`
- Enables slab empty detection
- LRU cache becomes functional
**Expected Impact**:
- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)
---
### Option C: Separate Accounting ⚠️
**Approach**: Track "logical used" (includes TLS SLL) vs "physical used"
**Problem**:
- Complex, error-prone
- Atomic operations required (slow)
- Hard to maintain consistency
**Verdict**: Not recommended
---
### Option D: Accept Current Behavior ❌
**Approach**: LRU cache only for shutdown/cleanup, not runtime
**Problem**:
- Defeats Phase 9 purpose (lazy deallocation)
- Leaves 74.8% syscall overhead unfixed
- Performance remains -94% regressed
**Verdict**: Not acceptable
---
## Recommendation
**Implement Option B: Periodic TLS SLL Drain**
### Phase 12 Design
1. **Add drain trigger** in `tls_sll_push()`
- Every 1,024 frees (tunable via ENV)
- Drain TLS SLL → slab freelist
- Decrement `meta->used` properly
2. **Enable slab empty detection**
- `meta->used == 0` now reachable
- `shared_pool_release_slab()` called
- `superslab_free()``hak_ss_lru_push()` called
3. **LRU cache becomes functional**
- SuperSlabs reused from cache
- mmap/munmap reduced by 96-97%
- Syscall overhead: 74.8% → ~5%
### Expected Performance
```
Current: 563K ops/s (0.63% of System malloc)
After: 8-10M ops/s (9-11% of System malloc)
Gain: +1,300-1,700%
```
**Remaining gap to System malloc (90M ops/s)**:
- Still need +800-1,000% additional optimization
- Focus areas: Front cache hit rate, branch prediction, cache locality
---
## Action Items
1. **[URGENT]** Implement TLS SLL periodic drain (Option B)
2. **[HIGH]** Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024`
3. **[HIGH]** Re-measure with `strace -c` (expect -96% mmap/munmap)
4. **[MEDIUM]** Fix prewarm crash (separate investigation)
5. **[MEDIUM]** Document architectural tradeoff in design docs
---
## Lessons Learned
1. **Fast path optimizations can disable architectural features**
- TLS SLL fast path → LRU cache unreachable
- Need periodic cleanup to restore functionality
2. **Accounting consistency is critical**
- `meta->used` must reflect true state
- Buffering (TLS SLL) creates accounting gap
3. **Integration testing needed**
- Phase 9 LRU tested in isolation: ✅ Works
- Phase 9 LRU + TLS SLL integration: ❌ Broken
- Need end-to-end benchmarks
4. **Performance monitoring essential**
- LRU hit rate = 0% should have triggered alert
- Syscall count regression should have been caught earlier
---
## Files Involved
- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update)
- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`)
- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()`
- `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation
---
## Conclusion
Phase 9 LRU cache is **functionally correct** but **architecturally unreachable** due to TLS SLL fast path not updating `meta->used`.
**Fix**: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.
**Expected Impact**: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)