306 lines
8.3 KiB
Markdown
306 lines
8.3 KiB
Markdown
|
|
# Phase 9 LRU Architecture Issue - Root Cause Analysis
|
||
|
|
|
||
|
|
**Date**: 2025-11-14
|
||
|
|
**Discovery**: Task B-1 Investigation
|
||
|
|
**Impact**: ❌ **CRITICAL** - Phase 9 Lazy Deallocation completely non-functional
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 9 LRU cache for SuperSlab reuse is **architecturally unreachable** during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition.
|
||
|
|
|
||
|
|
**Result**:
|
||
|
|
- LRU cache never populated (0% utilization)
|
||
|
|
- SuperSlabs never reused (100% mmap/munmap churn)
|
||
|
|
- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
|
||
|
|
- Performance impact: **-94% regression** (9.38M → 563K ops/s)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause Chain
|
||
|
|
|
||
|
|
### 1. Free Path Architecture
|
||
|
|
|
||
|
|
**Fast Path (95-99% of frees):**
|
||
|
|
```c
|
||
|
|
// core/tiny_free_fast_v2.inc.h
|
||
|
|
hak_tiny_free_fast_v2(ptr) {
|
||
|
|
tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Slow Path (1-5% of frees):**
|
||
|
|
```c
|
||
|
|
// core/tiny_superslab_free.inc.h
|
||
|
|
tiny_free_local_box() {
|
||
|
|
meta->used--; // ← ONLY here is meta->used decremented
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. The Accounting Gap
|
||
|
|
|
||
|
|
**Physical Reality**: Blocks freed to TLS SLL (available for reuse)
|
||
|
|
**Slab Accounting**: Blocks still counted as "used" (`meta->used` unchanged)
|
||
|
|
|
||
|
|
**Consequence**: Slabs never appear empty → SuperSlabs never freed → LRU never used
|
||
|
|
|
||
|
|
### 3. Empty Detection Code Path
|
||
|
|
|
||
|
|
```c
|
||
|
|
// core/tiny_superslab_free.inc.h:211 (local free)
|
||
|
|
if (meta->used == 0) {
|
||
|
|
shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED
|
||
|
|
}
|
||
|
|
|
||
|
|
// core/hakmem_shared_pool.c:298
|
||
|
|
if (ss->active_slabs == 0) {
|
||
|
|
superslab_free(ss); // ← NEVER REACHED
|
||
|
|
}
|
||
|
|
|
||
|
|
// core/hakmem_tiny_superslab.c:1016
|
||
|
|
void superslab_free(SuperSlab* ss) {
|
||
|
|
int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Experimental Evidence
|
||
|
|
|
||
|
|
**Test**: `bench_random_mixed_hakmem 200000 4096 1234567`
|
||
|
|
|
||
|
|
**Observations**:
|
||
|
|
```bash
|
||
|
|
export HAKMEM_SS_LRU_DEBUG=1
|
||
|
|
export HAKMEM_SS_FREE_DEBUG=1
|
||
|
|
|
||
|
|
# Results (200K iterations):
|
||
|
|
[LRU_POP] class=X (miss): 877 times ← LRU lookup attempts
|
||
|
|
[LRU_PUSH]: 0 times ← NEVER populated
|
||
|
|
[SS_FREE]: 0 times ← NEVER called
|
||
|
|
[SS_EMPTY]: 0 times ← meta->used never reached 0
|
||
|
|
```
|
||
|
|
|
||
|
|
**Syscall Impact**:
|
||
|
|
```
|
||
|
|
mmap: 3,241 calls (27.4% time)
|
||
|
|
munmap: 3,214 calls (47.4% time)
|
||
|
|
Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why This Happens
|
||
|
|
|
||
|
|
### TLS SLL Design Rationale
|
||
|
|
|
||
|
|
**Purpose**: Ultra-fast free path (3-5 instructions)
|
||
|
|
**Tradeoff**: No slab accounting updates
|
||
|
|
|
||
|
|
**Lifecycle**:
|
||
|
|
1. Block allocated from slab: `meta->used++`
|
||
|
|
2. Block freed to TLS SLL: `meta->used` UNCHANGED
|
||
|
|
3. Block reallocated from TLS SLL: `meta->used` UNCHANGED
|
||
|
|
4. Cycle repeats infinitely
|
||
|
|
|
||
|
|
**Drain Behavior**:
|
||
|
|
- `bench_random_mixed` drain phase frees all blocks
|
||
|
|
- But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs
|
||
|
|
- `meta->used` never decremented
|
||
|
|
- Slabs never reported as empty
|
||
|
|
|
||
|
|
### Benchmark Characteristics
|
||
|
|
|
||
|
|
`bench_random_mixed.c`:
|
||
|
|
- Working set: 4,096 slots (random alloc/free)
|
||
|
|
- Size range: 16-1040 bytes
|
||
|
|
- Pattern: Blocks cycle through TLS SLL
|
||
|
|
- **Never reaches `meta->used == 0` during main loop**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Impact Analysis
|
||
|
|
|
||
|
|
### Performance Regression
|
||
|
|
|
||
|
|
| Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change |
|
||
|
|
|--------|-------------------|--------------------------|--------|
|
||
|
|
| Throughput | 9.38M ops/s | 563K ops/s | **-94%** |
|
||
|
|
| mmap calls | ~800-900 | 3,241 | +260-305% |
|
||
|
|
| munmap calls | ~800-900 | 3,214 | +257-302% |
|
||
|
|
| LRU hits | Expected high | **0** | -100% |
|
||
|
|
|
||
|
|
**Root Causes**:
|
||
|
|
1. **Primary (74.8% time)**: LRU not working → mmap/munmap churn
|
||
|
|
2. **Secondary (11.0% time)**: mincore() SEGV fix overhead
|
||
|
|
|
||
|
|
### Design Validity
|
||
|
|
|
||
|
|
**Phase 9 LRU Implementation**: ✅ **Functionally Correct**
|
||
|
|
- `hak_ss_lru_push()`: Works as designed
|
||
|
|
- `hak_ss_lru_pop()`: Works as designed
|
||
|
|
- Cache eviction: Works as designed
|
||
|
|
|
||
|
|
**Phase 9 Architecture**: ❌ **Fundamentally Incompatible** with TLS SLL fast path
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Solution Options
|
||
|
|
|
||
|
|
### Option A: Decrement `meta->used` in Fast Path ❌
|
||
|
|
|
||
|
|
**Approach**: Modify `tls_sll_push()` to decrement `meta->used`
|
||
|
|
|
||
|
|
**Problem**:
|
||
|
|
- Requires SuperSlab lookup (expensive)
|
||
|
|
- Defeats fast path purpose (3-5 instructions → 50+ instructions)
|
||
|
|
- Cache misses, branch mispredicts
|
||
|
|
|
||
|
|
**Verdict**: Not viable
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option B: Periodic TLS SLL Drain to Slabs ✅ **RECOMMENDED**
|
||
|
|
|
||
|
|
**Approach**:
|
||
|
|
- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
|
||
|
|
- Decrement `meta->used` via `tiny_free_local_box()`
|
||
|
|
- Allow slab empty detection
|
||
|
|
|
||
|
|
**Implementation**:
|
||
|
|
```c
|
||
|
|
static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};
|
||
|
|
|
||
|
|
void tls_sll_push(int class_idx, void* base) {
|
||
|
|
// Fast path: push to SLL
|
||
|
|
// ... existing code ...
|
||
|
|
|
||
|
|
// Periodic drain
|
||
|
|
if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
|
||
|
|
tls_sll_drain_to_slabs(class_idx);
|
||
|
|
g_tls_sll_drain_counter[class_idx] = 0;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits**:
|
||
|
|
- Fast path stays fast (99.9% of frees)
|
||
|
|
- Slow path drain (0.1% of frees) updates `meta->used`
|
||
|
|
- Enables slab empty detection
|
||
|
|
- LRU cache becomes functional
|
||
|
|
|
||
|
|
**Expected Impact**:
|
||
|
|
- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
|
||
|
|
- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option C: Separate Accounting ⚠️
|
||
|
|
|
||
|
|
**Approach**: Track "logical used" (includes TLS SLL) vs "physical used"
|
||
|
|
|
||
|
|
**Problem**:
|
||
|
|
- Complex, error-prone
|
||
|
|
- Atomic operations required (slow)
|
||
|
|
- Hard to maintain consistency
|
||
|
|
|
||
|
|
**Verdict**: Not recommended
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Option D: Accept Current Behavior ❌
|
||
|
|
|
||
|
|
**Approach**: LRU cache only for shutdown/cleanup, not runtime
|
||
|
|
|
||
|
|
**Problem**:
|
||
|
|
- Defeats Phase 9 purpose (lazy deallocation)
|
||
|
|
- Leaves 74.8% syscall overhead unfixed
|
||
|
|
- Performance remains -94% regressed
|
||
|
|
|
||
|
|
**Verdict**: Not acceptable
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendation
|
||
|
|
|
||
|
|
**Implement Option B: Periodic TLS SLL Drain**
|
||
|
|
|
||
|
|
### Phase 12 Design
|
||
|
|
|
||
|
|
1. **Add drain trigger** in `tls_sll_push()`
|
||
|
|
- Every 1,024 frees (tunable via ENV)
|
||
|
|
- Drain TLS SLL → slab freelist
|
||
|
|
- Decrement `meta->used` properly
|
||
|
|
|
||
|
|
2. **Enable slab empty detection**
|
||
|
|
- `meta->used == 0` now reachable
|
||
|
|
- `shared_pool_release_slab()` called
|
||
|
|
- `superslab_free()` → `hak_ss_lru_push()` called
|
||
|
|
|
||
|
|
3. **LRU cache becomes functional**
|
||
|
|
- SuperSlabs reused from cache
|
||
|
|
- mmap/munmap reduced by 96-97%
|
||
|
|
- Syscall overhead: 74.8% → ~5%
|
||
|
|
|
||
|
|
### Expected Performance
|
||
|
|
|
||
|
|
```
|
||
|
|
Current: 563K ops/s (0.63% of System malloc)
|
||
|
|
After: 8-10M ops/s (9-11% of System malloc)
|
||
|
|
Gain: +1,300-1,700%
|
||
|
|
```
|
||
|
|
|
||
|
|
**Remaining gap to System malloc (90M ops/s)**:
|
||
|
|
- Still need +800-1,000% additional optimization
|
||
|
|
- Focus areas: Front cache hit rate, branch prediction, cache locality
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Action Items
|
||
|
|
|
||
|
|
1. **[URGENT]** Implement TLS SLL periodic drain (Option B)
|
||
|
|
2. **[HIGH]** Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024`
|
||
|
|
3. **[HIGH]** Re-measure with `strace -c` (expect -96% mmap/munmap)
|
||
|
|
4. **[MEDIUM]** Fix prewarm crash (separate investigation)
|
||
|
|
5. **[MEDIUM]** Document architectural tradeoff in design docs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
1. **Fast path optimizations can disable architectural features**
|
||
|
|
- TLS SLL fast path → LRU cache unreachable
|
||
|
|
- Need periodic cleanup to restore functionality
|
||
|
|
|
||
|
|
2. **Accounting consistency is critical**
|
||
|
|
- `meta->used` must reflect true state
|
||
|
|
- Buffering (TLS SLL) creates accounting gap
|
||
|
|
|
||
|
|
3. **Integration testing needed**
|
||
|
|
- Phase 9 LRU tested in isolation: ✅ Works
|
||
|
|
- Phase 9 LRU + TLS SLL integration: ❌ Broken
|
||
|
|
- Need end-to-end benchmarks
|
||
|
|
|
||
|
|
4. **Performance monitoring essential**
|
||
|
|
- LRU hit rate = 0% should have triggered alert
|
||
|
|
- Syscall count regression should have been caught earlier
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Files Involved
|
||
|
|
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update)
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`)
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()`
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 9 LRU cache is **functionally correct** but **architecturally unreachable** due to TLS SLL fast path not updating `meta->used`.
|
||
|
|
|
||
|
|
**Fix**: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.
|
||
|
|
|
||
|
|
**Expected Impact**: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)
|