hakmem/docs/design/PHASE9_LRU_ARCHITECTURE_ISSUE.md

# Phase 9 LRU Architecture Issue - Root Cause Analysis

**Date**: 2025-11-14
**Discovery**: Task B-1 Investigation
**Impact**: ❌ **CRITICAL** - Phase 9 Lazy Deallocation completely non-functional

---

## Executive Summary

Phase 9 LRU cache for SuperSlab reuse is **architecturally unreachable** during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition.

**Result**:
- LRU cache never populated (0% utilization)
- SuperSlabs never reused (100% mmap/munmap churn)
- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
- Performance impact: **-94% regression** (9.38M → 563K ops/s)

---

## Root Cause Chain

### 1. Free Path Architecture

**Fast Path (95-99% of frees):**
```c
// core/tiny_free_fast_v2.inc.h
hak_tiny_free_fast_v2(ptr) {
    tls_sll_push(class_idx, base);  // ← Does NOT decrement meta->used
}
```

**Slow Path (1-5% of frees):**
```c
// core/tiny_superslab_free.inc.h
tiny_free_local_box() {
    meta->used--;  // ← ONLY here is meta->used decremented
}
```

### 2. The Accounting Gap

**Physical Reality**: Blocks freed to TLS SLL (available for reuse)
**Slab Accounting**: Blocks still counted as "used" (`meta->used` unchanged)

**Consequence**: Slabs never appear empty → SuperSlabs never freed → LRU never used

### 3. Empty Detection Code Path

```c
// core/tiny_superslab_free.inc.h:211 (local free)
if (meta->used == 0) {
    shared_pool_release_slab(ss, slab_idx);  // ← NEVER REACHED
}

// core/hakmem_shared_pool.c:298
if (ss->active_slabs == 0) {
    superslab_free(ss);  // ← NEVER REACHED
}

// core/hakmem_tiny_superslab.c:1016
void superslab_free(SuperSlab* ss) {
    int lru_cached = hak_ss_lru_push(ss);  // ← NEVER CALLED
}
```

### 4. Experimental Evidence

**Test**: `bench_random_mixed_hakmem 200000 4096 1234567`

**Observations**:
```bash
export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_FREE_DEBUG=1

# Results (200K iterations):
[LRU_POP] class=X (miss): 877 times  ← LRU lookup attempts
[LRU_PUSH]: 0 times                   ← NEVER populated
[SS_FREE]: 0 times                    ← NEVER called
[SS_EMPTY]: 0 times                   ← meta->used never reached 0
```

**Syscall Impact**:
```
mmap:    3,241 calls (27.4% time)
munmap:  3,214 calls (47.4% time)
Total:   6,455 syscalls (74.8% time) ← Should be ~100 with LRU working
```

---

## Why This Happens

### TLS SLL Design Rationale

**Purpose**: Ultra-fast free path (3-5 instructions)
**Tradeoff**: No slab accounting updates

**Lifecycle**:
1. Block allocated from slab: `meta->used++`
2. Block freed to TLS SLL: `meta->used` UNCHANGED
3. Block reallocated from TLS SLL: `meta->used` UNCHANGED
4. Cycle repeats infinitely

**Drain Behavior**:
- `bench_random_mixed` drain phase frees all blocks
- But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs
- `meta->used` never decremented
- Slabs never reported as empty

### Benchmark Characteristics

`bench_random_mixed.c`:
- Working set: 4,096 slots (random alloc/free)
- Size range: 16-1040 bytes
- Pattern: Blocks cycle through TLS SLL
- **Never reaches `meta->used == 0` during main loop**

---

## Impact Analysis

### Performance Regression

| Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change |
|--------|-------------------|--------------------------|--------|
| Throughput | 9.38M ops/s | 563K ops/s | **-94%** |
| mmap calls | ~800-900 | 3,241 | +260-305% |
| munmap calls | ~800-900 | 3,214 | +257-302% |
| LRU hits | Expected high | **0** | -100% |

**Root Causes**:
1. **Primary (74.8% time)**: LRU not working → mmap/munmap churn
2. **Secondary (11.0% time)**: mincore() SEGV fix overhead

### Design Validity

**Phase 9 LRU Implementation**: ✅ **Functionally Correct**
- `hak_ss_lru_push()`: Works as designed
- `hak_ss_lru_pop()`: Works as designed
- Cache eviction: Works as designed

**Phase 9 Architecture**: ❌ **Fundamentally Incompatible** with TLS SLL fast path

---

## Solution Options

### Option A: Decrement `meta->used` in Fast Path ❌

**Approach**: Modify `tls_sll_push()` to decrement `meta->used`

**Problem**:
- Requires SuperSlab lookup (expensive)
- Defeats fast path purpose (3-5 instructions → 50+ instructions)
- Cache misses, branch mispredicts

**Verdict**: Not viable

---

### Option B: Periodic TLS SLL Drain to Slabs ✅ **RECOMMENDED**

**Approach**:
- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
- Decrement `meta->used` via `tiny_free_local_box()`
- Allow slab empty detection

**Implementation**:
```c
static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};

void tls_sll_push(int class_idx, void* base) {
    // Fast path: push to SLL
    // ... existing code ...

    // Periodic drain
    if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
        tls_sll_drain_to_slabs(class_idx);
        g_tls_sll_drain_counter[class_idx] = 0;
    }
}
```

**Benefits**:
- Fast path stays fast (99.9% of frees)
- Slow path drain (0.1% of frees) updates `meta->used`
- Enables slab empty detection
- LRU cache becomes functional

**Expected Impact**:
- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)

---

### Option C: Separate Accounting ⚠️

**Approach**: Track "logical used" (includes TLS SLL) vs "physical used"

**Problem**:
- Complex, error-prone
- Atomic operations required (slow)
- Hard to maintain consistency

**Verdict**: Not recommended

---

### Option D: Accept Current Behavior ❌

**Approach**: LRU cache only for shutdown/cleanup, not runtime

**Problem**:
- Defeats Phase 9 purpose (lazy deallocation)
- Leaves 74.8% syscall overhead unfixed
- Performance remains -94% regressed

**Verdict**: Not acceptable

---

## Recommendation

**Implement Option B: Periodic TLS SLL Drain**

### Phase 12 Design

1. **Add drain trigger** in `tls_sll_push()`
   - Every 1,024 frees (tunable via ENV)
   - Drain TLS SLL → slab freelist
   - Decrement `meta->used` properly

2. **Enable slab empty detection**
   - `meta->used == 0` now reachable
   - `shared_pool_release_slab()` called
   - `superslab_free()` → `hak_ss_lru_push()` called

3. **LRU cache becomes functional**
   - SuperSlabs reused from cache
   - mmap/munmap reduced by 96-97%
   - Syscall overhead: 74.8% → ~5%

### Expected Performance

```
Current:  563K ops/s (0.63% of System malloc)
After:    8-10M ops/s (9-11% of System malloc)
Gain:     +1,300-1,700%
```

**Remaining gap to System malloc (90M ops/s)**:
- Still need +800-1,000% additional optimization
- Focus areas: Front cache hit rate, branch prediction, cache locality

---

## Action Items

1. **[URGENT]** Implement TLS SLL periodic drain (Option B)
2. **[HIGH]** Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024`
3. **[HIGH]** Re-measure with `strace -c` (expect -96% mmap/munmap)
4. **[MEDIUM]** Fix prewarm crash (separate investigation)
5. **[MEDIUM]** Document architectural tradeoff in design docs

---

## Lessons Learned

1. **Fast path optimizations can disable architectural features**
   - TLS SLL fast path → LRU cache unreachable
   - Need periodic cleanup to restore functionality

2. **Accounting consistency is critical**
   - `meta->used` must reflect true state
   - Buffering (TLS SLL) creates accounting gap

3. **Integration testing needed**
   - Phase 9 LRU tested in isolation: ✅ Works
   - Phase 9 LRU + TLS SLL integration: ❌ Broken
   - Need end-to-end benchmarks

4. **Performance monitoring essential**
   - LRU hit rate = 0% should have triggered alert
   - Syscall count regression should have been caught earlier

---

## Files Involved

- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update)
- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`)
- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()`
- `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation

---

## Conclusion

Phase 9 LRU cache is **functionally correct** but **architecturally unreachable** due to TLS SLL fast path not updating `meta->used`.

**Fix**: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.

**Expected Impact**: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)
CRITICAL DISCOVERY: Phase 9 LRU architecturally unreachable due to TLS SLL Root Cause: - TLS SLL fast path (95-99% of frees) does NOT decrement meta->used - Slabs never appear empty (meta->used never reaches 0) - superslab_free() never called - hak_ss_lru_push() never called - LRU cache utilization: 0% (should be >90%) Impact: - mmap/munmap churn: 6,455 syscalls (74.8% time) - Performance: -94% regression (9.38M → 563K ops/s) - Phase 9 design goal: FAILED (lazy deallocation non-functional) Evidence: - 200K iterations: [LRU_PUSH]=0, [LRU_POP]=877 misses - Experimental verification with debug logs confirms theory Solution: Option B - Periodic TLS SLL Drain - Every 1,024 frees: drain TLS SLL → slab freelist - Decrement meta->used properly → enable empty detection - Expected: -96% syscalls, +1,300-1,700% throughput Files: - PHASE9_LRU_ARCHITECTURE_ISSUE.md: Comprehensive analysis (300+ lines) - Includes design options A/B/C/D with tradeoff analysis Next: Await ultrathink approval to implement Option B 2025-11-14 06:49:32 +09:00			`# Phase 9 LRU Architecture Issue - Root Cause Analysis`

			`Date: 2025-11-14`
			`Discovery: Task B-1 Investigation`
			`Impact: ❌ CRITICAL - Phase 9 Lazy Deallocation completely non-functional`

			`---`

			`## Executive Summary`

			Phase 9 LRU cache for SuperSlab reuse is architecturally unreachable during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition.

			`Result:`
			`- LRU cache never populated (0% utilization)`
			`- SuperSlabs never reused (100% mmap/munmap churn)`
			`- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)`
			`- Performance impact: -94% regression (9.38M → 563K ops/s)`

			`---`

			`## Root Cause Chain`

			`### 1. Free Path Architecture`

			`Fast Path (95-99% of frees):`
			```c
			`// core/tiny_free_fast_v2.inc.h`
			`hak_tiny_free_fast_v2(ptr) {`
			`tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used`
			`}`
			```

			`Slow Path (1-5% of frees):`
			```c
			`// core/tiny_superslab_free.inc.h`
			`tiny_free_local_box() {`
			`meta->used--; // ← ONLY here is meta->used decremented`
			`}`
			```

			`### 2. The Accounting Gap`

			`Physical Reality: Blocks freed to TLS SLL (available for reuse)`
			Slab Accounting: Blocks still counted as "used" (`meta->used` unchanged)

			`Consequence: Slabs never appear empty → SuperSlabs never freed → LRU never used`

			`### 3. Empty Detection Code Path`

			```c
			`// core/tiny_superslab_free.inc.h:211 (local free)`
			`if (meta->used == 0) {`
			`shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED`
			`}`

			`// core/hakmem_shared_pool.c:298`
			`if (ss->active_slabs == 0) {`
			`superslab_free(ss); // ← NEVER REACHED`
			`}`

			`// core/hakmem_tiny_superslab.c:1016`
			`void superslab_free(SuperSlab* ss) {`
			`int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED`
			`}`
			```

			`### 4. Experimental Evidence`

			Test: `bench_random_mixed_hakmem 200000 4096 1234567`

			`Observations:`
			```bash
			`export HAKMEM_SS_LRU_DEBUG=1`
			`export HAKMEM_SS_FREE_DEBUG=1`

			`# Results (200K iterations):`
			`[LRU_POP] class=X (miss): 877 times ← LRU lookup attempts`
			`[LRU_PUSH]: 0 times ← NEVER populated`
			`[SS_FREE]: 0 times ← NEVER called`
			`[SS_EMPTY]: 0 times ← meta->used never reached 0`
			```

			`Syscall Impact:`
			```
			`mmap: 3,241 calls (27.4% time)`
			`munmap: 3,214 calls (47.4% time)`
			`Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working`
			```

			`---`

			`## Why This Happens`

			`### TLS SLL Design Rationale`

			`Purpose: Ultra-fast free path (3-5 instructions)`
			`Tradeoff: No slab accounting updates`

			`Lifecycle:`
			1. Block allocated from slab: `meta->used++`
			2. Block freed to TLS SLL: `meta->used` UNCHANGED
			3. Block reallocated from TLS SLL: `meta->used` UNCHANGED
			`4. Cycle repeats infinitely`

			`Drain Behavior:`
			- `bench_random_mixed` drain phase frees all blocks
			- But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs
			- `meta->used` never decremented
			`- Slabs never reported as empty`

			`### Benchmark Characteristics`

			`bench_random_mixed.c`:
			`- Working set: 4,096 slots (random alloc/free)`
			`- Size range: 16-1040 bytes`
			`- Pattern: Blocks cycle through TLS SLL`
			- Never reaches `meta->used == 0` during main loop

			`---`

			`## Impact Analysis`

			`### Performance Regression`

			`\| Metric \| Phase 11 (Before) \| Current (After SEGV Fix) \| Change \|`
			`\|--------\|-------------------\|--------------------------\|--------\|`
			`\| Throughput \| 9.38M ops/s \| 563K ops/s \| -94% \|`
			`\| mmap calls \| ~800-900 \| 3,241 \| +260-305% \|`
			`\| munmap calls \| ~800-900 \| 3,214 \| +257-302% \|`
			`\| LRU hits \| Expected high \| 0 \| -100% \|`

			`Root Causes:`
			`1. Primary (74.8% time): LRU not working → mmap/munmap churn`
			`2. Secondary (11.0% time): mincore() SEGV fix overhead`

			`### Design Validity`

			`Phase 9 LRU Implementation: ✅ Functionally Correct`
			- `hak_ss_lru_push()`: Works as designed
			- `hak_ss_lru_pop()`: Works as designed
			`- Cache eviction: Works as designed`

			`Phase 9 Architecture: ❌ Fundamentally Incompatible with TLS SLL fast path`

			`---`

			`## Solution Options`

			### Option A: Decrement `meta->used` in Fast Path ❌

			Approach: Modify `tls_sll_push()` to decrement `meta->used`

			`Problem:`
			`- Requires SuperSlab lookup (expensive)`
			`- Defeats fast path purpose (3-5 instructions → 50+ instructions)`
			`- Cache misses, branch mispredicts`

			`Verdict: Not viable`

			`---`

			`### Option B: Periodic TLS SLL Drain to Slabs ✅ RECOMMENDED`

			`Approach:`
			`- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)`
			- Decrement `meta->used` via `tiny_free_local_box()`
			`- Allow slab empty detection`

			`Implementation:`
			```c
			`static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};`

			`void tls_sll_push(int class_idx, void* base) {`
			`// Fast path: push to SLL`
			`// ... existing code ...`

			`// Periodic drain`
			`if (++g_tls_sll_drain_counter[class_idx] >= 1024) {`
			`tls_sll_drain_to_slabs(class_idx);`
			`g_tls_sll_drain_counter[class_idx] = 0;`
			`}`
			`}`
			```

			`Benefits:`
			`- Fast path stays fast (99.9% of frees)`
			- Slow path drain (0.1% of frees) updates `meta->used`
			`- Enables slab empty detection`
			`- LRU cache becomes functional`

			`Expected Impact:`
			`- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)`
			`- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)`

			`---`

			`### Option C: Separate Accounting ⚠️`

			`Approach: Track "logical used" (includes TLS SLL) vs "physical used"`

			`Problem:`
			`- Complex, error-prone`
			`- Atomic operations required (slow)`
			`- Hard to maintain consistency`

			`Verdict: Not recommended`

			`---`

			`### Option D: Accept Current Behavior ❌`

			`Approach: LRU cache only for shutdown/cleanup, not runtime`

			`Problem:`
			`- Defeats Phase 9 purpose (lazy deallocation)`
			`- Leaves 74.8% syscall overhead unfixed`
			`- Performance remains -94% regressed`

			`Verdict: Not acceptable`

			`---`

			`## Recommendation`

			`Implement Option B: Periodic TLS SLL Drain`

			`### Phase 12 Design`

			1. Add drain trigger in `tls_sll_push()`
			`- Every 1,024 frees (tunable via ENV)`
			`- Drain TLS SLL → slab freelist`
			- Decrement `meta->used` properly

			`2. Enable slab empty detection`
			- `meta->used == 0` now reachable
			- `shared_pool_release_slab()` called
			- `superslab_free()` → `hak_ss_lru_push()` called

			`3. LRU cache becomes functional`
			`- SuperSlabs reused from cache`
			`- mmap/munmap reduced by 96-97%`
			`- Syscall overhead: 74.8% → ~5%`

			`### Expected Performance`

			```
			`Current: 563K ops/s (0.63% of System malloc)`
			`After: 8-10M ops/s (9-11% of System malloc)`
			`Gain: +1,300-1,700%`
			```

			`Remaining gap to System malloc (90M ops/s):`
			`- Still need +800-1,000% additional optimization`
			`- Focus areas: Front cache hit rate, branch prediction, cache locality`

			`---`

			`## Action Items`

			`1. [URGENT] Implement TLS SLL periodic drain (Option B)`
			2. [HIGH] Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024`
			3. [HIGH] Re-measure with `strace -c` (expect -96% mmap/munmap)
			`4. [MEDIUM] Fix prewarm crash (separate investigation)`
			`5. [MEDIUM] Document architectural tradeoff in design docs`

			`---`

			`## Lessons Learned`

			`1. Fast path optimizations can disable architectural features`
			`- TLS SLL fast path → LRU cache unreachable`
			`- Need periodic cleanup to restore functionality`

			`2. Accounting consistency is critical`
			- `meta->used` must reflect true state
			`- Buffering (TLS SLL) creates accounting gap`

			`3. Integration testing needed`
			`- Phase 9 LRU tested in isolation: ✅ Works`
			`- Phase 9 LRU + TLS SLL integration: ❌ Broken`
			`- Need end-to-end benchmarks`

			`4. Performance monitoring essential`
			`- LRU hit rate = 0% should have triggered alert`
			`- Syscall count regression should have been caught earlier`

			`---`

			`## Files Involved`

			- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update)
			- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`)
			- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection
			- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()`
			- `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation

			`---`

			`## Conclusion`

			Phase 9 LRU cache is functionally correct but architecturally unreachable due to TLS SLL fast path not updating `meta->used`.

			`Fix: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.`

			`Expected Impact: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)`