84 lines
2.9 KiB
Markdown
84 lines
2.9 KiB
Markdown
|
|
# Phase 15 v1: UnifiedCache FIFO→LIFO (Stack) A/B Test Results
|
|||
|
|
|
|||
|
|
**Date:** 2025-12-15
|
|||
|
|
**Benchmark:** Mixed (16–1024B) + C7-only (1025–2048B) 10-run cleanenv
|
|||
|
|
**Target:** Transform existing UnifiedCache from FIFO ring to LIFO stack
|
|||
|
|
**Expected ROI:** +5-10% (design estimate, cache locality improvement)
|
|||
|
|
**GO Threshold:** +1.0% mean improvement
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Implementation Summary
|
|||
|
|
|
|||
|
|
Phase 15 v1 transforms the existing array-based UnifiedCache from FIFO (ring buffer) to LIFO (stack) layout.
|
|||
|
|
|
|||
|
|
**Key Changes:**
|
|||
|
|
- **Patch 1**: L0 ENV gate box (`tiny_unified_lifo_env_box.{h,c}`)
|
|||
|
|
- **Patch 2**: L1 LIFO operations (`tiny_unified_lifo_box.h`)
|
|||
|
|
- **Patch 3**: Hot path integration (`tiny_front_hot_box.h` - alloc/free both)
|
|||
|
|
- **Patch 4**: Makefile updates (added `.o` files)
|
|||
|
|
- **Patch 5**: bench_profile.h refresh sync
|
|||
|
|
|
|||
|
|
**Design:**
|
|||
|
|
- Reuses existing `TinyUnifiedCache.slots[]` array (no intrusive pointers)
|
|||
|
|
- `tail` treated as stack top (depth), `head` unused (always 0)
|
|||
|
|
- Mode check at function entry (once per call)
|
|||
|
|
- No wrap-around (`mask` unused in LIFO mode)
|
|||
|
|
|
|||
|
|
**ENV Control:**
|
|||
|
|
```bash
|
|||
|
|
export HAKMEM_TINY_UNIFIED_LIFO=0 # Baseline (FIFO)
|
|||
|
|
export HAKMEM_TINY_UNIFIED_LIFO=1 # Optimized (LIFO)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Bonus Fix:**
|
|||
|
|
- Discovered and fixed pre-existing LTO linkage bug for `tiny_c7_preserve_header_enabled()` (Phase 13/14 latent issue)
|
|||
|
|
- Converted static inline to extern declaration + non-inline implementation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. A/B Test Results
|
|||
|
|
|
|||
|
|
### Mixed (16–1024B):
|
|||
|
|
- **Baseline (LIFO=0):** 52,965,966 ops/s
|
|||
|
|
- **Optimized (LIFO=1):** 52,593,948 ops/s
|
|||
|
|
- **Delta:** **-0.70%** (regression)
|
|||
|
|
|
|||
|
|
### C7-only (1025–2048B):
|
|||
|
|
- **Baseline (LIFO=0):** 78,010,783 ops/s
|
|||
|
|
- **Optimized (LIFO=1):** 78,335,509 ops/s
|
|||
|
|
- **Delta:** **+0.42%** (slight improvement)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Verdict: NEUTRAL
|
|||
|
|
|
|||
|
|
**Result:** Mixed -0.70%, C7-only +0.42% (both below GO threshold)
|
|||
|
|
|
|||
|
|
**Comparison to Phase 14:**
|
|||
|
|
- Phase 14 v1 (tcache free-side only): Mixed +0.20% (NEUTRAL)
|
|||
|
|
- Phase 14 v2 (tcache alloc+free): Mixed +0.08%, C7-only -0.39% (NEUTRAL)
|
|||
|
|
- Phase 15 v1 (FIFO→LIFO): Mixed -0.70%, C7-only +0.42% (NEUTRAL)
|
|||
|
|
|
|||
|
|
**Root Cause:**
|
|||
|
|
1. **Mode check overhead**: Entry-point `tiny_unified_lifo_enabled()` call adds branch
|
|||
|
|
2. **Minimal locality delta**: LIFO vs FIFO temporal locality difference is small in practice
|
|||
|
|
3. **Existing optimization**: FIFO ring implementation already well-optimized
|
|||
|
|
4. **Cache warming**: TLS cache pre-warming reduces locality sensitivity
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Recommendation: Freeze as Research Box
|
|||
|
|
|
|||
|
|
**Decision:** Freeze Phase 15 v1 as research box (HAKMEM_TINY_UNIFIED_LIFO=0 default, OFF)
|
|||
|
|
|
|||
|
|
**Rationale:**
|
|||
|
|
- Neither LIFO nor FIFO shows significant advantage
|
|||
|
|
- Mode switching overhead outweighs potential locality gains
|
|||
|
|
- Existing FIFO ring is simple and already fast
|
|||
|
|
|
|||
|
|
**Next:** Explore alternative approaches:
|
|||
|
|
- Hybrid strategies (per-class mode selection)
|
|||
|
|
- Batch operations (reduce per-call overhead)
|
|||
|
|
- Hardware prefetch hints (explicit locality control)
|