Files
hakmem/docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
Moe Charm (CI) 87fa27518c Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.

A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)

Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box

Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)

Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized

Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking

Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00

2.9 KiB
Raw Blame History

Phase 15 v1: UnifiedCache FIFO→LIFO (Stack) A/B Test Results

Date: 2025-12-15 Benchmark: Mixed (161024B) + C7-only (10252048B) 10-run cleanenv Target: Transform existing UnifiedCache from FIFO ring to LIFO stack Expected ROI: +5-10% (design estimate, cache locality improvement) GO Threshold: +1.0% mean improvement


1. Implementation Summary

Phase 15 v1 transforms the existing array-based UnifiedCache from FIFO (ring buffer) to LIFO (stack) layout.

Key Changes:

  • Patch 1: L0 ENV gate box (tiny_unified_lifo_env_box.{h,c})
  • Patch 2: L1 LIFO operations (tiny_unified_lifo_box.h)
  • Patch 3: Hot path integration (tiny_front_hot_box.h - alloc/free both)
  • Patch 4: Makefile updates (added .o files)
  • Patch 5: bench_profile.h refresh sync

Design:

  • Reuses existing TinyUnifiedCache.slots[] array (no intrusive pointers)
  • tail treated as stack top (depth), head unused (always 0)
  • Mode check at function entry (once per call)
  • No wrap-around (mask unused in LIFO mode)

ENV Control:

export HAKMEM_TINY_UNIFIED_LIFO=0  # Baseline (FIFO)
export HAKMEM_TINY_UNIFIED_LIFO=1  # Optimized (LIFO)

Bonus Fix:

  • Discovered and fixed pre-existing LTO linkage bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
  • Converted static inline to extern declaration + non-inline implementation

2. A/B Test Results

Mixed (161024B):

  • Baseline (LIFO=0): 52,965,966 ops/s
  • Optimized (LIFO=1): 52,593,948 ops/s
  • Delta: -0.70% (regression)

C7-only (10252048B):

  • Baseline (LIFO=0): 78,010,783 ops/s
  • Optimized (LIFO=1): 78,335,509 ops/s
  • Delta: +0.42% (slight improvement)

3. Verdict: NEUTRAL

Result: Mixed -0.70%, C7-only +0.42% (both below GO threshold)

Comparison to Phase 14:

  • Phase 14 v1 (tcache free-side only): Mixed +0.20% (NEUTRAL)
  • Phase 14 v2 (tcache alloc+free): Mixed +0.08%, C7-only -0.39% (NEUTRAL)
  • Phase 15 v1 (FIFO→LIFO): Mixed -0.70%, C7-only +0.42% (NEUTRAL)

Root Cause:

  1. Mode check overhead: Entry-point tiny_unified_lifo_enabled() call adds branch
  2. Minimal locality delta: LIFO vs FIFO temporal locality difference is small in practice
  3. Existing optimization: FIFO ring implementation already well-optimized
  4. Cache warming: TLS cache pre-warming reduces locality sensitivity

4. Recommendation: Freeze as Research Box

Decision: Freeze Phase 15 v1 as research box (HAKMEM_TINY_UNIFIED_LIFO=0 default, OFF)

Rationale:

  • Neither LIFO nor FIFO shows significant advantage
  • Mode switching overhead outweighs potential locality gains
  • Existing FIFO ring is simple and already fast

Next: Explore alternative approaches:

  • Hybrid strategies (per-class mode selection)
  • Batch operations (reduce per-call overhead)
  • Hardware prefetch hints (explicit locality control)