# Phase 74: UnifiedCache hit-path structural optimization - Results **Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%) ## Summary Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓). **P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase. --- ## Phase 74-1: LOCALIZE (ENV-gated, runtime branch) **Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains. **Implementation**: - ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0) - Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }` **Results** (10-run A/B): | Metric | LOCALIZE=0 | LOCALIZE=1 | Delta | |--------|------------|------------|-------| | throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** | | instructions | 4,583M | 4,615M | **+0.7%** | | branches | 1,276M | 1,281M | **+0.4%** | | cache-misses | 560K | 461K | -17.7% | **Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent. **Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate). --- ## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch) **Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance. **Implementation**: - Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0) - Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch) **Results** (10-run A/B via `layout_tax_forensics_box.sh`): | Metric | Baseline (=0) | Treatment (=1) | Delta | |--------|---------------|----------------|-------| | **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** | | cycles | 1,553M | 1,548M | -0.3% | | **instructions** | 2,748M | 2,733M | **-0.6%** | | **branches** | 632M | 617M | **-2.3%** | | **cache-misses** | 707K | 1,316K | **+86%** | | dTLB-load-misses | 46K | 33K | -28% | **Analysis**: 1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓ 2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓ 3. **But cache-misses +86%** → register pressure / spill / worse access pattern 4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings **Phase 74-1 vs 74-2 comparison**: - 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses** - 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins** - But cache-misses +86% cancels out → **total NEUTRAL** **Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN** --- ## Root Cause (Phase 74-2) **Why cache-misses increased (+86%)**: 1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers - Compiler may spill to stack → more memory traffic - `cache->slots[head]` may lose prefetch opportunity 2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations - Storing to local breaks dependency tracking? - Memory alias analysis degraded? **Evidence**: - dTLB-misses decreased (-28%) → data layout not the issue - L1-dcache-load-misses similar → not a TLB/page issue - cache-misses (+86%) is the PRIMARY BLOCKER --- ## Lessons Learned 1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate 2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed 3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate 4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects **Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop). --- ## P1 (LOCALIZE) - Frozen State **Files**: - `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0) - `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2) - `core/front/tiny_unified_cache.h`: compile-time `#if` blocks **Default behavior**: LOCALIZE=0 (original implementation) **Rollback**: No action needed (default OFF) --- ## Next Steps **Phase 74-3: P0 (FASTAPI)** **Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop. **Approach**: - Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs - Assume: "valid/enabled/no-stats" at caller side - Fail-fast: fallback to slow path on unexpected state - ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box) **Expected benefit**: +1-2% via branch reduction (different axis than P1) **GO threshold**: +1.0% (strict, structural change) --- ## Artifacts - **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md` - **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md` - **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file) - **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data) --- ## Timeline - Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)** - Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN** - Phase 74-3: P0 (FASTAPI) → (next)