Files
hakmem/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md

141 lines
5.2 KiB
Markdown
Raw Permalink Normal View History

# Phase 74: UnifiedCache hit-path structural optimization - Results
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
## Summary
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
---
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
**Implementation**:
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
**Results** (10-run A/B):
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|--------|------------|------------|-------|
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
| instructions | 4,583M | 4,615M | **+0.7%** |
| branches | 1,276M | 1,281M | **+0.4%** |
| cache-misses | 560K | 461K | -17.7% |
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
---
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
**Implementation**:
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|--------|---------------|----------------|-------|
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
| cycles | 1,553M | 1,548M | -0.3% |
| **instructions** | 2,748M | 2,733M | **-0.6%** |
| **branches** | 632M | 617M | **-2.3%** |
| **cache-misses** | 707K | 1,316K | **+86%** |
| dTLB-load-misses | 46K | 33K | -28% |
**Analysis**:
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
**Phase 74-1 vs 74-2 comparison**:
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
- But cache-misses +86% cancels out → **total NEUTRAL**
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)****P1 FROZEN**
---
## Root Cause (Phase 74-2)
**Why cache-misses increased (+86%)**:
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
- Compiler may spill to stack → more memory traffic
- `cache->slots[head]` may lose prefetch opportunity
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
- Storing to local breaks dependency tracking?
- Memory alias analysis degraded?
**Evidence**:
- dTLB-misses decreased (-28%) → data layout not the issue
- L1-dcache-load-misses similar → not a TLB/page issue
- cache-misses (+86%) is the PRIMARY BLOCKER
---
## Lessons Learned
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
---
## P1 (LOCALIZE) - Frozen State
**Files**:
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
**Default behavior**: LOCALIZE=0 (original implementation)
**Rollback**: No action needed (default OFF)
---
## Next Steps
**Phase 74-3: P0 (FASTAPI)**
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
**Approach**:
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
- Assume: "valid/enabled/no-stats" at caller side
- Fail-fast: fallback to slow path on unexpected state
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
**GO threshold**: +1.0% (strict, structural change)
---
## Artifacts
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
---
## Timeline
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)****P1 FROZEN**
- Phase 74-3: P0 (FASTAPI) → (next)