Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)
Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@ -0,0 +1,140 @@
|
||||
# Phase 74: UnifiedCache hit-path structural optimization - Results
|
||||
|
||||
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
|
||||
|
||||
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
|
||||
|
||||
---
|
||||
|
||||
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
|
||||
|
||||
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
|
||||
|
||||
**Implementation**:
|
||||
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
|
||||
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
|
||||
|
||||
**Results** (10-run A/B):
|
||||
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|
||||
|--------|------------|------------|-------|
|
||||
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
|
||||
| instructions | 4,583M | 4,615M | **+0.7%** |
|
||||
| branches | 1,276M | 1,281M | **+0.4%** |
|
||||
| cache-misses | 560K | 461K | -17.7% |
|
||||
|
||||
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
|
||||
|
||||
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
|
||||
|
||||
---
|
||||
|
||||
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
|
||||
|
||||
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
|
||||
|
||||
**Implementation**:
|
||||
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
||||
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
|
||||
|
||||
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
|
||||
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|
||||
|--------|---------------|----------------|-------|
|
||||
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
|
||||
| cycles | 1,553M | 1,548M | -0.3% |
|
||||
| **instructions** | 2,748M | 2,733M | **-0.6%** |
|
||||
| **branches** | 632M | 617M | **-2.3%** |
|
||||
| **cache-misses** | 707K | 1,316K | **+86%** |
|
||||
| dTLB-load-misses | 46K | 33K | -28% |
|
||||
|
||||
**Analysis**:
|
||||
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
|
||||
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
|
||||
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
|
||||
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
|
||||
|
||||
**Phase 74-1 vs 74-2 comparison**:
|
||||
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
|
||||
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
|
||||
- But cache-misses +86% cancels out → **total NEUTRAL**
|
||||
|
||||
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN**
|
||||
|
||||
---
|
||||
|
||||
## Root Cause (Phase 74-2)
|
||||
|
||||
**Why cache-misses increased (+86%)**:
|
||||
|
||||
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
|
||||
- Compiler may spill to stack → more memory traffic
|
||||
- `cache->slots[head]` may lose prefetch opportunity
|
||||
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
|
||||
- Storing to local breaks dependency tracking?
|
||||
- Memory alias analysis degraded?
|
||||
|
||||
**Evidence**:
|
||||
- dTLB-misses decreased (-28%) → data layout not the issue
|
||||
- L1-dcache-load-misses similar → not a TLB/page issue
|
||||
- cache-misses (+86%) is the PRIMARY BLOCKER
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
|
||||
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
|
||||
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
|
||||
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
|
||||
|
||||
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
|
||||
|
||||
---
|
||||
|
||||
## P1 (LOCALIZE) - Frozen State
|
||||
|
||||
**Files**:
|
||||
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
|
||||
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
|
||||
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
|
||||
|
||||
**Default behavior**: LOCALIZE=0 (original implementation)
|
||||
**Rollback**: No action needed (default OFF)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
**Phase 74-3: P0 (FASTAPI)**
|
||||
|
||||
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
|
||||
|
||||
**Approach**:
|
||||
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
|
||||
- Assume: "valid/enabled/no-stats" at caller side
|
||||
- Fail-fast: fallback to slow path on unexpected state
|
||||
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
||||
|
||||
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
|
||||
|
||||
**GO threshold**: +1.0% (strict, structural change)
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||||
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
||||
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
|
||||
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
|
||||
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN**
|
||||
- Phase 74-3: P0 (FASTAPI) → (next)
|
||||
Reference in New Issue
Block a user