141 lines
5.2 KiB
Markdown
141 lines
5.2 KiB
Markdown
|
|
# Phase 74: UnifiedCache hit-path structural optimization - Results
|
||
|
|
|
||
|
|
**Status**: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
Phase 74 investigated **unified_cache_push/pop** hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
|
||
|
|
|
||
|
|
**P1 (LOCALIZE)** attempted to reduce dependency chains by loading `head/tail/mask` into locals, but was **frozen at NEUTRAL (-0.87%)** due to cache-miss increase.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
|
||
|
|
|
||
|
|
**Goal**: Load `head/tail/mask` once into locals to avoid reload dependency chains.
|
||
|
|
|
||
|
|
**Implementation**:
|
||
|
|
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1` (default 0)
|
||
|
|
- Runtime branch at entry: `if (tiny_uc_localize_enabled()) { ... }`
|
||
|
|
|
||
|
|
**Results** (10-run A/B):
|
||
|
|
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|
||
|
|
|--------|------------|------------|-------|
|
||
|
|
| throughput | 57.43 M ops/s | 57.72 M ops/s | **+0.50%** |
|
||
|
|
| instructions | 4,583M | 4,615M | **+0.7%** |
|
||
|
|
| branches | 1,276M | 1,281M | **+0.4%** |
|
||
|
|
| cache-misses | 560K | 461K | -17.7% |
|
||
|
|
|
||
|
|
**Diagnosis**: Runtime branch overhead dominated. Instructions/branches **increased** despite LOCALIZE intent.
|
||
|
|
|
||
|
|
**Judgment**: **NEUTRAL (+0.50%, ±1.0% threshold)** → Proceed to Phase 74-2 (compile-time gate).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
|
||
|
|
|
||
|
|
**Goal**: Eliminate runtime branch to isolate LOCALIZE本体 performance.
|
||
|
|
|
||
|
|
**Implementation**:
|
||
|
|
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
|
||
|
|
- Compile-time gate: `#if HAKMEM_TINY_UC_LOCALIZE_COMPILED` (no runtime branch)
|
||
|
|
|
||
|
|
**Results** (10-run A/B via `layout_tax_forensics_box.sh`):
|
||
|
|
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|
||
|
|
|--------|---------------|----------------|-------|
|
||
|
|
| **throughput** | 58.90 M ops/s | 58.39 M ops/s | **-0.87%** |
|
||
|
|
| cycles | 1,553M | 1,548M | -0.3% |
|
||
|
|
| **instructions** | 2,748M | 2,733M | **-0.6%** |
|
||
|
|
| **branches** | 632M | 617M | **-2.3%** |
|
||
|
|
| **cache-misses** | 707K | 1,316K | **+86%** |
|
||
|
|
| dTLB-load-misses | 46K | 33K | -28% |
|
||
|
|
|
||
|
|
**Analysis**:
|
||
|
|
1. **Runtime branch overhead removed** → instructions/branches improved (-0.6%/-2.3%) ✓
|
||
|
|
2. **LOCALIZE本体 is effective** → dependency chain reduction confirmed ✓
|
||
|
|
3. **But cache-misses +86%** → register pressure / spill / worse access pattern
|
||
|
|
4. **Net result: -0.87%** → cache-miss increase dominates instruction/branch savings
|
||
|
|
|
||
|
|
**Phase 74-1 vs 74-2 comparison**:
|
||
|
|
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → **branch overhead loses**
|
||
|
|
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → **LOCALIZE本体 wins**
|
||
|
|
- But cache-misses +86% cancels out → **total NEUTRAL**
|
||
|
|
|
||
|
|
**Judgment**: **NEUTRAL (-0.87%, below +1.0% GO threshold)** → **P1 FROZEN**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause (Phase 74-2)
|
||
|
|
|
||
|
|
**Why cache-misses increased (+86%)**:
|
||
|
|
|
||
|
|
1. **Register pressure hypothesis**: Loading `head/tail/mask` into locals increases live registers
|
||
|
|
- Compiler may spill to stack → more memory traffic
|
||
|
|
- `cache->slots[head]` may lose prefetch opportunity
|
||
|
|
2. **Access pattern change**: `cache->head` direct load may benefit from compiler optimizations
|
||
|
|
- Storing to local breaks dependency tracking?
|
||
|
|
- Memory alias analysis degraded?
|
||
|
|
|
||
|
|
**Evidence**:
|
||
|
|
- dTLB-misses decreased (-28%) → data layout not the issue
|
||
|
|
- L1-dcache-load-misses similar → not a TLB/page issue
|
||
|
|
- cache-misses (+86%) is the PRIMARY BLOCKER
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
1. **Runtime branch tax is real**: Phase 74-1 showed +0.7% instruction increase from ENV gate
|
||
|
|
2. **LOCALIZE本体 works**: Phase 74-2 confirmed -2.3% branches when branch removed
|
||
|
|
3. **Register pressure matters**: Even when instruction count drops, cache behavior can dominate
|
||
|
|
4. **This optimization path has low ROI**: Dependency chain reduction is fragile to cache effects
|
||
|
|
|
||
|
|
**Conclusion**: P1 (LOCALIZE) frozen. Move to **P0 (FASTAPI)** (different approach: move branches outside hot loop).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## P1 (LOCALIZE) - Frozen State
|
||
|
|
|
||
|
|
**Files**:
|
||
|
|
- `core/hakmem_build_flags.h`: `HAKMEM_TINY_UC_LOCALIZE_COMPILED` (default 0)
|
||
|
|
- `core/box/tiny_unified_cache_hitpath_env_box.h`: ENV gate (unused after 74-2)
|
||
|
|
- `core/front/tiny_unified_cache.h`: compile-time `#if` blocks
|
||
|
|
|
||
|
|
**Default behavior**: LOCALIZE=0 (original implementation)
|
||
|
|
**Rollback**: No action needed (default OFF)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
**Phase 74-3: P0 (FASTAPI)**
|
||
|
|
|
||
|
|
**Goal**: Move `unified_cache_enabled()` / `lazy-init` / `stats` checks **outside** hot loop.
|
||
|
|
|
||
|
|
**Approach**:
|
||
|
|
- Create `unified_cache_push_fast()` / `unified_cache_pop_fast()` APIs
|
||
|
|
- Assume: "valid/enabled/no-stats" at caller side
|
||
|
|
- Fail-fast: fallback to slow path on unexpected state
|
||
|
|
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
|
||
|
|
|
||
|
|
**Expected benefit**: +1-2% via branch reduction (different axis than P1)
|
||
|
|
|
||
|
|
**GO threshold**: +1.0% (strict, structural change)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Artifacts
|
||
|
|
|
||
|
|
- **Design**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
|
||
|
|
- **Instructions**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
|
||
|
|
- **Results**: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md` (this file)
|
||
|
|
- **Forensics output**: `./results/layout_tax_forensics/` (Phase 74-2 perf data)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Timeline
|
||
|
|
|
||
|
|
- Phase 74-1: ENV-gated LOCALIZE → **NEUTRAL (+0.50%)**
|
||
|
|
- Phase 74-2: Compile-time LOCALIZE → **NEUTRAL (-0.87%)** → **P1 FROZEN**
|
||
|
|
- Phase 74-3: P0 (FASTAPI) → (next)
|