Phase 74-1 (ENV-gated LOCALIZE): - Result: +0.50% (NEUTRAL) - Runtime branch overhead caused instructions/branches to increase - Diagnosed: Branch tax dominates intended optimization Phase 74-2 (compile-time LOCALIZE): - Result: -0.87% (NEUTRAL, P1 frozen) - Removed runtime branch → instructions -0.6%, branches -2.3% ✓ - But cache-misses +86% (register pressure/spill) → net loss - Conclusion: LOCALIZE本体 works, but fragile to cache effects Key finding: - Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity - P1 (LOCALIZE) frozen at default OFF - Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop Files: - core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag - core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen) - core/front/tiny_unified_cache.h: compile-time #if blocks - docs/analysis/PHASE74_*: Design, instructions, results - CURRENT_TASK.md: P1 frozen, P0 next instructions Also includes: - Phase 69 refill tuning results (archived docs) - PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update - PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.2 KiB
Phase 74: UnifiedCache hit-path structural optimization - Results
Status: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)
Summary
Phase 74 investigated unified_cache_push/pop hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).
P1 (LOCALIZE) attempted to reduce dependency chains by loading head/tail/mask into locals, but was frozen at NEUTRAL (-0.87%) due to cache-miss increase.
Phase 74-1: LOCALIZE (ENV-gated, runtime branch)
Goal: Load head/tail/mask once into locals to avoid reload dependency chains.
Implementation:
- ENV:
HAKMEM_TINY_UC_LOCALIZE=0/1(default 0) - Runtime branch at entry:
if (tiny_uc_localize_enabled()) { ... }
Results (10-run A/B):
| Metric | LOCALIZE=0 | LOCALIZE=1 | Delta |
|---|---|---|---|
| throughput | 57.43 M ops/s | 57.72 M ops/s | +0.50% |
| instructions | 4,583M | 4,615M | +0.7% |
| branches | 1,276M | 1,281M | +0.4% |
| cache-misses | 560K | 461K | -17.7% |
Diagnosis: Runtime branch overhead dominated. Instructions/branches increased despite LOCALIZE intent.
Judgment: NEUTRAL (+0.50%, ±1.0% threshold) → Proceed to Phase 74-2 (compile-time gate).
Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)
Goal: Eliminate runtime branch to isolate LOCALIZE本体 performance.
Implementation:
- Build flag:
HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1(default 0) - Compile-time gate:
#if HAKMEM_TINY_UC_LOCALIZE_COMPILED(no runtime branch)
Results (10-run A/B via layout_tax_forensics_box.sh):
| Metric | Baseline (=0) | Treatment (=1) | Delta |
|---|---|---|---|
| throughput | 58.90 M ops/s | 58.39 M ops/s | -0.87% |
| cycles | 1,553M | 1,548M | -0.3% |
| instructions | 2,748M | 2,733M | -0.6% |
| branches | 632M | 617M | -2.3% |
| cache-misses | 707K | 1,316K | +86% |
| dTLB-load-misses | 46K | 33K | -28% |
Analysis:
- Runtime branch overhead removed → instructions/branches improved (-0.6%/-2.3%) ✓
- LOCALIZE本体 is effective → dependency chain reduction confirmed ✓
- But cache-misses +86% → register pressure / spill / worse access pattern
- Net result: -0.87% → cache-miss increase dominates instruction/branch savings
Phase 74-1 vs 74-2 comparison:
- 74-1 (runtime branch): instructions +0.7%, branches +0.4% → branch overhead loses
- 74-2 (compile-time): instructions -0.6%, branches -2.3% → LOCALIZE本体 wins
- But cache-misses +86% cancels out → total NEUTRAL
Judgment: NEUTRAL (-0.87%, below +1.0% GO threshold) → P1 FROZEN
Root Cause (Phase 74-2)
Why cache-misses increased (+86%):
- Register pressure hypothesis: Loading
head/tail/maskinto locals increases live registers- Compiler may spill to stack → more memory traffic
cache->slots[head]may lose prefetch opportunity
- Access pattern change:
cache->headdirect load may benefit from compiler optimizations- Storing to local breaks dependency tracking?
- Memory alias analysis degraded?
Evidence:
- dTLB-misses decreased (-28%) → data layout not the issue
- L1-dcache-load-misses similar → not a TLB/page issue
- cache-misses (+86%) is the PRIMARY BLOCKER
Lessons Learned
- Runtime branch tax is real: Phase 74-1 showed +0.7% instruction increase from ENV gate
- LOCALIZE本体 works: Phase 74-2 confirmed -2.3% branches when branch removed
- Register pressure matters: Even when instruction count drops, cache behavior can dominate
- This optimization path has low ROI: Dependency chain reduction is fragile to cache effects
Conclusion: P1 (LOCALIZE) frozen. Move to P0 (FASTAPI) (different approach: move branches outside hot loop).
P1 (LOCALIZE) - Frozen State
Files:
core/hakmem_build_flags.h:HAKMEM_TINY_UC_LOCALIZE_COMPILED(default 0)core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (unused after 74-2)core/front/tiny_unified_cache.h: compile-time#ifblocks
Default behavior: LOCALIZE=0 (original implementation) Rollback: No action needed (default OFF)
Next Steps
Phase 74-3: P0 (FASTAPI)
Goal: Move unified_cache_enabled() / lazy-init / stats checks outside hot loop.
Approach:
- Create
unified_cache_push_fast()/unified_cache_pop_fast()APIs - Assume: "valid/enabled/no-stats" at caller side
- Fail-fast: fallback to slow path on unexpected state
- ENV gate:
HAKMEM_TINY_UC_FASTAPI=0/1(default 0, research box)
Expected benefit: +1-2% via branch reduction (different axis than P1)
GO threshold: +1.0% (strict, structural change)
Artifacts
- Design:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md - Instructions:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md - Results:
docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md(this file) - Forensics output:
./results/layout_tax_forensics/(Phase 74-2 perf data)
Timeline
- Phase 74-1: ENV-gated LOCALIZE → NEUTRAL (+0.50%)
- Phase 74-2: Compile-time LOCALIZE → NEUTRAL (-0.87%) → P1 FROZEN
- Phase 74-3: P0 (FASTAPI) → (next)