Files

Moe Charm (CI) e9b97e9d8e Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)

Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-18 07:47:44 +09:00

5.2 KiB

Raw Permalink Blame History

Phase 74: UnifiedCache hit-path structural optimization - Results

Status: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)

Summary

Phase 74 investigated unified_cache_push/pop hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).

P1 (LOCALIZE) attempted to reduce dependency chains by loading head/tail/mask into locals, but was frozen at NEUTRAL (-0.87%) due to cache-miss increase.

Phase 74-1: LOCALIZE (ENV-gated, runtime branch)

Goal: Load head/tail/mask once into locals to avoid reload dependency chains.

Implementation:

ENV: HAKMEM_TINY_UC_LOCALIZE=0/1 (default 0)
Runtime branch at entry: if (tiny_uc_localize_enabled()) { ... }

Results (10-run A/B):

Metric	LOCALIZE=0	LOCALIZE=1	Delta
throughput	57.43 M ops/s	57.72 M ops/s	+0.50%
instructions	4,583M	4,615M	+0.7%
branches	1,276M	1,281M	+0.4%
cache-misses	560K	461K	-17.7%

Diagnosis: Runtime branch overhead dominated. Instructions/branches increased despite LOCALIZE intent.

Judgment: NEUTRAL (+0.50%, ±1.0% threshold) → Proceed to Phase 74-2 (compile-time gate).

Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)

Goal: Eliminate runtime branch to isolate LOCALIZE本体 performance.

Implementation:

Build flag: HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1 (default 0)
Compile-time gate: #if HAKMEM_TINY_UC_LOCALIZE_COMPILED (no runtime branch)

Results (10-run A/B via layout_tax_forensics_box.sh):

Metric	Baseline (=0)	Treatment (=1)	Delta
throughput	58.90 M ops/s	58.39 M ops/s	-0.87%
cycles	1,553M	1,548M	-0.3%
instructions	2,748M	2,733M	-0.6%
branches	632M	617M	-2.3%
cache-misses	707K	1,316K	+86%
dTLB-load-misses	46K	33K	-28%

Analysis:

Runtime branch overhead removed → instructions/branches improved (-0.6%/-2.3%) ✓
LOCALIZE本体 is effective → dependency chain reduction confirmed ✓
But cache-misses +86% → register pressure / spill / worse access pattern
Net result: -0.87% → cache-miss increase dominates instruction/branch savings

Phase 74-1 vs 74-2 comparison:

74-1 (runtime branch): instructions +0.7%, branches +0.4% → branch overhead loses
74-2 (compile-time): instructions -0.6%, branches -2.3% → LOCALIZE本体 wins
But cache-misses +86% cancels out → total NEUTRAL

Judgment: NEUTRAL (-0.87%, below +1.0% GO threshold) → P1 FROZEN

Root Cause (Phase 74-2)

Why cache-misses increased (+86%):

Register pressure hypothesis: Loading head/tail/mask into locals increases live registers
- Compiler may spill to stack → more memory traffic
- cache->slots[head] may lose prefetch opportunity
Access pattern change: cache->head direct load may benefit from compiler optimizations
- Storing to local breaks dependency tracking?
- Memory alias analysis degraded?

Evidence:

dTLB-misses decreased (-28%) → data layout not the issue
L1-dcache-load-misses similar → not a TLB/page issue
cache-misses (+86%) is the PRIMARY BLOCKER

Lessons Learned

Runtime branch tax is real: Phase 74-1 showed +0.7% instruction increase from ENV gate
LOCALIZE本体 works: Phase 74-2 confirmed -2.3% branches when branch removed
Register pressure matters: Even when instruction count drops, cache behavior can dominate
This optimization path has low ROI: Dependency chain reduction is fragile to cache effects

Conclusion: P1 (LOCALIZE) frozen. Move to P0 (FASTAPI) (different approach: move branches outside hot loop).

P1 (LOCALIZE) - Frozen State

Files:

core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED (default 0)
core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (unused after 74-2)
core/front/tiny_unified_cache.h: compile-time #if blocks

Default behavior: LOCALIZE=0 (original implementation) Rollback: No action needed (default OFF)

Next Steps

Phase 74-3: P0 (FASTAPI)

Goal: Move unified_cache_enabled() / lazy-init / stats checks outside hot loop.

Approach:

Create unified_cache_push_fast() / unified_cache_pop_fast() APIs
Assume: "valid/enabled/no-stats" at caller side
Fail-fast: fallback to slow path on unexpected state
ENV gate: HAKMEM_TINY_UC_FASTAPI=0/1 (default 0, research box)

Expected benefit: +1-2% via branch reduction (different axis than P1)

GO threshold: +1.0% (strict, structural change)

Artifacts

Design: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md
Instructions: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md
Results: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md (this file)
Forensics output: ./results/layout_tax_forensics/ (Phase 74-2 perf data)

Timeline

Phase 74-1: ENV-gated LOCALIZE → NEUTRAL (+0.50%)
Phase 74-2: Compile-time LOCALIZE → NEUTRAL (-0.87%) → P1 FROZEN
Phase 74-3: P0 (FASTAPI) → (next)

5.2 KiB Raw Permalink Blame History