Files
hakmem/docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md
Moe Charm (CI) e9b97e9d8e Phase 74-1/74-2: UnifiedCache LOCALIZE optimization (P1 frozen, NEUTRAL -0.87%)
Phase 74-1 (ENV-gated LOCALIZE):
- Result: +0.50% (NEUTRAL)
- Runtime branch overhead caused instructions/branches to increase
- Diagnosed: Branch tax dominates intended optimization

Phase 74-2 (compile-time LOCALIZE):
- Result: -0.87% (NEUTRAL, P1 frozen)
- Removed runtime branch → instructions -0.6%, branches -2.3% ✓
- But cache-misses +86% (register pressure/spill) → net loss
- Conclusion: LOCALIZE本体 works, but fragile to cache effects

Key finding:
- Dependency chain reduction (LOCALIZE) has low ROI due to cache-miss sensitivity
- P1 (LOCALIZE) frozen at default OFF
- Next: Phase 74-3 (P0: FASTAPI) - move branches outside hot loop

Files:
- core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED flag
- core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (frozen)
- core/front/tiny_unified_cache.h: compile-time #if blocks
- docs/analysis/PHASE74_*: Design, instructions, results
- CURRENT_TASK.md: P1 frozen, P0 next instructions

Also includes:
- Phase 69 refill tuning results (archived docs)
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 69 baseline update
- PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md: Route banner docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 07:47:44 +09:00

5.2 KiB

Phase 74: UnifiedCache hit-path structural optimization - Results

Status: 🔴 P1 (LOCALIZE) FROZEN (NEUTRAL -0.87%)

Summary

Phase 74 investigated unified_cache_push/pop hit-path optimizations to achieve +1-3% via instruction/branch reduction (Phase 73 教訓).

P1 (LOCALIZE) attempted to reduce dependency chains by loading head/tail/mask into locals, but was frozen at NEUTRAL (-0.87%) due to cache-miss increase.


Phase 74-1: LOCALIZE (ENV-gated, runtime branch)

Goal: Load head/tail/mask once into locals to avoid reload dependency chains.

Implementation:

  • ENV: HAKMEM_TINY_UC_LOCALIZE=0/1 (default 0)
  • Runtime branch at entry: if (tiny_uc_localize_enabled()) { ... }

Results (10-run A/B):

Metric LOCALIZE=0 LOCALIZE=1 Delta
throughput 57.43 M ops/s 57.72 M ops/s +0.50%
instructions 4,583M 4,615M +0.7%
branches 1,276M 1,281M +0.4%
cache-misses 560K 461K -17.7%

Diagnosis: Runtime branch overhead dominated. Instructions/branches increased despite LOCALIZE intent.

Judgment: NEUTRAL (+0.50%, ±1.0% threshold) → Proceed to Phase 74-2 (compile-time gate).


Phase 74-2: LOCALIZE (compile-time gate, no runtime branch)

Goal: Eliminate runtime branch to isolate LOCALIZE本体 performance.

Implementation:

  • Build flag: HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1 (default 0)
  • Compile-time gate: #if HAKMEM_TINY_UC_LOCALIZE_COMPILED (no runtime branch)

Results (10-run A/B via layout_tax_forensics_box.sh):

Metric Baseline (=0) Treatment (=1) Delta
throughput 58.90 M ops/s 58.39 M ops/s -0.87%
cycles 1,553M 1,548M -0.3%
instructions 2,748M 2,733M -0.6%
branches 632M 617M -2.3%
cache-misses 707K 1,316K +86%
dTLB-load-misses 46K 33K -28%

Analysis:

  1. Runtime branch overhead removed → instructions/branches improved (-0.6%/-2.3%) ✓
  2. LOCALIZE本体 is effective → dependency chain reduction confirmed ✓
  3. But cache-misses +86% → register pressure / spill / worse access pattern
  4. Net result: -0.87% → cache-miss increase dominates instruction/branch savings

Phase 74-1 vs 74-2 comparison:

  • 74-1 (runtime branch): instructions +0.7%, branches +0.4% → branch overhead loses
  • 74-2 (compile-time): instructions -0.6%, branches -2.3% → LOCALIZE本体 wins
  • But cache-misses +86% cancels out → total NEUTRAL

Judgment: NEUTRAL (-0.87%, below +1.0% GO threshold)P1 FROZEN


Root Cause (Phase 74-2)

Why cache-misses increased (+86%):

  1. Register pressure hypothesis: Loading head/tail/mask into locals increases live registers
    • Compiler may spill to stack → more memory traffic
    • cache->slots[head] may lose prefetch opportunity
  2. Access pattern change: cache->head direct load may benefit from compiler optimizations
    • Storing to local breaks dependency tracking?
    • Memory alias analysis degraded?

Evidence:

  • dTLB-misses decreased (-28%) → data layout not the issue
  • L1-dcache-load-misses similar → not a TLB/page issue
  • cache-misses (+86%) is the PRIMARY BLOCKER

Lessons Learned

  1. Runtime branch tax is real: Phase 74-1 showed +0.7% instruction increase from ENV gate
  2. LOCALIZE本体 works: Phase 74-2 confirmed -2.3% branches when branch removed
  3. Register pressure matters: Even when instruction count drops, cache behavior can dominate
  4. This optimization path has low ROI: Dependency chain reduction is fragile to cache effects

Conclusion: P1 (LOCALIZE) frozen. Move to P0 (FASTAPI) (different approach: move branches outside hot loop).


P1 (LOCALIZE) - Frozen State

Files:

  • core/hakmem_build_flags.h: HAKMEM_TINY_UC_LOCALIZE_COMPILED (default 0)
  • core/box/tiny_unified_cache_hitpath_env_box.h: ENV gate (unused after 74-2)
  • core/front/tiny_unified_cache.h: compile-time #if blocks

Default behavior: LOCALIZE=0 (original implementation) Rollback: No action needed (default OFF)


Next Steps

Phase 74-3: P0 (FASTAPI)

Goal: Move unified_cache_enabled() / lazy-init / stats checks outside hot loop.

Approach:

  • Create unified_cache_push_fast() / unified_cache_pop_fast() APIs
  • Assume: "valid/enabled/no-stats" at caller side
  • Fail-fast: fallback to slow path on unexpected state
  • ENV gate: HAKMEM_TINY_UC_FASTAPI=0/1 (default 0, research box)

Expected benefit: +1-2% via branch reduction (different axis than P1)

GO threshold: +1.0% (strict, structural change)


Artifacts

  • Design: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md
  • Instructions: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md
  • Results: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md (this file)
  • Forensics output: ./results/layout_tax_forensics/ (Phase 74-2 perf data)

Timeline

  • Phase 74-1: ENV-gated LOCALIZE → NEUTRAL (+0.50%)
  • Phase 74-2: Compile-time LOCALIZE → NEUTRAL (-0.87%)P1 FROZEN
  • Phase 74-3: P0 (FASTAPI) → (next)