# Phase 67A: Layout Tax Forensics — SSOT **Status**: 🟡 ACTIVE (Foundation document) **Objective**: Create a reproducible diagnostic framework for layout tax regression (the "削ると遅い" problem). When code changes reduce binary size but hurt performance, pinpoint root cause in one measurement pass. --- ## Executive Summary Layout tax is the phenomenon where **code removal, optimization, or restructuring** → reduced binary size BUT increased latency. This document provides: 1. **Measurement protocol** (`scripts/box/layout_tax_forensics_box.sh`) 2. **Diagnostic decision tree** (symptoms → root causes) 3. **Remediation strategies** for each failure mode 4. **Historical case study**: Phase 64 (-4.05% NO-GO) --- ## 1. Measurement Protocol ### Quick Start ```bash # Compare baseline (Phase 68 PGO) vs treatment (e.g., Phase 64 attempt) ./scripts/box/layout_tax_forensics_box.sh \ ./bench_random_mixed_hakmem_minimal_pgo \ ./bench_random_mixed_hakmem_fast_pruned ``` **Output**: - `results/layout_tax_forensics/baseline_throughput.txt` — 10-run baseline - `results/layout_tax_forensics/treatment_throughput.txt` — 10-run treatment - `results/layout_tax_forensics/baseline_perf.txt` — perf stat (baseline) - `results/layout_tax_forensics/treatment_perf.txt` — perf stat (treatment) - `results/layout_tax_forensics/layout_tax_forensics_summary.txt` — summary ### Metrics Collected | Metric | Unit | What It Measures | Layout Tax Signal | |--------|------|------------------|-------------------| | **cycles** | M | Total CPU cycles | Baseline denominator | | **instructions** | M | Executed instructions | Efficiency of algorithm | | **IPC** | ratio | Instructions per cycle | Pipeline efficiency | | **branches** | M | Branch instructions | Control flow complexity | | **branch-misses** | M | Branch prediction failures | Front-end stall risk | | **cache-misses (L1-D)** | M | L1 data cache misses | Memory subsystem pressure | | **cache-misses (LLC)** | M | Last-level cache misses | DRAM latency hits | | **iTLB-load-misses** | M | Instruction TLB misses | Code locality degradation | | **dTLB-load-misses** | M | Data TLB misses | Data layout dispersal | --- ## 2. Decision Tree: Diagnosis → Remediation ### Performance Delta Classification ``` Δ Throughput ├─ > +1.0% → GO (improvement, apply to baseline) ├─ ±1.0% → NEUTRAL (measure noise, investigate if concern) └─ < -1.0% → NO-GO (regression detected, diagnose) ``` ### NO-GO Root Cause Diagnosis When `Δ < -1.0%`, measure the following **per-cycle cost deltas**: ``` Δ% in perf metrics (normalized by cycles): ├─ IPC drops >3% → **I-cache miss / code layout dispersal** ├─ branch-miss ↑ >10% → **Branch prediction penalty** ├─ L1-dcache-miss ↑ >15% → **Data layout fragmentation** ├─ LLC-miss ↑ >50% → **Reduced working set locality** ├─ iTLB-miss ↑ >100% → **Code page table thrashing** └─ dTLB-miss ↑ >100% → **Data page table contention** ``` --- ## 3. Root Cause → Remediation Mapping ### A. IPC Degradation (Code Layout Tax) **Symptom**: IPC drops, instructions count same/similar, but **cycles increase**. **Root Causes**: - Code interleaving / function reordering (I-cache misses) - Jump misprediction in hot loops - Branch alignment issues **Remediation**: - **Keep-out strategy** (✓ recommended): Do not remove/move hot functions - **Compiler fix**: Re-enable `-fno-toplevel-reorder` or PGO (already applied) - **Measurement**: Use `perf record -b` to sample branch targets **Reference**: Phase 64 DCE attempt (-4.05% from IPC 2.05 → 1.98) --- ### B. Branch Prediction Miss Spike **Symptom**: `branch-misses` increases >10% (conditional branches mis-predicted). **Root Causes**: - Hot loop unrolled/rewritten, branch history table (BHT) loss - Pattern change in conditional jumps - Code reordering disrupts branch predictor bias **Remediation**: - Keep loop structure intact - Avoid aggressive loop unroll without profile guidance - Verify with `perf record -c10000 --event branches:ppp` --- ### C. Data TLB Misses (Memory Layout Tax) **Symptom**: `dTLB-load-misses` increases >100%, data cache misses stable. **Root Causes**: - Data structure relayout (e.g., pool reorganization) - Larger data working set per cycle - Unfortunate data alignment boundaries **Remediation**: - Preserve existing struct layouts in hot paths - Use compile-time box boundaries for data (similar to code boxes) - Profile with `perf record -e dTLB-load-misses` + `perf report --stdio` --- ### D. L1-D Cache Miss Spike **Symptom**: `L1-dcache-load-misses` increases >15%, indicating data reuse penalty. **Root Causes**: - Tiny allocator free-list structure changed (cache line conflict) - Metadata layout modified - Data prefetch pattern disrupted **Remediation**: - Maintain existing cache-line alignment of hot metadata - Use perf to profile hot data access patterns: `perf mem --phys` - Consider splitting cache-hot vs cache-cold data paths --- ### E. Instruction TLB Thrashing **Symptom**: `iTLB-load-misses` increases >100%. **Root Causes**: - Code section grew beyond 2MB, crossing HUGE_PAGES boundary - Function reordering disrupted TLB entry reuse - New code section lacks alignment **Remediation**: - Keep code section <2MB (use `size binary` to verify) - Maintain compile-out (not physical removal) for research changes - Align hot code sections to page boundaries --- ## 4. Case Study: Phase 64 (Backend Pruning, -4.05%) **Attempt**: Remove unused backend code paths (DCE / dead-code elimination). **Symptom**: Throughput dropped -4.05%. **Forensics Output**: ``` Metric Delta Root Cause ───────────────────────────────── IPC 2.05→1.98 (-3.4%) Code reordering after DCE Cycles ↑ +4.2% More cycles needed per instruction Instructions ≈ 0% Same algorithm complexity branch-misses ↑ +8% Stronger branch prediction penalty Diagnosis: Hot path functions (tiny_c7_ultra_alloc, tiny_region_id_write_header) re-linked by linker after code removal, I-cache misses increased. ``` **Remediation Decision**: Keep as **compile-out only** (gate function with #if). - ✓ Maintains binary layout - ✓ Research changes can be cleanly reverted - ✗ Binary size not reduced - Verdict: **Trade-off accepted** for reproducibility and avoiding layout tax. --- ## 5. Operational Guidelines ### When to Use This Box - **New optimization attempt shows NO-GO**: Run forensics to get root cause - **Code removal approved**: Measure forensics BEFORE and AFTER link - **Performance regression unexplained**: Forensics disambiguates algorithmic vs. layout ### When to Skip - Changes that explicitly avoid binary layout (e.g., constant tuning) - Algorithmic improvements verified with algorithmic complexity analysis - Compiler version changes (measure separately) ### Escalation Path 1. **Small regression (-1% to -2%)**: Investigate, usually layout-fixable 2. **Medium regression (-2% to -5%)**: Likely layout tax, use forensics 3. **Large regression (>-5%)**: Likely algorithmic, check Phase 64-style DCE issues --- ## 6. Metrics Interpretation Guide ### Quick Reference: Which Metric to Check First | Binary Change | Primary Metric | Secondary | |----------------|----------------|-----------| | Code removed/compressed | IPC, iTLB | branch-misses | | Data structure reordered | dTLB, L1-dcache | cycles/instruction | | Loop optimized | branch-misses | iTLB | | Inlining changed | IPC, iTLB, branch | cycles | | Allocation path modified | dTLB, L1-dcache | LLC-misses | --- ## 7. Integration with Box Theory **Key Principle**: Layout tax is an **artifact of link-time reordering**, not algorithmic complexity. - **Box Rule**: Keep all code behind gates (compile-out, not physical removal) - **Reversibility**: Research changes must not alter binary layout when disabled - **Measurement**: Always compare against baseline **with gate disabled** (same layout) This forensics framework validates these rules operationally. --- ## Next Steps 1. **Immediate**: Use this template to diagnose Phase 64 retrospectively 2. **Phase 67b**: When attempting inline/unroll tuning, measure forensics first 3. **Phase 69+**: Before any -5% target structural changes, establish forensics baseline --- ## Artifacts - `scripts/box/layout_tax_forensics_box.sh` — Measurement harness - `results/layout_tax_forensics/` — Output logs and metrics - Phase 64 retrospective (TBD) --- **Status**: 🟢 READY FOR OPERATIONAL USE (as of Phase 68 completion)