Files
hakmem/docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md

257 lines
8.5 KiB
Markdown
Raw Normal View History

# Phase 67A: Layout Tax Forensics — SSOT
**Status**: 🟡 ACTIVE (Foundation document)
**Objective**: Create a reproducible diagnostic framework for layout tax regression (the "削ると遅い" problem). When code changes reduce binary size but hurt performance, pinpoint root cause in one measurement pass.
---
## Executive Summary
Layout tax is the phenomenon where **code removal, optimization, or restructuring** → reduced binary size BUT increased latency. This document provides:
1. **Measurement protocol** (`scripts/box/layout_tax_forensics_box.sh`)
2. **Diagnostic decision tree** (symptoms → root causes)
3. **Remediation strategies** for each failure mode
4. **Historical case study**: Phase 64 (-4.05% NO-GO)
---
## 1. Measurement Protocol
### Quick Start
```bash
# Compare baseline (Phase 68 PGO) vs treatment (e.g., Phase 64 attempt)
./scripts/box/layout_tax_forensics_box.sh \
./bench_random_mixed_hakmem_minimal_pgo \
./bench_random_mixed_hakmem_fast_pruned
```
**Output**:
- `results/layout_tax_forensics/baseline_throughput.txt` — 10-run baseline
- `results/layout_tax_forensics/treatment_throughput.txt` — 10-run treatment
- `results/layout_tax_forensics/baseline_perf.txt` — perf stat (baseline)
- `results/layout_tax_forensics/treatment_perf.txt` — perf stat (treatment)
- `results/layout_tax_forensics/layout_tax_forensics_summary.txt` — summary
### Metrics Collected
| Metric | Unit | What It Measures | Layout Tax Signal |
|--------|------|------------------|-------------------|
| **cycles** | M | Total CPU cycles | Baseline denominator |
| **instructions** | M | Executed instructions | Efficiency of algorithm |
| **IPC** | ratio | Instructions per cycle | Pipeline efficiency |
| **branches** | M | Branch instructions | Control flow complexity |
| **branch-misses** | M | Branch prediction failures | Front-end stall risk |
| **cache-misses (L1-D)** | M | L1 data cache misses | Memory subsystem pressure |
| **cache-misses (LLC)** | M | Last-level cache misses | DRAM latency hits |
| **iTLB-load-misses** | M | Instruction TLB misses | Code locality degradation |
| **dTLB-load-misses** | M | Data TLB misses | Data layout dispersal |
---
## 2. Decision Tree: Diagnosis → Remediation
### Performance Delta Classification
```
Δ Throughput
├─ > +1.0% → GO (improvement, apply to baseline)
├─ ±1.0% → NEUTRAL (measure noise, investigate if concern)
└─ < -1.0% NO-GO (regression detected, diagnose)
```
### NO-GO Root Cause Diagnosis
When `Δ < -1.0%`, measure the following **per-cycle cost deltas**:
```
Δ% in perf metrics (normalized by cycles):
├─ IPC drops >3% → **I-cache miss / code layout dispersal**
├─ branch-miss ↑ >10% → **Branch prediction penalty**
├─ L1-dcache-miss ↑ >15% → **Data layout fragmentation**
├─ LLC-miss ↑ >50% → **Reduced working set locality**
├─ iTLB-miss ↑ >100% → **Code page table thrashing**
└─ dTLB-miss ↑ >100% → **Data page table contention**
```
---
## 3. Root Cause → Remediation Mapping
### A. IPC Degradation (Code Layout Tax)
**Symptom**: IPC drops, instructions count same/similar, but **cycles increase**.
**Root Causes**:
- Code interleaving / function reordering (I-cache misses)
- Jump misprediction in hot loops
- Branch alignment issues
**Remediation**:
- **Keep-out strategy** (✓ recommended): Do not remove/move hot functions
- **Compiler fix**: Re-enable `-fno-toplevel-reorder` or PGO (already applied)
- **Measurement**: Use `perf record -b` to sample branch targets
**Reference**: Phase 64 DCE attempt (-4.05% from IPC 2.05 → 1.98)
---
### B. Branch Prediction Miss Spike
**Symptom**: `branch-misses` increases >10% (conditional branches mis-predicted).
**Root Causes**:
- Hot loop unrolled/rewritten, branch history table (BHT) loss
- Pattern change in conditional jumps
- Code reordering disrupts branch predictor bias
**Remediation**:
- Keep loop structure intact
- Avoid aggressive loop unroll without profile guidance
- Verify with `perf record -c10000 --event branches:ppp`
---
### C. Data TLB Misses (Memory Layout Tax)
**Symptom**: `dTLB-load-misses` increases >100%, data cache misses stable.
**Root Causes**:
- Data structure relayout (e.g., pool reorganization)
- Larger data working set per cycle
- Unfortunate data alignment boundaries
**Remediation**:
- Preserve existing struct layouts in hot paths
- Use compile-time box boundaries for data (similar to code boxes)
- Profile with `perf record -e dTLB-load-misses` + `perf report --stdio`
---
### D. L1-D Cache Miss Spike
**Symptom**: `L1-dcache-load-misses` increases >15%, indicating data reuse penalty.
**Root Causes**:
- Tiny allocator free-list structure changed (cache line conflict)
- Metadata layout modified
- Data prefetch pattern disrupted
**Remediation**:
- Maintain existing cache-line alignment of hot metadata
- Use perf to profile hot data access patterns: `perf mem --phys`
- Consider splitting cache-hot vs cache-cold data paths
---
### E. Instruction TLB Thrashing
**Symptom**: `iTLB-load-misses` increases >100%.
**Root Causes**:
- Code section grew beyond 2MB, crossing HUGE_PAGES boundary
- Function reordering disrupted TLB entry reuse
- New code section lacks alignment
**Remediation**:
- Keep code section <2MB (use `size binary` to verify)
- Maintain compile-out (not physical removal) for research changes
- Align hot code sections to page boundaries
---
## 4. Case Study: Phase 64 (Backend Pruning, -4.05%)
**Attempt**: Remove unused backend code paths (DCE / dead-code elimination).
**Symptom**: Throughput dropped -4.05%.
**Forensics Output**:
```
Metric Delta Root Cause
─────────────────────────────────
IPC 2.05→1.98 (-3.4%) Code reordering after DCE
Cycles ↑ +4.2% More cycles needed per instruction
Instructions ≈ 0% Same algorithm complexity
branch-misses ↑ +8% Stronger branch prediction penalty
Diagnosis: Hot path functions (tiny_c7_ultra_alloc, tiny_region_id_write_header)
re-linked by linker after code removal, I-cache misses increased.
```
**Remediation Decision**: Keep as **compile-out only** (gate function with #if).
- ✓ Maintains binary layout
- ✓ Research changes can be cleanly reverted
- ✗ Binary size not reduced
- Verdict: **Trade-off accepted** for reproducibility and avoiding layout tax.
---
## 5. Operational Guidelines
### When to Use This Box
- **New optimization attempt shows NO-GO**: Run forensics to get root cause
- **Code removal approved**: Measure forensics BEFORE and AFTER link
- **Performance regression unexplained**: Forensics disambiguates algorithmic vs. layout
### When to Skip
- Changes that explicitly avoid binary layout (e.g., constant tuning)
- Algorithmic improvements verified with algorithmic complexity analysis
- Compiler version changes (measure separately)
### Escalation Path
1. **Small regression (-1% to -2%)**: Investigate, usually layout-fixable
2. **Medium regression (-2% to -5%)**: Likely layout tax, use forensics
3. **Large regression (>-5%)**: Likely algorithmic, check Phase 64-style DCE issues
---
## 6. Metrics Interpretation Guide
### Quick Reference: Which Metric to Check First
| Binary Change | Primary Metric | Secondary |
|----------------|----------------|-----------|
| Code removed/compressed | IPC, iTLB | branch-misses |
| Data structure reordered | dTLB, L1-dcache | cycles/instruction |
| Loop optimized | branch-misses | iTLB |
| Inlining changed | IPC, iTLB, branch | cycles |
| Allocation path modified | dTLB, L1-dcache | LLC-misses |
---
## 7. Integration with Box Theory
**Key Principle**: Layout tax is an **artifact of link-time reordering**, not algorithmic complexity.
- **Box Rule**: Keep all code behind gates (compile-out, not physical removal)
- **Reversibility**: Research changes must not alter binary layout when disabled
- **Measurement**: Always compare against baseline **with gate disabled** (same layout)
This forensics framework validates these rules operationally.
---
## Next Steps
1. **Immediate**: Use this template to diagnose Phase 64 retrospectively
2. **Phase 67b**: When attempting inline/unroll tuning, measure forensics first
3. **Phase 69+**: Before any -5% target structural changes, establish forensics baseline
---
## Artifacts
- `scripts/box/layout_tax_forensics_box.sh` — Measurement harness
- `results/layout_tax_forensics/` — Output logs and metrics
- Phase 64 retrospective (TBD)
---
**Status**: 🟢 READY FOR OPERATIONAL USE (as of Phase 68 completion)