181 lines
7.1 KiB
Markdown
181 lines
7.1 KiB
Markdown
|
|
# Phase 64: Backend Pruning via Compile-time Constants (DCE)
|
||
|
|
|
||
|
|
**Status**: ❌ NO-GO (Regression: -4.05%)
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 64 attempted to optimize hakmem by making unused backend allocation paths (MID_V3, POOL_V2) unreachable at compile-time, enabling LTO Dead Code Elimination (DCE) to remove them entirely from the binary. The expected target was **+5-10% performance gain** via code size reduction and improved I-cache locality.
|
||
|
|
|
||
|
|
**Result**: The strategy achieved significant instruction reduction (-26%, from 3.87B to 2.87B per operation) but produced a **-4.05% throughput regression** on the Mixed workload, failing the +2.0% GO threshold.
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
### Build Flags Added
|
||
|
|
- `HAKMEM_FAST_PROFILE_PRUNE_BACKENDS=1`: Master switch to activate backend pruning
|
||
|
|
|
||
|
|
### Code Changes
|
||
|
|
1. **hak_alloc_api.inc.h** (lines 83-120): Wrapped MID_V3 alloc dispatch with `#if !HAKMEM_FAST_PROFILE_PRUNE_BACKENDS`
|
||
|
|
2. **hak_free_api.inc.h** (lines 242-283): Wrapped MID_V3 free dispatch (both SSOT=1 and SSOT=0 paths)
|
||
|
|
3. **mid_hotbox_v3_env_box.h** (lines 15-33): Added compile-time constant `mid_v3_enabled()` returning 0
|
||
|
|
4. **pool_config_box.h** (lines 20-33): Added compile-time constant `hak_pool_v2_enabled()` returning 0
|
||
|
|
5. **learner_env_box.h** (lines 18-20): Added pruning flag to learning layer disable condition
|
||
|
|
6. **Makefile** (lines 672-680): Added target `bench_random_mixed_hakmem_fast_pruned`
|
||
|
|
|
||
|
|
## A/B Test Results (10 runs each)
|
||
|
|
|
||
|
|
### Baseline: bench_random_mixed_hakmem_minimal
|
||
|
|
```
|
||
|
|
Run 1: 60,022,164 ops/s
|
||
|
|
Run 2: 57,772,821 ops/s
|
||
|
|
Run 3: 59,633,856 ops/s
|
||
|
|
Run 4: 60,658,837 ops/s
|
||
|
|
Run 5: 58,595,231 ops/s
|
||
|
|
Run 6: 59,376,766 ops/s
|
||
|
|
Run 7: 58,661,246 ops/s
|
||
|
|
Run 8: 58,110,953 ops/s
|
||
|
|
Run 9: 58,952,756 ops/s
|
||
|
|
Run 10: 59,331,245 ops/s
|
||
|
|
|
||
|
|
Average: 59,111,588 ops/s
|
||
|
|
Median: 59,142,000 ops/s
|
||
|
|
Stdev: 875,766 ops/s
|
||
|
|
Range: 57,772,821 - 60,658,837 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment: bench_random_mixed_hakmem_fast_pruned
|
||
|
|
```
|
||
|
|
Run 1: 55,339,952 ops/s
|
||
|
|
Run 2: 56,847,444 ops/s
|
||
|
|
Run 3: 58,161,283 ops/s
|
||
|
|
Run 4: 58,645,002 ops/s
|
||
|
|
Run 5: 55,615,903 ops/s
|
||
|
|
Run 6: 55,984,988 ops/s
|
||
|
|
Run 7: 56,979,027 ops/s
|
||
|
|
Run 8: 55,851,054 ops/s
|
||
|
|
Run 9: 57,196,418 ops/s
|
||
|
|
Run 10: 56,529,372 ops/s
|
||
|
|
|
||
|
|
Average: 56,715,044 ops/s
|
||
|
|
Median: 56,688,408 ops/s
|
||
|
|
Stdev: 1,082,600 ops/s
|
||
|
|
Range: 55,339,952 - 58,645,002 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Performance Delta
|
||
|
|
- **Average Change**: -4.05% ❌
|
||
|
|
- **Median Change**: -4.15% ❌
|
||
|
|
- **GO Threshold**: +2.0%
|
||
|
|
- **Verdict**: NO-GO (regression exceeds negative tolerance)
|
||
|
|
|
||
|
|
## Performance Counter Analysis (perf stat, 5 runs each)
|
||
|
|
|
||
|
|
### Baseline: bench_random_mixed_hakmem_minimal
|
||
|
|
```
|
||
|
|
Cycles: 1,703,775,790 (baseline)
|
||
|
|
Instructions: 3,866,028,123 (baseline)
|
||
|
|
IPC: 2.27 insns/cycle
|
||
|
|
Branches: 945,213,995
|
||
|
|
Branch-misses: 23,682,440 (2.51% of branches)
|
||
|
|
Cache-misses: 420,262
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment: bench_random_mixed_hakmem_fast_pruned
|
||
|
|
```
|
||
|
|
Cycles: 1,608,678,889 (-5.6% vs baseline)
|
||
|
|
Instructions: 2,870,328,700 (-25.8% vs baseline) ✓
|
||
|
|
IPC: 1.78 insns/cycle (-21.6%)
|
||
|
|
Branches: 629,997,382 (-33.3% vs baseline) ✓
|
||
|
|
Branch-misses: 23,622,772 (-0.3% count, but +3.75% rate vs baseline)
|
||
|
|
Cache-misses: 501,446 (+19.3% vs baseline)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Analysis
|
||
|
|
|
||
|
|
### Success: Instruction Reduction
|
||
|
|
The compile-time backend pruning achieved excellent dead code elimination:
|
||
|
|
- **-26% instruction count**: Massive reduction from 3.87B to 2.87B instructions/op
|
||
|
|
- **-33% branch count**: Reduction from 945M to 630M branches/op
|
||
|
|
- **-5.6% cycle count**: Modest cycle reduction despite heavy pruning
|
||
|
|
|
||
|
|
This confirms that LTO DCE is working correctly and removing the MID_V3 and POOL_V2 code paths.
|
||
|
|
|
||
|
|
### Failure: Throughput Regression
|
||
|
|
Despite massive code reduction, throughput regressed by -4.05%, indicating:
|
||
|
|
|
||
|
|
**Hypothesis 1: Bad I-Cache Locality**
|
||
|
|
- Treatment has fewer branches (-33%) but higher branch-miss rate (3.75% vs 2.51%)
|
||
|
|
- This suggests code layout became worse during linker optimization
|
||
|
|
- Remaining critical paths may have been scattered across memory
|
||
|
|
- Similar to Phase 62A "layout tax" pattern
|
||
|
|
|
||
|
|
**Hypothesis 2: Critical Path Changed**
|
||
|
|
- IPC dropped from 2.27 to 1.78 instructions/cycle (-21.6%)
|
||
|
|
- This indicates the CPU is less efficient at executing the pruned code
|
||
|
|
- Cache hierarchy may be stressed despite fewer instructions (confirmed: +19% cache-misses)
|
||
|
|
- Reduced instruction diversity may confuse branch prediction
|
||
|
|
|
||
|
|
**Hypothesis 3: Microarchitecture Sensitivity**
|
||
|
|
- The pruned code path may have different memory access patterns
|
||
|
|
- Allocation patterns route through different backends (all Tiny now)
|
||
|
|
- Contention on TLS caches may be higher without MID_V3 pressure relief
|
||
|
|
|
||
|
|
### Why +5-10% Didn't Materialize
|
||
|
|
|
||
|
|
The expected +5-10% gain assumed:
|
||
|
|
1. Code size reduction → I-cache improvement ✗ (layout tax negative)
|
||
|
|
2. Fewer branches → Better prediction ✗ (branch-miss rate increased)
|
||
|
|
3. Simplified dispatch logic → Reduced overhead ✗ (IPC decreased)
|
||
|
|
|
||
|
|
The Mixed workload (257-768B allocations) benefits from MID_V3's specialized TLS lane caching. By removing it, all those allocations now route through the Tiny fast path, which:
|
||
|
|
- May have reduced TLS cache efficiency
|
||
|
|
- Increases contention on shared structures
|
||
|
|
- Affects memory layout and I-cache behavior
|
||
|
|
|
||
|
|
## Related Patterns
|
||
|
|
|
||
|
|
### Phase 62A: "Layout Tax" Pattern
|
||
|
|
- Phase 62A (C7 ULTRA Alloc DepChain Trim): -0.71% regression
|
||
|
|
- Both Phases showed code size improvements but IPC/layout deterioration
|
||
|
|
- This confirms that LTO + function-level optimizations create layout tax
|
||
|
|
|
||
|
|
### Successful Similar Phases
|
||
|
|
- None found that achieved code elimination + performance gain simultaneously
|
||
|
|
|
||
|
|
## Recommendations
|
||
|
|
|
||
|
|
### Path Forward Options
|
||
|
|
|
||
|
|
**Option A: Abandon Backend Pruning (Recommended)**
|
||
|
|
- The layout tax pattern is consistent across phases
|
||
|
|
- Removing code paths without architectural restructuring doesn't help
|
||
|
|
- Focus on algorithmic improvements instead
|
||
|
|
|
||
|
|
**Option B: Research Backend Pruning + Linker Optimizations**
|
||
|
|
- Try `--gc-sections` + section reordering (Phase 18 NO-GO, but different context)
|
||
|
|
- Experiment with PGO-guided section layout
|
||
|
|
- May require significant research investment
|
||
|
|
|
||
|
|
**Option C: Profile-Guided Backend Selection**
|
||
|
|
- Instead of compile-time removal, use runtime PGO to select optimal backend
|
||
|
|
- Keep both MID_V3 and Tiny, but bias allocation based on profile
|
||
|
|
- Trade size for flexibility (likely not worth it)
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 64 successfully implemented compile-time backend pruning and achieved 26% instruction reduction through LTO DCE. However, the strategy backfired due to layout tax and microarchitecture sensitivity, producing a -4.05% throughput regression.
|
||
|
|
|
||
|
|
This phase validates an important insight: **code elimination alone is insufficient**. Hakmem's performance depends on:
|
||
|
|
1. **Hot path efficiency** (IPC, branch prediction)
|
||
|
|
2. **Memory layout** (I-cache, D-cache)
|
||
|
|
3. **Architectural symmetry** (balanced pathways reduce contention)
|
||
|
|
|
||
|
|
Removing entire backends disrupts this balance, despite reducing instruction count.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Artifacts**:
|
||
|
|
- Baseline: `bench_random_mixed_hakmem_minimal` (BENCH_MINIMAL=1)
|
||
|
|
- Treatment: `bench_random_mixed_hakmem_fast_pruned` (BENCH_MINIMAL=1 + FAST_PROFILE_FIXED=1 + FAST_PROFILE_PRUNE_BACKENDS=1)
|
||
|
|
|
||
|
|
**Next Phase**: Back to algorithm-level optimizations or investigate why IPC dropped despite simpler code.
|