hakmem/docs/analysis/PHASE64_BACKEND_PRUNING_RESULTS.md

# Phase 64: Backend Pruning via Compile-time Constants (DCE)

**Status**: ❌ NO-GO (Regression: -4.05%)

## Executive Summary

Phase 64 attempted to optimize hakmem by making unused backend allocation paths (MID_V3, POOL_V2) unreachable at compile-time, enabling LTO Dead Code Elimination (DCE) to remove them entirely from the binary. The expected target was **+5-10% performance gain** via code size reduction and improved I-cache locality.

**Result**: The strategy achieved significant instruction reduction (-26%, from 3.87B to 2.87B per operation) but produced a **-4.05% throughput regression** on the Mixed workload, failing the +2.0% GO threshold.

## Implementation

### Build Flags Added
- `HAKMEM_FAST_PROFILE_PRUNE_BACKENDS=1`: Master switch to activate backend pruning

### Code Changes
1. **hak_alloc_api.inc.h** (lines 83-120): Wrapped MID_V3 alloc dispatch with `#if !HAKMEM_FAST_PROFILE_PRUNE_BACKENDS`
2. **hak_free_api.inc.h** (lines 242-283): Wrapped MID_V3 free dispatch (both SSOT=1 and SSOT=0 paths)
3. **mid_hotbox_v3_env_box.h** (lines 15-33): Added compile-time constant `mid_v3_enabled()` returning 0
4. **pool_config_box.h** (lines 20-33): Added compile-time constant `hak_pool_v2_enabled()` returning 0
5. **learner_env_box.h** (lines 18-20): Added pruning flag to learning layer disable condition
6. **Makefile** (lines 672-680): Added target `bench_random_mixed_hakmem_fast_pruned`

## A/B Test Results (10 runs each)

### Baseline: bench_random_mixed_hakmem_minimal
```
Run 1:  60,022,164 ops/s
Run 2:  57,772,821 ops/s
Run 3:  59,633,856 ops/s
Run 4:  60,658,837 ops/s
Run 5:  58,595,231 ops/s
Run 6:  59,376,766 ops/s
Run 7:  58,661,246 ops/s
Run 8:  58,110,953 ops/s
Run 9:  58,952,756 ops/s
Run 10: 59,331,245 ops/s

Average: 59,111,588 ops/s
Median:  59,142,000 ops/s
Stdev:   875,766 ops/s
Range:   57,772,821 - 60,658,837 ops/s
```

### Treatment: bench_random_mixed_hakmem_fast_pruned
```
Run 1:  55,339,952 ops/s
Run 2:  56,847,444 ops/s
Run 3:  58,161,283 ops/s
Run 4:  58,645,002 ops/s
Run 5:  55,615,903 ops/s
Run 6:  55,984,988 ops/s
Run 7:  56,979,027 ops/s
Run 8:  55,851,054 ops/s
Run 9:  57,196,418 ops/s
Run 10: 56,529,372 ops/s

Average: 56,715,044 ops/s
Median:  56,688,408 ops/s
Stdev:   1,082,600 ops/s
Range:   55,339,952 - 58,645,002 ops/s
```

### Performance Delta
- **Average Change**: -4.05% ❌
- **Median Change**: -4.15% ❌
- **GO Threshold**: +2.0%
- **Verdict**: NO-GO (regression exceeds negative tolerance)

## Performance Counter Analysis (perf stat, 5 runs each)

### Baseline: bench_random_mixed_hakmem_minimal
```
Cycles:        1,703,775,790 (baseline)
Instructions:  3,866,028,123 (baseline)
IPC:           2.27 insns/cycle
Branches:      945,213,995
Branch-misses: 23,682,440 (2.51% of branches)
Cache-misses:  420,262
```

### Treatment: bench_random_mixed_hakmem_fast_pruned
```
Cycles:        1,608,678,889 (-5.6% vs baseline)
Instructions:  2,870,328,700 (-25.8% vs baseline) ✓
IPC:           1.78 insns/cycle (-21.6%)
Branches:      629,997,382 (-33.3% vs baseline) ✓
Branch-misses: 23,622,772 (-0.3% count, but +3.75% rate vs baseline)
Cache-misses:  501,446 (+19.3% vs baseline)
```

## Analysis

### Success: Instruction Reduction
The compile-time backend pruning achieved excellent dead code elimination:
- **-26% instruction count**: Massive reduction from 3.87B to 2.87B instructions/op
- **-33% branch count**: Reduction from 945M to 630M branches/op
- **-5.6% cycle count**: Modest cycle reduction despite heavy pruning

This confirms that LTO DCE is working correctly and removing the MID_V3 and POOL_V2 code paths.

### Failure: Throughput Regression
Despite massive code reduction, throughput regressed by -4.05%, indicating:

**Hypothesis 1: Bad I-Cache Locality**
- Treatment has fewer branches (-33%) but higher branch-miss rate (3.75% vs 2.51%)
- This suggests code layout became worse during linker optimization
- Remaining critical paths may have been scattered across memory
- Similar to Phase 62A "layout tax" pattern

**Hypothesis 2: Critical Path Changed**
- IPC dropped from 2.27 to 1.78 instructions/cycle (-21.6%)
- This indicates the CPU is less efficient at executing the pruned code
- Cache hierarchy may be stressed despite fewer instructions (confirmed: +19% cache-misses)
- Reduced instruction diversity may confuse branch prediction

**Hypothesis 3: Microarchitecture Sensitivity**
- The pruned code path may have different memory access patterns
- Allocation patterns route through different backends (all Tiny now)
- Contention on TLS caches may be higher without MID_V3 pressure relief

### Why +5-10% Didn't Materialize

The expected +5-10% gain assumed:
1. Code size reduction → I-cache improvement ✗ (layout tax negative)
2. Fewer branches → Better prediction ✗ (branch-miss rate increased)
3. Simplified dispatch logic → Reduced overhead ✗ (IPC decreased)

The Mixed workload (257-768B allocations) benefits from MID_V3's specialized TLS lane caching. By removing it, all those allocations now route through the Tiny fast path, which:
- May have reduced TLS cache efficiency
- Increases contention on shared structures
- Affects memory layout and I-cache behavior

## Related Patterns

### Phase 62A: "Layout Tax" Pattern
- Phase 62A (C7 ULTRA Alloc DepChain Trim): -0.71% regression
- Both Phases showed code size improvements but IPC/layout deterioration
- This confirms that LTO + function-level optimizations create layout tax

### Successful Similar Phases
- None found that achieved code elimination + performance gain simultaneously

## Recommendations

### Path Forward Options

**Option A: Abandon Backend Pruning (Recommended)**
- The layout tax pattern is consistent across phases
- Removing code paths without architectural restructuring doesn't help
- Focus on algorithmic improvements instead

**Option B: Research Backend Pruning + Linker Optimizations**
- Try `--gc-sections` + section reordering (Phase 18 NO-GO, but different context)
- Experiment with PGO-guided section layout
- May require significant research investment

**Option C: Profile-Guided Backend Selection**
- Instead of compile-time removal, use runtime PGO to select optimal backend
- Keep both MID_V3 and Tiny, but bias allocation based on profile
- Trade size for flexibility (likely not worth it)

## Conclusion

Phase 64 successfully implemented compile-time backend pruning and achieved 26% instruction reduction through LTO DCE. However, the strategy backfired due to layout tax and microarchitecture sensitivity, producing a -4.05% throughput regression.

This phase validates an important insight: **code elimination alone is insufficient**. Hakmem's performance depends on:
1. **Hot path efficiency** (IPC, branch prediction)
2. **Memory layout** (I-cache, D-cache)
3. **Architectural symmetry** (balanced pathways reduce contention)

Removing entire backends disrupts this balance, despite reducing instruction count.

---

**Artifacts**:
- Baseline: `bench_random_mixed_hakmem_minimal` (BENCH_MINIMAL=1)
- Treatment: `bench_random_mixed_hakmem_fast_pruned` (BENCH_MINIMAL=1 + FAST_PROFILE_FIXED=1 + FAST_PROFILE_PRUNE_BACKENDS=1)

**Next Phase**: Back to algorithm-level optimizations or investigate why IPC dropped despite simpler code.
Phase 68: PGO training set diversification (seed/WS expansion) Changes: - scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3) for reduced overfitting and better production workload representativeness - PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%) - CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active Results: - 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold) - M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp) - Stability: 10-run mean/median with <2.1% CV 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> 2025-12-17 21:08:17 +09:00			`# Phase 64: Backend Pruning via Compile-time Constants (DCE)`

			`Status: ❌ NO-GO (Regression: -4.05%)`

			`## Executive Summary`

			`Phase 64 attempted to optimize hakmem by making unused backend allocation paths (MID_V3, POOL_V2) unreachable at compile-time, enabling LTO Dead Code Elimination (DCE) to remove them entirely from the binary. The expected target was +5-10% performance gain via code size reduction and improved I-cache locality.`

			`Result: The strategy achieved significant instruction reduction (-26%, from 3.87B to 2.87B per operation) but produced a -4.05% throughput regression on the Mixed workload, failing the +2.0% GO threshold.`

			`## Implementation`

			`### Build Flags Added`
			- `HAKMEM_FAST_PROFILE_PRUNE_BACKENDS=1`: Master switch to activate backend pruning

			`### Code Changes`
			1. hak_alloc_api.inc.h (lines 83-120): Wrapped MID_V3 alloc dispatch with `#if !HAKMEM_FAST_PROFILE_PRUNE_BACKENDS`
			`2. hak_free_api.inc.h (lines 242-283): Wrapped MID_V3 free dispatch (both SSOT=1 and SSOT=0 paths)`
			3. mid_hotbox_v3_env_box.h (lines 15-33): Added compile-time constant `mid_v3_enabled()` returning 0
			4. pool_config_box.h (lines 20-33): Added compile-time constant `hak_pool_v2_enabled()` returning 0
			`5. learner_env_box.h (lines 18-20): Added pruning flag to learning layer disable condition`
			6. Makefile (lines 672-680): Added target `bench_random_mixed_hakmem_fast_pruned`

			`## A/B Test Results (10 runs each)`

			`### Baseline: bench_random_mixed_hakmem_minimal`
			```
			`Run 1: 60,022,164 ops/s`
			`Run 2: 57,772,821 ops/s`
			`Run 3: 59,633,856 ops/s`
			`Run 4: 60,658,837 ops/s`
			`Run 5: 58,595,231 ops/s`
			`Run 6: 59,376,766 ops/s`
			`Run 7: 58,661,246 ops/s`
			`Run 8: 58,110,953 ops/s`
			`Run 9: 58,952,756 ops/s`
			`Run 10: 59,331,245 ops/s`

			`Average: 59,111,588 ops/s`
			`Median: 59,142,000 ops/s`
			`Stdev: 875,766 ops/s`
			`Range: 57,772,821 - 60,658,837 ops/s`
			```

			`### Treatment: bench_random_mixed_hakmem_fast_pruned`
			```
			`Run 1: 55,339,952 ops/s`
			`Run 2: 56,847,444 ops/s`
			`Run 3: 58,161,283 ops/s`
			`Run 4: 58,645,002 ops/s`
			`Run 5: 55,615,903 ops/s`
			`Run 6: 55,984,988 ops/s`
			`Run 7: 56,979,027 ops/s`
			`Run 8: 55,851,054 ops/s`
			`Run 9: 57,196,418 ops/s`
			`Run 10: 56,529,372 ops/s`

			`Average: 56,715,044 ops/s`
			`Median: 56,688,408 ops/s`
			`Stdev: 1,082,600 ops/s`
			`Range: 55,339,952 - 58,645,002 ops/s`
			```

			`### Performance Delta`
			`- Average Change: -4.05% ❌`
			`- Median Change: -4.15% ❌`
			`- GO Threshold: +2.0%`
			`- Verdict: NO-GO (regression exceeds negative tolerance)`

			`## Performance Counter Analysis (perf stat, 5 runs each)`

			`### Baseline: bench_random_mixed_hakmem_minimal`
			```
			`Cycles: 1,703,775,790 (baseline)`
			`Instructions: 3,866,028,123 (baseline)`
			`IPC: 2.27 insns/cycle`
			`Branches: 945,213,995`
			`Branch-misses: 23,682,440 (2.51% of branches)`
			`Cache-misses: 420,262`
			```

			`### Treatment: bench_random_mixed_hakmem_fast_pruned`
			```
			`Cycles: 1,608,678,889 (-5.6% vs baseline)`
			`Instructions: 2,870,328,700 (-25.8% vs baseline) ✓`
			`IPC: 1.78 insns/cycle (-21.6%)`
			`Branches: 629,997,382 (-33.3% vs baseline) ✓`
			`Branch-misses: 23,622,772 (-0.3% count, but +3.75% rate vs baseline)`
			`Cache-misses: 501,446 (+19.3% vs baseline)`
			```

			`## Analysis`

			`### Success: Instruction Reduction`
			`The compile-time backend pruning achieved excellent dead code elimination:`
			`- -26% instruction count: Massive reduction from 3.87B to 2.87B instructions/op`
			`- -33% branch count: Reduction from 945M to 630M branches/op`
			`- -5.6% cycle count: Modest cycle reduction despite heavy pruning`

			`This confirms that LTO DCE is working correctly and removing the MID_V3 and POOL_V2 code paths.`

			`### Failure: Throughput Regression`
			`Despite massive code reduction, throughput regressed by -4.05%, indicating:`

			`Hypothesis 1: Bad I-Cache Locality`
			`- Treatment has fewer branches (-33%) but higher branch-miss rate (3.75% vs 2.51%)`
			`- This suggests code layout became worse during linker optimization`
			`- Remaining critical paths may have been scattered across memory`
			`- Similar to Phase 62A "layout tax" pattern`

			`Hypothesis 2: Critical Path Changed`
			`- IPC dropped from 2.27 to 1.78 instructions/cycle (-21.6%)`
			`- This indicates the CPU is less efficient at executing the pruned code`
			`- Cache hierarchy may be stressed despite fewer instructions (confirmed: +19% cache-misses)`
			`- Reduced instruction diversity may confuse branch prediction`

			`Hypothesis 3: Microarchitecture Sensitivity`
			`- The pruned code path may have different memory access patterns`
			`- Allocation patterns route through different backends (all Tiny now)`
			`- Contention on TLS caches may be higher without MID_V3 pressure relief`

			`### Why +5-10% Didn't Materialize`

			`The expected +5-10% gain assumed:`
			`1. Code size reduction → I-cache improvement ✗ (layout tax negative)`
			`2. Fewer branches → Better prediction ✗ (branch-miss rate increased)`
			`3. Simplified dispatch logic → Reduced overhead ✗ (IPC decreased)`

			`The Mixed workload (257-768B allocations) benefits from MID_V3's specialized TLS lane caching. By removing it, all those allocations now route through the Tiny fast path, which:`
			`- May have reduced TLS cache efficiency`
			`- Increases contention on shared structures`
			`- Affects memory layout and I-cache behavior`

			`## Related Patterns`

			`### Phase 62A: "Layout Tax" Pattern`
			`- Phase 62A (C7 ULTRA Alloc DepChain Trim): -0.71% regression`
			`- Both Phases showed code size improvements but IPC/layout deterioration`
			`- This confirms that LTO + function-level optimizations create layout tax`

			`### Successful Similar Phases`
			`- None found that achieved code elimination + performance gain simultaneously`

			`## Recommendations`

			`### Path Forward Options`

			`Option A: Abandon Backend Pruning (Recommended)`
			`- The layout tax pattern is consistent across phases`
			`- Removing code paths without architectural restructuring doesn't help`
			`- Focus on algorithmic improvements instead`

			`Option B: Research Backend Pruning + Linker Optimizations`
			- Try `--gc-sections` + section reordering (Phase 18 NO-GO, but different context)
			`- Experiment with PGO-guided section layout`
			`- May require significant research investment`

			`Option C: Profile-Guided Backend Selection`
			`- Instead of compile-time removal, use runtime PGO to select optimal backend`
			`- Keep both MID_V3 and Tiny, but bias allocation based on profile`
			`- Trade size for flexibility (likely not worth it)`

			`## Conclusion`

			`Phase 64 successfully implemented compile-time backend pruning and achieved 26% instruction reduction through LTO DCE. However, the strategy backfired due to layout tax and microarchitecture sensitivity, producing a -4.05% throughput regression.`

			`This phase validates an important insight: code elimination alone is insufficient. Hakmem's performance depends on:`
			`1. Hot path efficiency (IPC, branch prediction)`
			`2. Memory layout (I-cache, D-cache)`
			`3. Architectural symmetry (balanced pathways reduce contention)`

			`Removing entire backends disrupts this balance, despite reducing instruction count.`

			`---`

			`Artifacts:`
			- Baseline: `bench_random_mixed_hakmem_minimal` (BENCH_MINIMAL=1)
			- Treatment: `bench_random_mixed_hakmem_fast_pruned` (BENCH_MINIMAL=1 + FAST_PROFILE_FIXED=1 + FAST_PROFILE_PRUNE_BACKENDS=1)

			`Next Phase: Back to algorithm-level optimizations or investigate why IPC dropped despite simpler code.`