Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
21 KiB
Branch Prediction Optimization Investigation Report
Date: 2025-11-09 Author: Claude Code Analysis Context: HAKMEM Phase 7 + Pool TLS Performance Investigation
Executive Summary
Problem: HAKMEM has 10.89% branch-miss rate vs System malloc's 3.5-3.9% (3x worse)
Root Cause Discovery: The problem is NOT just misprediction rate, but TOTAL BRANCH COUNT:
- HAKMEM: 17,098,340 branches (10.84% miss)
- System malloc: 2,006,962 branches (4.56% miss)
- HAKMEM executes 8.5x MORE branches than System malloc!
Impact:
- Branch misprediction overhead: ~1.8M misses × 15-20 cycles = 27-36M cycles wasted
- Total execution: 17M branches vs System's 2M → 8x more branch overhead
- Potential gain: 40-60% performance improvement with recommended optimizations
Critical Finding: HAKMEM_BUILD_RELEASE is NOT defined → All debug code is running in production builds!
1. Performance Hotspot Analysis
1.1 Perf Statistics (256B allocations, 100K iterations)
| Metric | HAKMEM | System malloc | Ratio |
|---|---|---|---|
| Branches | 17,098,340 | 2,006,962 | 8.5x |
| Branch-misses | 1,854,018 | 91,497 | 20.3x |
| Branch-miss rate | 10.84% | 4.56% | 2.4x |
| L1-dcache loads | 31,307,492 | 4,610,155 | 6.8x |
| L1-dcache misses | 1,063,117 | 44,773 | 23.7x |
| L1 miss rate | 3.40% | 0.97% | 3.5x |
| Cycles | ~83M | ~10M | 8.3x |
| Time | 0.103s | 0.003s | 34x slower |
Key insight: HAKMEM is not just suffering from poor branch prediction, but is executing 8.5x more branches than System malloc!
1.2 Branch Count by Component
Source file analysis:
| File | Branch Statements | Critical Issues |
|---|---|---|
tiny_alloc_fast.inc.h |
79 | 8 debug guards, 3 getenv() calls, SFC/SLL dual-layer |
hak_free_api.inc.h |
38 | Pool TLS + Phase 7 dual dispatch, multiple lookups |
hakmem_tiny_refill_p0.inc.h |
~40 | Complex precedence logic, 2 getenv() calls, validation |
tiny_refill_opt.h |
~20 | Corruption checks, guard functions |
Total: ~177 branch statements in hot path vs System malloc's ~5 branches
2. Branch Count Analysis: Allocation Path
2.1 Fast Path: tiny_alloc_fast() (lines 454-497)
Layer 0: SFC (Super Front Cache) - Lines 177-200
// Branch 1-2: Check if SFC enabled (TLS cache check)
if (!sfc_check_done) { /* getenv() + init */ } // COLD
if (sfc_is_enabled) { // HOT
// Branch 3: Try SFC
void* ptr = sfc_alloc(class_idx); // → 2 branches inside
if (ptr != NULL) { /* hit */ } // HOT
}
Branches: 5-6 (3 external + 2-3 in sfc_alloc)
Layer 1: SLL (TLS Freelist) - Lines 204-259
// Branch 4: Check if SLL enabled
if (g_tls_sll_enable) { // HOT
// Branch 5: Try SLL pop
void* head = g_tls_sll_head[class_idx];
if (head != NULL) { // HOT
// Branch 6-7: Corruption debug (ONLY if failfast ≥ 2)
if (tiny_refill_failfast_level() >= 2) { // DEBUG
/* alignment validation (2 branches) */
}
// Branch 8-9: Validate next pointer
void* next = *(void**)head;
if (tiny_refill_failfast_level() >= 2) { // DEBUG
/* next pointer validation (2 branches) */
}
// Branch 10: Count update
if (g_tls_sll_count[class_idx] > 0) { // HOT
g_tls_sll_count[class_idx]--;
}
// Branch 11: Profiling (DEBUG)
#if !HAKMEM_BUILD_RELEASE
if (start) { /* rdtsc tracking */ } // DEBUG
#endif
return head; // SUCCESS
}
}
Branches: 11-15 (2 unconditional + 5-9 conditional debug)
Total allocation fast path: 16-21 branches vs System tcache's 1-2 branches
2.2 Refill Path: tiny_alloc_fast_refill() (lines 321-436)
Phase 2b capacity check:
// Branch 1: Check available capacity
int available_capacity = get_available_capacity(class_idx);
if (available_capacity <= 0) { return 0; }
Refill count precedence logic (lines 338-363):
// Branch 2: First-time init check
if (cnt == 0) { // COLD (once per class per thread)
// Branch 3-6: Complex precedence logic
if (g_refill_count_class[class_idx] > 0) { /* ... */ }
else if (class_idx <= 3 && g_refill_count_hot > 0) { /* ... */ }
else if (class_idx >= 4 && g_refill_count_mid > 0) { /* ... */ }
else if (g_refill_count_global > 0) { /* ... */ }
// Branch 7-8: Clamping
if (v < 8) v = 8;
if (v > 256) v = 256;
}
Total refill path: 10-15 branches (one-time init + runtime checks)
3. Branch Count Analysis: Free Path
3.1 Free Path: hak_free_at() (hak_free_api.inc.h)
Pool TLS dispatch (lines 81-110):
#ifdef HAKMEM_POOL_TLS_PHASE1
// Branch 1: Page boundary check
#if !HAKMEM_TINY_SAFE_FREE
if (((uintptr_t)header_addr & 0xFFF) == 0) { // 0.1% frequency
// Branch 2: Memory readable check (mincore syscall)
if (!hak_is_memory_readable(header_addr)) { goto skip_pool_tls; }
}
#endif
// Branch 3: Magic check
if ((header & 0xF0) == POOL_MAGIC) {
pool_free(ptr);
goto done;
}
#endif
Branches: 3 (optimized with hybrid mincore)
Phase 7 dual-header dispatch (lines 112-167):
// Branch 4: Try 1-byte Tiny header
if (hak_tiny_free_fast_v2(ptr)) { // → 3-5 branches inside
goto done;
}
// Branch 5: Page boundary check for 16-byte header
if (offset_in_page < HEADER_SIZE) {
// Branch 6: Memory readable check
if (!hak_is_memory_readable(raw)) { goto slow_path; }
}
// Branch 7: 16-byte header magic check
if (hdr->magic == HAKMEM_MAGIC) {
// Branch 8: Method dispatch
if (hdr->method == ALLOC_METHOD_MALLOC) { /* ... */ }
}
Branches: 8-10 (including 3-5 inside hak_tiny_free_fast_v2)
Mid/L25 lookup (lines 196-206):
// Branch 9-10: Mid/L25 registry lookups
if (hak_pool_mid_lookup(ptr, &mid_sz)) { /* ... */ }
if (hak_l25_lookup(ptr, &l25_sz)) { /* ... */ }
Branches: 2
Total free path: 13-15 branches vs System tcache's 2-3 branches
4. Root Cause Analysis
4.1 CRITICAL: Debug Code in Production Builds
Finding: HAKMEM_BUILD_RELEASE is NOT defined anywhere in Makefile
Impact: All debug code runs in production:
| Debug Guard | Location | Frequency | Overhead |
|---|---|---|---|
!HAKMEM_BUILD_RELEASE |
tiny_alloc_fast.inc.h:171 |
Every allocation | 2-3 branches |
!HAKMEM_BUILD_RELEASE |
tiny_alloc_fast.inc.h:191-196 |
Every allocation | 1 branch + rdtsc |
!HAKMEM_BUILD_RELEASE |
tiny_alloc_fast.inc.h:250-256 |
Every allocation | 1 branch + rdtsc |
!HAKMEM_BUILD_RELEASE |
tiny_alloc_fast.inc.h:324-326 |
Every refill | 1 branch + rdtsc |
!HAKMEM_BUILD_RELEASE |
tiny_alloc_fast.inc.h:427-433 |
Every refill | 1 branch + rdtsc |
!HAKMEM_BUILD_RELEASE |
tiny_free_fast_v2.inc.h:99-104 |
Every free | 1 branch + capacity check |
!HAKMEM_BUILD_RELEASE |
hak_free_api.inc.h:118-120 |
Every free | 1 function call |
trc_refill_guard_enabled() |
tiny_refill_opt.h:61-75 |
Every splice | 1 branch + getenv |
Total overhead: 8-12 branches + 6 rdtsc calls + 2 getenv calls per allocation/free cycle
Expected impact of fixing: -40-50% total branches
4.2 HIGH: getenv() Calls in Hot Path
Finding: 3 lazy-initialized getenv() calls in hot path
| Location | Variable | Call Frequency | Fix |
|---|---|---|---|
tiny_alloc_fast.inc.h:104 |
HAKMEM_TINY_PROFILE |
Every allocation (if -1) | Cache in global var at init |
hakmem_tiny_refill_p0.inc.h:68 |
HAKMEM_TINY_REFILL_COUNT_HOT |
Every refill (class ≤ 3) | Pre-compute at init |
hakmem_tiny_refill_p0.inc.h:78 |
HAKMEM_TINY_REFILL_COUNT_MID |
Every refill (class ≥ 4) | Pre-compute at init |
Impact:
- getenv() is ~50-100 cycles (string lookup + syscall if not cached)
- Adds 2-3 branches per call (null check, lazy init, result check)
- Total: 6-9 branches + 150-300 cycles on first access per thread
Expected impact of fixing: -10-15% branches, -5-10% cycles
4.3 MEDIUM: Complex Multi-Layer Cache
Current architecture:
Allocation: Size check → SFC (Layer 0) → SLL (Layer 1) → SuperSlab → Refill
1 branch 5-6 branches 11-15 branches 20-30 branches
System malloc tcache:
Allocation: Size check → TLS cache → ptmalloc2
1 branch 1-2 branches
Problem: HAKMEM has 3 layers (SFC → SLL → SuperSlab) vs System's 1 layer (tcache)
Why SFC is redundant:
- SLL already provides TLS freelist (same design as tcache)
- SFC adds 5-6 branches with minimal benefit
- Pre-warming (Phase 7 Task 3) already boosted SLL hit rate to 95%+
Expected impact of removing SFC: -5-10% branches, simpler code
4.4 MEDIUM: Excessive Validation in Hot Path
Corruption checks (lines 208-235 in tiny_alloc_fast.inc.h):
if (tiny_refill_failfast_level() >= 2) { // getenv() call!
// Alignment validation
if (((uintptr_t)head % blk) != 0) {
fprintf(stderr, "[TLS_SLL_CORRUPT] ...");
abort();
}
// Next pointer validation
if (next != NULL && ((uintptr_t)next % blk) != 0) {
fprintf(stderr, "[ALLOC_POP_CORRUPT] ...");
abort();
}
}
Impact:
- 1 getenv() call per thread (lazy init) = ~100 cycles
- 5-7 branches per allocation when enabled
- fprintf/abort paths confuse branch predictor
Solution: Move to compile-time flag (e.g., HAKMEM_DEBUG_VALIDATION) instead of runtime check
Expected impact: -5-10% branches when disabled
5. Optimization Recommendations (Ranked by Impact/Risk)
5.1 CRITICAL FIX: Enable Release Mode (0 risk, 40-50% impact)
Action: Add -DHAKMEM_BUILD_RELEASE=1 to production build flags
Implementation:
# Makefile
HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto
release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
release: all
Changes enabled:
- Removes 8
!HAKMEM_BUILD_RELEASEguards → -8-12 branches - Disables rdtsc profiling → -6 rdtsc calls
- Disables corruption validation → -5-10 branches
- Enables LTO and aggressive optimization
Expected result:
- -40-50% total branches (17M → 8.5-10M)
- -20-30% cycles (better inlining, constant folding)
- +30-50% performance (overall)
A/B test command:
# Before
make bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42
# After
make HAKMEM_BUILD_RELEASE=1 bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42
5.2 HIGH PRIORITY: Pre-compute Env Vars at Init (Low risk, 10-15% impact)
Action: Move getenv() calls from hot path to global init
Current (lazy init in hot path):
// SLOW: Called on every allocation/refill
if (g_tiny_profile_enabled == -1) {
const char* env = getenv("HAKMEM_TINY_PROFILE"); // 50-100 cycles!
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}
Fixed (pre-compute at init):
// hakmem_init.c (runs once at startup)
void hakmem_tiny_init_config(void) {
// Profile mode
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
// Refill counts
const char* hot_env = getenv("HAKMEM_TINY_REFILL_COUNT_HOT");
g_refill_count_hot = hot_env ? atoi(hot_env) : HAKMEM_TINY_REFILL_DEFAULT;
const char* mid_env = getenv("HAKMEM_TINY_REFILL_COUNT_MID");
g_refill_count_mid = mid_env ? atoi(mid_env) : HAKMEM_TINY_REFILL_DEFAULT;
}
Expected result:
- -6-9 branches (3 getenv lazy-init patterns)
- -150-300 cycles on first access per thread
- +5-10% performance (cleaner hot path)
Files to modify:
core/tiny_alloc_fast.inc.h:104- Remove lazy initcore/hakmem_tiny_refill_p0.inc.h:66-84- Remove lazy initcore/hakmem_init.c- Add global init function
5.3 MEDIUM PRIORITY: Simplify Cache Layers (Medium risk, 5-10% impact)
Option A: Remove SFC Layer (Recommended)
Rationale:
- SFC adds 5-6 branches with minimal benefit
- SLL already provides TLS freelist (same as System tcache)
- Phase 7 Task 3 pre-warming gives SLL 95%+ hit rate
- Three cache layers = unnecessary complexity
Implementation:
// Remove SFC entirely, use only SLL
static inline void* tiny_alloc_fast(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
// Layer 1: TLS freelist (SLL) - DIRECT ACCESS
void* head = g_tls_sll_head[class_idx];
if (head != NULL) {
g_tls_sll_head[class_idx] = *(void**)head;
g_tls_sll_count[class_idx]--;
return head; // 3 instructions, 1-2 branches!
}
// Refill from SuperSlab
if (tiny_alloc_fast_refill(class_idx) > 0) {
head = g_tls_sll_head[class_idx];
// ... retry pop
}
return hak_tiny_alloc_slow(size, class_idx);
}
Expected result:
- -5-10% branches (remove SFC layer)
- Simpler code (easier to debug/maintain)
- Same or better performance (fewer layers = less overhead)
Option B: Unified TLS Cache (Higher risk, 10-20% impact)
Design: Single TLS cache with adaptive sizing (like mimalloc)
// Per-class TLS cache with adaptive capacity
struct TinyTLSCache {
void* head;
uint32_t count;
uint32_t capacity; // Adaptive: 16-256
};
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
Expected result:
- -10-20% branches (unified design)
- Better cache utilization (adaptive sizing)
- Matches System malloc architecture
5.4 LOW PRIORITY: Branch Hint Tuning (Low risk, 2-5% impact)
Action: Optimize __builtin_expect hints based on profiling
Current issues:
- Some hints are incorrect (e.g., SFC disabled in production)
- Missing hints on hot branches
Recommended changes:
// Line 184: SFC is DISABLED in most production builds
if (__builtin_expect(sfc_is_enabled, 1)) { // WRONG!
// Fix:
if (__builtin_expect(sfc_is_enabled, 0)) { // Expect disabled
// Line 208: Corruption checks are rare in production
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) { // CORRECT
// Line 457: Size > 1KB is common in mixed workloads
if (__builtin_expect(class_idx < 0, 0)) { // May be wrong for some workloads
Expected result:
- -2-5% branch-misses (better prediction)
- +2-5% performance (reduced pipeline stalls)
6. Expected Results Summary
6.1 Cumulative Impact (All Optimizations)
| Optimization | Branch Reduction | Cycle Reduction | Risk | Effort |
|---|---|---|---|---|
| Enable Release Mode | -40-50% | -20-30% | None | 1 line |
| Pre-compute Env Vars | -10-15% | -5-10% | Low | 1 day |
| Remove SFC Layer | -5-10% | -5-10% | Medium | 2 days |
| Branch Hint Tuning | -2-5% | -2-5% | Low | 1 day |
| TOTAL | -50-65% | -30-45% | Low | 4-5 days |
Projected final results:
- Branches: 17M → 6-8.5M (vs System's 2M)
- Branch-miss rate: 10.84% → 6-8% (vs System's 4.56%)
- Throughput: Current → +40-80% improvement
Target: 70-90% of System malloc performance (currently ~3% of System)
6.2 Quick Win: Release Mode Only
Minimal change, maximum impact:
# Add one line to Makefile
CFLAGS += -DHAKMEM_BUILD_RELEASE=1
# Rebuild
make clean && make bench_random_mixed_hakmem
# Test
./bench_random_mixed_hakmem 100000 256 42
Expected:
- -40-50% branches (17M → 8.5-10M)
- +30-50% performance (immediate)
- 0 code changes (just a flag)
7. A/B Test Plan
7.1 Baseline Measurement
# Measure current performance
perf stat -e branch-misses,branches,cycles,instructions \
./bench_random_mixed_hakmem 100000 256 42
# Output:
# branches: 17,098,340
# branch-misses: 1,854,018 (10.84%)
# cycles: ~83M
7.2 Test 1: Release Mode
# Build with release flag
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Measure
perf stat -e branch-misses,branches,cycles,instructions \
./bench_random_mixed_hakmem 100000 256 42
# Expected:
# branches: ~9M (-47%)
# branch-misses: ~700K (7.8%)
# cycles: ~60M (-27%)
7.3 Test 2: Release + Pre-compute Env
# Implement env var pre-computation (see 5.2)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Expected:
# branches: ~8M (-53%)
# branch-misses: ~600K (7.5%)
# cycles: ~55M (-33%)
7.4 Test 3: Release + Pre-compute + Remove SFC
# Remove SFC layer (see 5.3)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Expected:
# branches: ~7M (-59%)
# branch-misses: ~500K (7.1%)
# cycles: ~50M (-40%)
7.5 Success Criteria
| Metric | Current | Target | Stretch Goal |
|---|---|---|---|
| Branches | 17M | <10M | <8M |
| Branch-miss rate | 10.84% | <8% | <7% |
| vs System malloc | 8.5x slower | <5x slower | <3x slower |
| Throughput | 1.07M ops/s | >2M ops/s | >3M ops/s |
8. Comparison with System Malloc Strategy
8.1 System malloc tcache (glibc 2.27+)
Design:
// Allocation (2-3 instructions, 1-2 branches)
void* tcache_get(size_t size) {
int tc_idx = csize2tidx(size); // Size to index (no branch)
tcache_entry* e = tcache->entries[tc_idx];
if (e != NULL) { // BRANCH 1
tcache->entries[tc_idx] = e->next;
return (void*)e;
}
return _int_malloc(av, bytes); // Slow path
}
// Free (2 instructions, 1 branch)
void tcache_put(void* ptr, size_t size) {
int tc_idx = csize2tidx(size); // Size to index (no branch)
if (tcache->counts[tc_idx] < TCACHE_MAX_BINS) { // BRANCH 1
tcache_entry* e = (tcache_entry*)ptr;
e->next = tcache->entries[tc_idx];
tcache->entries[tc_idx] = e;
tcache->counts[tc_idx]++;
}
// Else: fall back to _int_free
}
Key insights:
- 1-2 branches total (vs HAKMEM's 16-21)
- No validation in fast path
- No debug guards in production
- Single TLS cache layer (vs HAKMEM's 3 layers)
- No getenv() calls (all config at compile-time)
8.2 mimalloc
Design:
// Allocation (3-4 instructions, 1-2 branches)
void* mi_malloc(size_t size) {
mi_page_t* page = _mi_page_fast(); // TLS page cache
if (mi_likely(page != NULL)) { // BRANCH 1
void* p = page->free;
if (mi_likely(p != NULL)) { // BRANCH 2
page->free = mi_ptr_decode(p);
return p;
}
}
return mi_malloc_generic(NULL, size); // Slow path
}
Key insights:
- 2 branches total (vs HAKMEM's 16-21)
- Inline header metadata (similar to HAKMEM Phase 7)
- No debug overhead in release builds
- Simple TLS structure (page + free pointer)
9. Conclusion
Root Cause: HAKMEM executes 8.5x more branches than System malloc due to:
- Debug code running in production (
HAKMEM_BUILD_RELEASEnot defined) - Complex multi-layer cache (SFC → SLL → SuperSlab)
- Runtime env var checks in hot path
- Excessive validation and profiling
Immediate Action (1 line change):
CFLAGS += -DHAKMEM_BUILD_RELEASE=1 # Expected: +30-50% performance
Full Fix (4-5 days work):
- Enable release mode
- Pre-compute env vars at init
- Remove redundant SFC layer
- Optimize branch hints
Expected Result:
- -50-65% branches (17M → 6-8.5M)
- -30-45% cycles
- +40-80% throughput
- 70-90% of System malloc performance (vs current 3%)
Next Steps:
- ✅ Enable
HAKMEM_BUILD_RELEASE=1(immediate) - Run A/B tests (measure impact)
- Implement env var pre-computation (1 day)
- Evaluate SFC removal (2 days)
- Re-measure and iterate
Appendix A: Detailed Branch Inventory
Allocation Path (tiny_alloc_fast.inc.h)
| Line | Branch | Frequency | Type | Fix |
|---|---|---|---|---|
| 177-182 | SFC check done | Cold (once/thread) | Init | Pre-compute |
| 184 | SFC enabled | Hot | Runtime | Remove SFC |
| 186 | SFC ptr != NULL | Hot | Fast path | Keep (necessary) |
| 204 | SLL enabled | Hot | Runtime | Make compile-time |
| 206 | SLL head != NULL | Hot | Fast path | Keep (necessary) |
| 208 | Failfast ≥ 2 | Hot | Debug | Remove in release |
| 211-216 | Alignment check | Hot | Debug | Remove in release |
| 225 | Failfast ≥ 2 | Hot | Debug | Remove in release |
| 227-234 | Next validation | Hot | Debug | Remove in release |
| 241 | Count > 0 | Hot | Unnecessary | Remove |
| 171-173 | Profile enabled | Hot | Debug | Remove in release |
| 250-256 | Profile rdtsc | Hot | Debug | Remove in release |
Total: 16-21 branches → Target: 2-3 branches (95% reduction)
Refill Path (hakmem_tiny_refill_p0.inc.h)
| Line | Branch | Frequency | Type | Fix |
|---|---|---|---|---|
| 33 | !g_use_superslab | Cold | Config | Remove check |
| 41 | !tls->ss | Hot | Refill | Keep (necessary) |
| 46 | !meta | Hot | Refill | Keep (necessary) |
| 56 | room <= 0 | Hot | Capacity | Keep (necessary) |
| 66-73 | Hot override | Cold | Env var | Pre-compute |
| 76-83 | Mid override | Cold | Env var | Pre-compute |
| 116-119 | Remote drain | Hot | Optimization | Keep |
| 138 | Capacity check | Hot | Refill | Keep (necessary) |
Total: 10-15 branches → Target: 5-8 branches (40-50% reduction)
End of Report