# Branch Prediction Optimization - Quick Start Guide **TL;DR:** HAKMEM has 10.89% branch-miss rate (3x worse than System malloc's 3.5%) because it executes **8.5x MORE branches** (17M vs 2M) due to debug code running in production. --- ## Immediate Fix (1 Minute) **Add this ONE line to your build command:** ```bash make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem ``` **Expected result:** +30-50% performance improvement --- ## Quick Win A/B Test ### Before (Current) ```bash make clean make bench_random_mixed_hakmem perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42 # Results: # branches: 17,098,340 # branch-misses: 1,854,018 (10.84%) # time: 0.103s ``` ### After (Release Mode) ```bash make clean make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42 # Expected: # branches: ~9M (-47%) # branch-misses: ~700K (7.8%) # time: ~0.060s (+42% faster) ``` --- ## Top 4 Optimizations (Ranked by Impact/Risk) ### 1. Enable Release Mode ⚡ (0 risk, 40-50% impact) **Action:** Add `-DHAKMEM_BUILD_RELEASE=1` to build flags **Why:** Currently ALL debug code runs in production: - 8 debug guards (`!HAKMEM_BUILD_RELEASE`) - 6 rdtsc profiling calls - 5-10 corruption validation branches - All removed with one flag! **Effort:** 1 line change **Impact:** -40-50% branches, +30-50% performance --- ### 2. Pre-compute Env Vars 📊 (Low risk, 10-15% impact) **Action:** Move getenv() from hot path to init **Current problem:** ```c // Called on EVERY allocation! (50-100 cycles) if (g_tiny_profile_enabled == -1) { const char* env = getenv("HAKMEM_TINY_PROFILE"); g_tiny_profile_enabled = (env && *env) ? 1 : 0; } ``` **Fix:** ```c // In hakmem_init.c (runs ONCE at startup) void hakmem_tiny_init_config(void) { const char* env = getenv("HAKMEM_TINY_PROFILE"); g_tiny_profile_enabled = (env && *env) ? 1 : 0; // Pre-compute all env vars here } ``` **Files to modify:** - `core/tiny_alloc_fast.inc.h:104` - `core/hakmem_tiny_refill_p0.inc.h:66-84` **Effort:** 1 day **Impact:** -10-15% branches, +5-10% performance --- ### 3. Remove SFC Layer 🗑️ (Medium risk, 5-10% impact) **Action:** Use only SLL (TLS freelist), remove SFC (Super Front Cache) **Why redundant:** - SLL already provides TLS freelist (same as System tcache) - Phase 7 pre-warming gives SLL 95%+ hit rate - SFC adds 5-6 branches with minimal benefit - System malloc has 1 layer, HAKMEM has 3! **Current:** ``` Allocation: SFC → SLL → SuperSlab 5-6br 11-15br 20-30br ``` **Simplified:** ``` Allocation: SLL → SuperSlab 2-3br 20-30br ``` **Effort:** 2 days **Impact:** -5-10% branches, simpler code --- ### 4. Branch Hint Tuning 🎯 (Low risk, 2-5% impact) **Action:** Fix incorrect `__builtin_expect` hints **Examples:** ```c // WRONG: SFC is disabled in most builds if (__builtin_expect(sfc_is_enabled, 1)) { // FIX: if (__builtin_expect(sfc_is_enabled, 0)) { ``` **Effort:** 1 day **Impact:** -2-5% branch-misses --- ## Performance Roadmap | Phase | Branches | Branch-miss% | Throughput | Effort | |-------|----------|--------------|------------|--------| | **Current** | 17M | 10.84% | 1.07M ops/s | - | | **+Release Mode** | 9M | 7.8% | 1.6M ops/s | 1 line | | **+Pre-compute Env** | 8M | 7.5% | 1.8M ops/s | +1 day | | **+Remove SFC** | 7M | 7.1% | 2.0M ops/s | +2 days | | **+Hint Tuning** | 6.5M | 6.8% | 2.2M ops/s | +1 day | | **System malloc** | 2M | 4.56% | 36M ops/s | - | **Target:** 70-90% of System malloc performance (currently ~3%) --- ## Root Cause: 8.5x More Branches Than System Malloc **The problem is NOT just misprediction rate, but TOTAL BRANCH COUNT:** | Component | HAKMEM Branches | System Branches | Ratio | |-----------|----------------|-----------------|-------| | **Allocation** | 16-21 | 1-2 | **10x** | | **Free** | 13-15 | 2-3 | **5x** | | **Refill** | 10-15 | N/A | ∞ | | **Total (100K allocs)** | 17M | 2M | **8.5x** | **Why so many branches?** 1. ❌ Debug code in production (8 guards) 2. ❌ Multi-layer cache (SFC → SLL → SuperSlab) 3. ❌ Runtime env var checks (3 getenv() calls) 4. ❌ Excessive validation (alignment, corruption) --- ## System Malloc Reference (glibc tcache) **Allocation (1-2 branches, 2-3 instructions):** ```c void* tcache_get(size_t size) { int tc_idx = csize2tidx(size); tcache_entry* e = tcache->entries[tc_idx]; if (e != NULL) { // BRANCH 1 tcache->entries[tc_idx] = e->next; return (void*)e; } return _int_malloc(av, bytes); } ``` **Key differences:** - ✅ 1 branch (vs HAKMEM's 16-21) - ✅ No validation - ✅ No debug guards - ✅ Single cache layer - ✅ No env var checks --- ## Makefile Integration (Recommended) Add release build target: ```makefile # Makefile # Release build flags HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto # Release targets release: CFLAGS += $(HAKMEM_RELEASE_FLAGS) release: all bench-release: CFLAGS += $(HAKMEM_RELEASE_FLAGS) bench-release: bench_random_mixed_hakmem larson_hakmem ``` **Usage:** ```bash make release # Build all in release mode make bench-release # Build benchmarks in release mode ./bench_random_mixed_hakmem 100000 256 42 ``` --- ## Detailed Analysis See full report: `BRANCH_PREDICTION_OPTIMIZATION_REPORT.md` **Key sections:** - Section 1: Performance hotspot analysis (perf data) - Section 2: Branch count by component (detailed breakdown) - Section 4: Root cause analysis (why 8.5x more branches) - Section 5: Optimization recommendations (ranked by impact/risk) - Section 7: A/B test plan (measurement protocol) --- ## Contact For questions or discussion: - See: `BRANCH_PREDICTION_OPTIMIZATION_REPORT.md` (comprehensive analysis) - Context: Phase 7 (header-based fast free) + Pool TLS Phase 1 - Date: 2025-11-09