242 lines
5.8 KiB
Markdown
242 lines
5.8 KiB
Markdown
|
|
# Branch Prediction Optimization - Quick Start Guide
|
||
|
|
|
||
|
|
**TL;DR:** HAKMEM has 10.89% branch-miss rate (3x worse than System malloc's 3.5%) because it executes **8.5x MORE branches** (17M vs 2M) due to debug code running in production.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Immediate Fix (1 Minute)
|
||
|
|
|
||
|
|
**Add this ONE line to your build command:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** +30-50% performance improvement
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Win A/B Test
|
||
|
|
|
||
|
|
### Before (Current)
|
||
|
|
```bash
|
||
|
|
make clean
|
||
|
|
make bench_random_mixed_hakmem
|
||
|
|
perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42
|
||
|
|
|
||
|
|
# Results:
|
||
|
|
# branches: 17,098,340
|
||
|
|
# branch-misses: 1,854,018 (10.84%)
|
||
|
|
# time: 0.103s
|
||
|
|
```
|
||
|
|
|
||
|
|
### After (Release Mode)
|
||
|
|
```bash
|
||
|
|
make clean
|
||
|
|
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
|
||
|
|
perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42
|
||
|
|
|
||
|
|
# Expected:
|
||
|
|
# branches: ~9M (-47%)
|
||
|
|
# branch-misses: ~700K (7.8%)
|
||
|
|
# time: ~0.060s (+42% faster)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 4 Optimizations (Ranked by Impact/Risk)
|
||
|
|
|
||
|
|
### 1. Enable Release Mode ⚡ (0 risk, 40-50% impact)
|
||
|
|
**Action:** Add `-DHAKMEM_BUILD_RELEASE=1` to build flags
|
||
|
|
|
||
|
|
**Why:** Currently ALL debug code runs in production:
|
||
|
|
- 8 debug guards (`!HAKMEM_BUILD_RELEASE`)
|
||
|
|
- 6 rdtsc profiling calls
|
||
|
|
- 5-10 corruption validation branches
|
||
|
|
- All removed with one flag!
|
||
|
|
|
||
|
|
**Effort:** 1 line change
|
||
|
|
**Impact:** -40-50% branches, +30-50% performance
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. Pre-compute Env Vars 📊 (Low risk, 10-15% impact)
|
||
|
|
**Action:** Move getenv() from hot path to init
|
||
|
|
|
||
|
|
**Current problem:**
|
||
|
|
```c
|
||
|
|
// Called on EVERY allocation! (50-100 cycles)
|
||
|
|
if (g_tiny_profile_enabled == -1) {
|
||
|
|
const char* env = getenv("HAKMEM_TINY_PROFILE");
|
||
|
|
g_tiny_profile_enabled = (env && *env) ? 1 : 0;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix:**
|
||
|
|
```c
|
||
|
|
// In hakmem_init.c (runs ONCE at startup)
|
||
|
|
void hakmem_tiny_init_config(void) {
|
||
|
|
const char* env = getenv("HAKMEM_TINY_PROFILE");
|
||
|
|
g_tiny_profile_enabled = (env && *env) ? 1 : 0;
|
||
|
|
|
||
|
|
// Pre-compute all env vars here
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Files to modify:**
|
||
|
|
- `core/tiny_alloc_fast.inc.h:104`
|
||
|
|
- `core/hakmem_tiny_refill_p0.inc.h:66-84`
|
||
|
|
|
||
|
|
**Effort:** 1 day
|
||
|
|
**Impact:** -10-15% branches, +5-10% performance
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. Remove SFC Layer 🗑️ (Medium risk, 5-10% impact)
|
||
|
|
**Action:** Use only SLL (TLS freelist), remove SFC (Super Front Cache)
|
||
|
|
|
||
|
|
**Why redundant:**
|
||
|
|
- SLL already provides TLS freelist (same as System tcache)
|
||
|
|
- Phase 7 pre-warming gives SLL 95%+ hit rate
|
||
|
|
- SFC adds 5-6 branches with minimal benefit
|
||
|
|
- System malloc has 1 layer, HAKMEM has 3!
|
||
|
|
|
||
|
|
**Current:**
|
||
|
|
```
|
||
|
|
Allocation: SFC → SLL → SuperSlab
|
||
|
|
5-6br 11-15br 20-30br
|
||
|
|
```
|
||
|
|
|
||
|
|
**Simplified:**
|
||
|
|
```
|
||
|
|
Allocation: SLL → SuperSlab
|
||
|
|
2-3br 20-30br
|
||
|
|
```
|
||
|
|
|
||
|
|
**Effort:** 2 days
|
||
|
|
**Impact:** -5-10% branches, simpler code
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. Branch Hint Tuning 🎯 (Low risk, 2-5% impact)
|
||
|
|
**Action:** Fix incorrect `__builtin_expect` hints
|
||
|
|
|
||
|
|
**Examples:**
|
||
|
|
```c
|
||
|
|
// WRONG: SFC is disabled in most builds
|
||
|
|
if (__builtin_expect(sfc_is_enabled, 1)) {
|
||
|
|
|
||
|
|
// FIX:
|
||
|
|
if (__builtin_expect(sfc_is_enabled, 0)) {
|
||
|
|
```
|
||
|
|
|
||
|
|
**Effort:** 1 day
|
||
|
|
**Impact:** -2-5% branch-misses
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Roadmap
|
||
|
|
|
||
|
|
| Phase | Branches | Branch-miss% | Throughput | Effort |
|
||
|
|
|-------|----------|--------------|------------|--------|
|
||
|
|
| **Current** | 17M | 10.84% | 1.07M ops/s | - |
|
||
|
|
| **+Release Mode** | 9M | 7.8% | 1.6M ops/s | 1 line |
|
||
|
|
| **+Pre-compute Env** | 8M | 7.5% | 1.8M ops/s | +1 day |
|
||
|
|
| **+Remove SFC** | 7M | 7.1% | 2.0M ops/s | +2 days |
|
||
|
|
| **+Hint Tuning** | 6.5M | 6.8% | 2.2M ops/s | +1 day |
|
||
|
|
| **System malloc** | 2M | 4.56% | 36M ops/s | - |
|
||
|
|
|
||
|
|
**Target:** 70-90% of System malloc performance (currently ~3%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause: 8.5x More Branches Than System Malloc
|
||
|
|
|
||
|
|
**The problem is NOT just misprediction rate, but TOTAL BRANCH COUNT:**
|
||
|
|
|
||
|
|
| Component | HAKMEM Branches | System Branches | Ratio |
|
||
|
|
|-----------|----------------|-----------------|-------|
|
||
|
|
| **Allocation** | 16-21 | 1-2 | **10x** |
|
||
|
|
| **Free** | 13-15 | 2-3 | **5x** |
|
||
|
|
| **Refill** | 10-15 | N/A | ∞ |
|
||
|
|
| **Total (100K allocs)** | 17M | 2M | **8.5x** |
|
||
|
|
|
||
|
|
**Why so many branches?**
|
||
|
|
1. ❌ Debug code in production (8 guards)
|
||
|
|
2. ❌ Multi-layer cache (SFC → SLL → SuperSlab)
|
||
|
|
3. ❌ Runtime env var checks (3 getenv() calls)
|
||
|
|
4. ❌ Excessive validation (alignment, corruption)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## System Malloc Reference (glibc tcache)
|
||
|
|
|
||
|
|
**Allocation (1-2 branches, 2-3 instructions):**
|
||
|
|
```c
|
||
|
|
void* tcache_get(size_t size) {
|
||
|
|
int tc_idx = csize2tidx(size);
|
||
|
|
tcache_entry* e = tcache->entries[tc_idx];
|
||
|
|
if (e != NULL) { // BRANCH 1
|
||
|
|
tcache->entries[tc_idx] = e->next;
|
||
|
|
return (void*)e;
|
||
|
|
}
|
||
|
|
return _int_malloc(av, bytes);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key differences:**
|
||
|
|
- ✅ 1 branch (vs HAKMEM's 16-21)
|
||
|
|
- ✅ No validation
|
||
|
|
- ✅ No debug guards
|
||
|
|
- ✅ Single cache layer
|
||
|
|
- ✅ No env var checks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Makefile Integration (Recommended)
|
||
|
|
|
||
|
|
Add release build target:
|
||
|
|
|
||
|
|
```makefile
|
||
|
|
# Makefile
|
||
|
|
|
||
|
|
# Release build flags
|
||
|
|
HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto
|
||
|
|
|
||
|
|
# Release targets
|
||
|
|
release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
|
||
|
|
release: all
|
||
|
|
|
||
|
|
bench-release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
|
||
|
|
bench-release: bench_random_mixed_hakmem larson_hakmem
|
||
|
|
```
|
||
|
|
|
||
|
|
**Usage:**
|
||
|
|
```bash
|
||
|
|
make release # Build all in release mode
|
||
|
|
make bench-release # Build benchmarks in release mode
|
||
|
|
./bench_random_mixed_hakmem 100000 256 42
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detailed Analysis
|
||
|
|
|
||
|
|
See full report: `BRANCH_PREDICTION_OPTIMIZATION_REPORT.md`
|
||
|
|
|
||
|
|
**Key sections:**
|
||
|
|
- Section 1: Performance hotspot analysis (perf data)
|
||
|
|
- Section 2: Branch count by component (detailed breakdown)
|
||
|
|
- Section 4: Root cause analysis (why 8.5x more branches)
|
||
|
|
- Section 5: Optimization recommendations (ranked by impact/risk)
|
||
|
|
- Section 7: A/B test plan (measurement protocol)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Contact
|
||
|
|
|
||
|
|
For questions or discussion:
|
||
|
|
- See: `BRANCH_PREDICTION_OPTIMIZATION_REPORT.md` (comprehensive analysis)
|
||
|
|
- Context: Phase 7 (header-based fast free) + Pool TLS Phase 1
|
||
|
|
- Date: 2025-11-09
|