Files
hakmem/docs/archive/BRANCH_PREDICTION_OPTIMIZATION_REPORT.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

709 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Branch Prediction Optimization Investigation Report
**Date:** 2025-11-09
**Author:** Claude Code Analysis
**Context:** HAKMEM Phase 7 + Pool TLS Performance Investigation
---
## Executive Summary
**Problem:** HAKMEM has **10.89% branch-miss rate** vs System malloc's **3.5-3.9%** (3x worse)
**Root Cause Discovery:** The problem is **NOT just misprediction rate**, but **TOTAL BRANCH COUNT**:
- HAKMEM: **17,098,340 branches** (10.84% miss)
- System malloc: **2,006,962 branches** (4.56% miss)
- **HAKMEM executes 8.5x MORE branches than System malloc!**
**Impact:**
- Branch misprediction overhead: ~1.8M misses × 15-20 cycles = **27-36M cycles wasted**
- Total execution: 17M branches vs System's 2M → **8x more branch overhead**
- **Potential gain: 40-60% performance improvement** with recommended optimizations
**Critical Finding:** `HAKMEM_BUILD_RELEASE` is **NOT defined** → All debug code is running in production builds!
---
## 1. Performance Hotspot Analysis
### 1.1 Perf Statistics (256B allocations, 100K iterations)
| Metric | HAKMEM | System malloc | Ratio |
|--------|--------|---------------|-------|
| **Branches** | 17,098,340 | 2,006,962 | **8.5x** |
| **Branch-misses** | 1,854,018 | 91,497 | **20.3x** |
| **Branch-miss rate** | 10.84% | 4.56% | **2.4x** |
| **L1-dcache loads** | 31,307,492 | 4,610,155 | **6.8x** |
| **L1-dcache misses** | 1,063,117 | 44,773 | **23.7x** |
| **L1 miss rate** | 3.40% | 0.97% | **3.5x** |
| **Cycles** | ~83M | ~10M | **8.3x** |
| **Time** | 0.103s | 0.003s | **34x slower** |
**Key insight:** HAKMEM is not just suffering from poor branch prediction, but is executing **8.5x more branches** than System malloc!
### 1.2 Branch Count by Component
**Source file analysis:**
| File | Branch Statements | Critical Issues |
|------|-------------------|-----------------|
| `tiny_alloc_fast.inc.h` | **79** | 8 debug guards, 3 getenv() calls, SFC/SLL dual-layer |
| `hak_free_api.inc.h` | **38** | Pool TLS + Phase 7 dual dispatch, multiple lookups |
| `hakmem_tiny_refill_p0.inc.h` | **~40** | Complex precedence logic, 2 getenv() calls, validation |
| `tiny_refill_opt.h` | **~20** | Corruption checks, guard functions |
**Total: ~177 branch statements in hot path** vs System malloc's **~5 branches**
---
## 2. Branch Count Analysis: Allocation Path
### 2.1 Fast Path: `tiny_alloc_fast()` (lines 454-497)
**Layer 0: SFC (Super Front Cache)** - Lines 177-200
```c
// Branch 1-2: Check if SFC enabled (TLS cache check)
if (!sfc_check_done) { /* getenv() + init */ } // COLD
if (sfc_is_enabled) { // HOT
// Branch 3: Try SFC
void* ptr = sfc_alloc(class_idx); // → 2 branches inside
if (ptr != NULL) { /* hit */ } // HOT
}
```
**Branches: 5-6** (3 external + 2-3 in sfc_alloc)
**Layer 1: SLL (TLS Freelist)** - Lines 204-259
```c
// Branch 4: Check if SLL enabled
if (g_tls_sll_enable) { // HOT
// Branch 5: Try SLL pop
void* head = g_tls_sll_head[class_idx];
if (head != NULL) { // HOT
// Branch 6-7: Corruption debug (ONLY if failfast ≥ 2)
if (tiny_refill_failfast_level() >= 2) { // DEBUG
/* alignment validation (2 branches) */
}
// Branch 8-9: Validate next pointer
void* next = *(void**)head;
if (tiny_refill_failfast_level() >= 2) { // DEBUG
/* next pointer validation (2 branches) */
}
// Branch 10: Count update
if (g_tls_sll_count[class_idx] > 0) { // HOT
g_tls_sll_count[class_idx]--;
}
// Branch 11: Profiling (DEBUG)
#if !HAKMEM_BUILD_RELEASE
if (start) { /* rdtsc tracking */ } // DEBUG
#endif
return head; // SUCCESS
}
}
```
**Branches: 11-15** (2 unconditional + 5-9 conditional debug)
**Total allocation fast path: 16-21 branches** vs System tcache's **1-2 branches**
### 2.2 Refill Path: `tiny_alloc_fast_refill()` (lines 321-436)
**Phase 2b capacity check:**
```c
// Branch 1: Check available capacity
int available_capacity = get_available_capacity(class_idx);
if (available_capacity <= 0) { return 0; }
```
**Refill count precedence logic (lines 338-363):**
```c
// Branch 2: First-time init check
if (cnt == 0) { // COLD (once per class per thread)
// Branch 3-6: Complex precedence logic
if (g_refill_count_class[class_idx] > 0) { /* ... */ }
else if (class_idx <= 3 && g_refill_count_hot > 0) { /* ... */ }
else if (class_idx >= 4 && g_refill_count_mid > 0) { /* ... */ }
else if (g_refill_count_global > 0) { /* ... */ }
// Branch 7-8: Clamping
if (v < 8) v = 8;
if (v > 256) v = 256;
}
```
**Total refill path: 10-15 branches** (one-time init + runtime checks)
---
## 3. Branch Count Analysis: Free Path
### 3.1 Free Path: `hak_free_at()` (hak_free_api.inc.h)
**Pool TLS dispatch (lines 81-110):**
```c
#ifdef HAKMEM_POOL_TLS_PHASE1
// Branch 1: Page boundary check
#if !HAKMEM_TINY_SAFE_FREE
if (((uintptr_t)header_addr & 0xFFF) == 0) { // 0.1% frequency
// Branch 2: Memory readable check (mincore syscall)
if (!hak_is_memory_readable(header_addr)) { goto skip_pool_tls; }
}
#endif
// Branch 3: Magic check
if ((header & 0xF0) == POOL_MAGIC) {
pool_free(ptr);
goto done;
}
#endif
```
**Branches: 3** (optimized with hybrid mincore)
**Phase 7 dual-header dispatch (lines 112-167):**
```c
// Branch 4: Try 1-byte Tiny header
if (hak_tiny_free_fast_v2(ptr)) { // → 3-5 branches inside
goto done;
}
// Branch 5: Page boundary check for 16-byte header
if (offset_in_page < HEADER_SIZE) {
// Branch 6: Memory readable check
if (!hak_is_memory_readable(raw)) { goto slow_path; }
}
// Branch 7: 16-byte header magic check
if (hdr->magic == HAKMEM_MAGIC) {
// Branch 8: Method dispatch
if (hdr->method == ALLOC_METHOD_MALLOC) { /* ... */ }
}
```
**Branches: 8-10** (including 3-5 inside hak_tiny_free_fast_v2)
**Mid/L25 lookup (lines 196-206):**
```c
// Branch 9-10: Mid/L25 registry lookups
if (hak_pool_mid_lookup(ptr, &mid_sz)) { /* ... */ }
if (hak_l25_lookup(ptr, &l25_sz)) { /* ... */ }
```
**Branches: 2**
**Total free path: 13-15 branches** vs System tcache's **2-3 branches**
---
## 4. Root Cause Analysis
### 4.1 CRITICAL: Debug Code in Production Builds
**Finding:** `HAKMEM_BUILD_RELEASE` is **NOT defined anywhere** in Makefile
**Impact:** All debug code runs in production:
| Debug Guard | Location | Frequency | Overhead |
|-------------|----------|-----------|----------|
| `!HAKMEM_BUILD_RELEASE` | `tiny_alloc_fast.inc.h:171` | Every allocation | 2-3 branches |
| `!HAKMEM_BUILD_RELEASE` | `tiny_alloc_fast.inc.h:191-196` | Every allocation | 1 branch + rdtsc |
| `!HAKMEM_BUILD_RELEASE` | `tiny_alloc_fast.inc.h:250-256` | Every allocation | 1 branch + rdtsc |
| `!HAKMEM_BUILD_RELEASE` | `tiny_alloc_fast.inc.h:324-326` | Every refill | 1 branch + rdtsc |
| `!HAKMEM_BUILD_RELEASE` | `tiny_alloc_fast.inc.h:427-433` | Every refill | 1 branch + rdtsc |
| `!HAKMEM_BUILD_RELEASE` | `tiny_free_fast_v2.inc.h:99-104` | Every free | 1 branch + capacity check |
| `!HAKMEM_BUILD_RELEASE` | `hak_free_api.inc.h:118-120` | Every free | 1 function call |
| `trc_refill_guard_enabled()` | `tiny_refill_opt.h:61-75` | Every splice | 1 branch + getenv |
**Total overhead: 8-12 branches + 6 rdtsc calls + 2 getenv calls per allocation/free cycle**
**Expected impact of fixing:** **-40-50% total branches**
### 4.2 HIGH: getenv() Calls in Hot Path
**Finding:** 3 lazy-initialized getenv() calls in hot path
| Location | Variable | Call Frequency | Fix |
|----------|----------|----------------|-----|
| `tiny_alloc_fast.inc.h:104` | `HAKMEM_TINY_PROFILE` | Every allocation (if -1) | Cache in global var at init |
| `hakmem_tiny_refill_p0.inc.h:68` | `HAKMEM_TINY_REFILL_COUNT_HOT` | Every refill (class ≤ 3) | Pre-compute at init |
| `hakmem_tiny_refill_p0.inc.h:78` | `HAKMEM_TINY_REFILL_COUNT_MID` | Every refill (class ≥ 4) | Pre-compute at init |
**Impact:**
- getenv() is ~50-100 cycles (string lookup + syscall if not cached)
- Adds 2-3 branches per call (null check, lazy init, result check)
- Total: **6-9 branches + 150-300 cycles** on first access per thread
**Expected impact of fixing:** **-10-15% branches, -5-10% cycles**
### 4.3 MEDIUM: Complex Multi-Layer Cache
**Current architecture:**
```
Allocation: Size check → SFC (Layer 0) → SLL (Layer 1) → SuperSlab → Refill
1 branch 5-6 branches 11-15 branches 20-30 branches
```
**System malloc tcache:**
```
Allocation: Size check → TLS cache → ptmalloc2
1 branch 1-2 branches
```
**Problem:** HAKMEM has **3 layers** (SFC → SLL → SuperSlab) vs System's **1 layer** (tcache)
**Why SFC is redundant:**
- SLL already provides TLS freelist (same design as tcache)
- SFC adds 5-6 branches with minimal benefit
- Pre-warming (Phase 7 Task 3) already boosted SLL hit rate to 95%+
**Expected impact of removing SFC:** **-5-10% branches, simpler code**
### 4.4 MEDIUM: Excessive Validation in Hot Path
**Corruption checks (lines 208-235 in tiny_alloc_fast.inc.h):**
```c
if (tiny_refill_failfast_level() >= 2) { // getenv() call!
// Alignment validation
if (((uintptr_t)head % blk) != 0) {
fprintf(stderr, "[TLS_SLL_CORRUPT] ...");
abort();
}
// Next pointer validation
if (next != NULL && ((uintptr_t)next % blk) != 0) {
fprintf(stderr, "[ALLOC_POP_CORRUPT] ...");
abort();
}
}
```
**Impact:**
- 1 getenv() call per thread (lazy init) = ~100 cycles
- 5-7 branches per allocation when enabled
- fprintf/abort paths confuse branch predictor
**Solution:** Move to compile-time flag (e.g., `HAKMEM_DEBUG_VALIDATION`) instead of runtime check
**Expected impact:** **-5-10% branches when disabled**
---
## 5. Optimization Recommendations (Ranked by Impact/Risk)
### 5.1 CRITICAL FIX: Enable Release Mode (0 risk, 40-50% impact)
**Action:** Add `-DHAKMEM_BUILD_RELEASE=1` to production build flags
**Implementation:**
```makefile
# Makefile
HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto
release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
release: all
```
**Changes enabled:**
- Removes 8 `!HAKMEM_BUILD_RELEASE` guards → **-8-12 branches**
- Disables rdtsc profiling → **-6 rdtsc calls**
- Disables corruption validation → **-5-10 branches**
- Enables LTO and aggressive optimization
**Expected result:**
- **-40-50% total branches** (17M → 8.5-10M)
- **-20-30% cycles** (better inlining, constant folding)
- **+30-50% performance** (overall)
**A/B test command:**
```bash
# Before
make bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42
# After
make HAKMEM_BUILD_RELEASE=1 bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42
```
---
### 5.2 HIGH PRIORITY: Pre-compute Env Vars at Init (Low risk, 10-15% impact)
**Action:** Move getenv() calls from hot path to global init
**Current (lazy init in hot path):**
```c
// SLOW: Called on every allocation/refill
if (g_tiny_profile_enabled == -1) {
const char* env = getenv("HAKMEM_TINY_PROFILE"); // 50-100 cycles!
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}
```
**Fixed (pre-compute at init):**
```c
// hakmem_init.c (runs once at startup)
void hakmem_tiny_init_config(void) {
// Profile mode
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
// Refill counts
const char* hot_env = getenv("HAKMEM_TINY_REFILL_COUNT_HOT");
g_refill_count_hot = hot_env ? atoi(hot_env) : HAKMEM_TINY_REFILL_DEFAULT;
const char* mid_env = getenv("HAKMEM_TINY_REFILL_COUNT_MID");
g_refill_count_mid = mid_env ? atoi(mid_env) : HAKMEM_TINY_REFILL_DEFAULT;
}
```
**Expected result:**
- **-6-9 branches** (3 getenv lazy-init patterns)
- **-150-300 cycles** on first access per thread
- **+5-10% performance** (cleaner hot path)
**Files to modify:**
- `core/tiny_alloc_fast.inc.h:104` - Remove lazy init
- `core/hakmem_tiny_refill_p0.inc.h:66-84` - Remove lazy init
- `core/hakmem_init.c` - Add global init function
---
### 5.3 MEDIUM PRIORITY: Simplify Cache Layers (Medium risk, 5-10% impact)
**Option A: Remove SFC Layer (Recommended)**
**Rationale:**
- SFC adds 5-6 branches with minimal benefit
- SLL already provides TLS freelist (same as System tcache)
- Phase 7 Task 3 pre-warming gives SLL 95%+ hit rate
- Three cache layers = unnecessary complexity
**Implementation:**
```c
// Remove SFC entirely, use only SLL
static inline void* tiny_alloc_fast(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
// Layer 1: TLS freelist (SLL) - DIRECT ACCESS
void* head = g_tls_sll_head[class_idx];
if (head != NULL) {
g_tls_sll_head[class_idx] = *(void**)head;
g_tls_sll_count[class_idx]--;
return head; // 3 instructions, 1-2 branches!
}
// Refill from SuperSlab
if (tiny_alloc_fast_refill(class_idx) > 0) {
head = g_tls_sll_head[class_idx];
// ... retry pop
}
return hak_tiny_alloc_slow(size, class_idx);
}
```
**Expected result:**
- **-5-10% branches** (remove SFC layer)
- **Simpler code** (easier to debug/maintain)
- **Same or better performance** (fewer layers = less overhead)
**Option B: Unified TLS Cache (Higher risk, 10-20% impact)**
**Design:** Single TLS cache with adaptive sizing (like mimalloc)
```c
// Per-class TLS cache with adaptive capacity
struct TinyTLSCache {
void* head;
uint32_t count;
uint32_t capacity; // Adaptive: 16-256
};
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
```
**Expected result:**
- **-10-20% branches** (unified design)
- **Better cache utilization** (adaptive sizing)
- **Matches System malloc architecture**
---
### 5.4 LOW PRIORITY: Branch Hint Tuning (Low risk, 2-5% impact)
**Action:** Optimize `__builtin_expect` hints based on profiling
**Current issues:**
- Some hints are incorrect (e.g., SFC disabled in production)
- Missing hints on hot branches
**Recommended changes:**
```c
// Line 184: SFC is DISABLED in most production builds
if (__builtin_expect(sfc_is_enabled, 1)) { // WRONG!
// Fix:
if (__builtin_expect(sfc_is_enabled, 0)) { // Expect disabled
// Line 208: Corruption checks are rare in production
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) { // CORRECT
// Line 457: Size > 1KB is common in mixed workloads
if (__builtin_expect(class_idx < 0, 0)) { // May be wrong for some workloads
```
**Expected result:**
- **-2-5% branch-misses** (better prediction)
- **+2-5% performance** (reduced pipeline stalls)
---
## 6. Expected Results Summary
### 6.1 Cumulative Impact (All Optimizations)
| Optimization | Branch Reduction | Cycle Reduction | Risk | Effort |
|--------------|------------------|-----------------|------|--------|
| **Enable Release Mode** | -40-50% | -20-30% | None | 1 line |
| **Pre-compute Env Vars** | -10-15% | -5-10% | Low | 1 day |
| **Remove SFC Layer** | -5-10% | -5-10% | Medium | 2 days |
| **Branch Hint Tuning** | -2-5% | -2-5% | Low | 1 day |
| **TOTAL** | **-50-65%** | **-30-45%** | Low | 4-5 days |
**Projected final results:**
- **Branches:** 17M → **6-8.5M** (vs System's 2M)
- **Branch-miss rate:** 10.84% → **6-8%** (vs System's 4.56%)
- **Throughput:** Current → **+40-80% improvement**
**Target:** **70-90% of System malloc performance** (currently ~3% of System)
---
### 6.2 Quick Win: Release Mode Only
**Minimal change, maximum impact:**
```bash
# Add one line to Makefile
CFLAGS += -DHAKMEM_BUILD_RELEASE=1
# Rebuild
make clean && make bench_random_mixed_hakmem
# Test
./bench_random_mixed_hakmem 100000 256 42
```
**Expected:**
- **-40-50% branches** (17M → 8.5-10M)
- **+30-50% performance** (immediate)
- **0 code changes** (just a flag)
---
## 7. A/B Test Plan
### 7.1 Baseline Measurement
```bash
# Measure current performance
perf stat -e branch-misses,branches,cycles,instructions \
./bench_random_mixed_hakmem 100000 256 42
# Output:
# branches: 17,098,340
# branch-misses: 1,854,018 (10.84%)
# cycles: ~83M
```
### 7.2 Test 1: Release Mode
```bash
# Build with release flag
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Measure
perf stat -e branch-misses,branches,cycles,instructions \
./bench_random_mixed_hakmem 100000 256 42
# Expected:
# branches: ~9M (-47%)
# branch-misses: ~700K (7.8%)
# cycles: ~60M (-27%)
```
### 7.3 Test 2: Release + Pre-compute Env
```bash
# Implement env var pre-computation (see 5.2)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Expected:
# branches: ~8M (-53%)
# branch-misses: ~600K (7.5%)
# cycles: ~55M (-33%)
```
### 7.4 Test 3: Release + Pre-compute + Remove SFC
```bash
# Remove SFC layer (see 5.3)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
# Expected:
# branches: ~7M (-59%)
# branch-misses: ~500K (7.1%)
# cycles: ~50M (-40%)
```
### 7.5 Success Criteria
| Metric | Current | Target | Stretch Goal |
|--------|---------|--------|--------------|
| **Branches** | 17M | <10M | <8M |
| **Branch-miss rate** | 10.84% | <8% | <7% |
| **vs System malloc** | 8.5x slower | <5x slower | <3x slower |
| **Throughput** | 1.07M ops/s | >2M ops/s | >3M ops/s |
---
## 8. Comparison with System Malloc Strategy
### 8.1 System malloc tcache (glibc 2.27+)
**Design:**
```c
// Allocation (2-3 instructions, 1-2 branches)
void* tcache_get(size_t size) {
int tc_idx = csize2tidx(size); // Size to index (no branch)
tcache_entry* e = tcache->entries[tc_idx];
if (e != NULL) { // BRANCH 1
tcache->entries[tc_idx] = e->next;
return (void*)e;
}
return _int_malloc(av, bytes); // Slow path
}
// Free (2 instructions, 1 branch)
void tcache_put(void* ptr, size_t size) {
int tc_idx = csize2tidx(size); // Size to index (no branch)
if (tcache->counts[tc_idx] < TCACHE_MAX_BINS) { // BRANCH 1
tcache_entry* e = (tcache_entry*)ptr;
e->next = tcache->entries[tc_idx];
tcache->entries[tc_idx] = e;
tcache->counts[tc_idx]++;
}
// Else: fall back to _int_free
}
```
**Key insights:**
- **1-2 branches total** (vs HAKMEM's 16-21)
- **No validation** in fast path
- **No debug guards** in production
- **Single TLS cache layer** (vs HAKMEM's 3 layers)
- **No getenv() calls** (all config at compile-time)
### 8.2 mimalloc
**Design:**
```c
// Allocation (3-4 instructions, 1-2 branches)
void* mi_malloc(size_t size) {
mi_page_t* page = _mi_page_fast(); // TLS page cache
if (mi_likely(page != NULL)) { // BRANCH 1
void* p = page->free;
if (mi_likely(p != NULL)) { // BRANCH 2
page->free = mi_ptr_decode(p);
return p;
}
}
return mi_malloc_generic(NULL, size); // Slow path
}
```
**Key insights:**
- **2 branches total** (vs HAKMEM's 16-21)
- **Inline header metadata** (similar to HAKMEM Phase 7)
- **No debug overhead** in release builds
- **Simple TLS structure** (page + free pointer)
---
## 9. Conclusion
**Root Cause:** HAKMEM executes **8.5x more branches** than System malloc due to:
1. Debug code running in production (`HAKMEM_BUILD_RELEASE` not defined)
2. Complex multi-layer cache (SFC → SLL → SuperSlab)
3. Runtime env var checks in hot path
4. Excessive validation and profiling
**Immediate Action (1 line change):**
```makefile
CFLAGS += -DHAKMEM_BUILD_RELEASE=1 # Expected: +30-50% performance
```
**Full Fix (4-5 days work):**
- Enable release mode
- Pre-compute env vars at init
- Remove redundant SFC layer
- Optimize branch hints
**Expected Result:**
- **-50-65% branches** (17M → 6-8.5M)
- **-30-45% cycles**
- **+40-80% throughput**
- **70-90% of System malloc performance** (vs current 3%)
**Next Steps:**
1. ✅ Enable `HAKMEM_BUILD_RELEASE=1` (immediate)
2. Run A/B tests (measure impact)
3. Implement env var pre-computation (1 day)
4. Evaluate SFC removal (2 days)
5. Re-measure and iterate
---
## Appendix A: Detailed Branch Inventory
### Allocation Path (tiny_alloc_fast.inc.h)
| Line | Branch | Frequency | Type | Fix |
|------|--------|-----------|------|-----|
| 177-182 | SFC check done | Cold (once/thread) | Init | Pre-compute |
| 184 | SFC enabled | Hot | Runtime | Remove SFC |
| 186 | SFC ptr != NULL | Hot | Fast path | Keep (necessary) |
| 204 | SLL enabled | Hot | Runtime | Make compile-time |
| 206 | SLL head != NULL | Hot | Fast path | Keep (necessary) |
| 208 | Failfast ≥ 2 | Hot | Debug | Remove in release |
| 211-216 | Alignment check | Hot | Debug | Remove in release |
| 225 | Failfast ≥ 2 | Hot | Debug | Remove in release |
| 227-234 | Next validation | Hot | Debug | Remove in release |
| 241 | Count > 0 | Hot | Unnecessary | Remove |
| 171-173 | Profile enabled | Hot | Debug | Remove in release |
| 250-256 | Profile rdtsc | Hot | Debug | Remove in release |
**Total: 16-21 branches****Target: 2-3 branches** (95% reduction)
### Refill Path (hakmem_tiny_refill_p0.inc.h)
| Line | Branch | Frequency | Type | Fix |
|------|--------|-----------|------|-----|
| 33 | !g_use_superslab | Cold | Config | Remove check |
| 41 | !tls->ss | Hot | Refill | Keep (necessary) |
| 46 | !meta | Hot | Refill | Keep (necessary) |
| 56 | room <= 0 | Hot | Capacity | Keep (necessary) |
| 66-73 | Hot override | Cold | Env var | Pre-compute |
| 76-83 | Mid override | Cold | Env var | Pre-compute |
| 116-119 | Remote drain | Hot | Optimization | Keep |
| 138 | Capacity check | Hot | Refill | Keep (necessary) |
**Total: 10-15 branches****Target: 5-8 branches** (40-50% reduction)
---
**End of Report**