Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
5.8 KiB
Branch Prediction Optimization - Quick Start Guide
TL;DR: HAKMEM has 10.89% branch-miss rate (3x worse than System malloc's 3.5%) because it executes 8.5x MORE branches (17M vs 2M) due to debug code running in production.
Immediate Fix (1 Minute)
Add this ONE line to your build command:
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
Expected result: +30-50% performance improvement
Quick Win A/B Test
Before (Current)
make clean
make bench_random_mixed_hakmem
perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42
# Results:
# branches: 17,098,340
# branch-misses: 1,854,018 (10.84%)
# time: 0.103s
After (Release Mode)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem
perf stat -e branch-misses,branches ./bench_random_mixed_hakmem 100000 256 42
# Expected:
# branches: ~9M (-47%)
# branch-misses: ~700K (7.8%)
# time: ~0.060s (+42% faster)
Top 4 Optimizations (Ranked by Impact/Risk)
1. Enable Release Mode ⚡ (0 risk, 40-50% impact)
Action: Add -DHAKMEM_BUILD_RELEASE=1 to build flags
Why: Currently ALL debug code runs in production:
- 8 debug guards (
!HAKMEM_BUILD_RELEASE) - 6 rdtsc profiling calls
- 5-10 corruption validation branches
- All removed with one flag!
Effort: 1 line change Impact: -40-50% branches, +30-50% performance
2. Pre-compute Env Vars 📊 (Low risk, 10-15% impact)
Action: Move getenv() from hot path to init
Current problem:
// Called on EVERY allocation! (50-100 cycles)
if (g_tiny_profile_enabled == -1) {
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env) ? 1 : 0;
}
Fix:
// In hakmem_init.c (runs ONCE at startup)
void hakmem_tiny_init_config(void) {
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env) ? 1 : 0;
// Pre-compute all env vars here
}
Files to modify:
core/tiny_alloc_fast.inc.h:104core/hakmem_tiny_refill_p0.inc.h:66-84
Effort: 1 day Impact: -10-15% branches, +5-10% performance
3. Remove SFC Layer 🗑️ (Medium risk, 5-10% impact)
Action: Use only SLL (TLS freelist), remove SFC (Super Front Cache)
Why redundant:
- SLL already provides TLS freelist (same as System tcache)
- Phase 7 pre-warming gives SLL 95%+ hit rate
- SFC adds 5-6 branches with minimal benefit
- System malloc has 1 layer, HAKMEM has 3!
Current:
Allocation: SFC → SLL → SuperSlab
5-6br 11-15br 20-30br
Simplified:
Allocation: SLL → SuperSlab
2-3br 20-30br
Effort: 2 days Impact: -5-10% branches, simpler code
4. Branch Hint Tuning 🎯 (Low risk, 2-5% impact)
Action: Fix incorrect __builtin_expect hints
Examples:
// WRONG: SFC is disabled in most builds
if (__builtin_expect(sfc_is_enabled, 1)) {
// FIX:
if (__builtin_expect(sfc_is_enabled, 0)) {
Effort: 1 day Impact: -2-5% branch-misses
Performance Roadmap
| Phase | Branches | Branch-miss% | Throughput | Effort |
|---|---|---|---|---|
| Current | 17M | 10.84% | 1.07M ops/s | - |
| +Release Mode | 9M | 7.8% | 1.6M ops/s | 1 line |
| +Pre-compute Env | 8M | 7.5% | 1.8M ops/s | +1 day |
| +Remove SFC | 7M | 7.1% | 2.0M ops/s | +2 days |
| +Hint Tuning | 6.5M | 6.8% | 2.2M ops/s | +1 day |
| System malloc | 2M | 4.56% | 36M ops/s | - |
Target: 70-90% of System malloc performance (currently ~3%)
Root Cause: 8.5x More Branches Than System Malloc
The problem is NOT just misprediction rate, but TOTAL BRANCH COUNT:
| Component | HAKMEM Branches | System Branches | Ratio |
|---|---|---|---|
| Allocation | 16-21 | 1-2 | 10x |
| Free | 13-15 | 2-3 | 5x |
| Refill | 10-15 | N/A | ∞ |
| Total (100K allocs) | 17M | 2M | 8.5x |
Why so many branches?
- ❌ Debug code in production (8 guards)
- ❌ Multi-layer cache (SFC → SLL → SuperSlab)
- ❌ Runtime env var checks (3 getenv() calls)
- ❌ Excessive validation (alignment, corruption)
System Malloc Reference (glibc tcache)
Allocation (1-2 branches, 2-3 instructions):
void* tcache_get(size_t size) {
int tc_idx = csize2tidx(size);
tcache_entry* e = tcache->entries[tc_idx];
if (e != NULL) { // BRANCH 1
tcache->entries[tc_idx] = e->next;
return (void*)e;
}
return _int_malloc(av, bytes);
}
Key differences:
- ✅ 1 branch (vs HAKMEM's 16-21)
- ✅ No validation
- ✅ No debug guards
- ✅ Single cache layer
- ✅ No env var checks
Makefile Integration (Recommended)
Add release build target:
# Makefile
# Release build flags
HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto
# Release targets
release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
release: all
bench-release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
bench-release: bench_random_mixed_hakmem larson_hakmem
Usage:
make release # Build all in release mode
make bench-release # Build benchmarks in release mode
./bench_random_mixed_hakmem 100000 256 42
Detailed Analysis
See full report: BRANCH_PREDICTION_OPTIMIZATION_REPORT.md
Key sections:
- Section 1: Performance hotspot analysis (perf data)
- Section 2: Branch count by component (detailed breakdown)
- Section 4: Root cause analysis (why 8.5x more branches)
- Section 5: Optimization recommendations (ranked by impact/risk)
- Section 7: A/B test plan (measurement protocol)
Contact
For questions or discussion:
- See:
BRANCH_PREDICTION_OPTIMIZATION_REPORT.md(comprehensive analysis) - Context: Phase 7 (header-based fast free) + Pool TLS Phase 1
- Date: 2025-11-09