Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

21 KiB

Raw Blame History

Branch Prediction Optimization Investigation Report

Date: 2025-11-09 Author: Claude Code Analysis Context: HAKMEM Phase 7 + Pool TLS Performance Investigation

Executive Summary

Problem: HAKMEM has 10.89% branch-miss rate vs System malloc's 3.5-3.9% (3x worse)

Root Cause Discovery: The problem is NOT just misprediction rate, but TOTAL BRANCH COUNT:

HAKMEM: 17,098,340 branches (10.84% miss)
System malloc: 2,006,962 branches (4.56% miss)
HAKMEM executes 8.5x MORE branches than System malloc!

Impact:

Branch misprediction overhead: ~1.8M misses × 15-20 cycles = 27-36M cycles wasted
Total execution: 17M branches vs System's 2M → 8x more branch overhead
Potential gain: 40-60% performance improvement with recommended optimizations

Critical Finding: HAKMEM_BUILD_RELEASE is NOT defined → All debug code is running in production builds!

1. Performance Hotspot Analysis

1.1 Perf Statistics (256B allocations, 100K iterations)

Metric	HAKMEM	System malloc	Ratio
Branches	17,098,340	2,006,962	8.5x
Branch-misses	1,854,018	91,497	20.3x
Branch-miss rate	10.84%	4.56%	2.4x
L1-dcache loads	31,307,492	4,610,155	6.8x
L1-dcache misses	1,063,117	44,773	23.7x
L1 miss rate	3.40%	0.97%	3.5x
Cycles	~83M	~10M	8.3x
Time	0.103s	0.003s	34x slower

Key insight: HAKMEM is not just suffering from poor branch prediction, but is executing 8.5x more branches than System malloc!

1.2 Branch Count by Component

Source file analysis:

File	Branch Statements	Critical Issues
`tiny_alloc_fast.inc.h`	79	8 debug guards, 3 getenv() calls, SFC/SLL dual-layer
`hak_free_api.inc.h`	38	Pool TLS + Phase 7 dual dispatch, multiple lookups
`hakmem_tiny_refill_p0.inc.h`	~40	Complex precedence logic, 2 getenv() calls, validation
`tiny_refill_opt.h`	~20	Corruption checks, guard functions

Total: ~177 branch statements in hot path vs System malloc's ~5 branches

2. Branch Count Analysis: Allocation Path

2.1 Fast Path: `tiny_alloc_fast()` (lines 454-497)

Layer 0: SFC (Super Front Cache) - Lines 177-200

// Branch 1-2: Check if SFC enabled (TLS cache check)
if (!sfc_check_done) { /* getenv() + init */ }  // COLD
if (sfc_is_enabled) {                            // HOT
    // Branch 3: Try SFC
    void* ptr = sfc_alloc(class_idx);            // → 2 branches inside
    if (ptr != NULL) { /* hit */ }               // HOT
}

Branches: 5-6 (3 external + 2-3 in sfc_alloc)

Layer 1: SLL (TLS Freelist) - Lines 204-259

// Branch 4: Check if SLL enabled
if (g_tls_sll_enable) {                          // HOT
    // Branch 5: Try SLL pop
    void* head = g_tls_sll_head[class_idx];
    if (head != NULL) {                          // HOT
        // Branch 6-7: Corruption debug (ONLY if failfast ≥ 2)
        if (tiny_refill_failfast_level() >= 2) { // DEBUG
            /* alignment validation (2 branches) */
        }

        // Branch 8-9: Validate next pointer
        void* next = *(void**)head;
        if (tiny_refill_failfast_level() >= 2) { // DEBUG
            /* next pointer validation (2 branches) */
        }

        // Branch 10: Count update
        if (g_tls_sll_count[class_idx] > 0) {   // HOT
            g_tls_sll_count[class_idx]--;
        }

        // Branch 11: Profiling (DEBUG)
        #if !HAKMEM_BUILD_RELEASE
        if (start) { /* rdtsc tracking */ }      // DEBUG
        #endif

        return head;  // SUCCESS
    }
}

Branches: 11-15 (2 unconditional + 5-9 conditional debug)

Total allocation fast path: 16-21 branches vs System tcache's 1-2 branches

2.2 Refill Path: `tiny_alloc_fast_refill()` (lines 321-436)

Phase 2b capacity check:

// Branch 1: Check available capacity
int available_capacity = get_available_capacity(class_idx);
if (available_capacity <= 0) { return 0; }

Refill count precedence logic (lines 338-363):

// Branch 2: First-time init check
if (cnt == 0) {  // COLD (once per class per thread)
    // Branch 3-6: Complex precedence logic
    if (g_refill_count_class[class_idx] > 0) { /* ... */ }
    else if (class_idx <= 3 && g_refill_count_hot > 0) { /* ... */ }
    else if (class_idx >= 4 && g_refill_count_mid > 0) { /* ... */ }
    else if (g_refill_count_global > 0) { /* ... */ }

    // Branch 7-8: Clamping
    if (v < 8) v = 8;
    if (v > 256) v = 256;
}

Total refill path: 10-15 branches (one-time init + runtime checks)

3. Branch Count Analysis: Free Path

3.1 Free Path: `hak_free_at()` (hak_free_api.inc.h)

Pool TLS dispatch (lines 81-110):

#ifdef HAKMEM_POOL_TLS_PHASE1
    // Branch 1: Page boundary check
    #if !HAKMEM_TINY_SAFE_FREE
    if (((uintptr_t)header_addr & 0xFFF) == 0) {  // 0.1% frequency
        // Branch 2: Memory readable check (mincore syscall)
        if (!hak_is_memory_readable(header_addr)) { goto skip_pool_tls; }
    }
    #endif

    // Branch 3: Magic check
    if ((header & 0xF0) == POOL_MAGIC) {
        pool_free(ptr);
        goto done;
    }
#endif

Branches: 3 (optimized with hybrid mincore)

Phase 7 dual-header dispatch (lines 112-167):

// Branch 4: Try 1-byte Tiny header
if (hak_tiny_free_fast_v2(ptr)) {  // → 3-5 branches inside
    goto done;
}

// Branch 5: Page boundary check for 16-byte header
if (offset_in_page < HEADER_SIZE) {
    // Branch 6: Memory readable check
    if (!hak_is_memory_readable(raw)) { goto slow_path; }
}

// Branch 7: 16-byte header magic check
if (hdr->magic == HAKMEM_MAGIC) {
    // Branch 8: Method dispatch
    if (hdr->method == ALLOC_METHOD_MALLOC) { /* ... */ }
}

Branches: 8-10 (including 3-5 inside hak_tiny_free_fast_v2)

Mid/L25 lookup (lines 196-206):

// Branch 9-10: Mid/L25 registry lookups
if (hak_pool_mid_lookup(ptr, &mid_sz)) { /* ... */ }
if (hak_l25_lookup(ptr, &l25_sz)) { /* ... */ }

Branches: 2

Total free path: 13-15 branches vs System tcache's 2-3 branches

4. Root Cause Analysis

4.1 CRITICAL: Debug Code in Production Builds

Finding: HAKMEM_BUILD_RELEASE is NOT defined anywhere in Makefile

Impact: All debug code runs in production:

Debug Guard	Location	Frequency	Overhead
`!HAKMEM_BUILD_RELEASE`	`tiny_alloc_fast.inc.h:171`	Every allocation	2-3 branches
`!HAKMEM_BUILD_RELEASE`	`tiny_alloc_fast.inc.h:191-196`	Every allocation	1 branch + rdtsc
`!HAKMEM_BUILD_RELEASE`	`tiny_alloc_fast.inc.h:250-256`	Every allocation	1 branch + rdtsc
`!HAKMEM_BUILD_RELEASE`	`tiny_alloc_fast.inc.h:324-326`	Every refill	1 branch + rdtsc
`!HAKMEM_BUILD_RELEASE`	`tiny_alloc_fast.inc.h:427-433`	Every refill	1 branch + rdtsc
`!HAKMEM_BUILD_RELEASE`	`tiny_free_fast_v2.inc.h:99-104`	Every free	1 branch + capacity check
`!HAKMEM_BUILD_RELEASE`	`hak_free_api.inc.h:118-120`	Every free	1 function call
`trc_refill_guard_enabled()`	`tiny_refill_opt.h:61-75`	Every splice	1 branch + getenv

Total overhead: 8-12 branches + 6 rdtsc calls + 2 getenv calls per allocation/free cycle

Expected impact of fixing: -40-50% total branches

4.2 HIGH: getenv() Calls in Hot Path

Finding: 3 lazy-initialized getenv() calls in hot path

Location	Variable	Call Frequency	Fix
`tiny_alloc_fast.inc.h:104`	`HAKMEM_TINY_PROFILE`	Every allocation (if -1)	Cache in global var at init
`hakmem_tiny_refill_p0.inc.h:68`	`HAKMEM_TINY_REFILL_COUNT_HOT`	Every refill (class ≤ 3)	Pre-compute at init
`hakmem_tiny_refill_p0.inc.h:78`	`HAKMEM_TINY_REFILL_COUNT_MID`	Every refill (class ≥ 4)	Pre-compute at init

Impact:

getenv() is ~50-100 cycles (string lookup + syscall if not cached)
Adds 2-3 branches per call (null check, lazy init, result check)
Total: 6-9 branches + 150-300 cycles on first access per thread

Expected impact of fixing: -10-15% branches, -5-10% cycles

4.3 MEDIUM: Complex Multi-Layer Cache

Current architecture:

Allocation: Size check → SFC (Layer 0) → SLL (Layer 1) → SuperSlab → Refill
            1 branch     5-6 branches     11-15 branches   20-30 branches

System malloc tcache:

Allocation: Size check → TLS cache → ptmalloc2
            1 branch     1-2 branches

Problem: HAKMEM has 3 layers (SFC → SLL → SuperSlab) vs System's 1 layer (tcache)

Why SFC is redundant:

SLL already provides TLS freelist (same design as tcache)
SFC adds 5-6 branches with minimal benefit
Pre-warming (Phase 7 Task 3) already boosted SLL hit rate to 95%+

Expected impact of removing SFC: -5-10% branches, simpler code

4.4 MEDIUM: Excessive Validation in Hot Path

Corruption checks (lines 208-235 in tiny_alloc_fast.inc.h):

if (tiny_refill_failfast_level() >= 2) {  // getenv() call!
    // Alignment validation
    if (((uintptr_t)head % blk) != 0) {
        fprintf(stderr, "[TLS_SLL_CORRUPT] ...");
        abort();
    }

    // Next pointer validation
    if (next != NULL && ((uintptr_t)next % blk) != 0) {
        fprintf(stderr, "[ALLOC_POP_CORRUPT] ...");
        abort();
    }
}

Impact:

1 getenv() call per thread (lazy init) = ~100 cycles
5-7 branches per allocation when enabled
fprintf/abort paths confuse branch predictor

Solution: Move to compile-time flag (e.g., HAKMEM_DEBUG_VALIDATION) instead of runtime check

Expected impact: -5-10% branches when disabled

5. Optimization Recommendations (Ranked by Impact/Risk)

5.1 CRITICAL FIX: Enable Release Mode (0 risk, 40-50% impact)

Action: Add -DHAKMEM_BUILD_RELEASE=1 to production build flags

Implementation:

# Makefile
HAKMEM_RELEASE_FLAGS = -DHAKMEM_BUILD_RELEASE=1 -DNDEBUG -O3 -flto

release: CFLAGS += $(HAKMEM_RELEASE_FLAGS)
release: all

Changes enabled:

Removes 8 !HAKMEM_BUILD_RELEASE guards → -8-12 branches
Disables rdtsc profiling → -6 rdtsc calls
Disables corruption validation → -5-10 branches
Enables LTO and aggressive optimization

Expected result:

-40-50% total branches (17M → 8.5-10M)
-20-30% cycles (better inlining, constant folding)
+30-50% performance (overall)

A/B test command:

# Before
make bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42

# After
make HAKMEM_BUILD_RELEASE=1 bench_random_mixed_hakmem
./bench_random_mixed_hakmem 100000 256 42

5.2 HIGH PRIORITY: Pre-compute Env Vars at Init (Low risk, 10-15% impact)

Action: Move getenv() calls from hot path to global init

Current (lazy init in hot path):

// SLOW: Called on every allocation/refill
if (g_tiny_profile_enabled == -1) {
    const char* env = getenv("HAKMEM_TINY_PROFILE");  // 50-100 cycles!
    g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}

Fixed (pre-compute at init):

// hakmem_init.c (runs once at startup)
void hakmem_tiny_init_config(void) {
    // Profile mode
    const char* env = getenv("HAKMEM_TINY_PROFILE");
    g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;

    // Refill counts
    const char* hot_env = getenv("HAKMEM_TINY_REFILL_COUNT_HOT");
    g_refill_count_hot = hot_env ? atoi(hot_env) : HAKMEM_TINY_REFILL_DEFAULT;

    const char* mid_env = getenv("HAKMEM_TINY_REFILL_COUNT_MID");
    g_refill_count_mid = mid_env ? atoi(mid_env) : HAKMEM_TINY_REFILL_DEFAULT;
}

Expected result:

-6-9 branches (3 getenv lazy-init patterns)
-150-300 cycles on first access per thread
+5-10% performance (cleaner hot path)

Files to modify:

core/tiny_alloc_fast.inc.h:104 - Remove lazy init
core/hakmem_tiny_refill_p0.inc.h:66-84 - Remove lazy init
core/hakmem_init.c - Add global init function

5.3 MEDIUM PRIORITY: Simplify Cache Layers (Medium risk, 5-10% impact)

Option A: Remove SFC Layer (Recommended)

Rationale:

SFC adds 5-6 branches with minimal benefit
SLL already provides TLS freelist (same as System tcache)
Phase 7 Task 3 pre-warming gives SLL 95%+ hit rate
Three cache layers = unnecessary complexity

Implementation:

// Remove SFC entirely, use only SLL
static inline void* tiny_alloc_fast(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);

    // Layer 1: TLS freelist (SLL) - DIRECT ACCESS
    void* head = g_tls_sll_head[class_idx];
    if (head != NULL) {
        g_tls_sll_head[class_idx] = *(void**)head;
        g_tls_sll_count[class_idx]--;
        return head;  // 3 instructions, 1-2 branches!
    }

    // Refill from SuperSlab
    if (tiny_alloc_fast_refill(class_idx) > 0) {
        head = g_tls_sll_head[class_idx];
        // ... retry pop
    }

    return hak_tiny_alloc_slow(size, class_idx);
}

Expected result:

-5-10% branches (remove SFC layer)
Simpler code (easier to debug/maintain)
Same or better performance (fewer layers = less overhead)

Option B: Unified TLS Cache (Higher risk, 10-20% impact)

Design: Single TLS cache with adaptive sizing (like mimalloc)

// Per-class TLS cache with adaptive capacity
struct TinyTLSCache {
    void* head;
    uint32_t count;
    uint32_t capacity;  // Adaptive: 16-256
};

static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];

Expected result:

-10-20% branches (unified design)
Better cache utilization (adaptive sizing)
Matches System malloc architecture

5.4 LOW PRIORITY: Branch Hint Tuning (Low risk, 2-5% impact)

Action: Optimize __builtin_expect hints based on profiling

Current issues:

Some hints are incorrect (e.g., SFC disabled in production)
Missing hints on hot branches

Recommended changes:

// Line 184: SFC is DISABLED in most production builds
if (__builtin_expect(sfc_is_enabled, 1)) {  // WRONG!
// Fix:
if (__builtin_expect(sfc_is_enabled, 0)) {  // Expect disabled

// Line 208: Corruption checks are rare in production
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) {  // CORRECT

// Line 457: Size > 1KB is common in mixed workloads
if (__builtin_expect(class_idx < 0, 0)) {  // May be wrong for some workloads

Expected result:

-2-5% branch-misses (better prediction)
+2-5% performance (reduced pipeline stalls)

6. Expected Results Summary

6.1 Cumulative Impact (All Optimizations)

Optimization	Branch Reduction	Cycle Reduction	Risk	Effort
Enable Release Mode	-40-50%	-20-30%	None	1 line
Pre-compute Env Vars	-10-15%	-5-10%	Low	1 day
Remove SFC Layer	-5-10%	-5-10%	Medium	2 days
Branch Hint Tuning	-2-5%	-2-5%	Low	1 day
TOTAL	-50-65%	-30-45%	Low	4-5 days

Projected final results:

Branches: 17M → 6-8.5M (vs System's 2M)
Branch-miss rate: 10.84% → 6-8% (vs System's 4.56%)
Throughput: Current → +40-80% improvement

Target: 70-90% of System malloc performance (currently ~3% of System)

6.2 Quick Win: Release Mode Only

Minimal change, maximum impact:

# Add one line to Makefile
CFLAGS += -DHAKMEM_BUILD_RELEASE=1

# Rebuild
make clean && make bench_random_mixed_hakmem

# Test
./bench_random_mixed_hakmem 100000 256 42

Expected:

-40-50% branches (17M → 8.5-10M)
+30-50% performance (immediate)
0 code changes (just a flag)

7. A/B Test Plan

7.1 Baseline Measurement

# Measure current performance
perf stat -e branch-misses,branches,cycles,instructions \
  ./bench_random_mixed_hakmem 100000 256 42

# Output:
# branches:       17,098,340
# branch-misses:   1,854,018 (10.84%)
# cycles:         ~83M

7.2 Test 1: Release Mode

# Build with release flag
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem

# Measure
perf stat -e branch-misses,branches,cycles,instructions \
  ./bench_random_mixed_hakmem 100000 256 42

# Expected:
# branches:       ~9M (-47%)
# branch-misses:  ~700K (7.8%)
# cycles:         ~60M (-27%)

7.3 Test 2: Release + Pre-compute Env

# Implement env var pre-computation (see 5.2)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem

# Expected:
# branches:       ~8M (-53%)
# branch-misses:  ~600K (7.5%)
# cycles:         ~55M (-33%)

7.4 Test 3: Release + Pre-compute + Remove SFC

# Remove SFC layer (see 5.3)
make clean
make CFLAGS="-DHAKMEM_BUILD_RELEASE=1 -O3" bench_random_mixed_hakmem

# Expected:
# branches:       ~7M (-59%)
# branch-misses:  ~500K (7.1%)
# cycles:         ~50M (-40%)

7.5 Success Criteria

Metric	Current	Target	Stretch Goal
Branches	17M	<10M	<8M
Branch-miss rate	10.84%	<8%	<7%
vs System malloc	8.5x slower	<5x slower	<3x slower
Throughput	1.07M ops/s	>2M ops/s	>3M ops/s

8. Comparison with System Malloc Strategy

8.1 System malloc tcache (glibc 2.27+)

Design:

// Allocation (2-3 instructions, 1-2 branches)
void* tcache_get(size_t size) {
    int tc_idx = csize2tidx(size);  // Size to index (no branch)
    tcache_entry* e = tcache->entries[tc_idx];
    if (e != NULL) {  // BRANCH 1
        tcache->entries[tc_idx] = e->next;
        return (void*)e;
    }
    return _int_malloc(av, bytes);  // Slow path
}

// Free (2 instructions, 1 branch)
void tcache_put(void* ptr, size_t size) {
    int tc_idx = csize2tidx(size);  // Size to index (no branch)
    if (tcache->counts[tc_idx] < TCACHE_MAX_BINS) {  // BRANCH 1
        tcache_entry* e = (tcache_entry*)ptr;
        e->next = tcache->entries[tc_idx];
        tcache->entries[tc_idx] = e;
        tcache->counts[tc_idx]++;
    }
    // Else: fall back to _int_free
}

Key insights:

1-2 branches total (vs HAKMEM's 16-21)
No validation in fast path
No debug guards in production
Single TLS cache layer (vs HAKMEM's 3 layers)
No getenv() calls (all config at compile-time)

8.2 mimalloc

Design:

// Allocation (3-4 instructions, 1-2 branches)
void* mi_malloc(size_t size) {
    mi_page_t* page = _mi_page_fast();  // TLS page cache
    if (mi_likely(page != NULL)) {  // BRANCH 1
        void* p = page->free;
        if (mi_likely(p != NULL)) {  // BRANCH 2
            page->free = mi_ptr_decode(p);
            return p;
        }
    }
    return mi_malloc_generic(NULL, size);  // Slow path
}

Key insights:

2 branches total (vs HAKMEM's 16-21)
Inline header metadata (similar to HAKMEM Phase 7)
No debug overhead in release builds
Simple TLS structure (page + free pointer)

9. Conclusion

Root Cause: HAKMEM executes 8.5x more branches than System malloc due to:

Debug code running in production (HAKMEM_BUILD_RELEASE not defined)
Complex multi-layer cache (SFC → SLL → SuperSlab)
Runtime env var checks in hot path
Excessive validation and profiling

Immediate Action (1 line change):

CFLAGS += -DHAKMEM_BUILD_RELEASE=1  # Expected: +30-50% performance

Full Fix (4-5 days work):

Enable release mode
Pre-compute env vars at init
Remove redundant SFC layer
Optimize branch hints

Expected Result:

-50-65% branches (17M → 6-8.5M)
-30-45% cycles
+40-80% throughput
70-90% of System malloc performance (vs current 3%)

Next Steps:

✅ Enable HAKMEM_BUILD_RELEASE=1 (immediate)
Run A/B tests (measure impact)
Implement env var pre-computation (1 day)
Evaluate SFC removal (2 days)
Re-measure and iterate

Appendix A: Detailed Branch Inventory

Allocation Path (tiny_alloc_fast.inc.h)

Line	Branch	Frequency	Type	Fix
177-182	SFC check done	Cold (once/thread)	Init	Pre-compute
184	SFC enabled	Hot	Runtime	Remove SFC
186	SFC ptr != NULL	Hot	Fast path	Keep (necessary)
204	SLL enabled	Hot	Runtime	Make compile-time
206	SLL head != NULL	Hot	Fast path	Keep (necessary)
208	Failfast ≥ 2	Hot	Debug	Remove in release
211-216	Alignment check	Hot	Debug	Remove in release
225	Failfast ≥ 2	Hot	Debug	Remove in release
227-234	Next validation	Hot	Debug	Remove in release
241	Count > 0	Hot	Unnecessary	Remove
171-173	Profile enabled	Hot	Debug	Remove in release
250-256	Profile rdtsc	Hot	Debug	Remove in release

Total: 16-21 branches → Target: 2-3 branches (95% reduction)

Refill Path (hakmem_tiny_refill_p0.inc.h)

Line	Branch	Frequency	Type	Fix
33	!g_use_superslab	Cold	Config	Remove check
41	!tls->ss	Hot	Refill	Keep (necessary)
46	!meta	Hot	Refill	Keep (necessary)
56	room <= 0	Hot	Capacity	Keep (necessary)
66-73	Hot override	Cold	Env var	Pre-compute
76-83	Mid override	Cold	Env var	Pre-compute
116-119	Remote drain	Hot	Optimization	Keep
138	Capacity check	Hot	Refill	Keep (necessary)

Total: 10-15 branches → Target: 5-8 branches (40-50% reduction)

End of Report

21 KiB Raw Blame History Unescape Escape

Branch Prediction Optimization Investigation Report

Executive Summary

1. Performance Hotspot Analysis

1.1 Perf Statistics (256B allocations, 100K iterations)

1.2 Branch Count by Component

2. Branch Count Analysis: Allocation Path

2.1 Fast Path: tiny_alloc_fast() (lines 454-497)

2.2 Refill Path: tiny_alloc_fast_refill() (lines 321-436)

3. Branch Count Analysis: Free Path

3.1 Free Path: hak_free_at() (hak_free_api.inc.h)

4. Root Cause Analysis

4.1 CRITICAL: Debug Code in Production Builds

4.2 HIGH: getenv() Calls in Hot Path

4.3 MEDIUM: Complex Multi-Layer Cache

4.4 MEDIUM: Excessive Validation in Hot Path

5. Optimization Recommendations (Ranked by Impact/Risk)

5.1 CRITICAL FIX: Enable Release Mode (0 risk, 40-50% impact)

5.2 HIGH PRIORITY: Pre-compute Env Vars at Init (Low risk, 10-15% impact)

5.3 MEDIUM PRIORITY: Simplify Cache Layers (Medium risk, 5-10% impact)

5.4 LOW PRIORITY: Branch Hint Tuning (Low risk, 2-5% impact)

6. Expected Results Summary

6.1 Cumulative Impact (All Optimizations)

6.2 Quick Win: Release Mode Only

7. A/B Test Plan

7.1 Baseline Measurement

7.2 Test 1: Release Mode

7.3 Test 2: Release + Pre-compute Env

7.4 Test 3: Release + Pre-compute + Remove SFC

7.5 Success Criteria

8. Comparison with System Malloc Strategy

8.1 System malloc tcache (glibc 2.27+)

8.2 mimalloc

9. Conclusion

Appendix A: Detailed Branch Inventory

Allocation Path (tiny_alloc_fast.inc.h)

Refill Path (hakmem_tiny_refill_p0.inc.h)

21 KiB

Raw Blame History

2.1 Fast Path: `tiny_alloc_fast()` (lines 454-497)

2.2 Refill Path: `tiny_alloc_fast_refill()` (lines 321-436)

3.1 Free Path: `hak_free_at()` (hak_free_api.inc.h)