Files
hakmem/docs/analysis/BOTTLENECK_ANALYSIS_REPORT_20251114.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

16 KiB

HAKMEM Bottleneck Analysis Report

Date: 2025-11-14 Phase: Post SP-SLOT Box Implementation Objective: Identify next optimization targets to close gap with System malloc / mimalloc


Executive Summary

Comprehensive performance analysis reveals 10x gap with System malloc (Tiny allocator) and 22x gap (Mid-Large allocator). Primary bottlenecks identified: syscall overhead (futex: 68% time), Frontend cache misses, and Mid-Large allocator failure.

Performance Gaps (Current State)

Allocator Tiny (random_mixed) Mid-Large MT (8-32KB)
System malloc 51.9M ops/s (100%) 5.4M ops/s (100%)
mimalloc 57.5M ops/s (111%) 24.2M ops/s (448%)
HAKMEM (best) 5.2M ops/s (10%) 0.24M ops/s (4.4%)
Gap -90% (10x slower) -95.6% (22x slower)

Urgent: Mid-Large allocator requires immediate attention (97x slower than mimalloc).


1. Benchmark Results: Current State

1.1 Random Mixed (Tiny Allocator: 16B-1KB)

Test Configuration:

  • 200K iterations
  • Working set: 4,096 slots
  • Size range: 16-1040 bytes (C0-C7 classes)

Results:

Variant spec_mask fast_cap Throughput vs System vs mimalloc
System malloc - - 51.9M ops/s 100% 90%
mimalloc - - 57.5M ops/s 111% 100%
HAKMEM 0 8 3.6M ops/s 6.9% 6.3%
HAKMEM 0 16 4.6M ops/s 8.9% 8.0%
HAKMEM 0 32 5.2M ops/s 10.0% 9.0%
HAKMEM 0x0F 32 5.18M ops/s 10.0% 9.0%

Key Findings:

  • Best HAKMEM config: fast_cap=32, spec_mask=0 → 5.2M ops/s
  • Gap: 10x slower than System, 11x slower than mimalloc
  • spec_mask effect: Negligible (<1% difference)
  • fast_cap scaling: 8→16 (+28%), 16→32 (+13%)

1.2 Mid-Large MT (8-32KB Allocations)

Test Configuration:

  • 2 threads
  • 40K cycles
  • Working set: 2,048 slots

Results:

Allocator Throughput vs System vs mimalloc
System malloc 5.4M ops/s 100% 22%
mimalloc 24.2M ops/s 448% 100%
HAKMEM (base) 0.243M ops/s 4.4% 1.0%
HAKMEM (no bigcache) 0.251M ops/s 4.6% 1.0%

Critical Issue:

[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures

Gap: 22x slower than System, 97x slower than mimalloc 💀

Root Cause: hkm_ace_alloc consistently returns NULL → Mid-Large allocator not functioning properly.


2. Syscall Analysis (strace)

2.1 System Call Distribution (200K iterations)

Syscall Calls % Time usec/call Category
futex 36 68.18% 1,970 Synchronization ⚠️
munmap 1,665 11.60% 7 SS deallocation
mmap 1,692 7.28% 4 SS allocation
madvise 1,591 6.85% 4 Memory advice
mincore 1,574 5.51% 3 Page existence check
Other 1,141 0.57% - Misc
Total 6,703 100% 15 (avg)

2.2 Key Observations

Unexpected: futex Dominates (68% time)

  • 36 futex calls consuming 68.18% of syscall time
  • 1,970 usec/call (extremely slow!)
  • Context: bench_random_mixed is single-threaded
  • Hypothesis: Contention in shared pool lock (pthread_mutex_lock in shared_pool_acquire_slab)

SP-SLOT Impact Confirmed:

Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls
After SP-SLOT:  mmap (1,692) + munmap (1,665) = 3,357 calls
Reduction:      -48% (-3,098 calls) ✅

Remaining syscall overhead:

  • madvise: 1,591 calls (6.85% time) - from other allocators?
  • mincore: 1,574 calls (5.51% time) - still present despite Phase 9 removal?

3. SP-SLOT Box Effectiveness Review

3.1 SuperSlab Allocation Reduction

Measured with debug logging (HAKMEM_SS_ACQUIRE_DEBUG=1):

Metric Before SP-SLOT After SP-SLOT Improvement
New SuperSlabs (Stage 3) 877 (200K iters) 72 (200K iters) -92% 🎉
Syscalls (mmap+munmap) 6,455 3,357 -48%
Throughput 563K ops/s 1.30M ops/s +131%

3.2 Allocation Stage Distribution (50K iterations)

Stage Description Count %
Stage 1 EMPTY slot reuse (per-class free list) 105 4.6%
Stage 2 UNUSED slot reuse (multi-class sharing) 2,117 92.4%
Stage 3 New SuperSlab (mmap) 69 3.0%
Total 2,291 100%

Key Insight: Stage 2 (UNUSED reuse) is dominant, proving multi-class SuperSlab sharing works.


4. Identified Bottlenecks (Priority Order)

Priority 1: Mid-Large Allocator Failure 🔥

Impact: 97x slower than mimalloc Symptom: hkm_ace_alloc returns NULL Evidence:

[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1
[ALLOC] 33KB: Calling hkm_ace_alloc
[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures

Root Cause Hypothesis:

  • Pool TLS arena not initialized?
  • Threshold logic preventing 8-32KB allocations?
  • Bug in hkm_ace_alloc path?

Action Required: Immediate investigation (blocking)


Priority 2: futex Overhead (68% syscall time) ⚠️

Impact: 68.18% of syscall time (1,970 usec/call) Symptom: Excessive lock contention in shared pool Root Cause:

// core/hakmem_shared_pool.c:343
pthread_mutex_lock(&g_shared_pool.alloc_lock);   Contention point?

Hypothesis:

  • shared_pool_acquire_slab() called frequently (2,291 times / 50K iters)
  • Lock held too long (metadata scans, dynamic array growth)
  • Contention even in single-threaded workload (TLS drain threads?)

Potential Solutions:

  1. Lock-free fast path: Per-class lock-free pop from free lists (Stage 1)
  2. Reduce lock scope: Move metadata scans outside critical section
  3. Batch acquire: Acquire multiple slabs per lock acquisition
  4. Per-class locks: Replace global lock with per-class locks

Expected Impact: -50-80% reduction in futex time


Priority 3: Frontend Cache Miss Rate

Impact: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%) Current Config: fast_cap=32 (best performance) Evidence: fast_cap scaling (8→16: +28%, 16→32: +13%)

Hypothesis:

  • TLS cache capacity too small for working set (4,096 slots)
  • Refill batch size suboptimal
  • Specialize mask (0x0F) shows no benefit (<1% difference)

Potential Solutions:

  1. Increase fast_cap: Test 64 / 128 (diminishing returns expected)
  2. Tune refill batch: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256
  3. Class-specific tuning: Hot classes (C6, C7) get larger caches

Expected Impact: +10-20% throughput (backend call reduction)


Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)

Impact: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore) Status: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)

Remaining Issues:

  1. madvise (1,591 calls): Where are these coming from?

    • Pool TLS arena (8-52KB)?
    • Mid-Large allocator (broken)?
    • Other internal structures?
  2. mincore (1,574 calls): Still present despite Phase 9 removal claim

    • Source location unknown
    • May be from other allocators or debug paths

Action Required: Trace source of madvise/mincore calls


5. Performance Evolution Timeline

Historical Performance Progression

Phase Optimization Throughput vs Baseline vs System
Baseline (Phase 8) - 563K ops/s +0% 1.1%
Phase 9 (LRU + mincore removal) Lazy deallocation 9.71M ops/s +1,625% 18.7%
Phase 10 (TLS/SFC tuning) Frontend expansion 9.89M ops/s +1,657% 19.0%
Phase 11 (Prewarm) Startup SS allocation 9.38M ops/s +1,566% 18.1%
Phase 12-A (TLS SLL Drain) Periodic drain 6.1M ops/s +984% 11.8%
Phase 12-B (SP-SLOT Box) Per-slot management 1.30M ops/s +131% 2.5%
Current (optimized ENV) fast_cap=32 5.2M ops/s +824% 10.0%

Note: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to ENV configuration:

  • Default: No ENV → 1.30M ops/s
  • Optimized: HAKMEM_TINY_FAST_CAP=32 + other flags → 5.2M ops/s

6. Working Set Sensitivity

Test Results (fast_cap=32, spec_mask=0):

Cycles WS Throughput vs ws=4096
200K 4,096 5.2M ops/s 100% (baseline)
200K 8,192 4.0M ops/s -23%
400K 4,096 5.3M ops/s +2%
400K 8,192 4.7M ops/s -10%

Observation: 23% performance drop when working set doubles (4K→8K)

Hypothesis:

  • Larger working set → more backend allocation calls
  • TLS cache misses increase
  • SuperSlab churn increases (more Stage 3 allocations)

Implication: Current frontend cache size (fast_cap=32) insufficient for large working sets.


Step 1: Fix Mid-Large Allocator (URGENT) 🔥

Priority: P0 (Blocking) Impact: 97x gap with mimalloc Effort: Medium

Tasks:

  1. Investigate hkm_ace_alloc NULL returns
  2. Check Pool TLS arena initialization
  3. Verify threshold logic for 8-32KB allocations
  4. Add debug logging to trace allocation path

Success Criteria: Mid-Large throughput >1M ops/s (current: 0.24M)


Step 2: Optimize Shared Pool Lock Contention

Priority: P1 (High) Impact: 68% syscall time Effort: Medium

Options (in order of risk):

A) Lock-free Stage 1 (Low Risk):

// Per-class atomic LIFO for EMPTY slot reuse
_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];

// Lock-free pop (Stage 1 fast path)
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
    FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);
    while (head != NULL) {
        if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {
            return head;
        }
    }
    return NULL;  // Fall back to locked Stage 2/3
}

Expected: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)

B) Reduce Lock Scope (Medium Risk):

// Move metadata scan outside lock
int candidate_slot = sp_meta_scan_unlocked();  // Read-only
pthread_mutex_lock(&g_shared_pool.alloc_lock);
if (sp_slot_try_claim(candidate_slot)) {  // Quick CAS
    // Success
}
pthread_mutex_unlock(&g_shared_pool.alloc_lock);

Expected: -30% futex overhead (reduce lock hold time)

C) Per-Class Locks (High Risk):

pthread_mutex_t g_class_locks[TINY_NUM_CLASSES];  // Replace global lock

Expected: -80% futex overhead (eliminate cross-class contention) Risk: Complexity increase, potential deadlocks

Recommendation: Start with Option A (lowest risk, measurable impact).


Step 3: TLS Drain Interval Tuning (Low Risk)

Priority: P2 (Medium) Impact: TBD (experimental) Effort: Low (ENV-only A/B testing)

Current: 1,024 frees/class (HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024)

Experiment Matrix:

Interval Expected Impact
512 -50% drain overhead, +syscalls (more frequent SS release)
2,048 +100% drain overhead, -syscalls (less frequent SS release)
4,096 +300% drain overhead, --syscalls (minimal SS release)

Metrics to Track:

  • Throughput (ops/s)
  • mmap/munmap count (strace)
  • TLS SLL drain frequency (debug log)

Success Criteria: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)


Step 4: Frontend Cache Tuning (Medium Risk)

Priority: P3 (Low) Impact: +10-20% expected Effort: Low (ENV-only A/B testing)

Current Best: fast_cap=32

Experiment Matrix:

fast_cap refill_count_hot Expected Impact
64 64 +5-10% (diminishing returns)
64 128 +10-15% (better batch refill)
128 128 +15-20% (max cache size)

Metrics to Track:

  • Throughput (ops/s)
  • Stage 3 frequency (debug log)
  • Working set sensitivity (ws=8192 test)

Success Criteria: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192


Step 5: Trace Remaining Syscalls (Investigation)

Priority: P4 (Low) Impact: TBD Effort: Low

Questions:

  1. madvise (1,591 calls): Where are these from?

    • Add debug logging to all madvise() call sites
    • Check Pool TLS arena, Mid-Large allocator
  2. mincore (1,574 calls): Why still present?

    • Grep codebase for mincore calls
    • Check if Phase 9 removal was incomplete

Tools:

# Trace madvise source
strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567

# Grep for mincore
grep -r "mincore" core/ --include="*.c" --include="*.h"

8. Risk Assessment

Optimization Impact Effort Risk Recommendation
Mid-Large Fix +++++ ++ Low DO NOW 🔥
Lock-free Stage 1 +++ ++ Low DO NEXT
Drain Interval Tune ++ + Low DO NEXT
Frontend Cache Tune ++ + Low DO AFTER
Reduce Lock Scope +++ +++ Med Consider
Per-Class Locks ++++ ++++ High Avoid (complex)
Trace Syscalls ? + Low Background task

9. Expected Performance Targets

Short-Term (1-2 weeks)

Metric Current Target Strategy
Mid-Large throughput 0.24M ops/s >1M ops/s Fix hkm_ace_alloc
Tiny throughput (ws=4096) 5.2M ops/s >7M ops/s Lock-free + drain tune
futex overhead 68% <30% Lock-free Stage 1
mmap+munmap 3,357 <2,500 Drain interval tune

Medium-Term (1-2 months)

Metric Current Target Strategy
Tiny throughput (ws=4096) 5.2M ops/s >15M ops/s Full optimization
vs System malloc 10% >25% Close gap by 15pp
vs mimalloc 9% >20% Close gap by 11pp

Long-Term (3-6 months)

Metric Current Target Strategy
Tiny throughput 5.2M ops/s >40M ops/s Architectural overhaul
vs System malloc 10% >70% Competitive performance
vs mimalloc 9% >60% Industry-standard

10. Lessons Learned

1. ENV Configuration is Critical

Discovery: Default (1.30M) vs Optimized (5.2M) = +300% gap Lesson: Always document and automate optimal ENV settings Action: Create scripts/bench_optimal_env.sh with best-known config

2. Mid-Large Allocator Broken

Discovery: 97x slower than mimalloc, NULL returns Lesson: Integration testing insufficient (bench suite doesn't cover 8-32KB properly) Action: Add bench_mid_large_single_thread.sh to CI suite

3. futex Overhead Unexpected

Discovery: 68% time in single-threaded workload Lesson: Shared pool global lock is a bottleneck even without contention Action: Profile lock hold time, consider lock-free paths

4. SP-SLOT Stage 2 Dominates

Discovery: 92.4% of allocations reuse UNUSED slots (Stage 2) Lesson: Multi-class sharing >> per-class free lists Action: Optimize Stage 2 path (lock-free metadata scan?)


11. Conclusion

Current State:

  • SP-SLOT Box successfully reduced SuperSlab churn by 92%
  • Syscall overhead reduced by 48% (mmap+munmap)
  • ⚠️ Still 10x slower than System malloc (Tiny)
  • 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)

Next Priorities:

  1. Fix Mid-Large allocator (P0, blocking)
  2. Optimize shared pool lock (P1, 68% syscall time)
  3. Tune drain interval (P2, low-risk improvement)
  4. Tune frontend cache (P3, diminishing returns)

Expected Impact (short-term):

  • Mid-Large: 0.24M → >1M ops/s (+316%)
  • Tiny: 5.2M → >7M ops/s (+35%)
  • futex overhead: 68% → <30% (-56%)

Long-Term Vision:

  • Close gap to 70% of System malloc performance (40M ops/s target)
  • Competitive with industry-standard allocators (mimalloc, jemalloc)

Report Generated: 2025-11-14 Tool: Claude Code Phase: Post SP-SLOT Box Implementation Status: Analysis Complete, Ready for Implementation