# HAKMEM Bottleneck Analysis Report **Date**: 2025-11-14 **Phase**: Post SP-SLOT Box Implementation **Objective**: Identify next optimization targets to close gap with System malloc / mimalloc --- ## Executive Summary Comprehensive performance analysis reveals **10x gap with System malloc** (Tiny allocator) and **22x gap** (Mid-Large allocator). Primary bottlenecks identified: **syscall overhead** (futex: 68% time), **Frontend cache misses**, and **Mid-Large allocator failure**. ### Performance Gaps (Current State) | Allocator | Tiny (random_mixed) | Mid-Large MT (8-32KB) | |-----------|---------------------|----------------------| | **System malloc** | 51.9M ops/s (100%) | 5.4M ops/s (100%) | | **mimalloc** | 57.5M ops/s (111%) | 24.2M ops/s (448%) | | **HAKMEM (best)** | 5.2M ops/s (**10%**) | 0.24M ops/s (**4.4%**) | | **Gap** | **-90% (10x slower)** | **-95.6% (22x slower)** | **Urgent**: Mid-Large allocator requires immediate attention (97x slower than mimalloc). --- ## 1. Benchmark Results: Current State ### 1.1 Random Mixed (Tiny Allocator: 16B-1KB) **Test Configuration**: - 200K iterations - Working set: 4,096 slots - Size range: 16-1040 bytes (C0-C7 classes) **Results**: | Variant | spec_mask | fast_cap | Throughput | vs System | vs mimalloc | |---------|-----------|----------|------------|-----------|-------------| | **System malloc** | - | - | 51.9M ops/s | 100% | 90% | | **mimalloc** | - | - | 57.5M ops/s | 111% | 100% | | **HAKMEM** | 0 | 8 | 3.6M ops/s | 6.9% | 6.3% | | **HAKMEM** | 0 | 16 | 4.6M ops/s | 8.9% | 8.0% | | **HAKMEM** | 0 | **32** | **5.2M ops/s** | **10.0%** | **9.0%** | | **HAKMEM** | 0x0F | 32 | 5.18M ops/s | 10.0% | 9.0% | **Key Findings**: - **Best HAKMEM config**: fast_cap=32, spec_mask=0 → **5.2M ops/s** - **Gap**: 10x slower than System, 11x slower than mimalloc - **spec_mask effect**: Negligible (<1% difference) - **fast_cap scaling**: 8→16 (+28%), 16→32 (+13%) ### 1.2 Mid-Large MT (8-32KB Allocations) **Test Configuration**: - 2 threads - 40K cycles - Working set: 2,048 slots **Results**: | Allocator | Throughput | vs System | vs mimalloc | |-----------|------------|-----------|-------------| | **System malloc** | 5.4M ops/s | 100% | 22% | | **mimalloc** | 24.2M ops/s | 448% | 100% | | **HAKMEM (base)** | 0.243M ops/s | **4.4%** | **1.0%** | | **HAKMEM (no bigcache)** | 0.251M ops/s | 4.6% | 1.0% | **Critical Issue**: ``` [ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures ``` **Gap**: 22x slower than System, **97x slower than mimalloc** 💀 **Root Cause**: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly. --- ## 2. Syscall Analysis (strace) ### 2.1 System Call Distribution (200K iterations) | Syscall | Calls | % Time | usec/call | Category | |---------|-------|--------|-----------|----------| | **futex** | 36 | **68.18%** | 1,970 | Synchronization ⚠️ | | **munmap** | 1,665 | 11.60% | 7 | SS deallocation | | **mmap** | 1,692 | 7.28% | 4 | SS allocation | | **madvise** | 1,591 | 6.85% | 4 | Memory advice | | **mincore** | 1,574 | 5.51% | 3 | Page existence check | | **Other** | 1,141 | 0.57% | - | Misc | | **Total** | **6,703** | 100% | 15 (avg) | | ### 2.2 Key Observations **Unexpected: futex Dominates (68% time)** - **36 futex calls** consuming **68.18% of syscall time** - **1,970 usec/call** (extremely slow!) - **Context**: `bench_random_mixed` is **single-threaded** - **Hypothesis**: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`) **SP-SLOT Impact Confirmed**: ``` Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls After SP-SLOT: mmap (1,692) + munmap (1,665) = 3,357 calls Reduction: -48% (-3,098 calls) ✅ ``` **Remaining syscall overhead**: - **madvise**: 1,591 calls (6.85% time) - from other allocators? - **mincore**: 1,574 calls (5.51% time) - still present despite Phase 9 removal? --- ## 3. SP-SLOT Box Effectiveness Review ### 3.1 SuperSlab Allocation Reduction **Measured with debug logging** (`HAKMEM_SS_ACQUIRE_DEBUG=1`): | Metric | Before SP-SLOT | After SP-SLOT | Improvement | |--------|----------------|---------------|-------------| | **New SuperSlabs** (Stage 3) | 877 (200K iters) | 72 (200K iters) | **-92%** 🎉 | | **Syscalls (mmap+munmap)** | 6,455 | 3,357 | **-48%** | | **Throughput** | 563K ops/s | 1.30M ops/s | **+131%** | ### 3.2 Allocation Stage Distribution (50K iterations) | Stage | Description | Count | % | |-------|-------------|-------|---| | **Stage 1** | EMPTY slot reuse (per-class free list) | 105 | 4.6% | | **Stage 2** | **UNUSED slot reuse (multi-class sharing)** | **2,117** | **92.4%** ✅ | | **Stage 3** | New SuperSlab (mmap) | 69 | 3.0% | | **Total** | | 2,291 | 100% | **Key Insight**: Stage 2 (UNUSED reuse) is dominant, proving **multi-class SuperSlab sharing works**. --- ## 4. Identified Bottlenecks (Priority Order) ### Priority 1: Mid-Large Allocator Failure 🔥 **Impact**: 97x slower than mimalloc **Symptom**: `hkm_ace_alloc` returns NULL **Evidence**: ``` [ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1 [ALLOC] 33KB: Calling hkm_ace_alloc [ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures ``` **Root Cause Hypothesis**: - Pool TLS arena not initialized? - Threshold logic preventing 8-32KB allocations? - Bug in `hkm_ace_alloc` path? **Action Required**: Immediate investigation (blocking) --- ### Priority 2: futex Overhead (68% syscall time) ⚠️ **Impact**: 68.18% of syscall time (1,970 usec/call) **Symptom**: Excessive lock contention in shared pool **Root Cause**: ```c // core/hakmem_shared_pool.c:343 pthread_mutex_lock(&g_shared_pool.alloc_lock); ← Contention point? ``` **Hypothesis**: - `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters) - Lock held too long (metadata scans, dynamic array growth) - Contention even in single-threaded workload (TLS drain threads?) **Potential Solutions**: 1. **Lock-free fast path**: Per-class lock-free pop from free lists (Stage 1) 2. **Reduce lock scope**: Move metadata scans outside critical section 3. **Batch acquire**: Acquire multiple slabs per lock acquisition 4. **Per-class locks**: Replace global lock with per-class locks **Expected Impact**: -50-80% reduction in futex time --- ### Priority 3: Frontend Cache Miss Rate **Impact**: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%) **Current Config**: fast_cap=32 (best performance) **Evidence**: fast_cap scaling (8→16: +28%, 16→32: +13%) **Hypothesis**: - TLS cache capacity too small for working set (4,096 slots) - Refill batch size suboptimal - Specialize mask (0x0F) shows no benefit (<1% difference) **Potential Solutions**: 1. **Increase fast_cap**: Test 64 / 128 (diminishing returns expected) 2. **Tune refill batch**: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256 3. **Class-specific tuning**: Hot classes (C6, C7) get larger caches **Expected Impact**: +10-20% throughput (backend call reduction) --- ### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore) **Impact**: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore) **Status**: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap) **Remaining Issues**: 1. **madvise (1,591 calls)**: Where are these coming from? - Pool TLS arena (8-52KB)? - Mid-Large allocator (broken)? - Other internal structures? 2. **mincore (1,574 calls)**: Still present despite Phase 9 removal claim - Source location unknown - May be from other allocators or debug paths **Action Required**: Trace source of madvise/mincore calls --- ## 5. Performance Evolution Timeline ### Historical Performance Progression | Phase | Optimization | Throughput | vs Baseline | vs System | |-------|--------------|------------|-------------|-----------| | **Baseline** (Phase 8) | - | 563K ops/s | +0% | 1.1% | | **Phase 9** (LRU + mincore removal) | Lazy deallocation | 9.71M ops/s | +1,625% | 18.7% | | **Phase 10** (TLS/SFC tuning) | Frontend expansion | 9.89M ops/s | +1,657% | 19.0% | | **Phase 11** (Prewarm) | Startup SS allocation | 9.38M ops/s | +1,566% | 18.1% | | **Phase 12-A** (TLS SLL Drain) | Periodic drain | 6.1M ops/s | +984% | 11.8% | | **Phase 12-B** (SP-SLOT Box) | Per-slot management | 1.30M ops/s | +131% | 2.5% | | **Current (optimized ENV)** | fast_cap=32 | **5.2M ops/s** | **+824%** | **10.0%** | **Note**: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to **ENV configuration**: - Default: No ENV → 1.30M ops/s - Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s --- ## 6. Working Set Sensitivity **Test Results** (fast_cap=32, spec_mask=0): | Cycles | WS | Throughput | vs ws=4096 | |--------|-----|------------|------------| | 200K | 4,096 | 5.2M ops/s | 100% (baseline) | | 200K | 8,192 | 4.0M ops/s | -23% | | 400K | 4,096 | 5.3M ops/s | +2% | | 400K | 8,192 | 4.7M ops/s | -10% | **Observation**: **23% performance drop** when working set doubles (4K→8K) **Hypothesis**: - Larger working set → more backend allocation calls - TLS cache misses increase - SuperSlab churn increases (more Stage 3 allocations) **Implication**: Current frontend cache size (fast_cap=32) insufficient for large working sets. --- ## 7. Recommended Next Steps (Priority Order) ### Step 1: Fix Mid-Large Allocator (URGENT) 🔥 **Priority**: P0 (Blocking) **Impact**: 97x gap with mimalloc **Effort**: Medium **Tasks**: 1. Investigate `hkm_ace_alloc` NULL returns 2. Check Pool TLS arena initialization 3. Verify threshold logic for 8-32KB allocations 4. Add debug logging to trace allocation path **Success Criteria**: Mid-Large throughput >1M ops/s (current: 0.24M) --- ### Step 2: Optimize Shared Pool Lock Contention **Priority**: P1 (High) **Impact**: 68% syscall time **Effort**: Medium **Options** (in order of risk): **A) Lock-free Stage 1 (Low Risk)**: ```c // Per-class atomic LIFO for EMPTY slot reuse _Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES]; // Lock-free pop (Stage 1 fast path) FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) { FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]); while (head != NULL) { if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) { return head; } } return NULL; // Fall back to locked Stage 2/3 } ``` **Expected**: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free) **B) Reduce Lock Scope (Medium Risk)**: ```c // Move metadata scan outside lock int candidate_slot = sp_meta_scan_unlocked(); // Read-only pthread_mutex_lock(&g_shared_pool.alloc_lock); if (sp_slot_try_claim(candidate_slot)) { // Quick CAS // Success } pthread_mutex_unlock(&g_shared_pool.alloc_lock); ``` **Expected**: -30% futex overhead (reduce lock hold time) **C) Per-Class Locks (High Risk)**: ```c pthread_mutex_t g_class_locks[TINY_NUM_CLASSES]; // Replace global lock ``` **Expected**: -80% futex overhead (eliminate cross-class contention) **Risk**: Complexity increase, potential deadlocks **Recommendation**: Start with **Option A** (lowest risk, measurable impact). --- ### Step 3: TLS Drain Interval Tuning (Low Risk) **Priority**: P2 (Medium) **Impact**: TBD (experimental) **Effort**: Low (ENV-only A/B testing) **Current**: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`) **Experiment Matrix**: | Interval | Expected Impact | |----------|-----------------| | 512 | -50% drain overhead, +syscalls (more frequent SS release) | | 2,048 | +100% drain overhead, -syscalls (less frequent SS release) | | 4,096 | +300% drain overhead, --syscalls (minimal SS release) | **Metrics to Track**: - Throughput (ops/s) - mmap/munmap count (strace) - TLS SLL drain frequency (debug log) **Success Criteria**: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000) --- ### Step 4: Frontend Cache Tuning (Medium Risk) **Priority**: P3 (Low) **Impact**: +10-20% expected **Effort**: Low (ENV-only A/B testing) **Current Best**: fast_cap=32 **Experiment Matrix**: | fast_cap | refill_count_hot | Expected Impact | |----------|------------------|-----------------| | 64 | 64 | +5-10% (diminishing returns) | | 64 | 128 | +10-15% (better batch refill) | | 128 | 128 | +15-20% (max cache size) | **Metrics to Track**: - Throughput (ops/s) - Stage 3 frequency (debug log) - Working set sensitivity (ws=8192 test) **Success Criteria**: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192 --- ### Step 5: Trace Remaining Syscalls (Investigation) **Priority**: P4 (Low) **Impact**: TBD **Effort**: Low **Questions**: 1. **madvise (1,591 calls)**: Where are these from? - Add debug logging to all `madvise()` call sites - Check Pool TLS arena, Mid-Large allocator 2. **mincore (1,574 calls)**: Why still present? - Grep codebase for `mincore` calls - Check if Phase 9 removal was incomplete **Tools**: ```bash # Trace madvise source strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567 # Grep for mincore grep -r "mincore" core/ --include="*.c" --include="*.h" ``` --- ## 8. Risk Assessment | Optimization | Impact | Effort | Risk | Recommendation | |--------------|--------|--------|------|----------------| | **Mid-Large Fix** | +++++ | ++ | Low | **DO NOW** 🔥 | | **Lock-free Stage 1** | +++ | ++ | Low | **DO NEXT** ✅ | | **Drain Interval Tune** | ++ | + | Low | **DO NEXT** ✅ | | **Frontend Cache Tune** | ++ | + | Low | **DO AFTER** | | **Reduce Lock Scope** | +++ | +++ | Med | Consider | | **Per-Class Locks** | ++++ | ++++ | High | Avoid (complex) | | **Trace Syscalls** | ? | + | Low | Background task | --- ## 9. Expected Performance Targets ### Short-Term (1-2 weeks) | Metric | Current | Target | Strategy | |--------|---------|--------|----------| | **Mid-Large throughput** | 0.24M ops/s | **>1M ops/s** | Fix `hkm_ace_alloc` | | **Tiny throughput (ws=4096)** | 5.2M ops/s | **>7M ops/s** | Lock-free + drain tune | | **futex overhead** | 68% | **<30%** | Lock-free Stage 1 | | **mmap+munmap** | 3,357 | **<2,500** | Drain interval tune | ### Medium-Term (1-2 months) | Metric | Current | Target | Strategy | |--------|---------|--------|----------| | **Tiny throughput (ws=4096)** | 5.2M ops/s | **>15M ops/s** | Full optimization | | **vs System malloc** | 10% | **>25%** | Close gap by 15pp | | **vs mimalloc** | 9% | **>20%** | Close gap by 11pp | ### Long-Term (3-6 months) | Metric | Current | Target | Strategy | |--------|---------|--------|----------| | **Tiny throughput** | 5.2M ops/s | **>40M ops/s** | Architectural overhaul | | **vs System malloc** | 10% | **>70%** | Competitive performance | | **vs mimalloc** | 9% | **>60%** | Industry-standard | --- ## 10. Lessons Learned ### 1. ENV Configuration is Critical **Discovery**: Default (1.30M) vs Optimized (5.2M) = **+300% gap** **Lesson**: Always document and automate optimal ENV settings **Action**: Create `scripts/bench_optimal_env.sh` with best-known config ### 2. Mid-Large Allocator Broken **Discovery**: 97x slower than mimalloc, NULL returns **Lesson**: Integration testing insufficient (bench suite doesn't cover 8-32KB properly) **Action**: Add `bench_mid_large_single_thread.sh` to CI suite ### 3. futex Overhead Unexpected **Discovery**: 68% time in single-threaded workload **Lesson**: Shared pool global lock is a bottleneck even without contention **Action**: Profile lock hold time, consider lock-free paths ### 4. SP-SLOT Stage 2 Dominates **Discovery**: 92.4% of allocations reuse UNUSED slots (Stage 2) **Lesson**: Multi-class sharing >> per-class free lists **Action**: Optimize Stage 2 path (lock-free metadata scan?) --- ## 11. Conclusion **Current State**: - ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92% - ✅ Syscall overhead reduced by 48% (mmap+munmap) - ⚠️ Still 10x slower than System malloc (Tiny) - 🔥 Mid-Large allocator critically broken (97x slower than mimalloc) **Next Priorities**: 1. **Fix Mid-Large allocator** (P0, blocking) 2. **Optimize shared pool lock** (P1, 68% syscall time) 3. **Tune drain interval** (P2, low-risk improvement) 4. **Tune frontend cache** (P3, diminishing returns) **Expected Impact** (short-term): - Mid-Large: 0.24M → >1M ops/s (+316%) - Tiny: 5.2M → >7M ops/s (+35%) - futex overhead: 68% → <30% (-56%) **Long-Term Vision**: - Close gap to 70% of System malloc performance (40M ops/s target) - Competitive with industry-standard allocators (mimalloc, jemalloc) --- **Report Generated**: 2025-11-14 **Tool**: Claude Code **Phase**: Post SP-SLOT Box Implementation **Status**: ✅ Analysis Complete, Ready for Implementation