hakmem/docs/analysis/BOTTLENECK_ANALYSIS_REPORT_20251114.md

# HAKMEM Bottleneck Analysis Report

**Date**: 2025-11-14
**Phase**: Post SP-SLOT Box Implementation
**Objective**: Identify next optimization targets to close gap with System malloc / mimalloc

---

## Executive Summary

Comprehensive performance analysis reveals **10x gap with System malloc** (Tiny allocator) and **22x gap** (Mid-Large allocator). Primary bottlenecks identified: **syscall overhead** (futex: 68% time), **Frontend cache misses**, and **Mid-Large allocator failure**.

### Performance Gaps (Current State)

| Allocator | Tiny (random_mixed) | Mid-Large MT (8-32KB) |
|-----------|---------------------|----------------------|
| **System malloc** | 51.9M ops/s (100%) | 5.4M ops/s (100%) |
| **mimalloc** | 57.5M ops/s (111%) | 24.2M ops/s (448%) |
| **HAKMEM (best)** | 5.2M ops/s (**10%**) | 0.24M ops/s (**4.4%**) |
| **Gap** | **-90% (10x slower)** | **-95.6% (22x slower)** |

**Urgent**: Mid-Large allocator requires immediate attention (97x slower than mimalloc).

---

## 1. Benchmark Results: Current State

### 1.1 Random Mixed (Tiny Allocator: 16B-1KB)

**Test Configuration**:
- 200K iterations
- Working set: 4,096 slots
- Size range: 16-1040 bytes (C0-C7 classes)

**Results**:

| Variant | spec_mask | fast_cap | Throughput | vs System | vs mimalloc |
|---------|-----------|----------|------------|-----------|-------------|
| **System malloc** | - | - | 51.9M ops/s | 100% | 90% |
| **mimalloc** | - | - | 57.5M ops/s | 111% | 100% |
| **HAKMEM** | 0 | 8 | 3.6M ops/s | 6.9% | 6.3% |
| **HAKMEM** | 0 | 16 | 4.6M ops/s | 8.9% | 8.0% |
| **HAKMEM** | 0 | **32** | **5.2M ops/s** | **10.0%** | **9.0%** |
| **HAKMEM** | 0x0F | 32 | 5.18M ops/s | 10.0% | 9.0% |

**Key Findings**:
- **Best HAKMEM config**: fast_cap=32, spec_mask=0 → **5.2M ops/s**
- **Gap**: 10x slower than System, 11x slower than mimalloc
- **spec_mask effect**: Negligible (<1% difference)
- **fast_cap scaling**: 8→16 (+28%), 16→32 (+13%)

### 1.2 Mid-Large MT (8-32KB Allocations)

**Test Configuration**:
- 2 threads
- 40K cycles
- Working set: 2,048 slots

**Results**:

| Allocator | Throughput | vs System | vs mimalloc |
|-----------|------------|-----------|-------------|
| **System malloc** | 5.4M ops/s | 100% | 22% |
| **mimalloc** | 24.2M ops/s | 448% | 100% |
| **HAKMEM (base)** | 0.243M ops/s | **4.4%** | **1.0%** |
| **HAKMEM (no bigcache)** | 0.251M ops/s | 4.6% | 1.0% |

**Critical Issue**:
```
[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures
```

**Gap**: 22x slower than System, **97x slower than mimalloc** 💀

**Root Cause**: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly.

---

## 2. Syscall Analysis (strace)

### 2.1 System Call Distribution (200K iterations)

| Syscall | Calls | % Time | usec/call | Category |
|---------|-------|--------|-----------|----------|
| **futex** | 36 | **68.18%** | 1,970 | Synchronization ⚠️ |
| **munmap** | 1,665 | 11.60% | 7 | SS deallocation |
| **mmap** | 1,692 | 7.28% | 4 | SS allocation |
| **madvise** | 1,591 | 6.85% | 4 | Memory advice |
| **mincore** | 1,574 | 5.51% | 3 | Page existence check |
| **Other** | 1,141 | 0.57% | - | Misc |
| **Total** | **6,703** | 100% | 15 (avg) | |

### 2.2 Key Observations

**Unexpected: futex Dominates (68% time)**
- **36 futex calls** consuming **68.18% of syscall time**
- **1,970 usec/call** (extremely slow!)
- **Context**: `bench_random_mixed` is **single-threaded**
- **Hypothesis**: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`)

**SP-SLOT Impact Confirmed**:
```
Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls
After SP-SLOT:  mmap (1,692) + munmap (1,665) = 3,357 calls
Reduction:      -48% (-3,098 calls) ✅
```

**Remaining syscall overhead**:
- **madvise**: 1,591 calls (6.85% time) - from other allocators?
- **mincore**: 1,574 calls (5.51% time) - still present despite Phase 9 removal?

---

## 3. SP-SLOT Box Effectiveness Review

### 3.1 SuperSlab Allocation Reduction

**Measured with debug logging** (`HAKMEM_SS_ACQUIRE_DEBUG=1`):

| Metric | Before SP-SLOT | After SP-SLOT | Improvement |
|--------|----------------|---------------|-------------|
| **New SuperSlabs** (Stage 3) | 877 (200K iters) | 72 (200K iters) | **-92%** 🎉 |
| **Syscalls (mmap+munmap)** | 6,455 | 3,357 | **-48%** |
| **Throughput** | 563K ops/s | 1.30M ops/s | **+131%** |

### 3.2 Allocation Stage Distribution (50K iterations)

| Stage | Description | Count | % |
|-------|-------------|-------|---|
| **Stage 1** | EMPTY slot reuse (per-class free list) | 105 | 4.6% |
| **Stage 2** | **UNUSED slot reuse (multi-class sharing)** | **2,117** | **92.4%** ✅ |
| **Stage 3** | New SuperSlab (mmap) | 69 | 3.0% |
| **Total** | | 2,291 | 100% |

**Key Insight**: Stage 2 (UNUSED reuse) is dominant, proving **multi-class SuperSlab sharing works**.

---

## 4. Identified Bottlenecks (Priority Order)

### Priority 1: Mid-Large Allocator Failure 🔥

**Impact**: 97x slower than mimalloc
**Symptom**: `hkm_ace_alloc` returns NULL
**Evidence**:
```
[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1
[ALLOC] 33KB: Calling hkm_ace_alloc
[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures
```

**Root Cause Hypothesis**:
- Pool TLS arena not initialized?
- Threshold logic preventing 8-32KB allocations?
- Bug in `hkm_ace_alloc` path?

**Action Required**: Immediate investigation (blocking)

---

### Priority 2: futex Overhead (68% syscall time) ⚠️

**Impact**: 68.18% of syscall time (1,970 usec/call)
**Symptom**: Excessive lock contention in shared pool
**Root Cause**:
```c
// core/hakmem_shared_pool.c:343
pthread_mutex_lock(&g_shared_pool.alloc_lock);  ← Contention point?
```

**Hypothesis**:
- `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters)
- Lock held too long (metadata scans, dynamic array growth)
- Contention even in single-threaded workload (TLS drain threads?)

**Potential Solutions**:
1. **Lock-free fast path**: Per-class lock-free pop from free lists (Stage 1)
2. **Reduce lock scope**: Move metadata scans outside critical section
3. **Batch acquire**: Acquire multiple slabs per lock acquisition
4. **Per-class locks**: Replace global lock with per-class locks

**Expected Impact**: -50-80% reduction in futex time

---

### Priority 3: Frontend Cache Miss Rate

**Impact**: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%)
**Current Config**: fast_cap=32 (best performance)
**Evidence**: fast_cap scaling (8→16: +28%, 16→32: +13%)

**Hypothesis**:
- TLS cache capacity too small for working set (4,096 slots)
- Refill batch size suboptimal
- Specialize mask (0x0F) shows no benefit (<1% difference)

**Potential Solutions**:
1. **Increase fast_cap**: Test 64 / 128 (diminishing returns expected)
2. **Tune refill batch**: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256
3. **Class-specific tuning**: Hot classes (C6, C7) get larger caches

**Expected Impact**: +10-20% throughput (backend call reduction)

---

### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)

**Impact**: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore)
**Status**: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)

**Remaining Issues**:
1. **madvise (1,591 calls)**: Where are these coming from?
   - Pool TLS arena (8-52KB)?
   - Mid-Large allocator (broken)?
   - Other internal structures?

2. **mincore (1,574 calls)**: Still present despite Phase 9 removal claim
   - Source location unknown
   - May be from other allocators or debug paths

**Action Required**: Trace source of madvise/mincore calls

---

## 5. Performance Evolution Timeline

### Historical Performance Progression

| Phase | Optimization | Throughput | vs Baseline | vs System |
|-------|--------------|------------|-------------|-----------|
| **Baseline** (Phase 8) | - | 563K ops/s | +0% | 1.1% |
| **Phase 9** (LRU + mincore removal) | Lazy deallocation | 9.71M ops/s | +1,625% | 18.7% |
| **Phase 10** (TLS/SFC tuning) | Frontend expansion | 9.89M ops/s | +1,657% | 19.0% |
| **Phase 11** (Prewarm) | Startup SS allocation | 9.38M ops/s | +1,566% | 18.1% |
| **Phase 12-A** (TLS SLL Drain) | Periodic drain | 6.1M ops/s | +984% | 11.8% |
| **Phase 12-B** (SP-SLOT Box) | Per-slot management | 1.30M ops/s | +131% | 2.5% |
| **Current (optimized ENV)** | fast_cap=32 | **5.2M ops/s** | **+824%** | **10.0%** |

**Note**: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to **ENV configuration**:
- Default: No ENV → 1.30M ops/s
- Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s

---

## 6. Working Set Sensitivity

**Test Results** (fast_cap=32, spec_mask=0):

| Cycles | WS | Throughput | vs ws=4096 |
|--------|-----|------------|------------|
| 200K | 4,096 | 5.2M ops/s | 100% (baseline) |
| 200K | 8,192 | 4.0M ops/s | -23% |
| 400K | 4,096 | 5.3M ops/s | +2% |
| 400K | 8,192 | 4.7M ops/s | -10% |

**Observation**: **23% performance drop** when working set doubles (4K→8K)

**Hypothesis**:
- Larger working set → more backend allocation calls
- TLS cache misses increase
- SuperSlab churn increases (more Stage 3 allocations)

**Implication**: Current frontend cache size (fast_cap=32) insufficient for large working sets.

---

## 7. Recommended Next Steps (Priority Order)

### Step 1: Fix Mid-Large Allocator (URGENT) 🔥

**Priority**: P0 (Blocking)
**Impact**: 97x gap with mimalloc
**Effort**: Medium

**Tasks**:
1. Investigate `hkm_ace_alloc` NULL returns
2. Check Pool TLS arena initialization
3. Verify threshold logic for 8-32KB allocations
4. Add debug logging to trace allocation path

**Success Criteria**: Mid-Large throughput >1M ops/s (current: 0.24M)

---

### Step 2: Optimize Shared Pool Lock Contention

**Priority**: P1 (High)
**Impact**: 68% syscall time
**Effort**: Medium

**Options** (in order of risk):

**A) Lock-free Stage 1 (Low Risk)**:
```c
// Per-class atomic LIFO for EMPTY slot reuse
_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];

// Lock-free pop (Stage 1 fast path)
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
    FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);
    while (head != NULL) {
        if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {
            return head;
        }
    }
    return NULL;  // Fall back to locked Stage 2/3
}
```

**Expected**: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)

**B) Reduce Lock Scope (Medium Risk)**:
```c
// Move metadata scan outside lock
int candidate_slot = sp_meta_scan_unlocked();  // Read-only
pthread_mutex_lock(&g_shared_pool.alloc_lock);
if (sp_slot_try_claim(candidate_slot)) {  // Quick CAS
    // Success
}
pthread_mutex_unlock(&g_shared_pool.alloc_lock);
```

**Expected**: -30% futex overhead (reduce lock hold time)

**C) Per-Class Locks (High Risk)**:
```c
pthread_mutex_t g_class_locks[TINY_NUM_CLASSES];  // Replace global lock
```

**Expected**: -80% futex overhead (eliminate cross-class contention)
**Risk**: Complexity increase, potential deadlocks

**Recommendation**: Start with **Option A** (lowest risk, measurable impact).

---

### Step 3: TLS Drain Interval Tuning (Low Risk)

**Priority**: P2 (Medium)
**Impact**: TBD (experimental)
**Effort**: Low (ENV-only A/B testing)

**Current**: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`)

**Experiment Matrix**:
| Interval | Expected Impact |
|----------|-----------------|
| 512 | -50% drain overhead, +syscalls (more frequent SS release) |
| 2,048 | +100% drain overhead, -syscalls (less frequent SS release) |
| 4,096 | +300% drain overhead, --syscalls (minimal SS release) |

**Metrics to Track**:
- Throughput (ops/s)
- mmap/munmap count (strace)
- TLS SLL drain frequency (debug log)

**Success Criteria**: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)

---

### Step 4: Frontend Cache Tuning (Medium Risk)

**Priority**: P3 (Low)
**Impact**: +10-20% expected
**Effort**: Low (ENV-only A/B testing)

**Current Best**: fast_cap=32

**Experiment Matrix**:
| fast_cap | refill_count_hot | Expected Impact |
|----------|------------------|-----------------|
| 64 | 64 | +5-10% (diminishing returns) |
| 64 | 128 | +10-15% (better batch refill) |
| 128 | 128 | +15-20% (max cache size) |

**Metrics to Track**:
- Throughput (ops/s)
- Stage 3 frequency (debug log)
- Working set sensitivity (ws=8192 test)

**Success Criteria**: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192

---

### Step 5: Trace Remaining Syscalls (Investigation)

**Priority**: P4 (Low)
**Impact**: TBD
**Effort**: Low

**Questions**:
1. **madvise (1,591 calls)**: Where are these from?
   - Add debug logging to all `madvise()` call sites
   - Check Pool TLS arena, Mid-Large allocator

2. **mincore (1,574 calls)**: Why still present?
   - Grep codebase for `mincore` calls
   - Check if Phase 9 removal was incomplete

**Tools**:
```bash
# Trace madvise source
strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567

# Grep for mincore
grep -r "mincore" core/ --include="*.c" --include="*.h"
```

---

## 8. Risk Assessment

| Optimization | Impact | Effort | Risk | Recommendation |
|--------------|--------|--------|------|----------------|
| **Mid-Large Fix** | +++++ | ++ | Low | **DO NOW** 🔥 |
| **Lock-free Stage 1** | +++ | ++ | Low | **DO NEXT** ✅ |
| **Drain Interval Tune** | ++ | + | Low | **DO NEXT** ✅ |
| **Frontend Cache Tune** | ++ | + | Low | **DO AFTER** |
| **Reduce Lock Scope** | +++ | +++ | Med | Consider |
| **Per-Class Locks** | ++++ | ++++ | High | Avoid (complex) |
| **Trace Syscalls** | ? | + | Low | Background task |

---

## 9. Expected Performance Targets

### Short-Term (1-2 weeks)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Mid-Large throughput** | 0.24M ops/s | **>1M ops/s** | Fix `hkm_ace_alloc` |
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>7M ops/s** | Lock-free + drain tune |
| **futex overhead** | 68% | **<30%** | Lock-free Stage 1 |
| **mmap+munmap** | 3,357 | **<2,500** | Drain interval tune |

### Medium-Term (1-2 months)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>15M ops/s** | Full optimization |
| **vs System malloc** | 10% | **>25%** | Close gap by 15pp |
| **vs mimalloc** | 9% | **>20%** | Close gap by 11pp |

### Long-Term (3-6 months)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput** | 5.2M ops/s | **>40M ops/s** | Architectural overhaul |
| **vs System malloc** | 10% | **>70%** | Competitive performance |
| **vs mimalloc** | 9% | **>60%** | Industry-standard |

---

## 10. Lessons Learned

### 1. ENV Configuration is Critical

**Discovery**: Default (1.30M) vs Optimized (5.2M) = **+300% gap**
**Lesson**: Always document and automate optimal ENV settings
**Action**: Create `scripts/bench_optimal_env.sh` with best-known config

### 2. Mid-Large Allocator Broken

**Discovery**: 97x slower than mimalloc, NULL returns
**Lesson**: Integration testing insufficient (bench suite doesn't cover 8-32KB properly)
**Action**: Add `bench_mid_large_single_thread.sh` to CI suite

### 3. futex Overhead Unexpected

**Discovery**: 68% time in single-threaded workload
**Lesson**: Shared pool global lock is a bottleneck even without contention
**Action**: Profile lock hold time, consider lock-free paths

### 4. SP-SLOT Stage 2 Dominates

**Discovery**: 92.4% of allocations reuse UNUSED slots (Stage 2)
**Lesson**: Multi-class sharing >> per-class free lists
**Action**: Optimize Stage 2 path (lock-free metadata scan?)

---

## 11. Conclusion

**Current State**:
- ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92%
- ✅ Syscall overhead reduced by 48% (mmap+munmap)
- ⚠️ Still 10x slower than System malloc (Tiny)
- 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)

**Next Priorities**:
1. **Fix Mid-Large allocator** (P0, blocking)
2. **Optimize shared pool lock** (P1, 68% syscall time)
3. **Tune drain interval** (P2, low-risk improvement)
4. **Tune frontend cache** (P3, diminishing returns)

**Expected Impact** (short-term):
- Mid-Large: 0.24M → >1M ops/s (+316%)
- Tiny: 5.2M → >7M ops/s (+35%)
- futex overhead: 68% → <30% (-56%)

**Long-Term Vision**:
- Close gap to 70% of System malloc performance (40M ops/s target)
- Competitive with industry-standard allocators (mimalloc, jemalloc)

---

**Report Generated**: 2025-11-14
**Tool**: Claude Code
**Phase**: Post SP-SLOT Box Implementation
**Status**: ✅ Analysis Complete, Ready for Implementation