## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
511 lines
16 KiB
Markdown
511 lines
16 KiB
Markdown
# HAKMEM Bottleneck Analysis Report
|
||
|
||
**Date**: 2025-11-14
|
||
**Phase**: Post SP-SLOT Box Implementation
|
||
**Objective**: Identify next optimization targets to close gap with System malloc / mimalloc
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
Comprehensive performance analysis reveals **10x gap with System malloc** (Tiny allocator) and **22x gap** (Mid-Large allocator). Primary bottlenecks identified: **syscall overhead** (futex: 68% time), **Frontend cache misses**, and **Mid-Large allocator failure**.
|
||
|
||
### Performance Gaps (Current State)
|
||
|
||
| Allocator | Tiny (random_mixed) | Mid-Large MT (8-32KB) |
|
||
|-----------|---------------------|----------------------|
|
||
| **System malloc** | 51.9M ops/s (100%) | 5.4M ops/s (100%) |
|
||
| **mimalloc** | 57.5M ops/s (111%) | 24.2M ops/s (448%) |
|
||
| **HAKMEM (best)** | 5.2M ops/s (**10%**) | 0.24M ops/s (**4.4%**) |
|
||
| **Gap** | **-90% (10x slower)** | **-95.6% (22x slower)** |
|
||
|
||
**Urgent**: Mid-Large allocator requires immediate attention (97x slower than mimalloc).
|
||
|
||
---
|
||
|
||
## 1. Benchmark Results: Current State
|
||
|
||
### 1.1 Random Mixed (Tiny Allocator: 16B-1KB)
|
||
|
||
**Test Configuration**:
|
||
- 200K iterations
|
||
- Working set: 4,096 slots
|
||
- Size range: 16-1040 bytes (C0-C7 classes)
|
||
|
||
**Results**:
|
||
|
||
| Variant | spec_mask | fast_cap | Throughput | vs System | vs mimalloc |
|
||
|---------|-----------|----------|------------|-----------|-------------|
|
||
| **System malloc** | - | - | 51.9M ops/s | 100% | 90% |
|
||
| **mimalloc** | - | - | 57.5M ops/s | 111% | 100% |
|
||
| **HAKMEM** | 0 | 8 | 3.6M ops/s | 6.9% | 6.3% |
|
||
| **HAKMEM** | 0 | 16 | 4.6M ops/s | 8.9% | 8.0% |
|
||
| **HAKMEM** | 0 | **32** | **5.2M ops/s** | **10.0%** | **9.0%** |
|
||
| **HAKMEM** | 0x0F | 32 | 5.18M ops/s | 10.0% | 9.0% |
|
||
|
||
**Key Findings**:
|
||
- **Best HAKMEM config**: fast_cap=32, spec_mask=0 → **5.2M ops/s**
|
||
- **Gap**: 10x slower than System, 11x slower than mimalloc
|
||
- **spec_mask effect**: Negligible (<1% difference)
|
||
- **fast_cap scaling**: 8→16 (+28%), 16→32 (+13%)
|
||
|
||
### 1.2 Mid-Large MT (8-32KB Allocations)
|
||
|
||
**Test Configuration**:
|
||
- 2 threads
|
||
- 40K cycles
|
||
- Working set: 2,048 slots
|
||
|
||
**Results**:
|
||
|
||
| Allocator | Throughput | vs System | vs mimalloc |
|
||
|-----------|------------|-----------|-------------|
|
||
| **System malloc** | 5.4M ops/s | 100% | 22% |
|
||
| **mimalloc** | 24.2M ops/s | 448% | 100% |
|
||
| **HAKMEM (base)** | 0.243M ops/s | **4.4%** | **1.0%** |
|
||
| **HAKMEM (no bigcache)** | 0.251M ops/s | 4.6% | 1.0% |
|
||
|
||
**Critical Issue**:
|
||
```
|
||
[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures
|
||
```
|
||
|
||
**Gap**: 22x slower than System, **97x slower than mimalloc** 💀
|
||
|
||
**Root Cause**: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly.
|
||
|
||
---
|
||
|
||
## 2. Syscall Analysis (strace)
|
||
|
||
### 2.1 System Call Distribution (200K iterations)
|
||
|
||
| Syscall | Calls | % Time | usec/call | Category |
|
||
|---------|-------|--------|-----------|----------|
|
||
| **futex** | 36 | **68.18%** | 1,970 | Synchronization ⚠️ |
|
||
| **munmap** | 1,665 | 11.60% | 7 | SS deallocation |
|
||
| **mmap** | 1,692 | 7.28% | 4 | SS allocation |
|
||
| **madvise** | 1,591 | 6.85% | 4 | Memory advice |
|
||
| **mincore** | 1,574 | 5.51% | 3 | Page existence check |
|
||
| **Other** | 1,141 | 0.57% | - | Misc |
|
||
| **Total** | **6,703** | 100% | 15 (avg) | |
|
||
|
||
### 2.2 Key Observations
|
||
|
||
**Unexpected: futex Dominates (68% time)**
|
||
- **36 futex calls** consuming **68.18% of syscall time**
|
||
- **1,970 usec/call** (extremely slow!)
|
||
- **Context**: `bench_random_mixed` is **single-threaded**
|
||
- **Hypothesis**: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`)
|
||
|
||
**SP-SLOT Impact Confirmed**:
|
||
```
|
||
Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls
|
||
After SP-SLOT: mmap (1,692) + munmap (1,665) = 3,357 calls
|
||
Reduction: -48% (-3,098 calls) ✅
|
||
```
|
||
|
||
**Remaining syscall overhead**:
|
||
- **madvise**: 1,591 calls (6.85% time) - from other allocators?
|
||
- **mincore**: 1,574 calls (5.51% time) - still present despite Phase 9 removal?
|
||
|
||
---
|
||
|
||
## 3. SP-SLOT Box Effectiveness Review
|
||
|
||
### 3.1 SuperSlab Allocation Reduction
|
||
|
||
**Measured with debug logging** (`HAKMEM_SS_ACQUIRE_DEBUG=1`):
|
||
|
||
| Metric | Before SP-SLOT | After SP-SLOT | Improvement |
|
||
|--------|----------------|---------------|-------------|
|
||
| **New SuperSlabs** (Stage 3) | 877 (200K iters) | 72 (200K iters) | **-92%** 🎉 |
|
||
| **Syscalls (mmap+munmap)** | 6,455 | 3,357 | **-48%** |
|
||
| **Throughput** | 563K ops/s | 1.30M ops/s | **+131%** |
|
||
|
||
### 3.2 Allocation Stage Distribution (50K iterations)
|
||
|
||
| Stage | Description | Count | % |
|
||
|-------|-------------|-------|---|
|
||
| **Stage 1** | EMPTY slot reuse (per-class free list) | 105 | 4.6% |
|
||
| **Stage 2** | **UNUSED slot reuse (multi-class sharing)** | **2,117** | **92.4%** ✅ |
|
||
| **Stage 3** | New SuperSlab (mmap) | 69 | 3.0% |
|
||
| **Total** | | 2,291 | 100% |
|
||
|
||
**Key Insight**: Stage 2 (UNUSED reuse) is dominant, proving **multi-class SuperSlab sharing works**.
|
||
|
||
---
|
||
|
||
## 4. Identified Bottlenecks (Priority Order)
|
||
|
||
### Priority 1: Mid-Large Allocator Failure 🔥
|
||
|
||
**Impact**: 97x slower than mimalloc
|
||
**Symptom**: `hkm_ace_alloc` returns NULL
|
||
**Evidence**:
|
||
```
|
||
[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1
|
||
[ALLOC] 33KB: Calling hkm_ace_alloc
|
||
[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures
|
||
```
|
||
|
||
**Root Cause Hypothesis**:
|
||
- Pool TLS arena not initialized?
|
||
- Threshold logic preventing 8-32KB allocations?
|
||
- Bug in `hkm_ace_alloc` path?
|
||
|
||
**Action Required**: Immediate investigation (blocking)
|
||
|
||
---
|
||
|
||
### Priority 2: futex Overhead (68% syscall time) ⚠️
|
||
|
||
**Impact**: 68.18% of syscall time (1,970 usec/call)
|
||
**Symptom**: Excessive lock contention in shared pool
|
||
**Root Cause**:
|
||
```c
|
||
// core/hakmem_shared_pool.c:343
|
||
pthread_mutex_lock(&g_shared_pool.alloc_lock); ← Contention point?
|
||
```
|
||
|
||
**Hypothesis**:
|
||
- `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters)
|
||
- Lock held too long (metadata scans, dynamic array growth)
|
||
- Contention even in single-threaded workload (TLS drain threads?)
|
||
|
||
**Potential Solutions**:
|
||
1. **Lock-free fast path**: Per-class lock-free pop from free lists (Stage 1)
|
||
2. **Reduce lock scope**: Move metadata scans outside critical section
|
||
3. **Batch acquire**: Acquire multiple slabs per lock acquisition
|
||
4. **Per-class locks**: Replace global lock with per-class locks
|
||
|
||
**Expected Impact**: -50-80% reduction in futex time
|
||
|
||
---
|
||
|
||
### Priority 3: Frontend Cache Miss Rate
|
||
|
||
**Impact**: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%)
|
||
**Current Config**: fast_cap=32 (best performance)
|
||
**Evidence**: fast_cap scaling (8→16: +28%, 16→32: +13%)
|
||
|
||
**Hypothesis**:
|
||
- TLS cache capacity too small for working set (4,096 slots)
|
||
- Refill batch size suboptimal
|
||
- Specialize mask (0x0F) shows no benefit (<1% difference)
|
||
|
||
**Potential Solutions**:
|
||
1. **Increase fast_cap**: Test 64 / 128 (diminishing returns expected)
|
||
2. **Tune refill batch**: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256
|
||
3. **Class-specific tuning**: Hot classes (C6, C7) get larger caches
|
||
|
||
**Expected Impact**: +10-20% throughput (backend call reduction)
|
||
|
||
---
|
||
|
||
### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)
|
||
|
||
**Impact**: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore)
|
||
**Status**: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)
|
||
|
||
**Remaining Issues**:
|
||
1. **madvise (1,591 calls)**: Where are these coming from?
|
||
- Pool TLS arena (8-52KB)?
|
||
- Mid-Large allocator (broken)?
|
||
- Other internal structures?
|
||
|
||
2. **mincore (1,574 calls)**: Still present despite Phase 9 removal claim
|
||
- Source location unknown
|
||
- May be from other allocators or debug paths
|
||
|
||
**Action Required**: Trace source of madvise/mincore calls
|
||
|
||
---
|
||
|
||
## 5. Performance Evolution Timeline
|
||
|
||
### Historical Performance Progression
|
||
|
||
| Phase | Optimization | Throughput | vs Baseline | vs System |
|
||
|-------|--------------|------------|-------------|-----------|
|
||
| **Baseline** (Phase 8) | - | 563K ops/s | +0% | 1.1% |
|
||
| **Phase 9** (LRU + mincore removal) | Lazy deallocation | 9.71M ops/s | +1,625% | 18.7% |
|
||
| **Phase 10** (TLS/SFC tuning) | Frontend expansion | 9.89M ops/s | +1,657% | 19.0% |
|
||
| **Phase 11** (Prewarm) | Startup SS allocation | 9.38M ops/s | +1,566% | 18.1% |
|
||
| **Phase 12-A** (TLS SLL Drain) | Periodic drain | 6.1M ops/s | +984% | 11.8% |
|
||
| **Phase 12-B** (SP-SLOT Box) | Per-slot management | 1.30M ops/s | +131% | 2.5% |
|
||
| **Current (optimized ENV)** | fast_cap=32 | **5.2M ops/s** | **+824%** | **10.0%** |
|
||
|
||
**Note**: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to **ENV configuration**:
|
||
- Default: No ENV → 1.30M ops/s
|
||
- Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s
|
||
|
||
---
|
||
|
||
## 6. Working Set Sensitivity
|
||
|
||
**Test Results** (fast_cap=32, spec_mask=0):
|
||
|
||
| Cycles | WS | Throughput | vs ws=4096 |
|
||
|--------|-----|------------|------------|
|
||
| 200K | 4,096 | 5.2M ops/s | 100% (baseline) |
|
||
| 200K | 8,192 | 4.0M ops/s | -23% |
|
||
| 400K | 4,096 | 5.3M ops/s | +2% |
|
||
| 400K | 8,192 | 4.7M ops/s | -10% |
|
||
|
||
**Observation**: **23% performance drop** when working set doubles (4K→8K)
|
||
|
||
**Hypothesis**:
|
||
- Larger working set → more backend allocation calls
|
||
- TLS cache misses increase
|
||
- SuperSlab churn increases (more Stage 3 allocations)
|
||
|
||
**Implication**: Current frontend cache size (fast_cap=32) insufficient for large working sets.
|
||
|
||
---
|
||
|
||
## 7. Recommended Next Steps (Priority Order)
|
||
|
||
### Step 1: Fix Mid-Large Allocator (URGENT) 🔥
|
||
|
||
**Priority**: P0 (Blocking)
|
||
**Impact**: 97x gap with mimalloc
|
||
**Effort**: Medium
|
||
|
||
**Tasks**:
|
||
1. Investigate `hkm_ace_alloc` NULL returns
|
||
2. Check Pool TLS arena initialization
|
||
3. Verify threshold logic for 8-32KB allocations
|
||
4. Add debug logging to trace allocation path
|
||
|
||
**Success Criteria**: Mid-Large throughput >1M ops/s (current: 0.24M)
|
||
|
||
---
|
||
|
||
### Step 2: Optimize Shared Pool Lock Contention
|
||
|
||
**Priority**: P1 (High)
|
||
**Impact**: 68% syscall time
|
||
**Effort**: Medium
|
||
|
||
**Options** (in order of risk):
|
||
|
||
**A) Lock-free Stage 1 (Low Risk)**:
|
||
```c
|
||
// Per-class atomic LIFO for EMPTY slot reuse
|
||
_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];
|
||
|
||
// Lock-free pop (Stage 1 fast path)
|
||
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
|
||
FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);
|
||
while (head != NULL) {
|
||
if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {
|
||
return head;
|
||
}
|
||
}
|
||
return NULL; // Fall back to locked Stage 2/3
|
||
}
|
||
```
|
||
|
||
**Expected**: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)
|
||
|
||
**B) Reduce Lock Scope (Medium Risk)**:
|
||
```c
|
||
// Move metadata scan outside lock
|
||
int candidate_slot = sp_meta_scan_unlocked(); // Read-only
|
||
pthread_mutex_lock(&g_shared_pool.alloc_lock);
|
||
if (sp_slot_try_claim(candidate_slot)) { // Quick CAS
|
||
// Success
|
||
}
|
||
pthread_mutex_unlock(&g_shared_pool.alloc_lock);
|
||
```
|
||
|
||
**Expected**: -30% futex overhead (reduce lock hold time)
|
||
|
||
**C) Per-Class Locks (High Risk)**:
|
||
```c
|
||
pthread_mutex_t g_class_locks[TINY_NUM_CLASSES]; // Replace global lock
|
||
```
|
||
|
||
**Expected**: -80% futex overhead (eliminate cross-class contention)
|
||
**Risk**: Complexity increase, potential deadlocks
|
||
|
||
**Recommendation**: Start with **Option A** (lowest risk, measurable impact).
|
||
|
||
---
|
||
|
||
### Step 3: TLS Drain Interval Tuning (Low Risk)
|
||
|
||
**Priority**: P2 (Medium)
|
||
**Impact**: TBD (experimental)
|
||
**Effort**: Low (ENV-only A/B testing)
|
||
|
||
**Current**: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`)
|
||
|
||
**Experiment Matrix**:
|
||
| Interval | Expected Impact |
|
||
|----------|-----------------|
|
||
| 512 | -50% drain overhead, +syscalls (more frequent SS release) |
|
||
| 2,048 | +100% drain overhead, -syscalls (less frequent SS release) |
|
||
| 4,096 | +300% drain overhead, --syscalls (minimal SS release) |
|
||
|
||
**Metrics to Track**:
|
||
- Throughput (ops/s)
|
||
- mmap/munmap count (strace)
|
||
- TLS SLL drain frequency (debug log)
|
||
|
||
**Success Criteria**: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)
|
||
|
||
---
|
||
|
||
### Step 4: Frontend Cache Tuning (Medium Risk)
|
||
|
||
**Priority**: P3 (Low)
|
||
**Impact**: +10-20% expected
|
||
**Effort**: Low (ENV-only A/B testing)
|
||
|
||
**Current Best**: fast_cap=32
|
||
|
||
**Experiment Matrix**:
|
||
| fast_cap | refill_count_hot | Expected Impact |
|
||
|----------|------------------|-----------------|
|
||
| 64 | 64 | +5-10% (diminishing returns) |
|
||
| 64 | 128 | +10-15% (better batch refill) |
|
||
| 128 | 128 | +15-20% (max cache size) |
|
||
|
||
**Metrics to Track**:
|
||
- Throughput (ops/s)
|
||
- Stage 3 frequency (debug log)
|
||
- Working set sensitivity (ws=8192 test)
|
||
|
||
**Success Criteria**: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192
|
||
|
||
---
|
||
|
||
### Step 5: Trace Remaining Syscalls (Investigation)
|
||
|
||
**Priority**: P4 (Low)
|
||
**Impact**: TBD
|
||
**Effort**: Low
|
||
|
||
**Questions**:
|
||
1. **madvise (1,591 calls)**: Where are these from?
|
||
- Add debug logging to all `madvise()` call sites
|
||
- Check Pool TLS arena, Mid-Large allocator
|
||
|
||
2. **mincore (1,574 calls)**: Why still present?
|
||
- Grep codebase for `mincore` calls
|
||
- Check if Phase 9 removal was incomplete
|
||
|
||
**Tools**:
|
||
```bash
|
||
# Trace madvise source
|
||
strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567
|
||
|
||
# Grep for mincore
|
||
grep -r "mincore" core/ --include="*.c" --include="*.h"
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Risk Assessment
|
||
|
||
| Optimization | Impact | Effort | Risk | Recommendation |
|
||
|--------------|--------|--------|------|----------------|
|
||
| **Mid-Large Fix** | +++++ | ++ | Low | **DO NOW** 🔥 |
|
||
| **Lock-free Stage 1** | +++ | ++ | Low | **DO NEXT** ✅ |
|
||
| **Drain Interval Tune** | ++ | + | Low | **DO NEXT** ✅ |
|
||
| **Frontend Cache Tune** | ++ | + | Low | **DO AFTER** |
|
||
| **Reduce Lock Scope** | +++ | +++ | Med | Consider |
|
||
| **Per-Class Locks** | ++++ | ++++ | High | Avoid (complex) |
|
||
| **Trace Syscalls** | ? | + | Low | Background task |
|
||
|
||
---
|
||
|
||
## 9. Expected Performance Targets
|
||
|
||
### Short-Term (1-2 weeks)
|
||
|
||
| Metric | Current | Target | Strategy |
|
||
|--------|---------|--------|----------|
|
||
| **Mid-Large throughput** | 0.24M ops/s | **>1M ops/s** | Fix `hkm_ace_alloc` |
|
||
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>7M ops/s** | Lock-free + drain tune |
|
||
| **futex overhead** | 68% | **<30%** | Lock-free Stage 1 |
|
||
| **mmap+munmap** | 3,357 | **<2,500** | Drain interval tune |
|
||
|
||
### Medium-Term (1-2 months)
|
||
|
||
| Metric | Current | Target | Strategy |
|
||
|--------|---------|--------|----------|
|
||
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>15M ops/s** | Full optimization |
|
||
| **vs System malloc** | 10% | **>25%** | Close gap by 15pp |
|
||
| **vs mimalloc** | 9% | **>20%** | Close gap by 11pp |
|
||
|
||
### Long-Term (3-6 months)
|
||
|
||
| Metric | Current | Target | Strategy |
|
||
|--------|---------|--------|----------|
|
||
| **Tiny throughput** | 5.2M ops/s | **>40M ops/s** | Architectural overhaul |
|
||
| **vs System malloc** | 10% | **>70%** | Competitive performance |
|
||
| **vs mimalloc** | 9% | **>60%** | Industry-standard |
|
||
|
||
---
|
||
|
||
## 10. Lessons Learned
|
||
|
||
### 1. ENV Configuration is Critical
|
||
|
||
**Discovery**: Default (1.30M) vs Optimized (5.2M) = **+300% gap**
|
||
**Lesson**: Always document and automate optimal ENV settings
|
||
**Action**: Create `scripts/bench_optimal_env.sh` with best-known config
|
||
|
||
### 2. Mid-Large Allocator Broken
|
||
|
||
**Discovery**: 97x slower than mimalloc, NULL returns
|
||
**Lesson**: Integration testing insufficient (bench suite doesn't cover 8-32KB properly)
|
||
**Action**: Add `bench_mid_large_single_thread.sh` to CI suite
|
||
|
||
### 3. futex Overhead Unexpected
|
||
|
||
**Discovery**: 68% time in single-threaded workload
|
||
**Lesson**: Shared pool global lock is a bottleneck even without contention
|
||
**Action**: Profile lock hold time, consider lock-free paths
|
||
|
||
### 4. SP-SLOT Stage 2 Dominates
|
||
|
||
**Discovery**: 92.4% of allocations reuse UNUSED slots (Stage 2)
|
||
**Lesson**: Multi-class sharing >> per-class free lists
|
||
**Action**: Optimize Stage 2 path (lock-free metadata scan?)
|
||
|
||
---
|
||
|
||
## 11. Conclusion
|
||
|
||
**Current State**:
|
||
- ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92%
|
||
- ✅ Syscall overhead reduced by 48% (mmap+munmap)
|
||
- ⚠️ Still 10x slower than System malloc (Tiny)
|
||
- 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)
|
||
|
||
**Next Priorities**:
|
||
1. **Fix Mid-Large allocator** (P0, blocking)
|
||
2. **Optimize shared pool lock** (P1, 68% syscall time)
|
||
3. **Tune drain interval** (P2, low-risk improvement)
|
||
4. **Tune frontend cache** (P3, diminishing returns)
|
||
|
||
**Expected Impact** (short-term):
|
||
- Mid-Large: 0.24M → >1M ops/s (+316%)
|
||
- Tiny: 5.2M → >7M ops/s (+35%)
|
||
- futex overhead: 68% → <30% (-56%)
|
||
|
||
**Long-Term Vision**:
|
||
- Close gap to 70% of System malloc performance (40M ops/s target)
|
||
- Competitive with industry-standard allocators (mimalloc, jemalloc)
|
||
|
||
---
|
||
|
||
**Report Generated**: 2025-11-14
|
||
**Tool**: Claude Code
|
||
**Phase**: Post SP-SLOT Box Implementation
|
||
**Status**: ✅ Analysis Complete, Ready for Implementation
|