Files
hakmem/docs/analysis/BOTTLENECK_ANALYSIS_REPORT_20251114.md

511 lines
16 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# HAKMEM Bottleneck Analysis Report
**Date**: 2025-11-14
**Phase**: Post SP-SLOT Box Implementation
**Objective**: Identify next optimization targets to close gap with System malloc / mimalloc
---
## Executive Summary
Comprehensive performance analysis reveals **10x gap with System malloc** (Tiny allocator) and **22x gap** (Mid-Large allocator). Primary bottlenecks identified: **syscall overhead** (futex: 68% time), **Frontend cache misses**, and **Mid-Large allocator failure**.
### Performance Gaps (Current State)
| Allocator | Tiny (random_mixed) | Mid-Large MT (8-32KB) |
|-----------|---------------------|----------------------|
| **System malloc** | 51.9M ops/s (100%) | 5.4M ops/s (100%) |
| **mimalloc** | 57.5M ops/s (111%) | 24.2M ops/s (448%) |
| **HAKMEM (best)** | 5.2M ops/s (**10%**) | 0.24M ops/s (**4.4%**) |
| **Gap** | **-90% (10x slower)** | **-95.6% (22x slower)** |
**Urgent**: Mid-Large allocator requires immediate attention (97x slower than mimalloc).
---
## 1. Benchmark Results: Current State
### 1.1 Random Mixed (Tiny Allocator: 16B-1KB)
**Test Configuration**:
- 200K iterations
- Working set: 4,096 slots
- Size range: 16-1040 bytes (C0-C7 classes)
**Results**:
| Variant | spec_mask | fast_cap | Throughput | vs System | vs mimalloc |
|---------|-----------|----------|------------|-----------|-------------|
| **System malloc** | - | - | 51.9M ops/s | 100% | 90% |
| **mimalloc** | - | - | 57.5M ops/s | 111% | 100% |
| **HAKMEM** | 0 | 8 | 3.6M ops/s | 6.9% | 6.3% |
| **HAKMEM** | 0 | 16 | 4.6M ops/s | 8.9% | 8.0% |
| **HAKMEM** | 0 | **32** | **5.2M ops/s** | **10.0%** | **9.0%** |
| **HAKMEM** | 0x0F | 32 | 5.18M ops/s | 10.0% | 9.0% |
**Key Findings**:
- **Best HAKMEM config**: fast_cap=32, spec_mask=0 → **5.2M ops/s**
- **Gap**: 10x slower than System, 11x slower than mimalloc
- **spec_mask effect**: Negligible (<1% difference)
- **fast_cap scaling**: 8→16 (+28%), 16→32 (+13%)
### 1.2 Mid-Large MT (8-32KB Allocations)
**Test Configuration**:
- 2 threads
- 40K cycles
- Working set: 2,048 slots
**Results**:
| Allocator | Throughput | vs System | vs mimalloc |
|-----------|------------|-----------|-------------|
| **System malloc** | 5.4M ops/s | 100% | 22% |
| **mimalloc** | 24.2M ops/s | 448% | 100% |
| **HAKMEM (base)** | 0.243M ops/s | **4.4%** | **1.0%** |
| **HAKMEM (no bigcache)** | 0.251M ops/s | 4.6% | 1.0% |
**Critical Issue**:
```
[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures
```
**Gap**: 22x slower than System, **97x slower than mimalloc** 💀
**Root Cause**: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly.
---
## 2. Syscall Analysis (strace)
### 2.1 System Call Distribution (200K iterations)
| Syscall | Calls | % Time | usec/call | Category |
|---------|-------|--------|-----------|----------|
| **futex** | 36 | **68.18%** | 1,970 | Synchronization ⚠️ |
| **munmap** | 1,665 | 11.60% | 7 | SS deallocation |
| **mmap** | 1,692 | 7.28% | 4 | SS allocation |
| **madvise** | 1,591 | 6.85% | 4 | Memory advice |
| **mincore** | 1,574 | 5.51% | 3 | Page existence check |
| **Other** | 1,141 | 0.57% | - | Misc |
| **Total** | **6,703** | 100% | 15 (avg) | |
### 2.2 Key Observations
**Unexpected: futex Dominates (68% time)**
- **36 futex calls** consuming **68.18% of syscall time**
- **1,970 usec/call** (extremely slow!)
- **Context**: `bench_random_mixed` is **single-threaded**
- **Hypothesis**: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`)
**SP-SLOT Impact Confirmed**:
```
Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls
After SP-SLOT: mmap (1,692) + munmap (1,665) = 3,357 calls
Reduction: -48% (-3,098 calls) ✅
```
**Remaining syscall overhead**:
- **madvise**: 1,591 calls (6.85% time) - from other allocators?
- **mincore**: 1,574 calls (5.51% time) - still present despite Phase 9 removal?
---
## 3. SP-SLOT Box Effectiveness Review
### 3.1 SuperSlab Allocation Reduction
**Measured with debug logging** (`HAKMEM_SS_ACQUIRE_DEBUG=1`):
| Metric | Before SP-SLOT | After SP-SLOT | Improvement |
|--------|----------------|---------------|-------------|
| **New SuperSlabs** (Stage 3) | 877 (200K iters) | 72 (200K iters) | **-92%** 🎉 |
| **Syscalls (mmap+munmap)** | 6,455 | 3,357 | **-48%** |
| **Throughput** | 563K ops/s | 1.30M ops/s | **+131%** |
### 3.2 Allocation Stage Distribution (50K iterations)
| Stage | Description | Count | % |
|-------|-------------|-------|---|
| **Stage 1** | EMPTY slot reuse (per-class free list) | 105 | 4.6% |
| **Stage 2** | **UNUSED slot reuse (multi-class sharing)** | **2,117** | **92.4%** ✅ |
| **Stage 3** | New SuperSlab (mmap) | 69 | 3.0% |
| **Total** | | 2,291 | 100% |
**Key Insight**: Stage 2 (UNUSED reuse) is dominant, proving **multi-class SuperSlab sharing works**.
---
## 4. Identified Bottlenecks (Priority Order)
### Priority 1: Mid-Large Allocator Failure 🔥
**Impact**: 97x slower than mimalloc
**Symptom**: `hkm_ace_alloc` returns NULL
**Evidence**:
```
[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1
[ALLOC] 33KB: Calling hkm_ace_alloc
[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures
```
**Root Cause Hypothesis**:
- Pool TLS arena not initialized?
- Threshold logic preventing 8-32KB allocations?
- Bug in `hkm_ace_alloc` path?
**Action Required**: Immediate investigation (blocking)
---
### Priority 2: futex Overhead (68% syscall time) ⚠️
**Impact**: 68.18% of syscall time (1,970 usec/call)
**Symptom**: Excessive lock contention in shared pool
**Root Cause**:
```c
// core/hakmem_shared_pool.c:343
pthread_mutex_lock(&g_shared_pool.alloc_lock); ← Contention point?
```
**Hypothesis**:
- `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters)
- Lock held too long (metadata scans, dynamic array growth)
- Contention even in single-threaded workload (TLS drain threads?)
**Potential Solutions**:
1. **Lock-free fast path**: Per-class lock-free pop from free lists (Stage 1)
2. **Reduce lock scope**: Move metadata scans outside critical section
3. **Batch acquire**: Acquire multiple slabs per lock acquisition
4. **Per-class locks**: Replace global lock with per-class locks
**Expected Impact**: -50-80% reduction in futex time
---
### Priority 3: Frontend Cache Miss Rate
**Impact**: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%)
**Current Config**: fast_cap=32 (best performance)
**Evidence**: fast_cap scaling (8→16: +28%, 16→32: +13%)
**Hypothesis**:
- TLS cache capacity too small for working set (4,096 slots)
- Refill batch size suboptimal
- Specialize mask (0x0F) shows no benefit (<1% difference)
**Potential Solutions**:
1. **Increase fast_cap**: Test 64 / 128 (diminishing returns expected)
2. **Tune refill batch**: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256
3. **Class-specific tuning**: Hot classes (C6, C7) get larger caches
**Expected Impact**: +10-20% throughput (backend call reduction)
---
### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)
**Impact**: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore)
**Status**: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)
**Remaining Issues**:
1. **madvise (1,591 calls)**: Where are these coming from?
- Pool TLS arena (8-52KB)?
- Mid-Large allocator (broken)?
- Other internal structures?
2. **mincore (1,574 calls)**: Still present despite Phase 9 removal claim
- Source location unknown
- May be from other allocators or debug paths
**Action Required**: Trace source of madvise/mincore calls
---
## 5. Performance Evolution Timeline
### Historical Performance Progression
| Phase | Optimization | Throughput | vs Baseline | vs System |
|-------|--------------|------------|-------------|-----------|
| **Baseline** (Phase 8) | - | 563K ops/s | +0% | 1.1% |
| **Phase 9** (LRU + mincore removal) | Lazy deallocation | 9.71M ops/s | +1,625% | 18.7% |
| **Phase 10** (TLS/SFC tuning) | Frontend expansion | 9.89M ops/s | +1,657% | 19.0% |
| **Phase 11** (Prewarm) | Startup SS allocation | 9.38M ops/s | +1,566% | 18.1% |
| **Phase 12-A** (TLS SLL Drain) | Periodic drain | 6.1M ops/s | +984% | 11.8% |
| **Phase 12-B** (SP-SLOT Box) | Per-slot management | 1.30M ops/s | +131% | 2.5% |
| **Current (optimized ENV)** | fast_cap=32 | **5.2M ops/s** | **+824%** | **10.0%** |
**Note**: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to **ENV configuration**:
- Default: No ENV → 1.30M ops/s
- Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s
---
## 6. Working Set Sensitivity
**Test Results** (fast_cap=32, spec_mask=0):
| Cycles | WS | Throughput | vs ws=4096 |
|--------|-----|------------|------------|
| 200K | 4,096 | 5.2M ops/s | 100% (baseline) |
| 200K | 8,192 | 4.0M ops/s | -23% |
| 400K | 4,096 | 5.3M ops/s | +2% |
| 400K | 8,192 | 4.7M ops/s | -10% |
**Observation**: **23% performance drop** when working set doubles (4K→8K)
**Hypothesis**:
- Larger working set → more backend allocation calls
- TLS cache misses increase
- SuperSlab churn increases (more Stage 3 allocations)
**Implication**: Current frontend cache size (fast_cap=32) insufficient for large working sets.
---
## 7. Recommended Next Steps (Priority Order)
### Step 1: Fix Mid-Large Allocator (URGENT) 🔥
**Priority**: P0 (Blocking)
**Impact**: 97x gap with mimalloc
**Effort**: Medium
**Tasks**:
1. Investigate `hkm_ace_alloc` NULL returns
2. Check Pool TLS arena initialization
3. Verify threshold logic for 8-32KB allocations
4. Add debug logging to trace allocation path
**Success Criteria**: Mid-Large throughput >1M ops/s (current: 0.24M)
---
### Step 2: Optimize Shared Pool Lock Contention
**Priority**: P1 (High)
**Impact**: 68% syscall time
**Effort**: Medium
**Options** (in order of risk):
**A) Lock-free Stage 1 (Low Risk)**:
```c
// Per-class atomic LIFO for EMPTY slot reuse
_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];
// Lock-free pop (Stage 1 fast path)
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);
while (head != NULL) {
if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {
return head;
}
}
return NULL; // Fall back to locked Stage 2/3
}
```
**Expected**: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)
**B) Reduce Lock Scope (Medium Risk)**:
```c
// Move metadata scan outside lock
int candidate_slot = sp_meta_scan_unlocked(); // Read-only
pthread_mutex_lock(&g_shared_pool.alloc_lock);
if (sp_slot_try_claim(candidate_slot)) { // Quick CAS
// Success
}
pthread_mutex_unlock(&g_shared_pool.alloc_lock);
```
**Expected**: -30% futex overhead (reduce lock hold time)
**C) Per-Class Locks (High Risk)**:
```c
pthread_mutex_t g_class_locks[TINY_NUM_CLASSES]; // Replace global lock
```
**Expected**: -80% futex overhead (eliminate cross-class contention)
**Risk**: Complexity increase, potential deadlocks
**Recommendation**: Start with **Option A** (lowest risk, measurable impact).
---
### Step 3: TLS Drain Interval Tuning (Low Risk)
**Priority**: P2 (Medium)
**Impact**: TBD (experimental)
**Effort**: Low (ENV-only A/B testing)
**Current**: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`)
**Experiment Matrix**:
| Interval | Expected Impact |
|----------|-----------------|
| 512 | -50% drain overhead, +syscalls (more frequent SS release) |
| 2,048 | +100% drain overhead, -syscalls (less frequent SS release) |
| 4,096 | +300% drain overhead, --syscalls (minimal SS release) |
**Metrics to Track**:
- Throughput (ops/s)
- mmap/munmap count (strace)
- TLS SLL drain frequency (debug log)
**Success Criteria**: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)
---
### Step 4: Frontend Cache Tuning (Medium Risk)
**Priority**: P3 (Low)
**Impact**: +10-20% expected
**Effort**: Low (ENV-only A/B testing)
**Current Best**: fast_cap=32
**Experiment Matrix**:
| fast_cap | refill_count_hot | Expected Impact |
|----------|------------------|-----------------|
| 64 | 64 | +5-10% (diminishing returns) |
| 64 | 128 | +10-15% (better batch refill) |
| 128 | 128 | +15-20% (max cache size) |
**Metrics to Track**:
- Throughput (ops/s)
- Stage 3 frequency (debug log)
- Working set sensitivity (ws=8192 test)
**Success Criteria**: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192
---
### Step 5: Trace Remaining Syscalls (Investigation)
**Priority**: P4 (Low)
**Impact**: TBD
**Effort**: Low
**Questions**:
1. **madvise (1,591 calls)**: Where are these from?
- Add debug logging to all `madvise()` call sites
- Check Pool TLS arena, Mid-Large allocator
2. **mincore (1,574 calls)**: Why still present?
- Grep codebase for `mincore` calls
- Check if Phase 9 removal was incomplete
**Tools**:
```bash
# Trace madvise source
strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567
# Grep for mincore
grep -r "mincore" core/ --include="*.c" --include="*.h"
```
---
## 8. Risk Assessment
| Optimization | Impact | Effort | Risk | Recommendation |
|--------------|--------|--------|------|----------------|
| **Mid-Large Fix** | +++++ | ++ | Low | **DO NOW** 🔥 |
| **Lock-free Stage 1** | +++ | ++ | Low | **DO NEXT** ✅ |
| **Drain Interval Tune** | ++ | + | Low | **DO NEXT** ✅ |
| **Frontend Cache Tune** | ++ | + | Low | **DO AFTER** |
| **Reduce Lock Scope** | +++ | +++ | Med | Consider |
| **Per-Class Locks** | ++++ | ++++ | High | Avoid (complex) |
| **Trace Syscalls** | ? | + | Low | Background task |
---
## 9. Expected Performance Targets
### Short-Term (1-2 weeks)
| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Mid-Large throughput** | 0.24M ops/s | **>1M ops/s** | Fix `hkm_ace_alloc` |
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>7M ops/s** | Lock-free + drain tune |
| **futex overhead** | 68% | **<30%** | Lock-free Stage 1 |
| **mmap+munmap** | 3,357 | **<2,500** | Drain interval tune |
### Medium-Term (1-2 months)
| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>15M ops/s** | Full optimization |
| **vs System malloc** | 10% | **>25%** | Close gap by 15pp |
| **vs mimalloc** | 9% | **>20%** | Close gap by 11pp |
### Long-Term (3-6 months)
| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput** | 5.2M ops/s | **>40M ops/s** | Architectural overhaul |
| **vs System malloc** | 10% | **>70%** | Competitive performance |
| **vs mimalloc** | 9% | **>60%** | Industry-standard |
---
## 10. Lessons Learned
### 1. ENV Configuration is Critical
**Discovery**: Default (1.30M) vs Optimized (5.2M) = **+300% gap**
**Lesson**: Always document and automate optimal ENV settings
**Action**: Create `scripts/bench_optimal_env.sh` with best-known config
### 2. Mid-Large Allocator Broken
**Discovery**: 97x slower than mimalloc, NULL returns
**Lesson**: Integration testing insufficient (bench suite doesn't cover 8-32KB properly)
**Action**: Add `bench_mid_large_single_thread.sh` to CI suite
### 3. futex Overhead Unexpected
**Discovery**: 68% time in single-threaded workload
**Lesson**: Shared pool global lock is a bottleneck even without contention
**Action**: Profile lock hold time, consider lock-free paths
### 4. SP-SLOT Stage 2 Dominates
**Discovery**: 92.4% of allocations reuse UNUSED slots (Stage 2)
**Lesson**: Multi-class sharing >> per-class free lists
**Action**: Optimize Stage 2 path (lock-free metadata scan?)
---
## 11. Conclusion
**Current State**:
- ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92%
- ✅ Syscall overhead reduced by 48% (mmap+munmap)
- ⚠️ Still 10x slower than System malloc (Tiny)
- 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)
**Next Priorities**:
1. **Fix Mid-Large allocator** (P0, blocking)
2. **Optimize shared pool lock** (P1, 68% syscall time)
3. **Tune drain interval** (P2, low-risk improvement)
4. **Tune frontend cache** (P3, diminishing returns)
**Expected Impact** (short-term):
- Mid-Large: 0.24M → >1M ops/s (+316%)
- Tiny: 5.2M → >7M ops/s (+35%)
- futex overhead: 68% → <30% (-56%)
**Long-Term Vision**:
- Close gap to 70% of System malloc performance (40M ops/s target)
- Competitive with industry-standard allocators (mimalloc, jemalloc)
---
**Report Generated**: 2025-11-14
**Tool**: Claude Code
**Phase**: Post SP-SLOT Box Implementation
**Status**: ✅ Analysis Complete, Ready for Implementation