hakmem/docs/analysis/BOTTLENECK_ANALYSIS_REPORT_20251114.md

# HAKMEM Bottleneck Analysis Report

**Date**: 2025-11-14
**Phase**: Post SP-SLOT Box Implementation
**Objective**: Identify next optimization targets to close gap with System malloc / mimalloc

---

## Executive Summary

Comprehensive performance analysis reveals **10x gap with System malloc** (Tiny allocator) and **22x gap** (Mid-Large allocator). Primary bottlenecks identified: **syscall overhead** (futex: 68% time), **Frontend cache misses**, and **Mid-Large allocator failure**.

### Performance Gaps (Current State)

| Allocator | Tiny (random_mixed) | Mid-Large MT (8-32KB) |
|-----------|---------------------|----------------------|
| **System malloc** | 51.9M ops/s (100%) | 5.4M ops/s (100%) |
| **mimalloc** | 57.5M ops/s (111%) | 24.2M ops/s (448%) |
| **HAKMEM (best)** | 5.2M ops/s (**10%**) | 0.24M ops/s (**4.4%**) |
| **Gap** | **-90% (10x slower)** | **-95.6% (22x slower)** |

**Urgent**: Mid-Large allocator requires immediate attention (97x slower than mimalloc).

---

## 1. Benchmark Results: Current State

### 1.1 Random Mixed (Tiny Allocator: 16B-1KB)

**Test Configuration**:
- 200K iterations
- Working set: 4,096 slots
- Size range: 16-1040 bytes (C0-C7 classes)

**Results**:

| Variant | spec_mask | fast_cap | Throughput | vs System | vs mimalloc |
|---------|-----------|----------|------------|-----------|-------------|
| **System malloc** | - | - | 51.9M ops/s | 100% | 90% |
| **mimalloc** | - | - | 57.5M ops/s | 111% | 100% |
| **HAKMEM** | 0 | 8 | 3.6M ops/s | 6.9% | 6.3% |
| **HAKMEM** | 0 | 16 | 4.6M ops/s | 8.9% | 8.0% |
| **HAKMEM** | 0 | **32** | **5.2M ops/s** | **10.0%** | **9.0%** |
| **HAKMEM** | 0x0F | 32 | 5.18M ops/s | 10.0% | 9.0% |

**Key Findings**:
- **Best HAKMEM config**: fast_cap=32, spec_mask=0 → **5.2M ops/s**
- **Gap**: 10x slower than System, 11x slower than mimalloc
- **spec_mask effect**: Negligible (<1% difference)
- **fast_cap scaling**: 8→16 (+28%), 16→32 (+13%)

### 1.2 Mid-Large MT (8-32KB Allocations)

**Test Configuration**:
- 2 threads
- 40K cycles
- Working set: 2,048 slots

**Results**:

| Allocator | Throughput | vs System | vs mimalloc |
|-----------|------------|-----------|-------------|
| **System malloc** | 5.4M ops/s | 100% | 22% |
| **mimalloc** | 24.2M ops/s | 448% | 100% |
| **HAKMEM (base)** | 0.243M ops/s | **4.4%** | **1.0%** |
| **HAKMEM (no bigcache)** | 0.251M ops/s | 4.6% | 1.0% |

**Critical Issue**:
```
[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures
```

**Gap**: 22x slower than System, **97x slower than mimalloc** 💀

**Root Cause**: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly.

---

## 2. Syscall Analysis (strace)

### 2.1 System Call Distribution (200K iterations)

| Syscall | Calls | % Time | usec/call | Category |
|---------|-------|--------|-----------|----------|
| **futex** | 36 | **68.18%** | 1,970 | Synchronization ⚠️ |
| **munmap** | 1,665 | 11.60% | 7 | SS deallocation |
| **mmap** | 1,692 | 7.28% | 4 | SS allocation |
| **madvise** | 1,591 | 6.85% | 4 | Memory advice |
| **mincore** | 1,574 | 5.51% | 3 | Page existence check |
| **Other** | 1,141 | 0.57% | - | Misc |
| **Total** | **6,703** | 100% | 15 (avg) | |

### 2.2 Key Observations

**Unexpected: futex Dominates (68% time)**
- **36 futex calls** consuming **68.18% of syscall time**
- **1,970 usec/call** (extremely slow!)
- **Context**: `bench_random_mixed` is **single-threaded**
- **Hypothesis**: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`)

**SP-SLOT Impact Confirmed**:
```
Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls
After SP-SLOT:  mmap (1,692) + munmap (1,665) = 3,357 calls
Reduction:      -48% (-3,098 calls) ✅
```

**Remaining syscall overhead**:
- **madvise**: 1,591 calls (6.85% time) - from other allocators?
- **mincore**: 1,574 calls (5.51% time) - still present despite Phase 9 removal?

---

## 3. SP-SLOT Box Effectiveness Review

### 3.1 SuperSlab Allocation Reduction

**Measured with debug logging** (`HAKMEM_SS_ACQUIRE_DEBUG=1`):

| Metric | Before SP-SLOT | After SP-SLOT | Improvement |
|--------|----------------|---------------|-------------|
| **New SuperSlabs** (Stage 3) | 877 (200K iters) | 72 (200K iters) | **-92%** 🎉 |
| **Syscalls (mmap+munmap)** | 6,455 | 3,357 | **-48%** |
| **Throughput** | 563K ops/s | 1.30M ops/s | **+131%** |

### 3.2 Allocation Stage Distribution (50K iterations)

| Stage | Description | Count | % |
|-------|-------------|-------|---|
| **Stage 1** | EMPTY slot reuse (per-class free list) | 105 | 4.6% |
| **Stage 2** | **UNUSED slot reuse (multi-class sharing)** | **2,117** | **92.4%** ✅ |
| **Stage 3** | New SuperSlab (mmap) | 69 | 3.0% |
| **Total** | | 2,291 | 100% |

**Key Insight**: Stage 2 (UNUSED reuse) is dominant, proving **multi-class SuperSlab sharing works**.

---

## 4. Identified Bottlenecks (Priority Order)

### Priority 1: Mid-Large Allocator Failure 🔥

**Impact**: 97x slower than mimalloc
**Symptom**: `hkm_ace_alloc` returns NULL
**Evidence**:
```
[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1
[ALLOC] 33KB: Calling hkm_ace_alloc
[ALLOC] 33KB: hkm_ace_alloc returned (nil)  ← Repeated failures
```

**Root Cause Hypothesis**:
- Pool TLS arena not initialized?
- Threshold logic preventing 8-32KB allocations?
- Bug in `hkm_ace_alloc` path?

**Action Required**: Immediate investigation (blocking)

---

### Priority 2: futex Overhead (68% syscall time) ⚠️

**Impact**: 68.18% of syscall time (1,970 usec/call)
**Symptom**: Excessive lock contention in shared pool
**Root Cause**:
```c
// core/hakmem_shared_pool.c:343
pthread_mutex_lock(&g_shared_pool.alloc_lock);  ← Contention point?
```

**Hypothesis**:
- `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters)
- Lock held too long (metadata scans, dynamic array growth)
- Contention even in single-threaded workload (TLS drain threads?)

**Potential Solutions**:
1. **Lock-free fast path**: Per-class lock-free pop from free lists (Stage 1)
2. **Reduce lock scope**: Move metadata scans outside critical section
3. **Batch acquire**: Acquire multiple slabs per lock acquisition
4. **Per-class locks**: Replace global lock with per-class locks

**Expected Impact**: -50-80% reduction in futex time

---

### Priority 3: Frontend Cache Miss Rate

**Impact**: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%)
**Current Config**: fast_cap=32 (best performance)
**Evidence**: fast_cap scaling (8→16: +28%, 16→32: +13%)

**Hypothesis**:
- TLS cache capacity too small for working set (4,096 slots)
- Refill batch size suboptimal
- Specialize mask (0x0F) shows no benefit (<1% difference)

**Potential Solutions**:
1. **Increase fast_cap**: Test 64 / 128 (diminishing returns expected)
2. **Tune refill batch**: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256
3. **Class-specific tuning**: Hot classes (C6, C7) get larger caches

**Expected Impact**: +10-20% throughput (backend call reduction)

---

### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)

**Impact**: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore)
**Status**: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)

**Remaining Issues**:
1. **madvise (1,591 calls)**: Where are these coming from?
   - Pool TLS arena (8-52KB)?
   - Mid-Large allocator (broken)?
   - Other internal structures?

2. **mincore (1,574 calls)**: Still present despite Phase 9 removal claim
   - Source location unknown
   - May be from other allocators or debug paths

**Action Required**: Trace source of madvise/mincore calls

---

## 5. Performance Evolution Timeline

### Historical Performance Progression

| Phase | Optimization | Throughput | vs Baseline | vs System |
|-------|--------------|------------|-------------|-----------|
| **Baseline** (Phase 8) | - | 563K ops/s | +0% | 1.1% |
| **Phase 9** (LRU + mincore removal) | Lazy deallocation | 9.71M ops/s | +1,625% | 18.7% |
| **Phase 10** (TLS/SFC tuning) | Frontend expansion | 9.89M ops/s | +1,657% | 19.0% |
| **Phase 11** (Prewarm) | Startup SS allocation | 9.38M ops/s | +1,566% | 18.1% |
| **Phase 12-A** (TLS SLL Drain) | Periodic drain | 6.1M ops/s | +984% | 11.8% |
| **Phase 12-B** (SP-SLOT Box) | Per-slot management | 1.30M ops/s | +131% | 2.5% |
| **Current (optimized ENV)** | fast_cap=32 | **5.2M ops/s** | **+824%** | **10.0%** |

**Note**: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to **ENV configuration**:
- Default: No ENV → 1.30M ops/s
- Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s

---

## 6. Working Set Sensitivity

**Test Results** (fast_cap=32, spec_mask=0):

| Cycles | WS | Throughput | vs ws=4096 |
|--------|-----|------------|------------|
| 200K | 4,096 | 5.2M ops/s | 100% (baseline) |
| 200K | 8,192 | 4.0M ops/s | -23% |
| 400K | 4,096 | 5.3M ops/s | +2% |
| 400K | 8,192 | 4.7M ops/s | -10% |

**Observation**: **23% performance drop** when working set doubles (4K→8K)

**Hypothesis**:
- Larger working set → more backend allocation calls
- TLS cache misses increase
- SuperSlab churn increases (more Stage 3 allocations)

**Implication**: Current frontend cache size (fast_cap=32) insufficient for large working sets.

---

## 7. Recommended Next Steps (Priority Order)

### Step 1: Fix Mid-Large Allocator (URGENT) 🔥

**Priority**: P0 (Blocking)
**Impact**: 97x gap with mimalloc
**Effort**: Medium

**Tasks**:
1. Investigate `hkm_ace_alloc` NULL returns
2. Check Pool TLS arena initialization
3. Verify threshold logic for 8-32KB allocations
4. Add debug logging to trace allocation path

**Success Criteria**: Mid-Large throughput >1M ops/s (current: 0.24M)

---

### Step 2: Optimize Shared Pool Lock Contention

**Priority**: P1 (High)
**Impact**: 68% syscall time
**Effort**: Medium

**Options** (in order of risk):

**A) Lock-free Stage 1 (Low Risk)**:
```c
// Per-class atomic LIFO for EMPTY slot reuse
_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];

// Lock-free pop (Stage 1 fast path)
FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {
    FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);
    while (head != NULL) {
        if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {
            return head;
        }
    }
    return NULL;  // Fall back to locked Stage 2/3
}
```

**Expected**: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)

**B) Reduce Lock Scope (Medium Risk)**:
```c
// Move metadata scan outside lock
int candidate_slot = sp_meta_scan_unlocked();  // Read-only
pthread_mutex_lock(&g_shared_pool.alloc_lock);
if (sp_slot_try_claim(candidate_slot)) {  // Quick CAS
    // Success
}
pthread_mutex_unlock(&g_shared_pool.alloc_lock);
```

**Expected**: -30% futex overhead (reduce lock hold time)

**C) Per-Class Locks (High Risk)**:
```c
pthread_mutex_t g_class_locks[TINY_NUM_CLASSES];  // Replace global lock
```

**Expected**: -80% futex overhead (eliminate cross-class contention)
**Risk**: Complexity increase, potential deadlocks

**Recommendation**: Start with **Option A** (lowest risk, measurable impact).

---

### Step 3: TLS Drain Interval Tuning (Low Risk)

**Priority**: P2 (Medium)
**Impact**: TBD (experimental)
**Effort**: Low (ENV-only A/B testing)

**Current**: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`)

**Experiment Matrix**:
| Interval | Expected Impact |
|----------|-----------------|
| 512 | -50% drain overhead, +syscalls (more frequent SS release) |
| 2,048 | +100% drain overhead, -syscalls (less frequent SS release) |
| 4,096 | +300% drain overhead, --syscalls (minimal SS release) |

**Metrics to Track**:
- Throughput (ops/s)
- mmap/munmap count (strace)
- TLS SLL drain frequency (debug log)

**Success Criteria**: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)

---

### Step 4: Frontend Cache Tuning (Medium Risk)

**Priority**: P3 (Low)
**Impact**: +10-20% expected
**Effort**: Low (ENV-only A/B testing)

**Current Best**: fast_cap=32

**Experiment Matrix**:
| fast_cap | refill_count_hot | Expected Impact |
|----------|------------------|-----------------|
| 64 | 64 | +5-10% (diminishing returns) |
| 64 | 128 | +10-15% (better batch refill) |
| 128 | 128 | +15-20% (max cache size) |

**Metrics to Track**:
- Throughput (ops/s)
- Stage 3 frequency (debug log)
- Working set sensitivity (ws=8192 test)

**Success Criteria**: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192

---

### Step 5: Trace Remaining Syscalls (Investigation)

**Priority**: P4 (Low)
**Impact**: TBD
**Effort**: Low

**Questions**:
1. **madvise (1,591 calls)**: Where are these from?
   - Add debug logging to all `madvise()` call sites
   - Check Pool TLS arena, Mid-Large allocator

2. **mincore (1,574 calls)**: Why still present?
   - Grep codebase for `mincore` calls
   - Check if Phase 9 removal was incomplete

**Tools**:
```bash
# Trace madvise source
strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567

# Grep for mincore
grep -r "mincore" core/ --include="*.c" --include="*.h"
```

---

## 8. Risk Assessment

| Optimization | Impact | Effort | Risk | Recommendation |
|--------------|--------|--------|------|----------------|
| **Mid-Large Fix** | +++++ | ++ | Low | **DO NOW** 🔥 |
| **Lock-free Stage 1** | +++ | ++ | Low | **DO NEXT** ✅ |
| **Drain Interval Tune** | ++ | + | Low | **DO NEXT** ✅ |
| **Frontend Cache Tune** | ++ | + | Low | **DO AFTER** |
| **Reduce Lock Scope** | +++ | +++ | Med | Consider |
| **Per-Class Locks** | ++++ | ++++ | High | Avoid (complex) |
| **Trace Syscalls** | ? | + | Low | Background task |

---

## 9. Expected Performance Targets

### Short-Term (1-2 weeks)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Mid-Large throughput** | 0.24M ops/s | **>1M ops/s** | Fix `hkm_ace_alloc` |
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>7M ops/s** | Lock-free + drain tune |
| **futex overhead** | 68% | **<30%** | Lock-free Stage 1 |
| **mmap+munmap** | 3,357 | **<2,500** | Drain interval tune |

### Medium-Term (1-2 months)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput (ws=4096)** | 5.2M ops/s | **>15M ops/s** | Full optimization |
| **vs System malloc** | 10% | **>25%** | Close gap by 15pp |
| **vs mimalloc** | 9% | **>20%** | Close gap by 11pp |

### Long-Term (3-6 months)

| Metric | Current | Target | Strategy |
|--------|---------|--------|----------|
| **Tiny throughput** | 5.2M ops/s | **>40M ops/s** | Architectural overhaul |
| **vs System malloc** | 10% | **>70%** | Competitive performance |
| **vs mimalloc** | 9% | **>60%** | Industry-standard |

---

## 10. Lessons Learned

### 1. ENV Configuration is Critical

**Discovery**: Default (1.30M) vs Optimized (5.2M) = **+300% gap**
**Lesson**: Always document and automate optimal ENV settings
**Action**: Create `scripts/bench_optimal_env.sh` with best-known config

### 2. Mid-Large Allocator Broken

**Discovery**: 97x slower than mimalloc, NULL returns
**Lesson**: Integration testing insufficient (bench suite doesn't cover 8-32KB properly)
**Action**: Add `bench_mid_large_single_thread.sh` to CI suite

### 3. futex Overhead Unexpected

**Discovery**: 68% time in single-threaded workload
**Lesson**: Shared pool global lock is a bottleneck even without contention
**Action**: Profile lock hold time, consider lock-free paths

### 4. SP-SLOT Stage 2 Dominates

**Discovery**: 92.4% of allocations reuse UNUSED slots (Stage 2)
**Lesson**: Multi-class sharing >> per-class free lists
**Action**: Optimize Stage 2 path (lock-free metadata scan?)

---

## 11. Conclusion

**Current State**:
- ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92%
- ✅ Syscall overhead reduced by 48% (mmap+munmap)
- ⚠️ Still 10x slower than System malloc (Tiny)
- 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)

**Next Priorities**:
1. **Fix Mid-Large allocator** (P0, blocking)
2. **Optimize shared pool lock** (P1, 68% syscall time)
3. **Tune drain interval** (P2, low-risk improvement)
4. **Tune frontend cache** (P3, diminishing returns)

**Expected Impact** (short-term):
- Mid-Large: 0.24M → >1M ops/s (+316%)
- Tiny: 5.2M → >7M ops/s (+35%)
- futex overhead: 68% → <30% (-56%)

**Long-Term Vision**:
- Close gap to 70% of System malloc performance (40M ops/s target)
- Competitive with industry-standard allocators (mimalloc, jemalloc)

---

**Report Generated**: 2025-11-14
**Tool**: Claude Code
**Phase**: Post SP-SLOT Box Implementation
**Status**: ✅ Analysis Complete, Ready for Implementation
Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-26 13:14:18 +09:00			`# HAKMEM Bottleneck Analysis Report`

			`Date: 2025-11-14`
			`Phase: Post SP-SLOT Box Implementation`
			`Objective: Identify next optimization targets to close gap with System malloc / mimalloc`

			`---`

			`## Executive Summary`

			`Comprehensive performance analysis reveals 10x gap with System malloc (Tiny allocator) and 22x gap (Mid-Large allocator). Primary bottlenecks identified: syscall overhead (futex: 68% time), Frontend cache misses, and Mid-Large allocator failure.`

			`### Performance Gaps (Current State)`

			`\| Allocator \| Tiny (random_mixed) \| Mid-Large MT (8-32KB) \|`
			`\|-----------\|---------------------\|----------------------\|`
			`\| System malloc \| 51.9M ops/s (100%) \| 5.4M ops/s (100%) \|`
			`\| mimalloc \| 57.5M ops/s (111%) \| 24.2M ops/s (448%) \|`
			`\| HAKMEM (best) \| 5.2M ops/s (10%) \| 0.24M ops/s (4.4%) \|`
			`\| Gap \| -90% (10x slower) \| -95.6% (22x slower) \|`

			`Urgent: Mid-Large allocator requires immediate attention (97x slower than mimalloc).`

			`---`

			`## 1. Benchmark Results: Current State`

			`### 1.1 Random Mixed (Tiny Allocator: 16B-1KB)`

			`Test Configuration:`
			`- 200K iterations`
			`- Working set: 4,096 slots`
			`- Size range: 16-1040 bytes (C0-C7 classes)`

			`Results:`

			`\| Variant \| spec_mask \| fast_cap \| Throughput \| vs System \| vs mimalloc \|`
			`\|---------\|-----------\|----------\|------------\|-----------\|-------------\|`
			`\| System malloc \| - \| - \| 51.9M ops/s \| 100% \| 90% \|`
			`\| mimalloc \| - \| - \| 57.5M ops/s \| 111% \| 100% \|`
			`\| HAKMEM \| 0 \| 8 \| 3.6M ops/s \| 6.9% \| 6.3% \|`
			`\| HAKMEM \| 0 \| 16 \| 4.6M ops/s \| 8.9% \| 8.0% \|`
			`\| HAKMEM \| 0 \| 32 \| 5.2M ops/s \| 10.0% \| 9.0% \|`
			`\| HAKMEM \| 0x0F \| 32 \| 5.18M ops/s \| 10.0% \| 9.0% \|`

			`Key Findings:`
			`- Best HAKMEM config: fast_cap=32, spec_mask=0 → 5.2M ops/s`
			`- Gap: 10x slower than System, 11x slower than mimalloc`
			`- spec_mask effect: Negligible (<1% difference)`
			`- fast_cap scaling: 8→16 (+28%), 16→32 (+13%)`

			`### 1.2 Mid-Large MT (8-32KB Allocations)`

			`Test Configuration:`
			`- 2 threads`
			`- 40K cycles`
			`- Working set: 2,048 slots`

			`Results:`

			`\| Allocator \| Throughput \| vs System \| vs mimalloc \|`
			`\|-----------\|------------\|-----------\|-------------\|`
			`\| System malloc \| 5.4M ops/s \| 100% \| 22% \|`
			`\| mimalloc \| 24.2M ops/s \| 448% \| 100% \|`
			`\| HAKMEM (base) \| 0.243M ops/s \| 4.4% \| 1.0% \|`
			`\| HAKMEM (no bigcache) \| 0.251M ops/s \| 4.6% \| 1.0% \|`

			`Critical Issue:`
			```
			`[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures`
			```

			`Gap: 22x slower than System, 97x slower than mimalloc 💀`

			Root Cause: `hkm_ace_alloc` consistently returns NULL → Mid-Large allocator not functioning properly.

			`---`

			`## 2. Syscall Analysis (strace)`

			`### 2.1 System Call Distribution (200K iterations)`

			`\| Syscall \| Calls \| % Time \| usec/call \| Category \|`
			`\|---------\|-------\|--------\|-----------\|----------\|`
			`\| futex \| 36 \| 68.18% \| 1,970 \| Synchronization ⚠️ \|`
			`\| munmap \| 1,665 \| 11.60% \| 7 \| SS deallocation \|`
			`\| mmap \| 1,692 \| 7.28% \| 4 \| SS allocation \|`
			`\| madvise \| 1,591 \| 6.85% \| 4 \| Memory advice \|`
			`\| mincore \| 1,574 \| 5.51% \| 3 \| Page existence check \|`
			`\| Other \| 1,141 \| 0.57% \| - \| Misc \|`
			`\| Total \| 6,703 \| 100% \| 15 (avg) \| \|`

			`### 2.2 Key Observations`

			`Unexpected: futex Dominates (68% time)`
			`- 36 futex calls consuming 68.18% of syscall time`
			`- 1,970 usec/call (extremely slow!)`
			- Context: `bench_random_mixed` is single-threaded
			- Hypothesis: Contention in shared pool lock (`pthread_mutex_lock` in `shared_pool_acquire_slab`)

			`SP-SLOT Impact Confirmed:`
			```
			`Before SP-SLOT: mmap (3,241) + munmap (3,214) = 6,455 calls`
			`After SP-SLOT: mmap (1,692) + munmap (1,665) = 3,357 calls`
			`Reduction: -48% (-3,098 calls) ✅`
			```

			`Remaining syscall overhead:`
			`- madvise: 1,591 calls (6.85% time) - from other allocators?`
			`- mincore: 1,574 calls (5.51% time) - still present despite Phase 9 removal?`

			`---`

			`## 3. SP-SLOT Box Effectiveness Review`

			`### 3.1 SuperSlab Allocation Reduction`

			Measured with debug logging (`HAKMEM_SS_ACQUIRE_DEBUG=1`):

			`\| Metric \| Before SP-SLOT \| After SP-SLOT \| Improvement \|`
			`\|--------\|----------------\|---------------\|-------------\|`
			`\| New SuperSlabs (Stage 3) \| 877 (200K iters) \| 72 (200K iters) \| -92% 🎉 \|`
			`\| Syscalls (mmap+munmap) \| 6,455 \| 3,357 \| -48% \|`
			`\| Throughput \| 563K ops/s \| 1.30M ops/s \| +131% \|`

			`### 3.2 Allocation Stage Distribution (50K iterations)`

			`\| Stage \| Description \| Count \| % \|`
			`\|-------\|-------------\|-------\|---\|`
			`\| Stage 1 \| EMPTY slot reuse (per-class free list) \| 105 \| 4.6% \|`
			`\| Stage 2 \| UNUSED slot reuse (multi-class sharing) \| 2,117 \| 92.4% ✅ \|`
			`\| Stage 3 \| New SuperSlab (mmap) \| 69 \| 3.0% \|`
			`\| Total \| \| 2,291 \| 100% \|`

			`Key Insight: Stage 2 (UNUSED reuse) is dominant, proving multi-class SuperSlab sharing works.`

			`---`

			`## 4. Identified Bottlenecks (Priority Order)`

			`### Priority 1: Mid-Large Allocator Failure 🔥`

			`Impact: 97x slower than mimalloc`
			Symptom: `hkm_ace_alloc` returns NULL
			`Evidence:`
			```
			`[ALLOC] 33KB: TINY_MAX_SIZE=1024, threshold=524288, condition=1`
			`[ALLOC] 33KB: Calling hkm_ace_alloc`
			`[ALLOC] 33KB: hkm_ace_alloc returned (nil) ← Repeated failures`
			```

			`Root Cause Hypothesis:`
			`- Pool TLS arena not initialized?`
			`- Threshold logic preventing 8-32KB allocations?`
			- Bug in `hkm_ace_alloc` path?

			`Action Required: Immediate investigation (blocking)`

			`---`

			`### Priority 2: futex Overhead (68% syscall time) ⚠️`

			`Impact: 68.18% of syscall time (1,970 usec/call)`
			`Symptom: Excessive lock contention in shared pool`
			`Root Cause:`
			```c
			`// core/hakmem_shared_pool.c:343`
			`pthread_mutex_lock(&g_shared_pool.alloc_lock); ← Contention point?`
			```

			`Hypothesis:`
			- `shared_pool_acquire_slab()` called frequently (2,291 times / 50K iters)
			`- Lock held too long (metadata scans, dynamic array growth)`
			`- Contention even in single-threaded workload (TLS drain threads?)`

			`Potential Solutions:`
			`1. Lock-free fast path: Per-class lock-free pop from free lists (Stage 1)`
			`2. Reduce lock scope: Move metadata scans outside critical section`
			`3. Batch acquire: Acquire multiple slabs per lock acquisition`
			`4. Per-class locks: Replace global lock with per-class locks`

			`Expected Impact: -50-80% reduction in futex time`

			`---`

			`### Priority 3: Frontend Cache Miss Rate`

			`Impact: Driving backend allocation frequency (2,291 acquires / 50K iters = 4.6%)`
			`Current Config: fast_cap=32 (best performance)`
			`Evidence: fast_cap scaling (8→16: +28%, 16→32: +13%)`

			`Hypothesis:`
			`- TLS cache capacity too small for working set (4,096 slots)`
			`- Refill batch size suboptimal`
			`- Specialize mask (0x0F) shows no benefit (<1% difference)`

			`Potential Solutions:`
			`1. Increase fast_cap: Test 64 / 128 (diminishing returns expected)`
			`2. Tune refill batch: Current 64 (HAKMEM_TINY_REFILL_COUNT_HOT) → test 128 / 256`
			`3. Class-specific tuning: Hot classes (C6, C7) get larger caches`

			`Expected Impact: +10-20% throughput (backend call reduction)`

			`---`

			`### Priority 4: Remaining syscall Overhead (mmap/munmap/madvise/mincore)`

			`Impact: 30.59% syscall time (3,357 mmap/munmap + 1,591 madvise + 1,574 mincore)`
			`Status: Significantly improved vs pre-SP-SLOT (-48% mmap/munmap)`

			`Remaining Issues:`
			`1. madvise (1,591 calls): Where are these coming from?`
			`- Pool TLS arena (8-52KB)?`
			`- Mid-Large allocator (broken)?`
			`- Other internal structures?`

			`2. mincore (1,574 calls): Still present despite Phase 9 removal claim`
			`- Source location unknown`
			`- May be from other allocators or debug paths`

			`Action Required: Trace source of madvise/mincore calls`

			`---`

			`## 5. Performance Evolution Timeline`

			`### Historical Performance Progression`

			`\| Phase \| Optimization \| Throughput \| vs Baseline \| vs System \|`
			`\|-------\|--------------\|------------\|-------------\|-----------\|`
			`\| Baseline (Phase 8) \| - \| 563K ops/s \| +0% \| 1.1% \|`
			`\| Phase 9 (LRU + mincore removal) \| Lazy deallocation \| 9.71M ops/s \| +1,625% \| 18.7% \|`
			`\| Phase 10 (TLS/SFC tuning) \| Frontend expansion \| 9.89M ops/s \| +1,657% \| 19.0% \|`
			`\| Phase 11 (Prewarm) \| Startup SS allocation \| 9.38M ops/s \| +1,566% \| 18.1% \|`
			`\| Phase 12-A (TLS SLL Drain) \| Periodic drain \| 6.1M ops/s \| +984% \| 11.8% \|`
			`\| Phase 12-B (SP-SLOT Box) \| Per-slot management \| 1.30M ops/s \| +131% \| 2.5% \|`
			`\| Current (optimized ENV) \| fast_cap=32 \| 5.2M ops/s \| +824% \| 10.0% \|`

			`Note: Discrepancy between Phase 12-B (1.30M) and Current (5.2M) due to ENV configuration:`
			`- Default: No ENV → 1.30M ops/s`
			- Optimized: `HAKMEM_TINY_FAST_CAP=32 + other flags` → 5.2M ops/s

			`---`

			`## 6. Working Set Sensitivity`

			`Test Results (fast_cap=32, spec_mask=0):`

			`\| Cycles \| WS \| Throughput \| vs ws=4096 \|`
			`\|--------\|-----\|------------\|------------\|`
			`\| 200K \| 4,096 \| 5.2M ops/s \| 100% (baseline) \|`
			`\| 200K \| 8,192 \| 4.0M ops/s \| -23% \|`
			`\| 400K \| 4,096 \| 5.3M ops/s \| +2% \|`
			`\| 400K \| 8,192 \| 4.7M ops/s \| -10% \|`

			`Observation: 23% performance drop when working set doubles (4K→8K)`

			`Hypothesis:`
			`- Larger working set → more backend allocation calls`
			`- TLS cache misses increase`
			`- SuperSlab churn increases (more Stage 3 allocations)`

			`Implication: Current frontend cache size (fast_cap=32) insufficient for large working sets.`

			`---`

			`## 7. Recommended Next Steps (Priority Order)`

			`### Step 1: Fix Mid-Large Allocator (URGENT) 🔥`

			`Priority: P0 (Blocking)`
			`Impact: 97x gap with mimalloc`
			`Effort: Medium`

			`Tasks:`
			1. Investigate `hkm_ace_alloc` NULL returns
			`2. Check Pool TLS arena initialization`
			`3. Verify threshold logic for 8-32KB allocations`
			`4. Add debug logging to trace allocation path`

			`Success Criteria: Mid-Large throughput >1M ops/s (current: 0.24M)`

			`---`

			`### Step 2: Optimize Shared Pool Lock Contention`

			`Priority: P1 (High)`
			`Impact: 68% syscall time`
			`Effort: Medium`

			`Options (in order of risk):`

			`A) Lock-free Stage 1 (Low Risk):`
			```c
			`// Per-class atomic LIFO for EMPTY slot reuse`
			`_Atomic(FreeSlotEntry*) g_free_list_heads[TINY_NUM_CLASSES];`

			`// Lock-free pop (Stage 1 fast path)`
			`FreeSlotEntry* sp_freelist_pop_lockfree(int class_idx) {`
			`FreeSlotEntry* head = atomic_load(&g_free_list_heads[class_idx]);`
			`while (head != NULL) {`
			`if (atomic_compare_exchange_weak(&g_free_list_heads[class_idx], &head, head->next)) {`
			`return head;`
			`}`
			`}`
			`return NULL; // Fall back to locked Stage 2/3`
			`}`
			```

			`Expected: -50% futex overhead (Stage 1 hit rate: 4.6% → lock-free)`

			`B) Reduce Lock Scope (Medium Risk):`
			```c
			`// Move metadata scan outside lock`
			`int candidate_slot = sp_meta_scan_unlocked(); // Read-only`
			`pthread_mutex_lock(&g_shared_pool.alloc_lock);`
			`if (sp_slot_try_claim(candidate_slot)) { // Quick CAS`
			`// Success`
			`}`
			`pthread_mutex_unlock(&g_shared_pool.alloc_lock);`
			```

			`Expected: -30% futex overhead (reduce lock hold time)`

			`C) Per-Class Locks (High Risk):`
			```c
			`pthread_mutex_t g_class_locks[TINY_NUM_CLASSES]; // Replace global lock`
			```

			`Expected: -80% futex overhead (eliminate cross-class contention)`
			`Risk: Complexity increase, potential deadlocks`

			`Recommendation: Start with Option A (lowest risk, measurable impact).`

			`---`

			`### Step 3: TLS Drain Interval Tuning (Low Risk)`

			`Priority: P2 (Medium)`
			`Impact: TBD (experimental)`
			`Effort: Low (ENV-only A/B testing)`

			Current: 1,024 frees/class (`HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024`)

			`Experiment Matrix:`
			`\| Interval \| Expected Impact \|`
			`\|----------\|-----------------\|`
			`\| 512 \| -50% drain overhead, +syscalls (more frequent SS release) \|`
			`\| 2,048 \| +100% drain overhead, -syscalls (less frequent SS release) \|`
			`\| 4,096 \| +300% drain overhead, --syscalls (minimal SS release) \|`

			`Metrics to Track:`
			`- Throughput (ops/s)`
			`- mmap/munmap count (strace)`
			`- TLS SLL drain frequency (debug log)`

			`Success Criteria: Find optimal balance (throughput > 5.5M ops/s, syscalls < 3,000)`

			`---`

			`### Step 4: Frontend Cache Tuning (Medium Risk)`

			`Priority: P3 (Low)`
			`Impact: +10-20% expected`
			`Effort: Low (ENV-only A/B testing)`

			`Current Best: fast_cap=32`

			`Experiment Matrix:`
			`\| fast_cap \| refill_count_hot \| Expected Impact \|`
			`\|----------\|------------------\|-----------------\|`
			`\| 64 \| 64 \| +5-10% (diminishing returns) \|`
			`\| 64 \| 128 \| +10-15% (better batch refill) \|`
			`\| 128 \| 128 \| +15-20% (max cache size) \|`

			`Metrics to Track:`
			`- Throughput (ops/s)`
			`- Stage 3 frequency (debug log)`
			`- Working set sensitivity (ws=8192 test)`

			`Success Criteria: Throughput > 6M ops/s on ws=4096, <10% drop on ws=8192`

			`---`

			`### Step 5: Trace Remaining Syscalls (Investigation)`

			`Priority: P4 (Low)`
			`Impact: TBD`
			`Effort: Low`

			`Questions:`
			`1. madvise (1,591 calls): Where are these from?`
			- Add debug logging to all `madvise()` call sites
			`- Check Pool TLS arena, Mid-Large allocator`

			`2. mincore (1,574 calls): Why still present?`
			- Grep codebase for `mincore` calls
			`- Check if Phase 9 removal was incomplete`

			`Tools:`
			```bash
			`# Trace madvise source`
			`strace -e trace=madvise -k ./bench_random_mixed_hakmem 200000 4096 1234567`

			`# Grep for mincore`
			`grep -r "mincore" core/ --include=".c" --include=".h"`
			```

			`---`

			`## 8. Risk Assessment`

			`\| Optimization \| Impact \| Effort \| Risk \| Recommendation \|`
			`\|--------------\|--------\|--------\|------\|----------------\|`
			`\| Mid-Large Fix \| +++++ \| ++ \| Low \| DO NOW 🔥 \|`
			`\| Lock-free Stage 1 \| +++ \| ++ \| Low \| DO NEXT ✅ \|`
			`\| Drain Interval Tune \| ++ \| + \| Low \| DO NEXT ✅ \|`
			`\| Frontend Cache Tune \| ++ \| + \| Low \| DO AFTER \|`
			`\| Reduce Lock Scope \| +++ \| +++ \| Med \| Consider \|`
			`\| Per-Class Locks \| ++++ \| ++++ \| High \| Avoid (complex) \|`
			`\| Trace Syscalls \| ? \| + \| Low \| Background task \|`

			`---`

			`## 9. Expected Performance Targets`

			`### Short-Term (1-2 weeks)`

			`\| Metric \| Current \| Target \| Strategy \|`
			`\|--------\|---------\|--------\|----------\|`
			\| Mid-Large throughput \| 0.24M ops/s \| >1M ops/s \| Fix `hkm_ace_alloc` \|
			`\| Tiny throughput (ws=4096) \| 5.2M ops/s \| >7M ops/s \| Lock-free + drain tune \|`
			`\| futex overhead \| 68% \| <30% \| Lock-free Stage 1 \|`
			`\| mmap+munmap \| 3,357 \| <2,500 \| Drain interval tune \|`

			`### Medium-Term (1-2 months)`

			`\| Metric \| Current \| Target \| Strategy \|`
			`\|--------\|---------\|--------\|----------\|`
			`\| Tiny throughput (ws=4096) \| 5.2M ops/s \| >15M ops/s \| Full optimization \|`
			`\| vs System malloc \| 10% \| >25% \| Close gap by 15pp \|`
			`\| vs mimalloc \| 9% \| >20% \| Close gap by 11pp \|`

			`### Long-Term (3-6 months)`

			`\| Metric \| Current \| Target \| Strategy \|`
			`\|--------\|---------\|--------\|----------\|`
			`\| Tiny throughput \| 5.2M ops/s \| >40M ops/s \| Architectural overhaul \|`
			`\| vs System malloc \| 10% \| >70% \| Competitive performance \|`
			`\| vs mimalloc \| 9% \| >60% \| Industry-standard \|`

			`---`

			`## 10. Lessons Learned`

			`### 1. ENV Configuration is Critical`

			`Discovery: Default (1.30M) vs Optimized (5.2M) = +300% gap`
			`Lesson: Always document and automate optimal ENV settings`
			Action: Create `scripts/bench_optimal_env.sh` with best-known config

			`### 2. Mid-Large Allocator Broken`

			`Discovery: 97x slower than mimalloc, NULL returns`
			`Lesson: Integration testing insufficient (bench suite doesn't cover 8-32KB properly)`
			Action: Add `bench_mid_large_single_thread.sh` to CI suite

			`### 3. futex Overhead Unexpected`

			`Discovery: 68% time in single-threaded workload`
			`Lesson: Shared pool global lock is a bottleneck even without contention`
			`Action: Profile lock hold time, consider lock-free paths`

			`### 4. SP-SLOT Stage 2 Dominates`

			`Discovery: 92.4% of allocations reuse UNUSED slots (Stage 2)`
			`Lesson: Multi-class sharing >> per-class free lists`
			`Action: Optimize Stage 2 path (lock-free metadata scan?)`

			`---`

			`## 11. Conclusion`

			`Current State:`
			`- ✅ SP-SLOT Box successfully reduced SuperSlab churn by 92%`
			`- ✅ Syscall overhead reduced by 48% (mmap+munmap)`
			`- ⚠️ Still 10x slower than System malloc (Tiny)`
			`- 🔥 Mid-Large allocator critically broken (97x slower than mimalloc)`

			`Next Priorities:`
			`1. Fix Mid-Large allocator (P0, blocking)`
			`2. Optimize shared pool lock (P1, 68% syscall time)`
			`3. Tune drain interval (P2, low-risk improvement)`
			`4. Tune frontend cache (P3, diminishing returns)`

			`Expected Impact (short-term):`
			`- Mid-Large: 0.24M → >1M ops/s (+316%)`
			`- Tiny: 5.2M → >7M ops/s (+35%)`
			`- futex overhead: 68% → <30% (-56%)`

			`Long-Term Vision:`
			`- Close gap to 70% of System malloc performance (40M ops/s target)`
			`- Competitive with industry-standard allocators (mimalloc, jemalloc)`

			`---`

			`Report Generated: 2025-11-14`
			`Tool: Claude Code`
			`Phase: Post SP-SLOT Box Implementation`
			`Status: ✅ Analysis Complete, Ready for Implementation`