# superslab_refill Bottleneck Analysis

**Function:** `superslab_refill()` in `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888`
**CPU Time:** 28.56% (perf report)
**Status:** 🔴 **CRITICAL BOTTLENECK**

---

## Function Complexity Analysis

### Code Statistics
- **Lines of code:** 238 lines (650-888)
- **Branches:** ~15 major decision points
- **Loops:** 4 nested loops
- **Atomic operations:** ~10+ atomic loads/stores
- **Function calls:** ~15 helper functions

**Complexity Score:** 🔥🔥🔥🔥🔥 (Extremely complex for a "refill" operation)

---

## Path Analysis: What superslab_refill Does

### Path 1: Adopt from Publish/Subscribe (Lines 686-750) ⭐⭐⭐⭐

**Condition:** `g_ss_adopt_en == 1` (auto-enabled if remote frees seen)

**Steps:**
1. Check cooldown period (lines 688-694)
2. Call `ss_partial_adopt(class_idx)` (line 696)
3. **Loop 1:** Scan adopted SS slabs (lines 701-710)
   - Load remote counts atomically
   - Calculate best score
4. Try to acquire best slab atomically (line 714)
5. Drain remote freelist (line 716)
6. Check if safe to bind (line 734)
7. Bind TLS slab (line 736)

**Atomic operations:** 3-5 per slab × up to 32 slabs = **96-160 atomic ops**

**Cost estimate:** 🔥🔥🔥🔥 **HIGH** (multi-threaded workloads only)

---

### Path 2: Reuse Existing SS Freelist (Lines 753-792) ⭐⭐⭐⭐⭐

**Condition:** `tls->ss != NULL` and slab has freelist

**Steps:**
1. Get slab capacity (line 756)
2. **Loop 2:** Scan all slabs (lines 757-792)
   - Check if `slabs[i].freelist` exists (line 763)
   - Try to acquire slab atomically (line 765)
   - Drain remote freelist if needed (line 768)
   - Check safe to bind (line 783)
   - Bind TLS slab (line 785)

**Worst case:** Scan all 32 slabs, attempt acquire on each
**Atomic operations:** 1-3 per slab × 32 = **32-96 atomic ops**

**Cost estimate:** 🔥🔥🔥🔥🔥 **VERY HIGH** (most common path in Larson!)

**Why this is THE bottleneck:**
- This loop runs on EVERY refill
- Larson has 4 threads × frequent allocations
- Each thread scans its own SS trying to find freelist
- Atomic operations cause cache line ping-pong between threads

---

### Path 3: Use Virgin Slab (Lines 794-810) ⭐⭐⭐

**Condition:** `tls->ss->active_slabs < capacity`

**Steps:**
1. Call `superslab_find_free_slab(tls->ss)` (line 797)
   - **Bitmap scan** to find unused slab
2. Call `superslab_init_slab()` (line 802)
   - Initialize metadata
   - Set up freelist/bitmap
3. Bind TLS slab (line 805)

**Cost estimate:** 🔥🔥🔥 **MEDIUM** (bitmap scan + init)

---

### Path 4: Registry Adoption (Lines 812-843) ⭐⭐⭐⭐

**Condition:** `!tls->ss` (no SuperSlab yet)

**Steps:**
1. **Loop 3:** Scan registry (lines 818-842)
   - Load entry atomically (line 820)
   - Check magic (line 823)
   - Check size class (line 824)
   - **Loop 4:** Scan slabs in SS (lines 828-840)
     - Try acquire (line 830)
     - Drain remote (line 832)
     - Check safe to bind (line 833)

**Worst case:** Scan 256 registry entries × 32 slabs each
**Atomic operations:** **Thousands**

**Cost estimate:** 🔥🔥🔥🔥🔥 **CATASTROPHIC** (if hit)

---

### Path 5: Must-Adopt Gate (Lines 845-849) ⭐⭐

**Condition:** Before allocating new SS

**Steps:**
1. Call `tiny_must_adopt_gate(class_idx, tls)`
   - Attempts sticky/hot/bench/mailbox/registry adoption

**Cost estimate:** 🔥🔥 **LOW-MEDIUM** (fast path optimization)

---

### Path 6: Allocate New SuperSlab (Lines 851-887) ⭐⭐⭐⭐⭐

**Condition:** All other paths failed

**Steps:**
1. Call `superslab_allocate(class_idx)` (line 852)
   - **mmap() syscall** to allocate 1MB SuperSlab
2. Initialize first slab (line 876)
3. Bind TLS slab (line 880)
4. Update refcounts (lines 882-885)

**Cost estimate:** 🔥🔥🔥🔥🔥 **CATASTROPHIC** (syscall!)

**Why this is expensive:**
- mmap() is a kernel syscall (~1000+ cycles)
- Page fault on first access
- TLB pressure

---

## Bottleneck Hypothesis

### Primary Suspects (in order of likelihood):

#### 1. Path 2: Freelist Scan Loop (Lines 757-792) 🥇

**Evidence:**
- Runs on EVERY refill
- Scans up to 32 slabs linearly
- Multiple atomic operations per slab
- Cache line bouncing between threads

**Why Larson hits this:**
- Larson does frequent alloc/free
- Freelists exist after first warmup
- Every refill scans the same SS repeatedly

**Estimated CPU contribution:** **15-20% of total CPU**

---

#### 2. Atomic Operations (Throughout) 🥈

**Count:**
- Path 1: 96-160 atomic ops
- Path 2: 32-96 atomic ops
- Path 4: Thousands of atomic ops

**Why expensive:**
- Each atomic op = cache coherency traffic
- 4 threads × frequent operations = contention
- AMD Ryzen (test system) has slower atomics than Intel

**Estimated CPU contribution:** **5-8% of total CPU**

---

#### 3. Path 6: mmap() Syscalls 🥉

**Evidence:**
- OOM messages in logs suggest path 6 is hit occasionally
- Each mmap() is ~1000 cycles minimum
- Page faults add another ~1000 cycles

**Frequency:**
- Larson runs for 2 seconds
- 4 threads × allocation rate = high turnover
- But: SuperSlabs are 1MB (reusable for many allocations)

**Estimated CPU contribution:** **2-5% of total CPU**

---

#### 4. Registry Scan (Path 4) ⚠️

**Evidence:**
- Only runs if `!tls->ss` (rare after warmup)
- But: if hit, scans 256 entries × 32 slabs = **massive**

**Estimated CPU contribution:** **0-3% of total CPU** (depends on hit rate)

---

## Optimization Opportunities

### 🔥 P0: Eliminate Freelist Scan Loop (Path 2)

**Current:**
```c
for (int i = 0; i < tls_cap; i++) {
    if (tls->ss->slabs[i].freelist) {
        // Try to acquire, drain, bind...
    }
}
```

**Problem:**
- O(n) scan where n = 32 slabs
- Linear search every refill
- Repeated checks of the same slabs

**Solutions:**

#### Option A: Freelist Bitmap (Best) ⭐⭐⭐⭐⭐
```c
// Add to SuperSlab struct:
uint32_t freelist_bitmap;  // bit i = 1 if slabs[i].freelist != NULL

// In superslab_refill:
uint32_t fl_bits = tls->ss->freelist_bitmap;
if (fl_bits) {
    int idx = __builtin_ctz(fl_bits);  // Find first set bit (1-2 cycles!)
    // Try to acquire slab[idx]...
}
```

**Benefits:**
- O(1) find instead of O(n) scan
- No atomic ops unless freelist exists
- **Estimated speedup:** 10-15% total CPU

**Risks:**
- Need to maintain bitmap on free/alloc
- Possible race conditions (can use atomic or accept false positives)

---

#### Option B: Last-Known-Good Index ⭐⭐⭐
```c
// Add to TinyTLSSlab:
uint8_t last_freelist_idx;

// In superslab_refill:
int start = tls->last_freelist_idx;
for (int i = 0; i < tls_cap; i++) {
    int idx = (start + i) % tls_cap;  // Round-robin
    if (tls->ss->slabs[idx].freelist) {
        tls->last_freelist_idx = idx;
        // Try to acquire...
    }
}
```

**Benefits:**
- Likely to hit on first try (temporal locality)
- No additional atomics
- **Estimated speedup:** 5-8% total CPU

**Risks:**
- Still O(n) worst case
- May not help if freelists are sparse

---

#### Option C: Intrusive Freelist of Slabs ⭐⭐⭐⭐
```c
// Add to SuperSlab:
int8_t first_freelist_slab;  // -1 = none, else index
// Add to TinySlabMeta:
int8_t next_freelist_slab;   // Intrusive linked list

// In superslab_refill:
int idx = tls->ss->first_freelist_slab;
if (idx >= 0) {
    // Try to acquire slab[idx]...
}
```

**Benefits:**
- O(1) lookup
- No scanning
- **Estimated speedup:** 12-18% total CPU

**Risks:**
- Complex to maintain
- Intrusive list management on every free
- Possible corruption if not careful

---

### 🔥 P1: Reduce Atomic Operations

**Current hotspots:**
- `slab_try_acquire()` - CAS operation
- `atomic_load_explicit(&remote_heads[s], ...)` - Cache coherency
- `atomic_load_explicit(&remote_counts[s], ...)` - Cache coherency

**Solutions:**

#### Option A: Batch Acquire Attempts ⭐⭐⭐
```c
// Instead of acquire → drain → release → retry,
// try multiple slabs and pick best BEFORE acquiring
uint32_t scores[32];
for (int i = 0; i < tls_cap; i++) {
    scores[i] = tls->ss->slabs[i].freelist ? 1 : 0;  // No atomics!
}
int best = find_max_index(scores);
// Now acquire only the best one
SlabHandle h = slab_try_acquire(tls->ss, best, self_tid);
```

**Benefits:**
- Reduce atomic ops from 32-96 to 1-3
- **Estimated speedup:** 3-5% total CPU

---

#### Option B: Relaxed Memory Ordering ⭐⭐
```c
// Change:
atomic_load_explicit(&remote_heads[s], memory_order_acquire)
// To:
atomic_load_explicit(&remote_heads[s], memory_order_relaxed)
```

**Benefits:**
- Cheaper than acquire (no fence)
- Safe if we re-check before binding

**Risks:**
- Requires careful analysis of race conditions

---

### 🔥 P2: Optimize Path 6 (mmap)

**Solutions:**

#### Option A: SuperSlab Pool / Freelancer ⭐⭐⭐⭐
```c
// Pre-allocate pool of SuperSlabs
SuperSlab* g_ss_pool[128];  // Pre-mmap'd and ready
int g_ss_pool_head = 0;

// In superslab_allocate:
if (g_ss_pool_head > 0) {
    return g_ss_pool[--g_ss_pool_head];  // O(1)!
}
// Fallback to mmap if pool empty
```

**Benefits:**
- Amortize mmap cost
- No syscalls in hot path
- **Estimated speedup:** 2-4% total CPU

---

#### Option B: Background Refill Thread ⭐⭐⭐⭐⭐
```c
// Dedicated thread to refill SS pool
void* bg_refill_thread(void* arg) {
    while (1) {
        if (g_ss_pool_head < 64) {
            SuperSlab* ss = mmap(...);
            g_ss_pool[g_ss_pool_head++] = ss;
        }
        usleep(1000);  // Sleep 1ms
    }
}
```

**Benefits:**
- ZERO mmap cost in allocation path
- **Estimated speedup:** 2-5% total CPU

**Risks:**
- Thread overhead
- Complexity

---

### 🔥 P3: Fast Path Bypass

**Idea:** Avoid superslab_refill entirely for hot classes

#### Option A: TLS Freelist Pre-warming ⭐⭐⭐⭐
```c
// On thread init, pre-fill TLS freelists
void thread_init() {
    for (int cls = 0; cls < 4; cls++) {  // Hot classes
        sll_refill_batch_from_ss(cls, 128);  // Fill to capacity
    }
}
```

**Benefits:**
- Reduces refill frequency
- **Estimated speedup:** 5-10% total CPU (indirect)

---

## Profiling TODO

To confirm hypotheses, instrument superslab_refill:

```c
static SuperSlab* superslab_refill(int class_idx) {
    uint64_t t0 = rdtsc();

    uint64_t t_adopt = 0, t_freelist = 0, t_virgin = 0, t_mmap = 0;
    int path_taken = 0;

    // Path 1: Adopt
    uint64_t t1 = rdtsc();
    if (g_ss_adopt_en) {
        // ... adopt logic ...
        if (adopted) { path_taken = 1; goto done; }
    }
    t_adopt = rdtsc() - t1;

    // Path 2: Freelist scan
    t1 = rdtsc();
    if (tls->ss) {
        for (int i = 0; i < tls_cap; i++) {
            // ... scan logic ...
            if (found) { path_taken = 2; goto done; }
        }
    }
    t_freelist = rdtsc() - t1;

    // Path 3: Virgin slab
    t1 = rdtsc();
    if (tls->ss && tls->ss->active_slabs < tls_cap) {
        // ... virgin logic ...
        if (found) { path_taken = 3; goto done; }
    }
    t_virgin = rdtsc() - t1;

    // Path 6: mmap
    t1 = rdtsc();
    SuperSlab* ss = superslab_allocate(class_idx);
    t_mmap = rdtsc() - t1;
    path_taken = 6;

done:
    uint64_t total = rdtsc() - t0;
    fprintf(stderr, "[REFILL] cls=%d path=%d total=%lu adopt=%lu freelist=%lu virgin=%lu mmap=%lu\n",
            class_idx, path_taken, total, t_adopt, t_freelist, t_virgin, t_mmap);
    return ss;
}
```

**Run:**
```bash
./larson_hakmem ... 2>&1 | grep REFILL | awk '{sum[$4]+=$8} END {for(p in sum) print p, sum[p]}' | sort -k2 -rn
```

**Expected output:**
```
path=2 12500000000  ← Freelist scan dominates
path=6  3200000000  ← mmap is expensive but rare
path=3   500000000  ← Virgin slabs
path=1   100000000  ← Adopt (if enabled)
```

---

## Recommended Implementation Order

### Sprint 1 (This Week): Quick Wins
1. ✅ Profile superslab_refill with rdtsc instrumentation
2. ✅ Confirm Path 2 (freelist scan) is dominant
3. ✅ Implement Option A: Freelist Bitmap
4. ✅ A/B test: expect +10-15% throughput

### Sprint 2 (Next Week): Atomic Optimization
1. ✅ Implement relaxed memory ordering where safe
2. ✅ Batch acquire attempts (reduce atomics)
3. ✅ A/B test: expect +3-5% throughput

### Sprint 3 (Week 3): Path 6 Optimization
1. ✅ Implement SuperSlab pool
2. ✅ Optional: Background refill thread
3. ✅ A/B test: expect +2-4% throughput

### Total Expected Gain
```
Baseline:     4.19 M ops/s
After Sprint 1: 4.62-4.82 M ops/s (+10-15%)
After Sprint 2: 4.76-5.06 M ops/s (+14-21%)
After Sprint 3: 4.85-5.27 M ops/s (+16-26%)
```

**Conservative estimate:** **+15-20% total** from superslab_refill optimization alone.

Combined with other optimizations (cache tuning, etc.), targeting **System malloc parity** (135 M ops/s) is still distant, but Tiny can approach **60-70 M ops/s** (40-50% of System).

---

## Conclusion

**superslab_refill is a 238-line monster** with:
- 15+ branches
- 4 nested loops
- 100+ atomic operations (worst case)
- Syscall overhead (mmap)

**The #1 sub-bottleneck is Path 2 (freelist scan):**
- O(n) scan of 32 slabs
- Runs on EVERY refill
- Multiple atomics per slab
- **Est. 15-20% of total CPU time**

**Immediate action:** Implement freelist bitmap for O(1) slab discovery.

**Long-term vision:** Eliminate superslab_refill from hot path entirely (background refill, pre-warmed slabs).

---

**Next:** See `PHASE1_EXECUTIVE_SUMMARY.md` for action plan.