532 lines
13 KiB
Markdown
532 lines
13 KiB
Markdown
|
|
# superslab_refill Bottleneck Analysis
|
|||
|
|
|
|||
|
|
**Function:** `superslab_refill()` in `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888`
|
|||
|
|
**CPU Time:** 28.56% (perf report)
|
|||
|
|
**Status:** 🔴 **CRITICAL BOTTLENECK**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Function Complexity Analysis
|
|||
|
|
|
|||
|
|
### Code Statistics
|
|||
|
|
- **Lines of code:** 238 lines (650-888)
|
|||
|
|
- **Branches:** ~15 major decision points
|
|||
|
|
- **Loops:** 4 nested loops
|
|||
|
|
- **Atomic operations:** ~10+ atomic loads/stores
|
|||
|
|
- **Function calls:** ~15 helper functions
|
|||
|
|
|
|||
|
|
**Complexity Score:** 🔥🔥🔥🔥🔥 (Extremely complex for a "refill" operation)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Path Analysis: What superslab_refill Does
|
|||
|
|
|
|||
|
|
### Path 1: Adopt from Publish/Subscribe (Lines 686-750) ⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** `g_ss_adopt_en == 1` (auto-enabled if remote frees seen)
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Check cooldown period (lines 688-694)
|
|||
|
|
2. Call `ss_partial_adopt(class_idx)` (line 696)
|
|||
|
|
3. **Loop 1:** Scan adopted SS slabs (lines 701-710)
|
|||
|
|
- Load remote counts atomically
|
|||
|
|
- Calculate best score
|
|||
|
|
4. Try to acquire best slab atomically (line 714)
|
|||
|
|
5. Drain remote freelist (line 716)
|
|||
|
|
6. Check if safe to bind (line 734)
|
|||
|
|
7. Bind TLS slab (line 736)
|
|||
|
|
|
|||
|
|
**Atomic operations:** 3-5 per slab × up to 32 slabs = **96-160 atomic ops**
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥🔥🔥 **HIGH** (multi-threaded workloads only)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Path 2: Reuse Existing SS Freelist (Lines 753-792) ⭐⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** `tls->ss != NULL` and slab has freelist
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Get slab capacity (line 756)
|
|||
|
|
2. **Loop 2:** Scan all slabs (lines 757-792)
|
|||
|
|
- Check if `slabs[i].freelist` exists (line 763)
|
|||
|
|
- Try to acquire slab atomically (line 765)
|
|||
|
|
- Drain remote freelist if needed (line 768)
|
|||
|
|
- Check safe to bind (line 783)
|
|||
|
|
- Bind TLS slab (line 785)
|
|||
|
|
|
|||
|
|
**Worst case:** Scan all 32 slabs, attempt acquire on each
|
|||
|
|
**Atomic operations:** 1-3 per slab × 32 = **32-96 atomic ops**
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥🔥🔥🔥 **VERY HIGH** (most common path in Larson!)
|
|||
|
|
|
|||
|
|
**Why this is THE bottleneck:**
|
|||
|
|
- This loop runs on EVERY refill
|
|||
|
|
- Larson has 4 threads × frequent allocations
|
|||
|
|
- Each thread scans its own SS trying to find freelist
|
|||
|
|
- Atomic operations cause cache line ping-pong between threads
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Path 3: Use Virgin Slab (Lines 794-810) ⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** `tls->ss->active_slabs < capacity`
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Call `superslab_find_free_slab(tls->ss)` (line 797)
|
|||
|
|
- **Bitmap scan** to find unused slab
|
|||
|
|
2. Call `superslab_init_slab()` (line 802)
|
|||
|
|
- Initialize metadata
|
|||
|
|
- Set up freelist/bitmap
|
|||
|
|
3. Bind TLS slab (line 805)
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥🔥 **MEDIUM** (bitmap scan + init)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Path 4: Registry Adoption (Lines 812-843) ⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** `!tls->ss` (no SuperSlab yet)
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. **Loop 3:** Scan registry (lines 818-842)
|
|||
|
|
- Load entry atomically (line 820)
|
|||
|
|
- Check magic (line 823)
|
|||
|
|
- Check size class (line 824)
|
|||
|
|
- **Loop 4:** Scan slabs in SS (lines 828-840)
|
|||
|
|
- Try acquire (line 830)
|
|||
|
|
- Drain remote (line 832)
|
|||
|
|
- Check safe to bind (line 833)
|
|||
|
|
|
|||
|
|
**Worst case:** Scan 256 registry entries × 32 slabs each
|
|||
|
|
**Atomic operations:** **Thousands**
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥🔥🔥🔥 **CATASTROPHIC** (if hit)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Path 5: Must-Adopt Gate (Lines 845-849) ⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** Before allocating new SS
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Call `tiny_must_adopt_gate(class_idx, tls)`
|
|||
|
|
- Attempts sticky/hot/bench/mailbox/registry adoption
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥 **LOW-MEDIUM** (fast path optimization)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Path 6: Allocate New SuperSlab (Lines 851-887) ⭐⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Condition:** All other paths failed
|
|||
|
|
|
|||
|
|
**Steps:**
|
|||
|
|
1. Call `superslab_allocate(class_idx)` (line 852)
|
|||
|
|
- **mmap() syscall** to allocate 1MB SuperSlab
|
|||
|
|
2. Initialize first slab (line 876)
|
|||
|
|
3. Bind TLS slab (line 880)
|
|||
|
|
4. Update refcounts (lines 882-885)
|
|||
|
|
|
|||
|
|
**Cost estimate:** 🔥🔥🔥🔥🔥 **CATASTROPHIC** (syscall!)
|
|||
|
|
|
|||
|
|
**Why this is expensive:**
|
|||
|
|
- mmap() is a kernel syscall (~1000+ cycles)
|
|||
|
|
- Page fault on first access
|
|||
|
|
- TLB pressure
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Bottleneck Hypothesis
|
|||
|
|
|
|||
|
|
### Primary Suspects (in order of likelihood):
|
|||
|
|
|
|||
|
|
#### 1. Path 2: Freelist Scan Loop (Lines 757-792) 🥇
|
|||
|
|
|
|||
|
|
**Evidence:**
|
|||
|
|
- Runs on EVERY refill
|
|||
|
|
- Scans up to 32 slabs linearly
|
|||
|
|
- Multiple atomic operations per slab
|
|||
|
|
- Cache line bouncing between threads
|
|||
|
|
|
|||
|
|
**Why Larson hits this:**
|
|||
|
|
- Larson does frequent alloc/free
|
|||
|
|
- Freelists exist after first warmup
|
|||
|
|
- Every refill scans the same SS repeatedly
|
|||
|
|
|
|||
|
|
**Estimated CPU contribution:** **15-20% of total CPU**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 2. Atomic Operations (Throughout) 🥈
|
|||
|
|
|
|||
|
|
**Count:**
|
|||
|
|
- Path 1: 96-160 atomic ops
|
|||
|
|
- Path 2: 32-96 atomic ops
|
|||
|
|
- Path 4: Thousands of atomic ops
|
|||
|
|
|
|||
|
|
**Why expensive:**
|
|||
|
|
- Each atomic op = cache coherency traffic
|
|||
|
|
- 4 threads × frequent operations = contention
|
|||
|
|
- AMD Ryzen (test system) has slower atomics than Intel
|
|||
|
|
|
|||
|
|
**Estimated CPU contribution:** **5-8% of total CPU**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 3. Path 6: mmap() Syscalls 🥉
|
|||
|
|
|
|||
|
|
**Evidence:**
|
|||
|
|
- OOM messages in logs suggest path 6 is hit occasionally
|
|||
|
|
- Each mmap() is ~1000 cycles minimum
|
|||
|
|
- Page faults add another ~1000 cycles
|
|||
|
|
|
|||
|
|
**Frequency:**
|
|||
|
|
- Larson runs for 2 seconds
|
|||
|
|
- 4 threads × allocation rate = high turnover
|
|||
|
|
- But: SuperSlabs are 1MB (reusable for many allocations)
|
|||
|
|
|
|||
|
|
**Estimated CPU contribution:** **2-5% of total CPU**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 4. Registry Scan (Path 4) ⚠️
|
|||
|
|
|
|||
|
|
**Evidence:**
|
|||
|
|
- Only runs if `!tls->ss` (rare after warmup)
|
|||
|
|
- But: if hit, scans 256 entries × 32 slabs = **massive**
|
|||
|
|
|
|||
|
|
**Estimated CPU contribution:** **0-3% of total CPU** (depends on hit rate)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Optimization Opportunities
|
|||
|
|
|
|||
|
|
### 🔥 P0: Eliminate Freelist Scan Loop (Path 2)
|
|||
|
|
|
|||
|
|
**Current:**
|
|||
|
|
```c
|
|||
|
|
for (int i = 0; i < tls_cap; i++) {
|
|||
|
|
if (tls->ss->slabs[i].freelist) {
|
|||
|
|
// Try to acquire, drain, bind...
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem:**
|
|||
|
|
- O(n) scan where n = 32 slabs
|
|||
|
|
- Linear search every refill
|
|||
|
|
- Repeated checks of the same slabs
|
|||
|
|
|
|||
|
|
**Solutions:**
|
|||
|
|
|
|||
|
|
#### Option A: Freelist Bitmap (Best) ⭐⭐⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Add to SuperSlab struct:
|
|||
|
|
uint32_t freelist_bitmap; // bit i = 1 if slabs[i].freelist != NULL
|
|||
|
|
|
|||
|
|
// In superslab_refill:
|
|||
|
|
uint32_t fl_bits = tls->ss->freelist_bitmap;
|
|||
|
|
if (fl_bits) {
|
|||
|
|
int idx = __builtin_ctz(fl_bits); // Find first set bit (1-2 cycles!)
|
|||
|
|
// Try to acquire slab[idx]...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- O(1) find instead of O(n) scan
|
|||
|
|
- No atomic ops unless freelist exists
|
|||
|
|
- **Estimated speedup:** 10-15% total CPU
|
|||
|
|
|
|||
|
|
**Risks:**
|
|||
|
|
- Need to maintain bitmap on free/alloc
|
|||
|
|
- Possible race conditions (can use atomic or accept false positives)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Option B: Last-Known-Good Index ⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Add to TinyTLSSlab:
|
|||
|
|
uint8_t last_freelist_idx;
|
|||
|
|
|
|||
|
|
// In superslab_refill:
|
|||
|
|
int start = tls->last_freelist_idx;
|
|||
|
|
for (int i = 0; i < tls_cap; i++) {
|
|||
|
|
int idx = (start + i) % tls_cap; // Round-robin
|
|||
|
|
if (tls->ss->slabs[idx].freelist) {
|
|||
|
|
tls->last_freelist_idx = idx;
|
|||
|
|
// Try to acquire...
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- Likely to hit on first try (temporal locality)
|
|||
|
|
- No additional atomics
|
|||
|
|
- **Estimated speedup:** 5-8% total CPU
|
|||
|
|
|
|||
|
|
**Risks:**
|
|||
|
|
- Still O(n) worst case
|
|||
|
|
- May not help if freelists are sparse
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Option C: Intrusive Freelist of Slabs ⭐⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Add to SuperSlab:
|
|||
|
|
int8_t first_freelist_slab; // -1 = none, else index
|
|||
|
|
// Add to TinySlabMeta:
|
|||
|
|
int8_t next_freelist_slab; // Intrusive linked list
|
|||
|
|
|
|||
|
|
// In superslab_refill:
|
|||
|
|
int idx = tls->ss->first_freelist_slab;
|
|||
|
|
if (idx >= 0) {
|
|||
|
|
// Try to acquire slab[idx]...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- O(1) lookup
|
|||
|
|
- No scanning
|
|||
|
|
- **Estimated speedup:** 12-18% total CPU
|
|||
|
|
|
|||
|
|
**Risks:**
|
|||
|
|
- Complex to maintain
|
|||
|
|
- Intrusive list management on every free
|
|||
|
|
- Possible corruption if not careful
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P1: Reduce Atomic Operations
|
|||
|
|
|
|||
|
|
**Current hotspots:**
|
|||
|
|
- `slab_try_acquire()` - CAS operation
|
|||
|
|
- `atomic_load_explicit(&remote_heads[s], ...)` - Cache coherency
|
|||
|
|
- `atomic_load_explicit(&remote_counts[s], ...)` - Cache coherency
|
|||
|
|
|
|||
|
|
**Solutions:**
|
|||
|
|
|
|||
|
|
#### Option A: Batch Acquire Attempts ⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Instead of acquire → drain → release → retry,
|
|||
|
|
// try multiple slabs and pick best BEFORE acquiring
|
|||
|
|
uint32_t scores[32];
|
|||
|
|
for (int i = 0; i < tls_cap; i++) {
|
|||
|
|
scores[i] = tls->ss->slabs[i].freelist ? 1 : 0; // No atomics!
|
|||
|
|
}
|
|||
|
|
int best = find_max_index(scores);
|
|||
|
|
// Now acquire only the best one
|
|||
|
|
SlabHandle h = slab_try_acquire(tls->ss, best, self_tid);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- Reduce atomic ops from 32-96 to 1-3
|
|||
|
|
- **Estimated speedup:** 3-5% total CPU
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Option B: Relaxed Memory Ordering ⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Change:
|
|||
|
|
atomic_load_explicit(&remote_heads[s], memory_order_acquire)
|
|||
|
|
// To:
|
|||
|
|
atomic_load_explicit(&remote_heads[s], memory_order_relaxed)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- Cheaper than acquire (no fence)
|
|||
|
|
- Safe if we re-check before binding
|
|||
|
|
|
|||
|
|
**Risks:**
|
|||
|
|
- Requires careful analysis of race conditions
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P2: Optimize Path 6 (mmap)
|
|||
|
|
|
|||
|
|
**Solutions:**
|
|||
|
|
|
|||
|
|
#### Option A: SuperSlab Pool / Freelancer ⭐⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Pre-allocate pool of SuperSlabs
|
|||
|
|
SuperSlab* g_ss_pool[128]; // Pre-mmap'd and ready
|
|||
|
|
int g_ss_pool_head = 0;
|
|||
|
|
|
|||
|
|
// In superslab_allocate:
|
|||
|
|
if (g_ss_pool_head > 0) {
|
|||
|
|
return g_ss_pool[--g_ss_pool_head]; // O(1)!
|
|||
|
|
}
|
|||
|
|
// Fallback to mmap if pool empty
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- Amortize mmap cost
|
|||
|
|
- No syscalls in hot path
|
|||
|
|
- **Estimated speedup:** 2-4% total CPU
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Option B: Background Refill Thread ⭐⭐⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// Dedicated thread to refill SS pool
|
|||
|
|
void* bg_refill_thread(void* arg) {
|
|||
|
|
while (1) {
|
|||
|
|
if (g_ss_pool_head < 64) {
|
|||
|
|
SuperSlab* ss = mmap(...);
|
|||
|
|
g_ss_pool[g_ss_pool_head++] = ss;
|
|||
|
|
}
|
|||
|
|
usleep(1000); // Sleep 1ms
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- ZERO mmap cost in allocation path
|
|||
|
|
- **Estimated speedup:** 2-5% total CPU
|
|||
|
|
|
|||
|
|
**Risks:**
|
|||
|
|
- Thread overhead
|
|||
|
|
- Complexity
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P3: Fast Path Bypass
|
|||
|
|
|
|||
|
|
**Idea:** Avoid superslab_refill entirely for hot classes
|
|||
|
|
|
|||
|
|
#### Option A: TLS Freelist Pre-warming ⭐⭐⭐⭐
|
|||
|
|
```c
|
|||
|
|
// On thread init, pre-fill TLS freelists
|
|||
|
|
void thread_init() {
|
|||
|
|
for (int cls = 0; cls < 4; cls++) { // Hot classes
|
|||
|
|
sll_refill_batch_from_ss(cls, 128); // Fill to capacity
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefits:**
|
|||
|
|
- Reduces refill frequency
|
|||
|
|
- **Estimated speedup:** 5-10% total CPU (indirect)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Profiling TODO
|
|||
|
|
|
|||
|
|
To confirm hypotheses, instrument superslab_refill:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static SuperSlab* superslab_refill(int class_idx) {
|
|||
|
|
uint64_t t0 = rdtsc();
|
|||
|
|
|
|||
|
|
uint64_t t_adopt = 0, t_freelist = 0, t_virgin = 0, t_mmap = 0;
|
|||
|
|
int path_taken = 0;
|
|||
|
|
|
|||
|
|
// Path 1: Adopt
|
|||
|
|
uint64_t t1 = rdtsc();
|
|||
|
|
if (g_ss_adopt_en) {
|
|||
|
|
// ... adopt logic ...
|
|||
|
|
if (adopted) { path_taken = 1; goto done; }
|
|||
|
|
}
|
|||
|
|
t_adopt = rdtsc() - t1;
|
|||
|
|
|
|||
|
|
// Path 2: Freelist scan
|
|||
|
|
t1 = rdtsc();
|
|||
|
|
if (tls->ss) {
|
|||
|
|
for (int i = 0; i < tls_cap; i++) {
|
|||
|
|
// ... scan logic ...
|
|||
|
|
if (found) { path_taken = 2; goto done; }
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
t_freelist = rdtsc() - t1;
|
|||
|
|
|
|||
|
|
// Path 3: Virgin slab
|
|||
|
|
t1 = rdtsc();
|
|||
|
|
if (tls->ss && tls->ss->active_slabs < tls_cap) {
|
|||
|
|
// ... virgin logic ...
|
|||
|
|
if (found) { path_taken = 3; goto done; }
|
|||
|
|
}
|
|||
|
|
t_virgin = rdtsc() - t1;
|
|||
|
|
|
|||
|
|
// Path 6: mmap
|
|||
|
|
t1 = rdtsc();
|
|||
|
|
SuperSlab* ss = superslab_allocate(class_idx);
|
|||
|
|
t_mmap = rdtsc() - t1;
|
|||
|
|
path_taken = 6;
|
|||
|
|
|
|||
|
|
done:
|
|||
|
|
uint64_t total = rdtsc() - t0;
|
|||
|
|
fprintf(stderr, "[REFILL] cls=%d path=%d total=%lu adopt=%lu freelist=%lu virgin=%lu mmap=%lu\n",
|
|||
|
|
class_idx, path_taken, total, t_adopt, t_freelist, t_virgin, t_mmap);
|
|||
|
|
return ss;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Run:**
|
|||
|
|
```bash
|
|||
|
|
./larson_hakmem ... 2>&1 | grep REFILL | awk '{sum[$4]+=$8} END {for(p in sum) print p, sum[p]}' | sort -k2 -rn
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected output:**
|
|||
|
|
```
|
|||
|
|
path=2 12500000000 ← Freelist scan dominates
|
|||
|
|
path=6 3200000000 ← mmap is expensive but rare
|
|||
|
|
path=3 500000000 ← Virgin slabs
|
|||
|
|
path=1 100000000 ← Adopt (if enabled)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommended Implementation Order
|
|||
|
|
|
|||
|
|
### Sprint 1 (This Week): Quick Wins
|
|||
|
|
1. ✅ Profile superslab_refill with rdtsc instrumentation
|
|||
|
|
2. ✅ Confirm Path 2 (freelist scan) is dominant
|
|||
|
|
3. ✅ Implement Option A: Freelist Bitmap
|
|||
|
|
4. ✅ A/B test: expect +10-15% throughput
|
|||
|
|
|
|||
|
|
### Sprint 2 (Next Week): Atomic Optimization
|
|||
|
|
1. ✅ Implement relaxed memory ordering where safe
|
|||
|
|
2. ✅ Batch acquire attempts (reduce atomics)
|
|||
|
|
3. ✅ A/B test: expect +3-5% throughput
|
|||
|
|
|
|||
|
|
### Sprint 3 (Week 3): Path 6 Optimization
|
|||
|
|
1. ✅ Implement SuperSlab pool
|
|||
|
|
2. ✅ Optional: Background refill thread
|
|||
|
|
3. ✅ A/B test: expect +2-4% throughput
|
|||
|
|
|
|||
|
|
### Total Expected Gain
|
|||
|
|
```
|
|||
|
|
Baseline: 4.19 M ops/s
|
|||
|
|
After Sprint 1: 4.62-4.82 M ops/s (+10-15%)
|
|||
|
|
After Sprint 2: 4.76-5.06 M ops/s (+14-21%)
|
|||
|
|
After Sprint 3: 4.85-5.27 M ops/s (+16-26%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conservative estimate:** **+15-20% total** from superslab_refill optimization alone.
|
|||
|
|
|
|||
|
|
Combined with other optimizations (cache tuning, etc.), targeting **System malloc parity** (135 M ops/s) is still distant, but Tiny can approach **60-70 M ops/s** (40-50% of System).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**superslab_refill is a 238-line monster** with:
|
|||
|
|
- 15+ branches
|
|||
|
|
- 4 nested loops
|
|||
|
|
- 100+ atomic operations (worst case)
|
|||
|
|
- Syscall overhead (mmap)
|
|||
|
|
|
|||
|
|
**The #1 sub-bottleneck is Path 2 (freelist scan):**
|
|||
|
|
- O(n) scan of 32 slabs
|
|||
|
|
- Runs on EVERY refill
|
|||
|
|
- Multiple atomics per slab
|
|||
|
|
- **Est. 15-20% of total CPU time**
|
|||
|
|
|
|||
|
|
**Immediate action:** Implement freelist bitmap for O(1) slab discovery.
|
|||
|
|
|
|||
|
|
**Long-term vision:** Eliminate superslab_refill from hot path entirely (background refill, pre-warmed slabs).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Next:** See `PHASE1_EXECUTIVE_SUMMARY.md` for action plan.
|