249 lines
6.2 KiB
Markdown
249 lines
6.2 KiB
Markdown
|
|
# Phase 1 Quick Wins - Executive Summary
|
|||
|
|
|
|||
|
|
**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Numbers
|
|||
|
|
|
|||
|
|
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|
|||
|
|
|--------------|------------|---------------|---------|
|
|||
|
|
| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
|
|||
|
|
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
|
|||
|
|
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Causes
|
|||
|
|
|
|||
|
|
### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
perf report (REFILL_COUNT=32):
|
|||
|
|
28.56% superslab_refill ← THIS IS THE PROBLEM
|
|||
|
|
3.10% [kernel] (various)
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
|
|||
|
|
|
|||
|
|
### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
REFILL_COUNT=32: L1d miss rate = 12.88%
|
|||
|
|
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why:**
|
|||
|
|
- 128 blocks × 128 bytes = 16 KB
|
|||
|
|
- L1 cache = 32 KB total
|
|||
|
|
- Batch + working set > L1 capacity
|
|||
|
|
- **Result:** More cache misses, slower performance
|
|||
|
|
|
|||
|
|
### 3. Refill Frequency Already Low ⭐⭐⭐
|
|||
|
|
|
|||
|
|
**Larson benchmark characteristics:**
|
|||
|
|
- FIFO pattern with 1024 chunks per thread
|
|||
|
|
- High TLS freelist hit rate
|
|||
|
|
- Refills are **rare**, not frequent
|
|||
|
|
|
|||
|
|
**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.
|
|||
|
|
|
|||
|
|
### 4. memset is NOT in Hot Path ⭐
|
|||
|
|
|
|||
|
|
**Search results:**
|
|||
|
|
```bash
|
|||
|
|
memset found in:
|
|||
|
|
- hakmem_tiny_init.inc (one-time init)
|
|||
|
|
- hakmem_tiny_intel.inc (debug ring init)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Conclusion:** memset removal would have **ZERO** impact on allocation performance.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Why Task Teacher's +31% Projection Failed
|
|||
|
|
|
|||
|
|
**Expected:**
|
|||
|
|
```
|
|||
|
|
REFILL 32→128: reduce calls by 4x → +31% speedup
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Reality:**
|
|||
|
|
```
|
|||
|
|
REFILL 32→128: -36% slowdown
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Mistakes:**
|
|||
|
|
1. ❌ Assumed refill is cheap (it's 28.56% of CPU)
|
|||
|
|
2. ❌ Assumed refills are frequent (they're rare in Larson)
|
|||
|
|
3. ❌ Ignored cache effects (L1d misses +25%)
|
|||
|
|
4. ❌ Used Larson-specific pattern (not generalizable)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Immediate Actions
|
|||
|
|
|
|||
|
|
### ✅ DO THIS NOW
|
|||
|
|
|
|||
|
|
1. **Keep REFILL_COUNT=32** (optimal for Larson)
|
|||
|
|
2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
|
|||
|
|
3. **Profile superslab_refill internals:**
|
|||
|
|
- Bitmap scanning
|
|||
|
|
- mmap syscalls
|
|||
|
|
- Metadata initialization
|
|||
|
|
|
|||
|
|
### ❌ DO NOT DO THIS
|
|||
|
|
|
|||
|
|
1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
|
|||
|
|
2. **DO NOT optimize memset** (not in hot path, waste of time)
|
|||
|
|
3. **DO NOT trust Larson alone** (need diverse benchmarks)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps (Priority Order)
|
|||
|
|
|
|||
|
|
### 🔥 P0: Superslab_refill Deep Dive (This Week)
|
|||
|
|
|
|||
|
|
**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
superslab_refill() {
|
|||
|
|
// Profile each step:
|
|||
|
|
1. Bitmap scan to find free slab ← How much time?
|
|||
|
|
2. mmap() for new SuperSlab ← How much time?
|
|||
|
|
3. Metadata initialization ← How much time?
|
|||
|
|
4. Slab carving / freelist setup ← How much time?
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Tools:**
|
|||
|
|
```bash
|
|||
|
|
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
|
|||
|
|
perf report --stdio -g --no-children | grep superslab
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P1: Cache-Aware Refill (Next Week)
|
|||
|
|
|
|||
|
|
**Goal:** Reduce L1d miss rate from 12.88% to <10%
|
|||
|
|
|
|||
|
|
**Approach:**
|
|||
|
|
1. Limit batch size to fit in L1 with working set
|
|||
|
|
- Current: REFILL_COUNT=32 (4KB for 128B class)
|
|||
|
|
- Test: REFILL_COUNT=16 (2KB)
|
|||
|
|
- Hypothesis: Smaller batches = fewer misses
|
|||
|
|
|
|||
|
|
2. Prefetching
|
|||
|
|
- Prefetch next batch while using current batch
|
|||
|
|
- Reduces cache miss penalty
|
|||
|
|
|
|||
|
|
3. Adaptive batch sizing
|
|||
|
|
- Small batches when working set is large
|
|||
|
|
- Large batches when working set is small
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P2: Benchmark Diversity (Next 2 Weeks)
|
|||
|
|
|
|||
|
|
**Problem:** Larson is NOT representative
|
|||
|
|
|
|||
|
|
**Larson characteristics:**
|
|||
|
|
- FIFO allocation pattern
|
|||
|
|
- Fixed working set (1024 chunks)
|
|||
|
|
- Predictable sizes (8-128B)
|
|||
|
|
- High freelist hit rate
|
|||
|
|
|
|||
|
|
**Need to test:**
|
|||
|
|
1. **Random allocation/free** (not FIFO)
|
|||
|
|
2. **Bursty allocations** (malloc storms)
|
|||
|
|
3. **Mixed lifetime** (long-lived + short-lived)
|
|||
|
|
4. **Variable sizes** (less predictable)
|
|||
|
|
|
|||
|
|
**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔥 P3: Fast Path Simplification (Phase 6 Goal)
|
|||
|
|
|
|||
|
|
**Long-term vision:** Eliminate superslab_refill from hot path
|
|||
|
|
|
|||
|
|
**Approach:**
|
|||
|
|
1. Background refill thread
|
|||
|
|
- Keep freelists pre-filled
|
|||
|
|
- Allocation never waits for superslab_refill
|
|||
|
|
|
|||
|
|
2. Lock-free slab exchange
|
|||
|
|
- Reduce atomic operations
|
|||
|
|
- Faster refill when needed
|
|||
|
|
|
|||
|
|
3. System tcache study
|
|||
|
|
- Understand why System malloc is 3-4 instructions
|
|||
|
|
- Adopt proven patterns
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Metrics to Track
|
|||
|
|
|
|||
|
|
### Performance
|
|||
|
|
- **Throughput:** 4.19 M ops/s (Larson baseline)
|
|||
|
|
- **superslab_refill CPU:** 28.56% → target <10%
|
|||
|
|
- **L1d miss rate:** 12.88% → target <10%
|
|||
|
|
- **IPC:** 1.93 → maintain or improve
|
|||
|
|
|
|||
|
|
### Health
|
|||
|
|
- **Stability:** Results should be consistent (±2%)
|
|||
|
|
- **Memory usage:** Monitor RSS growth
|
|||
|
|
- **Fragmentation:** Track over time
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Data-Driven Checklist
|
|||
|
|
|
|||
|
|
Before ANY optimization:
|
|||
|
|
- [ ] Profile with `perf record -g`
|
|||
|
|
- [ ] Identify TOP bottleneck (>5% CPU)
|
|||
|
|
- [ ] Verify with `perf stat` (cache, branches, IPC)
|
|||
|
|
- [ ] Test with MULTIPLE benchmarks (not just Larson)
|
|||
|
|
- [ ] Document baseline metrics
|
|||
|
|
- [ ] A/B test changes (at least 3 runs each)
|
|||
|
|
- [ ] Verify improvements are statistically significant
|
|||
|
|
|
|||
|
|
**Rule:** If perf doesn't show it, don't optimize it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
1. **Profile first, optimize second**
|
|||
|
|
- Task Teacher's intuition was wrong
|
|||
|
|
- Data revealed superslab_refill as real bottleneck
|
|||
|
|
|
|||
|
|
2. **Cache effects can reverse gains**
|
|||
|
|
- More batching ≠ always faster
|
|||
|
|
- L1 cache is precious (32 KB)
|
|||
|
|
|
|||
|
|
3. **Benchmarks lie**
|
|||
|
|
- Larson has special properties (FIFO, stable working set)
|
|||
|
|
- Real workloads may differ significantly
|
|||
|
|
|
|||
|
|
4. **Measure, don't guess**
|
|||
|
|
- memset "optimization" would have been wasted effort
|
|||
|
|
- perf shows what actually matters
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Final Recommendation
|
|||
|
|
|
|||
|
|
**STOP** optimizing refill frequency.
|
|||
|
|
**START** optimizing superslab_refill.
|
|||
|
|
|
|||
|
|
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`
|