# Phase 1 Quick Wins - Executive Summary **TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency. --- ## The Numbers | REFILL_COUNT | Throughput | L1d Miss Rate | Verdict | |--------------|------------|---------------|---------| | **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** | | 64 | 3.89 M/s | 14.12% | ❌ -7.2% | | 128 | 2.68 M/s | 16.08% | ❌ -36% | --- ## Root Causes ### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐ ``` perf report (REFILL_COUNT=32): 28.56% superslab_refill ← THIS IS THE PROBLEM 3.10% [kernel] (various) ... ``` **Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse. ### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐ ``` REFILL_COUNT=32: L1d miss rate = 12.88% REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!) ``` **Why:** - 128 blocks × 128 bytes = 16 KB - L1 cache = 32 KB total - Batch + working set > L1 capacity - **Result:** More cache misses, slower performance ### 3. Refill Frequency Already Low ⭐⭐⭐ **Larson benchmark characteristics:** - FIFO pattern with 1024 chunks per thread - High TLS freelist hit rate - Refills are **rare**, not frequent **Implication:** Reducing refill frequency has minimal impact when refills are already uncommon. ### 4. memset is NOT in Hot Path ⭐ **Search results:** ```bash memset found in: - hakmem_tiny_init.inc (one-time init) - hakmem_tiny_intel.inc (debug ring init) ``` **Conclusion:** memset removal would have **ZERO** impact on allocation performance. --- ## Why Task Teacher's +31% Projection Failed **Expected:** ``` REFILL 32→128: reduce calls by 4x → +31% speedup ``` **Reality:** ``` REFILL 32→128: -36% slowdown ``` **Mistakes:** 1. ❌ Assumed refill is cheap (it's 28.56% of CPU) 2. ❌ Assumed refills are frequent (they're rare in Larson) 3. ❌ Ignored cache effects (L1d misses +25%) 4. ❌ Used Larson-specific pattern (not generalizable) --- ## Immediate Actions ### ✅ DO THIS NOW 1. **Keep REFILL_COUNT=32** (optimal for Larson) 2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win) 3. **Profile superslab_refill internals:** - Bitmap scanning - mmap syscalls - Metadata initialization ### ❌ DO NOT DO THIS 1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution) 2. **DO NOT optimize memset** (not in hot path, waste of time) 3. **DO NOT trust Larson alone** (need diverse benchmarks) --- ## Next Steps (Priority Order) ### 🔥 P0: Superslab_refill Deep Dive (This Week) **Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down: ```c superslab_refill() { // Profile each step: 1. Bitmap scan to find free slab ← How much time? 2. mmap() for new SuperSlab ← How much time? 3. Metadata initialization ← How much time? 4. Slab carving / freelist setup ← How much time? } ``` **Tools:** ```bash perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ... perf report --stdio -g --no-children | grep superslab ``` **Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it. --- ### 🔥 P1: Cache-Aware Refill (Next Week) **Goal:** Reduce L1d miss rate from 12.88% to <10% **Approach:** 1. Limit batch size to fit in L1 with working set - Current: REFILL_COUNT=32 (4KB for 128B class) - Test: REFILL_COUNT=16 (2KB) - Hypothesis: Smaller batches = fewer misses 2. Prefetching - Prefetch next batch while using current batch - Reduces cache miss penalty 3. Adaptive batch sizing - Small batches when working set is large - Large batches when working set is small --- ### 🔥 P2: Benchmark Diversity (Next 2 Weeks) **Problem:** Larson is NOT representative **Larson characteristics:** - FIFO allocation pattern - Fixed working set (1024 chunks) - Predictable sizes (8-128B) - High freelist hit rate **Need to test:** 1. **Random allocation/free** (not FIFO) 2. **Bursty allocations** (malloc storms) 3. **Mixed lifetime** (long-lived + short-lived) 4. **Variable sizes** (less predictable) **Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more). --- ### 🔥 P3: Fast Path Simplification (Phase 6 Goal) **Long-term vision:** Eliminate superslab_refill from hot path **Approach:** 1. Background refill thread - Keep freelists pre-filled - Allocation never waits for superslab_refill 2. Lock-free slab exchange - Reduce atomic operations - Faster refill when needed 3. System tcache study - Understand why System malloc is 3-4 instructions - Adopt proven patterns --- ## Key Metrics to Track ### Performance - **Throughput:** 4.19 M ops/s (Larson baseline) - **superslab_refill CPU:** 28.56% → target <10% - **L1d miss rate:** 12.88% → target <10% - **IPC:** 1.93 → maintain or improve ### Health - **Stability:** Results should be consistent (±2%) - **Memory usage:** Monitor RSS growth - **Fragmentation:** Track over time --- ## Data-Driven Checklist Before ANY optimization: - [ ] Profile with `perf record -g` - [ ] Identify TOP bottleneck (>5% CPU) - [ ] Verify with `perf stat` (cache, branches, IPC) - [ ] Test with MULTIPLE benchmarks (not just Larson) - [ ] Document baseline metrics - [ ] A/B test changes (at least 3 runs each) - [ ] Verify improvements are statistically significant **Rule:** If perf doesn't show it, don't optimize it. --- ## Lessons Learned 1. **Profile first, optimize second** - Task Teacher's intuition was wrong - Data revealed superslab_refill as real bottleneck 2. **Cache effects can reverse gains** - More batching ≠ always faster - L1 cache is precious (32 KB) 3. **Benchmarks lie** - Larson has special properties (FIFO, stable working set) - Real workloads may differ significantly 4. **Measure, don't guess** - memset "optimization" would have been wasted effort - perf shows what actually matters --- ## Final Recommendation **STOP** optimizing refill frequency. **START** optimizing superslab_refill. The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are. --- **Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`