hakmem/docs/analysis/PHASE1_EXECUTIVE_SUMMARY.md

# Phase 1 Quick Wins - Executive Summary

**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.

---

## The Numbers

| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|--------------|------------|---------------|---------|
| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |

---

## Root Causes

### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐

```
perf report (REFILL_COUNT=32):
  28.56%  superslab_refill  ← THIS IS THE PROBLEM
   3.10%  [kernel] (various)
   ...
```

**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.

### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐

```
REFILL_COUNT=32:  L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
```

**Why:**
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- **Result:** More cache misses, slower performance

### 3. Refill Frequency Already Low ⭐⭐⭐

**Larson benchmark characteristics:**
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are **rare**, not frequent

**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.

### 4. memset is NOT in Hot Path ⭐

**Search results:**
```bash
memset found in:
  - hakmem_tiny_init.inc (one-time init)
  - hakmem_tiny_intel.inc (debug ring init)
```

**Conclusion:** memset removal would have **ZERO** impact on allocation performance.

---

## Why Task Teacher's +31% Projection Failed

**Expected:**
```
REFILL 32→128: reduce calls by 4x → +31% speedup
```

**Reality:**
```
REFILL 32→128: -36% slowdown
```

**Mistakes:**
1. ❌ Assumed refill is cheap (it's 28.56% of CPU)
2. ❌ Assumed refills are frequent (they're rare in Larson)
3. ❌ Ignored cache effects (L1d misses +25%)
4. ❌ Used Larson-specific pattern (not generalizable)

---

## Immediate Actions

### ✅ DO THIS NOW

1. **Keep REFILL_COUNT=32** (optimal for Larson)
2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
3. **Profile superslab_refill internals:**
   - Bitmap scanning
   - mmap syscalls
   - Metadata initialization

### ❌ DO NOT DO THIS

1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
2. **DO NOT optimize memset** (not in hot path, waste of time)
3. **DO NOT trust Larson alone** (need diverse benchmarks)

---

## Next Steps (Priority Order)

### 🔥 P0: Superslab_refill Deep Dive (This Week)

**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:

```c
superslab_refill() {
    // Profile each step:
    1. Bitmap scan to find free slab      ← How much time?
    2. mmap() for new SuperSlab            ← How much time?
    3. Metadata initialization             ← How much time?
    4. Slab carving / freelist setup       ← How much time?
}
```

**Tools:**
```bash
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
```

**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.

---

### 🔥 P1: Cache-Aware Refill (Next Week)

**Goal:** Reduce L1d miss rate from 12.88% to <10%

**Approach:**
1. Limit batch size to fit in L1 with working set
   - Current: REFILL_COUNT=32 (4KB for 128B class)
   - Test: REFILL_COUNT=16 (2KB)
   - Hypothesis: Smaller batches = fewer misses

2. Prefetching
   - Prefetch next batch while using current batch
   - Reduces cache miss penalty

3. Adaptive batch sizing
   - Small batches when working set is large
   - Large batches when working set is small

---

### 🔥 P2: Benchmark Diversity (Next 2 Weeks)

**Problem:** Larson is NOT representative

**Larson characteristics:**
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate

**Need to test:**
1. **Random allocation/free** (not FIFO)
2. **Bursty allocations** (malloc storms)
3. **Mixed lifetime** (long-lived + short-lived)
4. **Variable sizes** (less predictable)

**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).

---

### 🔥 P3: Fast Path Simplification (Phase 6 Goal)

**Long-term vision:** Eliminate superslab_refill from hot path

**Approach:**
1. Background refill thread
   - Keep freelists pre-filled
   - Allocation never waits for superslab_refill

2. Lock-free slab exchange
   - Reduce atomic operations
   - Faster refill when needed

3. System tcache study
   - Understand why System malloc is 3-4 instructions
   - Adopt proven patterns

---

## Key Metrics to Track

### Performance
- **Throughput:** 4.19 M ops/s (Larson baseline)
- **superslab_refill CPU:** 28.56% → target <10%
- **L1d miss rate:** 12.88% → target <10%
- **IPC:** 1.93 → maintain or improve

### Health
- **Stability:** Results should be consistent (±2%)
- **Memory usage:** Monitor RSS growth
- **Fragmentation:** Track over time

---

## Data-Driven Checklist

Before ANY optimization:
- [ ] Profile with `perf record -g`
- [ ] Identify TOP bottleneck (>5% CPU)
- [ ] Verify with `perf stat` (cache, branches, IPC)
- [ ] Test with MULTIPLE benchmarks (not just Larson)
- [ ] Document baseline metrics
- [ ] A/B test changes (at least 3 runs each)
- [ ] Verify improvements are statistically significant

**Rule:** If perf doesn't show it, don't optimize it.

---

## Lessons Learned

1. **Profile first, optimize second**
   - Task Teacher's intuition was wrong
   - Data revealed superslab_refill as real bottleneck

2. **Cache effects can reverse gains**
   - More batching ≠ always faster
   - L1 cache is precious (32 KB)

3. **Benchmarks lie**
   - Larson has special properties (FIFO, stable working set)
   - Real workloads may differ significantly

4. **Measure, don't guess**
   - memset "optimization" would have been wasted effort
   - perf shows what actually matters

---

## Final Recommendation

**STOP** optimizing refill frequency.
**START** optimizing superslab_refill.

The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.

---

**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Phase 1 Quick Wins - Executive Summary
 								**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.
 								---
 								## The Numbers
 								| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
 								|--------------|------------|---------------|---------|
 								| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
 								| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
 								| 128 | 2.68 M/s | 16.08% | ❌ -36% |
 								---
 								## Root Causes
 								### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
 								```
 								perf report (REFILL_COUNT=32):
 .56%  superslab_refill  ← THIS IS THE PROBLEM
 .10%  [kernel] (various)
 								   ...
 								```
 								**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
 								### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐
 								```
 								REFILL_COUNT=32:  L1d miss rate = 12.88%
 								REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
 								```
 								**Why:**
 								- 128 blocks × 128 bytes = 16 KB
 								- L1 cache = 32 KB total
 								- Batch + working set > L1 capacity
 								- **Result:** More cache misses, slower performance
 								### 3. Refill Frequency Already Low ⭐⭐⭐
 								**Larson benchmark characteristics:**
 								- FIFO pattern with 1024 chunks per thread
 								- High TLS freelist hit rate
 								- Refills are **rare**, not frequent
 								**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.
 								### 4. memset is NOT in Hot Path ⭐
 								**Search results:**
 								```bash
 								memset found in:
 								  - hakmem_tiny_init.inc (one-time init)
 								  - hakmem_tiny_intel.inc (debug ring init)
 								```
 								**Conclusion:** memset removal would have **ZERO** impact on allocation performance.
 								---
 								## Why Task Teacher's +31% Projection Failed
 								**Expected:**
 								```
 								REFILL 32→128: reduce calls by 4x → +31% speedup
 								```
 								**Reality:**
 								```
 								REFILL 32→128: -36% slowdown
 								```
 								**Mistakes:**
 . ❌ Assumed refill is cheap (it's 28.56% of CPU)
 . ❌ Assumed refills are frequent (they're rare in Larson)
 . ❌ Ignored cache effects (L1d misses +25%)
 . ❌ Used Larson-specific pattern (not generalizable)
 								---
 								## Immediate Actions
 								### ✅ DO THIS NOW
 . **Keep REFILL_COUNT=32** (optimal for Larson)
 . **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
 . **Profile superslab_refill internals:**
 								   - Bitmap scanning
 								   - mmap syscalls
 								   - Metadata initialization
 								### ❌ DO NOT DO THIS
 . **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
 . **DO NOT optimize memset** (not in hot path, waste of time)
 . **DO NOT trust Larson alone** (need diverse benchmarks)
 								---
 								## Next Steps (Priority Order)
 								### 🔥 P0: Superslab_refill Deep Dive (This Week)
 								**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:
 								```c
 								superslab_refill() {
 								    // Profile each step:
 . Bitmap scan to find free slab      ← How much time?
 . mmap() for new SuperSlab            ← How much time?
 . Metadata initialization             ← How much time?
 . Slab carving / freelist setup       ← How much time?
 								}
 								```
 								**Tools:**
 								```bash
 								perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
 								perf report --stdio -g --no-children | grep superslab
 								```
 								**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.
 								---
 								### 🔥 P1: Cache-Aware Refill (Next Week)
 								**Goal:** Reduce L1d miss rate from 12.88% to <10%
 								**Approach:**
 . Limit batch size to fit in L1 with working set
 								   - Current: REFILL_COUNT=32 (4KB for 128B class)
 								   - Test: REFILL_COUNT=16 (2KB)
 								   - Hypothesis: Smaller batches = fewer misses
 . Prefetching
 								   - Prefetch next batch while using current batch
 								   - Reduces cache miss penalty
 . Adaptive batch sizing
 								   - Small batches when working set is large
 								   - Large batches when working set is small
 								---
 								### 🔥 P2: Benchmark Diversity (Next 2 Weeks)
 								**Problem:** Larson is NOT representative
 								**Larson characteristics:**
 								- FIFO allocation pattern
 								- Fixed working set (1024 chunks)
 								- Predictable sizes (8-128B)
 								- High freelist hit rate
 								**Need to test:**
 . **Random allocation/free** (not FIFO)
 . **Bursty allocations** (malloc storms)
 . **Mixed lifetime** (long-lived + short-lived)
 . **Variable sizes** (less predictable)
 								**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).
 								---
 								### 🔥 P3: Fast Path Simplification (Phase 6 Goal)
 								**Long-term vision:** Eliminate superslab_refill from hot path
 								**Approach:**
 . Background refill thread
 								   - Keep freelists pre-filled
 								   - Allocation never waits for superslab_refill
 . Lock-free slab exchange
 								   - Reduce atomic operations
 								   - Faster refill when needed
 . System tcache study
 								   - Understand why System malloc is 3-4 instructions
 								   - Adopt proven patterns
 								---
 								## Key Metrics to Track
 								### Performance
 								- **Throughput:** 4.19 M ops/s (Larson baseline)
 								- **superslab_refill CPU:** 28.56% → target <10%
 								- **L1d miss rate:** 12.88% → target <10%
 								- **IPC:** 1.93 → maintain or improve
 								### Health
 								- **Stability:** Results should be consistent (±2%)
 								- **Memory usage:** Monitor RSS growth
 								- **Fragmentation:** Track over time
 								---
 								## Data-Driven Checklist
 								Before ANY optimization:
 								- [ ] Profile with `perf record -g`
 								- [ ] Identify TOP bottleneck (>5% CPU)
 								- [ ] Verify with `perf stat` (cache, branches, IPC)
 								- [ ] Test with MULTIPLE benchmarks (not just Larson)
 								- [ ] Document baseline metrics
 								- [ ] A/B test changes (at least 3 runs each)
 								- [ ] Verify improvements are statistically significant
 								**Rule:** If perf doesn't show it, don't optimize it.
 								---
 								## Lessons Learned
 . **Profile first, optimize second**
 								   - Task Teacher's intuition was wrong
 								   - Data revealed superslab_refill as real bottleneck
 . **Cache effects can reverse gains**
 								   - More batching ≠ always faster
 								   - L1 cache is precious (32 KB)
 . **Benchmarks lie**
 								   - Larson has special properties (FIFO, stable working set)
 								   - Real workloads may differ significantly
 . **Measure, don't guess**
 								   - memset "optimization" would have been wasted effort
 								   - perf shows what actually matters
 								---
 								## Final Recommendation
 								**STOP** optimizing refill frequency.
 								**START** optimizing superslab_refill.
 								The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
 								---
 								**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`