Files
hakmem/PHASE1_EXECUTIVE_SUMMARY.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

249 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1 Quick Wins - Executive Summary
**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.
---
## The Numbers
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|--------------|------------|---------------|---------|
| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
---
## Root Causes
### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
```
perf report (REFILL_COUNT=32):
28.56% superslab_refill ← THIS IS THE PROBLEM
3.10% [kernel] (various)
...
```
**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐
```
REFILL_COUNT=32: L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
```
**Why:**
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- **Result:** More cache misses, slower performance
### 3. Refill Frequency Already Low ⭐⭐⭐
**Larson benchmark characteristics:**
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are **rare**, not frequent
**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.
### 4. memset is NOT in Hot Path ⭐
**Search results:**
```bash
memset found in:
- hakmem_tiny_init.inc (one-time init)
- hakmem_tiny_intel.inc (debug ring init)
```
**Conclusion:** memset removal would have **ZERO** impact on allocation performance.
---
## Why Task Teacher's +31% Projection Failed
**Expected:**
```
REFILL 32→128: reduce calls by 4x → +31% speedup
```
**Reality:**
```
REFILL 32→128: -36% slowdown
```
**Mistakes:**
1. ❌ Assumed refill is cheap (it's 28.56% of CPU)
2. ❌ Assumed refills are frequent (they're rare in Larson)
3. ❌ Ignored cache effects (L1d misses +25%)
4. ❌ Used Larson-specific pattern (not generalizable)
---
## Immediate Actions
### ✅ DO THIS NOW
1. **Keep REFILL_COUNT=32** (optimal for Larson)
2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
3. **Profile superslab_refill internals:**
- Bitmap scanning
- mmap syscalls
- Metadata initialization
### ❌ DO NOT DO THIS
1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
2. **DO NOT optimize memset** (not in hot path, waste of time)
3. **DO NOT trust Larson alone** (need diverse benchmarks)
---
## Next Steps (Priority Order)
### 🔥 P0: Superslab_refill Deep Dive (This Week)
**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:
```c
superslab_refill() {
// Profile each step:
1. Bitmap scan to find free slab How much time?
2. mmap() for new SuperSlab How much time?
3. Metadata initialization How much time?
4. Slab carving / freelist setup How much time?
}
```
**Tools:**
```bash
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
```
**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.
---
### 🔥 P1: Cache-Aware Refill (Next Week)
**Goal:** Reduce L1d miss rate from 12.88% to <10%
**Approach:**
1. Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
2. Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
3. Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small
---
### 🔥 P2: Benchmark Diversity (Next 2 Weeks)
**Problem:** Larson is NOT representative
**Larson characteristics:**
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate
**Need to test:**
1. **Random allocation/free** (not FIFO)
2. **Bursty allocations** (malloc storms)
3. **Mixed lifetime** (long-lived + short-lived)
4. **Variable sizes** (less predictable)
**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).
---
### 🔥 P3: Fast Path Simplification (Phase 6 Goal)
**Long-term vision:** Eliminate superslab_refill from hot path
**Approach:**
1. Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
2. Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
3. System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns
---
## Key Metrics to Track
### Performance
- **Throughput:** 4.19 M ops/s (Larson baseline)
- **superslab_refill CPU:** 28.56% target <10%
- **L1d miss rate:** 12.88% target <10%
- **IPC:** 1.93 maintain or improve
### Health
- **Stability:** Results should be consistent 2%)
- **Memory usage:** Monitor RSS growth
- **Fragmentation:** Track over time
---
## Data-Driven Checklist
Before ANY optimization:
- [ ] Profile with `perf record -g`
- [ ] Identify TOP bottleneck (>5% CPU)
- [ ] Verify with `perf stat` (cache, branches, IPC)
- [ ] Test with MULTIPLE benchmarks (not just Larson)
- [ ] Document baseline metrics
- [ ] A/B test changes (at least 3 runs each)
- [ ] Verify improvements are statistically significant
**Rule:** If perf doesn't show it, don't optimize it.
---
## Lessons Learned
1. **Profile first, optimize second**
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
2. **Cache effects can reverse gains**
- More batching ≠ always faster
- L1 cache is precious (32 KB)
3. **Benchmarks lie**
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
4. **Measure, don't guess**
- memset "optimization" would have been wasted effort
- perf shows what actually matters
---
## Final Recommendation
**STOP** optimizing refill frequency.
**START** optimizing superslab_refill.
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
---
**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`