Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.2 KiB
Phase 1 Quick Wins - Executive Summary
TL;DR: REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is superslab_refill (28.56% CPU), not refill frequency.
The Numbers
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|---|---|---|---|
| 32 | 4.19 M/s | 12.88% | ✅ OPTIMAL |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
Root Causes
1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
perf report (REFILL_COUNT=32):
28.56% superslab_refill ← THIS IS THE PROBLEM
3.10% [kernel] (various)
...
Impact: Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
2. Cache Pollution from Large Batches ⭐⭐⭐⭐
REFILL_COUNT=32: L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
Why:
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- Result: More cache misses, slower performance
3. Refill Frequency Already Low ⭐⭐⭐
Larson benchmark characteristics:
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are rare, not frequent
Implication: Reducing refill frequency has minimal impact when refills are already uncommon.
4. memset is NOT in Hot Path ⭐
Search results:
memset found in:
- hakmem_tiny_init.inc (one-time init)
- hakmem_tiny_intel.inc (debug ring init)
Conclusion: memset removal would have ZERO impact on allocation performance.
Why Task Teacher's +31% Projection Failed
Expected:
REFILL 32→128: reduce calls by 4x → +31% speedup
Reality:
REFILL 32→128: -36% slowdown
Mistakes:
- ❌ Assumed refill is cheap (it's 28.56% of CPU)
- ❌ Assumed refills are frequent (they're rare in Larson)
- ❌ Ignored cache effects (L1d misses +25%)
- ❌ Used Larson-specific pattern (not generalizable)
Immediate Actions
✅ DO THIS NOW
- Keep REFILL_COUNT=32 (optimal for Larson)
- Focus on superslab_refill optimization (28.56% CPU → biggest win)
- Profile superslab_refill internals:
- Bitmap scanning
- mmap syscalls
- Metadata initialization
❌ DO NOT DO THIS
- DO NOT increase REFILL_COUNT to 64+ (causes cache pollution)
- DO NOT optimize memset (not in hot path, waste of time)
- DO NOT trust Larson alone (need diverse benchmarks)
Next Steps (Priority Order)
🔥 P0: Superslab_refill Deep Dive (This Week)
Hypothesis: 28.56% CPU in one function is unacceptable. Break it down:
superslab_refill() {
// Profile each step:
1. Bitmap scan to find free slab ← How much time?
2. mmap() for new SuperSlab ← How much time?
3. Metadata initialization ← How much time?
4. Slab carving / freelist setup ← How much time?
}
Tools:
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
Expected outcome: Find sub-bottleneck, get 10-20% speedup by optimizing it.
🔥 P1: Cache-Aware Refill (Next Week)
Goal: Reduce L1d miss rate from 12.88% to <10%
Approach:
-
Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
-
Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
-
Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small
🔥 P2: Benchmark Diversity (Next 2 Weeks)
Problem: Larson is NOT representative
Larson characteristics:
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate
Need to test:
- Random allocation/free (not FIFO)
- Bursty allocations (malloc storms)
- Mixed lifetime (long-lived + short-lived)
- Variable sizes (less predictable)
Hypothesis: Other patterns may have different bottlenecks (refill frequency might matter more).
🔥 P3: Fast Path Simplification (Phase 6 Goal)
Long-term vision: Eliminate superslab_refill from hot path
Approach:
-
Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
-
Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
-
System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns
Key Metrics to Track
Performance
- Throughput: 4.19 M ops/s (Larson baseline)
- superslab_refill CPU: 28.56% → target <10%
- L1d miss rate: 12.88% → target <10%
- IPC: 1.93 → maintain or improve
Health
- Stability: Results should be consistent (±2%)
- Memory usage: Monitor RSS growth
- Fragmentation: Track over time
Data-Driven Checklist
Before ANY optimization:
- Profile with
perf record -g - Identify TOP bottleneck (>5% CPU)
- Verify with
perf stat(cache, branches, IPC) - Test with MULTIPLE benchmarks (not just Larson)
- Document baseline metrics
- A/B test changes (at least 3 runs each)
- Verify improvements are statistically significant
Rule: If perf doesn't show it, don't optimize it.
Lessons Learned
-
Profile first, optimize second
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
-
Cache effects can reverse gains
- More batching ≠ always faster
- L1 cache is precious (32 KB)
-
Benchmarks lie
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
-
Measure, don't guess
- memset "optimization" would have been wasted effort
- perf shows what actually matters
Final Recommendation
STOP optimizing refill frequency. START optimizing superslab_refill.
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
Questions? See full report: PHASE1_REFILL_INVESTIGATION.md