Files
hakmem/docs/analysis/PHASE1_EXECUTIVE_SUMMARY.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

249 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1 Quick Wins - Executive Summary
**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.
---
## The Numbers
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|--------------|------------|---------------|---------|
| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
---
## Root Causes
### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
```
perf report (REFILL_COUNT=32):
28.56% superslab_refill ← THIS IS THE PROBLEM
3.10% [kernel] (various)
...
```
**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐
```
REFILL_COUNT=32: L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
```
**Why:**
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- **Result:** More cache misses, slower performance
### 3. Refill Frequency Already Low ⭐⭐⭐
**Larson benchmark characteristics:**
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are **rare**, not frequent
**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.
### 4. memset is NOT in Hot Path ⭐
**Search results:**
```bash
memset found in:
- hakmem_tiny_init.inc (one-time init)
- hakmem_tiny_intel.inc (debug ring init)
```
**Conclusion:** memset removal would have **ZERO** impact on allocation performance.
---
## Why Task Teacher's +31% Projection Failed
**Expected:**
```
REFILL 32→128: reduce calls by 4x → +31% speedup
```
**Reality:**
```
REFILL 32→128: -36% slowdown
```
**Mistakes:**
1. ❌ Assumed refill is cheap (it's 28.56% of CPU)
2. ❌ Assumed refills are frequent (they're rare in Larson)
3. ❌ Ignored cache effects (L1d misses +25%)
4. ❌ Used Larson-specific pattern (not generalizable)
---
## Immediate Actions
### ✅ DO THIS NOW
1. **Keep REFILL_COUNT=32** (optimal for Larson)
2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
3. **Profile superslab_refill internals:**
- Bitmap scanning
- mmap syscalls
- Metadata initialization
### ❌ DO NOT DO THIS
1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
2. **DO NOT optimize memset** (not in hot path, waste of time)
3. **DO NOT trust Larson alone** (need diverse benchmarks)
---
## Next Steps (Priority Order)
### 🔥 P0: Superslab_refill Deep Dive (This Week)
**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:
```c
superslab_refill() {
// Profile each step:
1. Bitmap scan to find free slab How much time?
2. mmap() for new SuperSlab How much time?
3. Metadata initialization How much time?
4. Slab carving / freelist setup How much time?
}
```
**Tools:**
```bash
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
```
**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.
---
### 🔥 P1: Cache-Aware Refill (Next Week)
**Goal:** Reduce L1d miss rate from 12.88% to <10%
**Approach:**
1. Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
2. Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
3. Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small
---
### 🔥 P2: Benchmark Diversity (Next 2 Weeks)
**Problem:** Larson is NOT representative
**Larson characteristics:**
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate
**Need to test:**
1. **Random allocation/free** (not FIFO)
2. **Bursty allocations** (malloc storms)
3. **Mixed lifetime** (long-lived + short-lived)
4. **Variable sizes** (less predictable)
**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).
---
### 🔥 P3: Fast Path Simplification (Phase 6 Goal)
**Long-term vision:** Eliminate superslab_refill from hot path
**Approach:**
1. Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
2. Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
3. System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns
---
## Key Metrics to Track
### Performance
- **Throughput:** 4.19 M ops/s (Larson baseline)
- **superslab_refill CPU:** 28.56% target <10%
- **L1d miss rate:** 12.88% target <10%
- **IPC:** 1.93 maintain or improve
### Health
- **Stability:** Results should be consistent 2%)
- **Memory usage:** Monitor RSS growth
- **Fragmentation:** Track over time
---
## Data-Driven Checklist
Before ANY optimization:
- [ ] Profile with `perf record -g`
- [ ] Identify TOP bottleneck (>5% CPU)
- [ ] Verify with `perf stat` (cache, branches, IPC)
- [ ] Test with MULTIPLE benchmarks (not just Larson)
- [ ] Document baseline metrics
- [ ] A/B test changes (at least 3 runs each)
- [ ] Verify improvements are statistically significant
**Rule:** If perf doesn't show it, don't optimize it.
---
## Lessons Learned
1. **Profile first, optimize second**
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
2. **Cache effects can reverse gains**
- More batching ≠ always faster
- L1 cache is precious (32 KB)
3. **Benchmarks lie**
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
4. **Measure, don't guess**
- memset "optimization" would have been wasted effort
- perf shows what actually matters
---
## Final Recommendation
**STOP** optimizing refill frequency.
**START** optimizing superslab_refill.
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
---
**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`