Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
6.2 KiB
Phase 1 Quick Wins - Executive Summary
TL;DR: REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is superslab_refill (28.56% CPU), not refill frequency.
The Numbers
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|---|---|---|---|
| 32 | 4.19 M/s | 12.88% | ✅ OPTIMAL |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
Root Causes
1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
perf report (REFILL_COUNT=32):
28.56% superslab_refill ← THIS IS THE PROBLEM
3.10% [kernel] (various)
...
Impact: Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
2. Cache Pollution from Large Batches ⭐⭐⭐⭐
REFILL_COUNT=32: L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
Why:
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- Result: More cache misses, slower performance
3. Refill Frequency Already Low ⭐⭐⭐
Larson benchmark characteristics:
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are rare, not frequent
Implication: Reducing refill frequency has minimal impact when refills are already uncommon.
4. memset is NOT in Hot Path ⭐
Search results:
memset found in:
- hakmem_tiny_init.inc (one-time init)
- hakmem_tiny_intel.inc (debug ring init)
Conclusion: memset removal would have ZERO impact on allocation performance.
Why Task Teacher's +31% Projection Failed
Expected:
REFILL 32→128: reduce calls by 4x → +31% speedup
Reality:
REFILL 32→128: -36% slowdown
Mistakes:
- ❌ Assumed refill is cheap (it's 28.56% of CPU)
- ❌ Assumed refills are frequent (they're rare in Larson)
- ❌ Ignored cache effects (L1d misses +25%)
- ❌ Used Larson-specific pattern (not generalizable)
Immediate Actions
✅ DO THIS NOW
- Keep REFILL_COUNT=32 (optimal for Larson)
- Focus on superslab_refill optimization (28.56% CPU → biggest win)
- Profile superslab_refill internals:
- Bitmap scanning
- mmap syscalls
- Metadata initialization
❌ DO NOT DO THIS
- DO NOT increase REFILL_COUNT to 64+ (causes cache pollution)
- DO NOT optimize memset (not in hot path, waste of time)
- DO NOT trust Larson alone (need diverse benchmarks)
Next Steps (Priority Order)
🔥 P0: Superslab_refill Deep Dive (This Week)
Hypothesis: 28.56% CPU in one function is unacceptable. Break it down:
superslab_refill() {
// Profile each step:
1. Bitmap scan to find free slab ← How much time?
2. mmap() for new SuperSlab ← How much time?
3. Metadata initialization ← How much time?
4. Slab carving / freelist setup ← How much time?
}
Tools:
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
Expected outcome: Find sub-bottleneck, get 10-20% speedup by optimizing it.
🔥 P1: Cache-Aware Refill (Next Week)
Goal: Reduce L1d miss rate from 12.88% to <10%
Approach:
-
Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
-
Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
-
Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small
🔥 P2: Benchmark Diversity (Next 2 Weeks)
Problem: Larson is NOT representative
Larson characteristics:
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate
Need to test:
- Random allocation/free (not FIFO)
- Bursty allocations (malloc storms)
- Mixed lifetime (long-lived + short-lived)
- Variable sizes (less predictable)
Hypothesis: Other patterns may have different bottlenecks (refill frequency might matter more).
🔥 P3: Fast Path Simplification (Phase 6 Goal)
Long-term vision: Eliminate superslab_refill from hot path
Approach:
-
Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
-
Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
-
System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns
Key Metrics to Track
Performance
- Throughput: 4.19 M ops/s (Larson baseline)
- superslab_refill CPU: 28.56% → target <10%
- L1d miss rate: 12.88% → target <10%
- IPC: 1.93 → maintain or improve
Health
- Stability: Results should be consistent (±2%)
- Memory usage: Monitor RSS growth
- Fragmentation: Track over time
Data-Driven Checklist
Before ANY optimization:
- Profile with
perf record -g - Identify TOP bottleneck (>5% CPU)
- Verify with
perf stat(cache, branches, IPC) - Test with MULTIPLE benchmarks (not just Larson)
- Document baseline metrics
- A/B test changes (at least 3 runs each)
- Verify improvements are statistically significant
Rule: If perf doesn't show it, don't optimize it.
Lessons Learned
-
Profile first, optimize second
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
-
Cache effects can reverse gains
- More batching ≠ always faster
- L1 cache is precious (32 KB)
-
Benchmarks lie
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
-
Measure, don't guess
- memset "optimization" would have been wasted effort
- perf shows what actually matters
Final Recommendation
STOP optimizing refill frequency. START optimizing superslab_refill.
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
Questions? See full report: PHASE1_REFILL_INVESTIGATION.md