Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

6.2 KiB

Raw Blame History

Phase 1 Quick Wins - Executive Summary

TL;DR: REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is superslab_refill (28.56% CPU), not refill frequency.

The Numbers

REFILL_COUNT	Throughput	L1d Miss Rate	Verdict
32	4.19 M/s	12.88%	✅ OPTIMAL
64	3.89 M/s	14.12%	❌ -7.2%
128	2.68 M/s	16.08%	❌ -36%

Root Causes

1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐

perf report (REFILL_COUNT=32):
  28.56%  superslab_refill  ← THIS IS THE PROBLEM
   3.10%  [kernel] (various)
   ...

Impact: Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.

2. Cache Pollution from Large Batches ⭐⭐⭐⭐

REFILL_COUNT=32:  L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)

Why:

128 blocks × 128 bytes = 16 KB
L1 cache = 32 KB total
Batch + working set > L1 capacity
Result: More cache misses, slower performance

3. Refill Frequency Already Low ⭐⭐⭐

Larson benchmark characteristics:

FIFO pattern with 1024 chunks per thread
High TLS freelist hit rate
Refills are rare, not frequent

Implication: Reducing refill frequency has minimal impact when refills are already uncommon.

4. memset is NOT in Hot Path ⭐

Search results:

memset found in:
  - hakmem_tiny_init.inc (one-time init)
  - hakmem_tiny_intel.inc (debug ring init)

Conclusion: memset removal would have ZERO impact on allocation performance.

Why Task Teacher's +31% Projection Failed

Expected:

REFILL 32→128: reduce calls by 4x → +31% speedup

Reality:

REFILL 32→128: -36% slowdown

Mistakes:

❌ Assumed refill is cheap (it's 28.56% of CPU)
❌ Assumed refills are frequent (they're rare in Larson)
❌ Ignored cache effects (L1d misses +25%)
❌ Used Larson-specific pattern (not generalizable)

Immediate Actions

✅ DO THIS NOW

Keep REFILL_COUNT=32 (optimal for Larson)
Focus on superslab_refill optimization (28.56% CPU → biggest win)
Profile superslab_refill internals:
- Bitmap scanning
- mmap syscalls
- Metadata initialization

❌ DO NOT DO THIS

DO NOT increase REFILL_COUNT to 64+ (causes cache pollution)
DO NOT optimize memset (not in hot path, waste of time)
DO NOT trust Larson alone (need diverse benchmarks)

Next Steps (Priority Order)

🔥 P0: Superslab_refill Deep Dive (This Week)

Hypothesis: 28.56% CPU in one function is unacceptable. Break it down:

superslab_refill() {
    // Profile each step:
    1. Bitmap scan to find free slab      ← How much time?
    2. mmap() for new SuperSlab            ← How much time?
    3. Metadata initialization             ← How much time?
    4. Slab carving / freelist setup       ← How much time?
}

Tools:

perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab

Expected outcome: Find sub-bottleneck, get 10-20% speedup by optimizing it.

🔥 P1: Cache-Aware Refill (Next Week)

Goal: Reduce L1d miss rate from 12.88% to <10%

Approach:

Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small

🔥 P2: Benchmark Diversity (Next 2 Weeks)

Problem: Larson is NOT representative

Larson characteristics:

FIFO allocation pattern
Fixed working set (1024 chunks)
Predictable sizes (8-128B)
High freelist hit rate

Need to test:

Random allocation/free (not FIFO)
Bursty allocations (malloc storms)
Mixed lifetime (long-lived + short-lived)
Variable sizes (less predictable)

Hypothesis: Other patterns may have different bottlenecks (refill frequency might matter more).

🔥 P3: Fast Path Simplification (Phase 6 Goal)

Long-term vision: Eliminate superslab_refill from hot path

Approach:

Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns

Key Metrics to Track

Performance

Throughput: 4.19 M ops/s (Larson baseline)
superslab_refill CPU: 28.56% → target <10%
L1d miss rate: 12.88% → target <10%
IPC: 1.93 → maintain or improve

Health

Stability: Results should be consistent (±2%)
Memory usage: Monitor RSS growth
Fragmentation: Track over time

Data-Driven Checklist

Before ANY optimization:

Profile with perf record -g
Identify TOP bottleneck (>5% CPU)
Verify with perf stat (cache, branches, IPC)
Test with MULTIPLE benchmarks (not just Larson)
Document baseline metrics
A/B test changes (at least 3 runs each)
Verify improvements are statistically significant

Rule: If perf doesn't show it, don't optimize it.

Lessons Learned

Profile first, optimize second
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
Cache effects can reverse gains
- More batching ≠ always faster
- L1 cache is precious (32 KB)
Benchmarks lie
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
Measure, don't guess
- memset "optimization" would have been wasted effort
- perf shows what actually matters

Final Recommendation

STOP optimizing refill frequency. START optimizing superslab_refill.

The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.

Questions? See full report: PHASE1_REFILL_INVESTIGATION.md

6.2 KiB Raw Blame History Unescape Escape