Files
hakmem/docs/status/PHASE1_EXECUTIVE_SUMMARY.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

249 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1 Quick Wins - Executive Summary
**TL;DR:** REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is `superslab_refill` (28.56% CPU), not refill frequency.
---
## The Numbers
| REFILL_COUNT | Throughput | L1d Miss Rate | Verdict |
|--------------|------------|---------------|---------|
| **32** | **4.19 M/s** | **12.88%** | ✅ **OPTIMAL** |
| 64 | 3.89 M/s | 14.12% | ❌ -7.2% |
| 128 | 2.68 M/s | 16.08% | ❌ -36% |
---
## Root Causes
### 1. superslab_refill is the Bottleneck (28.56% CPU) ⭐⭐⭐⭐⭐
```
perf report (REFILL_COUNT=32):
28.56% superslab_refill ← THIS IS THE PROBLEM
3.10% [kernel] (various)
...
```
**Impact:** Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.
### 2. Cache Pollution from Large Batches ⭐⭐⭐⭐
```
REFILL_COUNT=32: L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)
```
**Why:**
- 128 blocks × 128 bytes = 16 KB
- L1 cache = 32 KB total
- Batch + working set > L1 capacity
- **Result:** More cache misses, slower performance
### 3. Refill Frequency Already Low ⭐⭐⭐
**Larson benchmark characteristics:**
- FIFO pattern with 1024 chunks per thread
- High TLS freelist hit rate
- Refills are **rare**, not frequent
**Implication:** Reducing refill frequency has minimal impact when refills are already uncommon.
### 4. memset is NOT in Hot Path ⭐
**Search results:**
```bash
memset found in:
- hakmem_tiny_init.inc (one-time init)
- hakmem_tiny_intel.inc (debug ring init)
```
**Conclusion:** memset removal would have **ZERO** impact on allocation performance.
---
## Why Task Teacher's +31% Projection Failed
**Expected:**
```
REFILL 32→128: reduce calls by 4x → +31% speedup
```
**Reality:**
```
REFILL 32→128: -36% slowdown
```
**Mistakes:**
1. ❌ Assumed refill is cheap (it's 28.56% of CPU)
2. ❌ Assumed refills are frequent (they're rare in Larson)
3. ❌ Ignored cache effects (L1d misses +25%)
4. ❌ Used Larson-specific pattern (not generalizable)
---
## Immediate Actions
### ✅ DO THIS NOW
1. **Keep REFILL_COUNT=32** (optimal for Larson)
2. **Focus on superslab_refill optimization** (28.56% CPU → biggest win)
3. **Profile superslab_refill internals:**
- Bitmap scanning
- mmap syscalls
- Metadata initialization
### ❌ DO NOT DO THIS
1. **DO NOT increase REFILL_COUNT to 64+** (causes cache pollution)
2. **DO NOT optimize memset** (not in hot path, waste of time)
3. **DO NOT trust Larson alone** (need diverse benchmarks)
---
## Next Steps (Priority Order)
### 🔥 P0: Superslab_refill Deep Dive (This Week)
**Hypothesis:** 28.56% CPU in one function is unacceptable. Break it down:
```c
superslab_refill() {
// Profile each step:
1. Bitmap scan to find free slab How much time?
2. mmap() for new SuperSlab How much time?
3. Metadata initialization How much time?
4. Slab carving / freelist setup How much time?
}
```
**Tools:**
```bash
perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab
```
**Expected outcome:** Find sub-bottleneck, get 10-20% speedup by optimizing it.
---
### 🔥 P1: Cache-Aware Refill (Next Week)
**Goal:** Reduce L1d miss rate from 12.88% to <10%
**Approach:**
1. Limit batch size to fit in L1 with working set
- Current: REFILL_COUNT=32 (4KB for 128B class)
- Test: REFILL_COUNT=16 (2KB)
- Hypothesis: Smaller batches = fewer misses
2. Prefetching
- Prefetch next batch while using current batch
- Reduces cache miss penalty
3. Adaptive batch sizing
- Small batches when working set is large
- Large batches when working set is small
---
### 🔥 P2: Benchmark Diversity (Next 2 Weeks)
**Problem:** Larson is NOT representative
**Larson characteristics:**
- FIFO allocation pattern
- Fixed working set (1024 chunks)
- Predictable sizes (8-128B)
- High freelist hit rate
**Need to test:**
1. **Random allocation/free** (not FIFO)
2. **Bursty allocations** (malloc storms)
3. **Mixed lifetime** (long-lived + short-lived)
4. **Variable sizes** (less predictable)
**Hypothesis:** Other patterns may have different bottlenecks (refill frequency might matter more).
---
### 🔥 P3: Fast Path Simplification (Phase 6 Goal)
**Long-term vision:** Eliminate superslab_refill from hot path
**Approach:**
1. Background refill thread
- Keep freelists pre-filled
- Allocation never waits for superslab_refill
2. Lock-free slab exchange
- Reduce atomic operations
- Faster refill when needed
3. System tcache study
- Understand why System malloc is 3-4 instructions
- Adopt proven patterns
---
## Key Metrics to Track
### Performance
- **Throughput:** 4.19 M ops/s (Larson baseline)
- **superslab_refill CPU:** 28.56% target <10%
- **L1d miss rate:** 12.88% target <10%
- **IPC:** 1.93 maintain or improve
### Health
- **Stability:** Results should be consistent 2%)
- **Memory usage:** Monitor RSS growth
- **Fragmentation:** Track over time
---
## Data-Driven Checklist
Before ANY optimization:
- [ ] Profile with `perf record -g`
- [ ] Identify TOP bottleneck (>5% CPU)
- [ ] Verify with `perf stat` (cache, branches, IPC)
- [ ] Test with MULTIPLE benchmarks (not just Larson)
- [ ] Document baseline metrics
- [ ] A/B test changes (at least 3 runs each)
- [ ] Verify improvements are statistically significant
**Rule:** If perf doesn't show it, don't optimize it.
---
## Lessons Learned
1. **Profile first, optimize second**
- Task Teacher's intuition was wrong
- Data revealed superslab_refill as real bottleneck
2. **Cache effects can reverse gains**
- More batching ≠ always faster
- L1 cache is precious (32 KB)
3. **Benchmarks lie**
- Larson has special properties (FIFO, stable working set)
- Real workloads may differ significantly
4. **Measure, don't guess**
- memset "optimization" would have been wasted effort
- perf shows what actually matters
---
## Final Recommendation
**STOP** optimizing refill frequency.
**START** optimizing superslab_refill.
The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.
---
**Questions? See full report:** `PHASE1_REFILL_INVESTIGATION.md`