Files
hakmem/docs/analysis/PHASE7_BENCHMARK_PLAN.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

571 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 7 Full Benchmark Suite Execution Plan
**Date**: 2025-11-08
**Phase**: 7-1.3 (HEADER_CLASSIDX=1 optimization)
**Current Status**: Partial results available (Larson 1T: 2.63M ops/s, bench_random_mixed 128B: 17.7M ops/s)
**Goal**: Comprehensive performance evaluation across ALL benchmark patterns
---
## Executive Summary
### Available Benchmarks (5 categories)
1. **Larson** - Multi-threaded stress test (8-128B, mimalloc-bench derived)
2. **Random Mixed** - Single-threaded random allocation (16-8192B)
3. **Mid-Large MT** - Multi-threaded mid-size (8-32KB)
4. **VM Mixed** - Large allocations (512KB-2MB, L2.5/L2 test)
5. **Tiny Hot** - Hot path micro-benchmark (8-64B, LIFO)
### Current Build Status (Phase 7 = HEADER_CLASSIDX=1)
All benchmarks were built with HEADER_CLASSIDX=1 on 2025-11-07/08:
-`larson_hakmem` (2025-11-08 11:48)
-`bench_random_mixed_hakmem` (2025-11-08 11:48)
-`bench_mid_large_mt_hakmem` (2025-11-07 18:42)
-`bench_tiny_hot_hakmem` (2025-11-07 18:03)
-`bench_vm_mixed_hakmem` (2025-11-07 18:03)
**Note**: Makefile has `HAKMEM_TINY_HEADER_CLASSIDX=1` permanently enabled (line 99-100).
---
## Execution Plan
### Phase 1: Verify Build Status (5 minutes)
**Verify HEADER_CLASSIDX=1 is enabled:**
```bash
# Check Makefile flag
grep "HAKMEM_TINY_HEADER_CLASSIDX" Makefile
# Verify all binaries are up-to-date
make -n bench_random_mixed_hakmem bench_tiny_hot_hakmem \
bench_mid_large_mt_hakmem bench_vm_mixed_hakmem \
larson_hakmem
```
**If rebuild needed:**
```bash
# Clean rebuild with HEADER_CLASSIDX=1 (already default)
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system bench_random_mixed_mi \
bench_tiny_hot_hakmem bench_tiny_hot_system bench_tiny_hot_mi \
bench_mid_large_mt_hakmem bench_mid_large_mt_system bench_mid_large_mt_mi \
bench_vm_mixed_hakmem bench_vm_mixed_system \
larson_hakmem larson_system larson_mi
```
**Time**: ~3-5 minutes (if rebuild needed)
---
### Phase 2: Quick Sanity Test (2 minutes)
**Test each benchmark runs successfully:**
```bash
# Larson (1T, 1 second)
./larson_hakmem 1 8 128 1024 1 12345 1
# Random Mixed (small run)
./bench_random_mixed_hakmem 1000 128 1234567
# Mid-Large MT (2 threads, small)
./bench_mid_large_mt_hakmem 2 1000 2048 42
# VM Mixed (small)
./bench_vm_mixed_hakmem 100 256 424242
# Tiny Hot (small)
./bench_tiny_hot_hakmem 32 10 1000
```
**Expected**: All benchmarks run without SEGV/crashes.
---
### Phase 3: Full Benchmark Suite Execution
#### Option A: Automated Suite Runner (RECOMMENDED) ⭐
**Use existing bench_suite_matrix.sh:**
```bash
# This runs ALL benchmarks (random_mixed, mid_large_mt, vm_mixed, tiny_hot)
# across system/mimalloc/HAKMEM variants
./scripts/bench_suite_matrix.sh
```
**Output**:
- CSV: `bench_results/suite/<timestamp>/results.csv`
- Raw logs: `bench_results/suite/<timestamp>/raw/*.out`
**Time**: ~15-20 minutes
**Coverage**:
- Random Mixed: 2 cycles × 2 ws × 3 variants = 12 runs
- Mid-Large MT: 2 threads × 3 variants = 6 runs
- VM Mixed: 2 cycles × 2 variants = 4 runs (system + hakmem only)
- Tiny Hot: 2 sizes × 3 variants = 6 runs
**Total**: 28 benchmark runs
---
#### Option B: Individual Benchmark Scripts (Detailed Analysis)
If you need more control or want to run A/B tests with environment variables:
##### 3.1 Larson Benchmark (Multi-threaded Stress)
**Basic run (1T, 4T, 8T):**
```bash
# 1 thread, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 1
# 4 threads, 10 seconds (CRITICAL: test multi-thread stability)
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 4
# 8 threads, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 8
```
**A/B test with environment variables:**
```bash
# Use automated script (includes PGO)
./scripts/bench_larson_1t_ab.sh
```
**Output**: `bench_results/larson_ab/<timestamp>/results.csv`
**Time**: ~20-30 minutes (includes PGO build)
**Key Metrics**:
- Throughput (ops/s)
- Stability (4T should not crash - see Phase 6-2.3 active counter fix)
---
##### 3.2 Random Mixed (Single-threaded, Mixed Sizes)
**Basic run:**
```bash
# 400K cycles, 8192B working set
HAKMEM_WRAP_TINY=1 ./bench_random_mixed_hakmem 400000 8192 1234567
./bench_random_mixed_system 400000 8192 1234567
./bench_random_mixed_mi 400000 8192 1234567
```
**A/B test with environment variables:**
```bash
# Runs 5 repetitions, median calculation
./scripts/bench_random_mixed_ab.sh
```
**Output**: `bench_results/random_mixed_ab/<timestamp>/results.csv`
**Time**: ~15-20 minutes (5 reps × multiple configs)
**Key Metrics**:
- Throughput (ops/s) across different working set sizes
- SPECIALIZE_MASK impact (0 vs 0x0F)
- FAST_CAP impact (8 vs 16 vs 32)
---
##### 3.3 Mid-Large MT (Multi-threaded, 8-32KB)
**Basic run:**
```bash
# 4 threads, 40K cycles, 2KB working set
HAKMEM_WRAP_TINY=1 ./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_mid_large_mt_system 4 40000 2048 42
./bench_mid_large_mt_mi 4 40000 2048 42
```
**A/B test:**
```bash
./scripts/bench_mid_large_mt_ab.sh
```
**Output**: `bench_results/mid_large_mt_ab/<timestamp>/results.csv`
**Time**: ~10-15 minutes
**Key Metrics**:
- Multi-threaded performance (2T vs 4T)
- HAKMEM's SuperSlab efficiency (expected: strong performance here)
**Note**: Previous results showed HAKMEM weakness here (suite/20251107: 2.1M vs system 8.7M).
This is unexpected given the Mid-Large benchmark success (+108% on 2025-11-02).
Need to investigate if this is a regression or different test pattern.
---
##### 3.4 VM Mixed (Large Allocations, 512KB-2MB)
**Basic run:**
```bash
# 20K cycles, 256 working set
HAKMEM_BIGCACHE_L25=1 HAKMEM_WRAP_TINY=1 ./bench_vm_mixed_hakmem 20000 256 424242
./bench_vm_mixed_system 20000 256 424242
```
**Time**: ~5 minutes
**Key Metrics**:
- L2.5 cache effectiveness (BIGCACHE_L25=1 vs 0)
- Large allocation performance
---
##### 3.5 Tiny Hot (Hot Path Micro-benchmark)
**Basic run:**
```bash
# 32B, 100 batch, 60K cycles
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 32 100 60000
./bench_tiny_hot_system 32 100 60000
./bench_tiny_hot_mi 32 100 60000
# 64B
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 64 100 60000
./bench_tiny_hot_system 64 100 60000
./bench_tiny_hot_mi 64 100 60000
```
**Time**: ~5 minutes
**Key Metrics**:
- Hot path efficiency (direct TLS cache access)
- Expected weakness (Phase 6 analysis: -60% vs system)
---
### Phase 4: Analysis and Comparison
#### 4.1 Extract Results from Suite Run
```bash
# Get latest suite results
latest=$(ls -td bench_results/suite/* | head -1)
cat ${latest}/results.csv
# Quick comparison
awk -F, 'NR>1 {
if ($2=="hakmem") hakmem[$1]+=$4
if ($2=="system") system[$1]+=$4
if ($2=="mi") mi[$1]+=$4
count[$1]++
} END {
for (b in hakmem) {
h=hakmem[b]/count[b]
s=system[b]/count[b]
m=mi[b]/count[b]
printf "%s: HAKMEM=%.2fM system=%.2fM mi=%.2fM (vs_sys=%+.1f%%, vs_mi=%+.1f%%)\n",
b, h/1e6, s/1e6, m/1e6, (h/s-1)*100, (h/m-1)*100
}
}' ${latest}/results.csv
```
#### 4.2 Key Comparisons
**Phase 7 vs System malloc:**
```bash
# Extract HAKMEM vs system for each benchmark
awk -F, 'NR>1 && ($2=="hakmem" || $2=="system") {
key=$1 "," $3
if ($2=="hakmem") h[key]=$4
if ($2=="system") s[key]=$4
} END {
for (k in h) {
if (s[k]) {
pct = (h[k]/s[k] - 1) * 100
printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, s[k]/1e6, pct
}
}
}' ${latest}/results.csv | sort
```
**Phase 7 vs mimalloc:**
```bash
# Similar for mimalloc comparison
awk -F, 'NR>1 && ($2=="hakmem" || $2=="mi") {
key=$1 "," $3
if ($2=="hakmem") h[key]=$4
if ($2=="mi") m[key]=$4
} END {
for (k in h) {
if (m[k]) {
pct = (h[k]/m[k] - 1) * 100
printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, m[k]/1e6, pct
}
}
}' ${latest}/results.csv | sort
```
#### 4.3 Generate Summary Report
```bash
# Create comprehensive summary
cat > PHASE7_RESULTS_SUMMARY.md << 'REPORT'
# Phase 7 Benchmark Results Summary
## Test Configuration
- Phase: 7-1.3 (HEADER_CLASSIDX=1)
- Date: $(date +%Y-%m-%d)
- Suite: $(basename ${latest})
## Overall Results
### Random Mixed (16-8192B, single-threaded)
[Insert results here]
### Mid-Large MT (8-32KB, multi-threaded)
[Insert results here]
### VM Mixed (512KB-2MB, large allocations)
[Insert results here]
### Tiny Hot (8-64B, hot path micro)
[Insert results here]
### Larson (8-128B, multi-threaded stress)
[Insert results here]
## Analysis
### Strengths
[Areas where HAKMEM outperforms]
### Weaknesses
[Areas where HAKMEM underperforms]
### Comparison with Previous Phases
[Phase 6 vs Phase 7 delta]
## Bottleneck Identification
[Performance profiling with perf]
REPORT
```
---
### Phase 5: Performance Profiling (Optional, if bottlenecks found)
**Profile hot paths with perf:**
```bash
# Profile random_mixed (if slow)
perf record -g --call-graph dwarf -- \
./bench_random_mixed_hakmem 400000 8192 1234567
perf report --stdio > perf_random_mixed_phase7.txt
# Profile larson 1T
perf record -g --call-graph dwarf -- \
./larson_hakmem 10 8 128 1024 1 12345 1
perf report --stdio > perf_larson_1t_phase7.txt
```
**Compare with Phase 6:**
```bash
# If you have Phase 6 binaries saved, run side-by-side
# and compare perf reports
```
---
## Expected Results & Analysis Strategy
### Baseline Expectations (from Phase 6 analysis)
#### Strong Areas (Expected +50% to +171% vs System)
1. **Mid-Large (8-32KB)**: HAKMEM's SuperSlab should dominate
- Expected: +100% to +150% vs system
- Phase 7 improvement target: Maintain or improve
2. **Large Allocations (VM Mixed)**: L2.5 layer efficiency
- Expected: Competitive or slight win vs system
#### Weak Areas (Expected -50% to -70% vs System)
1. **Tiny (≤128B)**: Structural weakness identified in Phase 6
- Expected: -40% to -60% vs system
- Phase 7 HEADER_CLASSIDX may help: +10-20% improvement
2. **Random Mixed**: Magazine layer overhead
- Expected: -20% to -50% vs system
- Phase 7 target: Reduce gap
3. **Larson Multi-thread**: Contention issues
- Expected: Variable (1T: ok, 4T+: risk of crashes)
- Phase 7 critical: Verify 4T stability (active counter fix)
### What to Look For
#### Phase 7 Improvements (HEADER_CLASSIDX=1)
- **Tiny allocations**: +10-30% improvement (fewer header loads)
- **Random mixed**: +15-25% improvement (class_idx in header)
- **Cache efficiency**: Better locality (1-byte header vs 2-byte)
#### Red Flags
- **Mid-Large regression**: Should NOT regress (HEADER_CLASSIDX doesn't affect mid-large path)
- **4T+ crashes in Larson**: Active counter bug should be fixed (Phase 6-2.3)
- **Severe regression (>20%)**: Investigate immediately
#### Bottleneck Identification
If Phase 7 results are disappointing:
1. **Run perf** on slow benchmarks
2. **Compare with Phase 6** perf profiles (if available)
3. **Check hot paths**:
- `tiny_alloc_fast()` - Should be 3-4 instructions
- `tiny_free_fast()` - Should be fast header check
- `superslab_refill()` - Should use P0 ctz optimization
---
## Time Estimates
### Minimal Run (Option A: Suite Script Only)
- Build verification: 2 min
- Sanity test: 2 min
- Suite execution: 15-20 min
- Quick analysis: 5 min
- **Total: ~25-30 minutes**
### Comprehensive Run (Option B: All Individual Scripts)
- Build verification: 2 min
- Sanity test: 2 min
- Larson A/B: 25 min
- Random Mixed A/B: 20 min
- Mid-Large MT A/B: 15 min
- VM Mixed: 5 min
- Tiny Hot: 5 min
- Analysis & report: 15 min
- **Total: ~90 minutes (1.5 hours)**
### With Performance Profiling
- Add: ~20-30 min per benchmark
- **Total: ~2-3 hours**
---
## Recommended Execution Order
### Quick Assessment (30 minutes)
1. ✅ Verify build status
2. ✅ Run suite script (bench_suite_matrix.sh)
3. ✅ Generate quick comparison
4. 🔍 Identify major wins/losses
5. 📝 Decide if deep dive needed
### Deep Analysis (if needed, +60 minutes)
1. 🔬 Run individual A/B scripts for problem areas
2. 📊 Profile with perf
3. 📝 Compare with Phase 6 baseline
4. 💡 Generate actionable insights
---
## Output Organization
```
bench_results/
├── suite/
│ └── <timestamp>/
│ ├── results.csv # All benchmarks, all variants
│ └── raw/*.out # Raw logs
├── random_mixed_ab/
│ └── <timestamp>/
│ ├── results.csv # A/B test results
│ └── raw/*.txt # Per-run data
├── larson_ab/
│ └── <timestamp>/
│ ├── results.csv
│ └── raw/*.out
├── mid_large_mt_ab/
│ └── <timestamp>/
│ ├── results.csv
│ └── raw/*.out
└── ...
# Analysis reports
PHASE7_RESULTS_SUMMARY.md # High-level summary
PHASE7_DETAILED_ANALYSIS.md # Deep dive (if needed)
perf_*.txt # Performance profiles
```
---
## Next Steps After Benchmark
### If Phase 7 Shows Strong Results (+30-50% overall)
1. ✅ Commit and document improvements
2. 🎯 Focus on remaining weak areas (Tiny allocations)
3. 📢 Prepare performance summary for stakeholders
### If Phase 7 Shows Modest Results (+10-20% overall)
1. 🔍 Identify specific bottlenecks (perf profiling)
2. 🧪 Test individual optimizations in isolation
3. 📊 Compare with Phase 6 to ensure no regressions
### If Phase 7 Shows Regressions (any area -10% or worse)
1. 🚨 Immediate investigation
2. 🔄 Bisect to find regression point
3. 🧪 Consider reverting HEADER_CLASSIDX if severe
---
## Quick Reference Commands
```bash
# Full suite (automated)
./scripts/bench_suite_matrix.sh
# Individual benchmarks (quick test)
./larson_hakmem 1 8 128 1024 1 12345 1
./bench_random_mixed_hakmem 400000 8192 1234567
./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_vm_mixed_hakmem 20000 256 424242
./bench_tiny_hot_hakmem 32 100 60000
# A/B tests (environment variable sweeps)
./scripts/bench_larson_1t_ab.sh
./scripts/bench_random_mixed_ab.sh
./scripts/bench_mid_large_mt_ab.sh
# Latest results
ls -td bench_results/suite/* | head -1
cat $(ls -td bench_results/suite/* | head -1)/results.csv
# Performance profiling
perf record -g --call-graph dwarf -- ./bench_random_mixed_hakmem 400000 8192 1234567
perf report --stdio > perf_output.txt
```
---
## Key Success Metrics
### Primary Goal: Overall Improvement
- **Target**: +20-30% average throughput vs Phase 6
- **Minimum**: No regressions in mid-large (HAKMEM's strength)
### Secondary Goals:
1. **Stability**: 4T+ Larson runs without crashes
2. **Tiny improvement**: -40% to -50% vs system (from -60%)
3. **Random mixed improvement**: -10% to -20% vs system (from -30%+)
### Stretch Goals:
1. **Mid-large dominance**: Maintain +100% vs system
2. **Overall parity**: Match or beat system malloc on average
3. **Consistency**: No severe outliers (no single test <50% of system)
---
**Document Version**: 1.0
**Created**: 2025-11-08
**Author**: Claude (Task Agent)
**Status**: Ready for execution