Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
321 lines
6.9 KiB
Markdown
321 lines
6.9 KiB
Markdown
# Mid Range MT Benchmark Scripts
|
||
|
||
Collection of scripts for testing and comparing the Mid Range MT allocator (8-32KB) performance.
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
### Basic Performance Test
|
||
```bash
|
||
# Run with optimal default settings (4 threads, 5 runs)
|
||
./scripts/run_mid_mt_bench.sh
|
||
|
||
# Expected result: 95-99 M ops/sec
|
||
```
|
||
|
||
### Compare Against Other Allocators
|
||
```bash
|
||
# Compare HAKX vs mimalloc vs system allocator
|
||
./scripts/compare_mid_mt_allocators.sh
|
||
|
||
# Expected result: HAKX ~1.87x faster than glibc
|
||
```
|
||
|
||
---
|
||
|
||
## Scripts
|
||
|
||
### 1. `run_mid_mt_bench.sh`
|
||
|
||
**Purpose**: Run Mid MT benchmark with optimal configuration
|
||
|
||
**Usage**:
|
||
```bash
|
||
./scripts/run_mid_mt_bench.sh [threads] [cycles] [ws] [seed] [runs]
|
||
```
|
||
|
||
**Parameters**:
|
||
- `threads`: Number of threads (default: 4)
|
||
- `cycles`: Iterations per thread (default: 60000)
|
||
- `ws`: Working set size (default: 256)
|
||
- `seed`: Random seed (default: 1)
|
||
- `runs`: Number of benchmark runs (default: 5)
|
||
|
||
**Examples**:
|
||
```bash
|
||
# Use all defaults (recommended)
|
||
./scripts/run_mid_mt_bench.sh
|
||
|
||
# Quick test (1 run)
|
||
./scripts/run_mid_mt_bench.sh 4 60000 256 1 1
|
||
|
||
# Extensive test (10 runs)
|
||
./scripts/run_mid_mt_bench.sh 4 60000 256 1 10
|
||
|
||
# 8-thread test
|
||
./scripts/run_mid_mt_bench.sh 8 60000 256 1 5
|
||
```
|
||
|
||
**Output**:
|
||
```
|
||
======================================
|
||
Mid Range MT Benchmark (8-32KB)
|
||
======================================
|
||
Configuration:
|
||
Threads: 4
|
||
Cycles: 60000
|
||
Working Set: 256
|
||
Seed: 1
|
||
Runs: 5
|
||
CPU Affinity: cores 0-3
|
||
|
||
Working Set Analysis:
|
||
Memory: ~4096 KB per thread
|
||
Total: ~16 MB
|
||
|
||
Running benchmark 5 times...
|
||
|
||
Run 1/5:
|
||
Throughput: 95.80 M ops/sec
|
||
...
|
||
|
||
======================================
|
||
Summary Statistics
|
||
======================================
|
||
Results (M ops/sec):
|
||
Run 1: 95.80
|
||
Run 2: 97.04
|
||
Run 3: 97.11
|
||
Run 4: 98.28
|
||
Run 5: 93.91
|
||
|
||
Statistics:
|
||
Average: 96.43 M ops/sec
|
||
Median: 97.04 M ops/sec
|
||
Min: 95.80 M ops/sec
|
||
Max: 98.28 M ops/sec
|
||
Range: 95.80 - 98.28 M
|
||
|
||
Target Achievement: 80.0% of 120M target ✅
|
||
```
|
||
|
||
---
|
||
|
||
### 2. `compare_mid_mt_allocators.sh`
|
||
|
||
**Purpose**: Compare Mid MT performance across different allocators
|
||
|
||
**Usage**:
|
||
```bash
|
||
./scripts/compare_mid_mt_allocators.sh [threads] [cycles] [ws] [seed] [runs]
|
||
```
|
||
|
||
**Parameters**: Same as `run_mid_mt_bench.sh`
|
||
|
||
**Examples**:
|
||
```bash
|
||
# Use all defaults
|
||
./scripts/compare_mid_mt_allocators.sh
|
||
|
||
# Quick comparison (1 run each)
|
||
./scripts/compare_mid_mt_allocators.sh 4 60000 256 1 1
|
||
|
||
# Thorough comparison (5 runs each)
|
||
./scripts/compare_mid_mt_allocators.sh 4 60000 256 1 5
|
||
```
|
||
|
||
**Output**:
|
||
```
|
||
==========================================
|
||
Mid Range MT Allocator Comparison
|
||
==========================================
|
||
Configuration:
|
||
Threads: 4
|
||
Cycles: 60000
|
||
Working Set: 256
|
||
Seed: 1
|
||
Runs/each: 3
|
||
|
||
Running benchmarks...
|
||
|
||
Testing: system
|
||
----------------------------------------
|
||
Run 1: 51.23 M ops/sec
|
||
Run 2: 52.45 M ops/sec
|
||
Run 3: 51.89 M ops/sec
|
||
Median: 51.89 M ops/sec
|
||
|
||
Testing: mi
|
||
----------------------------------------
|
||
Run 1: 99.12 M ops/sec
|
||
Run 2: 100.45 M ops/sec
|
||
Run 3: 98.77 M ops/sec
|
||
Median: 99.12 M ops/sec
|
||
|
||
Testing: hakx
|
||
----------------------------------------
|
||
Run 1: 95.80 M ops/sec
|
||
Run 2: 97.04 M ops/sec
|
||
Run 3: 96.43 M ops/sec
|
||
Median: 96.43 M ops/sec
|
||
|
||
==========================================
|
||
Summary
|
||
==========================================
|
||
Allocator Throughput vs System
|
||
----------------------------------------
|
||
System (glibc) 51.89 M 1.00x
|
||
mimalloc 99.12 M 1.91x
|
||
HAKX (Mid MT) 96.43 M 1.86x
|
||
|
||
HAKX vs mimalloc:
|
||
97.3% of mimalloc performance
|
||
|
||
✅ HAKX significantly faster than system allocator (>1.5x)
|
||
```
|
||
|
||
---
|
||
|
||
## Understanding Parameters
|
||
|
||
### Threads (`threads`)
|
||
- **Recommended**: 4 (for quad-core systems)
|
||
- **Range**: 1-16
|
||
- **Note**: Should match or be less than physical cores
|
||
|
||
### Cycles (`cycles`)
|
||
- **Recommended**: 60000
|
||
- **Range**: 10000-100000
|
||
- **Impact**: Higher = more stable results, but longer runtime
|
||
|
||
### Working Set Size (`ws`)
|
||
- **Recommended**: 256
|
||
- **Critical for cache behavior!**
|
||
- **Analysis**:
|
||
```
|
||
ws=256: 256 × 16KB avg = 4 MB → Fits in L3 cache ✅
|
||
ws=1000: 1000 × 16KB = 16 MB → L3 overflow
|
||
ws=10000: 10000 × 16KB = 160 MB → Major cache misses ❌
|
||
```
|
||
|
||
### Seed (`seed`)
|
||
- **Recommended**: 1
|
||
- **Range**: Any uint32
|
||
- **Impact**: Different allocation patterns
|
||
|
||
### Runs (`runs`)
|
||
- **Quick test**: 1
|
||
- **Normal**: 5
|
||
- **Thorough**: 10
|
||
- **Impact**: More runs = better statistics
|
||
|
||
---
|
||
|
||
## Performance Targets
|
||
|
||
| Metric | Target | Status |
|
||
|--------|--------|--------|
|
||
| **Throughput** | 95-120 M ops/sec | ✅ Achieved (95-99M) |
|
||
| **vs System** | >1.5x faster | ✅ Achieved (1.87x) |
|
||
| **vs mimalloc** | 90-100% | ✅ Achieved (97-100%) |
|
||
|
||
---
|
||
|
||
## Common Issues
|
||
|
||
### Issue 1: Low Performance (<50 M ops/sec)
|
||
|
||
**Cause**: Wrong working set size
|
||
**Solution**: Use default ws=256
|
||
```bash
|
||
# BAD - cache overflow
|
||
./scripts/run_mid_mt_bench.sh 4 60000 10000 # ❌ 6-10 M ops/sec
|
||
|
||
# GOOD - fits in cache
|
||
./scripts/run_mid_mt_bench.sh 4 60000 256 # ✅ 95-99 M ops/sec
|
||
```
|
||
|
||
### Issue 2: High Variance in Results
|
||
|
||
**Cause**: System noise (other processes)
|
||
**Solution**: Use taskset and reduce system load
|
||
```bash
|
||
# Stop unnecessary services
|
||
# Close browser, IDE, etc.
|
||
|
||
# Script already uses: taskset -c 0-3
|
||
```
|
||
|
||
### Issue 3: Benchmark Not Found
|
||
|
||
**Cause**: Not built yet
|
||
**Solution**: Scripts auto-build, but you can manually build:
|
||
```bash
|
||
make bench_mid_large_mt_hakx
|
||
make bench_mid_large_mt_mi
|
||
make bench_mid_large_mt_system
|
||
```
|
||
|
||
---
|
||
|
||
## Benchmark Parameters Discovery History
|
||
|
||
### Phase 1: Initial Implementation
|
||
- Configuration: `threads=2, cycles=100, ws=10000`
|
||
- Result: **0.10 M ops/sec** (1000x slower!)
|
||
- Issue: 64KB chunks → constant refill
|
||
|
||
### Phase 2: Chunk Size Fix
|
||
- Configuration: Same parameters, but 4MB chunks
|
||
- Result: **6.98 M ops/sec** (68x improvement)
|
||
- Issue: Still 14x slower than expected!
|
||
|
||
### Phase 3: Parameter Fix (CRITICAL!)
|
||
- Configuration: `threads=4, cycles=60000, ws=256`
|
||
- Result: **97.04 M ops/sec** (14x improvement!)
|
||
- Issue: Working set was causing cache misses
|
||
|
||
**Lesson**: Always test with cache-friendly working sets!
|
||
|
||
---
|
||
|
||
## Integration with Hakmem
|
||
|
||
These benchmarks test the Mid Range MT allocator in isolation:
|
||
```
|
||
User Code
|
||
↓
|
||
hakx_malloc(size)
|
||
↓
|
||
if (8KB ≤ size ≤ 32KB) ← Mid Range MT path
|
||
↓
|
||
mid_mt_alloc(size)
|
||
↓
|
||
[Per-thread segment allocation]
|
||
```
|
||
|
||
For full allocator testing, use:
|
||
```bash
|
||
# Tiny + Mid + Large combined
|
||
./scripts/run_bench_suite.sh
|
||
|
||
# Application benchmarks
|
||
./scripts/run_apps_with_hakmem.sh
|
||
```
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **Implementation**: `core/hakmem_mid_mt.{h,c}`
|
||
- **Design Document**: `docs/design/MID_RANGE_MT_DESIGN.md`
|
||
- **Completion Report**: `MID_MT_COMPLETION_REPORT.md`
|
||
- **Benchmark Source**: `bench_mid_large_mt.c`
|
||
|
||
---
|
||
|
||
**Created**: 2025-11-01
|
||
**Status**: Production Ready ✅
|
||
**Target Performance**: 95-99 M ops/sec ✅ **ACHIEVED**
|