Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
257 lines
6.7 KiB
Markdown
257 lines
6.7 KiB
Markdown
# Phase 6.13: mimalloc-bench Integration
|
|
**Priority**: P0 (MUST-HAVE)
|
|
**Estimated Time**: 3-5 hours
|
|
**Goal**: Validate TLS multi-threaded benefit + diverse workload coverage
|
|
|
|
---
|
|
|
|
## Quick Start (30 Minutes)
|
|
|
|
### Step 1: Clone mimalloc-bench
|
|
```bash
|
|
cd /tmp
|
|
git clone https://github.com/daanx/mimalloc-bench.git
|
|
cd mimalloc-bench
|
|
./build-all.sh
|
|
```
|
|
|
|
**Expected output**: Builds 20+ benchmark executables in `./out/bench/*/`
|
|
|
|
---
|
|
|
|
### Step 2: Build hakmem.so
|
|
```bash
|
|
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
|
|
|
|
# Add shared library target to Makefile
|
|
cat >> Makefile << 'EOF'
|
|
|
|
# Shared library for LD_PRELOAD
|
|
shared: libhakmem.so
|
|
|
|
libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o
|
|
$(CC) -shared -o $@ $^ $(CFLAGS) -fPIC
|
|
|
|
hakmem.o: hakmem.c hakmem.h
|
|
$(CC) $(CFLAGS) -fPIC -c hakmem.c
|
|
|
|
hakmem_pool.o: hakmem_pool.c hakmem_pool.h
|
|
$(CC) $(CFLAGS) -fPIC -c hakmem_pool.c
|
|
|
|
hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h
|
|
$(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c
|
|
|
|
hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h
|
|
$(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c
|
|
EOF
|
|
|
|
# Build shared library
|
|
make shared
|
|
|
|
# Verify
|
|
ls -lh libhakmem.so
|
|
```
|
|
|
|
**Expected output**: `libhakmem.so` (~100-200KB)
|
|
|
|
---
|
|
|
|
### Step 3: Run Initial Benchmarks (1-2 Hours)
|
|
|
|
#### Test 1: cfrac (single-threaded, 24B-400B allocations)
|
|
```bash
|
|
cd /tmp/mimalloc-bench
|
|
|
|
# Baseline (system allocator)
|
|
./out/bench/cfrac/cfrac 17
|
|
# Expected: ~0.5-1.0 seconds
|
|
|
|
# mimalloc
|
|
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17
|
|
# Expected: ~0.3-0.5 seconds
|
|
|
|
# hakmem
|
|
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17
|
|
# Expected: ~0.6-1.0 seconds (within 2x of mimalloc)
|
|
```
|
|
|
|
**Success Criteria**: hakmem within 2x of mimalloc (single-threaded overhead acceptable)
|
|
|
|
---
|
|
|
|
#### Test 2: larson (multi-threaded, 10B-1KB allocations)
|
|
```bash
|
|
# 1 thread (baseline)
|
|
./out/bench/larson/larson 1 1000 10000
|
|
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000
|
|
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000
|
|
|
|
# 4 threads (TLS validation)
|
|
./out/bench/larson/larson 4 1000 10000
|
|
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000
|
|
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000
|
|
|
|
# 16 threads (TLS scaling)
|
|
./out/bench/larson/larson 16 1000 10000
|
|
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000
|
|
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- ✅ 1 thread: hakmem +5-10% overhead (TLS overhead expected)
|
|
- ✅ 4 threads: hakmem -20% faster (TLS benefit)
|
|
- ✅ 16 threads: hakmem -40% faster (TLS scaling)
|
|
|
|
---
|
|
|
|
#### Test 3: threadtest (multi-threaded, 64B-4KB allocations)
|
|
```bash
|
|
# Same as larson, but different allocation pattern
|
|
./out/bench/threadtest/threadtest 1 1000000
|
|
./out/bench/threadtest/threadtest 4 1000000
|
|
./out/bench/threadtest/threadtest 16 1000000
|
|
|
|
# With LD_PRELOAD (same as above)
|
|
```
|
|
|
|
---
|
|
|
|
## Analysis (1 Hour)
|
|
|
|
### Collect Results
|
|
Create a table in `BENCHMARK_PHASE_6.13.md`:
|
|
|
|
```markdown
|
|
| Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc |
|
|
|-----------|---------|--------|----------|--------|--------------------|
|
|
| cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% |
|
|
| larson | 1 | 2.50s | 1.80s | 1.95s | +8% |
|
|
| larson | 4 | 8.00s | 3.20s | 3.50s | +9% |
|
|
| larson | 16 | 28.0s | 10.5s | 12.0s | +14% |
|
|
| threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% |
|
|
| threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% |
|
|
| threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% |
|
|
```
|
|
|
|
**Note**: Replace with actual measured values!
|
|
|
|
---
|
|
|
|
### TLS Validation Decision
|
|
|
|
**Criteria**:
|
|
- ✅ **Keep TLS**: If 4-thread benefit > 20% AND 16-thread benefit > 40%
|
|
- Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20% ✅
|
|
- Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40% ✅
|
|
|
|
- ⚠️ **Make conditional**: If benefit exists but < 20% at 4 threads
|
|
- Implement compile-time flag: `HAKMEM_MULTITHREAD=1`
|
|
|
|
- ❌ **Revert TLS**: If no benefit at 4+ threads (unlikely)
|
|
- Revert Phase 6.12.1 Step 2 changes
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue 1: libhakmem.so not found
|
|
```bash
|
|
# Check file exists
|
|
ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
|
|
|
|
# Check ldd
|
|
ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac
|
|
|
|
# Try absolute path
|
|
export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
|
|
LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 2: Segfault or crashes
|
|
```bash
|
|
# Debug with gdb
|
|
LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17
|
|
(gdb) run
|
|
(gdb) bt
|
|
|
|
# Check for missing symbols
|
|
nm -D libhakmem.so | grep malloc
|
|
# Should see: malloc, free, calloc, realloc
|
|
```
|
|
|
|
---
|
|
|
|
### Issue 3: Performance worse than expected
|
|
```bash
|
|
# Check THP is enabled
|
|
cat /sys/kernel/mm/transparent_hugepage/enabled
|
|
# Should be: [always] or [madvise]
|
|
|
|
# Check CPU frequency scaling
|
|
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|
# Should be: performance (not powersave)
|
|
|
|
# Disable CPU frequency scaling
|
|
sudo cpupower frequency-set -g performance
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### If TLS validation succeeds (expected)
|
|
→ **Phase 6.14**: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.)
|
|
|
|
### If TLS validation fails (unlikely)
|
|
→ **Phase 6.13.1**: Revert TLS or make conditional (compile-time flag)
|
|
|
|
### Always
|
|
→ **Phase 6.16**: Fix Tiny Pool overhead (7,871ns → <200ns target)
|
|
|
|
---
|
|
|
|
## Appendix: Makefile Integration (Optional, 2 Hours)
|
|
|
|
**Goal**: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh)
|
|
|
|
### Step 1: Edit bench.sh
|
|
```bash
|
|
cd /tmp/mimalloc-bench
|
|
|
|
# Backup original
|
|
cp bench.sh bench.sh.backup
|
|
|
|
# Add hakmem
|
|
cat >> bench.sh << 'EOF'
|
|
|
|
# hakmem allocator
|
|
if [[ "$1" == "hakmem" ]]; then
|
|
export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
|
|
shift
|
|
exec "$@"
|
|
fi
|
|
EOF
|
|
```
|
|
|
|
### Step 2: Add to ALLOCATORS list
|
|
```bash
|
|
# Edit run-all.sh
|
|
# Find line: ALLOCATORS="mimalloc jemalloc tcmalloc"
|
|
# Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
|
|
```
|
|
|
|
### Step 3: Run automated comparison
|
|
```bash
|
|
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
|
|
```
|
|
|
|
**Output**: CSV file with all results (easy to compare)
|
|
|
|
---
|
|
|
|
**End of Phase 6.13 Guide**
|
|
|
|
This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).
|