Files
hakmem/docs/archive/PHASE_6.13_MIMALLOC_BENCH.md

257 lines
6.7 KiB
Markdown
Raw Normal View History

# Phase 6.13: mimalloc-bench Integration
**Priority**: P0 (MUST-HAVE)
**Estimated Time**: 3-5 hours
**Goal**: Validate TLS multi-threaded benefit + diverse workload coverage
---
## Quick Start (30 Minutes)
### Step 1: Clone mimalloc-bench
```bash
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
```
**Expected output**: Builds 20+ benchmark executables in `./out/bench/*/`
---
### Step 2: Build hakmem.so
```bash
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
# Add shared library target to Makefile
cat >> Makefile << 'EOF'
# Shared library for LD_PRELOAD
shared: libhakmem.so
libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o
$(CC) -shared -o $@ $^ $(CFLAGS) -fPIC
hakmem.o: hakmem.c hakmem.h
$(CC) $(CFLAGS) -fPIC -c hakmem.c
hakmem_pool.o: hakmem_pool.c hakmem_pool.h
$(CC) $(CFLAGS) -fPIC -c hakmem_pool.c
hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h
$(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c
hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h
$(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c
EOF
# Build shared library
make shared
# Verify
ls -lh libhakmem.so
```
**Expected output**: `libhakmem.so` (~100-200KB)
---
### Step 3: Run Initial Benchmarks (1-2 Hours)
#### Test 1: cfrac (single-threaded, 24B-400B allocations)
```bash
cd /tmp/mimalloc-bench
# Baseline (system allocator)
./out/bench/cfrac/cfrac 17
# Expected: ~0.5-1.0 seconds
# mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17
# Expected: ~0.3-0.5 seconds
# hakmem
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17
# Expected: ~0.6-1.0 seconds (within 2x of mimalloc)
```
**Success Criteria**: hakmem within 2x of mimalloc (single-threaded overhead acceptable)
---
#### Test 2: larson (multi-threaded, 10B-1KB allocations)
```bash
# 1 thread (baseline)
./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000
# 4 threads (TLS validation)
./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000
# 16 threads (TLS scaling)
./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000
```
**Success Criteria**:
- ✅ 1 thread: hakmem +5-10% overhead (TLS overhead expected)
- ✅ 4 threads: hakmem -20% faster (TLS benefit)
- ✅ 16 threads: hakmem -40% faster (TLS scaling)
---
#### Test 3: threadtest (multi-threaded, 64B-4KB allocations)
```bash
# Same as larson, but different allocation pattern
./out/bench/threadtest/threadtest 1 1000000
./out/bench/threadtest/threadtest 4 1000000
./out/bench/threadtest/threadtest 16 1000000
# With LD_PRELOAD (same as above)
```
---
## Analysis (1 Hour)
### Collect Results
Create a table in `BENCHMARK_PHASE_6.13.md`:
```markdown
| Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc |
|-----------|---------|--------|----------|--------|--------------------|
| cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% |
| larson | 1 | 2.50s | 1.80s | 1.95s | +8% |
| larson | 4 | 8.00s | 3.20s | 3.50s | +9% |
| larson | 16 | 28.0s | 10.5s | 12.0s | +14% |
| threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% |
| threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% |
| threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% |
```
**Note**: Replace with actual measured values!
---
### TLS Validation Decision
**Criteria**:
-**Keep TLS**: If 4-thread benefit > 20% AND 16-thread benefit > 40%
- Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20% ✅
- Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40% ✅
- ⚠️ **Make conditional**: If benefit exists but < 20% at 4 threads
- Implement compile-time flag: `HAKMEM_MULTITHREAD=1`
-**Revert TLS**: If no benefit at 4+ threads (unlikely)
- Revert Phase 6.12.1 Step 2 changes
---
## Troubleshooting
### Issue 1: libhakmem.so not found
```bash
# Check file exists
ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
# Check ldd
ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac
# Try absolute path
export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17
```
---
### Issue 2: Segfault or crashes
```bash
# Debug with gdb
LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17
(gdb) run
(gdb) bt
# Check for missing symbols
nm -D libhakmem.so | grep malloc
# Should see: malloc, free, calloc, realloc
```
---
### Issue 3: Performance worse than expected
```bash
# Check THP is enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# Should be: [always] or [madvise]
# Check CPU frequency scaling
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: performance (not powersave)
# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance
```
---
## Next Steps
### If TLS validation succeeds (expected)
**Phase 6.14**: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.)
### If TLS validation fails (unlikely)
**Phase 6.13.1**: Revert TLS or make conditional (compile-time flag)
### Always
**Phase 6.16**: Fix Tiny Pool overhead (7,871ns → <200ns target)
---
## Appendix: Makefile Integration (Optional, 2 Hours)
**Goal**: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh)
### Step 1: Edit bench.sh
```bash
cd /tmp/mimalloc-bench
# Backup original
cp bench.sh bench.sh.backup
# Add hakmem
cat >> bench.sh << 'EOF'
# hakmem allocator
if [[ "$1" == "hakmem" ]]; then
export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
shift
exec "$@"
fi
EOF
```
### Step 2: Add to ALLOCATORS list
```bash
# Edit run-all.sh
# Find line: ALLOCATORS="mimalloc jemalloc tcmalloc"
# Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
```
### Step 3: Run automated comparison
```bash
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
```
**Output**: CSV file with all results (easy to compare)
---
**End of Phase 6.13 Guide**
This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).