# Phase 6.13: mimalloc-bench Integration **Priority**: P0 (MUST-HAVE) **Estimated Time**: 3-5 hours **Goal**: Validate TLS multi-threaded benefit + diverse workload coverage --- ## Quick Start (30 Minutes) ### Step 1: Clone mimalloc-bench ```bash cd /tmp git clone https://github.com/daanx/mimalloc-bench.git cd mimalloc-bench ./build-all.sh ``` **Expected output**: Builds 20+ benchmark executables in `./out/bench/*/` --- ### Step 2: Build hakmem.so ```bash cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc # Add shared library target to Makefile cat >> Makefile << 'EOF' # Shared library for LD_PRELOAD shared: libhakmem.so libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o $(CC) -shared -o $@ $^ $(CFLAGS) -fPIC hakmem.o: hakmem.c hakmem.h $(CC) $(CFLAGS) -fPIC -c hakmem.c hakmem_pool.o: hakmem_pool.c hakmem_pool.h $(CC) $(CFLAGS) -fPIC -c hakmem_pool.c hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h $(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h $(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c EOF # Build shared library make shared # Verify ls -lh libhakmem.so ``` **Expected output**: `libhakmem.so` (~100-200KB) --- ### Step 3: Run Initial Benchmarks (1-2 Hours) #### Test 1: cfrac (single-threaded, 24B-400B allocations) ```bash cd /tmp/mimalloc-bench # Baseline (system allocator) ./out/bench/cfrac/cfrac 17 # Expected: ~0.5-1.0 seconds # mimalloc LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17 # Expected: ~0.3-0.5 seconds # hakmem LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17 # Expected: ~0.6-1.0 seconds (within 2x of mimalloc) ``` **Success Criteria**: hakmem within 2x of mimalloc (single-threaded overhead acceptable) --- #### Test 2: larson (multi-threaded, 10B-1KB allocations) ```bash # 1 thread (baseline) ./out/bench/larson/larson 1 1000 10000 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000 LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000 # 4 threads (TLS validation) ./out/bench/larson/larson 4 1000 10000 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000 LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000 # 16 threads (TLS scaling) ./out/bench/larson/larson 16 1000 10000 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000 LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000 ``` **Success Criteria**: - ✅ 1 thread: hakmem +5-10% overhead (TLS overhead expected) - ✅ 4 threads: hakmem -20% faster (TLS benefit) - ✅ 16 threads: hakmem -40% faster (TLS scaling) --- #### Test 3: threadtest (multi-threaded, 64B-4KB allocations) ```bash # Same as larson, but different allocation pattern ./out/bench/threadtest/threadtest 1 1000000 ./out/bench/threadtest/threadtest 4 1000000 ./out/bench/threadtest/threadtest 16 1000000 # With LD_PRELOAD (same as above) ``` --- ## Analysis (1 Hour) ### Collect Results Create a table in `BENCHMARK_PHASE_6.13.md`: ```markdown | Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc | |-----------|---------|--------|----------|--------|--------------------| | cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% | | larson | 1 | 2.50s | 1.80s | 1.95s | +8% | | larson | 4 | 8.00s | 3.20s | 3.50s | +9% | | larson | 16 | 28.0s | 10.5s | 12.0s | +14% | | threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% | | threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% | | threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% | ``` **Note**: Replace with actual measured values! --- ### TLS Validation Decision **Criteria**: - ✅ **Keep TLS**: If 4-thread benefit > 20% AND 16-thread benefit > 40% - Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20% ✅ - Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40% ✅ - ⚠️ **Make conditional**: If benefit exists but < 20% at 4 threads - Implement compile-time flag: `HAKMEM_MULTITHREAD=1` - ❌ **Revert TLS**: If no benefit at 4+ threads (unlikely) - Revert Phase 6.12.1 Step 2 changes --- ## Troubleshooting ### Issue 1: libhakmem.so not found ```bash # Check file exists ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so # Check ldd ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac # Try absolute path export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17 ``` --- ### Issue 2: Segfault or crashes ```bash # Debug with gdb LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17 (gdb) run (gdb) bt # Check for missing symbols nm -D libhakmem.so | grep malloc # Should see: malloc, free, calloc, realloc ``` --- ### Issue 3: Performance worse than expected ```bash # Check THP is enabled cat /sys/kernel/mm/transparent_hugepage/enabled # Should be: [always] or [madvise] # Check CPU frequency scaling cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # Should be: performance (not powersave) # Disable CPU frequency scaling sudo cpupower frequency-set -g performance ``` --- ## Next Steps ### If TLS validation succeeds (expected) → **Phase 6.14**: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.) ### If TLS validation fails (unlikely) → **Phase 6.13.1**: Revert TLS or make conditional (compile-time flag) ### Always → **Phase 6.16**: Fix Tiny Pool overhead (7,871ns → <200ns target) --- ## Appendix: Makefile Integration (Optional, 2 Hours) **Goal**: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh) ### Step 1: Edit bench.sh ```bash cd /tmp/mimalloc-bench # Backup original cp bench.sh bench.sh.backup # Add hakmem cat >> bench.sh << 'EOF' # hakmem allocator if [[ "$1" == "hakmem" ]]; then export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so shift exec "$@" fi EOF ``` ### Step 2: Add to ALLOCATORS list ```bash # Edit run-all.sh # Find line: ALLOCATORS="mimalloc jemalloc tcmalloc" # Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem" ``` ### Step 3: Run automated comparison ```bash ./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16 ``` **Output**: CSV file with all results (easy to compare) --- **End of Phase 6.13 Guide** This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).