Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.7 KiB
Phase 6.13: mimalloc-bench Integration
Priority: P0 (MUST-HAVE) Estimated Time: 3-5 hours Goal: Validate TLS multi-threaded benefit + diverse workload coverage
Quick Start (30 Minutes)
Step 1: Clone mimalloc-bench
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
Expected output: Builds 20+ benchmark executables in ./out/bench/*/
Step 2: Build hakmem.so
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
# Add shared library target to Makefile
cat >> Makefile << 'EOF'
# Shared library for LD_PRELOAD
shared: libhakmem.so
libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o
$(CC) -shared -o $@ $^ $(CFLAGS) -fPIC
hakmem.o: hakmem.c hakmem.h
$(CC) $(CFLAGS) -fPIC -c hakmem.c
hakmem_pool.o: hakmem_pool.c hakmem_pool.h
$(CC) $(CFLAGS) -fPIC -c hakmem_pool.c
hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h
$(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c
hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h
$(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c
EOF
# Build shared library
make shared
# Verify
ls -lh libhakmem.so
Expected output: libhakmem.so (~100-200KB)
Step 3: Run Initial Benchmarks (1-2 Hours)
Test 1: cfrac (single-threaded, 24B-400B allocations)
cd /tmp/mimalloc-bench
# Baseline (system allocator)
./out/bench/cfrac/cfrac 17
# Expected: ~0.5-1.0 seconds
# mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17
# Expected: ~0.3-0.5 seconds
# hakmem
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17
# Expected: ~0.6-1.0 seconds (within 2x of mimalloc)
Success Criteria: hakmem within 2x of mimalloc (single-threaded overhead acceptable)
Test 2: larson (multi-threaded, 10B-1KB allocations)
# 1 thread (baseline)
./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000
# 4 threads (TLS validation)
./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000
# 16 threads (TLS scaling)
./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000
Success Criteria:
- ✅ 1 thread: hakmem +5-10% overhead (TLS overhead expected)
- ✅ 4 threads: hakmem -20% faster (TLS benefit)
- ✅ 16 threads: hakmem -40% faster (TLS scaling)
Test 3: threadtest (multi-threaded, 64B-4KB allocations)
# Same as larson, but different allocation pattern
./out/bench/threadtest/threadtest 1 1000000
./out/bench/threadtest/threadtest 4 1000000
./out/bench/threadtest/threadtest 16 1000000
# With LD_PRELOAD (same as above)
Analysis (1 Hour)
Collect Results
Create a table in BENCHMARK_PHASE_6.13.md:
| Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc |
|-----------|---------|--------|----------|--------|--------------------|
| cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% |
| larson | 1 | 2.50s | 1.80s | 1.95s | +8% |
| larson | 4 | 8.00s | 3.20s | 3.50s | +9% |
| larson | 16 | 28.0s | 10.5s | 12.0s | +14% |
| threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% |
| threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% |
| threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% |
Note: Replace with actual measured values!
TLS Validation Decision
Criteria:
-
✅ Keep TLS: If 4-thread benefit > 20% AND 16-thread benefit > 40%
- Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20% ✅
- Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40% ✅
-
⚠️ Make conditional: If benefit exists but < 20% at 4 threads
- Implement compile-time flag:
HAKMEM_MULTITHREAD=1
- Implement compile-time flag:
-
❌ Revert TLS: If no benefit at 4+ threads (unlikely)
- Revert Phase 6.12.1 Step 2 changes
Troubleshooting
Issue 1: libhakmem.so not found
# Check file exists
ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
# Check ldd
ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac
# Try absolute path
export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17
Issue 2: Segfault or crashes
# Debug with gdb
LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17
(gdb) run
(gdb) bt
# Check for missing symbols
nm -D libhakmem.so | grep malloc
# Should see: malloc, free, calloc, realloc
Issue 3: Performance worse than expected
# Check THP is enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# Should be: [always] or [madvise]
# Check CPU frequency scaling
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: performance (not powersave)
# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance
Next Steps
If TLS validation succeeds (expected)
→ Phase 6.14: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.)
If TLS validation fails (unlikely)
→ Phase 6.13.1: Revert TLS or make conditional (compile-time flag)
Always
→ Phase 6.16: Fix Tiny Pool overhead (7,871ns → <200ns target)
Appendix: Makefile Integration (Optional, 2 Hours)
Goal: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh)
Step 1: Edit bench.sh
cd /tmp/mimalloc-bench
# Backup original
cp bench.sh bench.sh.backup
# Add hakmem
cat >> bench.sh << 'EOF'
# hakmem allocator
if [[ "$1" == "hakmem" ]]; then
export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
shift
exec "$@"
fi
EOF
Step 2: Add to ALLOCATORS list
# Edit run-all.sh
# Find line: ALLOCATORS="mimalloc jemalloc tcmalloc"
# Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
Step 3: Run automated comparison
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
Output: CSV file with all results (easy to compare)
End of Phase 6.13 Guide
This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).