Files
hakmem/docs/archive/PHASE_6.13_MIMALLOC_BENCH.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

6.7 KiB

Phase 6.13: mimalloc-bench Integration

Priority: P0 (MUST-HAVE) Estimated Time: 3-5 hours Goal: Validate TLS multi-threaded benefit + diverse workload coverage


Quick Start (30 Minutes)

Step 1: Clone mimalloc-bench

cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh

Expected output: Builds 20+ benchmark executables in ./out/bench/*/


Step 2: Build hakmem.so

cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc

# Add shared library target to Makefile
cat >> Makefile << 'EOF'

# Shared library for LD_PRELOAD
shared: libhakmem.so

libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o
	$(CC) -shared -o $@ $^ $(CFLAGS) -fPIC

hakmem.o: hakmem.c hakmem.h
	$(CC) $(CFLAGS) -fPIC -c hakmem.c

hakmem_pool.o: hakmem_pool.c hakmem_pool.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_pool.c

hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c

hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c
EOF

# Build shared library
make shared

# Verify
ls -lh libhakmem.so

Expected output: libhakmem.so (~100-200KB)


Step 3: Run Initial Benchmarks (1-2 Hours)

Test 1: cfrac (single-threaded, 24B-400B allocations)

cd /tmp/mimalloc-bench

# Baseline (system allocator)
./out/bench/cfrac/cfrac 17
# Expected: ~0.5-1.0 seconds

# mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17
# Expected: ~0.3-0.5 seconds

# hakmem
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17
# Expected: ~0.6-1.0 seconds (within 2x of mimalloc)

Success Criteria: hakmem within 2x of mimalloc (single-threaded overhead acceptable)


Test 2: larson (multi-threaded, 10B-1KB allocations)

# 1 thread (baseline)
./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000

# 4 threads (TLS validation)
./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000

# 16 threads (TLS scaling)
./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000

Success Criteria:

  • 1 thread: hakmem +5-10% overhead (TLS overhead expected)
  • 4 threads: hakmem -20% faster (TLS benefit)
  • 16 threads: hakmem -40% faster (TLS scaling)

Test 3: threadtest (multi-threaded, 64B-4KB allocations)

# Same as larson, but different allocation pattern
./out/bench/threadtest/threadtest 1 1000000
./out/bench/threadtest/threadtest 4 1000000
./out/bench/threadtest/threadtest 16 1000000

# With LD_PRELOAD (same as above)

Analysis (1 Hour)

Collect Results

Create a table in BENCHMARK_PHASE_6.13.md:

| Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc |
|-----------|---------|--------|----------|--------|--------------------|
| cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% |
| larson | 1 | 2.50s | 1.80s | 1.95s | +8% |
| larson | 4 | 8.00s | 3.20s | 3.50s | +9% |
| larson | 16 | 28.0s | 10.5s | 12.0s | +14% |
| threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% |
| threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% |
| threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% |

Note: Replace with actual measured values!


TLS Validation Decision

Criteria:

  • Keep TLS: If 4-thread benefit > 20% AND 16-thread benefit > 40%

    • Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20%
    • Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40%
  • ⚠️ Make conditional: If benefit exists but < 20% at 4 threads

    • Implement compile-time flag: HAKMEM_MULTITHREAD=1
  • Revert TLS: If no benefit at 4+ threads (unlikely)

    • Revert Phase 6.12.1 Step 2 changes

Troubleshooting

Issue 1: libhakmem.so not found

# Check file exists
ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so

# Check ldd
ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac

# Try absolute path
export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17

Issue 2: Segfault or crashes

# Debug with gdb
LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17
(gdb) run
(gdb) bt

# Check for missing symbols
nm -D libhakmem.so | grep malloc
# Should see: malloc, free, calloc, realloc

Issue 3: Performance worse than expected

# Check THP is enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# Should be: [always] or [madvise]

# Check CPU frequency scaling
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: performance (not powersave)

# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance

Next Steps

If TLS validation succeeds (expected)

Phase 6.14: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.)

If TLS validation fails (unlikely)

Phase 6.13.1: Revert TLS or make conditional (compile-time flag)

Always

Phase 6.16: Fix Tiny Pool overhead (7,871ns → <200ns target)


Appendix: Makefile Integration (Optional, 2 Hours)

Goal: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh)

Step 1: Edit bench.sh

cd /tmp/mimalloc-bench

# Backup original
cp bench.sh bench.sh.backup

# Add hakmem
cat >> bench.sh << 'EOF'

# hakmem allocator
if [[ "$1" == "hakmem" ]]; then
  export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
  shift
  exec "$@"
fi
EOF

Step 2: Add to ALLOCATORS list

# Edit run-all.sh
# Find line: ALLOCATORS="mimalloc jemalloc tcmalloc"
# Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"

Step 3: Run automated comparison

./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16

Output: CSV file with all results (easy to compare)


End of Phase 6.13 Guide

This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).