hakmem/docs/archive/PHASE_6.13_MIMALLOC_BENCH.md

# Phase 6.13: mimalloc-bench Integration
**Priority**: P0 (MUST-HAVE)
**Estimated Time**: 3-5 hours
**Goal**: Validate TLS multi-threaded benefit + diverse workload coverage

---

## Quick Start (30 Minutes)

### Step 1: Clone mimalloc-bench
```bash
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
```

**Expected output**: Builds 20+ benchmark executables in `./out/bench/*/`

---

### Step 2: Build hakmem.so
```bash
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc

# Add shared library target to Makefile
cat >> Makefile << 'EOF'

# Shared library for LD_PRELOAD
shared: libhakmem.so

libhakmem.so: hakmem.o hakmem_pool.o hakmem_site_rules.o hakmem_tiny.o
	$(CC) -shared -o $@ $^ $(CFLAGS) -fPIC

hakmem.o: hakmem.c hakmem.h
	$(CC) $(CFLAGS) -fPIC -c hakmem.c

hakmem_pool.o: hakmem_pool.c hakmem_pool.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_pool.c

hakmem_site_rules.o: hakmem_site_rules.c hakmem_site_rules.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_site_rules.c

hakmem_tiny.o: hakmem_tiny.c hakmem_tiny.h
	$(CC) $(CFLAGS) -fPIC -c hakmem_tiny.c
EOF

# Build shared library
make shared

# Verify
ls -lh libhakmem.so
```

**Expected output**: `libhakmem.so` (~100-200KB)

---

### Step 3: Run Initial Benchmarks (1-2 Hours)

#### Test 1: cfrac (single-threaded, 24B-400B allocations)
```bash
cd /tmp/mimalloc-bench

# Baseline (system allocator)
./out/bench/cfrac/cfrac 17
# Expected: ~0.5-1.0 seconds

# mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/cfrac/cfrac 17
# Expected: ~0.3-0.5 seconds

# hakmem
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/cfrac/cfrac 17
# Expected: ~0.6-1.0 seconds (within 2x of mimalloc)
```

**Success Criteria**: hakmem within 2x of mimalloc (single-threaded overhead acceptable)

---

#### Test 2: larson (multi-threaded, 10B-1KB allocations)
```bash
# 1 thread (baseline)
./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 1 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 1 1000 10000

# 4 threads (TLS validation)
./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 4 1000 10000

# 16 threads (TLS scaling)
./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./out/bench/larson/larson 16 1000 10000
LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so ./out/bench/larson/larson 16 1000 10000
```

**Success Criteria**:
- ✅ 1 thread: hakmem +5-10% overhead (TLS overhead expected)
- ✅ 4 threads: hakmem -20% faster (TLS benefit)
- ✅ 16 threads: hakmem -40% faster (TLS scaling)

---

#### Test 3: threadtest (multi-threaded, 64B-4KB allocations)
```bash
# Same as larson, but different allocation pattern
./out/bench/threadtest/threadtest 1 1000000
./out/bench/threadtest/threadtest 4 1000000
./out/bench/threadtest/threadtest 16 1000000

# With LD_PRELOAD (same as above)
```

---

## Analysis (1 Hour)

### Collect Results
Create a table in `BENCHMARK_PHASE_6.13.md`:

```markdown
| Benchmark | Threads | System | mimalloc | hakmem | hakmem vs mimalloc |
|-----------|---------|--------|----------|--------|--------------------|
| cfrac | 1 | 1.00s | 0.45s | 0.68s | +51% |
| larson | 1 | 2.50s | 1.80s | 1.95s | +8% |
| larson | 4 | 8.00s | 3.20s | 3.50s | +9% |
| larson | 16 | 28.0s | 10.5s | 12.0s | +14% |
| threadtest | 1 | 1.20s | 0.80s | 0.88s | +10% |
| threadtest | 4 | 4.00s | 1.50s | 1.70s | +13% |
| threadtest | 16 | 14.0s | 5.00s | 6.20s | +24% |
```

**Note**: Replace with actual measured values!

---

### TLS Validation Decision

**Criteria**:
- ✅ **Keep TLS**: If 4-thread benefit > 20% AND 16-thread benefit > 40%
  - Example: larson 4-thread is 2.50s (no TLS) → 2.00s (TLS) = -20% ✅
  - Example: larson 16-thread is 8.50s (no TLS) → 5.10s (TLS) = -40% ✅

- ⚠️ **Make conditional**: If benefit exists but < 20% at 4 threads
  - Implement compile-time flag: `HAKMEM_MULTITHREAD=1`

- ❌ **Revert TLS**: If no benefit at 4+ threads (unlikely)
  - Revert Phase 6.12.1 Step 2 changes

---

## Troubleshooting

### Issue 1: libhakmem.so not found
```bash
# Check file exists
ls -lh /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so

# Check ldd
ldd /tmp/mimalloc-bench/out/bench/cfrac/cfrac

# Try absolute path
export HAKMEM_LIB=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
LD_PRELOAD=$HAKMEM_LIB ./out/bench/cfrac/cfrac 17
```

---

### Issue 2: Segfault or crashes
```bash
# Debug with gdb
LD_PRELOAD=$HAKMEM_LIB gdb --args ./out/bench/cfrac/cfrac 17
(gdb) run
(gdb) bt

# Check for missing symbols
nm -D libhakmem.so | grep malloc
# Should see: malloc, free, calloc, realloc
```

---

### Issue 3: Performance worse than expected
```bash
# Check THP is enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# Should be: [always] or [madvise]

# Check CPU frequency scaling
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: performance (not powersave)

# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance
```

---

## Next Steps

### If TLS validation succeeds (expected)
→ **Phase 6.14**: Expand to 10+ benchmarks (espresso, barnes, cache-scratch, etc.)

### If TLS validation fails (unlikely)
→ **Phase 6.13.1**: Revert TLS or make conditional (compile-time flag)

### Always
→ **Phase 6.16**: Fix Tiny Pool overhead (7,871ns → <200ns target)

---

## Appendix: Makefile Integration (Optional, 2 Hours)

**Goal**: Integrate hakmem into mimalloc-bench's automated runner (./run-all.sh)

### Step 1: Edit bench.sh
```bash
cd /tmp/mimalloc-bench

# Backup original
cp bench.sh bench.sh.backup

# Add hakmem
cat >> bench.sh << 'EOF'

# hakmem allocator
if [[ "$1" == "hakmem" ]]; then
  export LD_PRELOAD=/home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so
  shift
  exec "$@"
fi
EOF
```

### Step 2: Add to ALLOCATORS list
```bash
# Edit run-all.sh
# Find line: ALLOCATORS="mimalloc jemalloc tcmalloc"
# Change to: ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
```

### Step 3: Run automated comparison
```bash
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
```

**Output**: CSV file with all results (easy to compare)

---

**End of Phase 6.13 Guide**

This guide provides a step-by-step implementation plan for mimalloc-bench integration. Start with the Quick Start section (30 minutes) to validate basic functionality, then proceed to full benchmarking (1-2 hours) and analysis (1 hour).