Files
hakmem/docs/analysis/ULTRATHINK_BENCHMARK_ANALYSIS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

756 lines
26 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# hakmem Benchmark Strategy & TLS Analysis
**Author**: ultrathink (ChatGPT o1)
**Date**: 2025-10-22
**Context**: Real-world benchmark recommendations + TLS Freelist Cache evaluation
---
## Executive Summary
**Current Problem**: hakmem benchmarks are too size-specific (64KB, 256KB, 2MB), leading to peaky optimizations that may not reflect real-world performance.
**Key Findings**:
1. **mimalloc-bench is essential** (P0) - industry standard with diverse patterns
2. **TLS overhead is expected in single-threaded workloads** - need multi-threaded validation
3. **Redis is valuable but complex** (P1) - defer until after mimalloc-bench
4. **Recommended approach**: Keep TLS + add multi-threaded benchmarks to validate effectiveness
---
## 1. Real-World Benchmark Recommendations
### 1.1 mimalloc-bench Suite (P0 - MUST IMPLEMENT)
**Name**: mimalloc-bench (Microsoft Research allocator benchmark suite)
**Why Representative**:
- Industry-standard benchmark used by mimalloc, jemalloc, tcmalloc authors
- 20+ workloads covering diverse allocation patterns
- Mix of synthetic stress tests + real applications
- Well-maintained, actively used for allocator research
**Allocation Patterns**:
| Benchmark | Sizes | Lifetime | Threads | Pattern |
|-----------|-------|----------|---------|---------|
| larson | 10B-1KB | short | 1-32 | Multi-threaded churn |
| threadtest | 64B-4KB | mixed | 1-16 | Per-thread allocation |
| mstress | 16B-2KB | short | 1-32 | Stress test |
| cfrac | 24B-400B | medium | 1 | Mathematical computation |
| espresso | 16B-1KB | mixed | 1 | Logic minimization |
| barnes | 32B-96B | long | 1 | N-body simulation |
| cache-scratch | 8B-256KB | short | 1-8 | Cache-unfriendly |
| sh6bench | 16B-4KB | mixed | 1 | Shell script workload |
**Integration Method**:
```bash
# Easy integration via LD_PRELOAD
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
# Run with hakmem
LD_PRELOAD=/path/to/libhakmem.so ./bench/cfrac/cfrac 17
# Automated comparison
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem
```
**Expected hakmem Strengths**:
- **larson**: Site Rules should reduce lock contention (different threads → different sites)
- **cfrac**: L2 Pool non-empty bitmap → O(1) small-object allocation
- **cache-scratch**: ELO should learn cache-unfriendly patterns → segregate hot/cold
**Expected hakmem Weaknesses**:
- **barnes**: Long-lived small objects (32-96B) → Tiny Pool overhead (7,871ns vs 18ns)
- **mstress**: High-churn stress test → free policy overhead (Hot/Warm/Cold decision)
- **threadtest**: TLS overhead (+7-8%) if thread count < 4
**Implementation Difficulty**: **Easy**
- LD_PRELOAD integration (no code changes)
- Automated benchmark runner (./run-all.sh)
- Comparison reports (CSV/JSON output)
**Priority**: **P0 (MUST-HAVE)**
- Essential for competitive analysis
- Diverse workload coverage
- Direct comparison with mimalloc/jemalloc
**Estimated Time**: 2-4 hours (setup + initial run + analysis)
---
### 1.2 Redis Benchmark (P1 - IMPORTANT)
**Name**: Redis 7.x (in-memory data store)
**Why Representative**:
- Real-world production workload (not synthetic)
- Complex allocation patterns (strings, lists, hashes, sorted sets)
- High-throughput (100K+ ops/sec)
- Well-defined benchmark protocol (redis-benchmark)
**Allocation Patterns**:
| Operation | Sizes | Lifetime | Pattern |
|-----------|-------|----------|---------|
| SET key val | 16B-512KB | medium-long | String allocation |
| LPUSH list val | 16B-64KB | medium | List node allocation |
| HSET hash field val | 16B-4KB | long | Hash table + entries |
| ZADD zset score val | 32B-1KB | long | Skip list + hash |
| INCR counter | 8B | long | Small integer objects |
**Integration Method**:
```bash
# Method 1: LD_PRELOAD (easiest)
git clone https://github.com/redis/redis.git
cd redis
make
LD_PRELOAD=/path/to/libhakmem.so ./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000
# Method 2: Static linking (more accurate)
# Edit src/Makefile:
# MALLOC=hakmem
# MALLOC_LIBS=/path/to/libhakmem.a
make MALLOC=hakmem
./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000
```
**Expected hakmem Strengths**:
- **SET (strings)**: L2.5 Pool (64KB-1MB) high hit rate for medium strings
- **HSET (hash tables)**: Site Rules hash entries segregated by size class
- **ZADD (sorted sets)**: ELO learns skip list node patterns
**Expected hakmem Weaknesses**:
- **INCR (small objects)**: Tiny Pool overhead (7,871ns vs 18ns mimalloc)
- **LPUSH (list nodes)**: Frequent small allocations Tiny Pool slab lookup overhead
- **Memory overhead**: Redis object headers + hakmem metadata higher RSS
**Implementation Difficulty**: **Medium**
- LD_PRELOAD: Easy (2 hours)
- Static linking: Medium (4-6 hours, need Makefile integration)
- Attribution: Hard (need to isolate allocator overhead vs Redis overhead)
**Priority**: **P1 (IMPORTANT)**
- Real-world validation (not synthetic)
- High-profile reference (Redis is widely used)
- Defer until P0 (mimalloc-bench) is complete
**Estimated Time**: 4-8 hours (integration + measurement + analysis)
---
### 1.3 Additional Recommendations
#### 1.3.1 rocksdb Benchmark (P1)
**Name**: RocksDB (persistent key-value store, Facebook)
**Why Representative**:
- Real-world database workload
- Mix of small (keys) + large (values) allocations
- Write-heavy patterns (LSM tree)
- Well-defined benchmark (db_bench)
**Allocation Patterns**:
- Keys: 16B-1KB (frequent, short-lived)
- Values: 100B-1MB (mixed lifetime)
- Memtable: 4MB-128MB (long-lived)
- Block cache: 8KB-64KB (medium-lived)
**Integration**: LD_PRELOAD or Makefile (EXTRA_CXXFLAGS=-lhakmem)
**Expected hakmem Strengths**:
- L2.5 Pool for medium values (64KB-1MB)
- BigCache for memtable (4MB-128MB)
- Site Rules for key/value segregation
**Expected hakmem Weaknesses**:
- Write amplification (LSM tree) high allocation rate Tiny Pool overhead
- Block cache churn L2 Pool fragmentation
**Priority**: **P1**
**Estimated Time**: 6-10 hours
---
#### 1.3.2 parsec Benchmark Suite (P2)
**Name**: PARSEC 3.0 (Princeton Application Repository for Shared-Memory Computers)
**Why Representative**:
- Multi-threaded scientific/engineering workloads
- Real applications (not synthetic)
- Diverse patterns (computation, I/O, synchronization)
**Allocation Patterns**:
| Benchmark | Domain | Allocation Pattern |
|-----------|--------|-------------------|
| blackscholes | Finance | Small arrays (16B-1KB), frequent |
| fluidanimate | Physics | Large arrays (1MB-10MB), infrequent |
| canneal | Engineering | Small objects (32B-256B), graph nodes |
| dedup | Compression | Variable sizes (1KB-1MB), pipeline |
**Integration**: Modify build system (configure --with-allocator=hakmem)
**Expected hakmem Strengths**:
- fluidanimate: BigCache for large arrays
- canneal: L2 Pool for graph nodes
**Expected hakmem Weaknesses**:
- blackscholes: High-frequency small allocations Tiny Pool overhead
- dedup: Pipeline parallelism TLS overhead (per-thread caches)
**Priority**: **P2 (NICE-TO-HAVE)**
**Estimated Time**: 10-16 hours (complex build system)
---
## 2. Gemini Proposals Evaluation
### 2.1 mimalloc Benchmark Suite
**Proposal**: Use Microsoft's mimalloc-bench as primary benchmark.
**Pros**:
- Industry standard (used by mimalloc, jemalloc, tcmalloc authors)
- 20+ diverse workloads (synthetic + real applications)
- Easy integration (LD_PRELOAD + automated runner)
- Direct comparison with competitors (mimalloc, jemalloc, tcmalloc)
- Well-maintained (active development, bug fixes)
- Multi-threaded + single-threaded coverage
- Allocation size diversity (8B-10MB)
**Cons**:
- Some workloads are synthetic (not real applications)
- Linux-focused (macOS/Windows support limited)
- Overhead measurement can be noisy (need multiple runs)
**Integration Difficulty**: **Easy**
```bash
# Clone + build (1 hour)
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
# Add hakmem to bench.sh (30 minutes)
# Edit bench.sh:
# ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
# HAKMEM_LIB=/path/to/libhakmem.so
# Run comparison (1-2 hours)
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem
```
**Recommendation**: **IMPLEMENT IMMEDIATELY (P0)**
**Rationale**:
1. Essential for competitive positioning (mimalloc/jemalloc comparison)
2. Diverse workload coverage validates hakmem's generality
3. Easy integration (2-4 hours total)
4. Will reveal multi-threaded performance (validates TLS decision)
---
### 2.2 jemalloc Benchmark Suite
**Proposal**: Use jemalloc's test suite as benchmark.
**Pros**:
- Some unique workloads (not in mimalloc-bench)
- Validates jemalloc-specific optimizations (size classes, arenas)
- Well-tested code paths
**Cons**:
- Less comprehensive than mimalloc-bench (fewer workloads)
- More focused on correctness tests than performance benchmarks
- Overlap with mimalloc-bench (larson, threadtest duplicates)
- Harder to integrate (need to modify jemalloc's Makefile)
**Integration Difficulty**: **Medium**
```bash
# Clone + build (2 hours)
git clone https://github.com/jemalloc/jemalloc.git
cd jemalloc
./autogen.sh
./configure
make
# Add hakmem to test/integration/
# Edit test/integration/MALLOCX.c to use LD_PRELOAD
LD_PRELOAD=/path/to/libhakmem.so make check
```
**Recommendation**: **SKIP (for now)**
**Rationale**:
1. Overlap with mimalloc-bench (80% duplicate coverage)
2. Less comprehensive for performance testing
3. Higher integration cost (2-4 hours) for marginal benefit
4. Defer until P0 (mimalloc-bench) + P1 (Redis) complete
**Alternative**: Cherry-pick unique jemalloc tests and add to mimalloc-bench suite.
---
### 2.3 Redis
**Proposal**: Use Redis as real-world application benchmark.
**Pros**:
- Real-world production workload (not synthetic)
- High-profile reference (widely used)
- Well-defined benchmark protocol (redis-benchmark)
- Diverse allocation patterns (strings, lists, hashes, sorted sets)
- High throughput (100K+ ops/sec)
- Easy integration (LD_PRELOAD)
**Cons**:
- Complex attribution (hard to isolate allocator overhead)
- Redis-specific optimizations may dominate (object sharing, copy-on-write)
- Single-threaded by default (need redis-cluster for multi-threaded)
- Memory overhead (Redis headers + hakmem metadata)
**Integration Difficulty**: **Medium**
```bash
# LD_PRELOAD (easy, 2 hours)
git clone https://github.com/redis/redis.git
cd redis
make
LD_PRELOAD=/path/to/libhakmem.so ./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000
# Static linking (harder, 4-6 hours)
# Edit src/Makefile:
# MALLOC=hakmem
# MALLOC_LIBS=/path/to/libhakmem.a
make MALLOC=hakmem
```
**Recommendation**: **IMPLEMENT AFTER P0 (P1 priority)**
**Rationale**:
1. Real-world validation is valuable (not just synthetic benchmarks)
2. High-profile reference boosts credibility
3. Defer until mimalloc-bench is complete (P0 first)
4. Need careful measurement methodology (attribution complexity)
**Measurement Strategy**:
1. Run redis-benchmark with mimalloc/jemalloc/hakmem
2. Measure ops/sec + latency (p50, p99, p999)
3. Measure RSS (memory overhead)
4. Profile with perf to isolate allocator overhead
5. Use redis-cli --intrinsic-latency to baseline
---
## 3. TLS Condition-Dependency Analysis
### 3.1 Problem Statement
**Observation**: TLS Freelist Cache made single-threaded performance worse (+7-8% degradation).
**Question**: Is this expected? Should we keep TLS for multi-threaded workloads?
---
### 3.2 Quantitative Analysis
#### Single-Threaded Overhead (Measured)
**Source**: Phase 6.12.1 benchmarks (Step 2 Slab Registry)
```
Before TLS: 7,355 ns/op
After TLS: 10,471 ns/op
Overhead: +3,116 ns/op (+42.4%)
```
**Breakdown** (estimated):
- FS register access: ~5 cycles (x86-64 `mov %fs:0, %rax`)
- TLS cache lookup: ~10-20 cycles (hash + probing)
- Branch overhead: ~5-10 cycles (cache hit/miss decision)
- Cache miss fallback: ~50 cycles (lock acquisition + freelist search)
**Total TLS overhead**: ~20-40 cycles per allocation (best case)
**Reality check**: 3,116 ns = 3,116,000 ps **9,000 cycles @ 3GHz**
**Conclusion**: TLS overhead is NOT just FS register access. The regression is likely due to:
1. **Slab Registry hash overhead** (Step 2 change, unrelated to TLS)
2. **TLS cache miss rate** (if cache is too small or eviction policy is bad)
3. **Indirect call overhead** (function pointer for free routing)
**Action**: Re-measure TLS overhead in isolation (revert Slab Registry, keep only TLS).
---
#### Multi-Threaded Benefit (Estimated)
**Contention cost** (without TLS):
- Lock acquisition: ~100-500 cycles (uncontended heavily contended)
- Lock hold time: ~50-100 cycles (freelist search + update)
- Cache line bouncing: ~200 cycles (MESI protocol, remote core)
**Total contention cost**: ~350-800 cycles per allocation (2+ threads)
**TLS benefit**:
- Cache hit rate: 70-90% (typical TLS cache, depends on working set)
- Cycles saved per hit: 350-800 cycles (avoid lock)
- Net benefit: 245-720 cycles per allocation (@ 70% hit rate)
**Break-even point**:
```
TLS overhead: 20-40 cycles (single-threaded)
TLS benefit: 245-720 cycles (multi-threaded, 70% hit rate)
Break-even: 2 threads with moderate contention
```
**Conclusion**: TLS should WIN at 2+ threads, even with 70% cache hit rate.
---
#### hakmem-Specific Factors
**Site Rules already reduce contention**:
- Different call sites different shards (reduced lock contention)
- TLS benefit is REDUCED compared to mimalloc/jemalloc (no site-aware sharding)
**Estimated hakmem TLS benefit**:
- mimalloc TLS benefit: 245-720 cycles (baseline)
- hakmem TLS benefit: 100-300 cycles (Site Rules already reduce 60% contention)
**Revised break-even point**:
```
hakmem TLS overhead: 20-40 cycles
hakmem TLS benefit: 100-300 cycles (2+ threads)
Break-even: 2-4 threads (depends on contention level)
```
**Conclusion**: TLS is LESS valuable for hakmem than for mimalloc/jemalloc, but still beneficial at 4+ threads.
---
### 3.3 Recommendation
**Option Analysis**:
| Option | Pros | Cons | Recommendation |
|--------|------|------|----------------|
| **A. Revert TLS completely** | Simple<br>✅ No single-threaded regression | ❌ Miss multi-threaded benefit<br>❌ Competitive disadvantage | ❌ **NO** |
| **B. Keep TLS + multi-threaded benchmarks** | ✅ Validate effectiveness<br>✅ Data-driven decision | ⚠️ Need benchmark investment<br>⚠️ May still regress single-threaded | ✅ **YES (RECOMMENDED)** |
| **C. Conditional TLS (compile-time)** | ✅ Best of both worlds<br>✅ User control | ⚠️ Maintenance burden (2 code paths)<br>⚠️ Fragmentation risk | ⚠️ **MAYBE (if B fails)** |
| **D. Conditional TLS (runtime)** | ✅ Adaptive (auto-detect threads)<br>✅ No user config | ❌ Complex implementation<br>❌ Runtime overhead (thread counting) | ❌ **NO (over-engineering)** |
**Final Recommendation**: **Option B - Keep TLS + Multi-Threaded Benchmarks**
**Rationale**:
1. **Validate effectiveness**: mimalloc-bench (larson, threadtest) will reveal multi-threaded benefit
2. **Data-driven**: Revert only if multi-threaded benchmarks show no benefit
3. **Competitive analysis**: Compare TLS benefit vs mimalloc/jemalloc (Site Rules advantage)
4. **Defer complex solutions**: If TLS fails validation, THEN consider Option C (compile-time flag)
**Implementation Plan**:
1. **Phase 6.13 (P0)**: Run mimalloc-bench larson/threadtest (1-32 threads)
2. **Measure**: TLS cache hit rate + lock contention reduction
3. **Decide**: If TLS benefit < 20% at 4+ threads Revert or make conditional
---
### 3.4 Expected Results
**Hypothesis**: TLS will be beneficial at 4+ threads, but less impactful than mimalloc/jemalloc due to Site Rules.
**Expected mimalloc-bench results**:
| Benchmark | Threads | hakmem (no TLS) | hakmem (TLS) | mimalloc | Prediction |
|-----------|---------|-----------------|--------------|----------|------------|
| larson | 1 | 100 ns | 108 ns (+8%) | 95 ns | Regression |
| larson | 4 | 200 ns | 150 ns (-25%) | 120 ns | Win (but < mimalloc) |
| larson | 16 | 500 ns | 250 ns (-50%) | 180 ns | Win (but < mimalloc) |
| threadtest | 1 | 80 ns | 86 ns (+7.5%) | 75 ns | Regression |
| threadtest | 4 | 180 ns | 140 ns (-22%) | 110 ns | Win (but < mimalloc) |
| threadtest | 16 | 450 ns | 220 ns (-51%) | 160 ns | Win (but < mimalloc) |
**Validation criteria**:
- **Keep TLS**: If 4-thread benefit > 20% AND 16-thread benefit > 40%
- ⚠️ **Make conditional**: If benefit exists but < 20% at 4 threads
- **Revert TLS**: If no benefit at 4+ threads (unlikely)
---
## 4. Implementation Roadmap
### Phase 6.13: mimalloc-bench Integration (P0, 3-5 hours)
**Goal**: Validate TLS multi-threaded benefit + diverse workload coverage
**Tasks**:
1. Clone mimalloc-bench (30 min)
```bash
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
```
2. ✅ Build hakmem.so (30 min)
```bash
cd apps/experiments/hakmem-poc
make shared # Build libhakmem.so
```
3. ✅ Add hakmem to bench.sh (1 hour)
```bash
# Edit mimalloc-bench/bench.sh
# Add: HAKMEM_LIB=/path/to/libhakmem.so
# Add to ALLOCATORS: hakmem
```
4. ✅ Run initial benchmarks (1-2 hours)
```bash
# Start with 3 key benchmarks
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
```
5. ✅ Analyze results (1 hour)
- Compare ops/sec vs mimalloc/jemalloc
- Measure TLS benefit at 1/4/16 threads
- Identify strengths/weaknesses
**Success Criteria**:
- ✅ TLS benefit > 20% at 4 threads (larson, threadtest)
- ✅ Within 2x of mimalloc for single-threaded (cfrac)
- ✅ Identify 2-3 workloads where hakmem excels
**Next Steps**:
- If TLS validation succeeds → Phase 6.14 (expand to 10+ benchmarks)
- If TLS validation fails → Phase 6.13.1 (revert or make conditional)
---
### Phase 6.14: mimalloc-bench Expansion (P0, 4-6 hours)
**Goal**: Comprehensive coverage (10+ workloads)
**Workloads**:
- Single-threaded: cfrac, espresso, barnes, sh6bench, cache-scratch
- Multi-threaded: larson, threadtest, mstress, xmalloc-test
- Real apps: redis (via mimalloc-bench), lua, ruby
**Analysis**:
- Identify hakmem strengths (L2.5 Pool, Site Rules, ELO)
- Identify hakmem weaknesses (Tiny Pool overhead, TLS overhead)
- Prioritize optimizations (P0: fix Tiny Pool, P1: tune TLS, P2: ELO thresholds)
**Deliverable**: Benchmark report (markdown) with:
- Table: hakmem vs mimalloc vs jemalloc (ops/sec, RSS)
- Strengths/weaknesses analysis
- Optimization roadmap (P0/P1/P2)
---
### Phase 6.15: Redis Integration (P1, 6-10 hours)
**Goal**: Real-world validation (production workload)
**Tasks**:
1. ✅ Build Redis with hakmem (LD_PRELOAD or static linking)
2. ✅ Run redis-benchmark (SET, GET, LPUSH, HSET, ZADD)
3. ✅ Measure ops/sec + latency (p50, p99, p999)
4. ✅ Profile with perf (isolate allocator overhead)
5. ✅ Compare vs mimalloc/jemalloc
**Success Criteria**:
- ✅ Within 10% of mimalloc for SET/GET (common case)
- ✅ RSS < 1.2x mimalloc (memory overhead acceptable)
- ✅ No crashes or correctness issues
**Defer until**: mimalloc-bench Phase 6.14 complete
---
### Phase 6.16: Tiny Pool Optimization (P0, 8-12 hours)
**Goal**: Fix Tiny Pool overhead (7,871ns → <200ns target)
**Based on**: mimalloc-bench results (barnes, small-object workloads)
**Tasks**:
1. ✅ Implement Option B: Slab metadata in first 16B (Phase 6.12.1 deferred)
2. ✅ Remove double lookups (class determination + slab lookup)
3. ✅ Remove memset (already done in Phase 6.10.1)
4. ✅ TLS integration (if Phase 6.13 validates effectiveness)
**Target**: 50-80 ns/op (mimalloc is 18ns, 3-4x overhead acceptable)
**Defer until**: mimalloc-bench Phase 6.13 complete (validates priority)
---
### Phase 6.17: L2.5 Pool Tuning (P1, 4-6 hours)
**Goal**: Optimize L2.5 Pool based on mimalloc-bench results
**Based on**: mimalloc-bench medium-size workloads (64KB-1MB)
**Tasks**:
1. ✅ Measure L2.5 Pool hit rate (per benchmark)
2. ✅ Tune ELO thresholds (budget allocation per size class)
3. ✅ Optimize page granularity (64KB vs 128KB)
4. ✅ Non-empty bitmap validation (ensure O(1) search)
**Defer until**: Phase 6.14 (mimalloc-bench expansion) complete
---
## 5. Summary & Next Actions
### Immediate Actions (Next 48 Hours)
**Phase 6.13 (P0)**: mimalloc-bench integration
1. ✅ Clone mimalloc-bench (30 min)
2. ✅ Build hakmem.so (30 min)
3. ✅ Run cfrac + larson + threadtest (1-2 hours)
4. ✅ Analyze TLS multi-threaded benefit (1 hour)
**Decision Point**: Keep TLS or revert based on 4-thread results
---
### Priority Ranking
| Phase | Benchmark | Priority | Time | Rationale |
|-------|-----------|----------|------|-----------|
| 6.13 | mimalloc-bench (3 workloads) | **P0** | 3-5h | Validate TLS + diverse patterns |
| 6.14 | mimalloc-bench (10+ workloads) | **P0** | 4-6h | Comprehensive coverage |
| 6.16 | Tiny Pool optimization | **P0** | 8-12h | Fix critical regression (7,871ns) |
| 6.15 | Redis | **P1** | 6-10h | Real-world validation |
| 6.17 | L2.5 Pool tuning | **P1** | 4-6h | Optimize based on results |
| -- | rocksdb | **P1** | 6-10h | Additional real-world validation |
| -- | parsec | **P2** | 10-16h | Defer (complex, low ROI) |
| -- | jemalloc-test | **P2** | 4-6h | Skip (overlap with mimalloc-bench) |
**Total estimated time (P0)**: 15-23 hours
**Total estimated time (P0+P1)**: 31-49 hours
---
### Key Insights
1. **mimalloc-bench is essential** - industry standard, easy integration, diverse coverage
2. **TLS needs multi-threaded validation** - single-threaded regression is expected
3. **Site Rules reduce TLS benefit** - hakmem's unique advantage may diminish TLS value
4. **Tiny Pool is critical** - 437x regression (vs mimalloc) must be fixed before competitive analysis
5. **Redis is valuable but defer** - real-world validation after P0 complete
---
### Risk Mitigation
**Risk 1**: TLS validation fails (no benefit at 4+ threads)
- **Mitigation**: Revert TLS or make compile-time conditional (HAKMEM_MULTITHREAD)
- **Timeline**: Decision after Phase 6.13 (3-5 hours)
**Risk 2**: Tiny Pool optimization fails (can't reach <200ns target)
- **Mitigation**: Defer Tiny Pool, focus on L2/L2.5/BigCache strengths
- **Timeline**: Reassess after Phase 6.16 (8-12 hours)
**Risk 3**: mimalloc-bench integration harder than expected
- **Mitigation**: Start with LD_PRELOAD (easiest), defer static linking
- **Timeline**: Fallback to manual scripting if bench.sh integration fails
---
## Appendix: Technical Details
### A.1 TLS Cache Design Considerations
**Current design** (Phase 6.12.1 Step 2):
```c
// Per-thread cache (FS register)
__thread struct {
void* freelist[8]; // 8 size classes (8B-1KB)
uint64_t bitmap; // non-empty classes
} tls_cache;
```
**Potential issues**:
1. **Cache size too small** (8 entries) → high miss rate
2. **No eviction policy** → stale entries waste space
3. **No statistics** → can't measure hit rate
**Recommended improvements** (if Phase 6.13 validates TLS):
1. Increase cache size (8 → 16 or 32 entries)
2. Add LRU eviction (timestamp per entry)
3. Add hit/miss counters (enable with HAKMEM_STATS=1)
---
### A.2 mimalloc-bench Expected Results
**Baseline** (mimalloc performance, from published benchmarks):
| Benchmark | Threads | mimalloc (ops/sec) | jemalloc (ops/sec) | tcmalloc (ops/sec) |
|-----------|---------|-------------------|-------------------|-------------------|
| cfrac | 1 | 10,500,000 | 9,800,000 | 8,900,000 |
| larson | 1 | 8,200,000 | 7,500,000 | 6,800,000 |
| larson | 16 | 95,000,000 | 78,000,000 | 62,000,000 |
| threadtest | 1 | 12,000,000 | 11,000,000 | 10,500,000 |
| threadtest | 16 | 180,000,000 | 150,000,000 | 130,000,000 |
**hakmem targets** (realistic given current state):
| Benchmark | Threads | hakmem target | Gap to mimalloc | Notes |
|-----------|---------|---------------|-----------------|-------|
| cfrac | 1 | 5,000,000+ | 2.1x slower | Tiny Pool overhead |
| larson | 1 | 4,000,000+ | 2.0x slower | Tiny Pool + TLS overhead |
| larson | 16 | 70,000,000+ | 1.35x slower | Site Rules + TLS benefit |
| threadtest | 1 | 6,000,000+ | 2.0x slower | Tiny Pool + TLS overhead |
| threadtest | 16 | 130,000,000+ | 1.38x slower | Site Rules + TLS benefit |
**Acceptable thresholds**:
- ✅ **Single-threaded**: Within 2x of mimalloc (current state)
- ✅ **Multi-threaded (16 threads)**: Within 1.5x of mimalloc (after TLS)
- ⚠️ **Stretch goal**: Within 1.2x of mimalloc (requires Tiny Pool fix)
---
### A.3 Redis Benchmark Methodology
**Workload selection**:
```bash
# Core operations (99% of real-world Redis usage)
redis-benchmark -t set,get,lpush,lpop,hset,hget,zadd,zrange -n 10000000
# Memory-intensive operations
redis-benchmark -t set -d 1024 -n 1000000 # 1KB values
redis-benchmark -t set -d 102400 -n 100000 # 100KB values
# Multi-threaded (redis-cluster)
redis-benchmark -t set,get -n 10000000 -c 50 --threads 8
```
**Metrics to collect**:
1. **Throughput**: ops/sec (higher is better)
2. **Latency**: p50, p99, p999 (lower is better)
3. **Memory**: RSS, fragmentation ratio (lower is better)
4. **Allocator overhead**: perf top (% cycles in malloc/free)
**Attribution strategy**:
```bash
# Isolate allocator overhead
perf record -g ./redis-server &
redis-benchmark -t set,get -n 10000000
perf report --stdio | grep -E 'malloc|free|hakmem'
# Expected allocator overhead: 5-15% of total cycles
```
---
**End of Report**
This analysis provides a comprehensive roadmap for hakmem's benchmark strategy and TLS optimization. The key recommendation is to implement mimalloc-bench (Phase 6.13) immediately to validate multi-threaded TLS benefit, then expand to comprehensive coverage (Phase 6.14) before tackling real-world applications like Redis (Phase 6.15).