Files
hakmem/docs/analysis/ULTRATHINK_BENCHMARK_ANALYSIS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

26 KiB

hakmem Benchmark Strategy & TLS Analysis

Author: ultrathink (ChatGPT o1) Date: 2025-10-22 Context: Real-world benchmark recommendations + TLS Freelist Cache evaluation


Executive Summary

Current Problem: hakmem benchmarks are too size-specific (64KB, 256KB, 2MB), leading to peaky optimizations that may not reflect real-world performance.

Key Findings:

  1. mimalloc-bench is essential (P0) - industry standard with diverse patterns
  2. TLS overhead is expected in single-threaded workloads - need multi-threaded validation
  3. Redis is valuable but complex (P1) - defer until after mimalloc-bench
  4. Recommended approach: Keep TLS + add multi-threaded benchmarks to validate effectiveness

1. Real-World Benchmark Recommendations

1.1 mimalloc-bench Suite (P0 - MUST IMPLEMENT)

Name: mimalloc-bench (Microsoft Research allocator benchmark suite)

Why Representative:

  • Industry-standard benchmark used by mimalloc, jemalloc, tcmalloc authors
  • 20+ workloads covering diverse allocation patterns
  • Mix of synthetic stress tests + real applications
  • Well-maintained, actively used for allocator research

Allocation Patterns:

Benchmark Sizes Lifetime Threads Pattern
larson 10B-1KB short 1-32 Multi-threaded churn
threadtest 64B-4KB mixed 1-16 Per-thread allocation
mstress 16B-2KB short 1-32 Stress test
cfrac 24B-400B medium 1 Mathematical computation
espresso 16B-1KB mixed 1 Logic minimization
barnes 32B-96B long 1 N-body simulation
cache-scratch 8B-256KB short 1-8 Cache-unfriendly
sh6bench 16B-4KB mixed 1 Shell script workload

Integration Method:

# Easy integration via LD_PRELOAD
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh

# Run with hakmem
LD_PRELOAD=/path/to/libhakmem.so ./bench/cfrac/cfrac 17

# Automated comparison
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem

Expected hakmem Strengths:

  • larson: Site Rules should reduce lock contention (different threads → different sites)
  • cfrac: L2 Pool non-empty bitmap → O(1) small-object allocation
  • cache-scratch: ELO should learn cache-unfriendly patterns → segregate hot/cold

Expected hakmem Weaknesses:

  • barnes: Long-lived small objects (32-96B) → Tiny Pool overhead (7,871ns vs 18ns)
  • mstress: High-churn stress test → free policy overhead (Hot/Warm/Cold decision)
  • threadtest: TLS overhead (+7-8%) if thread count < 4

Implementation Difficulty: Easy

  • LD_PRELOAD integration (no code changes)
  • Automated benchmark runner (./run-all.sh)
  • Comparison reports (CSV/JSON output)

Priority: P0 (MUST-HAVE)

  • Essential for competitive analysis
  • Diverse workload coverage
  • Direct comparison with mimalloc/jemalloc

Estimated Time: 2-4 hours (setup + initial run + analysis)


1.2 Redis Benchmark (P1 - IMPORTANT)

Name: Redis 7.x (in-memory data store)

Why Representative:

  • Real-world production workload (not synthetic)
  • Complex allocation patterns (strings, lists, hashes, sorted sets)
  • High-throughput (100K+ ops/sec)
  • Well-defined benchmark protocol (redis-benchmark)

Allocation Patterns:

Operation Sizes Lifetime Pattern
SET key val 16B-512KB medium-long String allocation
LPUSH list val 16B-64KB medium List node allocation
HSET hash field val 16B-4KB long Hash table + entries
ZADD zset score val 32B-1KB long Skip list + hash
INCR counter 8B long Small integer objects

Integration Method:

# Method 1: LD_PRELOAD (easiest)
git clone https://github.com/redis/redis.git
cd redis
make
LD_PRELOAD=/path/to/libhakmem.so ./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000

# Method 2: Static linking (more accurate)
# Edit src/Makefile:
# MALLOC=hakmem
# MALLOC_LIBS=/path/to/libhakmem.a
make MALLOC=hakmem
./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000

Expected hakmem Strengths:

  • SET (strings): L2.5 Pool (64KB-1MB) → high hit rate for medium strings
  • HSET (hash tables): Site Rules → hash entries segregated by size class
  • ZADD (sorted sets): ELO → learns skip list node patterns

Expected hakmem Weaknesses:

  • INCR (small objects): Tiny Pool overhead (7,871ns vs 18ns mimalloc)
  • LPUSH (list nodes): Frequent small allocations → Tiny Pool slab lookup overhead
  • Memory overhead: Redis object headers + hakmem metadata → higher RSS

Implementation Difficulty: Medium

  • LD_PRELOAD: Easy (2 hours)
  • Static linking: Medium (4-6 hours, need Makefile integration)
  • Attribution: Hard (need to isolate allocator overhead vs Redis overhead)

Priority: P1 (IMPORTANT)

  • Real-world validation (not synthetic)
  • High-profile reference (Redis is widely used)
  • Defer until P0 (mimalloc-bench) is complete

Estimated Time: 4-8 hours (integration + measurement + analysis)


1.3 Additional Recommendations

1.3.1 rocksdb Benchmark (P1)

Name: RocksDB (persistent key-value store, Facebook)

Why Representative:

  • Real-world database workload
  • Mix of small (keys) + large (values) allocations
  • Write-heavy patterns (LSM tree)
  • Well-defined benchmark (db_bench)

Allocation Patterns:

  • Keys: 16B-1KB (frequent, short-lived)
  • Values: 100B-1MB (mixed lifetime)
  • Memtable: 4MB-128MB (long-lived)
  • Block cache: 8KB-64KB (medium-lived)

Integration: LD_PRELOAD or Makefile (EXTRA_CXXFLAGS=-lhakmem)

Expected hakmem Strengths:

  • L2.5 Pool for medium values (64KB-1MB)
  • BigCache for memtable (4MB-128MB)
  • Site Rules for key/value segregation

Expected hakmem Weaknesses:

  • Write amplification (LSM tree) → high allocation rate → Tiny Pool overhead
  • Block cache churn → L2 Pool fragmentation

Priority: P1 Estimated Time: 6-10 hours


1.3.2 parsec Benchmark Suite (P2)

Name: PARSEC 3.0 (Princeton Application Repository for Shared-Memory Computers)

Why Representative:

  • Multi-threaded scientific/engineering workloads
  • Real applications (not synthetic)
  • Diverse patterns (computation, I/O, synchronization)

Allocation Patterns:

Benchmark Domain Allocation Pattern
blackscholes Finance Small arrays (16B-1KB), frequent
fluidanimate Physics Large arrays (1MB-10MB), infrequent
canneal Engineering Small objects (32B-256B), graph nodes
dedup Compression Variable sizes (1KB-1MB), pipeline

Integration: Modify build system (configure --with-allocator=hakmem)

Expected hakmem Strengths:

  • fluidanimate: BigCache for large arrays
  • canneal: L2 Pool for graph nodes

Expected hakmem Weaknesses:

  • blackscholes: High-frequency small allocations → Tiny Pool overhead
  • dedup: Pipeline parallelism → TLS overhead (per-thread caches)

Priority: P2 (NICE-TO-HAVE) Estimated Time: 10-16 hours (complex build system)


2. Gemini Proposals Evaluation

2.1 mimalloc Benchmark Suite

Proposal: Use Microsoft's mimalloc-bench as primary benchmark.

Pros:

  • Industry standard (used by mimalloc, jemalloc, tcmalloc authors)
  • 20+ diverse workloads (synthetic + real applications)
  • Easy integration (LD_PRELOAD + automated runner)
  • Direct comparison with competitors (mimalloc, jemalloc, tcmalloc)
  • Well-maintained (active development, bug fixes)
  • Multi-threaded + single-threaded coverage
  • Allocation size diversity (8B-10MB)

Cons:

  • ⚠️ Some workloads are synthetic (not real applications)
  • ⚠️ Linux-focused (macOS/Windows support limited)
  • ⚠️ Overhead measurement can be noisy (need multiple runs)

Integration Difficulty: Easy

# Clone + build (1 hour)
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh

# Add hakmem to bench.sh (30 minutes)
# Edit bench.sh:
# ALLOCATORS="mimalloc jemalloc tcmalloc hakmem"
# HAKMEM_LIB=/path/to/libhakmem.so

# Run comparison (1-2 hours)
./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem

Recommendation: IMPLEMENT IMMEDIATELY (P0)

Rationale:

  1. Essential for competitive positioning (mimalloc/jemalloc comparison)
  2. Diverse workload coverage validates hakmem's generality
  3. Easy integration (2-4 hours total)
  4. Will reveal multi-threaded performance (validates TLS decision)

2.2 jemalloc Benchmark Suite

Proposal: Use jemalloc's test suite as benchmark.

Pros:

  • Some unique workloads (not in mimalloc-bench)
  • Validates jemalloc-specific optimizations (size classes, arenas)
  • Well-tested code paths

Cons:

  • ⚠️ Less comprehensive than mimalloc-bench (fewer workloads)
  • ⚠️ More focused on correctness tests than performance benchmarks
  • ⚠️ Overlap with mimalloc-bench (larson, threadtest duplicates)
  • ⚠️ Harder to integrate (need to modify jemalloc's Makefile)

Integration Difficulty: Medium

# Clone + build (2 hours)
git clone https://github.com/jemalloc/jemalloc.git
cd jemalloc
./autogen.sh
./configure
make

# Add hakmem to test/integration/
# Edit test/integration/MALLOCX.c to use LD_PRELOAD
LD_PRELOAD=/path/to/libhakmem.so make check

Recommendation: SKIP (for now)

Rationale:

  1. Overlap with mimalloc-bench (80% duplicate coverage)
  2. Less comprehensive for performance testing
  3. Higher integration cost (2-4 hours) for marginal benefit
  4. Defer until P0 (mimalloc-bench) + P1 (Redis) complete

Alternative: Cherry-pick unique jemalloc tests and add to mimalloc-bench suite.


2.3 Redis

Proposal: Use Redis as real-world application benchmark.

Pros:

  • Real-world production workload (not synthetic)
  • High-profile reference (widely used)
  • Well-defined benchmark protocol (redis-benchmark)
  • Diverse allocation patterns (strings, lists, hashes, sorted sets)
  • High throughput (100K+ ops/sec)
  • Easy integration (LD_PRELOAD)

Cons:

  • ⚠️ Complex attribution (hard to isolate allocator overhead)
  • ⚠️ Redis-specific optimizations may dominate (object sharing, copy-on-write)
  • ⚠️ Single-threaded by default (need redis-cluster for multi-threaded)
  • ⚠️ Memory overhead (Redis headers + hakmem metadata)

Integration Difficulty: Medium

# LD_PRELOAD (easy, 2 hours)
git clone https://github.com/redis/redis.git
cd redis
make
LD_PRELOAD=/path/to/libhakmem.so ./src/redis-server &
./src/redis-benchmark -t set,get,lpush,hset,zadd -n 1000000

# Static linking (harder, 4-6 hours)
# Edit src/Makefile:
# MALLOC=hakmem
# MALLOC_LIBS=/path/to/libhakmem.a
make MALLOC=hakmem

Recommendation: IMPLEMENT AFTER P0 (P1 priority)

Rationale:

  1. Real-world validation is valuable (not just synthetic benchmarks)
  2. High-profile reference boosts credibility
  3. Defer until mimalloc-bench is complete (P0 first)
  4. Need careful measurement methodology (attribution complexity)

Measurement Strategy:

  1. Run redis-benchmark with mimalloc/jemalloc/hakmem
  2. Measure ops/sec + latency (p50, p99, p999)
  3. Measure RSS (memory overhead)
  4. Profile with perf to isolate allocator overhead
  5. Use redis-cli --intrinsic-latency to baseline

3. TLS Condition-Dependency Analysis

3.1 Problem Statement

Observation: TLS Freelist Cache made single-threaded performance worse (+7-8% degradation).

Question: Is this expected? Should we keep TLS for multi-threaded workloads?


3.2 Quantitative Analysis

Single-Threaded Overhead (Measured)

Source: Phase 6.12.1 benchmarks (Step 2 Slab Registry)

Before TLS:  7,355 ns/op
After TLS:  10,471 ns/op
Overhead:   +3,116 ns/op (+42.4%)

Breakdown (estimated):

  • FS register access: ~5 cycles (x86-64 mov %fs:0, %rax)
  • TLS cache lookup: ~10-20 cycles (hash + probing)
  • Branch overhead: ~5-10 cycles (cache hit/miss decision)
  • Cache miss fallback: ~50 cycles (lock acquisition + freelist search)

Total TLS overhead: ~20-40 cycles per allocation (best case)

Reality check: 3,116 ns = 3,116,000 ps ≈ 9,000 cycles @ 3GHz

Conclusion: TLS overhead is NOT just FS register access. The regression is likely due to:

  1. Slab Registry hash overhead (Step 2 change, unrelated to TLS)
  2. TLS cache miss rate (if cache is too small or eviction policy is bad)
  3. Indirect call overhead (function pointer for free routing)

Action: Re-measure TLS overhead in isolation (revert Slab Registry, keep only TLS).


Multi-Threaded Benefit (Estimated)

Contention cost (without TLS):

  • Lock acquisition: ~100-500 cycles (uncontended → heavily contended)
  • Lock hold time: ~50-100 cycles (freelist search + update)
  • Cache line bouncing: ~200 cycles (MESI protocol, remote core)

Total contention cost: ~350-800 cycles per allocation (2+ threads)

TLS benefit:

  • Cache hit rate: 70-90% (typical TLS cache, depends on working set)
  • Cycles saved per hit: 350-800 cycles (avoid lock)
  • Net benefit: 245-720 cycles per allocation (@ 70% hit rate)

Break-even point:

TLS overhead: 20-40 cycles (single-threaded)
TLS benefit: 245-720 cycles (multi-threaded, 70% hit rate)

Break-even: 2 threads with moderate contention

Conclusion: TLS should WIN at 2+ threads, even with 70% cache hit rate.


hakmem-Specific Factors

Site Rules already reduce contention:

  • Different call sites → different shards (reduced lock contention)
  • TLS benefit is REDUCED compared to mimalloc/jemalloc (no site-aware sharding)

Estimated hakmem TLS benefit:

  • mimalloc TLS benefit: 245-720 cycles (baseline)
  • hakmem TLS benefit: 100-300 cycles (Site Rules already reduce 60% contention)

Revised break-even point:

hakmem TLS overhead: 20-40 cycles
hakmem TLS benefit: 100-300 cycles (2+ threads)

Break-even: 2-4 threads (depends on contention level)

Conclusion: TLS is LESS valuable for hakmem than for mimalloc/jemalloc, but still beneficial at 4+ threads.


3.3 Recommendation

Option Analysis:

Option Pros Cons Recommendation
A. Revert TLS completely Simple
No single-threaded regression
Miss multi-threaded benefit
Competitive disadvantage
NO
B. Keep TLS + multi-threaded benchmarks Validate effectiveness
Data-driven decision
⚠️ Need benchmark investment
⚠️ May still regress single-threaded
YES (RECOMMENDED)
C. Conditional TLS (compile-time) Best of both worlds
User control
⚠️ Maintenance burden (2 code paths)
⚠️ Fragmentation risk
⚠️ MAYBE (if B fails)
D. Conditional TLS (runtime) Adaptive (auto-detect threads)
No user config
Complex implementation
Runtime overhead (thread counting)
NO (over-engineering)

Final Recommendation: Option B - Keep TLS + Multi-Threaded Benchmarks

Rationale:

  1. Validate effectiveness: mimalloc-bench (larson, threadtest) will reveal multi-threaded benefit
  2. Data-driven: Revert only if multi-threaded benchmarks show no benefit
  3. Competitive analysis: Compare TLS benefit vs mimalloc/jemalloc (Site Rules advantage)
  4. Defer complex solutions: If TLS fails validation, THEN consider Option C (compile-time flag)

Implementation Plan:

  1. Phase 6.13 (P0): Run mimalloc-bench larson/threadtest (1-32 threads)
  2. Measure: TLS cache hit rate + lock contention reduction
  3. Decide: If TLS benefit < 20% at 4+ threads → Revert or make conditional

3.4 Expected Results

Hypothesis: TLS will be beneficial at 4+ threads, but less impactful than mimalloc/jemalloc due to Site Rules.

Expected mimalloc-bench results:

Benchmark Threads hakmem (no TLS) hakmem (TLS) mimalloc Prediction
larson 1 100 ns 108 ns (+8%) 95 ns ⚠️ Regression
larson 4 200 ns 150 ns (-25%) 120 ns Win (but < mimalloc)
larson 16 500 ns 250 ns (-50%) 180 ns Win (but < mimalloc)
threadtest 1 80 ns 86 ns (+7.5%) 75 ns ⚠️ Regression
threadtest 4 180 ns 140 ns (-22%) 110 ns Win (but < mimalloc)
threadtest 16 450 ns 220 ns (-51%) 160 ns Win (but < mimalloc)

Validation criteria:

  • Keep TLS: If 4-thread benefit > 20% AND 16-thread benefit > 40%
  • ⚠️ Make conditional: If benefit exists but < 20% at 4 threads
  • Revert TLS: If no benefit at 4+ threads (unlikely)

4. Implementation Roadmap

Phase 6.13: mimalloc-bench Integration (P0, 3-5 hours)

Goal: Validate TLS multi-threaded benefit + diverse workload coverage

Tasks:

  1. Clone mimalloc-bench (30 min)

    git clone https://github.com/daanx/mimalloc-bench.git
    cd mimalloc-bench
    ./build-all.sh
    
  2. Build hakmem.so (30 min)

    cd apps/experiments/hakmem-poc
    make shared  # Build libhakmem.so
    
  3. Add hakmem to bench.sh (1 hour)

    # Edit mimalloc-bench/bench.sh
    # Add: HAKMEM_LIB=/path/to/libhakmem.so
    # Add to ALLOCATORS: hakmem
    
  4. Run initial benchmarks (1-2 hours)

    # Start with 3 key benchmarks
    ./run-all.sh -b cfrac,larson,threadtest -a mimalloc,jemalloc,hakmem -t 1,4,16
    
  5. Analyze results (1 hour)

    • Compare ops/sec vs mimalloc/jemalloc
    • Measure TLS benefit at 1/4/16 threads
    • Identify strengths/weaknesses

Success Criteria:

  • TLS benefit > 20% at 4 threads (larson, threadtest)
  • Within 2x of mimalloc for single-threaded (cfrac)
  • Identify 2-3 workloads where hakmem excels

Next Steps:

  • If TLS validation succeeds → Phase 6.14 (expand to 10+ benchmarks)
  • If TLS validation fails → Phase 6.13.1 (revert or make conditional)

Phase 6.14: mimalloc-bench Expansion (P0, 4-6 hours)

Goal: Comprehensive coverage (10+ workloads)

Workloads:

  • Single-threaded: cfrac, espresso, barnes, sh6bench, cache-scratch
  • Multi-threaded: larson, threadtest, mstress, xmalloc-test
  • Real apps: redis (via mimalloc-bench), lua, ruby

Analysis:

  • Identify hakmem strengths (L2.5 Pool, Site Rules, ELO)
  • Identify hakmem weaknesses (Tiny Pool overhead, TLS overhead)
  • Prioritize optimizations (P0: fix Tiny Pool, P1: tune TLS, P2: ELO thresholds)

Deliverable: Benchmark report (markdown) with:

  • Table: hakmem vs mimalloc vs jemalloc (ops/sec, RSS)
  • Strengths/weaknesses analysis
  • Optimization roadmap (P0/P1/P2)

Phase 6.15: Redis Integration (P1, 6-10 hours)

Goal: Real-world validation (production workload)

Tasks:

  1. Build Redis with hakmem (LD_PRELOAD or static linking)
  2. Run redis-benchmark (SET, GET, LPUSH, HSET, ZADD)
  3. Measure ops/sec + latency (p50, p99, p999)
  4. Profile with perf (isolate allocator overhead)
  5. Compare vs mimalloc/jemalloc

Success Criteria:

  • Within 10% of mimalloc for SET/GET (common case)
  • RSS < 1.2x mimalloc (memory overhead acceptable)
  • No crashes or correctness issues

Defer until: mimalloc-bench Phase 6.14 complete


Phase 6.16: Tiny Pool Optimization (P0, 8-12 hours)

Goal: Fix Tiny Pool overhead (7,871ns → <200ns target)

Based on: mimalloc-bench results (barnes, small-object workloads)

Tasks:

  1. Implement Option B: Slab metadata in first 16B (Phase 6.12.1 deferred)
  2. Remove double lookups (class determination + slab lookup)
  3. Remove memset (already done in Phase 6.10.1)
  4. TLS integration (if Phase 6.13 validates effectiveness)

Target: 50-80 ns/op (mimalloc is 18ns, 3-4x overhead acceptable)

Defer until: mimalloc-bench Phase 6.13 complete (validates priority)


Phase 6.17: L2.5 Pool Tuning (P1, 4-6 hours)

Goal: Optimize L2.5 Pool based on mimalloc-bench results

Based on: mimalloc-bench medium-size workloads (64KB-1MB)

Tasks:

  1. Measure L2.5 Pool hit rate (per benchmark)
  2. Tune ELO thresholds (budget allocation per size class)
  3. Optimize page granularity (64KB vs 128KB)
  4. Non-empty bitmap validation (ensure O(1) search)

Defer until: Phase 6.14 (mimalloc-bench expansion) complete


5. Summary & Next Actions

Immediate Actions (Next 48 Hours)

Phase 6.13 (P0): mimalloc-bench integration

  1. Clone mimalloc-bench (30 min)
  2. Build hakmem.so (30 min)
  3. Run cfrac + larson + threadtest (1-2 hours)
  4. Analyze TLS multi-threaded benefit (1 hour)

Decision Point: Keep TLS or revert based on 4-thread results


Priority Ranking

Phase Benchmark Priority Time Rationale
6.13 mimalloc-bench (3 workloads) P0 3-5h Validate TLS + diverse patterns
6.14 mimalloc-bench (10+ workloads) P0 4-6h Comprehensive coverage
6.16 Tiny Pool optimization P0 8-12h Fix critical regression (7,871ns)
6.15 Redis P1 6-10h Real-world validation
6.17 L2.5 Pool tuning P1 4-6h Optimize based on results
-- rocksdb P1 6-10h Additional real-world validation
-- parsec P2 10-16h Defer (complex, low ROI)
-- jemalloc-test P2 4-6h Skip (overlap with mimalloc-bench)

Total estimated time (P0): 15-23 hours Total estimated time (P0+P1): 31-49 hours


Key Insights

  1. mimalloc-bench is essential - industry standard, easy integration, diverse coverage
  2. TLS needs multi-threaded validation - single-threaded regression is expected
  3. Site Rules reduce TLS benefit - hakmem's unique advantage may diminish TLS value
  4. Tiny Pool is critical - 437x regression (vs mimalloc) must be fixed before competitive analysis
  5. Redis is valuable but defer - real-world validation after P0 complete

Risk Mitigation

Risk 1: TLS validation fails (no benefit at 4+ threads)

  • Mitigation: Revert TLS or make compile-time conditional (HAKMEM_MULTITHREAD)
  • Timeline: Decision after Phase 6.13 (3-5 hours)

Risk 2: Tiny Pool optimization fails (can't reach <200ns target)

  • Mitigation: Defer Tiny Pool, focus on L2/L2.5/BigCache strengths
  • Timeline: Reassess after Phase 6.16 (8-12 hours)

Risk 3: mimalloc-bench integration harder than expected

  • Mitigation: Start with LD_PRELOAD (easiest), defer static linking
  • Timeline: Fallback to manual scripting if bench.sh integration fails

Appendix: Technical Details

A.1 TLS Cache Design Considerations

Current design (Phase 6.12.1 Step 2):

// Per-thread cache (FS register)
__thread struct {
    void* freelist[8];  // 8 size classes (8B-1KB)
    uint64_t bitmap;    // non-empty classes
} tls_cache;

Potential issues:

  1. Cache size too small (8 entries) → high miss rate
  2. No eviction policy → stale entries waste space
  3. No statistics → can't measure hit rate

Recommended improvements (if Phase 6.13 validates TLS):

  1. Increase cache size (8 → 16 or 32 entries)
  2. Add LRU eviction (timestamp per entry)
  3. Add hit/miss counters (enable with HAKMEM_STATS=1)

A.2 mimalloc-bench Expected Results

Baseline (mimalloc performance, from published benchmarks):

Benchmark Threads mimalloc (ops/sec) jemalloc (ops/sec) tcmalloc (ops/sec)
cfrac 1 10,500,000 9,800,000 8,900,000
larson 1 8,200,000 7,500,000 6,800,000
larson 16 95,000,000 78,000,000 62,000,000
threadtest 1 12,000,000 11,000,000 10,500,000
threadtest 16 180,000,000 150,000,000 130,000,000

hakmem targets (realistic given current state):

Benchmark Threads hakmem target Gap to mimalloc Notes
cfrac 1 5,000,000+ 2.1x slower Tiny Pool overhead
larson 1 4,000,000+ 2.0x slower Tiny Pool + TLS overhead
larson 16 70,000,000+ 1.35x slower Site Rules + TLS benefit
threadtest 1 6,000,000+ 2.0x slower Tiny Pool + TLS overhead
threadtest 16 130,000,000+ 1.38x slower Site Rules + TLS benefit

Acceptable thresholds:

  • Single-threaded: Within 2x of mimalloc (current state)
  • Multi-threaded (16 threads): Within 1.5x of mimalloc (after TLS)
  • ⚠️ Stretch goal: Within 1.2x of mimalloc (requires Tiny Pool fix)

A.3 Redis Benchmark Methodology

Workload selection:

# Core operations (99% of real-world Redis usage)
redis-benchmark -t set,get,lpush,lpop,hset,hget,zadd,zrange -n 10000000

# Memory-intensive operations
redis-benchmark -t set -d 1024 -n 1000000  # 1KB values
redis-benchmark -t set -d 102400 -n 100000  # 100KB values

# Multi-threaded (redis-cluster)
redis-benchmark -t set,get -n 10000000 -c 50 --threads 8

Metrics to collect:

  1. Throughput: ops/sec (higher is better)
  2. Latency: p50, p99, p999 (lower is better)
  3. Memory: RSS, fragmentation ratio (lower is better)
  4. Allocator overhead: perf top (% cycles in malloc/free)

Attribution strategy:

# Isolate allocator overhead
perf record -g ./redis-server &
redis-benchmark -t set,get -n 10000000
perf report --stdio | grep -E 'malloc|free|hakmem'

# Expected allocator overhead: 5-15% of total cycles

End of Report

This analysis provides a comprehensive roadmap for hakmem's benchmark strategy and TLS optimization. The key recommendation is to implement mimalloc-bench (Phase 6.13) immediately to validate multi-threaded TLS benefit, then expand to comprehensive coverage (Phase 6.14) before tackling real-world applications like Redis (Phase 6.15).