Files
hakmem/docs/analysis/COMPREHENSIVE_BENCHMARK_ANALYSIS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

9.7 KiB
Raw Blame History

Comprehensive Benchmark Analysis

Bitmap vs Free-List Trade-offs

Date: 2025-10-26 Purpose: Evaluate hakmem's bitmap approach across multiple allocation patterns to identify strengths and weaknesses


Executive Summary

After discovering that all previous benchmarks were incorrectly measuring glibc (due to Makefile implicit rules), we rebuilt the benchmarking infrastructure and ran comprehensive tests across 6 allocation patterns.

Key Finding: Hakmem's bitmap approach shows relative resistance to random allocation patterns, validating the design for non-sequential workloads, though absolute performance remains 2.6x-8.8x slower than mimalloc.


Test Methodology

Benchmark Suite: bench_comprehensive.c

6 test patterns × 4 size classes (16B, 32B, 64B, 128B):

  1. Sequential LIFO - Allocate 100 blocks, free in reverse order (best case for free-lists)
  2. Sequential FIFO - Allocate 100 blocks, free in same order
  3. Random Free - Allocate 100 blocks, free in shuffled order (bitmap advantage test)
  4. Interleaved - Alternating alloc/free cycles
  5. Mixed Sizes - 8B, 16B, 32B, 64B mixed allocation
  6. Long-lived vs Short-lived - Keep 50% allocated, churn the rest

Allocators Tested

  • hakmem: Bitmap-based with two-tier structure
  • glibc malloc: Binned free-list (system default)
  • mimalloc: Magazine-based allocator

Verification

All binaries verified with verify_bench.sh:

$ ./verify_bench.sh ./bench_comprehensive_hakmem
✅ hakmem symbols: 119
✅ Binary size: 156KB
✅ Verification PASSED

Results: 16B Allocations (Representative)

Sequential LIFO (Best case for free-lists)

Allocator Throughput Latency vs hakmem
hakmem 102 M ops/sec 9.8 ns/op 1.0×
glibc 365 M ops/sec 2.7 ns/op 3.6×
mimalloc 942 M ops/sec 1.1 ns/op 9.2×

Random Free (Bitmap advantage test)

Allocator Throughput Latency vs hakmem Degradation from LIFO
hakmem 68 M ops/sec 14.7 ns/op 1.0× 34%
glibc 138 M ops/sec 7.2 ns/op 2.0× 62%
mimalloc 176 M ops/sec 5.7 ns/op 2.6× 81%

Key Insight: Hakmem degrades the least under random patterns:

  • hakmem: 66% of sequential performance
  • glibc: 38% of sequential performance
  • mimalloc: 19% of sequential performance

Pattern-by-Pattern Analysis

1. Sequential LIFO

Winner: mimalloc (9.2× faster than hakmem)

Analysis: Free-list allocators excel here because LIFO perfectly matches their intrusive linked list structure. The just-freed block becomes the next allocation with zero cache misses.

Hakmem's bitmap requires:

  • Bitmap scan (even if empty-word detection is O(1))
  • Bit manipulation
  • Pointer arithmetic

2. Sequential FIFO

Winner: mimalloc (8.4× faster than hakmem)

Analysis: Similar to LIFO, though slightly worse for free-lists because FIFO order disrupts cache locality. Hakmem's bitmap is order-independent, so performance is similar to LIFO.

3. Random Free Bitmap Advantage

Winner: mimalloc (2.6× faster than hakmem)

Analysis: This is where bitmap shines relatively:

  • Hakmem: 34% degradation (66% of LIFO performance)
  • glibc: 62% degradation (38% of LIFO performance)
  • mimalloc: 81% degradation (19% of LIFO performance)

Why bitmap resists degradation:

  • Free order doesn't matter - just flip a bit
  • Two-tier bitmap structure: summary bitmap + detail bitmap
  • Empty-word detection is still O(1) regardless of fragmentation

Why free-lists degrade badly:

  • Random free breaks LIFO order
  • List traversal becomes unpredictable
  • Cache thrashing on widely scattered allocations

4. Interleaved Alloc/Free

Winner: mimalloc (7.8× faster than hakmem)

Analysis: Frequent switching favors free-lists with hot cache. Bitmap's amortization strategy (batch refill) doesn't help here.

5. Mixed Sizes

Winner: mimalloc (9.1× faster than hakmem)

Analysis: Multiple size classes stress the TLS magazine selection logic. Mimalloc's per-size-class magazines avoid contention.

6. Long-lived vs Short-lived

Winner: mimalloc (8.5× faster than hakmem)

Analysis: Steady-state churning favors free-lists. Hakmem's bitmap doesn't distinguish between long-lived and short-lived allocations.


Bitmap vs Free-List Trade-offs

Bitmap Advantages

  1. Order Independence: Performance doesn't degrade under random allocation patterns
  2. Visibility: Bitmap provides instant fragmentation insight for diagnostics
  3. Batch Refill: Can amortize bitmap scan across multiple allocations (16 items/scan)
  4. Predictability: O(1) empty-word detection regardless of fragmentation
  5. Research Value: Easy to instrument and analyze allocation patterns

Free-List Advantages

  1. LIFO Fast Path: Just-freed block is next allocation (perfect cache locality)
  2. Zero Metadata: Intrusive next-pointer reuses allocated space
  3. Simple Push/Pop: Single pointer assignment vs bit manipulation
  4. Proven: Battle-tested in production allocators (jemalloc, mimalloc, tcmalloc)

Bitmap Disadvantages

  1. Baseline Overhead: Even with empty-word detection, bitmap scan is slower than free-list pop
  2. Bit Manipulation Cost: Extract, shift, and combine operations add latency
  3. Two-Tier Complexity: Summary + detail bitmap adds indirection
  4. Cold Cache: Bitmap memory separate from allocated memory

Free-List Disadvantages

  1. Random Pattern Degradation: 62-81% performance loss under random frees
  2. Fragmentation Blindness: Can't see allocation patterns without traversal
  3. Cache Unpredictability: Scattered allocations break LIFO order

Performance Gap Analysis

Why is hakmem still 2.6× slower on favorable patterns?

Even on Random Free (bitmap's best case), hakmem is 2.6× slower than mimalloc. The bitmap isn't the only bottleneck:

Potential bottlenecks (requires profiling):

  1. TLS Magazine Overhead:

    • 3-tier hierarchy (TLS → Page Mini-Mag → Bitmap)
    • Each tier has bounds checks and fallback logic
  2. Statistics Collection:

    • Even batched stats have overhead
    • Consider disabling in release builds
  3. Batch Refill Logic:

    • 16-item refill amortizes scan, but adds complexity
    • May not be worth it for bursty workloads
  4. Two-Tier Bitmap Traversal:

    • Summary bitmap scan → detail bitmap scan
    • Two levels of indirection
  5. Cache Effects:

    • Bitmap memory is separate from allocated memory
    • Free-lists keep everything hot in L1

Conclusions

Is Bitmap Worth It?

For Research: Yes

  • Visibility and diagnostics are invaluable
  • Order-independent performance is a unique advantage
  • Easy to instrument and analyze

For Production: ⚠️ Depends

  • If workload is random/unpredictable: bitmap degrades less
  • If workload is sequential/LIFO: free-list is 9× faster
  • If absolute performance matters: mimalloc wins

Next Steps

  1. Profile hakmem on Random Free pattern (bench_tiny.c)

    • Identify true bottlenecks beyond bitmap
    • Use perf record -g to find hot paths
  2. Consider Hybrid Approach:

    • Free-list for LIFO fast path (top 8-16 items)
    • Bitmap for overflow and diagnostics
    • Best of both worlds?
  3. Measure Statistics Overhead:

    • Build with stats disabled
    • Quantify cost of instrumentation
  4. Optimize Two-Tier Bitmap:

    • Can we flatten to single tier for small slabs?
    • SIMD instructions for bitmap scan?

Benchmark Commands

Build

make clean
make bench_comprehensive_hakmem
make bench_comprehensive_system
./verify_bench.sh ./bench_comprehensive_hakmem

Run

# hakmem (bitmap)
./bench_comprehensive_hakmem > results_hakmem.txt

# glibc (system malloc)
./bench_comprehensive_system > results_glibc.txt

# mimalloc (magazine-based)
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 \
  ./bench_comprehensive_system > results_mimalloc.txt

Raw Results (16B allocations)

========================================
hakmem (Bitmap-based)
========================================
Sequential LIFO:   102.00 M ops/sec (9.80 ns/op)
Sequential FIFO:    97.09 M ops/sec (10.30 ns/op)
Random Free:        68.03 M ops/sec (14.70 ns/op)  ← 66% of LIFO
Interleaved:        91.74 M ops/sec (10.90 ns/op)
Mixed Sizes:        99.01 M ops/sec (10.10 ns/op)
Long-lived:         95.24 M ops/sec (10.50 ns/op)

========================================
glibc malloc (Free-list)
========================================
Sequential LIFO:   364.96 M ops/sec (2.74 ns/op)
Sequential FIFO:   357.14 M ops/sec (2.80 ns/op)
Random Free:       138.89 M ops/sec (7.20 ns/op)  ← 38% of LIFO
Interleaved:       333.33 M ops/sec (3.00 ns/op)
Mixed Sizes:       344.83 M ops/sec (2.90 ns/op)
Long-lived:        350.88 M ops/sec (2.85 ns/op)

========================================
mimalloc (Magazine-based)
========================================
Sequential LIFO:   943.40 M ops/sec (1.06 ns/op)
Sequential FIFO:   900.90 M ops/sec (1.11 ns/op)
Random Free:       175.44 M ops/sec (5.70 ns/op)  ← 19% of LIFO
Interleaved:       800.00 M ops/sec (1.25 ns/op)
Mixed Sizes:       909.09 M ops/sec (1.10 ns/op)
Long-lived:        869.57 M ops/sec (1.15 ns/op)

Appendix: Verification Checklist

Before any benchmark:

  1. make clean
  2. make bench_comprehensive_hakmem
  3. ./verify_bench.sh ./bench_comprehensive_hakmem
    • Expect: 119 hakmem symbols
    • Expect: Binary size > 150KB
  4. Run benchmark
  5. Document results in this file

NEVER rely on make <target> if target doesn't exist in Makefile - it will silently use implicit rules and link with glibc!