Files
hakmem/docs/analysis/OVERHEAD_ANALYSIS_PLAN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

4.3 KiB

hakmem Overhead Analysis Plan (Phase 6.7 準備)

Gap: hakmem-evolving (37,602 ns) vs mimalloc (19,964 ns) = +88.3%


🎯 Overhead 候補(優先度順)

P0: Critical Path Overhead

  1. BigCache lookup (毎回実行)

    • Hash table lookup for site_id
    • Size class matching
    • Slot iteration
    • 推定コスト: 50-100 ns
  2. ELO strategy selection (LEARN mode)

    • hak_elo_select_strategy(): softmax calculation
    • 12 strategies の確率計算
    • Random number generation
    • 推定コスト: 100-200 ns
  3. Header read/write

    • AllocHeader (32 bytes) の read/write
    • Magic verification
    • 推定コスト: 10-20 ns
  4. Atomic tick counter

    • atomic_fetch_add(&tick_counter, 1)
    • Every allocation
    • 推定コスト: 5-10 ns

P1: Syscall Overhead

  1. mmap/munmap

    • System call overhead
    • TLB flush
    • Page table updates
    • 推定コスト: 1,000-5,000 ns (syscall dependent)
  2. Page faults

    • First touch of mmap'd memory
    • Soft page faults
    • 推定コスト: 100-500 ns per page

P2: Other Overhead

  1. Evolution lifecycle

    • hak_evo_tick() (every 1024 allocs)
    • hak_evo_record_size() (every alloc)
    • 推定コスト: 5-10 ns
  2. Batch madvise

    • Batch add/flush overhead
    • 推定コスト: Amortized, should be near-zero

🔬 Measurement Strategy

Phase 1: Feature Isolation

Test configurations (environment variables):

  1. Baseline: All features ON (current)
  2. No BigCache: HAKMEM_DISABLE_BIGCACHE=1
  3. No ELO: HAKMEM_DISABLE_ELO=1 (use fixed threshold)
  4. Frozen mode: HAKMEM_EVO_POLICY=frozen (skip learning)
  5. Minimal: BigCache + ELO + Evolution すべて OFF

Expected results:

  • If "No BigCache" → -100ns: BigCache overhead = 100ns
  • If "No ELO" → -200ns: ELO overhead = 200ns
  • If "Minimal" → -500ns: Total feature overhead = 500ns
  • Remaining gap (~17,000 ns) → syscall/page fault overhead

Phase 2: Profiling

# Compile with debug symbols
make clean && make CFLAGS="-g -O2"

# Run with perf
perf record -g ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
perf report

# Look for:
- hak_alloc_at() time breakdown
- hak_bigcache_try_get() cost
- hak_elo_select_strategy() cost
- mmap/munmap syscall time

Phase 3: Syscall Analysis

# Count syscalls
strace -c ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Compare with mimalloc
strace -c -o hakmem.strace ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
strace -c -o mimalloc.strace ./bench_allocators --allocator mimalloc --scenario vm --iterations 10

diff hakmem.strace mimalloc.strace

🎯 Expected Findings

Hypothesis 1: BigCache overhead = 5-10%

  • Hash lookup + slot iteration
  • Negligible compared to total gap

Hypothesis 2: ELO overhead = 5-10%

  • Softmax calculation
  • Can be eliminated in FROZEN mode

Hypothesis 3: mmap/munmap overhead = 60-70%

  • System call overhead
  • Page fault overhead
  • This is the main gap
  • Solution: Reduce mmap/munmap calls (already doing with BigCache)

Hypothesis 4: Remaining gap = mimalloc's slab allocator

  • mimalloc uses slab allocator for 2MB
  • Pre-allocated, no syscalls
  • hakmem uses mmap per allocation (first miss)
  • Can't compete without similar architecture

💡 Optimization Ideas (Phase 6.7+)

  1. FROZEN mode by default (after learning)

    • Zero ELO overhead
    • -5% improvement
  2. BigCache optimization

    • Direct indexing instead of linear search
    • -5% improvement
  3. Pre-allocated arena (Phase 7?)

    • mmap large arena once
    • Suballocate from arena
    • Avoid per-allocation syscalls
    • Target: -50% improvement
  4. Header optimization

    • Reduce AllocHeader size (32 → 16 bytes?)
    • Use bit packing
    • -2% improvement

📊 Success Metrics

Phase 6.7 Goal: Identify top 3 overhead sources Phase 7 Goal: Reduce gap to +40% (vs +88% now) Phase 8 Goal: Reduce gap to +20% (competitive)

Realistic limit: Cannot beat mimalloc without slab allocator

  • mimalloc: Industry-standard, 10+ years of optimization
  • hakmem: Research PoC, 2 months of development
  • Target: Within 20-30% is acceptable for PoC