Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

4.3 KiB

Raw Blame History

hakmem Overhead Analysis Plan (Phase 6.7 準備)

Gap: hakmem-evolving (37,602 ns) vs mimalloc (19,964 ns) = +88.3%

🎯 Overhead 候補（優先度順）

P0: Critical Path Overhead

BigCache lookup (毎回実行)
- Hash table lookup for site_id
- Size class matching
- Slot iteration
- 推定コスト: 50-100 ns
ELO strategy selection (LEARN mode)
- hak_elo_select_strategy(): softmax calculation
- 12 strategies の確率計算
- Random number generation
- 推定コスト: 100-200 ns
Header read/write
- AllocHeader (32 bytes) の read/write
- Magic verification
- 推定コスト: 10-20 ns
Atomic tick counter
- atomic_fetch_add(&tick_counter, 1)
- Every allocation
- 推定コスト: 5-10 ns

P1: Syscall Overhead

mmap/munmap
- System call overhead
- TLB flush
- Page table updates
- 推定コスト: 1,000-5,000 ns (syscall dependent)
Page faults
- First touch of mmap'd memory
- Soft page faults
- 推定コスト: 100-500 ns per page

P2: Other Overhead

Evolution lifecycle
- hak_evo_tick() (every 1024 allocs)
- hak_evo_record_size() (every alloc)
- 推定コスト: 5-10 ns
Batch madvise
- Batch add/flush overhead
- 推定コスト: Amortized, should be near-zero

🔬 Measurement Strategy

Phase 1: Feature Isolation

Test configurations (environment variables):

Baseline: All features ON (current)
No BigCache: HAKMEM_DISABLE_BIGCACHE=1
No ELO: HAKMEM_DISABLE_ELO=1 (use fixed threshold)
Frozen mode: HAKMEM_EVO_POLICY=frozen (skip learning)
Minimal: BigCache + ELO + Evolution すべて OFF

Expected results:

If "No BigCache" → -100ns: BigCache overhead = 100ns
If "No ELO" → -200ns: ELO overhead = 200ns
If "Minimal" → -500ns: Total feature overhead = 500ns
Remaining gap (~17,000 ns) → syscall/page fault overhead

Phase 2: Profiling

# Compile with debug symbols
make clean && make CFLAGS="-g -O2"

# Run with perf
perf record -g ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
perf report

# Look for:
- hak_alloc_at() time breakdown
- hak_bigcache_try_get() cost
- hak_elo_select_strategy() cost
- mmap/munmap syscall time

Phase 3: Syscall Analysis

# Count syscalls
strace -c ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Compare with mimalloc
strace -c -o hakmem.strace ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
strace -c -o mimalloc.strace ./bench_allocators --allocator mimalloc --scenario vm --iterations 10

diff hakmem.strace mimalloc.strace

🎯 Expected Findings

Hypothesis 1: BigCache overhead = 5-10%

Hash lookup + slot iteration
Negligible compared to total gap

Hypothesis 2: ELO overhead = 5-10%

Softmax calculation
Can be eliminated in FROZEN mode

Hypothesis 3: mmap/munmap overhead = 60-70%

System call overhead
Page fault overhead
This is the main gap
Solution: Reduce mmap/munmap calls (already doing with BigCache)

Hypothesis 4: Remaining gap = mimalloc's slab allocator

mimalloc uses slab allocator for 2MB
Pre-allocated, no syscalls
hakmem uses mmap per allocation (first miss)
Can't compete without similar architecture

💡 Optimization Ideas (Phase 6.7+)

FROZEN mode by default (after learning)
- Zero ELO overhead
- -5% improvement
BigCache optimization
- Direct indexing instead of linear search
- -5% improvement
Pre-allocated arena (Phase 7?)
- mmap large arena once
- Suballocate from arena
- Avoid per-allocation syscalls
- Target: -50% improvement
Header optimization
- Reduce AllocHeader size (32 → 16 bytes?)
- Use bit packing
- -2% improvement

📊 Success Metrics

Phase 6.7 Goal: Identify top 3 overhead sources Phase 7 Goal: Reduce gap to +40% (vs +88% now) Phase 8 Goal: Reduce gap to +20% (competitive)

Realistic limit: Cannot beat mimalloc without slab allocator

mimalloc: Industry-standard, 10+ years of optimization
hakmem: Research PoC, 2 months of development
Target: Within 20-30% is acceptable for PoC

4.3 KiB Raw Blame History