hakmem/docs/analysis/OVERHEAD_ANALYSIS_PLAN.md

# hakmem Overhead Analysis Plan (Phase 6.7 準備)

**Gap**: hakmem-evolving (37,602 ns) vs mimalloc (19,964 ns) = **+88.3%**

---

## 🎯 Overhead 候補（優先度順）

### P0: Critical Path Overhead

1. **BigCache lookup** (毎回実行)
   - Hash table lookup for site_id
   - Size class matching
   - Slot iteration
   - **推定コスト**: 50-100 ns

2. **ELO strategy selection** (LEARN mode)
   - `hak_elo_select_strategy()`: softmax calculation
   - 12 strategies の確率計算
   - Random number generation
   - **推定コスト**: 100-200 ns

3. **Header read/write**
   - AllocHeader (32 bytes) の read/write
   - Magic verification
   - **推定コスト**: 10-20 ns

4. **Atomic tick counter**
   - `atomic_fetch_add(&tick_counter, 1)`
   - Every allocation
   - **推定コスト**: 5-10 ns

### P1: Syscall Overhead

5. **mmap/munmap**
   - System call overhead
   - TLB flush
   - Page table updates
   - **推定コスト**: 1,000-5,000 ns (syscall dependent)

6. **Page faults**
   - First touch of mmap'd memory
   - Soft page faults
   - **推定コスト**: 100-500 ns per page

### P2: Other Overhead

7. **Evolution lifecycle**
   - `hak_evo_tick()` (every 1024 allocs)
   - `hak_evo_record_size()` (every alloc)
   - **推定コスト**: 5-10 ns

8. **Batch madvise**
   - Batch add/flush overhead
   - **推定コスト**: Amortized, should be near-zero

---

## 🔬 Measurement Strategy

### Phase 1: Feature Isolation

Test configurations (environment variables):
1. **Baseline**: All features ON (current)
2. **No BigCache**: `HAKMEM_DISABLE_BIGCACHE=1`
3. **No ELO**: `HAKMEM_DISABLE_ELO=1` (use fixed threshold)
4. **Frozen mode**: `HAKMEM_EVO_POLICY=frozen` (skip learning)
5. **Minimal**: BigCache + ELO + Evolution すべて OFF

**Expected results**:
- If "No BigCache" → -100ns: BigCache overhead = 100ns
- If "No ELO" → -200ns: ELO overhead = 200ns
- If "Minimal" → -500ns: Total feature overhead = 500ns
- Remaining gap (~17,000 ns) → syscall/page fault overhead

### Phase 2: Profiling

```bash
# Compile with debug symbols
make clean && make CFLAGS="-g -O2"

# Run with perf
perf record -g ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
perf report

# Look for:
- hak_alloc_at() time breakdown
- hak_bigcache_try_get() cost
- hak_elo_select_strategy() cost
- mmap/munmap syscall time
```

### Phase 3: Syscall Analysis

```bash
# Count syscalls
strace -c ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Compare with mimalloc
strace -c -o hakmem.strace ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
strace -c -o mimalloc.strace ./bench_allocators --allocator mimalloc --scenario vm --iterations 10

diff hakmem.strace mimalloc.strace
```

---

## 🎯 Expected Findings

**Hypothesis 1: BigCache overhead = 5-10%**
- Hash lookup + slot iteration
- Negligible compared to total gap

**Hypothesis 2: ELO overhead = 5-10%**
- Softmax calculation
- Can be eliminated in FROZEN mode

**Hypothesis 3: mmap/munmap overhead = 60-70%**
- System call overhead
- Page fault overhead
- **This is the main gap**
- Solution: Reduce mmap/munmap calls (already doing with BigCache)

**Hypothesis 4: Remaining gap = mimalloc's slab allocator**
- mimalloc uses slab allocator for 2MB
- Pre-allocated, no syscalls
- hakmem uses mmap per allocation (first miss)
- **Can't compete without similar architecture**

---

## 💡 Optimization Ideas (Phase 6.7+)

1. **FROZEN mode by default** (after learning)
   - Zero ELO overhead
   - -5% improvement

2. **BigCache optimization**
   - Direct indexing instead of linear search
   - -5% improvement

3. **Pre-allocated arena** (Phase 7?)
   - mmap large arena once
   - Suballocate from arena
   - Avoid per-allocation syscalls
   - Target: -50% improvement

4. **Header optimization**
   - Reduce AllocHeader size (32 → 16 bytes?)
   - Use bit packing
   - -2% improvement

---

## 📊 Success Metrics

**Phase 6.7 Goal**: Identify top 3 overhead sources
**Phase 7 Goal**: Reduce gap to +40% (vs +88% now)
**Phase 8 Goal**: Reduce gap to +20% (competitive)

**Realistic limit**: Cannot beat mimalloc without slab allocator
- mimalloc: Industry-standard, 10+ years of optimization
- hakmem: Research PoC, 2 months of development
- **Target: Within 20-30% is acceptable for PoC**
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# hakmem Overhead Analysis Plan (Phase 6.7 準備)`

			`Gap: hakmem-evolving (37,602 ns) vs mimalloc (19,964 ns) = +88.3%`

			`---`

			`## 🎯 Overhead 候補（優先度順）`

			`### P0: Critical Path Overhead`

			`1. BigCache lookup (毎回実行)`
			`- Hash table lookup for site_id`
			`- Size class matching`
			`- Slot iteration`
			`- 推定コスト: 50-100 ns`

			`2. ELO strategy selection (LEARN mode)`
			- `hak_elo_select_strategy()`: softmax calculation
			`- 12 strategies の確率計算`
			`- Random number generation`
			`- 推定コスト: 100-200 ns`

			`3. Header read/write`
			`- AllocHeader (32 bytes) の read/write`
			`- Magic verification`
			`- 推定コスト: 10-20 ns`

			`4. Atomic tick counter`
			- `atomic_fetch_add(&tick_counter, 1)`
			`- Every allocation`
			`- 推定コスト: 5-10 ns`

			`### P1: Syscall Overhead`

			`5. mmap/munmap`
			`- System call overhead`
			`- TLB flush`
			`- Page table updates`
			`- 推定コスト: 1,000-5,000 ns (syscall dependent)`

			`6. Page faults`
			`- First touch of mmap'd memory`
			`- Soft page faults`
			`- 推定コスト: 100-500 ns per page`

			`### P2: Other Overhead`

			`7. Evolution lifecycle`
			- `hak_evo_tick()` (every 1024 allocs)
			- `hak_evo_record_size()` (every alloc)
			`- 推定コスト: 5-10 ns`

			`8. Batch madvise`
			`- Batch add/flush overhead`
			`- 推定コスト: Amortized, should be near-zero`

			`---`

			`## 🔬 Measurement Strategy`

			`### Phase 1: Feature Isolation`

			`Test configurations (environment variables):`
			`1. Baseline: All features ON (current)`
			2. No BigCache: `HAKMEM_DISABLE_BIGCACHE=1`
			3. No ELO: `HAKMEM_DISABLE_ELO=1` (use fixed threshold)
			4. Frozen mode: `HAKMEM_EVO_POLICY=frozen` (skip learning)
			`5. Minimal: BigCache + ELO + Evolution すべて OFF`

			`Expected results:`
			`- If "No BigCache" → -100ns: BigCache overhead = 100ns`
			`- If "No ELO" → -200ns: ELO overhead = 200ns`
			`- If "Minimal" → -500ns: Total feature overhead = 500ns`
			`- Remaining gap (~17,000 ns) → syscall/page fault overhead`

			`### Phase 2: Profiling`

			```bash
			`# Compile with debug symbols`
			`make clean && make CFLAGS="-g -O2"`

			`# Run with perf`
			`perf record -g ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100`
			`perf report`

			`# Look for:`
			`- hak_alloc_at() time breakdown`
			`- hak_bigcache_try_get() cost`
			`- hak_elo_select_strategy() cost`
			`- mmap/munmap syscall time`
			```

			`### Phase 3: Syscall Analysis`

			```bash
			`# Count syscalls`
			`strace -c ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10`

			`# Compare with mimalloc`
			`strace -c -o hakmem.strace ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10`
			`strace -c -o mimalloc.strace ./bench_allocators --allocator mimalloc --scenario vm --iterations 10`

			`diff hakmem.strace mimalloc.strace`
			```

			`---`

			`## 🎯 Expected Findings`

			`Hypothesis 1: BigCache overhead = 5-10%`
			`- Hash lookup + slot iteration`
			`- Negligible compared to total gap`

			`Hypothesis 2: ELO overhead = 5-10%`
			`- Softmax calculation`
			`- Can be eliminated in FROZEN mode`

			`Hypothesis 3: mmap/munmap overhead = 60-70%`
			`- System call overhead`
			`- Page fault overhead`
			`- This is the main gap`
			`- Solution: Reduce mmap/munmap calls (already doing with BigCache)`

			`Hypothesis 4: Remaining gap = mimalloc's slab allocator`
			`- mimalloc uses slab allocator for 2MB`
			`- Pre-allocated, no syscalls`
			`- hakmem uses mmap per allocation (first miss)`
			`- Can't compete without similar architecture`

			`---`

			`## 💡 Optimization Ideas (Phase 6.7+)`

			`1. FROZEN mode by default (after learning)`
			`- Zero ELO overhead`
			`- -5% improvement`

			`2. BigCache optimization`
			`- Direct indexing instead of linear search`
			`- -5% improvement`

			`3. Pre-allocated arena (Phase 7?)`
			`- mmap large arena once`
			`- Suballocate from arena`
			`- Avoid per-allocation syscalls`
			`- Target: -50% improvement`

			`4. Header optimization`
			`- Reduce AllocHeader size (32 → 16 bytes?)`
			`- Use bit packing`
			`- -2% improvement`

			`---`

			`## 📊 Success Metrics`

			`Phase 6.7 Goal: Identify top 3 overhead sources`
			`Phase 7 Goal: Reduce gap to +40% (vs +88% now)`
			`Phase 8 Goal: Reduce gap to +20% (competitive)`

			`Realistic limit: Cannot beat mimalloc without slab allocator`
			`- mimalloc: Industry-standard, 10+ years of optimization`
			`- hakmem: Research PoC, 2 months of development`
			`- Target: Within 20-30% is acceptable for PoC`