165 lines
4.3 KiB
Markdown
165 lines
4.3 KiB
Markdown
|
|
# hakmem Overhead Analysis Plan (Phase 6.7 準備)
|
||
|
|
|
||
|
|
**Gap**: hakmem-evolving (37,602 ns) vs mimalloc (19,964 ns) = **+88.3%**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 Overhead 候補(優先度順)
|
||
|
|
|
||
|
|
### P0: Critical Path Overhead
|
||
|
|
|
||
|
|
1. **BigCache lookup** (毎回実行)
|
||
|
|
- Hash table lookup for site_id
|
||
|
|
- Size class matching
|
||
|
|
- Slot iteration
|
||
|
|
- **推定コスト**: 50-100 ns
|
||
|
|
|
||
|
|
2. **ELO strategy selection** (LEARN mode)
|
||
|
|
- `hak_elo_select_strategy()`: softmax calculation
|
||
|
|
- 12 strategies の確率計算
|
||
|
|
- Random number generation
|
||
|
|
- **推定コスト**: 100-200 ns
|
||
|
|
|
||
|
|
3. **Header read/write**
|
||
|
|
- AllocHeader (32 bytes) の read/write
|
||
|
|
- Magic verification
|
||
|
|
- **推定コスト**: 10-20 ns
|
||
|
|
|
||
|
|
4. **Atomic tick counter**
|
||
|
|
- `atomic_fetch_add(&tick_counter, 1)`
|
||
|
|
- Every allocation
|
||
|
|
- **推定コスト**: 5-10 ns
|
||
|
|
|
||
|
|
### P1: Syscall Overhead
|
||
|
|
|
||
|
|
5. **mmap/munmap**
|
||
|
|
- System call overhead
|
||
|
|
- TLB flush
|
||
|
|
- Page table updates
|
||
|
|
- **推定コスト**: 1,000-5,000 ns (syscall dependent)
|
||
|
|
|
||
|
|
6. **Page faults**
|
||
|
|
- First touch of mmap'd memory
|
||
|
|
- Soft page faults
|
||
|
|
- **推定コスト**: 100-500 ns per page
|
||
|
|
|
||
|
|
### P2: Other Overhead
|
||
|
|
|
||
|
|
7. **Evolution lifecycle**
|
||
|
|
- `hak_evo_tick()` (every 1024 allocs)
|
||
|
|
- `hak_evo_record_size()` (every alloc)
|
||
|
|
- **推定コスト**: 5-10 ns
|
||
|
|
|
||
|
|
8. **Batch madvise**
|
||
|
|
- Batch add/flush overhead
|
||
|
|
- **推定コスト**: Amortized, should be near-zero
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔬 Measurement Strategy
|
||
|
|
|
||
|
|
### Phase 1: Feature Isolation
|
||
|
|
|
||
|
|
Test configurations (environment variables):
|
||
|
|
1. **Baseline**: All features ON (current)
|
||
|
|
2. **No BigCache**: `HAKMEM_DISABLE_BIGCACHE=1`
|
||
|
|
3. **No ELO**: `HAKMEM_DISABLE_ELO=1` (use fixed threshold)
|
||
|
|
4. **Frozen mode**: `HAKMEM_EVO_POLICY=frozen` (skip learning)
|
||
|
|
5. **Minimal**: BigCache + ELO + Evolution すべて OFF
|
||
|
|
|
||
|
|
**Expected results**:
|
||
|
|
- If "No BigCache" → -100ns: BigCache overhead = 100ns
|
||
|
|
- If "No ELO" → -200ns: ELO overhead = 200ns
|
||
|
|
- If "Minimal" → -500ns: Total feature overhead = 500ns
|
||
|
|
- Remaining gap (~17,000 ns) → syscall/page fault overhead
|
||
|
|
|
||
|
|
### Phase 2: Profiling
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Compile with debug symbols
|
||
|
|
make clean && make CFLAGS="-g -O2"
|
||
|
|
|
||
|
|
# Run with perf
|
||
|
|
perf record -g ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
|
||
|
|
perf report
|
||
|
|
|
||
|
|
# Look for:
|
||
|
|
- hak_alloc_at() time breakdown
|
||
|
|
- hak_bigcache_try_get() cost
|
||
|
|
- hak_elo_select_strategy() cost
|
||
|
|
- mmap/munmap syscall time
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 3: Syscall Analysis
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Count syscalls
|
||
|
|
strace -c ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
|
||
|
|
|
||
|
|
# Compare with mimalloc
|
||
|
|
strace -c -o hakmem.strace ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
|
||
|
|
strace -c -o mimalloc.strace ./bench_allocators --allocator mimalloc --scenario vm --iterations 10
|
||
|
|
|
||
|
|
diff hakmem.strace mimalloc.strace
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 Expected Findings
|
||
|
|
|
||
|
|
**Hypothesis 1: BigCache overhead = 5-10%**
|
||
|
|
- Hash lookup + slot iteration
|
||
|
|
- Negligible compared to total gap
|
||
|
|
|
||
|
|
**Hypothesis 2: ELO overhead = 5-10%**
|
||
|
|
- Softmax calculation
|
||
|
|
- Can be eliminated in FROZEN mode
|
||
|
|
|
||
|
|
**Hypothesis 3: mmap/munmap overhead = 60-70%**
|
||
|
|
- System call overhead
|
||
|
|
- Page fault overhead
|
||
|
|
- **This is the main gap**
|
||
|
|
- Solution: Reduce mmap/munmap calls (already doing with BigCache)
|
||
|
|
|
||
|
|
**Hypothesis 4: Remaining gap = mimalloc's slab allocator**
|
||
|
|
- mimalloc uses slab allocator for 2MB
|
||
|
|
- Pre-allocated, no syscalls
|
||
|
|
- hakmem uses mmap per allocation (first miss)
|
||
|
|
- **Can't compete without similar architecture**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 💡 Optimization Ideas (Phase 6.7+)
|
||
|
|
|
||
|
|
1. **FROZEN mode by default** (after learning)
|
||
|
|
- Zero ELO overhead
|
||
|
|
- -5% improvement
|
||
|
|
|
||
|
|
2. **BigCache optimization**
|
||
|
|
- Direct indexing instead of linear search
|
||
|
|
- -5% improvement
|
||
|
|
|
||
|
|
3. **Pre-allocated arena** (Phase 7?)
|
||
|
|
- mmap large arena once
|
||
|
|
- Suballocate from arena
|
||
|
|
- Avoid per-allocation syscalls
|
||
|
|
- Target: -50% improvement
|
||
|
|
|
||
|
|
4. **Header optimization**
|
||
|
|
- Reduce AllocHeader size (32 → 16 bytes?)
|
||
|
|
- Use bit packing
|
||
|
|
- -2% improvement
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 Success Metrics
|
||
|
|
|
||
|
|
**Phase 6.7 Goal**: Identify top 3 overhead sources
|
||
|
|
**Phase 7 Goal**: Reduce gap to +40% (vs +88% now)
|
||
|
|
**Phase 8 Goal**: Reduce gap to +20% (competitive)
|
||
|
|
|
||
|
|
**Realistic limit**: Cannot beat mimalloc without slab allocator
|
||
|
|
- mimalloc: Industry-standard, 10+ years of optimization
|
||
|
|
- hakmem: Research PoC, 2 months of development
|
||
|
|
- **Target: Within 20-30% is acceptable for PoC**
|