Files
hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md

300 lines
8.3 KiB
Markdown
Raw Permalink Normal View History

# Phase 9-1 Performance Investigation - Executive Summary
**Date**: 2025-11-30
**Status**: Investigation Complete
**Investigator**: Claude (Sonnet 4.5)
---
## TL;DR
**Phase 9-1 hash table optimization had ZERO performance impact because:**
1. SuperSlab is **DISABLED by default** - optimized code never runs
2. Real bottleneck is **kernel overhead (55%)** - mmap/munmap syscalls dominate
3. SuperSlab lookup is **NOT in hot path** - only 1.14% of total time
**Fix**: Address SuperSlab backend failures and kernel overhead, not lookup performance.
---
## Performance Data
### Benchmark Results
| Configuration | Throughput | Change |
|--------------|------------|---------|
| Phase 8 Baseline | 16.5 M ops/s | - |
| Phase 9-1 (SuperSlab OFF) | 16.5 M ops/s | **0%** |
| Phase 9-1 (SuperSlab ON) | 16.4 M ops/s | **0%** |
**Conclusion**: Hash table optimization made no difference.
### Perf Profile (WS8192)
| Component | CPU % | Cycles | Status |
|-----------|-------|--------|--------|
| **Kernel (mmap/munmap)** | **55%** | ~117 | **BOTTLENECK** |
| ├─ munmap / VMA splitting | 30% | ~64 | Critical issue |
| └─ mmap / page setup | 11% | ~23 | Expensive |
| **free() wrapper** | 11% | ~24 | Wrapper overhead |
| **main() benchmark loop** | 8% | ~16 | Measurement artifact |
| **unified_cache_refill** | 4% | ~9 | Page faults |
| **Fast free TLS path** | 1% | ~3 | Actual work! |
| Other | 21% | ~43 | Misc |
**Key Insight**: Only **3 cycles** are spent in actual allocation work. The rest is overhead (117 cycles in kernel alone!).
---
## Root Cause Analysis
### 1. SuperSlab Disabled by Default
**Code**: `core/box/hak_core_init.inc.h:172-173`
```c
if (!getenv("HAKMEM_TINY_USE_SUPERSLAB")) {
setenv("HAKMEM_TINY_USE_SUPERSLAB", "0", 0); // DISABLED
}
```
**Impact**: Hash table code is never executed during benchmark.
### 2. Backend Failures Trigger Legacy Path
**Debug Logs**:
```
[SS_BACKEND] shared_fail→legacy cls=7 (4 times)
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6
```
**Analysis**:
- Class 7 (1024 bytes) SuperSlab exhaustion
- Falls back to system malloc → mmap/munmap
- 4 failures × ~1000 allocs = ~4000 kernel syscalls
- Explains 30% munmap overhead in perf
### 3. Hash Table Not in Hot Path
**Perf Evidence**:
- `hak_super_lookup()` does NOT appear in top 20 functions
- `ss_map_lookup()` hash table code: 0% visible overhead
- Fast TLS path dominates: only 1.14% total free time
**Code Path**:
```
free(ptr)
└─ hak_tiny_free_fast_v2() [1.14% total]
├─ Read header (class_idx)
├─ Push to TLS freelist ← FAST PATH (3 cycles)
└─ hak_super_lookup() ← VALIDATION ONLY (not in hot path)
```
---
## Where Phase 8 Analysis Went Wrong
### Phase 8 Claimed (INCORRECT)
| Claim | Reality |
|-------|---------|
| "SuperSlab lookup = 50-80 cycles" | Lookup not in hot path (0% perf profile) |
| "Major bottleneck" | Kernel overhead (55%) is real bottleneck |
| "Expected: 16.5M → 23-25M ops/s" | Actual: 16.5M → 16.5M ops/s (0% change) |
### What Was Missed
1. **No profiling before optimization** - Assumed bottleneck without evidence
2. **Didn't check default config** - SuperSlab disabled by default
3. **Ignored kernel overhead** - 55% of time in syscalls
4. **Optimized wrong thing** - Lookup is validation, not hot path
---
## Recommended Action Plan
### Priority 1: Fix SuperSlab Backend (Immediate)
**Problem**: Class 7 (1024 bytes) exhaustion → legacy fallback → kernel overhead
**Solutions**:
1. **Increase SuperSlab size**: 512KB → 2MB
- 4x more blocks per slab
- Reduces fragmentation
- **Expected**: -20% kernel overhead = +30-40% throughput
2. **Pre-allocate SuperSlabs** at startup:
```c
hak_ss_prewarm_class(7, 16); // 16 SuperSlabs for class 7
```
- Eliminates startup mmap overhead
- **Expected**: -30% kernel overhead = +50-70% throughput
3. **Enable SuperSlab by default** (after fixing backend):
```c
setenv("HAKMEM_TINY_USE_SUPERSLAB", "1", 0); // Enable
```
**Expected Result**: 16.5 M ops/s → **25-35 M ops/s** (+50-110%)
### Priority 2: Reduce Kernel Overhead (Short-term)
**Problem**: 55% of time in mmap/munmap syscalls
**Solutions**:
1. **Fix backend failures** (see Priority 1)
2. **Increase batch size** to amortize syscall cost
3. **Pre-allocate memory pool** to avoid runtime mmap
4. **Monitor VMA count**: `cat /proc/self/maps | wc -l`
**Expected Result**: Kernel overhead 55% → 10-20%
### Priority 3: Optimize User-space (Long-term)
**Problem**: 11% in free() wrapper overhead
**Solutions**:
1. **Inline wrapper** more aggressively
2. **Remove stack canary** checks in hot path
3. **Optimize TLS access** (direct segment access)
**Expected Result**: -5% overhead = +6-8% throughput
---
## Performance Projections
### Scenario 1: Fix Backend + Prewarm (Recommended)
**Changes**:
- Fix class 7 exhaustion
- Pre-allocate SuperSlab pool
- Enable SuperSlab by default
**Expected**:
- Kernel: 55% → 10% (-45%)
- Throughput: 16.5 M → **45-50 M ops/s** (+170-200%)
### Scenario 2: Increase SuperSlab Size Only
**Changes**:
- Change default: 512KB → 2MB
- No other changes
**Expected**:
- Kernel: 55% → 35% (-20%)
- Throughput: 16.5 M → **25-30 M ops/s** (+50-80%)
### Scenario 3: Do Nothing (Status Quo)
**Result**: 16.5 M ops/s (no change)
- Hash table infrastructure exists but provides no benefit
- Kernel overhead continues to dominate
- SuperSlab backend remains unstable
---
## Lessons Learned
### What Went Well
1. **Clean implementation**: Hash table code is well-architected
2. **Box pattern compliance**: Single responsibility, clear contracts
3. **No regressions**: 0% performance change (neither better nor worse)
4. **Good infrastructure**: Enables future optimizations
### What Could Be Better
1. **Profile before optimizing**: Always run perf first
2. **Verify assumptions**: Check default configuration
3. **Focus on hot path**: Optimize what's actually slow
4. **Measure kernel time**: Don't ignore syscall overhead
### Key Takeaway
> "Premature optimization is the root of all evil. Profile first, optimize second."
> - Donald Knuth
Phase 9-1 optimized SuperSlab lookup (not in hot path) while ignoring kernel overhead (55% of runtime). Always profile before optimizing!
---
## Next Steps
### Immediate (This Week)
1. **Investigate class 7 exhaustion**:
```bash
HAKMEM_SS_DEBUG=1 ./bench_random_mixed_hakmem 10000000 8192 42 2>&1 | grep -E "cls=7|shared_fail"
```
2. **Test SuperSlab size increase**:
- Change `SUPERSLAB_SIZE_MIN` from 512KB to 2MB
- Re-run benchmark, expect +50-80% throughput
3. **Test prewarming**:
```c
hak_ss_prewarm_class(7, 16); // Pre-allocate 16 SuperSlabs
```
- Expect +50-70% throughput
### Short-term (Next 2 Weeks)
1. **Fix backend stability**:
- Investigate fragmentation metrics
- Increase shared SuperSlab capacity
- Add telemetry for exhaustion events
2. **Enable SuperSlab by default**:
- Only after backend is stable
- Verify no regressions with full test suite
3. **Re-benchmark** with fixed backend:
- Target: 45-50 M ops/s at WS8192
- Compare to mimalloc (96.5 M ops/s)
### Long-term (Future Phases)
1. **Phase 10**: Reduce wrapper overhead (11% → 5%)
2. **Phase 11**: Architecture re-evaluation if still >2x slower than mimalloc
3. **Phase 12**: Consider hybrid approach (TLS + different backend)
---
## Files
**Investigation Report** (Full Details):
- `/mnt/workdisk/public_share/hakmem/PHASE9_PERF_INVESTIGATION.md`
**Summary** (This File):
- `/mnt/workdisk/public_share/hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md`
**Perf Data**:
- `/tmp/phase9_perf.data` (perf record output)
**Related Documents**:
- `PHASE8_TECHNICAL_ANALYSIS.md` - Original (incorrect) bottleneck analysis
- `PHASE9_1_COMPLETE.md` - Implementation completion report
- `PHASE9_1_PROGRESS.md` - Detailed progress tracking
---
## Conclusion
Phase 9-1 successfully delivered clean O(1) hash table infrastructure, but **performance did not improve** because:
1. **Wrong target**: Optimized lookup (not in hot path)
2. **Real bottleneck**: Kernel overhead (55% from mmap/munmap)
3. **Backend issues**: SuperSlab exhaustion forces legacy fallback
**Recommendation**: Fix SuperSlab backend and reduce kernel overhead. Expected gain: +170-200% throughput (16.5 M → 45-50 M ops/s).
---
**Prepared by**: Claude (Sonnet 4.5)
**Date**: 2025-11-30
**Status**: Complete - Action plan provided