This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
300 lines
8.3 KiB
Markdown
300 lines
8.3 KiB
Markdown
# Phase 9-1 Performance Investigation - Executive Summary
|
||
|
||
**Date**: 2025-11-30
|
||
**Status**: Investigation Complete
|
||
**Investigator**: Claude (Sonnet 4.5)
|
||
|
||
---
|
||
|
||
## TL;DR
|
||
|
||
**Phase 9-1 hash table optimization had ZERO performance impact because:**
|
||
|
||
1. SuperSlab is **DISABLED by default** - optimized code never runs
|
||
2. Real bottleneck is **kernel overhead (55%)** - mmap/munmap syscalls dominate
|
||
3. SuperSlab lookup is **NOT in hot path** - only 1.14% of total time
|
||
|
||
**Fix**: Address SuperSlab backend failures and kernel overhead, not lookup performance.
|
||
|
||
---
|
||
|
||
## Performance Data
|
||
|
||
### Benchmark Results
|
||
|
||
| Configuration | Throughput | Change |
|
||
|--------------|------------|---------|
|
||
| Phase 8 Baseline | 16.5 M ops/s | - |
|
||
| Phase 9-1 (SuperSlab OFF) | 16.5 M ops/s | **0%** |
|
||
| Phase 9-1 (SuperSlab ON) | 16.4 M ops/s | **0%** |
|
||
|
||
**Conclusion**: Hash table optimization made no difference.
|
||
|
||
### Perf Profile (WS8192)
|
||
|
||
| Component | CPU % | Cycles | Status |
|
||
|-----------|-------|--------|--------|
|
||
| **Kernel (mmap/munmap)** | **55%** | ~117 | **BOTTLENECK** |
|
||
| ├─ munmap / VMA splitting | 30% | ~64 | Critical issue |
|
||
| └─ mmap / page setup | 11% | ~23 | Expensive |
|
||
| **free() wrapper** | 11% | ~24 | Wrapper overhead |
|
||
| **main() benchmark loop** | 8% | ~16 | Measurement artifact |
|
||
| **unified_cache_refill** | 4% | ~9 | Page faults |
|
||
| **Fast free TLS path** | 1% | ~3 | Actual work! |
|
||
| Other | 21% | ~43 | Misc |
|
||
|
||
**Key Insight**: Only **3 cycles** are spent in actual allocation work. The rest is overhead (117 cycles in kernel alone!).
|
||
|
||
---
|
||
|
||
## Root Cause Analysis
|
||
|
||
### 1. SuperSlab Disabled by Default
|
||
|
||
**Code**: `core/box/hak_core_init.inc.h:172-173`
|
||
```c
|
||
if (!getenv("HAKMEM_TINY_USE_SUPERSLAB")) {
|
||
setenv("HAKMEM_TINY_USE_SUPERSLAB", "0", 0); // DISABLED
|
||
}
|
||
```
|
||
|
||
**Impact**: Hash table code is never executed during benchmark.
|
||
|
||
### 2. Backend Failures Trigger Legacy Path
|
||
|
||
**Debug Logs**:
|
||
```
|
||
[SS_BACKEND] shared_fail→legacy cls=7 (4 times)
|
||
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6
|
||
```
|
||
|
||
**Analysis**:
|
||
- Class 7 (1024 bytes) SuperSlab exhaustion
|
||
- Falls back to system malloc → mmap/munmap
|
||
- 4 failures × ~1000 allocs = ~4000 kernel syscalls
|
||
- Explains 30% munmap overhead in perf
|
||
|
||
### 3. Hash Table Not in Hot Path
|
||
|
||
**Perf Evidence**:
|
||
- `hak_super_lookup()` does NOT appear in top 20 functions
|
||
- `ss_map_lookup()` hash table code: 0% visible overhead
|
||
- Fast TLS path dominates: only 1.14% total free time
|
||
|
||
**Code Path**:
|
||
```
|
||
free(ptr)
|
||
└─ hak_tiny_free_fast_v2() [1.14% total]
|
||
├─ Read header (class_idx)
|
||
├─ Push to TLS freelist ← FAST PATH (3 cycles)
|
||
└─ hak_super_lookup() ← VALIDATION ONLY (not in hot path)
|
||
```
|
||
|
||
---
|
||
|
||
## Where Phase 8 Analysis Went Wrong
|
||
|
||
### Phase 8 Claimed (INCORRECT)
|
||
|
||
| Claim | Reality |
|
||
|-------|---------|
|
||
| "SuperSlab lookup = 50-80 cycles" | Lookup not in hot path (0% perf profile) |
|
||
| "Major bottleneck" | Kernel overhead (55%) is real bottleneck |
|
||
| "Expected: 16.5M → 23-25M ops/s" | Actual: 16.5M → 16.5M ops/s (0% change) |
|
||
|
||
### What Was Missed
|
||
|
||
1. **No profiling before optimization** - Assumed bottleneck without evidence
|
||
2. **Didn't check default config** - SuperSlab disabled by default
|
||
3. **Ignored kernel overhead** - 55% of time in syscalls
|
||
4. **Optimized wrong thing** - Lookup is validation, not hot path
|
||
|
||
---
|
||
|
||
## Recommended Action Plan
|
||
|
||
### Priority 1: Fix SuperSlab Backend (Immediate)
|
||
|
||
**Problem**: Class 7 (1024 bytes) exhaustion → legacy fallback → kernel overhead
|
||
|
||
**Solutions**:
|
||
|
||
1. **Increase SuperSlab size**: 512KB → 2MB
|
||
- 4x more blocks per slab
|
||
- Reduces fragmentation
|
||
- **Expected**: -20% kernel overhead = +30-40% throughput
|
||
|
||
2. **Pre-allocate SuperSlabs** at startup:
|
||
```c
|
||
hak_ss_prewarm_class(7, 16); // 16 SuperSlabs for class 7
|
||
```
|
||
- Eliminates startup mmap overhead
|
||
- **Expected**: -30% kernel overhead = +50-70% throughput
|
||
|
||
3. **Enable SuperSlab by default** (after fixing backend):
|
||
```c
|
||
setenv("HAKMEM_TINY_USE_SUPERSLAB", "1", 0); // Enable
|
||
```
|
||
|
||
**Expected Result**: 16.5 M ops/s → **25-35 M ops/s** (+50-110%)
|
||
|
||
### Priority 2: Reduce Kernel Overhead (Short-term)
|
||
|
||
**Problem**: 55% of time in mmap/munmap syscalls
|
||
|
||
**Solutions**:
|
||
|
||
1. **Fix backend failures** (see Priority 1)
|
||
2. **Increase batch size** to amortize syscall cost
|
||
3. **Pre-allocate memory pool** to avoid runtime mmap
|
||
4. **Monitor VMA count**: `cat /proc/self/maps | wc -l`
|
||
|
||
**Expected Result**: Kernel overhead 55% → 10-20%
|
||
|
||
### Priority 3: Optimize User-space (Long-term)
|
||
|
||
**Problem**: 11% in free() wrapper overhead
|
||
|
||
**Solutions**:
|
||
|
||
1. **Inline wrapper** more aggressively
|
||
2. **Remove stack canary** checks in hot path
|
||
3. **Optimize TLS access** (direct segment access)
|
||
|
||
**Expected Result**: -5% overhead = +6-8% throughput
|
||
|
||
---
|
||
|
||
## Performance Projections
|
||
|
||
### Scenario 1: Fix Backend + Prewarm (Recommended)
|
||
|
||
**Changes**:
|
||
- Fix class 7 exhaustion
|
||
- Pre-allocate SuperSlab pool
|
||
- Enable SuperSlab by default
|
||
|
||
**Expected**:
|
||
- Kernel: 55% → 10% (-45%)
|
||
- Throughput: 16.5 M → **45-50 M ops/s** (+170-200%)
|
||
|
||
### Scenario 2: Increase SuperSlab Size Only
|
||
|
||
**Changes**:
|
||
- Change default: 512KB → 2MB
|
||
- No other changes
|
||
|
||
**Expected**:
|
||
- Kernel: 55% → 35% (-20%)
|
||
- Throughput: 16.5 M → **25-30 M ops/s** (+50-80%)
|
||
|
||
### Scenario 3: Do Nothing (Status Quo)
|
||
|
||
**Result**: 16.5 M ops/s (no change)
|
||
- Hash table infrastructure exists but provides no benefit
|
||
- Kernel overhead continues to dominate
|
||
- SuperSlab backend remains unstable
|
||
|
||
---
|
||
|
||
## Lessons Learned
|
||
|
||
### What Went Well
|
||
|
||
1. **Clean implementation**: Hash table code is well-architected
|
||
2. **Box pattern compliance**: Single responsibility, clear contracts
|
||
3. **No regressions**: 0% performance change (neither better nor worse)
|
||
4. **Good infrastructure**: Enables future optimizations
|
||
|
||
### What Could Be Better
|
||
|
||
1. **Profile before optimizing**: Always run perf first
|
||
2. **Verify assumptions**: Check default configuration
|
||
3. **Focus on hot path**: Optimize what's actually slow
|
||
4. **Measure kernel time**: Don't ignore syscall overhead
|
||
|
||
### Key Takeaway
|
||
|
||
> "Premature optimization is the root of all evil. Profile first, optimize second."
|
||
> - Donald Knuth
|
||
|
||
Phase 9-1 optimized SuperSlab lookup (not in hot path) while ignoring kernel overhead (55% of runtime). Always profile before optimizing!
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
### Immediate (This Week)
|
||
|
||
1. **Investigate class 7 exhaustion**:
|
||
```bash
|
||
HAKMEM_SS_DEBUG=1 ./bench_random_mixed_hakmem 10000000 8192 42 2>&1 | grep -E "cls=7|shared_fail"
|
||
```
|
||
|
||
2. **Test SuperSlab size increase**:
|
||
- Change `SUPERSLAB_SIZE_MIN` from 512KB to 2MB
|
||
- Re-run benchmark, expect +50-80% throughput
|
||
|
||
3. **Test prewarming**:
|
||
```c
|
||
hak_ss_prewarm_class(7, 16); // Pre-allocate 16 SuperSlabs
|
||
```
|
||
- Expect +50-70% throughput
|
||
|
||
### Short-term (Next 2 Weeks)
|
||
|
||
1. **Fix backend stability**:
|
||
- Investigate fragmentation metrics
|
||
- Increase shared SuperSlab capacity
|
||
- Add telemetry for exhaustion events
|
||
|
||
2. **Enable SuperSlab by default**:
|
||
- Only after backend is stable
|
||
- Verify no regressions with full test suite
|
||
|
||
3. **Re-benchmark** with fixed backend:
|
||
- Target: 45-50 M ops/s at WS8192
|
||
- Compare to mimalloc (96.5 M ops/s)
|
||
|
||
### Long-term (Future Phases)
|
||
|
||
1. **Phase 10**: Reduce wrapper overhead (11% → 5%)
|
||
2. **Phase 11**: Architecture re-evaluation if still >2x slower than mimalloc
|
||
3. **Phase 12**: Consider hybrid approach (TLS + different backend)
|
||
|
||
---
|
||
|
||
## Files
|
||
|
||
**Investigation Report** (Full Details):
|
||
- `/mnt/workdisk/public_share/hakmem/PHASE9_PERF_INVESTIGATION.md`
|
||
|
||
**Summary** (This File):
|
||
- `/mnt/workdisk/public_share/hakmem/PHASE9_1_INVESTIGATION_SUMMARY.md`
|
||
|
||
**Perf Data**:
|
||
- `/tmp/phase9_perf.data` (perf record output)
|
||
|
||
**Related Documents**:
|
||
- `PHASE8_TECHNICAL_ANALYSIS.md` - Original (incorrect) bottleneck analysis
|
||
- `PHASE9_1_COMPLETE.md` - Implementation completion report
|
||
- `PHASE9_1_PROGRESS.md` - Detailed progress tracking
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
Phase 9-1 successfully delivered clean O(1) hash table infrastructure, but **performance did not improve** because:
|
||
|
||
1. **Wrong target**: Optimized lookup (not in hot path)
|
||
2. **Real bottleneck**: Kernel overhead (55% from mmap/munmap)
|
||
3. **Backend issues**: SuperSlab exhaustion forces legacy fallback
|
||
|
||
**Recommendation**: Fix SuperSlab backend and reduce kernel overhead. Expected gain: +170-200% throughput (16.5 M → 45-50 M ops/s).
|
||
|
||
---
|
||
|
||
**Prepared by**: Claude (Sonnet 4.5)
|
||
**Date**: 2025-11-30
|
||
**Status**: Complete - Action plan provided
|