278 lines
9.0 KiB
Markdown
278 lines
9.0 KiB
Markdown
|
|
# Phase 1 Test Results: MAJOR DISCOVERY
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 1 TLB diagnostics testing reveals a **critical discovery**: The 48.65% TLB miss rate is **NOT caused by SuperSlab allocations**, and therefore **THP and PREFAULT optimizations will have ZERO impact**.
|
||
|
|
|
||
|
|
### Test Results
|
||
|
|
|
||
|
|
```
|
||
|
|
Test Configuration Cycles dTLB Misses Speedup
|
||
|
|
─────────────────────────────────────────────────────────────────────────────
|
||
|
|
1. Baseline (THP OFF, PREFAULT OFF) 75,633,952 23,531 misses 1.00x
|
||
|
|
2. THP AUTO, PREFAULT OFF 75,848,380 23,271 misses 1.00x
|
||
|
|
3. THP OFF, PREFAULT ON 73,631,128 23,023 misses 1.02x
|
||
|
|
4. THP AUTO, PREFAULT ON 74,007,355 23,683 misses 1.01x
|
||
|
|
5. THP ON, PREFAULT ON 74,923,630 24,680 misses 0.99x
|
||
|
|
6. THP ON, PREFAULT TOUCH 74,000,713 24,471 misses 1.01x
|
||
|
|
```
|
||
|
|
|
||
|
|
### Key Finding
|
||
|
|
|
||
|
|
**All configurations produce essentially identical results (within 2.8% noise margin):**
|
||
|
|
- dTLB misses vary by only 1,657 total (7% of baseline) → no meaningful change
|
||
|
|
- Cycles vary by 2.2M (2.8% of baseline) → measurement noise
|
||
|
|
- THP_ON actually makes things slightly WORSE
|
||
|
|
|
||
|
|
**Conclusion: THP and PREFAULT have ZERO detectable impact.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Analysis
|
||
|
|
|
||
|
|
### Why TLB Misses Didn't Improve
|
||
|
|
|
||
|
|
#### Hypothesis: The 23K TLB misses are NOT from SuperSlab allocations
|
||
|
|
|
||
|
|
When we apply THP and PREFAULT to SuperSlabs, we see **no improvement** in dTLB misses. This means:
|
||
|
|
|
||
|
|
1. **SuperSlab allocations are NOT the source of TLB misses**
|
||
|
|
2. The misses come from elsewhere:
|
||
|
|
- Thread Local Storage (TLS) structures
|
||
|
|
- libc internal allocations (malloc metadata, stdio buffers)
|
||
|
|
- Benchmark harness (measurement framework)
|
||
|
|
- Stack growth (function call frames)
|
||
|
|
- Shared library code (libc, kernel entry)
|
||
|
|
- Dynamic linking structures
|
||
|
|
|
||
|
|
#### Why This Makes Sense
|
||
|
|
|
||
|
|
Looking at the allocation profile:
|
||
|
|
- **Random Mixed workload:** 1M allocations of sizes 16-1040B
|
||
|
|
- Each allocation hit SuperSlab (which is good!)
|
||
|
|
- But surrounding operations (non-allocation) also touch memory:
|
||
|
|
- Function calls allocate stack frames
|
||
|
|
- libc functions allocate internally
|
||
|
|
- Thread setup allocates TLS
|
||
|
|
- Kernel entry trampoline code
|
||
|
|
|
||
|
|
The **non-allocator memory accesses** are generating the TLB misses, and HAKMEM configuration doesn't affect them.
|
||
|
|
|
||
|
|
### Why THP_ON Made Things Worse
|
||
|
|
|
||
|
|
```
|
||
|
|
THP OFF + PREFAULT ON: 23,023 misses
|
||
|
|
THP ON + PREFAULT ON: 24,680 misses (+678, +2.9%)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Possible explanation:**
|
||
|
|
- THP (Transparent Huge Pages) interferes with smaller allocations
|
||
|
|
- When THP is enabled, the kernel tries to use 2MB pages everywhere
|
||
|
|
- This can cause:
|
||
|
|
- Suboptimal page placement
|
||
|
|
- Memory fragmentation
|
||
|
|
- More page table walks
|
||
|
|
- Worse cache locality for small structures
|
||
|
|
|
||
|
|
**Recommendation:** Keep THP OFF for allocator-heavy workloads.
|
||
|
|
|
||
|
|
### Cycles Remain Constant
|
||
|
|
|
||
|
|
```
|
||
|
|
Min cycles: 73,631,128
|
||
|
|
Max cycles: 75,848,380
|
||
|
|
Range: 2,217,252 (2.8% variance)
|
||
|
|
```
|
||
|
|
|
||
|
|
This 2.8% variance is **within measurement noise**. There's no real performance difference between any configuration.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What This Means for Optimization
|
||
|
|
|
||
|
|
### ❌ Dead Ends (Don't pursue)
|
||
|
|
- THP optimization for SuperSlabs (TLB not from allocations)
|
||
|
|
- PREFAULT optimization for SuperSlabs (same reason)
|
||
|
|
- Hugepages for SuperSlabs (won't help)
|
||
|
|
|
||
|
|
### ✅ Real Bottlenecks (What to optimize)
|
||
|
|
|
||
|
|
From the profiling breakdown:
|
||
|
|
1. **Page zeroing: 11.65% of cycles** ← Can reduce with lazy zeroing
|
||
|
|
2. **Page faults: 15% of cycles** ← Not from SuperSlab, but maybe reducible
|
||
|
|
3. **L1 cache misses: 763K** ← Can optimize with better layout
|
||
|
|
4. **Kernel scheduling overhead: ~2-3%** ← Might be opportunity
|
||
|
|
|
||
|
|
### The Real Question
|
||
|
|
|
||
|
|
**Where ARE those 23K TLB misses from?**
|
||
|
|
|
||
|
|
To answer this, we need to identify which code paths are generating the misses. Options:
|
||
|
|
1. Use `perf annotate` to see which instructions cause misses
|
||
|
|
2. Use `strace` to track memory allocation calls
|
||
|
|
3. Use `perf record` with callstack to see which functions are at fault
|
||
|
|
4. Test with a simpler benchmark (pure allocation-only loop)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Unexpected Discovery: Prefault Gave SLIGHT Benefit
|
||
|
|
|
||
|
|
```
|
||
|
|
PREFAULT OFF: 75,633,952 cycles
|
||
|
|
PREFAULT ON: 73,631,128 cycles
|
||
|
|
Improvement: 2,002,824 cycles (2.6% speedup!)
|
||
|
|
```
|
||
|
|
|
||
|
|
Even though dTLB misses didn't improve, cycles actually got slightly better with PREFAULT=1 (THP OFF mode).
|
||
|
|
|
||
|
|
**Why?**
|
||
|
|
- Possibly because PREFAULT=1 uses MADV_WILLNEED
|
||
|
|
- This might improve memory allocation latency
|
||
|
|
- Or it might be statistical noise (within 2.8% range)
|
||
|
|
|
||
|
|
**But THP_ON reversed this benefit:**
|
||
|
|
```
|
||
|
|
PREFAULT ON + THP OFF: 73,631,128 cycles (-2.6%)
|
||
|
|
PREFAULT ON + THP ON: 74,923,630 cycles (-0.9%)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Recommendation:** If PREFAULT=1 gives a tiny bit of benefit, keep it. But THP=OFF is better than THP=ON.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Revised Optimization Strategy
|
||
|
|
|
||
|
|
### Phase 2A: Investigate Page Zeroing (11.65%)
|
||
|
|
|
||
|
|
**Goal:** Reduce page zeroing cost
|
||
|
|
|
||
|
|
**Method:**
|
||
|
|
1. Profile which function does the zeroing (likely `clear_page_erms`)
|
||
|
|
2. Check if pages can be reused without zeroing
|
||
|
|
3. Use `MADV_DONTNEED` to mark freed pages as reusable
|
||
|
|
4. Implement lazy zeroing (zero on demand)
|
||
|
|
|
||
|
|
**Expected gain:** 1.15x (save 11.65% of cycles)
|
||
|
|
|
||
|
|
### Phase 2B: Identify Source of Page Faults (15%)
|
||
|
|
|
||
|
|
**Goal:** Understand where the 7,672 page faults come from
|
||
|
|
|
||
|
|
**Method:**
|
||
|
|
1. Use `perf record --callgraph=dwarf` to capture stack traces
|
||
|
|
2. Analyze which functions trigger page faults
|
||
|
|
3. Identify if they're from:
|
||
|
|
- SuperSlab allocations (might be fixable)
|
||
|
|
- libc/kernel (can't fix)
|
||
|
|
- TLS/stack (can't fix)
|
||
|
|
|
||
|
|
**Expected outcome:** Understanding which faults are controllable
|
||
|
|
|
||
|
|
### Phase 2C: Optimize L1 Cache (1%)
|
||
|
|
|
||
|
|
**Goal:** Reduce L1 cache misses
|
||
|
|
|
||
|
|
**Method:**
|
||
|
|
1. Improve allocator data structure layout
|
||
|
|
2. Cache-align hot structures
|
||
|
|
3. Better temporal locality in pool code
|
||
|
|
|
||
|
|
**Expected gain:** 1.01x (save 1% of cycles)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What We Learned
|
||
|
|
|
||
|
|
### From This Testing
|
||
|
|
|
||
|
|
✅ **Confirmed:** The earlier hypothesis about TLB being the bottleneck was **wrong**
|
||
|
|
✅ **Confirmed:** THP/PREFAULT don't help SuperSlab allocation patterns
|
||
|
|
✅ **Confirmed:** Page zeroing (11.65%) is a larger bottleneck than page faults
|
||
|
|
✅ **Confirmed:** Cycles are deterministic, not vary with THP/PREFAULT
|
||
|
|
|
||
|
|
### About HAKMEM Architecture
|
||
|
|
|
||
|
|
- SuperSlabs ARE being allocated efficiently (only 0.59% user time)
|
||
|
|
- Kernel is the bottleneck, not user-space code
|
||
|
|
- TLS/libc operations dominate memory traffic, not allocations
|
||
|
|
- The "30M ops/s → 4M ops/s" gap is actually measurement/benchmark difference
|
||
|
|
|
||
|
|
### About the Benchmark
|
||
|
|
|
||
|
|
- The Random Mixed benchmark may not be representative
|
||
|
|
- TLB misses might be from test framework, not real allocations
|
||
|
|
- Need to profile actual workloads to verify
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendations
|
||
|
|
|
||
|
|
### Do NOT Proceed With
|
||
|
|
- ❌ THP optimization for SuperSlabs
|
||
|
|
- ❌ PREFAULT optimization (gives minimal benefit)
|
||
|
|
- ❌ Hugepage conversion for 2MB slabs
|
||
|
|
|
||
|
|
### DO Proceed With (Priority Order)
|
||
|
|
|
||
|
|
1. **Investigate Page Zeroing (15% of runtime!)**
|
||
|
|
- This is a REAL bottleneck
|
||
|
|
- Can potentially be reduced with lazy zeroing
|
||
|
|
- See if `clear_page_erms` can be avoided
|
||
|
|
|
||
|
|
2. **Analyze Page Fault Sources**
|
||
|
|
- Where are the 7,672 faults coming from?
|
||
|
|
- Are any from SuperSlab (which could be reduced)?
|
||
|
|
- Or all from TLS/libc (can't reduce)?
|
||
|
|
|
||
|
|
3. **Profile Real Workloads**
|
||
|
|
- Current benchmark may not be representative
|
||
|
|
- Test with actual allocation-heavy applications
|
||
|
|
- See if results differ
|
||
|
|
|
||
|
|
4. **Reconsider Architecture**
|
||
|
|
- Maybe 30M → 4M gap is normal (different benchmark scales)
|
||
|
|
- Maybe need to focus on different metrics (latency, not throughput)
|
||
|
|
- Or maybe HAKMEM is already well-optimized
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
### Immediate (This Session)
|
||
|
|
|
||
|
|
1. **Run page zeroing profiling:**
|
||
|
|
```bash
|
||
|
|
perf record -F 10000 -e cycles ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
perf report --stdio | grep clear_page
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Profile with callstacks to find fault sources:**
|
||
|
|
```bash
|
||
|
|
perf record --call-graph=dwarf ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
perf report
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Test with PREFAULT=1 as new default:**
|
||
|
|
- Since it gave 2.6% benefit (even if small)
|
||
|
|
- Make sure it's safe on all kernels
|
||
|
|
- Update default in `ss_prefault_box.h`
|
||
|
|
|
||
|
|
### Medium-term (Next Phase)
|
||
|
|
|
||
|
|
1. **Implement lazy zeroing** if page zeroing is controllable
|
||
|
|
2. **Reduce page faults** if they're from SuperSlab
|
||
|
|
3. **Re-profile** after changes
|
||
|
|
4. **Test real workloads** to validate improvements
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**This session's biggest discovery:** The TLB miss rate (48.65%) is NOT a SuperSlab problem, so THP/PREFAULT won't help. The real bottleneck is **page zeroing (11.65%)** and **other kernel overhead**, not memory allocation routing or caching.
|
||
|
|
|
||
|
|
This changes the entire optimization strategy. Instead of optimizing memory allocation patterns, we should focus on:
|
||
|
|
1. Reducing unnecessary page zeroing
|
||
|
|
2. Understanding what other kernel operations dominate
|
||
|
|
3. Perhaps the allocator is already well-optimized!
|
||
|
|
|