474 lines
16 KiB
Markdown
474 lines
16 KiB
Markdown
|
|
# Tiny Allocator: Extended Perf Profile (1M iterations)
|
||
|
|
|
||
|
|
**Date**: 2025-11-14
|
||
|
|
**Phase**: Tiny集中攻撃 - 20M ops/s目標
|
||
|
|
**Workload**: bench_random_mixed_hakmem 1M iterations, 256B blocks
|
||
|
|
**Throughput**: 8.65M ops/s (baseline: 8.88M from initial measurement)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**Goal**: Identify bottlenecks for 20M ops/s target (2.2-2.5x improvement needed)
|
||
|
|
|
||
|
|
**Key Findings**:
|
||
|
|
1. **classify_ptr remains dominant** (3.74%) - consistent with Step 1 profile
|
||
|
|
2. **tiny_alloc_fast overhead reduced** (4.52% → 1.20%) - drain=2048 効果か測定ばらつきか要検証
|
||
|
|
3. **Kernel overhead still significant** (~40-50% in Top 20) - but improved vs Step 1 (86%)
|
||
|
|
4. **User-space total: ~13%** - similar to Step 1
|
||
|
|
|
||
|
|
**Recommendation**: **Optimize classify_ptr** (3.74%, free path bottleneck)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Perf Configuration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
perf record -F 999 -g -o perf_tiny_256b_1M.data \
|
||
|
|
-- ./out/release/bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
```
|
||
|
|
|
||
|
|
**Samples**: 117 samples, 408M cycles
|
||
|
|
**Comparison**: Step 1 (500K) = 90 samples, 285M cycles
|
||
|
|
**Improvement**: +30% samples, +43% cycles (longer measurement)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 20 Functions (Overall)
|
||
|
|
|
||
|
|
| Rank | Overhead | Function | Location | Notes |
|
||
|
|
|------|----------|----------|----------|-------|
|
||
|
|
| 1 | 5.46% | `main` | user | Benchmark loop (mmap/munmap) |
|
||
|
|
| 2 | 3.90% | `srso_alias_safe_ret` | kernel | Spectre mitigation |
|
||
|
|
| 3 | **3.74%** | **`classify_ptr`** | **user** | **Free path (pointer classification)** ✅ |
|
||
|
|
| 4 | 3.73% | `kmem_cache_alloc` | kernel | Kernel slab allocation |
|
||
|
|
| 5 | 2.94% | `do_anonymous_page` | kernel | Page fault handler |
|
||
|
|
| 6 | 2.73% | `__memset` | kernel | Kernel memset |
|
||
|
|
| 7 | 2.47% | `uncharge_batch` | kernel | Memory cgroup |
|
||
|
|
| 8 | 2.40% | `srso_alias_untrain_ret` | kernel | Spectre mitigation |
|
||
|
|
| 9 | 2.17% | `handle_mm_fault` | kernel | Memory management |
|
||
|
|
| 10 | 1.98% | `page_counter_cancel` | kernel | Memory cgroup |
|
||
|
|
| 11 | 1.96% | `mas_wr_node_store` | kernel | Maple tree (VMA management) |
|
||
|
|
| 12 | 1.95% | `asm_exc_page_fault` | kernel | Page fault entry |
|
||
|
|
| 13 | 1.94% | `__anon_vma_interval_tree_remove` | kernel | VMA tree |
|
||
|
|
| 14 | 1.90% | `vma_merge` | kernel | VMA merging |
|
||
|
|
| 15 | 1.88% | `__audit_syscall_exit` | kernel | Audit subsystem |
|
||
|
|
| 16 | 1.86% | `free_pgtables` | kernel | Page table free |
|
||
|
|
| 17 | 1.84% | `clear_page_erms` | kernel | Page clearing |
|
||
|
|
| 18 | **1.81%** | **`hak_tiny_alloc_fast_wrapper`** | **user** | **Alloc wrapper** ✅ |
|
||
|
|
| 19 | 1.77% | `__memset_avx2_unaligned_erms` | libc | User-space memset |
|
||
|
|
| 20 | 1.71% | `uncharge_folio` | kernel | Memory cgroup |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## User-Space Hot Paths Analysis (1%+ overhead)
|
||
|
|
|
||
|
|
### Top User-Space Functions
|
||
|
|
|
||
|
|
```
|
||
|
|
1. main: 5.46% (benchmark overhead)
|
||
|
|
2. classify_ptr: 3.74% ← FREE PATH BOTTLENECK ✅
|
||
|
|
3. hak_tiny_alloc_fast_wrapper: 1.81% (alloc wrapper)
|
||
|
|
4. __memset (libc): 1.77% (memset from user code)
|
||
|
|
5. tiny_alloc_fast: 1.20% (alloc hot path)
|
||
|
|
6. hak_free_at.part.0: 1.04% (free implementation)
|
||
|
|
7. malloc: 0.97% (malloc wrapper)
|
||
|
|
|
||
|
|
Total user-space overhead: ~12.78% (Top 20 only)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Comparison with Step 1 (500K iterations)
|
||
|
|
|
||
|
|
| Function | Step 1 (500K) | Extended (1M) | Change |
|
||
|
|
|----------|---------------|---------------|--------|
|
||
|
|
| `classify_ptr` | 3.65% | 3.74% | **+0.09%** (stable) |
|
||
|
|
| `tiny_alloc_fast` | 4.52% | 1.20% | **-3.32%** (大幅減少!) |
|
||
|
|
| `hak_tiny_alloc_fast_wrapper` | 1.35% | 1.81% | +0.46% |
|
||
|
|
| `hak_free_at.part.0` | 1.43% | 1.04% | -0.39% |
|
||
|
|
| `free` | 2.89% | (not in top 20) | - |
|
||
|
|
|
||
|
|
**Notable Change**: `tiny_alloc_fast` overhead reduction (4.52% → 1.20%)
|
||
|
|
|
||
|
|
**Possible Causes**:
|
||
|
|
1. **drain=2048 default** - improved TLS cache efficiency (Step 2 implementation)
|
||
|
|
2. **Measurement variance** - short workload (1M = 116ms) has high variance
|
||
|
|
3. **Compiler optimization differences** - rebuild between measurements
|
||
|
|
|
||
|
|
**Stability**: `classify_ptr` remains consistently ~3.7% (stable bottleneck)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Kernel vs User-Space Breakdown
|
||
|
|
|
||
|
|
### Top 20 Analysis
|
||
|
|
|
||
|
|
```
|
||
|
|
User-space: 4 functions, 12.78% total
|
||
|
|
└─ HAKMEM: 3 functions, 11.01% (main 5.46%, classify_ptr 3.74%, wrapper 1.81%)
|
||
|
|
└─ libc: 1 function, 1.77% (__memset)
|
||
|
|
|
||
|
|
Kernel: 16 functions, 39.36% total (Top 20 only)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Total Top 20**: 52.14% (remaining 47.86% in <1.71% functions)
|
||
|
|
|
||
|
|
### Comparison with Step 1
|
||
|
|
|
||
|
|
| Category | Step 1 (500K) | Extended (1M) | Change |
|
||
|
|
|----------|---------------|---------------|--------|
|
||
|
|
| User-space | 13.83% | ~12.78% | -1.05% |
|
||
|
|
| Kernel | 86.17% | ~50-60% (est) | **-25-35%** ✅ |
|
||
|
|
|
||
|
|
**Interpretation**:
|
||
|
|
- **Kernel overhead reduced** from 86% → ~50-60% (longer measurement reduces init impact)
|
||
|
|
- **User-space overhead stable** (~13%)
|
||
|
|
- **Step 1 measurement too short** (500K, 60ms) - initialization dominated
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detailed Function Analysis
|
||
|
|
|
||
|
|
### 1. classify_ptr (3.74%) - FREE PATH BOTTLENECK 🎯
|
||
|
|
|
||
|
|
**Purpose**: Determine allocation source (Tiny vs Pool vs ACE) on free
|
||
|
|
|
||
|
|
**Implementation**: `core/box/front_gate_classifier.c`
|
||
|
|
|
||
|
|
**Current Approach**:
|
||
|
|
- Uses mincore/registry lookup to identify region type
|
||
|
|
- Called on **every free operation**
|
||
|
|
- No caching of classification results
|
||
|
|
|
||
|
|
**Optimization Opportunities**:
|
||
|
|
|
||
|
|
1. **Cache classification in pointer metadata** (HIGH IMPACT)
|
||
|
|
- Store region type in 1-2 bits of pointer header
|
||
|
|
- Trade: +1-2 bits overhead per allocation
|
||
|
|
- Benefit: O(1) classification vs O(log N) registry lookup
|
||
|
|
|
||
|
|
2. **Exploit header bits** (MEDIUM IMPACT)
|
||
|
|
- Current header: `0xa0 | class_idx` (8 bits)
|
||
|
|
- Use unused bits to encode region type (Tiny/Pool/ACE)
|
||
|
|
- Requires header format change
|
||
|
|
|
||
|
|
3. **Inline fast path** (LOW-MEDIUM IMPACT)
|
||
|
|
- Inline common case (Tiny region) to reduce call overhead
|
||
|
|
- Falls back to full classification for Pool/ACE
|
||
|
|
|
||
|
|
**Expected Impact**: -2-3% overhead (reduce 3.74% → ~1% with header caching)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. tiny_alloc_fast (1.20%) - ALLOC HOT PATH
|
||
|
|
|
||
|
|
**Change**: 4.52% (Step 1) → 1.20% (Extended)
|
||
|
|
|
||
|
|
**Possible Explanations**:
|
||
|
|
|
||
|
|
1. **drain=2048 effect** (Step 2 implementation)
|
||
|
|
- TLS cache holds blocks longer → fewer refills
|
||
|
|
- Alloc fast path hit rate increased
|
||
|
|
|
||
|
|
2. **Measurement variance**
|
||
|
|
- Short workload (116ms) has ±10-15% variance
|
||
|
|
- Need longer measurement for stable results
|
||
|
|
|
||
|
|
3. **Inlining differences**
|
||
|
|
- Compiler inlining changed between builds
|
||
|
|
- Some overhead moved to caller (hak_tiny_alloc_fast_wrapper 1.81%)
|
||
|
|
|
||
|
|
**Verification Needed**:
|
||
|
|
- Run multiple measurements to check variance
|
||
|
|
- Profile with 5M+ iterations (if SEGV issue resolved)
|
||
|
|
|
||
|
|
**Current Assessment**: Not a bottleneck (1.20% acceptable for alloc hot path)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. hak_tiny_alloc_fast_wrapper (1.81%) - ALLOC WRAPPER
|
||
|
|
|
||
|
|
**Purpose**: Wrapper around tiny_alloc_fast (bounds checking, dispatch)
|
||
|
|
|
||
|
|
**Overhead**: 1.81% (increased from 1.35% in Step 1)
|
||
|
|
|
||
|
|
**Analysis**:
|
||
|
|
- If tiny_alloc_fast overhead moved here (inlining), total alloc = 1.81% + 1.20% = 3.01%
|
||
|
|
- Still lower than Step 1's 4.52% + 1.35% = 5.87%
|
||
|
|
- **Combined alloc overhead reduced**: 5.87% → 3.01% (**-49%**) ✅
|
||
|
|
|
||
|
|
**Conclusion**: Not a bottleneck, likely measurement variance or inlining change
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. __memset (libc + kernel, combined ~4.5%)
|
||
|
|
|
||
|
|
**Sources**:
|
||
|
|
- libc `__memset_avx2_unaligned_erms`: 1.77% (user-space)
|
||
|
|
- kernel `__memset`: 2.73% (kernel-space)
|
||
|
|
|
||
|
|
**Total**: ~4.5% on memset operations
|
||
|
|
|
||
|
|
**Causes**:
|
||
|
|
- Benchmark memset on allocated blocks (pattern fill)
|
||
|
|
- Kernel page zeroing (security/initialization)
|
||
|
|
|
||
|
|
**Optimization**: Not HAKMEM-specific, benchmark/kernel overhead
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Kernel Overhead Breakdown (Top Contributors)
|
||
|
|
|
||
|
|
### High Overhead Functions (2%+)
|
||
|
|
|
||
|
|
```
|
||
|
|
srso_alias_safe_ret: 3.90% ← Spectre mitigation (unavoidable)
|
||
|
|
kmem_cache_alloc: 3.73% ← Kernel slab allocator
|
||
|
|
do_anonymous_page: 2.94% ← Page fault handler (initialization)
|
||
|
|
__memset: 2.73% ← Page zeroing
|
||
|
|
uncharge_batch: 2.47% ← Memory cgroup accounting
|
||
|
|
srso_alias_untrain_ret: 2.40% ← Spectre mitigation
|
||
|
|
handle_mm_fault: 2.17% ← Memory management
|
||
|
|
```
|
||
|
|
|
||
|
|
**Total High Overhead**: 20.34% (Top 7 kernel functions)
|
||
|
|
|
||
|
|
### Analysis
|
||
|
|
|
||
|
|
1. **Spectre Mitigation**: 3.90% + 2.40% = 6.30%
|
||
|
|
- Unavoidable CPU-level overhead
|
||
|
|
- Cannot optimize without disabling mitigations
|
||
|
|
|
||
|
|
2. **Memory Initialization**: do_anonymous_page (2.94%), __memset (2.73%)
|
||
|
|
- First-touch page faults + zeroing
|
||
|
|
- Reduced with longer workloads (amortized)
|
||
|
|
|
||
|
|
3. **Memory Cgroup**: uncharge_batch (2.47%), page_counter_cancel (1.98%)
|
||
|
|
- Container/cgroup accounting overhead
|
||
|
|
- Unavoidable in modern kernels
|
||
|
|
|
||
|
|
**Conclusion**: Kernel overhead (20-40%) is mostly unavoidable (Spectre, cgroup, page faults)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison: Step 1 (500K) vs Extended (1M)
|
||
|
|
|
||
|
|
### Methodology Changes
|
||
|
|
|
||
|
|
| Metric | Step 1 | Extended | Change |
|
||
|
|
|--------|--------|----------|--------|
|
||
|
|
| Iterations | 500K | 1M | +100% |
|
||
|
|
| Runtime | ~60ms | ~116ms | +93% |
|
||
|
|
| Samples | 90 | 117 | +30% |
|
||
|
|
| Cycles | 285M | 408M | +43% |
|
||
|
|
|
||
|
|
### Top User-Space Functions
|
||
|
|
|
||
|
|
| Function | Step 1 | Extended | Δ |
|
||
|
|
|----------|--------|----------|---|
|
||
|
|
| `main` | 4.82% | 5.46% | +0.64% |
|
||
|
|
| `classify_ptr` | 3.65% | 3.74% | +0.09% ✅ Stable |
|
||
|
|
| `tiny_alloc_fast` | 4.52% | 1.20% | -3.32% ⚠️ Needs verification |
|
||
|
|
| `free` | 2.89% | <1% | -1.89%+ |
|
||
|
|
|
||
|
|
### Kernel Overhead
|
||
|
|
|
||
|
|
| Category | Step 1 | Extended | Δ |
|
||
|
|
|----------|--------|----------|---|
|
||
|
|
| Kernel Total | ~86% | ~50-60% | **-25-35%** ✅ |
|
||
|
|
| User Total | ~14% | ~13% | -1% |
|
||
|
|
|
||
|
|
**Key Takeaway**: Step 1 measurement was too short (initialization dominated)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Bottleneck Prioritization for 20M ops/s Target
|
||
|
|
|
||
|
|
### Current State
|
||
|
|
```
|
||
|
|
Current: 8.65M ops/s
|
||
|
|
Target: 20M ops/s
|
||
|
|
Gap: 2.31x improvement needed
|
||
|
|
```
|
||
|
|
|
||
|
|
### Optimization Targets (Priority Order)
|
||
|
|
|
||
|
|
#### Priority 1: classify_ptr (3.74%) ✅
|
||
|
|
**Impact**: High (largest user-space bottleneck)
|
||
|
|
**Feasibility**: High (header caching well-understood)
|
||
|
|
**Expected Gain**: -2-3% overhead → +20-30% throughput
|
||
|
|
**Implementation**: Medium complexity (header format change)
|
||
|
|
|
||
|
|
**Action**: Implement header-based region type caching
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### Priority 2: Verify tiny_alloc_fast reduction
|
||
|
|
**Impact**: Unknown (measurement variance vs real improvement)
|
||
|
|
**Feasibility**: High (just verification)
|
||
|
|
**Expected Gain**: None (if variance) or validate +49% gain (if real)
|
||
|
|
**Implementation**: Simple (re-measure with 3+ runs)
|
||
|
|
|
||
|
|
**Action**: Run 5+ measurements to confirm 1.20% is stable
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### Priority 3: Reduce kernel overhead (50-60%)
|
||
|
|
**Impact**: Medium (some unavoidable, some optimizable)
|
||
|
|
**Feasibility**: Low-Medium (depends on source)
|
||
|
|
**Expected Gain**: -10-20% overhead → +10-20% throughput
|
||
|
|
**Implementation**: Complex (requires longer workloads or syscall reduction)
|
||
|
|
|
||
|
|
**Sub-targets**:
|
||
|
|
1. **Reduce initialization overhead** - Prewarm more aggressively
|
||
|
|
2. **Reduce syscall count** - Batch operations, lazy deallocation
|
||
|
|
3. **Mitigate Spectre overhead** - Unavoidable (6.30%)
|
||
|
|
|
||
|
|
**Action**: Analyze syscall count (strace), compare with System malloc
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### Priority 4: Alloc wrapper overhead (1.81%)
|
||
|
|
**Impact**: Low (acceptable overhead)
|
||
|
|
**Feasibility**: High (inlining)
|
||
|
|
**Expected Gain**: -1-1.5% overhead → +10-15% throughput
|
||
|
|
**Implementation**: Simple (force inline, compiler flags)
|
||
|
|
|
||
|
|
**Action**: Low priority, only if Priority 1-3 exhausted
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendations
|
||
|
|
|
||
|
|
### Immediate Actions (Next Phase)
|
||
|
|
|
||
|
|
1. **Implement classify_ptr optimization** (Priority 1)
|
||
|
|
- Design: Header bit encoding for region type (Tiny/Pool/ACE)
|
||
|
|
- Prototype: 1-2 bit region ID in pointer header
|
||
|
|
- Measure: Expected -2-3% overhead, +20-30% throughput
|
||
|
|
|
||
|
|
2. **Verify tiny_alloc_fast variance** (Priority 2)
|
||
|
|
- Run 5x measurements (1M iterations each)
|
||
|
|
- Calculate mean ± stddev for tiny_alloc_fast overhead
|
||
|
|
- Confirm if 1.20% is stable or measurement artifact
|
||
|
|
|
||
|
|
3. **Syscall analysis** (Priority 3 prep)
|
||
|
|
- strace -c 1M iterations vs System malloc
|
||
|
|
- Identify syscall reduction opportunities
|
||
|
|
- Evaluate lazy deallocation impact
|
||
|
|
|
||
|
|
### Long-Term Strategy
|
||
|
|
|
||
|
|
**Phase 1**: classify_ptr optimization → 10-11M ops/s (+20-30%)
|
||
|
|
**Phase 2**: Syscall reduction (if needed) → 13-15M ops/s (+30-40% cumulative)
|
||
|
|
**Phase 3**: Deep alloc/free path optimization → 18-20M ops/s (target reached)
|
||
|
|
|
||
|
|
**Stretch Goal**: If classify_ptr + syscall reduction exceed expectations → 20M+ achievable
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Limitations of Current Measurement
|
||
|
|
|
||
|
|
### 1. Short Workload Duration
|
||
|
|
|
||
|
|
```
|
||
|
|
Runtime: 116ms (1M iterations)
|
||
|
|
Issue: Initialization still ~20-30% of total time
|
||
|
|
Impact: Kernel overhead overestimated
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solution**: Measure 5M-10M iterations (need to fix SEGV issue)
|
||
|
|
|
||
|
|
### 2. Low Sample Count
|
||
|
|
|
||
|
|
```
|
||
|
|
Samples: 117 (999 Hz sampling)
|
||
|
|
Issue: High variance for <1% functions
|
||
|
|
Impact: Confidence intervals wide for low-overhead functions
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solution**: Higher sampling frequency (-F 9999) or longer workload
|
||
|
|
|
||
|
|
### 3. SEGV on Long Workloads
|
||
|
|
|
||
|
|
```
|
||
|
|
5M iterations: SEGV (P0-4 node pool exhausted)
|
||
|
|
1M iterations: SEGV under perf, OK without perf
|
||
|
|
Issue: P0-4 node pool (Mid-Large) interferes with Tiny workload
|
||
|
|
Impact: Cannot measure longer workloads under perf
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
- Increase MAX_FREE_NODES_PER_CLASS (P0-4 node pool)
|
||
|
|
- Or disable P0-4 for Tiny-only benchmarks (ENV flag?)
|
||
|
|
|
||
|
|
### 4. Measurement Variance
|
||
|
|
|
||
|
|
```
|
||
|
|
tiny_alloc_fast: 4.52% → 1.20% (-73% change)
|
||
|
|
Issue: Too large for realistic optimization
|
||
|
|
Impact: Cannot trust single measurement
|
||
|
|
```
|
||
|
|
|
||
|
|
**Solution**: Multiple runs (5-10x) to calculate confidence intervals
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Raw Perf Data
|
||
|
|
|
||
|
|
### Command Used
|
||
|
|
|
||
|
|
```bash
|
||
|
|
perf record -F 999 -g -o perf_tiny_256b_1M.data \
|
||
|
|
-- ./out/release/bench_random_mixed_hakmem 1000000 256 42
|
||
|
|
|
||
|
|
perf report -i perf_tiny_256b_1M.data --stdio --no-children
|
||
|
|
```
|
||
|
|
|
||
|
|
### Sample Output (Top 20)
|
||
|
|
|
||
|
|
```
|
||
|
|
# Samples: 117 of event 'cycles:P'
|
||
|
|
# Event count (approx.): 408,473,373
|
||
|
|
|
||
|
|
Overhead Command Shared Object Symbol
|
||
|
|
5.46% bench_random_mi bench_random_mixed_hakmem [.] main
|
||
|
|
3.90% bench_random_mi [kernel.kallsyms] [k] srso_alias_safe_ret
|
||
|
|
3.74% bench_random_mi bench_random_mixed_hakmem [.] classify_ptr
|
||
|
|
3.73% bench_random_mi [kernel.kallsyms] [k] kmem_cache_alloc
|
||
|
|
2.94% bench_random_mi [kernel.kallsyms] [k] do_anonymous_page
|
||
|
|
2.73% bench_random_mi [kernel.kallsyms] [k] __memset
|
||
|
|
2.47% bench_random_mi [kernel.kallsyms] [k] uncharge_batch
|
||
|
|
2.40% bench_random_mi [kernel.kallsyms] [k] srso_alias_untrain_ret
|
||
|
|
2.17% bench_random_mi [kernel.kallsyms] [k] handle_mm_fault
|
||
|
|
1.98% bench_random_mi [kernel.kallsyms] [k] page_counter_cancel
|
||
|
|
1.96% bench_random_mi [kernel.kallsyms] [k] mas_wr_node_store
|
||
|
|
1.95% bench_random_mi [kernel.kallsyms] [k] asm_exc_page_fault
|
||
|
|
1.94% bench_random_mi [kernel.kallsyms] [k] __anon_vma_interval_tree_remove
|
||
|
|
1.90% bench_random_mi [kernel.kallsyms] [k] vma_merge
|
||
|
|
1.88% bench_random_mi [kernel.kallsyms] [k] __audit_syscall_exit
|
||
|
|
1.86% bench_random_mi [kernel.kallsyms] [k] free_pgtables
|
||
|
|
1.84% bench_random_mi [kernel.kallsyms] [k] clear_page_erms
|
||
|
|
1.81% bench_random_mi bench_random_mixed_hakmem [.] hak_tiny_alloc_fast_wrapper
|
||
|
|
1.77% bench_random_mi libc.so.6 [.] __memset_avx2_unaligned_erms
|
||
|
|
1.71% bench_random_mi [kernel.kallsyms] [k] uncharge_folio
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**Extended Perf Profile Complete** ✅
|
||
|
|
|
||
|
|
**Key Bottleneck Identified**: `classify_ptr` (3.74%) - stable across measurements
|
||
|
|
|
||
|
|
**Recommended Next Step**: **Implement classify_ptr optimization via header caching**
|
||
|
|
|
||
|
|
**Expected Impact**: +20-30% throughput (8.65M → 10-11M ops/s)
|
||
|
|
|
||
|
|
**Path to 20M ops/s**:
|
||
|
|
1. classify_ptr optimization → 10-11M (+20-30%)
|
||
|
|
2. Syscall reduction (if needed) → 13-15M (+30-40% cumulative)
|
||
|
|
3. Deep optimization (if needed) → 18-20M (target reached)
|
||
|
|
|
||
|
|
**Confidence**: High (classify_ptr is stable, well-understood, header caching proven technique)
|