hakmem/COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md

# Comprehensive Profiling Analysis: HAKMEM Performance Gaps

## 🔍 Executive Summary

After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:

### Current Performance Metrics

| Metric | Random Mixed | Tiny Hot | Gap |
|--------|---|---|---|
| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
| **L1 Cache Misses** | 763K | 738K | Similar |
| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |

### 🚨 KEY FINDING: Prefault is NOT working as expected!

**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
1. ✗ Prefault Box is either disabled or ineffective
2. ✗ Page faults are coming from elsewhere (not Superslab mmap)
3. ✗ MAP_POPULATE flag is not preventing runtime faults

---

## 📊 Detailed Performance Breakdown

### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)

**Top Kernel Functions (by CPU time):**
```
15.01% asm_exc_page_fault       ← Page fault handling
11.65% clear_page_erms          ← Page zeroing
 5.27% zap_pte_range            ← Memory cleanup
 5.20% handle_mm_fault          ← MMU fault handling
 4.06% do_anonymous_page        ← Anonymous page allocation
 3.18% __handle_mm_fault        ← Nested fault handling
 2.35% rmqueue_bulk             ← Allocator backend
 2.35% __memset_avx2_unaligned  ← Memory operations
 2.28% do_user_addr_fault       ← User fault handling
 1.77% arch_exit_to_user_mode   ← Context switch
```

**Kernel overhead:** ~63% of cycles
**L1 DCL misses:** 763K / operations
**Branch miss rate:** 11.94%

### Tiny Hot Workload (10M allocations, fixed size)

**Top Kernel Functions (by CPU time):**
```
14.19% asm_exc_page_fault       ← Page fault handling
12.82% clear_page_erms          ← Page zeroing
 5.61% __memset_avx2_unaligned  ← Memory operations
 5.02% do_anonymous_page        ← Anonymous page allocation
 3.31% mem_cgroup_commit_charge ← Memory accounting
 2.67% __handle_mm_fault        ← MMU fault handling
 2.45% do_user_addr_fault       ← User fault handling
```

**Kernel overhead:** ~66% of cycles
**L1 DCL misses:** 738K / operations
**Branch miss rate:** 11.03%

### Comparison: Why are cycles similar?

```
Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
Tiny Hot:     72.3M cycles / 10M ops = 7.23 cycles/op

⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
```

The cycles are nearly identical, but throughput differs because:
- Random Mixed: Measuring only 1M operations (baseline)
- Tiny Hot: Measuring 10M operations (10x scale)
- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op

**This means Random Mixed is NOT actually slower in per-operation cost!**

---

## 🎯 Critical Findings

### Finding 1: Page Faults Are NOT Being Reduced

**Observed:**
- Random Mixed: 7,672 page faults
- Tiny Hot: 7,672 page faults
- **Difference: 0** ← This is wrong!

**Expected (with prefault):**
- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
- Tiny Hot: 7,672 → ~50-100 (minimal change)

**Hypothesis:** 
- Prefault Box may not be enabled
- Or MAP_POPULATE is not working on this kernel
- Or allocations are hitting kernel-internal mmap (not Superslab)

### Finding 2: TLB Misses Are HIGH (48.65%)

```
dTLB-loads:        49,160
dTLB-load-misses:  23,917 (48.65% miss rate!)
iTLB-load-misses:  17,590 (7748.90% - kernel measurement artifact)
```

**Meaning:** Nearly half of TLB lookups fail, causing page table walks.

**Why this matters:**
- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
- 23,917 × 25 cycles = ~600K wasted cycles
- That's ~10% of total runtime!

### Finding 3: Both Workloads Are Similar

Despite different access patterns:
- Both spend 15% on page fault handling
- Both spend 12% on page zeroing
- Both have similar L1 miss rates
- Both have similar branch miss rates

**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.

---

## 📈 Layer Analysis

### Kernel vs User Split

| Category | Random Mixed | Tiny Hot | Analysis |
|----------|---|---|---|
| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
| **User malloc/free** | <1% | <1% | Not visible |
| **User pool/cache logic** | <1% | <1% | Not visible |

### User-Space Functions Visible in Profile

**Random Mixed:**
```
0.59% hak_free_at.constprop.0 (hakmem free path)
```

**Tiny Hot:**
```
0.59% hak_pool_mid_lookup (hakmem pool routing)
```

**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).

---

## 🔧 What's Really Happening

### Current State (POST-Prefault Box)

```
allocate(size):
  1. malloc wrapper           → <1% cycles
  2. Gatekeeper routing       → ~0.1% cycles
  3. unified_cache_refill     → (hidden in kernel time)
  4. shared_pool_acquire      → (hidden in kernel time)
  5. SuperSlab/mmap call      → Triggers kernel
  6. **KERNEL PAGE FAULTS**   → 15% cycles
  7. clear_page_erms (zero)   → 12% cycles
```

### Why Prefault Isn't Working

**Possible reasons:**

1. **Prefault Box disabled?**
   - Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
   - Or: `g_ss_populate_once` not being set

2. **MAP_POPULATE not actually pre-faulting?**
   - Linux kernel may be lazy even with MAP_POPULATE
   - Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
   - Or use `mincore()` to check before allocation

3. **Allocations not from Superslab mmap?**
   - Page faults may be from TLS cache allocation
   - Or from libc internal allocations
   - Not from Superslab backend

4. **TLB misses dominating?**
   - 48.65% TLB miss rate suggests memory layout issue
   - SuperSlab metadata may not be cache-friendly
   - Working set too large for TLB

---

## 🎓 What We Learned From Previous Analysis

From the earlier profiling report, we identified that:
- **Random Mixed was 21.7x slower** due to 61.7% page faults
- **Expected with prefault:** Should drop to ~5% or less

But NOW we see:
- **Random Mixed is NOT significantly slower** (per-op cost is similar)
- **Page faults are identical** to Tiny Hot
- **This contradicts expectations**

### Possible Explanation

The **earlier measurements** may have been from:
- Benchmark run at startup (cold caches)
- With additional profiling overhead
- Or different workload parameters

The **current measurements** are:
- Steady state (after initial allocation)
- With higher throughput (Tiny Hot = 10M ops)
- After recent optimizations

---

## 🎯 Next Steps - Three Options

### 📋 Option A: Verify Prefault is Actually Enabled

**Goal:** Confirm prefault mechanism is working

**Steps:**
1. Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
2. Check if `MAP_POPULATE` flag is set in actual mmap calls
3. Run with `strace` to see mmap flags:
   ```bash
   strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
   ```
4. Check if `madvise(MADV_POPULATE_READ)` calls are happening

**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces

---

### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)

**Goal:** Improve memory layout to reduce TLB pressure

**Steps:**
1. **Analyze SuperSlab metadata layout:**
   - Current: Is metadata per-slab or centralized?
   - Check: `sp_meta_find_or_create()` hot path

2. **Improve cache locality:**
   - Cache-align metadata structures
   - Use larger pages (2MB or 1GB hugepages)
   - Reduce working set size

3. **Profile with hugepages:**
   ```bash
   echo 10 > /proc/sys/vm/nr_hugepages
   HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
   ```

**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)

---

### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)

**Goal:** Skip unnecessary page zeroing

**Steps:**
1. **Analyze what needs zeroing:**
   - Are SuperSlab pages truly uninitialized?
   - Can we reuse memory without zeroing?
   - Use `MADV_DONTNEED` before reuse?

2. **Implement lazy zeroing:**
   - Don't zero pages on allocation
   - Only zero used portions
   - Let kernel handle rest on free

3. **Use uninitialized pools:**
   - Pre-allocate without zeroing
   - Initialize on-demand

**Expected gain:** 1.5x speedup (eliminate 12% zero cost)

---

## 📊 Recommendation

Based on the analysis:

### Most Impactful (Order of Preference):

1. **Fix TLB Misses (48.65%)**
   - Potential gain: 1.5-2x
   - Implementation: Medium difficulty
   - Reason: Already showing 48% miss rate

2. **Verify Prefault Actually Works**
   - Potential gain: Unknown (currently not working?)
   - Implementation: Easy (debugging)
   - Reason: Should have been solved but showing same page faults

3. **Reduce Page Zeroing**
   - Potential gain: 1.5x
   - Implementation: Medium difficulty
   - Reason: 12% of total time

---

## 🧪 Recommended Next Action

### Immediate (This Session)

Run diagnostic to confirm prefault status:

```bash
# Check if MAP_POPULATE is in actual mmap calls
strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20

# Check compiler flags
grep -i prefault Makefile

# Check environment variables
env | grep -i HAKMEM | grep -i PREFAULT
```

### If Prefault is Disabled → Enable It

Then re-run profiling to verify improvement.

### If Prefault is Enabled → Move to Option B (TLB)

Focus on reducing 48% TLB miss rate.

---

## 📈 Expected Outcome After All Fixes

| Factor | Current | After | Gain |
|--------|---------|-------|------|
| Page faults | 7,672 | 500-1000 | 8-15x |
| TLB misses | 48.65% | ~5% | 3-5x |
| Page zeroing | 12% | 2% | 2x |
| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |
-												Comprehensive Profiling Analysis: Phase 1 Complete with Major Discoveries

## Key Findings:
1. Prefault Box defaults to OFF (intentional, due to 4MB MAP_POPULATE bug fix)
2. User-space HAKMEM code is NOT the bottleneck (<1% CPU time)
3. TLB misses (48.65%) are NOT from SuperSlab allocations - mostly from TLS/libc
4. THP and PREFAULT optimizations have ZERO impact on dTLB misses
5. Page zeroing (11.65%) is the REAL bottleneck, not memory allocation

## Session Deliverables:
- COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md: Initial analysis
- PROFILING_INSIGHTS_AND_RECOMMENDATIONS_20251204.md: Task investigation
- PHASE1_TEST_RESULTS_MAJOR_DISCOVERY_20251204.md: Phase 1 test results
- SESSION_SUMMARY_FINDINGS_20251204.md: Final summary

## Phase 2 Recommendations:
1. Investigate lazy zeroing (11.65% of cycles)
2. Analyze page fault sources (debug with callgraph)
3. Skip THP/PREFAULT/Hugepage optimization (proven ineffective)

## Paradigm Shift:
Old: THP/PREFAULT → 2-3x speedup
New: Lazy zeroing → 1.10x-1.15x speedup (realistic)

🐱 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-04 20:41:53 +09:00
+								# Comprehensive Profiling Analysis: HAKMEM Performance Gaps
 								## 🔍 Executive Summary
 								After the **Prefault Box + MAP_POPULATE fix**, the profiling shows:
 								### Current Performance Metrics
 								| Metric | Random Mixed | Tiny Hot | Gap |
 								|--------|---|---|---|
 								| **Cycles (lower is better)** | 72.6M | 72.3M | **SAME** 🤯 |
 								| **Page Faults** | 7,672 | 7,672 | **IDENTICAL** ⚠️ |
 								| **L1 Cache Misses** | 763K | 738K | Similar |
 								| **Throughput** | ~1.06M ops/s | ~1.23M ops/s | 1.16x |
 								| **Instructions/Cycle** | 0.74 | 0.73 | Similar |
 								| **TLB Miss Rate** | 48.65% (dTLB) | N/A | High |
 								### 🚨 KEY FINDING: Prefault is NOT working as expected!
 								**Problem:** Both Random Mixed and Tiny Hot have identical page faults (7,672), which suggests:
 . ✗ Prefault Box is either disabled or ineffective
 . ✗ Page faults are coming from elsewhere (not Superslab mmap)
 . ✗ MAP_POPULATE flag is not preventing runtime faults
 								---
 								## 📊 Detailed Performance Breakdown
 								### Random Mixed Workload (1M allocations, 256 slots, sizes 16-1040B)
 								**Top Kernel Functions (by CPU time):**
 								```
 .01% asm_exc_page_fault       ← Page fault handling
 .65% clear_page_erms          ← Page zeroing
 .27% zap_pte_range            ← Memory cleanup
 .20% handle_mm_fault          ← MMU fault handling
 .06% do_anonymous_page        ← Anonymous page allocation
 .18% __handle_mm_fault        ← Nested fault handling
 .35% rmqueue_bulk             ← Allocator backend
 .35% __memset_avx2_unaligned  ← Memory operations
 .28% do_user_addr_fault       ← User fault handling
 .77% arch_exit_to_user_mode   ← Context switch
 								```
 								**Kernel overhead:** ~63% of cycles
 								**L1 DCL misses:** 763K / operations
 								**Branch miss rate:** 11.94%
 								### Tiny Hot Workload (10M allocations, fixed size)
 								**Top Kernel Functions (by CPU time):**
 								```
 .19% asm_exc_page_fault       ← Page fault handling
 .82% clear_page_erms          ← Page zeroing
 .61% __memset_avx2_unaligned  ← Memory operations
 .02% do_anonymous_page        ← Anonymous page allocation
 .31% mem_cgroup_commit_charge ← Memory accounting
 .67% __handle_mm_fault        ← MMU fault handling
 .45% do_user_addr_fault       ← User fault handling
 								```
 								**Kernel overhead:** ~66% of cycles
 								**L1 DCL misses:** 738K / operations
 								**Branch miss rate:** 11.03%
 								### Comparison: Why are cycles similar?
 								```
 								Random Mixed: 72.6M cycles / 1M ops = 72.6 cycles/op
 								Tiny Hot:     72.3M cycles / 10M ops = 7.23 cycles/op
 								⚠️ THROUGHPUT DIFFERENCE UNEXPLAINED!
 								```
 								The cycles are nearly identical, but throughput differs because:
 								- Random Mixed: Measuring only 1M operations (baseline)
 								- Tiny Hot: Measuring 10M operations (10x scale)
 								- **Real throughput:** Would need to normalize: 72.6M cycles / 1M = 72.6 vs 7.23 per op
 								**This means Random Mixed is NOT actually slower in per-operation cost!**
 								---
 								## 🎯 Critical Findings
 								### Finding 1: Page Faults Are NOT Being Reduced
 								**Observed:**
 								- Random Mixed: 7,672 page faults
 								- Tiny Hot: 7,672 page faults
 								- **Difference: 0** ← This is wrong!
 								**Expected (with prefault):**
 								- Random Mixed: 7,672 → maybe 100-500 (90% reduction)
 								- Tiny Hot: 7,672 → ~50-100 (minimal change)
 								**Hypothesis:**
 								- Prefault Box may not be enabled
 								- Or MAP_POPULATE is not working on this kernel
 								- Or allocations are hitting kernel-internal mmap (not Superslab)
 								### Finding 2: TLB Misses Are HIGH (48.65%)
 								```
 								dTLB-loads:        49,160
 								dTLB-load-misses:  23,917 (48.65% miss rate!)
 								iTLB-load-misses:  17,590 (7748.90% - kernel measurement artifact)
 								```
 								**Meaning:** Nearly half of TLB lookups fail, causing page table walks.
 								**Why this matters:**
 								- Each TLB miss = ~10-40 cycles (vs 1-3 for hit)
 								- 23,917 × 25 cycles = ~600K wasted cycles
 								- That's ~10% of total runtime!
 								### Finding 3: Both Workloads Are Similar
 								Despite different access patterns:
 								- Both spend 15% on page fault handling
 								- Both spend 12% on page zeroing
 								- Both have similar L1 miss rates
 								- Both have similar branch miss rates
 								**Conclusion:** The memory subsystem is the bottleneck for BOTH workloads, not user-space code.
 								---
 								## 📈 Layer Analysis
 								### Kernel vs User Split
 								| Category | Random Mixed | Tiny Hot | Analysis |
 								|----------|---|---|---|
 								| **Kernel (page faults, scheduling, etc)** | 63% | 66% | Dominant |
 								| **Kernel zeroing (clear_page_erms)** | 11.65% | 12.82% | Similar |
 								| **User malloc/free** | <1% | <1% | Not visible |
 								| **User pool/cache logic** | <1% | <1% | Not visible |
 								### User-Space Functions Visible in Profile
 								**Random Mixed:**
 								```
 .59% hak_free_at.constprop.0 (hakmem free path)
 								```
 								**Tiny Hot:**
 								```
 .59% hak_pool_mid_lookup (hakmem pool routing)
 								```
 								**Conclusion:** User-space HAKMEM code is NOT a bottleneck (<1% each).
 								---
 								## 🔧 What's Really Happening
 								### Current State (POST-Prefault Box)
 								```
 								allocate(size):
 . malloc wrapper           → <1% cycles
 . Gatekeeper routing       → ~0.1% cycles
 . unified_cache_refill     → (hidden in kernel time)
 . shared_pool_acquire      → (hidden in kernel time)
 . SuperSlab/mmap call      → Triggers kernel
 . **KERNEL PAGE FAULTS**   → 15% cycles
 . clear_page_erms (zero)   → 12% cycles
 								```
 								### Why Prefault Isn't Working
 								**Possible reasons:**
 . **Prefault Box disabled?**
 								   - Check: `HAKMEM_BOX_SS_PREFAULT_ENABLED`
 								   - Or: `g_ss_populate_once` not being set
 . **MAP_POPULATE not actually pre-faulting?**
 								   - Linux kernel may be lazy even with MAP_POPULATE
 								   - Need `madvise(MADV_POPULATE_READ)` to force immediate faulting
 								   - Or use `mincore()` to check before allocation
 . **Allocations not from Superslab mmap?**
 								   - Page faults may be from TLS cache allocation
 								   - Or from libc internal allocations
 								   - Not from Superslab backend
 . **TLB misses dominating?**
 								   - 48.65% TLB miss rate suggests memory layout issue
 								   - SuperSlab metadata may not be cache-friendly
 								   - Working set too large for TLB
 								---
 								## 🎓 What We Learned From Previous Analysis
 								From the earlier profiling report, we identified that:
 								- **Random Mixed was 21.7x slower** due to 61.7% page faults
 								- **Expected with prefault:** Should drop to ~5% or less
 								But NOW we see:
 								- **Random Mixed is NOT significantly slower** (per-op cost is similar)
 								- **Page faults are identical** to Tiny Hot
 								- **This contradicts expectations**
 								### Possible Explanation
 								The **earlier measurements** may have been from:
 								- Benchmark run at startup (cold caches)
 								- With additional profiling overhead
 								- Or different workload parameters
 								The **current measurements** are:
 								- Steady state (after initial allocation)
 								- With higher throughput (Tiny Hot = 10M ops)
 								- After recent optimizations
 								---
 								## 🎯 Next Steps - Three Options
 								### 📋 Option A: Verify Prefault is Actually Enabled
 								**Goal:** Confirm prefault mechanism is working
 								**Steps:**
 . Add debug output to `ss_prefault_policy()` and `ss_prefault_region()`
 . Check if `MAP_POPULATE` flag is set in actual mmap calls
 . Run with `strace` to see mmap flags:
 								   ```bash
 								   strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | grep MAP_POPULATE
 								   ```
 . Check if `madvise(MADV_POPULATE_READ)` calls are happening
 								**Expected outcome:** Should see MAP_POPULATE or MADV_POPULATE in traces
 								---
 								### 🎯 Option B: Reduce TLB Misses (48.65% → ~5%)
 								**Goal:** Improve memory layout to reduce TLB pressure
 								**Steps:**
 . **Analyze SuperSlab metadata layout:**
 								   - Current: Is metadata per-slab or centralized?
 								   - Check: `sp_meta_find_or_create()` hot path
 . **Improve cache locality:**
 								   - Cache-align metadata structures
 								   - Use larger pages (2MB or 1GB hugepages)
 								   - Reduce working set size
 . **Profile with hugepages:**
 								   ```bash
 								   echo 10 > /proc/sys/vm/nr_hugepages
 								   HAKMEM_USE_HUGEPAGES=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
 								   ```
 								**Expected gain:** 1.5-2x speedup (eliminate TLB miss penalty)
 								---
 								### 🚀 Option C: Reduce Page Zeroing (12% → ~2%)
 								**Goal:** Skip unnecessary page zeroing
 								**Steps:**
 . **Analyze what needs zeroing:**
 								   - Are SuperSlab pages truly uninitialized?
 								   - Can we reuse memory without zeroing?
 								   - Use `MADV_DONTNEED` before reuse?
 . **Implement lazy zeroing:**
 								   - Don't zero pages on allocation
 								   - Only zero used portions
 								   - Let kernel handle rest on free
 . **Use uninitialized pools:**
 								   - Pre-allocate without zeroing
 								   - Initialize on-demand
 								**Expected gain:** 1.5x speedup (eliminate 12% zero cost)
 								---
 								## 📊 Recommendation
 								Based on the analysis:
 								### Most Impactful (Order of Preference):
 . **Fix TLB Misses (48.65%)**
 								   - Potential gain: 1.5-2x
 								   - Implementation: Medium difficulty
 								   - Reason: Already showing 48% miss rate
 . **Verify Prefault Actually Works**
 								   - Potential gain: Unknown (currently not working?)
 								   - Implementation: Easy (debugging)
 								   - Reason: Should have been solved but showing same page faults
 . **Reduce Page Zeroing**
 								   - Potential gain: 1.5x
 								   - Implementation: Medium difficulty
 								   - Reason: 12% of total time
 								---
 								## 🧪 Recommended Next Action
 								### Immediate (This Session)
 								Run diagnostic to confirm prefault status:
 								```bash
 								# Check if MAP_POPULATE is in actual mmap calls
 								strace -e mmap2 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 2>&1 | head -20
 								# Check compiler flags
 								grep -i prefault Makefile
 								# Check environment variables
 								env | grep -i HAKMEM | grep -i PREFAULT
 								```
 								### If Prefault is Disabled → Enable It
 								Then re-run profiling to verify improvement.
 								### If Prefault is Enabled → Move to Option B (TLB)
 								Focus on reducing 48% TLB miss rate.
 								---
 								## 📈 Expected Outcome After All Fixes
 								| Factor | Current | After | Gain |
 								|--------|---------|-------|------|
 								| Page faults | 7,672 | 500-1000 | 8-15x |
 								| TLB misses | 48.65% | ~5% | 3-5x |
 								| Page zeroing | 12% | 2% | 2x |
 								| **Total per-op time** | 72.6 cycles | 20-25 cycles | **3-4x** |
 								| **Throughput** | 1.06M ops/s | 3.5-4M ops/s | **3-4x** |