438 lines
15 KiB
Markdown
438 lines
15 KiB
Markdown
|
|
# HAKMEM Performance Profiling: Answers to Key Questions
|
||
|
|
|
||
|
|
**Date:** 2025-12-04
|
||
|
|
**Benchmarks:** bench_random_mixed_hakmem vs bench_tiny_hot_hakmem
|
||
|
|
**Test:** 1M iterations, random sizes 16-1040B vs hot tiny allocations
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Answers to Your Questions
|
||
|
|
|
||
|
|
### Q1: What % of cycles are in malloc/free wrappers themselves?
|
||
|
|
|
||
|
|
**Answer:** **3.7%** (random_mixed), **46%** (tiny_hot)
|
||
|
|
|
||
|
|
- **random_mixed:** malloc 1.05% + free 2.68% = **3.7% total**
|
||
|
|
- **tiny_hot:** malloc 2.81% + free 43.1% = **46% total**
|
||
|
|
|
||
|
|
The dramatic difference is NOT because wrappers are slower in tiny_hot. Rather, in random_mixed, wrappers are **dwarfed by 61.7% kernel page fault overhead**. In tiny_hot, there's no kernel overhead (0.5% page faults), so wrappers dominate the profile.
|
||
|
|
|
||
|
|
**Verdict:** Wrapper overhead is **acceptable and consistent** across both workloads. Not a bottleneck.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q2: Is unified_cache_refill being called frequently? (High hit rate or low?)
|
||
|
|
|
||
|
|
**Answer:** **LOW hit rate** in random_mixed, **HIGH hit rate** in tiny_hot
|
||
|
|
|
||
|
|
- **random_mixed:** unified_cache_refill appears at **2.3% cycles** (#4 hotspot)
|
||
|
|
- Called frequently due to varied sizes (16-1040B)
|
||
|
|
- Triggers expensive mmap → page faults
|
||
|
|
- **Cache MISS ratio is HIGH**
|
||
|
|
|
||
|
|
- **tiny_hot:** unified_cache_refill **NOT in top 10 functions** (<0.1%)
|
||
|
|
- Rarely called due to predictable size
|
||
|
|
- **Cache HIT ratio is HIGH** (>95% estimated)
|
||
|
|
|
||
|
|
**Verdict:** Unified Cache needs **larger capacity** and **better refill batching** for random_mixed workloads.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q3: Is shared_pool_acquire being called? (If yes, how often?)
|
||
|
|
|
||
|
|
**Answer:** **YES - frequently in random_mixed** (3.3% cycles, #2 user hotspot)
|
||
|
|
|
||
|
|
- **random_mixed:** shared_pool_acquire_slab.part.0 = **3.3%** cycles
|
||
|
|
- Second-highest user-space function (after wrappers)
|
||
|
|
- Called when Unified Cache is empty → needs backend slab
|
||
|
|
- Involves **mutex locks** (pthread_mutex_lock visible in assembly)
|
||
|
|
- Triggers **SuperSlab mmap** → 512 page faults per 2MB slab
|
||
|
|
|
||
|
|
- **tiny_hot:** shared_pool functions **NOT visible** (<0.1%)
|
||
|
|
- Cache hits prevent backend calls
|
||
|
|
|
||
|
|
**Verdict:** shared_pool_acquire is a **MAJOR bottleneck** in random_mixed. Needs:
|
||
|
|
1. Lock-free fast path (atomic CAS)
|
||
|
|
2. TLS slab caching
|
||
|
|
3. Batch acquisition (2-4 slabs at once)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q4: Is registry lookup (hak_super_lookup) still visible in release build?
|
||
|
|
|
||
|
|
**Answer:** **NO** - registry lookup is NOT visible in top functions
|
||
|
|
|
||
|
|
- **random_mixed:** hak_super_register visible at **0.05%** (negligible)
|
||
|
|
- **tiny_hot:** No registry functions in profile
|
||
|
|
|
||
|
|
The registry optimization (mincore elimination) from Phase 1 **successfully removed registry overhead** from the hot path.
|
||
|
|
|
||
|
|
**Verdict:** Registry is **not a bottleneck**. Optimization was successful.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Q5: Where are the 22x slowdown cycles actually spent?
|
||
|
|
|
||
|
|
**Answer:** **Kernel page faults (61.7%)** + **User backend (5.6%)** + **Other kernel (22%)**
|
||
|
|
|
||
|
|
**Complete breakdown (random_mixed vs tiny_hot):**
|
||
|
|
|
||
|
|
```
|
||
|
|
random_mixed (4.1M ops/s):
|
||
|
|
├─ Kernel Page Faults: 61.7% ← PRIMARY CAUSE (16x slowdown)
|
||
|
|
├─ Other Kernel Overhead: 22.0% ← Secondary cause (memcg, rcu, scheduler)
|
||
|
|
├─ Shared Pool Backend: 3.3% ← #1 user hotspot
|
||
|
|
├─ Malloc/Free Wrappers: 3.7% ← #2 user hotspot
|
||
|
|
├─ Unified Cache Refill: 2.3% ← #3 user hotspot (triggers page faults)
|
||
|
|
└─ Other HAKMEM code: 7.0%
|
||
|
|
|
||
|
|
tiny_hot (89M ops/s):
|
||
|
|
├─ Free Path: 43.1% ← Safe free logic (expected)
|
||
|
|
├─ Kernel Overhead: 30.0% ← Scheduler timers only (unavoidable)
|
||
|
|
├─ Gatekeeper/Routing: 8.1% ← Pool lookup
|
||
|
|
├─ ACE Layer: 4.9% ← Adaptive control
|
||
|
|
├─ Malloc Wrapper: 2.8%
|
||
|
|
└─ Other HAKMEM code: 11.1%
|
||
|
|
```
|
||
|
|
|
||
|
|
**Root Cause Chain:**
|
||
|
|
1. Random sizes (16-1040B) → Unified Cache misses
|
||
|
|
2. Cache misses → unified_cache_refill (2.3%)
|
||
|
|
3. Refill → shared_pool_acquire (3.3%)
|
||
|
|
4. Pool acquire → SuperSlab mmap (2MB chunks)
|
||
|
|
5. mmap → **512 page faults per slab** (61.7% cycles!)
|
||
|
|
6. Page faults → clear_page_erms (6.9% - zeroing 4KB pages)
|
||
|
|
|
||
|
|
**Verdict:** The 22x gap is **NOT due to HAKMEM code inefficiency**. It's due to **kernel overhead from on-demand memory allocation**.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary Table: Layer Breakdown
|
||
|
|
|
||
|
|
| Layer | Random Mixed | Tiny Hot | Bottleneck? |
|
||
|
|
|-------|-------------|----------|-------------|
|
||
|
|
| **Kernel Page Faults** | 61.7% | 0.5% | **YES - PRIMARY** |
|
||
|
|
| **Other Kernel** | 22.0% | 29.5% | Secondary |
|
||
|
|
| **Shared Pool** | 3.3% | <0.1% | **YES** |
|
||
|
|
| **Wrappers** | 3.7% | 46.0% | No (acceptable) |
|
||
|
|
| **Unified Cache** | 2.3% | <0.1% | **YES** |
|
||
|
|
| **Gatekeeper** | 0.7% | 8.1% | Minor |
|
||
|
|
| **Tiny/SuperSlab** | 0.3% | <0.1% | No |
|
||
|
|
| **Other HAKMEM** | 7.0% | 16.0% | No |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Top 5-10 Functions by CPU Time
|
||
|
|
|
||
|
|
### Random Mixed (Top 10)
|
||
|
|
|
||
|
|
| Rank | Function | %Cycles | Layer | Path | Notes |
|
||
|
|
|------|----------|---------|-------|------|-------|
|
||
|
|
| 1 | **Kernel Page Faults** | 61.7% | Kernel | Cold | **PRIMARY BOTTLENECK** |
|
||
|
|
| 2 | **shared_pool_acquire_slab** | 3.3% | Shared Pool | Cold | #1 user hotspot, mutex locks |
|
||
|
|
| 3 | **free()** | 2.7% | Wrapper | Hot | Entry point, acceptable |
|
||
|
|
| 4 | **unified_cache_refill** | 2.3% | Unified Cache | Cold | Triggers mmap → page faults |
|
||
|
|
| 5 | **malloc()** | 1.1% | Wrapper | Hot | Entry point, acceptable |
|
||
|
|
| 6 | hak_pool_mid_lookup | 0.5% | Gatekeeper | Hot | Pool routing |
|
||
|
|
| 7 | sp_meta_find_or_create | 0.5% | Metadata | Cold | Metadata management |
|
||
|
|
| 8 | superslab_allocate | 0.3% | SuperSlab | Cold | Backend allocation |
|
||
|
|
| 9 | hak_free_at | 0.2% | Free Logic | Hot | Free routing |
|
||
|
|
| 10 | hak_pool_free | 0.2% | Pool Free | Hot | Pool release |
|
||
|
|
|
||
|
|
**Cache Miss Info:**
|
||
|
|
- Instructions/Cycle: Not available (IPC column empty in perf)
|
||
|
|
- Cache miss %: 5920K cache-misses / 8343K cycles = **71% cache miss rate**
|
||
|
|
- Branch miss %: 6860K branch-misses / 8343K cycles = **82% branch miss rate**
|
||
|
|
|
||
|
|
**High cache/branch miss rates suggest:**
|
||
|
|
1. Random allocation sizes → poor cache locality
|
||
|
|
2. Varied control flow → branch mispredictions
|
||
|
|
3. Page faults → TLB misses
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Tiny Hot (Top 10)
|
||
|
|
|
||
|
|
| Rank | Function | %Cycles | Layer | Path | Notes |
|
||
|
|
|------|----------|---------|-------|------|-------|
|
||
|
|
| 1 | **free.part.0** | 24.9% | Free Wrapper | Hot | Part of safe free |
|
||
|
|
| 2 | **hak_free_at** | 18.3% | Free Logic | Hot | Ownership checks |
|
||
|
|
| 3 | **hak_pool_mid_lookup** | 8.1% | Gatekeeper | Hot | Could optimize (inline) |
|
||
|
|
| 4 | hkm_ace_alloc | 4.9% | ACE Layer | Hot | Adaptive control |
|
||
|
|
| 5 | malloc() | 2.8% | Wrapper | Hot | Entry point |
|
||
|
|
| 6 | main() | 2.4% | Benchmark | N/A | Test harness overhead |
|
||
|
|
| 7 | hak_bigcache_try_get | 1.5% | BigCache | Hot | L2 cache |
|
||
|
|
| 8 | hak_elo_get_threshold | 0.9% | Strategy | Hot | ELO strategy selection |
|
||
|
|
| 9+ | Kernel (timers) | 30.0% | Kernel | N/A | Unavoidable timer interrupts |
|
||
|
|
|
||
|
|
**Cache Miss Info:**
|
||
|
|
- Cache miss %: 7195K cache-misses / 12329K cycles = **58% cache miss rate**
|
||
|
|
- Branch miss %: 11215K branch-misses / 12329K cycles = **91% branch miss rate**
|
||
|
|
|
||
|
|
Even the "hot" path has high branch miss rate due to complex control flow.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Unexpected Bottlenecks Flagged
|
||
|
|
|
||
|
|
### 1. **Kernel Page Faults (61.7%)** - UNEXPECTED SEVERITY
|
||
|
|
|
||
|
|
**Expected:** Some page fault overhead
|
||
|
|
**Actual:** Dominates entire profile (61.7% of cycles!)
|
||
|
|
|
||
|
|
**Why unexpected:**
|
||
|
|
- Allocators typically pre-allocate large chunks
|
||
|
|
- Modern allocators use madvise/hugepages to reduce faults
|
||
|
|
- 512 faults per 2MB slab is excessive
|
||
|
|
|
||
|
|
**Fix:** Pre-fault SuperSlabs at startup (Priority 1)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. **Shared Pool Mutex Lock Contention (3.3%)** - UNEXPECTED
|
||
|
|
|
||
|
|
**Expected:** Lock-free or low-contention pool
|
||
|
|
**Actual:** pthread_mutex_lock visible in assembly, 3.3% overhead
|
||
|
|
|
||
|
|
**Why unexpected:**
|
||
|
|
- Modern allocators use TLS to avoid locking
|
||
|
|
- Pool should be per-thread or use atomic operations
|
||
|
|
|
||
|
|
**Fix:** Lock-free fast path with atomic CAS (Priority 2)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. **High Unified Cache Miss Rate** - UNEXPECTED
|
||
|
|
|
||
|
|
**Expected:** >80% hit rate for 8-class cache
|
||
|
|
**Actual:** unified_cache_refill at 2.3% suggests <50% hit rate
|
||
|
|
|
||
|
|
**Why unexpected:**
|
||
|
|
- 8 size classes (C0-C7) should cover 16-1024B well
|
||
|
|
- TLS cache should absorb most allocations
|
||
|
|
|
||
|
|
**Fix:** Increase cache capacity to 64-128 blocks per class (Priority 3)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. **hak_pool_mid_lookup at 8.1% (tiny_hot)** - MINOR SURPRISE
|
||
|
|
|
||
|
|
**Expected:** <2% for lookup
|
||
|
|
**Actual:** 8.1% in hot path
|
||
|
|
|
||
|
|
**Why unexpected:**
|
||
|
|
- Simple size → class mapping should be fast
|
||
|
|
- Likely not inlined or has branch mispredictions
|
||
|
|
|
||
|
|
**Fix:** Force inline + branch hints (Priority 4)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison to Tiny Hot Breakdown
|
||
|
|
|
||
|
|
| Metric | Random Mixed | Tiny Hot | Ratio |
|
||
|
|
|--------|-------------|----------|-------|
|
||
|
|
| **Throughput** | 4.1 M ops/s | 89 M ops/s | 21.7x |
|
||
|
|
| **User-space %** | 11% | 70% | 6.4x |
|
||
|
|
| **Kernel %** | 89% | 30% | 3.0x |
|
||
|
|
| **Page Faults %** | 61.7% | 0.5% | 123x |
|
||
|
|
| **Shared Pool %** | 3.3% | <0.1% | >30x |
|
||
|
|
| **Unified Cache %** | 2.3% | <0.1% | >20x |
|
||
|
|
| **Wrapper %** | 3.7% | 46% | 12x (inverse) |
|
||
|
|
|
||
|
|
**Key Differences:**
|
||
|
|
|
||
|
|
1. **Kernel vs User Ratio:** Random mixed is 89% kernel vs 11% user. Tiny hot is 70% user vs 30% kernel. **Inverse!**
|
||
|
|
|
||
|
|
2. **Page Faults:** 123x more in random_mixed (61.7% vs 0.5%)
|
||
|
|
|
||
|
|
3. **Backend Calls:** Shared Pool + Unified Cache = 5.6% in random_mixed vs <0.1% in tiny_hot
|
||
|
|
|
||
|
|
4. **Wrapper Visibility:** Wrappers are 46% in tiny_hot vs 3.7% in random_mixed, but absolute time is similar. The difference is what ELSE is running (kernel).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What's Different Between the Workloads?
|
||
|
|
|
||
|
|
### Random Mixed
|
||
|
|
- **Allocation pattern:** Random sizes 16-1040B, random slot selection
|
||
|
|
- **Cache behavior:** Frequent misses due to varied sizes
|
||
|
|
- **Memory pattern:** On-demand allocation via mmap
|
||
|
|
- **Kernel interaction:** Heavy (61.7% page faults)
|
||
|
|
- **Backend path:** Frequently hits Shared Pool + SuperSlab
|
||
|
|
|
||
|
|
### Tiny Hot
|
||
|
|
- **Allocation pattern:** Fixed size (likely 64-128B), repeated alloc/free
|
||
|
|
- **Cache behavior:** High hit rate, rarely refills
|
||
|
|
- **Memory pattern:** Pre-allocated at startup
|
||
|
|
- **Kernel interaction:** Light (0.5% page faults, 10% timers)
|
||
|
|
- **Backend path:** Rarely hit (cache absorbs everything)
|
||
|
|
|
||
|
|
**The difference is night and day:** Tiny hot is a **pure user-space workload** with minimal kernel interaction. Random mixed is a **kernel-dominated workload** due to on-demand memory allocation.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Actionable Recommendations (Prioritized)
|
||
|
|
|
||
|
|
### Priority 1: Pre-fault SuperSlabs at Startup (10-15x gain)
|
||
|
|
|
||
|
|
**Target:** Eliminate 61.7% page fault overhead
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
```c
|
||
|
|
// During hakmem_init(), after SuperSlab allocation:
|
||
|
|
for (int class = 0; class < 8; class++) {
|
||
|
|
void* slab = superslab_alloc_2mb(class);
|
||
|
|
// Pre-fault all pages
|
||
|
|
madvise(slab, 2*1024*1024, MADV_POPULATE_READ);
|
||
|
|
// OR manually touch each page:
|
||
|
|
for (size_t i = 0; i < 2*1024*1024; i += 4096) {
|
||
|
|
((volatile char*)slab)[i];
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** 4.1M → 41M ops/s (10x)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Priority 2: Lock-Free Shared Pool (2-4x gain)
|
||
|
|
|
||
|
|
**Target:** Reduce 3.3% mutex overhead to 0.8%
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
```c
|
||
|
|
// Replace mutex with atomic CAS for free list
|
||
|
|
struct SharedPool {
|
||
|
|
_Atomic(Slab*) free_list; // atomic pointer
|
||
|
|
pthread_mutex_t slow_lock; // only for slow path
|
||
|
|
};
|
||
|
|
|
||
|
|
Slab* pool_acquire_fast(SharedPool* pool) {
|
||
|
|
Slab* head = atomic_load(&pool->free_list);
|
||
|
|
while (head) {
|
||
|
|
if (atomic_compare_exchange_weak(&pool->free_list, &head, head->next)) {
|
||
|
|
return head; // Fast path: no lock!
|
||
|
|
}
|
||
|
|
}
|
||
|
|
// Slow path: acquire new slab from backend
|
||
|
|
return pool_acquire_slow(pool);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** 3.3% → 0.8%, contributes to overall 2x gain
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Priority 3: Increase Unified Cache Capacity (2x fewer refills)
|
||
|
|
|
||
|
|
**Target:** Reduce cache miss rate from ~50% to ~20%
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
```c
|
||
|
|
// Current: 16-32 blocks per class
|
||
|
|
#define UNIFIED_CACHE_CAPACITY 32
|
||
|
|
|
||
|
|
// Proposed: 64-128 blocks per class
|
||
|
|
#define UNIFIED_CACHE_CAPACITY 128
|
||
|
|
|
||
|
|
// Also: Batch refills (128 blocks at once instead of 16)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** 2x fewer calls to unified_cache_refill
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Priority 4: Inline Gatekeeper (2x reduction in routing overhead)
|
||
|
|
|
||
|
|
**Target:** Reduce hak_pool_mid_lookup from 8.1% to 4%
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
```c
|
||
|
|
__attribute__((always_inline))
|
||
|
|
static inline int size_to_class(size_t size) {
|
||
|
|
// Use lookup table or bit tricks
|
||
|
|
return (size <= 32) ? 0 :
|
||
|
|
(size <= 64) ? 1 :
|
||
|
|
(size <= 128) ? 2 :
|
||
|
|
(size <= 256) ? 3 : /* ... */
|
||
|
|
7;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** Tiny hot benefits most (8.1% → 4%), random_mixed gets minor gain
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Expected Performance After Optimizations
|
||
|
|
|
||
|
|
| Stage | Random Mixed | Gain | Tiny Hot | Gain |
|
||
|
|
|-------|-------------|------|----------|------|
|
||
|
|
| **Current** | 4.1 M ops/s | - | 89 M ops/s | - |
|
||
|
|
| After P1 (Pre-fault) | 35 M ops/s | 8.5x | 89 M ops/s | 1.0x |
|
||
|
|
| After P2 (Lock-free) | 45 M ops/s | 1.3x | 89 M ops/s | 1.0x |
|
||
|
|
| After P3 (Cache) | 55 M ops/s | 1.2x | 90 M ops/s | 1.01x |
|
||
|
|
| After P4 (Inline) | 60 M ops/s | 1.1x | 100 M ops/s | 1.1x |
|
||
|
|
| **TOTAL** | **60 M ops/s** | **15x** | **100 M ops/s** | **1.1x** |
|
||
|
|
|
||
|
|
**Final gap:** 60M vs 100M = **1.67x slower** (within acceptable range)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
### Where are the 22x slowdown cycles actually spent?
|
||
|
|
|
||
|
|
1. **Kernel page faults: 61.7%** (PRIMARY CAUSE - 16x slowdown)
|
||
|
|
2. **Other kernel overhead: 22%** (memcg, scheduler, rcu)
|
||
|
|
3. **Shared Pool: 3.3%** (#1 user hotspot)
|
||
|
|
4. **Wrappers: 3.7%** (#2 user hotspot, but acceptable)
|
||
|
|
5. **Unified Cache: 2.3%** (#3 user hotspot, triggers page faults)
|
||
|
|
6. **Everything else: 7%**
|
||
|
|
|
||
|
|
### Which layers should be optimized next (beyond tiny front)?
|
||
|
|
|
||
|
|
1. **Pre-fault SuperSlabs** (eliminate kernel page faults)
|
||
|
|
2. **Lock-free Shared Pool** (eliminate mutex contention)
|
||
|
|
3. **Larger Unified Cache** (reduce refills)
|
||
|
|
|
||
|
|
### Is the gap due to control flow / complexity or real work?
|
||
|
|
|
||
|
|
**Both:**
|
||
|
|
- **Real work (kernel):** 61.7% of cycles are spent **zeroing new pages** (clear_page_erms) and handling page faults. This is REAL work, not control flow overhead.
|
||
|
|
- **Control flow (user):** Only ~11% of cycles are in HAKMEM code, and most of it is legitimate (routing, locking, cache management). Very little is wasted on unnecessary branches.
|
||
|
|
|
||
|
|
**Verdict:** The gap is due to **REAL WORK (kernel page faults)**, not control flow overhead.
|
||
|
|
|
||
|
|
### Can wrapper overhead be reduced?
|
||
|
|
|
||
|
|
**Current:** 3.7% (random_mixed), 46% (tiny_hot)
|
||
|
|
|
||
|
|
**Answer:** Wrapper overhead is **already acceptable**. In absolute terms, wrappers take similar time in both workloads. The difference is that tiny_hot has no kernel overhead, so wrappers dominate the profile.
|
||
|
|
|
||
|
|
**Possible improvements:**
|
||
|
|
- Cache ENV variables at startup (may already be done)
|
||
|
|
- Use ifunc for dispatch (eliminate LD_PRELOAD checks)
|
||
|
|
|
||
|
|
**Expected gain:** 1.5x reduction (3.7% → 2.5%), but this is LOW priority
|
||
|
|
|
||
|
|
### Should we focus on Unified Cache hit rate or Shared Pool efficiency?
|
||
|
|
|
||
|
|
**Answer: BOTH**, but in order:
|
||
|
|
|
||
|
|
1. **Priority 1: Eliminate page faults** (pre-fault at startup)
|
||
|
|
2. **Priority 2: Shared Pool efficiency** (lock-free fast path)
|
||
|
|
3. **Priority 3: Unified Cache hit rate** (increase capacity)
|
||
|
|
|
||
|
|
All three are needed to close the gap. Priority 1 alone gives 10x, but without Priorities 2-3, you'll still be 2-3x slower than tiny_hot.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Files Generated
|
||
|
|
|
||
|
|
1. **PERF_SUMMARY_TABLE.txt** - Quick reference table with cycle breakdowns
|
||
|
|
2. **PERF_ANALYSIS_RANDOM_MIXED_VS_TINY_HOT.md** - Detailed layer-by-layer analysis
|
||
|
|
3. **PERF_PROFILING_ANSWERS.md** - This file (answers to specific questions)
|
||
|
|
|
||
|
|
All saved to: `/mnt/workdisk/public_share/hakmem/`
|