620 lines
21 KiB
Markdown
620 lines
21 KiB
Markdown
|
|
# L1D Cache Miss Root Cause Analysis & Optimization Strategy
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-19
|
|||
|
|
**Status**: CRITICAL BOTTLENECK IDENTIFIED
|
|||
|
|
**Priority**: P0 (Blocks 3.8x performance gap closure)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Root Cause**: Metadata-heavy access pattern with poor cache locality
|
|||
|
|
**Impact**: 9.9x more L1D cache misses than System malloc (1.94M vs 0.20M per 1M ops)
|
|||
|
|
**Performance Gap**: 3.8x slower (23.51M ops/s vs ~90M ops/s)
|
|||
|
|
**Expected Improvement**: 50-70% performance gain (35-40M ops/s) with proposed optimizations
|
|||
|
|
**Recommended Priority**: Implement P1 (Quick Win) immediately, P2 within 1 week
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 1: Perf Profiling Results
|
|||
|
|
|
|||
|
|
### L1D Cache Miss Statistics (Random Mixed 256B, 1M iterations)
|
|||
|
|
|
|||
|
|
| Metric | HAKMEM | System malloc | Ratio | Impact |
|
|||
|
|
|--------|---------|---------------|-------|---------|
|
|||
|
|
| **L1D loads** | 111.5M | 40.8M | **2.7x** | Extra memory traffic |
|
|||
|
|
| **L1D misses** | 1.88M | 0.19M | **9.9x** | 🔥 **CRITICAL** |
|
|||
|
|
| **L1D miss rate** | 1.69% | 0.46% | **3.7x** | Cache inefficiency |
|
|||
|
|
| **Instructions** | 275.2M | 92.3M | **3.0x** | Code bloat |
|
|||
|
|
| **Cycles** | 180.9M | 44.7M | **4.0x** | Total overhead |
|
|||
|
|
| **IPC** | 1.52 | 2.06 | **0.74x** | Memory-bound |
|
|||
|
|
|
|||
|
|
**Key Finding**: L1D miss penalty dominates performance gap
|
|||
|
|
- Miss penalty: ~200 cycles per miss (typical L2 latency)
|
|||
|
|
- Total penalty: (1.88M - 0.19M) × 200 = **338M cycles**
|
|||
|
|
- This accounts for **~75% of the performance gap** (338M / 450M)
|
|||
|
|
|
|||
|
|
### Throughput Comparison
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
HAKMEM: 24.88M ops/s (1M iterations)
|
|||
|
|
System: 92.31M ops/s (1M iterations)
|
|||
|
|
Performance: 26.9% of System malloc (3.71x slower)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### L1 Instruction Cache (Control)
|
|||
|
|
|
|||
|
|
| Metric | HAKMEM | System | Ratio |
|
|||
|
|
|--------|---------|---------|-------|
|
|||
|
|
| I-cache misses | 40.8K | 2.2K | 18.5x |
|
|||
|
|
|
|||
|
|
**Analysis**: I-cache misses are negligible (40K vs 1.88M D-cache misses), confirming that **data access patterns**, not code size, are the bottleneck.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 2: Data Structure Analysis
|
|||
|
|
|
|||
|
|
### 2.1 SuperSlab Metadata Layout Issues
|
|||
|
|
|
|||
|
|
**Current Structure** (from `core/superslab/superslab_types.h`):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
// Cache line 0 (bytes 0-63): Header fields
|
|||
|
|
uint32_t magic; // offset 0
|
|||
|
|
uint8_t lg_size; // offset 4
|
|||
|
|
uint8_t _pad0[3]; // offset 5
|
|||
|
|
_Atomic uint32_t total_active_blocks; // offset 8
|
|||
|
|
_Atomic uint32_t refcount; // offset 12
|
|||
|
|
_Atomic uint32_t listed; // offset 16
|
|||
|
|
uint32_t slab_bitmap; // offset 20 ⭐ HOT
|
|||
|
|
uint32_t nonempty_mask; // offset 24 ⭐ HOT
|
|||
|
|
uint32_t freelist_mask; // offset 28 ⭐ HOT
|
|||
|
|
uint8_t active_slabs; // offset 32 ⭐ HOT
|
|||
|
|
uint8_t publish_hint; // offset 33
|
|||
|
|
uint16_t partial_epoch; // offset 34
|
|||
|
|
struct SuperSlab* next_chunk; // offset 36
|
|||
|
|
struct SuperSlab* partial_next; // offset 44
|
|||
|
|
// ... (continues)
|
|||
|
|
|
|||
|
|
// Cache line 9+ (bytes 600+): Per-slab metadata array
|
|||
|
|
_Atomic uintptr_t remote_heads[32]; // offset 72 (256 bytes)
|
|||
|
|
_Atomic uint32_t remote_counts[32]; // offset 328 (128 bytes)
|
|||
|
|
_Atomic uint32_t slab_listed[32]; // offset 456 (128 bytes)
|
|||
|
|
TinySlabMeta slabs[32]; // offset 600 ⭐ HOT (512 bytes)
|
|||
|
|
} SuperSlab; // Total: 1112 bytes (18 cache lines)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Size**: 1112 bytes (18 cache lines)
|
|||
|
|
|
|||
|
|
#### Problem 1: Hot Fields Scattered Across Cache Lines
|
|||
|
|
|
|||
|
|
**Hot fields accessed on every allocation**:
|
|||
|
|
1. `slab_bitmap` (offset 20, cache line 0)
|
|||
|
|
2. `nonempty_mask` (offset 24, cache line 0)
|
|||
|
|
3. `freelist_mask` (offset 28, cache line 0)
|
|||
|
|
4. `slabs[N]` (offset 600+, cache line 9+)
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Hot path loads **TWO cache lines minimum**: Line 0 (bitmasks) + Line 9+ (SlabMeta)
|
|||
|
|
- With 32 slabs, `slabs[]` spans **8 cache lines** (64 bytes/line × 8 = 512 bytes)
|
|||
|
|
- Random slab access causes **cache line thrashing**
|
|||
|
|
|
|||
|
|
#### Problem 2: TinySlabMeta Field Layout
|
|||
|
|
|
|||
|
|
**Current Structure**:
|
|||
|
|
```c
|
|||
|
|
typedef struct TinySlabMeta {
|
|||
|
|
void* freelist; // offset 0 ⭐ HOT (read on refill)
|
|||
|
|
uint16_t used; // offset 8 ⭐ HOT (update on alloc/free)
|
|||
|
|
uint16_t capacity; // offset 10 ⭐ HOT (check on refill)
|
|||
|
|
uint8_t class_idx; // offset 12 🔥 COLD (set once at init)
|
|||
|
|
uint8_t carved; // offset 13 🔥 COLD (rarely changed)
|
|||
|
|
uint8_t owner_tid_low; // offset 14 🔥 COLD (debug only)
|
|||
|
|
} TinySlabMeta; // Total: 16 bytes (fits in 1 cache line ✅)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Issue**: Cold fields (`class_idx`, `carved`, `owner_tid_low`) occupy **6 bytes** in the hot cache line, wasting precious L1D capacity.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2.2 TLS Cache Layout Analysis
|
|||
|
|
|
|||
|
|
**Current TLS Variables** (from `core/hakmem_tiny.c`):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
__thread void* g_tls_sll_head[8]; // 64 bytes (1 cache line)
|
|||
|
|
__thread uint32_t g_tls_sll_count[8]; // 32 bytes (0.5 cache lines)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Total TLS cache footprint**: 96 bytes (2 cache lines)
|
|||
|
|
|
|||
|
|
**Layout**:
|
|||
|
|
```
|
|||
|
|
Cache Line 0: g_tls_sll_head[0-7] (64 bytes) ⭐ HOT
|
|||
|
|
Cache Line 1: g_tls_sll_count[0-7] (32 bytes) + padding (32 bytes)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Issue: Split Head/Count Access
|
|||
|
|
|
|||
|
|
**Access pattern on alloc**:
|
|||
|
|
1. Read `g_tls_sll_head[cls]` → Cache line 0 ✅
|
|||
|
|
2. Read next pointer `*(void**)ptr` → Separate cache line (depends on `ptr`) ❌
|
|||
|
|
3. Write `g_tls_sll_head[cls] = next` → Cache line 0 ✅
|
|||
|
|
4. Decrement `g_tls_sll_count[cls]` → Cache line 1 ❌
|
|||
|
|
|
|||
|
|
**Problem**: **2 cache lines touched** per allocation (head + count), vs **1 cache line** for glibc tcache (counts[] rarely accessed in hot path).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 3: System malloc Comparison (glibc tcache)
|
|||
|
|
|
|||
|
|
### glibc tcache Design Principles
|
|||
|
|
|
|||
|
|
**Reference Structure**:
|
|||
|
|
```c
|
|||
|
|
typedef struct tcache_perthread_struct {
|
|||
|
|
uint16_t counts[64]; // offset 0, size 128 bytes (cache lines 0-1)
|
|||
|
|
tcache_entry *entries[64]; // offset 128, size 512 bytes (cache lines 2-9)
|
|||
|
|
} tcache_perthread_struct;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Total size**: 640 bytes (10 cache lines)
|
|||
|
|
|
|||
|
|
### Key Differences (HAKMEM vs tcache)
|
|||
|
|
|
|||
|
|
| Aspect | HAKMEM | glibc tcache | Impact |
|
|||
|
|
|--------|---------|--------------|---------|
|
|||
|
|
| **Metadata location** | Scattered (SuperSlab, 18 cache lines) | Compact (TLS, 10 cache lines) | **8 fewer cache lines** |
|
|||
|
|
| **Hot path accesses** | 3-4 cache lines (head, count, meta, bitmap) | **1 cache line** (entries[] only) | **75% reduction** |
|
|||
|
|
| **Count checks** | Every alloc/free | **Rarely** (only on refill threshold) | **Fewer loads** |
|
|||
|
|
| **Indirection** | TLS → SuperSlab → SlabMeta → freelist | TLS → freelist (direct) | **2 fewer indirections** |
|
|||
|
|
| **Spatial locality** | Poor (32 slabs × 16B scattered) | **Excellent** (entries[] contiguous) | **Better prefetch** |
|
|||
|
|
|
|||
|
|
**Root Cause Identified**: HAKMEM's SuperSlab-centric design requires **3-4 metadata loads** per allocation, vs tcache's **1 load** (just `entries[bin]`).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 4: Optimization Proposals
|
|||
|
|
|
|||
|
|
### Priority 1: Quick Wins (1-2 days, 30-40% improvement)
|
|||
|
|
|
|||
|
|
#### **Proposal 1.1: Separate Hot/Cold SlabMeta Fields**
|
|||
|
|
|
|||
|
|
**Current layout**:
|
|||
|
|
```c
|
|||
|
|
typedef struct TinySlabMeta {
|
|||
|
|
void* freelist; // 8B ⭐ HOT
|
|||
|
|
uint16_t used; // 2B ⭐ HOT
|
|||
|
|
uint16_t capacity; // 2B ⭐ HOT
|
|||
|
|
uint8_t class_idx; // 1B 🔥 COLD
|
|||
|
|
uint8_t carved; // 1B 🔥 COLD
|
|||
|
|
uint8_t owner_tid_low; // 1B 🔥 COLD
|
|||
|
|
// uint8_t _pad[1]; // 1B (implicit padding)
|
|||
|
|
}; // Total: 16B
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Optimized layout** (cache-aligned):
|
|||
|
|
```c
|
|||
|
|
// HOT structure (accessed on every alloc/free)
|
|||
|
|
typedef struct TinySlabMetaHot {
|
|||
|
|
void* freelist; // 8B ⭐ HOT
|
|||
|
|
uint16_t used; // 2B ⭐ HOT
|
|||
|
|
uint16_t capacity; // 2B ⭐ HOT
|
|||
|
|
uint32_t _pad; // 4B (keep 16B alignment)
|
|||
|
|
} __attribute__((aligned(16))) TinySlabMetaHot;
|
|||
|
|
|
|||
|
|
// COLD structure (accessed rarely, kept separate)
|
|||
|
|
typedef struct TinySlabMetaCold {
|
|||
|
|
uint8_t class_idx; // 1B 🔥 COLD
|
|||
|
|
uint8_t carved; // 1B 🔥 COLD
|
|||
|
|
uint8_t owner_tid_low; // 1B 🔥 COLD
|
|||
|
|
uint8_t _reserved; // 1B (future use)
|
|||
|
|
} TinySlabMetaCold;
|
|||
|
|
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
// ... existing fields ...
|
|||
|
|
TinySlabMetaHot slabs_hot[32]; // 512B (8 cache lines) ⭐ HOT
|
|||
|
|
TinySlabMetaCold slabs_cold[32]; // 128B (2 cache lines) 🔥 COLD
|
|||
|
|
} SuperSlab;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -20% (8 cache lines instead of 10 for hot path)
|
|||
|
|
- **Spatial locality**: Improved (hot fields contiguous)
|
|||
|
|
- **Performance gain**: +15-20%
|
|||
|
|
- **Implementation effort**: 4-6 hours (refactor field access, update tests)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### **Proposal 1.2: Prefetch SuperSlab Metadata**
|
|||
|
|
|
|||
|
|
**Target locations** (in `sll_refill_batch_from_ss`):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static inline int sll_refill_batch_from_ss(int class_idx, int max_take) {
|
|||
|
|
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
|
|||
|
|
|
|||
|
|
// ✅ ADD: Prefetch SuperSlab hot fields (slab_bitmap, nonempty_mask, freelist_mask)
|
|||
|
|
if (tls->ss) {
|
|||
|
|
__builtin_prefetch(&tls->ss->slab_bitmap, 0, 3); // Read, high temporal locality
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
TinySlabMeta* meta = tls->meta;
|
|||
|
|
if (!meta) return 0;
|
|||
|
|
|
|||
|
|
// ✅ ADD: Prefetch SlabMeta hot fields (freelist, used, capacity)
|
|||
|
|
__builtin_prefetch(&meta->freelist, 0, 3);
|
|||
|
|
|
|||
|
|
// ... rest of refill logic
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Prefetch in allocation path** (`tiny_alloc_fast`):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static inline void* tiny_alloc_fast(size_t size) {
|
|||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
|||
|
|
|
|||
|
|
// ✅ ADD: Prefetch TLS head (likely already in L1, but hints to CPU)
|
|||
|
|
__builtin_prefetch(&g_tls_sll_head[class_idx], 0, 3);
|
|||
|
|
|
|||
|
|
void* ptr = tiny_alloc_fast_pop(class_idx);
|
|||
|
|
// ... rest
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -10-15% (hide latency for sequential accesses)
|
|||
|
|
- **Performance gain**: +8-12%
|
|||
|
|
- **Implementation effort**: 2-3 hours (add prefetch calls, benchmark)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### **Proposal 1.3: Merge TLS Head/Count into Single Cache Line**
|
|||
|
|
|
|||
|
|
**Current layout** (2 cache lines):
|
|||
|
|
```c
|
|||
|
|
__thread void* g_tls_sll_head[8]; // 64B (cache line 0)
|
|||
|
|
__thread uint32_t g_tls_sll_count[8]; // 32B (cache line 1)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Optimized layout** (1 cache line for hot classes):
|
|||
|
|
```c
|
|||
|
|
// Option A: Interleaved (head + count together)
|
|||
|
|
typedef struct TLSCacheEntry {
|
|||
|
|
void* head; // 8B
|
|||
|
|
uint32_t count; // 4B
|
|||
|
|
uint32_t capacity; // 4B (adaptive sizing, was in separate array)
|
|||
|
|
} TLSCacheEntry; // 16B per class
|
|||
|
|
|
|||
|
|
__thread TLSCacheEntry g_tls_cache[8] __attribute__((aligned(64)));
|
|||
|
|
// Total: 128 bytes (2 cache lines), but 4 hot classes fit in 1 line!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Access pattern improvement**:
|
|||
|
|
```c
|
|||
|
|
// Before (2 cache lines):
|
|||
|
|
void* ptr = g_tls_sll_head[cls]; // Cache line 0
|
|||
|
|
g_tls_sll_count[cls]--; // Cache line 1 ❌
|
|||
|
|
|
|||
|
|
// After (1 cache line):
|
|||
|
|
void* ptr = g_tls_cache[cls].head; // Cache line 0
|
|||
|
|
g_tls_cache[cls].count--; // Cache line 0 ✅ (same line!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -15-20% (1 cache line per alloc instead of 2)
|
|||
|
|
- **Performance gain**: +12-18%
|
|||
|
|
- **Implementation effort**: 6-8 hours (major refactor, update all TLS accesses)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Priority 2: Medium Effort (3-5 days, 20-30% additional improvement)
|
|||
|
|
|
|||
|
|
#### **Proposal 2.1: SuperSlab Hot Field Clustering**
|
|||
|
|
|
|||
|
|
**Current layout** (hot fields scattered):
|
|||
|
|
```c
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
uint32_t magic; // offset 0
|
|||
|
|
uint8_t lg_size; // offset 4
|
|||
|
|
uint8_t _pad0[3]; // offset 5
|
|||
|
|
_Atomic uint32_t total_active_blocks; // offset 8
|
|||
|
|
// ... 12 more bytes ...
|
|||
|
|
uint32_t slab_bitmap; // offset 20 ⭐ HOT
|
|||
|
|
uint32_t nonempty_mask; // offset 24 ⭐ HOT
|
|||
|
|
uint32_t freelist_mask; // offset 28 ⭐ HOT
|
|||
|
|
// ... scattered cold fields ...
|
|||
|
|
TinySlabMeta slabs[32]; // offset 600 ⭐ HOT
|
|||
|
|
} SuperSlab;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Optimized layout** (hot fields in cache line 0):
|
|||
|
|
```c
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
// Cache line 0: HOT FIELDS ONLY (64 bytes)
|
|||
|
|
uint32_t slab_bitmap; // offset 0 ⭐ HOT
|
|||
|
|
uint32_t nonempty_mask; // offset 4 ⭐ HOT
|
|||
|
|
uint32_t freelist_mask; // offset 8 ⭐ HOT
|
|||
|
|
uint8_t active_slabs; // offset 12 ⭐ HOT
|
|||
|
|
uint8_t lg_size; // offset 13 (needed for geometry)
|
|||
|
|
uint16_t _pad0; // offset 14
|
|||
|
|
_Atomic uint32_t total_active_blocks; // offset 16 ⭐ HOT
|
|||
|
|
uint32_t magic; // offset 20 (validation)
|
|||
|
|
uint32_t _pad1[10]; // offset 24 (fill to 64B)
|
|||
|
|
|
|||
|
|
// Cache line 1+: COLD FIELDS
|
|||
|
|
_Atomic uint32_t refcount; // offset 64 🔥 COLD
|
|||
|
|
_Atomic uint32_t listed; // offset 68 🔥 COLD
|
|||
|
|
struct SuperSlab* next_chunk; // offset 72 🔥 COLD
|
|||
|
|
// ... rest of cold fields ...
|
|||
|
|
|
|||
|
|
// Cache line 9+: SLAB METADATA (unchanged)
|
|||
|
|
TinySlabMetaHot slabs_hot[32]; // offset 600
|
|||
|
|
} __attribute__((aligned(64))) SuperSlab;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -25% (hot fields guaranteed in 1 cache line)
|
|||
|
|
- **Performance gain**: +18-25%
|
|||
|
|
- **Implementation effort**: 8-12 hours (refactor layout, regression test)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### **Proposal 2.2: Reduce SlabMeta Array Size (Dynamic Allocation)**
|
|||
|
|
|
|||
|
|
**Problem**: 32-slot `slabs[]` array occupies **512 bytes** (8 cache lines), but most SuperSlabs use only **1-4 slabs**.
|
|||
|
|
|
|||
|
|
**Solution**: Allocate `TinySlabMeta` dynamically per active slab.
|
|||
|
|
|
|||
|
|
**Optimized structure**:
|
|||
|
|
```c
|
|||
|
|
typedef struct SuperSlab {
|
|||
|
|
// ... hot fields (cache line 0) ...
|
|||
|
|
|
|||
|
|
// Replace: TinySlabMeta slabs[32]; (512B)
|
|||
|
|
// With: Dynamic pointer array (256B = 4 cache lines)
|
|||
|
|
TinySlabMetaHot* slabs_hot[32]; // 256B (8B per pointer)
|
|||
|
|
|
|||
|
|
// Cold metadata stays in SuperSlab (no extra allocation)
|
|||
|
|
TinySlabMetaCold slabs_cold[32]; // 128B
|
|||
|
|
} SuperSlab;
|
|||
|
|
|
|||
|
|
// Allocate hot metadata on demand (first use)
|
|||
|
|
if (!ss->slabs_hot[slab_idx]) {
|
|||
|
|
ss->slabs_hot[slab_idx] = aligned_alloc(16, sizeof(TinySlabMetaHot));
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -30% (only active slabs loaded into cache)
|
|||
|
|
- **Memory overhead**: -256B per SuperSlab (512B → 256B pointers + dynamic alloc)
|
|||
|
|
- **Performance gain**: +20-28%
|
|||
|
|
- **Implementation effort**: 12-16 hours (refactor metadata access, lifecycle management)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Priority 3: High Impact (1-2 weeks, 40-50% additional improvement)
|
|||
|
|
|
|||
|
|
#### **Proposal 3.1: TLS-Local Metadata Cache (tcache-style)**
|
|||
|
|
|
|||
|
|
**Strategy**: Cache frequently accessed `TinySlabMeta` fields in TLS, avoid SuperSlab indirection.
|
|||
|
|
|
|||
|
|
**New TLS structure**:
|
|||
|
|
```c
|
|||
|
|
typedef struct TLSSlabCache {
|
|||
|
|
void* head; // 8B ⭐ HOT (freelist head)
|
|||
|
|
uint16_t count; // 2B ⭐ HOT (cached blocks in TLS)
|
|||
|
|
uint16_t capacity; // 2B ⭐ HOT (adaptive capacity)
|
|||
|
|
uint16_t used; // 2B ⭐ HOT (cached from meta->used)
|
|||
|
|
uint16_t slab_capacity; // 2B ⭐ HOT (cached from meta->capacity)
|
|||
|
|
TinySlabMeta* meta_ptr; // 8B 🔥 COLD (pointer to SuperSlab metadata)
|
|||
|
|
} __attribute__((aligned(32))) TLSSlabCache;
|
|||
|
|
|
|||
|
|
__thread TLSSlabCache g_tls_cache[8] __attribute__((aligned(64)));
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Access pattern**:
|
|||
|
|
```c
|
|||
|
|
// Before (2 indirections):
|
|||
|
|
TinyTLSSlab* tls = &g_tls_slabs[cls]; // 1st load
|
|||
|
|
TinySlabMeta* meta = tls->meta; // 2nd load
|
|||
|
|
if (meta->used < meta->capacity) { ... } // 3rd load (used), 4th load (capacity)
|
|||
|
|
|
|||
|
|
// After (direct TLS access):
|
|||
|
|
TLSSlabCache* cache = &g_tls_cache[cls]; // 1st load
|
|||
|
|
if (cache->used < cache->slab_capacity) { ... } // Same cache line! ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Synchronization** (periodically sync TLS cache → SuperSlab):
|
|||
|
|
```c
|
|||
|
|
// On refill threshold (every 64 allocs)
|
|||
|
|
if ((g_tls_cache[cls].count & 0x3F) == 0) {
|
|||
|
|
// Write back TLS cache to SuperSlab metadata
|
|||
|
|
TinySlabMeta* meta = g_tls_cache[cls].meta_ptr;
|
|||
|
|
atomic_store(&meta->used, g_tls_cache[cls].used);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -60% (eliminate SuperSlab access on fast path)
|
|||
|
|
- **Indirection elimination**: 3-4 loads → 1 load
|
|||
|
|
- **Performance gain**: +80-120% (tcache parity)
|
|||
|
|
- **Implementation effort**: 2-3 weeks (major architectural change, requires extensive testing)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### **Proposal 3.2: Per-Class SuperSlab Affinity (Reduce Working Set)**
|
|||
|
|
|
|||
|
|
**Problem**: Random Mixed workload accesses **8 size classes × N SuperSlabs**, causing cache thrashing.
|
|||
|
|
|
|||
|
|
**Solution**: Pin frequently used SuperSlabs to hot TLS cache, evict cold ones.
|
|||
|
|
|
|||
|
|
**Strategy**:
|
|||
|
|
1. Track access frequency per SuperSlab (LRU-like heuristic)
|
|||
|
|
2. Keep **1 "hot" SuperSlab per class** in TLS-local pointer
|
|||
|
|
3. Prefetch hot SuperSlab on class switch
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
```c
|
|||
|
|
__thread SuperSlab* g_hot_ss[8]; // Hot SuperSlab per class
|
|||
|
|
|
|||
|
|
static inline void ensure_hot_ss(int class_idx) {
|
|||
|
|
if (!g_hot_ss[class_idx]) {
|
|||
|
|
g_hot_ss[class_idx] = get_current_superslab(class_idx);
|
|||
|
|
__builtin_prefetch(&g_hot_ss[class_idx]->slab_bitmap, 0, 3);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Impact**:
|
|||
|
|
- **L1D miss reduction**: -25% (hot SuperSlabs stay in cache)
|
|||
|
|
- **Working set reduction**: 8 SuperSlabs → 1-2 SuperSlabs (cache-resident)
|
|||
|
|
- **Performance gain**: +18-25%
|
|||
|
|
- **Implementation effort**: 1 week (LRU tracking, eviction policy)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommended Action Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Quick Wins (Priority 1, 1-2 days) 🚀
|
|||
|
|
|
|||
|
|
**Implementation Order**:
|
|||
|
|
|
|||
|
|
1. **Day 1**: Proposal 1.2 (Prefetch) + Proposal 1.1 (Hot/Cold Split)
|
|||
|
|
- Morning: Add prefetch hints to refill + alloc paths (2-3 hours)
|
|||
|
|
- Afternoon: Split `TinySlabMeta` into hot/cold structs (4-6 hours)
|
|||
|
|
- Evening: Benchmark, regression test
|
|||
|
|
|
|||
|
|
2. **Day 2**: Proposal 1.3 (TLS Head/Count Merge)
|
|||
|
|
- Morning: Refactor TLS cache to `TLSCacheEntry[]` (4-6 hours)
|
|||
|
|
- Afternoon: Update all TLS access sites (2-3 hours)
|
|||
|
|
- Evening: Benchmark, regression test
|
|||
|
|
|
|||
|
|
**Expected Cumulative Impact**:
|
|||
|
|
- **L1D miss reduction**: -35-45%
|
|||
|
|
- **Performance gain**: +35-50%
|
|||
|
|
- **Target**: 32-37M ops/s (from 24.9M)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2: Medium Effort (Priority 2, 3-5 days)
|
|||
|
|
|
|||
|
|
**Implementation Order**:
|
|||
|
|
|
|||
|
|
1. **Day 3-4**: Proposal 2.1 (SuperSlab Hot Field Clustering)
|
|||
|
|
- Refactor `SuperSlab` layout (cache line 0 = hot only)
|
|||
|
|
- Update geometry calculations, regression test
|
|||
|
|
|
|||
|
|
2. **Day 5**: Proposal 2.2 (Dynamic SlabMeta Allocation)
|
|||
|
|
- Implement on-demand `slabs_hot[]` allocation
|
|||
|
|
- Lifecycle management (alloc on first use, free on SS destruction)
|
|||
|
|
|
|||
|
|
**Expected Cumulative Impact**:
|
|||
|
|
- **L1D miss reduction**: -55-70%
|
|||
|
|
- **Performance gain**: +70-100% (cumulative with P1)
|
|||
|
|
- **Target**: 42-50M ops/s
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 3: High Impact (Priority 3, 1-2 weeks)
|
|||
|
|
|
|||
|
|
**Long-term strategy**:
|
|||
|
|
|
|||
|
|
1. **Week 1**: Proposal 3.1 (TLS-Local Metadata Cache)
|
|||
|
|
- Major architectural change (tcache-style design)
|
|||
|
|
- Requires extensive testing, debugging
|
|||
|
|
|
|||
|
|
2. **Week 2**: Proposal 3.2 (SuperSlab Affinity)
|
|||
|
|
- LRU tracking, hot SS pinning
|
|||
|
|
- Working set reduction
|
|||
|
|
|
|||
|
|
**Expected Cumulative Impact**:
|
|||
|
|
- **L1D miss reduction**: -75-85%
|
|||
|
|
- **Performance gain**: +150-200% (cumulative)
|
|||
|
|
- **Target**: 60-70M ops/s (**System malloc parity!**)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Risk Assessment
|
|||
|
|
|
|||
|
|
### Risks
|
|||
|
|
|
|||
|
|
1. **Correctness Risk (Proposals 1.1, 2.1)**: ⚠️ **Medium**
|
|||
|
|
- Hot/cold split may break existing assumptions
|
|||
|
|
- **Mitigation**: Extensive regression tests, AddressSanitizer validation
|
|||
|
|
|
|||
|
|
2. **Performance Risk (Proposal 1.2)**: ⚠️ **Low**
|
|||
|
|
- Prefetch may hurt if memory access pattern changes
|
|||
|
|
- **Mitigation**: A/B test with `HAKMEM_PREFETCH=0/1` env flag
|
|||
|
|
|
|||
|
|
3. **Complexity Risk (Proposal 3.1)**: ⚠️ **High**
|
|||
|
|
- TLS cache synchronization bugs (stale reads, lost writes)
|
|||
|
|
- **Mitigation**: Incremental rollout, extensive fuzzing
|
|||
|
|
|
|||
|
|
4. **Memory Overhead (Proposal 2.2)**: ⚠️ **Low**
|
|||
|
|
- Dynamic allocation adds fragmentation
|
|||
|
|
- **Mitigation**: Use slab allocator for `TinySlabMetaHot` (fixed-size)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Validation Plan
|
|||
|
|
|
|||
|
|
#### Phase 1 Validation (Quick Wins)
|
|||
|
|
|
|||
|
|
1. **Perf Stat Validation**:
|
|||
|
|
```bash
|
|||
|
|
perf stat -e L1-dcache-loads,L1-dcache-load-misses,cycles,instructions \
|
|||
|
|
-r 10 ./bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
```
|
|||
|
|
**Target**: L1D miss rate < 1.0% (from 1.69%)
|
|||
|
|
|
|||
|
|
2. **Regression Tests**:
|
|||
|
|
```bash
|
|||
|
|
./build.sh test_all
|
|||
|
|
ASAN_OPTIONS=detect_leaks=1 ./out/asan/test_all
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Throughput Benchmark**:
|
|||
|
|
```bash
|
|||
|
|
./bench_random_mixed_hakmem 10000000 256 42
|
|||
|
|
```
|
|||
|
|
**Target**: > 35M ops/s (+40% from 24.9M)
|
|||
|
|
|
|||
|
|
#### Phase 2-3 Validation
|
|||
|
|
|
|||
|
|
1. **Stress Test** (1 hour continuous run):
|
|||
|
|
```bash
|
|||
|
|
timeout 3600 ./bench_random_mixed_hakmem 100000000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Multi-threaded Workload**:
|
|||
|
|
```bash
|
|||
|
|
./larson_hakmem 4 10000000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Memory Leak Check**:
|
|||
|
|
```bash
|
|||
|
|
valgrind --leak-check=full ./bench_random_mixed_hakmem 100000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**L1D cache misses are the PRIMARY bottleneck** (9.9x worse than System malloc), accounting for ~75% of the performance gap. The root cause is **metadata-heavy access patterns** with poor cache locality:
|
|||
|
|
|
|||
|
|
1. **SuperSlab**: 18 cache lines, scattered hot fields
|
|||
|
|
2. **TLS Cache**: 2 cache lines per alloc (head + count split)
|
|||
|
|
3. **Indirection**: 3-4 metadata loads vs tcache's 1 load
|
|||
|
|
|
|||
|
|
**Proposed optimizations** target these issues systematically:
|
|||
|
|
- **P1 (Quick Win)**: 35-50% gain in 1-2 days
|
|||
|
|
- **P2 (Medium)**: +70-100% gain in 1 week
|
|||
|
|
- **P3 (High Impact)**: +150-200% gain in 2 weeks (tcache parity)
|
|||
|
|
|
|||
|
|
**Immediate action**: Start with **Proposal 1.2 (Prefetch)** today (2-3 hours, +8-12% gain). Follow with **Proposal 1.1 (Hot/Cold Split)** tomorrow (6 hours, +15-20% gain).
|
|||
|
|
|
|||
|
|
**Final target**: 60-70M ops/s (System malloc parity within 2 weeks) 🎯
|