hakmem/docs/analysis/L1D_CACHE_MISS_HOTSPOT_DIAGRAM.md

# L1D Cache Miss Hotspot Diagram

## Memory Access Pattern Comparison

### Current HAKMEM (1.88M L1D misses per 1M ops)

```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tiny_alloc_fast)                          │
└─────────────────────────────────────────────────────────────────┘
         │
         ├─► [1] TLS Cache Access (Cache Line 0)
         │   ┌──────────────────────────────────────┐
         │   │ g_tls_sll_head[cls]    ← Load (8B)  │  ✅ L1 HIT (likely)
         │   └──────────────────────────────────────┘
         │
         ├─► [2] TLS Count Access (Cache Line 1)
         │   ┌──────────────────────────────────────┐
         │   │ g_tls_sll_count[cls]   ← Load (4B)  │  ❌ L1 MISS (~10%)
         │   └──────────────────────────────────────┘
         │
         ├─► [3] Next Pointer Deref (Random Cache Line)
         │   ┌──────────────────────────────────────┐
         │   │ *(void**)ptr           ← Load (8B)  │  ❌ L1 MISS (~40%)
         │   │ (depends on freelist block location)│     (random access)
         │   └──────────────────────────────────────┘
         │
         └─► [4] TLS Count Update (Cache Line 1)
             ┌──────────────────────────────────────┐
             │ g_tls_sll_count[cls]-- ← Store (4B) │  ❌ L1 MISS (~5%)
             └──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ Refill Path (sll_refill_batch_from_ss)                          │
└─────────────────────────────────────────────────────────────────┘
         │
         ├─► [5] TinyTLSSlab Access
         │   ┌──────────────────────────────────────┐
         │   │ g_tls_slabs[cls]       ← Load (24B) │  ✅ L1 HIT (TLS)
         │   └──────────────────────────────────────┘
         │
         ├─► [6] SuperSlab Hot Fields (Cache Line 0)
         │   ┌──────────────────────────────────────┐
         │   │ ss->slab_bitmap        ← Load (4B)  │  ❌ L1 MISS (~30%)
         │   │ ss->nonempty_mask      ← Load (4B)  │     (same line, but
         │   │ ss->freelist_mask      ← Load (4B)  │      miss on first access)
         │   └──────────────────────────────────────┘
         │
         ├─► [7] SlabMeta Access (Cache Line 9+)
         │   ┌──────────────────────────────────────┐
         │   │ ss->slabs[idx].freelist ← Load (8B) │  ❌ L1 MISS (~50%)
         │   │ ss->slabs[idx].used     ← Load (2B) │     (600+ bytes offset
         │   │ ss->slabs[idx].capacity ← Load (2B) │      from ss base)
         │   └──────────────────────────────────────┘
         │
         └─► [8] SlabMeta Update (Cache Line 9+)
             ┌──────────────────────────────────────┐
             │ ss->slabs[idx].used++   ← Store (2B)│  ✅ HIT (same as [7])
             └──────────────────────────────────────┘

Total Cache Lines Touched: 4-5 per refill (Lines 0, 1, 9+, random freelist)
L1D Miss Rate: ~1.69% (1.88M misses / 111.5M loads)
```

---

### Optimized HAKMEM (Target: <0.5% miss rate)

```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tiny_alloc_fast) - OPTIMIZED              │
└─────────────────────────────────────────────────────────────────┘
         │
         ├─► [1] TLS Cache Entry (Cache Line 0) - MERGED
         │   ┌──────────────────────────────────────┐
         │   │ g_tls_cache[cls].head  ← Load (8B)  │  ✅ L1 HIT (~95%)
         │   │ g_tls_cache[cls].count ← Load (4B)  │  ✅ SAME CACHE LINE!
         │   │ (both in same 16B struct)           │
         │   └──────────────────────────────────────┘
         │
         ├─► [2] Next Pointer Deref (Prefetched)
         │   ┌──────────────────────────────────────┐
         │   │ *(void**)ptr           ← Load (8B)  │  ✅ L1 HIT (~70%)
         │   │ __builtin_prefetch()               │     (prefetch hint!)
         │   └──────────────────────────────────────┘
         │
         └─► [3] TLS Cache Update (Cache Line 0)
             ┌──────────────────────────────────────┐
             │ g_tls_cache[cls].head  ← Store (8B) │  ✅ L1 HIT (write-back)
             │ g_tls_cache[cls].count ← Store (4B) │  ✅ SAME CACHE LINE!
             └──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ Refill Path (sll_refill_batch_from_ss) - OPTIMIZED              │
└─────────────────────────────────────────────────────────────────┘
         │
         ├─► [4] TLS Cache Entry (Cache Line 0)
         │   ┌──────────────────────────────────────┐
         │   │ g_tls_cache[cls]       ← Load (16B) │  ✅ L1 HIT (same as [1])
         │   └──────────────────────────────────────┘
         │
         ├─► [5] SuperSlab Hot Fields (Cache Line 0) - PREFETCHED
         │   ┌──────────────────────────────────────┐
         │   │ ss->slab_bitmap        ← Load (4B)  │  ✅ L1 HIT (~85%)
         │   │ ss->nonempty_mask      ← Load (4B)  │     (prefetched +
         │   │ ss->freelist_mask      ← Load (4B)  │      cache line 0!)
         │   │ __builtin_prefetch(&ss->slab_bitmap)│
         │   └──────────────────────────────────────┘
         │
         ├─► [6] SlabMeta HOT Fields ONLY (Cache Line 2) - SPLIT
         │   ┌──────────────────────────────────────┐
         │   │ ss->slabs_hot[idx].freelist ← (8B)  │  ✅ L1 HIT (~75%)
         │   │ ss->slabs_hot[idx].used     ← (2B)  │     (hot/cold split +
         │   │ ss->slabs_hot[idx].capacity ← (2B)  │      prefetch!)
         │   │ (NO cold fields: class_idx, carved) │
         │   └──────────────────────────────────────┘
         │
         └─► [7] SlabMeta Update (Cache Line 2)
             ┌──────────────────────────────────────┐
             │ ss->slabs_hot[idx].used++ ← (2B)    │  ✅ HIT (same as [6])
             └──────────────────────────────────────┘

Total Cache Lines Touched: 2-3 per refill (Lines 0, 2, prefetched)
L1D Miss Rate: ~0.4-0.5% (target: <0.5M misses / 111.5M loads)
Improvement: 73-76% L1D miss reduction! ✅
```

---

## System malloc (glibc tcache) - Reference (0.46% miss rate)

```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tcache_get)                               │
└─────────────────────────────────────────────────────────────────┘
         │
         ├─► [1] TLS tcache Entry (Cache Line 2-9)
         │   ┌──────────────────────────────────────┐
         │   │ tcache->entries[bin] ← Load (8B)    │  ✅ L1 HIT (~98%)
         │   │ (direct pointer array, no counts)   │     (1 cache line only!)
         │   └──────────────────────────────────────┘
         │
         ├─► [2] Next Pointer Deref (Random)
         │   ┌──────────────────────────────────────┐
         │   │ *(tcache_entry**)ptr ← Load (8B)    │  ❌ L1 MISS (~20%)
         │   └──────────────────────────────────────┘
         │
         └─► [3] TLS Entry Update (Cache Line 2-9)
             ┌──────────────────────────────────────┐
             │ tcache->entries[bin] ← Store (8B)   │  ✅ L1 HIT (write-back)
             └──────────────────────────────────────┘

Total Cache Lines Touched: 1-2 per allocation
L1D Miss Rate: ~0.46% (0.19M misses / 40.8M loads)

Key Insight: tcache NEVER touches counts[] in fast path!
- counts[] only accessed on refill/free threshold (every 64 ops)
- This minimizes cache footprint to 1 cache line (entries[] only)
```

---

## Cache Line Access Heatmap

### Current HAKMEM (Hot = High Miss Rate)

```
SuperSlab Structure (1112 bytes, 18 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents                                            │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│  0  │ magic, lg_size, total_active, slab_bitmap, ...     │  🔥 30%
│  1  │ refcount, listed, next_chunk, ...                  │  🟢 <1%
│  2  │ last_used_ns, generation, lru_prev, lru_next       │  🟢 <1%
│  3-7│ remote_heads[0-31] (atomic pointers)               │  🟡 10%
│ 8-9 │ remote_counts[0-31], slab_listed[0-31]             │  🟢 <1%
│10-17│ slabs[0-31] (TinySlabMeta array, 512B)             │  🔥 50%
└─────┴─────────────────────────────────────────────────────┘

TLS Cache (96 bytes, 2 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents                                            │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│  0  │ g_tls_sll_head[0-7] (64 bytes)                     │  🟢 <5%
│  1  │ g_tls_sll_count[0-7] (32B) + padding (32B)         │  🟡 10%
└─────┴─────────────────────────────────────────────────────┘
```

### Optimized HAKMEM (After Proposals 1.1 + 2.1)

```
SuperSlab Structure (1112 bytes, 18 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents                                            │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│  0  │ slab_bitmap, nonempty_mask, freelist_mask, ...     │  🟢 5-10%
│     │ (HOT FIELDS ONLY, prefetched!)                     │  (prefetch!)
│  1  │ refcount, listed, next_chunk (COLD fields)         │  🟢 <1%
│  2-9│ slabs_hot[0-31] (HOT fields only, 512B)            │  🟡 15-20%
│     │ (freelist, used, capacity - prefetched!)           │  (prefetch!)
│10-11│ slabs_cold[0-31] (COLD: class_idx, carved, ...)    │  🟢 <1%
│12-17│ remote_heads, remote_counts, slab_listed           │  🟢 <1%
└─────┴─────────────────────────────────────────────────────┘

TLS Cache (128 bytes, 2 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents                                            │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│  0  │ g_tls_cache[0-3] (head+count+capacity, 64B)        │  🟢 <2%
│  1  │ g_tls_cache[4-7] (head+count+capacity, 64B)        │  🟢 <2%
│     │ (merged structure, same cache line access!)        │
└─────┴─────────────────────────────────────────────────────┘
```

---

## Performance Impact Summary

### Baseline (Current)

| Metric | Value |
|--------|-------|
| L1D loads | 111.5M per 1M ops |
| L1D misses | 1.88M per 1M ops |
| Miss rate | 1.69% |
| Cache lines touched (alloc) | 3-4 |
| Cache lines touched (refill) | 4-5 |
| Throughput | 24.88M ops/s |

### After Proposal 1.1 + 1.2 + 1.3 (P1 Quick Wins)

| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1-2** | -50-67% |
| Cache lines (refill) | 4-5 → **2-3** | -40-50% |
| L1D miss rate | 1.69% → **1.0-1.1%** | -35-40% |
| L1D misses | 1.88M → **1.1-1.2M** | -36-41% |
| Throughput | 24.9M → **34-37M ops/s** | **+36-49%** |

### After Proposal 2.1 + 2.2 (P1+P2 Combined)

| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1** | -67-75% |
| Cache lines (refill) | 4-5 → **2** | -50-60% |
| L1D miss rate | 1.69% → **0.6-0.7%** | -59-65% |
| L1D misses | 1.88M → **0.67-0.78M** | -59-64% |
| Throughput | 24.9M → **42-50M ops/s** | **+69-101%** |

### After Proposal 3.1 (P1+P2+P3 Full Stack)

| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1** | -67-75% |
| Cache lines (refill) | 4-5 → **1-2** | -60-75% |
| L1D miss rate | 1.69% → **0.4-0.5%** | -71-76% |
| L1D misses | 1.88M → **0.45-0.56M** | -70-76% |
| Throughput | 24.9M → **60-70M ops/s** | **+141-181%** |
| **vs System** | 26.9% → **65-76%** | **🎯 tcache parity!** |

---

## Key Takeaways

1. **Current bottleneck**: 3-4 cache lines touched per allocation (vs tcache's 1)
2. **Root cause**: Scattered hot fields across SuperSlab (18 cache lines)
3. **Quick win**: Merge TLS head/count → -35-40% miss rate in 1 day
4. **Medium win**: Hot/cold split + prefetch → -59-65% miss rate in 1 week
5. **Long-term**: TLS metadata cache → -71-76% miss rate in 2 weeks (tcache parity!)

**Next step**: Implement Proposal 1.2 (Prefetch) TODAY (2-3 hours, +8-12% gain) 🚀