# L1D Cache Miss Hotspot Diagram ## Memory Access Pattern Comparison ### Current HAKMEM (1.88M L1D misses per 1M ops) ``` ┌─────────────────────────────────────────────────────────────────┐ │ Allocation Fast Path (tiny_alloc_fast) │ └─────────────────────────────────────────────────────────────────┘ │ ├─► [1] TLS Cache Access (Cache Line 0) │ ┌──────────────────────────────────────┐ │ │ g_tls_sll_head[cls] ← Load (8B) │ ✅ L1 HIT (likely) │ └──────────────────────────────────────┘ │ ├─► [2] TLS Count Access (Cache Line 1) │ ┌──────────────────────────────────────┐ │ │ g_tls_sll_count[cls] ← Load (4B) │ ❌ L1 MISS (~10%) │ └──────────────────────────────────────┘ │ ├─► [3] Next Pointer Deref (Random Cache Line) │ ┌──────────────────────────────────────┐ │ │ *(void**)ptr ← Load (8B) │ ❌ L1 MISS (~40%) │ │ (depends on freelist block location)│ (random access) │ └──────────────────────────────────────┘ │ └─► [4] TLS Count Update (Cache Line 1) ┌──────────────────────────────────────┐ │ g_tls_sll_count[cls]-- ← Store (4B) │ ❌ L1 MISS (~5%) └──────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Refill Path (sll_refill_batch_from_ss) │ └─────────────────────────────────────────────────────────────────┘ │ ├─► [5] TinyTLSSlab Access │ ┌──────────────────────────────────────┐ │ │ g_tls_slabs[cls] ← Load (24B) │ ✅ L1 HIT (TLS) │ └──────────────────────────────────────┘ │ ├─► [6] SuperSlab Hot Fields (Cache Line 0) │ ┌──────────────────────────────────────┐ │ │ ss->slab_bitmap ← Load (4B) │ ❌ L1 MISS (~30%) │ │ ss->nonempty_mask ← Load (4B) │ (same line, but │ │ ss->freelist_mask ← Load (4B) │ miss on first access) │ └──────────────────────────────────────┘ │ ├─► [7] SlabMeta Access (Cache Line 9+) │ ┌──────────────────────────────────────┐ │ │ ss->slabs[idx].freelist ← Load (8B) │ ❌ L1 MISS (~50%) │ │ ss->slabs[idx].used ← Load (2B) │ (600+ bytes offset │ │ ss->slabs[idx].capacity ← Load (2B) │ from ss base) │ └──────────────────────────────────────┘ │ └─► [8] SlabMeta Update (Cache Line 9+) ┌──────────────────────────────────────┐ │ ss->slabs[idx].used++ ← Store (2B)│ ✅ HIT (same as [7]) └──────────────────────────────────────┘ Total Cache Lines Touched: 4-5 per refill (Lines 0, 1, 9+, random freelist) L1D Miss Rate: ~1.69% (1.88M misses / 111.5M loads) ``` --- ### Optimized HAKMEM (Target: <0.5% miss rate) ``` ┌─────────────────────────────────────────────────────────────────┐ │ Allocation Fast Path (tiny_alloc_fast) - OPTIMIZED │ └─────────────────────────────────────────────────────────────────┘ │ ├─► [1] TLS Cache Entry (Cache Line 0) - MERGED │ ┌──────────────────────────────────────┐ │ │ g_tls_cache[cls].head ← Load (8B) │ ✅ L1 HIT (~95%) │ │ g_tls_cache[cls].count ← Load (4B) │ ✅ SAME CACHE LINE! │ │ (both in same 16B struct) │ │ └──────────────────────────────────────┘ │ ├─► [2] Next Pointer Deref (Prefetched) │ ┌──────────────────────────────────────┐ │ │ *(void**)ptr ← Load (8B) │ ✅ L1 HIT (~70%) │ │ __builtin_prefetch() │ (prefetch hint!) │ └──────────────────────────────────────┘ │ └─► [3] TLS Cache Update (Cache Line 0) ┌──────────────────────────────────────┐ │ g_tls_cache[cls].head ← Store (8B) │ ✅ L1 HIT (write-back) │ g_tls_cache[cls].count ← Store (4B) │ ✅ SAME CACHE LINE! └──────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Refill Path (sll_refill_batch_from_ss) - OPTIMIZED │ └─────────────────────────────────────────────────────────────────┘ │ ├─► [4] TLS Cache Entry (Cache Line 0) │ ┌──────────────────────────────────────┐ │ │ g_tls_cache[cls] ← Load (16B) │ ✅ L1 HIT (same as [1]) │ └──────────────────────────────────────┘ │ ├─► [5] SuperSlab Hot Fields (Cache Line 0) - PREFETCHED │ ┌──────────────────────────────────────┐ │ │ ss->slab_bitmap ← Load (4B) │ ✅ L1 HIT (~85%) │ │ ss->nonempty_mask ← Load (4B) │ (prefetched + │ │ ss->freelist_mask ← Load (4B) │ cache line 0!) │ │ __builtin_prefetch(&ss->slab_bitmap)│ │ └──────────────────────────────────────┘ │ ├─► [6] SlabMeta HOT Fields ONLY (Cache Line 2) - SPLIT │ ┌──────────────────────────────────────┐ │ │ ss->slabs_hot[idx].freelist ← (8B) │ ✅ L1 HIT (~75%) │ │ ss->slabs_hot[idx].used ← (2B) │ (hot/cold split + │ │ ss->slabs_hot[idx].capacity ← (2B) │ prefetch!) │ │ (NO cold fields: class_idx, carved) │ │ └──────────────────────────────────────┘ │ └─► [7] SlabMeta Update (Cache Line 2) ┌──────────────────────────────────────┐ │ ss->slabs_hot[idx].used++ ← (2B) │ ✅ HIT (same as [6]) └──────────────────────────────────────┘ Total Cache Lines Touched: 2-3 per refill (Lines 0, 2, prefetched) L1D Miss Rate: ~0.4-0.5% (target: <0.5M misses / 111.5M loads) Improvement: 73-76% L1D miss reduction! ✅ ``` --- ## System malloc (glibc tcache) - Reference (0.46% miss rate) ``` ┌─────────────────────────────────────────────────────────────────┐ │ Allocation Fast Path (tcache_get) │ └─────────────────────────────────────────────────────────────────┘ │ ├─► [1] TLS tcache Entry (Cache Line 2-9) │ ┌──────────────────────────────────────┐ │ │ tcache->entries[bin] ← Load (8B) │ ✅ L1 HIT (~98%) │ │ (direct pointer array, no counts) │ (1 cache line only!) │ └──────────────────────────────────────┘ │ ├─► [2] Next Pointer Deref (Random) │ ┌──────────────────────────────────────┐ │ │ *(tcache_entry**)ptr ← Load (8B) │ ❌ L1 MISS (~20%) │ └──────────────────────────────────────┘ │ └─► [3] TLS Entry Update (Cache Line 2-9) ┌──────────────────────────────────────┐ │ tcache->entries[bin] ← Store (8B) │ ✅ L1 HIT (write-back) └──────────────────────────────────────┘ Total Cache Lines Touched: 1-2 per allocation L1D Miss Rate: ~0.46% (0.19M misses / 40.8M loads) Key Insight: tcache NEVER touches counts[] in fast path! - counts[] only accessed on refill/free threshold (every 64 ops) - This minimizes cache footprint to 1 cache line (entries[] only) ``` --- ## Cache Line Access Heatmap ### Current HAKMEM (Hot = High Miss Rate) ``` SuperSlab Structure (1112 bytes, 18 cache lines): ┌─────┬─────────────────────────────────────────────────────┐ │ Line│ Contents │ Miss Rate ├─────┼─────────────────────────────────────────────────────┤ │ 0 │ magic, lg_size, total_active, slab_bitmap, ... │ 🔥 30% │ 1 │ refcount, listed, next_chunk, ... │ 🟢 <1% │ 2 │ last_used_ns, generation, lru_prev, lru_next │ 🟢 <1% │ 3-7│ remote_heads[0-31] (atomic pointers) │ 🟡 10% │ 8-9 │ remote_counts[0-31], slab_listed[0-31] │ 🟢 <1% │10-17│ slabs[0-31] (TinySlabMeta array, 512B) │ 🔥 50% └─────┴─────────────────────────────────────────────────────┘ TLS Cache (96 bytes, 2 cache lines): ┌─────┬─────────────────────────────────────────────────────┐ │ Line│ Contents │ Miss Rate ├─────┼─────────────────────────────────────────────────────┤ │ 0 │ g_tls_sll_head[0-7] (64 bytes) │ 🟢 <5% │ 1 │ g_tls_sll_count[0-7] (32B) + padding (32B) │ 🟡 10% └─────┴─────────────────────────────────────────────────────┘ ``` ### Optimized HAKMEM (After Proposals 1.1 + 2.1) ``` SuperSlab Structure (1112 bytes, 18 cache lines): ┌─────┬─────────────────────────────────────────────────────┐ │ Line│ Contents │ Miss Rate ├─────┼─────────────────────────────────────────────────────┤ │ 0 │ slab_bitmap, nonempty_mask, freelist_mask, ... │ 🟢 5-10% │ │ (HOT FIELDS ONLY, prefetched!) │ (prefetch!) │ 1 │ refcount, listed, next_chunk (COLD fields) │ 🟢 <1% │ 2-9│ slabs_hot[0-31] (HOT fields only, 512B) │ 🟡 15-20% │ │ (freelist, used, capacity - prefetched!) │ (prefetch!) │10-11│ slabs_cold[0-31] (COLD: class_idx, carved, ...) │ 🟢 <1% │12-17│ remote_heads, remote_counts, slab_listed │ 🟢 <1% └─────┴─────────────────────────────────────────────────────┘ TLS Cache (128 bytes, 2 cache lines): ┌─────┬─────────────────────────────────────────────────────┐ │ Line│ Contents │ Miss Rate ├─────┼─────────────────────────────────────────────────────┤ │ 0 │ g_tls_cache[0-3] (head+count+capacity, 64B) │ 🟢 <2% │ 1 │ g_tls_cache[4-7] (head+count+capacity, 64B) │ 🟢 <2% │ │ (merged structure, same cache line access!) │ └─────┴─────────────────────────────────────────────────────┘ ``` --- ## Performance Impact Summary ### Baseline (Current) | Metric | Value | |--------|-------| | L1D loads | 111.5M per 1M ops | | L1D misses | 1.88M per 1M ops | | Miss rate | 1.69% | | Cache lines touched (alloc) | 3-4 | | Cache lines touched (refill) | 4-5 | | Throughput | 24.88M ops/s | ### After Proposal 1.1 + 1.2 + 1.3 (P1 Quick Wins) | Metric | Current → Optimized | Improvement | |--------|---------------------|-------------| | Cache lines (alloc) | 3-4 → **1-2** | -50-67% | | Cache lines (refill) | 4-5 → **2-3** | -40-50% | | L1D miss rate | 1.69% → **1.0-1.1%** | -35-40% | | L1D misses | 1.88M → **1.1-1.2M** | -36-41% | | Throughput | 24.9M → **34-37M ops/s** | **+36-49%** | ### After Proposal 2.1 + 2.2 (P1+P2 Combined) | Metric | Current → Optimized | Improvement | |--------|---------------------|-------------| | Cache lines (alloc) | 3-4 → **1** | -67-75% | | Cache lines (refill) | 4-5 → **2** | -50-60% | | L1D miss rate | 1.69% → **0.6-0.7%** | -59-65% | | L1D misses | 1.88M → **0.67-0.78M** | -59-64% | | Throughput | 24.9M → **42-50M ops/s** | **+69-101%** | ### After Proposal 3.1 (P1+P2+P3 Full Stack) | Metric | Current → Optimized | Improvement | |--------|---------------------|-------------| | Cache lines (alloc) | 3-4 → **1** | -67-75% | | Cache lines (refill) | 4-5 → **1-2** | -60-75% | | L1D miss rate | 1.69% → **0.4-0.5%** | -71-76% | | L1D misses | 1.88M → **0.45-0.56M** | -70-76% | | Throughput | 24.9M → **60-70M ops/s** | **+141-181%** | | **vs System** | 26.9% → **65-76%** | **🎯 tcache parity!** | --- ## Key Takeaways 1. **Current bottleneck**: 3-4 cache lines touched per allocation (vs tcache's 1) 2. **Root cause**: Scattered hot fields across SuperSlab (18 cache lines) 3. **Quick win**: Merge TLS head/count → -35-40% miss rate in 1 day 4. **Medium win**: Hot/cold split + prefetch → -59-65% miss rate in 1 week 5. **Long-term**: TLS metadata cache → -71-76% miss rate in 2 weeks (tcache parity!) **Next step**: Implement Proposal 1.2 (Prefetch) TODAY (2-3 hours, +8-12% gain) 🚀