hakmem/docs/analysis/L1D_CACHE_MISS_EXECUTIVE_SUMMARY.md

# L1D Cache Miss Analysis - Executive Summary

**Date**: 2025-11-19
**Analyst**: Claude (Sonnet 4.5)
**Status**: ✅ ROOT CAUSE IDENTIFIED - ACTIONABLE PLAN READY

---

## TL;DR

**Problem**: HAKMEM is **3.8x slower** than System malloc (24.9M vs 92.3M ops/s)
**Root Cause**: **L1D cache misses** (9.9x more than System: 1.88M vs 0.19M per 1M ops)
**Impact**: 75% of performance gap caused by poor cache locality
**Solution**: 3-phase optimization plan (prefetch + hot/cold split + TLS merge)
**Expected Gain**: **+36-49% in 1-2 days**, **+150-200% in 2 weeks** (System parity!)

---

## Key Findings

### Performance Gap Analysis

| Metric | HAKMEM | System malloc | Ratio | Status |
|--------|---------|---------------|-------|---------|
| Throughput | 24.88M ops/s | 92.31M ops/s | **3.71x slower** | 🔴 CRITICAL |
| L1D loads | 111.5M | 40.8M | 2.73x more | 🟡 High |
| **L1D misses** | **1.88M** | **0.19M** | **🔥 9.9x worse** | 🔴 **BOTTLENECK** |
| L1D miss rate | 1.69% | 0.46% | 3.67x worse | 🔴 Critical |
| Instructions | 275.2M | 92.3M | 2.98x more | 🟡 High |
| IPC | 1.52 | 2.06 | 0.74x worse | 🟡 Memory-bound |

**Conclusion**: L1D cache misses are the **PRIMARY bottleneck**, accounting for ~75% of the performance gap (338M cycles penalty out of 450M total gap).

---

### Root Cause: Metadata-Heavy Access Pattern

#### Problem 1: SuperSlab Structure (1112 bytes, 18 cache lines)

**Current layout** - Hot fields scattered:
```
Cache Line 0:  magic, lg_size, total_active, slab_bitmap ⭐, nonempty_mask ⭐, freelist_mask ⭐
Cache Line 1:  refcount, listed, next_chunk (COLD fields)
Cache Line 9+: slabs[0-31] ⭐ (512 bytes, HOT metadata)
               ↑ 600 bytes offset from SuperSlab base!
```

**Issue**: Hot path touches **2+ cache lines** (bitmasks on line 0, SlabMeta on line 9+)
**Expected fix**: Cluster hot fields in cache line 0 → **-25% L1D misses**

---

#### Problem 2: TinySlabMeta (16 bytes, but wastes space)

**Current layout**:
```c
struct TinySlabMeta {
    void*    freelist;   // 8B ⭐ HOT
    uint16_t used;       // 2B ⭐ HOT
    uint16_t capacity;   // 2B ⭐ HOT
    uint8_t  class_idx;  // 1B 🔥 COLD (set once)
    uint8_t  carved;     // 1B 🔥 COLD (rarely changed)
    uint8_t  owner_tid;  // 1B 🔥 COLD (debug only)
    // 1B padding
};  // Total: 16B (fits in 1 cache line, but 6 bytes wasted on cold fields!)
```

**Issue**: 6 cold bytes occupy precious L1D cache, wasting **37.5% of cache line**
**Expected fix**: Split hot/cold → **-20% L1D misses**

---

#### Problem 3: TLS Cache Split (2 cache lines)

**Current layout**:
```c
__thread void* g_tls_sll_head[8];      // 64B (cache line 0)
__thread uint32_t g_tls_sll_count[8];  // 32B (cache line 1)
```

**Access pattern on alloc**:
1. Load `g_tls_sll_head[cls]` → Cache line 0 ✅
2. Load next pointer → Random cache line ❌
3. Write `g_tls_sll_head[cls]` → Cache line 0 ✅
4. Decrement `g_tls_sll_count[cls]` → Cache line 1 ❌

**Issue**: **2 cache lines** accessed per alloc (head + count separate)
**Expected fix**: Merge into `TLSCacheEntry` struct → **-15% L1D misses**

---

### Comparison: HAKMEM vs glibc tcache

| Aspect | HAKMEM | glibc tcache | Impact |
|--------|---------|--------------|---------|
| Cache lines (alloc) | **3-4** | **1** | 3-4x more misses |
| Metadata indirections | TLS → SS → SlabMeta → freelist (**3 loads**) | TLS → freelist (**1 load**) | 3x more loads |
| Count checks | Every alloc/free | Threshold-based (every 64 ops) | Frequent updates |
| Hot path cache footprint | **4-5 cache lines** | **1 cache line** | 4-5x larger |

**Insight**: tcache's design minimizes cache footprint by:
1. Direct TLS freelist access (no SuperSlab indirection)
2. Counts[] rarely accessed in hot path
3. All hot fields in 1 cache line (entries[] array)

HAKMEM can achieve similar locality with proposed optimizations.

---

## Optimization Plan

### Phase 1: Quick Wins (1-2 days, +36-49% gain) 🚀

**Priority**: P0 (Critical Path)
**Effort**: 6-8 hours implementation, 2-3 hours testing
**Risk**: Low (incremental changes, easy rollback)

#### Optimizations:

1. **Prefetch (2-3 hours)**
   - Add `__builtin_prefetch()` to refill + alloc paths
   - Prefetch SuperSlab hot fields, SlabMeta, next pointers
   - **Impact**: -10-15% L1D miss rate, +8-12% throughput

2. **Hot/Cold SlabMeta Split (4-6 hours)**
   - Separate `TinySlabMeta` into `TinySlabMetaHot` (freelist, used, capacity) and `TinySlabMetaCold` (class_idx, carved, owner_tid)
   - Keep hot fields contiguous (512B), move cold to separate array (128B)
   - **Impact**: -20% L1D miss rate, +15-20% throughput

3. **TLS Cache Merge (6-8 hours)**
   - Replace `g_tls_sll_head[]` + `g_tls_sll_count[]` with unified `TLSCacheEntry` struct
   - Merge head + count into same cache line (16B per class)
   - **Impact**: -15% L1D miss rate, +12-18% throughput

**Cumulative Impact**:
- L1D miss rate: 1.69% → **1.0-1.1%** (-35-41%)
- Throughput: 24.9M → **34-37M ops/s** (+36-49%)
- **Target**: Achieve **40% of System malloc** performance (from 27%)

---

### Phase 2: Medium Effort (1 week, +70-100% cumulative gain)

**Priority**: P1 (High Impact)
**Effort**: 3-5 days implementation
**Risk**: Medium (requires architectural changes)

#### Optimizations:

1. **SuperSlab Hot Field Clustering (3-4 days)**
   - Move all hot fields (slab_bitmap, nonempty_mask, freelist_mask, active_slabs) to cache line 0
   - Separate cold fields (refcount, listed, lru_prev) to cache line 1+
   - **Impact**: -25% L1D miss rate (additional), +18-25% throughput

2. **Dynamic SlabMeta Allocation (1-2 days)**
   - Allocate `TinySlabMetaHot` on demand (only for active slabs)
   - Replace 32-slot `slabs_hot[]` array with pointer array (256B → 32 pointers)
   - **Impact**: -30% L1D miss rate (additional), +20-28% throughput

**Cumulative Impact**:
- L1D miss rate: 1.69% → **0.6-0.7%** (-59-65%)
- Throughput: 24.9M → **42-50M ops/s** (+69-101%)
- **Target**: Achieve **50-54% of System malloc** performance

---

### Phase 3: High Impact (2 weeks, +150-200% cumulative gain)

**Priority**: P2 (Long-term, tcache parity)
**Effort**: 1-2 weeks implementation
**Risk**: High (major architectural change)

#### Optimizations:

1. **TLS-Local Metadata Cache (1 week)**
   - Cache `TinySlabMeta` fields (used, capacity, freelist) in TLS
   - Eliminate SuperSlab indirection on hot path (3 loads → 1 load)
   - Periodically sync TLS cache → SuperSlab (threshold-based)
   - **Impact**: -60% L1D miss rate (additional), +80-120% throughput

2. **Per-Class SuperSlab Affinity (1 week)**
   - Pin 1 "hot" SuperSlab per class in TLS pointer
   - LRU eviction for cold SuperSlabs
   - Prefetch hot SuperSlab on class switch
   - **Impact**: -25% L1D miss rate (additional), +18-25% throughput

**Cumulative Impact**:
- L1D miss rate: 1.69% → **0.4-0.5%** (-71-76%)
- Throughput: 24.9M → **60-70M ops/s** (+141-181%)
- **Target**: **tcache parity** (65-76% of System malloc)

---

## Recommended Immediate Action

### Today (2-3 hours):

**Implement Proposal 1.2: Prefetch Optimization**

1. Add prefetch to refill path (`core/hakmem_tiny_refill_p0.inc.h`):
   ```c
   if (tls->ss) {
       __builtin_prefetch(&tls->ss->slab_bitmap, 0, 3);
   }
   __builtin_prefetch(&meta->freelist, 0, 3);
   ```

2. Add prefetch to alloc path (`core/tiny_alloc_fast.inc.h`):
   ```c
   __builtin_prefetch(&g_tls_sll_head[class_idx], 0, 3);
   if (ptr) __builtin_prefetch(ptr, 0, 3);  // Next freelist entry
   ```

3. Build & benchmark:
   ```bash
   ./build.sh bench_random_mixed_hakmem
   perf stat -e L1-dcache-load-misses -r 10 \
     ./out/release/bench_random_mixed_hakmem 1000000 256 42
   ```

**Expected Result**: +8-12% throughput (24.9M → 27-28M ops/s) in **2-3 hours**! 🚀

---

### Tomorrow (4-6 hours):

**Implement Proposal 1.1: Hot/Cold SlabMeta Split**

1. Define `TinySlabMetaHot` and `TinySlabMetaCold` structs
2. Update `SuperSlab` to use separate arrays (`slabs_hot[]`, `slabs_cold[]`)
3. Add accessor functions for gradual migration
4. Migrate critical hot paths (refill, alloc, free)

**Expected Result**: +15-20% additional throughput (cumulative: +25-35%)

---

### Week 1 Target:

Complete **Phase 1 (Quick Wins)** by end of week:
- All 3 optimizations implemented and validated
- L1D miss rate reduced to **1.0-1.1%** (from 1.69%)
- Throughput improved to **34-37M ops/s** (from 24.9M)
- **+36-49% performance gain** 🎯

---

## Risk Mitigation

### Technical Risks:

1. **Correctness (Hot/Cold Split)**: Medium risk
   - **Mitigation**: Extensive testing (AddressSanitizer, regression tests, fuzzing)
   - Gradual migration using accessor functions (not big-bang refactor)

2. **Performance Regression (Prefetch)**: Low risk
   - **Mitigation**: A/B test with `HAKMEM_PREFETCH=0/1` env flag
   - Easy rollback (single commit)

3. **Complexity (TLS Merge)**: Medium risk
   - **Mitigation**: Update all access sites systematically (use grep to find all references)
   - Compile-time checks to catch missed migrations

4. **Memory Overhead (Dynamic Alloc)**: Low risk
   - **Mitigation**: Use slab allocator for `TinySlabMetaHot` (fixed-size, no fragmentation)

---

## Success Criteria

### Phase 1 Completion (Week 1):

- ✅ L1D miss rate < 1.1% (from 1.69%)
- ✅ Throughput > 34M ops/s (+36% minimum)
- ✅ All regression tests pass
- ✅ AddressSanitizer clean (no leaks, no buffer overflows)
- ✅ 1-hour stress test stable (100M ops, no crashes)

### Phase 2 Completion (Week 2):

- ✅ L1D miss rate < 0.7% (from 1.69%)
- ✅ Throughput > 42M ops/s (+69% minimum)
- ✅ Multi-threaded workload stable (Larson 4T)

### Phase 3 Completion (Week 3-4):

- ✅ L1D miss rate < 0.5% (from 1.69%, **tcache parity!**)
- ✅ Throughput > 60M ops/s (+141% minimum, **65% of System malloc**)
- ✅ Memory efficiency maintained (no significant RSS increase)

---

## Documentation

### Detailed Reports:

1. **`L1D_CACHE_MISS_ANALYSIS_REPORT.md`** - Full technical analysis
   - Perf profiling results
   - Data structure analysis
   - Comparison with glibc tcache
   - Detailed optimization proposals (P1-P3)

2. **`L1D_CACHE_MISS_HOTSPOT_DIAGRAM.md`** - Visual diagrams
   - Memory access pattern comparison
   - Cache line heatmaps
   - Before/after optimization flowcharts

3. **`L1D_OPTIMIZATION_QUICK_START_GUIDE.md`** - Implementation guide
   - Step-by-step code changes
   - Build & test instructions
   - Rollback procedures
   - Troubleshooting tips

---

## Next Steps

### Immediate (Today):

1. ✅ **Review this summary** with team (15 minutes)
2. 🚀 **Start Proposal 1.2 (Prefetch)** implementation (2-3 hours)
3. 📊 **Baseline benchmark** (save current L1D miss rate for comparison)

### This Week:

1. Complete **Phase 1 Quick Wins** (Prefetch + Hot/Cold Split + TLS Merge)
2. Validate **+36-49% gain** with comprehensive testing
3. Document results and plan Phase 2 rollout

### Next 2-4 Weeks:

1. **Phase 2**: SuperSlab optimization (+70-100% cumulative)
2. **Phase 3**: TLS metadata cache (+150-200% cumulative, **tcache parity!**)

---

## Conclusion

**L1D cache misses are the root cause of HAKMEM's 3.8x performance gap** vs System malloc. The proposed 3-phase optimization plan systematically addresses metadata access patterns to achieve:

- **Short-term** (1-2 days): +36-49% gain with prefetch + hot/cold split + TLS merge
- **Medium-term** (1 week): +70-100% cumulative gain with SuperSlab optimization
- **Long-term** (2 weeks): +150-200% cumulative gain, **achieving tcache parity** (60-70M ops/s)

**Recommendation**: Start with **Proposal 1.2 (Prefetch)** TODAY to get quick wins (+8-12%) and build momentum. 🚀

**Contact**: See detailed guides for step-by-step implementation instructions and troubleshooting support.

---

**Status**: ✅ READY FOR IMPLEMENTATION
**Next Action**: Begin Proposal 1.2 (Prefetch) - see `L1D_OPTIMIZATION_QUICK_START_GUIDE.md`