Files
hakmem/docs/analysis/L1D_CACHE_MISS_HOTSPOT_DIAGRAM.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

272 lines
18 KiB
Markdown

# L1D Cache Miss Hotspot Diagram
## Memory Access Pattern Comparison
### Current HAKMEM (1.88M L1D misses per 1M ops)
```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tiny_alloc_fast) │
└─────────────────────────────────────────────────────────────────┘
├─► [1] TLS Cache Access (Cache Line 0)
│ ┌──────────────────────────────────────┐
│ │ g_tls_sll_head[cls] ← Load (8B) │ ✅ L1 HIT (likely)
│ └──────────────────────────────────────┘
├─► [2] TLS Count Access (Cache Line 1)
│ ┌──────────────────────────────────────┐
│ │ g_tls_sll_count[cls] ← Load (4B) │ ❌ L1 MISS (~10%)
│ └──────────────────────────────────────┘
├─► [3] Next Pointer Deref (Random Cache Line)
│ ┌──────────────────────────────────────┐
│ │ *(void**)ptr ← Load (8B) │ ❌ L1 MISS (~40%)
│ │ (depends on freelist block location)│ (random access)
│ └──────────────────────────────────────┘
└─► [4] TLS Count Update (Cache Line 1)
┌──────────────────────────────────────┐
│ g_tls_sll_count[cls]-- ← Store (4B) │ ❌ L1 MISS (~5%)
└──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Refill Path (sll_refill_batch_from_ss) │
└─────────────────────────────────────────────────────────────────┘
├─► [5] TinyTLSSlab Access
│ ┌──────────────────────────────────────┐
│ │ g_tls_slabs[cls] ← Load (24B) │ ✅ L1 HIT (TLS)
│ └──────────────────────────────────────┘
├─► [6] SuperSlab Hot Fields (Cache Line 0)
│ ┌──────────────────────────────────────┐
│ │ ss->slab_bitmap ← Load (4B) │ ❌ L1 MISS (~30%)
│ │ ss->nonempty_mask ← Load (4B) │ (same line, but
│ │ ss->freelist_mask ← Load (4B) │ miss on first access)
│ └──────────────────────────────────────┘
├─► [7] SlabMeta Access (Cache Line 9+)
│ ┌──────────────────────────────────────┐
│ │ ss->slabs[idx].freelist ← Load (8B) │ ❌ L1 MISS (~50%)
│ │ ss->slabs[idx].used ← Load (2B) │ (600+ bytes offset
│ │ ss->slabs[idx].capacity ← Load (2B) │ from ss base)
│ └──────────────────────────────────────┘
└─► [8] SlabMeta Update (Cache Line 9+)
┌──────────────────────────────────────┐
│ ss->slabs[idx].used++ ← Store (2B)│ ✅ HIT (same as [7])
└──────────────────────────────────────┘
Total Cache Lines Touched: 4-5 per refill (Lines 0, 1, 9+, random freelist)
L1D Miss Rate: ~1.69% (1.88M misses / 111.5M loads)
```
---
### Optimized HAKMEM (Target: <0.5% miss rate)
```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tiny_alloc_fast) - OPTIMIZED │
└─────────────────────────────────────────────────────────────────┘
├─► [1] TLS Cache Entry (Cache Line 0) - MERGED
│ ┌──────────────────────────────────────┐
│ │ g_tls_cache[cls].head ← Load (8B) │ ✅ L1 HIT (~95%)
│ │ g_tls_cache[cls].count ← Load (4B) │ ✅ SAME CACHE LINE!
│ │ (both in same 16B struct) │
│ └──────────────────────────────────────┘
├─► [2] Next Pointer Deref (Prefetched)
│ ┌──────────────────────────────────────┐
│ │ *(void**)ptr ← Load (8B) │ ✅ L1 HIT (~70%)
│ │ __builtin_prefetch() │ (prefetch hint!)
│ └──────────────────────────────────────┘
└─► [3] TLS Cache Update (Cache Line 0)
┌──────────────────────────────────────┐
│ g_tls_cache[cls].head ← Store (8B) │ ✅ L1 HIT (write-back)
│ g_tls_cache[cls].count ← Store (4B) │ ✅ SAME CACHE LINE!
└──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Refill Path (sll_refill_batch_from_ss) - OPTIMIZED │
└─────────────────────────────────────────────────────────────────┘
├─► [4] TLS Cache Entry (Cache Line 0)
│ ┌──────────────────────────────────────┐
│ │ g_tls_cache[cls] ← Load (16B) │ ✅ L1 HIT (same as [1])
│ └──────────────────────────────────────┘
├─► [5] SuperSlab Hot Fields (Cache Line 0) - PREFETCHED
│ ┌──────────────────────────────────────┐
│ │ ss->slab_bitmap ← Load (4B) │ ✅ L1 HIT (~85%)
│ │ ss->nonempty_mask ← Load (4B) │ (prefetched +
│ │ ss->freelist_mask ← Load (4B) │ cache line 0!)
│ │ __builtin_prefetch(&ss->slab_bitmap)│
│ └──────────────────────────────────────┘
├─► [6] SlabMeta HOT Fields ONLY (Cache Line 2) - SPLIT
│ ┌──────────────────────────────────────┐
│ │ ss->slabs_hot[idx].freelist ← (8B) │ ✅ L1 HIT (~75%)
│ │ ss->slabs_hot[idx].used ← (2B) │ (hot/cold split +
│ │ ss->slabs_hot[idx].capacity ← (2B) │ prefetch!)
│ │ (NO cold fields: class_idx, carved) │
│ └──────────────────────────────────────┘
└─► [7] SlabMeta Update (Cache Line 2)
┌──────────────────────────────────────┐
│ ss->slabs_hot[idx].used++ ← (2B) │ ✅ HIT (same as [6])
└──────────────────────────────────────┘
Total Cache Lines Touched: 2-3 per refill (Lines 0, 2, prefetched)
L1D Miss Rate: ~0.4-0.5% (target: <0.5M misses / 111.5M loads)
Improvement: 73-76% L1D miss reduction! ✅
```
---
## System malloc (glibc tcache) - Reference (0.46% miss rate)
```
┌─────────────────────────────────────────────────────────────────┐
│ Allocation Fast Path (tcache_get) │
└─────────────────────────────────────────────────────────────────┘
├─► [1] TLS tcache Entry (Cache Line 2-9)
│ ┌──────────────────────────────────────┐
│ │ tcache->entries[bin] ← Load (8B) │ ✅ L1 HIT (~98%)
│ │ (direct pointer array, no counts) │ (1 cache line only!)
│ └──────────────────────────────────────┘
├─► [2] Next Pointer Deref (Random)
│ ┌──────────────────────────────────────┐
│ │ *(tcache_entry**)ptr ← Load (8B) │ ❌ L1 MISS (~20%)
│ └──────────────────────────────────────┘
└─► [3] TLS Entry Update (Cache Line 2-9)
┌──────────────────────────────────────┐
│ tcache->entries[bin] ← Store (8B) │ ✅ L1 HIT (write-back)
└──────────────────────────────────────┘
Total Cache Lines Touched: 1-2 per allocation
L1D Miss Rate: ~0.46% (0.19M misses / 40.8M loads)
Key Insight: tcache NEVER touches counts[] in fast path!
- counts[] only accessed on refill/free threshold (every 64 ops)
- This minimizes cache footprint to 1 cache line (entries[] only)
```
---
## Cache Line Access Heatmap
### Current HAKMEM (Hot = High Miss Rate)
```
SuperSlab Structure (1112 bytes, 18 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│ 0 │ magic, lg_size, total_active, slab_bitmap, ... │ 🔥 30%
│ 1 │ refcount, listed, next_chunk, ... │ 🟢 <1%
│ 2 │ last_used_ns, generation, lru_prev, lru_next │ 🟢 <1%
│ 3-7│ remote_heads[0-31] (atomic pointers) │ 🟡 10%
│ 8-9 │ remote_counts[0-31], slab_listed[0-31] │ 🟢 <1%
│10-17│ slabs[0-31] (TinySlabMeta array, 512B) │ 🔥 50%
└─────┴─────────────────────────────────────────────────────┘
TLS Cache (96 bytes, 2 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│ 0 │ g_tls_sll_head[0-7] (64 bytes) │ 🟢 <5%
│ 1 │ g_tls_sll_count[0-7] (32B) + padding (32B) │ 🟡 10%
└─────┴─────────────────────────────────────────────────────┘
```
### Optimized HAKMEM (After Proposals 1.1 + 2.1)
```
SuperSlab Structure (1112 bytes, 18 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│ 0 │ slab_bitmap, nonempty_mask, freelist_mask, ... │ 🟢 5-10%
│ │ (HOT FIELDS ONLY, prefetched!) │ (prefetch!)
│ 1 │ refcount, listed, next_chunk (COLD fields) │ 🟢 <1%
│ 2-9│ slabs_hot[0-31] (HOT fields only, 512B) │ 🟡 15-20%
│ │ (freelist, used, capacity - prefetched!) │ (prefetch!)
│10-11│ slabs_cold[0-31] (COLD: class_idx, carved, ...) │ 🟢 <1%
│12-17│ remote_heads, remote_counts, slab_listed │ 🟢 <1%
└─────┴─────────────────────────────────────────────────────┘
TLS Cache (128 bytes, 2 cache lines):
┌─────┬─────────────────────────────────────────────────────┐
│ Line│ Contents │ Miss Rate
├─────┼─────────────────────────────────────────────────────┤
│ 0 │ g_tls_cache[0-3] (head+count+capacity, 64B) │ 🟢 <2%
│ 1 │ g_tls_cache[4-7] (head+count+capacity, 64B) │ 🟢 <2%
│ │ (merged structure, same cache line access!) │
└─────┴─────────────────────────────────────────────────────┘
```
---
## Performance Impact Summary
### Baseline (Current)
| Metric | Value |
|--------|-------|
| L1D loads | 111.5M per 1M ops |
| L1D misses | 1.88M per 1M ops |
| Miss rate | 1.69% |
| Cache lines touched (alloc) | 3-4 |
| Cache lines touched (refill) | 4-5 |
| Throughput | 24.88M ops/s |
### After Proposal 1.1 + 1.2 + 1.3 (P1 Quick Wins)
| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1-2** | -50-67% |
| Cache lines (refill) | 4-5 → **2-3** | -40-50% |
| L1D miss rate | 1.69% → **1.0-1.1%** | -35-40% |
| L1D misses | 1.88M → **1.1-1.2M** | -36-41% |
| Throughput | 24.9M → **34-37M ops/s** | **+36-49%** |
### After Proposal 2.1 + 2.2 (P1+P2 Combined)
| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1** | -67-75% |
| Cache lines (refill) | 4-5 → **2** | -50-60% |
| L1D miss rate | 1.69% → **0.6-0.7%** | -59-65% |
| L1D misses | 1.88M → **0.67-0.78M** | -59-64% |
| Throughput | 24.9M → **42-50M ops/s** | **+69-101%** |
### After Proposal 3.1 (P1+P2+P3 Full Stack)
| Metric | Current → Optimized | Improvement |
|--------|---------------------|-------------|
| Cache lines (alloc) | 3-4 → **1** | -67-75% |
| Cache lines (refill) | 4-5 → **1-2** | -60-75% |
| L1D miss rate | 1.69% → **0.4-0.5%** | -71-76% |
| L1D misses | 1.88M → **0.45-0.56M** | -70-76% |
| Throughput | 24.9M → **60-70M ops/s** | **+141-181%** |
| **vs System** | 26.9% → **65-76%** | **🎯 tcache parity!** |
---
## Key Takeaways
1. **Current bottleneck**: 3-4 cache lines touched per allocation (vs tcache's 1)
2. **Root cause**: Scattered hot fields across SuperSlab (18 cache lines)
3. **Quick win**: Merge TLS head/count → -35-40% miss rate in 1 day
4. **Medium win**: Hot/cold split + prefetch → -59-65% miss rate in 1 week
5. **Long-term**: TLS metadata cache → -71-76% miss rate in 2 weeks (tcache parity!)
**Next step**: Implement Proposal 1.2 (Prefetch) TODAY (2-3 hours, +8-12% gain) 🚀