Files
hakmem/docs/analysis/L1D_CACHE_MISS_EXECUTIVE_SUMMARY.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

12 KiB

L1D Cache Miss Analysis - Executive Summary

Date: 2025-11-19 Analyst: Claude (Sonnet 4.5) Status: ROOT CAUSE IDENTIFIED - ACTIONABLE PLAN READY


TL;DR

Problem: HAKMEM is 3.8x slower than System malloc (24.9M vs 92.3M ops/s) Root Cause: L1D cache misses (9.9x more than System: 1.88M vs 0.19M per 1M ops) Impact: 75% of performance gap caused by poor cache locality Solution: 3-phase optimization plan (prefetch + hot/cold split + TLS merge) Expected Gain: +36-49% in 1-2 days, +150-200% in 2 weeks (System parity!)


Key Findings

Performance Gap Analysis

Metric HAKMEM System malloc Ratio Status
Throughput 24.88M ops/s 92.31M ops/s 3.71x slower 🔴 CRITICAL
L1D loads 111.5M 40.8M 2.73x more 🟡 High
L1D misses 1.88M 0.19M 🔥 9.9x worse 🔴 BOTTLENECK
L1D miss rate 1.69% 0.46% 3.67x worse 🔴 Critical
Instructions 275.2M 92.3M 2.98x more 🟡 High
IPC 1.52 2.06 0.74x worse 🟡 Memory-bound

Conclusion: L1D cache misses are the PRIMARY bottleneck, accounting for ~75% of the performance gap (338M cycles penalty out of 450M total gap).


Root Cause: Metadata-Heavy Access Pattern

Problem 1: SuperSlab Structure (1112 bytes, 18 cache lines)

Current layout - Hot fields scattered:

Cache Line 0:  magic, lg_size, total_active, slab_bitmap ⭐, nonempty_mask ⭐, freelist_mask ⭐
Cache Line 1:  refcount, listed, next_chunk (COLD fields)
Cache Line 9+: slabs[0-31] ⭐ (512 bytes, HOT metadata)
               ↑ 600 bytes offset from SuperSlab base!

Issue: Hot path touches 2+ cache lines (bitmasks on line 0, SlabMeta on line 9+) Expected fix: Cluster hot fields in cache line 0 → -25% L1D misses


Problem 2: TinySlabMeta (16 bytes, but wastes space)

Current layout:

struct TinySlabMeta {
    void*    freelist;   // 8B ⭐ HOT
    uint16_t used;       // 2B ⭐ HOT
    uint16_t capacity;   // 2B ⭐ HOT
    uint8_t  class_idx;  // 1B 🔥 COLD (set once)
    uint8_t  carved;     // 1B 🔥 COLD (rarely changed)
    uint8_t  owner_tid;  // 1B 🔥 COLD (debug only)
    // 1B padding
};  // Total: 16B (fits in 1 cache line, but 6 bytes wasted on cold fields!)

Issue: 6 cold bytes occupy precious L1D cache, wasting 37.5% of cache line Expected fix: Split hot/cold → -20% L1D misses


Problem 3: TLS Cache Split (2 cache lines)

Current layout:

__thread void* g_tls_sll_head[8];      // 64B (cache line 0)
__thread uint32_t g_tls_sll_count[8];  // 32B (cache line 1)

Access pattern on alloc:

  1. Load g_tls_sll_head[cls] → Cache line 0
  2. Load next pointer → Random cache line
  3. Write g_tls_sll_head[cls] → Cache line 0
  4. Decrement g_tls_sll_count[cls] → Cache line 1

Issue: 2 cache lines accessed per alloc (head + count separate) Expected fix: Merge into TLSCacheEntry struct → -15% L1D misses


Comparison: HAKMEM vs glibc tcache

Aspect HAKMEM glibc tcache Impact
Cache lines (alloc) 3-4 1 3-4x more misses
Metadata indirections TLS → SS → SlabMeta → freelist (3 loads) TLS → freelist (1 load) 3x more loads
Count checks Every alloc/free Threshold-based (every 64 ops) Frequent updates
Hot path cache footprint 4-5 cache lines 1 cache line 4-5x larger

Insight: tcache's design minimizes cache footprint by:

  1. Direct TLS freelist access (no SuperSlab indirection)
  2. Counts[] rarely accessed in hot path
  3. All hot fields in 1 cache line (entries[] array)

HAKMEM can achieve similar locality with proposed optimizations.


Optimization Plan

Phase 1: Quick Wins (1-2 days, +36-49% gain) 🚀

Priority: P0 (Critical Path) Effort: 6-8 hours implementation, 2-3 hours testing Risk: Low (incremental changes, easy rollback)

Optimizations:

  1. Prefetch (2-3 hours)

    • Add __builtin_prefetch() to refill + alloc paths
    • Prefetch SuperSlab hot fields, SlabMeta, next pointers
    • Impact: -10-15% L1D miss rate, +8-12% throughput
  2. Hot/Cold SlabMeta Split (4-6 hours)

    • Separate TinySlabMeta into TinySlabMetaHot (freelist, used, capacity) and TinySlabMetaCold (class_idx, carved, owner_tid)
    • Keep hot fields contiguous (512B), move cold to separate array (128B)
    • Impact: -20% L1D miss rate, +15-20% throughput
  3. TLS Cache Merge (6-8 hours)

    • Replace g_tls_sll_head[] + g_tls_sll_count[] with unified TLSCacheEntry struct
    • Merge head + count into same cache line (16B per class)
    • Impact: -15% L1D miss rate, +12-18% throughput

Cumulative Impact:

  • L1D miss rate: 1.69% → 1.0-1.1% (-35-41%)
  • Throughput: 24.9M → 34-37M ops/s (+36-49%)
  • Target: Achieve 40% of System malloc performance (from 27%)

Phase 2: Medium Effort (1 week, +70-100% cumulative gain)

Priority: P1 (High Impact) Effort: 3-5 days implementation Risk: Medium (requires architectural changes)

Optimizations:

  1. SuperSlab Hot Field Clustering (3-4 days)

    • Move all hot fields (slab_bitmap, nonempty_mask, freelist_mask, active_slabs) to cache line 0
    • Separate cold fields (refcount, listed, lru_prev) to cache line 1+
    • Impact: -25% L1D miss rate (additional), +18-25% throughput
  2. Dynamic SlabMeta Allocation (1-2 days)

    • Allocate TinySlabMetaHot on demand (only for active slabs)
    • Replace 32-slot slabs_hot[] array with pointer array (256B → 32 pointers)
    • Impact: -30% L1D miss rate (additional), +20-28% throughput

Cumulative Impact:

  • L1D miss rate: 1.69% → 0.6-0.7% (-59-65%)
  • Throughput: 24.9M → 42-50M ops/s (+69-101%)
  • Target: Achieve 50-54% of System malloc performance

Phase 3: High Impact (2 weeks, +150-200% cumulative gain)

Priority: P2 (Long-term, tcache parity) Effort: 1-2 weeks implementation Risk: High (major architectural change)

Optimizations:

  1. TLS-Local Metadata Cache (1 week)

    • Cache TinySlabMeta fields (used, capacity, freelist) in TLS
    • Eliminate SuperSlab indirection on hot path (3 loads → 1 load)
    • Periodically sync TLS cache → SuperSlab (threshold-based)
    • Impact: -60% L1D miss rate (additional), +80-120% throughput
  2. Per-Class SuperSlab Affinity (1 week)

    • Pin 1 "hot" SuperSlab per class in TLS pointer
    • LRU eviction for cold SuperSlabs
    • Prefetch hot SuperSlab on class switch
    • Impact: -25% L1D miss rate (additional), +18-25% throughput

Cumulative Impact:

  • L1D miss rate: 1.69% → 0.4-0.5% (-71-76%)
  • Throughput: 24.9M → 60-70M ops/s (+141-181%)
  • Target: tcache parity (65-76% of System malloc)

Today (2-3 hours):

Implement Proposal 1.2: Prefetch Optimization

  1. Add prefetch to refill path (core/hakmem_tiny_refill_p0.inc.h):

    if (tls->ss) {
        __builtin_prefetch(&tls->ss->slab_bitmap, 0, 3);
    }
    __builtin_prefetch(&meta->freelist, 0, 3);
    
  2. Add prefetch to alloc path (core/tiny_alloc_fast.inc.h):

    __builtin_prefetch(&g_tls_sll_head[class_idx], 0, 3);
    if (ptr) __builtin_prefetch(ptr, 0, 3);  // Next freelist entry
    
  3. Build & benchmark:

    ./build.sh bench_random_mixed_hakmem
    perf stat -e L1-dcache-load-misses -r 10 \
      ./out/release/bench_random_mixed_hakmem 1000000 256 42
    

Expected Result: +8-12% throughput (24.9M → 27-28M ops/s) in 2-3 hours! 🚀


Tomorrow (4-6 hours):

Implement Proposal 1.1: Hot/Cold SlabMeta Split

  1. Define TinySlabMetaHot and TinySlabMetaCold structs
  2. Update SuperSlab to use separate arrays (slabs_hot[], slabs_cold[])
  3. Add accessor functions for gradual migration
  4. Migrate critical hot paths (refill, alloc, free)

Expected Result: +15-20% additional throughput (cumulative: +25-35%)


Week 1 Target:

Complete Phase 1 (Quick Wins) by end of week:

  • All 3 optimizations implemented and validated
  • L1D miss rate reduced to 1.0-1.1% (from 1.69%)
  • Throughput improved to 34-37M ops/s (from 24.9M)
  • +36-49% performance gain 🎯

Risk Mitigation

Technical Risks:

  1. Correctness (Hot/Cold Split): Medium risk

    • Mitigation: Extensive testing (AddressSanitizer, regression tests, fuzzing)
    • Gradual migration using accessor functions (not big-bang refactor)
  2. Performance Regression (Prefetch): Low risk

    • Mitigation: A/B test with HAKMEM_PREFETCH=0/1 env flag
    • Easy rollback (single commit)
  3. Complexity (TLS Merge): Medium risk

    • Mitigation: Update all access sites systematically (use grep to find all references)
    • Compile-time checks to catch missed migrations
  4. Memory Overhead (Dynamic Alloc): Low risk

    • Mitigation: Use slab allocator for TinySlabMetaHot (fixed-size, no fragmentation)

Success Criteria

Phase 1 Completion (Week 1):

  • L1D miss rate < 1.1% (from 1.69%)
  • Throughput > 34M ops/s (+36% minimum)
  • All regression tests pass
  • AddressSanitizer clean (no leaks, no buffer overflows)
  • 1-hour stress test stable (100M ops, no crashes)

Phase 2 Completion (Week 2):

  • L1D miss rate < 0.7% (from 1.69%)
  • Throughput > 42M ops/s (+69% minimum)
  • Multi-threaded workload stable (Larson 4T)

Phase 3 Completion (Week 3-4):

  • L1D miss rate < 0.5% (from 1.69%, tcache parity!)
  • Throughput > 60M ops/s (+141% minimum, 65% of System malloc)
  • Memory efficiency maintained (no significant RSS increase)

Documentation

Detailed Reports:

  1. L1D_CACHE_MISS_ANALYSIS_REPORT.md - Full technical analysis

    • Perf profiling results
    • Data structure analysis
    • Comparison with glibc tcache
    • Detailed optimization proposals (P1-P3)
  2. L1D_CACHE_MISS_HOTSPOT_DIAGRAM.md - Visual diagrams

    • Memory access pattern comparison
    • Cache line heatmaps
    • Before/after optimization flowcharts
  3. L1D_OPTIMIZATION_QUICK_START_GUIDE.md - Implementation guide

    • Step-by-step code changes
    • Build & test instructions
    • Rollback procedures
    • Troubleshooting tips

Next Steps

Immediate (Today):

  1. Review this summary with team (15 minutes)
  2. 🚀 Start Proposal 1.2 (Prefetch) implementation (2-3 hours)
  3. 📊 Baseline benchmark (save current L1D miss rate for comparison)

This Week:

  1. Complete Phase 1 Quick Wins (Prefetch + Hot/Cold Split + TLS Merge)
  2. Validate +36-49% gain with comprehensive testing
  3. Document results and plan Phase 2 rollout

Next 2-4 Weeks:

  1. Phase 2: SuperSlab optimization (+70-100% cumulative)
  2. Phase 3: TLS metadata cache (+150-200% cumulative, tcache parity!)

Conclusion

L1D cache misses are the root cause of HAKMEM's 3.8x performance gap vs System malloc. The proposed 3-phase optimization plan systematically addresses metadata access patterns to achieve:

  • Short-term (1-2 days): +36-49% gain with prefetch + hot/cold split + TLS merge
  • Medium-term (1 week): +70-100% cumulative gain with SuperSlab optimization
  • Long-term (2 weeks): +150-200% cumulative gain, achieving tcache parity (60-70M ops/s)

Recommendation: Start with Proposal 1.2 (Prefetch) TODAY to get quick wins (+8-12%) and build momentum. 🚀

Contact: See detailed guides for step-by-step implementation instructions and troubleshooting support.


Status: READY FOR IMPLEMENTATION Next Action: Begin Proposal 1.2 (Prefetch) - see L1D_OPTIMIZATION_QUICK_START_GUIDE.md