Files
hakmem/INVESTIGATION_RESULTS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

8.8 KiB
Raw Blame History

Phase 1 Quick Wins Investigation - Final Results

Investigation Date: 2025-11-05 Investigator: Claude (Sonnet 4.5) Mission: Determine why REFILL_COUNT optimization failed


Investigation Summary

Question Asked

Why did increasing REFILL_COUNT from 32 to 128 fail to deliver the expected +31% performance improvement?

Answer Found

The optimization targeted the wrong bottleneck.

  • Real bottleneck: superslab_refill() function (28.56% CPU)
  • Assumed bottleneck: Refill frequency (actually minimal impact)
  • Side effect: Cache pollution from larger batches (-36% performance)

Key Findings

1. Performance Results

REFILL_COUNT Throughput Change L1d Miss Rate
32 (baseline) 4.19 M ops/s 0% 12.88%
64 2.69-3.89 M ops/s -7% to -36% 14.12% (+10%)
128 2.68-4.19 M ops/s -36% to 0% 16.08% (+25%)

Conclusion: REFILL_COUNT increases are HARMFUL, not helpful.


2. Bottleneck Identification 🎯

Perf profiling revealed:

CPU Time Breakdown:
  28.56% - superslab_refill()        ← THE PROBLEM
   3.10% - [kernel overhead]
   2.96% - [kernel overhead]
   ...    - (remaining distributed)

superslab_refill is 9x more expensive than any other user function.


3. Root Cause Analysis 🔍

Why REFILL_COUNT=128 Failed:

Factor 1: superslab_refill is inherently expensive

  • 238 lines of code
  • 15+ branches
  • 4 nested loops
  • 100+ atomic operations (worst case)
  • O(n) freelist scan (n=32 slabs) on every call
  • Cost: 28.56% of total CPU time

Factor 2: Cache pollution from large batches

  • REFILL=32: 12.88% L1d miss rate
  • REFILL=128: 16.08% L1d miss rate (+25% worse!)
  • Cause: 128 blocks × 128 bytes = 16KB doesn't fit in L1 (32KB total)

Factor 3: Refill frequency already low

  • Larson benchmark has FIFO pattern
  • High TLS freelist hit rate
  • Refills are rare, not frequent
  • Reducing frequency has minimal impact

Factor 4: More instructions, same cycles

  • REFILL=32: 39.6B instructions
  • REFILL=128: 61.1B instructions (+54% more work!)
  • IPC improves (1.93 → 2.86) but throughput drops
  • Paradox: better superscalar execution, but more total work

4. memset Analysis 📊

Searched for memset calls:

$ grep -rn "memset" core/*.inc
core/hakmem_tiny_init.inc:514:  memset(g_slab_registry, 0, ...)
core/hakmem_tiny_intel.inc:842: memset((void*)g_obs_ready, 0, ...)

Findings:

  • Only 2 memset calls, both in cold paths (init code)
  • NO memset in allocation hot path
  • Previous perf reports showing memset were from different builds

Conclusion: memset removal would have ZERO impact on performance.


5. Larson Benchmark Characteristics 🧪

Pattern:

  • 2 seconds runtime
  • 4 threads
  • 1024 chunks per thread (stable working set)
  • Sizes: 8-128B (Tiny classes 0-4)
  • FIFO replacement (allocate new, free oldest)

Implications:

  • After warmup, freelists are well-populated
  • High hit rate on TLS freelist
  • Refills are infrequent
  • This pattern may NOT represent real-world workloads

Detailed Bottleneck: superslab_refill()

Function Location

/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888

Complexity Metrics

  • Lines: 238
  • Branches: 15+
  • Loops: 4 nested
  • Atomic ops: 32-160 per call
  • Function calls: 15+

Execution Paths

Path 1: Adopt from Publish/Subscribe (Lines 686-750)

  • Scan up to 32 slabs
  • Multiple atomic loads per slab
  • Cost: 🔥🔥🔥🔥 HIGH

Path 2: Reuse Existing Freelist (Lines 753-792) ← PRIMARY BOTTLENECK

  • O(n) linear scan of all slabs (n=32)
  • Runs on EVERY refill
  • Multiple atomic ops per slab
  • Cost: 🔥🔥🔥🔥🔥 VERY HIGH
  • Estimated: 15-20% of total CPU

Path 3: Use Virgin Slab (Lines 794-810)

  • Bitmap scan to find free slab
  • Initialize metadata
  • Cost: 🔥🔥🔥 MEDIUM

Path 4: Registry Adoption (Lines 812-843)

  • Scan 256 registry entries × 32 slabs
  • Thousands of atomic ops (worst case)
  • Cost: 🔥🔥🔥🔥🔥 CATASTROPHIC (if hit)

Path 6: Allocate New SuperSlab (Lines 851-887)

  • mmap() syscall (~1000+ cycles)
  • Page fault on first access
  • Cost: 🔥🔥🔥🔥🔥 CATASTROPHIC

Optimization Recommendations

🥇 P0: Freelist Bitmap (Immediate - This Week)

Problem: O(n) linear scan of 32 slabs on every refill

Solution:

// Add to SuperSlab struct:
uint32_t freelist_bitmap;  // bit i = 1 if slabs[i].freelist != NULL

// In superslab_refill:
uint32_t fl_bits = tls->ss->freelist_bitmap;
if (fl_bits) {
    int idx = __builtin_ctz(fl_bits);  // O(1)! Find first set bit
    // Try to acquire slab[idx]...
}

Expected gain: +10-15% throughput (4.19 → 4.62-4.82 M ops/s)


🥈 P1: Reduce Atomic Operations (Next Week)

Problem: 32-96 atomic ops per refill

Solutions:

  1. Batch acquire attempts (reduce from 32 to 1-3 atomics)
  2. Relaxed memory ordering where safe
  3. Cache scores before atomic acquire

Expected gain: +3-5% throughput


🥉 P2: SuperSlab Pool (Week 3)

Problem: mmap() syscall in hot path

Solution:

SuperSlab* g_ss_pool[128];  // Pre-allocated pool
// Allocate from pool O(1), refill pool in background

Expected gain: +2-4% throughput


🏆 Long-term: Background Refill Thread

Vision: Eliminate superslab_refill from allocation path entirely

Approach:

  • Dedicated thread keeps freelists pre-filled
  • Allocation never waits for mmap or scanning
  • Zero syscalls in hot path

Expected gain: +20-30% throughput (but high complexity)


Total Expected Improvements

Conservative Estimates

Phase Optimization Gain Cumulative Throughput
Baseline - 0% 4.19 M ops/s
Sprint 1 Freelist bitmap +10-15% 4.62-4.82 M ops/s
Sprint 2 Reduce atomics +3-5% 4.76-5.06 M ops/s
Sprint 3 SS pool +2-4% 4.85-5.27 M ops/s
Total +16-26% ~5.0 M ops/s

Reality Check

Current state:

  • HAKMEM Tiny: 4.19 M ops/s
  • System malloc: 135.94 M ops/s
  • Gap: 32x slower

After optimizations:

  • HAKMEM Tiny: ~5.0 M ops/s (+19%)
  • Gap: 27x slower (still far behind)

Conclusion: These optimizations help, but fundamental redesign needed to approach System malloc performance (see Phase 6 goals).


Lessons Learned

1. Always Profile First 📊

  • Task Teacher's intuition was wrong
  • Perf revealed the real bottleneck
  • Rule: No optimization without perf data

2. Cache Effects Matter 🧊

  • Larger batches can HURT performance
  • L1 cache is precious (32KB)
  • Working set + batch must fit

3. Benchmarks Can Mislead 🎭

  • Larson has special properties (FIFO, stable)
  • Real workloads may differ
  • Rule: Test with diverse benchmarks

4. Complexity is the Enemy 🐉

  • superslab_refill is 238 lines, 15 branches
  • Compare to System tcache: 3-4 instructions
  • Rule: Simpler is faster

Next Steps

Immediate Actions (Today)

  1. Document findings (DONE - this report)
  2. DO NOT increase REFILL_COUNT beyond 32
  3. Focus on superslab_refill optimization

This Week

  1. Implement freelist bitmap (P0)
  2. Profile superslab_refill with rdtsc instrumentation
  3. A/B test freelist bitmap vs baseline
  4. Document results

Next 2 Weeks

  1. Reduce atomic operations (P1)
  2. Implement SuperSlab pool (P2)
  3. Test with diverse benchmarks (not just Larson)

Long-term (Phase 6)

  1. Study System tcache implementation
  2. Design ultra-simple fast path (3-4 instructions)
  3. Background refill thread
  4. Eliminate superslab_refill from hot path

Files Created

  1. PHASE1_REFILL_INVESTIGATION.md - Full detailed analysis
  2. PHASE1_EXECUTIVE_SUMMARY.md - Quick reference summary
  3. SUPERSLAB_REFILL_BREAKDOWN.md - Deep dive into superslab_refill
  4. INVESTIGATION_RESULTS.md - This file (final summary)

Conclusion

Why Phase 1 Failed:

Optimized the wrong thing (refill frequency instead of refill cost) Assumed without measuring (refill is cheap, happens often) Ignored cache effects (larger batches pollute L1) Trusted one benchmark (Larson is not representative)

What We Learned:

superslab_refill is THE bottleneck (28.56% CPU) Path 2 freelist scan is the sub-bottleneck (O(n) scan) memset is NOT in hot path (wasted optimization target) Data beats intuition (perf reveals truth)

What We'll Do:

🎯 Focus on superslab_refill (10-15% gain available) 🎯 Implement freelist bitmap (O(n) → O(1)) 🎯 Profile before optimizing (always measure first)

End of Investigation


For detailed analysis, see:

  • PHASE1_REFILL_INVESTIGATION.md (comprehensive report)
  • SUPERSLAB_REFILL_BREAKDOWN.md (code-level analysis)
  • PHASE1_EXECUTIVE_SUMMARY.md (quick reference)