Files
hakmem/docs/analysis/INVESTIGATION_RESULTS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

8.8 KiB
Raw Blame History

Phase 1 Quick Wins Investigation - Final Results

Investigation Date: 2025-11-05 Investigator: Claude (Sonnet 4.5) Mission: Determine why REFILL_COUNT optimization failed


Investigation Summary

Question Asked

Why did increasing REFILL_COUNT from 32 to 128 fail to deliver the expected +31% performance improvement?

Answer Found

The optimization targeted the wrong bottleneck.

  • Real bottleneck: superslab_refill() function (28.56% CPU)
  • Assumed bottleneck: Refill frequency (actually minimal impact)
  • Side effect: Cache pollution from larger batches (-36% performance)

Key Findings

1. Performance Results

REFILL_COUNT Throughput Change L1d Miss Rate
32 (baseline) 4.19 M ops/s 0% 12.88%
64 2.69-3.89 M ops/s -7% to -36% 14.12% (+10%)
128 2.68-4.19 M ops/s -36% to 0% 16.08% (+25%)

Conclusion: REFILL_COUNT increases are HARMFUL, not helpful.


2. Bottleneck Identification 🎯

Perf profiling revealed:

CPU Time Breakdown:
  28.56% - superslab_refill()        ← THE PROBLEM
   3.10% - [kernel overhead]
   2.96% - [kernel overhead]
   ...    - (remaining distributed)

superslab_refill is 9x more expensive than any other user function.


3. Root Cause Analysis 🔍

Why REFILL_COUNT=128 Failed:

Factor 1: superslab_refill is inherently expensive

  • 238 lines of code
  • 15+ branches
  • 4 nested loops
  • 100+ atomic operations (worst case)
  • O(n) freelist scan (n=32 slabs) on every call
  • Cost: 28.56% of total CPU time

Factor 2: Cache pollution from large batches

  • REFILL=32: 12.88% L1d miss rate
  • REFILL=128: 16.08% L1d miss rate (+25% worse!)
  • Cause: 128 blocks × 128 bytes = 16KB doesn't fit in L1 (32KB total)

Factor 3: Refill frequency already low

  • Larson benchmark has FIFO pattern
  • High TLS freelist hit rate
  • Refills are rare, not frequent
  • Reducing frequency has minimal impact

Factor 4: More instructions, same cycles

  • REFILL=32: 39.6B instructions
  • REFILL=128: 61.1B instructions (+54% more work!)
  • IPC improves (1.93 → 2.86) but throughput drops
  • Paradox: better superscalar execution, but more total work

4. memset Analysis 📊

Searched for memset calls:

$ grep -rn "memset" core/*.inc
core/hakmem_tiny_init.inc:514:  memset(g_slab_registry, 0, ...)
core/hakmem_tiny_intel.inc:842: memset((void*)g_obs_ready, 0, ...)

Findings:

  • Only 2 memset calls, both in cold paths (init code)
  • NO memset in allocation hot path
  • Previous perf reports showing memset were from different builds

Conclusion: memset removal would have ZERO impact on performance.


5. Larson Benchmark Characteristics 🧪

Pattern:

  • 2 seconds runtime
  • 4 threads
  • 1024 chunks per thread (stable working set)
  • Sizes: 8-128B (Tiny classes 0-4)
  • FIFO replacement (allocate new, free oldest)

Implications:

  • After warmup, freelists are well-populated
  • High hit rate on TLS freelist
  • Refills are infrequent
  • This pattern may NOT represent real-world workloads

Detailed Bottleneck: superslab_refill()

Function Location

/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888

Complexity Metrics

  • Lines: 238
  • Branches: 15+
  • Loops: 4 nested
  • Atomic ops: 32-160 per call
  • Function calls: 15+

Execution Paths

Path 1: Adopt from Publish/Subscribe (Lines 686-750)

  • Scan up to 32 slabs
  • Multiple atomic loads per slab
  • Cost: 🔥🔥🔥🔥 HIGH

Path 2: Reuse Existing Freelist (Lines 753-792) ← PRIMARY BOTTLENECK

  • O(n) linear scan of all slabs (n=32)
  • Runs on EVERY refill
  • Multiple atomic ops per slab
  • Cost: 🔥🔥🔥🔥🔥 VERY HIGH
  • Estimated: 15-20% of total CPU

Path 3: Use Virgin Slab (Lines 794-810)

  • Bitmap scan to find free slab
  • Initialize metadata
  • Cost: 🔥🔥🔥 MEDIUM

Path 4: Registry Adoption (Lines 812-843)

  • Scan 256 registry entries × 32 slabs
  • Thousands of atomic ops (worst case)
  • Cost: 🔥🔥🔥🔥🔥 CATASTROPHIC (if hit)

Path 6: Allocate New SuperSlab (Lines 851-887)

  • mmap() syscall (~1000+ cycles)
  • Page fault on first access
  • Cost: 🔥🔥🔥🔥🔥 CATASTROPHIC

Optimization Recommendations

🥇 P0: Freelist Bitmap (Immediate - This Week)

Problem: O(n) linear scan of 32 slabs on every refill

Solution:

// Add to SuperSlab struct:
uint32_t freelist_bitmap;  // bit i = 1 if slabs[i].freelist != NULL

// In superslab_refill:
uint32_t fl_bits = tls->ss->freelist_bitmap;
if (fl_bits) {
    int idx = __builtin_ctz(fl_bits);  // O(1)! Find first set bit
    // Try to acquire slab[idx]...
}

Expected gain: +10-15% throughput (4.19 → 4.62-4.82 M ops/s)


🥈 P1: Reduce Atomic Operations (Next Week)

Problem: 32-96 atomic ops per refill

Solutions:

  1. Batch acquire attempts (reduce from 32 to 1-3 atomics)
  2. Relaxed memory ordering where safe
  3. Cache scores before atomic acquire

Expected gain: +3-5% throughput


🥉 P2: SuperSlab Pool (Week 3)

Problem: mmap() syscall in hot path

Solution:

SuperSlab* g_ss_pool[128];  // Pre-allocated pool
// Allocate from pool O(1), refill pool in background

Expected gain: +2-4% throughput


🏆 Long-term: Background Refill Thread

Vision: Eliminate superslab_refill from allocation path entirely

Approach:

  • Dedicated thread keeps freelists pre-filled
  • Allocation never waits for mmap or scanning
  • Zero syscalls in hot path

Expected gain: +20-30% throughput (but high complexity)


Total Expected Improvements

Conservative Estimates

Phase Optimization Gain Cumulative Throughput
Baseline - 0% 4.19 M ops/s
Sprint 1 Freelist bitmap +10-15% 4.62-4.82 M ops/s
Sprint 2 Reduce atomics +3-5% 4.76-5.06 M ops/s
Sprint 3 SS pool +2-4% 4.85-5.27 M ops/s
Total +16-26% ~5.0 M ops/s

Reality Check

Current state:

  • HAKMEM Tiny: 4.19 M ops/s
  • System malloc: 135.94 M ops/s
  • Gap: 32x slower

After optimizations:

  • HAKMEM Tiny: ~5.0 M ops/s (+19%)
  • Gap: 27x slower (still far behind)

Conclusion: These optimizations help, but fundamental redesign needed to approach System malloc performance (see Phase 6 goals).


Lessons Learned

1. Always Profile First 📊

  • Task Teacher's intuition was wrong
  • Perf revealed the real bottleneck
  • Rule: No optimization without perf data

2. Cache Effects Matter 🧊

  • Larger batches can HURT performance
  • L1 cache is precious (32KB)
  • Working set + batch must fit

3. Benchmarks Can Mislead 🎭

  • Larson has special properties (FIFO, stable)
  • Real workloads may differ
  • Rule: Test with diverse benchmarks

4. Complexity is the Enemy 🐉

  • superslab_refill is 238 lines, 15 branches
  • Compare to System tcache: 3-4 instructions
  • Rule: Simpler is faster

Next Steps

Immediate Actions (Today)

  1. Document findings (DONE - this report)
  2. DO NOT increase REFILL_COUNT beyond 32
  3. Focus on superslab_refill optimization

This Week

  1. Implement freelist bitmap (P0)
  2. Profile superslab_refill with rdtsc instrumentation
  3. A/B test freelist bitmap vs baseline
  4. Document results

Next 2 Weeks

  1. Reduce atomic operations (P1)
  2. Implement SuperSlab pool (P2)
  3. Test with diverse benchmarks (not just Larson)

Long-term (Phase 6)

  1. Study System tcache implementation
  2. Design ultra-simple fast path (3-4 instructions)
  3. Background refill thread
  4. Eliminate superslab_refill from hot path

Files Created

  1. PHASE1_REFILL_INVESTIGATION.md - Full detailed analysis
  2. PHASE1_EXECUTIVE_SUMMARY.md - Quick reference summary
  3. SUPERSLAB_REFILL_BREAKDOWN.md - Deep dive into superslab_refill
  4. INVESTIGATION_RESULTS.md - This file (final summary)

Conclusion

Why Phase 1 Failed:

Optimized the wrong thing (refill frequency instead of refill cost) Assumed without measuring (refill is cheap, happens often) Ignored cache effects (larger batches pollute L1) Trusted one benchmark (Larson is not representative)

What We Learned:

superslab_refill is THE bottleneck (28.56% CPU) Path 2 freelist scan is the sub-bottleneck (O(n) scan) memset is NOT in hot path (wasted optimization target) Data beats intuition (perf reveals truth)

What We'll Do:

🎯 Focus on superslab_refill (10-15% gain available) 🎯 Implement freelist bitmap (O(n) → O(1)) 🎯 Profile before optimizing (always measure first)

End of Investigation


For detailed analysis, see:

  • PHASE1_REFILL_INVESTIGATION.md (comprehensive report)
  • SUPERSLAB_REFILL_BREAKDOWN.md (code-level analysis)
  • PHASE1_EXECUTIVE_SUMMARY.md (quick reference)