Files

Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 13:14:18 +09:00

13 KiB

Raw Blame History

superslab_refill Bottleneck Analysis

Function: superslab_refill() in /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888 CPU Time: 28.56% (perf report) Status: 🔴 CRITICAL BOTTLENECK

Function Complexity Analysis

Code Statistics

Lines of code: 238 lines (650-888)
Branches: ~15 major decision points
Loops: 4 nested loops
Atomic operations: ~10+ atomic loads/stores
Function calls: ~15 helper functions

Complexity Score: 🔥🔥🔥🔥🔥 (Extremely complex for a "refill" operation)

Path Analysis: What superslab_refill Does

Path 1: Adopt from Publish/Subscribe (Lines 686-750) ⭐⭐⭐⭐

Condition: g_ss_adopt_en == 1 (auto-enabled if remote frees seen)

Steps:

Check cooldown period (lines 688-694)
Call ss_partial_adopt(class_idx) (line 696)
Loop 1: Scan adopted SS slabs (lines 701-710)
- Load remote counts atomically
- Calculate best score
Try to acquire best slab atomically (line 714)
Drain remote freelist (line 716)
Check if safe to bind (line 734)
Bind TLS slab (line 736)

Atomic operations: 3-5 per slab × up to 32 slabs = 96-160 atomic ops

Cost estimate: 🔥🔥🔥🔥 HIGH (multi-threaded workloads only)

Path 2: Reuse Existing SS Freelist (Lines 753-792) ⭐⭐⭐⭐⭐

Condition: tls->ss != NULL and slab has freelist

Steps:

Get slab capacity (line 756)
Loop 2: Scan all slabs (lines 757-792)
- Check if slabs[i].freelist exists (line 763)
- Try to acquire slab atomically (line 765)
- Drain remote freelist if needed (line 768)
- Check safe to bind (line 783)
- Bind TLS slab (line 785)

Worst case: Scan all 32 slabs, attempt acquire on each Atomic operations: 1-3 per slab × 32 = 32-96 atomic ops

Cost estimate: 🔥🔥🔥🔥🔥 VERY HIGH (most common path in Larson!)

Why this is THE bottleneck:

This loop runs on EVERY refill
Larson has 4 threads × frequent allocations
Each thread scans its own SS trying to find freelist
Atomic operations cause cache line ping-pong between threads

Path 3: Use Virgin Slab (Lines 794-810) ⭐⭐⭐

Condition: tls->ss->active_slabs < capacity

Steps:

Call superslab_find_free_slab(tls->ss) (line 797)
- Bitmap scan to find unused slab
Call superslab_init_slab() (line 802)
- Initialize metadata
- Set up freelist/bitmap
Bind TLS slab (line 805)

Cost estimate: 🔥🔥🔥 MEDIUM (bitmap scan + init)

Path 4: Registry Adoption (Lines 812-843) ⭐⭐⭐⭐

Condition: !tls->ss (no SuperSlab yet)

Steps:

Loop 3: Scan registry (lines 818-842)
- Load entry atomically (line 820)
- Check magic (line 823)
- Check size class (line 824)
- Loop 4: Scan slabs in SS (lines 828-840)
  - Try acquire (line 830)
  - Drain remote (line 832)
  - Check safe to bind (line 833)

Worst case: Scan 256 registry entries × 32 slabs each Atomic operations: Thousands

Cost estimate: 🔥🔥🔥🔥🔥 CATASTROPHIC (if hit)

Path 5: Must-Adopt Gate (Lines 845-849) ⭐⭐

Condition: Before allocating new SS

Steps:

Call tiny_must_adopt_gate(class_idx, tls)
- Attempts sticky/hot/bench/mailbox/registry adoption

Cost estimate: 🔥🔥 LOW-MEDIUM (fast path optimization)

Path 6: Allocate New SuperSlab (Lines 851-887) ⭐⭐⭐⭐⭐

Condition: All other paths failed

Steps:

Call superslab_allocate(class_idx) (line 852)
- mmap() syscall to allocate 1MB SuperSlab
Initialize first slab (line 876)
Bind TLS slab (line 880)
Update refcounts (lines 882-885)

Cost estimate: 🔥🔥🔥🔥🔥 CATASTROPHIC (syscall!)

Why this is expensive:

mmap() is a kernel syscall (~1000+ cycles)
Page fault on first access
TLB pressure

Bottleneck Hypothesis

Primary Suspects (in order of likelihood):

1. Path 2: Freelist Scan Loop (Lines 757-792) 🥇

Evidence:

Runs on EVERY refill
Scans up to 32 slabs linearly
Multiple atomic operations per slab
Cache line bouncing between threads

Why Larson hits this:

Larson does frequent alloc/free
Freelists exist after first warmup
Every refill scans the same SS repeatedly

Estimated CPU contribution: 15-20% of total CPU

2. Atomic Operations (Throughout) 🥈

Count:

Path 1: 96-160 atomic ops
Path 2: 32-96 atomic ops
Path 4: Thousands of atomic ops

Why expensive:

Each atomic op = cache coherency traffic
4 threads × frequent operations = contention
AMD Ryzen (test system) has slower atomics than Intel

Estimated CPU contribution: 5-8% of total CPU

3. Path 6: mmap() Syscalls 🥉

Evidence:

OOM messages in logs suggest path 6 is hit occasionally
Each mmap() is ~1000 cycles minimum
Page faults add another ~1000 cycles

Frequency:

Larson runs for 2 seconds
4 threads × allocation rate = high turnover
But: SuperSlabs are 1MB (reusable for many allocations)

Estimated CPU contribution: 2-5% of total CPU

4. Registry Scan (Path 4) ⚠️

Evidence:

Only runs if !tls->ss (rare after warmup)
But: if hit, scans 256 entries × 32 slabs = massive

Estimated CPU contribution: 0-3% of total CPU (depends on hit rate)

Optimization Opportunities

🔥 P0: Eliminate Freelist Scan Loop (Path 2)

Current:

for (int i = 0; i < tls_cap; i++) {
    if (tls->ss->slabs[i].freelist) {
        // Try to acquire, drain, bind...
    }
}

Problem:

O(n) scan where n = 32 slabs
Linear search every refill
Repeated checks of the same slabs

Solutions:

Option A: Freelist Bitmap (Best) ⭐⭐⭐⭐⭐

// Add to SuperSlab struct:
uint32_t freelist_bitmap;  // bit i = 1 if slabs[i].freelist != NULL

// In superslab_refill:
uint32_t fl_bits = tls->ss->freelist_bitmap;
if (fl_bits) {
    int idx = __builtin_ctz(fl_bits);  // Find first set bit (1-2 cycles!)
    // Try to acquire slab[idx]...
}

Benefits:

O(1) find instead of O(n) scan
No atomic ops unless freelist exists
Estimated speedup: 10-15% total CPU

Risks:

Need to maintain bitmap on free/alloc
Possible race conditions (can use atomic or accept false positives)

Option B: Last-Known-Good Index ⭐⭐⭐

// Add to TinyTLSSlab:
uint8_t last_freelist_idx;

// In superslab_refill:
int start = tls->last_freelist_idx;
for (int i = 0; i < tls_cap; i++) {
    int idx = (start + i) % tls_cap;  // Round-robin
    if (tls->ss->slabs[idx].freelist) {
        tls->last_freelist_idx = idx;
        // Try to acquire...
    }
}

Benefits:

Likely to hit on first try (temporal locality)
No additional atomics
Estimated speedup: 5-8% total CPU

Risks:

Still O(n) worst case
May not help if freelists are sparse

Option C: Intrusive Freelist of Slabs ⭐⭐⭐⭐

// Add to SuperSlab:
int8_t first_freelist_slab;  // -1 = none, else index
// Add to TinySlabMeta:
int8_t next_freelist_slab;   // Intrusive linked list

// In superslab_refill:
int idx = tls->ss->first_freelist_slab;
if (idx >= 0) {
    // Try to acquire slab[idx]...
}

Benefits:

O(1) lookup
No scanning
Estimated speedup: 12-18% total CPU

Risks:

Complex to maintain
Intrusive list management on every free
Possible corruption if not careful

🔥 P1: Reduce Atomic Operations

Current hotspots:

slab_try_acquire() - CAS operation
atomic_load_explicit(&remote_heads[s], ...) - Cache coherency
atomic_load_explicit(&remote_counts[s], ...) - Cache coherency

Solutions:

Option A: Batch Acquire Attempts ⭐⭐⭐

// Instead of acquire → drain → release → retry,
// try multiple slabs and pick best BEFORE acquiring
uint32_t scores[32];
for (int i = 0; i < tls_cap; i++) {
    scores[i] = tls->ss->slabs[i].freelist ? 1 : 0;  // No atomics!
}
int best = find_max_index(scores);
// Now acquire only the best one
SlabHandle h = slab_try_acquire(tls->ss, best, self_tid);

Benefits:

Reduce atomic ops from 32-96 to 1-3
Estimated speedup: 3-5% total CPU

Option B: Relaxed Memory Ordering ⭐⭐

// Change:
atomic_load_explicit(&remote_heads[s], memory_order_acquire)
// To:
atomic_load_explicit(&remote_heads[s], memory_order_relaxed)

Benefits:

Cheaper than acquire (no fence)
Safe if we re-check before binding

Risks:

Requires careful analysis of race conditions

🔥 P2: Optimize Path 6 (mmap)

Solutions:

Option A: SuperSlab Pool / Freelancer ⭐⭐⭐⭐

// Pre-allocate pool of SuperSlabs
SuperSlab* g_ss_pool[128];  // Pre-mmap'd and ready
int g_ss_pool_head = 0;

// In superslab_allocate:
if (g_ss_pool_head > 0) {
    return g_ss_pool[--g_ss_pool_head];  // O(1)!
}
// Fallback to mmap if pool empty

Benefits:

Amortize mmap cost
No syscalls in hot path
Estimated speedup: 2-4% total CPU

Option B: Background Refill Thread ⭐⭐⭐⭐⭐

// Dedicated thread to refill SS pool
void* bg_refill_thread(void* arg) {
    while (1) {
        if (g_ss_pool_head < 64) {
            SuperSlab* ss = mmap(...);
            g_ss_pool[g_ss_pool_head++] = ss;
        }
        usleep(1000);  // Sleep 1ms
    }
}

Benefits:

ZERO mmap cost in allocation path
Estimated speedup: 2-5% total CPU

Risks:

Thread overhead
Complexity

🔥 P3: Fast Path Bypass

Idea: Avoid superslab_refill entirely for hot classes

Option A: TLS Freelist Pre-warming ⭐⭐⭐⭐

// On thread init, pre-fill TLS freelists
void thread_init() {
    for (int cls = 0; cls < 4; cls++) {  // Hot classes
        sll_refill_batch_from_ss(cls, 128);  // Fill to capacity
    }
}

Benefits:

Reduces refill frequency
Estimated speedup: 5-10% total CPU (indirect)

Profiling TODO

To confirm hypotheses, instrument superslab_refill:

static SuperSlab* superslab_refill(int class_idx) {
    uint64_t t0 = rdtsc();

    uint64_t t_adopt = 0, t_freelist = 0, t_virgin = 0, t_mmap = 0;
    int path_taken = 0;

    // Path 1: Adopt
    uint64_t t1 = rdtsc();
    if (g_ss_adopt_en) {
        // ... adopt logic ...
        if (adopted) { path_taken = 1; goto done; }
    }
    t_adopt = rdtsc() - t1;

    // Path 2: Freelist scan
    t1 = rdtsc();
    if (tls->ss) {
        for (int i = 0; i < tls_cap; i++) {
            // ... scan logic ...
            if (found) { path_taken = 2; goto done; }
        }
    }
    t_freelist = rdtsc() - t1;

    // Path 3: Virgin slab
    t1 = rdtsc();
    if (tls->ss && tls->ss->active_slabs < tls_cap) {
        // ... virgin logic ...
        if (found) { path_taken = 3; goto done; }
    }
    t_virgin = rdtsc() - t1;

    // Path 6: mmap
    t1 = rdtsc();
    SuperSlab* ss = superslab_allocate(class_idx);
    t_mmap = rdtsc() - t1;
    path_taken = 6;

done:
    uint64_t total = rdtsc() - t0;
    fprintf(stderr, "[REFILL] cls=%d path=%d total=%lu adopt=%lu freelist=%lu virgin=%lu mmap=%lu\n",
            class_idx, path_taken, total, t_adopt, t_freelist, t_virgin, t_mmap);
    return ss;
}

Run:

./larson_hakmem ... 2>&1 | grep REFILL | awk '{sum[$4]+=$8} END {for(p in sum) print p, sum[p]}' | sort -k2 -rn

Expected output:

path=2 12500000000  ← Freelist scan dominates
path=6  3200000000  ← mmap is expensive but rare
path=3   500000000  ← Virgin slabs
path=1   100000000  ← Adopt (if enabled)

Recommended Implementation Order

Sprint 1 (This Week): Quick Wins

✅ Profile superslab_refill with rdtsc instrumentation
✅ Confirm Path 2 (freelist scan) is dominant
✅ Implement Option A: Freelist Bitmap
✅ A/B test: expect +10-15% throughput

Sprint 2 (Next Week): Atomic Optimization

✅ Implement relaxed memory ordering where safe
✅ Batch acquire attempts (reduce atomics)
✅ A/B test: expect +3-5% throughput

Sprint 3 (Week 3): Path 6 Optimization

✅ Implement SuperSlab pool
✅ Optional: Background refill thread
✅ A/B test: expect +2-4% throughput

Total Expected Gain

Baseline:     4.19 M ops/s
After Sprint 1: 4.62-4.82 M ops/s (+10-15%)
After Sprint 2: 4.76-5.06 M ops/s (+14-21%)
After Sprint 3: 4.85-5.27 M ops/s (+16-26%)

Conservative estimate: +15-20% total from superslab_refill optimization alone.

Combined with other optimizations (cache tuning, etc.), targeting System malloc parity (135 M ops/s) is still distant, but Tiny can approach 60-70 M ops/s (40-50% of System).

Conclusion

superslab_refill is a 238-line monster with:

15+ branches
4 nested loops
100+ atomic operations (worst case)
Syscall overhead (mmap)

The #1 sub-bottleneck is Path 2 (freelist scan):

O(n) scan of 32 slabs
Runs on EVERY refill
Multiple atomics per slab
Est. 15-20% of total CPU time

Immediate action: Implement freelist bitmap for O(1) slab discovery.

Long-term vision: Eliminate superslab_refill from hot path entirely (background refill, pre-warmed slabs).

Next: See PHASE1_EXECUTIVE_SUMMARY.md for action plan.

13 KiB Raw Blame History Unescape Escape

superslab_refill Bottleneck Analysis

Function Complexity Analysis

Code Statistics

Path Analysis: What superslab_refill Does

Path 1: Adopt from Publish/Subscribe (Lines 686-750) ⭐⭐⭐⭐

Path 2: Reuse Existing SS Freelist (Lines 753-792) ⭐⭐⭐⭐⭐

Path 3: Use Virgin Slab (Lines 794-810) ⭐⭐⭐

Path 4: Registry Adoption (Lines 812-843) ⭐⭐⭐⭐

Path 5: Must-Adopt Gate (Lines 845-849) ⭐⭐

Path 6: Allocate New SuperSlab (Lines 851-887) ⭐⭐⭐⭐⭐

Bottleneck Hypothesis

Primary Suspects (in order of likelihood):

1. Path 2: Freelist Scan Loop (Lines 757-792) 🥇

2. Atomic Operations (Throughout) 🥈

3. Path 6: mmap() Syscalls 🥉

4. Registry Scan (Path 4) ⚠️

Optimization Opportunities

🔥 P0: Eliminate Freelist Scan Loop (Path 2)

Option A: Freelist Bitmap (Best) ⭐⭐⭐⭐⭐

Option B: Last-Known-Good Index ⭐⭐⭐

Option C: Intrusive Freelist of Slabs ⭐⭐⭐⭐

🔥 P1: Reduce Atomic Operations

Option A: Batch Acquire Attempts ⭐⭐⭐

Option B: Relaxed Memory Ordering ⭐⭐

🔥 P2: Optimize Path 6 (mmap)

Option A: SuperSlab Pool / Freelancer ⭐⭐⭐⭐

Option B: Background Refill Thread ⭐⭐⭐⭐⭐

🔥 P3: Fast Path Bypass

Option A: TLS Freelist Pre-warming ⭐⭐⭐⭐

Profiling TODO

Recommended Implementation Order

Sprint 1 (This Week): Quick Wins

Sprint 2 (Next Week): Atomic Optimization

Sprint 3 (Week 3): Path 6 Optimization

Total Expected Gain

Conclusion

13 KiB

Raw Blame History