Files
hakmem/docs/archive/SUPERSLAB_REFILL_BREAKDOWN.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

13 KiB
Raw Blame History

superslab_refill Bottleneck Analysis

Function: superslab_refill() in /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_free.inc:650-888 CPU Time: 28.56% (perf report) Status: 🔴 CRITICAL BOTTLENECK


Function Complexity Analysis

Code Statistics

  • Lines of code: 238 lines (650-888)
  • Branches: ~15 major decision points
  • Loops: 4 nested loops
  • Atomic operations: ~10+ atomic loads/stores
  • Function calls: ~15 helper functions

Complexity Score: 🔥🔥🔥🔥🔥 (Extremely complex for a "refill" operation)


Path Analysis: What superslab_refill Does

Path 1: Adopt from Publish/Subscribe (Lines 686-750)

Condition: g_ss_adopt_en == 1 (auto-enabled if remote frees seen)

Steps:

  1. Check cooldown period (lines 688-694)
  2. Call ss_partial_adopt(class_idx) (line 696)
  3. Loop 1: Scan adopted SS slabs (lines 701-710)
    • Load remote counts atomically
    • Calculate best score
  4. Try to acquire best slab atomically (line 714)
  5. Drain remote freelist (line 716)
  6. Check if safe to bind (line 734)
  7. Bind TLS slab (line 736)

Atomic operations: 3-5 per slab × up to 32 slabs = 96-160 atomic ops

Cost estimate: 🔥🔥🔥🔥 HIGH (multi-threaded workloads only)


Path 2: Reuse Existing SS Freelist (Lines 753-792)

Condition: tls->ss != NULL and slab has freelist

Steps:

  1. Get slab capacity (line 756)
  2. Loop 2: Scan all slabs (lines 757-792)
    • Check if slabs[i].freelist exists (line 763)
    • Try to acquire slab atomically (line 765)
    • Drain remote freelist if needed (line 768)
    • Check safe to bind (line 783)
    • Bind TLS slab (line 785)

Worst case: Scan all 32 slabs, attempt acquire on each Atomic operations: 1-3 per slab × 32 = 32-96 atomic ops

Cost estimate: 🔥🔥🔥🔥🔥 VERY HIGH (most common path in Larson!)

Why this is THE bottleneck:

  • This loop runs on EVERY refill
  • Larson has 4 threads × frequent allocations
  • Each thread scans its own SS trying to find freelist
  • Atomic operations cause cache line ping-pong between threads

Path 3: Use Virgin Slab (Lines 794-810)

Condition: tls->ss->active_slabs < capacity

Steps:

  1. Call superslab_find_free_slab(tls->ss) (line 797)
    • Bitmap scan to find unused slab
  2. Call superslab_init_slab() (line 802)
    • Initialize metadata
    • Set up freelist/bitmap
  3. Bind TLS slab (line 805)

Cost estimate: 🔥🔥🔥 MEDIUM (bitmap scan + init)


Path 4: Registry Adoption (Lines 812-843)

Condition: !tls->ss (no SuperSlab yet)

Steps:

  1. Loop 3: Scan registry (lines 818-842)
    • Load entry atomically (line 820)
    • Check magic (line 823)
    • Check size class (line 824)
    • Loop 4: Scan slabs in SS (lines 828-840)
      • Try acquire (line 830)
      • Drain remote (line 832)
      • Check safe to bind (line 833)

Worst case: Scan 256 registry entries × 32 slabs each Atomic operations: Thousands

Cost estimate: 🔥🔥🔥🔥🔥 CATASTROPHIC (if hit)


Path 5: Must-Adopt Gate (Lines 845-849)

Condition: Before allocating new SS

Steps:

  1. Call tiny_must_adopt_gate(class_idx, tls)
    • Attempts sticky/hot/bench/mailbox/registry adoption

Cost estimate: 🔥🔥 LOW-MEDIUM (fast path optimization)


Path 6: Allocate New SuperSlab (Lines 851-887)

Condition: All other paths failed

Steps:

  1. Call superslab_allocate(class_idx) (line 852)
    • mmap() syscall to allocate 1MB SuperSlab
  2. Initialize first slab (line 876)
  3. Bind TLS slab (line 880)
  4. Update refcounts (lines 882-885)

Cost estimate: 🔥🔥🔥🔥🔥 CATASTROPHIC (syscall!)

Why this is expensive:

  • mmap() is a kernel syscall (~1000+ cycles)
  • Page fault on first access
  • TLB pressure

Bottleneck Hypothesis

Primary Suspects (in order of likelihood):

1. Path 2: Freelist Scan Loop (Lines 757-792) 🥇

Evidence:

  • Runs on EVERY refill
  • Scans up to 32 slabs linearly
  • Multiple atomic operations per slab
  • Cache line bouncing between threads

Why Larson hits this:

  • Larson does frequent alloc/free
  • Freelists exist after first warmup
  • Every refill scans the same SS repeatedly

Estimated CPU contribution: 15-20% of total CPU


2. Atomic Operations (Throughout) 🥈

Count:

  • Path 1: 96-160 atomic ops
  • Path 2: 32-96 atomic ops
  • Path 4: Thousands of atomic ops

Why expensive:

  • Each atomic op = cache coherency traffic
  • 4 threads × frequent operations = contention
  • AMD Ryzen (test system) has slower atomics than Intel

Estimated CPU contribution: 5-8% of total CPU


3. Path 6: mmap() Syscalls 🥉

Evidence:

  • OOM messages in logs suggest path 6 is hit occasionally
  • Each mmap() is ~1000 cycles minimum
  • Page faults add another ~1000 cycles

Frequency:

  • Larson runs for 2 seconds
  • 4 threads × allocation rate = high turnover
  • But: SuperSlabs are 1MB (reusable for many allocations)

Estimated CPU contribution: 2-5% of total CPU


4. Registry Scan (Path 4) ⚠️

Evidence:

  • Only runs if !tls->ss (rare after warmup)
  • But: if hit, scans 256 entries × 32 slabs = massive

Estimated CPU contribution: 0-3% of total CPU (depends on hit rate)


Optimization Opportunities

🔥 P0: Eliminate Freelist Scan Loop (Path 2)

Current:

for (int i = 0; i < tls_cap; i++) {
    if (tls->ss->slabs[i].freelist) {
        // Try to acquire, drain, bind...
    }
}

Problem:

  • O(n) scan where n = 32 slabs
  • Linear search every refill
  • Repeated checks of the same slabs

Solutions:

Option A: Freelist Bitmap (Best)

// Add to SuperSlab struct:
uint32_t freelist_bitmap;  // bit i = 1 if slabs[i].freelist != NULL

// In superslab_refill:
uint32_t fl_bits = tls->ss->freelist_bitmap;
if (fl_bits) {
    int idx = __builtin_ctz(fl_bits);  // Find first set bit (1-2 cycles!)
    // Try to acquire slab[idx]...
}

Benefits:

  • O(1) find instead of O(n) scan
  • No atomic ops unless freelist exists
  • Estimated speedup: 10-15% total CPU

Risks:

  • Need to maintain bitmap on free/alloc
  • Possible race conditions (can use atomic or accept false positives)

Option B: Last-Known-Good Index

// Add to TinyTLSSlab:
uint8_t last_freelist_idx;

// In superslab_refill:
int start = tls->last_freelist_idx;
for (int i = 0; i < tls_cap; i++) {
    int idx = (start + i) % tls_cap;  // Round-robin
    if (tls->ss->slabs[idx].freelist) {
        tls->last_freelist_idx = idx;
        // Try to acquire...
    }
}

Benefits:

  • Likely to hit on first try (temporal locality)
  • No additional atomics
  • Estimated speedup: 5-8% total CPU

Risks:

  • Still O(n) worst case
  • May not help if freelists are sparse

Option C: Intrusive Freelist of Slabs

// Add to SuperSlab:
int8_t first_freelist_slab;  // -1 = none, else index
// Add to TinySlabMeta:
int8_t next_freelist_slab;   // Intrusive linked list

// In superslab_refill:
int idx = tls->ss->first_freelist_slab;
if (idx >= 0) {
    // Try to acquire slab[idx]...
}

Benefits:

  • O(1) lookup
  • No scanning
  • Estimated speedup: 12-18% total CPU

Risks:

  • Complex to maintain
  • Intrusive list management on every free
  • Possible corruption if not careful

🔥 P1: Reduce Atomic Operations

Current hotspots:

  • slab_try_acquire() - CAS operation
  • atomic_load_explicit(&remote_heads[s], ...) - Cache coherency
  • atomic_load_explicit(&remote_counts[s], ...) - Cache coherency

Solutions:

Option A: Batch Acquire Attempts

// Instead of acquire → drain → release → retry,
// try multiple slabs and pick best BEFORE acquiring
uint32_t scores[32];
for (int i = 0; i < tls_cap; i++) {
    scores[i] = tls->ss->slabs[i].freelist ? 1 : 0;  // No atomics!
}
int best = find_max_index(scores);
// Now acquire only the best one
SlabHandle h = slab_try_acquire(tls->ss, best, self_tid);

Benefits:

  • Reduce atomic ops from 32-96 to 1-3
  • Estimated speedup: 3-5% total CPU

Option B: Relaxed Memory Ordering

// Change:
atomic_load_explicit(&remote_heads[s], memory_order_acquire)
// To:
atomic_load_explicit(&remote_heads[s], memory_order_relaxed)

Benefits:

  • Cheaper than acquire (no fence)
  • Safe if we re-check before binding

Risks:

  • Requires careful analysis of race conditions

🔥 P2: Optimize Path 6 (mmap)

Solutions:

Option A: SuperSlab Pool / Freelancer

// Pre-allocate pool of SuperSlabs
SuperSlab* g_ss_pool[128];  // Pre-mmap'd and ready
int g_ss_pool_head = 0;

// In superslab_allocate:
if (g_ss_pool_head > 0) {
    return g_ss_pool[--g_ss_pool_head];  // O(1)!
}
// Fallback to mmap if pool empty

Benefits:

  • Amortize mmap cost
  • No syscalls in hot path
  • Estimated speedup: 2-4% total CPU

Option B: Background Refill Thread

// Dedicated thread to refill SS pool
void* bg_refill_thread(void* arg) {
    while (1) {
        if (g_ss_pool_head < 64) {
            SuperSlab* ss = mmap(...);
            g_ss_pool[g_ss_pool_head++] = ss;
        }
        usleep(1000);  // Sleep 1ms
    }
}

Benefits:

  • ZERO mmap cost in allocation path
  • Estimated speedup: 2-5% total CPU

Risks:

  • Thread overhead
  • Complexity

🔥 P3: Fast Path Bypass

Idea: Avoid superslab_refill entirely for hot classes

Option A: TLS Freelist Pre-warming

// On thread init, pre-fill TLS freelists
void thread_init() {
    for (int cls = 0; cls < 4; cls++) {  // Hot classes
        sll_refill_batch_from_ss(cls, 128);  // Fill to capacity
    }
}

Benefits:

  • Reduces refill frequency
  • Estimated speedup: 5-10% total CPU (indirect)

Profiling TODO

To confirm hypotheses, instrument superslab_refill:

static SuperSlab* superslab_refill(int class_idx) {
    uint64_t t0 = rdtsc();

    uint64_t t_adopt = 0, t_freelist = 0, t_virgin = 0, t_mmap = 0;
    int path_taken = 0;

    // Path 1: Adopt
    uint64_t t1 = rdtsc();
    if (g_ss_adopt_en) {
        // ... adopt logic ...
        if (adopted) { path_taken = 1; goto done; }
    }
    t_adopt = rdtsc() - t1;

    // Path 2: Freelist scan
    t1 = rdtsc();
    if (tls->ss) {
        for (int i = 0; i < tls_cap; i++) {
            // ... scan logic ...
            if (found) { path_taken = 2; goto done; }
        }
    }
    t_freelist = rdtsc() - t1;

    // Path 3: Virgin slab
    t1 = rdtsc();
    if (tls->ss && tls->ss->active_slabs < tls_cap) {
        // ... virgin logic ...
        if (found) { path_taken = 3; goto done; }
    }
    t_virgin = rdtsc() - t1;

    // Path 6: mmap
    t1 = rdtsc();
    SuperSlab* ss = superslab_allocate(class_idx);
    t_mmap = rdtsc() - t1;
    path_taken = 6;

done:
    uint64_t total = rdtsc() - t0;
    fprintf(stderr, "[REFILL] cls=%d path=%d total=%lu adopt=%lu freelist=%lu virgin=%lu mmap=%lu\n",
            class_idx, path_taken, total, t_adopt, t_freelist, t_virgin, t_mmap);
    return ss;
}

Run:

./larson_hakmem ... 2>&1 | grep REFILL | awk '{sum[$4]+=$8} END {for(p in sum) print p, sum[p]}' | sort -k2 -rn

Expected output:

path=2 12500000000  ← Freelist scan dominates
path=6  3200000000  ← mmap is expensive but rare
path=3   500000000  ← Virgin slabs
path=1   100000000  ← Adopt (if enabled)

Sprint 1 (This Week): Quick Wins

  1. Profile superslab_refill with rdtsc instrumentation
  2. Confirm Path 2 (freelist scan) is dominant
  3. Implement Option A: Freelist Bitmap
  4. A/B test: expect +10-15% throughput

Sprint 2 (Next Week): Atomic Optimization

  1. Implement relaxed memory ordering where safe
  2. Batch acquire attempts (reduce atomics)
  3. A/B test: expect +3-5% throughput

Sprint 3 (Week 3): Path 6 Optimization

  1. Implement SuperSlab pool
  2. Optional: Background refill thread
  3. A/B test: expect +2-4% throughput

Total Expected Gain

Baseline:     4.19 M ops/s
After Sprint 1: 4.62-4.82 M ops/s (+10-15%)
After Sprint 2: 4.76-5.06 M ops/s (+14-21%)
After Sprint 3: 4.85-5.27 M ops/s (+16-26%)

Conservative estimate: +15-20% total from superslab_refill optimization alone.

Combined with other optimizations (cache tuning, etc.), targeting System malloc parity (135 M ops/s) is still distant, but Tiny can approach 60-70 M ops/s (40-50% of System).


Conclusion

superslab_refill is a 238-line monster with:

  • 15+ branches
  • 4 nested loops
  • 100+ atomic operations (worst case)
  • Syscall overhead (mmap)

The #1 sub-bottleneck is Path 2 (freelist scan):

  • O(n) scan of 32 slabs
  • Runs on EVERY refill
  • Multiple atomics per slab
  • Est. 15-20% of total CPU time

Immediate action: Implement freelist bitmap for O(1) slab discovery.

Long-term vision: Eliminate superslab_refill from hot path entirely (background refill, pre-warmed slabs).


Next: See PHASE1_EXECUTIVE_SUMMARY.md for action plan.