Files
hakmem/docs/analysis/BOTTLENECK_ANALYSIS_TASK.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

45 KiB
Raw Blame History

Bottleneck Analysis Report: hakmem Tiny Pool Allocator

Date: 2025-10-26 Target: hakmem bitmap-based allocator Baseline: mimalloc (industry standard) Analyzed by: Deep code analysis + performance modeling


Executive Summary

Top 3 Bottlenecks with Estimated Impact

  1. TLS Magazine Hierarchy Overhead (HIGH: ~3-5 ns per allocation)

    • 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap
    • Each tier adds cache miss risk and branching overhead
    • Expected speedup: 30-40% if collapsed to 2-tier
  2. Two-Tier Bitmap Traversal (HIGH: ~4-6 ns on bitmap path)

    • Summary bitmap scan + main bitmap scan + hint_word update
    • Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates)
    • Expected speedup: 20-30% if bypassed more often via better caching
  3. Registry Lookup on Free Path (MEDIUM: ~2-4 ns per free)

    • Hash computation + linear probe + validation on every cross-slab free
    • Could be eliminated with mimalloc-style pointer arithmetic
    • Expected speedup: 15-25% on free-heavy workloads

Performance Gap Analysis

Random Free Pattern (Bitmap's best case):

  • hakmem: 68 M ops/sec (14.7 ns/op)
  • mimalloc: 176 M ops/sec (5.7 ns/op)
  • Gap: 2.6× slower (9 ns difference)

Sequential LIFO Pattern (Free-list's best case):

  • hakmem: 102 M ops/sec (9.8 ns/op)
  • mimalloc: 942 M ops/sec (1.1 ns/op)
  • Gap: 9.2× slower (8.7 ns difference)

Key Insight: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture.

Expected Total Speedup

  • Conservative: 2.0-2.5× (close the 2.6× gap partially)
  • Optimistic: 3.0-4.0× (with aggressive optimizations)
  • Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO)

Critical Path Analysis

Allocation Fast Path Walkthrough

Let me trace the exact execution path for hak_tiny_alloc(16) with step-by-step cycle estimates:

// hakmem_tiny.c:557 - Entry point
void* hak_tiny_alloc(size_t size) {
    // Line 558: Initialization check
    if (!g_tiny_initialized) hak_tiny_init();        // BRANCH: ~1 cycle (predicted taken once)

    // Line 561-562: Wrapper context check
    extern int hak_in_wrapper(void);
    if (!g_wrap_tiny_enabled && hak_in_wrapper())    // BRANCH: ~1 cycle
        return NULL;

    // Line 565: Size to class conversion
    int class_idx = hak_tiny_size_to_class(size);    // INLINE: ~2 cycles (branch chain)
    if (class_idx < 0) return NULL;                  // BRANCH: ~1 cycle

    // Line 569-576: SuperSlab path (disabled by default)
    if (g_use_superslab) { /* ... */ }               // BRANCH: ~1 cycle (not taken)

    // Line 650-651: TLS Magazine initialization check
    tiny_mag_init_if_needed(class_idx);              // INLINE: ~3 cycles (conditional init)
    TinyTLSMag* mag = &g_tls_mags[class_idx];       // TLS ACCESS: ~2 cycles

    // Line 666-670: TLS Magazine fast path (BEST CASE)
    if (mag->top > 0) {                              // LOAD + BRANCH: ~2 cycles
        void* p = mag->items[--mag->top].ptr;        // LOAD + DEC + STORE: ~3 cycles
        stats_record_alloc(class_idx);               // INLINE: ~1 cycle (TLS increment)
        return p;                                     // RETURN: ~1 cycle
    }
    // TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz)

    // Line 673-674: TLS Active Slab lookup (MEDIUM PATH)
    TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles
    if (!(tls && tls->free_count > 0))               // LOAD + BRANCH: ~3 cycles
        tls = g_tls_active_slab_b[class_idx];        // TLS ACCESS: ~2 cycles (if taken)

    if (tls && tls->free_count > 0) {                // BRANCH: ~1 cycle
        // Line 677-679: Remote drain check
        if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) {
            tiny_remote_drain_owner(tls);            // RARE: ~50-200 cycles (if taken)
        }

        // Line 682-688: Mini-magazine fast path
        if (!mini_mag_is_empty(&tls->mini_mag)) {    // LOAD + BRANCH: ~2 cycles
            void* p = mini_mag_pop(&tls->mini_mag);  // INLINE: ~4 cycles (LIFO pop)
            if (p) {
                stats_record_alloc(class_idx);       // INLINE: ~1 cycle
                return p;                             // RETURN: ~1 cycle
            }
        }
        // MINI-MAG PATH: ~30 cycles (~10 ns)

        // Line 691-700: Batch refill from bitmap
        if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) {
            int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
            // REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized
            if (refilled > 0) {
                void* p = mini_mag_pop(&tls->mini_mag);
                if (p) {
                    stats_record_alloc(class_idx);
                    return p;
                }
            }
        }
        // REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc

        // Line 703-713: Bitmap scan fallback
        if (tls->free_count > 0) {
            int block_idx = hak_tiny_find_free_block(tls);  // BITMAP SCAN: ~15-20 cycles
            if (block_idx >= 0) {
                hak_tiny_set_used(tls, block_idx);          // BITMAP UPDATE: ~10 cycles
                tls->free_count--;                          // STORE: ~1 cycle
                void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles
                stats_record_alloc(class_idx);              // INLINE: ~1 cycle
                return p;                                    // RETURN: ~1 cycle
            }
        }
        // BITMAP PATH: ~50 cycles (~17 ns)
    }

    // Line 717-718: Lock and refill from global pool (SLOW PATH)
    pthread_mutex_lock(lock);                        // LOCK: ~30-100 cycles (contended)
    // ... slow path: 200-1000 cycles (rare) ...
}

Cycle Count Summary

Path Cycles Latency (ns) Frequency Notes
TLS Magazine Hit ~18 ~6 ns 60-80% Best case (cache hit)
Mini-Mag Hit ~30 ~10 ns 10-20% Good case (slab-local)
Batch Refill ~50 ~17 ns 5-10% Amortized 3 ns/item
Bitmap Scan ~50 ~17 ns 5-10% Worst case before lock
Global Lock Path ~300 ~100 ns <5% Very rare (refill)

Weighted Average: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = ~11 ns/op (theoretical) Measured Actual: 9.8-14.7 ns/op (matches model!)

Comparison with mimalloc's Approach

mimalloc achieves 1.1 ns/op on LIFO pattern by:

  1. No TLS Magazine Layer: Direct access to thread-local page free-list
  2. Intrusive Free-List: 1 load + 1 store (2 cycles) vs our 18 cycles
  3. 2MB Alignment: O(1) pointer→slab via bit-masking (no registry lookup)
  4. No Bitmap: Free-list only (trades random-access resistance for speed)

hakmem's Architecture:

Allocation Request
  ↓
TLS Magazine (2048 items)          ← 1st tier: ~6 ns (cache hit)
  ↓ (miss)
TLS Active Slab (2 per class)      ← 2nd tier: lookup cost
  ↓
Mini-Magazine (16-32 items)        ← 3rd tier: ~10 ns (LIFO pop)
  ↓ (miss)
Batch Refill (16 items)            ← 4th tier: ~3 ns amortized
  ↓ (miss)
Bitmap Scan (two-tier)             ← 5th tier: ~17 ns (expensive)
  ↓ (miss)
Global Lock + Slab Allocation      ← 6th tier: ~100+ ns (rare)

mimalloc's Architecture:

Allocation Request
  ↓
Thread-Local Page Free-List        ← 1st tier: ~1 ns (1 load + 1 store)
  ↓ (miss)
Thread-Local Page Queue            ← 2nd tier: ~5 ns (page switch)
  ↓ (miss)
Global Segment Allocation          ← 3rd tier: ~50 ns (rare)

Key Difference: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead.


Bottleneck #1: TLS Magazine Hierarchy Overhead

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.c
  • Lines: 650-714 (allocation fast path)
  • Impact: HIGH (affects 100% of allocations)

Code Analysis

// Line 650-651: 1st tier - TLS Magazine
tiny_mag_init_if_needed(class_idx);              // ~3 cycles (conditional check)
TinyTLSMag* mag = &g_tls_mags[class_idx];       // ~2 cycles (TLS base + offset)

// Line 666-670: TLS Magazine lookup
if (mag->top > 0) {                              // ~2 cycles (load + branch)
    void* p = mag->items[--mag->top].ptr;        // ~3 cycles (array access + decrement)
    stats_record_alloc(class_idx);               // ~1 cycle (TLS increment)
    return p;                                     // ~1 cycle
}
// TOTAL: ~12 cycles for cache hit (BEST CASE)

// Line 673-674: 2nd tier - TLS Active Slab lookup
TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access)
if (!(tls && tls->free_count > 0))               // ~3 cycles (2 loads + branch)
    tls = g_tls_active_slab_b[class_idx];        // ~2 cycles (if miss)

// Line 682-688: 3rd tier - Mini-Magazine
if (!mini_mag_is_empty(&tls->mini_mag)) {        // ~2 cycles (load slab->mini_mag.count)
    void* p = mini_mag_pop(&tls->mini_mag);      // ~4 cycles (LIFO pop: 2 loads + 1 store)
    if (p) { stats_record_alloc(class_idx); return p; }
}
// TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE)

Why It's Slow

  1. Multiple TLS Accesses: Each tier requires TLS base lookup + offset calculation

    • g_tls_mags[class_idx] → TLS read #1
    • g_tls_active_slab_a[class_idx] → TLS read #2
    • g_tls_active_slab_b[class_idx] → TLS read #3 (conditional)
    • Cost: 2-3 cycles each × 3 = 6-9 cycles overhead
  2. Cache Line Fragmentation: TLS variables are separate arrays

    • g_tls_mags[8] = 16 KB (2048 items × 8 classes × 8 bytes)
    • g_tls_active_slab_a[8] = 64 bytes
    • g_tls_active_slab_b[8] = 64 bytes
    • Cost: Likely span multiple cache lines → potential cache misses
  3. Branch Misprediction: Multi-tier fallback creates branch chain

    • Magazine empty? → Check active slab A
    • Slab A empty? → Check active slab B
    • Mini-mag empty? → Refill from bitmap
    • Cost: Each mispredicted branch = 10-20 cycles penalty
  4. Redundant Metadata: Magazine items store {void* ptr} separately from slab pointers

    • Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class)
    • Slab pointers: 8 bytes × 2 per class (16 bytes)
    • Cost: Memory overhead reduces cache efficiency

Optimization: Unified TLS Cache Structure

Before (current):

// Separate TLS arrays (fragmented in memory)
static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP];
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];

After (proposed):

// Unified per-class TLS structure (cache-line aligned)
typedef struct __attribute__((aligned(64))) {
    // Hot fields (first 64 bytes for L1 cache line)
    void* mag_items[32];        // Reduced from 2048 to 32 (still effective)
    uint16_t mag_top;           // Current magazine count
    uint16_t mag_cap;           // Magazine capacity
    uint32_t _pad0;

    // Warm fields (second cache line)
    TinySlab* active_slab;      // Primary active slab (no A/B split)
    PageMiniMag* mini_mag;      // Direct pointer to slab's mini-mag
    uint64_t last_refill_tsc;   // For adaptive refill timing

    // Cold fields (third cache line)
    uint64_t stats_alloc_batch; // Batched statistics
    uint64_t stats_free_batch;
} __attribute__((packed)) TinyTLSCache;

static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];

Benefits:

  1. Single TLS access: g_tls_cache[class_idx] (not 3 separate lookups)
  2. Cache-line aligned: All hot fields in first 64 bytes
  3. Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class
  4. Direct mini-mag pointer: No slab→mini_mag indirection

Expected Speedup: 30-40% (reduce fast path from ~12 cycles to ~7 cycles)

Risk: MEDIUM

  • Requires refactoring TLS access patterns throughout codebase
  • Magazine size reduction may increase refill frequency (trade-off)
  • Need careful testing to ensure no regression on multi-threaded workloads

Bottleneck #2: Two-Tier Bitmap Traversal

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.h
  • Lines: 235-269 (hak_tiny_find_free_block)
  • Impact: HIGH (affects 5-15% of allocations, but expensive when hit)

Code Analysis

// Line 235-269: Two-tier bitmap scan
static inline int hak_tiny_find_free_block(TinySlab* slab) {
    const int bw = g_tiny_bitmap_words[slab->class_idx];  // Bitmap words
    const int sw = slab->summary_words;                    // Summary words
    if (bw <= 0 || sw <= 0) return -1;

    int start_word = slab->hint_word % bw;                 // Hint optimization
    int start_sw = start_word / 64;                        // Summary word index
    int start_sb = start_word % 64;                        // Summary bit offset

    // Line 244-267: Summary bitmap scan (outer loop)
    for (int k = 0; k < sw; k++) {                         // ~sw iterations (1-128)
        int idx = start_sw + k;
        if (idx >= sw) idx -= sw;                          // Wrap-around
        uint64_t bits = slab->summary[idx];                // LOAD: ~2 cycles

        // Mask optimization (skip processed bits)
        if (k == 0) {
            bits &= (~0ULL) << start_sb;                   // BITWISE: ~1 cycle
        }
        if (idx == sw - 1 && (bw % 64) != 0) {
            uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL);
            bits &= mask;                                  // BITWISE: ~1 cycle
        }
        if (bits == 0) continue;                           // BRANCH: ~1 cycle (often taken)

        int woff = __builtin_ctzll(bits);                  // CTZ #1: ~3 cycles
        int word_idx = idx * 64 + woff;                    // COMPUTE: ~2 cycles
        if (word_idx >= bw) continue;                      // BRANCH: ~1 cycle

        // Line 261-266: Main bitmap scan (inner)
        uint64_t used = slab->bitmap[word_idx];            // LOAD: ~2 cycles (cache miss risk)
        uint64_t free_bits = ~used;                        // BITWISE: ~1 cycle
        if (free_bits == 0) continue;                      // BRANCH: ~1 cycle (rare)

        int bit_idx = __builtin_ctzll(free_bits);          // CTZ #2: ~3 cycles
        slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles
        return word_idx * 64 + bit_idx;                    // RETURN: ~1 cycle
    }
    return -1;
}
// TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access)
// WORST CASE: 50-100 cycles (many summary words scanned, cache misses)

Why It's Slow

  1. Two-Level Indirection: Summary → Bitmap → Block

    • Summary scan: Find word with free bits (~5-10 cycles)
    • Main bitmap scan: Find bit within word (~5 cycles)
    • Cost: 2× CTZ operations, 2× memory loads
  2. Cache Miss Risk: Bitmap can be up to 1 KB (128 words × 8 bytes)

    • Class 0 (8B): 128 words = 1024 bytes
    • Class 1 (16B): 64 words = 512 bytes
    • Class 2 (32B): 32 words = 256 bytes
    • Cost: Bitmap may not fit in L1 cache (32 KB) → L2 access (~10-20 cycles)
  3. Hint Word State: Requires update on every allocation

    • Read hint_word (~1 cycle)
    • Compute new hint (~2 cycles)
    • Write hint_word (~1 cycle)
    • Cost: 4 cycles per allocation (not amortized)
  4. Branch-Heavy Loop: Multiple branches per iteration

    • if (bits == 0) continue; (often taken when bitmap is sparse)
    • if (word_idx >= bw) continue; (rare safety check)
    • if (free_bits == 0) continue; (rare but costly)
    • Cost: Branch misprediction = 10-20 cycles each

Optimization #1: Increase Mini-Magazine Capacity

Rationale: Avoid bitmap scan by keeping more items in mini-magazine

Current:

// Line 344: Mini-magazine capacity
uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16;

Proposed:

// Increase capacity to reduce bitmap scan frequency
uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32;

Benefits:

  • Fewer bitmap scans (amortized over 64 items instead of 32)
  • Better temporal locality (more items cached)

Costs:

  • +256 bytes memory per slab (64 × 8 bytes pointers)
  • Slightly higher refill cost (64 items vs 32)

Expected Speedup: 10-15% (reduce bitmap scan frequency by 50%)

Risk: LOW (simple parameter change, no logic changes)

Optimization #2: Cache-Aware Bitmap Layout

Rationale: Ensure bitmap fits in L1 cache for hot classes

Current:

// Separate bitmap allocation (may be cache-cold)
slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t));

Proposed:

// Embed small bitmaps directly in slab structure
typedef struct TinySlab {
    // ... existing fields ...

    // Embedded bitmap for small classes (≤256 bytes)
    union {
        uint64_t* bitmap_ptr;     // Large classes: heap-allocated
        uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes)
    };
    uint8_t bitmap_embedded;      // Flag: 1=embedded, 0=heap
} TinySlab;

Benefits:

  • Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded)
  • Single cache line access for bitmap + slab metadata
  • No heap allocation for small classes

Expected Speedup: 5-10% (reduce cache misses on bitmap access)

Risk: MEDIUM (requires refactoring bitmap access logic)

Optimization #3: Lazy Summary Bitmap Update

Rationale: Summary bitmap update is expensive on free path

Current:

// Line 199-213: Summary update on every set_used/set_free
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
    // ... bitmap update ...

    // Update summary (EXPENSIVE)
    int sum_word = word_idx / 64;
    int sum_bit  = word_idx % 64;
    uint64_t has_free = ~v;
    if (has_free != 0) {
        slab->summary[sum_word] |= (1ULL << sum_bit);   // WRITE
    } else {
        slab->summary[sum_word] &= ~(1ULL << sum_bit);  // WRITE
    }
}

Proposed:

// Lazy summary update (rebuild only when scanning)
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
    // ... bitmap update ...
    // NO SUMMARY UPDATE (deferred)
}

static inline int hak_tiny_find_free_block(TinySlab* slab) {
    // Rebuild summary if stale (rare)
    if (slab->summary_stale) {
        rebuild_summary_bitmap(slab);  // O(N) but rare
        slab->summary_stale = 0;
    }
    // ... existing scan logic ...
}

Benefits:

  • Eliminate summary update on 95% of operations (free path)
  • Summary rebuild cost amortized over many allocations

Expected Speedup: 15-20% on free-heavy workloads

Risk: MEDIUM (requires careful stale bit management)


Bottleneck #3: Registry Lookup on Free Path

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.c
  • Lines: 1102-1118 (hak_tiny_free)
  • Impact: MEDIUM (affects cross-slab frees, ~30-50% of frees)

Code Analysis

// Line 1102-1118: Free path with registry lookup
void hak_tiny_free(void* ptr) {
    if (!ptr || !g_tiny_initialized) return;

    // Line 1106-1111: SuperSlab fast path (disabled by default)
    SuperSlab* ss = ptr_to_superslab(ptr);           // BITWISE: ~2 cycles
    if (ss && ss->magic == SUPERSLAB_MAGIC) {        // LOAD + BRANCH: ~3 cycles
        hak_tiny_free_superslab(ptr, ss);            // FAST PATH: ~5 ns
        return;
    }

    // Line 1114: Registry lookup (EXPENSIVE)
    TinySlab* slab = hak_tiny_owner_slab(ptr);       // LOOKUP: ~10-30 cycles
    if (!slab) return;

    hak_tiny_free_with_slab(ptr, slab);              // FREE: ~50-200 cycles
}

// hakmem_tiny.c:395-440 - Registry lookup implementation
TinySlab* hak_tiny_owner_slab(void* ptr) {
    if (!ptr || !g_tiny_initialized) return NULL;

    if (g_use_registry) {
        // O(1) hash table lookup
        uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles
        TinySlab* slab = registry_lookup(slab_base);   // FUNCTION CALL: ~20-50 cycles
        if (!slab) return NULL;

        // Validation (bounds check)
        uintptr_t start = (uintptr_t)slab->base;
        uintptr_t end = start + TINY_SLAB_SIZE;
        if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) {
            return NULL;  // False positive
        }
        return slab;
    } else {
        // O(N) linear search (fallback)
        for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
            pthread_mutex_lock(lock);                  // LOCK: ~30-100 cycles
            // Search free slabs
            for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) {
                // ... bounds check ...
            }
            pthread_mutex_unlock(lock);
        }
        return NULL;
    }
}

// Line 268-288: Registry lookup (hash table linear probe)
static TinySlab* registry_lookup(uintptr_t slab_base) {
    int hash = registry_hash(slab_base);               // HASH: ~5 cycles

    for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes
        int idx = (hash + i) & SLAB_REGISTRY_MASK;     // BITWISE: ~2 cycles
        SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles

        if (entry->slab_base == slab_base) {           // LOAD + BRANCH: ~3 cycles
            TinySlab* owner = entry->owner;            // LOAD: ~2 cycles
            return owner;
        }

        if (entry->slab_base == 0) {                   // LOAD + BRANCH: ~2 cycles
            return NULL;  // Empty slot
        }
    }
    return NULL;
}
// TYPICAL COST: 20-30 cycles (1-2 probes, cache hit)
// WORST CASE: 50-100 cycles (8 probes, cache miss on registry array)

Why It's Slow

  1. Hash Computation: Complex mix function

    static inline int registry_hash(uintptr_t slab_base) {
        return (slab_base >> 16) & SLAB_REGISTRY_MASK;  // Simple, but...
    }
    
    • Shift + mask = 2 cycles (acceptable)
    • BUT: Linear probing on collision adds 10-30 cycles
  2. Linear Probing: Up to 8 probes on collision

    • Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case)
    • Registry size: 1024 entries (8 KB array)
    • Cost: May span multiple cache lines → cache miss (10-20 cycles penalty)
  3. Validation Overhead: Bounds check after lookup

    • Load slab->base (2 cycles)
    • Compute end address (1 cycle)
    • Compare twice (2 cycles)
    • Cost: 5 cycles per free (not amortized)
  4. Global Shared State: Registry is shared across all threads

    • No cache-line alignment (false sharing risk)
    • Lock-free reads → ABA problem potential
    • Cost: Atomic load penalties (~5-10 cycles vs normal load)

Optimization #1: Enable SuperSlab by Default

Rationale: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style)

Current:

// Line 81: SuperSlab disabled by default
static int g_use_superslab = 0;  // Runtime toggle

Proposed:

// Enable SuperSlab by default
static int g_use_superslab = 1;  // Always on

Benefits:

  • Eliminate registry lookup entirely: ptr & ~0x1FFFFF (1 AND operation)
  • SuperSlab free path: ~5 ns (vs ~10-30 ns registry path)
  • Better cache locality (2MB aligned pages)

Costs:

  • 2MB address space per SuperSlab (not physical memory due to lazy allocation)
  • Slightly higher memory overhead (metadata at SuperSlab level)

Expected Speedup: 20-30% on free-heavy workloads

Risk: LOW (SuperSlab already implemented and tested in Phase 6.23)

Optimization #2: Cache Last Freed Slab

Rationale: Temporal locality - next free likely from same slab

Proposed:

// Per-thread cache of last freed slab
static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL};

void hak_tiny_free(void* ptr) {
    if (!ptr) return;

    // Try cached slab first (likely hit)
    int class_idx = guess_class_from_size(ptr);  // Heuristic
    TinySlab* slab = t_last_freed_slab[class_idx];

    // Validate pointer is in this slab
    if (slab && ptr_in_slab_range(ptr, slab)) {
        hak_tiny_free_with_slab(ptr, slab);      // FAST PATH: ~5 ns
        return;
    }

    // Fallback to registry lookup (rare)
    slab = hak_tiny_owner_slab(ptr);
    if (slab) {
        t_last_freed_slab[slab->class_idx] = slab;  // Update cache
        hak_tiny_free_with_slab(ptr, slab);
    }
}

Benefits:

  • 80-90% cache hit rate (temporal locality)
  • Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles)

Expected Speedup: 15-20% on free-heavy workloads

Risk: MEDIUM (requires heuristic for class_idx guessing, may mispredict)


Bottleneck #4: Statistics Collection Overhead

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny_stats.h
  • Lines: 59-73 (stats_record_alloc, stats_record_free)
  • Impact: LOW (already optimized to TLS batching, but still ~0.5 ns per op)

Code Analysis

// Line 59-62: Allocation statistics (inline)
static inline void stats_record_alloc(int class_idx) __attribute__((always_inline));
static inline void stats_record_alloc(int class_idx) {
    t_alloc_batch[class_idx]++;  // TLS INCREMENT: ~0.5-1 cycle
}

// Line 70-73: Free statistics (inline)
static inline void stats_record_free(int class_idx) __attribute__((always_inline));
static inline void stats_record_free(int class_idx) {
    t_free_batch[class_idx]++;   // TLS INCREMENT: ~0.5-1 cycle
}

Why It's (Slightly) Slow

  1. TLS Access Overhead: Even TLS has cost

    • TLS base register: %fs on x86-64 (implicit)
    • Offset calculation: [%fs + class_idx*4]
    • Cost: ~0.5 cycles (not zero!)
  2. Cache Line Pollution: TLS counters compete for L1 cache

    • t_alloc_batch[8] = 32 bytes
    • t_free_batch[8] = 32 bytes
    • Cost: 64 bytes of L1 cache (1 cache line)
  3. Compiler Optimization Barriers: always_inline prevents optimization

    • Forces inline (good)
    • But prevents compiler from hoisting out of loops (bad)
    • Cost: Increment inside hot loop vs once outside

Optimization: Compile-Time Statistics Toggle

Rationale: Production builds don't need exact counts

Proposed:

#ifdef HAKMEM_ENABLE_STATS
    #define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++
    #define STATS_RECORD_FREE(cls)  t_free_batch[cls]++
#else
    #define STATS_RECORD_ALLOC(cls) ((void)0)
    #define STATS_RECORD_FREE(cls)  ((void)0)
#endif

Benefits:

  • Zero overhead when stats disabled
  • Compiler can optimize away dead code

Expected Speedup: 3-5% (small but measurable)

Risk: VERY LOW (compile-time flag, no runtime impact)


Bottleneck #5: Magazine Spill/Refill Lock Contention

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.c
  • Lines: 880-939 (magazine spill under class lock)
  • Impact: MEDIUM (affects 5-10% of frees when magazine is full)

Code Analysis

// Line 880-939: Magazine spill (class lock held)
if (mag->top < cap) {
    // Fast path: push to magazine (no lock)
    mag->items[mag->top].ptr = ptr;
    mag->top++;
    return;
}

// Spill half under class lock
pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
pthread_mutex_lock(lock);                        // LOCK: ~30-100 cycles (contended)

int spill = cap / 2;  // Spill 1024 items (for 2048 cap)

for (int i = 0; i < spill && mag->top > 0; i++) {
    TinyMagItem it = mag->items[--mag->top];
    TinySlab* owner = hak_tiny_owner_slab(it.ptr);   // LOOKUP: ~20-30 cycles × 1024
    if (!owner) continue;

    // Phase 4.1: Try mini-magazine push (avoid bitmap)
    if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) {
        mini_mag_push(&owner->mini_mag, it.ptr);     // FAST: ~4 cycles
        continue;
    }

    // Slow path: bitmap update
    size_t bs = g_tiny_class_sizes[owner->class_idx];
    int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;  // DIV: ~10 cycles
    if (hak_tiny_is_used(owner, idx)) {
        hak_tiny_set_free(owner, idx);               // BITMAP: ~10 cycles
        owner->free_count++;
        // ... list management ...
    }
}

pthread_mutex_unlock(lock);
// TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item)
// Amortized: 50-100 ns per free (when spill happens every ~1000 frees)

Why It's Slow

  1. Lock Hold Time: Lock held for entire spill (1024 items)

    • Blocks other threads from accessing class lock
    • Spill takes ~50-100 µs → other threads stalled
    • Cost: Contention penalty on multi-threaded workloads
  2. Registry Lookup in Loop: 1024 lookups under lock

    • hak_tiny_owner_slab(it.ptr) called 1024 times
    • Each lookup: 20-30 cycles
    • Cost: 20,000-30,000 cycles just for lookups
  3. Division in Hot Loop: Block index calculation uses division

    • int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;
    • Division is ~10 cycles on modern CPUs (not fully pipelined)
    • Cost: 10,000 cycles for 1024 divisions
  4. Large Spill Batch: 1024 items is too large

    • Amortizes lock cost well (good)
    • But increases lock hold time (bad)
    • Trade-off not optimized

Optimization #1: Reduce Spill Batch Size

Rationale: Smaller batches = shorter lock hold time = less contention

Current:

int spill = cap / 2;  // 1024 items for 2048 cap

Proposed:

int spill = 128;  // Fixed batch size (not cap-dependent)

Benefits:

  • Shorter lock hold time: ~6-12 µs (vs 50-100 µs)
  • Better multi-thread responsiveness

Costs:

  • More frequent spills (8× more frequent)
  • Slightly higher total lock overhead

Expected Speedup: 10-15% on multi-threaded workloads

Risk: LOW (simple parameter change)

Optimization #2: Lock-Free Spill Stack

Rationale: Avoid lock entirely for spill path

Proposed:

// Per-class global spill stack (lock-free MPSC)
static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES];

void magazine_spill_lockfree(int class_idx, void* ptr) {
    // Push to lock-free stack
    uintptr_t old_head;
    do {
        old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire);
        *((uintptr_t*)ptr) = old_head;  // Intrusive next-pointer
    } while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr,
                                           memory_order_release, memory_order_relaxed));
}

// Background thread drains spill stack periodically
void background_drain_spill_stack(void) {
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel);
        if (!head) continue;

        pthread_mutex_lock(&g_tiny_class_locks[i].m);
        // ... drain to bitmap ...
        pthread_mutex_unlock(&g_tiny_class_locks[i].m);
    }
}

Benefits:

  • Zero lock contention on spill path
  • Fast atomic CAS (~5-10 cycles)

Costs:

  • Requires background thread or periodic drain
  • Slightly more complex memory management

Expected Speedup: 20-30% on multi-threaded workloads

Risk: HIGH (requires careful design of background drain mechanism)


Bottleneck #6: Branch Misprediction in Size Class Lookup

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.h
  • Lines: 159-182 (hak_tiny_size_to_class)
  • Impact: LOW (only 1-2 ns per allocation, but called on every allocation)

Code Analysis

// Line 159-182: Size to class lookup (branch chain)
static inline int hak_tiny_size_to_class(size_t size) {
    if (size == 0 || size > TINY_MAX_SIZE) return -1;  // BRANCH: ~1 cycle

    // Branch chain (8 branches for 8 classes)
    if (size <= 8) return 0;      // BRANCH: ~1 cycle
    if (size <= 16) return 1;     // BRANCH: ~1 cycle
    if (size <= 32) return 2;     // BRANCH: ~1 cycle
    if (size <= 64) return 3;     // BRANCH: ~1 cycle
    if (size <= 128) return 4;    // BRANCH: ~1 cycle
    if (size <= 256) return 5;    // BRANCH: ~1 cycle
    if (size <= 512) return 6;    // BRANCH: ~1 cycle
    return 7;  // size <= 1024
}
// TYPICAL COST: 3-5 cycles (3-4 branches taken)
// WORST CASE: 8 cycles (all branches checked)

Why It's (Slightly) Slow

  1. Unpredictable Size Distribution: Branch predictor can't learn pattern

    • Real-world allocation sizes are quasi-random
    • Size 16 most common (33%), but others vary
    • Cost: ~20-30% branch misprediction rate (~10 cycles penalty)
  2. Sequential Dependency: Each branch depends on previous

    • CPU can't parallelize branch evaluation
    • Must evaluate branches in order
    • Cost: No instruction-level parallelism (ILP)

Optimization: Branchless Lookup Table

Rationale: Use CLZ (count leading zeros) for O(1) class lookup

Proposed:

// Lookup table for size → class (branchless)
static const uint8_t g_size_to_class_table[128] = {
    // size 0-7: class -1 (invalid)
    -1, -1, -1, -1, -1, -1, -1, -1,
    // size 8: class 0
    0,
    // size 9-16: class 1
    1, 1, 1, 1, 1, 1, 1, 1,
    // size 17-32: class 2
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    // size 33-64: class 3
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    // size 65-128: class 4
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
};

static inline int hak_tiny_size_to_class(size_t size) {
    if (size == 0 || size > TINY_MAX_SIZE) return -1;

    // Fast path: direct table lookup for small sizes
    if (size <= 128) {
        return g_size_to_class_table[size];  // LOAD: ~2 cycles (L1 cache)
    }

    // Slow path: CLZ-based for larger sizes
    // size 129-256 → class 5
    // size 257-512 → class 6
    // size 513-1024 → class 7
    int clz = __builtin_clzll(size - 1);     // CLZ: ~3 cycles
    return 12 - clz;  // Magic constant for power-of-2 classes
}
// TYPICAL COST: 2-3 cycles (table lookup, no branches)

Benefits:

  • Branchless for common sizes (8-128B covers 80%+ of allocations)
  • Table fits in L1 cache (128 bytes = 2 cache lines)
  • Predictable performance (no branch misprediction)

Expected Speedup: 2-3% (reduce 5 cycles to 2-3 cycles)

Risk: VERY LOW (table is static, no runtime overhead)


Bottleneck #7: Remote Free Drain Overhead

Location

  • File: /home/tomoaki/git/hakmem/hakmem_tiny.c
  • Lines: 146-184 (tiny_remote_drain_locked)
  • Impact: LOW (only affects cross-thread frees, ~10-20% of workloads)

Code Analysis

// Line 146-184: Remote free drain (under class lock)
static void tiny_remote_drain_locked(TinySlab* slab) {
    uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles
    unsigned drained = 0;

    while (head) {                                   // LOOP: variable iterations
        void* p = (void*)head;
        head = *((uintptr_t*)p);                     // LOAD NEXT: ~2 cycles

        // Calculate block index
        size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles
        uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle
        int block_idx = offset / block_size;         // DIVIDE: ~10 cycles

        // Skip if already free (idempotent)
        if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles

        hak_tiny_set_free(slab, block_idx);          // BITMAP UPDATE: ~10 cycles

        int was_full = (slab->free_count == 0);      // LOAD: ~1 cycle
        slab->free_count++;                          // INCREMENT: ~1 cycle

        if (was_full) {
            move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare)
        }

        if (slab->free_count == slab->total_count) {
            // ... slab release logic ... (rare)
            release_slab(slab);                      // EXPENSIVE: ~1000 cycles (very rare)
            break;
        }

        g_tiny_pool.free_count[slab->class_idx]++;   // GLOBAL INCREMENT: ~1 cycle
        drained++;
    }

    if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles
}
// TYPICAL COST: 50-100 cycles per drained block (moderate)
// WORST CASE: 1000+ cycles (slab release)

Why It's Slow

  1. Division in Loop: Block index calculation uses division

    • int block_idx = offset / block_size;
    • Division is ~10 cycles (even on modern CPUs)
    • Cost: 10 cycles × N remote frees
  2. Atomic Operations: 2 atomic ops per drain (exchange + fetch_sub)

    • atomic_exchange at start (~10 cycles)
    • atomic_fetch_sub at end (~10 cycles)
    • Cost: 20 cycles overhead (not per-block, but still expensive)
  3. Bitmap Update: Same as allocation path

    • hak_tiny_set_free updates both bitmap and summary
    • Cost: 10 cycles per block

Optimization: Multiplication-Based Division

Rationale: Replace division with multiplication by reciprocal

Current:

int block_idx = offset / block_size;  // DIVIDE: ~10 cycles

Proposed:

// Pre-computed reciprocals (magic constants)
static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = {
    // Computed as: (1ULL << 48) / block_size
    // Allows: block_idx = (offset * reciprocal) >> 48
    0x200000000000ULL / 8,     // Class 0: 8B
    0x100000000000ULL / 16,    // Class 1: 16B
    0x80000000000ULL / 32,     // Class 2: 32B
    0x40000000000ULL / 64,     // Class 3: 64B
    0x20000000000ULL / 128,    // Class 4: 128B
    0x10000000000ULL / 256,    // Class 5: 256B
    0x8000000000ULL / 512,     // Class 6: 512B
    0x4000000000ULL / 1024,    // Class 7: 1024B
};

// Fast division using multiplication
int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles

Benefits:

  • Reduce 10 cycles to 3 cycles per division
  • Saves 7 cycles per remote free

Expected Speedup: 5-10% on cross-thread workloads

Risk: VERY LOW (well-known compiler optimization, manually applied)


Profiling Plan

perf Commands to Run

# 1. CPU cycle breakdown (identify hotspots)
perf record -e cycles:u -g ./bench_comprehensive
perf report --stdio --no-children | head -100 > perf_cycles.txt

# 2. Cache miss analysis (L1d, L1i, LLC)
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\
L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \
./bench_comprehensive

# 3. Branch misprediction rate
perf stat -e cycles,instructions,branches,branch-misses \
./bench_comprehensive

# 4. TLB miss analysis (address translation overhead)
perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_comprehensive

# 5. Function-level profiling (annotated source)
perf record -e cycles:u --call-graph dwarf ./bench_comprehensive
perf report --stdio --sort symbol --percent-limit 1

# 6. Memory bandwidth utilization
perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\
mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \
./bench_comprehensive

# 7. Allocation-specific hotspots (focus on hak_tiny_alloc)
perf record -e cycles:u -g --call-graph dwarf -- \
./bench_comprehensive 2>&1 | grep "hak_tiny"

Expected Hotspots to Validate

Based on code analysis, we expect to see:

  1. hak_tiny_find_free_block (15-25% of cycles)

    • Two-tier bitmap scan
    • CTZ operations
    • Cache misses on large bitmaps
  2. hak_tiny_set_used / hak_tiny_set_free (10-15% of cycles)

    • Bitmap updates
    • Summary bitmap updates
    • Write-heavy (cache line bouncing)
  3. hak_tiny_owner_slab (10-20% of cycles on free path)

    • Registry lookup
    • Hash computation
    • Linear probing
  4. tiny_mag_init_if_needed (5-10% of cycles)

    • TLS access
    • Conditional initialization
  5. stats_record_alloc / stats_record_free (3-5% of cycles)

    • TLS counter increments
    • Cache line pollution

Validation Criteria

Cache Miss Rates:

  • L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor)
  • LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor)

Branch Misprediction:

  • Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor)
  • Expected: 3-4% (due to unpredictable size classes)

IPC (Instructions Per Cycle):

  • IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor)
  • Expected: 1.5-1.8 (memory-bound, not compute-bound)

Function Time Distribution:

  • hak_tiny_alloc: 40-60% (hot path)
  • hak_tiny_free: 20-30% (warm path)
  • hak_tiny_find_free_block: 10-20% (expensive when hit)
  • Other: < 10%

Optimization Roadmap

Quick Wins (< 1 hour, Low Risk)

  1. Enable SuperSlab by Default (Bottleneck #3)

    • Change: g_use_superslab = 1;
    • Impact: 20-30% speedup on free path
    • Risk: VERY LOW (already implemented)
    • Effort: 5 minutes
  2. Disable Statistics in Production (Bottleneck #4)

    • Change: Add #ifndef HAKMEM_ENABLE_STATS guards
    • Impact: 3-5% speedup
    • Risk: VERY LOW (compile-time flag)
    • Effort: 15 minutes
  3. Increase Mini-Magazine Capacity (Bottleneck #2)

    • Change: mag_capacity = 64 (was 32)
    • Impact: 10-15% speedup (reduce bitmap scans)
    • Risk: LOW (slight memory increase)
    • Effort: 5 minutes
  4. Branchless Size Class Lookup (Bottleneck #6)

    • Change: Use lookup table for common sizes
    • Impact: 2-3% speedup
    • Risk: VERY LOW (static table)
    • Effort: 30 minutes

Total Expected Speedup: 35-53% (conservative: 1.4-1.5×)

Medium Effort (1-4 hours, Medium Risk)

  1. Unified TLS Cache Structure (Bottleneck #1)

    • Change: Merge TLS arrays into single cache-aligned struct
    • Impact: 30-40% speedup on fast path
    • Risk: MEDIUM (requires refactoring)
    • Effort: 3-4 hours
  2. Reduce Magazine Spill Batch (Bottleneck #5)

    • Change: spill = 128 (was 1024)
    • Impact: 10-15% speedup on multi-threaded
    • Risk: LOW (parameter tuning)
    • Effort: 30 minutes
  3. Cache-Aware Bitmap Layout (Bottleneck #2)

    • Change: Embed small bitmaps in slab structure
    • Impact: 5-10% speedup
    • Risk: MEDIUM (requires struct changes)
    • Effort: 2-3 hours
  4. Multiplication-Based Division (Bottleneck #7)

    • Change: Replace division with mul+shift
    • Impact: 5-10% speedup on remote frees
    • Risk: VERY LOW (well-known optimization)
    • Effort: 1 hour

Total Expected Speedup: 50-85% (conservative: 1.5-1.8×)

Major Refactors (> 4 hours, High Risk)

  1. Lock-Free Spill Stack (Bottleneck #5)

    • Change: Use atomic MPSC queue for magazine spill
    • Impact: 20-30% speedup on multi-threaded
    • Risk: HIGH (complex concurrency)
    • Effort: 8-12 hours
  2. Lazy Summary Bitmap Update (Bottleneck #2)

    • Change: Rebuild summary only when scanning
    • Impact: 15-20% speedup on free-heavy workloads
    • Risk: MEDIUM (requires careful staleness tracking)
    • Effort: 4-6 hours
  3. Collapse TLS Magazine Tiers (Bottleneck #1)

    • Change: Merge magazine + mini-mag into single LIFO
    • Impact: 40-50% speedup (eliminate tier overhead)
    • Risk: HIGH (major architectural change)
    • Effort: 12-16 hours
  4. Full mimalloc-Style Rewrite (All Bottlenecks)

    • Change: Replace bitmap with intrusive free-list
    • Impact: 5-9× speedup (match mimalloc)
    • Risk: VERY HIGH (complete redesign)
    • Effort: 40+ hours

Total Expected Speedup: 75-150% (optimistic: 1.8-2.5×)


Risk Assessment Summary

Low Risk Optimizations (Safe to implement immediately)

  • SuperSlab enable
  • Statistics compile-time toggle
  • Mini-mag capacity increase
  • Branchless size lookup
  • Multiplication division
  • Magazine spill batch reduction

Expected: 1.4-1.6× speedup, 2-3 hours effort

Medium Risk Optimizations (Test thoroughly)

  • Unified TLS cache structure
  • Cache-aware bitmap layout
  • Lazy summary update

Expected: 1.6-2.0× speedup, 6-10 hours effort

High Risk Optimizations (Prototype first)

  • Lock-free spill stack
  • Magazine tier collapse
  • Full mimalloc rewrite

Expected: 2.0-9.0× speedup, 20-60 hours effort


Estimated Speedup Summary

Conservative Target (Low + Medium optimizations)

  • Random pattern: 68 M ops/sec → 140 M ops/sec (2.0× speedup)
  • LIFO pattern: 102 M ops/sec → 200 M ops/sec (2.0× speedup)
  • Gap to mimalloc: 2.6×1.3× (close 50% of gap)

Optimistic Target (All optimizations)

  • Random pattern: 68 M ops/sec → 170 M ops/sec (2.5× speedup)
  • LIFO pattern: 102 M ops/sec → 450 M ops/sec (4.4× speedup)
  • Gap to mimalloc: 2.6×1.0× (match on random, 2× on LIFO)

Conclusion

The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to:

  1. Architectural overhead: 6-tier allocation hierarchy vs mimalloc's 3-tier
  2. Bitmap traversal cost: Two-tier scan adds 15-20 cycles even when optimized
  3. Registry lookup overhead: Hash table lookup adds 20-30 cycles on free path

Quick wins (1-3 hours effort) can achieve 1.4-1.6× speedup. Medium effort (10 hours) can achieve 1.8-2.0× speedup. Full mimalloc-style rewrite (40+ hours) needed to match mimalloc's 1.1 ns/op.

Recommendation: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with perf, then decide if medium-effort optimizations are worth the complexity increase.