# Bottleneck Analysis Report: hakmem Tiny Pool Allocator

**Date**: 2025-10-26
**Target**: hakmem bitmap-based allocator
**Baseline**: mimalloc (industry standard)
**Analyzed by**: Deep code analysis + performance modeling

---

## Executive Summary

### Top 3 Bottlenecks with Estimated Impact

1. **TLS Magazine Hierarchy Overhead** (HIGH: ~3-5 ns per allocation)
   - 3-tier indirection: TLS Magazine → TLS Active Slab → Mini-Magazine → Bitmap
   - Each tier adds cache miss risk and branching overhead
   - Expected speedup: 30-40% if collapsed to 2-tier

2. **Two-Tier Bitmap Traversal** (HIGH: ~4-6 ns on bitmap path)
   - Summary bitmap scan + main bitmap scan + hint_word update
   - Cache-friendly but computationally expensive (2x CTZ, 2x bitmap updates)
   - Expected speedup: 20-30% if bypassed more often via better caching

3. **Registry Lookup on Free Path** (MEDIUM: ~2-4 ns per free)
   - Hash computation + linear probe + validation on every cross-slab free
   - Could be eliminated with mimalloc-style pointer arithmetic
   - Expected speedup: 15-25% on free-heavy workloads

### Performance Gap Analysis

**Random Free Pattern** (Bitmap's best case):
- hakmem: 68 M ops/sec (14.7 ns/op)
- mimalloc: 176 M ops/sec (5.7 ns/op)
- **Gap**: 2.6× slower (9 ns difference)

**Sequential LIFO Pattern** (Free-list's best case):
- hakmem: 102 M ops/sec (9.8 ns/op)
- mimalloc: 942 M ops/sec (1.1 ns/op)
- **Gap**: 9.2× slower (8.7 ns difference)

**Key Insight**: Even on favorable patterns (random), we're 2.6× slower. This means the bottleneck is NOT just the bitmap, but the entire allocation architecture.

### Expected Total Speedup

- Conservative: 2.0-2.5× (close the 2.6× gap partially)
- Optimistic: 3.0-4.0× (with aggressive optimizations)
- Realistic Target: 2.5× (reaching ~170 M ops/sec on random, ~250 M ops/sec on LIFO)

---

## Critical Path Analysis

### Allocation Fast Path Walkthrough

Let me trace the exact execution path for `hak_tiny_alloc(16)` with step-by-step cycle estimates:

```c
// hakmem_tiny.c:557 - Entry point
void* hak_tiny_alloc(size_t size) {
    // Line 558: Initialization check
    if (!g_tiny_initialized) hak_tiny_init();        // BRANCH: ~1 cycle (predicted taken once)

    // Line 561-562: Wrapper context check
    extern int hak_in_wrapper(void);
    if (!g_wrap_tiny_enabled && hak_in_wrapper())    // BRANCH: ~1 cycle
        return NULL;

    // Line 565: Size to class conversion
    int class_idx = hak_tiny_size_to_class(size);    // INLINE: ~2 cycles (branch chain)
    if (class_idx < 0) return NULL;                  // BRANCH: ~1 cycle

    // Line 569-576: SuperSlab path (disabled by default)
    if (g_use_superslab) { /* ... */ }               // BRANCH: ~1 cycle (not taken)

    // Line 650-651: TLS Magazine initialization check
    tiny_mag_init_if_needed(class_idx);              // INLINE: ~3 cycles (conditional init)
    TinyTLSMag* mag = &g_tls_mags[class_idx];       // TLS ACCESS: ~2 cycles

    // Line 666-670: TLS Magazine fast path (BEST CASE)
    if (mag->top > 0) {                              // LOAD + BRANCH: ~2 cycles
        void* p = mag->items[--mag->top].ptr;        // LOAD + DEC + STORE: ~3 cycles
        stats_record_alloc(class_idx);               // INLINE: ~1 cycle (TLS increment)
        return p;                                     // RETURN: ~1 cycle
    }
    // TOTAL FAST PATH: ~18 cycles (~6 ns @ 3 GHz)

    // Line 673-674: TLS Active Slab lookup (MEDIUM PATH)
    TinySlab* tls = g_tls_active_slab_a[class_idx]; // TLS ACCESS: ~2 cycles
    if (!(tls && tls->free_count > 0))               // LOAD + BRANCH: ~3 cycles
        tls = g_tls_active_slab_b[class_idx];        // TLS ACCESS: ~2 cycles (if taken)

    if (tls && tls->free_count > 0) {                // BRANCH: ~1 cycle
        // Line 677-679: Remote drain check
        if (atomic_load(&tls->remote_count) >= thresh || rand() & mask) {
            tiny_remote_drain_owner(tls);            // RARE: ~50-200 cycles (if taken)
        }

        // Line 682-688: Mini-magazine fast path
        if (!mini_mag_is_empty(&tls->mini_mag)) {    // LOAD + BRANCH: ~2 cycles
            void* p = mini_mag_pop(&tls->mini_mag);  // INLINE: ~4 cycles (LIFO pop)
            if (p) {
                stats_record_alloc(class_idx);       // INLINE: ~1 cycle
                return p;                             // RETURN: ~1 cycle
            }
        }
        // MINI-MAG PATH: ~30 cycles (~10 ns)

        // Line 691-700: Batch refill from bitmap
        if (tls->free_count > 0 && mini_mag_is_empty(&tls->mini_mag)) {
            int refilled = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
            // REFILL COST: ~48 ns for 16 items = ~3 ns/item amortized
            if (refilled > 0) {
                void* p = mini_mag_pop(&tls->mini_mag);
                if (p) {
                    stats_record_alloc(class_idx);
                    return p;
                }
            }
        }
        // REFILL PATH: ~50 cycles (~17 ns) for batch + ~10 ns for next alloc

        // Line 703-713: Bitmap scan fallback
        if (tls->free_count > 0) {
            int block_idx = hak_tiny_find_free_block(tls);  // BITMAP SCAN: ~15-20 cycles
            if (block_idx >= 0) {
                hak_tiny_set_used(tls, block_idx);          // BITMAP UPDATE: ~10 cycles
                tls->free_count--;                          // STORE: ~1 cycle
                void* p = (char*)tls->base + (block_idx * bs); // COMPUTE: ~3 cycles
                stats_record_alloc(class_idx);              // INLINE: ~1 cycle
                return p;                                    // RETURN: ~1 cycle
            }
        }
        // BITMAP PATH: ~50 cycles (~17 ns)
    }

    // Line 717-718: Lock and refill from global pool (SLOW PATH)
    pthread_mutex_lock(lock);                        // LOCK: ~30-100 cycles (contended)
    // ... slow path: 200-1000 cycles (rare) ...
}
```

### Cycle Count Summary

| Path                | Cycles | Latency (ns) | Frequency | Notes |
|---------------------|--------|--------------|-----------|-------|
| **TLS Magazine Hit**    | ~18    | ~6 ns        | 60-80%    | Best case (cache hit) |
| **Mini-Mag Hit**        | ~30    | ~10 ns       | 10-20%    | Good case (slab-local) |
| **Batch Refill**        | ~50    | ~17 ns       | 5-10%     | Amortized 3 ns/item |
| **Bitmap Scan**         | ~50    | ~17 ns       | 5-10%     | Worst case before lock |
| **Global Lock Path**    | ~300   | ~100 ns      | <5%       | Very rare (refill) |

**Weighted Average**: 0.7×6 + 0.15×10 + 0.1×17 + 0.05×100 = **~11 ns/op** (theoretical)
**Measured Actual**: 9.8-14.7 ns/op (matches model!)

### Comparison with mimalloc's Approach

mimalloc achieves **1.1 ns/op** on LIFO pattern by:

1. **No TLS Magazine Layer**: Direct access to thread-local page free-list
2. **Intrusive Free-List**: 1 load + 1 store (2 cycles) vs our 18 cycles
3. **2MB Alignment**: O(1) pointer→slab via bit-masking (no registry lookup)
4. **No Bitmap**: Free-list only (trades random-access resistance for speed)

**hakmem's Architecture**:
```
Allocation Request
  ↓
TLS Magazine (2048 items)          ← 1st tier: ~6 ns (cache hit)
  ↓ (miss)
TLS Active Slab (2 per class)      ← 2nd tier: lookup cost
  ↓
Mini-Magazine (16-32 items)        ← 3rd tier: ~10 ns (LIFO pop)
  ↓ (miss)
Batch Refill (16 items)            ← 4th tier: ~3 ns amortized
  ↓ (miss)
Bitmap Scan (two-tier)             ← 5th tier: ~17 ns (expensive)
  ↓ (miss)
Global Lock + Slab Allocation      ← 6th tier: ~100+ ns (rare)
```

**mimalloc's Architecture**:
```
Allocation Request
  ↓
Thread-Local Page Free-List        ← 1st tier: ~1 ns (1 load + 1 store)
  ↓ (miss)
Thread-Local Page Queue            ← 2nd tier: ~5 ns (page switch)
  ↓ (miss)
Global Segment Allocation          ← 3rd tier: ~50 ns (rare)
```

**Key Difference**: mimalloc has 3 tiers, hakmem has 6 tiers. Each tier adds ~2-3 ns overhead.

---

## Bottleneck #1: TLS Magazine Hierarchy Overhead

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 650-714 (allocation fast path)
- **Impact**: HIGH (affects 100% of allocations)

### Code Analysis

```c
// Line 650-651: 1st tier - TLS Magazine
tiny_mag_init_if_needed(class_idx);              // ~3 cycles (conditional check)
TinyTLSMag* mag = &g_tls_mags[class_idx];       // ~2 cycles (TLS base + offset)

// Line 666-670: TLS Magazine lookup
if (mag->top > 0) {                              // ~2 cycles (load + branch)
    void* p = mag->items[--mag->top].ptr;        // ~3 cycles (array access + decrement)
    stats_record_alloc(class_idx);               // ~1 cycle (TLS increment)
    return p;                                     // ~1 cycle
}
// TOTAL: ~12 cycles for cache hit (BEST CASE)

// Line 673-674: 2nd tier - TLS Active Slab lookup
TinySlab* tls = g_tls_active_slab_a[class_idx]; // ~2 cycles (TLS access)
if (!(tls && tls->free_count > 0))               // ~3 cycles (2 loads + branch)
    tls = g_tls_active_slab_b[class_idx];        // ~2 cycles (if miss)

// Line 682-688: 3rd tier - Mini-Magazine
if (!mini_mag_is_empty(&tls->mini_mag)) {        // ~2 cycles (load slab->mini_mag.count)
    void* p = mini_mag_pop(&tls->mini_mag);      // ~4 cycles (LIFO pop: 2 loads + 1 store)
    if (p) { stats_record_alloc(class_idx); return p; }
}
// TOTAL: ~13 cycles for mini-mag hit (MEDIUM CASE)
```

### Why It's Slow

1. **Multiple TLS Accesses**: Each tier requires TLS base lookup + offset calculation
   - `g_tls_mags[class_idx]` → TLS read #1
   - `g_tls_active_slab_a[class_idx]` → TLS read #2
   - `g_tls_active_slab_b[class_idx]` → TLS read #3 (conditional)
   - **Cost**: 2-3 cycles each × 3 = 6-9 cycles overhead

2. **Cache Line Fragmentation**: TLS variables are separate arrays
   - `g_tls_mags[8]` = 16 KB (2048 items × 8 classes × 8 bytes)
   - `g_tls_active_slab_a[8]` = 64 bytes
   - `g_tls_active_slab_b[8]` = 64 bytes
   - **Cost**: Likely span multiple cache lines → potential cache misses

3. **Branch Misprediction**: Multi-tier fallback creates branch chain
   - Magazine empty? → Check active slab A
   - Slab A empty? → Check active slab B
   - Mini-mag empty? → Refill from bitmap
   - **Cost**: Each mispredicted branch = 10-20 cycles penalty

4. **Redundant Metadata**: Magazine items store `{void* ptr}` separately from slab pointers
   - Magazine item: 8 bytes per pointer (2048 × 8 = 16 KB per class)
   - Slab pointers: 8 bytes × 2 per class (16 bytes)
   - **Cost**: Memory overhead reduces cache efficiency

### Optimization: Unified TLS Cache Structure

**Before** (current):
```c
// Separate TLS arrays (fragmented in memory)
static __thread TinyMagItem g_tls_mags[TINY_NUM_CLASSES][TINY_TLS_MAG_CAP];
static __thread TinySlab* g_tls_active_slab_a[TINY_NUM_CLASSES];
static __thread TinySlab* g_tls_active_slab_b[TINY_NUM_CLASSES];
```

**After** (proposed):
```c
// Unified per-class TLS structure (cache-line aligned)
typedef struct __attribute__((aligned(64))) {
    // Hot fields (first 64 bytes for L1 cache line)
    void* mag_items[32];        // Reduced from 2048 to 32 (still effective)
    uint16_t mag_top;           // Current magazine count
    uint16_t mag_cap;           // Magazine capacity
    uint32_t _pad0;

    // Warm fields (second cache line)
    TinySlab* active_slab;      // Primary active slab (no A/B split)
    PageMiniMag* mini_mag;      // Direct pointer to slab's mini-mag
    uint64_t last_refill_tsc;   // For adaptive refill timing

    // Cold fields (third cache line)
    uint64_t stats_alloc_batch; // Batched statistics
    uint64_t stats_free_batch;
} __attribute__((packed)) TinyTLSCache;

static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
```

**Benefits**:
1. Single TLS access: `g_tls_cache[class_idx]` (not 3 separate lookups)
2. Cache-line aligned: All hot fields in first 64 bytes
3. Reduced magazine size: 32 items (not 2048) saves 15.5 KB per class
4. Direct mini-mag pointer: No slab→mini_mag indirection

**Expected Speedup**: 30-40% (reduce fast path from ~12 cycles to ~7 cycles)

**Risk**: MEDIUM
- Requires refactoring TLS access patterns throughout codebase
- Magazine size reduction may increase refill frequency (trade-off)
- Need careful testing to ensure no regression on multi-threaded workloads

---

## Bottleneck #2: Two-Tier Bitmap Traversal

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
- **Lines**: 235-269 (`hak_tiny_find_free_block`)
- **Impact**: HIGH (affects 5-15% of allocations, but expensive when hit)

### Code Analysis

```c
// Line 235-269: Two-tier bitmap scan
static inline int hak_tiny_find_free_block(TinySlab* slab) {
    const int bw = g_tiny_bitmap_words[slab->class_idx];  // Bitmap words
    const int sw = slab->summary_words;                    // Summary words
    if (bw <= 0 || sw <= 0) return -1;

    int start_word = slab->hint_word % bw;                 // Hint optimization
    int start_sw = start_word / 64;                        // Summary word index
    int start_sb = start_word % 64;                        // Summary bit offset

    // Line 244-267: Summary bitmap scan (outer loop)
    for (int k = 0; k < sw; k++) {                         // ~sw iterations (1-128)
        int idx = start_sw + k;
        if (idx >= sw) idx -= sw;                          // Wrap-around
        uint64_t bits = slab->summary[idx];                // LOAD: ~2 cycles

        // Mask optimization (skip processed bits)
        if (k == 0) {
            bits &= (~0ULL) << start_sb;                   // BITWISE: ~1 cycle
        }
        if (idx == sw - 1 && (bw % 64) != 0) {
            uint64_t mask = (bw % 64) == 64 ? ~0ULL : ((1ULL << (bw % 64)) - 1ULL);
            bits &= mask;                                  // BITWISE: ~1 cycle
        }
        if (bits == 0) continue;                           // BRANCH: ~1 cycle (often taken)

        int woff = __builtin_ctzll(bits);                  // CTZ #1: ~3 cycles
        int word_idx = idx * 64 + woff;                    // COMPUTE: ~2 cycles
        if (word_idx >= bw) continue;                      // BRANCH: ~1 cycle

        // Line 261-266: Main bitmap scan (inner)
        uint64_t used = slab->bitmap[word_idx];            // LOAD: ~2 cycles (cache miss risk)
        uint64_t free_bits = ~used;                        // BITWISE: ~1 cycle
        if (free_bits == 0) continue;                      // BRANCH: ~1 cycle (rare)

        int bit_idx = __builtin_ctzll(free_bits);          // CTZ #2: ~3 cycles
        slab->hint_word = (uint16_t)((word_idx + 1) % bw); // UPDATE HINT: ~2 cycles
        return word_idx * 64 + bit_idx;                    // RETURN: ~1 cycle
    }
    return -1;
}
// TYPICAL COST: 15-20 cycles (1-2 summary iterations, 1 main bitmap access)
// WORST CASE: 50-100 cycles (many summary words scanned, cache misses)
```

### Why It's Slow

1. **Two-Level Indirection**: Summary → Bitmap → Block
   - Summary scan: Find word with free bits (~5-10 cycles)
   - Main bitmap scan: Find bit within word (~5 cycles)
   - **Cost**: 2× CTZ operations, 2× memory loads

2. **Cache Miss Risk**: Bitmap can be up to 1 KB (128 words × 8 bytes)
   - Class 0 (8B): 128 words = 1024 bytes
   - Class 1 (16B): 64 words = 512 bytes
   - Class 2 (32B): 32 words = 256 bytes
   - **Cost**: Bitmap may not fit in L1 cache (32 KB) → L2 access (~10-20 cycles)

3. **Hint Word State**: Requires update on every allocation
   - Read hint_word (~1 cycle)
   - Compute new hint (~2 cycles)
   - Write hint_word (~1 cycle)
   - **Cost**: 4 cycles per allocation (not amortized)

4. **Branch-Heavy Loop**: Multiple branches per iteration
   - `if (bits == 0) continue;` (often taken when bitmap is sparse)
   - `if (word_idx >= bw) continue;` (rare safety check)
   - `if (free_bits == 0) continue;` (rare but costly)
   - **Cost**: Branch misprediction = 10-20 cycles each

### Optimization #1: Increase Mini-Magazine Capacity

**Rationale**: Avoid bitmap scan by keeping more items in mini-magazine

**Current**:
```c
// Line 344: Mini-magazine capacity
uint16_t mag_capacity = (class_idx <= 3) ? 32 : 16;
```

**Proposed**:
```c
// Increase capacity to reduce bitmap scan frequency
uint16_t mag_capacity = (class_idx <= 3) ? 64 : 32;
```

**Benefits**:
- Fewer bitmap scans (amortized over 64 items instead of 32)
- Better temporal locality (more items cached)

**Costs**:
- +256 bytes memory per slab (64 × 8 bytes pointers)
- Slightly higher refill cost (64 items vs 32)

**Expected Speedup**: 10-15% (reduce bitmap scan frequency by 50%)

**Risk**: LOW (simple parameter change, no logic changes)

### Optimization #2: Cache-Aware Bitmap Layout

**Rationale**: Ensure bitmap fits in L1 cache for hot classes

**Current**:
```c
// Separate bitmap allocation (may be cache-cold)
slab->bitmap = (uint64_t*)hkm_libc_calloc(bitmap_size, sizeof(uint64_t));
```

**Proposed**:
```c
// Embed small bitmaps directly in slab structure
typedef struct TinySlab {
    // ... existing fields ...

    // Embedded bitmap for small classes (≤256 bytes)
    union {
        uint64_t* bitmap_ptr;     // Large classes: heap-allocated
        uint64_t bitmap_embed[32]; // Small classes: embedded (256 bytes)
    };
    uint8_t bitmap_embedded;      // Flag: 1=embedded, 0=heap
} TinySlab;
```

**Benefits**:
- Class 0-2 (8B-32B): Bitmap fits in 256 bytes (embedded)
- Single cache line access for bitmap + slab metadata
- No heap allocation for small classes

**Expected Speedup**: 5-10% (reduce cache misses on bitmap access)

**Risk**: MEDIUM (requires refactoring bitmap access logic)

### Optimization #3: Lazy Summary Bitmap Update

**Rationale**: Summary bitmap update is expensive on free path

**Current**:
```c
// Line 199-213: Summary update on every set_used/set_free
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
    // ... bitmap update ...

    // Update summary (EXPENSIVE)
    int sum_word = word_idx / 64;
    int sum_bit  = word_idx % 64;
    uint64_t has_free = ~v;
    if (has_free != 0) {
        slab->summary[sum_word] |= (1ULL << sum_bit);   // WRITE
    } else {
        slab->summary[sum_word] &= ~(1ULL << sum_bit);  // WRITE
    }
}
```

**Proposed**:
```c
// Lazy summary update (rebuild only when scanning)
static inline void hak_tiny_set_used(TinySlab* slab, int block_idx) {
    // ... bitmap update ...
    // NO SUMMARY UPDATE (deferred)
}

static inline int hak_tiny_find_free_block(TinySlab* slab) {
    // Rebuild summary if stale (rare)
    if (slab->summary_stale) {
        rebuild_summary_bitmap(slab);  // O(N) but rare
        slab->summary_stale = 0;
    }
    // ... existing scan logic ...
}
```

**Benefits**:
- Eliminate summary update on 95% of operations (free path)
- Summary rebuild cost amortized over many allocations

**Expected Speedup**: 15-20% on free-heavy workloads

**Risk**: MEDIUM (requires careful stale bit management)

---

## Bottleneck #3: Registry Lookup on Free Path

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 1102-1118 (`hak_tiny_free`)
- **Impact**: MEDIUM (affects cross-slab frees, ~30-50% of frees)

### Code Analysis

```c
// Line 1102-1118: Free path with registry lookup
void hak_tiny_free(void* ptr) {
    if (!ptr || !g_tiny_initialized) return;

    // Line 1106-1111: SuperSlab fast path (disabled by default)
    SuperSlab* ss = ptr_to_superslab(ptr);           // BITWISE: ~2 cycles
    if (ss && ss->magic == SUPERSLAB_MAGIC) {        // LOAD + BRANCH: ~3 cycles
        hak_tiny_free_superslab(ptr, ss);            // FAST PATH: ~5 ns
        return;
    }

    // Line 1114: Registry lookup (EXPENSIVE)
    TinySlab* slab = hak_tiny_owner_slab(ptr);       // LOOKUP: ~10-30 cycles
    if (!slab) return;

    hak_tiny_free_with_slab(ptr, slab);              // FREE: ~50-200 cycles
}

// hakmem_tiny.c:395-440 - Registry lookup implementation
TinySlab* hak_tiny_owner_slab(void* ptr) {
    if (!ptr || !g_tiny_initialized) return NULL;

    if (g_use_registry) {
        // O(1) hash table lookup
        uintptr_t slab_base = (uintptr_t)ptr & ~(TINY_SLAB_SIZE - 1); // BITWISE: ~2 cycles
        TinySlab* slab = registry_lookup(slab_base);   // FUNCTION CALL: ~20-50 cycles
        if (!slab) return NULL;

        // Validation (bounds check)
        uintptr_t start = (uintptr_t)slab->base;
        uintptr_t end = start + TINY_SLAB_SIZE;
        if ((uintptr_t)ptr < start || (uintptr_t)ptr >= end) {
            return NULL;  // False positive
        }
        return slab;
    } else {
        // O(N) linear search (fallback)
        for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
            pthread_mutex_lock(lock);                  // LOCK: ~30-100 cycles
            // Search free slabs
            for (TinySlab* slab = g_tiny_pool.free_slabs[class_idx]; slab; slab = slab->next) {
                // ... bounds check ...
            }
            pthread_mutex_unlock(lock);
        }
        return NULL;
    }
}

// Line 268-288: Registry lookup (hash table linear probe)
static TinySlab* registry_lookup(uintptr_t slab_base) {
    int hash = registry_hash(slab_base);               // HASH: ~5 cycles

    for (int i = 0; i < SLAB_REGISTRY_MAX_PROBE; i++) { // Up to 8 probes
        int idx = (hash + i) & SLAB_REGISTRY_MASK;     // BITWISE: ~2 cycles
        SlabRegistryEntry* entry = &g_slab_registry[idx]; // LOAD: ~2 cycles

        if (entry->slab_base == slab_base) {           // LOAD + BRANCH: ~3 cycles
            TinySlab* owner = entry->owner;            // LOAD: ~2 cycles
            return owner;
        }

        if (entry->slab_base == 0) {                   // LOAD + BRANCH: ~2 cycles
            return NULL;  // Empty slot
        }
    }
    return NULL;
}
// TYPICAL COST: 20-30 cycles (1-2 probes, cache hit)
// WORST CASE: 50-100 cycles (8 probes, cache miss on registry array)
```

### Why It's Slow

1. **Hash Computation**: Complex mix function
   ```c
   static inline int registry_hash(uintptr_t slab_base) {
       return (slab_base >> 16) & SLAB_REGISTRY_MASK;  // Simple, but...
   }
   ```
   - Shift + mask = 2 cycles (acceptable)
   - **BUT**: Linear probing on collision adds 10-30 cycles

2. **Linear Probing**: Up to 8 probes on collision
   - Each probe: Load + compare + branch (3 cycles × 8 = 24 cycles worst case)
   - Registry size: 1024 entries (8 KB array)
   - **Cost**: May span multiple cache lines → cache miss (10-20 cycles penalty)

3. **Validation Overhead**: Bounds check after lookup
   - Load slab->base (2 cycles)
   - Compute end address (1 cycle)
   - Compare twice (2 cycles)
   - **Cost**: 5 cycles per free (not amortized)

4. **Global Shared State**: Registry is shared across all threads
   - No cache-line alignment (false sharing risk)
   - Lock-free reads → ABA problem potential
   - **Cost**: Atomic load penalties (~5-10 cycles vs normal load)

### Optimization #1: Enable SuperSlab by Default

**Rationale**: SuperSlab has O(1) pointer→slab via 2MB alignment (mimalloc-style)

**Current**:
```c
// Line 81: SuperSlab disabled by default
static int g_use_superslab = 0;  // Runtime toggle
```

**Proposed**:
```c
// Enable SuperSlab by default
static int g_use_superslab = 1;  // Always on
```

**Benefits**:
- Eliminate registry lookup entirely: `ptr & ~0x1FFFFF` (1 AND operation)
- SuperSlab free path: ~5 ns (vs ~10-30 ns registry path)
- Better cache locality (2MB aligned pages)

**Costs**:
- 2MB address space per SuperSlab (not physical memory due to lazy allocation)
- Slightly higher memory overhead (metadata at SuperSlab level)

**Expected Speedup**: 20-30% on free-heavy workloads

**Risk**: LOW (SuperSlab already implemented and tested in Phase 6.23)

### Optimization #2: Cache Last Freed Slab

**Rationale**: Temporal locality - next free likely from same slab

**Proposed**:
```c
// Per-thread cache of last freed slab
static __thread TinySlab* t_last_freed_slab[TINY_NUM_CLASSES] = {NULL};

void hak_tiny_free(void* ptr) {
    if (!ptr) return;

    // Try cached slab first (likely hit)
    int class_idx = guess_class_from_size(ptr);  // Heuristic
    TinySlab* slab = t_last_freed_slab[class_idx];

    // Validate pointer is in this slab
    if (slab && ptr_in_slab_range(ptr, slab)) {
        hak_tiny_free_with_slab(ptr, slab);      // FAST PATH: ~5 ns
        return;
    }

    // Fallback to registry lookup (rare)
    slab = hak_tiny_owner_slab(ptr);
    if (slab) {
        t_last_freed_slab[slab->class_idx] = slab;  // Update cache
        hak_tiny_free_with_slab(ptr, slab);
    }
}
```

**Benefits**:
- 80-90% cache hit rate (temporal locality)
- Fast path: 2 loads + 2 compares (~5 cycles) vs registry lookup (20-30 cycles)

**Expected Speedup**: 15-20% on free-heavy workloads

**Risk**: MEDIUM (requires heuristic for class_idx guessing, may mispredict)

---

## Bottleneck #4: Statistics Collection Overhead

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny_stats.h`
- **Lines**: 59-73 (`stats_record_alloc`, `stats_record_free`)
- **Impact**: LOW (already optimized to TLS batching, but still ~0.5 ns per op)

### Code Analysis

```c
// Line 59-62: Allocation statistics (inline)
static inline void stats_record_alloc(int class_idx) __attribute__((always_inline));
static inline void stats_record_alloc(int class_idx) {
    t_alloc_batch[class_idx]++;  // TLS INCREMENT: ~0.5-1 cycle
}

// Line 70-73: Free statistics (inline)
static inline void stats_record_free(int class_idx) __attribute__((always_inline));
static inline void stats_record_free(int class_idx) {
    t_free_batch[class_idx]++;   // TLS INCREMENT: ~0.5-1 cycle
}
```

### Why It's (Slightly) Slow

1. **TLS Access Overhead**: Even TLS has cost
   - TLS base register: %fs on x86-64 (implicit)
   - Offset calculation: `[%fs + class_idx*4]`
   - **Cost**: ~0.5 cycles (not zero!)

2. **Cache Line Pollution**: TLS counters compete for L1 cache
   - `t_alloc_batch[8]` = 32 bytes
   - `t_free_batch[8]` = 32 bytes
   - **Cost**: 64 bytes of L1 cache (1 cache line)

3. **Compiler Optimization Barriers**: `always_inline` prevents optimization
   - Forces inline (good)
   - But prevents compiler from hoisting out of loops (bad)
   - **Cost**: Increment inside hot loop vs once outside

### Optimization: Compile-Time Statistics Toggle

**Rationale**: Production builds don't need exact counts

**Proposed**:
```c
#ifdef HAKMEM_ENABLE_STATS
    #define STATS_RECORD_ALLOC(cls) t_alloc_batch[cls]++
    #define STATS_RECORD_FREE(cls)  t_free_batch[cls]++
#else
    #define STATS_RECORD_ALLOC(cls) ((void)0)
    #define STATS_RECORD_FREE(cls)  ((void)0)
#endif
```

**Benefits**:
- Zero overhead when stats disabled
- Compiler can optimize away dead code

**Expected Speedup**: 3-5% (small but measurable)

**Risk**: VERY LOW (compile-time flag, no runtime impact)

---

## Bottleneck #5: Magazine Spill/Refill Lock Contention

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 880-939 (magazine spill under class lock)
- **Impact**: MEDIUM (affects 5-10% of frees when magazine is full)

### Code Analysis

```c
// Line 880-939: Magazine spill (class lock held)
if (mag->top < cap) {
    // Fast path: push to magazine (no lock)
    mag->items[mag->top].ptr = ptr;
    mag->top++;
    return;
}

// Spill half under class lock
pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
pthread_mutex_lock(lock);                        // LOCK: ~30-100 cycles (contended)

int spill = cap / 2;  // Spill 1024 items (for 2048 cap)

for (int i = 0; i < spill && mag->top > 0; i++) {
    TinyMagItem it = mag->items[--mag->top];
    TinySlab* owner = hak_tiny_owner_slab(it.ptr);   // LOOKUP: ~20-30 cycles × 1024
    if (!owner) continue;

    // Phase 4.1: Try mini-magazine push (avoid bitmap)
    if ((owner == tls_a || owner == tls_b) && !mini_mag_is_full(&owner->mini_mag)) {
        mini_mag_push(&owner->mini_mag, it.ptr);     // FAST: ~4 cycles
        continue;
    }

    // Slow path: bitmap update
    size_t bs = g_tiny_class_sizes[owner->class_idx];
    int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;  // DIV: ~10 cycles
    if (hak_tiny_is_used(owner, idx)) {
        hak_tiny_set_free(owner, idx);               // BITMAP: ~10 cycles
        owner->free_count++;
        // ... list management ...
    }
}

pthread_mutex_unlock(lock);
// TOTAL SPILL COST: ~50,000-100,000 cycles (1024 items × 50-100 cycles/item)
// Amortized: 50-100 ns per free (when spill happens every ~1000 frees)
```

### Why It's Slow

1. **Lock Hold Time**: Lock held for entire spill (1024 items)
   - Blocks other threads from accessing class lock
   - Spill takes ~50-100 µs → other threads stalled
   - **Cost**: Contention penalty on multi-threaded workloads

2. **Registry Lookup in Loop**: 1024 lookups under lock
   - `hak_tiny_owner_slab(it.ptr)` called 1024 times
   - Each lookup: 20-30 cycles
   - **Cost**: 20,000-30,000 cycles just for lookups

3. **Division in Hot Loop**: Block index calculation uses division
   - `int idx = ((uintptr_t)it.ptr - (uintptr_t)owner->base) / bs;`
   - Division is ~10 cycles on modern CPUs (not fully pipelined)
   - **Cost**: 10,000 cycles for 1024 divisions

4. **Large Spill Batch**: 1024 items is too large
   - Amortizes lock cost well (good)
   - But increases lock hold time (bad)
   - Trade-off not optimized

### Optimization #1: Reduce Spill Batch Size

**Rationale**: Smaller batches = shorter lock hold time = less contention

**Current**:
```c
int spill = cap / 2;  // 1024 items for 2048 cap
```

**Proposed**:
```c
int spill = 128;  // Fixed batch size (not cap-dependent)
```

**Benefits**:
- Shorter lock hold time: ~6-12 µs (vs 50-100 µs)
- Better multi-thread responsiveness

**Costs**:
- More frequent spills (8× more frequent)
- Slightly higher total lock overhead

**Expected Speedup**: 10-15% on multi-threaded workloads

**Risk**: LOW (simple parameter change)

### Optimization #2: Lock-Free Spill Stack

**Rationale**: Avoid lock entirely for spill path

**Proposed**:
```c
// Per-class global spill stack (lock-free MPSC)
static atomic_uintptr_t g_spill_stack[TINY_NUM_CLASSES];

void magazine_spill_lockfree(int class_idx, void* ptr) {
    // Push to lock-free stack
    uintptr_t old_head;
    do {
        old_head = atomic_load(&g_spill_stack[class_idx], memory_order_acquire);
        *((uintptr_t*)ptr) = old_head;  // Intrusive next-pointer
    } while (!atomic_compare_exchange_weak(&g_spill_stack[class_idx], &old_head, (uintptr_t)ptr,
                                           memory_order_release, memory_order_relaxed));
}

// Background thread drains spill stack periodically
void background_drain_spill_stack(void) {
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        uintptr_t head = atomic_exchange(&g_spill_stack[i], 0, memory_order_acq_rel);
        if (!head) continue;

        pthread_mutex_lock(&g_tiny_class_locks[i].m);
        // ... drain to bitmap ...
        pthread_mutex_unlock(&g_tiny_class_locks[i].m);
    }
}
```

**Benefits**:
- Zero lock contention on spill path
- Fast atomic CAS (~5-10 cycles)

**Costs**:
- Requires background thread or periodic drain
- Slightly more complex memory management

**Expected Speedup**: 20-30% on multi-threaded workloads

**Risk**: HIGH (requires careful design of background drain mechanism)

---

## Bottleneck #6: Branch Misprediction in Size Class Lookup

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.h`
- **Lines**: 159-182 (`hak_tiny_size_to_class`)
- **Impact**: LOW (only 1-2 ns per allocation, but called on every allocation)

### Code Analysis

```c
// Line 159-182: Size to class lookup (branch chain)
static inline int hak_tiny_size_to_class(size_t size) {
    if (size == 0 || size > TINY_MAX_SIZE) return -1;  // BRANCH: ~1 cycle

    // Branch chain (8 branches for 8 classes)
    if (size <= 8) return 0;      // BRANCH: ~1 cycle
    if (size <= 16) return 1;     // BRANCH: ~1 cycle
    if (size <= 32) return 2;     // BRANCH: ~1 cycle
    if (size <= 64) return 3;     // BRANCH: ~1 cycle
    if (size <= 128) return 4;    // BRANCH: ~1 cycle
    if (size <= 256) return 5;    // BRANCH: ~1 cycle
    if (size <= 512) return 6;    // BRANCH: ~1 cycle
    return 7;  // size <= 1024
}
// TYPICAL COST: 3-5 cycles (3-4 branches taken)
// WORST CASE: 8 cycles (all branches checked)
```

### Why It's (Slightly) Slow

1. **Unpredictable Size Distribution**: Branch predictor can't learn pattern
   - Real-world allocation sizes are quasi-random
   - Size 16 most common (33%), but others vary
   - **Cost**: ~20-30% branch misprediction rate (~10 cycles penalty)

2. **Sequential Dependency**: Each branch depends on previous
   - CPU can't parallelize branch evaluation
   - Must evaluate branches in order
   - **Cost**: No instruction-level parallelism (ILP)

### Optimization: Branchless Lookup Table

**Rationale**: Use CLZ (count leading zeros) for O(1) class lookup

**Proposed**:
```c
// Lookup table for size → class (branchless)
static const uint8_t g_size_to_class_table[128] = {
    // size 0-7: class -1 (invalid)
    -1, -1, -1, -1, -1, -1, -1, -1,
    // size 8: class 0
    0,
    // size 9-16: class 1
    1, 1, 1, 1, 1, 1, 1, 1,
    // size 17-32: class 2
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    // size 33-64: class 3
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    // size 65-128: class 4
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
};

static inline int hak_tiny_size_to_class(size_t size) {
    if (size == 0 || size > TINY_MAX_SIZE) return -1;

    // Fast path: direct table lookup for small sizes
    if (size <= 128) {
        return g_size_to_class_table[size];  // LOAD: ~2 cycles (L1 cache)
    }

    // Slow path: CLZ-based for larger sizes
    // size 129-256 → class 5
    // size 257-512 → class 6
    // size 513-1024 → class 7
    int clz = __builtin_clzll(size - 1);     // CLZ: ~3 cycles
    return 12 - clz;  // Magic constant for power-of-2 classes
}
// TYPICAL COST: 2-3 cycles (table lookup, no branches)
```

**Benefits**:
- Branchless for common sizes (8-128B covers 80%+ of allocations)
- Table fits in L1 cache (128 bytes = 2 cache lines)
- Predictable performance (no branch misprediction)

**Expected Speedup**: 2-3% (reduce 5 cycles to 2-3 cycles)

**Risk**: VERY LOW (table is static, no runtime overhead)

---

## Bottleneck #7: Remote Free Drain Overhead

### Location
- **File**: `/home/tomoaki/git/hakmem/hakmem_tiny.c`
- **Lines**: 146-184 (`tiny_remote_drain_locked`)
- **Impact**: LOW (only affects cross-thread frees, ~10-20% of workloads)

### Code Analysis

```c
// Line 146-184: Remote free drain (under class lock)
static void tiny_remote_drain_locked(TinySlab* slab) {
    uintptr_t head = atomic_exchange(&slab->remote_head, NULL, memory_order_acq_rel); // ATOMIC: ~10 cycles
    unsigned drained = 0;

    while (head) {                                   // LOOP: variable iterations
        void* p = (void*)head;
        head = *((uintptr_t*)p);                     // LOAD NEXT: ~2 cycles

        // Calculate block index
        size_t block_size = g_tiny_class_sizes[slab->class_idx]; // LOAD: ~2 cycles
        uintptr_t offset = (uintptr_t)p - (uintptr_t)slab->base; // SUBTRACT: ~1 cycle
        int block_idx = offset / block_size;         // DIVIDE: ~10 cycles

        // Skip if already free (idempotent)
        if (!hak_tiny_is_used(slab, block_idx)) continue; // BITMAP CHECK: ~5 cycles

        hak_tiny_set_free(slab, block_idx);          // BITMAP UPDATE: ~10 cycles

        int was_full = (slab->free_count == 0);      // LOAD: ~1 cycle
        slab->free_count++;                          // INCREMENT: ~1 cycle

        if (was_full) {
            move_to_free_list(slab->class_idx, slab); // LIST UPDATE: ~20-50 cycles (rare)
        }

        if (slab->free_count == slab->total_count) {
            // ... slab release logic ... (rare)
            release_slab(slab);                      // EXPENSIVE: ~1000 cycles (very rare)
            break;
        }

        g_tiny_pool.free_count[slab->class_idx]++;   // GLOBAL INCREMENT: ~1 cycle
        drained++;
    }

    if (drained) atomic_fetch_sub(&slab->remote_count, drained, memory_order_relaxed); // ATOMIC: ~10 cycles
}
// TYPICAL COST: 50-100 cycles per drained block (moderate)
// WORST CASE: 1000+ cycles (slab release)
```

### Why It's Slow

1. **Division in Loop**: Block index calculation uses division
   - `int block_idx = offset / block_size;`
   - Division is ~10 cycles (even on modern CPUs)
   - **Cost**: 10 cycles × N remote frees

2. **Atomic Operations**: 2 atomic ops per drain (exchange + fetch_sub)
   - `atomic_exchange` at start (~10 cycles)
   - `atomic_fetch_sub` at end (~10 cycles)
   - **Cost**: 20 cycles overhead (not per-block, but still expensive)

3. **Bitmap Update**: Same as allocation path
   - `hak_tiny_set_free` updates both bitmap and summary
   - **Cost**: 10 cycles per block

### Optimization: Multiplication-Based Division

**Rationale**: Replace division with multiplication by reciprocal

**Current**:
```c
int block_idx = offset / block_size;  // DIVIDE: ~10 cycles
```

**Proposed**:
```c
// Pre-computed reciprocals (magic constants)
static const uint64_t g_tiny_block_reciprocals[TINY_NUM_CLASSES] = {
    // Computed as: (1ULL << 48) / block_size
    // Allows: block_idx = (offset * reciprocal) >> 48
    0x200000000000ULL / 8,     // Class 0: 8B
    0x100000000000ULL / 16,    // Class 1: 16B
    0x80000000000ULL / 32,     // Class 2: 32B
    0x40000000000ULL / 64,     // Class 3: 64B
    0x20000000000ULL / 128,    // Class 4: 128B
    0x10000000000ULL / 256,    // Class 5: 256B
    0x8000000000ULL / 512,     // Class 6: 512B
    0x4000000000ULL / 1024,    // Class 7: 1024B
};

// Fast division using multiplication
int block_idx = (offset * g_tiny_block_reciprocals[slab->class_idx]) >> 48; // MUL + SHIFT: ~3 cycles
```

**Benefits**:
- Reduce 10 cycles to 3 cycles per division
- Saves 7 cycles per remote free

**Expected Speedup**: 5-10% on cross-thread workloads

**Risk**: VERY LOW (well-known compiler optimization, manually applied)

---

## Profiling Plan

### perf Commands to Run

```bash
# 1. CPU cycle breakdown (identify hotspots)
perf record -e cycles:u -g ./bench_comprehensive
perf report --stdio --no-children | head -100 > perf_cycles.txt

# 2. Cache miss analysis (L1d, L1i, LLC)
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,\
L1-icache-loads,L1-icache-load-misses,LLC-loads,LLC-load-misses \
./bench_comprehensive

# 3. Branch misprediction rate
perf stat -e cycles,instructions,branches,branch-misses \
./bench_comprehensive

# 4. TLB miss analysis (address translation overhead)
perf stat -e cycles,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_comprehensive

# 5. Function-level profiling (annotated source)
perf record -e cycles:u --call-graph dwarf ./bench_comprehensive
perf report --stdio --sort symbol --percent-limit 1

# 6. Memory bandwidth utilization
perf stat -e cycles,mem_load_retired.l1_hit,mem_load_retired.l1_miss,\
mem_load_retired.l2_hit,mem_load_retired.l3_hit,mem_load_retired.l3_miss \
./bench_comprehensive

# 7. Allocation-specific hotspots (focus on hak_tiny_alloc)
perf record -e cycles:u -g --call-graph dwarf -- \
./bench_comprehensive 2>&1 | grep "hak_tiny"
```

### Expected Hotspots to Validate

Based on code analysis, we expect to see:

1. **hak_tiny_find_free_block** (15-25% of cycles)
   - Two-tier bitmap scan
   - CTZ operations
   - Cache misses on large bitmaps

2. **hak_tiny_set_used / hak_tiny_set_free** (10-15% of cycles)
   - Bitmap updates
   - Summary bitmap updates
   - Write-heavy (cache line bouncing)

3. **hak_tiny_owner_slab** (10-20% of cycles on free path)
   - Registry lookup
   - Hash computation
   - Linear probing

4. **tiny_mag_init_if_needed** (5-10% of cycles)
   - TLS access
   - Conditional initialization

5. **stats_record_alloc / stats_record_free** (3-5% of cycles)
   - TLS counter increments
   - Cache line pollution

### Validation Criteria

**Cache Miss Rates**:
- L1d miss rate: < 5% (good), 5-10% (acceptable), > 10% (poor)
- LLC miss rate: < 1% (good), 1-3% (acceptable), > 3% (poor)

**Branch Misprediction**:
- Misprediction rate: < 2% (good), 2-5% (acceptable), > 5% (poor)
- Expected: 3-4% (due to unpredictable size classes)

**IPC (Instructions Per Cycle)**:
- IPC: > 2.0 (good), 1.5-2.0 (acceptable), < 1.5 (poor)
- Expected: 1.5-1.8 (memory-bound, not compute-bound)

**Function Time Distribution**:
- hak_tiny_alloc: 40-60% (hot path)
- hak_tiny_free: 20-30% (warm path)
- hak_tiny_find_free_block: 10-20% (expensive when hit)
- Other: < 10%

---

## Optimization Roadmap

### Quick Wins (< 1 hour, Low Risk)

1. **Enable SuperSlab by Default** (Bottleneck #3)
   - Change: `g_use_superslab = 1;`
   - Impact: 20-30% speedup on free path
   - Risk: VERY LOW (already implemented)
   - Effort: 5 minutes

2. **Disable Statistics in Production** (Bottleneck #4)
   - Change: Add `#ifndef HAKMEM_ENABLE_STATS` guards
   - Impact: 3-5% speedup
   - Risk: VERY LOW (compile-time flag)
   - Effort: 15 minutes

3. **Increase Mini-Magazine Capacity** (Bottleneck #2)
   - Change: `mag_capacity = 64` (was 32)
   - Impact: 10-15% speedup (reduce bitmap scans)
   - Risk: LOW (slight memory increase)
   - Effort: 5 minutes

4. **Branchless Size Class Lookup** (Bottleneck #6)
   - Change: Use lookup table for common sizes
   - Impact: 2-3% speedup
   - Risk: VERY LOW (static table)
   - Effort: 30 minutes

**Total Expected Speedup: 35-53%** (conservative: 1.4-1.5×)

### Medium Effort (1-4 hours, Medium Risk)

5. **Unified TLS Cache Structure** (Bottleneck #1)
   - Change: Merge TLS arrays into single cache-aligned struct
   - Impact: 30-40% speedup on fast path
   - Risk: MEDIUM (requires refactoring)
   - Effort: 3-4 hours

6. **Reduce Magazine Spill Batch** (Bottleneck #5)
   - Change: `spill = 128` (was 1024)
   - Impact: 10-15% speedup on multi-threaded
   - Risk: LOW (parameter tuning)
   - Effort: 30 minutes

7. **Cache-Aware Bitmap Layout** (Bottleneck #2)
   - Change: Embed small bitmaps in slab structure
   - Impact: 5-10% speedup
   - Risk: MEDIUM (requires struct changes)
   - Effort: 2-3 hours

8. **Multiplication-Based Division** (Bottleneck #7)
   - Change: Replace division with mul+shift
   - Impact: 5-10% speedup on remote frees
   - Risk: VERY LOW (well-known optimization)
   - Effort: 1 hour

**Total Expected Speedup: 50-85%** (conservative: 1.5-1.8×)

### Major Refactors (> 4 hours, High Risk)

9. **Lock-Free Spill Stack** (Bottleneck #5)
   - Change: Use atomic MPSC queue for magazine spill
   - Impact: 20-30% speedup on multi-threaded
   - Risk: HIGH (complex concurrency)
   - Effort: 8-12 hours

10. **Lazy Summary Bitmap Update** (Bottleneck #2)
    - Change: Rebuild summary only when scanning
    - Impact: 15-20% speedup on free-heavy workloads
    - Risk: MEDIUM (requires careful staleness tracking)
    - Effort: 4-6 hours

11. **Collapse TLS Magazine Tiers** (Bottleneck #1)
    - Change: Merge magazine + mini-mag into single LIFO
    - Impact: 40-50% speedup (eliminate tier overhead)
    - Risk: HIGH (major architectural change)
    - Effort: 12-16 hours

12. **Full mimalloc-Style Rewrite** (All Bottlenecks)
    - Change: Replace bitmap with intrusive free-list
    - Impact: 5-9× speedup (match mimalloc)
    - Risk: VERY HIGH (complete redesign)
    - Effort: 40+ hours

**Total Expected Speedup: 75-150%** (optimistic: 1.8-2.5×)

---

## Risk Assessment Summary

### Low Risk Optimizations (Safe to implement immediately)

- SuperSlab enable
- Statistics compile-time toggle
- Mini-mag capacity increase
- Branchless size lookup
- Multiplication division
- Magazine spill batch reduction

**Expected: 1.4-1.6× speedup, 2-3 hours effort**

### Medium Risk Optimizations (Test thoroughly)

- Unified TLS cache structure
- Cache-aware bitmap layout
- Lazy summary update

**Expected: 1.6-2.0× speedup, 6-10 hours effort**

### High Risk Optimizations (Prototype first)

- Lock-free spill stack
- Magazine tier collapse
- Full mimalloc rewrite

**Expected: 2.0-9.0× speedup, 20-60 hours effort**

---

## Estimated Speedup Summary

### Conservative Target (Low + Medium optimizations)

- **Random pattern**: 68 M ops/sec → **140 M ops/sec** (2.0× speedup)
- **LIFO pattern**: 102 M ops/sec → **200 M ops/sec** (2.0× speedup)
- **Gap to mimalloc**: 2.6× → **1.3×** (close 50% of gap)

### Optimistic Target (All optimizations)

- **Random pattern**: 68 M ops/sec → **170 M ops/sec** (2.5× speedup)
- **LIFO pattern**: 102 M ops/sec → **450 M ops/sec** (4.4× speedup)
- **Gap to mimalloc**: 2.6× → **1.0×** (match on random, 2× on LIFO)

---

## Conclusion

The hakmem allocator's 2.6× gap to mimalloc on favorable patterns (random free) is primarily due to:

1. **Architectural overhead**: 6-tier allocation hierarchy vs mimalloc's 3-tier
2. **Bitmap traversal cost**: Two-tier scan adds 15-20 cycles even when optimized
3. **Registry lookup overhead**: Hash table lookup adds 20-30 cycles on free path

**Quick wins** (1-3 hours effort) can achieve **1.4-1.6× speedup**.
**Medium effort** (10 hours) can achieve **1.8-2.0× speedup**.
**Full mimalloc-style rewrite** (40+ hours) needed to match mimalloc's 1.1 ns/op.

**Recommendation**: Implement quick wins first (SuperSlab + stats disable + branchless lookup), measure results with `perf`, then decide if medium-effort optimizations are worth the complexity increase.