hakmem/docs/design/ASYNC_OPTIMIZATION_PLAN.md

# Async Background Worker Optimization Plan
## hakmem Tiny Pool Allocator Performance Analysis

**Date**: 2025-10-26
**Author**: Claude Code Analysis
**Goal**: Reduce instruction count by moving work to background threads
**Target**: 2-3× speedup (62 M ops/sec → 120-180 M ops/sec)

---

## Executive Summary

### Can we achieve 2-3× speedup with async background workers?

**Answer: Partially - with significant caveats**

**Expected realistic speedup**: **1.5-2.0× (62 → 93-124 M ops/sec)**
**Best-case speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)**
**Gap to mimalloc remains**: 263 M ops/sec is still **4.2× faster**

### Why not 3×? Three fundamental constraints:

1. **TLS Magazine already defers most work** (60-80% hit rate)
   - Fast path already ~6 ns (18 cycles) - close to theoretical minimum
   - Background workers only help the remaining 20-40% of allocations
   - **Maximum impact**: 20-30% improvement (not 3×)

2. **Owner slab lookup on free cannot be fully deferred**
   - Cross-thread frees REQUIRE immediate slab lookup (for remote-free queue)
   - Same-thread frees can be deferred, but benchmarks show 40-60% cross-thread frees
   - **Deferred free savings**: Limited to 40-60% of frees only

3. **Background threads add synchronization overhead**
   - Atomic refill triggers, memory barriers, cache coherence
   - Expected overhead: 5-15% of total cycle budget
   - **Net gain reduced** from theoretical 2.5× to realistic 1.5-2.0×

### Strategic Recommendation

**Option B (Deferred Slab Lookup)** has the best effort/benefit ratio:
- **Effort**: 4-6 hours (TLS queue + batch processing)
- **Benefit**: 25-35% faster frees (same-thread only)
- **Overall speedup**: ~1.3-1.5× (62 → 81-93 M ops/sec)

**Option C (Hybrid)** for maximum performance:
- **Effort**: 8-12 hours (Option B + background magazine refill)
- **Benefit**: 40-60% overall speedup
- **Overall speedup**: ~1.7-2.0× (62 → 105-124 M ops/sec)

---

## Part 1: Current Front-Path Analysis (perf data)

### 1.1 Overall Hotspot Distribution

**Environment**: `HAKMEM_WRAP_TINY=1` (Tiny Pool enabled)
**Workload**: `bench_comprehensive_hakmem` (1M iterations, mixed sizes)
**Total cycles**: 242 billion (242.3 × 10⁹)
**Samples**: 187K

| Function | Cycles % | Samples | Category | Notes |
|----------|---------|---------|----------|-------|
| `_int_free` | 26.43% | 49,508 | glibc fallback | For >1KB allocations |
| `_int_malloc` | 23.45% | 43,933 | glibc fallback | For >1KB allocations |
| `malloc` | 14.01% | 26,216 | Wrapper overhead | TLS check + routing |
| `__random` | 7.99% | 14,974 | Benchmark overhead | rand() for shuffling |
| `unlink_chunk` | 7.96% | 14,824 | glibc internal | Chunk coalescing |
| **`hak_alloc_at`** | **3.13%** | **5,867** | **hakmem router** | **Tiny/Pool routing** |
| **`hak_tiny_alloc`** | **2.77%** | **5,206** | **Tiny alloc path** | **TARGET #1** |
| `_int_free_merge_chunk` | 2.15% | 3,993 | glibc internal | Free coalescing |
| `mid_desc_lookup` | 1.82% | 3,423 | hakmem pool | Mid-tier lookup |
| `hak_free_at` | 1.74% | 3,270 | hakmem router | Free routing |
| **`hak_tiny_owner_slab`** | **1.37%** | **2,571** | **Tiny free path** | **TARGET #2** |

### 1.2 Tiny Pool Allocation Path Breakdown

**From perf annotate on `hak_tiny_alloc` (5,206 samples)**:

```assembly
# Entry and initialization (lines 14eb0-14edb)
  4.00%:  endbr64              # CFI marker
  5.21%:  push %r15            # Stack frame setup
  3.94%:  push %r14
 25.81%:  push %rbp            # HOT: Stack frame overhead
  5.28%:  mov g_tiny_initialized,%r14d  # TLS read
 15.20%:  test %r14d,%r14d     # Branch check

# Size-to-class conversion (implicit, not shown in perf)
  # Estimated: ~2-3 cycles (branchless table lookup)

# TLS Magazine fast path (lines 14f41-14f9f)
  0.00%:  mov %fs:0x0,%rsi     # TLS base (rare - already cached)
  0.00%:  imul $0x4008,%rbx,%r10  # Class offset calculation
  0.00%:  mov -0x1c04c(%r10,%rsi,1),%r15d  # Magazine top read

# Mini-magazine operations (not heavily sampled - fast path works!)
  # Lines 15068-15131: Remote drain logic (rare)
  # Most samples are in initialization, not hot path
```

**Key observation from perf**:
- **Stack frame overhead dominates** (25.81% on single `push %rbp`)
- TLS access is **NOT a bottleneck** (0.00% on most TLS reads)
- Most cycles spent in **initialization checks and setup** (first 10 instructions)
- **Magazine fast path barely appears** (suggests it's working efficiently!)

### 1.3 Tiny Pool Free Path Breakdown

**From perf annotate on `hak_tiny_owner_slab` (2,571 samples)**:

```assembly
# Entry and registry lookup (lines 14c10-14c78)
 10.87%:  endbr64              # CFI marker
  3.06%:  push %r14            # Stack frame
 14.05%:  shr $0x10,%r10       # Hash calculation (64KB alignment)
  5.44%:  and $0x3ff,%eax      # Hash mask (1024 entries)
  3.91%:  mov %rax,%rdx        # Index calculation
  5.89%:  cmp %r13,%r9         # Registry lookup comparison
 14.31%:  test %r13,%r13       # NULL check

# Linear probing (lines 14c7e-14d70)
  # 8 probe attempts, each with similar overhead
  # Most time spent in hash computation and comparison
```

**Key observation from perf**:
- **Hash computation + comparison is the bottleneck** (14.05% + 5.89% + 14.31% = 34.25%)
- Registry lookup is **O(1) but expensive** (~10-15 cycles per lookup)
- **Called on every free** (2,571 samples ≈ 1.37% of total cycles)

### 1.4 Instruction Count Breakdown (Estimated)

Based on perf data and code analysis, here's the estimated instruction breakdown:

**Allocation path** (~228 instructions total as measured by perf stat):

| Component | Instructions | Cycles | % of Total | Notes |
|-----------|-------------|--------|-----------|-------|
| Wrapper overhead | 15-20 | ~6 | 7-9% | TLS check + routing |
| Size-to-class lookup | 5-8 | ~2 | 2-3% | Branchless table (fast!) |
| TLS magazine check | 8-12 | ~4 | 4-5% | Load + branch |
| **Pointer return (HIT)** | **3-5** | **~2** | **1-2%** | **Fast path: 30-45 instructions** |
| TLS slab lookup | 10-15 | ~5 | 4-6% | Miss: check active slabs |
| Mini-mag check | 8-12 | ~4 | 3-5% | LIFO pop |
| **Bitmap scan (MISS)** | **40-60** | **~20** | **18-26%** | **Summary + main bitmap + CTZ** |
| Bitmap update | 20-30 | ~10 | 9-13% | Set used + summary update |
| Pointer arithmetic | 8-12 | ~3 | 3-5% | Block index → pointer |
| Lock acquisition (rare) | 50-100 | ~30-100 | 22-44% | pthread_mutex_lock (contended) |
| Batch refill (rare) | 100-200 | ~50-100 | 44-88% | 16-64 items from bitmap |

**Free path** (~150-200 instructions estimated):

| Component | Instructions | Cycles | % of Total | Notes |
|-----------|-------------|--------|-----------|-------|
| Wrapper overhead | 10-15 | ~5 | 5-8% | TLS check + routing |
| **Owner slab lookup** | **30-50** | **~15-20** | **20-25%** | **Hash + linear probe** |
| Slab validation | 10-15 | ~5 | 5-8% | Range check (safety) |
| TLS magazine push | 8-12 | ~4 | 4-6% | Same-thread: fast! |
| **Remote free push** | **15-25** | **~8-12** | **10-13%** | **Cross-thread: atomic CAS** |
| Lock + bitmap update (spill) | 50-100 | ~30-80 | 25-50% | Magazine full (rare) |

**Critical finding**:
- **Owner slab lookup (30-50 instructions) is the #1 free-path bottleneck**
- Accounts for ~20-25% of free path instructions
- **Cannot be eliminated for cross-thread frees** (need slab to push to remote queue)

---

## Part 2: Async Background Worker Design

### 2.1 Option A: Deferred Bitmap Consolidation

**Goal**: Push bitmap scanning to background thread, keep front-path as simple pointer bump

#### Design

```c
// Front-path (allocation): 10-20 instructions
void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    TinyTLSMag* mag = &g_tls_mags[class_idx];

    // Fast path: Magazine hit (8-12 instructions)
    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;  // ~3 instructions
    }

    // Slow path: Trigger background refill
    return hak_tiny_alloc_slow(class_idx);  // ~5 instructions + function call
}

// Background thread: Bitmap scanning
void background_refill_magazines(void) {
    while (1) {
        for (int tid = 0; tid < MAX_THREADS; tid++) {
            for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
                TinyTLSMag* mag = &g_thread_mags[tid][class_idx];

                // Refill if below threshold (e.g., 25% full)
                if (mag->top < mag->cap / 4) {
                    // Scan bitmaps across all slabs (expensive!)
                    batch_refill_from_all_slabs(mag, 256);  // 256 items at once
                }
            }
        }
        usleep(100);  // 100μs sleep (tune based on load)
    }
}
```

#### Expected Performance

**Front-path savings**:
- Before: 228 instructions (magazine miss → bitmap scan)
- After: 30-45 instructions (magazine miss → return NULL + fallback)
- **Speedup**: 5-7× on miss case (but only 20-40% of allocations miss)

**Overall impact**:
- 60-80% hit TLS magazine: **No change** (already 30-45 instructions)
- 20-40% miss TLS magazine: **5-7× faster** (228 → 30-45 instructions)
- **Net speedup**: 1.0 × 0.7 + 6.0 × 0.3 = **2.5× on allocation path**

**BUT**: Background thread overhead
- CPU cost: 1 core at ~10-20% utilization (bitmap scanning)
- Memory barriers: Atomic refill triggers (5-10 cycles per trigger)
- Cache coherence: TLS magazine written by background thread (false sharing risk)

**Realistic net speedup**: **1.5-2.0× on allocations** (after overhead)

#### Pros
- **Minimal front-path changes** (magazine logic unchanged)
- **No new synchronization primitives** (use existing atomic refill triggers)
- **Compatible with existing TLS magazine** (just changes refill source)

#### Cons
- **Background thread CPU cost** (10-20% of 1 core)
- **Latency spikes** if background thread is delayed (magazine empty → fallback to pool)
- **Complex tuning** (refill threshold, batch size, sleep interval)
- **False sharing risk** (background thread writes TLS magazine `top` field)

---

### 2.2 Option B: Deferred Slab Lookup (Owner Slab Cache)

**Goal**: Eliminate owner slab lookup on same-thread frees by deferring to batch processing

#### Design

```c
// Front-path (free): 10-20 instructions
void hak_tiny_free(void* ptr) {
    // Push to thread-local deferred free queue (NO owner_slab lookup!)
    TinyDeferredFree* queue = &g_tls_deferred_free;

    // Fast path: Direct queue push (8-12 instructions)
    queue->ptrs[queue->count++] = ptr;  // ~3 instructions

    // Trigger batch processing if queue is full
    if (queue->count >= 256) {
        hak_tiny_process_deferred_frees(queue);  // Background or inline
    }
}

// Batch processing: Owner slab lookup (amortized cost)
void hak_tiny_process_deferred_frees(TinyDeferredFree* queue) {
    for (int i = 0; i < queue->count; i++) {
        void* ptr = queue->ptrs[i];

        // Owner slab lookup (expensive: 30-50 instructions)
        TinySlab* slab = hak_tiny_owner_slab(ptr);

        // Check if same-thread or cross-thread
        if (pthread_equal(slab->owner_tid, pthread_self())) {
            // Same-thread: Push to TLS magazine (fast)
            TinyTLSMag* mag = &g_tls_mags[slab->class_idx];
            mag->items[mag->top++].ptr = ptr;
        } else {
            // Cross-thread: Push to remote queue (already required)
            tiny_remote_push(slab, ptr);
        }
    }
    queue->count = 0;
}
```

#### Expected Performance

**Front-path savings**:
- Before: 150-200 instructions (owner slab lookup + magazine/remote push)
- After: 10-20 instructions (queue push only)
- **Speedup**: 10-15× on free path

**BUT**: Batch processing overhead
- Owner slab lookup: 30-50 instructions per free (unchanged)
- Amortized over 256 frees: ~0.12-0.20 instructions per free (negligible)
- **Net speedup**: ~10× on same-thread frees, **0× on cross-thread frees**

**Benchmark analysis** (from bench_comprehensive.c):
- Same-thread frees: 40-60% (LIFO/FIFO patterns)
- Cross-thread frees: 40-60% (interleaved/random patterns)

**Overall impact**:
- 40-60% same-thread: **10× faster** (150 → 15 instructions)
- 40-60% cross-thread: **No change** (still need immediate owner slab lookup)
- **Net speedup**: 10 × 0.5 + 1.0 × 0.5 = **5.5× on free path**

**BUT**: Deferred free delay
- Memory not reclaimed until batch processes (256 frees)
- Increased memory footprint: 256 × 8B = 2KB per thread per class
- Cache pollution: Deferred ptrs may evict useful data

**Realistic net speedup**: **1.3-1.5× on frees** (after overhead)

#### Pros
- **Large instruction savings** (10-15× on free path)
- **No background thread** (batch processes inline or on-demand)
- **Simple implementation** (just a TLS queue + batch loop)
- **Compatible with existing remote-free** (cross-thread unchanged)

#### Cons
- **Deferred memory reclamation** (256 frees delay)
- **Increased memory footprint** (2KB × 8 classes × 32 threads = 512KB)
- **Limited benefit on cross-thread frees** (40-60% of workload unaffected)
- **Batch processing latency spikes** (256 owner slab lookups at once)

---

### 2.3 Option C: Hybrid (Magazine + Deferred Processing)

**Goal**: Combine Option A (background magazine refill) + Option B (deferred free queue)

#### Design

```c
// Allocation: TLS magazine (10-20 instructions)
void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    TinyTLSMag* mag = &g_tls_mags[class_idx];

    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;
    }

    // Trigger background refill if needed
    if (mag->refill_needed == 0) {
        atomic_store(&mag->refill_needed, 1);
    }

    return NULL;  // Fallback to next tier
}

// Free: Deferred batch queue (10-20 instructions)
void hak_tiny_free(void* ptr) {
    TinyDeferredFree* queue = &g_tls_deferred_free;
    queue->ptrs[queue->count++] = ptr;

    if (queue->count >= 256) {
        hak_tiny_process_deferred_frees(queue);
    }
}

// Background worker: Refill magazines + process deferred frees
void background_worker(void) {
    while (1) {
        // Refill magazines from bitmaps
        for each thread with refill_needed {
            batch_refill_from_all_slabs(mag, 256);
        }

        // Process deferred frees from all threads
        for each thread with deferred_free queue {
            hak_tiny_process_deferred_frees(queue);
        }

        usleep(50);  // 50μs sleep
    }
}
```

#### Expected Performance

**Front-path savings**:
- Allocation: 228 → 30-45 instructions (5-7× faster)
- Free: 150-200 → 10-20 instructions (10-15× faster)

**Overall impact** (accounting for hit rates and overhead):
- Allocations: 1.5-2.0× faster (Option A)
- Frees: 1.3-1.5× faster (Option B)
- **Net speedup**: √(2.0 × 1.5) ≈ **1.7× overall**

**Realistic net speedup**: **1.7-2.0× (62 → 105-124 M ops/sec)**

#### Pros
- **Best overall speedup** (combines benefits of both approaches)
- **Balanced optimization** (both alloc and free paths improved)
- **Single background worker** (shared thread for refill + deferred frees)

#### Cons
- **Highest implementation complexity** (both systems + worker coordination)
- **Background thread CPU cost** (15-30% of 1 core)
- **Tuning complexity** (refill threshold, batch size, sleep interval, queue size)
- **Largest memory footprint** (TLS magazines + deferred queues)

---

## Part 3: Feasibility Analysis

### 3.1 Instruction Reduction Potential

**Current measured performance** (HAKMEM_WRAP_TINY=1):
- **Instructions per op**: 228 (from perf stat: 1.4T / 6.1B ops)
- **IPC**: 4.73 (very high - compute-bound)
- **Cycles per op**: 48.2 (228 / 4.73)
- **Latency**: 16.1 ns/op @ 3 GHz

**Theoretical minimum** (mimalloc-style):
- **Instructions per op**: 15-25 (TLS pointer bump + freelist push)
- **IPC**: 4.5-5.0 (cache-friendly sequential access)
- **Cycles per op**: 4-5 (15-25 / 5.0)
- **Latency**: 1.3-1.7 ns/op @ 3 GHz

**Achievable with async background workers**:
- **Allocation path**: 30-45 instructions (magazine hit) vs 228 (bitmap scan)
- **Free path**: 10-20 instructions (deferred queue) vs 150-200 (owner slab lookup)
- **Average**: (30 + 15) / 2 = **22.5 instructions per op** (geometric mean)
- **IPC**: 4.5 (slightly worse due to memory barriers)
- **Cycles per op**: 22.5 / 4.5 = **5.0 cycles**
- **Latency**: 5.0 / 3.0 = **1.7 ns/op**

**Expected speedup**: 16.1 / 1.7 ≈ **9.5× (theoretical maximum)**

**BUT**: Background thread overhead
- Atomic refill triggers: +1-2 cycles per op
- Cache coherence (false sharing): +2-3 cycles per op
- Memory barriers: +1-2 cycles per op
- **Total overhead**: +4-7 cycles per op

**Realistic achievable**:
- **Cycles per op**: 5.0 + 5.0 = 10.0 cycles
- **Latency**: 10.0 / 3.0 = **3.3 ns/op**
- **Throughput**: 300 M ops/sec
- **Speedup**: 16.1 / 3.3 ≈ **4.9× (theoretical)**

**Actual achievable** (accounting for partial hit rates):
- **60-80% hit magazine**: Already fast (6 ns)
- **20-40% miss magazine**: Improved (16 ns → 3.3 ns)
- **Net improvement**: 0.7 × 6 + 0.3 × 3.3 = **5.2 ns/op**
- **Speedup**: 16.1 / 5.2 ≈ **3.1× (optimistic)**

**Conservative estimate** (accounting for all overheads):
- **Net speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)**

### 3.2 Comparison with mimalloc

**Why mimalloc is 263 M ops/sec (3.8 ns/op)**:

1. **Zero-initialization on allocation** (no bitmap scan ever)
   - Uses sequential memory bump pointer (O(1) pointer arithmetic)
   - Free blocks tracked as linked list (no scanning needed)

2. **Embedded slab metadata** (no hash lookup on free)
   - Slab pointer embedded in first 16 bytes of allocation
   - Owner slab lookup is single pointer dereference (3-4 cycles)

3. **TLS-local slabs** (no cross-thread remote free queues)
   - Each thread owns its slabs exclusively
   - Cross-thread frees go to per-thread remote queue (not per-slab)

4. **Lazy coalescing** (defers bitmap consolidation to background)
   - Front-path never touches bitmaps
   - Background thread scans and coalesces every 100ms

**hakmem cannot match mimalloc without fundamental redesign** because:
- Bitmap-based allocation requires scanning (cannot be O(1) pointer bump)
- Hash-based owner slab lookup requires hash computation (cannot be single dereference)
- Per-slab remote queues require immediate slab lookup on cross-thread free

**Realistic target**: **120-180 M ops/sec (6.7-8.3 ns/op)** - still **2-3× slower than mimalloc**

### 3.3 Implementation Effort vs Benefit

| Option | Effort (hours) | Speedup | Ops/sec | Gap to mimalloc | Complexity |
|--------|---------------|---------|---------|-----------------|-----------|
| **Current** | 0 | 1.0× | 62 | 4.2× slower | Baseline |
| **Option A** | 6-8 | 1.5-1.8× | 93-112 | 2.4-2.8× slower | Medium |
| **Option B** | 4-6 | 1.3-1.5× | 81-93 | 2.8-3.2× slower | Low |
| **Option C** | 10-14 | 1.7-2.2× | 105-136 | 1.9-2.5× slower | High |
| **Theoretical max** | N/A | 3.1× | 192 | 1.4× slower | N/A |
| **mimalloc** | N/A | 4.2× | 263 | Baseline | N/A |

**Best effort/benefit ratio**: **Option B (Deferred Slab Lookup)**
- **4-6 hours** of implementation
- **1.3-1.5× speedup** (25-35% faster)
- **Low complexity** (single TLS queue + batch loop)
- **No background thread** (inline batch processing)

**Maximum performance**: **Option C (Hybrid)**
- **10-14 hours** of implementation
- **1.7-2.2× speedup** (50-75% faster)
- **High complexity** (background worker + coordination)
- **Requires background thread** (CPU cost)

---

## Part 4: Recommended Implementation Plan

### Phase 1: Deferred Free Queue (4-6 hours) [Option B]

**Goal**: Eliminate owner slab lookup on same-thread frees

#### Step 1.1: Add TLS Deferred Free Queue (1 hour)

```c
// hakmem_tiny.h - Add to global state
#define DEFERRED_FREE_QUEUE_SIZE 256

typedef struct {
    void* ptrs[DEFERRED_FREE_QUEUE_SIZE];
    uint16_t count;
} TinyDeferredFree;

// TLS per-class deferred free queues
static __thread TinyDeferredFree g_tls_deferred_free[TINY_NUM_CLASSES];
```

#### Step 1.2: Modify Free Path (2 hours)

```c
// hakmem_tiny.c - Replace hak_tiny_free()
void hak_tiny_free(void* ptr) {
    if (!ptr || !g_tiny_initialized) return;

    // Try SuperSlab fast path first (existing)
    SuperSlab* ss = ptr_to_superslab(ptr);
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        hak_tiny_free_superslab(ptr, ss);
        return;
    }

    // NEW: Deferred free path (no owner slab lookup!)
    // Guess class from allocation size hint (optional optimization)
    int class_idx = guess_class_from_ptr(ptr);  // heuristic

    if (class_idx >= 0) {
        TinyDeferredFree* queue = &g_tls_deferred_free[class_idx];
        queue->ptrs[queue->count++] = ptr;

        // Batch process if queue is full
        if (queue->count >= DEFERRED_FREE_QUEUE_SIZE) {
            hak_tiny_process_deferred_frees(class_idx, queue);
        }
        return;
    }

    // Fallback: Immediate owner slab lookup (cross-thread or unknown)
    TinySlab* slab = hak_tiny_owner_slab(ptr);
    if (!slab) return;
    hak_tiny_free_with_slab(ptr, slab);
}
```

#### Step 1.3: Implement Batch Processing (2-3 hours)

```c
// hakmem_tiny.c - Batch process deferred frees
static void hak_tiny_process_deferred_frees(int class_idx, TinyDeferredFree* queue) {
    pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
    pthread_mutex_lock(lock);

    for (int i = 0; i < queue->count; i++) {
        void* ptr = queue->ptrs[i];

        // Owner slab lookup (expensive, but amortized over batch)
        TinySlab* slab = hak_tiny_owner_slab(ptr);
        if (!slab) continue;

        // Push to magazine or bitmap
        hak_tiny_free_with_slab(ptr, slab);
    }

    pthread_mutex_unlock(lock);
    queue->count = 0;
}
```

**Expected outcome**:
- **Same-thread frees**: 10-15× faster (150 → 10-20 instructions)
- **Cross-thread frees**: Unchanged (still need immediate lookup)
- **Overall speedup**: 1.3-1.5× (25-35% faster)
- **Memory overhead**: 256 × 8B × 8 classes = 16KB per thread

---

### Phase 2: Background Magazine Refill (6-8 hours) [Option A]

**Goal**: Eliminate bitmap scanning on allocation path

#### Step 2.1: Add Refill Trigger (1 hour)

```c
// hakmem_tiny.h - Add refill trigger to TLS magazine
typedef struct {
    TinyMagItem items[TINY_TLS_MAG_CAP];
    int top;
    int cap;
    atomic_int refill_needed;  // NEW: Background refill trigger
} TinyTLSMag;
```

#### Step 2.2: Modify Allocation Path (2 hours)

```c
// hakmem_tiny.c - Trigger refill on magazine miss
void* hak_tiny_alloc(size_t size) {
    // ... (existing size-to-class logic) ...

    TinyTLSMag* mag = &g_tls_mags[class_idx];

    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;  // Fast path: unchanged
    }

    // NEW: Trigger background refill (non-blocking)
    if (atomic_load(&mag->refill_needed) == 0) {
        atomic_store(&mag->refill_needed, 1);
    }

    // Fallback to existing slow path (TLS slab, bitmap scan, lock)
    return hak_tiny_alloc_slow(class_idx);
}
```

#### Step 2.3: Implement Background Worker (3-5 hours)

```c
// hakmem_tiny.c - Background refill thread
static void* background_refill_worker(void* arg) {
    while (g_background_worker_running) {
        // Scan all threads for refill requests
        for (int tid = 0; tid < g_max_threads; tid++) {
            for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
                TinyTLSMag* mag = &g_thread_mags[tid][class_idx];

                // Check if refill needed
                if (atomic_load(&mag->refill_needed) == 0) {
                    continue;
                }

                // Refill from bitmaps (expensive, but in background)
                pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
                pthread_mutex_lock(lock);

                TinySlab* slab = g_tiny_pool.free_slabs[class_idx];
                if (slab && slab->free_count > 0) {
                    int refilled = batch_refill_from_bitmap(
                        slab, &mag->items[mag->top], 256
                    );
                    mag->top += refilled;
                }

                pthread_mutex_unlock(lock);
                atomic_store(&mag->refill_needed, 0);
            }
        }

        usleep(100);  // 100μs sleep (tune based on load)
    }
    return NULL;
}

// Start background worker on init
void hak_tiny_init(void) {
    // ... (existing init logic) ...

    g_background_worker_running = 1;
    pthread_create(&g_background_worker, NULL, background_refill_worker, NULL);
}
```

**Expected outcome**:
- **Allocation misses**: 5-7× faster (228 → 30-45 instructions)
- **Magazine hit rate**: Improved (background keeps magazines full)
- **Overall speedup**: +30-50% (combined with Phase 1)
- **CPU cost**: 1 core at 10-20% utilization

---

### Phase 3: Tuning and Optimization (2-3 hours)

**Goal**: Reduce overhead and maximize hit rates

#### Step 3.1: Tune Batch Sizes (1 hour)

- Test refill batch sizes: 64, 128, 256, 512
- Test deferred free queue sizes: 128, 256, 512
- Measure impact on throughput and latency variance

#### Step 3.2: Reduce False Sharing (1 hour)

```c
// Cache-align TLS magazines to avoid false sharing
typedef struct __attribute__((aligned(64))) {
    TinyMagItem items[TINY_TLS_MAG_CAP];
    int top;
    int cap;
    atomic_int refill_needed;
    char _pad[64 - sizeof(int) * 3];  // Pad to 64B
} TinyTLSMag;
```

#### Step 3.3: Add Adaptive Sleep (1 hour)

```c
// Background worker: Adaptive sleep based on load
static void* background_refill_worker(void* arg) {
    int idle_count = 0;

    while (g_background_worker_running) {
        int work_done = 0;

        // ... (refill logic) ...

        if (work_done == 0) {
            idle_count++;
            usleep(100 * (1 << min(idle_count, 4)));  // Exponential backoff
        } else {
            idle_count = 0;
            usleep(50);  // Active: short sleep
        }
    }
}
```

**Expected outcome**:
- **Reduced CPU cost**: 10-20% → 5-10% (adaptive sleep)
- **Better cache utilization**: Alignment reduces false sharing
- **Tuned for workload**: Batch sizes optimized for benchmarks

---

## Part 5: Expected Performance

### Before (Current)

```
HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

Test 1: Sequential LIFO (16B)
  Throughput: 105 M ops/sec
  Latency: 9.5 ns/op

Test 2: Sequential FIFO (16B)
  Throughput: 98 M ops/sec
  Latency: 10.2 ns/op

Test 3: Random Free (16B)
  Throughput: 62 M ops/sec
  Latency: 16.1 ns/op

Average: 88 M ops/sec (11.4 ns/op)
```

### After Phase 1 (Deferred Free Queue)

**Expected improvement**: +25-35% (same-thread frees only)

```
Test 1: Sequential LIFO (16B)  [80% same-thread]
  Throughput: 135 M ops/sec (+29%)
  Latency: 7.4 ns/op

Test 2: Sequential FIFO (16B)  [80% same-thread]
  Throughput: 126 M ops/sec (+29%)
  Latency: 7.9 ns/op

Test 3: Random Free (16B)  [40% same-thread]
  Throughput: 73 M ops/sec (+18%)
  Latency: 13.7 ns/op

Average: 111 M ops/sec (+26%) - [9.0 ns/op]
```

### After Phase 2 (Background Refill)

**Expected improvement**: +40-60% (combined)

```
Test 1: Sequential LIFO (16B)
  Throughput: 168 M ops/sec (+60%)
  Latency: 6.0 ns/op

Test 2: Sequential FIFO (16B)
  Throughput: 157 M ops/sec (+60%)
  Latency: 6.4 ns/op

Test 3: Random Free (16B)
  Throughput: 93 M ops/sec (+50%)
  Latency: 10.8 ns/op

Average: 139 M ops/sec (+58%) - [7.2 ns/op]
```

### After Phase 3 (Tuning)

**Expected improvement**: +50-75% (optimized)

```
Test 1: Sequential LIFO (16B)
  Throughput: 180 M ops/sec (+71%)
  Latency: 5.6 ns/op

Test 2: Sequential FIFO (16B)
  Throughput: 168 M ops/sec (+71%)
  Latency: 6.0 ns/op

Test 3: Random Free (16B)
  Throughput: 105 M ops/sec (+69%)
  Latency: 9.5 ns/op

Average: 151 M ops/sec (+72%) - [6.6 ns/op]
```

### Gap to mimalloc (263 M ops/sec)

| Phase | Ops/sec | Gap | % of mimalloc |
|-------|---------|-----|---------------|
| Current | 88 | 3.0× slower | 33% |
| Phase 1 | 111 | 2.4× slower | 42% |
| Phase 2 | 139 | 1.9× slower | 53% |
| Phase 3 | 151 | 1.7× slower | 57% |
| **mimalloc** | **263** | **Baseline** | **100%** |

**Conclusion**: Async background workers can achieve **1.7× speedup**, but still **1.7× slower than mimalloc** due to fundamental architecture differences.

---

## Part 6: Critical Success Factors

### 6.1 Verify with perf

After each phase, run:

```bash
HAKMEM_WRAP_TINY=1 perf record -e cycles:u -g ./bench_comprehensive_hakmem
perf report --stdio --no-children -n --percent-limit 1.0
```

**Expected changes**:
- **Phase 1**: `hak_tiny_owner_slab` drops from 1.37% → 0.5-0.7%
- **Phase 2**: `hak_tiny_find_free_block` drops from ~1% → 0.2-0.3%
- **Phase 3**: Overall cycles per op drops 40-60%

### 6.2 Measure Instruction Count

```bash
HAKMEM_WRAP_TINY=1 perf stat -e instructions,cycles,branches ./bench_comprehensive_hakmem
```

**Expected changes**:
- **Before**: 228 instructions/op, 48.2 cycles/op
- **Phase 1**: 180-200 instructions/op, 40-45 cycles/op
- **Phase 2**: 120-150 instructions/op, 28-35 cycles/op
- **Phase 3**: 100-130 instructions/op, 22-28 cycles/op

### 6.3 Avoid Synchronization Overhead

**Key principles**:
- Use `atomic_load_explicit` with `memory_order_relaxed` for low-contention checks
- Batch operations to amortize lock costs (256+ items per batch)
- Align TLS structures to 64B to avoid false sharing
- Use exponential backoff on background thread sleep

### 6.4 Focus on Front-Path

**Priority order**:
1. **TLS magazine hit**: Must remain <30 instructions (already optimal)
2. **Deferred free queue**: Must be <20 instructions (Phase 1)
3. **Background refill trigger**: Must be <10 instructions (Phase 2)
4. **Batch processing**: Can be expensive (amortized over 256 items)

---

## Part 7: Conclusion

### Can we achieve 100-150 M ops/sec with async background workers?

**Yes, but with caveats**:
- **100 M ops/sec**: Achievable with Phase 1 alone (4-6 hours)
- **150 M ops/sec**: Achievable with Phase 1+2+3 (12-17 hours)
- **180+ M ops/sec**: Unlikely without fundamental redesign

### Why the gap to mimalloc remains

**mimalloc's advantages that async workers cannot replicate**:
1. **O(1) pointer bump allocation** (no bitmap scan, even in background)
2. **Embedded slab metadata** (no hash lookup, ever)
3. **TLS-exclusive slabs** (no cross-thread remote queues)

**hakmem's fundamental constraints**:
- Bitmap-based allocation requires scanning (cannot be O(1))
- Hash-based slab registry requires computation on free
- Per-slab remote queues require immediate slab lookup

### Recommended next steps

**Short-term (4-6 hours)**: Implement **Phase 1 (Deferred Free Queue)**
- **Effort**: Low (single TLS queue + batch loop)
- **Benefit**: 25-35% speedup (62 → 81-93 M ops/sec)
- **Risk**: Low (no background thread, simple design)

**Medium-term (10-14 hours)**: Add **Phase 2 (Background Refill)**
- **Effort**: Medium (background worker + coordination)
- **Benefit**: 50-75% speedup (62 → 93-108 M ops/sec)
- **Risk**: Medium (background thread overhead, tuning complexity)

**Long-term (20-30 hours)**: Consider **fundamental redesign**
- Replace bitmap with freelist (mimalloc-style)
- Embed slab metadata in allocations (avoid hash lookup)
- Use TLS-exclusive slabs (eliminate remote queues)
- **Potential**: 3-4× speedup (approaching mimalloc)

### Final verdict

**Async background workers are a viable optimization**, but not a silver bullet:
- **Expected speedup**: 1.5-2.0× (realistic)
- **Best-case speedup**: 2.0-2.5× (with perfect tuning)
- **Gap to mimalloc**: Remains 1.7-2.0× (architectural limitations)

**Recommended approach**: Implement Phase 1 first (low effort, good ROI), then evaluate if Phase 2 is worth the complexity.

---

## Part 7: Phase 1 Implementation Results & Lessons Learned

**Date**: 2025-10-26
**Status**: FAILED - Structural design flaw identified

### Phase 1 Implementation Summary

**What was implemented**:
1. TLS Deferred Free Queue (256 items)
2. Batch processing function
3. Modified `hak_tiny_free` to push to queue

**Expected outcome**: 1.3-1.5× speedup (25-35% faster frees)

### Actual Results

| Metric | Before | After Phase 1 | Change |
|--------|--------|---------------|--------|
| 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | +2.2% |
| 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | +0.9% |
| 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | -1.6% |
| 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | -1.7% |
| **Instructions/op** | **228** | **229** | **+1** ❌ |

**Conclusion**: **Phase 1 had ZERO effect** (performance unchanged, instructions increased by 1)

### Root Cause Analysis

**Critical design flaw discovered**:

```c
void hak_tiny_free(void* ptr) {
    // SuperSlab fast path FIRST (Quick Win #1)
    SuperSlab* ss = ptr_to_superslab(ptr);
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        hak_tiny_free_superslab(ptr, ss);
        return;  // ← 99% of frees exit here!
    }

    // Deferred Free Queue (NEVER REACHED!)
    queue->ptrs[queue->count++] = ptr;
    ...
}
```

**Why Phase 1 failed**:
1. **SuperSlab is enabled by default** (`g_use_superslab = 1` from Quick Win #1)
2. **99% of frees take SuperSlab fast path** (especially sequential patterns)
3. **Deferred queue is never used** → zero benefit, added overhead
4. **Push-based approach is fundamentally flawed** for this use case

### Alignment with ChatGPT Analysis

ChatGPT's analysis of a similar "Phase 4" issue identified the same structural problem:

> **"Free で加速の仕込みをする（push型）は、spill が頻発する系ではコスト先払いになり負けやすい。"**

**Key insight**: **Push-based optimization on free path pays upfront cost without guaranteed benefit.**

### Lessons Learned

1. **Push vs Pull strategy**:
   - **Push (Phase 1)**: Pay cost upfront on every free → wasted if not consumed
   - **Pull (Phase 2)**: Pay cost only when needed on alloc → guaranteed benefit

2. **Interaction with existing optimizations**:
   - SuperSlab fast path makes deferred queue unreachable
   - Cannot optimize already-optimized path

3. **Measurement before optimization**:
   - Should have profiled where frees actually go (SuperSlab vs registry)
   - Would have caught this before implementation

### Revised Strategy: Phase 2 (Pull-based)

**Recommended approach** (from ChatGPT + our analysis):

```
Phase 2: Background Magazine Refill (Pull型)

Allocation path:
  magazine.top > 0 → return item (fast path unchanged)
  magazine.top == 0 → trigger background refill
                    → fallback to slow path

Background worker (Pull型):
  Periodically scan for refill_needed flags
  Perform bitmap scan (expensive, but in background)
  Refill magazines in batch (256 items)

Free path: NO CHANGES (zero cost increase)
```

**Expected benefits**:
- **No upfront cost on free** (major advantage over push型)
- **Guaranteed benefit on alloc** (magazine hit rate increases)
- **Amortized bitmap scan cost** (1 scan → 256 allocs)
- **Expected speedup**: 1.5-2.0× (30-50% faster)

### Decision: Revert Phase 1, Implement Phase 2

**Next steps**:
1. ✅ Document Phase 1 failure and analysis
2. ⏳ Revert Phase 1 changes (clean code)
3. ⏳ Implement Phase 2 (pull-based background refill)
4. ⏳ Measure and validate Phase 2 effectiveness

**Key takeaway**: **"Pull型 + 必要時のみ" beats "Push型 + 先払いコスト"**


---

## Part 8: Phase 2 Implementation Results & Critical Insight

**Date**: 2025-10-26
**Status**: FAILED - Worse than baseline (Phase 1 was zero effect, Phase 2 degraded performance)

### Phase 2 Implementation Summary

**What was implemented**:
1. Global Refill Queue (per-class, lock-free read)
2. Background worker thread (bitmap scanning in background)
3. Pull-based magazine refill (check global queue on magazine miss)
4. Adaptive sleep (exponential backoff when idle)

**Expected outcome**: 1.5-2.0× speedup (228 → 100-150 instructions/op)

### Actual Results

| Metric | Baseline (no async) | Phase 1 (Push) | Phase 2 (Pull) | Change |
|--------|----------|---------|---------|--------|
| 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | 62.80 M ops/s | **-1.1%** |
| 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | 52.64 M ops/s | **-3.4%** ❌ |
| 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | 49.37 M ops/s | **-1.5%** ❌ |
| 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | 63.53 M ops/s | **+0.3%** |
| **Instructions/op** | **~228** | **229** | **306** | **+33%** ❌❌❌ |

**Conclusion**: **Phase 2 DEGRADED performance** (worse than baseline and Phase 1!)

### Root Cause Analysis

**Critical insight**: **Both Phase 1 and Phase 2 optimize the WRONG code path!**

```
Benchmark allocation pattern (LIFO):

Iteration 1:
  alloc[0..99]   → Slow path: Fill TLS Magazine from slabs
  free[99..0]    → Items return to TLS Magazine (LIFO)

Iteration 2-1,000,000:
  alloc[0..99]   → Fast path: 100% TLS Magazine hit! (6 ns/op)
  free[99..0]    → Fast path: Return to TLS Magazine (6 ns/op)

  NO SLOW PATH EVER EXECUTED AFTER FIRST ITERATION!
```

**Why Phase 2 failed worse than Phase 1**:
1. **Background worker thread consuming CPU** (extra overhead)
2. **Atomic operations on global queue** (contention + memory ordering cost)
3. **No benefit** because TLS magazine never empties (100% hit rate)
4. **Pure overhead** without any gain

### Fundamental Architecture Problem

**The async optimization strategy (Phase 1 + 2) is based on a flawed assumption**:

❌ **Assumption**: "Slow path (bitmap scan, lock, owner lookup) is the bottleneck"
✅ **Reality**: "Fast path (TLS magazine access) is the bottleneck"

**Evidence**:
- Benchmark working set: 100 items
- TLS Magazine capacity: 2048 items (class 0)
- Hit rate: 100% after first iteration
- Slow path execution: ~0% (never reached)

**Performance gap breakdown**:
```
hakmem Tiny Pool:    60 M ops/sec (16.7 ns/op) = 228 instructions
glibc malloc:       105 M ops/sec ( 9.5 ns/op) = ~30-40 instructions

Gap: 40% slower = ~190 extra instructions on FAST PATH
```

### Why hakmem is Slower (Architectural Differences)

**1. Bitmap-based allocation** (hakmem):
- Find free block: bitmap scan (CTZ instruction)
- Mark used: bit manipulation (OR + update summary bitmap)
- Cost: 30-40 instructions even with optimizations

**2. Free-list allocation** (glibc):
- Find free block: single pointer dereference
- Mark used: pointer update
- Cost: 5-10 instructions

**3. TLS Magazine access overhead**:
- hakmem: `g_tls_mags[class_idx].items[--top].ptr` (3 memory reads + index calc)
- glibc: Direct arena access (1-2 memory reads)

**4. Statistics batching** (Phase 3 optimization):
- hakmem: XOR RNG sampling (10-15 instructions)
- glibc: No stats tracking

### Lessons Learned

**1. Optimize the code path that actually executes**:
- ❌ Optimized slow path (99.9% never executed)
- ✅ Should optimize fast path (99.9% of operations)

**2. Async optimization only helps with cache misses**:
- Benchmark: 100% cache hit rate after warmup
- Real workload: Unknown hit rate (need profiling)

**3. Adding complexity without measurement is harmful**:
- Phase 1: +1 instruction (zero benefit)
- Phase 2: +77 instructions (-33% performance)

**4. Fundamental architectural differences matter more than micro-optimizations**:
- Bitmap vs free-list: ~10× instruction difference
- Async background work cannot bridge this gap

### Revised Understanding

**The 40% performance gap (hakmem vs glibc) is NOT due to slow-path inefficiency.**

**It's due to fundamental design choices**:
1. **Bitmap allocation** (flexible, low fragmentation) vs **Free-list** (fast, simple)
2. **Slab ownership tracking** (hash lookup on free) vs **Embedded metadata** (single dereference)
3. **Research features** (statistics, ELO, batching) vs **Production simplicity**

**These tradeoffs are INTENTIONAL for research purposes.**

### Conclusion & Next Steps

**Both Phase 1 and Phase 2 should be reverted.**

**Async optimization strategy is fundamentally flawed for this workload.**

**Actual bottleneck**: TLS Magazine fast path (99.9% of execution)
- Current: ~17 ns/op (228 instructions)
- Target: ~10 ns/op (glibc level)
- Gap: 7 ns = ~50-70 instructions

**Possible future optimizations** (NOT async):
1. **Inline TLS magazine access** (reduce function call overhead)
2. **SIMD bitmap scanning** (4-8× faster block finding)
3. **Remove statistics sampling** (save 10-15 instructions)
4. **Simplified magazine structure** (single array instead of struct)

**Or accept reality**:
- hakmem is a research allocator with diagnostic features
- 40% slowdown is acceptable cost for flexibility
- Production use cases might have different performance profiles

**Recommended action**: Revert Phase 2, commit analysis, move on.

---

## Part 9: Phase 7.5 Failure Analysis - Inline Fast Path

**Date**: 2025-10-26
**Goal**: Reduce hak_tiny_alloc from 22.75% CPU to ~10% via inline wrapper
**Result**: **REGRESSION (-7% to -15%)** ❌

### Implementation Approach

Created inline wrapper to handle common case (TLS magazine hit) without function call:

```c
static inline void* hak_tiny_alloc(size_t size) __attribute__((always_inline));
static inline void* hak_tiny_alloc(size_t size) {
    // Fast checks
    if (UNLIKELY(size > TINY_MAX_SIZE)) return hak_tiny_alloc_impl(size);
    if (UNLIKELY(!g_tiny_initialized)) return hak_tiny_alloc_impl(size);
    if (UNLIKELY(!g_wrap_tiny_enabled && hak_in_wrapper())) return hak_tiny_alloc_impl(size);
    
    // Size to class
    int class_idx = hak_tiny_size_to_class(size);
    
    // TLS Magazine fast path
    tiny_mag_init_if_needed(class_idx);
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (LIKELY(mag->top > 0)) {
        return mag->items[--mag->top].ptr;  // Fast path!
    }
    
    return hak_tiny_alloc_impl(size);  // Slow path
}
```

### Benchmark Results

| Test | Before (getenv fix) | After (Phase 7.5) | Change |
|------|---------------------|-------------------|--------|
| 16B LIFO | 120.55 M ops/sec | 110.46 M ops/sec | **-8.4%** |
| 32B LIFO | 88.57 M ops/sec | 79.00 M ops/sec | **-10.8%** |
| 64B LIFO | 94.74 M ops/sec | 88.01 M ops/sec | **-7.1%** |
| 128B LIFO | 122.36 M ops/sec | 104.21 M ops/sec | **-14.8%** |
| Mixed | 164.56 M ops/sec | 148.99 M ops/sec | **-9.5%** |

With `__attribute__((always_inline))`:
| Test | Always-inline Result | vs Baseline |
|------|---------------------|-------------|
| 16B LIFO | 115.89 M ops/sec | **-3.9%** |

Still slower than baseline!

### Root Cause Analysis

The inline wrapper **added more overhead than it removed**:

**Overhead Added:**
1. **Extra function calls in wrapper**:
   - `hak_in_wrapper()` called on every allocation (even with UNLIKELY)
   - `tiny_mag_init_if_needed()` called on every allocation
   - These are function calls that happen BEFORE reaching the magazine

2. **Multiple conditional branches**:
   - Size check
   - Initialization check
   - Wrapper guard check
   - Branch misprediction cost

3. **Global variable reads**:
   - `g_tiny_initialized` read every time
   - `g_wrap_tiny_enabled` read every time

**Original code** (before inlining):
- One function call to `hak_tiny_alloc()`
- Inside function: direct path to magazine check (lines 685-688)
- No extra overhead

**Inline wrapper**:
- Zero function calls to enter
- But added 2 function calls inside (`hak_in_wrapper`, `tiny_mag_init_if_needed`)
- Added 3 conditional branches
- Net result: **MORE overhead, not less**

### Key Lesson Learned

**WRONG**: Function call overhead is the bottleneck (perf shows 22.75% in hak_tiny_alloc)  
**RIGHT**: The 22.75% is the **code inside** the function, not the call overhead

**Micro-optimization fallacy**: Eliminating a function call (2-4 cycles) while adding:
- 2 function calls (4-8 cycles)
- 3 conditional branches (3-6 cycles)
- Multiple global reads (3-6 cycles)

Total overhead added: **10-20 cycles** vs **2-4 cycles** saved = **net loss**

### Correct Approach (Not Implemented)

To actually reduce the 22.75% CPU in hak_tiny_alloc, we should:

1. **Keep it as a regular function** (not inline)
2. **Optimize the code INSIDE**:
   - Reduce stack usage (88 → 32 bytes)
   - Cache globals at function entry
   - Simplify control flow
   - Reduce register pressure

3. **Or accept current performance**:
   - Already 1.5-1.9× faster than glibc ✅
   - Diminishing returns zone
   - Further optimization may not be worth the risk

### Decision

**REVERTED** Phase 7.5 completely. Performance restored to 120-164 M ops/sec.

**CONCLUSION**: Stick with getenv fix. Ship what works. 🚀

---
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Async Background Worker Optimization Plan
 								## hakmem Tiny Pool Allocator Performance Analysis
 								**Date**: 2025-10-26
 								**Author**: Claude Code Analysis
 								**Goal**: Reduce instruction count by moving work to background threads
 								**Target**: 2-3× speedup (62 M ops/sec → 120-180 M ops/sec)
 								---
 								## Executive Summary
 								### Can we achieve 2-3× speedup with async background workers?
 								**Answer: Partially - with significant caveats**
 								**Expected realistic speedup**: **1.5-2.0× (62 → 93-124 M ops/sec)**
 								**Best-case speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)**
 								**Gap to mimalloc remains**: 263 M ops/sec is still **4.2× faster**
 								### Why not 3×? Three fundamental constraints:
 . **TLS Magazine already defers most work** (60-80% hit rate)
 								   - Fast path already ~6 ns (18 cycles) - close to theoretical minimum
 								   - Background workers only help the remaining 20-40% of allocations
 								   - **Maximum impact**: 20-30% improvement (not 3×)
 . **Owner slab lookup on free cannot be fully deferred**
 								   - Cross-thread frees REQUIRE immediate slab lookup (for remote-free queue)
 								   - Same-thread frees can be deferred, but benchmarks show 40-60% cross-thread frees
 								   - **Deferred free savings**: Limited to 40-60% of frees only
 . **Background threads add synchronization overhead**
 								   - Atomic refill triggers, memory barriers, cache coherence
 								   - Expected overhead: 5-15% of total cycle budget
 								   - **Net gain reduced** from theoretical 2.5× to realistic 1.5-2.0×
 								### Strategic Recommendation
 								**Option B (Deferred Slab Lookup)** has the best effort/benefit ratio:
 								- **Effort**: 4-6 hours (TLS queue + batch processing)
 								- **Benefit**: 25-35% faster frees (same-thread only)
 								- **Overall speedup**: ~1.3-1.5× (62 → 81-93 M ops/sec)
 								**Option C (Hybrid)** for maximum performance:
 								- **Effort**: 8-12 hours (Option B + background magazine refill)
 								- **Benefit**: 40-60% overall speedup
 								- **Overall speedup**: ~1.7-2.0× (62 → 105-124 M ops/sec)
 								---
 								## Part 1: Current Front-Path Analysis (perf data)
 								### 1.1 Overall Hotspot Distribution
 								**Environment**: `HAKMEM_WRAP_TINY=1` (Tiny Pool enabled)
 								**Workload**: `bench_comprehensive_hakmem` (1M iterations, mixed sizes)
 								**Total cycles**: 242 billion (242.3 × 10⁹)
 								**Samples**: 187K
 								| Function | Cycles % | Samples | Category | Notes |
 								|----------|---------|---------|----------|-------|
 								| `_int_free` | 26.43% | 49,508 | glibc fallback | For >1KB allocations |
 								| `_int_malloc` | 23.45% | 43,933 | glibc fallback | For >1KB allocations |
 								| `malloc` | 14.01% | 26,216 | Wrapper overhead | TLS check + routing |
 								| `__random` | 7.99% | 14,974 | Benchmark overhead | rand() for shuffling |
 								| `unlink_chunk` | 7.96% | 14,824 | glibc internal | Chunk coalescing |
 								| **`hak_alloc_at`** | **3.13%** | **5,867** | **hakmem router** | **Tiny/Pool routing** |
 								| **`hak_tiny_alloc`** | **2.77%** | **5,206** | **Tiny alloc path** | **TARGET #1** |
 								| `_int_free_merge_chunk` | 2.15% | 3,993 | glibc internal | Free coalescing |
 								| `mid_desc_lookup` | 1.82% | 3,423 | hakmem pool | Mid-tier lookup |
 								| `hak_free_at` | 1.74% | 3,270 | hakmem router | Free routing |
 								| **`hak_tiny_owner_slab`** | **1.37%** | **2,571** | **Tiny free path** | **TARGET #2** |
 								### 1.2 Tiny Pool Allocation Path Breakdown
 								**From perf annotate on `hak_tiny_alloc` (5,206 samples)**:
 								```assembly
 								# Entry and initialization (lines 14eb0-14edb)
 .00%:  endbr64              # CFI marker
 .21%:  push %r15            # Stack frame setup
 .94%:  push %r14
 .81%:  push %rbp            # HOT: Stack frame overhead
 .28%:  mov g_tiny_initialized,%r14d  # TLS read
 .20%:  test %r14d,%r14d     # Branch check
 								# Size-to-class conversion (implicit, not shown in perf)
 								  # Estimated: ~2-3 cycles (branchless table lookup)
 								# TLS Magazine fast path (lines 14f41-14f9f)
 .00%:  mov %fs:0x0,%rsi     # TLS base (rare - already cached)
 .00%:  imul $0x4008,%rbx,%r10  # Class offset calculation
 .00%:  mov -0x1c04c(%r10,%rsi,1),%r15d  # Magazine top read
 								# Mini-magazine operations (not heavily sampled - fast path works!)
 								  # Lines 15068-15131: Remote drain logic (rare)
 								  # Most samples are in initialization, not hot path
 								```
 								**Key observation from perf**:
 								- **Stack frame overhead dominates** (25.81% on single `push %rbp`)
 								- TLS access is **NOT a bottleneck** (0.00% on most TLS reads)
 								- Most cycles spent in **initialization checks and setup** (first 10 instructions)
 								- **Magazine fast path barely appears** (suggests it's working efficiently!)
 								### 1.3 Tiny Pool Free Path Breakdown
 								**From perf annotate on `hak_tiny_owner_slab` (2,571 samples)**:
 								```assembly
 								# Entry and registry lookup (lines 14c10-14c78)
 .87%:  endbr64              # CFI marker
 .06%:  push %r14            # Stack frame
 .05%:  shr $0x10,%r10       # Hash calculation (64KB alignment)
 .44%:  and $0x3ff,%eax      # Hash mask (1024 entries)
 .91%:  mov %rax,%rdx        # Index calculation
 .89%:  cmp %r13,%r9         # Registry lookup comparison
 .31%:  test %r13,%r13       # NULL check
 								# Linear probing (lines 14c7e-14d70)
 								  # 8 probe attempts, each with similar overhead
 								  # Most time spent in hash computation and comparison
 								```
 								**Key observation from perf**:
 								- **Hash computation + comparison is the bottleneck** (14.05% + 5.89% + 14.31% = 34.25%)
 								- Registry lookup is **O(1) but expensive** (~10-15 cycles per lookup)
 								- **Called on every free** (2,571 samples ≈ 1.37% of total cycles)
 								### 1.4 Instruction Count Breakdown (Estimated)
 								Based on perf data and code analysis, here's the estimated instruction breakdown:
 								**Allocation path** (~228 instructions total as measured by perf stat):
 								| Component | Instructions | Cycles | % of Total | Notes |
 								|-----------|-------------|--------|-----------|-------|
 								| Wrapper overhead | 15-20 | ~6 | 7-9% | TLS check + routing |
 								| Size-to-class lookup | 5-8 | ~2 | 2-3% | Branchless table (fast!) |
 								| TLS magazine check | 8-12 | ~4 | 4-5% | Load + branch |
 								| **Pointer return (HIT)** | **3-5** | **~2** | **1-2%** | **Fast path: 30-45 instructions** |
 								| TLS slab lookup | 10-15 | ~5 | 4-6% | Miss: check active slabs |
 								| Mini-mag check | 8-12 | ~4 | 3-5% | LIFO pop |
 								| **Bitmap scan (MISS)** | **40-60** | **~20** | **18-26%** | **Summary + main bitmap + CTZ** |
 								| Bitmap update | 20-30 | ~10 | 9-13% | Set used + summary update |
 								| Pointer arithmetic | 8-12 | ~3 | 3-5% | Block index → pointer |
 								| Lock acquisition (rare) | 50-100 | ~30-100 | 22-44% | pthread_mutex_lock (contended) |
 								| Batch refill (rare) | 100-200 | ~50-100 | 44-88% | 16-64 items from bitmap |
 								**Free path** (~150-200 instructions estimated):
 								| Component | Instructions | Cycles | % of Total | Notes |
 								|-----------|-------------|--------|-----------|-------|
 								| Wrapper overhead | 10-15 | ~5 | 5-8% | TLS check + routing |
 								| **Owner slab lookup** | **30-50** | **~15-20** | **20-25%** | **Hash + linear probe** |
 								| Slab validation | 10-15 | ~5 | 5-8% | Range check (safety) |
 								| TLS magazine push | 8-12 | ~4 | 4-6% | Same-thread: fast! |
 								| **Remote free push** | **15-25** | **~8-12** | **10-13%** | **Cross-thread: atomic CAS** |
 								| Lock + bitmap update (spill) | 50-100 | ~30-80 | 25-50% | Magazine full (rare) |
 								**Critical finding**:
 								- **Owner slab lookup (30-50 instructions) is the #1 free-path bottleneck**
 								- Accounts for ~20-25% of free path instructions
 								- **Cannot be eliminated for cross-thread frees** (need slab to push to remote queue)
 								---
 								## Part 2: Async Background Worker Design
 								### 2.1 Option A: Deferred Bitmap Consolidation
 								**Goal**: Push bitmap scanning to background thread, keep front-path as simple pointer bump
 								#### Design
 								```c
 								// Front-path (allocation): 10-20 instructions
 								void* hak_tiny_alloc(size_t size) {
 								    int class_idx = hak_tiny_size_to_class(size);
 								    TinyTLSMag* mag = &g_tls_mags[class_idx];
 								    // Fast path: Magazine hit (8-12 instructions)
 								    if (mag->top > 0) {
 								        return mag->items[--mag->top].ptr;  // ~3 instructions
 								    }
 								    // Slow path: Trigger background refill
 								    return hak_tiny_alloc_slow(class_idx);  // ~5 instructions + function call
 								}
 								// Background thread: Bitmap scanning
 								void background_refill_magazines(void) {
 								    while (1) {
 								        for (int tid = 0; tid < MAX_THREADS; tid++) {
 								            for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
 								                TinyTLSMag* mag = &g_thread_mags[tid][class_idx];
 								                // Refill if below threshold (e.g., 25% full)
 								                if (mag->top < mag->cap / 4) {
 								                    // Scan bitmaps across all slabs (expensive!)
 								                    batch_refill_from_all_slabs(mag, 256);  // 256 items at once
 								                }
 								            }
 								        }
 								        usleep(100);  // 100μs sleep (tune based on load)
 								    }
 								}
 								```
 								#### Expected Performance
 								**Front-path savings**:
 								- Before: 228 instructions (magazine miss → bitmap scan)
 								- After: 30-45 instructions (magazine miss → return NULL + fallback)
 								- **Speedup**: 5-7× on miss case (but only 20-40% of allocations miss)
 								**Overall impact**:
 								- 60-80% hit TLS magazine: **No change** (already 30-45 instructions)
 								- 20-40% miss TLS magazine: **5-7× faster** (228 → 30-45 instructions)
 								- **Net speedup**: 1.0 × 0.7 + 6.0 × 0.3 = **2.5× on allocation path**
 								**BUT**: Background thread overhead
 								- CPU cost: 1 core at ~10-20% utilization (bitmap scanning)
 								- Memory barriers: Atomic refill triggers (5-10 cycles per trigger)
 								- Cache coherence: TLS magazine written by background thread (false sharing risk)
 								**Realistic net speedup**: **1.5-2.0× on allocations** (after overhead)
 								#### Pros
 								- **Minimal front-path changes** (magazine logic unchanged)
 								- **No new synchronization primitives** (use existing atomic refill triggers)
 								- **Compatible with existing TLS magazine** (just changes refill source)
 								#### Cons
 								- **Background thread CPU cost** (10-20% of 1 core)
 								- **Latency spikes** if background thread is delayed (magazine empty → fallback to pool)
 								- **Complex tuning** (refill threshold, batch size, sleep interval)
 								- **False sharing risk** (background thread writes TLS magazine `top` field)
 								---
 								### 2.2 Option B: Deferred Slab Lookup (Owner Slab Cache)
 								**Goal**: Eliminate owner slab lookup on same-thread frees by deferring to batch processing
 								#### Design
 								```c
 								// Front-path (free): 10-20 instructions
 								void hak_tiny_free(void* ptr) {
 								    // Push to thread-local deferred free queue (NO owner_slab lookup!)
 								    TinyDeferredFree* queue = &g_tls_deferred_free;
 								    // Fast path: Direct queue push (8-12 instructions)
 								    queue->ptrs[queue->count++] = ptr;  // ~3 instructions
 								    // Trigger batch processing if queue is full
 								    if (queue->count >= 256) {
 								        hak_tiny_process_deferred_frees(queue);  // Background or inline
 								    }
 								}
 								// Batch processing: Owner slab lookup (amortized cost)
 								void hak_tiny_process_deferred_frees(TinyDeferredFree* queue) {
 								    for (int i = 0; i < queue->count; i++) {
 								        void* ptr = queue->ptrs[i];
 								        // Owner slab lookup (expensive: 30-50 instructions)
 								        TinySlab* slab = hak_tiny_owner_slab(ptr);
 								        // Check if same-thread or cross-thread
 								        if (pthread_equal(slab->owner_tid, pthread_self())) {
 								            // Same-thread: Push to TLS magazine (fast)
 								            TinyTLSMag* mag = &g_tls_mags[slab->class_idx];
 								            mag->items[mag->top++].ptr = ptr;
 								        } else {
 								            // Cross-thread: Push to remote queue (already required)
 								            tiny_remote_push(slab, ptr);
 								        }
 								    }
 								    queue->count = 0;
 								}
 								```
 								#### Expected Performance
 								**Front-path savings**:
 								- Before: 150-200 instructions (owner slab lookup + magazine/remote push)
 								- After: 10-20 instructions (queue push only)
 								- **Speedup**: 10-15× on free path
 								**BUT**: Batch processing overhead
 								- Owner slab lookup: 30-50 instructions per free (unchanged)
 								- Amortized over 256 frees: ~0.12-0.20 instructions per free (negligible)
 								- **Net speedup**: ~10× on same-thread frees, **0× on cross-thread frees**
 								**Benchmark analysis** (from bench_comprehensive.c):
 								- Same-thread frees: 40-60% (LIFO/FIFO patterns)
 								- Cross-thread frees: 40-60% (interleaved/random patterns)
 								**Overall impact**:
 								- 40-60% same-thread: **10× faster** (150 → 15 instructions)
 								- 40-60% cross-thread: **No change** (still need immediate owner slab lookup)
 								- **Net speedup**: 10 × 0.5 + 1.0 × 0.5 = **5.5× on free path**
 								**BUT**: Deferred free delay
 								- Memory not reclaimed until batch processes (256 frees)
 								- Increased memory footprint: 256 × 8B = 2KB per thread per class
 								- Cache pollution: Deferred ptrs may evict useful data
 								**Realistic net speedup**: **1.3-1.5× on frees** (after overhead)
 								#### Pros
 								- **Large instruction savings** (10-15× on free path)
 								- **No background thread** (batch processes inline or on-demand)
 								- **Simple implementation** (just a TLS queue + batch loop)
 								- **Compatible with existing remote-free** (cross-thread unchanged)
 								#### Cons
 								- **Deferred memory reclamation** (256 frees delay)
 								- **Increased memory footprint** (2KB × 8 classes × 32 threads = 512KB)
 								- **Limited benefit on cross-thread frees** (40-60% of workload unaffected)
 								- **Batch processing latency spikes** (256 owner slab lookups at once)
 								---
 								### 2.3 Option C: Hybrid (Magazine + Deferred Processing)
 								**Goal**: Combine Option A (background magazine refill) + Option B (deferred free queue)
 								#### Design
 								```c
 								// Allocation: TLS magazine (10-20 instructions)
 								void* hak_tiny_alloc(size_t size) {
 								    int class_idx = hak_tiny_size_to_class(size);
 								    TinyTLSMag* mag = &g_tls_mags[class_idx];
 								    if (mag->top > 0) {
 								        return mag->items[--mag->top].ptr;
 								    }
 								    // Trigger background refill if needed
 								    if (mag->refill_needed == 0) {
 								        atomic_store(&mag->refill_needed, 1);
 								    }
 								    return NULL;  // Fallback to next tier
 								}
 								// Free: Deferred batch queue (10-20 instructions)
 								void hak_tiny_free(void* ptr) {
 								    TinyDeferredFree* queue = &g_tls_deferred_free;
 								    queue->ptrs[queue->count++] = ptr;
 								    if (queue->count >= 256) {
 								        hak_tiny_process_deferred_frees(queue);
 								    }
 								}
 								// Background worker: Refill magazines + process deferred frees
 								void background_worker(void) {
 								    while (1) {
 								        // Refill magazines from bitmaps
 								        for each thread with refill_needed {
 								            batch_refill_from_all_slabs(mag, 256);
 								        }
 								        // Process deferred frees from all threads
 								        for each thread with deferred_free queue {
 								            hak_tiny_process_deferred_frees(queue);
 								        }
 								        usleep(50);  // 50μs sleep
 								    }
 								}
 								```
 								#### Expected Performance
 								**Front-path savings**:
 								- Allocation: 228 → 30-45 instructions (5-7× faster)
 								- Free: 150-200 → 10-20 instructions (10-15× faster)
 								**Overall impact** (accounting for hit rates and overhead):
 								- Allocations: 1.5-2.0× faster (Option A)
 								- Frees: 1.3-1.5× faster (Option B)
 								- **Net speedup**: √(2.0 × 1.5) ≈ **1.7× overall**
 								**Realistic net speedup**: **1.7-2.0× (62 → 105-124 M ops/sec)**
 								#### Pros
 								- **Best overall speedup** (combines benefits of both approaches)
 								- **Balanced optimization** (both alloc and free paths improved)
 								- **Single background worker** (shared thread for refill + deferred frees)
 								#### Cons
 								- **Highest implementation complexity** (both systems + worker coordination)
 								- **Background thread CPU cost** (15-30% of 1 core)
 								- **Tuning complexity** (refill threshold, batch size, sleep interval, queue size)
 								- **Largest memory footprint** (TLS magazines + deferred queues)
 								---
 								## Part 3: Feasibility Analysis
 								### 3.1 Instruction Reduction Potential
 								**Current measured performance** (HAKMEM_WRAP_TINY=1):
 								- **Instructions per op**: 228 (from perf stat: 1.4T / 6.1B ops)
 								- **IPC**: 4.73 (very high - compute-bound)
 								- **Cycles per op**: 48.2 (228 / 4.73)
 								- **Latency**: 16.1 ns/op @ 3 GHz
 								**Theoretical minimum** (mimalloc-style):
 								- **Instructions per op**: 15-25 (TLS pointer bump + freelist push)
 								- **IPC**: 4.5-5.0 (cache-friendly sequential access)
 								- **Cycles per op**: 4-5 (15-25 / 5.0)
 								- **Latency**: 1.3-1.7 ns/op @ 3 GHz
 								**Achievable with async background workers**:
 								- **Allocation path**: 30-45 instructions (magazine hit) vs 228 (bitmap scan)
 								- **Free path**: 10-20 instructions (deferred queue) vs 150-200 (owner slab lookup)
 								- **Average**: (30 + 15) / 2 = **22.5 instructions per op** (geometric mean)
 								- **IPC**: 4.5 (slightly worse due to memory barriers)
 								- **Cycles per op**: 22.5 / 4.5 = **5.0 cycles**
 								- **Latency**: 5.0 / 3.0 = **1.7 ns/op**
 								**Expected speedup**: 16.1 / 1.7 ≈ **9.5× (theoretical maximum)**
 								**BUT**: Background thread overhead
 								- Atomic refill triggers: +1-2 cycles per op
 								- Cache coherence (false sharing): +2-3 cycles per op
 								- Memory barriers: +1-2 cycles per op
 								- **Total overhead**: +4-7 cycles per op
 								**Realistic achievable**:
 								- **Cycles per op**: 5.0 + 5.0 = 10.0 cycles
 								- **Latency**: 10.0 / 3.0 = **3.3 ns/op**
 								- **Throughput**: 300 M ops/sec
 								- **Speedup**: 16.1 / 3.3 ≈ **4.9× (theoretical)**
 								**Actual achievable** (accounting for partial hit rates):
 								- **60-80% hit magazine**: Already fast (6 ns)
 								- **20-40% miss magazine**: Improved (16 ns → 3.3 ns)
 								- **Net improvement**: 0.7 × 6 + 0.3 × 3.3 = **5.2 ns/op**
 								- **Speedup**: 16.1 / 5.2 ≈ **3.1× (optimistic)**
 								**Conservative estimate** (accounting for all overheads):
 								- **Net speedup**: **2.0-2.5× (62 → 124-155 M ops/sec)**
 								### 3.2 Comparison with mimalloc
 								**Why mimalloc is 263 M ops/sec (3.8 ns/op)**:
 . **Zero-initialization on allocation** (no bitmap scan ever)
 								   - Uses sequential memory bump pointer (O(1) pointer arithmetic)
 								   - Free blocks tracked as linked list (no scanning needed)
 . **Embedded slab metadata** (no hash lookup on free)
 								   - Slab pointer embedded in first 16 bytes of allocation
 								   - Owner slab lookup is single pointer dereference (3-4 cycles)
 . **TLS-local slabs** (no cross-thread remote free queues)
 								   - Each thread owns its slabs exclusively
 								   - Cross-thread frees go to per-thread remote queue (not per-slab)
 . **Lazy coalescing** (defers bitmap consolidation to background)
 								   - Front-path never touches bitmaps
 								   - Background thread scans and coalesces every 100ms
 								**hakmem cannot match mimalloc without fundamental redesign** because:
 								- Bitmap-based allocation requires scanning (cannot be O(1) pointer bump)
 								- Hash-based owner slab lookup requires hash computation (cannot be single dereference)
 								- Per-slab remote queues require immediate slab lookup on cross-thread free
 								**Realistic target**: **120-180 M ops/sec (6.7-8.3 ns/op)** - still **2-3× slower than mimalloc**
 								### 3.3 Implementation Effort vs Benefit
 								| Option | Effort (hours) | Speedup | Ops/sec | Gap to mimalloc | Complexity |
 								|--------|---------------|---------|---------|-----------------|-----------|
 								| **Current** | 0 | 1.0× | 62 | 4.2× slower | Baseline |
 								| **Option A** | 6-8 | 1.5-1.8× | 93-112 | 2.4-2.8× slower | Medium |
 								| **Option B** | 4-6 | 1.3-1.5× | 81-93 | 2.8-3.2× slower | Low |
 								| **Option C** | 10-14 | 1.7-2.2× | 105-136 | 1.9-2.5× slower | High |
 								| **Theoretical max** | N/A | 3.1× | 192 | 1.4× slower | N/A |
 								| **mimalloc** | N/A | 4.2× | 263 | Baseline | N/A |
 								**Best effort/benefit ratio**: **Option B (Deferred Slab Lookup)**
 								- **4-6 hours** of implementation
 								- **1.3-1.5× speedup** (25-35% faster)
 								- **Low complexity** (single TLS queue + batch loop)
 								- **No background thread** (inline batch processing)
 								**Maximum performance**: **Option C (Hybrid)**
 								- **10-14 hours** of implementation
 								- **1.7-2.2× speedup** (50-75% faster)
 								- **High complexity** (background worker + coordination)
 								- **Requires background thread** (CPU cost)
 								---
 								## Part 4: Recommended Implementation Plan
 								### Phase 1: Deferred Free Queue (4-6 hours) [Option B]
 								**Goal**: Eliminate owner slab lookup on same-thread frees
 								#### Step 1.1: Add TLS Deferred Free Queue (1 hour)
 								```c
 								// hakmem_tiny.h - Add to global state
 								#define DEFERRED_FREE_QUEUE_SIZE 256
 								typedef struct {
 								    void* ptrs[DEFERRED_FREE_QUEUE_SIZE];
 								    uint16_t count;
 								} TinyDeferredFree;
 								// TLS per-class deferred free queues
 								static __thread TinyDeferredFree g_tls_deferred_free[TINY_NUM_CLASSES];
 								```
 								#### Step 1.2: Modify Free Path (2 hours)
 								```c
 								// hakmem_tiny.c - Replace hak_tiny_free()
 								void hak_tiny_free(void* ptr) {
 								    if (!ptr || !g_tiny_initialized) return;
 								    // Try SuperSlab fast path first (existing)
 								    SuperSlab* ss = ptr_to_superslab(ptr);
 								    if (ss && ss->magic == SUPERSLAB_MAGIC) {
 								        hak_tiny_free_superslab(ptr, ss);
 								        return;
 								    }
 								    // NEW: Deferred free path (no owner slab lookup!)
 								    // Guess class from allocation size hint (optional optimization)
 								    int class_idx = guess_class_from_ptr(ptr);  // heuristic
 								    if (class_idx >= 0) {
 								        TinyDeferredFree* queue = &g_tls_deferred_free[class_idx];
 								        queue->ptrs[queue->count++] = ptr;
 								        // Batch process if queue is full
 								        if (queue->count >= DEFERRED_FREE_QUEUE_SIZE) {
 								            hak_tiny_process_deferred_frees(class_idx, queue);
 								        }
 								        return;
 								    }
 								    // Fallback: Immediate owner slab lookup (cross-thread or unknown)
 								    TinySlab* slab = hak_tiny_owner_slab(ptr);
 								    if (!slab) return;
 								    hak_tiny_free_with_slab(ptr, slab);
 								}
 								```
 								#### Step 1.3: Implement Batch Processing (2-3 hours)
 								```c
 								// hakmem_tiny.c - Batch process deferred frees
 								static void hak_tiny_process_deferred_frees(int class_idx, TinyDeferredFree* queue) {
 								    pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
 								    pthread_mutex_lock(lock);
 								    for (int i = 0; i < queue->count; i++) {
 								        void* ptr = queue->ptrs[i];
 								        // Owner slab lookup (expensive, but amortized over batch)
 								        TinySlab* slab = hak_tiny_owner_slab(ptr);
 								        if (!slab) continue;
 								        // Push to magazine or bitmap
 								        hak_tiny_free_with_slab(ptr, slab);
 								    }
 								    pthread_mutex_unlock(lock);
 								    queue->count = 0;
 								}
 								```
 								**Expected outcome**:
 								- **Same-thread frees**: 10-15× faster (150 → 10-20 instructions)
 								- **Cross-thread frees**: Unchanged (still need immediate lookup)
 								- **Overall speedup**: 1.3-1.5× (25-35% faster)
 								- **Memory overhead**: 256 × 8B × 8 classes = 16KB per thread
 								---
 								### Phase 2: Background Magazine Refill (6-8 hours) [Option A]
 								**Goal**: Eliminate bitmap scanning on allocation path
 								#### Step 2.1: Add Refill Trigger (1 hour)
 								```c
 								// hakmem_tiny.h - Add refill trigger to TLS magazine
 								typedef struct {
 								    TinyMagItem items[TINY_TLS_MAG_CAP];
 								    int top;
 								    int cap;
 								    atomic_int refill_needed;  // NEW: Background refill trigger
 								} TinyTLSMag;
 								```
 								#### Step 2.2: Modify Allocation Path (2 hours)
 								```c
 								// hakmem_tiny.c - Trigger refill on magazine miss
 								void* hak_tiny_alloc(size_t size) {
 								    // ... (existing size-to-class logic) ...
 								    TinyTLSMag* mag = &g_tls_mags[class_idx];
 								    if (mag->top > 0) {
 								        return mag->items[--mag->top].ptr;  // Fast path: unchanged
 								    }
 								    // NEW: Trigger background refill (non-blocking)
 								    if (atomic_load(&mag->refill_needed) == 0) {
 								        atomic_store(&mag->refill_needed, 1);
 								    }
 								    // Fallback to existing slow path (TLS slab, bitmap scan, lock)
 								    return hak_tiny_alloc_slow(class_idx);
 								}
 								```
 								#### Step 2.3: Implement Background Worker (3-5 hours)
 								```c
 								// hakmem_tiny.c - Background refill thread
 								static void* background_refill_worker(void* arg) {
 								    while (g_background_worker_running) {
 								        // Scan all threads for refill requests
 								        for (int tid = 0; tid < g_max_threads; tid++) {
 								            for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
 								                TinyTLSMag* mag = &g_thread_mags[tid][class_idx];
 								                // Check if refill needed
 								                if (atomic_load(&mag->refill_needed) == 0) {
 								                    continue;
 								                }
 								                // Refill from bitmaps (expensive, but in background)
 								                pthread_mutex_t* lock = &g_tiny_class_locks[class_idx].m;
 								                pthread_mutex_lock(lock);
 								                TinySlab* slab = g_tiny_pool.free_slabs[class_idx];
 								                if (slab && slab->free_count > 0) {
 								                    int refilled = batch_refill_from_bitmap(
 								                        slab, &mag->items[mag->top], 256
 								                    );
 								                    mag->top += refilled;
 								                }
 								                pthread_mutex_unlock(lock);
 								                atomic_store(&mag->refill_needed, 0);
 								            }
 								        }
 								        usleep(100);  // 100μs sleep (tune based on load)
 								    }
 								    return NULL;
 								}
 								// Start background worker on init
 								void hak_tiny_init(void) {
 								    // ... (existing init logic) ...
 								    g_background_worker_running = 1;
 								    pthread_create(&g_background_worker, NULL, background_refill_worker, NULL);
 								}
 								```
 								**Expected outcome**:
 								- **Allocation misses**: 5-7× faster (228 → 30-45 instructions)
 								- **Magazine hit rate**: Improved (background keeps magazines full)
 								- **Overall speedup**: +30-50% (combined with Phase 1)
 								- **CPU cost**: 1 core at 10-20% utilization
 								---
 								### Phase 3: Tuning and Optimization (2-3 hours)
 								**Goal**: Reduce overhead and maximize hit rates
 								#### Step 3.1: Tune Batch Sizes (1 hour)
 								- Test refill batch sizes: 64, 128, 256, 512
 								- Test deferred free queue sizes: 128, 256, 512
 								- Measure impact on throughput and latency variance
 								#### Step 3.2: Reduce False Sharing (1 hour)
 								```c
 								// Cache-align TLS magazines to avoid false sharing
 								typedef struct __attribute__((aligned(64))) {
 								    TinyMagItem items[TINY_TLS_MAG_CAP];
 								    int top;
 								    int cap;
 								    atomic_int refill_needed;
 								    char _pad[64 - sizeof(int) * 3];  // Pad to 64B
 								} TinyTLSMag;
 								```
 								#### Step 3.3: Add Adaptive Sleep (1 hour)
 								```c
 								// Background worker: Adaptive sleep based on load
 								static void* background_refill_worker(void* arg) {
 								    int idle_count = 0;
 								    while (g_background_worker_running) {
 								        int work_done = 0;
 								        // ... (refill logic) ...
 								        if (work_done == 0) {
 								            idle_count++;
 								            usleep(100 * (1 << min(idle_count, 4)));  // Exponential backoff
 								        } else {
 								            idle_count = 0;
 								            usleep(50);  // Active: short sleep
 								        }
 								    }
 								}
 								```
 								**Expected outcome**:
 								- **Reduced CPU cost**: 10-20% → 5-10% (adaptive sleep)
 								- **Better cache utilization**: Alignment reduces false sharing
 								- **Tuned for workload**: Batch sizes optimized for benchmarks
 								---
 								## Part 5: Expected Performance
 								### Before (Current)
 								```
 								HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem
 								Test 1: Sequential LIFO (16B)
 								  Throughput: 105 M ops/sec
 								  Latency: 9.5 ns/op
 								Test 2: Sequential FIFO (16B)
 								  Throughput: 98 M ops/sec
 								  Latency: 10.2 ns/op
 								Test 3: Random Free (16B)
 								  Throughput: 62 M ops/sec
 								  Latency: 16.1 ns/op
 								Average: 88 M ops/sec (11.4 ns/op)
 								```
 								### After Phase 1 (Deferred Free Queue)
 								**Expected improvement**: +25-35% (same-thread frees only)
 								```
 								Test 1: Sequential LIFO (16B)  [80% same-thread]
 								  Throughput: 135 M ops/sec (+29%)
 								  Latency: 7.4 ns/op
 								Test 2: Sequential FIFO (16B)  [80% same-thread]
 								  Throughput: 126 M ops/sec (+29%)
 								  Latency: 7.9 ns/op
 								Test 3: Random Free (16B)  [40% same-thread]
 								  Throughput: 73 M ops/sec (+18%)
 								  Latency: 13.7 ns/op
 								Average: 111 M ops/sec (+26%) - [9.0 ns/op]
 								```
 								### After Phase 2 (Background Refill)
 								**Expected improvement**: +40-60% (combined)
 								```
 								Test 1: Sequential LIFO (16B)
 								  Throughput: 168 M ops/sec (+60%)
 								  Latency: 6.0 ns/op
 								Test 2: Sequential FIFO (16B)
 								  Throughput: 157 M ops/sec (+60%)
 								  Latency: 6.4 ns/op
 								Test 3: Random Free (16B)
 								  Throughput: 93 M ops/sec (+50%)
 								  Latency: 10.8 ns/op
 								Average: 139 M ops/sec (+58%) - [7.2 ns/op]
 								```
 								### After Phase 3 (Tuning)
 								**Expected improvement**: +50-75% (optimized)
 								```
 								Test 1: Sequential LIFO (16B)
 								  Throughput: 180 M ops/sec (+71%)
 								  Latency: 5.6 ns/op
 								Test 2: Sequential FIFO (16B)
 								  Throughput: 168 M ops/sec (+71%)
 								  Latency: 6.0 ns/op
 								Test 3: Random Free (16B)
 								  Throughput: 105 M ops/sec (+69%)
 								  Latency: 9.5 ns/op
 								Average: 151 M ops/sec (+72%) - [6.6 ns/op]
 								```
 								### Gap to mimalloc (263 M ops/sec)
 								| Phase | Ops/sec | Gap | % of mimalloc |
 								|-------|---------|-----|---------------|
 								| Current | 88 | 3.0× slower | 33% |
 								| Phase 1 | 111 | 2.4× slower | 42% |
 								| Phase 2 | 139 | 1.9× slower | 53% |
 								| Phase 3 | 151 | 1.7× slower | 57% |
 								| **mimalloc** | **263** | **Baseline** | **100%** |
 								**Conclusion**: Async background workers can achieve **1.7× speedup**, but still **1.7× slower than mimalloc** due to fundamental architecture differences.
 								---
 								## Part 6: Critical Success Factors
 								### 6.1 Verify with perf
 								After each phase, run:
 								```bash
 								HAKMEM_WRAP_TINY=1 perf record -e cycles:u -g ./bench_comprehensive_hakmem
 								perf report --stdio --no-children -n --percent-limit 1.0
 								```
 								**Expected changes**:
 								- **Phase 1**: `hak_tiny_owner_slab` drops from 1.37% → 0.5-0.7%
 								- **Phase 2**: `hak_tiny_find_free_block` drops from ~1% → 0.2-0.3%
 								- **Phase 3**: Overall cycles per op drops 40-60%
 								### 6.2 Measure Instruction Count
 								```bash
 								HAKMEM_WRAP_TINY=1 perf stat -e instructions,cycles,branches ./bench_comprehensive_hakmem
 								```
 								**Expected changes**:
 								- **Before**: 228 instructions/op, 48.2 cycles/op
 								- **Phase 1**: 180-200 instructions/op, 40-45 cycles/op
 								- **Phase 2**: 120-150 instructions/op, 28-35 cycles/op
 								- **Phase 3**: 100-130 instructions/op, 22-28 cycles/op
 								### 6.3 Avoid Synchronization Overhead
 								**Key principles**:
 								- Use `atomic_load_explicit` with `memory_order_relaxed` for low-contention checks
 								- Batch operations to amortize lock costs (256+ items per batch)
 								- Align TLS structures to 64B to avoid false sharing
 								- Use exponential backoff on background thread sleep
 								### 6.4 Focus on Front-Path
 								**Priority order**:
 . **TLS magazine hit**: Must remain <30 instructions (already optimal)
 . **Deferred free queue**: Must be <20 instructions (Phase 1)
 . **Background refill trigger**: Must be <10 instructions (Phase 2)
 . **Batch processing**: Can be expensive (amortized over 256 items)
 								---
 								## Part 7: Conclusion
 								### Can we achieve 100-150 M ops/sec with async background workers?
 								**Yes, but with caveats**:
 								- **100 M ops/sec**: Achievable with Phase 1 alone (4-6 hours)
 								- **150 M ops/sec**: Achievable with Phase 1+2+3 (12-17 hours)
 								- **180+ M ops/sec**: Unlikely without fundamental redesign
 								### Why the gap to mimalloc remains
 								**mimalloc's advantages that async workers cannot replicate**:
 . **O(1) pointer bump allocation** (no bitmap scan, even in background)
 . **Embedded slab metadata** (no hash lookup, ever)
 . **TLS-exclusive slabs** (no cross-thread remote queues)
 								**hakmem's fundamental constraints**:
 								- Bitmap-based allocation requires scanning (cannot be O(1))
 								- Hash-based slab registry requires computation on free
 								- Per-slab remote queues require immediate slab lookup
 								### Recommended next steps
 								**Short-term (4-6 hours)**: Implement **Phase 1 (Deferred Free Queue)**
 								- **Effort**: Low (single TLS queue + batch loop)
 								- **Benefit**: 25-35% speedup (62 → 81-93 M ops/sec)
 								- **Risk**: Low (no background thread, simple design)
 								**Medium-term (10-14 hours)**: Add **Phase 2 (Background Refill)**
 								- **Effort**: Medium (background worker + coordination)
 								- **Benefit**: 50-75% speedup (62 → 93-108 M ops/sec)
 								- **Risk**: Medium (background thread overhead, tuning complexity)
 								**Long-term (20-30 hours)**: Consider **fundamental redesign**
 								- Replace bitmap with freelist (mimalloc-style)
 								- Embed slab metadata in allocations (avoid hash lookup)
 								- Use TLS-exclusive slabs (eliminate remote queues)
 								- **Potential**: 3-4× speedup (approaching mimalloc)
 								### Final verdict
 								**Async background workers are a viable optimization**, but not a silver bullet:
 								- **Expected speedup**: 1.5-2.0× (realistic)
 								- **Best-case speedup**: 2.0-2.5× (with perfect tuning)
 								- **Gap to mimalloc**: Remains 1.7-2.0× (architectural limitations)
 								**Recommended approach**: Implement Phase 1 first (low effort, good ROI), then evaluate if Phase 2 is worth the complexity.
 								---
 								## Part 7: Phase 1 Implementation Results & Lessons Learned
 								**Date**: 2025-10-26
 								**Status**: FAILED - Structural design flaw identified
 								### Phase 1 Implementation Summary
 								**What was implemented**:
 . TLS Deferred Free Queue (256 items)
 . Batch processing function
 . Modified `hak_tiny_free` to push to queue
 								**Expected outcome**: 1.3-1.5× speedup (25-35% faster frees)
 								### Actual Results
 								| Metric | Before | After Phase 1 | Change |
 								|--------|--------|---------------|--------|
 								| 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | +2.2% |
 								| 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | +0.9% |
 								| 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | -1.6% |
 								| 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | -1.7% |
 								| **Instructions/op** | **228** | **229** | **+1** ❌ |
 								**Conclusion**: **Phase 1 had ZERO effect** (performance unchanged, instructions increased by 1)
 								### Root Cause Analysis
 								**Critical design flaw discovered**:
 								```c
 								void hak_tiny_free(void* ptr) {
 								    // SuperSlab fast path FIRST (Quick Win #1)
 								    SuperSlab* ss = ptr_to_superslab(ptr);
 								    if (ss && ss->magic == SUPERSLAB_MAGIC) {
 								        hak_tiny_free_superslab(ptr, ss);
 								        return;  // ← 99% of frees exit here!
 								    }
 								    // Deferred Free Queue (NEVER REACHED!)
 								    queue->ptrs[queue->count++] = ptr;
 								    ...
 								}
 								```
 								**Why Phase 1 failed**:
 . **SuperSlab is enabled by default** (`g_use_superslab = 1` from Quick Win #1)
 . **99% of frees take SuperSlab fast path** (especially sequential patterns)
 . **Deferred queue is never used** → zero benefit, added overhead
 . **Push-based approach is fundamentally flawed** for this use case
 								### Alignment with ChatGPT Analysis
 								ChatGPT's analysis of a similar "Phase 4" issue identified the same structural problem:
 								> **"Free で加速の仕込みをする（push型）は、spill が頻発する系ではコスト先払いになり負けやすい。"**
 								**Key insight**: **Push-based optimization on free path pays upfront cost without guaranteed benefit.**
 								### Lessons Learned
 . **Push vs Pull strategy**:
 								   - **Push (Phase 1)**: Pay cost upfront on every free → wasted if not consumed
 								   - **Pull (Phase 2)**: Pay cost only when needed on alloc → guaranteed benefit
 . **Interaction with existing optimizations**:
 								   - SuperSlab fast path makes deferred queue unreachable
 								   - Cannot optimize already-optimized path
 . **Measurement before optimization**:
 								   - Should have profiled where frees actually go (SuperSlab vs registry)
 								   - Would have caught this before implementation
 								### Revised Strategy: Phase 2 (Pull-based)
 								**Recommended approach** (from ChatGPT + our analysis):
 								```
 								Phase 2: Background Magazine Refill (Pull型)
 								Allocation path:
 								  magazine.top > 0 → return item (fast path unchanged)
 								  magazine.top == 0 → trigger background refill
 								                    → fallback to slow path
 								Background worker (Pull型):
 								  Periodically scan for refill_needed flags
 								  Perform bitmap scan (expensive, but in background)
 								  Refill magazines in batch (256 items)
 								Free path: NO CHANGES (zero cost increase)
 								```
 								**Expected benefits**:
 								- **No upfront cost on free** (major advantage over push型)
 								- **Guaranteed benefit on alloc** (magazine hit rate increases)
 								- **Amortized bitmap scan cost** (1 scan → 256 allocs)
 								- **Expected speedup**: 1.5-2.0× (30-50% faster)
 								### Decision: Revert Phase 1, Implement Phase 2
 								**Next steps**:
 . ✅ Document Phase 1 failure and analysis
 . ⏳ Revert Phase 1 changes (clean code)
 . ⏳ Implement Phase 2 (pull-based background refill)
 . ⏳ Measure and validate Phase 2 effectiveness
 								**Key takeaway**: **"Pull型 + 必要時のみ" beats "Push型 + 先払いコスト"**
 								---
 								## Part 8: Phase 2 Implementation Results & Critical Insight
 								**Date**: 2025-10-26
 								**Status**: FAILED - Worse than baseline (Phase 1 was zero effect, Phase 2 degraded performance)
 								### Phase 2 Implementation Summary
 								**What was implemented**:
 . Global Refill Queue (per-class, lock-free read)
 . Background worker thread (bitmap scanning in background)
 . Pull-based magazine refill (check global queue on magazine miss)
 . Adaptive sleep (exponential backoff when idle)
 								**Expected outcome**: 1.5-2.0× speedup (228 → 100-150 instructions/op)
 								### Actual Results
 								| Metric | Baseline (no async) | Phase 1 (Push) | Phase 2 (Pull) | Change |
 								|--------|----------|---------|---------|--------|
 								| 16B LIFO | 62.13 M ops/s | 63.50 M ops/s | 62.80 M ops/s | **-1.1%** |
 								| 32B LIFO | 53.96 M ops/s | 54.47 M ops/s | 52.64 M ops/s | **-3.4%** ❌ |
 								| 64B LIFO | 50.93 M ops/s | 50.10 M ops/s | 49.37 M ops/s | **-1.5%** ❌ |
 								| 128B LIFO | 64.44 M ops/s | 63.34 M ops/s | 63.53 M ops/s | **+0.3%** |
 								| **Instructions/op** | **~228** | **229** | **306** | **+33%** ❌❌❌ |
 								**Conclusion**: **Phase 2 DEGRADED performance** (worse than baseline and Phase 1!)
 								### Root Cause Analysis
 								**Critical insight**: **Both Phase 1 and Phase 2 optimize the WRONG code path!**
 								```
 								Benchmark allocation pattern (LIFO):
 								Iteration 1:
 								  alloc[0..99]   → Slow path: Fill TLS Magazine from slabs
 								  free[99..0]    → Items return to TLS Magazine (LIFO)
 								Iteration 2-1,000,000:
 								  alloc[0..99]   → Fast path: 100% TLS Magazine hit! (6 ns/op)
 								  free[99..0]    → Fast path: Return to TLS Magazine (6 ns/op)
 								  NO SLOW PATH EVER EXECUTED AFTER FIRST ITERATION!
 								```
 								**Why Phase 2 failed worse than Phase 1**:
 . **Background worker thread consuming CPU** (extra overhead)
 . **Atomic operations on global queue** (contention + memory ordering cost)
 . **No benefit** because TLS magazine never empties (100% hit rate)
 . **Pure overhead** without any gain
 								### Fundamental Architecture Problem
 								**The async optimization strategy (Phase 1 + 2) is based on a flawed assumption**:
 								❌ **Assumption**: "Slow path (bitmap scan, lock, owner lookup) is the bottleneck"
 								✅ **Reality**: "Fast path (TLS magazine access) is the bottleneck"
 								**Evidence**:
 								- Benchmark working set: 100 items
 								- TLS Magazine capacity: 2048 items (class 0)
 								- Hit rate: 100% after first iteration
 								- Slow path execution: ~0% (never reached)
 								**Performance gap breakdown**:
 								```
 								hakmem Tiny Pool:    60 M ops/sec (16.7 ns/op) = 228 instructions
 								glibc malloc:       105 M ops/sec ( 9.5 ns/op) = ~30-40 instructions
 								Gap: 40% slower = ~190 extra instructions on FAST PATH
 								```
 								### Why hakmem is Slower (Architectural Differences)
 								**1. Bitmap-based allocation** (hakmem):
 								- Find free block: bitmap scan (CTZ instruction)
 								- Mark used: bit manipulation (OR + update summary bitmap)
 								- Cost: 30-40 instructions even with optimizations
 								**2. Free-list allocation** (glibc):
 								- Find free block: single pointer dereference
 								- Mark used: pointer update
 								- Cost: 5-10 instructions
 								**3. TLS Magazine access overhead**:
 								- hakmem: `g_tls_mags[class_idx].items[--top].ptr` (3 memory reads + index calc)
 								- glibc: Direct arena access (1-2 memory reads)
 								**4. Statistics batching** (Phase 3 optimization):
 								- hakmem: XOR RNG sampling (10-15 instructions)
 								- glibc: No stats tracking
 								### Lessons Learned
 								**1. Optimize the code path that actually executes**:
 								- ❌ Optimized slow path (99.9% never executed)
 								- ✅ Should optimize fast path (99.9% of operations)
 								**2. Async optimization only helps with cache misses**:
 								- Benchmark: 100% cache hit rate after warmup
 								- Real workload: Unknown hit rate (need profiling)
 								**3. Adding complexity without measurement is harmful**:
 								- Phase 1: +1 instruction (zero benefit)
 								- Phase 2: +77 instructions (-33% performance)
 								**4. Fundamental architectural differences matter more than micro-optimizations**:
 								- Bitmap vs free-list: ~10× instruction difference
 								- Async background work cannot bridge this gap
 								### Revised Understanding
 								**The 40% performance gap (hakmem vs glibc) is NOT due to slow-path inefficiency.**
 								**It's due to fundamental design choices**:
 . **Bitmap allocation** (flexible, low fragmentation) vs **Free-list** (fast, simple)
 . **Slab ownership tracking** (hash lookup on free) vs **Embedded metadata** (single dereference)
 . **Research features** (statistics, ELO, batching) vs **Production simplicity**
 								**These tradeoffs are INTENTIONAL for research purposes.**
 								### Conclusion & Next Steps
 								**Both Phase 1 and Phase 2 should be reverted.**
 								**Async optimization strategy is fundamentally flawed for this workload.**
 								**Actual bottleneck**: TLS Magazine fast path (99.9% of execution)
 								- Current: ~17 ns/op (228 instructions)
 								- Target: ~10 ns/op (glibc level)
 								- Gap: 7 ns = ~50-70 instructions
 								**Possible future optimizations** (NOT async):
 . **Inline TLS magazine access** (reduce function call overhead)
 . **SIMD bitmap scanning** (4-8× faster block finding)
 . **Remove statistics sampling** (save 10-15 instructions)
 . **Simplified magazine structure** (single array instead of struct)
 								**Or accept reality**:
 								- hakmem is a research allocator with diagnostic features
 								- 40% slowdown is acceptable cost for flexibility
 								- Production use cases might have different performance profiles
 								**Recommended action**: Revert Phase 2, commit analysis, move on.
 								---
 								## Part 9: Phase 7.5 Failure Analysis - Inline Fast Path
 								**Date**: 2025-10-26
 								**Goal**: Reduce hak_tiny_alloc from 22.75% CPU to ~10% via inline wrapper
 								**Result**: **REGRESSION (-7% to -15%)** ❌
 								### Implementation Approach
 								Created inline wrapper to handle common case (TLS magazine hit) without function call:
 								```c
 								static inline void* hak_tiny_alloc(size_t size) __attribute__((always_inline));
 								static inline void* hak_tiny_alloc(size_t size) {
 								    // Fast checks
 								    if (UNLIKELY(size > TINY_MAX_SIZE)) return hak_tiny_alloc_impl(size);
 								    if (UNLIKELY(!g_tiny_initialized)) return hak_tiny_alloc_impl(size);
 								    if (UNLIKELY(!g_wrap_tiny_enabled && hak_in_wrapper())) return hak_tiny_alloc_impl(size);
 								    // Size to class
 								    int class_idx = hak_tiny_size_to_class(size);
 								    // TLS Magazine fast path
 								    tiny_mag_init_if_needed(class_idx);
 								    TinyTLSMag* mag = &g_tls_mags[class_idx];
 								    if (LIKELY(mag->top > 0)) {
 								        return mag->items[--mag->top].ptr;  // Fast path!
 								    }
 								    return hak_tiny_alloc_impl(size);  // Slow path
 								}
 								```
 								### Benchmark Results
 								| Test | Before (getenv fix) | After (Phase 7.5) | Change |
 								|------|---------------------|-------------------|--------|
 								| 16B LIFO | 120.55 M ops/sec | 110.46 M ops/sec | **-8.4%** |
 								| 32B LIFO | 88.57 M ops/sec | 79.00 M ops/sec | **-10.8%** |
 								| 64B LIFO | 94.74 M ops/sec | 88.01 M ops/sec | **-7.1%** |
 								| 128B LIFO | 122.36 M ops/sec | 104.21 M ops/sec | **-14.8%** |
 								| Mixed | 164.56 M ops/sec | 148.99 M ops/sec | **-9.5%** |
 								With `__attribute__((always_inline))`:
 								| Test | Always-inline Result | vs Baseline |
 								|------|---------------------|-------------|
 								| 16B LIFO | 115.89 M ops/sec | **-3.9%** |
 								Still slower than baseline!
 								### Root Cause Analysis
 								The inline wrapper **added more overhead than it removed**:
 								**Overhead Added:**
 . **Extra function calls in wrapper**:
 								   - `hak_in_wrapper()` called on every allocation (even with UNLIKELY)
 								   - `tiny_mag_init_if_needed()` called on every allocation
 								   - These are function calls that happen BEFORE reaching the magazine
 . **Multiple conditional branches**:
 								   - Size check
 								   - Initialization check
 								   - Wrapper guard check
 								   - Branch misprediction cost
 . **Global variable reads**:
 								   - `g_tiny_initialized` read every time
 								   - `g_wrap_tiny_enabled` read every time
 								**Original code** (before inlining):
 								- One function call to `hak_tiny_alloc()`
 								- Inside function: direct path to magazine check (lines 685-688)
 								- No extra overhead
 								**Inline wrapper**:
 								- Zero function calls to enter
 								- But added 2 function calls inside (`hak_in_wrapper`, `tiny_mag_init_if_needed`)
 								- Added 3 conditional branches
 								- Net result: **MORE overhead, not less**
 								### Key Lesson Learned
 								**WRONG**: Function call overhead is the bottleneck (perf shows 22.75% in hak_tiny_alloc)
 								**RIGHT**: The 22.75% is the **code inside** the function, not the call overhead
 								**Micro-optimization fallacy**: Eliminating a function call (2-4 cycles) while adding:
 								- 2 function calls (4-8 cycles)
 								- 3 conditional branches (3-6 cycles)
 								- Multiple global reads (3-6 cycles)
 								Total overhead added: **10-20 cycles** vs **2-4 cycles** saved = **net loss**
 								### Correct Approach (Not Implemented)
 								To actually reduce the 22.75% CPU in hak_tiny_alloc, we should:
 . **Keep it as a regular function** (not inline)
 . **Optimize the code INSIDE**:
 								   - Reduce stack usage (88 → 32 bytes)
 								   - Cache globals at function entry
 								   - Simplify control flow
 								   - Reduce register pressure
 . **Or accept current performance**:
 								   - Already 1.5-1.9× faster than glibc ✅
 								   - Diminishing returns zone
 								   - Further optimization may not be worth the risk
 								### Decision
 								**REVERTED** Phase 7.5 completely. Performance restored to 120-164 M ops/sec.
 								**CONCLUSION**: Stick with getenv fix. Ship what works. 🚀
 								---