hakmem/docs/analysis/FREE_PATH_ULTRATHINK_ANALYSIS.md

# FREE PATH ULTRATHINK ANALYSIS
**Date:** 2025-11-08
**Performance Hotspot:** `hak_tiny_free_superslab` consuming 52.63% CPU
**Benchmark:** 1,046,392 ops/s (53x slower than System malloc's 56,336,790 ops/s)

---

## Executive Summary

The free() path in HAKMEM is **8x slower than allocation** (52.63% vs 6.48% CPU) due to:
1. **Multiple redundant lookups** (SuperSlab lookup called twice)
2. **Massive function size** (330 lines with many branches)
3. **Expensive safety checks** in hot path (duplicate scans, alignment checks)
4. **Atomic contention** (CAS loops on every free)
5. **Syscall overhead** (TID lookup on every free)

**Root Cause:** The free path was designed for safety and diagnostics, not performance. It lacks the "ultra-simple fast path" design that made allocation fast (Box 5).

---

## 1. CALL CHAIN ANALYSIS

### Complete Free Path (User → Kernel)

```
User free(ptr)
  ↓
1. free() wrapper                          [hak_wrappers.inc.h:92]
   ├─ Line 93:  atomic_fetch_add(g_free_wrapper_calls)    ← Atomic #1
   ├─ Line 94:  if (!ptr) return
   ├─ Line 95:  if (g_hakmem_lock_depth > 0) → libc
   ├─ Line 96:  if (g_initializing) → libc
   ├─ Line 97:  if (hak_force_libc_alloc()) → libc
   ├─ Line 98-102: LD_PRELOAD checks
   ├─ Line 103: g_hakmem_lock_depth++                     ← TLS write #1
   ├─ Line 104: hak_free_at(ptr, 0, HAK_CALLSITE())      ← MAIN ENTRY
   └─ Line 105: g_hakmem_lock_depth--

2. hak_free_at()                           [hak_free_api.inc.h:64]
   ├─ Line 78:  static int s_free_to_ss (getenv cache)
   ├─ Line 86:  ss = hak_super_lookup(ptr)               ← LOOKUP #1 ⚠️
   ├─ Line 87:  if (ss->magic == SUPERSLAB_MAGIC)
   ├─ Line 88:    slab_idx = slab_index_for(ss, ptr)     ← CALC #1
   ├─ Line 89:    if (sidx >= 0 && sidx < cap)
   └─ Line 90:    hak_tiny_free(ptr)                     ← ROUTE TO TINY

3. hak_tiny_free()                         [hakmem_tiny_free.inc:246]
   ├─ Line 249: atomic_fetch_add(g_hak_tiny_free_calls)  ← Atomic #2
   ├─ Line 252: hak_tiny_stats_poll()
   ├─ Line 253: tiny_debug_ring_record()
   ├─ Line 255-303: BENCH_SLL_ONLY fast path (optional)
   ├─ Line 306-366: Ultra mode fast path (optional)
   ├─ Line 372: ss = hak_super_lookup(ptr)               ← LOOKUP #2 ⚠️ REDUNDANT!
   ├─ Line 373: if (ss && ss->magic == SUPERSLAB_MAGIC)
   ├─ Line 376-381: Validate size_class
   └─ Line 430: hak_tiny_free_superslab(ptr, ss)        ← 52.63% CPU HERE! 💀

4. hak_tiny_free_superslab()               [tiny_superslab_free.inc.h:10] ← HOTSPOT
   ├─ Line 13:  atomic_fetch_add(g_free_ss_enter)        ← Atomic #3
   ├─ Line 14:  ROUTE_MARK(16)
   ├─ Line 15:  HAK_DBG_INC(g_superslab_free_count)
   ├─ Line 17:  slab_idx = slab_index_for(ss, ptr)       ← CALC #2 ⚠️
   ├─ Line 18-19: ss_size, ss_base calculations
   ├─ Line 20-25: Safety: slab_idx < 0 check
   ├─ Line 26:  meta = &ss->slabs[slab_idx]
   ├─ Line 27-40: Watch point debug (if enabled)
   ├─ Line 42-46: Safety: validate size_class bounds
   ├─ Line 47-72: Safety: EXPENSIVE! ⚠️
   │   ├─ Alignment check (delta % blk == 0)
   │   ├─ Range check (delta / blk < capacity)
   │   └─ Duplicate scan in freelist (up to 64 iterations!) ← 💀 O(n)
   ├─ Line 75:  my_tid = tiny_self_u32()                 ← SYSCALL! ⚠️ 💀
   ├─ Line 79-81: Ownership claim (if owner_tid == 0)
   ├─ Line 82-157: SAME-THREAD PATH (owner_tid == my_tid)
   │   ├─ Line 90-95: Safety: check used == 0
   │   ├─ Line 96: tiny_remote_track_expect_alloc()
   │   ├─ Line 97-112: Remote guard check (expensive!)
   │   ├─ Line 114-131: MidTC bypass (optional)
   │   ├─ Line 133-150: tiny_free_local_box()           ← Freelist push
   │   └─ Line 137-149: First-free publish logic
   └─ Line 158-328: CROSS-THREAD PATH (owner_tid != my_tid)
       ├─ Line 175-229: Duplicate detection in remote queue ← 💀 O(n) EXPENSIVE!
       │   ├─ Scan up to 64 nodes in remote stack
       │   ├─ Sentinel checks (if g_remote_side_enable)
       │   └─ Corruption detection
       ├─ Line 230-235: Safety: check used == 0
       ├─ Line 236-255: A/B gate for remote MPSC
       └─ Line 256-302: ss_remote_push()                ← MPSC push (atomic CAS)

5. tiny_free_local_box()                   [box/free_local_box.c:5]
   ├─ Line 6:   atomic_fetch_add(g_free_local_box_calls) ← Atomic #4
   ├─ Line 12-26: Failfast validation (if level >= 2)
   ├─ Line 28:  prev = meta->freelist                    ← Load
   ├─ Line 30-61: Freelist corruption debug (if level >= 2)
   ├─ Line 63:  *(void**)ptr = prev                      ← Write #1
   ├─ Line 64:  meta->freelist = ptr                     ← Write #2
   ├─ Line 67-75: Freelist corruption verification
   ├─ Line 77:  tiny_failfast_log()
   ├─ Line 80:  atomic_thread_fence(memory_order_release)← Memory barrier
   ├─ Line 83-93: Freelist mask update (optional)
   ├─ Line 96:  tiny_remote_track_on_local_free()
   ├─ Line 97:  meta->used--                             ← Decrement
   ├─ Line 98:  ss_active_dec_one(ss)                    ← CAS LOOP! ⚠️ 💀
   └─ Line 100-103: First-free publish

6. ss_active_dec_one()                     [superslab_inline.h:162]
   ├─ Line 163: atomic_fetch_add(g_ss_active_dec_calls)  ← Atomic #5
   ├─ Line 164: old = atomic_load(total_active_blocks)   ← Atomic #6
   └─ Line 165-169: CAS loop:                            ← CAS LOOP (contention in MT!)
       while (old != 0) {
           if (CAS(&total_active_blocks, old, old-1)) break;
       }                                                  ← Atomic #7+

7. ss_remote_push() [Cross-thread only]    [superslab_inline.h:202]
   ├─ Line 203: atomic_fetch_add(g_ss_remote_push_calls) ← Atomic #N
   ├─ Line 215-233: Sanity checks (range, alignment)
   ├─ Line 258-266: MPSC CAS loop:                       ← CAS LOOP (contention!)
   │   do {
   │       old = atomic_load(&head, acquire);            ← Atomic #N+1
   │       *(void**)ptr = (void*)old;
   │   } while (!CAS(&head, old, ptr));                  ← Atomic #N+2+
   └─ Line 267: tiny_remote_side_set()
```

---

## 2. EXPENSIVE OPERATIONS IDENTIFIED

### Critical Issues (Prioritized by Impact)

#### 🔴 **ISSUE #1: Duplicate SuperSlab Lookup (Lines hak_free_api:86 + hak_tiny_free:372)**
**Cost:** 2x registry lookup per free
**Location:**
- `hak_free_at()` line 86: `ss = hak_super_lookup(ptr)`
- `hak_tiny_free()` line 372: `ss = hak_super_lookup(ptr)` ← REDUNDANT!

**Why it's expensive:**
- `hak_super_lookup()` walks a registry or performs hash lookup
- Result is already known from first call
- Wastes CPU cycles and pollutes cache

**Fix:** Pass `ss` as parameter from `hak_free_at()` to `hak_tiny_free()`

---

#### 🔴 **ISSUE #2: Syscall in Hot Path (Line 75: tiny_self_u32())**
**Cost:** ~200-500 cycles per free
**Location:** `tiny_superslab_free.inc.h:75`
```c
uint32_t my_tid = tiny_self_u32();  // ← SYSCALL (gettid)!
```

**Why it's expensive:**
- Syscall overhead: 200-500 cycles (vs 1-2 for TLS read)
- Context switch to kernel mode
- Called on EVERY free (same-thread AND cross-thread)

**Fix:** Cache TID in TLS variable (like `g_hakmem_lock_depth`)

---

#### 🔴 **ISSUE #3: Duplicate Scan in Freelist (Lines 64-71)**
**Cost:** O(n) scan, up to 64 iterations
**Location:** `tiny_superslab_free.inc.h:64-71`
```c
void* scan = meta->freelist; int scanned = 0; int dup = 0;
while (scan && scanned < 64) {
    if (scan == ptr) { dup = 1; break; }
    scan = *(void**)scan;
    scanned++;
}
```

**Why it's expensive:**
- O(n) complexity (up to 64 pointer chases)
- Cache misses (freelist nodes scattered in memory)
- Branch mispredictions (while loop, if statement)
- Only useful for debugging (catches double-free)

**Fix:** Move to debug-only path (behind `HAKMEM_SAFE_FREE` guard)

---

#### 🔴 **ISSUE #4: Remote Queue Duplicate Scan (Lines 175-229)**
**Cost:** O(n) scan, up to 64 iterations + sentinel checks
**Location:** `tiny_superslab_free.inc.h:177-221`
```c
uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire);
int scanned = 0; int dup = 0;
while (cur && scanned < 64) {
    if ((void*)cur == ptr) { dup = 1; break; }
    // ... sentinel checks ...
    cur = (uintptr_t)(*(void**)(void*)cur);
    scanned++;
}
```

**Why it's expensive:**
- O(n) scan of remote queue (up to 64 nodes)
- Atomic load + pointer chasing
- Sentinel validation (if enabled)
- Called on EVERY cross-thread free

**Fix:** Move to debug-only path or use bloom filter for fast negative check

---

#### 🔴 **ISSUE #5: CAS Loop on Every Free (ss_active_dec_one)**
**Cost:** 2-10 cycles (uncontended), 100+ cycles (contended)
**Location:** `superslab_inline.h:162-169`
```c
static inline void ss_active_dec_one(SuperSlab* ss) {
    atomic_fetch_add(&g_ss_active_dec_calls, 1, relaxed);  // ← Atomic #1
    uint32_t old = atomic_load(&ss->total_active_blocks, relaxed); // ← Atomic #2
    while (old != 0) {
        if (CAS(&ss->total_active_blocks, &old, old-1, relaxed)) break; // ← CAS loop
    }
}
```

**Why it's expensive:**
- 3 atomic operations per free (fetch_add, load, CAS)
- CAS loop can retry multiple times under contention (MT scenario)
- Cache line ping-pong in multi-threaded workloads

**Fix:** Batch decrements (decrement by N when draining remote queue)

---

#### 🟡 **ISSUE #6: Multiple Atomic Increments for Diagnostics**
**Cost:** 5-7 atomic operations per free
**Locations:**
1. `hak_wrappers.inc.h:93` - `g_free_wrapper_calls`
2. `hakmem_tiny_free.inc:249` - `g_hak_tiny_free_calls`
3. `tiny_superslab_free.inc.h:13` - `g_free_ss_enter`
4. `free_local_box.c:6` - `g_free_local_box_calls`
5. `superslab_inline.h:163` - `g_ss_active_dec_calls`
6. `superslab_inline.h:203` - `g_ss_remote_push_calls` (cross-thread only)

**Why it's expensive:**
- Each atomic increment: 10-20 cycles
- Total: 50-100+ cycles per free (5-10% overhead)
- Only useful for diagnostics

**Fix:** Compile-time gate (`#if HAKMEM_DEBUG_COUNTERS`)

---

#### 🟡 **ISSUE #7: Environment Variable Checks (Even with Caching)**
**Cost:** First call: 1000+ cycles (getenv), Subsequent: 2-5 cycles (cached)
**Locations:**
- Line 106, 145: `HAKMEM_TINY_ROUTE_FREE`
- Line 117, 169: `HAKMEM_TINY_FREE_TO_SS`
- Line 313: `HAKMEM_TINY_FREELIST_MASK`
- Line 238, 249: `HAKMEM_TINY_DISABLE_REMOTE`

**Why it's expensive:**
- First call to getenv() is expensive (1000+ cycles)
- Branch on cached value still adds 1-2 cycles
- Multiple env vars = multiple branches

**Fix:** Consolidate env vars or use compile-time flags

---

#### 🟡 **ISSUE #8: Massive Function Size (330 lines)**
**Cost:** I-cache misses, branch mispredictions
**Location:** `tiny_superslab_free.inc.h:10-330`

**Why it's expensive:**
- 330 lines of code (vs 10-20 for System tcache)
- Many branches (if statements, while loops)
- Branch mispredictions: 10-20 cycles per miss
- I-cache misses: 100+ cycles

**Fix:** Extract fast path (10-15 lines) and delegate to slow path

---

## 3. COMPARISON WITH ALLOCATION FAST PATH

### Allocation (6.48% CPU) vs Free (52.63% CPU)

| Metric | Allocation (Box 5) | Free (Current) | Ratio |
|--------|-------------------|----------------|-------|
| **CPU Usage** | 6.48% | 52.63% | **8.1x slower** |
| **Function Size** | ~20 lines | 330 lines | 16.5x larger |
| **Atomic Ops** | 1 (TLS count decrement) | 5-7 (counters + CAS) | 5-7x more |
| **Syscalls** | 0 | 1 (gettid) | ∞ |
| **Lookups** | 0 (direct TLS) | 2 (SuperSlab) | ∞ |
| **O(n) Scans** | 0 | 2 (freelist + remote) | ∞ |
| **Branches** | 2-3 (head == NULL check) | 50+ (safety, guards, env vars) | 16-25x |

**Key Insight:** Allocation succeeds with **3-4 instructions** (Box 5 design), while free requires **330 lines** with multiple syscalls, atomics, and O(n) scans.

---

## 4. ROOT CAUSE ANALYSIS

### Why is Free 8x Slower than Alloc?

#### Allocation Design (Box 5 - Ultra-Simple Fast Path)
```c
// Box 5: tiny_alloc_fast_pop() [~10 lines, 3-4 instructions]
void* tiny_alloc_fast_pop(int class_idx) {
    void* ptr = g_tls_sll_head[class_idx];       // 1. Load TLS head
    if (!ptr) return NULL;                       // 2. NULL check
    g_tls_sll_head[class_idx] = *(void**)ptr;    // 3. Update head (pop)
    g_tls_sll_count[class_idx]--;                // 4. Decrement count
    return ptr;                                  // 5. Return
}
// Assembly: ~5 instructions (mov, cmp, jz, mov, dec, ret)
```

#### Free Design (Current - Multi-Layer Complexity)
```c
// Current free path: 330 lines, 50+ branches, 5-7 atomics, 1 syscall
void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // 1. Diagnostics (atomic increments) - 3 atomics
    // 2. Safety checks (alignment, range, duplicate scan) - 64 iterations
    // 3. Syscall (gettid) - 200-500 cycles
    // 4. Ownership check (my_tid == owner_tid)
    // 5. Remote guard checks (function calls, tracking)
    // 6. MidTC bypass (optional)
    // 7. Freelist push (2 writes + failfast validation)
    // 8. CAS loop (ss_active_dec_one) - contention
    // 9. First-free publish (if prev == NULL)
    // ... 300+ more lines
}
```

**Problem:** Free path was designed for **safety and diagnostics**, not **performance**.

---

## 5. CONCRETE OPTIMIZATION PROPOSALS

### 🏆 **Proposal #1: Extract Ultra-Simple Free Fast Path (Highest Priority)**

**Goal:** Match allocation's 3-4 instruction fast path
**Expected Impact:** -60-70% free() CPU (52.63% → 15-20%)

#### Implementation (Box 6 Enhancement)

```c
// tiny_free_ultra_fast.inc.h (NEW FILE)
// Ultra-simple free fast path (3-4 instructions, same-thread only)

static inline int tiny_free_ultra_fast(void* ptr, SuperSlab* ss, int slab_idx, uint32_t my_tid) {
    // PREREQUISITE: Caller MUST validate:
    // 1. ss != NULL && ss->magic == SUPERSLAB_MAGIC
    // 2. slab_idx >= 0 && slab_idx < capacity
    // 3. my_tid == current thread (cached in TLS)

    TinySlabMeta* meta = &ss->slabs[slab_idx];

    // Fast path: Same-thread check (TOCTOU-safe)
    uint32_t owner = atomic_load_explicit(&meta->owner_tid, memory_order_relaxed);
    if (__builtin_expect(owner != my_tid, 0)) {
        return 0;  // Cross-thread → delegate to slow path
    }

    // Fast path: Direct freelist push (2 writes)
    void* prev = meta->freelist;                // 1. Load prev
    *(void**)ptr = prev;                        // 2. ptr->next = prev
    meta->freelist = ptr;                       // 3. freelist = ptr

    // Accounting (TLS, no atomic)
    meta->used--;                               // 4. Decrement used

    // SKIP ss_active_dec_one() in fast path (batch update later)

    return 1;  // Success
}

// Assembly (x86-64, expected):
//   mov    eax, DWORD PTR [meta->owner_tid]  ; owner
//   cmp    eax, my_tid                        ; owner == my_tid?
//   jne    .slow_path                         ; if not, slow path
//   mov    rax, QWORD PTR [meta->freelist]   ; prev = freelist
//   mov    QWORD PTR [ptr], rax              ; ptr->next = prev
//   mov    QWORD PTR [meta->freelist], ptr   ; freelist = ptr
//   dec    DWORD PTR [meta->used]            ; used--
//   ret                                       ; done
// .slow_path:
//   xor    eax, eax
//   ret
```

#### Integration into hak_tiny_free_superslab()

```c
void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // Cache TID in TLS (avoid syscall)
    static __thread uint32_t g_cached_tid = 0;
    if (__builtin_expect(g_cached_tid == 0, 0)) {
        g_cached_tid = tiny_self_u32();  // Initialize once per thread
    }
    uint32_t my_tid = g_cached_tid;

    int slab_idx = slab_index_for(ss, ptr);

    // FAST PATH: Ultra-simple free (3-4 instructions)
    if (__builtin_expect(tiny_free_ultra_fast(ptr, ss, slab_idx, my_tid), 1)) {
        return;  // Success: same-thread, pushed to freelist
    }

    // SLOW PATH: Cross-thread, safety checks, remote queue
    // ... existing 330 lines ...
}
```

**Benefits:**
- **Same-thread free:** 3-4 instructions (vs 330 lines)
- **No syscall** (TID cached in TLS)
- **No atomics** in fast path (meta->used is TLS-local)
- **No safety checks** in fast path (delegate to slow path)
- **Branch prediction friendly** (same-thread is common case)

**Trade-offs:**
- Skip `ss_active_dec_one()` in fast path (batch update in background thread)
- Skip safety checks in fast path (only in slow path / debug mode)

---

### 🏆 **Proposal #2: Cache TID in TLS (Quick Win)**

**Goal:** Eliminate syscall overhead
**Expected Impact:** -5-10% free() CPU

```c
// hakmem_tiny.c (or core header)
__thread uint32_t g_cached_tid = 0;  // TLS cache for thread ID

static inline uint32_t tiny_self_u32_cached(void) {
    if (__builtin_expect(g_cached_tid == 0, 0)) {
        g_cached_tid = tiny_self_u32();  // Initialize once per thread
    }
    return g_cached_tid;
}
```

**Change:** Replace all `tiny_self_u32()` calls with `tiny_self_u32_cached()`

**Benefits:**
- **Syscall elimination:** 0 syscalls (vs 1 per free)
- **TLS read:** 1-2 cycles (vs 200-500 for gettid)
- **Easy to implement:** 1-line change

---

### 🏆 **Proposal #3: Move Safety Checks to Debug-Only Path**

**Goal:** Remove O(n) scans from hot path
**Expected Impact:** -10-15% free() CPU

```c
#if HAKMEM_SAFE_FREE
    // Duplicate scan in freelist (lines 64-71)
    void* scan = meta->freelist; int scanned = 0; int dup = 0;
    while (scan && scanned < 64) { ... }

    // Remote queue duplicate scan (lines 175-229)
    uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire);
    while (cur && scanned < 64) { ... }
#endif
```

**Benefits:**
- **Production builds:** No O(n) scans (0 cycles)
- **Debug builds:** Full safety checks (detect double-free)
- **Easy toggle:** `HAKMEM_SAFE_FREE=0` for benchmarks

---

### 🏆 **Proposal #4: Batch ss_active_dec_one() Updates**

**Goal:** Reduce atomic contention
**Expected Impact:** -5-10% free() CPU (MT), -2-5% (ST)

```c
// Instead of: ss_active_dec_one(ss) on every free
// Do: Batch decrement when draining remote queue or TLS cache

void tiny_free_ultra_fast(...) {
    // ... freelist push ...
    meta->used--;
    // SKIP: ss_active_dec_one(ss);  ← Defer to batch update
}

// Background thread or refill path:
void batch_active_update(SuperSlab* ss) {
    uint32_t total_freed = 0;
    for (int i = 0; i < 32; i++) {
        total_freed += (meta[i].capacity - meta[i].used);
    }
    atomic_fetch_sub(&ss->total_active_blocks, total_freed, relaxed);
}
```

**Benefits:**
- **Fewer atomics:** 1 atomic per batch (vs N per free)
- **Less contention:** Batch updates are rare
- **Amortized cost:** O(1) amortized

---

### 🏆 **Proposal #5: Eliminate Redundant SuperSlab Lookup**

**Goal:** Remove duplicate lookup
**Expected Impact:** -2-5% free() CPU

```c
// hak_free_at() - pass ss to hak_tiny_free()
void hak_free_at(void* ptr, size_t size, hak_callsite_t site) {
    SuperSlab* ss = hak_super_lookup(ptr);  // ← Lookup #1
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        hak_tiny_free_with_ss(ptr, ss);     // ← Pass ss (avoid lookup #2)
        return;
    }
    // ... fallback paths ...
}

// NEW: hak_tiny_free_with_ss() - skip second lookup
void hak_tiny_free_with_ss(void* ptr, SuperSlab* ss) {
    // SKIP: ss = hak_super_lookup(ptr);  ← Lookup #2 (redundant!)
    hak_tiny_free_superslab(ptr, ss);
}
```

**Benefits:**
- **1 lookup:** vs 2 (50% reduction)
- **Cache friendly:** Reuse ss pointer
- **Easy change:** Add new function variant

---

## 6. PERFORMANCE PROJECTIONS

### Current Baseline
- **Free CPU:** 52.63%
- **Alloc CPU:** 6.48%
- **Ratio:** 8.1x slower

### After All Optimizations

| Optimization | CPU Reduction | Cumulative CPU |
|--------------|---------------|----------------|
| **Baseline** | - | 52.63% |
| #1: Ultra-Fast Path | -60% | **21.05%** |
| #2: TID Cache | -5% | **20.00%** |
| #3: Safety → Debug | -10% | **18.00%** |
| #4: Batch Active | -5% | **17.10%** |
| #5: Skip Lookup | -2% | **16.76%** |

**Final Target:** 16.76% CPU (vs 52.63% baseline)
**Improvement:** **-68% CPU reduction**
**New Ratio:** 2.6x slower than alloc (vs 8.1x)

### Expected Throughput Gain
- **Current:** 1,046,392 ops/s
- **Projected:** 3,200,000 ops/s (+206%)
- **vs System:** 56,336,790 ops/s (still 17x slower, but improved from 53x)

---

## 7. IMPLEMENTATION ROADMAP

### Phase 1: Quick Wins (1-2 days)
1. ✅ **TID Cache** (Proposal #2) - 1 hour
2. ✅ **Eliminate Redundant Lookup** (Proposal #5) - 2 hours
3. ✅ **Move Safety to Debug** (Proposal #3) - 1 hour

**Expected:** -15-20% CPU reduction

### Phase 2: Fast Path Extraction (3-5 days)
1. ✅ **Extract Ultra-Fast Free** (Proposal #1) - 2 days
2. ✅ **Integrate with Box 6** - 1 day
3. ✅ **Testing & Validation** - 1 day

**Expected:** -60% CPU reduction (cumulative: -68%)

### Phase 3: Advanced (1-2 weeks)
1. ⚠️ **Batch Active Updates** (Proposal #4) - 3 days
2. ⚠️ **Inline Fast Path** - 1 day
3. ⚠️ **Profile & Tune** - 2 days

**Expected:** -5% CPU reduction (final: -68%)

---

## 8. COMPARISON WITH SYSTEM MALLOC

### System malloc (tcache) Free Path (estimated)

```c
// glibc tcache_put() [~15 instructions]
void tcache_put(void* ptr, size_t tc_idx) {
    tcache_entry* e = (tcache_entry*)ptr;
    e->next = tcache->entries[tc_idx];      // 1. ptr->next = head
    tcache->entries[tc_idx] = e;            // 2. head = ptr
    ++tcache->counts[tc_idx];               // 3. count++
}
// Assembly: ~10 instructions (mov, mov, inc, ret)
```

**Why System malloc is faster:**
1. **No ownership check** (single-threaded tcache)
2. **No safety checks** (assumes valid pointer)
3. **No atomic operations** (TLS-local)
4. **No syscalls** (no TID lookup)
5. **Tiny code size** (~15 instructions)

**HAKMEM Gap Analysis:**
- Current: 330 lines vs 15 instructions (**22x code bloat**)
- After optimization: ~20 lines vs 15 instructions (**1.3x**, acceptable)

---

## 9. RISK ASSESSMENT

### Proposal #1 (Ultra-Fast Path)
**Risk:** 🟢 Low
**Reason:** Isolated fast path, delegates to slow path on failure
**Mitigation:** Keep slow path unchanged for safety

### Proposal #2 (TID Cache)
**Risk:** 🟢 Very Low
**Reason:** TLS variable, no shared state
**Mitigation:** Initialize once per thread

### Proposal #3 (Safety → Debug)
**Risk:** 🟡 Medium
**Reason:** Removes double-free detection in production
**Mitigation:** Keep enabled for debug builds, add compile-time flag

### Proposal #4 (Batch Active)
**Risk:** 🟡 Medium
**Reason:** Changes accounting semantics (delayed updates)
**Mitigation:** Thorough testing, fallback to per-free if issues

### Proposal #5 (Skip Lookup)
**Risk:** 🟢 Low
**Reason:** Pure optimization, no semantic change
**Mitigation:** Validate ss pointer is passed correctly

---

## 10. CONCLUSION

### Key Findings

1. **Free is 8x slower than alloc** (52.63% vs 6.48% CPU)
2. **Root cause:** Safety-first design (330 lines vs 3-4 instructions)
3. **Top bottlenecks:**
   - Syscall overhead (gettid)
   - O(n) duplicate scans (freelist + remote queue)
   - Redundant SuperSlab lookups
   - Atomic contention (ss_active_dec_one)
   - Diagnostic counters (5-7 atomics)

### Recommended Action Plan

**Priority 1 (Do Now):**
- ✅ **TID Cache** - 1 hour, -5% CPU
- ✅ **Skip Redundant Lookup** - 2 hours, -2% CPU
- ✅ **Safety → Debug Mode** - 1 hour, -10% CPU

**Priority 2 (This Week):**
- ✅ **Ultra-Fast Path** - 2 days, -60% CPU

**Priority 3 (Future):**
- ⚠️ **Batch Active Updates** - 3 days, -5% CPU

### Expected Outcome

- **CPU Reduction:** -68% (52.63% → 16.76%)
- **Throughput Gain:** +206% (1.04M → 3.2M ops/s)
- **Code Quality:** Cleaner separation (fast/slow paths)
- **Maintainability:** Safety checks isolated to debug mode

### Next Steps

1. **Review this analysis** with team
2. **Implement Priority 1** (TID cache, skip lookup, safety guards)
3. **Benchmark results** (validate -15-20% reduction)
4. **Proceed to Priority 2** (ultra-fast path extraction)

---

**END OF ULTRATHINK ANALYSIS**