# FREE PATH ULTRATHINK ANALYSIS **Date:** 2025-11-08 **Performance Hotspot:** `hak_tiny_free_superslab` consuming 52.63% CPU **Benchmark:** 1,046,392 ops/s (53x slower than System malloc's 56,336,790 ops/s) --- ## Executive Summary The free() path in HAKMEM is **8x slower than allocation** (52.63% vs 6.48% CPU) due to: 1. **Multiple redundant lookups** (SuperSlab lookup called twice) 2. **Massive function size** (330 lines with many branches) 3. **Expensive safety checks** in hot path (duplicate scans, alignment checks) 4. **Atomic contention** (CAS loops on every free) 5. **Syscall overhead** (TID lookup on every free) **Root Cause:** The free path was designed for safety and diagnostics, not performance. It lacks the "ultra-simple fast path" design that made allocation fast (Box 5). --- ## 1. CALL CHAIN ANALYSIS ### Complete Free Path (User → Kernel) ``` User free(ptr) ↓ 1. free() wrapper [hak_wrappers.inc.h:92] ├─ Line 93: atomic_fetch_add(g_free_wrapper_calls) ← Atomic #1 ├─ Line 94: if (!ptr) return ├─ Line 95: if (g_hakmem_lock_depth > 0) → libc ├─ Line 96: if (g_initializing) → libc ├─ Line 97: if (hak_force_libc_alloc()) → libc ├─ Line 98-102: LD_PRELOAD checks ├─ Line 103: g_hakmem_lock_depth++ ← TLS write #1 ├─ Line 104: hak_free_at(ptr, 0, HAK_CALLSITE()) ← MAIN ENTRY └─ Line 105: g_hakmem_lock_depth-- 2. hak_free_at() [hak_free_api.inc.h:64] ├─ Line 78: static int s_free_to_ss (getenv cache) ├─ Line 86: ss = hak_super_lookup(ptr) ← LOOKUP #1 ⚠️ ├─ Line 87: if (ss->magic == SUPERSLAB_MAGIC) ├─ Line 88: slab_idx = slab_index_for(ss, ptr) ← CALC #1 ├─ Line 89: if (sidx >= 0 && sidx < cap) └─ Line 90: hak_tiny_free(ptr) ← ROUTE TO TINY 3. hak_tiny_free() [hakmem_tiny_free.inc:246] ├─ Line 249: atomic_fetch_add(g_hak_tiny_free_calls) ← Atomic #2 ├─ Line 252: hak_tiny_stats_poll() ├─ Line 253: tiny_debug_ring_record() ├─ Line 255-303: BENCH_SLL_ONLY fast path (optional) ├─ Line 306-366: Ultra mode fast path (optional) ├─ Line 372: ss = hak_super_lookup(ptr) ← LOOKUP #2 ⚠️ REDUNDANT! ├─ Line 373: if (ss && ss->magic == SUPERSLAB_MAGIC) ├─ Line 376-381: Validate size_class └─ Line 430: hak_tiny_free_superslab(ptr, ss) ← 52.63% CPU HERE! 💀 4. hak_tiny_free_superslab() [tiny_superslab_free.inc.h:10] ← HOTSPOT ├─ Line 13: atomic_fetch_add(g_free_ss_enter) ← Atomic #3 ├─ Line 14: ROUTE_MARK(16) ├─ Line 15: HAK_DBG_INC(g_superslab_free_count) ├─ Line 17: slab_idx = slab_index_for(ss, ptr) ← CALC #2 ⚠️ ├─ Line 18-19: ss_size, ss_base calculations ├─ Line 20-25: Safety: slab_idx < 0 check ├─ Line 26: meta = &ss->slabs[slab_idx] ├─ Line 27-40: Watch point debug (if enabled) ├─ Line 42-46: Safety: validate size_class bounds ├─ Line 47-72: Safety: EXPENSIVE! ⚠️ │ ├─ Alignment check (delta % blk == 0) │ ├─ Range check (delta / blk < capacity) │ └─ Duplicate scan in freelist (up to 64 iterations!) ← 💀 O(n) ├─ Line 75: my_tid = tiny_self_u32() ← SYSCALL! ⚠️ 💀 ├─ Line 79-81: Ownership claim (if owner_tid == 0) ├─ Line 82-157: SAME-THREAD PATH (owner_tid == my_tid) │ ├─ Line 90-95: Safety: check used == 0 │ ├─ Line 96: tiny_remote_track_expect_alloc() │ ├─ Line 97-112: Remote guard check (expensive!) │ ├─ Line 114-131: MidTC bypass (optional) │ ├─ Line 133-150: tiny_free_local_box() ← Freelist push │ └─ Line 137-149: First-free publish logic └─ Line 158-328: CROSS-THREAD PATH (owner_tid != my_tid) ├─ Line 175-229: Duplicate detection in remote queue ← 💀 O(n) EXPENSIVE! │ ├─ Scan up to 64 nodes in remote stack │ ├─ Sentinel checks (if g_remote_side_enable) │ └─ Corruption detection ├─ Line 230-235: Safety: check used == 0 ├─ Line 236-255: A/B gate for remote MPSC └─ Line 256-302: ss_remote_push() ← MPSC push (atomic CAS) 5. tiny_free_local_box() [box/free_local_box.c:5] ├─ Line 6: atomic_fetch_add(g_free_local_box_calls) ← Atomic #4 ├─ Line 12-26: Failfast validation (if level >= 2) ├─ Line 28: prev = meta->freelist ← Load ├─ Line 30-61: Freelist corruption debug (if level >= 2) ├─ Line 63: *(void**)ptr = prev ← Write #1 ├─ Line 64: meta->freelist = ptr ← Write #2 ├─ Line 67-75: Freelist corruption verification ├─ Line 77: tiny_failfast_log() ├─ Line 80: atomic_thread_fence(memory_order_release)← Memory barrier ├─ Line 83-93: Freelist mask update (optional) ├─ Line 96: tiny_remote_track_on_local_free() ├─ Line 97: meta->used-- ← Decrement ├─ Line 98: ss_active_dec_one(ss) ← CAS LOOP! ⚠️ 💀 └─ Line 100-103: First-free publish 6. ss_active_dec_one() [superslab_inline.h:162] ├─ Line 163: atomic_fetch_add(g_ss_active_dec_calls) ← Atomic #5 ├─ Line 164: old = atomic_load(total_active_blocks) ← Atomic #6 └─ Line 165-169: CAS loop: ← CAS LOOP (contention in MT!) while (old != 0) { if (CAS(&total_active_blocks, old, old-1)) break; } ← Atomic #7+ 7. ss_remote_push() [Cross-thread only] [superslab_inline.h:202] ├─ Line 203: atomic_fetch_add(g_ss_remote_push_calls) ← Atomic #N ├─ Line 215-233: Sanity checks (range, alignment) ├─ Line 258-266: MPSC CAS loop: ← CAS LOOP (contention!) │ do { │ old = atomic_load(&head, acquire); ← Atomic #N+1 │ *(void**)ptr = (void*)old; │ } while (!CAS(&head, old, ptr)); ← Atomic #N+2+ └─ Line 267: tiny_remote_side_set() ``` --- ## 2. EXPENSIVE OPERATIONS IDENTIFIED ### Critical Issues (Prioritized by Impact) #### 🔴 **ISSUE #1: Duplicate SuperSlab Lookup (Lines hak_free_api:86 + hak_tiny_free:372)** **Cost:** 2x registry lookup per free **Location:** - `hak_free_at()` line 86: `ss = hak_super_lookup(ptr)` - `hak_tiny_free()` line 372: `ss = hak_super_lookup(ptr)` ← REDUNDANT! **Why it's expensive:** - `hak_super_lookup()` walks a registry or performs hash lookup - Result is already known from first call - Wastes CPU cycles and pollutes cache **Fix:** Pass `ss` as parameter from `hak_free_at()` to `hak_tiny_free()` --- #### 🔴 **ISSUE #2: Syscall in Hot Path (Line 75: tiny_self_u32())** **Cost:** ~200-500 cycles per free **Location:** `tiny_superslab_free.inc.h:75` ```c uint32_t my_tid = tiny_self_u32(); // ← SYSCALL (gettid)! ``` **Why it's expensive:** - Syscall overhead: 200-500 cycles (vs 1-2 for TLS read) - Context switch to kernel mode - Called on EVERY free (same-thread AND cross-thread) **Fix:** Cache TID in TLS variable (like `g_hakmem_lock_depth`) --- #### 🔴 **ISSUE #3: Duplicate Scan in Freelist (Lines 64-71)** **Cost:** O(n) scan, up to 64 iterations **Location:** `tiny_superslab_free.inc.h:64-71` ```c void* scan = meta->freelist; int scanned = 0; int dup = 0; while (scan && scanned < 64) { if (scan == ptr) { dup = 1; break; } scan = *(void**)scan; scanned++; } ``` **Why it's expensive:** - O(n) complexity (up to 64 pointer chases) - Cache misses (freelist nodes scattered in memory) - Branch mispredictions (while loop, if statement) - Only useful for debugging (catches double-free) **Fix:** Move to debug-only path (behind `HAKMEM_SAFE_FREE` guard) --- #### 🔴 **ISSUE #4: Remote Queue Duplicate Scan (Lines 175-229)** **Cost:** O(n) scan, up to 64 iterations + sentinel checks **Location:** `tiny_superslab_free.inc.h:177-221` ```c uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire); int scanned = 0; int dup = 0; while (cur && scanned < 64) { if ((void*)cur == ptr) { dup = 1; break; } // ... sentinel checks ... cur = (uintptr_t)(*(void**)(void*)cur); scanned++; } ``` **Why it's expensive:** - O(n) scan of remote queue (up to 64 nodes) - Atomic load + pointer chasing - Sentinel validation (if enabled) - Called on EVERY cross-thread free **Fix:** Move to debug-only path or use bloom filter for fast negative check --- #### 🔴 **ISSUE #5: CAS Loop on Every Free (ss_active_dec_one)** **Cost:** 2-10 cycles (uncontended), 100+ cycles (contended) **Location:** `superslab_inline.h:162-169` ```c static inline void ss_active_dec_one(SuperSlab* ss) { atomic_fetch_add(&g_ss_active_dec_calls, 1, relaxed); // ← Atomic #1 uint32_t old = atomic_load(&ss->total_active_blocks, relaxed); // ← Atomic #2 while (old != 0) { if (CAS(&ss->total_active_blocks, &old, old-1, relaxed)) break; // ← CAS loop } } ``` **Why it's expensive:** - 3 atomic operations per free (fetch_add, load, CAS) - CAS loop can retry multiple times under contention (MT scenario) - Cache line ping-pong in multi-threaded workloads **Fix:** Batch decrements (decrement by N when draining remote queue) --- #### 🟡 **ISSUE #6: Multiple Atomic Increments for Diagnostics** **Cost:** 5-7 atomic operations per free **Locations:** 1. `hak_wrappers.inc.h:93` - `g_free_wrapper_calls` 2. `hakmem_tiny_free.inc:249` - `g_hak_tiny_free_calls` 3. `tiny_superslab_free.inc.h:13` - `g_free_ss_enter` 4. `free_local_box.c:6` - `g_free_local_box_calls` 5. `superslab_inline.h:163` - `g_ss_active_dec_calls` 6. `superslab_inline.h:203` - `g_ss_remote_push_calls` (cross-thread only) **Why it's expensive:** - Each atomic increment: 10-20 cycles - Total: 50-100+ cycles per free (5-10% overhead) - Only useful for diagnostics **Fix:** Compile-time gate (`#if HAKMEM_DEBUG_COUNTERS`) --- #### 🟡 **ISSUE #7: Environment Variable Checks (Even with Caching)** **Cost:** First call: 1000+ cycles (getenv), Subsequent: 2-5 cycles (cached) **Locations:** - Line 106, 145: `HAKMEM_TINY_ROUTE_FREE` - Line 117, 169: `HAKMEM_TINY_FREE_TO_SS` - Line 313: `HAKMEM_TINY_FREELIST_MASK` - Line 238, 249: `HAKMEM_TINY_DISABLE_REMOTE` **Why it's expensive:** - First call to getenv() is expensive (1000+ cycles) - Branch on cached value still adds 1-2 cycles - Multiple env vars = multiple branches **Fix:** Consolidate env vars or use compile-time flags --- #### 🟡 **ISSUE #8: Massive Function Size (330 lines)** **Cost:** I-cache misses, branch mispredictions **Location:** `tiny_superslab_free.inc.h:10-330` **Why it's expensive:** - 330 lines of code (vs 10-20 for System tcache) - Many branches (if statements, while loops) - Branch mispredictions: 10-20 cycles per miss - I-cache misses: 100+ cycles **Fix:** Extract fast path (10-15 lines) and delegate to slow path --- ## 3. COMPARISON WITH ALLOCATION FAST PATH ### Allocation (6.48% CPU) vs Free (52.63% CPU) | Metric | Allocation (Box 5) | Free (Current) | Ratio | |--------|-------------------|----------------|-------| | **CPU Usage** | 6.48% | 52.63% | **8.1x slower** | | **Function Size** | ~20 lines | 330 lines | 16.5x larger | | **Atomic Ops** | 1 (TLS count decrement) | 5-7 (counters + CAS) | 5-7x more | | **Syscalls** | 0 | 1 (gettid) | ∞ | | **Lookups** | 0 (direct TLS) | 2 (SuperSlab) | ∞ | | **O(n) Scans** | 0 | 2 (freelist + remote) | ∞ | | **Branches** | 2-3 (head == NULL check) | 50+ (safety, guards, env vars) | 16-25x | **Key Insight:** Allocation succeeds with **3-4 instructions** (Box 5 design), while free requires **330 lines** with multiple syscalls, atomics, and O(n) scans. --- ## 4. ROOT CAUSE ANALYSIS ### Why is Free 8x Slower than Alloc? #### Allocation Design (Box 5 - Ultra-Simple Fast Path) ```c // Box 5: tiny_alloc_fast_pop() [~10 lines, 3-4 instructions] void* tiny_alloc_fast_pop(int class_idx) { void* ptr = g_tls_sll_head[class_idx]; // 1. Load TLS head if (!ptr) return NULL; // 2. NULL check g_tls_sll_head[class_idx] = *(void**)ptr; // 3. Update head (pop) g_tls_sll_count[class_idx]--; // 4. Decrement count return ptr; // 5. Return } // Assembly: ~5 instructions (mov, cmp, jz, mov, dec, ret) ``` #### Free Design (Current - Multi-Layer Complexity) ```c // Current free path: 330 lines, 50+ branches, 5-7 atomics, 1 syscall void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) { // 1. Diagnostics (atomic increments) - 3 atomics // 2. Safety checks (alignment, range, duplicate scan) - 64 iterations // 3. Syscall (gettid) - 200-500 cycles // 4. Ownership check (my_tid == owner_tid) // 5. Remote guard checks (function calls, tracking) // 6. MidTC bypass (optional) // 7. Freelist push (2 writes + failfast validation) // 8. CAS loop (ss_active_dec_one) - contention // 9. First-free publish (if prev == NULL) // ... 300+ more lines } ``` **Problem:** Free path was designed for **safety and diagnostics**, not **performance**. --- ## 5. CONCRETE OPTIMIZATION PROPOSALS ### 🏆 **Proposal #1: Extract Ultra-Simple Free Fast Path (Highest Priority)** **Goal:** Match allocation's 3-4 instruction fast path **Expected Impact:** -60-70% free() CPU (52.63% → 15-20%) #### Implementation (Box 6 Enhancement) ```c // tiny_free_ultra_fast.inc.h (NEW FILE) // Ultra-simple free fast path (3-4 instructions, same-thread only) static inline int tiny_free_ultra_fast(void* ptr, SuperSlab* ss, int slab_idx, uint32_t my_tid) { // PREREQUISITE: Caller MUST validate: // 1. ss != NULL && ss->magic == SUPERSLAB_MAGIC // 2. slab_idx >= 0 && slab_idx < capacity // 3. my_tid == current thread (cached in TLS) TinySlabMeta* meta = &ss->slabs[slab_idx]; // Fast path: Same-thread check (TOCTOU-safe) uint32_t owner = atomic_load_explicit(&meta->owner_tid, memory_order_relaxed); if (__builtin_expect(owner != my_tid, 0)) { return 0; // Cross-thread → delegate to slow path } // Fast path: Direct freelist push (2 writes) void* prev = meta->freelist; // 1. Load prev *(void**)ptr = prev; // 2. ptr->next = prev meta->freelist = ptr; // 3. freelist = ptr // Accounting (TLS, no atomic) meta->used--; // 4. Decrement used // SKIP ss_active_dec_one() in fast path (batch update later) return 1; // Success } // Assembly (x86-64, expected): // mov eax, DWORD PTR [meta->owner_tid] ; owner // cmp eax, my_tid ; owner == my_tid? // jne .slow_path ; if not, slow path // mov rax, QWORD PTR [meta->freelist] ; prev = freelist // mov QWORD PTR [ptr], rax ; ptr->next = prev // mov QWORD PTR [meta->freelist], ptr ; freelist = ptr // dec DWORD PTR [meta->used] ; used-- // ret ; done // .slow_path: // xor eax, eax // ret ``` #### Integration into hak_tiny_free_superslab() ```c void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) { // Cache TID in TLS (avoid syscall) static __thread uint32_t g_cached_tid = 0; if (__builtin_expect(g_cached_tid == 0, 0)) { g_cached_tid = tiny_self_u32(); // Initialize once per thread } uint32_t my_tid = g_cached_tid; int slab_idx = slab_index_for(ss, ptr); // FAST PATH: Ultra-simple free (3-4 instructions) if (__builtin_expect(tiny_free_ultra_fast(ptr, ss, slab_idx, my_tid), 1)) { return; // Success: same-thread, pushed to freelist } // SLOW PATH: Cross-thread, safety checks, remote queue // ... existing 330 lines ... } ``` **Benefits:** - **Same-thread free:** 3-4 instructions (vs 330 lines) - **No syscall** (TID cached in TLS) - **No atomics** in fast path (meta->used is TLS-local) - **No safety checks** in fast path (delegate to slow path) - **Branch prediction friendly** (same-thread is common case) **Trade-offs:** - Skip `ss_active_dec_one()` in fast path (batch update in background thread) - Skip safety checks in fast path (only in slow path / debug mode) --- ### 🏆 **Proposal #2: Cache TID in TLS (Quick Win)** **Goal:** Eliminate syscall overhead **Expected Impact:** -5-10% free() CPU ```c // hakmem_tiny.c (or core header) __thread uint32_t g_cached_tid = 0; // TLS cache for thread ID static inline uint32_t tiny_self_u32_cached(void) { if (__builtin_expect(g_cached_tid == 0, 0)) { g_cached_tid = tiny_self_u32(); // Initialize once per thread } return g_cached_tid; } ``` **Change:** Replace all `tiny_self_u32()` calls with `tiny_self_u32_cached()` **Benefits:** - **Syscall elimination:** 0 syscalls (vs 1 per free) - **TLS read:** 1-2 cycles (vs 200-500 for gettid) - **Easy to implement:** 1-line change --- ### 🏆 **Proposal #3: Move Safety Checks to Debug-Only Path** **Goal:** Remove O(n) scans from hot path **Expected Impact:** -10-15% free() CPU ```c #if HAKMEM_SAFE_FREE // Duplicate scan in freelist (lines 64-71) void* scan = meta->freelist; int scanned = 0; int dup = 0; while (scan && scanned < 64) { ... } // Remote queue duplicate scan (lines 175-229) uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire); while (cur && scanned < 64) { ... } #endif ``` **Benefits:** - **Production builds:** No O(n) scans (0 cycles) - **Debug builds:** Full safety checks (detect double-free) - **Easy toggle:** `HAKMEM_SAFE_FREE=0` for benchmarks --- ### 🏆 **Proposal #4: Batch ss_active_dec_one() Updates** **Goal:** Reduce atomic contention **Expected Impact:** -5-10% free() CPU (MT), -2-5% (ST) ```c // Instead of: ss_active_dec_one(ss) on every free // Do: Batch decrement when draining remote queue or TLS cache void tiny_free_ultra_fast(...) { // ... freelist push ... meta->used--; // SKIP: ss_active_dec_one(ss); ← Defer to batch update } // Background thread or refill path: void batch_active_update(SuperSlab* ss) { uint32_t total_freed = 0; for (int i = 0; i < 32; i++) { total_freed += (meta[i].capacity - meta[i].used); } atomic_fetch_sub(&ss->total_active_blocks, total_freed, relaxed); } ``` **Benefits:** - **Fewer atomics:** 1 atomic per batch (vs N per free) - **Less contention:** Batch updates are rare - **Amortized cost:** O(1) amortized --- ### 🏆 **Proposal #5: Eliminate Redundant SuperSlab Lookup** **Goal:** Remove duplicate lookup **Expected Impact:** -2-5% free() CPU ```c // hak_free_at() - pass ss to hak_tiny_free() void hak_free_at(void* ptr, size_t size, hak_callsite_t site) { SuperSlab* ss = hak_super_lookup(ptr); // ← Lookup #1 if (ss && ss->magic == SUPERSLAB_MAGIC) { hak_tiny_free_with_ss(ptr, ss); // ← Pass ss (avoid lookup #2) return; } // ... fallback paths ... } // NEW: hak_tiny_free_with_ss() - skip second lookup void hak_tiny_free_with_ss(void* ptr, SuperSlab* ss) { // SKIP: ss = hak_super_lookup(ptr); ← Lookup #2 (redundant!) hak_tiny_free_superslab(ptr, ss); } ``` **Benefits:** - **1 lookup:** vs 2 (50% reduction) - **Cache friendly:** Reuse ss pointer - **Easy change:** Add new function variant --- ## 6. PERFORMANCE PROJECTIONS ### Current Baseline - **Free CPU:** 52.63% - **Alloc CPU:** 6.48% - **Ratio:** 8.1x slower ### After All Optimizations | Optimization | CPU Reduction | Cumulative CPU | |--------------|---------------|----------------| | **Baseline** | - | 52.63% | | #1: Ultra-Fast Path | -60% | **21.05%** | | #2: TID Cache | -5% | **20.00%** | | #3: Safety → Debug | -10% | **18.00%** | | #4: Batch Active | -5% | **17.10%** | | #5: Skip Lookup | -2% | **16.76%** | **Final Target:** 16.76% CPU (vs 52.63% baseline) **Improvement:** **-68% CPU reduction** **New Ratio:** 2.6x slower than alloc (vs 8.1x) ### Expected Throughput Gain - **Current:** 1,046,392 ops/s - **Projected:** 3,200,000 ops/s (+206%) - **vs System:** 56,336,790 ops/s (still 17x slower, but improved from 53x) --- ## 7. IMPLEMENTATION ROADMAP ### Phase 1: Quick Wins (1-2 days) 1. ✅ **TID Cache** (Proposal #2) - 1 hour 2. ✅ **Eliminate Redundant Lookup** (Proposal #5) - 2 hours 3. ✅ **Move Safety to Debug** (Proposal #3) - 1 hour **Expected:** -15-20% CPU reduction ### Phase 2: Fast Path Extraction (3-5 days) 1. ✅ **Extract Ultra-Fast Free** (Proposal #1) - 2 days 2. ✅ **Integrate with Box 6** - 1 day 3. ✅ **Testing & Validation** - 1 day **Expected:** -60% CPU reduction (cumulative: -68%) ### Phase 3: Advanced (1-2 weeks) 1. ⚠️ **Batch Active Updates** (Proposal #4) - 3 days 2. ⚠️ **Inline Fast Path** - 1 day 3. ⚠️ **Profile & Tune** - 2 days **Expected:** -5% CPU reduction (final: -68%) --- ## 8. COMPARISON WITH SYSTEM MALLOC ### System malloc (tcache) Free Path (estimated) ```c // glibc tcache_put() [~15 instructions] void tcache_put(void* ptr, size_t tc_idx) { tcache_entry* e = (tcache_entry*)ptr; e->next = tcache->entries[tc_idx]; // 1. ptr->next = head tcache->entries[tc_idx] = e; // 2. head = ptr ++tcache->counts[tc_idx]; // 3. count++ } // Assembly: ~10 instructions (mov, mov, inc, ret) ``` **Why System malloc is faster:** 1. **No ownership check** (single-threaded tcache) 2. **No safety checks** (assumes valid pointer) 3. **No atomic operations** (TLS-local) 4. **No syscalls** (no TID lookup) 5. **Tiny code size** (~15 instructions) **HAKMEM Gap Analysis:** - Current: 330 lines vs 15 instructions (**22x code bloat**) - After optimization: ~20 lines vs 15 instructions (**1.3x**, acceptable) --- ## 9. RISK ASSESSMENT ### Proposal #1 (Ultra-Fast Path) **Risk:** 🟢 Low **Reason:** Isolated fast path, delegates to slow path on failure **Mitigation:** Keep slow path unchanged for safety ### Proposal #2 (TID Cache) **Risk:** 🟢 Very Low **Reason:** TLS variable, no shared state **Mitigation:** Initialize once per thread ### Proposal #3 (Safety → Debug) **Risk:** 🟡 Medium **Reason:** Removes double-free detection in production **Mitigation:** Keep enabled for debug builds, add compile-time flag ### Proposal #4 (Batch Active) **Risk:** 🟡 Medium **Reason:** Changes accounting semantics (delayed updates) **Mitigation:** Thorough testing, fallback to per-free if issues ### Proposal #5 (Skip Lookup) **Risk:** 🟢 Low **Reason:** Pure optimization, no semantic change **Mitigation:** Validate ss pointer is passed correctly --- ## 10. CONCLUSION ### Key Findings 1. **Free is 8x slower than alloc** (52.63% vs 6.48% CPU) 2. **Root cause:** Safety-first design (330 lines vs 3-4 instructions) 3. **Top bottlenecks:** - Syscall overhead (gettid) - O(n) duplicate scans (freelist + remote queue) - Redundant SuperSlab lookups - Atomic contention (ss_active_dec_one) - Diagnostic counters (5-7 atomics) ### Recommended Action Plan **Priority 1 (Do Now):** - ✅ **TID Cache** - 1 hour, -5% CPU - ✅ **Skip Redundant Lookup** - 2 hours, -2% CPU - ✅ **Safety → Debug Mode** - 1 hour, -10% CPU **Priority 2 (This Week):** - ✅ **Ultra-Fast Path** - 2 days, -60% CPU **Priority 3 (Future):** - ⚠️ **Batch Active Updates** - 3 days, -5% CPU ### Expected Outcome - **CPU Reduction:** -68% (52.63% → 16.76%) - **Throughput Gain:** +206% (1.04M → 3.2M ops/s) - **Code Quality:** Cleaner separation (fast/slow paths) - **Maintainability:** Safety checks isolated to debug mode ### Next Steps 1. **Review this analysis** with team 2. **Implement Priority 1** (TID cache, skip lookup, safety guards) 3. **Benchmark results** (validate -15-20% reduction) 4. **Proceed to Priority 2** (ultra-fast path extraction) --- **END OF ULTRATHINK ANALYSIS**