Files
hakmem/docs/analysis/FREE_PATH_ULTRATHINK_ANALYSIS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

24 KiB

FREE PATH ULTRATHINK ANALYSIS

Date: 2025-11-08 Performance Hotspot: hak_tiny_free_superslab consuming 52.63% CPU Benchmark: 1,046,392 ops/s (53x slower than System malloc's 56,336,790 ops/s)


Executive Summary

The free() path in HAKMEM is 8x slower than allocation (52.63% vs 6.48% CPU) due to:

  1. Multiple redundant lookups (SuperSlab lookup called twice)
  2. Massive function size (330 lines with many branches)
  3. Expensive safety checks in hot path (duplicate scans, alignment checks)
  4. Atomic contention (CAS loops on every free)
  5. Syscall overhead (TID lookup on every free)

Root Cause: The free path was designed for safety and diagnostics, not performance. It lacks the "ultra-simple fast path" design that made allocation fast (Box 5).


1. CALL CHAIN ANALYSIS

Complete Free Path (User → Kernel)

User free(ptr)
  ↓
1. free() wrapper                          [hak_wrappers.inc.h:92]
   ├─ Line 93:  atomic_fetch_add(g_free_wrapper_calls)    ← Atomic #1
   ├─ Line 94:  if (!ptr) return
   ├─ Line 95:  if (g_hakmem_lock_depth > 0) → libc
   ├─ Line 96:  if (g_initializing) → libc
   ├─ Line 97:  if (hak_force_libc_alloc()) → libc
   ├─ Line 98-102: LD_PRELOAD checks
   ├─ Line 103: g_hakmem_lock_depth++                     ← TLS write #1
   ├─ Line 104: hak_free_at(ptr, 0, HAK_CALLSITE())      ← MAIN ENTRY
   └─ Line 105: g_hakmem_lock_depth--

2. hak_free_at()                           [hak_free_api.inc.h:64]
   ├─ Line 78:  static int s_free_to_ss (getenv cache)
   ├─ Line 86:  ss = hak_super_lookup(ptr)               ← LOOKUP #1 ⚠️
   ├─ Line 87:  if (ss->magic == SUPERSLAB_MAGIC)
   ├─ Line 88:    slab_idx = slab_index_for(ss, ptr)     ← CALC #1
   ├─ Line 89:    if (sidx >= 0 && sidx < cap)
   └─ Line 90:    hak_tiny_free(ptr)                     ← ROUTE TO TINY

3. hak_tiny_free()                         [hakmem_tiny_free.inc:246]
   ├─ Line 249: atomic_fetch_add(g_hak_tiny_free_calls)  ← Atomic #2
   ├─ Line 252: hak_tiny_stats_poll()
   ├─ Line 253: tiny_debug_ring_record()
   ├─ Line 255-303: BENCH_SLL_ONLY fast path (optional)
   ├─ Line 306-366: Ultra mode fast path (optional)
   ├─ Line 372: ss = hak_super_lookup(ptr)               ← LOOKUP #2 ⚠️ REDUNDANT!
   ├─ Line 373: if (ss && ss->magic == SUPERSLAB_MAGIC)
   ├─ Line 376-381: Validate size_class
   └─ Line 430: hak_tiny_free_superslab(ptr, ss)        ← 52.63% CPU HERE! 💀

4. hak_tiny_free_superslab()               [tiny_superslab_free.inc.h:10] ← HOTSPOT
   ├─ Line 13:  atomic_fetch_add(g_free_ss_enter)        ← Atomic #3
   ├─ Line 14:  ROUTE_MARK(16)
   ├─ Line 15:  HAK_DBG_INC(g_superslab_free_count)
   ├─ Line 17:  slab_idx = slab_index_for(ss, ptr)       ← CALC #2 ⚠️
   ├─ Line 18-19: ss_size, ss_base calculations
   ├─ Line 20-25: Safety: slab_idx < 0 check
   ├─ Line 26:  meta = &ss->slabs[slab_idx]
   ├─ Line 27-40: Watch point debug (if enabled)
   ├─ Line 42-46: Safety: validate size_class bounds
   ├─ Line 47-72: Safety: EXPENSIVE! ⚠️
   │   ├─ Alignment check (delta % blk == 0)
   │   ├─ Range check (delta / blk < capacity)
   │   └─ Duplicate scan in freelist (up to 64 iterations!) ← 💀 O(n)
   ├─ Line 75:  my_tid = tiny_self_u32()                 ← SYSCALL! ⚠️ 💀
   ├─ Line 79-81: Ownership claim (if owner_tid == 0)
   ├─ Line 82-157: SAME-THREAD PATH (owner_tid == my_tid)
   │   ├─ Line 90-95: Safety: check used == 0
   │   ├─ Line 96: tiny_remote_track_expect_alloc()
   │   ├─ Line 97-112: Remote guard check (expensive!)
   │   ├─ Line 114-131: MidTC bypass (optional)
   │   ├─ Line 133-150: tiny_free_local_box()           ← Freelist push
   │   └─ Line 137-149: First-free publish logic
   └─ Line 158-328: CROSS-THREAD PATH (owner_tid != my_tid)
       ├─ Line 175-229: Duplicate detection in remote queue ← 💀 O(n) EXPENSIVE!
       │   ├─ Scan up to 64 nodes in remote stack
       │   ├─ Sentinel checks (if g_remote_side_enable)
       │   └─ Corruption detection
       ├─ Line 230-235: Safety: check used == 0
       ├─ Line 236-255: A/B gate for remote MPSC
       └─ Line 256-302: ss_remote_push()                ← MPSC push (atomic CAS)

5. tiny_free_local_box()                   [box/free_local_box.c:5]
   ├─ Line 6:   atomic_fetch_add(g_free_local_box_calls) ← Atomic #4
   ├─ Line 12-26: Failfast validation (if level >= 2)
   ├─ Line 28:  prev = meta->freelist                    ← Load
   ├─ Line 30-61: Freelist corruption debug (if level >= 2)
   ├─ Line 63:  *(void**)ptr = prev                      ← Write #1
   ├─ Line 64:  meta->freelist = ptr                     ← Write #2
   ├─ Line 67-75: Freelist corruption verification
   ├─ Line 77:  tiny_failfast_log()
   ├─ Line 80:  atomic_thread_fence(memory_order_release)← Memory barrier
   ├─ Line 83-93: Freelist mask update (optional)
   ├─ Line 96:  tiny_remote_track_on_local_free()
   ├─ Line 97:  meta->used--                             ← Decrement
   ├─ Line 98:  ss_active_dec_one(ss)                    ← CAS LOOP! ⚠️ 💀
   └─ Line 100-103: First-free publish

6. ss_active_dec_one()                     [superslab_inline.h:162]
   ├─ Line 163: atomic_fetch_add(g_ss_active_dec_calls)  ← Atomic #5
   ├─ Line 164: old = atomic_load(total_active_blocks)   ← Atomic #6
   └─ Line 165-169: CAS loop:                            ← CAS LOOP (contention in MT!)
       while (old != 0) {
           if (CAS(&total_active_blocks, old, old-1)) break;
       }                                                  ← Atomic #7+

7. ss_remote_push() [Cross-thread only]    [superslab_inline.h:202]
   ├─ Line 203: atomic_fetch_add(g_ss_remote_push_calls) ← Atomic #N
   ├─ Line 215-233: Sanity checks (range, alignment)
   ├─ Line 258-266: MPSC CAS loop:                       ← CAS LOOP (contention!)
   │   do {
   │       old = atomic_load(&head, acquire);            ← Atomic #N+1
   │       *(void**)ptr = (void*)old;
   │   } while (!CAS(&head, old, ptr));                  ← Atomic #N+2+
   └─ Line 267: tiny_remote_side_set()

2. EXPENSIVE OPERATIONS IDENTIFIED

Critical Issues (Prioritized by Impact)

🔴 ISSUE #1: Duplicate SuperSlab Lookup (Lines hak_free_api:86 + hak_tiny_free:372)

Cost: 2x registry lookup per free Location:

  • hak_free_at() line 86: ss = hak_super_lookup(ptr)
  • hak_tiny_free() line 372: ss = hak_super_lookup(ptr) ← REDUNDANT!

Why it's expensive:

  • hak_super_lookup() walks a registry or performs hash lookup
  • Result is already known from first call
  • Wastes CPU cycles and pollutes cache

Fix: Pass ss as parameter from hak_free_at() to hak_tiny_free()


🔴 ISSUE #2: Syscall in Hot Path (Line 75: tiny_self_u32())

Cost: ~200-500 cycles per free Location: tiny_superslab_free.inc.h:75

uint32_t my_tid = tiny_self_u32();  // ← SYSCALL (gettid)!

Why it's expensive:

  • Syscall overhead: 200-500 cycles (vs 1-2 for TLS read)
  • Context switch to kernel mode
  • Called on EVERY free (same-thread AND cross-thread)

Fix: Cache TID in TLS variable (like g_hakmem_lock_depth)


🔴 ISSUE #3: Duplicate Scan in Freelist (Lines 64-71)

Cost: O(n) scan, up to 64 iterations Location: tiny_superslab_free.inc.h:64-71

void* scan = meta->freelist; int scanned = 0; int dup = 0;
while (scan && scanned < 64) {
    if (scan == ptr) { dup = 1; break; }
    scan = *(void**)scan;
    scanned++;
}

Why it's expensive:

  • O(n) complexity (up to 64 pointer chases)
  • Cache misses (freelist nodes scattered in memory)
  • Branch mispredictions (while loop, if statement)
  • Only useful for debugging (catches double-free)

Fix: Move to debug-only path (behind HAKMEM_SAFE_FREE guard)


🔴 ISSUE #4: Remote Queue Duplicate Scan (Lines 175-229)

Cost: O(n) scan, up to 64 iterations + sentinel checks Location: tiny_superslab_free.inc.h:177-221

uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire);
int scanned = 0; int dup = 0;
while (cur && scanned < 64) {
    if ((void*)cur == ptr) { dup = 1; break; }
    // ... sentinel checks ...
    cur = (uintptr_t)(*(void**)(void*)cur);
    scanned++;
}

Why it's expensive:

  • O(n) scan of remote queue (up to 64 nodes)
  • Atomic load + pointer chasing
  • Sentinel validation (if enabled)
  • Called on EVERY cross-thread free

Fix: Move to debug-only path or use bloom filter for fast negative check


🔴 ISSUE #5: CAS Loop on Every Free (ss_active_dec_one)

Cost: 2-10 cycles (uncontended), 100+ cycles (contended) Location: superslab_inline.h:162-169

static inline void ss_active_dec_one(SuperSlab* ss) {
    atomic_fetch_add(&g_ss_active_dec_calls, 1, relaxed);  // ← Atomic #1
    uint32_t old = atomic_load(&ss->total_active_blocks, relaxed); // ← Atomic #2
    while (old != 0) {
        if (CAS(&ss->total_active_blocks, &old, old-1, relaxed)) break; // ← CAS loop
    }
}

Why it's expensive:

  • 3 atomic operations per free (fetch_add, load, CAS)
  • CAS loop can retry multiple times under contention (MT scenario)
  • Cache line ping-pong in multi-threaded workloads

Fix: Batch decrements (decrement by N when draining remote queue)


🟡 ISSUE #6: Multiple Atomic Increments for Diagnostics

Cost: 5-7 atomic operations per free Locations:

  1. hak_wrappers.inc.h:93 - g_free_wrapper_calls
  2. hakmem_tiny_free.inc:249 - g_hak_tiny_free_calls
  3. tiny_superslab_free.inc.h:13 - g_free_ss_enter
  4. free_local_box.c:6 - g_free_local_box_calls
  5. superslab_inline.h:163 - g_ss_active_dec_calls
  6. superslab_inline.h:203 - g_ss_remote_push_calls (cross-thread only)

Why it's expensive:

  • Each atomic increment: 10-20 cycles
  • Total: 50-100+ cycles per free (5-10% overhead)
  • Only useful for diagnostics

Fix: Compile-time gate (#if HAKMEM_DEBUG_COUNTERS)


🟡 ISSUE #7: Environment Variable Checks (Even with Caching)

Cost: First call: 1000+ cycles (getenv), Subsequent: 2-5 cycles (cached) Locations:

  • Line 106, 145: HAKMEM_TINY_ROUTE_FREE
  • Line 117, 169: HAKMEM_TINY_FREE_TO_SS
  • Line 313: HAKMEM_TINY_FREELIST_MASK
  • Line 238, 249: HAKMEM_TINY_DISABLE_REMOTE

Why it's expensive:

  • First call to getenv() is expensive (1000+ cycles)
  • Branch on cached value still adds 1-2 cycles
  • Multiple env vars = multiple branches

Fix: Consolidate env vars or use compile-time flags


🟡 ISSUE #8: Massive Function Size (330 lines)

Cost: I-cache misses, branch mispredictions Location: tiny_superslab_free.inc.h:10-330

Why it's expensive:

  • 330 lines of code (vs 10-20 for System tcache)
  • Many branches (if statements, while loops)
  • Branch mispredictions: 10-20 cycles per miss
  • I-cache misses: 100+ cycles

Fix: Extract fast path (10-15 lines) and delegate to slow path


3. COMPARISON WITH ALLOCATION FAST PATH

Allocation (6.48% CPU) vs Free (52.63% CPU)

Metric Allocation (Box 5) Free (Current) Ratio
CPU Usage 6.48% 52.63% 8.1x slower
Function Size ~20 lines 330 lines 16.5x larger
Atomic Ops 1 (TLS count decrement) 5-7 (counters + CAS) 5-7x more
Syscalls 0 1 (gettid)
Lookups 0 (direct TLS) 2 (SuperSlab)
O(n) Scans 0 2 (freelist + remote)
Branches 2-3 (head == NULL check) 50+ (safety, guards, env vars) 16-25x

Key Insight: Allocation succeeds with 3-4 instructions (Box 5 design), while free requires 330 lines with multiple syscalls, atomics, and O(n) scans.


4. ROOT CAUSE ANALYSIS

Why is Free 8x Slower than Alloc?

Allocation Design (Box 5 - Ultra-Simple Fast Path)

// Box 5: tiny_alloc_fast_pop() [~10 lines, 3-4 instructions]
void* tiny_alloc_fast_pop(int class_idx) {
    void* ptr = g_tls_sll_head[class_idx];       // 1. Load TLS head
    if (!ptr) return NULL;                       // 2. NULL check
    g_tls_sll_head[class_idx] = *(void**)ptr;    // 3. Update head (pop)
    g_tls_sll_count[class_idx]--;                // 4. Decrement count
    return ptr;                                  // 5. Return
}
// Assembly: ~5 instructions (mov, cmp, jz, mov, dec, ret)

Free Design (Current - Multi-Layer Complexity)

// Current free path: 330 lines, 50+ branches, 5-7 atomics, 1 syscall
void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // 1. Diagnostics (atomic increments) - 3 atomics
    // 2. Safety checks (alignment, range, duplicate scan) - 64 iterations
    // 3. Syscall (gettid) - 200-500 cycles
    // 4. Ownership check (my_tid == owner_tid)
    // 5. Remote guard checks (function calls, tracking)
    // 6. MidTC bypass (optional)
    // 7. Freelist push (2 writes + failfast validation)
    // 8. CAS loop (ss_active_dec_one) - contention
    // 9. First-free publish (if prev == NULL)
    // ... 300+ more lines
}

Problem: Free path was designed for safety and diagnostics, not performance.


5. CONCRETE OPTIMIZATION PROPOSALS

🏆 Proposal #1: Extract Ultra-Simple Free Fast Path (Highest Priority)

Goal: Match allocation's 3-4 instruction fast path Expected Impact: -60-70% free() CPU (52.63% → 15-20%)

Implementation (Box 6 Enhancement)

// tiny_free_ultra_fast.inc.h (NEW FILE)
// Ultra-simple free fast path (3-4 instructions, same-thread only)

static inline int tiny_free_ultra_fast(void* ptr, SuperSlab* ss, int slab_idx, uint32_t my_tid) {
    // PREREQUISITE: Caller MUST validate:
    // 1. ss != NULL && ss->magic == SUPERSLAB_MAGIC
    // 2. slab_idx >= 0 && slab_idx < capacity
    // 3. my_tid == current thread (cached in TLS)

    TinySlabMeta* meta = &ss->slabs[slab_idx];

    // Fast path: Same-thread check (TOCTOU-safe)
    uint32_t owner = atomic_load_explicit(&meta->owner_tid, memory_order_relaxed);
    if (__builtin_expect(owner != my_tid, 0)) {
        return 0;  // Cross-thread → delegate to slow path
    }

    // Fast path: Direct freelist push (2 writes)
    void* prev = meta->freelist;                // 1. Load prev
    *(void**)ptr = prev;                        // 2. ptr->next = prev
    meta->freelist = ptr;                       // 3. freelist = ptr

    // Accounting (TLS, no atomic)
    meta->used--;                               // 4. Decrement used

    // SKIP ss_active_dec_one() in fast path (batch update later)

    return 1;  // Success
}

// Assembly (x86-64, expected):
//   mov    eax, DWORD PTR [meta->owner_tid]  ; owner
//   cmp    eax, my_tid                        ; owner == my_tid?
//   jne    .slow_path                         ; if not, slow path
//   mov    rax, QWORD PTR [meta->freelist]   ; prev = freelist
//   mov    QWORD PTR [ptr], rax              ; ptr->next = prev
//   mov    QWORD PTR [meta->freelist], ptr   ; freelist = ptr
//   dec    DWORD PTR [meta->used]            ; used--
//   ret                                       ; done
// .slow_path:
//   xor    eax, eax
//   ret

Integration into hak_tiny_free_superslab()

void hak_tiny_free_superslab(void* ptr, SuperSlab* ss) {
    // Cache TID in TLS (avoid syscall)
    static __thread uint32_t g_cached_tid = 0;
    if (__builtin_expect(g_cached_tid == 0, 0)) {
        g_cached_tid = tiny_self_u32();  // Initialize once per thread
    }
    uint32_t my_tid = g_cached_tid;

    int slab_idx = slab_index_for(ss, ptr);

    // FAST PATH: Ultra-simple free (3-4 instructions)
    if (__builtin_expect(tiny_free_ultra_fast(ptr, ss, slab_idx, my_tid), 1)) {
        return;  // Success: same-thread, pushed to freelist
    }

    // SLOW PATH: Cross-thread, safety checks, remote queue
    // ... existing 330 lines ...
}

Benefits:

  • Same-thread free: 3-4 instructions (vs 330 lines)
  • No syscall (TID cached in TLS)
  • No atomics in fast path (meta->used is TLS-local)
  • No safety checks in fast path (delegate to slow path)
  • Branch prediction friendly (same-thread is common case)

Trade-offs:

  • Skip ss_active_dec_one() in fast path (batch update in background thread)
  • Skip safety checks in fast path (only in slow path / debug mode)

🏆 Proposal #2: Cache TID in TLS (Quick Win)

Goal: Eliminate syscall overhead Expected Impact: -5-10% free() CPU

// hakmem_tiny.c (or core header)
__thread uint32_t g_cached_tid = 0;  // TLS cache for thread ID

static inline uint32_t tiny_self_u32_cached(void) {
    if (__builtin_expect(g_cached_tid == 0, 0)) {
        g_cached_tid = tiny_self_u32();  // Initialize once per thread
    }
    return g_cached_tid;
}

Change: Replace all tiny_self_u32() calls with tiny_self_u32_cached()

Benefits:

  • Syscall elimination: 0 syscalls (vs 1 per free)
  • TLS read: 1-2 cycles (vs 200-500 for gettid)
  • Easy to implement: 1-line change

🏆 Proposal #3: Move Safety Checks to Debug-Only Path

Goal: Remove O(n) scans from hot path Expected Impact: -10-15% free() CPU

#if HAKMEM_SAFE_FREE
    // Duplicate scan in freelist (lines 64-71)
    void* scan = meta->freelist; int scanned = 0; int dup = 0;
    while (scan && scanned < 64) { ... }

    // Remote queue duplicate scan (lines 175-229)
    uintptr_t cur = atomic_load(&ss->remote_heads[slab_idx], acquire);
    while (cur && scanned < 64) { ... }
#endif

Benefits:

  • Production builds: No O(n) scans (0 cycles)
  • Debug builds: Full safety checks (detect double-free)
  • Easy toggle: HAKMEM_SAFE_FREE=0 for benchmarks

🏆 Proposal #4: Batch ss_active_dec_one() Updates

Goal: Reduce atomic contention Expected Impact: -5-10% free() CPU (MT), -2-5% (ST)

// Instead of: ss_active_dec_one(ss) on every free
// Do: Batch decrement when draining remote queue or TLS cache

void tiny_free_ultra_fast(...) {
    // ... freelist push ...
    meta->used--;
    // SKIP: ss_active_dec_one(ss);  ← Defer to batch update
}

// Background thread or refill path:
void batch_active_update(SuperSlab* ss) {
    uint32_t total_freed = 0;
    for (int i = 0; i < 32; i++) {
        total_freed += (meta[i].capacity - meta[i].used);
    }
    atomic_fetch_sub(&ss->total_active_blocks, total_freed, relaxed);
}

Benefits:

  • Fewer atomics: 1 atomic per batch (vs N per free)
  • Less contention: Batch updates are rare
  • Amortized cost: O(1) amortized

🏆 Proposal #5: Eliminate Redundant SuperSlab Lookup

Goal: Remove duplicate lookup Expected Impact: -2-5% free() CPU

// hak_free_at() - pass ss to hak_tiny_free()
void hak_free_at(void* ptr, size_t size, hak_callsite_t site) {
    SuperSlab* ss = hak_super_lookup(ptr);  // ← Lookup #1
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        hak_tiny_free_with_ss(ptr, ss);     // ← Pass ss (avoid lookup #2)
        return;
    }
    // ... fallback paths ...
}

// NEW: hak_tiny_free_with_ss() - skip second lookup
void hak_tiny_free_with_ss(void* ptr, SuperSlab* ss) {
    // SKIP: ss = hak_super_lookup(ptr);  ← Lookup #2 (redundant!)
    hak_tiny_free_superslab(ptr, ss);
}

Benefits:

  • 1 lookup: vs 2 (50% reduction)
  • Cache friendly: Reuse ss pointer
  • Easy change: Add new function variant

6. PERFORMANCE PROJECTIONS

Current Baseline

  • Free CPU: 52.63%
  • Alloc CPU: 6.48%
  • Ratio: 8.1x slower

After All Optimizations

Optimization CPU Reduction Cumulative CPU
Baseline - 52.63%
#1: Ultra-Fast Path -60% 21.05%
#2: TID Cache -5% 20.00%
#3: Safety → Debug -10% 18.00%
#4: Batch Active -5% 17.10%
#5: Skip Lookup -2% 16.76%

Final Target: 16.76% CPU (vs 52.63% baseline) Improvement: -68% CPU reduction New Ratio: 2.6x slower than alloc (vs 8.1x)

Expected Throughput Gain

  • Current: 1,046,392 ops/s
  • Projected: 3,200,000 ops/s (+206%)
  • vs System: 56,336,790 ops/s (still 17x slower, but improved from 53x)

7. IMPLEMENTATION ROADMAP

Phase 1: Quick Wins (1-2 days)

  1. TID Cache (Proposal #2) - 1 hour
  2. Eliminate Redundant Lookup (Proposal #5) - 2 hours
  3. Move Safety to Debug (Proposal #3) - 1 hour

Expected: -15-20% CPU reduction

Phase 2: Fast Path Extraction (3-5 days)

  1. Extract Ultra-Fast Free (Proposal #1) - 2 days
  2. Integrate with Box 6 - 1 day
  3. Testing & Validation - 1 day

Expected: -60% CPU reduction (cumulative: -68%)

Phase 3: Advanced (1-2 weeks)

  1. ⚠️ Batch Active Updates (Proposal #4) - 3 days
  2. ⚠️ Inline Fast Path - 1 day
  3. ⚠️ Profile & Tune - 2 days

Expected: -5% CPU reduction (final: -68%)


8. COMPARISON WITH SYSTEM MALLOC

System malloc (tcache) Free Path (estimated)

// glibc tcache_put() [~15 instructions]
void tcache_put(void* ptr, size_t tc_idx) {
    tcache_entry* e = (tcache_entry*)ptr;
    e->next = tcache->entries[tc_idx];      // 1. ptr->next = head
    tcache->entries[tc_idx] = e;            // 2. head = ptr
    ++tcache->counts[tc_idx];               // 3. count++
}
// Assembly: ~10 instructions (mov, mov, inc, ret)

Why System malloc is faster:

  1. No ownership check (single-threaded tcache)
  2. No safety checks (assumes valid pointer)
  3. No atomic operations (TLS-local)
  4. No syscalls (no TID lookup)
  5. Tiny code size (~15 instructions)

HAKMEM Gap Analysis:

  • Current: 330 lines vs 15 instructions (22x code bloat)
  • After optimization: ~20 lines vs 15 instructions (1.3x, acceptable)

9. RISK ASSESSMENT

Proposal #1 (Ultra-Fast Path)

Risk: 🟢 Low Reason: Isolated fast path, delegates to slow path on failure Mitigation: Keep slow path unchanged for safety

Proposal #2 (TID Cache)

Risk: 🟢 Very Low Reason: TLS variable, no shared state Mitigation: Initialize once per thread

Proposal #3 (Safety → Debug)

Risk: 🟡 Medium Reason: Removes double-free detection in production Mitigation: Keep enabled for debug builds, add compile-time flag

Proposal #4 (Batch Active)

Risk: 🟡 Medium Reason: Changes accounting semantics (delayed updates) Mitigation: Thorough testing, fallback to per-free if issues

Proposal #5 (Skip Lookup)

Risk: 🟢 Low Reason: Pure optimization, no semantic change Mitigation: Validate ss pointer is passed correctly


10. CONCLUSION

Key Findings

  1. Free is 8x slower than alloc (52.63% vs 6.48% CPU)
  2. Root cause: Safety-first design (330 lines vs 3-4 instructions)
  3. Top bottlenecks:
    • Syscall overhead (gettid)
    • O(n) duplicate scans (freelist + remote queue)
    • Redundant SuperSlab lookups
    • Atomic contention (ss_active_dec_one)
    • Diagnostic counters (5-7 atomics)

Priority 1 (Do Now):

  • TID Cache - 1 hour, -5% CPU
  • Skip Redundant Lookup - 2 hours, -2% CPU
  • Safety → Debug Mode - 1 hour, -10% CPU

Priority 2 (This Week):

  • Ultra-Fast Path - 2 days, -60% CPU

Priority 3 (Future):

  • ⚠️ Batch Active Updates - 3 days, -5% CPU

Expected Outcome

  • CPU Reduction: -68% (52.63% → 16.76%)
  • Throughput Gain: +206% (1.04M → 3.2M ops/s)
  • Code Quality: Cleaner separation (fast/slow paths)
  • Maintainability: Safety checks isolated to debug mode

Next Steps

  1. Review this analysis with team
  2. Implement Priority 1 (TID cache, skip lookup, safety guards)
  3. Benchmark results (validate -15-20% reduction)
  4. Proceed to Priority 2 (ultra-fast path extraction)

END OF ULTRATHINK ANALYSIS