Files
hakmem/core/tiny_alloc_fast.inc.h
Moe Charm (CI) 707056b765 feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓

Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
  Result: +180-280% improvement, 85-146% of System malloc

Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)

Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
  Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
  Result: 50% → 95% stability (19/20 4T success)

Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
  Files: core/tiny_adaptive_sizing.c/h (new)

Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
  Files: core/hakmem_bigcache.c/h
  Expected: +10-20% cache hit rate

Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)

Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis

Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files

Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00

543 lines
20 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// tiny_alloc_fast.inc.h - Box 5: Allocation Fast Path (3-4 instructions)
// Purpose: Ultra-fast TLS freelist pop (inspired by System tcache & Mid-Large HAKX +171%)
// Invariant: Hit rate > 95% → 3-4 instructions, Miss → refill from backend
// Design: "Simple Front + Smart Back" - Front is dumb & fast, Back is smart
//
// Box 5-NEW: SFC (Super Front Cache) Integration
// Architecture: SFC (Layer 0, 128-256 slots) → SLL (Layer 1, unlimited) → SuperSlab (Layer 2+)
// Cascade Refill: SFC ← SLL (one-way, safe)
// Goal: +200% performance (4.19M → 12M+ ops/s)
//
// Phase 2b: Adaptive TLS Cache Sizing
// Hot classes grow to 2048 slots, cold classes shrink to 16 slots
// Expected: +3-10% performance, -30-50% TLS cache memory overhead
#pragma once
#include "tiny_atomic.h"
#include "hakmem_tiny.h"
#include "tiny_route.h"
#include "tiny_alloc_fast_sfc.inc.h" // Box 5-NEW: SFC Layer
#include "tiny_region_id.h" // Phase 7: Header-based class_idx lookup
#include "tiny_adaptive_sizing.h" // Phase 2b: Adaptive sizing
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
#include "box/front_gate_box.h"
#endif
#include <stdio.h>
// Phase 7 Task 2: Aggressive inline TLS cache access
// Enable with: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1
#ifndef HAKMEM_TINY_AGGRESSIVE_INLINE
#define HAKMEM_TINY_AGGRESSIVE_INLINE 0
#endif
#if HAKMEM_TINY_AGGRESSIVE_INLINE
#include "tiny_alloc_fast_inline.h"
#endif
// ========== Debug Counters (compile-time gated) ==========
#if HAKMEM_DEBUG_COUNTERS
// Refill-stage counters (defined in hakmem_tiny.c)
extern unsigned long long g_rf_total_calls[];
extern unsigned long long g_rf_hit_bench[];
extern unsigned long long g_rf_hit_hot[];
extern unsigned long long g_rf_hit_mail[];
extern unsigned long long g_rf_hit_slab[];
extern unsigned long long g_rf_hit_ss[];
extern unsigned long long g_rf_hit_reg[];
extern unsigned long long g_rf_mmap_calls[];
// Publish hits (defined in hakmem_tiny.c)
extern unsigned long long g_pub_mail_hits[];
extern unsigned long long g_pub_bench_hits[];
extern unsigned long long g_pub_hot_hits[];
// Free pipeline (defined in hakmem_tiny.c)
extern unsigned long long g_free_via_tls_sll[];
#endif
// ========== Box 5: Allocation Fast Path ==========
// 箱理論の Fast Allocation 層。TLS freelist から直接 pop3-4命令
// 不変条件:
// - TLS freelist が非空なら即座に return (no lock, no sync)
// - Miss なら Backend (Box 3: SuperSlab) に委譲
// - Cross-thread allocation は考慮しないBackend が処理)
// External TLS variables (defined in hakmem_tiny.c)
extern __thread void* g_tls_sll_head[TINY_NUM_CLASSES];
extern __thread uint32_t g_tls_sll_count[TINY_NUM_CLASSES];
// External backend functions
extern int sll_refill_small_from_ss(int class_idx, int max_take);
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
extern int hak_tiny_size_to_class(size_t size);
extern int tiny_refill_failfast_level(void);
extern const size_t g_tiny_class_sizes[];
// Global Front refill config (parsed at init; defined in hakmem_tiny.c)
extern int g_refill_count_global;
extern int g_refill_count_hot;
extern int g_refill_count_mid;
extern int g_refill_count_class[TINY_NUM_CLASSES];
// HAK_RET_ALLOC macro is now defined in core/hakmem_tiny.c
// See lines 116-152 for single definition point based on HAKMEM_TINY_HEADER_CLASSIDX
// ========== RDTSC Profiling (lightweight) ==========
#ifdef __x86_64__
static inline uint64_t tiny_fast_rdtsc(void) {
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#else
static inline uint64_t tiny_fast_rdtsc(void) { return 0; }
#endif
// Per-thread profiling counters (enable with HAKMEM_TINY_PROFILE=1)
static __thread uint64_t g_tiny_alloc_hits = 0;
static __thread uint64_t g_tiny_alloc_cycles = 0;
static __thread uint64_t g_tiny_refill_calls = 0;
static __thread uint64_t g_tiny_refill_cycles = 0;
static int g_tiny_profile_enabled = -1; // -1: uninitialized
static inline int tiny_profile_enabled(void) {
if (__builtin_expect(g_tiny_profile_enabled == -1, 0)) {
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}
return g_tiny_profile_enabled;
}
// Print profiling results at exit
static void tiny_fast_print_profile(void) __attribute__((destructor));
static void tiny_fast_print_profile(void) {
if (!tiny_profile_enabled()) return;
if (g_tiny_alloc_hits == 0 && g_tiny_refill_calls == 0) return;
fprintf(stderr, "\n========== Box Theory Fast Path Profile ==========\n");
if (g_tiny_alloc_hits > 0) {
fprintf(stderr, "[ALLOC HIT] count=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_alloc_hits,
(unsigned long)(g_tiny_alloc_cycles / g_tiny_alloc_hits));
}
if (g_tiny_refill_calls > 0) {
fprintf(stderr, "[REFILL] count=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_refill_calls,
(unsigned long)(g_tiny_refill_cycles / g_tiny_refill_calls));
}
fprintf(stderr, "===================================================\n\n");
}
// ========== Fast Path: TLS Freelist Pop (3-4 instructions) ==========
// External SFC control (defined in hakmem_tiny_sfc.c)
extern int g_sfc_enabled;
// Allocation fast path (inline for zero-cost)
// Returns: pointer on success, NULL on miss (caller should try refill/slow)
//
// Box 5-NEW Architecture:
// Layer 0: SFC (128-256 slots, high hit rate) [if enabled]
// Layer 1: SLL (unlimited, existing)
// Cascade: SFC miss → try SLL → refill
//
// Assembly (x86-64, optimized):
// mov rax, QWORD PTR g_sfc_head[class_idx] ; SFC: Load head
// test rax, rax ; Check NULL
// jne .sfc_hit ; If not empty, SFC hit!
// mov rax, QWORD PTR g_tls_sll_head[class_idx] ; SLL: Load head
// test rax, rax ; Check NULL
// je .miss ; If empty, miss
// mov rdx, QWORD PTR [rax] ; Load next
// mov QWORD PTR g_tls_sll_head[class_idx], rdx ; Update head
// ret ; Return ptr
// .sfc_hit:
// mov rdx, QWORD PTR [rax] ; Load next
// mov QWORD PTR g_sfc_head[class_idx], rdx ; Update head
// ret
// .miss:
// ; Fall through to refill
//
// Expected: 3-4 instructions on SFC hit, 6-8 on SLL hit
static inline void* tiny_alloc_fast_pop(int class_idx) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
void* out = NULL;
if (front_gate_try_pop(class_idx, &out)) {
return out;
}
return NULL;
#else
// Phase 7 Task 3: Profiling overhead removed in release builds
// In release mode, compiler can completely eliminate profiling code
#if !HAKMEM_BUILD_RELEASE
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
#endif
// Box 5-NEW: Layer 0 - Try SFC first (if enabled)
// Cache g_sfc_enabled in TLS to avoid global load on every allocation
static __thread int sfc_check_done = 0;
static __thread int sfc_is_enabled = 0;
if (__builtin_expect(!sfc_check_done, 0)) {
sfc_is_enabled = g_sfc_enabled;
sfc_check_done = 1;
}
if (__builtin_expect(sfc_is_enabled, 1)) {
void* ptr = sfc_alloc(class_idx);
if (__builtin_expect(ptr != NULL, 1)) {
// Front Gate: SFC hit
extern unsigned long long g_front_sfc_hit[];
g_front_sfc_hit[class_idx]++;
// 🚀 SFC HIT! (Layer 0)
#if !HAKMEM_BUILD_RELEASE
if (start) {
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
g_tiny_alloc_hits++;
}
#endif
return ptr;
}
// SFC miss → try SLL (Layer 1)
}
// Box Boundary: Layer 1 - TLS SLL freelist の先頭を popenvで無効化可
extern int g_tls_sll_enable; // set at init via HAKMEM_TINY_TLS_SLL
if (__builtin_expect(g_tls_sll_enable, 1)) {
void* head = g_tls_sll_head[class_idx];
if (__builtin_expect(head != NULL, 1)) {
// CORRUPTION DEBUG: Validate TLS SLL head before popping
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) {
size_t blk = g_tiny_class_sizes[class_idx];
// Check alignment (must be multiple of block size)
if (((uintptr_t)head % blk) != 0) {
fprintf(stderr, "[TLS_SLL_CORRUPT] cls=%d head=%p misaligned (blk=%zu offset=%zu)\n",
class_idx, head, blk, (uintptr_t)head % blk);
fprintf(stderr, "[TLS_SLL_CORRUPT] TLS freelist head is corrupted!\n");
abort();
}
}
// Front Gate: SLL hit (fast path 3 instructions)
extern unsigned long long g_front_sll_hit[];
g_front_sll_hit[class_idx]++;
// CORRUPTION DEBUG: Validate next pointer before updating head
void* next = *(void**)head;
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) {
size_t blk = g_tiny_class_sizes[class_idx];
if (next != NULL && ((uintptr_t)next % blk) != 0) {
fprintf(stderr, "[ALLOC_POP_CORRUPT] Reading next from head=%p got corrupted next=%p!\n",
head, next);
fprintf(stderr, "[ALLOC_POP_CORRUPT] cls=%d blk=%zu next_offset=%zu (expected 0)\n",
class_idx, blk, (uintptr_t)next % blk);
fprintf(stderr, "[ALLOC_POP_CORRUPT] TLS SLL head block was corrupted (use-after-free/double-free)!\n");
abort();
}
fprintf(stderr, "[ALLOC_POP] cls=%d head=%p next=%p\n", class_idx, head, next);
}
g_tls_sll_head[class_idx] = next; // Pop: next = *head
// Optional: update count (for stats, can be disabled)
if (g_tls_sll_count[class_idx] > 0) {
g_tls_sll_count[class_idx]--;
}
#if HAKMEM_DEBUG_COUNTERS
// Track TLS freelist hits (compile-time gated, zero runtime cost when disabled)
g_free_via_tls_sll[class_idx]++;
#endif
#if !HAKMEM_BUILD_RELEASE
// Debug: Track profiling (release builds skip this overhead)
if (start) {
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
g_tiny_alloc_hits++;
}
#endif
return head;
}
}
// Fast path miss → NULL (caller should refill)
return NULL;
#endif
}
// ========== Cascade Refill: SFC ← SLL (Box Theory boundary) ==========
// Cascade refill: Transfer blocks from SLL to SFC (one-way, safe)
// Returns: number of blocks transferred
//
// Contract:
// - Transfer ownership: SLL → SFC
// - No circular dependency: one-way only
// - Boundary clear: SLL pop → SFC push
// - Fallback safe: if SFC full, stop (no overflow)
static inline int sfc_refill_from_sll(int class_idx, int target_count) {
int transferred = 0;
uint32_t cap = g_sfc_capacity[class_idx];
while (transferred < target_count && g_tls_sll_count[class_idx] > 0) {
// Check SFC capacity before transfer
if (g_sfc_count[class_idx] >= cap) {
break; // SFC full, stop
}
// Pop from SLL (Layer 1)
void* ptr = g_tls_sll_head[class_idx];
if (!ptr) break; // SLL empty
g_tls_sll_head[class_idx] = *(void**)ptr;
g_tls_sll_count[class_idx]--;
// Push to SFC (Layer 0)
*(void**)ptr = g_sfc_head[class_idx];
g_sfc_head[class_idx] = ptr;
g_sfc_count[class_idx]++;
transferred++;
}
return transferred;
}
// ========== Refill Path: Backend Integration ==========
// Refill TLS freelist from backend (SuperSlab/ACE/Learning layer)
// Returns: number of blocks refilled
//
// Box 5-NEW Architecture:
// SFC enabled: SuperSlab → SLL → SFC (cascade)
// SFC disabled: SuperSlab → SLL (direct, old path)
//
// This integrates with existing HAKMEM infrastructure:
// - SuperSlab provides memory chunks
// - ACE provides adaptive capacity learning
// - L25 provides mid-large integration
//
// Refill count is tunable via HAKMEM_TINY_REFILL_COUNT (default: 16)
// - Smaller count (8-16): better for diverse workloads, faster warmup
// - Larger count (64-128): better for homogeneous workloads, fewer refills
static inline int tiny_alloc_fast_refill(int class_idx) {
// Phase 7 Task 3: Profiling overhead removed in release builds
// In release mode, compiler can completely eliminate profiling code
#if !HAKMEM_BUILD_RELEASE
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
#endif
// Phase 2b: Check available capacity before refill
int available_capacity = get_available_capacity(class_idx);
if (available_capacity <= 0) {
// Cache is full, don't refill
return 0;
}
// Phase 7 Task 3: Simplified refill count (cached per-class in TLS)
// Previous: Complex precedence logic on every miss (5-10 cycles overhead)
// Now: Simple TLS cache lookup (1-2 cycles)
static __thread int s_refill_count[TINY_NUM_CLASSES] = {0};
int cnt = s_refill_count[class_idx];
if (__builtin_expect(cnt == 0, 0)) {
// First miss: Initialize from globals (parsed at init time)
int v = HAKMEM_TINY_REFILL_DEFAULT; // Default from hakmem_build_flags.h
// Precedence: per-class > hot/mid > global
if (g_refill_count_class[class_idx] > 0) {
v = g_refill_count_class[class_idx];
} else if (class_idx <= 3 && g_refill_count_hot > 0) {
v = g_refill_count_hot;
} else if (class_idx >= 4 && g_refill_count_mid > 0) {
v = g_refill_count_mid;
} else if (g_refill_count_global > 0) {
v = g_refill_count_global;
}
// Clamp to sane range (min: 8, max: 256)
if (v < 8) v = 8; // Minimum: avoid thrashing
if (v > 256) v = 256; // Maximum: avoid excessive TLS memory
s_refill_count[class_idx] = v;
cnt = v;
}
// Phase 2b: Clamp refill count to available capacity
if (cnt > available_capacity) {
cnt = available_capacity;
}
#if HAKMEM_DEBUG_COUNTERS
// Track refill calls (compile-time gated)
g_rf_total_calls[class_idx]++;
#endif
// Box Boundary: Delegate to Backend (Box 3: SuperSlab)
// This gives us ACE, Learning layer, L25 integration for free!
// Note: g_rf_hit_slab counter is incremented inside sll_refill_small_from_ss()
int refilled = sll_refill_small_from_ss(class_idx, cnt);
// Phase 2b: Track refill and adapt cache size
if (refilled > 0) {
track_refill_for_adaptation(class_idx);
}
// Box 5-NEW: Cascade refill SFC ← SLL (if SFC enabled)
// This happens AFTER SuperSlab → SLL refill, so SLL has blocks
static __thread int sfc_check_done_refill = 0;
static __thread int sfc_is_enabled_refill = 0;
if (__builtin_expect(!sfc_check_done_refill, 0)) {
sfc_is_enabled_refill = g_sfc_enabled;
sfc_check_done_refill = 1;
}
if (sfc_is_enabled_refill && refilled > 0) {
// Transfer half of refilled blocks to SFC (keep half in SLL for future)
int sfc_target = refilled / 2;
if (sfc_target > 0) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
front_gate_after_refill(class_idx, refilled);
#else
int transferred = sfc_refill_from_sll(class_idx, sfc_target);
(void)transferred; // Unused, but could track stats
#endif
}
}
#if !HAKMEM_BUILD_RELEASE
// Debug: Track profiling (release builds skip this overhead)
if (start) {
g_tiny_refill_cycles += (tiny_fast_rdtsc() - start);
g_tiny_refill_calls++;
}
#endif
return refilled;
}
// ========== Combined Fast Path (Alloc + Refill) ==========
// Complete fast path allocation (inline for zero-cost)
// Returns: pointer on success, NULL on failure (OOM or size too large)
//
// Flow:
// 1. TLS freelist pop (3-4 instructions) - Hit rate ~95%
// 2. Miss → Refill from backend (~5% cases)
// 3. Refill success → Retry pop
// 4. Refill failure → Slow path (OOM or new SuperSlab allocation)
//
// Example usage:
// void* ptr = tiny_alloc_fast(64);
// if (!ptr) {
// // OOM handling
// }
static inline void* tiny_alloc_fast(size_t size) {
// 1. Size → class index (inline, fast)
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0, 0)) {
return NULL; // Size > 1KB, not Tiny
}
ROUTE_BEGIN(class_idx);
// 2. Fast path: TLS freelist pop (3-4 instructions, 95% hit rate)
void* ptr;
#if HAKMEM_TINY_AGGRESSIVE_INLINE
// Task 2: Use inline macro (save 5-10 cycles, no function call)
TINY_ALLOC_FAST_POP_INLINE(class_idx, ptr);
#else
// Standard: Function call (preserves debugging visibility)
ptr = tiny_alloc_fast_pop(class_idx);
#endif
if (__builtin_expect(ptr != NULL, 1)) {
HAK_RET_ALLOC(class_idx, ptr);
}
// 3. Miss: Refill from backend (Box 3: SuperSlab)
int refilled = tiny_alloc_fast_refill(class_idx);
if (__builtin_expect(refilled > 0, 1)) {
// Refill success → retry pop
#if HAKMEM_TINY_AGGRESSIVE_INLINE
TINY_ALLOC_FAST_POP_INLINE(class_idx, ptr);
#else
ptr = tiny_alloc_fast_pop(class_idx);
#endif
if (ptr) {
HAK_RET_ALLOC(class_idx, ptr);
}
}
// 4. Refill failure or still empty → slow path (OOM or new SuperSlab)
// Box Boundary: Delegate to Slow Path (Box 3 backend)
ptr = hak_tiny_alloc_slow(size, class_idx);
if (ptr) {
HAK_RET_ALLOC(class_idx, ptr);
}
return ptr; // NULL if OOM
}
// ========== Push to TLS Freelist (for free path) ==========
// Push block to TLS freelist (used by free fast path)
// This is a "helper" for Box 6 (Free Fast Path)
//
// Invariant: ptr must belong to current thread (no ownership check here)
// Caller (Box 6) is responsible for ownership verification
static inline void tiny_alloc_fast_push(int class_idx, void* ptr) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
front_gate_push_tls(class_idx, ptr);
#else
// Box Boundary: Push to TLS freelist
*(void**)ptr = g_tls_sll_head[class_idx];
g_tls_sll_head[class_idx] = ptr;
g_tls_sll_count[class_idx]++;
#endif
}
// ========== Statistics & Diagnostics ==========
// Get TLS freelist stats (for debugging/profiling)
typedef struct {
int class_idx;
void* head;
uint32_t count;
} TinyAllocFastStats;
static inline TinyAllocFastStats tiny_alloc_fast_stats(int class_idx) {
TinyAllocFastStats stats = {
.class_idx = class_idx,
.head = g_tls_sll_head[class_idx],
.count = g_tls_sll_count[class_idx]
};
return stats;
}
// Reset TLS freelist (for testing/benchmarking)
// WARNING: This leaks memory! Only use in controlled test environments.
static inline void tiny_alloc_fast_reset(int class_idx) {
g_tls_sll_head[class_idx] = NULL;
g_tls_sll_count[class_idx] = 0;
}
// ========== Performance Notes ==========
//
// Expected metrics (based on System tcache & HAKX +171% results):
// - Fast path hit rate: 95%+ (workload dependent)
// - Fast path latency: 3-4 instructions (1-2 cycles on modern CPUs)
// - Miss penalty: ~20-50 instructions (refill from SuperSlab)
// - Throughput improvement: +10-25% vs current multi-layer design
//
// Key optimizations:
// 1. `__builtin_expect` for branch prediction (hot path first)
// 2. `static inline` for zero-cost abstraction
// 3. TLS variables (no atomic ops, no locks)
// 4. Minimal work in fast path (defer stats/accounting to backend)
//
// Comparison with current design:
// - Current: 20+ instructions (Magazine → SuperSlab → ACE → ...)
// - New: 3-4 instructions (TLS freelist pop only)
// - Reduction: -80% instructions in hot path
//
// Inspired by:
// - System tcache (glibc malloc) - 3-4 instruction fast path
// - HAKX Mid-Large (+171%) - "Simple Front + Smart Back"
// - Box Theory - Clear boundaries, minimal coupling