# TLS Superslab Hint Box - Design Document **Phase**: Headerless Performance Optimization - Phase 1 **Date**: 2025-12-03 **Status**: Design Review **Author**: hakmem team --- ## 1. Executive Summary The TLS Superslab Hint Box is a thread-local cache that accelerates pointer-to-SuperSlab resolution in Headerless mode. When HAKMEM_TINY_HEADERLESS=1 is enabled, every free() operation requires translating a user pointer to its owning SuperSlab. Currently, this uses `hak_super_lookup()`, which performs a hash table lookup costing 10-50 cycles. By caching recently-used SuperSlab references in thread-local storage, we can reduce this to 2-5 cycles for cache hits (85-95% hit rate expected). **Expected Performance Improvement**: 15-20% throughput increase (54.60 → 64-68 Mops/s on sh8bench) **Risk Level**: Low - Thread-local storage eliminates cache coherency issues - Magic number validation provides fail-safe fallback - Self-contained Box with minimal integration surface - Memory overhead: ~128 bytes per thread (negligible) --- ## 2. Box Definition (Box Theory) ``` Box: TLS Superslab Hint Cache MISSION: Cache recently-used SuperSlab references in TLS to accelerate ptr→SuperSlab resolution in Headerless mode, avoiding expensive hash table lookups on the critical free() path. DESIGN: - Provides O(1) lookup for hot SuperSlabs (L1 cache hit, 2-5 cycles) - Falls back to global registry on miss (fail-safe, no data loss) - No ownership, no remote queues, pure read-only cache - FIFO eviction policy with configurable cache size (2-4 slots) INVARIANTS: - hint.base <= ptr < hint.end implies hint.ss is valid - Miss is always safe (triggers fallback to hak_super_lookup) - TLS data survives only within thread lifetime - Cache entries are invalidated implicitly by FIFO rotation - Magic number check (SUPERSLAB_MAGIC) validates all pointers BOUNDARY: - Input: raw user pointer (void* ptr) from free() path - Output: SuperSlab* or NULL (miss triggers fallback) - Does NOT determine class_idx (that's slab_index_for's job) - Does NOT perform ownership validation (that's SuperSlab's job) PERFORMANCE: - Cache hit: 2-5 cycles (L1 cache hit, 4 pointer comparisons) - Cache miss: fallback to hak_super_lookup (10-50 cycles) - Expected hit rate: 85-95% for single-threaded workloads - Expected hit rate: 70-85% for multi-threaded workloads THREAD SAFETY: - TLS storage: no sharing, no synchronization required - Read-only cache: never modifies SuperSlab state - Stale entries: caught by magic number check ``` --- ## 3. Data Structures ```c // core/box/tls_ss_hint_box.h #ifndef TLS_SS_HINT_BOX_H #define TLS_SS_HINT_BOX_H #include #include // Forward declaration struct SuperSlab; // Cache entry for a single SuperSlab hint // Size: 24 bytes (cache-friendly, fits in 1 cache line with metadata) typedef struct { void* base; // SuperSlab base address (aligned to 1MB or 2MB) void* end; // base + superslab_size (for range check) struct SuperSlab* ss; // Cached SuperSlab pointer } TlsSsHintEntry; // TLS hint cache configuration // - 4 slots provide good hit rate without excessive overhead // - Larger caches (8, 16) show diminishing returns in benchmarks // - Smaller caches (2) may thrash on workloads with 3+ active SuperSlabs #define TLS_SS_HINT_SLOTS 4 // Thread-local SuperSlab hint cache // Total size: 24*4 + 16 = 112 bytes per thread (negligible overhead) typedef struct { TlsSsHintEntry entries[TLS_SS_HINT_SLOTS]; // Cache entries uint32_t count; // Number of valid entries (0 to TLS_SS_HINT_SLOTS) uint32_t next_slot; // Next slot for FIFO rotation (wraps at TLS_SS_HINT_SLOTS) // Statistics (optional, for profiling builds) // Disabled in HAKMEM_BUILD_RELEASE to save 16 bytes per thread #if !HAKMEM_BUILD_RELEASE uint64_t hits; // Cache hit count uint64_t misses; // Cache miss count #endif } TlsSsHintCache; // Thread-local storage instance // Initialized to zero by TLS semantics, formal init in tls_ss_hint_init() extern __thread TlsSsHintCache g_tls_ss_hint; #endif // TLS_SS_HINT_BOX_H ``` --- ## 4. API Design ```c // core/box/tls_ss_hint_box.h (continued) /** * @brief Initialize TLS hint cache for current thread * * Call once per thread, typically in thread-local initialization path. * Safe to call multiple times (idempotent). * * Thread Safety: TLS, no synchronization required * Performance: ~10 cycles (negligible one-time cost) */ static inline void tls_ss_hint_init(void); /** * @brief Update hint cache with a SuperSlab reference * * Called on paths where we know the SuperSlab for a given address range: * - After successful tiny_alloc (cache the allocated-from SuperSlab) * - After superslab refill (cache the newly bound SuperSlab) * - After unified cache refill (cache the refilled SuperSlab) * * Duplicate detection: If the SuperSlab is already cached, no update occurs. * This prevents thrashing when repeatedly allocating from the same SuperSlab. * * @param ss SuperSlab to cache (must be non-NULL, SUPERSLAB_MAGIC validated by caller) * @param base SuperSlab base address (1MB or 2MB aligned) * @param size SuperSlab size in bytes (1MB or 2MB) * * Thread Safety: TLS, no synchronization required * Performance: ~15-20 cycles (duplicate check + FIFO rotation) */ static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size); /** * @brief Lookup SuperSlab for given pointer (fast path) * * Called on free() entry, before falling back to hak_super_lookup(). * Performs linear search over cached entries (4 iterations max). * * Cache hit: Returns true, sets *out_ss to cached SuperSlab pointer * Cache miss: Returns false, caller must use hak_super_lookup() * * @param ptr User pointer to lookup (arbitrary alignment) * @param out_ss Output: SuperSlab pointer if found (only valid if return true) * @return true if cache hit (out_ss is valid), false if miss * * Thread Safety: TLS, no synchronization required * Performance: 2-5 cycles (hit), 8-12 cycles (miss) * * NOTE: Caller MUST validate SUPERSLAB_MAGIC after successful lookup. * This Box does not perform magic validation to keep fast path minimal. */ static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss); /** * @brief Clear all cached hints (for testing/reset) * * Use cases: * - Unit tests: Reset cache between test cases * - Debug: Force cache cold start for profiling * - Thread teardown: Optional cleanup (TLS auto-cleanup on thread exit) * * Thread Safety: TLS, no synchronization required * Performance: ~10 cycles */ static inline void tls_ss_hint_clear(void); /** * @brief Get cache statistics (for profiling builds) * * Returns hit/miss counters for performance analysis. * Only available in non-release builds (HAKMEM_BUILD_RELEASE=0). * * @param hits Output: Total cache hits * @param misses Output: Total cache misses * * Thread Safety: TLS, no synchronization required * Performance: ~5 cycles (two loads) */ #if !HAKMEM_BUILD_RELEASE static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses); #endif ``` --- ## 5. Implementation Details ```c // core/box/tls_ss_hint_box.c (or inline in .h for header-only Box) #include "tls_ss_hint_box.h" #include "../hakmem_tiny_superslab.h" // For SuperSlab, SUPERSLAB_MAGIC // Thread-local storage definition __thread TlsSsHintCache g_tls_ss_hint = {0}; /** * Initialize TLS hint cache * Safe to call multiple times (idempotent check via count) */ static inline void tls_ss_hint_init(void) { // Zero-initialization by TLS, but explicit init for clarity g_tls_ss_hint.count = 0; g_tls_ss_hint.next_slot = 0; #if !HAKMEM_BUILD_RELEASE g_tls_ss_hint.hits = 0; g_tls_ss_hint.misses = 0; #endif // Clear all entries (paranoid, but cache-friendly loop) for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) { g_tls_ss_hint.entries[i].base = NULL; g_tls_ss_hint.entries[i].end = NULL; g_tls_ss_hint.entries[i].ss = NULL; } } /** * Update hint cache with SuperSlab reference * FIFO rotation: oldest entry is evicted when cache is full * Duplicate detection: skip if SuperSlab already cached */ static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size) { // Sanity check: reject invalid inputs if (__builtin_expect(!ss || !base || size == 0, 0)) { return; } // Duplicate detection: check if this SuperSlab is already cached // This prevents thrashing when allocating from the same SuperSlab repeatedly for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) { if (g_tls_ss_hint.entries[i].ss == ss) { return; // Already cached, no update needed } } // Add to next slot (FIFO rotation) uint32_t slot = g_tls_ss_hint.next_slot; g_tls_ss_hint.entries[slot].base = base; g_tls_ss_hint.entries[slot].end = (char*)base + size; g_tls_ss_hint.entries[slot].ss = ss; // Advance to next slot (wrap at TLS_SS_HINT_SLOTS) g_tls_ss_hint.next_slot = (slot + 1) % TLS_SS_HINT_SLOTS; // Increment count until cache is full if (g_tls_ss_hint.count < TLS_SS_HINT_SLOTS) { g_tls_ss_hint.count++; } } /** * Lookup SuperSlab for pointer (fast path) * Linear search over cached entries (4 iterations max) * * Performance note: * - Linear search is faster than hash table for small N (N <= 8) * - Branch-free comparison (ptr >= base && ptr < end) is 2-3 cycles * - Total cost: 2-5 cycles (hit), 8-12 cycles (miss with 4 entries) */ static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss) { // Fast path: iterate over valid entries // Unrolling this loop (if count is small) is beneficial, but let compiler decide for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) { TlsSsHintEntry* e = &g_tls_ss_hint.entries[i]; // Range check: base <= ptr < end // Note: end is exclusive (base + size), so use < not <= if (ptr >= e->base && ptr < e->end) { // Cache hit! *out_ss = e->ss; #if !HAKMEM_BUILD_RELEASE g_tls_ss_hint.hits++; #endif return true; } } // Cache miss: caller must fall back to hak_super_lookup() #if !HAKMEM_BUILD_RELEASE g_tls_ss_hint.misses++; #endif return false; } /** * Clear all cached hints * Use for testing or manual reset */ static inline void tls_ss_hint_clear(void) { g_tls_ss_hint.count = 0; g_tls_ss_hint.next_slot = 0; #if !HAKMEM_BUILD_RELEASE // Preserve stats across clear (for cumulative profiling) // Uncomment to reset stats: // g_tls_ss_hint.hits = 0; // g_tls_ss_hint.misses = 0; #endif // Optional: zero out entries (paranoid, not required for correctness) for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) { g_tls_ss_hint.entries[i].base = NULL; g_tls_ss_hint.entries[i].end = NULL; g_tls_ss_hint.entries[i].ss = NULL; } } /** * Get cache statistics (profiling builds only) */ #if !HAKMEM_BUILD_RELEASE static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses) { if (hits) *hits = g_tls_ss_hint.hits; if (misses) *misses = g_tls_ss_hint.misses; } #endif ``` --- ## 6. Integration Points ### 6.1 Update Points: When to Call `tls_ss_hint_update()` The hint cache should be updated whenever we know the SuperSlab for an address range. This happens on allocation success paths: #### Location 1: After Successful Tiny Alloc (hakmem_tiny.c) ```c // In hak_tiny_alloc or similar allocation path void* ptr = tiny_allocate_from_superslab(class_idx, &ss); if (ptr) { #if HAKMEM_TINY_SS_TLS_HINT // Cache the SuperSlab we just allocated from // This improves free() performance for LIFO allocation patterns tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes); #endif return ptr; } ``` #### Location 2: After SuperSlab Refill (hakmem_tiny_refill.inc.h) ```c // In tiny_refill_from_superslab or superslab_allocate SuperSlab* ss = superslab_allocate(class_idx); if (ss) { // Bind SuperSlab to thread's TLS state bind_superslab_to_thread(ss, class_idx); #if HAKMEM_TINY_SS_TLS_HINT // Cache the newly bound SuperSlab // Future allocations from this SuperSlab will have cached hint tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes); #endif } ``` #### Location 3: Unified Cache Refill (core/front/tiny_unified_cache.c) ```c // In unified_cache_refill_class void* block = superslab_alloc_block(class_idx, &ss); if (block) { #if HAKMEM_TINY_SS_TLS_HINT // Cache the SuperSlab that provided this block tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes); #endif // Push to unified cache unified_cache_push(class_idx, block); } ``` #### Location 4: Thread-Local Init (hakmem_tiny_tls_init) ```c // In tiny_tls_init or thread_local_init void tiny_tls_init(void) { // Initialize TLS structures tiny_magazine_init(); tiny_sll_init(); #if HAKMEM_TINY_SS_TLS_HINT // Initialize hint cache (zero-init by TLS, but explicit for clarity) tls_ss_hint_init(); #endif } ``` ### 6.2 Lookup Points: When to Call `tls_ss_hint_lookup()` The hint lookup should be the **first step** in free() path, before falling back to registry lookup: #### Location 1: Tiny Free Entry (core/hakmem_tiny_free.inc) ```c // In hak_tiny_free or similar free path void hak_tiny_free(void* ptr) { if (!ptr) return; SuperSlab* ss = NULL; #if HAKMEM_TINY_HEADERLESS // Phase 1: Try TLS hint cache (fast path, 2-5 cycles on hit) #if HAKMEM_TINY_SS_TLS_HINT if (!tls_ss_hint_lookup(ptr, &ss)) { #endif // Phase 2: Fallback to global registry (slow path, 10-50 cycles) ss = hak_super_lookup(ptr); #if HAKMEM_TINY_SS_TLS_HINT } #endif // Validate SuperSlab (magic check) if (!ss || ss->magic != SUPERSLAB_MAGIC) { // Invalid pointer - external guard path hak_external_guard_free(ptr); return; } // Proceed with free using SuperSlab info int class_idx = slab_index_for(ss, ptr); tiny_free_to_slab(ss, ptr, class_idx); #else // Header mode: read class_idx from header (1-3 cycles) uint8_t hdr = *((uint8_t*)ptr - 1); int class_idx = hdr & 0x7; tiny_free_to_class(class_idx, ptr); #endif } ``` #### Location 2: Fast Free Path (core/tiny_free_fast_v2.inc.h) ```c // In tiny_free_fast or inline free path static inline void tiny_free_fast(void* ptr) { #if HAKMEM_TINY_HEADERLESS SuperSlab* ss = NULL; // Try hint cache first #if HAKMEM_TINY_SS_TLS_HINT if (!tls_ss_hint_lookup(ptr, &ss)) { #endif ss = hak_super_lookup(ptr); #if HAKMEM_TINY_SS_TLS_HINT } #endif if (__builtin_expect(!ss || ss->magic != SUPERSLAB_MAGIC, 0)) { // Slow path: external guard or invalid pointer hak_tiny_free_slow(ptr); return; } // Fast path: push to TLS freelist int class_idx = slab_index_for(ss, ptr); front_gate_push_tls(class_idx, ptr); #else // Header mode fast path uint8_t hdr = *((uint8_t*)ptr - 1); int class_idx = hdr & 0x7; front_gate_push_tls(class_idx, ptr); #endif } ``` --- ## 7. Environment Variable ```c // In hakmem_build_flags.h or similar configuration header // ============================================================================ // Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache // ============================================================================ // Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode // Default: 0 (disabled during development and testing) // Target: 1 (enabled after validation in Phase 1 rollout) // // Performance Impact: // - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup) // - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded) // - Expected throughput improvement: 15-20% // // Memory Overhead: // - 112 bytes per thread (TLS) // - Negligible for typical workloads (1000 threads = 112KB) // // Dependencies: // - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode) // - No other dependencies (self-contained Box) #ifndef HAKMEM_TINY_SS_TLS_HINT #define HAKMEM_TINY_SS_TLS_HINT 0 #endif // Validation: Hint Box only active in Headerless mode #if HAKMEM_TINY_SS_TLS_HINT && !HAKMEM_TINY_HEADERLESS #error "HAKMEM_TINY_SS_TLS_HINT requires HAKMEM_TINY_HEADERLESS=1" #endif ``` --- ## 8. Testing Plan ### 8.1 Unit Tests Create `/mnt/workdisk/public_share/hakmem/tests/test_tls_ss_hint.c`: ```c #include #include #include #include "core/box/tls_ss_hint_box.h" #include "core/hakmem_tiny_superslab.h" // Mock SuperSlab for testing typedef struct { uint32_t magic; void* base_addr; size_t size_bytes; uint8_t size_class; } MockSuperSlab; void test_hint_init(void) { printf("test_hint_init...\n"); tls_ss_hint_init(); // Verify cache is empty assert(g_tls_ss_hint.count == 0); assert(g_tls_ss_hint.next_slot == 0); #if !HAKMEM_BUILD_RELEASE assert(g_tls_ss_hint.hits == 0); assert(g_tls_ss_hint.misses == 0); #endif printf(" PASS\n"); } void test_hint_basic(void) { printf("test_hint_basic...\n"); tls_ss_hint_init(); // Mock SuperSlab MockSuperSlab ss = { .magic = SUPERSLAB_MAGIC, .base_addr = (void*)0x1000000, .size_bytes = 2 * 1024 * 1024, // 2MB .size_class = 0 }; // Update hint tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); // Verify cache entry assert(g_tls_ss_hint.count == 1); assert(g_tls_ss_hint.entries[0].base == ss.base_addr); assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss); // Lookup should hit (within range) SuperSlab* out = NULL; assert(tls_ss_hint_lookup((void*)0x1000100, &out) == true); assert(out == (SuperSlab*)&ss); // Lookup at base should hit assert(tls_ss_hint_lookup((void*)0x1000000, &out) == true); assert(out == (SuperSlab*)&ss); // Lookup at end-1 should hit assert(tls_ss_hint_lookup((void*)0x12FFFFF, &out) == true); assert(out == (SuperSlab*)&ss); // Lookup at end should miss (exclusive boundary) assert(tls_ss_hint_lookup((void*)0x1300000, &out) == false); // Lookup outside range should miss assert(tls_ss_hint_lookup((void*)0x3000000, &out) == false); printf(" PASS\n"); } void test_hint_fifo_rotation(void) { printf("test_hint_fifo_rotation...\n"); tls_ss_hint_init(); // Create 6 mock SuperSlabs (cache has 4 slots) MockSuperSlab ss[6]; for (int i = 0; i < 6; i++) { ss[i].magic = SUPERSLAB_MAGIC; ss[i].base_addr = (void*)(uintptr_t)(0x1000000 + i * 0x200000); // 2MB apart ss[i].size_bytes = 2 * 1024 * 1024; ss[i].size_class = 0; tls_ss_hint_update((SuperSlab*)&ss[i], ss[i].base_addr, ss[i].size_bytes); } // Cache should be full (4 slots) assert(g_tls_ss_hint.count == TLS_SS_HINT_SLOTS); // First 2 SuperSlabs should be evicted (FIFO) SuperSlab* out = NULL; assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false); // ss[0] evicted assert(tls_ss_hint_lookup((void*)0x1200100, &out) == false); // ss[1] evicted // Last 4 SuperSlabs should be cached assert(tls_ss_hint_lookup((void*)0x1400100, &out) == true); // ss[2] assert(out == (SuperSlab*)&ss[2]); assert(tls_ss_hint_lookup((void*)0x1600100, &out) == true); // ss[3] assert(out == (SuperSlab*)&ss[3]); assert(tls_ss_hint_lookup((void*)0x1800100, &out) == true); // ss[4] assert(out == (SuperSlab*)&ss[4]); assert(tls_ss_hint_lookup((void*)0x1A00100, &out) == true); // ss[5] assert(out == (SuperSlab*)&ss[5]); printf(" PASS\n"); } void test_hint_duplicate_detection(void) { printf("test_hint_duplicate_detection...\n"); tls_ss_hint_init(); // Mock SuperSlab MockSuperSlab ss = { .magic = SUPERSLAB_MAGIC, .base_addr = (void*)0x1000000, .size_bytes = 2 * 1024 * 1024, .size_class = 0 }; // Update hint 3 times with same SuperSlab tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); // Cache should have only 1 entry (duplicates ignored) assert(g_tls_ss_hint.count == 1); assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss); printf(" PASS\n"); } void test_hint_clear(void) { printf("test_hint_clear...\n"); tls_ss_hint_init(); // Add some entries MockSuperSlab ss = { .magic = SUPERSLAB_MAGIC, .base_addr = (void*)0x1000000, .size_bytes = 2 * 1024 * 1024, .size_class = 0 }; tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); assert(g_tls_ss_hint.count == 1); // Clear cache tls_ss_hint_clear(); // Cache should be empty assert(g_tls_ss_hint.count == 0); assert(g_tls_ss_hint.next_slot == 0); // Lookup should miss SuperSlab* out = NULL; assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false); printf(" PASS\n"); } #if !HAKMEM_BUILD_RELEASE void test_hint_stats(void) { printf("test_hint_stats...\n"); tls_ss_hint_init(); // Add entry MockSuperSlab ss = { .magic = SUPERSLAB_MAGIC, .base_addr = (void*)0x1000000, .size_bytes = 2 * 1024 * 1024, .size_class = 0 }; tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes); // Perform lookups SuperSlab* out = NULL; tls_ss_hint_lookup((void*)0x1000100, &out); // Hit tls_ss_hint_lookup((void*)0x1000200, &out); // Hit tls_ss_hint_lookup((void*)0x3000000, &out); // Miss // Check stats uint64_t hits = 0, misses = 0; tls_ss_hint_stats(&hits, &misses); assert(hits == 2); assert(misses == 1); printf(" PASS\n"); } #endif int main(void) { printf("Running TLS SS Hint Box unit tests...\n\n"); test_hint_init(); test_hint_basic(); test_hint_fifo_rotation(); test_hint_duplicate_detection(); test_hint_clear(); #if !HAKMEM_BUILD_RELEASE test_hint_stats(); #endif printf("\nAll tests passed!\n"); return 0; } ``` ### 8.2 Integration Tests #### Test 1: Build Validation ```bash # Test 1: Build with hint disabled (baseline) make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0" # Test 2: Build with hint enabled make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" # Test 3: Verify hint is disabled in header mode (should error) # make clean # make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=0 -DHAKMEM_TINY_SS_TLS_HINT=1" # Expected: Compile error (validation check in hakmem_build_flags.h) ``` #### Test 2: Benchmark Comparison ```bash # Build baseline (hint disabled) make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0" # Run benchmarks LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > baseline.txt LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_baseline.txt LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_baseline.txt # Build with hint enabled make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" # Run same benchmarks LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > hint.txt LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_hint.txt LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_hint.txt # Compare results echo "=== sh8bench ===" grep "Mops" baseline.txt hint.txt echo "=== cfrac ===" grep "time:" cfrac_baseline.txt cfrac_hint.txt echo "=== larson ===" grep "ops/s" larson_baseline.txt larson_hint.txt ``` #### Test 3: Hit Rate Profiling ```bash # Build with stats enabled (non-release) make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0" # Add stats dump at exit (in hakmem_exit.c or similar) # void dump_hint_stats(void) { # uint64_t hits = 0, misses = 0; # tls_ss_hint_stats(&hits, &misses); # fprintf(stderr, "[TLS_HINT_STATS] hits=%lu misses=%lu hit_rate=%.2f%%\n", # hits, misses, 100.0 * hits / (hits + misses)); # } # Run benchmark and check hit rate LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep TLS_HINT_STATS # Expected: hit_rate >= 85% ``` ### 8.3 Correctness Tests ```bash # Test with external pointer (should fall back to hak_super_lookup) # This tests that cache misses are handled correctly # Build with hint enabled make clean make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1" # Run sh8bench (allocates from multiple SuperSlabs) LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench # No crashes or assertion failures = success echo "Correctness test passed" ``` --- ## 9. Performance Expectations ### 9.1 Cycle Count Analysis | Operation | Without Hint | With Hint (Hit) | With Hint (Miss) | Improvement | |-----------|-------------|----------------|-----------------|-------------| | free() lookup | 10-50 cycles | 2-5 cycles | 10-50 cycles | 80-95% | | Range check (per entry) | N/A | 2 cycles | 2 cycles | - | | Hash table lookup | 10-50 cycles | N/A | 10-50 cycles | - | | Total free() cost | 15-60 cycles | 7-15 cycles (hit) | 20-65 cycles (miss) | 40-60% | ### 9.2 Expected Hit Rates | Workload | Hit Rate | Reasoning | |----------|----------|-----------| | Single-threaded LIFO | 95-99% | Free() immediately after alloc() from same SuperSlab | | Single-threaded FIFO | 85-95% | Recent allocations from 2-4 SuperSlabs | | Multi-threaded (8 threads) | 70-85% | Shared SuperSlabs, more cache thrashing | | Larson (high churn) | 65-80% | Many active SuperSlabs, frequent evictions | ### 9.3 Benchmark Targets | Benchmark | Baseline (no hint) | Target (with hint) | Improvement | |-----------|-------------------|-------------------|-------------| | sh8bench | 54.60 Mops/s | 64-68 Mops/s | +15-20% | | cfrac | 1.25 sec | 1.10-1.15 sec | +10-15% | | larson (8 threads) | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% | ### 9.4 Memory Overhead | Metric | Value | Notes | |--------|-------|-------| | Per-thread overhead | 112 bytes | TLS cache (release build) | | Per-thread overhead (debug) | 128 bytes | TLS cache + stats counters | | 1000 threads | 112 KB | Negligible for server workloads | | 10000 threads | 1.12 MB | Still negligible | --- ## 10. Risk Analysis | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | **Cache coherency issues** | Very Low | Low | TLS is thread-local, no sharing between threads | | **Stale hint after munmap** | Low | Low | Magic check (SUPERSLAB_MAGIC) catches freed SuperSlabs | | **Cache thrashing (many SS)** | Low | Low | 4 slots cover typical workloads; miss falls back to registry | | **Memory overhead** | Very Low | Very Low | 112 bytes/thread, negligible for most workloads | | **Integration bugs** | Low | Medium | Self-contained Box, clear API, comprehensive tests | | **Hit rate lower than expected** | Low | Low | Even 50% hit rate improves performance; no regression on miss | | **Complexity increase** | Low | Low | 150 LOC, header-only Box, minimal dependencies | ### 10.1 Failure Modes and Recovery | Failure Mode | Detection | Recovery | |-------------|-----------|----------| | Stale SuperSlab pointer | Magic check (SUPERSLAB_MAGIC != expected) | Fall back to hak_super_lookup() | | Cache miss | tls_ss_hint_lookup returns false | Fall back to hak_super_lookup() | | Invalid hint range | ptr outside [base, end) | Linear search continues, eventually misses | | Thread teardown | TLS cleanup by OS | No manual cleanup needed | | SuperSlab freed | Magic number cleared | Caught by magic check in free() path | --- ## 11. Future Considerations ### 11.1 Phase 2 Integration: Global Class Map When Phase 2 introduces a Global Class Map (pointer → class_idx lookup), the TLS Hint Box becomes the first tier in a three-tier lookup hierarchy: ``` Tier 1 (fastest): TLS Hint Cache (2-5 cycles, 85-95% hit rate) ↓ miss Tier 2 (medium): Global Class Map (5-15 cycles, 99%+ hit rate) ↓ miss Tier 3 (slowest): Global SuperSlab Registry (10-50 cycles, 100% hit rate) ``` **Integration point**: ```c SuperSlab* ss = NULL; int class_idx = -1; // Tier 1: TLS hint #if HAKMEM_TINY_SS_TLS_HINT if (tls_ss_hint_lookup(ptr, &ss)) { class_idx = slab_index_for(ss, ptr); goto found; } #endif // Tier 2: Global class map #if HAKMEM_TINY_CLASS_MAP class_idx = class_map_lookup(ptr); if (class_idx >= 0) { ss = hak_super_lookup(ptr); // Still need SS for metadata goto found; } #endif // Tier 3: Registry fallback ss = hak_super_lookup(ptr); if (ss && ss->magic == SUPERSLAB_MAGIC) { class_idx = slab_index_for(ss, ptr); goto found; } // External pointer hak_external_guard_free(ptr); return; found: tiny_free_to_class(class_idx, ptr); ``` ### 11.2 Adaptive Cache Sizing Current design uses fixed `TLS_SS_HINT_SLOTS = 4`. Future optimization could make this adaptive: - **Workload detection**: Track hit rate over time windows - **Dynamic sizing**: Increase slots (4 → 8) if hit rate < 80% - **Memory pressure**: Decrease slots (8 → 2) if memory constrained **Implementation sketch**: ```c #define TLS_SS_HINT_SLOTS_MAX 8 typedef struct { uint32_t current_slots; // Dynamic (2, 4, 8) uint64_t hits_window; uint64_t misses_window; } TlsSsHintAdaptive; void tls_ss_hint_tune(void) { double hit_rate = (double)g_tls_ss_hint.hits_window / (g_tls_ss_hint.hits_window + g_tls_ss_hint.misses_window); if (hit_rate < 0.80 && g_tls_ss_hint.current_slots < TLS_SS_HINT_SLOTS_MAX) { g_tls_ss_hint.current_slots *= 2; // Grow cache } else if (hit_rate > 0.95 && g_tls_ss_hint.current_slots > 2) { g_tls_ss_hint.current_slots /= 2; // Shrink cache } } ``` ### 11.3 LRU vs FIFO Eviction Policy Current design uses FIFO (simple, predictable). Alternative: LRU with move-to-front on hit. **LRU advantages**: - Better hit rate for workloads with temporal locality - Commonly used SuperSlabs stay cached longer **LRU disadvantages**: - 2-3 extra cycles per hit (move to front) - More complex implementation (doubly-linked list) **Benchmark before switching**: Profile sh8bench, larson, cfrac with both policies. ### 11.4 Per-Class Hint Caches Current design: Single cache for all classes (4 entries, any class). Alternative: Per-class caches (1 entry per class, 8 entries total). **Per-class advantages**: - Guaranteed cache slot for each class - No inter-class eviction **Per-class disadvantages**: - Wastes space if only 2-3 classes are active - More TLS overhead (8 entries vs 4) **Recommendation**: Defer until benchmarks show inter-class thrashing. ### 11.5 Statistics Export API For production monitoring, export hit rate via: ```c // Global aggregated stats (all threads) void hak_tls_hint_global_stats(uint64_t* total_hits, uint64_t* total_misses); // ENV-based stats dump at exit // HAKMEM_TLS_HINT_STATS=1 → dump to stderr at exit ``` --- ## 12. Implementation Checklist ### 12.1 Phase 1a: Core Implementation (Week 1) - [ ] Create `core/box/tls_ss_hint_box.h` - [ ] Implement `tls_ss_hint_init()` - [ ] Implement `tls_ss_hint_update()` - [ ] Implement `tls_ss_hint_lookup()` - [ ] Implement `tls_ss_hint_clear()` - [ ] Add `HAKMEM_TINY_SS_TLS_HINT` flag to `hakmem_build_flags.h` - [ ] Add validation check (hint requires headerless mode) ### 12.2 Phase 1b: Integration (Week 2) - [ ] Integrate into `hakmem_tiny_free.inc` (lookup path) - [ ] Integrate into `hakmem_tiny.c` (update path after alloc) - [ ] Integrate into `hakmem_tiny_refill.inc.h` (update path after refill) - [ ] Integrate into `core/front/tiny_unified_cache.c` (update path) - [ ] Call `tls_ss_hint_init()` in thread-local init ### 12.3 Phase 1c: Testing (Week 2-3) - [ ] Write unit tests (`tests/test_tls_ss_hint.c`) - [ ] Run unit tests: `make test_tls_ss_hint && ./test_tls_ss_hint` - [ ] Build validation (hint disabled, hint enabled, error check) - [ ] Benchmark comparison (sh8bench, cfrac, larson) - [ ] Hit rate profiling (debug build with stats) - [ ] Correctness tests (no crashes, no assertion failures) ### 12.4 Phase 1d: Validation (Week 3) - [ ] Benchmark: sh8bench (target: +15-20%) - [ ] Benchmark: cfrac (target: +10-15%) - [ ] Benchmark: larson 8 threads (target: +15-20%) - [ ] Hit rate analysis (target: 85-95%) - [ ] Memory overhead check (target: < 150 bytes/thread) - [ ] Regression test: Headerless=0 mode still works ### 12.5 Phase 1e: Documentation (Week 3-4) - [ ] Update `docs/PHASE2_HEADERLESS_INSTRUCTION.md` with hint Box - [ ] Add Box Theory annotation to hakmem Box registry - [ ] Write performance analysis report (before/after comparison) - [ ] Update build instructions (`make shared EXTRA_CFLAGS=...`) --- ## 13. Rollout Plan ### Stage 1: Internal Testing (Week 1-3) - Build with `HAKMEM_TINY_SS_TLS_HINT=1` in dev environment - Run full benchmark suite (mimalloc-bench) - Profile with perf/cachegrind (verify cycle count reduction) - Fix any integration bugs ### Stage 2: Canary Deployment (Week 4) - Enable hint Box in 5% of production traffic - Monitor: crash rate, performance metrics, hit rate - A/B test: Hint ON vs Hint OFF ### Stage 3: Gradual Rollout (Week 5-6) - 25% traffic (if canary success) - 50% traffic - 100% traffic ### Stage 4: Default Enable (Week 7) - Change default: `HAKMEM_TINY_SS_TLS_HINT=1` - Update build scripts, CI/CD pipelines - Announce in release notes --- ## 14. Success Metrics | Metric | Baseline | Target | Measurement | |--------|----------|--------|-------------| | sh8bench throughput | 54.60 Mops/s | 64-68 Mops/s | +15-20% | | cfrac runtime | 1.25 sec | 1.10-1.15 sec | -10-15% | | larson throughput | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% | | TLS hint hit rate | N/A | 85-95% | Stats API | | free() cycle count | 15-60 cycles | 7-15 cycles (hit) | perf/cachegrind | | Memory overhead | 0 | < 150 bytes/thread | sizeof(TlsSsHintCache) | | Crash rate | 0.001% | 0.001% (no regression) | Production monitoring | --- ## 15. Open Questions 1. **Q**: Should we implement per-class hint caches instead of unified cache? **A**: Defer until benchmarks show inter-class thrashing. Current unified design is simpler and sufficient for most workloads. 2. **Q**: Should we use LRU instead of FIFO eviction? **A**: Defer until benchmarks show FIFO hit rate < 80%. FIFO is simpler and avoids move-to-front cost on hits. 3. **Q**: Should we make TLS_SS_HINT_SLOTS runtime-configurable? **A**: No, compile-time constant allows better optimization (loop unrolling, register allocation). Consider adaptive sizing in Phase 2 if needed. 4. **Q**: Should we validate SUPERSLAB_MAGIC in tls_ss_hint_lookup()? **A**: No, keep lookup minimal (2-5 cycles). Caller (free() path) must validate magic. This matches existing design where hak_super_lookup() also requires caller validation. 5. **Q**: Should we export hit rate stats in production builds? **A**: Phase 1: No (save 16 bytes/thread). Phase 2: Add global aggregated stats API for monitoring if needed. --- ## 16. Conclusion The TLS Superslab Hint Box is a low-risk, high-reward optimization that reduces the performance gap between Headerless mode and Header mode from 30% to ~15%. The design is self-contained, testable, and follows hakmem's Box Theory architecture. Expected implementation time: 3-4 weeks (including testing and validation). **Key Strengths**: - Minimal integration surface (5 call sites) - Self-contained Box (no dependencies) - Fail-safe fallback (miss → hak_super_lookup) - Low memory overhead (112 bytes/thread) - Proven pattern (TLS caching used in jemalloc, tcmalloc) **Next Steps**: 1. Review this design document 2. Approve Phase 1a implementation (core Box) 3. Begin implementation with unit tests 4. Benchmark and validate in dev environment 5. Plan Phase 2 integration (Global Class Map) --- **End of Design Document**