# TLS Superslab Hint Box - Design Document

**Phase**: Headerless Performance Optimization - Phase 1
**Date**: 2025-12-03
**Status**: Design Review
**Author**: hakmem team

---

## 1. Executive Summary

The TLS Superslab Hint Box is a thread-local cache that accelerates pointer-to-SuperSlab resolution in Headerless mode. When HAKMEM_TINY_HEADERLESS=1 is enabled, every free() operation requires translating a user pointer to its owning SuperSlab. Currently, this uses `hak_super_lookup()`, which performs a hash table lookup costing 10-50 cycles. By caching recently-used SuperSlab references in thread-local storage, we can reduce this to 2-5 cycles for cache hits (85-95% hit rate expected).

**Expected Performance Improvement**: 15-20% throughput increase (54.60 → 64-68 Mops/s on sh8bench)

**Risk Level**: Low
- Thread-local storage eliminates cache coherency issues
- Magic number validation provides fail-safe fallback
- Self-contained Box with minimal integration surface
- Memory overhead: ~128 bytes per thread (negligible)

---

## 2. Box Definition (Box Theory)

```
Box: TLS Superslab Hint Cache

MISSION:
  Cache recently-used SuperSlab references in TLS to accelerate
  ptr→SuperSlab resolution in Headerless mode, avoiding expensive
  hash table lookups on the critical free() path.

DESIGN:
  - Provides O(1) lookup for hot SuperSlabs (L1 cache hit, 2-5 cycles)
  - Falls back to global registry on miss (fail-safe, no data loss)
  - No ownership, no remote queues, pure read-only cache
  - FIFO eviction policy with configurable cache size (2-4 slots)

INVARIANTS:
  - hint.base <= ptr < hint.end implies hint.ss is valid
  - Miss is always safe (triggers fallback to hak_super_lookup)
  - TLS data survives only within thread lifetime
  - Cache entries are invalidated implicitly by FIFO rotation
  - Magic number check (SUPERSLAB_MAGIC) validates all pointers

BOUNDARY:
  - Input: raw user pointer (void* ptr) from free() path
  - Output: SuperSlab* or NULL (miss triggers fallback)
  - Does NOT determine class_idx (that's slab_index_for's job)
  - Does NOT perform ownership validation (that's SuperSlab's job)

PERFORMANCE:
  - Cache hit: 2-5 cycles (L1 cache hit, 4 pointer comparisons)
  - Cache miss: fallback to hak_super_lookup (10-50 cycles)
  - Expected hit rate: 85-95% for single-threaded workloads
  - Expected hit rate: 70-85% for multi-threaded workloads

THREAD SAFETY:
  - TLS storage: no sharing, no synchronization required
  - Read-only cache: never modifies SuperSlab state
  - Stale entries: caught by magic number check
```

---

## 3. Data Structures

```c
// core/box/tls_ss_hint_box.h

#ifndef TLS_SS_HINT_BOX_H
#define TLS_SS_HINT_BOX_H

#include <stdint.h>
#include <stdbool.h>

// Forward declaration
struct SuperSlab;

// Cache entry for a single SuperSlab hint
// Size: 24 bytes (cache-friendly, fits in 1 cache line with metadata)
typedef struct {
    void* base;              // SuperSlab base address (aligned to 1MB or 2MB)
    void* end;               // base + superslab_size (for range check)
    struct SuperSlab* ss;    // Cached SuperSlab pointer
} TlsSsHintEntry;

// TLS hint cache configuration
// - 4 slots provide good hit rate without excessive overhead
// - Larger caches (8, 16) show diminishing returns in benchmarks
// - Smaller caches (2) may thrash on workloads with 3+ active SuperSlabs
#define TLS_SS_HINT_SLOTS 4

// Thread-local SuperSlab hint cache
// Total size: 24*4 + 16 = 112 bytes per thread (negligible overhead)
typedef struct {
    TlsSsHintEntry entries[TLS_SS_HINT_SLOTS];  // Cache entries
    uint32_t count;          // Number of valid entries (0 to TLS_SS_HINT_SLOTS)
    uint32_t next_slot;      // Next slot for FIFO rotation (wraps at TLS_SS_HINT_SLOTS)

    // Statistics (optional, for profiling builds)
    // Disabled in HAKMEM_BUILD_RELEASE to save 16 bytes per thread
    #if !HAKMEM_BUILD_RELEASE
    uint64_t hits;           // Cache hit count
    uint64_t misses;         // Cache miss count
    #endif
} TlsSsHintCache;

// Thread-local storage instance
// Initialized to zero by TLS semantics, formal init in tls_ss_hint_init()
extern __thread TlsSsHintCache g_tls_ss_hint;

#endif // TLS_SS_HINT_BOX_H
```

---

## 4. API Design

```c
// core/box/tls_ss_hint_box.h (continued)

/**
 * @brief Initialize TLS hint cache for current thread
 *
 * Call once per thread, typically in thread-local initialization path.
 * Safe to call multiple times (idempotent).
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~10 cycles (negligible one-time cost)
 */
static inline void tls_ss_hint_init(void);

/**
 * @brief Update hint cache with a SuperSlab reference
 *
 * Called on paths where we know the SuperSlab for a given address range:
 * - After successful tiny_alloc (cache the allocated-from SuperSlab)
 * - After superslab refill (cache the newly bound SuperSlab)
 * - After unified cache refill (cache the refilled SuperSlab)
 *
 * Duplicate detection: If the SuperSlab is already cached, no update occurs.
 * This prevents thrashing when repeatedly allocating from the same SuperSlab.
 *
 * @param ss    SuperSlab to cache (must be non-NULL, SUPERSLAB_MAGIC validated by caller)
 * @param base  SuperSlab base address (1MB or 2MB aligned)
 * @param size  SuperSlab size in bytes (1MB or 2MB)
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~15-20 cycles (duplicate check + FIFO rotation)
 */
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size);

/**
 * @brief Lookup SuperSlab for given pointer (fast path)
 *
 * Called on free() entry, before falling back to hak_super_lookup().
 * Performs linear search over cached entries (4 iterations max).
 *
 * Cache hit: Returns true, sets *out_ss to cached SuperSlab pointer
 * Cache miss: Returns false, caller must use hak_super_lookup()
 *
 * @param ptr     User pointer to lookup (arbitrary alignment)
 * @param out_ss  Output: SuperSlab pointer if found (only valid if return true)
 * @return true if cache hit (out_ss is valid), false if miss
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: 2-5 cycles (hit), 8-12 cycles (miss)
 *
 * NOTE: Caller MUST validate SUPERSLAB_MAGIC after successful lookup.
 *       This Box does not perform magic validation to keep fast path minimal.
 */
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss);

/**
 * @brief Clear all cached hints (for testing/reset)
 *
 * Use cases:
 * - Unit tests: Reset cache between test cases
 * - Debug: Force cache cold start for profiling
 * - Thread teardown: Optional cleanup (TLS auto-cleanup on thread exit)
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~10 cycles
 */
static inline void tls_ss_hint_clear(void);

/**
 * @brief Get cache statistics (for profiling builds)
 *
 * Returns hit/miss counters for performance analysis.
 * Only available in non-release builds (HAKMEM_BUILD_RELEASE=0).
 *
 * @param hits    Output: Total cache hits
 * @param misses  Output: Total cache misses
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~5 cycles (two loads)
 */
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses);
#endif
```

---

## 5. Implementation Details

```c
// core/box/tls_ss_hint_box.c (or inline in .h for header-only Box)

#include "tls_ss_hint_box.h"
#include "../hakmem_tiny_superslab.h"  // For SuperSlab, SUPERSLAB_MAGIC

// Thread-local storage definition
__thread TlsSsHintCache g_tls_ss_hint = {0};

/**
 * Initialize TLS hint cache
 * Safe to call multiple times (idempotent check via count)
 */
static inline void tls_ss_hint_init(void) {
    // Zero-initialization by TLS, but explicit init for clarity
    g_tls_ss_hint.count = 0;
    g_tls_ss_hint.next_slot = 0;

    #if !HAKMEM_BUILD_RELEASE
    g_tls_ss_hint.hits = 0;
    g_tls_ss_hint.misses = 0;
    #endif

    // Clear all entries (paranoid, but cache-friendly loop)
    for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
        g_tls_ss_hint.entries[i].base = NULL;
        g_tls_ss_hint.entries[i].end = NULL;
        g_tls_ss_hint.entries[i].ss = NULL;
    }
}

/**
 * Update hint cache with SuperSlab reference
 * FIFO rotation: oldest entry is evicted when cache is full
 * Duplicate detection: skip if SuperSlab already cached
 */
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size) {
    // Sanity check: reject invalid inputs
    if (__builtin_expect(!ss || !base || size == 0, 0)) {
        return;
    }

    // Duplicate detection: check if this SuperSlab is already cached
    // This prevents thrashing when allocating from the same SuperSlab repeatedly
    for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
        if (g_tls_ss_hint.entries[i].ss == ss) {
            return;  // Already cached, no update needed
        }
    }

    // Add to next slot (FIFO rotation)
    uint32_t slot = g_tls_ss_hint.next_slot;
    g_tls_ss_hint.entries[slot].base = base;
    g_tls_ss_hint.entries[slot].end = (char*)base + size;
    g_tls_ss_hint.entries[slot].ss = ss;

    // Advance to next slot (wrap at TLS_SS_HINT_SLOTS)
    g_tls_ss_hint.next_slot = (slot + 1) % TLS_SS_HINT_SLOTS;

    // Increment count until cache is full
    if (g_tls_ss_hint.count < TLS_SS_HINT_SLOTS) {
        g_tls_ss_hint.count++;
    }
}

/**
 * Lookup SuperSlab for pointer (fast path)
 * Linear search over cached entries (4 iterations max)
 *
 * Performance note:
 * - Linear search is faster than hash table for small N (N <= 8)
 * - Branch-free comparison (ptr >= base && ptr < end) is 2-3 cycles
 * - Total cost: 2-5 cycles (hit), 8-12 cycles (miss with 4 entries)
 */
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss) {
    // Fast path: iterate over valid entries
    // Unrolling this loop (if count is small) is beneficial, but let compiler decide
    for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
        TlsSsHintEntry* e = &g_tls_ss_hint.entries[i];

        // Range check: base <= ptr < end
        // Note: end is exclusive (base + size), so use < not <=
        if (ptr >= e->base && ptr < e->end) {
            // Cache hit!
            *out_ss = e->ss;

            #if !HAKMEM_BUILD_RELEASE
            g_tls_ss_hint.hits++;
            #endif

            return true;
        }
    }

    // Cache miss: caller must fall back to hak_super_lookup()
    #if !HAKMEM_BUILD_RELEASE
    g_tls_ss_hint.misses++;
    #endif

    return false;
}

/**
 * Clear all cached hints
 * Use for testing or manual reset
 */
static inline void tls_ss_hint_clear(void) {
    g_tls_ss_hint.count = 0;
    g_tls_ss_hint.next_slot = 0;

    #if !HAKMEM_BUILD_RELEASE
    // Preserve stats across clear (for cumulative profiling)
    // Uncomment to reset stats:
    // g_tls_ss_hint.hits = 0;
    // g_tls_ss_hint.misses = 0;
    #endif

    // Optional: zero out entries (paranoid, not required for correctness)
    for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
        g_tls_ss_hint.entries[i].base = NULL;
        g_tls_ss_hint.entries[i].end = NULL;
        g_tls_ss_hint.entries[i].ss = NULL;
    }
}

/**
 * Get cache statistics (profiling builds only)
 */
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses) {
    if (hits) *hits = g_tls_ss_hint.hits;
    if (misses) *misses = g_tls_ss_hint.misses;
}
#endif
```

---

## 6. Integration Points

### 6.1 Update Points: When to Call `tls_ss_hint_update()`

The hint cache should be updated whenever we know the SuperSlab for an address range. This happens on allocation success paths:

#### Location 1: After Successful Tiny Alloc (hakmem_tiny.c)
```c
// In hak_tiny_alloc or similar allocation path
void* ptr = tiny_allocate_from_superslab(class_idx, &ss);
if (ptr) {
    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the SuperSlab we just allocated from
    // This improves free() performance for LIFO allocation patterns
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif
    return ptr;
}
```

#### Location 2: After SuperSlab Refill (hakmem_tiny_refill.inc.h)
```c
// In tiny_refill_from_superslab or superslab_allocate
SuperSlab* ss = superslab_allocate(class_idx);
if (ss) {
    // Bind SuperSlab to thread's TLS state
    bind_superslab_to_thread(ss, class_idx);

    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the newly bound SuperSlab
    // Future allocations from this SuperSlab will have cached hint
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif
}
```

#### Location 3: Unified Cache Refill (core/front/tiny_unified_cache.c)
```c
// In unified_cache_refill_class
void* block = superslab_alloc_block(class_idx, &ss);
if (block) {
    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the SuperSlab that provided this block
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif

    // Push to unified cache
    unified_cache_push(class_idx, block);
}
```

#### Location 4: Thread-Local Init (hakmem_tiny_tls_init)
```c
// In tiny_tls_init or thread_local_init
void tiny_tls_init(void) {
    // Initialize TLS structures
    tiny_magazine_init();
    tiny_sll_init();

    #if HAKMEM_TINY_SS_TLS_HINT
    // Initialize hint cache (zero-init by TLS, but explicit for clarity)
    tls_ss_hint_init();
    #endif
}
```

### 6.2 Lookup Points: When to Call `tls_ss_hint_lookup()`

The hint lookup should be the **first step** in free() path, before falling back to registry lookup:

#### Location 1: Tiny Free Entry (core/hakmem_tiny_free.inc)
```c
// In hak_tiny_free or similar free path
void hak_tiny_free(void* ptr) {
    if (!ptr) return;

    SuperSlab* ss = NULL;

    #if HAKMEM_TINY_HEADERLESS
        // Phase 1: Try TLS hint cache (fast path, 2-5 cycles on hit)
        #if HAKMEM_TINY_SS_TLS_HINT
        if (!tls_ss_hint_lookup(ptr, &ss)) {
        #endif
            // Phase 2: Fallback to global registry (slow path, 10-50 cycles)
            ss = hak_super_lookup(ptr);
        #if HAKMEM_TINY_SS_TLS_HINT
        }
        #endif

        // Validate SuperSlab (magic check)
        if (!ss || ss->magic != SUPERSLAB_MAGIC) {
            // Invalid pointer - external guard path
            hak_external_guard_free(ptr);
            return;
        }

        // Proceed with free using SuperSlab info
        int class_idx = slab_index_for(ss, ptr);
        tiny_free_to_slab(ss, ptr, class_idx);

    #else
        // Header mode: read class_idx from header (1-3 cycles)
        uint8_t hdr = *((uint8_t*)ptr - 1);
        int class_idx = hdr & 0x7;
        tiny_free_to_class(class_idx, ptr);
    #endif
}
```

#### Location 2: Fast Free Path (core/tiny_free_fast_v2.inc.h)
```c
// In tiny_free_fast or inline free path
static inline void tiny_free_fast(void* ptr) {
    #if HAKMEM_TINY_HEADERLESS
        SuperSlab* ss = NULL;

        // Try hint cache first
        #if HAKMEM_TINY_SS_TLS_HINT
        if (!tls_ss_hint_lookup(ptr, &ss)) {
        #endif
            ss = hak_super_lookup(ptr);
        #if HAKMEM_TINY_SS_TLS_HINT
        }
        #endif

        if (__builtin_expect(!ss || ss->magic != SUPERSLAB_MAGIC, 0)) {
            // Slow path: external guard or invalid pointer
            hak_tiny_free_slow(ptr);
            return;
        }

        // Fast path: push to TLS freelist
        int class_idx = slab_index_for(ss, ptr);
        front_gate_push_tls(class_idx, ptr);

    #else
        // Header mode fast path
        uint8_t hdr = *((uint8_t*)ptr - 1);
        int class_idx = hdr & 0x7;
        front_gate_push_tls(class_idx, ptr);
    #endif
}
```

---

## 7. Environment Variable

```c
// In hakmem_build_flags.h or similar configuration header

// ============================================================================
// Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache
// ============================================================================
// Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode
// Default: 0 (disabled during development and testing)
// Target: 1 (enabled after validation in Phase 1 rollout)
//
// Performance Impact:
// - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup)
// - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded)
// - Expected throughput improvement: 15-20%
//
// Memory Overhead:
// - 112 bytes per thread (TLS)
// - Negligible for typical workloads (1000 threads = 112KB)
//
// Dependencies:
// - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode)
// - No other dependencies (self-contained Box)

#ifndef HAKMEM_TINY_SS_TLS_HINT
  #define HAKMEM_TINY_SS_TLS_HINT 0
#endif

// Validation: Hint Box only active in Headerless mode
#if HAKMEM_TINY_SS_TLS_HINT && !HAKMEM_TINY_HEADERLESS
  #error "HAKMEM_TINY_SS_TLS_HINT requires HAKMEM_TINY_HEADERLESS=1"
#endif
```

---

## 8. Testing Plan

### 8.1 Unit Tests

Create `/mnt/workdisk/public_share/hakmem/tests/test_tls_ss_hint.c`:

```c
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include "core/box/tls_ss_hint_box.h"
#include "core/hakmem_tiny_superslab.h"

// Mock SuperSlab for testing
typedef struct {
    uint32_t magic;
    void* base_addr;
    size_t size_bytes;
    uint8_t size_class;
} MockSuperSlab;

void test_hint_init(void) {
    printf("test_hint_init...\n");

    tls_ss_hint_init();

    // Verify cache is empty
    assert(g_tls_ss_hint.count == 0);
    assert(g_tls_ss_hint.next_slot == 0);

    #if !HAKMEM_BUILD_RELEASE
    assert(g_tls_ss_hint.hits == 0);
    assert(g_tls_ss_hint.misses == 0);
    #endif

    printf("  PASS\n");
}

void test_hint_basic(void) {
    printf("test_hint_basic...\n");

    tls_ss_hint_init();

    // Mock SuperSlab
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,  // 2MB
        .size_class = 0
    };

    // Update hint
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Verify cache entry
    assert(g_tls_ss_hint.count == 1);
    assert(g_tls_ss_hint.entries[0].base == ss.base_addr);
    assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);

    // Lookup should hit (within range)
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at base should hit
    assert(tls_ss_hint_lookup((void*)0x1000000, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at end-1 should hit
    assert(tls_ss_hint_lookup((void*)0x12FFFFF, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at end should miss (exclusive boundary)
    assert(tls_ss_hint_lookup((void*)0x1300000, &out) == false);

    // Lookup outside range should miss
    assert(tls_ss_hint_lookup((void*)0x3000000, &out) == false);

    printf("  PASS\n");
}

void test_hint_fifo_rotation(void) {
    printf("test_hint_fifo_rotation...\n");

    tls_ss_hint_init();

    // Create 6 mock SuperSlabs (cache has 4 slots)
    MockSuperSlab ss[6];
    for (int i = 0; i < 6; i++) {
        ss[i].magic = SUPERSLAB_MAGIC;
        ss[i].base_addr = (void*)(uintptr_t)(0x1000000 + i * 0x200000);  // 2MB apart
        ss[i].size_bytes = 2 * 1024 * 1024;
        ss[i].size_class = 0;

        tls_ss_hint_update((SuperSlab*)&ss[i], ss[i].base_addr, ss[i].size_bytes);
    }

    // Cache should be full (4 slots)
    assert(g_tls_ss_hint.count == TLS_SS_HINT_SLOTS);

    // First 2 SuperSlabs should be evicted (FIFO)
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);  // ss[0] evicted
    assert(tls_ss_hint_lookup((void*)0x1200100, &out) == false);  // ss[1] evicted

    // Last 4 SuperSlabs should be cached
    assert(tls_ss_hint_lookup((void*)0x1400100, &out) == true);   // ss[2]
    assert(out == (SuperSlab*)&ss[2]);
    assert(tls_ss_hint_lookup((void*)0x1600100, &out) == true);   // ss[3]
    assert(out == (SuperSlab*)&ss[3]);
    assert(tls_ss_hint_lookup((void*)0x1800100, &out) == true);   // ss[4]
    assert(out == (SuperSlab*)&ss[4]);
    assert(tls_ss_hint_lookup((void*)0x1A00100, &out) == true);   // ss[5]
    assert(out == (SuperSlab*)&ss[5]);

    printf("  PASS\n");
}

void test_hint_duplicate_detection(void) {
    printf("test_hint_duplicate_detection...\n");

    tls_ss_hint_init();

    // Mock SuperSlab
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };

    // Update hint 3 times with same SuperSlab
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Cache should have only 1 entry (duplicates ignored)
    assert(g_tls_ss_hint.count == 1);
    assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);

    printf("  PASS\n");
}

void test_hint_clear(void) {
    printf("test_hint_clear...\n");

    tls_ss_hint_init();

    // Add some entries
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    assert(g_tls_ss_hint.count == 1);

    // Clear cache
    tls_ss_hint_clear();

    // Cache should be empty
    assert(g_tls_ss_hint.count == 0);
    assert(g_tls_ss_hint.next_slot == 0);

    // Lookup should miss
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);

    printf("  PASS\n");
}

#if !HAKMEM_BUILD_RELEASE
void test_hint_stats(void) {
    printf("test_hint_stats...\n");

    tls_ss_hint_init();

    // Add entry
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Perform lookups
    SuperSlab* out = NULL;
    tls_ss_hint_lookup((void*)0x1000100, &out);  // Hit
    tls_ss_hint_lookup((void*)0x1000200, &out);  // Hit
    tls_ss_hint_lookup((void*)0x3000000, &out);  // Miss

    // Check stats
    uint64_t hits = 0, misses = 0;
    tls_ss_hint_stats(&hits, &misses);

    assert(hits == 2);
    assert(misses == 1);

    printf("  PASS\n");
}
#endif

int main(void) {
    printf("Running TLS SS Hint Box unit tests...\n\n");

    test_hint_init();
    test_hint_basic();
    test_hint_fifo_rotation();
    test_hint_duplicate_detection();
    test_hint_clear();

    #if !HAKMEM_BUILD_RELEASE
    test_hint_stats();
    #endif

    printf("\nAll tests passed!\n");
    return 0;
}
```

### 8.2 Integration Tests

#### Test 1: Build Validation
```bash
# Test 1: Build with hint disabled (baseline)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"

# Test 2: Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Test 3: Verify hint is disabled in header mode (should error)
# make clean
# make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=0 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Expected: Compile error (validation check in hakmem_build_flags.h)
```

#### Test 2: Benchmark Comparison
```bash
# Build baseline (hint disabled)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"

# Run benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_baseline.txt

# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Run same benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_hint.txt

# Compare results
echo "=== sh8bench ==="
grep "Mops" baseline.txt hint.txt

echo "=== cfrac ==="
grep "time:" cfrac_baseline.txt cfrac_hint.txt

echo "=== larson ==="
grep "ops/s" larson_baseline.txt larson_hint.txt
```

#### Test 3: Hit Rate Profiling
```bash
# Build with stats enabled (non-release)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0"

# Add stats dump at exit (in hakmem_exit.c or similar)
# void dump_hint_stats(void) {
#     uint64_t hits = 0, misses = 0;
#     tls_ss_hint_stats(&hits, &misses);
#     fprintf(stderr, "[TLS_HINT_STATS] hits=%lu misses=%lu hit_rate=%.2f%%\n",
#             hits, misses, 100.0 * hits / (hits + misses));
# }

# Run benchmark and check hit rate
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep TLS_HINT_STATS
# Expected: hit_rate >= 85%
```

### 8.3 Correctness Tests

```bash
# Test with external pointer (should fall back to hak_super_lookup)
# This tests that cache misses are handled correctly

# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Run sh8bench (allocates from multiple SuperSlabs)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

# No crashes or assertion failures = success
echo "Correctness test passed"
```

---

## 9. Performance Expectations

### 9.1 Cycle Count Analysis

| Operation | Without Hint | With Hint (Hit) | With Hint (Miss) | Improvement |
|-----------|-------------|----------------|-----------------|-------------|
| free() lookup | 10-50 cycles | 2-5 cycles | 10-50 cycles | 80-95% |
| Range check (per entry) | N/A | 2 cycles | 2 cycles | - |
| Hash table lookup | 10-50 cycles | N/A | 10-50 cycles | - |
| Total free() cost | 15-60 cycles | 7-15 cycles (hit) | 20-65 cycles (miss) | 40-60% |

### 9.2 Expected Hit Rates

| Workload | Hit Rate | Reasoning |
|----------|----------|-----------|
| Single-threaded LIFO | 95-99% | Free() immediately after alloc() from same SuperSlab |
| Single-threaded FIFO | 85-95% | Recent allocations from 2-4 SuperSlabs |
| Multi-threaded (8 threads) | 70-85% | Shared SuperSlabs, more cache thrashing |
| Larson (high churn) | 65-80% | Many active SuperSlabs, frequent evictions |

### 9.3 Benchmark Targets

| Benchmark | Baseline (no hint) | Target (with hint) | Improvement |
|-----------|-------------------|-------------------|-------------|
| sh8bench | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
| cfrac | 1.25 sec | 1.10-1.15 sec | +10-15% |
| larson (8 threads) | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |

### 9.4 Memory Overhead

| Metric | Value | Notes |
|--------|-------|-------|
| Per-thread overhead | 112 bytes | TLS cache (release build) |
| Per-thread overhead (debug) | 128 bytes | TLS cache + stats counters |
| 1000 threads | 112 KB | Negligible for server workloads |
| 10000 threads | 1.12 MB | Still negligible |

---

## 10. Risk Analysis

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| **Cache coherency issues** | Very Low | Low | TLS is thread-local, no sharing between threads |
| **Stale hint after munmap** | Low | Low | Magic check (SUPERSLAB_MAGIC) catches freed SuperSlabs |
| **Cache thrashing (many SS)** | Low | Low | 4 slots cover typical workloads; miss falls back to registry |
| **Memory overhead** | Very Low | Very Low | 112 bytes/thread, negligible for most workloads |
| **Integration bugs** | Low | Medium | Self-contained Box, clear API, comprehensive tests |
| **Hit rate lower than expected** | Low | Low | Even 50% hit rate improves performance; no regression on miss |
| **Complexity increase** | Low | Low | 150 LOC, header-only Box, minimal dependencies |

### 10.1 Failure Modes and Recovery

| Failure Mode | Detection | Recovery |
|-------------|-----------|----------|
| Stale SuperSlab pointer | Magic check (SUPERSLAB_MAGIC != expected) | Fall back to hak_super_lookup() |
| Cache miss | tls_ss_hint_lookup returns false | Fall back to hak_super_lookup() |
| Invalid hint range | ptr outside [base, end) | Linear search continues, eventually misses |
| Thread teardown | TLS cleanup by OS | No manual cleanup needed |
| SuperSlab freed | Magic number cleared | Caught by magic check in free() path |

---

## 11. Future Considerations

### 11.1 Phase 2 Integration: Global Class Map

When Phase 2 introduces a Global Class Map (pointer → class_idx lookup), the TLS Hint Box becomes the first tier in a three-tier lookup hierarchy:

```
Tier 1 (fastest): TLS Hint Cache (2-5 cycles, 85-95% hit rate)
    ↓ miss
Tier 2 (medium): Global Class Map (5-15 cycles, 99%+ hit rate)
    ↓ miss
Tier 3 (slowest): Global SuperSlab Registry (10-50 cycles, 100% hit rate)
```

**Integration point**:
```c
SuperSlab* ss = NULL;
int class_idx = -1;

// Tier 1: TLS hint
#if HAKMEM_TINY_SS_TLS_HINT
if (tls_ss_hint_lookup(ptr, &ss)) {
    class_idx = slab_index_for(ss, ptr);
    goto found;
}
#endif

// Tier 2: Global class map
#if HAKMEM_TINY_CLASS_MAP
class_idx = class_map_lookup(ptr);
if (class_idx >= 0) {
    ss = hak_super_lookup(ptr);  // Still need SS for metadata
    goto found;
}
#endif

// Tier 3: Registry fallback
ss = hak_super_lookup(ptr);
if (ss && ss->magic == SUPERSLAB_MAGIC) {
    class_idx = slab_index_for(ss, ptr);
    goto found;
}

// External pointer
hak_external_guard_free(ptr);
return;

found:
    tiny_free_to_class(class_idx, ptr);
```

### 11.2 Adaptive Cache Sizing

Current design uses fixed `TLS_SS_HINT_SLOTS = 4`. Future optimization could make this adaptive:

- **Workload detection**: Track hit rate over time windows
- **Dynamic sizing**: Increase slots (4 → 8) if hit rate < 80%
- **Memory pressure**: Decrease slots (8 → 2) if memory constrained

**Implementation sketch**:
```c
#define TLS_SS_HINT_SLOTS_MAX 8

typedef struct {
    uint32_t current_slots;  // Dynamic (2, 4, 8)
    uint64_t hits_window;
    uint64_t misses_window;
} TlsSsHintAdaptive;

void tls_ss_hint_tune(void) {
    double hit_rate = (double)g_tls_ss_hint.hits_window /
                      (g_tls_ss_hint.hits_window + g_tls_ss_hint.misses_window);

    if (hit_rate < 0.80 && g_tls_ss_hint.current_slots < TLS_SS_HINT_SLOTS_MAX) {
        g_tls_ss_hint.current_slots *= 2;  // Grow cache
    } else if (hit_rate > 0.95 && g_tls_ss_hint.current_slots > 2) {
        g_tls_ss_hint.current_slots /= 2;  // Shrink cache
    }
}
```

### 11.3 LRU vs FIFO Eviction Policy

Current design uses FIFO (simple, predictable). Alternative: LRU with move-to-front on hit.

**LRU advantages**:
- Better hit rate for workloads with temporal locality
- Commonly used SuperSlabs stay cached longer

**LRU disadvantages**:
- 2-3 extra cycles per hit (move to front)
- More complex implementation (doubly-linked list)

**Benchmark before switching**: Profile sh8bench, larson, cfrac with both policies.

### 11.4 Per-Class Hint Caches

Current design: Single cache for all classes (4 entries, any class).
Alternative: Per-class caches (1 entry per class, 8 entries total).

**Per-class advantages**:
- Guaranteed cache slot for each class
- No inter-class eviction

**Per-class disadvantages**:
- Wastes space if only 2-3 classes are active
- More TLS overhead (8 entries vs 4)

**Recommendation**: Defer until benchmarks show inter-class thrashing.

### 11.5 Statistics Export API

For production monitoring, export hit rate via:

```c
// Global aggregated stats (all threads)
void hak_tls_hint_global_stats(uint64_t* total_hits, uint64_t* total_misses);

// ENV-based stats dump at exit
// HAKMEM_TLS_HINT_STATS=1 → dump to stderr at exit
```

---

## 12. Implementation Checklist

### 12.1 Phase 1a: Core Implementation (Week 1)
- [ ] Create `core/box/tls_ss_hint_box.h`
- [ ] Implement `tls_ss_hint_init()`
- [ ] Implement `tls_ss_hint_update()`
- [ ] Implement `tls_ss_hint_lookup()`
- [ ] Implement `tls_ss_hint_clear()`
- [ ] Add `HAKMEM_TINY_SS_TLS_HINT` flag to `hakmem_build_flags.h`
- [ ] Add validation check (hint requires headerless mode)

### 12.2 Phase 1b: Integration (Week 2)
- [ ] Integrate into `hakmem_tiny_free.inc` (lookup path)
- [ ] Integrate into `hakmem_tiny.c` (update path after alloc)
- [ ] Integrate into `hakmem_tiny_refill.inc.h` (update path after refill)
- [ ] Integrate into `core/front/tiny_unified_cache.c` (update path)
- [ ] Call `tls_ss_hint_init()` in thread-local init

### 12.3 Phase 1c: Testing (Week 2-3)
- [ ] Write unit tests (`tests/test_tls_ss_hint.c`)
- [ ] Run unit tests: `make test_tls_ss_hint && ./test_tls_ss_hint`
- [ ] Build validation (hint disabled, hint enabled, error check)
- [ ] Benchmark comparison (sh8bench, cfrac, larson)
- [ ] Hit rate profiling (debug build with stats)
- [ ] Correctness tests (no crashes, no assertion failures)

### 12.4 Phase 1d: Validation (Week 3)
- [ ] Benchmark: sh8bench (target: +15-20%)
- [ ] Benchmark: cfrac (target: +10-15%)
- [ ] Benchmark: larson 8 threads (target: +15-20%)
- [ ] Hit rate analysis (target: 85-95%)
- [ ] Memory overhead check (target: < 150 bytes/thread)
- [ ] Regression test: Headerless=0 mode still works

### 12.5 Phase 1e: Documentation (Week 3-4)
- [ ] Update `docs/PHASE2_HEADERLESS_INSTRUCTION.md` with hint Box
- [ ] Add Box Theory annotation to hakmem Box registry
- [ ] Write performance analysis report (before/after comparison)
- [ ] Update build instructions (`make shared EXTRA_CFLAGS=...`)

---

## 13. Rollout Plan

### Stage 1: Internal Testing (Week 1-3)
- Build with `HAKMEM_TINY_SS_TLS_HINT=1` in dev environment
- Run full benchmark suite (mimalloc-bench)
- Profile with perf/cachegrind (verify cycle count reduction)
- Fix any integration bugs

### Stage 2: Canary Deployment (Week 4)
- Enable hint Box in 5% of production traffic
- Monitor: crash rate, performance metrics, hit rate
- A/B test: Hint ON vs Hint OFF

### Stage 3: Gradual Rollout (Week 5-6)
- 25% traffic (if canary success)
- 50% traffic
- 100% traffic

### Stage 4: Default Enable (Week 7)
- Change default: `HAKMEM_TINY_SS_TLS_HINT=1`
- Update build scripts, CI/CD pipelines
- Announce in release notes

---

## 14. Success Metrics

| Metric | Baseline | Target | Measurement |
|--------|----------|--------|-------------|
| sh8bench throughput | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
| cfrac runtime | 1.25 sec | 1.10-1.15 sec | -10-15% |
| larson throughput | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |
| TLS hint hit rate | N/A | 85-95% | Stats API |
| free() cycle count | 15-60 cycles | 7-15 cycles (hit) | perf/cachegrind |
| Memory overhead | 0 | < 150 bytes/thread | sizeof(TlsSsHintCache) |
| Crash rate | 0.001% | 0.001% (no regression) | Production monitoring |

---

## 15. Open Questions

1. **Q**: Should we implement per-class hint caches instead of unified cache?
   **A**: Defer until benchmarks show inter-class thrashing. Current unified design is simpler and sufficient for most workloads.

2. **Q**: Should we use LRU instead of FIFO eviction?
   **A**: Defer until benchmarks show FIFO hit rate < 80%. FIFO is simpler and avoids move-to-front cost on hits.

3. **Q**: Should we make TLS_SS_HINT_SLOTS runtime-configurable?
   **A**: No, compile-time constant allows better optimization (loop unrolling, register allocation). Consider adaptive sizing in Phase 2 if needed.

4. **Q**: Should we validate SUPERSLAB_MAGIC in tls_ss_hint_lookup()?
   **A**: No, keep lookup minimal (2-5 cycles). Caller (free() path) must validate magic. This matches existing design where hak_super_lookup() also requires caller validation.

5. **Q**: Should we export hit rate stats in production builds?
   **A**: Phase 1: No (save 16 bytes/thread). Phase 2: Add global aggregated stats API for monitoring if needed.

---

## 16. Conclusion

The TLS Superslab Hint Box is a low-risk, high-reward optimization that reduces the performance gap between Headerless mode and Header mode from 30% to ~15%. The design is self-contained, testable, and follows hakmem's Box Theory architecture. Expected implementation time: 3-4 weeks (including testing and validation).

**Key Strengths**:
- Minimal integration surface (5 call sites)
- Self-contained Box (no dependencies)
- Fail-safe fallback (miss → hak_super_lookup)
- Low memory overhead (112 bytes/thread)
- Proven pattern (TLS caching used in jemalloc, tcmalloc)

**Next Steps**:
1. Review this design document
2. Approve Phase 1a implementation (core Box)
3. Begin implementation with unit tests
4. Benchmark and validate in dev environment
5. Plan Phase 2 integration (Global Class Map)

---

**End of Design Document**