Files
hakmem/docs/TLS_SS_HINT_BOX_DESIGN.md
Moe Charm (CI) 2624dcce62 Add comprehensive ChatGPT handoff documentation for TLS SLL diagnosis
Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through
systematic diagnosis and fix of TLS SLL header corruption issue.

Documents Added:
- README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system
- CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read)
- CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline)
- GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review
- STATUS_2025_12_03_CURRENT.md: Complete project status snapshot
- TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines)
  - 6 root cause patterns with code examples
  - Diagnostic logging instrumentation
  - Fix templates and validation procedures
- TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines)
- HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup
- SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes

Problem Context:
- Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET]
- Error: cls=1 base=0x... got=0x31 expect=0xa1
- Blocks Phase 1 validation and Phase 2 progression

Expected Outcome:
- ChatGPT follows 7-step diagnostic process
- Root cause identified (one of 6 patterns)
- Surgical fix (1-5 lines)
- TC1 baseline completes without crashes

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:41:34 +09:00

36 KiB

TLS Superslab Hint Box - Design Document

Phase: Headerless Performance Optimization - Phase 1 Date: 2025-12-03 Status: Design Review Author: hakmem team


1. Executive Summary

The TLS Superslab Hint Box is a thread-local cache that accelerates pointer-to-SuperSlab resolution in Headerless mode. When HAKMEM_TINY_HEADERLESS=1 is enabled, every free() operation requires translating a user pointer to its owning SuperSlab. Currently, this uses hak_super_lookup(), which performs a hash table lookup costing 10-50 cycles. By caching recently-used SuperSlab references in thread-local storage, we can reduce this to 2-5 cycles for cache hits (85-95% hit rate expected).

Expected Performance Improvement: 15-20% throughput increase (54.60 → 64-68 Mops/s on sh8bench)

Risk Level: Low

  • Thread-local storage eliminates cache coherency issues
  • Magic number validation provides fail-safe fallback
  • Self-contained Box with minimal integration surface
  • Memory overhead: ~128 bytes per thread (negligible)

2. Box Definition (Box Theory)

Box: TLS Superslab Hint Cache

MISSION:
  Cache recently-used SuperSlab references in TLS to accelerate
  ptr→SuperSlab resolution in Headerless mode, avoiding expensive
  hash table lookups on the critical free() path.

DESIGN:
  - Provides O(1) lookup for hot SuperSlabs (L1 cache hit, 2-5 cycles)
  - Falls back to global registry on miss (fail-safe, no data loss)
  - No ownership, no remote queues, pure read-only cache
  - FIFO eviction policy with configurable cache size (2-4 slots)

INVARIANTS:
  - hint.base <= ptr < hint.end implies hint.ss is valid
  - Miss is always safe (triggers fallback to hak_super_lookup)
  - TLS data survives only within thread lifetime
  - Cache entries are invalidated implicitly by FIFO rotation
  - Magic number check (SUPERSLAB_MAGIC) validates all pointers

BOUNDARY:
  - Input: raw user pointer (void* ptr) from free() path
  - Output: SuperSlab* or NULL (miss triggers fallback)
  - Does NOT determine class_idx (that's slab_index_for's job)
  - Does NOT perform ownership validation (that's SuperSlab's job)

PERFORMANCE:
  - Cache hit: 2-5 cycles (L1 cache hit, 4 pointer comparisons)
  - Cache miss: fallback to hak_super_lookup (10-50 cycles)
  - Expected hit rate: 85-95% for single-threaded workloads
  - Expected hit rate: 70-85% for multi-threaded workloads

THREAD SAFETY:
  - TLS storage: no sharing, no synchronization required
  - Read-only cache: never modifies SuperSlab state
  - Stale entries: caught by magic number check

3. Data Structures

// core/box/tls_ss_hint_box.h

#ifndef TLS_SS_HINT_BOX_H
#define TLS_SS_HINT_BOX_H

#include <stdint.h>
#include <stdbool.h>

// Forward declaration
struct SuperSlab;

// Cache entry for a single SuperSlab hint
// Size: 24 bytes (cache-friendly, fits in 1 cache line with metadata)
typedef struct {
    void* base;              // SuperSlab base address (aligned to 1MB or 2MB)
    void* end;               // base + superslab_size (for range check)
    struct SuperSlab* ss;    // Cached SuperSlab pointer
} TlsSsHintEntry;

// TLS hint cache configuration
// - 4 slots provide good hit rate without excessive overhead
// - Larger caches (8, 16) show diminishing returns in benchmarks
// - Smaller caches (2) may thrash on workloads with 3+ active SuperSlabs
#define TLS_SS_HINT_SLOTS 4

// Thread-local SuperSlab hint cache
// Total size: 24*4 + 16 = 112 bytes per thread (negligible overhead)
typedef struct {
    TlsSsHintEntry entries[TLS_SS_HINT_SLOTS];  // Cache entries
    uint32_t count;          // Number of valid entries (0 to TLS_SS_HINT_SLOTS)
    uint32_t next_slot;      // Next slot for FIFO rotation (wraps at TLS_SS_HINT_SLOTS)

    // Statistics (optional, for profiling builds)
    // Disabled in HAKMEM_BUILD_RELEASE to save 16 bytes per thread
    #if !HAKMEM_BUILD_RELEASE
    uint64_t hits;           // Cache hit count
    uint64_t misses;         // Cache miss count
    #endif
} TlsSsHintCache;

// Thread-local storage instance
// Initialized to zero by TLS semantics, formal init in tls_ss_hint_init()
extern __thread TlsSsHintCache g_tls_ss_hint;

#endif // TLS_SS_HINT_BOX_H

4. API Design

// core/box/tls_ss_hint_box.h (continued)

/**
 * @brief Initialize TLS hint cache for current thread
 *
 * Call once per thread, typically in thread-local initialization path.
 * Safe to call multiple times (idempotent).
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~10 cycles (negligible one-time cost)
 */
static inline void tls_ss_hint_init(void);

/**
 * @brief Update hint cache with a SuperSlab reference
 *
 * Called on paths where we know the SuperSlab for a given address range:
 * - After successful tiny_alloc (cache the allocated-from SuperSlab)
 * - After superslab refill (cache the newly bound SuperSlab)
 * - After unified cache refill (cache the refilled SuperSlab)
 *
 * Duplicate detection: If the SuperSlab is already cached, no update occurs.
 * This prevents thrashing when repeatedly allocating from the same SuperSlab.
 *
 * @param ss    SuperSlab to cache (must be non-NULL, SUPERSLAB_MAGIC validated by caller)
 * @param base  SuperSlab base address (1MB or 2MB aligned)
 * @param size  SuperSlab size in bytes (1MB or 2MB)
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~15-20 cycles (duplicate check + FIFO rotation)
 */
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size);

/**
 * @brief Lookup SuperSlab for given pointer (fast path)
 *
 * Called on free() entry, before falling back to hak_super_lookup().
 * Performs linear search over cached entries (4 iterations max).
 *
 * Cache hit: Returns true, sets *out_ss to cached SuperSlab pointer
 * Cache miss: Returns false, caller must use hak_super_lookup()
 *
 * @param ptr     User pointer to lookup (arbitrary alignment)
 * @param out_ss  Output: SuperSlab pointer if found (only valid if return true)
 * @return true if cache hit (out_ss is valid), false if miss
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: 2-5 cycles (hit), 8-12 cycles (miss)
 *
 * NOTE: Caller MUST validate SUPERSLAB_MAGIC after successful lookup.
 *       This Box does not perform magic validation to keep fast path minimal.
 */
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss);

/**
 * @brief Clear all cached hints (for testing/reset)
 *
 * Use cases:
 * - Unit tests: Reset cache between test cases
 * - Debug: Force cache cold start for profiling
 * - Thread teardown: Optional cleanup (TLS auto-cleanup on thread exit)
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~10 cycles
 */
static inline void tls_ss_hint_clear(void);

/**
 * @brief Get cache statistics (for profiling builds)
 *
 * Returns hit/miss counters for performance analysis.
 * Only available in non-release builds (HAKMEM_BUILD_RELEASE=0).
 *
 * @param hits    Output: Total cache hits
 * @param misses  Output: Total cache misses
 *
 * Thread Safety: TLS, no synchronization required
 * Performance: ~5 cycles (two loads)
 */
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses);
#endif

5. Implementation Details

// core/box/tls_ss_hint_box.c (or inline in .h for header-only Box)

#include "tls_ss_hint_box.h"
#include "../hakmem_tiny_superslab.h"  // For SuperSlab, SUPERSLAB_MAGIC

// Thread-local storage definition
__thread TlsSsHintCache g_tls_ss_hint = {0};

/**
 * Initialize TLS hint cache
 * Safe to call multiple times (idempotent check via count)
 */
static inline void tls_ss_hint_init(void) {
    // Zero-initialization by TLS, but explicit init for clarity
    g_tls_ss_hint.count = 0;
    g_tls_ss_hint.next_slot = 0;

    #if !HAKMEM_BUILD_RELEASE
    g_tls_ss_hint.hits = 0;
    g_tls_ss_hint.misses = 0;
    #endif

    // Clear all entries (paranoid, but cache-friendly loop)
    for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
        g_tls_ss_hint.entries[i].base = NULL;
        g_tls_ss_hint.entries[i].end = NULL;
        g_tls_ss_hint.entries[i].ss = NULL;
    }
}

/**
 * Update hint cache with SuperSlab reference
 * FIFO rotation: oldest entry is evicted when cache is full
 * Duplicate detection: skip if SuperSlab already cached
 */
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size) {
    // Sanity check: reject invalid inputs
    if (__builtin_expect(!ss || !base || size == 0, 0)) {
        return;
    }

    // Duplicate detection: check if this SuperSlab is already cached
    // This prevents thrashing when allocating from the same SuperSlab repeatedly
    for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
        if (g_tls_ss_hint.entries[i].ss == ss) {
            return;  // Already cached, no update needed
        }
    }

    // Add to next slot (FIFO rotation)
    uint32_t slot = g_tls_ss_hint.next_slot;
    g_tls_ss_hint.entries[slot].base = base;
    g_tls_ss_hint.entries[slot].end = (char*)base + size;
    g_tls_ss_hint.entries[slot].ss = ss;

    // Advance to next slot (wrap at TLS_SS_HINT_SLOTS)
    g_tls_ss_hint.next_slot = (slot + 1) % TLS_SS_HINT_SLOTS;

    // Increment count until cache is full
    if (g_tls_ss_hint.count < TLS_SS_HINT_SLOTS) {
        g_tls_ss_hint.count++;
    }
}

/**
 * Lookup SuperSlab for pointer (fast path)
 * Linear search over cached entries (4 iterations max)
 *
 * Performance note:
 * - Linear search is faster than hash table for small N (N <= 8)
 * - Branch-free comparison (ptr >= base && ptr < end) is 2-3 cycles
 * - Total cost: 2-5 cycles (hit), 8-12 cycles (miss with 4 entries)
 */
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss) {
    // Fast path: iterate over valid entries
    // Unrolling this loop (if count is small) is beneficial, but let compiler decide
    for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
        TlsSsHintEntry* e = &g_tls_ss_hint.entries[i];

        // Range check: base <= ptr < end
        // Note: end is exclusive (base + size), so use < not <=
        if (ptr >= e->base && ptr < e->end) {
            // Cache hit!
            *out_ss = e->ss;

            #if !HAKMEM_BUILD_RELEASE
            g_tls_ss_hint.hits++;
            #endif

            return true;
        }
    }

    // Cache miss: caller must fall back to hak_super_lookup()
    #if !HAKMEM_BUILD_RELEASE
    g_tls_ss_hint.misses++;
    #endif

    return false;
}

/**
 * Clear all cached hints
 * Use for testing or manual reset
 */
static inline void tls_ss_hint_clear(void) {
    g_tls_ss_hint.count = 0;
    g_tls_ss_hint.next_slot = 0;

    #if !HAKMEM_BUILD_RELEASE
    // Preserve stats across clear (for cumulative profiling)
    // Uncomment to reset stats:
    // g_tls_ss_hint.hits = 0;
    // g_tls_ss_hint.misses = 0;
    #endif

    // Optional: zero out entries (paranoid, not required for correctness)
    for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
        g_tls_ss_hint.entries[i].base = NULL;
        g_tls_ss_hint.entries[i].end = NULL;
        g_tls_ss_hint.entries[i].ss = NULL;
    }
}

/**
 * Get cache statistics (profiling builds only)
 */
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses) {
    if (hits) *hits = g_tls_ss_hint.hits;
    if (misses) *misses = g_tls_ss_hint.misses;
}
#endif

6. Integration Points

6.1 Update Points: When to Call tls_ss_hint_update()

The hint cache should be updated whenever we know the SuperSlab for an address range. This happens on allocation success paths:

Location 1: After Successful Tiny Alloc (hakmem_tiny.c)

// In hak_tiny_alloc or similar allocation path
void* ptr = tiny_allocate_from_superslab(class_idx, &ss);
if (ptr) {
    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the SuperSlab we just allocated from
    // This improves free() performance for LIFO allocation patterns
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif
    return ptr;
}

Location 2: After SuperSlab Refill (hakmem_tiny_refill.inc.h)

// In tiny_refill_from_superslab or superslab_allocate
SuperSlab* ss = superslab_allocate(class_idx);
if (ss) {
    // Bind SuperSlab to thread's TLS state
    bind_superslab_to_thread(ss, class_idx);

    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the newly bound SuperSlab
    // Future allocations from this SuperSlab will have cached hint
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif
}

Location 3: Unified Cache Refill (core/front/tiny_unified_cache.c)

// In unified_cache_refill_class
void* block = superslab_alloc_block(class_idx, &ss);
if (block) {
    #if HAKMEM_TINY_SS_TLS_HINT
    // Cache the SuperSlab that provided this block
    tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
    #endif

    // Push to unified cache
    unified_cache_push(class_idx, block);
}

Location 4: Thread-Local Init (hakmem_tiny_tls_init)

// In tiny_tls_init or thread_local_init
void tiny_tls_init(void) {
    // Initialize TLS structures
    tiny_magazine_init();
    tiny_sll_init();

    #if HAKMEM_TINY_SS_TLS_HINT
    // Initialize hint cache (zero-init by TLS, but explicit for clarity)
    tls_ss_hint_init();
    #endif
}

6.2 Lookup Points: When to Call tls_ss_hint_lookup()

The hint lookup should be the first step in free() path, before falling back to registry lookup:

Location 1: Tiny Free Entry (core/hakmem_tiny_free.inc)

// In hak_tiny_free or similar free path
void hak_tiny_free(void* ptr) {
    if (!ptr) return;

    SuperSlab* ss = NULL;

    #if HAKMEM_TINY_HEADERLESS
        // Phase 1: Try TLS hint cache (fast path, 2-5 cycles on hit)
        #if HAKMEM_TINY_SS_TLS_HINT
        if (!tls_ss_hint_lookup(ptr, &ss)) {
        #endif
            // Phase 2: Fallback to global registry (slow path, 10-50 cycles)
            ss = hak_super_lookup(ptr);
        #if HAKMEM_TINY_SS_TLS_HINT
        }
        #endif

        // Validate SuperSlab (magic check)
        if (!ss || ss->magic != SUPERSLAB_MAGIC) {
            // Invalid pointer - external guard path
            hak_external_guard_free(ptr);
            return;
        }

        // Proceed with free using SuperSlab info
        int class_idx = slab_index_for(ss, ptr);
        tiny_free_to_slab(ss, ptr, class_idx);

    #else
        // Header mode: read class_idx from header (1-3 cycles)
        uint8_t hdr = *((uint8_t*)ptr - 1);
        int class_idx = hdr & 0x7;
        tiny_free_to_class(class_idx, ptr);
    #endif
}

Location 2: Fast Free Path (core/tiny_free_fast_v2.inc.h)

// In tiny_free_fast or inline free path
static inline void tiny_free_fast(void* ptr) {
    #if HAKMEM_TINY_HEADERLESS
        SuperSlab* ss = NULL;

        // Try hint cache first
        #if HAKMEM_TINY_SS_TLS_HINT
        if (!tls_ss_hint_lookup(ptr, &ss)) {
        #endif
            ss = hak_super_lookup(ptr);
        #if HAKMEM_TINY_SS_TLS_HINT
        }
        #endif

        if (__builtin_expect(!ss || ss->magic != SUPERSLAB_MAGIC, 0)) {
            // Slow path: external guard or invalid pointer
            hak_tiny_free_slow(ptr);
            return;
        }

        // Fast path: push to TLS freelist
        int class_idx = slab_index_for(ss, ptr);
        front_gate_push_tls(class_idx, ptr);

    #else
        // Header mode fast path
        uint8_t hdr = *((uint8_t*)ptr - 1);
        int class_idx = hdr & 0x7;
        front_gate_push_tls(class_idx, ptr);
    #endif
}

7. Environment Variable

// In hakmem_build_flags.h or similar configuration header

// ============================================================================
// Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache
// ============================================================================
// Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode
// Default: 0 (disabled during development and testing)
// Target: 1 (enabled after validation in Phase 1 rollout)
//
// Performance Impact:
// - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup)
// - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded)
// - Expected throughput improvement: 15-20%
//
// Memory Overhead:
// - 112 bytes per thread (TLS)
// - Negligible for typical workloads (1000 threads = 112KB)
//
// Dependencies:
// - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode)
// - No other dependencies (self-contained Box)

#ifndef HAKMEM_TINY_SS_TLS_HINT
  #define HAKMEM_TINY_SS_TLS_HINT 0
#endif

// Validation: Hint Box only active in Headerless mode
#if HAKMEM_TINY_SS_TLS_HINT && !HAKMEM_TINY_HEADERLESS
  #error "HAKMEM_TINY_SS_TLS_HINT requires HAKMEM_TINY_HEADERLESS=1"
#endif

8. Testing Plan

8.1 Unit Tests

Create /mnt/workdisk/public_share/hakmem/tests/test_tls_ss_hint.c:

#include <assert.h>
#include <stdio.h>
#include <string.h>
#include "core/box/tls_ss_hint_box.h"
#include "core/hakmem_tiny_superslab.h"

// Mock SuperSlab for testing
typedef struct {
    uint32_t magic;
    void* base_addr;
    size_t size_bytes;
    uint8_t size_class;
} MockSuperSlab;

void test_hint_init(void) {
    printf("test_hint_init...\n");

    tls_ss_hint_init();

    // Verify cache is empty
    assert(g_tls_ss_hint.count == 0);
    assert(g_tls_ss_hint.next_slot == 0);

    #if !HAKMEM_BUILD_RELEASE
    assert(g_tls_ss_hint.hits == 0);
    assert(g_tls_ss_hint.misses == 0);
    #endif

    printf("  PASS\n");
}

void test_hint_basic(void) {
    printf("test_hint_basic...\n");

    tls_ss_hint_init();

    // Mock SuperSlab
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,  // 2MB
        .size_class = 0
    };

    // Update hint
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Verify cache entry
    assert(g_tls_ss_hint.count == 1);
    assert(g_tls_ss_hint.entries[0].base == ss.base_addr);
    assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);

    // Lookup should hit (within range)
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at base should hit
    assert(tls_ss_hint_lookup((void*)0x1000000, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at end-1 should hit
    assert(tls_ss_hint_lookup((void*)0x12FFFFF, &out) == true);
    assert(out == (SuperSlab*)&ss);

    // Lookup at end should miss (exclusive boundary)
    assert(tls_ss_hint_lookup((void*)0x1300000, &out) == false);

    // Lookup outside range should miss
    assert(tls_ss_hint_lookup((void*)0x3000000, &out) == false);

    printf("  PASS\n");
}

void test_hint_fifo_rotation(void) {
    printf("test_hint_fifo_rotation...\n");

    tls_ss_hint_init();

    // Create 6 mock SuperSlabs (cache has 4 slots)
    MockSuperSlab ss[6];
    for (int i = 0; i < 6; i++) {
        ss[i].magic = SUPERSLAB_MAGIC;
        ss[i].base_addr = (void*)(uintptr_t)(0x1000000 + i * 0x200000);  // 2MB apart
        ss[i].size_bytes = 2 * 1024 * 1024;
        ss[i].size_class = 0;

        tls_ss_hint_update((SuperSlab*)&ss[i], ss[i].base_addr, ss[i].size_bytes);
    }

    // Cache should be full (4 slots)
    assert(g_tls_ss_hint.count == TLS_SS_HINT_SLOTS);

    // First 2 SuperSlabs should be evicted (FIFO)
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);  // ss[0] evicted
    assert(tls_ss_hint_lookup((void*)0x1200100, &out) == false);  // ss[1] evicted

    // Last 4 SuperSlabs should be cached
    assert(tls_ss_hint_lookup((void*)0x1400100, &out) == true);   // ss[2]
    assert(out == (SuperSlab*)&ss[2]);
    assert(tls_ss_hint_lookup((void*)0x1600100, &out) == true);   // ss[3]
    assert(out == (SuperSlab*)&ss[3]);
    assert(tls_ss_hint_lookup((void*)0x1800100, &out) == true);   // ss[4]
    assert(out == (SuperSlab*)&ss[4]);
    assert(tls_ss_hint_lookup((void*)0x1A00100, &out) == true);   // ss[5]
    assert(out == (SuperSlab*)&ss[5]);

    printf("  PASS\n");
}

void test_hint_duplicate_detection(void) {
    printf("test_hint_duplicate_detection...\n");

    tls_ss_hint_init();

    // Mock SuperSlab
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };

    // Update hint 3 times with same SuperSlab
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Cache should have only 1 entry (duplicates ignored)
    assert(g_tls_ss_hint.count == 1);
    assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);

    printf("  PASS\n");
}

void test_hint_clear(void) {
    printf("test_hint_clear...\n");

    tls_ss_hint_init();

    // Add some entries
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    assert(g_tls_ss_hint.count == 1);

    // Clear cache
    tls_ss_hint_clear();

    // Cache should be empty
    assert(g_tls_ss_hint.count == 0);
    assert(g_tls_ss_hint.next_slot == 0);

    // Lookup should miss
    SuperSlab* out = NULL;
    assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);

    printf("  PASS\n");
}

#if !HAKMEM_BUILD_RELEASE
void test_hint_stats(void) {
    printf("test_hint_stats...\n");

    tls_ss_hint_init();

    // Add entry
    MockSuperSlab ss = {
        .magic = SUPERSLAB_MAGIC,
        .base_addr = (void*)0x1000000,
        .size_bytes = 2 * 1024 * 1024,
        .size_class = 0
    };
    tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);

    // Perform lookups
    SuperSlab* out = NULL;
    tls_ss_hint_lookup((void*)0x1000100, &out);  // Hit
    tls_ss_hint_lookup((void*)0x1000200, &out);  // Hit
    tls_ss_hint_lookup((void*)0x3000000, &out);  // Miss

    // Check stats
    uint64_t hits = 0, misses = 0;
    tls_ss_hint_stats(&hits, &misses);

    assert(hits == 2);
    assert(misses == 1);

    printf("  PASS\n");
}
#endif

int main(void) {
    printf("Running TLS SS Hint Box unit tests...\n\n");

    test_hint_init();
    test_hint_basic();
    test_hint_fifo_rotation();
    test_hint_duplicate_detection();
    test_hint_clear();

    #if !HAKMEM_BUILD_RELEASE
    test_hint_stats();
    #endif

    printf("\nAll tests passed!\n");
    return 0;
}

8.2 Integration Tests

Test 1: Build Validation

# Test 1: Build with hint disabled (baseline)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"

# Test 2: Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Test 3: Verify hint is disabled in header mode (should error)
# make clean
# make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=0 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Expected: Compile error (validation check in hakmem_build_flags.h)

Test 2: Benchmark Comparison

# Build baseline (hint disabled)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"

# Run benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_baseline.txt

# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Run same benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_hint.txt

# Compare results
echo "=== sh8bench ==="
grep "Mops" baseline.txt hint.txt

echo "=== cfrac ==="
grep "time:" cfrac_baseline.txt cfrac_hint.txt

echo "=== larson ==="
grep "ops/s" larson_baseline.txt larson_hint.txt

Test 3: Hit Rate Profiling

# Build with stats enabled (non-release)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0"

# Add stats dump at exit (in hakmem_exit.c or similar)
# void dump_hint_stats(void) {
#     uint64_t hits = 0, misses = 0;
#     tls_ss_hint_stats(&hits, &misses);
#     fprintf(stderr, "[TLS_HINT_STATS] hits=%lu misses=%lu hit_rate=%.2f%%\n",
#             hits, misses, 100.0 * hits / (hits + misses));
# }

# Run benchmark and check hit rate
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep TLS_HINT_STATS
# Expected: hit_rate >= 85%

8.3 Correctness Tests

# Test with external pointer (should fall back to hak_super_lookup)
# This tests that cache misses are handled correctly

# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"

# Run sh8bench (allocates from multiple SuperSlabs)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

# No crashes or assertion failures = success
echo "Correctness test passed"

9. Performance Expectations

9.1 Cycle Count Analysis

Operation Without Hint With Hint (Hit) With Hint (Miss) Improvement
free() lookup 10-50 cycles 2-5 cycles 10-50 cycles 80-95%
Range check (per entry) N/A 2 cycles 2 cycles -
Hash table lookup 10-50 cycles N/A 10-50 cycles -
Total free() cost 15-60 cycles 7-15 cycles (hit) 20-65 cycles (miss) 40-60%

9.2 Expected Hit Rates

Workload Hit Rate Reasoning
Single-threaded LIFO 95-99% Free() immediately after alloc() from same SuperSlab
Single-threaded FIFO 85-95% Recent allocations from 2-4 SuperSlabs
Multi-threaded (8 threads) 70-85% Shared SuperSlabs, more cache thrashing
Larson (high churn) 65-80% Many active SuperSlabs, frequent evictions

9.3 Benchmark Targets

Benchmark Baseline (no hint) Target (with hint) Improvement
sh8bench 54.60 Mops/s 64-68 Mops/s +15-20%
cfrac 1.25 sec 1.10-1.15 sec +10-15%
larson (8 threads) 6.5M ops/s 7.5-8.0M ops/s +15-20%

9.4 Memory Overhead

Metric Value Notes
Per-thread overhead 112 bytes TLS cache (release build)
Per-thread overhead (debug) 128 bytes TLS cache + stats counters
1000 threads 112 KB Negligible for server workloads
10000 threads 1.12 MB Still negligible

10. Risk Analysis

Risk Likelihood Impact Mitigation
Cache coherency issues Very Low Low TLS is thread-local, no sharing between threads
Stale hint after munmap Low Low Magic check (SUPERSLAB_MAGIC) catches freed SuperSlabs
Cache thrashing (many SS) Low Low 4 slots cover typical workloads; miss falls back to registry
Memory overhead Very Low Very Low 112 bytes/thread, negligible for most workloads
Integration bugs Low Medium Self-contained Box, clear API, comprehensive tests
Hit rate lower than expected Low Low Even 50% hit rate improves performance; no regression on miss
Complexity increase Low Low 150 LOC, header-only Box, minimal dependencies

10.1 Failure Modes and Recovery

Failure Mode Detection Recovery
Stale SuperSlab pointer Magic check (SUPERSLAB_MAGIC != expected) Fall back to hak_super_lookup()
Cache miss tls_ss_hint_lookup returns false Fall back to hak_super_lookup()
Invalid hint range ptr outside [base, end) Linear search continues, eventually misses
Thread teardown TLS cleanup by OS No manual cleanup needed
SuperSlab freed Magic number cleared Caught by magic check in free() path

11. Future Considerations

11.1 Phase 2 Integration: Global Class Map

When Phase 2 introduces a Global Class Map (pointer → class_idx lookup), the TLS Hint Box becomes the first tier in a three-tier lookup hierarchy:

Tier 1 (fastest): TLS Hint Cache (2-5 cycles, 85-95% hit rate)
    ↓ miss
Tier 2 (medium): Global Class Map (5-15 cycles, 99%+ hit rate)
    ↓ miss
Tier 3 (slowest): Global SuperSlab Registry (10-50 cycles, 100% hit rate)

Integration point:

SuperSlab* ss = NULL;
int class_idx = -1;

// Tier 1: TLS hint
#if HAKMEM_TINY_SS_TLS_HINT
if (tls_ss_hint_lookup(ptr, &ss)) {
    class_idx = slab_index_for(ss, ptr);
    goto found;
}
#endif

// Tier 2: Global class map
#if HAKMEM_TINY_CLASS_MAP
class_idx = class_map_lookup(ptr);
if (class_idx >= 0) {
    ss = hak_super_lookup(ptr);  // Still need SS for metadata
    goto found;
}
#endif

// Tier 3: Registry fallback
ss = hak_super_lookup(ptr);
if (ss && ss->magic == SUPERSLAB_MAGIC) {
    class_idx = slab_index_for(ss, ptr);
    goto found;
}

// External pointer
hak_external_guard_free(ptr);
return;

found:
    tiny_free_to_class(class_idx, ptr);

11.2 Adaptive Cache Sizing

Current design uses fixed TLS_SS_HINT_SLOTS = 4. Future optimization could make this adaptive:

  • Workload detection: Track hit rate over time windows
  • Dynamic sizing: Increase slots (4 → 8) if hit rate < 80%
  • Memory pressure: Decrease slots (8 → 2) if memory constrained

Implementation sketch:

#define TLS_SS_HINT_SLOTS_MAX 8

typedef struct {
    uint32_t current_slots;  // Dynamic (2, 4, 8)
    uint64_t hits_window;
    uint64_t misses_window;
} TlsSsHintAdaptive;

void tls_ss_hint_tune(void) {
    double hit_rate = (double)g_tls_ss_hint.hits_window /
                      (g_tls_ss_hint.hits_window + g_tls_ss_hint.misses_window);

    if (hit_rate < 0.80 && g_tls_ss_hint.current_slots < TLS_SS_HINT_SLOTS_MAX) {
        g_tls_ss_hint.current_slots *= 2;  // Grow cache
    } else if (hit_rate > 0.95 && g_tls_ss_hint.current_slots > 2) {
        g_tls_ss_hint.current_slots /= 2;  // Shrink cache
    }
}

11.3 LRU vs FIFO Eviction Policy

Current design uses FIFO (simple, predictable). Alternative: LRU with move-to-front on hit.

LRU advantages:

  • Better hit rate for workloads with temporal locality
  • Commonly used SuperSlabs stay cached longer

LRU disadvantages:

  • 2-3 extra cycles per hit (move to front)
  • More complex implementation (doubly-linked list)

Benchmark before switching: Profile sh8bench, larson, cfrac with both policies.

11.4 Per-Class Hint Caches

Current design: Single cache for all classes (4 entries, any class). Alternative: Per-class caches (1 entry per class, 8 entries total).

Per-class advantages:

  • Guaranteed cache slot for each class
  • No inter-class eviction

Per-class disadvantages:

  • Wastes space if only 2-3 classes are active
  • More TLS overhead (8 entries vs 4)

Recommendation: Defer until benchmarks show inter-class thrashing.

11.5 Statistics Export API

For production monitoring, export hit rate via:

// Global aggregated stats (all threads)
void hak_tls_hint_global_stats(uint64_t* total_hits, uint64_t* total_misses);

// ENV-based stats dump at exit
// HAKMEM_TLS_HINT_STATS=1 → dump to stderr at exit

12. Implementation Checklist

12.1 Phase 1a: Core Implementation (Week 1)

  • Create core/box/tls_ss_hint_box.h
  • Implement tls_ss_hint_init()
  • Implement tls_ss_hint_update()
  • Implement tls_ss_hint_lookup()
  • Implement tls_ss_hint_clear()
  • Add HAKMEM_TINY_SS_TLS_HINT flag to hakmem_build_flags.h
  • Add validation check (hint requires headerless mode)

12.2 Phase 1b: Integration (Week 2)

  • Integrate into hakmem_tiny_free.inc (lookup path)
  • Integrate into hakmem_tiny.c (update path after alloc)
  • Integrate into hakmem_tiny_refill.inc.h (update path after refill)
  • Integrate into core/front/tiny_unified_cache.c (update path)
  • Call tls_ss_hint_init() in thread-local init

12.3 Phase 1c: Testing (Week 2-3)

  • Write unit tests (tests/test_tls_ss_hint.c)
  • Run unit tests: make test_tls_ss_hint && ./test_tls_ss_hint
  • Build validation (hint disabled, hint enabled, error check)
  • Benchmark comparison (sh8bench, cfrac, larson)
  • Hit rate profiling (debug build with stats)
  • Correctness tests (no crashes, no assertion failures)

12.4 Phase 1d: Validation (Week 3)

  • Benchmark: sh8bench (target: +15-20%)
  • Benchmark: cfrac (target: +10-15%)
  • Benchmark: larson 8 threads (target: +15-20%)
  • Hit rate analysis (target: 85-95%)
  • Memory overhead check (target: < 150 bytes/thread)
  • Regression test: Headerless=0 mode still works

12.5 Phase 1e: Documentation (Week 3-4)

  • Update docs/PHASE2_HEADERLESS_INSTRUCTION.md with hint Box
  • Add Box Theory annotation to hakmem Box registry
  • Write performance analysis report (before/after comparison)
  • Update build instructions (make shared EXTRA_CFLAGS=...)

13. Rollout Plan

Stage 1: Internal Testing (Week 1-3)

  • Build with HAKMEM_TINY_SS_TLS_HINT=1 in dev environment
  • Run full benchmark suite (mimalloc-bench)
  • Profile with perf/cachegrind (verify cycle count reduction)
  • Fix any integration bugs

Stage 2: Canary Deployment (Week 4)

  • Enable hint Box in 5% of production traffic
  • Monitor: crash rate, performance metrics, hit rate
  • A/B test: Hint ON vs Hint OFF

Stage 3: Gradual Rollout (Week 5-6)

  • 25% traffic (if canary success)
  • 50% traffic
  • 100% traffic

Stage 4: Default Enable (Week 7)

  • Change default: HAKMEM_TINY_SS_TLS_HINT=1
  • Update build scripts, CI/CD pipelines
  • Announce in release notes

14. Success Metrics

Metric Baseline Target Measurement
sh8bench throughput 54.60 Mops/s 64-68 Mops/s +15-20%
cfrac runtime 1.25 sec 1.10-1.15 sec -10-15%
larson throughput 6.5M ops/s 7.5-8.0M ops/s +15-20%
TLS hint hit rate N/A 85-95% Stats API
free() cycle count 15-60 cycles 7-15 cycles (hit) perf/cachegrind
Memory overhead 0 < 150 bytes/thread sizeof(TlsSsHintCache)
Crash rate 0.001% 0.001% (no regression) Production monitoring

15. Open Questions

  1. Q: Should we implement per-class hint caches instead of unified cache? A: Defer until benchmarks show inter-class thrashing. Current unified design is simpler and sufficient for most workloads.

  2. Q: Should we use LRU instead of FIFO eviction? A: Defer until benchmarks show FIFO hit rate < 80%. FIFO is simpler and avoids move-to-front cost on hits.

  3. Q: Should we make TLS_SS_HINT_SLOTS runtime-configurable? A: No, compile-time constant allows better optimization (loop unrolling, register allocation). Consider adaptive sizing in Phase 2 if needed.

  4. Q: Should we validate SUPERSLAB_MAGIC in tls_ss_hint_lookup()? A: No, keep lookup minimal (2-5 cycles). Caller (free() path) must validate magic. This matches existing design where hak_super_lookup() also requires caller validation.

  5. Q: Should we export hit rate stats in production builds? A: Phase 1: No (save 16 bytes/thread). Phase 2: Add global aggregated stats API for monitoring if needed.


16. Conclusion

The TLS Superslab Hint Box is a low-risk, high-reward optimization that reduces the performance gap between Headerless mode and Header mode from 30% to ~15%. The design is self-contained, testable, and follows hakmem's Box Theory architecture. Expected implementation time: 3-4 weeks (including testing and validation).

Key Strengths:

  • Minimal integration surface (5 call sites)
  • Self-contained Box (no dependencies)
  • Fail-safe fallback (miss → hak_super_lookup)
  • Low memory overhead (112 bytes/thread)
  • Proven pattern (TLS caching used in jemalloc, tcmalloc)

Next Steps:

  1. Review this design document
  2. Approve Phase 1a implementation (core Box)
  3. Begin implementation with unit tests
  4. Benchmark and validate in dev environment
  5. Plan Phase 2 integration (Global Class Map)

End of Design Document