Created 9 diagnostic and handoff documents (48KB) to guide ChatGPT through systematic diagnosis and fix of TLS SLL header corruption issue. Documents Added: - README_HANDOFF_CHATGPT.md: Master guide explaining 3-doc system - CHATGPT_CONTEXT_SUMMARY.md: Quick facts & architecture (2-3 min read) - CHATGPT_HANDOFF_TLS_DIAGNOSIS.md: 7-step procedure (4-8h timeline) - GEMINI_HANDOFF_SUMMARY.md: Handoff summary for user review - STATUS_2025_12_03_CURRENT.md: Complete project status snapshot - TLS_SLL_HEADER_CORRUPTION_DIAGNOSIS.md: Deep reference (1,150+ lines) - 6 root cause patterns with code examples - Diagnostic logging instrumentation - Fix templates and validation procedures - TLS_SS_HINT_BOX_DESIGN.md: Phase 1 optimization design (1,148 lines) - HEADERLESS_STABILITY_DEBUG_INSTRUCTIONS.md: Test environment setup - SEGFAULT_INVESTIGATION_FOR_GEMINI.md: Original investigation notes Problem Context: - Baseline (Headerless OFF) crashes with [TLS_SLL_HDR_RESET] - Error: cls=1 base=0x... got=0x31 expect=0xa1 - Blocks Phase 1 validation and Phase 2 progression Expected Outcome: - ChatGPT follows 7-step diagnostic process - Root cause identified (one of 6 patterns) - Surgical fix (1-5 lines) - TC1 baseline completes without crashes 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
36 KiB
TLS Superslab Hint Box - Design Document
Phase: Headerless Performance Optimization - Phase 1 Date: 2025-12-03 Status: Design Review Author: hakmem team
1. Executive Summary
The TLS Superslab Hint Box is a thread-local cache that accelerates pointer-to-SuperSlab resolution in Headerless mode. When HAKMEM_TINY_HEADERLESS=1 is enabled, every free() operation requires translating a user pointer to its owning SuperSlab. Currently, this uses hak_super_lookup(), which performs a hash table lookup costing 10-50 cycles. By caching recently-used SuperSlab references in thread-local storage, we can reduce this to 2-5 cycles for cache hits (85-95% hit rate expected).
Expected Performance Improvement: 15-20% throughput increase (54.60 → 64-68 Mops/s on sh8bench)
Risk Level: Low
- Thread-local storage eliminates cache coherency issues
- Magic number validation provides fail-safe fallback
- Self-contained Box with minimal integration surface
- Memory overhead: ~128 bytes per thread (negligible)
2. Box Definition (Box Theory)
Box: TLS Superslab Hint Cache
MISSION:
Cache recently-used SuperSlab references in TLS to accelerate
ptr→SuperSlab resolution in Headerless mode, avoiding expensive
hash table lookups on the critical free() path.
DESIGN:
- Provides O(1) lookup for hot SuperSlabs (L1 cache hit, 2-5 cycles)
- Falls back to global registry on miss (fail-safe, no data loss)
- No ownership, no remote queues, pure read-only cache
- FIFO eviction policy with configurable cache size (2-4 slots)
INVARIANTS:
- hint.base <= ptr < hint.end implies hint.ss is valid
- Miss is always safe (triggers fallback to hak_super_lookup)
- TLS data survives only within thread lifetime
- Cache entries are invalidated implicitly by FIFO rotation
- Magic number check (SUPERSLAB_MAGIC) validates all pointers
BOUNDARY:
- Input: raw user pointer (void* ptr) from free() path
- Output: SuperSlab* or NULL (miss triggers fallback)
- Does NOT determine class_idx (that's slab_index_for's job)
- Does NOT perform ownership validation (that's SuperSlab's job)
PERFORMANCE:
- Cache hit: 2-5 cycles (L1 cache hit, 4 pointer comparisons)
- Cache miss: fallback to hak_super_lookup (10-50 cycles)
- Expected hit rate: 85-95% for single-threaded workloads
- Expected hit rate: 70-85% for multi-threaded workloads
THREAD SAFETY:
- TLS storage: no sharing, no synchronization required
- Read-only cache: never modifies SuperSlab state
- Stale entries: caught by magic number check
3. Data Structures
// core/box/tls_ss_hint_box.h
#ifndef TLS_SS_HINT_BOX_H
#define TLS_SS_HINT_BOX_H
#include <stdint.h>
#include <stdbool.h>
// Forward declaration
struct SuperSlab;
// Cache entry for a single SuperSlab hint
// Size: 24 bytes (cache-friendly, fits in 1 cache line with metadata)
typedef struct {
void* base; // SuperSlab base address (aligned to 1MB or 2MB)
void* end; // base + superslab_size (for range check)
struct SuperSlab* ss; // Cached SuperSlab pointer
} TlsSsHintEntry;
// TLS hint cache configuration
// - 4 slots provide good hit rate without excessive overhead
// - Larger caches (8, 16) show diminishing returns in benchmarks
// - Smaller caches (2) may thrash on workloads with 3+ active SuperSlabs
#define TLS_SS_HINT_SLOTS 4
// Thread-local SuperSlab hint cache
// Total size: 24*4 + 16 = 112 bytes per thread (negligible overhead)
typedef struct {
TlsSsHintEntry entries[TLS_SS_HINT_SLOTS]; // Cache entries
uint32_t count; // Number of valid entries (0 to TLS_SS_HINT_SLOTS)
uint32_t next_slot; // Next slot for FIFO rotation (wraps at TLS_SS_HINT_SLOTS)
// Statistics (optional, for profiling builds)
// Disabled in HAKMEM_BUILD_RELEASE to save 16 bytes per thread
#if !HAKMEM_BUILD_RELEASE
uint64_t hits; // Cache hit count
uint64_t misses; // Cache miss count
#endif
} TlsSsHintCache;
// Thread-local storage instance
// Initialized to zero by TLS semantics, formal init in tls_ss_hint_init()
extern __thread TlsSsHintCache g_tls_ss_hint;
#endif // TLS_SS_HINT_BOX_H
4. API Design
// core/box/tls_ss_hint_box.h (continued)
/**
* @brief Initialize TLS hint cache for current thread
*
* Call once per thread, typically in thread-local initialization path.
* Safe to call multiple times (idempotent).
*
* Thread Safety: TLS, no synchronization required
* Performance: ~10 cycles (negligible one-time cost)
*/
static inline void tls_ss_hint_init(void);
/**
* @brief Update hint cache with a SuperSlab reference
*
* Called on paths where we know the SuperSlab for a given address range:
* - After successful tiny_alloc (cache the allocated-from SuperSlab)
* - After superslab refill (cache the newly bound SuperSlab)
* - After unified cache refill (cache the refilled SuperSlab)
*
* Duplicate detection: If the SuperSlab is already cached, no update occurs.
* This prevents thrashing when repeatedly allocating from the same SuperSlab.
*
* @param ss SuperSlab to cache (must be non-NULL, SUPERSLAB_MAGIC validated by caller)
* @param base SuperSlab base address (1MB or 2MB aligned)
* @param size SuperSlab size in bytes (1MB or 2MB)
*
* Thread Safety: TLS, no synchronization required
* Performance: ~15-20 cycles (duplicate check + FIFO rotation)
*/
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size);
/**
* @brief Lookup SuperSlab for given pointer (fast path)
*
* Called on free() entry, before falling back to hak_super_lookup().
* Performs linear search over cached entries (4 iterations max).
*
* Cache hit: Returns true, sets *out_ss to cached SuperSlab pointer
* Cache miss: Returns false, caller must use hak_super_lookup()
*
* @param ptr User pointer to lookup (arbitrary alignment)
* @param out_ss Output: SuperSlab pointer if found (only valid if return true)
* @return true if cache hit (out_ss is valid), false if miss
*
* Thread Safety: TLS, no synchronization required
* Performance: 2-5 cycles (hit), 8-12 cycles (miss)
*
* NOTE: Caller MUST validate SUPERSLAB_MAGIC after successful lookup.
* This Box does not perform magic validation to keep fast path minimal.
*/
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss);
/**
* @brief Clear all cached hints (for testing/reset)
*
* Use cases:
* - Unit tests: Reset cache between test cases
* - Debug: Force cache cold start for profiling
* - Thread teardown: Optional cleanup (TLS auto-cleanup on thread exit)
*
* Thread Safety: TLS, no synchronization required
* Performance: ~10 cycles
*/
static inline void tls_ss_hint_clear(void);
/**
* @brief Get cache statistics (for profiling builds)
*
* Returns hit/miss counters for performance analysis.
* Only available in non-release builds (HAKMEM_BUILD_RELEASE=0).
*
* @param hits Output: Total cache hits
* @param misses Output: Total cache misses
*
* Thread Safety: TLS, no synchronization required
* Performance: ~5 cycles (two loads)
*/
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses);
#endif
5. Implementation Details
// core/box/tls_ss_hint_box.c (or inline in .h for header-only Box)
#include "tls_ss_hint_box.h"
#include "../hakmem_tiny_superslab.h" // For SuperSlab, SUPERSLAB_MAGIC
// Thread-local storage definition
__thread TlsSsHintCache g_tls_ss_hint = {0};
/**
* Initialize TLS hint cache
* Safe to call multiple times (idempotent check via count)
*/
static inline void tls_ss_hint_init(void) {
// Zero-initialization by TLS, but explicit init for clarity
g_tls_ss_hint.count = 0;
g_tls_ss_hint.next_slot = 0;
#if !HAKMEM_BUILD_RELEASE
g_tls_ss_hint.hits = 0;
g_tls_ss_hint.misses = 0;
#endif
// Clear all entries (paranoid, but cache-friendly loop)
for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
g_tls_ss_hint.entries[i].base = NULL;
g_tls_ss_hint.entries[i].end = NULL;
g_tls_ss_hint.entries[i].ss = NULL;
}
}
/**
* Update hint cache with SuperSlab reference
* FIFO rotation: oldest entry is evicted when cache is full
* Duplicate detection: skip if SuperSlab already cached
*/
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size) {
// Sanity check: reject invalid inputs
if (__builtin_expect(!ss || !base || size == 0, 0)) {
return;
}
// Duplicate detection: check if this SuperSlab is already cached
// This prevents thrashing when allocating from the same SuperSlab repeatedly
for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
if (g_tls_ss_hint.entries[i].ss == ss) {
return; // Already cached, no update needed
}
}
// Add to next slot (FIFO rotation)
uint32_t slot = g_tls_ss_hint.next_slot;
g_tls_ss_hint.entries[slot].base = base;
g_tls_ss_hint.entries[slot].end = (char*)base + size;
g_tls_ss_hint.entries[slot].ss = ss;
// Advance to next slot (wrap at TLS_SS_HINT_SLOTS)
g_tls_ss_hint.next_slot = (slot + 1) % TLS_SS_HINT_SLOTS;
// Increment count until cache is full
if (g_tls_ss_hint.count < TLS_SS_HINT_SLOTS) {
g_tls_ss_hint.count++;
}
}
/**
* Lookup SuperSlab for pointer (fast path)
* Linear search over cached entries (4 iterations max)
*
* Performance note:
* - Linear search is faster than hash table for small N (N <= 8)
* - Branch-free comparison (ptr >= base && ptr < end) is 2-3 cycles
* - Total cost: 2-5 cycles (hit), 8-12 cycles (miss with 4 entries)
*/
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss) {
// Fast path: iterate over valid entries
// Unrolling this loop (if count is small) is beneficial, but let compiler decide
for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
TlsSsHintEntry* e = &g_tls_ss_hint.entries[i];
// Range check: base <= ptr < end
// Note: end is exclusive (base + size), so use < not <=
if (ptr >= e->base && ptr < e->end) {
// Cache hit!
*out_ss = e->ss;
#if !HAKMEM_BUILD_RELEASE
g_tls_ss_hint.hits++;
#endif
return true;
}
}
// Cache miss: caller must fall back to hak_super_lookup()
#if !HAKMEM_BUILD_RELEASE
g_tls_ss_hint.misses++;
#endif
return false;
}
/**
* Clear all cached hints
* Use for testing or manual reset
*/
static inline void tls_ss_hint_clear(void) {
g_tls_ss_hint.count = 0;
g_tls_ss_hint.next_slot = 0;
#if !HAKMEM_BUILD_RELEASE
// Preserve stats across clear (for cumulative profiling)
// Uncomment to reset stats:
// g_tls_ss_hint.hits = 0;
// g_tls_ss_hint.misses = 0;
#endif
// Optional: zero out entries (paranoid, not required for correctness)
for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
g_tls_ss_hint.entries[i].base = NULL;
g_tls_ss_hint.entries[i].end = NULL;
g_tls_ss_hint.entries[i].ss = NULL;
}
}
/**
* Get cache statistics (profiling builds only)
*/
#if !HAKMEM_BUILD_RELEASE
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses) {
if (hits) *hits = g_tls_ss_hint.hits;
if (misses) *misses = g_tls_ss_hint.misses;
}
#endif
6. Integration Points
6.1 Update Points: When to Call tls_ss_hint_update()
The hint cache should be updated whenever we know the SuperSlab for an address range. This happens on allocation success paths:
Location 1: After Successful Tiny Alloc (hakmem_tiny.c)
// In hak_tiny_alloc or similar allocation path
void* ptr = tiny_allocate_from_superslab(class_idx, &ss);
if (ptr) {
#if HAKMEM_TINY_SS_TLS_HINT
// Cache the SuperSlab we just allocated from
// This improves free() performance for LIFO allocation patterns
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
#endif
return ptr;
}
Location 2: After SuperSlab Refill (hakmem_tiny_refill.inc.h)
// In tiny_refill_from_superslab or superslab_allocate
SuperSlab* ss = superslab_allocate(class_idx);
if (ss) {
// Bind SuperSlab to thread's TLS state
bind_superslab_to_thread(ss, class_idx);
#if HAKMEM_TINY_SS_TLS_HINT
// Cache the newly bound SuperSlab
// Future allocations from this SuperSlab will have cached hint
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
#endif
}
Location 3: Unified Cache Refill (core/front/tiny_unified_cache.c)
// In unified_cache_refill_class
void* block = superslab_alloc_block(class_idx, &ss);
if (block) {
#if HAKMEM_TINY_SS_TLS_HINT
// Cache the SuperSlab that provided this block
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
#endif
// Push to unified cache
unified_cache_push(class_idx, block);
}
Location 4: Thread-Local Init (hakmem_tiny_tls_init)
// In tiny_tls_init or thread_local_init
void tiny_tls_init(void) {
// Initialize TLS structures
tiny_magazine_init();
tiny_sll_init();
#if HAKMEM_TINY_SS_TLS_HINT
// Initialize hint cache (zero-init by TLS, but explicit for clarity)
tls_ss_hint_init();
#endif
}
6.2 Lookup Points: When to Call tls_ss_hint_lookup()
The hint lookup should be the first step in free() path, before falling back to registry lookup:
Location 1: Tiny Free Entry (core/hakmem_tiny_free.inc)
// In hak_tiny_free or similar free path
void hak_tiny_free(void* ptr) {
if (!ptr) return;
SuperSlab* ss = NULL;
#if HAKMEM_TINY_HEADERLESS
// Phase 1: Try TLS hint cache (fast path, 2-5 cycles on hit)
#if HAKMEM_TINY_SS_TLS_HINT
if (!tls_ss_hint_lookup(ptr, &ss)) {
#endif
// Phase 2: Fallback to global registry (slow path, 10-50 cycles)
ss = hak_super_lookup(ptr);
#if HAKMEM_TINY_SS_TLS_HINT
}
#endif
// Validate SuperSlab (magic check)
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
// Invalid pointer - external guard path
hak_external_guard_free(ptr);
return;
}
// Proceed with free using SuperSlab info
int class_idx = slab_index_for(ss, ptr);
tiny_free_to_slab(ss, ptr, class_idx);
#else
// Header mode: read class_idx from header (1-3 cycles)
uint8_t hdr = *((uint8_t*)ptr - 1);
int class_idx = hdr & 0x7;
tiny_free_to_class(class_idx, ptr);
#endif
}
Location 2: Fast Free Path (core/tiny_free_fast_v2.inc.h)
// In tiny_free_fast or inline free path
static inline void tiny_free_fast(void* ptr) {
#if HAKMEM_TINY_HEADERLESS
SuperSlab* ss = NULL;
// Try hint cache first
#if HAKMEM_TINY_SS_TLS_HINT
if (!tls_ss_hint_lookup(ptr, &ss)) {
#endif
ss = hak_super_lookup(ptr);
#if HAKMEM_TINY_SS_TLS_HINT
}
#endif
if (__builtin_expect(!ss || ss->magic != SUPERSLAB_MAGIC, 0)) {
// Slow path: external guard or invalid pointer
hak_tiny_free_slow(ptr);
return;
}
// Fast path: push to TLS freelist
int class_idx = slab_index_for(ss, ptr);
front_gate_push_tls(class_idx, ptr);
#else
// Header mode fast path
uint8_t hdr = *((uint8_t*)ptr - 1);
int class_idx = hdr & 0x7;
front_gate_push_tls(class_idx, ptr);
#endif
}
7. Environment Variable
// In hakmem_build_flags.h or similar configuration header
// ============================================================================
// Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache
// ============================================================================
// Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode
// Default: 0 (disabled during development and testing)
// Target: 1 (enabled after validation in Phase 1 rollout)
//
// Performance Impact:
// - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup)
// - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded)
// - Expected throughput improvement: 15-20%
//
// Memory Overhead:
// - 112 bytes per thread (TLS)
// - Negligible for typical workloads (1000 threads = 112KB)
//
// Dependencies:
// - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode)
// - No other dependencies (self-contained Box)
#ifndef HAKMEM_TINY_SS_TLS_HINT
#define HAKMEM_TINY_SS_TLS_HINT 0
#endif
// Validation: Hint Box only active in Headerless mode
#if HAKMEM_TINY_SS_TLS_HINT && !HAKMEM_TINY_HEADERLESS
#error "HAKMEM_TINY_SS_TLS_HINT requires HAKMEM_TINY_HEADERLESS=1"
#endif
8. Testing Plan
8.1 Unit Tests
Create /mnt/workdisk/public_share/hakmem/tests/test_tls_ss_hint.c:
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include "core/box/tls_ss_hint_box.h"
#include "core/hakmem_tiny_superslab.h"
// Mock SuperSlab for testing
typedef struct {
uint32_t magic;
void* base_addr;
size_t size_bytes;
uint8_t size_class;
} MockSuperSlab;
void test_hint_init(void) {
printf("test_hint_init...\n");
tls_ss_hint_init();
// Verify cache is empty
assert(g_tls_ss_hint.count == 0);
assert(g_tls_ss_hint.next_slot == 0);
#if !HAKMEM_BUILD_RELEASE
assert(g_tls_ss_hint.hits == 0);
assert(g_tls_ss_hint.misses == 0);
#endif
printf(" PASS\n");
}
void test_hint_basic(void) {
printf("test_hint_basic...\n");
tls_ss_hint_init();
// Mock SuperSlab
MockSuperSlab ss = {
.magic = SUPERSLAB_MAGIC,
.base_addr = (void*)0x1000000,
.size_bytes = 2 * 1024 * 1024, // 2MB
.size_class = 0
};
// Update hint
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
// Verify cache entry
assert(g_tls_ss_hint.count == 1);
assert(g_tls_ss_hint.entries[0].base == ss.base_addr);
assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);
// Lookup should hit (within range)
SuperSlab* out = NULL;
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == true);
assert(out == (SuperSlab*)&ss);
// Lookup at base should hit
assert(tls_ss_hint_lookup((void*)0x1000000, &out) == true);
assert(out == (SuperSlab*)&ss);
// Lookup at end-1 should hit
assert(tls_ss_hint_lookup((void*)0x12FFFFF, &out) == true);
assert(out == (SuperSlab*)&ss);
// Lookup at end should miss (exclusive boundary)
assert(tls_ss_hint_lookup((void*)0x1300000, &out) == false);
// Lookup outside range should miss
assert(tls_ss_hint_lookup((void*)0x3000000, &out) == false);
printf(" PASS\n");
}
void test_hint_fifo_rotation(void) {
printf("test_hint_fifo_rotation...\n");
tls_ss_hint_init();
// Create 6 mock SuperSlabs (cache has 4 slots)
MockSuperSlab ss[6];
for (int i = 0; i < 6; i++) {
ss[i].magic = SUPERSLAB_MAGIC;
ss[i].base_addr = (void*)(uintptr_t)(0x1000000 + i * 0x200000); // 2MB apart
ss[i].size_bytes = 2 * 1024 * 1024;
ss[i].size_class = 0;
tls_ss_hint_update((SuperSlab*)&ss[i], ss[i].base_addr, ss[i].size_bytes);
}
// Cache should be full (4 slots)
assert(g_tls_ss_hint.count == TLS_SS_HINT_SLOTS);
// First 2 SuperSlabs should be evicted (FIFO)
SuperSlab* out = NULL;
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false); // ss[0] evicted
assert(tls_ss_hint_lookup((void*)0x1200100, &out) == false); // ss[1] evicted
// Last 4 SuperSlabs should be cached
assert(tls_ss_hint_lookup((void*)0x1400100, &out) == true); // ss[2]
assert(out == (SuperSlab*)&ss[2]);
assert(tls_ss_hint_lookup((void*)0x1600100, &out) == true); // ss[3]
assert(out == (SuperSlab*)&ss[3]);
assert(tls_ss_hint_lookup((void*)0x1800100, &out) == true); // ss[4]
assert(out == (SuperSlab*)&ss[4]);
assert(tls_ss_hint_lookup((void*)0x1A00100, &out) == true); // ss[5]
assert(out == (SuperSlab*)&ss[5]);
printf(" PASS\n");
}
void test_hint_duplicate_detection(void) {
printf("test_hint_duplicate_detection...\n");
tls_ss_hint_init();
// Mock SuperSlab
MockSuperSlab ss = {
.magic = SUPERSLAB_MAGIC,
.base_addr = (void*)0x1000000,
.size_bytes = 2 * 1024 * 1024,
.size_class = 0
};
// Update hint 3 times with same SuperSlab
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
// Cache should have only 1 entry (duplicates ignored)
assert(g_tls_ss_hint.count == 1);
assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);
printf(" PASS\n");
}
void test_hint_clear(void) {
printf("test_hint_clear...\n");
tls_ss_hint_init();
// Add some entries
MockSuperSlab ss = {
.magic = SUPERSLAB_MAGIC,
.base_addr = (void*)0x1000000,
.size_bytes = 2 * 1024 * 1024,
.size_class = 0
};
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
assert(g_tls_ss_hint.count == 1);
// Clear cache
tls_ss_hint_clear();
// Cache should be empty
assert(g_tls_ss_hint.count == 0);
assert(g_tls_ss_hint.next_slot == 0);
// Lookup should miss
SuperSlab* out = NULL;
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);
printf(" PASS\n");
}
#if !HAKMEM_BUILD_RELEASE
void test_hint_stats(void) {
printf("test_hint_stats...\n");
tls_ss_hint_init();
// Add entry
MockSuperSlab ss = {
.magic = SUPERSLAB_MAGIC,
.base_addr = (void*)0x1000000,
.size_bytes = 2 * 1024 * 1024,
.size_class = 0
};
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
// Perform lookups
SuperSlab* out = NULL;
tls_ss_hint_lookup((void*)0x1000100, &out); // Hit
tls_ss_hint_lookup((void*)0x1000200, &out); // Hit
tls_ss_hint_lookup((void*)0x3000000, &out); // Miss
// Check stats
uint64_t hits = 0, misses = 0;
tls_ss_hint_stats(&hits, &misses);
assert(hits == 2);
assert(misses == 1);
printf(" PASS\n");
}
#endif
int main(void) {
printf("Running TLS SS Hint Box unit tests...\n\n");
test_hint_init();
test_hint_basic();
test_hint_fifo_rotation();
test_hint_duplicate_detection();
test_hint_clear();
#if !HAKMEM_BUILD_RELEASE
test_hint_stats();
#endif
printf("\nAll tests passed!\n");
return 0;
}
8.2 Integration Tests
Test 1: Build Validation
# Test 1: Build with hint disabled (baseline)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
# Test 2: Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Test 3: Verify hint is disabled in header mode (should error)
# make clean
# make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=0 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Expected: Compile error (validation check in hakmem_build_flags.h)
Test 2: Benchmark Comparison
# Build baseline (hint disabled)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
# Run benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_baseline.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_baseline.txt
# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Run same benchmarks
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_hint.txt
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_hint.txt
# Compare results
echo "=== sh8bench ==="
grep "Mops" baseline.txt hint.txt
echo "=== cfrac ==="
grep "time:" cfrac_baseline.txt cfrac_hint.txt
echo "=== larson ==="
grep "ops/s" larson_baseline.txt larson_hint.txt
Test 3: Hit Rate Profiling
# Build with stats enabled (non-release)
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0"
# Add stats dump at exit (in hakmem_exit.c or similar)
# void dump_hint_stats(void) {
# uint64_t hits = 0, misses = 0;
# tls_ss_hint_stats(&hits, &misses);
# fprintf(stderr, "[TLS_HINT_STATS] hits=%lu misses=%lu hit_rate=%.2f%%\n",
# hits, misses, 100.0 * hits / (hits + misses));
# }
# Run benchmark and check hit rate
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep TLS_HINT_STATS
# Expected: hit_rate >= 85%
8.3 Correctness Tests
# Test with external pointer (should fall back to hak_super_lookup)
# This tests that cache misses are handled correctly
# Build with hint enabled
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
# Run sh8bench (allocates from multiple SuperSlabs)
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
# No crashes or assertion failures = success
echo "Correctness test passed"
9. Performance Expectations
9.1 Cycle Count Analysis
| Operation | Without Hint | With Hint (Hit) | With Hint (Miss) | Improvement |
|---|---|---|---|---|
| free() lookup | 10-50 cycles | 2-5 cycles | 10-50 cycles | 80-95% |
| Range check (per entry) | N/A | 2 cycles | 2 cycles | - |
| Hash table lookup | 10-50 cycles | N/A | 10-50 cycles | - |
| Total free() cost | 15-60 cycles | 7-15 cycles (hit) | 20-65 cycles (miss) | 40-60% |
9.2 Expected Hit Rates
| Workload | Hit Rate | Reasoning |
|---|---|---|
| Single-threaded LIFO | 95-99% | Free() immediately after alloc() from same SuperSlab |
| Single-threaded FIFO | 85-95% | Recent allocations from 2-4 SuperSlabs |
| Multi-threaded (8 threads) | 70-85% | Shared SuperSlabs, more cache thrashing |
| Larson (high churn) | 65-80% | Many active SuperSlabs, frequent evictions |
9.3 Benchmark Targets
| Benchmark | Baseline (no hint) | Target (with hint) | Improvement |
|---|---|---|---|
| sh8bench | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
| cfrac | 1.25 sec | 1.10-1.15 sec | +10-15% |
| larson (8 threads) | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |
9.4 Memory Overhead
| Metric | Value | Notes |
|---|---|---|
| Per-thread overhead | 112 bytes | TLS cache (release build) |
| Per-thread overhead (debug) | 128 bytes | TLS cache + stats counters |
| 1000 threads | 112 KB | Negligible for server workloads |
| 10000 threads | 1.12 MB | Still negligible |
10. Risk Analysis
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Cache coherency issues | Very Low | Low | TLS is thread-local, no sharing between threads |
| Stale hint after munmap | Low | Low | Magic check (SUPERSLAB_MAGIC) catches freed SuperSlabs |
| Cache thrashing (many SS) | Low | Low | 4 slots cover typical workloads; miss falls back to registry |
| Memory overhead | Very Low | Very Low | 112 bytes/thread, negligible for most workloads |
| Integration bugs | Low | Medium | Self-contained Box, clear API, comprehensive tests |
| Hit rate lower than expected | Low | Low | Even 50% hit rate improves performance; no regression on miss |
| Complexity increase | Low | Low | 150 LOC, header-only Box, minimal dependencies |
10.1 Failure Modes and Recovery
| Failure Mode | Detection | Recovery |
|---|---|---|
| Stale SuperSlab pointer | Magic check (SUPERSLAB_MAGIC != expected) | Fall back to hak_super_lookup() |
| Cache miss | tls_ss_hint_lookup returns false | Fall back to hak_super_lookup() |
| Invalid hint range | ptr outside [base, end) | Linear search continues, eventually misses |
| Thread teardown | TLS cleanup by OS | No manual cleanup needed |
| SuperSlab freed | Magic number cleared | Caught by magic check in free() path |
11. Future Considerations
11.1 Phase 2 Integration: Global Class Map
When Phase 2 introduces a Global Class Map (pointer → class_idx lookup), the TLS Hint Box becomes the first tier in a three-tier lookup hierarchy:
Tier 1 (fastest): TLS Hint Cache (2-5 cycles, 85-95% hit rate)
↓ miss
Tier 2 (medium): Global Class Map (5-15 cycles, 99%+ hit rate)
↓ miss
Tier 3 (slowest): Global SuperSlab Registry (10-50 cycles, 100% hit rate)
Integration point:
SuperSlab* ss = NULL;
int class_idx = -1;
// Tier 1: TLS hint
#if HAKMEM_TINY_SS_TLS_HINT
if (tls_ss_hint_lookup(ptr, &ss)) {
class_idx = slab_index_for(ss, ptr);
goto found;
}
#endif
// Tier 2: Global class map
#if HAKMEM_TINY_CLASS_MAP
class_idx = class_map_lookup(ptr);
if (class_idx >= 0) {
ss = hak_super_lookup(ptr); // Still need SS for metadata
goto found;
}
#endif
// Tier 3: Registry fallback
ss = hak_super_lookup(ptr);
if (ss && ss->magic == SUPERSLAB_MAGIC) {
class_idx = slab_index_for(ss, ptr);
goto found;
}
// External pointer
hak_external_guard_free(ptr);
return;
found:
tiny_free_to_class(class_idx, ptr);
11.2 Adaptive Cache Sizing
Current design uses fixed TLS_SS_HINT_SLOTS = 4. Future optimization could make this adaptive:
- Workload detection: Track hit rate over time windows
- Dynamic sizing: Increase slots (4 → 8) if hit rate < 80%
- Memory pressure: Decrease slots (8 → 2) if memory constrained
Implementation sketch:
#define TLS_SS_HINT_SLOTS_MAX 8
typedef struct {
uint32_t current_slots; // Dynamic (2, 4, 8)
uint64_t hits_window;
uint64_t misses_window;
} TlsSsHintAdaptive;
void tls_ss_hint_tune(void) {
double hit_rate = (double)g_tls_ss_hint.hits_window /
(g_tls_ss_hint.hits_window + g_tls_ss_hint.misses_window);
if (hit_rate < 0.80 && g_tls_ss_hint.current_slots < TLS_SS_HINT_SLOTS_MAX) {
g_tls_ss_hint.current_slots *= 2; // Grow cache
} else if (hit_rate > 0.95 && g_tls_ss_hint.current_slots > 2) {
g_tls_ss_hint.current_slots /= 2; // Shrink cache
}
}
11.3 LRU vs FIFO Eviction Policy
Current design uses FIFO (simple, predictable). Alternative: LRU with move-to-front on hit.
LRU advantages:
- Better hit rate for workloads with temporal locality
- Commonly used SuperSlabs stay cached longer
LRU disadvantages:
- 2-3 extra cycles per hit (move to front)
- More complex implementation (doubly-linked list)
Benchmark before switching: Profile sh8bench, larson, cfrac with both policies.
11.4 Per-Class Hint Caches
Current design: Single cache for all classes (4 entries, any class). Alternative: Per-class caches (1 entry per class, 8 entries total).
Per-class advantages:
- Guaranteed cache slot for each class
- No inter-class eviction
Per-class disadvantages:
- Wastes space if only 2-3 classes are active
- More TLS overhead (8 entries vs 4)
Recommendation: Defer until benchmarks show inter-class thrashing.
11.5 Statistics Export API
For production monitoring, export hit rate via:
// Global aggregated stats (all threads)
void hak_tls_hint_global_stats(uint64_t* total_hits, uint64_t* total_misses);
// ENV-based stats dump at exit
// HAKMEM_TLS_HINT_STATS=1 → dump to stderr at exit
12. Implementation Checklist
12.1 Phase 1a: Core Implementation (Week 1)
- Create
core/box/tls_ss_hint_box.h - Implement
tls_ss_hint_init() - Implement
tls_ss_hint_update() - Implement
tls_ss_hint_lookup() - Implement
tls_ss_hint_clear() - Add
HAKMEM_TINY_SS_TLS_HINTflag tohakmem_build_flags.h - Add validation check (hint requires headerless mode)
12.2 Phase 1b: Integration (Week 2)
- Integrate into
hakmem_tiny_free.inc(lookup path) - Integrate into
hakmem_tiny.c(update path after alloc) - Integrate into
hakmem_tiny_refill.inc.h(update path after refill) - Integrate into
core/front/tiny_unified_cache.c(update path) - Call
tls_ss_hint_init()in thread-local init
12.3 Phase 1c: Testing (Week 2-3)
- Write unit tests (
tests/test_tls_ss_hint.c) - Run unit tests:
make test_tls_ss_hint && ./test_tls_ss_hint - Build validation (hint disabled, hint enabled, error check)
- Benchmark comparison (sh8bench, cfrac, larson)
- Hit rate profiling (debug build with stats)
- Correctness tests (no crashes, no assertion failures)
12.4 Phase 1d: Validation (Week 3)
- Benchmark: sh8bench (target: +15-20%)
- Benchmark: cfrac (target: +10-15%)
- Benchmark: larson 8 threads (target: +15-20%)
- Hit rate analysis (target: 85-95%)
- Memory overhead check (target: < 150 bytes/thread)
- Regression test: Headerless=0 mode still works
12.5 Phase 1e: Documentation (Week 3-4)
- Update
docs/PHASE2_HEADERLESS_INSTRUCTION.mdwith hint Box - Add Box Theory annotation to hakmem Box registry
- Write performance analysis report (before/after comparison)
- Update build instructions (
make shared EXTRA_CFLAGS=...)
13. Rollout Plan
Stage 1: Internal Testing (Week 1-3)
- Build with
HAKMEM_TINY_SS_TLS_HINT=1in dev environment - Run full benchmark suite (mimalloc-bench)
- Profile with perf/cachegrind (verify cycle count reduction)
- Fix any integration bugs
Stage 2: Canary Deployment (Week 4)
- Enable hint Box in 5% of production traffic
- Monitor: crash rate, performance metrics, hit rate
- A/B test: Hint ON vs Hint OFF
Stage 3: Gradual Rollout (Week 5-6)
- 25% traffic (if canary success)
- 50% traffic
- 100% traffic
Stage 4: Default Enable (Week 7)
- Change default:
HAKMEM_TINY_SS_TLS_HINT=1 - Update build scripts, CI/CD pipelines
- Announce in release notes
14. Success Metrics
| Metric | Baseline | Target | Measurement |
|---|---|---|---|
| sh8bench throughput | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
| cfrac runtime | 1.25 sec | 1.10-1.15 sec | -10-15% |
| larson throughput | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |
| TLS hint hit rate | N/A | 85-95% | Stats API |
| free() cycle count | 15-60 cycles | 7-15 cycles (hit) | perf/cachegrind |
| Memory overhead | 0 | < 150 bytes/thread | sizeof(TlsSsHintCache) |
| Crash rate | 0.001% | 0.001% (no regression) | Production monitoring |
15. Open Questions
-
Q: Should we implement per-class hint caches instead of unified cache? A: Defer until benchmarks show inter-class thrashing. Current unified design is simpler and sufficient for most workloads.
-
Q: Should we use LRU instead of FIFO eviction? A: Defer until benchmarks show FIFO hit rate < 80%. FIFO is simpler and avoids move-to-front cost on hits.
-
Q: Should we make TLS_SS_HINT_SLOTS runtime-configurable? A: No, compile-time constant allows better optimization (loop unrolling, register allocation). Consider adaptive sizing in Phase 2 if needed.
-
Q: Should we validate SUPERSLAB_MAGIC in tls_ss_hint_lookup()? A: No, keep lookup minimal (2-5 cycles). Caller (free() path) must validate magic. This matches existing design where hak_super_lookup() also requires caller validation.
-
Q: Should we export hit rate stats in production builds? A: Phase 1: No (save 16 bytes/thread). Phase 2: Add global aggregated stats API for monitoring if needed.
16. Conclusion
The TLS Superslab Hint Box is a low-risk, high-reward optimization that reduces the performance gap between Headerless mode and Header mode from 30% to ~15%. The design is self-contained, testable, and follows hakmem's Box Theory architecture. Expected implementation time: 3-4 weeks (including testing and validation).
Key Strengths:
- Minimal integration surface (5 call sites)
- Self-contained Box (no dependencies)
- Fail-safe fallback (miss → hak_super_lookup)
- Low memory overhead (112 bytes/thread)
- Proven pattern (TLS caching used in jemalloc, tcmalloc)
Next Steps:
- Review this design document
- Approve Phase 1a implementation (core Box)
- Begin implementation with unit tests
- Benchmark and validate in dev environment
- Plan Phase 2 integration (Global Class Map)
End of Design Document