1149 lines
36 KiB
Markdown
1149 lines
36 KiB
Markdown
|
|
# TLS Superslab Hint Box - Design Document
|
||
|
|
|
||
|
|
**Phase**: Headerless Performance Optimization - Phase 1
|
||
|
|
**Date**: 2025-12-03
|
||
|
|
**Status**: Design Review
|
||
|
|
**Author**: hakmem team
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Executive Summary
|
||
|
|
|
||
|
|
The TLS Superslab Hint Box is a thread-local cache that accelerates pointer-to-SuperSlab resolution in Headerless mode. When HAKMEM_TINY_HEADERLESS=1 is enabled, every free() operation requires translating a user pointer to its owning SuperSlab. Currently, this uses `hak_super_lookup()`, which performs a hash table lookup costing 10-50 cycles. By caching recently-used SuperSlab references in thread-local storage, we can reduce this to 2-5 cycles for cache hits (85-95% hit rate expected).
|
||
|
|
|
||
|
|
**Expected Performance Improvement**: 15-20% throughput increase (54.60 → 64-68 Mops/s on sh8bench)
|
||
|
|
|
||
|
|
**Risk Level**: Low
|
||
|
|
- Thread-local storage eliminates cache coherency issues
|
||
|
|
- Magic number validation provides fail-safe fallback
|
||
|
|
- Self-contained Box with minimal integration surface
|
||
|
|
- Memory overhead: ~128 bytes per thread (negligible)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Box Definition (Box Theory)
|
||
|
|
|
||
|
|
```
|
||
|
|
Box: TLS Superslab Hint Cache
|
||
|
|
|
||
|
|
MISSION:
|
||
|
|
Cache recently-used SuperSlab references in TLS to accelerate
|
||
|
|
ptr→SuperSlab resolution in Headerless mode, avoiding expensive
|
||
|
|
hash table lookups on the critical free() path.
|
||
|
|
|
||
|
|
DESIGN:
|
||
|
|
- Provides O(1) lookup for hot SuperSlabs (L1 cache hit, 2-5 cycles)
|
||
|
|
- Falls back to global registry on miss (fail-safe, no data loss)
|
||
|
|
- No ownership, no remote queues, pure read-only cache
|
||
|
|
- FIFO eviction policy with configurable cache size (2-4 slots)
|
||
|
|
|
||
|
|
INVARIANTS:
|
||
|
|
- hint.base <= ptr < hint.end implies hint.ss is valid
|
||
|
|
- Miss is always safe (triggers fallback to hak_super_lookup)
|
||
|
|
- TLS data survives only within thread lifetime
|
||
|
|
- Cache entries are invalidated implicitly by FIFO rotation
|
||
|
|
- Magic number check (SUPERSLAB_MAGIC) validates all pointers
|
||
|
|
|
||
|
|
BOUNDARY:
|
||
|
|
- Input: raw user pointer (void* ptr) from free() path
|
||
|
|
- Output: SuperSlab* or NULL (miss triggers fallback)
|
||
|
|
- Does NOT determine class_idx (that's slab_index_for's job)
|
||
|
|
- Does NOT perform ownership validation (that's SuperSlab's job)
|
||
|
|
|
||
|
|
PERFORMANCE:
|
||
|
|
- Cache hit: 2-5 cycles (L1 cache hit, 4 pointer comparisons)
|
||
|
|
- Cache miss: fallback to hak_super_lookup (10-50 cycles)
|
||
|
|
- Expected hit rate: 85-95% for single-threaded workloads
|
||
|
|
- Expected hit rate: 70-85% for multi-threaded workloads
|
||
|
|
|
||
|
|
THREAD SAFETY:
|
||
|
|
- TLS storage: no sharing, no synchronization required
|
||
|
|
- Read-only cache: never modifies SuperSlab state
|
||
|
|
- Stale entries: caught by magic number check
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Data Structures
|
||
|
|
|
||
|
|
```c
|
||
|
|
// core/box/tls_ss_hint_box.h
|
||
|
|
|
||
|
|
#ifndef TLS_SS_HINT_BOX_H
|
||
|
|
#define TLS_SS_HINT_BOX_H
|
||
|
|
|
||
|
|
#include <stdint.h>
|
||
|
|
#include <stdbool.h>
|
||
|
|
|
||
|
|
// Forward declaration
|
||
|
|
struct SuperSlab;
|
||
|
|
|
||
|
|
// Cache entry for a single SuperSlab hint
|
||
|
|
// Size: 24 bytes (cache-friendly, fits in 1 cache line with metadata)
|
||
|
|
typedef struct {
|
||
|
|
void* base; // SuperSlab base address (aligned to 1MB or 2MB)
|
||
|
|
void* end; // base + superslab_size (for range check)
|
||
|
|
struct SuperSlab* ss; // Cached SuperSlab pointer
|
||
|
|
} TlsSsHintEntry;
|
||
|
|
|
||
|
|
// TLS hint cache configuration
|
||
|
|
// - 4 slots provide good hit rate without excessive overhead
|
||
|
|
// - Larger caches (8, 16) show diminishing returns in benchmarks
|
||
|
|
// - Smaller caches (2) may thrash on workloads with 3+ active SuperSlabs
|
||
|
|
#define TLS_SS_HINT_SLOTS 4
|
||
|
|
|
||
|
|
// Thread-local SuperSlab hint cache
|
||
|
|
// Total size: 24*4 + 16 = 112 bytes per thread (negligible overhead)
|
||
|
|
typedef struct {
|
||
|
|
TlsSsHintEntry entries[TLS_SS_HINT_SLOTS]; // Cache entries
|
||
|
|
uint32_t count; // Number of valid entries (0 to TLS_SS_HINT_SLOTS)
|
||
|
|
uint32_t next_slot; // Next slot for FIFO rotation (wraps at TLS_SS_HINT_SLOTS)
|
||
|
|
|
||
|
|
// Statistics (optional, for profiling builds)
|
||
|
|
// Disabled in HAKMEM_BUILD_RELEASE to save 16 bytes per thread
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
uint64_t hits; // Cache hit count
|
||
|
|
uint64_t misses; // Cache miss count
|
||
|
|
#endif
|
||
|
|
} TlsSsHintCache;
|
||
|
|
|
||
|
|
// Thread-local storage instance
|
||
|
|
// Initialized to zero by TLS semantics, formal init in tls_ss_hint_init()
|
||
|
|
extern __thread TlsSsHintCache g_tls_ss_hint;
|
||
|
|
|
||
|
|
#endif // TLS_SS_HINT_BOX_H
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. API Design
|
||
|
|
|
||
|
|
```c
|
||
|
|
// core/box/tls_ss_hint_box.h (continued)
|
||
|
|
|
||
|
|
/**
|
||
|
|
* @brief Initialize TLS hint cache for current thread
|
||
|
|
*
|
||
|
|
* Call once per thread, typically in thread-local initialization path.
|
||
|
|
* Safe to call multiple times (idempotent).
|
||
|
|
*
|
||
|
|
* Thread Safety: TLS, no synchronization required
|
||
|
|
* Performance: ~10 cycles (negligible one-time cost)
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_init(void);
|
||
|
|
|
||
|
|
/**
|
||
|
|
* @brief Update hint cache with a SuperSlab reference
|
||
|
|
*
|
||
|
|
* Called on paths where we know the SuperSlab for a given address range:
|
||
|
|
* - After successful tiny_alloc (cache the allocated-from SuperSlab)
|
||
|
|
* - After superslab refill (cache the newly bound SuperSlab)
|
||
|
|
* - After unified cache refill (cache the refilled SuperSlab)
|
||
|
|
*
|
||
|
|
* Duplicate detection: If the SuperSlab is already cached, no update occurs.
|
||
|
|
* This prevents thrashing when repeatedly allocating from the same SuperSlab.
|
||
|
|
*
|
||
|
|
* @param ss SuperSlab to cache (must be non-NULL, SUPERSLAB_MAGIC validated by caller)
|
||
|
|
* @param base SuperSlab base address (1MB or 2MB aligned)
|
||
|
|
* @param size SuperSlab size in bytes (1MB or 2MB)
|
||
|
|
*
|
||
|
|
* Thread Safety: TLS, no synchronization required
|
||
|
|
* Performance: ~15-20 cycles (duplicate check + FIFO rotation)
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size);
|
||
|
|
|
||
|
|
/**
|
||
|
|
* @brief Lookup SuperSlab for given pointer (fast path)
|
||
|
|
*
|
||
|
|
* Called on free() entry, before falling back to hak_super_lookup().
|
||
|
|
* Performs linear search over cached entries (4 iterations max).
|
||
|
|
*
|
||
|
|
* Cache hit: Returns true, sets *out_ss to cached SuperSlab pointer
|
||
|
|
* Cache miss: Returns false, caller must use hak_super_lookup()
|
||
|
|
*
|
||
|
|
* @param ptr User pointer to lookup (arbitrary alignment)
|
||
|
|
* @param out_ss Output: SuperSlab pointer if found (only valid if return true)
|
||
|
|
* @return true if cache hit (out_ss is valid), false if miss
|
||
|
|
*
|
||
|
|
* Thread Safety: TLS, no synchronization required
|
||
|
|
* Performance: 2-5 cycles (hit), 8-12 cycles (miss)
|
||
|
|
*
|
||
|
|
* NOTE: Caller MUST validate SUPERSLAB_MAGIC after successful lookup.
|
||
|
|
* This Box does not perform magic validation to keep fast path minimal.
|
||
|
|
*/
|
||
|
|
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss);
|
||
|
|
|
||
|
|
/**
|
||
|
|
* @brief Clear all cached hints (for testing/reset)
|
||
|
|
*
|
||
|
|
* Use cases:
|
||
|
|
* - Unit tests: Reset cache between test cases
|
||
|
|
* - Debug: Force cache cold start for profiling
|
||
|
|
* - Thread teardown: Optional cleanup (TLS auto-cleanup on thread exit)
|
||
|
|
*
|
||
|
|
* Thread Safety: TLS, no synchronization required
|
||
|
|
* Performance: ~10 cycles
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_clear(void);
|
||
|
|
|
||
|
|
/**
|
||
|
|
* @brief Get cache statistics (for profiling builds)
|
||
|
|
*
|
||
|
|
* Returns hit/miss counters for performance analysis.
|
||
|
|
* Only available in non-release builds (HAKMEM_BUILD_RELEASE=0).
|
||
|
|
*
|
||
|
|
* @param hits Output: Total cache hits
|
||
|
|
* @param misses Output: Total cache misses
|
||
|
|
*
|
||
|
|
* Thread Safety: TLS, no synchronization required
|
||
|
|
* Performance: ~5 cycles (two loads)
|
||
|
|
*/
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses);
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Implementation Details
|
||
|
|
|
||
|
|
```c
|
||
|
|
// core/box/tls_ss_hint_box.c (or inline in .h for header-only Box)
|
||
|
|
|
||
|
|
#include "tls_ss_hint_box.h"
|
||
|
|
#include "../hakmem_tiny_superslab.h" // For SuperSlab, SUPERSLAB_MAGIC
|
||
|
|
|
||
|
|
// Thread-local storage definition
|
||
|
|
__thread TlsSsHintCache g_tls_ss_hint = {0};
|
||
|
|
|
||
|
|
/**
|
||
|
|
* Initialize TLS hint cache
|
||
|
|
* Safe to call multiple times (idempotent check via count)
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_init(void) {
|
||
|
|
// Zero-initialization by TLS, but explicit init for clarity
|
||
|
|
g_tls_ss_hint.count = 0;
|
||
|
|
g_tls_ss_hint.next_slot = 0;
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
g_tls_ss_hint.hits = 0;
|
||
|
|
g_tls_ss_hint.misses = 0;
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Clear all entries (paranoid, but cache-friendly loop)
|
||
|
|
for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
|
||
|
|
g_tls_ss_hint.entries[i].base = NULL;
|
||
|
|
g_tls_ss_hint.entries[i].end = NULL;
|
||
|
|
g_tls_ss_hint.entries[i].ss = NULL;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
/**
|
||
|
|
* Update hint cache with SuperSlab reference
|
||
|
|
* FIFO rotation: oldest entry is evicted when cache is full
|
||
|
|
* Duplicate detection: skip if SuperSlab already cached
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_update(struct SuperSlab* ss, void* base, size_t size) {
|
||
|
|
// Sanity check: reject invalid inputs
|
||
|
|
if (__builtin_expect(!ss || !base || size == 0, 0)) {
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Duplicate detection: check if this SuperSlab is already cached
|
||
|
|
// This prevents thrashing when allocating from the same SuperSlab repeatedly
|
||
|
|
for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
|
||
|
|
if (g_tls_ss_hint.entries[i].ss == ss) {
|
||
|
|
return; // Already cached, no update needed
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Add to next slot (FIFO rotation)
|
||
|
|
uint32_t slot = g_tls_ss_hint.next_slot;
|
||
|
|
g_tls_ss_hint.entries[slot].base = base;
|
||
|
|
g_tls_ss_hint.entries[slot].end = (char*)base + size;
|
||
|
|
g_tls_ss_hint.entries[slot].ss = ss;
|
||
|
|
|
||
|
|
// Advance to next slot (wrap at TLS_SS_HINT_SLOTS)
|
||
|
|
g_tls_ss_hint.next_slot = (slot + 1) % TLS_SS_HINT_SLOTS;
|
||
|
|
|
||
|
|
// Increment count until cache is full
|
||
|
|
if (g_tls_ss_hint.count < TLS_SS_HINT_SLOTS) {
|
||
|
|
g_tls_ss_hint.count++;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
/**
|
||
|
|
* Lookup SuperSlab for pointer (fast path)
|
||
|
|
* Linear search over cached entries (4 iterations max)
|
||
|
|
*
|
||
|
|
* Performance note:
|
||
|
|
* - Linear search is faster than hash table for small N (N <= 8)
|
||
|
|
* - Branch-free comparison (ptr >= base && ptr < end) is 2-3 cycles
|
||
|
|
* - Total cost: 2-5 cycles (hit), 8-12 cycles (miss with 4 entries)
|
||
|
|
*/
|
||
|
|
static inline bool tls_ss_hint_lookup(void* ptr, struct SuperSlab** out_ss) {
|
||
|
|
// Fast path: iterate over valid entries
|
||
|
|
// Unrolling this loop (if count is small) is beneficial, but let compiler decide
|
||
|
|
for (uint32_t i = 0; i < g_tls_ss_hint.count; i++) {
|
||
|
|
TlsSsHintEntry* e = &g_tls_ss_hint.entries[i];
|
||
|
|
|
||
|
|
// Range check: base <= ptr < end
|
||
|
|
// Note: end is exclusive (base + size), so use < not <=
|
||
|
|
if (ptr >= e->base && ptr < e->end) {
|
||
|
|
// Cache hit!
|
||
|
|
*out_ss = e->ss;
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
g_tls_ss_hint.hits++;
|
||
|
|
#endif
|
||
|
|
|
||
|
|
return true;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Cache miss: caller must fall back to hak_super_lookup()
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
g_tls_ss_hint.misses++;
|
||
|
|
#endif
|
||
|
|
|
||
|
|
return false;
|
||
|
|
}
|
||
|
|
|
||
|
|
/**
|
||
|
|
* Clear all cached hints
|
||
|
|
* Use for testing or manual reset
|
||
|
|
*/
|
||
|
|
static inline void tls_ss_hint_clear(void) {
|
||
|
|
g_tls_ss_hint.count = 0;
|
||
|
|
g_tls_ss_hint.next_slot = 0;
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
// Preserve stats across clear (for cumulative profiling)
|
||
|
|
// Uncomment to reset stats:
|
||
|
|
// g_tls_ss_hint.hits = 0;
|
||
|
|
// g_tls_ss_hint.misses = 0;
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Optional: zero out entries (paranoid, not required for correctness)
|
||
|
|
for (int i = 0; i < TLS_SS_HINT_SLOTS; i++) {
|
||
|
|
g_tls_ss_hint.entries[i].base = NULL;
|
||
|
|
g_tls_ss_hint.entries[i].end = NULL;
|
||
|
|
g_tls_ss_hint.entries[i].ss = NULL;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
/**
|
||
|
|
* Get cache statistics (profiling builds only)
|
||
|
|
*/
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
static inline void tls_ss_hint_stats(uint64_t* hits, uint64_t* misses) {
|
||
|
|
if (hits) *hits = g_tls_ss_hint.hits;
|
||
|
|
if (misses) *misses = g_tls_ss_hint.misses;
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Integration Points
|
||
|
|
|
||
|
|
### 6.1 Update Points: When to Call `tls_ss_hint_update()`
|
||
|
|
|
||
|
|
The hint cache should be updated whenever we know the SuperSlab for an address range. This happens on allocation success paths:
|
||
|
|
|
||
|
|
#### Location 1: After Successful Tiny Alloc (hakmem_tiny.c)
|
||
|
|
```c
|
||
|
|
// In hak_tiny_alloc or similar allocation path
|
||
|
|
void* ptr = tiny_allocate_from_superslab(class_idx, &ss);
|
||
|
|
if (ptr) {
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
// Cache the SuperSlab we just allocated from
|
||
|
|
// This improves free() performance for LIFO allocation patterns
|
||
|
|
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
|
||
|
|
#endif
|
||
|
|
return ptr;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Location 2: After SuperSlab Refill (hakmem_tiny_refill.inc.h)
|
||
|
|
```c
|
||
|
|
// In tiny_refill_from_superslab or superslab_allocate
|
||
|
|
SuperSlab* ss = superslab_allocate(class_idx);
|
||
|
|
if (ss) {
|
||
|
|
// Bind SuperSlab to thread's TLS state
|
||
|
|
bind_superslab_to_thread(ss, class_idx);
|
||
|
|
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
// Cache the newly bound SuperSlab
|
||
|
|
// Future allocations from this SuperSlab will have cached hint
|
||
|
|
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
|
||
|
|
#endif
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Location 3: Unified Cache Refill (core/front/tiny_unified_cache.c)
|
||
|
|
```c
|
||
|
|
// In unified_cache_refill_class
|
||
|
|
void* block = superslab_alloc_block(class_idx, &ss);
|
||
|
|
if (block) {
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
// Cache the SuperSlab that provided this block
|
||
|
|
tls_ss_hint_update(ss, ss->base_addr, ss->size_bytes);
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Push to unified cache
|
||
|
|
unified_cache_push(class_idx, block);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Location 4: Thread-Local Init (hakmem_tiny_tls_init)
|
||
|
|
```c
|
||
|
|
// In tiny_tls_init or thread_local_init
|
||
|
|
void tiny_tls_init(void) {
|
||
|
|
// Initialize TLS structures
|
||
|
|
tiny_magazine_init();
|
||
|
|
tiny_sll_init();
|
||
|
|
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
// Initialize hint cache (zero-init by TLS, but explicit for clarity)
|
||
|
|
tls_ss_hint_init();
|
||
|
|
#endif
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6.2 Lookup Points: When to Call `tls_ss_hint_lookup()`
|
||
|
|
|
||
|
|
The hint lookup should be the **first step** in free() path, before falling back to registry lookup:
|
||
|
|
|
||
|
|
#### Location 1: Tiny Free Entry (core/hakmem_tiny_free.inc)
|
||
|
|
```c
|
||
|
|
// In hak_tiny_free or similar free path
|
||
|
|
void hak_tiny_free(void* ptr) {
|
||
|
|
if (!ptr) return;
|
||
|
|
|
||
|
|
SuperSlab* ss = NULL;
|
||
|
|
|
||
|
|
#if HAKMEM_TINY_HEADERLESS
|
||
|
|
// Phase 1: Try TLS hint cache (fast path, 2-5 cycles on hit)
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
if (!tls_ss_hint_lookup(ptr, &ss)) {
|
||
|
|
#endif
|
||
|
|
// Phase 2: Fallback to global registry (slow path, 10-50 cycles)
|
||
|
|
ss = hak_super_lookup(ptr);
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Validate SuperSlab (magic check)
|
||
|
|
if (!ss || ss->magic != SUPERSLAB_MAGIC) {
|
||
|
|
// Invalid pointer - external guard path
|
||
|
|
hak_external_guard_free(ptr);
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Proceed with free using SuperSlab info
|
||
|
|
int class_idx = slab_index_for(ss, ptr);
|
||
|
|
tiny_free_to_slab(ss, ptr, class_idx);
|
||
|
|
|
||
|
|
#else
|
||
|
|
// Header mode: read class_idx from header (1-3 cycles)
|
||
|
|
uint8_t hdr = *((uint8_t*)ptr - 1);
|
||
|
|
int class_idx = hdr & 0x7;
|
||
|
|
tiny_free_to_class(class_idx, ptr);
|
||
|
|
#endif
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Location 2: Fast Free Path (core/tiny_free_fast_v2.inc.h)
|
||
|
|
```c
|
||
|
|
// In tiny_free_fast or inline free path
|
||
|
|
static inline void tiny_free_fast(void* ptr) {
|
||
|
|
#if HAKMEM_TINY_HEADERLESS
|
||
|
|
SuperSlab* ss = NULL;
|
||
|
|
|
||
|
|
// Try hint cache first
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
if (!tls_ss_hint_lookup(ptr, &ss)) {
|
||
|
|
#endif
|
||
|
|
ss = hak_super_lookup(ptr);
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
if (__builtin_expect(!ss || ss->magic != SUPERSLAB_MAGIC, 0)) {
|
||
|
|
// Slow path: external guard or invalid pointer
|
||
|
|
hak_tiny_free_slow(ptr);
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Fast path: push to TLS freelist
|
||
|
|
int class_idx = slab_index_for(ss, ptr);
|
||
|
|
front_gate_push_tls(class_idx, ptr);
|
||
|
|
|
||
|
|
#else
|
||
|
|
// Header mode fast path
|
||
|
|
uint8_t hdr = *((uint8_t*)ptr - 1);
|
||
|
|
int class_idx = hdr & 0x7;
|
||
|
|
front_gate_push_tls(class_idx, ptr);
|
||
|
|
#endif
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Environment Variable
|
||
|
|
|
||
|
|
```c
|
||
|
|
// In hakmem_build_flags.h or similar configuration header
|
||
|
|
|
||
|
|
// ============================================================================
|
||
|
|
// Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache
|
||
|
|
// ============================================================================
|
||
|
|
// Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode
|
||
|
|
// Default: 0 (disabled during development and testing)
|
||
|
|
// Target: 1 (enabled after validation in Phase 1 rollout)
|
||
|
|
//
|
||
|
|
// Performance Impact:
|
||
|
|
// - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup)
|
||
|
|
// - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded)
|
||
|
|
// - Expected throughput improvement: 15-20%
|
||
|
|
//
|
||
|
|
// Memory Overhead:
|
||
|
|
// - 112 bytes per thread (TLS)
|
||
|
|
// - Negligible for typical workloads (1000 threads = 112KB)
|
||
|
|
//
|
||
|
|
// Dependencies:
|
||
|
|
// - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode)
|
||
|
|
// - No other dependencies (self-contained Box)
|
||
|
|
|
||
|
|
#ifndef HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
#define HAKMEM_TINY_SS_TLS_HINT 0
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Validation: Hint Box only active in Headerless mode
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT && !HAKMEM_TINY_HEADERLESS
|
||
|
|
#error "HAKMEM_TINY_SS_TLS_HINT requires HAKMEM_TINY_HEADERLESS=1"
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Testing Plan
|
||
|
|
|
||
|
|
### 8.1 Unit Tests
|
||
|
|
|
||
|
|
Create `/mnt/workdisk/public_share/hakmem/tests/test_tls_ss_hint.c`:
|
||
|
|
|
||
|
|
```c
|
||
|
|
#include <assert.h>
|
||
|
|
#include <stdio.h>
|
||
|
|
#include <string.h>
|
||
|
|
#include "core/box/tls_ss_hint_box.h"
|
||
|
|
#include "core/hakmem_tiny_superslab.h"
|
||
|
|
|
||
|
|
// Mock SuperSlab for testing
|
||
|
|
typedef struct {
|
||
|
|
uint32_t magic;
|
||
|
|
void* base_addr;
|
||
|
|
size_t size_bytes;
|
||
|
|
uint8_t size_class;
|
||
|
|
} MockSuperSlab;
|
||
|
|
|
||
|
|
void test_hint_init(void) {
|
||
|
|
printf("test_hint_init...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Verify cache is empty
|
||
|
|
assert(g_tls_ss_hint.count == 0);
|
||
|
|
assert(g_tls_ss_hint.next_slot == 0);
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
assert(g_tls_ss_hint.hits == 0);
|
||
|
|
assert(g_tls_ss_hint.misses == 0);
|
||
|
|
#endif
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
|
||
|
|
void test_hint_basic(void) {
|
||
|
|
printf("test_hint_basic...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Mock SuperSlab
|
||
|
|
MockSuperSlab ss = {
|
||
|
|
.magic = SUPERSLAB_MAGIC,
|
||
|
|
.base_addr = (void*)0x1000000,
|
||
|
|
.size_bytes = 2 * 1024 * 1024, // 2MB
|
||
|
|
.size_class = 0
|
||
|
|
};
|
||
|
|
|
||
|
|
// Update hint
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
|
||
|
|
// Verify cache entry
|
||
|
|
assert(g_tls_ss_hint.count == 1);
|
||
|
|
assert(g_tls_ss_hint.entries[0].base == ss.base_addr);
|
||
|
|
assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);
|
||
|
|
|
||
|
|
// Lookup should hit (within range)
|
||
|
|
SuperSlab* out = NULL;
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == true);
|
||
|
|
assert(out == (SuperSlab*)&ss);
|
||
|
|
|
||
|
|
// Lookup at base should hit
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1000000, &out) == true);
|
||
|
|
assert(out == (SuperSlab*)&ss);
|
||
|
|
|
||
|
|
// Lookup at end-1 should hit
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x12FFFFF, &out) == true);
|
||
|
|
assert(out == (SuperSlab*)&ss);
|
||
|
|
|
||
|
|
// Lookup at end should miss (exclusive boundary)
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1300000, &out) == false);
|
||
|
|
|
||
|
|
// Lookup outside range should miss
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x3000000, &out) == false);
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
|
||
|
|
void test_hint_fifo_rotation(void) {
|
||
|
|
printf("test_hint_fifo_rotation...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Create 6 mock SuperSlabs (cache has 4 slots)
|
||
|
|
MockSuperSlab ss[6];
|
||
|
|
for (int i = 0; i < 6; i++) {
|
||
|
|
ss[i].magic = SUPERSLAB_MAGIC;
|
||
|
|
ss[i].base_addr = (void*)(uintptr_t)(0x1000000 + i * 0x200000); // 2MB apart
|
||
|
|
ss[i].size_bytes = 2 * 1024 * 1024;
|
||
|
|
ss[i].size_class = 0;
|
||
|
|
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss[i], ss[i].base_addr, ss[i].size_bytes);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Cache should be full (4 slots)
|
||
|
|
assert(g_tls_ss_hint.count == TLS_SS_HINT_SLOTS);
|
||
|
|
|
||
|
|
// First 2 SuperSlabs should be evicted (FIFO)
|
||
|
|
SuperSlab* out = NULL;
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false); // ss[0] evicted
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1200100, &out) == false); // ss[1] evicted
|
||
|
|
|
||
|
|
// Last 4 SuperSlabs should be cached
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1400100, &out) == true); // ss[2]
|
||
|
|
assert(out == (SuperSlab*)&ss[2]);
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1600100, &out) == true); // ss[3]
|
||
|
|
assert(out == (SuperSlab*)&ss[3]);
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1800100, &out) == true); // ss[4]
|
||
|
|
assert(out == (SuperSlab*)&ss[4]);
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1A00100, &out) == true); // ss[5]
|
||
|
|
assert(out == (SuperSlab*)&ss[5]);
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
|
||
|
|
void test_hint_duplicate_detection(void) {
|
||
|
|
printf("test_hint_duplicate_detection...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Mock SuperSlab
|
||
|
|
MockSuperSlab ss = {
|
||
|
|
.magic = SUPERSLAB_MAGIC,
|
||
|
|
.base_addr = (void*)0x1000000,
|
||
|
|
.size_bytes = 2 * 1024 * 1024,
|
||
|
|
.size_class = 0
|
||
|
|
};
|
||
|
|
|
||
|
|
// Update hint 3 times with same SuperSlab
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
|
||
|
|
// Cache should have only 1 entry (duplicates ignored)
|
||
|
|
assert(g_tls_ss_hint.count == 1);
|
||
|
|
assert(g_tls_ss_hint.entries[0].ss == (SuperSlab*)&ss);
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
|
||
|
|
void test_hint_clear(void) {
|
||
|
|
printf("test_hint_clear...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Add some entries
|
||
|
|
MockSuperSlab ss = {
|
||
|
|
.magic = SUPERSLAB_MAGIC,
|
||
|
|
.base_addr = (void*)0x1000000,
|
||
|
|
.size_bytes = 2 * 1024 * 1024,
|
||
|
|
.size_class = 0
|
||
|
|
};
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
|
||
|
|
assert(g_tls_ss_hint.count == 1);
|
||
|
|
|
||
|
|
// Clear cache
|
||
|
|
tls_ss_hint_clear();
|
||
|
|
|
||
|
|
// Cache should be empty
|
||
|
|
assert(g_tls_ss_hint.count == 0);
|
||
|
|
assert(g_tls_ss_hint.next_slot == 0);
|
||
|
|
|
||
|
|
// Lookup should miss
|
||
|
|
SuperSlab* out = NULL;
|
||
|
|
assert(tls_ss_hint_lookup((void*)0x1000100, &out) == false);
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
void test_hint_stats(void) {
|
||
|
|
printf("test_hint_stats...\n");
|
||
|
|
|
||
|
|
tls_ss_hint_init();
|
||
|
|
|
||
|
|
// Add entry
|
||
|
|
MockSuperSlab ss = {
|
||
|
|
.magic = SUPERSLAB_MAGIC,
|
||
|
|
.base_addr = (void*)0x1000000,
|
||
|
|
.size_bytes = 2 * 1024 * 1024,
|
||
|
|
.size_class = 0
|
||
|
|
};
|
||
|
|
tls_ss_hint_update((SuperSlab*)&ss, ss.base_addr, ss.size_bytes);
|
||
|
|
|
||
|
|
// Perform lookups
|
||
|
|
SuperSlab* out = NULL;
|
||
|
|
tls_ss_hint_lookup((void*)0x1000100, &out); // Hit
|
||
|
|
tls_ss_hint_lookup((void*)0x1000200, &out); // Hit
|
||
|
|
tls_ss_hint_lookup((void*)0x3000000, &out); // Miss
|
||
|
|
|
||
|
|
// Check stats
|
||
|
|
uint64_t hits = 0, misses = 0;
|
||
|
|
tls_ss_hint_stats(&hits, &misses);
|
||
|
|
|
||
|
|
assert(hits == 2);
|
||
|
|
assert(misses == 1);
|
||
|
|
|
||
|
|
printf(" PASS\n");
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
int main(void) {
|
||
|
|
printf("Running TLS SS Hint Box unit tests...\n\n");
|
||
|
|
|
||
|
|
test_hint_init();
|
||
|
|
test_hint_basic();
|
||
|
|
test_hint_fifo_rotation();
|
||
|
|
test_hint_duplicate_detection();
|
||
|
|
test_hint_clear();
|
||
|
|
|
||
|
|
#if !HAKMEM_BUILD_RELEASE
|
||
|
|
test_hint_stats();
|
||
|
|
#endif
|
||
|
|
|
||
|
|
printf("\nAll tests passed!\n");
|
||
|
|
return 0;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 8.2 Integration Tests
|
||
|
|
|
||
|
|
#### Test 1: Build Validation
|
||
|
|
```bash
|
||
|
|
# Test 1: Build with hint disabled (baseline)
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
|
||
|
|
|
||
|
|
# Test 2: Build with hint enabled
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
||
|
|
|
||
|
|
# Test 3: Verify hint is disabled in header mode (should error)
|
||
|
|
# make clean
|
||
|
|
# make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=0 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
||
|
|
# Expected: Compile error (validation check in hakmem_build_flags.h)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Test 2: Benchmark Comparison
|
||
|
|
```bash
|
||
|
|
# Build baseline (hint disabled)
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
|
||
|
|
|
||
|
|
# Run benchmarks
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > baseline.txt
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_baseline.txt
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_baseline.txt
|
||
|
|
|
||
|
|
# Build with hint enabled
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
||
|
|
|
||
|
|
# Run same benchmarks
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench > hint.txt
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809 > cfrac_hint.txt
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8 > larson_hint.txt
|
||
|
|
|
||
|
|
# Compare results
|
||
|
|
echo "=== sh8bench ==="
|
||
|
|
grep "Mops" baseline.txt hint.txt
|
||
|
|
|
||
|
|
echo "=== cfrac ==="
|
||
|
|
grep "time:" cfrac_baseline.txt cfrac_hint.txt
|
||
|
|
|
||
|
|
echo "=== larson ==="
|
||
|
|
grep "ops/s" larson_baseline.txt larson_hint.txt
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Test 3: Hit Rate Profiling
|
||
|
|
```bash
|
||
|
|
# Build with stats enabled (non-release)
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0"
|
||
|
|
|
||
|
|
# Add stats dump at exit (in hakmem_exit.c or similar)
|
||
|
|
# void dump_hint_stats(void) {
|
||
|
|
# uint64_t hits = 0, misses = 0;
|
||
|
|
# tls_ss_hint_stats(&hits, &misses);
|
||
|
|
# fprintf(stderr, "[TLS_HINT_STATS] hits=%lu misses=%lu hit_rate=%.2f%%\n",
|
||
|
|
# hits, misses, 100.0 * hits / (hits + misses));
|
||
|
|
# }
|
||
|
|
|
||
|
|
# Run benchmark and check hit rate
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench 2>&1 | grep TLS_HINT_STATS
|
||
|
|
# Expected: hit_rate >= 85%
|
||
|
|
```
|
||
|
|
|
||
|
|
### 8.3 Correctness Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test with external pointer (should fall back to hak_super_lookup)
|
||
|
|
# This tests that cache misses are handled correctly
|
||
|
|
|
||
|
|
# Build with hint enabled
|
||
|
|
make clean
|
||
|
|
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
|
||
|
|
|
||
|
|
# Run sh8bench (allocates from multiple SuperSlabs)
|
||
|
|
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
|
||
|
|
|
||
|
|
# No crashes or assertion failures = success
|
||
|
|
echo "Correctness test passed"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Performance Expectations
|
||
|
|
|
||
|
|
### 9.1 Cycle Count Analysis
|
||
|
|
|
||
|
|
| Operation | Without Hint | With Hint (Hit) | With Hint (Miss) | Improvement |
|
||
|
|
|-----------|-------------|----------------|-----------------|-------------|
|
||
|
|
| free() lookup | 10-50 cycles | 2-5 cycles | 10-50 cycles | 80-95% |
|
||
|
|
| Range check (per entry) | N/A | 2 cycles | 2 cycles | - |
|
||
|
|
| Hash table lookup | 10-50 cycles | N/A | 10-50 cycles | - |
|
||
|
|
| Total free() cost | 15-60 cycles | 7-15 cycles (hit) | 20-65 cycles (miss) | 40-60% |
|
||
|
|
|
||
|
|
### 9.2 Expected Hit Rates
|
||
|
|
|
||
|
|
| Workload | Hit Rate | Reasoning |
|
||
|
|
|----------|----------|-----------|
|
||
|
|
| Single-threaded LIFO | 95-99% | Free() immediately after alloc() from same SuperSlab |
|
||
|
|
| Single-threaded FIFO | 85-95% | Recent allocations from 2-4 SuperSlabs |
|
||
|
|
| Multi-threaded (8 threads) | 70-85% | Shared SuperSlabs, more cache thrashing |
|
||
|
|
| Larson (high churn) | 65-80% | Many active SuperSlabs, frequent evictions |
|
||
|
|
|
||
|
|
### 9.3 Benchmark Targets
|
||
|
|
|
||
|
|
| Benchmark | Baseline (no hint) | Target (with hint) | Improvement |
|
||
|
|
|-----------|-------------------|-------------------|-------------|
|
||
|
|
| sh8bench | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
|
||
|
|
| cfrac | 1.25 sec | 1.10-1.15 sec | +10-15% |
|
||
|
|
| larson (8 threads) | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |
|
||
|
|
|
||
|
|
### 9.4 Memory Overhead
|
||
|
|
|
||
|
|
| Metric | Value | Notes |
|
||
|
|
|--------|-------|-------|
|
||
|
|
| Per-thread overhead | 112 bytes | TLS cache (release build) |
|
||
|
|
| Per-thread overhead (debug) | 128 bytes | TLS cache + stats counters |
|
||
|
|
| 1000 threads | 112 KB | Negligible for server workloads |
|
||
|
|
| 10000 threads | 1.12 MB | Still negligible |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10. Risk Analysis
|
||
|
|
|
||
|
|
| Risk | Likelihood | Impact | Mitigation |
|
||
|
|
|------|-----------|--------|------------|
|
||
|
|
| **Cache coherency issues** | Very Low | Low | TLS is thread-local, no sharing between threads |
|
||
|
|
| **Stale hint after munmap** | Low | Low | Magic check (SUPERSLAB_MAGIC) catches freed SuperSlabs |
|
||
|
|
| **Cache thrashing (many SS)** | Low | Low | 4 slots cover typical workloads; miss falls back to registry |
|
||
|
|
| **Memory overhead** | Very Low | Very Low | 112 bytes/thread, negligible for most workloads |
|
||
|
|
| **Integration bugs** | Low | Medium | Self-contained Box, clear API, comprehensive tests |
|
||
|
|
| **Hit rate lower than expected** | Low | Low | Even 50% hit rate improves performance; no regression on miss |
|
||
|
|
| **Complexity increase** | Low | Low | 150 LOC, header-only Box, minimal dependencies |
|
||
|
|
|
||
|
|
### 10.1 Failure Modes and Recovery
|
||
|
|
|
||
|
|
| Failure Mode | Detection | Recovery |
|
||
|
|
|-------------|-----------|----------|
|
||
|
|
| Stale SuperSlab pointer | Magic check (SUPERSLAB_MAGIC != expected) | Fall back to hak_super_lookup() |
|
||
|
|
| Cache miss | tls_ss_hint_lookup returns false | Fall back to hak_super_lookup() |
|
||
|
|
| Invalid hint range | ptr outside [base, end) | Linear search continues, eventually misses |
|
||
|
|
| Thread teardown | TLS cleanup by OS | No manual cleanup needed |
|
||
|
|
| SuperSlab freed | Magic number cleared | Caught by magic check in free() path |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 11. Future Considerations
|
||
|
|
|
||
|
|
### 11.1 Phase 2 Integration: Global Class Map
|
||
|
|
|
||
|
|
When Phase 2 introduces a Global Class Map (pointer → class_idx lookup), the TLS Hint Box becomes the first tier in a three-tier lookup hierarchy:
|
||
|
|
|
||
|
|
```
|
||
|
|
Tier 1 (fastest): TLS Hint Cache (2-5 cycles, 85-95% hit rate)
|
||
|
|
↓ miss
|
||
|
|
Tier 2 (medium): Global Class Map (5-15 cycles, 99%+ hit rate)
|
||
|
|
↓ miss
|
||
|
|
Tier 3 (slowest): Global SuperSlab Registry (10-50 cycles, 100% hit rate)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration point**:
|
||
|
|
```c
|
||
|
|
SuperSlab* ss = NULL;
|
||
|
|
int class_idx = -1;
|
||
|
|
|
||
|
|
// Tier 1: TLS hint
|
||
|
|
#if HAKMEM_TINY_SS_TLS_HINT
|
||
|
|
if (tls_ss_hint_lookup(ptr, &ss)) {
|
||
|
|
class_idx = slab_index_for(ss, ptr);
|
||
|
|
goto found;
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Tier 2: Global class map
|
||
|
|
#if HAKMEM_TINY_CLASS_MAP
|
||
|
|
class_idx = class_map_lookup(ptr);
|
||
|
|
if (class_idx >= 0) {
|
||
|
|
ss = hak_super_lookup(ptr); // Still need SS for metadata
|
||
|
|
goto found;
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
// Tier 3: Registry fallback
|
||
|
|
ss = hak_super_lookup(ptr);
|
||
|
|
if (ss && ss->magic == SUPERSLAB_MAGIC) {
|
||
|
|
class_idx = slab_index_for(ss, ptr);
|
||
|
|
goto found;
|
||
|
|
}
|
||
|
|
|
||
|
|
// External pointer
|
||
|
|
hak_external_guard_free(ptr);
|
||
|
|
return;
|
||
|
|
|
||
|
|
found:
|
||
|
|
tiny_free_to_class(class_idx, ptr);
|
||
|
|
```
|
||
|
|
|
||
|
|
### 11.2 Adaptive Cache Sizing
|
||
|
|
|
||
|
|
Current design uses fixed `TLS_SS_HINT_SLOTS = 4`. Future optimization could make this adaptive:
|
||
|
|
|
||
|
|
- **Workload detection**: Track hit rate over time windows
|
||
|
|
- **Dynamic sizing**: Increase slots (4 → 8) if hit rate < 80%
|
||
|
|
- **Memory pressure**: Decrease slots (8 → 2) if memory constrained
|
||
|
|
|
||
|
|
**Implementation sketch**:
|
||
|
|
```c
|
||
|
|
#define TLS_SS_HINT_SLOTS_MAX 8
|
||
|
|
|
||
|
|
typedef struct {
|
||
|
|
uint32_t current_slots; // Dynamic (2, 4, 8)
|
||
|
|
uint64_t hits_window;
|
||
|
|
uint64_t misses_window;
|
||
|
|
} TlsSsHintAdaptive;
|
||
|
|
|
||
|
|
void tls_ss_hint_tune(void) {
|
||
|
|
double hit_rate = (double)g_tls_ss_hint.hits_window /
|
||
|
|
(g_tls_ss_hint.hits_window + g_tls_ss_hint.misses_window);
|
||
|
|
|
||
|
|
if (hit_rate < 0.80 && g_tls_ss_hint.current_slots < TLS_SS_HINT_SLOTS_MAX) {
|
||
|
|
g_tls_ss_hint.current_slots *= 2; // Grow cache
|
||
|
|
} else if (hit_rate > 0.95 && g_tls_ss_hint.current_slots > 2) {
|
||
|
|
g_tls_ss_hint.current_slots /= 2; // Shrink cache
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 11.3 LRU vs FIFO Eviction Policy
|
||
|
|
|
||
|
|
Current design uses FIFO (simple, predictable). Alternative: LRU with move-to-front on hit.
|
||
|
|
|
||
|
|
**LRU advantages**:
|
||
|
|
- Better hit rate for workloads with temporal locality
|
||
|
|
- Commonly used SuperSlabs stay cached longer
|
||
|
|
|
||
|
|
**LRU disadvantages**:
|
||
|
|
- 2-3 extra cycles per hit (move to front)
|
||
|
|
- More complex implementation (doubly-linked list)
|
||
|
|
|
||
|
|
**Benchmark before switching**: Profile sh8bench, larson, cfrac with both policies.
|
||
|
|
|
||
|
|
### 11.4 Per-Class Hint Caches
|
||
|
|
|
||
|
|
Current design: Single cache for all classes (4 entries, any class).
|
||
|
|
Alternative: Per-class caches (1 entry per class, 8 entries total).
|
||
|
|
|
||
|
|
**Per-class advantages**:
|
||
|
|
- Guaranteed cache slot for each class
|
||
|
|
- No inter-class eviction
|
||
|
|
|
||
|
|
**Per-class disadvantages**:
|
||
|
|
- Wastes space if only 2-3 classes are active
|
||
|
|
- More TLS overhead (8 entries vs 4)
|
||
|
|
|
||
|
|
**Recommendation**: Defer until benchmarks show inter-class thrashing.
|
||
|
|
|
||
|
|
### 11.5 Statistics Export API
|
||
|
|
|
||
|
|
For production monitoring, export hit rate via:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Global aggregated stats (all threads)
|
||
|
|
void hak_tls_hint_global_stats(uint64_t* total_hits, uint64_t* total_misses);
|
||
|
|
|
||
|
|
// ENV-based stats dump at exit
|
||
|
|
// HAKMEM_TLS_HINT_STATS=1 → dump to stderr at exit
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 12. Implementation Checklist
|
||
|
|
|
||
|
|
### 12.1 Phase 1a: Core Implementation (Week 1)
|
||
|
|
- [ ] Create `core/box/tls_ss_hint_box.h`
|
||
|
|
- [ ] Implement `tls_ss_hint_init()`
|
||
|
|
- [ ] Implement `tls_ss_hint_update()`
|
||
|
|
- [ ] Implement `tls_ss_hint_lookup()`
|
||
|
|
- [ ] Implement `tls_ss_hint_clear()`
|
||
|
|
- [ ] Add `HAKMEM_TINY_SS_TLS_HINT` flag to `hakmem_build_flags.h`
|
||
|
|
- [ ] Add validation check (hint requires headerless mode)
|
||
|
|
|
||
|
|
### 12.2 Phase 1b: Integration (Week 2)
|
||
|
|
- [ ] Integrate into `hakmem_tiny_free.inc` (lookup path)
|
||
|
|
- [ ] Integrate into `hakmem_tiny.c` (update path after alloc)
|
||
|
|
- [ ] Integrate into `hakmem_tiny_refill.inc.h` (update path after refill)
|
||
|
|
- [ ] Integrate into `core/front/tiny_unified_cache.c` (update path)
|
||
|
|
- [ ] Call `tls_ss_hint_init()` in thread-local init
|
||
|
|
|
||
|
|
### 12.3 Phase 1c: Testing (Week 2-3)
|
||
|
|
- [ ] Write unit tests (`tests/test_tls_ss_hint.c`)
|
||
|
|
- [ ] Run unit tests: `make test_tls_ss_hint && ./test_tls_ss_hint`
|
||
|
|
- [ ] Build validation (hint disabled, hint enabled, error check)
|
||
|
|
- [ ] Benchmark comparison (sh8bench, cfrac, larson)
|
||
|
|
- [ ] Hit rate profiling (debug build with stats)
|
||
|
|
- [ ] Correctness tests (no crashes, no assertion failures)
|
||
|
|
|
||
|
|
### 12.4 Phase 1d: Validation (Week 3)
|
||
|
|
- [ ] Benchmark: sh8bench (target: +15-20%)
|
||
|
|
- [ ] Benchmark: cfrac (target: +10-15%)
|
||
|
|
- [ ] Benchmark: larson 8 threads (target: +15-20%)
|
||
|
|
- [ ] Hit rate analysis (target: 85-95%)
|
||
|
|
- [ ] Memory overhead check (target: < 150 bytes/thread)
|
||
|
|
- [ ] Regression test: Headerless=0 mode still works
|
||
|
|
|
||
|
|
### 12.5 Phase 1e: Documentation (Week 3-4)
|
||
|
|
- [ ] Update `docs/PHASE2_HEADERLESS_INSTRUCTION.md` with hint Box
|
||
|
|
- [ ] Add Box Theory annotation to hakmem Box registry
|
||
|
|
- [ ] Write performance analysis report (before/after comparison)
|
||
|
|
- [ ] Update build instructions (`make shared EXTRA_CFLAGS=...`)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 13. Rollout Plan
|
||
|
|
|
||
|
|
### Stage 1: Internal Testing (Week 1-3)
|
||
|
|
- Build with `HAKMEM_TINY_SS_TLS_HINT=1` in dev environment
|
||
|
|
- Run full benchmark suite (mimalloc-bench)
|
||
|
|
- Profile with perf/cachegrind (verify cycle count reduction)
|
||
|
|
- Fix any integration bugs
|
||
|
|
|
||
|
|
### Stage 2: Canary Deployment (Week 4)
|
||
|
|
- Enable hint Box in 5% of production traffic
|
||
|
|
- Monitor: crash rate, performance metrics, hit rate
|
||
|
|
- A/B test: Hint ON vs Hint OFF
|
||
|
|
|
||
|
|
### Stage 3: Gradual Rollout (Week 5-6)
|
||
|
|
- 25% traffic (if canary success)
|
||
|
|
- 50% traffic
|
||
|
|
- 100% traffic
|
||
|
|
|
||
|
|
### Stage 4: Default Enable (Week 7)
|
||
|
|
- Change default: `HAKMEM_TINY_SS_TLS_HINT=1`
|
||
|
|
- Update build scripts, CI/CD pipelines
|
||
|
|
- Announce in release notes
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 14. Success Metrics
|
||
|
|
|
||
|
|
| Metric | Baseline | Target | Measurement |
|
||
|
|
|--------|----------|--------|-------------|
|
||
|
|
| sh8bench throughput | 54.60 Mops/s | 64-68 Mops/s | +15-20% |
|
||
|
|
| cfrac runtime | 1.25 sec | 1.10-1.15 sec | -10-15% |
|
||
|
|
| larson throughput | 6.5M ops/s | 7.5-8.0M ops/s | +15-20% |
|
||
|
|
| TLS hint hit rate | N/A | 85-95% | Stats API |
|
||
|
|
| free() cycle count | 15-60 cycles | 7-15 cycles (hit) | perf/cachegrind |
|
||
|
|
| Memory overhead | 0 | < 150 bytes/thread | sizeof(TlsSsHintCache) |
|
||
|
|
| Crash rate | 0.001% | 0.001% (no regression) | Production monitoring |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 15. Open Questions
|
||
|
|
|
||
|
|
1. **Q**: Should we implement per-class hint caches instead of unified cache?
|
||
|
|
**A**: Defer until benchmarks show inter-class thrashing. Current unified design is simpler and sufficient for most workloads.
|
||
|
|
|
||
|
|
2. **Q**: Should we use LRU instead of FIFO eviction?
|
||
|
|
**A**: Defer until benchmarks show FIFO hit rate < 80%. FIFO is simpler and avoids move-to-front cost on hits.
|
||
|
|
|
||
|
|
3. **Q**: Should we make TLS_SS_HINT_SLOTS runtime-configurable?
|
||
|
|
**A**: No, compile-time constant allows better optimization (loop unrolling, register allocation). Consider adaptive sizing in Phase 2 if needed.
|
||
|
|
|
||
|
|
4. **Q**: Should we validate SUPERSLAB_MAGIC in tls_ss_hint_lookup()?
|
||
|
|
**A**: No, keep lookup minimal (2-5 cycles). Caller (free() path) must validate magic. This matches existing design where hak_super_lookup() also requires caller validation.
|
||
|
|
|
||
|
|
5. **Q**: Should we export hit rate stats in production builds?
|
||
|
|
**A**: Phase 1: No (save 16 bytes/thread). Phase 2: Add global aggregated stats API for monitoring if needed.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 16. Conclusion
|
||
|
|
|
||
|
|
The TLS Superslab Hint Box is a low-risk, high-reward optimization that reduces the performance gap between Headerless mode and Header mode from 30% to ~15%. The design is self-contained, testable, and follows hakmem's Box Theory architecture. Expected implementation time: 3-4 weeks (including testing and validation).
|
||
|
|
|
||
|
|
**Key Strengths**:
|
||
|
|
- Minimal integration surface (5 call sites)
|
||
|
|
- Self-contained Box (no dependencies)
|
||
|
|
- Fail-safe fallback (miss → hak_super_lookup)
|
||
|
|
- Low memory overhead (112 bytes/thread)
|
||
|
|
- Proven pattern (TLS caching used in jemalloc, tcmalloc)
|
||
|
|
|
||
|
|
**Next Steps**:
|
||
|
|
1. Review this design document
|
||
|
|
2. Approve Phase 1a implementation (core Box)
|
||
|
|
3. Begin implementation with unit tests
|
||
|
|
4. Benchmark and validate in dev environment
|
||
|
|
5. Plan Phase 2 integration (Global Class Map)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**End of Design Document**
|