# TLS Layout Plan: Unified ULTRA TLS (Phase v11b-2 Target) ## Goal C4-C7 ULTRA の hot path を **1 cache line (64B)** に収める。 ## Design: TinyUltraTlsCtx ```c // ============================================================================ // LINE 1: Hot fields (64B) - alloc/free hot path // ============================================================================ typedef struct TinyUltraTlsCtx { // Counts (8B total, padded for alignment) uint16_t c4_count; // 2B uint16_t c5_count; // 2B uint16_t c6_count; // 2B uint16_t c7_count; // 2B // Freelist heads (32B) void* c4_head; // 8B - next free slot for C4 void* c5_head; // 8B void* c6_head; // 8B void* c7_head; // 8B // Segment range (shared across C4-C7, 16B) uintptr_t seg_base; // 8B uintptr_t seg_end; // 8B // ========== LINE 1 END: 56B used, 8B spare ========== uint64_t _hot_pad; // 8B - align to 64B // ============================================================================ // LINE 2+: Cold fields (refill/retire, debug, stats) // ============================================================================ // Freelist tails (for bulk push, 32B) void* c4_tail; // 8B void* c5_tail; // 8B void* c6_tail; // 8B void* c7_tail; // 8B // Segment metadata (16B) void* segment; // 8B - owning segment pointer uint32_t page_idx; // 4B - current page index uint32_t _cold_pad; // 4B // Stats (optional, 16B) uint64_t alloc_count; // 8B uint64_t free_count; // 8B } TinyUltraTlsCtx; // Total: 128B (2 cache lines) ``` ## Memory Layout ``` Offset Field Size Cache Line ------ ----- ---- ---------- 0x00 c4_count 2B LINE 1 (HOT) 0x02 c5_count 2B LINE 1 0x04 c6_count 2B LINE 1 0x06 c7_count 2B LINE 1 0x08 c4_head 8B LINE 1 0x10 c5_head 8B LINE 1 0x18 c6_head 8B LINE 1 0x20 c7_head 8B LINE 1 0x28 seg_base 8B LINE 1 0x30 seg_end 8B LINE 1 0x38 _hot_pad 8B LINE 1 ------ ----- ---- ---------- 0x40 c4_tail 8B LINE 2 (COLD) 0x48 c5_tail 8B LINE 2 0x50 c6_tail 8B LINE 2 0x58 c7_tail 8B LINE 2 0x60 segment 8B LINE 3 0x68 page_idx 4B LINE 3 0x6C _cold_pad 4B LINE 3 0x70 alloc_count 8B LINE 3 0x78 free_count 8B LINE 3 ``` ## Hot Path Access Pattern ### alloc (TLS hit) ```c static inline void* tiny_ultra_alloc_fast(TinyUltraTlsCtx* ctx, uint8_t class_idx) { // Single cache line access uint16_t* counts = &ctx->c4_count; void** heads = &ctx->c4_head; uint16_t c = counts[class_idx - 4]; if (likely(c > 0)) { counts[class_idx - 4] = c - 1; return heads[class_idx - 4]; // pop from linked list } return tiny_ultra_alloc_slow(ctx, class_idx); } ``` ### free (TLS push) ```c static inline void tiny_ultra_free_fast(TinyUltraTlsCtx* ctx, void* ptr, uint8_t class_idx) { // Range check (seg_base/end in same cache line) uintptr_t p = (uintptr_t)ptr; if (likely(p >= ctx->seg_base && p < ctx->seg_end)) { // Push to freelist (single cache line) void** heads = &ctx->c4_head; uint16_t* counts = &ctx->c4_count; *(void**)ptr = heads[class_idx - 4]; heads[class_idx - 4] = ptr; counts[class_idx - 4]++; return; } tiny_ultra_free_slow(ctx, ptr, class_idx); } ``` ## Comparison: Before vs After | Metric | Current (v11b-1) | Unified (v11b-2) | |--------|------------------|------------------| | TLS size (C4-C7) | 3712B | 128B | | Cache lines (hot) | ~60 | **1** | | seg_base/end copies | 4 | 1 | | count access | scattered | contiguous | ## Freelist Design: Linked List vs Array **選択: Linked List (head/tail)** 理由: 1. **固定配列不要**: freelist[128] の 1KB を削除 2. **O(1) push/pop**: head だけで十分 3. **Bulk drain**: tail があれば一括返却可能 4. **メモリ効率**: 使用中スロットにのみリンク トレードオフ: - prefetch しにくい(配列なら連続アクセス可能) - 空間局所性が落ちる可能性 → プロファイル後に配列版も検討可能 ## Implementation Notes 1. **Backward Compatibility**: 既存の TinyC*UltraFreeTLS API を維持しつつ、内部で TinyUltraTlsCtx を使う 2. **Gradual Migration**: まず C7 を新構造に移行し、効果を計測 3. **ENV Gate**: `HAKMEM_ULTRA_UNIFIED_TLS=1` で有効化 --- **Date**: 2025-12-12 **Phase**: v11b-2 planning