Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
4.8 KiB
4.8 KiB
TLS Layout Plan: Unified ULTRA TLS (Phase v11b-2 Target)
Goal
C4-C7 ULTRA の hot path を 1 cache line (64B) に収める。
Design: TinyUltraTlsCtx
// ============================================================================
// LINE 1: Hot fields (64B) - alloc/free hot path
// ============================================================================
typedef struct TinyUltraTlsCtx {
// Counts (8B total, padded for alignment)
uint16_t c4_count; // 2B
uint16_t c5_count; // 2B
uint16_t c6_count; // 2B
uint16_t c7_count; // 2B
// Freelist heads (32B)
void* c4_head; // 8B - next free slot for C4
void* c5_head; // 8B
void* c6_head; // 8B
void* c7_head; // 8B
// Segment range (shared across C4-C7, 16B)
uintptr_t seg_base; // 8B
uintptr_t seg_end; // 8B
// ========== LINE 1 END: 56B used, 8B spare ==========
uint64_t _hot_pad; // 8B - align to 64B
// ============================================================================
// LINE 2+: Cold fields (refill/retire, debug, stats)
// ============================================================================
// Freelist tails (for bulk push, 32B)
void* c4_tail; // 8B
void* c5_tail; // 8B
void* c6_tail; // 8B
void* c7_tail; // 8B
// Segment metadata (16B)
void* segment; // 8B - owning segment pointer
uint32_t page_idx; // 4B - current page index
uint32_t _cold_pad; // 4B
// Stats (optional, 16B)
uint64_t alloc_count; // 8B
uint64_t free_count; // 8B
} TinyUltraTlsCtx;
// Total: 128B (2 cache lines)
Memory Layout
Offset Field Size Cache Line
------ ----- ---- ----------
0x00 c4_count 2B LINE 1 (HOT)
0x02 c5_count 2B LINE 1
0x04 c6_count 2B LINE 1
0x06 c7_count 2B LINE 1
0x08 c4_head 8B LINE 1
0x10 c5_head 8B LINE 1
0x18 c6_head 8B LINE 1
0x20 c7_head 8B LINE 1
0x28 seg_base 8B LINE 1
0x30 seg_end 8B LINE 1
0x38 _hot_pad 8B LINE 1
------ ----- ---- ----------
0x40 c4_tail 8B LINE 2 (COLD)
0x48 c5_tail 8B LINE 2
0x50 c6_tail 8B LINE 2
0x58 c7_tail 8B LINE 2
0x60 segment 8B LINE 3
0x68 page_idx 4B LINE 3
0x6C _cold_pad 4B LINE 3
0x70 alloc_count 8B LINE 3
0x78 free_count 8B LINE 3
Hot Path Access Pattern
alloc (TLS hit)
static inline void* tiny_ultra_alloc_fast(TinyUltraTlsCtx* ctx, uint8_t class_idx) {
// Single cache line access
uint16_t* counts = &ctx->c4_count;
void** heads = &ctx->c4_head;
uint16_t c = counts[class_idx - 4];
if (likely(c > 0)) {
counts[class_idx - 4] = c - 1;
return heads[class_idx - 4]; // pop from linked list
}
return tiny_ultra_alloc_slow(ctx, class_idx);
}
free (TLS push)
static inline void tiny_ultra_free_fast(TinyUltraTlsCtx* ctx, void* ptr, uint8_t class_idx) {
// Range check (seg_base/end in same cache line)
uintptr_t p = (uintptr_t)ptr;
if (likely(p >= ctx->seg_base && p < ctx->seg_end)) {
// Push to freelist (single cache line)
void** heads = &ctx->c4_head;
uint16_t* counts = &ctx->c4_count;
*(void**)ptr = heads[class_idx - 4];
heads[class_idx - 4] = ptr;
counts[class_idx - 4]++;
return;
}
tiny_ultra_free_slow(ctx, ptr, class_idx);
}
Comparison: Before vs After
| Metric | Current (v11b-1) | Unified (v11b-2) |
|---|---|---|
| TLS size (C4-C7) | 3712B | 128B |
| Cache lines (hot) | ~60 | 1 |
| seg_base/end copies | 4 | 1 |
| count access | scattered | contiguous |
Freelist Design: Linked List vs Array
選択: Linked List (head/tail)
理由:
- 固定配列不要: freelist[128] の 1KB を削除
- O(1) push/pop: head だけで十分
- Bulk drain: tail があれば一括返却可能
- メモリ効率: 使用中スロットにのみリンク
トレードオフ:
- prefetch しにくい(配列なら連続アクセス可能)
- 空間局所性が落ちる可能性
→ プロファイル後に配列版も検討可能
Implementation Notes
- Backward Compatibility: 既存の TinyC*UltraFreeTLS API を維持しつつ、内部で TinyUltraTlsCtx を使う
- Gradual Migration: まず C7 を新構造に移行し、効果を計測
- ENV Gate:
HAKMEM_ULTRA_UNIFIED_TLS=1で有効化
Date: 2025-12-12 Phase: v11b-2 planning