Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
159 lines
4.8 KiB
Markdown
159 lines
4.8 KiB
Markdown
# TLS Layout Plan: Unified ULTRA TLS (Phase v11b-2 Target)
|
|
|
|
## Goal
|
|
|
|
C4-C7 ULTRA の hot path を **1 cache line (64B)** に収める。
|
|
|
|
## Design: TinyUltraTlsCtx
|
|
|
|
```c
|
|
// ============================================================================
|
|
// LINE 1: Hot fields (64B) - alloc/free hot path
|
|
// ============================================================================
|
|
typedef struct TinyUltraTlsCtx {
|
|
// Counts (8B total, padded for alignment)
|
|
uint16_t c4_count; // 2B
|
|
uint16_t c5_count; // 2B
|
|
uint16_t c6_count; // 2B
|
|
uint16_t c7_count; // 2B
|
|
|
|
// Freelist heads (32B)
|
|
void* c4_head; // 8B - next free slot for C4
|
|
void* c5_head; // 8B
|
|
void* c6_head; // 8B
|
|
void* c7_head; // 8B
|
|
|
|
// Segment range (shared across C4-C7, 16B)
|
|
uintptr_t seg_base; // 8B
|
|
uintptr_t seg_end; // 8B
|
|
|
|
// ========== LINE 1 END: 56B used, 8B spare ==========
|
|
|
|
uint64_t _hot_pad; // 8B - align to 64B
|
|
|
|
// ============================================================================
|
|
// LINE 2+: Cold fields (refill/retire, debug, stats)
|
|
// ============================================================================
|
|
|
|
// Freelist tails (for bulk push, 32B)
|
|
void* c4_tail; // 8B
|
|
void* c5_tail; // 8B
|
|
void* c6_tail; // 8B
|
|
void* c7_tail; // 8B
|
|
|
|
// Segment metadata (16B)
|
|
void* segment; // 8B - owning segment pointer
|
|
uint32_t page_idx; // 4B - current page index
|
|
uint32_t _cold_pad; // 4B
|
|
|
|
// Stats (optional, 16B)
|
|
uint64_t alloc_count; // 8B
|
|
uint64_t free_count; // 8B
|
|
|
|
} TinyUltraTlsCtx;
|
|
// Total: 128B (2 cache lines)
|
|
```
|
|
|
|
## Memory Layout
|
|
|
|
```
|
|
Offset Field Size Cache Line
|
|
------ ----- ---- ----------
|
|
0x00 c4_count 2B LINE 1 (HOT)
|
|
0x02 c5_count 2B LINE 1
|
|
0x04 c6_count 2B LINE 1
|
|
0x06 c7_count 2B LINE 1
|
|
0x08 c4_head 8B LINE 1
|
|
0x10 c5_head 8B LINE 1
|
|
0x18 c6_head 8B LINE 1
|
|
0x20 c7_head 8B LINE 1
|
|
0x28 seg_base 8B LINE 1
|
|
0x30 seg_end 8B LINE 1
|
|
0x38 _hot_pad 8B LINE 1
|
|
------ ----- ---- ----------
|
|
0x40 c4_tail 8B LINE 2 (COLD)
|
|
0x48 c5_tail 8B LINE 2
|
|
0x50 c6_tail 8B LINE 2
|
|
0x58 c7_tail 8B LINE 2
|
|
0x60 segment 8B LINE 3
|
|
0x68 page_idx 4B LINE 3
|
|
0x6C _cold_pad 4B LINE 3
|
|
0x70 alloc_count 8B LINE 3
|
|
0x78 free_count 8B LINE 3
|
|
```
|
|
|
|
## Hot Path Access Pattern
|
|
|
|
### alloc (TLS hit)
|
|
|
|
```c
|
|
static inline void* tiny_ultra_alloc_fast(TinyUltraTlsCtx* ctx, uint8_t class_idx) {
|
|
// Single cache line access
|
|
uint16_t* counts = &ctx->c4_count;
|
|
void** heads = &ctx->c4_head;
|
|
|
|
uint16_t c = counts[class_idx - 4];
|
|
if (likely(c > 0)) {
|
|
counts[class_idx - 4] = c - 1;
|
|
return heads[class_idx - 4]; // pop from linked list
|
|
}
|
|
return tiny_ultra_alloc_slow(ctx, class_idx);
|
|
}
|
|
```
|
|
|
|
### free (TLS push)
|
|
|
|
```c
|
|
static inline void tiny_ultra_free_fast(TinyUltraTlsCtx* ctx, void* ptr, uint8_t class_idx) {
|
|
// Range check (seg_base/end in same cache line)
|
|
uintptr_t p = (uintptr_t)ptr;
|
|
if (likely(p >= ctx->seg_base && p < ctx->seg_end)) {
|
|
// Push to freelist (single cache line)
|
|
void** heads = &ctx->c4_head;
|
|
uint16_t* counts = &ctx->c4_count;
|
|
|
|
*(void**)ptr = heads[class_idx - 4];
|
|
heads[class_idx - 4] = ptr;
|
|
counts[class_idx - 4]++;
|
|
return;
|
|
}
|
|
tiny_ultra_free_slow(ctx, ptr, class_idx);
|
|
}
|
|
```
|
|
|
|
## Comparison: Before vs After
|
|
|
|
| Metric | Current (v11b-1) | Unified (v11b-2) |
|
|
|--------|------------------|------------------|
|
|
| TLS size (C4-C7) | 3712B | 128B |
|
|
| Cache lines (hot) | ~60 | **1** |
|
|
| seg_base/end copies | 4 | 1 |
|
|
| count access | scattered | contiguous |
|
|
|
|
## Freelist Design: Linked List vs Array
|
|
|
|
**選択: Linked List (head/tail)**
|
|
|
|
理由:
|
|
1. **固定配列不要**: freelist[128] の 1KB を削除
|
|
2. **O(1) push/pop**: head だけで十分
|
|
3. **Bulk drain**: tail があれば一括返却可能
|
|
4. **メモリ効率**: 使用中スロットにのみリンク
|
|
|
|
トレードオフ:
|
|
- prefetch しにくい(配列なら連続アクセス可能)
|
|
- 空間局所性が落ちる可能性
|
|
|
|
→ プロファイル後に配列版も検討可能
|
|
|
|
## Implementation Notes
|
|
|
|
1. **Backward Compatibility**: 既存の TinyC*UltraFreeTLS API を維持しつつ、内部で TinyUltraTlsCtx を使う
|
|
2. **Gradual Migration**: まず C7 を新構造に移行し、効果を計測
|
|
3. **ENV Gate**: `HAKMEM_ULTRA_UNIFIED_TLS=1` で有効化
|
|
|
|
---
|
|
|
|
**Date**: 2025-12-12
|
|
**Phase**: v11b-2 planning
|