159 lines
4.8 KiB
Markdown
159 lines
4.8 KiB
Markdown
|
|
# TLS Layout Plan: Unified ULTRA TLS (Phase v11b-2 Target)
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
C4-C7 ULTRA の hot path を **1 cache line (64B)** に収める。
|
||
|
|
|
||
|
|
## Design: TinyUltraTlsCtx
|
||
|
|
|
||
|
|
```c
|
||
|
|
// ============================================================================
|
||
|
|
// LINE 1: Hot fields (64B) - alloc/free hot path
|
||
|
|
// ============================================================================
|
||
|
|
typedef struct TinyUltraTlsCtx {
|
||
|
|
// Counts (8B total, padded for alignment)
|
||
|
|
uint16_t c4_count; // 2B
|
||
|
|
uint16_t c5_count; // 2B
|
||
|
|
uint16_t c6_count; // 2B
|
||
|
|
uint16_t c7_count; // 2B
|
||
|
|
|
||
|
|
// Freelist heads (32B)
|
||
|
|
void* c4_head; // 8B - next free slot for C4
|
||
|
|
void* c5_head; // 8B
|
||
|
|
void* c6_head; // 8B
|
||
|
|
void* c7_head; // 8B
|
||
|
|
|
||
|
|
// Segment range (shared across C4-C7, 16B)
|
||
|
|
uintptr_t seg_base; // 8B
|
||
|
|
uintptr_t seg_end; // 8B
|
||
|
|
|
||
|
|
// ========== LINE 1 END: 56B used, 8B spare ==========
|
||
|
|
|
||
|
|
uint64_t _hot_pad; // 8B - align to 64B
|
||
|
|
|
||
|
|
// ============================================================================
|
||
|
|
// LINE 2+: Cold fields (refill/retire, debug, stats)
|
||
|
|
// ============================================================================
|
||
|
|
|
||
|
|
// Freelist tails (for bulk push, 32B)
|
||
|
|
void* c4_tail; // 8B
|
||
|
|
void* c5_tail; // 8B
|
||
|
|
void* c6_tail; // 8B
|
||
|
|
void* c7_tail; // 8B
|
||
|
|
|
||
|
|
// Segment metadata (16B)
|
||
|
|
void* segment; // 8B - owning segment pointer
|
||
|
|
uint32_t page_idx; // 4B - current page index
|
||
|
|
uint32_t _cold_pad; // 4B
|
||
|
|
|
||
|
|
// Stats (optional, 16B)
|
||
|
|
uint64_t alloc_count; // 8B
|
||
|
|
uint64_t free_count; // 8B
|
||
|
|
|
||
|
|
} TinyUltraTlsCtx;
|
||
|
|
// Total: 128B (2 cache lines)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Memory Layout
|
||
|
|
|
||
|
|
```
|
||
|
|
Offset Field Size Cache Line
|
||
|
|
------ ----- ---- ----------
|
||
|
|
0x00 c4_count 2B LINE 1 (HOT)
|
||
|
|
0x02 c5_count 2B LINE 1
|
||
|
|
0x04 c6_count 2B LINE 1
|
||
|
|
0x06 c7_count 2B LINE 1
|
||
|
|
0x08 c4_head 8B LINE 1
|
||
|
|
0x10 c5_head 8B LINE 1
|
||
|
|
0x18 c6_head 8B LINE 1
|
||
|
|
0x20 c7_head 8B LINE 1
|
||
|
|
0x28 seg_base 8B LINE 1
|
||
|
|
0x30 seg_end 8B LINE 1
|
||
|
|
0x38 _hot_pad 8B LINE 1
|
||
|
|
------ ----- ---- ----------
|
||
|
|
0x40 c4_tail 8B LINE 2 (COLD)
|
||
|
|
0x48 c5_tail 8B LINE 2
|
||
|
|
0x50 c6_tail 8B LINE 2
|
||
|
|
0x58 c7_tail 8B LINE 2
|
||
|
|
0x60 segment 8B LINE 3
|
||
|
|
0x68 page_idx 4B LINE 3
|
||
|
|
0x6C _cold_pad 4B LINE 3
|
||
|
|
0x70 alloc_count 8B LINE 3
|
||
|
|
0x78 free_count 8B LINE 3
|
||
|
|
```
|
||
|
|
|
||
|
|
## Hot Path Access Pattern
|
||
|
|
|
||
|
|
### alloc (TLS hit)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline void* tiny_ultra_alloc_fast(TinyUltraTlsCtx* ctx, uint8_t class_idx) {
|
||
|
|
// Single cache line access
|
||
|
|
uint16_t* counts = &ctx->c4_count;
|
||
|
|
void** heads = &ctx->c4_head;
|
||
|
|
|
||
|
|
uint16_t c = counts[class_idx - 4];
|
||
|
|
if (likely(c > 0)) {
|
||
|
|
counts[class_idx - 4] = c - 1;
|
||
|
|
return heads[class_idx - 4]; // pop from linked list
|
||
|
|
}
|
||
|
|
return tiny_ultra_alloc_slow(ctx, class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### free (TLS push)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline void tiny_ultra_free_fast(TinyUltraTlsCtx* ctx, void* ptr, uint8_t class_idx) {
|
||
|
|
// Range check (seg_base/end in same cache line)
|
||
|
|
uintptr_t p = (uintptr_t)ptr;
|
||
|
|
if (likely(p >= ctx->seg_base && p < ctx->seg_end)) {
|
||
|
|
// Push to freelist (single cache line)
|
||
|
|
void** heads = &ctx->c4_head;
|
||
|
|
uint16_t* counts = &ctx->c4_count;
|
||
|
|
|
||
|
|
*(void**)ptr = heads[class_idx - 4];
|
||
|
|
heads[class_idx - 4] = ptr;
|
||
|
|
counts[class_idx - 4]++;
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
tiny_ultra_free_slow(ctx, ptr, class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Comparison: Before vs After
|
||
|
|
|
||
|
|
| Metric | Current (v11b-1) | Unified (v11b-2) |
|
||
|
|
|--------|------------------|------------------|
|
||
|
|
| TLS size (C4-C7) | 3712B | 128B |
|
||
|
|
| Cache lines (hot) | ~60 | **1** |
|
||
|
|
| seg_base/end copies | 4 | 1 |
|
||
|
|
| count access | scattered | contiguous |
|
||
|
|
|
||
|
|
## Freelist Design: Linked List vs Array
|
||
|
|
|
||
|
|
**選択: Linked List (head/tail)**
|
||
|
|
|
||
|
|
理由:
|
||
|
|
1. **固定配列不要**: freelist[128] の 1KB を削除
|
||
|
|
2. **O(1) push/pop**: head だけで十分
|
||
|
|
3. **Bulk drain**: tail があれば一括返却可能
|
||
|
|
4. **メモリ効率**: 使用中スロットにのみリンク
|
||
|
|
|
||
|
|
トレードオフ:
|
||
|
|
- prefetch しにくい(配列なら連続アクセス可能)
|
||
|
|
- 空間局所性が落ちる可能性
|
||
|
|
|
||
|
|
→ プロファイル後に配列版も検討可能
|
||
|
|
|
||
|
|
## Implementation Notes
|
||
|
|
|
||
|
|
1. **Backward Compatibility**: 既存の TinyC*UltraFreeTLS API を維持しつつ、内部で TinyUltraTlsCtx を使う
|
||
|
|
2. **Gradual Migration**: まず C7 を新構造に移行し、効果を計測
|
||
|
|
3. **ENV Gate**: `HAKMEM_ULTRA_UNIFIED_TLS=1` で有効化
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Date**: 2025-12-12
|
||
|
|
**Phase**: v11b-2 planning
|