178 lines
4.9 KiB
Markdown
178 lines
4.9 KiB
Markdown
|
|
# TLS Layout Analysis (Phase v11b-1 Current State)
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
現在の ULTRA TLS は **クラス別に独立した struct** が散在しており、L1D キャッシュ効率が悪い。
|
|||
|
|
|
|||
|
|
## TLS Structures (ULTRA C4-C7)
|
|||
|
|
|
|||
|
|
### 1. TinyC7UltraFreeTLS (`tiny_c7_ultra_tls_t`)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct tiny_c7_ultra_tls_t {
|
|||
|
|
uint16_t count; // 2B (hot)
|
|||
|
|
uint16_t _pad; // 2B
|
|||
|
|
void* freelist[128]; // 1024B (128 * 8)
|
|||
|
|
// --- cold fields ---
|
|||
|
|
uintptr_t seg_base; // 8B
|
|||
|
|
uintptr_t seg_end; // 8B
|
|||
|
|
tiny_c7_ultra_segment_t* seg; // 8B
|
|||
|
|
void* page_base; // 8B
|
|||
|
|
size_t block_size; // 8B
|
|||
|
|
uint32_t page_idx; // 4B
|
|||
|
|
tiny_c7_ultra_page_meta_t* page_meta; // 8B
|
|||
|
|
bool headers_initialized; // 1B
|
|||
|
|
} tiny_c7_ultra_tls_t;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Field | Size | Total |
|
|||
|
|
|-------|------|-------|
|
|||
|
|
| Hot (count + freelist) | 4 + 1024 | 1028B |
|
|||
|
|
| Cold (seg_base...headers_initialized) | ~53B | ~53B |
|
|||
|
|
| **Total** | | **~1080B (17 cache lines)** |
|
|||
|
|
|
|||
|
|
### 2. TinyC6UltraFreeTLS
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct TinyC6UltraFreeTLS {
|
|||
|
|
void* freelist[128]; // 1024B (128 * 8)
|
|||
|
|
uint8_t count; // 1B
|
|||
|
|
uint8_t _pad[7]; // 7B
|
|||
|
|
uintptr_t seg_base; // 8B
|
|||
|
|
uintptr_t seg_end; // 8B
|
|||
|
|
} TinyC6UltraFreeTLS;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Field | Size |
|
|||
|
|
|-------|------|
|
|||
|
|
| freelist | 1024B |
|
|||
|
|
| count + pad | 8B |
|
|||
|
|
| seg_base/end | 16B |
|
|||
|
|
| **Total** | **1048B (17 cache lines)** |
|
|||
|
|
|
|||
|
|
### 3. TinyC5UltraFreeTLS
|
|||
|
|
|
|||
|
|
Same as C6: **1048B (17 cache lines)**
|
|||
|
|
|
|||
|
|
### 4. TinyC4UltraFreeTLS
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct TinyC4UltraFreeTLS {
|
|||
|
|
void* freelist[64]; // 512B (64 * 8)
|
|||
|
|
uint8_t count; // 1B
|
|||
|
|
uint8_t _pad[7]; // 7B
|
|||
|
|
uintptr_t seg_base; // 8B
|
|||
|
|
uintptr_t seg_end; // 8B
|
|||
|
|
} TinyC4UltraFreeTLS;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Field | Size |
|
|||
|
|
|-------|------|
|
|||
|
|
| freelist | 512B |
|
|||
|
|
| count + pad | 8B |
|
|||
|
|
| seg_base/end | 16B |
|
|||
|
|
| **Total** | **536B (9 cache lines)** |
|
|||
|
|
|
|||
|
|
### 5. SmallMidV35TlsCtx (MID v3.5)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct {
|
|||
|
|
void *page[8]; // 64B
|
|||
|
|
uint32_t offset[8]; // 32B
|
|||
|
|
uint32_t capacity[8]; // 32B
|
|||
|
|
SmallPageMeta_MID_v3 *meta[8]; // 64B
|
|||
|
|
} SmallMidV35TlsCtx;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Field | Size |
|
|||
|
|
|-------|------|
|
|||
|
|
| page[8] | 64B |
|
|||
|
|
| offset[8] | 32B |
|
|||
|
|
| capacity[8] | 32B |
|
|||
|
|
| meta[8] | 64B |
|
|||
|
|
| **Total** | **192B (3 cache lines)** |
|
|||
|
|
|
|||
|
|
## Summary: Total TLS Footprint
|
|||
|
|
|
|||
|
|
| Structure | Size | Cache Lines |
|
|||
|
|
|-----------|------|-------------|
|
|||
|
|
| TinyC7UltraFreeTLS | 1080B | 17 |
|
|||
|
|
| TinyC6UltraFreeTLS | 1048B | 17 |
|
|||
|
|
| TinyC5UltraFreeTLS | 1048B | 17 |
|
|||
|
|
| TinyC4UltraFreeTLS | 536B | 9 |
|
|||
|
|
| SmallMidV35TlsCtx | 192B | 3 |
|
|||
|
|
| **Total ULTRA (C4-C7)** | **3712B** | **~60 lines** |
|
|||
|
|
|
|||
|
|
## Problem Analysis
|
|||
|
|
|
|||
|
|
### 1. Hot Path に必要な最小フィールド
|
|||
|
|
|
|||
|
|
| Operation | Required Fields |
|
|||
|
|
|-----------|-----------------|
|
|||
|
|
| alloc (TLS hit) | count, freelist[count-1] |
|
|||
|
|
| free (TLS push) | count, freelist[count], seg_base/end |
|
|||
|
|
|
|||
|
|
**Hot path は実質 count + head + seg_range の ~24B で済む。**
|
|||
|
|
|
|||
|
|
### 2. 現状の問題
|
|||
|
|
|
|||
|
|
1. **freelist 配列が巨大**: 各クラス 512-1024B の配列を TLS に保持
|
|||
|
|
2. **クラス間で seg_base/end が重複**: C4-C7 が同一セグメント範囲なら共有可能
|
|||
|
|
3. **count の配置が非統一**: C7 は先頭、C4-C6 は freelist の後ろ
|
|||
|
|
4. **Cold fields が hot 領域に混在**: C7 の page_meta 等が毎回ロード
|
|||
|
|
|
|||
|
|
### 3. Cache Miss の原因
|
|||
|
|
|
|||
|
|
- alloc/free のたびに **各クラス専用の TLS struct** をアクセス
|
|||
|
|
- 4 クラス × 平均 16 cache lines = **64 cache lines が L1D を争奪**
|
|||
|
|
- Mixed workload では C4-C7 がランダムに切り替わり、thrashing 発生
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase TLS-UNIFY-2a: C4-C6 Unified TLS (2025-12-12)
|
|||
|
|
|
|||
|
|
### 実装内容
|
|||
|
|
|
|||
|
|
C4-C6 ULTRA の TLS を `TinyUltraTlsCtx` 1箱に統合:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct TinyUltraTlsCtx {
|
|||
|
|
// Hot line: counts (8B aligned)
|
|||
|
|
uint16_t c4_count;
|
|||
|
|
uint16_t c5_count;
|
|||
|
|
uint16_t c6_count;
|
|||
|
|
uint16_t _pad_count;
|
|||
|
|
|
|||
|
|
// Per-class segment ranges (learned on first free)
|
|||
|
|
uintptr_t c4_seg_base, c4_seg_end;
|
|||
|
|
uintptr_t c5_seg_base, c5_seg_end;
|
|||
|
|
uintptr_t c6_seg_base, c6_seg_end;
|
|||
|
|
|
|||
|
|
// Per-class array magazines
|
|||
|
|
void* c4_freelist[64]; // 512B
|
|||
|
|
void* c5_freelist[64]; // 512B
|
|||
|
|
void* c6_freelist[128]; // 1024B
|
|||
|
|
} TinyUltraTlsCtx;
|
|||
|
|
// Total: ~2KB per thread
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**変更点**:
|
|||
|
|
- C4/C5/C6 の TLS を 1 struct に統合
|
|||
|
|
- 配列マガジン方式を維持(安全)
|
|||
|
|
- C7 は別箱のまま(既に安定)
|
|||
|
|
- 旧 `TinyC4/5/6UltraFreeTLS` への委譲を廃止
|
|||
|
|
|
|||
|
|
### A/B テスト結果
|
|||
|
|
|
|||
|
|
| Test | v11b-1 (Phase 1) | TLS-UNIFY-2a | Diff |
|
|||
|
|
|------|------------------|--------------|------|
|
|||
|
|
| Mixed 16-1024B | 8.0-8.8 Mop/s | 8.5-9.0 Mop/s | +0~5% |
|
|||
|
|
| MID 257-768B | 8.5-9.0 Mop/s | 8.1-9.0 Mop/s | ±0% |
|
|||
|
|
|
|||
|
|
**結果**: 性能同等以上、SEGV/assert なし ✅
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Date**: 2025-12-12
|
|||
|
|
**Phase**: TLS-UNIFY-2a completed
|