Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
178 lines
4.9 KiB
Markdown
178 lines
4.9 KiB
Markdown
# TLS Layout Analysis (Phase v11b-1 Current State)
|
||
|
||
## Overview
|
||
|
||
現在の ULTRA TLS は **クラス別に独立した struct** が散在しており、L1D キャッシュ効率が悪い。
|
||
|
||
## TLS Structures (ULTRA C4-C7)
|
||
|
||
### 1. TinyC7UltraFreeTLS (`tiny_c7_ultra_tls_t`)
|
||
|
||
```c
|
||
typedef struct tiny_c7_ultra_tls_t {
|
||
uint16_t count; // 2B (hot)
|
||
uint16_t _pad; // 2B
|
||
void* freelist[128]; // 1024B (128 * 8)
|
||
// --- cold fields ---
|
||
uintptr_t seg_base; // 8B
|
||
uintptr_t seg_end; // 8B
|
||
tiny_c7_ultra_segment_t* seg; // 8B
|
||
void* page_base; // 8B
|
||
size_t block_size; // 8B
|
||
uint32_t page_idx; // 4B
|
||
tiny_c7_ultra_page_meta_t* page_meta; // 8B
|
||
bool headers_initialized; // 1B
|
||
} tiny_c7_ultra_tls_t;
|
||
```
|
||
|
||
| Field | Size | Total |
|
||
|-------|------|-------|
|
||
| Hot (count + freelist) | 4 + 1024 | 1028B |
|
||
| Cold (seg_base...headers_initialized) | ~53B | ~53B |
|
||
| **Total** | | **~1080B (17 cache lines)** |
|
||
|
||
### 2. TinyC6UltraFreeTLS
|
||
|
||
```c
|
||
typedef struct TinyC6UltraFreeTLS {
|
||
void* freelist[128]; // 1024B (128 * 8)
|
||
uint8_t count; // 1B
|
||
uint8_t _pad[7]; // 7B
|
||
uintptr_t seg_base; // 8B
|
||
uintptr_t seg_end; // 8B
|
||
} TinyC6UltraFreeTLS;
|
||
```
|
||
|
||
| Field | Size |
|
||
|-------|------|
|
||
| freelist | 1024B |
|
||
| count + pad | 8B |
|
||
| seg_base/end | 16B |
|
||
| **Total** | **1048B (17 cache lines)** |
|
||
|
||
### 3. TinyC5UltraFreeTLS
|
||
|
||
Same as C6: **1048B (17 cache lines)**
|
||
|
||
### 4. TinyC4UltraFreeTLS
|
||
|
||
```c
|
||
typedef struct TinyC4UltraFreeTLS {
|
||
void* freelist[64]; // 512B (64 * 8)
|
||
uint8_t count; // 1B
|
||
uint8_t _pad[7]; // 7B
|
||
uintptr_t seg_base; // 8B
|
||
uintptr_t seg_end; // 8B
|
||
} TinyC4UltraFreeTLS;
|
||
```
|
||
|
||
| Field | Size |
|
||
|-------|------|
|
||
| freelist | 512B |
|
||
| count + pad | 8B |
|
||
| seg_base/end | 16B |
|
||
| **Total** | **536B (9 cache lines)** |
|
||
|
||
### 5. SmallMidV35TlsCtx (MID v3.5)
|
||
|
||
```c
|
||
typedef struct {
|
||
void *page[8]; // 64B
|
||
uint32_t offset[8]; // 32B
|
||
uint32_t capacity[8]; // 32B
|
||
SmallPageMeta_MID_v3 *meta[8]; // 64B
|
||
} SmallMidV35TlsCtx;
|
||
```
|
||
|
||
| Field | Size |
|
||
|-------|------|
|
||
| page[8] | 64B |
|
||
| offset[8] | 32B |
|
||
| capacity[8] | 32B |
|
||
| meta[8] | 64B |
|
||
| **Total** | **192B (3 cache lines)** |
|
||
|
||
## Summary: Total TLS Footprint
|
||
|
||
| Structure | Size | Cache Lines |
|
||
|-----------|------|-------------|
|
||
| TinyC7UltraFreeTLS | 1080B | 17 |
|
||
| TinyC6UltraFreeTLS | 1048B | 17 |
|
||
| TinyC5UltraFreeTLS | 1048B | 17 |
|
||
| TinyC4UltraFreeTLS | 536B | 9 |
|
||
| SmallMidV35TlsCtx | 192B | 3 |
|
||
| **Total ULTRA (C4-C7)** | **3712B** | **~60 lines** |
|
||
|
||
## Problem Analysis
|
||
|
||
### 1. Hot Path に必要な最小フィールド
|
||
|
||
| Operation | Required Fields |
|
||
|-----------|-----------------|
|
||
| alloc (TLS hit) | count, freelist[count-1] |
|
||
| free (TLS push) | count, freelist[count], seg_base/end |
|
||
|
||
**Hot path は実質 count + head + seg_range の ~24B で済む。**
|
||
|
||
### 2. 現状の問題
|
||
|
||
1. **freelist 配列が巨大**: 各クラス 512-1024B の配列を TLS に保持
|
||
2. **クラス間で seg_base/end が重複**: C4-C7 が同一セグメント範囲なら共有可能
|
||
3. **count の配置が非統一**: C7 は先頭、C4-C6 は freelist の後ろ
|
||
4. **Cold fields が hot 領域に混在**: C7 の page_meta 等が毎回ロード
|
||
|
||
### 3. Cache Miss の原因
|
||
|
||
- alloc/free のたびに **各クラス専用の TLS struct** をアクセス
|
||
- 4 クラス × 平均 16 cache lines = **64 cache lines が L1D を争奪**
|
||
- Mixed workload では C4-C7 がランダムに切り替わり、thrashing 発生
|
||
|
||
---
|
||
|
||
## Phase TLS-UNIFY-2a: C4-C6 Unified TLS (2025-12-12)
|
||
|
||
### 実装内容
|
||
|
||
C4-C6 ULTRA の TLS を `TinyUltraTlsCtx` 1箱に統合:
|
||
|
||
```c
|
||
typedef struct TinyUltraTlsCtx {
|
||
// Hot line: counts (8B aligned)
|
||
uint16_t c4_count;
|
||
uint16_t c5_count;
|
||
uint16_t c6_count;
|
||
uint16_t _pad_count;
|
||
|
||
// Per-class segment ranges (learned on first free)
|
||
uintptr_t c4_seg_base, c4_seg_end;
|
||
uintptr_t c5_seg_base, c5_seg_end;
|
||
uintptr_t c6_seg_base, c6_seg_end;
|
||
|
||
// Per-class array magazines
|
||
void* c4_freelist[64]; // 512B
|
||
void* c5_freelist[64]; // 512B
|
||
void* c6_freelist[128]; // 1024B
|
||
} TinyUltraTlsCtx;
|
||
// Total: ~2KB per thread
|
||
```
|
||
|
||
**変更点**:
|
||
- C4/C5/C6 の TLS を 1 struct に統合
|
||
- 配列マガジン方式を維持(安全)
|
||
- C7 は別箱のまま(既に安定)
|
||
- 旧 `TinyC4/5/6UltraFreeTLS` への委譲を廃止
|
||
|
||
### A/B テスト結果
|
||
|
||
| Test | v11b-1 (Phase 1) | TLS-UNIFY-2a | Diff |
|
||
|------|------------------|--------------|------|
|
||
| Mixed 16-1024B | 8.0-8.8 Mop/s | 8.5-9.0 Mop/s | +0~5% |
|
||
| MID 257-768B | 8.5-9.0 Mop/s | 8.1-9.0 Mop/s | ±0% |
|
||
|
||
**結果**: 性能同等以上、SEGV/assert なし ✅
|
||
|
||
---
|
||
|
||
**Date**: 2025-12-12
|
||
**Phase**: TLS-UNIFY-2a completed
|