Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
4.9 KiB
4.9 KiB
TLS Layout Analysis (Phase v11b-1 Current State)
Overview
現在の ULTRA TLS は クラス別に独立した struct が散在しており、L1D キャッシュ効率が悪い。
TLS Structures (ULTRA C4-C7)
1. TinyC7UltraFreeTLS (tiny_c7_ultra_tls_t)
typedef struct tiny_c7_ultra_tls_t {
uint16_t count; // 2B (hot)
uint16_t _pad; // 2B
void* freelist[128]; // 1024B (128 * 8)
// --- cold fields ---
uintptr_t seg_base; // 8B
uintptr_t seg_end; // 8B
tiny_c7_ultra_segment_t* seg; // 8B
void* page_base; // 8B
size_t block_size; // 8B
uint32_t page_idx; // 4B
tiny_c7_ultra_page_meta_t* page_meta; // 8B
bool headers_initialized; // 1B
} tiny_c7_ultra_tls_t;
| Field | Size | Total |
|---|---|---|
| Hot (count + freelist) | 4 + 1024 | 1028B |
| Cold (seg_base...headers_initialized) | ~53B | ~53B |
| Total | ~1080B (17 cache lines) |
2. TinyC6UltraFreeTLS
typedef struct TinyC6UltraFreeTLS {
void* freelist[128]; // 1024B (128 * 8)
uint8_t count; // 1B
uint8_t _pad[7]; // 7B
uintptr_t seg_base; // 8B
uintptr_t seg_end; // 8B
} TinyC6UltraFreeTLS;
| Field | Size |
|---|---|
| freelist | 1024B |
| count + pad | 8B |
| seg_base/end | 16B |
| Total | 1048B (17 cache lines) |
3. TinyC5UltraFreeTLS
Same as C6: 1048B (17 cache lines)
4. TinyC4UltraFreeTLS
typedef struct TinyC4UltraFreeTLS {
void* freelist[64]; // 512B (64 * 8)
uint8_t count; // 1B
uint8_t _pad[7]; // 7B
uintptr_t seg_base; // 8B
uintptr_t seg_end; // 8B
} TinyC4UltraFreeTLS;
| Field | Size |
|---|---|
| freelist | 512B |
| count + pad | 8B |
| seg_base/end | 16B |
| Total | 536B (9 cache lines) |
5. SmallMidV35TlsCtx (MID v3.5)
typedef struct {
void *page[8]; // 64B
uint32_t offset[8]; // 32B
uint32_t capacity[8]; // 32B
SmallPageMeta_MID_v3 *meta[8]; // 64B
} SmallMidV35TlsCtx;
| Field | Size |
|---|---|
| page[8] | 64B |
| offset[8] | 32B |
| capacity[8] | 32B |
| meta[8] | 64B |
| Total | 192B (3 cache lines) |
Summary: Total TLS Footprint
| Structure | Size | Cache Lines |
|---|---|---|
| TinyC7UltraFreeTLS | 1080B | 17 |
| TinyC6UltraFreeTLS | 1048B | 17 |
| TinyC5UltraFreeTLS | 1048B | 17 |
| TinyC4UltraFreeTLS | 536B | 9 |
| SmallMidV35TlsCtx | 192B | 3 |
| Total ULTRA (C4-C7) | 3712B | ~60 lines |
Problem Analysis
1. Hot Path に必要な最小フィールド
| Operation | Required Fields |
|---|---|
| alloc (TLS hit) | count, freelist[count-1] |
| free (TLS push) | count, freelist[count], seg_base/end |
Hot path は実質 count + head + seg_range の ~24B で済む。
2. 現状の問題
- freelist 配列が巨大: 各クラス 512-1024B の配列を TLS に保持
- クラス間で seg_base/end が重複: C4-C7 が同一セグメント範囲なら共有可能
- count の配置が非統一: C7 は先頭、C4-C6 は freelist の後ろ
- Cold fields が hot 領域に混在: C7 の page_meta 等が毎回ロード
3. Cache Miss の原因
- alloc/free のたびに 各クラス専用の TLS struct をアクセス
- 4 クラス × 平均 16 cache lines = 64 cache lines が L1D を争奪
- Mixed workload では C4-C7 がランダムに切り替わり、thrashing 発生
Phase TLS-UNIFY-2a: C4-C6 Unified TLS (2025-12-12)
実装内容
C4-C6 ULTRA の TLS を TinyUltraTlsCtx 1箱に統合:
typedef struct TinyUltraTlsCtx {
// Hot line: counts (8B aligned)
uint16_t c4_count;
uint16_t c5_count;
uint16_t c6_count;
uint16_t _pad_count;
// Per-class segment ranges (learned on first free)
uintptr_t c4_seg_base, c4_seg_end;
uintptr_t c5_seg_base, c5_seg_end;
uintptr_t c6_seg_base, c6_seg_end;
// Per-class array magazines
void* c4_freelist[64]; // 512B
void* c5_freelist[64]; // 512B
void* c6_freelist[128]; // 1024B
} TinyUltraTlsCtx;
// Total: ~2KB per thread
変更点:
- C4/C5/C6 の TLS を 1 struct に統合
- 配列マガジン方式を維持(安全)
- C7 は別箱のまま(既に安定)
- 旧
TinyC4/5/6UltraFreeTLSへの委譲を廃止
A/B テスト結果
| Test | v11b-1 (Phase 1) | TLS-UNIFY-2a | Diff |
|---|---|---|---|
| Mixed 16-1024B | 8.0-8.8 Mop/s | 8.5-9.0 Mop/s | +0~5% |
| MID 257-768B | 8.5-9.0 Mop/s | 8.1-9.0 Mop/s | ±0% |
結果: 性能同等以上、SEGV/assert なし ✅
Date: 2025-12-12 Phase: TLS-UNIFY-2a completed