Files
hakmem/docs/analysis/TLS_LAYOUT_V11B1_CURRENT.md
Moe Charm (CI) 1a8652a91a Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00

4.9 KiB
Raw Blame History

TLS Layout Analysis (Phase v11b-1 Current State)

Overview

現在の ULTRA TLS は クラス別に独立した struct が散在しており、L1D キャッシュ効率が悪い。

TLS Structures (ULTRA C4-C7)

1. TinyC7UltraFreeTLS (tiny_c7_ultra_tls_t)

typedef struct tiny_c7_ultra_tls_t {
    uint16_t count;                       // 2B  (hot)
    uint16_t _pad;                        // 2B
    void* freelist[128];                  // 1024B (128 * 8)
    // --- cold fields ---
    uintptr_t seg_base;                   // 8B
    uintptr_t seg_end;                    // 8B
    tiny_c7_ultra_segment_t* seg;         // 8B
    void* page_base;                      // 8B
    size_t block_size;                    // 8B
    uint32_t page_idx;                    // 4B
    tiny_c7_ultra_page_meta_t* page_meta; // 8B
    bool headers_initialized;             // 1B
} tiny_c7_ultra_tls_t;
Field Size Total
Hot (count + freelist) 4 + 1024 1028B
Cold (seg_base...headers_initialized) ~53B ~53B
Total ~1080B (17 cache lines)

2. TinyC6UltraFreeTLS

typedef struct TinyC6UltraFreeTLS {
    void*     freelist[128];  // 1024B (128 * 8)
    uint8_t   count;          // 1B
    uint8_t   _pad[7];        // 7B
    uintptr_t seg_base;       // 8B
    uintptr_t seg_end;        // 8B
} TinyC6UltraFreeTLS;
Field Size
freelist 1024B
count + pad 8B
seg_base/end 16B
Total 1048B (17 cache lines)

3. TinyC5UltraFreeTLS

Same as C6: 1048B (17 cache lines)

4. TinyC4UltraFreeTLS

typedef struct TinyC4UltraFreeTLS {
    void*     freelist[64];   // 512B (64 * 8)
    uint8_t   count;          // 1B
    uint8_t   _pad[7];        // 7B
    uintptr_t seg_base;       // 8B
    uintptr_t seg_end;        // 8B
} TinyC4UltraFreeTLS;
Field Size
freelist 512B
count + pad 8B
seg_base/end 16B
Total 536B (9 cache lines)

5. SmallMidV35TlsCtx (MID v3.5)

typedef struct {
    void *page[8];                   // 64B
    uint32_t offset[8];              // 32B
    uint32_t capacity[8];            // 32B
    SmallPageMeta_MID_v3 *meta[8];   // 64B
} SmallMidV35TlsCtx;
Field Size
page[8] 64B
offset[8] 32B
capacity[8] 32B
meta[8] 64B
Total 192B (3 cache lines)

Summary: Total TLS Footprint

Structure Size Cache Lines
TinyC7UltraFreeTLS 1080B 17
TinyC6UltraFreeTLS 1048B 17
TinyC5UltraFreeTLS 1048B 17
TinyC4UltraFreeTLS 536B 9
SmallMidV35TlsCtx 192B 3
Total ULTRA (C4-C7) 3712B ~60 lines

Problem Analysis

1. Hot Path に必要な最小フィールド

Operation Required Fields
alloc (TLS hit) count, freelist[count-1]
free (TLS push) count, freelist[count], seg_base/end

Hot path は実質 count + head + seg_range の ~24B で済む。

2. 現状の問題

  1. freelist 配列が巨大: 各クラス 512-1024B の配列を TLS に保持
  2. クラス間で seg_base/end が重複: C4-C7 が同一セグメント範囲なら共有可能
  3. count の配置が非統一: C7 は先頭、C4-C6 は freelist の後ろ
  4. Cold fields が hot 領域に混在: C7 の page_meta 等が毎回ロード

3. Cache Miss の原因

  • alloc/free のたびに 各クラス専用の TLS struct をアクセス
  • 4 クラス × 平均 16 cache lines = 64 cache lines が L1D を争奪
  • Mixed workload では C4-C7 がランダムに切り替わり、thrashing 発生

Phase TLS-UNIFY-2a: C4-C6 Unified TLS (2025-12-12)

実装内容

C4-C6 ULTRA の TLS を TinyUltraTlsCtx 1箱に統合:

typedef struct TinyUltraTlsCtx {
    // Hot line: counts (8B aligned)
    uint16_t c4_count;
    uint16_t c5_count;
    uint16_t c6_count;
    uint16_t _pad_count;

    // Per-class segment ranges (learned on first free)
    uintptr_t c4_seg_base, c4_seg_end;
    uintptr_t c5_seg_base, c5_seg_end;
    uintptr_t c6_seg_base, c6_seg_end;

    // Per-class array magazines
    void* c4_freelist[64];    // 512B
    void* c5_freelist[64];    // 512B
    void* c6_freelist[128];   // 1024B
} TinyUltraTlsCtx;
// Total: ~2KB per thread

変更点:

  • C4/C5/C6 の TLS を 1 struct に統合
  • 配列マガジン方式を維持(安全)
  • C7 は別箱のまま(既に安定)
  • TinyC4/5/6UltraFreeTLS への委譲を廃止

A/B テスト結果

Test v11b-1 (Phase 1) TLS-UNIFY-2a Diff
Mixed 16-1024B 8.0-8.8 Mop/s 8.5-9.0 Mop/s +0~5%
MID 257-768B 8.5-9.0 Mop/s 8.1-9.0 Mop/s ±0%

結果: 性能同等以上、SEGV/assert なし


Date: 2025-12-12 Phase: TLS-UNIFY-2a completed