Files
hakmem/docs/analysis/TLS_LAYOUT_V11B1_PLAN.md
Moe Charm (CI) 1a8652a91a Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00

4.8 KiB

TLS Layout Plan: Unified ULTRA TLS (Phase v11b-2 Target)

Goal

C4-C7 ULTRA の hot path を 1 cache line (64B) に収める。

Design: TinyUltraTlsCtx

// ============================================================================
// LINE 1: Hot fields (64B) - alloc/free hot path
// ============================================================================
typedef struct TinyUltraTlsCtx {
    // Counts (8B total, padded for alignment)
    uint16_t c4_count;      // 2B
    uint16_t c5_count;      // 2B
    uint16_t c6_count;      // 2B
    uint16_t c7_count;      // 2B

    // Freelist heads (32B)
    void* c4_head;          // 8B - next free slot for C4
    void* c5_head;          // 8B
    void* c6_head;          // 8B
    void* c7_head;          // 8B

    // Segment range (shared across C4-C7, 16B)
    uintptr_t seg_base;     // 8B
    uintptr_t seg_end;      // 8B

    // ========== LINE 1 END: 56B used, 8B spare ==========

    uint64_t _hot_pad;      // 8B - align to 64B

    // ============================================================================
    // LINE 2+: Cold fields (refill/retire, debug, stats)
    // ============================================================================

    // Freelist tails (for bulk push, 32B)
    void* c4_tail;          // 8B
    void* c5_tail;          // 8B
    void* c6_tail;          // 8B
    void* c7_tail;          // 8B

    // Segment metadata (16B)
    void* segment;          // 8B - owning segment pointer
    uint32_t page_idx;      // 4B - current page index
    uint32_t _cold_pad;     // 4B

    // Stats (optional, 16B)
    uint64_t alloc_count;   // 8B
    uint64_t free_count;    // 8B

} TinyUltraTlsCtx;
// Total: 128B (2 cache lines)

Memory Layout

Offset  Field           Size    Cache Line
------  -----           ----    ----------
0x00    c4_count        2B      LINE 1 (HOT)
0x02    c5_count        2B      LINE 1
0x04    c6_count        2B      LINE 1
0x06    c7_count        2B      LINE 1
0x08    c4_head         8B      LINE 1
0x10    c5_head         8B      LINE 1
0x18    c6_head         8B      LINE 1
0x20    c7_head         8B      LINE 1
0x28    seg_base        8B      LINE 1
0x30    seg_end         8B      LINE 1
0x38    _hot_pad        8B      LINE 1
------  -----           ----    ----------
0x40    c4_tail         8B      LINE 2 (COLD)
0x48    c5_tail         8B      LINE 2
0x50    c6_tail         8B      LINE 2
0x58    c7_tail         8B      LINE 2
0x60    segment         8B      LINE 3
0x68    page_idx        4B      LINE 3
0x6C    _cold_pad       4B      LINE 3
0x70    alloc_count     8B      LINE 3
0x78    free_count      8B      LINE 3

Hot Path Access Pattern

alloc (TLS hit)

static inline void* tiny_ultra_alloc_fast(TinyUltraTlsCtx* ctx, uint8_t class_idx) {
    // Single cache line access
    uint16_t* counts = &ctx->c4_count;
    void** heads = &ctx->c4_head;

    uint16_t c = counts[class_idx - 4];
    if (likely(c > 0)) {
        counts[class_idx - 4] = c - 1;
        return heads[class_idx - 4];  // pop from linked list
    }
    return tiny_ultra_alloc_slow(ctx, class_idx);
}

free (TLS push)

static inline void tiny_ultra_free_fast(TinyUltraTlsCtx* ctx, void* ptr, uint8_t class_idx) {
    // Range check (seg_base/end in same cache line)
    uintptr_t p = (uintptr_t)ptr;
    if (likely(p >= ctx->seg_base && p < ctx->seg_end)) {
        // Push to freelist (single cache line)
        void** heads = &ctx->c4_head;
        uint16_t* counts = &ctx->c4_count;

        *(void**)ptr = heads[class_idx - 4];
        heads[class_idx - 4] = ptr;
        counts[class_idx - 4]++;
        return;
    }
    tiny_ultra_free_slow(ctx, ptr, class_idx);
}

Comparison: Before vs After

Metric Current (v11b-1) Unified (v11b-2)
TLS size (C4-C7) 3712B 128B
Cache lines (hot) ~60 1
seg_base/end copies 4 1
count access scattered contiguous

Freelist Design: Linked List vs Array

選択: Linked List (head/tail)

理由:

  1. 固定配列不要: freelist[128] の 1KB を削除
  2. O(1) push/pop: head だけで十分
  3. Bulk drain: tail があれば一括返却可能
  4. メモリ効率: 使用中スロットにのみリンク

トレードオフ:

  • prefetch しにくい(配列なら連続アクセス可能)
  • 空間局所性が落ちる可能性

→ プロファイル後に配列版も検討可能

Implementation Notes

  1. Backward Compatibility: 既存の TinyC*UltraFreeTLS API を維持しつつ、内部で TinyUltraTlsCtx を使う
  2. Gradual Migration: まず C7 を新構造に移行し、効果を計測
  3. ENV Gate: HAKMEM_ULTRA_UNIFIED_TLS=1 で有効化

Date: 2025-12-12 Phase: v11b-2 planning