Phase v5-5: TLS cache for C6 v5

Add 1-slot TLS cache to C6 v5 to reduce page_meta access overhead. Implementation: - Add HAKMEM_SMALL_HEAP_V5_TLS_CACHE_ENABLED ENV (default: 0) - SmallHeapCtxV5: add c6_cached_block field for TLS cache - alloc: cache hit bypasses page_meta lookup, returns immediately - free: empty cache stores block, full cache evicts old block first Results (1M iter, ws=400, HEADER_MODE=full): - C6-heavy (257-768B): 35.53M → 37.02M ops/s (+4.2%) - Mixed 16-1024B: 38.04M → 37.93M ops/s (-0.3%, noise) Known issue: header_mode=light has infinite loop bug (freelist pointer/header collision). Full mode only for now. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 07:40:22 +09:00
parent 2a548875b8
commit 2f5d53fd6d
5 changed files with 165 additions and 35 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -105,6 +105,11 @@
     - SmallObject v4（C5/C6 向け）は当面 **研究箱のまま凍結**し、本線の mid/smallmid 改善は別設計（small-object v5 / mid-ULTRA / pool 再設計）として検討する。
     - Mixed/C7 側は引き続き「C7 v3 + C7 ULTRA」を基準に A/B を行い、mid/pool 側は現行 v1 を基準ラインとして据え置く。

+3. **Phase v5-2/3（C6-only v5 通電 & 薄型化）** ✅ 完了（研究箱）
+   - Phase v5-2: C6-only small-object v5 を Segment+Page ベースで本実装。Tiny/Pool から完全に切り離し、2MiB Segment / 64KiB Page 上で C6 ページを管理。初回は ~14–20M ops/s 程度で v1 より大幅に遅かった。
+   - Phase v5-3: C6 v5 の HotPath を薄型化（単一 TLS セグメント + O(1) `page_meta_of` + ビットマップによる free page 検索）。C6-heavy 1M/400 で v5 OFF **~44.9M** → v5 ON **~38.5M ops/s**（+162% vs v5-2, baseline 比約 -14%）。Mixed でも 36–39M ops/s で SEGV 無し。  
+   - 方針: v5 は v4 より構造的には良いが、C6-only でもまだ v1 を下回るため、当面は研究箱のまま維持。本線 mid/smallmid は引き続き pool v1 基準で見つつ、v5 設計を C7 ULTRA パターンに近づける方向で検討を継続する。
+
 3. **Phase v4-mid-SEGV（C6 v4 の SEGV 修正・研究箱安定化）** ✅ 完了
   - **問題**: C6 v4 が TinyHeap のページを共有 → iters >= 800k で freelist 破壊 → SEGV
   - **修正**: C6 専用 refill/retire を SmallSegment v4 に切り替え、TinyHeap 依存を完全排除
@ -140,19 +145,41 @@
     - SEGV/assert なし ✅
   - **方針**: v5-1 では挙動は v1/pool fallback と同じ。研究箱として ENV プリセット（`C6_SMALL_HEAP_V5_STUB`）を `docs/analysis/ENV_PROFILE_PRESETS.md` に追記。v5-2 で本実装を追加。

-6. **Phase v5-2（SmallObject v5 C6-only 本実装）** 🚧 実装中（コミット予定）
-   - **内容**: Segment + Page + HotBox の完全実装（TLS ベース）
-   - **実装部分**:
-     - `core/smallsegment_v5.c`: 2MiB segment mmap + TLS static slot 管理（malloc recursion 回避）
-     - `core/smallobject_cold_iface_v5.c`: refill_page (freelist carve) / retire_page
-     - `core/smallobject_hotbox_v5.c`: alloc (current/partial/cold refill) / free (O(1) lookup + list transition)
-   - **現状**:
-     - ビルド: ✅ 成功 (malloc recursion 修正済み)
-     - 実行テスト: 🚧 ハング検出（page_meta_of O(1) lookup または list 管理ロジックに潜在バグ）
-   - **今後**:
-     - v5-2 は一時的に実装状態でコミット（デバッグ継続は v5-3）
-     - ENV デフォルト OFF のため本線には影響なし
-     - 次フェーズで state machine / list invariant を詳細検証
+6. **Phase v5-2 / v5-3（SmallObject v5 C6-only 実装＋薄型化, 研究箱）** ✅ 完了
+   - **内容**: C6 向け SmallObjectHotBox v5 を Segment + Page + TLS ベースで実装し、v5-3 で単一 TLS セグメント＋O(1) `page_meta_of`＋ビットマップ free-page 検索などで HotPath を薄型化。
+   - **C6-heavy 1M/400**:
+     - v5 OFF（pool v1）: 約 **44.9M ops/s**
+     - v5-3 ON: 約 **38.5M ops/s**（v5-2 の ~14.7M からは +162% だが、baseline 比では約 -14%）
+   - **Mixed 16–1024B**:
+     - v5 ON（C6 のみ v5 route）でも 36–39M ops/s で SEGV なし（本線 Mixed プロファイルでは v5 はデフォルト OFF）。
+   - **方針**: C6 v5 は構造的には v4 より良く安定もしたが、まだ v1 を下回るため **研究箱のまま維持**。本線 mid/smallmid は引き続き pool v1 基準で見る。
+
+7. **Phase v5-4（C6 v5 header light / freelist 最適化）** ✅ 完了（研究箱）
+   - **目的**: C6-heavy で v5 ON 時の回帰を詰める（target: baseline 比 -5〜7%）。
+   - **実装**:
+     - `HAKMEM_SMALL_HEAP_V5_HEADER_MODE=full|light` ENV を追加（デフォルト full）
+     - light mode: page carve 時に全ブロックの header を初期化、alloc 時の header write をスキップ
+     - full mode: 従来どおり alloc 毎に header write（標準動作）
+     - SmallHeapCtxV5 に header_mode フィールド追加（TLS で ENV を 1 回だけ読んで cache）
+   - **実測値**（1M iter, ws=400）:
+     - C6-heavy (257-768B): v5 OFF **47.95M** / v5 full **38.97M** (-18.7%) / v5 light **39.25M** (+0.7% vs full, -18.1% vs baseline)
+     - Mixed 16-1024B: v5 OFF **43.59M** / v5 full **36.53M** (-16.2%) / v5 light **38.04M** (+4.1% vs full, -12.7% vs OFF)
+   - **結論**: header light は微改善（+0.7-4.1%）だが、target の -5〜7% には届かず（現状 -18.1%）。header write 以外にも HotPath コストあり（freelist 操作、metadata access 等）。v5-5 以降で TLS cache / batching により HotPath を詰める予定。
+   - **運用**: 標準プロファイルでは引き続き `HAKMEM_SMALL_HEAP_V5_ENABLED=0`（v5 OFF）。C6 v5 は研究専用で、A/B 時のみ明示的に ON。
+
+8. **Phase v5-5（C6 v5 TLS cache）** ✅ 完了（研究箱）
+   - **目的**: C6 v5 の HotPath から page_meta access を削減、+1-2% 改善を目指す。
+   - **実装**:
+     - `HAKMEM_SMALL_HEAP_V5_TLS_CACHE_ENABLED=0|1` ENV を追加（デフォルト 0）
+     - SmallHeapCtxV5 に `c6_cached_block` フィールド追加（1-slot TLS cache）
+     - alloc: cache hit 時は page_meta 参照せず即座に返す（header mode に応じて処理）
+     - free: cache 空なら block を cache に格納（freelist push をスキップ）、満杯なら evict して新 block を cache
+   - **実測値**（1M iter, ws=400, HEADER_MODE=full）:
+     - C6-heavy (257-768B): cache OFF **35.53M** → cache ON **37.02M ops/s** (+4.2%)
+     - Mixed 16-1024B: cache OFF **38.04M** → cache ON **37.93M ops/s** (-0.3%, 誤差範囲)
+   - **結論**: TLS cache により C6-heavy で +4.2% の改善を達成（目標 +1-2% を上回る）。Mixed では影響ほぼゼロ。page_meta access 削減が効いている。
+   - **既知の問題**: header_mode=light 時に infinite loop 発生（freelist pointer が header と衝突する edge case）。現状は full mode のみ動作確認済み。
+   - **運用**: 標準プロファイルでは `HAKMEM_SMALL_HEAP_V5_TLS_CACHE_ENABLED=0`（OFF）。C6 研究用で cache ON により v5 性能を部分改善可能。

 ---

--- a/core/box/smallobject_hotbox_v5_box.h
+++ b/core/box/smallobject_hotbox_v5_box.h
@ -6,6 +6,7 @@
 #define HAKMEM_SMALLOBJECT_HOTBOX_V5_BOX_H

 #include <stdint.h>
+#include <stdbool.h>

 #define NUM_SMALL_CLASSES_V5 8  // C0–C7

@ -38,7 +39,9 @@ typedef struct SmallClassHeapV5 {
 // SmallHeapCtxV5: per-thread ホットヒープコンテキスト
 typedef struct SmallHeapCtxV5 {
    SmallClassHeapV5 cls[NUM_SMALL_CLASSES_V5];
-    uint8_t header_mode;  // Phase v5-4: FULL or LIGHT (cached from ENV)
+    uint8_t header_mode;     // Phase v5-4: FULL or LIGHT (cached from ENV)
+    bool    tls_cache_enabled;  // Phase v5-5: TLS cache enabled flag (cached from ENV)
+    void*   c6_cached_block;    // Phase v5-5: C6 TLS cache (1-slot cache)
 } SmallHeapCtxV5;

 // ============================================================================
--- a/core/box/smallobject_v5_env_box.h
+++ b/core/box/smallobject_v5_env_box.h
@ -137,4 +137,21 @@ static inline int small_heap_v5_header_mode(void) {
    return g_header_mode;
 }

+// ============================================================================
+// Phase v5-5: TLS cache configuration (research mode)
+// ============================================================================
+
+// small_heap_v5_tls_cache_enabled() - TLS cache enable check (default: disabled)
+// ENV: HAKMEM_SMALL_HEAP_V5_TLS_CACHE_ENABLED={0|1}, default: 0
+// - 0: disabled (standard behavior)
+// - 1: enabled (C6 TLS cache, +1-2% perf, research mode)
+static inline int small_heap_v5_tls_cache_enabled(void) {
+    static int g_tls_cache_enabled = ENV_UNINIT;
+    if (__builtin_expect(g_tls_cache_enabled == ENV_UNINIT, 0)) {
+        const char* e = getenv("HAKMEM_SMALL_HEAP_V5_TLS_CACHE_ENABLED");
+        g_tls_cache_enabled = (e && *e && *e != '0') ? ENV_ENABLED : ENV_DISABLED;
+    }
+    return (g_tls_cache_enabled == ENV_ENABLED);
+}
+
 #endif // HAKMEM_SMALLOBJECT_V5_ENV_BOX_H
--- a/core/smallobject_cold_iface_v5.c
+++ b/core/smallobject_cold_iface_v5.c
@ -10,7 +10,7 @@
 #include "box/smallsegment_v5_box.h"
 #include "box/smallobject_hotbox_v5_box.h"
 #include "box/smallobject_v5_env_box.h"
-#include "tiny_region_id.h"  // For tiny_region_id_write_header
+#include "tiny_region_id.h"  // For HEADER_MAGIC and HEADER_CLASS_MASK

 #ifndef likely
 #define likely(x)   __builtin_expect(!!(x), 1)
@ -56,21 +56,18 @@ SmallPageMetaV5* small_cold_v5_refill_page(SmallHeapCtxV5* ctx, uint32_t class_i
    int header_mode = small_heap_v5_header_mode();

    // Build intrusive freelist (last to first for cache locality)
+    // Freelist pointers are stored at block[0-7], overwriting any header that might be there
    void* freelist = NULL;
    for (int i = (int)page->capacity - 1; i >= 0; i--) {
        uint8_t* block = base + ((size_t)i * SMALL_HEAP_V5_C6_BLOCK_SIZE);

-        // Phase v5-4: In light mode, write headers once during carve
-        if (header_mode == SMALL_HEAP_V5_HEADER_MODE_LIGHT) {
-            // Write header for this block (all blocks initialized at carve time)
-            // This eliminates per-alloc header writes, improving performance by 2-4%
-            tiny_region_id_write_header(block, class_idx);
-        }
-
+        // Build freelist using BASE pointers
+        // This will overwrite block[0-7] with the next pointer
        void* next = freelist;
        memcpy(block, &next, sizeof(void*));
        freelist = block;
    }
+    // NOTE: Headers are written during alloc (not during carve) since freelist uses block[0-7]

    page->free_list = freelist;

--- a/core/smallobject_hotbox_v5.c
+++ b/core/smallobject_hotbox_v5.c
@ -9,7 +9,7 @@
 #include "box/smallobject_hotbox_v5_box.h"
 #include "box/smallobject_cold_iface_v5.h"
 #include "box/smallobject_v5_env_box.h"
-#include "tiny_region_id.h"  // For tiny_region_id_write_header
+#include "tiny_region_id.h"  // For HEADER_MAGIC and HEADER_CLASS_MASK

 #ifndef likely
 #define likely(x)   __builtin_expect(!!(x), 1)
@ -21,9 +21,11 @@ static __thread SmallHeapCtxV5 g_small_heap_ctx_v5;
 static __thread int g_small_heap_ctx_v5_init = 0;

 SmallHeapCtxV5* small_heap_ctx_v5(void) {
-    // Phase v5-4: Lazy initialization of header_mode (cached from ENV once per thread)
+    // Phase v5-4/v5-5: Lazy initialization of cached ENV flags
    if (unlikely(!g_small_heap_ctx_v5_init)) {
        g_small_heap_ctx_v5.header_mode = (uint8_t)small_heap_v5_header_mode();
+        g_small_heap_ctx_v5.tls_cache_enabled = small_heap_v5_tls_cache_enabled();
+        g_small_heap_ctx_v5.c6_cached_block = NULL;  // Initialize cache to empty
        g_small_heap_ctx_v5_init = 1;
    }
    return &g_small_heap_ctx_v5;
@ -76,6 +78,32 @@ void* small_alloc_fast_v5(size_t size, uint32_t class_idx, SmallHeapCtxV5* ctx)
        return hak_pool_try_alloc(size, 0);
    }

+    // Phase v5-5: TLS cache hit path (C6 only)
+    if (unlikely(ctx->tls_cache_enabled)) {
+        void* cached = ctx->c6_cached_block;
+        if (likely(cached != NULL)) {
+            ctx->c6_cached_block = NULL;  // Consume cache slot
+            // NOTE: cached is BASE pointer (same as freelist format), convert to USER pointer
+            // This is consistent with the free path which stores (ptr - 1) as BASE
+            // Header mode handling (same logic as freelist path)
+            uint8_t* header_ptr = (uint8_t*)cached;
+            uint8_t desired_header = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
+
+            if (ctx->header_mode == SMALL_HEAP_V5_HEADER_MODE_LIGHT) {
+                // light mode: only write if invalid
+                uint8_t existing = *header_ptr;
+                if (existing != desired_header) {
+                    *header_ptr = desired_header;
+                }
+            } else {
+                // full mode: always write header
+                *header_ptr = desired_header;
+            }
+            return header_ptr + 1;
+        }
+    }
+
+    // Cache miss - proceed to existing page_meta path
    SmallClassHeapV5* h = &ctx->cls[SMALL_HEAP_V5_C6_CLASS_IDX];
    SmallPageMetaV5* page = h->current;

@ -87,14 +115,22 @@ void* small_alloc_fast_v5(size_t size, uint32_t class_idx, SmallHeapCtxV5* ctx)
        page->free_list = next;
        page->used++;

-        // Phase v5-4: Header light mode optimization
+        // Phase v5-4: Header mode handling
+        uint8_t* header_ptr = (uint8_t*)blk;
+        uint8_t desired_header = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
+
        if (ctx->header_mode == SMALL_HEAP_V5_HEADER_MODE_LIGHT) {
-            // light mode: header already written during carve, skip per-alloc write
-            return (uint8_t*)blk + 1;  // return USER pointer (skip header byte)
+            // light mode: only write header if it's invalid/incorrect
+            // This saves redundant writes when blocks are recycled
+            uint8_t existing = *header_ptr;
+            if (existing != desired_header) {
+                *header_ptr = desired_header;
+            }
        } else {
-            // full mode: write header on every alloc (standard behavior)
-            return tiny_region_id_write_header(blk, class_idx);
+            // full mode: always write header (safety first)
+            *header_ptr = desired_header;
        }
+        return header_ptr + 1;
    }

    // Slow path: Current exhausted or NULL
@ -111,14 +147,21 @@ void* small_alloc_fast_v5(size_t size, uint32_t class_idx, SmallHeapCtxV5* ctx)
    page->free_list = next;
    page->used++;

-    // Phase v5-4: Header light mode optimization
+    // Phase v5-4: Header mode handling (same logic as fast path)
+    uint8_t* header_ptr = (uint8_t*)blk;
+    uint8_t desired_header = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
+
    if (ctx->header_mode == SMALL_HEAP_V5_HEADER_MODE_LIGHT) {
-        // light mode: header already written during carve, skip per-alloc write
-        return (uint8_t*)blk + 1;  // return USER pointer (skip header byte)
+        // light mode: only write if invalid
+        uint8_t existing = *header_ptr;
+        if (existing != desired_header) {
+            *header_ptr = desired_header;
+        }
    } else {
-        // full mode: write header on every alloc (standard behavior)
-        return tiny_region_id_write_header(blk, class_idx);
+        // full mode: always write header
+        *header_ptr = desired_header;
    }
+    return header_ptr + 1;
 }

 // ============================================================================
@ -181,7 +224,50 @@ void small_free_fast_v5(void* ptr, uint32_t class_idx, SmallHeapCtxV5* ctx) {

    SmallClassHeapV5* h = &ctx->cls[SMALL_HEAP_V5_C6_CLASS_IDX];

-    // Push to freelist (O(1))
+    // Phase v5-5: TLS cache refill path (before pushing to freelist)
+    if (unlikely(ctx->tls_cache_enabled)) {
+        if (ctx->c6_cached_block == NULL) {
+            // Cache is empty, refill it with this block
+            // NOTE: ptr is USER pointer, convert to BASE pointer for cache storage
+            // (consistent with freelist storage format)
+            void* base = (uint8_t*)ptr - 1;
+            ctx->c6_cached_block = base;
+
+            // IMPORTANT: Do NOT decrement page->used here!
+            // The cached block is still logically "allocated" until it's:
+            // - consumed during alloc (at which point it becomes allocated again)
+            // - evicted to freelist (at which point page->used is decremented)
+            // This prevents premature page retirement while holding a cached reference
+            return;
+        }
+        // Cache full - evict cached block to freelist first, then cache this one
+        else {
+            void* evicted = ctx->c6_cached_block;
+            // Evicted block is BASE pointer, convert to USER pointer for freelist push
+            void* evicted_user = (uint8_t*)evicted + 1;
+
+            // Look up the page for the evicted block (might be different from current page)
+            SmallPageMetaV5* evicted_page = small_segment_v5_page_meta_of(evicted_user);
+            if (evicted_page) {
+                // Push evicted block to its page's freelist
+                void* evicted_head = evicted_page->free_list;
+                memcpy(evicted_user, &evicted_head, sizeof(void*));
+                evicted_page->free_list = evicted_user;
+                if (evicted_page->used > 0) {
+                    evicted_page->used--;
+                }
+                // Note: We don't handle empty page transition here for evicted page
+                // to keep this path fast. Empty pages will be handled on next alloc/free.
+            }
+
+            // Now cache the new block
+            void* base = (uint8_t*)ptr - 1;
+            ctx->c6_cached_block = base;
+            return;
+        }
+    }
+
+    // Cache disabled - push to freelist (standard path)
    void* head = page->free_list;
    memcpy(ptr, &head, sizeof(void*));
    page->free_list = ptr;