Phase 26: Front Gate Unification - Tiny allocator fast path (+12.9%)

Implementation: - New single-layer malloc/free path for Tiny (≤1024B) allocations - Bypasses 3-layer overhead: malloc → hak_alloc_at (236 lines) → wrapper → tiny_alloc_fast - Leverages Phase 23 Unified Cache (tcache-style, 2-3 cache misses) - Safe fallback to normal path on Unified Cache miss Performance (Random Mixed 256B, 100K iterations): - Baseline (Phase 26 OFF): 11.33M ops/s - Phase 26 ON: 12.79M ops/s (+12.9%) - Prediction (ChatGPT): +10-15% → Actual: +12.9% (perfect match!) Bug fixes: - Initialization bug: Added hak_init() call before fast path - Page boundary SEGV: Added guard for offset_in_page == 0 Also includes Phase 23 debug log fixes: - Guard C2_CARVE logs with #if !HAKMEM_BUILD_RELEASE - Guard prewarm logs with #if !HAKMEM_BUILD_RELEASE - Set Hot_2048 as default capacity (C2/C3=2048, others=64) Files: - core/front/malloc_tiny_fast.h: Phase 26 implementation (145 lines) - core/box/hak_wrappers.inc.h: Fast path integration (+28 lines) - core/front/tiny_unified_cache.h: Hot_2048 default - core/tiny_refill_opt.h: C2_CARVE log guard - core/box/ss_hot_prewarm_box.c: Prewarm log guard - CURRENT_TASK.md: Phase 26 completion documentation ENV variables: - HAKMEM_FRONT_GATE_UNIFIED=1 (enable Phase 26, default: OFF) - HAKMEM_TINY_UNIFIED_CACHE=1 (Phase 23, required) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 05:29:08 +09:00
parent 7311d32574
commit 5b36c1c908
6 changed files with 302 additions and 9 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -1,8 +1,106 @@
-# CURRENT TASK (Phase 14–17 Snapshot) – Tiny / Mid / ExternalGuard / Small-Mid
+# CURRENT TASK (Phase 14–26 Snapshot) – Tiny / Mid / ExternalGuard / Unified Cache / Front Gate
-**Last Updated**: 2025-11-16
+**Last Updated**: 2025-11-17
-**Owner**: ChatGPT → Phase 17 実装中: Claude Code
+**Owner**: ChatGPT → Phase 23/25/26 実装完了: Claude Code
-**Size**: 約 300 行（Claude 用コンテキスト簡略版）
+**Size**: 約 350 行（Claude 用コンテキスト簡略版）
 ---
 ## 🎉 **Phase 26: Front Gate Unification - 完了** (2025-11-17)
 **成果**: Random Mixed 256B ベンチマーク **+12.9%** 改善 (11.33M → 12.79M ops/s)
 ### Phase 26: Front Gate Unification (ChatGPT先生提案)
 - **設計**: malloc → hak_alloc_at (236行) → wrapper → tiny_alloc_fast の **3層オーバーヘッド削減**
 - **実装**: `core/front/malloc_tiny_fast.h` + `core/box/hak_wrappers.inc.h` 統合
 - **戦略**: Tiny範囲（≤1024B）専用の単層直行経路、Phase 23 Unified Cache 活用
 - **ENV**: `HAKMEM_FRONT_GATE_UNIFIED=1` でデフォルトOFF → **本番投入推奨**
 ### Phase 26 実装詳細
 **malloc_tiny_fast()** (alloc fast path):
 ```c
 1. size → class_idx (inline table lookup, 1-2 instructions)
 2. unified_cache_pop_or_refill(class_idx) (Phase 23 tcache, 2-3 cache misses)
 3. Write header + return USER pointer (2-3 instructions)
 Total: 8-10 instructions (vs 3-layer cascade: 236 lines routing + diagnostics)
 ```
 **free_tiny_fast()** (free fast path):
 ```c
 1. Page boundary guard (offset_in_page == 0 → return 0)
 2. Read header + validate Tiny magic (0xa0-0xa7)
 3. unified_cache_push(class_idx, base) (Phase 23 tcache, 2-3 cache misses)
 Total: 6-8 instructions (vs classify_ptr + hak_free_at routing)
 ```
 ### Phase 26 修正したバグ
 1. **初期化バグ**: Phase 26 fast path が hak_init() をバイパス → `if (!g_initialized) hak_init()` 追加
 2. **ページ境界SEGV**: free_tiny_fast() がページ先頭 (offset==0) で前ページ読み → ガード追加
   ```c
   uintptr_t off = (uintptr_t)ptr & 0xFFFu;
   if (off == 0) return 0;  // Page-aligned → 通常 free 経路へ
   ```
 ### A/B ベンチマーク結果 (Random Mixed 256B, 100K iterations)
 | Configuration | Run 1 | Run 2 | Run 3 | **Average** | vs Baseline |
 |---------------|-------|-------|-------|-------------|-------------|
 | **Phase 26 OFF** | 11.21M | 11.02M | 11.76M | **11.33M ops/s** | Baseline |
 | **Phase 26 ON** | 13.21M | 12.55M | 12.62M | **12.79M ops/s** | **+12.9%** 🎯 |
 **ChatGPT先生の予測**: +10-15% (3層オーバーヘッド削減による改善)
 **実測結果**: **+12.9%** ← **予測ど真ん中！** 🎯
 ### 本番推奨設定 (Phase 23 + Phase 26 組み合わせ)
 ```bash
 export HAKMEM_TINY_UNIFIED_CACHE=1     # Phase 23: Hot_2048がデフォルト
 export HAKMEM_FRONT_GATE_UNIFIED=1     # Phase 26: Front Gate Unification
 ./out/release/bench_random_mixed_hakmem
 # Expected: 12.79M ops/s (+27.8% vs Phase 23前のbaseline 10.0M ops/s)
 ```
 **主要ファイル**:
 - `core/front/malloc_tiny_fast.h` - Phase 26 single-layer malloc/free implementation
 - `core/box/hak_wrappers.inc.h:128-143` - Phase 26 fast path integration (malloc)
 - `core/box/hak_wrappers.inc.h:179-190` - Phase 26 fast path integration (free)
 ---
 ## 🎉 **Phase 23/25: Unified Frontend Cache - 完了** (2025-11-17)
 **成果**: Random Mixed 256B ベンチマーク **+7.3%** 改善 (10.58M → 11.35M ops/s)
 ### Phase 23: Unified Cache Implementation
 - **設計**: tcache-style single-layer frontend (Ring → FastCache → SFC → SLL の 4 層を 1 層に統合)
 - **実装**: `core/front/tiny_unified_cache.{h,c}` - Array-based ring buffer (2-3 cache misses)
 - **統合**: Alloc path (`tiny_alloc_fast.inc.h:621-633`) + Free path (`hak_free_api.inc.h`)
 - **ENV**: `HAKMEM_TINY_UNIFIED_CACHE=1` でデフォルトOFF → **Hot_2048設定で本番投入**
 ### Phase 23 Capacity Optimization (Hot_2048)
 - **Task Agent**: 10 configurations × 3 runs = 35 benchmarks
 - **最適設定**: C2/C3 (128B/256B) = 2048 slots, 他 = 64 slots
 - **根拠**: Hot-class優先戦略が+6.2%の追加改善（vs All_128）
 - **メモリ**: ~1.1MB cache overhead (C2/C3 に集中配置)
 ### Phase 25-A: Header Read Optimization (+2.2%)
 - **削減**: FG_DOMAIN_TINY の重複 header read を除去
 - **L1 hit**: 2回目の header read は L1 cache hit (~1 cycle) → 効果限定的
 ### Phase 25-B-1: Promote-on-Full (REVERTED, -4.0%)
 - **失敗**: Smart promotion logic が overhead > benefit
 - **教訓**: Clever ≠ Fast、incremental最適化は限界に達した
 ### Debug Log修正 (性能改善)
 - **修正箇所**: `core/tiny_refill_opt.h:316-326`, `core/box/ss_hot_prewarm_box.c:143-146`
 - **問題**: `[C2_CARVE]` / `[BOX_SS_HOT_PREWARM]` が Release build で常時出力
 - **解決**: `#if !HAKMEM_BUILD_RELEASE` で囲み、stderr負荷を除去
 ### 本番推奨設定
 ```bash
 export HAKMEM_TINY_UNIFIED_CACHE=1  # Hot_2048がデフォルト（C2/C3=2048, 他=64）
 ./out/release/bench_random_mixed_hakmem
 ```
 **次の戦略**: Phase 23でfrontend最適化は限界、Phase 12 Shared SuperSlab Pool (backend根本解決) へ進む
 ---
--- a/core/box/hak_wrappers.inc.h
+++ b/core/box/hak_wrappers.inc.h
@ -30,6 +30,7 @@ void* realloc(void* ptr, size_t size) {
 #include "../ptr_trace.h"              // Debug: pointer trace immediate dump on libc fallback
 #include "front_gate_classifier.h"     // Box FG: pointer classification (header/reg)
 #include "../front/malloc_tiny_fast.h" // Phase 26: Front Gate Unification
 // malloc wrapper - intercepts system malloc() calls
 __thread uint64_t g_malloc_total_calls = 0;
@ -124,6 +125,26 @@ void* malloc(size_t size) {
        }
    }
    // Phase 26: CRITICAL - Ensure initialization before fast path
    // (fast path bypasses hak_alloc_at, so we need to init here)
    if (!g_initialized) hak_init();
    // Phase 26: Front Gate Unification (Tiny fast path)
    // Placed AFTER all safety checks (lock depth, initializing, LD_SAFE, jemalloc)
    // Bypasses: hak_alloc_at routing (236 lines) + wrapper diagnostics + tiny overhead
    // Target: +10-15% performance (11.35M → 12.5-13.5M ops/s)
    // ENV: HAKMEM_FRONT_GATE_UNIFIED=1 to enable (default: OFF)
    if (__builtin_expect(front_gate_unified_enabled(), 0)) {
        if (size <= tiny_get_max_size()) {
            void* ptr = malloc_tiny_fast(size);
            if (__builtin_expect(ptr != NULL, 1)) {
                g_hakmem_lock_depth--;
                return ptr;
            }
            // Unified Cache miss → fallback to normal path (hak_alloc_at)
        }
    }
 #if !HAKMEM_BUILD_RELEASE
    if (count > 14250 && count < 14280 && size <= 1024) {
        fprintf(stderr, "[MALLOC_WRAPPER] count=%lu calling hak_alloc_at\n", count);
@ -159,6 +180,19 @@ void free(void* ptr) {
        // Fallback to normal path for non-Tiny or no-header mode
    }
    // Phase 26: Front Gate Unification (Tiny free fast path)
    // Placed AFTER BenchFast check, BEFORE expensive classify_ptr()
    // Bypasses: hak_free_at routing + wrapper overhead + classification
    // Target: +10-15% performance (pairs with malloc_tiny_fast)
    // ENV: HAKMEM_FRONT_GATE_UNIFIED=1 to enable (default: OFF)
    if (__builtin_expect(front_gate_unified_enabled(), 0)) {
        int freed = free_tiny_fast(ptr);
        if (__builtin_expect(freed, 1)) {
            return;  // Success (pushed to Unified Cache)
        }
        // Unified Cache full OR invalid header → fallback to normal path
    }
    do { static int on=-1; if (on==-1){ const char* e=getenv("HAKMEM_FREE_WRAP_TRACE"); on=(e&&*e&&*e!='0')?1:0;} if(on){ fprintf(stderr,"[WRAP_FREE_ENTER] ptr=%p depth=%d init=%d\n", ptr, g_hakmem_lock_depth, g_initializing); } } while(0);
 #if !HAKMEM_BUILD_RELEASE
    // Debug safety: guard obviously invalid tiny integers to avoid libc crash and collect trace
--- a/core/box/ss_hot_prewarm_box.c
+++ b/core/box/ss_hot_prewarm_box.c
@ -140,8 +140,10 @@ int box_ss_hot_prewarm_all(void) {
        total_prewarmed += actual;
    }
-    // Phase 20-1: ALWAYS log prewarm summary (even in release) for verification
+    // Phase 20-1: Log prewarm summary (DEBUG ONLY to avoid perf impact)
 #if !HAKMEM_BUILD_RELEASE
    fprintf(stderr, "[BOX_SS_HOT_PREWARM] Total blocks pre-warmed: %d\n", total_prewarmed);
 #endif
    return total_prewarmed;
 }
--- a/core/front/malloc_tiny_fast.h
+++ b/core/front/malloc_tiny_fast.h
@ -0,0 +1,148 @@
 // malloc_tiny_fast.h - Phase 26: Front Gate Unification (Tiny Fast Path)
 //
 // Goal: Eliminate 3-layer overhead (malloc → hak_alloc_at → wrapper → tiny_alloc_fast)
 // Target: +10-15% performance (11.35M → 12.5-13.5M ops/s)
 //
 // Design (ChatGPT analysis):
 //   - Replace: malloc → hak_alloc_at (236 lines) → wrapper (diagnostics) → tiny_alloc_fast
 //   - With: malloc → malloc_tiny_fast (single-layer, direct to Unified Cache)
 //   - Preserves: Safety checks (lock depth, initializing, LD_SAFE, jemalloc block)
 //   - Leverages: Phase 23 Unified Cache (tcache-style, 2-3 cache misses)
 //
 // Performance:
 //   - Current overhead: malloc(8.97%) + routing + wrapper(3.63%) + tiny(5.37%) = 17.97%
 //   - BenchFast ceiling: 8-10 instructions (~1-2% overhead)
 //   - Gap: ~16%
 //   - Target: Close half the gap (+10-15% improvement)
 //
 // ENV Variables:
 //   HAKMEM_FRONT_GATE_UNIFIED=1  # Enable Front Gate Unification (default: 0, OFF)
 #ifndef HAK_FRONT_MALLOC_TINY_FAST_H
 #define HAK_FRONT_MALLOC_TINY_FAST_H
 #include <stdint.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include "../hakmem_build_flags.h"
 #include "../hakmem_tiny_config.h"  // For TINY_NUM_CLASSES
 #include "tiny_unified_cache.h"     // For unified_cache_pop_or_refill
 #include "../tiny_region_id.h"      // For tiny_region_id_write_header
 #include "../hakmem_tiny.h"         // For hak_tiny_size_to_class
 // ============================================================================
 // ENV Control (cached, lazy init)
 // ============================================================================
 // Enable flag (default: 0, OFF)
 static inline int front_gate_unified_enabled(void) {
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_FRONT_GATE_UNIFIED");
        g_enable = (e && *e && *e != '0') ? 1 : 0;
 #if !HAKMEM_BUILD_RELEASE
        if (g_enable) {
            fprintf(stderr, "[FrontGate-INIT] front_gate_unified_enabled() = %d\n", g_enable);
            fflush(stderr);
        }
 #endif
    }
    return g_enable;
 }
 // ============================================================================
 // Phase 26-A: malloc_tiny_fast() - Ultra-thin Tiny allocation
 // ============================================================================
 // Single-layer Tiny allocation (bypasses hak_alloc_at + wrapper + diagnostics)
 // Preconditions:
 //   - Called AFTER malloc() safety checks (lock depth, initializing, LD_SAFE)
 //   - size <= tiny_get_max_size() (caller verified)
 // Returns:
 //   - USER pointer on success
 //   - NULL on Unified Cache miss (caller falls back to normal path)
 __attribute__((always_inline))
 static inline void* malloc_tiny_fast(size_t size) {
    // 1. size → class_idx (inline table lookup, 1-2 instructions)
    int class_idx = hak_tiny_size_to_class(size);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return NULL;  // Out of range (should not happen if caller checked tiny_get_max_size())
    }
    // 2. Phase 23: Unified Cache pop-or-refill (tcache-style, 2-3 cache misses)
    // This internally handles:
    //   - Cache hit: direct pop (fast path)
    //   - Cache miss: batch refill from SuperSlab (slow path)
    void* base = unified_cache_pop_or_refill(class_idx);
    if (__builtin_expect(base == NULL, 0)) {
        // Unified Cache disabled OR refill failed
        // Fall back to normal path (caller handles via hak_alloc_at)
        return NULL;
    }
    // 3. Write header + return USER pointer (2-3 instructions)
    #ifdef HAKMEM_TINY_HEADER_CLASSIDX
    tiny_region_id_write_header(base, class_idx);  // Write 1-byte header (BASE first!)
    return (void*)((char*)base + 1);  // Return USER pointer
    #else
    return base;  // No header mode - return BASE directly
    #endif
 }
 // ============================================================================
 // Phase 26-B: free_tiny_fast() - Ultra-thin Tiny deallocation
 // ============================================================================
 // Single-layer Tiny deallocation (bypasses hak_free_at + wrapper + diagnostics)
 // Preconditions:
 //   - ptr is from malloc_tiny_fast() (has valid header)
 //   - Front Gate Unified is enabled
 // Returns:
 //   - 1 on success (pushed to Unified Cache)
 //   - 0 on failure (caller falls back to normal free path)
 __attribute__((always_inline))
 static inline int free_tiny_fast(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;
    #ifdef HAKMEM_TINY_HEADER_CLASSIDX
    // 1. ページ境界ガード:
    //    ptr がページ先頭 (offset==0) の場合、ptr-1 は別ページか未マップ領域になる可能性がある。
    //    その場合はヘッダ読みを行わず、通常 free 経路にフォールバックする。
    uintptr_t off = (uintptr_t)ptr & 0xFFFu;
    if (__builtin_expect(off == 0, 0)) {
        return 0;
    }
    // 2. Fast header magic validation (必須)
    //    Release ビルドでは tiny_region_id_read_header() が magic を省略するため、
    //    ここで自前に Tiny 専用ヘッダ (0xA0) を検証しておく。
    uint8_t* header_ptr = (uint8_t*)ptr - 1;
    uint8_t header = *header_ptr;
    uint8_t magic = header & 0xF0u;
    if (__builtin_expect(magic != HEADER_MAGIC, 0)) {
        // Tiny ヘッダではない → Mid/Large/外部ポインタなので通常 free 経路へ
        return 0;
    }
    // 3. class_idx 抽出（下位4bit）
    int class_idx = (int)(header & HEADER_CLASS_MASK);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return 0;
    }
    // 4. BASE を計算して Unified Cache に push
    void* base = (void*)((char*)ptr - 1);
    int pushed = unified_cache_push(class_idx, base);
    if (__builtin_expect(pushed, 1)) {
        return 1;  // Success
    }
    // Unified Cache full → 通常 free 経路へ
    return 0;
    #else
    // No header mode - fall back to normal free
    return 0;
    #endif
 }
 #endif // HAK_FRONT_MALLOC_TINY_FAST_H
--- a/core/front/tiny_unified_cache.h
+++ b/core/front/tiny_unified_cache.h
@ -77,18 +77,27 @@ static inline int unified_cache_enabled(void) {
    return g_enable;
 }
-// Per-class capacity (default: 128 for all classes)
+// Per-class capacity (default: Hot_2048 strategy - optimized for 256B workload)
 // Phase 23 Capacity Optimization Result: Hot_2048 = 14.63M ops/s (+43% vs baseline)
 // Hot classes (C2/C3: 128B/256B) get 2048 slots, others get 64 slots
 static inline size_t unified_capacity(int class_idx) {
    static size_t g_cap[TINY_NUM_CLASSES] = {0};
    if (__builtin_expect(g_cap[class_idx] == 0, 0)) {
        char env_name[64];
        snprintf(env_name, sizeof(env_name), "HAKMEM_TINY_UNIFIED_C%d", class_idx);
        const char* e = getenv(env_name);
-        g_cap[class_idx] = (e && *e) ? (size_t)atoi(e) : 128;  // Default: 128
+
        // Default: Hot_2048 strategy (C2/C3=2048, others=64)
        size_t default_cap = 64;  // Cold classes
        if (class_idx == 2 || class_idx == 3) {
            default_cap = 2048;  // Hot classes (128B, 256B)
        }
        g_cap[class_idx] = (e && *e) ? (size_t)atoi(e) : default_cap;
        // Round up to power of 2 (for fast modulo)
        if (g_cap[class_idx] < 32) g_cap[class_idx] = 32;
-        if (g_cap[class_idx] > 512) g_cap[class_idx] = 512;
+        if (g_cap[class_idx] > 4096) g_cap[class_idx] = 4096;  // Increased limit for Hot_2048
        // Ensure power of 2
        size_t pow2 = 32;
--- a/core/tiny_refill_opt.h
+++ b/core/tiny_refill_opt.h
@ -313,7 +313,8 @@ static inline uint32_t trc_linear_carve(uint8_t* base, size_t bs,
        *block = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
        PTR_TRACK_HEADER_WRITE((void*)block, HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
-        // ✅ Option C: Class 2 inline logs - CARVE operation
+#if !HAKMEM_BUILD_RELEASE
        // ✅ Option C: Class 2 inline logs - CARVE operation (DEBUG ONLY)
        if (class_idx == 2) {
            uint64_t carve_id = atomic_fetch_add(&g_carve_count, 1);
            extern _Atomic uint64_t malloc_count;
@ -322,6 +323,7 @@ static inline uint32_t trc_linear_carve(uint8_t* base, size_t bs,
                    (void*)block, i+1, batch, carve_id, call);
            fflush(stderr);
        }
 #endif
    }
 #endif