Files
hakmem/docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md
Moe Charm (CI) e4c5f05355 Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)
## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 22:05:34 +09:00

23 KiB
Raw Blame History

Hakmem free-path review packet (compact)

Goal: understand remaining fixed costs vs mimalloc/tcmalloc, with Box Theory (single boundary, reversible ENV gates).

SSOT bench conditions (current practice):

  • HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
  • ITERS=20000000 WS=400 RUNS=10
  • run via scripts/run_mixed_10_cleanenv.sh

Request:

  1. Where is the dominant fixed cost on free path now?
  2. What structural change would give +510% without breaking Box Theory?
  3. What NOT to do (layout tax pitfalls)?

Code excerpts (clipped)

core/box/tiny_free_gate_box.h

static inline int tiny_free_gate_try_fast(void* user_ptr)
{
#if !HAKMEM_TINY_HEADER_CLASSIDX
    (void)user_ptr;
    // Header 無効構成では Tiny Fast Path 自体を使わない
    return 0;
#else
    if (__builtin_expect(!user_ptr, 0)) {
        return 0;
    }

    // Layer 3a: 軽量 Fail-Fast常時ON
    // 明らかに不正なアドレス(極端に小さい値)は Fast Path では扱わない。
    // Slow Path 側hak_free_at + registry/headerに任せる。
    {
        uintptr_t addr = (uintptr_t)user_ptr;
        if (__builtin_expect(addr < 4096, 0)) {
#if !HAKMEM_BUILD_RELEASE
            static _Atomic uint32_t g_free_gate_range_invalid = 0;
            uint32_t n = atomic_fetch_add_explicit(&g_free_gate_range_invalid, 1, memory_order_relaxed);
            if (n < 8) {
                fprintf(stderr,
                        "[TINY_FREE_GATE_RANGE_INVALID] ptr=%p\n",
                        user_ptr);
                fflush(stderr);
            }
#endif
            return 0;
        }
    }

    // 将来の拡張ポイント:
    //   - DIAG ON のときだけ Bridge + Guard を実行し、
    //     Tiny 管理外と判定された場合は Fast Path をスキップする。
#if !HAKMEM_BUILD_RELEASE
    if (__builtin_expect(tiny_free_gate_diag_enabled(), 0)) {
        TinyFreeGateContext ctx;
        if (!tiny_free_gate_classify(user_ptr, &ctx)) {
            // Tiny 管理外 or Bridge 失敗 → Fast Path は使わない
            return 0;
        }
        (void)ctx;  // 現時点ではログ専用。将来はここから Guard を挿入。
    }
#endif

    // 本体は既存の ultra-fast free に丸投げ(挙動を変えない)
    return hak_tiny_free_fast_v2(user_ptr);
#endif
}

core/front/malloc_tiny_fast.h

static inline int free_tiny_fast(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;

#if HAKMEM_TINY_HEADER_CLASSIDX
    // 1. ページ境界ガード:
    //    ptr がページ先頭 (offset==0) の場合、ptr-1 は別ページか未マップ領域になる可能性がある。
    //    その場合はヘッダ読みを行わず、通常 free 経路にフォールバックする。
    uintptr_t off = (uintptr_t)ptr & 0xFFFu;
    if (__builtin_expect(off == 0, 0)) {
        return 0;
    }

    // 2. Fast header magic validation (必須)
    //    Release ビルドでは tiny_region_id_read_header() が magic を省略するため、
    //    ここで自前に Tiny 専用ヘッダ (0xA0) を検証しておく。
    uint8_t* header_ptr = (uint8_t*)ptr - 1;
    uint8_t header = *header_ptr;
    uint8_t magic = header & 0xF0u;
    if (__builtin_expect(magic != HEADER_MAGIC, 0)) {
        // Tiny ヘッダではない → Mid/Large/外部ポインタなので通常 free 経路へ
        return 0;
    }

    // 3. class_idx 抽出下位4bit
    int class_idx = (int)(header & HEADER_CLASS_MASK);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return 0;
    }

    // 4. BASE を計算して Unified Cache に push
    void* base = tiny_user_to_base_inline(ptr);
    tiny_front_free_stat_inc(class_idx);

    // Phase FREE-LEGACY-BREAKDOWN-1: カウンタ散布 (1. 関数入口)
    FREE_PATH_STAT_INC(total_calls);

    // Phase 19-3b: Consolidate ENV snapshot reads (capture once per free_tiny_fast call).
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

    // Phase 9: MONO DUALHOT early-exit for C0-C3 (skip policy snapshot, direct to legacy)
    // Conditions:
    //   - ENV: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=1
    //   - class_idx <= 3 (C0-C3)
    //   - !HAKMEM_TINY_LARSON_FIX (cross-thread handling requires full validation)
    //   - g_tiny_route_snapshot_done == 1 && route == TINY_ROUTE_LEGACY (断定できないときは既存経路)
    if ((unsigned)class_idx <= 3u) {
        if (free_tiny_fast_mono_dualhot_enabled()) {
            static __thread int g_larson_fix = -1;
            if (__builtin_expect(g_larson_fix == -1, 0)) {
                const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
                g_larson_fix = (e && *e && *e != '0') ? 1 : 0;
            }

            if (!g_larson_fix &&
                g_tiny_route_snapshot_done == 1 &&
                g_tiny_route_class[class_idx] == TINY_ROUTE_LEGACY) {
                // Direct path: Skip policy snapshot, go straight to legacy fallback
                FREE_PATH_STAT_INC(mono_dualhot_hit);
                tiny_legacy_fallback_free_base_with_env(base, (uint32_t)class_idx, env);
                return 1;
            }
        }
    }

    // Phase 10: MONO LEGACY DIRECT early-exit for C4-C7 (skip policy snapshot, direct to legacy)
    // Conditions:
    //   - ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=1
    //   - cached nonlegacy_mask: class is NOT in non-legacy mask (= ULTRA/MID/V7 not active)
    //   - g_tiny_route_snapshot_done == 1 && route == TINY_ROUTE_LEGACY (断定できないときは既存経路)
    //   - !HAKMEM_TINY_LARSON_FIX (cross-thread handling requires full validation)
    if (free_tiny_fast_mono_legacy_direct_enabled()) {
        // 1. Check nonlegacy mask (computed once at init)
        uint8_t nonlegacy_mask = free_tiny_fast_mono_legacy_direct_nonlegacy_mask();
        if ((nonlegacy_mask & (1u << class_idx)) == 0) {
            // 2. Check route snapshot
            if (g_tiny_route_snapshot_done == 1 && g_tiny_route_class[class_idx] == TINY_ROUTE_LEGACY) {
                // 3. Check Larson fix
                static __thread int g_larson_fix = -1;
                if (__builtin_expect(g_larson_fix == -1, 0)) {
                    const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
                    g_larson_fix = (e && *e && *e != '0') ? 1 : 0;
                }

                if (!g_larson_fix) {
                    // Direct path: Skip policy snapshot, go straight to legacy fallback
                    FREE_PATH_STAT_INC(mono_legacy_direct_hit);
                    tiny_legacy_fallback_free_base_with_env(base, (uint32_t)class_idx, env);
                    return 1;
                }
            }
        }
    }

    // Phase v11b-1: C7 ULTRA early-exit (skip policy snapshot for most common case)
    // Phase 4 E1: Use ENV snapshot when enabled (consolidates 3 TLS reads → 1)
    // Phase 19-3a: Remove UNLIKELY hint (snapshot is ON by default in presets, hint is backwards)
    const bool c7_ultra_free = env ? env->tiny_c7_ultra_enabled : tiny_c7_ultra_enabled_env();

    if (class_idx == 7 && c7_ultra_free) {
        tiny_c7_ultra_free(ptr);
        return 1;
    }

    // Phase POLICY-FAST-PATH-V2: Skip policy snapshot for known-legacy classes
    if (free_policy_fast_v2_can_skip((uint8_t)class_idx)) {
        FREE_PATH_STAT_INC(policy_fast_v2_skip);
        goto legacy_fallback;
    }

    // Phase v11b-1: Policy-based single switch (replaces serial ULTRA checks)
    const SmallPolicyV7* policy_free = small_policy_v7_snapshot();
    SmallRouteKind route_kind_free = policy_free->route_kind[class_idx];

    switch (route_kind_free) {
        case SMALL_ROUTE_ULTRA: {
            // Phase TLS-UNIFY-1: Unified ULTRA TLS push for C4-C6 (C7 handled above)
            if (class_idx >= 4 && class_idx <= 6) {
                tiny_ultra_tls_push((uint8_t)class_idx, base);
                return 1;
            }
            // ULTRA for other classes → fallback to LEGACY
            break;
        }

        case SMALL_ROUTE_MID_V35: {
            // Phase v11a-3: MID v3.5 free
            small_mid_v35_free(ptr, class_idx);
            FREE_PATH_STAT_INC(smallheap_v7_fast);
            return 1;
        }

        case SMALL_ROUTE_V7: {
            // Phase v7: SmallObject v7 free (research box)
            if (small_heap_free_fast_v7_stub(ptr, (uint8_t)class_idx)) {
                FREE_PATH_STAT_INC(smallheap_v7_fast);
                return 1;
            }
            // V7 miss → fallback to LEGACY
            break;
        }

        case SMALL_ROUTE_MID_V3: {
            // Phase MID-V3: delegate to MID v3.5
            small_mid_v35_free(ptr, class_idx);
            FREE_PATH_STAT_INC(smallheap_v7_fast);
            return 1;
        }

        case SMALL_ROUTE_LEGACY:
        default:
            break;
    }

legacy_fallback:
    // LEGACY fallback path
    // Phase 19-6C: Compute route once using helper (avoid redundant tiny_route_for_class)
    tiny_route_kind_t route;
    int use_tiny_heap;
    free_tiny_fast_compute_route_and_heap(class_idx, &route, &use_tiny_heap);

    // TWO-SPEED: SuperSlab registration check is DEBUG-ONLY to keep HOT PATH fast.
    // In Release builds, we trust header magic (0xA0) as sufficient validation.
#if !HAKMEM_BUILD_RELEASE
    // 5. Superslab 登録確認(誤分類防止)
    SuperSlab* ss_guard = hak_super_lookup(ptr);
    if (__builtin_expect(!(ss_guard && ss_guard->magic == SUPERSLAB_MAGIC), 0)) {
        return 0;  // hakmem 管理外 → 通常 free 経路へ
    }
#endif  // !HAKMEM_BUILD_RELEASE

    // Cross-thread free detection (Larson MT crash fix, ENV gated) + TinyHeap free path
    {
        static __thread int g_larson_fix = -1;
        if (__builtin_expect(g_larson_fix == -1, 0)) {
            const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
            g_larson_fix = (e && *e && *e != '0') ? 1 : 0;
#if !HAKMEM_BUILD_RELEASE
            fprintf(stderr, "[LARSON_FIX_INIT] g_larson_fix=%d (env=%s)\n", g_larson_fix, e ? e : "NULL");
            fflush(stderr);
#endif
        }

        if (__builtin_expect(g_larson_fix || use_tiny_heap, 0)) {
            // Phase 12 optimization: Use fast mask-based lookup (~5-10 cycles vs 50-100)
            SuperSlab* ss = ss_fast_lookup(base);
            // Phase FREE-LEGACY-BREAKDOWN-1: カウンタ散布 (5. super_lookup 呼び出し)
            FREE_PATH_STAT_INC(super_lookup_called);
            if (ss) {
                int slab_idx = slab_index_for(ss, base);
                if (__builtin_expect(slab_idx >= 0 && slab_idx < ss_slabs_capacity(ss), 1)) {
                    uint32_t self_tid = tiny_self_u32_local();
                    uint8_t owner_tid_low = ss_slab_meta_owner_tid_low_get(ss, slab_idx);
                    TinySlabMeta* meta = &ss->slabs[slab_idx];
                    // LARSON FIX: Use bits 8-15 for comparison (pthread TIDs aligned to 256 bytes)
                    uint8_t self_tid_cmp = (uint8_t)((self_tid >> 8) & 0xFFu);
#if !HAKMEM_BUILD_RELEASE
                    static _Atomic uint64_t g_owner_check_count = 0;
                    uint64_t oc = atomic_fetch_add(&g_owner_check_count, 1);
                    if (oc < 10) {
                        fprintf(stderr, "[LARSON_FIX] Owner check: ptr=%p owner_tid_low=0x%02x self_tid_cmp=0x%02x self_tid=0x%08x match=%d\n",
                                ptr, owner_tid_low, self_tid_cmp, self_tid, (owner_tid_low == self_tid_cmp));
                        fflush(stderr);
                    }
#endif

                    if (__builtin_expect(owner_tid_low != self_tid_cmp, 0)) {
                        // Cross-thread free → route to remote queue instead of poisoning TLS cache
#if !HAKMEM_BUILD_RELEASE
                        static _Atomic uint64_t g_cross_thread_count = 0;
                        uint64_t ct = atomic_fetch_add(&g_cross_thread_count, 1);
                        if (ct < 20) {
                            fprintf(stderr, "[LARSON_FIX] Cross-thread free detected! ptr=%p owner_tid_low=0x%02x self_tid_cmp=0x%02x self_tid=0x%08x\n",
                                    ptr, owner_tid_low, self_tid_cmp, self_tid);
                            fflush(stderr);
                        }
#endif
                        if (tiny_free_remote_box(ss, slab_idx, meta, ptr, self_tid)) {
                            // Phase FREE-LEGACY-BREAKDOWN-1: カウンタ散布 (6. cross-thread free)
                            FREE_PATH_STAT_INC(remote_free);
                            return 1;  // handled via remote queue

core/box/tiny_front_hot_box.h

static inline int tiny_hot_free_fast(int class_idx, void* base) {
    extern __thread TinyUnifiedCache g_unified_cache[];

    // TLS cache access (1 cache miss)
    // NOTE: Range check removed - caller guarantees valid class_idx
    TinyUnifiedCache* cache = &g_unified_cache[class_idx];

#if HAKMEM_TINY_UNIFIED_LIFO_COMPILED
    // Phase 15 v1: Mode check at entry (once per call, not scattered in hot path)
    // Phase 22: Compile-out when disabled (default OFF)
    int lifo_mode = tiny_unified_lifo_enabled();

    // Phase 15 v1: LIFO vs FIFO mode switch
    if (lifo_mode) {
        // === LIFO MODE: Stack-based (LIFO) ===
        // Try push to stack (tail is stack depth)
        if (unified_cache_try_push_lifo(class_idx, base)) {
            #if !HAKMEM_BUILD_RELEASE
            extern __thread uint64_t g_unified_cache_push[];
            g_unified_cache_push[class_idx]++;
            #endif
            return 1;  // SUCCESS
        }
        // LIFO overflow → fall through to cold path
        #if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_unified_cache_full[];
        g_unified_cache_full[class_idx]++;
        #endif
        return 0;  // FULL
    }
#endif

    // === FIFO MODE: Ring-based (existing, default) ===
    // Calculate next tail (for full check)
    uint16_t next_tail = (cache->tail + 1) & cache->mask;

    // Branch 1: Cache full check (UNLIKELY full)
    // Hot path: cache has space (next_tail != head)
    // Cold path: cache full (next_tail == head) → drain needed
    if (TINY_HOT_LIKELY(next_tail != cache->head)) {
        // === HOT PATH: Cache has space (2-3 instructions) ===

        // Push to cache (1 cache miss for array write)
        cache->slots[cache->tail] = base;
        cache->tail = next_tail;

        // Debug metrics (zero overhead in release)
        #if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_unified_cache_push[];
        g_unified_cache_push[class_idx]++;
        #endif

        return 1;  // SUCCESS
    }

    // === COLD PATH: Cache full ===
    // Don't drain here - let caller handle via tiny_cold_drain_and_free()
    #if !HAKMEM_BUILD_RELEASE
    extern __thread uint64_t g_unified_cache_full[];
    g_unified_cache_full[class_idx]++;
    #endif

    return 0;  // FULL
}

core/box/tiny_legacy_fallback_box.h

static inline void tiny_legacy_fallback_free_base_with_env(void* base, uint32_t class_idx, const HakmemEnvSnapshot* env) {
    // Phase 80-1: Switch dispatch for C4/C5/C6 (branch reduction optimization)
    // Phase 83-1: Per-op branch removed via fixed-mode caching
    // C2/C3 excluded (NO-GO from Phase 77-1/79-1)
    if (tiny_inline_slots_switch_dispatch_enabled_fast()) {
        // Switch mode: Direct jump to case (zero comparison overhead for C4/C5/C6)
        switch (class_idx) {
            case 4:
                if (tiny_c4_inline_slots_enabled_fast()) {
                    if (c4_inline_push(c4_inline_tls(), base)) {
                        FREE_PATH_STAT_INC(legacy_fallback);
                        if (__builtin_expect(free_path_stats_enabled(), 0)) {
                            g_free_path_stats.legacy_by_class[class_idx]++;
                        }
                        return;
                    }
                }
                break;
            case 5:
                if (tiny_c5_inline_slots_enabled_fast()) {
                    if (c5_inline_push(c5_inline_tls(), base)) {
                        FREE_PATH_STAT_INC(legacy_fallback);
                        if (__builtin_expect(free_path_stats_enabled(), 0)) {
                            g_free_path_stats.legacy_by_class[class_idx]++;
                        }
                        return;
                    }
                }
                break;
            case 6:
                if (tiny_c6_inline_slots_enabled_fast()) {
                    if (c6_inline_push(c6_inline_tls(), base)) {
                        FREE_PATH_STAT_INC(legacy_fallback);
                        if (__builtin_expect(free_path_stats_enabled(), 0)) {
                            g_free_path_stats.legacy_by_class[class_idx]++;
                        }
                        return;
                    }
                }
                break;
            default:
                // C0-C3, C7: fall through to unified_cache push
                break;
        }
        // Switch mode: fall through to unified_cache push after miss
    } else {
        // If-chain mode (Phase 80-1 baseline): C3/C4/C5/C6 sequential checks
        // NOTE: C2 local cache (Phase 79-1 NO-GO) removed from hot path

    // Phase 77-1: C3 Inline Slots early-exit (ENV gated)
    // Try C3 inline slots SECOND (before C4/C5/C6/unified cache) for class 3
    if (class_idx == 3 && tiny_c3_inline_slots_enabled_fast()) {
        if (c3_inline_push(c3_inline_tls(), base)) {
            // Success: pushed to C3 inline slots
            FREE_PATH_STAT_INC(legacy_fallback);
            if (__builtin_expect(free_path_stats_enabled(), 0)) {
                g_free_path_stats.legacy_by_class[class_idx]++;
            }
            return;
        }
        // FULL → fall through to C4/C5/C6/unified cache
    }

    // Phase 76-1: C4 Inline Slots early-exit (ENV gated)
    // Try C4 inline slots SECOND (before C5/C6/unified cache) for class 4
    if (class_idx == 4 && tiny_c4_inline_slots_enabled_fast()) {
        if (c4_inline_push(c4_inline_tls(), base)) {
            // Success: pushed to C4 inline slots
            FREE_PATH_STAT_INC(legacy_fallback);
            if (__builtin_expect(free_path_stats_enabled(), 0)) {
                g_free_path_stats.legacy_by_class[class_idx]++;
            }
            return;
        }
        // FULL → fall through to C5/C6/unified cache
    }

    // Phase 75-2: C5 Inline Slots early-exit (ENV gated)
    // Try C5 inline slots SECOND (before C6 and unified cache) for class 5
    if (class_idx == 5 && tiny_c5_inline_slots_enabled_fast()) {
        if (c5_inline_push(c5_inline_tls(), base)) {
            // Success: pushed to C5 inline slots
            FREE_PATH_STAT_INC(legacy_fallback);
            if (__builtin_expect(free_path_stats_enabled(), 0)) {
                g_free_path_stats.legacy_by_class[class_idx]++;
            }
            return;
        }
        // FULL → fall through to C6/unified cache
    }

        // Phase 75-1: C6 Inline Slots early-exit (ENV gated)
        // Try C6 inline slots THIRD (before unified cache) for class 6
        if (class_idx == 6 && tiny_c6_inline_slots_enabled_fast()) {
            if (c6_inline_push(c6_inline_tls(), base)) {
                // Success: pushed to C6 inline slots
                FREE_PATH_STAT_INC(legacy_fallback);
                if (__builtin_expect(free_path_stats_enabled(), 0)) {
                    g_free_path_stats.legacy_by_class[class_idx]++;
                }
                return;
            }
            // FULL → fall through to unified cache
        }
    } // End of if-chain mode

    const TinyFrontV3Snapshot* front_snap =
        env ? (env->tiny_front_v3_enabled ? tiny_front_v3_snapshot_get() : NULL)
            : (__builtin_expect(tiny_front_v3_enabled(), 0) ? tiny_front_v3_snapshot_get() : NULL);
    const bool metadata_cache_on = env ? env->tiny_metadata_cache_eff : tiny_metadata_cache_enabled();

    // Phase 3 C2 Patch 2: First page cache hint (optional fast-path)
    // Check if pointer is in cached page (avoids metadata lookup in future optimizations)
    if (__builtin_expect(metadata_cache_on, 0)) {
        // Note: This is a hint-only check. Even if it hits, we still use the standard path.
        // The cache will be populated during refill operations for future use.
        // Currently this just validates the cache state; actual optimization TBD.
        if (tiny_first_page_cache_hit(class_idx, base, 4096)) {
            // Future: could optimize metadata access here
        }
    }

    // Legacy fallback - Unified Cache push
    if (!front_snap || front_snap->unified_cache_on) {
        // Phase 74-3 (P0): FASTAPI path (ENV-gated)
        if (tiny_uc_fastapi_enabled()) {
            // Preconditions guaranteed:
            // - unified_cache_on == true (checked above)
            // - TLS init guaranteed by front_gate_unified_enabled() in malloc_tiny_fast.h
            // - Stats compiled-out in FAST builds
            if (unified_cache_push_fast(class_idx, HAK_BASE_FROM_RAW(base))) {
                FREE_PATH_STAT_INC(legacy_fallback);

                // Per-class breakdown (Phase 4-1)
                if (__builtin_expect(free_path_stats_enabled(), 0)) {
                    if (class_idx < 8) {
                        g_free_path_stats.legacy_by_class[class_idx]++;
                    }
                }
                return;
            }
            // FULL → fallback to slow path (rare)
        }

        // Original path (FASTAPI=0 or fallback)
        if (unified_cache_push(class_idx, HAK_BASE_FROM_RAW(base))) {
            FREE_PATH_STAT_INC(legacy_fallback);

            // Per-class breakdown (Phase 4-1)
            if (__builtin_expect(free_path_stats_enabled(), 0)) {
                if (class_idx < 8) {
                    g_free_path_stats.legacy_by_class[class_idx]++;
                }
            }
            return;
        }
    }

    // Final fallback
    tiny_hot_free_fast(class_idx, base);
}

Questions to answer (please be concrete)

  1. In these snippets, which checks/branches are still "per-op fixed taxes" on the hot free path?

    • Please point to specific lines/conditions and estimate cost (branches/instructions or dependency chain).
  2. Is tiny_hot_free_fast() already close to optimal, and the real bottleneck is upstream (user->base/classify/route)?

    • If yes, whats the smallest structural refactor that removes that upstream fixed tax?
  3. Should we introduce a "commit once" plan (freeze the chosen free path) — or is branch prediction already making lazy-init checks ~free here?

    • If "commit once", where should it live to avoid runtime gate overhead (bench_profile refresh boundary vs per-op)?
  4. We have had many layout-tax regressions from code removal/reordering.

    • What patterns here are most likely to trigger layout tax if changed?
    • How would you stage a safe A/B (same binary, ENV toggle) for your proposal?
  5. If you could change just ONE of:

    • pointer classification to base/class_idx,
    • route determination,
    • unified cache push/pop structure, which is highest ROI for +510% on WS=400?

[packet] done