Files

Moe Charm (CI) e1a4561992 Phase 19-3b: pass down env snapshot in hot paths

2025-12-15 12:50:16 +09:00

26 KiB

Raw Blame History

Phase 19-3: ENV Snapshot Consolidation — Design

0. Goal

Objective: Reduce ENV check overhead from per-operation 3+ TLS reads to 1 TLS read Expected Impact (target): -10.0 instructions/op, -4.0 branches/op, +3-8% throughput Risk Level: MEDIUM (ENV invalidation handling required) Box Name: EnvSnapshotConsolidationBox (Phase 19-3)

Context

Phase 19 perf analysis revealed that ENV checks are executed 3+ times per operation:

hakmem_env_snapshot_enabled(): called 5 times in malloc/free hot paths (lines 236, 403, 624, 830, 910 in malloc_tiny_fast.h)
Each call triggers:
- TLS read of g_hakmem_env_snapshot_ctor_mode
- TLS read of g_hakmem_env_snapshot_gate
- Branch prediction overhead
- Potential lazy initialization check

Current overhead: ~7% perf samples on hakmem_env_snapshot_enabled() and related ENV checks.

Phase 4 E1 Status: ENV snapshot infrastructure exists (global default OFF, but promoted ON in presets like MIXED_TINYV3_C7_SAFE). Phase 19-3 aims to:

Eliminate redundant hakmem_env_snapshot_enabled() checks (5 calls → 1 call)
Make ENV snapshot the default path (not research box)
Further consolidate ENV reads into entry-point snapshot

Phase 19-3a Result (validated)

Phase 19-3a removed the call-site UNLIKELY hint: __builtin_expect(hakmem_env_snapshot_enabled(), 0) → hakmem_env_snapshot_enabled()

Observed impact: GO (+4.42% throughput) on Mixed. This validates that the remaining ENV work is dominated by branch/layout effects, not just raw "read cost".

Phase 19-3b Result (validated)

Phase 19-3b consolidated snapshot reads by capturing env once per hot call and passing it down into nested helpers.

Observed impact: GO (+2.76% mean / +2.57% median) on Mixed 10-run (scripts/run_mixed_10_cleanenv.sh).

1. Current State Analysis

1.1 ENV Check Locations (Per-Operation)

Based on code analysis, ENV checks occur in these hot path locations:

malloc_tiny_fast() path:

Line 236: C7 ULTRA check (hakmem_env_snapshot_enabled() → hakmem_env_snapshot())
Line 403: Front V3 snapshot check for free() (in free_tiny_fast_v4_hotcold)
Line 910: Front V3 snapshot check for free() (in free_tiny_fast_v4_larson)

free_tiny_fast() paths:

Line 624: C7 ULTRA check (hakmem_env_snapshot_enabled() → hakmem_env_snapshot())
Line 830: C7 ULTRA check (duplicate in free_tiny_fast_v4_larson)

tiny_legacy_fallback_box.h:

Line 28: hakmem_env_snapshot_enabled() for front_snap + metadata_cache_on

tiny_metadata_cache_hot_box.h:

Line 64: hakmem_env_snapshot_enabled() for metadata cache effective check

1.2 TLS Read Overhead Analysis

Each hakmem_env_snapshot_enabled() call performs:

int ctor_mode = g_hakmem_env_snapshot_ctor_mode;  // TLS read #1
if (ctor_mode == 1) {
    return g_hakmem_env_snapshot_gate != 0;       // TLS read #2 (ctor path)
}
// Legacy path
if (g_hakmem_env_snapshot_gate == -1) {           // TLS read #2 (legacy path)
    // Lazy init with getenv()
}

Per-operation cost (when snapshot enabled):

5 calls × 2 TLS reads = 10 TLS reads/op
Plus: 5× branch on ctor_mode, 5× branch on snapshot enabled
Actual measurement: ~7% perf samples

Per-operation cost (when snapshot disabled - current default):

5 calls × 2-3 TLS reads = 10-15 TLS reads/op
Plus: lazy init checks, getenv() overhead on first call per thread

1.3 Redundancy Analysis

Problem: Each hot path independently checks hakmem_env_snapshot_enabled():

malloc C7 ULTRA: check at line 236
free C7 ULTRA: check at line 624 (same operation, different code path)
free front V3: check at line 403 and 910 (same snapshot needed)
Legacy fallback: check at line 28 (called from above paths)
Metadata cache: check at line 64 (called from above paths)

Redundancy: For a typical malloc+free pair:

Current: 5+ hakmem_env_snapshot_enabled() calls = 10-15 TLS reads
Optimal: 1 entry-point snapshot = 1-2 TLS reads

Gap: 8-13 redundant TLS reads per operation

2. Design Options

Option A: Entry-Point Snapshot Pass-Down (Recommended)

Concept: Capture the existing HakmemEnvSnapshot pointer once at malloc/free entry, and pass it down. This avoids creating a new TLS context and automatically stays compatible with hakmem_env_snapshot_refresh_from_env() (refresh updates the snapshot in-place).

Architecture:

// At wrapper entry (malloc/free):
const HakmemEnvSnapshot* env =
    hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

// In malloc_tiny_fast():
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) {
    // Use ctx->c7_ultra_enabled instead of calling hakmem_env_snapshot_enabled()
    if (env && class_idx == 7 && env->tiny_c7_ultra_enabled) {
        // Direct check, no TLS read
    }
}

Pros:

Minimal refactoring: Add context parameter to existing functions
Type safety: Compiler enforces context passing
Clear boundary: ENV decisions made at entry, logic below is pure
Easy rollback: Context parameter can be NULL (fallback to old path)

Cons:

API threading: Some hot helpers need an extra pointer parameter (env) or _with_env variants.
Register pressure: Extra parameter may affect register allocation (verify via perf stat).

Risk: LOW-MEDIUM (mechanical threading, rollback is simple)

Option B: TLS Cached Context (Alternative)

Concept: Maintain thread-local ENV context, refresh on invalidation events.

Architecture:

// Global TLS context (replaces per-call ENV checks)
static __thread FastLaneEnvCtx g_fastlane_ctx;
static __thread int g_fastlane_ctx_version = 0;
extern int g_env_snapshot_version;  // Incremented on ENV change

static inline const FastLaneEnvCtx* fastlane_ctx_get(void) {
    if (__builtin_expect(g_fastlane_ctx_version != g_env_snapshot_version, 0)) {
        // Refresh from snapshot (rare)
        const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
        g_fastlane_ctx.c7_ultra_enabled = snap->tiny_c7_ultra_enabled;
        // ... copy fields
        g_fastlane_ctx_version = g_env_snapshot_version;
    }
    return &g_fastlane_ctx;
}

// In hot path:
const FastLaneEnvCtx* ctx = fastlane_ctx_get();  // 1 TLS read + 1 branch
if (class_idx == 7 && ctx->c7_ultra_enabled) {  // Direct struct access

Pros:

No API changes: Existing functions unchanged
Single TLS read: Version check is fast (1 global read + 1 TLS read)
Automatic invalidation: Version bump triggers refresh
Easy integration: Drop-in replacement for hakmem_env_snapshot_enabled()

Cons:

Version management: Need global version counter + invalidation hooks
Stale data risk: If version check is missed, stale context used
Init complexity: Each thread needs lazy init + version tracking
Debugging: Harder to trace when context was last refreshed

Risk: MEDIUM (version invalidation must be bulletproof)

Option C: Init-Time Fixed (High Risk)

Concept: Read ENV once at process init, freeze configuration for lifetime.

Architecture:

// Global constants (set in constructor)
static bool g_c7_ultra_enabled_fixed;
static bool g_front_v3_enabled_fixed;

__attribute__((constructor))
static void fastlane_env_init(void) {
    const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
    g_c7_ultra_enabled_fixed = snap->tiny_c7_ultra_enabled;
    g_front_v3_enabled_fixed = snap->tiny_front_v3_enabled;
}

// Hot path: direct global read (no TLS)
if (class_idx == 7 && g_c7_ultra_enabled_fixed) {

Pros:

Zero TLS reads: Direct global variable access
Maximum performance: Compiler can constant-fold if known at link time
Simple implementation: No lazy init, no version tracking

Cons:

No runtime ENV changes: ENV toggles require process restart
Breaks bench_profile: putenv() in benchmarks will not work
No A/B testing: Cannot toggle ENV for same-binary comparison
Box Theory violation: No rollback/toggle capability

Risk: HIGH (breaks existing workflow, violates Box Theory)

Recommended: Option A (Entry-Point Snapshot Pass-Down)

Reasoning:

Preserves Box Theory: env==NULL → fallback to old path
Clear separation: ENV decisions at entry, pure logic below
Benchmark compatible: Works with bench_profile putenv + hakmem_env_snapshot_refresh_from_env() (snapshot updates in-place)
Performance: Removes repeated hakmem_env_snapshot_enabled() checks inside deep helpers

Trade-off acceptance:

Accept API changes (mechanical, low risk)
Accept extra parameter (register pressure acceptable for hot path)
Reject Option B's version management complexity
Reject Option C's inflexibility

3. Implementation Plan (Option A)

3.1 Box Design

Box Name: EnvSnapshotConsolidationBox (Phase 19-3)

Files:

Modified: core/front/malloc_tiny_fast.h
- Phase 19-3a: remove backwards __builtin_expect(..., 0) hints (DONE, +4.42% GO).
- Phase 19-3b: thread const HakmemEnvSnapshot* env down to eliminate repeated hakmem_env_snapshot_enabled() checks (DONE, +2.76% GO).
Modified: core/box/tiny_legacy_fallback_box.h
- Add _with_env helper (Phase 19-3b).
Modified: core/box/tiny_metadata_cache_hot_box.h
- Add _with_env helper (Phase 19-3b).
Optional: core/box/hak_wrappers.inc.h
- If needed, compute env once per wrapper entry and pass it down (removes the remaining alloc-side gate in malloc_tiny_fast_for_class()).

ENV Gate:

Base: HAKMEM_ENV_SNAPSHOT=0/1 (Phase 4 E1 gate; promoted ON in presets)

Rollback:

Snapshot behavior: set HAKMEM_ENV_SNAPSHOT=0 to fall back to per-feature env gates.
Pass-down refactor: revert the Phase 19-3b commit (or add a dedicated pass-down gate if future A/B is needed).

3.2 API Design

Pass-down API (recommended):

// Wrapper entry (malloc/free): read snapshot ONCE, pass down.
const HakmemEnvSnapshot* env =
    hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

// Hot helpers accept optional env pointer (NULL-safe).
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env);
int   free_tiny_fast_with_env(void* ptr, const HakmemEnvSnapshot* env);

3.3 Migration Plan (Incremental)

Phase 19-3a (DONE): remove backwards UNLIKELY hints at the 5 hottest call sites in core/front/malloc_tiny_fast.h.

__builtin_expect(hakmem_env_snapshot_enabled(), 0) → hakmem_env_snapshot_enabled()
Measured: GO (+4.42%)

Phase 19-3b (DONE): capture env once per hot call and pass it down into nested helpers.

In core/front/malloc_tiny_fast.h:
- free_tiny_fast() / free_tiny_fast_hot() capture env once and pass it to cold + legacy helpers.
- malloc_tiny_fast_for_class() reuses the same snapshot for tiny_policy_hot_get_route_with_env(...).
In core/box/tiny_legacy_fallback_box.h and core/box/tiny_metadata_cache_hot_box.h:
- add _with_env helpers to consume the pass-down pointer.
Measured: GO (+2.76% mean / +2.57% median) on Mixed 10-run.

Phase 19-3c (OPTIONAL): propagate env into legacy fallback + metadata cache helpers to eliminate the remaining call sites:

(Already done in Phase 19-3b.) Optional next: pass env down from wrapper entry to remove the remaining alloc-side gate.

3.4 Files to Modify

core/front/malloc_tiny_fast.h
- Phase 19-3a: UNLIKELY hint removal.
- Phase 19-3b: pass-down env to cold + legacy helpers.
core/box/tiny_legacy_fallback_box.h
- Phase 19-3b: add _with_env helper + keep wrapper.
core/box/tiny_metadata_cache_hot_box.h
- Phase 19-3b: add _with_env helper + keep wrapper.
(Optional) core/box/hak_wrappers.inc.h
- Pass env down from wrapper entry (alloc-side; removes one remaining gate).

4. Safety / Box Theory

4.1 Boundary Preservation

L0 (ENV gate):

HAKMEM_ENV_SNAPSHOT=0 → env==NULL → fallback to per-feature env gates
HAKMEM_ENV_SNAPSHOT=1 → env!=NULL → snapshot-based checks
(Optional) A dedicated “pass-down gate” can be introduced for A/B safety, but avoid adding a new hot-branch unless needed.

L1 (Hot inline):

No algorithmic changes, only ENV check consolidation
Existing malloc_tiny_fast() / free_tiny_fast() logic unchanged
env is read-only (const pointer)

L2 (Cold fallback):

Cold paths unchanged (no context propagation needed)
Legacy fallback accepts optional env

L3 (Stats/Observability):

Add counter: ENV_CONSOLIDATION_STAT_INC(enabled_calls)
Track: pass-down hits, fallback path usage
Perf verification: reduced hakmem_env_snapshot_enabled() hot samples

4.2 Fail-Fast

NULL env handling:

All functions accept env==NULL → fallback to existing path
No crashes, no undefined behavior
Debug builds: assert(env!=NULL) only if the pass-down gate is enabled (optional)

ENV invalidation:

Snapshot refresh is handled by the existing Phase 4 E1 mechanism:
- bench_profile uses hakmem_env_snapshot_refresh_from_env() after putenv()
- Snapshot updates in-place, so the env pointer remains valid

4.3 Rollback

Runtime rollback:

HAKMEM_ENV_SNAPSHOT=0  # Disable snapshot path (falls back to per-feature env gates)

Gradual rollout:

Phase 19-3a: UNLIKELY hint removal (DONE, GO)
Phase 19-3b: hot helper pass-down (DONE, GO)
Phase 19-3c: optional wrapper-entry pass-down (alloc-side; measure)

4.4 Observability

Stats counters (debug builds):

typedef struct {
    uint64_t env_passdown_hits;   // wrapper passed non-NULL env
    uint64_t env_null_fallback;   // env==NULL, used old path
    uint64_t malloc_env_path;     // malloc used env pass-down
    uint64_t free_env_path;       // free used env pass-down
} EnvConsolidationStats;

Perf validation:

Before: perf record shows hakmem_env_snapshot_enabled at ~7%
After: hakmem_env_snapshot_enabled should drop to <1%
Expected: deep helpers stop calling hakmem_env_snapshot_enabled() repeatedly (single capture per hot call)

A/B testing:

# Recommended: compare baseline vs optimized commits with the same bench script
scripts/run_mixed_10_cleanenv.sh

5. Expected Performance

5.1 Instruction Reduction Estimate

Current overhead (per malloc+free operation):

5 calls to hakmem_env_snapshot_enabled():
- Each: gate loads + branches (and legacy lazy-init path on first call)
- Total: ~5 gate checks per operation across hot helpers

After Phase 19-3:

1 call at wrapper entry:
- hakmem_env_snapshot_enabled() once
- hakmem_env_snapshot() once (when enabled)
- Deep helpers use if (env) + direct field reads (no further gate checks)

Reduction:

Gate checks: ~5 → 1 (wrapper entry only)
Branches: reduce repeated gate branches inside hot helpers
Instructions: target ~-10 instructions/op (order-of-magnitude)

5.2 Branch Reduction Estimate

Current branching:

hakmem_env_snapshot_enabled(): 2 branches (ctor_mode check + gate check)
Called 5 times = 10 branches/op

After Phase 19-3:

Gate check is done once at wrapper entry; deep helpers reuse env pointer.

Reduction: 10 → 4 = -6 branches/op (conservative estimate: -4 branches/op accounting for overlap)

5.3 Throughput Estimate

Phase 19-1 Design Doc (Candidate B) estimates:

Instructions: -10.0/op
Branches: -4.0/op
Throughput: +5-8%

Phase 19-3 targets (aligned with Candidate B):

Instructions: -10.0/op ✓
Branches: -4.0/op ✓
Throughput: +5-8% (expected on top of Phase 19-2 baseline)

Validation criteria:

Perf stat shows instruction count reduction: ≥8.0/op (80% of estimate)
Perf stat shows branch count reduction: ≥3.0/op (75% of estimate)
Throughput improvement: ≥4.0% (50% of lower bound estimate)

6. Risk Assessment

6.1 Technical Risks

MEDIUM: API Signature Changes

Risk: Adding context parameter changes function signatures
Mitigation: Keep old signatures, add _ctx variants
Rollback: NULL context → fallback to old implementation
Timeline: 1 phase at a time (19-3a → 19-3b → 19-3c)

MEDIUM: ENV Invalidation

Risk: Runtime ENV changes (bench_profile putenv) may not refresh context
Mitigation: Phase 19-3 inherits Phase 4 E1 refresh mechanism
Limitation: Same as current ENV snapshot (requires explicit refresh)
Future: Add version tracking (Option B) if runtime toggle needed

LOW: Register Pressure

Risk: Extra context parameter may increase register spills
Mitigation: Context is const pointer (register-friendly)
Validation: Check perf stat for stall increases
Rollback: Disable via ENV if regression detected

LOW: Lazy Init Overhead

Risk: First call to fastlane_env_ctx() adds init cost
Mitigation: One-time per thread (amortized over millions of ops)
Measurement: Should be <0.1% overhead (verified via perf)

6.2 Performance Risks

Risk: Overhead greater than savings

Scenario: Context struct access slower than optimized TLS reads
Likelihood: LOW (struct access is 1-2 instructions, TLS read is 5-10)
Detection: Perf stat will show instruction count increase
Rollback: ENV=0 immediately reverts

Risk: Branch predictor thrashing

Scenario: New branch patterns confuse CPU predictor
Likelihood: LOW (reducing branches helps predictor)
Detection: Branch miss rate increases in perf stat
Rollback: ENV=0 immediately reverts

6.3 Integration Risks

Risk: Breaks bench_profile ENV refresh

Scenario: Context cached before putenv(), stale values used
Likelihood: MEDIUM (same issue as Phase 4 E1)
Mitigation: Follow Phase 4 E1 pattern (explicit refresh hook)
Validation: Run bench suite with ENV toggles

Risk: Conflicts with FastLane Direct (Phase 19-2)

Scenario: Phase 19-2 removed wrapper, context injection point unclear
Likelihood: LOW (context added at new entry point)
Mitigation: Phase 19-3 builds on Phase 19-2 baseline
Validation: A/B test with FASTLANE_DIRECT=1 + ENV_CONSOLIDATION=1

7. Validation Checklist

7.1 Pre-Implementation

Verify Phase 4 E1 (ENV snapshot) is stable and working
Verify Phase 19-2 (FASTLANE_DIRECT) is stable baseline
Document current hakmem_env_snapshot_enabled() call sites (5 locations)
Create test plan for ENV refresh (bench_profile compatibility)

7.2 Implementation

Implement fastlane_env_ctx_box.h (context struct + getter)
Add malloc_tiny_fast_ctx() variant (Phase 19-3a)
Add free_tiny_fast_ctx() variant (Phase 19-3b)
Propagate context to tiny_legacy_fallback_box.h (Phase 19-3c)
(Optional) Add a dedicated pass-down gate if A/B within a single binary is needed
Add stats counters (debug builds)

7.3 Testing (Per Phase)

Phase 19-3a (malloc path):

Correctness: Run make test suite (all tests pass)
Perf stat: Measure instruction/branch reduction (ENV=0 vs ENV=1)
Perf record: Verify hakmem_env_snapshot_enabled samples drop
Benchmark: Mixed 10-run (expect +2-3% from malloc path alone)

Phase 19-3b (free path):

Correctness: Run make test + Larson (all tests pass)
Perf stat: Measure cumulative reduction (vs baseline)
Perf record: Verify further reduction in ENV check samples
Benchmark: Mixed 10-run (expect +3-5% cumulative)

Phase 19-3c (legacy + metadata):

Correctness: Full test suite including multithreaded
Perf stat: Verify -10.0 instr/op, -4.0 branches/op (goal)
Perf record: hakmem_env_snapshot_enabled <1% samples
Benchmark: Mixed 10-run (expect +5-8% cumulative)

7.4 A/B Test (Final Validation)

Benchmark suite:

# Run the same cleanenv script on baseline vs optimized commits
scripts/run_mixed_10_cleanenv.sh

GO/NO-GO criteria:

GO: Mean throughput +5.0% or higher (within ±20% of +5-8% estimate)
NEUTRAL: +2.0% to +5.0% → keep as research box, preset-only promotion
NO-GO: <+2.0% or regression → revert, analyze perf data

Perf stat validation:

perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses \
  -- ./bench_random_mixed_hakmem 200000000 400 1

Expected deltas:

Instructions/op: -8.0 to -12.0 (target: -10.0)
Branches/op: -3.0 to -5.0 (target: -4.0)
Branch-miss%: unchanged or slightly better (fewer branches)
Throughput: +4.0% to +10.0% (target: +5-8%)

8. Rollout Plan

8.1 Phase 19-3a: malloc Path (Week 1)

Scope: Add context to malloc hot path

Modify malloc_tiny_fast() to accept context
Update C7 ULTRA check (line 236)
Add fastlane_env_ctx_box.h
Update wrapper.c malloc()

Timeline: 4-6 hours implementation + 2 hours testing Risk: LOW (isolated to alloc path) Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)

8.2 Phase 19-3b: free Path (Week 1)

Scope: Add context to free hot path

Modify free_tiny_fast() to accept context
Update C7 ULTRA checks (lines 624, 830)
Update front V3 checks (lines 403, 910)
Update wrapper.c free()

Timeline: 4-6 hours implementation + 2 hours testing Risk: LOW-MEDIUM (more call sites than malloc) Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)

8.3 Phase 19-3c: Legacy + Metadata (Week 2)

Scope: Propagate context to helper boxes

Update tiny_legacy_fallback_box.h (line 28)
Update tiny_metadata_cache_hot_box.h (line 64)
Add context parameter to helper functions

Timeline: 3-4 hours implementation + 2 hours testing Risk: MEDIUM (touches multiple boxes) Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)

8.4 Graduate (Week 2-3)

Promotion criteria:

All phases pass A/B testing (GO verdict)
Cumulative throughput gain ≥+5.0%
No correctness regressions (all tests pass)
Perf validation confirms instruction reduction

Promotion actions:

Ensure MIXED_TINYV3_C7_SAFE preset keeps HAKMEM_ENV_SNAPSHOT=1 (already)
Document in optimization roadmap
Update Box Theory index
Keep ENV default=0 (opt-in) until production validation

Rollback strategy:

Preset level: Remove from preset, keep code
Code level: revert the Phase 19-3b commit
Emergency: set HAKMEM_ENV_SNAPSHOT=0 (falls back to per-feature env gates)

9. Future Optimization Opportunities

9.1 Version-Based Invalidation (Option B)

If runtime ENV changes become important:

Add global g_env_snapshot_version counter
Increment on ENV change (bench_profile, runtime toggle)
Each thread checks version, refreshes context if stale
Overhead: +1 global read per operation (still net win vs 10 TLS reads)

9.2 Route Table Consolidation

Extend context to include pre-computed routes:

typedef struct {
    bool c7_ultra_enabled;
    bool front_v3_enabled;
    bool metadata_cache_eff;
    SmallRouteKind route_kind[8];  // Pre-computed per class
} FastLaneEnvCtx;

Benefit: Eliminate tiny_static_route_get_kind_fast() calls Impact: Additional -3-4 instructions/op, -1-2 branches/op

9.3 Constructor Init (Option C Hybrid)

For production builds (no bench_profile):

Use __attribute__((constructor)) to init context at startup
Eliminate lazy init check (g_init always 1)
Benefit: -1 branch per operation (init check)
Limitation: No runtime ENV changes (production-only optimization)

10. Comparison to Phase 4 E1

Phase 4 E1 (ENV Snapshot)

What it did:

Consolidated 3 ENV reads (tiny_c7_ultra_enabled_env, tiny_front_v3_enabled, tiny_metadata_cache_enabled) into 1 snapshot struct
Result: +3.92% throughput (Mixed)
Status: Promoted in presets (global default still OFF)

Limitation:

Still calls hakmem_env_snapshot_enabled() 5 times per operation
Each call: gate loads + branches
ENV check overhead remains: ~7% perf samples