# Phase 19-3: ENV Snapshot Consolidation — Design ## 0. Goal **Objective**: Reduce ENV check overhead from per-operation 3+ TLS reads to 1 TLS read **Expected Impact (target)**: -10.0 instructions/op, -4.0 branches/op, +3-8% throughput **Risk Level**: MEDIUM (ENV invalidation handling required) **Box Name**: EnvSnapshotConsolidationBox (Phase 19-3) ### Context Phase 19 perf analysis revealed that ENV checks are executed **3+ times per operation**: - `hakmem_env_snapshot_enabled()`: called 5 times in malloc/free hot paths (lines 236, 403, 624, 830, 910 in malloc_tiny_fast.h) - Each call triggers: - TLS read of `g_hakmem_env_snapshot_ctor_mode` - TLS read of `g_hakmem_env_snapshot_gate` - Branch prediction overhead - Potential lazy initialization check **Current overhead**: ~7% perf samples on `hakmem_env_snapshot_enabled()` and related ENV checks. **Phase 4 E1 Status**: ENV snapshot infrastructure exists (global default OFF, but promoted ON in presets like `MIXED_TINYV3_C7_SAFE`). Phase 19-3 aims to: 1. Eliminate redundant `hakmem_env_snapshot_enabled()` checks (5 calls → 1 call) 2. Make ENV snapshot the **default path** (not research box) 3. Further consolidate ENV reads into entry-point snapshot ### Phase 19-3a Result (validated) Phase 19-3a removed the call-site UNLIKELY hint: `__builtin_expect(hakmem_env_snapshot_enabled(), 0)` → `hakmem_env_snapshot_enabled()` Observed impact: **GO (+4.42% throughput)** on Mixed. This validates that the remaining ENV work is dominated by branch/layout effects, not just raw "read cost". ### Phase 19-3b Result (validated) Phase 19-3b consolidated snapshot reads by capturing `env` once per hot call and passing it down into nested helpers. Observed impact: **GO (+2.76% mean / +2.57% median)** on Mixed 10-run (`scripts/run_mixed_10_cleanenv.sh`). --- ## 1. Current State Analysis ### 1.1 ENV Check Locations (Per-Operation) Based on code analysis, ENV checks occur in these hot path locations: **malloc_tiny_fast() path**: 1. Line 236: C7 ULTRA check (`hakmem_env_snapshot_enabled()` → `hakmem_env_snapshot()`) 2. Line 403: Front V3 snapshot check for `free()` (in `free_tiny_fast_v4_hotcold`) 3. Line 910: Front V3 snapshot check for `free()` (in `free_tiny_fast_v4_larson`) **free_tiny_fast() paths**: 1. Line 624: C7 ULTRA check (`hakmem_env_snapshot_enabled()` → `hakmem_env_snapshot()`) 2. Line 830: C7 ULTRA check (duplicate in `free_tiny_fast_v4_larson`) **tiny_legacy_fallback_box.h**: - Line 28: `hakmem_env_snapshot_enabled()` for front_snap + metadata_cache_on **tiny_metadata_cache_hot_box.h**: - Line 64: `hakmem_env_snapshot_enabled()` for metadata cache effective check ### 1.2 TLS Read Overhead Analysis Each `hakmem_env_snapshot_enabled()` call performs: ```c int ctor_mode = g_hakmem_env_snapshot_ctor_mode; // TLS read #1 if (ctor_mode == 1) { return g_hakmem_env_snapshot_gate != 0; // TLS read #2 (ctor path) } // Legacy path if (g_hakmem_env_snapshot_gate == -1) { // TLS read #2 (legacy path) // Lazy init with getenv() } ``` **Per-operation cost** (when snapshot enabled): - **5 calls** × **2 TLS reads** = **10 TLS reads/op** - Plus: 5× branch on `ctor_mode`, 5× branch on snapshot enabled - Actual measurement: ~7% perf samples **Per-operation cost** (when snapshot disabled - current default): - **5 calls** × **2-3 TLS reads** = **10-15 TLS reads/op** - Plus: lazy init checks, getenv() overhead on first call per thread ### 1.3 Redundancy Analysis **Problem**: Each hot path independently checks `hakmem_env_snapshot_enabled()`: - malloc C7 ULTRA: check at line 236 - free C7 ULTRA: check at line 624 (same operation, different code path) - free front V3: check at line 403 and 910 (same snapshot needed) - Legacy fallback: check at line 28 (called from above paths) - Metadata cache: check at line 64 (called from above paths) **Redundancy**: For a typical malloc+free pair: - Current: 5+ `hakmem_env_snapshot_enabled()` calls = 10-15 TLS reads - Optimal: 1 entry-point snapshot = 1-2 TLS reads **Gap**: 8-13 redundant TLS reads per operation --- ## 2. Design Options ### Option A: Entry-Point Snapshot Pass-Down (Recommended) **Concept**: Capture the existing `HakmemEnvSnapshot` pointer once at malloc/free entry, and pass it down. This avoids creating a new TLS context and automatically stays compatible with `hakmem_env_snapshot_refresh_from_env()` (refresh updates the snapshot in-place). **Architecture**: ```c // At wrapper entry (malloc/free): const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL; // In malloc_tiny_fast(): void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) { // Use ctx->c7_ultra_enabled instead of calling hakmem_env_snapshot_enabled() if (env && class_idx == 7 && env->tiny_c7_ultra_enabled) { // Direct check, no TLS read } } ``` **Pros**: - **Minimal refactoring**: Add context parameter to existing functions - **Type safety**: Compiler enforces context passing - **Clear boundary**: ENV decisions made at entry, logic below is pure - **Easy rollback**: Context parameter can be NULL (fallback to old path) **Cons**: - **API threading**: Some hot helpers need an extra pointer parameter (`env`) or `_with_env` variants. - **Register pressure**: Extra parameter may affect register allocation (verify via perf stat). **Risk**: LOW-MEDIUM (mechanical threading, rollback is simple) --- ### Option B: TLS Cached Context (Alternative) **Concept**: Maintain thread-local ENV context, refresh on invalidation events. **Architecture**: ```c // Global TLS context (replaces per-call ENV checks) static __thread FastLaneEnvCtx g_fastlane_ctx; static __thread int g_fastlane_ctx_version = 0; extern int g_env_snapshot_version; // Incremented on ENV change static inline const FastLaneEnvCtx* fastlane_ctx_get(void) { if (__builtin_expect(g_fastlane_ctx_version != g_env_snapshot_version, 0)) { // Refresh from snapshot (rare) const HakmemEnvSnapshot* snap = hakmem_env_snapshot(); g_fastlane_ctx.c7_ultra_enabled = snap->tiny_c7_ultra_enabled; // ... copy fields g_fastlane_ctx_version = g_env_snapshot_version; } return &g_fastlane_ctx; } // In hot path: const FastLaneEnvCtx* ctx = fastlane_ctx_get(); // 1 TLS read + 1 branch if (class_idx == 7 && ctx->c7_ultra_enabled) { // Direct struct access ``` **Pros**: - **No API changes**: Existing functions unchanged - **Single TLS read**: Version check is fast (1 global read + 1 TLS read) - **Automatic invalidation**: Version bump triggers refresh - **Easy integration**: Drop-in replacement for `hakmem_env_snapshot_enabled()` **Cons**: - **Version management**: Need global version counter + invalidation hooks - **Stale data risk**: If version check is missed, stale context used - **Init complexity**: Each thread needs lazy init + version tracking - **Debugging**: Harder to trace when context was last refreshed **Risk**: MEDIUM (version invalidation must be bulletproof) --- ### Option C: Init-Time Fixed (High Risk) **Concept**: Read ENV once at process init, freeze configuration for lifetime. **Architecture**: ```c // Global constants (set in constructor) static bool g_c7_ultra_enabled_fixed; static bool g_front_v3_enabled_fixed; __attribute__((constructor)) static void fastlane_env_init(void) { const HakmemEnvSnapshot* snap = hakmem_env_snapshot(); g_c7_ultra_enabled_fixed = snap->tiny_c7_ultra_enabled; g_front_v3_enabled_fixed = snap->tiny_front_v3_enabled; } // Hot path: direct global read (no TLS) if (class_idx == 7 && g_c7_ultra_enabled_fixed) { ``` **Pros**: - **Zero TLS reads**: Direct global variable access - **Maximum performance**: Compiler can constant-fold if known at link time - **Simple implementation**: No lazy init, no version tracking **Cons**: - **No runtime ENV changes**: ENV toggles require process restart - **Breaks bench_profile**: `putenv()` in benchmarks will not work - **No A/B testing**: Cannot toggle ENV for same-binary comparison - **Box Theory violation**: No rollback/toggle capability **Risk**: HIGH (breaks existing workflow, violates Box Theory) --- ### Recommended: **Option A (Entry-Point Snapshot Pass-Down)** **Reasoning**: 1. **Preserves Box Theory**: `env==NULL` → fallback to old path 2. **Clear separation**: ENV decisions at entry, pure logic below 3. **Benchmark compatible**: Works with `bench_profile` putenv + `hakmem_env_snapshot_refresh_from_env()` (snapshot updates in-place) 4. **Performance**: Removes repeated `hakmem_env_snapshot_enabled()` checks inside deep helpers **Trade-off acceptance**: - Accept API changes (mechanical, low risk) - Accept extra parameter (register pressure acceptable for hot path) - Reject Option B's version management complexity - Reject Option C's inflexibility --- ## 3. Implementation Plan (Option A) ### 3.1 Box Design **Box Name**: `EnvSnapshotConsolidationBox` (Phase 19-3) **Files**: - Modified: `core/front/malloc_tiny_fast.h` - Phase 19-3a: remove backwards `__builtin_expect(..., 0)` hints (DONE, +4.42% GO). - Phase 19-3b: thread `const HakmemEnvSnapshot* env` down to eliminate repeated `hakmem_env_snapshot_enabled()` checks (DONE, +2.76% GO). - Modified: `core/box/tiny_legacy_fallback_box.h` - Add `_with_env` helper (Phase 19-3b). - Modified: `core/box/tiny_metadata_cache_hot_box.h` - Add `_with_env` helper (Phase 19-3b). - Optional: `core/box/hak_wrappers.inc.h` - If needed, compute `env` once per wrapper entry and pass it down (removes the remaining alloc-side gate in `malloc_tiny_fast_for_class()`). **ENV Gate**: - Base: `HAKMEM_ENV_SNAPSHOT=0/1` (Phase 4 E1 gate; promoted ON in presets) **Rollback**: - Snapshot behavior: set `HAKMEM_ENV_SNAPSHOT=0` to fall back to per-feature env gates. - Pass-down refactor: revert the Phase 19-3b commit (or add a dedicated pass-down gate if future A/B is needed). ### 3.2 API Design **Pass-down API (recommended)**: ```c // Wrapper entry (malloc/free): read snapshot ONCE, pass down. const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL; // Hot helpers accept optional env pointer (NULL-safe). void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env); int free_tiny_fast_with_env(void* ptr, const HakmemEnvSnapshot* env); ``` ### 3.3 Migration Plan (Incremental) **Phase 19-3a (DONE)**: remove backwards UNLIKELY hints at the 5 hottest call sites in `core/front/malloc_tiny_fast.h`. - `__builtin_expect(hakmem_env_snapshot_enabled(), 0)` → `hakmem_env_snapshot_enabled()` - Measured: **GO (+4.42%)** **Phase 19-3b (DONE)**: capture `env` once per hot call and pass it down into nested helpers. - In `core/front/malloc_tiny_fast.h`: - `free_tiny_fast()` / `free_tiny_fast_hot()` capture `env` once and pass it to cold + legacy helpers. - `malloc_tiny_fast_for_class()` reuses the same snapshot for `tiny_policy_hot_get_route_with_env(...)`. - In `core/box/tiny_legacy_fallback_box.h` and `core/box/tiny_metadata_cache_hot_box.h`: - add `_with_env` helpers to consume the pass-down pointer. - Measured: **GO (+2.76% mean / +2.57% median)** on Mixed 10-run. **Phase 19-3c (OPTIONAL)**: propagate `env` into legacy fallback + metadata cache helpers to eliminate the remaining call sites: - (Already done in Phase 19-3b.) Optional next: pass `env` down from wrapper entry to remove the remaining alloc-side gate. ### 3.4 Files to Modify 1. `core/front/malloc_tiny_fast.h` - Phase 19-3a: UNLIKELY hint removal. - Phase 19-3b: pass-down `env` to cold + legacy helpers. 2. `core/box/tiny_legacy_fallback_box.h` - Phase 19-3b: add `_with_env` helper + keep wrapper. 3. `core/box/tiny_metadata_cache_hot_box.h` - Phase 19-3b: add `_with_env` helper + keep wrapper. 4. (Optional) `core/box/hak_wrappers.inc.h` - Pass `env` down from wrapper entry (alloc-side; removes one remaining gate). --- ## 4. Safety / Box Theory ### 4.1 Boundary Preservation **L0 (ENV gate)**: - `HAKMEM_ENV_SNAPSHOT=0` → `env==NULL` → fallback to per-feature env gates - `HAKMEM_ENV_SNAPSHOT=1` → `env!=NULL` → snapshot-based checks - (Optional) A dedicated “pass-down gate” can be introduced for A/B safety, but avoid adding a new hot-branch unless needed. **L1 (Hot inline)**: - No algorithmic changes, only ENV check consolidation - Existing `malloc_tiny_fast()` / `free_tiny_fast()` logic unchanged - `env` is read-only (const pointer) **L2 (Cold fallback)**: - Cold paths unchanged (no context propagation needed) - Legacy fallback accepts optional `env` **L3 (Stats/Observability)**: - Add counter: `ENV_CONSOLIDATION_STAT_INC(enabled_calls)` - Track: pass-down hits, fallback path usage - Perf verification: reduced `hakmem_env_snapshot_enabled()` hot samples ### 4.2 Fail-Fast **NULL env handling**: - All functions accept `env==NULL` → fallback to existing path - No crashes, no undefined behavior - Debug builds: assert(`env!=NULL`) only if the pass-down gate is enabled (optional) **ENV invalidation**: - Snapshot refresh is handled by the existing Phase 4 E1 mechanism: - `bench_profile` uses `hakmem_env_snapshot_refresh_from_env()` after `putenv()` - Snapshot updates in-place, so the `env` pointer remains valid ### 4.3 Rollback **Runtime rollback**: ```sh HAKMEM_ENV_SNAPSHOT=0 # Disable snapshot path (falls back to per-feature env gates) ``` **Gradual rollout**: 1. Phase 19-3a: UNLIKELY hint removal (DONE, GO) 2. Phase 19-3b: hot helper pass-down (DONE, GO) 3. Phase 19-3c: optional wrapper-entry pass-down (alloc-side; measure) ### 4.4 Observability **Stats counters** (debug builds): ```c typedef struct { uint64_t env_passdown_hits; // wrapper passed non-NULL env uint64_t env_null_fallback; // env==NULL, used old path uint64_t malloc_env_path; // malloc used env pass-down uint64_t free_env_path; // free used env pass-down } EnvConsolidationStats; ``` **Perf validation**: - Before: `perf record` shows `hakmem_env_snapshot_enabled` at ~7% - After: `hakmem_env_snapshot_enabled` should drop to <1% - Expected: deep helpers stop calling `hakmem_env_snapshot_enabled()` repeatedly (single capture per hot call) **A/B testing**: ```sh # Recommended: compare baseline vs optimized commits with the same bench script scripts/run_mixed_10_cleanenv.sh ``` --- ## 5. Expected Performance ### 5.1 Instruction Reduction Estimate **Current overhead** (per malloc+free operation): - 5 calls to `hakmem_env_snapshot_enabled()`: - Each: gate loads + branches (and legacy lazy-init path on first call) - Total: **~5 gate checks** per operation across hot helpers **After Phase 19-3**: - 1 call at wrapper entry: - `hakmem_env_snapshot_enabled()` once - `hakmem_env_snapshot()` once (when enabled) - Deep helpers use `if (env)` + direct field reads (no further gate checks) **Reduction**: - **Gate checks**: ~5 → 1 (wrapper entry only) - **Branches**: reduce repeated gate branches inside hot helpers - **Instructions**: target ~-10 instructions/op (order-of-magnitude) ### 5.2 Branch Reduction Estimate **Current branching**: - `hakmem_env_snapshot_enabled()`: 2 branches (ctor_mode check + gate check) - Called 5 times = **10 branches/op** **After Phase 19-3**: - Gate check is done once at wrapper entry; deep helpers reuse `env` pointer. **Reduction**: 10 → 4 = **-6 branches/op** (conservative estimate: -4 branches/op accounting for overlap) ### 5.3 Throughput Estimate **Phase 19-1 Design Doc** (Candidate B) estimates: - Instructions: -10.0/op - Branches: -4.0/op - Throughput: **+5-8%** **Phase 19-3 targets** (aligned with Candidate B): - Instructions: **-10.0/op** ✓ - Branches: **-4.0/op** ✓ - Throughput: **+5-8%** (expected on top of Phase 19-2 baseline) **Validation criteria**: - Perf stat shows instruction count reduction: ≥8.0/op (80% of estimate) - Perf stat shows branch count reduction: ≥3.0/op (75% of estimate) - Throughput improvement: ≥4.0% (50% of lower bound estimate) --- ## 6. Risk Assessment ### 6.1 Technical Risks **MEDIUM: API Signature Changes** - Risk: Adding context parameter changes function signatures - Mitigation: Keep old signatures, add `_ctx` variants - Rollback: NULL context → fallback to old implementation - Timeline: 1 phase at a time (19-3a → 19-3b → 19-3c) **MEDIUM: ENV Invalidation** - Risk: Runtime ENV changes (bench_profile putenv) may not refresh context - Mitigation: Phase 19-3 inherits Phase 4 E1 refresh mechanism - Limitation: Same as current ENV snapshot (requires explicit refresh) - Future: Add version tracking (Option B) if runtime toggle needed **LOW: Register Pressure** - Risk: Extra context parameter may increase register spills - Mitigation: Context is const pointer (register-friendly) - Validation: Check perf stat for stall increases - Rollback: Disable via ENV if regression detected **LOW: Lazy Init Overhead** - Risk: First call to `fastlane_env_ctx()` adds init cost - Mitigation: One-time per thread (amortized over millions of ops) - Measurement: Should be <0.1% overhead (verified via perf) ### 6.2 Performance Risks **Risk: Overhead greater than savings** - Scenario: Context struct access slower than optimized TLS reads - Likelihood: LOW (struct access is 1-2 instructions, TLS read is 5-10) - Detection: Perf stat will show instruction count increase - Rollback: ENV=0 immediately reverts **Risk: Branch predictor thrashing** - Scenario: New branch patterns confuse CPU predictor - Likelihood: LOW (reducing branches helps predictor) - Detection: Branch miss rate increases in perf stat - Rollback: ENV=0 immediately reverts ### 6.3 Integration Risks **Risk: Breaks bench_profile ENV refresh** - Scenario: Context cached before putenv(), stale values used - Likelihood: MEDIUM (same issue as Phase 4 E1) - Mitigation: Follow Phase 4 E1 pattern (explicit refresh hook) - Validation: Run bench suite with ENV toggles **Risk: Conflicts with FastLane Direct (Phase 19-2)** - Scenario: Phase 19-2 removed wrapper, context injection point unclear - Likelihood: LOW (context added at new entry point) - Mitigation: Phase 19-3 builds on Phase 19-2 baseline - Validation: A/B test with FASTLANE_DIRECT=1 + ENV_CONSOLIDATION=1 --- ## 7. Validation Checklist ### 7.1 Pre-Implementation - [ ] Verify Phase 4 E1 (ENV snapshot) is stable and working - [ ] Verify Phase 19-2 (FASTLANE_DIRECT) is stable baseline - [ ] Document current `hakmem_env_snapshot_enabled()` call sites (5 locations) - [ ] Create test plan for ENV refresh (bench_profile compatibility) ### 7.2 Implementation - [ ] Implement `fastlane_env_ctx_box.h` (context struct + getter) - [ ] Add `malloc_tiny_fast_ctx()` variant (Phase 19-3a) - [ ] Add `free_tiny_fast_ctx()` variant (Phase 19-3b) - [ ] Propagate context to `tiny_legacy_fallback_box.h` (Phase 19-3c) - [ ] (Optional) Add a dedicated pass-down gate if A/B within a single binary is needed - [ ] Add stats counters (debug builds) ### 7.3 Testing (Per Phase) **Phase 19-3a (malloc path)**: - [ ] Correctness: Run `make test` suite (all tests pass) - [ ] Perf stat: Measure instruction/branch reduction (ENV=0 vs ENV=1) - [ ] Perf record: Verify `hakmem_env_snapshot_enabled` samples drop - [ ] Benchmark: Mixed 10-run (expect +2-3% from malloc path alone) **Phase 19-3b (free path)**: - [ ] Correctness: Run `make test` + Larson (all tests pass) - [ ] Perf stat: Measure cumulative reduction (vs baseline) - [ ] Perf record: Verify further reduction in ENV check samples - [ ] Benchmark: Mixed 10-run (expect +3-5% cumulative) **Phase 19-3c (legacy + metadata)**: - [ ] Correctness: Full test suite including multithreaded - [ ] Perf stat: Verify -10.0 instr/op, -4.0 branches/op (goal) - [ ] Perf record: `hakmem_env_snapshot_enabled` <1% samples - [ ] Benchmark: Mixed 10-run (expect +5-8% cumulative) ### 7.4 A/B Test (Final Validation) **Benchmark suite**: ```sh # Run the same cleanenv script on baseline vs optimized commits scripts/run_mixed_10_cleanenv.sh ``` **GO/NO-GO criteria**: - **GO**: Mean throughput +5.0% or higher (within ±20% of +5-8% estimate) - **NEUTRAL**: +2.0% to +5.0% → keep as research box, preset-only promotion - **NO-GO**: <+2.0% or regression → revert, analyze perf data **Perf stat validation**: ```sh perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses \ -- ./bench_random_mixed_hakmem 200000000 400 1 ``` **Expected deltas**: - Instructions/op: -8.0 to -12.0 (target: -10.0) - Branches/op: -3.0 to -5.0 (target: -4.0) - Branch-miss%: unchanged or slightly better (fewer branches) - Throughput: +4.0% to +10.0% (target: +5-8%) --- ## 8. Rollout Plan ### 8.1 Phase 19-3a: malloc Path (Week 1) **Scope**: Add context to malloc hot path - Modify `malloc_tiny_fast()` to accept context - Update C7 ULTRA check (line 236) - Add `fastlane_env_ctx_box.h` - Update wrapper.c `malloc()` **Timeline**: 4-6 hours implementation + 2 hours testing **Risk**: LOW (isolated to alloc path) **Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path) ### 8.2 Phase 19-3b: free Path (Week 1) **Scope**: Add context to free hot path - Modify `free_tiny_fast()` to accept context - Update C7 ULTRA checks (lines 624, 830) - Update front V3 checks (lines 403, 910) - Update wrapper.c `free()` **Timeline**: 4-6 hours implementation + 2 hours testing **Risk**: LOW-MEDIUM (more call sites than malloc) **Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path) ### 8.3 Phase 19-3c: Legacy + Metadata (Week 2) **Scope**: Propagate context to helper boxes - Update `tiny_legacy_fallback_box.h` (line 28) - Update `tiny_metadata_cache_hot_box.h` (line 64) - Add context parameter to helper functions **Timeline**: 3-4 hours implementation + 2 hours testing **Risk**: MEDIUM (touches multiple boxes) **Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path) ### 8.4 Graduate (Week 2-3) **Promotion criteria**: - All phases pass A/B testing (GO verdict) - Cumulative throughput gain ≥+5.0% - No correctness regressions (all tests pass) - Perf validation confirms instruction reduction **Promotion actions**: 1. Ensure `MIXED_TINYV3_C7_SAFE` preset keeps `HAKMEM_ENV_SNAPSHOT=1` (already) 2. Document in optimization roadmap 3. Update Box Theory index 4. Keep ENV default=0 (opt-in) until production validation **Rollback strategy**: - Preset level: Remove from preset, keep code - Code level: revert the Phase 19-3b commit - Emergency: set `HAKMEM_ENV_SNAPSHOT=0` (falls back to per-feature env gates) --- ## 9. Future Optimization Opportunities ### 9.1 Version-Based Invalidation (Option B) If runtime ENV changes become important: - Add global `g_env_snapshot_version` counter - Increment on ENV change (bench_profile, runtime toggle) - Each thread checks version, refreshes context if stale - Overhead: +1 global read per operation (still net win vs 10 TLS reads) ### 9.2 Route Table Consolidation Extend context to include pre-computed routes: ```c typedef struct { bool c7_ultra_enabled; bool front_v3_enabled; bool metadata_cache_eff; SmallRouteKind route_kind[8]; // Pre-computed per class } FastLaneEnvCtx; ``` **Benefit**: Eliminate `tiny_static_route_get_kind_fast()` calls **Impact**: Additional -3-4 instructions/op, -1-2 branches/op ### 9.3 Constructor Init (Option C Hybrid) For production builds (no bench_profile): - Use `__attribute__((constructor))` to init context at startup - Eliminate lazy init check (g_init always 1) - Benefit: -1 branch per operation (init check) - Limitation: No runtime ENV changes (production-only optimization) --- ## 10. Comparison to Phase 4 E1 ### Phase 4 E1 (ENV Snapshot) **What it did**: - Consolidated 3 ENV reads (`tiny_c7_ultra_enabled_env`, `tiny_front_v3_enabled`, `tiny_metadata_cache_enabled`) into 1 snapshot struct - Result: +3.92% throughput (Mixed) - Status: Promoted in presets (global default still OFF) **Limitation**: - Still calls `hakmem_env_snapshot_enabled()` 5 times per operation - Each call: gate loads + branches - ENV check overhead remains: ~7% perf samples ### Phase 19-3 (ENV Snapshot Consolidation) **What it does**: - Eliminates repeated `hakmem_env_snapshot_enabled()` calls inside deep helpers: - wrapper entry does the gate check once and passes `const HakmemEnvSnapshot* env` down - deep helpers use `if (env)` + direct field reads **Benefit over Phase 4 E1**: - Phase 4 E1: Consolidated ENV **values** (3 gates → 1 snapshot) - Phase 19-3: Consolidates ENV **checks** (5 snapshot calls → 1 context call) - Complementary: Phase 19-3 builds on Phase 4 E1 infrastructure **Combined impact**: - Phase 4 E1: +3.92% (ENV value consolidation) - Phase 19-3: +5-8% (ENV check consolidation) - Not additive (overlap), but Phase 19-3 should subsume Phase 4 E1 gains --- ## 11. Conclusion Phase 19-3 (ENV Snapshot Consolidation) targets a clear, measurable overhead: - **Current**: repeated `hakmem_env_snapshot_enabled()` gate checks scattered across hot helpers - **After**: wrapper entry gate check once + `env` pass-down - **Reduction**: fewer gate branches + fewer loads + less code/layout churn **Expected outcome**: +5-8% throughput (aligned with Phase 19-1 Design Candidate B estimate) **Recommended approach**: **Option A (Entry-Point Snapshot)** - Clear API, type-safe context passing - Preserves Box Theory (NULL context → fallback) - Gradual migration (3 sub-phases) - Benchmark-compatible (bench_profile refresh works) **Risk**: MEDIUM (API changes, ENV invalidation handling) **Effort**: 8-12 hours (implementation) + 6-8 hours (testing) **Timeline**: 2 weeks (3 sub-phases + A/B validation) **Next steps**: 1. Phase 19-3a done (UNLIKELY hint removal, GO) 2. Implement Phase 19-3b (wrapper env pass-down to hot helpers) 3. A/B test (expect +1-3% incremental on top of 19-3a) 4. Implement Phase 19-3c (legacy + metadata pass-down) 5. Final A/B test 7. Graduate if GO (add to MIXED_TINYV3_C7_SAFE preset) This positions Phase 19-3 as a **high-ROI, medium-risk** optimization with clear measurement criteria and rollback strategy.