Files
hakmem/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_3_DESIGN.md

701 lines
26 KiB
Markdown
Raw Normal View History

# Phase 19-3: ENV Snapshot Consolidation — Design
## 0. Goal
**Objective**: Reduce ENV check overhead from per-operation 3+ TLS reads to 1 TLS read
**Expected Impact (target)**: -10.0 instructions/op, -4.0 branches/op, +3-8% throughput
**Risk Level**: MEDIUM (ENV invalidation handling required)
**Box Name**: EnvSnapshotConsolidationBox (Phase 19-3)
### Context
Phase 19 perf analysis revealed that ENV checks are executed **3+ times per operation**:
- `hakmem_env_snapshot_enabled()`: called 5 times in malloc/free hot paths (lines 236, 403, 624, 830, 910 in malloc_tiny_fast.h)
- Each call triggers:
- TLS read of `g_hakmem_env_snapshot_ctor_mode`
- TLS read of `g_hakmem_env_snapshot_gate`
- Branch prediction overhead
- Potential lazy initialization check
**Current overhead**: ~7% perf samples on `hakmem_env_snapshot_enabled()` and related ENV checks.
**Phase 4 E1 Status**: ENV snapshot infrastructure exists (global default OFF, but promoted ON in presets like `MIXED_TINYV3_C7_SAFE`). Phase 19-3 aims to:
1. Eliminate redundant `hakmem_env_snapshot_enabled()` checks (5 calls → 1 call)
2. Make ENV snapshot the **default path** (not research box)
3. Further consolidate ENV reads into entry-point snapshot
### Phase 19-3a Result (validated)
Phase 19-3a removed the call-site UNLIKELY hint:
`__builtin_expect(hakmem_env_snapshot_enabled(), 0)``hakmem_env_snapshot_enabled()`
Observed impact: **GO (+4.42% throughput)** on Mixed.
This validates that the remaining ENV work is dominated by branch/layout effects, not just raw "read cost".
### Phase 19-3b Result (validated)
Phase 19-3b consolidated snapshot reads by capturing `env` once per hot call and passing it down into nested helpers.
Observed impact: **GO (+2.76% mean / +2.57% median)** on Mixed 10-run (`scripts/run_mixed_10_cleanenv.sh`).
---
## 1. Current State Analysis
### 1.1 ENV Check Locations (Per-Operation)
Based on code analysis, ENV checks occur in these hot path locations:
**malloc_tiny_fast() path**:
1. Line 236: C7 ULTRA check (`hakmem_env_snapshot_enabled()``hakmem_env_snapshot()`)
2. Line 403: Front V3 snapshot check for `free()` (in `free_tiny_fast_v4_hotcold`)
3. Line 910: Front V3 snapshot check for `free()` (in `free_tiny_fast_v4_larson`)
**free_tiny_fast() paths**:
1. Line 624: C7 ULTRA check (`hakmem_env_snapshot_enabled()``hakmem_env_snapshot()`)
2. Line 830: C7 ULTRA check (duplicate in `free_tiny_fast_v4_larson`)
**tiny_legacy_fallback_box.h**:
- Line 28: `hakmem_env_snapshot_enabled()` for front_snap + metadata_cache_on
**tiny_metadata_cache_hot_box.h**:
- Line 64: `hakmem_env_snapshot_enabled()` for metadata cache effective check
### 1.2 TLS Read Overhead Analysis
Each `hakmem_env_snapshot_enabled()` call performs:
```c
int ctor_mode = g_hakmem_env_snapshot_ctor_mode; // TLS read #1
if (ctor_mode == 1) {
return g_hakmem_env_snapshot_gate != 0; // TLS read #2 (ctor path)
}
// Legacy path
if (g_hakmem_env_snapshot_gate == -1) { // TLS read #2 (legacy path)
// Lazy init with getenv()
}
```
**Per-operation cost** (when snapshot enabled):
- **5 calls** × **2 TLS reads** = **10 TLS reads/op**
- Plus: 5× branch on `ctor_mode`, 5× branch on snapshot enabled
- Actual measurement: ~7% perf samples
**Per-operation cost** (when snapshot disabled - current default):
- **5 calls** × **2-3 TLS reads** = **10-15 TLS reads/op**
- Plus: lazy init checks, getenv() overhead on first call per thread
### 1.3 Redundancy Analysis
**Problem**: Each hot path independently checks `hakmem_env_snapshot_enabled()`:
- malloc C7 ULTRA: check at line 236
- free C7 ULTRA: check at line 624 (same operation, different code path)
- free front V3: check at line 403 and 910 (same snapshot needed)
- Legacy fallback: check at line 28 (called from above paths)
- Metadata cache: check at line 64 (called from above paths)
**Redundancy**: For a typical malloc+free pair:
- Current: 5+ `hakmem_env_snapshot_enabled()` calls = 10-15 TLS reads
- Optimal: 1 entry-point snapshot = 1-2 TLS reads
**Gap**: 8-13 redundant TLS reads per operation
---
## 2. Design Options
### Option A: Entry-Point Snapshot Pass-Down (Recommended)
**Concept**: Capture the existing `HakmemEnvSnapshot` pointer once at malloc/free entry, and pass it down.
This avoids creating a new TLS context and automatically stays compatible with `hakmem_env_snapshot_refresh_from_env()` (refresh updates the snapshot in-place).
**Architecture**:
```c
// At wrapper entry (malloc/free):
const HakmemEnvSnapshot* env =
hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
// In malloc_tiny_fast():
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) {
// Use ctx->c7_ultra_enabled instead of calling hakmem_env_snapshot_enabled()
if (env && class_idx == 7 && env->tiny_c7_ultra_enabled) {
// Direct check, no TLS read
}
}
```
**Pros**:
- **Minimal refactoring**: Add context parameter to existing functions
- **Type safety**: Compiler enforces context passing
- **Clear boundary**: ENV decisions made at entry, logic below is pure
- **Easy rollback**: Context parameter can be NULL (fallback to old path)
**Cons**:
- **API threading**: Some hot helpers need an extra pointer parameter (`env`) or `_with_env` variants.
- **Register pressure**: Extra parameter may affect register allocation (verify via perf stat).
**Risk**: LOW-MEDIUM (mechanical threading, rollback is simple)
---
### Option B: TLS Cached Context (Alternative)
**Concept**: Maintain thread-local ENV context, refresh on invalidation events.
**Architecture**:
```c
// Global TLS context (replaces per-call ENV checks)
static __thread FastLaneEnvCtx g_fastlane_ctx;
static __thread int g_fastlane_ctx_version = 0;
extern int g_env_snapshot_version; // Incremented on ENV change
static inline const FastLaneEnvCtx* fastlane_ctx_get(void) {
if (__builtin_expect(g_fastlane_ctx_version != g_env_snapshot_version, 0)) {
// Refresh from snapshot (rare)
const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
g_fastlane_ctx.c7_ultra_enabled = snap->tiny_c7_ultra_enabled;
// ... copy fields
g_fastlane_ctx_version = g_env_snapshot_version;
}
return &g_fastlane_ctx;
}
// In hot path:
const FastLaneEnvCtx* ctx = fastlane_ctx_get(); // 1 TLS read + 1 branch
if (class_idx == 7 && ctx->c7_ultra_enabled) { // Direct struct access
```
**Pros**:
- **No API changes**: Existing functions unchanged
- **Single TLS read**: Version check is fast (1 global read + 1 TLS read)
- **Automatic invalidation**: Version bump triggers refresh
- **Easy integration**: Drop-in replacement for `hakmem_env_snapshot_enabled()`
**Cons**:
- **Version management**: Need global version counter + invalidation hooks
- **Stale data risk**: If version check is missed, stale context used
- **Init complexity**: Each thread needs lazy init + version tracking
- **Debugging**: Harder to trace when context was last refreshed
**Risk**: MEDIUM (version invalidation must be bulletproof)
---
### Option C: Init-Time Fixed (High Risk)
**Concept**: Read ENV once at process init, freeze configuration for lifetime.
**Architecture**:
```c
// Global constants (set in constructor)
static bool g_c7_ultra_enabled_fixed;
static bool g_front_v3_enabled_fixed;
__attribute__((constructor))
static void fastlane_env_init(void) {
const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
g_c7_ultra_enabled_fixed = snap->tiny_c7_ultra_enabled;
g_front_v3_enabled_fixed = snap->tiny_front_v3_enabled;
}
// Hot path: direct global read (no TLS)
if (class_idx == 7 && g_c7_ultra_enabled_fixed) {
```
**Pros**:
- **Zero TLS reads**: Direct global variable access
- **Maximum performance**: Compiler can constant-fold if known at link time
- **Simple implementation**: No lazy init, no version tracking
**Cons**:
- **No runtime ENV changes**: ENV toggles require process restart
- **Breaks bench_profile**: `putenv()` in benchmarks will not work
- **No A/B testing**: Cannot toggle ENV for same-binary comparison
- **Box Theory violation**: No rollback/toggle capability
**Risk**: HIGH (breaks existing workflow, violates Box Theory)
---
### Recommended: **Option A (Entry-Point Snapshot Pass-Down)**
**Reasoning**:
1. **Preserves Box Theory**: `env==NULL` → fallback to old path
2. **Clear separation**: ENV decisions at entry, pure logic below
3. **Benchmark compatible**: Works with `bench_profile` putenv + `hakmem_env_snapshot_refresh_from_env()` (snapshot updates in-place)
4. **Performance**: Removes repeated `hakmem_env_snapshot_enabled()` checks inside deep helpers
**Trade-off acceptance**:
- Accept API changes (mechanical, low risk)
- Accept extra parameter (register pressure acceptable for hot path)
- Reject Option B's version management complexity
- Reject Option C's inflexibility
---
## 3. Implementation Plan (Option A)
### 3.1 Box Design
**Box Name**: `EnvSnapshotConsolidationBox` (Phase 19-3)
**Files**:
- Modified: `core/front/malloc_tiny_fast.h`
- Phase 19-3a: remove backwards `__builtin_expect(..., 0)` hints (DONE, +4.42% GO).
- Phase 19-3b: thread `const HakmemEnvSnapshot* env` down to eliminate repeated `hakmem_env_snapshot_enabled()` checks (DONE, +2.76% GO).
- Modified: `core/box/tiny_legacy_fallback_box.h`
- Add `_with_env` helper (Phase 19-3b).
- Modified: `core/box/tiny_metadata_cache_hot_box.h`
- Add `_with_env` helper (Phase 19-3b).
- Optional: `core/box/hak_wrappers.inc.h`
- If needed, compute `env` once per wrapper entry and pass it down (removes the remaining alloc-side gate in `malloc_tiny_fast_for_class()`).
**ENV Gate**:
- Base: `HAKMEM_ENV_SNAPSHOT=0/1` (Phase 4 E1 gate; promoted ON in presets)
**Rollback**:
- Snapshot behavior: set `HAKMEM_ENV_SNAPSHOT=0` to fall back to per-feature env gates.
- Pass-down refactor: revert the Phase 19-3b commit (or add a dedicated pass-down gate if future A/B is needed).
### 3.2 API Design
**Pass-down API (recommended)**:
```c
// Wrapper entry (malloc/free): read snapshot ONCE, pass down.
const HakmemEnvSnapshot* env =
hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
// Hot helpers accept optional env pointer (NULL-safe).
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env);
int free_tiny_fast_with_env(void* ptr, const HakmemEnvSnapshot* env);
```
### 3.3 Migration Plan (Incremental)
**Phase 19-3a (DONE)**: remove backwards UNLIKELY hints at the 5 hottest call sites in `core/front/malloc_tiny_fast.h`.
- `__builtin_expect(hakmem_env_snapshot_enabled(), 0)``hakmem_env_snapshot_enabled()`
- Measured: **GO (+4.42%)**
**Phase 19-3b (DONE)**: capture `env` once per hot call and pass it down into nested helpers.
- In `core/front/malloc_tiny_fast.h`:
- `free_tiny_fast()` / `free_tiny_fast_hot()` capture `env` once and pass it to cold + legacy helpers.
- `malloc_tiny_fast_for_class()` reuses the same snapshot for `tiny_policy_hot_get_route_with_env(...)`.
- In `core/box/tiny_legacy_fallback_box.h` and `core/box/tiny_metadata_cache_hot_box.h`:
- add `_with_env` helpers to consume the pass-down pointer.
- Measured: **GO (+2.76% mean / +2.57% median)** on Mixed 10-run.
**Phase 19-3c (OPTIONAL)**: propagate `env` into legacy fallback + metadata cache helpers to eliminate the remaining call sites:
- (Already done in Phase 19-3b.) Optional next: pass `env` down from wrapper entry to remove the remaining alloc-side gate.
### 3.4 Files to Modify
1. `core/front/malloc_tiny_fast.h`
- Phase 19-3a: UNLIKELY hint removal.
- Phase 19-3b: pass-down `env` to cold + legacy helpers.
2. `core/box/tiny_legacy_fallback_box.h`
- Phase 19-3b: add `_with_env` helper + keep wrapper.
3. `core/box/tiny_metadata_cache_hot_box.h`
- Phase 19-3b: add `_with_env` helper + keep wrapper.
4. (Optional) `core/box/hak_wrappers.inc.h`
- Pass `env` down from wrapper entry (alloc-side; removes one remaining gate).
---
## 4. Safety / Box Theory
### 4.1 Boundary Preservation
**L0 (ENV gate)**:
- `HAKMEM_ENV_SNAPSHOT=0``env==NULL` → fallback to per-feature env gates
- `HAKMEM_ENV_SNAPSHOT=1``env!=NULL` → snapshot-based checks
- (Optional) A dedicated “pass-down gate” can be introduced for A/B safety, but avoid adding a new hot-branch unless needed.
**L1 (Hot inline)**:
- No algorithmic changes, only ENV check consolidation
- Existing `malloc_tiny_fast()` / `free_tiny_fast()` logic unchanged
- `env` is read-only (const pointer)
**L2 (Cold fallback)**:
- Cold paths unchanged (no context propagation needed)
- Legacy fallback accepts optional `env`
**L3 (Stats/Observability)**:
- Add counter: `ENV_CONSOLIDATION_STAT_INC(enabled_calls)`
- Track: pass-down hits, fallback path usage
- Perf verification: reduced `hakmem_env_snapshot_enabled()` hot samples
### 4.2 Fail-Fast
**NULL env handling**:
- All functions accept `env==NULL` → fallback to existing path
- No crashes, no undefined behavior
- Debug builds: assert(`env!=NULL`) only if the pass-down gate is enabled (optional)
**ENV invalidation**:
- Snapshot refresh is handled by the existing Phase 4 E1 mechanism:
- `bench_profile` uses `hakmem_env_snapshot_refresh_from_env()` after `putenv()`
- Snapshot updates in-place, so the `env` pointer remains valid
### 4.3 Rollback
**Runtime rollback**:
```sh
HAKMEM_ENV_SNAPSHOT=0 # Disable snapshot path (falls back to per-feature env gates)
```
**Gradual rollout**:
1. Phase 19-3a: UNLIKELY hint removal (DONE, GO)
2. Phase 19-3b: hot helper pass-down (DONE, GO)
3. Phase 19-3c: optional wrapper-entry pass-down (alloc-side; measure)
### 4.4 Observability
**Stats counters** (debug builds):
```c
typedef struct {
uint64_t env_passdown_hits; // wrapper passed non-NULL env
uint64_t env_null_fallback; // env==NULL, used old path
uint64_t malloc_env_path; // malloc used env pass-down
uint64_t free_env_path; // free used env pass-down
} EnvConsolidationStats;
```
**Perf validation**:
- Before: `perf record` shows `hakmem_env_snapshot_enabled` at ~7%
- After: `hakmem_env_snapshot_enabled` should drop to <1%
- Expected: deep helpers stop calling `hakmem_env_snapshot_enabled()` repeatedly (single capture per hot call)
**A/B testing**:
```sh
# Recommended: compare baseline vs optimized commits with the same bench script
scripts/run_mixed_10_cleanenv.sh
```
---
## 5. Expected Performance
### 5.1 Instruction Reduction Estimate
**Current overhead** (per malloc+free operation):
- 5 calls to `hakmem_env_snapshot_enabled()`:
- Each: gate loads + branches (and legacy lazy-init path on first call)
- Total: **~5 gate checks** per operation across hot helpers
**After Phase 19-3**:
- 1 call at wrapper entry:
- `hakmem_env_snapshot_enabled()` once
- `hakmem_env_snapshot()` once (when enabled)
- Deep helpers use `if (env)` + direct field reads (no further gate checks)
**Reduction**:
- **Gate checks**: ~5 → 1 (wrapper entry only)
- **Branches**: reduce repeated gate branches inside hot helpers
- **Instructions**: target ~-10 instructions/op (order-of-magnitude)
### 5.2 Branch Reduction Estimate
**Current branching**:
- `hakmem_env_snapshot_enabled()`: 2 branches (ctor_mode check + gate check)
- Called 5 times = **10 branches/op**
**After Phase 19-3**:
- Gate check is done once at wrapper entry; deep helpers reuse `env` pointer.
**Reduction**: 10 → 4 = **-6 branches/op** (conservative estimate: -4 branches/op accounting for overlap)
### 5.3 Throughput Estimate
**Phase 19-1 Design Doc** (Candidate B) estimates:
- Instructions: -10.0/op
- Branches: -4.0/op
- Throughput: **+5-8%**
**Phase 19-3 targets** (aligned with Candidate B):
- Instructions: **-10.0/op** ✓
- Branches: **-4.0/op** ✓
- Throughput: **+5-8%** (expected on top of Phase 19-2 baseline)
**Validation criteria**:
- Perf stat shows instruction count reduction: ≥8.0/op (80% of estimate)
- Perf stat shows branch count reduction: ≥3.0/op (75% of estimate)
- Throughput improvement: ≥4.0% (50% of lower bound estimate)
---
## 6. Risk Assessment
### 6.1 Technical Risks
**MEDIUM: API Signature Changes**
- Risk: Adding context parameter changes function signatures
- Mitigation: Keep old signatures, add `_ctx` variants
- Rollback: NULL context → fallback to old implementation
- Timeline: 1 phase at a time (19-3a → 19-3b → 19-3c)
**MEDIUM: ENV Invalidation**
- Risk: Runtime ENV changes (bench_profile putenv) may not refresh context
- Mitigation: Phase 19-3 inherits Phase 4 E1 refresh mechanism
- Limitation: Same as current ENV snapshot (requires explicit refresh)
- Future: Add version tracking (Option B) if runtime toggle needed
**LOW: Register Pressure**
- Risk: Extra context parameter may increase register spills
- Mitigation: Context is const pointer (register-friendly)
- Validation: Check perf stat for stall increases
- Rollback: Disable via ENV if regression detected
**LOW: Lazy Init Overhead**
- Risk: First call to `fastlane_env_ctx()` adds init cost
- Mitigation: One-time per thread (amortized over millions of ops)
- Measurement: Should be <0.1% overhead (verified via perf)
### 6.2 Performance Risks
**Risk: Overhead greater than savings**
- Scenario: Context struct access slower than optimized TLS reads
- Likelihood: LOW (struct access is 1-2 instructions, TLS read is 5-10)
- Detection: Perf stat will show instruction count increase
- Rollback: ENV=0 immediately reverts
**Risk: Branch predictor thrashing**
- Scenario: New branch patterns confuse CPU predictor
- Likelihood: LOW (reducing branches helps predictor)
- Detection: Branch miss rate increases in perf stat
- Rollback: ENV=0 immediately reverts
### 6.3 Integration Risks
**Risk: Breaks bench_profile ENV refresh**
- Scenario: Context cached before putenv(), stale values used
- Likelihood: MEDIUM (same issue as Phase 4 E1)
- Mitigation: Follow Phase 4 E1 pattern (explicit refresh hook)
- Validation: Run bench suite with ENV toggles
**Risk: Conflicts with FastLane Direct (Phase 19-2)**
- Scenario: Phase 19-2 removed wrapper, context injection point unclear
- Likelihood: LOW (context added at new entry point)
- Mitigation: Phase 19-3 builds on Phase 19-2 baseline
- Validation: A/B test with FASTLANE_DIRECT=1 + ENV_CONSOLIDATION=1
---
## 7. Validation Checklist
### 7.1 Pre-Implementation
- [ ] Verify Phase 4 E1 (ENV snapshot) is stable and working
- [ ] Verify Phase 19-2 (FASTLANE_DIRECT) is stable baseline
- [ ] Document current `hakmem_env_snapshot_enabled()` call sites (5 locations)
- [ ] Create test plan for ENV refresh (bench_profile compatibility)
### 7.2 Implementation
- [ ] Implement `fastlane_env_ctx_box.h` (context struct + getter)
- [ ] Add `malloc_tiny_fast_ctx()` variant (Phase 19-3a)
- [ ] Add `free_tiny_fast_ctx()` variant (Phase 19-3b)
- [ ] Propagate context to `tiny_legacy_fallback_box.h` (Phase 19-3c)
- [ ] (Optional) Add a dedicated pass-down gate if A/B within a single binary is needed
- [ ] Add stats counters (debug builds)
### 7.3 Testing (Per Phase)
**Phase 19-3a (malloc path)**:
- [ ] Correctness: Run `make test` suite (all tests pass)
- [ ] Perf stat: Measure instruction/branch reduction (ENV=0 vs ENV=1)
- [ ] Perf record: Verify `hakmem_env_snapshot_enabled` samples drop
- [ ] Benchmark: Mixed 10-run (expect +2-3% from malloc path alone)
**Phase 19-3b (free path)**:
- [ ] Correctness: Run `make test` + Larson (all tests pass)
- [ ] Perf stat: Measure cumulative reduction (vs baseline)
- [ ] Perf record: Verify further reduction in ENV check samples
- [ ] Benchmark: Mixed 10-run (expect +3-5% cumulative)
**Phase 19-3c (legacy + metadata)**:
- [ ] Correctness: Full test suite including multithreaded
- [ ] Perf stat: Verify -10.0 instr/op, -4.0 branches/op (goal)
- [ ] Perf record: `hakmem_env_snapshot_enabled` <1% samples
- [ ] Benchmark: Mixed 10-run (expect +5-8% cumulative)
### 7.4 A/B Test (Final Validation)
**Benchmark suite**:
```sh
# Run the same cleanenv script on baseline vs optimized commits
scripts/run_mixed_10_cleanenv.sh
```
**GO/NO-GO criteria**:
- **GO**: Mean throughput +5.0% or higher (within ±20% of +5-8% estimate)
- **NEUTRAL**: +2.0% to +5.0% → keep as research box, preset-only promotion
- **NO-GO**: <+2.0% or regression → revert, analyze perf data
**Perf stat validation**:
```sh
perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses \
-- ./bench_random_mixed_hakmem 200000000 400 1
```
**Expected deltas**:
- Instructions/op: -8.0 to -12.0 (target: -10.0)
- Branches/op: -3.0 to -5.0 (target: -4.0)
- Branch-miss%: unchanged or slightly better (fewer branches)
- Throughput: +4.0% to +10.0% (target: +5-8%)
---
## 8. Rollout Plan
### 8.1 Phase 19-3a: malloc Path (Week 1)
**Scope**: Add context to malloc hot path
- Modify `malloc_tiny_fast()` to accept context
- Update C7 ULTRA check (line 236)
- Add `fastlane_env_ctx_box.h`
- Update wrapper.c `malloc()`
**Timeline**: 4-6 hours implementation + 2 hours testing
**Risk**: LOW (isolated to alloc path)
**Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path)
### 8.2 Phase 19-3b: free Path (Week 1)
**Scope**: Add context to free hot path
- Modify `free_tiny_fast()` to accept context
- Update C7 ULTRA checks (lines 624, 830)
- Update front V3 checks (lines 403, 910)
- Update wrapper.c `free()`
**Timeline**: 4-6 hours implementation + 2 hours testing
**Risk**: LOW-MEDIUM (more call sites than malloc)
**Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path)
### 8.3 Phase 19-3c: Legacy + Metadata (Week 2)
**Scope**: Propagate context to helper boxes
- Update `tiny_legacy_fallback_box.h` (line 28)
- Update `tiny_metadata_cache_hot_box.h` (line 64)
- Add context parameter to helper functions
**Timeline**: 3-4 hours implementation + 2 hours testing
**Risk**: MEDIUM (touches multiple boxes)
**Rollback**: revert Phase 19-3b commit (or set `HAKMEM_ENV_SNAPSHOT=0` to disable snapshot path)
### 8.4 Graduate (Week 2-3)
**Promotion criteria**:
- All phases pass A/B testing (GO verdict)
- Cumulative throughput gain ≥+5.0%
- No correctness regressions (all tests pass)
- Perf validation confirms instruction reduction
**Promotion actions**:
1. Ensure `MIXED_TINYV3_C7_SAFE` preset keeps `HAKMEM_ENV_SNAPSHOT=1` (already)
2. Document in optimization roadmap
3. Update Box Theory index
4. Keep ENV default=0 (opt-in) until production validation
**Rollback strategy**:
- Preset level: Remove from preset, keep code
- Code level: revert the Phase 19-3b commit
- Emergency: set `HAKMEM_ENV_SNAPSHOT=0` (falls back to per-feature env gates)
---
## 9. Future Optimization Opportunities
### 9.1 Version-Based Invalidation (Option B)
If runtime ENV changes become important:
- Add global `g_env_snapshot_version` counter
- Increment on ENV change (bench_profile, runtime toggle)
- Each thread checks version, refreshes context if stale
- Overhead: +1 global read per operation (still net win vs 10 TLS reads)
### 9.2 Route Table Consolidation
Extend context to include pre-computed routes:
```c
typedef struct {
bool c7_ultra_enabled;
bool front_v3_enabled;
bool metadata_cache_eff;
SmallRouteKind route_kind[8]; // Pre-computed per class
} FastLaneEnvCtx;
```
**Benefit**: Eliminate `tiny_static_route_get_kind_fast()` calls
**Impact**: Additional -3-4 instructions/op, -1-2 branches/op
### 9.3 Constructor Init (Option C Hybrid)
For production builds (no bench_profile):
- Use `__attribute__((constructor))` to init context at startup
- Eliminate lazy init check (g_init always 1)
- Benefit: -1 branch per operation (init check)
- Limitation: No runtime ENV changes (production-only optimization)
---
## 10. Comparison to Phase 4 E1
### Phase 4 E1 (ENV Snapshot)
**What it did**:
- Consolidated 3 ENV reads (`tiny_c7_ultra_enabled_env`, `tiny_front_v3_enabled`, `tiny_metadata_cache_enabled`) into 1 snapshot struct
- Result: +3.92% throughput (Mixed)
- Status: Promoted in presets (global default still OFF)
**Limitation**:
- Still calls `hakmem_env_snapshot_enabled()` 5 times per operation
- Each call: gate loads + branches
- ENV check overhead remains: ~7% perf samples
### Phase 19-3 (ENV Snapshot Consolidation)
**What it does**:
- Eliminates repeated `hakmem_env_snapshot_enabled()` calls inside deep helpers:
- wrapper entry does the gate check once and passes `const HakmemEnvSnapshot* env` down
- deep helpers use `if (env)` + direct field reads
**Benefit over Phase 4 E1**:
- Phase 4 E1: Consolidated ENV **values** (3 gates → 1 snapshot)
- Phase 19-3: Consolidates ENV **checks** (5 snapshot calls → 1 context call)
- Complementary: Phase 19-3 builds on Phase 4 E1 infrastructure
**Combined impact**:
- Phase 4 E1: +3.92% (ENV value consolidation)
- Phase 19-3: +5-8% (ENV check consolidation)
- Not additive (overlap), but Phase 19-3 should subsume Phase 4 E1 gains
---
## 11. Conclusion
Phase 19-3 (ENV Snapshot Consolidation) targets a clear, measurable overhead:
- **Current**: repeated `hakmem_env_snapshot_enabled()` gate checks scattered across hot helpers
- **After**: wrapper entry gate check once + `env` pass-down
- **Reduction**: fewer gate branches + fewer loads + less code/layout churn
**Expected outcome**: +5-8% throughput (aligned with Phase 19-1 Design Candidate B estimate)
**Recommended approach**: **Option A (Entry-Point Snapshot)**
- Clear API, type-safe context passing
- Preserves Box Theory (NULL context → fallback)
- Gradual migration (3 sub-phases)
- Benchmark-compatible (bench_profile refresh works)
**Risk**: MEDIUM (API changes, ENV invalidation handling)
**Effort**: 8-12 hours (implementation) + 6-8 hours (testing)
**Timeline**: 2 weeks (3 sub-phases + A/B validation)
**Next steps**:
1. Phase 19-3a done (UNLIKELY hint removal, GO)
2. Implement Phase 19-3b (wrapper env pass-down to hot helpers)
3. A/B test (expect +1-3% incremental on top of 19-3a)
4. Implement Phase 19-3c (legacy + metadata pass-down)
5. Final A/B test
7. Graduate if GO (add to MIXED_TINYV3_C7_SAFE preset)
This positions Phase 19-3 as a **high-ROI, medium-risk** optimization with clear measurement criteria and rollback strategy.