26 KiB
Phase 19-3: ENV Snapshot Consolidation — Design
0. Goal
Objective: Reduce ENV check overhead from per-operation 3+ TLS reads to 1 TLS read Expected Impact (target): -10.0 instructions/op, -4.0 branches/op, +3-8% throughput Risk Level: MEDIUM (ENV invalidation handling required) Box Name: EnvSnapshotConsolidationBox (Phase 19-3)
Context
Phase 19 perf analysis revealed that ENV checks are executed 3+ times per operation:
hakmem_env_snapshot_enabled(): called 5 times in malloc/free hot paths (lines 236, 403, 624, 830, 910 in malloc_tiny_fast.h)- Each call triggers:
- TLS read of
g_hakmem_env_snapshot_ctor_mode - TLS read of
g_hakmem_env_snapshot_gate - Branch prediction overhead
- Potential lazy initialization check
- TLS read of
Current overhead: ~7% perf samples on hakmem_env_snapshot_enabled() and related ENV checks.
Phase 4 E1 Status: ENV snapshot infrastructure exists (global default OFF, but promoted ON in presets like MIXED_TINYV3_C7_SAFE). Phase 19-3 aims to:
- Eliminate redundant
hakmem_env_snapshot_enabled()checks (5 calls → 1 call) - Make ENV snapshot the default path (not research box)
- Further consolidate ENV reads into entry-point snapshot
Phase 19-3a Result (validated)
Phase 19-3a removed the call-site UNLIKELY hint:
__builtin_expect(hakmem_env_snapshot_enabled(), 0) → hakmem_env_snapshot_enabled()
Observed impact: GO (+4.42% throughput) on Mixed. This validates that the remaining ENV work is dominated by branch/layout effects, not just raw "read cost".
Phase 19-3b Result (validated)
Phase 19-3b consolidated snapshot reads by capturing env once per hot call and passing it down into nested helpers.
Observed impact: GO (+2.76% mean / +2.57% median) on Mixed 10-run (scripts/run_mixed_10_cleanenv.sh).
1. Current State Analysis
1.1 ENV Check Locations (Per-Operation)
Based on code analysis, ENV checks occur in these hot path locations:
malloc_tiny_fast() path:
- Line 236: C7 ULTRA check (
hakmem_env_snapshot_enabled()→hakmem_env_snapshot()) - Line 403: Front V3 snapshot check for
free()(infree_tiny_fast_v4_hotcold) - Line 910: Front V3 snapshot check for
free()(infree_tiny_fast_v4_larson)
free_tiny_fast() paths:
- Line 624: C7 ULTRA check (
hakmem_env_snapshot_enabled()→hakmem_env_snapshot()) - Line 830: C7 ULTRA check (duplicate in
free_tiny_fast_v4_larson)
tiny_legacy_fallback_box.h:
- Line 28:
hakmem_env_snapshot_enabled()for front_snap + metadata_cache_on
tiny_metadata_cache_hot_box.h:
- Line 64:
hakmem_env_snapshot_enabled()for metadata cache effective check
1.2 TLS Read Overhead Analysis
Each hakmem_env_snapshot_enabled() call performs:
int ctor_mode = g_hakmem_env_snapshot_ctor_mode; // TLS read #1
if (ctor_mode == 1) {
return g_hakmem_env_snapshot_gate != 0; // TLS read #2 (ctor path)
}
// Legacy path
if (g_hakmem_env_snapshot_gate == -1) { // TLS read #2 (legacy path)
// Lazy init with getenv()
}
Per-operation cost (when snapshot enabled):
- 5 calls × 2 TLS reads = 10 TLS reads/op
- Plus: 5× branch on
ctor_mode, 5× branch on snapshot enabled - Actual measurement: ~7% perf samples
Per-operation cost (when snapshot disabled - current default):
- 5 calls × 2-3 TLS reads = 10-15 TLS reads/op
- Plus: lazy init checks, getenv() overhead on first call per thread
1.3 Redundancy Analysis
Problem: Each hot path independently checks hakmem_env_snapshot_enabled():
- malloc C7 ULTRA: check at line 236
- free C7 ULTRA: check at line 624 (same operation, different code path)
- free front V3: check at line 403 and 910 (same snapshot needed)
- Legacy fallback: check at line 28 (called from above paths)
- Metadata cache: check at line 64 (called from above paths)
Redundancy: For a typical malloc+free pair:
- Current: 5+
hakmem_env_snapshot_enabled()calls = 10-15 TLS reads - Optimal: 1 entry-point snapshot = 1-2 TLS reads
Gap: 8-13 redundant TLS reads per operation
2. Design Options
Option A: Entry-Point Snapshot Pass-Down (Recommended)
Concept: Capture the existing HakmemEnvSnapshot pointer once at malloc/free entry, and pass it down.
This avoids creating a new TLS context and automatically stays compatible with hakmem_env_snapshot_refresh_from_env() (refresh updates the snapshot in-place).
Architecture:
// At wrapper entry (malloc/free):
const HakmemEnvSnapshot* env =
hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
// In malloc_tiny_fast():
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) {
// Use ctx->c7_ultra_enabled instead of calling hakmem_env_snapshot_enabled()
if (env && class_idx == 7 && env->tiny_c7_ultra_enabled) {
// Direct check, no TLS read
}
}
Pros:
- Minimal refactoring: Add context parameter to existing functions
- Type safety: Compiler enforces context passing
- Clear boundary: ENV decisions made at entry, logic below is pure
- Easy rollback: Context parameter can be NULL (fallback to old path)
Cons:
- API threading: Some hot helpers need an extra pointer parameter (
env) or_with_envvariants. - Register pressure: Extra parameter may affect register allocation (verify via perf stat).
Risk: LOW-MEDIUM (mechanical threading, rollback is simple)
Option B: TLS Cached Context (Alternative)
Concept: Maintain thread-local ENV context, refresh on invalidation events.
Architecture:
// Global TLS context (replaces per-call ENV checks)
static __thread FastLaneEnvCtx g_fastlane_ctx;
static __thread int g_fastlane_ctx_version = 0;
extern int g_env_snapshot_version; // Incremented on ENV change
static inline const FastLaneEnvCtx* fastlane_ctx_get(void) {
if (__builtin_expect(g_fastlane_ctx_version != g_env_snapshot_version, 0)) {
// Refresh from snapshot (rare)
const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
g_fastlane_ctx.c7_ultra_enabled = snap->tiny_c7_ultra_enabled;
// ... copy fields
g_fastlane_ctx_version = g_env_snapshot_version;
}
return &g_fastlane_ctx;
}
// In hot path:
const FastLaneEnvCtx* ctx = fastlane_ctx_get(); // 1 TLS read + 1 branch
if (class_idx == 7 && ctx->c7_ultra_enabled) { // Direct struct access
Pros:
- No API changes: Existing functions unchanged
- Single TLS read: Version check is fast (1 global read + 1 TLS read)
- Automatic invalidation: Version bump triggers refresh
- Easy integration: Drop-in replacement for
hakmem_env_snapshot_enabled()
Cons:
- Version management: Need global version counter + invalidation hooks
- Stale data risk: If version check is missed, stale context used
- Init complexity: Each thread needs lazy init + version tracking
- Debugging: Harder to trace when context was last refreshed
Risk: MEDIUM (version invalidation must be bulletproof)
Option C: Init-Time Fixed (High Risk)
Concept: Read ENV once at process init, freeze configuration for lifetime.
Architecture:
// Global constants (set in constructor)
static bool g_c7_ultra_enabled_fixed;
static bool g_front_v3_enabled_fixed;
__attribute__((constructor))
static void fastlane_env_init(void) {
const HakmemEnvSnapshot* snap = hakmem_env_snapshot();
g_c7_ultra_enabled_fixed = snap->tiny_c7_ultra_enabled;
g_front_v3_enabled_fixed = snap->tiny_front_v3_enabled;
}
// Hot path: direct global read (no TLS)
if (class_idx == 7 && g_c7_ultra_enabled_fixed) {
Pros:
- Zero TLS reads: Direct global variable access
- Maximum performance: Compiler can constant-fold if known at link time
- Simple implementation: No lazy init, no version tracking
Cons:
- No runtime ENV changes: ENV toggles require process restart
- Breaks bench_profile:
putenv()in benchmarks will not work - No A/B testing: Cannot toggle ENV for same-binary comparison
- Box Theory violation: No rollback/toggle capability
Risk: HIGH (breaks existing workflow, violates Box Theory)
Recommended: Option A (Entry-Point Snapshot Pass-Down)
Reasoning:
- Preserves Box Theory:
env==NULL→ fallback to old path - Clear separation: ENV decisions at entry, pure logic below
- Benchmark compatible: Works with
bench_profileputenv +hakmem_env_snapshot_refresh_from_env()(snapshot updates in-place) - Performance: Removes repeated
hakmem_env_snapshot_enabled()checks inside deep helpers
Trade-off acceptance:
- Accept API changes (mechanical, low risk)
- Accept extra parameter (register pressure acceptable for hot path)
- Reject Option B's version management complexity
- Reject Option C's inflexibility
3. Implementation Plan (Option A)
3.1 Box Design
Box Name: EnvSnapshotConsolidationBox (Phase 19-3)
Files:
- Modified:
core/front/malloc_tiny_fast.h- Phase 19-3a: remove backwards
__builtin_expect(..., 0)hints (DONE, +4.42% GO). - Phase 19-3b: thread
const HakmemEnvSnapshot* envdown to eliminate repeatedhakmem_env_snapshot_enabled()checks (DONE, +2.76% GO).
- Phase 19-3a: remove backwards
- Modified:
core/box/tiny_legacy_fallback_box.h- Add
_with_envhelper (Phase 19-3b).
- Add
- Modified:
core/box/tiny_metadata_cache_hot_box.h- Add
_with_envhelper (Phase 19-3b).
- Add
- Optional:
core/box/hak_wrappers.inc.h- If needed, compute
envonce per wrapper entry and pass it down (removes the remaining alloc-side gate inmalloc_tiny_fast_for_class()).
- If needed, compute
ENV Gate:
- Base:
HAKMEM_ENV_SNAPSHOT=0/1(Phase 4 E1 gate; promoted ON in presets)
Rollback:
- Snapshot behavior: set
HAKMEM_ENV_SNAPSHOT=0to fall back to per-feature env gates. - Pass-down refactor: revert the Phase 19-3b commit (or add a dedicated pass-down gate if future A/B is needed).
3.2 API Design
Pass-down API (recommended):
// Wrapper entry (malloc/free): read snapshot ONCE, pass down.
const HakmemEnvSnapshot* env =
hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
// Hot helpers accept optional env pointer (NULL-safe).
void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env);
int free_tiny_fast_with_env(void* ptr, const HakmemEnvSnapshot* env);
3.3 Migration Plan (Incremental)
Phase 19-3a (DONE): remove backwards UNLIKELY hints at the 5 hottest call sites in core/front/malloc_tiny_fast.h.
__builtin_expect(hakmem_env_snapshot_enabled(), 0)→hakmem_env_snapshot_enabled()- Measured: GO (+4.42%)
Phase 19-3b (DONE): capture env once per hot call and pass it down into nested helpers.
- In
core/front/malloc_tiny_fast.h:free_tiny_fast()/free_tiny_fast_hot()captureenvonce and pass it to cold + legacy helpers.malloc_tiny_fast_for_class()reuses the same snapshot fortiny_policy_hot_get_route_with_env(...).
- In
core/box/tiny_legacy_fallback_box.handcore/box/tiny_metadata_cache_hot_box.h:- add
_with_envhelpers to consume the pass-down pointer.
- add
- Measured: GO (+2.76% mean / +2.57% median) on Mixed 10-run.
Phase 19-3c (OPTIONAL): propagate env into legacy fallback + metadata cache helpers to eliminate the remaining call sites:
- (Already done in Phase 19-3b.) Optional next: pass
envdown from wrapper entry to remove the remaining alloc-side gate.
3.4 Files to Modify
core/front/malloc_tiny_fast.h- Phase 19-3a: UNLIKELY hint removal.
- Phase 19-3b: pass-down
envto cold + legacy helpers.
core/box/tiny_legacy_fallback_box.h- Phase 19-3b: add
_with_envhelper + keep wrapper.
- Phase 19-3b: add
core/box/tiny_metadata_cache_hot_box.h- Phase 19-3b: add
_with_envhelper + keep wrapper.
- Phase 19-3b: add
- (Optional)
core/box/hak_wrappers.inc.h- Pass
envdown from wrapper entry (alloc-side; removes one remaining gate).
- Pass
4. Safety / Box Theory
4.1 Boundary Preservation
L0 (ENV gate):
HAKMEM_ENV_SNAPSHOT=0→env==NULL→ fallback to per-feature env gatesHAKMEM_ENV_SNAPSHOT=1→env!=NULL→ snapshot-based checks- (Optional) A dedicated “pass-down gate” can be introduced for A/B safety, but avoid adding a new hot-branch unless needed.
L1 (Hot inline):
- No algorithmic changes, only ENV check consolidation
- Existing
malloc_tiny_fast()/free_tiny_fast()logic unchanged envis read-only (const pointer)
L2 (Cold fallback):
- Cold paths unchanged (no context propagation needed)
- Legacy fallback accepts optional
env
L3 (Stats/Observability):
- Add counter:
ENV_CONSOLIDATION_STAT_INC(enabled_calls) - Track: pass-down hits, fallback path usage
- Perf verification: reduced
hakmem_env_snapshot_enabled()hot samples
4.2 Fail-Fast
NULL env handling:
- All functions accept
env==NULL→ fallback to existing path - No crashes, no undefined behavior
- Debug builds: assert(
env!=NULL) only if the pass-down gate is enabled (optional)
ENV invalidation:
- Snapshot refresh is handled by the existing Phase 4 E1 mechanism:
bench_profileuseshakmem_env_snapshot_refresh_from_env()afterputenv()- Snapshot updates in-place, so the
envpointer remains valid
4.3 Rollback
Runtime rollback:
HAKMEM_ENV_SNAPSHOT=0 # Disable snapshot path (falls back to per-feature env gates)
Gradual rollout:
- Phase 19-3a: UNLIKELY hint removal (DONE, GO)
- Phase 19-3b: hot helper pass-down (DONE, GO)
- Phase 19-3c: optional wrapper-entry pass-down (alloc-side; measure)
4.4 Observability
Stats counters (debug builds):
typedef struct {
uint64_t env_passdown_hits; // wrapper passed non-NULL env
uint64_t env_null_fallback; // env==NULL, used old path
uint64_t malloc_env_path; // malloc used env pass-down
uint64_t free_env_path; // free used env pass-down
} EnvConsolidationStats;
Perf validation:
- Before:
perf recordshowshakmem_env_snapshot_enabledat ~7% - After:
hakmem_env_snapshot_enabledshould drop to <1% - Expected: deep helpers stop calling
hakmem_env_snapshot_enabled()repeatedly (single capture per hot call)
A/B testing:
# Recommended: compare baseline vs optimized commits with the same bench script
scripts/run_mixed_10_cleanenv.sh
5. Expected Performance
5.1 Instruction Reduction Estimate
Current overhead (per malloc+free operation):
- 5 calls to
hakmem_env_snapshot_enabled():- Each: gate loads + branches (and legacy lazy-init path on first call)
- Total: ~5 gate checks per operation across hot helpers
After Phase 19-3:
- 1 call at wrapper entry:
hakmem_env_snapshot_enabled()oncehakmem_env_snapshot()once (when enabled)- Deep helpers use
if (env)+ direct field reads (no further gate checks)
Reduction:
- Gate checks: ~5 → 1 (wrapper entry only)
- Branches: reduce repeated gate branches inside hot helpers
- Instructions: target ~-10 instructions/op (order-of-magnitude)
5.2 Branch Reduction Estimate
Current branching:
hakmem_env_snapshot_enabled(): 2 branches (ctor_mode check + gate check)- Called 5 times = 10 branches/op
After Phase 19-3:
- Gate check is done once at wrapper entry; deep helpers reuse
envpointer.
Reduction: 10 → 4 = -6 branches/op (conservative estimate: -4 branches/op accounting for overlap)
5.3 Throughput Estimate
Phase 19-1 Design Doc (Candidate B) estimates:
- Instructions: -10.0/op
- Branches: -4.0/op
- Throughput: +5-8%
Phase 19-3 targets (aligned with Candidate B):
- Instructions: -10.0/op ✓
- Branches: -4.0/op ✓
- Throughput: +5-8% (expected on top of Phase 19-2 baseline)
Validation criteria:
- Perf stat shows instruction count reduction: ≥8.0/op (80% of estimate)
- Perf stat shows branch count reduction: ≥3.0/op (75% of estimate)
- Throughput improvement: ≥4.0% (50% of lower bound estimate)
6. Risk Assessment
6.1 Technical Risks
MEDIUM: API Signature Changes
- Risk: Adding context parameter changes function signatures
- Mitigation: Keep old signatures, add
_ctxvariants - Rollback: NULL context → fallback to old implementation
- Timeline: 1 phase at a time (19-3a → 19-3b → 19-3c)
MEDIUM: ENV Invalidation
- Risk: Runtime ENV changes (bench_profile putenv) may not refresh context
- Mitigation: Phase 19-3 inherits Phase 4 E1 refresh mechanism
- Limitation: Same as current ENV snapshot (requires explicit refresh)
- Future: Add version tracking (Option B) if runtime toggle needed
LOW: Register Pressure
- Risk: Extra context parameter may increase register spills
- Mitigation: Context is const pointer (register-friendly)
- Validation: Check perf stat for stall increases
- Rollback: Disable via ENV if regression detected
LOW: Lazy Init Overhead
- Risk: First call to
fastlane_env_ctx()adds init cost - Mitigation: One-time per thread (amortized over millions of ops)
- Measurement: Should be <0.1% overhead (verified via perf)
6.2 Performance Risks
Risk: Overhead greater than savings
- Scenario: Context struct access slower than optimized TLS reads
- Likelihood: LOW (struct access is 1-2 instructions, TLS read is 5-10)
- Detection: Perf stat will show instruction count increase
- Rollback: ENV=0 immediately reverts
Risk: Branch predictor thrashing
- Scenario: New branch patterns confuse CPU predictor
- Likelihood: LOW (reducing branches helps predictor)
- Detection: Branch miss rate increases in perf stat
- Rollback: ENV=0 immediately reverts
6.3 Integration Risks
Risk: Breaks bench_profile ENV refresh
- Scenario: Context cached before putenv(), stale values used
- Likelihood: MEDIUM (same issue as Phase 4 E1)
- Mitigation: Follow Phase 4 E1 pattern (explicit refresh hook)
- Validation: Run bench suite with ENV toggles
Risk: Conflicts with FastLane Direct (Phase 19-2)
- Scenario: Phase 19-2 removed wrapper, context injection point unclear
- Likelihood: LOW (context added at new entry point)
- Mitigation: Phase 19-3 builds on Phase 19-2 baseline
- Validation: A/B test with FASTLANE_DIRECT=1 + ENV_CONSOLIDATION=1
7. Validation Checklist
7.1 Pre-Implementation
- Verify Phase 4 E1 (ENV snapshot) is stable and working
- Verify Phase 19-2 (FASTLANE_DIRECT) is stable baseline
- Document current
hakmem_env_snapshot_enabled()call sites (5 locations) - Create test plan for ENV refresh (bench_profile compatibility)
7.2 Implementation
- Implement
fastlane_env_ctx_box.h(context struct + getter) - Add
malloc_tiny_fast_ctx()variant (Phase 19-3a) - Add
free_tiny_fast_ctx()variant (Phase 19-3b) - Propagate context to
tiny_legacy_fallback_box.h(Phase 19-3c) - (Optional) Add a dedicated pass-down gate if A/B within a single binary is needed
- Add stats counters (debug builds)
7.3 Testing (Per Phase)
Phase 19-3a (malloc path):
- Correctness: Run
make testsuite (all tests pass) - Perf stat: Measure instruction/branch reduction (ENV=0 vs ENV=1)
- Perf record: Verify
hakmem_env_snapshot_enabledsamples drop - Benchmark: Mixed 10-run (expect +2-3% from malloc path alone)
Phase 19-3b (free path):
- Correctness: Run
make test+ Larson (all tests pass) - Perf stat: Measure cumulative reduction (vs baseline)
- Perf record: Verify further reduction in ENV check samples
- Benchmark: Mixed 10-run (expect +3-5% cumulative)
Phase 19-3c (legacy + metadata):
- Correctness: Full test suite including multithreaded
- Perf stat: Verify -10.0 instr/op, -4.0 branches/op (goal)
- Perf record:
hakmem_env_snapshot_enabled<1% samples - Benchmark: Mixed 10-run (expect +5-8% cumulative)
7.4 A/B Test (Final Validation)
Benchmark suite:
# Run the same cleanenv script on baseline vs optimized commits
scripts/run_mixed_10_cleanenv.sh
GO/NO-GO criteria:
- GO: Mean throughput +5.0% or higher (within ±20% of +5-8% estimate)
- NEUTRAL: +2.0% to +5.0% → keep as research box, preset-only promotion
- NO-GO: <+2.0% or regression → revert, analyze perf data
Perf stat validation:
perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses \
-- ./bench_random_mixed_hakmem 200000000 400 1
Expected deltas:
- Instructions/op: -8.0 to -12.0 (target: -10.0)
- Branches/op: -3.0 to -5.0 (target: -4.0)
- Branch-miss%: unchanged or slightly better (fewer branches)
- Throughput: +4.0% to +10.0% (target: +5-8%)
8. Rollout Plan
8.1 Phase 19-3a: malloc Path (Week 1)
Scope: Add context to malloc hot path
- Modify
malloc_tiny_fast()to accept context - Update C7 ULTRA check (line 236)
- Add
fastlane_env_ctx_box.h - Update wrapper.c
malloc()
Timeline: 4-6 hours implementation + 2 hours testing
Risk: LOW (isolated to alloc path)
Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)
8.2 Phase 19-3b: free Path (Week 1)
Scope: Add context to free hot path
- Modify
free_tiny_fast()to accept context - Update C7 ULTRA checks (lines 624, 830)
- Update front V3 checks (lines 403, 910)
- Update wrapper.c
free()
Timeline: 4-6 hours implementation + 2 hours testing
Risk: LOW-MEDIUM (more call sites than malloc)
Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)
8.3 Phase 19-3c: Legacy + Metadata (Week 2)
Scope: Propagate context to helper boxes
- Update
tiny_legacy_fallback_box.h(line 28) - Update
tiny_metadata_cache_hot_box.h(line 64) - Add context parameter to helper functions
Timeline: 3-4 hours implementation + 2 hours testing
Risk: MEDIUM (touches multiple boxes)
Rollback: revert Phase 19-3b commit (or set HAKMEM_ENV_SNAPSHOT=0 to disable snapshot path)
8.4 Graduate (Week 2-3)
Promotion criteria:
- All phases pass A/B testing (GO verdict)
- Cumulative throughput gain ≥+5.0%
- No correctness regressions (all tests pass)
- Perf validation confirms instruction reduction
Promotion actions:
- Ensure
MIXED_TINYV3_C7_SAFEpreset keepsHAKMEM_ENV_SNAPSHOT=1(already) - Document in optimization roadmap
- Update Box Theory index
- Keep ENV default=0 (opt-in) until production validation
Rollback strategy:
- Preset level: Remove from preset, keep code
- Code level: revert the Phase 19-3b commit
- Emergency: set
HAKMEM_ENV_SNAPSHOT=0(falls back to per-feature env gates)
9. Future Optimization Opportunities
9.1 Version-Based Invalidation (Option B)
If runtime ENV changes become important:
- Add global
g_env_snapshot_versioncounter - Increment on ENV change (bench_profile, runtime toggle)
- Each thread checks version, refreshes context if stale
- Overhead: +1 global read per operation (still net win vs 10 TLS reads)
9.2 Route Table Consolidation
Extend context to include pre-computed routes:
typedef struct {
bool c7_ultra_enabled;
bool front_v3_enabled;
bool metadata_cache_eff;
SmallRouteKind route_kind[8]; // Pre-computed per class
} FastLaneEnvCtx;
Benefit: Eliminate tiny_static_route_get_kind_fast() calls
Impact: Additional -3-4 instructions/op, -1-2 branches/op
9.3 Constructor Init (Option C Hybrid)
For production builds (no bench_profile):
- Use
__attribute__((constructor))to init context at startup - Eliminate lazy init check (g_init always 1)
- Benefit: -1 branch per operation (init check)
- Limitation: No runtime ENV changes (production-only optimization)
10. Comparison to Phase 4 E1
Phase 4 E1 (ENV Snapshot)
What it did:
- Consolidated 3 ENV reads (
tiny_c7_ultra_enabled_env,tiny_front_v3_enabled,tiny_metadata_cache_enabled) into 1 snapshot struct - Result: +3.92% throughput (Mixed)
- Status: Promoted in presets (global default still OFF)
Limitation:
- Still calls
hakmem_env_snapshot_enabled()5 times per operation - Each call: gate loads + branches
- ENV check overhead remains: ~7% perf samples
Phase 19-3 (ENV Snapshot Consolidation)
What it does:
- Eliminates repeated
hakmem_env_snapshot_enabled()calls inside deep helpers:- wrapper entry does the gate check once and passes
const HakmemEnvSnapshot* envdown - deep helpers use
if (env)+ direct field reads
- wrapper entry does the gate check once and passes
Benefit over Phase 4 E1:
- Phase 4 E1: Consolidated ENV values (3 gates → 1 snapshot)
- Phase 19-3: Consolidates ENV checks (5 snapshot calls → 1 context call)
- Complementary: Phase 19-3 builds on Phase 4 E1 infrastructure
Combined impact:
- Phase 4 E1: +3.92% (ENV value consolidation)
- Phase 19-3: +5-8% (ENV check consolidation)
- Not additive (overlap), but Phase 19-3 should subsume Phase 4 E1 gains
11. Conclusion
Phase 19-3 (ENV Snapshot Consolidation) targets a clear, measurable overhead:
- Current: repeated
hakmem_env_snapshot_enabled()gate checks scattered across hot helpers - After: wrapper entry gate check once +
envpass-down - Reduction: fewer gate branches + fewer loads + less code/layout churn
Expected outcome: +5-8% throughput (aligned with Phase 19-1 Design Candidate B estimate)
Recommended approach: Option A (Entry-Point Snapshot)
- Clear API, type-safe context passing
- Preserves Box Theory (NULL context → fallback)
- Gradual migration (3 sub-phases)
- Benchmark-compatible (bench_profile refresh works)
Risk: MEDIUM (API changes, ENV invalidation handling) Effort: 8-12 hours (implementation) + 6-8 hours (testing) Timeline: 2 weeks (3 sub-phases + A/B validation)
Next steps:
- Phase 19-3a done (UNLIKELY hint removal, GO)
- Implement Phase 19-3b (wrapper env pass-down to hot helpers)
- A/B test (expect +1-3% incremental on top of 19-3a)
- Implement Phase 19-3c (legacy + metadata pass-down)
- Final A/B test
- Graduate if GO (add to MIXED_TINYV3_C7_SAFE preset)
This positions Phase 19-3 as a high-ROI, medium-risk optimization with clear measurement criteria and rollback strategy.