From 9ffef0ac9aef9172b47075617f70e300445e5e2d Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Mon, 15 Dec 2025 19:32:24 +0900 Subject: [PATCH] Phase 19-5 Investigation: Both getenv() consolidation attempts NO-GO MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Testing Results: - Phase 19-5 (Global ENV Cache): -4.28% regression (57.1M โ†’ 54.66M ops/s) - Phase 19-5v2 (HakmemEnvSnapshot): -7.7% regression (57.1M โ†’ 52.71M ops/s) Root Cause Analysis: Phase 19-5 Failed: 400B global struct causes L1 cache layout conflicts - Cache coherency overhead > syscall savings - False sharing on g_hak_env_cache struct Phase 19-5v2 Failed (WORSE): Broke existing ultra-efficient per-thread TLS cache - Original pattern: static __thread int g_larson_fix = -1 - Cost: 1 getenv per thread (lazy init at first check) - Benefit: 1-cycle memory reads for all subsequent checks - Already near-optimal for runtime-configurable gates - My change: Replaced with env->tiny_larson_fix access - Issue: env pointer NULL-safety, lost efficient TLS cache - Result: Worse performance than both baseline and v1 Key Discovery: Original code's per-thread TLS cache pattern is already excellent. Attempts to consolidate into global or snapshot-based caches failed because they lose the amortization benefit and introduce layout conflicts. Decision: DEFER Phase 19-5 series - Current TLS pattern is near-optimal for runtime-configurable gates - Focus remaining effort on other instruction reduction candidates: - Stats removal (+3-5%) - Header optimization (+2-3%) - Route fast path (+2-3%) Updated: CURRENT_TASK.md with findings Reverted: All Phase 19-5v2 code changes (git reset --hard HEAD~1) Phase 19 Final Status (19-1b through 19-4c): - Cumulative improvement: +9.65% (52.06M โ†’ 57.1M ops/s) - GO phases: 19-1b (+5.88%), 19-3a (+4.42%), 19-3b (+2.76%), 19-4a (+0.16%), 19-4c (+0.88%) - Stable state: Phase 19-4c ๐Ÿค– Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 --- CURRENT_TASK.md | 24 ++ ...TION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md | 288 ++++++++++++++++++ 2 files changed, 312 insertions(+) create mode 100644 docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index ff1cc3ba..31d3522a 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -25,6 +25,30 @@ **Ref**: - `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_4_HINT_MISMATCH_AB_TEST_RESULTS.md` +--- + +## ๆ›ดๆ–ฐใƒกใƒข๏ผˆ2025-12-15 Phase 19-5 Attempts: Both NO-GO๏ผ‰ + +### Phase 19-5 & v2: Consolidate hot getenv() โ€” โŒ DEFERRED + +**Result**: Both attempts to eliminate hot getenv() failed. Current TLS cache pattern is already near-optimal. + +**Attempt 1: Global ENV Cache (-4.28% regression)** +- 400B struct causes L1 cache layout conflicts + +**Attempt 2: HakmemEnvSnapshot Integration (-7.7% regression)** +- Broke efficient per-thread TLS cache (`static __thread int g_larson_fix = -1`) +- env pointer NULL-safety issues + +**Key Discovery**: Original code's per-thread TLS cache is excellent +- Cost: 1 getenv/thread, amortized +- Benefit: 1-cycle reads thereafter +- Already near-optimal + +**Decision**: Focus on other instruction reduction candidates instead. + +--- + ## ๆ›ดๆ–ฐใƒกใƒข๏ผˆ2025-12-15 Phase 19-3b ENV-SNAPSHOT-PASSDOWN๏ผ‰ ### Phase 19-3b ENV-SNAPSHOT-PASSDOWN: Consolidate ENV snapshot reads across hot helpers โ€” โœ… GO (+2.76%) diff --git a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md new file mode 100644 index 00000000..269bfb3e --- /dev/null +++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md @@ -0,0 +1,288 @@ +# Phase 19-5v2: Hot `getenv()` Elimination via HakmemEnvSnapshot Integration + +## Goal + +Eliminate hot-path `getenv()` for `HAKMEM_TINY_LARSON_FIX` by consolidating it into the existing `HakmemEnvSnapshot` SSOT (Single-Source-Of-Truth). + +**Problem**: Phase 19-5 failed (-4.28%) due to large cache struct inducing L1 thrashing. + +**Solution**: Instead of new global cache, add `tiny_larson_fix` field to existing `HakmemEnvSnapshot`: +- Already TLS-cached (Phase 19-3b consolidation) +- Already refreshed via `hakmem_env_snapshot_refresh_from_env()` (called in bench_profile) +- No new includes required (minimal code footprint) + +**Expected**: **+0.5-1.5% throughput** (conservative) from eliminating hot getenv + reduced cache pressure + +--- + +## Why This Design Works Better + +### Phase 19-5 (Failed Approach) +- โŒ New 400B global struct `g_hak_env_cache` (requires new include) +- โŒ Cache coherency overhead negated syscall savings +- โŒ Result: -4.28% regression + +### Phase 19-5v2 (Proposed Approach) +- โœ… Reuse existing `HakmemEnvSnapshot` (4 fields โ†’ 5 fields, ~24B โ†’ ~28B) +- โœ… Already in TLS (L1 hot), refreshed at stable boundary +- โœ… Piggyback on Phase 19-3b env pass-down infrastructure +- โœ… No new include in hot headers +- โœ… Minimal layout impact (1 more bool field) + +--- + +## Detailed Changes + +### 1. HakmemEnvSnapshot Struct (core/box/hakmem_env_snapshot_box.h) + +**Current**: +```c +typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1) + bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1) + bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0) + bool tiny_metadata_cache_eff; // Effective: cache && !learner +} HakmemEnvSnapshot; +``` + +**After** (add 1 field): +```c +typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1) + bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1) + bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0) + bool tiny_metadata_cache_eff; // Effective: cache && !learner + bool tiny_larson_fix; // HAKMEM_TINY_LARSON_FIX (default 0) โ€” Phase 19-5v2 +} HakmemEnvSnapshot; +``` + +**Impact**: +- +4 bytes (bool alignment) +- Still fits in 1 cache line (< 64B) +- No memory overhead for callers (already TLS-read) + +--- + +### 2. Refresh Function (core/box/hakmem_env_snapshot_box.c) + +**In `hakmem_env_snapshot_init()` / `hakmem_env_snapshot_refresh_from_env()`**: + +Add initialization for `tiny_larson_fix`: +```c +static inline void hakmem_env_snapshot_init(void) { + // ... existing code ... + + // Phase 19-5v2: Tiny Larson Fix + { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); + g_hakmem_env_snapshot.tiny_larson_fix = (e && *e && *e != '0') ? true : false; + } +} +``` + +**Guarantee**: +- Single `getenv()` call per process (at init) +- Refresh syncs with bench_profile putenv (already called) +- No per-operation syscalls + +--- + +### 3. Free Path Changes (core/front/malloc_tiny_fast.h) + +**Current** (Phase 19-5 issue: redundant getenv): +```c +const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); // from hakmem_env_cache.h +if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { + // cross-thread free detection +} +``` + +**After** (Phase 19-5v2: env snapshot field access): +```c +// env already passed from wrapper (Phase 19-3b) +const int larson_fix = env ? env->tiny_larson_fix : 0; +if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { + // cross-thread free detection +} +``` + +**Lines to modify**: +- Line 439: `free_tiny_fast_with_env()` hot path +- Line 624: `free_tiny_fast_cold_with_env()` cold helper +- Line 752: Free path routing decision +- Line 885: Free larson detection (v2 variant) + +**Benefits**: +- No getenv() syscall +- No `static __thread` storage +- No new include (reuse existing env parameter) +- 1 cache line read (already loaded for other fields) + +--- + +## Box Theory Compliance + +### Boundary (L0) + +**Gate**: Implicit in `hakmem_env_snapshot_enabled()` (Phase 4 E1) +- Already enabled in bench_profile presets +- No new ENV variable needed (reuse HAKMEM_TINY_LARSON_FIX, already defined) + +### Data Flow (L1) + +**env snapshot pass-down** (already established in Phase 19-3b): +``` +malloc_wrapper + โ†“ (hakmem_env_snapshot() โ†’ env) +free_tiny_fast_with_env(ptr, env) + โ†“ (pass env) +free_tiny_fast_cold_with_env(ptr, env) + โ†“ (read env->tiny_larson_fix, instead of getenv()) +free_tiny_fast_hot(ptr, env) +``` + +**Refresh**: Already integrated (bench_profile calls `hakmem_env_snapshot_refresh_from_env()`) + +### Fallback (L2) + +**NULL-safe design**: +```c +const int larson_fix = env ? env->tiny_larson_fix : 0; +``` +- If env is NULL: larson_fix defaults to 0 (OFF, safe) +- If init not called: snapshot is zero-filled (safe default) + +--- + +## A/B Test Protocol + +### Baseline: Phase 19-4c +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: ~57.1M ops/s +``` + +### Optimized: Phase 19-5v2 +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: ~57.1-57.9M ops/s (+0-1.4%) +# Conservative: +0.5-1.5% expected (vs +2-4% in Phase 19-5 theory) +``` + +**GO Criteria**: +- Mean โ‰ฅ baseline (ยฑ0.5% acceptable) +- No I-cache regression +- NO-GO: < baseline - 1.0% + +--- + +## Performance Model + +### Savings vs Phase 19-5 + +**Phase 19-5 (Failed)**: +- โŒ New struct include: hakmem_env_cache.h (400B struct initialization) +- โŒ Cache coherency: g_hak_env_cache updates impact other cores +- โŒ Layout: Extra load for unrelated cache fields + +**Phase 19-5v2 (Proposed)**: +- โœ… No new include (struct already included in snapshot path) +- โœ… Single snapshot read (already done for c7_ultra_enabled, etc.) +- โœ… No layout changes (1 extra bool in existing struct) + +### Instruction Budget + +**getenv() elimination**: +- Hot path saves 1 `HAK_ENV_TINY_LARSON_FIX()` lookup call +- Replaces with 1 cache-line field read: `env->tiny_larson_fix` +- Net: -1-2 instructions/op (minor, but positive) + +**Cache pressure**: +- No new global struct +- Snapshot read already hot (L1) +- Minimal impact expected + +**Expected**: **+0.5-1.5% throughput** (realistic, not overoptimistic) + +--- + +## Implementation Checklist + +- [ ] Add `tiny_larson_fix` field to `HakmemEnvSnapshot` struct +- [ ] Update `hakmem_env_snapshot_init()` to read `HAKMEM_TINY_LARSON_FIX` +- [ ] Replace `HAK_ENV_TINY_LARSON_FIX()` macro calls with `env->tiny_larson_fix` +- [ ] Verify env is always non-NULL in hot paths (should be, Phase 19-3b) +- [ ] Remove hakmem_env_cache.h include from malloc_tiny_fast.h (if present) +- [ ] Build & verify no compile errors +- [ ] A/B test: 10-run benchmark (mixed) +- [ ] Perf stat validation (no I-cache regression) +- [ ] Commit + push to private repo + +--- + +## Risk Assessment + +| Risk | Level | Mitigation | +|------|-------|-----------| +| Struct layout change | LOW | Bool field, fits in 1 cache line, no alignment issue | +| env NULL dereference | LOW | Guard with `env ? env->tiny_larson_fix : 0` | +| Refresh timing | LOW | Reuse existing `hakmem_env_snapshot_refresh_from_env()` call | +| Cache coherency | LOW | No new global struct, snapshot is read-only after init | +| Layout regression | LOW | Minimal struct change, no include overhead | + +**Overall**: **GREEN** (low risk, proven design pattern) + +--- + +## Why This Will Work + +1. **Proven infrastructure** (Phase 19-3b): env pass-down is already validated +2. **Minimal change**: 1 bool field + 1 getenv call (already in snapshot init) +3. **No new includes**: Reuse existing snapshot path (no overhead) +4. **Cache-friendly**: Reads from same cache line as other snapshot fields +5. **Conservative expectation**: +0.5-1.5% is realistic (not overoptimistic like Phase 19-5) + +--- + +## Success Path + +If Phase 19-5v2 is GO (+0.5-1.5%): +- **New state**: ~57.4-57.9M ops/s +- **Cumulative Phase 19**: +10-15% improvement +- **Gap to libc**: Closer by ~0.5-1.5% + +If Phase 19-5v2 is NEUTRAL (ยฑ0.5%): +- Keep optimization (no harm, minimal code) +- Combine with other candidates (stats removal, header optimization) + +--- + +## Appendix: Code Diffs + +### hakmem_env_snapshot_box.h (struct only) +```diff + typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; + bool tiny_front_v3_enabled; + bool tiny_metadata_cache; + bool tiny_metadata_cache_eff; ++ bool tiny_larson_fix; // Phase 19-5v2 + } HakmemEnvSnapshot; +``` + +### malloc_tiny_fast.h (line 439 example) +```diff +- const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); ++ const int larson_fix = env ? env->tiny_larson_fix : 0; + if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { +``` + +--- + +## Next Steps + +1. โœ… Design review (complete) +2. ๐Ÿ”„ Implementation (small changes, low risk) +3. ๐Ÿ”„ A/B test (10-run mixed benchmark) +4. ๐Ÿ“Š Perf stat validation +5. ๐Ÿ“‹ Commit + push if GO