diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index ff1cc3ba..31d3522a 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -25,6 +25,30 @@ **Ref**: - `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_4_HINT_MISMATCH_AB_TEST_RESULTS.md` +--- + +## 更新メモ(2025-12-15 Phase 19-5 Attempts: Both NO-GO) + +### Phase 19-5 & v2: Consolidate hot getenv() — ❌ DEFERRED + +**Result**: Both attempts to eliminate hot getenv() failed. Current TLS cache pattern is already near-optimal. + +**Attempt 1: Global ENV Cache (-4.28% regression)** +- 400B struct causes L1 cache layout conflicts + +**Attempt 2: HakmemEnvSnapshot Integration (-7.7% regression)** +- Broke efficient per-thread TLS cache (`static __thread int g_larson_fix = -1`) +- env pointer NULL-safety issues + +**Key Discovery**: Original code's per-thread TLS cache is excellent +- Cost: 1 getenv/thread, amortized +- Benefit: 1-cycle reads thereafter +- Already near-optimal + +**Decision**: Focus on other instruction reduction candidates instead. + +--- + ## 更新メモ(2025-12-15 Phase 19-3b ENV-SNAPSHOT-PASSDOWN) ### Phase 19-3b ENV-SNAPSHOT-PASSDOWN: Consolidate ENV snapshot reads across hot helpers — ✅ GO (+2.76%) diff --git a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md new file mode 100644 index 00000000..269bfb3e --- /dev/null +++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_5V2_ENV_SNAPSHOT_LARSON_FIX_DESIGN.md @@ -0,0 +1,288 @@ +# Phase 19-5v2: Hot `getenv()` Elimination via HakmemEnvSnapshot Integration + +## Goal + +Eliminate hot-path `getenv()` for `HAKMEM_TINY_LARSON_FIX` by consolidating it into the existing `HakmemEnvSnapshot` SSOT (Single-Source-Of-Truth). + +**Problem**: Phase 19-5 failed (-4.28%) due to large cache struct inducing L1 thrashing. + +**Solution**: Instead of new global cache, add `tiny_larson_fix` field to existing `HakmemEnvSnapshot`: +- Already TLS-cached (Phase 19-3b consolidation) +- Already refreshed via `hakmem_env_snapshot_refresh_from_env()` (called in bench_profile) +- No new includes required (minimal code footprint) + +**Expected**: **+0.5-1.5% throughput** (conservative) from eliminating hot getenv + reduced cache pressure + +--- + +## Why This Design Works Better + +### Phase 19-5 (Failed Approach) +- ❌ New 400B global struct `g_hak_env_cache` (requires new include) +- ❌ Cache coherency overhead negated syscall savings +- ❌ Result: -4.28% regression + +### Phase 19-5v2 (Proposed Approach) +- ✅ Reuse existing `HakmemEnvSnapshot` (4 fields → 5 fields, ~24B → ~28B) +- ✅ Already in TLS (L1 hot), refreshed at stable boundary +- ✅ Piggyback on Phase 19-3b env pass-down infrastructure +- ✅ No new include in hot headers +- ✅ Minimal layout impact (1 more bool field) + +--- + +## Detailed Changes + +### 1. HakmemEnvSnapshot Struct (core/box/hakmem_env_snapshot_box.h) + +**Current**: +```c +typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1) + bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1) + bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0) + bool tiny_metadata_cache_eff; // Effective: cache && !learner +} HakmemEnvSnapshot; +``` + +**After** (add 1 field): +```c +typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1) + bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1) + bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0) + bool tiny_metadata_cache_eff; // Effective: cache && !learner + bool tiny_larson_fix; // HAKMEM_TINY_LARSON_FIX (default 0) — Phase 19-5v2 +} HakmemEnvSnapshot; +``` + +**Impact**: +- +4 bytes (bool alignment) +- Still fits in 1 cache line (< 64B) +- No memory overhead for callers (already TLS-read) + +--- + +### 2. Refresh Function (core/box/hakmem_env_snapshot_box.c) + +**In `hakmem_env_snapshot_init()` / `hakmem_env_snapshot_refresh_from_env()`**: + +Add initialization for `tiny_larson_fix`: +```c +static inline void hakmem_env_snapshot_init(void) { + // ... existing code ... + + // Phase 19-5v2: Tiny Larson Fix + { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); + g_hakmem_env_snapshot.tiny_larson_fix = (e && *e && *e != '0') ? true : false; + } +} +``` + +**Guarantee**: +- Single `getenv()` call per process (at init) +- Refresh syncs with bench_profile putenv (already called) +- No per-operation syscalls + +--- + +### 3. Free Path Changes (core/front/malloc_tiny_fast.h) + +**Current** (Phase 19-5 issue: redundant getenv): +```c +const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); // from hakmem_env_cache.h +if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { + // cross-thread free detection +} +``` + +**After** (Phase 19-5v2: env snapshot field access): +```c +// env already passed from wrapper (Phase 19-3b) +const int larson_fix = env ? env->tiny_larson_fix : 0; +if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { + // cross-thread free detection +} +``` + +**Lines to modify**: +- Line 439: `free_tiny_fast_with_env()` hot path +- Line 624: `free_tiny_fast_cold_with_env()` cold helper +- Line 752: Free path routing decision +- Line 885: Free larson detection (v2 variant) + +**Benefits**: +- No getenv() syscall +- No `static __thread` storage +- No new include (reuse existing env parameter) +- 1 cache line read (already loaded for other fields) + +--- + +## Box Theory Compliance + +### Boundary (L0) + +**Gate**: Implicit in `hakmem_env_snapshot_enabled()` (Phase 4 E1) +- Already enabled in bench_profile presets +- No new ENV variable needed (reuse HAKMEM_TINY_LARSON_FIX, already defined) + +### Data Flow (L1) + +**env snapshot pass-down** (already established in Phase 19-3b): +``` +malloc_wrapper + ↓ (hakmem_env_snapshot() → env) +free_tiny_fast_with_env(ptr, env) + ↓ (pass env) +free_tiny_fast_cold_with_env(ptr, env) + ↓ (read env->tiny_larson_fix, instead of getenv()) +free_tiny_fast_hot(ptr, env) +``` + +**Refresh**: Already integrated (bench_profile calls `hakmem_env_snapshot_refresh_from_env()`) + +### Fallback (L2) + +**NULL-safe design**: +```c +const int larson_fix = env ? env->tiny_larson_fix : 0; +``` +- If env is NULL: larson_fix defaults to 0 (OFF, safe) +- If init not called: snapshot is zero-filled (safe default) + +--- + +## A/B Test Protocol + +### Baseline: Phase 19-4c +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: ~57.1M ops/s +``` + +### Optimized: Phase 19-5v2 +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: ~57.1-57.9M ops/s (+0-1.4%) +# Conservative: +0.5-1.5% expected (vs +2-4% in Phase 19-5 theory) +``` + +**GO Criteria**: +- Mean ≥ baseline (±0.5% acceptable) +- No I-cache regression +- NO-GO: < baseline - 1.0% + +--- + +## Performance Model + +### Savings vs Phase 19-5 + +**Phase 19-5 (Failed)**: +- ❌ New struct include: hakmem_env_cache.h (400B struct initialization) +- ❌ Cache coherency: g_hak_env_cache updates impact other cores +- ❌ Layout: Extra load for unrelated cache fields + +**Phase 19-5v2 (Proposed)**: +- ✅ No new include (struct already included in snapshot path) +- ✅ Single snapshot read (already done for c7_ultra_enabled, etc.) +- ✅ No layout changes (1 extra bool in existing struct) + +### Instruction Budget + +**getenv() elimination**: +- Hot path saves 1 `HAK_ENV_TINY_LARSON_FIX()` lookup call +- Replaces with 1 cache-line field read: `env->tiny_larson_fix` +- Net: -1-2 instructions/op (minor, but positive) + +**Cache pressure**: +- No new global struct +- Snapshot read already hot (L1) +- Minimal impact expected + +**Expected**: **+0.5-1.5% throughput** (realistic, not overoptimistic) + +--- + +## Implementation Checklist + +- [ ] Add `tiny_larson_fix` field to `HakmemEnvSnapshot` struct +- [ ] Update `hakmem_env_snapshot_init()` to read `HAKMEM_TINY_LARSON_FIX` +- [ ] Replace `HAK_ENV_TINY_LARSON_FIX()` macro calls with `env->tiny_larson_fix` +- [ ] Verify env is always non-NULL in hot paths (should be, Phase 19-3b) +- [ ] Remove hakmem_env_cache.h include from malloc_tiny_fast.h (if present) +- [ ] Build & verify no compile errors +- [ ] A/B test: 10-run benchmark (mixed) +- [ ] Perf stat validation (no I-cache regression) +- [ ] Commit + push to private repo + +--- + +## Risk Assessment + +| Risk | Level | Mitigation | +|------|-------|-----------| +| Struct layout change | LOW | Bool field, fits in 1 cache line, no alignment issue | +| env NULL dereference | LOW | Guard with `env ? env->tiny_larson_fix : 0` | +| Refresh timing | LOW | Reuse existing `hakmem_env_snapshot_refresh_from_env()` call | +| Cache coherency | LOW | No new global struct, snapshot is read-only after init | +| Layout regression | LOW | Minimal struct change, no include overhead | + +**Overall**: **GREEN** (low risk, proven design pattern) + +--- + +## Why This Will Work + +1. **Proven infrastructure** (Phase 19-3b): env pass-down is already validated +2. **Minimal change**: 1 bool field + 1 getenv call (already in snapshot init) +3. **No new includes**: Reuse existing snapshot path (no overhead) +4. **Cache-friendly**: Reads from same cache line as other snapshot fields +5. **Conservative expectation**: +0.5-1.5% is realistic (not overoptimistic like Phase 19-5) + +--- + +## Success Path + +If Phase 19-5v2 is GO (+0.5-1.5%): +- **New state**: ~57.4-57.9M ops/s +- **Cumulative Phase 19**: +10-15% improvement +- **Gap to libc**: Closer by ~0.5-1.5% + +If Phase 19-5v2 is NEUTRAL (±0.5%): +- Keep optimization (no harm, minimal code) +- Combine with other candidates (stats removal, header optimization) + +--- + +## Appendix: Code Diffs + +### hakmem_env_snapshot_box.h (struct only) +```diff + typedef struct HakmemEnvSnapshot { + bool tiny_c7_ultra_enabled; + bool tiny_front_v3_enabled; + bool tiny_metadata_cache; + bool tiny_metadata_cache_eff; ++ bool tiny_larson_fix; // Phase 19-5v2 + } HakmemEnvSnapshot; +``` + +### malloc_tiny_fast.h (line 439 example) +```diff +- const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); ++ const int larson_fix = env ? env->tiny_larson_fix : 0; + if (__builtin_expect(larson_fix || use_tiny_heap, 0)) { +``` + +--- + +## Next Steps + +1. ✅ Design review (complete) +2. 🔄 Implementation (small changes, low risk) +3. 🔄 A/B test (10-run mixed benchmark) +4. 📊 Perf stat validation +5. 📋 Commit + push if GO