Phase 19-5 Investigation: Both getenv() consolidation attempts NO-GO
Testing Results: - Phase 19-5 (Global ENV Cache): -4.28% regression (57.1M → 54.66M ops/s) - Phase 19-5v2 (HakmemEnvSnapshot): -7.7% regression (57.1M → 52.71M ops/s) Root Cause Analysis: Phase 19-5 Failed: 400B global struct causes L1 cache layout conflicts - Cache coherency overhead > syscall savings - False sharing on g_hak_env_cache struct Phase 19-5v2 Failed (WORSE): Broke existing ultra-efficient per-thread TLS cache - Original pattern: static __thread int g_larson_fix = -1 - Cost: 1 getenv per thread (lazy init at first check) - Benefit: 1-cycle memory reads for all subsequent checks - Already near-optimal for runtime-configurable gates - My change: Replaced with env->tiny_larson_fix access - Issue: env pointer NULL-safety, lost efficient TLS cache - Result: Worse performance than both baseline and v1 Key Discovery: Original code's per-thread TLS cache pattern is already excellent. Attempts to consolidate into global or snapshot-based caches failed because they lose the amortization benefit and introduce layout conflicts. Decision: DEFER Phase 19-5 series - Current TLS pattern is near-optimal for runtime-configurable gates - Focus remaining effort on other instruction reduction candidates: - Stats removal (+3-5%) - Header optimization (+2-3%) - Route fast path (+2-3%) Updated: CURRENT_TASK.md with findings Reverted: All Phase 19-5v2 code changes (git reset --hard HEAD~1) Phase 19 Final Status (19-1b through 19-4c): - Cumulative improvement: +9.65% (52.06M → 57.1M ops/s) - GO phases: 19-1b (+5.88%), 19-3a (+4.42%), 19-3b (+2.76%), 19-4a (+0.16%), 19-4c (+0.88%) - Stable state: Phase 19-4c 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@ -25,6 +25,30 @@
|
||||
**Ref**:
|
||||
- `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_4_HINT_MISMATCH_AB_TEST_RESULTS.md`
|
||||
|
||||
---
|
||||
|
||||
## 更新メモ(2025-12-15 Phase 19-5 Attempts: Both NO-GO)
|
||||
|
||||
### Phase 19-5 & v2: Consolidate hot getenv() — ❌ DEFERRED
|
||||
|
||||
**Result**: Both attempts to eliminate hot getenv() failed. Current TLS cache pattern is already near-optimal.
|
||||
|
||||
**Attempt 1: Global ENV Cache (-4.28% regression)**
|
||||
- 400B struct causes L1 cache layout conflicts
|
||||
|
||||
**Attempt 2: HakmemEnvSnapshot Integration (-7.7% regression)**
|
||||
- Broke efficient per-thread TLS cache (`static __thread int g_larson_fix = -1`)
|
||||
- env pointer NULL-safety issues
|
||||
|
||||
**Key Discovery**: Original code's per-thread TLS cache is excellent
|
||||
- Cost: 1 getenv/thread, amortized
|
||||
- Benefit: 1-cycle reads thereafter
|
||||
- Already near-optimal
|
||||
|
||||
**Decision**: Focus on other instruction reduction candidates instead.
|
||||
|
||||
---
|
||||
|
||||
## 更新メモ(2025-12-15 Phase 19-3b ENV-SNAPSHOT-PASSDOWN)
|
||||
|
||||
### Phase 19-3b ENV-SNAPSHOT-PASSDOWN: Consolidate ENV snapshot reads across hot helpers — ✅ GO (+2.76%)
|
||||
|
||||
@ -0,0 +1,288 @@
|
||||
# Phase 19-5v2: Hot `getenv()` Elimination via HakmemEnvSnapshot Integration
|
||||
|
||||
## Goal
|
||||
|
||||
Eliminate hot-path `getenv()` for `HAKMEM_TINY_LARSON_FIX` by consolidating it into the existing `HakmemEnvSnapshot` SSOT (Single-Source-Of-Truth).
|
||||
|
||||
**Problem**: Phase 19-5 failed (-4.28%) due to large cache struct inducing L1 thrashing.
|
||||
|
||||
**Solution**: Instead of new global cache, add `tiny_larson_fix` field to existing `HakmemEnvSnapshot`:
|
||||
- Already TLS-cached (Phase 19-3b consolidation)
|
||||
- Already refreshed via `hakmem_env_snapshot_refresh_from_env()` (called in bench_profile)
|
||||
- No new includes required (minimal code footprint)
|
||||
|
||||
**Expected**: **+0.5-1.5% throughput** (conservative) from eliminating hot getenv + reduced cache pressure
|
||||
|
||||
---
|
||||
|
||||
## Why This Design Works Better
|
||||
|
||||
### Phase 19-5 (Failed Approach)
|
||||
- ❌ New 400B global struct `g_hak_env_cache` (requires new include)
|
||||
- ❌ Cache coherency overhead negated syscall savings
|
||||
- ❌ Result: -4.28% regression
|
||||
|
||||
### Phase 19-5v2 (Proposed Approach)
|
||||
- ✅ Reuse existing `HakmemEnvSnapshot` (4 fields → 5 fields, ~24B → ~28B)
|
||||
- ✅ Already in TLS (L1 hot), refreshed at stable boundary
|
||||
- ✅ Piggyback on Phase 19-3b env pass-down infrastructure
|
||||
- ✅ No new include in hot headers
|
||||
- ✅ Minimal layout impact (1 more bool field)
|
||||
|
||||
---
|
||||
|
||||
## Detailed Changes
|
||||
|
||||
### 1. HakmemEnvSnapshot Struct (core/box/hakmem_env_snapshot_box.h)
|
||||
|
||||
**Current**:
|
||||
```c
|
||||
typedef struct HakmemEnvSnapshot {
|
||||
bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1)
|
||||
bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1)
|
||||
bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0)
|
||||
bool tiny_metadata_cache_eff; // Effective: cache && !learner
|
||||
} HakmemEnvSnapshot;
|
||||
```
|
||||
|
||||
**After** (add 1 field):
|
||||
```c
|
||||
typedef struct HakmemEnvSnapshot {
|
||||
bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1)
|
||||
bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1)
|
||||
bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0)
|
||||
bool tiny_metadata_cache_eff; // Effective: cache && !learner
|
||||
bool tiny_larson_fix; // HAKMEM_TINY_LARSON_FIX (default 0) — Phase 19-5v2
|
||||
} HakmemEnvSnapshot;
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- +4 bytes (bool alignment)
|
||||
- Still fits in 1 cache line (< 64B)
|
||||
- No memory overhead for callers (already TLS-read)
|
||||
|
||||
---
|
||||
|
||||
### 2. Refresh Function (core/box/hakmem_env_snapshot_box.c)
|
||||
|
||||
**In `hakmem_env_snapshot_init()` / `hakmem_env_snapshot_refresh_from_env()`**:
|
||||
|
||||
Add initialization for `tiny_larson_fix`:
|
||||
```c
|
||||
static inline void hakmem_env_snapshot_init(void) {
|
||||
// ... existing code ...
|
||||
|
||||
// Phase 19-5v2: Tiny Larson Fix
|
||||
{
|
||||
const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
|
||||
g_hakmem_env_snapshot.tiny_larson_fix = (e && *e && *e != '0') ? true : false;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Guarantee**:
|
||||
- Single `getenv()` call per process (at init)
|
||||
- Refresh syncs with bench_profile putenv (already called)
|
||||
- No per-operation syscalls
|
||||
|
||||
---
|
||||
|
||||
### 3. Free Path Changes (core/front/malloc_tiny_fast.h)
|
||||
|
||||
**Current** (Phase 19-5 issue: redundant getenv):
|
||||
```c
|
||||
const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); // from hakmem_env_cache.h
|
||||
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
|
||||
// cross-thread free detection
|
||||
}
|
||||
```
|
||||
|
||||
**After** (Phase 19-5v2: env snapshot field access):
|
||||
```c
|
||||
// env already passed from wrapper (Phase 19-3b)
|
||||
const int larson_fix = env ? env->tiny_larson_fix : 0;
|
||||
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
|
||||
// cross-thread free detection
|
||||
}
|
||||
```
|
||||
|
||||
**Lines to modify**:
|
||||
- Line 439: `free_tiny_fast_with_env()` hot path
|
||||
- Line 624: `free_tiny_fast_cold_with_env()` cold helper
|
||||
- Line 752: Free path routing decision
|
||||
- Line 885: Free larson detection (v2 variant)
|
||||
|
||||
**Benefits**:
|
||||
- No getenv() syscall
|
||||
- No `static __thread` storage
|
||||
- No new include (reuse existing env parameter)
|
||||
- 1 cache line read (already loaded for other fields)
|
||||
|
||||
---
|
||||
|
||||
## Box Theory Compliance
|
||||
|
||||
### Boundary (L0)
|
||||
|
||||
**Gate**: Implicit in `hakmem_env_snapshot_enabled()` (Phase 4 E1)
|
||||
- Already enabled in bench_profile presets
|
||||
- No new ENV variable needed (reuse HAKMEM_TINY_LARSON_FIX, already defined)
|
||||
|
||||
### Data Flow (L1)
|
||||
|
||||
**env snapshot pass-down** (already established in Phase 19-3b):
|
||||
```
|
||||
malloc_wrapper
|
||||
↓ (hakmem_env_snapshot() → env)
|
||||
free_tiny_fast_with_env(ptr, env)
|
||||
↓ (pass env)
|
||||
free_tiny_fast_cold_with_env(ptr, env)
|
||||
↓ (read env->tiny_larson_fix, instead of getenv())
|
||||
free_tiny_fast_hot(ptr, env)
|
||||
```
|
||||
|
||||
**Refresh**: Already integrated (bench_profile calls `hakmem_env_snapshot_refresh_from_env()`)
|
||||
|
||||
### Fallback (L2)
|
||||
|
||||
**NULL-safe design**:
|
||||
```c
|
||||
const int larson_fix = env ? env->tiny_larson_fix : 0;
|
||||
```
|
||||
- If env is NULL: larson_fix defaults to 0 (OFF, safe)
|
||||
- If init not called: snapshot is zero-filled (safe default)
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Protocol
|
||||
|
||||
### Baseline: Phase 19-4c
|
||||
```sh
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
# Expected: ~57.1M ops/s
|
||||
```
|
||||
|
||||
### Optimized: Phase 19-5v2
|
||||
```sh
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
# Expected: ~57.1-57.9M ops/s (+0-1.4%)
|
||||
# Conservative: +0.5-1.5% expected (vs +2-4% in Phase 19-5 theory)
|
||||
```
|
||||
|
||||
**GO Criteria**:
|
||||
- Mean ≥ baseline (±0.5% acceptable)
|
||||
- No I-cache regression
|
||||
- NO-GO: < baseline - 1.0%
|
||||
|
||||
---
|
||||
|
||||
## Performance Model
|
||||
|
||||
### Savings vs Phase 19-5
|
||||
|
||||
**Phase 19-5 (Failed)**:
|
||||
- ❌ New struct include: hakmem_env_cache.h (400B struct initialization)
|
||||
- ❌ Cache coherency: g_hak_env_cache updates impact other cores
|
||||
- ❌ Layout: Extra load for unrelated cache fields
|
||||
|
||||
**Phase 19-5v2 (Proposed)**:
|
||||
- ✅ No new include (struct already included in snapshot path)
|
||||
- ✅ Single snapshot read (already done for c7_ultra_enabled, etc.)
|
||||
- ✅ No layout changes (1 extra bool in existing struct)
|
||||
|
||||
### Instruction Budget
|
||||
|
||||
**getenv() elimination**:
|
||||
- Hot path saves 1 `HAK_ENV_TINY_LARSON_FIX()` lookup call
|
||||
- Replaces with 1 cache-line field read: `env->tiny_larson_fix`
|
||||
- Net: -1-2 instructions/op (minor, but positive)
|
||||
|
||||
**Cache pressure**:
|
||||
- No new global struct
|
||||
- Snapshot read already hot (L1)
|
||||
- Minimal impact expected
|
||||
|
||||
**Expected**: **+0.5-1.5% throughput** (realistic, not overoptimistic)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
- [ ] Add `tiny_larson_fix` field to `HakmemEnvSnapshot` struct
|
||||
- [ ] Update `hakmem_env_snapshot_init()` to read `HAKMEM_TINY_LARSON_FIX`
|
||||
- [ ] Replace `HAK_ENV_TINY_LARSON_FIX()` macro calls with `env->tiny_larson_fix`
|
||||
- [ ] Verify env is always non-NULL in hot paths (should be, Phase 19-3b)
|
||||
- [ ] Remove hakmem_env_cache.h include from malloc_tiny_fast.h (if present)
|
||||
- [ ] Build & verify no compile errors
|
||||
- [ ] A/B test: 10-run benchmark (mixed)
|
||||
- [ ] Perf stat validation (no I-cache regression)
|
||||
- [ ] Commit + push to private repo
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Level | Mitigation |
|
||||
|------|-------|-----------|
|
||||
| Struct layout change | LOW | Bool field, fits in 1 cache line, no alignment issue |
|
||||
| env NULL dereference | LOW | Guard with `env ? env->tiny_larson_fix : 0` |
|
||||
| Refresh timing | LOW | Reuse existing `hakmem_env_snapshot_refresh_from_env()` call |
|
||||
| Cache coherency | LOW | No new global struct, snapshot is read-only after init |
|
||||
| Layout regression | LOW | Minimal struct change, no include overhead |
|
||||
|
||||
**Overall**: **GREEN** (low risk, proven design pattern)
|
||||
|
||||
---
|
||||
|
||||
## Why This Will Work
|
||||
|
||||
1. **Proven infrastructure** (Phase 19-3b): env pass-down is already validated
|
||||
2. **Minimal change**: 1 bool field + 1 getenv call (already in snapshot init)
|
||||
3. **No new includes**: Reuse existing snapshot path (no overhead)
|
||||
4. **Cache-friendly**: Reads from same cache line as other snapshot fields
|
||||
5. **Conservative expectation**: +0.5-1.5% is realistic (not overoptimistic like Phase 19-5)
|
||||
|
||||
---
|
||||
|
||||
## Success Path
|
||||
|
||||
If Phase 19-5v2 is GO (+0.5-1.5%):
|
||||
- **New state**: ~57.4-57.9M ops/s
|
||||
- **Cumulative Phase 19**: +10-15% improvement
|
||||
- **Gap to libc**: Closer by ~0.5-1.5%
|
||||
|
||||
If Phase 19-5v2 is NEUTRAL (±0.5%):
|
||||
- Keep optimization (no harm, minimal code)
|
||||
- Combine with other candidates (stats removal, header optimization)
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Code Diffs
|
||||
|
||||
### hakmem_env_snapshot_box.h (struct only)
|
||||
```diff
|
||||
typedef struct HakmemEnvSnapshot {
|
||||
bool tiny_c7_ultra_enabled;
|
||||
bool tiny_front_v3_enabled;
|
||||
bool tiny_metadata_cache;
|
||||
bool tiny_metadata_cache_eff;
|
||||
+ bool tiny_larson_fix; // Phase 19-5v2
|
||||
} HakmemEnvSnapshot;
|
||||
```
|
||||
|
||||
### malloc_tiny_fast.h (line 439 example)
|
||||
```diff
|
||||
- const int larson_fix = HAK_ENV_TINY_LARSON_FIX();
|
||||
+ const int larson_fix = env ? env->tiny_larson_fix : 0;
|
||||
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Design review (complete)
|
||||
2. 🔄 Implementation (small changes, low risk)
|
||||
3. 🔄 A/B test (10-run mixed benchmark)
|
||||
4. 📊 Perf stat validation
|
||||
5. 📋 Commit + push if GO
|
||||
Reference in New Issue
Block a user