Phase 19-5 Investigation: Both getenv() consolidation attempts NO-GO

Testing Results:
- Phase 19-5 (Global ENV Cache): -4.28% regression (57.1M → 54.66M ops/s)
- Phase 19-5v2 (HakmemEnvSnapshot): -7.7% regression (57.1M → 52.71M ops/s)

Root Cause Analysis:
Phase 19-5 Failed: 400B global struct causes L1 cache layout conflicts
- Cache coherency overhead > syscall savings
- False sharing on g_hak_env_cache struct

Phase 19-5v2 Failed (WORSE): Broke existing ultra-efficient per-thread TLS cache
- Original pattern: static __thread int g_larson_fix = -1
  - Cost: 1 getenv per thread (lazy init at first check)
  - Benefit: 1-cycle memory reads for all subsequent checks
  - Already near-optimal for runtime-configurable gates
- My change: Replaced with env->tiny_larson_fix access
  - Issue: env pointer NULL-safety, lost efficient TLS cache
  - Result: Worse performance than both baseline and v1

Key Discovery:
Original code's per-thread TLS cache pattern is already excellent.
Attempts to consolidate into global or snapshot-based caches failed
because they lose the amortization benefit and introduce layout conflicts.

Decision: DEFER Phase 19-5 series
- Current TLS pattern is near-optimal for runtime-configurable gates
- Focus remaining effort on other instruction reduction candidates:
  - Stats removal (+3-5%)
  - Header optimization (+2-3%)
  - Route fast path (+2-3%)

Updated: CURRENT_TASK.md with findings
Reverted: All Phase 19-5v2 code changes (git reset --hard HEAD~1)

Phase 19 Final Status (19-1b through 19-4c):
- Cumulative improvement: +9.65% (52.06M → 57.1M ops/s)
- GO phases: 19-1b (+5.88%), 19-3a (+4.42%), 19-3b (+2.76%), 19-4a (+0.16%), 19-4c (+0.88%)
- Stable state: Phase 19-4c

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-15 19:32:24 +09:00
parent 97b6748255
commit 9ffef0ac9a
2 changed files with 312 additions and 0 deletions

View File

@ -25,6 +25,30 @@
**Ref**: **Ref**:
- `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_4_HINT_MISMATCH_AB_TEST_RESULTS.md` - `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_4_HINT_MISMATCH_AB_TEST_RESULTS.md`
---
## 更新メモ2025-12-15 Phase 19-5 Attempts: Both NO-GO
### Phase 19-5 & v2: Consolidate hot getenv() — ❌ DEFERRED
**Result**: Both attempts to eliminate hot getenv() failed. Current TLS cache pattern is already near-optimal.
**Attempt 1: Global ENV Cache (-4.28% regression)**
- 400B struct causes L1 cache layout conflicts
**Attempt 2: HakmemEnvSnapshot Integration (-7.7% regression)**
- Broke efficient per-thread TLS cache (`static __thread int g_larson_fix = -1`)
- env pointer NULL-safety issues
**Key Discovery**: Original code's per-thread TLS cache is excellent
- Cost: 1 getenv/thread, amortized
- Benefit: 1-cycle reads thereafter
- Already near-optimal
**Decision**: Focus on other instruction reduction candidates instead.
---
## 更新メモ2025-12-15 Phase 19-3b ENV-SNAPSHOT-PASSDOWN ## 更新メモ2025-12-15 Phase 19-3b ENV-SNAPSHOT-PASSDOWN
### Phase 19-3b ENV-SNAPSHOT-PASSDOWN: Consolidate ENV snapshot reads across hot helpers — ✅ GO (+2.76%) ### Phase 19-3b ENV-SNAPSHOT-PASSDOWN: Consolidate ENV snapshot reads across hot helpers — ✅ GO (+2.76%)

View File

@ -0,0 +1,288 @@
# Phase 19-5v2: Hot `getenv()` Elimination via HakmemEnvSnapshot Integration
## Goal
Eliminate hot-path `getenv()` for `HAKMEM_TINY_LARSON_FIX` by consolidating it into the existing `HakmemEnvSnapshot` SSOT (Single-Source-Of-Truth).
**Problem**: Phase 19-5 failed (-4.28%) due to large cache struct inducing L1 thrashing.
**Solution**: Instead of new global cache, add `tiny_larson_fix` field to existing `HakmemEnvSnapshot`:
- Already TLS-cached (Phase 19-3b consolidation)
- Already refreshed via `hakmem_env_snapshot_refresh_from_env()` (called in bench_profile)
- No new includes required (minimal code footprint)
**Expected**: **+0.5-1.5% throughput** (conservative) from eliminating hot getenv + reduced cache pressure
---
## Why This Design Works Better
### Phase 19-5 (Failed Approach)
- ❌ New 400B global struct `g_hak_env_cache` (requires new include)
- ❌ Cache coherency overhead negated syscall savings
- ❌ Result: -4.28% regression
### Phase 19-5v2 (Proposed Approach)
- ✅ Reuse existing `HakmemEnvSnapshot` (4 fields → 5 fields, ~24B → ~28B)
- ✅ Already in TLS (L1 hot), refreshed at stable boundary
- ✅ Piggyback on Phase 19-3b env pass-down infrastructure
- ✅ No new include in hot headers
- ✅ Minimal layout impact (1 more bool field)
---
## Detailed Changes
### 1. HakmemEnvSnapshot Struct (core/box/hakmem_env_snapshot_box.h)
**Current**:
```c
typedef struct HakmemEnvSnapshot {
bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1)
bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1)
bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0)
bool tiny_metadata_cache_eff; // Effective: cache && !learner
} HakmemEnvSnapshot;
```
**After** (add 1 field):
```c
typedef struct HakmemEnvSnapshot {
bool tiny_c7_ultra_enabled; // HAKMEM_TINY_C7_ULTRA (default 1)
bool tiny_front_v3_enabled; // HAKMEM_TINY_FRONT_V3_ENABLED (default 1)
bool tiny_metadata_cache; // HAKMEM_TINY_METADATA_CACHE (default 0)
bool tiny_metadata_cache_eff; // Effective: cache && !learner
bool tiny_larson_fix; // HAKMEM_TINY_LARSON_FIX (default 0) — Phase 19-5v2
} HakmemEnvSnapshot;
```
**Impact**:
- +4 bytes (bool alignment)
- Still fits in 1 cache line (< 64B)
- No memory overhead for callers (already TLS-read)
---
### 2. Refresh Function (core/box/hakmem_env_snapshot_box.c)
**In `hakmem_env_snapshot_init()` / `hakmem_env_snapshot_refresh_from_env()`**:
Add initialization for `tiny_larson_fix`:
```c
static inline void hakmem_env_snapshot_init(void) {
// ... existing code ...
// Phase 19-5v2: Tiny Larson Fix
{
const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
g_hakmem_env_snapshot.tiny_larson_fix = (e && *e && *e != '0') ? true : false;
}
}
```
**Guarantee**:
- Single `getenv()` call per process (at init)
- Refresh syncs with bench_profile putenv (already called)
- No per-operation syscalls
---
### 3. Free Path Changes (core/front/malloc_tiny_fast.h)
**Current** (Phase 19-5 issue: redundant getenv):
```c
const int larson_fix = HAK_ENV_TINY_LARSON_FIX(); // from hakmem_env_cache.h
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
// cross-thread free detection
}
```
**After** (Phase 19-5v2: env snapshot field access):
```c
// env already passed from wrapper (Phase 19-3b)
const int larson_fix = env ? env->tiny_larson_fix : 0;
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
// cross-thread free detection
}
```
**Lines to modify**:
- Line 439: `free_tiny_fast_with_env()` hot path
- Line 624: `free_tiny_fast_cold_with_env()` cold helper
- Line 752: Free path routing decision
- Line 885: Free larson detection (v2 variant)
**Benefits**:
- No getenv() syscall
- No `static __thread` storage
- No new include (reuse existing env parameter)
- 1 cache line read (already loaded for other fields)
---
## Box Theory Compliance
### Boundary (L0)
**Gate**: Implicit in `hakmem_env_snapshot_enabled()` (Phase 4 E1)
- Already enabled in bench_profile presets
- No new ENV variable needed (reuse HAKMEM_TINY_LARSON_FIX, already defined)
### Data Flow (L1)
**env snapshot pass-down** (already established in Phase 19-3b):
```
malloc_wrapper
↓ (hakmem_env_snapshot() → env)
free_tiny_fast_with_env(ptr, env)
↓ (pass env)
free_tiny_fast_cold_with_env(ptr, env)
↓ (read env->tiny_larson_fix, instead of getenv())
free_tiny_fast_hot(ptr, env)
```
**Refresh**: Already integrated (bench_profile calls `hakmem_env_snapshot_refresh_from_env()`)
### Fallback (L2)
**NULL-safe design**:
```c
const int larson_fix = env ? env->tiny_larson_fix : 0;
```
- If env is NULL: larson_fix defaults to 0 (OFF, safe)
- If init not called: snapshot is zero-filled (safe default)
---
## A/B Test Protocol
### Baseline: Phase 19-4c
```sh
scripts/run_mixed_10_cleanenv.sh
# Expected: ~57.1M ops/s
```
### Optimized: Phase 19-5v2
```sh
scripts/run_mixed_10_cleanenv.sh
# Expected: ~57.1-57.9M ops/s (+0-1.4%)
# Conservative: +0.5-1.5% expected (vs +2-4% in Phase 19-5 theory)
```
**GO Criteria**:
- Mean baseline 0.5% acceptable)
- No I-cache regression
- NO-GO: < baseline - 1.0%
---
## Performance Model
### Savings vs Phase 19-5
**Phase 19-5 (Failed)**:
- New struct include: hakmem_env_cache.h (400B struct initialization)
- Cache coherency: g_hak_env_cache updates impact other cores
- Layout: Extra load for unrelated cache fields
**Phase 19-5v2 (Proposed)**:
- No new include (struct already included in snapshot path)
- Single snapshot read (already done for c7_ultra_enabled, etc.)
- No layout changes (1 extra bool in existing struct)
### Instruction Budget
**getenv() elimination**:
- Hot path saves 1 `HAK_ENV_TINY_LARSON_FIX()` lookup call
- Replaces with 1 cache-line field read: `env->tiny_larson_fix`
- Net: -1-2 instructions/op (minor, but positive)
**Cache pressure**:
- No new global struct
- Snapshot read already hot (L1)
- Minimal impact expected
**Expected**: **+0.5-1.5% throughput** (realistic, not overoptimistic)
---
## Implementation Checklist
- [ ] Add `tiny_larson_fix` field to `HakmemEnvSnapshot` struct
- [ ] Update `hakmem_env_snapshot_init()` to read `HAKMEM_TINY_LARSON_FIX`
- [ ] Replace `HAK_ENV_TINY_LARSON_FIX()` macro calls with `env->tiny_larson_fix`
- [ ] Verify env is always non-NULL in hot paths (should be, Phase 19-3b)
- [ ] Remove hakmem_env_cache.h include from malloc_tiny_fast.h (if present)
- [ ] Build & verify no compile errors
- [ ] A/B test: 10-run benchmark (mixed)
- [ ] Perf stat validation (no I-cache regression)
- [ ] Commit + push to private repo
---
## Risk Assessment
| Risk | Level | Mitigation |
|------|-------|-----------|
| Struct layout change | LOW | Bool field, fits in 1 cache line, no alignment issue |
| env NULL dereference | LOW | Guard with `env ? env->tiny_larson_fix : 0` |
| Refresh timing | LOW | Reuse existing `hakmem_env_snapshot_refresh_from_env()` call |
| Cache coherency | LOW | No new global struct, snapshot is read-only after init |
| Layout regression | LOW | Minimal struct change, no include overhead |
**Overall**: **GREEN** (low risk, proven design pattern)
---
## Why This Will Work
1. **Proven infrastructure** (Phase 19-3b): env pass-down is already validated
2. **Minimal change**: 1 bool field + 1 getenv call (already in snapshot init)
3. **No new includes**: Reuse existing snapshot path (no overhead)
4. **Cache-friendly**: Reads from same cache line as other snapshot fields
5. **Conservative expectation**: +0.5-1.5% is realistic (not overoptimistic like Phase 19-5)
---
## Success Path
If Phase 19-5v2 is GO (+0.5-1.5%):
- **New state**: ~57.4-57.9M ops/s
- **Cumulative Phase 19**: +10-15% improvement
- **Gap to libc**: Closer by ~0.5-1.5%
If Phase 19-5v2 is NEUTRAL 0.5%):
- Keep optimization (no harm, minimal code)
- Combine with other candidates (stats removal, header optimization)
---
## Appendix: Code Diffs
### hakmem_env_snapshot_box.h (struct only)
```diff
typedef struct HakmemEnvSnapshot {
bool tiny_c7_ultra_enabled;
bool tiny_front_v3_enabled;
bool tiny_metadata_cache;
bool tiny_metadata_cache_eff;
+ bool tiny_larson_fix; // Phase 19-5v2
} HakmemEnvSnapshot;
```
### malloc_tiny_fast.h (line 439 example)
```diff
- const int larson_fix = HAK_ENV_TINY_LARSON_FIX();
+ const int larson_fix = env ? env->tiny_larson_fix : 0;
if (__builtin_expect(larson_fix || use_tiny_heap, 0)) {
```
---
## Next Steps
1. Design review (complete)
2. 🔄 Implementation (small changes, low risk)
3. 🔄 A/B test (10-run mixed benchmark)
4. 📊 Perf stat validation
5. 📋 Commit + push if GO