265 lines
8.5 KiB
Markdown
265 lines
8.5 KiB
Markdown
|
|
# Phase 40: BENCH_MINIMAL Gate Constantization Results
|
||
|
|
|
||
|
|
**Date**: 2025-12-16
|
||
|
|
**Verdict**: **NO-GO (-2.47%)**
|
||
|
|
**Status**: Reverted
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 40 attempted to constantize `tiny_header_mode()` in BENCH_MINIMAL mode, following the proven success pattern from Phase 39 (+1.98%). However, A/B testing revealed an unexpected **-2.47% regression**, leading to a NO-GO verdict and full revert of changes.
|
||
|
|
|
||
|
|
## Hypothesis
|
||
|
|
|
||
|
|
Building on Phase 39's success with gate function constantization (+1.98%), Phase 40 targeted `tiny_header_mode()` as the next highest-impact candidate based on FAST v3 perf profiling:
|
||
|
|
|
||
|
|
- **Location**: `core/tiny_region_id.h:180-211`
|
||
|
|
- **Pattern**: Lazy-init with `static int g_header_mode = -1` + `getenv()`
|
||
|
|
- **Call site**: Hot path in `tiny_region_id_write_header()` (4.56% self-time)
|
||
|
|
- **Expected gain**: +0.3~0.8% (similar to Phase 39 targets)
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
### Change: tiny_header_mode() Constantization
|
||
|
|
|
||
|
|
**File**: `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline int tiny_header_mode(void)
|
||
|
|
{
|
||
|
|
#if HAKMEM_BENCH_MINIMAL
|
||
|
|
// Phase 40: BENCH_MINIMAL → 固定 FULL (header write enabled)
|
||
|
|
// Rationale: Eliminates lazy-init gate check in alloc hot path
|
||
|
|
// Expected: +0.3~0.8% (TBD after A/B test)
|
||
|
|
return TINY_HEADER_MODE_FULL;
|
||
|
|
#else
|
||
|
|
static int g_header_mode = -1;
|
||
|
|
if (__builtin_expect(g_header_mode == -1, 0))
|
||
|
|
{
|
||
|
|
const char* e = getenv("HAKMEM_TINY_HEADER_MODE");
|
||
|
|
// ... [original lazy-init logic] ...
|
||
|
|
}
|
||
|
|
return g_header_mode;
|
||
|
|
#endif
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
- In BENCH_MINIMAL mode, always return constant `TINY_HEADER_MODE_FULL` (0)
|
||
|
|
- Eliminates branch + lazy-init overhead in hot path
|
||
|
|
- Matches default benchmark behavior (FULL mode)
|
||
|
|
|
||
|
|
## A/B Test Results
|
||
|
|
|
||
|
|
### Test Configuration
|
||
|
|
|
||
|
|
- **Benchmark**: `bench_random_mixed_hakmem_minimal`
|
||
|
|
- **Test harness**: `scripts/run_mixed_10_cleanenv.sh`
|
||
|
|
- **Parameters**: `ITERS=20000000 WS=400`
|
||
|
|
- **Method**: Git stash A/B (baseline vs treatment)
|
||
|
|
|
||
|
|
### Baseline (FAST v3 without Phase 40)
|
||
|
|
|
||
|
|
```
|
||
|
|
Run 1/10: 56789069 ops/s
|
||
|
|
Run 2/10: 56274671 ops/s
|
||
|
|
Run 3/10: 56513942 ops/s
|
||
|
|
Run 4/10: 56133590 ops/s
|
||
|
|
Run 5/10: 56634961 ops/s
|
||
|
|
Run 6/10: 54943677 ops/s
|
||
|
|
Run 7/10: 57088883 ops/s
|
||
|
|
Run 8/10: 56337157 ops/s
|
||
|
|
Run 9/10: 55930637 ops/s
|
||
|
|
Run 10/10: 56590285 ops/s
|
||
|
|
|
||
|
|
Mean: 56,323,700 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment (FAST v4 with Phase 40)
|
||
|
|
|
||
|
|
```
|
||
|
|
Run 1/10: 54355307 ops/s
|
||
|
|
Run 2/10: 56936372 ops/s
|
||
|
|
Run 3/10: 54694629 ops/s
|
||
|
|
Run 4/10: 54504756 ops/s
|
||
|
|
Run 5/10: 55137468 ops/s
|
||
|
|
Run 6/10: 52434980 ops/s
|
||
|
|
Run 7/10: 52438841 ops/s
|
||
|
|
Run 8/10: 54966798 ops/s
|
||
|
|
Run 9/10: 56834583 ops/s
|
||
|
|
Run 10/10: 57034821 ops/s
|
||
|
|
|
||
|
|
Mean: 54,933,856 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
### Delta Analysis
|
||
|
|
|
||
|
|
```
|
||
|
|
Baseline: 56,323,700 ops/s
|
||
|
|
Treatment: 54,933,856 ops/s
|
||
|
|
Delta: -1,389,844 ops/s (-2.47%)
|
||
|
|
|
||
|
|
Verdict: NO-GO (threshold: -0.5% or worse)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### Why did Phase 40 fail when Phase 39 succeeded?
|
||
|
|
|
||
|
|
#### 1. Code Layout Effects (Phase 22-2 Precedent)
|
||
|
|
|
||
|
|
The regression is likely caused by **compiler code layout changes** rather than the logic change itself:
|
||
|
|
|
||
|
|
- **LTO reordering**: Adding `#if HAKMEM_BENCH_MINIMAL` block changes function layout
|
||
|
|
- **Instruction cache**: Small layout changes can significantly impact icache hit rates
|
||
|
|
- **Branch prediction**: Modified code placement affects CPU branch predictor state
|
||
|
|
|
||
|
|
**Evidence from Phase 22-2**:
|
||
|
|
- Physical code deletion caused **-5.16% regression** despite removing "dead" code
|
||
|
|
- Reason: Layout changes disrupted hot path alignment and icache behavior
|
||
|
|
- Lesson: "Deleting to speed up" is unreliable with LTO
|
||
|
|
|
||
|
|
#### 2. Hot Path Already Optimized
|
||
|
|
|
||
|
|
Unlike Phase 39 targets, `tiny_header_mode()` may already be effectively optimized:
|
||
|
|
|
||
|
|
**Phase 21 Hot/Cold Split**:
|
||
|
|
```c
|
||
|
|
// Phase 21: Hot/cold split for FULL mode (ENV-gated)
|
||
|
|
if (tiny_header_hotfull_enabled()) {
|
||
|
|
int header_mode = tiny_header_mode();
|
||
|
|
if (__builtin_expect(header_mode == TINY_HEADER_MODE_FULL, 1)) {
|
||
|
|
// Hot path: straight-line code (no existing_header read, no guard call)
|
||
|
|
uint8_t desired_header = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
|
||
|
|
*header_ptr = desired_header;
|
||
|
|
// ... fast path ...
|
||
|
|
return user;
|
||
|
|
}
|
||
|
|
// Cold path
|
||
|
|
return tiny_region_id_write_header_slow(base, class_idx, header_ptr);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key observation**:
|
||
|
|
- The hot path at line 349 calls `tiny_header_mode()` and checks for `TINY_HEADER_MODE_FULL`
|
||
|
|
- This call is already **once per allocation** and **highly predictable** (always FULL in benchmarks)
|
||
|
|
- The `__builtin_expect` hint ensures the FULL branch is predicted correctly
|
||
|
|
- Compiler may already be inlining and optimizing away the branch
|
||
|
|
|
||
|
|
**Phase 39 difference**:
|
||
|
|
- Phase 39 targeted gates called on **every path** without existing optimization
|
||
|
|
- Those gates had no Phase 21-style hot/cold split
|
||
|
|
- Constantization provided genuine branch elimination
|
||
|
|
|
||
|
|
#### 3. Snapshot Caching Interaction
|
||
|
|
|
||
|
|
The `TinyFrontV3Snapshot` mechanism caches `tiny_header_mode()` value:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// core/box/tiny_front_v3_env_box.h:13
|
||
|
|
uint8_t header_mode; // tiny_header_mode() の値をキャッシュ
|
||
|
|
|
||
|
|
// core/hakmem_tiny.c:83
|
||
|
|
.header_mode = (uint8_t)tiny_header_mode(),
|
||
|
|
```
|
||
|
|
|
||
|
|
If most allocations use the cached value from snapshot rather than calling `tiny_header_mode()` directly, constantizing the function provides minimal benefit while still incurring layout disruption costs.
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
### 1. Not All Gates Are Created Equal
|
||
|
|
|
||
|
|
**Phase 39 success criteria** (gates that benefit from constantization):
|
||
|
|
- Called on **every hot path** without optimization
|
||
|
|
- No existing hot/cold split or branch prediction hints
|
||
|
|
- No snapshot caching mechanism
|
||
|
|
- Examples: `g_alloc_front_gate_enabled`, `g_alloc_prewarm_enabled`
|
||
|
|
|
||
|
|
**Phase 40 failure indicators** (gates that DON'T benefit):
|
||
|
|
- Already optimized with hot/cold split (Phase 21)
|
||
|
|
- Protected by `__builtin_expect` branch hints
|
||
|
|
- Cached in snapshot structures
|
||
|
|
- Infrequently called (once per allocation vs once per operation)
|
||
|
|
|
||
|
|
### 2. Code Layout Tax Exceeds Logic Benefit
|
||
|
|
|
||
|
|
Even when logic change is sound, layout disruption can dominate:
|
||
|
|
|
||
|
|
```
|
||
|
|
Logic benefit: ~0.5% (eliminate branch + lazy-init)
|
||
|
|
Layout tax: ~3.0% (icache/alignment disruption)
|
||
|
|
Net result: -2.47% (NO-GO)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Perf Profile Can Be Misleading
|
||
|
|
|
||
|
|
`tiny_region_id_write_header()` showed 4.56% self-time in perf, but:
|
||
|
|
- Most of that time is **actual header write work**, not gate overhead
|
||
|
|
- The `tiny_header_mode()` call is already optimized by compiler
|
||
|
|
- Profiler cannot distinguish between "work" time and "gate" time
|
||
|
|
|
||
|
|
**Better heuristic**: Only constantize gates that:
|
||
|
|
1. Appear in perf with **high instruction count** (not just time)
|
||
|
|
2. Have visible `getenv()` calls in assembly
|
||
|
|
3. Lack existing optimization (no Phase 21-style split)
|
||
|
|
|
||
|
|
## Recommendation
|
||
|
|
|
||
|
|
**REVERT Phase 40 changes completely.**
|
||
|
|
|
||
|
|
### Alternative Approaches (Future Research)
|
||
|
|
|
||
|
|
If we still want to optimize `tiny_header_mode()`:
|
||
|
|
|
||
|
|
1. **Wait for Phase 21 BENCH_MINIMAL adoption** - Constantize `tiny_header_hotfull_enabled()` instead
|
||
|
|
- Rationale: Eliminates entire hot/cold branch, not just mode check
|
||
|
|
- Expected: +0.5~1% (higher leverage point)
|
||
|
|
|
||
|
|
2. **Profile-guided optimization** - Let compiler optimize based on runtime profile
|
||
|
|
- Rationale: Avoid manual layout disruption
|
||
|
|
- Method: `gcc -fprofile-generate` → run benchmark → `gcc -fprofile-use`
|
||
|
|
|
||
|
|
3. **Assembly inspection first** - Check if gate is actually compiled as branch
|
||
|
|
- Method: `objdump -d bench_random_mixed_hakmem_minimal | grep -A20 tiny_header_mode`
|
||
|
|
- If already optimized away → skip constantization
|
||
|
|
|
||
|
|
## Files Modified (REVERTED)
|
||
|
|
|
||
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h` (lines 180-218)
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Revert all Phase 40 changes** via `git restore`
|
||
|
|
2. **Update CURRENT_TASK.md** - Mark Phase 40 as NO-GO with analysis
|
||
|
|
3. **Document in scorecard** - Add Phase 40 as research failure for future reference
|
||
|
|
4. **Re-evaluate gate candidates** - Use stricter criteria (see Lessons Learned #1)
|
||
|
|
|
||
|
|
## Appendix: Raw Test Data
|
||
|
|
|
||
|
|
### Baseline runs
|
||
|
|
```
|
||
|
|
56789069, 56274671, 56513942, 56133590, 56634961,
|
||
|
|
54943677, 57088883, 56337157, 55930637, 56590285
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment runs
|
||
|
|
```
|
||
|
|
54355307, 56936372, 54694629, 54504756, 55137468,
|
||
|
|
52434980, 52438841, 54966798, 56834583, 57034821
|
||
|
|
```
|
||
|
|
|
||
|
|
### Variance Analysis
|
||
|
|
|
||
|
|
**Baseline**:
|
||
|
|
- Std dev: ~586K ops/s (1.04% CV)
|
||
|
|
- Range: 2.14M ops/s (54.9M - 57.1M)
|
||
|
|
|
||
|
|
**Treatment**:
|
||
|
|
- Std dev: ~1.52M ops/s (2.77% CV)
|
||
|
|
- Range: 4.60M ops/s (52.4M - 57.0M)
|
||
|
|
|
||
|
|
**Observation**: Treatment shows **2.6x higher variance** than baseline, suggesting layout instability.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Conclusion**: Phase 40 is a clear NO-GO. Revert all changes and re-focus on gates without existing optimization.
|