375 lines
12 KiB
Markdown
375 lines
12 KiB
Markdown
|
|
# Phase 41: ASM-First Gate Audit and Optimization - Results
|
|||
|
|
|
|||
|
|
**Date**: 2025-12-16
|
|||
|
|
**Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median)
|
|||
|
|
**Target**: +0.5% (56.32M+ ops/s) for GO
|
|||
|
|
**Result**: **NO-GO** (-2.02% regression)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Methodology: ASM-First Approach
|
|||
|
|
|
|||
|
|
Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology:
|
|||
|
|
|
|||
|
|
1. **Baseline measurement** before any code changes
|
|||
|
|
2. **ASM inspection** to verify gates actually exist in assembly
|
|||
|
|
3. **Optimization only if gates found** in hot paths
|
|||
|
|
4. **Incremental testing** with proper A/B comparison
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Step 0: Baseline Measurement
|
|||
|
|
|
|||
|
|
**Command**: `make perf_fast`
|
|||
|
|
|
|||
|
|
### 10-Run Results (Baseline):
|
|||
|
|
```
|
|||
|
|
Run 1: 56.62M ops/s
|
|||
|
|
Run 2: 55.62M ops/s
|
|||
|
|
Run 3: 56.62M ops/s
|
|||
|
|
Run 4: 56.62M ops/s
|
|||
|
|
Run 5: 55.79M ops/s
|
|||
|
|
Run 6: 55.42M ops/s
|
|||
|
|
Run 7: 55.89M ops/s
|
|||
|
|
Run 8: 56.16M ops/s
|
|||
|
|
Run 9: 54.79M ops/s
|
|||
|
|
Run 10: 56.17M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Baseline Statistics**:
|
|||
|
|
- **Mean**: 55.97M ops/s
|
|||
|
|
- **Median**: 56.03M ops/s
|
|||
|
|
- **Range**: 54.79M - 56.62M ops/s
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Step 1: ASM Inspection Results
|
|||
|
|
|
|||
|
|
### Target Gates (from Phase 40 preparation):
|
|||
|
|
1. `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
|
|||
|
|
2. `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
|
|||
|
|
|
|||
|
|
### Inspection Command:
|
|||
|
|
```bash
|
|||
|
|
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Findings:
|
|||
|
|
|
|||
|
|
#### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly**
|
|||
|
|
- **Call count**: 19+ occurrences in disassembly
|
|||
|
|
- **Function location**: `0x10630 <mid_v3_debug_enabled.lto_priv.0>`
|
|||
|
|
- **Call sites identified**:
|
|||
|
|
- Line 685: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
|
|||
|
|
- Line 705: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
|
|||
|
|
- Line 933: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
|
|||
|
|
- Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.
|
|||
|
|
|
|||
|
|
#### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly**
|
|||
|
|
- **Already optimized away** by compiler (likely inlined and dead-code eliminated)
|
|||
|
|
- MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks
|
|||
|
|
|
|||
|
|
### Call Site Analysis:
|
|||
|
|
|
|||
|
|
**Source locations of `mid_v3_debug_enabled()`**:
|
|||
|
|
|
|||
|
|
1. **Alloc path** (`core/box/hak_alloc_api.inc.h`):
|
|||
|
|
- Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block
|
|||
|
|
- Line 95: Inside same block, after class selection
|
|||
|
|
- Line 106: Inside same block, after successful allocation
|
|||
|
|
|
|||
|
|
2. **Free path** (`core/box/hak_free_api.inc.h`):
|
|||
|
|
- Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path)
|
|||
|
|
- Line 273: Inside same block (legacy path)
|
|||
|
|
|
|||
|
|
3. **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`):
|
|||
|
|
- Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)
|
|||
|
|
|
|||
|
|
### Key Insight:
|
|||
|
|
|
|||
|
|
`mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.
|
|||
|
|
|
|||
|
|
**Pattern observed**:
|
|||
|
|
```c
|
|||
|
|
// In hot paths:
|
|||
|
|
if (mid_v3_enabled() && ...) { // Outer guard - optimized to "if (0)"
|
|||
|
|
// ...
|
|||
|
|
if (mid_v3_debug_enabled() && ...) { // Inner debug gate - still in ASM!
|
|||
|
|
fprintf(stderr, ...);
|
|||
|
|
}
|
|||
|
|
// ...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Step 2: Condition Reordering
|
|||
|
|
|
|||
|
|
**Status**: **SKIPPED** - Not applicable
|
|||
|
|
|
|||
|
|
**Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Step 3: BENCH_MINIMAL Constantization
|
|||
|
|
|
|||
|
|
### Implementation:
|
|||
|
|
|
|||
|
|
Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
#include "../hakmem_build_flags.h"
|
|||
|
|
|
|||
|
|
static inline int mid_v3_enabled(void) {
|
|||
|
|
#if HAKMEM_BENCH_MINIMAL
|
|||
|
|
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
|
|||
|
|
return 0;
|
|||
|
|
#else
|
|||
|
|
static int g_enable = -1;
|
|||
|
|
if (__builtin_expect(g_enable == -1, 0)) {
|
|||
|
|
const char* e = getenv("HAKMEM_MID_V3_ENABLED");
|
|||
|
|
if (e && *e) {
|
|||
|
|
g_enable = (*e != '0') ? 1 : 0;
|
|||
|
|
} else {
|
|||
|
|
g_enable = 0; // default OFF
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return g_enable;
|
|||
|
|
#endif
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
static inline int mid_v3_debug_enabled(void) {
|
|||
|
|
#if HAKMEM_BENCH_MINIMAL
|
|||
|
|
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
|
|||
|
|
return 0;
|
|||
|
|
#else
|
|||
|
|
static int g_debug = -1;
|
|||
|
|
if (__builtin_expect(g_debug == -1, 0)) {
|
|||
|
|
const char* e = getenv("HAKMEM_MID_V3_DEBUG");
|
|||
|
|
if (e && *e) {
|
|||
|
|
g_debug = (*e != '0') ? 1 : 0;
|
|||
|
|
} else {
|
|||
|
|
g_debug = 0;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return g_debug;
|
|||
|
|
#endif
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ASM Verification After Step 3:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Result**: ✅ **Both gates ELIMINATED from assembly**
|
|||
|
|
- No `mid_v3_debug_enabled` function in disassembly
|
|||
|
|
- No call sites remaining
|
|||
|
|
- Compiler successfully dead-code eliminated all MID v3 related code
|
|||
|
|
|
|||
|
|
### Performance Results (Step 3):
|
|||
|
|
|
|||
|
|
**Command**: `make perf_fast` (after Step 3 changes)
|
|||
|
|
|
|||
|
|
### 10-Run Results (Step 3):
|
|||
|
|
```
|
|||
|
|
Run 1: 54.60M ops/s
|
|||
|
|
Run 2: 54.35M ops/s
|
|||
|
|
Run 3: 54.11M ops/s
|
|||
|
|
Run 4: 54.60M ops/s
|
|||
|
|
Run 5: 54.84M ops/s
|
|||
|
|
Run 6: 54.79M ops/s
|
|||
|
|
Run 7: 54.53M ops/s
|
|||
|
|
Run 8: 54.56M ops/s
|
|||
|
|
Run 9: 55.96M ops/s
|
|||
|
|
Run 10: 56.08M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3 Statistics**:
|
|||
|
|
- **Mean**: 54.84M ops/s
|
|||
|
|
- **Median**: 54.60M ops/s
|
|||
|
|
- **Range**: 54.11M - 56.08M ops/s
|
|||
|
|
|
|||
|
|
### Comparison vs Baseline:
|
|||
|
|
|
|||
|
|
| Metric | Baseline | Step 3 | Delta | Percent |
|
|||
|
|
|--------|----------|--------|-------|---------|
|
|||
|
|
| Mean | 55.97M | 54.84M | -1.13M | **-2.02%** |
|
|||
|
|
| Median | 56.03M | 54.60M | -1.43M | **-2.55%** |
|
|||
|
|
|
|||
|
|
**Verdict**: **NO-GO** (-2.02% regression)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis: Layout Tax
|
|||
|
|
|
|||
|
|
### Why did constantization hurt performance?
|
|||
|
|
|
|||
|
|
**Hypothesis**: **Code layout tax** (same issue as Phase 40)
|
|||
|
|
|
|||
|
|
1. **Before Step 3**:
|
|||
|
|
- `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions
|
|||
|
|
- Call sites reference these functions, which are never executed (dead code)
|
|||
|
|
- Hot path code layout is stable
|
|||
|
|
|
|||
|
|
2. **After Step 3**:
|
|||
|
|
- Both gates return compile-time constant `0`
|
|||
|
|
- Compiler inlines these and eliminates entire MID v3 blocks
|
|||
|
|
- Hot path code is re-laid out by compiler (different basic block arrangement)
|
|||
|
|
- **I-cache locality changes** → performance regression
|
|||
|
|
|
|||
|
|
### Precedent: Phase 40 Results
|
|||
|
|
|
|||
|
|
Phase 40 attempted to constantize `tiny_header_mode()`:
|
|||
|
|
- **Result**: -2.47% regression
|
|||
|
|
- **Cause**: Layout tax from code elimination
|
|||
|
|
- **Lesson**: Removing already-optimized-away code can hurt more than help
|
|||
|
|
|
|||
|
|
### Why layout tax occurs:
|
|||
|
|
|
|||
|
|
Modern CPUs are extremely sensitive to:
|
|||
|
|
- **Branch predictor state** (different code layout → different prediction patterns)
|
|||
|
|
- **I-cache line alignment** (moving hot loops can cause cache line splits)
|
|||
|
|
- **μop cache behavior** (LSD/DSB interactions change with layout)
|
|||
|
|
- **TLB pressure** (code page mapping changes)
|
|||
|
|
|
|||
|
|
Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Final Decision: REVERT Step 3
|
|||
|
|
|
|||
|
|
**Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h`
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git checkout core/box/mid_hotbox_v3_env_box.h
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
### 1. ASM-First Methodology Works
|
|||
|
|
|
|||
|
|
✅ Successfully identified that:
|
|||
|
|
- `mid_v3_enabled()` was already optimized away
|
|||
|
|
- `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks)
|
|||
|
|
|
|||
|
|
### 2. Dead Code != Performance Impact
|
|||
|
|
|
|||
|
|
❌ **Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax
|
|||
|
|
|
|||
|
|
- The dead `mid_v3_debug_enabled()` calls were never executed
|
|||
|
|
- But removing them caused code relayout → -2.02% regression
|
|||
|
|
- **Lesson**: Leave dead code alone if it's already not executed
|
|||
|
|
|
|||
|
|
### 3. Layout Tax is Real and Significant
|
|||
|
|
|
|||
|
|
Both Phase 40 and Phase 41 hit layout tax:
|
|||
|
|
- Phase 40: `tiny_header_mode()` constantization → -2.47%
|
|||
|
|
- Phase 41: `mid_v3_*()` constantization → -2.02%
|
|||
|
|
|
|||
|
|
**Pattern**: Structural changes to inline functions → unpredictable layout effects
|
|||
|
|
|
|||
|
|
### 4. When to Stop Optimizing
|
|||
|
|
|
|||
|
|
**Stop criteria**:
|
|||
|
|
1. If gate is already optimized away in ASM → Don't touch it
|
|||
|
|
2. If gate appears in ASM but is never executed → **Still don't touch it** (layout risk)
|
|||
|
|
3. Only optimize if gate is **executed frequently** in hot paths
|
|||
|
|
|
|||
|
|
### 5. ASM Inspection is Necessary but Not Sufficient
|
|||
|
|
|
|||
|
|
- ✅ ASM inspection told us gates exist
|
|||
|
|
- ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks
|
|||
|
|
- ✅ **Need runtime profiling** (e.g., `perf record`) to confirm execution frequency
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommendations for Phase 42+
|
|||
|
|
|
|||
|
|
### 1. Add Runtime Profiling Step
|
|||
|
|
|
|||
|
|
**Before optimizing any gate**, use `perf` to verify it's actually executed:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Profile hot functions
|
|||
|
|
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
|
|||
|
|
perf report --no-children --sort comm,dso,symbol
|
|||
|
|
|
|||
|
|
# Check if mid_v3_debug_enabled appears in profile
|
|||
|
|
perf report | grep mid_v3
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Decision criteria**:
|
|||
|
|
- If function appears in `perf report` → Worth optimizing
|
|||
|
|
- If function is in ASM but NOT in `perf report` → Dead code, leave alone
|
|||
|
|
|
|||
|
|
### 2. Focus on Actually-Executed Gates
|
|||
|
|
|
|||
|
|
**Priority list** (requires profiling validation):
|
|||
|
|
1. Gates that appear in `perf report` top 50 functions
|
|||
|
|
2. Gates called in tight loops (identified via `-C` context in `perf annotate`)
|
|||
|
|
3. Gates with measurable CPU time (>0.1% in profile)
|
|||
|
|
|
|||
|
|
### 3. Accept Dead Code in ASM
|
|||
|
|
|
|||
|
|
**Philosophy shift**:
|
|||
|
|
- Old: "If it's in ASM, optimize it"
|
|||
|
|
- New: "If it's in ASM but not executed, ignore it"
|
|||
|
|
|
|||
|
|
Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax.
|
|||
|
|
|
|||
|
|
### 4. Test Layout Stability
|
|||
|
|
|
|||
|
|
Before committing any structural change:
|
|||
|
|
1. Run 3× 10-run benchmarks (baseline, change, revert-verify)
|
|||
|
|
2. Check if results are reproducible
|
|||
|
|
3. Accept only if gain is **≥1.0%** (to overcome layout noise)
|
|||
|
|
|
|||
|
|
### 5. Alternative: Investigate Other Hot Gates
|
|||
|
|
|
|||
|
|
Instead of MID v3 gates (which are dead), profile to find:
|
|||
|
|
- Tiny allocator gates that ARE executed
|
|||
|
|
- Free path gates with measurable cost
|
|||
|
|
- Size class routing decisions in hot paths
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quantitative Summary
|
|||
|
|
|
|||
|
|
| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
|
|||
|
|
|-------|----------------|--------------|-----------|--------|--------|---------|
|
|||
|
|
| Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped |
|
|||
|
|
| Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO |
|
|||
|
|
| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
|
|||
|
|
| Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** |
|
|||
|
|
|
|||
|
|
**Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation:
|
|||
|
|
|
|||
|
|
> **ASM presence ≠ Performance impact**
|
|||
|
|
|
|||
|
|
The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**.
|
|||
|
|
|
|||
|
|
**Key Takeaway**:
|
|||
|
|
- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`)
|
|||
|
|
- ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code
|
|||
|
|
- ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions
|
|||
|
|
|
|||
|
|
**Phase 42 Direction**:
|
|||
|
|
1. Add `perf record/report` step to methodology
|
|||
|
|
2. Target only gates that appear in runtime profiles
|
|||
|
|
3. Accept dead code in ASM as zero-cost (don't fix what isn't broken)
|
|||
|
|
4. Require ≥1.0% gain to overcome layout noise
|
|||
|
|
|
|||
|
|
**Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**
|