Files
hakmem/docs/analysis/PHASE41_ASM_FIRST_GATE_AUDIT_RESULTS.md

375 lines
12 KiB
Markdown
Raw Normal View History

Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: **Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)** - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset **Phase 57: 60-min soak finalization** - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY **Phase 59: 50% recovery baseline rebase** - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc **Phase 60: Alloc pass-down SSOT (NO-GO)** - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
# Phase 41: ASM-First Gate Audit and Optimization - Results
**Date**: 2025-12-16
**Baseline**: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median)
**Target**: +0.5% (56.32M+ ops/s) for GO
**Result**: **NO-GO** (-2.02% regression)
---
## Methodology: ASM-First Approach
Following Phase 40's lesson (where `tiny_header_mode()` was already optimized away by Phase 21), this phase implemented a strict **ASM inspection FIRST** methodology:
1. **Baseline measurement** before any code changes
2. **ASM inspection** to verify gates actually exist in assembly
3. **Optimization only if gates found** in hot paths
4. **Incremental testing** with proper A/B comparison
---
## Step 0: Baseline Measurement
**Command**: `make perf_fast`
### 10-Run Results (Baseline):
```
Run 1: 56.62M ops/s
Run 2: 55.62M ops/s
Run 3: 56.62M ops/s
Run 4: 56.62M ops/s
Run 5: 55.79M ops/s
Run 6: 55.42M ops/s
Run 7: 55.89M ops/s
Run 8: 56.16M ops/s
Run 9: 54.79M ops/s
Run 10: 56.17M ops/s
```
**Baseline Statistics**:
- **Mean**: 55.97M ops/s
- **Median**: 56.03M ops/s
- **Range**: 54.79M - 56.62M ops/s
---
## Step 1: ASM Inspection Results
### Target Gates (from Phase 40 preparation):
1. `mid_v3_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
2. `mid_v3_debug_enabled()` in `core/box/mid_hotbox_v3_env_box.h`
### Inspection Command:
```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```
### Findings:
#### `mid_v3_debug_enabled()`: ✅ **FOUND in assembly**
- **Call count**: 19+ occurrences in disassembly
- **Function location**: `0x10630 <mid_v3_debug_enabled.lto_priv.0>`
- **Call sites identified**:
- Line 685: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
- Line 705: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
- Line 933: `call 10630 <mid_v3_debug_enabled.lto_priv.0>`
- Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.
#### `mid_v3_enabled()`: ❌ **NOT FOUND in assembly**
- **Already optimized away** by compiler (likely inlined and dead-code eliminated)
- MID v3 is OFF by default (`g_enable = 0`), so compiler eliminated entire blocks
### Call Site Analysis:
**Source locations of `mid_v3_debug_enabled()`**:
1. **Alloc path** (`core/box/hak_alloc_api.inc.h`):
- Line 84: Inside `if (mid_v3_enabled() && size >= 257 && size <= 768)` block
- Line 95: Inside same block, after class selection
- Line 106: Inside same block, after successful allocation
2. **Free path** (`core/box/hak_free_api.inc.h`):
- Line 252: Inside `if (lk.kind == REGION_KIND_MID_V3)` block (SSOT path)
- Line 273: Inside same block (legacy path)
3. **Mid-hotbox v3 implementation** (`core/mid_hotbox_v3.c`):
- Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)
### Key Insight:
`mid_v3_debug_enabled()` appears in assembly because it's called INSIDE blocks that are already guarded by `mid_v3_enabled()`. However, since `mid_v3_enabled()` returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.
**Pattern observed**:
```c
// In hot paths:
if (mid_v3_enabled() && ...) { // Outer guard - optimized to "if (0)"
// ...
if (mid_v3_debug_enabled() && ...) { // Inner debug gate - still in ASM!
fprintf(stderr, ...);
}
// ...
}
```
---
## Step 2: Condition Reordering
**Status**: **SKIPPED** - Not applicable
**Reason**: All `mid_v3_debug_enabled()` calls are already inside `mid_v3_enabled()` guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (`mid_v3_enabled()`) is already at the top of the conditional chain.
---
## Step 3: BENCH_MINIMAL Constantization
### Implementation:
Modified `core/box/mid_hotbox_v3_env_box.h` to add compile-time constant returns for `HAKMEM_BENCH_MINIMAL`:
```c
#include "../hakmem_build_flags.h"
static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
return 0;
#else
static int g_enable = -1;
if (__builtin_expect(g_enable == -1, 0)) {
const char* e = getenv("HAKMEM_MID_V3_ENABLED");
if (e && *e) {
g_enable = (*e != '0') ? 1 : 0;
} else {
g_enable = 0; // default OFF
}
}
return g_enable;
#endif
}
static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
return 0;
#else
static int g_debug = -1;
if (__builtin_expect(g_debug == -1, 0)) {
const char* e = getenv("HAKMEM_MID_V3_DEBUG");
if (e && *e) {
g_debug = (*e != '0') ? 1 : 0;
} else {
g_debug = 0;
}
}
return g_debug;
#endif
}
```
### ASM Verification After Step 3:
```bash
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
```
**Result**: ✅ **Both gates ELIMINATED from assembly**
- No `mid_v3_debug_enabled` function in disassembly
- No call sites remaining
- Compiler successfully dead-code eliminated all MID v3 related code
### Performance Results (Step 3):
**Command**: `make perf_fast` (after Step 3 changes)
### 10-Run Results (Step 3):
```
Run 1: 54.60M ops/s
Run 2: 54.35M ops/s
Run 3: 54.11M ops/s
Run 4: 54.60M ops/s
Run 5: 54.84M ops/s
Run 6: 54.79M ops/s
Run 7: 54.53M ops/s
Run 8: 54.56M ops/s
Run 9: 55.96M ops/s
Run 10: 56.08M ops/s
```
**Step 3 Statistics**:
- **Mean**: 54.84M ops/s
- **Median**: 54.60M ops/s
- **Range**: 54.11M - 56.08M ops/s
### Comparison vs Baseline:
| Metric | Baseline | Step 3 | Delta | Percent |
|--------|----------|--------|-------|---------|
| Mean | 55.97M | 54.84M | -1.13M | **-2.02%** |
| Median | 56.03M | 54.60M | -1.43M | **-2.55%** |
**Verdict**: **NO-GO** (-2.02% regression)
---
## Root Cause Analysis: Layout Tax
### Why did constantization hurt performance?
**Hypothesis**: **Code layout tax** (same issue as Phase 40)
1. **Before Step 3**:
- `mid_v3_enabled()` and `mid_v3_debug_enabled()` exist as outlined functions
- Call sites reference these functions, which are never executed (dead code)
- Hot path code layout is stable
2. **After Step 3**:
- Both gates return compile-time constant `0`
- Compiler inlines these and eliminates entire MID v3 blocks
- Hot path code is re-laid out by compiler (different basic block arrangement)
- **I-cache locality changes** → performance regression
### Precedent: Phase 40 Results
Phase 40 attempted to constantize `tiny_header_mode()`:
- **Result**: -2.47% regression
- **Cause**: Layout tax from code elimination
- **Lesson**: Removing already-optimized-away code can hurt more than help
### Why layout tax occurs:
Modern CPUs are extremely sensitive to:
- **Branch predictor state** (different code layout → different prediction patterns)
- **I-cache line alignment** (moving hot loops can cause cache line splits)
- **μop cache behavior** (LSD/DSB interactions change with layout)
- **TLB pressure** (code page mapping changes)
Even though we eliminated dead code, the **side effect of code relayout** outweighed the benefit of removing a few dead function calls.
---
## Final Decision: REVERT Step 3
**Action**: Reverted all changes to `core/box/mid_hotbox_v3_env_box.h`
```bash
git checkout core/box/mid_hotbox_v3_env_box.h
```
**Reason**: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.
---
## Lessons Learned
### 1. ASM-First Methodology Works
✅ Successfully identified that:
- `mid_v3_enabled()` was already optimized away
- `mid_v3_debug_enabled()` existed in ASM but was dead code (inside `if (0)` blocks)
### 2. Dead Code != Performance Impact
**Counterintuitive finding**: Removing dead code can **hurt** performance due to layout tax
- The dead `mid_v3_debug_enabled()` calls were never executed
- But removing them caused code relayout → -2.02% regression
- **Lesson**: Leave dead code alone if it's already not executed
### 3. Layout Tax is Real and Significant
Both Phase 40 and Phase 41 hit layout tax:
- Phase 40: `tiny_header_mode()` constantization → -2.47%
- Phase 41: `mid_v3_*()` constantization → -2.02%
**Pattern**: Structural changes to inline functions → unpredictable layout effects
### 4. When to Stop Optimizing
**Stop criteria**:
1. If gate is already optimized away in ASM → Don't touch it
2. If gate appears in ASM but is never executed → **Still don't touch it** (layout risk)
3. Only optimize if gate is **executed frequently** in hot paths
### 5. ASM Inspection is Necessary but Not Sufficient
- ✅ ASM inspection told us gates exist
- ❌ ASM inspection didn't tell us they're dead code inside `if (0)` blocks
-**Need runtime profiling** (e.g., `perf record`) to confirm execution frequency
---
## Recommendations for Phase 42+
### 1. Add Runtime Profiling Step
**Before optimizing any gate**, use `perf` to verify it's actually executed:
```bash
# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol
# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3
```
**Decision criteria**:
- If function appears in `perf report` → Worth optimizing
- If function is in ASM but NOT in `perf report` → Dead code, leave alone
### 2. Focus on Actually-Executed Gates
**Priority list** (requires profiling validation):
1. Gates that appear in `perf report` top 50 functions
2. Gates called in tight loops (identified via `-C` context in `perf annotate`)
3. Gates with measurable CPU time (>0.1% in profile)
### 3. Accept Dead Code in ASM
**Philosophy shift**:
- Old: "If it's in ASM, optimize it"
- New: "If it's in ASM but not executed, ignore it"
Dead code that's never executed has **zero runtime cost**. Removing it risks layout tax.
### 4. Test Layout Stability
Before committing any structural change:
1. Run 3× 10-run benchmarks (baseline, change, revert-verify)
2. Check if results are reproducible
3. Accept only if gain is **≥1.0%** (to overcome layout noise)
### 5. Alternative: Investigate Other Hot Gates
Instead of MID v3 gates (which are dead), profile to find:
- Tiny allocator gates that ARE executed
- Free path gates with measurable cost
- Size class routing decisions in hot paths
---
## Quantitative Summary
| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
|-------|----------------|--------------|-----------|--------|--------|---------|
| Phase 21 | `tiny_header_mode()` | No (optimized away) | No | N/A | N/A | Skipped |
| Phase 40 | `tiny_header_mode()` | No | No | Constantization | -2.47% | NO-GO |
| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
| Phase 41 Step 3 | `mid_v3_enabled()`, `mid_v3_debug_enabled()` | Yes (debug only) | **No** (dead code) | Constantization | **-2.02%** | **NO-GO** |
**Phase 41 Final Performance**: **55.97M ops/s** (baseline, no changes adopted)
---
## Conclusion
Phase 41 successfully demonstrated the **ASM-first gate audit methodology** and confirmed its value. However, it also revealed a critical limitation:
> **ASM presence ≠ Performance impact**
The gates we targeted (`mid_v3_debug_enabled()`) existed in assembly but were **dead code** inside `if (mid_v3_enabled())` guards that compile to `if (0)`. Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a **-2.02% layout tax regression**.
**Key Takeaway**:
- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's `tiny_header_mode()`)
- ❌ But ASM inspection alone is insufficient - need **runtime profiling** to distinguish executed vs. dead code
- ⚠️ **Layout tax is a first-class optimization enemy** - structural changes risk unpredictable regressions
**Phase 42 Direction**:
1. Add `perf record/report` step to methodology
2. Target only gates that appear in runtime profiles
3. Accept dead code in ASM as zero-cost (don't fix what isn't broken)
4. Require ≥1.0% gain to overcome layout noise
**Phase 41 Verdict**: **NO-GO** - Revert all changes, baseline remains **FAST v3 = 55.97M ops/s**