Files
hakmem/docs/analysis/PHASE41_ASM_FIRST_GATE_AUDIT_RESULTS.md
Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00

12 KiB
Raw Blame History

Phase 41: ASM-First Gate Audit and Optimization - Results

Date: 2025-12-16 Baseline: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median) Target: +0.5% (56.32M+ ops/s) for GO Result: NO-GO (-2.02% regression)


Methodology: ASM-First Approach

Following Phase 40's lesson (where tiny_header_mode() was already optimized away by Phase 21), this phase implemented a strict ASM inspection FIRST methodology:

  1. Baseline measurement before any code changes
  2. ASM inspection to verify gates actually exist in assembly
  3. Optimization only if gates found in hot paths
  4. Incremental testing with proper A/B comparison

Step 0: Baseline Measurement

Command: make perf_fast

10-Run Results (Baseline):

Run 1:  56.62M ops/s
Run 2:  55.62M ops/s
Run 3:  56.62M ops/s
Run 4:  56.62M ops/s
Run 5:  55.79M ops/s
Run 6:  55.42M ops/s
Run 7:  55.89M ops/s
Run 8:  56.16M ops/s
Run 9:  54.79M ops/s
Run 10: 56.17M ops/s

Baseline Statistics:

  • Mean: 55.97M ops/s
  • Median: 56.03M ops/s
  • Range: 54.79M - 56.62M ops/s

Step 1: ASM Inspection Results

Target Gates (from Phase 40 preparation):

  1. mid_v3_enabled() in core/box/mid_hotbox_v3_env_box.h
  2. mid_v3_debug_enabled() in core/box/mid_hotbox_v3_env_box.h

Inspection Command:

objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"

Findings:

mid_v3_debug_enabled(): FOUND in assembly

  • Call count: 19+ occurrences in disassembly
  • Function location: 0x10630 <mid_v3_debug_enabled.lto_priv.0>
  • Call sites identified:
    • Line 685: call 10630 <mid_v3_debug_enabled.lto_priv.0>
    • Line 705: call 10630 <mid_v3_debug_enabled.lto_priv.0>
    • Line 933: call 10630 <mid_v3_debug_enabled.lto_priv.0>
    • Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.

mid_v3_enabled(): NOT FOUND in assembly

  • Already optimized away by compiler (likely inlined and dead-code eliminated)
  • MID v3 is OFF by default (g_enable = 0), so compiler eliminated entire blocks

Call Site Analysis:

Source locations of mid_v3_debug_enabled():

  1. Alloc path (core/box/hak_alloc_api.inc.h):

    • Line 84: Inside if (mid_v3_enabled() && size >= 257 && size <= 768) block
    • Line 95: Inside same block, after class selection
    • Line 106: Inside same block, after successful allocation
  2. Free path (core/box/hak_free_api.inc.h):

    • Line 252: Inside if (lk.kind == REGION_KIND_MID_V3) block (SSOT path)
    • Line 273: Inside same block (legacy path)
  3. Mid-hotbox v3 implementation (core/mid_hotbox_v3.c):

    • Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)

Key Insight:

mid_v3_debug_enabled() appears in assembly because it's called INSIDE blocks that are already guarded by mid_v3_enabled(). However, since mid_v3_enabled() returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.

Pattern observed:

// In hot paths:
if (mid_v3_enabled() && ...) {  // Outer guard - optimized to "if (0)"
    // ...
    if (mid_v3_debug_enabled() && ...) {  // Inner debug gate - still in ASM!
        fprintf(stderr, ...);
    }
    // ...
}

Step 2: Condition Reordering

Status: SKIPPED - Not applicable

Reason: All mid_v3_debug_enabled() calls are already inside mid_v3_enabled() guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (mid_v3_enabled()) is already at the top of the conditional chain.


Step 3: BENCH_MINIMAL Constantization

Implementation:

Modified core/box/mid_hotbox_v3_env_box.h to add compile-time constant returns for HAKMEM_BENCH_MINIMAL:

#include "../hakmem_build_flags.h"

static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_ENABLED");
        if (e && *e) {
            g_enable = (*e != '0') ? 1 : 0;
        } else {
            g_enable = 0;  // default OFF
        }
    }
    return g_enable;
#endif
}

static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_debug = -1;
    if (__builtin_expect(g_debug == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_DEBUG");
        if (e && *e) {
            g_debug = (*e != '0') ? 1 : 0;
        } else {
            g_debug = 0;
        }
    }
    return g_debug;
#endif
}

ASM Verification After Step 3:

objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"

Result: Both gates ELIMINATED from assembly

  • No mid_v3_debug_enabled function in disassembly
  • No call sites remaining
  • Compiler successfully dead-code eliminated all MID v3 related code

Performance Results (Step 3):

Command: make perf_fast (after Step 3 changes)

10-Run Results (Step 3):

Run 1:  54.60M ops/s
Run 2:  54.35M ops/s
Run 3:  54.11M ops/s
Run 4:  54.60M ops/s
Run 5:  54.84M ops/s
Run 6:  54.79M ops/s
Run 7:  54.53M ops/s
Run 8:  54.56M ops/s
Run 9:  55.96M ops/s
Run 10: 56.08M ops/s

Step 3 Statistics:

  • Mean: 54.84M ops/s
  • Median: 54.60M ops/s
  • Range: 54.11M - 56.08M ops/s

Comparison vs Baseline:

Metric Baseline Step 3 Delta Percent
Mean 55.97M 54.84M -1.13M -2.02%
Median 56.03M 54.60M -1.43M -2.55%

Verdict: NO-GO (-2.02% regression)


Root Cause Analysis: Layout Tax

Why did constantization hurt performance?

Hypothesis: Code layout tax (same issue as Phase 40)

  1. Before Step 3:

    • mid_v3_enabled() and mid_v3_debug_enabled() exist as outlined functions
    • Call sites reference these functions, which are never executed (dead code)
    • Hot path code layout is stable
  2. After Step 3:

    • Both gates return compile-time constant 0
    • Compiler inlines these and eliminates entire MID v3 blocks
    • Hot path code is re-laid out by compiler (different basic block arrangement)
    • I-cache locality changes → performance regression

Precedent: Phase 40 Results

Phase 40 attempted to constantize tiny_header_mode():

  • Result: -2.47% regression
  • Cause: Layout tax from code elimination
  • Lesson: Removing already-optimized-away code can hurt more than help

Why layout tax occurs:

Modern CPUs are extremely sensitive to:

  • Branch predictor state (different code layout → different prediction patterns)
  • I-cache line alignment (moving hot loops can cause cache line splits)
  • μop cache behavior (LSD/DSB interactions change with layout)
  • TLB pressure (code page mapping changes)

Even though we eliminated dead code, the side effect of code relayout outweighed the benefit of removing a few dead function calls.


Final Decision: REVERT Step 3

Action: Reverted all changes to core/box/mid_hotbox_v3_env_box.h

git checkout core/box/mid_hotbox_v3_env_box.h

Reason: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.


Lessons Learned

1. ASM-First Methodology Works

Successfully identified that:

  • mid_v3_enabled() was already optimized away
  • mid_v3_debug_enabled() existed in ASM but was dead code (inside if (0) blocks)

2. Dead Code != Performance Impact

Counterintuitive finding: Removing dead code can hurt performance due to layout tax

  • The dead mid_v3_debug_enabled() calls were never executed
  • But removing them caused code relayout → -2.02% regression
  • Lesson: Leave dead code alone if it's already not executed

3. Layout Tax is Real and Significant

Both Phase 40 and Phase 41 hit layout tax:

  • Phase 40: tiny_header_mode() constantization → -2.47%
  • Phase 41: mid_v3_*() constantization → -2.02%

Pattern: Structural changes to inline functions → unpredictable layout effects

4. When to Stop Optimizing

Stop criteria:

  1. If gate is already optimized away in ASM → Don't touch it
  2. If gate appears in ASM but is never executed → Still don't touch it (layout risk)
  3. Only optimize if gate is executed frequently in hot paths

5. ASM Inspection is Necessary but Not Sufficient

  • ASM inspection told us gates exist
  • ASM inspection didn't tell us they're dead code inside if (0) blocks
  • Need runtime profiling (e.g., perf record) to confirm execution frequency

Recommendations for Phase 42+

1. Add Runtime Profiling Step

Before optimizing any gate, use perf to verify it's actually executed:

# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol

# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3

Decision criteria:

  • If function appears in perf report → Worth optimizing
  • If function is in ASM but NOT in perf report → Dead code, leave alone

2. Focus on Actually-Executed Gates

Priority list (requires profiling validation):

  1. Gates that appear in perf report top 50 functions
  2. Gates called in tight loops (identified via -C context in perf annotate)
  3. Gates with measurable CPU time (>0.1% in profile)

3. Accept Dead Code in ASM

Philosophy shift:

  • Old: "If it's in ASM, optimize it"
  • New: "If it's in ASM but not executed, ignore it"

Dead code that's never executed has zero runtime cost. Removing it risks layout tax.

4. Test Layout Stability

Before committing any structural change:

  1. Run 3× 10-run benchmarks (baseline, change, revert-verify)
  2. Check if results are reproducible
  3. Accept only if gain is ≥1.0% (to overcome layout noise)

5. Alternative: Investigate Other Hot Gates

Instead of MID v3 gates (which are dead), profile to find:

  • Tiny allocator gates that ARE executed
  • Free path gates with measurable cost
  • Size class routing decisions in hot paths

Quantitative Summary

Phase Target Gate(s) ASM Present? Executed? Change Result Verdict
Phase 21 tiny_header_mode() No (optimized away) No N/A N/A Skipped
Phase 40 tiny_header_mode() No No Constantization -2.47% NO-GO
Phase 41 Step 2 Condition reorder N/A N/A N/A Skipped N/A
Phase 41 Step 3 mid_v3_enabled(), mid_v3_debug_enabled() Yes (debug only) No (dead code) Constantization -2.02% NO-GO

Phase 41 Final Performance: 55.97M ops/s (baseline, no changes adopted)


Conclusion

Phase 41 successfully demonstrated the ASM-first gate audit methodology and confirmed its value. However, it also revealed a critical limitation:

ASM presence ≠ Performance impact

The gates we targeted (mid_v3_debug_enabled()) existed in assembly but were dead code inside if (mid_v3_enabled()) guards that compile to if (0). Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a -2.02% layout tax regression.

Key Takeaway:

  • ASM inspection prevents wasting time on already-optimized gates (like Phase 21's tiny_header_mode())
  • But ASM inspection alone is insufficient - need runtime profiling to distinguish executed vs. dead code
  • ⚠️ Layout tax is a first-class optimization enemy - structural changes risk unpredictable regressions

Phase 42 Direction:

  1. Add perf record/report step to methodology
  2. Target only gates that appear in runtime profiles
  3. Accept dead code in ASM as zero-cost (don't fix what isn't broken)
  4. Require ≥1.0% gain to overcome layout noise

Phase 41 Verdict: NO-GO - Revert all changes, baseline remains FAST v3 = 55.97M ops/s