Files

Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-17 06:24:01 +09:00

12 KiB

Raw Blame History

Phase 41: ASM-First Gate Audit and Optimization - Results

Date: 2025-12-16 Baseline: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median) Target: +0.5% (56.32M+ ops/s) for GO Result: NO-GO (-2.02% regression)

Methodology: ASM-First Approach

Following Phase 40's lesson (where tiny_header_mode() was already optimized away by Phase 21), this phase implemented a strict ASM inspection FIRST methodology:

Baseline measurement before any code changes
ASM inspection to verify gates actually exist in assembly
Optimization only if gates found in hot paths
Incremental testing with proper A/B comparison

Step 0: Baseline Measurement

Command: make perf_fast

10-Run Results (Baseline):

Run 1:  56.62M ops/s
Run 2:  55.62M ops/s
Run 3:  56.62M ops/s
Run 4:  56.62M ops/s
Run 5:  55.79M ops/s
Run 6:  55.42M ops/s
Run 7:  55.89M ops/s
Run 8:  56.16M ops/s
Run 9:  54.79M ops/s
Run 10: 56.17M ops/s

Baseline Statistics:

Mean: 55.97M ops/s
Median: 56.03M ops/s
Range: 54.79M - 56.62M ops/s

Step 1: ASM Inspection Results

Target Gates (from Phase 40 preparation):

mid_v3_enabled() in core/box/mid_hotbox_v3_env_box.h
mid_v3_debug_enabled() in core/box/mid_hotbox_v3_env_box.h

Inspection Command:

objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"

Findings:

`mid_v3_debug_enabled()`: ✅ FOUND in assembly

Call count: 19+ occurrences in disassembly
Function location: 0x10630 <mid_v3_debug_enabled.lto_priv.0>
Call sites identified:
- Line 685: call 10630 <mid_v3_debug_enabled.lto_priv.0>
- Line 705: call 10630 <mid_v3_debug_enabled.lto_priv.0>
- Line 933: call 10630 <mid_v3_debug_enabled.lto_priv.0>
- Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.

`mid_v3_enabled()`: ❌ NOT FOUND in assembly

Already optimized away by compiler (likely inlined and dead-code eliminated)
MID v3 is OFF by default (g_enable = 0), so compiler eliminated entire blocks

Call Site Analysis:

Source locations of mid_v3_debug_enabled():

Alloc path (core/box/hak_alloc_api.inc.h):
- Line 84: Inside if (mid_v3_enabled() && size >= 257 && size <= 768) block
- Line 95: Inside same block, after class selection
- Line 106: Inside same block, after successful allocation
Free path (core/box/hak_free_api.inc.h):
- Line 252: Inside if (lk.kind == REGION_KIND_MID_V3) block (SSOT path)
- Line 273: Inside same block (legacy path)
Mid-hotbox v3 implementation (core/mid_hotbox_v3.c):
- Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)

Key Insight:

mid_v3_debug_enabled() appears in assembly because it's called INSIDE blocks that are already guarded by mid_v3_enabled(). However, since mid_v3_enabled() returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.

Pattern observed:

// In hot paths:
if (mid_v3_enabled() && ...) {  // Outer guard - optimized to "if (0)"
    // ...
    if (mid_v3_debug_enabled() && ...) {  // Inner debug gate - still in ASM!
        fprintf(stderr, ...);
    }
    // ...
}

Step 2: Condition Reordering

Status: SKIPPED - Not applicable

Reason: All mid_v3_debug_enabled() calls are already inside mid_v3_enabled() guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (mid_v3_enabled()) is already at the top of the conditional chain.

Step 3: BENCH_MINIMAL Constantization

Implementation:

Modified core/box/mid_hotbox_v3_env_box.h to add compile-time constant returns for HAKMEM_BENCH_MINIMAL:

#include "../hakmem_build_flags.h"

static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_ENABLED");
        if (e && *e) {
            g_enable = (*e != '0') ? 1 : 0;
        } else {
            g_enable = 0;  // default OFF
        }
    }
    return g_enable;
#endif
}

static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
    // Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
    return 0;
#else
    static int g_debug = -1;
    if (__builtin_expect(g_debug == -1, 0)) {
        const char* e = getenv("HAKMEM_MID_V3_DEBUG");
        if (e && *e) {
            g_debug = (*e != '0') ? 1 : 0;
        } else {
            g_debug = 0;
        }
    }
    return g_debug;
#endif
}

ASM Verification After Step 3:

objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"

Result: ✅ Both gates ELIMINATED from assembly

No mid_v3_debug_enabled function in disassembly
No call sites remaining
Compiler successfully dead-code eliminated all MID v3 related code

Performance Results (Step 3):

Command: make perf_fast (after Step 3 changes)

10-Run Results (Step 3):

Run 1:  54.60M ops/s
Run 2:  54.35M ops/s
Run 3:  54.11M ops/s
Run 4:  54.60M ops/s
Run 5:  54.84M ops/s
Run 6:  54.79M ops/s
Run 7:  54.53M ops/s
Run 8:  54.56M ops/s
Run 9:  55.96M ops/s
Run 10: 56.08M ops/s

Step 3 Statistics:

Mean: 54.84M ops/s
Median: 54.60M ops/s
Range: 54.11M - 56.08M ops/s

Comparison vs Baseline:

Metric	Baseline	Step 3	Delta	Percent
Mean	55.97M	54.84M	-1.13M	-2.02%
Median	56.03M	54.60M	-1.43M	-2.55%

Verdict: NO-GO (-2.02% regression)

Root Cause Analysis: Layout Tax

Why did constantization hurt performance?

Hypothesis: Code layout tax (same issue as Phase 40)

Before Step 3:
- mid_v3_enabled() and mid_v3_debug_enabled() exist as outlined functions
- Call sites reference these functions, which are never executed (dead code)
- Hot path code layout is stable
After Step 3:
- Both gates return compile-time constant 0
- Compiler inlines these and eliminates entire MID v3 blocks
- Hot path code is re-laid out by compiler (different basic block arrangement)
- I-cache locality changes → performance regression

Precedent: Phase 40 Results

Phase 40 attempted to constantize tiny_header_mode():

Result: -2.47% regression
Cause: Layout tax from code elimination
Lesson: Removing already-optimized-away code can hurt more than help

Why layout tax occurs:

Modern CPUs are extremely sensitive to:

Branch predictor state (different code layout → different prediction patterns)
I-cache line alignment (moving hot loops can cause cache line splits)
μop cache behavior (LSD/DSB interactions change with layout)
TLB pressure (code page mapping changes)

Even though we eliminated dead code, the side effect of code relayout outweighed the benefit of removing a few dead function calls.

Final Decision: REVERT Step 3

Action: Reverted all changes to core/box/mid_hotbox_v3_env_box.h

git checkout core/box/mid_hotbox_v3_env_box.h

Reason: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.

Lessons Learned

1. ASM-First Methodology Works

✅ Successfully identified that:

mid_v3_enabled() was already optimized away
mid_v3_debug_enabled() existed in ASM but was dead code (inside if (0) blocks)

2. Dead Code != Performance Impact

❌ Counterintuitive finding: Removing dead code can hurt performance due to layout tax

The dead mid_v3_debug_enabled() calls were never executed
But removing them caused code relayout → -2.02% regression
Lesson: Leave dead code alone if it's already not executed

3. Layout Tax is Real and Significant

Both Phase 40 and Phase 41 hit layout tax:

Phase 40: tiny_header_mode() constantization → -2.47%
Phase 41: mid_v3_*() constantization → -2.02%

Pattern: Structural changes to inline functions → unpredictable layout effects

4. When to Stop Optimizing

Stop criteria:

If gate is already optimized away in ASM → Don't touch it
If gate appears in ASM but is never executed → Still don't touch it (layout risk)
Only optimize if gate is executed frequently in hot paths

5. ASM Inspection is Necessary but Not Sufficient

✅ ASM inspection told us gates exist
❌ ASM inspection didn't tell us they're dead code inside if (0) blocks
✅ Need runtime profiling (e.g., perf record) to confirm execution frequency

Recommendations for Phase 42+

1. Add Runtime Profiling Step

Before optimizing any gate, use perf to verify it's actually executed:

# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol

# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3

Decision criteria:

If function appears in perf report → Worth optimizing
If function is in ASM but NOT in perf report → Dead code, leave alone

2. Focus on Actually-Executed Gates

Priority list (requires profiling validation):

Gates that appear in perf report top 50 functions
Gates called in tight loops (identified via -C context in perf annotate)
Gates with measurable CPU time (>0.1% in profile)

3. Accept Dead Code in ASM

Philosophy shift:

Old: "If it's in ASM, optimize it"
New: "If it's in ASM but not executed, ignore it"

Dead code that's never executed has zero runtime cost. Removing it risks layout tax.

4. Test Layout Stability

Before committing any structural change:

Run 3× 10-run benchmarks (baseline, change, revert-verify)
Check if results are reproducible
Accept only if gain is ≥1.0% (to overcome layout noise)

5. Alternative: Investigate Other Hot Gates

Instead of MID v3 gates (which are dead), profile to find:

Tiny allocator gates that ARE executed
Free path gates with measurable cost
Size class routing decisions in hot paths

Quantitative Summary

Phase	Target Gate(s)	ASM Present?	Executed?	Change	Result	Verdict
Phase 21	`tiny_header_mode()`	No (optimized away)	No	N/A	N/A	Skipped
Phase 40	`tiny_header_mode()`	No	No	Constantization	-2.47%	NO-GO
Phase 41 Step 2	Condition reorder	N/A	N/A	N/A	Skipped	N/A
Phase 41 Step 3	`mid_v3_enabled()`, `mid_v3_debug_enabled()`	Yes (debug only)	No (dead code)	Constantization	-2.02%	NO-GO

Phase 41 Final Performance: 55.97M ops/s (baseline, no changes adopted)

Conclusion

Phase 41 successfully demonstrated the ASM-first gate audit methodology and confirmed its value. However, it also revealed a critical limitation:

ASM presence ≠ Performance impact

The gates we targeted (mid_v3_debug_enabled()) existed in assembly but were dead code inside if (mid_v3_enabled()) guards that compile to if (0). Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a -2.02% layout tax regression.

Key Takeaway:

✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's tiny_header_mode())
❌ But ASM inspection alone is insufficient - need runtime profiling to distinguish executed vs. dead code
⚠️ Layout tax is a first-class optimization enemy - structural changes risk unpredictable regressions

Phase 42 Direction:

Add perf record/report step to methodology
Target only gates that appear in runtime profiles
Accept dead code in ASM as zero-cost (don't fix what isn't broken)
Require ≥1.0% gain to overcome layout noise

Phase 41 Verdict: NO-GO - Revert all changes, baseline remains FAST v3 = 55.97M ops/s

12 KiB Raw Blame History Unescape Escape