## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Phase 41: ASM-First Gate Audit and Optimization - Results
Date: 2025-12-16 Baseline: FAST v3 = 55.97M ops/s (mean), 56.03M ops/s (median) Target: +0.5% (56.32M+ ops/s) for GO Result: NO-GO (-2.02% regression)
Methodology: ASM-First Approach
Following Phase 40's lesson (where tiny_header_mode() was already optimized away by Phase 21), this phase implemented a strict ASM inspection FIRST methodology:
- Baseline measurement before any code changes
- ASM inspection to verify gates actually exist in assembly
- Optimization only if gates found in hot paths
- Incremental testing with proper A/B comparison
Step 0: Baseline Measurement
Command: make perf_fast
10-Run Results (Baseline):
Run 1: 56.62M ops/s
Run 2: 55.62M ops/s
Run 3: 56.62M ops/s
Run 4: 56.62M ops/s
Run 5: 55.79M ops/s
Run 6: 55.42M ops/s
Run 7: 55.89M ops/s
Run 8: 56.16M ops/s
Run 9: 54.79M ops/s
Run 10: 56.17M ops/s
Baseline Statistics:
- Mean: 55.97M ops/s
- Median: 56.03M ops/s
- Range: 54.79M - 56.62M ops/s
Step 1: ASM Inspection Results
Target Gates (from Phase 40 preparation):
mid_v3_enabled()incore/box/mid_hotbox_v3_env_box.hmid_v3_debug_enabled()incore/box/mid_hotbox_v3_env_box.h
Inspection Command:
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
Findings:
mid_v3_debug_enabled(): ✅ FOUND in assembly
- Call count: 19+ occurrences in disassembly
- Function location:
0x10630 <mid_v3_debug_enabled.lto_priv.0> - Call sites identified:
- Line 685:
call 10630 <mid_v3_debug_enabled.lto_priv.0> - Line 705:
call 10630 <mid_v3_debug_enabled.lto_priv.0> - Line 933:
call 10630 <mid_v3_debug_enabled.lto_priv.0> - Lines 9378, 9385, 9403, 10890, 31533, 31540, 31554, 31867, 32748, 32755, 32774, 33047, etc.
- Line 685:
mid_v3_enabled(): ❌ NOT FOUND in assembly
- Already optimized away by compiler (likely inlined and dead-code eliminated)
- MID v3 is OFF by default (
g_enable = 0), so compiler eliminated entire blocks
Call Site Analysis:
Source locations of mid_v3_debug_enabled():
-
Alloc path (
core/box/hak_alloc_api.inc.h):- Line 84: Inside
if (mid_v3_enabled() && size >= 257 && size <= 768)block - Line 95: Inside same block, after class selection
- Line 106: Inside same block, after successful allocation
- Line 84: Inside
-
Free path (
core/box/hak_free_api.inc.h):- Line 252: Inside
if (lk.kind == REGION_KIND_MID_V3)block (SSOT path) - Line 273: Inside same block (legacy path)
- Line 252: Inside
-
Mid-hotbox v3 implementation (
core/mid_hotbox_v3.c):- Multiple debug logging calls (lines 149, 158, 258, 270, 401, 423, 464, 507, 545)
Key Insight:
mid_v3_debug_enabled() appears in assembly because it's called INSIDE blocks that are already guarded by mid_v3_enabled(). However, since mid_v3_enabled() returns 0 (OFF by default), these debug gates are NEVER actually executed at runtime. The compiler still generates the function calls as dead code.
Pattern observed:
// In hot paths:
if (mid_v3_enabled() && ...) { // Outer guard - optimized to "if (0)"
// ...
if (mid_v3_debug_enabled() && ...) { // Inner debug gate - still in ASM!
fprintf(stderr, ...);
}
// ...
}
Step 2: Condition Reordering
Status: SKIPPED - Not applicable
Reason: All mid_v3_debug_enabled() calls are already inside mid_v3_enabled() guards. There are no opportunities for condition reordering to skip gate calls, because the outer gate (mid_v3_enabled()) is already at the top of the conditional chain.
Step 3: BENCH_MINIMAL Constantization
Implementation:
Modified core/box/mid_hotbox_v3_env_box.h to add compile-time constant returns for HAKMEM_BENCH_MINIMAL:
#include "../hakmem_build_flags.h"
static inline int mid_v3_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
return 0;
#else
static int g_enable = -1;
if (__builtin_expect(g_enable == -1, 0)) {
const char* e = getenv("HAKMEM_MID_V3_ENABLED");
if (e && *e) {
g_enable = (*e != '0') ? 1 : 0;
} else {
g_enable = 0; // default OFF
}
}
return g_enable;
#endif
}
static inline int mid_v3_debug_enabled(void) {
#if HAKMEM_BENCH_MINIMAL
// Phase 41: BENCH_MINIMAL → 固定 OFF (research box)
return 0;
#else
static int g_debug = -1;
if (__builtin_expect(g_debug == -1, 0)) {
const char* e = getenv("HAKMEM_MID_V3_DEBUG");
if (e && *e) {
g_debug = (*e != '0') ? 1 : 0;
} else {
g_debug = 0;
}
}
return g_debug;
#endif
}
ASM Verification After Step 3:
objdump -d ./bench_random_mixed_hakmem_minimal | grep -n "mid_v3_enabled\|mid_v3_debug_enabled"
Result: ✅ Both gates ELIMINATED from assembly
- No
mid_v3_debug_enabledfunction in disassembly - No call sites remaining
- Compiler successfully dead-code eliminated all MID v3 related code
Performance Results (Step 3):
Command: make perf_fast (after Step 3 changes)
10-Run Results (Step 3):
Run 1: 54.60M ops/s
Run 2: 54.35M ops/s
Run 3: 54.11M ops/s
Run 4: 54.60M ops/s
Run 5: 54.84M ops/s
Run 6: 54.79M ops/s
Run 7: 54.53M ops/s
Run 8: 54.56M ops/s
Run 9: 55.96M ops/s
Run 10: 56.08M ops/s
Step 3 Statistics:
- Mean: 54.84M ops/s
- Median: 54.60M ops/s
- Range: 54.11M - 56.08M ops/s
Comparison vs Baseline:
| Metric | Baseline | Step 3 | Delta | Percent |
|---|---|---|---|---|
| Mean | 55.97M | 54.84M | -1.13M | -2.02% |
| Median | 56.03M | 54.60M | -1.43M | -2.55% |
Verdict: NO-GO (-2.02% regression)
Root Cause Analysis: Layout Tax
Why did constantization hurt performance?
Hypothesis: Code layout tax (same issue as Phase 40)
-
Before Step 3:
mid_v3_enabled()andmid_v3_debug_enabled()exist as outlined functions- Call sites reference these functions, which are never executed (dead code)
- Hot path code layout is stable
-
After Step 3:
- Both gates return compile-time constant
0 - Compiler inlines these and eliminates entire MID v3 blocks
- Hot path code is re-laid out by compiler (different basic block arrangement)
- I-cache locality changes → performance regression
- Both gates return compile-time constant
Precedent: Phase 40 Results
Phase 40 attempted to constantize tiny_header_mode():
- Result: -2.47% regression
- Cause: Layout tax from code elimination
- Lesson: Removing already-optimized-away code can hurt more than help
Why layout tax occurs:
Modern CPUs are extremely sensitive to:
- Branch predictor state (different code layout → different prediction patterns)
- I-cache line alignment (moving hot loops can cause cache line splits)
- μop cache behavior (LSD/DSB interactions change with layout)
- TLB pressure (code page mapping changes)
Even though we eliminated dead code, the side effect of code relayout outweighed the benefit of removing a few dead function calls.
Final Decision: REVERT Step 3
Action: Reverted all changes to core/box/mid_hotbox_v3_env_box.h
git checkout core/box/mid_hotbox_v3_env_box.h
Reason: -2.02% regression is unacceptable for eliminating dead code that was never executed anyway.
Lessons Learned
1. ASM-First Methodology Works
✅ Successfully identified that:
mid_v3_enabled()was already optimized awaymid_v3_debug_enabled()existed in ASM but was dead code (insideif (0)blocks)
2. Dead Code != Performance Impact
❌ Counterintuitive finding: Removing dead code can hurt performance due to layout tax
- The dead
mid_v3_debug_enabled()calls were never executed - But removing them caused code relayout → -2.02% regression
- Lesson: Leave dead code alone if it's already not executed
3. Layout Tax is Real and Significant
Both Phase 40 and Phase 41 hit layout tax:
- Phase 40:
tiny_header_mode()constantization → -2.47% - Phase 41:
mid_v3_*()constantization → -2.02%
Pattern: Structural changes to inline functions → unpredictable layout effects
4. When to Stop Optimizing
Stop criteria:
- If gate is already optimized away in ASM → Don't touch it
- If gate appears in ASM but is never executed → Still don't touch it (layout risk)
- Only optimize if gate is executed frequently in hot paths
5. ASM Inspection is Necessary but Not Sufficient
- ✅ ASM inspection told us gates exist
- ❌ ASM inspection didn't tell us they're dead code inside
if (0)blocks - ✅ Need runtime profiling (e.g.,
perf record) to confirm execution frequency
Recommendations for Phase 42+
1. Add Runtime Profiling Step
Before optimizing any gate, use perf to verify it's actually executed:
# Profile hot functions
perf record -g -F 999 ./bench_random_mixed_hakmem_minimal
perf report --no-children --sort comm,dso,symbol
# Check if mid_v3_debug_enabled appears in profile
perf report | grep mid_v3
Decision criteria:
- If function appears in
perf report→ Worth optimizing - If function is in ASM but NOT in
perf report→ Dead code, leave alone
2. Focus on Actually-Executed Gates
Priority list (requires profiling validation):
- Gates that appear in
perf reporttop 50 functions - Gates called in tight loops (identified via
-Ccontext inperf annotate) - Gates with measurable CPU time (>0.1% in profile)
3. Accept Dead Code in ASM
Philosophy shift:
- Old: "If it's in ASM, optimize it"
- New: "If it's in ASM but not executed, ignore it"
Dead code that's never executed has zero runtime cost. Removing it risks layout tax.
4. Test Layout Stability
Before committing any structural change:
- Run 3× 10-run benchmarks (baseline, change, revert-verify)
- Check if results are reproducible
- Accept only if gain is ≥1.0% (to overcome layout noise)
5. Alternative: Investigate Other Hot Gates
Instead of MID v3 gates (which are dead), profile to find:
- Tiny allocator gates that ARE executed
- Free path gates with measurable cost
- Size class routing decisions in hot paths
Quantitative Summary
| Phase | Target Gate(s) | ASM Present? | Executed? | Change | Result | Verdict |
|---|---|---|---|---|---|---|
| Phase 21 | tiny_header_mode() |
No (optimized away) | No | N/A | N/A | Skipped |
| Phase 40 | tiny_header_mode() |
No | No | Constantization | -2.47% | NO-GO |
| Phase 41 Step 2 | Condition reorder | N/A | N/A | N/A | Skipped | N/A |
| Phase 41 Step 3 | mid_v3_enabled(), mid_v3_debug_enabled() |
Yes (debug only) | No (dead code) | Constantization | -2.02% | NO-GO |
Phase 41 Final Performance: 55.97M ops/s (baseline, no changes adopted)
Conclusion
Phase 41 successfully demonstrated the ASM-first gate audit methodology and confirmed its value. However, it also revealed a critical limitation:
ASM presence ≠ Performance impact
The gates we targeted (mid_v3_debug_enabled()) existed in assembly but were dead code inside if (mid_v3_enabled()) guards that compile to if (0). Attempting to eliminate this dead code via BENCH_MINIMAL constantization caused a -2.02% layout tax regression.
Key Takeaway:
- ✅ ASM inspection prevents wasting time on already-optimized gates (like Phase 21's
tiny_header_mode()) - ❌ But ASM inspection alone is insufficient - need runtime profiling to distinguish executed vs. dead code
- ⚠️ Layout tax is a first-class optimization enemy - structural changes risk unpredictable regressions
Phase 42 Direction:
- Add
perf record/reportstep to methodology - Target only gates that appear in runtime profiles
- Accept dead code in ASM as zero-cost (don't fix what isn't broken)
- Require ≥1.0% gain to overcome layout noise
Phase 41 Verdict: NO-GO - Revert all changes, baseline remains FAST v3 = 55.97M ops/s