Files

Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-17 06:24:01 +09:00

6.8 KiB

Raw Blame History

Phase 60: Alloc Pass-Down SSOT - A/B Test Results

Date: 2025-12-17 Verdict: NO-GO (-0.46%)

Executive Summary

Phase 60 attempted to reduce redundant computations in the allocation path by computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down to the allocation logic (SSOT pattern, similar to Free-side Phase 19-6C).

Result: The SSOT approach introduced a slight performance regression (-0.46%), making it a NO-GO. The added branch check if (alloc_passdown_ssot_enabled()) and the overhead of computing the context upfront outweighed any benefits from reducing duplicate calculations.

Step 0: Runtime Profiling (Prerequisite Check)

Command:

perf record -F 99 -g -- ./bench_random_mixed_hakmem_minimal 200000000 400 1
perf report --no-children | head -60

Top 50 Functions (overhead %):

31.27%  malloc
28.60%  free
21.82%  main
 4.14%  tiny_c7_ultra_alloc.constprop.0
 3.69%  free_tiny_fast_compute_route_and_heap.lto_priv.0
 3.50%  tiny_region_id_write_header.lto_priv.0
 2.16%  tiny_c7_ultra_free
 1.21%  unified_cache_push.lto_priv.0
 1.00%  hak_free_at.part.0
 0.47%  hak_force_libc_alloc.lto_priv.0
 0.46%  hak_super_lookup.part.0.lto_priv.4.lto_priv.0
 0.45%  hak_pool_try_alloc_v1_impl.part.0

Conclusion: Alloc-side functions (malloc, tiny_c7_ultra_alloc, tiny_region_id_write_header) are present in the Top 50, confirming that this Phase is worth investigating.

A/B Test Results (Mixed Benchmark, 10-run)

Baseline (HAKMEM_ALLOC_PASSDOWN_SSOT=0)

Command:

make bench_random_mixed_hakmem_minimal
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh

Results:

Run  1: 60411170 ops/s
Run  2: 59748852 ops/s
Run  3: 59978565 ops/s
Run  4: 60709007 ops/s
Run  5: 60525102 ops/s
Run  6: 60140203 ops/s
Run  7: 58531001 ops/s
Run  8: 59976257 ops/s
Run  9: 59847921 ops/s
Run 10: 60617511 ops/s

Statistics:

Mean: 60,048,559 ops/s
Min: 58,531,001 ops/s
Max: 60,709,007 ops/s
StdDev: 597,500 ops/s
CV: 1.00%

Treatment (HAKMEM_ALLOC_PASSDOWN_SSOT=1)

Command:

make clean
make bench_random_mixed_hakmem_minimal EXTRA_CFLAGS='-DHAKMEM_ALLOC_PASSDOWN_SSOT=1'
BENCH_BIN=./bench_random_mixed_hakmem_minimal scripts/run_mixed_10_cleanenv.sh

Results:

Run  1: 60961455 ops/s
Run  2: 60006558 ops/s
Run  3: 59090044 ops/s
Run  4: 60244712 ops/s
Run  5: 60909895 ops/s
Run  6: 60470043 ops/s
Run  7: 59077611 ops/s
Run  8: 58890407 ops/s
Run  9: 60107925 ops/s
Run 10: 57966046 ops/s

Statistics:

Mean: 59,772,470 ops/s
Min: 57,966,046 ops/s
Max: 60,961,455 ops/s
StdDev: 925,965 ops/s
CV: 1.55%

Comparison

Metric	Baseline (SSOT=0)	Treatment (SSOT=1)	Delta
Mean	60,048,559 ops/s	59,772,470 ops/s	-0.46%
CV	1.00%	1.55%	+0.55pp
Min	58,531,001 ops/s	57,966,046 ops/s	-0.97%
Max	60,709,007 ops/s	60,961,455 ops/s	+0.42%

Verdict: NO-GO (regression of -0.46%, below the -1.0% threshold but still negative)

Root Cause Analysis

1. Added Branch Overhead

The SSOT approach requires a branch check at the entry point:

if (alloc_passdown_ssot_enabled()) {
    alloc_passdown_context_t ctx = alloc_passdown_context_compute(class_idx);
    return malloc_tiny_fast_for_class_ssot(size, class_idx, &ctx);
}

Even though alloc_passdown_ssot_enabled() is compile-time constant in HAKMEM_BENCH_MINIMAL, the branch itself adds overhead.

2. Duplicate Context Computation

The alloc_passdown_context_compute() function computes:

ENV snapshot (hakmem_env_snapshot())
C7 ULTRA enabled (tiny_c7_ultra_enabled_env())
DUALHOT enabled (alloc_dualhot_enabled())
Route kind (tiny_static_route_get_kind_fast() or tiny_policy_hot_get_route_with_env())

However, the original path already computes these values on-demand, and only when needed. The SSOT path computes them upfront, even if they are not used (e.g., if C7 ULTRA hits early, the route kind is not needed).

3. Pass-Down Overhead

The alloc_passdown_context_t struct is passed by pointer to malloc_tiny_fast_for_class_ssot(), which may introduce ABI overhead (register pressure, stack spills).

4. No Actual Redundancy Reduction

The original path has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations in the common case. The SSOT path forces all computations upfront, negating the benefit of early exits.

Comparison with Free-Side Phase 19-6C

Free-Side Success (Phase 19-6C):

Free-side had many branches and duplicate policy snapshot calls across multiple code paths.
Pass-down eliminated these redundancies, resulting in a +1.5% improvement.

Alloc-Side Failure (Phase 60):

Alloc-side already has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations.
The SSOT approach forces upfront computation, reducing the benefit of early exits.
The added branch check (if (alloc_passdown_ssot_enabled())) introduces overhead.

Conclusion: The SSOT pattern works well when there are many redundant computations across multiple code paths, but fails when the original path already has efficient early exits.

Decision

NO-GO: The SSOT approach introduces a slight regression (-0.46%) and does not provide the expected benefits. The implementation will remain OFF (default HAKMEM_ALLOC_PASSDOWN_SSOT=0), and the code will be kept as a research box for future reference.

Recommendations for Future Phases

Focus on Hot Functions: Continue profiling to identify the next hottest allocation functions (e.g., tiny_region_id_write_header, unified_cache_push).
Avoid Upfront Computation: For allocation paths with early exits, avoid forcing upfront computation. Instead, optimize the early-exit paths directly.
Branch Reduction: Investigate removing branches from the hot path (e.g., if (class_idx == 7 && c7_ultra_on)).
Inline Critical Functions: Ensure critical functions like tiny_c7_ultra_alloc are always inlined to reduce call overhead.

Box Theory Compliance

Single Conversion Point: The SSOT entry point computes the context once (compliant).
Clear Boundaries: The alloc_passdown_context_t struct defines the boundary (compliant).
Reversible: The HAKMEM_ALLOC_PASSDOWN_SSOT ENV gate allows rollback (compliant).
Performance: The approach did not improve performance (non-compliant).

Verdict: Box Theory compliant, but performance non-compliant (NO-GO).

6.8 KiB Raw Blame History

Phase 60: Alloc Pass-Down SSOT - A/B Test Results

Executive Summary

Step 0: Runtime Profiling (Prerequisite Check)

A/B Test Results (Mixed Benchmark, 10-run)

Baseline (HAKMEM_ALLOC_PASSDOWN_SSOT=0)

Treatment (HAKMEM_ALLOC_PASSDOWN_SSOT=1)

Comparison

Root Cause Analysis

1. Added Branch Overhead

2. Duplicate Context Computation

3. Pass-Down Overhead

4. No Actual Redundancy Reduction

Comparison with Free-Side Phase 19-6C

Decision

Recommendations for Future Phases

Box Theory Compliance

6.8 KiB

Raw Blame History