## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
9.1 KiB
Phase 60: Alloc Pass-Down SSOT - Implementation Guide
Date: 2025-12-17 Status: Implemented, NO-GO (kept as research box)
Overview
Phase 60 implements a Single Source of Truth (SSOT) pattern for the allocation path, computing ENV snapshot, route kind, C7 ULTRA, and DUALHOT flags once at the entry point and passing them down to the allocation logic.
Goal: Reduce redundant computations (ENV snapshot, route determination, etc.) by computing them once at the entry point.
Result: NO-GO (-0.46% regression). The implementation is kept as a research box with default OFF (HAKMEM_ALLOC_PASSDOWN_SSOT=0).
Files Modified
1. New ENV Box
File: /mnt/workdisk/public_share/hakmem/core/box/alloc_passdown_ssot_env_box.h
Purpose: Provides the ENV gate for enabling/disabling the SSOT path.
Key Functions:
// ENV gate (compile-time constant in HAKMEM_BENCH_MINIMAL)
static inline int alloc_passdown_ssot_enabled(void);
ENV Variable: HAKMEM_ALLOC_PASSDOWN_SSOT (default: 0, OFF)
2. Core Implementation
File: /mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h
Key Changes:
a. Context Structure (Lines 92-97)
// Alloc context: computed once at entry, passed down
typedef struct {
const HakmemEnvSnapshot* env; // ENV snapshot (NULL if snapshot disabled)
SmallRouteKind route_kind; // Route kind (LEGACY/ULTRA/MID/V7)
bool c7_ultra_on; // C7 ULTRA enabled
bool alloc_dualhot_on; // Alloc DUALHOT enabled (C0-C3 direct path)
} alloc_passdown_context_t;
b. Context Computation (Lines 200-220)
// Phase 60: Compute context once at entry point
__attribute__((always_inline))
static inline alloc_passdown_context_t alloc_passdown_context_compute(int class_idx) {
alloc_passdown_context_t ctx;
// 1. ENV snapshot (once)
ctx.env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
// 2. C7 ULTRA enabled (once)
ctx.c7_ultra_on = ctx.env ? ctx.env->tiny_c7_ultra_enabled : tiny_c7_ultra_enabled_env();
// 3. Alloc DUALHOT enabled (once)
ctx.alloc_dualhot_on = alloc_dualhot_enabled();
// 4. Route kind (once)
if (tiny_static_route_ready_fast()) {
ctx.route_kind = tiny_static_route_get_kind_fast(class_idx);
} else {
ctx.route_kind = tiny_policy_hot_get_route_with_env((uint32_t)class_idx, ctx.env);
}
return ctx;
}
c. SSOT Allocation Path (Lines 286-392)
// Phase 60: SSOT mode allocation (uses pre-computed context)
__attribute__((always_inline))
static inline void* malloc_tiny_fast_for_class_ssot(size_t size, int class_idx,
const alloc_passdown_context_t* ctx) {
// Stats
tiny_front_alloc_stat_inc(class_idx);
ALLOC_GATE_STAT_INC_CLASS(class_idx);
// C7 ULTRA early-exit (uses ctx->c7_ultra_on)
if (class_idx == 7 && ctx->c7_ultra_on) {
void* ultra_p = tiny_c7_ultra_alloc(size);
if (TINY_HOT_LIKELY(ultra_p != NULL)) {
return ultra_p;
}
}
// C0-C3 DUALHOT direct path (uses ctx->alloc_dualhot_on)
if ((unsigned)class_idx <= 3u) {
if (ctx->alloc_dualhot_on) {
void* ptr = tiny_hot_alloc_fast(class_idx);
if (TINY_HOT_LIKELY(ptr != NULL)) {
return ptr;
}
return tiny_cold_refill_and_alloc(class_idx);
}
}
// Routing dispatch (uses ctx->route_kind)
const tiny_env_cfg_t* env_cfg = tiny_env_cfg();
if (TINY_HOT_LIKELY(env_cfg->alloc_route_shape)) {
if (TINY_HOT_LIKELY(ctx->route_kind == SMALL_ROUTE_LEGACY)) {
void* ptr = tiny_hot_alloc_fast(class_idx);
if (TINY_HOT_LIKELY(ptr != NULL)) {
return ptr;
}
return tiny_cold_refill_and_alloc(class_idx);
}
return tiny_alloc_route_cold(ctx->route_kind, class_idx, size);
}
// Original dispatch (backward compatible)
switch (ctx->route_kind) {
case SMALL_ROUTE_ULTRA:
// ... ULTRA path
break;
case SMALL_ROUTE_MID_V35:
// ... MID v3.5 path
break;
case SMALL_ROUTE_V7:
// ... V7 path
break;
case SMALL_ROUTE_LEGACY:
default:
break;
}
// LEGACY fallback
void* ptr = tiny_hot_alloc_fast(class_idx);
if (TINY_HOT_LIKELY(ptr != NULL)) {
return ptr;
}
return tiny_cold_refill_and_alloc(class_idx);
}
d. Entry Point Dispatch (Lines 396-402)
// Phase 60: Entry point dispatch
__attribute__((always_inline))
static inline void* malloc_tiny_fast_for_class(size_t size, int class_idx) {
// Phase 60: SSOT mode (ENV gated)
if (alloc_passdown_ssot_enabled()) {
alloc_passdown_context_t ctx = alloc_passdown_context_compute(class_idx);
return malloc_tiny_fast_for_class_ssot(size, class_idx, &ctx);
}
// Original path (backward compatible, default)
// ... existing implementation ...
}
Design Patterns
1. SSOT (Single Source of Truth)
Principle: Compute expensive values once at the entry point, then pass them down.
Benefits (intended):
- Avoid redundant ENV snapshot calls
- Avoid redundant route kind computations
- Reduce branch mispredictions
Actual Result: The original path already has early exits that avoid expensive computations. The SSOT approach forces upfront computation, negating the benefit of early exits.
2. Pass-Down Pattern
Principle: Pass context via struct pointer to downstream functions.
Benefits (intended):
- Clear API boundary
- Avoid global state
Actual Result: Struct pass-down introduces ABI overhead (register pressure, stack spills), especially when combined with the upfront computation overhead.
3. Always Inline
Principle: Use __attribute__((always_inline)) to ensure the context computation is inlined.
Benefits:
- Reduce function call overhead
- Allow compiler to optimize across boundaries
Actual Result: Inlining works as expected, but the upfront computation overhead remains.
Rollback Procedure
Option 1: ENV Variable (Runtime)
Set HAKMEM_ALLOC_PASSDOWN_SSOT=0 (default).
Option 2: Compile-Time (Build-Time)
Build without -DHAKMEM_ALLOC_PASSDOWN_SSOT=1:
make bench_random_mixed_hakmem_minimal
Option 3: Code Removal (Permanent)
If the research box is no longer needed, remove:
/mnt/workdisk/public_share/hakmem/core/box/alloc_passdown_ssot_env_box.h- The SSOT dispatch code in
malloc_tiny_fast_for_class()(lines 397-401) - The
alloc_passdown_context_tstruct and related functions (lines 92-220)
Lessons Learned
1. Early Exits Are Powerful
The original allocation path has early exits (C7 ULTRA, DUALHOT) that avoid expensive computations in the common case. Forcing upfront computation negates these benefits.
2. Branch Cost
Even a single branch check (if (alloc_passdown_ssot_enabled())) can introduce measurable overhead in a hot path.
3. Pass-Down Overhead
Passing a struct by pointer introduces ABI overhead (register pressure, stack spills), especially when the struct contains multiple fields.
4. SSOT Is Not Always Better
The SSOT pattern works well when there are many redundant computations across multiple code paths (e.g., Free-side Phase 19-6C). It fails when the original path already has efficient early exits.
Future Work
Alternative Approaches
- Inline Critical Functions: Ensure
tiny_c7_ultra_alloc,tiny_region_id_write_header, andunified_cache_pushare always inlined. - Branch Reduction: Remove branches from the hot path (e.g., combine
if (class_idx == 7 && c7_ultra_on)into a single check). - Profile-Guided Optimization (PGO): Use PGO to optimize branch prediction.
- Direct Dispatch: For common class indices (C0-C3, C7), use direct dispatch instead of switch statements.
Related Phases
- Phase 19-6C (Free-side SSOT): Successful (+1.5%) due to many redundant computations.
- Phase 43 (Branch vs Store): Branch cost is higher than store cost in hot paths.
- Phase 40/41 (ASM Analysis): Focus on functions that are actually executed at runtime.
Box Theory Compliance
| Principle | Compliant? | Notes |
|---|---|---|
| Single Conversion Point | Yes | Entry point computes context once |
| Clear Boundaries | Yes | alloc_passdown_context_t defines the boundary |
| Reversible | Yes | ENV gate allows rollback |
| No Side Effects | Yes | Context is immutable after computation |
| Performance | No | -0.46% regression (NO-GO) |
Overall: Box Theory compliant, but performance non-compliant (NO-GO).