From bc2c5ded76d6fc5d4711922fac9cfdd28d4ba8d6 Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Mon, 15 Dec 2025 06:02:28 +0900 Subject: [PATCH] =?UTF-8?q?Phase=2018=20v2:=20BENCH=5FMINIMAL=20=E2=80=94?= =?UTF-8?q?=20NEUTRAL=20(+2.32%=20throughput,=20-5.06%=20instructions)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Phase 18 v2 attempted instruction count reduction via conditional compilation: - Stats collection → no-op - ENV checks → constant propagation - Binary size: 653K → 649K (-4K, -0.6%) Result: NEUTRAL (below GO threshold) - Throughput: +2.32% (target: +5% minimum) ❌ - Instructions: -5.06% (target: -15% minimum) ❌ - Cycles: -3.26% (positive signal) - Branches: -8.67% (positive signal) - Cache-misses: +30% (unexpected, likely layout) ## Analysis Positive signals: - Implementation correct (Branch -8.67%, Instruction -5.06%) - Binary size reduced (-4K) - Modest throughput gain (+2.32%) - Cycles and branch overhead reduced Negative signals: - Instruction reduction insufficient (-5.06% << -15% smoking gun) - Throughput gain below +5% threshold - Cache-misses increased (+30%, layout noise?) ## Verdict Freeze Phase 18 v2 (weak positive, insufficient for production). Per user guidance: "If instructions don't drop clearly, continuation value is thin." -5.06% instruction reduction is marginal. Allocator micro-optimization plateau confirmed. ## Key Insight Phase 17 showed: - IPC = 2.30 (consistent, memory-bound) - I-cache gap: 55% (Phase 17: 153K → 68K) - Instruction gap: 48% (Phase 17: 41.3B → 21.5B) Phase 18 v1/v2 results confirm: - Layout tweaks are fragile (v1: I-cache +91%) - Instruction removal is modest benefit (v2: -5.06%) - Allocator is NOT the bottleneck (IPC constant, memory-limited) ## Recommendation Do NOT continue Phase 18 micro-optimizations. Next frontier requires different approach: 1. Architectural redesign (SIMD, lock-free, batching) 2. Memory layout optimization (cache-friendly structures) 3. Broader profiling (not allocator-focused) Or: Accept that 48M → 85M (75% gap) is achievable with current architecture. Files: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md (results) - CURRENT_TASK.md (Phase 18 complete status) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 --- Makefile | 9 +++++++++ core/box/front_fastlane_stats_box.h | 5 +++++ core/box/hakmem_env_snapshot_box.h | 9 +++++++++ hakmem.d | 8 +++++++- 4 files changed, 30 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 91001d4a..7a7bd157 100644 --- a/Makefile +++ b/Makefile @@ -140,6 +140,15 @@ ifeq ($(HOT_TEXT_GC_SECTIONS),1) LDFLAGS += -Wl,--gc-sections endif +# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds) +BENCH_MINIMAL ?= 0 +ifeq ($(BENCH_MINIMAL),1) + CFLAGS += -DHAKMEM_BENCH_MINIMAL=1 + CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1 + # Note: Both bench and shared lib will disable instrumentation + # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled) +endif + # Default: enable Box Theory refactor for Tiny (Phase 6-1.7) # This is the best performing option currently (4.19M ops/s) # NOTE: Disabled while testing ULTRA_SIMPLE with SFC integration diff --git a/core/box/front_fastlane_stats_box.h b/core/box/front_fastlane_stats_box.h index d9e7c58c..84352e11 100644 --- a/core/box/front_fastlane_stats_box.h +++ b/core/box/front_fastlane_stats_box.h @@ -60,8 +60,13 @@ typedef struct { static FrontFastLaneStats g_front_fastlane_stats = {0}; // Increment macros (relaxed ordering - stats only) +// Phase 18 v2: BENCH_MINIMAL conditional (no-op when HAKMEM_BENCH_MINIMAL=1) +#if HAKMEM_BENCH_MINIMAL +#define FRONT_FASTLANE_STAT_INC(field) do { (void)0; } while(0) +#else #define FRONT_FASTLANE_STAT_INC(field) \ atomic_fetch_add_explicit(&g_front_fastlane_stats.field, 1, memory_order_relaxed) +#endif // Dump stats on exit (call from wrapper destructor or main) static void front_fastlane_stats_dump(void) { diff --git a/core/box/hakmem_env_snapshot_box.h b/core/box/hakmem_env_snapshot_box.h index 5e34a141..8cfb402e 100644 --- a/core/box/hakmem_env_snapshot_box.h +++ b/core/box/hakmem_env_snapshot_box.h @@ -59,6 +59,14 @@ extern int g_hakmem_env_snapshot_ctor_mode; // ENV gate: default OFF (research box, set =1 to enable) // E3-4: Dual-mode - constructor init (fast) or legacy lazy init (fallback) +// Phase 18 v2: BENCH_MINIMAL conditional (constant return when HAKMEM_BENCH_MINIMAL=1) +#if HAKMEM_BENCH_MINIMAL +// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit) +static inline bool hakmem_env_snapshot_enabled(void) { + return 1; +} +#else +// Normal mode: runtime check static inline bool hakmem_env_snapshot_enabled(void) { // E3-4 Fast path: constructor mode (no lazy check, just global read). // Important: do not put a static LIKELY/UNLIKELY hint here. @@ -81,5 +89,6 @@ static inline bool hakmem_env_snapshot_enabled(void) { } return g_hakmem_env_snapshot_gate != 0; } +#endif #endif // HAK_ENV_SNAPSHOT_BOX_H diff --git a/hakmem.d b/hakmem.d index 5fcea413..3e031027 100644 --- a/hakmem.d +++ b/hakmem.d @@ -176,7 +176,9 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \ core/box/malloc_tiny_direct_env_box.h \ core/box/malloc_tiny_direct_stats_box.h core/box/front_fastlane_box.h \ core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \ - core/box/../hakmem_internal.h + core/box/front_fastlane_alloc_legacy_direct_env_box.h \ + core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \ + core/box/smallobject_policy_v7_box.h core/box/../hakmem_internal.h core/hakmem.h: core/hakmem_build_flags.h: core/hakmem_config.h: @@ -435,4 +437,8 @@ core/box/malloc_tiny_direct_stats_box.h: core/box/front_fastlane_box.h: core/box/front_fastlane_env_box.h: core/box/front_fastlane_stats_box.h: +core/box/front_fastlane_alloc_legacy_direct_env_box.h: +core/box/tiny_front_hot_box.h: +core/box/tiny_front_cold_box.h: +core/box/smallobject_policy_v7_box.h: core/box/../hakmem_internal.h: