# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions ## Status - Phase 18 v1 (layout + sections) → NO-GO (I-cache regression) - Phase 18 v2 (instruction removal) → Next phase - Strategy: Remove stats/ENV overhead at compile-time - Expected: +10-20% throughput, -30-40% instructions Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md` --- ## 0. Goal / Success Criteria **GO threshold** (STRICT): - Throughput: **+5% minimum** (+8% preferred) - Instructions: **-15% minimum** (clear proof of concept) - I-cache: automatic improvement from smaller footprint **NEUTRAL**: - Throughput: ±3% (marginal) - Instructions: -5% to -15% (incomplete optimization) **NO-GO**: - Throughput: < -2% or negative - Instructions: < -5% (failed to reduce overhead) If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck). --- ## 1. Implementation Steps ### 1.1 Add Makefile knob **File**: `Makefile` After Phase 18 v1 section (around line 140), add: ```makefile # Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds) BENCH_MINIMAL ?= 0 ifeq ($(BENCH_MINIMAL),1) CFLAGS += -DHAKMEM_BENCH_MINIMAL=1 CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1 # Note: Both bench and shared lib will disable instrumentation # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled) endif ``` **Location**: After `HOT_TEXT_GC_SECTIONS` section (line ~145) ### 1.2 Update front_fastlane_stats_box.h **File**: `core/box/front_fastlane_stats_box.h` Find the `FRONT_FASTLANE_STAT_INC` macro (currently defined as atomic increment). Replace entire macro section: ```c // Before: #define FRONT_FASTLANE_STAT_INC(stat) \ do { \ if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \ atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \ } \ } while(0) // After: #if HAKMEM_BENCH_MINIMAL #define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0) #else #define FRONT_FASTLANE_STAT_INC(stat) \ do { \ if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \ atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \ } \ } while(0) #endif ``` **Rationale**: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely. ### 1.3 Update malloc_tiny_fast.h free stats **File**: `core/front/malloc_tiny_fast.h` Search for `free_tiny_fast_mono_stat_inc_*` calls (probably in free_tiny_fast function). Wrap each call: ```c // Before: free_tiny_fast_mono_stat_inc_hit(); // After: #if !HAKMEM_BENCH_MINIMAL free_tiny_fast_mono_stat_inc_hit(); #endif ``` Search pattern: `free_tiny_fast_mono_stat_` (should find ~5-10 calls) **Rationale**: Similar to malloc, free stats become optional in BENCH_MINIMAL. ### 1.4 Update hakmem_env_snapshot_box.h **File**: `core/box/hakmem_env_snapshot_box.h` Find function `hakmem_env_snapshot_enabled()`. Replace implementation: ```c // Before: static inline bool hakmem_env_snapshot_enabled(void) { return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed); } // After: #if HAKMEM_BENCH_MINIMAL // In bench mode, snapshot is always enabled (one-time cost, compile-away benefit) static inline bool hakmem_env_snapshot_enabled(void) { return 1; } #else // Normal mode: runtime check static inline bool hakmem_env_snapshot_enabled(void) { return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed); } #endif ``` **Rationale**: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization. ### 1.5 (Optional) Update debug logging **File**: any file with `hakmem_tls_trace_*` or `hakmem_diag_*` calls Only if needed (lower priority). Wrap verbose logging: ```c #if !HAKMEM_BENCH_MINIMAL hakmem_tls_trace_alloc(ptr, size); #endif ``` --- ## 2. Build & Verify ### 2.1 Baseline build (BENCH_MINIMAL=0) ```sh make clean make -j bench_random_mixed_hakmem bench_random_mixed_system ls -lh bench_random_mixed_hakmem ``` Expected: ~653K (same as before) ### 2.2 Optimized build (BENCH_MINIMAL=1) ```sh make clean make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system ls -lh bench_random_mixed_hakmem ``` Expected: Same size (instrumentation is compiled away, not removed) **Note**: Binary size may not change (removal is compile-time, not linker removal). ### 2.3 Verify compilation Both builds should complete without errors. Check for syntax errors in conditional code: ```sh # If you see errors like "undeclared identifier", it means conditional guards are wrong ``` --- ## 3. A/B Test Execution ### 3.1 Baseline run (BENCH_MINIMAL=0) ```sh make clean make -j bench_random_mixed_hakmem bench_random_mixed_system scripts/run_mixed_10_cleanenv.sh ``` Record: - Mean throughput - Stdev - Min/Max Run perf stat: ```sh perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \ env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ ./bench_random_mixed_hakmem 200000000 400 1 ``` Record: - cycles - instructions - I-cache-load-misses - branch-misses % ### 3.2 Optimized run (BENCH_MINIMAL=1) ```sh make clean make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system scripts/run_mixed_10_cleanenv.sh ``` Same recording as baseline. Perf stat (same command). ### 3.3 Optional: system ceiling check ```sh ./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" ``` (Already measured in Phase 17: ~85M ops/s) --- ## 4. Analysis & GO/NO-GO Decision ### Compute deltas ``` Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100 Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100 Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100 ``` ### Decision logic **GO** (if ALL true): - Delta_throughput ≥ +5% - Delta_instructions ≤ -15% - Delta_icache ≤ -10% (should follow instruction reduction) **NEUTRAL** (if some criteria marginal): - Delta_throughput ∈ ±3% - Delta_instructions ∈ [-15%, -5%] - Variance increase < 50% **NO-GO** (if any critical missed): - Delta_throughput < -2% - Delta_instructions > -5% (failed to reduce overhead) - Variance increase > 100% --- ## 5. Reporting (required artifacts) Create file: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md` Include: ```markdown # Phase 18 v2: BENCH_MINIMAL — A/B Test Results | Metric | Baseline | Optimized | Delta | |--------|----------|-----------|-------| | Throughput (mean) | X.XXM | X.XXM | +/-X.XX% | | Throughput (σ) | X.XXM | X.XXM | % | | Instructions | X.XB | X.XB | +/-X.X% | | I-cache misses | XXXK | XXXK | +/-X.X% | | Cycles | X.XB | X.XB | +/-X.X% | ## Verdict GO / NEUTRAL / NO-GO with reasoning ``` Update: `CURRENT_TASK.md` (Phase 18 v2 status + next) --- ## 6. Important Notes - Ensure `BENCH_MINIMAL=0` (default) is tested first - Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully - If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this) - This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI) --- ## 7. If GO: Next Steps 1. **Promote BENCH_MINIMAL=1** as default for bench builds 2. **Prepare Phase 19** (next optimization frontier) - Possible targets: SIMD prefetch, lock-free structures, allocation batching 3. **Document learnings** about allocator overhead vs memory latency --- ## 8. If NO-GO: Lessons - Allocator is memory-bound (IPC=2.30 constant across phases) - Instruction count reduction doesn't always yield throughput gains if memory latency dominates - Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations