# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design ## Context (Phase 18 v1 failed) Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`). **Result**: CATASTROPHIC I-cache regression (+91%). **Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality. **Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**. --- ## 0. Goal (High Impact) Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time: - Stats collection (counter increments on every operation) - Environment variable checks (TLS lookups) - Debug logging (conditional output) **Expected impact**: - Instruction count: -30-40% (benchmark workload) - I-cache misses: automatic improvement from smaller working set - Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally) **GO Criteria** (strict): - Throughput: +5% minimum, +8% preferred - Instructions: -15% minimum (clear, measurable reduction) - I-cache: should improve proportionally to instruction reduction If instructions do not drop meaningfully, abandon this phase. --- ## 1. Strategy: BENCH_MINIMAL Build Mode ### 1.1 What to Remove **Category A: Stats collection** (highest frequency, low value for bench) ```c FRONT_FASTLANE_STAT_INC(malloc_total); // ← 1 per malloc FRONT_FASTLANE_STAT_INC(malloc_hit); // ← 1 per malloc FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional free_tiny_fast_mono_stat_inc_hit(); // ← 1 per free ``` **Category B: Environment variable checks** (per-operation cost if not short-circuit) ```c if (hakmem_env_snapshot_enabled()) { ... } // TLS + branch if (front_fastlane_enabled()) { ... } // TLS + branch ``` **Category C: Debug logging / verbose output** (rarely needed in bench) ```c hakmem_tls_trace_alloc(); // log path hakmem_diag_record_fallback(); ``` ### 1.2 Compile-Time Gate New build mode: ``` HAKMEM_BENCH_MINIMAL=0/1 (default 0, opt-in) ``` Activated via Makefile knob: ```makefile BENCH_MINIMAL ?= 0 ifeq ($(BENCH_MINIMAL),1) CFLAGS += -DHAKMEM_BENCH_MINIMAL=1 # ... applies to both bench binaries and shared lib endif ``` **Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation). ### 1.3 Safeguards 1. **Conditional compilation markers**: - Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL` - Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL` - Ensure code is still correct when markers are OFF (normal operation) 2. **No behavioral changes**: - Fast paths must remain identical (stats are just instrumentation) - Slow paths (fallback, cold) can be simplified (less instrumentation) 3. **Rollback**: Single build knob, reversible --- ## 2. Implementation Plan ### Phase 2.1: Stats removal (Priority 1) **File**: `core/box/front_fastlane_stats_box.h` ```c #if !HAKMEM_BENCH_MINIMAL #define FRONT_FASTLANE_STAT_INC(stat) \ do { (void)0; } while(0) // ← becomes no-op #else #define FRONT_FASTLANE_STAT_INC(stat) \ atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed) #endif ``` **File**: `core/front/malloc_tiny_fast.h` Search for all `free_tiny_fast_mono_stat_*` calls and wrap: ```c #if !HAKMEM_BENCH_MINIMAL free_tiny_fast_mono_stat_inc_hit(); #endif ``` **Expected savings**: ~20-30 instructions/op (counter increments + memory sync) ### Phase 2.2: Environment variable check removal (Priority 1) **File**: `core/box/hakmem_env_snapshot_box.h` ```c #if !HAKMEM_BENCH_MINIMAL // Normal: Check ENV at runtime bool hakmem_env_snapshot_enabled(void) { return g_env_snapshot_enabled; } #else // Bench: Always enable snapshot (one-time cost) static inline bool hakmem_env_snapshot_enabled(void) { return 1; // ← compile-time constant, eliminated by optimizer } #endif ``` **File**: `core/bench_profile.h` Similar pattern for profile-related checks. **Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction) ### Phase 2.3: Debug logging removal (Priority 2, if needed) **File**: trace/logging functions Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`). **Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench) --- ## 3. Risks / Mitigations ### Risk A: Benchmark no longer representative If we remove instrumentation, does the bench still measure the allocator? **Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic. - Fast paths remain identical. - Only instrumentation overhead is removed. - This is similar to how production binaries disable debug tracing. ### Risk B: Build regression Conditional compilation could hide bugs. **Mitigation**: - Ensure `BENCH_MINIMAL=0` (default) always tested first. - Test both modes in CI if available. - Manual verification that code is syntactically correct in both modes. ### Risk C: Instruction count doesn't drop If removing stats + ENV checks doesn't clearly drop instructions, it means: - Compiler/LTO already optimizes these away in RELEASE builds - The overhead is elsewhere (memory access patterns, branch misprediction) **Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy. --- ## 4. Expected Impact **Conservative** (stats only): - Throughput: +5-8% - Instructions: -20-30% - I-cache: improvement follows from instruction reduction **Ambitious** (stats + ENV + debug): - Throughput: +10-20% - Instructions: -30-40% - I-cache: -20-30% (proportional to instruction reduction) **System binary ceiling**: 85M ops/s - Baseline: 48M ops/s - Target: 52-55M ops/s (via +5-15%) - Would be 60-68M ops/s with ambitious impact --- ## 5. Box Theory Compliance **Box**: BenchMinimalBox - **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time) - **Scope**: Instrumentation removal only (no algorithm changes) - **Rollback**: Single build knob (default OFF, backward compatible) - **Observability**: Perf stat before/after (instructions, I-cache, throughput) --- ## 6. Success Criteria ### GO (Proceed): - Throughput: **+5% or more** - Instructions: **-15% or more** (smoking gun for success) - I-cache: should improve proportionally ### NEUTRAL (keep as research box): - Throughput: ±3% (within noise) - Instructions: -5% to -15% (marginal) ### NO-GO (freeze): - Throughput: < -2% or no improvement - Instructions: < -5% (failed optimization objective) --- ## 7. Files to Modify 1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support) - Already created in Phase 18 v1 2. **`core/box/front_fastlane_stats_box.h`** (stats conditional) - Update macro to be no-op when BENCH_MINIMAL=1 3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper) - Wrap free_tiny_fast_mono_stat_* calls 4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification) - Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1 5. **`core/bench_profile.h`** (profile checks) - Simplify profile related checks 6. **`Makefile`** - Add `BENCH_MINIMAL ?= 0` knob - Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled --- ## 8. Notes - **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation. - **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries. - **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput. --- ## 9. Next Steps After Phase 18 v2 ### If GO (+5%+): - Promote BENCH_MINIMAL=1 as the new baseline for bench builds - Update CURRENT_TASK with Phase 18 v2 victory - Prepare Phase 19 (next optimization frontier) ### If NEUTRAL: - Investigate why instructions didn't drop (compiler already optimized?) - Consider aggressive options: SIMD, allocation batching, lock-free ### If NO-GO: - Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30) - Shift focus to broader architectural changes