# Phase 32: Tiny Free Calls Atomic Prune - A/B Test Results **Date:** 2025-12-16 **Target:** `g_hak_tiny_free_calls` atomic counter in `core/hakmem_tiny_free.inc:335` **Build Flag:** `HAKMEM_TINY_FREE_CALLS_COMPILED` (default: 0) **Verdict:** NEUTRAL → Adopt for code cleanliness --- ## Executive Summary Phase 32 implements compile-time gating for the `g_hak_tiny_free_calls` diagnostic counter in `hak_tiny_free()`. A/B testing shows **NEUTRAL** impact (-0.46%, within measurement noise). We adopt the compile-out default (COMPILED=0) for code cleanliness and consistency with the atomic prune series. **Key Finding:** The atomic counter has negligible performance impact, but removing it maintains cleaner code and aligns with the systematic removal of diagnostic telemetry from HOT paths. --- ## Test Configuration ### Target Code Location **File:** `core/hakmem_tiny_free.inc:335` **Before (always active):** ```c extern _Atomic uint64_t g_hak_tiny_free_calls; atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed); ``` **After (compile-out default):** ```c #if HAKMEM_TINY_FREE_CALLS_COMPILED extern _Atomic uint64_t g_hak_tiny_free_calls; atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed); #else (void)0; // No-op when diagnostic counter compiled out #endif ``` ### Code Classification - **Category:** TELEMETRY - **Frequency:** Every free operation (unconditional) - **Correctness Impact:** None (diagnostic only) - **Flow Control:** None ### Build Flag (SSOT) **File:** `core/hakmem_build_flags.h` ```c // ------------------------------------------------------------ // Phase 32: Tiny Free Calls Atomic Prune (Compile-out diagnostic counter) // ------------------------------------------------------------ // Tiny Free Calls: Compile gate (default OFF = compile-out) // Set to 1 for research builds that need free path call counting // Target: g_hak_tiny_free_calls atomic in core/hakmem_tiny_free.inc:335 // Impact: HOT path atomic (every free operation, unconditional) // Expected improvement: +0.3% to +0.7% (diagnostic counter, less critical than Phase 25) #ifndef HAKMEM_TINY_FREE_CALLS_COMPILED # define HAKMEM_TINY_FREE_CALLS_COMPILED 0 #endif ``` --- ## A/B Test Results ### Methodology - **Workload:** `bench_random_mixed` (Mixed 8-64B allocation pattern) - **Iterations:** 10 runs per configuration - **Environment:** Clean environment via `scripts/run_mixed_10_cleanenv.sh` - **Compiler:** GCC with `-O3 -flto -march=native` ### Configuration A: Baseline (COMPILED=0, counter compiled-out) ```bash make clean && make -j bench_random_mixed_hakmem scripts/run_mixed_10_cleanenv.sh ``` **Results:** ``` Run 1: 51,155,676 ops/s Run 2: 51,337,897 ops/s Run 3: 53,355,358 ops/s Run 4: 52,484,033 ops/s Run 5: 53,554,331 ops/s Run 6: 52,816,908 ops/s Run 7: 53,764,926 ops/s Run 8: 53,908,882 ops/s Run 9: 53,963,916 ops/s Run 10: 53,083,746 ops/s Median: 53,219,552 ops/s Mean: 52,942,567 ops/s Stdev: 1,011,696 ops/s (1.91%) ``` ### Configuration B: Compiled-in (COMPILED=1, counter active) ```bash make clean && make -j EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem scripts/run_mixed_10_cleanenv.sh ``` **Results:** ``` Run 1: 53,017,261 ops/s Run 2: 52,053,756 ops/s Run 3: 53,815,545 ops/s Run 4: 53,366,110 ops/s Run 5: 53,560,201 ops/s Run 6: 54,113,944 ops/s Run 7: 53,252,767 ops/s Run 8: 53,823,030 ops/s Run 9: 53,766,710 ops/s Run 10: 52,006,868 ops/s Median: 53,463,156 ops/s Mean: 53,277,619 ops/s Stdev: 729,857 ops/s (1.37%) ``` ### Performance Impact | Metric | Baseline (COMPILED=0) | Compiled-in (COMPILED=1) | Delta | |--------|----------------------|--------------------------|-------| | **Median** | 53,219,552 ops/s | 53,463,156 ops/s | **-0.46%** | | **Mean** | 52,942,567 ops/s | 53,277,619 ops/s | -0.63% | | **Stdev** | 1,011,696 (1.91%) | 729,857 (1.37%) | Lower variance | **Improvement:** -0.46% (NEUTRAL) --- ## Analysis ### Unexpected Result Unlike previous atomic prune phases (Phase 25: +1.07%, Phase 31: NEUTRAL), Phase 32 shows a **slight performance improvement** with the atomic counter **compiled-in**. This is counterintuitive and within measurement noise. ### Possible Explanations 1. **Code Alignment Effects:** The `(void)0` no-op may cause different code alignment than the atomic instruction, potentially affecting instruction cache behavior 2. **Measurement Noise:** The -0.46% difference is well within typical variance (±0.5%) 3. **Compiler Optimization:** LTO may optimize the atomic differently in the compiled-in case ### Statistical Significance - **Difference:** 243,604 ops/s (0.46%) - **Baseline Stdev:** 1,011,696 ops/s (1.91%) - **Compiled-in Stdev:** 729,857 ops/s (1.37%) - **Conclusion:** Not statistically significant (difference < 1 stdev) ### Verdict Rationale Despite the slight negative delta, we adopt **COMPILED=0** (compiled-out) for: 1. **Code Cleanliness:** Removes unnecessary diagnostic counter from production code 2. **Consistency:** Aligns with atomic prune series (Phases 24-32) 3. **Future-Proofing:** Eliminates potential cache line contention in multi-threaded workloads 4. **Research Flexibility:** Counter can be re-enabled via `-DHAKMEM_TINY_FREE_CALLS_COMPILED=1` --- ## Comparison with Related Phases ### Phase 25: g_free_ss_enter (+1.07% GO) - **Location:** `tiny_superslab_free.inc.h` - **Frequency:** Every free operation - **Impact:** +1.07% improvement (GO) - **Similarity:** Same HOT path, same frequency - **Difference:** Phase 25 counter was in more critical code section ### Phase 31: g_tiny_free_trace (NEUTRAL) - **Location:** `hakmem_tiny_free.inc:326` (9 lines above Phase 32) - **Frequency:** Every free operation (rate-limited to 128 calls) - **Impact:** NEUTRAL (adopted for code cleanliness) - **Similarity:** Same function, same file - **Difference:** Phase 31 was rate-limited, Phase 32 is unconditional ### Key Insight Phase 32's NEUTRAL result is consistent with Phase 31 (same function, similar location). The atomic counter's impact is negligible in modern CPUs with efficient relaxed atomics. The primary benefit is code cleanliness, not performance. --- ## Cumulative Impact ### Atomic Prune Series Progress (Phases 24-32) 1. **Phase 24:** Tiny Class Stats (+0.93% GO) 2. **Phase 25:** Tiny Free Stats (+1.07% GO) 3. **Phase 26A:** C7 Free Count (+0.77% GO) 4. **Phase 26B:** Header Mismatch Log (+0.53% GO) 5. **Phase 26C:** Header Meta Mismatch (+0.41% NEUTRAL) 6. **Phase 26D:** Metric Bad Class (+0.47% NEUTRAL) 7. **Phase 26E:** Header Meta Fast (+0.67% GO) 8. **Phase 27:** Unified Cache Stats (+0.47% NEUTRAL) 9. **Phase 29:** Pool Hotbox v2 Stats (+1.00% GO) 10. **Phase 31:** Tiny Free Trace (NEUTRAL) 11. **Phase 32:** Tiny Free Calls (NEUTRAL) **Total Improvement (GO phases only):** ~5.4% --- ## Recommendations ### Adoption Decision **ADOPT** with `HAKMEM_TINY_FREE_CALLS_COMPILED=0` (default OFF). **Rationale:** 1. NEUTRAL performance impact (within noise) 2. Code cleanliness benefit 3. Consistency with atomic prune series 4. No functional impact (diagnostic only) ### Production Use ```bash # Default build (counter compiled-out) make bench_random_mixed_hakmem ``` ### Research/Debug Use ```bash # Enable counter for diagnostics make EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem ``` ### Next Steps: Phase 33 Candidate **Target:** `tiny_debug_ring_record(TINY_RING_EVENT_FREE_ENTER, ...)` **Location:** `core/hakmem_tiny_free.inc:340` (3 lines below Phase 32) **Classification:** TELEMETRY (debug ring buffer) **⚠️ CRITICAL:** Phase 33 requires **Step 0 verification** (Phase 30 lesson): ```bash # Check if debug ring is ENV-gated or always-on rg "getenv.*DEBUG_RING" core/ rg "HAKMEM.*DEBUG.*RING" core/ ``` Only proceed if debug ring is **always-on by default** (not ENV-gated). --- ## Conclusion Phase 32 demonstrates that the `g_hak_tiny_free_calls` diagnostic counter has **negligible performance impact** on modern hardware. The NEUTRAL result (-0.46%) is within measurement noise and likely influenced by code alignment effects rather than actual atomic overhead. We adopt the compile-out default (COMPILED=0) to maintain code cleanliness and consistency with the atomic prune series. This phase reinforces the pattern established in Phase 31: diagnostic counters on HOT paths should be compile-time gated, even if their runtime impact is minimal. The systematic removal of diagnostic telemetry from production builds improves code clarity and eliminates potential future issues (e.g., cache line contention in multi-threaded scenarios). --- **Phase 32 Status:** COMPLETE (NEUTRAL → Adopt for code cleanliness) **Next Phase:** Phase 33 (`tiny_debug_ring_record`) - Step 0 verification required