Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%) - tiny_front_v3_enabled() → constant true - tiny_metadata_cache_enabled() → constant 0 - learner_v7_enabled() → constant false - small_learner_v2_enabled() → constant false Phase 36: Policy snapshot init-once (GO +0.71%) - small_policy_v7_snapshot() version check skip in BENCH_MINIMAL - TLS cache for policy snapshot Phase 37: Standard TLS cache (NO-GO -0.07%) - TLS cache for Standard build attempted - Runtime gate overhead negates benefit Phase 38: FAST/OBSERVE/Standard workflow established - make perf_fast, make perf_observe targets - Scorecard and documentation updates Phase 39: Hot path gate constantization (GO +1.98%) - front_gate_unified_enabled() → constant 1 - alloc_dualhot_enabled() → constant 0 - g_bench_fast_front, g_v3_enabled blocks → compile-out - free_dispatch_stats_enabled() → constant false Results: - FAST v3: 56.04M ops/s (47.4% of mimalloc) - Standard: 53.50M ops/s (45.3% of mimalloc) - M1 target (50%): 5.5% remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.6 KiB
Phase 32: Tiny Free Calls Atomic Prune - A/B Test Results
Date: 2025-12-16
Target: g_hak_tiny_free_calls atomic counter in core/hakmem_tiny_free.inc:335
Build Flag: HAKMEM_TINY_FREE_CALLS_COMPILED (default: 0)
Verdict: NEUTRAL → Adopt for code cleanliness
Executive Summary
Phase 32 implements compile-time gating for the g_hak_tiny_free_calls diagnostic counter in hak_tiny_free(). A/B testing shows NEUTRAL impact (-0.46%, within measurement noise). We adopt the compile-out default (COMPILED=0) for code cleanliness and consistency with the atomic prune series.
Key Finding: The atomic counter has negligible performance impact, but removing it maintains cleaner code and aligns with the systematic removal of diagnostic telemetry from HOT paths.
Test Configuration
Target Code Location
File: core/hakmem_tiny_free.inc:335
Before (always active):
extern _Atomic uint64_t g_hak_tiny_free_calls;
atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed);
After (compile-out default):
#if HAKMEM_TINY_FREE_CALLS_COMPILED
extern _Atomic uint64_t g_hak_tiny_free_calls;
atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed);
#else
(void)0; // No-op when diagnostic counter compiled out
#endif
Code Classification
- Category: TELEMETRY
- Frequency: Every free operation (unconditional)
- Correctness Impact: None (diagnostic only)
- Flow Control: None
Build Flag (SSOT)
File: core/hakmem_build_flags.h
// ------------------------------------------------------------
// Phase 32: Tiny Free Calls Atomic Prune (Compile-out diagnostic counter)
// ------------------------------------------------------------
// Tiny Free Calls: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need free path call counting
// Target: g_hak_tiny_free_calls atomic in core/hakmem_tiny_free.inc:335
// Impact: HOT path atomic (every free operation, unconditional)
// Expected improvement: +0.3% to +0.7% (diagnostic counter, less critical than Phase 25)
#ifndef HAKMEM_TINY_FREE_CALLS_COMPILED
# define HAKMEM_TINY_FREE_CALLS_COMPILED 0
#endif
A/B Test Results
Methodology
- Workload:
bench_random_mixed(Mixed 8-64B allocation pattern) - Iterations: 10 runs per configuration
- Environment: Clean environment via
scripts/run_mixed_10_cleanenv.sh - Compiler: GCC with
-O3 -flto -march=native
Configuration A: Baseline (COMPILED=0, counter compiled-out)
make clean && make -j bench_random_mixed_hakmem
scripts/run_mixed_10_cleanenv.sh
Results:
Run 1: 51,155,676 ops/s
Run 2: 51,337,897 ops/s
Run 3: 53,355,358 ops/s
Run 4: 52,484,033 ops/s
Run 5: 53,554,331 ops/s
Run 6: 52,816,908 ops/s
Run 7: 53,764,926 ops/s
Run 8: 53,908,882 ops/s
Run 9: 53,963,916 ops/s
Run 10: 53,083,746 ops/s
Median: 53,219,552 ops/s
Mean: 52,942,567 ops/s
Stdev: 1,011,696 ops/s (1.91%)
Configuration B: Compiled-in (COMPILED=1, counter active)
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem
scripts/run_mixed_10_cleanenv.sh
Results:
Run 1: 53,017,261 ops/s
Run 2: 52,053,756 ops/s
Run 3: 53,815,545 ops/s
Run 4: 53,366,110 ops/s
Run 5: 53,560,201 ops/s
Run 6: 54,113,944 ops/s
Run 7: 53,252,767 ops/s
Run 8: 53,823,030 ops/s
Run 9: 53,766,710 ops/s
Run 10: 52,006,868 ops/s
Median: 53,463,156 ops/s
Mean: 53,277,619 ops/s
Stdev: 729,857 ops/s (1.37%)
Performance Impact
| Metric | Baseline (COMPILED=0) | Compiled-in (COMPILED=1) | Delta |
|---|---|---|---|
| Median | 53,219,552 ops/s | 53,463,156 ops/s | -0.46% |
| Mean | 52,942,567 ops/s | 53,277,619 ops/s | -0.63% |
| Stdev | 1,011,696 (1.91%) | 729,857 (1.37%) | Lower variance |
Improvement: -0.46% (NEUTRAL)
Analysis
Unexpected Result
Unlike previous atomic prune phases (Phase 25: +1.07%, Phase 31: NEUTRAL), Phase 32 shows a slight performance improvement with the atomic counter compiled-in. This is counterintuitive and within measurement noise.
Possible Explanations
- Code Alignment Effects: The
(void)0no-op may cause different code alignment than the atomic instruction, potentially affecting instruction cache behavior - Measurement Noise: The -0.46% difference is well within typical variance (±0.5%)
- Compiler Optimization: LTO may optimize the atomic differently in the compiled-in case
Statistical Significance
- Difference: 243,604 ops/s (0.46%)
- Baseline Stdev: 1,011,696 ops/s (1.91%)
- Compiled-in Stdev: 729,857 ops/s (1.37%)
- Conclusion: Not statistically significant (difference < 1 stdev)
Verdict Rationale
Despite the slight negative delta, we adopt COMPILED=0 (compiled-out) for:
- Code Cleanliness: Removes unnecessary diagnostic counter from production code
- Consistency: Aligns with atomic prune series (Phases 24-32)
- Future-Proofing: Eliminates potential cache line contention in multi-threaded workloads
- Research Flexibility: Counter can be re-enabled via
-DHAKMEM_TINY_FREE_CALLS_COMPILED=1
Comparison with Related Phases
Phase 25: g_free_ss_enter (+1.07% GO)
- Location:
tiny_superslab_free.inc.h - Frequency: Every free operation
- Impact: +1.07% improvement (GO)
- Similarity: Same HOT path, same frequency
- Difference: Phase 25 counter was in more critical code section
Phase 31: g_tiny_free_trace (NEUTRAL)
- Location:
hakmem_tiny_free.inc:326(9 lines above Phase 32) - Frequency: Every free operation (rate-limited to 128 calls)
- Impact: NEUTRAL (adopted for code cleanliness)
- Similarity: Same function, same file
- Difference: Phase 31 was rate-limited, Phase 32 is unconditional
Key Insight
Phase 32's NEUTRAL result is consistent with Phase 31 (same function, similar location). The atomic counter's impact is negligible in modern CPUs with efficient relaxed atomics. The primary benefit is code cleanliness, not performance.
Cumulative Impact
Atomic Prune Series Progress (Phases 24-32)
- Phase 24: Tiny Class Stats (+0.93% GO)
- Phase 25: Tiny Free Stats (+1.07% GO)
- Phase 26A: C7 Free Count (+0.77% GO)
- Phase 26B: Header Mismatch Log (+0.53% GO)
- Phase 26C: Header Meta Mismatch (+0.41% NEUTRAL)
- Phase 26D: Metric Bad Class (+0.47% NEUTRAL)
- Phase 26E: Header Meta Fast (+0.67% GO)
- Phase 27: Unified Cache Stats (+0.47% NEUTRAL)
- Phase 29: Pool Hotbox v2 Stats (+1.00% GO)
- Phase 31: Tiny Free Trace (NEUTRAL)
- Phase 32: Tiny Free Calls (NEUTRAL)
Total Improvement (GO phases only): ~5.4%
Recommendations
Adoption Decision
ADOPT with HAKMEM_TINY_FREE_CALLS_COMPILED=0 (default OFF).
Rationale:
- NEUTRAL performance impact (within noise)
- Code cleanliness benefit
- Consistency with atomic prune series
- No functional impact (diagnostic only)
Production Use
# Default build (counter compiled-out)
make bench_random_mixed_hakmem
Research/Debug Use
# Enable counter for diagnostics
make EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem
Next Steps: Phase 33 Candidate
Target: tiny_debug_ring_record(TINY_RING_EVENT_FREE_ENTER, ...)
Location: core/hakmem_tiny_free.inc:340 (3 lines below Phase 32)
Classification: TELEMETRY (debug ring buffer)
⚠️ CRITICAL: Phase 33 requires Step 0 verification (Phase 30 lesson):
# Check if debug ring is ENV-gated or always-on
rg "getenv.*DEBUG_RING" core/
rg "HAKMEM.*DEBUG.*RING" core/
Only proceed if debug ring is always-on by default (not ENV-gated).
Conclusion
Phase 32 demonstrates that the g_hak_tiny_free_calls diagnostic counter has negligible performance impact on modern hardware. The NEUTRAL result (-0.46%) is within measurement noise and likely influenced by code alignment effects rather than actual atomic overhead.
We adopt the compile-out default (COMPILED=0) to maintain code cleanliness and consistency with the atomic prune series. This phase reinforces the pattern established in Phase 31: diagnostic counters on HOT paths should be compile-time gated, even if their runtime impact is minimal.
The systematic removal of diagnostic telemetry from production builds improves code clarity and eliminates potential future issues (e.g., cache line contention in multi-threaded scenarios).
Phase 32 Status: COMPLETE (NEUTRAL → Adopt for code cleanliness)
Next Phase: Phase 33 (tiny_debug_ring_record) - Step 0 verification required