Files

Moe Charm (CI) b7085c47e1 Phase 35-39: FAST build optimization complete (+7.13% cumulative)

Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%)
- tiny_front_v3_enabled() → constant true
- tiny_metadata_cache_enabled() → constant 0
- learner_v7_enabled() → constant false
- small_learner_v2_enabled() → constant false

Phase 36: Policy snapshot init-once (GO +0.71%)
- small_policy_v7_snapshot() version check skip in BENCH_MINIMAL
- TLS cache for policy snapshot

Phase 37: Standard TLS cache (NO-GO -0.07%)
- TLS cache for Standard build attempted
- Runtime gate overhead negates benefit

Phase 38: FAST/OBSERVE/Standard workflow established
- make perf_fast, make perf_observe targets
- Scorecard and documentation updates

Phase 39: Hot path gate constantization (GO +1.98%)
- front_gate_unified_enabled() → constant 1
- alloc_dualhot_enabled() → constant 0
- g_bench_fast_front, g_v3_enabled blocks → compile-out
- free_dispatch_stats_enabled() → constant false

Results:
- FAST v3: 56.04M ops/s (47.4% of mimalloc)
- Standard: 53.50M ops/s (45.3% of mimalloc)
- M1 target (50%): 5.5% remaining

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-16 15:01:56 +09:00

8.6 KiB

Raw Blame History

Phase 32: Tiny Free Calls Atomic Prune - A/B Test Results

Date: 2025-12-16 Target: g_hak_tiny_free_calls atomic counter in core/hakmem_tiny_free.inc:335 Build Flag: HAKMEM_TINY_FREE_CALLS_COMPILED (default: 0) Verdict: NEUTRAL → Adopt for code cleanliness

Executive Summary

Phase 32 implements compile-time gating for the g_hak_tiny_free_calls diagnostic counter in hak_tiny_free(). A/B testing shows NEUTRAL impact (-0.46%, within measurement noise). We adopt the compile-out default (COMPILED=0) for code cleanliness and consistency with the atomic prune series.

Key Finding: The atomic counter has negligible performance impact, but removing it maintains cleaner code and aligns with the systematic removal of diagnostic telemetry from HOT paths.

Test Configuration

Target Code Location

File: core/hakmem_tiny_free.inc:335

Before (always active):

extern _Atomic uint64_t g_hak_tiny_free_calls;
atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed);

After (compile-out default):

#if HAKMEM_TINY_FREE_CALLS_COMPILED
extern _Atomic uint64_t g_hak_tiny_free_calls;
atomic_fetch_add_explicit(&g_hak_tiny_free_calls, 1, memory_order_relaxed);
#else
(void)0;  // No-op when diagnostic counter compiled out
#endif

Code Classification

Category: TELEMETRY
Frequency: Every free operation (unconditional)
Correctness Impact: None (diagnostic only)
Flow Control: None

Build Flag (SSOT)

File: core/hakmem_build_flags.h

// ------------------------------------------------------------
// Phase 32: Tiny Free Calls Atomic Prune (Compile-out diagnostic counter)
// ------------------------------------------------------------
// Tiny Free Calls: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need free path call counting
// Target: g_hak_tiny_free_calls atomic in core/hakmem_tiny_free.inc:335
// Impact: HOT path atomic (every free operation, unconditional)
// Expected improvement: +0.3% to +0.7% (diagnostic counter, less critical than Phase 25)
#ifndef HAKMEM_TINY_FREE_CALLS_COMPILED
#  define HAKMEM_TINY_FREE_CALLS_COMPILED 0
#endif

A/B Test Results

Methodology

Workload: bench_random_mixed (Mixed 8-64B allocation pattern)
Iterations: 10 runs per configuration
Environment: Clean environment via scripts/run_mixed_10_cleanenv.sh
Compiler: GCC with -O3 -flto -march=native

Configuration A: Baseline (COMPILED=0, counter compiled-out)

make clean && make -j bench_random_mixed_hakmem
scripts/run_mixed_10_cleanenv.sh

Results:

Run  1: 51,155,676 ops/s
Run  2: 51,337,897 ops/s
Run  3: 53,355,358 ops/s
Run  4: 52,484,033 ops/s
Run  5: 53,554,331 ops/s
Run  6: 52,816,908 ops/s
Run  7: 53,764,926 ops/s
Run  8: 53,908,882 ops/s
Run  9: 53,963,916 ops/s
Run 10: 53,083,746 ops/s

Median: 53,219,552 ops/s
Mean:   52,942,567 ops/s
Stdev:  1,011,696 ops/s (1.91%)

Configuration B: Compiled-in (COMPILED=1, counter active)

make clean && make -j EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem
scripts/run_mixed_10_cleanenv.sh

Results:

Run  1: 53,017,261 ops/s
Run  2: 52,053,756 ops/s
Run  3: 53,815,545 ops/s
Run  4: 53,366,110 ops/s
Run  5: 53,560,201 ops/s
Run  6: 54,113,944 ops/s
Run  7: 53,252,767 ops/s
Run  8: 53,823,030 ops/s
Run  9: 53,766,710 ops/s
Run 10: 52,006,868 ops/s

Median: 53,463,156 ops/s
Mean:   53,277,619 ops/s
Stdev:  729,857 ops/s (1.37%)

Performance Impact

Metric	Baseline (COMPILED=0)	Compiled-in (COMPILED=1)	Delta
Median	53,219,552 ops/s	53,463,156 ops/s	-0.46%
Mean	52,942,567 ops/s	53,277,619 ops/s	-0.63%
Stdev	1,011,696 (1.91%)	729,857 (1.37%)	Lower variance

Improvement: -0.46% (NEUTRAL)

Analysis

Unexpected Result

Unlike previous atomic prune phases (Phase 25: +1.07%, Phase 31: NEUTRAL), Phase 32 shows a slight performance improvement with the atomic counter compiled-in. This is counterintuitive and within measurement noise.

Possible Explanations

Code Alignment Effects: The (void)0 no-op may cause different code alignment than the atomic instruction, potentially affecting instruction cache behavior
Measurement Noise: The -0.46% difference is well within typical variance (±0.5%)
Compiler Optimization: LTO may optimize the atomic differently in the compiled-in case

Statistical Significance

Difference: 243,604 ops/s (0.46%)
Baseline Stdev: 1,011,696 ops/s (1.91%)
Compiled-in Stdev: 729,857 ops/s (1.37%)
Conclusion: Not statistically significant (difference < 1 stdev)

Verdict Rationale

Despite the slight negative delta, we adopt COMPILED=0 (compiled-out) for:

Code Cleanliness: Removes unnecessary diagnostic counter from production code
Consistency: Aligns with atomic prune series (Phases 24-32)
Future-Proofing: Eliminates potential cache line contention in multi-threaded workloads
Research Flexibility: Counter can be re-enabled via -DHAKMEM_TINY_FREE_CALLS_COMPILED=1

Phase 25: g_free_ss_enter (+1.07% GO)

Location: tiny_superslab_free.inc.h
Frequency: Every free operation
Impact: +1.07% improvement (GO)
Similarity: Same HOT path, same frequency
Difference: Phase 25 counter was in more critical code section

Phase 31: g_tiny_free_trace (NEUTRAL)

Location: hakmem_tiny_free.inc:326 (9 lines above Phase 32)
Frequency: Every free operation (rate-limited to 128 calls)
Impact: NEUTRAL (adopted for code cleanliness)
Similarity: Same function, same file
Difference: Phase 31 was rate-limited, Phase 32 is unconditional

Key Insight

Phase 32's NEUTRAL result is consistent with Phase 31 (same function, similar location). The atomic counter's impact is negligible in modern CPUs with efficient relaxed atomics. The primary benefit is code cleanliness, not performance.

Cumulative Impact

Atomic Prune Series Progress (Phases 24-32)

Phase 24: Tiny Class Stats (+0.93% GO)
Phase 25: Tiny Free Stats (+1.07% GO)
Phase 26A: C7 Free Count (+0.77% GO)
Phase 26B: Header Mismatch Log (+0.53% GO)
Phase 26C: Header Meta Mismatch (+0.41% NEUTRAL)
Phase 26D: Metric Bad Class (+0.47% NEUTRAL)
Phase 26E: Header Meta Fast (+0.67% GO)
Phase 27: Unified Cache Stats (+0.47% NEUTRAL)
Phase 29: Pool Hotbox v2 Stats (+1.00% GO)
Phase 31: Tiny Free Trace (NEUTRAL)
Phase 32: Tiny Free Calls (NEUTRAL)

Total Improvement (GO phases only): ~5.4%

Recommendations

Adoption Decision

ADOPT with HAKMEM_TINY_FREE_CALLS_COMPILED=0 (default OFF).

Rationale:

NEUTRAL performance impact (within noise)
Code cleanliness benefit
Consistency with atomic prune series
No functional impact (diagnostic only)

Production Use

# Default build (counter compiled-out)
make bench_random_mixed_hakmem

Research/Debug Use

# Enable counter for diagnostics
make EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_CALLS_COMPILED=1' bench_random_mixed_hakmem

Next Steps: Phase 33 Candidate

Target: tiny_debug_ring_record(TINY_RING_EVENT_FREE_ENTER, ...) Location: core/hakmem_tiny_free.inc:340 (3 lines below Phase 32) Classification: TELEMETRY (debug ring buffer)

⚠️ CRITICAL: Phase 33 requires Step 0 verification (Phase 30 lesson):

# Check if debug ring is ENV-gated or always-on
rg "getenv.*DEBUG_RING" core/
rg "HAKMEM.*DEBUG.*RING" core/

Only proceed if debug ring is always-on by default (not ENV-gated).

Conclusion

Phase 32 demonstrates that the g_hak_tiny_free_calls diagnostic counter has negligible performance impact on modern hardware. The NEUTRAL result (-0.46%) is within measurement noise and likely influenced by code alignment effects rather than actual atomic overhead.

We adopt the compile-out default (COMPILED=0) to maintain code cleanliness and consistency with the atomic prune series. This phase reinforces the pattern established in Phase 31: diagnostic counters on HOT paths should be compile-time gated, even if their runtime impact is minimal.

The systematic removal of diagnostic telemetry from production builds improves code clarity and eliminates potential future issues (e.g., cache line contention in multi-threaded scenarios).

Phase 32 Status: COMPLETE (NEUTRAL → Adopt for code cleanliness) Next Phase: Phase 33 (tiny_debug_ring_record) - Step 0 verification required

8.6 KiB Raw Blame History