Files

Moe Charm (CI) 8052e8b320 Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-16 05:35:11 +09:00

4.6 KiB

Raw Blame History

Phase 25: Tiny Free Stats Atomic Prune - Results

Objective

Compile-out g_free_ss_enter atomic counter in core/tiny_superslab_free.inc.h to reduce free path overhead, following Phase 24 pattern.

Implementation

Changes Made

Added compile gate to core/hakmem_build_flags.h:

// Phase 25: Tiny Free Stats Atomic Prune (Compile-out g_free_ss_enter)
// Tiny Free Stats: Compile gate (default OFF = compile-out)
#ifndef HAKMEM_TINY_FREE_STATS_COMPILED
#  define HAKMEM_TINY_FREE_STATS_COMPILED 0
#endif

Wrapped atomic in core/tiny_superslab_free.inc.h:

// Phase 25: Compile-out free stats atomic (default OFF)
#if HAKMEM_TINY_FREE_STATS_COMPILED
    extern _Atomic uint64_t g_free_ss_enter;
    atomic_fetch_add_explicit(&g_free_ss_enter, 1, memory_order_relaxed);
#else
    (void)0;  // No-op when compiled out
#endif

A/B Test Results

Baseline (COMPILED=0, default - atomic compiled OUT)

Run  1: 56,507,896 ops/s
Run  2: 57,333,770 ops/s
Run  3: 57,434,992 ops/s
Run  4: 57,578,038 ops/s
Run  5: 56,664,457 ops/s
Run  6: 56,524,671 ops/s
Run  7: 56,654,263 ops/s
Run  8: 57,349,250 ops/s
Run  9: 56,907,667 ops/s
Run 10: 57,211,685 ops/s

Mean:   57,016,669 ops/s
StdDev:    409,269 ops/s

Compiled-In (COMPILED=1, research - atomic compiled IN)

Run  1: 56,820,429 ops/s
Run  2: 57,373,517 ops/s
Run  3: 56,861,669 ops/s
Run  4: 56,206,268 ops/s
Run  5: 56,777,968 ops/s
Run  6: 55,020,362 ops/s
Run  7: 55,932,595 ops/s
Run  8: 56,506,976 ops/s
Run  9: 56,944,509 ops/s
Run 10: 55,708,673 ops/s

Mean:   56,415,297 ops/s
StdDev:    701,064 ops/s

Performance Impact

Delta: +601,372 ops/s (+1.07%)
Decision: GO
Rationale: Baseline (atomic compiled out) is 1.07% faster, exceeding +0.5% threshold

Analysis

Why This Works

Hot Path Tax Elimination:
- g_free_ss_enter atomic is executed on EVERY free operation
- Atomic operations have inherent overhead even with relaxed memory ordering
- Compile-out eliminates both the atomic instruction and the counter increment
Diagnostics-Only Counter:
- g_free_ss_enter is used only for debug dumps and statistics
- NOT required for correctness
- Safe to compile out in production builds
Consistent with Phase 24:
- Phase 24: Alloc path stats compile-out → +0.93%
- Phase 25: Free path stats compile-out → +1.07%
- Both confirm that even relaxed atomics have measurable overhead on hot paths

Impact Breakdown

Free Path:

Every hak_tiny_free_superslab() call saved ~2-3 cycles (atomic increment elimination)
Mixed workload: ~50% free operations
Net impact: ~1.07% throughput improvement

Code Size:

Default build (COMPILED=0): atomic code completely eliminated by compiler
Research build (COMPILED=1): atomic code present for diagnostics

Comparison with mimalloc Principles

mimalloc's "No Atomics on Hot Path" Rule:

mimalloc avoids atomics on allocation/free hot paths
Uses thread-local counters with periodic aggregation
hakmem Phase 24-25 align with this principle by making hot-path atomics opt-in

Files Modified

/mnt/workdisk/public_share/hakmem/core/hakmem_build_flags.h
- Added HAKMEM_TINY_FREE_STATS_COMPILED flag (default: 0)
/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h
- Wrapped g_free_ss_enter atomic with compile gate
- Added header include for build flags

Build Instructions

Default Build (Production - Atomic Compiled OUT)

make clean && make -j bench_random_mixed_hakmem

Research Build (Diagnostics - Atomic Compiled IN)

make clean && make -j EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_STATS_COMPILED=1' bench_random_mixed_hakmem

Next Steps

Immediate

Phase 25 is GO - changes remain in codebase
Default build (COMPILED=0) is now the standard

Future Opportunities

Identify other hot-path atomics for compile-out:

Remote queue counters (g_remote_free_transitions[])
First-free transition counters (g_first_free_transitions[])
Other diagnostic-only atomics in free/alloc paths

Conclusion

Phase 25 successfully eliminated free path atomic overhead with +1.07% improvement, matching Phase 24's pattern. The compile-gate approach allows:

Production builds: Maximum performance (atomics compiled out)
Research builds: Full diagnostics (atomics available when needed)

This validates the "tax prune" strategy: even low-cost operations (relaxed atomics) accumulate measurable overhead when executed on every hot-path operation.

Status: GO (+1.07%) Date: 2025-12-16 Benchmark: bench_random_mixed (10 runs, clean env)

4.6 KiB Raw Blame History