Summary: - Phase 24 (alloc stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness) - Total: 11 atomics compiled-out, +2.00% improvement Phase 24: OBSERVE tax prune (tiny_class_stats_box.h) - Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0) - Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_* - Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s) Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h) - Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0) - Wrapped g_free_ss_enter atomic in free hot path - Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s) Phase 26: Hot path diagnostic atomics prune - Added 5 compile gates for low-frequency error counters: - HAKMEM_TINY_C7_FREE_COUNT_COMPILED - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED - HAKMEM_TINY_HDR_META_FAST_COMPILED - Result: -0.33% NEUTRAL (within noise, kept for cleanliness) Alignment with mimalloc principles: - "No atomics on hot path" - telemetry moved to compile-time opt-in - Fixed per-op tax elimination - Production builds: maximum performance (atomics compiled-out) - Research builds: full diagnostics (COMPILED=1) Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.6 KiB
Phase 25: Tiny Free Stats Atomic Prune - Results
Objective
Compile-out g_free_ss_enter atomic counter in core/tiny_superslab_free.inc.h to reduce free path overhead, following Phase 24 pattern.
Implementation
Changes Made
-
Added compile gate to
core/hakmem_build_flags.h:// Phase 25: Tiny Free Stats Atomic Prune (Compile-out g_free_ss_enter) // Tiny Free Stats: Compile gate (default OFF = compile-out) #ifndef HAKMEM_TINY_FREE_STATS_COMPILED # define HAKMEM_TINY_FREE_STATS_COMPILED 0 #endif -
Wrapped atomic in
core/tiny_superslab_free.inc.h:// Phase 25: Compile-out free stats atomic (default OFF) #if HAKMEM_TINY_FREE_STATS_COMPILED extern _Atomic uint64_t g_free_ss_enter; atomic_fetch_add_explicit(&g_free_ss_enter, 1, memory_order_relaxed); #else (void)0; // No-op when compiled out #endif
A/B Test Results
Baseline (COMPILED=0, default - atomic compiled OUT)
Run 1: 56,507,896 ops/s
Run 2: 57,333,770 ops/s
Run 3: 57,434,992 ops/s
Run 4: 57,578,038 ops/s
Run 5: 56,664,457 ops/s
Run 6: 56,524,671 ops/s
Run 7: 56,654,263 ops/s
Run 8: 57,349,250 ops/s
Run 9: 56,907,667 ops/s
Run 10: 57,211,685 ops/s
Mean: 57,016,669 ops/s
StdDev: 409,269 ops/s
Compiled-In (COMPILED=1, research - atomic compiled IN)
Run 1: 56,820,429 ops/s
Run 2: 57,373,517 ops/s
Run 3: 56,861,669 ops/s
Run 4: 56,206,268 ops/s
Run 5: 56,777,968 ops/s
Run 6: 55,020,362 ops/s
Run 7: 55,932,595 ops/s
Run 8: 56,506,976 ops/s
Run 9: 56,944,509 ops/s
Run 10: 55,708,673 ops/s
Mean: 56,415,297 ops/s
StdDev: 701,064 ops/s
Performance Impact
- Delta: +601,372 ops/s (+1.07%)
- Decision: GO
- Rationale: Baseline (atomic compiled out) is 1.07% faster, exceeding +0.5% threshold
Analysis
Why This Works
-
Hot Path Tax Elimination:
g_free_ss_enteratomic is executed on EVERY free operation- Atomic operations have inherent overhead even with relaxed memory ordering
- Compile-out eliminates both the atomic instruction and the counter increment
-
Diagnostics-Only Counter:
g_free_ss_enteris used only for debug dumps and statistics- NOT required for correctness
- Safe to compile out in production builds
-
Consistent with Phase 24:
- Phase 24: Alloc path stats compile-out → +0.93%
- Phase 25: Free path stats compile-out → +1.07%
- Both confirm that even relaxed atomics have measurable overhead on hot paths
Impact Breakdown
Free Path:
- Every
hak_tiny_free_superslab()call saved ~2-3 cycles (atomic increment elimination) - Mixed workload: ~50% free operations
- Net impact: ~1.07% throughput improvement
Code Size:
- Default build (COMPILED=0): atomic code completely eliminated by compiler
- Research build (COMPILED=1): atomic code present for diagnostics
Comparison with mimalloc Principles
mimalloc's "No Atomics on Hot Path" Rule:
- mimalloc avoids atomics on allocation/free hot paths
- Uses thread-local counters with periodic aggregation
- hakmem Phase 24-25 align with this principle by making hot-path atomics opt-in
Files Modified
-
/mnt/workdisk/public_share/hakmem/core/hakmem_build_flags.h- Added
HAKMEM_TINY_FREE_STATS_COMPILEDflag (default: 0)
- Added
-
/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h- Wrapped
g_free_ss_enteratomic with compile gate - Added header include for build flags
- Wrapped
Build Instructions
Default Build (Production - Atomic Compiled OUT)
make clean && make -j bench_random_mixed_hakmem
Research Build (Diagnostics - Atomic Compiled IN)
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_TINY_FREE_STATS_COMPILED=1' bench_random_mixed_hakmem
Next Steps
Immediate
- Phase 25 is GO - changes remain in codebase
- Default build (COMPILED=0) is now the standard
Future Opportunities
Identify other hot-path atomics for compile-out:
- Remote queue counters (
g_remote_free_transitions[]) - First-free transition counters (
g_first_free_transitions[]) - Other diagnostic-only atomics in free/alloc paths
Conclusion
Phase 25 successfully eliminated free path atomic overhead with +1.07% improvement, matching Phase 24's pattern. The compile-gate approach allows:
- Production builds: Maximum performance (atomics compiled out)
- Research builds: Full diagnostics (atomics available when needed)
This validates the "tax prune" strategy: even low-cost operations (relaxed atomics) accumulate measurable overhead when executed on every hot-path operation.
Status: GO (+1.07%) Date: 2025-12-16 Benchmark: bench_random_mixed (10 runs, clean env)