# Hot Path Atomic Telemetry Prune - Cumulative Summary **Project:** HAKMEM Memory Allocator - Hot Path Optimization **Goal:** Remove all telemetry-only atomics from hot alloc/free paths **Principle:** Follow mimalloc: No atomics/observe in hot path **Status:** Phase 24+25+26+27 Complete (+2.74% cumulative), Phase 28 Audit Complete (NO-OP) --- ## Overview This document tracks the systematic removal of telemetry-only `atomic_fetch_add/sub` operations from hot alloc/free code paths. Each phase follows a consistent pattern: 1. Identify telemetry-only atomic (not CORRECTNESS) 2. Add `HAKMEM_*_COMPILED` compile gate (default: 0) 3. A/B test: baseline (compiled-out) vs compiled-in 4. Verdict: GO (>+0.5%), NEUTRAL (±0.5%), or NO-GO (<-0.5%) 5. Document and proceed to next candidate --- ## Completed Phases ### Phase 24: Tiny Class Stats Atomic Prune ✅ **GO (+0.93%)** **Date:** 2025-12-15 (prior work) **Target:** `g_tiny_class_stats_*` (per-class cache hit/miss counters) **File:** `core/box/tiny_class_stats_box.h` **Atomics:** 5 global counters (executed on every cache operation) **Build Flag:** `HAKMEM_TINY_CLASS_STATS_COMPILED` (default: 0) **Results:** - **Baseline (compiled-out):** 57.8 M ops/s - **Compiled-in:** 57.3 M ops/s - **Improvement:** **+0.93%** - **Verdict:** **GO** ✅ (keep compiled-out) **Analysis:** High-frequency atomics (every cache hit/miss) show measurable impact. Compiling out provides nearly 1% improvement. **Reference:** Pattern established in Phase 24, used as template for all subsequent phases. --- ### Phase 25: Free Stats Atomic Prune ✅ **GO (+1.07%)** **Date:** 2025-12-15 (prior work) **Target:** `g_free_ss_enter` (superslab free entry counter) **File:** `core/tiny_superslab_free.inc.h:22` **Atomics:** 1 global counter (executed on every superslab free) **Build Flag:** `HAKMEM_TINY_FREE_STATS_COMPILED` (default: 0) **Results:** - **Baseline (compiled-out):** 58.4 M ops/s - **Compiled-in:** 57.8 M ops/s - **Improvement:** **+1.07%** - **Verdict:** **GO** ✅ (keep compiled-out) **Analysis:** Single high-frequency atomic (every free call) shows >1% impact. Demonstrates that even one hot-path atomic matters. **Reference:** `docs/analysis/PHASE25_FREE_STATS_RESULTS.md` (assumed from pattern) --- ### Phase 26: Hot Path Diagnostic Atomics Prune ✅ **NEUTRAL (-0.33%)** **Date:** 2025-12-16 **Targets:** 5 diagnostic atomics in hot-path edge cases **Files:** - `core/tiny_superslab_free.inc.h` (3 atomics) - `core/hakmem_tiny_alloc.inc` (1 atomic) - `core/tiny_free_fast_v2.inc.h` (1 atomic) **Build Flags:** (all default: 0) - `HAKMEM_C7_FREE_COUNT_COMPILED` - `HAKMEM_HDR_MISMATCH_LOG_COMPILED` - `HAKMEM_HDR_META_MISMATCH_COMPILED` - `HAKMEM_METRIC_BAD_CLASS_COMPILED` - `HAKMEM_HDR_META_FAST_COMPILED` **Results:** - **Baseline (compiled-out):** 53.14 M ops/s (±0.96M) - **Compiled-in:** 53.31 M ops/s (±1.09M) - **Improvement:** **-0.33%** (within ±0.5% noise margin) - **Verdict:** **NEUTRAL** ➡️ Keep compiled-out for cleanliness ✅ **Analysis:** Low-frequency atomics (only in error/diagnostic paths) show no measurable impact. Kept compiled-out for code cleanliness and maintainability. **Reference:** `docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md` --- ### Phase 27: Unified Cache Stats Atomic Prune ✅ **GO (+0.74%)** **Date:** 2025-12-16 **Target:** `g_unified_cache_*` (unified cache measurement atomics) **File:** `core/front/tiny_unified_cache.c`, `core/front/tiny_unified_cache.h` **Atomics:** 6 global counters (hits, misses, refill cycles, per-class variants) **Build Flag:** `HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED` (default: 0) **Results:** - **Baseline (compiled-out):** 52.94 M ops/s (mean), 53.59 M ops/s (median) - **Compiled-in:** 52.55 M ops/s (mean), 53.06 M ops/s (median) - **Improvement:** **+0.74% (mean), +1.01% (median)** - **Verdict:** **GO** ✅ (keep compiled-out) **Analysis:** WARM path atomics (cache refill operations) show measurable impact exceeding initial expectations (+0.2-0.4% expected, +0.74% actual). This suggests refill frequency is substantial in the random_mixed benchmark. The improvement validates the Phase 23 compile-out decision. **Path:** WARM (unified cache refill: 3 locations; cache hits: 2 locations) **Frequency:** Medium (every cache miss triggers refill with 4 atomic ops + ENV check) **Reference:** `docs/analysis/PHASE27_UNIFIED_CACHE_STATS_RESULTS.md` --- ### Phase 28: Background Spill Queue Atomic Audit ✅ **NO-OP (All CORRECTNESS)** **Date:** 2025-12-16 **Target:** Background spill queue atomics (`g_bg_spill_head`, `g_bg_spill_len`) **Files:** `core/hakmem_tiny_bg_spill.h`, `core/hakmem_tiny_bg_spill.c` **Atomics:** 8 atomic operations (CAS loops, queue management) **Build Flag:** None (no compile-out candidates) **Audit Results:** - **CORRECTNESS Atomics:** 8/8 (100%) - **TELEMETRY Atomics:** 0/8 (0%) - **Verdict:** **NO-OP** (no action taken) **Analysis:** All atomics are critical for correctness: 1. **Lock-free queue operations:** `atomic_load`, `atomic_compare_exchange_weak` for CAS loops 2. **Queue length tracking (`g_bg_spill_len`):** Used for **flow control**, NOT telemetry - Checked in `tiny_free_magazine.inc.h:76-77` to decide whether to queue work - Controls queue depth to prevent unbounded growth - This is an operational counter, not a debug counter **Key Finding:** `g_bg_spill_len` is superficially similar to telemetry counters, but serves a critical role: ```c uint32_t qlen = atomic_load_explicit(&g_bg_spill_len[class_idx], memory_order_relaxed); if ((int)qlen < g_bg_spill_target) { // FLOW CONTROL DECISION // Queue work to background spill } ``` **Conclusion:** Background spill queue is a lock-free data structure. All atomics are untouchable. Phase 28 completes with **no code changes**. **Reference:** `docs/analysis/PHASE28_BG_SPILL_ATOMIC_AUDIT.md` --- ## Cumulative Impact | Phase | Atomics Removed | Frequency | Impact | Status | |-------|-----------------|-----------|--------|--------| | 24 | 5 (class stats) | High (every cache op) | **+0.93%** | GO ✅ | | 25 | 1 (free_ss_enter) | High (every free) | **+1.07%** | GO ✅ | | 26 | 5 (diagnostics) | Low (edge cases) | -0.33% | NEUTRAL ✅ | | 27 | 6 (unified cache) | Medium (refills) | **+0.74%** | GO ✅ | | **28** | **0 (bg spill)** | **N/A (all CORRECTNESS)** | **N/A** | **NO-OP ✅** | | **Total** | **17 atomics** | **Mixed** | **+2.74%** | **✅** | **Key Insight:** Atomic frequency matters more than count. High-frequency atomics (Phase 24+25) provide measurable benefit (+0.93%, +1.07%). Medium-frequency atomics (Phase 27, WARM path) provide substantial benefit (+0.74%). Low-frequency atomics (Phase 26) provide cleanliness but no performance gain. **Correctness atomics are untouchable** (Phase 28). --- ## Lessons Learned ### 1. Frequency Trumps Count - **Phase 24:** 5 atomics, high frequency → +0.93% ✅ - **Phase 25:** 1 atomic, high frequency → +1.07% ✅ - **Phase 26:** 5 atomics, low frequency → -0.33% (NEUTRAL) **Takeaway:** Focus on always-executed atomics, not just atomic count. ### 2. Edge Cases Don't Matter (Performance-Wise) - Phase 26 atomics are in error/diagnostic paths (header mismatch, bad class, etc.) - Rarely executed in benchmarks → no measurable impact - Still worth compiling out for code cleanliness ### 3. Compile-Time Gates Work Well - Pattern: `#if HAKMEM_*_COMPILED` (default: 0) - Clean separation between research (compiled-in) and production (compiled-out) - Easy to A/B test individual flags ### 4. Noise Margin: ±0.5% - Benchmark variance ~1-2% - Improvements <0.5% are within noise - NEUTRAL verdict: keep simpler code (compiled-out) ### 5. Classification is Critical - **Phase 28:** All atomics were CORRECTNESS (lock-free queue, flow control) - Must distinguish between: - **Telemetry counters:** Observational only, safe to compile-out - **Operational counters:** Used for control flow decisions, UNTOUCHABLE - Example: `g_bg_spill_len` looks like telemetry but controls queue depth limits --- ## Next Phase Candidates (Phase 29+) ### High Priority: Warm Path Atomics 1. ~~**Background Spill Queue** (Phase 28)~~ ✅ **COMPLETE (NO-OP)** - **Result:** All CORRECTNESS atomics, no compile-out candidates - **Reason:** Lock-free queue + flow control counter ### Medium Priority: Warm-ish Path Atomics 2. **Remote Target Queue** (Phase 29 candidate) - **Targets:** `g_remote_target_len[class_idx]` atomics - **File:** `core/hakmem_tiny_remote_target.c` - **Atomics:** `atomic_fetch_add/sub` on queue length - **Frequency:** Warm (remote free path) - **Expected Gain:** +0.1-0.3% (if telemetry) - **Priority:** MEDIUM (needs correctness review - similar to bg_spill) - **Warning:** May be flow control like `g_bg_spill_len`, needs audit ### Low Priority: Cold Path Atomics 3. **SuperSlab OS Stats** (Phase 29+) - **Targets:** `g_ss_os_alloc_calls`, `g_ss_os_madvise_calls`, etc. - **Files:** `core/box/ss_os_acquire_box.h`, `core/box/madvise_guard_box.c` - **Frequency:** Cold (init/mmap/madvise) - **Expected Gain:** <0.1% - **Priority:** LOW (code cleanliness only) 4. **Shared Pool Diagnostics** (Phase 30+) - **Targets:** `rel_c7_*`, `dbg_c7_*` (release/acquire logs) - **Files:** `core/hakmem_shared_pool_acquire.c`, `core/hakmem_shared_pool_release.c` - **Frequency:** Cold (shared pool operations) - **Expected Gain:** <0.1% - **Priority:** LOW 5. **Pool Hotbox v2 Stats** (Phase 31+) - **Targets:** `g_pool_hotbox_v2_stats[ci].*` counters - **File:** `core/hakmem_pool.c` - **Atomics:** ~15 stats counters (alloc_calls, free_calls, etc.) - **Frequency:** Medium-High (pool operations) - **Expected Gain:** +0.2-0.5% (if high-frequency) - **Priority:** MEDIUM --- ## Pattern Template (For Future Phases) ### Step 1: Add Build Flag ```c // core/hakmem_build_flags.h #ifndef HAKMEM_[NAME]_COMPILED # define HAKMEM_[NAME]_COMPILED 0 #endif ``` ### Step 2: Wrap Atomic ```c // core/[file].c #if HAKMEM_[NAME]_COMPILED atomic_fetch_add_explicit(&g_[name], 1, memory_order_relaxed); #else (void)0; // No-op when compiled out #endif ``` ### Step 3: A/B Test ```bash # Baseline (compiled-out, default) make clean && make -j bench_random_mixed_hakmem ./scripts/run_mixed_10_cleanenv.sh > baseline.txt # Compiled-in make clean && make -j EXTRA_CFLAGS='-DHAKMEM_[NAME]_COMPILED=1' bench_random_mixed_hakmem ./scripts/run_mixed_10_cleanenv.sh > compiled_in.txt ``` ### Step 4: Analyze & Verdict ```python improvement = ((baseline_avg - compiled_in_avg) / compiled_in_avg) * 100 if improvement >= 0.5: verdict = "GO (keep compiled-out)" elif improvement <= -0.5: verdict = "NO-GO (revert, compiled-in is better)" else: verdict = "NEUTRAL (keep compiled-out for cleanliness)" ``` ### Step 5: Document Create `docs/analysis/PHASE[N]_[NAME]_RESULTS.md` with: - Implementation details - A/B test results - Verdict & reasoning - Files modified --- ## Build Flag Summary All atomic compile gates in `core/hakmem_build_flags.h`: ```c // Phase 24: Tiny Class Stats (GO +0.93%) #ifndef HAKMEM_TINY_CLASS_STATS_COMPILED # define HAKMEM_TINY_CLASS_STATS_COMPILED 0 #endif // Phase 25: Tiny Free Stats (GO +1.07%) #ifndef HAKMEM_TINY_FREE_STATS_COMPILED # define HAKMEM_TINY_FREE_STATS_COMPILED 0 #endif // Phase 27: Unified Cache Stats (GO +0.74%) #ifndef HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED # define HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED 0 #endif // Phase 26A: C7 Free Count (NEUTRAL -0.33%) #ifndef HAKMEM_C7_FREE_COUNT_COMPILED # define HAKMEM_C7_FREE_COUNT_COMPILED 0 #endif // Phase 26B: Header Mismatch Log (NEUTRAL) #ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED # define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0 #endif // Phase 26C: Header Meta Mismatch (NEUTRAL) #ifndef HAKMEM_HDR_META_MISMATCH_COMPILED # define HAKMEM_HDR_META_MISMATCH_COMPILED 0 #endif // Phase 26D: Metric Bad Class (NEUTRAL) #ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED # define HAKMEM_METRIC_BAD_CLASS_COMPILED 0 #endif // Phase 26E: Header Meta Fast (NEUTRAL) #ifndef HAKMEM_HDR_META_FAST_COMPILED # define HAKMEM_HDR_META_FAST_COMPILED 0 #endif ``` **Default State:** All flags = 0 (compiled-out, production-ready) **Research Use:** Set flag = 1 to enable specific telemetry atomic --- ## Conclusion **Total Progress (Phase 24+25+26+27+28):** - **Performance Gain:** +2.74% (Phase 24: +0.93%, Phase 25: +1.07%, Phase 26: NEUTRAL, Phase 27: +0.74%, Phase 28: NO-OP) - **Atomics Removed:** 17 telemetry atomics from hot/warm paths - **Phases Completed:** 5 phases (4 with changes, 1 audit-only) - **Code Quality:** Cleaner hot/warm paths, closer to mimalloc's zero-overhead principle - **Next Target:** Phase 29 (remote target queue or pool hotbox v2 stats) **Key Success Factors:** 1. Systematic audit and classification (CORRECTNESS vs TELEMETRY) 2. Consistent A/B testing methodology 3. Clear verdict criteria (GO/NEUTRAL/NO-GO) 4. Focus on high-frequency atomics for performance 5. Compile-out low-frequency atomics for cleanliness **Future Work:** - Continue Phase 29+ (warm/cold path atomics) - Expected cumulative gain: +3.0-3.5% total (already at +2.74%) - Focus on high-frequency paths, audit carefully for CORRECTNESS vs TELEMETRY - Document all verdicts for reproducibility **Lessons from Phase 28:** - Not all atomic counters are telemetry - Flow control counters (e.g., `g_bg_spill_len`) are CORRECTNESS - Always trace how counter is used before classifying --- **Last Updated:** 2025-12-16 **Status:** Phase 24+25+26+27 Complete (+2.74%), Phase 28 Audit Complete (NO-OP) **Maintained By:** Claude Sonnet 4.5