Target: g_pool_hotbox_v2_stats atomics (12 total) in Pool v2 Result: 0.00% impact (code path inactive by default, ENV-gated) Verdict: NO-OP - Maintain compile-out for future-proofing Audit Results: - Classification: 12/12 TELEMETRY (100% observational) - Counters: alloc_calls, alloc_fast, alloc_refill, alloc_refill_fail, alloc_fallback_v1, free_calls, free_fast, free_fallback_v1, page_of_fail_* (4 failure counters) - Verification: All stats/logging only, zero flow control usage - Phase 28 lesson applied: Traced all usages, confirmed no CORRECTNESS Key Finding: Pool v2 OFF by default - Requires HAKMEM_POOL_V2_ENABLED=1 to activate - Benchmark never executes Pool v2 code paths - Compile-out has zero performance impact (code never runs) Implementation (future-ready): - Added HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0) - Wrapped 13 atomic write sites in core/hakmem_pool.c - Pattern: #if HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED ... #endif - Expected impact if Pool v2 enabled: +0.3~0.8% (HOT+WARM atomics) A/B Test Results: - Baseline (COMPILED=0): 52.98 M ops/s (±0.43M, 0.81% stdev) - Research (COMPILED=1): 53.31 M ops/s (±0.80M, 1.50% stdev) - Delta: -0.62% (noise, not real effect - code path not active) Critical Lesson Learned (NEW): Phase 29 revealed ENV-gated features can appear on hot paths but never execute. Updated audit checklist: 1. Classify atomics (CORRECTNESS vs TELEMETRY) 2. Verify no flow control usage 3. NEW: Verify code path is ACTIVE in benchmark (check ENV gates) 4. Implement compile-out 5. A/B test Verification methods added to documentation: - rg "getenv.*FEATURE" to check ENV gates - perf record/report to verify execution - Debug printf for quick validation Cumulative Progress (Phase 24-29): - Phase 24 (class stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL - Phase 27 (unified cache): +0.74% GO - Phase 28 (bg spill): NO-OP (all CORRECTNESS) - Phase 29 (pool v2): NO-OP (inactive code path) - Total: 17 atomics removed, +2.74% improvement Documentation: - PHASE29_POOL_HOTBOX_V2_AUDIT.md: Complete audit with TELEMETRY classification - PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md: Results + new lesson learned - ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated with Phase 29 + new checklist - PHASE29_COMPLETE.md: Completion summary with recommendations Decision: Keep compile-out despite NO-OP - Code cleanliness (binary size reduction) - Future-proofing (ready when Pool v2 enabled) - Consistency with Phase 24-28 pattern Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
Hot Path Atomic Telemetry Prune - Cumulative Summary
Project: HAKMEM Memory Allocator - Hot Path Optimization Goal: Remove all telemetry-only atomics from hot alloc/free paths Principle: Follow mimalloc: No atomics/observe in hot path Status: Phase 24+25+26+27 Complete (+2.74% cumulative), Phase 28 Audit Complete (NO-OP)
Overview
This document tracks the systematic removal of telemetry-only atomic_fetch_add/sub operations from hot alloc/free code paths. Each phase follows a consistent pattern:
- Identify telemetry-only atomic (not CORRECTNESS)
- Add
HAKMEM_*_COMPILEDcompile gate (default: 0) - A/B test: baseline (compiled-out) vs compiled-in
- Verdict: GO (>+0.5%), NEUTRAL (±0.5%), or NO-GO (<-0.5%)
- Document and proceed to next candidate
Completed Phases
Phase 24: Tiny Class Stats Atomic Prune ✅ GO (+0.93%)
Date: 2025-12-15 (prior work)
Target: g_tiny_class_stats_* (per-class cache hit/miss counters)
File: core/box/tiny_class_stats_box.h
Atomics: 5 global counters (executed on every cache operation)
Build Flag: HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
Results:
- Baseline (compiled-out): 57.8 M ops/s
- Compiled-in: 57.3 M ops/s
- Improvement: +0.93%
- Verdict: GO ✅ (keep compiled-out)
Analysis: High-frequency atomics (every cache hit/miss) show measurable impact. Compiling out provides nearly 1% improvement.
Reference: Pattern established in Phase 24, used as template for all subsequent phases.
Phase 25: Free Stats Atomic Prune ✅ GO (+1.07%)
Date: 2025-12-15 (prior work)
Target: g_free_ss_enter (superslab free entry counter)
File: core/tiny_superslab_free.inc.h:22
Atomics: 1 global counter (executed on every superslab free)
Build Flag: HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
Results:
- Baseline (compiled-out): 58.4 M ops/s
- Compiled-in: 57.8 M ops/s
- Improvement: +1.07%
- Verdict: GO ✅ (keep compiled-out)
Analysis: Single high-frequency atomic (every free call) shows >1% impact. Demonstrates that even one hot-path atomic matters.
Reference: docs/analysis/PHASE25_FREE_STATS_RESULTS.md (assumed from pattern)
Phase 26: Hot Path Diagnostic Atomics Prune ✅ NEUTRAL (-0.33%)
Date: 2025-12-16 Targets: 5 diagnostic atomics in hot-path edge cases Files:
core/tiny_superslab_free.inc.h(3 atomics)core/hakmem_tiny_alloc.inc(1 atomic)core/tiny_free_fast_v2.inc.h(1 atomic)
Build Flags: (all default: 0)
HAKMEM_C7_FREE_COUNT_COMPILEDHAKMEM_HDR_MISMATCH_LOG_COMPILEDHAKMEM_HDR_META_MISMATCH_COMPILEDHAKMEM_METRIC_BAD_CLASS_COMPILEDHAKMEM_HDR_META_FAST_COMPILED
Results:
- Baseline (compiled-out): 53.14 M ops/s (±0.96M)
- Compiled-in: 53.31 M ops/s (±1.09M)
- Improvement: -0.33% (within ±0.5% noise margin)
- Verdict: NEUTRAL ➡️ Keep compiled-out for cleanliness ✅
Analysis: Low-frequency atomics (only in error/diagnostic paths) show no measurable impact. Kept compiled-out for code cleanliness and maintainability.
Reference: docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md
Phase 27: Unified Cache Stats Atomic Prune ✅ GO (+0.74%)
Date: 2025-12-16
Target: g_unified_cache_* (unified cache measurement atomics)
File: core/front/tiny_unified_cache.c, core/front/tiny_unified_cache.h
Atomics: 6 global counters (hits, misses, refill cycles, per-class variants)
Build Flag: HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED (default: 0)
Results:
- Baseline (compiled-out): 52.94 M ops/s (mean), 53.59 M ops/s (median)
- Compiled-in: 52.55 M ops/s (mean), 53.06 M ops/s (median)
- Improvement: +0.74% (mean), +1.01% (median)
- Verdict: GO ✅ (keep compiled-out)
Analysis: WARM path atomics (cache refill operations) show measurable impact exceeding initial expectations (+0.2-0.4% expected, +0.74% actual). This suggests refill frequency is substantial in the random_mixed benchmark. The improvement validates the Phase 23 compile-out decision.
Path: WARM (unified cache refill: 3 locations; cache hits: 2 locations) Frequency: Medium (every cache miss triggers refill with 4 atomic ops + ENV check)
Reference: docs/analysis/PHASE27_UNIFIED_CACHE_STATS_RESULTS.md
Phase 28: Background Spill Queue Atomic Audit ✅ NO-OP (All CORRECTNESS)
Date: 2025-12-16
Target: Background spill queue atomics (g_bg_spill_head, g_bg_spill_len)
Files: core/hakmem_tiny_bg_spill.h, core/hakmem_tiny_bg_spill.c
Atomics: 8 atomic operations (CAS loops, queue management)
Build Flag: None (no compile-out candidates)
Audit Results:
- CORRECTNESS Atomics: 8/8 (100%)
- TELEMETRY Atomics: 0/8 (0%)
- Verdict: NO-OP (no action taken)
Analysis: All atomics are critical for correctness:
- Lock-free queue operations:
atomic_load,atomic_compare_exchange_weakfor CAS loops - Queue length tracking (
g_bg_spill_len): Used for flow control, NOT telemetry- Checked in
tiny_free_magazine.inc.h:76-77to decide whether to queue work - Controls queue depth to prevent unbounded growth
- This is an operational counter, not a debug counter
- Checked in
Key Finding: g_bg_spill_len is superficially similar to telemetry counters, but serves a critical role:
uint32_t qlen = atomic_load_explicit(&g_bg_spill_len[class_idx], memory_order_relaxed);
if ((int)qlen < g_bg_spill_target) { // FLOW CONTROL DECISION
// Queue work to background spill
}
Conclusion: Background spill queue is a lock-free data structure. All atomics are untouchable. Phase 28 completes with no code changes.
Reference: docs/analysis/PHASE28_BG_SPILL_ATOMIC_AUDIT.md
Phase 29: Pool Hotbox v2 Stats Atomic Audit ✅ NO-OP (Code Not Active)
Date: 2025-12-16
Target: Pool Hotbox v2 stats atomics (g_pool_hotbox_v2_stats[ci].*)
Files: core/hakmem_pool.c, core/box/pool_hotbox_v2_box.h
Atomics: 12 atomic counters (alloc_calls, free_calls, alloc_fast, free_fast, etc.)
Build Flag: HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0)
Audit Results:
- CORRECTNESS Atomics: 0/12 (0%)
- TELEMETRY Atomics: 12/12 (100%)
- Verdict: NO-OP (code path not active)
Analysis:
All 12 atomics are pure TELEMETRY (destructor dump only, no flow control). However, Pool Hotbox v2 is disabled by default via HAKMEM_POOL_V2_ENABLED environment variable, so these atomics are never executed in the benchmark.
A/B Test Results (Anomaly Detected):
- Baseline (compiled-out): 52.98 M ops/s (±0.43M)
- Compiled-in: 53.31 M ops/s (±0.80M)
- Improvement: -0.62% (compiled-in is faster!)
Root Cause: Pool v2 is OFF by default (ENV-gated):
const char* e = getenv("HAKMEM_POOL_V2_ENABLED");
g = (e && *e && *e != '0') ? 1 : 0; // Default: OFF
Result: Atomics are never incremented → compile-out has zero runtime effect.
Why anomaly (-0.62% faster with atomics ON)?
- High variance (research build: 1.50% stdev vs baseline: 0.81%)
- Compiler optimization artifact (code layout, instruction cache alignment)
- Sample size (10 runs) insufficient to distinguish signal from noise
- Conclusion: Noise, not real effect
Decision: NEUTRAL - Keep compile-out for:
- Code cleanliness (reduces binary size)
- Future-proofing (ready if Pool v2 is enabled)
- Consistency with Phase 24-28 pattern
Key Lesson: Before A/B testing, verify code is ACTIVE:
rg "getenv.*FEATURE" && echo "⚠️ ENV-gated, may be OFF"
Updated Audit Checklist:
- ✅ Classify atomics (CORRECTNESS vs TELEMETRY)
- ✅ Verify no flow control usage
- NEW: ✅ Verify code path is ACTIVE in benchmark ← Phase 29 lesson
- Implement compile-out
- A/B test
Reference: docs/analysis/PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md
Cumulative Impact
| Phase | Atomics Removed | Frequency | Impact | Status |
|---|---|---|---|---|
| 24 | 5 (class stats) | High (every cache op) | +0.93% | GO ✅ |
| 25 | 1 (free_ss_enter) | High (every free) | +1.07% | GO ✅ |
| 26 | 5 (diagnostics) | Low (edge cases) | -0.33% | NEUTRAL ✅ |
| 27 | 6 (unified cache) | Medium (refills) | +0.74% | GO ✅ |
| 28 | 0 (bg spill) | N/A (all CORRECTNESS) | N/A | NO-OP ✅ |
| 29 | 0 (pool v2) | N/A (code not active) | 0.00% | NO-OP ✅ |
| Total | 17 atomics | Mixed | +2.74% | ✅ |
Key Insights:
- Frequency matters more than count: High-frequency atomics (Phase 24+25) provide measurable benefit (+0.93%, +1.07%). Medium-frequency atomics (Phase 27, WARM path) provide substantial benefit (+0.74%). Low-frequency atomics (Phase 26) provide cleanliness but no performance gain.
- Correctness atomics are untouchable: Phase 28 showed that lock-free queues and flow control counters must not be touched.
- ENV-gated code paths need verification: Phase 29 showed that compile-out of inactive code has zero performance impact. Always verify code is active before A/B testing.
Lessons Learned
1. Frequency Trumps Count
- Phase 24: 5 atomics, high frequency → +0.93% ✅
- Phase 25: 1 atomic, high frequency → +1.07% ✅
- Phase 26: 5 atomics, low frequency → -0.33% (NEUTRAL)
Takeaway: Focus on always-executed atomics, not just atomic count.
2. Edge Cases Don't Matter (Performance-Wise)
- Phase 26 atomics are in error/diagnostic paths (header mismatch, bad class, etc.)
- Rarely executed in benchmarks → no measurable impact
- Still worth compiling out for code cleanliness
3. Compile-Time Gates Work Well
- Pattern:
#if HAKMEM_*_COMPILED(default: 0) - Clean separation between research (compiled-in) and production (compiled-out)
- Easy to A/B test individual flags
4. Noise Margin: ±0.5%
- Benchmark variance ~1-2%
- Improvements <0.5% are within noise
- NEUTRAL verdict: keep simpler code (compiled-out)
5. Classification is Critical
- Phase 28: All atomics were CORRECTNESS (lock-free queue, flow control)
- Must distinguish between:
- Telemetry counters: Observational only, safe to compile-out
- Operational counters: Used for control flow decisions, UNTOUCHABLE
- Example:
g_bg_spill_lenlooks like telemetry but controls queue depth limits
6. Verify Code is Active (NEW: Phase 29 Lesson)
- Phase 29: Pool v2 stats were all TELEMETRY but ENV-gated (default OFF)
- Compile-out had zero impact because code never ran
- Before A/B testing:
- Check for
getenv()gates → may be OFF by default - Add temporary debug printf to verify code path is hit
- Or use
perf recordto check if functions are called
- Check for
- Anomaly: Compiled-in was 0.62% faster (noise due to compiler artifacts, not real effect)
Next Phase Candidates (Phase 30+)
Completed Audits
-
Background Spill Queue (Phase 28)✅ COMPLETE (NO-OP)- Result: All CORRECTNESS atomics, no compile-out candidates
- Reason: Lock-free queue + flow control counter
-
Pool Hotbox v2 Stats (Phase 29)✅ COMPLETE (NO-OP)- Result: All TELEMETRY atomics, but code path not active (ENV-gated)
- Reason:
HAKMEM_POOL_V2_ENABLEDdefaults to OFF
High Priority: Warm Path Atomics
- Remote Target Queue (Phase 30 candidate)
- Targets:
g_remote_target_len[class_idx]atomics - File:
core/hakmem_tiny_remote_target.c - Atomics:
atomic_fetch_add/subon queue length - Frequency: Warm (remote free path)
- Expected Gain: +0.1-0.3% (if telemetry)
- Priority: MEDIUM (needs correctness review - similar to bg_spill)
- Warning: May be flow control like
g_bg_spill_len, needs audit
- Targets:
Low Priority: Cold Path Atomics
-
SuperSlab OS Stats (Phase 30+)
- Targets:
g_ss_os_alloc_calls,g_ss_os_madvise_calls, etc. - Files:
core/box/ss_os_acquire_box.h,core/box/madvise_guard_box.c - Frequency: Cold (init/mmap/madvise)
- Expected Gain: <0.1%
- Priority: LOW (code cleanliness only)
- Targets:
-
Shared Pool Diagnostics (Phase 31+)
- Targets:
rel_c7_*,dbg_c7_*(release/acquire logs) - Files:
core/hakmem_shared_pool_acquire.c,core/hakmem_shared_pool_release.c - Frequency: Cold (shared pool operations)
- Expected Gain: <0.1%
- Priority: LOW
- Targets:
Pattern Template (For Future Phases)
Step 1: Add Build Flag
// core/hakmem_build_flags.h
#ifndef HAKMEM_[NAME]_COMPILED
# define HAKMEM_[NAME]_COMPILED 0
#endif
Step 2: Wrap Atomic
// core/[file].c
#if HAKMEM_[NAME]_COMPILED
atomic_fetch_add_explicit(&g_[name], 1, memory_order_relaxed);
#else
(void)0; // No-op when compiled out
#endif
Step 3: A/B Test
# Baseline (compiled-out, default)
make clean && make -j bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > baseline.txt
# Compiled-in
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_[NAME]_COMPILED=1' bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > compiled_in.txt
Step 4: Analyze & Verdict
improvement = ((baseline_avg - compiled_in_avg) / compiled_in_avg) * 100
if improvement >= 0.5:
verdict = "GO (keep compiled-out)"
elif improvement <= -0.5:
verdict = "NO-GO (revert, compiled-in is better)"
else:
verdict = "NEUTRAL (keep compiled-out for cleanliness)"
Step 5: Document
Create docs/analysis/PHASE[N]_[NAME]_RESULTS.md with:
- Implementation details
- A/B test results
- Verdict & reasoning
- Files modified
Build Flag Summary
All atomic compile gates in core/hakmem_build_flags.h:
// Phase 24: Tiny Class Stats (GO +0.93%)
#ifndef HAKMEM_TINY_CLASS_STATS_COMPILED
# define HAKMEM_TINY_CLASS_STATS_COMPILED 0
#endif
// Phase 25: Tiny Free Stats (GO +1.07%)
#ifndef HAKMEM_TINY_FREE_STATS_COMPILED
# define HAKMEM_TINY_FREE_STATS_COMPILED 0
#endif
// Phase 27: Unified Cache Stats (GO +0.74%)
#ifndef HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
# define HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED 0
#endif
// Phase 26A: C7 Free Count (NEUTRAL -0.33%)
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
# define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif
// Phase 26B: Header Mismatch Log (NEUTRAL)
#ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED
# define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0
#endif
// Phase 26C: Header Meta Mismatch (NEUTRAL)
#ifndef HAKMEM_HDR_META_MISMATCH_COMPILED
# define HAKMEM_HDR_META_MISMATCH_COMPILED 0
#endif
// Phase 26D: Metric Bad Class (NEUTRAL)
#ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED
# define HAKMEM_METRIC_BAD_CLASS_COMPILED 0
#endif
// Phase 26E: Header Meta Fast (NEUTRAL)
#ifndef HAKMEM_HDR_META_FAST_COMPILED
# define HAKMEM_HDR_META_FAST_COMPILED 0
#endif
// Phase 29: Pool Hotbox v2 Stats (NO-OP - code not active)
#ifndef HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED
# define HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED 0
#endif
Default State: All flags = 0 (compiled-out, production-ready) Research Use: Set flag = 1 to enable specific telemetry atomic
Conclusion
Total Progress (Phase 24+25+26+27+28+29):
- Performance Gain: +2.74% (Phase 24: +0.93%, Phase 25: +1.07%, Phase 26: NEUTRAL, Phase 27: +0.74%, Phase 28: NO-OP, Phase 29: NO-OP)
- Atomics Removed: 17 telemetry atomics from hot/warm paths
- Phases Completed: 6 phases (4 with changes, 2 audit-only)
- Code Quality: Cleaner hot/warm paths, closer to mimalloc's zero-overhead principle
- Next Target: Phase 30 (remote target queue or other ACTIVE code paths)
Key Success Factors:
- Systematic audit and classification (CORRECTNESS vs TELEMETRY)
- Consistent A/B testing methodology
- Clear verdict criteria (GO/NEUTRAL/NO-GO)
- Focus on high-frequency atomics for performance
- Compile-out low-frequency atomics for cleanliness
Future Work:
- Continue Phase 29+ (warm/cold path atomics)
- Expected cumulative gain: +3.0-3.5% total (already at +2.74%)
- Focus on high-frequency paths, audit carefully for CORRECTNESS vs TELEMETRY
- Document all verdicts for reproducibility
Lessons from Phase 28+29:
- Not all atomic counters are telemetry (Phase 28: flow control counters are CORRECTNESS)
- Flow control counters (e.g.,
g_bg_spill_len) are UNTOUCHABLE - Always trace how counter is used before classifying
- Verify code path is ACTIVE before A/B testing (Phase 29: ENV-gated code has zero impact)
Last Updated: 2025-12-16 Status: Phase 24+25+26+27 Complete (+2.74%), Phase 28+29 Audit Complete (NO-OP x2) Maintained By: Claude Sonnet 4.5