Summary: - Phase 24 (alloc stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness) - Total: 11 atomics compiled-out, +2.00% improvement Phase 24: OBSERVE tax prune (tiny_class_stats_box.h) - Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0) - Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_* - Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s) Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h) - Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0) - Wrapped g_free_ss_enter atomic in free hot path - Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s) Phase 26: Hot path diagnostic atomics prune - Added 5 compile gates for low-frequency error counters: - HAKMEM_TINY_C7_FREE_COUNT_COMPILED - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED - HAKMEM_TINY_HDR_META_FAST_COMPILED - Result: -0.33% NEUTRAL (within noise, kept for cleanliness) Alignment with mimalloc principles: - "No atomics on hot path" - telemetry moved to compile-time opt-in - Fixed per-op tax elimination - Production builds: maximum performance (atomics compiled-out) - Research builds: full diagnostics (COMPILED=1) Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.5 KiB
8.5 KiB
Phase 26: Hot Path Atomic Telemetry Prune - Audit & Plan
Date: 2025-12-16 Purpose: Identify and compile-out telemetry-only atomics in hot alloc/free paths Pattern: Follow Phase 24 (tiny_class_stats) + Phase 25 (g_free_ss_enter) Expected Gain: +2-3% cumulative improvement
Executive Summary
Goal: Remove all telemetry-only atomic_fetch_add/sub from hot paths (alloc/free direct paths).
Methodology:
- Audit all atomics in
core/directory - Classify: CORRECTNESS (keep) vs TELEMETRY (compile-out)
- Prioritize: HOT (direct alloc/free) > WARM (refill/spill) > COLD (init/shutdown)
- Implement compile gates following Phase 24+25 pattern
- A/B test each candidate independently
Status: Phase 25 complete (+1.07% GO). Starting Phase 26.
Classification Criteria
CORRECTNESS (Do NOT touch)
- Remote queue management:
remote_count,remote_head,remote_tail - Refcount/ownership:
refcount,owner,in_use,active - Lock/synchronization:
lock,mutex,head,tail(queue atomics) - Metadata:
meta->used,meta->active,meta->tls_cached
TELEMETRY (Candidate for compile-out)
- Stats counters:
*_stats,*_count,*_calls - Diagnostics:
*_trace,*_debug,*_diag,*_log - Observability:
*_enter,*_exit,*_hit,*_miss,*_attempt,*_success - Metrics:
g_metric_*,g_dbg_*,g_rel_*
Phase 26 Candidates: HOT PATH TELEMETRY ATOMICS
Priority A: Direct Free Path (tiny_superslab_free.inc.h)
1. g_free_ss_enter - ALREADY DONE (Phase 25)
- Status: GO (+1.07%)
- Location:
core/tiny_superslab_free.inc.h:22 - Gate:
HAKMEM_TINY_FREE_STATS_COMPILED - Verdict: Keep compiled-out (default: 0)
2. c7_free_count - NEW CANDIDATE
- Location:
core/tiny_superslab_free.inc.h:51 - Code:
atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed); - Purpose: Debug counter for C7 free path diagnostics
- Path: HOT (free superslab fast path)
- Expected Gain: +0.3-0.8%
- Priority: HIGH
- Action: Create Phase 26A
3. g_hdr_mismatch_log - NEW CANDIDATE
- Location:
core/tiny_superslab_free.inc.h:147 - Code:
atomic_fetch_add_explicit(&g_hdr_mismatch_log, 1, memory_order_relaxed); - Purpose: Log header validation mismatches (debug only)
- Path: HOT (free path validation)
- Expected Gain: +0.2-0.5%
- Priority: HIGH
- Action: Create Phase 26B
4. g_hdr_meta_mismatch - NEW CANDIDATE
- Location:
core/tiny_superslab_free.inc.h:182 - Code:
atomic_fetch_add_explicit(&g_hdr_meta_mismatch, 1, memory_order_relaxed); - Purpose: Log metadata validation failures (debug only)
- Path: HOT (free path validation)
- Expected Gain: +0.2-0.5%
- Priority: HIGH
- Action: Create Phase 26C
Priority B: Direct Alloc Path
5. g_metric_bad_class_once - NEW CANDIDATE
- Location:
core/hakmem_tiny_alloc.inc:22 - Code:
atomic_fetch_add_explicit(&g_metric_bad_class_once, 1, memory_order_relaxed) - Purpose: One-shot metric for bad class index (safety check)
- Path: HOT (alloc entry gate)
- Expected Gain: +0.1-0.3%
- Priority: MEDIUM
- Action: Create Phase 26D
6. g_hdr_meta_fast - NEW CANDIDATE
- Location:
core/tiny_free_fast_v2.inc.h:181 - Code:
atomic_fetch_add_explicit(&g_hdr_meta_fast, 1, memory_order_relaxed); - Purpose: Fast-path header metadata hit counter (telemetry)
- Path: HOT (free_fast_v2 path)
- Expected Gain: +0.3-0.7%
- Priority: HIGH
- Action: Create Phase 26E
Priority C: Warm Path (Refill/Spill)
7. g_bg_spill_len - BORDERLINE
- Location:
core/hakmem_tiny_bg_spill.h:32,44 - Code:
atomic_fetch_add_explicit(&g_bg_spill_len[class_idx], ...) - Purpose: Background spill queue length tracking
- Path: WARM (spill path)
- Expected Gain: +0.1-0.2%
- Priority: MEDIUM
- Note: May be CORRECTNESS if queue length is used for flow control
- Action: Review code, then decide (Phase 27+)
8. Unified Cache Stats - MULTIPLE ATOMICS
- Location:
core/front/tiny_unified_cache.c(multiple lines) - Variables:
g_unified_cache_hits_global,g_unified_cache_misses_global, etc. - Purpose: Unified cache hit/miss telemetry
- Path: WARM (cache layer)
- Expected Gain: +0.2-0.4%
- Priority: MEDIUM
- Action: Group into single Phase 27+ candidate
Phase 26 Implementation Plan
Phase 26A: c7_free_count Atomic Prune
Target: core/tiny_superslab_free.inc.h:51
Step 1: Add Build Flag
// core/hakmem_build_flags.h (after line 290)
// ------------------------------------------------------------
// Phase 26A: C7 Free Count Atomic Prune (Compile-out c7_free_count)
// ------------------------------------------------------------
// C7 Free Count: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need C7 free path diagnostics
// Target: c7_free_count atomic in core/tiny_superslab_free.inc.h:51
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
# define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif
Step 2: Wrap Atomic with Compile Gate
// core/tiny_superslab_free.inc.h:51
#if HAKMEM_C7_FREE_COUNT_COMPILED
extern _Atomic int c7_free_count;
int count = atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed);
#else
int count = 0; // No-op when compiled out
(void)count; // Suppress unused warning
#endif
Step 3: A/B Test (Build-Level)
# Baseline (compiled-out, default)
make clean && make -j bench_random_mixed_hakmem
./bench_random_mixed_hakmem > baseline_26a.txt
# Compiled-in (for comparison)
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1' bench_random_mixed_hakmem
./bench_random_mixed_hakmem > compiled_in_26a.txt
# Run full bench suite
./scripts/run_mixed_10_cleanenv.sh > bench_26a_baseline.txt
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1' bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > bench_26a_compiled.txt
Step 4: Verdict
- GO: +0.5% or more → keep compiled-out (default: 0)
- NEUTRAL: ±0.5% → document, keep compiled-out for cleanliness
- NO-GO: -0.5% or worse → revert change
Phase 26B-E: Repeat Pattern
Follow same pattern for:
- 26B:
g_hdr_mismatch_log(tiny_superslab_free.inc.h:147) - 26C:
g_hdr_meta_mismatch(tiny_superslab_free.inc.h:182) - 26D:
g_metric_bad_class_once(hakmem_tiny_alloc.inc:22) - 26E:
g_hdr_meta_fast(tiny_free_fast_v2.inc.h:181)
Each Phase:
- Add
HAKMEM_[NAME]_COMPILEDflag tohakmem_build_flags.h - Wrap atomic with
#if HAKMEM_[NAME]_COMPILED - Run A/B test (baseline vs compiled-in)
- Measure improvement
- Document verdict
Expected Cumulative Impact
| Phase | Target Atomic | File | Expected Gain | Status |
|---|---|---|---|---|
| 24 | g_tiny_class_stats_* |
tiny_class_stats_box.h | +0.93% | GO ✅ |
| 25 | g_free_ss_enter |
tiny_superslab_free.inc.h:22 | +1.07% | GO ✅ |
| 26A | c7_free_count |
tiny_superslab_free.inc.h:51 | +0.3-0.8% | TBD |
| 26B | g_hdr_mismatch_log |
tiny_superslab_free.inc.h:147 | +0.2-0.5% | TBD |
| 26C | g_hdr_meta_mismatch |
tiny_superslab_free.inc.h:182 | +0.2-0.5% | TBD |
| 26D | g_metric_bad_class_once |
hakmem_tiny_alloc.inc:22 | +0.1-0.3% | TBD |
| 26E | g_hdr_meta_fast |
tiny_free_fast_v2.inc.h:181 | +0.3-0.7% | TBD |
| Total (24-26E) | - | - | +2.93-4.83% | - |
Conservative Estimate: +3.0% cumulative improvement from hot-path atomic prune.
Next Steps
- ✅ Audit complete (this document)
- ⏳ Implement Phase 26A (
c7_free_count) - ⏳ Run A/B test (baseline vs compiled-in)
- ⏳ Document results in
PHASE26A_C7_FREE_COUNT_RESULTS.md - ⏳ Repeat for 26B-E
- ⏳ Create cumulative report
References
- Phase 24 Pattern:
core/box/tiny_class_stats_box.h - Phase 25 Pattern:
core/tiny_superslab_free.inc.h:20-25 - Build Flags:
core/hakmem_build_flags.h:274-290 - Mimalloc Principle: No atomics/observe in hot path
Notes
- DO NOT touch correctness atomics (
remote_count,refcount,meta->used, etc.) - ALWAYS A/B test each candidate independently (no batching)
- ALWAYS use build-level flags (compile-time, not runtime)
- FOLLOW Phase 24+25 pattern (
#if COMPILEDwith default: 0) - DOCUMENT all verdicts (GO/NEUTRAL/NO-GO)
mimalloc Gap Analysis: This work closes the "hot path atomic tax" gap identified in optimization roadmap.