Files
hakmem/docs/analysis/PHASE26_HOT_PATH_ATOMIC_AUDIT.md
Moe Charm (CI) 8052e8b320 Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)
Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-16 05:35:11 +09:00

8.5 KiB

Phase 26: Hot Path Atomic Telemetry Prune - Audit & Plan

Date: 2025-12-16 Purpose: Identify and compile-out telemetry-only atomics in hot alloc/free paths Pattern: Follow Phase 24 (tiny_class_stats) + Phase 25 (g_free_ss_enter) Expected Gain: +2-3% cumulative improvement


Executive Summary

Goal: Remove all telemetry-only atomic_fetch_add/sub from hot paths (alloc/free direct paths).

Methodology:

  1. Audit all atomics in core/ directory
  2. Classify: CORRECTNESS (keep) vs TELEMETRY (compile-out)
  3. Prioritize: HOT (direct alloc/free) > WARM (refill/spill) > COLD (init/shutdown)
  4. Implement compile gates following Phase 24+25 pattern
  5. A/B test each candidate independently

Status: Phase 25 complete (+1.07% GO). Starting Phase 26.


Classification Criteria

CORRECTNESS (Do NOT touch)

  • Remote queue management: remote_count, remote_head, remote_tail
  • Refcount/ownership: refcount, owner, in_use, active
  • Lock/synchronization: lock, mutex, head, tail (queue atomics)
  • Metadata: meta->used, meta->active, meta->tls_cached

TELEMETRY (Candidate for compile-out)

  • Stats counters: *_stats, *_count, *_calls
  • Diagnostics: *_trace, *_debug, *_diag, *_log
  • Observability: *_enter, *_exit, *_hit, *_miss, *_attempt, *_success
  • Metrics: g_metric_*, g_dbg_*, g_rel_*

Phase 26 Candidates: HOT PATH TELEMETRY ATOMICS

Priority A: Direct Free Path (tiny_superslab_free.inc.h)

1. g_free_ss_enter - ALREADY DONE (Phase 25)

  • Status: GO (+1.07%)
  • Location: core/tiny_superslab_free.inc.h:22
  • Gate: HAKMEM_TINY_FREE_STATS_COMPILED
  • Verdict: Keep compiled-out (default: 0)

2. c7_free_count - NEW CANDIDATE

  • Location: core/tiny_superslab_free.inc.h:51
  • Code: atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed);
  • Purpose: Debug counter for C7 free path diagnostics
  • Path: HOT (free superslab fast path)
  • Expected Gain: +0.3-0.8%
  • Priority: HIGH
  • Action: Create Phase 26A

3. g_hdr_mismatch_log - NEW CANDIDATE

  • Location: core/tiny_superslab_free.inc.h:147
  • Code: atomic_fetch_add_explicit(&g_hdr_mismatch_log, 1, memory_order_relaxed);
  • Purpose: Log header validation mismatches (debug only)
  • Path: HOT (free path validation)
  • Expected Gain: +0.2-0.5%
  • Priority: HIGH
  • Action: Create Phase 26B

4. g_hdr_meta_mismatch - NEW CANDIDATE

  • Location: core/tiny_superslab_free.inc.h:182
  • Code: atomic_fetch_add_explicit(&g_hdr_meta_mismatch, 1, memory_order_relaxed);
  • Purpose: Log metadata validation failures (debug only)
  • Path: HOT (free path validation)
  • Expected Gain: +0.2-0.5%
  • Priority: HIGH
  • Action: Create Phase 26C

Priority B: Direct Alloc Path

5. g_metric_bad_class_once - NEW CANDIDATE

  • Location: core/hakmem_tiny_alloc.inc:22
  • Code: atomic_fetch_add_explicit(&g_metric_bad_class_once, 1, memory_order_relaxed)
  • Purpose: One-shot metric for bad class index (safety check)
  • Path: HOT (alloc entry gate)
  • Expected Gain: +0.1-0.3%
  • Priority: MEDIUM
  • Action: Create Phase 26D

6. g_hdr_meta_fast - NEW CANDIDATE

  • Location: core/tiny_free_fast_v2.inc.h:181
  • Code: atomic_fetch_add_explicit(&g_hdr_meta_fast, 1, memory_order_relaxed);
  • Purpose: Fast-path header metadata hit counter (telemetry)
  • Path: HOT (free_fast_v2 path)
  • Expected Gain: +0.3-0.7%
  • Priority: HIGH
  • Action: Create Phase 26E

Priority C: Warm Path (Refill/Spill)

7. g_bg_spill_len - BORDERLINE

  • Location: core/hakmem_tiny_bg_spill.h:32,44
  • Code: atomic_fetch_add_explicit(&g_bg_spill_len[class_idx], ...)
  • Purpose: Background spill queue length tracking
  • Path: WARM (spill path)
  • Expected Gain: +0.1-0.2%
  • Priority: MEDIUM
  • Note: May be CORRECTNESS if queue length is used for flow control
  • Action: Review code, then decide (Phase 27+)

8. Unified Cache Stats - MULTIPLE ATOMICS

  • Location: core/front/tiny_unified_cache.c (multiple lines)
  • Variables: g_unified_cache_hits_global, g_unified_cache_misses_global, etc.
  • Purpose: Unified cache hit/miss telemetry
  • Path: WARM (cache layer)
  • Expected Gain: +0.2-0.4%
  • Priority: MEDIUM
  • Action: Group into single Phase 27+ candidate

Phase 26 Implementation Plan

Phase 26A: c7_free_count Atomic Prune

Target: core/tiny_superslab_free.inc.h:51

Step 1: Add Build Flag

// core/hakmem_build_flags.h (after line 290)

// ------------------------------------------------------------
// Phase 26A: C7 Free Count Atomic Prune (Compile-out c7_free_count)
// ------------------------------------------------------------
// C7 Free Count: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need C7 free path diagnostics
// Target: c7_free_count atomic in core/tiny_superslab_free.inc.h:51
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
#  define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif

Step 2: Wrap Atomic with Compile Gate

// core/tiny_superslab_free.inc.h:51
#if HAKMEM_C7_FREE_COUNT_COMPILED
    extern _Atomic int c7_free_count;
    int count = atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed);
#else
    int count = 0;  // No-op when compiled out
    (void)count;    // Suppress unused warning
#endif

Step 3: A/B Test (Build-Level)

# Baseline (compiled-out, default)
make clean && make -j bench_random_mixed_hakmem
./bench_random_mixed_hakmem > baseline_26a.txt

# Compiled-in (for comparison)
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1' bench_random_mixed_hakmem
./bench_random_mixed_hakmem > compiled_in_26a.txt

# Run full bench suite
./scripts/run_mixed_10_cleanenv.sh > bench_26a_baseline.txt
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1' bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > bench_26a_compiled.txt

Step 4: Verdict

  • GO: +0.5% or more → keep compiled-out (default: 0)
  • NEUTRAL: ±0.5% → document, keep compiled-out for cleanliness
  • NO-GO: -0.5% or worse → revert change

Phase 26B-E: Repeat Pattern

Follow same pattern for:

  • 26B: g_hdr_mismatch_log (tiny_superslab_free.inc.h:147)
  • 26C: g_hdr_meta_mismatch (tiny_superslab_free.inc.h:182)
  • 26D: g_metric_bad_class_once (hakmem_tiny_alloc.inc:22)
  • 26E: g_hdr_meta_fast (tiny_free_fast_v2.inc.h:181)

Each Phase:

  1. Add HAKMEM_[NAME]_COMPILED flag to hakmem_build_flags.h
  2. Wrap atomic with #if HAKMEM_[NAME]_COMPILED
  3. Run A/B test (baseline vs compiled-in)
  4. Measure improvement
  5. Document verdict

Expected Cumulative Impact

Phase Target Atomic File Expected Gain Status
24 g_tiny_class_stats_* tiny_class_stats_box.h +0.93% GO
25 g_free_ss_enter tiny_superslab_free.inc.h:22 +1.07% GO
26A c7_free_count tiny_superslab_free.inc.h:51 +0.3-0.8% TBD
26B g_hdr_mismatch_log tiny_superslab_free.inc.h:147 +0.2-0.5% TBD
26C g_hdr_meta_mismatch tiny_superslab_free.inc.h:182 +0.2-0.5% TBD
26D g_metric_bad_class_once hakmem_tiny_alloc.inc:22 +0.1-0.3% TBD
26E g_hdr_meta_fast tiny_free_fast_v2.inc.h:181 +0.3-0.7% TBD
Total (24-26E) - - +2.93-4.83% -

Conservative Estimate: +3.0% cumulative improvement from hot-path atomic prune.


Next Steps

  1. Audit complete (this document)
  2. Implement Phase 26A (c7_free_count)
  3. Run A/B test (baseline vs compiled-in)
  4. Document results in PHASE26A_C7_FREE_COUNT_RESULTS.md
  5. Repeat for 26B-E
  6. Create cumulative report

References

  • Phase 24 Pattern: core/box/tiny_class_stats_box.h
  • Phase 25 Pattern: core/tiny_superslab_free.inc.h:20-25
  • Build Flags: core/hakmem_build_flags.h:274-290
  • Mimalloc Principle: No atomics/observe in hot path

Notes

  • DO NOT touch correctness atomics (remote_count, refcount, meta->used, etc.)
  • ALWAYS A/B test each candidate independently (no batching)
  • ALWAYS use build-level flags (compile-time, not runtime)
  • FOLLOW Phase 24+25 pattern (#if COMPILED with default: 0)
  • DOCUMENT all verdicts (GO/NEUTRAL/NO-GO)

mimalloc Gap Analysis: This work closes the "hot path atomic tax" gap identified in optimization roadmap.