Files
hakmem/docs/analysis/ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md
Moe Charm (CI) 9ed8b9c79a Phase 27-28: Unified Cache stats validation + BG Spill audit
Phase 27: Unified Cache Stats A/B Test - GO (+0.74%)
- Target: g_unified_cache_* atomics (6 total) in WARM refill path
- Already implemented in Phase 23 (HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED)
- A/B validation: Baseline 52.94M vs Compiled-in 52.55M ops/s
- Result: +0.74% mean, +1.01% median (both exceed +0.5% GO threshold)
- Impact: WARM path atomics have similar impact to HOT path
- Insight: Refill frequency is substantial, ENV check overhead matters

Phase 28: BG Spill Queue Atomic Audit - NO-OP
- Target: g_bg_spill_* atomics (8 total) in background spill subsystem
- Classification: 8/8 CORRECTNESS (100% untouchable)
- Key finding: g_bg_spill_len is flow control, NOT telemetry
  - Used in queue depth limiting: if (qlen < target) {...}
  - Operational counter (affects behavior), not observational
- Lesson: Counter name ≠ purpose, must trace all usages
- Result: NO-OP (no code changes, audit documentation only)

Cumulative Progress (Phase 24-28):
- Phase 24 (class stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL
- Phase 27 (unified cache): +0.74% GO
- Phase 28 (bg spill): NO-OP (audit only)
- Total: 17 atomics removed, +2.74% improvement

Documentation:
- PHASE27_UNIFIED_CACHE_STATS_RESULTS.md: Complete A/B test report
- PHASE28_BG_SPILL_ATOMIC_AUDIT.md: Detailed CORRECTNESS classification
- PHASE28_BG_SPILL_ATOMIC_PRUNE_RESULTS.md: NO-OP verdict and lessons
- ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated with Phase 27-28
- CURRENT_TASK.md: Phase 29 candidate identified (Pool Hotbox v2)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-16 06:12:17 +09:00

13 KiB

Hot Path Atomic Telemetry Prune - Cumulative Summary

Project: HAKMEM Memory Allocator - Hot Path Optimization Goal: Remove all telemetry-only atomics from hot alloc/free paths Principle: Follow mimalloc: No atomics/observe in hot path Status: Phase 24+25+26+27 Complete (+2.74% cumulative), Phase 28 Audit Complete (NO-OP)


Overview

This document tracks the systematic removal of telemetry-only atomic_fetch_add/sub operations from hot alloc/free code paths. Each phase follows a consistent pattern:

  1. Identify telemetry-only atomic (not CORRECTNESS)
  2. Add HAKMEM_*_COMPILED compile gate (default: 0)
  3. A/B test: baseline (compiled-out) vs compiled-in
  4. Verdict: GO (>+0.5%), NEUTRAL (±0.5%), or NO-GO (<-0.5%)
  5. Document and proceed to next candidate

Completed Phases

Phase 24: Tiny Class Stats Atomic Prune GO (+0.93%)

Date: 2025-12-15 (prior work) Target: g_tiny_class_stats_* (per-class cache hit/miss counters) File: core/box/tiny_class_stats_box.h Atomics: 5 global counters (executed on every cache operation) Build Flag: HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)

Results:

  • Baseline (compiled-out): 57.8 M ops/s
  • Compiled-in: 57.3 M ops/s
  • Improvement: +0.93%
  • Verdict: GO (keep compiled-out)

Analysis: High-frequency atomics (every cache hit/miss) show measurable impact. Compiling out provides nearly 1% improvement.

Reference: Pattern established in Phase 24, used as template for all subsequent phases.


Phase 25: Free Stats Atomic Prune GO (+1.07%)

Date: 2025-12-15 (prior work) Target: g_free_ss_enter (superslab free entry counter) File: core/tiny_superslab_free.inc.h:22 Atomics: 1 global counter (executed on every superslab free) Build Flag: HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)

Results:

  • Baseline (compiled-out): 58.4 M ops/s
  • Compiled-in: 57.8 M ops/s
  • Improvement: +1.07%
  • Verdict: GO (keep compiled-out)

Analysis: Single high-frequency atomic (every free call) shows >1% impact. Demonstrates that even one hot-path atomic matters.

Reference: docs/analysis/PHASE25_FREE_STATS_RESULTS.md (assumed from pattern)


Phase 26: Hot Path Diagnostic Atomics Prune NEUTRAL (-0.33%)

Date: 2025-12-16 Targets: 5 diagnostic atomics in hot-path edge cases Files:

  • core/tiny_superslab_free.inc.h (3 atomics)
  • core/hakmem_tiny_alloc.inc (1 atomic)
  • core/tiny_free_fast_v2.inc.h (1 atomic)

Build Flags: (all default: 0)

  • HAKMEM_C7_FREE_COUNT_COMPILED
  • HAKMEM_HDR_MISMATCH_LOG_COMPILED
  • HAKMEM_HDR_META_MISMATCH_COMPILED
  • HAKMEM_METRIC_BAD_CLASS_COMPILED
  • HAKMEM_HDR_META_FAST_COMPILED

Results:

  • Baseline (compiled-out): 53.14 M ops/s (±0.96M)
  • Compiled-in: 53.31 M ops/s (±1.09M)
  • Improvement: -0.33% (within ±0.5% noise margin)
  • Verdict: NEUTRAL ➡️ Keep compiled-out for cleanliness

Analysis: Low-frequency atomics (only in error/diagnostic paths) show no measurable impact. Kept compiled-out for code cleanliness and maintainability.

Reference: docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md


Phase 27: Unified Cache Stats Atomic Prune GO (+0.74%)

Date: 2025-12-16 Target: g_unified_cache_* (unified cache measurement atomics) File: core/front/tiny_unified_cache.c, core/front/tiny_unified_cache.h Atomics: 6 global counters (hits, misses, refill cycles, per-class variants) Build Flag: HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED (default: 0)

Results:

  • Baseline (compiled-out): 52.94 M ops/s (mean), 53.59 M ops/s (median)
  • Compiled-in: 52.55 M ops/s (mean), 53.06 M ops/s (median)
  • Improvement: +0.74% (mean), +1.01% (median)
  • Verdict: GO (keep compiled-out)

Analysis: WARM path atomics (cache refill operations) show measurable impact exceeding initial expectations (+0.2-0.4% expected, +0.74% actual). This suggests refill frequency is substantial in the random_mixed benchmark. The improvement validates the Phase 23 compile-out decision.

Path: WARM (unified cache refill: 3 locations; cache hits: 2 locations) Frequency: Medium (every cache miss triggers refill with 4 atomic ops + ENV check)

Reference: docs/analysis/PHASE27_UNIFIED_CACHE_STATS_RESULTS.md


Phase 28: Background Spill Queue Atomic Audit NO-OP (All CORRECTNESS)

Date: 2025-12-16 Target: Background spill queue atomics (g_bg_spill_head, g_bg_spill_len) Files: core/hakmem_tiny_bg_spill.h, core/hakmem_tiny_bg_spill.c Atomics: 8 atomic operations (CAS loops, queue management) Build Flag: None (no compile-out candidates)

Audit Results:

  • CORRECTNESS Atomics: 8/8 (100%)
  • TELEMETRY Atomics: 0/8 (0%)
  • Verdict: NO-OP (no action taken)

Analysis: All atomics are critical for correctness:

  1. Lock-free queue operations: atomic_load, atomic_compare_exchange_weak for CAS loops
  2. Queue length tracking (g_bg_spill_len): Used for flow control, NOT telemetry
    • Checked in tiny_free_magazine.inc.h:76-77 to decide whether to queue work
    • Controls queue depth to prevent unbounded growth
    • This is an operational counter, not a debug counter

Key Finding: g_bg_spill_len is superficially similar to telemetry counters, but serves a critical role:

uint32_t qlen = atomic_load_explicit(&g_bg_spill_len[class_idx], memory_order_relaxed);
if ((int)qlen < g_bg_spill_target) {  // FLOW CONTROL DECISION
    // Queue work to background spill
}

Conclusion: Background spill queue is a lock-free data structure. All atomics are untouchable. Phase 28 completes with no code changes.

Reference: docs/analysis/PHASE28_BG_SPILL_ATOMIC_AUDIT.md


Cumulative Impact

Phase Atomics Removed Frequency Impact Status
24 5 (class stats) High (every cache op) +0.93% GO
25 1 (free_ss_enter) High (every free) +1.07% GO
26 5 (diagnostics) Low (edge cases) -0.33% NEUTRAL
27 6 (unified cache) Medium (refills) +0.74% GO
28 0 (bg spill) N/A (all CORRECTNESS) N/A NO-OP
Total 17 atomics Mixed +2.74%

Key Insight: Atomic frequency matters more than count. High-frequency atomics (Phase 24+25) provide measurable benefit (+0.93%, +1.07%). Medium-frequency atomics (Phase 27, WARM path) provide substantial benefit (+0.74%). Low-frequency atomics (Phase 26) provide cleanliness but no performance gain. Correctness atomics are untouchable (Phase 28).


Lessons Learned

1. Frequency Trumps Count

  • Phase 24: 5 atomics, high frequency → +0.93%
  • Phase 25: 1 atomic, high frequency → +1.07%
  • Phase 26: 5 atomics, low frequency → -0.33% (NEUTRAL)

Takeaway: Focus on always-executed atomics, not just atomic count.

2. Edge Cases Don't Matter (Performance-Wise)

  • Phase 26 atomics are in error/diagnostic paths (header mismatch, bad class, etc.)
  • Rarely executed in benchmarks → no measurable impact
  • Still worth compiling out for code cleanliness

3. Compile-Time Gates Work Well

  • Pattern: #if HAKMEM_*_COMPILED (default: 0)
  • Clean separation between research (compiled-in) and production (compiled-out)
  • Easy to A/B test individual flags

4. Noise Margin: ±0.5%

  • Benchmark variance ~1-2%
  • Improvements <0.5% are within noise
  • NEUTRAL verdict: keep simpler code (compiled-out)

5. Classification is Critical

  • Phase 28: All atomics were CORRECTNESS (lock-free queue, flow control)
  • Must distinguish between:
    • Telemetry counters: Observational only, safe to compile-out
    • Operational counters: Used for control flow decisions, UNTOUCHABLE
  • Example: g_bg_spill_len looks like telemetry but controls queue depth limits

Next Phase Candidates (Phase 29+)

High Priority: Warm Path Atomics

  1. Background Spill Queue (Phase 28) COMPLETE (NO-OP)
    • Result: All CORRECTNESS atomics, no compile-out candidates
    • Reason: Lock-free queue + flow control counter

Medium Priority: Warm-ish Path Atomics

  1. Remote Target Queue (Phase 29 candidate)
    • Targets: g_remote_target_len[class_idx] atomics
    • File: core/hakmem_tiny_remote_target.c
    • Atomics: atomic_fetch_add/sub on queue length
    • Frequency: Warm (remote free path)
    • Expected Gain: +0.1-0.3% (if telemetry)
    • Priority: MEDIUM (needs correctness review - similar to bg_spill)
    • Warning: May be flow control like g_bg_spill_len, needs audit

Low Priority: Cold Path Atomics

  1. SuperSlab OS Stats (Phase 29+)

    • Targets: g_ss_os_alloc_calls, g_ss_os_madvise_calls, etc.
    • Files: core/box/ss_os_acquire_box.h, core/box/madvise_guard_box.c
    • Frequency: Cold (init/mmap/madvise)
    • Expected Gain: <0.1%
    • Priority: LOW (code cleanliness only)
  2. Shared Pool Diagnostics (Phase 30+)

    • Targets: rel_c7_*, dbg_c7_* (release/acquire logs)
    • Files: core/hakmem_shared_pool_acquire.c, core/hakmem_shared_pool_release.c
    • Frequency: Cold (shared pool operations)
    • Expected Gain: <0.1%
    • Priority: LOW
  3. Pool Hotbox v2 Stats (Phase 31+)

    • Targets: g_pool_hotbox_v2_stats[ci].* counters
    • File: core/hakmem_pool.c
    • Atomics: ~15 stats counters (alloc_calls, free_calls, etc.)
    • Frequency: Medium-High (pool operations)
    • Expected Gain: +0.2-0.5% (if high-frequency)
    • Priority: MEDIUM

Pattern Template (For Future Phases)

Step 1: Add Build Flag

// core/hakmem_build_flags.h
#ifndef HAKMEM_[NAME]_COMPILED
#  define HAKMEM_[NAME]_COMPILED 0
#endif

Step 2: Wrap Atomic

// core/[file].c
#if HAKMEM_[NAME]_COMPILED
    atomic_fetch_add_explicit(&g_[name], 1, memory_order_relaxed);
#else
    (void)0;  // No-op when compiled out
#endif

Step 3: A/B Test

# Baseline (compiled-out, default)
make clean && make -j bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > baseline.txt

# Compiled-in
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_[NAME]_COMPILED=1' bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > compiled_in.txt

Step 4: Analyze & Verdict

improvement = ((baseline_avg - compiled_in_avg) / compiled_in_avg) * 100

if improvement >= 0.5:
    verdict = "GO (keep compiled-out)"
elif improvement <= -0.5:
    verdict = "NO-GO (revert, compiled-in is better)"
else:
    verdict = "NEUTRAL (keep compiled-out for cleanliness)"

Step 5: Document

Create docs/analysis/PHASE[N]_[NAME]_RESULTS.md with:

  • Implementation details
  • A/B test results
  • Verdict & reasoning
  • Files modified

Build Flag Summary

All atomic compile gates in core/hakmem_build_flags.h:

// Phase 24: Tiny Class Stats (GO +0.93%)
#ifndef HAKMEM_TINY_CLASS_STATS_COMPILED
#  define HAKMEM_TINY_CLASS_STATS_COMPILED 0
#endif

// Phase 25: Tiny Free Stats (GO +1.07%)
#ifndef HAKMEM_TINY_FREE_STATS_COMPILED
#  define HAKMEM_TINY_FREE_STATS_COMPILED 0
#endif

// Phase 27: Unified Cache Stats (GO +0.74%)
#ifndef HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
#  define HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED 0
#endif

// Phase 26A: C7 Free Count (NEUTRAL -0.33%)
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
#  define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif

// Phase 26B: Header Mismatch Log (NEUTRAL)
#ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED
#  define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0
#endif

// Phase 26C: Header Meta Mismatch (NEUTRAL)
#ifndef HAKMEM_HDR_META_MISMATCH_COMPILED
#  define HAKMEM_HDR_META_MISMATCH_COMPILED 0
#endif

// Phase 26D: Metric Bad Class (NEUTRAL)
#ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED
#  define HAKMEM_METRIC_BAD_CLASS_COMPILED 0
#endif

// Phase 26E: Header Meta Fast (NEUTRAL)
#ifndef HAKMEM_HDR_META_FAST_COMPILED
#  define HAKMEM_HDR_META_FAST_COMPILED 0
#endif

Default State: All flags = 0 (compiled-out, production-ready) Research Use: Set flag = 1 to enable specific telemetry atomic


Conclusion

Total Progress (Phase 24+25+26+27+28):

  • Performance Gain: +2.74% (Phase 24: +0.93%, Phase 25: +1.07%, Phase 26: NEUTRAL, Phase 27: +0.74%, Phase 28: NO-OP)
  • Atomics Removed: 17 telemetry atomics from hot/warm paths
  • Phases Completed: 5 phases (4 with changes, 1 audit-only)
  • Code Quality: Cleaner hot/warm paths, closer to mimalloc's zero-overhead principle
  • Next Target: Phase 29 (remote target queue or pool hotbox v2 stats)

Key Success Factors:

  1. Systematic audit and classification (CORRECTNESS vs TELEMETRY)
  2. Consistent A/B testing methodology
  3. Clear verdict criteria (GO/NEUTRAL/NO-GO)
  4. Focus on high-frequency atomics for performance
  5. Compile-out low-frequency atomics for cleanliness

Future Work:

  • Continue Phase 29+ (warm/cold path atomics)
  • Expected cumulative gain: +3.0-3.5% total (already at +2.74%)
  • Focus on high-frequency paths, audit carefully for CORRECTNESS vs TELEMETRY
  • Document all verdicts for reproducibility

Lessons from Phase 28:

  • Not all atomic counters are telemetry
  • Flow control counters (e.g., g_bg_spill_len) are CORRECTNESS
  • Always trace how counter is used before classifying

Last Updated: 2025-12-16 Status: Phase 24+25+26+27 Complete (+2.74%), Phase 28 Audit Complete (NO-OP) Maintained By: Claude Sonnet 4.5