Files

Moe Charm (CI) b7085c47e1 Phase 35-39: FAST build optimization complete (+7.13% cumulative)

Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%)
- tiny_front_v3_enabled() → constant true
- tiny_metadata_cache_enabled() → constant 0
- learner_v7_enabled() → constant false
- small_learner_v2_enabled() → constant false

Phase 36: Policy snapshot init-once (GO +0.71%)
- small_policy_v7_snapshot() version check skip in BENCH_MINIMAL
- TLS cache for policy snapshot

Phase 37: Standard TLS cache (NO-GO -0.07%)
- TLS cache for Standard build attempted
- Runtime gate overhead negates benefit

Phase 38: FAST/OBSERVE/Standard workflow established
- make perf_fast, make perf_observe targets
- Scorecard and documentation updates

Phase 39: Hot path gate constantization (GO +1.98%)
- front_gate_unified_enabled() → constant 1
- alloc_dualhot_enabled() → constant 0
- g_bench_fast_front, g_v3_enabled blocks → compile-out
- free_dispatch_stats_enabled() → constant false

Results:
- FAST v3: 56.04M ops/s (47.4% of mimalloc)
- Standard: 53.50M ops/s (45.3% of mimalloc)
- M1 target (50%): 5.5% remaining

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-16 15:01:56 +09:00

27 KiB

Raw Blame History

Hot Path Atomic Telemetry Prune - Cumulative Summary

Project: HAKMEM Memory Allocator - Hot Path Optimization Goal: Remove all telemetry-only atomics from hot alloc/free paths Principle: Follow mimalloc: No atomics/observe in hot path Status: Phase 24+25+26+27+31+32 Complete (+2.74% cumulative), Phase 28+29 NO-OP, Phase 30 Procedure Complete

Overview

This document tracks the systematic removal of telemetry-only atomic_fetch_add/sub operations from hot alloc/free code paths. Each phase follows a consistent pattern:

Identify telemetry-only atomic (not CORRECTNESS)
Add HAKMEM_*_COMPILED compile gate (default: 0)
A/B test: baseline (compiled-out) vs compiled-in
Verdict: GO (>+0.5%), NEUTRAL (±0.5%), or NO-GO (<-0.5%)
Document and proceed to next candidate

Completed Phases

Phase 24: Tiny Class Stats Atomic Prune ✅ GO (+0.93%)

Date: 2025-12-15 (prior work) Target: g_tiny_class_stats_* (per-class cache hit/miss counters) File: core/box/tiny_class_stats_box.h Atomics: 5 global counters (executed on every cache operation) Build Flag: HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)

Results:

Baseline (compiled-out): 57.8 M ops/s
Compiled-in: 57.3 M ops/s
Improvement: +0.93%
Verdict: GO ✅ (keep compiled-out)

Analysis: High-frequency atomics (every cache hit/miss) show measurable impact. Compiling out provides nearly 1% improvement.

Reference: Pattern established in Phase 24, used as template for all subsequent phases.

Phase 25: Free Stats Atomic Prune ✅ GO (+1.07%)

Date: 2025-12-15 (prior work) Target: g_free_ss_enter (superslab free entry counter) File: core/tiny_superslab_free.inc.h:22 Atomics: 1 global counter (executed on every superslab free) Build Flag: HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)

Results:

Baseline (compiled-out): 58.4 M ops/s
Compiled-in: 57.8 M ops/s
Improvement: +1.07%
Verdict: GO ✅ (keep compiled-out)

Analysis: Single high-frequency atomic (every free call) shows >1% impact. Demonstrates that even one hot-path atomic matters.

Reference: docs/analysis/PHASE25_FREE_STATS_RESULTS.md (assumed from pattern)

Phase 26: Hot Path Diagnostic Atomics Prune ✅ NEUTRAL (-0.33%)

Date: 2025-12-16 Targets: 5 diagnostic atomics in hot-path edge cases Files:

core/tiny_superslab_free.inc.h (3 atomics)
core/hakmem_tiny_alloc.inc (1 atomic)
core/tiny_free_fast_v2.inc.h (1 atomic)

Build Flags: (all default: 0)

HAKMEM_C7_FREE_COUNT_COMPILED
HAKMEM_HDR_MISMATCH_LOG_COMPILED
HAKMEM_HDR_META_MISMATCH_COMPILED
HAKMEM_METRIC_BAD_CLASS_COMPILED
HAKMEM_HDR_META_FAST_COMPILED

Results:

Baseline (compiled-out): 53.14 M ops/s (±0.96M)
Compiled-in: 53.31 M ops/s (±1.09M)
Improvement: -0.33% (within ±0.5% noise margin)
Verdict: NEUTRAL ➡️ Keep compiled-out for cleanliness ✅

Analysis: Low-frequency atomics (only in error/diagnostic paths) show no measurable impact. Kept compiled-out for code cleanliness and maintainability.

Reference: docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md

Phase 27: Unified Cache Stats Atomic Prune ✅ GO (+0.74%)

Date: 2025-12-16 Target: g_unified_cache_* (unified cache measurement atomics) File: core/front/tiny_unified_cache.c, core/front/tiny_unified_cache.h Atomics: 6 global counters (hits, misses, refill cycles, per-class variants) Build Flag: HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED (default: 0)

Results:

Baseline (compiled-out): 52.94 M ops/s (mean), 53.59 M ops/s (median)
Compiled-in: 52.55 M ops/s (mean), 53.06 M ops/s (median)
Improvement: +0.74% (mean), +1.01% (median)
Verdict: GO ✅ (keep compiled-out)

Analysis: WARM path atomics (cache refill operations) show measurable impact exceeding initial expectations (+0.2-0.4% expected, +0.74% actual). This suggests refill frequency is substantial in the random_mixed benchmark. The improvement validates the Phase 23 compile-out decision.

Path: WARM (unified cache refill: 3 locations; cache hits: 2 locations) Frequency: Medium (every cache miss triggers refill with 4 atomic ops + ENV check)

Reference: docs/analysis/PHASE27_UNIFIED_CACHE_STATS_RESULTS.md

Phase 28: Background Spill Queue Atomic Audit ✅ NO-OP (All CORRECTNESS)

Date: 2025-12-16 Target: Background spill queue atomics (g_bg_spill_head, g_bg_spill_len) Files: core/hakmem_tiny_bg_spill.h, core/hakmem_tiny_bg_spill.c Atomics: 8 atomic operations (CAS loops, queue management) Build Flag: None (no compile-out candidates)

Audit Results:

CORRECTNESS Atomics: 8/8 (100%)
TELEMETRY Atomics: 0/8 (0%)
Verdict: NO-OP (no action taken)

Analysis: All atomics are critical for correctness:

Lock-free queue operations: atomic_load, atomic_compare_exchange_weak for CAS loops
Queue length tracking (g_bg_spill_len): Used for flow control, NOT telemetry
- Checked in tiny_free_magazine.inc.h:76-77 to decide whether to queue work
- Controls queue depth to prevent unbounded growth
- This is an operational counter, not a debug counter

Key Finding: g_bg_spill_len is superficially similar to telemetry counters, but serves a critical role:

uint32_t qlen = atomic_load_explicit(&g_bg_spill_len[class_idx], memory_order_relaxed);
if ((int)qlen < g_bg_spill_target) {  // FLOW CONTROL DECISION
    // Queue work to background spill
}

Conclusion: Background spill queue is a lock-free data structure. All atomics are untouchable. Phase 28 completes with no code changes.

Reference: docs/analysis/PHASE28_BG_SPILL_ATOMIC_AUDIT.md

Phase 29: Pool Hotbox v2 Stats Atomic Audit ✅ NO-OP (Code Not Active)

Date: 2025-12-16 Target: Pool Hotbox v2 stats atomics (g_pool_hotbox_v2_stats[ci].*) Files: core/hakmem_pool.c, core/box/pool_hotbox_v2_box.h Atomics: 12 atomic counters (alloc_calls, free_calls, alloc_fast, free_fast, etc.) Build Flag: HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0)

Audit Results:

CORRECTNESS Atomics: 0/12 (0%)
TELEMETRY Atomics: 12/12 (100%)
Verdict: NO-OP (code path not active)

Analysis: All 12 atomics are pure TELEMETRY (destructor dump only, no flow control). However, Pool Hotbox v2 is disabled by default via HAKMEM_POOL_V2_ENABLED environment variable, so these atomics are never executed in the benchmark.

A/B Test Results (Anomaly Detected):

Baseline (compiled-out): 52.98 M ops/s (±0.43M)
Compiled-in: 53.31 M ops/s (±0.80M)
Improvement: -0.62% (compiled-in is faster!)

Root Cause: Pool v2 is OFF by default (ENV-gated):

const char* e = getenv("HAKMEM_POOL_V2_ENABLED");
g = (e && *e && *e != '0') ? 1 : 0;  // Default: OFF

Result: Atomics are never incremented → compile-out has zero runtime effect.

Why anomaly (-0.62% faster with atomics ON)?

High variance (research build: 1.50% stdev vs baseline: 0.81%)
Compiler optimization artifact (code layout, instruction cache alignment)
Sample size (10 runs) insufficient to distinguish signal from noise
Conclusion: Noise, not real effect

Decision: NEUTRAL - Keep compile-out for:

Code cleanliness (reduces binary size)
Future-proofing (ready if Pool v2 is enabled)
Consistency with Phase 24-28 pattern

Key Lesson: Before A/B testing, verify code is ACTIVE:

rg "getenv.*FEATURE" && echo "⚠️ ENV-gated, may be OFF"

Updated Audit Checklist:

✅ Classify atomics (CORRECTNESS vs TELEMETRY)
✅ Verify no flow control usage
NEW: ✅ Verify code path is ACTIVE in benchmark ← Phase 29 lesson
Implement compile-out
A/B test

Reference: docs/analysis/PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md

Phase 30: Standard Procedure Documentation ✅ PROCEDURE COMPLETE

Date: 2025-12-16 Target: Standardization of atomic prune methodology (not a performance phase) Purpose: Codify learnings from Phase 24-29 into reusable 4-step procedure

Deliverables:

docs/analysis/PHASE30_STANDARD_PROCEDURE.md - 4-step standardized methodology
docs/analysis/ATOMIC_AUDIT_FULL.txt - Complete atomic audit (412 atomics)
docs/analysis/PHASE31_RECOMMENDED_CANDIDATES.md - Phase 31 candidate selection

4-Step Standard Procedure:

Step 0: Execution Verification (NEW - Phase 29 lesson)

Check for ENV gates (getenv() checks)
Verify execution counters > 0 in benchmark
Use perf/flamegraph to confirm code path is hit
Decision: SKIP if ENV-gated or not executed

Step 1: CORRECTNESS/TELEMETRY Classification (Phase 28 lesson)

Track all atomic usage sites
Check for if conditions (CORRECTNESS)
Verify pure telemetry usage (TELEMETRY)
Decision: DO NOT TOUCH if CORRECTNESS

Step 2: Compile-Out Implementation (Phase 24-27 pattern)

Add HAKMEM_*_COMPILED flag to hakmem_build_flags.h
Wrap atomics with #if preprocessor gates
Build-level compile-out (not link-out)

Step 3: A/B Test (build-level comparison)

Baseline (COMPILED=0): default build
Compiled-in (COMPILED=1): research build
Compare 10-run averages
Verdict: GO (+0.5%+), NEUTRAL (±0.5%), NO-GO (-0.5%+)

Audit Results (Phase 30):

Total atomics: 412 (104 TELEMETRY, 24 CORRECTNESS, 284 UNKNOWN)
HOT path: 16 atomics (5 TELEMETRY, 11 UNKNOWN)
WARM path: 10 atomics (3 TELEMETRY, 7 UNKNOWN)
COLD path: 386 atomics (remaining)

Phase 31 Candidate Selection:

TOP PRIORITY: g_tiny_free_trace (HOT path, TELEMETRY, execution verified)
Expected Impact: +0.5% to +1.0% (similar to Phase 25)
Skipped: 2 ENV-gated WARM path candidates (Phase 29 lesson applied)

Key Lesson: Step 0 (execution verification) prevents wasted effort on ENV-gated or inactive code paths. Phase 29 taught us that optimization without execution = zero impact.

Reference: docs/analysis/PHASE30_STANDARD_PROCEDURE.md, docs/analysis/PHASE31_RECOMMENDED_CANDIDATES.md

Phase 31: Tiny Free Trace Atomic Prune ✅ NEUTRAL (-0.35%)

Date: 2025-12-16 Target: g_tiny_free_trace (tiny free trace rate-limit counter) File: core/hakmem_tiny_free.inc:326 Atomics: 1 global counter (executed on every tiny free) Build Flag: HAKMEM_TINY_FREE_TRACE_COMPILED (default: 0)

Results:

Baseline (compiled-out): 53.64 M ops/s (mean), 53.80 M ops/s (median)
Compiled-in: 53.83 M ops/s (mean), 53.70 M ops/s (median)
Improvement: -0.35% (mean), +0.19% (median)
Verdict: NEUTRAL ➡️ Keep compiled-out for cleanliness ✅

Analysis: HOT path atomic (every free call entry) shows no measurable impact (-0.35% mean, +0.19% median, both within ±0.5% noise margin). Unlike Phase 25 (g_free_ss_enter: +1.07%), this trace rate-limit atomic (128 calls) does not show performance overhead. Following Phase 26 precedent (-0.33% NEUTRAL, adopted for cleanliness), Phase 31 is ADOPTED with COMPILED=0 as default.

Path: HOT (entry point of hak_tiny_free()) Frequency: High (every tiny free call, but rate-limited to 128 traces) Key Finding: Not all HOT path atomics have measurable overhead. Rate-limited trace may be optimized by compiler.

Reference: docs/analysis/PHASE31_TINY_FREE_TRACE_ATOMIC_PRUNE_RESULTS.md

Phase 32: Tiny Free Calls Atomic Prune ✅ NEUTRAL (-0.46%)

Date: 2025-12-16 Target: g_hak_tiny_free_calls (tiny free calls diagnostic counter) File: core/hakmem_tiny_free.inc:335 (9 lines after Phase 31) Atomics: 1 global counter (executed on every tiny free, unconditional) Build Flag: HAKMEM_TINY_FREE_CALLS_COMPILED (default: 0)

Results:

Baseline (compiled-out): 52.94 M ops/s (mean), 53.22 M ops/s (median)
Compiled-in: 53.28 M ops/s (mean), 53.46 M ops/s (median)
Improvement: -0.46% (mean), -0.46% (median)
Verdict: NEUTRAL ➡️ Keep compiled-out for cleanliness ✅

Analysis: HOT path atomic (every free call, 9 lines after Phase 31 target) shows no measurable impact (-0.46%, within ±0.5% noise margin). Unexpectedly, the atomic counter compiled-in performed slightly better, suggesting code alignment effects rather than atomic overhead. Following Phase 31 precedent (-0.35% NEUTRAL), Phase 32 is ADOPTED with COMPILED=0 for code cleanliness and consistency.

Path: HOT (same function as Phase 31, hak_tiny_free()) Frequency: High (every tiny free call, unconditional - no rate limit) Key Finding: Diagnostic counter has negligible performance impact on modern CPUs. NEUTRAL result reinforces Phase 31 pattern: compile-out for code cleanliness, not performance.

Reference: docs/analysis/PHASE32_TINY_FREE_CALLS_ATOMIC_PRUNE_RESULTS.md

Cumulative Impact

Phase	Atomics Removed	Frequency	Impact	Status
24	5 (class stats)	High (every cache op)	+0.93%	GO ✅
25	1 (free_ss_enter)	High (every free)	+1.07%	GO ✅
26	5 (diagnostics)	Low (edge cases)	-0.33%	NEUTRAL ✅
27	6 (unified cache)	Medium (refills)	+0.74%	GO ✅
28	0 (bg spill)	N/A (all CORRECTNESS)	N/A	NO-OP ✅
29	0 (pool v2)	N/A (code not active)	0.00%	NO-OP ✅
30	0 (procedure)	N/A (standardization)	N/A	PROCEDURE ✅
31	1 (free trace)	High (every free entry)	-0.35%	NEUTRAL ✅
32	1 (free calls)	High (every free, unconditional)	-0.46%	NEUTRAL ✅
Total	19 atomics	Mixed	+2.74%	✅

Key Insights:

Frequency matters more than count: High-frequency atomics (Phase 24+25) provide measurable benefit (+0.93%, +1.07%). Medium-frequency atomics (Phase 27, WARM path) provide substantial benefit (+0.74%). Low-frequency atomics (Phase 26) provide cleanliness but no performance gain.
Correctness atomics are untouchable: Phase 28 showed that lock-free queues and flow control counters must not be touched.
ENV-gated code paths need verification: Phase 29 showed that compile-out of inactive code has zero performance impact. Always verify code is active before A/B testing.
Standardized procedure prevents wasted effort: Phase 30 codified 4-step procedure with Step 0 (execution verification) as mandatory gate to avoid Phase 29-style no-ops.
HOT path ≠ guaranteed performance win: Phase 31 showed that even HOT path atomics may have zero measurable overhead if rate-limited or well-optimized. NEUTRAL results still justify adoption for code cleanliness (Phase 26/31 precedent).

Lessons Learned

1. Frequency Trumps Count (But Not Always)

Phase 24: 5 atomics, high frequency → +0.93% ✅
Phase 25: 1 atomic, high frequency → +1.07% ✅
Phase 26: 5 atomics, low frequency → -0.33% (NEUTRAL)
Phase 31: 1 atomic, high frequency → -0.35% (NEUTRAL)

Takeaway: Focus on always-executed atomics, not just atomic count. However, even high-frequency atomics may have zero measurable overhead if optimized (e.g., rate-limited, compiler optimization).

2. Edge Cases Don't Matter (Performance-Wise)

Phase 26 atomics are in error/diagnostic paths (header mismatch, bad class, etc.)
Rarely executed in benchmarks → no measurable impact
Still worth compiling out for code cleanliness

3. Compile-Time Gates Work Well

Pattern: #if HAKMEM_*_COMPILED (default: 0)
Clean separation between research (compiled-in) and production (compiled-out)
Easy to A/B test individual flags

4. Noise Margin: ±0.5%

Benchmark variance ~1-2%
Improvements <0.5% are within noise
NEUTRAL verdict: keep simpler code (compiled-out)

5. Classification is Critical

Phase 28: All atomics were CORRECTNESS (lock-free queue, flow control)
Must distinguish between:
- Telemetry counters: Observational only, safe to compile-out
- Operational counters: Used for control flow decisions, UNTOUCHABLE
Example: g_bg_spill_len looks like telemetry but controls queue depth limits

6. Verify Code is Active (NEW: Phase 29 Lesson)

Phase 29: Pool v2 stats were all TELEMETRY but ENV-gated (default OFF)
Compile-out had zero impact because code never ran
Before A/B testing:
1. Check for getenv() gates → may be OFF by default
2. Add temporary debug printf to verify code path is hit
3. Or use perf record to check if functions are called
Anomaly: Compiled-in was 0.62% faster (noise due to compiler artifacts, not real effect)

7. Standard Procedure is Reusable (NEW: Phase 30)

Phase 30: Codified 4-step procedure from Phase 24-29 learnings
Step 0 (execution verification): Prevents Phase 29-style wasted effort on ENV-gated code
Step 1 (classification): Prevents Phase 28-style mistakes (CORRECTNESS vs TELEMETRY)
Step 2-3 (implementation + A/B test): Proven pattern from Phase 24-27
Result: Systematic atomic audit (412 atomics), Phase 31 candidate selected with high confidence

8. NEUTRAL + Cleanliness = Valid Adoption (Phase 26/31 Pattern)

Phase 26: -0.33% NEUTRAL → Adopted for code cleanliness
Phase 31: -0.35% NEUTRAL → Adopted for code cleanliness (same precedent)
Rationale: No performance regression (within noise), reduces complexity, maintains research flexibility (COMPILED=1 available)
Takeaway: NEUTRAL verdicts justify compile-out even without performance wins

Next Phase Candidates (Phase 31+)

Completed Audits

Background Spill Queue (Phase 28) ✅ COMPLETE (NO-OP)
- Result: All CORRECTNESS atomics, no compile-out candidates
- Reason: Lock-free queue + flow control counter
Pool Hotbox v2 Stats (Phase 29) ✅ COMPLETE (NO-OP)
- Result: All TELEMETRY atomics, but code path not active (ENV-gated)
- Reason: HAKMEM_POOL_V2_ENABLED defaults to OFF
Standard Procedure Documentation (Phase 30) ✅ COMPLETE (PROCEDURE)
- Result: 4-step procedure standardized, atomic audit complete (412 atomics)
- Reason: Methodology standardization, not a performance phase

High Priority: Phase 32 Target (NEXT)

Tiny Free Trace Atomic (Phase 31) ✅ COMPLETE (NEUTRAL -0.35%)
- Result: NEUTRAL verdict, adopted for code cleanliness
- Reason: HOT path atomic with zero measurable overhead (rate-limited trace)
Tiny Free Calls Counter (Phase 32) ✅ COMPLETE (NEUTRAL -0.46%)
- Result: NEUTRAL verdict, adopted for code cleanliness
- Reason: HOT path diagnostic counter with negligible overhead (code alignment effects)

High Priority: Phase 33 Target (NEXT)

Tiny Debug Ring Record (Phase 33 - TOP PRIORITY) ⭐
- Target: tiny_debug_ring_record(TINY_RING_EVENT_FREE_ENTER, ...) (HOT path)
- File: core/hakmem_tiny_free.inc:340 (3 lines after Phase 32 target)
- Classification: TELEMETRY (debug ring buffer, event logging)
- Execution: ⚠️ REQUIRES STEP 0 VERIFICATION (Phase 30 lesson)
- Verification Required:
```
# Check if debug ring is ENV-gated or always-on
rg "getenv.*DEBUG_RING" core/
rg "HAKMEM.*DEBUG.*RING" core/
```
- Expected Gain: +0.3% to +1.0% (if always-on, similar to Phase 25/31/32)
- Priority: HIGHEST (same HOT path as Phase 31+32, same function)
- Warning: Only proceed if debug ring is always-on by default (not ENV-gated)

Medium Priority: Uncertain Candidates

P0 Class OOB Log (Phase 34 candidate)
- Target: g_p0_class_oob_log (WARM path)
- File: core/hakmem_tiny_refill_p0.inc.h:41
- Classification: TELEMETRY (error logging)
- Execution: ❓ UNCERTAIN (error path, needs verification)
- Expected Gain: ±0.0% to +0.2%
- Priority: MEDIUM (verify execution first)
Remote Target Queue (Phase 34 candidate)
- Targets: g_remote_target_len[class_idx] atomics
- File: core/hakmem_tiny_remote_target.c
- Atomics: atomic_fetch_add/sub on queue length
- Frequency: Warm (remote free path)
- Expected Gain: +0.1-0.3% (if telemetry)
- Priority: MEDIUM (needs correctness review - similar to bg_spill)
- Warning: May be flow control like g_bg_spill_len, needs audit

Low Priority: ENV-gated (SKIP)

Warm Pool Prefill Logs (SKIP - ENV-gated)
- Targets: rel_logs, dbg_logs (WARM path)
- Files: core/box/warm_pool_prefill_box.h, core/hakmem_tiny_refill.inc.h
- Classification: TELEMETRY (fprintf only)
- Execution: ❌ ENV-gated (HAKMEM_TINY_WARM_LOG=OFF by default)
- Expected Gain: 0.0% (NO-OP, Phase 29 lesson)
- Priority: SKIP (not executed in benchmark)

Low Priority: Cold Path Atomics

SuperSlab OS Stats (Phase 35+)
- Targets: g_ss_os_alloc_calls, g_ss_os_madvise_calls, etc.
- Files: core/box/ss_os_acquire_box.h, core/box/madvise_guard_box.c
- Frequency: Cold (init/mmap/madvise)
- Expected Gain: <0.1%
- Priority: LOW (code cleanliness only)

Pattern Template (For Future Phases)

Step 1: Add Build Flag

// core/hakmem_build_flags.h
#ifndef HAKMEM_[NAME]_COMPILED
#  define HAKMEM_[NAME]_COMPILED 0
#endif

Step 2: Wrap Atomic

// core/[file].c
#if HAKMEM_[NAME]_COMPILED
    atomic_fetch_add_explicit(&g_[name], 1, memory_order_relaxed);
#else
    (void)0;  // No-op when compiled out
#endif

Step 3: A/B Test

# Baseline (compiled-out, default)
make clean && make -j bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > baseline.txt

# Compiled-in
make clean && make -j EXTRA_CFLAGS='-DHAKMEM_[NAME]_COMPILED=1' bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh > compiled_in.txt

Step 4: Analyze & Verdict

improvement = ((baseline_avg - compiled_in_avg) / compiled_in_avg) * 100

if improvement >= 0.5:
    verdict = "GO (keep compiled-out)"
elif improvement <= -0.5:
    verdict = "NO-GO (revert, compiled-in is better)"
else:
    verdict = "NEUTRAL (keep compiled-out for cleanliness)"

Step 5: Document

Create docs/analysis/PHASE[N]_[NAME]_RESULTS.md with:

Implementation details
A/B test results
Verdict & reasoning
Files modified

Build Flag Summary

All atomic compile gates in core/hakmem_build_flags.h:

// Phase 24: Tiny Class Stats (GO +0.93%)
#ifndef HAKMEM_TINY_CLASS_STATS_COMPILED
#  define HAKMEM_TINY_CLASS_STATS_COMPILED 0
#endif

// Phase 25: Tiny Free Stats (GO +1.07%)
#ifndef HAKMEM_TINY_FREE_STATS_COMPILED
#  define HAKMEM_TINY_FREE_STATS_COMPILED 0
#endif

// Phase 27: Unified Cache Stats (GO +0.74%)
#ifndef HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
#  define HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED 0
#endif

// Phase 26A: C7 Free Count (NEUTRAL -0.33%)
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
#  define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif

// Phase 26B: Header Mismatch Log (NEUTRAL)
#ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED
#  define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0
#endif

// Phase 26C: Header Meta Mismatch (NEUTRAL)
#ifndef HAKMEM_HDR_META_MISMATCH_COMPILED
#  define HAKMEM_HDR_META_MISMATCH_COMPILED 0
#endif

// Phase 26D: Metric Bad Class (NEUTRAL)
#ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED
#  define HAKMEM_METRIC_BAD_CLASS_COMPILED 0
#endif

// Phase 26E: Header Meta Fast (NEUTRAL)
#ifndef HAKMEM_HDR_META_FAST_COMPILED
#  define HAKMEM_HDR_META_FAST_COMPILED 0
#endif

// Phase 29: Pool Hotbox v2 Stats (NO-OP - code not active)
#ifndef HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED
#  define HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED 0
#endif

// Phase 31: Tiny Free Trace (NEUTRAL -0.35%)
#ifndef HAKMEM_TINY_FREE_TRACE_COMPILED
#  define HAKMEM_TINY_FREE_TRACE_COMPILED 0
#endif

// Phase 32: Tiny Free Calls (NEUTRAL -0.46%)
#ifndef HAKMEM_TINY_FREE_CALLS_COMPILED
#  define HAKMEM_TINY_FREE_CALLS_COMPILED 0
#endif

Default State: All flags = 0 (compiled-out, production-ready) Research Use: Set flag = 1 to enable specific telemetry atomic

Conclusion

Total Progress (Phase 24+25+26+27+28+29+30+31+32):

Performance Gain: +2.74% (Phase 24: +0.93%, Phase 25: +1.07%, Phase 26: NEUTRAL, Phase 27: +0.74%, Phase 28: NO-OP, Phase 29: NO-OP, Phase 30: PROCEDURE, Phase 31: NEUTRAL, Phase 32: NEUTRAL)
Atomics Removed: 19 telemetry atomics from hot/warm paths (17 compiled-out + 1 Phase 31 + 1 Phase 32)
Phases Completed: 9 phases (4 with performance changes, 2 audit-only, 1 standardization, 2 cleanliness)
Code Quality: Cleaner hot/warm paths, closer to mimalloc's zero-overhead principle
Methodology: 4-step standard procedure validated (Phase 30-31-32)
Next Target: Phase 33 (tiny_debug_ring_record, HOT path, REQUIRES STEP 0 VERIFICATION)

Key Success Factors:

Systematic audit and classification (CORRECTNESS vs TELEMETRY)
Consistent A/B testing methodology
Clear verdict criteria (GO/NEUTRAL/NO-GO)
Focus on high-frequency atomics for performance
Compile-out low-frequency atomics for cleanliness
NEW: Step 0 execution verification (Phase 30 standard procedure)

Future Work:

Immediate: Phase 33 (tiny_debug_ring_record, HOT path, same location as Phase 31+32)
CRITICAL: Phase 33 requires Step 0 verification (ENV gate check) before proceeding
Expected cumulative gain: +2.74% (stable, no further performance gains expected from Phase 31+32 NEUTRAL results)
Follow Phase 30 standard procedure for all future candidates
Focus on execution-verified, high-frequency paths
Document all verdicts for reproducibility
Accept NEUTRAL verdicts for code cleanliness (Phase 26/31/32 pattern)

Lessons from Phase 28+29+30+31+32:

Not all atomic counters are telemetry (Phase 28: flow control counters are CORRECTNESS)
Flow control counters (e.g., g_bg_spill_len) are UNTOUCHABLE
Always trace how counter is used before classifying
Verify code path is ACTIVE before A/B testing (Phase 29: ENV-gated code has zero impact)
Standard procedure prevents repeated mistakes (Phase 30: Step 0 gate prevents Phase 29-style no-ops)
Not all HOT path atomics have measurable overhead (Phase 31: -0.35% NEUTRAL, Phase 32: -0.46% NEUTRAL)
NEUTRAL verdicts justify adoption for code cleanliness (Phase 26/31/32 precedent)
Code alignment matters: Phase 32 showed compiled-in was faster (code layout effects, not atomic overhead)

Last Updated: 2025-12-16 Status: Phase 24-27+31+32 Complete (+2.74%), Phase 28-29 NO-OP, Phase 30 Procedure Complete Next Phase: Phase 33 (tiny_debug_ring_record, HOT path, REQUIRES STEP 0 VERIFICATION) Maintained By: Claude Sonnet 4.5

27 KiB Raw Blame History

Hot Path Atomic Telemetry Prune - Cumulative Summary

Overview

Completed Phases

Phase 24: Tiny Class Stats Atomic Prune ✅ GO (+0.93%)

Phase 25: Free Stats Atomic Prune ✅ GO (+1.07%)

Phase 26: Hot Path Diagnostic Atomics Prune ✅ NEUTRAL (-0.33%)

Phase 27: Unified Cache Stats Atomic Prune ✅ GO (+0.74%)

Phase 28: Background Spill Queue Atomic Audit ✅ NO-OP (All CORRECTNESS)

Phase 29: Pool Hotbox v2 Stats Atomic Audit ✅ NO-OP (Code Not Active)

Phase 30: Standard Procedure Documentation ✅ PROCEDURE COMPLETE

Phase 31: Tiny Free Trace Atomic Prune ✅ NEUTRAL (-0.35%)

Phase 32: Tiny Free Calls Atomic Prune ✅ NEUTRAL (-0.46%)

Cumulative Impact

Lessons Learned

1. Frequency Trumps Count (But Not Always)

2. Edge Cases Don't Matter (Performance-Wise)

3. Compile-Time Gates Work Well

4. Noise Margin: ±0.5%

5. Classification is Critical

6. Verify Code is Active (NEW: Phase 29 Lesson)

7. Standard Procedure is Reusable (NEW: Phase 30)

8. NEUTRAL + Cleanliness = Valid Adoption (Phase 26/31 Pattern)

Next Phase Candidates (Phase 31+)

Completed Audits

High Priority: Phase 32 Target (NEXT)

High Priority: Phase 33 Target (NEXT)

Medium Priority: Uncertain Candidates

Low Priority: ENV-gated (SKIP)

Low Priority: Cold Path Atomics

Pattern Template (For Future Phases)

Step 1: Add Build Flag

Step 2: Wrap Atomic

Step 3: A/B Test

Step 4: Analyze & Verdict

Step 5: Document

Build Flag Summary

Conclusion

27 KiB

Raw Blame History