Files
hakmem/docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md

419 lines
12 KiB
Markdown
Raw Normal View History

# Phase 26: Hot Path Atomic Telemetry Prune - Complete Results
**Date:** 2025-12-16
**Status:** ✅ COMPLETE (NEUTRAL verdict, keep compiled-out for cleanliness)
**Pattern:** Followed Phase 24 (tiny_class_stats) + Phase 25 (g_free_ss_enter)
**Impact:** -0.33% (NEUTRAL, within ±0.5% noise margin)
---
## Executive Summary
**Goal:** Systematically compile-out all telemetry-only `atomic_fetch_add/sub` operations from hot alloc/free paths.
**Method:**
- Audited all 200+ atomics in `core/` directory
- Identified 5 high-priority hot-path telemetry atomics
- Implemented compile gates for each (default: OFF)
- Ran A/B test: baseline (compiled-out) vs compiled-in
**Results:**
- **Baseline (compiled-out):** 53.14 M ops/s (±0.96M)
- **Compiled-in (all atomics):** 53.31 M ops/s (±1.09M)
- **Difference:** -0.33% (NEUTRAL, within noise margin)
**Verdict:** **NEUTRAL** - keep compiled-out for code cleanliness
- Atomics have negligible impact on this benchmark
- Compiled-out version is cleaner and more maintainable
- Consistent with mimalloc principle: no telemetry in hot path
---
## Phase 26 Implementation Details
### Phase 26A: `c7_free_count` Atomic Prune
**Target:** `core/tiny_superslab_free.inc.h:51`
**Code:**
```c
static _Atomic int c7_free_count = 0;
int count = atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed);
```
**Purpose:** Debug counter for C7 free path diagnostics (log first C7 free)
**Implementation:**
```c
// Phase 26A: Compile-out c7_free_count atomic (default OFF)
#if HAKMEM_C7_FREE_COUNT_COMPILED
static _Atomic int c7_free_count = 0;
int count = atomic_fetch_add_explicit(&c7_free_count, 1, memory_order_relaxed);
if (count == 0) {
#if !HAKMEM_BUILD_RELEASE && HAKMEM_DEBUG_VERBOSE
fprintf(stderr, "[C7_FIRST_FREE] ptr=%p base=%p slab_idx=%d\n", ptr, base, slab_idx);
#endif
}
#else
(void)0; // No-op when compiled out
#endif
```
**Build Flag:** `HAKMEM_C7_FREE_COUNT_COMPILED` (default: 0)
---
### Phase 26B: `g_hdr_mismatch_log` Atomic Prune
**Target:** `core/tiny_superslab_free.inc.h:153`
**Code:**
```c
static _Atomic uint32_t g_hdr_mismatch_log = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_mismatch_log, 1, memory_order_relaxed);
```
**Purpose:** Log header validation mismatches (debug diagnostics)
**Implementation:**
```c
// Phase 26B: Compile-out g_hdr_mismatch_log atomic (default OFF)
#if HAKMEM_HDR_MISMATCH_LOG_COMPILED
static _Atomic uint32_t g_hdr_mismatch_log = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_mismatch_log, 1, memory_order_relaxed);
#else
uint32_t n = 0; // No-op when compiled out
#endif
```
**Build Flag:** `HAKMEM_HDR_MISMATCH_LOG_COMPILED` (default: 0)
---
### Phase 26C: `g_hdr_meta_mismatch` Atomic Prune
**Target:** `core/tiny_superslab_free.inc.h:195`
**Code:**
```c
static _Atomic uint32_t g_hdr_meta_mismatch = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_meta_mismatch, 1, memory_order_relaxed);
```
**Purpose:** Log metadata validation failures (debug diagnostics)
**Implementation:**
```c
// Phase 26C: Compile-out g_hdr_meta_mismatch atomic (default OFF)
#if HAKMEM_HDR_META_MISMATCH_COMPILED
static _Atomic uint32_t g_hdr_meta_mismatch = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_meta_mismatch, 1, memory_order_relaxed);
#else
uint32_t n = 0; // No-op when compiled out
#endif
```
**Build Flag:** `HAKMEM_HDR_META_MISMATCH_COMPILED` (default: 0)
---
### Phase 26D: `g_metric_bad_class_once` Atomic Prune
**Target:** `core/hakmem_tiny_alloc.inc:24`
**Code:**
```c
static _Atomic int g_metric_bad_class_once = 0;
if (atomic_fetch_add_explicit(&g_metric_bad_class_once, 1, memory_order_relaxed) == 0) {
fprintf(stderr, "[ALLOC_1024_METRIC] bad class_idx=%d size=%zu\n", class_idx, req_size);
}
```
**Purpose:** One-shot metric for bad class index (safety check)
**Implementation:**
```c
// Phase 26D: Compile-out g_metric_bad_class_once atomic (default OFF)
#if HAKMEM_METRIC_BAD_CLASS_COMPILED
static _Atomic int g_metric_bad_class_once = 0;
if (atomic_fetch_add_explicit(&g_metric_bad_class_once, 1, memory_order_relaxed) == 0) {
fprintf(stderr, "[ALLOC_1024_METRIC] bad class_idx=%d size=%zu\n", class_idx, req_size);
}
#else
(void)0; // No-op when compiled out
#endif
```
**Build Flag:** `HAKMEM_METRIC_BAD_CLASS_COMPILED` (default: 0)
---
### Phase 26E: `g_hdr_meta_fast` Atomic Prune
**Target:** `core/tiny_free_fast_v2.inc.h:183`
**Code:**
```c
static _Atomic uint32_t g_hdr_meta_fast = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_meta_fast, 1, memory_order_relaxed);
```
**Purpose:** Fast-path header metadata hit counter (telemetry)
**Implementation:**
```c
// Phase 26E: Compile-out g_hdr_meta_fast atomic (default OFF)
#if HAKMEM_HDR_META_FAST_COMPILED
static _Atomic uint32_t g_hdr_meta_fast = 0;
uint32_t n = atomic_fetch_add_explicit(&g_hdr_meta_fast, 1, memory_order_relaxed);
#else
uint32_t n = 0; // No-op when compiled out
#endif
```
**Build Flag:** `HAKMEM_HDR_META_FAST_COMPILED` (default: 0)
---
## A/B Test Methodology
### Build Configurations
**Baseline (compiled-out, default):**
```bash
make clean
make -j bench_random_mixed_hakmem
# All Phase 26 flags default to 0 (compiled-out)
```
**Compiled-in (all atomics enabled):**
```bash
make clean
make -j \
EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1 \
-DHAKMEM_HDR_MISMATCH_LOG_COMPILED=1 \
-DHAKMEM_HDR_META_MISMATCH_COMPILED=1 \
-DHAKMEM_METRIC_BAD_CLASS_COMPILED=1 \
-DHAKMEM_HDR_META_FAST_COMPILED=1' \
bench_random_mixed_hakmem
```
### Benchmark Protocol
**Workload:** `bench_random_mixed_hakmem` (mixed alloc/free, realistic workload)
**Runs:** 10 iterations per configuration
**Environment:** Clean environment (no ENV overrides)
**Script:** `./scripts/run_mixed_10_cleanenv.sh`
---
## Detailed Results
### Baseline (Compiled-Out, Default)
```
Run 1: 52,461,094 ops/s
Run 2: 51,925,957 ops/s
Run 3: 51,350,083 ops/s
Run 4: 53,636,515 ops/s
Run 5: 52,748,470 ops/s
Run 6: 54,275,764 ops/s
Run 7: 53,780,940 ops/s
Run 8: 53,956,030 ops/s
Run 9: 53,599,190 ops/s
Run 10: 53,628,420 ops/s
Average: 53,136,246 ops/s
StdDev: 963,465 ops/s (±1.81%)
```
### Compiled-In (All Atomics Enabled)
```
Run 1: 53,293,891 ops/s
Run 2: 50,898,548 ops/s
Run 3: 51,829,279 ops/s
Run 4: 54,060,593 ops/s
Run 5: 54,067,053 ops/s
Run 6: 53,704,313 ops/s
Run 7: 54,160,166 ops/s
Run 8: 53,985,836 ops/s
Run 9: 53,687,837 ops/s
Run 10: 53,420,216 ops/s
Average: 53,310,773 ops/s
StdDev: 1,087,011 ops/s (±2.04%)
```
### Statistical Analysis
**Difference:** 53,136,246 - 53,310,773 = **-174,527 ops/s**
**Improvement:** (-174,527 / 53,310,773) * 100 = **-0.33%**
**Noise Margin:** ±0.5%
**Conclusion:** NEUTRAL (difference within noise margin)
---
## Verdict & Recommendations
### NEUTRAL ➡️ Keep Compiled-Out ✅
**Why NEUTRAL?**
- Difference (-0.33%) is well within ±0.5% noise margin
- Standard deviations overlap significantly
- These atomics are rarely executed (debug/edge cases only)
- Benchmark variance (~2%) exceeds observed difference
**Why Keep Compiled-Out?**
1. **Code Cleanliness:** Removes dead telemetry code from production builds
2. **Maintainability:** Clearer hot path without diagnostic clutter
3. **Mimalloc Principle:** No telemetry/observe in hot path (consistency)
4. **Conservative Choice:** When neutral, prefer simpler code
5. **Future Benefit:** Reduces binary size and icache pressure (small but measurable)
**Default Settings:** All Phase 26 flags remain **0** (compiled-out)
---
## Cumulative Phase 24+25+26 Impact
| Phase | Target | File | Impact | Status |
|-------|--------|------|--------|--------|
| **24** | `g_tiny_class_stats_*` | tiny_class_stats_box.h | **+0.93%** | GO ✅ |
| **25** | `g_free_ss_enter` | tiny_superslab_free.inc.h:22 | **+1.07%** | GO ✅ |
| **26A** | `c7_free_count` | tiny_superslab_free.inc.h:51 | -0.33% | NEUTRAL |
| **26B** | `g_hdr_mismatch_log` | tiny_superslab_free.inc.h:153 | (bundled) | NEUTRAL |
| **26C** | `g_hdr_meta_mismatch` | tiny_superslab_free.inc.h:195 | (bundled) | NEUTRAL |
| **26D** | `g_metric_bad_class_once` | hakmem_tiny_alloc.inc:24 | (bundled) | NEUTRAL |
| **26E** | `g_hdr_meta_fast` | tiny_free_fast_v2.inc.h:183 | (bundled) | NEUTRAL |
**Cumulative Improvement:** **+2.00%** (Phase 24: +0.93% + Phase 25: +1.07%)
- Phase 26 contributes +0.0% (NEUTRAL, but code cleanliness benefit)
---
## Next Steps: Phase 27+ Candidates
### Warm Path Candidates (Expected: +0.1-0.3% each)
1. **Unified Cache Stats** (warm path, multiple atomics)
- `g_unified_cache_hits_global`
- `g_unified_cache_misses_global`
- `g_unified_cache_refill_cycles_global`
- **File:** `core/front/tiny_unified_cache.c`
- **Priority:** MEDIUM
- **Expected Gain:** +0.2-0.4%
2. **Background Spill Queue** (warm path, refill/spill)
- `g_bg_spill_len` (may be CORRECTNESS - needs review)
- **File:** `core/hakmem_tiny_bg_spill.h`
- **Priority:** MEDIUM (pending classification)
- **Expected Gain:** +0.1-0.2% (if telemetry)
### Cold Path Candidates (Low Priority)
- SS allocation stats (`g_ss_os_alloc_calls`, `g_ss_os_madvise_calls`, etc.)
- Shared pool diagnostics (`rel_c7_*`, `dbg_c7_*`)
- Debug logs (`g_hak_alloc_at_trace`, `g_hak_free_at_trace`)
- **Expected Gain:** <0.1% (cold path, low frequency)
---
## Lessons Learned
### Why Phase 26 Showed NEUTRAL vs Phase 24+25 GO?
1. **Execution Frequency:**
- Phase 24 (`g_tiny_class_stats_*`): Every cache hit/miss (hot)
- Phase 25 (`g_free_ss_enter`): Every superslab free (hot)
- Phase 26: Only edge cases (header mismatch, C7 first-free, bad class) - **rarely executed**
2. **Benchmark Characteristics:**
- `bench_random_mixed_hakmem` mostly hits happy paths
- Phase 26 atomics are in error/diagnostic paths (rarely taken)
- No performance benefit when code isn't executed
3. **Implication:**
- Hot path frequency matters more than atomic count
- Focus future work on **always-executed** atomics
- Edge-case atomics: compile-out for cleanliness, not performance
---
## Build Flag Reference
All Phase 26 flags in `core/hakmem_build_flags.h` (lines 293-340):
```c
// Phase 26A: C7 Free Count
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
# define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif
// Phase 26B: Header Mismatch Log
#ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED
# define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0
#endif
// Phase 26C: Header Meta Mismatch
#ifndef HAKMEM_HDR_META_MISMATCH_COMPILED
# define HAKMEM_HDR_META_MISMATCH_COMPILED 0
#endif
// Phase 26D: Metric Bad Class
#ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED
# define HAKMEM_METRIC_BAD_CLASS_COMPILED 0
#endif
// Phase 26E: Header Meta Fast
#ifndef HAKMEM_HDR_META_FAST_COMPILED
# define HAKMEM_HDR_META_FAST_COMPILED 0
#endif
```
**Usage (research builds only):**
```bash
make EXTRA_CFLAGS='-DHAKMEM_C7_FREE_COUNT_COMPILED=1' bench_random_mixed_hakmem
```
---
## Files Modified
### 1. Build Flags
- `core/hakmem_build_flags.h` (lines 293-340): 5 new compile gates
### 2. Hot Path Files
- `core/tiny_superslab_free.inc.h` (lines 51, 153, 195): 3 atomics wrapped
- `core/hakmem_tiny_alloc.inc` (line 24): 1 atomic wrapped
- `core/tiny_free_fast_v2.inc.h` (line 183): 1 atomic wrapped
### 3. Documentation
- `docs/analysis/PHASE26_HOT_PATH_ATOMIC_AUDIT.md` (audit plan)
- `docs/analysis/PHASE26_HOT_PATH_ATOMIC_PRUNE_RESULTS.md` (this file)
---
## Conclusion
**Phase 26 Status:** ✅ **COMPLETE** (NEUTRAL verdict)
**Key Outcomes:**
1. Successfully compiled-out 5 hot-path telemetry atomics
2. Verified NEUTRAL impact (-0.33%, within noise)
3. Kept compiled-out for code cleanliness and maintainability
4. Established pattern for future atomic prune phases
5. Identified next candidates for Phase 27+ (unified cache stats)
**Cumulative Progress (Phase 24+25+26):**
- **Performance:** +2.00% (Phase 24: +0.93%, Phase 25: +1.07%, Phase 26: NEUTRAL)
- **Code Quality:** Removed 12 hot-path telemetry atomics (7 from 24+25, 5 from 26)
- **mimalloc Alignment:** Hot path now cleaner, closer to mimalloc's zero-overhead principle
**Next Actions:**
- Phase 27: Target unified cache stats (warm path, +0.2-0.4% expected)
- Continue systematic atomic audit and prune
- Document all verdicts for future reference
---
**Date Completed:** 2025-12-16
**Engineer:** Claude Sonnet 4.5
**Review Status:** Ready for integration