Summary: - Phase 24 (alloc stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness) - Total: 11 atomics compiled-out, +2.00% improvement Phase 24: OBSERVE tax prune (tiny_class_stats_box.h) - Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0) - Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_* - Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s) Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h) - Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0) - Wrapped g_free_ss_enter atomic in free hot path - Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s) Phase 26: Hot path diagnostic atomics prune - Added 5 compile gates for low-frequency error counters: - HAKMEM_TINY_C7_FREE_COUNT_COMPILED - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED - HAKMEM_TINY_HDR_META_FAST_COMPILED - Result: -0.33% NEUTRAL (within noise, kept for cleanliness) Alignment with mimalloc principles: - "No atomics on hot path" - telemetry moved to compile-time opt-in - Fixed per-op tax elimination - Production builds: maximum performance (atomics compiled-out) - Research builds: full diagnostics (COMPILED=1) Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
3.4 KiB
3.4 KiB
Phase 21: Tiny Header HotFull (alloc header write hot/cold split)
Status: ✅ GO (default ON / opt-out)
Problem statement
tiny_region_id_write_header() runs on every allocation and is on the hot path.
Even when the steady-state configuration is the default (header mode = FULL, guard disabled),
the function still carries:
- runtime mode selection (
FULL/LIGHT/OFF) - guard gate (
tiny_guard_is_enabled()), even when it is OFF - extra branches/code for “bench-only” experimentation modes
This is exactly the kind of per-op fixed tax that stays visible after Phase 6–10 consolidation.
Goal
Keep semantics identical, but make the common case fast path behave like:
*(uint8_t*)base = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
return (uint8_t*)base + 1;
Box Theory framing
- This is a refactor inside the TinyHeaderBox (no new global layers).
- Boundary is a single conversion point:
tiny_region_id_write_header()decides “hot-full vs slow-path” once, then either returns or calls a cold helper. - Rollback is easy: keep the old implementation behind an ENV gate.
Proposed implementation
1) Add a dedicated ENV gate (rollback handle)
ENV (default ON / opt-out):
HAKMEM_TINY_HEADER_HOTFULL=0/1
Meaning:
0: disable hot/cold split (revert to unified logic)1(or unset): enable hot/cold split (hot-full + cold helper)
2) Hot path: FULL mode only + no guard call
In core/tiny_region_id.h:
- Keep
tiny_header_mode()as-is (do not re-introduce global env-cache SSOT patterns). - In
tiny_region_id_write_header():- Compute
int header_mode = tiny_header_mode(); - If
HAKMEM_TINY_HEADER_HOTFULL=1andheader_mode == TINY_HEADER_MODE_FULL:- write header byte unconditionally
- return
(uint8_t*)base + 1 - do not call
tiny_guard_is_enabled()on this hot path
- Otherwise, delegate to cold helper (below)
- Compute
Rationale:
- FULL is the default for performance profiles.
- Guard is a debug tool; when it must be enabled, we pay the slow path cost explicitly.
3) Cold helper: everything else (LIGHT/OFF + guard)
Add a cold noinline helper, e.g.:
__attribute__((cold,noinline))
static void* tiny_region_id_write_header_slow(void* base, int class_idx, int header_mode);
This helper contains:
- LIGHT/OFF store-elision logic
- allocation-side guard hook
- any debug-only plumbing (already under
#if !HAKMEM_BUILD_RELEASE)
Safety invariants
- Header byte remains correct for all classes (C0–C7).
- Returned pointer remains
base + 1. - Free path classification remains unchanged.
- When
HAKMEM_TINY_HEADER_HOTFULL=1, non-FULL or guard-enabled configurations must still work via the slow helper.
A/B plan (same-binary)
Command:
scripts/run_mixed_10_cleanenv.sh
A:
HAKMEM_TINY_HEADER_HOTFULL=0
B:
HAKMEM_TINY_HEADER_HOTFULL=1
Perf counters (optional, but recommended):
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses
GO/NO-GO
- GO: Mixed 10-run mean +1.0% or more
- NEUTRAL: ±1.0%
- NO-GO: -1.0% or worse
Risks
- Code-size/layout sensitivity: hot/cold split can help or hurt depending on placement.
- Mitigation: keep hot path strictly minimal; mark slow helper
cold,noinline.
- Mitigation: keep hot path strictly minimal; mark slow helper
- If profiles rely on
HAKMEM_TINY_HEADER_MODE=LIGHT/OFFin release runs:- Mitigation: hot-full triggers only for FULL; other modes remain supported (slow path).