Files

Moe Charm (CI) 8052e8b320 Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative)

Summary:
- Phase 24 (alloc stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness)
- Total: 11 atomics compiled-out, +2.00% improvement

Phase 24: OBSERVE tax prune (tiny_class_stats_box.h)
- Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0)
- Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_*
- Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s)

Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h)
- Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0)
- Wrapped g_free_ss_enter atomic in free hot path
- Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s)

Phase 26: Hot path diagnostic atomics prune
- Added 5 compile gates for low-frequency error counters:
  - HAKMEM_TINY_C7_FREE_COUNT_COMPILED
  - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED
  - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED
  - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED
  - HAKMEM_TINY_HDR_META_FAST_COMPILED
- Result: -0.33% NEUTRAL (within noise, kept for cleanliness)

Alignment with mimalloc principles:
- "No atomics on hot path" - telemetry moved to compile-time opt-in
- Fixed per-op tax elimination
- Production builds: maximum performance (atomics compiled-out)
- Research builds: full diagnostics (COMPILED=1)

Generated with Claude Code
https://claude.com/claude-code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-16 05:35:11 +09:00

3.4 KiB

Raw Permalink Blame History

Phase 21: Tiny Header HotFull (alloc header write hot/cold split)

Status: ✅ GO (default ON / opt-out)

Problem statement

tiny_region_id_write_header() runs on every allocation and is on the hot path. Even when the steady-state configuration is the default (header mode = FULL, guard disabled), the function still carries:

runtime mode selection (FULL/LIGHT/OFF)
guard gate (tiny_guard_is_enabled()), even when it is OFF
extra branches/code for “bench-only” experimentation modes

This is exactly the kind of per-op fixed tax that stays visible after Phase 6–10 consolidation.

Goal

Keep semantics identical, but make the common case fast path behave like:

*(uint8_t*)base = (uint8_t)(HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK));
return (uint8_t*)base + 1;

Box Theory framing

This is a refactor inside the TinyHeaderBox (no new global layers).
Boundary is a single conversion point: tiny_region_id_write_header() decides “hot-full vs slow-path” once, then either returns or calls a cold helper.
Rollback is easy: keep the old implementation behind an ENV gate.

Proposed implementation

1) Add a dedicated ENV gate (rollback handle)

ENV (default ON / opt-out):

HAKMEM_TINY_HEADER_HOTFULL=0/1

Meaning:

0: disable hot/cold split (revert to unified logic)
1 (or unset): enable hot/cold split (hot-full + cold helper)

2) Hot path: FULL mode only + no guard call

In core/tiny_region_id.h:

Keep tiny_header_mode() as-is (do not re-introduce global env-cache SSOT patterns).
In tiny_region_id_write_header():
- Compute int header_mode = tiny_header_mode();
- If HAKMEM_TINY_HEADER_HOTFULL=1 and header_mode == TINY_HEADER_MODE_FULL:
  - write header byte unconditionally
  - return (uint8_t*)base + 1
  - do not call tiny_guard_is_enabled() on this hot path
- Otherwise, delegate to cold helper (below)

Rationale:

FULL is the default for performance profiles.
Guard is a debug tool; when it must be enabled, we pay the slow path cost explicitly.

3) Cold helper: everything else (LIGHT/OFF + guard)

Add a cold noinline helper, e.g.:

__attribute__((cold,noinline))
static void* tiny_region_id_write_header_slow(void* base, int class_idx, int header_mode);

This helper contains:

LIGHT/OFF store-elision logic
allocation-side guard hook
any debug-only plumbing (already under #if !HAKMEM_BUILD_RELEASE)

Safety invariants

Header byte remains correct for all classes (C0–C7).
Returned pointer remains base + 1.
Free path classification remains unchanged.
When HAKMEM_TINY_HEADER_HOTFULL=1, non-FULL or guard-enabled configurations must still work via the slow helper.

A/B plan (same-binary)

Command:

scripts/run_mixed_10_cleanenv.sh

HAKMEM_TINY_HEADER_HOTFULL=0

HAKMEM_TINY_HEADER_HOTFULL=1

Perf counters (optional, but recommended):

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses

GO/NO-GO

GO: Mixed 10-run mean +1.0% or more
NEUTRAL: ±1.0%
NO-GO: -1.0% or worse

Risks

Code-size/layout sensitivity: hot/cold split can help or hurt depending on placement.
- Mitigation: keep hot path strictly minimal; mark slow helper cold,noinline.
If profiles rely on HAKMEM_TINY_HEADER_MODE=LIGHT/OFF in release runs:
- Mitigation: hot-full triggers only for FULL; other modes remain supported (slow path).

3.4 KiB Raw Permalink Blame History Unescape Escape