Files

Moe Charm (CI) 4d9429e14c Phase 19-7: LARSON_FIX TLS Consolidation — NO-GO (-1.34%)

Goal: Eliminate 5 duplicate getenv("HAKMEM_TINY_LARSON_FIX") calls
- Create unified TLS cache box: tiny_larson_fix_tls_box.h
- Replace 5 separate static __thread blocks with single helper

Result: -1.34% throughput (54.55M → 53.82M ops/s)
- Expected: +0.3-0.7%
- Actual: -1.34%
- Decision: NO-GO, reverted immediately

Root cause: Compiler optimization works better with separate-scope TLS caches
- Each scope gets independent optimization
- Function call overhead outweighs duplication savings
- Rare case where duplication is optimal

Key learning: Not all code duplication is inefficient. Per-scope TLS
caching can outperform centralized caching when compiler can optimize
each scope independently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-15 22:27:19 +09:00

4.2 KiB

Raw Blame History

Phase 20 (Proposal): WarmPool SlabIdx Hint (avoid O(cap) scan on warm hit)

TL;DR

unified_cache_refill() already does batch refill (up to 128/256/512 blocks) and already has the Warm Pool to avoid registry scans.

The remaining hot cost on the warm-hit path is typically:

tiny_warm_pool_pop(class_idx) (O(1))
for (i=0..cap) scan to find slab_idx where tiny_get_class_from_ss(ss,i)==class_idx (O(cap))
ss_tls_bind_one(...) and optional TLS carve

This Phase 20 reduces step (2) by storing a slab_idx hint alongside the warm pool entry.

Expected ROI: +1–4% on Mixed (only if warm-hit rate is high and cap scan is nontrivial).

Box Theory framing

New box

WarmPoolEntryBox: “warm entry = (SuperSlab*, slab_idx_hint)”

Responsibility: store + retrieve warm candidates with a verified slab index hint
No side effects outside warm pool memory (pure stack operations)

Boundary (single conversion point)

The only “conversion” is in unified_cache_refill():

WarmPoolEntryBox.pop(class_idx) → (ss, slab_idx_hint) → validate hint once → ss_tls_bind_one(class_idx, tls, ss, slab_idx, tid)

No other code paths should re-infer slab_idx.

Rollback

ENV gate: HAKMEM_WARM_POOL_SLABIDX_HINT=0/1 (default 0, opt-in)

OFF: current TinyWarmPool behavior (ss only, scan to find slab_idx)
ON: use entry-with-hint fast path; fallback to scan if hint invalid

Design details

Data structure

Add a parallel “entry” type without changing SuperSlab layout:

typedef struct {
  SuperSlab* ss;
  uint16_t slab_idx_hint; // [0..cap-1], or 0xFFFF for “unknown”
} TinyWarmEntry;

Implementation options:

Replace TinyWarmPool.slabs[] with TinyWarmEntry entries[]
Dual arrays (slabs[] + slab_idx[])

Prefer (2) if you want minimal diff risk.

Push contract

When pushing a warm candidate, compute slab_idx once (cold-ish context) and store the hint:

int slab_idx = ss_find_first_slab_for_class(ss, class_idx); // may scan
warm_pool_push(class_idx, ss, slab_idx);

This moves scan cost to the producer side (where it is already scanning registry / iterating slabs).

Pop contract (hot)

TinyWarmEntry e = warm_pool_pop_entry(class_idx);
if (!e.ss) return NULL;

int slab_idx = e.slab_idx_hint;
if (slab_idx != 0xFFFF && tiny_get_class_from_ss(e.ss, slab_idx) == class_idx) {
  // fast: validated
} else {
  // fallback: scan to find slab_idx (rare)
  slab_idx = ss_find_first_slab_for_class(e.ss, class_idx);
  if (slab_idx < 0) return NULL; // fail-fast → warm miss fallback
}

Fail-fast rules

If hint invalid and scan fails → treat as warm miss (fallback to existing refill path)
Never bind TLS to a slab_idx that doesn’t match class

Minimal observability

Add only coarse counters (ENV-gated, or debug builds only):

warm_hint_hit (hint valid)
warm_hint_miss (hint invalid → scanned)
warm_hint_scan_fail (should be ~0; fail-fast signal)

A/B plan

Step 0: prove this is worth doing

Before implementing, validate that warm pool is used and the slab scan is nontrivial:

perf record / perf report focus on unified_cache_refill
confirm the slab scan loop shows up (e.g. tiny_get_class_from_ss / loop body)

If unified_cache_refill is < ~3% total cycles in Mixed, ROI will be capped.

Step 1: implement behind ENV

HAKMEM_WARM_POOL_SLABIDX_HINT=0/1 (default 0)
Keep behavior identical when OFF.

Step 2: Mixed 10-run

Use the existing cleanenv harness:

HAKMEM_WARM_POOL_SLABIDX_HINT=0 scripts/run_mixed_10_cleanenv.sh
HAKMEM_WARM_POOL_SLABIDX_HINT=1 scripts/run_mixed_10_cleanenv.sh

GO/NO-GO:

GO: mean +1.0% or more
NEUTRAL: ±1.0%
NO-GO: -1.0% or worse

Step 3: perf sanity

If GO, confirm:

branches/op down (or stable)
no iTLB explosion

Why “Segment Batch Refill Layer” is probably redundant here

unified_cache_refill() already batches heavily (up to 512) and already has Warm Pool + PageBox + carve boxes.

If you want a Phase 20 with big ROI, aim at the remaining O(cap) scan and any remaining registry scans, not adding another batch layer on top of an existing batch layer.

4.2 KiB Raw Blame History Unescape Escape